Chris@87
|
1 """
|
Chris@87
|
2 Define a simple format for saving numpy arrays to disk with the full
|
Chris@87
|
3 information about them.
|
Chris@87
|
4
|
Chris@87
|
5 The ``.npy`` format is the standard binary file format in NumPy for
|
Chris@87
|
6 persisting a *single* arbitrary NumPy array on disk. The format stores all
|
Chris@87
|
7 of the shape and dtype information necessary to reconstruct the array
|
Chris@87
|
8 correctly even on another machine with a different architecture.
|
Chris@87
|
9 The format is designed to be as simple as possible while achieving
|
Chris@87
|
10 its limited goals.
|
Chris@87
|
11
|
Chris@87
|
12 The ``.npz`` format is the standard format for persisting *multiple* NumPy
|
Chris@87
|
13 arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy``
|
Chris@87
|
14 files, one for each array.
|
Chris@87
|
15
|
Chris@87
|
16 Capabilities
|
Chris@87
|
17 ------------
|
Chris@87
|
18
|
Chris@87
|
19 - Can represent all NumPy arrays including nested record arrays and
|
Chris@87
|
20 object arrays.
|
Chris@87
|
21
|
Chris@87
|
22 - Represents the data in its native binary form.
|
Chris@87
|
23
|
Chris@87
|
24 - Supports Fortran-contiguous arrays directly.
|
Chris@87
|
25
|
Chris@87
|
26 - Stores all of the necessary information to reconstruct the array
|
Chris@87
|
27 including shape and dtype on a machine of a different
|
Chris@87
|
28 architecture. Both little-endian and big-endian arrays are
|
Chris@87
|
29 supported, and a file with little-endian numbers will yield
|
Chris@87
|
30 a little-endian array on any machine reading the file. The
|
Chris@87
|
31 types are described in terms of their actual sizes. For example,
|
Chris@87
|
32 if a machine with a 64-bit C "long int" writes out an array with
|
Chris@87
|
33 "long ints", a reading machine with 32-bit C "long ints" will yield
|
Chris@87
|
34 an array with 64-bit integers.
|
Chris@87
|
35
|
Chris@87
|
36 - Is straightforward to reverse engineer. Datasets often live longer than
|
Chris@87
|
37 the programs that created them. A competent developer should be
|
Chris@87
|
38 able to create a solution in his preferred programming language to
|
Chris@87
|
39 read most ``.npy`` files that he has been given without much
|
Chris@87
|
40 documentation.
|
Chris@87
|
41
|
Chris@87
|
42 - Allows memory-mapping of the data. See `open_memmep`.
|
Chris@87
|
43
|
Chris@87
|
44 - Can be read from a filelike stream object instead of an actual file.
|
Chris@87
|
45
|
Chris@87
|
46 - Stores object arrays, i.e. arrays containing elements that are arbitrary
|
Chris@87
|
47 Python objects. Files with object arrays are not to be mmapable, but
|
Chris@87
|
48 can be read and written to disk.
|
Chris@87
|
49
|
Chris@87
|
50 Limitations
|
Chris@87
|
51 -----------
|
Chris@87
|
52
|
Chris@87
|
53 - Arbitrary subclasses of numpy.ndarray are not completely preserved.
|
Chris@87
|
54 Subclasses will be accepted for writing, but only the array data will
|
Chris@87
|
55 be written out. A regular numpy.ndarray object will be created
|
Chris@87
|
56 upon reading the file.
|
Chris@87
|
57
|
Chris@87
|
58 .. warning::
|
Chris@87
|
59
|
Chris@87
|
60 Due to limitations in the interpretation of structured dtypes, dtypes
|
Chris@87
|
61 with fields with empty names will have the names replaced by 'f0', 'f1',
|
Chris@87
|
62 etc. Such arrays will not round-trip through the format entirely
|
Chris@87
|
63 accurately. The data is intact; only the field names will differ. We are
|
Chris@87
|
64 working on a fix for this. This fix will not require a change in the
|
Chris@87
|
65 file format. The arrays with such structures can still be saved and
|
Chris@87
|
66 restored, and the correct dtype may be restored by using the
|
Chris@87
|
67 ``loadedarray.view(correct_dtype)`` method.
|
Chris@87
|
68
|
Chris@87
|
69 File extensions
|
Chris@87
|
70 ---------------
|
Chris@87
|
71
|
Chris@87
|
72 We recommend using the ``.npy`` and ``.npz`` extensions for files saved
|
Chris@87
|
73 in this format. This is by no means a requirement; applications may wish
|
Chris@87
|
74 to use these file formats but use an extension specific to the
|
Chris@87
|
75 application. In the absence of an obvious alternative, however,
|
Chris@87
|
76 we suggest using ``.npy`` and ``.npz``.
|
Chris@87
|
77
|
Chris@87
|
78 Version numbering
|
Chris@87
|
79 -----------------
|
Chris@87
|
80
|
Chris@87
|
81 The version numbering of these formats is independent of NumPy version
|
Chris@87
|
82 numbering. If the format is upgraded, the code in `numpy.io` will still
|
Chris@87
|
83 be able to read and write Version 1.0 files.
|
Chris@87
|
84
|
Chris@87
|
85 Format Version 1.0
|
Chris@87
|
86 ------------------
|
Chris@87
|
87
|
Chris@87
|
88 The first 6 bytes are a magic string: exactly ``\\x93NUMPY``.
|
Chris@87
|
89
|
Chris@87
|
90 The next 1 byte is an unsigned byte: the major version number of the file
|
Chris@87
|
91 format, e.g. ``\\x01``.
|
Chris@87
|
92
|
Chris@87
|
93 The next 1 byte is an unsigned byte: the minor version number of the file
|
Chris@87
|
94 format, e.g. ``\\x00``. Note: the version of the file format is not tied
|
Chris@87
|
95 to the version of the numpy package.
|
Chris@87
|
96
|
Chris@87
|
97 The next 2 bytes form a little-endian unsigned short int: the length of
|
Chris@87
|
98 the header data HEADER_LEN.
|
Chris@87
|
99
|
Chris@87
|
100 The next HEADER_LEN bytes form the header data describing the array's
|
Chris@87
|
101 format. It is an ASCII string which contains a Python literal expression
|
Chris@87
|
102 of a dictionary. It is terminated by a newline (``\\n``) and padded with
|
Chris@87
|
103 spaces (``\\x20``) to make the total length of
|
Chris@87
|
104 ``magic string + 4 + HEADER_LEN`` be evenly divisible by 16 for alignment
|
Chris@87
|
105 purposes.
|
Chris@87
|
106
|
Chris@87
|
107 The dictionary contains three keys:
|
Chris@87
|
108
|
Chris@87
|
109 "descr" : dtype.descr
|
Chris@87
|
110 An object that can be passed as an argument to the `numpy.dtype`
|
Chris@87
|
111 constructor to create the array's dtype.
|
Chris@87
|
112 "fortran_order" : bool
|
Chris@87
|
113 Whether the array data is Fortran-contiguous or not. Since
|
Chris@87
|
114 Fortran-contiguous arrays are a common form of non-C-contiguity,
|
Chris@87
|
115 we allow them to be written directly to disk for efficiency.
|
Chris@87
|
116 "shape" : tuple of int
|
Chris@87
|
117 The shape of the array.
|
Chris@87
|
118
|
Chris@87
|
119 For repeatability and readability, the dictionary keys are sorted in
|
Chris@87
|
120 alphabetic order. This is for convenience only. A writer SHOULD implement
|
Chris@87
|
121 this if possible. A reader MUST NOT depend on this.
|
Chris@87
|
122
|
Chris@87
|
123 Following the header comes the array data. If the dtype contains Python
|
Chris@87
|
124 objects (i.e. ``dtype.hasobject is True``), then the data is a Python
|
Chris@87
|
125 pickle of the array. Otherwise the data is the contiguous (either C-
|
Chris@87
|
126 or Fortran-, depending on ``fortran_order``) bytes of the array.
|
Chris@87
|
127 Consumers can figure out the number of bytes by multiplying the number
|
Chris@87
|
128 of elements given by the shape (noting that ``shape=()`` means there is
|
Chris@87
|
129 1 element) by ``dtype.itemsize``.
|
Chris@87
|
130
|
Chris@87
|
131 Notes
|
Chris@87
|
132 -----
|
Chris@87
|
133 The ``.npy`` format, including reasons for creating it and a comparison of
|
Chris@87
|
134 alternatives, is described fully in the "npy-format" NEP.
|
Chris@87
|
135
|
Chris@87
|
136 """
|
Chris@87
|
137 from __future__ import division, absolute_import, print_function
|
Chris@87
|
138
|
Chris@87
|
139 import numpy
|
Chris@87
|
140 import sys
|
Chris@87
|
141 import io
|
Chris@87
|
142 import warnings
|
Chris@87
|
143 from numpy.lib.utils import safe_eval
|
Chris@87
|
144 from numpy.compat import asbytes, asstr, isfileobj, long, basestring
|
Chris@87
|
145
|
Chris@87
|
146 if sys.version_info[0] >= 3:
|
Chris@87
|
147 import pickle
|
Chris@87
|
148 else:
|
Chris@87
|
149 import cPickle as pickle
|
Chris@87
|
150
|
Chris@87
|
151 MAGIC_PREFIX = asbytes('\x93NUMPY')
|
Chris@87
|
152 MAGIC_LEN = len(MAGIC_PREFIX) + 2
|
Chris@87
|
153 BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes
|
Chris@87
|
154
|
Chris@87
|
155 # difference between version 1.0 and 2.0 is a 4 byte (I) header length
|
Chris@87
|
156 # instead of 2 bytes (H) allowing storage of large structured arrays
|
Chris@87
|
157
|
Chris@87
|
158 def _check_version(version):
|
Chris@87
|
159 if version not in [(1, 0), (2, 0), None]:
|
Chris@87
|
160 msg = "we only support format version (1,0) and (2, 0), not %s"
|
Chris@87
|
161 raise ValueError(msg % (version,))
|
Chris@87
|
162
|
Chris@87
|
163 def magic(major, minor):
|
Chris@87
|
164 """ Return the magic string for the given file format version.
|
Chris@87
|
165
|
Chris@87
|
166 Parameters
|
Chris@87
|
167 ----------
|
Chris@87
|
168 major : int in [0, 255]
|
Chris@87
|
169 minor : int in [0, 255]
|
Chris@87
|
170
|
Chris@87
|
171 Returns
|
Chris@87
|
172 -------
|
Chris@87
|
173 magic : str
|
Chris@87
|
174
|
Chris@87
|
175 Raises
|
Chris@87
|
176 ------
|
Chris@87
|
177 ValueError if the version cannot be formatted.
|
Chris@87
|
178 """
|
Chris@87
|
179 if major < 0 or major > 255:
|
Chris@87
|
180 raise ValueError("major version must be 0 <= major < 256")
|
Chris@87
|
181 if minor < 0 or minor > 255:
|
Chris@87
|
182 raise ValueError("minor version must be 0 <= minor < 256")
|
Chris@87
|
183 if sys.version_info[0] < 3:
|
Chris@87
|
184 return MAGIC_PREFIX + chr(major) + chr(minor)
|
Chris@87
|
185 else:
|
Chris@87
|
186 return MAGIC_PREFIX + bytes([major, minor])
|
Chris@87
|
187
|
Chris@87
|
188 def read_magic(fp):
|
Chris@87
|
189 """ Read the magic string to get the version of the file format.
|
Chris@87
|
190
|
Chris@87
|
191 Parameters
|
Chris@87
|
192 ----------
|
Chris@87
|
193 fp : filelike object
|
Chris@87
|
194
|
Chris@87
|
195 Returns
|
Chris@87
|
196 -------
|
Chris@87
|
197 major : int
|
Chris@87
|
198 minor : int
|
Chris@87
|
199 """
|
Chris@87
|
200 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string")
|
Chris@87
|
201 if magic_str[:-2] != MAGIC_PREFIX:
|
Chris@87
|
202 msg = "the magic string is not correct; expected %r, got %r"
|
Chris@87
|
203 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2]))
|
Chris@87
|
204 if sys.version_info[0] < 3:
|
Chris@87
|
205 major, minor = map(ord, magic_str[-2:])
|
Chris@87
|
206 else:
|
Chris@87
|
207 major, minor = magic_str[-2:]
|
Chris@87
|
208 return major, minor
|
Chris@87
|
209
|
Chris@87
|
210 def dtype_to_descr(dtype):
|
Chris@87
|
211 """
|
Chris@87
|
212 Get a serializable descriptor from the dtype.
|
Chris@87
|
213
|
Chris@87
|
214 The .descr attribute of a dtype object cannot be round-tripped through
|
Chris@87
|
215 the dtype() constructor. Simple types, like dtype('float32'), have
|
Chris@87
|
216 a descr which looks like a record array with one field with '' as
|
Chris@87
|
217 a name. The dtype() constructor interprets this as a request to give
|
Chris@87
|
218 a default name. Instead, we construct descriptor that can be passed to
|
Chris@87
|
219 dtype().
|
Chris@87
|
220
|
Chris@87
|
221 Parameters
|
Chris@87
|
222 ----------
|
Chris@87
|
223 dtype : dtype
|
Chris@87
|
224 The dtype of the array that will be written to disk.
|
Chris@87
|
225
|
Chris@87
|
226 Returns
|
Chris@87
|
227 -------
|
Chris@87
|
228 descr : object
|
Chris@87
|
229 An object that can be passed to `numpy.dtype()` in order to
|
Chris@87
|
230 replicate the input dtype.
|
Chris@87
|
231
|
Chris@87
|
232 """
|
Chris@87
|
233 if dtype.names is not None:
|
Chris@87
|
234 # This is a record array. The .descr is fine. XXX: parts of the
|
Chris@87
|
235 # record array with an empty name, like padding bytes, still get
|
Chris@87
|
236 # fiddled with. This needs to be fixed in the C implementation of
|
Chris@87
|
237 # dtype().
|
Chris@87
|
238 return dtype.descr
|
Chris@87
|
239 else:
|
Chris@87
|
240 return dtype.str
|
Chris@87
|
241
|
Chris@87
|
242 def header_data_from_array_1_0(array):
|
Chris@87
|
243 """ Get the dictionary of header metadata from a numpy.ndarray.
|
Chris@87
|
244
|
Chris@87
|
245 Parameters
|
Chris@87
|
246 ----------
|
Chris@87
|
247 array : numpy.ndarray
|
Chris@87
|
248
|
Chris@87
|
249 Returns
|
Chris@87
|
250 -------
|
Chris@87
|
251 d : dict
|
Chris@87
|
252 This has the appropriate entries for writing its string representation
|
Chris@87
|
253 to the header of the file.
|
Chris@87
|
254 """
|
Chris@87
|
255 d = {}
|
Chris@87
|
256 d['shape'] = array.shape
|
Chris@87
|
257 if array.flags.c_contiguous:
|
Chris@87
|
258 d['fortran_order'] = False
|
Chris@87
|
259 elif array.flags.f_contiguous:
|
Chris@87
|
260 d['fortran_order'] = True
|
Chris@87
|
261 else:
|
Chris@87
|
262 # Totally non-contiguous data. We will have to make it C-contiguous
|
Chris@87
|
263 # before writing. Note that we need to test for C_CONTIGUOUS first
|
Chris@87
|
264 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
|
Chris@87
|
265 d['fortran_order'] = False
|
Chris@87
|
266
|
Chris@87
|
267 d['descr'] = dtype_to_descr(array.dtype)
|
Chris@87
|
268 return d
|
Chris@87
|
269
|
Chris@87
|
270 def _write_array_header(fp, d, version=None):
|
Chris@87
|
271 """ Write the header for an array and returns the version used
|
Chris@87
|
272
|
Chris@87
|
273 Parameters
|
Chris@87
|
274 ----------
|
Chris@87
|
275 fp : filelike object
|
Chris@87
|
276 d : dict
|
Chris@87
|
277 This has the appropriate entries for writing its string representation
|
Chris@87
|
278 to the header of the file.
|
Chris@87
|
279 version: tuple or None
|
Chris@87
|
280 None means use oldest that works
|
Chris@87
|
281 explicit version will raise a ValueError if the format does not
|
Chris@87
|
282 allow saving this data. Default: None
|
Chris@87
|
283 Returns
|
Chris@87
|
284 -------
|
Chris@87
|
285 version : tuple of int
|
Chris@87
|
286 the file version which needs to be used to store the data
|
Chris@87
|
287 """
|
Chris@87
|
288 import struct
|
Chris@87
|
289 header = ["{"]
|
Chris@87
|
290 for key, value in sorted(d.items()):
|
Chris@87
|
291 # Need to use repr here, since we eval these when reading
|
Chris@87
|
292 header.append("'%s': %s, " % (key, repr(value)))
|
Chris@87
|
293 header.append("}")
|
Chris@87
|
294 header = "".join(header)
|
Chris@87
|
295 # Pad the header with spaces and a final newline such that the magic
|
Chris@87
|
296 # string, the header-length short and the header are aligned on a
|
Chris@87
|
297 # 16-byte boundary. Hopefully, some system, possibly memory-mapping,
|
Chris@87
|
298 # can take advantage of our premature optimization.
|
Chris@87
|
299 current_header_len = MAGIC_LEN + 2 + len(header) + 1 # 1 for the newline
|
Chris@87
|
300 topad = 16 - (current_header_len % 16)
|
Chris@87
|
301 header = header + ' '*topad + '\n'
|
Chris@87
|
302 header = asbytes(_filter_header(header))
|
Chris@87
|
303
|
Chris@87
|
304 if len(header) >= (256*256) and version == (1, 0):
|
Chris@87
|
305 raise ValueError("header does not fit inside %s bytes required by the"
|
Chris@87
|
306 " 1.0 format" % (256*256))
|
Chris@87
|
307 if len(header) < (256*256):
|
Chris@87
|
308 header_len_str = struct.pack('<H', len(header))
|
Chris@87
|
309 version = (1, 0)
|
Chris@87
|
310 elif len(header) < (2**32):
|
Chris@87
|
311 header_len_str = struct.pack('<I', len(header))
|
Chris@87
|
312 version = (2, 0)
|
Chris@87
|
313 else:
|
Chris@87
|
314 raise ValueError("header does not fit inside 4 GiB required by "
|
Chris@87
|
315 "the 2.0 format")
|
Chris@87
|
316
|
Chris@87
|
317 fp.write(magic(*version))
|
Chris@87
|
318 fp.write(header_len_str)
|
Chris@87
|
319 fp.write(header)
|
Chris@87
|
320 return version
|
Chris@87
|
321
|
Chris@87
|
322 def write_array_header_1_0(fp, d):
|
Chris@87
|
323 """ Write the header for an array using the 1.0 format.
|
Chris@87
|
324
|
Chris@87
|
325 Parameters
|
Chris@87
|
326 ----------
|
Chris@87
|
327 fp : filelike object
|
Chris@87
|
328 d : dict
|
Chris@87
|
329 This has the appropriate entries for writing its string
|
Chris@87
|
330 representation to the header of the file.
|
Chris@87
|
331 """
|
Chris@87
|
332 _write_array_header(fp, d, (1, 0))
|
Chris@87
|
333
|
Chris@87
|
334
|
Chris@87
|
335 def write_array_header_2_0(fp, d):
|
Chris@87
|
336 """ Write the header for an array using the 2.0 format.
|
Chris@87
|
337 The 2.0 format allows storing very large structured arrays.
|
Chris@87
|
338
|
Chris@87
|
339 .. versionadded:: 1.9.0
|
Chris@87
|
340
|
Chris@87
|
341 Parameters
|
Chris@87
|
342 ----------
|
Chris@87
|
343 fp : filelike object
|
Chris@87
|
344 d : dict
|
Chris@87
|
345 This has the appropriate entries for writing its string
|
Chris@87
|
346 representation to the header of the file.
|
Chris@87
|
347 """
|
Chris@87
|
348 _write_array_header(fp, d, (2, 0))
|
Chris@87
|
349
|
Chris@87
|
350 def read_array_header_1_0(fp):
|
Chris@87
|
351 """
|
Chris@87
|
352 Read an array header from a filelike object using the 1.0 file format
|
Chris@87
|
353 version.
|
Chris@87
|
354
|
Chris@87
|
355 This will leave the file object located just after the header.
|
Chris@87
|
356
|
Chris@87
|
357 Parameters
|
Chris@87
|
358 ----------
|
Chris@87
|
359 fp : filelike object
|
Chris@87
|
360 A file object or something with a `.read()` method like a file.
|
Chris@87
|
361
|
Chris@87
|
362 Returns
|
Chris@87
|
363 -------
|
Chris@87
|
364 shape : tuple of int
|
Chris@87
|
365 The shape of the array.
|
Chris@87
|
366 fortran_order : bool
|
Chris@87
|
367 The array data will be written out directly if it is either
|
Chris@87
|
368 C-contiguous or Fortran-contiguous. Otherwise, it will be made
|
Chris@87
|
369 contiguous before writing it out.
|
Chris@87
|
370 dtype : dtype
|
Chris@87
|
371 The dtype of the file's data.
|
Chris@87
|
372
|
Chris@87
|
373 Raises
|
Chris@87
|
374 ------
|
Chris@87
|
375 ValueError
|
Chris@87
|
376 If the data is invalid.
|
Chris@87
|
377
|
Chris@87
|
378 """
|
Chris@87
|
379 _read_array_header(fp, version=(1, 0))
|
Chris@87
|
380
|
Chris@87
|
381 def read_array_header_2_0(fp):
|
Chris@87
|
382 """
|
Chris@87
|
383 Read an array header from a filelike object using the 2.0 file format
|
Chris@87
|
384 version.
|
Chris@87
|
385
|
Chris@87
|
386 This will leave the file object located just after the header.
|
Chris@87
|
387
|
Chris@87
|
388 .. versionadded:: 1.9.0
|
Chris@87
|
389
|
Chris@87
|
390 Parameters
|
Chris@87
|
391 ----------
|
Chris@87
|
392 fp : filelike object
|
Chris@87
|
393 A file object or something with a `.read()` method like a file.
|
Chris@87
|
394
|
Chris@87
|
395 Returns
|
Chris@87
|
396 -------
|
Chris@87
|
397 shape : tuple of int
|
Chris@87
|
398 The shape of the array.
|
Chris@87
|
399 fortran_order : bool
|
Chris@87
|
400 The array data will be written out directly if it is either
|
Chris@87
|
401 C-contiguous or Fortran-contiguous. Otherwise, it will be made
|
Chris@87
|
402 contiguous before writing it out.
|
Chris@87
|
403 dtype : dtype
|
Chris@87
|
404 The dtype of the file's data.
|
Chris@87
|
405
|
Chris@87
|
406 Raises
|
Chris@87
|
407 ------
|
Chris@87
|
408 ValueError
|
Chris@87
|
409 If the data is invalid.
|
Chris@87
|
410
|
Chris@87
|
411 """
|
Chris@87
|
412 _read_array_header(fp, version=(2, 0))
|
Chris@87
|
413
|
Chris@87
|
414
|
Chris@87
|
415 def _filter_header(s):
|
Chris@87
|
416 """Clean up 'L' in npz header ints.
|
Chris@87
|
417
|
Chris@87
|
418 Cleans up the 'L' in strings representing integers. Needed to allow npz
|
Chris@87
|
419 headers produced in Python2 to be read in Python3.
|
Chris@87
|
420
|
Chris@87
|
421 Parameters
|
Chris@87
|
422 ----------
|
Chris@87
|
423 s : byte string
|
Chris@87
|
424 Npy file header.
|
Chris@87
|
425
|
Chris@87
|
426 Returns
|
Chris@87
|
427 -------
|
Chris@87
|
428 header : str
|
Chris@87
|
429 Cleaned up header.
|
Chris@87
|
430
|
Chris@87
|
431 """
|
Chris@87
|
432 import tokenize
|
Chris@87
|
433 if sys.version_info[0] >= 3:
|
Chris@87
|
434 from io import StringIO
|
Chris@87
|
435 else:
|
Chris@87
|
436 from StringIO import StringIO
|
Chris@87
|
437
|
Chris@87
|
438 tokens = []
|
Chris@87
|
439 last_token_was_number = False
|
Chris@87
|
440 for token in tokenize.generate_tokens(StringIO(asstr(s)).read):
|
Chris@87
|
441 token_type = token[0]
|
Chris@87
|
442 token_string = token[1]
|
Chris@87
|
443 if (last_token_was_number and
|
Chris@87
|
444 token_type == tokenize.NAME and
|
Chris@87
|
445 token_string == "L"):
|
Chris@87
|
446 continue
|
Chris@87
|
447 else:
|
Chris@87
|
448 tokens.append(token)
|
Chris@87
|
449 last_token_was_number = (token_type == tokenize.NUMBER)
|
Chris@87
|
450 return tokenize.untokenize(tokens)
|
Chris@87
|
451
|
Chris@87
|
452
|
Chris@87
|
453 def _read_array_header(fp, version):
|
Chris@87
|
454 """
|
Chris@87
|
455 see read_array_header_1_0
|
Chris@87
|
456 """
|
Chris@87
|
457 # Read an unsigned, little-endian short int which has the length of the
|
Chris@87
|
458 # header.
|
Chris@87
|
459 import struct
|
Chris@87
|
460 if version == (1, 0):
|
Chris@87
|
461 hlength_str = _read_bytes(fp, 2, "array header length")
|
Chris@87
|
462 header_length = struct.unpack('<H', hlength_str)[0]
|
Chris@87
|
463 header = _read_bytes(fp, header_length, "array header")
|
Chris@87
|
464 elif version == (2, 0):
|
Chris@87
|
465 hlength_str = _read_bytes(fp, 4, "array header length")
|
Chris@87
|
466 header_length = struct.unpack('<I', hlength_str)[0]
|
Chris@87
|
467 header = _read_bytes(fp, header_length, "array header")
|
Chris@87
|
468 else:
|
Chris@87
|
469 raise ValueError("Invalid version %r" % version)
|
Chris@87
|
470
|
Chris@87
|
471 # The header is a pretty-printed string representation of a literal
|
Chris@87
|
472 # Python dictionary with trailing newlines padded to a 16-byte
|
Chris@87
|
473 # boundary. The keys are strings.
|
Chris@87
|
474 # "shape" : tuple of int
|
Chris@87
|
475 # "fortran_order" : bool
|
Chris@87
|
476 # "descr" : dtype.descr
|
Chris@87
|
477 header = _filter_header(header)
|
Chris@87
|
478 try:
|
Chris@87
|
479 d = safe_eval(header)
|
Chris@87
|
480 except SyntaxError as e:
|
Chris@87
|
481 msg = "Cannot parse header: %r\nException: %r"
|
Chris@87
|
482 raise ValueError(msg % (header, e))
|
Chris@87
|
483 if not isinstance(d, dict):
|
Chris@87
|
484 msg = "Header is not a dictionary: %r"
|
Chris@87
|
485 raise ValueError(msg % d)
|
Chris@87
|
486 keys = sorted(d.keys())
|
Chris@87
|
487 if keys != ['descr', 'fortran_order', 'shape']:
|
Chris@87
|
488 msg = "Header does not contain the correct keys: %r"
|
Chris@87
|
489 raise ValueError(msg % (keys,))
|
Chris@87
|
490
|
Chris@87
|
491 # Sanity-check the values.
|
Chris@87
|
492 if (not isinstance(d['shape'], tuple) or
|
Chris@87
|
493 not numpy.all([isinstance(x, (int, long)) for x in d['shape']])):
|
Chris@87
|
494 msg = "shape is not valid: %r"
|
Chris@87
|
495 raise ValueError(msg % (d['shape'],))
|
Chris@87
|
496 if not isinstance(d['fortran_order'], bool):
|
Chris@87
|
497 msg = "fortran_order is not a valid bool: %r"
|
Chris@87
|
498 raise ValueError(msg % (d['fortran_order'],))
|
Chris@87
|
499 try:
|
Chris@87
|
500 dtype = numpy.dtype(d['descr'])
|
Chris@87
|
501 except TypeError as e:
|
Chris@87
|
502 msg = "descr is not a valid dtype descriptor: %r"
|
Chris@87
|
503 raise ValueError(msg % (d['descr'],))
|
Chris@87
|
504
|
Chris@87
|
505 return d['shape'], d['fortran_order'], dtype
|
Chris@87
|
506
|
Chris@87
|
507 def write_array(fp, array, version=None):
|
Chris@87
|
508 """
|
Chris@87
|
509 Write an array to an NPY file, including a header.
|
Chris@87
|
510
|
Chris@87
|
511 If the array is neither C-contiguous nor Fortran-contiguous AND the
|
Chris@87
|
512 file_like object is not a real file object, this function will have to
|
Chris@87
|
513 copy data in memory.
|
Chris@87
|
514
|
Chris@87
|
515 Parameters
|
Chris@87
|
516 ----------
|
Chris@87
|
517 fp : file_like object
|
Chris@87
|
518 An open, writable file object, or similar object with a
|
Chris@87
|
519 ``.write()`` method.
|
Chris@87
|
520 array : ndarray
|
Chris@87
|
521 The array to write to disk.
|
Chris@87
|
522 version : (int, int) or None, optional
|
Chris@87
|
523 The version number of the format. None means use the oldest
|
Chris@87
|
524 supported version that is able to store the data. Default: None
|
Chris@87
|
525
|
Chris@87
|
526 Raises
|
Chris@87
|
527 ------
|
Chris@87
|
528 ValueError
|
Chris@87
|
529 If the array cannot be persisted.
|
Chris@87
|
530 Various other errors
|
Chris@87
|
531 If the array contains Python objects as part of its dtype, the
|
Chris@87
|
532 process of pickling them may raise various errors if the objects
|
Chris@87
|
533 are not picklable.
|
Chris@87
|
534
|
Chris@87
|
535 """
|
Chris@87
|
536 _check_version(version)
|
Chris@87
|
537 used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
|
Chris@87
|
538 version)
|
Chris@87
|
539 # this warning can be removed when 1.9 has aged enough
|
Chris@87
|
540 if version != (2, 0) and used_ver == (2, 0):
|
Chris@87
|
541 warnings.warn("Stored array in format 2.0. It can only be"
|
Chris@87
|
542 "read by NumPy >= 1.9", UserWarning)
|
Chris@87
|
543
|
Chris@87
|
544 # Set buffer size to 16 MiB to hide the Python loop overhead.
|
Chris@87
|
545 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1)
|
Chris@87
|
546
|
Chris@87
|
547 if array.dtype.hasobject:
|
Chris@87
|
548 # We contain Python objects so we cannot write out the data
|
Chris@87
|
549 # directly. Instead, we will pickle it out with version 2 of the
|
Chris@87
|
550 # pickle protocol.
|
Chris@87
|
551 pickle.dump(array, fp, protocol=2)
|
Chris@87
|
552 elif array.flags.f_contiguous and not array.flags.c_contiguous:
|
Chris@87
|
553 if isfileobj(fp):
|
Chris@87
|
554 array.T.tofile(fp)
|
Chris@87
|
555 else:
|
Chris@87
|
556 for chunk in numpy.nditer(
|
Chris@87
|
557 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
|
Chris@87
|
558 buffersize=buffersize, order='F'):
|
Chris@87
|
559 fp.write(chunk.tobytes('C'))
|
Chris@87
|
560 else:
|
Chris@87
|
561 if isfileobj(fp):
|
Chris@87
|
562 array.tofile(fp)
|
Chris@87
|
563 else:
|
Chris@87
|
564 for chunk in numpy.nditer(
|
Chris@87
|
565 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
|
Chris@87
|
566 buffersize=buffersize, order='C'):
|
Chris@87
|
567 fp.write(chunk.tobytes('C'))
|
Chris@87
|
568
|
Chris@87
|
569
|
Chris@87
|
570 def read_array(fp):
|
Chris@87
|
571 """
|
Chris@87
|
572 Read an array from an NPY file.
|
Chris@87
|
573
|
Chris@87
|
574 Parameters
|
Chris@87
|
575 ----------
|
Chris@87
|
576 fp : file_like object
|
Chris@87
|
577 If this is not a real file object, then this may take extra memory
|
Chris@87
|
578 and time.
|
Chris@87
|
579
|
Chris@87
|
580 Returns
|
Chris@87
|
581 -------
|
Chris@87
|
582 array : ndarray
|
Chris@87
|
583 The array from the data on disk.
|
Chris@87
|
584
|
Chris@87
|
585 Raises
|
Chris@87
|
586 ------
|
Chris@87
|
587 ValueError
|
Chris@87
|
588 If the data is invalid.
|
Chris@87
|
589
|
Chris@87
|
590 """
|
Chris@87
|
591 version = read_magic(fp)
|
Chris@87
|
592 _check_version(version)
|
Chris@87
|
593 shape, fortran_order, dtype = _read_array_header(fp, version)
|
Chris@87
|
594 if len(shape) == 0:
|
Chris@87
|
595 count = 1
|
Chris@87
|
596 else:
|
Chris@87
|
597 count = numpy.multiply.reduce(shape)
|
Chris@87
|
598
|
Chris@87
|
599 # Now read the actual data.
|
Chris@87
|
600 if dtype.hasobject:
|
Chris@87
|
601 # The array contained Python objects. We need to unpickle the data.
|
Chris@87
|
602 array = pickle.load(fp)
|
Chris@87
|
603 else:
|
Chris@87
|
604 if isfileobj(fp):
|
Chris@87
|
605 # We can use the fast fromfile() function.
|
Chris@87
|
606 array = numpy.fromfile(fp, dtype=dtype, count=count)
|
Chris@87
|
607 else:
|
Chris@87
|
608 # This is not a real file. We have to read it the
|
Chris@87
|
609 # memory-intensive way.
|
Chris@87
|
610 # crc32 module fails on reads greater than 2 ** 32 bytes,
|
Chris@87
|
611 # breaking large reads from gzip streams. Chunk reads to
|
Chris@87
|
612 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead
|
Chris@87
|
613 # of the read. In non-chunked case count < max_read_count, so
|
Chris@87
|
614 # only one read is performed.
|
Chris@87
|
615
|
Chris@87
|
616 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize)
|
Chris@87
|
617
|
Chris@87
|
618 array = numpy.empty(count, dtype=dtype)
|
Chris@87
|
619 for i in range(0, count, max_read_count):
|
Chris@87
|
620 read_count = min(max_read_count, count - i)
|
Chris@87
|
621 read_size = int(read_count * dtype.itemsize)
|
Chris@87
|
622 data = _read_bytes(fp, read_size, "array data")
|
Chris@87
|
623 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
|
Chris@87
|
624 count=read_count)
|
Chris@87
|
625
|
Chris@87
|
626 if fortran_order:
|
Chris@87
|
627 array.shape = shape[::-1]
|
Chris@87
|
628 array = array.transpose()
|
Chris@87
|
629 else:
|
Chris@87
|
630 array.shape = shape
|
Chris@87
|
631
|
Chris@87
|
632 return array
|
Chris@87
|
633
|
Chris@87
|
634
|
Chris@87
|
635 def open_memmap(filename, mode='r+', dtype=None, shape=None,
|
Chris@87
|
636 fortran_order=False, version=None):
|
Chris@87
|
637 """
|
Chris@87
|
638 Open a .npy file as a memory-mapped array.
|
Chris@87
|
639
|
Chris@87
|
640 This may be used to read an existing file or create a new one.
|
Chris@87
|
641
|
Chris@87
|
642 Parameters
|
Chris@87
|
643 ----------
|
Chris@87
|
644 filename : str
|
Chris@87
|
645 The name of the file on disk. This may *not* be a file-like
|
Chris@87
|
646 object.
|
Chris@87
|
647 mode : str, optional
|
Chris@87
|
648 The mode in which to open the file; the default is 'r+'. In
|
Chris@87
|
649 addition to the standard file modes, 'c' is also accepted to mean
|
Chris@87
|
650 "copy on write." See `memmap` for the available mode strings.
|
Chris@87
|
651 dtype : data-type, optional
|
Chris@87
|
652 The data type of the array if we are creating a new file in "write"
|
Chris@87
|
653 mode, if not, `dtype` is ignored. The default value is None, which
|
Chris@87
|
654 results in a data-type of `float64`.
|
Chris@87
|
655 shape : tuple of int
|
Chris@87
|
656 The shape of the array if we are creating a new file in "write"
|
Chris@87
|
657 mode, in which case this parameter is required. Otherwise, this
|
Chris@87
|
658 parameter is ignored and is thus optional.
|
Chris@87
|
659 fortran_order : bool, optional
|
Chris@87
|
660 Whether the array should be Fortran-contiguous (True) or
|
Chris@87
|
661 C-contiguous (False, the default) if we are creating a new file in
|
Chris@87
|
662 "write" mode.
|
Chris@87
|
663 version : tuple of int (major, minor) or None
|
Chris@87
|
664 If the mode is a "write" mode, then this is the version of the file
|
Chris@87
|
665 format used to create the file. None means use the oldest
|
Chris@87
|
666 supported version that is able to store the data. Default: None
|
Chris@87
|
667
|
Chris@87
|
668 Returns
|
Chris@87
|
669 -------
|
Chris@87
|
670 marray : memmap
|
Chris@87
|
671 The memory-mapped array.
|
Chris@87
|
672
|
Chris@87
|
673 Raises
|
Chris@87
|
674 ------
|
Chris@87
|
675 ValueError
|
Chris@87
|
676 If the data or the mode is invalid.
|
Chris@87
|
677 IOError
|
Chris@87
|
678 If the file is not found or cannot be opened correctly.
|
Chris@87
|
679
|
Chris@87
|
680 See Also
|
Chris@87
|
681 --------
|
Chris@87
|
682 memmap
|
Chris@87
|
683
|
Chris@87
|
684 """
|
Chris@87
|
685 if not isinstance(filename, basestring):
|
Chris@87
|
686 raise ValueError("Filename must be a string. Memmap cannot use"
|
Chris@87
|
687 " existing file handles.")
|
Chris@87
|
688
|
Chris@87
|
689 if 'w' in mode:
|
Chris@87
|
690 # We are creating the file, not reading it.
|
Chris@87
|
691 # Check if we ought to create the file.
|
Chris@87
|
692 _check_version(version)
|
Chris@87
|
693 # Ensure that the given dtype is an authentic dtype object rather
|
Chris@87
|
694 # than just something that can be interpreted as a dtype object.
|
Chris@87
|
695 dtype = numpy.dtype(dtype)
|
Chris@87
|
696 if dtype.hasobject:
|
Chris@87
|
697 msg = "Array can't be memory-mapped: Python objects in dtype."
|
Chris@87
|
698 raise ValueError(msg)
|
Chris@87
|
699 d = dict(
|
Chris@87
|
700 descr=dtype_to_descr(dtype),
|
Chris@87
|
701 fortran_order=fortran_order,
|
Chris@87
|
702 shape=shape,
|
Chris@87
|
703 )
|
Chris@87
|
704 # If we got here, then it should be safe to create the file.
|
Chris@87
|
705 fp = open(filename, mode+'b')
|
Chris@87
|
706 try:
|
Chris@87
|
707 used_ver = _write_array_header(fp, d, version)
|
Chris@87
|
708 # this warning can be removed when 1.9 has aged enough
|
Chris@87
|
709 if version != (2, 0) and used_ver == (2, 0):
|
Chris@87
|
710 warnings.warn("Stored array in format 2.0. It can only be"
|
Chris@87
|
711 "read by NumPy >= 1.9", UserWarning)
|
Chris@87
|
712 offset = fp.tell()
|
Chris@87
|
713 finally:
|
Chris@87
|
714 fp.close()
|
Chris@87
|
715 else:
|
Chris@87
|
716 # Read the header of the file first.
|
Chris@87
|
717 fp = open(filename, 'rb')
|
Chris@87
|
718 try:
|
Chris@87
|
719 version = read_magic(fp)
|
Chris@87
|
720 _check_version(version)
|
Chris@87
|
721
|
Chris@87
|
722 shape, fortran_order, dtype = _read_array_header(fp, version)
|
Chris@87
|
723 if dtype.hasobject:
|
Chris@87
|
724 msg = "Array can't be memory-mapped: Python objects in dtype."
|
Chris@87
|
725 raise ValueError(msg)
|
Chris@87
|
726 offset = fp.tell()
|
Chris@87
|
727 finally:
|
Chris@87
|
728 fp.close()
|
Chris@87
|
729
|
Chris@87
|
730 if fortran_order:
|
Chris@87
|
731 order = 'F'
|
Chris@87
|
732 else:
|
Chris@87
|
733 order = 'C'
|
Chris@87
|
734
|
Chris@87
|
735 # We need to change a write-only mode to a read-write mode since we've
|
Chris@87
|
736 # already written data to the file.
|
Chris@87
|
737 if mode == 'w+':
|
Chris@87
|
738 mode = 'r+'
|
Chris@87
|
739
|
Chris@87
|
740 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
|
Chris@87
|
741 mode=mode, offset=offset)
|
Chris@87
|
742
|
Chris@87
|
743 return marray
|
Chris@87
|
744
|
Chris@87
|
745
|
Chris@87
|
746 def _read_bytes(fp, size, error_template="ran out of data"):
|
Chris@87
|
747 """
|
Chris@87
|
748 Read from file-like object until size bytes are read.
|
Chris@87
|
749 Raises ValueError if not EOF is encountered before size bytes are read.
|
Chris@87
|
750 Non-blocking objects only supported if they derive from io objects.
|
Chris@87
|
751
|
Chris@87
|
752 Required as e.g. ZipExtFile in python 2.6 can return less data than
|
Chris@87
|
753 requested.
|
Chris@87
|
754 """
|
Chris@87
|
755 data = bytes()
|
Chris@87
|
756 while True:
|
Chris@87
|
757 # io files (default in python3) return None or raise on
|
Chris@87
|
758 # would-block, python2 file will truncate, probably nothing can be
|
Chris@87
|
759 # done about that. note that regular files can't be non-blocking
|
Chris@87
|
760 try:
|
Chris@87
|
761 r = fp.read(size - len(data))
|
Chris@87
|
762 data += r
|
Chris@87
|
763 if len(r) == 0 or len(data) == size:
|
Chris@87
|
764 break
|
Chris@87
|
765 except io.BlockingIOError:
|
Chris@87
|
766 pass
|
Chris@87
|
767 if len(data) != size:
|
Chris@87
|
768 msg = "EOF: reading %s, expected %d bytes got %d"
|
Chris@87
|
769 raise ValueError(msg % (error_template, size, len(data)))
|
Chris@87
|
770 else:
|
Chris@87
|
771 return data
|