annotate DEPENDENCIES/mingw32/Python27/Lib/site-packages/numpy/lib/format.py @ 133:4acb5d8d80b6 tip

Don't fail environmental check if README.md exists (but .txt and no-suffix don't)
author Chris Cannam
date Tue, 30 Jul 2019 12:25:44 +0100
parents 2a2c65a20a8b
children
rev   line source
Chris@87 1 """
Chris@87 2 Define a simple format for saving numpy arrays to disk with the full
Chris@87 3 information about them.
Chris@87 4
Chris@87 5 The ``.npy`` format is the standard binary file format in NumPy for
Chris@87 6 persisting a *single* arbitrary NumPy array on disk. The format stores all
Chris@87 7 of the shape and dtype information necessary to reconstruct the array
Chris@87 8 correctly even on another machine with a different architecture.
Chris@87 9 The format is designed to be as simple as possible while achieving
Chris@87 10 its limited goals.
Chris@87 11
Chris@87 12 The ``.npz`` format is the standard format for persisting *multiple* NumPy
Chris@87 13 arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy``
Chris@87 14 files, one for each array.
Chris@87 15
Chris@87 16 Capabilities
Chris@87 17 ------------
Chris@87 18
Chris@87 19 - Can represent all NumPy arrays including nested record arrays and
Chris@87 20 object arrays.
Chris@87 21
Chris@87 22 - Represents the data in its native binary form.
Chris@87 23
Chris@87 24 - Supports Fortran-contiguous arrays directly.
Chris@87 25
Chris@87 26 - Stores all of the necessary information to reconstruct the array
Chris@87 27 including shape and dtype on a machine of a different
Chris@87 28 architecture. Both little-endian and big-endian arrays are
Chris@87 29 supported, and a file with little-endian numbers will yield
Chris@87 30 a little-endian array on any machine reading the file. The
Chris@87 31 types are described in terms of their actual sizes. For example,
Chris@87 32 if a machine with a 64-bit C "long int" writes out an array with
Chris@87 33 "long ints", a reading machine with 32-bit C "long ints" will yield
Chris@87 34 an array with 64-bit integers.
Chris@87 35
Chris@87 36 - Is straightforward to reverse engineer. Datasets often live longer than
Chris@87 37 the programs that created them. A competent developer should be
Chris@87 38 able to create a solution in his preferred programming language to
Chris@87 39 read most ``.npy`` files that he has been given without much
Chris@87 40 documentation.
Chris@87 41
Chris@87 42 - Allows memory-mapping of the data. See `open_memmep`.
Chris@87 43
Chris@87 44 - Can be read from a filelike stream object instead of an actual file.
Chris@87 45
Chris@87 46 - Stores object arrays, i.e. arrays containing elements that are arbitrary
Chris@87 47 Python objects. Files with object arrays are not to be mmapable, but
Chris@87 48 can be read and written to disk.
Chris@87 49
Chris@87 50 Limitations
Chris@87 51 -----------
Chris@87 52
Chris@87 53 - Arbitrary subclasses of numpy.ndarray are not completely preserved.
Chris@87 54 Subclasses will be accepted for writing, but only the array data will
Chris@87 55 be written out. A regular numpy.ndarray object will be created
Chris@87 56 upon reading the file.
Chris@87 57
Chris@87 58 .. warning::
Chris@87 59
Chris@87 60 Due to limitations in the interpretation of structured dtypes, dtypes
Chris@87 61 with fields with empty names will have the names replaced by 'f0', 'f1',
Chris@87 62 etc. Such arrays will not round-trip through the format entirely
Chris@87 63 accurately. The data is intact; only the field names will differ. We are
Chris@87 64 working on a fix for this. This fix will not require a change in the
Chris@87 65 file format. The arrays with such structures can still be saved and
Chris@87 66 restored, and the correct dtype may be restored by using the
Chris@87 67 ``loadedarray.view(correct_dtype)`` method.
Chris@87 68
Chris@87 69 File extensions
Chris@87 70 ---------------
Chris@87 71
Chris@87 72 We recommend using the ``.npy`` and ``.npz`` extensions for files saved
Chris@87 73 in this format. This is by no means a requirement; applications may wish
Chris@87 74 to use these file formats but use an extension specific to the
Chris@87 75 application. In the absence of an obvious alternative, however,
Chris@87 76 we suggest using ``.npy`` and ``.npz``.
Chris@87 77
Chris@87 78 Version numbering
Chris@87 79 -----------------
Chris@87 80
Chris@87 81 The version numbering of these formats is independent of NumPy version
Chris@87 82 numbering. If the format is upgraded, the code in `numpy.io` will still
Chris@87 83 be able to read and write Version 1.0 files.
Chris@87 84
Chris@87 85 Format Version 1.0
Chris@87 86 ------------------
Chris@87 87
Chris@87 88 The first 6 bytes are a magic string: exactly ``\\x93NUMPY``.
Chris@87 89
Chris@87 90 The next 1 byte is an unsigned byte: the major version number of the file
Chris@87 91 format, e.g. ``\\x01``.
Chris@87 92
Chris@87 93 The next 1 byte is an unsigned byte: the minor version number of the file
Chris@87 94 format, e.g. ``\\x00``. Note: the version of the file format is not tied
Chris@87 95 to the version of the numpy package.
Chris@87 96
Chris@87 97 The next 2 bytes form a little-endian unsigned short int: the length of
Chris@87 98 the header data HEADER_LEN.
Chris@87 99
Chris@87 100 The next HEADER_LEN bytes form the header data describing the array's
Chris@87 101 format. It is an ASCII string which contains a Python literal expression
Chris@87 102 of a dictionary. It is terminated by a newline (``\\n``) and padded with
Chris@87 103 spaces (``\\x20``) to make the total length of
Chris@87 104 ``magic string + 4 + HEADER_LEN`` be evenly divisible by 16 for alignment
Chris@87 105 purposes.
Chris@87 106
Chris@87 107 The dictionary contains three keys:
Chris@87 108
Chris@87 109 "descr" : dtype.descr
Chris@87 110 An object that can be passed as an argument to the `numpy.dtype`
Chris@87 111 constructor to create the array's dtype.
Chris@87 112 "fortran_order" : bool
Chris@87 113 Whether the array data is Fortran-contiguous or not. Since
Chris@87 114 Fortran-contiguous arrays are a common form of non-C-contiguity,
Chris@87 115 we allow them to be written directly to disk for efficiency.
Chris@87 116 "shape" : tuple of int
Chris@87 117 The shape of the array.
Chris@87 118
Chris@87 119 For repeatability and readability, the dictionary keys are sorted in
Chris@87 120 alphabetic order. This is for convenience only. A writer SHOULD implement
Chris@87 121 this if possible. A reader MUST NOT depend on this.
Chris@87 122
Chris@87 123 Following the header comes the array data. If the dtype contains Python
Chris@87 124 objects (i.e. ``dtype.hasobject is True``), then the data is a Python
Chris@87 125 pickle of the array. Otherwise the data is the contiguous (either C-
Chris@87 126 or Fortran-, depending on ``fortran_order``) bytes of the array.
Chris@87 127 Consumers can figure out the number of bytes by multiplying the number
Chris@87 128 of elements given by the shape (noting that ``shape=()`` means there is
Chris@87 129 1 element) by ``dtype.itemsize``.
Chris@87 130
Chris@87 131 Notes
Chris@87 132 -----
Chris@87 133 The ``.npy`` format, including reasons for creating it and a comparison of
Chris@87 134 alternatives, is described fully in the "npy-format" NEP.
Chris@87 135
Chris@87 136 """
Chris@87 137 from __future__ import division, absolute_import, print_function
Chris@87 138
Chris@87 139 import numpy
Chris@87 140 import sys
Chris@87 141 import io
Chris@87 142 import warnings
Chris@87 143 from numpy.lib.utils import safe_eval
Chris@87 144 from numpy.compat import asbytes, asstr, isfileobj, long, basestring
Chris@87 145
Chris@87 146 if sys.version_info[0] >= 3:
Chris@87 147 import pickle
Chris@87 148 else:
Chris@87 149 import cPickle as pickle
Chris@87 150
Chris@87 151 MAGIC_PREFIX = asbytes('\x93NUMPY')
Chris@87 152 MAGIC_LEN = len(MAGIC_PREFIX) + 2
Chris@87 153 BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes
Chris@87 154
Chris@87 155 # difference between version 1.0 and 2.0 is a 4 byte (I) header length
Chris@87 156 # instead of 2 bytes (H) allowing storage of large structured arrays
Chris@87 157
Chris@87 158 def _check_version(version):
Chris@87 159 if version not in [(1, 0), (2, 0), None]:
Chris@87 160 msg = "we only support format version (1,0) and (2, 0), not %s"
Chris@87 161 raise ValueError(msg % (version,))
Chris@87 162
Chris@87 163 def magic(major, minor):
Chris@87 164 """ Return the magic string for the given file format version.
Chris@87 165
Chris@87 166 Parameters
Chris@87 167 ----------
Chris@87 168 major : int in [0, 255]
Chris@87 169 minor : int in [0, 255]
Chris@87 170
Chris@87 171 Returns
Chris@87 172 -------
Chris@87 173 magic : str
Chris@87 174
Chris@87 175 Raises
Chris@87 176 ------
Chris@87 177 ValueError if the version cannot be formatted.
Chris@87 178 """
Chris@87 179 if major < 0 or major > 255:
Chris@87 180 raise ValueError("major version must be 0 <= major < 256")
Chris@87 181 if minor < 0 or minor > 255:
Chris@87 182 raise ValueError("minor version must be 0 <= minor < 256")
Chris@87 183 if sys.version_info[0] < 3:
Chris@87 184 return MAGIC_PREFIX + chr(major) + chr(minor)
Chris@87 185 else:
Chris@87 186 return MAGIC_PREFIX + bytes([major, minor])
Chris@87 187
Chris@87 188 def read_magic(fp):
Chris@87 189 """ Read the magic string to get the version of the file format.
Chris@87 190
Chris@87 191 Parameters
Chris@87 192 ----------
Chris@87 193 fp : filelike object
Chris@87 194
Chris@87 195 Returns
Chris@87 196 -------
Chris@87 197 major : int
Chris@87 198 minor : int
Chris@87 199 """
Chris@87 200 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string")
Chris@87 201 if magic_str[:-2] != MAGIC_PREFIX:
Chris@87 202 msg = "the magic string is not correct; expected %r, got %r"
Chris@87 203 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2]))
Chris@87 204 if sys.version_info[0] < 3:
Chris@87 205 major, minor = map(ord, magic_str[-2:])
Chris@87 206 else:
Chris@87 207 major, minor = magic_str[-2:]
Chris@87 208 return major, minor
Chris@87 209
Chris@87 210 def dtype_to_descr(dtype):
Chris@87 211 """
Chris@87 212 Get a serializable descriptor from the dtype.
Chris@87 213
Chris@87 214 The .descr attribute of a dtype object cannot be round-tripped through
Chris@87 215 the dtype() constructor. Simple types, like dtype('float32'), have
Chris@87 216 a descr which looks like a record array with one field with '' as
Chris@87 217 a name. The dtype() constructor interprets this as a request to give
Chris@87 218 a default name. Instead, we construct descriptor that can be passed to
Chris@87 219 dtype().
Chris@87 220
Chris@87 221 Parameters
Chris@87 222 ----------
Chris@87 223 dtype : dtype
Chris@87 224 The dtype of the array that will be written to disk.
Chris@87 225
Chris@87 226 Returns
Chris@87 227 -------
Chris@87 228 descr : object
Chris@87 229 An object that can be passed to `numpy.dtype()` in order to
Chris@87 230 replicate the input dtype.
Chris@87 231
Chris@87 232 """
Chris@87 233 if dtype.names is not None:
Chris@87 234 # This is a record array. The .descr is fine. XXX: parts of the
Chris@87 235 # record array with an empty name, like padding bytes, still get
Chris@87 236 # fiddled with. This needs to be fixed in the C implementation of
Chris@87 237 # dtype().
Chris@87 238 return dtype.descr
Chris@87 239 else:
Chris@87 240 return dtype.str
Chris@87 241
Chris@87 242 def header_data_from_array_1_0(array):
Chris@87 243 """ Get the dictionary of header metadata from a numpy.ndarray.
Chris@87 244
Chris@87 245 Parameters
Chris@87 246 ----------
Chris@87 247 array : numpy.ndarray
Chris@87 248
Chris@87 249 Returns
Chris@87 250 -------
Chris@87 251 d : dict
Chris@87 252 This has the appropriate entries for writing its string representation
Chris@87 253 to the header of the file.
Chris@87 254 """
Chris@87 255 d = {}
Chris@87 256 d['shape'] = array.shape
Chris@87 257 if array.flags.c_contiguous:
Chris@87 258 d['fortran_order'] = False
Chris@87 259 elif array.flags.f_contiguous:
Chris@87 260 d['fortran_order'] = True
Chris@87 261 else:
Chris@87 262 # Totally non-contiguous data. We will have to make it C-contiguous
Chris@87 263 # before writing. Note that we need to test for C_CONTIGUOUS first
Chris@87 264 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
Chris@87 265 d['fortran_order'] = False
Chris@87 266
Chris@87 267 d['descr'] = dtype_to_descr(array.dtype)
Chris@87 268 return d
Chris@87 269
Chris@87 270 def _write_array_header(fp, d, version=None):
Chris@87 271 """ Write the header for an array and returns the version used
Chris@87 272
Chris@87 273 Parameters
Chris@87 274 ----------
Chris@87 275 fp : filelike object
Chris@87 276 d : dict
Chris@87 277 This has the appropriate entries for writing its string representation
Chris@87 278 to the header of the file.
Chris@87 279 version: tuple or None
Chris@87 280 None means use oldest that works
Chris@87 281 explicit version will raise a ValueError if the format does not
Chris@87 282 allow saving this data. Default: None
Chris@87 283 Returns
Chris@87 284 -------
Chris@87 285 version : tuple of int
Chris@87 286 the file version which needs to be used to store the data
Chris@87 287 """
Chris@87 288 import struct
Chris@87 289 header = ["{"]
Chris@87 290 for key, value in sorted(d.items()):
Chris@87 291 # Need to use repr here, since we eval these when reading
Chris@87 292 header.append("'%s': %s, " % (key, repr(value)))
Chris@87 293 header.append("}")
Chris@87 294 header = "".join(header)
Chris@87 295 # Pad the header with spaces and a final newline such that the magic
Chris@87 296 # string, the header-length short and the header are aligned on a
Chris@87 297 # 16-byte boundary. Hopefully, some system, possibly memory-mapping,
Chris@87 298 # can take advantage of our premature optimization.
Chris@87 299 current_header_len = MAGIC_LEN + 2 + len(header) + 1 # 1 for the newline
Chris@87 300 topad = 16 - (current_header_len % 16)
Chris@87 301 header = header + ' '*topad + '\n'
Chris@87 302 header = asbytes(_filter_header(header))
Chris@87 303
Chris@87 304 if len(header) >= (256*256) and version == (1, 0):
Chris@87 305 raise ValueError("header does not fit inside %s bytes required by the"
Chris@87 306 " 1.0 format" % (256*256))
Chris@87 307 if len(header) < (256*256):
Chris@87 308 header_len_str = struct.pack('<H', len(header))
Chris@87 309 version = (1, 0)
Chris@87 310 elif len(header) < (2**32):
Chris@87 311 header_len_str = struct.pack('<I', len(header))
Chris@87 312 version = (2, 0)
Chris@87 313 else:
Chris@87 314 raise ValueError("header does not fit inside 4 GiB required by "
Chris@87 315 "the 2.0 format")
Chris@87 316
Chris@87 317 fp.write(magic(*version))
Chris@87 318 fp.write(header_len_str)
Chris@87 319 fp.write(header)
Chris@87 320 return version
Chris@87 321
Chris@87 322 def write_array_header_1_0(fp, d):
Chris@87 323 """ Write the header for an array using the 1.0 format.
Chris@87 324
Chris@87 325 Parameters
Chris@87 326 ----------
Chris@87 327 fp : filelike object
Chris@87 328 d : dict
Chris@87 329 This has the appropriate entries for writing its string
Chris@87 330 representation to the header of the file.
Chris@87 331 """
Chris@87 332 _write_array_header(fp, d, (1, 0))
Chris@87 333
Chris@87 334
Chris@87 335 def write_array_header_2_0(fp, d):
Chris@87 336 """ Write the header for an array using the 2.0 format.
Chris@87 337 The 2.0 format allows storing very large structured arrays.
Chris@87 338
Chris@87 339 .. versionadded:: 1.9.0
Chris@87 340
Chris@87 341 Parameters
Chris@87 342 ----------
Chris@87 343 fp : filelike object
Chris@87 344 d : dict
Chris@87 345 This has the appropriate entries for writing its string
Chris@87 346 representation to the header of the file.
Chris@87 347 """
Chris@87 348 _write_array_header(fp, d, (2, 0))
Chris@87 349
Chris@87 350 def read_array_header_1_0(fp):
Chris@87 351 """
Chris@87 352 Read an array header from a filelike object using the 1.0 file format
Chris@87 353 version.
Chris@87 354
Chris@87 355 This will leave the file object located just after the header.
Chris@87 356
Chris@87 357 Parameters
Chris@87 358 ----------
Chris@87 359 fp : filelike object
Chris@87 360 A file object or something with a `.read()` method like a file.
Chris@87 361
Chris@87 362 Returns
Chris@87 363 -------
Chris@87 364 shape : tuple of int
Chris@87 365 The shape of the array.
Chris@87 366 fortran_order : bool
Chris@87 367 The array data will be written out directly if it is either
Chris@87 368 C-contiguous or Fortran-contiguous. Otherwise, it will be made
Chris@87 369 contiguous before writing it out.
Chris@87 370 dtype : dtype
Chris@87 371 The dtype of the file's data.
Chris@87 372
Chris@87 373 Raises
Chris@87 374 ------
Chris@87 375 ValueError
Chris@87 376 If the data is invalid.
Chris@87 377
Chris@87 378 """
Chris@87 379 _read_array_header(fp, version=(1, 0))
Chris@87 380
Chris@87 381 def read_array_header_2_0(fp):
Chris@87 382 """
Chris@87 383 Read an array header from a filelike object using the 2.0 file format
Chris@87 384 version.
Chris@87 385
Chris@87 386 This will leave the file object located just after the header.
Chris@87 387
Chris@87 388 .. versionadded:: 1.9.0
Chris@87 389
Chris@87 390 Parameters
Chris@87 391 ----------
Chris@87 392 fp : filelike object
Chris@87 393 A file object or something with a `.read()` method like a file.
Chris@87 394
Chris@87 395 Returns
Chris@87 396 -------
Chris@87 397 shape : tuple of int
Chris@87 398 The shape of the array.
Chris@87 399 fortran_order : bool
Chris@87 400 The array data will be written out directly if it is either
Chris@87 401 C-contiguous or Fortran-contiguous. Otherwise, it will be made
Chris@87 402 contiguous before writing it out.
Chris@87 403 dtype : dtype
Chris@87 404 The dtype of the file's data.
Chris@87 405
Chris@87 406 Raises
Chris@87 407 ------
Chris@87 408 ValueError
Chris@87 409 If the data is invalid.
Chris@87 410
Chris@87 411 """
Chris@87 412 _read_array_header(fp, version=(2, 0))
Chris@87 413
Chris@87 414
Chris@87 415 def _filter_header(s):
Chris@87 416 """Clean up 'L' in npz header ints.
Chris@87 417
Chris@87 418 Cleans up the 'L' in strings representing integers. Needed to allow npz
Chris@87 419 headers produced in Python2 to be read in Python3.
Chris@87 420
Chris@87 421 Parameters
Chris@87 422 ----------
Chris@87 423 s : byte string
Chris@87 424 Npy file header.
Chris@87 425
Chris@87 426 Returns
Chris@87 427 -------
Chris@87 428 header : str
Chris@87 429 Cleaned up header.
Chris@87 430
Chris@87 431 """
Chris@87 432 import tokenize
Chris@87 433 if sys.version_info[0] >= 3:
Chris@87 434 from io import StringIO
Chris@87 435 else:
Chris@87 436 from StringIO import StringIO
Chris@87 437
Chris@87 438 tokens = []
Chris@87 439 last_token_was_number = False
Chris@87 440 for token in tokenize.generate_tokens(StringIO(asstr(s)).read):
Chris@87 441 token_type = token[0]
Chris@87 442 token_string = token[1]
Chris@87 443 if (last_token_was_number and
Chris@87 444 token_type == tokenize.NAME and
Chris@87 445 token_string == "L"):
Chris@87 446 continue
Chris@87 447 else:
Chris@87 448 tokens.append(token)
Chris@87 449 last_token_was_number = (token_type == tokenize.NUMBER)
Chris@87 450 return tokenize.untokenize(tokens)
Chris@87 451
Chris@87 452
Chris@87 453 def _read_array_header(fp, version):
Chris@87 454 """
Chris@87 455 see read_array_header_1_0
Chris@87 456 """
Chris@87 457 # Read an unsigned, little-endian short int which has the length of the
Chris@87 458 # header.
Chris@87 459 import struct
Chris@87 460 if version == (1, 0):
Chris@87 461 hlength_str = _read_bytes(fp, 2, "array header length")
Chris@87 462 header_length = struct.unpack('<H', hlength_str)[0]
Chris@87 463 header = _read_bytes(fp, header_length, "array header")
Chris@87 464 elif version == (2, 0):
Chris@87 465 hlength_str = _read_bytes(fp, 4, "array header length")
Chris@87 466 header_length = struct.unpack('<I', hlength_str)[0]
Chris@87 467 header = _read_bytes(fp, header_length, "array header")
Chris@87 468 else:
Chris@87 469 raise ValueError("Invalid version %r" % version)
Chris@87 470
Chris@87 471 # The header is a pretty-printed string representation of a literal
Chris@87 472 # Python dictionary with trailing newlines padded to a 16-byte
Chris@87 473 # boundary. The keys are strings.
Chris@87 474 # "shape" : tuple of int
Chris@87 475 # "fortran_order" : bool
Chris@87 476 # "descr" : dtype.descr
Chris@87 477 header = _filter_header(header)
Chris@87 478 try:
Chris@87 479 d = safe_eval(header)
Chris@87 480 except SyntaxError as e:
Chris@87 481 msg = "Cannot parse header: %r\nException: %r"
Chris@87 482 raise ValueError(msg % (header, e))
Chris@87 483 if not isinstance(d, dict):
Chris@87 484 msg = "Header is not a dictionary: %r"
Chris@87 485 raise ValueError(msg % d)
Chris@87 486 keys = sorted(d.keys())
Chris@87 487 if keys != ['descr', 'fortran_order', 'shape']:
Chris@87 488 msg = "Header does not contain the correct keys: %r"
Chris@87 489 raise ValueError(msg % (keys,))
Chris@87 490
Chris@87 491 # Sanity-check the values.
Chris@87 492 if (not isinstance(d['shape'], tuple) or
Chris@87 493 not numpy.all([isinstance(x, (int, long)) for x in d['shape']])):
Chris@87 494 msg = "shape is not valid: %r"
Chris@87 495 raise ValueError(msg % (d['shape'],))
Chris@87 496 if not isinstance(d['fortran_order'], bool):
Chris@87 497 msg = "fortran_order is not a valid bool: %r"
Chris@87 498 raise ValueError(msg % (d['fortran_order'],))
Chris@87 499 try:
Chris@87 500 dtype = numpy.dtype(d['descr'])
Chris@87 501 except TypeError as e:
Chris@87 502 msg = "descr is not a valid dtype descriptor: %r"
Chris@87 503 raise ValueError(msg % (d['descr'],))
Chris@87 504
Chris@87 505 return d['shape'], d['fortran_order'], dtype
Chris@87 506
Chris@87 507 def write_array(fp, array, version=None):
Chris@87 508 """
Chris@87 509 Write an array to an NPY file, including a header.
Chris@87 510
Chris@87 511 If the array is neither C-contiguous nor Fortran-contiguous AND the
Chris@87 512 file_like object is not a real file object, this function will have to
Chris@87 513 copy data in memory.
Chris@87 514
Chris@87 515 Parameters
Chris@87 516 ----------
Chris@87 517 fp : file_like object
Chris@87 518 An open, writable file object, or similar object with a
Chris@87 519 ``.write()`` method.
Chris@87 520 array : ndarray
Chris@87 521 The array to write to disk.
Chris@87 522 version : (int, int) or None, optional
Chris@87 523 The version number of the format. None means use the oldest
Chris@87 524 supported version that is able to store the data. Default: None
Chris@87 525
Chris@87 526 Raises
Chris@87 527 ------
Chris@87 528 ValueError
Chris@87 529 If the array cannot be persisted.
Chris@87 530 Various other errors
Chris@87 531 If the array contains Python objects as part of its dtype, the
Chris@87 532 process of pickling them may raise various errors if the objects
Chris@87 533 are not picklable.
Chris@87 534
Chris@87 535 """
Chris@87 536 _check_version(version)
Chris@87 537 used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
Chris@87 538 version)
Chris@87 539 # this warning can be removed when 1.9 has aged enough
Chris@87 540 if version != (2, 0) and used_ver == (2, 0):
Chris@87 541 warnings.warn("Stored array in format 2.0. It can only be"
Chris@87 542 "read by NumPy >= 1.9", UserWarning)
Chris@87 543
Chris@87 544 # Set buffer size to 16 MiB to hide the Python loop overhead.
Chris@87 545 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1)
Chris@87 546
Chris@87 547 if array.dtype.hasobject:
Chris@87 548 # We contain Python objects so we cannot write out the data
Chris@87 549 # directly. Instead, we will pickle it out with version 2 of the
Chris@87 550 # pickle protocol.
Chris@87 551 pickle.dump(array, fp, protocol=2)
Chris@87 552 elif array.flags.f_contiguous and not array.flags.c_contiguous:
Chris@87 553 if isfileobj(fp):
Chris@87 554 array.T.tofile(fp)
Chris@87 555 else:
Chris@87 556 for chunk in numpy.nditer(
Chris@87 557 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
Chris@87 558 buffersize=buffersize, order='F'):
Chris@87 559 fp.write(chunk.tobytes('C'))
Chris@87 560 else:
Chris@87 561 if isfileobj(fp):
Chris@87 562 array.tofile(fp)
Chris@87 563 else:
Chris@87 564 for chunk in numpy.nditer(
Chris@87 565 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
Chris@87 566 buffersize=buffersize, order='C'):
Chris@87 567 fp.write(chunk.tobytes('C'))
Chris@87 568
Chris@87 569
Chris@87 570 def read_array(fp):
Chris@87 571 """
Chris@87 572 Read an array from an NPY file.
Chris@87 573
Chris@87 574 Parameters
Chris@87 575 ----------
Chris@87 576 fp : file_like object
Chris@87 577 If this is not a real file object, then this may take extra memory
Chris@87 578 and time.
Chris@87 579
Chris@87 580 Returns
Chris@87 581 -------
Chris@87 582 array : ndarray
Chris@87 583 The array from the data on disk.
Chris@87 584
Chris@87 585 Raises
Chris@87 586 ------
Chris@87 587 ValueError
Chris@87 588 If the data is invalid.
Chris@87 589
Chris@87 590 """
Chris@87 591 version = read_magic(fp)
Chris@87 592 _check_version(version)
Chris@87 593 shape, fortran_order, dtype = _read_array_header(fp, version)
Chris@87 594 if len(shape) == 0:
Chris@87 595 count = 1
Chris@87 596 else:
Chris@87 597 count = numpy.multiply.reduce(shape)
Chris@87 598
Chris@87 599 # Now read the actual data.
Chris@87 600 if dtype.hasobject:
Chris@87 601 # The array contained Python objects. We need to unpickle the data.
Chris@87 602 array = pickle.load(fp)
Chris@87 603 else:
Chris@87 604 if isfileobj(fp):
Chris@87 605 # We can use the fast fromfile() function.
Chris@87 606 array = numpy.fromfile(fp, dtype=dtype, count=count)
Chris@87 607 else:
Chris@87 608 # This is not a real file. We have to read it the
Chris@87 609 # memory-intensive way.
Chris@87 610 # crc32 module fails on reads greater than 2 ** 32 bytes,
Chris@87 611 # breaking large reads from gzip streams. Chunk reads to
Chris@87 612 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead
Chris@87 613 # of the read. In non-chunked case count < max_read_count, so
Chris@87 614 # only one read is performed.
Chris@87 615
Chris@87 616 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize)
Chris@87 617
Chris@87 618 array = numpy.empty(count, dtype=dtype)
Chris@87 619 for i in range(0, count, max_read_count):
Chris@87 620 read_count = min(max_read_count, count - i)
Chris@87 621 read_size = int(read_count * dtype.itemsize)
Chris@87 622 data = _read_bytes(fp, read_size, "array data")
Chris@87 623 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
Chris@87 624 count=read_count)
Chris@87 625
Chris@87 626 if fortran_order:
Chris@87 627 array.shape = shape[::-1]
Chris@87 628 array = array.transpose()
Chris@87 629 else:
Chris@87 630 array.shape = shape
Chris@87 631
Chris@87 632 return array
Chris@87 633
Chris@87 634
Chris@87 635 def open_memmap(filename, mode='r+', dtype=None, shape=None,
Chris@87 636 fortran_order=False, version=None):
Chris@87 637 """
Chris@87 638 Open a .npy file as a memory-mapped array.
Chris@87 639
Chris@87 640 This may be used to read an existing file or create a new one.
Chris@87 641
Chris@87 642 Parameters
Chris@87 643 ----------
Chris@87 644 filename : str
Chris@87 645 The name of the file on disk. This may *not* be a file-like
Chris@87 646 object.
Chris@87 647 mode : str, optional
Chris@87 648 The mode in which to open the file; the default is 'r+'. In
Chris@87 649 addition to the standard file modes, 'c' is also accepted to mean
Chris@87 650 "copy on write." See `memmap` for the available mode strings.
Chris@87 651 dtype : data-type, optional
Chris@87 652 The data type of the array if we are creating a new file in "write"
Chris@87 653 mode, if not, `dtype` is ignored. The default value is None, which
Chris@87 654 results in a data-type of `float64`.
Chris@87 655 shape : tuple of int
Chris@87 656 The shape of the array if we are creating a new file in "write"
Chris@87 657 mode, in which case this parameter is required. Otherwise, this
Chris@87 658 parameter is ignored and is thus optional.
Chris@87 659 fortran_order : bool, optional
Chris@87 660 Whether the array should be Fortran-contiguous (True) or
Chris@87 661 C-contiguous (False, the default) if we are creating a new file in
Chris@87 662 "write" mode.
Chris@87 663 version : tuple of int (major, minor) or None
Chris@87 664 If the mode is a "write" mode, then this is the version of the file
Chris@87 665 format used to create the file. None means use the oldest
Chris@87 666 supported version that is able to store the data. Default: None
Chris@87 667
Chris@87 668 Returns
Chris@87 669 -------
Chris@87 670 marray : memmap
Chris@87 671 The memory-mapped array.
Chris@87 672
Chris@87 673 Raises
Chris@87 674 ------
Chris@87 675 ValueError
Chris@87 676 If the data or the mode is invalid.
Chris@87 677 IOError
Chris@87 678 If the file is not found or cannot be opened correctly.
Chris@87 679
Chris@87 680 See Also
Chris@87 681 --------
Chris@87 682 memmap
Chris@87 683
Chris@87 684 """
Chris@87 685 if not isinstance(filename, basestring):
Chris@87 686 raise ValueError("Filename must be a string. Memmap cannot use"
Chris@87 687 " existing file handles.")
Chris@87 688
Chris@87 689 if 'w' in mode:
Chris@87 690 # We are creating the file, not reading it.
Chris@87 691 # Check if we ought to create the file.
Chris@87 692 _check_version(version)
Chris@87 693 # Ensure that the given dtype is an authentic dtype object rather
Chris@87 694 # than just something that can be interpreted as a dtype object.
Chris@87 695 dtype = numpy.dtype(dtype)
Chris@87 696 if dtype.hasobject:
Chris@87 697 msg = "Array can't be memory-mapped: Python objects in dtype."
Chris@87 698 raise ValueError(msg)
Chris@87 699 d = dict(
Chris@87 700 descr=dtype_to_descr(dtype),
Chris@87 701 fortran_order=fortran_order,
Chris@87 702 shape=shape,
Chris@87 703 )
Chris@87 704 # If we got here, then it should be safe to create the file.
Chris@87 705 fp = open(filename, mode+'b')
Chris@87 706 try:
Chris@87 707 used_ver = _write_array_header(fp, d, version)
Chris@87 708 # this warning can be removed when 1.9 has aged enough
Chris@87 709 if version != (2, 0) and used_ver == (2, 0):
Chris@87 710 warnings.warn("Stored array in format 2.0. It can only be"
Chris@87 711 "read by NumPy >= 1.9", UserWarning)
Chris@87 712 offset = fp.tell()
Chris@87 713 finally:
Chris@87 714 fp.close()
Chris@87 715 else:
Chris@87 716 # Read the header of the file first.
Chris@87 717 fp = open(filename, 'rb')
Chris@87 718 try:
Chris@87 719 version = read_magic(fp)
Chris@87 720 _check_version(version)
Chris@87 721
Chris@87 722 shape, fortran_order, dtype = _read_array_header(fp, version)
Chris@87 723 if dtype.hasobject:
Chris@87 724 msg = "Array can't be memory-mapped: Python objects in dtype."
Chris@87 725 raise ValueError(msg)
Chris@87 726 offset = fp.tell()
Chris@87 727 finally:
Chris@87 728 fp.close()
Chris@87 729
Chris@87 730 if fortran_order:
Chris@87 731 order = 'F'
Chris@87 732 else:
Chris@87 733 order = 'C'
Chris@87 734
Chris@87 735 # We need to change a write-only mode to a read-write mode since we've
Chris@87 736 # already written data to the file.
Chris@87 737 if mode == 'w+':
Chris@87 738 mode = 'r+'
Chris@87 739
Chris@87 740 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
Chris@87 741 mode=mode, offset=offset)
Chris@87 742
Chris@87 743 return marray
Chris@87 744
Chris@87 745
Chris@87 746 def _read_bytes(fp, size, error_template="ran out of data"):
Chris@87 747 """
Chris@87 748 Read from file-like object until size bytes are read.
Chris@87 749 Raises ValueError if not EOF is encountered before size bytes are read.
Chris@87 750 Non-blocking objects only supported if they derive from io objects.
Chris@87 751
Chris@87 752 Required as e.g. ZipExtFile in python 2.6 can return less data than
Chris@87 753 requested.
Chris@87 754 """
Chris@87 755 data = bytes()
Chris@87 756 while True:
Chris@87 757 # io files (default in python3) return None or raise on
Chris@87 758 # would-block, python2 file will truncate, probably nothing can be
Chris@87 759 # done about that. note that regular files can't be non-blocking
Chris@87 760 try:
Chris@87 761 r = fp.read(size - len(data))
Chris@87 762 data += r
Chris@87 763 if len(r) == 0 or len(data) == size:
Chris@87 764 break
Chris@87 765 except io.BlockingIOError:
Chris@87 766 pass
Chris@87 767 if len(data) != size:
Chris@87 768 msg = "EOF: reading %s, expected %d bytes got %d"
Chris@87 769 raise ValueError(msg % (error_template, size, len(data)))
Chris@87 770 else:
Chris@87 771 return data