Chris@87: """
Chris@87: Define a simple format for saving numpy arrays to disk with the full
Chris@87: information about them.
Chris@87: 
Chris@87: The ``.npy`` format is the standard binary file format in NumPy for
Chris@87: persisting a *single* arbitrary NumPy array on disk. The format stores all
Chris@87: of the shape and dtype information necessary to reconstruct the array
Chris@87: correctly even on another machine with a different architecture.
Chris@87: The format is designed to be as simple as possible while achieving
Chris@87: its limited goals.
Chris@87: 
Chris@87: The ``.npz`` format is the standard format for persisting *multiple* NumPy
Chris@87: arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy``
Chris@87: files, one for each array.
Chris@87: 
Chris@87: Capabilities
Chris@87: ------------
Chris@87: 
Chris@87: - Can represent all NumPy arrays including nested record arrays and
Chris@87:   object arrays.
Chris@87: 
Chris@87: - Represents the data in its native binary form.
Chris@87: 
Chris@87: - Supports Fortran-contiguous arrays directly.
Chris@87: 
Chris@87: - Stores all of the necessary information to reconstruct the array
Chris@87:   including shape and dtype on a machine of a different
Chris@87:   architecture.  Both little-endian and big-endian arrays are
Chris@87:   supported, and a file with little-endian numbers will yield
Chris@87:   a little-endian array on any machine reading the file. The
Chris@87:   types are described in terms of their actual sizes. For example,
Chris@87:   if a machine with a 64-bit C "long int" writes out an array with
Chris@87:   "long ints", a reading machine with 32-bit C "long ints" will yield
Chris@87:   an array with 64-bit integers.
Chris@87: 
Chris@87: - Is straightforward to reverse engineer. Datasets often live longer than
Chris@87:   the programs that created them. A competent developer should be
Chris@87:   able to create a solution in his preferred programming language to
Chris@87:   read most ``.npy`` files that he has been given without much
Chris@87:   documentation.
Chris@87: 
Chris@87: - Allows memory-mapping of the data. See `open_memmep`.
Chris@87: 
Chris@87: - Can be read from a filelike stream object instead of an actual file.
Chris@87: 
Chris@87: - Stores object arrays, i.e. arrays containing elements that are arbitrary
Chris@87:   Python objects. Files with object arrays are not to be mmapable, but
Chris@87:   can be read and written to disk.
Chris@87: 
Chris@87: Limitations
Chris@87: -----------
Chris@87: 
Chris@87: - Arbitrary subclasses of numpy.ndarray are not completely preserved.
Chris@87:   Subclasses will be accepted for writing, but only the array data will
Chris@87:   be written out. A regular numpy.ndarray object will be created
Chris@87:   upon reading the file.
Chris@87: 
Chris@87: .. warning::
Chris@87: 
Chris@87:   Due to limitations in the interpretation of structured dtypes, dtypes
Chris@87:   with fields with empty names will have the names replaced by 'f0', 'f1',
Chris@87:   etc. Such arrays will not round-trip through the format entirely
Chris@87:   accurately. The data is intact; only the field names will differ. We are
Chris@87:   working on a fix for this. This fix will not require a change in the
Chris@87:   file format. The arrays with such structures can still be saved and
Chris@87:   restored, and the correct dtype may be restored by using the
Chris@87:   ``loadedarray.view(correct_dtype)`` method.
Chris@87: 
Chris@87: File extensions
Chris@87: ---------------
Chris@87: 
Chris@87: We recommend using the ``.npy`` and ``.npz`` extensions for files saved
Chris@87: in this format. This is by no means a requirement; applications may wish
Chris@87: to use these file formats but use an extension specific to the
Chris@87: application. In the absence of an obvious alternative, however,
Chris@87: we suggest using ``.npy`` and ``.npz``.
Chris@87: 
Chris@87: Version numbering
Chris@87: -----------------
Chris@87: 
Chris@87: The version numbering of these formats is independent of NumPy version
Chris@87: numbering. If the format is upgraded, the code in `numpy.io` will still
Chris@87: be able to read and write Version 1.0 files.
Chris@87: 
Chris@87: Format Version 1.0
Chris@87: ------------------
Chris@87: 
Chris@87: The first 6 bytes are a magic string: exactly ``\\x93NUMPY``.
Chris@87: 
Chris@87: The next 1 byte is an unsigned byte: the major version number of the file
Chris@87: format, e.g. ``\\x01``.
Chris@87: 
Chris@87: The next 1 byte is an unsigned byte: the minor version number of the file
Chris@87: format, e.g. ``\\x00``. Note: the version of the file format is not tied
Chris@87: to the version of the numpy package.
Chris@87: 
Chris@87: The next 2 bytes form a little-endian unsigned short int: the length of
Chris@87: the header data HEADER_LEN.
Chris@87: 
Chris@87: The next HEADER_LEN bytes form the header data describing the array's
Chris@87: format. It is an ASCII string which contains a Python literal expression
Chris@87: of a dictionary. It is terminated by a newline (``\\n``) and padded with
Chris@87: spaces (``\\x20``) to make the total length of
Chris@87: ``magic string + 4 + HEADER_LEN`` be evenly divisible by 16 for alignment
Chris@87: purposes.
Chris@87: 
Chris@87: The dictionary contains three keys:
Chris@87: 
Chris@87:     "descr" : dtype.descr
Chris@87:       An object that can be passed as an argument to the `numpy.dtype`
Chris@87:       constructor to create the array's dtype.
Chris@87:     "fortran_order" : bool
Chris@87:       Whether the array data is Fortran-contiguous or not. Since
Chris@87:       Fortran-contiguous arrays are a common form of non-C-contiguity,
Chris@87:       we allow them to be written directly to disk for efficiency.
Chris@87:     "shape" : tuple of int
Chris@87:       The shape of the array.
Chris@87: 
Chris@87: For repeatability and readability, the dictionary keys are sorted in
Chris@87: alphabetic order. This is for convenience only. A writer SHOULD implement
Chris@87: this if possible. A reader MUST NOT depend on this.
Chris@87: 
Chris@87: Following the header comes the array data. If the dtype contains Python
Chris@87: objects (i.e. ``dtype.hasobject is True``), then the data is a Python
Chris@87: pickle of the array. Otherwise the data is the contiguous (either C-
Chris@87: or Fortran-, depending on ``fortran_order``) bytes of the array.
Chris@87: Consumers can figure out the number of bytes by multiplying the number
Chris@87: of elements given by the shape (noting that ``shape=()`` means there is
Chris@87: 1 element) by ``dtype.itemsize``.
Chris@87: 
Chris@87: Notes
Chris@87: -----
Chris@87: The ``.npy`` format, including reasons for creating it and a comparison of
Chris@87: alternatives, is described fully in the "npy-format" NEP.
Chris@87: 
Chris@87: """
Chris@87: from __future__ import division, absolute_import, print_function
Chris@87: 
Chris@87: import numpy
Chris@87: import sys
Chris@87: import io
Chris@87: import warnings
Chris@87: from numpy.lib.utils import safe_eval
Chris@87: from numpy.compat import asbytes, asstr, isfileobj, long, basestring
Chris@87: 
Chris@87: if sys.version_info[0] >= 3:
Chris@87:     import pickle
Chris@87: else:
Chris@87:     import cPickle as pickle
Chris@87: 
Chris@87: MAGIC_PREFIX = asbytes('\x93NUMPY')
Chris@87: MAGIC_LEN = len(MAGIC_PREFIX) + 2
Chris@87: BUFFER_SIZE = 2**18  # size of buffer for reading npz files in bytes
Chris@87: 
Chris@87: # difference between version 1.0 and 2.0 is a 4 byte (I) header length
Chris@87: # instead of 2 bytes (H) allowing storage of large structured arrays
Chris@87: 
Chris@87: def _check_version(version):
Chris@87:     if version not in [(1, 0), (2, 0), None]:
Chris@87:         msg = "we only support format version (1,0) and (2, 0), not %s"
Chris@87:         raise ValueError(msg % (version,))
Chris@87: 
Chris@87: def magic(major, minor):
Chris@87:     """ Return the magic string for the given file format version.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     major : int in [0, 255]
Chris@87:     minor : int in [0, 255]
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     magic : str
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError if the version cannot be formatted.
Chris@87:     """
Chris@87:     if major < 0 or major > 255:
Chris@87:         raise ValueError("major version must be 0 <= major < 256")
Chris@87:     if minor < 0 or minor > 255:
Chris@87:         raise ValueError("minor version must be 0 <= minor < 256")
Chris@87:     if sys.version_info[0] < 3:
Chris@87:         return MAGIC_PREFIX + chr(major) + chr(minor)
Chris@87:     else:
Chris@87:         return MAGIC_PREFIX + bytes([major, minor])
Chris@87: 
Chris@87: def read_magic(fp):
Chris@87:     """ Read the magic string to get the version of the file format.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     major : int
Chris@87:     minor : int
Chris@87:     """
Chris@87:     magic_str = _read_bytes(fp, MAGIC_LEN, "magic string")
Chris@87:     if magic_str[:-2] != MAGIC_PREFIX:
Chris@87:         msg = "the magic string is not correct; expected %r, got %r"
Chris@87:         raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2]))
Chris@87:     if sys.version_info[0] < 3:
Chris@87:         major, minor = map(ord, magic_str[-2:])
Chris@87:     else:
Chris@87:         major, minor = magic_str[-2:]
Chris@87:     return major, minor
Chris@87: 
Chris@87: def dtype_to_descr(dtype):
Chris@87:     """
Chris@87:     Get a serializable descriptor from the dtype.
Chris@87: 
Chris@87:     The .descr attribute of a dtype object cannot be round-tripped through
Chris@87:     the dtype() constructor. Simple types, like dtype('float32'), have
Chris@87:     a descr which looks like a record array with one field with '' as
Chris@87:     a name. The dtype() constructor interprets this as a request to give
Chris@87:     a default name.  Instead, we construct descriptor that can be passed to
Chris@87:     dtype().
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     dtype : dtype
Chris@87:         The dtype of the array that will be written to disk.
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     descr : object
Chris@87:         An object that can be passed to `numpy.dtype()` in order to
Chris@87:         replicate the input dtype.
Chris@87: 
Chris@87:     """
Chris@87:     if dtype.names is not None:
Chris@87:         # This is a record array. The .descr is fine.  XXX: parts of the
Chris@87:         # record array with an empty name, like padding bytes, still get
Chris@87:         # fiddled with. This needs to be fixed in the C implementation of
Chris@87:         # dtype().
Chris@87:         return dtype.descr
Chris@87:     else:
Chris@87:         return dtype.str
Chris@87: 
Chris@87: def header_data_from_array_1_0(array):
Chris@87:     """ Get the dictionary of header metadata from a numpy.ndarray.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     array : numpy.ndarray
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     d : dict
Chris@87:         This has the appropriate entries for writing its string representation
Chris@87:         to the header of the file.
Chris@87:     """
Chris@87:     d = {}
Chris@87:     d['shape'] = array.shape
Chris@87:     if array.flags.c_contiguous:
Chris@87:         d['fortran_order'] = False
Chris@87:     elif array.flags.f_contiguous:
Chris@87:         d['fortran_order'] = True
Chris@87:     else:
Chris@87:         # Totally non-contiguous data. We will have to make it C-contiguous
Chris@87:         # before writing. Note that we need to test for C_CONTIGUOUS first
Chris@87:         # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
Chris@87:         d['fortran_order'] = False
Chris@87: 
Chris@87:     d['descr'] = dtype_to_descr(array.dtype)
Chris@87:     return d
Chris@87: 
Chris@87: def _write_array_header(fp, d, version=None):
Chris@87:     """ Write the header for an array and returns the version used
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87:     d : dict
Chris@87:         This has the appropriate entries for writing its string representation
Chris@87:         to the header of the file.
Chris@87:     version: tuple or None
Chris@87:         None means use oldest that works
Chris@87:         explicit version will raise a ValueError if the format does not
Chris@87:         allow saving this data.  Default: None
Chris@87:     Returns
Chris@87:     -------
Chris@87:     version : tuple of int
Chris@87:         the file version which needs to be used to store the data
Chris@87:     """
Chris@87:     import struct
Chris@87:     header = ["{"]
Chris@87:     for key, value in sorted(d.items()):
Chris@87:         # Need to use repr here, since we eval these when reading
Chris@87:         header.append("'%s': %s, " % (key, repr(value)))
Chris@87:     header.append("}")
Chris@87:     header = "".join(header)
Chris@87:     # Pad the header with spaces and a final newline such that the magic
Chris@87:     # string, the header-length short and the header are aligned on a
Chris@87:     # 16-byte boundary.  Hopefully, some system, possibly memory-mapping,
Chris@87:     # can take advantage of our premature optimization.
Chris@87:     current_header_len = MAGIC_LEN + 2 + len(header) + 1  # 1 for the newline
Chris@87:     topad = 16 - (current_header_len % 16)
Chris@87:     header = header + ' '*topad + '\n'
Chris@87:     header = asbytes(_filter_header(header))
Chris@87: 
Chris@87:     if len(header) >= (256*256) and version == (1, 0):
Chris@87:         raise ValueError("header does not fit inside %s bytes required by the"
Chris@87:                          " 1.0 format" % (256*256))
Chris@87:     if len(header) < (256*256):
Chris@87:         header_len_str = struct.pack('<H', len(header))
Chris@87:         version = (1, 0)
Chris@87:     elif len(header) < (2**32):
Chris@87:         header_len_str = struct.pack('<I', len(header))
Chris@87:         version = (2, 0)
Chris@87:     else:
Chris@87:         raise ValueError("header does not fit inside 4 GiB required by "
Chris@87:                          "the 2.0 format")
Chris@87: 
Chris@87:     fp.write(magic(*version))
Chris@87:     fp.write(header_len_str)
Chris@87:     fp.write(header)
Chris@87:     return version
Chris@87: 
Chris@87: def write_array_header_1_0(fp, d):
Chris@87:     """ Write the header for an array using the 1.0 format.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87:     d : dict
Chris@87:         This has the appropriate entries for writing its string
Chris@87:         representation to the header of the file.
Chris@87:     """
Chris@87:     _write_array_header(fp, d, (1, 0))
Chris@87: 
Chris@87: 
Chris@87: def write_array_header_2_0(fp, d):
Chris@87:     """ Write the header for an array using the 2.0 format.
Chris@87:         The 2.0 format allows storing very large structured arrays.
Chris@87: 
Chris@87:     .. versionadded:: 1.9.0
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87:     d : dict
Chris@87:         This has the appropriate entries for writing its string
Chris@87:         representation to the header of the file.
Chris@87:     """
Chris@87:     _write_array_header(fp, d, (2, 0))
Chris@87: 
Chris@87: def read_array_header_1_0(fp):
Chris@87:     """
Chris@87:     Read an array header from a filelike object using the 1.0 file format
Chris@87:     version.
Chris@87: 
Chris@87:     This will leave the file object located just after the header.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87:         A file object or something with a `.read()` method like a file.
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     shape : tuple of int
Chris@87:         The shape of the array.
Chris@87:     fortran_order : bool
Chris@87:         The array data will be written out directly if it is either
Chris@87:         C-contiguous or Fortran-contiguous. Otherwise, it will be made
Chris@87:         contiguous before writing it out.
Chris@87:     dtype : dtype
Chris@87:         The dtype of the file's data.
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError
Chris@87:         If the data is invalid.
Chris@87: 
Chris@87:     """
Chris@87:     _read_array_header(fp, version=(1, 0))
Chris@87: 
Chris@87: def read_array_header_2_0(fp):
Chris@87:     """
Chris@87:     Read an array header from a filelike object using the 2.0 file format
Chris@87:     version.
Chris@87: 
Chris@87:     This will leave the file object located just after the header.
Chris@87: 
Chris@87:     .. versionadded:: 1.9.0
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : filelike object
Chris@87:         A file object or something with a `.read()` method like a file.
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     shape : tuple of int
Chris@87:         The shape of the array.
Chris@87:     fortran_order : bool
Chris@87:         The array data will be written out directly if it is either
Chris@87:         C-contiguous or Fortran-contiguous. Otherwise, it will be made
Chris@87:         contiguous before writing it out.
Chris@87:     dtype : dtype
Chris@87:         The dtype of the file's data.
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError
Chris@87:         If the data is invalid.
Chris@87: 
Chris@87:     """
Chris@87:     _read_array_header(fp, version=(2, 0))
Chris@87: 
Chris@87: 
Chris@87: def _filter_header(s):
Chris@87:     """Clean up 'L' in npz header ints.
Chris@87: 
Chris@87:     Cleans up the 'L' in strings representing integers. Needed to allow npz
Chris@87:     headers produced in Python2 to be read in Python3.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     s : byte string
Chris@87:         Npy file header.
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     header : str
Chris@87:         Cleaned up header.
Chris@87: 
Chris@87:     """
Chris@87:     import tokenize
Chris@87:     if sys.version_info[0] >= 3:
Chris@87:         from io import StringIO
Chris@87:     else:
Chris@87:         from StringIO import StringIO
Chris@87: 
Chris@87:     tokens = []
Chris@87:     last_token_was_number = False
Chris@87:     for token in tokenize.generate_tokens(StringIO(asstr(s)).read):
Chris@87:         token_type = token[0]
Chris@87:         token_string = token[1]
Chris@87:         if (last_token_was_number and
Chris@87:                 token_type == tokenize.NAME and
Chris@87:                 token_string == "L"):
Chris@87:             continue
Chris@87:         else:
Chris@87:             tokens.append(token)
Chris@87:         last_token_was_number = (token_type == tokenize.NUMBER)
Chris@87:     return tokenize.untokenize(tokens)
Chris@87: 
Chris@87: 
Chris@87: def _read_array_header(fp, version):
Chris@87:     """
Chris@87:     see read_array_header_1_0
Chris@87:     """
Chris@87:     # Read an unsigned, little-endian short int which has the length of the
Chris@87:     # header.
Chris@87:     import struct
Chris@87:     if version == (1, 0):
Chris@87:         hlength_str = _read_bytes(fp, 2, "array header length")
Chris@87:         header_length = struct.unpack('<H', hlength_str)[0]
Chris@87:         header = _read_bytes(fp, header_length, "array header")
Chris@87:     elif version == (2, 0):
Chris@87:         hlength_str = _read_bytes(fp, 4, "array header length")
Chris@87:         header_length = struct.unpack('<I', hlength_str)[0]
Chris@87:         header = _read_bytes(fp, header_length, "array header")
Chris@87:     else:
Chris@87:         raise ValueError("Invalid version %r" % version)
Chris@87: 
Chris@87:     # The header is a pretty-printed string representation of a literal
Chris@87:     # Python dictionary with trailing newlines padded to a 16-byte
Chris@87:     # boundary. The keys are strings.
Chris@87:     #   "shape" : tuple of int
Chris@87:     #   "fortran_order" : bool
Chris@87:     #   "descr" : dtype.descr
Chris@87:     header = _filter_header(header)
Chris@87:     try:
Chris@87:         d = safe_eval(header)
Chris@87:     except SyntaxError as e:
Chris@87:         msg = "Cannot parse header: %r\nException: %r"
Chris@87:         raise ValueError(msg % (header, e))
Chris@87:     if not isinstance(d, dict):
Chris@87:         msg = "Header is not a dictionary: %r"
Chris@87:         raise ValueError(msg % d)
Chris@87:     keys = sorted(d.keys())
Chris@87:     if keys != ['descr', 'fortran_order', 'shape']:
Chris@87:         msg = "Header does not contain the correct keys: %r"
Chris@87:         raise ValueError(msg % (keys,))
Chris@87: 
Chris@87:     # Sanity-check the values.
Chris@87:     if (not isinstance(d['shape'], tuple) or
Chris@87:             not numpy.all([isinstance(x, (int, long)) for x in d['shape']])):
Chris@87:         msg = "shape is not valid: %r"
Chris@87:         raise ValueError(msg % (d['shape'],))
Chris@87:     if not isinstance(d['fortran_order'], bool):
Chris@87:         msg = "fortran_order is not a valid bool: %r"
Chris@87:         raise ValueError(msg % (d['fortran_order'],))
Chris@87:     try:
Chris@87:         dtype = numpy.dtype(d['descr'])
Chris@87:     except TypeError as e:
Chris@87:         msg = "descr is not a valid dtype descriptor: %r"
Chris@87:         raise ValueError(msg % (d['descr'],))
Chris@87: 
Chris@87:     return d['shape'], d['fortran_order'], dtype
Chris@87: 
Chris@87: def write_array(fp, array, version=None):
Chris@87:     """
Chris@87:     Write an array to an NPY file, including a header.
Chris@87: 
Chris@87:     If the array is neither C-contiguous nor Fortran-contiguous AND the
Chris@87:     file_like object is not a real file object, this function will have to
Chris@87:     copy data in memory.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : file_like object
Chris@87:         An open, writable file object, or similar object with a
Chris@87:         ``.write()`` method.
Chris@87:     array : ndarray
Chris@87:         The array to write to disk.
Chris@87:     version : (int, int) or None, optional
Chris@87:         The version number of the format. None means use the oldest
Chris@87:         supported version that is able to store the data.  Default: None
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError
Chris@87:         If the array cannot be persisted.
Chris@87:     Various other errors
Chris@87:         If the array contains Python objects as part of its dtype, the
Chris@87:         process of pickling them may raise various errors if the objects
Chris@87:         are not picklable.
Chris@87: 
Chris@87:     """
Chris@87:     _check_version(version)
Chris@87:     used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
Chris@87:                                    version)
Chris@87:     # this warning can be removed when 1.9 has aged enough
Chris@87:     if version != (2, 0) and used_ver == (2, 0):
Chris@87:         warnings.warn("Stored array in format 2.0. It can only be"
Chris@87:                       "read by NumPy >= 1.9", UserWarning)
Chris@87: 
Chris@87:     # Set buffer size to 16 MiB to hide the Python loop overhead.
Chris@87:     buffersize = max(16 * 1024 ** 2 // array.itemsize, 1)
Chris@87: 
Chris@87:     if array.dtype.hasobject:
Chris@87:         # We contain Python objects so we cannot write out the data
Chris@87:         # directly.  Instead, we will pickle it out with version 2 of the
Chris@87:         # pickle protocol.
Chris@87:         pickle.dump(array, fp, protocol=2)
Chris@87:     elif array.flags.f_contiguous and not array.flags.c_contiguous:
Chris@87:         if isfileobj(fp):
Chris@87:             array.T.tofile(fp)
Chris@87:         else:
Chris@87:             for chunk in numpy.nditer(
Chris@87:                     array, flags=['external_loop', 'buffered', 'zerosize_ok'],
Chris@87:                     buffersize=buffersize, order='F'):
Chris@87:                 fp.write(chunk.tobytes('C'))
Chris@87:     else:
Chris@87:         if isfileobj(fp):
Chris@87:             array.tofile(fp)
Chris@87:         else:
Chris@87:             for chunk in numpy.nditer(
Chris@87:                     array, flags=['external_loop', 'buffered', 'zerosize_ok'],
Chris@87:                     buffersize=buffersize, order='C'):
Chris@87:                 fp.write(chunk.tobytes('C'))
Chris@87: 
Chris@87: 
Chris@87: def read_array(fp):
Chris@87:     """
Chris@87:     Read an array from an NPY file.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     fp : file_like object
Chris@87:         If this is not a real file object, then this may take extra memory
Chris@87:         and time.
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     array : ndarray
Chris@87:         The array from the data on disk.
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError
Chris@87:         If the data is invalid.
Chris@87: 
Chris@87:     """
Chris@87:     version = read_magic(fp)
Chris@87:     _check_version(version)
Chris@87:     shape, fortran_order, dtype = _read_array_header(fp, version)
Chris@87:     if len(shape) == 0:
Chris@87:         count = 1
Chris@87:     else:
Chris@87:         count = numpy.multiply.reduce(shape)
Chris@87: 
Chris@87:     # Now read the actual data.
Chris@87:     if dtype.hasobject:
Chris@87:         # The array contained Python objects. We need to unpickle the data.
Chris@87:         array = pickle.load(fp)
Chris@87:     else:
Chris@87:         if isfileobj(fp):
Chris@87:             # We can use the fast fromfile() function.
Chris@87:             array = numpy.fromfile(fp, dtype=dtype, count=count)
Chris@87:         else:
Chris@87:             # This is not a real file. We have to read it the
Chris@87:             # memory-intensive way.
Chris@87:             # crc32 module fails on reads greater than 2 ** 32 bytes,
Chris@87:             # breaking large reads from gzip streams. Chunk reads to
Chris@87:             # BUFFER_SIZE bytes to avoid issue and reduce memory overhead
Chris@87:             # of the read. In non-chunked case count < max_read_count, so
Chris@87:             # only one read is performed.
Chris@87: 
Chris@87:             max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize)
Chris@87: 
Chris@87:             array = numpy.empty(count, dtype=dtype)
Chris@87:             for i in range(0, count, max_read_count):
Chris@87:                 read_count = min(max_read_count, count - i)
Chris@87:                 read_size = int(read_count * dtype.itemsize)
Chris@87:                 data = _read_bytes(fp, read_size, "array data")
Chris@87:                 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
Chris@87:                                                          count=read_count)
Chris@87: 
Chris@87:         if fortran_order:
Chris@87:             array.shape = shape[::-1]
Chris@87:             array = array.transpose()
Chris@87:         else:
Chris@87:             array.shape = shape
Chris@87: 
Chris@87:     return array
Chris@87: 
Chris@87: 
Chris@87: def open_memmap(filename, mode='r+', dtype=None, shape=None,
Chris@87:                 fortran_order=False, version=None):
Chris@87:     """
Chris@87:     Open a .npy file as a memory-mapped array.
Chris@87: 
Chris@87:     This may be used to read an existing file or create a new one.
Chris@87: 
Chris@87:     Parameters
Chris@87:     ----------
Chris@87:     filename : str
Chris@87:         The name of the file on disk.  This may *not* be a file-like
Chris@87:         object.
Chris@87:     mode : str, optional
Chris@87:         The mode in which to open the file; the default is 'r+'.  In
Chris@87:         addition to the standard file modes, 'c' is also accepted to mean
Chris@87:         "copy on write."  See `memmap` for the available mode strings.
Chris@87:     dtype : data-type, optional
Chris@87:         The data type of the array if we are creating a new file in "write"
Chris@87:         mode, if not, `dtype` is ignored.  The default value is None, which
Chris@87:         results in a data-type of `float64`.
Chris@87:     shape : tuple of int
Chris@87:         The shape of the array if we are creating a new file in "write"
Chris@87:         mode, in which case this parameter is required.  Otherwise, this
Chris@87:         parameter is ignored and is thus optional.
Chris@87:     fortran_order : bool, optional
Chris@87:         Whether the array should be Fortran-contiguous (True) or
Chris@87:         C-contiguous (False, the default) if we are creating a new file in
Chris@87:         "write" mode.
Chris@87:     version : tuple of int (major, minor) or None
Chris@87:         If the mode is a "write" mode, then this is the version of the file
Chris@87:         format used to create the file.  None means use the oldest
Chris@87:         supported version that is able to store the data.  Default: None
Chris@87: 
Chris@87:     Returns
Chris@87:     -------
Chris@87:     marray : memmap
Chris@87:         The memory-mapped array.
Chris@87: 
Chris@87:     Raises
Chris@87:     ------
Chris@87:     ValueError
Chris@87:         If the data or the mode is invalid.
Chris@87:     IOError
Chris@87:         If the file is not found or cannot be opened correctly.
Chris@87: 
Chris@87:     See Also
Chris@87:     --------
Chris@87:     memmap
Chris@87: 
Chris@87:     """
Chris@87:     if not isinstance(filename, basestring):
Chris@87:         raise ValueError("Filename must be a string.  Memmap cannot use"
Chris@87:                          " existing file handles.")
Chris@87: 
Chris@87:     if 'w' in mode:
Chris@87:         # We are creating the file, not reading it.
Chris@87:         # Check if we ought to create the file.
Chris@87:         _check_version(version)
Chris@87:         # Ensure that the given dtype is an authentic dtype object rather
Chris@87:         # than just something that can be interpreted as a dtype object.
Chris@87:         dtype = numpy.dtype(dtype)
Chris@87:         if dtype.hasobject:
Chris@87:             msg = "Array can't be memory-mapped: Python objects in dtype."
Chris@87:             raise ValueError(msg)
Chris@87:         d = dict(
Chris@87:             descr=dtype_to_descr(dtype),
Chris@87:             fortran_order=fortran_order,
Chris@87:             shape=shape,
Chris@87:         )
Chris@87:         # If we got here, then it should be safe to create the file.
Chris@87:         fp = open(filename, mode+'b')
Chris@87:         try:
Chris@87:             used_ver = _write_array_header(fp, d, version)
Chris@87:             # this warning can be removed when 1.9 has aged enough
Chris@87:             if version != (2, 0) and used_ver == (2, 0):
Chris@87:                 warnings.warn("Stored array in format 2.0. It can only be"
Chris@87:                               "read by NumPy >= 1.9", UserWarning)
Chris@87:             offset = fp.tell()
Chris@87:         finally:
Chris@87:             fp.close()
Chris@87:     else:
Chris@87:         # Read the header of the file first.
Chris@87:         fp = open(filename, 'rb')
Chris@87:         try:
Chris@87:             version = read_magic(fp)
Chris@87:             _check_version(version)
Chris@87: 
Chris@87:             shape, fortran_order, dtype = _read_array_header(fp, version)
Chris@87:             if dtype.hasobject:
Chris@87:                 msg = "Array can't be memory-mapped: Python objects in dtype."
Chris@87:                 raise ValueError(msg)
Chris@87:             offset = fp.tell()
Chris@87:         finally:
Chris@87:             fp.close()
Chris@87: 
Chris@87:     if fortran_order:
Chris@87:         order = 'F'
Chris@87:     else:
Chris@87:         order = 'C'
Chris@87: 
Chris@87:     # We need to change a write-only mode to a read-write mode since we've
Chris@87:     # already written data to the file.
Chris@87:     if mode == 'w+':
Chris@87:         mode = 'r+'
Chris@87: 
Chris@87:     marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
Chris@87:         mode=mode, offset=offset)
Chris@87: 
Chris@87:     return marray
Chris@87: 
Chris@87: 
Chris@87: def _read_bytes(fp, size, error_template="ran out of data"):
Chris@87:     """
Chris@87:     Read from file-like object until size bytes are read.
Chris@87:     Raises ValueError if not EOF is encountered before size bytes are read.
Chris@87:     Non-blocking objects only supported if they derive from io objects.
Chris@87: 
Chris@87:     Required as e.g. ZipExtFile in python 2.6 can return less data than
Chris@87:     requested.
Chris@87:     """
Chris@87:     data = bytes()
Chris@87:     while True:
Chris@87:         # io files (default in python3) return None or raise on
Chris@87:         # would-block, python2 file will truncate, probably nothing can be
Chris@87:         # done about that.  note that regular files can't be non-blocking
Chris@87:         try:
Chris@87:             r = fp.read(size - len(data))
Chris@87:             data += r
Chris@87:             if len(r) == 0 or len(data) == size:
Chris@87:                 break
Chris@87:         except io.BlockingIOError:
Chris@87:             pass
Chris@87:     if len(data) != size:
Chris@87:         msg = "EOF: reading %s, expected %d bytes got %d"
Chris@87:         raise ValueError(msg % (error_template, size, len(data)))
Chris@87:     else:
Chris@87:         return data