Chris@87: """ Chris@87: Define a simple format for saving numpy arrays to disk with the full Chris@87: information about them. Chris@87: Chris@87: The ``.npy`` format is the standard binary file format in NumPy for Chris@87: persisting a *single* arbitrary NumPy array on disk. The format stores all Chris@87: of the shape and dtype information necessary to reconstruct the array Chris@87: correctly even on another machine with a different architecture. Chris@87: The format is designed to be as simple as possible while achieving Chris@87: its limited goals. Chris@87: Chris@87: The ``.npz`` format is the standard format for persisting *multiple* NumPy Chris@87: arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` Chris@87: files, one for each array. Chris@87: Chris@87: Capabilities Chris@87: ------------ Chris@87: Chris@87: - Can represent all NumPy arrays including nested record arrays and Chris@87: object arrays. Chris@87: Chris@87: - Represents the data in its native binary form. Chris@87: Chris@87: - Supports Fortran-contiguous arrays directly. Chris@87: Chris@87: - Stores all of the necessary information to reconstruct the array Chris@87: including shape and dtype on a machine of a different Chris@87: architecture. Both little-endian and big-endian arrays are Chris@87: supported, and a file with little-endian numbers will yield Chris@87: a little-endian array on any machine reading the file. The Chris@87: types are described in terms of their actual sizes. For example, Chris@87: if a machine with a 64-bit C "long int" writes out an array with Chris@87: "long ints", a reading machine with 32-bit C "long ints" will yield Chris@87: an array with 64-bit integers. Chris@87: Chris@87: - Is straightforward to reverse engineer. Datasets often live longer than Chris@87: the programs that created them. A competent developer should be Chris@87: able to create a solution in his preferred programming language to Chris@87: read most ``.npy`` files that he has been given without much Chris@87: documentation. Chris@87: Chris@87: - Allows memory-mapping of the data. See `open_memmep`. Chris@87: Chris@87: - Can be read from a filelike stream object instead of an actual file. Chris@87: Chris@87: - Stores object arrays, i.e. arrays containing elements that are arbitrary Chris@87: Python objects. Files with object arrays are not to be mmapable, but Chris@87: can be read and written to disk. Chris@87: Chris@87: Limitations Chris@87: ----------- Chris@87: Chris@87: - Arbitrary subclasses of numpy.ndarray are not completely preserved. Chris@87: Subclasses will be accepted for writing, but only the array data will Chris@87: be written out. A regular numpy.ndarray object will be created Chris@87: upon reading the file. Chris@87: Chris@87: .. warning:: Chris@87: Chris@87: Due to limitations in the interpretation of structured dtypes, dtypes Chris@87: with fields with empty names will have the names replaced by 'f0', 'f1', Chris@87: etc. Such arrays will not round-trip through the format entirely Chris@87: accurately. The data is intact; only the field names will differ. We are Chris@87: working on a fix for this. This fix will not require a change in the Chris@87: file format. The arrays with such structures can still be saved and Chris@87: restored, and the correct dtype may be restored by using the Chris@87: ``loadedarray.view(correct_dtype)`` method. Chris@87: Chris@87: File extensions Chris@87: --------------- Chris@87: Chris@87: We recommend using the ``.npy`` and ``.npz`` extensions for files saved Chris@87: in this format. This is by no means a requirement; applications may wish Chris@87: to use these file formats but use an extension specific to the Chris@87: application. In the absence of an obvious alternative, however, Chris@87: we suggest using ``.npy`` and ``.npz``. Chris@87: Chris@87: Version numbering Chris@87: ----------------- Chris@87: Chris@87: The version numbering of these formats is independent of NumPy version Chris@87: numbering. If the format is upgraded, the code in `numpy.io` will still Chris@87: be able to read and write Version 1.0 files. Chris@87: Chris@87: Format Version 1.0 Chris@87: ------------------ Chris@87: Chris@87: The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. Chris@87: Chris@87: The next 1 byte is an unsigned byte: the major version number of the file Chris@87: format, e.g. ``\\x01``. Chris@87: Chris@87: The next 1 byte is an unsigned byte: the minor version number of the file Chris@87: format, e.g. ``\\x00``. Note: the version of the file format is not tied Chris@87: to the version of the numpy package. Chris@87: Chris@87: The next 2 bytes form a little-endian unsigned short int: the length of Chris@87: the header data HEADER_LEN. Chris@87: Chris@87: The next HEADER_LEN bytes form the header data describing the array's Chris@87: format. It is an ASCII string which contains a Python literal expression Chris@87: of a dictionary. It is terminated by a newline (``\\n``) and padded with Chris@87: spaces (``\\x20``) to make the total length of Chris@87: ``magic string + 4 + HEADER_LEN`` be evenly divisible by 16 for alignment Chris@87: purposes. Chris@87: Chris@87: The dictionary contains three keys: Chris@87: Chris@87: "descr" : dtype.descr Chris@87: An object that can be passed as an argument to the `numpy.dtype` Chris@87: constructor to create the array's dtype. Chris@87: "fortran_order" : bool Chris@87: Whether the array data is Fortran-contiguous or not. Since Chris@87: Fortran-contiguous arrays are a common form of non-C-contiguity, Chris@87: we allow them to be written directly to disk for efficiency. Chris@87: "shape" : tuple of int Chris@87: The shape of the array. Chris@87: Chris@87: For repeatability and readability, the dictionary keys are sorted in Chris@87: alphabetic order. This is for convenience only. A writer SHOULD implement Chris@87: this if possible. A reader MUST NOT depend on this. Chris@87: Chris@87: Following the header comes the array data. If the dtype contains Python Chris@87: objects (i.e. ``dtype.hasobject is True``), then the data is a Python Chris@87: pickle of the array. Otherwise the data is the contiguous (either C- Chris@87: or Fortran-, depending on ``fortran_order``) bytes of the array. Chris@87: Consumers can figure out the number of bytes by multiplying the number Chris@87: of elements given by the shape (noting that ``shape=()`` means there is Chris@87: 1 element) by ``dtype.itemsize``. Chris@87: Chris@87: Notes Chris@87: ----- Chris@87: The ``.npy`` format, including reasons for creating it and a comparison of Chris@87: alternatives, is described fully in the "npy-format" NEP. Chris@87: Chris@87: """ Chris@87: from __future__ import division, absolute_import, print_function Chris@87: Chris@87: import numpy Chris@87: import sys Chris@87: import io Chris@87: import warnings Chris@87: from numpy.lib.utils import safe_eval Chris@87: from numpy.compat import asbytes, asstr, isfileobj, long, basestring Chris@87: Chris@87: if sys.version_info[0] >= 3: Chris@87: import pickle Chris@87: else: Chris@87: import cPickle as pickle Chris@87: Chris@87: MAGIC_PREFIX = asbytes('\x93NUMPY') Chris@87: MAGIC_LEN = len(MAGIC_PREFIX) + 2 Chris@87: BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes Chris@87: Chris@87: # difference between version 1.0 and 2.0 is a 4 byte (I) header length Chris@87: # instead of 2 bytes (H) allowing storage of large structured arrays Chris@87: Chris@87: def _check_version(version): Chris@87: if version not in [(1, 0), (2, 0), None]: Chris@87: msg = "we only support format version (1,0) and (2, 0), not %s" Chris@87: raise ValueError(msg % (version,)) Chris@87: Chris@87: def magic(major, minor): Chris@87: """ Return the magic string for the given file format version. Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: major : int in [0, 255] Chris@87: minor : int in [0, 255] Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: magic : str Chris@87: Chris@87: Raises Chris@87: ------ Chris@87: ValueError if the version cannot be formatted. Chris@87: """ Chris@87: if major < 0 or major > 255: Chris@87: raise ValueError("major version must be 0 <= major < 256") Chris@87: if minor < 0 or minor > 255: Chris@87: raise ValueError("minor version must be 0 <= minor < 256") Chris@87: if sys.version_info[0] < 3: Chris@87: return MAGIC_PREFIX + chr(major) + chr(minor) Chris@87: else: Chris@87: return MAGIC_PREFIX + bytes([major, minor]) Chris@87: Chris@87: def read_magic(fp): Chris@87: """ Read the magic string to get the version of the file format. Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: fp : filelike object Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: major : int Chris@87: minor : int Chris@87: """ Chris@87: magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") Chris@87: if magic_str[:-2] != MAGIC_PREFIX: Chris@87: msg = "the magic string is not correct; expected %r, got %r" Chris@87: raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) Chris@87: if sys.version_info[0] < 3: Chris@87: major, minor = map(ord, magic_str[-2:]) Chris@87: else: Chris@87: major, minor = magic_str[-2:] Chris@87: return major, minor Chris@87: Chris@87: def dtype_to_descr(dtype): Chris@87: """ Chris@87: Get a serializable descriptor from the dtype. Chris@87: Chris@87: The .descr attribute of a dtype object cannot be round-tripped through Chris@87: the dtype() constructor. Simple types, like dtype('float32'), have Chris@87: a descr which looks like a record array with one field with '' as Chris@87: a name. The dtype() constructor interprets this as a request to give Chris@87: a default name. Instead, we construct descriptor that can be passed to Chris@87: dtype(). Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: dtype : dtype Chris@87: The dtype of the array that will be written to disk. Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: descr : object Chris@87: An object that can be passed to `numpy.dtype()` in order to Chris@87: replicate the input dtype. Chris@87: Chris@87: """ Chris@87: if dtype.names is not None: Chris@87: # This is a record array. The .descr is fine. XXX: parts of the Chris@87: # record array with an empty name, like padding bytes, still get Chris@87: # fiddled with. This needs to be fixed in the C implementation of Chris@87: # dtype(). Chris@87: return dtype.descr Chris@87: else: Chris@87: return dtype.str Chris@87: Chris@87: def header_data_from_array_1_0(array): Chris@87: """ Get the dictionary of header metadata from a numpy.ndarray. Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: array : numpy.ndarray Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: d : dict Chris@87: This has the appropriate entries for writing its string representation Chris@87: to the header of the file. Chris@87: """ Chris@87: d = {} Chris@87: d['shape'] = array.shape Chris@87: if array.flags.c_contiguous: Chris@87: d['fortran_order'] = False Chris@87: elif array.flags.f_contiguous: Chris@87: d['fortran_order'] = True Chris@87: else: Chris@87: # Totally non-contiguous data. We will have to make it C-contiguous Chris@87: # before writing. Note that we need to test for C_CONTIGUOUS first Chris@87: # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. Chris@87: d['fortran_order'] = False Chris@87: Chris@87: d['descr'] = dtype_to_descr(array.dtype) Chris@87: return d Chris@87: Chris@87: def _write_array_header(fp, d, version=None): Chris@87: """ Write the header for an array and returns the version used Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: fp : filelike object Chris@87: d : dict Chris@87: This has the appropriate entries for writing its string representation Chris@87: to the header of the file. Chris@87: version: tuple or None Chris@87: None means use oldest that works Chris@87: explicit version will raise a ValueError if the format does not Chris@87: allow saving this data. Default: None Chris@87: Returns Chris@87: ------- Chris@87: version : tuple of int Chris@87: the file version which needs to be used to store the data Chris@87: """ Chris@87: import struct Chris@87: header = ["{"] Chris@87: for key, value in sorted(d.items()): Chris@87: # Need to use repr here, since we eval these when reading Chris@87: header.append("'%s': %s, " % (key, repr(value))) Chris@87: header.append("}") Chris@87: header = "".join(header) Chris@87: # Pad the header with spaces and a final newline such that the magic Chris@87: # string, the header-length short and the header are aligned on a Chris@87: # 16-byte boundary. Hopefully, some system, possibly memory-mapping, Chris@87: # can take advantage of our premature optimization. Chris@87: current_header_len = MAGIC_LEN + 2 + len(header) + 1 # 1 for the newline Chris@87: topad = 16 - (current_header_len % 16) Chris@87: header = header + ' '*topad + '\n' Chris@87: header = asbytes(_filter_header(header)) Chris@87: Chris@87: if len(header) >= (256*256) and version == (1, 0): Chris@87: raise ValueError("header does not fit inside %s bytes required by the" Chris@87: " 1.0 format" % (256*256)) Chris@87: if len(header) < (256*256): Chris@87: header_len_str = struct.pack('= 3: Chris@87: from io import StringIO Chris@87: else: Chris@87: from StringIO import StringIO Chris@87: Chris@87: tokens = [] Chris@87: last_token_was_number = False Chris@87: for token in tokenize.generate_tokens(StringIO(asstr(s)).read): Chris@87: token_type = token[0] Chris@87: token_string = token[1] Chris@87: if (last_token_was_number and Chris@87: token_type == tokenize.NAME and Chris@87: token_string == "L"): Chris@87: continue Chris@87: else: Chris@87: tokens.append(token) Chris@87: last_token_was_number = (token_type == tokenize.NUMBER) Chris@87: return tokenize.untokenize(tokens) Chris@87: Chris@87: Chris@87: def _read_array_header(fp, version): Chris@87: """ Chris@87: see read_array_header_1_0 Chris@87: """ Chris@87: # Read an unsigned, little-endian short int which has the length of the Chris@87: # header. Chris@87: import struct Chris@87: if version == (1, 0): Chris@87: hlength_str = _read_bytes(fp, 2, "array header length") Chris@87: header_length = struct.unpack('= 1.9", UserWarning) Chris@87: Chris@87: # Set buffer size to 16 MiB to hide the Python loop overhead. Chris@87: buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) Chris@87: Chris@87: if array.dtype.hasobject: Chris@87: # We contain Python objects so we cannot write out the data Chris@87: # directly. Instead, we will pickle it out with version 2 of the Chris@87: # pickle protocol. Chris@87: pickle.dump(array, fp, protocol=2) Chris@87: elif array.flags.f_contiguous and not array.flags.c_contiguous: Chris@87: if isfileobj(fp): Chris@87: array.T.tofile(fp) Chris@87: else: Chris@87: for chunk in numpy.nditer( Chris@87: array, flags=['external_loop', 'buffered', 'zerosize_ok'], Chris@87: buffersize=buffersize, order='F'): Chris@87: fp.write(chunk.tobytes('C')) Chris@87: else: Chris@87: if isfileobj(fp): Chris@87: array.tofile(fp) Chris@87: else: Chris@87: for chunk in numpy.nditer( Chris@87: array, flags=['external_loop', 'buffered', 'zerosize_ok'], Chris@87: buffersize=buffersize, order='C'): Chris@87: fp.write(chunk.tobytes('C')) Chris@87: Chris@87: Chris@87: def read_array(fp): Chris@87: """ Chris@87: Read an array from an NPY file. Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: fp : file_like object Chris@87: If this is not a real file object, then this may take extra memory Chris@87: and time. Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: array : ndarray Chris@87: The array from the data on disk. Chris@87: Chris@87: Raises Chris@87: ------ Chris@87: ValueError Chris@87: If the data is invalid. Chris@87: Chris@87: """ Chris@87: version = read_magic(fp) Chris@87: _check_version(version) Chris@87: shape, fortran_order, dtype = _read_array_header(fp, version) Chris@87: if len(shape) == 0: Chris@87: count = 1 Chris@87: else: Chris@87: count = numpy.multiply.reduce(shape) Chris@87: Chris@87: # Now read the actual data. Chris@87: if dtype.hasobject: Chris@87: # The array contained Python objects. We need to unpickle the data. Chris@87: array = pickle.load(fp) Chris@87: else: Chris@87: if isfileobj(fp): Chris@87: # We can use the fast fromfile() function. Chris@87: array = numpy.fromfile(fp, dtype=dtype, count=count) Chris@87: else: Chris@87: # This is not a real file. We have to read it the Chris@87: # memory-intensive way. Chris@87: # crc32 module fails on reads greater than 2 ** 32 bytes, Chris@87: # breaking large reads from gzip streams. Chunk reads to Chris@87: # BUFFER_SIZE bytes to avoid issue and reduce memory overhead Chris@87: # of the read. In non-chunked case count < max_read_count, so Chris@87: # only one read is performed. Chris@87: Chris@87: max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) Chris@87: Chris@87: array = numpy.empty(count, dtype=dtype) Chris@87: for i in range(0, count, max_read_count): Chris@87: read_count = min(max_read_count, count - i) Chris@87: read_size = int(read_count * dtype.itemsize) Chris@87: data = _read_bytes(fp, read_size, "array data") Chris@87: array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, Chris@87: count=read_count) Chris@87: Chris@87: if fortran_order: Chris@87: array.shape = shape[::-1] Chris@87: array = array.transpose() Chris@87: else: Chris@87: array.shape = shape Chris@87: Chris@87: return array Chris@87: Chris@87: Chris@87: def open_memmap(filename, mode='r+', dtype=None, shape=None, Chris@87: fortran_order=False, version=None): Chris@87: """ Chris@87: Open a .npy file as a memory-mapped array. Chris@87: Chris@87: This may be used to read an existing file or create a new one. Chris@87: Chris@87: Parameters Chris@87: ---------- Chris@87: filename : str Chris@87: The name of the file on disk. This may *not* be a file-like Chris@87: object. Chris@87: mode : str, optional Chris@87: The mode in which to open the file; the default is 'r+'. In Chris@87: addition to the standard file modes, 'c' is also accepted to mean Chris@87: "copy on write." See `memmap` for the available mode strings. Chris@87: dtype : data-type, optional Chris@87: The data type of the array if we are creating a new file in "write" Chris@87: mode, if not, `dtype` is ignored. The default value is None, which Chris@87: results in a data-type of `float64`. Chris@87: shape : tuple of int Chris@87: The shape of the array if we are creating a new file in "write" Chris@87: mode, in which case this parameter is required. Otherwise, this Chris@87: parameter is ignored and is thus optional. Chris@87: fortran_order : bool, optional Chris@87: Whether the array should be Fortran-contiguous (True) or Chris@87: C-contiguous (False, the default) if we are creating a new file in Chris@87: "write" mode. Chris@87: version : tuple of int (major, minor) or None Chris@87: If the mode is a "write" mode, then this is the version of the file Chris@87: format used to create the file. None means use the oldest Chris@87: supported version that is able to store the data. Default: None Chris@87: Chris@87: Returns Chris@87: ------- Chris@87: marray : memmap Chris@87: The memory-mapped array. Chris@87: Chris@87: Raises Chris@87: ------ Chris@87: ValueError Chris@87: If the data or the mode is invalid. Chris@87: IOError Chris@87: If the file is not found or cannot be opened correctly. Chris@87: Chris@87: See Also Chris@87: -------- Chris@87: memmap Chris@87: Chris@87: """ Chris@87: if not isinstance(filename, basestring): Chris@87: raise ValueError("Filename must be a string. Memmap cannot use" Chris@87: " existing file handles.") Chris@87: Chris@87: if 'w' in mode: Chris@87: # We are creating the file, not reading it. Chris@87: # Check if we ought to create the file. Chris@87: _check_version(version) Chris@87: # Ensure that the given dtype is an authentic dtype object rather Chris@87: # than just something that can be interpreted as a dtype object. Chris@87: dtype = numpy.dtype(dtype) Chris@87: if dtype.hasobject: Chris@87: msg = "Array can't be memory-mapped: Python objects in dtype." Chris@87: raise ValueError(msg) Chris@87: d = dict( Chris@87: descr=dtype_to_descr(dtype), Chris@87: fortran_order=fortran_order, Chris@87: shape=shape, Chris@87: ) Chris@87: # If we got here, then it should be safe to create the file. Chris@87: fp = open(filename, mode+'b') Chris@87: try: Chris@87: used_ver = _write_array_header(fp, d, version) Chris@87: # this warning can be removed when 1.9 has aged enough Chris@87: if version != (2, 0) and used_ver == (2, 0): Chris@87: warnings.warn("Stored array in format 2.0. It can only be" Chris@87: "read by NumPy >= 1.9", UserWarning) Chris@87: offset = fp.tell() Chris@87: finally: Chris@87: fp.close() Chris@87: else: Chris@87: # Read the header of the file first. Chris@87: fp = open(filename, 'rb') Chris@87: try: Chris@87: version = read_magic(fp) Chris@87: _check_version(version) Chris@87: Chris@87: shape, fortran_order, dtype = _read_array_header(fp, version) Chris@87: if dtype.hasobject: Chris@87: msg = "Array can't be memory-mapped: Python objects in dtype." Chris@87: raise ValueError(msg) Chris@87: offset = fp.tell() Chris@87: finally: Chris@87: fp.close() Chris@87: Chris@87: if fortran_order: Chris@87: order = 'F' Chris@87: else: Chris@87: order = 'C' Chris@87: Chris@87: # We need to change a write-only mode to a read-write mode since we've Chris@87: # already written data to the file. Chris@87: if mode == 'w+': Chris@87: mode = 'r+' Chris@87: Chris@87: marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, Chris@87: mode=mode, offset=offset) Chris@87: Chris@87: return marray Chris@87: Chris@87: Chris@87: def _read_bytes(fp, size, error_template="ran out of data"): Chris@87: """ Chris@87: Read from file-like object until size bytes are read. Chris@87: Raises ValueError if not EOF is encountered before size bytes are read. Chris@87: Non-blocking objects only supported if they derive from io objects. Chris@87: Chris@87: Required as e.g. ZipExtFile in python 2.6 can return less data than Chris@87: requested. Chris@87: """ Chris@87: data = bytes() Chris@87: while True: Chris@87: # io files (default in python3) return None or raise on Chris@87: # would-block, python2 file will truncate, probably nothing can be Chris@87: # done about that. note that regular files can't be non-blocking Chris@87: try: Chris@87: r = fp.read(size - len(data)) Chris@87: data += r Chris@87: if len(r) == 0 or len(data) == size: Chris@87: break Chris@87: except io.BlockingIOError: Chris@87: pass Chris@87: if len(data) != size: Chris@87: msg = "EOF: reading %s, expected %d bytes got %d" Chris@87: raise ValueError(msg % (error_template, size, len(data))) Chris@87: else: Chris@87: return data