Chris@87
|
1 """
|
Chris@87
|
2 ===============
|
Chris@87
|
3 Array Internals
|
Chris@87
|
4 ===============
|
Chris@87
|
5
|
Chris@87
|
6 Internal organization of numpy arrays
|
Chris@87
|
7 =====================================
|
Chris@87
|
8
|
Chris@87
|
9 It helps to understand a bit about how numpy arrays are handled under the covers to help understand numpy better. This section will not go into great detail. Those wishing to understand the full details are referred to Travis Oliphant's book "Guide to Numpy".
|
Chris@87
|
10
|
Chris@87
|
11 Numpy arrays consist of two major components, the raw array data (from now on,
|
Chris@87
|
12 referred to as the data buffer), and the information about the raw array data.
|
Chris@87
|
13 The data buffer is typically what people think of as arrays in C or Fortran,
|
Chris@87
|
14 a contiguous (and fixed) block of memory containing fixed sized data items.
|
Chris@87
|
15 Numpy also contains a significant set of data that describes how to interpret
|
Chris@87
|
16 the data in the data buffer. This extra information contains (among other things):
|
Chris@87
|
17
|
Chris@87
|
18 1) The basic data element's size in bytes
|
Chris@87
|
19 2) The start of the data within the data buffer (an offset relative to the
|
Chris@87
|
20 beginning of the data buffer).
|
Chris@87
|
21 3) The number of dimensions and the size of each dimension
|
Chris@87
|
22 4) The separation between elements for each dimension (the 'stride'). This
|
Chris@87
|
23 does not have to be a multiple of the element size
|
Chris@87
|
24 5) The byte order of the data (which may not be the native byte order)
|
Chris@87
|
25 6) Whether the buffer is read-only
|
Chris@87
|
26 7) Information (via the dtype object) about the interpretation of the basic
|
Chris@87
|
27 data element. The basic data element may be as simple as a int or a float,
|
Chris@87
|
28 or it may be a compound object (e.g., struct-like), a fixed character field,
|
Chris@87
|
29 or Python object pointers.
|
Chris@87
|
30 8) Whether the array is to interpreted as C-order or Fortran-order.
|
Chris@87
|
31
|
Chris@87
|
32 This arrangement allow for very flexible use of arrays. One thing that it allows
|
Chris@87
|
33 is simple changes of the metadata to change the interpretation of the array buffer.
|
Chris@87
|
34 Changing the byteorder of the array is a simple change involving no rearrangement
|
Chris@87
|
35 of the data. The shape of the array can be changed very easily without changing
|
Chris@87
|
36 anything in the data buffer or any data copying at all
|
Chris@87
|
37
|
Chris@87
|
38 Among other things that are made possible is one can create a new array metadata
|
Chris@87
|
39 object that uses the same data buffer
|
Chris@87
|
40 to create a new view of that data buffer that has a different interpretation
|
Chris@87
|
41 of the buffer (e.g., different shape, offset, byte order, strides, etc) but
|
Chris@87
|
42 shares the same data bytes. Many operations in numpy do just this such as
|
Chris@87
|
43 slices. Other operations, such as transpose, don't move data elements
|
Chris@87
|
44 around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move.
|
Chris@87
|
45
|
Chris@87
|
46 Typically these new versions of the array metadata but the same data buffer are
|
Chris@87
|
47 new 'views' into the data buffer. There is a different ndarray object, but it
|
Chris@87
|
48 uses the same data buffer. This is why it is necessary to force copies through
|
Chris@87
|
49 use of the .copy() method if one really wants to make a new and independent
|
Chris@87
|
50 copy of the data buffer.
|
Chris@87
|
51
|
Chris@87
|
52 New views into arrays mean the the object reference counts for the data buffer
|
Chris@87
|
53 increase. Simply doing away with the original array object will not remove the
|
Chris@87
|
54 data buffer if other views of it still exist.
|
Chris@87
|
55
|
Chris@87
|
56 Multidimensional Array Indexing Order Issues
|
Chris@87
|
57 ============================================
|
Chris@87
|
58
|
Chris@87
|
59 What is the right way to index
|
Chris@87
|
60 multi-dimensional arrays? Before you jump to conclusions about the one and
|
Chris@87
|
61 true way to index multi-dimensional arrays, it pays to understand why this is
|
Chris@87
|
62 a confusing issue. This section will try to explain in detail how numpy
|
Chris@87
|
63 indexing works and why we adopt the convention we do for images, and when it
|
Chris@87
|
64 may be appropriate to adopt other conventions.
|
Chris@87
|
65
|
Chris@87
|
66 The first thing to understand is
|
Chris@87
|
67 that there are two conflicting conventions for indexing 2-dimensional arrays.
|
Chris@87
|
68 Matrix notation uses the first index to indicate which row is being selected and
|
Chris@87
|
69 the second index to indicate which column is selected. This is opposite the
|
Chris@87
|
70 geometrically oriented-convention for images where people generally think the
|
Chris@87
|
71 first index represents x position (i.e., column) and the second represents y
|
Chris@87
|
72 position (i.e., row). This alone is the source of much confusion;
|
Chris@87
|
73 matrix-oriented users and image-oriented users expect two different things with
|
Chris@87
|
74 regard to indexing.
|
Chris@87
|
75
|
Chris@87
|
76 The second issue to understand is how indices correspond
|
Chris@87
|
77 to the order the array is stored in memory. In Fortran the first index is the
|
Chris@87
|
78 most rapidly varying index when moving through the elements of a two
|
Chris@87
|
79 dimensional array as it is stored in memory. If you adopt the matrix
|
Chris@87
|
80 convention for indexing, then this means the matrix is stored one column at a
|
Chris@87
|
81 time (since the first index moves to the next row as it changes). Thus Fortran
|
Chris@87
|
82 is considered a Column-major language. C has just the opposite convention. In
|
Chris@87
|
83 C, the last index changes most rapidly as one moves through the array as
|
Chris@87
|
84 stored in memory. Thus C is a Row-major language. The matrix is stored by
|
Chris@87
|
85 rows. Note that in both cases it presumes that the matrix convention for
|
Chris@87
|
86 indexing is being used, i.e., for both Fortran and C, the first index is the
|
Chris@87
|
87 row. Note this convention implies that the indexing convention is invariant
|
Chris@87
|
88 and that the data order changes to keep that so.
|
Chris@87
|
89
|
Chris@87
|
90 But that's not the only way
|
Chris@87
|
91 to look at it. Suppose one has large two-dimensional arrays (images or
|
Chris@87
|
92 matrices) stored in data files. Suppose the data are stored by rows rather than
|
Chris@87
|
93 by columns. If we are to preserve our index convention (whether matrix or
|
Chris@87
|
94 image) that means that depending on the language we use, we may be forced to
|
Chris@87
|
95 reorder the data if it is read into memory to preserve our indexing
|
Chris@87
|
96 convention. For example if we read row-ordered data into memory without
|
Chris@87
|
97 reordering, it will match the matrix indexing convention for C, but not for
|
Chris@87
|
98 Fortran. Conversely, it will match the image indexing convention for Fortran,
|
Chris@87
|
99 but not for C. For C, if one is using data stored in row order, and one wants
|
Chris@87
|
100 to preserve the image index convention, the data must be reordered when
|
Chris@87
|
101 reading into memory.
|
Chris@87
|
102
|
Chris@87
|
103 In the end, which you do for Fortran or C depends on
|
Chris@87
|
104 which is more important, not reordering data or preserving the indexing
|
Chris@87
|
105 convention. For large images, reordering data is potentially expensive, and
|
Chris@87
|
106 often the indexing convention is inverted to avoid that.
|
Chris@87
|
107
|
Chris@87
|
108 The situation with
|
Chris@87
|
109 numpy makes this issue yet more complicated. The internal machinery of numpy
|
Chris@87
|
110 arrays is flexible enough to accept any ordering of indices. One can simply
|
Chris@87
|
111 reorder indices by manipulating the internal stride information for arrays
|
Chris@87
|
112 without reordering the data at all. Numpy will know how to map the new index
|
Chris@87
|
113 order to the data without moving the data.
|
Chris@87
|
114
|
Chris@87
|
115 So if this is true, why not choose
|
Chris@87
|
116 the index order that matches what you most expect? In particular, why not define
|
Chris@87
|
117 row-ordered images to use the image convention? (This is sometimes referred
|
Chris@87
|
118 to as the Fortran convention vs the C convention, thus the 'C' and 'FORTRAN'
|
Chris@87
|
119 order options for array ordering in numpy.) The drawback of doing this is
|
Chris@87
|
120 potential performance penalties. It's common to access the data sequentially,
|
Chris@87
|
121 either implicitly in array operations or explicitly by looping over rows of an
|
Chris@87
|
122 image. When that is done, then the data will be accessed in non-optimal order.
|
Chris@87
|
123 As the first index is incremented, what is actually happening is that elements
|
Chris@87
|
124 spaced far apart in memory are being sequentially accessed, with usually poor
|
Chris@87
|
125 memory access speeds. For example, for a two dimensional image 'im' defined so
|
Chris@87
|
126 that im[0, 10] represents the value at x=0, y=10. To be consistent with usual
|
Chris@87
|
127 Python behavior then im[0] would represent a column at x=0. Yet that data
|
Chris@87
|
128 would be spread over the whole array since the data are stored in row order.
|
Chris@87
|
129 Despite the flexibility of numpy's indexing, it can't really paper over the fact
|
Chris@87
|
130 basic operations are rendered inefficient because of data order or that getting
|
Chris@87
|
131 contiguous subarrays is still awkward (e.g., im[:,0] for the first row, vs
|
Chris@87
|
132 im[0]), thus one can't use an idiom such as for row in im; for col in im does
|
Chris@87
|
133 work, but doesn't yield contiguous column data.
|
Chris@87
|
134
|
Chris@87
|
135 As it turns out, numpy is
|
Chris@87
|
136 smart enough when dealing with ufuncs to determine which index is the most
|
Chris@87
|
137 rapidly varying one in memory and uses that for the innermost loop. Thus for
|
Chris@87
|
138 ufuncs there is no large intrinsic advantage to either approach in most cases.
|
Chris@87
|
139 On the other hand, use of .flat with an FORTRAN ordered array will lead to
|
Chris@87
|
140 non-optimal memory access as adjacent elements in the flattened array (iterator,
|
Chris@87
|
141 actually) are not contiguous in memory.
|
Chris@87
|
142
|
Chris@87
|
143 Indeed, the fact is that Python
|
Chris@87
|
144 indexing on lists and other sequences naturally leads to an outside-to inside
|
Chris@87
|
145 ordering (the first index gets the largest grouping, the next the next largest,
|
Chris@87
|
146 and the last gets the smallest element). Since image data are normally stored
|
Chris@87
|
147 by rows, this corresponds to position within rows being the last item indexed.
|
Chris@87
|
148
|
Chris@87
|
149 If you do want to use Fortran ordering realize that
|
Chris@87
|
150 there are two approaches to consider: 1) accept that the first index is just not
|
Chris@87
|
151 the most rapidly changing in memory and have all your I/O routines reorder
|
Chris@87
|
152 your data when going from memory to disk or visa versa, or use numpy's
|
Chris@87
|
153 mechanism for mapping the first index to the most rapidly varying data. We
|
Chris@87
|
154 recommend the former if possible. The disadvantage of the latter is that many
|
Chris@87
|
155 of numpy's functions will yield arrays without Fortran ordering unless you are
|
Chris@87
|
156 careful to use the 'order' keyword. Doing this would be highly inconvenient.
|
Chris@87
|
157
|
Chris@87
|
158 Otherwise we recommend simply learning to reverse the usual order of indices
|
Chris@87
|
159 when accessing elements of an array. Granted, it goes against the grain, but
|
Chris@87
|
160 it is more in line with Python semantics and the natural order of the data.
|
Chris@87
|
161
|
Chris@87
|
162 """
|
Chris@87
|
163 from __future__ import division, absolute_import, print_function
|