Mercurial > hg > sv-dependency-builds
comparison src/fftw-3.3.5/doc/mpi.texi @ 127:7867fa7e1b6b
Current fftw source
author | Chris Cannam <cannam@all-day-breakfast.com> |
---|---|
date | Tue, 18 Oct 2016 13:40:26 +0100 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
126:4a7071416412 | 127:7867fa7e1b6b |
---|---|
1 @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top | |
2 @chapter Distributed-memory FFTW with MPI | |
3 @cindex MPI | |
4 | |
5 @cindex parallel transform | |
6 In this chapter we document the parallel FFTW routines for parallel | |
7 systems supporting the MPI message-passing interface. Unlike the | |
8 shared-memory threads described in the previous chapter, MPI allows | |
9 you to use @emph{distributed-memory} parallelism, where each CPU has | |
10 its own separate memory, and which can scale up to clusters of many | |
11 thousands of processors. This capability comes at a price, however: | |
12 each process only stores a @emph{portion} of the data to be | |
13 transformed, which means that the data structures and | |
14 programming-interface are quite different from the serial or threads | |
15 versions of FFTW. | |
16 @cindex data distribution | |
17 | |
18 | |
19 Distributed-memory parallelism is especially useful when you are | |
20 transforming arrays so large that they do not fit into the memory of a | |
21 single processor. The storage per-process required by FFTW's MPI | |
22 routines is proportional to the total array size divided by the number | |
23 of processes. Conversely, distributed-memory parallelism can easily | |
24 pose an unacceptably high communications overhead for small problems; | |
25 the threshold problem size for which parallelism becomes advantageous | |
26 will depend on the precise problem you are interested in, your | |
27 hardware, and your MPI implementation. | |
28 | |
29 A note on terminology: in MPI, you divide the data among a set of | |
30 ``processes'' which each run in their own memory address space. | |
31 Generally, each process runs on a different physical processor, but | |
32 this is not required. A set of processes in MPI is described by an | |
33 opaque data structure called a ``communicator,'' the most common of | |
34 which is the predefined communicator @code{MPI_COMM_WORLD} which | |
35 refers to @emph{all} processes. For more information on these and | |
36 other concepts common to all MPI programs, we refer the reader to the | |
37 documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home | |
38 page}. | |
39 @cindex MPI communicator | |
40 @ctindex MPI_COMM_WORLD | |
41 | |
42 | |
43 We assume in this chapter that the reader is familiar with the usage | |
44 of the serial (uniprocessor) FFTW, and focus only on the concepts new | |
45 to the MPI interface. | |
46 | |
47 @menu | |
48 * FFTW MPI Installation:: | |
49 * Linking and Initializing MPI FFTW:: | |
50 * 2d MPI example:: | |
51 * MPI Data Distribution:: | |
52 * Multi-dimensional MPI DFTs of Real Data:: | |
53 * Other Multi-dimensional Real-data MPI Transforms:: | |
54 * FFTW MPI Transposes:: | |
55 * FFTW MPI Wisdom:: | |
56 * Avoiding MPI Deadlocks:: | |
57 * FFTW MPI Performance Tips:: | |
58 * Combining MPI and Threads:: | |
59 * FFTW MPI Reference:: | |
60 * FFTW MPI Fortran Interface:: | |
61 @end menu | |
62 | |
63 @c ------------------------------------------------------------ | |
64 @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI | |
65 @section FFTW MPI Installation | |
66 | |
67 All of the FFTW MPI code is located in the @code{mpi} subdirectory of | |
68 the FFTW package. On Unix systems, the FFTW MPI libraries and header | |
69 files are automatically configured, compiled, and installed along with | |
70 the uniprocessor FFTW libraries simply by including | |
71 @code{--enable-mpi} in the flags to the @code{configure} script | |
72 (@pxref{Installation on Unix}). | |
73 @fpindex configure | |
74 | |
75 | |
76 Any implementation of the MPI standard, version 1 or later, should | |
77 work with FFTW. The @code{configure} script will attempt to | |
78 automatically detect how to compile and link code using your MPI | |
79 implementation. In some cases, especially if you have multiple | |
80 different MPI implementations installed or have an unusual MPI | |
81 software package, you may need to provide this information explicitly. | |
82 | |
83 Most commonly, one compiles MPI code by invoking a special compiler | |
84 command, typically @code{mpicc} for C code. The @code{configure} | |
85 script knows the most common names for this command, but you can | |
86 specify the MPI compilation command explicitly by setting the | |
87 @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}. | |
88 @fpindex mpicc | |
89 | |
90 | |
91 If, instead of a special compiler command, you need to link a certain | |
92 library, you can specify the link command via the @code{MPILIBS} | |
93 variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if | |
94 your MPI library is installed in a non-standard location (one the | |
95 compiler does not know about by default), you may also have to specify | |
96 the location of the library and header files via @code{LDFLAGS} and | |
97 @code{CPPFLAGS} variables, respectively, as in @samp{./configure | |
98 LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}. | |
99 | |
100 @c ------------------------------------------------------------ | |
101 @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI | |
102 @section Linking and Initializing MPI FFTW | |
103 | |
104 Programs using the MPI FFTW routines should be linked with | |
105 @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision, | |
106 @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on | |
107 (@pxref{Precision}). You will also need to link with whatever library | |
108 is responsible for MPI on your system; in most MPI implementations, | |
109 there is a special compiler alias named @code{mpicc} to compile and | |
110 link MPI code. | |
111 @fpindex mpicc | |
112 @cindex linking on Unix | |
113 @cindex precision | |
114 | |
115 | |
116 @findex fftw_init_threads | |
117 Before calling any FFTW routines except possibly | |
118 @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling | |
119 @code{MPI_Init}, you should call the function: | |
120 | |
121 @example | |
122 void fftw_mpi_init(void); | |
123 @end example | |
124 @findex fftw_mpi_init | |
125 | |
126 If, at the end of your program, you want to get rid of all memory and | |
127 other resources allocated internally by FFTW, for both the serial and | |
128 MPI routines, you can call: | |
129 | |
130 @example | |
131 void fftw_mpi_cleanup(void); | |
132 @end example | |
133 @findex fftw_mpi_cleanup | |
134 | |
135 which is much like the @code{fftw_cleanup()} function except that it | |
136 also gets rid of FFTW's MPI-related data. You must @emph{not} execute | |
137 any previously created plans after calling this function. | |
138 | |
139 @c ------------------------------------------------------------ | |
140 @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI | |
141 @section 2d MPI example | |
142 | |
143 Before we document the FFTW MPI interface in detail, we begin with a | |
144 simple example outlining how one would perform a two-dimensional | |
145 @code{N0} by @code{N1} complex DFT. | |
146 | |
147 @example | |
148 #include <fftw3-mpi.h> | |
149 | |
150 int main(int argc, char **argv) | |
151 @{ | |
152 const ptrdiff_t N0 = ..., N1 = ...; | |
153 fftw_plan plan; | |
154 fftw_complex *data; | |
155 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | |
156 | |
157 MPI_Init(&argc, &argv); | |
158 fftw_mpi_init(); | |
159 | |
160 /* @r{get local data size and allocate} */ | |
161 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, | |
162 &local_n0, &local_0_start); | |
163 data = fftw_alloc_complex(alloc_local); | |
164 | |
165 /* @r{create plan for in-place forward DFT} */ | |
166 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, | |
167 FFTW_FORWARD, FFTW_ESTIMATE); | |
168 | |
169 /* @r{initialize data to some function} my_function(x,y) */ | |
170 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) | |
171 data[i*N1 + j] = my_function(local_0_start + i, j); | |
172 | |
173 /* @r{compute transforms, in-place, as many times as desired} */ | |
174 fftw_execute(plan); | |
175 | |
176 fftw_destroy_plan(plan); | |
177 | |
178 MPI_Finalize(); | |
179 @} | |
180 @end example | |
181 | |
182 As can be seen above, the MPI interface follows the same basic style | |
183 of allocate/plan/execute/destroy as the serial FFTW routines. All of | |
184 the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead | |
185 of @samp{fftw_}. There are a few important differences, however: | |
186 | |
187 First, we must call @code{fftw_mpi_init()} after calling | |
188 @code{MPI_Init} (required in all MPI programs) and before calling any | |
189 other @samp{fftw_mpi_} routine. | |
190 @findex MPI_Init | |
191 @findex fftw_mpi_init | |
192 | |
193 | |
194 Second, when we create the plan with @code{fftw_mpi_plan_dft_2d}, | |
195 analogous to @code{fftw_plan_dft_2d}, we pass an additional argument: | |
196 the communicator, indicating which processes will participate in the | |
197 transform (here @code{MPI_COMM_WORLD}, indicating all processes). | |
198 Whenever you create, execute, or destroy a plan for an MPI transform, | |
199 you must call the corresponding FFTW routine on @emph{all} processes | |
200 in the communicator for that transform. (That is, these are | |
201 @emph{collective} calls.) Note that the plan for the MPI transform | |
202 uses the standard @code{fftw_execute} and @code{fftw_destroy} routines | |
203 (on the other hand, there are MPI-specific new-array execute functions | |
204 documented below). | |
205 @cindex collective function | |
206 @findex fftw_mpi_plan_dft_2d | |
207 @ctindex MPI_COMM_WORLD | |
208 | |
209 | |
210 Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments | |
211 instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a | |
212 standard C integer type which is (at least) 32 bits wide on a 32-bit | |
213 machine and 64 bits wide on a 64-bit machine. This is to make it easy | |
214 to specify very large parallel transforms on a 64-bit machine. (You | |
215 can specify 64-bit transform sizes in the serial FFTW, too, but only | |
216 by using the @samp{guru64} planner interface. @xref{64-bit Guru | |
217 Interface}.) | |
218 @tindex ptrdiff_t | |
219 @cindex 64-bit architecture | |
220 | |
221 | |
222 Fourth, and most importantly, you don't allocate the entire | |
223 two-dimensional array on each process. Instead, you call | |
224 @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the | |
225 array resides on each processor, and how much space to allocate. | |
226 Here, the portion of the array on each process is a @code{local_n0} by | |
227 @code{N1} slice of the total array, starting at index | |
228 @code{local_0_start}. The total number of @code{fftw_complex} numbers | |
229 to allocate is given by the @code{alloc_local} return value, which | |
230 @emph{may} be greater than @code{local_n0 * N1} (in case some | |
231 intermediate calculations require additional storage). The data | |
232 distribution in FFTW's MPI interface is described in more detail by | |
233 the next section. | |
234 @findex fftw_mpi_local_size_2d | |
235 @cindex data distribution | |
236 | |
237 | |
238 Given the portion of the array that resides on the local process, it | |
239 is straightforward to initialize the data (here to a function | |
240 @code{myfunction}) and otherwise manipulate it. Of course, at the end | |
241 of the program you may want to output the data somehow, but | |
242 synchronizing this output is up to you and is beyond the scope of this | |
243 manual. (One good way to output a large multi-dimensional distributed | |
244 array in MPI to a portable binary file is to use the free HDF5 | |
245 library; see the @uref{http://www.hdfgroup.org/, HDF home page}.) | |
246 @cindex HDF5 | |
247 @cindex MPI I/O | |
248 | |
249 @c ------------------------------------------------------------ | |
250 @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI | |
251 @section MPI Data Distribution | |
252 @cindex data distribution | |
253 | |
254 The most important concept to understand in using FFTW's MPI interface | |
255 is the data distribution. With a serial or multithreaded FFT, all of | |
256 the inputs and outputs are stored as a single contiguous chunk of | |
257 memory. With a distributed-memory FFT, the inputs and outputs are | |
258 broken into disjoint blocks, one per process. | |
259 | |
260 In particular, FFTW uses a @emph{1d block distribution} of the data, | |
261 distributed along the @emph{first dimension}. For example, if you | |
262 want to perform a @twodims{100,200} complex DFT, distributed over 4 | |
263 processes, each process will get a @twodims{25,200} slice of the data. | |
264 That is, process 0 will get rows 0 through 24, process 1 will get rows | |
265 25 through 49, process 2 will get rows 50 through 74, and process 3 | |
266 will get rows 75 through 99. If you take the same array but | |
267 distribute it over 3 processes, then it is not evenly divisible so the | |
268 different processes will have unequal chunks. FFTW's default choice | |
269 in this case is to assign 34 rows to processes 0 and 1, and 32 rows to | |
270 process 2. | |
271 @cindex block distribution | |
272 | |
273 | |
274 FFTW provides several @samp{fftw_mpi_local_size} routines that you can | |
275 call to find out what portion of an array is stored on the current | |
276 process. In most cases, you should use the default block sizes picked | |
277 by FFTW, but it is also possible to specify your own block size. For | |
278 example, with a @twodims{100,200} array on three processes, you can | |
279 tell FFTW to use a block size of 40, which would assign 40 rows to | |
280 processes 0 and 1, and 20 rows to process 2. FFTW's default is to | |
281 divide the data equally among the processes if possible, and as best | |
282 it can otherwise. The rows are always assigned in ``rank order,'' | |
283 i.e. process 0 gets the first block of rows, then process 1, and so | |
284 on. (You can change this by using @code{MPI_Comm_split} to create a | |
285 new communicator with re-ordered processes.) However, you should | |
286 always call the @samp{fftw_mpi_local_size} routines, if possible, | |
287 rather than trying to predict FFTW's distribution choices. | |
288 | |
289 In particular, it is critical that you allocate the storage size that | |
290 is returned by @samp{fftw_mpi_local_size}, which is @emph{not} | |
291 necessarily the size of the local slice of the array. The reason is | |
292 that intermediate steps of FFTW's algorithms involve transposing the | |
293 array and redistributing the data, so at these intermediate steps FFTW | |
294 may require more local storage space (albeit always proportional to | |
295 the total size divided by the number of processes). The | |
296 @samp{fftw_mpi_local_size} functions know how much storage is required | |
297 for these intermediate steps and tell you the correct amount to | |
298 allocate. | |
299 | |
300 @menu | |
301 * Basic and advanced distribution interfaces:: | |
302 * Load balancing:: | |
303 * Transposed distributions:: | |
304 * One-dimensional distributions:: | |
305 @end menu | |
306 | |
307 @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution | |
308 @subsection Basic and advanced distribution interfaces | |
309 | |
310 As with the planner interface, the @samp{fftw_mpi_local_size} | |
311 distribution interface is broken into basic and advanced | |
312 (@samp{_many}) interfaces, where the latter allows you to specify the | |
313 block size manually and also to request block sizes when computing | |
314 multiple transforms simultaneously. These functions are documented | |
315 more exhaustively by the FFTW MPI Reference, but we summarize the | |
316 basic ideas here using a couple of two-dimensional examples. | |
317 | |
318 For the @twodims{100,200} complex-DFT example, above, we would find | |
319 the distribution by calling the following function in the basic | |
320 interface: | |
321 | |
322 @example | |
323 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
324 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
325 @end example | |
326 @findex fftw_mpi_local_size_2d | |
327 | |
328 Given the total size of the data to be transformed (here, @code{n0 = | |
329 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this | |
330 function provides three numbers. | |
331 | |
332 First, it describes the shape of the local data: the current process | |
333 should store a @code{local_n0} by @code{n1} slice of the overall | |
334 dataset, in row-major order (@code{n1} dimension contiguous), starting | |
335 at index @code{local_0_start}. That is, if the total dataset is | |
336 viewed as a @code{n0} by @code{n1} matrix, the current process should | |
337 store the rows @code{local_0_start} to | |
338 @code{local_0_start+local_n0-1}. Obviously, if you are running with | |
339 only a single MPI process, that process will store the entire array: | |
340 @code{local_0_start} will be zero and @code{local_n0} will be | |
341 @code{n0}. @xref{Row-major Format}. | |
342 @cindex row-major | |
343 | |
344 | |
345 Second, the return value is the total number of data elements (e.g., | |
346 complex numbers for a complex DFT) that should be allocated for the | |
347 input and output arrays on the current process (ideally with | |
348 @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal | |
349 alignment). It might seem that this should always be equal to | |
350 @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's | |
351 distributed FFT algorithms require data redistributions at | |
352 intermediate stages of the transform, and in some circumstances this | |
353 may require slightly larger local storage. This is discussed in more | |
354 detail below, under @ref{Load balancing}. | |
355 @findex fftw_malloc | |
356 @findex fftw_alloc_complex | |
357 | |
358 | |
359 @cindex advanced interface | |
360 The advanced-interface @samp{local_size} function for multidimensional | |
361 transforms returns the same three things (@code{local_n0}, | |
362 @code{local_0_start}, and the total number of elements to allocate), | |
363 but takes more inputs: | |
364 | |
365 @example | |
366 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, | |
367 ptrdiff_t howmany, | |
368 ptrdiff_t block0, | |
369 MPI_Comm comm, | |
370 ptrdiff_t *local_n0, | |
371 ptrdiff_t *local_0_start); | |
372 @end example | |
373 @findex fftw_mpi_local_size_many | |
374 | |
375 The two-dimensional case above corresponds to @code{rnk = 2} and an | |
376 array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}. | |
377 This routine is for any @code{rnk > 1}; one-dimensional transforms | |
378 have their own interface because they work slightly differently, as | |
379 discussed below. | |
380 | |
381 First, the advanced interface allows you to perform multiple | |
382 transforms at once, of interleaved data, as specified by the | |
383 @code{howmany} parameter. (@code{hoamany} is 1 for a single | |
384 transform.) | |
385 | |
386 Second, here you can specify your desired block size in the @code{n0} | |
387 dimension, @code{block0}. To use FFTW's default block size, pass | |
388 @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on | |
389 @code{P} processes, FFTW will return @code{local_n0} equal to | |
390 @code{block0} on the first @code{P / block0} processes (rounded down), | |
391 return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on | |
392 the next process, and @code{local_n0} equal to zero on any remaining | |
393 processes. In general, we recommend using the default block size | |
394 (which corresponds to @code{n0 / P}, rounded up). | |
395 @ctindex FFTW_MPI_DEFAULT_BLOCK | |
396 @cindex block distribution | |
397 | |
398 | |
399 For example, suppose you have @code{P = 4} processes and @code{n0 = | |
400 21}. The default will be a block size of @code{6}, which will give | |
401 @code{local_n0 = 6} on the first three processes and @code{local_n0 = | |
402 3} on the last process. Instead, however, you could specify | |
403 @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5} | |
404 on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice, | |
405 while it may look superficially more ``balanced,'' has the same | |
406 critical path as FFTW's default but requires more communications.) | |
407 | |
408 @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution | |
409 @subsection Load balancing | |
410 @cindex load balancing | |
411 | |
412 Ideally, when you parallelize a transform over some @math{P} | |
413 processes, each process should end up with work that takes equal time. | |
414 Otherwise, all of the processes end up waiting on whichever process is | |
415 slowest. This goal is known as ``load balancing.'' In this section, | |
416 we describe the circumstances under which FFTW is able to load-balance | |
417 well, and in particular how you should choose your transform size in | |
418 order to load balance. | |
419 | |
420 Load balancing is especially difficult when you are parallelizing over | |
421 heterogeneous machines; for example, if one of your processors is a | |
422 old 486 and another is a Pentium IV, obviously you should give the | |
423 Pentium more work to do than the 486 since the latter is much slower. | |
424 FFTW does not deal with this problem, however---it assumes that your | |
425 processes run on hardware of comparable speed, and that the goal is | |
426 therefore to divide the problem as equally as possible. | |
427 | |
428 For a multi-dimensional complex DFT, FFTW can divide the problem | |
429 equally among the processes if: (i) the @emph{first} dimension | |
430 @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of | |
431 the subsequent dimensions is divisible by @math{P}. (For the advanced | |
432 interface, where you can specify multiple simultaneous transforms via | |
433 some ``vector'' length @code{howmany}, a factor of @code{howmany} is | |
434 included in the product of the subsequent dimensions.) | |
435 | |
436 For a one-dimensional complex DFT, the length @code{N} of the data | |
437 should be divisible by @math{P} @emph{squared} to be able to divide | |
438 the problem equally among the processes. | |
439 | |
440 @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution | |
441 @subsection Transposed distributions | |
442 | |
443 Internally, FFTW's MPI transform algorithms work by first computing | |
444 transforms of the data local to each process, then by globally | |
445 @emph{transposing} the data in some fashion to redistribute the data | |
446 among the processes, transforming the new data local to each process, | |
447 and transposing back. For example, a two-dimensional @code{n0} by | |
448 @code{n1} array, distributed across the @code{n0} dimension, is | |
449 transformd by: (i) transforming the @code{n1} dimension, which are | |
450 local to each process; (ii) transposing to an @code{n1} by @code{n0} | |
451 array, distributed across the @code{n1} dimension; (iii) transforming | |
452 the @code{n0} dimension, which is now local to each process; (iv) | |
453 transposing back. | |
454 @cindex transpose | |
455 | |
456 | |
457 However, in many applications it is acceptable to compute a | |
458 multidimensional DFT whose results are produced in transposed order | |
459 (e.g., @code{n1} by @code{n0} in two dimensions). This provides a | |
460 significant performance advantage, because it means that the final | |
461 transposition step can be omitted. FFTW supports this optimization, | |
462 which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT} | |
463 to the planner routines. To compute the inverse transform of | |
464 transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell | |
465 it that the input is transposed. In this section, we explain how to | |
466 interpret the output format of such a transform. | |
467 @ctindex FFTW_MPI_TRANSPOSED_OUT | |
468 @ctindex FFTW_MPI_TRANSPOSED_IN | |
469 | |
470 | |
471 Suppose you have are transforming multi-dimensional data with (at | |
472 least two) dimensions @ndims{}. As always, it is distributed along | |
473 the first dimension @dimk{0}. Now, if we compute its DFT with the | |
474 @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored | |
475 with the first @emph{two} dimensions transposed: @ndimstrans{}, | |
476 distributed along the @dimk{1} dimension. Conversely, if we take the | |
477 @ndimstrans{} data and transform it with the | |
478 @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the | |
479 original @ndims{} array. | |
480 | |
481 There are two ways to find the portion of the transposed array that | |
482 resides on the current process. First, you can simply call the | |
483 appropriate @samp{local_size} function, passing @ndimstrans{} (the | |
484 transposed dimensions). This would mean calling the @samp{local_size} | |
485 function twice, once for the transposed and once for the | |
486 non-transposed dimensions. Alternatively, you can call one of the | |
487 @samp{local_size_transposed} functions, which returns both the | |
488 non-transposed and transposed data distribution from a single call. | |
489 For example, for a 3d transform with transposed output (or input), you | |
490 might call: | |
491 | |
492 @example | |
493 ptrdiff_t fftw_mpi_local_size_3d_transposed( | |
494 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, | |
495 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
496 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
497 @end example | |
498 @findex fftw_mpi_local_size_3d_transposed | |
499 | |
500 Here, @code{local_n0} and @code{local_0_start} give the size and | |
501 starting index of the @code{n0} dimension for the | |
502 @emph{non}-transposed data, as in the previous sections. For | |
503 @emph{transposed} data (e.g. the output for | |
504 @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and | |
505 @code{local_1_start} give the size and starting index of the @code{n1} | |
506 dimension, which is the first dimension of the transposed data | |
507 (@code{n1} by @code{n0} by @code{n2}). | |
508 | |
509 (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to | |
510 performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two | |
511 dimensions to the planner in reverse order, or vice versa. If you | |
512 pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and | |
513 @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the | |
514 first two dimensions passed to the planner and passing @emph{neither} | |
515 flag.) | |
516 | |
517 @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution | |
518 @subsection One-dimensional distributions | |
519 | |
520 For one-dimensional distributed DFTs using FFTW, matters are slightly | |
521 more complicated because the data distribution is more closely tied to | |
522 how the algorithm works. In particular, you can no longer pass an | |
523 arbitrary block size and must accept FFTW's default; also, the block | |
524 sizes may be different for input and output. Also, the data | |
525 distribution depends on the flags and transform direction, in order | |
526 for forward and backward transforms to work correctly. | |
527 | |
528 @example | |
529 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, | |
530 int sign, unsigned flags, | |
531 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
532 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
533 @end example | |
534 @findex fftw_mpi_local_size_1d | |
535 | |
536 This function computes the data distribution for a 1d transform of | |
537 size @code{n0} with the given transform @code{sign} and @code{flags}. | |
538 Both input and output data use block distributions. The input on the | |
539 current process will consist of @code{local_ni} numbers starting at | |
540 index @code{local_i_start}; e.g. if only a single process is used, | |
541 then @code{local_ni} will be @code{n0} and @code{local_i_start} will | |
542 be @code{0}. Similarly for the output, with @code{local_no} numbers | |
543 starting at index @code{local_o_start}. The return value of | |
544 @code{fftw_mpi_local_size_1d} will be the total number of elements to | |
545 allocate on the current process (which might be slightly larger than | |
546 the local size due to intermediate steps in the algorithm). | |
547 | |
548 As mentioned above (@pxref{Load balancing}), the data will be divided | |
549 equally among the processes if @code{n0} is divisible by the | |
550 @emph{square} of the number of processes. In this case, | |
551 @code{local_ni} will equal @code{local_no}. Otherwise, they may be | |
552 different. | |
553 | |
554 For some applications, such as convolutions, the order of the output | |
555 data is irrelevant. In this case, performance can be improved by | |
556 specifying that the output data be stored in an FFTW-defined | |
557 ``scrambled'' format. (In particular, this is the analogue of | |
558 transposed output in the multidimensional case: scrambled output saves | |
559 a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in | |
560 the flags, then the output is stored in this (undocumented) scrambled | |
561 order. Conversely, to perform the inverse transform of data in | |
562 scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag. | |
563 @ctindex FFTW_MPI_SCRAMBLED_OUT | |
564 @ctindex FFTW_MPI_SCRAMBLED_IN | |
565 | |
566 | |
567 In MPI FFTW, only composite sizes @code{n0} can be parallelized; we | |
568 have not yet implemented a parallel algorithm for large prime sizes. | |
569 | |
570 @c ------------------------------------------------------------ | |
571 @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI | |
572 @section Multi-dimensional MPI DFTs of Real Data | |
573 | |
574 FFTW's MPI interface also supports multi-dimensional DFTs of real | |
575 data, similar to the serial r2c and c2r interfaces. (Parallel | |
576 one-dimensional real-data DFTs are not currently supported; you must | |
577 use a complex transform and set the imaginary parts of the inputs to | |
578 zero.) | |
579 | |
580 The key points to understand for r2c and c2r MPI transforms (compared | |
581 to the MPI complex DFTs or the serial r2c/c2r transforms), are: | |
582 | |
583 @itemize @bullet | |
584 | |
585 @item | |
586 Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real | |
587 data to/from @ndimshalf{} complex data: the last dimension of the | |
588 complex data is cut in half (rounded down), plus one. As for the | |
589 serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and | |
590 @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data. | |
591 | |
592 @item | |
593 @cindex padding | |
594 Although the real data is @emph{conceptually} @ndims{}, it is | |
595 @emph{physically} stored as an @ndimspad{} array, where the last | |
596 dimension has been @emph{padded} to make it the same size as the | |
597 complex output. This is much like the in-place serial r2c/c2r | |
598 interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that | |
599 in MPI the padding is required even for out-of-place data. The extra | |
600 padding numbers are ignored by FFTW (they are @emph{not} like | |
601 zero-padding the transform to a larger size); they are only used to | |
602 determine the data layout. | |
603 | |
604 @item | |
605 @cindex data distribution | |
606 The data distribution in MPI for @emph{both} the real and complex data | |
607 is determined by the shape of the @emph{complex} data. That is, you | |
608 call the appropriate @samp{local size} function for the @ndimshalf{} | |
609 complex data, and then use the @emph{same} distribution for the real | |
610 data except that the last complex dimension is replaced by a (padded) | |
611 real dimension of twice the length. | |
612 | |
613 @end itemize | |
614 | |
615 For example suppose we are performing an out-of-place r2c transform of | |
616 @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}], | |
617 resulting in @threedims{L,M,N/2+1} complex data. Similar to the | |
618 example in @ref{2d MPI example}, we might do something like: | |
619 | |
620 @example | |
621 #include <fftw3-mpi.h> | |
622 | |
623 int main(int argc, char **argv) | |
624 @{ | |
625 const ptrdiff_t L = ..., M = ..., N = ...; | |
626 fftw_plan plan; | |
627 double *rin; | |
628 fftw_complex *cout; | |
629 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; | |
630 | |
631 MPI_Init(&argc, &argv); | |
632 fftw_mpi_init(); | |
633 | |
634 /* @r{get local data size and allocate} */ | |
635 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, | |
636 &local_n0, &local_0_start); | |
637 rin = fftw_alloc_real(2 * alloc_local); | |
638 cout = fftw_alloc_complex(alloc_local); | |
639 | |
640 /* @r{create plan for out-of-place r2c DFT} */ | |
641 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, | |
642 FFTW_MEASURE); | |
643 | |
644 /* @r{initialize rin to some function} my_func(x,y,z) */ | |
645 for (i = 0; i < local_n0; ++i) | |
646 for (j = 0; j < M; ++j) | |
647 for (k = 0; k < N; ++k) | |
648 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); | |
649 | |
650 /* @r{compute transforms as many times as desired} */ | |
651 fftw_execute(plan); | |
652 | |
653 fftw_destroy_plan(plan); | |
654 | |
655 MPI_Finalize(); | |
656 @} | |
657 @end example | |
658 | |
659 @findex fftw_alloc_real | |
660 @cindex row-major | |
661 Note that we allocated @code{rin} using @code{fftw_alloc_real} with an | |
662 argument of @code{2 * alloc_local}: since @code{alloc_local} is the | |
663 number of @emph{complex} values to allocate, the number of @emph{real} | |
664 values is twice as many. The @code{rin} array is then | |
665 @threedims{local_n0,M,2(N/2+1)} in row-major order, so its | |
666 @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) + | |
667 k} (@pxref{Multi-dimensional Array Format }). | |
668 | |
669 @cindex transpose | |
670 @ctindex FFTW_TRANSPOSED_OUT | |
671 @ctindex FFTW_TRANSPOSED_IN | |
672 As for the complex transforms, improved performance can be obtained by | |
673 specifying that the output is the transpose of the input or vice versa | |
674 (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c | |
675 example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that | |
676 the input would be a padded @threedims{L,M,2(N/2+1)} real array | |
677 distributed over the @code{L} dimension, while the output would be a | |
678 @threedims{M,L,N/2+1} complex array distributed over the @code{M} | |
679 dimension. To perform the inverse c2r transform with the same data | |
680 distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag. | |
681 | |
682 @c ------------------------------------------------------------ | |
683 @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI | |
684 @section Other multi-dimensional Real-Data MPI Transforms | |
685 | |
686 @cindex r2r | |
687 FFTW's MPI interface also supports multi-dimensional @samp{r2r} | |
688 transforms of all kinds supported by the serial interface | |
689 (e.g. discrete cosine and sine transforms, discrete Hartley | |
690 transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not | |
691 one-dimensional transforms, are currently parallelized. | |
692 | |
693 @tindex fftw_r2r_kind | |
694 These are used much like the multidimensional complex DFTs discussed | |
695 above, except that the data is real rather than complex, and one needs | |
696 to pass an r2r transform kind (@code{fftw_r2r_kind}) for each | |
697 dimension as in the serial FFTW (@pxref{More DFTs of Real Data}). | |
698 | |
699 For example, one might perform a two-dimensional @twodims{L,M} that is | |
700 an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in | |
701 the second dimension with code like: | |
702 | |
703 @example | |
704 const ptrdiff_t L = ..., M = ...; | |
705 fftw_plan plan; | |
706 double *data; | |
707 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | |
708 | |
709 /* @r{get local data size and allocate} */ | |
710 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, | |
711 &local_n0, &local_0_start); | |
712 data = fftw_alloc_real(alloc_local); | |
713 | |
714 /* @r{create plan for in-place REDFT10 x RODFT10} */ | |
715 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, | |
716 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); | |
717 | |
718 /* @r{initialize data to some function} my_function(x,y) */ | |
719 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) | |
720 data[i*M + j] = my_function(local_0_start + i, j); | |
721 | |
722 /* @r{compute transforms, in-place, as many times as desired} */ | |
723 fftw_execute(plan); | |
724 | |
725 fftw_destroy_plan(plan); | |
726 @end example | |
727 | |
728 @findex fftw_alloc_real | |
729 Notice that we use the same @samp{local_size} functions as we did for | |
730 complex data, only now we interpret the sizes in terms of real rather | |
731 than complex values, and correspondingly use @code{fftw_alloc_real}. | |
732 | |
733 @c ------------------------------------------------------------ | |
734 @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI | |
735 @section FFTW MPI Transposes | |
736 @cindex transpose | |
737 | |
738 The FFTW's MPI Fourier transforms rely on one or more @emph{global | |
739 transposition} step for their communications. For example, the | |
740 multidimensional transforms work by transforming along some | |
741 dimensions, then transposing to make the first dimension local and | |
742 transforming that, then transposing back. Because global | |
743 transposition of a block-distributed matrix has many other potential | |
744 uses besides FFTs, FFTW's transpose routines can be called directly, | |
745 as documented in this section. | |
746 | |
747 @menu | |
748 * Basic distributed-transpose interface:: | |
749 * Advanced distributed-transpose interface:: | |
750 * An improved replacement for MPI_Alltoall:: | |
751 @end menu | |
752 | |
753 @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes | |
754 @subsection Basic distributed-transpose interface | |
755 | |
756 In particular, suppose that we have an @code{n0} by @code{n1} array in | |
757 row-major order, block-distributed across the @code{n0} dimension. To | |
758 transpose this into an @code{n1} by @code{n0} array block-distributed | |
759 across the @code{n1} dimension, we would create a plan by calling the | |
760 following function: | |
761 | |
762 @example | |
763 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | |
764 double *in, double *out, | |
765 MPI_Comm comm, unsigned flags); | |
766 @end example | |
767 @findex fftw_mpi_plan_transpose | |
768 | |
769 The input and output arrays (@code{in} and @code{out}) can be the | |
770 same. The transpose is actually executed by calling | |
771 @code{fftw_execute} on the plan, as usual. | |
772 @findex fftw_execute | |
773 | |
774 | |
775 The @code{flags} are the usual FFTW planner flags, but support | |
776 two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or | |
777 @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for | |
778 transpose plans, is that the output and/or input, respectively, are | |
779 @emph{locally} transposed. That is, on each process input data is | |
780 normally stored as a @code{local_n0} by @code{n1} array in row-major | |
781 order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is | |
782 stored as @code{n1} by @code{local_n0} in row-major order. Similarly, | |
783 @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by | |
784 @code{local_n1} instead of @code{local_n1} by @code{n0}. | |
785 @ctindex FFTW_MPI_TRANSPOSED_OUT | |
786 @ctindex FFTW_MPI_TRANSPOSED_IN | |
787 | |
788 | |
789 To determine the local size of the array on each process before and | |
790 after the transpose, as well as the amount of storage that must be | |
791 allocated, one should call @code{fftw_mpi_local_size_2d_transposed}, | |
792 just as for a 2d DFT as described in the previous section: | |
793 @cindex data distribution | |
794 | |
795 @example | |
796 ptrdiff_t fftw_mpi_local_size_2d_transposed | |
797 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
798 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
799 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
800 @end example | |
801 @findex fftw_mpi_local_size_2d_transposed | |
802 | |
803 Again, the return value is the local storage to allocate, which in | |
804 this case is the number of @emph{real} (@code{double}) values rather | |
805 than complex numbers as in the previous examples. | |
806 | |
807 @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes | |
808 @subsection Advanced distributed-transpose interface | |
809 | |
810 The above routines are for a transpose of a matrix of numbers (of type | |
811 @code{double}), using FFTW's default block sizes. More generally, one | |
812 can perform transposes of @emph{tuples} of numbers, with | |
813 user-specified block sizes for the input and output: | |
814 | |
815 @example | |
816 fftw_plan fftw_mpi_plan_many_transpose | |
817 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | |
818 ptrdiff_t block0, ptrdiff_t block1, | |
819 double *in, double *out, MPI_Comm comm, unsigned flags); | |
820 @end example | |
821 @findex fftw_mpi_plan_many_transpose | |
822 | |
823 In this case, one is transposing an @code{n0} by @code{n1} matrix of | |
824 @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers). | |
825 The input is distributed along the @code{n0} dimension with block size | |
826 @code{block0}, and the @code{n1} by @code{n0} output is distributed | |
827 along the @code{n1} dimension with block size @code{block1}. If | |
828 @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW | |
829 uses its default block size. To get the local size of the data on | |
830 each process, you should then call @code{fftw_mpi_local_size_many_transposed}. | |
831 @ctindex FFTW_MPI_DEFAULT_BLOCK | |
832 @findex fftw_mpi_local_size_many_transposed | |
833 | |
834 @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes | |
835 @subsection An improved replacement for MPI_Alltoall | |
836 | |
837 We close this section by noting that FFTW's MPI transpose routines can | |
838 be thought of as a generalization for the @code{MPI_Alltoall} function | |
839 (albeit only for floating-point types), and in some circumstances can | |
840 function as an improved replacement. | |
841 @findex MPI_Alltoall | |
842 | |
843 | |
844 @code{MPI_Alltoall} is defined by the MPI standard as: | |
845 | |
846 @example | |
847 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, | |
848 void *recvbuf, int recvcnt, MPI_Datatype recvtype, | |
849 MPI_Comm comm); | |
850 @end example | |
851 | |
852 In particular, for @code{double*} arrays @code{in} and @code{out}, | |
853 consider the call: | |
854 | |
855 @example | |
856 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); | |
857 @end example | |
858 | |
859 This is completely equivalent to: | |
860 | |
861 @example | |
862 MPI_Comm_size(comm, &P); | |
863 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); | |
864 fftw_execute(plan); | |
865 fftw_destroy_plan(plan); | |
866 @end example | |
867 | |
868 That is, computing a @twodims{P,P} transpose on @code{P} processes, | |
869 with a block size of 1, is just a standard all-to-all communication. | |
870 | |
871 However, using the FFTW routine instead of @code{MPI_Alltoall} may | |
872 have certain advantages. First of all, FFTW's routine can operate | |
873 in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only | |
874 operate out-of-place. | |
875 @cindex in-place | |
876 | |
877 | |
878 Second, even for out-of-place plans, FFTW's routine may be faster, | |
879 especially if you need to perform the all-to-all communication many | |
880 times and can afford to use @code{FFTW_MEASURE} or | |
881 @code{FFTW_PATIENT}. It should certainly be no slower, not including | |
882 the time to create the plan, since one of the possible algorithms that | |
883 FFTW uses for an out-of-place transpose @emph{is} simply to call | |
884 @code{MPI_Alltoall}. However, FFTW also considers several other | |
885 possible algorithms that, depending on your MPI implementation and | |
886 your hardware, may be faster. | |
887 @ctindex FFTW_MEASURE | |
888 @ctindex FFTW_PATIENT | |
889 | |
890 @c ------------------------------------------------------------ | |
891 @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI | |
892 @section FFTW MPI Wisdom | |
893 @cindex wisdom | |
894 @cindex saving plans to disk | |
895 | |
896 FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can | |
897 be used to save MPI plans as well as to save uniprocessor plans. | |
898 However, for MPI there are several unavoidable complications. | |
899 | |
900 @cindex MPI I/O | |
901 First, the MPI standard does not guarantee that every process can | |
902 perform file I/O (at least, not using C stdio routines)---in general, | |
903 we may only assume that process 0 is capable of I/O.@footnote{In fact, | |
904 even this assumption is not technically guaranteed by the standard, | |
905 although it seems to be universal in actual MPI implementations and is | |
906 widely assumed by MPI-using software. Technically, you need to query | |
907 the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with | |
908 @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no | |
909 I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can | |
910 perform I/O. Otherwise, it is the rank of a process that can perform | |
911 I/O ... but since it is not guaranteed to yield the @emph{same} rank | |
912 on all processes, you have to do an @code{MPI_Allreduce} of some kind | |
913 if you want all processes to agree about which is going to do I/O. | |
914 And even then, the standard only guarantees that this process can | |
915 perform output, but not input. See e.g. @cite{Parallel Programming | |
916 with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our | |
917 experience virtually no MPI programmers worry about this.} So, if we | |
918 want to export the wisdom from a single process to a file, we must | |
919 first export the wisdom to a string, then send it to process 0, then | |
920 write it to a file. | |
921 | |
922 Second, in principle we may want to have separate wisdom for every | |
923 process, since in general the processes may run on different hardware | |
924 even for a single MPI program. However, in practice FFTW's MPI code | |
925 is designed for the case of homogeneous hardware (@pxref{Load | |
926 balancing}), and in this case it is convenient to use the same wisdom | |
927 for every process. Thus, we need a mechanism to synchronize the wisdom. | |
928 | |
929 To address both of these problems, FFTW provides the following two | |
930 functions: | |
931 | |
932 @example | |
933 void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | |
934 void fftw_mpi_gather_wisdom(MPI_Comm comm); | |
935 @end example | |
936 @findex fftw_mpi_gather_wisdom | |
937 @findex fftw_mpi_broadcast_wisdom | |
938 | |
939 Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom} | |
940 will broadcast the wisdom from process 0 to all other processes. | |
941 Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all | |
942 processes onto process 0. (If the plans created for the same problem | |
943 by different processes are not the same, @code{fftw_mpi_gather_wisdom} | |
944 will arbitrarily choose one of the plans.) Both of these functions | |
945 may result in suboptimal plans for different processes if the | |
946 processes are running on non-identical hardware. Both of these | |
947 functions are @emph{collective} calls, which means that they must be | |
948 executed by all processes in the communicator. | |
949 @cindex collective function | |
950 | |
951 | |
952 So, for example, a typical code snippet to import wisdom from a file | |
953 and use it on all processes would be: | |
954 | |
955 @example | |
956 @{ | |
957 int rank; | |
958 | |
959 fftw_mpi_init(); | |
960 MPI_Comm_rank(MPI_COMM_WORLD, &rank); | |
961 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); | |
962 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); | |
963 @} | |
964 @end example | |
965 | |
966 (Note that we must call @code{fftw_mpi_init} before importing any | |
967 wisdom that might contain MPI plans.) Similarly, a typical code | |
968 snippet to export wisdom from all processes to a file is: | |
969 @findex fftw_mpi_init | |
970 | |
971 @example | |
972 @{ | |
973 int rank; | |
974 | |
975 fftw_mpi_gather_wisdom(MPI_COMM_WORLD); | |
976 MPI_Comm_rank(MPI_COMM_WORLD, &rank); | |
977 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); | |
978 @} | |
979 @end example | |
980 | |
981 @c ------------------------------------------------------------ | |
982 @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI | |
983 @section Avoiding MPI Deadlocks | |
984 @cindex deadlock | |
985 | |
986 An MPI program can @emph{deadlock} if one process is waiting for a | |
987 message from another process that never gets sent. To avoid deadlocks | |
988 when using FFTW's MPI routines, it is important to know which | |
989 functions are @emph{collective}: that is, which functions must | |
990 @emph{always} be called in the @emph{same order} from @emph{every} | |
991 process in a given communicator. (For example, @code{MPI_Barrier} is | |
992 the canonical example of a collective function in the MPI standard.) | |
993 @cindex collective function | |
994 @findex MPI_Barrier | |
995 | |
996 | |
997 The functions in FFTW that are @emph{always} collective are: every | |
998 function beginning with @samp{fftw_mpi_plan}, as well as | |
999 @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}. | |
1000 Also, the following functions from the ordinary FFTW interface are | |
1001 collective when they are applied to a plan created by an | |
1002 @samp{fftw_mpi_plan} function: @code{fftw_execute}, | |
1003 @code{fftw_destroy_plan}, and @code{fftw_flops}. | |
1004 @findex fftw_execute | |
1005 @findex fftw_destroy_plan | |
1006 @findex fftw_flops | |
1007 | |
1008 @c ------------------------------------------------------------ | |
1009 @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI | |
1010 @section FFTW MPI Performance Tips | |
1011 | |
1012 In this section, we collect a few tips on getting the best performance | |
1013 out of FFTW's MPI transforms. | |
1014 | |
1015 First, because of the 1d block distribution, FFTW's parallelization is | |
1016 currently limited by the size of the first dimension. | |
1017 (Multidimensional block distributions may be supported by a future | |
1018 version.) More generally, you should ideally arrange the dimensions so | |
1019 that FFTW can divide them equally among the processes. @xref{Load | |
1020 balancing}. | |
1021 @cindex block distribution | |
1022 @cindex load balancing | |
1023 | |
1024 | |
1025 Second, if it is not too inconvenient, you should consider working | |
1026 with transposed output for multidimensional plans, as this saves a | |
1027 considerable amount of communications. @xref{Transposed distributions}. | |
1028 @cindex transpose | |
1029 | |
1030 | |
1031 Third, the fastest choices are generally either an in-place transform | |
1032 or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag | |
1033 (which allows the input array to be used as scratch space). In-place | |
1034 is especially beneficial if the amount of data per process is large. | |
1035 @ctindex FFTW_DESTROY_INPUT | |
1036 | |
1037 | |
1038 Fourth, if you have multiple arrays to transform at once, rather than | |
1039 calling FFTW's MPI transforms several times it usually seems to be | |
1040 faster to interleave the data and use the advanced interface. (This | |
1041 groups the communications together instead of requiring separate | |
1042 messages for each transform.) | |
1043 | |
1044 @c ------------------------------------------------------------ | |
1045 @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI | |
1046 @section Combining MPI and Threads | |
1047 @cindex threads | |
1048 | |
1049 In certain cases, it may be advantageous to combine MPI | |
1050 (distributed-memory) and threads (shared-memory) parallelization. | |
1051 FFTW supports this, with certain caveats. For example, if you have a | |
1052 cluster of 4-processor shared-memory nodes, you may want to use | |
1053 threads within the nodes and MPI between the nodes, instead of MPI for | |
1054 all parallelization. | |
1055 | |
1056 In particular, it is possible to seamlessly combine the MPI FFTW | |
1057 routines with the multi-threaded FFTW routines (@pxref{Multi-threaded | |
1058 FFTW}). However, some care must be taken in the initialization code, | |
1059 which should look something like this: | |
1060 | |
1061 @example | |
1062 int threads_ok; | |
1063 | |
1064 int main(int argc, char **argv) | |
1065 @{ | |
1066 int provided; | |
1067 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); | |
1068 threads_ok = provided >= MPI_THREAD_FUNNELED; | |
1069 | |
1070 if (threads_ok) threads_ok = fftw_init_threads(); | |
1071 fftw_mpi_init(); | |
1072 | |
1073 ... | |
1074 if (threads_ok) fftw_plan_with_nthreads(...); | |
1075 ... | |
1076 | |
1077 MPI_Finalize(); | |
1078 @} | |
1079 @end example | |
1080 @findex fftw_mpi_init | |
1081 @findex fftw_init_threads | |
1082 @findex fftw_plan_with_nthreads | |
1083 | |
1084 First, note that instead of calling @code{MPI_Init}, you should call | |
1085 @code{MPI_Init_threads}, which is the initialization routine defined | |
1086 by the MPI-2 standard to indicate to MPI that your program will be | |
1087 multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates | |
1088 that we will only call MPI routines from the main thread. (FFTW will | |
1089 launch additional threads internally, but the extra threads will not | |
1090 call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or | |
1091 @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading | |
1092 support from the MPI implementation, but this is not required by | |
1093 FFTW.) The @code{provided} parameter returns what level of threads | |
1094 support is actually supported by your MPI implementation; this | |
1095 @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call | |
1096 the FFTW threads routines, so we define a global variable | |
1097 @code{threads_ok} to record this. You should only call | |
1098 @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if | |
1099 @code{threads_ok} is true. For more information on thread safety in | |
1100 MPI, see the | |
1101 @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and | |
1102 Threads} section of the MPI-2 standard. | |
1103 @cindex thread safety | |
1104 | |
1105 | |
1106 Second, we must call @code{fftw_init_threads} @emph{before} | |
1107 @code{fftw_mpi_init}. This is critical for technical reasons having | |
1108 to do with how FFTW initializes its list of algorithms. | |
1109 | |
1110 Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI | |
1111 process will launch (up to) @code{N} threads to parallelize its transforms. | |
1112 | |
1113 For example, in the hypothetical cluster of 4-processor nodes, you | |
1114 might wish to launch only a single MPI process per node, and then call | |
1115 @code{fftw_plan_with_nthreads(4)} on each process to use all | |
1116 processors in the nodes. | |
1117 | |
1118 This may or may not be faster than simply using as many MPI processes | |
1119 as you have processors, however. On the one hand, using threads | |
1120 within a node eliminates the need for explicit message passing within | |
1121 the node. On the other hand, FFTW's transpose routines are not | |
1122 multi-threaded, and this means that the communications that do take | |
1123 place will not benefit from parallelization within the node. | |
1124 Moreover, many MPI implementations already have optimizations to | |
1125 exploit shared memory when it is available, so adding the | |
1126 multithreaded FFTW on top of this may be superfluous. | |
1127 @cindex transpose | |
1128 | |
1129 @c ------------------------------------------------------------ | |
1130 @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI | |
1131 @section FFTW MPI Reference | |
1132 | |
1133 This chapter provides a complete reference to all FFTW MPI functions, | |
1134 datatypes, and constants. See also @ref{FFTW Reference} for information | |
1135 on functions and types in common with the serial interface. | |
1136 | |
1137 @menu | |
1138 * MPI Files and Data Types:: | |
1139 * MPI Initialization:: | |
1140 * Using MPI Plans:: | |
1141 * MPI Data Distribution Functions:: | |
1142 * MPI Plan Creation:: | |
1143 * MPI Wisdom Communication:: | |
1144 @end menu | |
1145 | |
1146 @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference | |
1147 @subsection MPI Files and Data Types | |
1148 | |
1149 All programs using FFTW's MPI support should include its header file: | |
1150 | |
1151 @example | |
1152 #include <fftw3-mpi.h> | |
1153 @end example | |
1154 | |
1155 Note that this header file includes the serial-FFTW @code{fftw3.h} | |
1156 header file, and also the @code{mpi.h} header file for MPI, so you | |
1157 need not include those files separately. | |
1158 | |
1159 You must also link to @emph{both} the FFTW MPI library and to the | |
1160 serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi | |
1161 -lfftw3 -lm} at the end of the link command. | |
1162 | |
1163 @cindex precision | |
1164 Different precisions are handled as in the serial interface: | |
1165 @xref{Precision}. That is, @samp{fftw_} functions become | |
1166 @code{fftwf_} (in single precision) etcetera, and the libraries become | |
1167 @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double | |
1168 precision is supported in MPI, but quad precision (@samp{fftwq_}) is | |
1169 not due to the lack of MPI support for this type. | |
1170 | |
1171 @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference | |
1172 @subsection MPI Initialization | |
1173 | |
1174 Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and | |
1175 before importing any wisdom for MPI problems, you must call: | |
1176 | |
1177 @findex fftw_mpi_init | |
1178 @example | |
1179 void fftw_mpi_init(void); | |
1180 @end example | |
1181 | |
1182 @findex fftw_init_threads | |
1183 If FFTW threads support is used, however, @code{fftw_mpi_init} should | |
1184 be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI | |
1185 and Threads}). Calling @code{fftw_mpi_init} additional times (before | |
1186 @code{fftw_mpi_cleanup}) has no effect. | |
1187 | |
1188 | |
1189 If you want to deallocate all persistent data and reset FFTW to the | |
1190 pristine state it was in when you started your program, you can call: | |
1191 | |
1192 @findex fftw_mpi_cleanup | |
1193 @example | |
1194 void fftw_mpi_cleanup(void); | |
1195 @end example | |
1196 | |
1197 @findex fftw_cleanup | |
1198 (This calls @code{fftw_cleanup}, so you need not call the serial | |
1199 cleanup routine too, although it is safe to do so.) After calling | |
1200 @code{fftw_mpi_cleanup}, all existing plans become undefined, and you | |
1201 should not attempt to execute or destroy them. You must call | |
1202 @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want | |
1203 to resume using the MPI FFTW routines. | |
1204 | |
1205 @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference | |
1206 @subsection Using MPI Plans | |
1207 | |
1208 Once an MPI plan is created, you can execute and destroy it using | |
1209 @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions | |
1210 in the serial interface that operate on generic plans (@pxref{Using | |
1211 Plans}). | |
1212 | |
1213 @cindex collective function | |
1214 @cindex MPI communicator | |
1215 The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to | |
1216 MPI plans, are @emph{collective} calls: they must be called for all processes | |
1217 in the communicator that was used to create the plan. | |
1218 | |
1219 @cindex new-array execution | |
1220 You must @emph{not} use the serial new-array plan-execution functions | |
1221 @code{fftw_execute_dft} and so on (@pxref{New-array Execute | |
1222 Functions}) with MPI plans. Such functions are specialized to the | |
1223 problem type, and there are specific new-array execute functions for MPI plans: | |
1224 | |
1225 @findex fftw_mpi_execute_dft | |
1226 @findex fftw_mpi_execute_dft_r2c | |
1227 @findex fftw_mpi_execute_dft_c2r | |
1228 @findex fftw_mpi_execute_r2r | |
1229 @example | |
1230 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); | |
1231 void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); | |
1232 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); | |
1233 void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); | |
1234 @end example | |
1235 | |
1236 @cindex alignment | |
1237 @findex fftw_malloc | |
1238 These functions have the same restrictions as those of the serial | |
1239 new-array execute functions. They are @emph{always} safe to apply to | |
1240 the @emph{same} @code{in} and @code{out} arrays that were used to | |
1241 create the plan. They can only be applied to new arrarys if those | |
1242 arrays have the same types, dimensions, in-placeness, and alignment as | |
1243 the original arrays, where the best way to ensure the same alignment | |
1244 is to use FFTW's @code{fftw_malloc} and related allocation functions | |
1245 for all arrays (@pxref{Memory Allocation}). Note that distributed | |
1246 transposes (@pxref{FFTW MPI Transposes}) use | |
1247 @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans | |
1248 from FFTW's perspective. | |
1249 | |
1250 @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference | |
1251 @subsection MPI Data Distribution Functions | |
1252 | |
1253 @cindex data distribution | |
1254 As described above (@pxref{MPI Data Distribution}), in order to | |
1255 allocate your arrays, @emph{before} creating a plan, you must first | |
1256 call one of the following routines to determine the required | |
1257 allocation size and the portion of the array locally stored on a given | |
1258 process. The @code{MPI_Comm} communicator passed here must be | |
1259 equivalent to the communicator used below for plan creation. | |
1260 | |
1261 The basic interface for multidimensional transforms consists of the | |
1262 functions: | |
1263 | |
1264 @findex fftw_mpi_local_size_2d | |
1265 @findex fftw_mpi_local_size_3d | |
1266 @findex fftw_mpi_local_size | |
1267 @findex fftw_mpi_local_size_2d_transposed | |
1268 @findex fftw_mpi_local_size_3d_transposed | |
1269 @findex fftw_mpi_local_size_transposed | |
1270 @example | |
1271 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
1272 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
1273 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1274 MPI_Comm comm, | |
1275 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
1276 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, | |
1277 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
1278 | |
1279 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
1280 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
1281 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
1282 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1283 MPI_Comm comm, | |
1284 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
1285 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
1286 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, | |
1287 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
1288 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
1289 @end example | |
1290 | |
1291 These functions return the number of elements to allocate (complex | |
1292 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas | |
1293 the @code{local_n0} and @code{local_0_start} return the portion | |
1294 (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the | |
1295 first dimension of an @ndims{} array that is stored on the local | |
1296 process. @xref{Basic and advanced distribution interfaces}. For | |
1297 @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants | |
1298 are useful in order to also return the local portion of the first | |
1299 dimension in the @ndimstrans{} transposed output. | |
1300 @xref{Transposed distributions}. | |
1301 The advanced interface for multidimensional transforms is: | |
1302 | |
1303 @cindex advanced interface | |
1304 @findex fftw_mpi_local_size_many | |
1305 @findex fftw_mpi_local_size_many_transposed | |
1306 @example | |
1307 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
1308 ptrdiff_t block0, MPI_Comm comm, | |
1309 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
1310 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
1311 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, | |
1312 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
1313 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
1314 @end example | |
1315 | |
1316 These differ from the basic interface in only two ways. First, they | |
1317 allow you to specify block sizes @code{block0} and @code{block1} (the | |
1318 latter for the transposed output); you can pass | |
1319 @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in | |
1320 the basic interface. Second, you can pass a @code{howmany} parameter, | |
1321 corresponding to the advanced planning interface below: this is for | |
1322 transforms of contiguous @code{howmany}-tuples of numbers | |
1323 (@code{howmany = 1} in the basic interface). | |
1324 | |
1325 The corresponding basic and advanced routines for one-dimensional | |
1326 transforms (currently only complex DFTs) are: | |
1327 | |
1328 @findex fftw_mpi_local_size_1d | |
1329 @findex fftw_mpi_local_size_many_1d | |
1330 @example | |
1331 ptrdiff_t fftw_mpi_local_size_1d( | |
1332 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, | |
1333 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
1334 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
1335 ptrdiff_t fftw_mpi_local_size_many_1d( | |
1336 ptrdiff_t n0, ptrdiff_t howmany, | |
1337 MPI_Comm comm, int sign, unsigned flags, | |
1338 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
1339 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
1340 @end example | |
1341 | |
1342 @ctindex FFTW_MPI_SCRAMBLED_OUT | |
1343 @ctindex FFTW_MPI_SCRAMBLED_IN | |
1344 As above, the return value is the number of elements to allocate | |
1345 (complex numbers, for complex DFTs). The @code{local_ni} and | |
1346 @code{local_i_start} arguments return the portion | |
1347 (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the | |
1348 1d array that is stored on this process for the transform | |
1349 @emph{input}, and @code{local_no} and @code{local_o_start} are the | |
1350 corresponding quantities for the input. The @code{sign} | |
1351 (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must | |
1352 match the arguments passed when creating a plan. Although the inputs | |
1353 and outputs have different data distributions in general, it is | |
1354 guaranteed that the @emph{output} data distribution of an | |
1355 @code{FFTW_FORWARD} plan will match the @emph{input} data distribution | |
1356 of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the | |
1357 @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags. | |
1358 @xref{One-dimensional distributions}. | |
1359 | |
1360 @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference | |
1361 @subsection MPI Plan Creation | |
1362 | |
1363 @subsubheading Complex-data MPI DFTs | |
1364 | |
1365 Plans for complex-data DFTs (@pxref{2d MPI example}) are created by: | |
1366 | |
1367 @findex fftw_mpi_plan_dft_1d | |
1368 @findex fftw_mpi_plan_dft_2d | |
1369 @findex fftw_mpi_plan_dft_3d | |
1370 @findex fftw_mpi_plan_dft | |
1371 @findex fftw_mpi_plan_many_dft | |
1372 @example | |
1373 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, | |
1374 MPI_Comm comm, int sign, unsigned flags); | |
1375 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1376 fftw_complex *in, fftw_complex *out, | |
1377 MPI_Comm comm, int sign, unsigned flags); | |
1378 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1379 fftw_complex *in, fftw_complex *out, | |
1380 MPI_Comm comm, int sign, unsigned flags); | |
1381 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, | |
1382 fftw_complex *in, fftw_complex *out, | |
1383 MPI_Comm comm, int sign, unsigned flags); | |
1384 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, | |
1385 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, | |
1386 fftw_complex *in, fftw_complex *out, | |
1387 MPI_Comm comm, int sign, unsigned flags); | |
1388 @end example | |
1389 | |
1390 @cindex MPI communicator | |
1391 @cindex collective function | |
1392 These are similar to their serial counterparts (@pxref{Complex DFTs}) | |
1393 in specifying the dimensions, sign, and flags of the transform. The | |
1394 @code{comm} argument gives an MPI communicator that specifies the set | |
1395 of processes to participate in the transform; plan creation is a | |
1396 collective function that must be called for all processes in the | |
1397 communicator. The @code{in} and @code{out} pointers refer only to a | |
1398 portion of the overall transform data (@pxref{MPI Data Distribution}) | |
1399 as specified by the @samp{local_size} functions in the previous | |
1400 section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these | |
1401 arrays are overwritten during plan creation as for the serial | |
1402 interface. For multi-dimensional transforms, any dimensions @code{> | |
1403 1} are supported; for one-dimensional transforms, only composite | |
1404 (non-prime) @code{n0} are currently supported (unlike the serial | |
1405 FFTW). Requesting an unsupported transform size will yield a | |
1406 @code{NULL} plan. (As in the serial interface, highly composite sizes | |
1407 generally yield the best performance.) | |
1408 | |
1409 @cindex advanced interface | |
1410 @ctindex FFTW_MPI_DEFAULT_BLOCK | |
1411 @cindex stride | |
1412 The advanced-interface @code{fftw_mpi_plan_many_dft} additionally | |
1413 allows you to specify the block sizes for the first dimension | |
1414 (@code{block}) of the @ndims{} input data and the first dimension | |
1415 (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate | |
1416 steps of the transform, and for the output if | |
1417 @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must | |
1418 be the same block sizes as were passed to the corresponding | |
1419 @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK} | |
1420 to use FFTW's default block size as in the basic interface. Also, the | |
1421 @code{howmany} parameter specifies that the transform is of contiguous | |
1422 @code{howmany}-tuples rather than individual complex numbers; this | |
1423 corresponds to the same parameter in the serial advanced interface | |
1424 (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and | |
1425 @code{dist = 1}. | |
1426 | |
1427 @subsubheading MPI flags | |
1428 | |
1429 The @code{flags} can be any of those for the serial FFTW | |
1430 (@pxref{Planner Flags}), and in addition may include one or more of | |
1431 the following MPI-specific flags, which improve performance at the | |
1432 cost of changing the output or input data formats. | |
1433 | |
1434 @itemize @bullet | |
1435 | |
1436 @item | |
1437 @ctindex FFTW_MPI_SCRAMBLED_OUT | |
1438 @ctindex FFTW_MPI_SCRAMBLED_IN | |
1439 @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for | |
1440 1d transforms only, these flags indicate that the output/input of the | |
1441 transform are in an undocumented ``scrambled'' order. A forward | |
1442 @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward | |
1443 @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization). | |
1444 @xref{One-dimensional distributions}. | |
1445 | |
1446 @item | |
1447 @ctindex FFTW_MPI_TRANSPOSED_OUT | |
1448 @ctindex FFTW_MPI_TRANSPOSED_IN | |
1449 @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid | |
1450 for multidimensional (@code{rnk > 1}) transforms only, these flags | |
1451 specify that the output or input of an @ndims{} transform is | |
1452 transposed to @ndimstrans{}. @xref{Transposed distributions}. | |
1453 | |
1454 @end itemize | |
1455 | |
1456 @subsubheading Real-data MPI DFTs | |
1457 | |
1458 @cindex r2c | |
1459 Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional | |
1460 MPI DFTs of Real Data}) are created by: | |
1461 | |
1462 @findex fftw_mpi_plan_dft_r2c_2d | |
1463 @findex fftw_mpi_plan_dft_r2c_2d | |
1464 @findex fftw_mpi_plan_dft_r2c_3d | |
1465 @findex fftw_mpi_plan_dft_r2c | |
1466 @findex fftw_mpi_plan_dft_c2r_2d | |
1467 @findex fftw_mpi_plan_dft_c2r_2d | |
1468 @findex fftw_mpi_plan_dft_c2r_3d | |
1469 @findex fftw_mpi_plan_dft_c2r | |
1470 @example | |
1471 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1472 double *in, fftw_complex *out, | |
1473 MPI_Comm comm, unsigned flags); | |
1474 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1475 double *in, fftw_complex *out, | |
1476 MPI_Comm comm, unsigned flags); | |
1477 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1478 double *in, fftw_complex *out, | |
1479 MPI_Comm comm, unsigned flags); | |
1480 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, | |
1481 double *in, fftw_complex *out, | |
1482 MPI_Comm comm, unsigned flags); | |
1483 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1484 fftw_complex *in, double *out, | |
1485 MPI_Comm comm, unsigned flags); | |
1486 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1487 fftw_complex *in, double *out, | |
1488 MPI_Comm comm, unsigned flags); | |
1489 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1490 fftw_complex *in, double *out, | |
1491 MPI_Comm comm, unsigned flags); | |
1492 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, | |
1493 fftw_complex *in, double *out, | |
1494 MPI_Comm comm, unsigned flags); | |
1495 @end example | |
1496 | |
1497 Similar to the serial interface (@pxref{Real-data DFTs}), these | |
1498 transform logically @ndims{} real data to/from @ndimshalf{} complex | |
1499 data, representing the non-redundant half of the conjugate-symmetry | |
1500 output of a real-input DFT (@pxref{Multi-dimensional Transforms}). | |
1501 However, the real array must be stored within a padded @ndimspad{} | |
1502 array (much like the in-place serial r2c transforms, but here for | |
1503 out-of-place transforms as well). Currently, only multi-dimensional | |
1504 (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan | |
1505 for @code{rnk = 1} will yield @code{NULL}). As explained above | |
1506 (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data | |
1507 distribution of both the real and complex arrays is given by the | |
1508 @samp{local_size} function called for the dimensions of the | |
1509 @emph{complex} array. Similar to the other planning functions, the | |
1510 input and output arrays are overwritten when the plan is created | |
1511 except in @code{FFTW_ESTIMATE} mode. | |
1512 | |
1513 As for the complex DFTs above, there is an advance interface that | |
1514 allows you to manually specify block sizes and to transform contiguous | |
1515 @code{howmany}-tuples of real/complex numbers: | |
1516 | |
1517 @findex fftw_mpi_plan_many_dft_r2c | |
1518 @findex fftw_mpi_plan_many_dft_c2r | |
1519 @example | |
1520 fftw_plan fftw_mpi_plan_many_dft_r2c | |
1521 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
1522 ptrdiff_t iblock, ptrdiff_t oblock, | |
1523 double *in, fftw_complex *out, | |
1524 MPI_Comm comm, unsigned flags); | |
1525 fftw_plan fftw_mpi_plan_many_dft_c2r | |
1526 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
1527 ptrdiff_t iblock, ptrdiff_t oblock, | |
1528 fftw_complex *in, double *out, | |
1529 MPI_Comm comm, unsigned flags); | |
1530 @end example | |
1531 | |
1532 @subsubheading MPI r2r transforms | |
1533 | |
1534 @cindex r2r | |
1535 There are corresponding plan-creation routines for r2r | |
1536 transforms (@pxref{More DFTs of Real Data}), currently supporting | |
1537 multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will | |
1538 yield a @code{NULL} plan): | |
1539 | |
1540 @example | |
1541 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
1542 double *in, double *out, | |
1543 MPI_Comm comm, | |
1544 fftw_r2r_kind kind0, fftw_r2r_kind kind1, | |
1545 unsigned flags); | |
1546 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
1547 double *in, double *out, | |
1548 MPI_Comm comm, | |
1549 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, | |
1550 unsigned flags); | |
1551 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, | |
1552 double *in, double *out, | |
1553 MPI_Comm comm, const fftw_r2r_kind *kind, | |
1554 unsigned flags); | |
1555 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, | |
1556 ptrdiff_t iblock, ptrdiff_t oblock, | |
1557 double *in, double *out, | |
1558 MPI_Comm comm, const fftw_r2r_kind *kind, | |
1559 unsigned flags); | |
1560 @end example | |
1561 | |
1562 The parameters are much the same as for the complex DFTs above, except | |
1563 that the arrays are of real numbers (and hence the outputs of the | |
1564 @samp{local_size} data-distribution functions should be interpreted as | |
1565 counts of real rather than complex numbers). Also, the @code{kind} | |
1566 parameters specify the r2r kinds along each dimension as for the | |
1567 serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other | |
1568 Multi-dimensional Real-data MPI Transforms}. | |
1569 | |
1570 @subsubheading MPI transposition | |
1571 @cindex transpose | |
1572 | |
1573 FFTW also provides routines to plan a transpose of a distributed | |
1574 @code{n0} by @code{n1} array of real numbers, or an array of | |
1575 @code{howmany}-tuples of real numbers with specified block sizes | |
1576 (@pxref{FFTW MPI Transposes}): | |
1577 | |
1578 @findex fftw_mpi_plan_transpose | |
1579 @findex fftw_mpi_plan_many_transpose | |
1580 @example | |
1581 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | |
1582 double *in, double *out, | |
1583 MPI_Comm comm, unsigned flags); | |
1584 fftw_plan fftw_mpi_plan_many_transpose | |
1585 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | |
1586 ptrdiff_t block0, ptrdiff_t block1, | |
1587 double *in, double *out, MPI_Comm comm, unsigned flags); | |
1588 @end example | |
1589 | |
1590 @cindex new-array execution | |
1591 @findex fftw_mpi_execute_r2r | |
1592 These plans are used with the @code{fftw_mpi_execute_r2r} new-array | |
1593 execute function (@pxref{Using MPI Plans }), since they count as (rank | |
1594 zero) r2r plans from FFTW's perspective. | |
1595 | |
1596 @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference | |
1597 @subsection MPI Wisdom Communication | |
1598 | |
1599 To facilitate synchronizing wisdom among the different MPI processes, | |
1600 we provide two functions: | |
1601 | |
1602 @findex fftw_mpi_gather_wisdom | |
1603 @findex fftw_mpi_broadcast_wisdom | |
1604 @example | |
1605 void fftw_mpi_gather_wisdom(MPI_Comm comm); | |
1606 void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | |
1607 @end example | |
1608 | |
1609 The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the | |
1610 given communicator @code{comm} to the process of rank 0 in the | |
1611 communicator: that process obtains the union of all wisdom on all the | |
1612 processes. As a side effect, some other processes will gain | |
1613 additional wisdom from other processes, but only process 0 will gain | |
1614 the complete union. | |
1615 | |
1616 The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports | |
1617 wisdom from process 0 in @code{comm} to all other processes in the | |
1618 communicator, replacing any wisdom they currently have. | |
1619 | |
1620 @xref{FFTW MPI Wisdom}. | |
1621 | |
1622 @c ------------------------------------------------------------ | |
1623 @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI | |
1624 @section FFTW MPI Fortran Interface | |
1625 @cindex Fortran interface | |
1626 | |
1627 @cindex iso_c_binding | |
1628 The FFTW MPI interface is callable from modern Fortran compilers | |
1629 supporting the Fortran 2003 @code{iso_c_binding} standard for calling | |
1630 C functions. As described in @ref{Calling FFTW from Modern Fortran}, | |
1631 this means that you can directly call FFTW's C interface from Fortran | |
1632 with only minor changes in syntax. There are, however, a few things | |
1633 specific to the MPI interface to keep in mind: | |
1634 | |
1635 @itemize @bullet | |
1636 | |
1637 @item | |
1638 Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran | |
1639 interface }, you should @code{include 'fftw3-mpi.f03'} (after | |
1640 @code{use, intrinsic :: iso_c_binding} as before). The | |
1641 @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should | |
1642 @emph{not} @code{include} them both yourself. (You will also want to | |
1643 include the MPI header file, usually via @code{include 'mpif.h'} or | |
1644 similar, although though this is not needed by @code{fftw3-mpi.f03} | |
1645 @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.) | |
1646 | |
1647 @item | |
1648 Because of the different storage conventions between C and Fortran, | |
1649 you reverse the order of your array dimensions when passing them to | |
1650 FFTW (@pxref{Reversing array dimensions}). This is merely a | |
1651 difference in notation and incurs no performance overhead. However, | |
1652 it means that, whereas in C the @emph{first} dimension is distributed, | |
1653 in Fortran the @emph{last} dimension of your array is distributed. | |
1654 | |
1655 @item | |
1656 @cindex MPI communicator | |
1657 In Fortran, communicators are stored as @code{integer} types; there is | |
1658 no @code{MPI_Comm} type, nor is there any way to access a C | |
1659 @code{MPI_Comm}. Fortunately, this is taken care of for you by the | |
1660 FFTW Fortran interface: whenever the C interface expects an | |
1661 @code{MPI_Comm} type, you should pass the Fortran communicator as an | |
1662 @code{integer}.@footnote{Technically, this is because you aren't | |
1663 actually calling the C functions directly. You are calling wrapper | |
1664 functions that translate the communicator with @code{MPI_Comm_f2c} | |
1665 before calling the ordinary C interface. This is all done | |
1666 transparently, however, since the @code{fftw3-mpi.f03} interface file | |
1667 renames the wrappers so that they are called in Fortran with the same | |
1668 names as the C interface functions.} | |
1669 | |
1670 @item | |
1671 Because you need to call the @samp{local_size} function to find out | |
1672 how much space to allocate, and this may be @emph{larger} than the | |
1673 local portion of the array (@pxref{MPI Data Distribution}), you should | |
1674 @emph{always} allocate your arrays dynamically using FFTW's allocation | |
1675 routines as described in @ref{Allocating aligned memory in Fortran}. | |
1676 (Coincidentally, this also provides the best performance by | |
1677 guaranteeding proper data alignment.) | |
1678 | |
1679 @item | |
1680 Because all sizes in the MPI FFTW interface are declared as | |
1681 @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in | |
1682 Fortran (@pxref{FFTW Fortran type reference}). | |
1683 | |
1684 @item | |
1685 @findex fftw_execute_dft | |
1686 @findex fftw_mpi_execute_dft | |
1687 @cindex new-array execution | |
1688 In Fortran, because of the language semantics, we generally recommend | |
1689 using the new-array execute functions for all plans, even in the | |
1690 common case where you are executing the plan on the same arrays for | |
1691 which the plan was created (@pxref{Plan execution in Fortran}). | |
1692 However, note that in the MPI interface these functions are changed: | |
1693 @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft}, | |
1694 etcetera. @xref{Using MPI Plans}. | |
1695 | |
1696 @end itemize | |
1697 | |
1698 For example, here is a Fortran code snippet to perform a distributed | |
1699 @twodims{L,M} complex DFT in-place. (This assumes you have already | |
1700 initialized MPI with @code{MPI_init} and have also performed | |
1701 @code{call fftw_mpi_init}.) | |
1702 | |
1703 @example | |
1704 use, intrinsic :: iso_c_binding | |
1705 include 'fftw3-mpi.f03' | |
1706 integer(C_INTPTR_T), parameter :: L = ... | |
1707 integer(C_INTPTR_T), parameter :: M = ... | |
1708 type(C_PTR) :: plan, cdata | |
1709 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) | |
1710 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset | |
1711 | |
1712 ! @r{get local data size and allocate (note dimension reversal)} | |
1713 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & | |
1714 local_M, local_j_offset) | |
1715 cdata = fftw_alloc_complex(alloc_local) | |
1716 call c_f_pointer(cdata, data, [L,local_M]) | |
1717 | |
1718 ! @r{create MPI plan for in-place forward DFT (note dimension reversal)} | |
1719 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & | |
1720 FFTW_FORWARD, FFTW_MEASURE) | |
1721 | |
1722 ! @r{initialize data to some function} my_function(i,j) | |
1723 do j = 1, local_M | |
1724 do i = 1, L | |
1725 data(i, j) = my_function(i, j + local_j_offset) | |
1726 end do | |
1727 end do | |
1728 | |
1729 ! @r{compute transform (as many times as desired)} | |
1730 call fftw_mpi_execute_dft(plan, data, data) | |
1731 | |
1732 call fftw_destroy_plan(plan) | |
1733 call fftw_free(cdata) | |
1734 @end example | |
1735 | |
1736 Note that when we called @code{fftw_mpi_local_size_2d} and | |
1737 @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order, | |
1738 since a @twodims{L,M} Fortran array is viewed by FFTW in C as a | |
1739 @twodims{M, L} array. This means that the array was distributed over | |
1740 the @code{M} dimension, the local portion of which is a | |
1741 @twodims{L,local_M} array in Fortran. (You must @emph{not} use an | |
1742 @code{allocate} statement to allocate an @twodims{L,local_M} array, | |
1743 however; you must allocate @code{alloc_local} complex numbers, which | |
1744 may be greater than @code{L * local_M}, in order to reserve space for | |
1745 intermediate steps of the transform.) Finally, we mention that | |
1746 because C's array indices are zero-based, the @code{local_j_offset} | |
1747 argument can conveniently be interpreted as an offset in the 1-based | |
1748 @code{j} index (rather than as a starting index as in C). | |
1749 | |
1750 If instead you had used the @code{ior(FFTW_MEASURE, | |
1751 FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a | |
1752 transposed @twodims{M,local_L} array, associated with the @emph{same} | |
1753 @code{cdata} allocation (since the transform is in-place), and which | |
1754 you could declare with: | |
1755 | |
1756 @example | |
1757 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) | |
1758 ... | |
1759 call c_f_pointer(cdata, tdata, [M,local_L]) | |
1760 @end example | |
1761 | |
1762 where @code{local_L} would have been obtained by changing the | |
1763 @code{fftw_mpi_local_size_2d} call to: | |
1764 | |
1765 @example | |
1766 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & | |
1767 local_M, local_j_offset, local_L, local_i_offset) | |
1768 @end example |