comparison src/fftw-3.3.5/doc/mpi.texi @ 127:7867fa7e1b6b

Current fftw source
author Chris Cannam <cannam@all-day-breakfast.com>
date Tue, 18 Oct 2016 13:40:26 +0100
parents
children
comparison
equal deleted inserted replaced
126:4a7071416412 127:7867fa7e1b6b
1 @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top
2 @chapter Distributed-memory FFTW with MPI
3 @cindex MPI
4
5 @cindex parallel transform
6 In this chapter we document the parallel FFTW routines for parallel
7 systems supporting the MPI message-passing interface. Unlike the
8 shared-memory threads described in the previous chapter, MPI allows
9 you to use @emph{distributed-memory} parallelism, where each CPU has
10 its own separate memory, and which can scale up to clusters of many
11 thousands of processors. This capability comes at a price, however:
12 each process only stores a @emph{portion} of the data to be
13 transformed, which means that the data structures and
14 programming-interface are quite different from the serial or threads
15 versions of FFTW.
16 @cindex data distribution
17
18
19 Distributed-memory parallelism is especially useful when you are
20 transforming arrays so large that they do not fit into the memory of a
21 single processor. The storage per-process required by FFTW's MPI
22 routines is proportional to the total array size divided by the number
23 of processes. Conversely, distributed-memory parallelism can easily
24 pose an unacceptably high communications overhead for small problems;
25 the threshold problem size for which parallelism becomes advantageous
26 will depend on the precise problem you are interested in, your
27 hardware, and your MPI implementation.
28
29 A note on terminology: in MPI, you divide the data among a set of
30 ``processes'' which each run in their own memory address space.
31 Generally, each process runs on a different physical processor, but
32 this is not required. A set of processes in MPI is described by an
33 opaque data structure called a ``communicator,'' the most common of
34 which is the predefined communicator @code{MPI_COMM_WORLD} which
35 refers to @emph{all} processes. For more information on these and
36 other concepts common to all MPI programs, we refer the reader to the
37 documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home
38 page}.
39 @cindex MPI communicator
40 @ctindex MPI_COMM_WORLD
41
42
43 We assume in this chapter that the reader is familiar with the usage
44 of the serial (uniprocessor) FFTW, and focus only on the concepts new
45 to the MPI interface.
46
47 @menu
48 * FFTW MPI Installation::
49 * Linking and Initializing MPI FFTW::
50 * 2d MPI example::
51 * MPI Data Distribution::
52 * Multi-dimensional MPI DFTs of Real Data::
53 * Other Multi-dimensional Real-data MPI Transforms::
54 * FFTW MPI Transposes::
55 * FFTW MPI Wisdom::
56 * Avoiding MPI Deadlocks::
57 * FFTW MPI Performance Tips::
58 * Combining MPI and Threads::
59 * FFTW MPI Reference::
60 * FFTW MPI Fortran Interface::
61 @end menu
62
63 @c ------------------------------------------------------------
64 @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI
65 @section FFTW MPI Installation
66
67 All of the FFTW MPI code is located in the @code{mpi} subdirectory of
68 the FFTW package. On Unix systems, the FFTW MPI libraries and header
69 files are automatically configured, compiled, and installed along with
70 the uniprocessor FFTW libraries simply by including
71 @code{--enable-mpi} in the flags to the @code{configure} script
72 (@pxref{Installation on Unix}).
73 @fpindex configure
74
75
76 Any implementation of the MPI standard, version 1 or later, should
77 work with FFTW. The @code{configure} script will attempt to
78 automatically detect how to compile and link code using your MPI
79 implementation. In some cases, especially if you have multiple
80 different MPI implementations installed or have an unusual MPI
81 software package, you may need to provide this information explicitly.
82
83 Most commonly, one compiles MPI code by invoking a special compiler
84 command, typically @code{mpicc} for C code. The @code{configure}
85 script knows the most common names for this command, but you can
86 specify the MPI compilation command explicitly by setting the
87 @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}.
88 @fpindex mpicc
89
90
91 If, instead of a special compiler command, you need to link a certain
92 library, you can specify the link command via the @code{MPILIBS}
93 variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if
94 your MPI library is installed in a non-standard location (one the
95 compiler does not know about by default), you may also have to specify
96 the location of the library and header files via @code{LDFLAGS} and
97 @code{CPPFLAGS} variables, respectively, as in @samp{./configure
98 LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}.
99
100 @c ------------------------------------------------------------
101 @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI
102 @section Linking and Initializing MPI FFTW
103
104 Programs using the MPI FFTW routines should be linked with
105 @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision,
106 @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on
107 (@pxref{Precision}). You will also need to link with whatever library
108 is responsible for MPI on your system; in most MPI implementations,
109 there is a special compiler alias named @code{mpicc} to compile and
110 link MPI code.
111 @fpindex mpicc
112 @cindex linking on Unix
113 @cindex precision
114
115
116 @findex fftw_init_threads
117 Before calling any FFTW routines except possibly
118 @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling
119 @code{MPI_Init}, you should call the function:
120
121 @example
122 void fftw_mpi_init(void);
123 @end example
124 @findex fftw_mpi_init
125
126 If, at the end of your program, you want to get rid of all memory and
127 other resources allocated internally by FFTW, for both the serial and
128 MPI routines, you can call:
129
130 @example
131 void fftw_mpi_cleanup(void);
132 @end example
133 @findex fftw_mpi_cleanup
134
135 which is much like the @code{fftw_cleanup()} function except that it
136 also gets rid of FFTW's MPI-related data. You must @emph{not} execute
137 any previously created plans after calling this function.
138
139 @c ------------------------------------------------------------
140 @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI
141 @section 2d MPI example
142
143 Before we document the FFTW MPI interface in detail, we begin with a
144 simple example outlining how one would perform a two-dimensional
145 @code{N0} by @code{N1} complex DFT.
146
147 @example
148 #include <fftw3-mpi.h>
149
150 int main(int argc, char **argv)
151 @{
152 const ptrdiff_t N0 = ..., N1 = ...;
153 fftw_plan plan;
154 fftw_complex *data;
155 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
156
157 MPI_Init(&argc, &argv);
158 fftw_mpi_init();
159
160 /* @r{get local data size and allocate} */
161 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
162 &local_n0, &local_0_start);
163 data = fftw_alloc_complex(alloc_local);
164
165 /* @r{create plan for in-place forward DFT} */
166 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
167 FFTW_FORWARD, FFTW_ESTIMATE);
168
169 /* @r{initialize data to some function} my_function(x,y) */
170 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
171 data[i*N1 + j] = my_function(local_0_start + i, j);
172
173 /* @r{compute transforms, in-place, as many times as desired} */
174 fftw_execute(plan);
175
176 fftw_destroy_plan(plan);
177
178 MPI_Finalize();
179 @}
180 @end example
181
182 As can be seen above, the MPI interface follows the same basic style
183 of allocate/plan/execute/destroy as the serial FFTW routines. All of
184 the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead
185 of @samp{fftw_}. There are a few important differences, however:
186
187 First, we must call @code{fftw_mpi_init()} after calling
188 @code{MPI_Init} (required in all MPI programs) and before calling any
189 other @samp{fftw_mpi_} routine.
190 @findex MPI_Init
191 @findex fftw_mpi_init
192
193
194 Second, when we create the plan with @code{fftw_mpi_plan_dft_2d},
195 analogous to @code{fftw_plan_dft_2d}, we pass an additional argument:
196 the communicator, indicating which processes will participate in the
197 transform (here @code{MPI_COMM_WORLD}, indicating all processes).
198 Whenever you create, execute, or destroy a plan for an MPI transform,
199 you must call the corresponding FFTW routine on @emph{all} processes
200 in the communicator for that transform. (That is, these are
201 @emph{collective} calls.) Note that the plan for the MPI transform
202 uses the standard @code{fftw_execute} and @code{fftw_destroy} routines
203 (on the other hand, there are MPI-specific new-array execute functions
204 documented below).
205 @cindex collective function
206 @findex fftw_mpi_plan_dft_2d
207 @ctindex MPI_COMM_WORLD
208
209
210 Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments
211 instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a
212 standard C integer type which is (at least) 32 bits wide on a 32-bit
213 machine and 64 bits wide on a 64-bit machine. This is to make it easy
214 to specify very large parallel transforms on a 64-bit machine. (You
215 can specify 64-bit transform sizes in the serial FFTW, too, but only
216 by using the @samp{guru64} planner interface. @xref{64-bit Guru
217 Interface}.)
218 @tindex ptrdiff_t
219 @cindex 64-bit architecture
220
221
222 Fourth, and most importantly, you don't allocate the entire
223 two-dimensional array on each process. Instead, you call
224 @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the
225 array resides on each processor, and how much space to allocate.
226 Here, the portion of the array on each process is a @code{local_n0} by
227 @code{N1} slice of the total array, starting at index
228 @code{local_0_start}. The total number of @code{fftw_complex} numbers
229 to allocate is given by the @code{alloc_local} return value, which
230 @emph{may} be greater than @code{local_n0 * N1} (in case some
231 intermediate calculations require additional storage). The data
232 distribution in FFTW's MPI interface is described in more detail by
233 the next section.
234 @findex fftw_mpi_local_size_2d
235 @cindex data distribution
236
237
238 Given the portion of the array that resides on the local process, it
239 is straightforward to initialize the data (here to a function
240 @code{myfunction}) and otherwise manipulate it. Of course, at the end
241 of the program you may want to output the data somehow, but
242 synchronizing this output is up to you and is beyond the scope of this
243 manual. (One good way to output a large multi-dimensional distributed
244 array in MPI to a portable binary file is to use the free HDF5
245 library; see the @uref{http://www.hdfgroup.org/, HDF home page}.)
246 @cindex HDF5
247 @cindex MPI I/O
248
249 @c ------------------------------------------------------------
250 @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI
251 @section MPI Data Distribution
252 @cindex data distribution
253
254 The most important concept to understand in using FFTW's MPI interface
255 is the data distribution. With a serial or multithreaded FFT, all of
256 the inputs and outputs are stored as a single contiguous chunk of
257 memory. With a distributed-memory FFT, the inputs and outputs are
258 broken into disjoint blocks, one per process.
259
260 In particular, FFTW uses a @emph{1d block distribution} of the data,
261 distributed along the @emph{first dimension}. For example, if you
262 want to perform a @twodims{100,200} complex DFT, distributed over 4
263 processes, each process will get a @twodims{25,200} slice of the data.
264 That is, process 0 will get rows 0 through 24, process 1 will get rows
265 25 through 49, process 2 will get rows 50 through 74, and process 3
266 will get rows 75 through 99. If you take the same array but
267 distribute it over 3 processes, then it is not evenly divisible so the
268 different processes will have unequal chunks. FFTW's default choice
269 in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
270 process 2.
271 @cindex block distribution
272
273
274 FFTW provides several @samp{fftw_mpi_local_size} routines that you can
275 call to find out what portion of an array is stored on the current
276 process. In most cases, you should use the default block sizes picked
277 by FFTW, but it is also possible to specify your own block size. For
278 example, with a @twodims{100,200} array on three processes, you can
279 tell FFTW to use a block size of 40, which would assign 40 rows to
280 processes 0 and 1, and 20 rows to process 2. FFTW's default is to
281 divide the data equally among the processes if possible, and as best
282 it can otherwise. The rows are always assigned in ``rank order,''
283 i.e. process 0 gets the first block of rows, then process 1, and so
284 on. (You can change this by using @code{MPI_Comm_split} to create a
285 new communicator with re-ordered processes.) However, you should
286 always call the @samp{fftw_mpi_local_size} routines, if possible,
287 rather than trying to predict FFTW's distribution choices.
288
289 In particular, it is critical that you allocate the storage size that
290 is returned by @samp{fftw_mpi_local_size}, which is @emph{not}
291 necessarily the size of the local slice of the array. The reason is
292 that intermediate steps of FFTW's algorithms involve transposing the
293 array and redistributing the data, so at these intermediate steps FFTW
294 may require more local storage space (albeit always proportional to
295 the total size divided by the number of processes). The
296 @samp{fftw_mpi_local_size} functions know how much storage is required
297 for these intermediate steps and tell you the correct amount to
298 allocate.
299
300 @menu
301 * Basic and advanced distribution interfaces::
302 * Load balancing::
303 * Transposed distributions::
304 * One-dimensional distributions::
305 @end menu
306
307 @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution
308 @subsection Basic and advanced distribution interfaces
309
310 As with the planner interface, the @samp{fftw_mpi_local_size}
311 distribution interface is broken into basic and advanced
312 (@samp{_many}) interfaces, where the latter allows you to specify the
313 block size manually and also to request block sizes when computing
314 multiple transforms simultaneously. These functions are documented
315 more exhaustively by the FFTW MPI Reference, but we summarize the
316 basic ideas here using a couple of two-dimensional examples.
317
318 For the @twodims{100,200} complex-DFT example, above, we would find
319 the distribution by calling the following function in the basic
320 interface:
321
322 @example
323 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
324 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
325 @end example
326 @findex fftw_mpi_local_size_2d
327
328 Given the total size of the data to be transformed (here, @code{n0 =
329 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this
330 function provides three numbers.
331
332 First, it describes the shape of the local data: the current process
333 should store a @code{local_n0} by @code{n1} slice of the overall
334 dataset, in row-major order (@code{n1} dimension contiguous), starting
335 at index @code{local_0_start}. That is, if the total dataset is
336 viewed as a @code{n0} by @code{n1} matrix, the current process should
337 store the rows @code{local_0_start} to
338 @code{local_0_start+local_n0-1}. Obviously, if you are running with
339 only a single MPI process, that process will store the entire array:
340 @code{local_0_start} will be zero and @code{local_n0} will be
341 @code{n0}. @xref{Row-major Format}.
342 @cindex row-major
343
344
345 Second, the return value is the total number of data elements (e.g.,
346 complex numbers for a complex DFT) that should be allocated for the
347 input and output arrays on the current process (ideally with
348 @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal
349 alignment). It might seem that this should always be equal to
350 @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's
351 distributed FFT algorithms require data redistributions at
352 intermediate stages of the transform, and in some circumstances this
353 may require slightly larger local storage. This is discussed in more
354 detail below, under @ref{Load balancing}.
355 @findex fftw_malloc
356 @findex fftw_alloc_complex
357
358
359 @cindex advanced interface
360 The advanced-interface @samp{local_size} function for multidimensional
361 transforms returns the same three things (@code{local_n0},
362 @code{local_0_start}, and the total number of elements to allocate),
363 but takes more inputs:
364
365 @example
366 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
367 ptrdiff_t howmany,
368 ptrdiff_t block0,
369 MPI_Comm comm,
370 ptrdiff_t *local_n0,
371 ptrdiff_t *local_0_start);
372 @end example
373 @findex fftw_mpi_local_size_many
374
375 The two-dimensional case above corresponds to @code{rnk = 2} and an
376 array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}.
377 This routine is for any @code{rnk > 1}; one-dimensional transforms
378 have their own interface because they work slightly differently, as
379 discussed below.
380
381 First, the advanced interface allows you to perform multiple
382 transforms at once, of interleaved data, as specified by the
383 @code{howmany} parameter. (@code{hoamany} is 1 for a single
384 transform.)
385
386 Second, here you can specify your desired block size in the @code{n0}
387 dimension, @code{block0}. To use FFTW's default block size, pass
388 @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on
389 @code{P} processes, FFTW will return @code{local_n0} equal to
390 @code{block0} on the first @code{P / block0} processes (rounded down),
391 return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on
392 the next process, and @code{local_n0} equal to zero on any remaining
393 processes. In general, we recommend using the default block size
394 (which corresponds to @code{n0 / P}, rounded up).
395 @ctindex FFTW_MPI_DEFAULT_BLOCK
396 @cindex block distribution
397
398
399 For example, suppose you have @code{P = 4} processes and @code{n0 =
400 21}. The default will be a block size of @code{6}, which will give
401 @code{local_n0 = 6} on the first three processes and @code{local_n0 =
402 3} on the last process. Instead, however, you could specify
403 @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5}
404 on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice,
405 while it may look superficially more ``balanced,'' has the same
406 critical path as FFTW's default but requires more communications.)
407
408 @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution
409 @subsection Load balancing
410 @cindex load balancing
411
412 Ideally, when you parallelize a transform over some @math{P}
413 processes, each process should end up with work that takes equal time.
414 Otherwise, all of the processes end up waiting on whichever process is
415 slowest. This goal is known as ``load balancing.'' In this section,
416 we describe the circumstances under which FFTW is able to load-balance
417 well, and in particular how you should choose your transform size in
418 order to load balance.
419
420 Load balancing is especially difficult when you are parallelizing over
421 heterogeneous machines; for example, if one of your processors is a
422 old 486 and another is a Pentium IV, obviously you should give the
423 Pentium more work to do than the 486 since the latter is much slower.
424 FFTW does not deal with this problem, however---it assumes that your
425 processes run on hardware of comparable speed, and that the goal is
426 therefore to divide the problem as equally as possible.
427
428 For a multi-dimensional complex DFT, FFTW can divide the problem
429 equally among the processes if: (i) the @emph{first} dimension
430 @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of
431 the subsequent dimensions is divisible by @math{P}. (For the advanced
432 interface, where you can specify multiple simultaneous transforms via
433 some ``vector'' length @code{howmany}, a factor of @code{howmany} is
434 included in the product of the subsequent dimensions.)
435
436 For a one-dimensional complex DFT, the length @code{N} of the data
437 should be divisible by @math{P} @emph{squared} to be able to divide
438 the problem equally among the processes.
439
440 @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution
441 @subsection Transposed distributions
442
443 Internally, FFTW's MPI transform algorithms work by first computing
444 transforms of the data local to each process, then by globally
445 @emph{transposing} the data in some fashion to redistribute the data
446 among the processes, transforming the new data local to each process,
447 and transposing back. For example, a two-dimensional @code{n0} by
448 @code{n1} array, distributed across the @code{n0} dimension, is
449 transformd by: (i) transforming the @code{n1} dimension, which are
450 local to each process; (ii) transposing to an @code{n1} by @code{n0}
451 array, distributed across the @code{n1} dimension; (iii) transforming
452 the @code{n0} dimension, which is now local to each process; (iv)
453 transposing back.
454 @cindex transpose
455
456
457 However, in many applications it is acceptable to compute a
458 multidimensional DFT whose results are produced in transposed order
459 (e.g., @code{n1} by @code{n0} in two dimensions). This provides a
460 significant performance advantage, because it means that the final
461 transposition step can be omitted. FFTW supports this optimization,
462 which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT}
463 to the planner routines. To compute the inverse transform of
464 transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell
465 it that the input is transposed. In this section, we explain how to
466 interpret the output format of such a transform.
467 @ctindex FFTW_MPI_TRANSPOSED_OUT
468 @ctindex FFTW_MPI_TRANSPOSED_IN
469
470
471 Suppose you have are transforming multi-dimensional data with (at
472 least two) dimensions @ndims{}. As always, it is distributed along
473 the first dimension @dimk{0}. Now, if we compute its DFT with the
474 @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored
475 with the first @emph{two} dimensions transposed: @ndimstrans{},
476 distributed along the @dimk{1} dimension. Conversely, if we take the
477 @ndimstrans{} data and transform it with the
478 @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the
479 original @ndims{} array.
480
481 There are two ways to find the portion of the transposed array that
482 resides on the current process. First, you can simply call the
483 appropriate @samp{local_size} function, passing @ndimstrans{} (the
484 transposed dimensions). This would mean calling the @samp{local_size}
485 function twice, once for the transposed and once for the
486 non-transposed dimensions. Alternatively, you can call one of the
487 @samp{local_size_transposed} functions, which returns both the
488 non-transposed and transposed data distribution from a single call.
489 For example, for a 3d transform with transposed output (or input), you
490 might call:
491
492 @example
493 ptrdiff_t fftw_mpi_local_size_3d_transposed(
494 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
495 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
496 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
497 @end example
498 @findex fftw_mpi_local_size_3d_transposed
499
500 Here, @code{local_n0} and @code{local_0_start} give the size and
501 starting index of the @code{n0} dimension for the
502 @emph{non}-transposed data, as in the previous sections. For
503 @emph{transposed} data (e.g. the output for
504 @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and
505 @code{local_1_start} give the size and starting index of the @code{n1}
506 dimension, which is the first dimension of the transposed data
507 (@code{n1} by @code{n0} by @code{n2}).
508
509 (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to
510 performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two
511 dimensions to the planner in reverse order, or vice versa. If you
512 pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and
513 @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the
514 first two dimensions passed to the planner and passing @emph{neither}
515 flag.)
516
517 @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution
518 @subsection One-dimensional distributions
519
520 For one-dimensional distributed DFTs using FFTW, matters are slightly
521 more complicated because the data distribution is more closely tied to
522 how the algorithm works. In particular, you can no longer pass an
523 arbitrary block size and must accept FFTW's default; also, the block
524 sizes may be different for input and output. Also, the data
525 distribution depends on the flags and transform direction, in order
526 for forward and backward transforms to work correctly.
527
528 @example
529 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
530 int sign, unsigned flags,
531 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
532 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
533 @end example
534 @findex fftw_mpi_local_size_1d
535
536 This function computes the data distribution for a 1d transform of
537 size @code{n0} with the given transform @code{sign} and @code{flags}.
538 Both input and output data use block distributions. The input on the
539 current process will consist of @code{local_ni} numbers starting at
540 index @code{local_i_start}; e.g. if only a single process is used,
541 then @code{local_ni} will be @code{n0} and @code{local_i_start} will
542 be @code{0}. Similarly for the output, with @code{local_no} numbers
543 starting at index @code{local_o_start}. The return value of
544 @code{fftw_mpi_local_size_1d} will be the total number of elements to
545 allocate on the current process (which might be slightly larger than
546 the local size due to intermediate steps in the algorithm).
547
548 As mentioned above (@pxref{Load balancing}), the data will be divided
549 equally among the processes if @code{n0} is divisible by the
550 @emph{square} of the number of processes. In this case,
551 @code{local_ni} will equal @code{local_no}. Otherwise, they may be
552 different.
553
554 For some applications, such as convolutions, the order of the output
555 data is irrelevant. In this case, performance can be improved by
556 specifying that the output data be stored in an FFTW-defined
557 ``scrambled'' format. (In particular, this is the analogue of
558 transposed output in the multidimensional case: scrambled output saves
559 a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in
560 the flags, then the output is stored in this (undocumented) scrambled
561 order. Conversely, to perform the inverse transform of data in
562 scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag.
563 @ctindex FFTW_MPI_SCRAMBLED_OUT
564 @ctindex FFTW_MPI_SCRAMBLED_IN
565
566
567 In MPI FFTW, only composite sizes @code{n0} can be parallelized; we
568 have not yet implemented a parallel algorithm for large prime sizes.
569
570 @c ------------------------------------------------------------
571 @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI
572 @section Multi-dimensional MPI DFTs of Real Data
573
574 FFTW's MPI interface also supports multi-dimensional DFTs of real
575 data, similar to the serial r2c and c2r interfaces. (Parallel
576 one-dimensional real-data DFTs are not currently supported; you must
577 use a complex transform and set the imaginary parts of the inputs to
578 zero.)
579
580 The key points to understand for r2c and c2r MPI transforms (compared
581 to the MPI complex DFTs or the serial r2c/c2r transforms), are:
582
583 @itemize @bullet
584
585 @item
586 Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real
587 data to/from @ndimshalf{} complex data: the last dimension of the
588 complex data is cut in half (rounded down), plus one. As for the
589 serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and
590 @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data.
591
592 @item
593 @cindex padding
594 Although the real data is @emph{conceptually} @ndims{}, it is
595 @emph{physically} stored as an @ndimspad{} array, where the last
596 dimension has been @emph{padded} to make it the same size as the
597 complex output. This is much like the in-place serial r2c/c2r
598 interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that
599 in MPI the padding is required even for out-of-place data. The extra
600 padding numbers are ignored by FFTW (they are @emph{not} like
601 zero-padding the transform to a larger size); they are only used to
602 determine the data layout.
603
604 @item
605 @cindex data distribution
606 The data distribution in MPI for @emph{both} the real and complex data
607 is determined by the shape of the @emph{complex} data. That is, you
608 call the appropriate @samp{local size} function for the @ndimshalf{}
609 complex data, and then use the @emph{same} distribution for the real
610 data except that the last complex dimension is replaced by a (padded)
611 real dimension of twice the length.
612
613 @end itemize
614
615 For example suppose we are performing an out-of-place r2c transform of
616 @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}],
617 resulting in @threedims{L,M,N/2+1} complex data. Similar to the
618 example in @ref{2d MPI example}, we might do something like:
619
620 @example
621 #include <fftw3-mpi.h>
622
623 int main(int argc, char **argv)
624 @{
625 const ptrdiff_t L = ..., M = ..., N = ...;
626 fftw_plan plan;
627 double *rin;
628 fftw_complex *cout;
629 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
630
631 MPI_Init(&argc, &argv);
632 fftw_mpi_init();
633
634 /* @r{get local data size and allocate} */
635 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
636 &local_n0, &local_0_start);
637 rin = fftw_alloc_real(2 * alloc_local);
638 cout = fftw_alloc_complex(alloc_local);
639
640 /* @r{create plan for out-of-place r2c DFT} */
641 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
642 FFTW_MEASURE);
643
644 /* @r{initialize rin to some function} my_func(x,y,z) */
645 for (i = 0; i < local_n0; ++i)
646 for (j = 0; j < M; ++j)
647 for (k = 0; k < N; ++k)
648 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
649
650 /* @r{compute transforms as many times as desired} */
651 fftw_execute(plan);
652
653 fftw_destroy_plan(plan);
654
655 MPI_Finalize();
656 @}
657 @end example
658
659 @findex fftw_alloc_real
660 @cindex row-major
661 Note that we allocated @code{rin} using @code{fftw_alloc_real} with an
662 argument of @code{2 * alloc_local}: since @code{alloc_local} is the
663 number of @emph{complex} values to allocate, the number of @emph{real}
664 values is twice as many. The @code{rin} array is then
665 @threedims{local_n0,M,2(N/2+1)} in row-major order, so its
666 @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) +
667 k} (@pxref{Multi-dimensional Array Format }).
668
669 @cindex transpose
670 @ctindex FFTW_TRANSPOSED_OUT
671 @ctindex FFTW_TRANSPOSED_IN
672 As for the complex transforms, improved performance can be obtained by
673 specifying that the output is the transpose of the input or vice versa
674 (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c
675 example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that
676 the input would be a padded @threedims{L,M,2(N/2+1)} real array
677 distributed over the @code{L} dimension, while the output would be a
678 @threedims{M,L,N/2+1} complex array distributed over the @code{M}
679 dimension. To perform the inverse c2r transform with the same data
680 distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag.
681
682 @c ------------------------------------------------------------
683 @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI
684 @section Other multi-dimensional Real-Data MPI Transforms
685
686 @cindex r2r
687 FFTW's MPI interface also supports multi-dimensional @samp{r2r}
688 transforms of all kinds supported by the serial interface
689 (e.g. discrete cosine and sine transforms, discrete Hartley
690 transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not
691 one-dimensional transforms, are currently parallelized.
692
693 @tindex fftw_r2r_kind
694 These are used much like the multidimensional complex DFTs discussed
695 above, except that the data is real rather than complex, and one needs
696 to pass an r2r transform kind (@code{fftw_r2r_kind}) for each
697 dimension as in the serial FFTW (@pxref{More DFTs of Real Data}).
698
699 For example, one might perform a two-dimensional @twodims{L,M} that is
700 an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in
701 the second dimension with code like:
702
703 @example
704 const ptrdiff_t L = ..., M = ...;
705 fftw_plan plan;
706 double *data;
707 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
708
709 /* @r{get local data size and allocate} */
710 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
711 &local_n0, &local_0_start);
712 data = fftw_alloc_real(alloc_local);
713
714 /* @r{create plan for in-place REDFT10 x RODFT10} */
715 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
716 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
717
718 /* @r{initialize data to some function} my_function(x,y) */
719 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
720 data[i*M + j] = my_function(local_0_start + i, j);
721
722 /* @r{compute transforms, in-place, as many times as desired} */
723 fftw_execute(plan);
724
725 fftw_destroy_plan(plan);
726 @end example
727
728 @findex fftw_alloc_real
729 Notice that we use the same @samp{local_size} functions as we did for
730 complex data, only now we interpret the sizes in terms of real rather
731 than complex values, and correspondingly use @code{fftw_alloc_real}.
732
733 @c ------------------------------------------------------------
734 @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI
735 @section FFTW MPI Transposes
736 @cindex transpose
737
738 The FFTW's MPI Fourier transforms rely on one or more @emph{global
739 transposition} step for their communications. For example, the
740 multidimensional transforms work by transforming along some
741 dimensions, then transposing to make the first dimension local and
742 transforming that, then transposing back. Because global
743 transposition of a block-distributed matrix has many other potential
744 uses besides FFTs, FFTW's transpose routines can be called directly,
745 as documented in this section.
746
747 @menu
748 * Basic distributed-transpose interface::
749 * Advanced distributed-transpose interface::
750 * An improved replacement for MPI_Alltoall::
751 @end menu
752
753 @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes
754 @subsection Basic distributed-transpose interface
755
756 In particular, suppose that we have an @code{n0} by @code{n1} array in
757 row-major order, block-distributed across the @code{n0} dimension. To
758 transpose this into an @code{n1} by @code{n0} array block-distributed
759 across the @code{n1} dimension, we would create a plan by calling the
760 following function:
761
762 @example
763 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
764 double *in, double *out,
765 MPI_Comm comm, unsigned flags);
766 @end example
767 @findex fftw_mpi_plan_transpose
768
769 The input and output arrays (@code{in} and @code{out}) can be the
770 same. The transpose is actually executed by calling
771 @code{fftw_execute} on the plan, as usual.
772 @findex fftw_execute
773
774
775 The @code{flags} are the usual FFTW planner flags, but support
776 two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or
777 @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for
778 transpose plans, is that the output and/or input, respectively, are
779 @emph{locally} transposed. That is, on each process input data is
780 normally stored as a @code{local_n0} by @code{n1} array in row-major
781 order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is
782 stored as @code{n1} by @code{local_n0} in row-major order. Similarly,
783 @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by
784 @code{local_n1} instead of @code{local_n1} by @code{n0}.
785 @ctindex FFTW_MPI_TRANSPOSED_OUT
786 @ctindex FFTW_MPI_TRANSPOSED_IN
787
788
789 To determine the local size of the array on each process before and
790 after the transpose, as well as the amount of storage that must be
791 allocated, one should call @code{fftw_mpi_local_size_2d_transposed},
792 just as for a 2d DFT as described in the previous section:
793 @cindex data distribution
794
795 @example
796 ptrdiff_t fftw_mpi_local_size_2d_transposed
797 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
798 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
799 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
800 @end example
801 @findex fftw_mpi_local_size_2d_transposed
802
803 Again, the return value is the local storage to allocate, which in
804 this case is the number of @emph{real} (@code{double}) values rather
805 than complex numbers as in the previous examples.
806
807 @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes
808 @subsection Advanced distributed-transpose interface
809
810 The above routines are for a transpose of a matrix of numbers (of type
811 @code{double}), using FFTW's default block sizes. More generally, one
812 can perform transposes of @emph{tuples} of numbers, with
813 user-specified block sizes for the input and output:
814
815 @example
816 fftw_plan fftw_mpi_plan_many_transpose
817 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
818 ptrdiff_t block0, ptrdiff_t block1,
819 double *in, double *out, MPI_Comm comm, unsigned flags);
820 @end example
821 @findex fftw_mpi_plan_many_transpose
822
823 In this case, one is transposing an @code{n0} by @code{n1} matrix of
824 @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers).
825 The input is distributed along the @code{n0} dimension with block size
826 @code{block0}, and the @code{n1} by @code{n0} output is distributed
827 along the @code{n1} dimension with block size @code{block1}. If
828 @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW
829 uses its default block size. To get the local size of the data on
830 each process, you should then call @code{fftw_mpi_local_size_many_transposed}.
831 @ctindex FFTW_MPI_DEFAULT_BLOCK
832 @findex fftw_mpi_local_size_many_transposed
833
834 @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes
835 @subsection An improved replacement for MPI_Alltoall
836
837 We close this section by noting that FFTW's MPI transpose routines can
838 be thought of as a generalization for the @code{MPI_Alltoall} function
839 (albeit only for floating-point types), and in some circumstances can
840 function as an improved replacement.
841 @findex MPI_Alltoall
842
843
844 @code{MPI_Alltoall} is defined by the MPI standard as:
845
846 @example
847 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
848 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
849 MPI_Comm comm);
850 @end example
851
852 In particular, for @code{double*} arrays @code{in} and @code{out},
853 consider the call:
854
855 @example
856 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
857 @end example
858
859 This is completely equivalent to:
860
861 @example
862 MPI_Comm_size(comm, &P);
863 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
864 fftw_execute(plan);
865 fftw_destroy_plan(plan);
866 @end example
867
868 That is, computing a @twodims{P,P} transpose on @code{P} processes,
869 with a block size of 1, is just a standard all-to-all communication.
870
871 However, using the FFTW routine instead of @code{MPI_Alltoall} may
872 have certain advantages. First of all, FFTW's routine can operate
873 in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only
874 operate out-of-place.
875 @cindex in-place
876
877
878 Second, even for out-of-place plans, FFTW's routine may be faster,
879 especially if you need to perform the all-to-all communication many
880 times and can afford to use @code{FFTW_MEASURE} or
881 @code{FFTW_PATIENT}. It should certainly be no slower, not including
882 the time to create the plan, since one of the possible algorithms that
883 FFTW uses for an out-of-place transpose @emph{is} simply to call
884 @code{MPI_Alltoall}. However, FFTW also considers several other
885 possible algorithms that, depending on your MPI implementation and
886 your hardware, may be faster.
887 @ctindex FFTW_MEASURE
888 @ctindex FFTW_PATIENT
889
890 @c ------------------------------------------------------------
891 @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI
892 @section FFTW MPI Wisdom
893 @cindex wisdom
894 @cindex saving plans to disk
895
896 FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can
897 be used to save MPI plans as well as to save uniprocessor plans.
898 However, for MPI there are several unavoidable complications.
899
900 @cindex MPI I/O
901 First, the MPI standard does not guarantee that every process can
902 perform file I/O (at least, not using C stdio routines)---in general,
903 we may only assume that process 0 is capable of I/O.@footnote{In fact,
904 even this assumption is not technically guaranteed by the standard,
905 although it seems to be universal in actual MPI implementations and is
906 widely assumed by MPI-using software. Technically, you need to query
907 the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with
908 @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no
909 I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can
910 perform I/O. Otherwise, it is the rank of a process that can perform
911 I/O ... but since it is not guaranteed to yield the @emph{same} rank
912 on all processes, you have to do an @code{MPI_Allreduce} of some kind
913 if you want all processes to agree about which is going to do I/O.
914 And even then, the standard only guarantees that this process can
915 perform output, but not input. See e.g. @cite{Parallel Programming
916 with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our
917 experience virtually no MPI programmers worry about this.} So, if we
918 want to export the wisdom from a single process to a file, we must
919 first export the wisdom to a string, then send it to process 0, then
920 write it to a file.
921
922 Second, in principle we may want to have separate wisdom for every
923 process, since in general the processes may run on different hardware
924 even for a single MPI program. However, in practice FFTW's MPI code
925 is designed for the case of homogeneous hardware (@pxref{Load
926 balancing}), and in this case it is convenient to use the same wisdom
927 for every process. Thus, we need a mechanism to synchronize the wisdom.
928
929 To address both of these problems, FFTW provides the following two
930 functions:
931
932 @example
933 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
934 void fftw_mpi_gather_wisdom(MPI_Comm comm);
935 @end example
936 @findex fftw_mpi_gather_wisdom
937 @findex fftw_mpi_broadcast_wisdom
938
939 Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom}
940 will broadcast the wisdom from process 0 to all other processes.
941 Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all
942 processes onto process 0. (If the plans created for the same problem
943 by different processes are not the same, @code{fftw_mpi_gather_wisdom}
944 will arbitrarily choose one of the plans.) Both of these functions
945 may result in suboptimal plans for different processes if the
946 processes are running on non-identical hardware. Both of these
947 functions are @emph{collective} calls, which means that they must be
948 executed by all processes in the communicator.
949 @cindex collective function
950
951
952 So, for example, a typical code snippet to import wisdom from a file
953 and use it on all processes would be:
954
955 @example
956 @{
957 int rank;
958
959 fftw_mpi_init();
960 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
961 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
962 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
963 @}
964 @end example
965
966 (Note that we must call @code{fftw_mpi_init} before importing any
967 wisdom that might contain MPI plans.) Similarly, a typical code
968 snippet to export wisdom from all processes to a file is:
969 @findex fftw_mpi_init
970
971 @example
972 @{
973 int rank;
974
975 fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
976 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
977 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
978 @}
979 @end example
980
981 @c ------------------------------------------------------------
982 @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI
983 @section Avoiding MPI Deadlocks
984 @cindex deadlock
985
986 An MPI program can @emph{deadlock} if one process is waiting for a
987 message from another process that never gets sent. To avoid deadlocks
988 when using FFTW's MPI routines, it is important to know which
989 functions are @emph{collective}: that is, which functions must
990 @emph{always} be called in the @emph{same order} from @emph{every}
991 process in a given communicator. (For example, @code{MPI_Barrier} is
992 the canonical example of a collective function in the MPI standard.)
993 @cindex collective function
994 @findex MPI_Barrier
995
996
997 The functions in FFTW that are @emph{always} collective are: every
998 function beginning with @samp{fftw_mpi_plan}, as well as
999 @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}.
1000 Also, the following functions from the ordinary FFTW interface are
1001 collective when they are applied to a plan created by an
1002 @samp{fftw_mpi_plan} function: @code{fftw_execute},
1003 @code{fftw_destroy_plan}, and @code{fftw_flops}.
1004 @findex fftw_execute
1005 @findex fftw_destroy_plan
1006 @findex fftw_flops
1007
1008 @c ------------------------------------------------------------
1009 @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI
1010 @section FFTW MPI Performance Tips
1011
1012 In this section, we collect a few tips on getting the best performance
1013 out of FFTW's MPI transforms.
1014
1015 First, because of the 1d block distribution, FFTW's parallelization is
1016 currently limited by the size of the first dimension.
1017 (Multidimensional block distributions may be supported by a future
1018 version.) More generally, you should ideally arrange the dimensions so
1019 that FFTW can divide them equally among the processes. @xref{Load
1020 balancing}.
1021 @cindex block distribution
1022 @cindex load balancing
1023
1024
1025 Second, if it is not too inconvenient, you should consider working
1026 with transposed output for multidimensional plans, as this saves a
1027 considerable amount of communications. @xref{Transposed distributions}.
1028 @cindex transpose
1029
1030
1031 Third, the fastest choices are generally either an in-place transform
1032 or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag
1033 (which allows the input array to be used as scratch space). In-place
1034 is especially beneficial if the amount of data per process is large.
1035 @ctindex FFTW_DESTROY_INPUT
1036
1037
1038 Fourth, if you have multiple arrays to transform at once, rather than
1039 calling FFTW's MPI transforms several times it usually seems to be
1040 faster to interleave the data and use the advanced interface. (This
1041 groups the communications together instead of requiring separate
1042 messages for each transform.)
1043
1044 @c ------------------------------------------------------------
1045 @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI
1046 @section Combining MPI and Threads
1047 @cindex threads
1048
1049 In certain cases, it may be advantageous to combine MPI
1050 (distributed-memory) and threads (shared-memory) parallelization.
1051 FFTW supports this, with certain caveats. For example, if you have a
1052 cluster of 4-processor shared-memory nodes, you may want to use
1053 threads within the nodes and MPI between the nodes, instead of MPI for
1054 all parallelization.
1055
1056 In particular, it is possible to seamlessly combine the MPI FFTW
1057 routines with the multi-threaded FFTW routines (@pxref{Multi-threaded
1058 FFTW}). However, some care must be taken in the initialization code,
1059 which should look something like this:
1060
1061 @example
1062 int threads_ok;
1063
1064 int main(int argc, char **argv)
1065 @{
1066 int provided;
1067 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
1068 threads_ok = provided >= MPI_THREAD_FUNNELED;
1069
1070 if (threads_ok) threads_ok = fftw_init_threads();
1071 fftw_mpi_init();
1072
1073 ...
1074 if (threads_ok) fftw_plan_with_nthreads(...);
1075 ...
1076
1077 MPI_Finalize();
1078 @}
1079 @end example
1080 @findex fftw_mpi_init
1081 @findex fftw_init_threads
1082 @findex fftw_plan_with_nthreads
1083
1084 First, note that instead of calling @code{MPI_Init}, you should call
1085 @code{MPI_Init_threads}, which is the initialization routine defined
1086 by the MPI-2 standard to indicate to MPI that your program will be
1087 multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates
1088 that we will only call MPI routines from the main thread. (FFTW will
1089 launch additional threads internally, but the extra threads will not
1090 call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or
1091 @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading
1092 support from the MPI implementation, but this is not required by
1093 FFTW.) The @code{provided} parameter returns what level of threads
1094 support is actually supported by your MPI implementation; this
1095 @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call
1096 the FFTW threads routines, so we define a global variable
1097 @code{threads_ok} to record this. You should only call
1098 @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if
1099 @code{threads_ok} is true. For more information on thread safety in
1100 MPI, see the
1101 @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and
1102 Threads} section of the MPI-2 standard.
1103 @cindex thread safety
1104
1105
1106 Second, we must call @code{fftw_init_threads} @emph{before}
1107 @code{fftw_mpi_init}. This is critical for technical reasons having
1108 to do with how FFTW initializes its list of algorithms.
1109
1110 Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI
1111 process will launch (up to) @code{N} threads to parallelize its transforms.
1112
1113 For example, in the hypothetical cluster of 4-processor nodes, you
1114 might wish to launch only a single MPI process per node, and then call
1115 @code{fftw_plan_with_nthreads(4)} on each process to use all
1116 processors in the nodes.
1117
1118 This may or may not be faster than simply using as many MPI processes
1119 as you have processors, however. On the one hand, using threads
1120 within a node eliminates the need for explicit message passing within
1121 the node. On the other hand, FFTW's transpose routines are not
1122 multi-threaded, and this means that the communications that do take
1123 place will not benefit from parallelization within the node.
1124 Moreover, many MPI implementations already have optimizations to
1125 exploit shared memory when it is available, so adding the
1126 multithreaded FFTW on top of this may be superfluous.
1127 @cindex transpose
1128
1129 @c ------------------------------------------------------------
1130 @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI
1131 @section FFTW MPI Reference
1132
1133 This chapter provides a complete reference to all FFTW MPI functions,
1134 datatypes, and constants. See also @ref{FFTW Reference} for information
1135 on functions and types in common with the serial interface.
1136
1137 @menu
1138 * MPI Files and Data Types::
1139 * MPI Initialization::
1140 * Using MPI Plans::
1141 * MPI Data Distribution Functions::
1142 * MPI Plan Creation::
1143 * MPI Wisdom Communication::
1144 @end menu
1145
1146 @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference
1147 @subsection MPI Files and Data Types
1148
1149 All programs using FFTW's MPI support should include its header file:
1150
1151 @example
1152 #include <fftw3-mpi.h>
1153 @end example
1154
1155 Note that this header file includes the serial-FFTW @code{fftw3.h}
1156 header file, and also the @code{mpi.h} header file for MPI, so you
1157 need not include those files separately.
1158
1159 You must also link to @emph{both} the FFTW MPI library and to the
1160 serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi
1161 -lfftw3 -lm} at the end of the link command.
1162
1163 @cindex precision
1164 Different precisions are handled as in the serial interface:
1165 @xref{Precision}. That is, @samp{fftw_} functions become
1166 @code{fftwf_} (in single precision) etcetera, and the libraries become
1167 @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double
1168 precision is supported in MPI, but quad precision (@samp{fftwq_}) is
1169 not due to the lack of MPI support for this type.
1170
1171 @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference
1172 @subsection MPI Initialization
1173
1174 Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and
1175 before importing any wisdom for MPI problems, you must call:
1176
1177 @findex fftw_mpi_init
1178 @example
1179 void fftw_mpi_init(void);
1180 @end example
1181
1182 @findex fftw_init_threads
1183 If FFTW threads support is used, however, @code{fftw_mpi_init} should
1184 be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI
1185 and Threads}). Calling @code{fftw_mpi_init} additional times (before
1186 @code{fftw_mpi_cleanup}) has no effect.
1187
1188
1189 If you want to deallocate all persistent data and reset FFTW to the
1190 pristine state it was in when you started your program, you can call:
1191
1192 @findex fftw_mpi_cleanup
1193 @example
1194 void fftw_mpi_cleanup(void);
1195 @end example
1196
1197 @findex fftw_cleanup
1198 (This calls @code{fftw_cleanup}, so you need not call the serial
1199 cleanup routine too, although it is safe to do so.) After calling
1200 @code{fftw_mpi_cleanup}, all existing plans become undefined, and you
1201 should not attempt to execute or destroy them. You must call
1202 @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want
1203 to resume using the MPI FFTW routines.
1204
1205 @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference
1206 @subsection Using MPI Plans
1207
1208 Once an MPI plan is created, you can execute and destroy it using
1209 @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions
1210 in the serial interface that operate on generic plans (@pxref{Using
1211 Plans}).
1212
1213 @cindex collective function
1214 @cindex MPI communicator
1215 The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to
1216 MPI plans, are @emph{collective} calls: they must be called for all processes
1217 in the communicator that was used to create the plan.
1218
1219 @cindex new-array execution
1220 You must @emph{not} use the serial new-array plan-execution functions
1221 @code{fftw_execute_dft} and so on (@pxref{New-array Execute
1222 Functions}) with MPI plans. Such functions are specialized to the
1223 problem type, and there are specific new-array execute functions for MPI plans:
1224
1225 @findex fftw_mpi_execute_dft
1226 @findex fftw_mpi_execute_dft_r2c
1227 @findex fftw_mpi_execute_dft_c2r
1228 @findex fftw_mpi_execute_r2r
1229 @example
1230 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
1231 void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
1232 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
1233 void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
1234 @end example
1235
1236 @cindex alignment
1237 @findex fftw_malloc
1238 These functions have the same restrictions as those of the serial
1239 new-array execute functions. They are @emph{always} safe to apply to
1240 the @emph{same} @code{in} and @code{out} arrays that were used to
1241 create the plan. They can only be applied to new arrarys if those
1242 arrays have the same types, dimensions, in-placeness, and alignment as
1243 the original arrays, where the best way to ensure the same alignment
1244 is to use FFTW's @code{fftw_malloc} and related allocation functions
1245 for all arrays (@pxref{Memory Allocation}). Note that distributed
1246 transposes (@pxref{FFTW MPI Transposes}) use
1247 @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans
1248 from FFTW's perspective.
1249
1250 @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference
1251 @subsection MPI Data Distribution Functions
1252
1253 @cindex data distribution
1254 As described above (@pxref{MPI Data Distribution}), in order to
1255 allocate your arrays, @emph{before} creating a plan, you must first
1256 call one of the following routines to determine the required
1257 allocation size and the portion of the array locally stored on a given
1258 process. The @code{MPI_Comm} communicator passed here must be
1259 equivalent to the communicator used below for plan creation.
1260
1261 The basic interface for multidimensional transforms consists of the
1262 functions:
1263
1264 @findex fftw_mpi_local_size_2d
1265 @findex fftw_mpi_local_size_3d
1266 @findex fftw_mpi_local_size
1267 @findex fftw_mpi_local_size_2d_transposed
1268 @findex fftw_mpi_local_size_3d_transposed
1269 @findex fftw_mpi_local_size_transposed
1270 @example
1271 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
1272 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1273 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1274 MPI_Comm comm,
1275 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1276 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
1277 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1278
1279 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
1280 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1281 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1282 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1283 MPI_Comm comm,
1284 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1285 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1286 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
1287 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1288 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1289 @end example
1290
1291 These functions return the number of elements to allocate (complex
1292 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas
1293 the @code{local_n0} and @code{local_0_start} return the portion
1294 (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the
1295 first dimension of an @ndims{} array that is stored on the local
1296 process. @xref{Basic and advanced distribution interfaces}. For
1297 @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants
1298 are useful in order to also return the local portion of the first
1299 dimension in the @ndimstrans{} transposed output.
1300 @xref{Transposed distributions}.
1301 The advanced interface for multidimensional transforms is:
1302
1303 @cindex advanced interface
1304 @findex fftw_mpi_local_size_many
1305 @findex fftw_mpi_local_size_many_transposed
1306 @example
1307 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1308 ptrdiff_t block0, MPI_Comm comm,
1309 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
1310 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1311 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
1312 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
1313 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
1314 @end example
1315
1316 These differ from the basic interface in only two ways. First, they
1317 allow you to specify block sizes @code{block0} and @code{block1} (the
1318 latter for the transposed output); you can pass
1319 @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in
1320 the basic interface. Second, you can pass a @code{howmany} parameter,
1321 corresponding to the advanced planning interface below: this is for
1322 transforms of contiguous @code{howmany}-tuples of numbers
1323 (@code{howmany = 1} in the basic interface).
1324
1325 The corresponding basic and advanced routines for one-dimensional
1326 transforms (currently only complex DFTs) are:
1327
1328 @findex fftw_mpi_local_size_1d
1329 @findex fftw_mpi_local_size_many_1d
1330 @example
1331 ptrdiff_t fftw_mpi_local_size_1d(
1332 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
1333 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
1334 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
1335 ptrdiff_t fftw_mpi_local_size_many_1d(
1336 ptrdiff_t n0, ptrdiff_t howmany,
1337 MPI_Comm comm, int sign, unsigned flags,
1338 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
1339 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
1340 @end example
1341
1342 @ctindex FFTW_MPI_SCRAMBLED_OUT
1343 @ctindex FFTW_MPI_SCRAMBLED_IN
1344 As above, the return value is the number of elements to allocate
1345 (complex numbers, for complex DFTs). The @code{local_ni} and
1346 @code{local_i_start} arguments return the portion
1347 (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the
1348 1d array that is stored on this process for the transform
1349 @emph{input}, and @code{local_no} and @code{local_o_start} are the
1350 corresponding quantities for the input. The @code{sign}
1351 (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must
1352 match the arguments passed when creating a plan. Although the inputs
1353 and outputs have different data distributions in general, it is
1354 guaranteed that the @emph{output} data distribution of an
1355 @code{FFTW_FORWARD} plan will match the @emph{input} data distribution
1356 of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the
1357 @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags.
1358 @xref{One-dimensional distributions}.
1359
1360 @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference
1361 @subsection MPI Plan Creation
1362
1363 @subsubheading Complex-data MPI DFTs
1364
1365 Plans for complex-data DFTs (@pxref{2d MPI example}) are created by:
1366
1367 @findex fftw_mpi_plan_dft_1d
1368 @findex fftw_mpi_plan_dft_2d
1369 @findex fftw_mpi_plan_dft_3d
1370 @findex fftw_mpi_plan_dft
1371 @findex fftw_mpi_plan_many_dft
1372 @example
1373 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
1374 MPI_Comm comm, int sign, unsigned flags);
1375 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
1376 fftw_complex *in, fftw_complex *out,
1377 MPI_Comm comm, int sign, unsigned flags);
1378 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1379 fftw_complex *in, fftw_complex *out,
1380 MPI_Comm comm, int sign, unsigned flags);
1381 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
1382 fftw_complex *in, fftw_complex *out,
1383 MPI_Comm comm, int sign, unsigned flags);
1384 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
1385 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
1386 fftw_complex *in, fftw_complex *out,
1387 MPI_Comm comm, int sign, unsigned flags);
1388 @end example
1389
1390 @cindex MPI communicator
1391 @cindex collective function
1392 These are similar to their serial counterparts (@pxref{Complex DFTs})
1393 in specifying the dimensions, sign, and flags of the transform. The
1394 @code{comm} argument gives an MPI communicator that specifies the set
1395 of processes to participate in the transform; plan creation is a
1396 collective function that must be called for all processes in the
1397 communicator. The @code{in} and @code{out} pointers refer only to a
1398 portion of the overall transform data (@pxref{MPI Data Distribution})
1399 as specified by the @samp{local_size} functions in the previous
1400 section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these
1401 arrays are overwritten during plan creation as for the serial
1402 interface. For multi-dimensional transforms, any dimensions @code{>
1403 1} are supported; for one-dimensional transforms, only composite
1404 (non-prime) @code{n0} are currently supported (unlike the serial
1405 FFTW). Requesting an unsupported transform size will yield a
1406 @code{NULL} plan. (As in the serial interface, highly composite sizes
1407 generally yield the best performance.)
1408
1409 @cindex advanced interface
1410 @ctindex FFTW_MPI_DEFAULT_BLOCK
1411 @cindex stride
1412 The advanced-interface @code{fftw_mpi_plan_many_dft} additionally
1413 allows you to specify the block sizes for the first dimension
1414 (@code{block}) of the @ndims{} input data and the first dimension
1415 (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate
1416 steps of the transform, and for the output if
1417 @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must
1418 be the same block sizes as were passed to the corresponding
1419 @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK}
1420 to use FFTW's default block size as in the basic interface. Also, the
1421 @code{howmany} parameter specifies that the transform is of contiguous
1422 @code{howmany}-tuples rather than individual complex numbers; this
1423 corresponds to the same parameter in the serial advanced interface
1424 (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and
1425 @code{dist = 1}.
1426
1427 @subsubheading MPI flags
1428
1429 The @code{flags} can be any of those for the serial FFTW
1430 (@pxref{Planner Flags}), and in addition may include one or more of
1431 the following MPI-specific flags, which improve performance at the
1432 cost of changing the output or input data formats.
1433
1434 @itemize @bullet
1435
1436 @item
1437 @ctindex FFTW_MPI_SCRAMBLED_OUT
1438 @ctindex FFTW_MPI_SCRAMBLED_IN
1439 @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for
1440 1d transforms only, these flags indicate that the output/input of the
1441 transform are in an undocumented ``scrambled'' order. A forward
1442 @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward
1443 @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization).
1444 @xref{One-dimensional distributions}.
1445
1446 @item
1447 @ctindex FFTW_MPI_TRANSPOSED_OUT
1448 @ctindex FFTW_MPI_TRANSPOSED_IN
1449 @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid
1450 for multidimensional (@code{rnk > 1}) transforms only, these flags
1451 specify that the output or input of an @ndims{} transform is
1452 transposed to @ndimstrans{}. @xref{Transposed distributions}.
1453
1454 @end itemize
1455
1456 @subsubheading Real-data MPI DFTs
1457
1458 @cindex r2c
1459 Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional
1460 MPI DFTs of Real Data}) are created by:
1461
1462 @findex fftw_mpi_plan_dft_r2c_2d
1463 @findex fftw_mpi_plan_dft_r2c_2d
1464 @findex fftw_mpi_plan_dft_r2c_3d
1465 @findex fftw_mpi_plan_dft_r2c
1466 @findex fftw_mpi_plan_dft_c2r_2d
1467 @findex fftw_mpi_plan_dft_c2r_2d
1468 @findex fftw_mpi_plan_dft_c2r_3d
1469 @findex fftw_mpi_plan_dft_c2r
1470 @example
1471 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
1472 double *in, fftw_complex *out,
1473 MPI_Comm comm, unsigned flags);
1474 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
1475 double *in, fftw_complex *out,
1476 MPI_Comm comm, unsigned flags);
1477 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1478 double *in, fftw_complex *out,
1479 MPI_Comm comm, unsigned flags);
1480 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
1481 double *in, fftw_complex *out,
1482 MPI_Comm comm, unsigned flags);
1483 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1484 fftw_complex *in, double *out,
1485 MPI_Comm comm, unsigned flags);
1486 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1487 fftw_complex *in, double *out,
1488 MPI_Comm comm, unsigned flags);
1489 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1490 fftw_complex *in, double *out,
1491 MPI_Comm comm, unsigned flags);
1492 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
1493 fftw_complex *in, double *out,
1494 MPI_Comm comm, unsigned flags);
1495 @end example
1496
1497 Similar to the serial interface (@pxref{Real-data DFTs}), these
1498 transform logically @ndims{} real data to/from @ndimshalf{} complex
1499 data, representing the non-redundant half of the conjugate-symmetry
1500 output of a real-input DFT (@pxref{Multi-dimensional Transforms}).
1501 However, the real array must be stored within a padded @ndimspad{}
1502 array (much like the in-place serial r2c transforms, but here for
1503 out-of-place transforms as well). Currently, only multi-dimensional
1504 (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan
1505 for @code{rnk = 1} will yield @code{NULL}). As explained above
1506 (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data
1507 distribution of both the real and complex arrays is given by the
1508 @samp{local_size} function called for the dimensions of the
1509 @emph{complex} array. Similar to the other planning functions, the
1510 input and output arrays are overwritten when the plan is created
1511 except in @code{FFTW_ESTIMATE} mode.
1512
1513 As for the complex DFTs above, there is an advance interface that
1514 allows you to manually specify block sizes and to transform contiguous
1515 @code{howmany}-tuples of real/complex numbers:
1516
1517 @findex fftw_mpi_plan_many_dft_r2c
1518 @findex fftw_mpi_plan_many_dft_c2r
1519 @example
1520 fftw_plan fftw_mpi_plan_many_dft_r2c
1521 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1522 ptrdiff_t iblock, ptrdiff_t oblock,
1523 double *in, fftw_complex *out,
1524 MPI_Comm comm, unsigned flags);
1525 fftw_plan fftw_mpi_plan_many_dft_c2r
1526 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
1527 ptrdiff_t iblock, ptrdiff_t oblock,
1528 fftw_complex *in, double *out,
1529 MPI_Comm comm, unsigned flags);
1530 @end example
1531
1532 @subsubheading MPI r2r transforms
1533
1534 @cindex r2r
1535 There are corresponding plan-creation routines for r2r
1536 transforms (@pxref{More DFTs of Real Data}), currently supporting
1537 multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will
1538 yield a @code{NULL} plan):
1539
1540 @example
1541 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
1542 double *in, double *out,
1543 MPI_Comm comm,
1544 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
1545 unsigned flags);
1546 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
1547 double *in, double *out,
1548 MPI_Comm comm,
1549 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
1550 unsigned flags);
1551 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
1552 double *in, double *out,
1553 MPI_Comm comm, const fftw_r2r_kind *kind,
1554 unsigned flags);
1555 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
1556 ptrdiff_t iblock, ptrdiff_t oblock,
1557 double *in, double *out,
1558 MPI_Comm comm, const fftw_r2r_kind *kind,
1559 unsigned flags);
1560 @end example
1561
1562 The parameters are much the same as for the complex DFTs above, except
1563 that the arrays are of real numbers (and hence the outputs of the
1564 @samp{local_size} data-distribution functions should be interpreted as
1565 counts of real rather than complex numbers). Also, the @code{kind}
1566 parameters specify the r2r kinds along each dimension as for the
1567 serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other
1568 Multi-dimensional Real-data MPI Transforms}.
1569
1570 @subsubheading MPI transposition
1571 @cindex transpose
1572
1573 FFTW also provides routines to plan a transpose of a distributed
1574 @code{n0} by @code{n1} array of real numbers, or an array of
1575 @code{howmany}-tuples of real numbers with specified block sizes
1576 (@pxref{FFTW MPI Transposes}):
1577
1578 @findex fftw_mpi_plan_transpose
1579 @findex fftw_mpi_plan_many_transpose
1580 @example
1581 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
1582 double *in, double *out,
1583 MPI_Comm comm, unsigned flags);
1584 fftw_plan fftw_mpi_plan_many_transpose
1585 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
1586 ptrdiff_t block0, ptrdiff_t block1,
1587 double *in, double *out, MPI_Comm comm, unsigned flags);
1588 @end example
1589
1590 @cindex new-array execution
1591 @findex fftw_mpi_execute_r2r
1592 These plans are used with the @code{fftw_mpi_execute_r2r} new-array
1593 execute function (@pxref{Using MPI Plans }), since they count as (rank
1594 zero) r2r plans from FFTW's perspective.
1595
1596 @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference
1597 @subsection MPI Wisdom Communication
1598
1599 To facilitate synchronizing wisdom among the different MPI processes,
1600 we provide two functions:
1601
1602 @findex fftw_mpi_gather_wisdom
1603 @findex fftw_mpi_broadcast_wisdom
1604 @example
1605 void fftw_mpi_gather_wisdom(MPI_Comm comm);
1606 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
1607 @end example
1608
1609 The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the
1610 given communicator @code{comm} to the process of rank 0 in the
1611 communicator: that process obtains the union of all wisdom on all the
1612 processes. As a side effect, some other processes will gain
1613 additional wisdom from other processes, but only process 0 will gain
1614 the complete union.
1615
1616 The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports
1617 wisdom from process 0 in @code{comm} to all other processes in the
1618 communicator, replacing any wisdom they currently have.
1619
1620 @xref{FFTW MPI Wisdom}.
1621
1622 @c ------------------------------------------------------------
1623 @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI
1624 @section FFTW MPI Fortran Interface
1625 @cindex Fortran interface
1626
1627 @cindex iso_c_binding
1628 The FFTW MPI interface is callable from modern Fortran compilers
1629 supporting the Fortran 2003 @code{iso_c_binding} standard for calling
1630 C functions. As described in @ref{Calling FFTW from Modern Fortran},
1631 this means that you can directly call FFTW's C interface from Fortran
1632 with only minor changes in syntax. There are, however, a few things
1633 specific to the MPI interface to keep in mind:
1634
1635 @itemize @bullet
1636
1637 @item
1638 Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran
1639 interface }, you should @code{include 'fftw3-mpi.f03'} (after
1640 @code{use, intrinsic :: iso_c_binding} as before). The
1641 @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should
1642 @emph{not} @code{include} them both yourself. (You will also want to
1643 include the MPI header file, usually via @code{include 'mpif.h'} or
1644 similar, although though this is not needed by @code{fftw3-mpi.f03}
1645 @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.)
1646
1647 @item
1648 Because of the different storage conventions between C and Fortran,
1649 you reverse the order of your array dimensions when passing them to
1650 FFTW (@pxref{Reversing array dimensions}). This is merely a
1651 difference in notation and incurs no performance overhead. However,
1652 it means that, whereas in C the @emph{first} dimension is distributed,
1653 in Fortran the @emph{last} dimension of your array is distributed.
1654
1655 @item
1656 @cindex MPI communicator
1657 In Fortran, communicators are stored as @code{integer} types; there is
1658 no @code{MPI_Comm} type, nor is there any way to access a C
1659 @code{MPI_Comm}. Fortunately, this is taken care of for you by the
1660 FFTW Fortran interface: whenever the C interface expects an
1661 @code{MPI_Comm} type, you should pass the Fortran communicator as an
1662 @code{integer}.@footnote{Technically, this is because you aren't
1663 actually calling the C functions directly. You are calling wrapper
1664 functions that translate the communicator with @code{MPI_Comm_f2c}
1665 before calling the ordinary C interface. This is all done
1666 transparently, however, since the @code{fftw3-mpi.f03} interface file
1667 renames the wrappers so that they are called in Fortran with the same
1668 names as the C interface functions.}
1669
1670 @item
1671 Because you need to call the @samp{local_size} function to find out
1672 how much space to allocate, and this may be @emph{larger} than the
1673 local portion of the array (@pxref{MPI Data Distribution}), you should
1674 @emph{always} allocate your arrays dynamically using FFTW's allocation
1675 routines as described in @ref{Allocating aligned memory in Fortran}.
1676 (Coincidentally, this also provides the best performance by
1677 guaranteeding proper data alignment.)
1678
1679 @item
1680 Because all sizes in the MPI FFTW interface are declared as
1681 @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in
1682 Fortran (@pxref{FFTW Fortran type reference}).
1683
1684 @item
1685 @findex fftw_execute_dft
1686 @findex fftw_mpi_execute_dft
1687 @cindex new-array execution
1688 In Fortran, because of the language semantics, we generally recommend
1689 using the new-array execute functions for all plans, even in the
1690 common case where you are executing the plan on the same arrays for
1691 which the plan was created (@pxref{Plan execution in Fortran}).
1692 However, note that in the MPI interface these functions are changed:
1693 @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft},
1694 etcetera. @xref{Using MPI Plans}.
1695
1696 @end itemize
1697
1698 For example, here is a Fortran code snippet to perform a distributed
1699 @twodims{L,M} complex DFT in-place. (This assumes you have already
1700 initialized MPI with @code{MPI_init} and have also performed
1701 @code{call fftw_mpi_init}.)
1702
1703 @example
1704 use, intrinsic :: iso_c_binding
1705 include 'fftw3-mpi.f03'
1706 integer(C_INTPTR_T), parameter :: L = ...
1707 integer(C_INTPTR_T), parameter :: M = ...
1708 type(C_PTR) :: plan, cdata
1709 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
1710 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
1711
1712 ! @r{get local data size and allocate (note dimension reversal)}
1713 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
1714 local_M, local_j_offset)
1715 cdata = fftw_alloc_complex(alloc_local)
1716 call c_f_pointer(cdata, data, [L,local_M])
1717
1718 ! @r{create MPI plan for in-place forward DFT (note dimension reversal)}
1719 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
1720 FFTW_FORWARD, FFTW_MEASURE)
1721
1722 ! @r{initialize data to some function} my_function(i,j)
1723 do j = 1, local_M
1724 do i = 1, L
1725 data(i, j) = my_function(i, j + local_j_offset)
1726 end do
1727 end do
1728
1729 ! @r{compute transform (as many times as desired)}
1730 call fftw_mpi_execute_dft(plan, data, data)
1731
1732 call fftw_destroy_plan(plan)
1733 call fftw_free(cdata)
1734 @end example
1735
1736 Note that when we called @code{fftw_mpi_local_size_2d} and
1737 @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order,
1738 since a @twodims{L,M} Fortran array is viewed by FFTW in C as a
1739 @twodims{M, L} array. This means that the array was distributed over
1740 the @code{M} dimension, the local portion of which is a
1741 @twodims{L,local_M} array in Fortran. (You must @emph{not} use an
1742 @code{allocate} statement to allocate an @twodims{L,local_M} array,
1743 however; you must allocate @code{alloc_local} complex numbers, which
1744 may be greater than @code{L * local_M}, in order to reserve space for
1745 intermediate steps of the transform.) Finally, we mention that
1746 because C's array indices are zero-based, the @code{local_j_offset}
1747 argument can conveniently be interpreted as an offset in the 1-based
1748 @code{j} index (rather than as a starting index as in C).
1749
1750 If instead you had used the @code{ior(FFTW_MEASURE,
1751 FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a
1752 transposed @twodims{M,local_L} array, associated with the @emph{same}
1753 @code{cdata} allocation (since the transform is in-place), and which
1754 you could declare with:
1755
1756 @example
1757 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
1758 ...
1759 call c_f_pointer(cdata, tdata, [M,local_L])
1760 @end example
1761
1762 where @code{local_L} would have been obtained by changing the
1763 @code{fftw_mpi_local_size_2d} call to:
1764
1765 @example
1766 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
1767 local_M, local_j_offset, local_L, local_i_offset)
1768 @end example