cannam@127
|
1 @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top
|
cannam@127
|
2 @chapter Distributed-memory FFTW with MPI
|
cannam@127
|
3 @cindex MPI
|
cannam@127
|
4
|
cannam@127
|
5 @cindex parallel transform
|
cannam@127
|
6 In this chapter we document the parallel FFTW routines for parallel
|
cannam@127
|
7 systems supporting the MPI message-passing interface. Unlike the
|
cannam@127
|
8 shared-memory threads described in the previous chapter, MPI allows
|
cannam@127
|
9 you to use @emph{distributed-memory} parallelism, where each CPU has
|
cannam@127
|
10 its own separate memory, and which can scale up to clusters of many
|
cannam@127
|
11 thousands of processors. This capability comes at a price, however:
|
cannam@127
|
12 each process only stores a @emph{portion} of the data to be
|
cannam@127
|
13 transformed, which means that the data structures and
|
cannam@127
|
14 programming-interface are quite different from the serial or threads
|
cannam@127
|
15 versions of FFTW.
|
cannam@127
|
16 @cindex data distribution
|
cannam@127
|
17
|
cannam@127
|
18
|
cannam@127
|
19 Distributed-memory parallelism is especially useful when you are
|
cannam@127
|
20 transforming arrays so large that they do not fit into the memory of a
|
cannam@127
|
21 single processor. The storage per-process required by FFTW's MPI
|
cannam@127
|
22 routines is proportional to the total array size divided by the number
|
cannam@127
|
23 of processes. Conversely, distributed-memory parallelism can easily
|
cannam@127
|
24 pose an unacceptably high communications overhead for small problems;
|
cannam@127
|
25 the threshold problem size for which parallelism becomes advantageous
|
cannam@127
|
26 will depend on the precise problem you are interested in, your
|
cannam@127
|
27 hardware, and your MPI implementation.
|
cannam@127
|
28
|
cannam@127
|
29 A note on terminology: in MPI, you divide the data among a set of
|
cannam@127
|
30 ``processes'' which each run in their own memory address space.
|
cannam@127
|
31 Generally, each process runs on a different physical processor, but
|
cannam@127
|
32 this is not required. A set of processes in MPI is described by an
|
cannam@127
|
33 opaque data structure called a ``communicator,'' the most common of
|
cannam@127
|
34 which is the predefined communicator @code{MPI_COMM_WORLD} which
|
cannam@127
|
35 refers to @emph{all} processes. For more information on these and
|
cannam@127
|
36 other concepts common to all MPI programs, we refer the reader to the
|
cannam@127
|
37 documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home
|
cannam@127
|
38 page}.
|
cannam@127
|
39 @cindex MPI communicator
|
cannam@127
|
40 @ctindex MPI_COMM_WORLD
|
cannam@127
|
41
|
cannam@127
|
42
|
cannam@127
|
43 We assume in this chapter that the reader is familiar with the usage
|
cannam@127
|
44 of the serial (uniprocessor) FFTW, and focus only on the concepts new
|
cannam@127
|
45 to the MPI interface.
|
cannam@127
|
46
|
cannam@127
|
47 @menu
|
cannam@127
|
48 * FFTW MPI Installation::
|
cannam@127
|
49 * Linking and Initializing MPI FFTW::
|
cannam@127
|
50 * 2d MPI example::
|
cannam@127
|
51 * MPI Data Distribution::
|
cannam@127
|
52 * Multi-dimensional MPI DFTs of Real Data::
|
cannam@127
|
53 * Other Multi-dimensional Real-data MPI Transforms::
|
cannam@127
|
54 * FFTW MPI Transposes::
|
cannam@127
|
55 * FFTW MPI Wisdom::
|
cannam@127
|
56 * Avoiding MPI Deadlocks::
|
cannam@127
|
57 * FFTW MPI Performance Tips::
|
cannam@127
|
58 * Combining MPI and Threads::
|
cannam@127
|
59 * FFTW MPI Reference::
|
cannam@127
|
60 * FFTW MPI Fortran Interface::
|
cannam@127
|
61 @end menu
|
cannam@127
|
62
|
cannam@127
|
63 @c ------------------------------------------------------------
|
cannam@127
|
64 @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI
|
cannam@127
|
65 @section FFTW MPI Installation
|
cannam@127
|
66
|
cannam@127
|
67 All of the FFTW MPI code is located in the @code{mpi} subdirectory of
|
cannam@127
|
68 the FFTW package. On Unix systems, the FFTW MPI libraries and header
|
cannam@127
|
69 files are automatically configured, compiled, and installed along with
|
cannam@127
|
70 the uniprocessor FFTW libraries simply by including
|
cannam@127
|
71 @code{--enable-mpi} in the flags to the @code{configure} script
|
cannam@127
|
72 (@pxref{Installation on Unix}).
|
cannam@127
|
73 @fpindex configure
|
cannam@127
|
74
|
cannam@127
|
75
|
cannam@127
|
76 Any implementation of the MPI standard, version 1 or later, should
|
cannam@127
|
77 work with FFTW. The @code{configure} script will attempt to
|
cannam@127
|
78 automatically detect how to compile and link code using your MPI
|
cannam@127
|
79 implementation. In some cases, especially if you have multiple
|
cannam@127
|
80 different MPI implementations installed or have an unusual MPI
|
cannam@127
|
81 software package, you may need to provide this information explicitly.
|
cannam@127
|
82
|
cannam@127
|
83 Most commonly, one compiles MPI code by invoking a special compiler
|
cannam@127
|
84 command, typically @code{mpicc} for C code. The @code{configure}
|
cannam@127
|
85 script knows the most common names for this command, but you can
|
cannam@127
|
86 specify the MPI compilation command explicitly by setting the
|
cannam@127
|
87 @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}.
|
cannam@127
|
88 @fpindex mpicc
|
cannam@127
|
89
|
cannam@127
|
90
|
cannam@127
|
91 If, instead of a special compiler command, you need to link a certain
|
cannam@127
|
92 library, you can specify the link command via the @code{MPILIBS}
|
cannam@127
|
93 variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if
|
cannam@127
|
94 your MPI library is installed in a non-standard location (one the
|
cannam@127
|
95 compiler does not know about by default), you may also have to specify
|
cannam@127
|
96 the location of the library and header files via @code{LDFLAGS} and
|
cannam@127
|
97 @code{CPPFLAGS} variables, respectively, as in @samp{./configure
|
cannam@127
|
98 LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}.
|
cannam@127
|
99
|
cannam@127
|
100 @c ------------------------------------------------------------
|
cannam@127
|
101 @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI
|
cannam@127
|
102 @section Linking and Initializing MPI FFTW
|
cannam@127
|
103
|
cannam@127
|
104 Programs using the MPI FFTW routines should be linked with
|
cannam@127
|
105 @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision,
|
cannam@127
|
106 @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on
|
cannam@127
|
107 (@pxref{Precision}). You will also need to link with whatever library
|
cannam@127
|
108 is responsible for MPI on your system; in most MPI implementations,
|
cannam@127
|
109 there is a special compiler alias named @code{mpicc} to compile and
|
cannam@127
|
110 link MPI code.
|
cannam@127
|
111 @fpindex mpicc
|
cannam@127
|
112 @cindex linking on Unix
|
cannam@127
|
113 @cindex precision
|
cannam@127
|
114
|
cannam@127
|
115
|
cannam@127
|
116 @findex fftw_init_threads
|
cannam@127
|
117 Before calling any FFTW routines except possibly
|
cannam@127
|
118 @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling
|
cannam@127
|
119 @code{MPI_Init}, you should call the function:
|
cannam@127
|
120
|
cannam@127
|
121 @example
|
cannam@127
|
122 void fftw_mpi_init(void);
|
cannam@127
|
123 @end example
|
cannam@127
|
124 @findex fftw_mpi_init
|
cannam@127
|
125
|
cannam@127
|
126 If, at the end of your program, you want to get rid of all memory and
|
cannam@127
|
127 other resources allocated internally by FFTW, for both the serial and
|
cannam@127
|
128 MPI routines, you can call:
|
cannam@127
|
129
|
cannam@127
|
130 @example
|
cannam@127
|
131 void fftw_mpi_cleanup(void);
|
cannam@127
|
132 @end example
|
cannam@127
|
133 @findex fftw_mpi_cleanup
|
cannam@127
|
134
|
cannam@127
|
135 which is much like the @code{fftw_cleanup()} function except that it
|
cannam@127
|
136 also gets rid of FFTW's MPI-related data. You must @emph{not} execute
|
cannam@127
|
137 any previously created plans after calling this function.
|
cannam@127
|
138
|
cannam@127
|
139 @c ------------------------------------------------------------
|
cannam@127
|
140 @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI
|
cannam@127
|
141 @section 2d MPI example
|
cannam@127
|
142
|
cannam@127
|
143 Before we document the FFTW MPI interface in detail, we begin with a
|
cannam@127
|
144 simple example outlining how one would perform a two-dimensional
|
cannam@127
|
145 @code{N0} by @code{N1} complex DFT.
|
cannam@127
|
146
|
cannam@127
|
147 @example
|
cannam@127
|
148 #include <fftw3-mpi.h>
|
cannam@127
|
149
|
cannam@127
|
150 int main(int argc, char **argv)
|
cannam@127
|
151 @{
|
cannam@127
|
152 const ptrdiff_t N0 = ..., N1 = ...;
|
cannam@127
|
153 fftw_plan plan;
|
cannam@127
|
154 fftw_complex *data;
|
cannam@127
|
155 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
|
cannam@127
|
156
|
cannam@127
|
157 MPI_Init(&argc, &argv);
|
cannam@127
|
158 fftw_mpi_init();
|
cannam@127
|
159
|
cannam@127
|
160 /* @r{get local data size and allocate} */
|
cannam@127
|
161 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
|
cannam@127
|
162 &local_n0, &local_0_start);
|
cannam@127
|
163 data = fftw_alloc_complex(alloc_local);
|
cannam@127
|
164
|
cannam@127
|
165 /* @r{create plan for in-place forward DFT} */
|
cannam@127
|
166 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
|
cannam@127
|
167 FFTW_FORWARD, FFTW_ESTIMATE);
|
cannam@127
|
168
|
cannam@127
|
169 /* @r{initialize data to some function} my_function(x,y) */
|
cannam@127
|
170 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
|
cannam@127
|
171 data[i*N1 + j] = my_function(local_0_start + i, j);
|
cannam@127
|
172
|
cannam@127
|
173 /* @r{compute transforms, in-place, as many times as desired} */
|
cannam@127
|
174 fftw_execute(plan);
|
cannam@127
|
175
|
cannam@127
|
176 fftw_destroy_plan(plan);
|
cannam@127
|
177
|
cannam@127
|
178 MPI_Finalize();
|
cannam@127
|
179 @}
|
cannam@127
|
180 @end example
|
cannam@127
|
181
|
cannam@127
|
182 As can be seen above, the MPI interface follows the same basic style
|
cannam@127
|
183 of allocate/plan/execute/destroy as the serial FFTW routines. All of
|
cannam@127
|
184 the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead
|
cannam@127
|
185 of @samp{fftw_}. There are a few important differences, however:
|
cannam@127
|
186
|
cannam@127
|
187 First, we must call @code{fftw_mpi_init()} after calling
|
cannam@127
|
188 @code{MPI_Init} (required in all MPI programs) and before calling any
|
cannam@127
|
189 other @samp{fftw_mpi_} routine.
|
cannam@127
|
190 @findex MPI_Init
|
cannam@127
|
191 @findex fftw_mpi_init
|
cannam@127
|
192
|
cannam@127
|
193
|
cannam@127
|
194 Second, when we create the plan with @code{fftw_mpi_plan_dft_2d},
|
cannam@127
|
195 analogous to @code{fftw_plan_dft_2d}, we pass an additional argument:
|
cannam@127
|
196 the communicator, indicating which processes will participate in the
|
cannam@127
|
197 transform (here @code{MPI_COMM_WORLD}, indicating all processes).
|
cannam@127
|
198 Whenever you create, execute, or destroy a plan for an MPI transform,
|
cannam@127
|
199 you must call the corresponding FFTW routine on @emph{all} processes
|
cannam@127
|
200 in the communicator for that transform. (That is, these are
|
cannam@127
|
201 @emph{collective} calls.) Note that the plan for the MPI transform
|
cannam@127
|
202 uses the standard @code{fftw_execute} and @code{fftw_destroy} routines
|
cannam@127
|
203 (on the other hand, there are MPI-specific new-array execute functions
|
cannam@127
|
204 documented below).
|
cannam@127
|
205 @cindex collective function
|
cannam@127
|
206 @findex fftw_mpi_plan_dft_2d
|
cannam@127
|
207 @ctindex MPI_COMM_WORLD
|
cannam@127
|
208
|
cannam@127
|
209
|
cannam@127
|
210 Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments
|
cannam@127
|
211 instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a
|
cannam@127
|
212 standard C integer type which is (at least) 32 bits wide on a 32-bit
|
cannam@127
|
213 machine and 64 bits wide on a 64-bit machine. This is to make it easy
|
cannam@127
|
214 to specify very large parallel transforms on a 64-bit machine. (You
|
cannam@127
|
215 can specify 64-bit transform sizes in the serial FFTW, too, but only
|
cannam@127
|
216 by using the @samp{guru64} planner interface. @xref{64-bit Guru
|
cannam@127
|
217 Interface}.)
|
cannam@127
|
218 @tindex ptrdiff_t
|
cannam@127
|
219 @cindex 64-bit architecture
|
cannam@127
|
220
|
cannam@127
|
221
|
cannam@127
|
222 Fourth, and most importantly, you don't allocate the entire
|
cannam@127
|
223 two-dimensional array on each process. Instead, you call
|
cannam@127
|
224 @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the
|
cannam@127
|
225 array resides on each processor, and how much space to allocate.
|
cannam@127
|
226 Here, the portion of the array on each process is a @code{local_n0} by
|
cannam@127
|
227 @code{N1} slice of the total array, starting at index
|
cannam@127
|
228 @code{local_0_start}. The total number of @code{fftw_complex} numbers
|
cannam@127
|
229 to allocate is given by the @code{alloc_local} return value, which
|
cannam@127
|
230 @emph{may} be greater than @code{local_n0 * N1} (in case some
|
cannam@127
|
231 intermediate calculations require additional storage). The data
|
cannam@127
|
232 distribution in FFTW's MPI interface is described in more detail by
|
cannam@127
|
233 the next section.
|
cannam@127
|
234 @findex fftw_mpi_local_size_2d
|
cannam@127
|
235 @cindex data distribution
|
cannam@127
|
236
|
cannam@127
|
237
|
cannam@127
|
238 Given the portion of the array that resides on the local process, it
|
cannam@127
|
239 is straightforward to initialize the data (here to a function
|
cannam@127
|
240 @code{myfunction}) and otherwise manipulate it. Of course, at the end
|
cannam@127
|
241 of the program you may want to output the data somehow, but
|
cannam@127
|
242 synchronizing this output is up to you and is beyond the scope of this
|
cannam@127
|
243 manual. (One good way to output a large multi-dimensional distributed
|
cannam@127
|
244 array in MPI to a portable binary file is to use the free HDF5
|
cannam@127
|
245 library; see the @uref{http://www.hdfgroup.org/, HDF home page}.)
|
cannam@127
|
246 @cindex HDF5
|
cannam@127
|
247 @cindex MPI I/O
|
cannam@127
|
248
|
cannam@127
|
249 @c ------------------------------------------------------------
|
cannam@127
|
250 @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI
|
cannam@127
|
251 @section MPI Data Distribution
|
cannam@127
|
252 @cindex data distribution
|
cannam@127
|
253
|
cannam@127
|
254 The most important concept to understand in using FFTW's MPI interface
|
cannam@127
|
255 is the data distribution. With a serial or multithreaded FFT, all of
|
cannam@127
|
256 the inputs and outputs are stored as a single contiguous chunk of
|
cannam@127
|
257 memory. With a distributed-memory FFT, the inputs and outputs are
|
cannam@127
|
258 broken into disjoint blocks, one per process.
|
cannam@127
|
259
|
cannam@127
|
260 In particular, FFTW uses a @emph{1d block distribution} of the data,
|
cannam@127
|
261 distributed along the @emph{first dimension}. For example, if you
|
cannam@127
|
262 want to perform a @twodims{100,200} complex DFT, distributed over 4
|
cannam@127
|
263 processes, each process will get a @twodims{25,200} slice of the data.
|
cannam@127
|
264 That is, process 0 will get rows 0 through 24, process 1 will get rows
|
cannam@127
|
265 25 through 49, process 2 will get rows 50 through 74, and process 3
|
cannam@127
|
266 will get rows 75 through 99. If you take the same array but
|
cannam@127
|
267 distribute it over 3 processes, then it is not evenly divisible so the
|
cannam@127
|
268 different processes will have unequal chunks. FFTW's default choice
|
cannam@127
|
269 in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
|
cannam@127
|
270 process 2.
|
cannam@127
|
271 @cindex block distribution
|
cannam@127
|
272
|
cannam@127
|
273
|
cannam@127
|
274 FFTW provides several @samp{fftw_mpi_local_size} routines that you can
|
cannam@127
|
275 call to find out what portion of an array is stored on the current
|
cannam@127
|
276 process. In most cases, you should use the default block sizes picked
|
cannam@127
|
277 by FFTW, but it is also possible to specify your own block size. For
|
cannam@127
|
278 example, with a @twodims{100,200} array on three processes, you can
|
cannam@127
|
279 tell FFTW to use a block size of 40, which would assign 40 rows to
|
cannam@127
|
280 processes 0 and 1, and 20 rows to process 2. FFTW's default is to
|
cannam@127
|
281 divide the data equally among the processes if possible, and as best
|
cannam@127
|
282 it can otherwise. The rows are always assigned in ``rank order,''
|
cannam@127
|
283 i.e. process 0 gets the first block of rows, then process 1, and so
|
cannam@127
|
284 on. (You can change this by using @code{MPI_Comm_split} to create a
|
cannam@127
|
285 new communicator with re-ordered processes.) However, you should
|
cannam@127
|
286 always call the @samp{fftw_mpi_local_size} routines, if possible,
|
cannam@127
|
287 rather than trying to predict FFTW's distribution choices.
|
cannam@127
|
288
|
cannam@127
|
289 In particular, it is critical that you allocate the storage size that
|
cannam@127
|
290 is returned by @samp{fftw_mpi_local_size}, which is @emph{not}
|
cannam@127
|
291 necessarily the size of the local slice of the array. The reason is
|
cannam@127
|
292 that intermediate steps of FFTW's algorithms involve transposing the
|
cannam@127
|
293 array and redistributing the data, so at these intermediate steps FFTW
|
cannam@127
|
294 may require more local storage space (albeit always proportional to
|
cannam@127
|
295 the total size divided by the number of processes). The
|
cannam@127
|
296 @samp{fftw_mpi_local_size} functions know how much storage is required
|
cannam@127
|
297 for these intermediate steps and tell you the correct amount to
|
cannam@127
|
298 allocate.
|
cannam@127
|
299
|
cannam@127
|
300 @menu
|
cannam@127
|
301 * Basic and advanced distribution interfaces::
|
cannam@127
|
302 * Load balancing::
|
cannam@127
|
303 * Transposed distributions::
|
cannam@127
|
304 * One-dimensional distributions::
|
cannam@127
|
305 @end menu
|
cannam@127
|
306
|
cannam@127
|
307 @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution
|
cannam@127
|
308 @subsection Basic and advanced distribution interfaces
|
cannam@127
|
309
|
cannam@127
|
310 As with the planner interface, the @samp{fftw_mpi_local_size}
|
cannam@127
|
311 distribution interface is broken into basic and advanced
|
cannam@127
|
312 (@samp{_many}) interfaces, where the latter allows you to specify the
|
cannam@127
|
313 block size manually and also to request block sizes when computing
|
cannam@127
|
314 multiple transforms simultaneously. These functions are documented
|
cannam@127
|
315 more exhaustively by the FFTW MPI Reference, but we summarize the
|
cannam@127
|
316 basic ideas here using a couple of two-dimensional examples.
|
cannam@127
|
317
|
cannam@127
|
318 For the @twodims{100,200} complex-DFT example, above, we would find
|
cannam@127
|
319 the distribution by calling the following function in the basic
|
cannam@127
|
320 interface:
|
cannam@127
|
321
|
cannam@127
|
322 @example
|
cannam@127
|
323 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
cannam@127
|
324 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
cannam@127
|
325 @end example
|
cannam@127
|
326 @findex fftw_mpi_local_size_2d
|
cannam@127
|
327
|
cannam@127
|
328 Given the total size of the data to be transformed (here, @code{n0 =
|
cannam@127
|
329 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this
|
cannam@127
|
330 function provides three numbers.
|
cannam@127
|
331
|
cannam@127
|
332 First, it describes the shape of the local data: the current process
|
cannam@127
|
333 should store a @code{local_n0} by @code{n1} slice of the overall
|
cannam@127
|
334 dataset, in row-major order (@code{n1} dimension contiguous), starting
|
cannam@127
|
335 at index @code{local_0_start}. That is, if the total dataset is
|
cannam@127
|
336 viewed as a @code{n0} by @code{n1} matrix, the current process should
|
cannam@127
|
337 store the rows @code{local_0_start} to
|
cannam@127
|
338 @code{local_0_start+local_n0-1}. Obviously, if you are running with
|
cannam@127
|
339 only a single MPI process, that process will store the entire array:
|
cannam@127
|
340 @code{local_0_start} will be zero and @code{local_n0} will be
|
cannam@127
|
341 @code{n0}. @xref{Row-major Format}.
|
cannam@127
|
342 @cindex row-major
|
cannam@127
|
343
|
cannam@127
|
344
|
cannam@127
|
345 Second, the return value is the total number of data elements (e.g.,
|
cannam@127
|
346 complex numbers for a complex DFT) that should be allocated for the
|
cannam@127
|
347 input and output arrays on the current process (ideally with
|
cannam@127
|
348 @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal
|
cannam@127
|
349 alignment). It might seem that this should always be equal to
|
cannam@127
|
350 @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's
|
cannam@127
|
351 distributed FFT algorithms require data redistributions at
|
cannam@127
|
352 intermediate stages of the transform, and in some circumstances this
|
cannam@127
|
353 may require slightly larger local storage. This is discussed in more
|
cannam@127
|
354 detail below, under @ref{Load balancing}.
|
cannam@127
|
355 @findex fftw_malloc
|
cannam@127
|
356 @findex fftw_alloc_complex
|
cannam@127
|
357
|
cannam@127
|
358
|
cannam@127
|
359 @cindex advanced interface
|
cannam@127
|
360 The advanced-interface @samp{local_size} function for multidimensional
|
cannam@127
|
361 transforms returns the same three things (@code{local_n0},
|
cannam@127
|
362 @code{local_0_start}, and the total number of elements to allocate),
|
cannam@127
|
363 but takes more inputs:
|
cannam@127
|
364
|
cannam@127
|
365 @example
|
cannam@127
|
366 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
|
cannam@127
|
367 ptrdiff_t howmany,
|
cannam@127
|
368 ptrdiff_t block0,
|
cannam@127
|
369 MPI_Comm comm,
|
cannam@127
|
370 ptrdiff_t *local_n0,
|
cannam@127
|
371 ptrdiff_t *local_0_start);
|
cannam@127
|
372 @end example
|
cannam@127
|
373 @findex fftw_mpi_local_size_many
|
cannam@127
|
374
|
cannam@127
|
375 The two-dimensional case above corresponds to @code{rnk = 2} and an
|
cannam@127
|
376 array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}.
|
cannam@127
|
377 This routine is for any @code{rnk > 1}; one-dimensional transforms
|
cannam@127
|
378 have their own interface because they work slightly differently, as
|
cannam@127
|
379 discussed below.
|
cannam@127
|
380
|
cannam@127
|
381 First, the advanced interface allows you to perform multiple
|
cannam@127
|
382 transforms at once, of interleaved data, as specified by the
|
cannam@127
|
383 @code{howmany} parameter. (@code{hoamany} is 1 for a single
|
cannam@127
|
384 transform.)
|
cannam@127
|
385
|
cannam@127
|
386 Second, here you can specify your desired block size in the @code{n0}
|
cannam@127
|
387 dimension, @code{block0}. To use FFTW's default block size, pass
|
cannam@127
|
388 @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on
|
cannam@127
|
389 @code{P} processes, FFTW will return @code{local_n0} equal to
|
cannam@127
|
390 @code{block0} on the first @code{P / block0} processes (rounded down),
|
cannam@127
|
391 return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on
|
cannam@127
|
392 the next process, and @code{local_n0} equal to zero on any remaining
|
cannam@127
|
393 processes. In general, we recommend using the default block size
|
cannam@127
|
394 (which corresponds to @code{n0 / P}, rounded up).
|
cannam@127
|
395 @ctindex FFTW_MPI_DEFAULT_BLOCK
|
cannam@127
|
396 @cindex block distribution
|
cannam@127
|
397
|
cannam@127
|
398
|
cannam@127
|
399 For example, suppose you have @code{P = 4} processes and @code{n0 =
|
cannam@127
|
400 21}. The default will be a block size of @code{6}, which will give
|
cannam@127
|
401 @code{local_n0 = 6} on the first three processes and @code{local_n0 =
|
cannam@127
|
402 3} on the last process. Instead, however, you could specify
|
cannam@127
|
403 @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5}
|
cannam@127
|
404 on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice,
|
cannam@127
|
405 while it may look superficially more ``balanced,'' has the same
|
cannam@127
|
406 critical path as FFTW's default but requires more communications.)
|
cannam@127
|
407
|
cannam@127
|
408 @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution
|
cannam@127
|
409 @subsection Load balancing
|
cannam@127
|
410 @cindex load balancing
|
cannam@127
|
411
|
cannam@127
|
412 Ideally, when you parallelize a transform over some @math{P}
|
cannam@127
|
413 processes, each process should end up with work that takes equal time.
|
cannam@127
|
414 Otherwise, all of the processes end up waiting on whichever process is
|
cannam@127
|
415 slowest. This goal is known as ``load balancing.'' In this section,
|
cannam@127
|
416 we describe the circumstances under which FFTW is able to load-balance
|
cannam@127
|
417 well, and in particular how you should choose your transform size in
|
cannam@127
|
418 order to load balance.
|
cannam@127
|
419
|
cannam@127
|
420 Load balancing is especially difficult when you are parallelizing over
|
cannam@127
|
421 heterogeneous machines; for example, if one of your processors is a
|
cannam@127
|
422 old 486 and another is a Pentium IV, obviously you should give the
|
cannam@127
|
423 Pentium more work to do than the 486 since the latter is much slower.
|
cannam@127
|
424 FFTW does not deal with this problem, however---it assumes that your
|
cannam@127
|
425 processes run on hardware of comparable speed, and that the goal is
|
cannam@127
|
426 therefore to divide the problem as equally as possible.
|
cannam@127
|
427
|
cannam@127
|
428 For a multi-dimensional complex DFT, FFTW can divide the problem
|
cannam@127
|
429 equally among the processes if: (i) the @emph{first} dimension
|
cannam@127
|
430 @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of
|
cannam@127
|
431 the subsequent dimensions is divisible by @math{P}. (For the advanced
|
cannam@127
|
432 interface, where you can specify multiple simultaneous transforms via
|
cannam@127
|
433 some ``vector'' length @code{howmany}, a factor of @code{howmany} is
|
cannam@127
|
434 included in the product of the subsequent dimensions.)
|
cannam@127
|
435
|
cannam@127
|
436 For a one-dimensional complex DFT, the length @code{N} of the data
|
cannam@127
|
437 should be divisible by @math{P} @emph{squared} to be able to divide
|
cannam@127
|
438 the problem equally among the processes.
|
cannam@127
|
439
|
cannam@127
|
440 @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution
|
cannam@127
|
441 @subsection Transposed distributions
|
cannam@127
|
442
|
cannam@127
|
443 Internally, FFTW's MPI transform algorithms work by first computing
|
cannam@127
|
444 transforms of the data local to each process, then by globally
|
cannam@127
|
445 @emph{transposing} the data in some fashion to redistribute the data
|
cannam@127
|
446 among the processes, transforming the new data local to each process,
|
cannam@127
|
447 and transposing back. For example, a two-dimensional @code{n0} by
|
cannam@127
|
448 @code{n1} array, distributed across the @code{n0} dimension, is
|
cannam@127
|
449 transformd by: (i) transforming the @code{n1} dimension, which are
|
cannam@127
|
450 local to each process; (ii) transposing to an @code{n1} by @code{n0}
|
cannam@127
|
451 array, distributed across the @code{n1} dimension; (iii) transforming
|
cannam@127
|
452 the @code{n0} dimension, which is now local to each process; (iv)
|
cannam@127
|
453 transposing back.
|
cannam@127
|
454 @cindex transpose
|
cannam@127
|
455
|
cannam@127
|
456
|
cannam@127
|
457 However, in many applications it is acceptable to compute a
|
cannam@127
|
458 multidimensional DFT whose results are produced in transposed order
|
cannam@127
|
459 (e.g., @code{n1} by @code{n0} in two dimensions). This provides a
|
cannam@127
|
460 significant performance advantage, because it means that the final
|
cannam@127
|
461 transposition step can be omitted. FFTW supports this optimization,
|
cannam@127
|
462 which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT}
|
cannam@127
|
463 to the planner routines. To compute the inverse transform of
|
cannam@127
|
464 transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell
|
cannam@127
|
465 it that the input is transposed. In this section, we explain how to
|
cannam@127
|
466 interpret the output format of such a transform.
|
cannam@127
|
467 @ctindex FFTW_MPI_TRANSPOSED_OUT
|
cannam@127
|
468 @ctindex FFTW_MPI_TRANSPOSED_IN
|
cannam@127
|
469
|
cannam@127
|
470
|
cannam@127
|
471 Suppose you have are transforming multi-dimensional data with (at
|
cannam@127
|
472 least two) dimensions @ndims{}. As always, it is distributed along
|
cannam@127
|
473 the first dimension @dimk{0}. Now, if we compute its DFT with the
|
cannam@127
|
474 @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored
|
cannam@127
|
475 with the first @emph{two} dimensions transposed: @ndimstrans{},
|
cannam@127
|
476 distributed along the @dimk{1} dimension. Conversely, if we take the
|
cannam@127
|
477 @ndimstrans{} data and transform it with the
|
cannam@127
|
478 @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the
|
cannam@127
|
479 original @ndims{} array.
|
cannam@127
|
480
|
cannam@127
|
481 There are two ways to find the portion of the transposed array that
|
cannam@127
|
482 resides on the current process. First, you can simply call the
|
cannam@127
|
483 appropriate @samp{local_size} function, passing @ndimstrans{} (the
|
cannam@127
|
484 transposed dimensions). This would mean calling the @samp{local_size}
|
cannam@127
|
485 function twice, once for the transposed and once for the
|
cannam@127
|
486 non-transposed dimensions. Alternatively, you can call one of the
|
cannam@127
|
487 @samp{local_size_transposed} functions, which returns both the
|
cannam@127
|
488 non-transposed and transposed data distribution from a single call.
|
cannam@127
|
489 For example, for a 3d transform with transposed output (or input), you
|
cannam@127
|
490 might call:
|
cannam@127
|
491
|
cannam@127
|
492 @example
|
cannam@127
|
493 ptrdiff_t fftw_mpi_local_size_3d_transposed(
|
cannam@127
|
494 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
|
cannam@127
|
495 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
496 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
497 @end example
|
cannam@127
|
498 @findex fftw_mpi_local_size_3d_transposed
|
cannam@127
|
499
|
cannam@127
|
500 Here, @code{local_n0} and @code{local_0_start} give the size and
|
cannam@127
|
501 starting index of the @code{n0} dimension for the
|
cannam@127
|
502 @emph{non}-transposed data, as in the previous sections. For
|
cannam@127
|
503 @emph{transposed} data (e.g. the output for
|
cannam@127
|
504 @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and
|
cannam@127
|
505 @code{local_1_start} give the size and starting index of the @code{n1}
|
cannam@127
|
506 dimension, which is the first dimension of the transposed data
|
cannam@127
|
507 (@code{n1} by @code{n0} by @code{n2}).
|
cannam@127
|
508
|
cannam@127
|
509 (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to
|
cannam@127
|
510 performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two
|
cannam@127
|
511 dimensions to the planner in reverse order, or vice versa. If you
|
cannam@127
|
512 pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and
|
cannam@127
|
513 @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the
|
cannam@127
|
514 first two dimensions passed to the planner and passing @emph{neither}
|
cannam@127
|
515 flag.)
|
cannam@127
|
516
|
cannam@127
|
517 @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution
|
cannam@127
|
518 @subsection One-dimensional distributions
|
cannam@127
|
519
|
cannam@127
|
520 For one-dimensional distributed DFTs using FFTW, matters are slightly
|
cannam@127
|
521 more complicated because the data distribution is more closely tied to
|
cannam@127
|
522 how the algorithm works. In particular, you can no longer pass an
|
cannam@127
|
523 arbitrary block size and must accept FFTW's default; also, the block
|
cannam@127
|
524 sizes may be different for input and output. Also, the data
|
cannam@127
|
525 distribution depends on the flags and transform direction, in order
|
cannam@127
|
526 for forward and backward transforms to work correctly.
|
cannam@127
|
527
|
cannam@127
|
528 @example
|
cannam@127
|
529 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
|
cannam@127
|
530 int sign, unsigned flags,
|
cannam@127
|
531 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
cannam@127
|
532 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
cannam@127
|
533 @end example
|
cannam@127
|
534 @findex fftw_mpi_local_size_1d
|
cannam@127
|
535
|
cannam@127
|
536 This function computes the data distribution for a 1d transform of
|
cannam@127
|
537 size @code{n0} with the given transform @code{sign} and @code{flags}.
|
cannam@127
|
538 Both input and output data use block distributions. The input on the
|
cannam@127
|
539 current process will consist of @code{local_ni} numbers starting at
|
cannam@127
|
540 index @code{local_i_start}; e.g. if only a single process is used,
|
cannam@127
|
541 then @code{local_ni} will be @code{n0} and @code{local_i_start} will
|
cannam@127
|
542 be @code{0}. Similarly for the output, with @code{local_no} numbers
|
cannam@127
|
543 starting at index @code{local_o_start}. The return value of
|
cannam@127
|
544 @code{fftw_mpi_local_size_1d} will be the total number of elements to
|
cannam@127
|
545 allocate on the current process (which might be slightly larger than
|
cannam@127
|
546 the local size due to intermediate steps in the algorithm).
|
cannam@127
|
547
|
cannam@127
|
548 As mentioned above (@pxref{Load balancing}), the data will be divided
|
cannam@127
|
549 equally among the processes if @code{n0} is divisible by the
|
cannam@127
|
550 @emph{square} of the number of processes. In this case,
|
cannam@127
|
551 @code{local_ni} will equal @code{local_no}. Otherwise, they may be
|
cannam@127
|
552 different.
|
cannam@127
|
553
|
cannam@127
|
554 For some applications, such as convolutions, the order of the output
|
cannam@127
|
555 data is irrelevant. In this case, performance can be improved by
|
cannam@127
|
556 specifying that the output data be stored in an FFTW-defined
|
cannam@127
|
557 ``scrambled'' format. (In particular, this is the analogue of
|
cannam@127
|
558 transposed output in the multidimensional case: scrambled output saves
|
cannam@127
|
559 a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in
|
cannam@127
|
560 the flags, then the output is stored in this (undocumented) scrambled
|
cannam@127
|
561 order. Conversely, to perform the inverse transform of data in
|
cannam@127
|
562 scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag.
|
cannam@127
|
563 @ctindex FFTW_MPI_SCRAMBLED_OUT
|
cannam@127
|
564 @ctindex FFTW_MPI_SCRAMBLED_IN
|
cannam@127
|
565
|
cannam@127
|
566
|
cannam@127
|
567 In MPI FFTW, only composite sizes @code{n0} can be parallelized; we
|
cannam@127
|
568 have not yet implemented a parallel algorithm for large prime sizes.
|
cannam@127
|
569
|
cannam@127
|
570 @c ------------------------------------------------------------
|
cannam@127
|
571 @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI
|
cannam@127
|
572 @section Multi-dimensional MPI DFTs of Real Data
|
cannam@127
|
573
|
cannam@127
|
574 FFTW's MPI interface also supports multi-dimensional DFTs of real
|
cannam@127
|
575 data, similar to the serial r2c and c2r interfaces. (Parallel
|
cannam@127
|
576 one-dimensional real-data DFTs are not currently supported; you must
|
cannam@127
|
577 use a complex transform and set the imaginary parts of the inputs to
|
cannam@127
|
578 zero.)
|
cannam@127
|
579
|
cannam@127
|
580 The key points to understand for r2c and c2r MPI transforms (compared
|
cannam@127
|
581 to the MPI complex DFTs or the serial r2c/c2r transforms), are:
|
cannam@127
|
582
|
cannam@127
|
583 @itemize @bullet
|
cannam@127
|
584
|
cannam@127
|
585 @item
|
cannam@127
|
586 Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real
|
cannam@127
|
587 data to/from @ndimshalf{} complex data: the last dimension of the
|
cannam@127
|
588 complex data is cut in half (rounded down), plus one. As for the
|
cannam@127
|
589 serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and
|
cannam@127
|
590 @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data.
|
cannam@127
|
591
|
cannam@127
|
592 @item
|
cannam@127
|
593 @cindex padding
|
cannam@127
|
594 Although the real data is @emph{conceptually} @ndims{}, it is
|
cannam@127
|
595 @emph{physically} stored as an @ndimspad{} array, where the last
|
cannam@127
|
596 dimension has been @emph{padded} to make it the same size as the
|
cannam@127
|
597 complex output. This is much like the in-place serial r2c/c2r
|
cannam@127
|
598 interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that
|
cannam@127
|
599 in MPI the padding is required even for out-of-place data. The extra
|
cannam@127
|
600 padding numbers are ignored by FFTW (they are @emph{not} like
|
cannam@127
|
601 zero-padding the transform to a larger size); they are only used to
|
cannam@127
|
602 determine the data layout.
|
cannam@127
|
603
|
cannam@127
|
604 @item
|
cannam@127
|
605 @cindex data distribution
|
cannam@127
|
606 The data distribution in MPI for @emph{both} the real and complex data
|
cannam@127
|
607 is determined by the shape of the @emph{complex} data. That is, you
|
cannam@127
|
608 call the appropriate @samp{local size} function for the @ndimshalf{}
|
cannam@127
|
609 complex data, and then use the @emph{same} distribution for the real
|
cannam@127
|
610 data except that the last complex dimension is replaced by a (padded)
|
cannam@127
|
611 real dimension of twice the length.
|
cannam@127
|
612
|
cannam@127
|
613 @end itemize
|
cannam@127
|
614
|
cannam@127
|
615 For example suppose we are performing an out-of-place r2c transform of
|
cannam@127
|
616 @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}],
|
cannam@127
|
617 resulting in @threedims{L,M,N/2+1} complex data. Similar to the
|
cannam@127
|
618 example in @ref{2d MPI example}, we might do something like:
|
cannam@127
|
619
|
cannam@127
|
620 @example
|
cannam@127
|
621 #include <fftw3-mpi.h>
|
cannam@127
|
622
|
cannam@127
|
623 int main(int argc, char **argv)
|
cannam@127
|
624 @{
|
cannam@127
|
625 const ptrdiff_t L = ..., M = ..., N = ...;
|
cannam@127
|
626 fftw_plan plan;
|
cannam@127
|
627 double *rin;
|
cannam@127
|
628 fftw_complex *cout;
|
cannam@127
|
629 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
|
cannam@127
|
630
|
cannam@127
|
631 MPI_Init(&argc, &argv);
|
cannam@127
|
632 fftw_mpi_init();
|
cannam@127
|
633
|
cannam@127
|
634 /* @r{get local data size and allocate} */
|
cannam@127
|
635 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
|
cannam@127
|
636 &local_n0, &local_0_start);
|
cannam@127
|
637 rin = fftw_alloc_real(2 * alloc_local);
|
cannam@127
|
638 cout = fftw_alloc_complex(alloc_local);
|
cannam@127
|
639
|
cannam@127
|
640 /* @r{create plan for out-of-place r2c DFT} */
|
cannam@127
|
641 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
|
cannam@127
|
642 FFTW_MEASURE);
|
cannam@127
|
643
|
cannam@127
|
644 /* @r{initialize rin to some function} my_func(x,y,z) */
|
cannam@127
|
645 for (i = 0; i < local_n0; ++i)
|
cannam@127
|
646 for (j = 0; j < M; ++j)
|
cannam@127
|
647 for (k = 0; k < N; ++k)
|
cannam@127
|
648 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
|
cannam@127
|
649
|
cannam@127
|
650 /* @r{compute transforms as many times as desired} */
|
cannam@127
|
651 fftw_execute(plan);
|
cannam@127
|
652
|
cannam@127
|
653 fftw_destroy_plan(plan);
|
cannam@127
|
654
|
cannam@127
|
655 MPI_Finalize();
|
cannam@127
|
656 @}
|
cannam@127
|
657 @end example
|
cannam@127
|
658
|
cannam@127
|
659 @findex fftw_alloc_real
|
cannam@127
|
660 @cindex row-major
|
cannam@127
|
661 Note that we allocated @code{rin} using @code{fftw_alloc_real} with an
|
cannam@127
|
662 argument of @code{2 * alloc_local}: since @code{alloc_local} is the
|
cannam@127
|
663 number of @emph{complex} values to allocate, the number of @emph{real}
|
cannam@127
|
664 values is twice as many. The @code{rin} array is then
|
cannam@127
|
665 @threedims{local_n0,M,2(N/2+1)} in row-major order, so its
|
cannam@127
|
666 @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) +
|
cannam@127
|
667 k} (@pxref{Multi-dimensional Array Format }).
|
cannam@127
|
668
|
cannam@127
|
669 @cindex transpose
|
cannam@127
|
670 @ctindex FFTW_TRANSPOSED_OUT
|
cannam@127
|
671 @ctindex FFTW_TRANSPOSED_IN
|
cannam@127
|
672 As for the complex transforms, improved performance can be obtained by
|
cannam@127
|
673 specifying that the output is the transpose of the input or vice versa
|
cannam@127
|
674 (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c
|
cannam@127
|
675 example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that
|
cannam@127
|
676 the input would be a padded @threedims{L,M,2(N/2+1)} real array
|
cannam@127
|
677 distributed over the @code{L} dimension, while the output would be a
|
cannam@127
|
678 @threedims{M,L,N/2+1} complex array distributed over the @code{M}
|
cannam@127
|
679 dimension. To perform the inverse c2r transform with the same data
|
cannam@127
|
680 distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag.
|
cannam@127
|
681
|
cannam@127
|
682 @c ------------------------------------------------------------
|
cannam@127
|
683 @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI
|
cannam@127
|
684 @section Other multi-dimensional Real-Data MPI Transforms
|
cannam@127
|
685
|
cannam@127
|
686 @cindex r2r
|
cannam@127
|
687 FFTW's MPI interface also supports multi-dimensional @samp{r2r}
|
cannam@127
|
688 transforms of all kinds supported by the serial interface
|
cannam@127
|
689 (e.g. discrete cosine and sine transforms, discrete Hartley
|
cannam@127
|
690 transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not
|
cannam@127
|
691 one-dimensional transforms, are currently parallelized.
|
cannam@127
|
692
|
cannam@127
|
693 @tindex fftw_r2r_kind
|
cannam@127
|
694 These are used much like the multidimensional complex DFTs discussed
|
cannam@127
|
695 above, except that the data is real rather than complex, and one needs
|
cannam@127
|
696 to pass an r2r transform kind (@code{fftw_r2r_kind}) for each
|
cannam@127
|
697 dimension as in the serial FFTW (@pxref{More DFTs of Real Data}).
|
cannam@127
|
698
|
cannam@127
|
699 For example, one might perform a two-dimensional @twodims{L,M} that is
|
cannam@127
|
700 an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in
|
cannam@127
|
701 the second dimension with code like:
|
cannam@127
|
702
|
cannam@127
|
703 @example
|
cannam@127
|
704 const ptrdiff_t L = ..., M = ...;
|
cannam@127
|
705 fftw_plan plan;
|
cannam@127
|
706 double *data;
|
cannam@127
|
707 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
|
cannam@127
|
708
|
cannam@127
|
709 /* @r{get local data size and allocate} */
|
cannam@127
|
710 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
|
cannam@127
|
711 &local_n0, &local_0_start);
|
cannam@127
|
712 data = fftw_alloc_real(alloc_local);
|
cannam@127
|
713
|
cannam@127
|
714 /* @r{create plan for in-place REDFT10 x RODFT10} */
|
cannam@127
|
715 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
|
cannam@127
|
716 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
|
cannam@127
|
717
|
cannam@127
|
718 /* @r{initialize data to some function} my_function(x,y) */
|
cannam@127
|
719 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
|
cannam@127
|
720 data[i*M + j] = my_function(local_0_start + i, j);
|
cannam@127
|
721
|
cannam@127
|
722 /* @r{compute transforms, in-place, as many times as desired} */
|
cannam@127
|
723 fftw_execute(plan);
|
cannam@127
|
724
|
cannam@127
|
725 fftw_destroy_plan(plan);
|
cannam@127
|
726 @end example
|
cannam@127
|
727
|
cannam@127
|
728 @findex fftw_alloc_real
|
cannam@127
|
729 Notice that we use the same @samp{local_size} functions as we did for
|
cannam@127
|
730 complex data, only now we interpret the sizes in terms of real rather
|
cannam@127
|
731 than complex values, and correspondingly use @code{fftw_alloc_real}.
|
cannam@127
|
732
|
cannam@127
|
733 @c ------------------------------------------------------------
|
cannam@127
|
734 @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI
|
cannam@127
|
735 @section FFTW MPI Transposes
|
cannam@127
|
736 @cindex transpose
|
cannam@127
|
737
|
cannam@127
|
738 The FFTW's MPI Fourier transforms rely on one or more @emph{global
|
cannam@127
|
739 transposition} step for their communications. For example, the
|
cannam@127
|
740 multidimensional transforms work by transforming along some
|
cannam@127
|
741 dimensions, then transposing to make the first dimension local and
|
cannam@127
|
742 transforming that, then transposing back. Because global
|
cannam@127
|
743 transposition of a block-distributed matrix has many other potential
|
cannam@127
|
744 uses besides FFTs, FFTW's transpose routines can be called directly,
|
cannam@127
|
745 as documented in this section.
|
cannam@127
|
746
|
cannam@127
|
747 @menu
|
cannam@127
|
748 * Basic distributed-transpose interface::
|
cannam@127
|
749 * Advanced distributed-transpose interface::
|
cannam@127
|
750 * An improved replacement for MPI_Alltoall::
|
cannam@127
|
751 @end menu
|
cannam@127
|
752
|
cannam@127
|
753 @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes
|
cannam@127
|
754 @subsection Basic distributed-transpose interface
|
cannam@127
|
755
|
cannam@127
|
756 In particular, suppose that we have an @code{n0} by @code{n1} array in
|
cannam@127
|
757 row-major order, block-distributed across the @code{n0} dimension. To
|
cannam@127
|
758 transpose this into an @code{n1} by @code{n0} array block-distributed
|
cannam@127
|
759 across the @code{n1} dimension, we would create a plan by calling the
|
cannam@127
|
760 following function:
|
cannam@127
|
761
|
cannam@127
|
762 @example
|
cannam@127
|
763 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
764 double *in, double *out,
|
cannam@127
|
765 MPI_Comm comm, unsigned flags);
|
cannam@127
|
766 @end example
|
cannam@127
|
767 @findex fftw_mpi_plan_transpose
|
cannam@127
|
768
|
cannam@127
|
769 The input and output arrays (@code{in} and @code{out}) can be the
|
cannam@127
|
770 same. The transpose is actually executed by calling
|
cannam@127
|
771 @code{fftw_execute} on the plan, as usual.
|
cannam@127
|
772 @findex fftw_execute
|
cannam@127
|
773
|
cannam@127
|
774
|
cannam@127
|
775 The @code{flags} are the usual FFTW planner flags, but support
|
cannam@127
|
776 two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or
|
cannam@127
|
777 @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for
|
cannam@127
|
778 transpose plans, is that the output and/or input, respectively, are
|
cannam@127
|
779 @emph{locally} transposed. That is, on each process input data is
|
cannam@127
|
780 normally stored as a @code{local_n0} by @code{n1} array in row-major
|
cannam@127
|
781 order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is
|
cannam@127
|
782 stored as @code{n1} by @code{local_n0} in row-major order. Similarly,
|
cannam@127
|
783 @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by
|
cannam@127
|
784 @code{local_n1} instead of @code{local_n1} by @code{n0}.
|
cannam@127
|
785 @ctindex FFTW_MPI_TRANSPOSED_OUT
|
cannam@127
|
786 @ctindex FFTW_MPI_TRANSPOSED_IN
|
cannam@127
|
787
|
cannam@127
|
788
|
cannam@127
|
789 To determine the local size of the array on each process before and
|
cannam@127
|
790 after the transpose, as well as the amount of storage that must be
|
cannam@127
|
791 allocated, one should call @code{fftw_mpi_local_size_2d_transposed},
|
cannam@127
|
792 just as for a 2d DFT as described in the previous section:
|
cannam@127
|
793 @cindex data distribution
|
cannam@127
|
794
|
cannam@127
|
795 @example
|
cannam@127
|
796 ptrdiff_t fftw_mpi_local_size_2d_transposed
|
cannam@127
|
797 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
cannam@127
|
798 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
799 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
800 @end example
|
cannam@127
|
801 @findex fftw_mpi_local_size_2d_transposed
|
cannam@127
|
802
|
cannam@127
|
803 Again, the return value is the local storage to allocate, which in
|
cannam@127
|
804 this case is the number of @emph{real} (@code{double}) values rather
|
cannam@127
|
805 than complex numbers as in the previous examples.
|
cannam@127
|
806
|
cannam@127
|
807 @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes
|
cannam@127
|
808 @subsection Advanced distributed-transpose interface
|
cannam@127
|
809
|
cannam@127
|
810 The above routines are for a transpose of a matrix of numbers (of type
|
cannam@127
|
811 @code{double}), using FFTW's default block sizes. More generally, one
|
cannam@127
|
812 can perform transposes of @emph{tuples} of numbers, with
|
cannam@127
|
813 user-specified block sizes for the input and output:
|
cannam@127
|
814
|
cannam@127
|
815 @example
|
cannam@127
|
816 fftw_plan fftw_mpi_plan_many_transpose
|
cannam@127
|
817 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
|
cannam@127
|
818 ptrdiff_t block0, ptrdiff_t block1,
|
cannam@127
|
819 double *in, double *out, MPI_Comm comm, unsigned flags);
|
cannam@127
|
820 @end example
|
cannam@127
|
821 @findex fftw_mpi_plan_many_transpose
|
cannam@127
|
822
|
cannam@127
|
823 In this case, one is transposing an @code{n0} by @code{n1} matrix of
|
cannam@127
|
824 @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers).
|
cannam@127
|
825 The input is distributed along the @code{n0} dimension with block size
|
cannam@127
|
826 @code{block0}, and the @code{n1} by @code{n0} output is distributed
|
cannam@127
|
827 along the @code{n1} dimension with block size @code{block1}. If
|
cannam@127
|
828 @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW
|
cannam@127
|
829 uses its default block size. To get the local size of the data on
|
cannam@127
|
830 each process, you should then call @code{fftw_mpi_local_size_many_transposed}.
|
cannam@127
|
831 @ctindex FFTW_MPI_DEFAULT_BLOCK
|
cannam@127
|
832 @findex fftw_mpi_local_size_many_transposed
|
cannam@127
|
833
|
cannam@127
|
834 @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes
|
cannam@127
|
835 @subsection An improved replacement for MPI_Alltoall
|
cannam@127
|
836
|
cannam@127
|
837 We close this section by noting that FFTW's MPI transpose routines can
|
cannam@127
|
838 be thought of as a generalization for the @code{MPI_Alltoall} function
|
cannam@127
|
839 (albeit only for floating-point types), and in some circumstances can
|
cannam@127
|
840 function as an improved replacement.
|
cannam@127
|
841 @findex MPI_Alltoall
|
cannam@127
|
842
|
cannam@127
|
843
|
cannam@127
|
844 @code{MPI_Alltoall} is defined by the MPI standard as:
|
cannam@127
|
845
|
cannam@127
|
846 @example
|
cannam@127
|
847 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
|
cannam@127
|
848 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
|
cannam@127
|
849 MPI_Comm comm);
|
cannam@127
|
850 @end example
|
cannam@127
|
851
|
cannam@127
|
852 In particular, for @code{double*} arrays @code{in} and @code{out},
|
cannam@127
|
853 consider the call:
|
cannam@127
|
854
|
cannam@127
|
855 @example
|
cannam@127
|
856 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
|
cannam@127
|
857 @end example
|
cannam@127
|
858
|
cannam@127
|
859 This is completely equivalent to:
|
cannam@127
|
860
|
cannam@127
|
861 @example
|
cannam@127
|
862 MPI_Comm_size(comm, &P);
|
cannam@127
|
863 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
|
cannam@127
|
864 fftw_execute(plan);
|
cannam@127
|
865 fftw_destroy_plan(plan);
|
cannam@127
|
866 @end example
|
cannam@127
|
867
|
cannam@127
|
868 That is, computing a @twodims{P,P} transpose on @code{P} processes,
|
cannam@127
|
869 with a block size of 1, is just a standard all-to-all communication.
|
cannam@127
|
870
|
cannam@127
|
871 However, using the FFTW routine instead of @code{MPI_Alltoall} may
|
cannam@127
|
872 have certain advantages. First of all, FFTW's routine can operate
|
cannam@127
|
873 in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only
|
cannam@127
|
874 operate out-of-place.
|
cannam@127
|
875 @cindex in-place
|
cannam@127
|
876
|
cannam@127
|
877
|
cannam@127
|
878 Second, even for out-of-place plans, FFTW's routine may be faster,
|
cannam@127
|
879 especially if you need to perform the all-to-all communication many
|
cannam@127
|
880 times and can afford to use @code{FFTW_MEASURE} or
|
cannam@127
|
881 @code{FFTW_PATIENT}. It should certainly be no slower, not including
|
cannam@127
|
882 the time to create the plan, since one of the possible algorithms that
|
cannam@127
|
883 FFTW uses for an out-of-place transpose @emph{is} simply to call
|
cannam@127
|
884 @code{MPI_Alltoall}. However, FFTW also considers several other
|
cannam@127
|
885 possible algorithms that, depending on your MPI implementation and
|
cannam@127
|
886 your hardware, may be faster.
|
cannam@127
|
887 @ctindex FFTW_MEASURE
|
cannam@127
|
888 @ctindex FFTW_PATIENT
|
cannam@127
|
889
|
cannam@127
|
890 @c ------------------------------------------------------------
|
cannam@127
|
891 @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI
|
cannam@127
|
892 @section FFTW MPI Wisdom
|
cannam@127
|
893 @cindex wisdom
|
cannam@127
|
894 @cindex saving plans to disk
|
cannam@127
|
895
|
cannam@127
|
896 FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can
|
cannam@127
|
897 be used to save MPI plans as well as to save uniprocessor plans.
|
cannam@127
|
898 However, for MPI there are several unavoidable complications.
|
cannam@127
|
899
|
cannam@127
|
900 @cindex MPI I/O
|
cannam@127
|
901 First, the MPI standard does not guarantee that every process can
|
cannam@127
|
902 perform file I/O (at least, not using C stdio routines)---in general,
|
cannam@127
|
903 we may only assume that process 0 is capable of I/O.@footnote{In fact,
|
cannam@127
|
904 even this assumption is not technically guaranteed by the standard,
|
cannam@127
|
905 although it seems to be universal in actual MPI implementations and is
|
cannam@127
|
906 widely assumed by MPI-using software. Technically, you need to query
|
cannam@127
|
907 the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with
|
cannam@127
|
908 @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no
|
cannam@127
|
909 I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can
|
cannam@127
|
910 perform I/O. Otherwise, it is the rank of a process that can perform
|
cannam@127
|
911 I/O ... but since it is not guaranteed to yield the @emph{same} rank
|
cannam@127
|
912 on all processes, you have to do an @code{MPI_Allreduce} of some kind
|
cannam@127
|
913 if you want all processes to agree about which is going to do I/O.
|
cannam@127
|
914 And even then, the standard only guarantees that this process can
|
cannam@127
|
915 perform output, but not input. See e.g. @cite{Parallel Programming
|
cannam@127
|
916 with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our
|
cannam@127
|
917 experience virtually no MPI programmers worry about this.} So, if we
|
cannam@127
|
918 want to export the wisdom from a single process to a file, we must
|
cannam@127
|
919 first export the wisdom to a string, then send it to process 0, then
|
cannam@127
|
920 write it to a file.
|
cannam@127
|
921
|
cannam@127
|
922 Second, in principle we may want to have separate wisdom for every
|
cannam@127
|
923 process, since in general the processes may run on different hardware
|
cannam@127
|
924 even for a single MPI program. However, in practice FFTW's MPI code
|
cannam@127
|
925 is designed for the case of homogeneous hardware (@pxref{Load
|
cannam@127
|
926 balancing}), and in this case it is convenient to use the same wisdom
|
cannam@127
|
927 for every process. Thus, we need a mechanism to synchronize the wisdom.
|
cannam@127
|
928
|
cannam@127
|
929 To address both of these problems, FFTW provides the following two
|
cannam@127
|
930 functions:
|
cannam@127
|
931
|
cannam@127
|
932 @example
|
cannam@127
|
933 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
|
cannam@127
|
934 void fftw_mpi_gather_wisdom(MPI_Comm comm);
|
cannam@127
|
935 @end example
|
cannam@127
|
936 @findex fftw_mpi_gather_wisdom
|
cannam@127
|
937 @findex fftw_mpi_broadcast_wisdom
|
cannam@127
|
938
|
cannam@127
|
939 Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom}
|
cannam@127
|
940 will broadcast the wisdom from process 0 to all other processes.
|
cannam@127
|
941 Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all
|
cannam@127
|
942 processes onto process 0. (If the plans created for the same problem
|
cannam@127
|
943 by different processes are not the same, @code{fftw_mpi_gather_wisdom}
|
cannam@127
|
944 will arbitrarily choose one of the plans.) Both of these functions
|
cannam@127
|
945 may result in suboptimal plans for different processes if the
|
cannam@127
|
946 processes are running on non-identical hardware. Both of these
|
cannam@127
|
947 functions are @emph{collective} calls, which means that they must be
|
cannam@127
|
948 executed by all processes in the communicator.
|
cannam@127
|
949 @cindex collective function
|
cannam@127
|
950
|
cannam@127
|
951
|
cannam@127
|
952 So, for example, a typical code snippet to import wisdom from a file
|
cannam@127
|
953 and use it on all processes would be:
|
cannam@127
|
954
|
cannam@127
|
955 @example
|
cannam@127
|
956 @{
|
cannam@127
|
957 int rank;
|
cannam@127
|
958
|
cannam@127
|
959 fftw_mpi_init();
|
cannam@127
|
960 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
|
cannam@127
|
961 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
|
cannam@127
|
962 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
|
cannam@127
|
963 @}
|
cannam@127
|
964 @end example
|
cannam@127
|
965
|
cannam@127
|
966 (Note that we must call @code{fftw_mpi_init} before importing any
|
cannam@127
|
967 wisdom that might contain MPI plans.) Similarly, a typical code
|
cannam@127
|
968 snippet to export wisdom from all processes to a file is:
|
cannam@127
|
969 @findex fftw_mpi_init
|
cannam@127
|
970
|
cannam@127
|
971 @example
|
cannam@127
|
972 @{
|
cannam@127
|
973 int rank;
|
cannam@127
|
974
|
cannam@127
|
975 fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
|
cannam@127
|
976 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
|
cannam@127
|
977 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
|
cannam@127
|
978 @}
|
cannam@127
|
979 @end example
|
cannam@127
|
980
|
cannam@127
|
981 @c ------------------------------------------------------------
|
cannam@127
|
982 @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI
|
cannam@127
|
983 @section Avoiding MPI Deadlocks
|
cannam@127
|
984 @cindex deadlock
|
cannam@127
|
985
|
cannam@127
|
986 An MPI program can @emph{deadlock} if one process is waiting for a
|
cannam@127
|
987 message from another process that never gets sent. To avoid deadlocks
|
cannam@127
|
988 when using FFTW's MPI routines, it is important to know which
|
cannam@127
|
989 functions are @emph{collective}: that is, which functions must
|
cannam@127
|
990 @emph{always} be called in the @emph{same order} from @emph{every}
|
cannam@127
|
991 process in a given communicator. (For example, @code{MPI_Barrier} is
|
cannam@127
|
992 the canonical example of a collective function in the MPI standard.)
|
cannam@127
|
993 @cindex collective function
|
cannam@127
|
994 @findex MPI_Barrier
|
cannam@127
|
995
|
cannam@127
|
996
|
cannam@127
|
997 The functions in FFTW that are @emph{always} collective are: every
|
cannam@127
|
998 function beginning with @samp{fftw_mpi_plan}, as well as
|
cannam@127
|
999 @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}.
|
cannam@127
|
1000 Also, the following functions from the ordinary FFTW interface are
|
cannam@127
|
1001 collective when they are applied to a plan created by an
|
cannam@127
|
1002 @samp{fftw_mpi_plan} function: @code{fftw_execute},
|
cannam@127
|
1003 @code{fftw_destroy_plan}, and @code{fftw_flops}.
|
cannam@127
|
1004 @findex fftw_execute
|
cannam@127
|
1005 @findex fftw_destroy_plan
|
cannam@127
|
1006 @findex fftw_flops
|
cannam@127
|
1007
|
cannam@127
|
1008 @c ------------------------------------------------------------
|
cannam@127
|
1009 @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI
|
cannam@127
|
1010 @section FFTW MPI Performance Tips
|
cannam@127
|
1011
|
cannam@127
|
1012 In this section, we collect a few tips on getting the best performance
|
cannam@127
|
1013 out of FFTW's MPI transforms.
|
cannam@127
|
1014
|
cannam@127
|
1015 First, because of the 1d block distribution, FFTW's parallelization is
|
cannam@127
|
1016 currently limited by the size of the first dimension.
|
cannam@127
|
1017 (Multidimensional block distributions may be supported by a future
|
cannam@127
|
1018 version.) More generally, you should ideally arrange the dimensions so
|
cannam@127
|
1019 that FFTW can divide them equally among the processes. @xref{Load
|
cannam@127
|
1020 balancing}.
|
cannam@127
|
1021 @cindex block distribution
|
cannam@127
|
1022 @cindex load balancing
|
cannam@127
|
1023
|
cannam@127
|
1024
|
cannam@127
|
1025 Second, if it is not too inconvenient, you should consider working
|
cannam@127
|
1026 with transposed output for multidimensional plans, as this saves a
|
cannam@127
|
1027 considerable amount of communications. @xref{Transposed distributions}.
|
cannam@127
|
1028 @cindex transpose
|
cannam@127
|
1029
|
cannam@127
|
1030
|
cannam@127
|
1031 Third, the fastest choices are generally either an in-place transform
|
cannam@127
|
1032 or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag
|
cannam@127
|
1033 (which allows the input array to be used as scratch space). In-place
|
cannam@127
|
1034 is especially beneficial if the amount of data per process is large.
|
cannam@127
|
1035 @ctindex FFTW_DESTROY_INPUT
|
cannam@127
|
1036
|
cannam@127
|
1037
|
cannam@127
|
1038 Fourth, if you have multiple arrays to transform at once, rather than
|
cannam@127
|
1039 calling FFTW's MPI transforms several times it usually seems to be
|
cannam@127
|
1040 faster to interleave the data and use the advanced interface. (This
|
cannam@127
|
1041 groups the communications together instead of requiring separate
|
cannam@127
|
1042 messages for each transform.)
|
cannam@127
|
1043
|
cannam@127
|
1044 @c ------------------------------------------------------------
|
cannam@127
|
1045 @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI
|
cannam@127
|
1046 @section Combining MPI and Threads
|
cannam@127
|
1047 @cindex threads
|
cannam@127
|
1048
|
cannam@127
|
1049 In certain cases, it may be advantageous to combine MPI
|
cannam@127
|
1050 (distributed-memory) and threads (shared-memory) parallelization.
|
cannam@127
|
1051 FFTW supports this, with certain caveats. For example, if you have a
|
cannam@127
|
1052 cluster of 4-processor shared-memory nodes, you may want to use
|
cannam@127
|
1053 threads within the nodes and MPI between the nodes, instead of MPI for
|
cannam@127
|
1054 all parallelization.
|
cannam@127
|
1055
|
cannam@127
|
1056 In particular, it is possible to seamlessly combine the MPI FFTW
|
cannam@127
|
1057 routines with the multi-threaded FFTW routines (@pxref{Multi-threaded
|
cannam@127
|
1058 FFTW}). However, some care must be taken in the initialization code,
|
cannam@127
|
1059 which should look something like this:
|
cannam@127
|
1060
|
cannam@127
|
1061 @example
|
cannam@127
|
1062 int threads_ok;
|
cannam@127
|
1063
|
cannam@127
|
1064 int main(int argc, char **argv)
|
cannam@127
|
1065 @{
|
cannam@127
|
1066 int provided;
|
cannam@127
|
1067 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
|
cannam@127
|
1068 threads_ok = provided >= MPI_THREAD_FUNNELED;
|
cannam@127
|
1069
|
cannam@127
|
1070 if (threads_ok) threads_ok = fftw_init_threads();
|
cannam@127
|
1071 fftw_mpi_init();
|
cannam@127
|
1072
|
cannam@127
|
1073 ...
|
cannam@127
|
1074 if (threads_ok) fftw_plan_with_nthreads(...);
|
cannam@127
|
1075 ...
|
cannam@127
|
1076
|
cannam@127
|
1077 MPI_Finalize();
|
cannam@127
|
1078 @}
|
cannam@127
|
1079 @end example
|
cannam@127
|
1080 @findex fftw_mpi_init
|
cannam@127
|
1081 @findex fftw_init_threads
|
cannam@127
|
1082 @findex fftw_plan_with_nthreads
|
cannam@127
|
1083
|
cannam@127
|
1084 First, note that instead of calling @code{MPI_Init}, you should call
|
cannam@127
|
1085 @code{MPI_Init_threads}, which is the initialization routine defined
|
cannam@127
|
1086 by the MPI-2 standard to indicate to MPI that your program will be
|
cannam@127
|
1087 multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates
|
cannam@127
|
1088 that we will only call MPI routines from the main thread. (FFTW will
|
cannam@127
|
1089 launch additional threads internally, but the extra threads will not
|
cannam@127
|
1090 call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or
|
cannam@127
|
1091 @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading
|
cannam@127
|
1092 support from the MPI implementation, but this is not required by
|
cannam@127
|
1093 FFTW.) The @code{provided} parameter returns what level of threads
|
cannam@127
|
1094 support is actually supported by your MPI implementation; this
|
cannam@127
|
1095 @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call
|
cannam@127
|
1096 the FFTW threads routines, so we define a global variable
|
cannam@127
|
1097 @code{threads_ok} to record this. You should only call
|
cannam@127
|
1098 @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if
|
cannam@127
|
1099 @code{threads_ok} is true. For more information on thread safety in
|
cannam@127
|
1100 MPI, see the
|
cannam@127
|
1101 @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and
|
cannam@127
|
1102 Threads} section of the MPI-2 standard.
|
cannam@127
|
1103 @cindex thread safety
|
cannam@127
|
1104
|
cannam@127
|
1105
|
cannam@127
|
1106 Second, we must call @code{fftw_init_threads} @emph{before}
|
cannam@127
|
1107 @code{fftw_mpi_init}. This is critical for technical reasons having
|
cannam@127
|
1108 to do with how FFTW initializes its list of algorithms.
|
cannam@127
|
1109
|
cannam@127
|
1110 Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI
|
cannam@127
|
1111 process will launch (up to) @code{N} threads to parallelize its transforms.
|
cannam@127
|
1112
|
cannam@127
|
1113 For example, in the hypothetical cluster of 4-processor nodes, you
|
cannam@127
|
1114 might wish to launch only a single MPI process per node, and then call
|
cannam@127
|
1115 @code{fftw_plan_with_nthreads(4)} on each process to use all
|
cannam@127
|
1116 processors in the nodes.
|
cannam@127
|
1117
|
cannam@127
|
1118 This may or may not be faster than simply using as many MPI processes
|
cannam@127
|
1119 as you have processors, however. On the one hand, using threads
|
cannam@127
|
1120 within a node eliminates the need for explicit message passing within
|
cannam@127
|
1121 the node. On the other hand, FFTW's transpose routines are not
|
cannam@127
|
1122 multi-threaded, and this means that the communications that do take
|
cannam@127
|
1123 place will not benefit from parallelization within the node.
|
cannam@127
|
1124 Moreover, many MPI implementations already have optimizations to
|
cannam@127
|
1125 exploit shared memory when it is available, so adding the
|
cannam@127
|
1126 multithreaded FFTW on top of this may be superfluous.
|
cannam@127
|
1127 @cindex transpose
|
cannam@127
|
1128
|
cannam@127
|
1129 @c ------------------------------------------------------------
|
cannam@127
|
1130 @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI
|
cannam@127
|
1131 @section FFTW MPI Reference
|
cannam@127
|
1132
|
cannam@127
|
1133 This chapter provides a complete reference to all FFTW MPI functions,
|
cannam@127
|
1134 datatypes, and constants. See also @ref{FFTW Reference} for information
|
cannam@127
|
1135 on functions and types in common with the serial interface.
|
cannam@127
|
1136
|
cannam@127
|
1137 @menu
|
cannam@127
|
1138 * MPI Files and Data Types::
|
cannam@127
|
1139 * MPI Initialization::
|
cannam@127
|
1140 * Using MPI Plans::
|
cannam@127
|
1141 * MPI Data Distribution Functions::
|
cannam@127
|
1142 * MPI Plan Creation::
|
cannam@127
|
1143 * MPI Wisdom Communication::
|
cannam@127
|
1144 @end menu
|
cannam@127
|
1145
|
cannam@127
|
1146 @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference
|
cannam@127
|
1147 @subsection MPI Files and Data Types
|
cannam@127
|
1148
|
cannam@127
|
1149 All programs using FFTW's MPI support should include its header file:
|
cannam@127
|
1150
|
cannam@127
|
1151 @example
|
cannam@127
|
1152 #include <fftw3-mpi.h>
|
cannam@127
|
1153 @end example
|
cannam@127
|
1154
|
cannam@127
|
1155 Note that this header file includes the serial-FFTW @code{fftw3.h}
|
cannam@127
|
1156 header file, and also the @code{mpi.h} header file for MPI, so you
|
cannam@127
|
1157 need not include those files separately.
|
cannam@127
|
1158
|
cannam@127
|
1159 You must also link to @emph{both} the FFTW MPI library and to the
|
cannam@127
|
1160 serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi
|
cannam@127
|
1161 -lfftw3 -lm} at the end of the link command.
|
cannam@127
|
1162
|
cannam@127
|
1163 @cindex precision
|
cannam@127
|
1164 Different precisions are handled as in the serial interface:
|
cannam@127
|
1165 @xref{Precision}. That is, @samp{fftw_} functions become
|
cannam@127
|
1166 @code{fftwf_} (in single precision) etcetera, and the libraries become
|
cannam@127
|
1167 @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double
|
cannam@127
|
1168 precision is supported in MPI, but quad precision (@samp{fftwq_}) is
|
cannam@127
|
1169 not due to the lack of MPI support for this type.
|
cannam@127
|
1170
|
cannam@127
|
1171 @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference
|
cannam@127
|
1172 @subsection MPI Initialization
|
cannam@127
|
1173
|
cannam@127
|
1174 Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and
|
cannam@127
|
1175 before importing any wisdom for MPI problems, you must call:
|
cannam@127
|
1176
|
cannam@127
|
1177 @findex fftw_mpi_init
|
cannam@127
|
1178 @example
|
cannam@127
|
1179 void fftw_mpi_init(void);
|
cannam@127
|
1180 @end example
|
cannam@127
|
1181
|
cannam@127
|
1182 @findex fftw_init_threads
|
cannam@127
|
1183 If FFTW threads support is used, however, @code{fftw_mpi_init} should
|
cannam@127
|
1184 be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI
|
cannam@127
|
1185 and Threads}). Calling @code{fftw_mpi_init} additional times (before
|
cannam@127
|
1186 @code{fftw_mpi_cleanup}) has no effect.
|
cannam@127
|
1187
|
cannam@127
|
1188
|
cannam@127
|
1189 If you want to deallocate all persistent data and reset FFTW to the
|
cannam@127
|
1190 pristine state it was in when you started your program, you can call:
|
cannam@127
|
1191
|
cannam@127
|
1192 @findex fftw_mpi_cleanup
|
cannam@127
|
1193 @example
|
cannam@127
|
1194 void fftw_mpi_cleanup(void);
|
cannam@127
|
1195 @end example
|
cannam@127
|
1196
|
cannam@127
|
1197 @findex fftw_cleanup
|
cannam@127
|
1198 (This calls @code{fftw_cleanup}, so you need not call the serial
|
cannam@127
|
1199 cleanup routine too, although it is safe to do so.) After calling
|
cannam@127
|
1200 @code{fftw_mpi_cleanup}, all existing plans become undefined, and you
|
cannam@127
|
1201 should not attempt to execute or destroy them. You must call
|
cannam@127
|
1202 @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want
|
cannam@127
|
1203 to resume using the MPI FFTW routines.
|
cannam@127
|
1204
|
cannam@127
|
1205 @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference
|
cannam@127
|
1206 @subsection Using MPI Plans
|
cannam@127
|
1207
|
cannam@127
|
1208 Once an MPI plan is created, you can execute and destroy it using
|
cannam@127
|
1209 @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions
|
cannam@127
|
1210 in the serial interface that operate on generic plans (@pxref{Using
|
cannam@127
|
1211 Plans}).
|
cannam@127
|
1212
|
cannam@127
|
1213 @cindex collective function
|
cannam@127
|
1214 @cindex MPI communicator
|
cannam@127
|
1215 The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to
|
cannam@127
|
1216 MPI plans, are @emph{collective} calls: they must be called for all processes
|
cannam@127
|
1217 in the communicator that was used to create the plan.
|
cannam@127
|
1218
|
cannam@127
|
1219 @cindex new-array execution
|
cannam@127
|
1220 You must @emph{not} use the serial new-array plan-execution functions
|
cannam@127
|
1221 @code{fftw_execute_dft} and so on (@pxref{New-array Execute
|
cannam@127
|
1222 Functions}) with MPI plans. Such functions are specialized to the
|
cannam@127
|
1223 problem type, and there are specific new-array execute functions for MPI plans:
|
cannam@127
|
1224
|
cannam@127
|
1225 @findex fftw_mpi_execute_dft
|
cannam@127
|
1226 @findex fftw_mpi_execute_dft_r2c
|
cannam@127
|
1227 @findex fftw_mpi_execute_dft_c2r
|
cannam@127
|
1228 @findex fftw_mpi_execute_r2r
|
cannam@127
|
1229 @example
|
cannam@127
|
1230 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
|
cannam@127
|
1231 void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
|
cannam@127
|
1232 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
|
cannam@127
|
1233 void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
|
cannam@127
|
1234 @end example
|
cannam@127
|
1235
|
cannam@127
|
1236 @cindex alignment
|
cannam@127
|
1237 @findex fftw_malloc
|
cannam@127
|
1238 These functions have the same restrictions as those of the serial
|
cannam@127
|
1239 new-array execute functions. They are @emph{always} safe to apply to
|
cannam@127
|
1240 the @emph{same} @code{in} and @code{out} arrays that were used to
|
cannam@127
|
1241 create the plan. They can only be applied to new arrarys if those
|
cannam@127
|
1242 arrays have the same types, dimensions, in-placeness, and alignment as
|
cannam@127
|
1243 the original arrays, where the best way to ensure the same alignment
|
cannam@127
|
1244 is to use FFTW's @code{fftw_malloc} and related allocation functions
|
cannam@127
|
1245 for all arrays (@pxref{Memory Allocation}). Note that distributed
|
cannam@127
|
1246 transposes (@pxref{FFTW MPI Transposes}) use
|
cannam@127
|
1247 @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans
|
cannam@127
|
1248 from FFTW's perspective.
|
cannam@127
|
1249
|
cannam@127
|
1250 @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference
|
cannam@127
|
1251 @subsection MPI Data Distribution Functions
|
cannam@127
|
1252
|
cannam@127
|
1253 @cindex data distribution
|
cannam@127
|
1254 As described above (@pxref{MPI Data Distribution}), in order to
|
cannam@127
|
1255 allocate your arrays, @emph{before} creating a plan, you must first
|
cannam@127
|
1256 call one of the following routines to determine the required
|
cannam@127
|
1257 allocation size and the portion of the array locally stored on a given
|
cannam@127
|
1258 process. The @code{MPI_Comm} communicator passed here must be
|
cannam@127
|
1259 equivalent to the communicator used below for plan creation.
|
cannam@127
|
1260
|
cannam@127
|
1261 The basic interface for multidimensional transforms consists of the
|
cannam@127
|
1262 functions:
|
cannam@127
|
1263
|
cannam@127
|
1264 @findex fftw_mpi_local_size_2d
|
cannam@127
|
1265 @findex fftw_mpi_local_size_3d
|
cannam@127
|
1266 @findex fftw_mpi_local_size
|
cannam@127
|
1267 @findex fftw_mpi_local_size_2d_transposed
|
cannam@127
|
1268 @findex fftw_mpi_local_size_3d_transposed
|
cannam@127
|
1269 @findex fftw_mpi_local_size_transposed
|
cannam@127
|
1270 @example
|
cannam@127
|
1271 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
cannam@127
|
1272 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
cannam@127
|
1273 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1274 MPI_Comm comm,
|
cannam@127
|
1275 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
cannam@127
|
1276 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
|
cannam@127
|
1277 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
cannam@127
|
1278
|
cannam@127
|
1279 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
cannam@127
|
1280 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
1281 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
1282 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1283 MPI_Comm comm,
|
cannam@127
|
1284 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
1285 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
1286 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
|
cannam@127
|
1287 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
1288 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
1289 @end example
|
cannam@127
|
1290
|
cannam@127
|
1291 These functions return the number of elements to allocate (complex
|
cannam@127
|
1292 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas
|
cannam@127
|
1293 the @code{local_n0} and @code{local_0_start} return the portion
|
cannam@127
|
1294 (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the
|
cannam@127
|
1295 first dimension of an @ndims{} array that is stored on the local
|
cannam@127
|
1296 process. @xref{Basic and advanced distribution interfaces}. For
|
cannam@127
|
1297 @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants
|
cannam@127
|
1298 are useful in order to also return the local portion of the first
|
cannam@127
|
1299 dimension in the @ndimstrans{} transposed output.
|
cannam@127
|
1300 @xref{Transposed distributions}.
|
cannam@127
|
1301 The advanced interface for multidimensional transforms is:
|
cannam@127
|
1302
|
cannam@127
|
1303 @cindex advanced interface
|
cannam@127
|
1304 @findex fftw_mpi_local_size_many
|
cannam@127
|
1305 @findex fftw_mpi_local_size_many_transposed
|
cannam@127
|
1306 @example
|
cannam@127
|
1307 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
cannam@127
|
1308 ptrdiff_t block0, MPI_Comm comm,
|
cannam@127
|
1309 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
cannam@127
|
1310 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
cannam@127
|
1311 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
|
cannam@127
|
1312 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
cannam@127
|
1313 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
cannam@127
|
1314 @end example
|
cannam@127
|
1315
|
cannam@127
|
1316 These differ from the basic interface in only two ways. First, they
|
cannam@127
|
1317 allow you to specify block sizes @code{block0} and @code{block1} (the
|
cannam@127
|
1318 latter for the transposed output); you can pass
|
cannam@127
|
1319 @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in
|
cannam@127
|
1320 the basic interface. Second, you can pass a @code{howmany} parameter,
|
cannam@127
|
1321 corresponding to the advanced planning interface below: this is for
|
cannam@127
|
1322 transforms of contiguous @code{howmany}-tuples of numbers
|
cannam@127
|
1323 (@code{howmany = 1} in the basic interface).
|
cannam@127
|
1324
|
cannam@127
|
1325 The corresponding basic and advanced routines for one-dimensional
|
cannam@127
|
1326 transforms (currently only complex DFTs) are:
|
cannam@127
|
1327
|
cannam@127
|
1328 @findex fftw_mpi_local_size_1d
|
cannam@127
|
1329 @findex fftw_mpi_local_size_many_1d
|
cannam@127
|
1330 @example
|
cannam@127
|
1331 ptrdiff_t fftw_mpi_local_size_1d(
|
cannam@127
|
1332 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
|
cannam@127
|
1333 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
cannam@127
|
1334 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
cannam@127
|
1335 ptrdiff_t fftw_mpi_local_size_many_1d(
|
cannam@127
|
1336 ptrdiff_t n0, ptrdiff_t howmany,
|
cannam@127
|
1337 MPI_Comm comm, int sign, unsigned flags,
|
cannam@127
|
1338 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
cannam@127
|
1339 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
cannam@127
|
1340 @end example
|
cannam@127
|
1341
|
cannam@127
|
1342 @ctindex FFTW_MPI_SCRAMBLED_OUT
|
cannam@127
|
1343 @ctindex FFTW_MPI_SCRAMBLED_IN
|
cannam@127
|
1344 As above, the return value is the number of elements to allocate
|
cannam@127
|
1345 (complex numbers, for complex DFTs). The @code{local_ni} and
|
cannam@127
|
1346 @code{local_i_start} arguments return the portion
|
cannam@127
|
1347 (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the
|
cannam@127
|
1348 1d array that is stored on this process for the transform
|
cannam@127
|
1349 @emph{input}, and @code{local_no} and @code{local_o_start} are the
|
cannam@127
|
1350 corresponding quantities for the input. The @code{sign}
|
cannam@127
|
1351 (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must
|
cannam@127
|
1352 match the arguments passed when creating a plan. Although the inputs
|
cannam@127
|
1353 and outputs have different data distributions in general, it is
|
cannam@127
|
1354 guaranteed that the @emph{output} data distribution of an
|
cannam@127
|
1355 @code{FFTW_FORWARD} plan will match the @emph{input} data distribution
|
cannam@127
|
1356 of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the
|
cannam@127
|
1357 @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags.
|
cannam@127
|
1358 @xref{One-dimensional distributions}.
|
cannam@127
|
1359
|
cannam@127
|
1360 @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference
|
cannam@127
|
1361 @subsection MPI Plan Creation
|
cannam@127
|
1362
|
cannam@127
|
1363 @subsubheading Complex-data MPI DFTs
|
cannam@127
|
1364
|
cannam@127
|
1365 Plans for complex-data DFTs (@pxref{2d MPI example}) are created by:
|
cannam@127
|
1366
|
cannam@127
|
1367 @findex fftw_mpi_plan_dft_1d
|
cannam@127
|
1368 @findex fftw_mpi_plan_dft_2d
|
cannam@127
|
1369 @findex fftw_mpi_plan_dft_3d
|
cannam@127
|
1370 @findex fftw_mpi_plan_dft
|
cannam@127
|
1371 @findex fftw_mpi_plan_many_dft
|
cannam@127
|
1372 @example
|
cannam@127
|
1373 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
|
cannam@127
|
1374 MPI_Comm comm, int sign, unsigned flags);
|
cannam@127
|
1375 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1376 fftw_complex *in, fftw_complex *out,
|
cannam@127
|
1377 MPI_Comm comm, int sign, unsigned flags);
|
cannam@127
|
1378 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1379 fftw_complex *in, fftw_complex *out,
|
cannam@127
|
1380 MPI_Comm comm, int sign, unsigned flags);
|
cannam@127
|
1381 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1382 fftw_complex *in, fftw_complex *out,
|
cannam@127
|
1383 MPI_Comm comm, int sign, unsigned flags);
|
cannam@127
|
1384 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1385 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
|
cannam@127
|
1386 fftw_complex *in, fftw_complex *out,
|
cannam@127
|
1387 MPI_Comm comm, int sign, unsigned flags);
|
cannam@127
|
1388 @end example
|
cannam@127
|
1389
|
cannam@127
|
1390 @cindex MPI communicator
|
cannam@127
|
1391 @cindex collective function
|
cannam@127
|
1392 These are similar to their serial counterparts (@pxref{Complex DFTs})
|
cannam@127
|
1393 in specifying the dimensions, sign, and flags of the transform. The
|
cannam@127
|
1394 @code{comm} argument gives an MPI communicator that specifies the set
|
cannam@127
|
1395 of processes to participate in the transform; plan creation is a
|
cannam@127
|
1396 collective function that must be called for all processes in the
|
cannam@127
|
1397 communicator. The @code{in} and @code{out} pointers refer only to a
|
cannam@127
|
1398 portion of the overall transform data (@pxref{MPI Data Distribution})
|
cannam@127
|
1399 as specified by the @samp{local_size} functions in the previous
|
cannam@127
|
1400 section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these
|
cannam@127
|
1401 arrays are overwritten during plan creation as for the serial
|
cannam@127
|
1402 interface. For multi-dimensional transforms, any dimensions @code{>
|
cannam@127
|
1403 1} are supported; for one-dimensional transforms, only composite
|
cannam@127
|
1404 (non-prime) @code{n0} are currently supported (unlike the serial
|
cannam@127
|
1405 FFTW). Requesting an unsupported transform size will yield a
|
cannam@127
|
1406 @code{NULL} plan. (As in the serial interface, highly composite sizes
|
cannam@127
|
1407 generally yield the best performance.)
|
cannam@127
|
1408
|
cannam@127
|
1409 @cindex advanced interface
|
cannam@127
|
1410 @ctindex FFTW_MPI_DEFAULT_BLOCK
|
cannam@127
|
1411 @cindex stride
|
cannam@127
|
1412 The advanced-interface @code{fftw_mpi_plan_many_dft} additionally
|
cannam@127
|
1413 allows you to specify the block sizes for the first dimension
|
cannam@127
|
1414 (@code{block}) of the @ndims{} input data and the first dimension
|
cannam@127
|
1415 (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate
|
cannam@127
|
1416 steps of the transform, and for the output if
|
cannam@127
|
1417 @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must
|
cannam@127
|
1418 be the same block sizes as were passed to the corresponding
|
cannam@127
|
1419 @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK}
|
cannam@127
|
1420 to use FFTW's default block size as in the basic interface. Also, the
|
cannam@127
|
1421 @code{howmany} parameter specifies that the transform is of contiguous
|
cannam@127
|
1422 @code{howmany}-tuples rather than individual complex numbers; this
|
cannam@127
|
1423 corresponds to the same parameter in the serial advanced interface
|
cannam@127
|
1424 (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and
|
cannam@127
|
1425 @code{dist = 1}.
|
cannam@127
|
1426
|
cannam@127
|
1427 @subsubheading MPI flags
|
cannam@127
|
1428
|
cannam@127
|
1429 The @code{flags} can be any of those for the serial FFTW
|
cannam@127
|
1430 (@pxref{Planner Flags}), and in addition may include one or more of
|
cannam@127
|
1431 the following MPI-specific flags, which improve performance at the
|
cannam@127
|
1432 cost of changing the output or input data formats.
|
cannam@127
|
1433
|
cannam@127
|
1434 @itemize @bullet
|
cannam@127
|
1435
|
cannam@127
|
1436 @item
|
cannam@127
|
1437 @ctindex FFTW_MPI_SCRAMBLED_OUT
|
cannam@127
|
1438 @ctindex FFTW_MPI_SCRAMBLED_IN
|
cannam@127
|
1439 @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for
|
cannam@127
|
1440 1d transforms only, these flags indicate that the output/input of the
|
cannam@127
|
1441 transform are in an undocumented ``scrambled'' order. A forward
|
cannam@127
|
1442 @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward
|
cannam@127
|
1443 @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization).
|
cannam@127
|
1444 @xref{One-dimensional distributions}.
|
cannam@127
|
1445
|
cannam@127
|
1446 @item
|
cannam@127
|
1447 @ctindex FFTW_MPI_TRANSPOSED_OUT
|
cannam@127
|
1448 @ctindex FFTW_MPI_TRANSPOSED_IN
|
cannam@127
|
1449 @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid
|
cannam@127
|
1450 for multidimensional (@code{rnk > 1}) transforms only, these flags
|
cannam@127
|
1451 specify that the output or input of an @ndims{} transform is
|
cannam@127
|
1452 transposed to @ndimstrans{}. @xref{Transposed distributions}.
|
cannam@127
|
1453
|
cannam@127
|
1454 @end itemize
|
cannam@127
|
1455
|
cannam@127
|
1456 @subsubheading Real-data MPI DFTs
|
cannam@127
|
1457
|
cannam@127
|
1458 @cindex r2c
|
cannam@127
|
1459 Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional
|
cannam@127
|
1460 MPI DFTs of Real Data}) are created by:
|
cannam@127
|
1461
|
cannam@127
|
1462 @findex fftw_mpi_plan_dft_r2c_2d
|
cannam@127
|
1463 @findex fftw_mpi_plan_dft_r2c_2d
|
cannam@127
|
1464 @findex fftw_mpi_plan_dft_r2c_3d
|
cannam@127
|
1465 @findex fftw_mpi_plan_dft_r2c
|
cannam@127
|
1466 @findex fftw_mpi_plan_dft_c2r_2d
|
cannam@127
|
1467 @findex fftw_mpi_plan_dft_c2r_2d
|
cannam@127
|
1468 @findex fftw_mpi_plan_dft_c2r_3d
|
cannam@127
|
1469 @findex fftw_mpi_plan_dft_c2r
|
cannam@127
|
1470 @example
|
cannam@127
|
1471 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1472 double *in, fftw_complex *out,
|
cannam@127
|
1473 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1474 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1475 double *in, fftw_complex *out,
|
cannam@127
|
1476 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1477 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1478 double *in, fftw_complex *out,
|
cannam@127
|
1479 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1480 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1481 double *in, fftw_complex *out,
|
cannam@127
|
1482 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1483 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1484 fftw_complex *in, double *out,
|
cannam@127
|
1485 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1486 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1487 fftw_complex *in, double *out,
|
cannam@127
|
1488 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1489 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1490 fftw_complex *in, double *out,
|
cannam@127
|
1491 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1492 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1493 fftw_complex *in, double *out,
|
cannam@127
|
1494 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1495 @end example
|
cannam@127
|
1496
|
cannam@127
|
1497 Similar to the serial interface (@pxref{Real-data DFTs}), these
|
cannam@127
|
1498 transform logically @ndims{} real data to/from @ndimshalf{} complex
|
cannam@127
|
1499 data, representing the non-redundant half of the conjugate-symmetry
|
cannam@127
|
1500 output of a real-input DFT (@pxref{Multi-dimensional Transforms}).
|
cannam@127
|
1501 However, the real array must be stored within a padded @ndimspad{}
|
cannam@127
|
1502 array (much like the in-place serial r2c transforms, but here for
|
cannam@127
|
1503 out-of-place transforms as well). Currently, only multi-dimensional
|
cannam@127
|
1504 (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan
|
cannam@127
|
1505 for @code{rnk = 1} will yield @code{NULL}). As explained above
|
cannam@127
|
1506 (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data
|
cannam@127
|
1507 distribution of both the real and complex arrays is given by the
|
cannam@127
|
1508 @samp{local_size} function called for the dimensions of the
|
cannam@127
|
1509 @emph{complex} array. Similar to the other planning functions, the
|
cannam@127
|
1510 input and output arrays are overwritten when the plan is created
|
cannam@127
|
1511 except in @code{FFTW_ESTIMATE} mode.
|
cannam@127
|
1512
|
cannam@127
|
1513 As for the complex DFTs above, there is an advance interface that
|
cannam@127
|
1514 allows you to manually specify block sizes and to transform contiguous
|
cannam@127
|
1515 @code{howmany}-tuples of real/complex numbers:
|
cannam@127
|
1516
|
cannam@127
|
1517 @findex fftw_mpi_plan_many_dft_r2c
|
cannam@127
|
1518 @findex fftw_mpi_plan_many_dft_c2r
|
cannam@127
|
1519 @example
|
cannam@127
|
1520 fftw_plan fftw_mpi_plan_many_dft_r2c
|
cannam@127
|
1521 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
cannam@127
|
1522 ptrdiff_t iblock, ptrdiff_t oblock,
|
cannam@127
|
1523 double *in, fftw_complex *out,
|
cannam@127
|
1524 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1525 fftw_plan fftw_mpi_plan_many_dft_c2r
|
cannam@127
|
1526 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
cannam@127
|
1527 ptrdiff_t iblock, ptrdiff_t oblock,
|
cannam@127
|
1528 fftw_complex *in, double *out,
|
cannam@127
|
1529 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1530 @end example
|
cannam@127
|
1531
|
cannam@127
|
1532 @subsubheading MPI r2r transforms
|
cannam@127
|
1533
|
cannam@127
|
1534 @cindex r2r
|
cannam@127
|
1535 There are corresponding plan-creation routines for r2r
|
cannam@127
|
1536 transforms (@pxref{More DFTs of Real Data}), currently supporting
|
cannam@127
|
1537 multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will
|
cannam@127
|
1538 yield a @code{NULL} plan):
|
cannam@127
|
1539
|
cannam@127
|
1540 @example
|
cannam@127
|
1541 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1542 double *in, double *out,
|
cannam@127
|
1543 MPI_Comm comm,
|
cannam@127
|
1544 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
|
cannam@127
|
1545 unsigned flags);
|
cannam@127
|
1546 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
cannam@127
|
1547 double *in, double *out,
|
cannam@127
|
1548 MPI_Comm comm,
|
cannam@127
|
1549 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
|
cannam@127
|
1550 unsigned flags);
|
cannam@127
|
1551 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1552 double *in, double *out,
|
cannam@127
|
1553 MPI_Comm comm, const fftw_r2r_kind *kind,
|
cannam@127
|
1554 unsigned flags);
|
cannam@127
|
1555 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
|
cannam@127
|
1556 ptrdiff_t iblock, ptrdiff_t oblock,
|
cannam@127
|
1557 double *in, double *out,
|
cannam@127
|
1558 MPI_Comm comm, const fftw_r2r_kind *kind,
|
cannam@127
|
1559 unsigned flags);
|
cannam@127
|
1560 @end example
|
cannam@127
|
1561
|
cannam@127
|
1562 The parameters are much the same as for the complex DFTs above, except
|
cannam@127
|
1563 that the arrays are of real numbers (and hence the outputs of the
|
cannam@127
|
1564 @samp{local_size} data-distribution functions should be interpreted as
|
cannam@127
|
1565 counts of real rather than complex numbers). Also, the @code{kind}
|
cannam@127
|
1566 parameters specify the r2r kinds along each dimension as for the
|
cannam@127
|
1567 serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other
|
cannam@127
|
1568 Multi-dimensional Real-data MPI Transforms}.
|
cannam@127
|
1569
|
cannam@127
|
1570 @subsubheading MPI transposition
|
cannam@127
|
1571 @cindex transpose
|
cannam@127
|
1572
|
cannam@127
|
1573 FFTW also provides routines to plan a transpose of a distributed
|
cannam@127
|
1574 @code{n0} by @code{n1} array of real numbers, or an array of
|
cannam@127
|
1575 @code{howmany}-tuples of real numbers with specified block sizes
|
cannam@127
|
1576 (@pxref{FFTW MPI Transposes}):
|
cannam@127
|
1577
|
cannam@127
|
1578 @findex fftw_mpi_plan_transpose
|
cannam@127
|
1579 @findex fftw_mpi_plan_many_transpose
|
cannam@127
|
1580 @example
|
cannam@127
|
1581 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
|
cannam@127
|
1582 double *in, double *out,
|
cannam@127
|
1583 MPI_Comm comm, unsigned flags);
|
cannam@127
|
1584 fftw_plan fftw_mpi_plan_many_transpose
|
cannam@127
|
1585 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
|
cannam@127
|
1586 ptrdiff_t block0, ptrdiff_t block1,
|
cannam@127
|
1587 double *in, double *out, MPI_Comm comm, unsigned flags);
|
cannam@127
|
1588 @end example
|
cannam@127
|
1589
|
cannam@127
|
1590 @cindex new-array execution
|
cannam@127
|
1591 @findex fftw_mpi_execute_r2r
|
cannam@127
|
1592 These plans are used with the @code{fftw_mpi_execute_r2r} new-array
|
cannam@127
|
1593 execute function (@pxref{Using MPI Plans }), since they count as (rank
|
cannam@127
|
1594 zero) r2r plans from FFTW's perspective.
|
cannam@127
|
1595
|
cannam@127
|
1596 @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference
|
cannam@127
|
1597 @subsection MPI Wisdom Communication
|
cannam@127
|
1598
|
cannam@127
|
1599 To facilitate synchronizing wisdom among the different MPI processes,
|
cannam@127
|
1600 we provide two functions:
|
cannam@127
|
1601
|
cannam@127
|
1602 @findex fftw_mpi_gather_wisdom
|
cannam@127
|
1603 @findex fftw_mpi_broadcast_wisdom
|
cannam@127
|
1604 @example
|
cannam@127
|
1605 void fftw_mpi_gather_wisdom(MPI_Comm comm);
|
cannam@127
|
1606 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
|
cannam@127
|
1607 @end example
|
cannam@127
|
1608
|
cannam@127
|
1609 The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the
|
cannam@127
|
1610 given communicator @code{comm} to the process of rank 0 in the
|
cannam@127
|
1611 communicator: that process obtains the union of all wisdom on all the
|
cannam@127
|
1612 processes. As a side effect, some other processes will gain
|
cannam@127
|
1613 additional wisdom from other processes, but only process 0 will gain
|
cannam@127
|
1614 the complete union.
|
cannam@127
|
1615
|
cannam@127
|
1616 The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports
|
cannam@127
|
1617 wisdom from process 0 in @code{comm} to all other processes in the
|
cannam@127
|
1618 communicator, replacing any wisdom they currently have.
|
cannam@127
|
1619
|
cannam@127
|
1620 @xref{FFTW MPI Wisdom}.
|
cannam@127
|
1621
|
cannam@127
|
1622 @c ------------------------------------------------------------
|
cannam@127
|
1623 @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI
|
cannam@127
|
1624 @section FFTW MPI Fortran Interface
|
cannam@127
|
1625 @cindex Fortran interface
|
cannam@127
|
1626
|
cannam@127
|
1627 @cindex iso_c_binding
|
cannam@127
|
1628 The FFTW MPI interface is callable from modern Fortran compilers
|
cannam@127
|
1629 supporting the Fortran 2003 @code{iso_c_binding} standard for calling
|
cannam@127
|
1630 C functions. As described in @ref{Calling FFTW from Modern Fortran},
|
cannam@127
|
1631 this means that you can directly call FFTW's C interface from Fortran
|
cannam@127
|
1632 with only minor changes in syntax. There are, however, a few things
|
cannam@127
|
1633 specific to the MPI interface to keep in mind:
|
cannam@127
|
1634
|
cannam@127
|
1635 @itemize @bullet
|
cannam@127
|
1636
|
cannam@127
|
1637 @item
|
cannam@127
|
1638 Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran
|
cannam@127
|
1639 interface }, you should @code{include 'fftw3-mpi.f03'} (after
|
cannam@127
|
1640 @code{use, intrinsic :: iso_c_binding} as before). The
|
cannam@127
|
1641 @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should
|
cannam@127
|
1642 @emph{not} @code{include} them both yourself. (You will also want to
|
cannam@127
|
1643 include the MPI header file, usually via @code{include 'mpif.h'} or
|
cannam@127
|
1644 similar, although though this is not needed by @code{fftw3-mpi.f03}
|
cannam@127
|
1645 @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.)
|
cannam@127
|
1646
|
cannam@127
|
1647 @item
|
cannam@127
|
1648 Because of the different storage conventions between C and Fortran,
|
cannam@127
|
1649 you reverse the order of your array dimensions when passing them to
|
cannam@127
|
1650 FFTW (@pxref{Reversing array dimensions}). This is merely a
|
cannam@127
|
1651 difference in notation and incurs no performance overhead. However,
|
cannam@127
|
1652 it means that, whereas in C the @emph{first} dimension is distributed,
|
cannam@127
|
1653 in Fortran the @emph{last} dimension of your array is distributed.
|
cannam@127
|
1654
|
cannam@127
|
1655 @item
|
cannam@127
|
1656 @cindex MPI communicator
|
cannam@127
|
1657 In Fortran, communicators are stored as @code{integer} types; there is
|
cannam@127
|
1658 no @code{MPI_Comm} type, nor is there any way to access a C
|
cannam@127
|
1659 @code{MPI_Comm}. Fortunately, this is taken care of for you by the
|
cannam@127
|
1660 FFTW Fortran interface: whenever the C interface expects an
|
cannam@127
|
1661 @code{MPI_Comm} type, you should pass the Fortran communicator as an
|
cannam@127
|
1662 @code{integer}.@footnote{Technically, this is because you aren't
|
cannam@127
|
1663 actually calling the C functions directly. You are calling wrapper
|
cannam@127
|
1664 functions that translate the communicator with @code{MPI_Comm_f2c}
|
cannam@127
|
1665 before calling the ordinary C interface. This is all done
|
cannam@127
|
1666 transparently, however, since the @code{fftw3-mpi.f03} interface file
|
cannam@127
|
1667 renames the wrappers so that they are called in Fortran with the same
|
cannam@127
|
1668 names as the C interface functions.}
|
cannam@127
|
1669
|
cannam@127
|
1670 @item
|
cannam@127
|
1671 Because you need to call the @samp{local_size} function to find out
|
cannam@127
|
1672 how much space to allocate, and this may be @emph{larger} than the
|
cannam@127
|
1673 local portion of the array (@pxref{MPI Data Distribution}), you should
|
cannam@127
|
1674 @emph{always} allocate your arrays dynamically using FFTW's allocation
|
cannam@127
|
1675 routines as described in @ref{Allocating aligned memory in Fortran}.
|
cannam@127
|
1676 (Coincidentally, this also provides the best performance by
|
cannam@127
|
1677 guaranteeding proper data alignment.)
|
cannam@127
|
1678
|
cannam@127
|
1679 @item
|
cannam@127
|
1680 Because all sizes in the MPI FFTW interface are declared as
|
cannam@127
|
1681 @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in
|
cannam@127
|
1682 Fortran (@pxref{FFTW Fortran type reference}).
|
cannam@127
|
1683
|
cannam@127
|
1684 @item
|
cannam@127
|
1685 @findex fftw_execute_dft
|
cannam@127
|
1686 @findex fftw_mpi_execute_dft
|
cannam@127
|
1687 @cindex new-array execution
|
cannam@127
|
1688 In Fortran, because of the language semantics, we generally recommend
|
cannam@127
|
1689 using the new-array execute functions for all plans, even in the
|
cannam@127
|
1690 common case where you are executing the plan on the same arrays for
|
cannam@127
|
1691 which the plan was created (@pxref{Plan execution in Fortran}).
|
cannam@127
|
1692 However, note that in the MPI interface these functions are changed:
|
cannam@127
|
1693 @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft},
|
cannam@127
|
1694 etcetera. @xref{Using MPI Plans}.
|
cannam@127
|
1695
|
cannam@127
|
1696 @end itemize
|
cannam@127
|
1697
|
cannam@127
|
1698 For example, here is a Fortran code snippet to perform a distributed
|
cannam@127
|
1699 @twodims{L,M} complex DFT in-place. (This assumes you have already
|
cannam@127
|
1700 initialized MPI with @code{MPI_init} and have also performed
|
cannam@127
|
1701 @code{call fftw_mpi_init}.)
|
cannam@127
|
1702
|
cannam@127
|
1703 @example
|
cannam@127
|
1704 use, intrinsic :: iso_c_binding
|
cannam@127
|
1705 include 'fftw3-mpi.f03'
|
cannam@127
|
1706 integer(C_INTPTR_T), parameter :: L = ...
|
cannam@127
|
1707 integer(C_INTPTR_T), parameter :: M = ...
|
cannam@127
|
1708 type(C_PTR) :: plan, cdata
|
cannam@127
|
1709 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
|
cannam@127
|
1710 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
|
cannam@127
|
1711
|
cannam@127
|
1712 ! @r{get local data size and allocate (note dimension reversal)}
|
cannam@127
|
1713 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
|
cannam@127
|
1714 local_M, local_j_offset)
|
cannam@127
|
1715 cdata = fftw_alloc_complex(alloc_local)
|
cannam@127
|
1716 call c_f_pointer(cdata, data, [L,local_M])
|
cannam@127
|
1717
|
cannam@127
|
1718 ! @r{create MPI plan for in-place forward DFT (note dimension reversal)}
|
cannam@127
|
1719 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
|
cannam@127
|
1720 FFTW_FORWARD, FFTW_MEASURE)
|
cannam@127
|
1721
|
cannam@127
|
1722 ! @r{initialize data to some function} my_function(i,j)
|
cannam@127
|
1723 do j = 1, local_M
|
cannam@127
|
1724 do i = 1, L
|
cannam@127
|
1725 data(i, j) = my_function(i, j + local_j_offset)
|
cannam@127
|
1726 end do
|
cannam@127
|
1727 end do
|
cannam@127
|
1728
|
cannam@127
|
1729 ! @r{compute transform (as many times as desired)}
|
cannam@127
|
1730 call fftw_mpi_execute_dft(plan, data, data)
|
cannam@127
|
1731
|
cannam@127
|
1732 call fftw_destroy_plan(plan)
|
cannam@127
|
1733 call fftw_free(cdata)
|
cannam@127
|
1734 @end example
|
cannam@127
|
1735
|
cannam@127
|
1736 Note that when we called @code{fftw_mpi_local_size_2d} and
|
cannam@127
|
1737 @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order,
|
cannam@127
|
1738 since a @twodims{L,M} Fortran array is viewed by FFTW in C as a
|
cannam@127
|
1739 @twodims{M, L} array. This means that the array was distributed over
|
cannam@127
|
1740 the @code{M} dimension, the local portion of which is a
|
cannam@127
|
1741 @twodims{L,local_M} array in Fortran. (You must @emph{not} use an
|
cannam@127
|
1742 @code{allocate} statement to allocate an @twodims{L,local_M} array,
|
cannam@127
|
1743 however; you must allocate @code{alloc_local} complex numbers, which
|
cannam@127
|
1744 may be greater than @code{L * local_M}, in order to reserve space for
|
cannam@127
|
1745 intermediate steps of the transform.) Finally, we mention that
|
cannam@127
|
1746 because C's array indices are zero-based, the @code{local_j_offset}
|
cannam@127
|
1747 argument can conveniently be interpreted as an offset in the 1-based
|
cannam@127
|
1748 @code{j} index (rather than as a starting index as in C).
|
cannam@127
|
1749
|
cannam@127
|
1750 If instead you had used the @code{ior(FFTW_MEASURE,
|
cannam@127
|
1751 FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a
|
cannam@127
|
1752 transposed @twodims{M,local_L} array, associated with the @emph{same}
|
cannam@127
|
1753 @code{cdata} allocation (since the transform is in-place), and which
|
cannam@127
|
1754 you could declare with:
|
cannam@127
|
1755
|
cannam@127
|
1756 @example
|
cannam@127
|
1757 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
|
cannam@127
|
1758 ...
|
cannam@127
|
1759 call c_f_pointer(cdata, tdata, [M,local_L])
|
cannam@127
|
1760 @end example
|
cannam@127
|
1761
|
cannam@127
|
1762 where @code{local_L} would have been obtained by changing the
|
cannam@127
|
1763 @code{fftw_mpi_local_size_2d} call to:
|
cannam@127
|
1764
|
cannam@127
|
1765 @example
|
cannam@127
|
1766 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
|
cannam@127
|
1767 local_M, local_j_offset, local_L, local_i_offset)
|
cannam@127
|
1768 @end example
|