sv-dependency-builds: src/fftw-3.3.5/doc/mpi.texi annotate

annotate src/fftw-3.3.5/doc/mpi.texi @ 83:ae30d91d2ffe

Replace these with versions built using an older toolset (so as to avoid ABI compatibilities when linking on Ubuntu 14.04 for packaging purposes)

author	Chris Cannam
date	Fri, 07 Feb 2020 11:51:13 +0000
parents	2cd0e3b3e1fd
children

rev	line source
Chris@42	1 @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top
Chris@42	2 @chapter Distributed-memory FFTW with MPI
Chris@42	3 @cindex MPI
Chris@42	4
Chris@42	5 @cindex parallel transform
Chris@42	6 In this chapter we document the parallel FFTW routines for parallel
Chris@42	7 systems supporting the MPI message-passing interface. Unlike the
Chris@42	8 shared-memory threads described in the previous chapter, MPI allows
Chris@42	9 you to use @emph{distributed-memory} parallelism, where each CPU has
Chris@42	10 its own separate memory, and which can scale up to clusters of many
Chris@42	11 thousands of processors. This capability comes at a price, however:
Chris@42	12 each process only stores a @emph{portion} of the data to be
Chris@42	13 transformed, which means that the data structures and
Chris@42	14 programming-interface are quite different from the serial or threads
Chris@42	15 versions of FFTW.
Chris@42	16 @cindex data distribution
Chris@42	17
Chris@42	18
Chris@42	19 Distributed-memory parallelism is especially useful when you are
Chris@42	20 transforming arrays so large that they do not fit into the memory of a
Chris@42	21 single processor. The storage per-process required by FFTW's MPI
Chris@42	22 routines is proportional to the total array size divided by the number
Chris@42	23 of processes. Conversely, distributed-memory parallelism can easily
Chris@42	24 pose an unacceptably high communications overhead for small problems;
Chris@42	25 the threshold problem size for which parallelism becomes advantageous
Chris@42	26 will depend on the precise problem you are interested in, your
Chris@42	27 hardware, and your MPI implementation.
Chris@42	28
Chris@42	29 A note on terminology: in MPI, you divide the data among a set of
Chris@42	30 ``processes'' which each run in their own memory address space.
Chris@42	31 Generally, each process runs on a different physical processor, but
Chris@42	32 this is not required. A set of processes in MPI is described by an
Chris@42	33 opaque data structure called a ``communicator,'' the most common of
Chris@42	34 which is the predefined communicator @code{MPI_COMM_WORLD} which
Chris@42	35 refers to @emph{all} processes. For more information on these and
Chris@42	36 other concepts common to all MPI programs, we refer the reader to the
Chris@42	37 documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home
Chris@42	38 page}.
Chris@42	39 @cindex MPI communicator
Chris@42	40 @ctindex MPI_COMM_WORLD
Chris@42	41
Chris@42	42
Chris@42	43 We assume in this chapter that the reader is familiar with the usage
Chris@42	44 of the serial (uniprocessor) FFTW, and focus only on the concepts new
Chris@42	45 to the MPI interface.
Chris@42	46
Chris@42	47 @menu
Chris@42	48 * FFTW MPI Installation::
Chris@42	49 * Linking and Initializing MPI FFTW::
Chris@42	50 * 2d MPI example::
Chris@42	51 * MPI Data Distribution::
Chris@42	52 * Multi-dimensional MPI DFTs of Real Data::
Chris@42	53 * Other Multi-dimensional Real-data MPI Transforms::
Chris@42	54 * FFTW MPI Transposes::
Chris@42	55 * FFTW MPI Wisdom::
Chris@42	56 * Avoiding MPI Deadlocks::
Chris@42	57 * FFTW MPI Performance Tips::
Chris@42	58 * Combining MPI and Threads::
Chris@42	59 * FFTW MPI Reference::
Chris@42	60 * FFTW MPI Fortran Interface::
Chris@42	61 @end menu
Chris@42	62
Chris@42	63 @c ------------------------------------------------------------
Chris@42	64 @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI
Chris@42	65 @section FFTW MPI Installation
Chris@42	66
Chris@42	67 All of the FFTW MPI code is located in the @code{mpi} subdirectory of
Chris@42	68 the FFTW package. On Unix systems, the FFTW MPI libraries and header
Chris@42	69 files are automatically configured, compiled, and installed along with
Chris@42	70 the uniprocessor FFTW libraries simply by including
Chris@42	71 @code{--enable-mpi} in the flags to the @code{configure} script
Chris@42	72 (@pxref{Installation on Unix}).
Chris@42	73 @fpindex configure
Chris@42	74
Chris@42	75
Chris@42	76 Any implementation of the MPI standard, version 1 or later, should
Chris@42	77 work with FFTW. The @code{configure} script will attempt to
Chris@42	78 automatically detect how to compile and link code using your MPI
Chris@42	79 implementation. In some cases, especially if you have multiple
Chris@42	80 different MPI implementations installed or have an unusual MPI
Chris@42	81 software package, you may need to provide this information explicitly.
Chris@42	82
Chris@42	83 Most commonly, one compiles MPI code by invoking a special compiler
Chris@42	84 command, typically @code{mpicc} for C code. The @code{configure}
Chris@42	85 script knows the most common names for this command, but you can
Chris@42	86 specify the MPI compilation command explicitly by setting the
Chris@42	87 @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}.
Chris@42	88 @fpindex mpicc
Chris@42	89
Chris@42	90
Chris@42	91 If, instead of a special compiler command, you need to link a certain
Chris@42	92 library, you can specify the link command via the @code{MPILIBS}
Chris@42	93 variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if
Chris@42	94 your MPI library is installed in a non-standard location (one the
Chris@42	95 compiler does not know about by default), you may also have to specify
Chris@42	96 the location of the library and header files via @code{LDFLAGS} and
Chris@42	97 @code{CPPFLAGS} variables, respectively, as in @samp{./configure
Chris@42	98 LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}.
Chris@42	99
Chris@42	100 @c ------------------------------------------------------------
Chris@42	101 @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI
Chris@42	102 @section Linking and Initializing MPI FFTW
Chris@42	103
Chris@42	104 Programs using the MPI FFTW routines should be linked with
Chris@42	105 @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision,
Chris@42	106 @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on
Chris@42	107 (@pxref{Precision}). You will also need to link with whatever library
Chris@42	108 is responsible for MPI on your system; in most MPI implementations,
Chris@42	109 there is a special compiler alias named @code{mpicc} to compile and
Chris@42	110 link MPI code.
Chris@42	111 @fpindex mpicc
Chris@42	112 @cindex linking on Unix
Chris@42	113 @cindex precision
Chris@42	114
Chris@42	115
Chris@42	116 @findex fftw_init_threads
Chris@42	117 Before calling any FFTW routines except possibly
Chris@42	118 @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling
Chris@42	119 @code{MPI_Init}, you should call the function:
Chris@42	120
Chris@42	121 @example
Chris@42	122 void fftw_mpi_init(void);
Chris@42	123 @end example
Chris@42	124 @findex fftw_mpi_init
Chris@42	125
Chris@42	126 If, at the end of your program, you want to get rid of all memory and
Chris@42	127 other resources allocated internally by FFTW, for both the serial and
Chris@42	128 MPI routines, you can call:
Chris@42	129
Chris@42	130 @example
Chris@42	131 void fftw_mpi_cleanup(void);
Chris@42	132 @end example
Chris@42	133 @findex fftw_mpi_cleanup
Chris@42	134
Chris@42	135 which is much like the @code{fftw_cleanup()} function except that it
Chris@42	136 also gets rid of FFTW's MPI-related data. You must @emph{not} execute
Chris@42	137 any previously created plans after calling this function.
Chris@42	138
Chris@42	139 @c ------------------------------------------------------------
Chris@42	140 @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI
Chris@42	141 @section 2d MPI example
Chris@42	142
Chris@42	143 Before we document the FFTW MPI interface in detail, we begin with a
Chris@42	144 simple example outlining how one would perform a two-dimensional
Chris@42	145 @code{N0} by @code{N1} complex DFT.
Chris@42	146
Chris@42	147 @example
Chris@42	148 #include <fftw3-mpi.h>
Chris@42	149
Chris@42	150 int main(int argc, char **argv)
Chris@42	151 @{
Chris@42	152 const ptrdiff_t N0 = ..., N1 = ...;
Chris@42	153 fftw_plan plan;
Chris@42	154 fftw_complex *data;
Chris@42	155 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
Chris@42	156
Chris@42	157 MPI_Init(&argc, &argv);
Chris@42	158 fftw_mpi_init();
Chris@42	159
Chris@42	160 /* @r{get local data size and allocate} */
Chris@42	161 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
Chris@42	162 &local_n0, &local_0_start);
Chris@42	163 data = fftw_alloc_complex(alloc_local);
Chris@42	164
Chris@42	165 /* @r{create plan for in-place forward DFT} */
Chris@42	166 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
Chris@42	167 FFTW_FORWARD, FFTW_ESTIMATE);
Chris@42	168
Chris@42	169 /* @r{initialize data to some function} my_function(x,y) */
Chris@42	170 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
Chris@42	171 data[i*N1 + j] = my_function(local_0_start + i, j);
Chris@42	172
Chris@42	173 /* @r{compute transforms, in-place, as many times as desired} */
Chris@42	174 fftw_execute(plan);
Chris@42	175
Chris@42	176 fftw_destroy_plan(plan);
Chris@42	177
Chris@42	178 MPI_Finalize();
Chris@42	179 @}
Chris@42	180 @end example
Chris@42	181
Chris@42	182 As can be seen above, the MPI interface follows the same basic style
Chris@42	183 of allocate/plan/execute/destroy as the serial FFTW routines. All of
Chris@42	184 the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead
Chris@42	185 of @samp{fftw_}. There are a few important differences, however:
Chris@42	186
Chris@42	187 First, we must call @code{fftw_mpi_init()} after calling
Chris@42	188 @code{MPI_Init} (required in all MPI programs) and before calling any
Chris@42	189 other @samp{fftw_mpi_} routine.
Chris@42	190 @findex MPI_Init
Chris@42	191 @findex fftw_mpi_init
Chris@42	192
Chris@42	193
Chris@42	194 Second, when we create the plan with @code{fftw_mpi_plan_dft_2d},
Chris@42	195 analogous to @code{fftw_plan_dft_2d}, we pass an additional argument:
Chris@42	196 the communicator, indicating which processes will participate in the
Chris@42	197 transform (here @code{MPI_COMM_WORLD}, indicating all processes).
Chris@42	198 Whenever you create, execute, or destroy a plan for an MPI transform,
Chris@42	199 you must call the corresponding FFTW routine on @emph{all} processes
Chris@42	200 in the communicator for that transform. (That is, these are
Chris@42	201 @emph{collective} calls.) Note that the plan for the MPI transform
Chris@42	202 uses the standard @code{fftw_execute} and @code{fftw_destroy} routines
Chris@42	203 (on the other hand, there are MPI-specific new-array execute functions
Chris@42	204 documented below).
Chris@42	205 @cindex collective function
Chris@42	206 @findex fftw_mpi_plan_dft_2d
Chris@42	207 @ctindex MPI_COMM_WORLD
Chris@42	208
Chris@42	209
Chris@42	210 Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments
Chris@42	211 instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a
Chris@42	212 standard C integer type which is (at least) 32 bits wide on a 32-bit
Chris@42	213 machine and 64 bits wide on a 64-bit machine. This is to make it easy
Chris@42	214 to specify very large parallel transforms on a 64-bit machine. (You
Chris@42	215 can specify 64-bit transform sizes in the serial FFTW, too, but only
Chris@42	216 by using the @samp{guru64} planner interface. @xref{64-bit Guru
Chris@42	217 Interface}.)
Chris@42	218 @tindex ptrdiff_t
Chris@42	219 @cindex 64-bit architecture
Chris@42	220
Chris@42	221
Chris@42	222 Fourth, and most importantly, you don't allocate the entire
Chris@42	223 two-dimensional array on each process. Instead, you call
Chris@42	224 @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the
Chris@42	225 array resides on each processor, and how much space to allocate.
Chris@42	226 Here, the portion of the array on each process is a @code{local_n0} by
Chris@42	227 @code{N1} slice of the total array, starting at index
Chris@42	228 @code{local_0_start}. The total number of @code{fftw_complex} numbers
Chris@42	229 to allocate is given by the @code{alloc_local} return value, which
Chris@42	230 @emph{may} be greater than @code{local_n0 * N1} (in case some
Chris@42	231 intermediate calculations require additional storage). The data
Chris@42	232 distribution in FFTW's MPI interface is described in more detail by
Chris@42	233 the next section.
Chris@42	234 @findex fftw_mpi_local_size_2d
Chris@42	235 @cindex data distribution
Chris@42	236
Chris@42	237
Chris@42	238 Given the portion of the array that resides on the local process, it
Chris@42	239 is straightforward to initialize the data (here to a function
Chris@42	240 @code{myfunction}) and otherwise manipulate it. Of course, at the end
Chris@42	241 of the program you may want to output the data somehow, but
Chris@42	242 synchronizing this output is up to you and is beyond the scope of this
Chris@42	243 manual. (One good way to output a large multi-dimensional distributed
Chris@42	244 array in MPI to a portable binary file is to use the free HDF5
Chris@42	245 library; see the @uref{http://www.hdfgroup.org/, HDF home page}.)
Chris@42	246 @cindex HDF5
Chris@42	247 @cindex MPI I/O
Chris@42	248
Chris@42	249 @c ------------------------------------------------------------
Chris@42	250 @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI
Chris@42	251 @section MPI Data Distribution
Chris@42	252 @cindex data distribution
Chris@42	253
Chris@42	254 The most important concept to understand in using FFTW's MPI interface
Chris@42	255 is the data distribution. With a serial or multithreaded FFT, all of
Chris@42	256 the inputs and outputs are stored as a single contiguous chunk of
Chris@42	257 memory. With a distributed-memory FFT, the inputs and outputs are
Chris@42	258 broken into disjoint blocks, one per process.
Chris@42	259
Chris@42	260 In particular, FFTW uses a @emph{1d block distribution} of the data,
Chris@42	261 distributed along the @emph{first dimension}. For example, if you
Chris@42	262 want to perform a @twodims{100,200} complex DFT, distributed over 4
Chris@42	263 processes, each process will get a @twodims{25,200} slice of the data.
Chris@42	264 That is, process 0 will get rows 0 through 24, process 1 will get rows
Chris@42	265 25 through 49, process 2 will get rows 50 through 74, and process 3
Chris@42	266 will get rows 75 through 99. If you take the same array but
Chris@42	267 distribute it over 3 processes, then it is not evenly divisible so the
Chris@42	268 different processes will have unequal chunks. FFTW's default choice
Chris@42	269 in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
Chris@42	270 process 2.
Chris@42	271 @cindex block distribution
Chris@42	272
Chris@42	273
Chris@42	274 FFTW provides several @samp{fftw_mpi_local_size} routines that you can
Chris@42	275 call to find out what portion of an array is stored on the current
Chris@42	276 process. In most cases, you should use the default block sizes picked
Chris@42	277 by FFTW, but it is also possible to specify your own block size. For
Chris@42	278 example, with a @twodims{100,200} array on three processes, you can
Chris@42	279 tell FFTW to use a block size of 40, which would assign 40 rows to
Chris@42	280 processes 0 and 1, and 20 rows to process 2. FFTW's default is to
Chris@42	281 divide the data equally among the processes if possible, and as best
Chris@42	282 it can otherwise. The rows are always assigned in ``rank order,''
Chris@42	283 i.e. process 0 gets the first block of rows, then process 1, and so
Chris@42	284 on. (You can change this by using @code{MPI_Comm_split} to create a
Chris@42	285 new communicator with re-ordered processes.) However, you should
Chris@42	286 always call the @samp{fftw_mpi_local_size} routines, if possible,
Chris@42	287 rather than trying to predict FFTW's distribution choices.
Chris@42	288
Chris@42	289 In particular, it is critical that you allocate the storage size that
Chris@42	290 is returned by @samp{fftw_mpi_local_size}, which is @emph{not}
Chris@42	291 necessarily the size of the local slice of the array. The reason is
Chris@42	292 that intermediate steps of FFTW's algorithms involve transposing the
Chris@42	293 array and redistributing the data, so at these intermediate steps FFTW
Chris@42	294 may require more local storage space (albeit always proportional to
Chris@42	295 the total size divided by the number of processes). The
Chris@42	296 @samp{fftw_mpi_local_size} functions know how much storage is required
Chris@42	297 for these intermediate steps and tell you the correct amount to
Chris@42	298 allocate.
Chris@42	299
Chris@42	300 @menu
Chris@42	301 * Basic and advanced distribution interfaces::
Chris@42	302 * Load balancing::
Chris@42	303 * Transposed distributions::
Chris@42	304 * One-dimensional distributions::
Chris@42	305 @end menu
Chris@42	306
Chris@42	307 @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution
Chris@42	308 @subsection Basic and advanced distribution interfaces
Chris@42	309
Chris@42	310 As with the planner interface, the @samp{fftw_mpi_local_size}
Chris@42	311 distribution interface is broken into basic and advanced
Chris@42	312 (@samp{_many}) interfaces, where the latter allows you to specify the
Chris@42	313 block size manually and also to request block sizes when computing
Chris@42	314 multiple transforms simultaneously. These functions are documented
Chris@42	315 more exhaustively by the FFTW MPI Reference, but we summarize the
Chris@42	316 basic ideas here using a couple of two-dimensional examples.
Chris@42	317
Chris@42	318 For the @twodims{100,200} complex-DFT example, above, we would find
Chris@42	319 the distribution by calling the following function in the basic
Chris@42	320 interface:
Chris@42	321
Chris@42	322 @example
Chris@42	323 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
Chris@42	324 ptrdiff_t local_n0, ptrdiff_t local_0_start);
Chris@42	325 @end example
Chris@42	326 @findex fftw_mpi_local_size_2d
Chris@42	327
Chris@42	328 Given the total size of the data to be transformed (here, @code{n0 =
Chris@42	329 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this
Chris@42	330 function provides three numbers.
Chris@42	331
Chris@42	332 First, it describes the shape of the local data: the current process
Chris@42	333 should store a @code{local_n0} by @code{n1} slice of the overall
Chris@42	334 dataset, in row-major order (@code{n1} dimension contiguous), starting
Chris@42	335 at index @code{local_0_start}. That is, if the total dataset is
Chris@42	336 viewed as a @code{n0} by @code{n1} matrix, the current process should
Chris@42	337 store the rows @code{local_0_start} to
Chris@42	338 @code{local_0_start+local_n0-1}. Obviously, if you are running with
Chris@42	339 only a single MPI process, that process will store the entire array:
Chris@42	340 @code{local_0_start} will be zero and @code{local_n0} will be
Chris@42	341 @code{n0}. @xref{Row-major Format}.
Chris@42	342 @cindex row-major
Chris@42	343
Chris@42	344
Chris@42	345 Second, the return value is the total number of data elements (e.g.,
Chris@42	346 complex numbers for a complex DFT) that should be allocated for the
Chris@42	347 input and output arrays on the current process (ideally with
Chris@42	348 @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal
Chris@42	349 alignment). It might seem that this should always be equal to
Chris@42	350 @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's
Chris@42	351 distributed FFT algorithms require data redistributions at
Chris@42	352 intermediate stages of the transform, and in some circumstances this
Chris@42	353 may require slightly larger local storage. This is discussed in more
Chris@42	354 detail below, under @ref{Load balancing}.
Chris@42	355 @findex fftw_malloc
Chris@42	356 @findex fftw_alloc_complex
Chris@42	357
Chris@42	358
Chris@42	359 @cindex advanced interface
Chris@42	360 The advanced-interface @samp{local_size} function for multidimensional
Chris@42	361 transforms returns the same three things (@code{local_n0},
Chris@42	362 @code{local_0_start}, and the total number of elements to allocate),
Chris@42	363 but takes more inputs:
Chris@42	364
Chris@42	365 @example
Chris@42	366 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
Chris@42	367 ptrdiff_t howmany,
Chris@42	368 ptrdiff_t block0,
Chris@42	369 MPI_Comm comm,
Chris@42	370 ptrdiff_t *local_n0,
Chris@42	371 ptrdiff_t *local_0_start);
Chris@42	372 @end example
Chris@42	373 @findex fftw_mpi_local_size_many
Chris@42	374
Chris@42	375 The two-dimensional case above corresponds to @code{rnk = 2} and an
Chris@42	376 array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}.
Chris@42	377 This routine is for any @code{rnk > 1}; one-dimensional transforms
Chris@42	378 have their own interface because they work slightly differently, as
Chris@42	379 discussed below.
Chris@42	380
Chris@42	381 First, the advanced interface allows you to perform multiple
Chris@42	382 transforms at once, of interleaved data, as specified by the
Chris@42	383 @code{howmany} parameter. (@code{hoamany} is 1 for a single
Chris@42	384 transform.)
Chris@42	385
Chris@42	386 Second, here you can specify your desired block size in the @code{n0}
Chris@42	387 dimension, @code{block0}. To use FFTW's default block size, pass
Chris@42	388 @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on
Chris@42	389 @code{P} processes, FFTW will return @code{local_n0} equal to
Chris@42	390 @code{block0} on the first @code{P / block0} processes (rounded down),
Chris@42	391 return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on
Chris@42	392 the next process, and @code{local_n0} equal to zero on any remaining
Chris@42	393 processes. In general, we recommend using the default block size
Chris@42	394 (which corresponds to @code{n0 / P}, rounded up).
Chris@42	395 @ctindex FFTW_MPI_DEFAULT_BLOCK
Chris@42	396 @cindex block distribution
Chris@42	397
Chris@42	398
Chris@42	399 For example, suppose you have @code{P = 4} processes and @code{n0 =
Chris@42	400 21}. The default will be a block size of @code{6}, which will give
Chris@42	401 @code{local_n0 = 6} on the first three processes and @code{local_n0 =
Chris@42	402 3} on the last process. Instead, however, you could specify
Chris@42	403 @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5}
Chris@42	404 on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice,
Chris@42	405 while it may look superficially more ``balanced,'' has the same
Chris@42	406 critical path as FFTW's default but requires more communications.)
Chris@42	407
Chris@42	408 @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution
Chris@42	409 @subsection Load balancing
Chris@42	410 @cindex load balancing
Chris@42	411
Chris@42	412 Ideally, when you parallelize a transform over some @math{P}
Chris@42	413 processes, each process should end up with work that takes equal time.
Chris@42	414 Otherwise, all of the processes end up waiting on whichever process is
Chris@42	415 slowest. This goal is known as ``load balancing.'' In this section,
Chris@42	416 we describe the circumstances under which FFTW is able to load-balance
Chris@42	417 well, and in particular how you should choose your transform size in
Chris@42	418 order to load balance.
Chris@42	419
Chris@42	420 Load balancing is especially difficult when you are parallelizing over
Chris@42	421 heterogeneous machines; for example, if one of your processors is a
Chris@42	422 old 486 and another is a Pentium IV, obviously you should give the
Chris@42	423 Pentium more work to do than the 486 since the latter is much slower.
Chris@42	424 FFTW does not deal with this problem, however---it assumes that your
Chris@42	425 processes run on hardware of comparable speed, and that the goal is
Chris@42	426 therefore to divide the problem as equally as possible.
Chris@42	427
Chris@42	428 For a multi-dimensional complex DFT, FFTW can divide the problem
Chris@42	429 equally among the processes if: (i) the @emph{first} dimension
Chris@42	430 @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of
Chris@42	431 the subsequent dimensions is divisible by @math{P}. (For the advanced
Chris@42	432 interface, where you can specify multiple simultaneous transforms via
Chris@42	433 some ``vector'' length @code{howmany}, a factor of @code{howmany} is
Chris@42	434 included in the product of the subsequent dimensions.)
Chris@42	435
Chris@42	436 For a one-dimensional complex DFT, the length @code{N} of the data
Chris@42	437 should be divisible by @math{P} @emph{squared} to be able to divide
Chris@42	438 the problem equally among the processes.
Chris@42	439
Chris@42	440 @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution
Chris@42	441 @subsection Transposed distributions
Chris@42	442
Chris@42	443 Internally, FFTW's MPI transform algorithms work by first computing
Chris@42	444 transforms of the data local to each process, then by globally
Chris@42	445 @emph{transposing} the data in some fashion to redistribute the data
Chris@42	446 among the processes, transforming the new data local to each process,
Chris@42	447 and transposing back. For example, a two-dimensional @code{n0} by
Chris@42	448 @code{n1} array, distributed across the @code{n0} dimension, is
Chris@42	449 transformd by: (i) transforming the @code{n1} dimension, which are
Chris@42	450 local to each process; (ii) transposing to an @code{n1} by @code{n0}
Chris@42	451 array, distributed across the @code{n1} dimension; (iii) transforming
Chris@42	452 the @code{n0} dimension, which is now local to each process; (iv)
Chris@42	453 transposing back.
Chris@42	454 @cindex transpose
Chris@42	455
Chris@42	456
Chris@42	457 However, in many applications it is acceptable to compute a
Chris@42	458 multidimensional DFT whose results are produced in transposed order
Chris@42	459 (e.g., @code{n1} by @code{n0} in two dimensions). This provides a
Chris@42	460 significant performance advantage, because it means that the final
Chris@42	461 transposition step can be omitted. FFTW supports this optimization,
Chris@42	462 which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT}
Chris@42	463 to the planner routines. To compute the inverse transform of
Chris@42	464 transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell
Chris@42	465 it that the input is transposed. In this section, we explain how to
Chris@42	466 interpret the output format of such a transform.
Chris@42	467 @ctindex FFTW_MPI_TRANSPOSED_OUT
Chris@42	468 @ctindex FFTW_MPI_TRANSPOSED_IN
Chris@42	469
Chris@42	470
Chris@42	471 Suppose you have are transforming multi-dimensional data with (at
Chris@42	472 least two) dimensions @ndims{}. As always, it is distributed along
Chris@42	473 the first dimension @dimk{0}. Now, if we compute its DFT with the
Chris@42	474 @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored
Chris@42	475 with the first @emph{two} dimensions transposed: @ndimstrans{},
Chris@42	476 distributed along the @dimk{1} dimension. Conversely, if we take the
Chris@42	477 @ndimstrans{} data and transform it with the
Chris@42	478 @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the
Chris@42	479 original @ndims{} array.
Chris@42	480
Chris@42	481 There are two ways to find the portion of the transposed array that
Chris@42	482 resides on the current process. First, you can simply call the
Chris@42	483 appropriate @samp{local_size} function, passing @ndimstrans{} (the
Chris@42	484 transposed dimensions). This would mean calling the @samp{local_size}
Chris@42	485 function twice, once for the transposed and once for the
Chris@42	486 non-transposed dimensions. Alternatively, you can call one of the
Chris@42	487 @samp{local_size_transposed} functions, which returns both the
Chris@42	488 non-transposed and transposed data distribution from a single call.
Chris@42	489 For example, for a 3d transform with transposed output (or input), you
Chris@42	490 might call:
Chris@42	491
Chris@42	492 @example
Chris@42	493 ptrdiff_t fftw_mpi_local_size_3d_transposed(
Chris@42	494 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
Chris@42	495 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	496 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	497 @end example
Chris@42	498 @findex fftw_mpi_local_size_3d_transposed
Chris@42	499
Chris@42	500 Here, @code{local_n0} and @code{local_0_start} give the size and
Chris@42	501 starting index of the @code{n0} dimension for the
Chris@42	502 @emph{non}-transposed data, as in the previous sections. For
Chris@42	503 @emph{transposed} data (e.g. the output for
Chris@42	504 @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and
Chris@42	505 @code{local_1_start} give the size and starting index of the @code{n1}
Chris@42	506 dimension, which is the first dimension of the transposed data
Chris@42	507 (@code{n1} by @code{n0} by @code{n2}).
Chris@42	508
Chris@42	509 (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to
Chris@42	510 performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two
Chris@42	511 dimensions to the planner in reverse order, or vice versa. If you
Chris@42	512 pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and
Chris@42	513 @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the
Chris@42	514 first two dimensions passed to the planner and passing @emph{neither}
Chris@42	515 flag.)
Chris@42	516
Chris@42	517 @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution
Chris@42	518 @subsection One-dimensional distributions
Chris@42	519
Chris@42	520 For one-dimensional distributed DFTs using FFTW, matters are slightly
Chris@42	521 more complicated because the data distribution is more closely tied to
Chris@42	522 how the algorithm works. In particular, you can no longer pass an
Chris@42	523 arbitrary block size and must accept FFTW's default; also, the block
Chris@42	524 sizes may be different for input and output. Also, the data
Chris@42	525 distribution depends on the flags and transform direction, in order
Chris@42	526 for forward and backward transforms to work correctly.
Chris@42	527
Chris@42	528 @example
Chris@42	529 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
Chris@42	530 int sign, unsigned flags,
Chris@42	531 ptrdiff_t local_ni, ptrdiff_t local_i_start,
Chris@42	532 ptrdiff_t local_no, ptrdiff_t local_o_start);
Chris@42	533 @end example
Chris@42	534 @findex fftw_mpi_local_size_1d
Chris@42	535
Chris@42	536 This function computes the data distribution for a 1d transform of
Chris@42	537 size @code{n0} with the given transform @code{sign} and @code{flags}.
Chris@42	538 Both input and output data use block distributions. The input on the
Chris@42	539 current process will consist of @code{local_ni} numbers starting at
Chris@42	540 index @code{local_i_start}; e.g. if only a single process is used,
Chris@42	541 then @code{local_ni} will be @code{n0} and @code{local_i_start} will
Chris@42	542 be @code{0}. Similarly for the output, with @code{local_no} numbers
Chris@42	543 starting at index @code{local_o_start}. The return value of
Chris@42	544 @code{fftw_mpi_local_size_1d} will be the total number of elements to
Chris@42	545 allocate on the current process (which might be slightly larger than
Chris@42	546 the local size due to intermediate steps in the algorithm).
Chris@42	547
Chris@42	548 As mentioned above (@pxref{Load balancing}), the data will be divided
Chris@42	549 equally among the processes if @code{n0} is divisible by the
Chris@42	550 @emph{square} of the number of processes. In this case,
Chris@42	551 @code{local_ni} will equal @code{local_no}. Otherwise, they may be
Chris@42	552 different.
Chris@42	553
Chris@42	554 For some applications, such as convolutions, the order of the output
Chris@42	555 data is irrelevant. In this case, performance can be improved by
Chris@42	556 specifying that the output data be stored in an FFTW-defined
Chris@42	557 ``scrambled'' format. (In particular, this is the analogue of
Chris@42	558 transposed output in the multidimensional case: scrambled output saves
Chris@42	559 a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in
Chris@42	560 the flags, then the output is stored in this (undocumented) scrambled
Chris@42	561 order. Conversely, to perform the inverse transform of data in
Chris@42	562 scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag.
Chris@42	563 @ctindex FFTW_MPI_SCRAMBLED_OUT
Chris@42	564 @ctindex FFTW_MPI_SCRAMBLED_IN
Chris@42	565
Chris@42	566
Chris@42	567 In MPI FFTW, only composite sizes @code{n0} can be parallelized; we
Chris@42	568 have not yet implemented a parallel algorithm for large prime sizes.
Chris@42	569
Chris@42	570 @c ------------------------------------------------------------
Chris@42	571 @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI
Chris@42	572 @section Multi-dimensional MPI DFTs of Real Data
Chris@42	573
Chris@42	574 FFTW's MPI interface also supports multi-dimensional DFTs of real
Chris@42	575 data, similar to the serial r2c and c2r interfaces. (Parallel
Chris@42	576 one-dimensional real-data DFTs are not currently supported; you must
Chris@42	577 use a complex transform and set the imaginary parts of the inputs to
Chris@42	578 zero.)
Chris@42	579
Chris@42	580 The key points to understand for r2c and c2r MPI transforms (compared
Chris@42	581 to the MPI complex DFTs or the serial r2c/c2r transforms), are:
Chris@42	582
Chris@42	583 @itemize @bullet
Chris@42	584
Chris@42	585 @item
Chris@42	586 Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real
Chris@42	587 data to/from @ndimshalf{} complex data: the last dimension of the
Chris@42	588 complex data is cut in half (rounded down), plus one. As for the
Chris@42	589 serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and
Chris@42	590 @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data.
Chris@42	591
Chris@42	592 @item
Chris@42	593 @cindex padding
Chris@42	594 Although the real data is @emph{conceptually} @ndims{}, it is
Chris@42	595 @emph{physically} stored as an @ndimspad{} array, where the last
Chris@42	596 dimension has been @emph{padded} to make it the same size as the
Chris@42	597 complex output. This is much like the in-place serial r2c/c2r
Chris@42	598 interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that
Chris@42	599 in MPI the padding is required even for out-of-place data. The extra
Chris@42	600 padding numbers are ignored by FFTW (they are @emph{not} like
Chris@42	601 zero-padding the transform to a larger size); they are only used to
Chris@42	602 determine the data layout.
Chris@42	603
Chris@42	604 @item
Chris@42	605 @cindex data distribution
Chris@42	606 The data distribution in MPI for @emph{both} the real and complex data
Chris@42	607 is determined by the shape of the @emph{complex} data. That is, you
Chris@42	608 call the appropriate @samp{local size} function for the @ndimshalf{}
Chris@42	609 complex data, and then use the @emph{same} distribution for the real
Chris@42	610 data except that the last complex dimension is replaced by a (padded)
Chris@42	611 real dimension of twice the length.
Chris@42	612
Chris@42	613 @end itemize
Chris@42	614
Chris@42	615 For example suppose we are performing an out-of-place r2c transform of
Chris@42	616 @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}],
Chris@42	617 resulting in @threedims{L,M,N/2+1} complex data. Similar to the
Chris@42	618 example in @ref{2d MPI example}, we might do something like:
Chris@42	619
Chris@42	620 @example
Chris@42	621 #include <fftw3-mpi.h>
Chris@42	622
Chris@42	623 int main(int argc, char **argv)
Chris@42	624 @{
Chris@42	625 const ptrdiff_t L = ..., M = ..., N = ...;
Chris@42	626 fftw_plan plan;
Chris@42	627 double *rin;
Chris@42	628 fftw_complex *cout;
Chris@42	629 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
Chris@42	630
Chris@42	631 MPI_Init(&argc, &argv);
Chris@42	632 fftw_mpi_init();
Chris@42	633
Chris@42	634 /* @r{get local data size and allocate} */
Chris@42	635 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
Chris@42	636 &local_n0, &local_0_start);
Chris@42	637 rin = fftw_alloc_real(2 * alloc_local);
Chris@42	638 cout = fftw_alloc_complex(alloc_local);
Chris@42	639
Chris@42	640 /* @r{create plan for out-of-place r2c DFT} */
Chris@42	641 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
Chris@42	642 FFTW_MEASURE);
Chris@42	643
Chris@42	644 /* @r{initialize rin to some function} my_func(x,y,z) */
Chris@42	645 for (i = 0; i < local_n0; ++i)
Chris@42	646 for (j = 0; j < M; ++j)
Chris@42	647 for (k = 0; k < N; ++k)
Chris@42	648 rin[(iM + j) (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
Chris@42	649
Chris@42	650 /* @r{compute transforms as many times as desired} */
Chris@42	651 fftw_execute(plan);
Chris@42	652
Chris@42	653 fftw_destroy_plan(plan);
Chris@42	654
Chris@42	655 MPI_Finalize();
Chris@42	656 @}
Chris@42	657 @end example
Chris@42	658
Chris@42	659 @findex fftw_alloc_real
Chris@42	660 @cindex row-major
Chris@42	661 Note that we allocated @code{rin} using @code{fftw_alloc_real} with an
Chris@42	662 argument of @code{2 * alloc_local}: since @code{alloc_local} is the
Chris@42	663 number of @emph{complex} values to allocate, the number of @emph{real}
Chris@42	664 values is twice as many. The @code{rin} array is then
Chris@42	665 @threedims{local_n0,M,2(N/2+1)} in row-major order, so its
Chris@42	666 @code{(i,j,k)} element is at the index @code{(iM + j) (2*(N/2+1)) +
Chris@42	667 k} (@pxref{Multi-dimensional Array Format }).
Chris@42	668
Chris@42	669 @cindex transpose
Chris@42	670 @ctindex FFTW_TRANSPOSED_OUT
Chris@42	671 @ctindex FFTW_TRANSPOSED_IN
Chris@42	672 As for the complex transforms, improved performance can be obtained by
Chris@42	673 specifying that the output is the transpose of the input or vice versa
Chris@42	674 (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c
Chris@42	675 example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that
Chris@42	676 the input would be a padded @threedims{L,M,2(N/2+1)} real array
Chris@42	677 distributed over the @code{L} dimension, while the output would be a
Chris@42	678 @threedims{M,L,N/2+1} complex array distributed over the @code{M}
Chris@42	679 dimension. To perform the inverse c2r transform with the same data
Chris@42	680 distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag.
Chris@42	681
Chris@42	682 @c ------------------------------------------------------------
Chris@42	683 @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI
Chris@42	684 @section Other multi-dimensional Real-Data MPI Transforms
Chris@42	685
Chris@42	686 @cindex r2r
Chris@42	687 FFTW's MPI interface also supports multi-dimensional @samp{r2r}
Chris@42	688 transforms of all kinds supported by the serial interface
Chris@42	689 (e.g. discrete cosine and sine transforms, discrete Hartley
Chris@42	690 transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not
Chris@42	691 one-dimensional transforms, are currently parallelized.
Chris@42	692
Chris@42	693 @tindex fftw_r2r_kind
Chris@42	694 These are used much like the multidimensional complex DFTs discussed
Chris@42	695 above, except that the data is real rather than complex, and one needs
Chris@42	696 to pass an r2r transform kind (@code{fftw_r2r_kind}) for each
Chris@42	697 dimension as in the serial FFTW (@pxref{More DFTs of Real Data}).
Chris@42	698
Chris@42	699 For example, one might perform a two-dimensional @twodims{L,M} that is
Chris@42	700 an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in
Chris@42	701 the second dimension with code like:
Chris@42	702
Chris@42	703 @example
Chris@42	704 const ptrdiff_t L = ..., M = ...;
Chris@42	705 fftw_plan plan;
Chris@42	706 double *data;
Chris@42	707 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
Chris@42	708
Chris@42	709 /* @r{get local data size and allocate} */
Chris@42	710 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
Chris@42	711 &local_n0, &local_0_start);
Chris@42	712 data = fftw_alloc_real(alloc_local);
Chris@42	713
Chris@42	714 /* @r{create plan for in-place REDFT10 x RODFT10} */
Chris@42	715 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
Chris@42	716 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
Chris@42	717
Chris@42	718 /* @r{initialize data to some function} my_function(x,y) */
Chris@42	719 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
Chris@42	720 data[i*M + j] = my_function(local_0_start + i, j);
Chris@42	721
Chris@42	722 /* @r{compute transforms, in-place, as many times as desired} */
Chris@42	723 fftw_execute(plan);
Chris@42	724
Chris@42	725 fftw_destroy_plan(plan);
Chris@42	726 @end example
Chris@42	727
Chris@42	728 @findex fftw_alloc_real
Chris@42	729 Notice that we use the same @samp{local_size} functions as we did for
Chris@42	730 complex data, only now we interpret the sizes in terms of real rather
Chris@42	731 than complex values, and correspondingly use @code{fftw_alloc_real}.
Chris@42	732
Chris@42	733 @c ------------------------------------------------------------
Chris@42	734 @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI
Chris@42	735 @section FFTW MPI Transposes
Chris@42	736 @cindex transpose
Chris@42	737
Chris@42	738 The FFTW's MPI Fourier transforms rely on one or more @emph{global
Chris@42	739 transposition} step for their communications. For example, the
Chris@42	740 multidimensional transforms work by transforming along some
Chris@42	741 dimensions, then transposing to make the first dimension local and
Chris@42	742 transforming that, then transposing back. Because global
Chris@42	743 transposition of a block-distributed matrix has many other potential
Chris@42	744 uses besides FFTs, FFTW's transpose routines can be called directly,
Chris@42	745 as documented in this section.
Chris@42	746
Chris@42	747 @menu
Chris@42	748 * Basic distributed-transpose interface::
Chris@42	749 * Advanced distributed-transpose interface::
Chris@42	750 * An improved replacement for MPI_Alltoall::
Chris@42	751 @end menu
Chris@42	752
Chris@42	753 @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes
Chris@42	754 @subsection Basic distributed-transpose interface
Chris@42	755
Chris@42	756 In particular, suppose that we have an @code{n0} by @code{n1} array in
Chris@42	757 row-major order, block-distributed across the @code{n0} dimension. To
Chris@42	758 transpose this into an @code{n1} by @code{n0} array block-distributed
Chris@42	759 across the @code{n1} dimension, we would create a plan by calling the
Chris@42	760 following function:
Chris@42	761
Chris@42	762 @example
Chris@42	763 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	764 double in, double out,
Chris@42	765 MPI_Comm comm, unsigned flags);
Chris@42	766 @end example
Chris@42	767 @findex fftw_mpi_plan_transpose
Chris@42	768
Chris@42	769 The input and output arrays (@code{in} and @code{out}) can be the
Chris@42	770 same. The transpose is actually executed by calling
Chris@42	771 @code{fftw_execute} on the plan, as usual.
Chris@42	772 @findex fftw_execute
Chris@42	773
Chris@42	774
Chris@42	775 The @code{flags} are the usual FFTW planner flags, but support
Chris@42	776 two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or
Chris@42	777 @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for
Chris@42	778 transpose plans, is that the output and/or input, respectively, are
Chris@42	779 @emph{locally} transposed. That is, on each process input data is
Chris@42	780 normally stored as a @code{local_n0} by @code{n1} array in row-major
Chris@42	781 order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is
Chris@42	782 stored as @code{n1} by @code{local_n0} in row-major order. Similarly,
Chris@42	783 @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by
Chris@42	784 @code{local_n1} instead of @code{local_n1} by @code{n0}.
Chris@42	785 @ctindex FFTW_MPI_TRANSPOSED_OUT
Chris@42	786 @ctindex FFTW_MPI_TRANSPOSED_IN
Chris@42	787
Chris@42	788
Chris@42	789 To determine the local size of the array on each process before and
Chris@42	790 after the transpose, as well as the amount of storage that must be
Chris@42	791 allocated, one should call @code{fftw_mpi_local_size_2d_transposed},
Chris@42	792 just as for a 2d DFT as described in the previous section:
Chris@42	793 @cindex data distribution
Chris@42	794
Chris@42	795 @example
Chris@42	796 ptrdiff_t fftw_mpi_local_size_2d_transposed
Chris@42	797 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
Chris@42	798 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	799 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	800 @end example
Chris@42	801 @findex fftw_mpi_local_size_2d_transposed
Chris@42	802
Chris@42	803 Again, the return value is the local storage to allocate, which in
Chris@42	804 this case is the number of @emph{real} (@code{double}) values rather
Chris@42	805 than complex numbers as in the previous examples.
Chris@42	806
Chris@42	807 @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes
Chris@42	808 @subsection Advanced distributed-transpose interface
Chris@42	809
Chris@42	810 The above routines are for a transpose of a matrix of numbers (of type
Chris@42	811 @code{double}), using FFTW's default block sizes. More generally, one
Chris@42	812 can perform transposes of @emph{tuples} of numbers, with
Chris@42	813 user-specified block sizes for the input and output:
Chris@42	814
Chris@42	815 @example
Chris@42	816 fftw_plan fftw_mpi_plan_many_transpose
Chris@42	817 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
Chris@42	818 ptrdiff_t block0, ptrdiff_t block1,
Chris@42	819 double in, double out, MPI_Comm comm, unsigned flags);
Chris@42	820 @end example
Chris@42	821 @findex fftw_mpi_plan_many_transpose
Chris@42	822
Chris@42	823 In this case, one is transposing an @code{n0} by @code{n1} matrix of
Chris@42	824 @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers).
Chris@42	825 The input is distributed along the @code{n0} dimension with block size
Chris@42	826 @code{block0}, and the @code{n1} by @code{n0} output is distributed
Chris@42	827 along the @code{n1} dimension with block size @code{block1}. If
Chris@42	828 @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW
Chris@42	829 uses its default block size. To get the local size of the data on
Chris@42	830 each process, you should then call @code{fftw_mpi_local_size_many_transposed}.
Chris@42	831 @ctindex FFTW_MPI_DEFAULT_BLOCK
Chris@42	832 @findex fftw_mpi_local_size_many_transposed
Chris@42	833
Chris@42	834 @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes
Chris@42	835 @subsection An improved replacement for MPI_Alltoall
Chris@42	836
Chris@42	837 We close this section by noting that FFTW's MPI transpose routines can
Chris@42	838 be thought of as a generalization for the @code{MPI_Alltoall} function
Chris@42	839 (albeit only for floating-point types), and in some circumstances can
Chris@42	840 function as an improved replacement.
Chris@42	841 @findex MPI_Alltoall
Chris@42	842
Chris@42	843
Chris@42	844 @code{MPI_Alltoall} is defined by the MPI standard as:
Chris@42	845
Chris@42	846 @example
Chris@42	847 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
Chris@42	848 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
Chris@42	849 MPI_Comm comm);
Chris@42	850 @end example
Chris@42	851
Chris@42	852 In particular, for @code{double*} arrays @code{in} and @code{out},
Chris@42	853 consider the call:
Chris@42	854
Chris@42	855 @example
Chris@42	856 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
Chris@42	857 @end example
Chris@42	858
Chris@42	859 This is completely equivalent to:
Chris@42	860
Chris@42	861 @example
Chris@42	862 MPI_Comm_size(comm, &P);
Chris@42	863 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
Chris@42	864 fftw_execute(plan);
Chris@42	865 fftw_destroy_plan(plan);
Chris@42	866 @end example
Chris@42	867
Chris@42	868 That is, computing a @twodims{P,P} transpose on @code{P} processes,
Chris@42	869 with a block size of 1, is just a standard all-to-all communication.
Chris@42	870
Chris@42	871 However, using the FFTW routine instead of @code{MPI_Alltoall} may
Chris@42	872 have certain advantages. First of all, FFTW's routine can operate
Chris@42	873 in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only
Chris@42	874 operate out-of-place.
Chris@42	875 @cindex in-place
Chris@42	876
Chris@42	877
Chris@42	878 Second, even for out-of-place plans, FFTW's routine may be faster,
Chris@42	879 especially if you need to perform the all-to-all communication many
Chris@42	880 times and can afford to use @code{FFTW_MEASURE} or
Chris@42	881 @code{FFTW_PATIENT}. It should certainly be no slower, not including
Chris@42	882 the time to create the plan, since one of the possible algorithms that
Chris@42	883 FFTW uses for an out-of-place transpose @emph{is} simply to call
Chris@42	884 @code{MPI_Alltoall}. However, FFTW also considers several other
Chris@42	885 possible algorithms that, depending on your MPI implementation and
Chris@42	886 your hardware, may be faster.
Chris@42	887 @ctindex FFTW_MEASURE
Chris@42	888 @ctindex FFTW_PATIENT
Chris@42	889
Chris@42	890 @c ------------------------------------------------------------
Chris@42	891 @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI
Chris@42	892 @section FFTW MPI Wisdom
Chris@42	893 @cindex wisdom
Chris@42	894 @cindex saving plans to disk
Chris@42	895
Chris@42	896 FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can
Chris@42	897 be used to save MPI plans as well as to save uniprocessor plans.
Chris@42	898 However, for MPI there are several unavoidable complications.
Chris@42	899
Chris@42	900 @cindex MPI I/O
Chris@42	901 First, the MPI standard does not guarantee that every process can
Chris@42	902 perform file I/O (at least, not using C stdio routines)---in general,
Chris@42	903 we may only assume that process 0 is capable of I/O.@footnote{In fact,
Chris@42	904 even this assumption is not technically guaranteed by the standard,
Chris@42	905 although it seems to be universal in actual MPI implementations and is
Chris@42	906 widely assumed by MPI-using software. Technically, you need to query
Chris@42	907 the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with
Chris@42	908 @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no
Chris@42	909 I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can
Chris@42	910 perform I/O. Otherwise, it is the rank of a process that can perform
Chris@42	911 I/O ... but since it is not guaranteed to yield the @emph{same} rank
Chris@42	912 on all processes, you have to do an @code{MPI_Allreduce} of some kind
Chris@42	913 if you want all processes to agree about which is going to do I/O.
Chris@42	914 And even then, the standard only guarantees that this process can
Chris@42	915 perform output, but not input. See e.g. @cite{Parallel Programming
Chris@42	916 with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our
Chris@42	917 experience virtually no MPI programmers worry about this.} So, if we
Chris@42	918 want to export the wisdom from a single process to a file, we must
Chris@42	919 first export the wisdom to a string, then send it to process 0, then
Chris@42	920 write it to a file.
Chris@42	921
Chris@42	922 Second, in principle we may want to have separate wisdom for every
Chris@42	923 process, since in general the processes may run on different hardware
Chris@42	924 even for a single MPI program. However, in practice FFTW's MPI code
Chris@42	925 is designed for the case of homogeneous hardware (@pxref{Load
Chris@42	926 balancing}), and in this case it is convenient to use the same wisdom
Chris@42	927 for every process. Thus, we need a mechanism to synchronize the wisdom.
Chris@42	928
Chris@42	929 To address both of these problems, FFTW provides the following two
Chris@42	930 functions:
Chris@42	931
Chris@42	932 @example
Chris@42	933 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
Chris@42	934 void fftw_mpi_gather_wisdom(MPI_Comm comm);
Chris@42	935 @end example
Chris@42	936 @findex fftw_mpi_gather_wisdom
Chris@42	937 @findex fftw_mpi_broadcast_wisdom
Chris@42	938
Chris@42	939 Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom}
Chris@42	940 will broadcast the wisdom from process 0 to all other processes.
Chris@42	941 Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all
Chris@42	942 processes onto process 0. (If the plans created for the same problem
Chris@42	943 by different processes are not the same, @code{fftw_mpi_gather_wisdom}
Chris@42	944 will arbitrarily choose one of the plans.) Both of these functions
Chris@42	945 may result in suboptimal plans for different processes if the
Chris@42	946 processes are running on non-identical hardware. Both of these
Chris@42	947 functions are @emph{collective} calls, which means that they must be
Chris@42	948 executed by all processes in the communicator.
Chris@42	949 @cindex collective function
Chris@42	950
Chris@42	951
Chris@42	952 So, for example, a typical code snippet to import wisdom from a file
Chris@42	953 and use it on all processes would be:
Chris@42	954
Chris@42	955 @example
Chris@42	956 @{
Chris@42	957 int rank;
Chris@42	958
Chris@42	959 fftw_mpi_init();
Chris@42	960 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Chris@42	961 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
Chris@42	962 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
Chris@42	963 @}
Chris@42	964 @end example
Chris@42	965
Chris@42	966 (Note that we must call @code{fftw_mpi_init} before importing any
Chris@42	967 wisdom that might contain MPI plans.) Similarly, a typical code
Chris@42	968 snippet to export wisdom from all processes to a file is:
Chris@42	969 @findex fftw_mpi_init
Chris@42	970
Chris@42	971 @example
Chris@42	972 @{
Chris@42	973 int rank;
Chris@42	974
Chris@42	975 fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
Chris@42	976 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Chris@42	977 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
Chris@42	978 @}
Chris@42	979 @end example
Chris@42	980
Chris@42	981 @c ------------------------------------------------------------
Chris@42	982 @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI
Chris@42	983 @section Avoiding MPI Deadlocks
Chris@42	984 @cindex deadlock
Chris@42	985
Chris@42	986 An MPI program can @emph{deadlock} if one process is waiting for a
Chris@42	987 message from another process that never gets sent. To avoid deadlocks
Chris@42	988 when using FFTW's MPI routines, it is important to know which
Chris@42	989 functions are @emph{collective}: that is, which functions must
Chris@42	990 @emph{always} be called in the @emph{same order} from @emph{every}
Chris@42	991 process in a given communicator. (For example, @code{MPI_Barrier} is
Chris@42	992 the canonical example of a collective function in the MPI standard.)
Chris@42	993 @cindex collective function
Chris@42	994 @findex MPI_Barrier
Chris@42	995
Chris@42	996
Chris@42	997 The functions in FFTW that are @emph{always} collective are: every
Chris@42	998 function beginning with @samp{fftw_mpi_plan}, as well as
Chris@42	999 @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}.
Chris@42	1000 Also, the following functions from the ordinary FFTW interface are
Chris@42	1001 collective when they are applied to a plan created by an
Chris@42	1002 @samp{fftw_mpi_plan} function: @code{fftw_execute},
Chris@42	1003 @code{fftw_destroy_plan}, and @code{fftw_flops}.
Chris@42	1004 @findex fftw_execute
Chris@42	1005 @findex fftw_destroy_plan
Chris@42	1006 @findex fftw_flops
Chris@42	1007
Chris@42	1008 @c ------------------------------------------------------------
Chris@42	1009 @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI
Chris@42	1010 @section FFTW MPI Performance Tips
Chris@42	1011
Chris@42	1012 In this section, we collect a few tips on getting the best performance
Chris@42	1013 out of FFTW's MPI transforms.
Chris@42	1014
Chris@42	1015 First, because of the 1d block distribution, FFTW's parallelization is
Chris@42	1016 currently limited by the size of the first dimension.
Chris@42	1017 (Multidimensional block distributions may be supported by a future
Chris@42	1018 version.) More generally, you should ideally arrange the dimensions so
Chris@42	1019 that FFTW can divide them equally among the processes. @xref{Load
Chris@42	1020 balancing}.
Chris@42	1021 @cindex block distribution
Chris@42	1022 @cindex load balancing
Chris@42	1023
Chris@42	1024
Chris@42	1025 Second, if it is not too inconvenient, you should consider working
Chris@42	1026 with transposed output for multidimensional plans, as this saves a
Chris@42	1027 considerable amount of communications. @xref{Transposed distributions}.
Chris@42	1028 @cindex transpose
Chris@42	1029
Chris@42	1030
Chris@42	1031 Third, the fastest choices are generally either an in-place transform
Chris@42	1032 or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag
Chris@42	1033 (which allows the input array to be used as scratch space). In-place
Chris@42	1034 is especially beneficial if the amount of data per process is large.
Chris@42	1035 @ctindex FFTW_DESTROY_INPUT
Chris@42	1036
Chris@42	1037
Chris@42	1038 Fourth, if you have multiple arrays to transform at once, rather than
Chris@42	1039 calling FFTW's MPI transforms several times it usually seems to be
Chris@42	1040 faster to interleave the data and use the advanced interface. (This
Chris@42	1041 groups the communications together instead of requiring separate
Chris@42	1042 messages for each transform.)
Chris@42	1043
Chris@42	1044 @c ------------------------------------------------------------
Chris@42	1045 @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI
Chris@42	1046 @section Combining MPI and Threads
Chris@42	1047 @cindex threads
Chris@42	1048
Chris@42	1049 In certain cases, it may be advantageous to combine MPI
Chris@42	1050 (distributed-memory) and threads (shared-memory) parallelization.
Chris@42	1051 FFTW supports this, with certain caveats. For example, if you have a
Chris@42	1052 cluster of 4-processor shared-memory nodes, you may want to use
Chris@42	1053 threads within the nodes and MPI between the nodes, instead of MPI for
Chris@42	1054 all parallelization.
Chris@42	1055
Chris@42	1056 In particular, it is possible to seamlessly combine the MPI FFTW
Chris@42	1057 routines with the multi-threaded FFTW routines (@pxref{Multi-threaded
Chris@42	1058 FFTW}). However, some care must be taken in the initialization code,
Chris@42	1059 which should look something like this:
Chris@42	1060
Chris@42	1061 @example
Chris@42	1062 int threads_ok;
Chris@42	1063
Chris@42	1064 int main(int argc, char **argv)
Chris@42	1065 @{
Chris@42	1066 int provided;
Chris@42	1067 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
Chris@42	1068 threads_ok = provided >= MPI_THREAD_FUNNELED;
Chris@42	1069
Chris@42	1070 if (threads_ok) threads_ok = fftw_init_threads();
Chris@42	1071 fftw_mpi_init();
Chris@42	1072
Chris@42	1073 ...
Chris@42	1074 if (threads_ok) fftw_plan_with_nthreads(...);
Chris@42	1075 ...
Chris@42	1076
Chris@42	1077 MPI_Finalize();
Chris@42	1078 @}
Chris@42	1079 @end example
Chris@42	1080 @findex fftw_mpi_init
Chris@42	1081 @findex fftw_init_threads
Chris@42	1082 @findex fftw_plan_with_nthreads
Chris@42	1083
Chris@42	1084 First, note that instead of calling @code{MPI_Init}, you should call
Chris@42	1085 @code{MPI_Init_threads}, which is the initialization routine defined
Chris@42	1086 by the MPI-2 standard to indicate to MPI that your program will be
Chris@42	1087 multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates
Chris@42	1088 that we will only call MPI routines from the main thread. (FFTW will
Chris@42	1089 launch additional threads internally, but the extra threads will not
Chris@42	1090 call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or
Chris@42	1091 @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading
Chris@42	1092 support from the MPI implementation, but this is not required by
Chris@42	1093 FFTW.) The @code{provided} parameter returns what level of threads
Chris@42	1094 support is actually supported by your MPI implementation; this
Chris@42	1095 @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call
Chris@42	1096 the FFTW threads routines, so we define a global variable
Chris@42	1097 @code{threads_ok} to record this. You should only call
Chris@42	1098 @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if
Chris@42	1099 @code{threads_ok} is true. For more information on thread safety in
Chris@42	1100 MPI, see the
Chris@42	1101 @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and
Chris@42	1102 Threads} section of the MPI-2 standard.
Chris@42	1103 @cindex thread safety
Chris@42	1104
Chris@42	1105
Chris@42	1106 Second, we must call @code{fftw_init_threads} @emph{before}
Chris@42	1107 @code{fftw_mpi_init}. This is critical for technical reasons having
Chris@42	1108 to do with how FFTW initializes its list of algorithms.
Chris@42	1109
Chris@42	1110 Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI
Chris@42	1111 process will launch (up to) @code{N} threads to parallelize its transforms.
Chris@42	1112
Chris@42	1113 For example, in the hypothetical cluster of 4-processor nodes, you
Chris@42	1114 might wish to launch only a single MPI process per node, and then call
Chris@42	1115 @code{fftw_plan_with_nthreads(4)} on each process to use all
Chris@42	1116 processors in the nodes.
Chris@42	1117
Chris@42	1118 This may or may not be faster than simply using as many MPI processes
Chris@42	1119 as you have processors, however. On the one hand, using threads
Chris@42	1120 within a node eliminates the need for explicit message passing within
Chris@42	1121 the node. On the other hand, FFTW's transpose routines are not
Chris@42	1122 multi-threaded, and this means that the communications that do take
Chris@42	1123 place will not benefit from parallelization within the node.
Chris@42	1124 Moreover, many MPI implementations already have optimizations to
Chris@42	1125 exploit shared memory when it is available, so adding the
Chris@42	1126 multithreaded FFTW on top of this may be superfluous.
Chris@42	1127 @cindex transpose
Chris@42	1128
Chris@42	1129 @c ------------------------------------------------------------
Chris@42	1130 @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI
Chris@42	1131 @section FFTW MPI Reference
Chris@42	1132
Chris@42	1133 This chapter provides a complete reference to all FFTW MPI functions,
Chris@42	1134 datatypes, and constants. See also @ref{FFTW Reference} for information
Chris@42	1135 on functions and types in common with the serial interface.
Chris@42	1136
Chris@42	1137 @menu
Chris@42	1138 * MPI Files and Data Types::
Chris@42	1139 * MPI Initialization::
Chris@42	1140 * Using MPI Plans::
Chris@42	1141 * MPI Data Distribution Functions::
Chris@42	1142 * MPI Plan Creation::
Chris@42	1143 * MPI Wisdom Communication::
Chris@42	1144 @end menu
Chris@42	1145
Chris@42	1146 @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference
Chris@42	1147 @subsection MPI Files and Data Types
Chris@42	1148
Chris@42	1149 All programs using FFTW's MPI support should include its header file:
Chris@42	1150
Chris@42	1151 @example
Chris@42	1152 #include <fftw3-mpi.h>
Chris@42	1153 @end example
Chris@42	1154
Chris@42	1155 Note that this header file includes the serial-FFTW @code{fftw3.h}
Chris@42	1156 header file, and also the @code{mpi.h} header file for MPI, so you
Chris@42	1157 need not include those files separately.
Chris@42	1158
Chris@42	1159 You must also link to @emph{both} the FFTW MPI library and to the
Chris@42	1160 serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi
Chris@42	1161 -lfftw3 -lm} at the end of the link command.
Chris@42	1162
Chris@42	1163 @cindex precision
Chris@42	1164 Different precisions are handled as in the serial interface:
Chris@42	1165 @xref{Precision}. That is, @samp{fftw_} functions become
Chris@42	1166 @code{fftwf_} (in single precision) etcetera, and the libraries become
Chris@42	1167 @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double
Chris@42	1168 precision is supported in MPI, but quad precision (@samp{fftwq_}) is
Chris@42	1169 not due to the lack of MPI support for this type.
Chris@42	1170
Chris@42	1171 @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference
Chris@42	1172 @subsection MPI Initialization
Chris@42	1173
Chris@42	1174 Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and
Chris@42	1175 before importing any wisdom for MPI problems, you must call:
Chris@42	1176
Chris@42	1177 @findex fftw_mpi_init
Chris@42	1178 @example
Chris@42	1179 void fftw_mpi_init(void);
Chris@42	1180 @end example
Chris@42	1181
Chris@42	1182 @findex fftw_init_threads
Chris@42	1183 If FFTW threads support is used, however, @code{fftw_mpi_init} should
Chris@42	1184 be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI
Chris@42	1185 and Threads}). Calling @code{fftw_mpi_init} additional times (before
Chris@42	1186 @code{fftw_mpi_cleanup}) has no effect.
Chris@42	1187
Chris@42	1188
Chris@42	1189 If you want to deallocate all persistent data and reset FFTW to the
Chris@42	1190 pristine state it was in when you started your program, you can call:
Chris@42	1191
Chris@42	1192 @findex fftw_mpi_cleanup
Chris@42	1193 @example
Chris@42	1194 void fftw_mpi_cleanup(void);
Chris@42	1195 @end example
Chris@42	1196
Chris@42	1197 @findex fftw_cleanup
Chris@42	1198 (This calls @code{fftw_cleanup}, so you need not call the serial
Chris@42	1199 cleanup routine too, although it is safe to do so.) After calling
Chris@42	1200 @code{fftw_mpi_cleanup}, all existing plans become undefined, and you
Chris@42	1201 should not attempt to execute or destroy them. You must call
Chris@42	1202 @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want
Chris@42	1203 to resume using the MPI FFTW routines.
Chris@42	1204
Chris@42	1205 @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference
Chris@42	1206 @subsection Using MPI Plans
Chris@42	1207
Chris@42	1208 Once an MPI plan is created, you can execute and destroy it using
Chris@42	1209 @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions
Chris@42	1210 in the serial interface that operate on generic plans (@pxref{Using
Chris@42	1211 Plans}).
Chris@42	1212
Chris@42	1213 @cindex collective function
Chris@42	1214 @cindex MPI communicator
Chris@42	1215 The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to
Chris@42	1216 MPI plans, are @emph{collective} calls: they must be called for all processes
Chris@42	1217 in the communicator that was used to create the plan.
Chris@42	1218
Chris@42	1219 @cindex new-array execution
Chris@42	1220 You must @emph{not} use the serial new-array plan-execution functions
Chris@42	1221 @code{fftw_execute_dft} and so on (@pxref{New-array Execute
Chris@42	1222 Functions}) with MPI plans. Such functions are specialized to the
Chris@42	1223 problem type, and there are specific new-array execute functions for MPI plans:
Chris@42	1224
Chris@42	1225 @findex fftw_mpi_execute_dft
Chris@42	1226 @findex fftw_mpi_execute_dft_r2c
Chris@42	1227 @findex fftw_mpi_execute_dft_c2r
Chris@42	1228 @findex fftw_mpi_execute_r2r
Chris@42	1229 @example
Chris@42	1230 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex in, fftw_complex out);
Chris@42	1231 void fftw_mpi_execute_dft_r2c(fftw_plan p, double in, fftw_complex out);
Chris@42	1232 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex in, double out);
Chris@42	1233 void fftw_mpi_execute_r2r(fftw_plan p, double in, double out);
Chris@42	1234 @end example
Chris@42	1235
Chris@42	1236 @cindex alignment
Chris@42	1237 @findex fftw_malloc
Chris@42	1238 These functions have the same restrictions as those of the serial
Chris@42	1239 new-array execute functions. They are @emph{always} safe to apply to
Chris@42	1240 the @emph{same} @code{in} and @code{out} arrays that were used to
Chris@42	1241 create the plan. They can only be applied to new arrarys if those
Chris@42	1242 arrays have the same types, dimensions, in-placeness, and alignment as
Chris@42	1243 the original arrays, where the best way to ensure the same alignment
Chris@42	1244 is to use FFTW's @code{fftw_malloc} and related allocation functions
Chris@42	1245 for all arrays (@pxref{Memory Allocation}). Note that distributed
Chris@42	1246 transposes (@pxref{FFTW MPI Transposes}) use
Chris@42	1247 @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans
Chris@42	1248 from FFTW's perspective.
Chris@42	1249
Chris@42	1250 @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference
Chris@42	1251 @subsection MPI Data Distribution Functions
Chris@42	1252
Chris@42	1253 @cindex data distribution
Chris@42	1254 As described above (@pxref{MPI Data Distribution}), in order to
Chris@42	1255 allocate your arrays, @emph{before} creating a plan, you must first
Chris@42	1256 call one of the following routines to determine the required
Chris@42	1257 allocation size and the portion of the array locally stored on a given
Chris@42	1258 process. The @code{MPI_Comm} communicator passed here must be
Chris@42	1259 equivalent to the communicator used below for plan creation.
Chris@42	1260
Chris@42	1261 The basic interface for multidimensional transforms consists of the
Chris@42	1262 functions:
Chris@42	1263
Chris@42	1264 @findex fftw_mpi_local_size_2d
Chris@42	1265 @findex fftw_mpi_local_size_3d
Chris@42	1266 @findex fftw_mpi_local_size
Chris@42	1267 @findex fftw_mpi_local_size_2d_transposed
Chris@42	1268 @findex fftw_mpi_local_size_3d_transposed
Chris@42	1269 @findex fftw_mpi_local_size_transposed
Chris@42	1270 @example
Chris@42	1271 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
Chris@42	1272 ptrdiff_t local_n0, ptrdiff_t local_0_start);
Chris@42	1273 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1274 MPI_Comm comm,
Chris@42	1275 ptrdiff_t local_n0, ptrdiff_t local_0_start);
Chris@42	1276 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
Chris@42	1277 ptrdiff_t local_n0, ptrdiff_t local_0_start);
Chris@42	1278
Chris@42	1279 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
Chris@42	1280 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	1281 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	1282 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1283 MPI_Comm comm,
Chris@42	1284 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	1285 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	1286 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
Chris@42	1287 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	1288 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	1289 @end example
Chris@42	1290
Chris@42	1291 These functions return the number of elements to allocate (complex
Chris@42	1292 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas
Chris@42	1293 the @code{local_n0} and @code{local_0_start} return the portion
Chris@42	1294 (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the
Chris@42	1295 first dimension of an @ndims{} array that is stored on the local
Chris@42	1296 process. @xref{Basic and advanced distribution interfaces}. For
Chris@42	1297 @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants
Chris@42	1298 are useful in order to also return the local portion of the first
Chris@42	1299 dimension in the @ndimstrans{} transposed output.
Chris@42	1300 @xref{Transposed distributions}.
Chris@42	1301 The advanced interface for multidimensional transforms is:
Chris@42	1302
Chris@42	1303 @cindex advanced interface
Chris@42	1304 @findex fftw_mpi_local_size_many
Chris@42	1305 @findex fftw_mpi_local_size_many_transposed
Chris@42	1306 @example
Chris@42	1307 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
Chris@42	1308 ptrdiff_t block0, MPI_Comm comm,
Chris@42	1309 ptrdiff_t local_n0, ptrdiff_t local_0_start);
Chris@42	1310 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
Chris@42	1311 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
Chris@42	1312 ptrdiff_t local_n0, ptrdiff_t local_0_start,
Chris@42	1313 ptrdiff_t local_n1, ptrdiff_t local_1_start);
Chris@42	1314 @end example
Chris@42	1315
Chris@42	1316 These differ from the basic interface in only two ways. First, they
Chris@42	1317 allow you to specify block sizes @code{block0} and @code{block1} (the
Chris@42	1318 latter for the transposed output); you can pass
Chris@42	1319 @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in
Chris@42	1320 the basic interface. Second, you can pass a @code{howmany} parameter,
Chris@42	1321 corresponding to the advanced planning interface below: this is for
Chris@42	1322 transforms of contiguous @code{howmany}-tuples of numbers
Chris@42	1323 (@code{howmany = 1} in the basic interface).
Chris@42	1324
Chris@42	1325 The corresponding basic and advanced routines for one-dimensional
Chris@42	1326 transforms (currently only complex DFTs) are:
Chris@42	1327
Chris@42	1328 @findex fftw_mpi_local_size_1d
Chris@42	1329 @findex fftw_mpi_local_size_many_1d
Chris@42	1330 @example
Chris@42	1331 ptrdiff_t fftw_mpi_local_size_1d(
Chris@42	1332 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
Chris@42	1333 ptrdiff_t local_ni, ptrdiff_t local_i_start,
Chris@42	1334 ptrdiff_t local_no, ptrdiff_t local_o_start);
Chris@42	1335 ptrdiff_t fftw_mpi_local_size_many_1d(
Chris@42	1336 ptrdiff_t n0, ptrdiff_t howmany,
Chris@42	1337 MPI_Comm comm, int sign, unsigned flags,
Chris@42	1338 ptrdiff_t local_ni, ptrdiff_t local_i_start,
Chris@42	1339 ptrdiff_t local_no, ptrdiff_t local_o_start);
Chris@42	1340 @end example
Chris@42	1341
Chris@42	1342 @ctindex FFTW_MPI_SCRAMBLED_OUT
Chris@42	1343 @ctindex FFTW_MPI_SCRAMBLED_IN
Chris@42	1344 As above, the return value is the number of elements to allocate
Chris@42	1345 (complex numbers, for complex DFTs). The @code{local_ni} and
Chris@42	1346 @code{local_i_start} arguments return the portion
Chris@42	1347 (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the
Chris@42	1348 1d array that is stored on this process for the transform
Chris@42	1349 @emph{input}, and @code{local_no} and @code{local_o_start} are the
Chris@42	1350 corresponding quantities for the input. The @code{sign}
Chris@42	1351 (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must
Chris@42	1352 match the arguments passed when creating a plan. Although the inputs
Chris@42	1353 and outputs have different data distributions in general, it is
Chris@42	1354 guaranteed that the @emph{output} data distribution of an
Chris@42	1355 @code{FFTW_FORWARD} plan will match the @emph{input} data distribution
Chris@42	1356 of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the
Chris@42	1357 @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags.
Chris@42	1358 @xref{One-dimensional distributions}.
Chris@42	1359
Chris@42	1360 @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference
Chris@42	1361 @subsection MPI Plan Creation
Chris@42	1362
Chris@42	1363 @subsubheading Complex-data MPI DFTs
Chris@42	1364
Chris@42	1365 Plans for complex-data DFTs (@pxref{2d MPI example}) are created by:
Chris@42	1366
Chris@42	1367 @findex fftw_mpi_plan_dft_1d
Chris@42	1368 @findex fftw_mpi_plan_dft_2d
Chris@42	1369 @findex fftw_mpi_plan_dft_3d
Chris@42	1370 @findex fftw_mpi_plan_dft
Chris@42	1371 @findex fftw_mpi_plan_many_dft
Chris@42	1372 @example
Chris@42	1373 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex in, fftw_complex out,
Chris@42	1374 MPI_Comm comm, int sign, unsigned flags);
Chris@42	1375 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1376 fftw_complex in, fftw_complex out,
Chris@42	1377 MPI_Comm comm, int sign, unsigned flags);
Chris@42	1378 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1379 fftw_complex in, fftw_complex out,
Chris@42	1380 MPI_Comm comm, int sign, unsigned flags);
Chris@42	1381 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
Chris@42	1382 fftw_complex in, fftw_complex out,
Chris@42	1383 MPI_Comm comm, int sign, unsigned flags);
Chris@42	1384 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
Chris@42	1385 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
Chris@42	1386 fftw_complex in, fftw_complex out,
Chris@42	1387 MPI_Comm comm, int sign, unsigned flags);
Chris@42	1388 @end example
Chris@42	1389
Chris@42	1390 @cindex MPI communicator
Chris@42	1391 @cindex collective function
Chris@42	1392 These are similar to their serial counterparts (@pxref{Complex DFTs})
Chris@42	1393 in specifying the dimensions, sign, and flags of the transform. The
Chris@42	1394 @code{comm} argument gives an MPI communicator that specifies the set
Chris@42	1395 of processes to participate in the transform; plan creation is a
Chris@42	1396 collective function that must be called for all processes in the
Chris@42	1397 communicator. The @code{in} and @code{out} pointers refer only to a
Chris@42	1398 portion of the overall transform data (@pxref{MPI Data Distribution})
Chris@42	1399 as specified by the @samp{local_size} functions in the previous
Chris@42	1400 section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these
Chris@42	1401 arrays are overwritten during plan creation as for the serial
Chris@42	1402 interface. For multi-dimensional transforms, any dimensions @code{>
Chris@42	1403 1} are supported; for one-dimensional transforms, only composite
Chris@42	1404 (non-prime) @code{n0} are currently supported (unlike the serial
Chris@42	1405 FFTW). Requesting an unsupported transform size will yield a
Chris@42	1406 @code{NULL} plan. (As in the serial interface, highly composite sizes
Chris@42	1407 generally yield the best performance.)
Chris@42	1408
Chris@42	1409 @cindex advanced interface
Chris@42	1410 @ctindex FFTW_MPI_DEFAULT_BLOCK
Chris@42	1411 @cindex stride
Chris@42	1412 The advanced-interface @code{fftw_mpi_plan_many_dft} additionally
Chris@42	1413 allows you to specify the block sizes for the first dimension
Chris@42	1414 (@code{block}) of the @ndims{} input data and the first dimension
Chris@42	1415 (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate
Chris@42	1416 steps of the transform, and for the output if
Chris@42	1417 @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must
Chris@42	1418 be the same block sizes as were passed to the corresponding
Chris@42	1419 @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK}
Chris@42	1420 to use FFTW's default block size as in the basic interface. Also, the
Chris@42	1421 @code{howmany} parameter specifies that the transform is of contiguous
Chris@42	1422 @code{howmany}-tuples rather than individual complex numbers; this
Chris@42	1423 corresponds to the same parameter in the serial advanced interface
Chris@42	1424 (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and
Chris@42	1425 @code{dist = 1}.
Chris@42	1426
Chris@42	1427 @subsubheading MPI flags
Chris@42	1428
Chris@42	1429 The @code{flags} can be any of those for the serial FFTW
Chris@42	1430 (@pxref{Planner Flags}), and in addition may include one or more of
Chris@42	1431 the following MPI-specific flags, which improve performance at the
Chris@42	1432 cost of changing the output or input data formats.
Chris@42	1433
Chris@42	1434 @itemize @bullet
Chris@42	1435
Chris@42	1436 @item
Chris@42	1437 @ctindex FFTW_MPI_SCRAMBLED_OUT
Chris@42	1438 @ctindex FFTW_MPI_SCRAMBLED_IN
Chris@42	1439 @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for
Chris@42	1440 1d transforms only, these flags indicate that the output/input of the
Chris@42	1441 transform are in an undocumented ``scrambled'' order. A forward
Chris@42	1442 @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward
Chris@42	1443 @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization).
Chris@42	1444 @xref{One-dimensional distributions}.
Chris@42	1445
Chris@42	1446 @item
Chris@42	1447 @ctindex FFTW_MPI_TRANSPOSED_OUT
Chris@42	1448 @ctindex FFTW_MPI_TRANSPOSED_IN
Chris@42	1449 @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid
Chris@42	1450 for multidimensional (@code{rnk > 1}) transforms only, these flags
Chris@42	1451 specify that the output or input of an @ndims{} transform is
Chris@42	1452 transposed to @ndimstrans{}. @xref{Transposed distributions}.
Chris@42	1453
Chris@42	1454 @end itemize
Chris@42	1455
Chris@42	1456 @subsubheading Real-data MPI DFTs
Chris@42	1457
Chris@42	1458 @cindex r2c
Chris@42	1459 Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional
Chris@42	1460 MPI DFTs of Real Data}) are created by:
Chris@42	1461
Chris@42	1462 @findex fftw_mpi_plan_dft_r2c_2d
Chris@42	1463 @findex fftw_mpi_plan_dft_r2c_2d
Chris@42	1464 @findex fftw_mpi_plan_dft_r2c_3d
Chris@42	1465 @findex fftw_mpi_plan_dft_r2c
Chris@42	1466 @findex fftw_mpi_plan_dft_c2r_2d
Chris@42	1467 @findex fftw_mpi_plan_dft_c2r_2d
Chris@42	1468 @findex fftw_mpi_plan_dft_c2r_3d
Chris@42	1469 @findex fftw_mpi_plan_dft_c2r
Chris@42	1470 @example
Chris@42	1471 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1472 double in, fftw_complex out,
Chris@42	1473 MPI_Comm comm, unsigned flags);
Chris@42	1474 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1475 double in, fftw_complex out,
Chris@42	1476 MPI_Comm comm, unsigned flags);
Chris@42	1477 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1478 double in, fftw_complex out,
Chris@42	1479 MPI_Comm comm, unsigned flags);
Chris@42	1480 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
Chris@42	1481 double in, fftw_complex out,
Chris@42	1482 MPI_Comm comm, unsigned flags);
Chris@42	1483 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1484 fftw_complex in, double out,
Chris@42	1485 MPI_Comm comm, unsigned flags);
Chris@42	1486 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1487 fftw_complex in, double out,
Chris@42	1488 MPI_Comm comm, unsigned flags);
Chris@42	1489 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1490 fftw_complex in, double out,
Chris@42	1491 MPI_Comm comm, unsigned flags);
Chris@42	1492 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
Chris@42	1493 fftw_complex in, double out,
Chris@42	1494 MPI_Comm comm, unsigned flags);
Chris@42	1495 @end example
Chris@42	1496
Chris@42	1497 Similar to the serial interface (@pxref{Real-data DFTs}), these
Chris@42	1498 transform logically @ndims{} real data to/from @ndimshalf{} complex
Chris@42	1499 data, representing the non-redundant half of the conjugate-symmetry
Chris@42	1500 output of a real-input DFT (@pxref{Multi-dimensional Transforms}).
Chris@42	1501 However, the real array must be stored within a padded @ndimspad{}
Chris@42	1502 array (much like the in-place serial r2c transforms, but here for
Chris@42	1503 out-of-place transforms as well). Currently, only multi-dimensional
Chris@42	1504 (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan
Chris@42	1505 for @code{rnk = 1} will yield @code{NULL}). As explained above
Chris@42	1506 (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data
Chris@42	1507 distribution of both the real and complex arrays is given by the
Chris@42	1508 @samp{local_size} function called for the dimensions of the
Chris@42	1509 @emph{complex} array. Similar to the other planning functions, the
Chris@42	1510 input and output arrays are overwritten when the plan is created
Chris@42	1511 except in @code{FFTW_ESTIMATE} mode.
Chris@42	1512
Chris@42	1513 As for the complex DFTs above, there is an advance interface that
Chris@42	1514 allows you to manually specify block sizes and to transform contiguous
Chris@42	1515 @code{howmany}-tuples of real/complex numbers:
Chris@42	1516
Chris@42	1517 @findex fftw_mpi_plan_many_dft_r2c
Chris@42	1518 @findex fftw_mpi_plan_many_dft_c2r
Chris@42	1519 @example
Chris@42	1520 fftw_plan fftw_mpi_plan_many_dft_r2c
Chris@42	1521 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
Chris@42	1522 ptrdiff_t iblock, ptrdiff_t oblock,
Chris@42	1523 double in, fftw_complex out,
Chris@42	1524 MPI_Comm comm, unsigned flags);
Chris@42	1525 fftw_plan fftw_mpi_plan_many_dft_c2r
Chris@42	1526 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
Chris@42	1527 ptrdiff_t iblock, ptrdiff_t oblock,
Chris@42	1528 fftw_complex in, double out,
Chris@42	1529 MPI_Comm comm, unsigned flags);
Chris@42	1530 @end example
Chris@42	1531
Chris@42	1532 @subsubheading MPI r2r transforms
Chris@42	1533
Chris@42	1534 @cindex r2r
Chris@42	1535 There are corresponding plan-creation routines for r2r
Chris@42	1536 transforms (@pxref{More DFTs of Real Data}), currently supporting
Chris@42	1537 multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will
Chris@42	1538 yield a @code{NULL} plan):
Chris@42	1539
Chris@42	1540 @example
Chris@42	1541 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1542 double in, double out,
Chris@42	1543 MPI_Comm comm,
Chris@42	1544 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
Chris@42	1545 unsigned flags);
Chris@42	1546 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
Chris@42	1547 double in, double out,
Chris@42	1548 MPI_Comm comm,
Chris@42	1549 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
Chris@42	1550 unsigned flags);
Chris@42	1551 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
Chris@42	1552 double in, double out,
Chris@42	1553 MPI_Comm comm, const fftw_r2r_kind *kind,
Chris@42	1554 unsigned flags);
Chris@42	1555 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
Chris@42	1556 ptrdiff_t iblock, ptrdiff_t oblock,
Chris@42	1557 double in, double out,
Chris@42	1558 MPI_Comm comm, const fftw_r2r_kind *kind,
Chris@42	1559 unsigned flags);
Chris@42	1560 @end example
Chris@42	1561
Chris@42	1562 The parameters are much the same as for the complex DFTs above, except
Chris@42	1563 that the arrays are of real numbers (and hence the outputs of the
Chris@42	1564 @samp{local_size} data-distribution functions should be interpreted as
Chris@42	1565 counts of real rather than complex numbers). Also, the @code{kind}
Chris@42	1566 parameters specify the r2r kinds along each dimension as for the
Chris@42	1567 serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other
Chris@42	1568 Multi-dimensional Real-data MPI Transforms}.
Chris@42	1569
Chris@42	1570 @subsubheading MPI transposition
Chris@42	1571 @cindex transpose
Chris@42	1572
Chris@42	1573 FFTW also provides routines to plan a transpose of a distributed
Chris@42	1574 @code{n0} by @code{n1} array of real numbers, or an array of
Chris@42	1575 @code{howmany}-tuples of real numbers with specified block sizes
Chris@42	1576 (@pxref{FFTW MPI Transposes}):
Chris@42	1577
Chris@42	1578 @findex fftw_mpi_plan_transpose
Chris@42	1579 @findex fftw_mpi_plan_many_transpose
Chris@42	1580 @example
Chris@42	1581 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
Chris@42	1582 double in, double out,
Chris@42	1583 MPI_Comm comm, unsigned flags);
Chris@42	1584 fftw_plan fftw_mpi_plan_many_transpose
Chris@42	1585 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
Chris@42	1586 ptrdiff_t block0, ptrdiff_t block1,
Chris@42	1587 double in, double out, MPI_Comm comm, unsigned flags);
Chris@42	1588 @end example
Chris@42	1589
Chris@42	1590 @cindex new-array execution
Chris@42	1591 @findex fftw_mpi_execute_r2r
Chris@42	1592 These plans are used with the @code{fftw_mpi_execute_r2r} new-array
Chris@42	1593 execute function (@pxref{Using MPI Plans }), since they count as (rank
Chris@42	1594 zero) r2r plans from FFTW's perspective.
Chris@42	1595
Chris@42	1596 @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference
Chris@42	1597 @subsection MPI Wisdom Communication
Chris@42	1598
Chris@42	1599 To facilitate synchronizing wisdom among the different MPI processes,
Chris@42	1600 we provide two functions:
Chris@42	1601
Chris@42	1602 @findex fftw_mpi_gather_wisdom
Chris@42	1603 @findex fftw_mpi_broadcast_wisdom
Chris@42	1604 @example
Chris@42	1605 void fftw_mpi_gather_wisdom(MPI_Comm comm);
Chris@42	1606 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
Chris@42	1607 @end example
Chris@42	1608
Chris@42	1609 The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the
Chris@42	1610 given communicator @code{comm} to the process of rank 0 in the
Chris@42	1611 communicator: that process obtains the union of all wisdom on all the
Chris@42	1612 processes. As a side effect, some other processes will gain
Chris@42	1613 additional wisdom from other processes, but only process 0 will gain
Chris@42	1614 the complete union.
Chris@42	1615
Chris@42	1616 The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports
Chris@42	1617 wisdom from process 0 in @code{comm} to all other processes in the
Chris@42	1618 communicator, replacing any wisdom they currently have.
Chris@42	1619
Chris@42	1620 @xref{FFTW MPI Wisdom}.
Chris@42	1621
Chris@42	1622 @c ------------------------------------------------------------
Chris@42	1623 @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI
Chris@42	1624 @section FFTW MPI Fortran Interface
Chris@42	1625 @cindex Fortran interface
Chris@42	1626
Chris@42	1627 @cindex iso_c_binding
Chris@42	1628 The FFTW MPI interface is callable from modern Fortran compilers
Chris@42	1629 supporting the Fortran 2003 @code{iso_c_binding} standard for calling
Chris@42	1630 C functions. As described in @ref{Calling FFTW from Modern Fortran},
Chris@42	1631 this means that you can directly call FFTW's C interface from Fortran
Chris@42	1632 with only minor changes in syntax. There are, however, a few things
Chris@42	1633 specific to the MPI interface to keep in mind:
Chris@42	1634
Chris@42	1635 @itemize @bullet
Chris@42	1636
Chris@42	1637 @item
Chris@42	1638 Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran
Chris@42	1639 interface }, you should @code{include 'fftw3-mpi.f03'} (after
Chris@42	1640 @code{use, intrinsic :: iso_c_binding} as before). The
Chris@42	1641 @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should
Chris@42	1642 @emph{not} @code{include} them both yourself. (You will also want to
Chris@42	1643 include the MPI header file, usually via @code{include 'mpif.h'} or
Chris@42	1644 similar, although though this is not needed by @code{fftw3-mpi.f03}
Chris@42	1645 @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.)
Chris@42	1646
Chris@42	1647 @item
Chris@42	1648 Because of the different storage conventions between C and Fortran,
Chris@42	1649 you reverse the order of your array dimensions when passing them to
Chris@42	1650 FFTW (@pxref{Reversing array dimensions}). This is merely a
Chris@42	1651 difference in notation and incurs no performance overhead. However,
Chris@42	1652 it means that, whereas in C the @emph{first} dimension is distributed,
Chris@42	1653 in Fortran the @emph{last} dimension of your array is distributed.
Chris@42	1654
Chris@42	1655 @item
Chris@42	1656 @cindex MPI communicator
Chris@42	1657 In Fortran, communicators are stored as @code{integer} types; there is
Chris@42	1658 no @code{MPI_Comm} type, nor is there any way to access a C
Chris@42	1659 @code{MPI_Comm}. Fortunately, this is taken care of for you by the
Chris@42	1660 FFTW Fortran interface: whenever the C interface expects an
Chris@42	1661 @code{MPI_Comm} type, you should pass the Fortran communicator as an
Chris@42	1662 @code{integer}.@footnote{Technically, this is because you aren't
Chris@42	1663 actually calling the C functions directly. You are calling wrapper
Chris@42	1664 functions that translate the communicator with @code{MPI_Comm_f2c}
Chris@42	1665 before calling the ordinary C interface. This is all done
Chris@42	1666 transparently, however, since the @code{fftw3-mpi.f03} interface file
Chris@42	1667 renames the wrappers so that they are called in Fortran with the same
Chris@42	1668 names as the C interface functions.}
Chris@42	1669
Chris@42	1670 @item
Chris@42	1671 Because you need to call the @samp{local_size} function to find out
Chris@42	1672 how much space to allocate, and this may be @emph{larger} than the
Chris@42	1673 local portion of the array (@pxref{MPI Data Distribution}), you should
Chris@42	1674 @emph{always} allocate your arrays dynamically using FFTW's allocation
Chris@42	1675 routines as described in @ref{Allocating aligned memory in Fortran}.
Chris@42	1676 (Coincidentally, this also provides the best performance by
Chris@42	1677 guaranteeding proper data alignment.)
Chris@42	1678
Chris@42	1679 @item
Chris@42	1680 Because all sizes in the MPI FFTW interface are declared as
Chris@42	1681 @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in
Chris@42	1682 Fortran (@pxref{FFTW Fortran type reference}).
Chris@42	1683
Chris@42	1684 @item
Chris@42	1685 @findex fftw_execute_dft
Chris@42	1686 @findex fftw_mpi_execute_dft
Chris@42	1687 @cindex new-array execution
Chris@42	1688 In Fortran, because of the language semantics, we generally recommend
Chris@42	1689 using the new-array execute functions for all plans, even in the
Chris@42	1690 common case where you are executing the plan on the same arrays for
Chris@42	1691 which the plan was created (@pxref{Plan execution in Fortran}).
Chris@42	1692 However, note that in the MPI interface these functions are changed:
Chris@42	1693 @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft},
Chris@42	1694 etcetera. @xref{Using MPI Plans}.
Chris@42	1695
Chris@42	1696 @end itemize
Chris@42	1697
Chris@42	1698 For example, here is a Fortran code snippet to perform a distributed
Chris@42	1699 @twodims{L,M} complex DFT in-place. (This assumes you have already
Chris@42	1700 initialized MPI with @code{MPI_init} and have also performed
Chris@42	1701 @code{call fftw_mpi_init}.)
Chris@42	1702
Chris@42	1703 @example
Chris@42	1704 use, intrinsic :: iso_c_binding
Chris@42	1705 include 'fftw3-mpi.f03'
Chris@42	1706 integer(C_INTPTR_T), parameter :: L = ...
Chris@42	1707 integer(C_INTPTR_T), parameter :: M = ...
Chris@42	1708 type(C_PTR) :: plan, cdata
Chris@42	1709 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
Chris@42	1710 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
Chris@42	1711
Chris@42	1712 ! @r{get local data size and allocate (note dimension reversal)}
Chris@42	1713 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
Chris@42	1714 local_M, local_j_offset)
Chris@42	1715 cdata = fftw_alloc_complex(alloc_local)
Chris@42	1716 call c_f_pointer(cdata, data, [L,local_M])
Chris@42	1717
Chris@42	1718 ! @r{create MPI plan for in-place forward DFT (note dimension reversal)}
Chris@42	1719 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
Chris@42	1720 FFTW_FORWARD, FFTW_MEASURE)
Chris@42	1721
Chris@42	1722 ! @r{initialize data to some function} my_function(i,j)
Chris@42	1723 do j = 1, local_M
Chris@42	1724 do i = 1, L
Chris@42	1725 data(i, j) = my_function(i, j + local_j_offset)
Chris@42	1726 end do
Chris@42	1727 end do
Chris@42	1728
Chris@42	1729 ! @r{compute transform (as many times as desired)}
Chris@42	1730 call fftw_mpi_execute_dft(plan, data, data)
Chris@42	1731
Chris@42	1732 call fftw_destroy_plan(plan)
Chris@42	1733 call fftw_free(cdata)
Chris@42	1734 @end example
Chris@42	1735
Chris@42	1736 Note that when we called @code{fftw_mpi_local_size_2d} and
Chris@42	1737 @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order,
Chris@42	1738 since a @twodims{L,M} Fortran array is viewed by FFTW in C as a
Chris@42	1739 @twodims{M, L} array. This means that the array was distributed over
Chris@42	1740 the @code{M} dimension, the local portion of which is a
Chris@42	1741 @twodims{L,local_M} array in Fortran. (You must @emph{not} use an
Chris@42	1742 @code{allocate} statement to allocate an @twodims{L,local_M} array,
Chris@42	1743 however; you must allocate @code{alloc_local} complex numbers, which
Chris@42	1744 may be greater than @code{L * local_M}, in order to reserve space for
Chris@42	1745 intermediate steps of the transform.) Finally, we mention that
Chris@42	1746 because C's array indices are zero-based, the @code{local_j_offset}
Chris@42	1747 argument can conveniently be interpreted as an offset in the 1-based
Chris@42	1748 @code{j} index (rather than as a starting index as in C).
Chris@42	1749
Chris@42	1750 If instead you had used the @code{ior(FFTW_MEASURE,
Chris@42	1751 FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a
Chris@42	1752 transposed @twodims{M,local_L} array, associated with the @emph{same}
Chris@42	1753 @code{cdata} allocation (since the transform is in-place), and which
Chris@42	1754 you could declare with:
Chris@42	1755
Chris@42	1756 @example
Chris@42	1757 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
Chris@42	1758 ...
Chris@42	1759 call c_f_pointer(cdata, tdata, [M,local_L])
Chris@42	1760 @end example
Chris@42	1761
Chris@42	1762 where @code{local_L} would have been obtained by changing the
Chris@42	1763 @code{fftw_mpi_local_size_2d} call to:
Chris@42	1764
Chris@42	1765 @example
Chris@42	1766 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
Chris@42	1767 local_M, local_j_offset, local_L, local_i_offset)
Chris@42	1768 @end example

Mercurial > hg > sv-dependency-builds

annotate src/fftw-3.3.5/doc/mpi.texi @ 83:ae30d91d2ffe