cannam@167: @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top cannam@167: @chapter Distributed-memory FFTW with MPI cannam@167: @cindex MPI cannam@167: cannam@167: @cindex parallel transform cannam@167: In this chapter we document the parallel FFTW routines for parallel cannam@167: systems supporting the MPI message-passing interface. Unlike the cannam@167: shared-memory threads described in the previous chapter, MPI allows cannam@167: you to use @emph{distributed-memory} parallelism, where each CPU has cannam@167: its own separate memory, and which can scale up to clusters of many cannam@167: thousands of processors. This capability comes at a price, however: cannam@167: each process only stores a @emph{portion} of the data to be cannam@167: transformed, which means that the data structures and cannam@167: programming-interface are quite different from the serial or threads cannam@167: versions of FFTW. cannam@167: @cindex data distribution cannam@167: cannam@167: cannam@167: Distributed-memory parallelism is especially useful when you are cannam@167: transforming arrays so large that they do not fit into the memory of a cannam@167: single processor. The storage per-process required by FFTW's MPI cannam@167: routines is proportional to the total array size divided by the number cannam@167: of processes. Conversely, distributed-memory parallelism can easily cannam@167: pose an unacceptably high communications overhead for small problems; cannam@167: the threshold problem size for which parallelism becomes advantageous cannam@167: will depend on the precise problem you are interested in, your cannam@167: hardware, and your MPI implementation. cannam@167: cannam@167: A note on terminology: in MPI, you divide the data among a set of cannam@167: ``processes'' which each run in their own memory address space. cannam@167: Generally, each process runs on a different physical processor, but cannam@167: this is not required. A set of processes in MPI is described by an cannam@167: opaque data structure called a ``communicator,'' the most common of cannam@167: which is the predefined communicator @code{MPI_COMM_WORLD} which cannam@167: refers to @emph{all} processes. For more information on these and cannam@167: other concepts common to all MPI programs, we refer the reader to the cannam@167: documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home cannam@167: page}. cannam@167: @cindex MPI communicator cannam@167: @ctindex MPI_COMM_WORLD cannam@167: cannam@167: cannam@167: We assume in this chapter that the reader is familiar with the usage cannam@167: of the serial (uniprocessor) FFTW, and focus only on the concepts new cannam@167: to the MPI interface. cannam@167: cannam@167: @menu cannam@167: * FFTW MPI Installation:: cannam@167: * Linking and Initializing MPI FFTW:: cannam@167: * 2d MPI example:: cannam@167: * MPI Data Distribution:: cannam@167: * Multi-dimensional MPI DFTs of Real Data:: cannam@167: * Other Multi-dimensional Real-data MPI Transforms:: cannam@167: * FFTW MPI Transposes:: cannam@167: * FFTW MPI Wisdom:: cannam@167: * Avoiding MPI Deadlocks:: cannam@167: * FFTW MPI Performance Tips:: cannam@167: * Combining MPI and Threads:: cannam@167: * FFTW MPI Reference:: cannam@167: * FFTW MPI Fortran Interface:: cannam@167: @end menu cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Installation cannam@167: cannam@167: All of the FFTW MPI code is located in the @code{mpi} subdirectory of cannam@167: the FFTW package. On Unix systems, the FFTW MPI libraries and header cannam@167: files are automatically configured, compiled, and installed along with cannam@167: the uniprocessor FFTW libraries simply by including cannam@167: @code{--enable-mpi} in the flags to the @code{configure} script cannam@167: (@pxref{Installation on Unix}). cannam@167: @fpindex configure cannam@167: cannam@167: cannam@167: Any implementation of the MPI standard, version 1 or later, should cannam@167: work with FFTW. The @code{configure} script will attempt to cannam@167: automatically detect how to compile and link code using your MPI cannam@167: implementation. In some cases, especially if you have multiple cannam@167: different MPI implementations installed or have an unusual MPI cannam@167: software package, you may need to provide this information explicitly. cannam@167: cannam@167: Most commonly, one compiles MPI code by invoking a special compiler cannam@167: command, typically @code{mpicc} for C code. The @code{configure} cannam@167: script knows the most common names for this command, but you can cannam@167: specify the MPI compilation command explicitly by setting the cannam@167: @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}. cannam@167: @fpindex mpicc cannam@167: cannam@167: cannam@167: If, instead of a special compiler command, you need to link a certain cannam@167: library, you can specify the link command via the @code{MPILIBS} cannam@167: variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if cannam@167: your MPI library is installed in a non-standard location (one the cannam@167: compiler does not know about by default), you may also have to specify cannam@167: the location of the library and header files via @code{LDFLAGS} and cannam@167: @code{CPPFLAGS} variables, respectively, as in @samp{./configure cannam@167: LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI cannam@167: @section Linking and Initializing MPI FFTW cannam@167: cannam@167: Programs using the MPI FFTW routines should be linked with cannam@167: @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision, cannam@167: @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on cannam@167: (@pxref{Precision}). You will also need to link with whatever library cannam@167: is responsible for MPI on your system; in most MPI implementations, cannam@167: there is a special compiler alias named @code{mpicc} to compile and cannam@167: link MPI code. cannam@167: @fpindex mpicc cannam@167: @cindex linking on Unix cannam@167: @cindex precision cannam@167: cannam@167: cannam@167: @findex fftw_init_threads cannam@167: Before calling any FFTW routines except possibly cannam@167: @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling cannam@167: @code{MPI_Init}, you should call the function: cannam@167: cannam@167: @example cannam@167: void fftw_mpi_init(void); cannam@167: @end example cannam@167: @findex fftw_mpi_init cannam@167: cannam@167: If, at the end of your program, you want to get rid of all memory and cannam@167: other resources allocated internally by FFTW, for both the serial and cannam@167: MPI routines, you can call: cannam@167: cannam@167: @example cannam@167: void fftw_mpi_cleanup(void); cannam@167: @end example cannam@167: @findex fftw_mpi_cleanup cannam@167: cannam@167: which is much like the @code{fftw_cleanup()} function except that it cannam@167: also gets rid of FFTW's MPI-related data. You must @emph{not} execute cannam@167: any previously created plans after calling this function. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI cannam@167: @section 2d MPI example cannam@167: cannam@167: Before we document the FFTW MPI interface in detail, we begin with a cannam@167: simple example outlining how one would perform a two-dimensional cannam@167: @code{N0} by @code{N1} complex DFT. cannam@167: cannam@167: @example cannam@167: #include cannam@167: cannam@167: int main(int argc, char **argv) cannam@167: @{ cannam@167: const ptrdiff_t N0 = ..., N1 = ...; cannam@167: fftw_plan plan; cannam@167: fftw_complex *data; cannam@167: ptrdiff_t alloc_local, local_n0, local_0_start, i, j; cannam@167: cannam@167: MPI_Init(&argc, &argv); cannam@167: fftw_mpi_init(); cannam@167: cannam@167: /* @r{get local data size and allocate} */ cannam@167: alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, cannam@167: &local_n0, &local_0_start); cannam@167: data = fftw_alloc_complex(alloc_local); cannam@167: cannam@167: /* @r{create plan for in-place forward DFT} */ cannam@167: plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, cannam@167: FFTW_FORWARD, FFTW_ESTIMATE); cannam@167: cannam@167: /* @r{initialize data to some function} my_function(x,y) */ cannam@167: for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) cannam@167: data[i*N1 + j] = my_function(local_0_start + i, j); cannam@167: cannam@167: /* @r{compute transforms, in-place, as many times as desired} */ cannam@167: fftw_execute(plan); cannam@167: cannam@167: fftw_destroy_plan(plan); cannam@167: cannam@167: MPI_Finalize(); cannam@167: @} cannam@167: @end example cannam@167: cannam@167: As can be seen above, the MPI interface follows the same basic style cannam@167: of allocate/plan/execute/destroy as the serial FFTW routines. All of cannam@167: the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead cannam@167: of @samp{fftw_}. There are a few important differences, however: cannam@167: cannam@167: First, we must call @code{fftw_mpi_init()} after calling cannam@167: @code{MPI_Init} (required in all MPI programs) and before calling any cannam@167: other @samp{fftw_mpi_} routine. cannam@167: @findex MPI_Init cannam@167: @findex fftw_mpi_init cannam@167: cannam@167: cannam@167: Second, when we create the plan with @code{fftw_mpi_plan_dft_2d}, cannam@167: analogous to @code{fftw_plan_dft_2d}, we pass an additional argument: cannam@167: the communicator, indicating which processes will participate in the cannam@167: transform (here @code{MPI_COMM_WORLD}, indicating all processes). cannam@167: Whenever you create, execute, or destroy a plan for an MPI transform, cannam@167: you must call the corresponding FFTW routine on @emph{all} processes cannam@167: in the communicator for that transform. (That is, these are cannam@167: @emph{collective} calls.) Note that the plan for the MPI transform cannam@167: uses the standard @code{fftw_execute} and @code{fftw_destroy} routines cannam@167: (on the other hand, there are MPI-specific new-array execute functions cannam@167: documented below). cannam@167: @cindex collective function cannam@167: @findex fftw_mpi_plan_dft_2d cannam@167: @ctindex MPI_COMM_WORLD cannam@167: cannam@167: cannam@167: Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments cannam@167: instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a cannam@167: standard C integer type which is (at least) 32 bits wide on a 32-bit cannam@167: machine and 64 bits wide on a 64-bit machine. This is to make it easy cannam@167: to specify very large parallel transforms on a 64-bit machine. (You cannam@167: can specify 64-bit transform sizes in the serial FFTW, too, but only cannam@167: by using the @samp{guru64} planner interface. @xref{64-bit Guru cannam@167: Interface}.) cannam@167: @tindex ptrdiff_t cannam@167: @cindex 64-bit architecture cannam@167: cannam@167: cannam@167: Fourth, and most importantly, you don't allocate the entire cannam@167: two-dimensional array on each process. Instead, you call cannam@167: @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the cannam@167: array resides on each processor, and how much space to allocate. cannam@167: Here, the portion of the array on each process is a @code{local_n0} by cannam@167: @code{N1} slice of the total array, starting at index cannam@167: @code{local_0_start}. The total number of @code{fftw_complex} numbers cannam@167: to allocate is given by the @code{alloc_local} return value, which cannam@167: @emph{may} be greater than @code{local_n0 * N1} (in case some cannam@167: intermediate calculations require additional storage). The data cannam@167: distribution in FFTW's MPI interface is described in more detail by cannam@167: the next section. cannam@167: @findex fftw_mpi_local_size_2d cannam@167: @cindex data distribution cannam@167: cannam@167: cannam@167: Given the portion of the array that resides on the local process, it cannam@167: is straightforward to initialize the data (here to a function cannam@167: @code{myfunction}) and otherwise manipulate it. Of course, at the end cannam@167: of the program you may want to output the data somehow, but cannam@167: synchronizing this output is up to you and is beyond the scope of this cannam@167: manual. (One good way to output a large multi-dimensional distributed cannam@167: array in MPI to a portable binary file is to use the free HDF5 cannam@167: library; see the @uref{http://www.hdfgroup.org/, HDF home page}.) cannam@167: @cindex HDF5 cannam@167: @cindex MPI I/O cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI cannam@167: @section MPI Data Distribution cannam@167: @cindex data distribution cannam@167: cannam@167: The most important concept to understand in using FFTW's MPI interface cannam@167: is the data distribution. With a serial or multithreaded FFT, all of cannam@167: the inputs and outputs are stored as a single contiguous chunk of cannam@167: memory. With a distributed-memory FFT, the inputs and outputs are cannam@167: broken into disjoint blocks, one per process. cannam@167: cannam@167: In particular, FFTW uses a @emph{1d block distribution} of the data, cannam@167: distributed along the @emph{first dimension}. For example, if you cannam@167: want to perform a @twodims{100,200} complex DFT, distributed over 4 cannam@167: processes, each process will get a @twodims{25,200} slice of the data. cannam@167: That is, process 0 will get rows 0 through 24, process 1 will get rows cannam@167: 25 through 49, process 2 will get rows 50 through 74, and process 3 cannam@167: will get rows 75 through 99. If you take the same array but cannam@167: distribute it over 3 processes, then it is not evenly divisible so the cannam@167: different processes will have unequal chunks. FFTW's default choice cannam@167: in this case is to assign 34 rows to processes 0 and 1, and 32 rows to cannam@167: process 2. cannam@167: @cindex block distribution cannam@167: cannam@167: cannam@167: FFTW provides several @samp{fftw_mpi_local_size} routines that you can cannam@167: call to find out what portion of an array is stored on the current cannam@167: process. In most cases, you should use the default block sizes picked cannam@167: by FFTW, but it is also possible to specify your own block size. For cannam@167: example, with a @twodims{100,200} array on three processes, you can cannam@167: tell FFTW to use a block size of 40, which would assign 40 rows to cannam@167: processes 0 and 1, and 20 rows to process 2. FFTW's default is to cannam@167: divide the data equally among the processes if possible, and as best cannam@167: it can otherwise. The rows are always assigned in ``rank order,'' cannam@167: i.e. process 0 gets the first block of rows, then process 1, and so cannam@167: on. (You can change this by using @code{MPI_Comm_split} to create a cannam@167: new communicator with re-ordered processes.) However, you should cannam@167: always call the @samp{fftw_mpi_local_size} routines, if possible, cannam@167: rather than trying to predict FFTW's distribution choices. cannam@167: cannam@167: In particular, it is critical that you allocate the storage size that cannam@167: is returned by @samp{fftw_mpi_local_size}, which is @emph{not} cannam@167: necessarily the size of the local slice of the array. The reason is cannam@167: that intermediate steps of FFTW's algorithms involve transposing the cannam@167: array and redistributing the data, so at these intermediate steps FFTW cannam@167: may require more local storage space (albeit always proportional to cannam@167: the total size divided by the number of processes). The cannam@167: @samp{fftw_mpi_local_size} functions know how much storage is required cannam@167: for these intermediate steps and tell you the correct amount to cannam@167: allocate. cannam@167: cannam@167: @menu cannam@167: * Basic and advanced distribution interfaces:: cannam@167: * Load balancing:: cannam@167: * Transposed distributions:: cannam@167: * One-dimensional distributions:: cannam@167: @end menu cannam@167: cannam@167: @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution cannam@167: @subsection Basic and advanced distribution interfaces cannam@167: cannam@167: As with the planner interface, the @samp{fftw_mpi_local_size} cannam@167: distribution interface is broken into basic and advanced cannam@167: (@samp{_many}) interfaces, where the latter allows you to specify the cannam@167: block size manually and also to request block sizes when computing cannam@167: multiple transforms simultaneously. These functions are documented cannam@167: more exhaustively by the FFTW MPI Reference, but we summarize the cannam@167: basic ideas here using a couple of two-dimensional examples. cannam@167: cannam@167: For the @twodims{100,200} complex-DFT example, above, we would find cannam@167: the distribution by calling the following function in the basic cannam@167: interface: cannam@167: cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@167: @end example cannam@167: @findex fftw_mpi_local_size_2d cannam@167: cannam@167: Given the total size of the data to be transformed (here, @code{n0 = cannam@167: 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this cannam@167: function provides three numbers. cannam@167: cannam@167: First, it describes the shape of the local data: the current process cannam@167: should store a @code{local_n0} by @code{n1} slice of the overall cannam@167: dataset, in row-major order (@code{n1} dimension contiguous), starting cannam@167: at index @code{local_0_start}. That is, if the total dataset is cannam@167: viewed as a @code{n0} by @code{n1} matrix, the current process should cannam@167: store the rows @code{local_0_start} to cannam@167: @code{local_0_start+local_n0-1}. Obviously, if you are running with cannam@167: only a single MPI process, that process will store the entire array: cannam@167: @code{local_0_start} will be zero and @code{local_n0} will be cannam@167: @code{n0}. @xref{Row-major Format}. cannam@167: @cindex row-major cannam@167: cannam@167: cannam@167: Second, the return value is the total number of data elements (e.g., cannam@167: complex numbers for a complex DFT) that should be allocated for the cannam@167: input and output arrays on the current process (ideally with cannam@167: @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal cannam@167: alignment). It might seem that this should always be equal to cannam@167: @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's cannam@167: distributed FFT algorithms require data redistributions at cannam@167: intermediate stages of the transform, and in some circumstances this cannam@167: may require slightly larger local storage. This is discussed in more cannam@167: detail below, under @ref{Load balancing}. cannam@167: @findex fftw_malloc cannam@167: @findex fftw_alloc_complex cannam@167: cannam@167: cannam@167: @cindex advanced interface cannam@167: The advanced-interface @samp{local_size} function for multidimensional cannam@167: transforms returns the same three things (@code{local_n0}, cannam@167: @code{local_0_start}, and the total number of elements to allocate), cannam@167: but takes more inputs: cannam@167: cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, cannam@167: ptrdiff_t howmany, cannam@167: ptrdiff_t block0, cannam@167: MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, cannam@167: ptrdiff_t *local_0_start); cannam@167: @end example cannam@167: @findex fftw_mpi_local_size_many cannam@167: cannam@167: The two-dimensional case above corresponds to @code{rnk = 2} and an cannam@167: array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}. cannam@167: This routine is for any @code{rnk > 1}; one-dimensional transforms cannam@167: have their own interface because they work slightly differently, as cannam@167: discussed below. cannam@167: cannam@167: First, the advanced interface allows you to perform multiple cannam@167: transforms at once, of interleaved data, as specified by the cannam@167: @code{howmany} parameter. (@code{hoamany} is 1 for a single cannam@167: transform.) cannam@167: cannam@167: Second, here you can specify your desired block size in the @code{n0} cannam@167: dimension, @code{block0}. To use FFTW's default block size, pass cannam@167: @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on cannam@167: @code{P} processes, FFTW will return @code{local_n0} equal to cannam@167: @code{block0} on the first @code{P / block0} processes (rounded down), cannam@167: return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on cannam@167: the next process, and @code{local_n0} equal to zero on any remaining cannam@167: processes. In general, we recommend using the default block size cannam@167: (which corresponds to @code{n0 / P}, rounded up). cannam@167: @ctindex FFTW_MPI_DEFAULT_BLOCK cannam@167: @cindex block distribution cannam@167: cannam@167: cannam@167: For example, suppose you have @code{P = 4} processes and @code{n0 = cannam@167: 21}. The default will be a block size of @code{6}, which will give cannam@167: @code{local_n0 = 6} on the first three processes and @code{local_n0 = cannam@167: 3} on the last process. Instead, however, you could specify cannam@167: @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5} cannam@167: on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice, cannam@167: while it may look superficially more ``balanced,'' has the same cannam@167: critical path as FFTW's default but requires more communications.) cannam@167: cannam@167: @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution cannam@167: @subsection Load balancing cannam@167: @cindex load balancing cannam@167: cannam@167: Ideally, when you parallelize a transform over some @math{P} cannam@167: processes, each process should end up with work that takes equal time. cannam@167: Otherwise, all of the processes end up waiting on whichever process is cannam@167: slowest. This goal is known as ``load balancing.'' In this section, cannam@167: we describe the circumstances under which FFTW is able to load-balance cannam@167: well, and in particular how you should choose your transform size in cannam@167: order to load balance. cannam@167: cannam@167: Load balancing is especially difficult when you are parallelizing over cannam@167: heterogeneous machines; for example, if one of your processors is a cannam@167: old 486 and another is a Pentium IV, obviously you should give the cannam@167: Pentium more work to do than the 486 since the latter is much slower. cannam@167: FFTW does not deal with this problem, however---it assumes that your cannam@167: processes run on hardware of comparable speed, and that the goal is cannam@167: therefore to divide the problem as equally as possible. cannam@167: cannam@167: For a multi-dimensional complex DFT, FFTW can divide the problem cannam@167: equally among the processes if: (i) the @emph{first} dimension cannam@167: @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of cannam@167: the subsequent dimensions is divisible by @math{P}. (For the advanced cannam@167: interface, where you can specify multiple simultaneous transforms via cannam@167: some ``vector'' length @code{howmany}, a factor of @code{howmany} is cannam@167: included in the product of the subsequent dimensions.) cannam@167: cannam@167: For a one-dimensional complex DFT, the length @code{N} of the data cannam@167: should be divisible by @math{P} @emph{squared} to be able to divide cannam@167: the problem equally among the processes. cannam@167: cannam@167: @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution cannam@167: @subsection Transposed distributions cannam@167: cannam@167: Internally, FFTW's MPI transform algorithms work by first computing cannam@167: transforms of the data local to each process, then by globally cannam@167: @emph{transposing} the data in some fashion to redistribute the data cannam@167: among the processes, transforming the new data local to each process, cannam@167: and transposing back. For example, a two-dimensional @code{n0} by cannam@167: @code{n1} array, distributed across the @code{n0} dimension, is cannam@167: transformd by: (i) transforming the @code{n1} dimension, which are cannam@167: local to each process; (ii) transposing to an @code{n1} by @code{n0} cannam@167: array, distributed across the @code{n1} dimension; (iii) transforming cannam@167: the @code{n0} dimension, which is now local to each process; (iv) cannam@167: transposing back. cannam@167: @cindex transpose cannam@167: cannam@167: cannam@167: However, in many applications it is acceptable to compute a cannam@167: multidimensional DFT whose results are produced in transposed order cannam@167: (e.g., @code{n1} by @code{n0} in two dimensions). This provides a cannam@167: significant performance advantage, because it means that the final cannam@167: transposition step can be omitted. FFTW supports this optimization, cannam@167: which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT} cannam@167: to the planner routines. To compute the inverse transform of cannam@167: transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell cannam@167: it that the input is transposed. In this section, we explain how to cannam@167: interpret the output format of such a transform. cannam@167: @ctindex FFTW_MPI_TRANSPOSED_OUT cannam@167: @ctindex FFTW_MPI_TRANSPOSED_IN cannam@167: cannam@167: cannam@167: Suppose you have are transforming multi-dimensional data with (at cannam@167: least two) dimensions @ndims{}. As always, it is distributed along cannam@167: the first dimension @dimk{0}. Now, if we compute its DFT with the cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored cannam@167: with the first @emph{two} dimensions transposed: @ndimstrans{}, cannam@167: distributed along the @dimk{1} dimension. Conversely, if we take the cannam@167: @ndimstrans{} data and transform it with the cannam@167: @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the cannam@167: original @ndims{} array. cannam@167: cannam@167: There are two ways to find the portion of the transposed array that cannam@167: resides on the current process. First, you can simply call the cannam@167: appropriate @samp{local_size} function, passing @ndimstrans{} (the cannam@167: transposed dimensions). This would mean calling the @samp{local_size} cannam@167: function twice, once for the transposed and once for the cannam@167: non-transposed dimensions. Alternatively, you can call one of the cannam@167: @samp{local_size_transposed} functions, which returns both the cannam@167: non-transposed and transposed data distribution from a single call. cannam@167: For example, for a 3d transform with transposed output (or input), you cannam@167: might call: cannam@167: cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_3d_transposed( cannam@167: ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: @end example cannam@167: @findex fftw_mpi_local_size_3d_transposed cannam@167: cannam@167: Here, @code{local_n0} and @code{local_0_start} give the size and cannam@167: starting index of the @code{n0} dimension for the cannam@167: @emph{non}-transposed data, as in the previous sections. For cannam@167: @emph{transposed} data (e.g. the output for cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and cannam@167: @code{local_1_start} give the size and starting index of the @code{n1} cannam@167: dimension, which is the first dimension of the transposed data cannam@167: (@code{n1} by @code{n0} by @code{n2}). cannam@167: cannam@167: (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to cannam@167: performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two cannam@167: dimensions to the planner in reverse order, or vice versa. If you cannam@167: pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the cannam@167: first two dimensions passed to the planner and passing @emph{neither} cannam@167: flag.) cannam@167: cannam@167: @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution cannam@167: @subsection One-dimensional distributions cannam@167: cannam@167: For one-dimensional distributed DFTs using FFTW, matters are slightly cannam@167: more complicated because the data distribution is more closely tied to cannam@167: how the algorithm works. In particular, you can no longer pass an cannam@167: arbitrary block size and must accept FFTW's default; also, the block cannam@167: sizes may be different for input and output. Also, the data cannam@167: distribution depends on the flags and transform direction, in order cannam@167: for forward and backward transforms to work correctly. cannam@167: cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, cannam@167: int sign, unsigned flags, cannam@167: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, cannam@167: ptrdiff_t *local_no, ptrdiff_t *local_o_start); cannam@167: @end example cannam@167: @findex fftw_mpi_local_size_1d cannam@167: cannam@167: This function computes the data distribution for a 1d transform of cannam@167: size @code{n0} with the given transform @code{sign} and @code{flags}. cannam@167: Both input and output data use block distributions. The input on the cannam@167: current process will consist of @code{local_ni} numbers starting at cannam@167: index @code{local_i_start}; e.g. if only a single process is used, cannam@167: then @code{local_ni} will be @code{n0} and @code{local_i_start} will cannam@167: be @code{0}. Similarly for the output, with @code{local_no} numbers cannam@167: starting at index @code{local_o_start}. The return value of cannam@167: @code{fftw_mpi_local_size_1d} will be the total number of elements to cannam@167: allocate on the current process (which might be slightly larger than cannam@167: the local size due to intermediate steps in the algorithm). cannam@167: cannam@167: As mentioned above (@pxref{Load balancing}), the data will be divided cannam@167: equally among the processes if @code{n0} is divisible by the cannam@167: @emph{square} of the number of processes. In this case, cannam@167: @code{local_ni} will equal @code{local_no}. Otherwise, they may be cannam@167: different. cannam@167: cannam@167: For some applications, such as convolutions, the order of the output cannam@167: data is irrelevant. In this case, performance can be improved by cannam@167: specifying that the output data be stored in an FFTW-defined cannam@167: ``scrambled'' format. (In particular, this is the analogue of cannam@167: transposed output in the multidimensional case: scrambled output saves cannam@167: a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in cannam@167: the flags, then the output is stored in this (undocumented) scrambled cannam@167: order. Conversely, to perform the inverse transform of data in cannam@167: scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag. cannam@167: @ctindex FFTW_MPI_SCRAMBLED_OUT cannam@167: @ctindex FFTW_MPI_SCRAMBLED_IN cannam@167: cannam@167: cannam@167: In MPI FFTW, only composite sizes @code{n0} can be parallelized; we cannam@167: have not yet implemented a parallel algorithm for large prime sizes. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI cannam@167: @section Multi-dimensional MPI DFTs of Real Data cannam@167: cannam@167: FFTW's MPI interface also supports multi-dimensional DFTs of real cannam@167: data, similar to the serial r2c and c2r interfaces. (Parallel cannam@167: one-dimensional real-data DFTs are not currently supported; you must cannam@167: use a complex transform and set the imaginary parts of the inputs to cannam@167: zero.) cannam@167: cannam@167: The key points to understand for r2c and c2r MPI transforms (compared cannam@167: to the MPI complex DFTs or the serial r2c/c2r transforms), are: cannam@167: cannam@167: @itemize @bullet cannam@167: cannam@167: @item cannam@167: Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real cannam@167: data to/from @ndimshalf{} complex data: the last dimension of the cannam@167: complex data is cut in half (rounded down), plus one. As for the cannam@167: serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and cannam@167: @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data. cannam@167: cannam@167: @item cannam@167: @cindex padding cannam@167: Although the real data is @emph{conceptually} @ndims{}, it is cannam@167: @emph{physically} stored as an @ndimspad{} array, where the last cannam@167: dimension has been @emph{padded} to make it the same size as the cannam@167: complex output. This is much like the in-place serial r2c/c2r cannam@167: interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that cannam@167: in MPI the padding is required even for out-of-place data. The extra cannam@167: padding numbers are ignored by FFTW (they are @emph{not} like cannam@167: zero-padding the transform to a larger size); they are only used to cannam@167: determine the data layout. cannam@167: cannam@167: @item cannam@167: @cindex data distribution cannam@167: The data distribution in MPI for @emph{both} the real and complex data cannam@167: is determined by the shape of the @emph{complex} data. That is, you cannam@167: call the appropriate @samp{local size} function for the @ndimshalf{} cannam@167: complex data, and then use the @emph{same} distribution for the real cannam@167: data except that the last complex dimension is replaced by a (padded) cannam@167: real dimension of twice the length. cannam@167: cannam@167: @end itemize cannam@167: cannam@167: For example suppose we are performing an out-of-place r2c transform of cannam@167: @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}], cannam@167: resulting in @threedims{L,M,N/2+1} complex data. Similar to the cannam@167: example in @ref{2d MPI example}, we might do something like: cannam@167: cannam@167: @example cannam@167: #include cannam@167: cannam@167: int main(int argc, char **argv) cannam@167: @{ cannam@167: const ptrdiff_t L = ..., M = ..., N = ...; cannam@167: fftw_plan plan; cannam@167: double *rin; cannam@167: fftw_complex *cout; cannam@167: ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; cannam@167: cannam@167: MPI_Init(&argc, &argv); cannam@167: fftw_mpi_init(); cannam@167: cannam@167: /* @r{get local data size and allocate} */ cannam@167: alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, cannam@167: &local_n0, &local_0_start); cannam@167: rin = fftw_alloc_real(2 * alloc_local); cannam@167: cout = fftw_alloc_complex(alloc_local); cannam@167: cannam@167: /* @r{create plan for out-of-place r2c DFT} */ cannam@167: plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, cannam@167: FFTW_MEASURE); cannam@167: cannam@167: /* @r{initialize rin to some function} my_func(x,y,z) */ cannam@167: for (i = 0; i < local_n0; ++i) cannam@167: for (j = 0; j < M; ++j) cannam@167: for (k = 0; k < N; ++k) cannam@167: rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); cannam@167: cannam@167: /* @r{compute transforms as many times as desired} */ cannam@167: fftw_execute(plan); cannam@167: cannam@167: fftw_destroy_plan(plan); cannam@167: cannam@167: MPI_Finalize(); cannam@167: @} cannam@167: @end example cannam@167: cannam@167: @findex fftw_alloc_real cannam@167: @cindex row-major cannam@167: Note that we allocated @code{rin} using @code{fftw_alloc_real} with an cannam@167: argument of @code{2 * alloc_local}: since @code{alloc_local} is the cannam@167: number of @emph{complex} values to allocate, the number of @emph{real} cannam@167: values is twice as many. The @code{rin} array is then cannam@167: @threedims{local_n0,M,2(N/2+1)} in row-major order, so its cannam@167: @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) + cannam@167: k} (@pxref{Multi-dimensional Array Format }). cannam@167: cannam@167: @cindex transpose cannam@167: @ctindex FFTW_TRANSPOSED_OUT cannam@167: @ctindex FFTW_TRANSPOSED_IN cannam@167: As for the complex transforms, improved performance can be obtained by cannam@167: specifying that the output is the transpose of the input or vice versa cannam@167: (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c cannam@167: example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that cannam@167: the input would be a padded @threedims{L,M,2(N/2+1)} real array cannam@167: distributed over the @code{L} dimension, while the output would be a cannam@167: @threedims{M,L,N/2+1} complex array distributed over the @code{M} cannam@167: dimension. To perform the inverse c2r transform with the same data cannam@167: distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI cannam@167: @section Other multi-dimensional Real-Data MPI Transforms cannam@167: cannam@167: @cindex r2r cannam@167: FFTW's MPI interface also supports multi-dimensional @samp{r2r} cannam@167: transforms of all kinds supported by the serial interface cannam@167: (e.g. discrete cosine and sine transforms, discrete Hartley cannam@167: transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not cannam@167: one-dimensional transforms, are currently parallelized. cannam@167: cannam@167: @tindex fftw_r2r_kind cannam@167: These are used much like the multidimensional complex DFTs discussed cannam@167: above, except that the data is real rather than complex, and one needs cannam@167: to pass an r2r transform kind (@code{fftw_r2r_kind}) for each cannam@167: dimension as in the serial FFTW (@pxref{More DFTs of Real Data}). cannam@167: cannam@167: For example, one might perform a two-dimensional @twodims{L,M} that is cannam@167: an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in cannam@167: the second dimension with code like: cannam@167: cannam@167: @example cannam@167: const ptrdiff_t L = ..., M = ...; cannam@167: fftw_plan plan; cannam@167: double *data; cannam@167: ptrdiff_t alloc_local, local_n0, local_0_start, i, j; cannam@167: cannam@167: /* @r{get local data size and allocate} */ cannam@167: alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, cannam@167: &local_n0, &local_0_start); cannam@167: data = fftw_alloc_real(alloc_local); cannam@167: cannam@167: /* @r{create plan for in-place REDFT10 x RODFT10} */ cannam@167: plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, cannam@167: FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); cannam@167: cannam@167: /* @r{initialize data to some function} my_function(x,y) */ cannam@167: for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) cannam@167: data[i*M + j] = my_function(local_0_start + i, j); cannam@167: cannam@167: /* @r{compute transforms, in-place, as many times as desired} */ cannam@167: fftw_execute(plan); cannam@167: cannam@167: fftw_destroy_plan(plan); cannam@167: @end example cannam@167: cannam@167: @findex fftw_alloc_real cannam@167: Notice that we use the same @samp{local_size} functions as we did for cannam@167: complex data, only now we interpret the sizes in terms of real rather cannam@167: than complex values, and correspondingly use @code{fftw_alloc_real}. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Transposes cannam@167: @cindex transpose cannam@167: cannam@167: The FFTW's MPI Fourier transforms rely on one or more @emph{global cannam@167: transposition} step for their communications. For example, the cannam@167: multidimensional transforms work by transforming along some cannam@167: dimensions, then transposing to make the first dimension local and cannam@167: transforming that, then transposing back. Because global cannam@167: transposition of a block-distributed matrix has many other potential cannam@167: uses besides FFTs, FFTW's transpose routines can be called directly, cannam@167: as documented in this section. cannam@167: cannam@167: @menu cannam@167: * Basic distributed-transpose interface:: cannam@167: * Advanced distributed-transpose interface:: cannam@167: * An improved replacement for MPI_Alltoall:: cannam@167: @end menu cannam@167: cannam@167: @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes cannam@167: @subsection Basic distributed-transpose interface cannam@167: cannam@167: In particular, suppose that we have an @code{n0} by @code{n1} array in cannam@167: row-major order, block-distributed across the @code{n0} dimension. To cannam@167: transpose this into an @code{n1} by @code{n0} array block-distributed cannam@167: across the @code{n1} dimension, we would create a plan by calling the cannam@167: following function: cannam@167: cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: @end example cannam@167: @findex fftw_mpi_plan_transpose cannam@167: cannam@167: The input and output arrays (@code{in} and @code{out}) can be the cannam@167: same. The transpose is actually executed by calling cannam@167: @code{fftw_execute} on the plan, as usual. cannam@167: @findex fftw_execute cannam@167: cannam@167: cannam@167: The @code{flags} are the usual FFTW planner flags, but support cannam@167: two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or cannam@167: @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for cannam@167: transpose plans, is that the output and/or input, respectively, are cannam@167: @emph{locally} transposed. That is, on each process input data is cannam@167: normally stored as a @code{local_n0} by @code{n1} array in row-major cannam@167: order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is cannam@167: stored as @code{n1} by @code{local_n0} in row-major order. Similarly, cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by cannam@167: @code{local_n1} instead of @code{local_n1} by @code{n0}. cannam@167: @ctindex FFTW_MPI_TRANSPOSED_OUT cannam@167: @ctindex FFTW_MPI_TRANSPOSED_IN cannam@167: cannam@167: cannam@167: To determine the local size of the array on each process before and cannam@167: after the transpose, as well as the amount of storage that must be cannam@167: allocated, one should call @code{fftw_mpi_local_size_2d_transposed}, cannam@167: just as for a 2d DFT as described in the previous section: cannam@167: @cindex data distribution cannam@167: cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_2d_transposed cannam@167: (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: @end example cannam@167: @findex fftw_mpi_local_size_2d_transposed cannam@167: cannam@167: Again, the return value is the local storage to allocate, which in cannam@167: this case is the number of @emph{real} (@code{double}) values rather cannam@167: than complex numbers as in the previous examples. cannam@167: cannam@167: @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes cannam@167: @subsection Advanced distributed-transpose interface cannam@167: cannam@167: The above routines are for a transpose of a matrix of numbers (of type cannam@167: @code{double}), using FFTW's default block sizes. More generally, one cannam@167: can perform transposes of @emph{tuples} of numbers, with cannam@167: user-specified block sizes for the input and output: cannam@167: cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_many_transpose cannam@167: (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, cannam@167: ptrdiff_t block0, ptrdiff_t block1, cannam@167: double *in, double *out, MPI_Comm comm, unsigned flags); cannam@167: @end example cannam@167: @findex fftw_mpi_plan_many_transpose cannam@167: cannam@167: In this case, one is transposing an @code{n0} by @code{n1} matrix of cannam@167: @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers). cannam@167: The input is distributed along the @code{n0} dimension with block size cannam@167: @code{block0}, and the @code{n1} by @code{n0} output is distributed cannam@167: along the @code{n1} dimension with block size @code{block1}. If cannam@167: @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW cannam@167: uses its default block size. To get the local size of the data on cannam@167: each process, you should then call @code{fftw_mpi_local_size_many_transposed}. cannam@167: @ctindex FFTW_MPI_DEFAULT_BLOCK cannam@167: @findex fftw_mpi_local_size_many_transposed cannam@167: cannam@167: @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes cannam@167: @subsection An improved replacement for MPI_Alltoall cannam@167: cannam@167: We close this section by noting that FFTW's MPI transpose routines can cannam@167: be thought of as a generalization for the @code{MPI_Alltoall} function cannam@167: (albeit only for floating-point types), and in some circumstances can cannam@167: function as an improved replacement. cannam@167: @findex MPI_Alltoall cannam@167: cannam@167: cannam@167: @code{MPI_Alltoall} is defined by the MPI standard as: cannam@167: cannam@167: @example cannam@167: int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, cannam@167: void *recvbuf, int recvcnt, MPI_Datatype recvtype, cannam@167: MPI_Comm comm); cannam@167: @end example cannam@167: cannam@167: In particular, for @code{double*} arrays @code{in} and @code{out}, cannam@167: consider the call: cannam@167: cannam@167: @example cannam@167: MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); cannam@167: @end example cannam@167: cannam@167: This is completely equivalent to: cannam@167: cannam@167: @example cannam@167: MPI_Comm_size(comm, &P); cannam@167: plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); cannam@167: fftw_execute(plan); cannam@167: fftw_destroy_plan(plan); cannam@167: @end example cannam@167: cannam@167: That is, computing a @twodims{P,P} transpose on @code{P} processes, cannam@167: with a block size of 1, is just a standard all-to-all communication. cannam@167: cannam@167: However, using the FFTW routine instead of @code{MPI_Alltoall} may cannam@167: have certain advantages. First of all, FFTW's routine can operate cannam@167: in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only cannam@167: operate out-of-place. cannam@167: @cindex in-place cannam@167: cannam@167: cannam@167: Second, even for out-of-place plans, FFTW's routine may be faster, cannam@167: especially if you need to perform the all-to-all communication many cannam@167: times and can afford to use @code{FFTW_MEASURE} or cannam@167: @code{FFTW_PATIENT}. It should certainly be no slower, not including cannam@167: the time to create the plan, since one of the possible algorithms that cannam@167: FFTW uses for an out-of-place transpose @emph{is} simply to call cannam@167: @code{MPI_Alltoall}. However, FFTW also considers several other cannam@167: possible algorithms that, depending on your MPI implementation and cannam@167: your hardware, may be faster. cannam@167: @ctindex FFTW_MEASURE cannam@167: @ctindex FFTW_PATIENT cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Wisdom cannam@167: @cindex wisdom cannam@167: @cindex saving plans to disk cannam@167: cannam@167: FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can cannam@167: be used to save MPI plans as well as to save uniprocessor plans. cannam@167: However, for MPI there are several unavoidable complications. cannam@167: cannam@167: @cindex MPI I/O cannam@167: First, the MPI standard does not guarantee that every process can cannam@167: perform file I/O (at least, not using C stdio routines)---in general, cannam@167: we may only assume that process 0 is capable of I/O.@footnote{In fact, cannam@167: even this assumption is not technically guaranteed by the standard, cannam@167: although it seems to be universal in actual MPI implementations and is cannam@167: widely assumed by MPI-using software. Technically, you need to query cannam@167: the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with cannam@167: @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no cannam@167: I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can cannam@167: perform I/O. Otherwise, it is the rank of a process that can perform cannam@167: I/O ... but since it is not guaranteed to yield the @emph{same} rank cannam@167: on all processes, you have to do an @code{MPI_Allreduce} of some kind cannam@167: if you want all processes to agree about which is going to do I/O. cannam@167: And even then, the standard only guarantees that this process can cannam@167: perform output, but not input. See e.g. @cite{Parallel Programming cannam@167: with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our cannam@167: experience virtually no MPI programmers worry about this.} So, if we cannam@167: want to export the wisdom from a single process to a file, we must cannam@167: first export the wisdom to a string, then send it to process 0, then cannam@167: write it to a file. cannam@167: cannam@167: Second, in principle we may want to have separate wisdom for every cannam@167: process, since in general the processes may run on different hardware cannam@167: even for a single MPI program. However, in practice FFTW's MPI code cannam@167: is designed for the case of homogeneous hardware (@pxref{Load cannam@167: balancing}), and in this case it is convenient to use the same wisdom cannam@167: for every process. Thus, we need a mechanism to synchronize the wisdom. cannam@167: cannam@167: To address both of these problems, FFTW provides the following two cannam@167: functions: cannam@167: cannam@167: @example cannam@167: void fftw_mpi_broadcast_wisdom(MPI_Comm comm); cannam@167: void fftw_mpi_gather_wisdom(MPI_Comm comm); cannam@167: @end example cannam@167: @findex fftw_mpi_gather_wisdom cannam@167: @findex fftw_mpi_broadcast_wisdom cannam@167: cannam@167: Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom} cannam@167: will broadcast the wisdom from process 0 to all other processes. cannam@167: Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all cannam@167: processes onto process 0. (If the plans created for the same problem cannam@167: by different processes are not the same, @code{fftw_mpi_gather_wisdom} cannam@167: will arbitrarily choose one of the plans.) Both of these functions cannam@167: may result in suboptimal plans for different processes if the cannam@167: processes are running on non-identical hardware. Both of these cannam@167: functions are @emph{collective} calls, which means that they must be cannam@167: executed by all processes in the communicator. cannam@167: @cindex collective function cannam@167: cannam@167: cannam@167: So, for example, a typical code snippet to import wisdom from a file cannam@167: and use it on all processes would be: cannam@167: cannam@167: @example cannam@167: @{ cannam@167: int rank; cannam@167: cannam@167: fftw_mpi_init(); cannam@167: MPI_Comm_rank(MPI_COMM_WORLD, &rank); cannam@167: if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); cannam@167: fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); cannam@167: @} cannam@167: @end example cannam@167: cannam@167: (Note that we must call @code{fftw_mpi_init} before importing any cannam@167: wisdom that might contain MPI plans.) Similarly, a typical code cannam@167: snippet to export wisdom from all processes to a file is: cannam@167: @findex fftw_mpi_init cannam@167: cannam@167: @example cannam@167: @{ cannam@167: int rank; cannam@167: cannam@167: fftw_mpi_gather_wisdom(MPI_COMM_WORLD); cannam@167: MPI_Comm_rank(MPI_COMM_WORLD, &rank); cannam@167: if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); cannam@167: @} cannam@167: @end example cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI cannam@167: @section Avoiding MPI Deadlocks cannam@167: @cindex deadlock cannam@167: cannam@167: An MPI program can @emph{deadlock} if one process is waiting for a cannam@167: message from another process that never gets sent. To avoid deadlocks cannam@167: when using FFTW's MPI routines, it is important to know which cannam@167: functions are @emph{collective}: that is, which functions must cannam@167: @emph{always} be called in the @emph{same order} from @emph{every} cannam@167: process in a given communicator. (For example, @code{MPI_Barrier} is cannam@167: the canonical example of a collective function in the MPI standard.) cannam@167: @cindex collective function cannam@167: @findex MPI_Barrier cannam@167: cannam@167: cannam@167: The functions in FFTW that are @emph{always} collective are: every cannam@167: function beginning with @samp{fftw_mpi_plan}, as well as cannam@167: @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}. cannam@167: Also, the following functions from the ordinary FFTW interface are cannam@167: collective when they are applied to a plan created by an cannam@167: @samp{fftw_mpi_plan} function: @code{fftw_execute}, cannam@167: @code{fftw_destroy_plan}, and @code{fftw_flops}. cannam@167: @findex fftw_execute cannam@167: @findex fftw_destroy_plan cannam@167: @findex fftw_flops cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Performance Tips cannam@167: cannam@167: In this section, we collect a few tips on getting the best performance cannam@167: out of FFTW's MPI transforms. cannam@167: cannam@167: First, because of the 1d block distribution, FFTW's parallelization is cannam@167: currently limited by the size of the first dimension. cannam@167: (Multidimensional block distributions may be supported by a future cannam@167: version.) More generally, you should ideally arrange the dimensions so cannam@167: that FFTW can divide them equally among the processes. @xref{Load cannam@167: balancing}. cannam@167: @cindex block distribution cannam@167: @cindex load balancing cannam@167: cannam@167: cannam@167: Second, if it is not too inconvenient, you should consider working cannam@167: with transposed output for multidimensional plans, as this saves a cannam@167: considerable amount of communications. @xref{Transposed distributions}. cannam@167: @cindex transpose cannam@167: cannam@167: cannam@167: Third, the fastest choices are generally either an in-place transform cannam@167: or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag cannam@167: (which allows the input array to be used as scratch space). In-place cannam@167: is especially beneficial if the amount of data per process is large. cannam@167: @ctindex FFTW_DESTROY_INPUT cannam@167: cannam@167: cannam@167: Fourth, if you have multiple arrays to transform at once, rather than cannam@167: calling FFTW's MPI transforms several times it usually seems to be cannam@167: faster to interleave the data and use the advanced interface. (This cannam@167: groups the communications together instead of requiring separate cannam@167: messages for each transform.) cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI cannam@167: @section Combining MPI and Threads cannam@167: @cindex threads cannam@167: cannam@167: In certain cases, it may be advantageous to combine MPI cannam@167: (distributed-memory) and threads (shared-memory) parallelization. cannam@167: FFTW supports this, with certain caveats. For example, if you have a cannam@167: cluster of 4-processor shared-memory nodes, you may want to use cannam@167: threads within the nodes and MPI between the nodes, instead of MPI for cannam@167: all parallelization. cannam@167: cannam@167: In particular, it is possible to seamlessly combine the MPI FFTW cannam@167: routines with the multi-threaded FFTW routines (@pxref{Multi-threaded cannam@167: FFTW}). However, some care must be taken in the initialization code, cannam@167: which should look something like this: cannam@167: cannam@167: @example cannam@167: int threads_ok; cannam@167: cannam@167: int main(int argc, char **argv) cannam@167: @{ cannam@167: int provided; cannam@167: MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); cannam@167: threads_ok = provided >= MPI_THREAD_FUNNELED; cannam@167: cannam@167: if (threads_ok) threads_ok = fftw_init_threads(); cannam@167: fftw_mpi_init(); cannam@167: cannam@167: ... cannam@167: if (threads_ok) fftw_plan_with_nthreads(...); cannam@167: ... cannam@167: cannam@167: MPI_Finalize(); cannam@167: @} cannam@167: @end example cannam@167: @findex fftw_mpi_init cannam@167: @findex fftw_init_threads cannam@167: @findex fftw_plan_with_nthreads cannam@167: cannam@167: First, note that instead of calling @code{MPI_Init}, you should call cannam@167: @code{MPI_Init_threads}, which is the initialization routine defined cannam@167: by the MPI-2 standard to indicate to MPI that your program will be cannam@167: multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates cannam@167: that we will only call MPI routines from the main thread. (FFTW will cannam@167: launch additional threads internally, but the extra threads will not cannam@167: call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or cannam@167: @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading cannam@167: support from the MPI implementation, but this is not required by cannam@167: FFTW.) The @code{provided} parameter returns what level of threads cannam@167: support is actually supported by your MPI implementation; this cannam@167: @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call cannam@167: the FFTW threads routines, so we define a global variable cannam@167: @code{threads_ok} to record this. You should only call cannam@167: @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if cannam@167: @code{threads_ok} is true. For more information on thread safety in cannam@167: MPI, see the cannam@167: @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and cannam@167: Threads} section of the MPI-2 standard. cannam@167: @cindex thread safety cannam@167: cannam@167: cannam@167: Second, we must call @code{fftw_init_threads} @emph{before} cannam@167: @code{fftw_mpi_init}. This is critical for technical reasons having cannam@167: to do with how FFTW initializes its list of algorithms. cannam@167: cannam@167: Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI cannam@167: process will launch (up to) @code{N} threads to parallelize its transforms. cannam@167: cannam@167: For example, in the hypothetical cluster of 4-processor nodes, you cannam@167: might wish to launch only a single MPI process per node, and then call cannam@167: @code{fftw_plan_with_nthreads(4)} on each process to use all cannam@167: processors in the nodes. cannam@167: cannam@167: This may or may not be faster than simply using as many MPI processes cannam@167: as you have processors, however. On the one hand, using threads cannam@167: within a node eliminates the need for explicit message passing within cannam@167: the node. On the other hand, FFTW's transpose routines are not cannam@167: multi-threaded, and this means that the communications that do take cannam@167: place will not benefit from parallelization within the node. cannam@167: Moreover, many MPI implementations already have optimizations to cannam@167: exploit shared memory when it is available, so adding the cannam@167: multithreaded FFTW on top of this may be superfluous. cannam@167: @cindex transpose cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Reference cannam@167: cannam@167: This chapter provides a complete reference to all FFTW MPI functions, cannam@167: datatypes, and constants. See also @ref{FFTW Reference} for information cannam@167: on functions and types in common with the serial interface. cannam@167: cannam@167: @menu cannam@167: * MPI Files and Data Types:: cannam@167: * MPI Initialization:: cannam@167: * Using MPI Plans:: cannam@167: * MPI Data Distribution Functions:: cannam@167: * MPI Plan Creation:: cannam@167: * MPI Wisdom Communication:: cannam@167: @end menu cannam@167: cannam@167: @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference cannam@167: @subsection MPI Files and Data Types cannam@167: cannam@167: All programs using FFTW's MPI support should include its header file: cannam@167: cannam@167: @example cannam@167: #include cannam@167: @end example cannam@167: cannam@167: Note that this header file includes the serial-FFTW @code{fftw3.h} cannam@167: header file, and also the @code{mpi.h} header file for MPI, so you cannam@167: need not include those files separately. cannam@167: cannam@167: You must also link to @emph{both} the FFTW MPI library and to the cannam@167: serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi cannam@167: -lfftw3 -lm} at the end of the link command. cannam@167: cannam@167: @cindex precision cannam@167: Different precisions are handled as in the serial interface: cannam@167: @xref{Precision}. That is, @samp{fftw_} functions become cannam@167: @code{fftwf_} (in single precision) etcetera, and the libraries become cannam@167: @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double cannam@167: precision is supported in MPI, but quad precision (@samp{fftwq_}) is cannam@167: not due to the lack of MPI support for this type. cannam@167: cannam@167: @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference cannam@167: @subsection MPI Initialization cannam@167: cannam@167: Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and cannam@167: before importing any wisdom for MPI problems, you must call: cannam@167: cannam@167: @findex fftw_mpi_init cannam@167: @example cannam@167: void fftw_mpi_init(void); cannam@167: @end example cannam@167: cannam@167: @findex fftw_init_threads cannam@167: If FFTW threads support is used, however, @code{fftw_mpi_init} should cannam@167: be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI cannam@167: and Threads}). Calling @code{fftw_mpi_init} additional times (before cannam@167: @code{fftw_mpi_cleanup}) has no effect. cannam@167: cannam@167: cannam@167: If you want to deallocate all persistent data and reset FFTW to the cannam@167: pristine state it was in when you started your program, you can call: cannam@167: cannam@167: @findex fftw_mpi_cleanup cannam@167: @example cannam@167: void fftw_mpi_cleanup(void); cannam@167: @end example cannam@167: cannam@167: @findex fftw_cleanup cannam@167: (This calls @code{fftw_cleanup}, so you need not call the serial cannam@167: cleanup routine too, although it is safe to do so.) After calling cannam@167: @code{fftw_mpi_cleanup}, all existing plans become undefined, and you cannam@167: should not attempt to execute or destroy them. You must call cannam@167: @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want cannam@167: to resume using the MPI FFTW routines. cannam@167: cannam@167: @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference cannam@167: @subsection Using MPI Plans cannam@167: cannam@167: Once an MPI plan is created, you can execute and destroy it using cannam@167: @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions cannam@167: in the serial interface that operate on generic plans (@pxref{Using cannam@167: Plans}). cannam@167: cannam@167: @cindex collective function cannam@167: @cindex MPI communicator cannam@167: The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to cannam@167: MPI plans, are @emph{collective} calls: they must be called for all processes cannam@167: in the communicator that was used to create the plan. cannam@167: cannam@167: @cindex new-array execution cannam@167: You must @emph{not} use the serial new-array plan-execution functions cannam@167: @code{fftw_execute_dft} and so on (@pxref{New-array Execute cannam@167: Functions}) with MPI plans. Such functions are specialized to the cannam@167: problem type, and there are specific new-array execute functions for MPI plans: cannam@167: cannam@167: @findex fftw_mpi_execute_dft cannam@167: @findex fftw_mpi_execute_dft_r2c cannam@167: @findex fftw_mpi_execute_dft_c2r cannam@167: @findex fftw_mpi_execute_r2r cannam@167: @example cannam@167: void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); cannam@167: void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); cannam@167: void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); cannam@167: void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); cannam@167: @end example cannam@167: cannam@167: @cindex alignment cannam@167: @findex fftw_malloc cannam@167: These functions have the same restrictions as those of the serial cannam@167: new-array execute functions. They are @emph{always} safe to apply to cannam@167: the @emph{same} @code{in} and @code{out} arrays that were used to cannam@167: create the plan. They can only be applied to new arrarys if those cannam@167: arrays have the same types, dimensions, in-placeness, and alignment as cannam@167: the original arrays, where the best way to ensure the same alignment cannam@167: is to use FFTW's @code{fftw_malloc} and related allocation functions cannam@167: for all arrays (@pxref{Memory Allocation}). Note that distributed cannam@167: transposes (@pxref{FFTW MPI Transposes}) use cannam@167: @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans cannam@167: from FFTW's perspective. cannam@167: cannam@167: @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference cannam@167: @subsection MPI Data Distribution Functions cannam@167: cannam@167: @cindex data distribution cannam@167: As described above (@pxref{MPI Data Distribution}), in order to cannam@167: allocate your arrays, @emph{before} creating a plan, you must first cannam@167: call one of the following routines to determine the required cannam@167: allocation size and the portion of the array locally stored on a given cannam@167: process. The @code{MPI_Comm} communicator passed here must be cannam@167: equivalent to the communicator used below for plan creation. cannam@167: cannam@167: The basic interface for multidimensional transforms consists of the cannam@167: functions: cannam@167: cannam@167: @findex fftw_mpi_local_size_2d cannam@167: @findex fftw_mpi_local_size_3d cannam@167: @findex fftw_mpi_local_size cannam@167: @findex fftw_mpi_local_size_2d_transposed cannam@167: @findex fftw_mpi_local_size_3d_transposed cannam@167: @findex fftw_mpi_local_size_transposed cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@167: ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@167: ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@167: cannam@167: ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: @end example cannam@167: cannam@167: These functions return the number of elements to allocate (complex cannam@167: numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas cannam@167: the @code{local_n0} and @code{local_0_start} return the portion cannam@167: (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the cannam@167: first dimension of an @ndims{} array that is stored on the local cannam@167: process. @xref{Basic and advanced distribution interfaces}. For cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants cannam@167: are useful in order to also return the local portion of the first cannam@167: dimension in the @ndimstrans{} transposed output. cannam@167: @xref{Transposed distributions}. cannam@167: The advanced interface for multidimensional transforms is: cannam@167: cannam@167: @cindex advanced interface cannam@167: @findex fftw_mpi_local_size_many cannam@167: @findex fftw_mpi_local_size_many_transposed cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, cannam@167: ptrdiff_t block0, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@167: ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, cannam@167: ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, cannam@167: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, cannam@167: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); cannam@167: @end example cannam@167: cannam@167: These differ from the basic interface in only two ways. First, they cannam@167: allow you to specify block sizes @code{block0} and @code{block1} (the cannam@167: latter for the transposed output); you can pass cannam@167: @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in cannam@167: the basic interface. Second, you can pass a @code{howmany} parameter, cannam@167: corresponding to the advanced planning interface below: this is for cannam@167: transforms of contiguous @code{howmany}-tuples of numbers cannam@167: (@code{howmany = 1} in the basic interface). cannam@167: cannam@167: The corresponding basic and advanced routines for one-dimensional cannam@167: transforms (currently only complex DFTs) are: cannam@167: cannam@167: @findex fftw_mpi_local_size_1d cannam@167: @findex fftw_mpi_local_size_many_1d cannam@167: @example cannam@167: ptrdiff_t fftw_mpi_local_size_1d( cannam@167: ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, cannam@167: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, cannam@167: ptrdiff_t *local_no, ptrdiff_t *local_o_start); cannam@167: ptrdiff_t fftw_mpi_local_size_many_1d( cannam@167: ptrdiff_t n0, ptrdiff_t howmany, cannam@167: MPI_Comm comm, int sign, unsigned flags, cannam@167: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, cannam@167: ptrdiff_t *local_no, ptrdiff_t *local_o_start); cannam@167: @end example cannam@167: cannam@167: @ctindex FFTW_MPI_SCRAMBLED_OUT cannam@167: @ctindex FFTW_MPI_SCRAMBLED_IN cannam@167: As above, the return value is the number of elements to allocate cannam@167: (complex numbers, for complex DFTs). The @code{local_ni} and cannam@167: @code{local_i_start} arguments return the portion cannam@167: (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the cannam@167: 1d array that is stored on this process for the transform cannam@167: @emph{input}, and @code{local_no} and @code{local_o_start} are the cannam@167: corresponding quantities for the input. The @code{sign} cannam@167: (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must cannam@167: match the arguments passed when creating a plan. Although the inputs cannam@167: and outputs have different data distributions in general, it is cannam@167: guaranteed that the @emph{output} data distribution of an cannam@167: @code{FFTW_FORWARD} plan will match the @emph{input} data distribution cannam@167: of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the cannam@167: @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags. cannam@167: @xref{One-dimensional distributions}. cannam@167: cannam@167: @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference cannam@167: @subsection MPI Plan Creation cannam@167: cannam@167: @subsubheading Complex-data MPI DFTs cannam@167: cannam@167: Plans for complex-data DFTs (@pxref{2d MPI example}) are created by: cannam@167: cannam@167: @findex fftw_mpi_plan_dft_1d cannam@167: @findex fftw_mpi_plan_dft_2d cannam@167: @findex fftw_mpi_plan_dft_3d cannam@167: @findex fftw_mpi_plan_dft cannam@167: @findex fftw_mpi_plan_many_dft cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, cannam@167: MPI_Comm comm, int sign, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: fftw_complex *in, fftw_complex *out, cannam@167: MPI_Comm comm, int sign, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: fftw_complex *in, fftw_complex *out, cannam@167: MPI_Comm comm, int sign, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, cannam@167: fftw_complex *in, fftw_complex *out, cannam@167: MPI_Comm comm, int sign, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, cannam@167: ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, cannam@167: fftw_complex *in, fftw_complex *out, cannam@167: MPI_Comm comm, int sign, unsigned flags); cannam@167: @end example cannam@167: cannam@167: @cindex MPI communicator cannam@167: @cindex collective function cannam@167: These are similar to their serial counterparts (@pxref{Complex DFTs}) cannam@167: in specifying the dimensions, sign, and flags of the transform. The cannam@167: @code{comm} argument gives an MPI communicator that specifies the set cannam@167: of processes to participate in the transform; plan creation is a cannam@167: collective function that must be called for all processes in the cannam@167: communicator. The @code{in} and @code{out} pointers refer only to a cannam@167: portion of the overall transform data (@pxref{MPI Data Distribution}) cannam@167: as specified by the @samp{local_size} functions in the previous cannam@167: section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these cannam@167: arrays are overwritten during plan creation as for the serial cannam@167: interface. For multi-dimensional transforms, any dimensions @code{> cannam@167: 1} are supported; for one-dimensional transforms, only composite cannam@167: (non-prime) @code{n0} are currently supported (unlike the serial cannam@167: FFTW). Requesting an unsupported transform size will yield a cannam@167: @code{NULL} plan. (As in the serial interface, highly composite sizes cannam@167: generally yield the best performance.) cannam@167: cannam@167: @cindex advanced interface cannam@167: @ctindex FFTW_MPI_DEFAULT_BLOCK cannam@167: @cindex stride cannam@167: The advanced-interface @code{fftw_mpi_plan_many_dft} additionally cannam@167: allows you to specify the block sizes for the first dimension cannam@167: (@code{block}) of the @ndims{} input data and the first dimension cannam@167: (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate cannam@167: steps of the transform, and for the output if cannam@167: @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must cannam@167: be the same block sizes as were passed to the corresponding cannam@167: @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK} cannam@167: to use FFTW's default block size as in the basic interface. Also, the cannam@167: @code{howmany} parameter specifies that the transform is of contiguous cannam@167: @code{howmany}-tuples rather than individual complex numbers; this cannam@167: corresponds to the same parameter in the serial advanced interface cannam@167: (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and cannam@167: @code{dist = 1}. cannam@167: cannam@167: @subsubheading MPI flags cannam@167: cannam@167: The @code{flags} can be any of those for the serial FFTW cannam@167: (@pxref{Planner Flags}), and in addition may include one or more of cannam@167: the following MPI-specific flags, which improve performance at the cannam@167: cost of changing the output or input data formats. cannam@167: cannam@167: @itemize @bullet cannam@167: cannam@167: @item cannam@167: @ctindex FFTW_MPI_SCRAMBLED_OUT cannam@167: @ctindex FFTW_MPI_SCRAMBLED_IN cannam@167: @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for cannam@167: 1d transforms only, these flags indicate that the output/input of the cannam@167: transform are in an undocumented ``scrambled'' order. A forward cannam@167: @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward cannam@167: @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization). cannam@167: @xref{One-dimensional distributions}. cannam@167: cannam@167: @item cannam@167: @ctindex FFTW_MPI_TRANSPOSED_OUT cannam@167: @ctindex FFTW_MPI_TRANSPOSED_IN cannam@167: @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid cannam@167: for multidimensional (@code{rnk > 1}) transforms only, these flags cannam@167: specify that the output or input of an @ndims{} transform is cannam@167: transposed to @ndimstrans{}. @xref{Transposed distributions}. cannam@167: cannam@167: @end itemize cannam@167: cannam@167: @subsubheading Real-data MPI DFTs cannam@167: cannam@167: @cindex r2c cannam@167: Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional cannam@167: MPI DFTs of Real Data}) are created by: cannam@167: cannam@167: @findex fftw_mpi_plan_dft_r2c_2d cannam@167: @findex fftw_mpi_plan_dft_r2c_2d cannam@167: @findex fftw_mpi_plan_dft_r2c_3d cannam@167: @findex fftw_mpi_plan_dft_r2c cannam@167: @findex fftw_mpi_plan_dft_c2r_2d cannam@167: @findex fftw_mpi_plan_dft_c2r_2d cannam@167: @findex fftw_mpi_plan_dft_c2r_3d cannam@167: @findex fftw_mpi_plan_dft_c2r cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: double *in, fftw_complex *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: double *in, fftw_complex *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: double *in, fftw_complex *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, cannam@167: double *in, fftw_complex *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: fftw_complex *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: fftw_complex *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: fftw_complex *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, cannam@167: fftw_complex *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: @end example cannam@167: cannam@167: Similar to the serial interface (@pxref{Real-data DFTs}), these cannam@167: transform logically @ndims{} real data to/from @ndimshalf{} complex cannam@167: data, representing the non-redundant half of the conjugate-symmetry cannam@167: output of a real-input DFT (@pxref{Multi-dimensional Transforms}). cannam@167: However, the real array must be stored within a padded @ndimspad{} cannam@167: array (much like the in-place serial r2c transforms, but here for cannam@167: out-of-place transforms as well). Currently, only multi-dimensional cannam@167: (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan cannam@167: for @code{rnk = 1} will yield @code{NULL}). As explained above cannam@167: (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data cannam@167: distribution of both the real and complex arrays is given by the cannam@167: @samp{local_size} function called for the dimensions of the cannam@167: @emph{complex} array. Similar to the other planning functions, the cannam@167: input and output arrays are overwritten when the plan is created cannam@167: except in @code{FFTW_ESTIMATE} mode. cannam@167: cannam@167: As for the complex DFTs above, there is an advance interface that cannam@167: allows you to manually specify block sizes and to transform contiguous cannam@167: @code{howmany}-tuples of real/complex numbers: cannam@167: cannam@167: @findex fftw_mpi_plan_many_dft_r2c cannam@167: @findex fftw_mpi_plan_many_dft_c2r cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_many_dft_r2c cannam@167: (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, cannam@167: ptrdiff_t iblock, ptrdiff_t oblock, cannam@167: double *in, fftw_complex *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_many_dft_c2r cannam@167: (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, cannam@167: ptrdiff_t iblock, ptrdiff_t oblock, cannam@167: fftw_complex *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: @end example cannam@167: cannam@167: @subsubheading MPI r2r transforms cannam@167: cannam@167: @cindex r2r cannam@167: There are corresponding plan-creation routines for r2r cannam@167: transforms (@pxref{More DFTs of Real Data}), currently supporting cannam@167: multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will cannam@167: yield a @code{NULL} plan): cannam@167: cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, cannam@167: fftw_r2r_kind kind0, fftw_r2r_kind kind1, cannam@167: unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, cannam@167: fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, cannam@167: unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, const fftw_r2r_kind *kind, cannam@167: unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, cannam@167: ptrdiff_t iblock, ptrdiff_t oblock, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, const fftw_r2r_kind *kind, cannam@167: unsigned flags); cannam@167: @end example cannam@167: cannam@167: The parameters are much the same as for the complex DFTs above, except cannam@167: that the arrays are of real numbers (and hence the outputs of the cannam@167: @samp{local_size} data-distribution functions should be interpreted as cannam@167: counts of real rather than complex numbers). Also, the @code{kind} cannam@167: parameters specify the r2r kinds along each dimension as for the cannam@167: serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other cannam@167: Multi-dimensional Real-data MPI Transforms}. cannam@167: cannam@167: @subsubheading MPI transposition cannam@167: @cindex transpose cannam@167: cannam@167: FFTW also provides routines to plan a transpose of a distributed cannam@167: @code{n0} by @code{n1} array of real numbers, or an array of cannam@167: @code{howmany}-tuples of real numbers with specified block sizes cannam@167: (@pxref{FFTW MPI Transposes}): cannam@167: cannam@167: @findex fftw_mpi_plan_transpose cannam@167: @findex fftw_mpi_plan_many_transpose cannam@167: @example cannam@167: fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, cannam@167: double *in, double *out, cannam@167: MPI_Comm comm, unsigned flags); cannam@167: fftw_plan fftw_mpi_plan_many_transpose cannam@167: (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, cannam@167: ptrdiff_t block0, ptrdiff_t block1, cannam@167: double *in, double *out, MPI_Comm comm, unsigned flags); cannam@167: @end example cannam@167: cannam@167: @cindex new-array execution cannam@167: @findex fftw_mpi_execute_r2r cannam@167: These plans are used with the @code{fftw_mpi_execute_r2r} new-array cannam@167: execute function (@pxref{Using MPI Plans }), since they count as (rank cannam@167: zero) r2r plans from FFTW's perspective. cannam@167: cannam@167: @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference cannam@167: @subsection MPI Wisdom Communication cannam@167: cannam@167: To facilitate synchronizing wisdom among the different MPI processes, cannam@167: we provide two functions: cannam@167: cannam@167: @findex fftw_mpi_gather_wisdom cannam@167: @findex fftw_mpi_broadcast_wisdom cannam@167: @example cannam@167: void fftw_mpi_gather_wisdom(MPI_Comm comm); cannam@167: void fftw_mpi_broadcast_wisdom(MPI_Comm comm); cannam@167: @end example cannam@167: cannam@167: The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the cannam@167: given communicator @code{comm} to the process of rank 0 in the cannam@167: communicator: that process obtains the union of all wisdom on all the cannam@167: processes. As a side effect, some other processes will gain cannam@167: additional wisdom from other processes, but only process 0 will gain cannam@167: the complete union. cannam@167: cannam@167: The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports cannam@167: wisdom from process 0 in @code{comm} to all other processes in the cannam@167: communicator, replacing any wisdom they currently have. cannam@167: cannam@167: @xref{FFTW MPI Wisdom}. cannam@167: cannam@167: @c ------------------------------------------------------------ cannam@167: @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI cannam@167: @section FFTW MPI Fortran Interface cannam@167: @cindex Fortran interface cannam@167: cannam@167: @cindex iso_c_binding cannam@167: The FFTW MPI interface is callable from modern Fortran compilers cannam@167: supporting the Fortran 2003 @code{iso_c_binding} standard for calling cannam@167: C functions. As described in @ref{Calling FFTW from Modern Fortran}, cannam@167: this means that you can directly call FFTW's C interface from Fortran cannam@167: with only minor changes in syntax. There are, however, a few things cannam@167: specific to the MPI interface to keep in mind: cannam@167: cannam@167: @itemize @bullet cannam@167: cannam@167: @item cannam@167: Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran cannam@167: interface }, you should @code{include 'fftw3-mpi.f03'} (after cannam@167: @code{use, intrinsic :: iso_c_binding} as before). The cannam@167: @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should cannam@167: @emph{not} @code{include} them both yourself. (You will also want to cannam@167: include the MPI header file, usually via @code{include 'mpif.h'} or cannam@167: similar, although though this is not needed by @code{fftw3-mpi.f03} cannam@167: @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.) cannam@167: cannam@167: @item cannam@167: Because of the different storage conventions between C and Fortran, cannam@167: you reverse the order of your array dimensions when passing them to cannam@167: FFTW (@pxref{Reversing array dimensions}). This is merely a cannam@167: difference in notation and incurs no performance overhead. However, cannam@167: it means that, whereas in C the @emph{first} dimension is distributed, cannam@167: in Fortran the @emph{last} dimension of your array is distributed. cannam@167: cannam@167: @item cannam@167: @cindex MPI communicator cannam@167: In Fortran, communicators are stored as @code{integer} types; there is cannam@167: no @code{MPI_Comm} type, nor is there any way to access a C cannam@167: @code{MPI_Comm}. Fortunately, this is taken care of for you by the cannam@167: FFTW Fortran interface: whenever the C interface expects an cannam@167: @code{MPI_Comm} type, you should pass the Fortran communicator as an cannam@167: @code{integer}.@footnote{Technically, this is because you aren't cannam@167: actually calling the C functions directly. You are calling wrapper cannam@167: functions that translate the communicator with @code{MPI_Comm_f2c} cannam@167: before calling the ordinary C interface. This is all done cannam@167: transparently, however, since the @code{fftw3-mpi.f03} interface file cannam@167: renames the wrappers so that they are called in Fortran with the same cannam@167: names as the C interface functions.} cannam@167: cannam@167: @item cannam@167: Because you need to call the @samp{local_size} function to find out cannam@167: how much space to allocate, and this may be @emph{larger} than the cannam@167: local portion of the array (@pxref{MPI Data Distribution}), you should cannam@167: @emph{always} allocate your arrays dynamically using FFTW's allocation cannam@167: routines as described in @ref{Allocating aligned memory in Fortran}. cannam@167: (Coincidentally, this also provides the best performance by cannam@167: guaranteeding proper data alignment.) cannam@167: cannam@167: @item cannam@167: Because all sizes in the MPI FFTW interface are declared as cannam@167: @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in cannam@167: Fortran (@pxref{FFTW Fortran type reference}). cannam@167: cannam@167: @item cannam@167: @findex fftw_execute_dft cannam@167: @findex fftw_mpi_execute_dft cannam@167: @cindex new-array execution cannam@167: In Fortran, because of the language semantics, we generally recommend cannam@167: using the new-array execute functions for all plans, even in the cannam@167: common case where you are executing the plan on the same arrays for cannam@167: which the plan was created (@pxref{Plan execution in Fortran}). cannam@167: However, note that in the MPI interface these functions are changed: cannam@167: @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft}, cannam@167: etcetera. @xref{Using MPI Plans}. cannam@167: cannam@167: @end itemize cannam@167: cannam@167: For example, here is a Fortran code snippet to perform a distributed cannam@167: @twodims{L,M} complex DFT in-place. (This assumes you have already cannam@167: initialized MPI with @code{MPI_init} and have also performed cannam@167: @code{call fftw_mpi_init}.) cannam@167: cannam@167: @example cannam@167: use, intrinsic :: iso_c_binding cannam@167: include 'fftw3-mpi.f03' cannam@167: integer(C_INTPTR_T), parameter :: L = ... cannam@167: integer(C_INTPTR_T), parameter :: M = ... cannam@167: type(C_PTR) :: plan, cdata cannam@167: complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) cannam@167: integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset cannam@167: cannam@167: ! @r{get local data size and allocate (note dimension reversal)} cannam@167: alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & cannam@167: local_M, local_j_offset) cannam@167: cdata = fftw_alloc_complex(alloc_local) cannam@167: call c_f_pointer(cdata, data, [L,local_M]) cannam@167: cannam@167: ! @r{create MPI plan for in-place forward DFT (note dimension reversal)} cannam@167: plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & cannam@167: FFTW_FORWARD, FFTW_MEASURE) cannam@167: cannam@167: ! @r{initialize data to some function} my_function(i,j) cannam@167: do j = 1, local_M cannam@167: do i = 1, L cannam@167: data(i, j) = my_function(i, j + local_j_offset) cannam@167: end do cannam@167: end do cannam@167: cannam@167: ! @r{compute transform (as many times as desired)} cannam@167: call fftw_mpi_execute_dft(plan, data, data) cannam@167: cannam@167: call fftw_destroy_plan(plan) cannam@167: call fftw_free(cdata) cannam@167: @end example cannam@167: cannam@167: Note that when we called @code{fftw_mpi_local_size_2d} and cannam@167: @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order, cannam@167: since a @twodims{L,M} Fortran array is viewed by FFTW in C as a cannam@167: @twodims{M, L} array. This means that the array was distributed over cannam@167: the @code{M} dimension, the local portion of which is a cannam@167: @twodims{L,local_M} array in Fortran. (You must @emph{not} use an cannam@167: @code{allocate} statement to allocate an @twodims{L,local_M} array, cannam@167: however; you must allocate @code{alloc_local} complex numbers, which cannam@167: may be greater than @code{L * local_M}, in order to reserve space for cannam@167: intermediate steps of the transform.) Finally, we mention that cannam@167: because C's array indices are zero-based, the @code{local_j_offset} cannam@167: argument can conveniently be interpreted as an offset in the 1-based cannam@167: @code{j} index (rather than as a starting index as in C). cannam@167: cannam@167: If instead you had used the @code{ior(FFTW_MEASURE, cannam@167: FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a cannam@167: transposed @twodims{M,local_L} array, associated with the @emph{same} cannam@167: @code{cdata} allocation (since the transform is in-place), and which cannam@167: you could declare with: cannam@167: cannam@167: @example cannam@167: complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) cannam@167: ... cannam@167: call c_f_pointer(cdata, tdata, [M,local_L]) cannam@167: @end example cannam@167: cannam@167: where @code{local_L} would have been obtained by changing the cannam@167: @code{fftw_mpi_local_size_2d} call to: cannam@167: cannam@167: @example cannam@167: alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & cannam@167: local_M, local_j_offset, local_L, local_i_offset) cannam@167: @end example