Basic and advanced distribution interfaces

d@0: d@0: d@0: Basic and advanced distribution interfaces - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:

d@0:

d@0: d@0: Next: Load balancing, d@0: Previous: MPI data distribution, d@0: Up: MPI data distribution d@0:

d@0:

d@0: d@0:

6.4.1 Basic and advanced distribution interfaces

d@0: d@0:

As with the planner interface, the `fftw_mpi_local_size' d@0: distribution interface is broken into basic and advanced d@0: (`_many') interfaces, where the latter allows you to specify the d@0: block size manually and also to request block sizes when computing d@0: multiple transforms simultaneously. These functions are documented d@0: more exhaustively by the FFTW MPI Reference, but we summarize the d@0: basic ideas here using a couple of two-dimensional examples. d@0: d@0:

For the 100 × 200 complex-DFT example, above, we would find d@0: the distribution by calling the following function in the basic d@0: interface: d@0: d@0:

     ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
d@0:                          ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
d@0:

d@0:

d@0: Given the total size of the data to be transformed (here, n0 = d@0: 100 and n1 = 200) and an MPI communicator (comm), this d@0: function provides three numbers. d@0: d@0:

First, it describes the shape of the local data: the current process d@0: should store a local_n0 by n1 slice of the overall d@0: dataset, in row-major order (n1 dimension contiguous), starting d@0: at index local_0_start. That is, if the total dataset is d@0: viewed as a n0 by n1 matrix, the current process should d@0: store the rows local_0_start to d@0: local_0_start+local_n0-1. Obviously, if you are running with d@0: only a single MPI process, that process will store the entire array: d@0: local_0_start will be zero and local_n0 will be d@0: n0. See Row-major Format. d@0: d@0: Second, the return value is the total number of data elements (e.g., d@0: complex numbers for a complex DFT) that should be allocated for the d@0: input and output arrays on the current process (ideally with d@0: fftw_malloc, to ensure optimal alignment). It might seem that d@0: this should always be equal to local_n0 * n1, but this is d@0: not the case. FFTW's distributed FFT algorithms require data d@0: redistributions at intermediate stages of the transform, and in some d@0: circumstances this may require slightly larger local storage. This is d@0: discussed in more detail below, under Load balancing. d@0: d@0: The advanced-interface `local_size' function for multidimensional d@0: transforms returns the same three things (local_n0, d@0: local_0_start, and the total number of elements to allocate), d@0: but takes more inputs: d@0: d@0:

     ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
d@0:                                         ptrdiff_t howmany,
d@0:                                         ptrdiff_t block0,
d@0:                                         MPI_Comm comm,
d@0:                                         ptrdiff_t *local_n0,
d@0:                                         ptrdiff_t *local_0_start);
d@0:

d@0:

d@0: The two-dimensional case above corresponds to rnk = 2 and an d@0: array n of length 2 with n[0] = n0 and n[1] = n1. d@0: This routine is for any rnk > 1; one-dimensional transforms d@0: have their own interface because they work slightly differently, as d@0: discussed below. d@0: d@0:

First, the advanced interface allows you to perform multiple d@0: transforms at once, of interleaved data, as specified by the d@0: howmany parameter. (hoamany is 1 for a single d@0: transform.) d@0: d@0:

Second, here you can specify your desired block size in the n0 d@0: dimension, block0. To use FFTW's default block size, pass d@0: FFTW_MPI_DEFAULT_BLOCK (0) for block0. Otherwise, on d@0: P processes, FFTW will return local_n0 equal to d@0: block0 on the first P / block0 processes (rounded down), d@0: return local_n0 equal to n0 - block0 * (P / block0) on d@0: the next process, and local_n0 equal to zero on any remaining d@0: processes. In general, we recommend using the default block size d@0: (which corresponds to n0 / P, rounded up). d@0: d@0: For example, suppose you have P = 4 processes and n0 = d@0: 21. The default will be a block size of 6, which will give d@0: local_n0 = 6 on the first three processes and local_n0 = d@0: 3 on the last process. Instead, however, you could specify d@0: block0 = 5 if you wanted, which would give local_n0 = 5 d@0: on processes 0 to 2, local_n0 = 6 on process 3. (This choice, d@0: while it may look superficially more “balanced,” has the same d@0: critical path as FFTW's default but requires more communications.) d@0: d@0: d@0: