cannam@95: cannam@95:
cannam@95:cannam@95: Next: Load balancing, cannam@95: Previous: MPI Data Distribution, cannam@95: Up: MPI Data Distribution cannam@95:
As with the planner interface, the ‘fftw_mpi_local_size’ cannam@95: distribution interface is broken into basic and advanced cannam@95: (‘_many’) interfaces, where the latter allows you to specify the cannam@95: block size manually and also to request block sizes when computing cannam@95: multiple transforms simultaneously. These functions are documented cannam@95: more exhaustively by the FFTW MPI Reference, but we summarize the cannam@95: basic ideas here using a couple of two-dimensional examples. cannam@95: cannam@95:
For the 100 × 200 complex-DFT example, above, we would find cannam@95: the distribution by calling the following function in the basic cannam@95: interface: cannam@95: cannam@95:
ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, cannam@95: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); cannam@95:cannam@95:
cannam@95: Given the total size of the data to be transformed (here, n0 =
cannam@95: 100
and n1 = 200
) and an MPI communicator (comm
), this
cannam@95: function provides three numbers.
cannam@95:
cannam@95:
First, it describes the shape of the local data: the current process
cannam@95: should store a local_n0
by n1
slice of the overall
cannam@95: dataset, in row-major order (n1
dimension contiguous), starting
cannam@95: at index local_0_start
. That is, if the total dataset is
cannam@95: viewed as a n0
by n1
matrix, the current process should
cannam@95: store the rows local_0_start
to
cannam@95: local_0_start+local_n0-1
. Obviously, if you are running with
cannam@95: only a single MPI process, that process will store the entire array:
cannam@95: local_0_start
will be zero and local_n0
will be
cannam@95: n0
. See Row-major Format.
cannam@95:
cannam@95:
cannam@95:
Second, the return value is the total number of data elements (e.g.,
cannam@95: complex numbers for a complex DFT) that should be allocated for the
cannam@95: input and output arrays on the current process (ideally with
cannam@95: fftw_malloc
or an ‘fftw_alloc’ function, to ensure optimal
cannam@95: alignment). It might seem that this should always be equal to
cannam@95: local_n0 * n1
, but this is not the case. FFTW's
cannam@95: distributed FFT algorithms require data redistributions at
cannam@95: intermediate stages of the transform, and in some circumstances this
cannam@95: may require slightly larger local storage. This is discussed in more
cannam@95: detail below, under Load balancing.
cannam@95:
cannam@95:
cannam@95:
The advanced-interface ‘local_size’ function for multidimensional
cannam@95: transforms returns the same three things (local_n0
,
cannam@95: local_0_start
, and the total number of elements to allocate),
cannam@95: but takes more inputs:
cannam@95:
cannam@95:
ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, cannam@95: ptrdiff_t howmany, cannam@95: ptrdiff_t block0, cannam@95: MPI_Comm comm, cannam@95: ptrdiff_t *local_n0, cannam@95: ptrdiff_t *local_0_start); cannam@95:cannam@95:
cannam@95: The two-dimensional case above corresponds to rnk = 2
and an
cannam@95: array n
of length 2 with n[0] = n0
and n[1] = n1
.
cannam@95: This routine is for any rnk > 1
; one-dimensional transforms
cannam@95: have their own interface because they work slightly differently, as
cannam@95: discussed below.
cannam@95:
cannam@95:
First, the advanced interface allows you to perform multiple
cannam@95: transforms at once, of interleaved data, as specified by the
cannam@95: howmany
parameter. (hoamany
is 1 for a single
cannam@95: transform.)
cannam@95:
cannam@95:
Second, here you can specify your desired block size in the n0
cannam@95: dimension, block0
. To use FFTW's default block size, pass
cannam@95: FFTW_MPI_DEFAULT_BLOCK
(0) for block0
. Otherwise, on
cannam@95: P
processes, FFTW will return local_n0
equal to
cannam@95: block0
on the first P / block0
processes (rounded down),
cannam@95: return local_n0
equal to n0 - block0 * (P / block0)
on
cannam@95: the next process, and local_n0
equal to zero on any remaining
cannam@95: processes. In general, we recommend using the default block size
cannam@95: (which corresponds to n0 / P
, rounded up).
cannam@95:
cannam@95:
cannam@95:
For example, suppose you have P = 4
processes and n0 =
cannam@95: 21
. The default will be a block size of 6
, which will give
cannam@95: local_n0 = 6
on the first three processes and local_n0 =
cannam@95: 3
on the last process. Instead, however, you could specify
cannam@95: block0 = 5
if you wanted, which would give local_n0 = 5
cannam@95: on processes 0 to 2, local_n0 = 6
on process 3. (This choice,
cannam@95: while it may look superficially more “balanced,” has the same
cannam@95: critical path as FFTW's default but requires more communications.)
cannam@95:
cannam@95:
cannam@95: