Basic and advanced distribution interfaces

cannam@95: cannam@95:

cannam@95: Next: Load balancing, cannam@95: Previous: MPI Data Distribution, cannam@95: Up: MPI Data Distribution cannam@95:

cannam@95:

6.4.1 Basic and advanced distribution interfaces

As with the planner interface, the ‘fftw_mpi_local_size’ cannam@95: distribution interface is broken into basic and advanced cannam@95: (‘_many’) interfaces, where the latter allows you to specify the cannam@95: block size manually and also to request block sizes when computing cannam@95: multiple transforms simultaneously. These functions are documented cannam@95: more exhaustively by the FFTW MPI Reference, but we summarize the cannam@95: basic ideas here using a couple of two-dimensional examples. cannam@95: cannam@95:

For the 100 × 200 complex-DFT example, above, we would find cannam@95: the distribution by calling the following function in the basic cannam@95: interface: cannam@95: cannam@95:

cannam@95: Given the total size of the data to be transformed (here,

n0 =
cannam@95: 100

and n1 = 200) and an MPI communicator (comm), this cannam@95: function provides three numbers. cannam@95: cannam@95:

First, it describes the shape of the local data: the current process cannam@95: should store a local_n0 by n1 slice of the overall cannam@95: dataset, in row-major order (n1 dimension contiguous), starting cannam@95: at index local_0_start. That is, if the total dataset is cannam@95: viewed as a n0 by n1 matrix, the current process should cannam@95: store the rows local_0_start to cannam@95: local_0_start+local_n0-1. Obviously, if you are running with cannam@95: only a single MPI process, that process will store the entire array: cannam@95: local_0_start will be zero and local_n0 will be cannam@95: n0. See Row-major Format. cannam@95: cannam@95: cannam@95:

Second, the return value is the total number of data elements (e.g., cannam@95: complex numbers for a complex DFT) that should be allocated for the cannam@95: input and output arrays on the current process (ideally with cannam@95: fftw_malloc or an ‘fftw_alloc’ function, to ensure optimal cannam@95: alignment). It might seem that this should always be equal to cannam@95: local_n0 * n1, but this is not the case. FFTW's cannam@95: distributed FFT algorithms require data redistributions at cannam@95: intermediate stages of the transform, and in some circumstances this cannam@95: may require slightly larger local storage. This is discussed in more cannam@95: detail below, under Load balancing. cannam@95: cannam@95: cannam@95:

The advanced-interface ‘local_size’ function for multidimensional cannam@95: transforms returns the same three things (local_n0, cannam@95: local_0_start, and the total number of elements to allocate), cannam@95: but takes more inputs: cannam@95: cannam@95:

cannam@95: The two-dimensional case above corresponds to rnk = 2 and an cannam@95: array n of length 2 with n[0] = n0 and n[1] = n1. cannam@95: This routine is for any rnk > 1; one-dimensional transforms cannam@95: have their own interface because they work slightly differently, as cannam@95: discussed below. cannam@95: cannam@95:

First, the advanced interface allows you to perform multiple cannam@95: transforms at once, of interleaved data, as specified by the cannam@95: howmany parameter. (hoamany is 1 for a single cannam@95: transform.) cannam@95: cannam@95:

Second, here you can specify your desired block size in the n0 cannam@95: dimension, block0. To use FFTW's default block size, pass cannam@95: FFTW_MPI_DEFAULT_BLOCK (0) for block0. Otherwise, on cannam@95: P processes, FFTW will return local_n0 equal to cannam@95: block0 on the first P / block0 processes (rounded down), cannam@95: return local_n0 equal to n0 - block0 * (P / block0) on cannam@95: the next process, and local_n0 equal to zero on any remaining cannam@95: processes. In general, we recommend using the default block size cannam@95: (which corresponds to n0 / P, rounded up). cannam@95: cannam@95: cannam@95:

For example, suppose you have P = 4 processes and

n0 =
cannam@95: 21

. The default will be a block size of 6, which will give cannam@95: local_n0 = 6 on the first three processes and

local_n0 =
cannam@95: 3

on the last process. Instead, however, you could specify cannam@95: block0 = 5 if you wanted, which would give local_n0 = 5 cannam@95: on processes 0 to 2, local_n0 = 6 on process 3. (This choice, cannam@95: while it may look superficially more “balanced,” has the same cannam@95: critical path as FFTW's default but requires more communications.) cannam@95: cannam@95: cannam@95: