FFTW 3.3.8: MPI Data Distribution

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: FFTW 3.3.8: MPI Data Distribution cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167:

6.4 MPI Data Distribution

cannam@167: cannam@167: cannam@167:

The most important concept to understand in using FFTW’s MPI interface cannam@167: is the data distribution. With a serial or multithreaded FFT, all of cannam@167: the inputs and outputs are stored as a single contiguous chunk of cannam@167: memory. With a distributed-memory FFT, the inputs and outputs are cannam@167: broken into disjoint blocks, one per process. cannam@167:

cannam@167:

In particular, FFTW uses a 1d block distribution of the data, cannam@167: distributed along the first dimension. For example, if you cannam@167: want to perform a 100 × 200 cannam@167: complex DFT, distributed over 4 cannam@167: processes, each process will get a 25 × 200 cannam@167: slice of the data. cannam@167: That is, process 0 will get rows 0 through 24, process 1 will get rows cannam@167: 25 through 49, process 2 will get rows 50 through 74, and process 3 cannam@167: will get rows 75 through 99. If you take the same array but cannam@167: distribute it over 3 processes, then it is not evenly divisible so the cannam@167: different processes will have unequal chunks. FFTW’s default choice cannam@167: in this case is to assign 34 rows to processes 0 and 1, and 32 rows to cannam@167: process 2. cannam@167: cannam@167:

cannam@167: cannam@167:

FFTW provides several ‘fftw_mpi_local_size’ routines that you can cannam@167: call to find out what portion of an array is stored on the current cannam@167: process. In most cases, you should use the default block sizes picked cannam@167: by FFTW, but it is also possible to specify your own block size. For cannam@167: example, with a 100 × 200 cannam@167: array on three processes, you can cannam@167: tell FFTW to use a block size of 40, which would assign 40 rows to cannam@167: processes 0 and 1, and 20 rows to process 2. FFTW’s default is to cannam@167: divide the data equally among the processes if possible, and as best cannam@167: it can otherwise. The rows are always assigned in “rank order,” cannam@167: i.e. process 0 gets the first block of rows, then process 1, and so cannam@167: on. (You can change this by using MPI_Comm_split to create a cannam@167: new communicator with re-ordered processes.) However, you should cannam@167: always call the ‘fftw_mpi_local_size’ routines, if possible, cannam@167: rather than trying to predict FFTW’s distribution choices. cannam@167:

cannam@167:

In particular, it is critical that you allocate the storage size that cannam@167: is returned by ‘fftw_mpi_local_size’, which is not cannam@167: necessarily the size of the local slice of the array. The reason is cannam@167: that intermediate steps of FFTW’s algorithms involve transposing the cannam@167: array and redistributing the data, so at these intermediate steps FFTW cannam@167: may require more local storage space (albeit always proportional to cannam@167: the total size divided by the number of processes). The cannam@167: ‘fftw_mpi_local_size’ functions know how much storage is required cannam@167: for these intermediate steps and tell you the correct amount to cannam@167: allocate. cannam@167:

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

• Basic and advanced distribution interfaces:		cannam@167:
• Load balancing:		cannam@167:
• Transposed distributions:		cannam@167:
• One-dimensional distributions:		cannam@167: