MPI data distribution - FFTW 3.2alpha3

d@0: d@0: d@0: MPI data distribution - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:

d@0:

d@0: d@0: Next: Multi-dimensional MPI DFT of Real Data, d@0: Previous: Simple MPI example, d@0: Up: Distributed-memory FFTW with MPI d@0:

d@0:

d@0: d@0:

6.4 MPI data distribution

d@0: d@0:

d@0: The most important concept to understand in using FFTW's MPI interface d@0: is the data distribution. With a serial or multithreaded FFT, all of d@0: the input and outputs are stored as a single contiguous chunk of d@0: memory. With a distributed-memory FFT, the inputs and outputs are d@0: broken into disjoint blocks, one per process. d@0: d@0:

In particular, FFTW uses a 1d block distribution of the data, d@0: distributed along the first dimension. For example, if you d@0: want to perform a 100 × 200 complex DFT, distributed over 4 d@0: processes, each process will get a 25 × 200 slice of the data. d@0: That is, process 0 will get rows 0 through 24, process 1 will get rows d@0: 25 through 49, process 2 will get rows 50 through 74, and process 3 d@0: will get rows 75 through 99. If you take the same array but d@0: distribute it over 3 processes, then it is not evenly divisible so the d@0: different processes will have unequal chunks. FFTW's default choice d@0: in this case is to assign 34 rows to processes 0 and 1, and 32 rows to d@0: process 2. d@0: d@0: FFTW provides several `fftw_mpi_local_size' routines that you can d@0: call to find out what portion of an array is stored on the current d@0: process. In most cases, you should use the default block sizes picked d@0: by FFTW, but it is also possible to specify your own block size. For d@0: example, with a 100 × 200 array on three processes, you can d@0: tell FFTW to use a block size of 40, which would assign 40 rows to d@0: processes 0 and 1, and 20 rows to process 2. FFTW's default is to d@0: divide the data equally among the processes if possible, and as best d@0: it can otherwise. The rows are always assigned in “rank order,” d@0: i.e. process 0 gets the first block of rows, then process 1, and so d@0: on. (You can change this by using MPI_Comm_split to create a d@0: new communicator with re-ordered processes.) However, you should d@0: always call the `fftw_mpi_local_size' routines, if possible, d@0: rather than trying to predict FFTW's distribution choices. d@0: d@0:

d@0: d@0: d@0: