Chris@82: Chris@82: Chris@82: Chris@82: Chris@82:
Chris@82:Chris@82: Next: Multi-dimensional MPI DFTs of Real Data, Previous: 2d MPI example, Up: Distributed-memory FFTW with MPI [Contents][Index]
Chris@82:The most important concept to understand in using FFTW’s MPI interface Chris@82: is the data distribution. With a serial or multithreaded FFT, all of Chris@82: the inputs and outputs are stored as a single contiguous chunk of Chris@82: memory. With a distributed-memory FFT, the inputs and outputs are Chris@82: broken into disjoint blocks, one per process. Chris@82:
Chris@82:In particular, FFTW uses a 1d block distribution of the data, Chris@82: distributed along the first dimension. For example, if you Chris@82: want to perform a 100 × 200 Chris@82: complex DFT, distributed over 4 Chris@82: processes, each process will get a 25 × 200 Chris@82: slice of the data. Chris@82: That is, process 0 will get rows 0 through 24, process 1 will get rows Chris@82: 25 through 49, process 2 will get rows 50 through 74, and process 3 Chris@82: will get rows 75 through 99. If you take the same array but Chris@82: distribute it over 3 processes, then it is not evenly divisible so the Chris@82: different processes will have unequal chunks. FFTW’s default choice Chris@82: in this case is to assign 34 rows to processes 0 and 1, and 32 rows to Chris@82: process 2. Chris@82: Chris@82:
Chris@82: Chris@82:FFTW provides several ‘fftw_mpi_local_size’ routines that you can
Chris@82: call to find out what portion of an array is stored on the current
Chris@82: process. In most cases, you should use the default block sizes picked
Chris@82: by FFTW, but it is also possible to specify your own block size. For
Chris@82: example, with a 100 × 200
Chris@82: array on three processes, you can
Chris@82: tell FFTW to use a block size of 40, which would assign 40 rows to
Chris@82: processes 0 and 1, and 20 rows to process 2. FFTW’s default is to
Chris@82: divide the data equally among the processes if possible, and as best
Chris@82: it can otherwise. The rows are always assigned in “rank order,”
Chris@82: i.e. process 0 gets the first block of rows, then process 1, and so
Chris@82: on. (You can change this by using MPI_Comm_split
to create a
Chris@82: new communicator with re-ordered processes.) However, you should
Chris@82: always call the ‘fftw_mpi_local_size’ routines, if possible,
Chris@82: rather than trying to predict FFTW’s distribution choices.
Chris@82:
In particular, it is critical that you allocate the storage size that Chris@82: is returned by ‘fftw_mpi_local_size’, which is not Chris@82: necessarily the size of the local slice of the array. The reason is Chris@82: that intermediate steps of FFTW’s algorithms involve transposing the Chris@82: array and redistributing the data, so at these intermediate steps FFTW Chris@82: may require more local storage space (albeit always proportional to Chris@82: the total size divided by the number of processes). The Chris@82: ‘fftw_mpi_local_size’ functions know how much storage is required Chris@82: for these intermediate steps and tell you the correct amount to Chris@82: allocate. Chris@82:
Chris@82:• Basic and advanced distribution interfaces: | Chris@82: | |
• Load balancing: | Chris@82: | |
• Transposed distributions: | Chris@82: | |
• One-dimensional distributions: | Chris@82: |
Chris@82: Next: Multi-dimensional MPI DFTs of Real Data, Previous: 2d MPI example, Up: Distributed-memory FFTW with MPI [Contents][Index]
Chris@82: