cannam@95: cannam@95: cannam@95: Combining MPI and Threads - FFTW 3.3.3 cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: cannam@95:
cannam@95: cannam@95:

cannam@95: Next: , cannam@95: Previous: FFTW MPI Performance Tips, cannam@95: Up: Distributed-memory FFTW with MPI cannam@95:


cannam@95:
cannam@95: cannam@95:

6.11 Combining MPI and Threads

cannam@95: cannam@95:

cannam@95: In certain cases, it may be advantageous to combine MPI cannam@95: (distributed-memory) and threads (shared-memory) parallelization. cannam@95: FFTW supports this, with certain caveats. For example, if you have a cannam@95: cluster of 4-processor shared-memory nodes, you may want to use cannam@95: threads within the nodes and MPI between the nodes, instead of MPI for cannam@95: all parallelization. cannam@95: cannam@95:

In particular, it is possible to seamlessly combine the MPI FFTW cannam@95: routines with the multi-threaded FFTW routines (see Multi-threaded FFTW). However, some care must be taken in the initialization code, cannam@95: which should look something like this: cannam@95: cannam@95:

     int threads_ok;
cannam@95:      
cannam@95:      int main(int argc, char **argv)
cannam@95:      {
cannam@95:          int provided;
cannam@95:          MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
cannam@95:          threads_ok = provided >= MPI_THREAD_FUNNELED;
cannam@95:      
cannam@95:          if (threads_ok) threads_ok = fftw_init_threads();
cannam@95:          fftw_mpi_init();
cannam@95:      
cannam@95:          ...
cannam@95:          if (threads_ok) fftw_plan_with_nthreads(...);
cannam@95:          ...
cannam@95:      
cannam@95:          MPI_Finalize();
cannam@95:      }
cannam@95: 
cannam@95:

cannam@95: First, note that instead of calling MPI_Init, you should call cannam@95: MPI_Init_threads, which is the initialization routine defined cannam@95: by the MPI-2 standard to indicate to MPI that your program will be cannam@95: multithreaded. We pass MPI_THREAD_FUNNELED, which indicates cannam@95: that we will only call MPI routines from the main thread. (FFTW will cannam@95: launch additional threads internally, but the extra threads will not cannam@95: call MPI code.) (You may also pass MPI_THREAD_SERIALIZED or cannam@95: MPI_THREAD_MULTIPLE, which requests additional multithreading cannam@95: support from the MPI implementation, but this is not required by cannam@95: FFTW.) The provided parameter returns what level of threads cannam@95: support is actually supported by your MPI implementation; this cannam@95: must be at least MPI_THREAD_FUNNELED if you want to call cannam@95: the FFTW threads routines, so we define a global variable cannam@95: threads_ok to record this. You should only call cannam@95: fftw_init_threads or fftw_plan_with_nthreads if cannam@95: threads_ok is true. For more information on thread safety in cannam@95: MPI, see the cannam@95: MPI and Threads section of the MPI-2 standard. cannam@95: cannam@95: cannam@95:

Second, we must call fftw_init_threads before cannam@95: fftw_mpi_init. This is critical for technical reasons having cannam@95: to do with how FFTW initializes its list of algorithms. cannam@95: cannam@95:

Then, if you call fftw_plan_with_nthreads(N), every MPI cannam@95: process will launch (up to) N threads to parallelize its transforms. cannam@95: cannam@95:

For example, in the hypothetical cluster of 4-processor nodes, you cannam@95: might wish to launch only a single MPI process per node, and then call cannam@95: fftw_plan_with_nthreads(4) on each process to use all cannam@95: processors in the nodes. cannam@95: cannam@95:

This may or may not be faster than simply using as many MPI processes cannam@95: as you have processors, however. On the one hand, using threads cannam@95: within a node eliminates the need for explicit message passing within cannam@95: the node. On the other hand, FFTW's transpose routines are not cannam@95: multi-threaded, and this means that the communications that do take cannam@95: place will not benefit from parallelization within the node. cannam@95: Moreover, many MPI implementations already have optimizations to cannam@95: exploit shared memory when it is available, so adding the cannam@95: multithreaded FFTW on top of this may be superfluous. cannam@95: cannam@95: cannam@95: cannam@95: cannam@95: