Chris@19: Chris@19: Chris@19: Combining MPI and Threads - FFTW 3.3.4 Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: Chris@19:
Chris@19: Chris@19:

Chris@19: Next: , Chris@19: Previous: FFTW MPI Performance Tips, Chris@19: Up: Distributed-memory FFTW with MPI Chris@19:


Chris@19:
Chris@19: Chris@19:

6.11 Combining MPI and Threads

Chris@19: Chris@19:

Chris@19: In certain cases, it may be advantageous to combine MPI Chris@19: (distributed-memory) and threads (shared-memory) parallelization. Chris@19: FFTW supports this, with certain caveats. For example, if you have a Chris@19: cluster of 4-processor shared-memory nodes, you may want to use Chris@19: threads within the nodes and MPI between the nodes, instead of MPI for Chris@19: all parallelization. Chris@19: Chris@19:

In particular, it is possible to seamlessly combine the MPI FFTW Chris@19: routines with the multi-threaded FFTW routines (see Multi-threaded FFTW). However, some care must be taken in the initialization code, Chris@19: which should look something like this: Chris@19: Chris@19:

     int threads_ok;
Chris@19:      
Chris@19:      int main(int argc, char **argv)
Chris@19:      {
Chris@19:          int provided;
Chris@19:          MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
Chris@19:          threads_ok = provided >= MPI_THREAD_FUNNELED;
Chris@19:      
Chris@19:          if (threads_ok) threads_ok = fftw_init_threads();
Chris@19:          fftw_mpi_init();
Chris@19:      
Chris@19:          ...
Chris@19:          if (threads_ok) fftw_plan_with_nthreads(...);
Chris@19:          ...
Chris@19:      
Chris@19:          MPI_Finalize();
Chris@19:      }
Chris@19: 
Chris@19:

Chris@19: First, note that instead of calling MPI_Init, you should call Chris@19: MPI_Init_threads, which is the initialization routine defined Chris@19: by the MPI-2 standard to indicate to MPI that your program will be Chris@19: multithreaded. We pass MPI_THREAD_FUNNELED, which indicates Chris@19: that we will only call MPI routines from the main thread. (FFTW will Chris@19: launch additional threads internally, but the extra threads will not Chris@19: call MPI code.) (You may also pass MPI_THREAD_SERIALIZED or Chris@19: MPI_THREAD_MULTIPLE, which requests additional multithreading Chris@19: support from the MPI implementation, but this is not required by Chris@19: FFTW.) The provided parameter returns what level of threads Chris@19: support is actually supported by your MPI implementation; this Chris@19: must be at least MPI_THREAD_FUNNELED if you want to call Chris@19: the FFTW threads routines, so we define a global variable Chris@19: threads_ok to record this. You should only call Chris@19: fftw_init_threads or fftw_plan_with_nthreads if Chris@19: threads_ok is true. For more information on thread safety in Chris@19: MPI, see the Chris@19: MPI and Threads section of the MPI-2 standard. Chris@19: Chris@19: Chris@19:

Second, we must call fftw_init_threads before Chris@19: fftw_mpi_init. This is critical for technical reasons having Chris@19: to do with how FFTW initializes its list of algorithms. Chris@19: Chris@19:

Then, if you call fftw_plan_with_nthreads(N), every MPI Chris@19: process will launch (up to) N threads to parallelize its transforms. Chris@19: Chris@19:

For example, in the hypothetical cluster of 4-processor nodes, you Chris@19: might wish to launch only a single MPI process per node, and then call Chris@19: fftw_plan_with_nthreads(4) on each process to use all Chris@19: processors in the nodes. Chris@19: Chris@19:

This may or may not be faster than simply using as many MPI processes Chris@19: as you have processors, however. On the one hand, using threads Chris@19: within a node eliminates the need for explicit message passing within Chris@19: the node. On the other hand, FFTW's transpose routines are not Chris@19: multi-threaded, and this means that the communications that do take Chris@19: place will not benefit from parallelization within the node. Chris@19: Moreover, many MPI implementations already have optimizations to Chris@19: exploit shared memory when it is available, so adding the Chris@19: multithreaded FFTW on top of this may be superfluous. Chris@19: Chris@19: Chris@19: Chris@19: Chris@19: