Combining MPI and Threads - FFTW 3.2alpha3

d@0: d@0: d@0: Combining MPI and Threads - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:

d@0:

d@0: d@0: Next: FFTW MPI Reference, d@0: Previous: FFTW MPI Performance Tips, d@0: Up: Distributed-memory FFTW with MPI d@0:

d@0:

d@0: d@0:

6.11 Combining MPI and Threads

d@0: d@0:

d@0: In certain cases, it may be advantageous to combine MPI d@0: (distributed-memory) and threads (shared-memory) parallelization. d@0: FFTW supports this, with certain caveats. For example, if you have a d@0: cluster of 4-processor shared-memory nodes, you may want to use d@0: threads within the nodes and MPI between the nodes, instead of MPI for d@0: all parallelization. FFTW's MPI code can also transparently use d@0: FFTW's Cell processor support (e.g. for clusters of Cell processors). d@0: d@0: In particular, it is possible to seamlessly combine the MPI FFTW d@0: routines with the multi-threaded FFTW routines (see Multi-threaded FFTW). In this case, you will begin your program by calling both d@0: fftw_mpi_init() and fftw_init_threads(). Then, if you d@0: call fftw_plan_with_nthreads(N), then every MPI process d@0: will launch N threads to parallelize its transforms. d@0: d@0: For example, in the hypothetical cluster of 4-processor nodes, you d@0: might wish to launch only a single MPI process per node, and then call d@0: fftw_plan_with_nthreads(4) on each process to use all d@0: processors in the nodes. d@0: d@0:

This may or may not be faster than simply using as many MPI processes d@0: as you have processors, however. On the one hand, using threads within a d@0: node eliminates the need for explicit message passing within the node. d@0: On the other hand, FFTW's transpose routines are not multi-threaded, d@0: and this means that the communications that do take place will not d@0: benefit from parallelization within the node. Moreover, many MPI d@0: implementations already have optimizations to exploit shared memory d@0: when it is available. d@0: d@0: (Note that this is quite independent of whether MPI itself is d@0: thread-safe or multi-threaded: regardless of how many threads you d@0: specify with fftw_plan_with_nthreads, FFTW will perform all of d@0: its MPI communication only from the parent process.) d@0: d@0: d@0: d@0: d@0: