d@0: d@0:
d@0:d@0: d@0: Next: FFTW MPI Reference, d@0: Previous: FFTW MPI Performance Tips, d@0: Up: Distributed-memory FFTW with MPI d@0:
d@0: In certain cases, it may be advantageous to combine MPI
d@0: (distributed-memory) and threads (shared-memory) parallelization.
d@0: FFTW supports this, with certain caveats. For example, if you have a
d@0: cluster of 4-processor shared-memory nodes, you may want to use
d@0: threads within the nodes and MPI between the nodes, instead of MPI for
d@0: all parallelization. FFTW's MPI code can also transparently use
d@0: FFTW's Cell processor support (e.g. for clusters of Cell processors).
d@0:
d@0: In particular, it is possible to seamlessly combine the MPI FFTW
d@0: routines with the multi-threaded FFTW routines (see Multi-threaded FFTW). In this case, you will begin your program by calling both
d@0: fftw_mpi_init()
and fftw_init_threads()
. Then, if you
d@0: call fftw_plan_with_nthreads(N)
, then every MPI process
d@0: will launch N
threads to parallelize its transforms.
d@0:
d@0: For example, in the hypothetical cluster of 4-processor nodes, you
d@0: might wish to launch only a single MPI process per node, and then call
d@0: fftw_plan_with_nthreads(4)
on each process to use all
d@0: processors in the nodes.
d@0:
d@0:
This may or may not be faster than simply using as many MPI processes
d@0: as you have processors, however. On the one hand, using threads within a
d@0: node eliminates the need for explicit message passing within the node.
d@0: On the other hand, FFTW's transpose routines are not multi-threaded,
d@0: and this means that the communications that do take place will not
d@0: benefit from parallelization within the node. Moreover, many MPI
d@0: implementations already have optimizations to exploit shared memory
d@0: when it is available.
d@0:
d@0: (Note that this is quite independent of whether MPI itself is
d@0: thread-safe or multi-threaded: regardless of how many threads you
d@0: specify with fftw_plan_with_nthreads
, FFTW will perform all of
d@0: its MPI communication only from the parent process.)
d@0:
d@0:
d@0:
d@0:
d@0: