FFTW 3.3.8: Distributed-memory FFTW with MPI

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: FFTW 3.3.8: Distributed-memory FFTW with MPI cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167:

6 Distributed-memory FFTW with MPI

cannam@167: cannam@167: cannam@167: cannam@167:

In this chapter we document the parallel FFTW routines for parallel cannam@167: systems supporting the MPI message-passing interface. Unlike the cannam@167: shared-memory threads described in the previous chapter, MPI allows cannam@167: you to use distributed-memory parallelism, where each CPU has cannam@167: its own separate memory, and which can scale up to clusters of many cannam@167: thousands of processors. This capability comes at a price, however: cannam@167: each process only stores a portion of the data to be cannam@167: transformed, which means that the data structures and cannam@167: programming-interface are quite different from the serial or threads cannam@167: versions of FFTW. cannam@167: cannam@167:

cannam@167: cannam@167:

Distributed-memory parallelism is especially useful when you are cannam@167: transforming arrays so large that they do not fit into the memory of a cannam@167: single processor. The storage per-process required by FFTW’s MPI cannam@167: routines is proportional to the total array size divided by the number cannam@167: of processes. Conversely, distributed-memory parallelism can easily cannam@167: pose an unacceptably high communications overhead for small problems; cannam@167: the threshold problem size for which parallelism becomes advantageous cannam@167: will depend on the precise problem you are interested in, your cannam@167: hardware, and your MPI implementation. cannam@167:

cannam@167:

A note on terminology: in MPI, you divide the data among a set of cannam@167: “processes” which each run in their own memory address space. cannam@167: Generally, each process runs on a different physical processor, but cannam@167: this is not required. A set of processes in MPI is described by an cannam@167: opaque data structure called a “communicator,” the most common of cannam@167: which is the predefined communicator MPI_COMM_WORLD which cannam@167: refers to all processes. For more information on these and cannam@167: other concepts common to all MPI programs, we refer the reader to the cannam@167: documentation at the MPI home cannam@167: page. cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167:

We assume in this chapter that the reader is familiar with the usage cannam@167: of the serial (uniprocessor) FFTW, and focus only on the concepts new cannam@167: to the MPI interface. cannam@167:

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

• FFTW MPI Installation:		cannam@167:
• Linking and Initializing MPI FFTW:		cannam@167:
• 2d MPI example:		cannam@167:
• MPI Data Distribution:		cannam@167:
• Multi-dimensional MPI DFTs of Real Data:		cannam@167:
• Other Multi-dimensional Real-data MPI Transforms:		cannam@167:
• FFTW MPI Transposes:		cannam@167:
• FFTW MPI Wisdom:		cannam@167:
• Avoiding MPI Deadlocks:		cannam@167:
• FFTW MPI Performance Tips:		cannam@167:
• Combining MPI and Threads:		cannam@167:
• FFTW MPI Reference:		cannam@167:
• FFTW MPI Fortran Interface:		cannam@167:

cannam@167: cannam@167:

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: