cannam@95: cannam@95:
cannam@95:cannam@95: Next: Combining MPI and Threads, cannam@95: Previous: Avoiding MPI Deadlocks, cannam@95: Up: Distributed-memory FFTW with MPI cannam@95:
In this section, we collect a few tips on getting the best performance cannam@95: out of FFTW's MPI transforms. cannam@95: cannam@95:
First, because of the 1d block distribution, FFTW's parallelization is cannam@95: currently limited by the size of the first dimension. cannam@95: (Multidimensional block distributions may be supported by a future cannam@95: version.) More generally, you should ideally arrange the dimensions so cannam@95: that FFTW can divide them equally among the processes. See Load balancing. cannam@95: cannam@95: cannam@95:
Second, if it is not too inconvenient, you should consider working cannam@95: with transposed output for multidimensional plans, as this saves a cannam@95: considerable amount of communications. See Transposed distributions. cannam@95: cannam@95: cannam@95:
Third, the fastest choices are generally either an in-place transform
cannam@95: or an out-of-place transform with the FFTW_DESTROY_INPUT
flag
cannam@95: (which allows the input array to be used as scratch space). In-place
cannam@95: is especially beneficial if the amount of data per process is large.
cannam@95:
cannam@95:
cannam@95:
Fourth, if you have multiple arrays to transform at once, rather than cannam@95: calling FFTW's MPI transforms several times it usually seems to be cannam@95: faster to interleave the data and use the advanced interface. (This cannam@95: groups the communications together instead of requiring separate cannam@95: messages for each transform.) cannam@95: cannam@95: cannam@95: cannam@95: