Chris@42: Chris@42: Chris@42: Chris@42: Chris@42:
Chris@42:Chris@42: Next: Combining MPI and Threads, Previous: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI [Contents][Index]
Chris@42:In this section, we collect a few tips on getting the best performance Chris@42: out of FFTW’s MPI transforms. Chris@42:
Chris@42:First, because of the 1d block distribution, FFTW’s parallelization is Chris@42: currently limited by the size of the first dimension. Chris@42: (Multidimensional block distributions may be supported by a future Chris@42: version.) More generally, you should ideally arrange the dimensions so Chris@42: that FFTW can divide them equally among the processes. See Load balancing. Chris@42: Chris@42: Chris@42:
Chris@42: Chris@42:Second, if it is not too inconvenient, you should consider working Chris@42: with transposed output for multidimensional plans, as this saves a Chris@42: considerable amount of communications. See Transposed distributions. Chris@42: Chris@42:
Chris@42: Chris@42:Third, the fastest choices are generally either an in-place transform
Chris@42: or an out-of-place transform with the FFTW_DESTROY_INPUT
flag
Chris@42: (which allows the input array to be used as scratch space). In-place
Chris@42: is especially beneficial if the amount of data per process is large.
Chris@42:
Chris@42:
Fourth, if you have multiple arrays to transform at once, rather than Chris@42: calling FFTW’s MPI transforms several times it usually seems to be Chris@42: faster to interleave the data and use the advanced interface. (This Chris@42: groups the communications together instead of requiring separate Chris@42: messages for each transform.) Chris@42:
Chris@42: Chris@42: Chris@42: Chris@42: Chris@42: