FFTW MPI Performance Tips - FFTW 3.2alpha3

d@0: d@0: d@0: FFTW MPI Performance Tips - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:

d@0:

d@0: d@0: Next: Combining MPI and Threads, d@0: Previous: Avoiding MPI Deadlocks, d@0: Up: Distributed-memory FFTW with MPI d@0:

d@0:

d@0: d@0:

6.10 FFTW MPI Performance Tips

d@0: d@0:

In this section, we collect a few tips on getting the best performance d@0: out of FFTW's MPI transforms. d@0: d@0:

First, because of the 1d block distribution, FFTW's parallelization is d@0: currently limited by the size of the first dimension. d@0: (Multidimensional block distributions may be supported by a future d@0: version.) More generally, you should ideally arrange the dimensions so d@0: that FFTW can divide them equally among the processes. See Load balancing. d@0: d@0: Second, if it is not too inconvenient, you should consider working d@0: with transposed output for multidimensional plans, as this saves a d@0: considerable amount of communications. See Transposed distributions. d@0: d@0: Third, the fastest choices are generally either an in-place transform d@0: or an out-of-place transform with the FFTW_DESTROY_INPUT flag d@0: (which allows the input array to be used as scratch space). In-place d@0: is especially beneficial if the amount of data per process is large. d@0: d@0: Fourth, if you have multiple arrays to transform at once, rather than d@0: calling FFTW's MPI transforms several times it usually seems to be d@0: faster to interleave the data and use the advanced interface. (This d@0: groups the communications together instead of requiring separate d@0: messages for each transform.) d@0: d@0: d@0: d@0: