FFTW 3.3.8: One-dimensional distributions

For one-dimensional distributed DFTs using FFTW, matters are slightly Chris@82: more complicated because the data distribution is more closely tied to Chris@82: how the algorithm works. In particular, you can no longer pass an Chris@82: arbitrary block size and must accept FFTW’s default; also, the block Chris@82: sizes may be different for input and output. Also, the data Chris@82: distribution depends on the flags and transform direction, in order Chris@82: for forward and backward transforms to work correctly. Chris@82:

Chris@82:

ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
Chris@82:                 int sign, unsigned flags,
Chris@82:                 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
Chris@82:                 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
Chris@82:

This function computes the data distribution for a 1d transform of Chris@82: size n0 with the given transform sign and flags. Chris@82: Both input and output data use block distributions. The input on the Chris@82: current process will consist of local_ni numbers starting at Chris@82: index local_i_start; e.g. if only a single process is used, Chris@82: then local_ni will be n0 and local_i_start will Chris@82: be 0. Similarly for the output, with local_no numbers Chris@82: starting at index local_o_start. The return value of Chris@82: fftw_mpi_local_size_1d will be the total number of elements to Chris@82: allocate on the current process (which might be slightly larger than Chris@82: the local size due to intermediate steps in the algorithm). Chris@82:

As mentioned above (see Load balancing), the data will be divided Chris@82: equally among the processes if n0 is divisible by the Chris@82: square of the number of processes. In this case, Chris@82: local_ni will equal local_no. Otherwise, they may be Chris@82: different. Chris@82:

For some applications, such as convolutions, the order of the output Chris@82: data is irrelevant. In this case, performance can be improved by Chris@82: specifying that the output data be stored in an FFTW-defined Chris@82: “scrambled” format. (In particular, this is the analogue of Chris@82: transposed output in the multidimensional case: scrambled output saves Chris@82: a communications step.) If you pass FFTW_MPI_SCRAMBLED_OUT in Chris@82: the flags, then the output is stored in this (undocumented) scrambled Chris@82: order. Conversely, to perform the inverse transform of data in Chris@82: scrambled order, pass the FFTW_MPI_SCRAMBLED_IN flag. Chris@82: Chris@82: Chris@82:

In MPI FFTW, only composite sizes n0 can be parallelized; we Chris@82: have not yet implemented a parallel algorithm for large prime sizes. Chris@82:

6.4.4 One-dimensional distributions