FFTW 3.3.8: Basic distributed-transpose interface

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: FFTW 3.3.8: Basic distributed-transpose interface cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167:

6.7.1 Basic distributed-transpose interface

cannam@167: cannam@167:

In particular, suppose that we have an n0 by n1 array in cannam@167: row-major order, block-distributed across the n0 dimension. To cannam@167: transpose this into an n1 by n0 array block-distributed cannam@167: across the n1 dimension, we would create a plan by calling the cannam@167: following function: cannam@167:

cannam@167:

fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
cannam@167:                                   double *in, double *out,
cannam@167:                                   MPI_Comm comm, unsigned flags);
cannam@167:

cannam@167: cannam@167: cannam@167:

The input and output arrays (in and out) can be the cannam@167: same. The transpose is actually executed by calling cannam@167: fftw_execute on the plan, as usual. cannam@167: cannam@167:

cannam@167: cannam@167:

The flags are the usual FFTW planner flags, but support cannam@167: two additional flags: FFTW_MPI_TRANSPOSED_OUT and/or cannam@167: FFTW_MPI_TRANSPOSED_IN. What these flags indicate, for cannam@167: transpose plans, is that the output and/or input, respectively, are cannam@167: locally transposed. That is, on each process input data is cannam@167: normally stored as a local_n0 by n1 array in row-major cannam@167: order, but for an FFTW_MPI_TRANSPOSED_IN plan the input data is cannam@167: stored as n1 by local_n0 in row-major order. Similarly, cannam@167: FFTW_MPI_TRANSPOSED_OUT means that the output is n0 by cannam@167: local_n1 instead of local_n1 by n0. cannam@167: cannam@167: cannam@167:

cannam@167: cannam@167:

To determine the local size of the array on each process before and cannam@167: after the transpose, as well as the amount of storage that must be cannam@167: allocated, one should call fftw_mpi_local_size_2d_transposed, cannam@167: just as for a 2d DFT as described in the previous section: cannam@167: cannam@167:

cannam@167:

ptrdiff_t fftw_mpi_local_size_2d_transposed
cannam@167:                 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
cannam@167:                  ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
cannam@167:                  ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
cannam@167:

cannam@167: cannam@167: cannam@167:

Again, the return value is the local storage to allocate, which in cannam@167: this case is the number of real (double) values rather cannam@167: than complex numbers as in the previous examples. cannam@167:

cannam@167:

cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: cannam@167: