Transposed distributions

Chris@10: Chris@10:

Chris@10: Next: One-dimensional distributions, Chris@10: Previous: Load balancing, Chris@10: Up: MPI Data Distribution Chris@10:

Chris@10:

6.4.3 Transposed distributions

Internally, FFTW's MPI transform algorithms work by first computing Chris@10: transforms of the data local to each process, then by globally Chris@10: transposing the data in some fashion to redistribute the data Chris@10: among the processes, transforming the new data local to each process, Chris@10: and transposing back. For example, a two-dimensional n0 by Chris@10: n1 array, distributed across the n0 dimension, is Chris@10: transformd by: (i) transforming the n1 dimension, which are Chris@10: local to each process; (ii) transposing to an n1 by n0 Chris@10: array, distributed across the n1 dimension; (iii) transforming Chris@10: the n0 dimension, which is now local to each process; (iv) Chris@10: transposing back. Chris@10: Chris@10: Chris@10:

However, in many applications it is acceptable to compute a Chris@10: multidimensional DFT whose results are produced in transposed order Chris@10: (e.g., n1 by n0 in two dimensions). This provides a Chris@10: significant performance advantage, because it means that the final Chris@10: transposition step can be omitted. FFTW supports this optimization, Chris@10: which you specify by passing the flag FFTW_MPI_TRANSPOSED_OUT Chris@10: to the planner routines. To compute the inverse transform of Chris@10: transposed output, you specify FFTW_MPI_TRANSPOSED_IN to tell Chris@10: it that the input is transposed. In this section, we explain how to Chris@10: interpret the output format of such a transform. Chris@10: Chris@10: Chris@10:

Suppose you have are transforming multi-dimensional data with (at Chris@10: least two) dimensions n₀ × n₁ × n₂ × … × n_d-1. As always, it is distributed along Chris@10: the first dimension n₀. Now, if we compute its DFT with the Chris@10: FFTW_MPI_TRANSPOSED_OUT flag, the resulting output data are stored Chris@10: with the first two dimensions transposed: n₁ × n₀ × n₂ ×…× n_d-1, Chris@10: distributed along the n₁ dimension. Conversely, if we take the Chris@10: n₁ × n₀ × n₂ ×…× n_d-1 data and transform it with the Chris@10: FFTW_MPI_TRANSPOSED_IN flag, then the format goes back to the Chris@10: original n₀ × n₁ × n₂ × … × n_d-1 array. Chris@10: Chris@10:

There are two ways to find the portion of the transposed array that Chris@10: resides on the current process. First, you can simply call the Chris@10: appropriate ‘local_size’ function, passing n₁ × n₀ × n₂ ×…× n_d-1 (the Chris@10: transposed dimensions). This would mean calling the ‘local_size’ Chris@10: function twice, once for the transposed and once for the Chris@10: non-transposed dimensions. Alternatively, you can call one of the Chris@10: ‘local_size_transposed’ functions, which returns both the Chris@10: non-transposed and transposed data distribution from a single call. Chris@10: For example, for a 3d transform with transposed output (or input), you Chris@10: might call: Chris@10: Chris@10:

Chris@10: Here, local_n0 and local_0_start give the size and Chris@10: starting index of the n0 dimension for the Chris@10: non-transposed data, as in the previous sections. For Chris@10: transposed data (e.g. the output for Chris@10: FFTW_MPI_TRANSPOSED_OUT), local_n1 and Chris@10: local_1_start give the size and starting index of the n1 Chris@10: dimension, which is the first dimension of the transposed data Chris@10: (n1 by n0 by n2). Chris@10: Chris@10:

(Note that FFTW_MPI_TRANSPOSED_IN is completely equivalent to Chris@10: performing FFTW_MPI_TRANSPOSED_OUT and passing the first two Chris@10: dimensions to the planner in reverse order, or vice versa. If you Chris@10: pass both the FFTW_MPI_TRANSPOSED_IN and Chris@10: FFTW_MPI_TRANSPOSED_OUT flags, it is equivalent to swapping the Chris@10: first two dimensions passed to the planner and passing neither Chris@10: flag.) Chris@10: Chris@10: Chris@10: