cannam@95: cannam@95:
cannam@95:cannam@95: Previous: Advanced distributed-transpose interface, cannam@95: Up: FFTW MPI Transposes cannam@95:
We close this section by noting that FFTW's MPI transpose routines can
cannam@95: be thought of as a generalization for the MPI_Alltoall
function
cannam@95: (albeit only for floating-point types), and in some circumstances can
cannam@95: function as an improved replacement.
cannam@95:
cannam@95:
cannam@95:
MPI_Alltoall
is defined by the MPI standard as:
cannam@95:
cannam@95:
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, cannam@95: void *recvbuf, int recvcnt, MPI_Datatype recvtype, cannam@95: MPI_Comm comm); cannam@95:cannam@95:
In particular, for double*
arrays in
and out
,
cannam@95: consider the call:
cannam@95:
cannam@95:
MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); cannam@95:cannam@95:
This is completely equivalent to: cannam@95: cannam@95:
MPI_Comm_size(comm, &P); cannam@95: plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); cannam@95: fftw_execute(plan); cannam@95: fftw_destroy_plan(plan); cannam@95:cannam@95:
That is, computing a P × P transpose on P
processes,
cannam@95: with a block size of 1, is just a standard all-to-all communication.
cannam@95:
cannam@95:
However, using the FFTW routine instead of MPI_Alltoall
may
cannam@95: have certain advantages. First of all, FFTW's routine can operate
cannam@95: in-place (in == out
) whereas MPI_Alltoall
can only
cannam@95: operate out-of-place.
cannam@95:
cannam@95:
cannam@95:
Second, even for out-of-place plans, FFTW's routine may be faster,
cannam@95: especially if you need to perform the all-to-all communication many
cannam@95: times and can afford to use FFTW_MEASURE
or
cannam@95: FFTW_PATIENT
. It should certainly be no slower, not including
cannam@95: the time to create the plan, since one of the possible algorithms that
cannam@95: FFTW uses for an out-of-place transpose is simply to call
cannam@95: MPI_Alltoall
. However, FFTW also considers several other
cannam@95: possible algorithms that, depending on your MPI implementation and
cannam@95: your hardware, may be faster.
cannam@95:
cannam@95:
cannam@95:
cannam@95:
cannam@95: