d@0: d@0: d@0: An improved replacement for MPI_Alltoall - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:
d@0:

d@0: d@0: d@0: Previous: Advanced distributed-transpose interface, d@0: Up: FFTW MPI Transposes d@0:


d@0:
d@0: d@0:

6.7.3 An improved replacement for MPI_Alltoall

d@0: d@0:

We close this section by noting that FFTW's MPI transpose routines can d@0: be thought of as a generalization for the MPI_Alltoall function d@0: (albeit only for floating-point types), and in some circumstances can d@0: function as an improved replacement. d@0: d@0: MPI_Alltoall is defined by the MPI standard as: d@0: d@0:

     int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
d@0:                       void *recvbuf, int recvcnt, MPI_Datatype recvtype,
d@0:                       MPI_Comm comm);
d@0: 
d@0:

In particular, for double* arrays in and out, d@0: consider the call: d@0: d@0:

     MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
d@0: 
d@0:

This is completely equivalent to: d@0: d@0:

     MPI_Comm_size(comm, &P);
d@0:      plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
d@0:      fftw_execute(plan);
d@0:      fftw_destroy_plan(plan);
d@0: 
d@0:

That is, computing a P × P transpose on P processes, d@0: with a block size of 1, is just a standard all-to-all communication. d@0: d@0:

However, using the FFTW routine instead of MPI_Alltoall may d@0: have certain advantages. First of all, FFTW's routine can operate d@0: in-place in == out) whereas MPI_Alltoall can only d@0: operate out-of-place. d@0: d@0: Second, even for out-of-place plans, FFTW's routine may be faster, d@0: especially if you need to perform the all-to-all communication many d@0: times and can afford to use FFTW_MEASURE or d@0: FFTW_PATIENT. It should certainly be no slower, not including d@0: the time to create the plan, since one of the possible algorithms that d@0: FFTW uses for an out-of-place transpose is simply to call d@0: MPI_Alltoall. However, FFTW also considers several other d@0: possible algorithms that, depending on your MPI implementation and d@0: your hardware, may be faster. d@0: d@0: d@0: d@0: d@0: