cannam@167: cannam@167: cannam@167: cannam@167: cannam@167:
cannam@167:cannam@167: Next: Transposed distributions, Previous: Basic and advanced distribution interfaces, Up: MPI Data Distribution [Contents][Index]
cannam@167:Ideally, when you parallelize a transform over some P cannam@167: processes, each process should end up with work that takes equal time. cannam@167: Otherwise, all of the processes end up waiting on whichever process is cannam@167: slowest. This goal is known as “load balancing.” In this section, cannam@167: we describe the circumstances under which FFTW is able to load-balance cannam@167: well, and in particular how you should choose your transform size in cannam@167: order to load balance. cannam@167:
cannam@167:Load balancing is especially difficult when you are parallelizing over cannam@167: heterogeneous machines; for example, if one of your processors is a cannam@167: old 486 and another is a Pentium IV, obviously you should give the cannam@167: Pentium more work to do than the 486 since the latter is much slower. cannam@167: FFTW does not deal with this problem, however—it assumes that your cannam@167: processes run on hardware of comparable speed, and that the goal is cannam@167: therefore to divide the problem as equally as possible. cannam@167:
cannam@167:For a multi-dimensional complex DFT, FFTW can divide the problem
cannam@167: equally among the processes if: (i) the first dimension
cannam@167: n0
is divisible by P; and (ii), the product of
cannam@167: the subsequent dimensions is divisible by P. (For the advanced
cannam@167: interface, where you can specify multiple simultaneous transforms via
cannam@167: some “vector” length howmany
, a factor of howmany
is
cannam@167: included in the product of the subsequent dimensions.)
cannam@167:
For a one-dimensional complex DFT, the length N
of the data
cannam@167: should be divisible by P squared to be able to divide
cannam@167: the problem equally among the processes.
cannam@167: