Chris@19: Chris@19:
Chris@19:Chris@19: Next: Transposed distributions, Chris@19: Previous: Basic and advanced distribution interfaces, Chris@19: Up: MPI Data Distribution Chris@19:
Chris@19: Ideally, when you parallelize a transform over some P Chris@19: processes, each process should end up with work that takes equal time. Chris@19: Otherwise, all of the processes end up waiting on whichever process is Chris@19: slowest. This goal is known as “load balancing.” In this section, Chris@19: we describe the circumstances under which FFTW is able to load-balance Chris@19: well, and in particular how you should choose your transform size in Chris@19: order to load balance. Chris@19: Chris@19:
Load balancing is especially difficult when you are parallelizing over Chris@19: heterogeneous machines; for example, if one of your processors is a Chris@19: old 486 and another is a Pentium IV, obviously you should give the Chris@19: Pentium more work to do than the 486 since the latter is much slower. Chris@19: FFTW does not deal with this problem, however—it assumes that your Chris@19: processes run on hardware of comparable speed, and that the goal is Chris@19: therefore to divide the problem as equally as possible. Chris@19: Chris@19:
For a multi-dimensional complex DFT, FFTW can divide the problem
Chris@19: equally among the processes if: (i) the first dimension
Chris@19: n0
is divisible by P; and (ii), the product of
Chris@19: the subsequent dimensions is divisible by P. (For the advanced
Chris@19: interface, where you can specify multiple simultaneous transforms via
Chris@19: some “vector” length howmany
, a factor of howmany
is
Chris@19: included in the product of the subsequent dimensions.)
Chris@19:
Chris@19:
For a one-dimensional complex DFT, the length N
of the data
Chris@19: should be divisible by P squared to be able to divide
Chris@19: the problem equally among the processes.
Chris@19:
Chris@19:
Chris@19: