Load balancing - FFTW 3.2alpha3

d@0: d@0: d@0: Load balancing - FFTW 3.2alpha3 d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0: d@0:

d@0:

d@0: d@0: Next: Transposed distributions, d@0: Previous: Basic and advanced distribution interfaces, d@0: Up: MPI data distribution d@0:

d@0:

d@0: d@0:

6.4.2 Load balancing

d@0: d@0:

d@0: Ideally, when you parallelize a transform over some P d@0: processes, each process should end up with work that takes equal time. d@0: Otherwise, all of the processes end up waiting on whichever process is d@0: slowest. This goal is known as “load balancing.” In this section, d@0: we describe the circumstances under which FFTW is able to load-balance d@0: well, and in particular how you should choose your transform size in d@0: order to load balance. d@0: d@0:

Load balancing is especially difficult when you are parallelizing over d@0: heterogeneous machines; for example, if one of your processors is a d@0: old 486 and another is a Pentium IV, obviously you should give the d@0: Pentium more work to do than the 486 since the latter is much slower. d@0: FFTW does not deal with this problem, however—it assumes that your d@0: processes run on hardware of comparable speed, and that the goal is d@0: therefore to divide the problem as equally as possible. d@0: d@0:

For a multi-dimensional complex DFT, FFTW can divide the problem d@0: equally among the processes if: (i) the first dimension d@0: n0 is divisible by P; and (ii), the product of d@0: the subsequent dimensions is divisible by P. (For the advanced d@0: interface, where you can specify multiple simultaneous transforms via d@0: some “vector” length howmany, a factor of howmany is d@0: included in the product of the subsequent dimensions.) d@0: d@0:

For a one-dimensional complex DFT, the length N of the data d@0: should be divisible by P squared to be able to divide d@0: the problem equally among the processes. d@0: d@0: d@0: