Chris@10
|
1 @node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top
|
Chris@10
|
2 @chapter Multi-threaded FFTW
|
Chris@10
|
3
|
Chris@10
|
4 @cindex parallel transform
|
Chris@10
|
5 In this chapter we document the parallel FFTW routines for
|
Chris@10
|
6 shared-memory parallel hardware. These routines, which support
|
Chris@10
|
7 parallel one- and multi-dimensional transforms of both real and
|
Chris@10
|
8 complex data, are the easiest way to take advantage of multiple
|
Chris@10
|
9 processors with FFTW. They work just like the corresponding
|
Chris@10
|
10 uniprocessor transform routines, except that you have an extra
|
Chris@10
|
11 initialization routine to call, and there is a routine to set the
|
Chris@10
|
12 number of threads to employ. Any program that uses the uniprocessor
|
Chris@10
|
13 FFTW can therefore be trivially modified to use the multi-threaded
|
Chris@10
|
14 FFTW.
|
Chris@10
|
15
|
Chris@10
|
16 A shared-memory machine is one in which all CPUs can directly access
|
Chris@10
|
17 the same main memory, and such machines are now common due to the
|
Chris@10
|
18 ubiquity of multi-core CPUs. FFTW's multi-threading support allows
|
Chris@10
|
19 you to utilize these additional CPUs transparently from a single
|
Chris@10
|
20 program. However, this does not necessarily translate into
|
Chris@10
|
21 performance gains---when multiple threads/CPUs are employed, there is
|
Chris@10
|
22 an overhead required for synchronization that may outweigh the
|
Chris@10
|
23 computatational parallelism. Therefore, you can only benefit from
|
Chris@10
|
24 threads if your problem is sufficiently large.
|
Chris@10
|
25 @cindex shared-memory
|
Chris@10
|
26 @cindex threads
|
Chris@10
|
27
|
Chris@10
|
28 @menu
|
Chris@10
|
29 * Installation and Supported Hardware/Software::
|
Chris@10
|
30 * Usage of Multi-threaded FFTW::
|
Chris@10
|
31 * How Many Threads to Use?::
|
Chris@10
|
32 * Thread safety::
|
Chris@10
|
33 @end menu
|
Chris@10
|
34
|
Chris@10
|
35 @c ------------------------------------------------------------
|
Chris@10
|
36 @node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW
|
Chris@10
|
37 @section Installation and Supported Hardware/Software
|
Chris@10
|
38
|
Chris@10
|
39 All of the FFTW threads code is located in the @code{threads}
|
Chris@10
|
40 subdirectory of the FFTW package. On Unix systems, the FFTW threads
|
Chris@10
|
41 libraries and header files can be automatically configured, compiled,
|
Chris@10
|
42 and installed along with the uniprocessor FFTW libraries simply by
|
Chris@10
|
43 including @code{--enable-threads} in the flags to the @code{configure}
|
Chris@10
|
44 script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use
|
Chris@10
|
45 @uref{http://www.openmp.org,OpenMP} threads.
|
Chris@10
|
46 @fpindex configure
|
Chris@10
|
47
|
Chris@10
|
48
|
Chris@10
|
49 @cindex portability
|
Chris@10
|
50 @cindex OpenMP
|
Chris@10
|
51 The threads routines require your operating system to have some sort
|
Chris@10
|
52 of shared-memory threads support. Specifically, the FFTW threads
|
Chris@10
|
53 package works with POSIX threads (available on most Unix variants,
|
Chris@10
|
54 from GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which
|
Chris@10
|
55 are supported in many common compilers (e.g. gcc) are also supported,
|
Chris@10
|
56 and may give better performance on some systems. (OpenMP threads are
|
Chris@10
|
57 also useful if you are employing OpenMP in your own code, in order to
|
Chris@10
|
58 minimize conflicts between threading models.) If you have a
|
Chris@10
|
59 shared-memory machine that uses a different threads API, it should be
|
Chris@10
|
60 a simple matter of programming to include support for it; see the file
|
Chris@10
|
61 @code{threads/threads.c} for more detail.
|
Chris@10
|
62
|
Chris@10
|
63 You can compile FFTW with @emph{both} @code{--enable-threads} and
|
Chris@10
|
64 @code{--enable-openmp} at the same time, since they install libraries
|
Chris@10
|
65 with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as
|
Chris@10
|
66 described below). However, your programs may only link to @emph{one}
|
Chris@10
|
67 of these two libraries at a time.
|
Chris@10
|
68
|
Chris@10
|
69 Ideally, of course, you should also have multiple processors in order to
|
Chris@10
|
70 get any benefit from the threaded transforms.
|
Chris@10
|
71
|
Chris@10
|
72 @c ------------------------------------------------------------
|
Chris@10
|
73 @node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW
|
Chris@10
|
74 @section Usage of Multi-threaded FFTW
|
Chris@10
|
75
|
Chris@10
|
76 Here, it is assumed that the reader is already familiar with the usage
|
Chris@10
|
77 of the uniprocessor FFTW routines, described elsewhere in this manual.
|
Chris@10
|
78 We only describe what one has to change in order to use the
|
Chris@10
|
79 multi-threaded routines.
|
Chris@10
|
80
|
Chris@10
|
81 @cindex OpenMP
|
Chris@10
|
82 First, programs using the parallel complex transforms should be linked
|
Chris@10
|
83 with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp
|
Chris@10
|
84 -lfftw3 -lm} if you compiled with OpenMP. You will also need to link
|
Chris@10
|
85 with whatever library is responsible for threads on your system
|
Chris@10
|
86 (e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag
|
Chris@10
|
87 enables OpenMP (e.g. @code{-fopenmp} with gcc).
|
Chris@10
|
88 @cindex linking on Unix
|
Chris@10
|
89
|
Chris@10
|
90
|
Chris@10
|
91 Second, before calling @emph{any} FFTW routines, you should call the
|
Chris@10
|
92 function:
|
Chris@10
|
93
|
Chris@10
|
94 @example
|
Chris@10
|
95 int fftw_init_threads(void);
|
Chris@10
|
96 @end example
|
Chris@10
|
97 @findex fftw_init_threads
|
Chris@10
|
98
|
Chris@10
|
99 This function, which need only be called once, performs any one-time
|
Chris@10
|
100 initialization required to use threads on your system. It returns zero
|
Chris@10
|
101 if there was some error (which should not happen under normal
|
Chris@10
|
102 circumstances) and a non-zero value otherwise.
|
Chris@10
|
103
|
Chris@10
|
104 Third, before creating a plan that you want to parallelize, you should
|
Chris@10
|
105 call:
|
Chris@10
|
106
|
Chris@10
|
107 @example
|
Chris@10
|
108 void fftw_plan_with_nthreads(int nthreads);
|
Chris@10
|
109 @end example
|
Chris@10
|
110 @findex fftw_plan_with_nthreads
|
Chris@10
|
111
|
Chris@10
|
112 The @code{nthreads} argument indicates the number of threads you want
|
Chris@10
|
113 FFTW to use (or actually, the maximum number). All plans subsequently
|
Chris@10
|
114 created with any planner routine will use that many threads. You can
|
Chris@10
|
115 call @code{fftw_plan_with_nthreads}, create some plans, call
|
Chris@10
|
116 @code{fftw_plan_with_nthreads} again with a different argument, and
|
Chris@10
|
117 create some more plans for a new number of threads. Plans already created
|
Chris@10
|
118 before a call to @code{fftw_plan_with_nthreads} are unaffected. If you
|
Chris@10
|
119 pass an @code{nthreads} argument of @code{1} (the default), threads are
|
Chris@10
|
120 disabled for subsequent plans.
|
Chris@10
|
121
|
Chris@10
|
122 @cindex OpenMP
|
Chris@10
|
123 With OpenMP, to configure FFTW to use all of the currently running
|
Chris@10
|
124 OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the
|
Chris@10
|
125 @code{OMP_NUM_THREADS} environment variable), you can do:
|
Chris@10
|
126 @code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_}
|
Chris@10
|
127 OpenMP functions are declared via @code{#include <omp.h>}.)
|
Chris@10
|
128
|
Chris@10
|
129 @cindex thread safety
|
Chris@10
|
130 Given a plan, you then execute it as usual with
|
Chris@10
|
131 @code{fftw_execute(plan)}, and the execution will use the number of
|
Chris@10
|
132 threads specified when the plan was created. When done, you destroy
|
Chris@10
|
133 it as usual with @code{fftw_destroy_plan}. As described in
|
Chris@10
|
134 @ref{Thread safety}, plan @emph{execution} is thread-safe, but plan
|
Chris@10
|
135 creation and destruction are @emph{not}: you should create/destroy
|
Chris@10
|
136 plans only from a single thread, but can safely execute multiple plans
|
Chris@10
|
137 in parallel.
|
Chris@10
|
138
|
Chris@10
|
139 There is one additional routine: if you want to get rid of all memory
|
Chris@10
|
140 and other resources allocated internally by FFTW, you can call:
|
Chris@10
|
141
|
Chris@10
|
142 @example
|
Chris@10
|
143 void fftw_cleanup_threads(void);
|
Chris@10
|
144 @end example
|
Chris@10
|
145 @findex fftw_cleanup_threads
|
Chris@10
|
146
|
Chris@10
|
147 which is much like the @code{fftw_cleanup()} function except that it
|
Chris@10
|
148 also gets rid of threads-related data. You must @emph{not} execute any
|
Chris@10
|
149 previously created plans after calling this function.
|
Chris@10
|
150
|
Chris@10
|
151 We should also mention one other restriction: if you save wisdom from a
|
Chris@10
|
152 program using the multi-threaded FFTW, that wisdom @emph{cannot be used}
|
Chris@10
|
153 by a program using only the single-threaded FFTW (i.e. not calling
|
Chris@10
|
154 @code{fftw_init_threads}). @xref{Words of Wisdom-Saving Plans}.
|
Chris@10
|
155
|
Chris@10
|
156 @c ------------------------------------------------------------
|
Chris@10
|
157 @node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW
|
Chris@10
|
158 @section How Many Threads to Use?
|
Chris@10
|
159
|
Chris@10
|
160 @cindex number of threads
|
Chris@10
|
161 There is a fair amount of overhead involved in synchronizing threads,
|
Chris@10
|
162 so the optimal number of threads to use depends upon the size of the
|
Chris@10
|
163 transform as well as on the number of processors you have.
|
Chris@10
|
164
|
Chris@10
|
165 As a general rule, you don't want to use more threads than you have
|
Chris@10
|
166 processors. (Using more threads will work, but there will be extra
|
Chris@10
|
167 overhead with no benefit.) In fact, if the problem size is too small,
|
Chris@10
|
168 you may want to use fewer threads than you have processors.
|
Chris@10
|
169
|
Chris@10
|
170 You will have to experiment with your system to see what level of
|
Chris@10
|
171 parallelization is best for your problem size. Typically, the problem
|
Chris@10
|
172 will have to involve at least a few thousand data points before threads
|
Chris@10
|
173 become beneficial. If you plan with @code{FFTW_PATIENT}, it will
|
Chris@10
|
174 automatically disable threads for sizes that don't benefit from
|
Chris@10
|
175 parallelization.
|
Chris@10
|
176 @ctindex FFTW_PATIENT
|
Chris@10
|
177
|
Chris@10
|
178 @c ------------------------------------------------------------
|
Chris@10
|
179 @node Thread safety, , How Many Threads to Use?, Multi-threaded FFTW
|
Chris@10
|
180 @section Thread safety
|
Chris@10
|
181
|
Chris@10
|
182 @cindex threads
|
Chris@10
|
183 @cindex OpenMP
|
Chris@10
|
184 @cindex thread safety
|
Chris@10
|
185 Users writing multi-threaded programs (including OpenMP) must concern
|
Chris@10
|
186 themselves with the @dfn{thread safety} of the libraries they
|
Chris@10
|
187 use---that is, whether it is safe to call routines in parallel from
|
Chris@10
|
188 multiple threads. FFTW can be used in such an environment, but some
|
Chris@10
|
189 care must be taken because the planner routines share data
|
Chris@10
|
190 (e.g. wisdom and trigonometric tables) between calls and plans.
|
Chris@10
|
191
|
Chris@10
|
192 The upshot is that the only thread-safe (re-entrant) routine in FFTW is
|
Chris@10
|
193 @code{fftw_execute} (and the new-array variants thereof). All other routines
|
Chris@10
|
194 (e.g. the planner) should only be called from one thread at a time. So,
|
Chris@10
|
195 for example, you can wrap a semaphore lock around any calls to the
|
Chris@10
|
196 planner; even more simply, you can just create all of your plans from
|
Chris@10
|
197 one thread. We do not think this should be an important restriction
|
Chris@10
|
198 (FFTW is designed for the situation where the only performance-sensitive
|
Chris@10
|
199 code is the actual execution of the transform), and the benefits of
|
Chris@10
|
200 shared data between plans are great.
|
Chris@10
|
201
|
Chris@10
|
202 Note also that, since the plan is not modified by @code{fftw_execute},
|
Chris@10
|
203 it is safe to execute the @emph{same plan} in parallel by multiple
|
Chris@10
|
204 threads. However, since a given plan operates by default on a fixed
|
Chris@10
|
205 array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data.
|
Chris@10
|
206
|
Chris@10
|
207 (Users should note that these comments only apply to programs using
|
Chris@10
|
208 shared-memory threads or OpenMP. Parallelism using MPI or forked processes
|
Chris@10
|
209 involves a separate address-space and global variables for each process,
|
Chris@10
|
210 and is not susceptible to problems of this sort.)
|
Chris@10
|
211
|
Chris@10
|
212 If you are configured FFTW with the @code{--enable-debug} or
|
Chris@10
|
213 @code{--enable-debug-malloc} flags (@pxref{Installation on Unix}),
|
Chris@10
|
214 then @code{fftw_execute} is not thread-safe. These flags are not
|
Chris@10
|
215 documented because they are intended only for developing
|
Chris@10
|
216 and debugging FFTW, but if you must use @code{--enable-debug} then you
|
Chris@10
|
217 should also specifically pass @code{--disable-debug-malloc} for
|
Chris@10
|
218 @code{fftw_execute} to be thread-safe.
|
Chris@10
|
219
|