Chris@10
|
1 This is fftw3.info, produced by makeinfo version 4.13 from fftw3.texi.
|
Chris@10
|
2
|
Chris@10
|
3 This manual is for FFTW (version 3.3.3, 25 November 2012).
|
Chris@10
|
4
|
Chris@10
|
5 Copyright (C) 2003 Matteo Frigo.
|
Chris@10
|
6
|
Chris@10
|
7 Copyright (C) 2003 Massachusetts Institute of Technology.
|
Chris@10
|
8
|
Chris@10
|
9 Permission is granted to make and distribute verbatim copies of
|
Chris@10
|
10 this manual provided the copyright notice and this permission
|
Chris@10
|
11 notice are preserved on all copies.
|
Chris@10
|
12
|
Chris@10
|
13 Permission is granted to copy and distribute modified versions of
|
Chris@10
|
14 this manual under the conditions for verbatim copying, provided
|
Chris@10
|
15 that the entire resulting derived work is distributed under the
|
Chris@10
|
16 terms of a permission notice identical to this one.
|
Chris@10
|
17
|
Chris@10
|
18 Permission is granted to copy and distribute translations of this
|
Chris@10
|
19 manual into another language, under the above conditions for
|
Chris@10
|
20 modified versions, except that this permission notice may be
|
Chris@10
|
21 stated in a translation approved by the Free Software Foundation.
|
Chris@10
|
22
|
Chris@10
|
23 INFO-DIR-SECTION Texinfo documentation system
|
Chris@10
|
24 START-INFO-DIR-ENTRY
|
Chris@10
|
25 * fftw3: (fftw3). FFTW User's Manual.
|
Chris@10
|
26 END-INFO-DIR-ENTRY
|
Chris@10
|
27
|
Chris@10
|
28
|
Chris@10
|
29 File: fftw3.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir)
|
Chris@10
|
30
|
Chris@10
|
31 FFTW User Manual
|
Chris@10
|
32 ****************
|
Chris@10
|
33
|
Chris@10
|
34 Welcome to FFTW, the Fastest Fourier Transform in the West. FFTW is a
|
Chris@10
|
35 collection of fast C routines to compute the discrete Fourier transform.
|
Chris@10
|
36 This manual documents FFTW version 3.3.3.
|
Chris@10
|
37
|
Chris@10
|
38 * Menu:
|
Chris@10
|
39
|
Chris@10
|
40 * Introduction::
|
Chris@10
|
41 * Tutorial::
|
Chris@10
|
42 * Other Important Topics::
|
Chris@10
|
43 * FFTW Reference::
|
Chris@10
|
44 * Multi-threaded FFTW::
|
Chris@10
|
45 * Distributed-memory FFTW with MPI::
|
Chris@10
|
46 * Calling FFTW from Modern Fortran::
|
Chris@10
|
47 * Calling FFTW from Legacy Fortran::
|
Chris@10
|
48 * Upgrading from FFTW version 2::
|
Chris@10
|
49 * Installation and Customization::
|
Chris@10
|
50 * Acknowledgments::
|
Chris@10
|
51 * License and Copyright::
|
Chris@10
|
52 * Concept Index::
|
Chris@10
|
53 * Library Index::
|
Chris@10
|
54
|
Chris@10
|
55
|
Chris@10
|
56 File: fftw3.info, Node: Introduction, Next: Tutorial, Prev: Top, Up: Top
|
Chris@10
|
57
|
Chris@10
|
58 1 Introduction
|
Chris@10
|
59 **************
|
Chris@10
|
60
|
Chris@10
|
61 This manual documents version 3.3.3 of FFTW, the _Fastest Fourier
|
Chris@10
|
62 Transform in the West_. FFTW is a comprehensive collection of fast C
|
Chris@10
|
63 routines for computing the discrete Fourier transform (DFT) and various
|
Chris@10
|
64 special cases thereof.
|
Chris@10
|
65 * FFTW computes the DFT of complex data, real data, even- or
|
Chris@10
|
66 odd-symmetric real data (these symmetric transforms are usually
|
Chris@10
|
67 known as the discrete cosine or sine transform, respectively), and
|
Chris@10
|
68 the discrete Hartley transform (DHT) of real data.
|
Chris@10
|
69
|
Chris@10
|
70 * The input data can have arbitrary length. FFTW employs O(n
|
Chris@10
|
71 log n) algorithms for all lengths, including prime numbers.
|
Chris@10
|
72
|
Chris@10
|
73 * FFTW supports arbitrary multi-dimensional data.
|
Chris@10
|
74
|
Chris@10
|
75 * FFTW supports the SSE, SSE2, AVX, Altivec, and MIPS PS instruction
|
Chris@10
|
76 sets.
|
Chris@10
|
77
|
Chris@10
|
78 * FFTW includes parallel (multi-threaded) transforms for
|
Chris@10
|
79 shared-memory systems.
|
Chris@10
|
80
|
Chris@10
|
81 * Starting with version 3.3, FFTW includes distributed-memory
|
Chris@10
|
82 parallel transforms using MPI.
|
Chris@10
|
83
|
Chris@10
|
84 We assume herein that you are familiar with the properties and uses
|
Chris@10
|
85 of the DFT that are relevant to your application. Otherwise, see e.g.
|
Chris@10
|
86 `The Fast Fourier Transform and Its Applications' by E. O. Brigham
|
Chris@10
|
87 (Prentice-Hall, Englewood Cliffs, NJ, 1988). Our web page
|
Chris@10
|
88 (http://www.fftw.org) also has links to FFT-related information online.
|
Chris@10
|
89
|
Chris@10
|
90 In order to use FFTW effectively, you need to learn one basic concept
|
Chris@10
|
91 of FFTW's internal structure: FFTW does not use a fixed algorithm for
|
Chris@10
|
92 computing the transform, but instead it adapts the DFT algorithm to
|
Chris@10
|
93 details of the underlying hardware in order to maximize performance.
|
Chris@10
|
94 Hence, the computation of the transform is split into two phases.
|
Chris@10
|
95 First, FFTW's "planner" "learns" the fastest way to compute the
|
Chris@10
|
96 transform on your machine. The planner produces a data structure
|
Chris@10
|
97 called a "plan" that contains this information. Subsequently, the plan
|
Chris@10
|
98 is "executed" to transform the array of input data as dictated by the
|
Chris@10
|
99 plan. The plan can be reused as many times as needed. In typical
|
Chris@10
|
100 high-performance applications, many transforms of the same size are
|
Chris@10
|
101 computed and, consequently, a relatively expensive initialization of
|
Chris@10
|
102 this sort is acceptable. On the other hand, if you need a single
|
Chris@10
|
103 transform of a given size, the one-time cost of the planner becomes
|
Chris@10
|
104 significant. For this case, FFTW provides fast planners based on
|
Chris@10
|
105 heuristics or on previously computed plans.
|
Chris@10
|
106
|
Chris@10
|
107 FFTW supports transforms of data with arbitrary length, rank,
|
Chris@10
|
108 multiplicity, and a general memory layout. In simple cases, however,
|
Chris@10
|
109 this generality may be unnecessary and confusing. Consequently, we
|
Chris@10
|
110 organized the interface to FFTW into three levels of increasing
|
Chris@10
|
111 generality.
|
Chris@10
|
112 * The "basic interface" computes a single transform of
|
Chris@10
|
113 contiguous data.
|
Chris@10
|
114
|
Chris@10
|
115 * The "advanced interface" computes transforms of multiple or
|
Chris@10
|
116 strided arrays.
|
Chris@10
|
117
|
Chris@10
|
118 * The "guru interface" supports the most general data layouts,
|
Chris@10
|
119 multiplicities, and strides.
|
Chris@10
|
120 We expect that most users will be best served by the basic interface,
|
Chris@10
|
121 whereas the guru interface requires careful attention to the
|
Chris@10
|
122 documentation to avoid problems.
|
Chris@10
|
123
|
Chris@10
|
124 Besides the automatic performance adaptation performed by the
|
Chris@10
|
125 planner, it is also possible for advanced users to customize FFTW
|
Chris@10
|
126 manually. For example, if code space is a concern, we provide a tool
|
Chris@10
|
127 that links only the subset of FFTW needed by your application.
|
Chris@10
|
128 Conversely, you may need to extend FFTW because the standard
|
Chris@10
|
129 distribution is not sufficient for your needs. For example, the
|
Chris@10
|
130 standard FFTW distribution works most efficiently for arrays whose size
|
Chris@10
|
131 can be factored into small primes (2, 3, 5, and 7), and otherwise it
|
Chris@10
|
132 uses a slower general-purpose routine. If you need efficient
|
Chris@10
|
133 transforms of other sizes, you can use FFTW's code generator, which
|
Chris@10
|
134 produces fast C programs ("codelets") for any particular array size you
|
Chris@10
|
135 may care about. For example, if you need transforms of size 513 = 19 x
|
Chris@10
|
136 3^3, you can customize FFTW to support the factor 19 efficiently.
|
Chris@10
|
137
|
Chris@10
|
138 For more information regarding FFTW, see the paper, "The Design and
|
Chris@10
|
139 Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an
|
Chris@10
|
140 invited paper in `Proc. IEEE' 93 (2), p. 216 (2005). The code
|
Chris@10
|
141 generator is described in the paper "A fast Fourier transform compiler", by
|
Chris@10
|
142 M. Frigo, in the `Proceedings of the 1999 ACM SIGPLAN Conference on
|
Chris@10
|
143 Programming Language Design and Implementation (PLDI), Atlanta,
|
Chris@10
|
144 Georgia, May 1999'. These papers, along with the latest version of
|
Chris@10
|
145 FFTW, the FAQ, benchmarks, and other links, are available at the FFTW
|
Chris@10
|
146 home page (http://www.fftw.org).
|
Chris@10
|
147
|
Chris@10
|
148 The current version of FFTW incorporates many good ideas from the
|
Chris@10
|
149 past thirty years of FFT literature. In one way or another, FFTW uses
|
Chris@10
|
150 the Cooley-Tukey algorithm, the prime factor algorithm, Rader's
|
Chris@10
|
151 algorithm for prime sizes, and a split-radix algorithm (with a
|
Chris@10
|
152 "conjugate-pair" variation pointed out to us by Dan Bernstein). FFTW's
|
Chris@10
|
153 code generator also produces new algorithms that we do not completely
|
Chris@10
|
154 understand. The reader is referred to the cited papers for the
|
Chris@10
|
155 appropriate references.
|
Chris@10
|
156
|
Chris@10
|
157 The rest of this manual is organized as follows. We first discuss
|
Chris@10
|
158 the sequential (single-processor) implementation. We start by
|
Chris@10
|
159 describing the basic interface/features of FFTW in *note Tutorial::.
|
Chris@10
|
160 Next, *note Other Important Topics:: discusses data alignment (*note
|
Chris@10
|
161 SIMD alignment and fftw_malloc::), the storage scheme of
|
Chris@10
|
162 multi-dimensional arrays (*note Multi-dimensional Array Format::), and
|
Chris@10
|
163 FFTW's mechanism for storing plans on disk (*note Words of
|
Chris@10
|
164 Wisdom-Saving Plans::). Next, *note FFTW Reference:: provides
|
Chris@10
|
165 comprehensive documentation of all FFTW's features. Parallel
|
Chris@10
|
166 transforms are discussed in their own chapters: *note Multi-threaded
|
Chris@10
|
167 FFTW:: and *note Distributed-memory FFTW with MPI::. Fortran
|
Chris@10
|
168 programmers can also use FFTW, as described in *note Calling FFTW from
|
Chris@10
|
169 Legacy Fortran:: and *note Calling FFTW from Modern Fortran::. *note
|
Chris@10
|
170 Installation and Customization:: explains how to install FFTW in your
|
Chris@10
|
171 computer system and how to adapt FFTW to your needs. License and
|
Chris@10
|
172 copyright information is given in *note License and Copyright::.
|
Chris@10
|
173 Finally, we thank all the people who helped us in *note
|
Chris@10
|
174 Acknowledgments::.
|
Chris@10
|
175
|
Chris@10
|
176
|
Chris@10
|
177 File: fftw3.info, Node: Tutorial, Next: Other Important Topics, Prev: Introduction, Up: Top
|
Chris@10
|
178
|
Chris@10
|
179 2 Tutorial
|
Chris@10
|
180 **********
|
Chris@10
|
181
|
Chris@10
|
182 * Menu:
|
Chris@10
|
183
|
Chris@10
|
184 * Complex One-Dimensional DFTs::
|
Chris@10
|
185 * Complex Multi-Dimensional DFTs::
|
Chris@10
|
186 * One-Dimensional DFTs of Real Data::
|
Chris@10
|
187 * Multi-Dimensional DFTs of Real Data::
|
Chris@10
|
188 * More DFTs of Real Data::
|
Chris@10
|
189
|
Chris@10
|
190 This chapter describes the basic usage of FFTW, i.e., how to compute the
|
Chris@10
|
191 Fourier transform of a single array. This chapter tells the truth, but
|
Chris@10
|
192 not the _whole_ truth. Specifically, FFTW implements additional
|
Chris@10
|
193 routines and flags that are not documented here, although in many cases
|
Chris@10
|
194 we try to indicate where added capabilities exist. For more complete
|
Chris@10
|
195 information, see *note FFTW Reference::. (Note that you need to
|
Chris@10
|
196 compile and install FFTW before you can use it in a program. For the
|
Chris@10
|
197 details of the installation, see *note Installation and
|
Chris@10
|
198 Customization::.)
|
Chris@10
|
199
|
Chris@10
|
200 We recommend that you read this tutorial in order.(1) At the least,
|
Chris@10
|
201 read the first section (*note Complex One-Dimensional DFTs::) before
|
Chris@10
|
202 reading any of the others, even if your main interest lies in one of
|
Chris@10
|
203 the other transform types.
|
Chris@10
|
204
|
Chris@10
|
205 Users of FFTW version 2 and earlier may also want to read *note
|
Chris@10
|
206 Upgrading from FFTW version 2::.
|
Chris@10
|
207
|
Chris@10
|
208 ---------- Footnotes ----------
|
Chris@10
|
209
|
Chris@10
|
210 (1) You can read the tutorial in bit-reversed order after computing
|
Chris@10
|
211 your first transform.
|
Chris@10
|
212
|
Chris@10
|
213
|
Chris@10
|
214 File: fftw3.info, Node: Complex One-Dimensional DFTs, Next: Complex Multi-Dimensional DFTs, Prev: Tutorial, Up: Tutorial
|
Chris@10
|
215
|
Chris@10
|
216 2.1 Complex One-Dimensional DFTs
|
Chris@10
|
217 ================================
|
Chris@10
|
218
|
Chris@10
|
219 Plan: To bother about the best method of accomplishing an
|
Chris@10
|
220 accidental result. [Ambrose Bierce, `The Enlarged Devil's
|
Chris@10
|
221 Dictionary'.]
|
Chris@10
|
222
|
Chris@10
|
223 The basic usage of FFTW to compute a one-dimensional DFT of size `N'
|
Chris@10
|
224 is simple, and it typically looks something like this code:
|
Chris@10
|
225
|
Chris@10
|
226 #include <fftw3.h>
|
Chris@10
|
227 ...
|
Chris@10
|
228 {
|
Chris@10
|
229 fftw_complex *in, *out;
|
Chris@10
|
230 fftw_plan p;
|
Chris@10
|
231 ...
|
Chris@10
|
232 in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
|
Chris@10
|
233 out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
|
Chris@10
|
234 p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
|
Chris@10
|
235 ...
|
Chris@10
|
236 fftw_execute(p); /* repeat as needed */
|
Chris@10
|
237 ...
|
Chris@10
|
238 fftw_destroy_plan(p);
|
Chris@10
|
239 fftw_free(in); fftw_free(out);
|
Chris@10
|
240 }
|
Chris@10
|
241
|
Chris@10
|
242 You must link this code with the `fftw3' library. On Unix systems,
|
Chris@10
|
243 link with `-lfftw3 -lm'.
|
Chris@10
|
244
|
Chris@10
|
245 The example code first allocates the input and output arrays. You
|
Chris@10
|
246 can allocate them in any way that you like, but we recommend using
|
Chris@10
|
247 `fftw_malloc', which behaves like `malloc' except that it properly
|
Chris@10
|
248 aligns the array when SIMD instructions (such as SSE and Altivec) are
|
Chris@10
|
249 available (*note SIMD alignment and fftw_malloc::). [Alternatively, we
|
Chris@10
|
250 provide a convenient wrapper function `fftw_alloc_complex(N)' which has
|
Chris@10
|
251 the same effect.]
|
Chris@10
|
252
|
Chris@10
|
253 The data is an array of type `fftw_complex', which is by default a
|
Chris@10
|
254 `double[2]' composed of the real (`in[i][0]') and imaginary
|
Chris@10
|
255 (`in[i][1]') parts of a complex number.
|
Chris@10
|
256
|
Chris@10
|
257 The next step is to create a "plan", which is an object that
|
Chris@10
|
258 contains all the data that FFTW needs to compute the FFT. This
|
Chris@10
|
259 function creates the plan:
|
Chris@10
|
260
|
Chris@10
|
261 fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out,
|
Chris@10
|
262 int sign, unsigned flags);
|
Chris@10
|
263
|
Chris@10
|
264 The first argument, `n', is the size of the transform you are trying
|
Chris@10
|
265 to compute. The size `n' can be any positive integer, but sizes that
|
Chris@10
|
266 are products of small factors are transformed most efficiently
|
Chris@10
|
267 (although prime sizes still use an O(n log n) algorithm).
|
Chris@10
|
268
|
Chris@10
|
269 The next two arguments are pointers to the input and output arrays of
|
Chris@10
|
270 the transform. These pointers can be equal, indicating an "in-place"
|
Chris@10
|
271 transform.
|
Chris@10
|
272
|
Chris@10
|
273 The fourth argument, `sign', can be either `FFTW_FORWARD' (`-1') or
|
Chris@10
|
274 `FFTW_BACKWARD' (`+1'), and indicates the direction of the transform
|
Chris@10
|
275 you are interested in; technically, it is the sign of the exponent in
|
Chris@10
|
276 the transform.
|
Chris@10
|
277
|
Chris@10
|
278 The `flags' argument is usually either `FFTW_MEASURE' or `FFTW_ESTIMATE'.
|
Chris@10
|
279 `FFTW_MEASURE' instructs FFTW to run and measure the execution time of
|
Chris@10
|
280 several FFTs in order to find the best way to compute the transform of
|
Chris@10
|
281 size `n'. This process takes some time (usually a few seconds),
|
Chris@10
|
282 depending on your machine and on the size of the transform.
|
Chris@10
|
283 `FFTW_ESTIMATE', on the contrary, does not run any computation and just
|
Chris@10
|
284 builds a reasonable plan that is probably sub-optimal. In short, if
|
Chris@10
|
285 your program performs many transforms of the same size and
|
Chris@10
|
286 initialization time is not important, use `FFTW_MEASURE'; otherwise use
|
Chris@10
|
287 the estimate.
|
Chris@10
|
288
|
Chris@10
|
289 _You must create the plan before initializing the input_, because
|
Chris@10
|
290 `FFTW_MEASURE' overwrites the `in'/`out' arrays. (Technically,
|
Chris@10
|
291 `FFTW_ESTIMATE' does not touch your arrays, but you should always
|
Chris@10
|
292 create plans first just to be sure.)
|
Chris@10
|
293
|
Chris@10
|
294 Once the plan has been created, you can use it as many times as you
|
Chris@10
|
295 like for transforms on the specified `in'/`out' arrays, computing the
|
Chris@10
|
296 actual transforms via `fftw_execute(plan)':
|
Chris@10
|
297 void fftw_execute(const fftw_plan plan);
|
Chris@10
|
298
|
Chris@10
|
299 The DFT results are stored in-order in the array `out', with the
|
Chris@10
|
300 zero-frequency (DC) component in `out[0]'. If `in != out', the
|
Chris@10
|
301 transform is "out-of-place" and the input array `in' is not modified.
|
Chris@10
|
302 Otherwise, the input array is overwritten with the transform.
|
Chris@10
|
303
|
Chris@10
|
304 If you want to transform a _different_ array of the same size, you
|
Chris@10
|
305 can create a new plan with `fftw_plan_dft_1d' and FFTW automatically
|
Chris@10
|
306 reuses the information from the previous plan, if possible.
|
Chris@10
|
307 Alternatively, with the "guru" interface you can apply a given plan to
|
Chris@10
|
308 a different array, if you are careful. *Note FFTW Reference::.
|
Chris@10
|
309
|
Chris@10
|
310 When you are done with the plan, you deallocate it by calling
|
Chris@10
|
311 `fftw_destroy_plan(plan)':
|
Chris@10
|
312 void fftw_destroy_plan(fftw_plan plan);
|
Chris@10
|
313 If you allocate an array with `fftw_malloc()' you must deallocate it
|
Chris@10
|
314 with `fftw_free()'. Do not use `free()' or, heaven forbid, `delete'.
|
Chris@10
|
315
|
Chris@10
|
316 FFTW computes an _unnormalized_ DFT. Thus, computing a forward
|
Chris@10
|
317 followed by a backward transform (or vice versa) results in the original
|
Chris@10
|
318 array scaled by `n'. For the definition of the DFT, see *note What
|
Chris@10
|
319 FFTW Really Computes::.
|
Chris@10
|
320
|
Chris@10
|
321 If you have a C compiler, such as `gcc', that supports the C99
|
Chris@10
|
322 standard, and you `#include <complex.h>' _before_ `<fftw3.h>', then
|
Chris@10
|
323 `fftw_complex' is the native double-precision complex type and you can
|
Chris@10
|
324 manipulate it with ordinary arithmetic. Otherwise, FFTW defines its
|
Chris@10
|
325 own complex type, which is bit-compatible with the C99 complex type.
|
Chris@10
|
326 *Note Complex numbers::. (The C++ `<complex>' template class may also
|
Chris@10
|
327 be usable via a typecast.)
|
Chris@10
|
328
|
Chris@10
|
329 To use single or long-double precision versions of FFTW, replace the
|
Chris@10
|
330 `fftw_' prefix by `fftwf_' or `fftwl_' and link with `-lfftw3f' or
|
Chris@10
|
331 `-lfftw3l', but use the _same_ `<fftw3.h>' header file.
|
Chris@10
|
332
|
Chris@10
|
333 Many more flags exist besides `FFTW_MEASURE' and `FFTW_ESTIMATE'.
|
Chris@10
|
334 For example, use `FFTW_PATIENT' if you're willing to wait even longer
|
Chris@10
|
335 for a possibly even faster plan (*note FFTW Reference::). You can also
|
Chris@10
|
336 save plans for future use, as described by *note Words of Wisdom-Saving
|
Chris@10
|
337 Plans::.
|
Chris@10
|
338
|
Chris@10
|
339
|
Chris@10
|
340 File: fftw3.info, Node: Complex Multi-Dimensional DFTs, Next: One-Dimensional DFTs of Real Data, Prev: Complex One-Dimensional DFTs, Up: Tutorial
|
Chris@10
|
341
|
Chris@10
|
342 2.2 Complex Multi-Dimensional DFTs
|
Chris@10
|
343 ==================================
|
Chris@10
|
344
|
Chris@10
|
345 Multi-dimensional transforms work much the same way as one-dimensional
|
Chris@10
|
346 transforms: you allocate arrays of `fftw_complex' (preferably using
|
Chris@10
|
347 `fftw_malloc'), create an `fftw_plan', execute it as many times as you
|
Chris@10
|
348 want with `fftw_execute(plan)', and clean up with
|
Chris@10
|
349 `fftw_destroy_plan(plan)' (and `fftw_free').
|
Chris@10
|
350
|
Chris@10
|
351 FFTW provides two routines for creating plans for 2d and 3d
|
Chris@10
|
352 transforms, and one routine for creating plans of arbitrary
|
Chris@10
|
353 dimensionality. The 2d and 3d routines have the following signature:
|
Chris@10
|
354 fftw_plan fftw_plan_dft_2d(int n0, int n1,
|
Chris@10
|
355 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
356 int sign, unsigned flags);
|
Chris@10
|
357 fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
|
Chris@10
|
358 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
359 int sign, unsigned flags);
|
Chris@10
|
360
|
Chris@10
|
361 These routines create plans for `n0' by `n1' two-dimensional (2d)
|
Chris@10
|
362 transforms and `n0' by `n1' by `n2' 3d transforms, respectively. All
|
Chris@10
|
363 of these transforms operate on contiguous arrays in the C-standard
|
Chris@10
|
364 "row-major" order, so that the last dimension has the fastest-varying
|
Chris@10
|
365 index in the array. This layout is described further in *note
|
Chris@10
|
366 Multi-dimensional Array Format::.
|
Chris@10
|
367
|
Chris@10
|
368 FFTW can also compute transforms of higher dimensionality. In order
|
Chris@10
|
369 to avoid confusion between the various meanings of the the word
|
Chris@10
|
370 "dimension", we use the term _rank_ to denote the number of independent
|
Chris@10
|
371 indices in an array.(1) For example, we say that a 2d transform has
|
Chris@10
|
372 rank 2, a 3d transform has rank 3, and so on. You can plan transforms
|
Chris@10
|
373 of arbitrary rank by means of the following function:
|
Chris@10
|
374
|
Chris@10
|
375 fftw_plan fftw_plan_dft(int rank, const int *n,
|
Chris@10
|
376 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
377 int sign, unsigned flags);
|
Chris@10
|
378
|
Chris@10
|
379 Here, `n' is a pointer to an array `n[rank]' denoting an `n[0]' by
|
Chris@10
|
380 `n[1]' by ... by `n[rank-1]' transform. Thus, for example, the call
|
Chris@10
|
381 fftw_plan_dft_2d(n0, n1, in, out, sign, flags);
|
Chris@10
|
382 is equivalent to the following code fragment:
|
Chris@10
|
383 int n[2];
|
Chris@10
|
384 n[0] = n0;
|
Chris@10
|
385 n[1] = n1;
|
Chris@10
|
386 fftw_plan_dft(2, n, in, out, sign, flags);
|
Chris@10
|
387 `fftw_plan_dft' is not restricted to 2d and 3d transforms, however,
|
Chris@10
|
388 but it can plan transforms of arbitrary rank.
|
Chris@10
|
389
|
Chris@10
|
390 You may have noticed that all the planner routines described so far
|
Chris@10
|
391 have overlapping functionality. For example, you can plan a 1d or 2d
|
Chris@10
|
392 transform by using `fftw_plan_dft' with a `rank' of `1' or `2', or even
|
Chris@10
|
393 by calling `fftw_plan_dft_3d' with `n0' and/or `n1' equal to `1' (with
|
Chris@10
|
394 no loss in efficiency). This pattern continues, and FFTW's planning
|
Chris@10
|
395 routines in general form a "partial order," sequences of interfaces
|
Chris@10
|
396 with strictly increasing generality but correspondingly greater
|
Chris@10
|
397 complexity.
|
Chris@10
|
398
|
Chris@10
|
399 `fftw_plan_dft' is the most general complex-DFT routine that we
|
Chris@10
|
400 describe in this tutorial, but there are also the advanced and guru
|
Chris@10
|
401 interfaces, which allow one to efficiently combine multiple/strided
|
Chris@10
|
402 transforms into a single FFTW plan, transform a subset of a larger
|
Chris@10
|
403 multi-dimensional array, and/or to handle more general complex-number
|
Chris@10
|
404 formats. For more information, see *note FFTW Reference::.
|
Chris@10
|
405
|
Chris@10
|
406 ---------- Footnotes ----------
|
Chris@10
|
407
|
Chris@10
|
408 (1) The term "rank" is commonly used in the APL, FORTRAN, and Common
|
Chris@10
|
409 Lisp traditions, although it is not so common in the C world.
|
Chris@10
|
410
|
Chris@10
|
411
|
Chris@10
|
412 File: fftw3.info, Node: One-Dimensional DFTs of Real Data, Next: Multi-Dimensional DFTs of Real Data, Prev: Complex Multi-Dimensional DFTs, Up: Tutorial
|
Chris@10
|
413
|
Chris@10
|
414 2.3 One-Dimensional DFTs of Real Data
|
Chris@10
|
415 =====================================
|
Chris@10
|
416
|
Chris@10
|
417 In many practical applications, the input data `in[i]' are purely real
|
Chris@10
|
418 numbers, in which case the DFT output satisfies the "Hermitian" redundancy:
|
Chris@10
|
419 `out[i]' is the conjugate of `out[n-i]'. It is possible to take
|
Chris@10
|
420 advantage of these circumstances in order to achieve roughly a factor
|
Chris@10
|
421 of two improvement in both speed and memory usage.
|
Chris@10
|
422
|
Chris@10
|
423 In exchange for these speed and space advantages, the user sacrifices
|
Chris@10
|
424 some of the simplicity of FFTW's complex transforms. First of all, the
|
Chris@10
|
425 input and output arrays are of _different sizes and types_: the input
|
Chris@10
|
426 is `n' real numbers, while the output is `n/2+1' complex numbers (the
|
Chris@10
|
427 non-redundant outputs); this also requires slight "padding" of the
|
Chris@10
|
428 input array for in-place transforms. Second, the inverse transform
|
Chris@10
|
429 (complex to real) has the side-effect of _overwriting its input array_,
|
Chris@10
|
430 by default. Neither of these inconveniences should pose a serious
|
Chris@10
|
431 problem for users, but it is important to be aware of them.
|
Chris@10
|
432
|
Chris@10
|
433 The routines to perform real-data transforms are almost the same as
|
Chris@10
|
434 those for complex transforms: you allocate arrays of `double' and/or
|
Chris@10
|
435 `fftw_complex' (preferably using `fftw_malloc' or
|
Chris@10
|
436 `fftw_alloc_complex'), create an `fftw_plan', execute it as many times
|
Chris@10
|
437 as you want with `fftw_execute(plan)', and clean up with
|
Chris@10
|
438 `fftw_destroy_plan(plan)' (and `fftw_free'). The only differences are
|
Chris@10
|
439 that the input (or output) is of type `double' and there are new
|
Chris@10
|
440 routines to create the plan. In one dimension:
|
Chris@10
|
441
|
Chris@10
|
442 fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out,
|
Chris@10
|
443 unsigned flags);
|
Chris@10
|
444 fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out,
|
Chris@10
|
445 unsigned flags);
|
Chris@10
|
446
|
Chris@10
|
447 for the real input to complex-Hermitian output ("r2c") and
|
Chris@10
|
448 complex-Hermitian input to real output ("c2r") transforms. Unlike the
|
Chris@10
|
449 complex DFT planner, there is no `sign' argument. Instead, r2c DFTs
|
Chris@10
|
450 are always `FFTW_FORWARD' and c2r DFTs are always `FFTW_BACKWARD'. (For
|
Chris@10
|
451 single/long-double precision `fftwf' and `fftwl', `double' should be
|
Chris@10
|
452 replaced by `float' and `long double', respectively.)
|
Chris@10
|
453
|
Chris@10
|
454 Here, `n' is the "logical" size of the DFT, not necessarily the
|
Chris@10
|
455 physical size of the array. In particular, the real (`double') array
|
Chris@10
|
456 has `n' elements, while the complex (`fftw_complex') array has `n/2+1'
|
Chris@10
|
457 elements (where the division is rounded down). For an in-place
|
Chris@10
|
458 transform, `in' and `out' are aliased to the same array, which must be
|
Chris@10
|
459 big enough to hold both; so, the real array would actually have
|
Chris@10
|
460 `2*(n/2+1)' elements, where the elements beyond the first `n' are
|
Chris@10
|
461 unused padding. (Note that this is very different from the concept of
|
Chris@10
|
462 "zero-padding" a transform to a larger length, which changes the
|
Chris@10
|
463 logical size of the DFT by actually adding new input data.) The kth
|
Chris@10
|
464 element of the complex array is exactly the same as the kth element of
|
Chris@10
|
465 the corresponding complex DFT. All positive `n' are supported;
|
Chris@10
|
466 products of small factors are most efficient, but an O(n log n)
|
Chris@10
|
467 algorithm is used even for prime sizes.
|
Chris@10
|
468
|
Chris@10
|
469 As noted above, the c2r transform destroys its input array even for
|
Chris@10
|
470 out-of-place transforms. This can be prevented, if necessary, by
|
Chris@10
|
471 including `FFTW_PRESERVE_INPUT' in the `flags', with unfortunately some
|
Chris@10
|
472 sacrifice in performance. This flag is also not currently supported
|
Chris@10
|
473 for multi-dimensional real DFTs (next section).
|
Chris@10
|
474
|
Chris@10
|
475 Readers familiar with DFTs of real data will recall that the 0th (the
|
Chris@10
|
476 "DC") and `n/2'-th (the "Nyquist" frequency, when `n' is even) elements
|
Chris@10
|
477 of the complex output are purely real. Some implementations therefore
|
Chris@10
|
478 store the Nyquist element where the DC imaginary part would go, in
|
Chris@10
|
479 order to make the input and output arrays the same size. Such packing,
|
Chris@10
|
480 however, does not generalize well to multi-dimensional transforms, and
|
Chris@10
|
481 the space savings are miniscule in any case; FFTW does not support it.
|
Chris@10
|
482
|
Chris@10
|
483 An alternative interface for one-dimensional r2c and c2r DFTs can be
|
Chris@10
|
484 found in the `r2r' interface (*note The Halfcomplex-format DFT::), with
|
Chris@10
|
485 "halfcomplex"-format output that _is_ the same size (and type) as the
|
Chris@10
|
486 input array. That interface, although it is not very useful for
|
Chris@10
|
487 multi-dimensional transforms, may sometimes yield better performance.
|
Chris@10
|
488
|
Chris@10
|
489
|
Chris@10
|
490 File: fftw3.info, Node: Multi-Dimensional DFTs of Real Data, Next: More DFTs of Real Data, Prev: One-Dimensional DFTs of Real Data, Up: Tutorial
|
Chris@10
|
491
|
Chris@10
|
492 2.4 Multi-Dimensional DFTs of Real Data
|
Chris@10
|
493 =======================================
|
Chris@10
|
494
|
Chris@10
|
495 Multi-dimensional DFTs of real data use the following planner routines:
|
Chris@10
|
496
|
Chris@10
|
497 fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
|
Chris@10
|
498 double *in, fftw_complex *out,
|
Chris@10
|
499 unsigned flags);
|
Chris@10
|
500 fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
|
Chris@10
|
501 double *in, fftw_complex *out,
|
Chris@10
|
502 unsigned flags);
|
Chris@10
|
503 fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
|
Chris@10
|
504 double *in, fftw_complex *out,
|
Chris@10
|
505 unsigned flags);
|
Chris@10
|
506
|
Chris@10
|
507 as well as the corresponding `c2r' routines with the input/output
|
Chris@10
|
508 types swapped. These routines work similarly to their complex
|
Chris@10
|
509 analogues, except for the fact that here the complex output array is cut
|
Chris@10
|
510 roughly in half and the real array requires padding for in-place
|
Chris@10
|
511 transforms (as in 1d, above).
|
Chris@10
|
512
|
Chris@10
|
513 As before, `n' is the logical size of the array, and the
|
Chris@10
|
514 consequences of this on the the format of the complex arrays deserve
|
Chris@10
|
515 careful attention. Suppose that the real data has dimensions n[0] x
|
Chris@10
|
516 n[1] x n[2] x ... x n[d-1] (in row-major order). Then, after an r2c
|
Chris@10
|
517 transform, the output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1)
|
Chris@10
|
518 array of `fftw_complex' values in row-major order, corresponding to
|
Chris@10
|
519 slightly over half of the output of the corresponding complex DFT.
|
Chris@10
|
520 (The division is rounded down.) The ordering of the data is otherwise
|
Chris@10
|
521 exactly the same as in the complex-DFT case.
|
Chris@10
|
522
|
Chris@10
|
523 For out-of-place transforms, this is the end of the story: the real
|
Chris@10
|
524 data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x
|
Chris@10
|
525 n[d-1] and the complex data is stored as a row-major array of size
|
Chris@10
|
526 n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) .
|
Chris@10
|
527
|
Chris@10
|
528 For in-place transforms, however, extra padding of the real-data
|
Chris@10
|
529 array is necessary because the complex array is larger than the real
|
Chris@10
|
530 array, and the two arrays share the same memory locations. Thus, for
|
Chris@10
|
531 in-place transforms, the final dimension of the real-data array must be
|
Chris@10
|
532 padded with extra values to accommodate the size of the complex
|
Chris@10
|
533 data--two values if the last dimension is even and one if it is odd. That
|
Chris@10
|
534 is, the last dimension of the real data must physically contain 2 *
|
Chris@10
|
535 (n[d-1]/2+1) `double' values (exactly enough to hold the complex data).
|
Chris@10
|
536 This physical array size does not, however, change the _logical_ array
|
Chris@10
|
537 size--only n[d-1] values are actually stored in the last dimension, and
|
Chris@10
|
538 n[d-1] is the last dimension passed to the plan-creation routine.
|
Chris@10
|
539
|
Chris@10
|
540 For example, consider the transform of a two-dimensional real array
|
Chris@10
|
541 of size `n0' by `n1'. The output of the r2c transform is a
|
Chris@10
|
542 two-dimensional complex array of size `n0' by `n1/2+1', where the `y'
|
Chris@10
|
543 dimension has been cut nearly in half because of redundancies in the
|
Chris@10
|
544 output. Because `fftw_complex' is twice the size of `double', the
|
Chris@10
|
545 output array is slightly bigger than the input array. Thus, if we want
|
Chris@10
|
546 to compute the transform in place, we must _pad_ the input array so
|
Chris@10
|
547 that it is of size `n0' by `2*(n1/2+1)'. If `n1' is even, then there
|
Chris@10
|
548 are two padding elements at the end of each row (which need not be
|
Chris@10
|
549 initialized, as they are only used for output).
|
Chris@10
|
550
|
Chris@10
|
551 These transforms are unnormalized, so an r2c followed by a c2r
|
Chris@10
|
552 transform (or vice versa) will result in the original data scaled by
|
Chris@10
|
553 the number of real data elements--that is, the product of the (logical)
|
Chris@10
|
554 dimensions of the real data.
|
Chris@10
|
555
|
Chris@10
|
556 (Because the last dimension is treated specially, if it is equal to
|
Chris@10
|
557 `1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r
|
Chris@10
|
558 transform. In that case, the last complex dimension also has size `1'
|
Chris@10
|
559 (`=1/2+1'), and no advantage is gained over the complex transforms.)
|
Chris@10
|
560
|
Chris@10
|
561
|
Chris@10
|
562 File: fftw3.info, Node: More DFTs of Real Data, Prev: Multi-Dimensional DFTs of Real Data, Up: Tutorial
|
Chris@10
|
563
|
Chris@10
|
564 2.5 More DFTs of Real Data
|
Chris@10
|
565 ==========================
|
Chris@10
|
566
|
Chris@10
|
567 * Menu:
|
Chris@10
|
568
|
Chris@10
|
569 * The Halfcomplex-format DFT::
|
Chris@10
|
570 * Real even/odd DFTs (cosine/sine transforms)::
|
Chris@10
|
571 * The Discrete Hartley Transform::
|
Chris@10
|
572
|
Chris@10
|
573 FFTW supports several other transform types via a unified "r2r"
|
Chris@10
|
574 (real-to-real) interface, so called because it takes a real (`double')
|
Chris@10
|
575 array and outputs a real array of the same size. These r2r transforms
|
Chris@10
|
576 currently fall into three categories: DFTs of real input and
|
Chris@10
|
577 complex-Hermitian output in halfcomplex format, DFTs of real input with
|
Chris@10
|
578 even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs),
|
Chris@10
|
579 and discrete Hartley transforms (DHTs), all described in more detail by
|
Chris@10
|
580 the following sections.
|
Chris@10
|
581
|
Chris@10
|
582 The r2r transforms follow the by now familiar interface of creating
|
Chris@10
|
583 an `fftw_plan', executing it with `fftw_execute(plan)', and destroying
|
Chris@10
|
584 it with `fftw_destroy_plan(plan)'. Furthermore, all r2r transforms
|
Chris@10
|
585 share the same planner interface:
|
Chris@10
|
586
|
Chris@10
|
587 fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
|
Chris@10
|
588 fftw_r2r_kind kind, unsigned flags);
|
Chris@10
|
589 fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
|
Chris@10
|
590 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
|
Chris@10
|
591 unsigned flags);
|
Chris@10
|
592 fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
|
Chris@10
|
593 double *in, double *out,
|
Chris@10
|
594 fftw_r2r_kind kind0,
|
Chris@10
|
595 fftw_r2r_kind kind1,
|
Chris@10
|
596 fftw_r2r_kind kind2,
|
Chris@10
|
597 unsigned flags);
|
Chris@10
|
598 fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
|
Chris@10
|
599 const fftw_r2r_kind *kind, unsigned flags);
|
Chris@10
|
600
|
Chris@10
|
601 Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional
|
Chris@10
|
602 transforms for contiguous arrays in row-major order, transforming (real)
|
Chris@10
|
603 input to output of the same size, where `n' specifies the _physical_
|
Chris@10
|
604 dimensions of the arrays. All positive `n' are supported (with the
|
Chris@10
|
605 exception of `n=1' for the `FFTW_REDFT00' kind, noted in the real-even
|
Chris@10
|
606 subsection below); products of small factors are most efficient
|
Chris@10
|
607 (factorizing `n-1' and `n+1' for `FFTW_REDFT00' and `FFTW_RODFT00'
|
Chris@10
|
608 kinds, described below), but an O(n log n) algorithm is used even for
|
Chris@10
|
609 prime sizes.
|
Chris@10
|
610
|
Chris@10
|
611 Each dimension has a "kind" parameter, of type `fftw_r2r_kind',
|
Chris@10
|
612 specifying the kind of r2r transform to be used for that dimension. (In
|
Chris@10
|
613 the case of `fftw_plan_r2r', this is an array `kind[rank]' where
|
Chris@10
|
614 `kind[i]' is the transform kind for the dimension `n[i]'.) The kind
|
Chris@10
|
615 can be one of a set of predefined constants, defined in the following
|
Chris@10
|
616 subsections.
|
Chris@10
|
617
|
Chris@10
|
618 In other words, FFTW computes the separable product of the specified
|
Chris@10
|
619 r2r transforms over each dimension, which can be used e.g. for partial
|
Chris@10
|
620 differential equations with mixed boundary conditions. (For some r2r
|
Chris@10
|
621 kinds, notably the halfcomplex DFT and the DHT, such a separable
|
Chris@10
|
622 product is somewhat problematic in more than one dimension, however, as
|
Chris@10
|
623 is described below.)
|
Chris@10
|
624
|
Chris@10
|
625 In the current version of FFTW, all r2r transforms except for the
|
Chris@10
|
626 halfcomplex type are computed via pre- or post-processing of
|
Chris@10
|
627 halfcomplex transforms, and they are therefore not as fast as they
|
Chris@10
|
628 could be. Since most other general DCT/DST codes employ a similar
|
Chris@10
|
629 algorithm, however, FFTW's implementation should provide at least
|
Chris@10
|
630 competitive performance.
|
Chris@10
|
631
|
Chris@10
|
632
|
Chris@10
|
633 File: fftw3.info, Node: The Halfcomplex-format DFT, Next: Real even/odd DFTs (cosine/sine transforms), Prev: More DFTs of Real Data, Up: More DFTs of Real Data
|
Chris@10
|
634
|
Chris@10
|
635 2.5.1 The Halfcomplex-format DFT
|
Chris@10
|
636 --------------------------------
|
Chris@10
|
637
|
Chris@10
|
638 An r2r kind of `FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note
|
Chris@10
|
639 One-Dimensional DFTs of Real Data::) but with "halfcomplex" format
|
Chris@10
|
640 output, and may sometimes be faster and/or more convenient than the
|
Chris@10
|
641 latter. The inverse "hc2r" transform is of kind `FFTW_HC2R'. This
|
Chris@10
|
642 consists of the non-redundant half of the complex output for a 1d
|
Chris@10
|
643 real-input DFT of size `n', stored as a sequence of `n' real numbers
|
Chris@10
|
644 (`double') in the format:
|
Chris@10
|
645
|
Chris@10
|
646 r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1
|
Chris@10
|
647
|
Chris@10
|
648 Here, rk is the real part of the kth output, and ik is the imaginary
|
Chris@10
|
649 part. (Division by 2 is rounded down.) For a halfcomplex array
|
Chris@10
|
650 `hc[n]', the kth component thus has its real part in `hc[k]' and its
|
Chris@10
|
651 imaginary part in `hc[n-k]', with the exception of `k' `==' `0' or
|
Chris@10
|
652 `n/2' (the latter only if `n' is even)--in these two cases, the
|
Chris@10
|
653 imaginary part is zero due to symmetries of the real-input DFT, and is
|
Chris@10
|
654 not stored. Thus, the r2hc transform of `n' real values is a
|
Chris@10
|
655 halfcomplex array of length `n', and vice versa for hc2r.
|
Chris@10
|
656
|
Chris@10
|
657 Aside from the differing format, the output of
|
Chris@10
|
658 `FFTW_R2HC'/`FFTW_HC2R' is otherwise exactly the same as for the
|
Chris@10
|
659 corresponding 1d r2c/c2r transform (i.e. `FFTW_FORWARD'/`FFTW_BACKWARD'
|
Chris@10
|
660 transforms, respectively). Recall that these transforms are
|
Chris@10
|
661 unnormalized, so r2hc followed by hc2r will result in the original data
|
Chris@10
|
662 multiplied by `n'. Furthermore, like the c2r transform, an
|
Chris@10
|
663 out-of-place hc2r transform will _destroy its input_ array.
|
Chris@10
|
664
|
Chris@10
|
665 Although these halfcomplex transforms can be used with the
|
Chris@10
|
666 multi-dimensional r2r interface, the interpretation of such a separable
|
Chris@10
|
667 product of transforms along each dimension is problematic. For example,
|
Chris@10
|
668 consider a two-dimensional `n0' by `n1', r2hc by r2hc transform planned
|
Chris@10
|
669 by `fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC,
|
Chris@10
|
670 FFTW_MEASURE)'. Conceptually, FFTW first transforms the rows (of size
|
Chris@10
|
671 `n1') to produce halfcomplex rows, and then transforms the columns (of
|
Chris@10
|
672 size `n0'). Half of these column transforms, however, are of imaginary
|
Chris@10
|
673 parts, and should therefore be multiplied by i and combined with the
|
Chris@10
|
674 r2hc transforms of the real columns to produce the 2d DFT amplitudes;
|
Chris@10
|
675 FFTW's r2r transform does _not_ perform this combination for you.
|
Chris@10
|
676 Thus, if a multi-dimensional real-input/output DFT is required, we
|
Chris@10
|
677 recommend using the ordinary r2c/c2r interface (*note Multi-Dimensional
|
Chris@10
|
678 DFTs of Real Data::).
|
Chris@10
|
679
|
Chris@10
|
680
|
Chris@10
|
681 File: fftw3.info, Node: Real even/odd DFTs (cosine/sine transforms), Next: The Discrete Hartley Transform, Prev: The Halfcomplex-format DFT, Up: More DFTs of Real Data
|
Chris@10
|
682
|
Chris@10
|
683 2.5.2 Real even/odd DFTs (cosine/sine transforms)
|
Chris@10
|
684 -------------------------------------------------
|
Chris@10
|
685
|
Chris@10
|
686 The Fourier transform of a real-even function f(-x) = f(x) is
|
Chris@10
|
687 real-even, and i times the Fourier transform of a real-odd function
|
Chris@10
|
688 f(-x) = -f(x) is real-odd. Similar results hold for a discrete Fourier
|
Chris@10
|
689 transform, and thus for these symmetries the need for complex
|
Chris@10
|
690 inputs/outputs is entirely eliminated. Moreover, one gains a factor of
|
Chris@10
|
691 two in speed/space from the fact that the data are real, and an
|
Chris@10
|
692 additional factor of two from the even/odd symmetry: only the
|
Chris@10
|
693 non-redundant (first) half of the array need be stored. The result is
|
Chris@10
|
694 the real-even DFT ("REDFT") and the real-odd DFT ("RODFT"), also known
|
Chris@10
|
695 as the discrete cosine and sine transforms ("DCT" and "DST"),
|
Chris@10
|
696 respectively.
|
Chris@10
|
697
|
Chris@10
|
698 (In this section, we describe the 1d transforms; multi-dimensional
|
Chris@10
|
699 transforms are just a separable product of these transforms operating
|
Chris@10
|
700 along each dimension.)
|
Chris@10
|
701
|
Chris@10
|
702 Because of the discrete sampling, one has an additional choice: is
|
Chris@10
|
703 the data even/odd around a sampling point, or around the point halfway
|
Chris@10
|
704 between two samples? The latter corresponds to _shifting_ the samples
|
Chris@10
|
705 by _half_ an interval, and gives rise to several transform variants
|
Chris@10
|
706 denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate
|
Chris@10
|
707 whether the input (a) and/or output (b) are shifted by half a sample (1
|
Chris@10
|
708 means it is shifted). These are also known as types I-IV of the DCT
|
Chris@10
|
709 and DST, and all four types are supported by FFTW's r2r interface.(1)
|
Chris@10
|
710
|
Chris@10
|
711 The r2r kinds for the various REDFT and RODFT types supported by
|
Chris@10
|
712 FFTW, along with the boundary conditions at both ends of the _input_
|
Chris@10
|
713 array (`n' real numbers `in[j=0..n-1]'), are:
|
Chris@10
|
714
|
Chris@10
|
715 * `FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1.
|
Chris@10
|
716
|
Chris@10
|
717 * `FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even
|
Chris@10
|
718 around j=n-0.5.
|
Chris@10
|
719
|
Chris@10
|
720 * `FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd
|
Chris@10
|
721 around j=n.
|
Chris@10
|
722
|
Chris@10
|
723 * `FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5.
|
Chris@10
|
724
|
Chris@10
|
725 * `FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n.
|
Chris@10
|
726
|
Chris@10
|
727 * `FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5.
|
Chris@10
|
728
|
Chris@10
|
729 * `FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1.
|
Chris@10
|
730
|
Chris@10
|
731 * `FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5.
|
Chris@10
|
732
|
Chris@10
|
733
|
Chris@10
|
734 Note that these symmetries apply to the "logical" array being
|
Chris@10
|
735 transformed; *there are no constraints on your physical input data*.
|
Chris@10
|
736 So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data
|
Chris@10
|
737 abcde, it corresponds to the DFT of the logical even array abcdedcb of
|
Chris@10
|
738 size 8. A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the
|
Chris@10
|
739 size-8 logical DFT of the even array abcddcba, shifted by half a sample.
|
Chris@10
|
740
|
Chris@10
|
741 All of these transforms are invertible. The inverse of R*DFT00 is
|
Chris@10
|
742 R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called
|
Chris@10
|
743 simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11.
|
Chris@10
|
744 However, the transforms computed by FFTW are unnormalized, exactly like
|
Chris@10
|
745 the corresponding real and complex DFTs, so computing a transform
|
Chris@10
|
746 followed by its inverse yields the original array scaled by N, where N
|
Chris@10
|
747 is the _logical_ DFT size. For REDFT00, N=2(n-1); for RODFT00,
|
Chris@10
|
748 N=2(n+1); otherwise, N=2n.
|
Chris@10
|
749
|
Chris@10
|
750 Note that the boundary conditions of the transform output array are
|
Chris@10
|
751 given by the input boundary conditions of the inverse transform. Thus,
|
Chris@10
|
752 the above transforms are all inequivalent in terms of input/output
|
Chris@10
|
753 boundary conditions, even neglecting the 0.5 shift difference.
|
Chris@10
|
754
|
Chris@10
|
755 FFTW is most efficient when N is a product of small factors; note
|
Chris@10
|
756 that this _differs_ from the factorization of the physical size `n' for
|
Chris@10
|
757 REDFT00 and RODFT00! There is another oddity: `n=1' REDFT00 transforms
|
Chris@10
|
758 correspond to N=0, and so are _not defined_ (the planner will return
|
Chris@10
|
759 `NULL'). Otherwise, any positive `n' is supported.
|
Chris@10
|
760
|
Chris@10
|
761 For the precise mathematical definitions of these transforms as used
|
Chris@10
|
762 by FFTW, see *note What FFTW Really Computes::. (For people accustomed
|
Chris@10
|
763 to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of
|
Chris@10
|
764 the cos/sin functions so that they correspond precisely to an even/odd
|
Chris@10
|
765 DFT of size N. Some authors also include additional multiplicative
|
Chris@10
|
766 factors of sqrt(2) for selected inputs and outputs; this makes the
|
Chris@10
|
767 transform orthogonal, but sacrifices the direct equivalence to a
|
Chris@10
|
768 symmetric DFT.)
|
Chris@10
|
769
|
Chris@10
|
770 Which type do you need?
|
Chris@10
|
771 .......................
|
Chris@10
|
772
|
Chris@10
|
773 Since the required flavor of even/odd DFT depends upon your problem,
|
Chris@10
|
774 you are the best judge of this choice, but we can make a few comments
|
Chris@10
|
775 on relative efficiency to help you in your selection. In particular,
|
Chris@10
|
776 R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially
|
Chris@10
|
777 for odd sizes), while the R*DFT00 transforms are sometimes
|
Chris@10
|
778 significantly slower (especially for even sizes).(2)
|
Chris@10
|
779
|
Chris@10
|
780 Thus, if only the boundary conditions on the transform inputs are
|
Chris@10
|
781 specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over
|
Chris@10
|
782 R*DFT11 (unless the half-sample shift or the self-inverse property is
|
Chris@10
|
783 significant for your problem).
|
Chris@10
|
784
|
Chris@10
|
785 If performance is important to you and you are using only small sizes
|
Chris@10
|
786 (say n<200), e.g. for multi-dimensional transforms, then you might
|
Chris@10
|
787 consider generating hard-coded transforms of those sizes and types that
|
Chris@10
|
788 you are interested in (*note Generating your own code::).
|
Chris@10
|
789
|
Chris@10
|
790 We are interested in hearing what types of symmetric transforms you
|
Chris@10
|
791 find most useful.
|
Chris@10
|
792
|
Chris@10
|
793 ---------- Footnotes ----------
|
Chris@10
|
794
|
Chris@10
|
795 (1) There are also type V-VIII transforms, which correspond to a
|
Chris@10
|
796 logical DFT of _odd_ size N, independent of whether the physical size
|
Chris@10
|
797 `n' is odd, but we do not support these variants.
|
Chris@10
|
798
|
Chris@10
|
799 (2) R*DFT00 is sometimes slower in FFTW because we discovered that
|
Chris@10
|
800 the standard algorithm for computing this by a pre/post-processed real
|
Chris@10
|
801 DFT--the algorithm used in FFTPACK, Numerical Recipes, and other
|
Chris@10
|
802 sources for decades now--has serious numerical problems: it already
|
Chris@10
|
803 loses several decimal places of accuracy for 16k sizes. There seem to
|
Chris@10
|
804 be only two alternatives in the literature that do not suffer
|
Chris@10
|
805 similarly: a recursive decomposition into smaller DCTs, which would
|
Chris@10
|
806 require a large set of codelets for efficiency and generality, or
|
Chris@10
|
807 sacrificing a factor of 2 in speed to use a real DFT of twice the size.
|
Chris@10
|
808 We currently employ the latter technique for general n, as well as a
|
Chris@10
|
809 limited form of the former method: a split-radix decomposition when n
|
Chris@10
|
810 is odd (N a multiple of 4). For N containing many factors of 2, the
|
Chris@10
|
811 split-radix method seems to recover most of the speed of the standard
|
Chris@10
|
812 algorithm without the accuracy tradeoff.
|
Chris@10
|
813
|
Chris@10
|
814
|
Chris@10
|
815 File: fftw3.info, Node: The Discrete Hartley Transform, Prev: Real even/odd DFTs (cosine/sine transforms), Up: More DFTs of Real Data
|
Chris@10
|
816
|
Chris@10
|
817 2.5.3 The Discrete Hartley Transform
|
Chris@10
|
818 ------------------------------------
|
Chris@10
|
819
|
Chris@10
|
820 If you are planning to use the DHT because you've heard that it is
|
Chris@10
|
821 "faster" than the DFT (FFT), *stop here*. The DHT is not faster than
|
Chris@10
|
822 the DFT. That story is an old but enduring misconception that was
|
Chris@10
|
823 debunked in 1987.
|
Chris@10
|
824
|
Chris@10
|
825 The discrete Hartley transform (DHT) is an invertible linear
|
Chris@10
|
826 transform closely related to the DFT. In the DFT, one multiplies each
|
Chris@10
|
827 input by cos - i * sin (a complex exponential), whereas in the DHT each
|
Chris@10
|
828 input is multiplied by simply cos + sin. Thus, the DHT transforms `n'
|
Chris@10
|
829 real numbers to `n' real numbers, and has the convenient property of
|
Chris@10
|
830 being its own inverse. In FFTW, a DHT (of any positive `n') can be
|
Chris@10
|
831 specified by an r2r kind of `FFTW_DHT'.
|
Chris@10
|
832
|
Chris@10
|
833 Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of
|
Chris@10
|
834 size `n' followed by another DHT of the same size will result in the
|
Chris@10
|
835 original array multiplied by `n'.
|
Chris@10
|
836
|
Chris@10
|
837 The DHT was originally proposed as a more efficient alternative to
|
Chris@10
|
838 the DFT for real data, but it was subsequently shown that a specialized
|
Chris@10
|
839 DFT (such as FFTW's r2hc or r2c transforms) could be just as fast. In
|
Chris@10
|
840 FFTW, the DHT is actually computed by post-processing an r2hc
|
Chris@10
|
841 transform, so there is ordinarily no reason to prefer it from a
|
Chris@10
|
842 performance perspective.(1) However, we have heard rumors that the DHT
|
Chris@10
|
843 might be the most appropriate transform in its own right for certain
|
Chris@10
|
844 applications, and we would be very interested to hear from anyone who
|
Chris@10
|
845 finds it useful.
|
Chris@10
|
846
|
Chris@10
|
847 If `FFTW_DHT' is specified for multiple dimensions of a
|
Chris@10
|
848 multi-dimensional transform, FFTW computes the separable product of 1d
|
Chris@10
|
849 DHTs along each dimension. Unfortunately, this is not quite the same
|
Chris@10
|
850 thing as a true multi-dimensional DHT; you can compute the latter, if
|
Chris@10
|
851 necessary, with at most `rank-1' post-processing passes [see e.g. H.
|
Chris@10
|
852 Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)].
|
Chris@10
|
853
|
Chris@10
|
854 For the precise mathematical definition of the DHT as used by FFTW,
|
Chris@10
|
855 see *note What FFTW Really Computes::.
|
Chris@10
|
856
|
Chris@10
|
857 ---------- Footnotes ----------
|
Chris@10
|
858
|
Chris@10
|
859 (1) We provide the DHT mainly as a byproduct of some internal
|
Chris@10
|
860 algorithms. FFTW computes a real input/output DFT of _prime_ size by
|
Chris@10
|
861 re-expressing it as a DHT plus post/pre-processing and then using
|
Chris@10
|
862 Rader's prime-DFT algorithm adapted to the DHT.
|
Chris@10
|
863
|
Chris@10
|
864
|
Chris@10
|
865 File: fftw3.info, Node: Other Important Topics, Next: FFTW Reference, Prev: Tutorial, Up: Top
|
Chris@10
|
866
|
Chris@10
|
867 3 Other Important Topics
|
Chris@10
|
868 ************************
|
Chris@10
|
869
|
Chris@10
|
870 * Menu:
|
Chris@10
|
871
|
Chris@10
|
872 * SIMD alignment and fftw_malloc::
|
Chris@10
|
873 * Multi-dimensional Array Format::
|
Chris@10
|
874 * Words of Wisdom-Saving Plans::
|
Chris@10
|
875 * Caveats in Using Wisdom::
|
Chris@10
|
876
|
Chris@10
|
877
|
Chris@10
|
878 File: fftw3.info, Node: SIMD alignment and fftw_malloc, Next: Multi-dimensional Array Format, Prev: Other Important Topics, Up: Other Important Topics
|
Chris@10
|
879
|
Chris@10
|
880 3.1 SIMD alignment and fftw_malloc
|
Chris@10
|
881 ==================================
|
Chris@10
|
882
|
Chris@10
|
883 SIMD, which stands for "Single Instruction Multiple Data," is a set of
|
Chris@10
|
884 special operations supported by some processors to perform a single
|
Chris@10
|
885 operation on several numbers (usually 2 or 4) simultaneously. SIMD
|
Chris@10
|
886 floating-point instructions are available on several popular CPUs:
|
Chris@10
|
887 SSE/SSE2/AVX on recent x86/x86-64 processors, AltiVec (single precision)
|
Chris@10
|
888 on some PowerPCs (Apple G4 and higher), NEON on some ARM models, and
|
Chris@10
|
889 MIPS Paired Single (currently only in FFTW 3.2.x). FFTW can be
|
Chris@10
|
890 compiled to support the SIMD instructions on any of these systems.
|
Chris@10
|
891
|
Chris@10
|
892 A program linking to an FFTW library compiled with SIMD support can
|
Chris@10
|
893 obtain a nonnegligible speedup for most complex and r2c/c2r transforms.
|
Chris@10
|
894 In order to obtain this speedup, however, the arrays of complex (or
|
Chris@10
|
895 real) data passed to FFTW must be specially aligned in memory
|
Chris@10
|
896 (typically 16-byte aligned), and often this alignment is more stringent
|
Chris@10
|
897 than that provided by the usual `malloc' (etc.) allocation routines.
|
Chris@10
|
898
|
Chris@10
|
899 In order to guarantee proper alignment for SIMD, therefore, in case
|
Chris@10
|
900 your program is ever linked against a SIMD-using FFTW, we recommend
|
Chris@10
|
901 allocating your transform data with `fftw_malloc' and de-allocating it
|
Chris@10
|
902 with `fftw_free'. These have exactly the same interface and behavior as
|
Chris@10
|
903 `malloc'/`free', except that for a SIMD FFTW they ensure that the
|
Chris@10
|
904 returned pointer has the necessary alignment (by calling `memalign' or
|
Chris@10
|
905 its equivalent on your OS).
|
Chris@10
|
906
|
Chris@10
|
907 You are not _required_ to use `fftw_malloc'. You can allocate your
|
Chris@10
|
908 data in any way that you like, from `malloc' to `new' (in C++) to a
|
Chris@10
|
909 fixed-size array declaration. If the array happens not to be properly
|
Chris@10
|
910 aligned, FFTW will not use the SIMD extensions.
|
Chris@10
|
911
|
Chris@10
|
912 Since `fftw_malloc' only ever needs to be used for real and complex
|
Chris@10
|
913 arrays, we provide two convenient wrapper routines `fftw_alloc_real(N)'
|
Chris@10
|
914 and `fftw_alloc_complex(N)' that are equivalent to
|
Chris@10
|
915 `(double*)fftw_malloc(sizeof(double) * N)' and
|
Chris@10
|
916 `(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively
|
Chris@10
|
917 (or their equivalents in other precisions).
|
Chris@10
|
918
|
Chris@10
|
919
|
Chris@10
|
920 File: fftw3.info, Node: Multi-dimensional Array Format, Next: Words of Wisdom-Saving Plans, Prev: SIMD alignment and fftw_malloc, Up: Other Important Topics
|
Chris@10
|
921
|
Chris@10
|
922 3.2 Multi-dimensional Array Format
|
Chris@10
|
923 ==================================
|
Chris@10
|
924
|
Chris@10
|
925 This section describes the format in which multi-dimensional arrays are
|
Chris@10
|
926 stored in FFTW. We felt that a detailed discussion of this topic was
|
Chris@10
|
927 necessary. Since several different formats are common, this topic is
|
Chris@10
|
928 often a source of confusion.
|
Chris@10
|
929
|
Chris@10
|
930 * Menu:
|
Chris@10
|
931
|
Chris@10
|
932 * Row-major Format::
|
Chris@10
|
933 * Column-major Format::
|
Chris@10
|
934 * Fixed-size Arrays in C::
|
Chris@10
|
935 * Dynamic Arrays in C::
|
Chris@10
|
936 * Dynamic Arrays in C-The Wrong Way::
|
Chris@10
|
937
|
Chris@10
|
938
|
Chris@10
|
939 File: fftw3.info, Node: Row-major Format, Next: Column-major Format, Prev: Multi-dimensional Array Format, Up: Multi-dimensional Array Format
|
Chris@10
|
940
|
Chris@10
|
941 3.2.1 Row-major Format
|
Chris@10
|
942 ----------------------
|
Chris@10
|
943
|
Chris@10
|
944 The multi-dimensional arrays passed to `fftw_plan_dft' etcetera are
|
Chris@10
|
945 expected to be stored as a single contiguous block in "row-major" order
|
Chris@10
|
946 (sometimes called "C order"). Basically, this means that as you step
|
Chris@10
|
947 through adjacent memory locations, the first dimension's index varies
|
Chris@10
|
948 most slowly and the last dimension's index varies most quickly.
|
Chris@10
|
949
|
Chris@10
|
950 To be more explicit, let us consider an array of rank d whose
|
Chris@10
|
951 dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a
|
Chris@10
|
952 location in the array by a sequence of d (zero-based) indices, one for
|
Chris@10
|
953 each dimension: (i[0], i[1], ..., i[d-1]). If the array is stored in
|
Chris@10
|
954 row-major order, then this element is located at the position i[d-1] +
|
Chris@10
|
955 n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])).
|
Chris@10
|
956
|
Chris@10
|
957 Note that, for the ordinary complex DFT, each element of the array
|
Chris@10
|
958 must be of type `fftw_complex'; i.e. a (real, imaginary) pair of
|
Chris@10
|
959 (double-precision) numbers.
|
Chris@10
|
960
|
Chris@10
|
961 In the advanced FFTW interface, the physical dimensions n from which
|
Chris@10
|
962 the indices are computed can be different from (larger than) the
|
Chris@10
|
963 logical dimensions of the transform to be computed, in order to
|
Chris@10
|
964 transform a subset of a larger array. Note also that, in the advanced
|
Chris@10
|
965 interface, the expression above is multiplied by a "stride" to get the
|
Chris@10
|
966 actual array index--this is useful in situations where each element of
|
Chris@10
|
967 the multi-dimensional array is actually a data structure (or another
|
Chris@10
|
968 array), and you just want to transform a single field. In the basic
|
Chris@10
|
969 interface, however, the stride is 1.
|
Chris@10
|
970
|
Chris@10
|
971
|
Chris@10
|
972 File: fftw3.info, Node: Column-major Format, Next: Fixed-size Arrays in C, Prev: Row-major Format, Up: Multi-dimensional Array Format
|
Chris@10
|
973
|
Chris@10
|
974 3.2.2 Column-major Format
|
Chris@10
|
975 -------------------------
|
Chris@10
|
976
|
Chris@10
|
977 Readers from the Fortran world are used to arrays stored in
|
Chris@10
|
978 "column-major" order (sometimes called "Fortran order"). This is
|
Chris@10
|
979 essentially the exact opposite of row-major order in that, here, the
|
Chris@10
|
980 _first_ dimension's index varies most quickly.
|
Chris@10
|
981
|
Chris@10
|
982 If you have an array stored in column-major order and wish to
|
Chris@10
|
983 transform it using FFTW, it is quite easy to do. When creating the
|
Chris@10
|
984 plan, simply pass the dimensions of the array to the planner in
|
Chris@10
|
985 _reverse order_. For example, if your array is a rank three `N x M x
|
Chris@10
|
986 L' matrix in column-major order, you should pass the dimensions of the
|
Chris@10
|
987 array as if it were an `L x M x N' matrix (which it is, from the
|
Chris@10
|
988 perspective of FFTW). This is done for you _automatically_ by the FFTW
|
Chris@10
|
989 legacy-Fortran interface (*note Calling FFTW from Legacy Fortran::),
|
Chris@10
|
990 but you must do it manually with the modern Fortran interface (*note
|
Chris@10
|
991 Reversing array dimensions::).
|
Chris@10
|
992
|
Chris@10
|
993
|
Chris@10
|
994 File: fftw3.info, Node: Fixed-size Arrays in C, Next: Dynamic Arrays in C, Prev: Column-major Format, Up: Multi-dimensional Array Format
|
Chris@10
|
995
|
Chris@10
|
996 3.2.3 Fixed-size Arrays in C
|
Chris@10
|
997 ----------------------------
|
Chris@10
|
998
|
Chris@10
|
999 A multi-dimensional array whose size is declared at compile time in C
|
Chris@10
|
1000 is _already_ in row-major order. You don't have to do anything special
|
Chris@10
|
1001 to transform it. For example:
|
Chris@10
|
1002
|
Chris@10
|
1003 {
|
Chris@10
|
1004 fftw_complex data[N0][N1][N2];
|
Chris@10
|
1005 fftw_plan plan;
|
Chris@10
|
1006 ...
|
Chris@10
|
1007 plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0],
|
Chris@10
|
1008 FFTW_FORWARD, FFTW_ESTIMATE);
|
Chris@10
|
1009 ...
|
Chris@10
|
1010 }
|
Chris@10
|
1011
|
Chris@10
|
1012 This will plan a 3d in-place transform of size `N0 x N1 x N2'.
|
Chris@10
|
1013 Notice how we took the address of the zero-th element to pass to the
|
Chris@10
|
1014 planner (we could also have used a typecast).
|
Chris@10
|
1015
|
Chris@10
|
1016 However, we tend to _discourage_ users from declaring their arrays
|
Chris@10
|
1017 in this way, for two reasons. First, this allocates the array on the
|
Chris@10
|
1018 stack ("automatic" storage), which has a very limited size on most
|
Chris@10
|
1019 operating systems (declaring an array with more than a few thousand
|
Chris@10
|
1020 elements will often cause a crash). (You can get around this
|
Chris@10
|
1021 limitation on many systems by declaring the array as `static' and/or
|
Chris@10
|
1022 global, but that has its own drawbacks.) Second, it may not optimally
|
Chris@10
|
1023 align the array for use with a SIMD FFTW (*note SIMD alignment and
|
Chris@10
|
1024 fftw_malloc::). Instead, we recommend using `fftw_malloc', as
|
Chris@10
|
1025 described below.
|
Chris@10
|
1026
|
Chris@10
|
1027
|
Chris@10
|
1028 File: fftw3.info, Node: Dynamic Arrays in C, Next: Dynamic Arrays in C-The Wrong Way, Prev: Fixed-size Arrays in C, Up: Multi-dimensional Array Format
|
Chris@10
|
1029
|
Chris@10
|
1030 3.2.4 Dynamic Arrays in C
|
Chris@10
|
1031 -------------------------
|
Chris@10
|
1032
|
Chris@10
|
1033 We recommend allocating most arrays dynamically, with `fftw_malloc'.
|
Chris@10
|
1034 This isn't too hard to do, although it is not as straightforward for
|
Chris@10
|
1035 multi-dimensional arrays as it is for one-dimensional arrays.
|
Chris@10
|
1036
|
Chris@10
|
1037 Creating the array is simple: using a dynamic-allocation routine like
|
Chris@10
|
1038 `fftw_malloc', allocate an array big enough to store N `fftw_complex'
|
Chris@10
|
1039 values (for a complex DFT), where N is the product of the sizes of the
|
Chris@10
|
1040 array dimensions (i.e. the total number of complex values in the
|
Chris@10
|
1041 array). For example, here is code to allocate a 5 x 12 x 27 rank-3
|
Chris@10
|
1042 array:
|
Chris@10
|
1043
|
Chris@10
|
1044 fftw_complex *an_array;
|
Chris@10
|
1045 an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex));
|
Chris@10
|
1046
|
Chris@10
|
1047 Accessing the array elements, however, is more tricky--you can't
|
Chris@10
|
1048 simply use multiple applications of the `[]' operator like you could
|
Chris@10
|
1049 for fixed-size arrays. Instead, you have to explicitly compute the
|
Chris@10
|
1050 offset into the array using the formula given earlier for row-major
|
Chris@10
|
1051 arrays. For example, to reference the (i,j,k)-th element of the array
|
Chris@10
|
1052 allocated above, you would use the expression `an_array[k + 27 * (j +
|
Chris@10
|
1053 12 * i)]'.
|
Chris@10
|
1054
|
Chris@10
|
1055 This pain can be alleviated somewhat by defining appropriate macros,
|
Chris@10
|
1056 or, in C++, creating a class and overloading the `()' operator. The
|
Chris@10
|
1057 recent C99 standard provides a way to reinterpret the dynamic array as
|
Chris@10
|
1058 a "variable-length" multi-dimensional array amenable to `[]', but this
|
Chris@10
|
1059 feature is not yet widely supported by compilers.
|
Chris@10
|
1060
|
Chris@10
|
1061
|
Chris@10
|
1062 File: fftw3.info, Node: Dynamic Arrays in C-The Wrong Way, Prev: Dynamic Arrays in C, Up: Multi-dimensional Array Format
|
Chris@10
|
1063
|
Chris@10
|
1064 3.2.5 Dynamic Arrays in C--The Wrong Way
|
Chris@10
|
1065 ----------------------------------------
|
Chris@10
|
1066
|
Chris@10
|
1067 A different method for allocating multi-dimensional arrays in C is
|
Chris@10
|
1068 often suggested that is incompatible with FFTW: _using it will cause
|
Chris@10
|
1069 FFTW to die a painful death_. We discuss the technique here, however,
|
Chris@10
|
1070 because it is so commonly known and used. This method is to create
|
Chris@10
|
1071 arrays of pointers of arrays of pointers of ...etcetera. For example,
|
Chris@10
|
1072 the analogue in this method to the example above is:
|
Chris@10
|
1073
|
Chris@10
|
1074 int i,j;
|
Chris@10
|
1075 fftw_complex ***a_bad_array; /* another way to make a 5x12x27 array */
|
Chris@10
|
1076
|
Chris@10
|
1077 a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **));
|
Chris@10
|
1078 for (i = 0; i < 5; ++i) {
|
Chris@10
|
1079 a_bad_array[i] =
|
Chris@10
|
1080 (fftw_complex **) malloc(12 * sizeof(fftw_complex *));
|
Chris@10
|
1081 for (j = 0; j < 12; ++j)
|
Chris@10
|
1082 a_bad_array[i][j] =
|
Chris@10
|
1083 (fftw_complex *) malloc(27 * sizeof(fftw_complex));
|
Chris@10
|
1084 }
|
Chris@10
|
1085
|
Chris@10
|
1086 As you can see, this sort of array is inconvenient to allocate (and
|
Chris@10
|
1087 deallocate). On the other hand, it has the advantage that the
|
Chris@10
|
1088 (i,j,k)-th element can be referenced simply by `a_bad_array[i][j][k]'.
|
Chris@10
|
1089
|
Chris@10
|
1090 If you like this technique and want to maximize convenience in
|
Chris@10
|
1091 accessing the array, but still want to pass the array to FFTW, you can
|
Chris@10
|
1092 use a hybrid method. Allocate the array as one contiguous block, but
|
Chris@10
|
1093 also declare an array of arrays of pointers that point to appropriate
|
Chris@10
|
1094 places in the block. That sort of trick is beyond the scope of this
|
Chris@10
|
1095 documentation; for more information on multi-dimensional arrays in C,
|
Chris@10
|
1096 see the `comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html).
|
Chris@10
|
1097
|
Chris@10
|
1098
|
Chris@10
|
1099 File: fftw3.info, Node: Words of Wisdom-Saving Plans, Next: Caveats in Using Wisdom, Prev: Multi-dimensional Array Format, Up: Other Important Topics
|
Chris@10
|
1100
|
Chris@10
|
1101 3.3 Words of Wisdom--Saving Plans
|
Chris@10
|
1102 =================================
|
Chris@10
|
1103
|
Chris@10
|
1104 FFTW implements a method for saving plans to disk and restoring them.
|
Chris@10
|
1105 In fact, what FFTW does is more general than just saving and loading
|
Chris@10
|
1106 plans. The mechanism is called "wisdom". Here, we describe this
|
Chris@10
|
1107 feature at a high level. *Note FFTW Reference::, for a less casual but
|
Chris@10
|
1108 more complete discussion of how to use wisdom in FFTW.
|
Chris@10
|
1109
|
Chris@10
|
1110 Plans created with the `FFTW_MEASURE', `FFTW_PATIENT', or
|
Chris@10
|
1111 `FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may
|
Chris@10
|
1112 require a long time to compute because FFTW must measure the runtime of
|
Chris@10
|
1113 many possible plans and select the best one. This setup is designed
|
Chris@10
|
1114 for the situations where so many transforms of the same size must be
|
Chris@10
|
1115 computed that the start-up time is irrelevant. For short
|
Chris@10
|
1116 initialization times, but slower transforms, we have provided
|
Chris@10
|
1117 `FFTW_ESTIMATE'. The `wisdom' mechanism is a way to get the best of
|
Chris@10
|
1118 both worlds: you compute a good plan once, save it to disk, and later
|
Chris@10
|
1119 reload it as many times as necessary. The wisdom mechanism can
|
Chris@10
|
1120 actually save and reload many plans at once, not just one.
|
Chris@10
|
1121
|
Chris@10
|
1122 Whenever you create a plan, the FFTW planner accumulates wisdom,
|
Chris@10
|
1123 which is information sufficient to reconstruct the plan. After
|
Chris@10
|
1124 planning, you can save this information to disk by means of the
|
Chris@10
|
1125 function:
|
Chris@10
|
1126 int fftw_export_wisdom_to_filename(const char *filename);
|
Chris@10
|
1127 (This function returns non-zero on success.)
|
Chris@10
|
1128
|
Chris@10
|
1129 The next time you run the program, you can restore the wisdom with
|
Chris@10
|
1130 `fftw_import_wisdom_from_filename' (which also returns non-zero on
|
Chris@10
|
1131 success), and then recreate the plan using the same flags as before.
|
Chris@10
|
1132 int fftw_import_wisdom_from_filename(const char *filename);
|
Chris@10
|
1133
|
Chris@10
|
1134 Wisdom is automatically used for any size to which it is applicable,
|
Chris@10
|
1135 as long as the planner flags are not more "patient" than those with
|
Chris@10
|
1136 which the wisdom was created. For example, wisdom created with
|
Chris@10
|
1137 `FFTW_MEASURE' can be used if you later plan with `FFTW_ESTIMATE' or
|
Chris@10
|
1138 `FFTW_MEASURE', but not with `FFTW_PATIENT'.
|
Chris@10
|
1139
|
Chris@10
|
1140 The `wisdom' is cumulative, and is stored in a global, private data
|
Chris@10
|
1141 structure managed internally by FFTW. The storage space required is
|
Chris@10
|
1142 minimal, proportional to the logarithm of the sizes the wisdom was
|
Chris@10
|
1143 generated from. If memory usage is a concern, however, the wisdom can
|
Chris@10
|
1144 be forgotten and its associated memory freed by calling:
|
Chris@10
|
1145 void fftw_forget_wisdom(void);
|
Chris@10
|
1146
|
Chris@10
|
1147 Wisdom can be exported to a file, a string, or any other medium.
|
Chris@10
|
1148 For details, see *note Wisdom::.
|
Chris@10
|
1149
|
Chris@10
|
1150
|
Chris@10
|
1151 File: fftw3.info, Node: Caveats in Using Wisdom, Prev: Words of Wisdom-Saving Plans, Up: Other Important Topics
|
Chris@10
|
1152
|
Chris@10
|
1153 3.4 Caveats in Using Wisdom
|
Chris@10
|
1154 ===========================
|
Chris@10
|
1155
|
Chris@10
|
1156 For in much wisdom is much grief, and he that increaseth knowledge
|
Chris@10
|
1157 increaseth sorrow. [Ecclesiastes 1:18]
|
Chris@10
|
1158
|
Chris@10
|
1159 There are pitfalls to using wisdom, in that it can negate FFTW's
|
Chris@10
|
1160 ability to adapt to changing hardware and other conditions. For
|
Chris@10
|
1161 example, it would be perfectly possible to export wisdom from a program
|
Chris@10
|
1162 running on one processor and import it into a program running on
|
Chris@10
|
1163 another processor. Doing so, however, would mean that the second
|
Chris@10
|
1164 program would use plans optimized for the first processor, instead of
|
Chris@10
|
1165 the one it is running on.
|
Chris@10
|
1166
|
Chris@10
|
1167 It should be safe to reuse wisdom as long as the hardware and program
|
Chris@10
|
1168 binaries remain unchanged. (Actually, the optimal plan may change even
|
Chris@10
|
1169 between runs of the same binary on identical hardware, due to
|
Chris@10
|
1170 differences in the virtual memory environment, etcetera. Users
|
Chris@10
|
1171 seriously interested in performance should worry about this problem,
|
Chris@10
|
1172 too.) It is likely that, if the same wisdom is used for two different
|
Chris@10
|
1173 program binaries, even running on the same machine, the plans may be
|
Chris@10
|
1174 sub-optimal because of differing code alignments. It is therefore wise
|
Chris@10
|
1175 to recreate wisdom every time an application is recompiled. The more
|
Chris@10
|
1176 the underlying hardware and software changes between the creation of
|
Chris@10
|
1177 wisdom and its use, the greater grows the risk of sub-optimal plans.
|
Chris@10
|
1178
|
Chris@10
|
1179 Nevertheless, if the choice is between using `FFTW_ESTIMATE' or
|
Chris@10
|
1180 using possibly-suboptimal wisdom (created on the same machine, but for a
|
Chris@10
|
1181 different binary), the wisdom is likely to be better. For this reason,
|
Chris@10
|
1182 we provide a function to import wisdom from a standard system-wide
|
Chris@10
|
1183 location (`/etc/fftw/wisdom' on Unix):
|
Chris@10
|
1184
|
Chris@10
|
1185 int fftw_import_system_wisdom(void);
|
Chris@10
|
1186
|
Chris@10
|
1187 FFTW also provides a standalone program, `fftw-wisdom' (described by
|
Chris@10
|
1188 its own `man' page on Unix) with which users can create wisdom, e.g.
|
Chris@10
|
1189 for a canonical set of sizes to store in the system wisdom file. *Note
|
Chris@10
|
1190 Wisdom Utilities::.
|
Chris@10
|
1191
|
Chris@10
|
1192
|
Chris@10
|
1193 File: fftw3.info, Node: FFTW Reference, Next: Multi-threaded FFTW, Prev: Other Important Topics, Up: Top
|
Chris@10
|
1194
|
Chris@10
|
1195 4 FFTW Reference
|
Chris@10
|
1196 ****************
|
Chris@10
|
1197
|
Chris@10
|
1198 This chapter provides a complete reference for all sequential (i.e.,
|
Chris@10
|
1199 one-processor) FFTW functions. Parallel transforms are described in
|
Chris@10
|
1200 later chapters.
|
Chris@10
|
1201
|
Chris@10
|
1202 * Menu:
|
Chris@10
|
1203
|
Chris@10
|
1204 * Data Types and Files::
|
Chris@10
|
1205 * Using Plans::
|
Chris@10
|
1206 * Basic Interface::
|
Chris@10
|
1207 * Advanced Interface::
|
Chris@10
|
1208 * Guru Interface::
|
Chris@10
|
1209 * New-array Execute Functions::
|
Chris@10
|
1210 * Wisdom::
|
Chris@10
|
1211 * What FFTW Really Computes::
|
Chris@10
|
1212
|
Chris@10
|
1213
|
Chris@10
|
1214 File: fftw3.info, Node: Data Types and Files, Next: Using Plans, Prev: FFTW Reference, Up: FFTW Reference
|
Chris@10
|
1215
|
Chris@10
|
1216 4.1 Data Types and Files
|
Chris@10
|
1217 ========================
|
Chris@10
|
1218
|
Chris@10
|
1219 All programs using FFTW should include its header file:
|
Chris@10
|
1220
|
Chris@10
|
1221 #include <fftw3.h>
|
Chris@10
|
1222
|
Chris@10
|
1223 You must also link to the FFTW library. On Unix, this means adding
|
Chris@10
|
1224 `-lfftw3 -lm' at the _end_ of the link command.
|
Chris@10
|
1225
|
Chris@10
|
1226 * Menu:
|
Chris@10
|
1227
|
Chris@10
|
1228 * Complex numbers::
|
Chris@10
|
1229 * Precision::
|
Chris@10
|
1230 * Memory Allocation::
|
Chris@10
|
1231
|
Chris@10
|
1232
|
Chris@10
|
1233 File: fftw3.info, Node: Complex numbers, Next: Precision, Prev: Data Types and Files, Up: Data Types and Files
|
Chris@10
|
1234
|
Chris@10
|
1235 4.1.1 Complex numbers
|
Chris@10
|
1236 ---------------------
|
Chris@10
|
1237
|
Chris@10
|
1238 The default FFTW interface uses `double' precision for all
|
Chris@10
|
1239 floating-point numbers, and defines a `fftw_complex' type to hold
|
Chris@10
|
1240 complex numbers as:
|
Chris@10
|
1241
|
Chris@10
|
1242 typedef double fftw_complex[2];
|
Chris@10
|
1243
|
Chris@10
|
1244 Here, the `[0]' element holds the real part and the `[1]' element
|
Chris@10
|
1245 holds the imaginary part.
|
Chris@10
|
1246
|
Chris@10
|
1247 Alternatively, if you have a C compiler (such as `gcc') that
|
Chris@10
|
1248 supports the C99 revision of the ANSI C standard, you can use C's new
|
Chris@10
|
1249 native complex type (which is binary-compatible with the typedef above).
|
Chris@10
|
1250 In particular, if you `#include <complex.h>' _before_ `<fftw3.h>', then
|
Chris@10
|
1251 `fftw_complex' is defined to be the native complex type and you can
|
Chris@10
|
1252 manipulate it with ordinary arithmetic (e.g. `x = y * (3+4*I)', where
|
Chris@10
|
1253 `x' and `y' are `fftw_complex' and `I' is the standard symbol for the
|
Chris@10
|
1254 imaginary unit);
|
Chris@10
|
1255
|
Chris@10
|
1256 C++ has its own `complex<T>' template class, defined in the standard
|
Chris@10
|
1257 `<complex>' header file. Reportedly, the C++ standards committee has
|
Chris@10
|
1258 recently agreed to mandate that the storage format used for this type
|
Chris@10
|
1259 be binary-compatible with the C99 type, i.e. an array `T[2]' with
|
Chris@10
|
1260 consecutive real `[0]' and imaginary `[1]' parts. (See report
|
Chris@10
|
1261 `http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf
|
Chris@10
|
1262 WG21/N1388'.) Although not part of the official standard as of this
|
Chris@10
|
1263 writing, the proposal stated that: "This solution has been tested with
|
Chris@10
|
1264 all current major implementations of the standard library and shown to
|
Chris@10
|
1265 be working." To the extent that this is true, if you have a variable
|
Chris@10
|
1266 `complex<double> *x', you can pass it directly to FFTW via
|
Chris@10
|
1267 `reinterpret_cast<fftw_complex*>(x)'.
|
Chris@10
|
1268
|
Chris@10
|
1269
|
Chris@10
|
1270 File: fftw3.info, Node: Precision, Next: Memory Allocation, Prev: Complex numbers, Up: Data Types and Files
|
Chris@10
|
1271
|
Chris@10
|
1272 4.1.2 Precision
|
Chris@10
|
1273 ---------------
|
Chris@10
|
1274
|
Chris@10
|
1275 You can install single and long-double precision versions of FFTW,
|
Chris@10
|
1276 which replace `double' with `float' and `long double', respectively
|
Chris@10
|
1277 (*note Installation and Customization::). To use these interfaces, you:
|
Chris@10
|
1278
|
Chris@10
|
1279 * Link to the single/long-double libraries; on Unix, `-lfftw3f' or
|
Chris@10
|
1280 `-lfftw3l' instead of (or in addition to) `-lfftw3'. (You can
|
Chris@10
|
1281 link to the different-precision libraries simultaneously.)
|
Chris@10
|
1282
|
Chris@10
|
1283 * Include the _same_ `<fftw3.h>' header file.
|
Chris@10
|
1284
|
Chris@10
|
1285 * Replace all lowercase instances of `fftw_' with `fftwf_' or
|
Chris@10
|
1286 `fftwl_' for single or long-double precision, respectively.
|
Chris@10
|
1287 (`fftw_complex' becomes `fftwf_complex', `fftw_execute' becomes
|
Chris@10
|
1288 `fftwf_execute', etcetera.)
|
Chris@10
|
1289
|
Chris@10
|
1290 * Uppercase names, i.e. names beginning with `FFTW_', remain the
|
Chris@10
|
1291 same.
|
Chris@10
|
1292
|
Chris@10
|
1293 * Replace `double' with `float' or `long double' for subroutine
|
Chris@10
|
1294 parameters.
|
Chris@10
|
1295
|
Chris@10
|
1296
|
Chris@10
|
1297 Depending upon your compiler and/or hardware, `long double' may not
|
Chris@10
|
1298 be any more precise than `double' (or may not be supported at all,
|
Chris@10
|
1299 although it is standard in C99).
|
Chris@10
|
1300
|
Chris@10
|
1301 We also support using the nonstandard `__float128'
|
Chris@10
|
1302 quadruple-precision type provided by recent versions of `gcc' on 32-
|
Chris@10
|
1303 and 64-bit x86 hardware (*note Installation and Customization::). To
|
Chris@10
|
1304 use this type, link with `-lfftw3q -lquadmath -lm' (the `libquadmath'
|
Chris@10
|
1305 library provided by `gcc' is needed for quadruple-precision
|
Chris@10
|
1306 trigonometric functions) and use `fftwq_' identifiers.
|
Chris@10
|
1307
|
Chris@10
|
1308
|
Chris@10
|
1309 File: fftw3.info, Node: Memory Allocation, Prev: Precision, Up: Data Types and Files
|
Chris@10
|
1310
|
Chris@10
|
1311 4.1.3 Memory Allocation
|
Chris@10
|
1312 -----------------------
|
Chris@10
|
1313
|
Chris@10
|
1314 void *fftw_malloc(size_t n);
|
Chris@10
|
1315 void fftw_free(void *p);
|
Chris@10
|
1316
|
Chris@10
|
1317 These are functions that behave identically to `malloc' and `free',
|
Chris@10
|
1318 except that they guarantee that the returned pointer obeys any special
|
Chris@10
|
1319 alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD
|
Chris@10
|
1320 acceleration). *Note SIMD alignment and fftw_malloc::.
|
Chris@10
|
1321
|
Chris@10
|
1322 Data allocated by `fftw_malloc' _must_ be deallocated by `fftw_free'
|
Chris@10
|
1323 and not by the ordinary `free'.
|
Chris@10
|
1324
|
Chris@10
|
1325 These routines simply call through to your operating system's
|
Chris@10
|
1326 `malloc' or, if necessary, its aligned equivalent (e.g. `memalign'), so
|
Chris@10
|
1327 you normally need not worry about any significant time or space
|
Chris@10
|
1328 overhead. You are _not required_ to use them to allocate your data,
|
Chris@10
|
1329 but we strongly recommend it.
|
Chris@10
|
1330
|
Chris@10
|
1331 Note: in C++, just as with ordinary `malloc', you must typecast the
|
Chris@10
|
1332 output of `fftw_malloc' to whatever pointer type you are allocating.
|
Chris@10
|
1333
|
Chris@10
|
1334 We also provide the following two convenience functions to allocate
|
Chris@10
|
1335 real and complex arrays with `n' elements, which are equivalent to
|
Chris@10
|
1336 `(double *) fftw_malloc(sizeof(double) * n)' and `(fftw_complex *)
|
Chris@10
|
1337 fftw_malloc(sizeof(fftw_complex) * n)', respectively:
|
Chris@10
|
1338
|
Chris@10
|
1339 double *fftw_alloc_real(size_t n);
|
Chris@10
|
1340 fftw_complex *fftw_alloc_complex(size_t n);
|
Chris@10
|
1341
|
Chris@10
|
1342 The equivalent functions in other precisions allocate arrays of `n'
|
Chris@10
|
1343 elements in that precision. e.g. `fftwf_alloc_real(n)' is equivalent
|
Chris@10
|
1344 to `(float *) fftwf_malloc(sizeof(float) * n)'.
|
Chris@10
|
1345
|
Chris@10
|
1346
|
Chris@10
|
1347 File: fftw3.info, Node: Using Plans, Next: Basic Interface, Prev: Data Types and Files, Up: FFTW Reference
|
Chris@10
|
1348
|
Chris@10
|
1349 4.2 Using Plans
|
Chris@10
|
1350 ===============
|
Chris@10
|
1351
|
Chris@10
|
1352 Plans for all transform types in FFTW are stored as type `fftw_plan'
|
Chris@10
|
1353 (an opaque pointer type), and are created by one of the various
|
Chris@10
|
1354 planning routines described in the following sections. An `fftw_plan'
|
Chris@10
|
1355 contains all information necessary to compute the transform, including
|
Chris@10
|
1356 the pointers to the input and output arrays.
|
Chris@10
|
1357
|
Chris@10
|
1358 void fftw_execute(const fftw_plan plan);
|
Chris@10
|
1359
|
Chris@10
|
1360 This executes the `plan', to compute the corresponding transform on
|
Chris@10
|
1361 the arrays for which it was planned (which must still exist). The plan
|
Chris@10
|
1362 is not modified, and `fftw_execute' can be called as many times as
|
Chris@10
|
1363 desired.
|
Chris@10
|
1364
|
Chris@10
|
1365 To apply a given plan to a different array, you can use the
|
Chris@10
|
1366 new-array execute interface. *Note New-array Execute Functions::.
|
Chris@10
|
1367
|
Chris@10
|
1368 `fftw_execute' (and equivalents) is the only function in FFTW
|
Chris@10
|
1369 guaranteed to be thread-safe; see *note Thread safety::.
|
Chris@10
|
1370
|
Chris@10
|
1371 This function:
|
Chris@10
|
1372 void fftw_destroy_plan(fftw_plan plan);
|
Chris@10
|
1373 deallocates the `plan' and all its associated data.
|
Chris@10
|
1374
|
Chris@10
|
1375 FFTW's planner saves some other persistent data, such as the
|
Chris@10
|
1376 accumulated wisdom and a list of algorithms available in the current
|
Chris@10
|
1377 configuration. If you want to deallocate all of that and reset FFTW to
|
Chris@10
|
1378 the pristine state it was in when you started your program, you can
|
Chris@10
|
1379 call:
|
Chris@10
|
1380
|
Chris@10
|
1381 void fftw_cleanup(void);
|
Chris@10
|
1382
|
Chris@10
|
1383 After calling `fftw_cleanup', all existing plans become undefined,
|
Chris@10
|
1384 and you should not attempt to execute them nor to destroy them. You can
|
Chris@10
|
1385 however create and execute/destroy new plans, in which case FFTW starts
|
Chris@10
|
1386 accumulating wisdom information again.
|
Chris@10
|
1387
|
Chris@10
|
1388 `fftw_cleanup' does not deallocate your plans, however. To prevent
|
Chris@10
|
1389 memory leaks, you must still call `fftw_destroy_plan' before executing
|
Chris@10
|
1390 `fftw_cleanup'.
|
Chris@10
|
1391
|
Chris@10
|
1392 Occasionally, it may useful to know FFTW's internal "cost" metric
|
Chris@10
|
1393 that it uses to compare plans to one another; this cost is proportional
|
Chris@10
|
1394 to an execution time of the plan, in undocumented units, if the plan
|
Chris@10
|
1395 was created with the `FFTW_MEASURE' or other timing-based options, or
|
Chris@10
|
1396 alternatively is a heuristic cost function for `FFTW_ESTIMATE' plans.
|
Chris@10
|
1397 (The cost values of measured and estimated plans are not comparable,
|
Chris@10
|
1398 being in different units. Also, costs from different FFTW versions or
|
Chris@10
|
1399 the same version compiled differently may not be in the same units.
|
Chris@10
|
1400 Plans created from wisdom have a cost of 0 since no timing measurement
|
Chris@10
|
1401 is performed for them. Finally, certain problems for which only one
|
Chris@10
|
1402 top-level algorithm was possible may have required no measurements of
|
Chris@10
|
1403 the cost of the whole plan, in which case `fftw_cost' will also return
|
Chris@10
|
1404 0.) The cost metric for a given plan is returned by:
|
Chris@10
|
1405
|
Chris@10
|
1406 double fftw_cost(const fftw_plan plan);
|
Chris@10
|
1407
|
Chris@10
|
1408 The following two routines are provided purely for academic purposes
|
Chris@10
|
1409 (that is, for entertainment).
|
Chris@10
|
1410
|
Chris@10
|
1411 void fftw_flops(const fftw_plan plan,
|
Chris@10
|
1412 double *add, double *mul, double *fma);
|
Chris@10
|
1413
|
Chris@10
|
1414 Given a `plan', set `add', `mul', and `fma' to an exact count of the
|
Chris@10
|
1415 number of floating-point additions, multiplications, and fused
|
Chris@10
|
1416 multiply-add operations involved in the plan's execution. The total
|
Chris@10
|
1417 number of floating-point operations (flops) is `add + mul + 2*fma', or
|
Chris@10
|
1418 `add + mul + fma' if the hardware supports fused multiply-add
|
Chris@10
|
1419 instructions (although the number of FMA operations is only approximate
|
Chris@10
|
1420 because of compiler voodoo). (The number of operations should be an
|
Chris@10
|
1421 integer, but we use `double' to avoid overflowing `int' for large
|
Chris@10
|
1422 transforms; the arguments are of type `double' even for single and
|
Chris@10
|
1423 long-double precision versions of FFTW.)
|
Chris@10
|
1424
|
Chris@10
|
1425 void fftw_fprint_plan(const fftw_plan plan, FILE *output_file);
|
Chris@10
|
1426 void fftw_print_plan(const fftw_plan plan);
|
Chris@10
|
1427
|
Chris@10
|
1428 This outputs a "nerd-readable" representation of the `plan' to the
|
Chris@10
|
1429 given file or to `stdout', respectively.
|
Chris@10
|
1430
|
Chris@10
|
1431
|
Chris@10
|
1432 File: fftw3.info, Node: Basic Interface, Next: Advanced Interface, Prev: Using Plans, Up: FFTW Reference
|
Chris@10
|
1433
|
Chris@10
|
1434 4.3 Basic Interface
|
Chris@10
|
1435 ===================
|
Chris@10
|
1436
|
Chris@10
|
1437 Recall that the FFTW API is divided into three parts(1): the "basic
|
Chris@10
|
1438 interface" computes a single transform of contiguous data, the "advanced
|
Chris@10
|
1439 interface" computes transforms of multiple or strided arrays, and the
|
Chris@10
|
1440 "guru interface" supports the most general data layouts,
|
Chris@10
|
1441 multiplicities, and strides. This section describes the the basic
|
Chris@10
|
1442 interface, which we expect to satisfy the needs of most users.
|
Chris@10
|
1443
|
Chris@10
|
1444 * Menu:
|
Chris@10
|
1445
|
Chris@10
|
1446 * Complex DFTs::
|
Chris@10
|
1447 * Planner Flags::
|
Chris@10
|
1448 * Real-data DFTs::
|
Chris@10
|
1449 * Real-data DFT Array Format::
|
Chris@10
|
1450 * Real-to-Real Transforms::
|
Chris@10
|
1451 * Real-to-Real Transform Kinds::
|
Chris@10
|
1452
|
Chris@10
|
1453 ---------- Footnotes ----------
|
Chris@10
|
1454
|
Chris@10
|
1455 (1) Gallia est omnis divisa in partes tres (Julius Caesar).
|
Chris@10
|
1456
|
Chris@10
|
1457
|
Chris@10
|
1458 File: fftw3.info, Node: Complex DFTs, Next: Planner Flags, Prev: Basic Interface, Up: Basic Interface
|
Chris@10
|
1459
|
Chris@10
|
1460 4.3.1 Complex DFTs
|
Chris@10
|
1461 ------------------
|
Chris@10
|
1462
|
Chris@10
|
1463 fftw_plan fftw_plan_dft_1d(int n0,
|
Chris@10
|
1464 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
1465 int sign, unsigned flags);
|
Chris@10
|
1466 fftw_plan fftw_plan_dft_2d(int n0, int n1,
|
Chris@10
|
1467 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
1468 int sign, unsigned flags);
|
Chris@10
|
1469 fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
|
Chris@10
|
1470 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
1471 int sign, unsigned flags);
|
Chris@10
|
1472 fftw_plan fftw_plan_dft(int rank, const int *n,
|
Chris@10
|
1473 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
1474 int sign, unsigned flags);
|
Chris@10
|
1475
|
Chris@10
|
1476 Plan a complex input/output discrete Fourier transform (DFT) in zero
|
Chris@10
|
1477 or more dimensions, returning an `fftw_plan' (*note Using Plans::).
|
Chris@10
|
1478
|
Chris@10
|
1479 Once you have created a plan for a certain transform type and
|
Chris@10
|
1480 parameters, then creating another plan of the same type and parameters,
|
Chris@10
|
1481 but for different arrays, is fast and shares constant data with the
|
Chris@10
|
1482 first plan (if it still exists).
|
Chris@10
|
1483
|
Chris@10
|
1484 The planner returns `NULL' if the plan cannot be created. In the
|
Chris@10
|
1485 standard FFTW distribution, the basic interface is guaranteed to return
|
Chris@10
|
1486 a non-`NULL' plan. A plan may be `NULL', however, if you are using a
|
Chris@10
|
1487 customized FFTW configuration supporting a restricted set of transforms.
|
Chris@10
|
1488
|
Chris@10
|
1489 Arguments
|
Chris@10
|
1490 .........
|
Chris@10
|
1491
|
Chris@10
|
1492 * `rank' is the rank of the transform (it should be the size of the
|
Chris@10
|
1493 array `*n'), and can be any non-negative integer. (*Note Complex
|
Chris@10
|
1494 Multi-Dimensional DFTs::, for the definition of "rank".) The
|
Chris@10
|
1495 `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1',
|
Chris@10
|
1496 `2', and `3', respectively. The rank may be zero, which is
|
Chris@10
|
1497 equivalent to a rank-1 transform of size 1, i.e. a copy of one
|
Chris@10
|
1498 number from input to output.
|
Chris@10
|
1499
|
Chris@10
|
1500 * `n0', `n1', `n2', or `n[0..rank-1]' (as appropriate for each
|
Chris@10
|
1501 routine) specify the size of the transform dimensions. They can
|
Chris@10
|
1502 be any positive integer.
|
Chris@10
|
1503
|
Chris@10
|
1504 - Multi-dimensional arrays are stored in row-major order with
|
Chris@10
|
1505 dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x
|
Chris@10
|
1506 `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array
|
Chris@10
|
1507 Format::.
|
Chris@10
|
1508
|
Chris@10
|
1509 - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
|
Chris@10
|
1510 11^e 13^f, where e+f is either 0 or 1, and the other exponents
|
Chris@10
|
1511 are arbitrary. Other sizes are computed by means of a slow,
|
Chris@10
|
1512 general-purpose algorithm (which nevertheless retains O(n log
|
Chris@10
|
1513 n) performance even for prime sizes). It is possible to
|
Chris@10
|
1514 customize FFTW for different array sizes; see *note
|
Chris@10
|
1515 Installation and Customization::. Transforms whose sizes are
|
Chris@10
|
1516 powers of 2 are especially fast.
|
Chris@10
|
1517
|
Chris@10
|
1518 * `in' and `out' point to the input and output arrays of the
|
Chris@10
|
1519 transform, which may be the same (yielding an in-place transform). These
|
Chris@10
|
1520 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
|
Chris@10
|
1521 used in the flags. (The arrays need not be initialized, but they
|
Chris@10
|
1522 must be allocated.)
|
Chris@10
|
1523
|
Chris@10
|
1524 If `in == out', the transform is "in-place" and the input array is
|
Chris@10
|
1525 overwritten. If `in != out', the two arrays must not overlap (but
|
Chris@10
|
1526 FFTW does not check for this condition).
|
Chris@10
|
1527
|
Chris@10
|
1528 * `sign' is the sign of the exponent in the formula that defines the
|
Chris@10
|
1529 Fourier transform. It can be -1 (= `FFTW_FORWARD') or +1 (=
|
Chris@10
|
1530 `FFTW_BACKWARD').
|
Chris@10
|
1531
|
Chris@10
|
1532 * `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
1533 defined in *note Planner Flags::.
|
Chris@10
|
1534
|
Chris@10
|
1535
|
Chris@10
|
1536 FFTW computes an unnormalized transform: computing a forward
|
Chris@10
|
1537 followed by a backward transform (or vice versa) will result in the
|
Chris@10
|
1538 original data multiplied by the size of the transform (the product of
|
Chris@10
|
1539 the dimensions). For more information, see *note What FFTW Really
|
Chris@10
|
1540 Computes::.
|
Chris@10
|
1541
|
Chris@10
|
1542
|
Chris@10
|
1543 File: fftw3.info, Node: Planner Flags, Next: Real-data DFTs, Prev: Complex DFTs, Up: Basic Interface
|
Chris@10
|
1544
|
Chris@10
|
1545 4.3.2 Planner Flags
|
Chris@10
|
1546 -------------------
|
Chris@10
|
1547
|
Chris@10
|
1548 All of the planner routines in FFTW accept an integer `flags' argument,
|
Chris@10
|
1549 which is a bitwise OR (`|') of zero or more of the flag constants
|
Chris@10
|
1550 defined below. These flags control the rigor (and time) of the
|
Chris@10
|
1551 planning process, and can also impose (or lift) restrictions on the
|
Chris@10
|
1552 type of transform algorithm that is employed.
|
Chris@10
|
1553
|
Chris@10
|
1554 _Important:_ the planner overwrites the input array during planning
|
Chris@10
|
1555 unless a saved plan (*note Wisdom::) is available for that problem, so
|
Chris@10
|
1556 you should initialize your input data after creating the plan. The
|
Chris@10
|
1557 only exceptions to this are the `FFTW_ESTIMATE' and `FFTW_WISDOM_ONLY'
|
Chris@10
|
1558 flags, as mentioned below.
|
Chris@10
|
1559
|
Chris@10
|
1560 In all cases, if wisdom is available for the given problem that
|
Chris@10
|
1561 was created with equal-or-greater planning rigor, then the more
|
Chris@10
|
1562 rigorous wisdom is used. For example, in `FFTW_ESTIMATE' mode any
|
Chris@10
|
1563 available wisdom is used, whereas in `FFTW_PATIENT' mode only wisdom
|
Chris@10
|
1564 created in patient or exhaustive mode can be used. *Note Words of
|
Chris@10
|
1565 Wisdom-Saving Plans::.
|
Chris@10
|
1566
|
Chris@10
|
1567 Planning-rigor flags
|
Chris@10
|
1568 ....................
|
Chris@10
|
1569
|
Chris@10
|
1570 * `FFTW_ESTIMATE' specifies that, instead of actual measurements of
|
Chris@10
|
1571 different algorithms, a simple heuristic is used to pick a
|
Chris@10
|
1572 (probably sub-optimal) plan quickly. With this flag, the
|
Chris@10
|
1573 input/output arrays are not overwritten during planning.
|
Chris@10
|
1574
|
Chris@10
|
1575 * `FFTW_MEASURE' tells FFTW to find an optimized plan by actually
|
Chris@10
|
1576 _computing_ several FFTs and measuring their execution time.
|
Chris@10
|
1577 Depending on your machine, this can take some time (often a few
|
Chris@10
|
1578 seconds). `FFTW_MEASURE' is the default planning option.
|
Chris@10
|
1579
|
Chris@10
|
1580 * `FFTW_PATIENT' is like `FFTW_MEASURE', but considers a wider range
|
Chris@10
|
1581 of algorithms and often produces a "more optimal" plan (especially
|
Chris@10
|
1582 for large transforms), but at the expense of several times longer
|
Chris@10
|
1583 planning time (especially for large transforms).
|
Chris@10
|
1584
|
Chris@10
|
1585 * `FFTW_EXHAUSTIVE' is like `FFTW_PATIENT', but considers an even
|
Chris@10
|
1586 wider range of algorithms, including many that we think are
|
Chris@10
|
1587 unlikely to be fast, to produce the most optimal plan but with a
|
Chris@10
|
1588 substantially increased planning time.
|
Chris@10
|
1589
|
Chris@10
|
1590 * `FFTW_WISDOM_ONLY' is a special planning mode in which the plan is
|
Chris@10
|
1591 only created if wisdom is available for the given problem, and
|
Chris@10
|
1592 otherwise a `NULL' plan is returned. This can be combined with
|
Chris@10
|
1593 other flags, e.g. `FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan
|
Chris@10
|
1594 only if wisdom is available that was created in `FFTW_PATIENT' or
|
Chris@10
|
1595 `FFTW_EXHAUSTIVE' mode. The `FFTW_WISDOM_ONLY' flag is intended
|
Chris@10
|
1596 for users who need to detect whether wisdom is available; for
|
Chris@10
|
1597 example, if wisdom is not available one may wish to allocate new
|
Chris@10
|
1598 arrays for planning so that user data is not overwritten.
|
Chris@10
|
1599
|
Chris@10
|
1600
|
Chris@10
|
1601 Algorithm-restriction flags
|
Chris@10
|
1602 ...........................
|
Chris@10
|
1603
|
Chris@10
|
1604 * `FFTW_DESTROY_INPUT' specifies that an out-of-place transform is
|
Chris@10
|
1605 allowed to _overwrite its input_ array with arbitrary data; this
|
Chris@10
|
1606 can sometimes allow more efficient algorithms to be employed.
|
Chris@10
|
1607
|
Chris@10
|
1608 * `FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must
|
Chris@10
|
1609 _not change its input_ array. This is ordinarily the _default_,
|
Chris@10
|
1610 except for c2r and hc2r (i.e. complex-to-real) transforms for
|
Chris@10
|
1611 which `FFTW_DESTROY_INPUT' is the default. In the latter cases,
|
Chris@10
|
1612 passing `FFTW_PRESERVE_INPUT' will attempt to use algorithms that
|
Chris@10
|
1613 do not destroy the input, at the expense of worse performance; for
|
Chris@10
|
1614 multi-dimensional c2r transforms, however, no input-preserving
|
Chris@10
|
1615 algorithms are implemented and the planner will return `NULL' if
|
Chris@10
|
1616 one is requested.
|
Chris@10
|
1617
|
Chris@10
|
1618 * `FFTW_UNALIGNED' specifies that the algorithm may not impose any
|
Chris@10
|
1619 unusual alignment requirements on the input/output arrays (i.e. no
|
Chris@10
|
1620 SIMD may be used). This flag is normally _not necessary_, since
|
Chris@10
|
1621 the planner automatically detects misaligned arrays. The only use
|
Chris@10
|
1622 for this flag is if you want to use the new-array execute
|
Chris@10
|
1623 interface to execute a given plan on a different array that may
|
Chris@10
|
1624 not be aligned like the original. (Using `fftw_malloc' makes this
|
Chris@10
|
1625 flag unnecessary even then.)
|
Chris@10
|
1626
|
Chris@10
|
1627
|
Chris@10
|
1628 Limiting planning time
|
Chris@10
|
1629 ......................
|
Chris@10
|
1630
|
Chris@10
|
1631 extern void fftw_set_timelimit(double seconds);
|
Chris@10
|
1632
|
Chris@10
|
1633 This function instructs FFTW to spend at most `seconds' seconds
|
Chris@10
|
1634 (approximately) in the planner. If `seconds == FFTW_NO_TIMELIMIT' (the
|
Chris@10
|
1635 default value, which is negative), then planning time is unbounded.
|
Chris@10
|
1636 Otherwise, FFTW plans with a progressively wider range of algorithms
|
Chris@10
|
1637 until the the given time limit is reached or the given range of
|
Chris@10
|
1638 algorithms is explored, returning the best available plan.
|
Chris@10
|
1639
|
Chris@10
|
1640 For example, specifying `FFTW_PATIENT' first plans in
|
Chris@10
|
1641 `FFTW_ESTIMATE' mode, then in `FFTW_MEASURE' mode, then finally (time
|
Chris@10
|
1642 permitting) in `FFTW_PATIENT'. If `FFTW_EXHAUSTIVE' is specified
|
Chris@10
|
1643 instead, the planner will further progress to `FFTW_EXHAUSTIVE' mode.
|
Chris@10
|
1644
|
Chris@10
|
1645 Note that the `seconds' argument specifies only a rough limit; in
|
Chris@10
|
1646 practice, the planner may use somewhat more time if the time limit is
|
Chris@10
|
1647 reached when the planner is in the middle of an operation that cannot
|
Chris@10
|
1648 be interrupted. At the very least, the planner will complete planning
|
Chris@10
|
1649 in `FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0).
|
Chris@10
|
1650
|
Chris@10
|
1651
|
Chris@10
|
1652 File: fftw3.info, Node: Real-data DFTs, Next: Real-data DFT Array Format, Prev: Planner Flags, Up: Basic Interface
|
Chris@10
|
1653
|
Chris@10
|
1654 4.3.3 Real-data DFTs
|
Chris@10
|
1655 --------------------
|
Chris@10
|
1656
|
Chris@10
|
1657 fftw_plan fftw_plan_dft_r2c_1d(int n0,
|
Chris@10
|
1658 double *in, fftw_complex *out,
|
Chris@10
|
1659 unsigned flags);
|
Chris@10
|
1660 fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
|
Chris@10
|
1661 double *in, fftw_complex *out,
|
Chris@10
|
1662 unsigned flags);
|
Chris@10
|
1663 fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
|
Chris@10
|
1664 double *in, fftw_complex *out,
|
Chris@10
|
1665 unsigned flags);
|
Chris@10
|
1666 fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
|
Chris@10
|
1667 double *in, fftw_complex *out,
|
Chris@10
|
1668 unsigned flags);
|
Chris@10
|
1669
|
Chris@10
|
1670 Plan a real-input/complex-output discrete Fourier transform (DFT) in
|
Chris@10
|
1671 zero or more dimensions, returning an `fftw_plan' (*note Using Plans::).
|
Chris@10
|
1672
|
Chris@10
|
1673 Once you have created a plan for a certain transform type and
|
Chris@10
|
1674 parameters, then creating another plan of the same type and parameters,
|
Chris@10
|
1675 but for different arrays, is fast and shares constant data with the
|
Chris@10
|
1676 first plan (if it still exists).
|
Chris@10
|
1677
|
Chris@10
|
1678 The planner returns `NULL' if the plan cannot be created. A
|
Chris@10
|
1679 non-`NULL' plan is always returned by the basic interface unless you
|
Chris@10
|
1680 are using a customized FFTW configuration supporting a restricted set
|
Chris@10
|
1681 of transforms, or if you use the `FFTW_PRESERVE_INPUT' flag with a
|
Chris@10
|
1682 multi-dimensional out-of-place c2r transform (see below).
|
Chris@10
|
1683
|
Chris@10
|
1684 Arguments
|
Chris@10
|
1685 .........
|
Chris@10
|
1686
|
Chris@10
|
1687 * `rank' is the rank of the transform (it should be the size of the
|
Chris@10
|
1688 array `*n'), and can be any non-negative integer. (*Note Complex
|
Chris@10
|
1689 Multi-Dimensional DFTs::, for the definition of "rank".) The
|
Chris@10
|
1690 `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1',
|
Chris@10
|
1691 `2', and `3', respectively. The rank may be zero, which is
|
Chris@10
|
1692 equivalent to a rank-1 transform of size 1, i.e. a copy of one
|
Chris@10
|
1693 real number (with zero imaginary part) from input to output.
|
Chris@10
|
1694
|
Chris@10
|
1695 * `n0', `n1', `n2', or `n[0..rank-1]', (as appropriate for each
|
Chris@10
|
1696 routine) specify the size of the transform dimensions. They can
|
Chris@10
|
1697 be any positive integer. This is different in general from the
|
Chris@10
|
1698 _physical_ array dimensions, which are described in *note
|
Chris@10
|
1699 Real-data DFT Array Format::.
|
Chris@10
|
1700
|
Chris@10
|
1701 - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
|
Chris@10
|
1702 11^e 13^f, where e+f is either 0 or 1, and the other exponents
|
Chris@10
|
1703 are arbitrary. Other sizes are computed by means of a slow,
|
Chris@10
|
1704 general-purpose algorithm (which nevertheless retains O(n log
|
Chris@10
|
1705 n) performance even for prime sizes). (It is possible to
|
Chris@10
|
1706 customize FFTW for different array sizes; see *note
|
Chris@10
|
1707 Installation and Customization::.) Transforms whose sizes
|
Chris@10
|
1708 are powers of 2 are especially fast, and it is generally
|
Chris@10
|
1709 beneficial for the _last_ dimension of an r2c/c2r transform
|
Chris@10
|
1710 to be _even_.
|
Chris@10
|
1711
|
Chris@10
|
1712 * `in' and `out' point to the input and output arrays of the
|
Chris@10
|
1713 transform, which may be the same (yielding an in-place transform). These
|
Chris@10
|
1714 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
|
Chris@10
|
1715 used in the flags. (The arrays need not be initialized, but they
|
Chris@10
|
1716 must be allocated.) For an in-place transform, it is important to
|
Chris@10
|
1717 remember that the real array will require padding, described in
|
Chris@10
|
1718 *note Real-data DFT Array Format::.
|
Chris@10
|
1719
|
Chris@10
|
1720 * `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
1721 defined in *note Planner Flags::.
|
Chris@10
|
1722
|
Chris@10
|
1723
|
Chris@10
|
1724 The inverse transforms, taking complex input (storing the
|
Chris@10
|
1725 non-redundant half of a logically Hermitian array) to real output, are
|
Chris@10
|
1726 given by:
|
Chris@10
|
1727
|
Chris@10
|
1728 fftw_plan fftw_plan_dft_c2r_1d(int n0,
|
Chris@10
|
1729 fftw_complex *in, double *out,
|
Chris@10
|
1730 unsigned flags);
|
Chris@10
|
1731 fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1,
|
Chris@10
|
1732 fftw_complex *in, double *out,
|
Chris@10
|
1733 unsigned flags);
|
Chris@10
|
1734 fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2,
|
Chris@10
|
1735 fftw_complex *in, double *out,
|
Chris@10
|
1736 unsigned flags);
|
Chris@10
|
1737 fftw_plan fftw_plan_dft_c2r(int rank, const int *n,
|
Chris@10
|
1738 fftw_complex *in, double *out,
|
Chris@10
|
1739 unsigned flags);
|
Chris@10
|
1740
|
Chris@10
|
1741 The arguments are the same as for the r2c transforms, except that the
|
Chris@10
|
1742 input and output data formats are reversed.
|
Chris@10
|
1743
|
Chris@10
|
1744 FFTW computes an unnormalized transform: computing an r2c followed
|
Chris@10
|
1745 by a c2r transform (or vice versa) will result in the original data
|
Chris@10
|
1746 multiplied by the size of the transform (the product of the logical
|
Chris@10
|
1747 dimensions). An r2c transform produces the same output as a
|
Chris@10
|
1748 `FFTW_FORWARD' complex DFT of the same input, and a c2r transform is
|
Chris@10
|
1749 correspondingly equivalent to `FFTW_BACKWARD'. For more information,
|
Chris@10
|
1750 see *note What FFTW Really Computes::.
|
Chris@10
|
1751
|
Chris@10
|
1752
|
Chris@10
|
1753 File: fftw3.info, Node: Real-data DFT Array Format, Next: Real-to-Real Transforms, Prev: Real-data DFTs, Up: Basic Interface
|
Chris@10
|
1754
|
Chris@10
|
1755 4.3.4 Real-data DFT Array Format
|
Chris@10
|
1756 --------------------------------
|
Chris@10
|
1757
|
Chris@10
|
1758 The output of a DFT of real data (r2c) contains symmetries that, in
|
Chris@10
|
1759 principle, make half of the outputs redundant (*note What FFTW Really
|
Chris@10
|
1760 Computes::). (Similarly for the input of an inverse c2r transform.) In
|
Chris@10
|
1761 practice, it is not possible to entirely realize these savings in an
|
Chris@10
|
1762 efficient and understandable format that generalizes to
|
Chris@10
|
1763 multi-dimensional transforms. Instead, the output of the r2c
|
Chris@10
|
1764 transforms is _slightly_ over half of the output of the corresponding
|
Chris@10
|
1765 complex transform. We do not "pack" the data in any way, but store it
|
Chris@10
|
1766 as an ordinary array of `fftw_complex' values. In fact, this data is
|
Chris@10
|
1767 simply a subsection of what would be the array in the corresponding
|
Chris@10
|
1768 complex transform.
|
Chris@10
|
1769
|
Chris@10
|
1770 Specifically, for a real transform of d (= `rank') dimensions n[0] x
|
Chris@10
|
1771 n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2]
|
Chris@10
|
1772 x ... x (n[d-1]/2 + 1) array of `fftw_complex' values in row-major
|
Chris@10
|
1773 order (with the division rounded down). That is, we only store the
|
Chris@10
|
1774 _lower_ half (non-negative frequencies), plus one element, of the last
|
Chris@10
|
1775 dimension of the data from the ordinary complex transform. (We could
|
Chris@10
|
1776 have instead taken half of any other dimension, but implementation
|
Chris@10
|
1777 turns out to be simpler if the last, contiguous, dimension is used.)
|
Chris@10
|
1778
|
Chris@10
|
1779 For an out-of-place transform, the real data is simply an array with
|
Chris@10
|
1780 physical dimensions n[0] x n[1] x n[2] x ... x n[d-1] in row-major
|
Chris@10
|
1781 order.
|
Chris@10
|
1782
|
Chris@10
|
1783 For an in-place transform, some complications arise since the
|
Chris@10
|
1784 complex data is slightly larger than the real data. In this case, the
|
Chris@10
|
1785 final dimension of the real data must be _padded_ with extra values to
|
Chris@10
|
1786 accommodate the size of the complex data--two extra if the last
|
Chris@10
|
1787 dimension is even and one if it is odd. That is, the last dimension of
|
Chris@10
|
1788 the real data must physically contain 2 * (n[d-1]/2+1) `double' values
|
Chris@10
|
1789 (exactly enough to hold the complex data). This physical array size
|
Chris@10
|
1790 does not, however, change the _logical_ array size--only n[d-1] values
|
Chris@10
|
1791 are actually stored in the last dimension, and n[d-1] is the last
|
Chris@10
|
1792 dimension passed to the planner.
|
Chris@10
|
1793
|
Chris@10
|
1794
|
Chris@10
|
1795 File: fftw3.info, Node: Real-to-Real Transforms, Next: Real-to-Real Transform Kinds, Prev: Real-data DFT Array Format, Up: Basic Interface
|
Chris@10
|
1796
|
Chris@10
|
1797 4.3.5 Real-to-Real Transforms
|
Chris@10
|
1798 -----------------------------
|
Chris@10
|
1799
|
Chris@10
|
1800 fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
|
Chris@10
|
1801 fftw_r2r_kind kind, unsigned flags);
|
Chris@10
|
1802 fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
|
Chris@10
|
1803 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
|
Chris@10
|
1804 unsigned flags);
|
Chris@10
|
1805 fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
|
Chris@10
|
1806 double *in, double *out,
|
Chris@10
|
1807 fftw_r2r_kind kind0,
|
Chris@10
|
1808 fftw_r2r_kind kind1,
|
Chris@10
|
1809 fftw_r2r_kind kind2,
|
Chris@10
|
1810 unsigned flags);
|
Chris@10
|
1811 fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
|
Chris@10
|
1812 const fftw_r2r_kind *kind, unsigned flags);
|
Chris@10
|
1813
|
Chris@10
|
1814 Plan a real input/output (r2r) transform of various kinds in zero or
|
Chris@10
|
1815 more dimensions, returning an `fftw_plan' (*note Using Plans::).
|
Chris@10
|
1816
|
Chris@10
|
1817 Once you have created a plan for a certain transform type and
|
Chris@10
|
1818 parameters, then creating another plan of the same type and parameters,
|
Chris@10
|
1819 but for different arrays, is fast and shares constant data with the
|
Chris@10
|
1820 first plan (if it still exists).
|
Chris@10
|
1821
|
Chris@10
|
1822 The planner returns `NULL' if the plan cannot be created. A
|
Chris@10
|
1823 non-`NULL' plan is always returned by the basic interface unless you
|
Chris@10
|
1824 are using a customized FFTW configuration supporting a restricted set
|
Chris@10
|
1825 of transforms, or for size-1 `FFTW_REDFT00' kinds (which are not
|
Chris@10
|
1826 defined).
|
Chris@10
|
1827
|
Chris@10
|
1828 Arguments
|
Chris@10
|
1829 .........
|
Chris@10
|
1830
|
Chris@10
|
1831 * `rank' is the dimensionality of the transform (it should be the
|
Chris@10
|
1832 size of the arrays `*n' and `*kind'), and can be any non-negative
|
Chris@10
|
1833 integer. The `_1d', `_2d', and `_3d' planners correspond to a
|
Chris@10
|
1834 `rank' of `1', `2', and `3', respectively. A `rank' of zero is
|
Chris@10
|
1835 equivalent to a copy of one number from input to output.
|
Chris@10
|
1836
|
Chris@10
|
1837 * `n', or `n0'/`n1'/`n2', or `n[rank]', respectively, gives the
|
Chris@10
|
1838 (physical) size of the transform dimensions. They can be any
|
Chris@10
|
1839 positive integer.
|
Chris@10
|
1840
|
Chris@10
|
1841 - Multi-dimensional arrays are stored in row-major order with
|
Chris@10
|
1842 dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x
|
Chris@10
|
1843 `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array
|
Chris@10
|
1844 Format::.
|
Chris@10
|
1845
|
Chris@10
|
1846 - FFTW is generally best at handling sizes of the form 2^a 3^b
|
Chris@10
|
1847 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other
|
Chris@10
|
1848 exponents are arbitrary. Other sizes are computed by means
|
Chris@10
|
1849 of a slow, general-purpose algorithm (which nevertheless
|
Chris@10
|
1850 retains O(n log n) performance even for prime sizes). (It
|
Chris@10
|
1851 is possible to customize FFTW for different array sizes; see
|
Chris@10
|
1852 *note Installation and Customization::.) Transforms whose
|
Chris@10
|
1853 sizes are powers of 2 are especially fast.
|
Chris@10
|
1854
|
Chris@10
|
1855 - For a `REDFT00' or `RODFT00' transform kind in a dimension of
|
Chris@10
|
1856 size n, it is n-1 or n+1, respectively, that should be
|
Chris@10
|
1857 factorizable in the above form.
|
Chris@10
|
1858
|
Chris@10
|
1859 * `in' and `out' point to the input and output arrays of the
|
Chris@10
|
1860 transform, which may be the same (yielding an in-place transform). These
|
Chris@10
|
1861 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
|
Chris@10
|
1862 used in the flags. (The arrays need not be initialized, but they
|
Chris@10
|
1863 must be allocated.)
|
Chris@10
|
1864
|
Chris@10
|
1865 * `kind', or `kind0'/`kind1'/`kind2', or `kind[rank]', is the kind
|
Chris@10
|
1866 of r2r transform used for the corresponding dimension. The valid
|
Chris@10
|
1867 kind constants are described in *note Real-to-Real Transform
|
Chris@10
|
1868 Kinds::. In a multi-dimensional transform, what is computed is
|
Chris@10
|
1869 the separable product formed by taking each transform kind along
|
Chris@10
|
1870 the corresponding dimension, one dimension after another.
|
Chris@10
|
1871
|
Chris@10
|
1872 * `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
1873 defined in *note Planner Flags::.
|
Chris@10
|
1874
|
Chris@10
|
1875
|
Chris@10
|
1876
|
Chris@10
|
1877 File: fftw3.info, Node: Real-to-Real Transform Kinds, Prev: Real-to-Real Transforms, Up: Basic Interface
|
Chris@10
|
1878
|
Chris@10
|
1879 4.3.6 Real-to-Real Transform Kinds
|
Chris@10
|
1880 ----------------------------------
|
Chris@10
|
1881
|
Chris@10
|
1882 FFTW currently supports 11 different r2r transform kinds, specified by
|
Chris@10
|
1883 one of the constants below. For the precise definitions of these
|
Chris@10
|
1884 transforms, see *note What FFTW Really Computes::. For a more
|
Chris@10
|
1885 colloquial introduction to these transform kinds, see *note More DFTs
|
Chris@10
|
1886 of Real Data::.
|
Chris@10
|
1887
|
Chris@10
|
1888 For dimension of size `n', there is a corresponding "logical"
|
Chris@10
|
1889 dimension `N' that determines the normalization (and the optimal
|
Chris@10
|
1890 factorization); the formula for `N' is given for each kind below.
|
Chris@10
|
1891 Also, with each transform kind is listed its corrsponding inverse
|
Chris@10
|
1892 transform. FFTW computes unnormalized transforms: a transform followed
|
Chris@10
|
1893 by its inverse will result in the original data multiplied by `N' (or
|
Chris@10
|
1894 the product of the `N''s for each dimension, in multi-dimensions).
|
Chris@10
|
1895
|
Chris@10
|
1896 * `FFTW_R2HC' computes a real-input DFT with output in "halfcomplex"
|
Chris@10
|
1897 format, i.e. real and imaginary parts for a transform of size `n'
|
Chris@10
|
1898 stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical
|
Chris@10
|
1899 `N=n', inverse is `FFTW_HC2R'.)
|
Chris@10
|
1900
|
Chris@10
|
1901 * `FFTW_HC2R' computes the reverse of `FFTW_R2HC', above. (Logical
|
Chris@10
|
1902 `N=n', inverse is `FFTW_R2HC'.)
|
Chris@10
|
1903
|
Chris@10
|
1904 * `FFTW_DHT' computes a discrete Hartley transform. (Logical `N=n',
|
Chris@10
|
1905 inverse is `FFTW_DHT'.)
|
Chris@10
|
1906
|
Chris@10
|
1907 * `FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I.
|
Chris@10
|
1908 (Logical `N=2*(n-1)', inverse is `FFTW_REDFT00'.)
|
Chris@10
|
1909
|
Chris@10
|
1910 * `FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II
|
Chris@10
|
1911 (sometimes called "the" DCT). (Logical `N=2*n', inverse is
|
Chris@10
|
1912 `FFTW_REDFT01'.)
|
Chris@10
|
1913
|
Chris@10
|
1914 * `FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III
|
Chris@10
|
1915 (sometimes called "the" IDCT, being the inverse of DCT-II).
|
Chris@10
|
1916 (Logical `N=2*n', inverse is `FFTW_REDFT=10'.)
|
Chris@10
|
1917
|
Chris@10
|
1918 * `FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV.
|
Chris@10
|
1919 (Logical `N=2*n', inverse is `FFTW_REDFT11'.)
|
Chris@10
|
1920
|
Chris@10
|
1921 * `FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I.
|
Chris@10
|
1922 (Logical `N=2*(n+1)', inverse is `FFTW_RODFT00'.)
|
Chris@10
|
1923
|
Chris@10
|
1924 * `FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II.
|
Chris@10
|
1925 (Logical `N=2*n', inverse is `FFTW_RODFT01'.)
|
Chris@10
|
1926
|
Chris@10
|
1927 * `FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III.
|
Chris@10
|
1928 (Logical `N=2*n', inverse is `FFTW_RODFT=10'.)
|
Chris@10
|
1929
|
Chris@10
|
1930 * `FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV.
|
Chris@10
|
1931 (Logical `N=2*n', inverse is `FFTW_RODFT11'.)
|
Chris@10
|
1932
|
Chris@10
|
1933
|
Chris@10
|
1934
|
Chris@10
|
1935 File: fftw3.info, Node: Advanced Interface, Next: Guru Interface, Prev: Basic Interface, Up: FFTW Reference
|
Chris@10
|
1936
|
Chris@10
|
1937 4.4 Advanced Interface
|
Chris@10
|
1938 ======================
|
Chris@10
|
1939
|
Chris@10
|
1940 FFTW's "advanced" interface supplements the basic interface with four
|
Chris@10
|
1941 new planner routines, providing a new level of flexibility: you can plan
|
Chris@10
|
1942 a transform of multiple arrays simultaneously, operate on non-contiguous
|
Chris@10
|
1943 (strided) data, and transform a subset of a larger multi-dimensional
|
Chris@10
|
1944 array. Other than these additional features, the planner operates in
|
Chris@10
|
1945 the same fashion as in the basic interface, and the resulting
|
Chris@10
|
1946 `fftw_plan' is used in the same way (*note Using Plans::).
|
Chris@10
|
1947
|
Chris@10
|
1948 * Menu:
|
Chris@10
|
1949
|
Chris@10
|
1950 * Advanced Complex DFTs::
|
Chris@10
|
1951 * Advanced Real-data DFTs::
|
Chris@10
|
1952 * Advanced Real-to-real Transforms::
|
Chris@10
|
1953
|
Chris@10
|
1954
|
Chris@10
|
1955 File: fftw3.info, Node: Advanced Complex DFTs, Next: Advanced Real-data DFTs, Prev: Advanced Interface, Up: Advanced Interface
|
Chris@10
|
1956
|
Chris@10
|
1957 4.4.1 Advanced Complex DFTs
|
Chris@10
|
1958 ---------------------------
|
Chris@10
|
1959
|
Chris@10
|
1960 fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany,
|
Chris@10
|
1961 fftw_complex *in, const int *inembed,
|
Chris@10
|
1962 int istride, int idist,
|
Chris@10
|
1963 fftw_complex *out, const int *onembed,
|
Chris@10
|
1964 int ostride, int odist,
|
Chris@10
|
1965 int sign, unsigned flags);
|
Chris@10
|
1966
|
Chris@10
|
1967 This routine plans multiple multidimensional complex DFTs, and it
|
Chris@10
|
1968 extends the `fftw_plan_dft' routine (*note Complex DFTs::) to compute
|
Chris@10
|
1969 `howmany' transforms, each having rank `rank' and size `n'. In
|
Chris@10
|
1970 addition, the transform data need not be contiguous, but it may be laid
|
Chris@10
|
1971 out in memory with an arbitrary stride. To account for these
|
Chris@10
|
1972 possibilities, `fftw_plan_many_dft' adds the new parameters `howmany',
|
Chris@10
|
1973 {`i',`o'}`nembed', {`i',`o'}`stride', and {`i',`o'}`dist'. The FFTW
|
Chris@10
|
1974 basic interface (*note Complex DFTs::) provides routines specialized
|
Chris@10
|
1975 for ranks 1, 2, and 3, but the advanced interface handles only the
|
Chris@10
|
1976 general-rank case.
|
Chris@10
|
1977
|
Chris@10
|
1978 `howmany' is the number of transforms to compute. The resulting
|
Chris@10
|
1979 plan computes `howmany' transforms, where the input of the `k'-th
|
Chris@10
|
1980 transform is at location `in+k*idist' (in C pointer arithmetic), and
|
Chris@10
|
1981 its output is at location `out+k*odist'. Plans obtained in this way
|
Chris@10
|
1982 can often be faster than calling FFTW multiple times for the individual
|
Chris@10
|
1983 transforms. The basic `fftw_plan_dft' interface corresponds to
|
Chris@10
|
1984 `howmany=1' (in which case the `dist' parameters are ignored).
|
Chris@10
|
1985
|
Chris@10
|
1986 Each of the `howmany' transforms has rank `rank' and size `n', as in
|
Chris@10
|
1987 the basic interface. In addition, the advanced interface allows the
|
Chris@10
|
1988 input and output arrays of each transform to be row-major subarrays of
|
Chris@10
|
1989 larger rank-`rank' arrays, described by `inembed' and `onembed'
|
Chris@10
|
1990 parameters, respectively. {`i',`o'}`nembed' must be arrays of length
|
Chris@10
|
1991 `rank', and `n' should be elementwise less than or equal to
|
Chris@10
|
1992 {`i',`o'}`nembed'. Passing `NULL' for an `nembed' parameter is
|
Chris@10
|
1993 equivalent to passing `n' (i.e. same physical and logical dimensions,
|
Chris@10
|
1994 as in the basic interface.)
|
Chris@10
|
1995
|
Chris@10
|
1996 The `stride' parameters indicate that the `j'-th element of the
|
Chris@10
|
1997 input or output arrays is located at `j*istride' or `j*ostride',
|
Chris@10
|
1998 respectively. (For a multi-dimensional array, `j' is the ordinary
|
Chris@10
|
1999 row-major index.) When combined with the `k'-th transform in a
|
Chris@10
|
2000 `howmany' loop, from above, this means that the (`j',`k')-th element is
|
Chris@10
|
2001 at `j*stride+k*dist'. (The basic `fftw_plan_dft' interface corresponds
|
Chris@10
|
2002 to a stride of 1.)
|
Chris@10
|
2003
|
Chris@10
|
2004 For in-place transforms, the input and output `stride' and `dist'
|
Chris@10
|
2005 parameters should be the same; otherwise, the planner may return `NULL'.
|
Chris@10
|
2006
|
Chris@10
|
2007 Arrays `n', `inembed', and `onembed' are not used after this
|
Chris@10
|
2008 function returns. You can safely free or reuse them.
|
Chris@10
|
2009
|
Chris@10
|
2010 *Examples*: One transform of one 5 by 6 array contiguous in memory:
|
Chris@10
|
2011 int rank = 2;
|
Chris@10
|
2012 int n[] = {5, 6};
|
Chris@10
|
2013 int howmany = 1;
|
Chris@10
|
2014 int idist = odist = 0; /* unused because howmany = 1 */
|
Chris@10
|
2015 int istride = ostride = 1; /* array is contiguous in memory */
|
Chris@10
|
2016 int *inembed = n, *onembed = n;
|
Chris@10
|
2017
|
Chris@10
|
2018 Transform of three 5 by 6 arrays, each contiguous in memory, stored
|
Chris@10
|
2019 in memory one after another:
|
Chris@10
|
2020 int rank = 2;
|
Chris@10
|
2021 int n[] = {5, 6};
|
Chris@10
|
2022 int howmany = 3;
|
Chris@10
|
2023 int idist = odist = n[0]*n[1]; /* = 30, the distance in memory
|
Chris@10
|
2024 between the first element
|
Chris@10
|
2025 of the first array and the
|
Chris@10
|
2026 first element of the second array */
|
Chris@10
|
2027 int istride = ostride = 1; /* array is contiguous in memory */
|
Chris@10
|
2028 int *inembed = n, *onembed = n;
|
Chris@10
|
2029
|
Chris@10
|
2030 Transform each column of a 2d array with 10 rows and 3 columns:
|
Chris@10
|
2031 int rank = 1; /* not 2: we are computing 1d transforms */
|
Chris@10
|
2032 int n[] = {10}; /* 1d transforms of length 10 */
|
Chris@10
|
2033 int howmany = 3;
|
Chris@10
|
2034 int idist = odist = 1;
|
Chris@10
|
2035 int istride = ostride = 3; /* distance between two elements in
|
Chris@10
|
2036 the same column */
|
Chris@10
|
2037 int *inembed = n, *onembed = n;
|
Chris@10
|
2038
|
Chris@10
|
2039
|
Chris@10
|
2040 File: fftw3.info, Node: Advanced Real-data DFTs, Next: Advanced Real-to-real Transforms, Prev: Advanced Complex DFTs, Up: Advanced Interface
|
Chris@10
|
2041
|
Chris@10
|
2042 4.4.2 Advanced Real-data DFTs
|
Chris@10
|
2043 -----------------------------
|
Chris@10
|
2044
|
Chris@10
|
2045 fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany,
|
Chris@10
|
2046 double *in, const int *inembed,
|
Chris@10
|
2047 int istride, int idist,
|
Chris@10
|
2048 fftw_complex *out, const int *onembed,
|
Chris@10
|
2049 int ostride, int odist,
|
Chris@10
|
2050 unsigned flags);
|
Chris@10
|
2051 fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany,
|
Chris@10
|
2052 fftw_complex *in, const int *inembed,
|
Chris@10
|
2053 int istride, int idist,
|
Chris@10
|
2054 double *out, const int *onembed,
|
Chris@10
|
2055 int ostride, int odist,
|
Chris@10
|
2056 unsigned flags);
|
Chris@10
|
2057
|
Chris@10
|
2058 Like `fftw_plan_many_dft', these two functions add `howmany',
|
Chris@10
|
2059 `nembed', `stride', and `dist' parameters to the `fftw_plan_dft_r2c'
|
Chris@10
|
2060 and `fftw_plan_dft_c2r' functions, but otherwise behave the same as the
|
Chris@10
|
2061 basic interface.
|
Chris@10
|
2062
|
Chris@10
|
2063 The interpretation of `howmany', `stride', and `dist' are the same
|
Chris@10
|
2064 as for `fftw_plan_many_dft', above. Note that the `stride' and `dist'
|
Chris@10
|
2065 for the real array are in units of `double', and for the complex array
|
Chris@10
|
2066 are in units of `fftw_complex'.
|
Chris@10
|
2067
|
Chris@10
|
2068 If an `nembed' parameter is `NULL', it is interpreted as what it
|
Chris@10
|
2069 would be in the basic interface, as described in *note Real-data DFT
|
Chris@10
|
2070 Array Format::. That is, for the complex array the size is assumed to
|
Chris@10
|
2071 be the same as `n', but with the last dimension cut roughly in half.
|
Chris@10
|
2072 For the real array, the size is assumed to be `n' if the transform is
|
Chris@10
|
2073 out-of-place, or `n' with the last dimension "padded" if the transform
|
Chris@10
|
2074 is in-place.
|
Chris@10
|
2075
|
Chris@10
|
2076 If an `nembed' parameter is non-`NULL', it is interpreted as the
|
Chris@10
|
2077 physical size of the corresponding array, in row-major order, just as
|
Chris@10
|
2078 for `fftw_plan_many_dft'. In this case, each dimension of `nembed'
|
Chris@10
|
2079 should be `>=' what it would be in the basic interface (e.g. the halved
|
Chris@10
|
2080 or padded `n').
|
Chris@10
|
2081
|
Chris@10
|
2082 Arrays `n', `inembed', and `onembed' are not used after this
|
Chris@10
|
2083 function returns. You can safely free or reuse them.
|
Chris@10
|
2084
|
Chris@10
|
2085
|
Chris@10
|
2086 File: fftw3.info, Node: Advanced Real-to-real Transforms, Prev: Advanced Real-data DFTs, Up: Advanced Interface
|
Chris@10
|
2087
|
Chris@10
|
2088 4.4.3 Advanced Real-to-real Transforms
|
Chris@10
|
2089 --------------------------------------
|
Chris@10
|
2090
|
Chris@10
|
2091 fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany,
|
Chris@10
|
2092 double *in, const int *inembed,
|
Chris@10
|
2093 int istride, int idist,
|
Chris@10
|
2094 double *out, const int *onembed,
|
Chris@10
|
2095 int ostride, int odist,
|
Chris@10
|
2096 const fftw_r2r_kind *kind, unsigned flags);
|
Chris@10
|
2097
|
Chris@10
|
2098 Like `fftw_plan_many_dft', this functions adds `howmany', `nembed',
|
Chris@10
|
2099 `stride', and `dist' parameters to the `fftw_plan_r2r' function, but
|
Chris@10
|
2100 otherwise behave the same as the basic interface. The interpretation
|
Chris@10
|
2101 of those additional parameters are the same as for
|
Chris@10
|
2102 `fftw_plan_many_dft'. (Of course, the `stride' and `dist' parameters
|
Chris@10
|
2103 are now in units of `double', not `fftw_complex'.)
|
Chris@10
|
2104
|
Chris@10
|
2105 Arrays `n', `inembed', `onembed', and `kind' are not used after this
|
Chris@10
|
2106 function returns. You can safely free or reuse them.
|
Chris@10
|
2107
|
Chris@10
|
2108
|
Chris@10
|
2109 File: fftw3.info, Node: Guru Interface, Next: New-array Execute Functions, Prev: Advanced Interface, Up: FFTW Reference
|
Chris@10
|
2110
|
Chris@10
|
2111 4.5 Guru Interface
|
Chris@10
|
2112 ==================
|
Chris@10
|
2113
|
Chris@10
|
2114 The "guru" interface to FFTW is intended to expose as much as possible
|
Chris@10
|
2115 of the flexibility in the underlying FFTW architecture. It allows one
|
Chris@10
|
2116 to compute multi-dimensional "vectors" (loops) of multi-dimensional
|
Chris@10
|
2117 transforms, where each vector/transform dimension has an independent
|
Chris@10
|
2118 size and stride. One can also use more general complex-number formats,
|
Chris@10
|
2119 e.g. separate real and imaginary arrays.
|
Chris@10
|
2120
|
Chris@10
|
2121 For those users who require the flexibility of the guru interface,
|
Chris@10
|
2122 it is important that they pay special attention to the documentation
|
Chris@10
|
2123 lest they shoot themselves in the foot.
|
Chris@10
|
2124
|
Chris@10
|
2125 * Menu:
|
Chris@10
|
2126
|
Chris@10
|
2127 * Interleaved and split arrays::
|
Chris@10
|
2128 * Guru vector and transform sizes::
|
Chris@10
|
2129 * Guru Complex DFTs::
|
Chris@10
|
2130 * Guru Real-data DFTs::
|
Chris@10
|
2131 * Guru Real-to-real Transforms::
|
Chris@10
|
2132 * 64-bit Guru Interface::
|
Chris@10
|
2133
|
Chris@10
|
2134
|
Chris@10
|
2135 File: fftw3.info, Node: Interleaved and split arrays, Next: Guru vector and transform sizes, Prev: Guru Interface, Up: Guru Interface
|
Chris@10
|
2136
|
Chris@10
|
2137 4.5.1 Interleaved and split arrays
|
Chris@10
|
2138 ----------------------------------
|
Chris@10
|
2139
|
Chris@10
|
2140 The guru interface supports two representations of complex numbers,
|
Chris@10
|
2141 which we call the interleaved and the split format.
|
Chris@10
|
2142
|
Chris@10
|
2143 The "interleaved" format is the same one used by the basic and
|
Chris@10
|
2144 advanced interfaces, and it is documented in *note Complex numbers::.
|
Chris@10
|
2145 In the interleaved format, you provide pointers to the real part of a
|
Chris@10
|
2146 complex number, and the imaginary part understood to be stored in the
|
Chris@10
|
2147 next memory location.
|
Chris@10
|
2148
|
Chris@10
|
2149 The "split" format allows separate pointers to the real and
|
Chris@10
|
2150 imaginary parts of a complex array.
|
Chris@10
|
2151
|
Chris@10
|
2152 Technically, the interleaved format is redundant, because you can
|
Chris@10
|
2153 always express an interleaved array in terms of a split array with
|
Chris@10
|
2154 appropriate pointers and strides. On the other hand, the interleaved
|
Chris@10
|
2155 format is simpler to use, and it is common in practice. Hence, FFTW
|
Chris@10
|
2156 supports it as a special case.
|
Chris@10
|
2157
|
Chris@10
|
2158
|
Chris@10
|
2159 File: fftw3.info, Node: Guru vector and transform sizes, Next: Guru Complex DFTs, Prev: Interleaved and split arrays, Up: Guru Interface
|
Chris@10
|
2160
|
Chris@10
|
2161 4.5.2 Guru vector and transform sizes
|
Chris@10
|
2162 -------------------------------------
|
Chris@10
|
2163
|
Chris@10
|
2164 The guru interface introduces one basic new data structure,
|
Chris@10
|
2165 `fftw_iodim', that is used to specify sizes and strides for
|
Chris@10
|
2166 multi-dimensional transforms and vectors:
|
Chris@10
|
2167
|
Chris@10
|
2168 typedef struct {
|
Chris@10
|
2169 int n;
|
Chris@10
|
2170 int is;
|
Chris@10
|
2171 int os;
|
Chris@10
|
2172 } fftw_iodim;
|
Chris@10
|
2173
|
Chris@10
|
2174 Here, `n' is the size of the dimension, and `is' and `os' are the
|
Chris@10
|
2175 strides of that dimension for the input and output arrays. (The stride
|
Chris@10
|
2176 is the separation of consecutive elements along this dimension.)
|
Chris@10
|
2177
|
Chris@10
|
2178 The meaning of the stride parameter depends on the type of the array
|
Chris@10
|
2179 that the stride refers to. _If the array is interleaved complex,
|
Chris@10
|
2180 strides are expressed in units of complex numbers (`fftw_complex'). If
|
Chris@10
|
2181 the array is split complex or real, strides are expressed in units of
|
Chris@10
|
2182 real numbers (`double')._ This convention is consistent with the usual
|
Chris@10
|
2183 pointer arithmetic in the C language. An interleaved array is denoted
|
Chris@10
|
2184 by a pointer `p' to `fftw_complex', so that `p+1' points to the next
|
Chris@10
|
2185 complex number. Split arrays are denoted by pointers to `double', in
|
Chris@10
|
2186 which case pointer arithmetic operates in units of `sizeof(double)'.
|
Chris@10
|
2187
|
Chris@10
|
2188 The guru planner interfaces all take a (`rank', `dims[rank]') pair
|
Chris@10
|
2189 describing the transform size, and a (`howmany_rank',
|
Chris@10
|
2190 `howmany_dims[howmany_rank]') pair describing the "vector" size (a
|
Chris@10
|
2191 multi-dimensional loop of transforms to perform), where `dims' and
|
Chris@10
|
2192 `howmany_dims' are arrays of `fftw_iodim'.
|
Chris@10
|
2193
|
Chris@10
|
2194 For example, the `howmany' parameter in the advanced complex-DFT
|
Chris@10
|
2195 interface corresponds to `howmany_rank' = 1, `howmany_dims[0].n' =
|
Chris@10
|
2196 `howmany', `howmany_dims[0].is' = `idist', and `howmany_dims[0].os' =
|
Chris@10
|
2197 `odist'. (To compute a single transform, you can just use
|
Chris@10
|
2198 `howmany_rank' = 0.)
|
Chris@10
|
2199
|
Chris@10
|
2200 A row-major multidimensional array with dimensions `n[rank]' (*note
|
Chris@10
|
2201 Row-major Format::) corresponds to `dims[i].n' = `n[i]' and the
|
Chris@10
|
2202 recurrence `dims[i].is' = `n[i+1] * dims[i+1].is' (similarly for `os').
|
Chris@10
|
2203 The stride of the last (`i=rank-1') dimension is the overall stride of
|
Chris@10
|
2204 the array. e.g. to be equivalent to the advanced complex-DFT
|
Chris@10
|
2205 interface, you would have `dims[rank-1].is' = `istride' and
|
Chris@10
|
2206 `dims[rank-1].os' = `ostride'.
|
Chris@10
|
2207
|
Chris@10
|
2208 In general, we only guarantee FFTW to return a non-`NULL' plan if
|
Chris@10
|
2209 the vector and transform dimensions correspond to a set of distinct
|
Chris@10
|
2210 indices, and for in-place transforms the input/output strides should be
|
Chris@10
|
2211 the same.
|
Chris@10
|
2212
|
Chris@10
|
2213
|
Chris@10
|
2214 File: fftw3.info, Node: Guru Complex DFTs, Next: Guru Real-data DFTs, Prev: Guru vector and transform sizes, Up: Guru Interface
|
Chris@10
|
2215
|
Chris@10
|
2216 4.5.3 Guru Complex DFTs
|
Chris@10
|
2217 -----------------------
|
Chris@10
|
2218
|
Chris@10
|
2219 fftw_plan fftw_plan_guru_dft(
|
Chris@10
|
2220 int rank, const fftw_iodim *dims,
|
Chris@10
|
2221 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2222 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
2223 int sign, unsigned flags);
|
Chris@10
|
2224
|
Chris@10
|
2225 fftw_plan fftw_plan_guru_split_dft(
|
Chris@10
|
2226 int rank, const fftw_iodim *dims,
|
Chris@10
|
2227 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2228 double *ri, double *ii, double *ro, double *io,
|
Chris@10
|
2229 unsigned flags);
|
Chris@10
|
2230
|
Chris@10
|
2231 These two functions plan a complex-data, multi-dimensional DFT for
|
Chris@10
|
2232 the interleaved and split format, respectively. Transform dimensions
|
Chris@10
|
2233 are given by (`rank', `dims') over a multi-dimensional vector (loop) of
|
Chris@10
|
2234 dimensions (`howmany_rank', `howmany_dims'). `dims' and `howmany_dims'
|
Chris@10
|
2235 should point to `fftw_iodim' arrays of length `rank' and
|
Chris@10
|
2236 `howmany_rank', respectively.
|
Chris@10
|
2237
|
Chris@10
|
2238 `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
2239 defined in *note Planner Flags::.
|
Chris@10
|
2240
|
Chris@10
|
2241 In the `fftw_plan_guru_dft' function, the pointers `in' and `out'
|
Chris@10
|
2242 point to the interleaved input and output arrays, respectively. The
|
Chris@10
|
2243 sign can be either -1 (= `FFTW_FORWARD') or +1 (= `FFTW_BACKWARD'). If
|
Chris@10
|
2244 the pointers are equal, the transform is in-place.
|
Chris@10
|
2245
|
Chris@10
|
2246 In the `fftw_plan_guru_split_dft' function, `ri' and `ii' point to
|
Chris@10
|
2247 the real and imaginary input arrays, and `ro' and `io' point to the
|
Chris@10
|
2248 real and imaginary output arrays. The input and output pointers may be
|
Chris@10
|
2249 the same, indicating an in-place transform. For example, for
|
Chris@10
|
2250 `fftw_complex' pointers `in' and `out', the corresponding parameters
|
Chris@10
|
2251 are:
|
Chris@10
|
2252
|
Chris@10
|
2253 ri = (double *) in;
|
Chris@10
|
2254 ii = (double *) in + 1;
|
Chris@10
|
2255 ro = (double *) out;
|
Chris@10
|
2256 io = (double *) out + 1;
|
Chris@10
|
2257
|
Chris@10
|
2258 Because `fftw_plan_guru_split_dft' accepts split arrays, strides are
|
Chris@10
|
2259 expressed in units of `double'. For a contiguous `fftw_complex' array,
|
Chris@10
|
2260 the overall stride of the transform should be 2, the distance between
|
Chris@10
|
2261 consecutive real parts or between consecutive imaginary parts; see
|
Chris@10
|
2262 *note Guru vector and transform sizes::. Note that the dimension
|
Chris@10
|
2263 strides are applied equally to the real and imaginary parts; real and
|
Chris@10
|
2264 imaginary arrays with different strides are not supported.
|
Chris@10
|
2265
|
Chris@10
|
2266 There is no `sign' parameter in `fftw_plan_guru_split_dft'. This
|
Chris@10
|
2267 function always plans for an `FFTW_FORWARD' transform. To plan for an
|
Chris@10
|
2268 `FFTW_BACKWARD' transform, you can exploit the identity that the
|
Chris@10
|
2269 backwards DFT is equal to the forwards DFT with the real and imaginary
|
Chris@10
|
2270 parts swapped. For example, in the case of the `fftw_complex' arrays
|
Chris@10
|
2271 above, the `FFTW_BACKWARD' transform is computed by the parameters:
|
Chris@10
|
2272
|
Chris@10
|
2273 ri = (double *) in + 1;
|
Chris@10
|
2274 ii = (double *) in;
|
Chris@10
|
2275 ro = (double *) out + 1;
|
Chris@10
|
2276 io = (double *) out;
|
Chris@10
|
2277
|
Chris@10
|
2278
|
Chris@10
|
2279 File: fftw3.info, Node: Guru Real-data DFTs, Next: Guru Real-to-real Transforms, Prev: Guru Complex DFTs, Up: Guru Interface
|
Chris@10
|
2280
|
Chris@10
|
2281 4.5.4 Guru Real-data DFTs
|
Chris@10
|
2282 -------------------------
|
Chris@10
|
2283
|
Chris@10
|
2284 fftw_plan fftw_plan_guru_dft_r2c(
|
Chris@10
|
2285 int rank, const fftw_iodim *dims,
|
Chris@10
|
2286 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2287 double *in, fftw_complex *out,
|
Chris@10
|
2288 unsigned flags);
|
Chris@10
|
2289
|
Chris@10
|
2290 fftw_plan fftw_plan_guru_split_dft_r2c(
|
Chris@10
|
2291 int rank, const fftw_iodim *dims,
|
Chris@10
|
2292 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2293 double *in, double *ro, double *io,
|
Chris@10
|
2294 unsigned flags);
|
Chris@10
|
2295
|
Chris@10
|
2296 fftw_plan fftw_plan_guru_dft_c2r(
|
Chris@10
|
2297 int rank, const fftw_iodim *dims,
|
Chris@10
|
2298 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2299 fftw_complex *in, double *out,
|
Chris@10
|
2300 unsigned flags);
|
Chris@10
|
2301
|
Chris@10
|
2302 fftw_plan fftw_plan_guru_split_dft_c2r(
|
Chris@10
|
2303 int rank, const fftw_iodim *dims,
|
Chris@10
|
2304 int howmany_rank, const fftw_iodim *howmany_dims,
|
Chris@10
|
2305 double *ri, double *ii, double *out,
|
Chris@10
|
2306 unsigned flags);
|
Chris@10
|
2307
|
Chris@10
|
2308 Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT
|
Chris@10
|
2309 with transform dimensions given by (`rank', `dims') over a
|
Chris@10
|
2310 multi-dimensional vector (loop) of dimensions (`howmany_rank',
|
Chris@10
|
2311 `howmany_dims'). `dims' and `howmany_dims' should point to
|
Chris@10
|
2312 `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively.
|
Chris@10
|
2313 As for the basic and advanced interfaces, an r2c transform is
|
Chris@10
|
2314 `FFTW_FORWARD' and a c2r transform is `FFTW_BACKWARD'.
|
Chris@10
|
2315
|
Chris@10
|
2316 The _last_ dimension of `dims' is interpreted specially: that
|
Chris@10
|
2317 dimension of the real array has size `dims[rank-1].n', but that
|
Chris@10
|
2318 dimension of the complex array has size `dims[rank-1].n/2+1' (division
|
Chris@10
|
2319 rounded down). The strides, on the other hand, are taken to be exactly
|
Chris@10
|
2320 as specified. It is up to the user to specify the strides
|
Chris@10
|
2321 appropriately for the peculiar dimensions of the data, and we do not
|
Chris@10
|
2322 guarantee that the planner will succeed (return non-`NULL') for any
|
Chris@10
|
2323 dimensions other than those described in *note Real-data DFT Array
|
Chris@10
|
2324 Format:: and generalized in *note Advanced Real-data DFTs::. (That is,
|
Chris@10
|
2325 for an in-place transform, each individual dimension should be able to
|
Chris@10
|
2326 operate in place.)
|
Chris@10
|
2327
|
Chris@10
|
2328 `in' and `out' point to the input and output arrays for r2c and c2r
|
Chris@10
|
2329 transforms, respectively. For split arrays, `ri' and `ii' point to the
|
Chris@10
|
2330 real and imaginary input arrays for a c2r transform, and `ro' and `io'
|
Chris@10
|
2331 point to the real and imaginary output arrays for an r2c transform.
|
Chris@10
|
2332 `in' and `ro' or `ri' and `out' may be the same, indicating an in-place
|
Chris@10
|
2333 transform. (In-place transforms where `in' and `io' or `ii' and `out'
|
Chris@10
|
2334 are the same are not currently supported.)
|
Chris@10
|
2335
|
Chris@10
|
2336 `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
2337 defined in *note Planner Flags::.
|
Chris@10
|
2338
|
Chris@10
|
2339 In-place transforms of rank greater than 1 are currently only
|
Chris@10
|
2340 supported for interleaved arrays. For split arrays, the planner will
|
Chris@10
|
2341 return `NULL'.
|
Chris@10
|
2342
|
Chris@10
|
2343
|
Chris@10
|
2344 File: fftw3.info, Node: Guru Real-to-real Transforms, Next: 64-bit Guru Interface, Prev: Guru Real-data DFTs, Up: Guru Interface
|
Chris@10
|
2345
|
Chris@10
|
2346 4.5.5 Guru Real-to-real Transforms
|
Chris@10
|
2347 ----------------------------------
|
Chris@10
|
2348
|
Chris@10
|
2349 fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims,
|
Chris@10
|
2350 int howmany_rank,
|
Chris@10
|
2351 const fftw_iodim *howmany_dims,
|
Chris@10
|
2352 double *in, double *out,
|
Chris@10
|
2353 const fftw_r2r_kind *kind,
|
Chris@10
|
2354 unsigned flags);
|
Chris@10
|
2355
|
Chris@10
|
2356 Plan a real-to-real (r2r) multi-dimensional `FFTW_FORWARD' transform
|
Chris@10
|
2357 with transform dimensions given by (`rank', `dims') over a
|
Chris@10
|
2358 multi-dimensional vector (loop) of dimensions (`howmany_rank',
|
Chris@10
|
2359 `howmany_dims'). `dims' and `howmany_dims' should point to
|
Chris@10
|
2360 `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively.
|
Chris@10
|
2361
|
Chris@10
|
2362 The transform kind of each dimension is given by the `kind'
|
Chris@10
|
2363 parameter, which should point to an array of length `rank'. Valid
|
Chris@10
|
2364 `fftw_r2r_kind' constants are given in *note Real-to-Real Transform
|
Chris@10
|
2365 Kinds::.
|
Chris@10
|
2366
|
Chris@10
|
2367 `in' and `out' point to the real input and output arrays; they may
|
Chris@10
|
2368 be the same, indicating an in-place transform.
|
Chris@10
|
2369
|
Chris@10
|
2370 `flags' is a bitwise OR (`|') of zero or more planner flags, as
|
Chris@10
|
2371 defined in *note Planner Flags::.
|
Chris@10
|
2372
|
Chris@10
|
2373
|
Chris@10
|
2374 File: fftw3.info, Node: 64-bit Guru Interface, Prev: Guru Real-to-real Transforms, Up: Guru Interface
|
Chris@10
|
2375
|
Chris@10
|
2376 4.5.6 64-bit Guru Interface
|
Chris@10
|
2377 ---------------------------
|
Chris@10
|
2378
|
Chris@10
|
2379 When compiled in 64-bit mode on a 64-bit architecture (where addresses
|
Chris@10
|
2380 are 64 bits wide), FFTW uses 64-bit quantities internally for all
|
Chris@10
|
2381 transform sizes, strides, and so on--you don't have to do anything
|
Chris@10
|
2382 special to exploit this. However, in the ordinary FFTW interfaces, you
|
Chris@10
|
2383 specify the transform size by an `int' quantity, which is normally only
|
Chris@10
|
2384 32 bits wide. This means that, even though FFTW is using 64-bit sizes
|
Chris@10
|
2385 internally, you cannot specify a single transform dimension larger than
|
Chris@10
|
2386 2^31-1 numbers.
|
Chris@10
|
2387
|
Chris@10
|
2388 We expect that few users will require transforms larger than this,
|
Chris@10
|
2389 but, for those who do, we provide a 64-bit version of the guru
|
Chris@10
|
2390 interface in which all sizes are specified as integers of type
|
Chris@10
|
2391 `ptrdiff_t' instead of `int'. (`ptrdiff_t' is a signed integer type
|
Chris@10
|
2392 defined by the C standard to be wide enough to represent address
|
Chris@10
|
2393 differences, and thus must be at least 64 bits wide on a 64-bit
|
Chris@10
|
2394 machine.) We stress that there is _no performance advantage_ to using
|
Chris@10
|
2395 this interface--the same internal FFTW code is employed regardless--and
|
Chris@10
|
2396 it is only necessary if you want to specify very large transform sizes.
|
Chris@10
|
2397
|
Chris@10
|
2398 In particular, the 64-bit guru interface is a set of planner routines
|
Chris@10
|
2399 that are exactly the same as the guru planner routines, except that
|
Chris@10
|
2400 they are named with `guru64' instead of `guru' and they take arguments
|
Chris@10
|
2401 of type `fftw_iodim64' instead of `fftw_iodim'. For example, instead
|
Chris@10
|
2402 of `fftw_plan_guru_dft', we have `fftw_plan_guru64_dft'.
|
Chris@10
|
2403
|
Chris@10
|
2404 fftw_plan fftw_plan_guru64_dft(
|
Chris@10
|
2405 int rank, const fftw_iodim64 *dims,
|
Chris@10
|
2406 int howmany_rank, const fftw_iodim64 *howmany_dims,
|
Chris@10
|
2407 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
2408 int sign, unsigned flags);
|
Chris@10
|
2409
|
Chris@10
|
2410 The `fftw_iodim64' type is similar to `fftw_iodim', with the same
|
Chris@10
|
2411 interpretation, except that it uses type `ptrdiff_t' instead of type
|
Chris@10
|
2412 `int'.
|
Chris@10
|
2413
|
Chris@10
|
2414 typedef struct {
|
Chris@10
|
2415 ptrdiff_t n;
|
Chris@10
|
2416 ptrdiff_t is;
|
Chris@10
|
2417 ptrdiff_t os;
|
Chris@10
|
2418 } fftw_iodim64;
|
Chris@10
|
2419
|
Chris@10
|
2420 Every other `fftw_plan_guru' function also has a `fftw_plan_guru64'
|
Chris@10
|
2421 equivalent, but we do not repeat their documentation here since they
|
Chris@10
|
2422 are identical to the 32-bit versions except as noted above.
|
Chris@10
|
2423
|
Chris@10
|
2424
|
Chris@10
|
2425 File: fftw3.info, Node: New-array Execute Functions, Next: Wisdom, Prev: Guru Interface, Up: FFTW Reference
|
Chris@10
|
2426
|
Chris@10
|
2427 4.6 New-array Execute Functions
|
Chris@10
|
2428 ===============================
|
Chris@10
|
2429
|
Chris@10
|
2430 Normally, one executes a plan for the arrays with which the plan was
|
Chris@10
|
2431 created, by calling `fftw_execute(plan)' as described in *note Using
|
Chris@10
|
2432 Plans::. However, it is possible for sophisticated users to apply a
|
Chris@10
|
2433 given plan to a _different_ array using the "new-array execute"
|
Chris@10
|
2434 functions detailed below, provided that the following conditions are
|
Chris@10
|
2435 met:
|
Chris@10
|
2436
|
Chris@10
|
2437 * The array size, strides, etcetera are the same (since those are
|
Chris@10
|
2438 set by the plan).
|
Chris@10
|
2439
|
Chris@10
|
2440 * The input and output arrays are the same (in-place) or different
|
Chris@10
|
2441 (out-of-place) if the plan was originally created to be in-place or
|
Chris@10
|
2442 out-of-place, respectively.
|
Chris@10
|
2443
|
Chris@10
|
2444 * For split arrays, the separations between the real and imaginary
|
Chris@10
|
2445 parts, `ii-ri' and `io-ro', are the same as they were for the
|
Chris@10
|
2446 input and output arrays when the plan was created. (This
|
Chris@10
|
2447 condition is automatically satisfied for interleaved arrays.)
|
Chris@10
|
2448
|
Chris@10
|
2449 * The "alignment" of the new input/output arrays is the same as that
|
Chris@10
|
2450 of the input/output arrays when the plan was created, unless the
|
Chris@10
|
2451 plan was created with the `FFTW_UNALIGNED' flag. Here, the
|
Chris@10
|
2452 alignment is a platform-dependent quantity (for example, it is the
|
Chris@10
|
2453 address modulo 16 if SSE SIMD instructions are used, but the
|
Chris@10
|
2454 address modulo 4 for non-SIMD single-precision FFTW on the same
|
Chris@10
|
2455 machine). In general, only arrays allocated with `fftw_malloc'
|
Chris@10
|
2456 are guaranteed to be equally aligned (*note SIMD alignment and
|
Chris@10
|
2457 fftw_malloc::).
|
Chris@10
|
2458
|
Chris@10
|
2459
|
Chris@10
|
2460 The alignment issue is especially critical, because if you don't use
|
Chris@10
|
2461 `fftw_malloc' then you may have little control over the alignment of
|
Chris@10
|
2462 arrays in memory. For example, neither the C++ `new' function nor the
|
Chris@10
|
2463 Fortran `allocate' statement provide strong enough guarantees about
|
Chris@10
|
2464 data alignment. If you don't use `fftw_malloc', therefore, you
|
Chris@10
|
2465 probably have to use `FFTW_UNALIGNED' (which disables most SIMD
|
Chris@10
|
2466 support). If possible, it is probably better for you to simply create
|
Chris@10
|
2467 multiple plans (creating a new plan is quick once one exists for a
|
Chris@10
|
2468 given size), or better yet re-use the same array for your transforms.
|
Chris@10
|
2469
|
Chris@10
|
2470 If you are tempted to use the new-array execute interface because you
|
Chris@10
|
2471 want to transform a known bunch of arrays of the same size, you should
|
Chris@10
|
2472 probably go use the advanced interface instead (*note Advanced
|
Chris@10
|
2473 Interface::)).
|
Chris@10
|
2474
|
Chris@10
|
2475 The new-array execute functions are:
|
Chris@10
|
2476
|
Chris@10
|
2477 void fftw_execute_dft(
|
Chris@10
|
2478 const fftw_plan p,
|
Chris@10
|
2479 fftw_complex *in, fftw_complex *out);
|
Chris@10
|
2480
|
Chris@10
|
2481 void fftw_execute_split_dft(
|
Chris@10
|
2482 const fftw_plan p,
|
Chris@10
|
2483 double *ri, double *ii, double *ro, double *io);
|
Chris@10
|
2484
|
Chris@10
|
2485 void fftw_execute_dft_r2c(
|
Chris@10
|
2486 const fftw_plan p,
|
Chris@10
|
2487 double *in, fftw_complex *out);
|
Chris@10
|
2488
|
Chris@10
|
2489 void fftw_execute_split_dft_r2c(
|
Chris@10
|
2490 const fftw_plan p,
|
Chris@10
|
2491 double *in, double *ro, double *io);
|
Chris@10
|
2492
|
Chris@10
|
2493 void fftw_execute_dft_c2r(
|
Chris@10
|
2494 const fftw_plan p,
|
Chris@10
|
2495 fftw_complex *in, double *out);
|
Chris@10
|
2496
|
Chris@10
|
2497 void fftw_execute_split_dft_c2r(
|
Chris@10
|
2498 const fftw_plan p,
|
Chris@10
|
2499 double *ri, double *ii, double *out);
|
Chris@10
|
2500
|
Chris@10
|
2501 void fftw_execute_r2r(
|
Chris@10
|
2502 const fftw_plan p,
|
Chris@10
|
2503 double *in, double *out);
|
Chris@10
|
2504
|
Chris@10
|
2505 These execute the `plan' to compute the corresponding transform on
|
Chris@10
|
2506 the input/output arrays specified by the subsequent arguments. The
|
Chris@10
|
2507 input/output array arguments have the same meanings as the ones passed
|
Chris@10
|
2508 to the guru planner routines in the preceding sections. The `plan' is
|
Chris@10
|
2509 not modified, and these routines can be called as many times as
|
Chris@10
|
2510 desired, or intermixed with calls to the ordinary `fftw_execute'.
|
Chris@10
|
2511
|
Chris@10
|
2512 The `plan' _must_ have been created for the transform type
|
Chris@10
|
2513 corresponding to the execute function, e.g. it must be a complex-DFT
|
Chris@10
|
2514 plan for `fftw_execute_dft'. Any of the planner routines for that
|
Chris@10
|
2515 transform type, from the basic to the guru interface, could have been
|
Chris@10
|
2516 used to create the plan, however.
|
Chris@10
|
2517
|
Chris@10
|
2518
|
Chris@10
|
2519 File: fftw3.info, Node: Wisdom, Next: What FFTW Really Computes, Prev: New-array Execute Functions, Up: FFTW Reference
|
Chris@10
|
2520
|
Chris@10
|
2521 4.7 Wisdom
|
Chris@10
|
2522 ==========
|
Chris@10
|
2523
|
Chris@10
|
2524 This section documents the FFTW mechanism for saving and restoring
|
Chris@10
|
2525 plans from disk. This mechanism is called "wisdom".
|
Chris@10
|
2526
|
Chris@10
|
2527 * Menu:
|
Chris@10
|
2528
|
Chris@10
|
2529 * Wisdom Export::
|
Chris@10
|
2530 * Wisdom Import::
|
Chris@10
|
2531 * Forgetting Wisdom::
|
Chris@10
|
2532 * Wisdom Utilities::
|
Chris@10
|
2533
|
Chris@10
|
2534
|
Chris@10
|
2535 File: fftw3.info, Node: Wisdom Export, Next: Wisdom Import, Prev: Wisdom, Up: Wisdom
|
Chris@10
|
2536
|
Chris@10
|
2537 4.7.1 Wisdom Export
|
Chris@10
|
2538 -------------------
|
Chris@10
|
2539
|
Chris@10
|
2540 int fftw_export_wisdom_to_filename(const char *filename);
|
Chris@10
|
2541 void fftw_export_wisdom_to_file(FILE *output_file);
|
Chris@10
|
2542 char *fftw_export_wisdom_to_string(void);
|
Chris@10
|
2543 void fftw_export_wisdom(void (*write_char)(char c, void *), void *data);
|
Chris@10
|
2544
|
Chris@10
|
2545 These functions allow you to export all currently accumulated wisdom
|
Chris@10
|
2546 in a form from which it can be later imported and restored, even during
|
Chris@10
|
2547 a separate run of the program. (*Note Words of Wisdom-Saving Plans::.)
|
Chris@10
|
2548 The current store of wisdom is not affected by calling any of these
|
Chris@10
|
2549 routines.
|
Chris@10
|
2550
|
Chris@10
|
2551 `fftw_export_wisdom' exports the wisdom to any output medium, as
|
Chris@10
|
2552 specified by the callback function `write_char'. `write_char' is a
|
Chris@10
|
2553 `putc'-like function that writes the character `c' to some output; its
|
Chris@10
|
2554 second parameter is the `data' pointer passed to `fftw_export_wisdom'.
|
Chris@10
|
2555 For convenience, the following three "wrapper" routines are provided:
|
Chris@10
|
2556
|
Chris@10
|
2557 `fftw_export_wisdom_to_filename' writes wisdom to a file named
|
Chris@10
|
2558 `filename' (which is created or overwritten), returning `1' on success
|
Chris@10
|
2559 and `0' on failure. A lower-level function, which requires you to open
|
Chris@10
|
2560 and close the file yourself (e.g. if you want to write wisdom to a
|
Chris@10
|
2561 portion of a larger file) is `fftw_export_wisdom_to_file'. This writes
|
Chris@10
|
2562 the wisdom to the current position in `output_file', which should be
|
Chris@10
|
2563 open with write permission; upon exit, the file remains open and is
|
Chris@10
|
2564 positioned at the end of the wisdom data.
|
Chris@10
|
2565
|
Chris@10
|
2566 `fftw_export_wisdom_to_string' returns a pointer to a
|
Chris@10
|
2567 `NULL'-terminated string holding the wisdom data. This string is
|
Chris@10
|
2568 dynamically allocated, and it is the responsibility of the caller to
|
Chris@10
|
2569 deallocate it with `free' when it is no longer needed.
|
Chris@10
|
2570
|
Chris@10
|
2571 All of these routines export the wisdom in the same format, which we
|
Chris@10
|
2572 will not document here except to say that it is LISP-like ASCII text
|
Chris@10
|
2573 that is insensitive to white space.
|
Chris@10
|
2574
|
Chris@10
|
2575
|
Chris@10
|
2576 File: fftw3.info, Node: Wisdom Import, Next: Forgetting Wisdom, Prev: Wisdom Export, Up: Wisdom
|
Chris@10
|
2577
|
Chris@10
|
2578 4.7.2 Wisdom Import
|
Chris@10
|
2579 -------------------
|
Chris@10
|
2580
|
Chris@10
|
2581 int fftw_import_system_wisdom(void);
|
Chris@10
|
2582 int fftw_import_wisdom_from_filename(const char *filename);
|
Chris@10
|
2583 int fftw_import_wisdom_from_string(const char *input_string);
|
Chris@10
|
2584 int fftw_import_wisdom(int (*read_char)(void *), void *data);
|
Chris@10
|
2585
|
Chris@10
|
2586 These functions import wisdom into a program from data stored by the
|
Chris@10
|
2587 `fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving
|
Chris@10
|
2588 Plans::.) The imported wisdom replaces any wisdom already accumulated
|
Chris@10
|
2589 by the running program.
|
Chris@10
|
2590
|
Chris@10
|
2591 `fftw_import_wisdom' imports wisdom from any input medium, as
|
Chris@10
|
2592 specified by the callback function `read_char'. `read_char' is a
|
Chris@10
|
2593 `getc'-like function that returns the next character in the input; its
|
Chris@10
|
2594 parameter is the `data' pointer passed to `fftw_import_wisdom'. If the
|
Chris@10
|
2595 end of the input data is reached (which should never happen for valid
|
Chris@10
|
2596 data), `read_char' should return `EOF' (as defined in `<stdio.h>').
|
Chris@10
|
2597 For convenience, the following three "wrapper" routines are provided:
|
Chris@10
|
2598
|
Chris@10
|
2599 `fftw_import_wisdom_from_filename' reads wisdom from a file named
|
Chris@10
|
2600 `filename'. A lower-level function, which requires you to open and
|
Chris@10
|
2601 close the file yourself (e.g. if you want to read wisdom from a portion
|
Chris@10
|
2602 of a larger file) is `fftw_import_wisdom_from_file'. This reads wisdom
|
Chris@10
|
2603 from the current position in `input_file' (which should be open with
|
Chris@10
|
2604 read permission); upon exit, the file remains open, but the position of
|
Chris@10
|
2605 the read pointer is unspecified.
|
Chris@10
|
2606
|
Chris@10
|
2607 `fftw_import_wisdom_from_string' reads wisdom from the
|
Chris@10
|
2608 `NULL'-terminated string `input_string'.
|
Chris@10
|
2609
|
Chris@10
|
2610 `fftw_import_system_wisdom' reads wisdom from an
|
Chris@10
|
2611 implementation-defined standard file (`/etc/fftw/wisdom' on Unix and
|
Chris@10
|
2612 GNU systems).
|
Chris@10
|
2613
|
Chris@10
|
2614 The return value of these import routines is `1' if the wisdom was
|
Chris@10
|
2615 read successfully and `0' otherwise. Note that, in all of these
|
Chris@10
|
2616 functions, any data in the input stream past the end of the wisdom data
|
Chris@10
|
2617 is simply ignored.
|
Chris@10
|
2618
|
Chris@10
|
2619
|
Chris@10
|
2620 File: fftw3.info, Node: Forgetting Wisdom, Next: Wisdom Utilities, Prev: Wisdom Import, Up: Wisdom
|
Chris@10
|
2621
|
Chris@10
|
2622 4.7.3 Forgetting Wisdom
|
Chris@10
|
2623 -----------------------
|
Chris@10
|
2624
|
Chris@10
|
2625 void fftw_forget_wisdom(void);
|
Chris@10
|
2626
|
Chris@10
|
2627 Calling `fftw_forget_wisdom' causes all accumulated `wisdom' to be
|
Chris@10
|
2628 discarded and its associated memory to be freed. (New `wisdom' can
|
Chris@10
|
2629 still be gathered subsequently, however.)
|
Chris@10
|
2630
|
Chris@10
|
2631
|
Chris@10
|
2632 File: fftw3.info, Node: Wisdom Utilities, Prev: Forgetting Wisdom, Up: Wisdom
|
Chris@10
|
2633
|
Chris@10
|
2634 4.7.4 Wisdom Utilities
|
Chris@10
|
2635 ----------------------
|
Chris@10
|
2636
|
Chris@10
|
2637 FFTW includes two standalone utility programs that deal with wisdom. We
|
Chris@10
|
2638 merely summarize them here, since they come with their own `man' pages
|
Chris@10
|
2639 for Unix and GNU systems (with HTML versions on our web site).
|
Chris@10
|
2640
|
Chris@10
|
2641 The first program is `fftw-wisdom' (or `fftwf-wisdom' in single
|
Chris@10
|
2642 precision, etcetera), which can be used to create a wisdom file
|
Chris@10
|
2643 containing plans for any of the transform sizes and types supported by
|
Chris@10
|
2644 FFTW. It is preferable to create wisdom directly from your executable
|
Chris@10
|
2645 (*note Caveats in Using Wisdom::), but this program is useful for
|
Chris@10
|
2646 creating global wisdom files for `fftw_import_system_wisdom'.
|
Chris@10
|
2647
|
Chris@10
|
2648 The second program is `fftw-wisdom-to-conf', which takes a wisdom
|
Chris@10
|
2649 file as input and produces a "configuration routine" as output. The
|
Chris@10
|
2650 latter is a C subroutine that you can compile and link into your
|
Chris@10
|
2651 program, replacing a routine of the same name in the FFTW library, that
|
Chris@10
|
2652 determines which parts of FFTW are callable by your program.
|
Chris@10
|
2653 `fftw-wisdom-to-conf' produces a configuration routine that links to
|
Chris@10
|
2654 only those parts of FFTW needed by the saved plans in the wisdom,
|
Chris@10
|
2655 greatly reducing the size of statically linked executables (which should
|
Chris@10
|
2656 only attempt to create plans corresponding to those in the wisdom,
|
Chris@10
|
2657 however).
|
Chris@10
|
2658
|
Chris@10
|
2659
|
Chris@10
|
2660 File: fftw3.info, Node: What FFTW Really Computes, Prev: Wisdom, Up: FFTW Reference
|
Chris@10
|
2661
|
Chris@10
|
2662 4.8 What FFTW Really Computes
|
Chris@10
|
2663 =============================
|
Chris@10
|
2664
|
Chris@10
|
2665 In this section, we provide precise mathematical definitions for the
|
Chris@10
|
2666 transforms that FFTW computes. These transform definitions are fairly
|
Chris@10
|
2667 standard, but some authors follow slightly different conventions for the
|
Chris@10
|
2668 normalization of the transform (the constant factor in front) and the
|
Chris@10
|
2669 sign of the complex exponent. We begin by presenting the
|
Chris@10
|
2670 one-dimensional (1d) transform definitions, and then give the
|
Chris@10
|
2671 straightforward extension to multi-dimensional transforms.
|
Chris@10
|
2672
|
Chris@10
|
2673 * Menu:
|
Chris@10
|
2674
|
Chris@10
|
2675 * The 1d Discrete Fourier Transform (DFT)::
|
Chris@10
|
2676 * The 1d Real-data DFT::
|
Chris@10
|
2677 * 1d Real-even DFTs (DCTs)::
|
Chris@10
|
2678 * 1d Real-odd DFTs (DSTs)::
|
Chris@10
|
2679 * 1d Discrete Hartley Transforms (DHTs)::
|
Chris@10
|
2680 * Multi-dimensional Transforms::
|
Chris@10
|
2681
|
Chris@10
|
2682
|
Chris@10
|
2683 File: fftw3.info, Node: The 1d Discrete Fourier Transform (DFT), Next: The 1d Real-data DFT, Prev: What FFTW Really Computes, Up: What FFTW Really Computes
|
Chris@10
|
2684
|
Chris@10
|
2685 4.8.1 The 1d Discrete Fourier Transform (DFT)
|
Chris@10
|
2686 ---------------------------------------------
|
Chris@10
|
2687
|
Chris@10
|
2688 The forward (`FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d
|
Chris@10
|
2689 complex array X of size n computes an array Y, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
|
Chris@10
|
2690 The backward (`FFTW_BACKWARD') DFT computes: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
|
Chris@10
|
2691 FFTW computes an unnormalized transform, in that there is no
|
Chris@10
|
2692 coefficient in front of the summation in the DFT. In other words,
|
Chris@10
|
2693 applying the forward and then the backward transform will multiply the
|
Chris@10
|
2694 input by n.
|
Chris@10
|
2695
|
Chris@10
|
2696 From above, an `FFTW_FORWARD' transform corresponds to a sign of -1
|
Chris@10
|
2697 in the exponent of the DFT. Note also that we use the standard
|
Chris@10
|
2698 "in-order" output ordering--the k-th output corresponds to the
|
Chris@10
|
2699 frequency k/n (or k/T, where T is your total sampling period). For
|
Chris@10
|
2700 those who like to think in terms of positive and negative frequencies,
|
Chris@10
|
2701 this means that the positive frequencies are stored in the first half
|
Chris@10
|
2702 of the output and the negative frequencies are stored in backwards
|
Chris@10
|
2703 order in the second half of the output. (The frequency -k/n is the
|
Chris@10
|
2704 same as the frequency (n-k)/n.)
|
Chris@10
|
2705
|
Chris@10
|
2706
|
Chris@10
|
2707 File: fftw3.info, Node: The 1d Real-data DFT, Next: 1d Real-even DFTs (DCTs), Prev: The 1d Discrete Fourier Transform (DFT), Up: What FFTW Really Computes
|
Chris@10
|
2708
|
Chris@10
|
2709 4.8.2 The 1d Real-data DFT
|
Chris@10
|
2710 --------------------------
|
Chris@10
|
2711
|
Chris@10
|
2712 The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of
|
Chris@10
|
2713 the size `n' real array X, exactly as defined above, i.e. Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
|
Chris@10
|
2714 This output array Y can easily be shown to possess the "Hermitian"
|
Chris@10
|
2715 symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] =
|
Chris@10
|
2716 Y[0].
|
Chris@10
|
2717
|
Chris@10
|
2718 As a result of this symmetry, half of the output Y is redundant
|
Chris@10
|
2719 (being the complex conjugate of the other half), and so the 1d r2c
|
Chris@10
|
2720 transforms only output elements 0...n/2 of Y (n/2+1 complex numbers),
|
Chris@10
|
2721 where the division by 2 is rounded down.
|
Chris@10
|
2722
|
Chris@10
|
2723 Moreover, the Hermitian symmetry implies that Y[0] and, if n is
|
Chris@10
|
2724 even, the Y[n/2] element, are purely real. So, for the `R2HC' r2r
|
Chris@10
|
2725 transform, these elements are not stored in the halfcomplex output
|
Chris@10
|
2726 format.
|
Chris@10
|
2727
|
Chris@10
|
2728 The c2r and `H2RC' r2r transforms compute the backward DFT of the
|
Chris@10
|
2729 _complex_ array X with Hermitian symmetry, stored in the r2c/`R2HC'
|
Chris@10
|
2730 output formats, respectively, where the backward transform is defined
|
Chris@10
|
2731 exactly as for the complex case: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
|
Chris@10
|
2732 The outputs `Y' of this transform can easily be seen to be purely
|
Chris@10
|
2733 real, and are stored as an array of real numbers.
|
Chris@10
|
2734
|
Chris@10
|
2735 Like FFTW's complex DFT, these transforms are unnormalized. In other
|
Chris@10
|
2736 words, applying the real-to-complex (forward) and then the
|
Chris@10
|
2737 complex-to-real (backward) transform will multiply the input by n.
|
Chris@10
|
2738
|
Chris@10
|
2739
|
Chris@10
|
2740 File: fftw3.info, Node: 1d Real-even DFTs (DCTs), Next: 1d Real-odd DFTs (DSTs), Prev: The 1d Real-data DFT, Up: What FFTW Really Computes
|
Chris@10
|
2741
|
Chris@10
|
2742 4.8.3 1d Real-even DFTs (DCTs)
|
Chris@10
|
2743 ------------------------------
|
Chris@10
|
2744
|
Chris@10
|
2745 The Real-even symmetry DFTs in FFTW are exactly equivalent to the
|
Chris@10
|
2746 unnormalized forward (and backward) DFTs as defined above, where the
|
Chris@10
|
2747 input array X of length N is purely real and is also "even" symmetry.
|
Chris@10
|
2748 In this case, the output array is likewise real and even symmetry.
|
Chris@10
|
2749
|
Chris@10
|
2750 For the case of `REDFT00', this even symmetry means that X[j] =
|
Chris@10
|
2751 X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of
|
Chris@10
|
2752 this redundancy, only the first n real numbers are actually stored,
|
Chris@10
|
2753 where N = 2(n-1).
|
Chris@10
|
2754
|
Chris@10
|
2755 The proper definition of even symmetry for `REDFT10', `REDFT01', and
|
Chris@10
|
2756 `REDFT11' transforms is somewhat more intricate because of the shifts
|
Chris@10
|
2757 by 1/2 of the input and/or output, although the corresponding boundary
|
Chris@10
|
2758 conditions are given in *note Real even/odd DFTs (cosine/sine
|
Chris@10
|
2759 transforms)::. Because of the even symmetry, however, the sine terms
|
Chris@10
|
2760 in the DFT all cancel and the remaining cosine terms are written
|
Chris@10
|
2761 explicitly below. This formulation often leads people to call such a
|
Chris@10
|
2762 transform a "discrete cosine transform" (DCT), although it is really
|
Chris@10
|
2763 just a special case of the DFT.
|
Chris@10
|
2764
|
Chris@10
|
2765 In each of the definitions below, we transform a real array X of
|
Chris@10
|
2766 length n to a real array Y of length n:
|
Chris@10
|
2767
|
Chris@10
|
2768 REDFT00 (DCT-I)
|
Chris@10
|
2769 ...............
|
Chris@10
|
2770
|
Chris@10
|
2771 An `REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0]
|
Chris@10
|
2772 + (-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))).
|
Chris@10
|
2773 Note that this transform is not defined for n=1. For n=2, the
|
Chris@10
|
2774 summation term above is dropped as you might expect.
|
Chris@10
|
2775
|
Chris@10
|
2776 REDFT10 (DCT-II)
|
Chris@10
|
2777 ................
|
Chris@10
|
2778
|
Chris@10
|
2779 An `REDFT10' transform (type-II DCT, sometimes called "the" DCT) in
|
Chris@10
|
2780 FFTW is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi
|
Chris@10
|
2781 (j+1/2) k / n)).
|
Chris@10
|
2782
|
Chris@10
|
2783 REDFT01 (DCT-III)
|
Chris@10
|
2784 .................
|
Chris@10
|
2785
|
Chris@10
|
2786 An `REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] =
|
Chris@10
|
2787 X[0] + 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)). In the
|
Chris@10
|
2788 case of n=1, this reduces to Y[0] = X[0]. Up to a scale factor (see
|
Chris@10
|
2789 below), this is the inverse of `REDFT10' ("the" DCT), and so the
|
Chris@10
|
2790 `REDFT01' (DCT-III) is sometimes called the "IDCT".
|
Chris@10
|
2791
|
Chris@10
|
2792 REDFT11 (DCT-IV)
|
Chris@10
|
2793 ................
|
Chris@10
|
2794
|
Chris@10
|
2795 An `REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2
|
Chris@10
|
2796 (sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)).
|
Chris@10
|
2797
|
Chris@10
|
2798 Inverses and Normalization
|
Chris@10
|
2799 ..........................
|
Chris@10
|
2800
|
Chris@10
|
2801 These definitions correspond directly to the unnormalized DFTs used
|
Chris@10
|
2802 elsewhere in FFTW (hence the factors of 2 in front of the summations).
|
Chris@10
|
2803 The unnormalized inverse of `REDFT00' is `REDFT00', of `REDFT10' is
|
Chris@10
|
2804 `REDFT01' and vice versa, and of `REDFT11' is `REDFT11'. Each
|
Chris@10
|
2805 unnormalized inverse results in the original array multiplied by N,
|
Chris@10
|
2806 where N is the _logical_ DFT size. For `REDFT00', N=2(n-1) (note that
|
Chris@10
|
2807 n=1 is not defined); otherwise, N=2n.
|
Chris@10
|
2808
|
Chris@10
|
2809 In defining the discrete cosine transform, some authors also include
|
Chris@10
|
2810 additional factors of sqrt(2) (or its inverse) multiplying selected
|
Chris@10
|
2811 inputs and/or outputs. This is a mostly cosmetic change that makes the
|
Chris@10
|
2812 transform orthogonal, but sacrifices the direct equivalence to a
|
Chris@10
|
2813 symmetric DFT.
|
Chris@10
|
2814
|
Chris@10
|
2815
|
Chris@10
|
2816 File: fftw3.info, Node: 1d Real-odd DFTs (DSTs), Next: 1d Discrete Hartley Transforms (DHTs), Prev: 1d Real-even DFTs (DCTs), Up: What FFTW Really Computes
|
Chris@10
|
2817
|
Chris@10
|
2818 4.8.4 1d Real-odd DFTs (DSTs)
|
Chris@10
|
2819 -----------------------------
|
Chris@10
|
2820
|
Chris@10
|
2821 The Real-odd symmetry DFTs in FFTW are exactly equivalent to the
|
Chris@10
|
2822 unnormalized forward (and backward) DFTs as defined above, where the
|
Chris@10
|
2823 input array X of length N is purely real and is also "odd" symmetry. In
|
Chris@10
|
2824 this case, the output is odd symmetry and purely imaginary.
|
Chris@10
|
2825
|
Chris@10
|
2826 For the case of `RODFT00', this odd symmetry means that X[j] =
|
Chris@10
|
2827 -X[N-j], where we take X to be periodic so that X[N] = X[0]. Because
|
Chris@10
|
2828 of this redundancy, only the first n real numbers starting at j=1 are
|
Chris@10
|
2829 actually stored (the j=0 element is zero), where N = 2(n+1).
|
Chris@10
|
2830
|
Chris@10
|
2831 The proper definition of odd symmetry for `RODFT10', `RODFT01', and
|
Chris@10
|
2832 `RODFT11' transforms is somewhat more intricate because of the shifts
|
Chris@10
|
2833 by 1/2 of the input and/or output, although the corresponding boundary
|
Chris@10
|
2834 conditions are given in *note Real even/odd DFTs (cosine/sine
|
Chris@10
|
2835 transforms)::. Because of the odd symmetry, however, the cosine terms
|
Chris@10
|
2836 in the DFT all cancel and the remaining sine terms are written
|
Chris@10
|
2837 explicitly below. This formulation often leads people to call such a
|
Chris@10
|
2838 transform a "discrete sine transform" (DST), although it is really just
|
Chris@10
|
2839 a special case of the DFT.
|
Chris@10
|
2840
|
Chris@10
|
2841 In each of the definitions below, we transform a real array X of
|
Chris@10
|
2842 length n to a real array Y of length n:
|
Chris@10
|
2843
|
Chris@10
|
2844 RODFT00 (DST-I)
|
Chris@10
|
2845 ...............
|
Chris@10
|
2846
|
Chris@10
|
2847 An `RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2
|
Chris@10
|
2848 (sum for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))).
|
Chris@10
|
2849
|
Chris@10
|
2850 RODFT10 (DST-II)
|
Chris@10
|
2851 ................
|
Chris@10
|
2852
|
Chris@10
|
2853 An `RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2
|
Chris@10
|
2854 (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)).
|
Chris@10
|
2855
|
Chris@10
|
2856 RODFT01 (DST-III)
|
Chris@10
|
2857 .................
|
Chris@10
|
2858
|
Chris@10
|
2859 An `RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] =
|
Chris@10
|
2860 (-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) /
|
Chris@10
|
2861 n)). In the case of n=1, this reduces to Y[0] = X[0].
|
Chris@10
|
2862
|
Chris@10
|
2863 RODFT11 (DST-IV)
|
Chris@10
|
2864 ................
|
Chris@10
|
2865
|
Chris@10
|
2866 An `RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2
|
Chris@10
|
2867 (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)).
|
Chris@10
|
2868
|
Chris@10
|
2869 Inverses and Normalization
|
Chris@10
|
2870 ..........................
|
Chris@10
|
2871
|
Chris@10
|
2872 These definitions correspond directly to the unnormalized DFTs used
|
Chris@10
|
2873 elsewhere in FFTW (hence the factors of 2 in front of the summations).
|
Chris@10
|
2874 The unnormalized inverse of `RODFT00' is `RODFT00', of `RODFT10' is
|
Chris@10
|
2875 `RODFT01' and vice versa, and of `RODFT11' is `RODFT11'. Each
|
Chris@10
|
2876 unnormalized inverse results in the original array multiplied by N,
|
Chris@10
|
2877 where N is the _logical_ DFT size. For `RODFT00', N=2(n+1); otherwise,
|
Chris@10
|
2878 N=2n.
|
Chris@10
|
2879
|
Chris@10
|
2880 In defining the discrete sine transform, some authors also include
|
Chris@10
|
2881 additional factors of sqrt(2) (or its inverse) multiplying selected
|
Chris@10
|
2882 inputs and/or outputs. This is a mostly cosmetic change that makes the
|
Chris@10
|
2883 transform orthogonal, but sacrifices the direct equivalence to an
|
Chris@10
|
2884 antisymmetric DFT.
|
Chris@10
|
2885
|
Chris@10
|
2886
|
Chris@10
|
2887 File: fftw3.info, Node: 1d Discrete Hartley Transforms (DHTs), Next: Multi-dimensional Transforms, Prev: 1d Real-odd DFTs (DSTs), Up: What FFTW Really Computes
|
Chris@10
|
2888
|
Chris@10
|
2889 4.8.5 1d Discrete Hartley Transforms (DHTs)
|
Chris@10
|
2890 -------------------------------------------
|
Chris@10
|
2891
|
Chris@10
|
2892 The discrete Hartley transform (DHT) of a 1d real array X of size n
|
Chris@10
|
2893 computes a real array Y of the same size, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)].
|
Chris@10
|
2894 FFTW computes an unnormalized transform, in that there is no
|
Chris@10
|
2895 coefficient in front of the summation in the DHT. In other words,
|
Chris@10
|
2896 applying the transform twice (the DHT is its own inverse) will multiply
|
Chris@10
|
2897 the input by n.
|
Chris@10
|
2898
|
Chris@10
|
2899
|
Chris@10
|
2900 File: fftw3.info, Node: Multi-dimensional Transforms, Prev: 1d Discrete Hartley Transforms (DHTs), Up: What FFTW Really Computes
|
Chris@10
|
2901
|
Chris@10
|
2902 4.8.6 Multi-dimensional Transforms
|
Chris@10
|
2903 ----------------------------------
|
Chris@10
|
2904
|
Chris@10
|
2905 The multi-dimensional transforms of FFTW, in general, compute simply the
|
Chris@10
|
2906 separable product of the given 1d transform along each dimension of the
|
Chris@10
|
2907 array. Since each of these transforms is unnormalized, computing the
|
Chris@10
|
2908 forward followed by the backward/inverse multi-dimensional transform
|
Chris@10
|
2909 will result in the original array scaled by the product of the
|
Chris@10
|
2910 normalization factors for each dimension (e.g. the product of the
|
Chris@10
|
2911 dimension sizes, for a multi-dimensional DFT).
|
Chris@10
|
2912
|
Chris@10
|
2913 The definition of FFTW's multi-dimensional DFT of real data (r2c)
|
Chris@10
|
2914 deserves special attention. In this case, we logically compute the full
|
Chris@10
|
2915 multi-dimensional DFT of the input data; since the input data are purely
|
Chris@10
|
2916 real, the output data have the Hermitian symmetry and therefore only one
|
Chris@10
|
2917 non-redundant half need be stored. More specifically, for an n[0] x
|
Chris@10
|
2918 n[1] x n[2] x ... x n[d-1] multi-dimensional real-input DFT, the full
|
Chris@10
|
2919 (logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the
|
Chris@10
|
2920 symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ...,
|
Chris@10
|
2921 n[d-1] - k[d-1]]* (where each dimension is periodic). Because of this
|
Chris@10
|
2922 symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the
|
Chris@10
|
2923 _last_ dimension (division by 2 is rounded down). (We could instead
|
Chris@10
|
2924 have cut any other dimension in half, but the last dimension proved
|
Chris@10
|
2925 computationally convenient.) This results in the peculiar array format
|
Chris@10
|
2926 described in more detail by *note Real-data DFT Array Format::.
|
Chris@10
|
2927
|
Chris@10
|
2928 The multi-dimensional c2r transform is simply the unnormalized
|
Chris@10
|
2929 inverse of the r2c transform. i.e. it is the same as FFTW's complex
|
Chris@10
|
2930 backward multi-dimensional DFT, operating on a Hermitian input array in
|
Chris@10
|
2931 the peculiar format mentioned above and outputting a real array (since
|
Chris@10
|
2932 the DFT output is purely real).
|
Chris@10
|
2933
|
Chris@10
|
2934 We should remind the user that the separable product of 1d transforms
|
Chris@10
|
2935 along each dimension, as computed by FFTW, is not always the same thing
|
Chris@10
|
2936 as the usual multi-dimensional transform. A multi-dimensional `R2HC'
|
Chris@10
|
2937 (or `HC2R') transform is not identical to the multi-dimensional DFT,
|
Chris@10
|
2938 requiring some post-processing to combine the requisite real and
|
Chris@10
|
2939 imaginary parts, as was described in *note The Halfcomplex-format
|
Chris@10
|
2940 DFT::. Likewise, FFTW's multidimensional `FFTW_DHT' r2r transform is
|
Chris@10
|
2941 not the same thing as the logical multi-dimensional discrete Hartley
|
Chris@10
|
2942 transform defined in the literature, as discussed in *note The Discrete
|
Chris@10
|
2943 Hartley Transform::.
|
Chris@10
|
2944
|
Chris@10
|
2945
|
Chris@10
|
2946 File: fftw3.info, Node: Multi-threaded FFTW, Next: Distributed-memory FFTW with MPI, Prev: FFTW Reference, Up: Top
|
Chris@10
|
2947
|
Chris@10
|
2948 5 Multi-threaded FFTW
|
Chris@10
|
2949 *********************
|
Chris@10
|
2950
|
Chris@10
|
2951 In this chapter we document the parallel FFTW routines for
|
Chris@10
|
2952 shared-memory parallel hardware. These routines, which support
|
Chris@10
|
2953 parallel one- and multi-dimensional transforms of both real and complex
|
Chris@10
|
2954 data, are the easiest way to take advantage of multiple processors with
|
Chris@10
|
2955 FFTW. They work just like the corresponding uniprocessor transform
|
Chris@10
|
2956 routines, except that you have an extra initialization routine to call,
|
Chris@10
|
2957 and there is a routine to set the number of threads to employ. Any
|
Chris@10
|
2958 program that uses the uniprocessor FFTW can therefore be trivially
|
Chris@10
|
2959 modified to use the multi-threaded FFTW.
|
Chris@10
|
2960
|
Chris@10
|
2961 A shared-memory machine is one in which all CPUs can directly access
|
Chris@10
|
2962 the same main memory, and such machines are now common due to the
|
Chris@10
|
2963 ubiquity of multi-core CPUs. FFTW's multi-threading support allows you
|
Chris@10
|
2964 to utilize these additional CPUs transparently from a single program.
|
Chris@10
|
2965 However, this does not necessarily translate into performance
|
Chris@10
|
2966 gains--when multiple threads/CPUs are employed, there is an overhead
|
Chris@10
|
2967 required for synchronization that may outweigh the computatational
|
Chris@10
|
2968 parallelism. Therefore, you can only benefit from threads if your
|
Chris@10
|
2969 problem is sufficiently large.
|
Chris@10
|
2970
|
Chris@10
|
2971 * Menu:
|
Chris@10
|
2972
|
Chris@10
|
2973 * Installation and Supported Hardware/Software::
|
Chris@10
|
2974 * Usage of Multi-threaded FFTW::
|
Chris@10
|
2975 * How Many Threads to Use?::
|
Chris@10
|
2976 * Thread safety::
|
Chris@10
|
2977
|
Chris@10
|
2978
|
Chris@10
|
2979 File: fftw3.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW
|
Chris@10
|
2980
|
Chris@10
|
2981 5.1 Installation and Supported Hardware/Software
|
Chris@10
|
2982 ================================================
|
Chris@10
|
2983
|
Chris@10
|
2984 All of the FFTW threads code is located in the `threads' subdirectory
|
Chris@10
|
2985 of the FFTW package. On Unix systems, the FFTW threads libraries and
|
Chris@10
|
2986 header files can be automatically configured, compiled, and installed
|
Chris@10
|
2987 along with the uniprocessor FFTW libraries simply by including
|
Chris@10
|
2988 `--enable-threads' in the flags to the `configure' script (*note
|
Chris@10
|
2989 Installation on Unix::), or `--enable-openmp' to use OpenMP
|
Chris@10
|
2990 (http://www.openmp.org) threads.
|
Chris@10
|
2991
|
Chris@10
|
2992 The threads routines require your operating system to have some sort
|
Chris@10
|
2993 of shared-memory threads support. Specifically, the FFTW threads
|
Chris@10
|
2994 package works with POSIX threads (available on most Unix variants, from
|
Chris@10
|
2995 GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are
|
Chris@10
|
2996 supported in many common compilers (e.g. gcc) are also supported, and
|
Chris@10
|
2997 may give better performance on some systems. (OpenMP threads are also
|
Chris@10
|
2998 useful if you are employing OpenMP in your own code, in order to
|
Chris@10
|
2999 minimize conflicts between threading models.) If you have a
|
Chris@10
|
3000 shared-memory machine that uses a different threads API, it should be a
|
Chris@10
|
3001 simple matter of programming to include support for it; see the file
|
Chris@10
|
3002 `threads/threads.c' for more detail.
|
Chris@10
|
3003
|
Chris@10
|
3004 You can compile FFTW with _both_ `--enable-threads' and
|
Chris@10
|
3005 `--enable-openmp' at the same time, since they install libraries with
|
Chris@10
|
3006 different names (`fftw3_threads' and `fftw3_omp', as described below).
|
Chris@10
|
3007 However, your programs may only link to _one_ of these two libraries at
|
Chris@10
|
3008 a time.
|
Chris@10
|
3009
|
Chris@10
|
3010 Ideally, of course, you should also have multiple processors in
|
Chris@10
|
3011 order to get any benefit from the threaded transforms.
|
Chris@10
|
3012
|
Chris@10
|
3013
|
Chris@10
|
3014 File: fftw3.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW
|
Chris@10
|
3015
|
Chris@10
|
3016 5.2 Usage of Multi-threaded FFTW
|
Chris@10
|
3017 ================================
|
Chris@10
|
3018
|
Chris@10
|
3019 Here, it is assumed that the reader is already familiar with the usage
|
Chris@10
|
3020 of the uniprocessor FFTW routines, described elsewhere in this manual.
|
Chris@10
|
3021 We only describe what one has to change in order to use the
|
Chris@10
|
3022 multi-threaded routines.
|
Chris@10
|
3023
|
Chris@10
|
3024 First, programs using the parallel complex transforms should be
|
Chris@10
|
3025 linked with `-lfftw3_threads -lfftw3 -lm' on Unix, or `-lfftw3_omp
|
Chris@10
|
3026 -lfftw3 -lm' if you compiled with OpenMP. You will also need to link
|
Chris@10
|
3027 with whatever library is responsible for threads on your system (e.g.
|
Chris@10
|
3028 `-lpthread' on GNU/Linux) or include whatever compiler flag enables
|
Chris@10
|
3029 OpenMP (e.g. `-fopenmp' with gcc).
|
Chris@10
|
3030
|
Chris@10
|
3031 Second, before calling _any_ FFTW routines, you should call the
|
Chris@10
|
3032 function:
|
Chris@10
|
3033
|
Chris@10
|
3034 int fftw_init_threads(void);
|
Chris@10
|
3035
|
Chris@10
|
3036 This function, which need only be called once, performs any one-time
|
Chris@10
|
3037 initialization required to use threads on your system. It returns zero
|
Chris@10
|
3038 if there was some error (which should not happen under normal
|
Chris@10
|
3039 circumstances) and a non-zero value otherwise.
|
Chris@10
|
3040
|
Chris@10
|
3041 Third, before creating a plan that you want to parallelize, you
|
Chris@10
|
3042 should call:
|
Chris@10
|
3043
|
Chris@10
|
3044 void fftw_plan_with_nthreads(int nthreads);
|
Chris@10
|
3045
|
Chris@10
|
3046 The `nthreads' argument indicates the number of threads you want
|
Chris@10
|
3047 FFTW to use (or actually, the maximum number). All plans subsequently
|
Chris@10
|
3048 created with any planner routine will use that many threads. You can
|
Chris@10
|
3049 call `fftw_plan_with_nthreads', create some plans, call
|
Chris@10
|
3050 `fftw_plan_with_nthreads' again with a different argument, and create
|
Chris@10
|
3051 some more plans for a new number of threads. Plans already created
|
Chris@10
|
3052 before a call to `fftw_plan_with_nthreads' are unaffected. If you pass
|
Chris@10
|
3053 an `nthreads' argument of `1' (the default), threads are disabled for
|
Chris@10
|
3054 subsequent plans.
|
Chris@10
|
3055
|
Chris@10
|
3056 With OpenMP, to configure FFTW to use all of the currently running
|
Chris@10
|
3057 OpenMP threads (set by `omp_set_num_threads(nthreads)' or by the
|
Chris@10
|
3058 `OMP_NUM_THREADS' environment variable), you can do:
|
Chris@10
|
3059 `fftw_plan_with_nthreads(omp_get_max_threads())'. (The `omp_' OpenMP
|
Chris@10
|
3060 functions are declared via `#include <omp.h>'.)
|
Chris@10
|
3061
|
Chris@10
|
3062 Given a plan, you then execute it as usual with
|
Chris@10
|
3063 `fftw_execute(plan)', and the execution will use the number of threads
|
Chris@10
|
3064 specified when the plan was created. When done, you destroy it as
|
Chris@10
|
3065 usual with `fftw_destroy_plan'. As described in *note Thread safety::,
|
Chris@10
|
3066 plan _execution_ is thread-safe, but plan creation and destruction are
|
Chris@10
|
3067 _not_: you should create/destroy plans only from a single thread, but
|
Chris@10
|
3068 can safely execute multiple plans in parallel.
|
Chris@10
|
3069
|
Chris@10
|
3070 There is one additional routine: if you want to get rid of all memory
|
Chris@10
|
3071 and other resources allocated internally by FFTW, you can call:
|
Chris@10
|
3072
|
Chris@10
|
3073 void fftw_cleanup_threads(void);
|
Chris@10
|
3074
|
Chris@10
|
3075 which is much like the `fftw_cleanup()' function except that it also
|
Chris@10
|
3076 gets rid of threads-related data. You must _not_ execute any
|
Chris@10
|
3077 previously created plans after calling this function.
|
Chris@10
|
3078
|
Chris@10
|
3079 We should also mention one other restriction: if you save wisdom
|
Chris@10
|
3080 from a program using the multi-threaded FFTW, that wisdom _cannot be
|
Chris@10
|
3081 used_ by a program using only the single-threaded FFTW (i.e. not calling
|
Chris@10
|
3082 `fftw_init_threads'). *Note Words of Wisdom-Saving Plans::.
|
Chris@10
|
3083
|
Chris@10
|
3084
|
Chris@10
|
3085 File: fftw3.info, Node: How Many Threads to Use?, Next: Thread safety, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW
|
Chris@10
|
3086
|
Chris@10
|
3087 5.3 How Many Threads to Use?
|
Chris@10
|
3088 ============================
|
Chris@10
|
3089
|
Chris@10
|
3090 There is a fair amount of overhead involved in synchronizing threads,
|
Chris@10
|
3091 so the optimal number of threads to use depends upon the size of the
|
Chris@10
|
3092 transform as well as on the number of processors you have.
|
Chris@10
|
3093
|
Chris@10
|
3094 As a general rule, you don't want to use more threads than you have
|
Chris@10
|
3095 processors. (Using more threads will work, but there will be extra
|
Chris@10
|
3096 overhead with no benefit.) In fact, if the problem size is too small,
|
Chris@10
|
3097 you may want to use fewer threads than you have processors.
|
Chris@10
|
3098
|
Chris@10
|
3099 You will have to experiment with your system to see what level of
|
Chris@10
|
3100 parallelization is best for your problem size. Typically, the problem
|
Chris@10
|
3101 will have to involve at least a few thousand data points before threads
|
Chris@10
|
3102 become beneficial. If you plan with `FFTW_PATIENT', it will
|
Chris@10
|
3103 automatically disable threads for sizes that don't benefit from
|
Chris@10
|
3104 parallelization.
|
Chris@10
|
3105
|
Chris@10
|
3106
|
Chris@10
|
3107 File: fftw3.info, Node: Thread safety, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW
|
Chris@10
|
3108
|
Chris@10
|
3109 5.4 Thread safety
|
Chris@10
|
3110 =================
|
Chris@10
|
3111
|
Chris@10
|
3112 Users writing multi-threaded programs (including OpenMP) must concern
|
Chris@10
|
3113 themselves with the "thread safety" of the libraries they use--that is,
|
Chris@10
|
3114 whether it is safe to call routines in parallel from multiple threads.
|
Chris@10
|
3115 FFTW can be used in such an environment, but some care must be taken
|
Chris@10
|
3116 because the planner routines share data (e.g. wisdom and trigonometric
|
Chris@10
|
3117 tables) between calls and plans.
|
Chris@10
|
3118
|
Chris@10
|
3119 The upshot is that the only thread-safe (re-entrant) routine in FFTW
|
Chris@10
|
3120 is `fftw_execute' (and the new-array variants thereof). All other
|
Chris@10
|
3121 routines (e.g. the planner) should only be called from one thread at a
|
Chris@10
|
3122 time. So, for example, you can wrap a semaphore lock around any calls
|
Chris@10
|
3123 to the planner; even more simply, you can just create all of your plans
|
Chris@10
|
3124 from one thread. We do not think this should be an important
|
Chris@10
|
3125 restriction (FFTW is designed for the situation where the only
|
Chris@10
|
3126 performance-sensitive code is the actual execution of the transform),
|
Chris@10
|
3127 and the benefits of shared data between plans are great.
|
Chris@10
|
3128
|
Chris@10
|
3129 Note also that, since the plan is not modified by `fftw_execute', it
|
Chris@10
|
3130 is safe to execute the _same plan_ in parallel by multiple threads.
|
Chris@10
|
3131 However, since a given plan operates by default on a fixed array, you
|
Chris@10
|
3132 need to use one of the new-array execute functions (*note New-array
|
Chris@10
|
3133 Execute Functions::) so that different threads compute the transform of
|
Chris@10
|
3134 different data.
|
Chris@10
|
3135
|
Chris@10
|
3136 (Users should note that these comments only apply to programs using
|
Chris@10
|
3137 shared-memory threads or OpenMP. Parallelism using MPI or forked
|
Chris@10
|
3138 processes involves a separate address-space and global variables for
|
Chris@10
|
3139 each process, and is not susceptible to problems of this sort.)
|
Chris@10
|
3140
|
Chris@10
|
3141 If you are configured FFTW with the `--enable-debug' or
|
Chris@10
|
3142 `--enable-debug-malloc' flags (*note Installation on Unix::), then
|
Chris@10
|
3143 `fftw_execute' is not thread-safe. These flags are not documented
|
Chris@10
|
3144 because they are intended only for developing and debugging FFTW, but
|
Chris@10
|
3145 if you must use `--enable-debug' then you should also specifically pass
|
Chris@10
|
3146 `--disable-debug-malloc' for `fftw_execute' to be thread-safe.
|
Chris@10
|
3147
|
Chris@10
|
3148
|
Chris@10
|
3149 File: fftw3.info, Node: Distributed-memory FFTW with MPI, Next: Calling FFTW from Modern Fortran, Prev: Multi-threaded FFTW, Up: Top
|
Chris@10
|
3150
|
Chris@10
|
3151 6 Distributed-memory FFTW with MPI
|
Chris@10
|
3152 **********************************
|
Chris@10
|
3153
|
Chris@10
|
3154 In this chapter we document the parallel FFTW routines for parallel
|
Chris@10
|
3155 systems supporting the MPI message-passing interface. Unlike the
|
Chris@10
|
3156 shared-memory threads described in the previous chapter, MPI allows you
|
Chris@10
|
3157 to use _distributed-memory_ parallelism, where each CPU has its own
|
Chris@10
|
3158 separate memory, and which can scale up to clusters of many thousands
|
Chris@10
|
3159 of processors. This capability comes at a price, however: each process
|
Chris@10
|
3160 only stores a _portion_ of the data to be transformed, which means that
|
Chris@10
|
3161 the data structures and programming-interface are quite different from
|
Chris@10
|
3162 the serial or threads versions of FFTW.
|
Chris@10
|
3163
|
Chris@10
|
3164 Distributed-memory parallelism is especially useful when you are
|
Chris@10
|
3165 transforming arrays so large that they do not fit into the memory of a
|
Chris@10
|
3166 single processor. The storage per-process required by FFTW's MPI
|
Chris@10
|
3167 routines is proportional to the total array size divided by the number
|
Chris@10
|
3168 of processes. Conversely, distributed-memory parallelism can easily
|
Chris@10
|
3169 pose an unacceptably high communications overhead for small problems;
|
Chris@10
|
3170 the threshold problem size for which parallelism becomes advantageous
|
Chris@10
|
3171 will depend on the precise problem you are interested in, your
|
Chris@10
|
3172 hardware, and your MPI implementation.
|
Chris@10
|
3173
|
Chris@10
|
3174 A note on terminology: in MPI, you divide the data among a set of
|
Chris@10
|
3175 "processes" which each run in their own memory address space.
|
Chris@10
|
3176 Generally, each process runs on a different physical processor, but
|
Chris@10
|
3177 this is not required. A set of processes in MPI is described by an
|
Chris@10
|
3178 opaque data structure called a "communicator," the most common of which
|
Chris@10
|
3179 is the predefined communicator `MPI_COMM_WORLD' which refers to _all_
|
Chris@10
|
3180 processes. For more information on these and other concepts common to
|
Chris@10
|
3181 all MPI programs, we refer the reader to the documentation at the MPI
|
Chris@10
|
3182 home page (http://www.mcs.anl.gov/research/projects/mpi/).
|
Chris@10
|
3183
|
Chris@10
|
3184 We assume in this chapter that the reader is familiar with the usage
|
Chris@10
|
3185 of the serial (uniprocessor) FFTW, and focus only on the concepts new
|
Chris@10
|
3186 to the MPI interface.
|
Chris@10
|
3187
|
Chris@10
|
3188 * Menu:
|
Chris@10
|
3189
|
Chris@10
|
3190 * FFTW MPI Installation::
|
Chris@10
|
3191 * Linking and Initializing MPI FFTW::
|
Chris@10
|
3192 * 2d MPI example::
|
Chris@10
|
3193 * MPI Data Distribution::
|
Chris@10
|
3194 * Multi-dimensional MPI DFTs of Real Data::
|
Chris@10
|
3195 * Other Multi-dimensional Real-data MPI Transforms::
|
Chris@10
|
3196 * FFTW MPI Transposes::
|
Chris@10
|
3197 * FFTW MPI Wisdom::
|
Chris@10
|
3198 * Avoiding MPI Deadlocks::
|
Chris@10
|
3199 * FFTW MPI Performance Tips::
|
Chris@10
|
3200 * Combining MPI and Threads::
|
Chris@10
|
3201 * FFTW MPI Reference::
|
Chris@10
|
3202 * FFTW MPI Fortran Interface::
|
Chris@10
|
3203
|
Chris@10
|
3204
|
Chris@10
|
3205 File: fftw3.info, Node: FFTW MPI Installation, Next: Linking and Initializing MPI FFTW, Prev: Distributed-memory FFTW with MPI, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3206
|
Chris@10
|
3207 6.1 FFTW MPI Installation
|
Chris@10
|
3208 =========================
|
Chris@10
|
3209
|
Chris@10
|
3210 All of the FFTW MPI code is located in the `mpi' subdirectory of the
|
Chris@10
|
3211 FFTW package. On Unix systems, the FFTW MPI libraries and header files
|
Chris@10
|
3212 are automatically configured, compiled, and installed along with the
|
Chris@10
|
3213 uniprocessor FFTW libraries simply by including `--enable-mpi' in the
|
Chris@10
|
3214 flags to the `configure' script (*note Installation on Unix::).
|
Chris@10
|
3215
|
Chris@10
|
3216 Any implementation of the MPI standard, version 1 or later, should
|
Chris@10
|
3217 work with FFTW. The `configure' script will attempt to automatically
|
Chris@10
|
3218 detect how to compile and link code using your MPI implementation. In
|
Chris@10
|
3219 some cases, especially if you have multiple different MPI
|
Chris@10
|
3220 implementations installed or have an unusual MPI software package, you
|
Chris@10
|
3221 may need to provide this information explicitly.
|
Chris@10
|
3222
|
Chris@10
|
3223 Most commonly, one compiles MPI code by invoking a special compiler
|
Chris@10
|
3224 command, typically `mpicc' for C code. The `configure' script knows
|
Chris@10
|
3225 the most common names for this command, but you can specify the MPI
|
Chris@10
|
3226 compilation command explicitly by setting the `MPICC' variable, as in
|
Chris@10
|
3227 `./configure MPICC=mpicc ...'.
|
Chris@10
|
3228
|
Chris@10
|
3229 If, instead of a special compiler command, you need to link a certain
|
Chris@10
|
3230 library, you can specify the link command via the `MPILIBS' variable,
|
Chris@10
|
3231 as in `./configure MPILIBS=-lmpi ...'. Note that if your MPI library
|
Chris@10
|
3232 is installed in a non-standard location (one the compiler does not know
|
Chris@10
|
3233 about by default), you may also have to specify the location of the
|
Chris@10
|
3234 library and header files via `LDFLAGS' and `CPPFLAGS' variables,
|
Chris@10
|
3235 respectively, as in `./configure LDFLAGS=-L/path/to/mpi/libs
|
Chris@10
|
3236 CPPFLAGS=-I/path/to/mpi/include ...'.
|
Chris@10
|
3237
|
Chris@10
|
3238
|
Chris@10
|
3239 File: fftw3.info, Node: Linking and Initializing MPI FFTW, Next: 2d MPI example, Prev: FFTW MPI Installation, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3240
|
Chris@10
|
3241 6.2 Linking and Initializing MPI FFTW
|
Chris@10
|
3242 =====================================
|
Chris@10
|
3243
|
Chris@10
|
3244 Programs using the MPI FFTW routines should be linked with `-lfftw3_mpi
|
Chris@10
|
3245 -lfftw3 -lm' on Unix in double precision, `-lfftw3f_mpi -lfftw3f -lm'
|
Chris@10
|
3246 in single precision, and so on (*note Precision::). You will also need
|
Chris@10
|
3247 to link with whatever library is responsible for MPI on your system; in
|
Chris@10
|
3248 most MPI implementations, there is a special compiler alias named
|
Chris@10
|
3249 `mpicc' to compile and link MPI code.
|
Chris@10
|
3250
|
Chris@10
|
3251 Before calling any FFTW routines except possibly `fftw_init_threads'
|
Chris@10
|
3252 (*note Combining MPI and Threads::), but after calling `MPI_Init', you
|
Chris@10
|
3253 should call the function:
|
Chris@10
|
3254
|
Chris@10
|
3255 void fftw_mpi_init(void);
|
Chris@10
|
3256
|
Chris@10
|
3257 If, at the end of your program, you want to get rid of all memory and
|
Chris@10
|
3258 other resources allocated internally by FFTW, for both the serial and
|
Chris@10
|
3259 MPI routines, you can call:
|
Chris@10
|
3260
|
Chris@10
|
3261 void fftw_mpi_cleanup(void);
|
Chris@10
|
3262
|
Chris@10
|
3263 which is much like the `fftw_cleanup()' function except that it also
|
Chris@10
|
3264 gets rid of FFTW's MPI-related data. You must _not_ execute any
|
Chris@10
|
3265 previously created plans after calling this function.
|
Chris@10
|
3266
|
Chris@10
|
3267
|
Chris@10
|
3268 File: fftw3.info, Node: 2d MPI example, Next: MPI Data Distribution, Prev: Linking and Initializing MPI FFTW, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3269
|
Chris@10
|
3270 6.3 2d MPI example
|
Chris@10
|
3271 ==================
|
Chris@10
|
3272
|
Chris@10
|
3273 Before we document the FFTW MPI interface in detail, we begin with a
|
Chris@10
|
3274 simple example outlining how one would perform a two-dimensional `N0'
|
Chris@10
|
3275 by `N1' complex DFT.
|
Chris@10
|
3276
|
Chris@10
|
3277 #include <fftw3-mpi.h>
|
Chris@10
|
3278
|
Chris@10
|
3279 int main(int argc, char **argv)
|
Chris@10
|
3280 {
|
Chris@10
|
3281 const ptrdiff_t N0 = ..., N1 = ...;
|
Chris@10
|
3282 fftw_plan plan;
|
Chris@10
|
3283 fftw_complex *data;
|
Chris@10
|
3284 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
|
Chris@10
|
3285
|
Chris@10
|
3286 MPI_Init(&argc, &argv);
|
Chris@10
|
3287 fftw_mpi_init();
|
Chris@10
|
3288
|
Chris@10
|
3289 /* get local data size and allocate */
|
Chris@10
|
3290 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
|
Chris@10
|
3291 &local_n0, &local_0_start);
|
Chris@10
|
3292 data = fftw_alloc_complex(alloc_local);
|
Chris@10
|
3293
|
Chris@10
|
3294 /* create plan for in-place forward DFT */
|
Chris@10
|
3295 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
|
Chris@10
|
3296 FFTW_FORWARD, FFTW_ESTIMATE);
|
Chris@10
|
3297
|
Chris@10
|
3298 /* initialize data to some function my_function(x,y) */
|
Chris@10
|
3299 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
|
Chris@10
|
3300 data[i*N1 + j] = my_function(local_0_start + i, j);
|
Chris@10
|
3301
|
Chris@10
|
3302 /* compute transforms, in-place, as many times as desired */
|
Chris@10
|
3303 fftw_execute(plan);
|
Chris@10
|
3304
|
Chris@10
|
3305 fftw_destroy_plan(plan);
|
Chris@10
|
3306
|
Chris@10
|
3307 MPI_Finalize();
|
Chris@10
|
3308 }
|
Chris@10
|
3309
|
Chris@10
|
3310 As can be seen above, the MPI interface follows the same basic style
|
Chris@10
|
3311 of allocate/plan/execute/destroy as the serial FFTW routines. All of
|
Chris@10
|
3312 the MPI-specific routines are prefixed with `fftw_mpi_' instead of
|
Chris@10
|
3313 `fftw_'. There are a few important differences, however:
|
Chris@10
|
3314
|
Chris@10
|
3315 First, we must call `fftw_mpi_init()' after calling `MPI_Init'
|
Chris@10
|
3316 (required in all MPI programs) and before calling any other `fftw_mpi_'
|
Chris@10
|
3317 routine.
|
Chris@10
|
3318
|
Chris@10
|
3319 Second, when we create the plan with `fftw_mpi_plan_dft_2d',
|
Chris@10
|
3320 analogous to `fftw_plan_dft_2d', we pass an additional argument: the
|
Chris@10
|
3321 communicator, indicating which processes will participate in the
|
Chris@10
|
3322 transform (here `MPI_COMM_WORLD', indicating all processes). Whenever
|
Chris@10
|
3323 you create, execute, or destroy a plan for an MPI transform, you must
|
Chris@10
|
3324 call the corresponding FFTW routine on _all_ processes in the
|
Chris@10
|
3325 communicator for that transform. (That is, these are _collective_
|
Chris@10
|
3326 calls.) Note that the plan for the MPI transform uses the standard
|
Chris@10
|
3327 `fftw_execute' and `fftw_destroy' routines (on the other hand, there
|
Chris@10
|
3328 are MPI-specific new-array execute functions documented below).
|
Chris@10
|
3329
|
Chris@10
|
3330 Third, all of the FFTW MPI routines take `ptrdiff_t' arguments
|
Chris@10
|
3331 instead of `int' as for the serial FFTW. `ptrdiff_t' is a standard C
|
Chris@10
|
3332 integer type which is (at least) 32 bits wide on a 32-bit machine and
|
Chris@10
|
3333 64 bits wide on a 64-bit machine. This is to make it easy to specify
|
Chris@10
|
3334 very large parallel transforms on a 64-bit machine. (You can specify
|
Chris@10
|
3335 64-bit transform sizes in the serial FFTW, too, but only by using the
|
Chris@10
|
3336 `guru64' planner interface. *Note 64-bit Guru Interface::.)
|
Chris@10
|
3337
|
Chris@10
|
3338 Fourth, and most importantly, you don't allocate the entire
|
Chris@10
|
3339 two-dimensional array on each process. Instead, you call
|
Chris@10
|
3340 `fftw_mpi_local_size_2d' to find out what _portion_ of the array
|
Chris@10
|
3341 resides on each processor, and how much space to allocate. Here, the
|
Chris@10
|
3342 portion of the array on each process is a `local_n0' by `N1' slice of
|
Chris@10
|
3343 the total array, starting at index `local_0_start'. The total number
|
Chris@10
|
3344 of `fftw_complex' numbers to allocate is given by the `alloc_local'
|
Chris@10
|
3345 return value, which _may_ be greater than `local_n0 * N1' (in case some
|
Chris@10
|
3346 intermediate calculations require additional storage). The data
|
Chris@10
|
3347 distribution in FFTW's MPI interface is described in more detail by the
|
Chris@10
|
3348 next section.
|
Chris@10
|
3349
|
Chris@10
|
3350 Given the portion of the array that resides on the local process, it
|
Chris@10
|
3351 is straightforward to initialize the data (here to a function
|
Chris@10
|
3352 `myfunction') and otherwise manipulate it. Of course, at the end of
|
Chris@10
|
3353 the program you may want to output the data somehow, but synchronizing
|
Chris@10
|
3354 this output is up to you and is beyond the scope of this manual. (One
|
Chris@10
|
3355 good way to output a large multi-dimensional distributed array in MPI
|
Chris@10
|
3356 to a portable binary file is to use the free HDF5 library; see the HDF
|
Chris@10
|
3357 home page (http://www.hdfgroup.org/).)
|
Chris@10
|
3358
|
Chris@10
|
3359
|
Chris@10
|
3360 File: fftw3.info, Node: MPI Data Distribution, Next: Multi-dimensional MPI DFTs of Real Data, Prev: 2d MPI example, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3361
|
Chris@10
|
3362 6.4 MPI Data Distribution
|
Chris@10
|
3363 =========================
|
Chris@10
|
3364
|
Chris@10
|
3365 The most important concept to understand in using FFTW's MPI interface
|
Chris@10
|
3366 is the data distribution. With a serial or multithreaded FFT, all of
|
Chris@10
|
3367 the inputs and outputs are stored as a single contiguous chunk of
|
Chris@10
|
3368 memory. With a distributed-memory FFT, the inputs and outputs are
|
Chris@10
|
3369 broken into disjoint blocks, one per process.
|
Chris@10
|
3370
|
Chris@10
|
3371 In particular, FFTW uses a _1d block distribution_ of the data,
|
Chris@10
|
3372 distributed along the _first dimension_. For example, if you want to
|
Chris@10
|
3373 perform a 100 x 200 complex DFT, distributed over 4 processes, each
|
Chris@10
|
3374 process will get a 25 x 200 slice of the data. That is, process 0
|
Chris@10
|
3375 will get rows 0 through 24, process 1 will get rows 25 through 49,
|
Chris@10
|
3376 process 2 will get rows 50 through 74, and process 3 will get rows 75
|
Chris@10
|
3377 through 99. If you take the same array but distribute it over 3
|
Chris@10
|
3378 processes, then it is not evenly divisible so the different processes
|
Chris@10
|
3379 will have unequal chunks. FFTW's default choice in this case is to
|
Chris@10
|
3380 assign 34 rows to processes 0 and 1, and 32 rows to process 2.
|
Chris@10
|
3381
|
Chris@10
|
3382 FFTW provides several `fftw_mpi_local_size' routines that you can
|
Chris@10
|
3383 call to find out what portion of an array is stored on the current
|
Chris@10
|
3384 process. In most cases, you should use the default block sizes picked
|
Chris@10
|
3385 by FFTW, but it is also possible to specify your own block size. For
|
Chris@10
|
3386 example, with a 100 x 200 array on three processes, you can tell FFTW
|
Chris@10
|
3387 to use a block size of 40, which would assign 40 rows to processes 0
|
Chris@10
|
3388 and 1, and 20 rows to process 2. FFTW's default is to divide the data
|
Chris@10
|
3389 equally among the processes if possible, and as best it can otherwise.
|
Chris@10
|
3390 The rows are always assigned in "rank order," i.e. process 0 gets the
|
Chris@10
|
3391 first block of rows, then process 1, and so on. (You can change this
|
Chris@10
|
3392 by using `MPI_Comm_split' to create a new communicator with re-ordered
|
Chris@10
|
3393 processes.) However, you should always call the `fftw_mpi_local_size'
|
Chris@10
|
3394 routines, if possible, rather than trying to predict FFTW's
|
Chris@10
|
3395 distribution choices.
|
Chris@10
|
3396
|
Chris@10
|
3397 In particular, it is critical that you allocate the storage size that
|
Chris@10
|
3398 is returned by `fftw_mpi_local_size', which is _not_ necessarily the
|
Chris@10
|
3399 size of the local slice of the array. The reason is that intermediate
|
Chris@10
|
3400 steps of FFTW's algorithms involve transposing the array and
|
Chris@10
|
3401 redistributing the data, so at these intermediate steps FFTW may
|
Chris@10
|
3402 require more local storage space (albeit always proportional to the
|
Chris@10
|
3403 total size divided by the number of processes). The
|
Chris@10
|
3404 `fftw_mpi_local_size' functions know how much storage is required for
|
Chris@10
|
3405 these intermediate steps and tell you the correct amount to allocate.
|
Chris@10
|
3406
|
Chris@10
|
3407 * Menu:
|
Chris@10
|
3408
|
Chris@10
|
3409 * Basic and advanced distribution interfaces::
|
Chris@10
|
3410 * Load balancing::
|
Chris@10
|
3411 * Transposed distributions::
|
Chris@10
|
3412 * One-dimensional distributions::
|
Chris@10
|
3413
|
Chris@10
|
3414
|
Chris@10
|
3415 File: fftw3.info, Node: Basic and advanced distribution interfaces, Next: Load balancing, Prev: MPI Data Distribution, Up: MPI Data Distribution
|
Chris@10
|
3416
|
Chris@10
|
3417 6.4.1 Basic and advanced distribution interfaces
|
Chris@10
|
3418 ------------------------------------------------
|
Chris@10
|
3419
|
Chris@10
|
3420 As with the planner interface, the `fftw_mpi_local_size' distribution
|
Chris@10
|
3421 interface is broken into basic and advanced (`_many') interfaces, where
|
Chris@10
|
3422 the latter allows you to specify the block size manually and also to
|
Chris@10
|
3423 request block sizes when computing multiple transforms simultaneously.
|
Chris@10
|
3424 These functions are documented more exhaustively by the FFTW MPI
|
Chris@10
|
3425 Reference, but we summarize the basic ideas here using a couple of
|
Chris@10
|
3426 two-dimensional examples.
|
Chris@10
|
3427
|
Chris@10
|
3428 For the 100 x 200 complex-DFT example, above, we would find the
|
Chris@10
|
3429 distribution by calling the following function in the basic interface:
|
Chris@10
|
3430
|
Chris@10
|
3431 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
Chris@10
|
3432 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
Chris@10
|
3433
|
Chris@10
|
3434 Given the total size of the data to be transformed (here, `n0 = 100'
|
Chris@10
|
3435 and `n1 = 200') and an MPI communicator (`comm'), this function
|
Chris@10
|
3436 provides three numbers.
|
Chris@10
|
3437
|
Chris@10
|
3438 First, it describes the shape of the local data: the current process
|
Chris@10
|
3439 should store a `local_n0' by `n1' slice of the overall dataset, in
|
Chris@10
|
3440 row-major order (`n1' dimension contiguous), starting at index
|
Chris@10
|
3441 `local_0_start'. That is, if the total dataset is viewed as a `n0' by
|
Chris@10
|
3442 `n1' matrix, the current process should store the rows `local_0_start'
|
Chris@10
|
3443 to `local_0_start+local_n0-1'. Obviously, if you are running with only
|
Chris@10
|
3444 a single MPI process, that process will store the entire array:
|
Chris@10
|
3445 `local_0_start' will be zero and `local_n0' will be `n0'. *Note
|
Chris@10
|
3446 Row-major Format::.
|
Chris@10
|
3447
|
Chris@10
|
3448 Second, the return value is the total number of data elements (e.g.,
|
Chris@10
|
3449 complex numbers for a complex DFT) that should be allocated for the
|
Chris@10
|
3450 input and output arrays on the current process (ideally with
|
Chris@10
|
3451 `fftw_malloc' or an `fftw_alloc' function, to ensure optimal
|
Chris@10
|
3452 alignment). It might seem that this should always be equal to
|
Chris@10
|
3453 `local_n0 * n1', but this is _not_ the case. FFTW's distributed FFT
|
Chris@10
|
3454 algorithms require data redistributions at intermediate stages of the
|
Chris@10
|
3455 transform, and in some circumstances this may require slightly larger
|
Chris@10
|
3456 local storage. This is discussed in more detail below, under *note
|
Chris@10
|
3457 Load balancing::.
|
Chris@10
|
3458
|
Chris@10
|
3459 The advanced-interface `local_size' function for multidimensional
|
Chris@10
|
3460 transforms returns the same three things (`local_n0', `local_0_start',
|
Chris@10
|
3461 and the total number of elements to allocate), but takes more inputs:
|
Chris@10
|
3462
|
Chris@10
|
3463 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
|
Chris@10
|
3464 ptrdiff_t howmany,
|
Chris@10
|
3465 ptrdiff_t block0,
|
Chris@10
|
3466 MPI_Comm comm,
|
Chris@10
|
3467 ptrdiff_t *local_n0,
|
Chris@10
|
3468 ptrdiff_t *local_0_start);
|
Chris@10
|
3469
|
Chris@10
|
3470 The two-dimensional case above corresponds to `rnk = 2' and an array
|
Chris@10
|
3471 `n' of length 2 with `n[0] = n0' and `n[1] = n1'. This routine is for
|
Chris@10
|
3472 any `rnk > 1'; one-dimensional transforms have their own interface
|
Chris@10
|
3473 because they work slightly differently, as discussed below.
|
Chris@10
|
3474
|
Chris@10
|
3475 First, the advanced interface allows you to perform multiple
|
Chris@10
|
3476 transforms at once, of interleaved data, as specified by the `howmany'
|
Chris@10
|
3477 parameter. (`hoamany' is 1 for a single transform.)
|
Chris@10
|
3478
|
Chris@10
|
3479 Second, here you can specify your desired block size in the `n0'
|
Chris@10
|
3480 dimension, `block0'. To use FFTW's default block size, pass
|
Chris@10
|
3481 `FFTW_MPI_DEFAULT_BLOCK' (0) for `block0'. Otherwise, on `P'
|
Chris@10
|
3482 processes, FFTW will return `local_n0' equal to `block0' on the first
|
Chris@10
|
3483 `P / block0' processes (rounded down), return `local_n0' equal to `n0 -
|
Chris@10
|
3484 block0 * (P / block0)' on the next process, and `local_n0' equal to
|
Chris@10
|
3485 zero on any remaining processes. In general, we recommend using the
|
Chris@10
|
3486 default block size (which corresponds to `n0 / P', rounded up).
|
Chris@10
|
3487
|
Chris@10
|
3488 For example, suppose you have `P = 4' processes and `n0 = 21'. The
|
Chris@10
|
3489 default will be a block size of `6', which will give `local_n0 = 6' on
|
Chris@10
|
3490 the first three processes and `local_n0 = 3' on the last process.
|
Chris@10
|
3491 Instead, however, you could specify `block0 = 5' if you wanted, which
|
Chris@10
|
3492 would give `local_n0 = 5' on processes 0 to 2, `local_n0 = 6' on
|
Chris@10
|
3493 process 3. (This choice, while it may look superficially more
|
Chris@10
|
3494 "balanced," has the same critical path as FFTW's default but requires
|
Chris@10
|
3495 more communications.)
|
Chris@10
|
3496
|
Chris@10
|
3497
|
Chris@10
|
3498 File: fftw3.info, Node: Load balancing, Next: Transposed distributions, Prev: Basic and advanced distribution interfaces, Up: MPI Data Distribution
|
Chris@10
|
3499
|
Chris@10
|
3500 6.4.2 Load balancing
|
Chris@10
|
3501 --------------------
|
Chris@10
|
3502
|
Chris@10
|
3503 Ideally, when you parallelize a transform over some P processes, each
|
Chris@10
|
3504 process should end up with work that takes equal time. Otherwise, all
|
Chris@10
|
3505 of the processes end up waiting on whichever process is slowest. This
|
Chris@10
|
3506 goal is known as "load balancing." In this section, we describe the
|
Chris@10
|
3507 circumstances under which FFTW is able to load-balance well, and in
|
Chris@10
|
3508 particular how you should choose your transform size in order to load
|
Chris@10
|
3509 balance.
|
Chris@10
|
3510
|
Chris@10
|
3511 Load balancing is especially difficult when you are parallelizing
|
Chris@10
|
3512 over heterogeneous machines; for example, if one of your processors is a
|
Chris@10
|
3513 old 486 and another is a Pentium IV, obviously you should give the
|
Chris@10
|
3514 Pentium more work to do than the 486 since the latter is much slower.
|
Chris@10
|
3515 FFTW does not deal with this problem, however--it assumes that your
|
Chris@10
|
3516 processes run on hardware of comparable speed, and that the goal is
|
Chris@10
|
3517 therefore to divide the problem as equally as possible.
|
Chris@10
|
3518
|
Chris@10
|
3519 For a multi-dimensional complex DFT, FFTW can divide the problem
|
Chris@10
|
3520 equally among the processes if: (i) the _first_ dimension `n0' is
|
Chris@10
|
3521 divisible by P; and (ii), the _product_ of the subsequent dimensions is
|
Chris@10
|
3522 divisible by P. (For the advanced interface, where you can specify
|
Chris@10
|
3523 multiple simultaneous transforms via some "vector" length `howmany', a
|
Chris@10
|
3524 factor of `howmany' is included in the product of the subsequent
|
Chris@10
|
3525 dimensions.)
|
Chris@10
|
3526
|
Chris@10
|
3527 For a one-dimensional complex DFT, the length `N' of the data should
|
Chris@10
|
3528 be divisible by P _squared_ to be able to divide the problem equally
|
Chris@10
|
3529 among the processes.
|
Chris@10
|
3530
|
Chris@10
|
3531
|
Chris@10
|
3532 File: fftw3.info, Node: Transposed distributions, Next: One-dimensional distributions, Prev: Load balancing, Up: MPI Data Distribution
|
Chris@10
|
3533
|
Chris@10
|
3534 6.4.3 Transposed distributions
|
Chris@10
|
3535 ------------------------------
|
Chris@10
|
3536
|
Chris@10
|
3537 Internally, FFTW's MPI transform algorithms work by first computing
|
Chris@10
|
3538 transforms of the data local to each process, then by globally
|
Chris@10
|
3539 _transposing_ the data in some fashion to redistribute the data among
|
Chris@10
|
3540 the processes, transforming the new data local to each process, and
|
Chris@10
|
3541 transposing back. For example, a two-dimensional `n0' by `n1' array,
|
Chris@10
|
3542 distributed across the `n0' dimension, is transformd by: (i)
|
Chris@10
|
3543 transforming the `n1' dimension, which are local to each process; (ii)
|
Chris@10
|
3544 transposing to an `n1' by `n0' array, distributed across the `n1'
|
Chris@10
|
3545 dimension; (iii) transforming the `n0' dimension, which is now local to
|
Chris@10
|
3546 each process; (iv) transposing back.
|
Chris@10
|
3547
|
Chris@10
|
3548 However, in many applications it is acceptable to compute a
|
Chris@10
|
3549 multidimensional DFT whose results are produced in transposed order
|
Chris@10
|
3550 (e.g., `n1' by `n0' in two dimensions). This provides a significant
|
Chris@10
|
3551 performance advantage, because it means that the final transposition
|
Chris@10
|
3552 step can be omitted. FFTW supports this optimization, which you
|
Chris@10
|
3553 specify by passing the flag `FFTW_MPI_TRANSPOSED_OUT' to the planner
|
Chris@10
|
3554 routines. To compute the inverse transform of transposed output, you
|
Chris@10
|
3555 specify `FFTW_MPI_TRANSPOSED_IN' to tell it that the input is
|
Chris@10
|
3556 transposed. In this section, we explain how to interpret the output
|
Chris@10
|
3557 format of such a transform.
|
Chris@10
|
3558
|
Chris@10
|
3559 Suppose you have are transforming multi-dimensional data with (at
|
Chris@10
|
3560 least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always,
|
Chris@10
|
3561 it is distributed along the first dimension n[0] . Now, if we compute
|
Chris@10
|
3562 its DFT with the `FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output
|
Chris@10
|
3563 data are stored with the first _two_ dimensions transposed: n[1] x n[0]
|
Chris@10
|
3564 x n[2] x ... x n[d-1] , distributed along the n[1] dimension.
|
Chris@10
|
3565 Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and
|
Chris@10
|
3566 transform it with the `FFTW_MPI_TRANSPOSED_IN' flag, then the format
|
Chris@10
|
3567 goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array.
|
Chris@10
|
3568
|
Chris@10
|
3569 There are two ways to find the portion of the transposed array that
|
Chris@10
|
3570 resides on the current process. First, you can simply call the
|
Chris@10
|
3571 appropriate `local_size' function, passing n[1] x n[0] x n[2] x ... x
|
Chris@10
|
3572 n[d-1] (the transposed dimensions). This would mean calling the
|
Chris@10
|
3573 `local_size' function twice, once for the transposed and once for the
|
Chris@10
|
3574 non-transposed dimensions. Alternatively, you can call one of the
|
Chris@10
|
3575 `local_size_transposed' functions, which returns both the
|
Chris@10
|
3576 non-transposed and transposed data distribution from a single call.
|
Chris@10
|
3577 For example, for a 3d transform with transposed output (or input), you
|
Chris@10
|
3578 might call:
|
Chris@10
|
3579
|
Chris@10
|
3580 ptrdiff_t fftw_mpi_local_size_3d_transposed(
|
Chris@10
|
3581 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
|
Chris@10
|
3582 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
3583 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
3584
|
Chris@10
|
3585 Here, `local_n0' and `local_0_start' give the size and starting
|
Chris@10
|
3586 index of the `n0' dimension for the _non_-transposed data, as in the
|
Chris@10
|
3587 previous sections. For _transposed_ data (e.g. the output for
|
Chris@10
|
3588 `FFTW_MPI_TRANSPOSED_OUT'), `local_n1' and `local_1_start' give the
|
Chris@10
|
3589 size and starting index of the `n1' dimension, which is the first
|
Chris@10
|
3590 dimension of the transposed data (`n1' by `n0' by `n2').
|
Chris@10
|
3591
|
Chris@10
|
3592 (Note that `FFTW_MPI_TRANSPOSED_IN' is completely equivalent to
|
Chris@10
|
3593 performing `FFTW_MPI_TRANSPOSED_OUT' and passing the first two
|
Chris@10
|
3594 dimensions to the planner in reverse order, or vice versa. If you pass
|
Chris@10
|
3595 _both_ the `FFTW_MPI_TRANSPOSED_IN' and `FFTW_MPI_TRANSPOSED_OUT'
|
Chris@10
|
3596 flags, it is equivalent to swapping the first two dimensions passed to
|
Chris@10
|
3597 the planner and passing _neither_ flag.)
|
Chris@10
|
3598
|
Chris@10
|
3599
|
Chris@10
|
3600 File: fftw3.info, Node: One-dimensional distributions, Prev: Transposed distributions, Up: MPI Data Distribution
|
Chris@10
|
3601
|
Chris@10
|
3602 6.4.4 One-dimensional distributions
|
Chris@10
|
3603 -----------------------------------
|
Chris@10
|
3604
|
Chris@10
|
3605 For one-dimensional distributed DFTs using FFTW, matters are slightly
|
Chris@10
|
3606 more complicated because the data distribution is more closely tied to
|
Chris@10
|
3607 how the algorithm works. In particular, you can no longer pass an
|
Chris@10
|
3608 arbitrary block size and must accept FFTW's default; also, the block
|
Chris@10
|
3609 sizes may be different for input and output. Also, the data
|
Chris@10
|
3610 distribution depends on the flags and transform direction, in order for
|
Chris@10
|
3611 forward and backward transforms to work correctly.
|
Chris@10
|
3612
|
Chris@10
|
3613 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
|
Chris@10
|
3614 int sign, unsigned flags,
|
Chris@10
|
3615 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
Chris@10
|
3616 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
Chris@10
|
3617
|
Chris@10
|
3618 This function computes the data distribution for a 1d transform of
|
Chris@10
|
3619 size `n0' with the given transform `sign' and `flags'. Both input and
|
Chris@10
|
3620 output data use block distributions. The input on the current process
|
Chris@10
|
3621 will consist of `local_ni' numbers starting at index `local_i_start';
|
Chris@10
|
3622 e.g. if only a single process is used, then `local_ni' will be `n0' and
|
Chris@10
|
3623 `local_i_start' will be `0'. Similarly for the output, with `local_no'
|
Chris@10
|
3624 numbers starting at index `local_o_start'. The return value of
|
Chris@10
|
3625 `fftw_mpi_local_size_1d' will be the total number of elements to
|
Chris@10
|
3626 allocate on the current process (which might be slightly larger than
|
Chris@10
|
3627 the local size due to intermediate steps in the algorithm).
|
Chris@10
|
3628
|
Chris@10
|
3629 As mentioned above (*note Load balancing::), the data will be divided
|
Chris@10
|
3630 equally among the processes if `n0' is divisible by the _square_ of the
|
Chris@10
|
3631 number of processes. In this case, `local_ni' will equal `local_no'.
|
Chris@10
|
3632 Otherwise, they may be different.
|
Chris@10
|
3633
|
Chris@10
|
3634 For some applications, such as convolutions, the order of the output
|
Chris@10
|
3635 data is irrelevant. In this case, performance can be improved by
|
Chris@10
|
3636 specifying that the output data be stored in an FFTW-defined
|
Chris@10
|
3637 "scrambled" format. (In particular, this is the analogue of transposed
|
Chris@10
|
3638 output in the multidimensional case: scrambled output saves a
|
Chris@10
|
3639 communications step.) If you pass `FFTW_MPI_SCRAMBLED_OUT' in the
|
Chris@10
|
3640 flags, then the output is stored in this (undocumented) scrambled
|
Chris@10
|
3641 order. Conversely, to perform the inverse transform of data in
|
Chris@10
|
3642 scrambled order, pass the `FFTW_MPI_SCRAMBLED_IN' flag.
|
Chris@10
|
3643
|
Chris@10
|
3644 In MPI FFTW, only composite sizes `n0' can be parallelized; we have
|
Chris@10
|
3645 not yet implemented a parallel algorithm for large prime sizes.
|
Chris@10
|
3646
|
Chris@10
|
3647
|
Chris@10
|
3648 File: fftw3.info, Node: Multi-dimensional MPI DFTs of Real Data, Next: Other Multi-dimensional Real-data MPI Transforms, Prev: MPI Data Distribution, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3649
|
Chris@10
|
3650 6.5 Multi-dimensional MPI DFTs of Real Data
|
Chris@10
|
3651 ===========================================
|
Chris@10
|
3652
|
Chris@10
|
3653 FFTW's MPI interface also supports multi-dimensional DFTs of real data,
|
Chris@10
|
3654 similar to the serial r2c and c2r interfaces. (Parallel
|
Chris@10
|
3655 one-dimensional real-data DFTs are not currently supported; you must
|
Chris@10
|
3656 use a complex transform and set the imaginary parts of the inputs to
|
Chris@10
|
3657 zero.)
|
Chris@10
|
3658
|
Chris@10
|
3659 The key points to understand for r2c and c2r MPI transforms (compared
|
Chris@10
|
3660 to the MPI complex DFTs or the serial r2c/c2r transforms), are:
|
Chris@10
|
3661
|
Chris@10
|
3662 * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1]
|
Chris@10
|
3663 x n[2] x ... x n[d-1] real data to/from n[0] x n[1] x n[2] x ...
|
Chris@10
|
3664 x (n[d-1]/2 + 1) complex data: the last dimension of the complex
|
Chris@10
|
3665 data is cut in half (rounded down), plus one. As for the serial
|
Chris@10
|
3666 transforms, the sizes you pass to the `plan_dft_r2c' and
|
Chris@10
|
3667 `plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1]
|
Chris@10
|
3668 dimensions of the real data.
|
Chris@10
|
3669
|
Chris@10
|
3670 * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ...
|
Chris@10
|
3671 x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x
|
Chris@10
|
3672 ... x [2 (n[d-1]/2 + 1)] array, where the last dimension has been
|
Chris@10
|
3673 _padded_ to make it the same size as the complex output. This is
|
Chris@10
|
3674 much like the in-place serial r2c/c2r interface (*note
|
Chris@10
|
3675 Multi-Dimensional DFTs of Real Data::), except that in MPI the
|
Chris@10
|
3676 padding is required even for out-of-place data. The extra padding
|
Chris@10
|
3677 numbers are ignored by FFTW (they are _not_ like zero-padding the
|
Chris@10
|
3678 transform to a larger size); they are only used to determine the
|
Chris@10
|
3679 data layout.
|
Chris@10
|
3680
|
Chris@10
|
3681 * The data distribution in MPI for _both_ the real and complex data
|
Chris@10
|
3682 is determined by the shape of the _complex_ data. That is, you
|
Chris@10
|
3683 call the appropriate `local size' function for the n[0] x n[1] x
|
Chris@10
|
3684 n[2] x ... x (n[d-1]/2 + 1)
|
Chris@10
|
3685
|
Chris@10
|
3686 complex data, and then use the _same_ distribution for the real
|
Chris@10
|
3687 data except that the last complex dimension is replaced by a
|
Chris@10
|
3688 (padded) real dimension of twice the length.
|
Chris@10
|
3689
|
Chris@10
|
3690
|
Chris@10
|
3691 For example suppose we are performing an out-of-place r2c transform
|
Chris@10
|
3692 of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x
|
Chris@10
|
3693 M x N/2+1 complex data. Similar to the example in *note 2d MPI
|
Chris@10
|
3694 example::, we might do something like:
|
Chris@10
|
3695
|
Chris@10
|
3696 #include <fftw3-mpi.h>
|
Chris@10
|
3697
|
Chris@10
|
3698 int main(int argc, char **argv)
|
Chris@10
|
3699 {
|
Chris@10
|
3700 const ptrdiff_t L = ..., M = ..., N = ...;
|
Chris@10
|
3701 fftw_plan plan;
|
Chris@10
|
3702 double *rin;
|
Chris@10
|
3703 fftw_complex *cout;
|
Chris@10
|
3704 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
|
Chris@10
|
3705
|
Chris@10
|
3706 MPI_Init(&argc, &argv);
|
Chris@10
|
3707 fftw_mpi_init();
|
Chris@10
|
3708
|
Chris@10
|
3709 /* get local data size and allocate */
|
Chris@10
|
3710 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
|
Chris@10
|
3711 &local_n0, &local_0_start);
|
Chris@10
|
3712 rin = fftw_alloc_real(2 * alloc_local);
|
Chris@10
|
3713 cout = fftw_alloc_complex(alloc_local);
|
Chris@10
|
3714
|
Chris@10
|
3715 /* create plan for out-of-place r2c DFT */
|
Chris@10
|
3716 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
|
Chris@10
|
3717 FFTW_MEASURE);
|
Chris@10
|
3718
|
Chris@10
|
3719 /* initialize rin to some function my_func(x,y,z) */
|
Chris@10
|
3720 for (i = 0; i < local_n0; ++i)
|
Chris@10
|
3721 for (j = 0; j < M; ++j)
|
Chris@10
|
3722 for (k = 0; k < N; ++k)
|
Chris@10
|
3723 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
|
Chris@10
|
3724
|
Chris@10
|
3725 /* compute transforms as many times as desired */
|
Chris@10
|
3726 fftw_execute(plan);
|
Chris@10
|
3727
|
Chris@10
|
3728 fftw_destroy_plan(plan);
|
Chris@10
|
3729
|
Chris@10
|
3730 MPI_Finalize();
|
Chris@10
|
3731 }
|
Chris@10
|
3732
|
Chris@10
|
3733 Note that we allocated `rin' using `fftw_alloc_real' with an
|
Chris@10
|
3734 argument of `2 * alloc_local': since `alloc_local' is the number of
|
Chris@10
|
3735 _complex_ values to allocate, the number of _real_ values is twice as
|
Chris@10
|
3736 many. The `rin' array is then local_n0 x M x 2(N/2+1) in row-major
|
Chris@10
|
3737 order, so its `(i,j,k)' element is at the index `(i*M + j) *
|
Chris@10
|
3738 (2*(N/2+1)) + k' (*note Multi-dimensional Array Format::).
|
Chris@10
|
3739
|
Chris@10
|
3740 As for the complex transforms, improved performance can be obtained
|
Chris@10
|
3741 by specifying that the output is the transpose of the input or vice
|
Chris@10
|
3742 versa (*note Transposed distributions::). In our L x M x N r2c
|
Chris@10
|
3743 example, including `FFTW_TRANSPOSED_OUT' in the flags means that the
|
Chris@10
|
3744 input would be a padded L x M x 2(N/2+1) real array distributed over
|
Chris@10
|
3745 the `L' dimension, while the output would be a M x L x N/2+1 complex
|
Chris@10
|
3746 array distributed over the `M' dimension. To perform the inverse c2r
|
Chris@10
|
3747 transform with the same data distributions, you would use the
|
Chris@10
|
3748 `FFTW_TRANSPOSED_IN' flag.
|
Chris@10
|
3749
|
Chris@10
|
3750
|
Chris@10
|
3751 File: fftw3.info, Node: Other Multi-dimensional Real-data MPI Transforms, Next: FFTW MPI Transposes, Prev: Multi-dimensional MPI DFTs of Real Data, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3752
|
Chris@10
|
3753 6.6 Other multi-dimensional Real-Data MPI Transforms
|
Chris@10
|
3754 ====================================================
|
Chris@10
|
3755
|
Chris@10
|
3756 FFTW's MPI interface also supports multi-dimensional `r2r' transforms
|
Chris@10
|
3757 of all kinds supported by the serial interface (e.g. discrete cosine
|
Chris@10
|
3758 and sine transforms, discrete Hartley transforms, etc.). Only
|
Chris@10
|
3759 multi-dimensional `r2r' transforms, not one-dimensional transforms, are
|
Chris@10
|
3760 currently parallelized.
|
Chris@10
|
3761
|
Chris@10
|
3762 These are used much like the multidimensional complex DFTs discussed
|
Chris@10
|
3763 above, except that the data is real rather than complex, and one needs
|
Chris@10
|
3764 to pass an r2r transform kind (`fftw_r2r_kind') for each dimension as
|
Chris@10
|
3765 in the serial FFTW (*note More DFTs of Real Data::).
|
Chris@10
|
3766
|
Chris@10
|
3767 For example, one might perform a two-dimensional L x M that is an
|
Chris@10
|
3768 REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the
|
Chris@10
|
3769 second dimension with code like:
|
Chris@10
|
3770
|
Chris@10
|
3771 const ptrdiff_t L = ..., M = ...;
|
Chris@10
|
3772 fftw_plan plan;
|
Chris@10
|
3773 double *data;
|
Chris@10
|
3774 ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
|
Chris@10
|
3775
|
Chris@10
|
3776 /* get local data size and allocate */
|
Chris@10
|
3777 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
|
Chris@10
|
3778 &local_n0, &local_0_start);
|
Chris@10
|
3779 data = fftw_alloc_real(alloc_local);
|
Chris@10
|
3780
|
Chris@10
|
3781 /* create plan for in-place REDFT10 x RODFT10 */
|
Chris@10
|
3782 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
|
Chris@10
|
3783 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
|
Chris@10
|
3784
|
Chris@10
|
3785 /* initialize data to some function my_function(x,y) */
|
Chris@10
|
3786 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
|
Chris@10
|
3787 data[i*M + j] = my_function(local_0_start + i, j);
|
Chris@10
|
3788
|
Chris@10
|
3789 /* compute transforms, in-place, as many times as desired */
|
Chris@10
|
3790 fftw_execute(plan);
|
Chris@10
|
3791
|
Chris@10
|
3792 fftw_destroy_plan(plan);
|
Chris@10
|
3793
|
Chris@10
|
3794 Notice that we use the same `local_size' functions as we did for
|
Chris@10
|
3795 complex data, only now we interpret the sizes in terms of real rather
|
Chris@10
|
3796 than complex values, and correspondingly use `fftw_alloc_real'.
|
Chris@10
|
3797
|
Chris@10
|
3798
|
Chris@10
|
3799 File: fftw3.info, Node: FFTW MPI Transposes, Next: FFTW MPI Wisdom, Prev: Other Multi-dimensional Real-data MPI Transforms, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3800
|
Chris@10
|
3801 6.7 FFTW MPI Transposes
|
Chris@10
|
3802 =======================
|
Chris@10
|
3803
|
Chris@10
|
3804 The FFTW's MPI Fourier transforms rely on one or more _global
|
Chris@10
|
3805 transposition_ step for their communications. For example, the
|
Chris@10
|
3806 multidimensional transforms work by transforming along some dimensions,
|
Chris@10
|
3807 then transposing to make the first dimension local and transforming
|
Chris@10
|
3808 that, then transposing back. Because global transposition of a
|
Chris@10
|
3809 block-distributed matrix has many other potential uses besides FFTs,
|
Chris@10
|
3810 FFTW's transpose routines can be called directly, as documented in this
|
Chris@10
|
3811 section.
|
Chris@10
|
3812
|
Chris@10
|
3813 * Menu:
|
Chris@10
|
3814
|
Chris@10
|
3815 * Basic distributed-transpose interface::
|
Chris@10
|
3816 * Advanced distributed-transpose interface::
|
Chris@10
|
3817 * An improved replacement for MPI_Alltoall::
|
Chris@10
|
3818
|
Chris@10
|
3819
|
Chris@10
|
3820 File: fftw3.info, Node: Basic distributed-transpose interface, Next: Advanced distributed-transpose interface, Prev: FFTW MPI Transposes, Up: FFTW MPI Transposes
|
Chris@10
|
3821
|
Chris@10
|
3822 6.7.1 Basic distributed-transpose interface
|
Chris@10
|
3823 -------------------------------------------
|
Chris@10
|
3824
|
Chris@10
|
3825 In particular, suppose that we have an `n0' by `n1' array in row-major
|
Chris@10
|
3826 order, block-distributed across the `n0' dimension. To transpose this
|
Chris@10
|
3827 into an `n1' by `n0' array block-distributed across the `n1' dimension,
|
Chris@10
|
3828 we would create a plan by calling the following function:
|
Chris@10
|
3829
|
Chris@10
|
3830 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
3831 double *in, double *out,
|
Chris@10
|
3832 MPI_Comm comm, unsigned flags);
|
Chris@10
|
3833
|
Chris@10
|
3834 The input and output arrays (`in' and `out') can be the same. The
|
Chris@10
|
3835 transpose is actually executed by calling `fftw_execute' on the plan,
|
Chris@10
|
3836 as usual.
|
Chris@10
|
3837
|
Chris@10
|
3838 The `flags' are the usual FFTW planner flags, but support two
|
Chris@10
|
3839 additional flags: `FFTW_MPI_TRANSPOSED_OUT' and/or
|
Chris@10
|
3840 `FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose
|
Chris@10
|
3841 plans, is that the output and/or input, respectively, are _locally_
|
Chris@10
|
3842 transposed. That is, on each process input data is normally stored as
|
Chris@10
|
3843 a `local_n0' by `n1' array in row-major order, but for an
|
Chris@10
|
3844 `FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as `n1' by
|
Chris@10
|
3845 `local_n0' in row-major order. Similarly, `FFTW_MPI_TRANSPOSED_OUT'
|
Chris@10
|
3846 means that the output is `n0' by `local_n1' instead of `local_n1' by
|
Chris@10
|
3847 `n0'.
|
Chris@10
|
3848
|
Chris@10
|
3849 To determine the local size of the array on each process before and
|
Chris@10
|
3850 after the transpose, as well as the amount of storage that must be
|
Chris@10
|
3851 allocated, one should call `fftw_mpi_local_size_2d_transposed', just as
|
Chris@10
|
3852 for a 2d DFT as described in the previous section:
|
Chris@10
|
3853
|
Chris@10
|
3854 ptrdiff_t fftw_mpi_local_size_2d_transposed
|
Chris@10
|
3855 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
Chris@10
|
3856 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
3857 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
3858
|
Chris@10
|
3859 Again, the return value is the local storage to allocate, which in
|
Chris@10
|
3860 this case is the number of _real_ (`double') values rather than complex
|
Chris@10
|
3861 numbers as in the previous examples.
|
Chris@10
|
3862
|
Chris@10
|
3863
|
Chris@10
|
3864 File: fftw3.info, Node: Advanced distributed-transpose interface, Next: An improved replacement for MPI_Alltoall, Prev: Basic distributed-transpose interface, Up: FFTW MPI Transposes
|
Chris@10
|
3865
|
Chris@10
|
3866 6.7.2 Advanced distributed-transpose interface
|
Chris@10
|
3867 ----------------------------------------------
|
Chris@10
|
3868
|
Chris@10
|
3869 The above routines are for a transpose of a matrix of numbers (of type
|
Chris@10
|
3870 `double'), using FFTW's default block sizes. More generally, one can
|
Chris@10
|
3871 perform transposes of _tuples_ of numbers, with user-specified block
|
Chris@10
|
3872 sizes for the input and output:
|
Chris@10
|
3873
|
Chris@10
|
3874 fftw_plan fftw_mpi_plan_many_transpose
|
Chris@10
|
3875 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
|
Chris@10
|
3876 ptrdiff_t block0, ptrdiff_t block1,
|
Chris@10
|
3877 double *in, double *out, MPI_Comm comm, unsigned flags);
|
Chris@10
|
3878
|
Chris@10
|
3879 In this case, one is transposing an `n0' by `n1' matrix of
|
Chris@10
|
3880 `howmany'-tuples (e.g. `howmany = 2' for complex numbers). The input
|
Chris@10
|
3881 is distributed along the `n0' dimension with block size `block0', and
|
Chris@10
|
3882 the `n1' by `n0' output is distributed along the `n1' dimension with
|
Chris@10
|
3883 block size `block1'. If `FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a
|
Chris@10
|
3884 block size then FFTW uses its default block size. To get the local
|
Chris@10
|
3885 size of the data on each process, you should then call
|
Chris@10
|
3886 `fftw_mpi_local_size_many_transposed'.
|
Chris@10
|
3887
|
Chris@10
|
3888
|
Chris@10
|
3889 File: fftw3.info, Node: An improved replacement for MPI_Alltoall, Prev: Advanced distributed-transpose interface, Up: FFTW MPI Transposes
|
Chris@10
|
3890
|
Chris@10
|
3891 6.7.3 An improved replacement for MPI_Alltoall
|
Chris@10
|
3892 ----------------------------------------------
|
Chris@10
|
3893
|
Chris@10
|
3894 We close this section by noting that FFTW's MPI transpose routines can
|
Chris@10
|
3895 be thought of as a generalization for the `MPI_Alltoall' function
|
Chris@10
|
3896 (albeit only for floating-point types), and in some circumstances can
|
Chris@10
|
3897 function as an improved replacement.
|
Chris@10
|
3898
|
Chris@10
|
3899 `MPI_Alltoall' is defined by the MPI standard as:
|
Chris@10
|
3900
|
Chris@10
|
3901 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
|
Chris@10
|
3902 void *recvbuf, int recvcnt, MPI_Datatype recvtype,
|
Chris@10
|
3903 MPI_Comm comm);
|
Chris@10
|
3904
|
Chris@10
|
3905 In particular, for `double*' arrays `in' and `out', consider the
|
Chris@10
|
3906 call:
|
Chris@10
|
3907
|
Chris@10
|
3908 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
|
Chris@10
|
3909
|
Chris@10
|
3910 This is completely equivalent to:
|
Chris@10
|
3911
|
Chris@10
|
3912 MPI_Comm_size(comm, &P);
|
Chris@10
|
3913 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
|
Chris@10
|
3914 fftw_execute(plan);
|
Chris@10
|
3915 fftw_destroy_plan(plan);
|
Chris@10
|
3916
|
Chris@10
|
3917 That is, computing a P x P transpose on `P' processes, with a block
|
Chris@10
|
3918 size of 1, is just a standard all-to-all communication.
|
Chris@10
|
3919
|
Chris@10
|
3920 However, using the FFTW routine instead of `MPI_Alltoall' may have
|
Chris@10
|
3921 certain advantages. First of all, FFTW's routine can operate in-place
|
Chris@10
|
3922 (`in == out') whereas `MPI_Alltoall' can only operate out-of-place.
|
Chris@10
|
3923
|
Chris@10
|
3924 Second, even for out-of-place plans, FFTW's routine may be faster,
|
Chris@10
|
3925 especially if you need to perform the all-to-all communication many
|
Chris@10
|
3926 times and can afford to use `FFTW_MEASURE' or `FFTW_PATIENT'. It
|
Chris@10
|
3927 should certainly be no slower, not including the time to create the
|
Chris@10
|
3928 plan, since one of the possible algorithms that FFTW uses for an
|
Chris@10
|
3929 out-of-place transpose _is_ simply to call `MPI_Alltoall'. However,
|
Chris@10
|
3930 FFTW also considers several other possible algorithms that, depending
|
Chris@10
|
3931 on your MPI implementation and your hardware, may be faster.
|
Chris@10
|
3932
|
Chris@10
|
3933
|
Chris@10
|
3934 File: fftw3.info, Node: FFTW MPI Wisdom, Next: Avoiding MPI Deadlocks, Prev: FFTW MPI Transposes, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
3935
|
Chris@10
|
3936 6.8 FFTW MPI Wisdom
|
Chris@10
|
3937 ===================
|
Chris@10
|
3938
|
Chris@10
|
3939 FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be
|
Chris@10
|
3940 used to save MPI plans as well as to save uniprocessor plans. However,
|
Chris@10
|
3941 for MPI there are several unavoidable complications.
|
Chris@10
|
3942
|
Chris@10
|
3943 First, the MPI standard does not guarantee that every process can
|
Chris@10
|
3944 perform file I/O (at least, not using C stdio routines)--in general, we
|
Chris@10
|
3945 may only assume that process 0 is capable of I/O.(1) So, if we want to
|
Chris@10
|
3946 export the wisdom from a single process to a file, we must first export
|
Chris@10
|
3947 the wisdom to a string, then send it to process 0, then write it to a
|
Chris@10
|
3948 file.
|
Chris@10
|
3949
|
Chris@10
|
3950 Second, in principle we may want to have separate wisdom for every
|
Chris@10
|
3951 process, since in general the processes may run on different hardware
|
Chris@10
|
3952 even for a single MPI program. However, in practice FFTW's MPI code is
|
Chris@10
|
3953 designed for the case of homogeneous hardware (*note Load balancing::),
|
Chris@10
|
3954 and in this case it is convenient to use the same wisdom for every
|
Chris@10
|
3955 process. Thus, we need a mechanism to synchronize the wisdom.
|
Chris@10
|
3956
|
Chris@10
|
3957 To address both of these problems, FFTW provides the following two
|
Chris@10
|
3958 functions:
|
Chris@10
|
3959
|
Chris@10
|
3960 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
|
Chris@10
|
3961 void fftw_mpi_gather_wisdom(MPI_Comm comm);
|
Chris@10
|
3962
|
Chris@10
|
3963 Given a communicator `comm', `fftw_mpi_broadcast_wisdom' will
|
Chris@10
|
3964 broadcast the wisdom from process 0 to all other processes.
|
Chris@10
|
3965 Conversely, `fftw_mpi_gather_wisdom' will collect wisdom from all
|
Chris@10
|
3966 processes onto process 0. (If the plans created for the same problem
|
Chris@10
|
3967 by different processes are not the same, `fftw_mpi_gather_wisdom' will
|
Chris@10
|
3968 arbitrarily choose one of the plans.) Both of these functions may
|
Chris@10
|
3969 result in suboptimal plans for different processes if the processes are
|
Chris@10
|
3970 running on non-identical hardware. Both of these functions are
|
Chris@10
|
3971 _collective_ calls, which means that they must be executed by all
|
Chris@10
|
3972 processes in the communicator.
|
Chris@10
|
3973
|
Chris@10
|
3974 So, for example, a typical code snippet to import wisdom from a file
|
Chris@10
|
3975 and use it on all processes would be:
|
Chris@10
|
3976
|
Chris@10
|
3977 {
|
Chris@10
|
3978 int rank;
|
Chris@10
|
3979
|
Chris@10
|
3980 fftw_mpi_init();
|
Chris@10
|
3981 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
|
Chris@10
|
3982 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
|
Chris@10
|
3983 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
|
Chris@10
|
3984 }
|
Chris@10
|
3985
|
Chris@10
|
3986 (Note that we must call `fftw_mpi_init' before importing any wisdom
|
Chris@10
|
3987 that might contain MPI plans.) Similarly, a typical code snippet to
|
Chris@10
|
3988 export wisdom from all processes to a file is:
|
Chris@10
|
3989
|
Chris@10
|
3990 {
|
Chris@10
|
3991 int rank;
|
Chris@10
|
3992
|
Chris@10
|
3993 fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
|
Chris@10
|
3994 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
|
Chris@10
|
3995 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
|
Chris@10
|
3996 }
|
Chris@10
|
3997
|
Chris@10
|
3998 ---------- Footnotes ----------
|
Chris@10
|
3999
|
Chris@10
|
4000 (1) In fact, even this assumption is not technically guaranteed by
|
Chris@10
|
4001 the standard, although it seems to be universal in actual MPI
|
Chris@10
|
4002 implementations and is widely assumed by MPI-using software.
|
Chris@10
|
4003 Technically, you need to query the `MPI_IO' attribute of
|
Chris@10
|
4004 `MPI_COMM_WORLD' with `MPI_Attr_get'. If this attribute is
|
Chris@10
|
4005 `MPI_PROC_NULL', no I/O is possible. If it is `MPI_ANY_SOURCE', any
|
Chris@10
|
4006 process can perform I/O. Otherwise, it is the rank of a process that
|
Chris@10
|
4007 can perform I/O ... but since it is not guaranteed to yield the _same_
|
Chris@10
|
4008 rank on all processes, you have to do an `MPI_Allreduce' of some kind
|
Chris@10
|
4009 if you want all processes to agree about which is going to do I/O. And
|
Chris@10
|
4010 even then, the standard only guarantees that this process can perform
|
Chris@10
|
4011 output, but not input. See e.g. `Parallel Programming with MPI' by P.
|
Chris@10
|
4012 S. Pacheco, section 8.1.3. Needless to say, in our experience
|
Chris@10
|
4013 virtually no MPI programmers worry about this.
|
Chris@10
|
4014
|
Chris@10
|
4015
|
Chris@10
|
4016 File: fftw3.info, Node: Avoiding MPI Deadlocks, Next: FFTW MPI Performance Tips, Prev: FFTW MPI Wisdom, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
4017
|
Chris@10
|
4018 6.9 Avoiding MPI Deadlocks
|
Chris@10
|
4019 ==========================
|
Chris@10
|
4020
|
Chris@10
|
4021 An MPI program can _deadlock_ if one process is waiting for a message
|
Chris@10
|
4022 from another process that never gets sent. To avoid deadlocks when
|
Chris@10
|
4023 using FFTW's MPI routines, it is important to know which functions are
|
Chris@10
|
4024 _collective_: that is, which functions must _always_ be called in the
|
Chris@10
|
4025 _same order_ from _every_ process in a given communicator. (For
|
Chris@10
|
4026 example, `MPI_Barrier' is the canonical example of a collective
|
Chris@10
|
4027 function in the MPI standard.)
|
Chris@10
|
4028
|
Chris@10
|
4029 The functions in FFTW that are _always_ collective are: every
|
Chris@10
|
4030 function beginning with `fftw_mpi_plan', as well as
|
Chris@10
|
4031 `fftw_mpi_broadcast_wisdom' and `fftw_mpi_gather_wisdom'. Also, the
|
Chris@10
|
4032 following functions from the ordinary FFTW interface are collective
|
Chris@10
|
4033 when they are applied to a plan created by an `fftw_mpi_plan' function:
|
Chris@10
|
4034 `fftw_execute', `fftw_destroy_plan', and `fftw_flops'.
|
Chris@10
|
4035
|
Chris@10
|
4036
|
Chris@10
|
4037 File: fftw3.info, Node: FFTW MPI Performance Tips, Next: Combining MPI and Threads, Prev: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
4038
|
Chris@10
|
4039 6.10 FFTW MPI Performance Tips
|
Chris@10
|
4040 ==============================
|
Chris@10
|
4041
|
Chris@10
|
4042 In this section, we collect a few tips on getting the best performance
|
Chris@10
|
4043 out of FFTW's MPI transforms.
|
Chris@10
|
4044
|
Chris@10
|
4045 First, because of the 1d block distribution, FFTW's parallelization
|
Chris@10
|
4046 is currently limited by the size of the first dimension.
|
Chris@10
|
4047 (Multidimensional block distributions may be supported by a future
|
Chris@10
|
4048 version.) More generally, you should ideally arrange the dimensions so
|
Chris@10
|
4049 that FFTW can divide them equally among the processes. *Note Load
|
Chris@10
|
4050 balancing::.
|
Chris@10
|
4051
|
Chris@10
|
4052 Second, if it is not too inconvenient, you should consider working
|
Chris@10
|
4053 with transposed output for multidimensional plans, as this saves a
|
Chris@10
|
4054 considerable amount of communications. *Note Transposed
|
Chris@10
|
4055 distributions::.
|
Chris@10
|
4056
|
Chris@10
|
4057 Third, the fastest choices are generally either an in-place transform
|
Chris@10
|
4058 or an out-of-place transform with the `FFTW_DESTROY_INPUT' flag (which
|
Chris@10
|
4059 allows the input array to be used as scratch space). In-place is
|
Chris@10
|
4060 especially beneficial if the amount of data per process is large.
|
Chris@10
|
4061
|
Chris@10
|
4062 Fourth, if you have multiple arrays to transform at once, rather than
|
Chris@10
|
4063 calling FFTW's MPI transforms several times it usually seems to be
|
Chris@10
|
4064 faster to interleave the data and use the advanced interface. (This
|
Chris@10
|
4065 groups the communications together instead of requiring separate
|
Chris@10
|
4066 messages for each transform.)
|
Chris@10
|
4067
|
Chris@10
|
4068
|
Chris@10
|
4069 File: fftw3.info, Node: Combining MPI and Threads, Next: FFTW MPI Reference, Prev: FFTW MPI Performance Tips, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
4070
|
Chris@10
|
4071 6.11 Combining MPI and Threads
|
Chris@10
|
4072 ==============================
|
Chris@10
|
4073
|
Chris@10
|
4074 In certain cases, it may be advantageous to combine MPI
|
Chris@10
|
4075 (distributed-memory) and threads (shared-memory) parallelization. FFTW
|
Chris@10
|
4076 supports this, with certain caveats. For example, if you have a
|
Chris@10
|
4077 cluster of 4-processor shared-memory nodes, you may want to use threads
|
Chris@10
|
4078 within the nodes and MPI between the nodes, instead of MPI for all
|
Chris@10
|
4079 parallelization.
|
Chris@10
|
4080
|
Chris@10
|
4081 In particular, it is possible to seamlessly combine the MPI FFTW
|
Chris@10
|
4082 routines with the multi-threaded FFTW routines (*note Multi-threaded
|
Chris@10
|
4083 FFTW::). However, some care must be taken in the initialization code,
|
Chris@10
|
4084 which should look something like this:
|
Chris@10
|
4085
|
Chris@10
|
4086 int threads_ok;
|
Chris@10
|
4087
|
Chris@10
|
4088 int main(int argc, char **argv)
|
Chris@10
|
4089 {
|
Chris@10
|
4090 int provided;
|
Chris@10
|
4091 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
|
Chris@10
|
4092 threads_ok = provided >= MPI_THREAD_FUNNELED;
|
Chris@10
|
4093
|
Chris@10
|
4094 if (threads_ok) threads_ok = fftw_init_threads();
|
Chris@10
|
4095 fftw_mpi_init();
|
Chris@10
|
4096
|
Chris@10
|
4097 ...
|
Chris@10
|
4098 if (threads_ok) fftw_plan_with_nthreads(...);
|
Chris@10
|
4099 ...
|
Chris@10
|
4100
|
Chris@10
|
4101 MPI_Finalize();
|
Chris@10
|
4102 }
|
Chris@10
|
4103
|
Chris@10
|
4104 First, note that instead of calling `MPI_Init', you should call
|
Chris@10
|
4105 `MPI_Init_threads', which is the initialization routine defined by the
|
Chris@10
|
4106 MPI-2 standard to indicate to MPI that your program will be
|
Chris@10
|
4107 multithreaded. We pass `MPI_THREAD_FUNNELED', which indicates that we
|
Chris@10
|
4108 will only call MPI routines from the main thread. (FFTW will launch
|
Chris@10
|
4109 additional threads internally, but the extra threads will not call MPI
|
Chris@10
|
4110 code.) (You may also pass `MPI_THREAD_SERIALIZED' or
|
Chris@10
|
4111 `MPI_THREAD_MULTIPLE', which requests additional multithreading support
|
Chris@10
|
4112 from the MPI implementation, but this is not required by FFTW.) The
|
Chris@10
|
4113 `provided' parameter returns what level of threads support is actually
|
Chris@10
|
4114 supported by your MPI implementation; this _must_ be at least
|
Chris@10
|
4115 `MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so
|
Chris@10
|
4116 we define a global variable `threads_ok' to record this. You should
|
Chris@10
|
4117 only call `fftw_init_threads' or `fftw_plan_with_nthreads' if
|
Chris@10
|
4118 `threads_ok' is true. For more information on thread safety in MPI,
|
Chris@10
|
4119 see the MPI and Threads
|
Chris@10
|
4120 (http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the
|
Chris@10
|
4121 MPI-2 standard.
|
Chris@10
|
4122
|
Chris@10
|
4123 Second, we must call `fftw_init_threads' _before_ `fftw_mpi_init'.
|
Chris@10
|
4124 This is critical for technical reasons having to do with how FFTW
|
Chris@10
|
4125 initializes its list of algorithms.
|
Chris@10
|
4126
|
Chris@10
|
4127 Then, if you call `fftw_plan_with_nthreads(N)', _every_ MPI process
|
Chris@10
|
4128 will launch (up to) `N' threads to parallelize its transforms.
|
Chris@10
|
4129
|
Chris@10
|
4130 For example, in the hypothetical cluster of 4-processor nodes, you
|
Chris@10
|
4131 might wish to launch only a single MPI process per node, and then call
|
Chris@10
|
4132 `fftw_plan_with_nthreads(4)' on each process to use all processors in
|
Chris@10
|
4133 the nodes.
|
Chris@10
|
4134
|
Chris@10
|
4135 This may or may not be faster than simply using as many MPI processes
|
Chris@10
|
4136 as you have processors, however. On the one hand, using threads within
|
Chris@10
|
4137 a node eliminates the need for explicit message passing within the
|
Chris@10
|
4138 node. On the other hand, FFTW's transpose routines are not
|
Chris@10
|
4139 multi-threaded, and this means that the communications that do take
|
Chris@10
|
4140 place will not benefit from parallelization within the node. Moreover,
|
Chris@10
|
4141 many MPI implementations already have optimizations to exploit shared
|
Chris@10
|
4142 memory when it is available, so adding the multithreaded FFTW on top of
|
Chris@10
|
4143 this may be superfluous.
|
Chris@10
|
4144
|
Chris@10
|
4145
|
Chris@10
|
4146 File: fftw3.info, Node: FFTW MPI Reference, Next: FFTW MPI Fortran Interface, Prev: Combining MPI and Threads, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
4147
|
Chris@10
|
4148 6.12 FFTW MPI Reference
|
Chris@10
|
4149 =======================
|
Chris@10
|
4150
|
Chris@10
|
4151 This chapter provides a complete reference to all FFTW MPI functions,
|
Chris@10
|
4152 datatypes, and constants. See also *note FFTW Reference:: for
|
Chris@10
|
4153 information on functions and types in common with the serial interface.
|
Chris@10
|
4154
|
Chris@10
|
4155 * Menu:
|
Chris@10
|
4156
|
Chris@10
|
4157 * MPI Files and Data Types::
|
Chris@10
|
4158 * MPI Initialization::
|
Chris@10
|
4159 * Using MPI Plans::
|
Chris@10
|
4160 * MPI Data Distribution Functions::
|
Chris@10
|
4161 * MPI Plan Creation::
|
Chris@10
|
4162 * MPI Wisdom Communication::
|
Chris@10
|
4163
|
Chris@10
|
4164
|
Chris@10
|
4165 File: fftw3.info, Node: MPI Files and Data Types, Next: MPI Initialization, Prev: FFTW MPI Reference, Up: FFTW MPI Reference
|
Chris@10
|
4166
|
Chris@10
|
4167 6.12.1 MPI Files and Data Types
|
Chris@10
|
4168 -------------------------------
|
Chris@10
|
4169
|
Chris@10
|
4170 All programs using FFTW's MPI support should include its header file:
|
Chris@10
|
4171
|
Chris@10
|
4172 #include <fftw3-mpi.h>
|
Chris@10
|
4173
|
Chris@10
|
4174 Note that this header file includes the serial-FFTW `fftw3.h' header
|
Chris@10
|
4175 file, and also the `mpi.h' header file for MPI, so you need not include
|
Chris@10
|
4176 those files separately.
|
Chris@10
|
4177
|
Chris@10
|
4178 You must also link to _both_ the FFTW MPI library and to the serial
|
Chris@10
|
4179 FFTW library. On Unix, this means adding `-lfftw3_mpi -lfftw3 -lm' at
|
Chris@10
|
4180 the end of the link command.
|
Chris@10
|
4181
|
Chris@10
|
4182 Different precisions are handled as in the serial interface: *Note
|
Chris@10
|
4183 Precision::. That is, `fftw_' functions become `fftwf_' (in single
|
Chris@10
|
4184 precision) etcetera, and the libraries become `-lfftw3f_mpi -lfftw3f
|
Chris@10
|
4185 -lm' etcetera on Unix. Long-double precision is supported in MPI, but
|
Chris@10
|
4186 quad precision (`fftwq_') is not due to the lack of MPI support for
|
Chris@10
|
4187 this type.
|
Chris@10
|
4188
|
Chris@10
|
4189
|
Chris@10
|
4190 File: fftw3.info, Node: MPI Initialization, Next: Using MPI Plans, Prev: MPI Files and Data Types, Up: FFTW MPI Reference
|
Chris@10
|
4191
|
Chris@10
|
4192 6.12.2 MPI Initialization
|
Chris@10
|
4193 -------------------------
|
Chris@10
|
4194
|
Chris@10
|
4195 Before calling any other FFTW MPI (`fftw_mpi_') function, and before
|
Chris@10
|
4196 importing any wisdom for MPI problems, you must call:
|
Chris@10
|
4197
|
Chris@10
|
4198 void fftw_mpi_init(void);
|
Chris@10
|
4199
|
Chris@10
|
4200 If FFTW threads support is used, however, `fftw_mpi_init' should be
|
Chris@10
|
4201 called _after_ `fftw_init_threads' (*note Combining MPI and Threads::).
|
Chris@10
|
4202 Calling `fftw_mpi_init' additional times (before `fftw_mpi_cleanup')
|
Chris@10
|
4203 has no effect.
|
Chris@10
|
4204
|
Chris@10
|
4205 If you want to deallocate all persistent data and reset FFTW to the
|
Chris@10
|
4206 pristine state it was in when you started your program, you can call:
|
Chris@10
|
4207
|
Chris@10
|
4208 void fftw_mpi_cleanup(void);
|
Chris@10
|
4209
|
Chris@10
|
4210 (This calls `fftw_cleanup', so you need not call the serial cleanup
|
Chris@10
|
4211 routine too, although it is safe to do so.) After calling
|
Chris@10
|
4212 `fftw_mpi_cleanup', all existing plans become undefined, and you should
|
Chris@10
|
4213 not attempt to execute or destroy them. You must call `fftw_mpi_init'
|
Chris@10
|
4214 again after `fftw_mpi_cleanup' if you want to resume using the MPI FFTW
|
Chris@10
|
4215 routines.
|
Chris@10
|
4216
|
Chris@10
|
4217
|
Chris@10
|
4218 File: fftw3.info, Node: Using MPI Plans, Next: MPI Data Distribution Functions, Prev: MPI Initialization, Up: FFTW MPI Reference
|
Chris@10
|
4219
|
Chris@10
|
4220 6.12.3 Using MPI Plans
|
Chris@10
|
4221 ----------------------
|
Chris@10
|
4222
|
Chris@10
|
4223 Once an MPI plan is created, you can execute and destroy it using
|
Chris@10
|
4224 `fftw_execute', `fftw_destroy_plan', and the other functions in the
|
Chris@10
|
4225 serial interface that operate on generic plans (*note Using Plans::).
|
Chris@10
|
4226
|
Chris@10
|
4227 The `fftw_execute' and `fftw_destroy_plan' functions, applied to MPI
|
Chris@10
|
4228 plans, are _collective_ calls: they must be called for all processes in
|
Chris@10
|
4229 the communicator that was used to create the plan.
|
Chris@10
|
4230
|
Chris@10
|
4231 You must _not_ use the serial new-array plan-execution functions
|
Chris@10
|
4232 `fftw_execute_dft' and so on (*note New-array Execute Functions::) with
|
Chris@10
|
4233 MPI plans. Such functions are specialized to the problem type, and
|
Chris@10
|
4234 there are specific new-array execute functions for MPI plans:
|
Chris@10
|
4235
|
Chris@10
|
4236 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
|
Chris@10
|
4237 void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
|
Chris@10
|
4238 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
|
Chris@10
|
4239 void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
|
Chris@10
|
4240
|
Chris@10
|
4241 These functions have the same restrictions as those of the serial
|
Chris@10
|
4242 new-array execute functions. They are _always_ safe to apply to the
|
Chris@10
|
4243 _same_ `in' and `out' arrays that were used to create the plan. They
|
Chris@10
|
4244 can only be applied to new arrarys if those arrays have the same types,
|
Chris@10
|
4245 dimensions, in-placeness, and alignment as the original arrays, where
|
Chris@10
|
4246 the best way to ensure the same alignment is to use FFTW's
|
Chris@10
|
4247 `fftw_malloc' and related allocation functions for all arrays (*note
|
Chris@10
|
4248 Memory Allocation::). Note that distributed transposes (*note FFTW MPI
|
Chris@10
|
4249 Transposes::) use `fftw_mpi_execute_r2r', since they count as rank-zero
|
Chris@10
|
4250 r2r plans from FFTW's perspective.
|
Chris@10
|
4251
|
Chris@10
|
4252
|
Chris@10
|
4253 File: fftw3.info, Node: MPI Data Distribution Functions, Next: MPI Plan Creation, Prev: Using MPI Plans, Up: FFTW MPI Reference
|
Chris@10
|
4254
|
Chris@10
|
4255 6.12.4 MPI Data Distribution Functions
|
Chris@10
|
4256 --------------------------------------
|
Chris@10
|
4257
|
Chris@10
|
4258 As described above (*note MPI Data Distribution::), in order to
|
Chris@10
|
4259 allocate your arrays, _before_ creating a plan, you must first call one
|
Chris@10
|
4260 of the following routines to determine the required allocation size and
|
Chris@10
|
4261 the portion of the array locally stored on a given process. The
|
Chris@10
|
4262 `MPI_Comm' communicator passed here must be equivalent to the
|
Chris@10
|
4263 communicator used below for plan creation.
|
Chris@10
|
4264
|
Chris@10
|
4265 The basic interface for multidimensional transforms consists of the
|
Chris@10
|
4266 functions:
|
Chris@10
|
4267
|
Chris@10
|
4268 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
Chris@10
|
4269 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
Chris@10
|
4270 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4271 MPI_Comm comm,
|
Chris@10
|
4272 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
Chris@10
|
4273 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
|
Chris@10
|
4274 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
Chris@10
|
4275
|
Chris@10
|
4276 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
|
Chris@10
|
4277 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
4278 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
4279 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4280 MPI_Comm comm,
|
Chris@10
|
4281 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
4282 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
4283 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
|
Chris@10
|
4284 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
4285 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
4286
|
Chris@10
|
4287 These functions return the number of elements to allocate (complex
|
Chris@10
|
4288 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the
|
Chris@10
|
4289 `local_n0' and `local_0_start' return the portion (`local_0_start' to
|
Chris@10
|
4290 `local_0_start + local_n0 - 1') of the first dimension of an n[0] x
|
Chris@10
|
4291 n[1] x n[2] x ... x n[d-1] array that is stored on the local process.
|
Chris@10
|
4292 *Note Basic and advanced distribution interfaces::. For
|
Chris@10
|
4293 `FFTW_MPI_TRANSPOSED_OUT' plans, the `_transposed' variants are useful
|
Chris@10
|
4294 in order to also return the local portion of the first dimension in the
|
Chris@10
|
4295 n[1] x n[0] x n[2] x ... x n[d-1] transposed output. *Note Transposed
|
Chris@10
|
4296 distributions::. The advanced interface for multidimensional
|
Chris@10
|
4297 transforms is:
|
Chris@10
|
4298
|
Chris@10
|
4299 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
Chris@10
|
4300 ptrdiff_t block0, MPI_Comm comm,
|
Chris@10
|
4301 ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
|
Chris@10
|
4302 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
Chris@10
|
4303 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
|
Chris@10
|
4304 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
|
Chris@10
|
4305 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
|
Chris@10
|
4306
|
Chris@10
|
4307 These differ from the basic interface in only two ways. First, they
|
Chris@10
|
4308 allow you to specify block sizes `block0' and `block1' (the latter for
|
Chris@10
|
4309 the transposed output); you can pass `FFTW_MPI_DEFAULT_BLOCK' to use
|
Chris@10
|
4310 FFTW's default block size as in the basic interface. Second, you can
|
Chris@10
|
4311 pass a `howmany' parameter, corresponding to the advanced planning
|
Chris@10
|
4312 interface below: this is for transforms of contiguous `howmany'-tuples
|
Chris@10
|
4313 of numbers (`howmany = 1' in the basic interface).
|
Chris@10
|
4314
|
Chris@10
|
4315 The corresponding basic and advanced routines for one-dimensional
|
Chris@10
|
4316 transforms (currently only complex DFTs) are:
|
Chris@10
|
4317
|
Chris@10
|
4318 ptrdiff_t fftw_mpi_local_size_1d(
|
Chris@10
|
4319 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
|
Chris@10
|
4320 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
Chris@10
|
4321 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
Chris@10
|
4322 ptrdiff_t fftw_mpi_local_size_many_1d(
|
Chris@10
|
4323 ptrdiff_t n0, ptrdiff_t howmany,
|
Chris@10
|
4324 MPI_Comm comm, int sign, unsigned flags,
|
Chris@10
|
4325 ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
|
Chris@10
|
4326 ptrdiff_t *local_no, ptrdiff_t *local_o_start);
|
Chris@10
|
4327
|
Chris@10
|
4328 As above, the return value is the number of elements to allocate
|
Chris@10
|
4329 (complex numbers, for complex DFTs). The `local_ni' and
|
Chris@10
|
4330 `local_i_start' arguments return the portion (`local_i_start' to
|
Chris@10
|
4331 `local_i_start + local_ni - 1') of the 1d array that is stored on this
|
Chris@10
|
4332 process for the transform _input_, and `local_no' and `local_o_start'
|
Chris@10
|
4333 are the corresponding quantities for the input. The `sign'
|
Chris@10
|
4334 (`FFTW_FORWARD' or `FFTW_BACKWARD') and `flags' must match the
|
Chris@10
|
4335 arguments passed when creating a plan. Although the inputs and outputs
|
Chris@10
|
4336 have different data distributions in general, it is guaranteed that the
|
Chris@10
|
4337 _output_ data distribution of an `FFTW_FORWARD' plan will match the
|
Chris@10
|
4338 _input_ data distribution of an `FFTW_BACKWARD' plan and vice versa;
|
Chris@10
|
4339 similarly for the `FFTW_MPI_SCRAMBLED_OUT' and `FFTW_MPI_SCRAMBLED_IN'
|
Chris@10
|
4340 flags. *Note One-dimensional distributions::.
|
Chris@10
|
4341
|
Chris@10
|
4342
|
Chris@10
|
4343 File: fftw3.info, Node: MPI Plan Creation, Next: MPI Wisdom Communication, Prev: MPI Data Distribution Functions, Up: FFTW MPI Reference
|
Chris@10
|
4344
|
Chris@10
|
4345 6.12.5 MPI Plan Creation
|
Chris@10
|
4346 ------------------------
|
Chris@10
|
4347
|
Chris@10
|
4348 Complex-data MPI DFTs
|
Chris@10
|
4349 .....................
|
Chris@10
|
4350
|
Chris@10
|
4351 Plans for complex-data DFTs (*note 2d MPI example::) are created by:
|
Chris@10
|
4352
|
Chris@10
|
4353 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
|
Chris@10
|
4354 MPI_Comm comm, int sign, unsigned flags);
|
Chris@10
|
4355 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4356 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
4357 MPI_Comm comm, int sign, unsigned flags);
|
Chris@10
|
4358 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4359 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
4360 MPI_Comm comm, int sign, unsigned flags);
|
Chris@10
|
4361 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4362 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
4363 MPI_Comm comm, int sign, unsigned flags);
|
Chris@10
|
4364 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4365 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
|
Chris@10
|
4366 fftw_complex *in, fftw_complex *out,
|
Chris@10
|
4367 MPI_Comm comm, int sign, unsigned flags);
|
Chris@10
|
4368
|
Chris@10
|
4369 These are similar to their serial counterparts (*note Complex DFTs::)
|
Chris@10
|
4370 in specifying the dimensions, sign, and flags of the transform. The
|
Chris@10
|
4371 `comm' argument gives an MPI communicator that specifies the set of
|
Chris@10
|
4372 processes to participate in the transform; plan creation is a
|
Chris@10
|
4373 collective function that must be called for all processes in the
|
Chris@10
|
4374 communicator. The `in' and `out' pointers refer only to a portion of
|
Chris@10
|
4375 the overall transform data (*note MPI Data Distribution::) as specified
|
Chris@10
|
4376 by the `local_size' functions in the previous section. Unless `flags'
|
Chris@10
|
4377 contains `FFTW_ESTIMATE', these arrays are overwritten during plan
|
Chris@10
|
4378 creation as for the serial interface. For multi-dimensional
|
Chris@10
|
4379 transforms, any dimensions `> 1' are supported; for one-dimensional
|
Chris@10
|
4380 transforms, only composite (non-prime) `n0' are currently supported
|
Chris@10
|
4381 (unlike the serial FFTW). Requesting an unsupported transform size
|
Chris@10
|
4382 will yield a `NULL' plan. (As in the serial interface, highly
|
Chris@10
|
4383 composite sizes generally yield the best performance.)
|
Chris@10
|
4384
|
Chris@10
|
4385 The advanced-interface `fftw_mpi_plan_many_dft' additionally allows
|
Chris@10
|
4386 you to specify the block sizes for the first dimension (`block') of the
|
Chris@10
|
4387 n[0] x n[1] x n[2] x ... x n[d-1] input data and the first dimension
|
Chris@10
|
4388 (`tblock') of the n[1] x n[0] x n[2] x ... x n[d-1] transposed data
|
Chris@10
|
4389 (at intermediate steps of the transform, and for the output if
|
Chris@10
|
4390 `FFTW_TRANSPOSED_OUT' is specified in `flags'). These must be the same
|
Chris@10
|
4391 block sizes as were passed to the corresponding `local_size' function;
|
Chris@10
|
4392 you can pass `FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size
|
Chris@10
|
4393 as in the basic interface. Also, the `howmany' parameter specifies
|
Chris@10
|
4394 that the transform is of contiguous `howmany'-tuples rather than
|
Chris@10
|
4395 individual complex numbers; this corresponds to the same parameter in
|
Chris@10
|
4396 the serial advanced interface (*note Advanced Complex DFTs::) with
|
Chris@10
|
4397 `stride = howmany' and `dist = 1'.
|
Chris@10
|
4398
|
Chris@10
|
4399 MPI flags
|
Chris@10
|
4400 .........
|
Chris@10
|
4401
|
Chris@10
|
4402 The `flags' can be any of those for the serial FFTW (*note Planner
|
Chris@10
|
4403 Flags::), and in addition may include one or more of the following
|
Chris@10
|
4404 MPI-specific flags, which improve performance at the cost of changing
|
Chris@10
|
4405 the output or input data formats.
|
Chris@10
|
4406
|
Chris@10
|
4407 * `FFTW_MPI_SCRAMBLED_OUT', `FFTW_MPI_SCRAMBLED_IN': valid for 1d
|
Chris@10
|
4408 transforms only, these flags indicate that the output/input of the
|
Chris@10
|
4409 transform are in an undocumented "scrambled" order. A forward
|
Chris@10
|
4410 `FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward
|
Chris@10
|
4411 `FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization).
|
Chris@10
|
4412 *Note One-dimensional distributions::.
|
Chris@10
|
4413
|
Chris@10
|
4414 * `FFTW_MPI_TRANSPOSED_OUT', `FFTW_MPI_TRANSPOSED_IN': valid for
|
Chris@10
|
4415 multidimensional (`rnk > 1') transforms only, these flags specify
|
Chris@10
|
4416 that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1]
|
Chris@10
|
4417 transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] .
|
Chris@10
|
4418 *Note Transposed distributions::.
|
Chris@10
|
4419
|
Chris@10
|
4420
|
Chris@10
|
4421 Real-data MPI DFTs
|
Chris@10
|
4422 ..................
|
Chris@10
|
4423
|
Chris@10
|
4424 Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI
|
Chris@10
|
4425 DFTs of Real Data::) are created by:
|
Chris@10
|
4426
|
Chris@10
|
4427 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4428 double *in, fftw_complex *out,
|
Chris@10
|
4429 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4430 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4431 double *in, fftw_complex *out,
|
Chris@10
|
4432 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4433 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4434 double *in, fftw_complex *out,
|
Chris@10
|
4435 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4436 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4437 double *in, fftw_complex *out,
|
Chris@10
|
4438 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4439 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4440 fftw_complex *in, double *out,
|
Chris@10
|
4441 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4442 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4443 fftw_complex *in, double *out,
|
Chris@10
|
4444 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4445 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4446 fftw_complex *in, double *out,
|
Chris@10
|
4447 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4448 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4449 fftw_complex *in, double *out,
|
Chris@10
|
4450 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4451
|
Chris@10
|
4452 Similar to the serial interface (*note Real-data DFTs::), these
|
Chris@10
|
4453 transform logically n[0] x n[1] x n[2] x ... x n[d-1] real data
|
Chris@10
|
4454 to/from n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) complex data,
|
Chris@10
|
4455 representing the non-redundant half of the conjugate-symmetry output of
|
Chris@10
|
4456 a real-input DFT (*note Multi-dimensional Transforms::). However, the
|
Chris@10
|
4457 real array must be stored within a padded n[0] x n[1] x n[2] x ... x [2
|
Chris@10
|
4458 (n[d-1]/2 + 1)]
|
Chris@10
|
4459
|
Chris@10
|
4460 array (much like the in-place serial r2c transforms, but here for
|
Chris@10
|
4461 out-of-place transforms as well). Currently, only multi-dimensional
|
Chris@10
|
4462 (`rnk > 1') r2c/c2r transforms are supported (requesting a plan for
|
Chris@10
|
4463 `rnk = 1' will yield `NULL'). As explained above (*note
|
Chris@10
|
4464 Multi-dimensional MPI DFTs of Real Data::), the data distribution of
|
Chris@10
|
4465 both the real and complex arrays is given by the `local_size' function
|
Chris@10
|
4466 called for the dimensions of the _complex_ array. Similar to the other
|
Chris@10
|
4467 planning functions, the input and output arrays are overwritten when
|
Chris@10
|
4468 the plan is created except in `FFTW_ESTIMATE' mode.
|
Chris@10
|
4469
|
Chris@10
|
4470 As for the complex DFTs above, there is an advance interface that
|
Chris@10
|
4471 allows you to manually specify block sizes and to transform contiguous
|
Chris@10
|
4472 `howmany'-tuples of real/complex numbers:
|
Chris@10
|
4473
|
Chris@10
|
4474 fftw_plan fftw_mpi_plan_many_dft_r2c
|
Chris@10
|
4475 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
Chris@10
|
4476 ptrdiff_t iblock, ptrdiff_t oblock,
|
Chris@10
|
4477 double *in, fftw_complex *out,
|
Chris@10
|
4478 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4479 fftw_plan fftw_mpi_plan_many_dft_c2r
|
Chris@10
|
4480 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
|
Chris@10
|
4481 ptrdiff_t iblock, ptrdiff_t oblock,
|
Chris@10
|
4482 fftw_complex *in, double *out,
|
Chris@10
|
4483 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4484
|
Chris@10
|
4485 MPI r2r transforms
|
Chris@10
|
4486 ..................
|
Chris@10
|
4487
|
Chris@10
|
4488 There are corresponding plan-creation routines for r2r transforms
|
Chris@10
|
4489 (*note More DFTs of Real Data::), currently supporting multidimensional
|
Chris@10
|
4490 (`rnk > 1') transforms only (`rnk = 1' will yield a `NULL' plan):
|
Chris@10
|
4491
|
Chris@10
|
4492 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4493 double *in, double *out,
|
Chris@10
|
4494 MPI_Comm comm,
|
Chris@10
|
4495 fftw_r2r_kind kind0, fftw_r2r_kind kind1,
|
Chris@10
|
4496 unsigned flags);
|
Chris@10
|
4497 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
|
Chris@10
|
4498 double *in, double *out,
|
Chris@10
|
4499 MPI_Comm comm,
|
Chris@10
|
4500 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
|
Chris@10
|
4501 unsigned flags);
|
Chris@10
|
4502 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4503 double *in, double *out,
|
Chris@10
|
4504 MPI_Comm comm, const fftw_r2r_kind *kind,
|
Chris@10
|
4505 unsigned flags);
|
Chris@10
|
4506 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
|
Chris@10
|
4507 ptrdiff_t iblock, ptrdiff_t oblock,
|
Chris@10
|
4508 double *in, double *out,
|
Chris@10
|
4509 MPI_Comm comm, const fftw_r2r_kind *kind,
|
Chris@10
|
4510 unsigned flags);
|
Chris@10
|
4511
|
Chris@10
|
4512 The parameters are much the same as for the complex DFTs above,
|
Chris@10
|
4513 except that the arrays are of real numbers (and hence the outputs of the
|
Chris@10
|
4514 `local_size' data-distribution functions should be interpreted as
|
Chris@10
|
4515 counts of real rather than complex numbers). Also, the `kind'
|
Chris@10
|
4516 parameters specify the r2r kinds along each dimension as for the serial
|
Chris@10
|
4517 interface (*note Real-to-Real Transform Kinds::). *Note Other
|
Chris@10
|
4518 Multi-dimensional Real-data MPI Transforms::.
|
Chris@10
|
4519
|
Chris@10
|
4520 MPI transposition
|
Chris@10
|
4521 .................
|
Chris@10
|
4522
|
Chris@10
|
4523 FFTW also provides routines to plan a transpose of a distributed `n0'
|
Chris@10
|
4524 by `n1' array of real numbers, or an array of `howmany'-tuples of real
|
Chris@10
|
4525 numbers with specified block sizes (*note FFTW MPI Transposes::):
|
Chris@10
|
4526
|
Chris@10
|
4527 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
|
Chris@10
|
4528 double *in, double *out,
|
Chris@10
|
4529 MPI_Comm comm, unsigned flags);
|
Chris@10
|
4530 fftw_plan fftw_mpi_plan_many_transpose
|
Chris@10
|
4531 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
|
Chris@10
|
4532 ptrdiff_t block0, ptrdiff_t block1,
|
Chris@10
|
4533 double *in, double *out, MPI_Comm comm, unsigned flags);
|
Chris@10
|
4534
|
Chris@10
|
4535 These plans are used with the `fftw_mpi_execute_r2r' new-array
|
Chris@10
|
4536 execute function (*note Using MPI Plans::), since they count as (rank
|
Chris@10
|
4537 zero) r2r plans from FFTW's perspective.
|
Chris@10
|
4538
|
Chris@10
|
4539
|
Chris@10
|
4540 File: fftw3.info, Node: MPI Wisdom Communication, Prev: MPI Plan Creation, Up: FFTW MPI Reference
|
Chris@10
|
4541
|
Chris@10
|
4542 6.12.6 MPI Wisdom Communication
|
Chris@10
|
4543 -------------------------------
|
Chris@10
|
4544
|
Chris@10
|
4545 To facilitate synchronizing wisdom among the different MPI processes,
|
Chris@10
|
4546 we provide two functions:
|
Chris@10
|
4547
|
Chris@10
|
4548 void fftw_mpi_gather_wisdom(MPI_Comm comm);
|
Chris@10
|
4549 void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
|
Chris@10
|
4550
|
Chris@10
|
4551 The `fftw_mpi_gather_wisdom' function gathers all wisdom in the
|
Chris@10
|
4552 given communicator `comm' to the process of rank 0 in the communicator:
|
Chris@10
|
4553 that process obtains the union of all wisdom on all the processes. As
|
Chris@10
|
4554 a side effect, some other processes will gain additional wisdom from
|
Chris@10
|
4555 other processes, but only process 0 will gain the complete union.
|
Chris@10
|
4556
|
Chris@10
|
4557 The `fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom
|
Chris@10
|
4558 from process 0 in `comm' to all other processes in the communicator,
|
Chris@10
|
4559 replacing any wisdom they currently have.
|
Chris@10
|
4560
|
Chris@10
|
4561 *Note FFTW MPI Wisdom::.
|
Chris@10
|
4562
|
Chris@10
|
4563
|
Chris@10
|
4564 File: fftw3.info, Node: FFTW MPI Fortran Interface, Prev: FFTW MPI Reference, Up: Distributed-memory FFTW with MPI
|
Chris@10
|
4565
|
Chris@10
|
4566 6.13 FFTW MPI Fortran Interface
|
Chris@10
|
4567 ===============================
|
Chris@10
|
4568
|
Chris@10
|
4569 The FFTW MPI interface is callable from modern Fortran compilers
|
Chris@10
|
4570 supporting the Fortran 2003 `iso_c_binding' standard for calling C
|
Chris@10
|
4571 functions. As described in *note Calling FFTW from Modern Fortran::,
|
Chris@10
|
4572 this means that you can directly call FFTW's C interface from Fortran
|
Chris@10
|
4573 with only minor changes in syntax. There are, however, a few things
|
Chris@10
|
4574 specific to the MPI interface to keep in mind:
|
Chris@10
|
4575
|
Chris@10
|
4576 * Instead of including `fftw3.f03' as in *note Overview of Fortran
|
Chris@10
|
4577 interface::, you should `include 'fftw3-mpi.f03'' (after `use,
|
Chris@10
|
4578 intrinsic :: iso_c_binding' as before). The `fftw3-mpi.f03' file
|
Chris@10
|
4579 includes `fftw3.f03', so you should _not_ `include' them both
|
Chris@10
|
4580 yourself. (You will also want to include the MPI header file,
|
Chris@10
|
4581 usually via `include 'mpif.h'' or similar, although though this is
|
Chris@10
|
4582 not needed by `fftw3-mpi.f03' per se.) (To use the `fftwl_' `long
|
Chris@10
|
4583 double' extended-precision routines in supporting compilers, you
|
Chris@10
|
4584 should include `fftw3f-mpi.f03' in _addition_ to `fftw3-mpi.f03'.
|
Chris@10
|
4585 *Note Extended and quadruple precision in Fortran::.)
|
Chris@10
|
4586
|
Chris@10
|
4587 * Because of the different storage conventions between C and Fortran,
|
Chris@10
|
4588 you reverse the order of your array dimensions when passing them to
|
Chris@10
|
4589 FFTW (*note Reversing array dimensions::). This is merely a
|
Chris@10
|
4590 difference in notation and incurs no performance overhead.
|
Chris@10
|
4591 However, it means that, whereas in C the _first_ dimension is
|
Chris@10
|
4592 distributed, in Fortran the _last_ dimension of your array is
|
Chris@10
|
4593 distributed.
|
Chris@10
|
4594
|
Chris@10
|
4595 * In Fortran, communicators are stored as `integer' types; there is
|
Chris@10
|
4596 no `MPI_Comm' type, nor is there any way to access a C `MPI_Comm'.
|
Chris@10
|
4597 Fortunately, this is taken care of for you by the FFTW Fortran
|
Chris@10
|
4598 interface: whenever the C interface expects an `MPI_Comm' type,
|
Chris@10
|
4599 you should pass the Fortran communicator as an `integer'.(1)
|
Chris@10
|
4600
|
Chris@10
|
4601 * Because you need to call the `local_size' function to find out how
|
Chris@10
|
4602 much space to allocate, and this may be _larger_ than the local
|
Chris@10
|
4603 portion of the array (*note MPI Data Distribution::), you should
|
Chris@10
|
4604 _always_ allocate your arrays dynamically using FFTW's allocation
|
Chris@10
|
4605 routines as described in *note Allocating aligned memory in
|
Chris@10
|
4606 Fortran::. (Coincidentally, this also provides the best
|
Chris@10
|
4607 performance by guaranteeding proper data alignment.)
|
Chris@10
|
4608
|
Chris@10
|
4609 * Because all sizes in the MPI FFTW interface are declared as
|
Chris@10
|
4610 `ptrdiff_t' in C, you should use `integer(C_INTPTR_T)' in Fortran
|
Chris@10
|
4611 (*note FFTW Fortran type reference::).
|
Chris@10
|
4612
|
Chris@10
|
4613 * In Fortran, because of the language semantics, we generally
|
Chris@10
|
4614 recommend using the new-array execute functions for all plans,
|
Chris@10
|
4615 even in the common case where you are executing the plan on the
|
Chris@10
|
4616 same arrays for which the plan was created (*note Plan execution
|
Chris@10
|
4617 in Fortran::). However, note that in the MPI interface these
|
Chris@10
|
4618 functions are changed: `fftw_execute_dft' becomes
|
Chris@10
|
4619 `fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::.
|
Chris@10
|
4620
|
Chris@10
|
4621
|
Chris@10
|
4622 For example, here is a Fortran code snippet to perform a distributed
|
Chris@10
|
4623 L x M complex DFT in-place. (This assumes you have already
|
Chris@10
|
4624 initialized MPI with `MPI_init' and have also performed `call
|
Chris@10
|
4625 fftw_mpi_init'.)
|
Chris@10
|
4626
|
Chris@10
|
4627 use, intrinsic :: iso_c_binding
|
Chris@10
|
4628 include 'fftw3-mpi.f03'
|
Chris@10
|
4629 integer(C_INTPTR_T), parameter :: L = ...
|
Chris@10
|
4630 integer(C_INTPTR_T), parameter :: M = ...
|
Chris@10
|
4631 type(C_PTR) :: plan, cdata
|
Chris@10
|
4632 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
|
Chris@10
|
4633 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
|
Chris@10
|
4634
|
Chris@10
|
4635 ! get local data size and allocate (note dimension reversal)
|
Chris@10
|
4636 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
|
Chris@10
|
4637 local_M, local_j_offset)
|
Chris@10
|
4638 cdata = fftw_alloc_complex(alloc_local)
|
Chris@10
|
4639 call c_f_pointer(cdata, data, [L,local_M])
|
Chris@10
|
4640
|
Chris@10
|
4641 ! create MPI plan for in-place forward DFT (note dimension reversal)
|
Chris@10
|
4642 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
|
Chris@10
|
4643 FFTW_FORWARD, FFTW_MEASURE)
|
Chris@10
|
4644
|
Chris@10
|
4645 ! initialize data to some function my_function(i,j)
|
Chris@10
|
4646 do j = 1, local_M
|
Chris@10
|
4647 do i = 1, L
|
Chris@10
|
4648 data(i, j) = my_function(i, j + local_j_offset)
|
Chris@10
|
4649 end do
|
Chris@10
|
4650 end do
|
Chris@10
|
4651
|
Chris@10
|
4652 ! compute transform (as many times as desired)
|
Chris@10
|
4653 call fftw_mpi_execute_dft(plan, data, data)
|
Chris@10
|
4654
|
Chris@10
|
4655 call fftw_destroy_plan(plan)
|
Chris@10
|
4656 call fftw_free(cdata)
|
Chris@10
|
4657
|
Chris@10
|
4658 Note that when we called `fftw_mpi_local_size_2d' and
|
Chris@10
|
4659 `fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L
|
Chris@10
|
4660 x M Fortran array is viewed by FFTW in C as a M x L array. This
|
Chris@10
|
4661 means that the array was distributed over the `M' dimension, the local
|
Chris@10
|
4662 portion of which is a L x local_M array in Fortran. (You must _not_
|
Chris@10
|
4663 use an `allocate' statement to allocate an L x local_M array, however;
|
Chris@10
|
4664 you must allocate `alloc_local' complex numbers, which may be greater
|
Chris@10
|
4665 than `L * local_M', in order to reserve space for intermediate steps of
|
Chris@10
|
4666 the transform.) Finally, we mention that because C's array indices are
|
Chris@10
|
4667 zero-based, the `local_j_offset' argument can conveniently be
|
Chris@10
|
4668 interpreted as an offset in the 1-based `j' index (rather than as a
|
Chris@10
|
4669 starting index as in C).
|
Chris@10
|
4670
|
Chris@10
|
4671 If instead you had used the `ior(FFTW_MEASURE,
|
Chris@10
|
4672 FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a
|
Chris@10
|
4673 transposed M x local_L array, associated with the _same_ `cdata'
|
Chris@10
|
4674 allocation (since the transform is in-place), and which you could
|
Chris@10
|
4675 declare with:
|
Chris@10
|
4676
|
Chris@10
|
4677 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
|
Chris@10
|
4678 ...
|
Chris@10
|
4679 call c_f_pointer(cdata, tdata, [M,local_L])
|
Chris@10
|
4680
|
Chris@10
|
4681 where `local_L' would have been obtained by changing the
|
Chris@10
|
4682 `fftw_mpi_local_size_2d' call to:
|
Chris@10
|
4683
|
Chris@10
|
4684 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
|
Chris@10
|
4685 local_M, local_j_offset, local_L, local_i_offset)
|
Chris@10
|
4686
|
Chris@10
|
4687 ---------- Footnotes ----------
|
Chris@10
|
4688
|
Chris@10
|
4689 (1) Technically, this is because you aren't actually calling the C
|
Chris@10
|
4690 functions directly. You are calling wrapper functions that translate
|
Chris@10
|
4691 the communicator with `MPI_Comm_f2c' before calling the ordinary C
|
Chris@10
|
4692 interface. This is all done transparently, however, since the
|
Chris@10
|
4693 `fftw3-mpi.f03' interface file renames the wrappers so that they are
|
Chris@10
|
4694 called in Fortran with the same names as the C interface functions.
|
Chris@10
|
4695
|
Chris@10
|
4696
|
Chris@10
|
4697 File: fftw3.info, Node: Calling FFTW from Modern Fortran, Next: Calling FFTW from Legacy Fortran, Prev: Distributed-memory FFTW with MPI, Up: Top
|
Chris@10
|
4698
|
Chris@10
|
4699 7 Calling FFTW from Modern Fortran
|
Chris@10
|
4700 **********************************
|
Chris@10
|
4701
|
Chris@10
|
4702 Fortran 2003 standardized ways for Fortran code to call C libraries,
|
Chris@10
|
4703 and this allows us to support a direct translation of the FFTW C API
|
Chris@10
|
4704 into Fortran. Compared to the legacy Fortran 77 interface (*note
|
Chris@10
|
4705 Calling FFTW from Legacy Fortran::), this direct interface offers many
|
Chris@10
|
4706 advantages, especially compile-time type-checking and aligned memory
|
Chris@10
|
4707 allocation. As of this writing, support for these C interoperability
|
Chris@10
|
4708 features seems widespread, having been implemented in nearly all major
|
Chris@10
|
4709 Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland
|
Chris@10
|
4710 Group, NAG).
|
Chris@10
|
4711
|
Chris@10
|
4712 This chapter documents that interface. For the most part, since this
|
Chris@10
|
4713 interface allows Fortran to call the C interface directly, the usage is
|
Chris@10
|
4714 identical to C translated to Fortran syntax. However, there are a few
|
Chris@10
|
4715 subtle points such as memory allocation, wisdom, and data types that
|
Chris@10
|
4716 deserve closer attention.
|
Chris@10
|
4717
|
Chris@10
|
4718 * Menu:
|
Chris@10
|
4719
|
Chris@10
|
4720 * Overview of Fortran interface::
|
Chris@10
|
4721 * Reversing array dimensions::
|
Chris@10
|
4722 * FFTW Fortran type reference::
|
Chris@10
|
4723 * Plan execution in Fortran::
|
Chris@10
|
4724 * Allocating aligned memory in Fortran::
|
Chris@10
|
4725 * Accessing the wisdom API from Fortran::
|
Chris@10
|
4726 * Defining an FFTW module::
|
Chris@10
|
4727
|
Chris@10
|
4728
|
Chris@10
|
4729 File: fftw3.info, Node: Overview of Fortran interface, Next: Reversing array dimensions, Prev: Calling FFTW from Modern Fortran, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
4730
|
Chris@10
|
4731 7.1 Overview of Fortran interface
|
Chris@10
|
4732 =================================
|
Chris@10
|
4733
|
Chris@10
|
4734 FFTW provides a file `fftw3.f03' that defines Fortran 2003 interfaces
|
Chris@10
|
4735 for all of its C routines, except for the MPI routines described
|
Chris@10
|
4736 elsewhere, which can be found in the same directory as `fftw3.h' (the C
|
Chris@10
|
4737 header file). In any Fortran subroutine where you want to use FFTW
|
Chris@10
|
4738 functions, you should begin with:
|
Chris@10
|
4739
|
Chris@10
|
4740 use, intrinsic :: iso_c_binding
|
Chris@10
|
4741 include 'fftw3.f03'
|
Chris@10
|
4742
|
Chris@10
|
4743 This includes the interface definitions and the standard
|
Chris@10
|
4744 `iso_c_binding' module (which defines the equivalents of C types). You
|
Chris@10
|
4745 can also put the FFTW functions into a module if you prefer (*note
|
Chris@10
|
4746 Defining an FFTW module::).
|
Chris@10
|
4747
|
Chris@10
|
4748 At this point, you can now call anything in the FFTW C interface
|
Chris@10
|
4749 directly, almost exactly as in C other than minor changes in syntax.
|
Chris@10
|
4750 For example:
|
Chris@10
|
4751
|
Chris@10
|
4752 type(C_PTR) :: plan
|
Chris@10
|
4753 complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out
|
Chris@10
|
4754 plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
|
Chris@10
|
4755 ...
|
Chris@10
|
4756 call fftw_execute_dft(plan, in, out)
|
Chris@10
|
4757 ...
|
Chris@10
|
4758 call fftw_destroy_plan(plan)
|
Chris@10
|
4759
|
Chris@10
|
4760 A few important things to keep in mind are:
|
Chris@10
|
4761
|
Chris@10
|
4762 * FFTW plans are `type(C_PTR)'. Other C types are mapped in the
|
Chris@10
|
4763 obvious way via the `iso_c_binding' standard: `int' turns into
|
Chris@10
|
4764 `integer(C_INT)', `fftw_complex' turns into
|
Chris@10
|
4765 `complex(C_DOUBLE_COMPLEX)', `double' turns into `real(C_DOUBLE)',
|
Chris@10
|
4766 and so on. *Note FFTW Fortran type reference::.
|
Chris@10
|
4767
|
Chris@10
|
4768 * Functions in C become functions in Fortran if they have a return
|
Chris@10
|
4769 value, and subroutines in Fortran otherwise.
|
Chris@10
|
4770
|
Chris@10
|
4771 * The ordering of the Fortran array dimensions must be _reversed_
|
Chris@10
|
4772 when they are passed to the FFTW plan creation, thanks to
|
Chris@10
|
4773 differences in array indexing conventions (*note Multi-dimensional
|
Chris@10
|
4774 Array Format::). This is _unlike_ the legacy Fortran interface
|
Chris@10
|
4775 (*note Fortran-interface routines::), which reversed the dimensions
|
Chris@10
|
4776 for you. *Note Reversing array dimensions::.
|
Chris@10
|
4777
|
Chris@10
|
4778 * Using ordinary Fortran array declarations like this works, but may
|
Chris@10
|
4779 yield suboptimal performance because the data may not be not
|
Chris@10
|
4780 aligned to exploit SIMD instructions on modern proessors (*note
|
Chris@10
|
4781 SIMD alignment and fftw_malloc::). Better performance will often
|
Chris@10
|
4782 be obtained by allocating with `fftw_alloc'. *Note Allocating
|
Chris@10
|
4783 aligned memory in Fortran::.
|
Chris@10
|
4784
|
Chris@10
|
4785 * Similar to the legacy Fortran interface (*note FFTW Execution in
|
Chris@10
|
4786 Fortran::), we currently recommend _not_ using `fftw_execute' but
|
Chris@10
|
4787 rather using the more specialized functions like
|
Chris@10
|
4788 `fftw_execute_dft' (*note New-array Execute Functions::).
|
Chris@10
|
4789 However, you should execute the plan on the `same arrays' as the
|
Chris@10
|
4790 ones for which you created the plan, unless you are especially
|
Chris@10
|
4791 careful. *Note Plan execution in Fortran::. To prevent you from
|
Chris@10
|
4792 using `fftw_execute' by mistake, the `fftw3.f03' file does not
|
Chris@10
|
4793 provide an `fftw_execute' interface declaration.
|
Chris@10
|
4794
|
Chris@10
|
4795 * Multiple planner flags are combined with `ior' (equivalent to `|'
|
Chris@10
|
4796 in C). e.g. `FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes
|
Chris@10
|
4797 `ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'. (You can also use `+' as
|
Chris@10
|
4798 long as you don't try to include a given flag more than once.)
|
Chris@10
|
4799
|
Chris@10
|
4800
|
Chris@10
|
4801 * Menu:
|
Chris@10
|
4802
|
Chris@10
|
4803 * Extended and quadruple precision in Fortran::
|
Chris@10
|
4804
|
Chris@10
|
4805
|
Chris@10
|
4806 File: fftw3.info, Node: Extended and quadruple precision in Fortran, Prev: Overview of Fortran interface, Up: Overview of Fortran interface
|
Chris@10
|
4807
|
Chris@10
|
4808 7.1.1 Extended and quadruple precision in Fortran
|
Chris@10
|
4809 -------------------------------------------------
|
Chris@10
|
4810
|
Chris@10
|
4811 If FFTW is compiled in `long double' (extended) precision (*note
|
Chris@10
|
4812 Installation and Customization::), you may be able to call the
|
Chris@10
|
4813 resulting `fftwl_' routines (*note Precision::) from Fortran if your
|
Chris@10
|
4814 compiler supports the `C_LONG_DOUBLE_COMPLEX' type code.
|
Chris@10
|
4815
|
Chris@10
|
4816 Because some Fortran compilers do not support
|
Chris@10
|
4817 `C_LONG_DOUBLE_COMPLEX', the `fftwl_' declarations are segregated into
|
Chris@10
|
4818 a separate interface file `fftw3l.f03', which you should include _in
|
Chris@10
|
4819 addition_ to `fftw3.f03' (which declares precision-independent `FFTW_'
|
Chris@10
|
4820 constants):
|
Chris@10
|
4821
|
Chris@10
|
4822 use, intrinsic :: iso_c_binding
|
Chris@10
|
4823 include 'fftw3.f03'
|
Chris@10
|
4824 include 'fftw3l.f03'
|
Chris@10
|
4825
|
Chris@10
|
4826 We also support using the nonstandard `__float128'
|
Chris@10
|
4827 quadruple-precision type provided by recent versions of `gcc' on 32-
|
Chris@10
|
4828 and 64-bit x86 hardware (*note Installation and Customization::), using
|
Chris@10
|
4829 the corresponding `real(16)' and `complex(16)' types supported by
|
Chris@10
|
4830 `gfortran'. The quadruple-precision `fftwq_' functions (*note
|
Chris@10
|
4831 Precision::) are declared in a `fftw3q.f03' interface file, which
|
Chris@10
|
4832 should be included in addition to `fftw3l.f03', as above. You should
|
Chris@10
|
4833 also link with `-lfftw3q -lquadmath -lm' as in C.
|
Chris@10
|
4834
|
Chris@10
|
4835
|
Chris@10
|
4836 File: fftw3.info, Node: Reversing array dimensions, Next: FFTW Fortran type reference, Prev: Overview of Fortran interface, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
4837
|
Chris@10
|
4838 7.2 Reversing array dimensions
|
Chris@10
|
4839 ==============================
|
Chris@10
|
4840
|
Chris@10
|
4841 A minor annoyance in calling FFTW from Fortran is that FFTW's array
|
Chris@10
|
4842 dimensions are defined in the C convention (row-major order), while
|
Chris@10
|
4843 Fortran's array dimensions are the opposite convention (column-major
|
Chris@10
|
4844 order). *Note Multi-dimensional Array Format::. This is just a
|
Chris@10
|
4845 bookkeeping difference, with no effect on performance. The only
|
Chris@10
|
4846 consequence of this is that, whenever you create an FFTW plan for a
|
Chris@10
|
4847 multi-dimensional transform, you must always _reverse the ordering of
|
Chris@10
|
4848 the dimensions_.
|
Chris@10
|
4849
|
Chris@10
|
4850 For example, consider the three-dimensional (L x M x N ) arrays:
|
Chris@10
|
4851
|
Chris@10
|
4852 complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out
|
Chris@10
|
4853
|
Chris@10
|
4854 To plan a DFT for these arrays using `fftw_plan_dft_3d', you could
|
Chris@10
|
4855 do:
|
Chris@10
|
4856
|
Chris@10
|
4857 plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
|
Chris@10
|
4858
|
Chris@10
|
4859 That is, from FFTW's perspective this is a N x M x L array. _No
|
Chris@10
|
4860 data transposition need occur_, as this is _only notation_. Similarly,
|
Chris@10
|
4861 to use the more generic routine `fftw_plan_dft' with the same arrays,
|
Chris@10
|
4862 you could do:
|
Chris@10
|
4863
|
Chris@10
|
4864 integer(C_INT), dimension(3) :: n = [N,M,L]
|
Chris@10
|
4865 plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
|
Chris@10
|
4866
|
Chris@10
|
4867 Note, by the way, that this is different from the legacy Fortran
|
Chris@10
|
4868 interface (*note Fortran-interface routines::), which automatically
|
Chris@10
|
4869 reverses the order of the array dimension for you. Here, you are
|
Chris@10
|
4870 calling the C interface directly, so there is no "translation" layer.
|
Chris@10
|
4871
|
Chris@10
|
4872 An important thing to keep in mind is the implication of this for
|
Chris@10
|
4873 multidimensional real-to-complex transforms (*note Multi-Dimensional
|
Chris@10
|
4874 DFTs of Real Data::). In C, a multidimensional real-to-complex DFT
|
Chris@10
|
4875 chops the last dimension roughly in half (N x M x L real input goes to
|
Chris@10
|
4876 N x M x L/2+1 complex output). In Fortran, because the array
|
Chris@10
|
4877 dimension notation is reversed, the _first_ dimension of the complex
|
Chris@10
|
4878 data is chopped roughly in half. For example consider the `r2c'
|
Chris@10
|
4879 transform of L x M x N real input in Fortran:
|
Chris@10
|
4880
|
Chris@10
|
4881 type(C_PTR) :: plan
|
Chris@10
|
4882 real(C_DOUBLE), dimension(L,M,N) :: in
|
Chris@10
|
4883 complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out
|
Chris@10
|
4884 plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
|
Chris@10
|
4885 ...
|
Chris@10
|
4886 call fftw_execute_dft_r2c(plan, in, out)
|
Chris@10
|
4887
|
Chris@10
|
4888 Alternatively, for an in-place r2c transform, as described in the C
|
Chris@10
|
4889 documentation we must _pad_ the _first_ dimension of the real input
|
Chris@10
|
4890 with an extra two entries (which are ignored by FFTW) so as to leave
|
Chris@10
|
4891 enough space for the complex output. The input is _allocated_ as a
|
Chris@10
|
4892 2[L/2+1] x M x N array, even though only L x M x N of it is actually
|
Chris@10
|
4893 used. In this example, we will allocate the array as a pointer type,
|
Chris@10
|
4894 using `fftw_alloc' to ensure aligned memory for maximum performance
|
Chris@10
|
4895 (*note Allocating aligned memory in Fortran::); this also makes it easy
|
Chris@10
|
4896 to reference the same memory as both a real array and a complex array.
|
Chris@10
|
4897
|
Chris@10
|
4898 real(C_DOUBLE), pointer :: in(:,:,:)
|
Chris@10
|
4899 complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:)
|
Chris@10
|
4900 type(C_PTR) :: plan, data
|
Chris@10
|
4901 data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T))
|
Chris@10
|
4902 call c_f_pointer(data, in, [2*(L/2+1),M,N])
|
Chris@10
|
4903 call c_f_pointer(data, out, [L/2+1,M,N])
|
Chris@10
|
4904 plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
|
Chris@10
|
4905 ...
|
Chris@10
|
4906 call fftw_execute_dft_r2c(plan, in, out)
|
Chris@10
|
4907 ...
|
Chris@10
|
4908 call fftw_destroy_plan(plan)
|
Chris@10
|
4909 call fftw_free(data)
|
Chris@10
|
4910
|
Chris@10
|
4911
|
Chris@10
|
4912 File: fftw3.info, Node: FFTW Fortran type reference, Next: Plan execution in Fortran, Prev: Reversing array dimensions, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
4913
|
Chris@10
|
4914 7.3 FFTW Fortran type reference
|
Chris@10
|
4915 ===============================
|
Chris@10
|
4916
|
Chris@10
|
4917 The following are the most important type correspondences between the C
|
Chris@10
|
4918 interface and Fortran:
|
Chris@10
|
4919
|
Chris@10
|
4920 * Plans (`fftw_plan' and variants) are `type(C_PTR)' (i.e. an opaque
|
Chris@10
|
4921 pointer).
|
Chris@10
|
4922
|
Chris@10
|
4923 * The C floating-point types `double', `float', and `long double'
|
Chris@10
|
4924 correspond to `real(C_DOUBLE)', `real(C_FLOAT)', and
|
Chris@10
|
4925 `real(C_LONG_DOUBLE)', respectively. The C complex types
|
Chris@10
|
4926 `fftw_complex', `fftwf_complex', and `fftwl_complex' correspond in
|
Chris@10
|
4927 Fortran to `complex(C_DOUBLE_COMPLEX)',
|
Chris@10
|
4928 `complex(C_FLOAT_COMPLEX)', and `complex(C_LONG_DOUBLE_COMPLEX)',
|
Chris@10
|
4929 respectively. Just as in C (*note Precision::), the FFTW
|
Chris@10
|
4930 subroutines and types are prefixed with `fftw_', `fftwf_', and
|
Chris@10
|
4931 `fftwl_' for the different precisions, and link to different
|
Chris@10
|
4932 libraries (`-lfftw3', `-lfftw3f', and `-lfftw3l' on Unix), but use
|
Chris@10
|
4933 the _same_ include file `fftw3.f03' and the _same_ constants (all
|
Chris@10
|
4934 of which begin with `FFTW_'). The exception is `long double'
|
Chris@10
|
4935 precision, for which you should _also_ include `fftw3l.f03' (*note
|
Chris@10
|
4936 Extended and quadruple precision in Fortran::).
|
Chris@10
|
4937
|
Chris@10
|
4938 * The C integer types `int' and `unsigned' (used for planner flags)
|
Chris@10
|
4939 become `integer(C_INT)'. The C integer type `ptrdiff_t' (e.g. in
|
Chris@10
|
4940 the *note 64-bit Guru Interface::) becomes `integer(C_INTPTR_T)',
|
Chris@10
|
4941 and `size_t' (in `fftw_malloc' etc.) becomes `integer(C_SIZE_T)'.
|
Chris@10
|
4942
|
Chris@10
|
4943 * The `fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::)
|
Chris@10
|
4944 becomes `integer(C_FFTW_R2R_KIND)'. The various constant values
|
Chris@10
|
4945 of the C enumerated type (`FFTW_R2HC' etc.) become simply integer
|
Chris@10
|
4946 constants of the same names in Fortran.
|
Chris@10
|
4947
|
Chris@10
|
4948 * Numeric array pointer arguments (e.g. `double *') become
|
Chris@10
|
4949 `dimension(*), intent(out)' arrays of the same type, or
|
Chris@10
|
4950 `dimension(*), intent(in)' if they are pointers to constant data
|
Chris@10
|
4951 (e.g. `const int *'). There are a few exceptions where numeric
|
Chris@10
|
4952 pointers refer to scalar outputs (e.g. for `fftw_flops'), in which
|
Chris@10
|
4953 case they are `intent(out)' scalar arguments in Fortran too. For
|
Chris@10
|
4954 the new-array execute functions (*note New-array Execute
|
Chris@10
|
4955 Functions::), the input arrays are declared `dimension(*),
|
Chris@10
|
4956 intent(inout)', since they can be modified in the case of in-place
|
Chris@10
|
4957 or `FFTW_DESTROY_INPUT' transforms.
|
Chris@10
|
4958
|
Chris@10
|
4959 * Pointer _return_ values (e.g `double *') become `type(C_PTR)'.
|
Chris@10
|
4960 (If they are pointers to arrays, as for `fftw_alloc_real', you can
|
Chris@10
|
4961 convert them back to Fortran array pointers with the standard
|
Chris@10
|
4962 intrinsic function `c_f_pointer'.)
|
Chris@10
|
4963
|
Chris@10
|
4964 * The `fftw_iodim' type in the guru interface (*note Guru vector and
|
Chris@10
|
4965 transform sizes::) becomes `type(fftw_iodim)' in Fortran, a
|
Chris@10
|
4966 derived data type (the Fortran analogue of C's `struct') with
|
Chris@10
|
4967 three `integer(C_INT)' components: `n', `is', and `os', with the
|
Chris@10
|
4968 same meanings as in C. The `fftw_iodim64' type in the 64-bit guru
|
Chris@10
|
4969 interface (*note 64-bit Guru Interface::) is the same, except that
|
Chris@10
|
4970 its components are of type `integer(C_INTPTR_T)'.
|
Chris@10
|
4971
|
Chris@10
|
4972 * Using the wisdom import/export functions from Fortran is a bit
|
Chris@10
|
4973 tricky, and is discussed in *note Accessing the wisdom API from
|
Chris@10
|
4974 Fortran::. In brief, the `FILE *' arguments map to `type(C_PTR)',
|
Chris@10
|
4975 `const char *' to `character(C_CHAR), dimension(*), intent(in)'
|
Chris@10
|
4976 (null-terminated!), and the generic read-char/write-char functions
|
Chris@10
|
4977 map to `type(C_FUNPTR)'.
|
Chris@10
|
4978
|
Chris@10
|
4979
|
Chris@10
|
4980 You may be wondering if you need to search-and-replace
|
Chris@10
|
4981 `real(kind(0.0d0))' (or whatever your favorite Fortran spelling of
|
Chris@10
|
4982 "double precision" is) with `real(C_DOUBLE)' everywhere in your
|
Chris@10
|
4983 program, and similarly for `complex' and `integer' types. The answer
|
Chris@10
|
4984 is no; you can still use your existing types. As long as these types
|
Chris@10
|
4985 match their C counterparts, things should work without a hitch. The
|
Chris@10
|
4986 worst that can happen, e.g. in the (unlikely) event of a system where
|
Chris@10
|
4987 `real(kind(0.0d0))' is different from `real(C_DOUBLE)', is that the
|
Chris@10
|
4988 compiler will give you a type-mismatch error. That is, if you don't
|
Chris@10
|
4989 use the `iso_c_binding' kinds you need to accept at least the
|
Chris@10
|
4990 theoretical possibility of having to change your code in response to
|
Chris@10
|
4991 compiler errors on some future machine, but you don't need to worry
|
Chris@10
|
4992 about silently compiling incorrect code that yields runtime errors.
|
Chris@10
|
4993
|
Chris@10
|
4994
|
Chris@10
|
4995 File: fftw3.info, Node: Plan execution in Fortran, Next: Allocating aligned memory in Fortran, Prev: FFTW Fortran type reference, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
4996
|
Chris@10
|
4997 7.4 Plan execution in Fortran
|
Chris@10
|
4998 =============================
|
Chris@10
|
4999
|
Chris@10
|
5000 In C, in order to use a plan, one normally calls `fftw_execute', which
|
Chris@10
|
5001 executes the plan to perform the transform on the input/output arrays
|
Chris@10
|
5002 passed when the plan was created (*note Using Plans::). The
|
Chris@10
|
5003 corresponding subroutine call in modern Fortran is:
|
Chris@10
|
5004 call fftw_execute(plan)
|
Chris@10
|
5005
|
Chris@10
|
5006 However, we have had reports that this causes problems with some
|
Chris@10
|
5007 recent optimizing Fortran compilers. The problem is, because the
|
Chris@10
|
5008 input/output arrays are not passed as explicit arguments to
|
Chris@10
|
5009 `fftw_execute', the semantics of Fortran (unlike C) allow the compiler
|
Chris@10
|
5010 to assume that the input/output arrays are not changed by
|
Chris@10
|
5011 `fftw_execute'. As a consequence, certain compilers end up
|
Chris@10
|
5012 repositioning the call to `fftw_execute', assuming incorrectly that it
|
Chris@10
|
5013 does nothing to the arrays.
|
Chris@10
|
5014
|
Chris@10
|
5015 There are various workarounds to this, but the safest and simplest
|
Chris@10
|
5016 thing is to not use `fftw_execute' in Fortran. Instead, use the
|
Chris@10
|
5017 functions described in *note New-array Execute Functions::, which take
|
Chris@10
|
5018 the input/output arrays as explicit arguments. For example, if the
|
Chris@10
|
5019 plan is for a complex-data DFT and was created for the arrays `in' and
|
Chris@10
|
5020 `out', you would do:
|
Chris@10
|
5021 call fftw_execute_dft(plan, in, out)
|
Chris@10
|
5022
|
Chris@10
|
5023 There are a few things to be careful of, however:
|
Chris@10
|
5024
|
Chris@10
|
5025 * You must use the correct type of execute function, matching the way
|
Chris@10
|
5026 the plan was created. Complex DFT plans should use
|
Chris@10
|
5027 `fftw_execute_dft', Real-input (r2c) DFT plans should use use
|
Chris@10
|
5028 `fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use
|
Chris@10
|
5029 `fftw_execute_dft_c2r'. The various r2r plans should use
|
Chris@10
|
5030 `fftw_execute_r2r'. Fortunately, if you use the wrong one you
|
Chris@10
|
5031 will get a compile-time type-mismatch error (unlike legacy
|
Chris@10
|
5032 Fortran).
|
Chris@10
|
5033
|
Chris@10
|
5034 * You should normally pass the same input/output arrays that were
|
Chris@10
|
5035 used when creating the plan. This is always safe.
|
Chris@10
|
5036
|
Chris@10
|
5037 * _If_ you pass _different_ input/output arrays compared to those
|
Chris@10
|
5038 used when creating the plan, you must abide by all the
|
Chris@10
|
5039 restrictions of the new-array execute functions (*note New-array
|
Chris@10
|
5040 Execute Functions::). The most tricky of these is the requirement
|
Chris@10
|
5041 that the new arrays have the same alignment as the original
|
Chris@10
|
5042 arrays; the best (and possibly only) way to guarantee this is to
|
Chris@10
|
5043 use the `fftw_alloc' functions to allocate your arrays (*note
|
Chris@10
|
5044 Allocating aligned memory in Fortran::). Alternatively, you can
|
Chris@10
|
5045 use the `FFTW_UNALIGNED' flag when creating the plan, in which
|
Chris@10
|
5046 case the plan does not depend on the alignment, but this may
|
Chris@10
|
5047 sacrifice substantial performance on architectures (like x86) with
|
Chris@10
|
5048 SIMD instructions (*note SIMD alignment and fftw_malloc::).
|
Chris@10
|
5049
|
Chris@10
|
5050
|
Chris@10
|
5051
|
Chris@10
|
5052 File: fftw3.info, Node: Allocating aligned memory in Fortran, Next: Accessing the wisdom API from Fortran, Prev: Plan execution in Fortran, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
5053
|
Chris@10
|
5054 7.5 Allocating aligned memory in Fortran
|
Chris@10
|
5055 ========================================
|
Chris@10
|
5056
|
Chris@10
|
5057 In order to obtain maximum performance in FFTW, you should store your
|
Chris@10
|
5058 data in arrays that have been specially aligned in memory (*note SIMD
|
Chris@10
|
5059 alignment and fftw_malloc::). Enforcing alignment also permits you to
|
Chris@10
|
5060 safely use the new-array execute functions (*note New-array Execute
|
Chris@10
|
5061 Functions::) to apply a given plan to more than one pair of in/out
|
Chris@10
|
5062 arrays. Unfortunately, standard Fortran arrays do _not_ provide any
|
Chris@10
|
5063 alignment guarantees. The _only_ way to allocate aligned memory in
|
Chris@10
|
5064 standard Fortran is to allocate it with an external C function, like
|
Chris@10
|
5065 the `fftw_alloc_real' and `fftw_alloc_complex' functions. Fortunately,
|
Chris@10
|
5066 Fortran 2003 provides a simple way to associate such allocated memory
|
Chris@10
|
5067 with a standard Fortran array pointer that you can then use normally.
|
Chris@10
|
5068
|
Chris@10
|
5069 We therefore recommend allocating all your input/output arrays using
|
Chris@10
|
5070 the following technique:
|
Chris@10
|
5071
|
Chris@10
|
5072 1. Declare a `pointer', `arr', to your array of the desired type and
|
Chris@10
|
5073 dimensions. For example, `real(C_DOUBLE), pointer :: a(:,:)' for
|
Chris@10
|
5074 a 2d real array, or `complex(C_DOUBLE_COMPLEX), pointer ::
|
Chris@10
|
5075 a(:,:,:)' for a 3d complex array.
|
Chris@10
|
5076
|
Chris@10
|
5077 2. The number of elements to allocate must be an `integer(C_SIZE_T)'.
|
Chris@10
|
5078 You can either declare a variable of this type, e.g.
|
Chris@10
|
5079 `integer(C_SIZE_T) :: sz', to store the number of elements to
|
Chris@10
|
5080 allocate, or you can use the `int(..., C_SIZE_T)' intrinsic
|
Chris@10
|
5081 function. e.g. set `sz = L * M * N' or use `int(L * M * N,
|
Chris@10
|
5082 C_SIZE_T)' for an L x M x N array.
|
Chris@10
|
5083
|
Chris@10
|
5084 3. Declare a `type(C_PTR) :: p' to hold the return value from FFTW's
|
Chris@10
|
5085 allocation routine. Set `p = fftw_alloc_real(sz)' for a real
|
Chris@10
|
5086 array, or `p = fftw_alloc_complex(sz)' for a complex array.
|
Chris@10
|
5087
|
Chris@10
|
5088 4. Associate your pointer `arr' with the allocated memory `p' using
|
Chris@10
|
5089 the standard `c_f_pointer' subroutine: `call c_f_pointer(p, arr,
|
Chris@10
|
5090 [...dimensions...])', where `[...dimensions...])' are an array of
|
Chris@10
|
5091 the dimensions of the array (in the usual Fortran order). e.g.
|
Chris@10
|
5092 `call c_f_pointer(p, arr, [L,M,N])' for an L x M x N array.
|
Chris@10
|
5093 (Alternatively, you can omit the dimensions argument if you
|
Chris@10
|
5094 specified the shape explicitly when declaring `arr'.) You can now
|
Chris@10
|
5095 use `arr' as a usual multidimensional array.
|
Chris@10
|
5096
|
Chris@10
|
5097 5. When you are done using the array, deallocate the memory by `call
|
Chris@10
|
5098 fftw_free(p)' on `p'.
|
Chris@10
|
5099
|
Chris@10
|
5100
|
Chris@10
|
5101 For example, here is how we would allocate an L x M 2d real array:
|
Chris@10
|
5102
|
Chris@10
|
5103 real(C_DOUBLE), pointer :: arr(:,:)
|
Chris@10
|
5104 type(C_PTR) :: p
|
Chris@10
|
5105 p = fftw_alloc_real(int(L * M, C_SIZE_T))
|
Chris@10
|
5106 call c_f_pointer(p, arr, [L,M])
|
Chris@10
|
5107 _...use arr and arr(i,j) as usual..._
|
Chris@10
|
5108 call fftw_free(p)
|
Chris@10
|
5109
|
Chris@10
|
5110 and here is an L x M x N 3d complex array:
|
Chris@10
|
5111
|
Chris@10
|
5112 complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:)
|
Chris@10
|
5113 type(C_PTR) :: p
|
Chris@10
|
5114 p = fftw_alloc_complex(int(L * M * N, C_SIZE_T))
|
Chris@10
|
5115 call c_f_pointer(p, arr, [L,M,N])
|
Chris@10
|
5116 _...use arr and arr(i,j,k) as usual..._
|
Chris@10
|
5117 call fftw_free(p)
|
Chris@10
|
5118
|
Chris@10
|
5119 See *note Reversing array dimensions:: for an example allocating a
|
Chris@10
|
5120 single array and associating both real and complex array pointers with
|
Chris@10
|
5121 it, for in-place real-to-complex transforms.
|
Chris@10
|
5122
|
Chris@10
|
5123
|
Chris@10
|
5124 File: fftw3.info, Node: Accessing the wisdom API from Fortran, Next: Defining an FFTW module, Prev: Allocating aligned memory in Fortran, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
5125
|
Chris@10
|
5126 7.6 Accessing the wisdom API from Fortran
|
Chris@10
|
5127 =========================================
|
Chris@10
|
5128
|
Chris@10
|
5129 As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a
|
Chris@10
|
5130 "wisdom" API for saving plans to disk so that they can be recreated
|
Chris@10
|
5131 quickly. The C API for exporting (*note Wisdom Export::) and importing
|
Chris@10
|
5132 (*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran,
|
Chris@10
|
5133 however, because of differences in file I/O and string types between C
|
Chris@10
|
5134 and Fortran.
|
Chris@10
|
5135
|
Chris@10
|
5136 * Menu:
|
Chris@10
|
5137
|
Chris@10
|
5138 * Wisdom File Export/Import from Fortran::
|
Chris@10
|
5139 * Wisdom String Export/Import from Fortran::
|
Chris@10
|
5140 * Wisdom Generic Export/Import from Fortran::
|
Chris@10
|
5141
|
Chris@10
|
5142
|
Chris@10
|
5143 File: fftw3.info, Node: Wisdom File Export/Import from Fortran, Next: Wisdom String Export/Import from Fortran, Prev: Accessing the wisdom API from Fortran, Up: Accessing the wisdom API from Fortran
|
Chris@10
|
5144
|
Chris@10
|
5145 7.6.1 Wisdom File Export/Import from Fortran
|
Chris@10
|
5146 --------------------------------------------
|
Chris@10
|
5147
|
Chris@10
|
5148 The easiest way to export and import wisdom is to do so using
|
Chris@10
|
5149 `fftw_export_wisdom_to_filename' and `fftw_wisdom_from_filename'. The
|
Chris@10
|
5150 only trick is that these require you to pass a C string, which is an
|
Chris@10
|
5151 array of type `CHARACTER(C_CHAR)' that is terminated by `C_NULL_CHAR'.
|
Chris@10
|
5152 You can call them like this:
|
Chris@10
|
5153
|
Chris@10
|
5154 integer(C_INT) :: ret
|
Chris@10
|
5155 ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
|
Chris@10
|
5156 if (ret .eq. 0) stop 'error exporting wisdom to file'
|
Chris@10
|
5157 ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
|
Chris@10
|
5158 if (ret .eq. 0) stop 'error importing wisdom from file'
|
Chris@10
|
5159
|
Chris@10
|
5160 Note that prepending `C_CHAR_' is needed to specify that the literal
|
Chris@10
|
5161 string is of kind `C_CHAR', and we null-terminate the string by
|
Chris@10
|
5162 appending `// C_NULL_CHAR'. These functions return an `integer(C_INT)'
|
Chris@10
|
5163 (`ret') which is `0' if an error occurred during export/import and
|
Chris@10
|
5164 nonzero otherwise.
|
Chris@10
|
5165
|
Chris@10
|
5166 It is also possible to use the lower-level routines
|
Chris@10
|
5167 `fftw_export_wisdom_to_file' and `fftw_import_wisdom_from_file', which
|
Chris@10
|
5168 accept parameters of the C type `FILE*', expressed in Fortran as
|
Chris@10
|
5169 `type(C_PTR)'. However, you are then responsible for creating the
|
Chris@10
|
5170 `FILE*' yourself. You can do this by using `iso_c_binding' to define
|
Chris@10
|
5171 Fortran intefaces for the C library functions `fopen' and `fclose',
|
Chris@10
|
5172 which is a bit strange in Fortran but workable.
|
Chris@10
|
5173
|
Chris@10
|
5174
|
Chris@10
|
5175 File: fftw3.info, Node: Wisdom String Export/Import from Fortran, Next: Wisdom Generic Export/Import from Fortran, Prev: Wisdom File Export/Import from Fortran, Up: Accessing the wisdom API from Fortran
|
Chris@10
|
5176
|
Chris@10
|
5177 7.6.2 Wisdom String Export/Import from Fortran
|
Chris@10
|
5178 ----------------------------------------------
|
Chris@10
|
5179
|
Chris@10
|
5180 Dealing with FFTW's C string export/import is a bit more painful. In
|
Chris@10
|
5181 particular, the `fftw_export_wisdom_to_string' function requires you to
|
Chris@10
|
5182 deal with a dynamically allocated C string. To get its length, you
|
Chris@10
|
5183 must define an interface to the C `strlen' function, and to deallocate
|
Chris@10
|
5184 it you must define an interface to C `free':
|
Chris@10
|
5185
|
Chris@10
|
5186 use, intrinsic :: iso_c_binding
|
Chris@10
|
5187 interface
|
Chris@10
|
5188 integer(C_INT) function strlen(s) bind(C, name='strlen')
|
Chris@10
|
5189 import
|
Chris@10
|
5190 type(C_PTR), value :: s
|
Chris@10
|
5191 end function strlen
|
Chris@10
|
5192 subroutine free(p) bind(C, name='free')
|
Chris@10
|
5193 import
|
Chris@10
|
5194 type(C_PTR), value :: p
|
Chris@10
|
5195 end subroutine free
|
Chris@10
|
5196 end interface
|
Chris@10
|
5197
|
Chris@10
|
5198 Given these definitions, you can then export wisdom to a Fortran
|
Chris@10
|
5199 character array:
|
Chris@10
|
5200
|
Chris@10
|
5201 character(C_CHAR), pointer :: s(:)
|
Chris@10
|
5202 integer(C_SIZE_T) :: slen
|
Chris@10
|
5203 type(C_PTR) :: p
|
Chris@10
|
5204 p = fftw_export_wisdom_to_string()
|
Chris@10
|
5205 if (.not. c_associated(p)) stop 'error exporting wisdom'
|
Chris@10
|
5206 slen = strlen(p)
|
Chris@10
|
5207 call c_f_pointer(p, s, [slen+1])
|
Chris@10
|
5208 ...
|
Chris@10
|
5209 call free(p)
|
Chris@10
|
5210
|
Chris@10
|
5211 Note that `slen' is the length of the C string, but the length of
|
Chris@10
|
5212 the array is `slen+1' because it includes the terminating null
|
Chris@10
|
5213 character. (You can omit the `+1' if you don't want Fortran to know
|
Chris@10
|
5214 about the null character.) The standard `c_associated' function checks
|
Chris@10
|
5215 whether `p' is a null pointer, which is returned by
|
Chris@10
|
5216 `fftw_export_wisdom_to_string' if there was an error.
|
Chris@10
|
5217
|
Chris@10
|
5218 To import wisdom from a string, use `fftw_import_wisdom_from_string'
|
Chris@10
|
5219 as usual; note that the argument of this function must be a
|
Chris@10
|
5220 `character(C_CHAR)' that is terminated by the `C_NULL_CHAR' character,
|
Chris@10
|
5221 like the `s' array above.
|
Chris@10
|
5222
|
Chris@10
|
5223
|
Chris@10
|
5224 File: fftw3.info, Node: Wisdom Generic Export/Import from Fortran, Prev: Wisdom String Export/Import from Fortran, Up: Accessing the wisdom API from Fortran
|
Chris@10
|
5225
|
Chris@10
|
5226 7.6.3 Wisdom Generic Export/Import from Fortran
|
Chris@10
|
5227 -----------------------------------------------
|
Chris@10
|
5228
|
Chris@10
|
5229 The most generic wisdom export/import functions allow you to provide an
|
Chris@10
|
5230 arbitrary callback function to read/write one character at a time in
|
Chris@10
|
5231 any way you want. However, your callback function must be written in a
|
Chris@10
|
5232 special way, using the `bind(C)' attribute to be passed to a C
|
Chris@10
|
5233 interface.
|
Chris@10
|
5234
|
Chris@10
|
5235 In particular, to call the generic wisdom export function
|
Chris@10
|
5236 `fftw_export_wisdom', you would write a callback subroutine of the form:
|
Chris@10
|
5237
|
Chris@10
|
5238 subroutine my_write_char(c, p) bind(C)
|
Chris@10
|
5239 use, intrinsic :: iso_c_binding
|
Chris@10
|
5240 character(C_CHAR), value :: c
|
Chris@10
|
5241 type(C_PTR), value :: p
|
Chris@10
|
5242 _...write c..._
|
Chris@10
|
5243 end subroutine my_write_char
|
Chris@10
|
5244
|
Chris@10
|
5245 Given such a subroutine (along with the corresponding interface
|
Chris@10
|
5246 definition), you could then export wisdom using:
|
Chris@10
|
5247
|
Chris@10
|
5248 call fftw_export_wisdom(c_funloc(my_write_char), p)
|
Chris@10
|
5249
|
Chris@10
|
5250 The standard `c_funloc' intrinsic converts a Fortran `bind(C)'
|
Chris@10
|
5251 subroutine into a C function pointer. The parameter `p' is a
|
Chris@10
|
5252 `type(C_PTR)' to any arbitrary data that you want to pass to
|
Chris@10
|
5253 `my_write_char' (or `C_NULL_PTR' if none). (Note that you can get a C
|
Chris@10
|
5254 pointer to Fortran data using the intrinsic `c_loc', and convert it
|
Chris@10
|
5255 back to a Fortran pointer in `my_write_char' using `c_f_pointer'.)
|
Chris@10
|
5256
|
Chris@10
|
5257 Similarly, to use the generic `fftw_import_wisdom', you would define
|
Chris@10
|
5258 a callback function of the form:
|
Chris@10
|
5259
|
Chris@10
|
5260 integer(C_INT) function my_read_char(p) bind(C)
|
Chris@10
|
5261 use, intrinsic :: iso_c_binding
|
Chris@10
|
5262 type(C_PTR), value :: p
|
Chris@10
|
5263 character :: c
|
Chris@10
|
5264 _...read a character c..._
|
Chris@10
|
5265 my_read_char = ichar(c, C_INT)
|
Chris@10
|
5266 end function my_read_char
|
Chris@10
|
5267
|
Chris@10
|
5268 ....
|
Chris@10
|
5269
|
Chris@10
|
5270 integer(C_INT) :: ret
|
Chris@10
|
5271 ret = fftw_import_wisdom(c_funloc(my_read_char), p)
|
Chris@10
|
5272 if (ret .eq. 0) stop 'error importing wisdom'
|
Chris@10
|
5273
|
Chris@10
|
5274 Your function can return `-1' if the end of the input is reached.
|
Chris@10
|
5275 Again, `p' is an arbitrary `type(C_PTR' that is passed through to your
|
Chris@10
|
5276 function. `fftw_import_wisdom' returns `0' if an error occurred and
|
Chris@10
|
5277 nonzero otherwise.
|
Chris@10
|
5278
|
Chris@10
|
5279
|
Chris@10
|
5280 File: fftw3.info, Node: Defining an FFTW module, Prev: Accessing the wisdom API from Fortran, Up: Calling FFTW from Modern Fortran
|
Chris@10
|
5281
|
Chris@10
|
5282 7.7 Defining an FFTW module
|
Chris@10
|
5283 ===========================
|
Chris@10
|
5284
|
Chris@10
|
5285 Rather than using the `include' statement to include the `fftw3.f03'
|
Chris@10
|
5286 interface file in any subroutine where you want to use FFTW, you might
|
Chris@10
|
5287 prefer to define an FFTW Fortran module. FFTW does not install itself
|
Chris@10
|
5288 as a module, primarily because `fftw3.f03' can be shared between
|
Chris@10
|
5289 different Fortran compilers while modules (in general) cannot.
|
Chris@10
|
5290 However, it is trivial to define your own FFTW module if you want.
|
Chris@10
|
5291 Just create a file containing:
|
Chris@10
|
5292
|
Chris@10
|
5293 module FFTW3
|
Chris@10
|
5294 use, intrinsic :: iso_c_binding
|
Chris@10
|
5295 include 'fftw3.f03'
|
Chris@10
|
5296 end module
|
Chris@10
|
5297
|
Chris@10
|
5298 Compile this file into a module as usual for your compiler (e.g. with
|
Chris@10
|
5299 `gfortran -c' you will get a file `fftw3.mod'). Now, instead of
|
Chris@10
|
5300 `include 'fftw3.f03'', whenever you want to use FFTW routines you can
|
Chris@10
|
5301 just do:
|
Chris@10
|
5302
|
Chris@10
|
5303 use FFTW3
|
Chris@10
|
5304
|
Chris@10
|
5305 as usual for Fortran modules. (You still need to link to the FFTW
|
Chris@10
|
5306 library, of course.)
|
Chris@10
|
5307
|
Chris@10
|
5308
|
Chris@10
|
5309 File: fftw3.info, Node: Calling FFTW from Legacy Fortran, Next: Upgrading from FFTW version 2, Prev: Calling FFTW from Modern Fortran, Up: Top
|
Chris@10
|
5310
|
Chris@10
|
5311 8 Calling FFTW from Legacy Fortran
|
Chris@10
|
5312 **********************************
|
Chris@10
|
5313
|
Chris@10
|
5314 This chapter describes the interface to FFTW callable by Fortran code
|
Chris@10
|
5315 in older compilers not supporting the Fortran 2003 C interoperability
|
Chris@10
|
5316 features (*note Calling FFTW from Modern Fortran::). This interface
|
Chris@10
|
5317 has the major disadvantage that it is not type-checked, so if you
|
Chris@10
|
5318 mistake the argument types or ordering then your program will not have
|
Chris@10
|
5319 any compiler errors, and will likely crash at runtime. So, greater
|
Chris@10
|
5320 care is needed. Also, technically interfacing older Fortran versions
|
Chris@10
|
5321 to C is nonstandard, but in practice we have found that the techniques
|
Chris@10
|
5322 used in this chapter have worked with all known Fortran compilers for
|
Chris@10
|
5323 many years.
|
Chris@10
|
5324
|
Chris@10
|
5325 The legacy Fortran interface differs from the C interface only in the
|
Chris@10
|
5326 prefix (`dfftw_' instead of `fftw_' in double precision) and a few
|
Chris@10
|
5327 other minor details. This Fortran interface is included in the FFTW
|
Chris@10
|
5328 libraries by default, unless a Fortran compiler isn't found on your
|
Chris@10
|
5329 system or `--disable-fortran' is included in the `configure' flags. We
|
Chris@10
|
5330 assume here that the reader is already familiar with the usage of FFTW
|
Chris@10
|
5331 in C, as described elsewhere in this manual.
|
Chris@10
|
5332
|
Chris@10
|
5333 The MPI parallel interface to FFTW is _not_ currently available to
|
Chris@10
|
5334 legacy Fortran.
|
Chris@10
|
5335
|
Chris@10
|
5336 * Menu:
|
Chris@10
|
5337
|
Chris@10
|
5338 * Fortran-interface routines::
|
Chris@10
|
5339 * FFTW Constants in Fortran::
|
Chris@10
|
5340 * FFTW Execution in Fortran::
|
Chris@10
|
5341 * Fortran Examples::
|
Chris@10
|
5342 * Wisdom of Fortran?::
|
Chris@10
|
5343
|
Chris@10
|
5344
|
Chris@10
|
5345 File: fftw3.info, Node: Fortran-interface routines, Next: FFTW Constants in Fortran, Prev: Calling FFTW from Legacy Fortran, Up: Calling FFTW from Legacy Fortran
|
Chris@10
|
5346
|
Chris@10
|
5347 8.1 Fortran-interface routines
|
Chris@10
|
5348 ==============================
|
Chris@10
|
5349
|
Chris@10
|
5350 Nearly all of the FFTW functions have Fortran-callable equivalents.
|
Chris@10
|
5351 The name of the legacy Fortran routine is the same as that of the
|
Chris@10
|
5352 corresponding C routine, but with the `fftw_' prefix replaced by
|
Chris@10
|
5353 `dfftw_'.(1) The single and long-double precision versions use
|
Chris@10
|
5354 `sfftw_' and `lfftw_', respectively, instead of `fftwf_' and `fftwl_';
|
Chris@10
|
5355 quadruple precision (`real*16') is available on some systems as
|
Chris@10
|
5356 `fftwq_' (*note Precision::). (Note that `long double' on x86 hardware
|
Chris@10
|
5357 is usually at most 80-bit extended precision, _not_ quadruple
|
Chris@10
|
5358 precision.)
|
Chris@10
|
5359
|
Chris@10
|
5360 For the most part, all of the arguments to the functions are the
|
Chris@10
|
5361 same, with the following exceptions:
|
Chris@10
|
5362
|
Chris@10
|
5363 * `plan' variables (what would be of type `fftw_plan' in C), must be
|
Chris@10
|
5364 declared as a type that is at least as big as a pointer (address)
|
Chris@10
|
5365 on your machine. We recommend using `integer*8' everywhere, since
|
Chris@10
|
5366 this should always be big enough.
|
Chris@10
|
5367
|
Chris@10
|
5368 * Any function that returns a value (e.g. `fftw_plan_dft') is
|
Chris@10
|
5369 converted into a _subroutine_. The return value is converted into
|
Chris@10
|
5370 an additional _first_ parameter of this subroutine.(2)
|
Chris@10
|
5371
|
Chris@10
|
5372 * The Fortran routines expect multi-dimensional arrays to be in
|
Chris@10
|
5373 _column-major_ order, which is the ordinary format of Fortran
|
Chris@10
|
5374 arrays (*note Multi-dimensional Array Format::). They do this
|
Chris@10
|
5375 transparently and costlessly simply by reversing the order of the
|
Chris@10
|
5376 dimensions passed to FFTW, but this has one important consequence
|
Chris@10
|
5377 for multi-dimensional real-complex transforms, discussed below.
|
Chris@10
|
5378
|
Chris@10
|
5379 * Wisdom import and export is somewhat more tricky because one cannot
|
Chris@10
|
5380 easily pass files or strings between C and Fortran; see *note
|
Chris@10
|
5381 Wisdom of Fortran?::.
|
Chris@10
|
5382
|
Chris@10
|
5383 * Legacy Fortran cannot use the `fftw_malloc' dynamic-allocation
|
Chris@10
|
5384 routine. If you want to exploit the SIMD FFTW (*note SIMD
|
Chris@10
|
5385 alignment and fftw_malloc::), you'll need to figure out some other
|
Chris@10
|
5386 way to ensure that your arrays are at least 16-byte aligned.
|
Chris@10
|
5387
|
Chris@10
|
5388 * Since Fortran 77 does not have data structures, the `fftw_iodim'
|
Chris@10
|
5389 structure from the guru interface (*note Guru vector and transform
|
Chris@10
|
5390 sizes::) must be split into separate arguments. In particular, any
|
Chris@10
|
5391 `fftw_iodim' array arguments in the C guru interface become three
|
Chris@10
|
5392 integer array arguments (`n', `is', and `os') in the Fortran guru
|
Chris@10
|
5393 interface, all of whose lengths should be equal to the
|
Chris@10
|
5394 corresponding `rank' argument.
|
Chris@10
|
5395
|
Chris@10
|
5396 * The guru planner interface in Fortran does _not_ do any automatic
|
Chris@10
|
5397 translation between column-major and row-major; you are responsible
|
Chris@10
|
5398 for setting the strides etcetera to correspond to your Fortran
|
Chris@10
|
5399 arrays. However, as a slight bug that we are preserving for
|
Chris@10
|
5400 backwards compatibility, the `plan_guru_r2r' in Fortran _does_
|
Chris@10
|
5401 reverse the order of its `kind' array parameter, so the `kind'
|
Chris@10
|
5402 array of that routine should be in the reverse of the order of the
|
Chris@10
|
5403 iodim arrays (see above).
|
Chris@10
|
5404
|
Chris@10
|
5405
|
Chris@10
|
5406 In general, you should take care to use Fortran data types that
|
Chris@10
|
5407 correspond to (i.e. are the same size as) the C types used by FFTW. In
|
Chris@10
|
5408 practice, this correspondence is usually straightforward (i.e.
|
Chris@10
|
5409 `integer' corresponds to `int', `real' corresponds to `float',
|
Chris@10
|
5410 etcetera). The native Fortran double/single-precision complex type
|
Chris@10
|
5411 should be compatible with `fftw_complex'/`fftwf_complex'. Such simple
|
Chris@10
|
5412 correspondences are assumed in the examples below.
|
Chris@10
|
5413
|
Chris@10
|
5414 ---------- Footnotes ----------
|
Chris@10
|
5415
|
Chris@10
|
5416 (1) Technically, Fortran 77 identifiers are not allowed to have more
|
Chris@10
|
5417 than 6 characters, nor may they contain underscores. Any compiler that
|
Chris@10
|
5418 enforces this limitation doesn't deserve to link to FFTW.
|
Chris@10
|
5419
|
Chris@10
|
5420 (2) The reason for this is that some Fortran implementations seem to
|
Chris@10
|
5421 have trouble with C function return values, and vice versa.
|
Chris@10
|
5422
|
Chris@10
|
5423
|
Chris@10
|
5424 File: fftw3.info, Node: FFTW Constants in Fortran, Next: FFTW Execution in Fortran, Prev: Fortran-interface routines, Up: Calling FFTW from Legacy Fortran
|
Chris@10
|
5425
|
Chris@10
|
5426 8.2 FFTW Constants in Fortran
|
Chris@10
|
5427 =============================
|
Chris@10
|
5428
|
Chris@10
|
5429 When creating plans in FFTW, a number of constants are used to specify
|
Chris@10
|
5430 options, such as `FFTW_MEASURE' or `FFTW_ESTIMATE'. The same constants
|
Chris@10
|
5431 must be used with the wrapper routines, but of course the C header
|
Chris@10
|
5432 files where the constants are defined can't be incorporated directly
|
Chris@10
|
5433 into Fortran code.
|
Chris@10
|
5434
|
Chris@10
|
5435 Instead, we have placed Fortran equivalents of the FFTW constant
|
Chris@10
|
5436 definitions in the file `fftw3.f', which can be found in the same
|
Chris@10
|
5437 directory as `fftw3.h'. If your Fortran compiler supports a
|
Chris@10
|
5438 preprocessor of some sort, you should be able to `include' or
|
Chris@10
|
5439 `#include' this file; otherwise, you can paste it directly into your
|
Chris@10
|
5440 code.
|
Chris@10
|
5441
|
Chris@10
|
5442 In C, you combine different flags (like `FFTW_PRESERVE_INPUT' and
|
Chris@10
|
5443 `FFTW_MEASURE') using the ``|'' operator; in Fortran you should just
|
Chris@10
|
5444 use ``+''. (Take care not to add in the same flag more than once,
|
Chris@10
|
5445 though. Alternatively, you can use the `ior' intrinsic function
|
Chris@10
|
5446 standardized in Fortran 95.)
|
Chris@10
|
5447
|
Chris@10
|
5448
|
Chris@10
|
5449 File: fftw3.info, Node: FFTW Execution in Fortran, Next: Fortran Examples, Prev: FFTW Constants in Fortran, Up: Calling FFTW from Legacy Fortran
|
Chris@10
|
5450
|
Chris@10
|
5451 8.3 FFTW Execution in Fortran
|
Chris@10
|
5452 =============================
|
Chris@10
|
5453
|
Chris@10
|
5454 In C, in order to use a plan, one normally calls `fftw_execute', which
|
Chris@10
|
5455 executes the plan to perform the transform on the input/output arrays
|
Chris@10
|
5456 passed when the plan was created (*note Using Plans::). The
|
Chris@10
|
5457 corresponding subroutine call in legacy Fortran is:
|
Chris@10
|
5458 call dfftw_execute(plan)
|
Chris@10
|
5459
|
Chris@10
|
5460 However, we have had reports that this causes problems with some
|
Chris@10
|
5461 recent optimizing Fortran compilers. The problem is, because the
|
Chris@10
|
5462 input/output arrays are not passed as explicit arguments to
|
Chris@10
|
5463 `dfftw_execute', the semantics of Fortran (unlike C) allow the compiler
|
Chris@10
|
5464 to assume that the input/output arrays are not changed by
|
Chris@10
|
5465 `dfftw_execute'. As a consequence, certain compilers end up optimizing
|
Chris@10
|
5466 out or repositioning the call to `dfftw_execute', assuming incorrectly
|
Chris@10
|
5467 that it does nothing.
|
Chris@10
|
5468
|
Chris@10
|
5469 There are various workarounds to this, but the safest and simplest
|
Chris@10
|
5470 thing is to not use `dfftw_execute' in Fortran. Instead, use the
|
Chris@10
|
5471 functions described in *note New-array Execute Functions::, which take
|
Chris@10
|
5472 the input/output arrays as explicit arguments. For example, if the
|
Chris@10
|
5473 plan is for a complex-data DFT and was created for the arrays `in' and
|
Chris@10
|
5474 `out', you would do:
|
Chris@10
|
5475 call dfftw_execute_dft(plan, in, out)
|
Chris@10
|
5476
|
Chris@10
|
5477 There are a few things to be careful of, however:
|
Chris@10
|
5478
|
Chris@10
|
5479 * You must use the correct type of execute function, matching the way
|
Chris@10
|
5480 the plan was created. Complex DFT plans should use
|
Chris@10
|
5481 `dfftw_execute_dft', Real-input (r2c) DFT plans should use use
|
Chris@10
|
5482 `dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should
|
Chris@10
|
5483 use `dfftw_execute_dft_c2r'. The various r2r plans should use
|
Chris@10
|
5484 `dfftw_execute_r2r'.
|
Chris@10
|
5485
|
Chris@10
|
5486 * You should normally pass the same input/output arrays that were
|
Chris@10
|
5487 used when creating the plan. This is always safe.
|
Chris@10
|
5488
|
Chris@10
|
5489 * _If_ you pass _different_ input/output arrays compared to those
|
Chris@10
|
5490 used when creating the plan, you must abide by all the
|
Chris@10
|
5491 restrictions of the new-array execute functions (*note New-array
|
Chris@10
|
5492 Execute Functions::). The most difficult of these, in Fortran, is
|
Chris@10
|
5493 the requirement that the new arrays have the same alignment as the
|
Chris@10
|
5494 original arrays, because there seems to be no way in legacy
|
Chris@10
|
5495 Fortran to obtain guaranteed-aligned arrays (analogous to
|
Chris@10
|
5496 `fftw_malloc' in C). You can, of course, use the `FFTW_UNALIGNED'
|
Chris@10
|
5497 flag when creating the plan, in which case the plan does not
|
Chris@10
|
5498 depend on the alignment, but this may sacrifice substantial
|
Chris@10
|
5499 performance on architectures (like x86) with SIMD instructions
|
Chris@10
|
5500 (*note SIMD alignment and fftw_malloc::).
|
Chris@10
|
5501
|
Chris@10
|
5502
|
Chris@10
|
5503
|
Chris@10
|
5504 File: fftw3.info, Node: Fortran Examples, Next: Wisdom of Fortran?, Prev: FFTW Execution in Fortran, Up: Calling FFTW from Legacy Fortran
|
Chris@10
|
5505
|
Chris@10
|
5506 8.4 Fortran Examples
|
Chris@10
|
5507 ====================
|
Chris@10
|
5508
|
Chris@10
|
5509 In C, you might have something like the following to transform a
|
Chris@10
|
5510 one-dimensional complex array:
|
Chris@10
|
5511
|
Chris@10
|
5512 fftw_complex in[N], out[N];
|
Chris@10
|
5513 fftw_plan plan;
|
Chris@10
|
5514
|
Chris@10
|
5515 plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
|
Chris@10
|
5516 fftw_execute(plan);
|
Chris@10
|
5517 fftw_destroy_plan(plan);
|
Chris@10
|
5518
|
Chris@10
|
5519 In Fortran, you would use the following to accomplish the same thing:
|
Chris@10
|
5520
|
Chris@10
|
5521 double complex in, out
|
Chris@10
|
5522 dimension in(N), out(N)
|
Chris@10
|
5523 integer*8 plan
|
Chris@10
|
5524
|
Chris@10
|
5525 call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE)
|
Chris@10
|
5526 call dfftw_execute_dft(plan, in, out)
|
Chris@10
|
5527 call dfftw_destroy_plan(plan)
|
Chris@10
|
5528
|
Chris@10
|
5529 Notice how all routines are called as Fortran subroutines, and the
|
Chris@10
|
5530 plan is returned via the first argument to `dfftw_plan_dft_1d'. Notice
|
Chris@10
|
5531 also that we changed `fftw_execute' to `dfftw_execute_dft' (*note FFTW
|
Chris@10
|
5532 Execution in Fortran::). To do the same thing, but using 8 threads in
|
Chris@10
|
5533 parallel (*note Multi-threaded FFTW::), you would simply prefix these
|
Chris@10
|
5534 calls with:
|
Chris@10
|
5535
|
Chris@10
|
5536 integer iret
|
Chris@10
|
5537 call dfftw_init_threads(iret)
|
Chris@10
|
5538 call dfftw_plan_with_nthreads(8)
|
Chris@10
|
5539
|
Chris@10
|
5540 (You might want to check the value of `iret': if it is zero, it
|
Chris@10
|
5541 indicates an unlikely error during thread initialization.)
|
Chris@10
|
5542
|
Chris@10
|
5543 To transform a three-dimensional array in-place with C, you might do:
|
Chris@10
|
5544
|
Chris@10
|
5545 fftw_complex arr[L][M][N];
|
Chris@10
|
5546 fftw_plan plan;
|
Chris@10
|
5547
|
Chris@10
|
5548 plan = fftw_plan_dft_3d(L,M,N, arr,arr,
|
Chris@10
|
5549 FFTW_FORWARD, FFTW_ESTIMATE);
|
Chris@10
|
5550 fftw_execute(plan);
|
Chris@10
|
5551 fftw_destroy_plan(plan);
|
Chris@10
|
5552
|
Chris@10
|
5553 In Fortran, you would use this instead:
|
Chris@10
|
5554
|
Chris@10
|
5555 double complex arr
|
Chris@10
|
5556 dimension arr(L,M,N)
|
Chris@10
|
5557 integer*8 plan
|
Chris@10
|
5558
|
Chris@10
|
5559 call dfftw_plan_dft_3d(plan, L,M,N, arr,arr,
|
Chris@10
|
5560 & FFTW_FORWARD, FFTW_ESTIMATE)
|
Chris@10
|
5561 call dfftw_execute_dft(plan, arr, arr)
|
Chris@10
|
5562 call dfftw_destroy_plan(plan)
|
Chris@10
|
5563
|
Chris@10
|
5564 Note that we pass the array dimensions in the "natural" order in
|
Chris@10
|
5565 both C and Fortran.
|
Chris@10
|
5566
|
Chris@10
|
5567 To transform a one-dimensional real array in Fortran, you might do:
|
Chris@10
|
5568
|
Chris@10
|
5569 double precision in
|
Chris@10
|
5570 dimension in(N)
|
Chris@10
|
5571 double complex out
|
Chris@10
|
5572 dimension out(N/2 + 1)
|
Chris@10
|
5573 integer*8 plan
|
Chris@10
|
5574
|
Chris@10
|
5575 call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE)
|
Chris@10
|
5576 call dfftw_execute_dft_r2c(plan, in, out)
|
Chris@10
|
5577 call dfftw_destroy_plan(plan)
|
Chris@10
|
5578
|
Chris@10
|
5579 To transform a two-dimensional real array, out of place, you might
|
Chris@10
|
5580 use the following:
|
Chris@10
|
5581
|
Chris@10
|
5582 double precision in
|
Chris@10
|
5583 dimension in(M,N)
|
Chris@10
|
5584 double complex out
|
Chris@10
|
5585 dimension out(M/2 + 1, N)
|
Chris@10
|
5586 integer*8 plan
|
Chris@10
|
5587
|
Chris@10
|
5588 call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE)
|
Chris@10
|
5589 call dfftw_execute_dft_r2c(plan, in, out)
|
Chris@10
|
5590 call dfftw_destroy_plan(plan)
|
Chris@10
|
5591
|
Chris@10
|
5592 *Important:* Notice that it is the _first_ dimension of the complex
|
Chris@10
|
5593 output array that is cut in half in Fortran, rather than the last
|
Chris@10
|
5594 dimension as in C. This is a consequence of the interface routines
|
Chris@10
|
5595 reversing the order of the array dimensions passed to FFTW so that the
|
Chris@10
|
5596 Fortran program can use its ordinary column-major order.
|
Chris@10
|
5597
|
Chris@10
|
5598
|
Chris@10
|
5599 File: fftw3.info, Node: Wisdom of Fortran?, Prev: Fortran Examples, Up: Calling FFTW from Legacy Fortran
|
Chris@10
|
5600
|
Chris@10
|
5601 8.5 Wisdom of Fortran?
|
Chris@10
|
5602 ======================
|
Chris@10
|
5603
|
Chris@10
|
5604 In this section, we discuss how one can import/export FFTW wisdom
|
Chris@10
|
5605 (saved plans) to/from a Fortran program; we assume that the reader is
|
Chris@10
|
5606 already familiar with wisdom, as described in *note Words of
|
Chris@10
|
5607 Wisdom-Saving Plans::.
|
Chris@10
|
5608
|
Chris@10
|
5609 The basic problem is that is difficult to (portably) pass files and
|
Chris@10
|
5610 strings between Fortran and C, so we cannot provide a direct Fortran
|
Chris@10
|
5611 equivalent to the `fftw_export_wisdom_to_file', etcetera, functions.
|
Chris@10
|
5612 Fortran interfaces _are_ provided for the functions that do not take
|
Chris@10
|
5613 file/string arguments, however: `dfftw_import_system_wisdom',
|
Chris@10
|
5614 `dfftw_import_wisdom', `dfftw_export_wisdom', and `dfftw_forget_wisdom'.
|
Chris@10
|
5615
|
Chris@10
|
5616 So, for example, to import the system-wide wisdom, you would do:
|
Chris@10
|
5617
|
Chris@10
|
5618 integer isuccess
|
Chris@10
|
5619 call dfftw_import_system_wisdom(isuccess)
|
Chris@10
|
5620
|
Chris@10
|
5621 As usual, the C return value is turned into a first parameter;
|
Chris@10
|
5622 `isuccess' is non-zero on success and zero on failure (e.g. if there is
|
Chris@10
|
5623 no system wisdom installed).
|
Chris@10
|
5624
|
Chris@10
|
5625 If you want to import/export wisdom from/to an arbitrary file or
|
Chris@10
|
5626 elsewhere, you can employ the generic `dfftw_import_wisdom' and
|
Chris@10
|
5627 `dfftw_export_wisdom' functions, for which you must supply a subroutine
|
Chris@10
|
5628 to read/write one character at a time. The FFTW package contains an
|
Chris@10
|
5629 example file `doc/f77_wisdom.f' demonstrating how to implement
|
Chris@10
|
5630 `import_wisdom_from_file' and `export_wisdom_to_file' subroutines in
|
Chris@10
|
5631 this way. (These routines cannot be compiled into the FFTW library
|
Chris@10
|
5632 itself, lest all FFTW-using programs be required to link with the
|
Chris@10
|
5633 Fortran I/O library.)
|
Chris@10
|
5634
|
Chris@10
|
5635
|
Chris@10
|
5636 File: fftw3.info, Node: Upgrading from FFTW version 2, Next: Installation and Customization, Prev: Calling FFTW from Legacy Fortran, Up: Top
|
Chris@10
|
5637
|
Chris@10
|
5638 9 Upgrading from FFTW version 2
|
Chris@10
|
5639 *******************************
|
Chris@10
|
5640
|
Chris@10
|
5641 In this chapter, we outline the process for updating codes designed for
|
Chris@10
|
5642 the older FFTW 2 interface to work with FFTW 3. The interface for FFTW
|
Chris@10
|
5643 3 is not backwards-compatible with the interface for FFTW 2 and earlier
|
Chris@10
|
5644 versions; codes written to use those versions will fail to link with
|
Chris@10
|
5645 FFTW 3. Nor is it possible to write "compatibility wrappers" to bridge
|
Chris@10
|
5646 the gap (at least not efficiently), because FFTW 3 has different
|
Chris@10
|
5647 semantics from previous versions. However, upgrading should be a
|
Chris@10
|
5648 straightforward process because the data formats are identical and the
|
Chris@10
|
5649 overall style of planning/execution is essentially the same.
|
Chris@10
|
5650
|
Chris@10
|
5651 Unlike FFTW 2, there are no separate header files for real and
|
Chris@10
|
5652 complex transforms (or even for different precisions) in FFTW 3; all
|
Chris@10
|
5653 interfaces are defined in the `<fftw3.h>' header file.
|
Chris@10
|
5654
|
Chris@10
|
5655 Numeric Types
|
Chris@10
|
5656 =============
|
Chris@10
|
5657
|
Chris@10
|
5658 The main difference in data types is that `fftw_complex' in FFTW 2 was
|
Chris@10
|
5659 defined as a `struct' with macros `c_re' and `c_im' for accessing the
|
Chris@10
|
5660 real/imaginary parts. (This is binary-compatible with FFTW 3 on any
|
Chris@10
|
5661 machine except perhaps for some older Crays in single precision.) The
|
Chris@10
|
5662 equivalent macros for FFTW 3 are:
|
Chris@10
|
5663
|
Chris@10
|
5664 #define c_re(c) ((c)[0])
|
Chris@10
|
5665 #define c_im(c) ((c)[1])
|
Chris@10
|
5666
|
Chris@10
|
5667 This does not work if you are using the C99 complex type, however,
|
Chris@10
|
5668 unless you insert a `double*' typecast into the above macros (*note
|
Chris@10
|
5669 Complex numbers::).
|
Chris@10
|
5670
|
Chris@10
|
5671 Also, FFTW 2 had an `fftw_real' typedef that was an alias for
|
Chris@10
|
5672 `double' (in double precision). In FFTW 3 you should just use `double'
|
Chris@10
|
5673 (or whatever precision you are employing).
|
Chris@10
|
5674
|
Chris@10
|
5675 Plans
|
Chris@10
|
5676 =====
|
Chris@10
|
5677
|
Chris@10
|
5678 The major difference between FFTW 2 and FFTW 3 is in the
|
Chris@10
|
5679 planning/execution division of labor. In FFTW 2, plans were found for a
|
Chris@10
|
5680 given transform size and type, and then could be applied to _any_
|
Chris@10
|
5681 arrays and for _any_ multiplicity/stride parameters. In FFTW 3, you
|
Chris@10
|
5682 specify the particular arrays, stride parameters, etcetera when
|
Chris@10
|
5683 creating the plan, and the plan is then executed for _those_ arrays
|
Chris@10
|
5684 (unless the guru interface is used) and _those_ parameters _only_.
|
Chris@10
|
5685 (FFTW 2 had "specific planner" routines that planned for a particular
|
Chris@10
|
5686 array and stride, but the plan could still be used for other arrays and
|
Chris@10
|
5687 strides.) That is, much of the information that was formerly specified
|
Chris@10
|
5688 at execution time is now specified at planning time.
|
Chris@10
|
5689
|
Chris@10
|
5690 Like FFTW 2's specific planner routines, the FFTW 3 planner
|
Chris@10
|
5691 overwrites the input/output arrays unless you use `FFTW_ESTIMATE'.
|
Chris@10
|
5692
|
Chris@10
|
5693 FFTW 2 had separate data types `fftw_plan', `fftwnd_plan',
|
Chris@10
|
5694 `rfftw_plan', and `rfftwnd_plan' for complex and real one- and
|
Chris@10
|
5695 multi-dimensional transforms, and each type had its own `destroy'
|
Chris@10
|
5696 function. In FFTW 3, all plans are of type `fftw_plan' and all are
|
Chris@10
|
5697 destroyed by `fftw_destroy_plan(plan)'.
|
Chris@10
|
5698
|
Chris@10
|
5699 Where you formerly used `fftw_create_plan' and `fftw_one' to plan
|
Chris@10
|
5700 and compute a single 1d transform, you would now use `fftw_plan_dft_1d'
|
Chris@10
|
5701 to plan the transform. If you used the generic `fftw' function to
|
Chris@10
|
5702 execute the transform with multiplicity (`howmany') and stride
|
Chris@10
|
5703 parameters, you would now use the advanced interface
|
Chris@10
|
5704 `fftw_plan_many_dft' to specify those parameters. The plans are now
|
Chris@10
|
5705 executed with `fftw_execute(plan)', which takes all of its parameters
|
Chris@10
|
5706 (including the input/output arrays) from the plan.
|
Chris@10
|
5707
|
Chris@10
|
5708 In-place transforms no longer interpret their output argument as
|
Chris@10
|
5709 scratch space, nor is there an `FFTW_IN_PLACE' flag. You simply pass
|
Chris@10
|
5710 the same pointer for both the input and output arguments. (Previously,
|
Chris@10
|
5711 the output `ostride' and `odist' parameters were ignored for in-place
|
Chris@10
|
5712 transforms; now, if they are specified via the advanced interface, they
|
Chris@10
|
5713 are significant even in the in-place case, although they should
|
Chris@10
|
5714 normally equal the corresponding input parameters.)
|
Chris@10
|
5715
|
Chris@10
|
5716 The `FFTW_ESTIMATE' and `FFTW_MEASURE' flags have the same meaning
|
Chris@10
|
5717 as before, although the planning time will differ. You may also
|
Chris@10
|
5718 consider using `FFTW_PATIENT', which is like `FFTW_MEASURE' except that
|
Chris@10
|
5719 it takes more time in order to consider a wider variety of algorithms.
|
Chris@10
|
5720
|
Chris@10
|
5721 For multi-dimensional complex DFTs, instead of `fftwnd_create_plan'
|
Chris@10
|
5722 (or `fftw2d_create_plan' or `fftw3d_create_plan'), followed by
|
Chris@10
|
5723 `fftwnd_one', you would use `fftw_plan_dft' (or `fftw_plan_dft_2d' or
|
Chris@10
|
5724 `fftw_plan_dft_3d'). followed by `fftw_execute'. If you used `fftwnd'
|
Chris@10
|
5725 to to specify strides etcetera, you would instead specify these via
|
Chris@10
|
5726 `fftw_plan_many_dft'.
|
Chris@10
|
5727
|
Chris@10
|
5728 The analogues to `rfftw_create_plan' and `rfftw_one' with
|
Chris@10
|
5729 `FFTW_REAL_TO_COMPLEX' or `FFTW_COMPLEX_TO_REAL' directions are
|
Chris@10
|
5730 `fftw_plan_r2r_1d' with kind `FFTW_R2HC' or `FFTW_HC2R', followed by
|
Chris@10
|
5731 `fftw_execute'. The stride etcetera arguments of `rfftw' are now in
|
Chris@10
|
5732 `fftw_plan_many_r2r'.
|
Chris@10
|
5733
|
Chris@10
|
5734 Instead of `rfftwnd_create_plan' (or `rfftw2d_create_plan' or
|
Chris@10
|
5735 `rfftw3d_create_plan') followed by `rfftwnd_one_real_to_complex' or
|
Chris@10
|
5736 `rfftwnd_one_complex_to_real', you now use `fftw_plan_dft_r2c' (or
|
Chris@10
|
5737 `fftw_plan_dft_r2c_2d' or `fftw_plan_dft_r2c_3d') or
|
Chris@10
|
5738 `fftw_plan_dft_c2r' (or `fftw_plan_dft_c2r_2d' or
|
Chris@10
|
5739 `fftw_plan_dft_c2r_3d'), respectively, followed by `fftw_execute'. As
|
Chris@10
|
5740 usual, the strides etcetera of `rfftwnd_real_to_complex' or
|
Chris@10
|
5741 `rfftwnd_complex_to_real' are no specified in the advanced planner
|
Chris@10
|
5742 routines, `fftw_plan_many_dft_r2c' or `fftw_plan_many_dft_c2r'.
|
Chris@10
|
5743
|
Chris@10
|
5744 Wisdom
|
Chris@10
|
5745 ======
|
Chris@10
|
5746
|
Chris@10
|
5747 In FFTW 2, you had to supply the `FFTW_USE_WISDOM' flag in order to use
|
Chris@10
|
5748 wisdom; in FFTW 3, wisdom is always used. (You could simulate the FFTW
|
Chris@10
|
5749 2 wisdom-less behavior by calling `fftw_forget_wisdom' after every
|
Chris@10
|
5750 planner call.)
|
Chris@10
|
5751
|
Chris@10
|
5752 The FFTW 3 wisdom import/export routines are almost the same as
|
Chris@10
|
5753 before (although the storage format is entirely different). There is
|
Chris@10
|
5754 one significant difference, however. In FFTW 2, the import routines
|
Chris@10
|
5755 would never read past the end of the wisdom, so you could store extra
|
Chris@10
|
5756 data beyond the wisdom in the same file, for example. In FFTW 3, the
|
Chris@10
|
5757 file-import routine may read up to a few hundred bytes past the end of
|
Chris@10
|
5758 the wisdom, so you cannot store other data just beyond it.(1)
|
Chris@10
|
5759
|
Chris@10
|
5760 Wisdom has been enhanced by additional humility in FFTW 3: whereas
|
Chris@10
|
5761 FFTW 2 would re-use wisdom for a given transform size regardless of the
|
Chris@10
|
5762 stride etc., in FFTW 3 wisdom is only used with the strides etc. for
|
Chris@10
|
5763 which it was created. Unfortunately, this means FFTW 3 has to create
|
Chris@10
|
5764 new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g.
|
Chris@10
|
5765 one transform of size 1024 also created wisdom for all smaller powers
|
Chris@10
|
5766 of 2, but this no longer occurs).
|
Chris@10
|
5767
|
Chris@10
|
5768 FFTW 3 also has the new routine `fftw_import_system_wisdom' to
|
Chris@10
|
5769 import wisdom from a standard system-wide location.
|
Chris@10
|
5770
|
Chris@10
|
5771 Memory allocation
|
Chris@10
|
5772 =================
|
Chris@10
|
5773
|
Chris@10
|
5774 In FFTW 3, we recommend allocating your arrays with `fftw_malloc' and
|
Chris@10
|
5775 deallocating them with `fftw_free'; this is not required, but allows
|
Chris@10
|
5776 optimal performance when SIMD acceleration is used. (Those two
|
Chris@10
|
5777 functions actually existed in FFTW 2, and worked the same way, but were
|
Chris@10
|
5778 not documented.)
|
Chris@10
|
5779
|
Chris@10
|
5780 In FFTW 2, there were `fftw_malloc_hook' and `fftw_free_hook'
|
Chris@10
|
5781 functions that allowed the user to replace FFTW's memory-allocation
|
Chris@10
|
5782 routines (e.g. to implement different error-handling, since by default
|
Chris@10
|
5783 FFTW prints an error message and calls `exit' to abort the program if
|
Chris@10
|
5784 `malloc' returns `NULL'). These hooks are not supported in FFTW 3;
|
Chris@10
|
5785 those few users who require this functionality can just directly modify
|
Chris@10
|
5786 the memory-allocation routines in FFTW (they are defined in
|
Chris@10
|
5787 `kernel/alloc.c').
|
Chris@10
|
5788
|
Chris@10
|
5789 Fortran interface
|
Chris@10
|
5790 =================
|
Chris@10
|
5791
|
Chris@10
|
5792 In FFTW 2, the subroutine names were obtained by replacing `fftw_' with
|
Chris@10
|
5793 `fftw_f77'; in FFTW 3, you replace `fftw_' with `dfftw_' (or `sfftw_'
|
Chris@10
|
5794 or `lfftw_', depending upon the precision).
|
Chris@10
|
5795
|
Chris@10
|
5796 In FFTW 3, we have begun recommending that you always declare the
|
Chris@10
|
5797 type used to store plans as `integer*8'. (Too many people didn't notice
|
Chris@10
|
5798 our instruction to switch from `integer' to `integer*8' for 64-bit
|
Chris@10
|
5799 machines.)
|
Chris@10
|
5800
|
Chris@10
|
5801 In FFTW 3, we provide a `fftw3.f' "header file" to include in your
|
Chris@10
|
5802 code (and which is officially installed on Unix systems). (In FFTW 2,
|
Chris@10
|
5803 we supplied a `fftw_f77.i' file, but it was not installed.)
|
Chris@10
|
5804
|
Chris@10
|
5805 Otherwise, the C-Fortran interface relationship is much the same as
|
Chris@10
|
5806 it was before (e.g. return values become initial parameters, and
|
Chris@10
|
5807 multi-dimensional arrays are in column-major order). Unlike FFTW 2, we
|
Chris@10
|
5808 do provide some support for wisdom import/export in Fortran (*note
|
Chris@10
|
5809 Wisdom of Fortran?::).
|
Chris@10
|
5810
|
Chris@10
|
5811 Threads
|
Chris@10
|
5812 =======
|
Chris@10
|
5813
|
Chris@10
|
5814 Like FFTW 2, only the execution routines are thread-safe. All planner
|
Chris@10
|
5815 routines, etcetera, should be called by only a single thread at a time
|
Chris@10
|
5816 (*note Thread safety::). _Unlike_ FFTW 2, there is no special
|
Chris@10
|
5817 `FFTW_THREADSAFE' flag for the planner to allow a given plan to be
|
Chris@10
|
5818 usable by multiple threads in parallel; this is now the case by default.
|
Chris@10
|
5819
|
Chris@10
|
5820 The multi-threaded version of FFTW 2 required you to pass the number
|
Chris@10
|
5821 of threads each time you execute the transform. The number of threads
|
Chris@10
|
5822 is now stored in the plan, and is specified before the planner is
|
Chris@10
|
5823 called by `fftw_plan_with_nthreads'. The threads initialization
|
Chris@10
|
5824 routine used to be called `fftw_threads_init' and would return zero on
|
Chris@10
|
5825 success; the new routine is called `fftw_init_threads' and returns zero
|
Chris@10
|
5826 on failure. *Note Multi-threaded FFTW::.
|
Chris@10
|
5827
|
Chris@10
|
5828 There is no separate threads header file in FFTW 3; all the function
|
Chris@10
|
5829 prototypes are in `<fftw3.h>'. However, you still have to link to a
|
Chris@10
|
5830 separate library (`-lfftw3_threads -lfftw3 -lm' on Unix), as well as to
|
Chris@10
|
5831 the threading library (e.g. POSIX threads on Unix).
|
Chris@10
|
5832
|
Chris@10
|
5833 ---------- Footnotes ----------
|
Chris@10
|
5834
|
Chris@10
|
5835 (1) We do our own buffering because GNU libc I/O routines are
|
Chris@10
|
5836 horribly slow for single-character I/O, apparently for thread-safety
|
Chris@10
|
5837 reasons (whether you are using threads or not).
|
Chris@10
|
5838
|
Chris@10
|
5839
|
Chris@10
|
5840 File: fftw3.info, Node: Installation and Customization, Next: Acknowledgments, Prev: Upgrading from FFTW version 2, Up: Top
|
Chris@10
|
5841
|
Chris@10
|
5842 10 Installation and Customization
|
Chris@10
|
5843 *********************************
|
Chris@10
|
5844
|
Chris@10
|
5845 This chapter describes the installation and customization of FFTW, the
|
Chris@10
|
5846 latest version of which may be downloaded from the FFTW home page
|
Chris@10
|
5847 (http://www.fftw.org).
|
Chris@10
|
5848
|
Chris@10
|
5849 In principle, FFTW should work on any system with an ANSI C compiler
|
Chris@10
|
5850 (`gcc' is fine). However, planner time is drastically reduced if FFTW
|
Chris@10
|
5851 can exploit a hardware cycle counter; FFTW comes with cycle-counter
|
Chris@10
|
5852 support for all modern general-purpose CPUs, but you may need to add a
|
Chris@10
|
5853 couple of lines of code if your compiler is not yet supported (*note
|
Chris@10
|
5854 Cycle Counters::). (On Unix, there will be a warning at the end of the
|
Chris@10
|
5855 `configure' output if no cycle counter is found.)
|
Chris@10
|
5856
|
Chris@10
|
5857 Installation of FFTW is simplest if you have a Unix or a GNU system,
|
Chris@10
|
5858 such as GNU/Linux, and we describe this case in the first section below,
|
Chris@10
|
5859 including the use of special configuration options to e.g. install
|
Chris@10
|
5860 different precisions or exploit optimizations for particular
|
Chris@10
|
5861 architectures (e.g. SIMD). Compilation on non-Unix systems is a more
|
Chris@10
|
5862 manual process, but we outline the procedure in the second section. It
|
Chris@10
|
5863 is also likely that pre-compiled binaries will be available for popular
|
Chris@10
|
5864 systems.
|
Chris@10
|
5865
|
Chris@10
|
5866 Finally, we describe how you can customize FFTW for particular needs
|
Chris@10
|
5867 by generating _codelets_ for fast transforms of sizes not supported
|
Chris@10
|
5868 efficiently by the standard FFTW distribution.
|
Chris@10
|
5869
|
Chris@10
|
5870 * Menu:
|
Chris@10
|
5871
|
Chris@10
|
5872 * Installation on Unix::
|
Chris@10
|
5873 * Installation on non-Unix systems::
|
Chris@10
|
5874 * Cycle Counters::
|
Chris@10
|
5875 * Generating your own code::
|
Chris@10
|
5876
|
Chris@10
|
5877
|
Chris@10
|
5878 File: fftw3.info, Node: Installation on Unix, Next: Installation on non-Unix systems, Prev: Installation and Customization, Up: Installation and Customization
|
Chris@10
|
5879
|
Chris@10
|
5880 10.1 Installation on Unix
|
Chris@10
|
5881 =========================
|
Chris@10
|
5882
|
Chris@10
|
5883 FFTW comes with a `configure' program in the GNU style. Installation
|
Chris@10
|
5884 can be as simple as:
|
Chris@10
|
5885
|
Chris@10
|
5886 ./configure
|
Chris@10
|
5887 make
|
Chris@10
|
5888 make install
|
Chris@10
|
5889
|
Chris@10
|
5890 This will build the uniprocessor complex and real transform libraries
|
Chris@10
|
5891 along with the test programs. (We recommend that you use GNU `make' if
|
Chris@10
|
5892 it is available; on some systems it is called `gmake'.) The "`make
|
Chris@10
|
5893 install'" command installs the fftw and rfftw libraries in standard
|
Chris@10
|
5894 places, and typically requires root privileges (unless you specify a
|
Chris@10
|
5895 different install directory with the `--prefix' flag to `configure').
|
Chris@10
|
5896 You can also type "`make check'" to put the FFTW test programs through
|
Chris@10
|
5897 their paces. If you have problems during configuration or compilation,
|
Chris@10
|
5898 you may want to run "`make distclean'" before trying again; this
|
Chris@10
|
5899 ensures that you don't have any stale files left over from previous
|
Chris@10
|
5900 compilation attempts.
|
Chris@10
|
5901
|
Chris@10
|
5902 The `configure' script chooses the `gcc' compiler by default, if it
|
Chris@10
|
5903 is available; you can select some other compiler with:
|
Chris@10
|
5904 ./configure CC="<the name of your C compiler>"
|
Chris@10
|
5905
|
Chris@10
|
5906 The `configure' script knows good `CFLAGS' (C compiler flags) for a
|
Chris@10
|
5907 few systems. If your system is not known, the `configure' script will
|
Chris@10
|
5908 print out a warning. In this case, you should re-configure FFTW with
|
Chris@10
|
5909 the command
|
Chris@10
|
5910 ./configure CFLAGS="<write your CFLAGS here>"
|
Chris@10
|
5911 and then compile as usual. If you do find an optimal set of
|
Chris@10
|
5912 `CFLAGS' for your system, please let us know what they are (along with
|
Chris@10
|
5913 the output of `config.guess') so that we can include them in future
|
Chris@10
|
5914 releases.
|
Chris@10
|
5915
|
Chris@10
|
5916 `configure' supports all the standard flags defined by the GNU
|
Chris@10
|
5917 Coding Standards; see the `INSTALL' file in FFTW or the GNU web page
|
Chris@10
|
5918 (http://www.gnu.org/prep/standards/html_node/index.html). Note
|
Chris@10
|
5919 especially `--help' to list all flags and `--enable-shared' to create
|
Chris@10
|
5920 shared, rather than static, libraries. `configure' also accepts a few
|
Chris@10
|
5921 FFTW-specific flags, particularly:
|
Chris@10
|
5922
|
Chris@10
|
5923 * `--enable-float': Produces a single-precision version of FFTW
|
Chris@10
|
5924 (`float') instead of the default double-precision (`double').
|
Chris@10
|
5925 *Note Precision::.
|
Chris@10
|
5926
|
Chris@10
|
5927 * `--enable-long-double': Produces a long-double precision version of
|
Chris@10
|
5928 FFTW (`long double') instead of the default double-precision
|
Chris@10
|
5929 (`double'). The `configure' script will halt with an error
|
Chris@10
|
5930 message if `long double' is the same size as `double' on your
|
Chris@10
|
5931 machine/compiler. *Note Precision::.
|
Chris@10
|
5932
|
Chris@10
|
5933 * `--enable-quad-precision': Produces a quadruple-precision version
|
Chris@10
|
5934 of FFTW using the nonstandard `__float128' type provided by `gcc'
|
Chris@10
|
5935 4.6 or later on x86, x86-64, and Itanium architectures, instead of
|
Chris@10
|
5936 the default double-precision (`double'). The `configure' script
|
Chris@10
|
5937 will halt with an error message if the compiler is not `gcc'
|
Chris@10
|
5938 version 4.6 or later or if `gcc''s `libquadmath' library is not
|
Chris@10
|
5939 installed. *Note Precision::.
|
Chris@10
|
5940
|
Chris@10
|
5941 * `--enable-threads': Enables compilation and installation of the
|
Chris@10
|
5942 FFTW threads library (*note Multi-threaded FFTW::), which provides
|
Chris@10
|
5943 a simple interface to parallel transforms for SMP systems. By
|
Chris@10
|
5944 default, the threads routines are not compiled.
|
Chris@10
|
5945
|
Chris@10
|
5946 * `--enable-openmp': Like `--enable-threads', but using OpenMP
|
Chris@10
|
5947 compiler directives in order to induce parallelism rather than
|
Chris@10
|
5948 spawning its own threads directly, and installing an `fftw3_omp'
|
Chris@10
|
5949 library rather than an `fftw3_threads' library (*note
|
Chris@10
|
5950 Multi-threaded FFTW::). You can use both `--enable-openmp' and
|
Chris@10
|
5951 `--enable-threads' since they compile/install libraries with
|
Chris@10
|
5952 different names. By default, the OpenMP routines are not compiled.
|
Chris@10
|
5953
|
Chris@10
|
5954 * `--with-combined-threads': By default, if `--enable-threads' is
|
Chris@10
|
5955 used, the threads support is compiled into a separate library that
|
Chris@10
|
5956 must be linked in addition to the main FFTW library. This is so
|
Chris@10
|
5957 that users of the serial library do not need to link the system
|
Chris@10
|
5958 threads libraries. If `--with-combined-threads' is specified,
|
Chris@10
|
5959 however, then no separate threads library is created, and threads
|
Chris@10
|
5960 are included in the main FFTW library. This is mainly useful
|
Chris@10
|
5961 under Windows, where no system threads library is required and
|
Chris@10
|
5962 inter-library dependencies are problematic.
|
Chris@10
|
5963
|
Chris@10
|
5964 * `--enable-mpi': Enables compilation and installation of the FFTW
|
Chris@10
|
5965 MPI library (*note Distributed-memory FFTW with MPI::), which
|
Chris@10
|
5966 provides parallel transforms for distributed-memory systems with
|
Chris@10
|
5967 MPI. (By default, the MPI routines are not compiled.) *Note FFTW
|
Chris@10
|
5968 MPI Installation::.
|
Chris@10
|
5969
|
Chris@10
|
5970 * `--disable-fortran': Disables inclusion of legacy-Fortran wrapper
|
Chris@10
|
5971 routines (*note Calling FFTW from Legacy Fortran::) in the standard
|
Chris@10
|
5972 FFTW libraries. These wrapper routines increase the library size
|
Chris@10
|
5973 by only a negligible amount, so they are included by default as
|
Chris@10
|
5974 long as the `configure' script finds a Fortran compiler on your
|
Chris@10
|
5975 system. (To specify a particular Fortran compiler foo, pass
|
Chris@10
|
5976 `F77='foo to `configure'.)
|
Chris@10
|
5977
|
Chris@10
|
5978 * `--with-g77-wrappers': By default, when Fortran wrappers are
|
Chris@10
|
5979 included, the wrappers employ the linking conventions of the
|
Chris@10
|
5980 Fortran compiler detected by the `configure' script. If this
|
Chris@10
|
5981 compiler is GNU `g77', however, then _two_ versions of the
|
Chris@10
|
5982 wrappers are included: one with `g77''s idiosyncratic convention
|
Chris@10
|
5983 of appending two underscores to identifiers, and one with the more
|
Chris@10
|
5984 common convention of appending only a single underscore. This
|
Chris@10
|
5985 way, the same FFTW library will work with both `g77' and other
|
Chris@10
|
5986 Fortran compilers, such as GNU `gfortran'. However, the converse
|
Chris@10
|
5987 is not true: if you configure with a different compiler, then the
|
Chris@10
|
5988 `g77'-compatible wrappers are not included. By specifying
|
Chris@10
|
5989 `--with-g77-wrappers', the `g77'-compatible wrappers are included
|
Chris@10
|
5990 in addition to wrappers for whatever Fortran compiler `configure'
|
Chris@10
|
5991 finds.
|
Chris@10
|
5992
|
Chris@10
|
5993 * `--with-slow-timer': Disables the use of hardware cycle counters,
|
Chris@10
|
5994 and falls back on `gettimeofday' or `clock'. This greatly worsens
|
Chris@10
|
5995 performance, and should generally not be used (unless you don't
|
Chris@10
|
5996 have a cycle counter but still really want an optimized plan
|
Chris@10
|
5997 regardless of the time). *Note Cycle Counters::.
|
Chris@10
|
5998
|
Chris@10
|
5999 * `--enable-sse', `--enable-sse2', `--enable-avx',
|
Chris@10
|
6000 `--enable-altivec', `--enable-neon': Enable the compilation of
|
Chris@10
|
6001 SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy
|
Chris@10
|
6002 Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM
|
Chris@10
|
6003 processors). SSE, AltiVec, and NEON only work with
|
Chris@10
|
6004 `--enable-float' (above). SSE2 works in both single and double
|
Chris@10
|
6005 precision (and is simply SSE in single precision). The resulting
|
Chris@10
|
6006 code will _still work_ on earlier CPUs lacking the SIMD extensions
|
Chris@10
|
6007 (SIMD is automatically disabled, although the FFTW library is
|
Chris@10
|
6008 still larger).
|
Chris@10
|
6009 - These options require a compiler supporting SIMD extensions,
|
Chris@10
|
6010 and compiler support is always a bit flaky: see the FFTW FAQ
|
Chris@10
|
6011 for a list of compiler versions that have problems compiling
|
Chris@10
|
6012 FFTW.
|
Chris@10
|
6013
|
Chris@10
|
6014 - With AltiVec and `gcc', you may have to use the
|
Chris@10
|
6015 `-mabi=altivec' option when compiling any code that links to
|
Chris@10
|
6016 FFTW, in order to properly align the stack; otherwise, FFTW
|
Chris@10
|
6017 could crash when it tries to use an AltiVec feature. (This
|
Chris@10
|
6018 is not necessary on MacOS X.)
|
Chris@10
|
6019
|
Chris@10
|
6020 - With SSE/SSE2 and `gcc', you should use a version of gcc that
|
Chris@10
|
6021 properly aligns the stack when compiling any code that links
|
Chris@10
|
6022 to FFTW. By default, `gcc' 2.95 and later versions align the
|
Chris@10
|
6023 stack as needed, but you should not compile FFTW with the
|
Chris@10
|
6024 `-Os' option or the `-mpreferred-stack-boundary' option with
|
Chris@10
|
6025 an argument less than 4.
|
Chris@10
|
6026
|
Chris@10
|
6027 - Because of the large variety of ARM processors and ABIs, FFTW
|
Chris@10
|
6028 does not attempt to guess the correct `gcc' flags for
|
Chris@10
|
6029 generating NEON code. In general, you will have to provide
|
Chris@10
|
6030 them on the command line. This command line is known to have
|
Chris@10
|
6031 worked at least once:
|
Chris@10
|
6032 ./configure --with-slow-timer --host=arm-linux-gnueabi \
|
Chris@10
|
6033 --enable-single --enable-neon \
|
Chris@10
|
6034 "CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp"
|
Chris@10
|
6035
|
Chris@10
|
6036
|
Chris@10
|
6037 To force `configure' to use a particular C compiler foo (instead of
|
Chris@10
|
6038 the default, usually `gcc'), pass `CC='foo to the `configure' script;
|
Chris@10
|
6039 you may also need to set the flags via the variable `CFLAGS' as
|
Chris@10
|
6040 described above.
|
Chris@10
|
6041
|
Chris@10
|
6042
|
Chris@10
|
6043 File: fftw3.info, Node: Installation on non-Unix systems, Next: Cycle Counters, Prev: Installation on Unix, Up: Installation and Customization
|
Chris@10
|
6044
|
Chris@10
|
6045 10.2 Installation on non-Unix systems
|
Chris@10
|
6046 =====================================
|
Chris@10
|
6047
|
Chris@10
|
6048 It should be relatively straightforward to compile FFTW even on non-Unix
|
Chris@10
|
6049 systems lacking the niceties of a `configure' script. Basically, you
|
Chris@10
|
6050 need to edit the `config.h' header (copy it from `config.h.in') to
|
Chris@10
|
6051 `#define' the various options and compiler characteristics, and then
|
Chris@10
|
6052 compile all the `.c' files in the relevant directories.
|
Chris@10
|
6053
|
Chris@10
|
6054 The `config.h' header contains about 100 options to set, each one
|
Chris@10
|
6055 initially an `#undef', each documented with a comment, and most of them
|
Chris@10
|
6056 fairly obvious. For most of the options, you should simply `#define'
|
Chris@10
|
6057 them to `1' if they are applicable, although a few options require a
|
Chris@10
|
6058 particular value (e.g. `SIZEOF_LONG_LONG' should be defined to the size
|
Chris@10
|
6059 of the `long long' type, in bytes, or zero if it is not supported). We
|
Chris@10
|
6060 will likely post some sample `config.h' files for various operating
|
Chris@10
|
6061 systems and compilers for you to use (at least as a starting point).
|
Chris@10
|
6062 Please let us know if you have to hand-create a configuration file
|
Chris@10
|
6063 (and/or a pre-compiled binary) that you want to share.
|
Chris@10
|
6064
|
Chris@10
|
6065 To create the FFTW library, you will then need to compile all of the
|
Chris@10
|
6066 `.c' files in the `kernel', `dft', `dft/scalar', `dft/scalar/codelets',
|
Chris@10
|
6067 `rdft', `rdft/scalar', `rdft/scalar/r2cf', `rdft/scalar/r2cb',
|
Chris@10
|
6068 `rdft/scalar/r2r', `reodft', and `api' directories. If you are
|
Chris@10
|
6069 compiling with SIMD support (e.g. you defined `HAVE_SSE2' in
|
Chris@10
|
6070 `config.h'), then you also need to compile the `.c' files in the
|
Chris@10
|
6071 `simd-support', `{dft,rdft}/simd', `{dft,rdft}/simd/*' directories.
|
Chris@10
|
6072
|
Chris@10
|
6073 Once these files are all compiled, link them into a library, or a
|
Chris@10
|
6074 shared library, or directly into your program.
|
Chris@10
|
6075
|
Chris@10
|
6076 To compile the FFTW test program, additionally compile the code in
|
Chris@10
|
6077 the `libbench2/' directory, and link it into a library. Then compile
|
Chris@10
|
6078 the code in the `tests/' directory and link it to the `libbench2' and
|
Chris@10
|
6079 FFTW libraries. To compile the `fftw-wisdom' (command-line) tool
|
Chris@10
|
6080 (*note Wisdom Utilities::), compile `tools/fftw-wisdom.c' and link it
|
Chris@10
|
6081 to the `libbench2' and FFTW libraries
|
Chris@10
|
6082
|
Chris@10
|
6083
|
Chris@10
|
6084 File: fftw3.info, Node: Cycle Counters, Next: Generating your own code, Prev: Installation on non-Unix systems, Up: Installation and Customization
|
Chris@10
|
6085
|
Chris@10
|
6086 10.3 Cycle Counters
|
Chris@10
|
6087 ===================
|
Chris@10
|
6088
|
Chris@10
|
6089 FFTW's planner actually executes and times different possible FFT
|
Chris@10
|
6090 algorithms in order to pick the fastest plan for a given n. In order
|
Chris@10
|
6091 to do this in as short a time as possible, however, the timer must have
|
Chris@10
|
6092 a very high resolution, and to accomplish this we employ the hardware
|
Chris@10
|
6093 "cycle counters" that are available on most CPUs. Currently, FFTW
|
Chris@10
|
6094 supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC
|
Chris@10
|
6095 (SPARC v9), IA64, PA-RISC, and MIPS processors.
|
Chris@10
|
6096
|
Chris@10
|
6097 Access to the cycle counters, unfortunately, is a compiler and/or
|
Chris@10
|
6098 operating-system dependent task, often requiring inline assembly
|
Chris@10
|
6099 language, and it may be that your compiler is not supported. If you are
|
Chris@10
|
6100 _not_ supported, FFTW will by default fall back on its estimator
|
Chris@10
|
6101 (effectively using `FFTW_ESTIMATE' for all plans).
|
Chris@10
|
6102
|
Chris@10
|
6103 You can add support by editing the file `kernel/cycle.h'; normally,
|
Chris@10
|
6104 this will involve adapting one of the examples already present in order
|
Chris@10
|
6105 to use the inline-assembler syntax for your C compiler, and will only
|
Chris@10
|
6106 require a couple of lines of code. Anyone adding support for a new
|
Chris@10
|
6107 system to `cycle.h' is encouraged to email us at <fftw@fftw.org>.
|
Chris@10
|
6108
|
Chris@10
|
6109 If a cycle counter is not available on your system (e.g. some
|
Chris@10
|
6110 embedded processor), and you don't want to use estimated plans, as a
|
Chris@10
|
6111 last resort you can use the `--with-slow-timer' option to `configure'
|
Chris@10
|
6112 (on Unix) or `#define WITH_SLOW_TIMER' in `config.h' (elsewhere). This
|
Chris@10
|
6113 will use the much lower-resolution `gettimeofday' function, or even
|
Chris@10
|
6114 `clock' if the former is unavailable, and planning will be extremely
|
Chris@10
|
6115 slow.
|
Chris@10
|
6116
|
Chris@10
|
6117
|
Chris@10
|
6118 File: fftw3.info, Node: Generating your own code, Prev: Cycle Counters, Up: Installation and Customization
|
Chris@10
|
6119
|
Chris@10
|
6120 10.4 Generating your own code
|
Chris@10
|
6121 =============================
|
Chris@10
|
6122
|
Chris@10
|
6123 The directory `genfft' contains the programs that were used to generate
|
Chris@10
|
6124 FFTW's "codelets," which are hard-coded transforms of small sizes. We
|
Chris@10
|
6125 do not expect casual users to employ the generator, which is a rather
|
Chris@10
|
6126 sophisticated program that generates directed acyclic graphs of FFT
|
Chris@10
|
6127 algorithms and performs algebraic simplifications on them. It was
|
Chris@10
|
6128 written in Objective Caml, a dialect of ML, which is available at
|
Chris@10
|
6129 `http://caml.inria.fr/ocaml/index.en.html'.
|
Chris@10
|
6130
|
Chris@10
|
6131 If you have Objective Caml installed (along with recent versions of
|
Chris@10
|
6132 GNU `autoconf', `automake', and `libtool'), then you can change the set
|
Chris@10
|
6133 of codelets that are generated or play with the generation options.
|
Chris@10
|
6134 The set of generated codelets is specified by the
|
Chris@10
|
6135 `{dft,rdft}/{codelets,simd}/*/Makefile.am' files. For example, you can
|
Chris@10
|
6136 add efficient REDFT codelets of small sizes by modifying
|
Chris@10
|
6137 `rdft/codelets/r2r/Makefile.am'. After you modify any `Makefile.am'
|
Chris@10
|
6138 files, you can type `sh bootstrap.sh' in the top-level directory
|
Chris@10
|
6139 followed by `make' to re-generate the files.
|
Chris@10
|
6140
|
Chris@10
|
6141 We do not provide more details about the code-generation process,
|
Chris@10
|
6142 since we do not expect that most users will need to generate their own
|
Chris@10
|
6143 code. However, feel free to contact us at <fftw@fftw.org> if you are
|
Chris@10
|
6144 interested in the subject.
|
Chris@10
|
6145
|
Chris@10
|
6146 You might find it interesting to learn Caml and/or some modern
|
Chris@10
|
6147 programming techniques that we used in the generator (including monadic
|
Chris@10
|
6148 programming), especially if you heard the rumor that Java and
|
Chris@10
|
6149 object-oriented programming are the latest advancement in the field.
|
Chris@10
|
6150 The internal operation of the codelet generator is described in the
|
Chris@10
|
6151 paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is
|
Chris@10
|
6152 available from the FFTW home page (http://www.fftw.org) and also
|
Chris@10
|
6153 appeared in the `Proceedings of the 1999 ACM SIGPLAN Conference on
|
Chris@10
|
6154 Programming Language Design and Implementation (PLDI)'.
|
Chris@10
|
6155
|
Chris@10
|
6156
|
Chris@10
|
6157 File: fftw3.info, Node: Acknowledgments, Next: License and Copyright, Prev: Installation and Customization, Up: Top
|
Chris@10
|
6158
|
Chris@10
|
6159 11 Acknowledgments
|
Chris@10
|
6160 ******************
|
Chris@10
|
6161
|
Chris@10
|
6162 Matteo Frigo was supported in part by the Special Research Program SFB
|
Chris@10
|
6163 F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln
|
Chris@10
|
6164 Laboratory. For previous versions of FFTW, he was supported in part by
|
Chris@10
|
6165 the Defense Advanced Research Projects Agency (DARPA), under Grants
|
Chris@10
|
6166 N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment
|
Chris@10
|
6167 Corporation Fellowship.
|
Chris@10
|
6168
|
Chris@10
|
6169 Steven G. Johnson was supported in part by a Dept. of Defense NDSEG
|
Chris@10
|
6170 Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials
|
Chris@10
|
6171 Research Science and Engineering Center program of the National Science
|
Chris@10
|
6172 Foundation under award DMR-9400334.
|
Chris@10
|
6173
|
Chris@10
|
6174 Code for the Cell Broadband Engine was graciously donated to the FFTW
|
Chris@10
|
6175 project by the IBM Austin Research Lab and included in fftw-3.2. (This
|
Chris@10
|
6176 code was removed in fftw-3.3.)
|
Chris@10
|
6177
|
Chris@10
|
6178 Code for the MIPS paired-single SIMD support was graciously donated
|
Chris@10
|
6179 to the FFTW project by CodeSourcery, Inc.
|
Chris@10
|
6180
|
Chris@10
|
6181 We are grateful to Sun Microsystems Inc. for its donation of a
|
Chris@10
|
6182 cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These
|
Chris@10
|
6183 machines served as the primary platform for the development of early
|
Chris@10
|
6184 versions of FFTW.
|
Chris@10
|
6185
|
Chris@10
|
6186 We thank Intel Corporation for donating a four-processor Pentium Pro
|
Chris@10
|
6187 machine. We thank the GNU/Linux community for giving us a decent OS to
|
Chris@10
|
6188 run on that machine.
|
Chris@10
|
6189
|
Chris@10
|
6190 We are thankful to the AMD corporation for donating an AMD Athlon XP
|
Chris@10
|
6191 1700+ computer to the FFTW project.
|
Chris@10
|
6192
|
Chris@10
|
6193 We thank the Compaq/HP testdrive program and VA Software Corporation
|
Chris@10
|
6194 (SourceForge.net) for providing remote access to machines that were used
|
Chris@10
|
6195 to test FFTW.
|
Chris@10
|
6196
|
Chris@10
|
6197 The `genfft' suite of code generators was written using Objective
|
Chris@10
|
6198 Caml, a dialect of ML. Objective Caml is a small and elegant language
|
Chris@10
|
6199 developed by Xavier Leroy. The implementation is available from
|
Chris@10
|
6200 `http://caml.inria.fr/' (http://caml.inria.fr/). In previous releases
|
Chris@10
|
6201 of FFTW, `genfft' was written in Caml Light, by the same authors. An
|
Chris@10
|
6202 even earlier implementation of `genfft' was written in Scheme, but Caml
|
Chris@10
|
6203 is definitely better for this kind of application.
|
Chris@10
|
6204
|
Chris@10
|
6205 FFTW uses many tools from the GNU project, including `automake',
|
Chris@10
|
6206 `texinfo', and `libtool'.
|
Chris@10
|
6207
|
Chris@10
|
6208 Prof. Charles E. Leiserson of MIT provided continuous support and
|
Chris@10
|
6209 encouragement. This program would not exist without him. Charles also
|
Chris@10
|
6210 proposed the name "codelets" for the basic FFT blocks.
|
Chris@10
|
6211
|
Chris@10
|
6212 Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance
|
Chris@10
|
6213 of Steven's "extra-curricular" computer-science activities, as well as
|
Chris@10
|
6214 remarkable creativity in working them into his grant proposals.
|
Chris@10
|
6215 Steven's physics degree would not exist without him.
|
Chris@10
|
6216
|
Chris@10
|
6217 Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually
|
Chris@10
|
6218 led to the SIMD support in FFTW 3.
|
Chris@10
|
6219
|
Chris@10
|
6220 Stefan Kral wrote most of the K7 code generator distributed with FFTW
|
Chris@10
|
6221 3.0.x and 3.1.x.
|
Chris@10
|
6222
|
Chris@10
|
6223 Andrew Sterian contributed the Windows timing code in FFTW 2.
|
Chris@10
|
6224
|
Chris@10
|
6225 Didier Miras reported a bug in the test procedure used in FFTW 1.2.
|
Chris@10
|
6226 We now use a completely different test algorithm by Funda Ergun that
|
Chris@10
|
6227 does not require a separate FFT program to compare against.
|
Chris@10
|
6228
|
Chris@10
|
6229 Wolfgang Reimer contributed the Pentium cycle counter and a few fixes
|
Chris@10
|
6230 that help portability.
|
Chris@10
|
6231
|
Chris@10
|
6232 Ming-Chang Liu uncovered a well-hidden bug in the complex transforms
|
Chris@10
|
6233 of FFTW 2.0 and supplied a patch to correct it.
|
Chris@10
|
6234
|
Chris@10
|
6235 The FFTW FAQ was written in `bfnn' (Bizarre Format With No Name) and
|
Chris@10
|
6236 formatted using the tools developed by Ian Jackson for the Linux FAQ.
|
Chris@10
|
6237
|
Chris@10
|
6238 _We are especially thankful to all of our users for their continuing
|
Chris@10
|
6239 support, feedback, and interest during our development of FFTW._
|
Chris@10
|
6240
|
Chris@10
|
6241
|
Chris@10
|
6242 File: fftw3.info, Node: License and Copyright, Next: Concept Index, Prev: Acknowledgments, Up: Top
|
Chris@10
|
6243
|
Chris@10
|
6244 12 License and Copyright
|
Chris@10
|
6245 ************************
|
Chris@10
|
6246
|
Chris@10
|
6247 FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003,
|
Chris@10
|
6248 2007-11 Massachusetts Institute of Technology.
|
Chris@10
|
6249
|
Chris@10
|
6250 FFTW is free software; you can redistribute it and/or modify it
|
Chris@10
|
6251 under the terms of the GNU General Public License as published by the
|
Chris@10
|
6252 Free Software Foundation; either version 2 of the License, or (at your
|
Chris@10
|
6253 option) any later version.
|
Chris@10
|
6254
|
Chris@10
|
6255 This program is distributed in the hope that it will be useful, but
|
Chris@10
|
6256 WITHOUT ANY WARRANTY; without even the implied warranty of
|
Chris@10
|
6257 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
Chris@10
|
6258 General Public License for more details.
|
Chris@10
|
6259
|
Chris@10
|
6260 You should have received a copy of the GNU General Public License
|
Chris@10
|
6261 along with this program; if not, write to the Free Software Foundation,
|
Chris@10
|
6262 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You
|
Chris@10
|
6263 can also find the GPL on the GNU web site
|
Chris@10
|
6264 (http://www.gnu.org/licenses/gpl-2.0.html).
|
Chris@10
|
6265
|
Chris@10
|
6266 In addition, we kindly ask you to acknowledge FFTW and its authors in
|
Chris@10
|
6267 any program or publication in which you use FFTW. (You are not
|
Chris@10
|
6268 _required_ to do so; it is up to your common sense to decide whether
|
Chris@10
|
6269 you want to comply with this request or not.) For general
|
Chris@10
|
6270 publications, we suggest referencing: Matteo Frigo and Steven G.
|
Chris@10
|
6271 Johnson, "The design and implementation of FFTW3," Proc. IEEE 93 (2),
|
Chris@10
|
6272 216-231 (2005).
|
Chris@10
|
6273
|
Chris@10
|
6274 Non-free versions of FFTW are available under terms different from
|
Chris@10
|
6275 those of the General Public License. (e.g. they do not require you to
|
Chris@10
|
6276 accompany any object code using FFTW with the corresponding source
|
Chris@10
|
6277 code.) For these alternative terms you must purchase a license from
|
Chris@10
|
6278 MIT's Technology Licensing Office. Users interested in such a license
|
Chris@10
|
6279 should contact us (<fftw@fftw.org>) for more information.
|
Chris@10
|
6280
|