Mercurial > hg > js-dsp-test
comparison fft/fftw/fftw-3.3.4/doc/fftw3.info-1 @ 19:26056e866c29
Add FFTW to comparison table
author | Chris Cannam |
---|---|
date | Tue, 06 Oct 2015 13:08:39 +0100 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
18:8db794ca3e0b | 19:26056e866c29 |
---|---|
1 This is fftw3.info, produced by makeinfo version 4.13 from fftw3.texi. | |
2 | |
3 This manual is for FFTW (version 3.3.4, 20 September 2013). | |
4 | |
5 Copyright (C) 2003 Matteo Frigo. | |
6 | |
7 Copyright (C) 2003 Massachusetts Institute of Technology. | |
8 | |
9 Permission is granted to make and distribute verbatim copies of | |
10 this manual provided the copyright notice and this permission | |
11 notice are preserved on all copies. | |
12 | |
13 Permission is granted to copy and distribute modified versions of | |
14 this manual under the conditions for verbatim copying, provided | |
15 that the entire resulting derived work is distributed under the | |
16 terms of a permission notice identical to this one. | |
17 | |
18 Permission is granted to copy and distribute translations of this | |
19 manual into another language, under the above conditions for | |
20 modified versions, except that this permission notice may be | |
21 stated in a translation approved by the Free Software Foundation. | |
22 | |
23 INFO-DIR-SECTION Development | |
24 START-INFO-DIR-ENTRY | |
25 * fftw3: (fftw3). FFTW User's Manual. | |
26 END-INFO-DIR-ENTRY | |
27 | |
28 | |
29 File: fftw3.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir) | |
30 | |
31 FFTW User Manual | |
32 **************** | |
33 | |
34 Welcome to FFTW, the Fastest Fourier Transform in the West. FFTW is a | |
35 collection of fast C routines to compute the discrete Fourier transform. | |
36 This manual documents FFTW version 3.3.4. | |
37 | |
38 * Menu: | |
39 | |
40 * Introduction:: | |
41 * Tutorial:: | |
42 * Other Important Topics:: | |
43 * FFTW Reference:: | |
44 * Multi-threaded FFTW:: | |
45 * Distributed-memory FFTW with MPI:: | |
46 * Calling FFTW from Modern Fortran:: | |
47 * Calling FFTW from Legacy Fortran:: | |
48 * Upgrading from FFTW version 2:: | |
49 * Installation and Customization:: | |
50 * Acknowledgments:: | |
51 * License and Copyright:: | |
52 * Concept Index:: | |
53 * Library Index:: | |
54 | |
55 | |
56 File: fftw3.info, Node: Introduction, Next: Tutorial, Prev: Top, Up: Top | |
57 | |
58 1 Introduction | |
59 ************** | |
60 | |
61 This manual documents version 3.3.4 of FFTW, the _Fastest Fourier | |
62 Transform in the West_. FFTW is a comprehensive collection of fast C | |
63 routines for computing the discrete Fourier transform (DFT) and various | |
64 special cases thereof. | |
65 * FFTW computes the DFT of complex data, real data, even- or | |
66 odd-symmetric real data (these symmetric transforms are usually | |
67 known as the discrete cosine or sine transform, respectively), and | |
68 the discrete Hartley transform (DHT) of real data. | |
69 | |
70 * The input data can have arbitrary length. FFTW employs O(n | |
71 log n) algorithms for all lengths, including prime numbers. | |
72 | |
73 * FFTW supports arbitrary multi-dimensional data. | |
74 | |
75 * FFTW supports the SSE, SSE2, AVX, Altivec, and MIPS PS instruction | |
76 sets. | |
77 | |
78 * FFTW includes parallel (multi-threaded) transforms for | |
79 shared-memory systems. | |
80 | |
81 * Starting with version 3.3, FFTW includes distributed-memory | |
82 parallel transforms using MPI. | |
83 | |
84 We assume herein that you are familiar with the properties and uses | |
85 of the DFT that are relevant to your application. Otherwise, see e.g. | |
86 `The Fast Fourier Transform and Its Applications' by E. O. Brigham | |
87 (Prentice-Hall, Englewood Cliffs, NJ, 1988). Our web page | |
88 (http://www.fftw.org) also has links to FFT-related information online. | |
89 | |
90 In order to use FFTW effectively, you need to learn one basic concept | |
91 of FFTW's internal structure: FFTW does not use a fixed algorithm for | |
92 computing the transform, but instead it adapts the DFT algorithm to | |
93 details of the underlying hardware in order to maximize performance. | |
94 Hence, the computation of the transform is split into two phases. | |
95 First, FFTW's "planner" "learns" the fastest way to compute the | |
96 transform on your machine. The planner produces a data structure | |
97 called a "plan" that contains this information. Subsequently, the plan | |
98 is "executed" to transform the array of input data as dictated by the | |
99 plan. The plan can be reused as many times as needed. In typical | |
100 high-performance applications, many transforms of the same size are | |
101 computed and, consequently, a relatively expensive initialization of | |
102 this sort is acceptable. On the other hand, if you need a single | |
103 transform of a given size, the one-time cost of the planner becomes | |
104 significant. For this case, FFTW provides fast planners based on | |
105 heuristics or on previously computed plans. | |
106 | |
107 FFTW supports transforms of data with arbitrary length, rank, | |
108 multiplicity, and a general memory layout. In simple cases, however, | |
109 this generality may be unnecessary and confusing. Consequently, we | |
110 organized the interface to FFTW into three levels of increasing | |
111 generality. | |
112 * The "basic interface" computes a single transform of | |
113 contiguous data. | |
114 | |
115 * The "advanced interface" computes transforms of multiple or | |
116 strided arrays. | |
117 | |
118 * The "guru interface" supports the most general data layouts, | |
119 multiplicities, and strides. | |
120 We expect that most users will be best served by the basic interface, | |
121 whereas the guru interface requires careful attention to the | |
122 documentation to avoid problems. | |
123 | |
124 Besides the automatic performance adaptation performed by the | |
125 planner, it is also possible for advanced users to customize FFTW | |
126 manually. For example, if code space is a concern, we provide a tool | |
127 that links only the subset of FFTW needed by your application. | |
128 Conversely, you may need to extend FFTW because the standard | |
129 distribution is not sufficient for your needs. For example, the | |
130 standard FFTW distribution works most efficiently for arrays whose size | |
131 can be factored into small primes (2, 3, 5, and 7), and otherwise it | |
132 uses a slower general-purpose routine. If you need efficient | |
133 transforms of other sizes, you can use FFTW's code generator, which | |
134 produces fast C programs ("codelets") for any particular array size you | |
135 may care about. For example, if you need transforms of size 513 = 19 x | |
136 3^3, you can customize FFTW to support the factor 19 efficiently. | |
137 | |
138 For more information regarding FFTW, see the paper, "The Design and | |
139 Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an | |
140 invited paper in `Proc. IEEE' 93 (2), p. 216 (2005). The code | |
141 generator is described in the paper "A fast Fourier transform compiler", by | |
142 M. Frigo, in the `Proceedings of the 1999 ACM SIGPLAN Conference on | |
143 Programming Language Design and Implementation (PLDI), Atlanta, | |
144 Georgia, May 1999'. These papers, along with the latest version of | |
145 FFTW, the FAQ, benchmarks, and other links, are available at the FFTW | |
146 home page (http://www.fftw.org). | |
147 | |
148 The current version of FFTW incorporates many good ideas from the | |
149 past thirty years of FFT literature. In one way or another, FFTW uses | |
150 the Cooley-Tukey algorithm, the prime factor algorithm, Rader's | |
151 algorithm for prime sizes, and a split-radix algorithm (with a | |
152 "conjugate-pair" variation pointed out to us by Dan Bernstein). FFTW's | |
153 code generator also produces new algorithms that we do not completely | |
154 understand. The reader is referred to the cited papers for the | |
155 appropriate references. | |
156 | |
157 The rest of this manual is organized as follows. We first discuss | |
158 the sequential (single-processor) implementation. We start by | |
159 describing the basic interface/features of FFTW in *note Tutorial::. | |
160 Next, *note Other Important Topics:: discusses data alignment (*note | |
161 SIMD alignment and fftw_malloc::), the storage scheme of | |
162 multi-dimensional arrays (*note Multi-dimensional Array Format::), and | |
163 FFTW's mechanism for storing plans on disk (*note Words of | |
164 Wisdom-Saving Plans::). Next, *note FFTW Reference:: provides | |
165 comprehensive documentation of all FFTW's features. Parallel | |
166 transforms are discussed in their own chapters: *note Multi-threaded | |
167 FFTW:: and *note Distributed-memory FFTW with MPI::. Fortran | |
168 programmers can also use FFTW, as described in *note Calling FFTW from | |
169 Legacy Fortran:: and *note Calling FFTW from Modern Fortran::. *note | |
170 Installation and Customization:: explains how to install FFTW in your | |
171 computer system and how to adapt FFTW to your needs. License and | |
172 copyright information is given in *note License and Copyright::. | |
173 Finally, we thank all the people who helped us in *note | |
174 Acknowledgments::. | |
175 | |
176 | |
177 File: fftw3.info, Node: Tutorial, Next: Other Important Topics, Prev: Introduction, Up: Top | |
178 | |
179 2 Tutorial | |
180 ********** | |
181 | |
182 * Menu: | |
183 | |
184 * Complex One-Dimensional DFTs:: | |
185 * Complex Multi-Dimensional DFTs:: | |
186 * One-Dimensional DFTs of Real Data:: | |
187 * Multi-Dimensional DFTs of Real Data:: | |
188 * More DFTs of Real Data:: | |
189 | |
190 This chapter describes the basic usage of FFTW, i.e., how to compute the | |
191 Fourier transform of a single array. This chapter tells the truth, but | |
192 not the _whole_ truth. Specifically, FFTW implements additional | |
193 routines and flags that are not documented here, although in many cases | |
194 we try to indicate where added capabilities exist. For more complete | |
195 information, see *note FFTW Reference::. (Note that you need to | |
196 compile and install FFTW before you can use it in a program. For the | |
197 details of the installation, see *note Installation and | |
198 Customization::.) | |
199 | |
200 We recommend that you read this tutorial in order.(1) At the least, | |
201 read the first section (*note Complex One-Dimensional DFTs::) before | |
202 reading any of the others, even if your main interest lies in one of | |
203 the other transform types. | |
204 | |
205 Users of FFTW version 2 and earlier may also want to read *note | |
206 Upgrading from FFTW version 2::. | |
207 | |
208 ---------- Footnotes ---------- | |
209 | |
210 (1) You can read the tutorial in bit-reversed order after computing | |
211 your first transform. | |
212 | |
213 | |
214 File: fftw3.info, Node: Complex One-Dimensional DFTs, Next: Complex Multi-Dimensional DFTs, Prev: Tutorial, Up: Tutorial | |
215 | |
216 2.1 Complex One-Dimensional DFTs | |
217 ================================ | |
218 | |
219 Plan: To bother about the best method of accomplishing an | |
220 accidental result. [Ambrose Bierce, `The Enlarged Devil's | |
221 Dictionary'.] | |
222 | |
223 The basic usage of FFTW to compute a one-dimensional DFT of size `N' | |
224 is simple, and it typically looks something like this code: | |
225 | |
226 #include <fftw3.h> | |
227 ... | |
228 { | |
229 fftw_complex *in, *out; | |
230 fftw_plan p; | |
231 ... | |
232 in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); | |
233 out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); | |
234 p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); | |
235 ... | |
236 fftw_execute(p); /* repeat as needed */ | |
237 ... | |
238 fftw_destroy_plan(p); | |
239 fftw_free(in); fftw_free(out); | |
240 } | |
241 | |
242 You must link this code with the `fftw3' library. On Unix systems, | |
243 link with `-lfftw3 -lm'. | |
244 | |
245 The example code first allocates the input and output arrays. You | |
246 can allocate them in any way that you like, but we recommend using | |
247 `fftw_malloc', which behaves like `malloc' except that it properly | |
248 aligns the array when SIMD instructions (such as SSE and Altivec) are | |
249 available (*note SIMD alignment and fftw_malloc::). [Alternatively, we | |
250 provide a convenient wrapper function `fftw_alloc_complex(N)' which has | |
251 the same effect.] | |
252 | |
253 The data is an array of type `fftw_complex', which is by default a | |
254 `double[2]' composed of the real (`in[i][0]') and imaginary | |
255 (`in[i][1]') parts of a complex number. | |
256 | |
257 The next step is to create a "plan", which is an object that | |
258 contains all the data that FFTW needs to compute the FFT. This | |
259 function creates the plan: | |
260 | |
261 fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out, | |
262 int sign, unsigned flags); | |
263 | |
264 The first argument, `n', is the size of the transform you are trying | |
265 to compute. The size `n' can be any positive integer, but sizes that | |
266 are products of small factors are transformed most efficiently | |
267 (although prime sizes still use an O(n log n) algorithm). | |
268 | |
269 The next two arguments are pointers to the input and output arrays of | |
270 the transform. These pointers can be equal, indicating an "in-place" | |
271 transform. | |
272 | |
273 The fourth argument, `sign', can be either `FFTW_FORWARD' (`-1') or | |
274 `FFTW_BACKWARD' (`+1'), and indicates the direction of the transform | |
275 you are interested in; technically, it is the sign of the exponent in | |
276 the transform. | |
277 | |
278 The `flags' argument is usually either `FFTW_MEASURE' or `FFTW_ESTIMATE'. | |
279 `FFTW_MEASURE' instructs FFTW to run and measure the execution time of | |
280 several FFTs in order to find the best way to compute the transform of | |
281 size `n'. This process takes some time (usually a few seconds), | |
282 depending on your machine and on the size of the transform. | |
283 `FFTW_ESTIMATE', on the contrary, does not run any computation and just | |
284 builds a reasonable plan that is probably sub-optimal. In short, if | |
285 your program performs many transforms of the same size and | |
286 initialization time is not important, use `FFTW_MEASURE'; otherwise use | |
287 the estimate. | |
288 | |
289 _You must create the plan before initializing the input_, because | |
290 `FFTW_MEASURE' overwrites the `in'/`out' arrays. (Technically, | |
291 `FFTW_ESTIMATE' does not touch your arrays, but you should always | |
292 create plans first just to be sure.) | |
293 | |
294 Once the plan has been created, you can use it as many times as you | |
295 like for transforms on the specified `in'/`out' arrays, computing the | |
296 actual transforms via `fftw_execute(plan)': | |
297 void fftw_execute(const fftw_plan plan); | |
298 | |
299 The DFT results are stored in-order in the array `out', with the | |
300 zero-frequency (DC) component in `out[0]'. If `in != out', the | |
301 transform is "out-of-place" and the input array `in' is not modified. | |
302 Otherwise, the input array is overwritten with the transform. | |
303 | |
304 If you want to transform a _different_ array of the same size, you | |
305 can create a new plan with `fftw_plan_dft_1d' and FFTW automatically | |
306 reuses the information from the previous plan, if possible. | |
307 Alternatively, with the "guru" interface you can apply a given plan to | |
308 a different array, if you are careful. *Note FFTW Reference::. | |
309 | |
310 When you are done with the plan, you deallocate it by calling | |
311 `fftw_destroy_plan(plan)': | |
312 void fftw_destroy_plan(fftw_plan plan); | |
313 If you allocate an array with `fftw_malloc()' you must deallocate it | |
314 with `fftw_free()'. Do not use `free()' or, heaven forbid, `delete'. | |
315 | |
316 FFTW computes an _unnormalized_ DFT. Thus, computing a forward | |
317 followed by a backward transform (or vice versa) results in the original | |
318 array scaled by `n'. For the definition of the DFT, see *note What | |
319 FFTW Really Computes::. | |
320 | |
321 If you have a C compiler, such as `gcc', that supports the C99 | |
322 standard, and you `#include <complex.h>' _before_ `<fftw3.h>', then | |
323 `fftw_complex' is the native double-precision complex type and you can | |
324 manipulate it with ordinary arithmetic. Otherwise, FFTW defines its | |
325 own complex type, which is bit-compatible with the C99 complex type. | |
326 *Note Complex numbers::. (The C++ `<complex>' template class may also | |
327 be usable via a typecast.) | |
328 | |
329 To use single or long-double precision versions of FFTW, replace the | |
330 `fftw_' prefix by `fftwf_' or `fftwl_' and link with `-lfftw3f' or | |
331 `-lfftw3l', but use the _same_ `<fftw3.h>' header file. | |
332 | |
333 Many more flags exist besides `FFTW_MEASURE' and `FFTW_ESTIMATE'. | |
334 For example, use `FFTW_PATIENT' if you're willing to wait even longer | |
335 for a possibly even faster plan (*note FFTW Reference::). You can also | |
336 save plans for future use, as described by *note Words of Wisdom-Saving | |
337 Plans::. | |
338 | |
339 | |
340 File: fftw3.info, Node: Complex Multi-Dimensional DFTs, Next: One-Dimensional DFTs of Real Data, Prev: Complex One-Dimensional DFTs, Up: Tutorial | |
341 | |
342 2.2 Complex Multi-Dimensional DFTs | |
343 ================================== | |
344 | |
345 Multi-dimensional transforms work much the same way as one-dimensional | |
346 transforms: you allocate arrays of `fftw_complex' (preferably using | |
347 `fftw_malloc'), create an `fftw_plan', execute it as many times as you | |
348 want with `fftw_execute(plan)', and clean up with | |
349 `fftw_destroy_plan(plan)' (and `fftw_free'). | |
350 | |
351 FFTW provides two routines for creating plans for 2d and 3d | |
352 transforms, and one routine for creating plans of arbitrary | |
353 dimensionality. The 2d and 3d routines have the following signature: | |
354 fftw_plan fftw_plan_dft_2d(int n0, int n1, | |
355 fftw_complex *in, fftw_complex *out, | |
356 int sign, unsigned flags); | |
357 fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, | |
358 fftw_complex *in, fftw_complex *out, | |
359 int sign, unsigned flags); | |
360 | |
361 These routines create plans for `n0' by `n1' two-dimensional (2d) | |
362 transforms and `n0' by `n1' by `n2' 3d transforms, respectively. All | |
363 of these transforms operate on contiguous arrays in the C-standard | |
364 "row-major" order, so that the last dimension has the fastest-varying | |
365 index in the array. This layout is described further in *note | |
366 Multi-dimensional Array Format::. | |
367 | |
368 FFTW can also compute transforms of higher dimensionality. In order | |
369 to avoid confusion between the various meanings of the the word | |
370 "dimension", we use the term _rank_ to denote the number of independent | |
371 indices in an array.(1) For example, we say that a 2d transform has | |
372 rank 2, a 3d transform has rank 3, and so on. You can plan transforms | |
373 of arbitrary rank by means of the following function: | |
374 | |
375 fftw_plan fftw_plan_dft(int rank, const int *n, | |
376 fftw_complex *in, fftw_complex *out, | |
377 int sign, unsigned flags); | |
378 | |
379 Here, `n' is a pointer to an array `n[rank]' denoting an `n[0]' by | |
380 `n[1]' by ... by `n[rank-1]' transform. Thus, for example, the call | |
381 fftw_plan_dft_2d(n0, n1, in, out, sign, flags); | |
382 is equivalent to the following code fragment: | |
383 int n[2]; | |
384 n[0] = n0; | |
385 n[1] = n1; | |
386 fftw_plan_dft(2, n, in, out, sign, flags); | |
387 `fftw_plan_dft' is not restricted to 2d and 3d transforms, however, | |
388 but it can plan transforms of arbitrary rank. | |
389 | |
390 You may have noticed that all the planner routines described so far | |
391 have overlapping functionality. For example, you can plan a 1d or 2d | |
392 transform by using `fftw_plan_dft' with a `rank' of `1' or `2', or even | |
393 by calling `fftw_plan_dft_3d' with `n0' and/or `n1' equal to `1' (with | |
394 no loss in efficiency). This pattern continues, and FFTW's planning | |
395 routines in general form a "partial order," sequences of interfaces | |
396 with strictly increasing generality but correspondingly greater | |
397 complexity. | |
398 | |
399 `fftw_plan_dft' is the most general complex-DFT routine that we | |
400 describe in this tutorial, but there are also the advanced and guru | |
401 interfaces, which allow one to efficiently combine multiple/strided | |
402 transforms into a single FFTW plan, transform a subset of a larger | |
403 multi-dimensional array, and/or to handle more general complex-number | |
404 formats. For more information, see *note FFTW Reference::. | |
405 | |
406 ---------- Footnotes ---------- | |
407 | |
408 (1) The term "rank" is commonly used in the APL, FORTRAN, and Common | |
409 Lisp traditions, although it is not so common in the C world. | |
410 | |
411 | |
412 File: fftw3.info, Node: One-Dimensional DFTs of Real Data, Next: Multi-Dimensional DFTs of Real Data, Prev: Complex Multi-Dimensional DFTs, Up: Tutorial | |
413 | |
414 2.3 One-Dimensional DFTs of Real Data | |
415 ===================================== | |
416 | |
417 In many practical applications, the input data `in[i]' are purely real | |
418 numbers, in which case the DFT output satisfies the "Hermitian" redundancy: | |
419 `out[i]' is the conjugate of `out[n-i]'. It is possible to take | |
420 advantage of these circumstances in order to achieve roughly a factor | |
421 of two improvement in both speed and memory usage. | |
422 | |
423 In exchange for these speed and space advantages, the user sacrifices | |
424 some of the simplicity of FFTW's complex transforms. First of all, the | |
425 input and output arrays are of _different sizes and types_: the input | |
426 is `n' real numbers, while the output is `n/2+1' complex numbers (the | |
427 non-redundant outputs); this also requires slight "padding" of the | |
428 input array for in-place transforms. Second, the inverse transform | |
429 (complex to real) has the side-effect of _overwriting its input array_, | |
430 by default. Neither of these inconveniences should pose a serious | |
431 problem for users, but it is important to be aware of them. | |
432 | |
433 The routines to perform real-data transforms are almost the same as | |
434 those for complex transforms: you allocate arrays of `double' and/or | |
435 `fftw_complex' (preferably using `fftw_malloc' or | |
436 `fftw_alloc_complex'), create an `fftw_plan', execute it as many times | |
437 as you want with `fftw_execute(plan)', and clean up with | |
438 `fftw_destroy_plan(plan)' (and `fftw_free'). The only differences are | |
439 that the input (or output) is of type `double' and there are new | |
440 routines to create the plan. In one dimension: | |
441 | |
442 fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out, | |
443 unsigned flags); | |
444 fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out, | |
445 unsigned flags); | |
446 | |
447 for the real input to complex-Hermitian output ("r2c") and | |
448 complex-Hermitian input to real output ("c2r") transforms. Unlike the | |
449 complex DFT planner, there is no `sign' argument. Instead, r2c DFTs | |
450 are always `FFTW_FORWARD' and c2r DFTs are always `FFTW_BACKWARD'. (For | |
451 single/long-double precision `fftwf' and `fftwl', `double' should be | |
452 replaced by `float' and `long double', respectively.) | |
453 | |
454 Here, `n' is the "logical" size of the DFT, not necessarily the | |
455 physical size of the array. In particular, the real (`double') array | |
456 has `n' elements, while the complex (`fftw_complex') array has `n/2+1' | |
457 elements (where the division is rounded down). For an in-place | |
458 transform, `in' and `out' are aliased to the same array, which must be | |
459 big enough to hold both; so, the real array would actually have | |
460 `2*(n/2+1)' elements, where the elements beyond the first `n' are | |
461 unused padding. (Note that this is very different from the concept of | |
462 "zero-padding" a transform to a larger length, which changes the | |
463 logical size of the DFT by actually adding new input data.) The kth | |
464 element of the complex array is exactly the same as the kth element of | |
465 the corresponding complex DFT. All positive `n' are supported; | |
466 products of small factors are most efficient, but an O(n log n) | |
467 algorithm is used even for prime sizes. | |
468 | |
469 As noted above, the c2r transform destroys its input array even for | |
470 out-of-place transforms. This can be prevented, if necessary, by | |
471 including `FFTW_PRESERVE_INPUT' in the `flags', with unfortunately some | |
472 sacrifice in performance. This flag is also not currently supported | |
473 for multi-dimensional real DFTs (next section). | |
474 | |
475 Readers familiar with DFTs of real data will recall that the 0th (the | |
476 "DC") and `n/2'-th (the "Nyquist" frequency, when `n' is even) elements | |
477 of the complex output are purely real. Some implementations therefore | |
478 store the Nyquist element where the DC imaginary part would go, in | |
479 order to make the input and output arrays the same size. Such packing, | |
480 however, does not generalize well to multi-dimensional transforms, and | |
481 the space savings are miniscule in any case; FFTW does not support it. | |
482 | |
483 An alternative interface for one-dimensional r2c and c2r DFTs can be | |
484 found in the `r2r' interface (*note The Halfcomplex-format DFT::), with | |
485 "halfcomplex"-format output that _is_ the same size (and type) as the | |
486 input array. That interface, although it is not very useful for | |
487 multi-dimensional transforms, may sometimes yield better performance. | |
488 | |
489 | |
490 File: fftw3.info, Node: Multi-Dimensional DFTs of Real Data, Next: More DFTs of Real Data, Prev: One-Dimensional DFTs of Real Data, Up: Tutorial | |
491 | |
492 2.4 Multi-Dimensional DFTs of Real Data | |
493 ======================================= | |
494 | |
495 Multi-dimensional DFTs of real data use the following planner routines: | |
496 | |
497 fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, | |
498 double *in, fftw_complex *out, | |
499 unsigned flags); | |
500 fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, | |
501 double *in, fftw_complex *out, | |
502 unsigned flags); | |
503 fftw_plan fftw_plan_dft_r2c(int rank, const int *n, | |
504 double *in, fftw_complex *out, | |
505 unsigned flags); | |
506 | |
507 as well as the corresponding `c2r' routines with the input/output | |
508 types swapped. These routines work similarly to their complex | |
509 analogues, except for the fact that here the complex output array is cut | |
510 roughly in half and the real array requires padding for in-place | |
511 transforms (as in 1d, above). | |
512 | |
513 As before, `n' is the logical size of the array, and the | |
514 consequences of this on the the format of the complex arrays deserve | |
515 careful attention. Suppose that the real data has dimensions n[0] x | |
516 n[1] x n[2] x ... x n[d-1] (in row-major order). Then, after an r2c | |
517 transform, the output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) | |
518 array of `fftw_complex' values in row-major order, corresponding to | |
519 slightly over half of the output of the corresponding complex DFT. | |
520 (The division is rounded down.) The ordering of the data is otherwise | |
521 exactly the same as in the complex-DFT case. | |
522 | |
523 For out-of-place transforms, this is the end of the story: the real | |
524 data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x | |
525 n[d-1] and the complex data is stored as a row-major array of size | |
526 n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) . | |
527 | |
528 For in-place transforms, however, extra padding of the real-data | |
529 array is necessary because the complex array is larger than the real | |
530 array, and the two arrays share the same memory locations. Thus, for | |
531 in-place transforms, the final dimension of the real-data array must be | |
532 padded with extra values to accommodate the size of the complex | |
533 data--two values if the last dimension is even and one if it is odd. That | |
534 is, the last dimension of the real data must physically contain 2 * | |
535 (n[d-1]/2+1) `double' values (exactly enough to hold the complex data). | |
536 This physical array size does not, however, change the _logical_ array | |
537 size--only n[d-1] values are actually stored in the last dimension, and | |
538 n[d-1] is the last dimension passed to the plan-creation routine. | |
539 | |
540 For example, consider the transform of a two-dimensional real array | |
541 of size `n0' by `n1'. The output of the r2c transform is a | |
542 two-dimensional complex array of size `n0' by `n1/2+1', where the `y' | |
543 dimension has been cut nearly in half because of redundancies in the | |
544 output. Because `fftw_complex' is twice the size of `double', the | |
545 output array is slightly bigger than the input array. Thus, if we want | |
546 to compute the transform in place, we must _pad_ the input array so | |
547 that it is of size `n0' by `2*(n1/2+1)'. If `n1' is even, then there | |
548 are two padding elements at the end of each row (which need not be | |
549 initialized, as they are only used for output). | |
550 | |
551 These transforms are unnormalized, so an r2c followed by a c2r | |
552 transform (or vice versa) will result in the original data scaled by | |
553 the number of real data elements--that is, the product of the (logical) | |
554 dimensions of the real data. | |
555 | |
556 (Because the last dimension is treated specially, if it is equal to | |
557 `1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r | |
558 transform. In that case, the last complex dimension also has size `1' | |
559 (`=1/2+1'), and no advantage is gained over the complex transforms.) | |
560 | |
561 | |
562 File: fftw3.info, Node: More DFTs of Real Data, Prev: Multi-Dimensional DFTs of Real Data, Up: Tutorial | |
563 | |
564 2.5 More DFTs of Real Data | |
565 ========================== | |
566 | |
567 * Menu: | |
568 | |
569 * The Halfcomplex-format DFT:: | |
570 * Real even/odd DFTs (cosine/sine transforms):: | |
571 * The Discrete Hartley Transform:: | |
572 | |
573 FFTW supports several other transform types via a unified "r2r" | |
574 (real-to-real) interface, so called because it takes a real (`double') | |
575 array and outputs a real array of the same size. These r2r transforms | |
576 currently fall into three categories: DFTs of real input and | |
577 complex-Hermitian output in halfcomplex format, DFTs of real input with | |
578 even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs), | |
579 and discrete Hartley transforms (DHTs), all described in more detail by | |
580 the following sections. | |
581 | |
582 The r2r transforms follow the by now familiar interface of creating | |
583 an `fftw_plan', executing it with `fftw_execute(plan)', and destroying | |
584 it with `fftw_destroy_plan(plan)'. Furthermore, all r2r transforms | |
585 share the same planner interface: | |
586 | |
587 fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, | |
588 fftw_r2r_kind kind, unsigned flags); | |
589 fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, | |
590 fftw_r2r_kind kind0, fftw_r2r_kind kind1, | |
591 unsigned flags); | |
592 fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, | |
593 double *in, double *out, | |
594 fftw_r2r_kind kind0, | |
595 fftw_r2r_kind kind1, | |
596 fftw_r2r_kind kind2, | |
597 unsigned flags); | |
598 fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, | |
599 const fftw_r2r_kind *kind, unsigned flags); | |
600 | |
601 Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional | |
602 transforms for contiguous arrays in row-major order, transforming (real) | |
603 input to output of the same size, where `n' specifies the _physical_ | |
604 dimensions of the arrays. All positive `n' are supported (with the | |
605 exception of `n=1' for the `FFTW_REDFT00' kind, noted in the real-even | |
606 subsection below); products of small factors are most efficient | |
607 (factorizing `n-1' and `n+1' for `FFTW_REDFT00' and `FFTW_RODFT00' | |
608 kinds, described below), but an O(n log n) algorithm is used even for | |
609 prime sizes. | |
610 | |
611 Each dimension has a "kind" parameter, of type `fftw_r2r_kind', | |
612 specifying the kind of r2r transform to be used for that dimension. (In | |
613 the case of `fftw_plan_r2r', this is an array `kind[rank]' where | |
614 `kind[i]' is the transform kind for the dimension `n[i]'.) The kind | |
615 can be one of a set of predefined constants, defined in the following | |
616 subsections. | |
617 | |
618 In other words, FFTW computes the separable product of the specified | |
619 r2r transforms over each dimension, which can be used e.g. for partial | |
620 differential equations with mixed boundary conditions. (For some r2r | |
621 kinds, notably the halfcomplex DFT and the DHT, such a separable | |
622 product is somewhat problematic in more than one dimension, however, as | |
623 is described below.) | |
624 | |
625 In the current version of FFTW, all r2r transforms except for the | |
626 halfcomplex type are computed via pre- or post-processing of | |
627 halfcomplex transforms, and they are therefore not as fast as they | |
628 could be. Since most other general DCT/DST codes employ a similar | |
629 algorithm, however, FFTW's implementation should provide at least | |
630 competitive performance. | |
631 | |
632 | |
633 File: fftw3.info, Node: The Halfcomplex-format DFT, Next: Real even/odd DFTs (cosine/sine transforms), Prev: More DFTs of Real Data, Up: More DFTs of Real Data | |
634 | |
635 2.5.1 The Halfcomplex-format DFT | |
636 -------------------------------- | |
637 | |
638 An r2r kind of `FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note | |
639 One-Dimensional DFTs of Real Data::) but with "halfcomplex" format | |
640 output, and may sometimes be faster and/or more convenient than the | |
641 latter. The inverse "hc2r" transform is of kind `FFTW_HC2R'. This | |
642 consists of the non-redundant half of the complex output for a 1d | |
643 real-input DFT of size `n', stored as a sequence of `n' real numbers | |
644 (`double') in the format: | |
645 | |
646 r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 | |
647 | |
648 Here, rk is the real part of the kth output, and ik is the imaginary | |
649 part. (Division by 2 is rounded down.) For a halfcomplex array | |
650 `hc[n]', the kth component thus has its real part in `hc[k]' and its | |
651 imaginary part in `hc[n-k]', with the exception of `k' `==' `0' or | |
652 `n/2' (the latter only if `n' is even)--in these two cases, the | |
653 imaginary part is zero due to symmetries of the real-input DFT, and is | |
654 not stored. Thus, the r2hc transform of `n' real values is a | |
655 halfcomplex array of length `n', and vice versa for hc2r. | |
656 | |
657 Aside from the differing format, the output of | |
658 `FFTW_R2HC'/`FFTW_HC2R' is otherwise exactly the same as for the | |
659 corresponding 1d r2c/c2r transform (i.e. `FFTW_FORWARD'/`FFTW_BACKWARD' | |
660 transforms, respectively). Recall that these transforms are | |
661 unnormalized, so r2hc followed by hc2r will result in the original data | |
662 multiplied by `n'. Furthermore, like the c2r transform, an | |
663 out-of-place hc2r transform will _destroy its input_ array. | |
664 | |
665 Although these halfcomplex transforms can be used with the | |
666 multi-dimensional r2r interface, the interpretation of such a separable | |
667 product of transforms along each dimension is problematic. For example, | |
668 consider a two-dimensional `n0' by `n1', r2hc by r2hc transform planned | |
669 by `fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC, | |
670 FFTW_MEASURE)'. Conceptually, FFTW first transforms the rows (of size | |
671 `n1') to produce halfcomplex rows, and then transforms the columns (of | |
672 size `n0'). Half of these column transforms, however, are of imaginary | |
673 parts, and should therefore be multiplied by i and combined with the | |
674 r2hc transforms of the real columns to produce the 2d DFT amplitudes; | |
675 FFTW's r2r transform does _not_ perform this combination for you. | |
676 Thus, if a multi-dimensional real-input/output DFT is required, we | |
677 recommend using the ordinary r2c/c2r interface (*note Multi-Dimensional | |
678 DFTs of Real Data::). | |
679 | |
680 | |
681 File: fftw3.info, Node: Real even/odd DFTs (cosine/sine transforms), Next: The Discrete Hartley Transform, Prev: The Halfcomplex-format DFT, Up: More DFTs of Real Data | |
682 | |
683 2.5.2 Real even/odd DFTs (cosine/sine transforms) | |
684 ------------------------------------------------- | |
685 | |
686 The Fourier transform of a real-even function f(-x) = f(x) is | |
687 real-even, and i times the Fourier transform of a real-odd function | |
688 f(-x) = -f(x) is real-odd. Similar results hold for a discrete Fourier | |
689 transform, and thus for these symmetries the need for complex | |
690 inputs/outputs is entirely eliminated. Moreover, one gains a factor of | |
691 two in speed/space from the fact that the data are real, and an | |
692 additional factor of two from the even/odd symmetry: only the | |
693 non-redundant (first) half of the array need be stored. The result is | |
694 the real-even DFT ("REDFT") and the real-odd DFT ("RODFT"), also known | |
695 as the discrete cosine and sine transforms ("DCT" and "DST"), | |
696 respectively. | |
697 | |
698 (In this section, we describe the 1d transforms; multi-dimensional | |
699 transforms are just a separable product of these transforms operating | |
700 along each dimension.) | |
701 | |
702 Because of the discrete sampling, one has an additional choice: is | |
703 the data even/odd around a sampling point, or around the point halfway | |
704 between two samples? The latter corresponds to _shifting_ the samples | |
705 by _half_ an interval, and gives rise to several transform variants | |
706 denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate | |
707 whether the input (a) and/or output (b) are shifted by half a sample (1 | |
708 means it is shifted). These are also known as types I-IV of the DCT | |
709 and DST, and all four types are supported by FFTW's r2r interface.(1) | |
710 | |
711 The r2r kinds for the various REDFT and RODFT types supported by | |
712 FFTW, along with the boundary conditions at both ends of the _input_ | |
713 array (`n' real numbers `in[j=0..n-1]'), are: | |
714 | |
715 * `FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1. | |
716 | |
717 * `FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even | |
718 around j=n-0.5. | |
719 | |
720 * `FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd | |
721 around j=n. | |
722 | |
723 * `FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5. | |
724 | |
725 * `FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n. | |
726 | |
727 * `FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5. | |
728 | |
729 * `FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1. | |
730 | |
731 * `FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5. | |
732 | |
733 | |
734 Note that these symmetries apply to the "logical" array being | |
735 transformed; *there are no constraints on your physical input data*. | |
736 So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data | |
737 abcde, it corresponds to the DFT of the logical even array abcdedcb of | |
738 size 8. A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the | |
739 size-8 logical DFT of the even array abcddcba, shifted by half a sample. | |
740 | |
741 All of these transforms are invertible. The inverse of R*DFT00 is | |
742 R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called | |
743 simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11. | |
744 However, the transforms computed by FFTW are unnormalized, exactly like | |
745 the corresponding real and complex DFTs, so computing a transform | |
746 followed by its inverse yields the original array scaled by N, where N | |
747 is the _logical_ DFT size. For REDFT00, N=2(n-1); for RODFT00, | |
748 N=2(n+1); otherwise, N=2n. | |
749 | |
750 Note that the boundary conditions of the transform output array are | |
751 given by the input boundary conditions of the inverse transform. Thus, | |
752 the above transforms are all inequivalent in terms of input/output | |
753 boundary conditions, even neglecting the 0.5 shift difference. | |
754 | |
755 FFTW is most efficient when N is a product of small factors; note | |
756 that this _differs_ from the factorization of the physical size `n' for | |
757 REDFT00 and RODFT00! There is another oddity: `n=1' REDFT00 transforms | |
758 correspond to N=0, and so are _not defined_ (the planner will return | |
759 `NULL'). Otherwise, any positive `n' is supported. | |
760 | |
761 For the precise mathematical definitions of these transforms as used | |
762 by FFTW, see *note What FFTW Really Computes::. (For people accustomed | |
763 to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of | |
764 the cos/sin functions so that they correspond precisely to an even/odd | |
765 DFT of size N. Some authors also include additional multiplicative | |
766 factors of sqrt(2) for selected inputs and outputs; this makes the | |
767 transform orthogonal, but sacrifices the direct equivalence to a | |
768 symmetric DFT.) | |
769 | |
770 Which type do you need? | |
771 ....................... | |
772 | |
773 Since the required flavor of even/odd DFT depends upon your problem, | |
774 you are the best judge of this choice, but we can make a few comments | |
775 on relative efficiency to help you in your selection. In particular, | |
776 R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially | |
777 for odd sizes), while the R*DFT00 transforms are sometimes | |
778 significantly slower (especially for even sizes).(2) | |
779 | |
780 Thus, if only the boundary conditions on the transform inputs are | |
781 specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over | |
782 R*DFT11 (unless the half-sample shift or the self-inverse property is | |
783 significant for your problem). | |
784 | |
785 If performance is important to you and you are using only small sizes | |
786 (say n<200), e.g. for multi-dimensional transforms, then you might | |
787 consider generating hard-coded transforms of those sizes and types that | |
788 you are interested in (*note Generating your own code::). | |
789 | |
790 We are interested in hearing what types of symmetric transforms you | |
791 find most useful. | |
792 | |
793 ---------- Footnotes ---------- | |
794 | |
795 (1) There are also type V-VIII transforms, which correspond to a | |
796 logical DFT of _odd_ size N, independent of whether the physical size | |
797 `n' is odd, but we do not support these variants. | |
798 | |
799 (2) R*DFT00 is sometimes slower in FFTW because we discovered that | |
800 the standard algorithm for computing this by a pre/post-processed real | |
801 DFT--the algorithm used in FFTPACK, Numerical Recipes, and other | |
802 sources for decades now--has serious numerical problems: it already | |
803 loses several decimal places of accuracy for 16k sizes. There seem to | |
804 be only two alternatives in the literature that do not suffer | |
805 similarly: a recursive decomposition into smaller DCTs, which would | |
806 require a large set of codelets for efficiency and generality, or | |
807 sacrificing a factor of 2 in speed to use a real DFT of twice the size. | |
808 We currently employ the latter technique for general n, as well as a | |
809 limited form of the former method: a split-radix decomposition when n | |
810 is odd (N a multiple of 4). For N containing many factors of 2, the | |
811 split-radix method seems to recover most of the speed of the standard | |
812 algorithm without the accuracy tradeoff. | |
813 | |
814 | |
815 File: fftw3.info, Node: The Discrete Hartley Transform, Prev: Real even/odd DFTs (cosine/sine transforms), Up: More DFTs of Real Data | |
816 | |
817 2.5.3 The Discrete Hartley Transform | |
818 ------------------------------------ | |
819 | |
820 If you are planning to use the DHT because you've heard that it is | |
821 "faster" than the DFT (FFT), *stop here*. The DHT is not faster than | |
822 the DFT. That story is an old but enduring misconception that was | |
823 debunked in 1987. | |
824 | |
825 The discrete Hartley transform (DHT) is an invertible linear | |
826 transform closely related to the DFT. In the DFT, one multiplies each | |
827 input by cos - i * sin (a complex exponential), whereas in the DHT each | |
828 input is multiplied by simply cos + sin. Thus, the DHT transforms `n' | |
829 real numbers to `n' real numbers, and has the convenient property of | |
830 being its own inverse. In FFTW, a DHT (of any positive `n') can be | |
831 specified by an r2r kind of `FFTW_DHT'. | |
832 | |
833 Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of | |
834 size `n' followed by another DHT of the same size will result in the | |
835 original array multiplied by `n'. | |
836 | |
837 The DHT was originally proposed as a more efficient alternative to | |
838 the DFT for real data, but it was subsequently shown that a specialized | |
839 DFT (such as FFTW's r2hc or r2c transforms) could be just as fast. In | |
840 FFTW, the DHT is actually computed by post-processing an r2hc | |
841 transform, so there is ordinarily no reason to prefer it from a | |
842 performance perspective.(1) However, we have heard rumors that the DHT | |
843 might be the most appropriate transform in its own right for certain | |
844 applications, and we would be very interested to hear from anyone who | |
845 finds it useful. | |
846 | |
847 If `FFTW_DHT' is specified for multiple dimensions of a | |
848 multi-dimensional transform, FFTW computes the separable product of 1d | |
849 DHTs along each dimension. Unfortunately, this is not quite the same | |
850 thing as a true multi-dimensional DHT; you can compute the latter, if | |
851 necessary, with at most `rank-1' post-processing passes [see e.g. H. | |
852 Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)]. | |
853 | |
854 For the precise mathematical definition of the DHT as used by FFTW, | |
855 see *note What FFTW Really Computes::. | |
856 | |
857 ---------- Footnotes ---------- | |
858 | |
859 (1) We provide the DHT mainly as a byproduct of some internal | |
860 algorithms. FFTW computes a real input/output DFT of _prime_ size by | |
861 re-expressing it as a DHT plus post/pre-processing and then using | |
862 Rader's prime-DFT algorithm adapted to the DHT. | |
863 | |
864 | |
865 File: fftw3.info, Node: Other Important Topics, Next: FFTW Reference, Prev: Tutorial, Up: Top | |
866 | |
867 3 Other Important Topics | |
868 ************************ | |
869 | |
870 * Menu: | |
871 | |
872 * SIMD alignment and fftw_malloc:: | |
873 * Multi-dimensional Array Format:: | |
874 * Words of Wisdom-Saving Plans:: | |
875 * Caveats in Using Wisdom:: | |
876 | |
877 | |
878 File: fftw3.info, Node: SIMD alignment and fftw_malloc, Next: Multi-dimensional Array Format, Prev: Other Important Topics, Up: Other Important Topics | |
879 | |
880 3.1 SIMD alignment and fftw_malloc | |
881 ================================== | |
882 | |
883 SIMD, which stands for "Single Instruction Multiple Data," is a set of | |
884 special operations supported by some processors to perform a single | |
885 operation on several numbers (usually 2 or 4) simultaneously. SIMD | |
886 floating-point instructions are available on several popular CPUs: | |
887 SSE/SSE2/AVX on recent x86/x86-64 processors, AltiVec (single precision) | |
888 on some PowerPCs (Apple G4 and higher), NEON on some ARM models, and | |
889 MIPS Paired Single (currently only in FFTW 3.2.x). FFTW can be | |
890 compiled to support the SIMD instructions on any of these systems. | |
891 | |
892 A program linking to an FFTW library compiled with SIMD support can | |
893 obtain a nonnegligible speedup for most complex and r2c/c2r transforms. | |
894 In order to obtain this speedup, however, the arrays of complex (or | |
895 real) data passed to FFTW must be specially aligned in memory | |
896 (typically 16-byte aligned), and often this alignment is more stringent | |
897 than that provided by the usual `malloc' (etc.) allocation routines. | |
898 | |
899 In order to guarantee proper alignment for SIMD, therefore, in case | |
900 your program is ever linked against a SIMD-using FFTW, we recommend | |
901 allocating your transform data with `fftw_malloc' and de-allocating it | |
902 with `fftw_free'. These have exactly the same interface and behavior as | |
903 `malloc'/`free', except that for a SIMD FFTW they ensure that the | |
904 returned pointer has the necessary alignment (by calling `memalign' or | |
905 its equivalent on your OS). | |
906 | |
907 You are not _required_ to use `fftw_malloc'. You can allocate your | |
908 data in any way that you like, from `malloc' to `new' (in C++) to a | |
909 fixed-size array declaration. If the array happens not to be properly | |
910 aligned, FFTW will not use the SIMD extensions. | |
911 | |
912 Since `fftw_malloc' only ever needs to be used for real and complex | |
913 arrays, we provide two convenient wrapper routines `fftw_alloc_real(N)' | |
914 and `fftw_alloc_complex(N)' that are equivalent to | |
915 `(double*)fftw_malloc(sizeof(double) * N)' and | |
916 `(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively | |
917 (or their equivalents in other precisions). | |
918 | |
919 | |
920 File: fftw3.info, Node: Multi-dimensional Array Format, Next: Words of Wisdom-Saving Plans, Prev: SIMD alignment and fftw_malloc, Up: Other Important Topics | |
921 | |
922 3.2 Multi-dimensional Array Format | |
923 ================================== | |
924 | |
925 This section describes the format in which multi-dimensional arrays are | |
926 stored in FFTW. We felt that a detailed discussion of this topic was | |
927 necessary. Since several different formats are common, this topic is | |
928 often a source of confusion. | |
929 | |
930 * Menu: | |
931 | |
932 * Row-major Format:: | |
933 * Column-major Format:: | |
934 * Fixed-size Arrays in C:: | |
935 * Dynamic Arrays in C:: | |
936 * Dynamic Arrays in C-The Wrong Way:: | |
937 | |
938 | |
939 File: fftw3.info, Node: Row-major Format, Next: Column-major Format, Prev: Multi-dimensional Array Format, Up: Multi-dimensional Array Format | |
940 | |
941 3.2.1 Row-major Format | |
942 ---------------------- | |
943 | |
944 The multi-dimensional arrays passed to `fftw_plan_dft' etcetera are | |
945 expected to be stored as a single contiguous block in "row-major" order | |
946 (sometimes called "C order"). Basically, this means that as you step | |
947 through adjacent memory locations, the first dimension's index varies | |
948 most slowly and the last dimension's index varies most quickly. | |
949 | |
950 To be more explicit, let us consider an array of rank d whose | |
951 dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a | |
952 location in the array by a sequence of d (zero-based) indices, one for | |
953 each dimension: (i[0], i[1], ..., i[d-1]). If the array is stored in | |
954 row-major order, then this element is located at the position i[d-1] + | |
955 n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])). | |
956 | |
957 Note that, for the ordinary complex DFT, each element of the array | |
958 must be of type `fftw_complex'; i.e. a (real, imaginary) pair of | |
959 (double-precision) numbers. | |
960 | |
961 In the advanced FFTW interface, the physical dimensions n from which | |
962 the indices are computed can be different from (larger than) the | |
963 logical dimensions of the transform to be computed, in order to | |
964 transform a subset of a larger array. Note also that, in the advanced | |
965 interface, the expression above is multiplied by a "stride" to get the | |
966 actual array index--this is useful in situations where each element of | |
967 the multi-dimensional array is actually a data structure (or another | |
968 array), and you just want to transform a single field. In the basic | |
969 interface, however, the stride is 1. | |
970 | |
971 | |
972 File: fftw3.info, Node: Column-major Format, Next: Fixed-size Arrays in C, Prev: Row-major Format, Up: Multi-dimensional Array Format | |
973 | |
974 3.2.2 Column-major Format | |
975 ------------------------- | |
976 | |
977 Readers from the Fortran world are used to arrays stored in | |
978 "column-major" order (sometimes called "Fortran order"). This is | |
979 essentially the exact opposite of row-major order in that, here, the | |
980 _first_ dimension's index varies most quickly. | |
981 | |
982 If you have an array stored in column-major order and wish to | |
983 transform it using FFTW, it is quite easy to do. When creating the | |
984 plan, simply pass the dimensions of the array to the planner in | |
985 _reverse order_. For example, if your array is a rank three `N x M x | |
986 L' matrix in column-major order, you should pass the dimensions of the | |
987 array as if it were an `L x M x N' matrix (which it is, from the | |
988 perspective of FFTW). This is done for you _automatically_ by the FFTW | |
989 legacy-Fortran interface (*note Calling FFTW from Legacy Fortran::), | |
990 but you must do it manually with the modern Fortran interface (*note | |
991 Reversing array dimensions::). | |
992 | |
993 | |
994 File: fftw3.info, Node: Fixed-size Arrays in C, Next: Dynamic Arrays in C, Prev: Column-major Format, Up: Multi-dimensional Array Format | |
995 | |
996 3.2.3 Fixed-size Arrays in C | |
997 ---------------------------- | |
998 | |
999 A multi-dimensional array whose size is declared at compile time in C | |
1000 is _already_ in row-major order. You don't have to do anything special | |
1001 to transform it. For example: | |
1002 | |
1003 { | |
1004 fftw_complex data[N0][N1][N2]; | |
1005 fftw_plan plan; | |
1006 ... | |
1007 plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0], | |
1008 FFTW_FORWARD, FFTW_ESTIMATE); | |
1009 ... | |
1010 } | |
1011 | |
1012 This will plan a 3d in-place transform of size `N0 x N1 x N2'. | |
1013 Notice how we took the address of the zero-th element to pass to the | |
1014 planner (we could also have used a typecast). | |
1015 | |
1016 However, we tend to _discourage_ users from declaring their arrays | |
1017 in this way, for two reasons. First, this allocates the array on the | |
1018 stack ("automatic" storage), which has a very limited size on most | |
1019 operating systems (declaring an array with more than a few thousand | |
1020 elements will often cause a crash). (You can get around this | |
1021 limitation on many systems by declaring the array as `static' and/or | |
1022 global, but that has its own drawbacks.) Second, it may not optimally | |
1023 align the array for use with a SIMD FFTW (*note SIMD alignment and | |
1024 fftw_malloc::). Instead, we recommend using `fftw_malloc', as | |
1025 described below. | |
1026 | |
1027 | |
1028 File: fftw3.info, Node: Dynamic Arrays in C, Next: Dynamic Arrays in C-The Wrong Way, Prev: Fixed-size Arrays in C, Up: Multi-dimensional Array Format | |
1029 | |
1030 3.2.4 Dynamic Arrays in C | |
1031 ------------------------- | |
1032 | |
1033 We recommend allocating most arrays dynamically, with `fftw_malloc'. | |
1034 This isn't too hard to do, although it is not as straightforward for | |
1035 multi-dimensional arrays as it is for one-dimensional arrays. | |
1036 | |
1037 Creating the array is simple: using a dynamic-allocation routine like | |
1038 `fftw_malloc', allocate an array big enough to store N `fftw_complex' | |
1039 values (for a complex DFT), where N is the product of the sizes of the | |
1040 array dimensions (i.e. the total number of complex values in the | |
1041 array). For example, here is code to allocate a 5 x 12 x 27 rank-3 | |
1042 array: | |
1043 | |
1044 fftw_complex *an_array; | |
1045 an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex)); | |
1046 | |
1047 Accessing the array elements, however, is more tricky--you can't | |
1048 simply use multiple applications of the `[]' operator like you could | |
1049 for fixed-size arrays. Instead, you have to explicitly compute the | |
1050 offset into the array using the formula given earlier for row-major | |
1051 arrays. For example, to reference the (i,j,k)-th element of the array | |
1052 allocated above, you would use the expression `an_array[k + 27 * (j + | |
1053 12 * i)]'. | |
1054 | |
1055 This pain can be alleviated somewhat by defining appropriate macros, | |
1056 or, in C++, creating a class and overloading the `()' operator. The | |
1057 recent C99 standard provides a way to reinterpret the dynamic array as | |
1058 a "variable-length" multi-dimensional array amenable to `[]', but this | |
1059 feature is not yet widely supported by compilers. | |
1060 | |
1061 | |
1062 File: fftw3.info, Node: Dynamic Arrays in C-The Wrong Way, Prev: Dynamic Arrays in C, Up: Multi-dimensional Array Format | |
1063 | |
1064 3.2.5 Dynamic Arrays in C--The Wrong Way | |
1065 ---------------------------------------- | |
1066 | |
1067 A different method for allocating multi-dimensional arrays in C is | |
1068 often suggested that is incompatible with FFTW: _using it will cause | |
1069 FFTW to die a painful death_. We discuss the technique here, however, | |
1070 because it is so commonly known and used. This method is to create | |
1071 arrays of pointers of arrays of pointers of ...etcetera. For example, | |
1072 the analogue in this method to the example above is: | |
1073 | |
1074 int i,j; | |
1075 fftw_complex ***a_bad_array; /* another way to make a 5x12x27 array */ | |
1076 | |
1077 a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **)); | |
1078 for (i = 0; i < 5; ++i) { | |
1079 a_bad_array[i] = | |
1080 (fftw_complex **) malloc(12 * sizeof(fftw_complex *)); | |
1081 for (j = 0; j < 12; ++j) | |
1082 a_bad_array[i][j] = | |
1083 (fftw_complex *) malloc(27 * sizeof(fftw_complex)); | |
1084 } | |
1085 | |
1086 As you can see, this sort of array is inconvenient to allocate (and | |
1087 deallocate). On the other hand, it has the advantage that the | |
1088 (i,j,k)-th element can be referenced simply by `a_bad_array[i][j][k]'. | |
1089 | |
1090 If you like this technique and want to maximize convenience in | |
1091 accessing the array, but still want to pass the array to FFTW, you can | |
1092 use a hybrid method. Allocate the array as one contiguous block, but | |
1093 also declare an array of arrays of pointers that point to appropriate | |
1094 places in the block. That sort of trick is beyond the scope of this | |
1095 documentation; for more information on multi-dimensional arrays in C, | |
1096 see the `comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html). | |
1097 | |
1098 | |
1099 File: fftw3.info, Node: Words of Wisdom-Saving Plans, Next: Caveats in Using Wisdom, Prev: Multi-dimensional Array Format, Up: Other Important Topics | |
1100 | |
1101 3.3 Words of Wisdom--Saving Plans | |
1102 ================================= | |
1103 | |
1104 FFTW implements a method for saving plans to disk and restoring them. | |
1105 In fact, what FFTW does is more general than just saving and loading | |
1106 plans. The mechanism is called "wisdom". Here, we describe this | |
1107 feature at a high level. *Note FFTW Reference::, for a less casual but | |
1108 more complete discussion of how to use wisdom in FFTW. | |
1109 | |
1110 Plans created with the `FFTW_MEASURE', `FFTW_PATIENT', or | |
1111 `FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may | |
1112 require a long time to compute because FFTW must measure the runtime of | |
1113 many possible plans and select the best one. This setup is designed | |
1114 for the situations where so many transforms of the same size must be | |
1115 computed that the start-up time is irrelevant. For short | |
1116 initialization times, but slower transforms, we have provided | |
1117 `FFTW_ESTIMATE'. The `wisdom' mechanism is a way to get the best of | |
1118 both worlds: you compute a good plan once, save it to disk, and later | |
1119 reload it as many times as necessary. The wisdom mechanism can | |
1120 actually save and reload many plans at once, not just one. | |
1121 | |
1122 Whenever you create a plan, the FFTW planner accumulates wisdom, | |
1123 which is information sufficient to reconstruct the plan. After | |
1124 planning, you can save this information to disk by means of the | |
1125 function: | |
1126 int fftw_export_wisdom_to_filename(const char *filename); | |
1127 (This function returns non-zero on success.) | |
1128 | |
1129 The next time you run the program, you can restore the wisdom with | |
1130 `fftw_import_wisdom_from_filename' (which also returns non-zero on | |
1131 success), and then recreate the plan using the same flags as before. | |
1132 int fftw_import_wisdom_from_filename(const char *filename); | |
1133 | |
1134 Wisdom is automatically used for any size to which it is applicable, | |
1135 as long as the planner flags are not more "patient" than those with | |
1136 which the wisdom was created. For example, wisdom created with | |
1137 `FFTW_MEASURE' can be used if you later plan with `FFTW_ESTIMATE' or | |
1138 `FFTW_MEASURE', but not with `FFTW_PATIENT'. | |
1139 | |
1140 The `wisdom' is cumulative, and is stored in a global, private data | |
1141 structure managed internally by FFTW. The storage space required is | |
1142 minimal, proportional to the logarithm of the sizes the wisdom was | |
1143 generated from. If memory usage is a concern, however, the wisdom can | |
1144 be forgotten and its associated memory freed by calling: | |
1145 void fftw_forget_wisdom(void); | |
1146 | |
1147 Wisdom can be exported to a file, a string, or any other medium. | |
1148 For details, see *note Wisdom::. | |
1149 | |
1150 | |
1151 File: fftw3.info, Node: Caveats in Using Wisdom, Prev: Words of Wisdom-Saving Plans, Up: Other Important Topics | |
1152 | |
1153 3.4 Caveats in Using Wisdom | |
1154 =========================== | |
1155 | |
1156 For in much wisdom is much grief, and he that increaseth knowledge | |
1157 increaseth sorrow. [Ecclesiastes 1:18] | |
1158 | |
1159 There are pitfalls to using wisdom, in that it can negate FFTW's | |
1160 ability to adapt to changing hardware and other conditions. For | |
1161 example, it would be perfectly possible to export wisdom from a program | |
1162 running on one processor and import it into a program running on | |
1163 another processor. Doing so, however, would mean that the second | |
1164 program would use plans optimized for the first processor, instead of | |
1165 the one it is running on. | |
1166 | |
1167 It should be safe to reuse wisdom as long as the hardware and program | |
1168 binaries remain unchanged. (Actually, the optimal plan may change even | |
1169 between runs of the same binary on identical hardware, due to | |
1170 differences in the virtual memory environment, etcetera. Users | |
1171 seriously interested in performance should worry about this problem, | |
1172 too.) It is likely that, if the same wisdom is used for two different | |
1173 program binaries, even running on the same machine, the plans may be | |
1174 sub-optimal because of differing code alignments. It is therefore wise | |
1175 to recreate wisdom every time an application is recompiled. The more | |
1176 the underlying hardware and software changes between the creation of | |
1177 wisdom and its use, the greater grows the risk of sub-optimal plans. | |
1178 | |
1179 Nevertheless, if the choice is between using `FFTW_ESTIMATE' or | |
1180 using possibly-suboptimal wisdom (created on the same machine, but for a | |
1181 different binary), the wisdom is likely to be better. For this reason, | |
1182 we provide a function to import wisdom from a standard system-wide | |
1183 location (`/etc/fftw/wisdom' on Unix): | |
1184 | |
1185 int fftw_import_system_wisdom(void); | |
1186 | |
1187 FFTW also provides a standalone program, `fftw-wisdom' (described by | |
1188 its own `man' page on Unix) with which users can create wisdom, e.g. | |
1189 for a canonical set of sizes to store in the system wisdom file. *Note | |
1190 Wisdom Utilities::. | |
1191 | |
1192 | |
1193 File: fftw3.info, Node: FFTW Reference, Next: Multi-threaded FFTW, Prev: Other Important Topics, Up: Top | |
1194 | |
1195 4 FFTW Reference | |
1196 **************** | |
1197 | |
1198 This chapter provides a complete reference for all sequential (i.e., | |
1199 one-processor) FFTW functions. Parallel transforms are described in | |
1200 later chapters. | |
1201 | |
1202 * Menu: | |
1203 | |
1204 * Data Types and Files:: | |
1205 * Using Plans:: | |
1206 * Basic Interface:: | |
1207 * Advanced Interface:: | |
1208 * Guru Interface:: | |
1209 * New-array Execute Functions:: | |
1210 * Wisdom:: | |
1211 * What FFTW Really Computes:: | |
1212 | |
1213 | |
1214 File: fftw3.info, Node: Data Types and Files, Next: Using Plans, Prev: FFTW Reference, Up: FFTW Reference | |
1215 | |
1216 4.1 Data Types and Files | |
1217 ======================== | |
1218 | |
1219 All programs using FFTW should include its header file: | |
1220 | |
1221 #include <fftw3.h> | |
1222 | |
1223 You must also link to the FFTW library. On Unix, this means adding | |
1224 `-lfftw3 -lm' at the _end_ of the link command. | |
1225 | |
1226 * Menu: | |
1227 | |
1228 * Complex numbers:: | |
1229 * Precision:: | |
1230 * Memory Allocation:: | |
1231 | |
1232 | |
1233 File: fftw3.info, Node: Complex numbers, Next: Precision, Prev: Data Types and Files, Up: Data Types and Files | |
1234 | |
1235 4.1.1 Complex numbers | |
1236 --------------------- | |
1237 | |
1238 The default FFTW interface uses `double' precision for all | |
1239 floating-point numbers, and defines a `fftw_complex' type to hold | |
1240 complex numbers as: | |
1241 | |
1242 typedef double fftw_complex[2]; | |
1243 | |
1244 Here, the `[0]' element holds the real part and the `[1]' element | |
1245 holds the imaginary part. | |
1246 | |
1247 Alternatively, if you have a C compiler (such as `gcc') that | |
1248 supports the C99 revision of the ANSI C standard, you can use C's new | |
1249 native complex type (which is binary-compatible with the typedef above). | |
1250 In particular, if you `#include <complex.h>' _before_ `<fftw3.h>', then | |
1251 `fftw_complex' is defined to be the native complex type and you can | |
1252 manipulate it with ordinary arithmetic (e.g. `x = y * (3+4*I)', where | |
1253 `x' and `y' are `fftw_complex' and `I' is the standard symbol for the | |
1254 imaginary unit); | |
1255 | |
1256 C++ has its own `complex<T>' template class, defined in the standard | |
1257 `<complex>' header file. Reportedly, the C++ standards committee has | |
1258 recently agreed to mandate that the storage format used for this type | |
1259 be binary-compatible with the C99 type, i.e. an array `T[2]' with | |
1260 consecutive real `[0]' and imaginary `[1]' parts. (See report | |
1261 `http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf | |
1262 WG21/N1388'.) Although not part of the official standard as of this | |
1263 writing, the proposal stated that: "This solution has been tested with | |
1264 all current major implementations of the standard library and shown to | |
1265 be working." To the extent that this is true, if you have a variable | |
1266 `complex<double> *x', you can pass it directly to FFTW via | |
1267 `reinterpret_cast<fftw_complex*>(x)'. | |
1268 | |
1269 | |
1270 File: fftw3.info, Node: Precision, Next: Memory Allocation, Prev: Complex numbers, Up: Data Types and Files | |
1271 | |
1272 4.1.2 Precision | |
1273 --------------- | |
1274 | |
1275 You can install single and long-double precision versions of FFTW, | |
1276 which replace `double' with `float' and `long double', respectively | |
1277 (*note Installation and Customization::). To use these interfaces, you: | |
1278 | |
1279 * Link to the single/long-double libraries; on Unix, `-lfftw3f' or | |
1280 `-lfftw3l' instead of (or in addition to) `-lfftw3'. (You can | |
1281 link to the different-precision libraries simultaneously.) | |
1282 | |
1283 * Include the _same_ `<fftw3.h>' header file. | |
1284 | |
1285 * Replace all lowercase instances of `fftw_' with `fftwf_' or | |
1286 `fftwl_' for single or long-double precision, respectively. | |
1287 (`fftw_complex' becomes `fftwf_complex', `fftw_execute' becomes | |
1288 `fftwf_execute', etcetera.) | |
1289 | |
1290 * Uppercase names, i.e. names beginning with `FFTW_', remain the | |
1291 same. | |
1292 | |
1293 * Replace `double' with `float' or `long double' for subroutine | |
1294 parameters. | |
1295 | |
1296 | |
1297 Depending upon your compiler and/or hardware, `long double' may not | |
1298 be any more precise than `double' (or may not be supported at all, | |
1299 although it is standard in C99). | |
1300 | |
1301 We also support using the nonstandard `__float128' | |
1302 quadruple-precision type provided by recent versions of `gcc' on 32- | |
1303 and 64-bit x86 hardware (*note Installation and Customization::). To | |
1304 use this type, link with `-lfftw3q -lquadmath -lm' (the `libquadmath' | |
1305 library provided by `gcc' is needed for quadruple-precision | |
1306 trigonometric functions) and use `fftwq_' identifiers. | |
1307 | |
1308 | |
1309 File: fftw3.info, Node: Memory Allocation, Prev: Precision, Up: Data Types and Files | |
1310 | |
1311 4.1.3 Memory Allocation | |
1312 ----------------------- | |
1313 | |
1314 void *fftw_malloc(size_t n); | |
1315 void fftw_free(void *p); | |
1316 | |
1317 These are functions that behave identically to `malloc' and `free', | |
1318 except that they guarantee that the returned pointer obeys any special | |
1319 alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD | |
1320 acceleration). *Note SIMD alignment and fftw_malloc::. | |
1321 | |
1322 Data allocated by `fftw_malloc' _must_ be deallocated by `fftw_free' | |
1323 and not by the ordinary `free'. | |
1324 | |
1325 These routines simply call through to your operating system's | |
1326 `malloc' or, if necessary, its aligned equivalent (e.g. `memalign'), so | |
1327 you normally need not worry about any significant time or space | |
1328 overhead. You are _not required_ to use them to allocate your data, | |
1329 but we strongly recommend it. | |
1330 | |
1331 Note: in C++, just as with ordinary `malloc', you must typecast the | |
1332 output of `fftw_malloc' to whatever pointer type you are allocating. | |
1333 | |
1334 We also provide the following two convenience functions to allocate | |
1335 real and complex arrays with `n' elements, which are equivalent to | |
1336 `(double *) fftw_malloc(sizeof(double) * n)' and `(fftw_complex *) | |
1337 fftw_malloc(sizeof(fftw_complex) * n)', respectively: | |
1338 | |
1339 double *fftw_alloc_real(size_t n); | |
1340 fftw_complex *fftw_alloc_complex(size_t n); | |
1341 | |
1342 The equivalent functions in other precisions allocate arrays of `n' | |
1343 elements in that precision. e.g. `fftwf_alloc_real(n)' is equivalent | |
1344 to `(float *) fftwf_malloc(sizeof(float) * n)'. | |
1345 | |
1346 | |
1347 File: fftw3.info, Node: Using Plans, Next: Basic Interface, Prev: Data Types and Files, Up: FFTW Reference | |
1348 | |
1349 4.2 Using Plans | |
1350 =============== | |
1351 | |
1352 Plans for all transform types in FFTW are stored as type `fftw_plan' | |
1353 (an opaque pointer type), and are created by one of the various | |
1354 planning routines described in the following sections. An `fftw_plan' | |
1355 contains all information necessary to compute the transform, including | |
1356 the pointers to the input and output arrays. | |
1357 | |
1358 void fftw_execute(const fftw_plan plan); | |
1359 | |
1360 This executes the `plan', to compute the corresponding transform on | |
1361 the arrays for which it was planned (which must still exist). The plan | |
1362 is not modified, and `fftw_execute' can be called as many times as | |
1363 desired. | |
1364 | |
1365 To apply a given plan to a different array, you can use the | |
1366 new-array execute interface. *Note New-array Execute Functions::. | |
1367 | |
1368 `fftw_execute' (and equivalents) is the only function in FFTW | |
1369 guaranteed to be thread-safe; see *note Thread safety::. | |
1370 | |
1371 This function: | |
1372 void fftw_destroy_plan(fftw_plan plan); | |
1373 deallocates the `plan' and all its associated data. | |
1374 | |
1375 FFTW's planner saves some other persistent data, such as the | |
1376 accumulated wisdom and a list of algorithms available in the current | |
1377 configuration. If you want to deallocate all of that and reset FFTW to | |
1378 the pristine state it was in when you started your program, you can | |
1379 call: | |
1380 | |
1381 void fftw_cleanup(void); | |
1382 | |
1383 After calling `fftw_cleanup', all existing plans become undefined, | |
1384 and you should not attempt to execute them nor to destroy them. You can | |
1385 however create and execute/destroy new plans, in which case FFTW starts | |
1386 accumulating wisdom information again. | |
1387 | |
1388 `fftw_cleanup' does not deallocate your plans, however. To prevent | |
1389 memory leaks, you must still call `fftw_destroy_plan' before executing | |
1390 `fftw_cleanup'. | |
1391 | |
1392 Occasionally, it may useful to know FFTW's internal "cost" metric | |
1393 that it uses to compare plans to one another; this cost is proportional | |
1394 to an execution time of the plan, in undocumented units, if the plan | |
1395 was created with the `FFTW_MEASURE' or other timing-based options, or | |
1396 alternatively is a heuristic cost function for `FFTW_ESTIMATE' plans. | |
1397 (The cost values of measured and estimated plans are not comparable, | |
1398 being in different units. Also, costs from different FFTW versions or | |
1399 the same version compiled differently may not be in the same units. | |
1400 Plans created from wisdom have a cost of 0 since no timing measurement | |
1401 is performed for them. Finally, certain problems for which only one | |
1402 top-level algorithm was possible may have required no measurements of | |
1403 the cost of the whole plan, in which case `fftw_cost' will also return | |
1404 0.) The cost metric for a given plan is returned by: | |
1405 | |
1406 double fftw_cost(const fftw_plan plan); | |
1407 | |
1408 The following two routines are provided purely for academic purposes | |
1409 (that is, for entertainment). | |
1410 | |
1411 void fftw_flops(const fftw_plan plan, | |
1412 double *add, double *mul, double *fma); | |
1413 | |
1414 Given a `plan', set `add', `mul', and `fma' to an exact count of the | |
1415 number of floating-point additions, multiplications, and fused | |
1416 multiply-add operations involved in the plan's execution. The total | |
1417 number of floating-point operations (flops) is `add + mul + 2*fma', or | |
1418 `add + mul + fma' if the hardware supports fused multiply-add | |
1419 instructions (although the number of FMA operations is only approximate | |
1420 because of compiler voodoo). (The number of operations should be an | |
1421 integer, but we use `double' to avoid overflowing `int' for large | |
1422 transforms; the arguments are of type `double' even for single and | |
1423 long-double precision versions of FFTW.) | |
1424 | |
1425 void fftw_fprint_plan(const fftw_plan plan, FILE *output_file); | |
1426 void fftw_print_plan(const fftw_plan plan); | |
1427 char *fftw_sprint_plan(const fftw_plan plan); | |
1428 | |
1429 This outputs a "nerd-readable" representation of the `plan' to the | |
1430 given file, to `stdout', or two a newly allocated NUL-terminated string | |
1431 (which the caller is responsible for deallocating with `free'), | |
1432 respectively. | |
1433 | |
1434 | |
1435 File: fftw3.info, Node: Basic Interface, Next: Advanced Interface, Prev: Using Plans, Up: FFTW Reference | |
1436 | |
1437 4.3 Basic Interface | |
1438 =================== | |
1439 | |
1440 Recall that the FFTW API is divided into three parts(1): the "basic | |
1441 interface" computes a single transform of contiguous data, the "advanced | |
1442 interface" computes transforms of multiple or strided arrays, and the | |
1443 "guru interface" supports the most general data layouts, | |
1444 multiplicities, and strides. This section describes the the basic | |
1445 interface, which we expect to satisfy the needs of most users. | |
1446 | |
1447 * Menu: | |
1448 | |
1449 * Complex DFTs:: | |
1450 * Planner Flags:: | |
1451 * Real-data DFTs:: | |
1452 * Real-data DFT Array Format:: | |
1453 * Real-to-Real Transforms:: | |
1454 * Real-to-Real Transform Kinds:: | |
1455 | |
1456 ---------- Footnotes ---------- | |
1457 | |
1458 (1) Gallia est omnis divisa in partes tres (Julius Caesar). | |
1459 | |
1460 | |
1461 File: fftw3.info, Node: Complex DFTs, Next: Planner Flags, Prev: Basic Interface, Up: Basic Interface | |
1462 | |
1463 4.3.1 Complex DFTs | |
1464 ------------------ | |
1465 | |
1466 fftw_plan fftw_plan_dft_1d(int n0, | |
1467 fftw_complex *in, fftw_complex *out, | |
1468 int sign, unsigned flags); | |
1469 fftw_plan fftw_plan_dft_2d(int n0, int n1, | |
1470 fftw_complex *in, fftw_complex *out, | |
1471 int sign, unsigned flags); | |
1472 fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, | |
1473 fftw_complex *in, fftw_complex *out, | |
1474 int sign, unsigned flags); | |
1475 fftw_plan fftw_plan_dft(int rank, const int *n, | |
1476 fftw_complex *in, fftw_complex *out, | |
1477 int sign, unsigned flags); | |
1478 | |
1479 Plan a complex input/output discrete Fourier transform (DFT) in zero | |
1480 or more dimensions, returning an `fftw_plan' (*note Using Plans::). | |
1481 | |
1482 Once you have created a plan for a certain transform type and | |
1483 parameters, then creating another plan of the same type and parameters, | |
1484 but for different arrays, is fast and shares constant data with the | |
1485 first plan (if it still exists). | |
1486 | |
1487 The planner returns `NULL' if the plan cannot be created. In the | |
1488 standard FFTW distribution, the basic interface is guaranteed to return | |
1489 a non-`NULL' plan. A plan may be `NULL', however, if you are using a | |
1490 customized FFTW configuration supporting a restricted set of transforms. | |
1491 | |
1492 Arguments | |
1493 ......... | |
1494 | |
1495 * `rank' is the rank of the transform (it should be the size of the | |
1496 array `*n'), and can be any non-negative integer. (*Note Complex | |
1497 Multi-Dimensional DFTs::, for the definition of "rank".) The | |
1498 `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', | |
1499 `2', and `3', respectively. The rank may be zero, which is | |
1500 equivalent to a rank-1 transform of size 1, i.e. a copy of one | |
1501 number from input to output. | |
1502 | |
1503 * `n0', `n1', `n2', or `n[0..rank-1]' (as appropriate for each | |
1504 routine) specify the size of the transform dimensions. They can | |
1505 be any positive integer. | |
1506 | |
1507 - Multi-dimensional arrays are stored in row-major order with | |
1508 dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x | |
1509 `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array | |
1510 Format::. | |
1511 | |
1512 - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d | |
1513 11^e 13^f, where e+f is either 0 or 1, and the other exponents | |
1514 are arbitrary. Other sizes are computed by means of a slow, | |
1515 general-purpose algorithm (which nevertheless retains O(n log | |
1516 n) performance even for prime sizes). It is possible to | |
1517 customize FFTW for different array sizes; see *note | |
1518 Installation and Customization::. Transforms whose sizes are | |
1519 powers of 2 are especially fast. | |
1520 | |
1521 * `in' and `out' point to the input and output arrays of the | |
1522 transform, which may be the same (yielding an in-place transform). These | |
1523 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is | |
1524 used in the flags. (The arrays need not be initialized, but they | |
1525 must be allocated.) | |
1526 | |
1527 If `in == out', the transform is "in-place" and the input array is | |
1528 overwritten. If `in != out', the two arrays must not overlap (but | |
1529 FFTW does not check for this condition). | |
1530 | |
1531 * `sign' is the sign of the exponent in the formula that defines the | |
1532 Fourier transform. It can be -1 (= `FFTW_FORWARD') or +1 (= | |
1533 `FFTW_BACKWARD'). | |
1534 | |
1535 * `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
1536 defined in *note Planner Flags::. | |
1537 | |
1538 | |
1539 FFTW computes an unnormalized transform: computing a forward | |
1540 followed by a backward transform (or vice versa) will result in the | |
1541 original data multiplied by the size of the transform (the product of | |
1542 the dimensions). For more information, see *note What FFTW Really | |
1543 Computes::. | |
1544 | |
1545 | |
1546 File: fftw3.info, Node: Planner Flags, Next: Real-data DFTs, Prev: Complex DFTs, Up: Basic Interface | |
1547 | |
1548 4.3.2 Planner Flags | |
1549 ------------------- | |
1550 | |
1551 All of the planner routines in FFTW accept an integer `flags' argument, | |
1552 which is a bitwise OR (`|') of zero or more of the flag constants | |
1553 defined below. These flags control the rigor (and time) of the | |
1554 planning process, and can also impose (or lift) restrictions on the | |
1555 type of transform algorithm that is employed. | |
1556 | |
1557 _Important:_ the planner overwrites the input array during planning | |
1558 unless a saved plan (*note Wisdom::) is available for that problem, so | |
1559 you should initialize your input data after creating the plan. The | |
1560 only exceptions to this are the `FFTW_ESTIMATE' and `FFTW_WISDOM_ONLY' | |
1561 flags, as mentioned below. | |
1562 | |
1563 In all cases, if wisdom is available for the given problem that | |
1564 was created with equal-or-greater planning rigor, then the more | |
1565 rigorous wisdom is used. For example, in `FFTW_ESTIMATE' mode any | |
1566 available wisdom is used, whereas in `FFTW_PATIENT' mode only wisdom | |
1567 created in patient or exhaustive mode can be used. *Note Words of | |
1568 Wisdom-Saving Plans::. | |
1569 | |
1570 Planning-rigor flags | |
1571 .................... | |
1572 | |
1573 * `FFTW_ESTIMATE' specifies that, instead of actual measurements of | |
1574 different algorithms, a simple heuristic is used to pick a | |
1575 (probably sub-optimal) plan quickly. With this flag, the | |
1576 input/output arrays are not overwritten during planning. | |
1577 | |
1578 * `FFTW_MEASURE' tells FFTW to find an optimized plan by actually | |
1579 _computing_ several FFTs and measuring their execution time. | |
1580 Depending on your machine, this can take some time (often a few | |
1581 seconds). `FFTW_MEASURE' is the default planning option. | |
1582 | |
1583 * `FFTW_PATIENT' is like `FFTW_MEASURE', but considers a wider range | |
1584 of algorithms and often produces a "more optimal" plan (especially | |
1585 for large transforms), but at the expense of several times longer | |
1586 planning time (especially for large transforms). | |
1587 | |
1588 * `FFTW_EXHAUSTIVE' is like `FFTW_PATIENT', but considers an even | |
1589 wider range of algorithms, including many that we think are | |
1590 unlikely to be fast, to produce the most optimal plan but with a | |
1591 substantially increased planning time. | |
1592 | |
1593 * `FFTW_WISDOM_ONLY' is a special planning mode in which the plan is | |
1594 only created if wisdom is available for the given problem, and | |
1595 otherwise a `NULL' plan is returned. This can be combined with | |
1596 other flags, e.g. `FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan | |
1597 only if wisdom is available that was created in `FFTW_PATIENT' or | |
1598 `FFTW_EXHAUSTIVE' mode. The `FFTW_WISDOM_ONLY' flag is intended | |
1599 for users who need to detect whether wisdom is available; for | |
1600 example, if wisdom is not available one may wish to allocate new | |
1601 arrays for planning so that user data is not overwritten. | |
1602 | |
1603 | |
1604 Algorithm-restriction flags | |
1605 ........................... | |
1606 | |
1607 * `FFTW_DESTROY_INPUT' specifies that an out-of-place transform is | |
1608 allowed to _overwrite its input_ array with arbitrary data; this | |
1609 can sometimes allow more efficient algorithms to be employed. | |
1610 | |
1611 * `FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must | |
1612 _not change its input_ array. This is ordinarily the _default_, | |
1613 except for c2r and hc2r (i.e. complex-to-real) transforms for | |
1614 which `FFTW_DESTROY_INPUT' is the default. In the latter cases, | |
1615 passing `FFTW_PRESERVE_INPUT' will attempt to use algorithms that | |
1616 do not destroy the input, at the expense of worse performance; for | |
1617 multi-dimensional c2r transforms, however, no input-preserving | |
1618 algorithms are implemented and the planner will return `NULL' if | |
1619 one is requested. | |
1620 | |
1621 * `FFTW_UNALIGNED' specifies that the algorithm may not impose any | |
1622 unusual alignment requirements on the input/output arrays (i.e. no | |
1623 SIMD may be used). This flag is normally _not necessary_, since | |
1624 the planner automatically detects misaligned arrays. The only use | |
1625 for this flag is if you want to use the new-array execute | |
1626 interface to execute a given plan on a different array that may | |
1627 not be aligned like the original. (Using `fftw_malloc' makes this | |
1628 flag unnecessary even then. You can also use `fftw_alignment_of' | |
1629 to detect whether two arrays are equivalently aligned.) | |
1630 | |
1631 | |
1632 Limiting planning time | |
1633 ...................... | |
1634 | |
1635 extern void fftw_set_timelimit(double seconds); | |
1636 | |
1637 This function instructs FFTW to spend at most `seconds' seconds | |
1638 (approximately) in the planner. If `seconds == FFTW_NO_TIMELIMIT' (the | |
1639 default value, which is negative), then planning time is unbounded. | |
1640 Otherwise, FFTW plans with a progressively wider range of algorithms | |
1641 until the the given time limit is reached or the given range of | |
1642 algorithms is explored, returning the best available plan. | |
1643 | |
1644 For example, specifying `FFTW_PATIENT' first plans in | |
1645 `FFTW_ESTIMATE' mode, then in `FFTW_MEASURE' mode, then finally (time | |
1646 permitting) in `FFTW_PATIENT'. If `FFTW_EXHAUSTIVE' is specified | |
1647 instead, the planner will further progress to `FFTW_EXHAUSTIVE' mode. | |
1648 | |
1649 Note that the `seconds' argument specifies only a rough limit; in | |
1650 practice, the planner may use somewhat more time if the time limit is | |
1651 reached when the planner is in the middle of an operation that cannot | |
1652 be interrupted. At the very least, the planner will complete planning | |
1653 in `FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0). | |
1654 | |
1655 | |
1656 File: fftw3.info, Node: Real-data DFTs, Next: Real-data DFT Array Format, Prev: Planner Flags, Up: Basic Interface | |
1657 | |
1658 4.3.3 Real-data DFTs | |
1659 -------------------- | |
1660 | |
1661 fftw_plan fftw_plan_dft_r2c_1d(int n0, | |
1662 double *in, fftw_complex *out, | |
1663 unsigned flags); | |
1664 fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, | |
1665 double *in, fftw_complex *out, | |
1666 unsigned flags); | |
1667 fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, | |
1668 double *in, fftw_complex *out, | |
1669 unsigned flags); | |
1670 fftw_plan fftw_plan_dft_r2c(int rank, const int *n, | |
1671 double *in, fftw_complex *out, | |
1672 unsigned flags); | |
1673 | |
1674 Plan a real-input/complex-output discrete Fourier transform (DFT) in | |
1675 zero or more dimensions, returning an `fftw_plan' (*note Using Plans::). | |
1676 | |
1677 Once you have created a plan for a certain transform type and | |
1678 parameters, then creating another plan of the same type and parameters, | |
1679 but for different arrays, is fast and shares constant data with the | |
1680 first plan (if it still exists). | |
1681 | |
1682 The planner returns `NULL' if the plan cannot be created. A | |
1683 non-`NULL' plan is always returned by the basic interface unless you | |
1684 are using a customized FFTW configuration supporting a restricted set | |
1685 of transforms, or if you use the `FFTW_PRESERVE_INPUT' flag with a | |
1686 multi-dimensional out-of-place c2r transform (see below). | |
1687 | |
1688 Arguments | |
1689 ......... | |
1690 | |
1691 * `rank' is the rank of the transform (it should be the size of the | |
1692 array `*n'), and can be any non-negative integer. (*Note Complex | |
1693 Multi-Dimensional DFTs::, for the definition of "rank".) The | |
1694 `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', | |
1695 `2', and `3', respectively. The rank may be zero, which is | |
1696 equivalent to a rank-1 transform of size 1, i.e. a copy of one | |
1697 real number (with zero imaginary part) from input to output. | |
1698 | |
1699 * `n0', `n1', `n2', or `n[0..rank-1]', (as appropriate for each | |
1700 routine) specify the size of the transform dimensions. They can | |
1701 be any positive integer. This is different in general from the | |
1702 _physical_ array dimensions, which are described in *note | |
1703 Real-data DFT Array Format::. | |
1704 | |
1705 - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d | |
1706 11^e 13^f, where e+f is either 0 or 1, and the other exponents | |
1707 are arbitrary. Other sizes are computed by means of a slow, | |
1708 general-purpose algorithm (which nevertheless retains O(n log | |
1709 n) performance even for prime sizes). (It is possible to | |
1710 customize FFTW for different array sizes; see *note | |
1711 Installation and Customization::.) Transforms whose sizes | |
1712 are powers of 2 are especially fast, and it is generally | |
1713 beneficial for the _last_ dimension of an r2c/c2r transform | |
1714 to be _even_. | |
1715 | |
1716 * `in' and `out' point to the input and output arrays of the | |
1717 transform, which may be the same (yielding an in-place transform). These | |
1718 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is | |
1719 used in the flags. (The arrays need not be initialized, but they | |
1720 must be allocated.) For an in-place transform, it is important to | |
1721 remember that the real array will require padding, described in | |
1722 *note Real-data DFT Array Format::. | |
1723 | |
1724 * `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
1725 defined in *note Planner Flags::. | |
1726 | |
1727 | |
1728 The inverse transforms, taking complex input (storing the | |
1729 non-redundant half of a logically Hermitian array) to real output, are | |
1730 given by: | |
1731 | |
1732 fftw_plan fftw_plan_dft_c2r_1d(int n0, | |
1733 fftw_complex *in, double *out, | |
1734 unsigned flags); | |
1735 fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1, | |
1736 fftw_complex *in, double *out, | |
1737 unsigned flags); | |
1738 fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2, | |
1739 fftw_complex *in, double *out, | |
1740 unsigned flags); | |
1741 fftw_plan fftw_plan_dft_c2r(int rank, const int *n, | |
1742 fftw_complex *in, double *out, | |
1743 unsigned flags); | |
1744 | |
1745 The arguments are the same as for the r2c transforms, except that the | |
1746 input and output data formats are reversed. | |
1747 | |
1748 FFTW computes an unnormalized transform: computing an r2c followed | |
1749 by a c2r transform (or vice versa) will result in the original data | |
1750 multiplied by the size of the transform (the product of the logical | |
1751 dimensions). An r2c transform produces the same output as a | |
1752 `FFTW_FORWARD' complex DFT of the same input, and a c2r transform is | |
1753 correspondingly equivalent to `FFTW_BACKWARD'. For more information, | |
1754 see *note What FFTW Really Computes::. | |
1755 | |
1756 | |
1757 File: fftw3.info, Node: Real-data DFT Array Format, Next: Real-to-Real Transforms, Prev: Real-data DFTs, Up: Basic Interface | |
1758 | |
1759 4.3.4 Real-data DFT Array Format | |
1760 -------------------------------- | |
1761 | |
1762 The output of a DFT of real data (r2c) contains symmetries that, in | |
1763 principle, make half of the outputs redundant (*note What FFTW Really | |
1764 Computes::). (Similarly for the input of an inverse c2r transform.) In | |
1765 practice, it is not possible to entirely realize these savings in an | |
1766 efficient and understandable format that generalizes to | |
1767 multi-dimensional transforms. Instead, the output of the r2c | |
1768 transforms is _slightly_ over half of the output of the corresponding | |
1769 complex transform. We do not "pack" the data in any way, but store it | |
1770 as an ordinary array of `fftw_complex' values. In fact, this data is | |
1771 simply a subsection of what would be the array in the corresponding | |
1772 complex transform. | |
1773 | |
1774 Specifically, for a real transform of d (= `rank') dimensions n[0] x | |
1775 n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2] | |
1776 x ... x (n[d-1]/2 + 1) array of `fftw_complex' values in row-major | |
1777 order (with the division rounded down). That is, we only store the | |
1778 _lower_ half (non-negative frequencies), plus one element, of the last | |
1779 dimension of the data from the ordinary complex transform. (We could | |
1780 have instead taken half of any other dimension, but implementation | |
1781 turns out to be simpler if the last, contiguous, dimension is used.) | |
1782 | |
1783 For an out-of-place transform, the real data is simply an array with | |
1784 physical dimensions n[0] x n[1] x n[2] x ... x n[d-1] in row-major | |
1785 order. | |
1786 | |
1787 For an in-place transform, some complications arise since the | |
1788 complex data is slightly larger than the real data. In this case, the | |
1789 final dimension of the real data must be _padded_ with extra values to | |
1790 accommodate the size of the complex data--two extra if the last | |
1791 dimension is even and one if it is odd. That is, the last dimension of | |
1792 the real data must physically contain 2 * (n[d-1]/2+1) `double' values | |
1793 (exactly enough to hold the complex data). This physical array size | |
1794 does not, however, change the _logical_ array size--only n[d-1] values | |
1795 are actually stored in the last dimension, and n[d-1] is the last | |
1796 dimension passed to the planner. | |
1797 | |
1798 | |
1799 File: fftw3.info, Node: Real-to-Real Transforms, Next: Real-to-Real Transform Kinds, Prev: Real-data DFT Array Format, Up: Basic Interface | |
1800 | |
1801 4.3.5 Real-to-Real Transforms | |
1802 ----------------------------- | |
1803 | |
1804 fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, | |
1805 fftw_r2r_kind kind, unsigned flags); | |
1806 fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, | |
1807 fftw_r2r_kind kind0, fftw_r2r_kind kind1, | |
1808 unsigned flags); | |
1809 fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, | |
1810 double *in, double *out, | |
1811 fftw_r2r_kind kind0, | |
1812 fftw_r2r_kind kind1, | |
1813 fftw_r2r_kind kind2, | |
1814 unsigned flags); | |
1815 fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, | |
1816 const fftw_r2r_kind *kind, unsigned flags); | |
1817 | |
1818 Plan a real input/output (r2r) transform of various kinds in zero or | |
1819 more dimensions, returning an `fftw_plan' (*note Using Plans::). | |
1820 | |
1821 Once you have created a plan for a certain transform type and | |
1822 parameters, then creating another plan of the same type and parameters, | |
1823 but for different arrays, is fast and shares constant data with the | |
1824 first plan (if it still exists). | |
1825 | |
1826 The planner returns `NULL' if the plan cannot be created. A | |
1827 non-`NULL' plan is always returned by the basic interface unless you | |
1828 are using a customized FFTW configuration supporting a restricted set | |
1829 of transforms, or for size-1 `FFTW_REDFT00' kinds (which are not | |
1830 defined). | |
1831 | |
1832 Arguments | |
1833 ......... | |
1834 | |
1835 * `rank' is the dimensionality of the transform (it should be the | |
1836 size of the arrays `*n' and `*kind'), and can be any non-negative | |
1837 integer. The `_1d', `_2d', and `_3d' planners correspond to a | |
1838 `rank' of `1', `2', and `3', respectively. A `rank' of zero is | |
1839 equivalent to a copy of one number from input to output. | |
1840 | |
1841 * `n', or `n0'/`n1'/`n2', or `n[rank]', respectively, gives the | |
1842 (physical) size of the transform dimensions. They can be any | |
1843 positive integer. | |
1844 | |
1845 - Multi-dimensional arrays are stored in row-major order with | |
1846 dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x | |
1847 `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array | |
1848 Format::. | |
1849 | |
1850 - FFTW is generally best at handling sizes of the form 2^a 3^b | |
1851 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other | |
1852 exponents are arbitrary. Other sizes are computed by means | |
1853 of a slow, general-purpose algorithm (which nevertheless | |
1854 retains O(n log n) performance even for prime sizes). (It | |
1855 is possible to customize FFTW for different array sizes; see | |
1856 *note Installation and Customization::.) Transforms whose | |
1857 sizes are powers of 2 are especially fast. | |
1858 | |
1859 - For a `REDFT00' or `RODFT00' transform kind in a dimension of | |
1860 size n, it is n-1 or n+1, respectively, that should be | |
1861 factorizable in the above form. | |
1862 | |
1863 * `in' and `out' point to the input and output arrays of the | |
1864 transform, which may be the same (yielding an in-place transform). These | |
1865 arrays are overwritten during planning, unless `FFTW_ESTIMATE' is | |
1866 used in the flags. (The arrays need not be initialized, but they | |
1867 must be allocated.) | |
1868 | |
1869 * `kind', or `kind0'/`kind1'/`kind2', or `kind[rank]', is the kind | |
1870 of r2r transform used for the corresponding dimension. The valid | |
1871 kind constants are described in *note Real-to-Real Transform | |
1872 Kinds::. In a multi-dimensional transform, what is computed is | |
1873 the separable product formed by taking each transform kind along | |
1874 the corresponding dimension, one dimension after another. | |
1875 | |
1876 * `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
1877 defined in *note Planner Flags::. | |
1878 | |
1879 | |
1880 | |
1881 File: fftw3.info, Node: Real-to-Real Transform Kinds, Prev: Real-to-Real Transforms, Up: Basic Interface | |
1882 | |
1883 4.3.6 Real-to-Real Transform Kinds | |
1884 ---------------------------------- | |
1885 | |
1886 FFTW currently supports 11 different r2r transform kinds, specified by | |
1887 one of the constants below. For the precise definitions of these | |
1888 transforms, see *note What FFTW Really Computes::. For a more | |
1889 colloquial introduction to these transform kinds, see *note More DFTs | |
1890 of Real Data::. | |
1891 | |
1892 For dimension of size `n', there is a corresponding "logical" | |
1893 dimension `N' that determines the normalization (and the optimal | |
1894 factorization); the formula for `N' is given for each kind below. | |
1895 Also, with each transform kind is listed its corrsponding inverse | |
1896 transform. FFTW computes unnormalized transforms: a transform followed | |
1897 by its inverse will result in the original data multiplied by `N' (or | |
1898 the product of the `N''s for each dimension, in multi-dimensions). | |
1899 | |
1900 * `FFTW_R2HC' computes a real-input DFT with output in "halfcomplex" | |
1901 format, i.e. real and imaginary parts for a transform of size `n' | |
1902 stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical | |
1903 `N=n', inverse is `FFTW_HC2R'.) | |
1904 | |
1905 * `FFTW_HC2R' computes the reverse of `FFTW_R2HC', above. (Logical | |
1906 `N=n', inverse is `FFTW_R2HC'.) | |
1907 | |
1908 * `FFTW_DHT' computes a discrete Hartley transform. (Logical `N=n', | |
1909 inverse is `FFTW_DHT'.) | |
1910 | |
1911 * `FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I. | |
1912 (Logical `N=2*(n-1)', inverse is `FFTW_REDFT00'.) | |
1913 | |
1914 * `FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II | |
1915 (sometimes called "the" DCT). (Logical `N=2*n', inverse is | |
1916 `FFTW_REDFT01'.) | |
1917 | |
1918 * `FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III | |
1919 (sometimes called "the" IDCT, being the inverse of DCT-II). | |
1920 (Logical `N=2*n', inverse is `FFTW_REDFT=10'.) | |
1921 | |
1922 * `FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV. | |
1923 (Logical `N=2*n', inverse is `FFTW_REDFT11'.) | |
1924 | |
1925 * `FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I. | |
1926 (Logical `N=2*(n+1)', inverse is `FFTW_RODFT00'.) | |
1927 | |
1928 * `FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II. | |
1929 (Logical `N=2*n', inverse is `FFTW_RODFT01'.) | |
1930 | |
1931 * `FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III. | |
1932 (Logical `N=2*n', inverse is `FFTW_RODFT=10'.) | |
1933 | |
1934 * `FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV. | |
1935 (Logical `N=2*n', inverse is `FFTW_RODFT11'.) | |
1936 | |
1937 | |
1938 | |
1939 File: fftw3.info, Node: Advanced Interface, Next: Guru Interface, Prev: Basic Interface, Up: FFTW Reference | |
1940 | |
1941 4.4 Advanced Interface | |
1942 ====================== | |
1943 | |
1944 FFTW's "advanced" interface supplements the basic interface with four | |
1945 new planner routines, providing a new level of flexibility: you can plan | |
1946 a transform of multiple arrays simultaneously, operate on non-contiguous | |
1947 (strided) data, and transform a subset of a larger multi-dimensional | |
1948 array. Other than these additional features, the planner operates in | |
1949 the same fashion as in the basic interface, and the resulting | |
1950 `fftw_plan' is used in the same way (*note Using Plans::). | |
1951 | |
1952 * Menu: | |
1953 | |
1954 * Advanced Complex DFTs:: | |
1955 * Advanced Real-data DFTs:: | |
1956 * Advanced Real-to-real Transforms:: | |
1957 | |
1958 | |
1959 File: fftw3.info, Node: Advanced Complex DFTs, Next: Advanced Real-data DFTs, Prev: Advanced Interface, Up: Advanced Interface | |
1960 | |
1961 4.4.1 Advanced Complex DFTs | |
1962 --------------------------- | |
1963 | |
1964 fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany, | |
1965 fftw_complex *in, const int *inembed, | |
1966 int istride, int idist, | |
1967 fftw_complex *out, const int *onembed, | |
1968 int ostride, int odist, | |
1969 int sign, unsigned flags); | |
1970 | |
1971 This routine plans multiple multidimensional complex DFTs, and it | |
1972 extends the `fftw_plan_dft' routine (*note Complex DFTs::) to compute | |
1973 `howmany' transforms, each having rank `rank' and size `n'. In | |
1974 addition, the transform data need not be contiguous, but it may be laid | |
1975 out in memory with an arbitrary stride. To account for these | |
1976 possibilities, `fftw_plan_many_dft' adds the new parameters `howmany', | |
1977 {`i',`o'}`nembed', {`i',`o'}`stride', and {`i',`o'}`dist'. The FFTW | |
1978 basic interface (*note Complex DFTs::) provides routines specialized | |
1979 for ranks 1, 2, and 3, but the advanced interface handles only the | |
1980 general-rank case. | |
1981 | |
1982 `howmany' is the number of transforms to compute. The resulting | |
1983 plan computes `howmany' transforms, where the input of the `k'-th | |
1984 transform is at location `in+k*idist' (in C pointer arithmetic), and | |
1985 its output is at location `out+k*odist'. Plans obtained in this way | |
1986 can often be faster than calling FFTW multiple times for the individual | |
1987 transforms. The basic `fftw_plan_dft' interface corresponds to | |
1988 `howmany=1' (in which case the `dist' parameters are ignored). | |
1989 | |
1990 Each of the `howmany' transforms has rank `rank' and size `n', as in | |
1991 the basic interface. In addition, the advanced interface allows the | |
1992 input and output arrays of each transform to be row-major subarrays of | |
1993 larger rank-`rank' arrays, described by `inembed' and `onembed' | |
1994 parameters, respectively. {`i',`o'}`nembed' must be arrays of length | |
1995 `rank', and `n' should be elementwise less than or equal to | |
1996 {`i',`o'}`nembed'. Passing `NULL' for an `nembed' parameter is | |
1997 equivalent to passing `n' (i.e. same physical and logical dimensions, | |
1998 as in the basic interface.) | |
1999 | |
2000 The `stride' parameters indicate that the `j'-th element of the | |
2001 input or output arrays is located at `j*istride' or `j*ostride', | |
2002 respectively. (For a multi-dimensional array, `j' is the ordinary | |
2003 row-major index.) When combined with the `k'-th transform in a | |
2004 `howmany' loop, from above, this means that the (`j',`k')-th element is | |
2005 at `j*stride+k*dist'. (The basic `fftw_plan_dft' interface corresponds | |
2006 to a stride of 1.) | |
2007 | |
2008 For in-place transforms, the input and output `stride' and `dist' | |
2009 parameters should be the same; otherwise, the planner may return `NULL'. | |
2010 | |
2011 Arrays `n', `inembed', and `onembed' are not used after this | |
2012 function returns. You can safely free or reuse them. | |
2013 | |
2014 *Examples*: One transform of one 5 by 6 array contiguous in memory: | |
2015 int rank = 2; | |
2016 int n[] = {5, 6}; | |
2017 int howmany = 1; | |
2018 int idist = odist = 0; /* unused because howmany = 1 */ | |
2019 int istride = ostride = 1; /* array is contiguous in memory */ | |
2020 int *inembed = n, *onembed = n; | |
2021 | |
2022 Transform of three 5 by 6 arrays, each contiguous in memory, stored | |
2023 in memory one after another: | |
2024 int rank = 2; | |
2025 int n[] = {5, 6}; | |
2026 int howmany = 3; | |
2027 int idist = odist = n[0]*n[1]; /* = 30, the distance in memory | |
2028 between the first element | |
2029 of the first array and the | |
2030 first element of the second array */ | |
2031 int istride = ostride = 1; /* array is contiguous in memory */ | |
2032 int *inembed = n, *onembed = n; | |
2033 | |
2034 Transform each column of a 2d array with 10 rows and 3 columns: | |
2035 int rank = 1; /* not 2: we are computing 1d transforms */ | |
2036 int n[] = {10}; /* 1d transforms of length 10 */ | |
2037 int howmany = 3; | |
2038 int idist = odist = 1; | |
2039 int istride = ostride = 3; /* distance between two elements in | |
2040 the same column */ | |
2041 int *inembed = n, *onembed = n; | |
2042 | |
2043 | |
2044 File: fftw3.info, Node: Advanced Real-data DFTs, Next: Advanced Real-to-real Transforms, Prev: Advanced Complex DFTs, Up: Advanced Interface | |
2045 | |
2046 4.4.2 Advanced Real-data DFTs | |
2047 ----------------------------- | |
2048 | |
2049 fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany, | |
2050 double *in, const int *inembed, | |
2051 int istride, int idist, | |
2052 fftw_complex *out, const int *onembed, | |
2053 int ostride, int odist, | |
2054 unsigned flags); | |
2055 fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany, | |
2056 fftw_complex *in, const int *inembed, | |
2057 int istride, int idist, | |
2058 double *out, const int *onembed, | |
2059 int ostride, int odist, | |
2060 unsigned flags); | |
2061 | |
2062 Like `fftw_plan_many_dft', these two functions add `howmany', | |
2063 `nembed', `stride', and `dist' parameters to the `fftw_plan_dft_r2c' | |
2064 and `fftw_plan_dft_c2r' functions, but otherwise behave the same as the | |
2065 basic interface. | |
2066 | |
2067 The interpretation of `howmany', `stride', and `dist' are the same | |
2068 as for `fftw_plan_many_dft', above. Note that the `stride' and `dist' | |
2069 for the real array are in units of `double', and for the complex array | |
2070 are in units of `fftw_complex'. | |
2071 | |
2072 If an `nembed' parameter is `NULL', it is interpreted as what it | |
2073 would be in the basic interface, as described in *note Real-data DFT | |
2074 Array Format::. That is, for the complex array the size is assumed to | |
2075 be the same as `n', but with the last dimension cut roughly in half. | |
2076 For the real array, the size is assumed to be `n' if the transform is | |
2077 out-of-place, or `n' with the last dimension "padded" if the transform | |
2078 is in-place. | |
2079 | |
2080 If an `nembed' parameter is non-`NULL', it is interpreted as the | |
2081 physical size of the corresponding array, in row-major order, just as | |
2082 for `fftw_plan_many_dft'. In this case, each dimension of `nembed' | |
2083 should be `>=' what it would be in the basic interface (e.g. the halved | |
2084 or padded `n'). | |
2085 | |
2086 Arrays `n', `inembed', and `onembed' are not used after this | |
2087 function returns. You can safely free or reuse them. | |
2088 | |
2089 | |
2090 File: fftw3.info, Node: Advanced Real-to-real Transforms, Prev: Advanced Real-data DFTs, Up: Advanced Interface | |
2091 | |
2092 4.4.3 Advanced Real-to-real Transforms | |
2093 -------------------------------------- | |
2094 | |
2095 fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany, | |
2096 double *in, const int *inembed, | |
2097 int istride, int idist, | |
2098 double *out, const int *onembed, | |
2099 int ostride, int odist, | |
2100 const fftw_r2r_kind *kind, unsigned flags); | |
2101 | |
2102 Like `fftw_plan_many_dft', this functions adds `howmany', `nembed', | |
2103 `stride', and `dist' parameters to the `fftw_plan_r2r' function, but | |
2104 otherwise behave the same as the basic interface. The interpretation | |
2105 of those additional parameters are the same as for | |
2106 `fftw_plan_many_dft'. (Of course, the `stride' and `dist' parameters | |
2107 are now in units of `double', not `fftw_complex'.) | |
2108 | |
2109 Arrays `n', `inembed', `onembed', and `kind' are not used after this | |
2110 function returns. You can safely free or reuse them. | |
2111 | |
2112 | |
2113 File: fftw3.info, Node: Guru Interface, Next: New-array Execute Functions, Prev: Advanced Interface, Up: FFTW Reference | |
2114 | |
2115 4.5 Guru Interface | |
2116 ================== | |
2117 | |
2118 The "guru" interface to FFTW is intended to expose as much as possible | |
2119 of the flexibility in the underlying FFTW architecture. It allows one | |
2120 to compute multi-dimensional "vectors" (loops) of multi-dimensional | |
2121 transforms, where each vector/transform dimension has an independent | |
2122 size and stride. One can also use more general complex-number formats, | |
2123 e.g. separate real and imaginary arrays. | |
2124 | |
2125 For those users who require the flexibility of the guru interface, | |
2126 it is important that they pay special attention to the documentation | |
2127 lest they shoot themselves in the foot. | |
2128 | |
2129 * Menu: | |
2130 | |
2131 * Interleaved and split arrays:: | |
2132 * Guru vector and transform sizes:: | |
2133 * Guru Complex DFTs:: | |
2134 * Guru Real-data DFTs:: | |
2135 * Guru Real-to-real Transforms:: | |
2136 * 64-bit Guru Interface:: | |
2137 | |
2138 | |
2139 File: fftw3.info, Node: Interleaved and split arrays, Next: Guru vector and transform sizes, Prev: Guru Interface, Up: Guru Interface | |
2140 | |
2141 4.5.1 Interleaved and split arrays | |
2142 ---------------------------------- | |
2143 | |
2144 The guru interface supports two representations of complex numbers, | |
2145 which we call the interleaved and the split format. | |
2146 | |
2147 The "interleaved" format is the same one used by the basic and | |
2148 advanced interfaces, and it is documented in *note Complex numbers::. | |
2149 In the interleaved format, you provide pointers to the real part of a | |
2150 complex number, and the imaginary part understood to be stored in the | |
2151 next memory location. | |
2152 | |
2153 The "split" format allows separate pointers to the real and | |
2154 imaginary parts of a complex array. | |
2155 | |
2156 Technically, the interleaved format is redundant, because you can | |
2157 always express an interleaved array in terms of a split array with | |
2158 appropriate pointers and strides. On the other hand, the interleaved | |
2159 format is simpler to use, and it is common in practice. Hence, FFTW | |
2160 supports it as a special case. | |
2161 | |
2162 | |
2163 File: fftw3.info, Node: Guru vector and transform sizes, Next: Guru Complex DFTs, Prev: Interleaved and split arrays, Up: Guru Interface | |
2164 | |
2165 4.5.2 Guru vector and transform sizes | |
2166 ------------------------------------- | |
2167 | |
2168 The guru interface introduces one basic new data structure, | |
2169 `fftw_iodim', that is used to specify sizes and strides for | |
2170 multi-dimensional transforms and vectors: | |
2171 | |
2172 typedef struct { | |
2173 int n; | |
2174 int is; | |
2175 int os; | |
2176 } fftw_iodim; | |
2177 | |
2178 Here, `n' is the size of the dimension, and `is' and `os' are the | |
2179 strides of that dimension for the input and output arrays. (The stride | |
2180 is the separation of consecutive elements along this dimension.) | |
2181 | |
2182 The meaning of the stride parameter depends on the type of the array | |
2183 that the stride refers to. _If the array is interleaved complex, | |
2184 strides are expressed in units of complex numbers (`fftw_complex'). If | |
2185 the array is split complex or real, strides are expressed in units of | |
2186 real numbers (`double')._ This convention is consistent with the usual | |
2187 pointer arithmetic in the C language. An interleaved array is denoted | |
2188 by a pointer `p' to `fftw_complex', so that `p+1' points to the next | |
2189 complex number. Split arrays are denoted by pointers to `double', in | |
2190 which case pointer arithmetic operates in units of `sizeof(double)'. | |
2191 | |
2192 The guru planner interfaces all take a (`rank', `dims[rank]') pair | |
2193 describing the transform size, and a (`howmany_rank', | |
2194 `howmany_dims[howmany_rank]') pair describing the "vector" size (a | |
2195 multi-dimensional loop of transforms to perform), where `dims' and | |
2196 `howmany_dims' are arrays of `fftw_iodim'. | |
2197 | |
2198 For example, the `howmany' parameter in the advanced complex-DFT | |
2199 interface corresponds to `howmany_rank' = 1, `howmany_dims[0].n' = | |
2200 `howmany', `howmany_dims[0].is' = `idist', and `howmany_dims[0].os' = | |
2201 `odist'. (To compute a single transform, you can just use | |
2202 `howmany_rank' = 0.) | |
2203 | |
2204 A row-major multidimensional array with dimensions `n[rank]' (*note | |
2205 Row-major Format::) corresponds to `dims[i].n' = `n[i]' and the | |
2206 recurrence `dims[i].is' = `n[i+1] * dims[i+1].is' (similarly for `os'). | |
2207 The stride of the last (`i=rank-1') dimension is the overall stride of | |
2208 the array. e.g. to be equivalent to the advanced complex-DFT | |
2209 interface, you would have `dims[rank-1].is' = `istride' and | |
2210 `dims[rank-1].os' = `ostride'. | |
2211 | |
2212 In general, we only guarantee FFTW to return a non-`NULL' plan if | |
2213 the vector and transform dimensions correspond to a set of distinct | |
2214 indices, and for in-place transforms the input/output strides should be | |
2215 the same. | |
2216 | |
2217 | |
2218 File: fftw3.info, Node: Guru Complex DFTs, Next: Guru Real-data DFTs, Prev: Guru vector and transform sizes, Up: Guru Interface | |
2219 | |
2220 4.5.3 Guru Complex DFTs | |
2221 ----------------------- | |
2222 | |
2223 fftw_plan fftw_plan_guru_dft( | |
2224 int rank, const fftw_iodim *dims, | |
2225 int howmany_rank, const fftw_iodim *howmany_dims, | |
2226 fftw_complex *in, fftw_complex *out, | |
2227 int sign, unsigned flags); | |
2228 | |
2229 fftw_plan fftw_plan_guru_split_dft( | |
2230 int rank, const fftw_iodim *dims, | |
2231 int howmany_rank, const fftw_iodim *howmany_dims, | |
2232 double *ri, double *ii, double *ro, double *io, | |
2233 unsigned flags); | |
2234 | |
2235 These two functions plan a complex-data, multi-dimensional DFT for | |
2236 the interleaved and split format, respectively. Transform dimensions | |
2237 are given by (`rank', `dims') over a multi-dimensional vector (loop) of | |
2238 dimensions (`howmany_rank', `howmany_dims'). `dims' and `howmany_dims' | |
2239 should point to `fftw_iodim' arrays of length `rank' and | |
2240 `howmany_rank', respectively. | |
2241 | |
2242 `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
2243 defined in *note Planner Flags::. | |
2244 | |
2245 In the `fftw_plan_guru_dft' function, the pointers `in' and `out' | |
2246 point to the interleaved input and output arrays, respectively. The | |
2247 sign can be either -1 (= `FFTW_FORWARD') or +1 (= `FFTW_BACKWARD'). If | |
2248 the pointers are equal, the transform is in-place. | |
2249 | |
2250 In the `fftw_plan_guru_split_dft' function, `ri' and `ii' point to | |
2251 the real and imaginary input arrays, and `ro' and `io' point to the | |
2252 real and imaginary output arrays. The input and output pointers may be | |
2253 the same, indicating an in-place transform. For example, for | |
2254 `fftw_complex' pointers `in' and `out', the corresponding parameters | |
2255 are: | |
2256 | |
2257 ri = (double *) in; | |
2258 ii = (double *) in + 1; | |
2259 ro = (double *) out; | |
2260 io = (double *) out + 1; | |
2261 | |
2262 Because `fftw_plan_guru_split_dft' accepts split arrays, strides are | |
2263 expressed in units of `double'. For a contiguous `fftw_complex' array, | |
2264 the overall stride of the transform should be 2, the distance between | |
2265 consecutive real parts or between consecutive imaginary parts; see | |
2266 *note Guru vector and transform sizes::. Note that the dimension | |
2267 strides are applied equally to the real and imaginary parts; real and | |
2268 imaginary arrays with different strides are not supported. | |
2269 | |
2270 There is no `sign' parameter in `fftw_plan_guru_split_dft'. This | |
2271 function always plans for an `FFTW_FORWARD' transform. To plan for an | |
2272 `FFTW_BACKWARD' transform, you can exploit the identity that the | |
2273 backwards DFT is equal to the forwards DFT with the real and imaginary | |
2274 parts swapped. For example, in the case of the `fftw_complex' arrays | |
2275 above, the `FFTW_BACKWARD' transform is computed by the parameters: | |
2276 | |
2277 ri = (double *) in + 1; | |
2278 ii = (double *) in; | |
2279 ro = (double *) out + 1; | |
2280 io = (double *) out; | |
2281 | |
2282 | |
2283 File: fftw3.info, Node: Guru Real-data DFTs, Next: Guru Real-to-real Transforms, Prev: Guru Complex DFTs, Up: Guru Interface | |
2284 | |
2285 4.5.4 Guru Real-data DFTs | |
2286 ------------------------- | |
2287 | |
2288 fftw_plan fftw_plan_guru_dft_r2c( | |
2289 int rank, const fftw_iodim *dims, | |
2290 int howmany_rank, const fftw_iodim *howmany_dims, | |
2291 double *in, fftw_complex *out, | |
2292 unsigned flags); | |
2293 | |
2294 fftw_plan fftw_plan_guru_split_dft_r2c( | |
2295 int rank, const fftw_iodim *dims, | |
2296 int howmany_rank, const fftw_iodim *howmany_dims, | |
2297 double *in, double *ro, double *io, | |
2298 unsigned flags); | |
2299 | |
2300 fftw_plan fftw_plan_guru_dft_c2r( | |
2301 int rank, const fftw_iodim *dims, | |
2302 int howmany_rank, const fftw_iodim *howmany_dims, | |
2303 fftw_complex *in, double *out, | |
2304 unsigned flags); | |
2305 | |
2306 fftw_plan fftw_plan_guru_split_dft_c2r( | |
2307 int rank, const fftw_iodim *dims, | |
2308 int howmany_rank, const fftw_iodim *howmany_dims, | |
2309 double *ri, double *ii, double *out, | |
2310 unsigned flags); | |
2311 | |
2312 Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT | |
2313 with transform dimensions given by (`rank', `dims') over a | |
2314 multi-dimensional vector (loop) of dimensions (`howmany_rank', | |
2315 `howmany_dims'). `dims' and `howmany_dims' should point to | |
2316 `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. | |
2317 As for the basic and advanced interfaces, an r2c transform is | |
2318 `FFTW_FORWARD' and a c2r transform is `FFTW_BACKWARD'. | |
2319 | |
2320 The _last_ dimension of `dims' is interpreted specially: that | |
2321 dimension of the real array has size `dims[rank-1].n', but that | |
2322 dimension of the complex array has size `dims[rank-1].n/2+1' (division | |
2323 rounded down). The strides, on the other hand, are taken to be exactly | |
2324 as specified. It is up to the user to specify the strides | |
2325 appropriately for the peculiar dimensions of the data, and we do not | |
2326 guarantee that the planner will succeed (return non-`NULL') for any | |
2327 dimensions other than those described in *note Real-data DFT Array | |
2328 Format:: and generalized in *note Advanced Real-data DFTs::. (That is, | |
2329 for an in-place transform, each individual dimension should be able to | |
2330 operate in place.) | |
2331 | |
2332 `in' and `out' point to the input and output arrays for r2c and c2r | |
2333 transforms, respectively. For split arrays, `ri' and `ii' point to the | |
2334 real and imaginary input arrays for a c2r transform, and `ro' and `io' | |
2335 point to the real and imaginary output arrays for an r2c transform. | |
2336 `in' and `ro' or `ri' and `out' may be the same, indicating an in-place | |
2337 transform. (In-place transforms where `in' and `io' or `ii' and `out' | |
2338 are the same are not currently supported.) | |
2339 | |
2340 `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
2341 defined in *note Planner Flags::. | |
2342 | |
2343 In-place transforms of rank greater than 1 are currently only | |
2344 supported for interleaved arrays. For split arrays, the planner will | |
2345 return `NULL'. | |
2346 | |
2347 | |
2348 File: fftw3.info, Node: Guru Real-to-real Transforms, Next: 64-bit Guru Interface, Prev: Guru Real-data DFTs, Up: Guru Interface | |
2349 | |
2350 4.5.5 Guru Real-to-real Transforms | |
2351 ---------------------------------- | |
2352 | |
2353 fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims, | |
2354 int howmany_rank, | |
2355 const fftw_iodim *howmany_dims, | |
2356 double *in, double *out, | |
2357 const fftw_r2r_kind *kind, | |
2358 unsigned flags); | |
2359 | |
2360 Plan a real-to-real (r2r) multi-dimensional `FFTW_FORWARD' transform | |
2361 with transform dimensions given by (`rank', `dims') over a | |
2362 multi-dimensional vector (loop) of dimensions (`howmany_rank', | |
2363 `howmany_dims'). `dims' and `howmany_dims' should point to | |
2364 `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. | |
2365 | |
2366 The transform kind of each dimension is given by the `kind' | |
2367 parameter, which should point to an array of length `rank'. Valid | |
2368 `fftw_r2r_kind' constants are given in *note Real-to-Real Transform | |
2369 Kinds::. | |
2370 | |
2371 `in' and `out' point to the real input and output arrays; they may | |
2372 be the same, indicating an in-place transform. | |
2373 | |
2374 `flags' is a bitwise OR (`|') of zero or more planner flags, as | |
2375 defined in *note Planner Flags::. | |
2376 | |
2377 | |
2378 File: fftw3.info, Node: 64-bit Guru Interface, Prev: Guru Real-to-real Transforms, Up: Guru Interface | |
2379 | |
2380 4.5.6 64-bit Guru Interface | |
2381 --------------------------- | |
2382 | |
2383 When compiled in 64-bit mode on a 64-bit architecture (where addresses | |
2384 are 64 bits wide), FFTW uses 64-bit quantities internally for all | |
2385 transform sizes, strides, and so on--you don't have to do anything | |
2386 special to exploit this. However, in the ordinary FFTW interfaces, you | |
2387 specify the transform size by an `int' quantity, which is normally only | |
2388 32 bits wide. This means that, even though FFTW is using 64-bit sizes | |
2389 internally, you cannot specify a single transform dimension larger than | |
2390 2^31-1 numbers. | |
2391 | |
2392 We expect that few users will require transforms larger than this, | |
2393 but, for those who do, we provide a 64-bit version of the guru | |
2394 interface in which all sizes are specified as integers of type | |
2395 `ptrdiff_t' instead of `int'. (`ptrdiff_t' is a signed integer type | |
2396 defined by the C standard to be wide enough to represent address | |
2397 differences, and thus must be at least 64 bits wide on a 64-bit | |
2398 machine.) We stress that there is _no performance advantage_ to using | |
2399 this interface--the same internal FFTW code is employed regardless--and | |
2400 it is only necessary if you want to specify very large transform sizes. | |
2401 | |
2402 In particular, the 64-bit guru interface is a set of planner routines | |
2403 that are exactly the same as the guru planner routines, except that | |
2404 they are named with `guru64' instead of `guru' and they take arguments | |
2405 of type `fftw_iodim64' instead of `fftw_iodim'. For example, instead | |
2406 of `fftw_plan_guru_dft', we have `fftw_plan_guru64_dft'. | |
2407 | |
2408 fftw_plan fftw_plan_guru64_dft( | |
2409 int rank, const fftw_iodim64 *dims, | |
2410 int howmany_rank, const fftw_iodim64 *howmany_dims, | |
2411 fftw_complex *in, fftw_complex *out, | |
2412 int sign, unsigned flags); | |
2413 | |
2414 The `fftw_iodim64' type is similar to `fftw_iodim', with the same | |
2415 interpretation, except that it uses type `ptrdiff_t' instead of type | |
2416 `int'. | |
2417 | |
2418 typedef struct { | |
2419 ptrdiff_t n; | |
2420 ptrdiff_t is; | |
2421 ptrdiff_t os; | |
2422 } fftw_iodim64; | |
2423 | |
2424 Every other `fftw_plan_guru' function also has a `fftw_plan_guru64' | |
2425 equivalent, but we do not repeat their documentation here since they | |
2426 are identical to the 32-bit versions except as noted above. | |
2427 | |
2428 | |
2429 File: fftw3.info, Node: New-array Execute Functions, Next: Wisdom, Prev: Guru Interface, Up: FFTW Reference | |
2430 | |
2431 4.6 New-array Execute Functions | |
2432 =============================== | |
2433 | |
2434 Normally, one executes a plan for the arrays with which the plan was | |
2435 created, by calling `fftw_execute(plan)' as described in *note Using | |
2436 Plans::. However, it is possible for sophisticated users to apply a | |
2437 given plan to a _different_ array using the "new-array execute" | |
2438 functions detailed below, provided that the following conditions are | |
2439 met: | |
2440 | |
2441 * The array size, strides, etcetera are the same (since those are | |
2442 set by the plan). | |
2443 | |
2444 * The input and output arrays are the same (in-place) or different | |
2445 (out-of-place) if the plan was originally created to be in-place or | |
2446 out-of-place, respectively. | |
2447 | |
2448 * For split arrays, the separations between the real and imaginary | |
2449 parts, `ii-ri' and `io-ro', are the same as they were for the | |
2450 input and output arrays when the plan was created. (This | |
2451 condition is automatically satisfied for interleaved arrays.) | |
2452 | |
2453 * The "alignment" of the new input/output arrays is the same as that | |
2454 of the input/output arrays when the plan was created, unless the | |
2455 plan was created with the `FFTW_UNALIGNED' flag. Here, the | |
2456 alignment is a platform-dependent quantity (for example, it is the | |
2457 address modulo 16 if SSE SIMD instructions are used, but the | |
2458 address modulo 4 for non-SIMD single-precision FFTW on the same | |
2459 machine). In general, only arrays allocated with `fftw_malloc' | |
2460 are guaranteed to be equally aligned (*note SIMD alignment and | |
2461 fftw_malloc::). | |
2462 | |
2463 | |
2464 The alignment issue is especially critical, because if you don't use | |
2465 `fftw_malloc' then you may have little control over the alignment of | |
2466 arrays in memory. For example, neither the C++ `new' function nor the | |
2467 Fortran `allocate' statement provide strong enough guarantees about | |
2468 data alignment. If you don't use `fftw_malloc', therefore, you | |
2469 probably have to use `FFTW_UNALIGNED' (which disables most SIMD | |
2470 support). If possible, it is probably better for you to simply create | |
2471 multiple plans (creating a new plan is quick once one exists for a | |
2472 given size), or better yet re-use the same array for your transforms. | |
2473 | |
2474 For rare circumstances in which you cannot control the alignment of | |
2475 allocated memory, but wish to determine where a given array is aligned | |
2476 like the original array for which a plan was created, you can use the | |
2477 `fftw_alignment_of' function: | |
2478 int fftw_alignment_of(double *p); | |
2479 Two arrays have equivalent alignment (for the purposes of applying a | |
2480 plan) if and only if `fftw_alignment_of' returns the same value for the | |
2481 corresponding pointers to their data (typecast to `double*' if | |
2482 necessary). | |
2483 | |
2484 If you are tempted to use the new-array execute interface because you | |
2485 want to transform a known bunch of arrays of the same size, you should | |
2486 probably go use the advanced interface instead (*note Advanced | |
2487 Interface::)). | |
2488 | |
2489 The new-array execute functions are: | |
2490 | |
2491 void fftw_execute_dft( | |
2492 const fftw_plan p, | |
2493 fftw_complex *in, fftw_complex *out); | |
2494 | |
2495 void fftw_execute_split_dft( | |
2496 const fftw_plan p, | |
2497 double *ri, double *ii, double *ro, double *io); | |
2498 | |
2499 void fftw_execute_dft_r2c( | |
2500 const fftw_plan p, | |
2501 double *in, fftw_complex *out); | |
2502 | |
2503 void fftw_execute_split_dft_r2c( | |
2504 const fftw_plan p, | |
2505 double *in, double *ro, double *io); | |
2506 | |
2507 void fftw_execute_dft_c2r( | |
2508 const fftw_plan p, | |
2509 fftw_complex *in, double *out); | |
2510 | |
2511 void fftw_execute_split_dft_c2r( | |
2512 const fftw_plan p, | |
2513 double *ri, double *ii, double *out); | |
2514 | |
2515 void fftw_execute_r2r( | |
2516 const fftw_plan p, | |
2517 double *in, double *out); | |
2518 | |
2519 These execute the `plan' to compute the corresponding transform on | |
2520 the input/output arrays specified by the subsequent arguments. The | |
2521 input/output array arguments have the same meanings as the ones passed | |
2522 to the guru planner routines in the preceding sections. The `plan' is | |
2523 not modified, and these routines can be called as many times as | |
2524 desired, or intermixed with calls to the ordinary `fftw_execute'. | |
2525 | |
2526 The `plan' _must_ have been created for the transform type | |
2527 corresponding to the execute function, e.g. it must be a complex-DFT | |
2528 plan for `fftw_execute_dft'. Any of the planner routines for that | |
2529 transform type, from the basic to the guru interface, could have been | |
2530 used to create the plan, however. | |
2531 | |
2532 | |
2533 File: fftw3.info, Node: Wisdom, Next: What FFTW Really Computes, Prev: New-array Execute Functions, Up: FFTW Reference | |
2534 | |
2535 4.7 Wisdom | |
2536 ========== | |
2537 | |
2538 This section documents the FFTW mechanism for saving and restoring | |
2539 plans from disk. This mechanism is called "wisdom". | |
2540 | |
2541 * Menu: | |
2542 | |
2543 * Wisdom Export:: | |
2544 * Wisdom Import:: | |
2545 * Forgetting Wisdom:: | |
2546 * Wisdom Utilities:: | |
2547 | |
2548 | |
2549 File: fftw3.info, Node: Wisdom Export, Next: Wisdom Import, Prev: Wisdom, Up: Wisdom | |
2550 | |
2551 4.7.1 Wisdom Export | |
2552 ------------------- | |
2553 | |
2554 int fftw_export_wisdom_to_filename(const char *filename); | |
2555 void fftw_export_wisdom_to_file(FILE *output_file); | |
2556 char *fftw_export_wisdom_to_string(void); | |
2557 void fftw_export_wisdom(void (*write_char)(char c, void *), void *data); | |
2558 | |
2559 These functions allow you to export all currently accumulated wisdom | |
2560 in a form from which it can be later imported and restored, even during | |
2561 a separate run of the program. (*Note Words of Wisdom-Saving Plans::.) | |
2562 The current store of wisdom is not affected by calling any of these | |
2563 routines. | |
2564 | |
2565 `fftw_export_wisdom' exports the wisdom to any output medium, as | |
2566 specified by the callback function `write_char'. `write_char' is a | |
2567 `putc'-like function that writes the character `c' to some output; its | |
2568 second parameter is the `data' pointer passed to `fftw_export_wisdom'. | |
2569 For convenience, the following three "wrapper" routines are provided: | |
2570 | |
2571 `fftw_export_wisdom_to_filename' writes wisdom to a file named | |
2572 `filename' (which is created or overwritten), returning `1' on success | |
2573 and `0' on failure. A lower-level function, which requires you to open | |
2574 and close the file yourself (e.g. if you want to write wisdom to a | |
2575 portion of a larger file) is `fftw_export_wisdom_to_file'. This writes | |
2576 the wisdom to the current position in `output_file', which should be | |
2577 open with write permission; upon exit, the file remains open and is | |
2578 positioned at the end of the wisdom data. | |
2579 | |
2580 `fftw_export_wisdom_to_string' returns a pointer to a | |
2581 `NULL'-terminated string holding the wisdom data. This string is | |
2582 dynamically allocated, and it is the responsibility of the caller to | |
2583 deallocate it with `free' when it is no longer needed. | |
2584 | |
2585 All of these routines export the wisdom in the same format, which we | |
2586 will not document here except to say that it is LISP-like ASCII text | |
2587 that is insensitive to white space. | |
2588 | |
2589 | |
2590 File: fftw3.info, Node: Wisdom Import, Next: Forgetting Wisdom, Prev: Wisdom Export, Up: Wisdom | |
2591 | |
2592 4.7.2 Wisdom Import | |
2593 ------------------- | |
2594 | |
2595 int fftw_import_system_wisdom(void); | |
2596 int fftw_import_wisdom_from_filename(const char *filename); | |
2597 int fftw_import_wisdom_from_string(const char *input_string); | |
2598 int fftw_import_wisdom(int (*read_char)(void *), void *data); | |
2599 | |
2600 These functions import wisdom into a program from data stored by the | |
2601 `fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving | |
2602 Plans::.) The imported wisdom replaces any wisdom already accumulated | |
2603 by the running program. | |
2604 | |
2605 `fftw_import_wisdom' imports wisdom from any input medium, as | |
2606 specified by the callback function `read_char'. `read_char' is a | |
2607 `getc'-like function that returns the next character in the input; its | |
2608 parameter is the `data' pointer passed to `fftw_import_wisdom'. If the | |
2609 end of the input data is reached (which should never happen for valid | |
2610 data), `read_char' should return `EOF' (as defined in `<stdio.h>'). | |
2611 For convenience, the following three "wrapper" routines are provided: | |
2612 | |
2613 `fftw_import_wisdom_from_filename' reads wisdom from a file named | |
2614 `filename'. A lower-level function, which requires you to open and | |
2615 close the file yourself (e.g. if you want to read wisdom from a portion | |
2616 of a larger file) is `fftw_import_wisdom_from_file'. This reads wisdom | |
2617 from the current position in `input_file' (which should be open with | |
2618 read permission); upon exit, the file remains open, but the position of | |
2619 the read pointer is unspecified. | |
2620 | |
2621 `fftw_import_wisdom_from_string' reads wisdom from the | |
2622 `NULL'-terminated string `input_string'. | |
2623 | |
2624 `fftw_import_system_wisdom' reads wisdom from an | |
2625 implementation-defined standard file (`/etc/fftw/wisdom' on Unix and | |
2626 GNU systems). | |
2627 | |
2628 The return value of these import routines is `1' if the wisdom was | |
2629 read successfully and `0' otherwise. Note that, in all of these | |
2630 functions, any data in the input stream past the end of the wisdom data | |
2631 is simply ignored. | |
2632 | |
2633 | |
2634 File: fftw3.info, Node: Forgetting Wisdom, Next: Wisdom Utilities, Prev: Wisdom Import, Up: Wisdom | |
2635 | |
2636 4.7.3 Forgetting Wisdom | |
2637 ----------------------- | |
2638 | |
2639 void fftw_forget_wisdom(void); | |
2640 | |
2641 Calling `fftw_forget_wisdom' causes all accumulated `wisdom' to be | |
2642 discarded and its associated memory to be freed. (New `wisdom' can | |
2643 still be gathered subsequently, however.) | |
2644 | |
2645 | |
2646 File: fftw3.info, Node: Wisdom Utilities, Prev: Forgetting Wisdom, Up: Wisdom | |
2647 | |
2648 4.7.4 Wisdom Utilities | |
2649 ---------------------- | |
2650 | |
2651 FFTW includes two standalone utility programs that deal with wisdom. We | |
2652 merely summarize them here, since they come with their own `man' pages | |
2653 for Unix and GNU systems (with HTML versions on our web site). | |
2654 | |
2655 The first program is `fftw-wisdom' (or `fftwf-wisdom' in single | |
2656 precision, etcetera), which can be used to create a wisdom file | |
2657 containing plans for any of the transform sizes and types supported by | |
2658 FFTW. It is preferable to create wisdom directly from your executable | |
2659 (*note Caveats in Using Wisdom::), but this program is useful for | |
2660 creating global wisdom files for `fftw_import_system_wisdom'. | |
2661 | |
2662 The second program is `fftw-wisdom-to-conf', which takes a wisdom | |
2663 file as input and produces a "configuration routine" as output. The | |
2664 latter is a C subroutine that you can compile and link into your | |
2665 program, replacing a routine of the same name in the FFTW library, that | |
2666 determines which parts of FFTW are callable by your program. | |
2667 `fftw-wisdom-to-conf' produces a configuration routine that links to | |
2668 only those parts of FFTW needed by the saved plans in the wisdom, | |
2669 greatly reducing the size of statically linked executables (which should | |
2670 only attempt to create plans corresponding to those in the wisdom, | |
2671 however). | |
2672 | |
2673 | |
2674 File: fftw3.info, Node: What FFTW Really Computes, Prev: Wisdom, Up: FFTW Reference | |
2675 | |
2676 4.8 What FFTW Really Computes | |
2677 ============================= | |
2678 | |
2679 In this section, we provide precise mathematical definitions for the | |
2680 transforms that FFTW computes. These transform definitions are fairly | |
2681 standard, but some authors follow slightly different conventions for the | |
2682 normalization of the transform (the constant factor in front) and the | |
2683 sign of the complex exponent. We begin by presenting the | |
2684 one-dimensional (1d) transform definitions, and then give the | |
2685 straightforward extension to multi-dimensional transforms. | |
2686 | |
2687 * Menu: | |
2688 | |
2689 * The 1d Discrete Fourier Transform (DFT):: | |
2690 * The 1d Real-data DFT:: | |
2691 * 1d Real-even DFTs (DCTs):: | |
2692 * 1d Real-odd DFTs (DSTs):: | |
2693 * 1d Discrete Hartley Transforms (DHTs):: | |
2694 * Multi-dimensional Transforms:: | |
2695 | |
2696 | |
2697 File: fftw3.info, Node: The 1d Discrete Fourier Transform (DFT), Next: The 1d Real-data DFT, Prev: What FFTW Really Computes, Up: What FFTW Really Computes | |
2698 | |
2699 4.8.1 The 1d Discrete Fourier Transform (DFT) | |
2700 --------------------------------------------- | |
2701 | |
2702 The forward (`FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d | |
2703 complex array X of size n computes an array Y, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . | |
2704 The backward (`FFTW_BACKWARD') DFT computes: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . | |
2705 FFTW computes an unnormalized transform, in that there is no | |
2706 coefficient in front of the summation in the DFT. In other words, | |
2707 applying the forward and then the backward transform will multiply the | |
2708 input by n. | |
2709 | |
2710 From above, an `FFTW_FORWARD' transform corresponds to a sign of -1 | |
2711 in the exponent of the DFT. Note also that we use the standard | |
2712 "in-order" output ordering--the k-th output corresponds to the | |
2713 frequency k/n (or k/T, where T is your total sampling period). For | |
2714 those who like to think in terms of positive and negative frequencies, | |
2715 this means that the positive frequencies are stored in the first half | |
2716 of the output and the negative frequencies are stored in backwards | |
2717 order in the second half of the output. (The frequency -k/n is the | |
2718 same as the frequency (n-k)/n.) | |
2719 | |
2720 | |
2721 File: fftw3.info, Node: The 1d Real-data DFT, Next: 1d Real-even DFTs (DCTs), Prev: The 1d Discrete Fourier Transform (DFT), Up: What FFTW Really Computes | |
2722 | |
2723 4.8.2 The 1d Real-data DFT | |
2724 -------------------------- | |
2725 | |
2726 The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of | |
2727 the size `n' real array X, exactly as defined above, i.e. Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . | |
2728 This output array Y can easily be shown to possess the "Hermitian" | |
2729 symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] = | |
2730 Y[0]. | |
2731 | |
2732 As a result of this symmetry, half of the output Y is redundant | |
2733 (being the complex conjugate of the other half), and so the 1d r2c | |
2734 transforms only output elements 0...n/2 of Y (n/2+1 complex numbers), | |
2735 where the division by 2 is rounded down. | |
2736 | |
2737 Moreover, the Hermitian symmetry implies that Y[0] and, if n is | |
2738 even, the Y[n/2] element, are purely real. So, for the `R2HC' r2r | |
2739 transform, these elements are not stored in the halfcomplex output | |
2740 format. | |
2741 | |
2742 The c2r and `H2RC' r2r transforms compute the backward DFT of the | |
2743 _complex_ array X with Hermitian symmetry, stored in the r2c/`R2HC' | |
2744 output formats, respectively, where the backward transform is defined | |
2745 exactly as for the complex case: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . | |
2746 The outputs `Y' of this transform can easily be seen to be purely | |
2747 real, and are stored as an array of real numbers. | |
2748 | |
2749 Like FFTW's complex DFT, these transforms are unnormalized. In other | |
2750 words, applying the real-to-complex (forward) and then the | |
2751 complex-to-real (backward) transform will multiply the input by n. | |
2752 | |
2753 | |
2754 File: fftw3.info, Node: 1d Real-even DFTs (DCTs), Next: 1d Real-odd DFTs (DSTs), Prev: The 1d Real-data DFT, Up: What FFTW Really Computes | |
2755 | |
2756 4.8.3 1d Real-even DFTs (DCTs) | |
2757 ------------------------------ | |
2758 | |
2759 The Real-even symmetry DFTs in FFTW are exactly equivalent to the | |
2760 unnormalized forward (and backward) DFTs as defined above, where the | |
2761 input array X of length N is purely real and is also "even" symmetry. | |
2762 In this case, the output array is likewise real and even symmetry. | |
2763 | |
2764 For the case of `REDFT00', this even symmetry means that X[j] = | |
2765 X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of | |
2766 this redundancy, only the first n real numbers are actually stored, | |
2767 where N = 2(n-1). | |
2768 | |
2769 The proper definition of even symmetry for `REDFT10', `REDFT01', and | |
2770 `REDFT11' transforms is somewhat more intricate because of the shifts | |
2771 by 1/2 of the input and/or output, although the corresponding boundary | |
2772 conditions are given in *note Real even/odd DFTs (cosine/sine | |
2773 transforms)::. Because of the even symmetry, however, the sine terms | |
2774 in the DFT all cancel and the remaining cosine terms are written | |
2775 explicitly below. This formulation often leads people to call such a | |
2776 transform a "discrete cosine transform" (DCT), although it is really | |
2777 just a special case of the DFT. | |
2778 | |
2779 In each of the definitions below, we transform a real array X of | |
2780 length n to a real array Y of length n: | |
2781 | |
2782 REDFT00 (DCT-I) | |
2783 ............... | |
2784 | |
2785 An `REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0] | |
2786 + (-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))). | |
2787 Note that this transform is not defined for n=1. For n=2, the | |
2788 summation term above is dropped as you might expect. | |
2789 | |
2790 REDFT10 (DCT-II) | |
2791 ................ | |
2792 | |
2793 An `REDFT10' transform (type-II DCT, sometimes called "the" DCT) in | |
2794 FFTW is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi | |
2795 (j+1/2) k / n)). | |
2796 | |
2797 REDFT01 (DCT-III) | |
2798 ................. | |
2799 | |
2800 An `REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] = | |
2801 X[0] + 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)). In the | |
2802 case of n=1, this reduces to Y[0] = X[0]. Up to a scale factor (see | |
2803 below), this is the inverse of `REDFT10' ("the" DCT), and so the | |
2804 `REDFT01' (DCT-III) is sometimes called the "IDCT". | |
2805 | |
2806 REDFT11 (DCT-IV) | |
2807 ................ | |
2808 | |
2809 An `REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2 | |
2810 (sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)). | |
2811 | |
2812 Inverses and Normalization | |
2813 .......................... | |
2814 | |
2815 These definitions correspond directly to the unnormalized DFTs used | |
2816 elsewhere in FFTW (hence the factors of 2 in front of the summations). | |
2817 The unnormalized inverse of `REDFT00' is `REDFT00', of `REDFT10' is | |
2818 `REDFT01' and vice versa, and of `REDFT11' is `REDFT11'. Each | |
2819 unnormalized inverse results in the original array multiplied by N, | |
2820 where N is the _logical_ DFT size. For `REDFT00', N=2(n-1) (note that | |
2821 n=1 is not defined); otherwise, N=2n. | |
2822 | |
2823 In defining the discrete cosine transform, some authors also include | |
2824 additional factors of sqrt(2) (or its inverse) multiplying selected | |
2825 inputs and/or outputs. This is a mostly cosmetic change that makes the | |
2826 transform orthogonal, but sacrifices the direct equivalence to a | |
2827 symmetric DFT. | |
2828 | |
2829 | |
2830 File: fftw3.info, Node: 1d Real-odd DFTs (DSTs), Next: 1d Discrete Hartley Transforms (DHTs), Prev: 1d Real-even DFTs (DCTs), Up: What FFTW Really Computes | |
2831 | |
2832 4.8.4 1d Real-odd DFTs (DSTs) | |
2833 ----------------------------- | |
2834 | |
2835 The Real-odd symmetry DFTs in FFTW are exactly equivalent to the | |
2836 unnormalized forward (and backward) DFTs as defined above, where the | |
2837 input array X of length N is purely real and is also "odd" symmetry. In | |
2838 this case, the output is odd symmetry and purely imaginary. | |
2839 | |
2840 For the case of `RODFT00', this odd symmetry means that X[j] = | |
2841 -X[N-j], where we take X to be periodic so that X[N] = X[0]. Because | |
2842 of this redundancy, only the first n real numbers starting at j=1 are | |
2843 actually stored (the j=0 element is zero), where N = 2(n+1). | |
2844 | |
2845 The proper definition of odd symmetry for `RODFT10', `RODFT01', and | |
2846 `RODFT11' transforms is somewhat more intricate because of the shifts | |
2847 by 1/2 of the input and/or output, although the corresponding boundary | |
2848 conditions are given in *note Real even/odd DFTs (cosine/sine | |
2849 transforms)::. Because of the odd symmetry, however, the cosine terms | |
2850 in the DFT all cancel and the remaining sine terms are written | |
2851 explicitly below. This formulation often leads people to call such a | |
2852 transform a "discrete sine transform" (DST), although it is really just | |
2853 a special case of the DFT. | |
2854 | |
2855 In each of the definitions below, we transform a real array X of | |
2856 length n to a real array Y of length n: | |
2857 | |
2858 RODFT00 (DST-I) | |
2859 ............... | |
2860 | |
2861 An `RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2 | |
2862 (sum for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))). | |
2863 | |
2864 RODFT10 (DST-II) | |
2865 ................ | |
2866 | |
2867 An `RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2 | |
2868 (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)). | |
2869 | |
2870 RODFT01 (DST-III) | |
2871 ................. | |
2872 | |
2873 An `RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] = | |
2874 (-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) / | |
2875 n)). In the case of n=1, this reduces to Y[0] = X[0]. | |
2876 | |
2877 RODFT11 (DST-IV) | |
2878 ................ | |
2879 | |
2880 An `RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2 | |
2881 (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)). | |
2882 | |
2883 Inverses and Normalization | |
2884 .......................... | |
2885 | |
2886 These definitions correspond directly to the unnormalized DFTs used | |
2887 elsewhere in FFTW (hence the factors of 2 in front of the summations). | |
2888 The unnormalized inverse of `RODFT00' is `RODFT00', of `RODFT10' is | |
2889 `RODFT01' and vice versa, and of `RODFT11' is `RODFT11'. Each | |
2890 unnormalized inverse results in the original array multiplied by N, | |
2891 where N is the _logical_ DFT size. For `RODFT00', N=2(n+1); otherwise, | |
2892 N=2n. | |
2893 | |
2894 In defining the discrete sine transform, some authors also include | |
2895 additional factors of sqrt(2) (or its inverse) multiplying selected | |
2896 inputs and/or outputs. This is a mostly cosmetic change that makes the | |
2897 transform orthogonal, but sacrifices the direct equivalence to an | |
2898 antisymmetric DFT. | |
2899 | |
2900 | |
2901 File: fftw3.info, Node: 1d Discrete Hartley Transforms (DHTs), Next: Multi-dimensional Transforms, Prev: 1d Real-odd DFTs (DSTs), Up: What FFTW Really Computes | |
2902 | |
2903 4.8.5 1d Discrete Hartley Transforms (DHTs) | |
2904 ------------------------------------------- | |
2905 | |
2906 The discrete Hartley transform (DHT) of a 1d real array X of size n | |
2907 computes a real array Y of the same size, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)]. | |
2908 FFTW computes an unnormalized transform, in that there is no | |
2909 coefficient in front of the summation in the DHT. In other words, | |
2910 applying the transform twice (the DHT is its own inverse) will multiply | |
2911 the input by n. | |
2912 | |
2913 | |
2914 File: fftw3.info, Node: Multi-dimensional Transforms, Prev: 1d Discrete Hartley Transforms (DHTs), Up: What FFTW Really Computes | |
2915 | |
2916 4.8.6 Multi-dimensional Transforms | |
2917 ---------------------------------- | |
2918 | |
2919 The multi-dimensional transforms of FFTW, in general, compute simply the | |
2920 separable product of the given 1d transform along each dimension of the | |
2921 array. Since each of these transforms is unnormalized, computing the | |
2922 forward followed by the backward/inverse multi-dimensional transform | |
2923 will result in the original array scaled by the product of the | |
2924 normalization factors for each dimension (e.g. the product of the | |
2925 dimension sizes, for a multi-dimensional DFT). | |
2926 | |
2927 The definition of FFTW's multi-dimensional DFT of real data (r2c) | |
2928 deserves special attention. In this case, we logically compute the full | |
2929 multi-dimensional DFT of the input data; since the input data are purely | |
2930 real, the output data have the Hermitian symmetry and therefore only one | |
2931 non-redundant half need be stored. More specifically, for an n[0] x | |
2932 n[1] x n[2] x ... x n[d-1] multi-dimensional real-input DFT, the full | |
2933 (logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the | |
2934 symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ..., | |
2935 n[d-1] - k[d-1]]* (where each dimension is periodic). Because of this | |
2936 symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the | |
2937 _last_ dimension (division by 2 is rounded down). (We could instead | |
2938 have cut any other dimension in half, but the last dimension proved | |
2939 computationally convenient.) This results in the peculiar array format | |
2940 described in more detail by *note Real-data DFT Array Format::. | |
2941 | |
2942 The multi-dimensional c2r transform is simply the unnormalized | |
2943 inverse of the r2c transform. i.e. it is the same as FFTW's complex | |
2944 backward multi-dimensional DFT, operating on a Hermitian input array in | |
2945 the peculiar format mentioned above and outputting a real array (since | |
2946 the DFT output is purely real). | |
2947 | |
2948 We should remind the user that the separable product of 1d transforms | |
2949 along each dimension, as computed by FFTW, is not always the same thing | |
2950 as the usual multi-dimensional transform. A multi-dimensional `R2HC' | |
2951 (or `HC2R') transform is not identical to the multi-dimensional DFT, | |
2952 requiring some post-processing to combine the requisite real and | |
2953 imaginary parts, as was described in *note The Halfcomplex-format | |
2954 DFT::. Likewise, FFTW's multidimensional `FFTW_DHT' r2r transform is | |
2955 not the same thing as the logical multi-dimensional discrete Hartley | |
2956 transform defined in the literature, as discussed in *note The Discrete | |
2957 Hartley Transform::. | |
2958 | |
2959 | |
2960 File: fftw3.info, Node: Multi-threaded FFTW, Next: Distributed-memory FFTW with MPI, Prev: FFTW Reference, Up: Top | |
2961 | |
2962 5 Multi-threaded FFTW | |
2963 ********************* | |
2964 | |
2965 In this chapter we document the parallel FFTW routines for | |
2966 shared-memory parallel hardware. These routines, which support | |
2967 parallel one- and multi-dimensional transforms of both real and complex | |
2968 data, are the easiest way to take advantage of multiple processors with | |
2969 FFTW. They work just like the corresponding uniprocessor transform | |
2970 routines, except that you have an extra initialization routine to call, | |
2971 and there is a routine to set the number of threads to employ. Any | |
2972 program that uses the uniprocessor FFTW can therefore be trivially | |
2973 modified to use the multi-threaded FFTW. | |
2974 | |
2975 A shared-memory machine is one in which all CPUs can directly access | |
2976 the same main memory, and such machines are now common due to the | |
2977 ubiquity of multi-core CPUs. FFTW's multi-threading support allows you | |
2978 to utilize these additional CPUs transparently from a single program. | |
2979 However, this does not necessarily translate into performance | |
2980 gains--when multiple threads/CPUs are employed, there is an overhead | |
2981 required for synchronization that may outweigh the computatational | |
2982 parallelism. Therefore, you can only benefit from threads if your | |
2983 problem is sufficiently large. | |
2984 | |
2985 * Menu: | |
2986 | |
2987 * Installation and Supported Hardware/Software:: | |
2988 * Usage of Multi-threaded FFTW:: | |
2989 * How Many Threads to Use?:: | |
2990 * Thread safety:: | |
2991 | |
2992 | |
2993 File: fftw3.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW | |
2994 | |
2995 5.1 Installation and Supported Hardware/Software | |
2996 ================================================ | |
2997 | |
2998 All of the FFTW threads code is located in the `threads' subdirectory | |
2999 of the FFTW package. On Unix systems, the FFTW threads libraries and | |
3000 header files can be automatically configured, compiled, and installed | |
3001 along with the uniprocessor FFTW libraries simply by including | |
3002 `--enable-threads' in the flags to the `configure' script (*note | |
3003 Installation on Unix::), or `--enable-openmp' to use OpenMP | |
3004 (http://www.openmp.org) threads. | |
3005 | |
3006 The threads routines require your operating system to have some sort | |
3007 of shared-memory threads support. Specifically, the FFTW threads | |
3008 package works with POSIX threads (available on most Unix variants, from | |
3009 GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are | |
3010 supported in many common compilers (e.g. gcc) are also supported, and | |
3011 may give better performance on some systems. (OpenMP threads are also | |
3012 useful if you are employing OpenMP in your own code, in order to | |
3013 minimize conflicts between threading models.) If you have a | |
3014 shared-memory machine that uses a different threads API, it should be a | |
3015 simple matter of programming to include support for it; see the file | |
3016 `threads/threads.c' for more detail. | |
3017 | |
3018 You can compile FFTW with _both_ `--enable-threads' and | |
3019 `--enable-openmp' at the same time, since they install libraries with | |
3020 different names (`fftw3_threads' and `fftw3_omp', as described below). | |
3021 However, your programs may only link to _one_ of these two libraries at | |
3022 a time. | |
3023 | |
3024 Ideally, of course, you should also have multiple processors in | |
3025 order to get any benefit from the threaded transforms. | |
3026 | |
3027 | |
3028 File: fftw3.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW | |
3029 | |
3030 5.2 Usage of Multi-threaded FFTW | |
3031 ================================ | |
3032 | |
3033 Here, it is assumed that the reader is already familiar with the usage | |
3034 of the uniprocessor FFTW routines, described elsewhere in this manual. | |
3035 We only describe what one has to change in order to use the | |
3036 multi-threaded routines. | |
3037 | |
3038 First, programs using the parallel complex transforms should be | |
3039 linked with `-lfftw3_threads -lfftw3 -lm' on Unix, or `-lfftw3_omp | |
3040 -lfftw3 -lm' if you compiled with OpenMP. You will also need to link | |
3041 with whatever library is responsible for threads on your system (e.g. | |
3042 `-lpthread' on GNU/Linux) or include whatever compiler flag enables | |
3043 OpenMP (e.g. `-fopenmp' with gcc). | |
3044 | |
3045 Second, before calling _any_ FFTW routines, you should call the | |
3046 function: | |
3047 | |
3048 int fftw_init_threads(void); | |
3049 | |
3050 This function, which need only be called once, performs any one-time | |
3051 initialization required to use threads on your system. It returns zero | |
3052 if there was some error (which should not happen under normal | |
3053 circumstances) and a non-zero value otherwise. | |
3054 | |
3055 Third, before creating a plan that you want to parallelize, you | |
3056 should call: | |
3057 | |
3058 void fftw_plan_with_nthreads(int nthreads); | |
3059 | |
3060 The `nthreads' argument indicates the number of threads you want | |
3061 FFTW to use (or actually, the maximum number). All plans subsequently | |
3062 created with any planner routine will use that many threads. You can | |
3063 call `fftw_plan_with_nthreads', create some plans, call | |
3064 `fftw_plan_with_nthreads' again with a different argument, and create | |
3065 some more plans for a new number of threads. Plans already created | |
3066 before a call to `fftw_plan_with_nthreads' are unaffected. If you pass | |
3067 an `nthreads' argument of `1' (the default), threads are disabled for | |
3068 subsequent plans. | |
3069 | |
3070 With OpenMP, to configure FFTW to use all of the currently running | |
3071 OpenMP threads (set by `omp_set_num_threads(nthreads)' or by the | |
3072 `OMP_NUM_THREADS' environment variable), you can do: | |
3073 `fftw_plan_with_nthreads(omp_get_max_threads())'. (The `omp_' OpenMP | |
3074 functions are declared via `#include <omp.h>'.) | |
3075 | |
3076 Given a plan, you then execute it as usual with | |
3077 `fftw_execute(plan)', and the execution will use the number of threads | |
3078 specified when the plan was created. When done, you destroy it as | |
3079 usual with `fftw_destroy_plan'. As described in *note Thread safety::, | |
3080 plan _execution_ is thread-safe, but plan creation and destruction are | |
3081 _not_: you should create/destroy plans only from a single thread, but | |
3082 can safely execute multiple plans in parallel. | |
3083 | |
3084 There is one additional routine: if you want to get rid of all memory | |
3085 and other resources allocated internally by FFTW, you can call: | |
3086 | |
3087 void fftw_cleanup_threads(void); | |
3088 | |
3089 which is much like the `fftw_cleanup()' function except that it also | |
3090 gets rid of threads-related data. You must _not_ execute any | |
3091 previously created plans after calling this function. | |
3092 | |
3093 We should also mention one other restriction: if you save wisdom | |
3094 from a program using the multi-threaded FFTW, that wisdom _cannot be | |
3095 used_ by a program using only the single-threaded FFTW (i.e. not calling | |
3096 `fftw_init_threads'). *Note Words of Wisdom-Saving Plans::. | |
3097 | |
3098 | |
3099 File: fftw3.info, Node: How Many Threads to Use?, Next: Thread safety, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW | |
3100 | |
3101 5.3 How Many Threads to Use? | |
3102 ============================ | |
3103 | |
3104 There is a fair amount of overhead involved in synchronizing threads, | |
3105 so the optimal number of threads to use depends upon the size of the | |
3106 transform as well as on the number of processors you have. | |
3107 | |
3108 As a general rule, you don't want to use more threads than you have | |
3109 processors. (Using more threads will work, but there will be extra | |
3110 overhead with no benefit.) In fact, if the problem size is too small, | |
3111 you may want to use fewer threads than you have processors. | |
3112 | |
3113 You will have to experiment with your system to see what level of | |
3114 parallelization is best for your problem size. Typically, the problem | |
3115 will have to involve at least a few thousand data points before threads | |
3116 become beneficial. If you plan with `FFTW_PATIENT', it will | |
3117 automatically disable threads for sizes that don't benefit from | |
3118 parallelization. | |
3119 | |
3120 | |
3121 File: fftw3.info, Node: Thread safety, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW | |
3122 | |
3123 5.4 Thread safety | |
3124 ================= | |
3125 | |
3126 Users writing multi-threaded programs (including OpenMP) must concern | |
3127 themselves with the "thread safety" of the libraries they use--that is, | |
3128 whether it is safe to call routines in parallel from multiple threads. | |
3129 FFTW can be used in such an environment, but some care must be taken | |
3130 because the planner routines share data (e.g. wisdom and trigonometric | |
3131 tables) between calls and plans. | |
3132 | |
3133 The upshot is that the only thread-safe (re-entrant) routine in FFTW | |
3134 is `fftw_execute' (and the new-array variants thereof). All other | |
3135 routines (e.g. the planner) should only be called from one thread at a | |
3136 time. So, for example, you can wrap a semaphore lock around any calls | |
3137 to the planner; even more simply, you can just create all of your plans | |
3138 from one thread. We do not think this should be an important | |
3139 restriction (FFTW is designed for the situation where the only | |
3140 performance-sensitive code is the actual execution of the transform), | |
3141 and the benefits of shared data between plans are great. | |
3142 | |
3143 Note also that, since the plan is not modified by `fftw_execute', it | |
3144 is safe to execute the _same plan_ in parallel by multiple threads. | |
3145 However, since a given plan operates by default on a fixed array, you | |
3146 need to use one of the new-array execute functions (*note New-array | |
3147 Execute Functions::) so that different threads compute the transform of | |
3148 different data. | |
3149 | |
3150 (Users should note that these comments only apply to programs using | |
3151 shared-memory threads or OpenMP. Parallelism using MPI or forked | |
3152 processes involves a separate address-space and global variables for | |
3153 each process, and is not susceptible to problems of this sort.) | |
3154 | |
3155 If you are configured FFTW with the `--enable-debug' or | |
3156 `--enable-debug-malloc' flags (*note Installation on Unix::), then | |
3157 `fftw_execute' is not thread-safe. These flags are not documented | |
3158 because they are intended only for developing and debugging FFTW, but | |
3159 if you must use `--enable-debug' then you should also specifically pass | |
3160 `--disable-debug-malloc' for `fftw_execute' to be thread-safe. | |
3161 | |
3162 | |
3163 File: fftw3.info, Node: Distributed-memory FFTW with MPI, Next: Calling FFTW from Modern Fortran, Prev: Multi-threaded FFTW, Up: Top | |
3164 | |
3165 6 Distributed-memory FFTW with MPI | |
3166 ********************************** | |
3167 | |
3168 In this chapter we document the parallel FFTW routines for parallel | |
3169 systems supporting the MPI message-passing interface. Unlike the | |
3170 shared-memory threads described in the previous chapter, MPI allows you | |
3171 to use _distributed-memory_ parallelism, where each CPU has its own | |
3172 separate memory, and which can scale up to clusters of many thousands | |
3173 of processors. This capability comes at a price, however: each process | |
3174 only stores a _portion_ of the data to be transformed, which means that | |
3175 the data structures and programming-interface are quite different from | |
3176 the serial or threads versions of FFTW. | |
3177 | |
3178 Distributed-memory parallelism is especially useful when you are | |
3179 transforming arrays so large that they do not fit into the memory of a | |
3180 single processor. The storage per-process required by FFTW's MPI | |
3181 routines is proportional to the total array size divided by the number | |
3182 of processes. Conversely, distributed-memory parallelism can easily | |
3183 pose an unacceptably high communications overhead for small problems; | |
3184 the threshold problem size for which parallelism becomes advantageous | |
3185 will depend on the precise problem you are interested in, your | |
3186 hardware, and your MPI implementation. | |
3187 | |
3188 A note on terminology: in MPI, you divide the data among a set of | |
3189 "processes" which each run in their own memory address space. | |
3190 Generally, each process runs on a different physical processor, but | |
3191 this is not required. A set of processes in MPI is described by an | |
3192 opaque data structure called a "communicator," the most common of which | |
3193 is the predefined communicator `MPI_COMM_WORLD' which refers to _all_ | |
3194 processes. For more information on these and other concepts common to | |
3195 all MPI programs, we refer the reader to the documentation at the MPI | |
3196 home page (http://www.mcs.anl.gov/research/projects/mpi/). | |
3197 | |
3198 We assume in this chapter that the reader is familiar with the usage | |
3199 of the serial (uniprocessor) FFTW, and focus only on the concepts new | |
3200 to the MPI interface. | |
3201 | |
3202 * Menu: | |
3203 | |
3204 * FFTW MPI Installation:: | |
3205 * Linking and Initializing MPI FFTW:: | |
3206 * 2d MPI example:: | |
3207 * MPI Data Distribution:: | |
3208 * Multi-dimensional MPI DFTs of Real Data:: | |
3209 * Other Multi-dimensional Real-data MPI Transforms:: | |
3210 * FFTW MPI Transposes:: | |
3211 * FFTW MPI Wisdom:: | |
3212 * Avoiding MPI Deadlocks:: | |
3213 * FFTW MPI Performance Tips:: | |
3214 * Combining MPI and Threads:: | |
3215 * FFTW MPI Reference:: | |
3216 * FFTW MPI Fortran Interface:: | |
3217 | |
3218 | |
3219 File: fftw3.info, Node: FFTW MPI Installation, Next: Linking and Initializing MPI FFTW, Prev: Distributed-memory FFTW with MPI, Up: Distributed-memory FFTW with MPI | |
3220 | |
3221 6.1 FFTW MPI Installation | |
3222 ========================= | |
3223 | |
3224 All of the FFTW MPI code is located in the `mpi' subdirectory of the | |
3225 FFTW package. On Unix systems, the FFTW MPI libraries and header files | |
3226 are automatically configured, compiled, and installed along with the | |
3227 uniprocessor FFTW libraries simply by including `--enable-mpi' in the | |
3228 flags to the `configure' script (*note Installation on Unix::). | |
3229 | |
3230 Any implementation of the MPI standard, version 1 or later, should | |
3231 work with FFTW. The `configure' script will attempt to automatically | |
3232 detect how to compile and link code using your MPI implementation. In | |
3233 some cases, especially if you have multiple different MPI | |
3234 implementations installed or have an unusual MPI software package, you | |
3235 may need to provide this information explicitly. | |
3236 | |
3237 Most commonly, one compiles MPI code by invoking a special compiler | |
3238 command, typically `mpicc' for C code. The `configure' script knows | |
3239 the most common names for this command, but you can specify the MPI | |
3240 compilation command explicitly by setting the `MPICC' variable, as in | |
3241 `./configure MPICC=mpicc ...'. | |
3242 | |
3243 If, instead of a special compiler command, you need to link a certain | |
3244 library, you can specify the link command via the `MPILIBS' variable, | |
3245 as in `./configure MPILIBS=-lmpi ...'. Note that if your MPI library | |
3246 is installed in a non-standard location (one the compiler does not know | |
3247 about by default), you may also have to specify the location of the | |
3248 library and header files via `LDFLAGS' and `CPPFLAGS' variables, | |
3249 respectively, as in `./configure LDFLAGS=-L/path/to/mpi/libs | |
3250 CPPFLAGS=-I/path/to/mpi/include ...'. | |
3251 | |
3252 | |
3253 File: fftw3.info, Node: Linking and Initializing MPI FFTW, Next: 2d MPI example, Prev: FFTW MPI Installation, Up: Distributed-memory FFTW with MPI | |
3254 | |
3255 6.2 Linking and Initializing MPI FFTW | |
3256 ===================================== | |
3257 | |
3258 Programs using the MPI FFTW routines should be linked with `-lfftw3_mpi | |
3259 -lfftw3 -lm' on Unix in double precision, `-lfftw3f_mpi -lfftw3f -lm' | |
3260 in single precision, and so on (*note Precision::). You will also need | |
3261 to link with whatever library is responsible for MPI on your system; in | |
3262 most MPI implementations, there is a special compiler alias named | |
3263 `mpicc' to compile and link MPI code. | |
3264 | |
3265 Before calling any FFTW routines except possibly `fftw_init_threads' | |
3266 (*note Combining MPI and Threads::), but after calling `MPI_Init', you | |
3267 should call the function: | |
3268 | |
3269 void fftw_mpi_init(void); | |
3270 | |
3271 If, at the end of your program, you want to get rid of all memory and | |
3272 other resources allocated internally by FFTW, for both the serial and | |
3273 MPI routines, you can call: | |
3274 | |
3275 void fftw_mpi_cleanup(void); | |
3276 | |
3277 which is much like the `fftw_cleanup()' function except that it also | |
3278 gets rid of FFTW's MPI-related data. You must _not_ execute any | |
3279 previously created plans after calling this function. | |
3280 | |
3281 | |
3282 File: fftw3.info, Node: 2d MPI example, Next: MPI Data Distribution, Prev: Linking and Initializing MPI FFTW, Up: Distributed-memory FFTW with MPI | |
3283 | |
3284 6.3 2d MPI example | |
3285 ================== | |
3286 | |
3287 Before we document the FFTW MPI interface in detail, we begin with a | |
3288 simple example outlining how one would perform a two-dimensional `N0' | |
3289 by `N1' complex DFT. | |
3290 | |
3291 #include <fftw3-mpi.h> | |
3292 | |
3293 int main(int argc, char **argv) | |
3294 { | |
3295 const ptrdiff_t N0 = ..., N1 = ...; | |
3296 fftw_plan plan; | |
3297 fftw_complex *data; | |
3298 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | |
3299 | |
3300 MPI_Init(&argc, &argv); | |
3301 fftw_mpi_init(); | |
3302 | |
3303 /* get local data size and allocate */ | |
3304 alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, | |
3305 &local_n0, &local_0_start); | |
3306 data = fftw_alloc_complex(alloc_local); | |
3307 | |
3308 /* create plan for in-place forward DFT */ | |
3309 plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, | |
3310 FFTW_FORWARD, FFTW_ESTIMATE); | |
3311 | |
3312 /* initialize data to some function my_function(x,y) */ | |
3313 for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) | |
3314 data[i*N1 + j] = my_function(local_0_start + i, j); | |
3315 | |
3316 /* compute transforms, in-place, as many times as desired */ | |
3317 fftw_execute(plan); | |
3318 | |
3319 fftw_destroy_plan(plan); | |
3320 | |
3321 MPI_Finalize(); | |
3322 } | |
3323 | |
3324 As can be seen above, the MPI interface follows the same basic style | |
3325 of allocate/plan/execute/destroy as the serial FFTW routines. All of | |
3326 the MPI-specific routines are prefixed with `fftw_mpi_' instead of | |
3327 `fftw_'. There are a few important differences, however: | |
3328 | |
3329 First, we must call `fftw_mpi_init()' after calling `MPI_Init' | |
3330 (required in all MPI programs) and before calling any other `fftw_mpi_' | |
3331 routine. | |
3332 | |
3333 Second, when we create the plan with `fftw_mpi_plan_dft_2d', | |
3334 analogous to `fftw_plan_dft_2d', we pass an additional argument: the | |
3335 communicator, indicating which processes will participate in the | |
3336 transform (here `MPI_COMM_WORLD', indicating all processes). Whenever | |
3337 you create, execute, or destroy a plan for an MPI transform, you must | |
3338 call the corresponding FFTW routine on _all_ processes in the | |
3339 communicator for that transform. (That is, these are _collective_ | |
3340 calls.) Note that the plan for the MPI transform uses the standard | |
3341 `fftw_execute' and `fftw_destroy' routines (on the other hand, there | |
3342 are MPI-specific new-array execute functions documented below). | |
3343 | |
3344 Third, all of the FFTW MPI routines take `ptrdiff_t' arguments | |
3345 instead of `int' as for the serial FFTW. `ptrdiff_t' is a standard C | |
3346 integer type which is (at least) 32 bits wide on a 32-bit machine and | |
3347 64 bits wide on a 64-bit machine. This is to make it easy to specify | |
3348 very large parallel transforms on a 64-bit machine. (You can specify | |
3349 64-bit transform sizes in the serial FFTW, too, but only by using the | |
3350 `guru64' planner interface. *Note 64-bit Guru Interface::.) | |
3351 | |
3352 Fourth, and most importantly, you don't allocate the entire | |
3353 two-dimensional array on each process. Instead, you call | |
3354 `fftw_mpi_local_size_2d' to find out what _portion_ of the array | |
3355 resides on each processor, and how much space to allocate. Here, the | |
3356 portion of the array on each process is a `local_n0' by `N1' slice of | |
3357 the total array, starting at index `local_0_start'. The total number | |
3358 of `fftw_complex' numbers to allocate is given by the `alloc_local' | |
3359 return value, which _may_ be greater than `local_n0 * N1' (in case some | |
3360 intermediate calculations require additional storage). The data | |
3361 distribution in FFTW's MPI interface is described in more detail by the | |
3362 next section. | |
3363 | |
3364 Given the portion of the array that resides on the local process, it | |
3365 is straightforward to initialize the data (here to a function | |
3366 `myfunction') and otherwise manipulate it. Of course, at the end of | |
3367 the program you may want to output the data somehow, but synchronizing | |
3368 this output is up to you and is beyond the scope of this manual. (One | |
3369 good way to output a large multi-dimensional distributed array in MPI | |
3370 to a portable binary file is to use the free HDF5 library; see the HDF | |
3371 home page (http://www.hdfgroup.org/).) | |
3372 | |
3373 | |
3374 File: fftw3.info, Node: MPI Data Distribution, Next: Multi-dimensional MPI DFTs of Real Data, Prev: 2d MPI example, Up: Distributed-memory FFTW with MPI | |
3375 | |
3376 6.4 MPI Data Distribution | |
3377 ========================= | |
3378 | |
3379 The most important concept to understand in using FFTW's MPI interface | |
3380 is the data distribution. With a serial or multithreaded FFT, all of | |
3381 the inputs and outputs are stored as a single contiguous chunk of | |
3382 memory. With a distributed-memory FFT, the inputs and outputs are | |
3383 broken into disjoint blocks, one per process. | |
3384 | |
3385 In particular, FFTW uses a _1d block distribution_ of the data, | |
3386 distributed along the _first dimension_. For example, if you want to | |
3387 perform a 100 x 200 complex DFT, distributed over 4 processes, each | |
3388 process will get a 25 x 200 slice of the data. That is, process 0 | |
3389 will get rows 0 through 24, process 1 will get rows 25 through 49, | |
3390 process 2 will get rows 50 through 74, and process 3 will get rows 75 | |
3391 through 99. If you take the same array but distribute it over 3 | |
3392 processes, then it is not evenly divisible so the different processes | |
3393 will have unequal chunks. FFTW's default choice in this case is to | |
3394 assign 34 rows to processes 0 and 1, and 32 rows to process 2. | |
3395 | |
3396 FFTW provides several `fftw_mpi_local_size' routines that you can | |
3397 call to find out what portion of an array is stored on the current | |
3398 process. In most cases, you should use the default block sizes picked | |
3399 by FFTW, but it is also possible to specify your own block size. For | |
3400 example, with a 100 x 200 array on three processes, you can tell FFTW | |
3401 to use a block size of 40, which would assign 40 rows to processes 0 | |
3402 and 1, and 20 rows to process 2. FFTW's default is to divide the data | |
3403 equally among the processes if possible, and as best it can otherwise. | |
3404 The rows are always assigned in "rank order," i.e. process 0 gets the | |
3405 first block of rows, then process 1, and so on. (You can change this | |
3406 by using `MPI_Comm_split' to create a new communicator with re-ordered | |
3407 processes.) However, you should always call the `fftw_mpi_local_size' | |
3408 routines, if possible, rather than trying to predict FFTW's | |
3409 distribution choices. | |
3410 | |
3411 In particular, it is critical that you allocate the storage size that | |
3412 is returned by `fftw_mpi_local_size', which is _not_ necessarily the | |
3413 size of the local slice of the array. The reason is that intermediate | |
3414 steps of FFTW's algorithms involve transposing the array and | |
3415 redistributing the data, so at these intermediate steps FFTW may | |
3416 require more local storage space (albeit always proportional to the | |
3417 total size divided by the number of processes). The | |
3418 `fftw_mpi_local_size' functions know how much storage is required for | |
3419 these intermediate steps and tell you the correct amount to allocate. | |
3420 | |
3421 * Menu: | |
3422 | |
3423 * Basic and advanced distribution interfaces:: | |
3424 * Load balancing:: | |
3425 * Transposed distributions:: | |
3426 * One-dimensional distributions:: | |
3427 | |
3428 | |
3429 File: fftw3.info, Node: Basic and advanced distribution interfaces, Next: Load balancing, Prev: MPI Data Distribution, Up: MPI Data Distribution | |
3430 | |
3431 6.4.1 Basic and advanced distribution interfaces | |
3432 ------------------------------------------------ | |
3433 | |
3434 As with the planner interface, the `fftw_mpi_local_size' distribution | |
3435 interface is broken into basic and advanced (`_many') interfaces, where | |
3436 the latter allows you to specify the block size manually and also to | |
3437 request block sizes when computing multiple transforms simultaneously. | |
3438 These functions are documented more exhaustively by the FFTW MPI | |
3439 Reference, but we summarize the basic ideas here using a couple of | |
3440 two-dimensional examples. | |
3441 | |
3442 For the 100 x 200 complex-DFT example, above, we would find the | |
3443 distribution by calling the following function in the basic interface: | |
3444 | |
3445 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
3446 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
3447 | |
3448 Given the total size of the data to be transformed (here, `n0 = 100' | |
3449 and `n1 = 200') and an MPI communicator (`comm'), this function | |
3450 provides three numbers. | |
3451 | |
3452 First, it describes the shape of the local data: the current process | |
3453 should store a `local_n0' by `n1' slice of the overall dataset, in | |
3454 row-major order (`n1' dimension contiguous), starting at index | |
3455 `local_0_start'. That is, if the total dataset is viewed as a `n0' by | |
3456 `n1' matrix, the current process should store the rows `local_0_start' | |
3457 to `local_0_start+local_n0-1'. Obviously, if you are running with only | |
3458 a single MPI process, that process will store the entire array: | |
3459 `local_0_start' will be zero and `local_n0' will be `n0'. *Note | |
3460 Row-major Format::. | |
3461 | |
3462 Second, the return value is the total number of data elements (e.g., | |
3463 complex numbers for a complex DFT) that should be allocated for the | |
3464 input and output arrays on the current process (ideally with | |
3465 `fftw_malloc' or an `fftw_alloc' function, to ensure optimal | |
3466 alignment). It might seem that this should always be equal to | |
3467 `local_n0 * n1', but this is _not_ the case. FFTW's distributed FFT | |
3468 algorithms require data redistributions at intermediate stages of the | |
3469 transform, and in some circumstances this may require slightly larger | |
3470 local storage. This is discussed in more detail below, under *note | |
3471 Load balancing::. | |
3472 | |
3473 The advanced-interface `local_size' function for multidimensional | |
3474 transforms returns the same three things (`local_n0', `local_0_start', | |
3475 and the total number of elements to allocate), but takes more inputs: | |
3476 | |
3477 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, | |
3478 ptrdiff_t howmany, | |
3479 ptrdiff_t block0, | |
3480 MPI_Comm comm, | |
3481 ptrdiff_t *local_n0, | |
3482 ptrdiff_t *local_0_start); | |
3483 | |
3484 The two-dimensional case above corresponds to `rnk = 2' and an array | |
3485 `n' of length 2 with `n[0] = n0' and `n[1] = n1'. This routine is for | |
3486 any `rnk > 1'; one-dimensional transforms have their own interface | |
3487 because they work slightly differently, as discussed below. | |
3488 | |
3489 First, the advanced interface allows you to perform multiple | |
3490 transforms at once, of interleaved data, as specified by the `howmany' | |
3491 parameter. (`hoamany' is 1 for a single transform.) | |
3492 | |
3493 Second, here you can specify your desired block size in the `n0' | |
3494 dimension, `block0'. To use FFTW's default block size, pass | |
3495 `FFTW_MPI_DEFAULT_BLOCK' (0) for `block0'. Otherwise, on `P' | |
3496 processes, FFTW will return `local_n0' equal to `block0' on the first | |
3497 `P / block0' processes (rounded down), return `local_n0' equal to `n0 - | |
3498 block0 * (P / block0)' on the next process, and `local_n0' equal to | |
3499 zero on any remaining processes. In general, we recommend using the | |
3500 default block size (which corresponds to `n0 / P', rounded up). | |
3501 | |
3502 For example, suppose you have `P = 4' processes and `n0 = 21'. The | |
3503 default will be a block size of `6', which will give `local_n0 = 6' on | |
3504 the first three processes and `local_n0 = 3' on the last process. | |
3505 Instead, however, you could specify `block0 = 5' if you wanted, which | |
3506 would give `local_n0 = 5' on processes 0 to 2, `local_n0 = 6' on | |
3507 process 3. (This choice, while it may look superficially more | |
3508 "balanced," has the same critical path as FFTW's default but requires | |
3509 more communications.) | |
3510 | |
3511 | |
3512 File: fftw3.info, Node: Load balancing, Next: Transposed distributions, Prev: Basic and advanced distribution interfaces, Up: MPI Data Distribution | |
3513 | |
3514 6.4.2 Load balancing | |
3515 -------------------- | |
3516 | |
3517 Ideally, when you parallelize a transform over some P processes, each | |
3518 process should end up with work that takes equal time. Otherwise, all | |
3519 of the processes end up waiting on whichever process is slowest. This | |
3520 goal is known as "load balancing." In this section, we describe the | |
3521 circumstances under which FFTW is able to load-balance well, and in | |
3522 particular how you should choose your transform size in order to load | |
3523 balance. | |
3524 | |
3525 Load balancing is especially difficult when you are parallelizing | |
3526 over heterogeneous machines; for example, if one of your processors is a | |
3527 old 486 and another is a Pentium IV, obviously you should give the | |
3528 Pentium more work to do than the 486 since the latter is much slower. | |
3529 FFTW does not deal with this problem, however--it assumes that your | |
3530 processes run on hardware of comparable speed, and that the goal is | |
3531 therefore to divide the problem as equally as possible. | |
3532 | |
3533 For a multi-dimensional complex DFT, FFTW can divide the problem | |
3534 equally among the processes if: (i) the _first_ dimension `n0' is | |
3535 divisible by P; and (ii), the _product_ of the subsequent dimensions is | |
3536 divisible by P. (For the advanced interface, where you can specify | |
3537 multiple simultaneous transforms via some "vector" length `howmany', a | |
3538 factor of `howmany' is included in the product of the subsequent | |
3539 dimensions.) | |
3540 | |
3541 For a one-dimensional complex DFT, the length `N' of the data should | |
3542 be divisible by P _squared_ to be able to divide the problem equally | |
3543 among the processes. | |
3544 | |
3545 | |
3546 File: fftw3.info, Node: Transposed distributions, Next: One-dimensional distributions, Prev: Load balancing, Up: MPI Data Distribution | |
3547 | |
3548 6.4.3 Transposed distributions | |
3549 ------------------------------ | |
3550 | |
3551 Internally, FFTW's MPI transform algorithms work by first computing | |
3552 transforms of the data local to each process, then by globally | |
3553 _transposing_ the data in some fashion to redistribute the data among | |
3554 the processes, transforming the new data local to each process, and | |
3555 transposing back. For example, a two-dimensional `n0' by `n1' array, | |
3556 distributed across the `n0' dimension, is transformd by: (i) | |
3557 transforming the `n1' dimension, which are local to each process; (ii) | |
3558 transposing to an `n1' by `n0' array, distributed across the `n1' | |
3559 dimension; (iii) transforming the `n0' dimension, which is now local to | |
3560 each process; (iv) transposing back. | |
3561 | |
3562 However, in many applications it is acceptable to compute a | |
3563 multidimensional DFT whose results are produced in transposed order | |
3564 (e.g., `n1' by `n0' in two dimensions). This provides a significant | |
3565 performance advantage, because it means that the final transposition | |
3566 step can be omitted. FFTW supports this optimization, which you | |
3567 specify by passing the flag `FFTW_MPI_TRANSPOSED_OUT' to the planner | |
3568 routines. To compute the inverse transform of transposed output, you | |
3569 specify `FFTW_MPI_TRANSPOSED_IN' to tell it that the input is | |
3570 transposed. In this section, we explain how to interpret the output | |
3571 format of such a transform. | |
3572 | |
3573 Suppose you have are transforming multi-dimensional data with (at | |
3574 least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always, | |
3575 it is distributed along the first dimension n[0] . Now, if we compute | |
3576 its DFT with the `FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output | |
3577 data are stored with the first _two_ dimensions transposed: n[1] x n[0] | |
3578 x n[2] x ... x n[d-1] , distributed along the n[1] dimension. | |
3579 Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and | |
3580 transform it with the `FFTW_MPI_TRANSPOSED_IN' flag, then the format | |
3581 goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array. | |
3582 | |
3583 There are two ways to find the portion of the transposed array that | |
3584 resides on the current process. First, you can simply call the | |
3585 appropriate `local_size' function, passing n[1] x n[0] x n[2] x ... x | |
3586 n[d-1] (the transposed dimensions). This would mean calling the | |
3587 `local_size' function twice, once for the transposed and once for the | |
3588 non-transposed dimensions. Alternatively, you can call one of the | |
3589 `local_size_transposed' functions, which returns both the | |
3590 non-transposed and transposed data distribution from a single call. | |
3591 For example, for a 3d transform with transposed output (or input), you | |
3592 might call: | |
3593 | |
3594 ptrdiff_t fftw_mpi_local_size_3d_transposed( | |
3595 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, | |
3596 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
3597 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
3598 | |
3599 Here, `local_n0' and `local_0_start' give the size and starting | |
3600 index of the `n0' dimension for the _non_-transposed data, as in the | |
3601 previous sections. For _transposed_ data (e.g. the output for | |
3602 `FFTW_MPI_TRANSPOSED_OUT'), `local_n1' and `local_1_start' give the | |
3603 size and starting index of the `n1' dimension, which is the first | |
3604 dimension of the transposed data (`n1' by `n0' by `n2'). | |
3605 | |
3606 (Note that `FFTW_MPI_TRANSPOSED_IN' is completely equivalent to | |
3607 performing `FFTW_MPI_TRANSPOSED_OUT' and passing the first two | |
3608 dimensions to the planner in reverse order, or vice versa. If you pass | |
3609 _both_ the `FFTW_MPI_TRANSPOSED_IN' and `FFTW_MPI_TRANSPOSED_OUT' | |
3610 flags, it is equivalent to swapping the first two dimensions passed to | |
3611 the planner and passing _neither_ flag.) | |
3612 | |
3613 | |
3614 File: fftw3.info, Node: One-dimensional distributions, Prev: Transposed distributions, Up: MPI Data Distribution | |
3615 | |
3616 6.4.4 One-dimensional distributions | |
3617 ----------------------------------- | |
3618 | |
3619 For one-dimensional distributed DFTs using FFTW, matters are slightly | |
3620 more complicated because the data distribution is more closely tied to | |
3621 how the algorithm works. In particular, you can no longer pass an | |
3622 arbitrary block size and must accept FFTW's default; also, the block | |
3623 sizes may be different for input and output. Also, the data | |
3624 distribution depends on the flags and transform direction, in order for | |
3625 forward and backward transforms to work correctly. | |
3626 | |
3627 ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, | |
3628 int sign, unsigned flags, | |
3629 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
3630 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
3631 | |
3632 This function computes the data distribution for a 1d transform of | |
3633 size `n0' with the given transform `sign' and `flags'. Both input and | |
3634 output data use block distributions. The input on the current process | |
3635 will consist of `local_ni' numbers starting at index `local_i_start'; | |
3636 e.g. if only a single process is used, then `local_ni' will be `n0' and | |
3637 `local_i_start' will be `0'. Similarly for the output, with `local_no' | |
3638 numbers starting at index `local_o_start'. The return value of | |
3639 `fftw_mpi_local_size_1d' will be the total number of elements to | |
3640 allocate on the current process (which might be slightly larger than | |
3641 the local size due to intermediate steps in the algorithm). | |
3642 | |
3643 As mentioned above (*note Load balancing::), the data will be divided | |
3644 equally among the processes if `n0' is divisible by the _square_ of the | |
3645 number of processes. In this case, `local_ni' will equal `local_no'. | |
3646 Otherwise, they may be different. | |
3647 | |
3648 For some applications, such as convolutions, the order of the output | |
3649 data is irrelevant. In this case, performance can be improved by | |
3650 specifying that the output data be stored in an FFTW-defined | |
3651 "scrambled" format. (In particular, this is the analogue of transposed | |
3652 output in the multidimensional case: scrambled output saves a | |
3653 communications step.) If you pass `FFTW_MPI_SCRAMBLED_OUT' in the | |
3654 flags, then the output is stored in this (undocumented) scrambled | |
3655 order. Conversely, to perform the inverse transform of data in | |
3656 scrambled order, pass the `FFTW_MPI_SCRAMBLED_IN' flag. | |
3657 | |
3658 In MPI FFTW, only composite sizes `n0' can be parallelized; we have | |
3659 not yet implemented a parallel algorithm for large prime sizes. | |
3660 | |
3661 | |
3662 File: fftw3.info, Node: Multi-dimensional MPI DFTs of Real Data, Next: Other Multi-dimensional Real-data MPI Transforms, Prev: MPI Data Distribution, Up: Distributed-memory FFTW with MPI | |
3663 | |
3664 6.5 Multi-dimensional MPI DFTs of Real Data | |
3665 =========================================== | |
3666 | |
3667 FFTW's MPI interface also supports multi-dimensional DFTs of real data, | |
3668 similar to the serial r2c and c2r interfaces. (Parallel | |
3669 one-dimensional real-data DFTs are not currently supported; you must | |
3670 use a complex transform and set the imaginary parts of the inputs to | |
3671 zero.) | |
3672 | |
3673 The key points to understand for r2c and c2r MPI transforms (compared | |
3674 to the MPI complex DFTs or the serial r2c/c2r transforms), are: | |
3675 | |
3676 * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1] | |
3677 x n[2] x ... x n[d-1] real data to/from n[0] x n[1] x n[2] x ... | |
3678 x (n[d-1]/2 + 1) complex data: the last dimension of the complex | |
3679 data is cut in half (rounded down), plus one. As for the serial | |
3680 transforms, the sizes you pass to the `plan_dft_r2c' and | |
3681 `plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1] | |
3682 dimensions of the real data. | |
3683 | |
3684 * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ... | |
3685 x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x | |
3686 ... x [2 (n[d-1]/2 + 1)] array, where the last dimension has been | |
3687 _padded_ to make it the same size as the complex output. This is | |
3688 much like the in-place serial r2c/c2r interface (*note | |
3689 Multi-Dimensional DFTs of Real Data::), except that in MPI the | |
3690 padding is required even for out-of-place data. The extra padding | |
3691 numbers are ignored by FFTW (they are _not_ like zero-padding the | |
3692 transform to a larger size); they are only used to determine the | |
3693 data layout. | |
3694 | |
3695 * The data distribution in MPI for _both_ the real and complex data | |
3696 is determined by the shape of the _complex_ data. That is, you | |
3697 call the appropriate `local size' function for the n[0] x n[1] x | |
3698 n[2] x ... x (n[d-1]/2 + 1) | |
3699 | |
3700 complex data, and then use the _same_ distribution for the real | |
3701 data except that the last complex dimension is replaced by a | |
3702 (padded) real dimension of twice the length. | |
3703 | |
3704 | |
3705 For example suppose we are performing an out-of-place r2c transform | |
3706 of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x | |
3707 M x N/2+1 complex data. Similar to the example in *note 2d MPI | |
3708 example::, we might do something like: | |
3709 | |
3710 #include <fftw3-mpi.h> | |
3711 | |
3712 int main(int argc, char **argv) | |
3713 { | |
3714 const ptrdiff_t L = ..., M = ..., N = ...; | |
3715 fftw_plan plan; | |
3716 double *rin; | |
3717 fftw_complex *cout; | |
3718 ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; | |
3719 | |
3720 MPI_Init(&argc, &argv); | |
3721 fftw_mpi_init(); | |
3722 | |
3723 /* get local data size and allocate */ | |
3724 alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, | |
3725 &local_n0, &local_0_start); | |
3726 rin = fftw_alloc_real(2 * alloc_local); | |
3727 cout = fftw_alloc_complex(alloc_local); | |
3728 | |
3729 /* create plan for out-of-place r2c DFT */ | |
3730 plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, | |
3731 FFTW_MEASURE); | |
3732 | |
3733 /* initialize rin to some function my_func(x,y,z) */ | |
3734 for (i = 0; i < local_n0; ++i) | |
3735 for (j = 0; j < M; ++j) | |
3736 for (k = 0; k < N; ++k) | |
3737 rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); | |
3738 | |
3739 /* compute transforms as many times as desired */ | |
3740 fftw_execute(plan); | |
3741 | |
3742 fftw_destroy_plan(plan); | |
3743 | |
3744 MPI_Finalize(); | |
3745 } | |
3746 | |
3747 Note that we allocated `rin' using `fftw_alloc_real' with an | |
3748 argument of `2 * alloc_local': since `alloc_local' is the number of | |
3749 _complex_ values to allocate, the number of _real_ values is twice as | |
3750 many. The `rin' array is then local_n0 x M x 2(N/2+1) in row-major | |
3751 order, so its `(i,j,k)' element is at the index `(i*M + j) * | |
3752 (2*(N/2+1)) + k' (*note Multi-dimensional Array Format::). | |
3753 | |
3754 As for the complex transforms, improved performance can be obtained | |
3755 by specifying that the output is the transpose of the input or vice | |
3756 versa (*note Transposed distributions::). In our L x M x N r2c | |
3757 example, including `FFTW_TRANSPOSED_OUT' in the flags means that the | |
3758 input would be a padded L x M x 2(N/2+1) real array distributed over | |
3759 the `L' dimension, while the output would be a M x L x N/2+1 complex | |
3760 array distributed over the `M' dimension. To perform the inverse c2r | |
3761 transform with the same data distributions, you would use the | |
3762 `FFTW_TRANSPOSED_IN' flag. | |
3763 | |
3764 | |
3765 File: fftw3.info, Node: Other Multi-dimensional Real-data MPI Transforms, Next: FFTW MPI Transposes, Prev: Multi-dimensional MPI DFTs of Real Data, Up: Distributed-memory FFTW with MPI | |
3766 | |
3767 6.6 Other multi-dimensional Real-Data MPI Transforms | |
3768 ==================================================== | |
3769 | |
3770 FFTW's MPI interface also supports multi-dimensional `r2r' transforms | |
3771 of all kinds supported by the serial interface (e.g. discrete cosine | |
3772 and sine transforms, discrete Hartley transforms, etc.). Only | |
3773 multi-dimensional `r2r' transforms, not one-dimensional transforms, are | |
3774 currently parallelized. | |
3775 | |
3776 These are used much like the multidimensional complex DFTs discussed | |
3777 above, except that the data is real rather than complex, and one needs | |
3778 to pass an r2r transform kind (`fftw_r2r_kind') for each dimension as | |
3779 in the serial FFTW (*note More DFTs of Real Data::). | |
3780 | |
3781 For example, one might perform a two-dimensional L x M that is an | |
3782 REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the | |
3783 second dimension with code like: | |
3784 | |
3785 const ptrdiff_t L = ..., M = ...; | |
3786 fftw_plan plan; | |
3787 double *data; | |
3788 ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | |
3789 | |
3790 /* get local data size and allocate */ | |
3791 alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, | |
3792 &local_n0, &local_0_start); | |
3793 data = fftw_alloc_real(alloc_local); | |
3794 | |
3795 /* create plan for in-place REDFT10 x RODFT10 */ | |
3796 plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, | |
3797 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); | |
3798 | |
3799 /* initialize data to some function my_function(x,y) */ | |
3800 for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) | |
3801 data[i*M + j] = my_function(local_0_start + i, j); | |
3802 | |
3803 /* compute transforms, in-place, as many times as desired */ | |
3804 fftw_execute(plan); | |
3805 | |
3806 fftw_destroy_plan(plan); | |
3807 | |
3808 Notice that we use the same `local_size' functions as we did for | |
3809 complex data, only now we interpret the sizes in terms of real rather | |
3810 than complex values, and correspondingly use `fftw_alloc_real'. | |
3811 | |
3812 | |
3813 File: fftw3.info, Node: FFTW MPI Transposes, Next: FFTW MPI Wisdom, Prev: Other Multi-dimensional Real-data MPI Transforms, Up: Distributed-memory FFTW with MPI | |
3814 | |
3815 6.7 FFTW MPI Transposes | |
3816 ======================= | |
3817 | |
3818 The FFTW's MPI Fourier transforms rely on one or more _global | |
3819 transposition_ step for their communications. For example, the | |
3820 multidimensional transforms work by transforming along some dimensions, | |
3821 then transposing to make the first dimension local and transforming | |
3822 that, then transposing back. Because global transposition of a | |
3823 block-distributed matrix has many other potential uses besides FFTs, | |
3824 FFTW's transpose routines can be called directly, as documented in this | |
3825 section. | |
3826 | |
3827 * Menu: | |
3828 | |
3829 * Basic distributed-transpose interface:: | |
3830 * Advanced distributed-transpose interface:: | |
3831 * An improved replacement for MPI_Alltoall:: | |
3832 | |
3833 | |
3834 File: fftw3.info, Node: Basic distributed-transpose interface, Next: Advanced distributed-transpose interface, Prev: FFTW MPI Transposes, Up: FFTW MPI Transposes | |
3835 | |
3836 6.7.1 Basic distributed-transpose interface | |
3837 ------------------------------------------- | |
3838 | |
3839 In particular, suppose that we have an `n0' by `n1' array in row-major | |
3840 order, block-distributed across the `n0' dimension. To transpose this | |
3841 into an `n1' by `n0' array block-distributed across the `n1' dimension, | |
3842 we would create a plan by calling the following function: | |
3843 | |
3844 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | |
3845 double *in, double *out, | |
3846 MPI_Comm comm, unsigned flags); | |
3847 | |
3848 The input and output arrays (`in' and `out') can be the same. The | |
3849 transpose is actually executed by calling `fftw_execute' on the plan, | |
3850 as usual. | |
3851 | |
3852 The `flags' are the usual FFTW planner flags, but support two | |
3853 additional flags: `FFTW_MPI_TRANSPOSED_OUT' and/or | |
3854 `FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose | |
3855 plans, is that the output and/or input, respectively, are _locally_ | |
3856 transposed. That is, on each process input data is normally stored as | |
3857 a `local_n0' by `n1' array in row-major order, but for an | |
3858 `FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as `n1' by | |
3859 `local_n0' in row-major order. Similarly, `FFTW_MPI_TRANSPOSED_OUT' | |
3860 means that the output is `n0' by `local_n1' instead of `local_n1' by | |
3861 `n0'. | |
3862 | |
3863 To determine the local size of the array on each process before and | |
3864 after the transpose, as well as the amount of storage that must be | |
3865 allocated, one should call `fftw_mpi_local_size_2d_transposed', just as | |
3866 for a 2d DFT as described in the previous section: | |
3867 | |
3868 ptrdiff_t fftw_mpi_local_size_2d_transposed | |
3869 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
3870 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
3871 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
3872 | |
3873 Again, the return value is the local storage to allocate, which in | |
3874 this case is the number of _real_ (`double') values rather than complex | |
3875 numbers as in the previous examples. | |
3876 | |
3877 | |
3878 File: fftw3.info, Node: Advanced distributed-transpose interface, Next: An improved replacement for MPI_Alltoall, Prev: Basic distributed-transpose interface, Up: FFTW MPI Transposes | |
3879 | |
3880 6.7.2 Advanced distributed-transpose interface | |
3881 ---------------------------------------------- | |
3882 | |
3883 The above routines are for a transpose of a matrix of numbers (of type | |
3884 `double'), using FFTW's default block sizes. More generally, one can | |
3885 perform transposes of _tuples_ of numbers, with user-specified block | |
3886 sizes for the input and output: | |
3887 | |
3888 fftw_plan fftw_mpi_plan_many_transpose | |
3889 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | |
3890 ptrdiff_t block0, ptrdiff_t block1, | |
3891 double *in, double *out, MPI_Comm comm, unsigned flags); | |
3892 | |
3893 In this case, one is transposing an `n0' by `n1' matrix of | |
3894 `howmany'-tuples (e.g. `howmany = 2' for complex numbers). The input | |
3895 is distributed along the `n0' dimension with block size `block0', and | |
3896 the `n1' by `n0' output is distributed along the `n1' dimension with | |
3897 block size `block1'. If `FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a | |
3898 block size then FFTW uses its default block size. To get the local | |
3899 size of the data on each process, you should then call | |
3900 `fftw_mpi_local_size_many_transposed'. | |
3901 | |
3902 | |
3903 File: fftw3.info, Node: An improved replacement for MPI_Alltoall, Prev: Advanced distributed-transpose interface, Up: FFTW MPI Transposes | |
3904 | |
3905 6.7.3 An improved replacement for MPI_Alltoall | |
3906 ---------------------------------------------- | |
3907 | |
3908 We close this section by noting that FFTW's MPI transpose routines can | |
3909 be thought of as a generalization for the `MPI_Alltoall' function | |
3910 (albeit only for floating-point types), and in some circumstances can | |
3911 function as an improved replacement. | |
3912 | |
3913 `MPI_Alltoall' is defined by the MPI standard as: | |
3914 | |
3915 int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, | |
3916 void *recvbuf, int recvcnt, MPI_Datatype recvtype, | |
3917 MPI_Comm comm); | |
3918 | |
3919 In particular, for `double*' arrays `in' and `out', consider the | |
3920 call: | |
3921 | |
3922 MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); | |
3923 | |
3924 This is completely equivalent to: | |
3925 | |
3926 MPI_Comm_size(comm, &P); | |
3927 plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); | |
3928 fftw_execute(plan); | |
3929 fftw_destroy_plan(plan); | |
3930 | |
3931 That is, computing a P x P transpose on `P' processes, with a block | |
3932 size of 1, is just a standard all-to-all communication. | |
3933 | |
3934 However, using the FFTW routine instead of `MPI_Alltoall' may have | |
3935 certain advantages. First of all, FFTW's routine can operate in-place | |
3936 (`in == out') whereas `MPI_Alltoall' can only operate out-of-place. | |
3937 | |
3938 Second, even for out-of-place plans, FFTW's routine may be faster, | |
3939 especially if you need to perform the all-to-all communication many | |
3940 times and can afford to use `FFTW_MEASURE' or `FFTW_PATIENT'. It | |
3941 should certainly be no slower, not including the time to create the | |
3942 plan, since one of the possible algorithms that FFTW uses for an | |
3943 out-of-place transpose _is_ simply to call `MPI_Alltoall'. However, | |
3944 FFTW also considers several other possible algorithms that, depending | |
3945 on your MPI implementation and your hardware, may be faster. | |
3946 | |
3947 | |
3948 File: fftw3.info, Node: FFTW MPI Wisdom, Next: Avoiding MPI Deadlocks, Prev: FFTW MPI Transposes, Up: Distributed-memory FFTW with MPI | |
3949 | |
3950 6.8 FFTW MPI Wisdom | |
3951 =================== | |
3952 | |
3953 FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be | |
3954 used to save MPI plans as well as to save uniprocessor plans. However, | |
3955 for MPI there are several unavoidable complications. | |
3956 | |
3957 First, the MPI standard does not guarantee that every process can | |
3958 perform file I/O (at least, not using C stdio routines)--in general, we | |
3959 may only assume that process 0 is capable of I/O.(1) So, if we want to | |
3960 export the wisdom from a single process to a file, we must first export | |
3961 the wisdom to a string, then send it to process 0, then write it to a | |
3962 file. | |
3963 | |
3964 Second, in principle we may want to have separate wisdom for every | |
3965 process, since in general the processes may run on different hardware | |
3966 even for a single MPI program. However, in practice FFTW's MPI code is | |
3967 designed for the case of homogeneous hardware (*note Load balancing::), | |
3968 and in this case it is convenient to use the same wisdom for every | |
3969 process. Thus, we need a mechanism to synchronize the wisdom. | |
3970 | |
3971 To address both of these problems, FFTW provides the following two | |
3972 functions: | |
3973 | |
3974 void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | |
3975 void fftw_mpi_gather_wisdom(MPI_Comm comm); | |
3976 | |
3977 Given a communicator `comm', `fftw_mpi_broadcast_wisdom' will | |
3978 broadcast the wisdom from process 0 to all other processes. | |
3979 Conversely, `fftw_mpi_gather_wisdom' will collect wisdom from all | |
3980 processes onto process 0. (If the plans created for the same problem | |
3981 by different processes are not the same, `fftw_mpi_gather_wisdom' will | |
3982 arbitrarily choose one of the plans.) Both of these functions may | |
3983 result in suboptimal plans for different processes if the processes are | |
3984 running on non-identical hardware. Both of these functions are | |
3985 _collective_ calls, which means that they must be executed by all | |
3986 processes in the communicator. | |
3987 | |
3988 So, for example, a typical code snippet to import wisdom from a file | |
3989 and use it on all processes would be: | |
3990 | |
3991 { | |
3992 int rank; | |
3993 | |
3994 fftw_mpi_init(); | |
3995 MPI_Comm_rank(MPI_COMM_WORLD, &rank); | |
3996 if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); | |
3997 fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); | |
3998 } | |
3999 | |
4000 (Note that we must call `fftw_mpi_init' before importing any wisdom | |
4001 that might contain MPI plans.) Similarly, a typical code snippet to | |
4002 export wisdom from all processes to a file is: | |
4003 | |
4004 { | |
4005 int rank; | |
4006 | |
4007 fftw_mpi_gather_wisdom(MPI_COMM_WORLD); | |
4008 MPI_Comm_rank(MPI_COMM_WORLD, &rank); | |
4009 if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); | |
4010 } | |
4011 | |
4012 ---------- Footnotes ---------- | |
4013 | |
4014 (1) In fact, even this assumption is not technically guaranteed by | |
4015 the standard, although it seems to be universal in actual MPI | |
4016 implementations and is widely assumed by MPI-using software. | |
4017 Technically, you need to query the `MPI_IO' attribute of | |
4018 `MPI_COMM_WORLD' with `MPI_Attr_get'. If this attribute is | |
4019 `MPI_PROC_NULL', no I/O is possible. If it is `MPI_ANY_SOURCE', any | |
4020 process can perform I/O. Otherwise, it is the rank of a process that | |
4021 can perform I/O ... but since it is not guaranteed to yield the _same_ | |
4022 rank on all processes, you have to do an `MPI_Allreduce' of some kind | |
4023 if you want all processes to agree about which is going to do I/O. And | |
4024 even then, the standard only guarantees that this process can perform | |
4025 output, but not input. See e.g. `Parallel Programming with MPI' by P. | |
4026 S. Pacheco, section 8.1.3. Needless to say, in our experience | |
4027 virtually no MPI programmers worry about this. | |
4028 | |
4029 | |
4030 File: fftw3.info, Node: Avoiding MPI Deadlocks, Next: FFTW MPI Performance Tips, Prev: FFTW MPI Wisdom, Up: Distributed-memory FFTW with MPI | |
4031 | |
4032 6.9 Avoiding MPI Deadlocks | |
4033 ========================== | |
4034 | |
4035 An MPI program can _deadlock_ if one process is waiting for a message | |
4036 from another process that never gets sent. To avoid deadlocks when | |
4037 using FFTW's MPI routines, it is important to know which functions are | |
4038 _collective_: that is, which functions must _always_ be called in the | |
4039 _same order_ from _every_ process in a given communicator. (For | |
4040 example, `MPI_Barrier' is the canonical example of a collective | |
4041 function in the MPI standard.) | |
4042 | |
4043 The functions in FFTW that are _always_ collective are: every | |
4044 function beginning with `fftw_mpi_plan', as well as | |
4045 `fftw_mpi_broadcast_wisdom' and `fftw_mpi_gather_wisdom'. Also, the | |
4046 following functions from the ordinary FFTW interface are collective | |
4047 when they are applied to a plan created by an `fftw_mpi_plan' function: | |
4048 `fftw_execute', `fftw_destroy_plan', and `fftw_flops'. | |
4049 | |
4050 | |
4051 File: fftw3.info, Node: FFTW MPI Performance Tips, Next: Combining MPI and Threads, Prev: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI | |
4052 | |
4053 6.10 FFTW MPI Performance Tips | |
4054 ============================== | |
4055 | |
4056 In this section, we collect a few tips on getting the best performance | |
4057 out of FFTW's MPI transforms. | |
4058 | |
4059 First, because of the 1d block distribution, FFTW's parallelization | |
4060 is currently limited by the size of the first dimension. | |
4061 (Multidimensional block distributions may be supported by a future | |
4062 version.) More generally, you should ideally arrange the dimensions so | |
4063 that FFTW can divide them equally among the processes. *Note Load | |
4064 balancing::. | |
4065 | |
4066 Second, if it is not too inconvenient, you should consider working | |
4067 with transposed output for multidimensional plans, as this saves a | |
4068 considerable amount of communications. *Note Transposed | |
4069 distributions::. | |
4070 | |
4071 Third, the fastest choices are generally either an in-place transform | |
4072 or an out-of-place transform with the `FFTW_DESTROY_INPUT' flag (which | |
4073 allows the input array to be used as scratch space). In-place is | |
4074 especially beneficial if the amount of data per process is large. | |
4075 | |
4076 Fourth, if you have multiple arrays to transform at once, rather than | |
4077 calling FFTW's MPI transforms several times it usually seems to be | |
4078 faster to interleave the data and use the advanced interface. (This | |
4079 groups the communications together instead of requiring separate | |
4080 messages for each transform.) | |
4081 | |
4082 | |
4083 File: fftw3.info, Node: Combining MPI and Threads, Next: FFTW MPI Reference, Prev: FFTW MPI Performance Tips, Up: Distributed-memory FFTW with MPI | |
4084 | |
4085 6.11 Combining MPI and Threads | |
4086 ============================== | |
4087 | |
4088 In certain cases, it may be advantageous to combine MPI | |
4089 (distributed-memory) and threads (shared-memory) parallelization. FFTW | |
4090 supports this, with certain caveats. For example, if you have a | |
4091 cluster of 4-processor shared-memory nodes, you may want to use threads | |
4092 within the nodes and MPI between the nodes, instead of MPI for all | |
4093 parallelization. | |
4094 | |
4095 In particular, it is possible to seamlessly combine the MPI FFTW | |
4096 routines with the multi-threaded FFTW routines (*note Multi-threaded | |
4097 FFTW::). However, some care must be taken in the initialization code, | |
4098 which should look something like this: | |
4099 | |
4100 int threads_ok; | |
4101 | |
4102 int main(int argc, char **argv) | |
4103 { | |
4104 int provided; | |
4105 MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); | |
4106 threads_ok = provided >= MPI_THREAD_FUNNELED; | |
4107 | |
4108 if (threads_ok) threads_ok = fftw_init_threads(); | |
4109 fftw_mpi_init(); | |
4110 | |
4111 ... | |
4112 if (threads_ok) fftw_plan_with_nthreads(...); | |
4113 ... | |
4114 | |
4115 MPI_Finalize(); | |
4116 } | |
4117 | |
4118 First, note that instead of calling `MPI_Init', you should call | |
4119 `MPI_Init_threads', which is the initialization routine defined by the | |
4120 MPI-2 standard to indicate to MPI that your program will be | |
4121 multithreaded. We pass `MPI_THREAD_FUNNELED', which indicates that we | |
4122 will only call MPI routines from the main thread. (FFTW will launch | |
4123 additional threads internally, but the extra threads will not call MPI | |
4124 code.) (You may also pass `MPI_THREAD_SERIALIZED' or | |
4125 `MPI_THREAD_MULTIPLE', which requests additional multithreading support | |
4126 from the MPI implementation, but this is not required by FFTW.) The | |
4127 `provided' parameter returns what level of threads support is actually | |
4128 supported by your MPI implementation; this _must_ be at least | |
4129 `MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so | |
4130 we define a global variable `threads_ok' to record this. You should | |
4131 only call `fftw_init_threads' or `fftw_plan_with_nthreads' if | |
4132 `threads_ok' is true. For more information on thread safety in MPI, | |
4133 see the MPI and Threads | |
4134 (http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the | |
4135 MPI-2 standard. | |
4136 | |
4137 Second, we must call `fftw_init_threads' _before_ `fftw_mpi_init'. | |
4138 This is critical for technical reasons having to do with how FFTW | |
4139 initializes its list of algorithms. | |
4140 | |
4141 Then, if you call `fftw_plan_with_nthreads(N)', _every_ MPI process | |
4142 will launch (up to) `N' threads to parallelize its transforms. | |
4143 | |
4144 For example, in the hypothetical cluster of 4-processor nodes, you | |
4145 might wish to launch only a single MPI process per node, and then call | |
4146 `fftw_plan_with_nthreads(4)' on each process to use all processors in | |
4147 the nodes. | |
4148 | |
4149 This may or may not be faster than simply using as many MPI processes | |
4150 as you have processors, however. On the one hand, using threads within | |
4151 a node eliminates the need for explicit message passing within the | |
4152 node. On the other hand, FFTW's transpose routines are not | |
4153 multi-threaded, and this means that the communications that do take | |
4154 place will not benefit from parallelization within the node. Moreover, | |
4155 many MPI implementations already have optimizations to exploit shared | |
4156 memory when it is available, so adding the multithreaded FFTW on top of | |
4157 this may be superfluous. | |
4158 | |
4159 | |
4160 File: fftw3.info, Node: FFTW MPI Reference, Next: FFTW MPI Fortran Interface, Prev: Combining MPI and Threads, Up: Distributed-memory FFTW with MPI | |
4161 | |
4162 6.12 FFTW MPI Reference | |
4163 ======================= | |
4164 | |
4165 This chapter provides a complete reference to all FFTW MPI functions, | |
4166 datatypes, and constants. See also *note FFTW Reference:: for | |
4167 information on functions and types in common with the serial interface. | |
4168 | |
4169 * Menu: | |
4170 | |
4171 * MPI Files and Data Types:: | |
4172 * MPI Initialization:: | |
4173 * Using MPI Plans:: | |
4174 * MPI Data Distribution Functions:: | |
4175 * MPI Plan Creation:: | |
4176 * MPI Wisdom Communication:: | |
4177 | |
4178 | |
4179 File: fftw3.info, Node: MPI Files and Data Types, Next: MPI Initialization, Prev: FFTW MPI Reference, Up: FFTW MPI Reference | |
4180 | |
4181 6.12.1 MPI Files and Data Types | |
4182 ------------------------------- | |
4183 | |
4184 All programs using FFTW's MPI support should include its header file: | |
4185 | |
4186 #include <fftw3-mpi.h> | |
4187 | |
4188 Note that this header file includes the serial-FFTW `fftw3.h' header | |
4189 file, and also the `mpi.h' header file for MPI, so you need not include | |
4190 those files separately. | |
4191 | |
4192 You must also link to _both_ the FFTW MPI library and to the serial | |
4193 FFTW library. On Unix, this means adding `-lfftw3_mpi -lfftw3 -lm' at | |
4194 the end of the link command. | |
4195 | |
4196 Different precisions are handled as in the serial interface: *Note | |
4197 Precision::. That is, `fftw_' functions become `fftwf_' (in single | |
4198 precision) etcetera, and the libraries become `-lfftw3f_mpi -lfftw3f | |
4199 -lm' etcetera on Unix. Long-double precision is supported in MPI, but | |
4200 quad precision (`fftwq_') is not due to the lack of MPI support for | |
4201 this type. | |
4202 | |
4203 | |
4204 File: fftw3.info, Node: MPI Initialization, Next: Using MPI Plans, Prev: MPI Files and Data Types, Up: FFTW MPI Reference | |
4205 | |
4206 6.12.2 MPI Initialization | |
4207 ------------------------- | |
4208 | |
4209 Before calling any other FFTW MPI (`fftw_mpi_') function, and before | |
4210 importing any wisdom for MPI problems, you must call: | |
4211 | |
4212 void fftw_mpi_init(void); | |
4213 | |
4214 If FFTW threads support is used, however, `fftw_mpi_init' should be | |
4215 called _after_ `fftw_init_threads' (*note Combining MPI and Threads::). | |
4216 Calling `fftw_mpi_init' additional times (before `fftw_mpi_cleanup') | |
4217 has no effect. | |
4218 | |
4219 If you want to deallocate all persistent data and reset FFTW to the | |
4220 pristine state it was in when you started your program, you can call: | |
4221 | |
4222 void fftw_mpi_cleanup(void); | |
4223 | |
4224 (This calls `fftw_cleanup', so you need not call the serial cleanup | |
4225 routine too, although it is safe to do so.) After calling | |
4226 `fftw_mpi_cleanup', all existing plans become undefined, and you should | |
4227 not attempt to execute or destroy them. You must call `fftw_mpi_init' | |
4228 again after `fftw_mpi_cleanup' if you want to resume using the MPI FFTW | |
4229 routines. | |
4230 | |
4231 | |
4232 File: fftw3.info, Node: Using MPI Plans, Next: MPI Data Distribution Functions, Prev: MPI Initialization, Up: FFTW MPI Reference | |
4233 | |
4234 6.12.3 Using MPI Plans | |
4235 ---------------------- | |
4236 | |
4237 Once an MPI plan is created, you can execute and destroy it using | |
4238 `fftw_execute', `fftw_destroy_plan', and the other functions in the | |
4239 serial interface that operate on generic plans (*note Using Plans::). | |
4240 | |
4241 The `fftw_execute' and `fftw_destroy_plan' functions, applied to MPI | |
4242 plans, are _collective_ calls: they must be called for all processes in | |
4243 the communicator that was used to create the plan. | |
4244 | |
4245 You must _not_ use the serial new-array plan-execution functions | |
4246 `fftw_execute_dft' and so on (*note New-array Execute Functions::) with | |
4247 MPI plans. Such functions are specialized to the problem type, and | |
4248 there are specific new-array execute functions for MPI plans: | |
4249 | |
4250 void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); | |
4251 void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); | |
4252 void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); | |
4253 void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); | |
4254 | |
4255 These functions have the same restrictions as those of the serial | |
4256 new-array execute functions. They are _always_ safe to apply to the | |
4257 _same_ `in' and `out' arrays that were used to create the plan. They | |
4258 can only be applied to new arrarys if those arrays have the same types, | |
4259 dimensions, in-placeness, and alignment as the original arrays, where | |
4260 the best way to ensure the same alignment is to use FFTW's | |
4261 `fftw_malloc' and related allocation functions for all arrays (*note | |
4262 Memory Allocation::). Note that distributed transposes (*note FFTW MPI | |
4263 Transposes::) use `fftw_mpi_execute_r2r', since they count as rank-zero | |
4264 r2r plans from FFTW's perspective. | |
4265 | |
4266 | |
4267 File: fftw3.info, Node: MPI Data Distribution Functions, Next: MPI Plan Creation, Prev: Using MPI Plans, Up: FFTW MPI Reference | |
4268 | |
4269 6.12.4 MPI Data Distribution Functions | |
4270 -------------------------------------- | |
4271 | |
4272 As described above (*note MPI Data Distribution::), in order to | |
4273 allocate your arrays, _before_ creating a plan, you must first call one | |
4274 of the following routines to determine the required allocation size and | |
4275 the portion of the array locally stored on a given process. The | |
4276 `MPI_Comm' communicator passed here must be equivalent to the | |
4277 communicator used below for plan creation. | |
4278 | |
4279 The basic interface for multidimensional transforms consists of the | |
4280 functions: | |
4281 | |
4282 ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
4283 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
4284 ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4285 MPI_Comm comm, | |
4286 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
4287 ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, | |
4288 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
4289 | |
4290 ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | |
4291 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
4292 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
4293 ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4294 MPI_Comm comm, | |
4295 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
4296 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
4297 ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, | |
4298 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
4299 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
4300 | |
4301 These functions return the number of elements to allocate (complex | |
4302 numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the | |
4303 `local_n0' and `local_0_start' return the portion (`local_0_start' to | |
4304 `local_0_start + local_n0 - 1') of the first dimension of an n[0] x | |
4305 n[1] x n[2] x ... x n[d-1] array that is stored on the local process. | |
4306 *Note Basic and advanced distribution interfaces::. For | |
4307 `FFTW_MPI_TRANSPOSED_OUT' plans, the `_transposed' variants are useful | |
4308 in order to also return the local portion of the first dimension in the | |
4309 n[1] x n[0] x n[2] x ... x n[d-1] transposed output. *Note Transposed | |
4310 distributions::. The advanced interface for multidimensional | |
4311 transforms is: | |
4312 | |
4313 ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
4314 ptrdiff_t block0, MPI_Comm comm, | |
4315 ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | |
4316 ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
4317 ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, | |
4318 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | |
4319 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | |
4320 | |
4321 These differ from the basic interface in only two ways. First, they | |
4322 allow you to specify block sizes `block0' and `block1' (the latter for | |
4323 the transposed output); you can pass `FFTW_MPI_DEFAULT_BLOCK' to use | |
4324 FFTW's default block size as in the basic interface. Second, you can | |
4325 pass a `howmany' parameter, corresponding to the advanced planning | |
4326 interface below: this is for transforms of contiguous `howmany'-tuples | |
4327 of numbers (`howmany = 1' in the basic interface). | |
4328 | |
4329 The corresponding basic and advanced routines for one-dimensional | |
4330 transforms (currently only complex DFTs) are: | |
4331 | |
4332 ptrdiff_t fftw_mpi_local_size_1d( | |
4333 ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, | |
4334 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
4335 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
4336 ptrdiff_t fftw_mpi_local_size_many_1d( | |
4337 ptrdiff_t n0, ptrdiff_t howmany, | |
4338 MPI_Comm comm, int sign, unsigned flags, | |
4339 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | |
4340 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | |
4341 | |
4342 As above, the return value is the number of elements to allocate | |
4343 (complex numbers, for complex DFTs). The `local_ni' and | |
4344 `local_i_start' arguments return the portion (`local_i_start' to | |
4345 `local_i_start + local_ni - 1') of the 1d array that is stored on this | |
4346 process for the transform _input_, and `local_no' and `local_o_start' | |
4347 are the corresponding quantities for the input. The `sign' | |
4348 (`FFTW_FORWARD' or `FFTW_BACKWARD') and `flags' must match the | |
4349 arguments passed when creating a plan. Although the inputs and outputs | |
4350 have different data distributions in general, it is guaranteed that the | |
4351 _output_ data distribution of an `FFTW_FORWARD' plan will match the | |
4352 _input_ data distribution of an `FFTW_BACKWARD' plan and vice versa; | |
4353 similarly for the `FFTW_MPI_SCRAMBLED_OUT' and `FFTW_MPI_SCRAMBLED_IN' | |
4354 flags. *Note One-dimensional distributions::. | |
4355 | |
4356 | |
4357 File: fftw3.info, Node: MPI Plan Creation, Next: MPI Wisdom Communication, Prev: MPI Data Distribution Functions, Up: FFTW MPI Reference | |
4358 | |
4359 6.12.5 MPI Plan Creation | |
4360 ------------------------ | |
4361 | |
4362 Complex-data MPI DFTs | |
4363 ..................... | |
4364 | |
4365 Plans for complex-data DFTs (*note 2d MPI example::) are created by: | |
4366 | |
4367 fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, | |
4368 MPI_Comm comm, int sign, unsigned flags); | |
4369 fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4370 fftw_complex *in, fftw_complex *out, | |
4371 MPI_Comm comm, int sign, unsigned flags); | |
4372 fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4373 fftw_complex *in, fftw_complex *out, | |
4374 MPI_Comm comm, int sign, unsigned flags); | |
4375 fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, | |
4376 fftw_complex *in, fftw_complex *out, | |
4377 MPI_Comm comm, int sign, unsigned flags); | |
4378 fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, | |
4379 ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, | |
4380 fftw_complex *in, fftw_complex *out, | |
4381 MPI_Comm comm, int sign, unsigned flags); | |
4382 | |
4383 These are similar to their serial counterparts (*note Complex DFTs::) | |
4384 in specifying the dimensions, sign, and flags of the transform. The | |
4385 `comm' argument gives an MPI communicator that specifies the set of | |
4386 processes to participate in the transform; plan creation is a | |
4387 collective function that must be called for all processes in the | |
4388 communicator. The `in' and `out' pointers refer only to a portion of | |
4389 the overall transform data (*note MPI Data Distribution::) as specified | |
4390 by the `local_size' functions in the previous section. Unless `flags' | |
4391 contains `FFTW_ESTIMATE', these arrays are overwritten during plan | |
4392 creation as for the serial interface. For multi-dimensional | |
4393 transforms, any dimensions `> 1' are supported; for one-dimensional | |
4394 transforms, only composite (non-prime) `n0' are currently supported | |
4395 (unlike the serial FFTW). Requesting an unsupported transform size | |
4396 will yield a `NULL' plan. (As in the serial interface, highly | |
4397 composite sizes generally yield the best performance.) | |
4398 | |
4399 The advanced-interface `fftw_mpi_plan_many_dft' additionally allows | |
4400 you to specify the block sizes for the first dimension (`block') of the | |
4401 n[0] x n[1] x n[2] x ... x n[d-1] input data and the first dimension | |
4402 (`tblock') of the n[1] x n[0] x n[2] x ... x n[d-1] transposed data | |
4403 (at intermediate steps of the transform, and for the output if | |
4404 `FFTW_TRANSPOSED_OUT' is specified in `flags'). These must be the same | |
4405 block sizes as were passed to the corresponding `local_size' function; | |
4406 you can pass `FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size | |
4407 as in the basic interface. Also, the `howmany' parameter specifies | |
4408 that the transform is of contiguous `howmany'-tuples rather than | |
4409 individual complex numbers; this corresponds to the same parameter in | |
4410 the serial advanced interface (*note Advanced Complex DFTs::) with | |
4411 `stride = howmany' and `dist = 1'. | |
4412 | |
4413 MPI flags | |
4414 ......... | |
4415 | |
4416 The `flags' can be any of those for the serial FFTW (*note Planner | |
4417 Flags::), and in addition may include one or more of the following | |
4418 MPI-specific flags, which improve performance at the cost of changing | |
4419 the output or input data formats. | |
4420 | |
4421 * `FFTW_MPI_SCRAMBLED_OUT', `FFTW_MPI_SCRAMBLED_IN': valid for 1d | |
4422 transforms only, these flags indicate that the output/input of the | |
4423 transform are in an undocumented "scrambled" order. A forward | |
4424 `FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward | |
4425 `FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization). | |
4426 *Note One-dimensional distributions::. | |
4427 | |
4428 * `FFTW_MPI_TRANSPOSED_OUT', `FFTW_MPI_TRANSPOSED_IN': valid for | |
4429 multidimensional (`rnk > 1') transforms only, these flags specify | |
4430 that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1] | |
4431 transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] . | |
4432 *Note Transposed distributions::. | |
4433 | |
4434 | |
4435 Real-data MPI DFTs | |
4436 .................. | |
4437 | |
4438 Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI | |
4439 DFTs of Real Data::) are created by: | |
4440 | |
4441 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4442 double *in, fftw_complex *out, | |
4443 MPI_Comm comm, unsigned flags); | |
4444 fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4445 double *in, fftw_complex *out, | |
4446 MPI_Comm comm, unsigned flags); | |
4447 fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4448 double *in, fftw_complex *out, | |
4449 MPI_Comm comm, unsigned flags); | |
4450 fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, | |
4451 double *in, fftw_complex *out, | |
4452 MPI_Comm comm, unsigned flags); | |
4453 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4454 fftw_complex *in, double *out, | |
4455 MPI_Comm comm, unsigned flags); | |
4456 fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4457 fftw_complex *in, double *out, | |
4458 MPI_Comm comm, unsigned flags); | |
4459 fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4460 fftw_complex *in, double *out, | |
4461 MPI_Comm comm, unsigned flags); | |
4462 fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, | |
4463 fftw_complex *in, double *out, | |
4464 MPI_Comm comm, unsigned flags); | |
4465 | |
4466 Similar to the serial interface (*note Real-data DFTs::), these | |
4467 transform logically n[0] x n[1] x n[2] x ... x n[d-1] real data | |
4468 to/from n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) complex data, | |
4469 representing the non-redundant half of the conjugate-symmetry output of | |
4470 a real-input DFT (*note Multi-dimensional Transforms::). However, the | |
4471 real array must be stored within a padded n[0] x n[1] x n[2] x ... x [2 | |
4472 (n[d-1]/2 + 1)] | |
4473 | |
4474 array (much like the in-place serial r2c transforms, but here for | |
4475 out-of-place transforms as well). Currently, only multi-dimensional | |
4476 (`rnk > 1') r2c/c2r transforms are supported (requesting a plan for | |
4477 `rnk = 1' will yield `NULL'). As explained above (*note | |
4478 Multi-dimensional MPI DFTs of Real Data::), the data distribution of | |
4479 both the real and complex arrays is given by the `local_size' function | |
4480 called for the dimensions of the _complex_ array. Similar to the other | |
4481 planning functions, the input and output arrays are overwritten when | |
4482 the plan is created except in `FFTW_ESTIMATE' mode. | |
4483 | |
4484 As for the complex DFTs above, there is an advance interface that | |
4485 allows you to manually specify block sizes and to transform contiguous | |
4486 `howmany'-tuples of real/complex numbers: | |
4487 | |
4488 fftw_plan fftw_mpi_plan_many_dft_r2c | |
4489 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
4490 ptrdiff_t iblock, ptrdiff_t oblock, | |
4491 double *in, fftw_complex *out, | |
4492 MPI_Comm comm, unsigned flags); | |
4493 fftw_plan fftw_mpi_plan_many_dft_c2r | |
4494 (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | |
4495 ptrdiff_t iblock, ptrdiff_t oblock, | |
4496 fftw_complex *in, double *out, | |
4497 MPI_Comm comm, unsigned flags); | |
4498 | |
4499 MPI r2r transforms | |
4500 .................. | |
4501 | |
4502 There are corresponding plan-creation routines for r2r transforms | |
4503 (*note More DFTs of Real Data::), currently supporting multidimensional | |
4504 (`rnk > 1') transforms only (`rnk = 1' will yield a `NULL' plan): | |
4505 | |
4506 fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, | |
4507 double *in, double *out, | |
4508 MPI_Comm comm, | |
4509 fftw_r2r_kind kind0, fftw_r2r_kind kind1, | |
4510 unsigned flags); | |
4511 fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | |
4512 double *in, double *out, | |
4513 MPI_Comm comm, | |
4514 fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, | |
4515 unsigned flags); | |
4516 fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, | |
4517 double *in, double *out, | |
4518 MPI_Comm comm, const fftw_r2r_kind *kind, | |
4519 unsigned flags); | |
4520 fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, | |
4521 ptrdiff_t iblock, ptrdiff_t oblock, | |
4522 double *in, double *out, | |
4523 MPI_Comm comm, const fftw_r2r_kind *kind, | |
4524 unsigned flags); | |
4525 | |
4526 The parameters are much the same as for the complex DFTs above, | |
4527 except that the arrays are of real numbers (and hence the outputs of the | |
4528 `local_size' data-distribution functions should be interpreted as | |
4529 counts of real rather than complex numbers). Also, the `kind' | |
4530 parameters specify the r2r kinds along each dimension as for the serial | |
4531 interface (*note Real-to-Real Transform Kinds::). *Note Other | |
4532 Multi-dimensional Real-data MPI Transforms::. | |
4533 | |
4534 MPI transposition | |
4535 ................. | |
4536 | |
4537 FFTW also provides routines to plan a transpose of a distributed `n0' | |
4538 by `n1' array of real numbers, or an array of `howmany'-tuples of real | |
4539 numbers with specified block sizes (*note FFTW MPI Transposes::): | |
4540 | |
4541 fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | |
4542 double *in, double *out, | |
4543 MPI_Comm comm, unsigned flags); | |
4544 fftw_plan fftw_mpi_plan_many_transpose | |
4545 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | |
4546 ptrdiff_t block0, ptrdiff_t block1, | |
4547 double *in, double *out, MPI_Comm comm, unsigned flags); | |
4548 | |
4549 These plans are used with the `fftw_mpi_execute_r2r' new-array | |
4550 execute function (*note Using MPI Plans::), since they count as (rank | |
4551 zero) r2r plans from FFTW's perspective. | |
4552 | |
4553 | |
4554 File: fftw3.info, Node: MPI Wisdom Communication, Prev: MPI Plan Creation, Up: FFTW MPI Reference | |
4555 | |
4556 6.12.6 MPI Wisdom Communication | |
4557 ------------------------------- | |
4558 | |
4559 To facilitate synchronizing wisdom among the different MPI processes, | |
4560 we provide two functions: | |
4561 | |
4562 void fftw_mpi_gather_wisdom(MPI_Comm comm); | |
4563 void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | |
4564 | |
4565 The `fftw_mpi_gather_wisdom' function gathers all wisdom in the | |
4566 given communicator `comm' to the process of rank 0 in the communicator: | |
4567 that process obtains the union of all wisdom on all the processes. As | |
4568 a side effect, some other processes will gain additional wisdom from | |
4569 other processes, but only process 0 will gain the complete union. | |
4570 | |
4571 The `fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom | |
4572 from process 0 in `comm' to all other processes in the communicator, | |
4573 replacing any wisdom they currently have. | |
4574 | |
4575 *Note FFTW MPI Wisdom::. | |
4576 | |
4577 | |
4578 File: fftw3.info, Node: FFTW MPI Fortran Interface, Prev: FFTW MPI Reference, Up: Distributed-memory FFTW with MPI | |
4579 | |
4580 6.13 FFTW MPI Fortran Interface | |
4581 =============================== | |
4582 | |
4583 The FFTW MPI interface is callable from modern Fortran compilers | |
4584 supporting the Fortran 2003 `iso_c_binding' standard for calling C | |
4585 functions. As described in *note Calling FFTW from Modern Fortran::, | |
4586 this means that you can directly call FFTW's C interface from Fortran | |
4587 with only minor changes in syntax. There are, however, a few things | |
4588 specific to the MPI interface to keep in mind: | |
4589 | |
4590 * Instead of including `fftw3.f03' as in *note Overview of Fortran | |
4591 interface::, you should `include 'fftw3-mpi.f03'' (after `use, | |
4592 intrinsic :: iso_c_binding' as before). The `fftw3-mpi.f03' file | |
4593 includes `fftw3.f03', so you should _not_ `include' them both | |
4594 yourself. (You will also want to include the MPI header file, | |
4595 usually via `include 'mpif.h'' or similar, although though this is | |
4596 not needed by `fftw3-mpi.f03' per se.) (To use the `fftwl_' `long | |
4597 double' extended-precision routines in supporting compilers, you | |
4598 should include `fftw3f-mpi.f03' in _addition_ to `fftw3-mpi.f03'. | |
4599 *Note Extended and quadruple precision in Fortran::.) | |
4600 | |
4601 * Because of the different storage conventions between C and Fortran, | |
4602 you reverse the order of your array dimensions when passing them to | |
4603 FFTW (*note Reversing array dimensions::). This is merely a | |
4604 difference in notation and incurs no performance overhead. | |
4605 However, it means that, whereas in C the _first_ dimension is | |
4606 distributed, in Fortran the _last_ dimension of your array is | |
4607 distributed. | |
4608 | |
4609 * In Fortran, communicators are stored as `integer' types; there is | |
4610 no `MPI_Comm' type, nor is there any way to access a C `MPI_Comm'. | |
4611 Fortunately, this is taken care of for you by the FFTW Fortran | |
4612 interface: whenever the C interface expects an `MPI_Comm' type, | |
4613 you should pass the Fortran communicator as an `integer'.(1) | |
4614 | |
4615 * Because you need to call the `local_size' function to find out how | |
4616 much space to allocate, and this may be _larger_ than the local | |
4617 portion of the array (*note MPI Data Distribution::), you should | |
4618 _always_ allocate your arrays dynamically using FFTW's allocation | |
4619 routines as described in *note Allocating aligned memory in | |
4620 Fortran::. (Coincidentally, this also provides the best | |
4621 performance by guaranteeding proper data alignment.) | |
4622 | |
4623 * Because all sizes in the MPI FFTW interface are declared as | |
4624 `ptrdiff_t' in C, you should use `integer(C_INTPTR_T)' in Fortran | |
4625 (*note FFTW Fortran type reference::). | |
4626 | |
4627 * In Fortran, because of the language semantics, we generally | |
4628 recommend using the new-array execute functions for all plans, | |
4629 even in the common case where you are executing the plan on the | |
4630 same arrays for which the plan was created (*note Plan execution | |
4631 in Fortran::). However, note that in the MPI interface these | |
4632 functions are changed: `fftw_execute_dft' becomes | |
4633 `fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::. | |
4634 | |
4635 | |
4636 For example, here is a Fortran code snippet to perform a distributed | |
4637 L x M complex DFT in-place. (This assumes you have already | |
4638 initialized MPI with `MPI_init' and have also performed `call | |
4639 fftw_mpi_init'.) | |
4640 | |
4641 use, intrinsic :: iso_c_binding | |
4642 include 'fftw3-mpi.f03' | |
4643 integer(C_INTPTR_T), parameter :: L = ... | |
4644 integer(C_INTPTR_T), parameter :: M = ... | |
4645 type(C_PTR) :: plan, cdata | |
4646 complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) | |
4647 integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset | |
4648 | |
4649 ! get local data size and allocate (note dimension reversal) | |
4650 alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & | |
4651 local_M, local_j_offset) | |
4652 cdata = fftw_alloc_complex(alloc_local) | |
4653 call c_f_pointer(cdata, data, [L,local_M]) | |
4654 | |
4655 ! create MPI plan for in-place forward DFT (note dimension reversal) | |
4656 plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & | |
4657 FFTW_FORWARD, FFTW_MEASURE) | |
4658 | |
4659 ! initialize data to some function my_function(i,j) | |
4660 do j = 1, local_M | |
4661 do i = 1, L | |
4662 data(i, j) = my_function(i, j + local_j_offset) | |
4663 end do | |
4664 end do | |
4665 | |
4666 ! compute transform (as many times as desired) | |
4667 call fftw_mpi_execute_dft(plan, data, data) | |
4668 | |
4669 call fftw_destroy_plan(plan) | |
4670 call fftw_free(cdata) | |
4671 | |
4672 Note that when we called `fftw_mpi_local_size_2d' and | |
4673 `fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L | |
4674 x M Fortran array is viewed by FFTW in C as a M x L array. This | |
4675 means that the array was distributed over the `M' dimension, the local | |
4676 portion of which is a L x local_M array in Fortran. (You must _not_ | |
4677 use an `allocate' statement to allocate an L x local_M array, however; | |
4678 you must allocate `alloc_local' complex numbers, which may be greater | |
4679 than `L * local_M', in order to reserve space for intermediate steps of | |
4680 the transform.) Finally, we mention that because C's array indices are | |
4681 zero-based, the `local_j_offset' argument can conveniently be | |
4682 interpreted as an offset in the 1-based `j' index (rather than as a | |
4683 starting index as in C). | |
4684 | |
4685 If instead you had used the `ior(FFTW_MEASURE, | |
4686 FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a | |
4687 transposed M x local_L array, associated with the _same_ `cdata' | |
4688 allocation (since the transform is in-place), and which you could | |
4689 declare with: | |
4690 | |
4691 complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) | |
4692 ... | |
4693 call c_f_pointer(cdata, tdata, [M,local_L]) | |
4694 | |
4695 where `local_L' would have been obtained by changing the | |
4696 `fftw_mpi_local_size_2d' call to: | |
4697 | |
4698 alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & | |
4699 local_M, local_j_offset, local_L, local_i_offset) | |
4700 | |
4701 ---------- Footnotes ---------- | |
4702 | |
4703 (1) Technically, this is because you aren't actually calling the C | |
4704 functions directly. You are calling wrapper functions that translate | |
4705 the communicator with `MPI_Comm_f2c' before calling the ordinary C | |
4706 interface. This is all done transparently, however, since the | |
4707 `fftw3-mpi.f03' interface file renames the wrappers so that they are | |
4708 called in Fortran with the same names as the C interface functions. | |
4709 | |
4710 | |
4711 File: fftw3.info, Node: Calling FFTW from Modern Fortran, Next: Calling FFTW from Legacy Fortran, Prev: Distributed-memory FFTW with MPI, Up: Top | |
4712 | |
4713 7 Calling FFTW from Modern Fortran | |
4714 ********************************** | |
4715 | |
4716 Fortran 2003 standardized ways for Fortran code to call C libraries, | |
4717 and this allows us to support a direct translation of the FFTW C API | |
4718 into Fortran. Compared to the legacy Fortran 77 interface (*note | |
4719 Calling FFTW from Legacy Fortran::), this direct interface offers many | |
4720 advantages, especially compile-time type-checking and aligned memory | |
4721 allocation. As of this writing, support for these C interoperability | |
4722 features seems widespread, having been implemented in nearly all major | |
4723 Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland | |
4724 Group, NAG). | |
4725 | |
4726 This chapter documents that interface. For the most part, since this | |
4727 interface allows Fortran to call the C interface directly, the usage is | |
4728 identical to C translated to Fortran syntax. However, there are a few | |
4729 subtle points such as memory allocation, wisdom, and data types that | |
4730 deserve closer attention. | |
4731 | |
4732 * Menu: | |
4733 | |
4734 * Overview of Fortran interface:: | |
4735 * Reversing array dimensions:: | |
4736 * FFTW Fortran type reference:: | |
4737 * Plan execution in Fortran:: | |
4738 * Allocating aligned memory in Fortran:: | |
4739 * Accessing the wisdom API from Fortran:: | |
4740 * Defining an FFTW module:: | |
4741 | |
4742 | |
4743 File: fftw3.info, Node: Overview of Fortran interface, Next: Reversing array dimensions, Prev: Calling FFTW from Modern Fortran, Up: Calling FFTW from Modern Fortran | |
4744 | |
4745 7.1 Overview of Fortran interface | |
4746 ================================= | |
4747 | |
4748 FFTW provides a file `fftw3.f03' that defines Fortran 2003 interfaces | |
4749 for all of its C routines, except for the MPI routines described | |
4750 elsewhere, which can be found in the same directory as `fftw3.h' (the C | |
4751 header file). In any Fortran subroutine where you want to use FFTW | |
4752 functions, you should begin with: | |
4753 | |
4754 use, intrinsic :: iso_c_binding | |
4755 include 'fftw3.f03' | |
4756 | |
4757 This includes the interface definitions and the standard | |
4758 `iso_c_binding' module (which defines the equivalents of C types). You | |
4759 can also put the FFTW functions into a module if you prefer (*note | |
4760 Defining an FFTW module::). | |
4761 | |
4762 At this point, you can now call anything in the FFTW C interface | |
4763 directly, almost exactly as in C other than minor changes in syntax. | |
4764 For example: | |
4765 | |
4766 type(C_PTR) :: plan | |
4767 complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out | |
4768 plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE) | |
4769 ... | |
4770 call fftw_execute_dft(plan, in, out) | |
4771 ... | |
4772 call fftw_destroy_plan(plan) | |
4773 | |
4774 A few important things to keep in mind are: | |
4775 | |
4776 * FFTW plans are `type(C_PTR)'. Other C types are mapped in the | |
4777 obvious way via the `iso_c_binding' standard: `int' turns into | |
4778 `integer(C_INT)', `fftw_complex' turns into | |
4779 `complex(C_DOUBLE_COMPLEX)', `double' turns into `real(C_DOUBLE)', | |
4780 and so on. *Note FFTW Fortran type reference::. | |
4781 | |
4782 * Functions in C become functions in Fortran if they have a return | |
4783 value, and subroutines in Fortran otherwise. | |
4784 | |
4785 * The ordering of the Fortran array dimensions must be _reversed_ | |
4786 when they are passed to the FFTW plan creation, thanks to | |
4787 differences in array indexing conventions (*note Multi-dimensional | |
4788 Array Format::). This is _unlike_ the legacy Fortran interface | |
4789 (*note Fortran-interface routines::), which reversed the dimensions | |
4790 for you. *Note Reversing array dimensions::. | |
4791 | |
4792 * Using ordinary Fortran array declarations like this works, but may | |
4793 yield suboptimal performance because the data may not be not | |
4794 aligned to exploit SIMD instructions on modern proessors (*note | |
4795 SIMD alignment and fftw_malloc::). Better performance will often | |
4796 be obtained by allocating with `fftw_alloc'. *Note Allocating | |
4797 aligned memory in Fortran::. | |
4798 | |
4799 * Similar to the legacy Fortran interface (*note FFTW Execution in | |
4800 Fortran::), we currently recommend _not_ using `fftw_execute' but | |
4801 rather using the more specialized functions like | |
4802 `fftw_execute_dft' (*note New-array Execute Functions::). | |
4803 However, you should execute the plan on the `same arrays' as the | |
4804 ones for which you created the plan, unless you are especially | |
4805 careful. *Note Plan execution in Fortran::. To prevent you from | |
4806 using `fftw_execute' by mistake, the `fftw3.f03' file does not | |
4807 provide an `fftw_execute' interface declaration. | |
4808 | |
4809 * Multiple planner flags are combined with `ior' (equivalent to `|' | |
4810 in C). e.g. `FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes | |
4811 `ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'. (You can also use `+' as | |
4812 long as you don't try to include a given flag more than once.) | |
4813 | |
4814 | |
4815 * Menu: | |
4816 | |
4817 * Extended and quadruple precision in Fortran:: | |
4818 | |
4819 | |
4820 File: fftw3.info, Node: Extended and quadruple precision in Fortran, Prev: Overview of Fortran interface, Up: Overview of Fortran interface | |
4821 | |
4822 7.1.1 Extended and quadruple precision in Fortran | |
4823 ------------------------------------------------- | |
4824 | |
4825 If FFTW is compiled in `long double' (extended) precision (*note | |
4826 Installation and Customization::), you may be able to call the | |
4827 resulting `fftwl_' routines (*note Precision::) from Fortran if your | |
4828 compiler supports the `C_LONG_DOUBLE_COMPLEX' type code. | |
4829 | |
4830 Because some Fortran compilers do not support | |
4831 `C_LONG_DOUBLE_COMPLEX', the `fftwl_' declarations are segregated into | |
4832 a separate interface file `fftw3l.f03', which you should include _in | |
4833 addition_ to `fftw3.f03' (which declares precision-independent `FFTW_' | |
4834 constants): | |
4835 | |
4836 use, intrinsic :: iso_c_binding | |
4837 include 'fftw3.f03' | |
4838 include 'fftw3l.f03' | |
4839 | |
4840 We also support using the nonstandard `__float128' | |
4841 quadruple-precision type provided by recent versions of `gcc' on 32- | |
4842 and 64-bit x86 hardware (*note Installation and Customization::), using | |
4843 the corresponding `real(16)' and `complex(16)' types supported by | |
4844 `gfortran'. The quadruple-precision `fftwq_' functions (*note | |
4845 Precision::) are declared in a `fftw3q.f03' interface file, which | |
4846 should be included in addition to `fftw3l.f03', as above. You should | |
4847 also link with `-lfftw3q -lquadmath -lm' as in C. | |
4848 | |
4849 | |
4850 File: fftw3.info, Node: Reversing array dimensions, Next: FFTW Fortran type reference, Prev: Overview of Fortran interface, Up: Calling FFTW from Modern Fortran | |
4851 | |
4852 7.2 Reversing array dimensions | |
4853 ============================== | |
4854 | |
4855 A minor annoyance in calling FFTW from Fortran is that FFTW's array | |
4856 dimensions are defined in the C convention (row-major order), while | |
4857 Fortran's array dimensions are the opposite convention (column-major | |
4858 order). *Note Multi-dimensional Array Format::. This is just a | |
4859 bookkeeping difference, with no effect on performance. The only | |
4860 consequence of this is that, whenever you create an FFTW plan for a | |
4861 multi-dimensional transform, you must always _reverse the ordering of | |
4862 the dimensions_. | |
4863 | |
4864 For example, consider the three-dimensional (L x M x N ) arrays: | |
4865 | |
4866 complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out | |
4867 | |
4868 To plan a DFT for these arrays using `fftw_plan_dft_3d', you could | |
4869 do: | |
4870 | |
4871 plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE) | |
4872 | |
4873 That is, from FFTW's perspective this is a N x M x L array. _No | |
4874 data transposition need occur_, as this is _only notation_. Similarly, | |
4875 to use the more generic routine `fftw_plan_dft' with the same arrays, | |
4876 you could do: | |
4877 | |
4878 integer(C_INT), dimension(3) :: n = [N,M,L] | |
4879 plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE) | |
4880 | |
4881 Note, by the way, that this is different from the legacy Fortran | |
4882 interface (*note Fortran-interface routines::), which automatically | |
4883 reverses the order of the array dimension for you. Here, you are | |
4884 calling the C interface directly, so there is no "translation" layer. | |
4885 | |
4886 An important thing to keep in mind is the implication of this for | |
4887 multidimensional real-to-complex transforms (*note Multi-Dimensional | |
4888 DFTs of Real Data::). In C, a multidimensional real-to-complex DFT | |
4889 chops the last dimension roughly in half (N x M x L real input goes to | |
4890 N x M x L/2+1 complex output). In Fortran, because the array | |
4891 dimension notation is reversed, the _first_ dimension of the complex | |
4892 data is chopped roughly in half. For example consider the `r2c' | |
4893 transform of L x M x N real input in Fortran: | |
4894 | |
4895 type(C_PTR) :: plan | |
4896 real(C_DOUBLE), dimension(L,M,N) :: in | |
4897 complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out | |
4898 plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) | |
4899 ... | |
4900 call fftw_execute_dft_r2c(plan, in, out) | |
4901 | |
4902 Alternatively, for an in-place r2c transform, as described in the C | |
4903 documentation we must _pad_ the _first_ dimension of the real input | |
4904 with an extra two entries (which are ignored by FFTW) so as to leave | |
4905 enough space for the complex output. The input is _allocated_ as a | |
4906 2[L/2+1] x M x N array, even though only L x M x N of it is actually | |
4907 used. In this example, we will allocate the array as a pointer type, | |
4908 using `fftw_alloc' to ensure aligned memory for maximum performance | |
4909 (*note Allocating aligned memory in Fortran::); this also makes it easy | |
4910 to reference the same memory as both a real array and a complex array. | |
4911 | |
4912 real(C_DOUBLE), pointer :: in(:,:,:) | |
4913 complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:) | |
4914 type(C_PTR) :: plan, data | |
4915 data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T)) | |
4916 call c_f_pointer(data, in, [2*(L/2+1),M,N]) | |
4917 call c_f_pointer(data, out, [L/2+1,M,N]) | |
4918 plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) | |
4919 ... | |
4920 call fftw_execute_dft_r2c(plan, in, out) | |
4921 ... | |
4922 call fftw_destroy_plan(plan) | |
4923 call fftw_free(data) | |
4924 | |
4925 | |
4926 File: fftw3.info, Node: FFTW Fortran type reference, Next: Plan execution in Fortran, Prev: Reversing array dimensions, Up: Calling FFTW from Modern Fortran | |
4927 | |
4928 7.3 FFTW Fortran type reference | |
4929 =============================== | |
4930 | |
4931 The following are the most important type correspondences between the C | |
4932 interface and Fortran: | |
4933 | |
4934 * Plans (`fftw_plan' and variants) are `type(C_PTR)' (i.e. an opaque | |
4935 pointer). | |
4936 | |
4937 * The C floating-point types `double', `float', and `long double' | |
4938 correspond to `real(C_DOUBLE)', `real(C_FLOAT)', and | |
4939 `real(C_LONG_DOUBLE)', respectively. The C complex types | |
4940 `fftw_complex', `fftwf_complex', and `fftwl_complex' correspond in | |
4941 Fortran to `complex(C_DOUBLE_COMPLEX)', | |
4942 `complex(C_FLOAT_COMPLEX)', and `complex(C_LONG_DOUBLE_COMPLEX)', | |
4943 respectively. Just as in C (*note Precision::), the FFTW | |
4944 subroutines and types are prefixed with `fftw_', `fftwf_', and | |
4945 `fftwl_' for the different precisions, and link to different | |
4946 libraries (`-lfftw3', `-lfftw3f', and `-lfftw3l' on Unix), but use | |
4947 the _same_ include file `fftw3.f03' and the _same_ constants (all | |
4948 of which begin with `FFTW_'). The exception is `long double' | |
4949 precision, for which you should _also_ include `fftw3l.f03' (*note | |
4950 Extended and quadruple precision in Fortran::). | |
4951 | |
4952 * The C integer types `int' and `unsigned' (used for planner flags) | |
4953 become `integer(C_INT)'. The C integer type `ptrdiff_t' (e.g. in | |
4954 the *note 64-bit Guru Interface::) becomes `integer(C_INTPTR_T)', | |
4955 and `size_t' (in `fftw_malloc' etc.) becomes `integer(C_SIZE_T)'. | |
4956 | |
4957 * The `fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::) | |
4958 becomes `integer(C_FFTW_R2R_KIND)'. The various constant values | |
4959 of the C enumerated type (`FFTW_R2HC' etc.) become simply integer | |
4960 constants of the same names in Fortran. | |
4961 | |
4962 * Numeric array pointer arguments (e.g. `double *') become | |
4963 `dimension(*), intent(out)' arrays of the same type, or | |
4964 `dimension(*), intent(in)' if they are pointers to constant data | |
4965 (e.g. `const int *'). There are a few exceptions where numeric | |
4966 pointers refer to scalar outputs (e.g. for `fftw_flops'), in which | |
4967 case they are `intent(out)' scalar arguments in Fortran too. For | |
4968 the new-array execute functions (*note New-array Execute | |
4969 Functions::), the input arrays are declared `dimension(*), | |
4970 intent(inout)', since they can be modified in the case of in-place | |
4971 or `FFTW_DESTROY_INPUT' transforms. | |
4972 | |
4973 * Pointer _return_ values (e.g `double *') become `type(C_PTR)'. | |
4974 (If they are pointers to arrays, as for `fftw_alloc_real', you can | |
4975 convert them back to Fortran array pointers with the standard | |
4976 intrinsic function `c_f_pointer'.) | |
4977 | |
4978 * The `fftw_iodim' type in the guru interface (*note Guru vector and | |
4979 transform sizes::) becomes `type(fftw_iodim)' in Fortran, a | |
4980 derived data type (the Fortran analogue of C's `struct') with | |
4981 three `integer(C_INT)' components: `n', `is', and `os', with the | |
4982 same meanings as in C. The `fftw_iodim64' type in the 64-bit guru | |
4983 interface (*note 64-bit Guru Interface::) is the same, except that | |
4984 its components are of type `integer(C_INTPTR_T)'. | |
4985 | |
4986 * Using the wisdom import/export functions from Fortran is a bit | |
4987 tricky, and is discussed in *note Accessing the wisdom API from | |
4988 Fortran::. In brief, the `FILE *' arguments map to `type(C_PTR)', | |
4989 `const char *' to `character(C_CHAR), dimension(*), intent(in)' | |
4990 (null-terminated!), and the generic read-char/write-char functions | |
4991 map to `type(C_FUNPTR)'. | |
4992 | |
4993 | |
4994 You may be wondering if you need to search-and-replace | |
4995 `real(kind(0.0d0))' (or whatever your favorite Fortran spelling of | |
4996 "double precision" is) with `real(C_DOUBLE)' everywhere in your | |
4997 program, and similarly for `complex' and `integer' types. The answer | |
4998 is no; you can still use your existing types. As long as these types | |
4999 match their C counterparts, things should work without a hitch. The | |
5000 worst that can happen, e.g. in the (unlikely) event of a system where | |
5001 `real(kind(0.0d0))' is different from `real(C_DOUBLE)', is that the | |
5002 compiler will give you a type-mismatch error. That is, if you don't | |
5003 use the `iso_c_binding' kinds you need to accept at least the | |
5004 theoretical possibility of having to change your code in response to | |
5005 compiler errors on some future machine, but you don't need to worry | |
5006 about silently compiling incorrect code that yields runtime errors. | |
5007 | |
5008 | |
5009 File: fftw3.info, Node: Plan execution in Fortran, Next: Allocating aligned memory in Fortran, Prev: FFTW Fortran type reference, Up: Calling FFTW from Modern Fortran | |
5010 | |
5011 7.4 Plan execution in Fortran | |
5012 ============================= | |
5013 | |
5014 In C, in order to use a plan, one normally calls `fftw_execute', which | |
5015 executes the plan to perform the transform on the input/output arrays | |
5016 passed when the plan was created (*note Using Plans::). The | |
5017 corresponding subroutine call in modern Fortran is: | |
5018 call fftw_execute(plan) | |
5019 | |
5020 However, we have had reports that this causes problems with some | |
5021 recent optimizing Fortran compilers. The problem is, because the | |
5022 input/output arrays are not passed as explicit arguments to | |
5023 `fftw_execute', the semantics of Fortran (unlike C) allow the compiler | |
5024 to assume that the input/output arrays are not changed by | |
5025 `fftw_execute'. As a consequence, certain compilers end up | |
5026 repositioning the call to `fftw_execute', assuming incorrectly that it | |
5027 does nothing to the arrays. | |
5028 | |
5029 There are various workarounds to this, but the safest and simplest | |
5030 thing is to not use `fftw_execute' in Fortran. Instead, use the | |
5031 functions described in *note New-array Execute Functions::, which take | |
5032 the input/output arrays as explicit arguments. For example, if the | |
5033 plan is for a complex-data DFT and was created for the arrays `in' and | |
5034 `out', you would do: | |
5035 call fftw_execute_dft(plan, in, out) | |
5036 | |
5037 There are a few things to be careful of, however: | |
5038 | |
5039 * You must use the correct type of execute function, matching the way | |
5040 the plan was created. Complex DFT plans should use | |
5041 `fftw_execute_dft', Real-input (r2c) DFT plans should use use | |
5042 `fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use | |
5043 `fftw_execute_dft_c2r'. The various r2r plans should use | |
5044 `fftw_execute_r2r'. Fortunately, if you use the wrong one you | |
5045 will get a compile-time type-mismatch error (unlike legacy | |
5046 Fortran). | |
5047 | |
5048 * You should normally pass the same input/output arrays that were | |
5049 used when creating the plan. This is always safe. | |
5050 | |
5051 * _If_ you pass _different_ input/output arrays compared to those | |
5052 used when creating the plan, you must abide by all the | |
5053 restrictions of the new-array execute functions (*note New-array | |
5054 Execute Functions::). The most tricky of these is the requirement | |
5055 that the new arrays have the same alignment as the original | |
5056 arrays; the best (and possibly only) way to guarantee this is to | |
5057 use the `fftw_alloc' functions to allocate your arrays (*note | |
5058 Allocating aligned memory in Fortran::). Alternatively, you can | |
5059 use the `FFTW_UNALIGNED' flag when creating the plan, in which | |
5060 case the plan does not depend on the alignment, but this may | |
5061 sacrifice substantial performance on architectures (like x86) with | |
5062 SIMD instructions (*note SIMD alignment and fftw_malloc::). | |
5063 | |
5064 | |
5065 | |
5066 File: fftw3.info, Node: Allocating aligned memory in Fortran, Next: Accessing the wisdom API from Fortran, Prev: Plan execution in Fortran, Up: Calling FFTW from Modern Fortran | |
5067 | |
5068 7.5 Allocating aligned memory in Fortran | |
5069 ======================================== | |
5070 | |
5071 In order to obtain maximum performance in FFTW, you should store your | |
5072 data in arrays that have been specially aligned in memory (*note SIMD | |
5073 alignment and fftw_malloc::). Enforcing alignment also permits you to | |
5074 safely use the new-array execute functions (*note New-array Execute | |
5075 Functions::) to apply a given plan to more than one pair of in/out | |
5076 arrays. Unfortunately, standard Fortran arrays do _not_ provide any | |
5077 alignment guarantees. The _only_ way to allocate aligned memory in | |
5078 standard Fortran is to allocate it with an external C function, like | |
5079 the `fftw_alloc_real' and `fftw_alloc_complex' functions. Fortunately, | |
5080 Fortran 2003 provides a simple way to associate such allocated memory | |
5081 with a standard Fortran array pointer that you can then use normally. | |
5082 | |
5083 We therefore recommend allocating all your input/output arrays using | |
5084 the following technique: | |
5085 | |
5086 1. Declare a `pointer', `arr', to your array of the desired type and | |
5087 dimensions. For example, `real(C_DOUBLE), pointer :: a(:,:)' for | |
5088 a 2d real array, or `complex(C_DOUBLE_COMPLEX), pointer :: | |
5089 a(:,:,:)' for a 3d complex array. | |
5090 | |
5091 2. The number of elements to allocate must be an `integer(C_SIZE_T)'. | |
5092 You can either declare a variable of this type, e.g. | |
5093 `integer(C_SIZE_T) :: sz', to store the number of elements to | |
5094 allocate, or you can use the `int(..., C_SIZE_T)' intrinsic | |
5095 function. e.g. set `sz = L * M * N' or use `int(L * M * N, | |
5096 C_SIZE_T)' for an L x M x N array. | |
5097 | |
5098 3. Declare a `type(C_PTR) :: p' to hold the return value from FFTW's | |
5099 allocation routine. Set `p = fftw_alloc_real(sz)' for a real | |
5100 array, or `p = fftw_alloc_complex(sz)' for a complex array. | |
5101 | |
5102 4. Associate your pointer `arr' with the allocated memory `p' using | |
5103 the standard `c_f_pointer' subroutine: `call c_f_pointer(p, arr, | |
5104 [...dimensions...])', where `[...dimensions...])' are an array of | |
5105 the dimensions of the array (in the usual Fortran order). e.g. | |
5106 `call c_f_pointer(p, arr, [L,M,N])' for an L x M x N array. | |
5107 (Alternatively, you can omit the dimensions argument if you | |
5108 specified the shape explicitly when declaring `arr'.) You can now | |
5109 use `arr' as a usual multidimensional array. | |
5110 | |
5111 5. When you are done using the array, deallocate the memory by `call | |
5112 fftw_free(p)' on `p'. | |
5113 | |
5114 | |
5115 For example, here is how we would allocate an L x M 2d real array: | |
5116 | |
5117 real(C_DOUBLE), pointer :: arr(:,:) | |
5118 type(C_PTR) :: p | |
5119 p = fftw_alloc_real(int(L * M, C_SIZE_T)) | |
5120 call c_f_pointer(p, arr, [L,M]) | |
5121 _...use arr and arr(i,j) as usual..._ | |
5122 call fftw_free(p) | |
5123 | |
5124 and here is an L x M x N 3d complex array: | |
5125 | |
5126 complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:) | |
5127 type(C_PTR) :: p | |
5128 p = fftw_alloc_complex(int(L * M * N, C_SIZE_T)) | |
5129 call c_f_pointer(p, arr, [L,M,N]) | |
5130 _...use arr and arr(i,j,k) as usual..._ | |
5131 call fftw_free(p) | |
5132 | |
5133 See *note Reversing array dimensions:: for an example allocating a | |
5134 single array and associating both real and complex array pointers with | |
5135 it, for in-place real-to-complex transforms. | |
5136 | |
5137 | |
5138 File: fftw3.info, Node: Accessing the wisdom API from Fortran, Next: Defining an FFTW module, Prev: Allocating aligned memory in Fortran, Up: Calling FFTW from Modern Fortran | |
5139 | |
5140 7.6 Accessing the wisdom API from Fortran | |
5141 ========================================= | |
5142 | |
5143 As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a | |
5144 "wisdom" API for saving plans to disk so that they can be recreated | |
5145 quickly. The C API for exporting (*note Wisdom Export::) and importing | |
5146 (*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran, | |
5147 however, because of differences in file I/O and string types between C | |
5148 and Fortran. | |
5149 | |
5150 * Menu: | |
5151 | |
5152 * Wisdom File Export/Import from Fortran:: | |
5153 * Wisdom String Export/Import from Fortran:: | |
5154 * Wisdom Generic Export/Import from Fortran:: | |
5155 | |
5156 | |
5157 File: fftw3.info, Node: Wisdom File Export/Import from Fortran, Next: Wisdom String Export/Import from Fortran, Prev: Accessing the wisdom API from Fortran, Up: Accessing the wisdom API from Fortran | |
5158 | |
5159 7.6.1 Wisdom File Export/Import from Fortran | |
5160 -------------------------------------------- | |
5161 | |
5162 The easiest way to export and import wisdom is to do so using | |
5163 `fftw_export_wisdom_to_filename' and `fftw_wisdom_from_filename'. The | |
5164 only trick is that these require you to pass a C string, which is an | |
5165 array of type `CHARACTER(C_CHAR)' that is terminated by `C_NULL_CHAR'. | |
5166 You can call them like this: | |
5167 | |
5168 integer(C_INT) :: ret | |
5169 ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) | |
5170 if (ret .eq. 0) stop 'error exporting wisdom to file' | |
5171 ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) | |
5172 if (ret .eq. 0) stop 'error importing wisdom from file' | |
5173 | |
5174 Note that prepending `C_CHAR_' is needed to specify that the literal | |
5175 string is of kind `C_CHAR', and we null-terminate the string by | |
5176 appending `// C_NULL_CHAR'. These functions return an `integer(C_INT)' | |
5177 (`ret') which is `0' if an error occurred during export/import and | |
5178 nonzero otherwise. | |
5179 | |
5180 It is also possible to use the lower-level routines | |
5181 `fftw_export_wisdom_to_file' and `fftw_import_wisdom_from_file', which | |
5182 accept parameters of the C type `FILE*', expressed in Fortran as | |
5183 `type(C_PTR)'. However, you are then responsible for creating the | |
5184 `FILE*' yourself. You can do this by using `iso_c_binding' to define | |
5185 Fortran intefaces for the C library functions `fopen' and `fclose', | |
5186 which is a bit strange in Fortran but workable. | |
5187 | |
5188 | |
5189 File: fftw3.info, Node: Wisdom String Export/Import from Fortran, Next: Wisdom Generic Export/Import from Fortran, Prev: Wisdom File Export/Import from Fortran, Up: Accessing the wisdom API from Fortran | |
5190 | |
5191 7.6.2 Wisdom String Export/Import from Fortran | |
5192 ---------------------------------------------- | |
5193 | |
5194 Dealing with FFTW's C string export/import is a bit more painful. In | |
5195 particular, the `fftw_export_wisdom_to_string' function requires you to | |
5196 deal with a dynamically allocated C string. To get its length, you | |
5197 must define an interface to the C `strlen' function, and to deallocate | |
5198 it you must define an interface to C `free': | |
5199 | |
5200 use, intrinsic :: iso_c_binding | |
5201 interface | |
5202 integer(C_INT) function strlen(s) bind(C, name='strlen') | |
5203 import | |
5204 type(C_PTR), value :: s | |
5205 end function strlen | |
5206 subroutine free(p) bind(C, name='free') | |
5207 import | |
5208 type(C_PTR), value :: p | |
5209 end subroutine free | |
5210 end interface | |
5211 | |
5212 Given these definitions, you can then export wisdom to a Fortran | |
5213 character array: | |
5214 | |
5215 character(C_CHAR), pointer :: s(:) | |
5216 integer(C_SIZE_T) :: slen | |
5217 type(C_PTR) :: p | |
5218 p = fftw_export_wisdom_to_string() | |
5219 if (.not. c_associated(p)) stop 'error exporting wisdom' | |
5220 slen = strlen(p) | |
5221 call c_f_pointer(p, s, [slen+1]) | |
5222 ... | |
5223 call free(p) | |
5224 | |
5225 Note that `slen' is the length of the C string, but the length of | |
5226 the array is `slen+1' because it includes the terminating null | |
5227 character. (You can omit the `+1' if you don't want Fortran to know | |
5228 about the null character.) The standard `c_associated' function checks | |
5229 whether `p' is a null pointer, which is returned by | |
5230 `fftw_export_wisdom_to_string' if there was an error. | |
5231 | |
5232 To import wisdom from a string, use `fftw_import_wisdom_from_string' | |
5233 as usual; note that the argument of this function must be a | |
5234 `character(C_CHAR)' that is terminated by the `C_NULL_CHAR' character, | |
5235 like the `s' array above. | |
5236 | |
5237 | |
5238 File: fftw3.info, Node: Wisdom Generic Export/Import from Fortran, Prev: Wisdom String Export/Import from Fortran, Up: Accessing the wisdom API from Fortran | |
5239 | |
5240 7.6.3 Wisdom Generic Export/Import from Fortran | |
5241 ----------------------------------------------- | |
5242 | |
5243 The most generic wisdom export/import functions allow you to provide an | |
5244 arbitrary callback function to read/write one character at a time in | |
5245 any way you want. However, your callback function must be written in a | |
5246 special way, using the `bind(C)' attribute to be passed to a C | |
5247 interface. | |
5248 | |
5249 In particular, to call the generic wisdom export function | |
5250 `fftw_export_wisdom', you would write a callback subroutine of the form: | |
5251 | |
5252 subroutine my_write_char(c, p) bind(C) | |
5253 use, intrinsic :: iso_c_binding | |
5254 character(C_CHAR), value :: c | |
5255 type(C_PTR), value :: p | |
5256 _...write c..._ | |
5257 end subroutine my_write_char | |
5258 | |
5259 Given such a subroutine (along with the corresponding interface | |
5260 definition), you could then export wisdom using: | |
5261 | |
5262 call fftw_export_wisdom(c_funloc(my_write_char), p) | |
5263 | |
5264 The standard `c_funloc' intrinsic converts a Fortran `bind(C)' | |
5265 subroutine into a C function pointer. The parameter `p' is a | |
5266 `type(C_PTR)' to any arbitrary data that you want to pass to | |
5267 `my_write_char' (or `C_NULL_PTR' if none). (Note that you can get a C | |
5268 pointer to Fortran data using the intrinsic `c_loc', and convert it | |
5269 back to a Fortran pointer in `my_write_char' using `c_f_pointer'.) | |
5270 | |
5271 Similarly, to use the generic `fftw_import_wisdom', you would define | |
5272 a callback function of the form: | |
5273 | |
5274 integer(C_INT) function my_read_char(p) bind(C) | |
5275 use, intrinsic :: iso_c_binding | |
5276 type(C_PTR), value :: p | |
5277 character :: c | |
5278 _...read a character c..._ | |
5279 my_read_char = ichar(c, C_INT) | |
5280 end function my_read_char | |
5281 | |
5282 .... | |
5283 | |
5284 integer(C_INT) :: ret | |
5285 ret = fftw_import_wisdom(c_funloc(my_read_char), p) | |
5286 if (ret .eq. 0) stop 'error importing wisdom' | |
5287 | |
5288 Your function can return `-1' if the end of the input is reached. | |
5289 Again, `p' is an arbitrary `type(C_PTR' that is passed through to your | |
5290 function. `fftw_import_wisdom' returns `0' if an error occurred and | |
5291 nonzero otherwise. | |
5292 | |
5293 | |
5294 File: fftw3.info, Node: Defining an FFTW module, Prev: Accessing the wisdom API from Fortran, Up: Calling FFTW from Modern Fortran | |
5295 | |
5296 7.7 Defining an FFTW module | |
5297 =========================== | |
5298 | |
5299 Rather than using the `include' statement to include the `fftw3.f03' | |
5300 interface file in any subroutine where you want to use FFTW, you might | |
5301 prefer to define an FFTW Fortran module. FFTW does not install itself | |
5302 as a module, primarily because `fftw3.f03' can be shared between | |
5303 different Fortran compilers while modules (in general) cannot. | |
5304 However, it is trivial to define your own FFTW module if you want. | |
5305 Just create a file containing: | |
5306 | |
5307 module FFTW3 | |
5308 use, intrinsic :: iso_c_binding | |
5309 include 'fftw3.f03' | |
5310 end module | |
5311 | |
5312 Compile this file into a module as usual for your compiler (e.g. with | |
5313 `gfortran -c' you will get a file `fftw3.mod'). Now, instead of | |
5314 `include 'fftw3.f03'', whenever you want to use FFTW routines you can | |
5315 just do: | |
5316 | |
5317 use FFTW3 | |
5318 | |
5319 as usual for Fortran modules. (You still need to link to the FFTW | |
5320 library, of course.) | |
5321 | |
5322 | |
5323 File: fftw3.info, Node: Calling FFTW from Legacy Fortran, Next: Upgrading from FFTW version 2, Prev: Calling FFTW from Modern Fortran, Up: Top | |
5324 | |
5325 8 Calling FFTW from Legacy Fortran | |
5326 ********************************** | |
5327 | |
5328 This chapter describes the interface to FFTW callable by Fortran code | |
5329 in older compilers not supporting the Fortran 2003 C interoperability | |
5330 features (*note Calling FFTW from Modern Fortran::). This interface | |
5331 has the major disadvantage that it is not type-checked, so if you | |
5332 mistake the argument types or ordering then your program will not have | |
5333 any compiler errors, and will likely crash at runtime. So, greater | |
5334 care is needed. Also, technically interfacing older Fortran versions | |
5335 to C is nonstandard, but in practice we have found that the techniques | |
5336 used in this chapter have worked with all known Fortran compilers for | |
5337 many years. | |
5338 | |
5339 The legacy Fortran interface differs from the C interface only in the | |
5340 prefix (`dfftw_' instead of `fftw_' in double precision) and a few | |
5341 other minor details. This Fortran interface is included in the FFTW | |
5342 libraries by default, unless a Fortran compiler isn't found on your | |
5343 system or `--disable-fortran' is included in the `configure' flags. We | |
5344 assume here that the reader is already familiar with the usage of FFTW | |
5345 in C, as described elsewhere in this manual. | |
5346 | |
5347 The MPI parallel interface to FFTW is _not_ currently available to | |
5348 legacy Fortran. | |
5349 | |
5350 * Menu: | |
5351 | |
5352 * Fortran-interface routines:: | |
5353 * FFTW Constants in Fortran:: | |
5354 * FFTW Execution in Fortran:: | |
5355 * Fortran Examples:: | |
5356 * Wisdom of Fortran?:: | |
5357 | |
5358 | |
5359 File: fftw3.info, Node: Fortran-interface routines, Next: FFTW Constants in Fortran, Prev: Calling FFTW from Legacy Fortran, Up: Calling FFTW from Legacy Fortran | |
5360 | |
5361 8.1 Fortran-interface routines | |
5362 ============================== | |
5363 | |
5364 Nearly all of the FFTW functions have Fortran-callable equivalents. | |
5365 The name of the legacy Fortran routine is the same as that of the | |
5366 corresponding C routine, but with the `fftw_' prefix replaced by | |
5367 `dfftw_'.(1) The single and long-double precision versions use | |
5368 `sfftw_' and `lfftw_', respectively, instead of `fftwf_' and `fftwl_'; | |
5369 quadruple precision (`real*16') is available on some systems as | |
5370 `fftwq_' (*note Precision::). (Note that `long double' on x86 hardware | |
5371 is usually at most 80-bit extended precision, _not_ quadruple | |
5372 precision.) | |
5373 | |
5374 For the most part, all of the arguments to the functions are the | |
5375 same, with the following exceptions: | |
5376 | |
5377 * `plan' variables (what would be of type `fftw_plan' in C), must be | |
5378 declared as a type that is at least as big as a pointer (address) | |
5379 on your machine. We recommend using `integer*8' everywhere, since | |
5380 this should always be big enough. | |
5381 | |
5382 * Any function that returns a value (e.g. `fftw_plan_dft') is | |
5383 converted into a _subroutine_. The return value is converted into | |
5384 an additional _first_ parameter of this subroutine.(2) | |
5385 | |
5386 * The Fortran routines expect multi-dimensional arrays to be in | |
5387 _column-major_ order, which is the ordinary format of Fortran | |
5388 arrays (*note Multi-dimensional Array Format::). They do this | |
5389 transparently and costlessly simply by reversing the order of the | |
5390 dimensions passed to FFTW, but this has one important consequence | |
5391 for multi-dimensional real-complex transforms, discussed below. | |
5392 | |
5393 * Wisdom import and export is somewhat more tricky because one cannot | |
5394 easily pass files or strings between C and Fortran; see *note | |
5395 Wisdom of Fortran?::. | |
5396 | |
5397 * Legacy Fortran cannot use the `fftw_malloc' dynamic-allocation | |
5398 routine. If you want to exploit the SIMD FFTW (*note SIMD | |
5399 alignment and fftw_malloc::), you'll need to figure out some other | |
5400 way to ensure that your arrays are at least 16-byte aligned. | |
5401 | |
5402 * Since Fortran 77 does not have data structures, the `fftw_iodim' | |
5403 structure from the guru interface (*note Guru vector and transform | |
5404 sizes::) must be split into separate arguments. In particular, any | |
5405 `fftw_iodim' array arguments in the C guru interface become three | |
5406 integer array arguments (`n', `is', and `os') in the Fortran guru | |
5407 interface, all of whose lengths should be equal to the | |
5408 corresponding `rank' argument. | |
5409 | |
5410 * The guru planner interface in Fortran does _not_ do any automatic | |
5411 translation between column-major and row-major; you are responsible | |
5412 for setting the strides etcetera to correspond to your Fortran | |
5413 arrays. However, as a slight bug that we are preserving for | |
5414 backwards compatibility, the `plan_guru_r2r' in Fortran _does_ | |
5415 reverse the order of its `kind' array parameter, so the `kind' | |
5416 array of that routine should be in the reverse of the order of the | |
5417 iodim arrays (see above). | |
5418 | |
5419 | |
5420 In general, you should take care to use Fortran data types that | |
5421 correspond to (i.e. are the same size as) the C types used by FFTW. In | |
5422 practice, this correspondence is usually straightforward (i.e. | |
5423 `integer' corresponds to `int', `real' corresponds to `float', | |
5424 etcetera). The native Fortran double/single-precision complex type | |
5425 should be compatible with `fftw_complex'/`fftwf_complex'. Such simple | |
5426 correspondences are assumed in the examples below. | |
5427 | |
5428 ---------- Footnotes ---------- | |
5429 | |
5430 (1) Technically, Fortran 77 identifiers are not allowed to have more | |
5431 than 6 characters, nor may they contain underscores. Any compiler that | |
5432 enforces this limitation doesn't deserve to link to FFTW. | |
5433 | |
5434 (2) The reason for this is that some Fortran implementations seem to | |
5435 have trouble with C function return values, and vice versa. | |
5436 | |
5437 | |
5438 File: fftw3.info, Node: FFTW Constants in Fortran, Next: FFTW Execution in Fortran, Prev: Fortran-interface routines, Up: Calling FFTW from Legacy Fortran | |
5439 | |
5440 8.2 FFTW Constants in Fortran | |
5441 ============================= | |
5442 | |
5443 When creating plans in FFTW, a number of constants are used to specify | |
5444 options, such as `FFTW_MEASURE' or `FFTW_ESTIMATE'. The same constants | |
5445 must be used with the wrapper routines, but of course the C header | |
5446 files where the constants are defined can't be incorporated directly | |
5447 into Fortran code. | |
5448 | |
5449 Instead, we have placed Fortran equivalents of the FFTW constant | |
5450 definitions in the file `fftw3.f', which can be found in the same | |
5451 directory as `fftw3.h'. If your Fortran compiler supports a | |
5452 preprocessor of some sort, you should be able to `include' or | |
5453 `#include' this file; otherwise, you can paste it directly into your | |
5454 code. | |
5455 | |
5456 In C, you combine different flags (like `FFTW_PRESERVE_INPUT' and | |
5457 `FFTW_MEASURE') using the ``|'' operator; in Fortran you should just | |
5458 use ``+''. (Take care not to add in the same flag more than once, | |
5459 though. Alternatively, you can use the `ior' intrinsic function | |
5460 standardized in Fortran 95.) | |
5461 | |
5462 | |
5463 File: fftw3.info, Node: FFTW Execution in Fortran, Next: Fortran Examples, Prev: FFTW Constants in Fortran, Up: Calling FFTW from Legacy Fortran | |
5464 | |
5465 8.3 FFTW Execution in Fortran | |
5466 ============================= | |
5467 | |
5468 In C, in order to use a plan, one normally calls `fftw_execute', which | |
5469 executes the plan to perform the transform on the input/output arrays | |
5470 passed when the plan was created (*note Using Plans::). The | |
5471 corresponding subroutine call in legacy Fortran is: | |
5472 call dfftw_execute(plan) | |
5473 | |
5474 However, we have had reports that this causes problems with some | |
5475 recent optimizing Fortran compilers. The problem is, because the | |
5476 input/output arrays are not passed as explicit arguments to | |
5477 `dfftw_execute', the semantics of Fortran (unlike C) allow the compiler | |
5478 to assume that the input/output arrays are not changed by | |
5479 `dfftw_execute'. As a consequence, certain compilers end up optimizing | |
5480 out or repositioning the call to `dfftw_execute', assuming incorrectly | |
5481 that it does nothing. | |
5482 | |
5483 There are various workarounds to this, but the safest and simplest | |
5484 thing is to not use `dfftw_execute' in Fortran. Instead, use the | |
5485 functions described in *note New-array Execute Functions::, which take | |
5486 the input/output arrays as explicit arguments. For example, if the | |
5487 plan is for a complex-data DFT and was created for the arrays `in' and | |
5488 `out', you would do: | |
5489 call dfftw_execute_dft(plan, in, out) | |
5490 | |
5491 There are a few things to be careful of, however: | |
5492 | |
5493 * You must use the correct type of execute function, matching the way | |
5494 the plan was created. Complex DFT plans should use | |
5495 `dfftw_execute_dft', Real-input (r2c) DFT plans should use use | |
5496 `dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should | |
5497 use `dfftw_execute_dft_c2r'. The various r2r plans should use | |
5498 `dfftw_execute_r2r'. | |
5499 | |
5500 * You should normally pass the same input/output arrays that were | |
5501 used when creating the plan. This is always safe. | |
5502 | |
5503 * _If_ you pass _different_ input/output arrays compared to those | |
5504 used when creating the plan, you must abide by all the | |
5505 restrictions of the new-array execute functions (*note New-array | |
5506 Execute Functions::). The most difficult of these, in Fortran, is | |
5507 the requirement that the new arrays have the same alignment as the | |
5508 original arrays, because there seems to be no way in legacy | |
5509 Fortran to obtain guaranteed-aligned arrays (analogous to | |
5510 `fftw_malloc' in C). You can, of course, use the `FFTW_UNALIGNED' | |
5511 flag when creating the plan, in which case the plan does not | |
5512 depend on the alignment, but this may sacrifice substantial | |
5513 performance on architectures (like x86) with SIMD instructions | |
5514 (*note SIMD alignment and fftw_malloc::). | |
5515 | |
5516 | |
5517 | |
5518 File: fftw3.info, Node: Fortran Examples, Next: Wisdom of Fortran?, Prev: FFTW Execution in Fortran, Up: Calling FFTW from Legacy Fortran | |
5519 | |
5520 8.4 Fortran Examples | |
5521 ==================== | |
5522 | |
5523 In C, you might have something like the following to transform a | |
5524 one-dimensional complex array: | |
5525 | |
5526 fftw_complex in[N], out[N]; | |
5527 fftw_plan plan; | |
5528 | |
5529 plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE); | |
5530 fftw_execute(plan); | |
5531 fftw_destroy_plan(plan); | |
5532 | |
5533 In Fortran, you would use the following to accomplish the same thing: | |
5534 | |
5535 double complex in, out | |
5536 dimension in(N), out(N) | |
5537 integer*8 plan | |
5538 | |
5539 call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) | |
5540 call dfftw_execute_dft(plan, in, out) | |
5541 call dfftw_destroy_plan(plan) | |
5542 | |
5543 Notice how all routines are called as Fortran subroutines, and the | |
5544 plan is returned via the first argument to `dfftw_plan_dft_1d'. Notice | |
5545 also that we changed `fftw_execute' to `dfftw_execute_dft' (*note FFTW | |
5546 Execution in Fortran::). To do the same thing, but using 8 threads in | |
5547 parallel (*note Multi-threaded FFTW::), you would simply prefix these | |
5548 calls with: | |
5549 | |
5550 integer iret | |
5551 call dfftw_init_threads(iret) | |
5552 call dfftw_plan_with_nthreads(8) | |
5553 | |
5554 (You might want to check the value of `iret': if it is zero, it | |
5555 indicates an unlikely error during thread initialization.) | |
5556 | |
5557 To transform a three-dimensional array in-place with C, you might do: | |
5558 | |
5559 fftw_complex arr[L][M][N]; | |
5560 fftw_plan plan; | |
5561 | |
5562 plan = fftw_plan_dft_3d(L,M,N, arr,arr, | |
5563 FFTW_FORWARD, FFTW_ESTIMATE); | |
5564 fftw_execute(plan); | |
5565 fftw_destroy_plan(plan); | |
5566 | |
5567 In Fortran, you would use this instead: | |
5568 | |
5569 double complex arr | |
5570 dimension arr(L,M,N) | |
5571 integer*8 plan | |
5572 | |
5573 call dfftw_plan_dft_3d(plan, L,M,N, arr,arr, | |
5574 & FFTW_FORWARD, FFTW_ESTIMATE) | |
5575 call dfftw_execute_dft(plan, arr, arr) | |
5576 call dfftw_destroy_plan(plan) | |
5577 | |
5578 Note that we pass the array dimensions in the "natural" order in | |
5579 both C and Fortran. | |
5580 | |
5581 To transform a one-dimensional real array in Fortran, you might do: | |
5582 | |
5583 double precision in | |
5584 dimension in(N) | |
5585 double complex out | |
5586 dimension out(N/2 + 1) | |
5587 integer*8 plan | |
5588 | |
5589 call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE) | |
5590 call dfftw_execute_dft_r2c(plan, in, out) | |
5591 call dfftw_destroy_plan(plan) | |
5592 | |
5593 To transform a two-dimensional real array, out of place, you might | |
5594 use the following: | |
5595 | |
5596 double precision in | |
5597 dimension in(M,N) | |
5598 double complex out | |
5599 dimension out(M/2 + 1, N) | |
5600 integer*8 plan | |
5601 | |
5602 call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE) | |
5603 call dfftw_execute_dft_r2c(plan, in, out) | |
5604 call dfftw_destroy_plan(plan) | |
5605 | |
5606 *Important:* Notice that it is the _first_ dimension of the complex | |
5607 output array that is cut in half in Fortran, rather than the last | |
5608 dimension as in C. This is a consequence of the interface routines | |
5609 reversing the order of the array dimensions passed to FFTW so that the | |
5610 Fortran program can use its ordinary column-major order. | |
5611 | |
5612 | |
5613 File: fftw3.info, Node: Wisdom of Fortran?, Prev: Fortran Examples, Up: Calling FFTW from Legacy Fortran | |
5614 | |
5615 8.5 Wisdom of Fortran? | |
5616 ====================== | |
5617 | |
5618 In this section, we discuss how one can import/export FFTW wisdom | |
5619 (saved plans) to/from a Fortran program; we assume that the reader is | |
5620 already familiar with wisdom, as described in *note Words of | |
5621 Wisdom-Saving Plans::. | |
5622 | |
5623 The basic problem is that is difficult to (portably) pass files and | |
5624 strings between Fortran and C, so we cannot provide a direct Fortran | |
5625 equivalent to the `fftw_export_wisdom_to_file', etcetera, functions. | |
5626 Fortran interfaces _are_ provided for the functions that do not take | |
5627 file/string arguments, however: `dfftw_import_system_wisdom', | |
5628 `dfftw_import_wisdom', `dfftw_export_wisdom', and `dfftw_forget_wisdom'. | |
5629 | |
5630 So, for example, to import the system-wide wisdom, you would do: | |
5631 | |
5632 integer isuccess | |
5633 call dfftw_import_system_wisdom(isuccess) | |
5634 | |
5635 As usual, the C return value is turned into a first parameter; | |
5636 `isuccess' is non-zero on success and zero on failure (e.g. if there is | |
5637 no system wisdom installed). | |
5638 | |
5639 If you want to import/export wisdom from/to an arbitrary file or | |
5640 elsewhere, you can employ the generic `dfftw_import_wisdom' and | |
5641 `dfftw_export_wisdom' functions, for which you must supply a subroutine | |
5642 to read/write one character at a time. The FFTW package contains an | |
5643 example file `doc/f77_wisdom.f' demonstrating how to implement | |
5644 `import_wisdom_from_file' and `export_wisdom_to_file' subroutines in | |
5645 this way. (These routines cannot be compiled into the FFTW library | |
5646 itself, lest all FFTW-using programs be required to link with the | |
5647 Fortran I/O library.) | |
5648 | |
5649 | |
5650 File: fftw3.info, Node: Upgrading from FFTW version 2, Next: Installation and Customization, Prev: Calling FFTW from Legacy Fortran, Up: Top | |
5651 | |
5652 9 Upgrading from FFTW version 2 | |
5653 ******************************* | |
5654 | |
5655 In this chapter, we outline the process for updating codes designed for | |
5656 the older FFTW 2 interface to work with FFTW 3. The interface for FFTW | |
5657 3 is not backwards-compatible with the interface for FFTW 2 and earlier | |
5658 versions; codes written to use those versions will fail to link with | |
5659 FFTW 3. Nor is it possible to write "compatibility wrappers" to bridge | |
5660 the gap (at least not efficiently), because FFTW 3 has different | |
5661 semantics from previous versions. However, upgrading should be a | |
5662 straightforward process because the data formats are identical and the | |
5663 overall style of planning/execution is essentially the same. | |
5664 | |
5665 Unlike FFTW 2, there are no separate header files for real and | |
5666 complex transforms (or even for different precisions) in FFTW 3; all | |
5667 interfaces are defined in the `<fftw3.h>' header file. | |
5668 | |
5669 Numeric Types | |
5670 ============= | |
5671 | |
5672 The main difference in data types is that `fftw_complex' in FFTW 2 was | |
5673 defined as a `struct' with macros `c_re' and `c_im' for accessing the | |
5674 real/imaginary parts. (This is binary-compatible with FFTW 3 on any | |
5675 machine except perhaps for some older Crays in single precision.) The | |
5676 equivalent macros for FFTW 3 are: | |
5677 | |
5678 #define c_re(c) ((c)[0]) | |
5679 #define c_im(c) ((c)[1]) | |
5680 | |
5681 This does not work if you are using the C99 complex type, however, | |
5682 unless you insert a `double*' typecast into the above macros (*note | |
5683 Complex numbers::). | |
5684 | |
5685 Also, FFTW 2 had an `fftw_real' typedef that was an alias for | |
5686 `double' (in double precision). In FFTW 3 you should just use `double' | |
5687 (or whatever precision you are employing). | |
5688 | |
5689 Plans | |
5690 ===== | |
5691 | |
5692 The major difference between FFTW 2 and FFTW 3 is in the | |
5693 planning/execution division of labor. In FFTW 2, plans were found for a | |
5694 given transform size and type, and then could be applied to _any_ | |
5695 arrays and for _any_ multiplicity/stride parameters. In FFTW 3, you | |
5696 specify the particular arrays, stride parameters, etcetera when | |
5697 creating the plan, and the plan is then executed for _those_ arrays | |
5698 (unless the guru interface is used) and _those_ parameters _only_. | |
5699 (FFTW 2 had "specific planner" routines that planned for a particular | |
5700 array and stride, but the plan could still be used for other arrays and | |
5701 strides.) That is, much of the information that was formerly specified | |
5702 at execution time is now specified at planning time. | |
5703 | |
5704 Like FFTW 2's specific planner routines, the FFTW 3 planner | |
5705 overwrites the input/output arrays unless you use `FFTW_ESTIMATE'. | |
5706 | |
5707 FFTW 2 had separate data types `fftw_plan', `fftwnd_plan', | |
5708 `rfftw_plan', and `rfftwnd_plan' for complex and real one- and | |
5709 multi-dimensional transforms, and each type had its own `destroy' | |
5710 function. In FFTW 3, all plans are of type `fftw_plan' and all are | |
5711 destroyed by `fftw_destroy_plan(plan)'. | |
5712 | |
5713 Where you formerly used `fftw_create_plan' and `fftw_one' to plan | |
5714 and compute a single 1d transform, you would now use `fftw_plan_dft_1d' | |
5715 to plan the transform. If you used the generic `fftw' function to | |
5716 execute the transform with multiplicity (`howmany') and stride | |
5717 parameters, you would now use the advanced interface | |
5718 `fftw_plan_many_dft' to specify those parameters. The plans are now | |
5719 executed with `fftw_execute(plan)', which takes all of its parameters | |
5720 (including the input/output arrays) from the plan. | |
5721 | |
5722 In-place transforms no longer interpret their output argument as | |
5723 scratch space, nor is there an `FFTW_IN_PLACE' flag. You simply pass | |
5724 the same pointer for both the input and output arguments. (Previously, | |
5725 the output `ostride' and `odist' parameters were ignored for in-place | |
5726 transforms; now, if they are specified via the advanced interface, they | |
5727 are significant even in the in-place case, although they should | |
5728 normally equal the corresponding input parameters.) | |
5729 | |
5730 The `FFTW_ESTIMATE' and `FFTW_MEASURE' flags have the same meaning | |
5731 as before, although the planning time will differ. You may also | |
5732 consider using `FFTW_PATIENT', which is like `FFTW_MEASURE' except that | |
5733 it takes more time in order to consider a wider variety of algorithms. | |
5734 | |
5735 For multi-dimensional complex DFTs, instead of `fftwnd_create_plan' | |
5736 (or `fftw2d_create_plan' or `fftw3d_create_plan'), followed by | |
5737 `fftwnd_one', you would use `fftw_plan_dft' (or `fftw_plan_dft_2d' or | |
5738 `fftw_plan_dft_3d'). followed by `fftw_execute'. If you used `fftwnd' | |
5739 to to specify strides etcetera, you would instead specify these via | |
5740 `fftw_plan_many_dft'. | |
5741 | |
5742 The analogues to `rfftw_create_plan' and `rfftw_one' with | |
5743 `FFTW_REAL_TO_COMPLEX' or `FFTW_COMPLEX_TO_REAL' directions are | |
5744 `fftw_plan_r2r_1d' with kind `FFTW_R2HC' or `FFTW_HC2R', followed by | |
5745 `fftw_execute'. The stride etcetera arguments of `rfftw' are now in | |
5746 `fftw_plan_many_r2r'. | |
5747 | |
5748 Instead of `rfftwnd_create_plan' (or `rfftw2d_create_plan' or | |
5749 `rfftw3d_create_plan') followed by `rfftwnd_one_real_to_complex' or | |
5750 `rfftwnd_one_complex_to_real', you now use `fftw_plan_dft_r2c' (or | |
5751 `fftw_plan_dft_r2c_2d' or `fftw_plan_dft_r2c_3d') or | |
5752 `fftw_plan_dft_c2r' (or `fftw_plan_dft_c2r_2d' or | |
5753 `fftw_plan_dft_c2r_3d'), respectively, followed by `fftw_execute'. As | |
5754 usual, the strides etcetera of `rfftwnd_real_to_complex' or | |
5755 `rfftwnd_complex_to_real' are no specified in the advanced planner | |
5756 routines, `fftw_plan_many_dft_r2c' or `fftw_plan_many_dft_c2r'. | |
5757 | |
5758 Wisdom | |
5759 ====== | |
5760 | |
5761 In FFTW 2, you had to supply the `FFTW_USE_WISDOM' flag in order to use | |
5762 wisdom; in FFTW 3, wisdom is always used. (You could simulate the FFTW | |
5763 2 wisdom-less behavior by calling `fftw_forget_wisdom' after every | |
5764 planner call.) | |
5765 | |
5766 The FFTW 3 wisdom import/export routines are almost the same as | |
5767 before (although the storage format is entirely different). There is | |
5768 one significant difference, however. In FFTW 2, the import routines | |
5769 would never read past the end of the wisdom, so you could store extra | |
5770 data beyond the wisdom in the same file, for example. In FFTW 3, the | |
5771 file-import routine may read up to a few hundred bytes past the end of | |
5772 the wisdom, so you cannot store other data just beyond it.(1) | |
5773 | |
5774 Wisdom has been enhanced by additional humility in FFTW 3: whereas | |
5775 FFTW 2 would re-use wisdom for a given transform size regardless of the | |
5776 stride etc., in FFTW 3 wisdom is only used with the strides etc. for | |
5777 which it was created. Unfortunately, this means FFTW 3 has to create | |
5778 new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g. | |
5779 one transform of size 1024 also created wisdom for all smaller powers | |
5780 of 2, but this no longer occurs). | |
5781 | |
5782 FFTW 3 also has the new routine `fftw_import_system_wisdom' to | |
5783 import wisdom from a standard system-wide location. | |
5784 | |
5785 Memory allocation | |
5786 ================= | |
5787 | |
5788 In FFTW 3, we recommend allocating your arrays with `fftw_malloc' and | |
5789 deallocating them with `fftw_free'; this is not required, but allows | |
5790 optimal performance when SIMD acceleration is used. (Those two | |
5791 functions actually existed in FFTW 2, and worked the same way, but were | |
5792 not documented.) | |
5793 | |
5794 In FFTW 2, there were `fftw_malloc_hook' and `fftw_free_hook' | |
5795 functions that allowed the user to replace FFTW's memory-allocation | |
5796 routines (e.g. to implement different error-handling, since by default | |
5797 FFTW prints an error message and calls `exit' to abort the program if | |
5798 `malloc' returns `NULL'). These hooks are not supported in FFTW 3; | |
5799 those few users who require this functionality can just directly modify | |
5800 the memory-allocation routines in FFTW (they are defined in | |
5801 `kernel/alloc.c'). | |
5802 | |
5803 Fortran interface | |
5804 ================= | |
5805 | |
5806 In FFTW 2, the subroutine names were obtained by replacing `fftw_' with | |
5807 `fftw_f77'; in FFTW 3, you replace `fftw_' with `dfftw_' (or `sfftw_' | |
5808 or `lfftw_', depending upon the precision). | |
5809 | |
5810 In FFTW 3, we have begun recommending that you always declare the | |
5811 type used to store plans as `integer*8'. (Too many people didn't notice | |
5812 our instruction to switch from `integer' to `integer*8' for 64-bit | |
5813 machines.) | |
5814 | |
5815 In FFTW 3, we provide a `fftw3.f' "header file" to include in your | |
5816 code (and which is officially installed on Unix systems). (In FFTW 2, | |
5817 we supplied a `fftw_f77.i' file, but it was not installed.) | |
5818 | |
5819 Otherwise, the C-Fortran interface relationship is much the same as | |
5820 it was before (e.g. return values become initial parameters, and | |
5821 multi-dimensional arrays are in column-major order). Unlike FFTW 2, we | |
5822 do provide some support for wisdom import/export in Fortran (*note | |
5823 Wisdom of Fortran?::). | |
5824 | |
5825 Threads | |
5826 ======= | |
5827 | |
5828 Like FFTW 2, only the execution routines are thread-safe. All planner | |
5829 routines, etcetera, should be called by only a single thread at a time | |
5830 (*note Thread safety::). _Unlike_ FFTW 2, there is no special | |
5831 `FFTW_THREADSAFE' flag for the planner to allow a given plan to be | |
5832 usable by multiple threads in parallel; this is now the case by default. | |
5833 | |
5834 The multi-threaded version of FFTW 2 required you to pass the number | |
5835 of threads each time you execute the transform. The number of threads | |
5836 is now stored in the plan, and is specified before the planner is | |
5837 called by `fftw_plan_with_nthreads'. The threads initialization | |
5838 routine used to be called `fftw_threads_init' and would return zero on | |
5839 success; the new routine is called `fftw_init_threads' and returns zero | |
5840 on failure. *Note Multi-threaded FFTW::. | |
5841 | |
5842 There is no separate threads header file in FFTW 3; all the function | |
5843 prototypes are in `<fftw3.h>'. However, you still have to link to a | |
5844 separate library (`-lfftw3_threads -lfftw3 -lm' on Unix), as well as to | |
5845 the threading library (e.g. POSIX threads on Unix). | |
5846 | |
5847 ---------- Footnotes ---------- | |
5848 | |
5849 (1) We do our own buffering because GNU libc I/O routines are | |
5850 horribly slow for single-character I/O, apparently for thread-safety | |
5851 reasons (whether you are using threads or not). | |
5852 | |
5853 | |
5854 File: fftw3.info, Node: Installation and Customization, Next: Acknowledgments, Prev: Upgrading from FFTW version 2, Up: Top | |
5855 | |
5856 10 Installation and Customization | |
5857 ********************************* | |
5858 | |
5859 This chapter describes the installation and customization of FFTW, the | |
5860 latest version of which may be downloaded from the FFTW home page | |
5861 (http://www.fftw.org). | |
5862 | |
5863 In principle, FFTW should work on any system with an ANSI C compiler | |
5864 (`gcc' is fine). However, planner time is drastically reduced if FFTW | |
5865 can exploit a hardware cycle counter; FFTW comes with cycle-counter | |
5866 support for all modern general-purpose CPUs, but you may need to add a | |
5867 couple of lines of code if your compiler is not yet supported (*note | |
5868 Cycle Counters::). (On Unix, there will be a warning at the end of the | |
5869 `configure' output if no cycle counter is found.) | |
5870 | |
5871 Installation of FFTW is simplest if you have a Unix or a GNU system, | |
5872 such as GNU/Linux, and we describe this case in the first section below, | |
5873 including the use of special configuration options to e.g. install | |
5874 different precisions or exploit optimizations for particular | |
5875 architectures (e.g. SIMD). Compilation on non-Unix systems is a more | |
5876 manual process, but we outline the procedure in the second section. It | |
5877 is also likely that pre-compiled binaries will be available for popular | |
5878 systems. | |
5879 | |
5880 Finally, we describe how you can customize FFTW for particular needs | |
5881 by generating _codelets_ for fast transforms of sizes not supported | |
5882 efficiently by the standard FFTW distribution. | |
5883 | |
5884 * Menu: | |
5885 | |
5886 * Installation on Unix:: | |
5887 * Installation on non-Unix systems:: | |
5888 * Cycle Counters:: | |
5889 * Generating your own code:: | |
5890 | |
5891 | |
5892 File: fftw3.info, Node: Installation on Unix, Next: Installation on non-Unix systems, Prev: Installation and Customization, Up: Installation and Customization | |
5893 | |
5894 10.1 Installation on Unix | |
5895 ========================= | |
5896 | |
5897 FFTW comes with a `configure' program in the GNU style. Installation | |
5898 can be as simple as: | |
5899 | |
5900 ./configure | |
5901 make | |
5902 make install | |
5903 | |
5904 This will build the uniprocessor complex and real transform libraries | |
5905 along with the test programs. (We recommend that you use GNU `make' if | |
5906 it is available; on some systems it is called `gmake'.) The "`make | |
5907 install'" command installs the fftw and rfftw libraries in standard | |
5908 places, and typically requires root privileges (unless you specify a | |
5909 different install directory with the `--prefix' flag to `configure'). | |
5910 You can also type "`make check'" to put the FFTW test programs through | |
5911 their paces. If you have problems during configuration or compilation, | |
5912 you may want to run "`make distclean'" before trying again; this | |
5913 ensures that you don't have any stale files left over from previous | |
5914 compilation attempts. | |
5915 | |
5916 The `configure' script chooses the `gcc' compiler by default, if it | |
5917 is available; you can select some other compiler with: | |
5918 ./configure CC="<the name of your C compiler>" | |
5919 | |
5920 The `configure' script knows good `CFLAGS' (C compiler flags) for a | |
5921 few systems. If your system is not known, the `configure' script will | |
5922 print out a warning. In this case, you should re-configure FFTW with | |
5923 the command | |
5924 ./configure CFLAGS="<write your CFLAGS here>" | |
5925 and then compile as usual. If you do find an optimal set of | |
5926 `CFLAGS' for your system, please let us know what they are (along with | |
5927 the output of `config.guess') so that we can include them in future | |
5928 releases. | |
5929 | |
5930 `configure' supports all the standard flags defined by the GNU | |
5931 Coding Standards; see the `INSTALL' file in FFTW or the GNU web page | |
5932 (http://www.gnu.org/prep/standards/html_node/index.html). Note | |
5933 especially `--help' to list all flags and `--enable-shared' to create | |
5934 shared, rather than static, libraries. `configure' also accepts a few | |
5935 FFTW-specific flags, particularly: | |
5936 | |
5937 * `--enable-float': Produces a single-precision version of FFTW | |
5938 (`float') instead of the default double-precision (`double'). | |
5939 *Note Precision::. | |
5940 | |
5941 * `--enable-long-double': Produces a long-double precision version of | |
5942 FFTW (`long double') instead of the default double-precision | |
5943 (`double'). The `configure' script will halt with an error | |
5944 message if `long double' is the same size as `double' on your | |
5945 machine/compiler. *Note Precision::. | |
5946 | |
5947 * `--enable-quad-precision': Produces a quadruple-precision version | |
5948 of FFTW using the nonstandard `__float128' type provided by `gcc' | |
5949 4.6 or later on x86, x86-64, and Itanium architectures, instead of | |
5950 the default double-precision (`double'). The `configure' script | |
5951 will halt with an error message if the compiler is not `gcc' | |
5952 version 4.6 or later or if `gcc''s `libquadmath' library is not | |
5953 installed. *Note Precision::. | |
5954 | |
5955 * `--enable-threads': Enables compilation and installation of the | |
5956 FFTW threads library (*note Multi-threaded FFTW::), which provides | |
5957 a simple interface to parallel transforms for SMP systems. By | |
5958 default, the threads routines are not compiled. | |
5959 | |
5960 * `--enable-openmp': Like `--enable-threads', but using OpenMP | |
5961 compiler directives in order to induce parallelism rather than | |
5962 spawning its own threads directly, and installing an `fftw3_omp' | |
5963 library rather than an `fftw3_threads' library (*note | |
5964 Multi-threaded FFTW::). You can use both `--enable-openmp' and | |
5965 `--enable-threads' since they compile/install libraries with | |
5966 different names. By default, the OpenMP routines are not compiled. | |
5967 | |
5968 * `--with-combined-threads': By default, if `--enable-threads' is | |
5969 used, the threads support is compiled into a separate library that | |
5970 must be linked in addition to the main FFTW library. This is so | |
5971 that users of the serial library do not need to link the system | |
5972 threads libraries. If `--with-combined-threads' is specified, | |
5973 however, then no separate threads library is created, and threads | |
5974 are included in the main FFTW library. This is mainly useful | |
5975 under Windows, where no system threads library is required and | |
5976 inter-library dependencies are problematic. | |
5977 | |
5978 * `--enable-mpi': Enables compilation and installation of the FFTW | |
5979 MPI library (*note Distributed-memory FFTW with MPI::), which | |
5980 provides parallel transforms for distributed-memory systems with | |
5981 MPI. (By default, the MPI routines are not compiled.) *Note FFTW | |
5982 MPI Installation::. | |
5983 | |
5984 * `--disable-fortran': Disables inclusion of legacy-Fortran wrapper | |
5985 routines (*note Calling FFTW from Legacy Fortran::) in the standard | |
5986 FFTW libraries. These wrapper routines increase the library size | |
5987 by only a negligible amount, so they are included by default as | |
5988 long as the `configure' script finds a Fortran compiler on your | |
5989 system. (To specify a particular Fortran compiler foo, pass | |
5990 `F77='foo to `configure'.) | |
5991 | |
5992 * `--with-g77-wrappers': By default, when Fortran wrappers are | |
5993 included, the wrappers employ the linking conventions of the | |
5994 Fortran compiler detected by the `configure' script. If this | |
5995 compiler is GNU `g77', however, then _two_ versions of the | |
5996 wrappers are included: one with `g77''s idiosyncratic convention | |
5997 of appending two underscores to identifiers, and one with the more | |
5998 common convention of appending only a single underscore. This | |
5999 way, the same FFTW library will work with both `g77' and other | |
6000 Fortran compilers, such as GNU `gfortran'. However, the converse | |
6001 is not true: if you configure with a different compiler, then the | |
6002 `g77'-compatible wrappers are not included. By specifying | |
6003 `--with-g77-wrappers', the `g77'-compatible wrappers are included | |
6004 in addition to wrappers for whatever Fortran compiler `configure' | |
6005 finds. | |
6006 | |
6007 * `--with-slow-timer': Disables the use of hardware cycle counters, | |
6008 and falls back on `gettimeofday' or `clock'. This greatly worsens | |
6009 performance, and should generally not be used (unless you don't | |
6010 have a cycle counter but still really want an optimized plan | |
6011 regardless of the time). *Note Cycle Counters::. | |
6012 | |
6013 * `--enable-sse', `--enable-sse2', `--enable-avx', | |
6014 `--enable-altivec', `--enable-neon': Enable the compilation of | |
6015 SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy | |
6016 Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM | |
6017 processors). SSE, AltiVec, and NEON only work with | |
6018 `--enable-float' (above). SSE2 works in both single and double | |
6019 precision (and is simply SSE in single precision). The resulting | |
6020 code will _still work_ on earlier CPUs lacking the SIMD extensions | |
6021 (SIMD is automatically disabled, although the FFTW library is | |
6022 still larger). | |
6023 - These options require a compiler supporting SIMD extensions, | |
6024 and compiler support is always a bit flaky: see the FFTW FAQ | |
6025 for a list of compiler versions that have problems compiling | |
6026 FFTW. | |
6027 | |
6028 - With AltiVec and `gcc', you may have to use the | |
6029 `-mabi=altivec' option when compiling any code that links to | |
6030 FFTW, in order to properly align the stack; otherwise, FFTW | |
6031 could crash when it tries to use an AltiVec feature. (This | |
6032 is not necessary on MacOS X.) | |
6033 | |
6034 - With SSE/SSE2 and `gcc', you should use a version of gcc that | |
6035 properly aligns the stack when compiling any code that links | |
6036 to FFTW. By default, `gcc' 2.95 and later versions align the | |
6037 stack as needed, but you should not compile FFTW with the | |
6038 `-Os' option or the `-mpreferred-stack-boundary' option with | |
6039 an argument less than 4. | |
6040 | |
6041 - Because of the large variety of ARM processors and ABIs, FFTW | |
6042 does not attempt to guess the correct `gcc' flags for | |
6043 generating NEON code. In general, you will have to provide | |
6044 them on the command line. This command line is known to have | |
6045 worked at least once: | |
6046 ./configure --with-slow-timer --host=arm-linux-gnueabi \ | |
6047 --enable-single --enable-neon \ | |
6048 "CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp" | |
6049 | |
6050 | |
6051 To force `configure' to use a particular C compiler foo (instead of | |
6052 the default, usually `gcc'), pass `CC='foo to the `configure' script; | |
6053 you may also need to set the flags via the variable `CFLAGS' as | |
6054 described above. | |
6055 | |
6056 | |
6057 File: fftw3.info, Node: Installation on non-Unix systems, Next: Cycle Counters, Prev: Installation on Unix, Up: Installation and Customization | |
6058 | |
6059 10.2 Installation on non-Unix systems | |
6060 ===================================== | |
6061 | |
6062 It should be relatively straightforward to compile FFTW even on non-Unix | |
6063 systems lacking the niceties of a `configure' script. Basically, you | |
6064 need to edit the `config.h' header (copy it from `config.h.in') to | |
6065 `#define' the various options and compiler characteristics, and then | |
6066 compile all the `.c' files in the relevant directories. | |
6067 | |
6068 The `config.h' header contains about 100 options to set, each one | |
6069 initially an `#undef', each documented with a comment, and most of them | |
6070 fairly obvious. For most of the options, you should simply `#define' | |
6071 them to `1' if they are applicable, although a few options require a | |
6072 particular value (e.g. `SIZEOF_LONG_LONG' should be defined to the size | |
6073 of the `long long' type, in bytes, or zero if it is not supported). We | |
6074 will likely post some sample `config.h' files for various operating | |
6075 systems and compilers for you to use (at least as a starting point). | |
6076 Please let us know if you have to hand-create a configuration file | |
6077 (and/or a pre-compiled binary) that you want to share. | |
6078 | |
6079 To create the FFTW library, you will then need to compile all of the | |
6080 `.c' files in the `kernel', `dft', `dft/scalar', `dft/scalar/codelets', | |
6081 `rdft', `rdft/scalar', `rdft/scalar/r2cf', `rdft/scalar/r2cb', | |
6082 `rdft/scalar/r2r', `reodft', and `api' directories. If you are | |
6083 compiling with SIMD support (e.g. you defined `HAVE_SSE2' in | |
6084 `config.h'), then you also need to compile the `.c' files in the | |
6085 `simd-support', `{dft,rdft}/simd', `{dft,rdft}/simd/*' directories. | |
6086 | |
6087 Once these files are all compiled, link them into a library, or a | |
6088 shared library, or directly into your program. | |
6089 | |
6090 To compile the FFTW test program, additionally compile the code in | |
6091 the `libbench2/' directory, and link it into a library. Then compile | |
6092 the code in the `tests/' directory and link it to the `libbench2' and | |
6093 FFTW libraries. To compile the `fftw-wisdom' (command-line) tool | |
6094 (*note Wisdom Utilities::), compile `tools/fftw-wisdom.c' and link it | |
6095 to the `libbench2' and FFTW libraries | |
6096 | |
6097 | |
6098 File: fftw3.info, Node: Cycle Counters, Next: Generating your own code, Prev: Installation on non-Unix systems, Up: Installation and Customization | |
6099 | |
6100 10.3 Cycle Counters | |
6101 =================== | |
6102 | |
6103 FFTW's planner actually executes and times different possible FFT | |
6104 algorithms in order to pick the fastest plan for a given n. In order | |
6105 to do this in as short a time as possible, however, the timer must have | |
6106 a very high resolution, and to accomplish this we employ the hardware | |
6107 "cycle counters" that are available on most CPUs. Currently, FFTW | |
6108 supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC | |
6109 (SPARC v9), IA64, PA-RISC, and MIPS processors. | |
6110 | |
6111 Access to the cycle counters, unfortunately, is a compiler and/or | |
6112 operating-system dependent task, often requiring inline assembly | |
6113 language, and it may be that your compiler is not supported. If you are | |
6114 _not_ supported, FFTW will by default fall back on its estimator | |
6115 (effectively using `FFTW_ESTIMATE' for all plans). | |
6116 | |
6117 You can add support by editing the file `kernel/cycle.h'; normally, | |
6118 this will involve adapting one of the examples already present in order | |
6119 to use the inline-assembler syntax for your C compiler, and will only | |
6120 require a couple of lines of code. Anyone adding support for a new | |
6121 system to `cycle.h' is encouraged to email us at <fftw@fftw.org>. | |
6122 | |
6123 If a cycle counter is not available on your system (e.g. some | |
6124 embedded processor), and you don't want to use estimated plans, as a | |
6125 last resort you can use the `--with-slow-timer' option to `configure' | |
6126 (on Unix) or `#define WITH_SLOW_TIMER' in `config.h' (elsewhere). This | |
6127 will use the much lower-resolution `gettimeofday' function, or even | |
6128 `clock' if the former is unavailable, and planning will be extremely | |
6129 slow. | |
6130 | |
6131 | |
6132 File: fftw3.info, Node: Generating your own code, Prev: Cycle Counters, Up: Installation and Customization | |
6133 | |
6134 10.4 Generating your own code | |
6135 ============================= | |
6136 | |
6137 The directory `genfft' contains the programs that were used to generate | |
6138 FFTW's "codelets," which are hard-coded transforms of small sizes. We | |
6139 do not expect casual users to employ the generator, which is a rather | |
6140 sophisticated program that generates directed acyclic graphs of FFT | |
6141 algorithms and performs algebraic simplifications on them. It was | |
6142 written in Objective Caml, a dialect of ML, which is available at | |
6143 `http://caml.inria.fr/ocaml/index.en.html'. | |
6144 | |
6145 If you have Objective Caml installed (along with recent versions of | |
6146 GNU `autoconf', `automake', and `libtool'), then you can change the set | |
6147 of codelets that are generated or play with the generation options. | |
6148 The set of generated codelets is specified by the | |
6149 `{dft,rdft}/{codelets,simd}/*/Makefile.am' files. For example, you can | |
6150 add efficient REDFT codelets of small sizes by modifying | |
6151 `rdft/codelets/r2r/Makefile.am'. After you modify any `Makefile.am' | |
6152 files, you can type `sh bootstrap.sh' in the top-level directory | |
6153 followed by `make' to re-generate the files. | |
6154 | |
6155 We do not provide more details about the code-generation process, | |
6156 since we do not expect that most users will need to generate their own | |
6157 code. However, feel free to contact us at <fftw@fftw.org> if you are | |
6158 interested in the subject. | |
6159 | |
6160 You might find it interesting to learn Caml and/or some modern | |
6161 programming techniques that we used in the generator (including monadic | |
6162 programming), especially if you heard the rumor that Java and | |
6163 object-oriented programming are the latest advancement in the field. | |
6164 The internal operation of the codelet generator is described in the | |
6165 paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is | |
6166 available from the FFTW home page (http://www.fftw.org) and also | |
6167 appeared in the `Proceedings of the 1999 ACM SIGPLAN Conference on | |
6168 Programming Language Design and Implementation (PLDI)'. | |
6169 | |
6170 | |
6171 File: fftw3.info, Node: Acknowledgments, Next: License and Copyright, Prev: Installation and Customization, Up: Top | |
6172 | |
6173 11 Acknowledgments | |
6174 ****************** | |
6175 | |
6176 Matteo Frigo was supported in part by the Special Research Program SFB | |
6177 F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln | |
6178 Laboratory. For previous versions of FFTW, he was supported in part by | |
6179 the Defense Advanced Research Projects Agency (DARPA), under Grants | |
6180 N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment | |
6181 Corporation Fellowship. | |
6182 | |
6183 Steven G. Johnson was supported in part by a Dept. of Defense NDSEG | |
6184 Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials | |
6185 Research Science and Engineering Center program of the National Science | |
6186 Foundation under award DMR-9400334. | |
6187 | |
6188 Code for the Cell Broadband Engine was graciously donated to the FFTW | |
6189 project by the IBM Austin Research Lab and included in fftw-3.2. (This | |
6190 code was removed in fftw-3.3.) | |
6191 | |
6192 Code for the MIPS paired-single SIMD support was graciously donated | |
6193 to the FFTW project by CodeSourcery, Inc. | |
6194 | |
6195 We are grateful to Sun Microsystems Inc. for its donation of a | |
6196 cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These | |
6197 machines served as the primary platform for the development of early | |
6198 versions of FFTW. | |
6199 | |
6200 We thank Intel Corporation for donating a four-processor Pentium Pro | |
6201 machine. We thank the GNU/Linux community for giving us a decent OS to | |
6202 run on that machine. | |
6203 | |
6204 We are thankful to the AMD corporation for donating an AMD Athlon XP | |
6205 1700+ computer to the FFTW project. | |
6206 | |
6207 We thank the Compaq/HP testdrive program and VA Software Corporation | |
6208 (SourceForge.net) for providing remote access to machines that were used | |
6209 to test FFTW. | |
6210 | |
6211 The `genfft' suite of code generators was written using Objective | |
6212 Caml, a dialect of ML. Objective Caml is a small and elegant language | |
6213 developed by Xavier Leroy. The implementation is available from | |
6214 `http://caml.inria.fr/' (http://caml.inria.fr/). In previous releases | |
6215 of FFTW, `genfft' was written in Caml Light, by the same authors. An | |
6216 even earlier implementation of `genfft' was written in Scheme, but Caml | |
6217 is definitely better for this kind of application. | |
6218 | |
6219 FFTW uses many tools from the GNU project, including `automake', | |
6220 `texinfo', and `libtool'. | |
6221 | |
6222 Prof. Charles E. Leiserson of MIT provided continuous support and | |
6223 encouragement. This program would not exist without him. Charles also | |
6224 proposed the name "codelets" for the basic FFT blocks. | |
6225 | |
6226 Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance | |
6227 of Steven's "extra-curricular" computer-science activities, as well as | |
6228 remarkable creativity in working them into his grant proposals. | |
6229 Steven's physics degree would not exist without him. | |
6230 | |
6231 Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually | |
6232 led to the SIMD support in FFTW 3. | |
6233 | |
6234 Stefan Kral wrote most of the K7 code generator distributed with FFTW | |
6235 3.0.x and 3.1.x. | |
6236 | |
6237 Andrew Sterian contributed the Windows timing code in FFTW 2. | |
6238 | |
6239 Didier Miras reported a bug in the test procedure used in FFTW 1.2. | |
6240 We now use a completely different test algorithm by Funda Ergun that | |
6241 does not require a separate FFT program to compare against. | |
6242 | |
6243 Wolfgang Reimer contributed the Pentium cycle counter and a few fixes | |
6244 that help portability. | |
6245 | |
6246 Ming-Chang Liu uncovered a well-hidden bug in the complex transforms | |
6247 of FFTW 2.0 and supplied a patch to correct it. | |
6248 | |
6249 The FFTW FAQ was written in `bfnn' (Bizarre Format With No Name) and | |
6250 formatted using the tools developed by Ian Jackson for the Linux FAQ. | |
6251 | |
6252 _We are especially thankful to all of our users for their continuing | |
6253 support, feedback, and interest during our development of FFTW._ | |
6254 | |
6255 | |
6256 File: fftw3.info, Node: License and Copyright, Next: Concept Index, Prev: Acknowledgments, Up: Top | |
6257 | |
6258 12 License and Copyright | |
6259 ************************ | |
6260 | |
6261 FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003, | |
6262 2007-11 Massachusetts Institute of Technology. | |
6263 | |
6264 FFTW is free software; you can redistribute it and/or modify it | |
6265 under the terms of the GNU General Public License as published by the | |
6266 Free Software Foundation; either version 2 of the License, or (at your | |
6267 option) any later version. | |
6268 | |
6269 This program is distributed in the hope that it will be useful, but | |
6270 WITHOUT ANY WARRANTY; without even the implied warranty of | |
6271 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | |
6272 General Public License for more details. | |
6273 | |
6274 You should have received a copy of the GNU General Public License | |
6275 along with this program; if not, write to the Free Software Foundation, | |
6276 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You | |
6277 can also find the GPL on the GNU web site | |
6278 (http://www.gnu.org/licenses/gpl-2.0.html). | |
6279 | |
6280 In addition, we kindly ask you to acknowledge FFTW and its authors in | |
6281 any program or publication in which you use FFTW. (You are not | |
6282 _required_ to do so; it is up to your common sense to decide whether | |
6283 you want to comply with this request or not.) For general | |
6284 publications, we suggest referencing: Matteo Frigo and Steven G. | |
6285 Johnson, "The design and implementation of FFTW3," Proc. IEEE 93 (2), | |
6286 216-231 (2005). | |
6287 | |
6288 Non-free versions of FFTW are available under terms different from | |
6289 those of the General Public License. (e.g. they do not require you to | |
6290 accompany any object code using FFTW with the corresponding source | |
6291 code.) For these alternative terms you must purchase a license from | |
6292 MIT's Technology Licensing Office. Users interested in such a license | |
6293 should contact us (<fftw@fftw.org>) for more information. | |
6294 |