Mercurial > hg > sv-dependency-builds
diff src/fftw-3.3.3/doc/fftw3.info-1 @ 10:37bf6b4a2645
Add FFTW3
author | Chris Cannam |
---|---|
date | Wed, 20 Mar 2013 15:35:50 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/fftw-3.3.3/doc/fftw3.info-1 Wed Mar 20 15:35:50 2013 +0000 @@ -0,0 +1,6280 @@ +This is fftw3.info, produced by makeinfo version 4.13 from fftw3.texi. + +This manual is for FFTW (version 3.3.3, 25 November 2012). + + Copyright (C) 2003 Matteo Frigo. + + Copyright (C) 2003 Massachusetts Institute of Technology. + + Permission is granted to make and distribute verbatim copies of + this manual provided the copyright notice and this permission + notice are preserved on all copies. + + Permission is granted to copy and distribute modified versions of + this manual under the conditions for verbatim copying, provided + that the entire resulting derived work is distributed under the + terms of a permission notice identical to this one. + + Permission is granted to copy and distribute translations of this + manual into another language, under the above conditions for + modified versions, except that this permission notice may be + stated in a translation approved by the Free Software Foundation. + +INFO-DIR-SECTION Texinfo documentation system +START-INFO-DIR-ENTRY +* fftw3: (fftw3). FFTW User's Manual. +END-INFO-DIR-ENTRY + + +File: fftw3.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir) + +FFTW User Manual +**************** + +Welcome to FFTW, the Fastest Fourier Transform in the West. FFTW is a +collection of fast C routines to compute the discrete Fourier transform. +This manual documents FFTW version 3.3.3. + +* Menu: + +* Introduction:: +* Tutorial:: +* Other Important Topics:: +* FFTW Reference:: +* Multi-threaded FFTW:: +* Distributed-memory FFTW with MPI:: +* Calling FFTW from Modern Fortran:: +* Calling FFTW from Legacy Fortran:: +* Upgrading from FFTW version 2:: +* Installation and Customization:: +* Acknowledgments:: +* License and Copyright:: +* Concept Index:: +* Library Index:: + + +File: fftw3.info, Node: Introduction, Next: Tutorial, Prev: Top, Up: Top + +1 Introduction +************** + +This manual documents version 3.3.3 of FFTW, the _Fastest Fourier +Transform in the West_. FFTW is a comprehensive collection of fast C +routines for computing the discrete Fourier transform (DFT) and various +special cases thereof. + * FFTW computes the DFT of complex data, real data, even- or + odd-symmetric real data (these symmetric transforms are usually + known as the discrete cosine or sine transform, respectively), and + the discrete Hartley transform (DHT) of real data. + + * The input data can have arbitrary length. FFTW employs O(n + log n) algorithms for all lengths, including prime numbers. + + * FFTW supports arbitrary multi-dimensional data. + + * FFTW supports the SSE, SSE2, AVX, Altivec, and MIPS PS instruction + sets. + + * FFTW includes parallel (multi-threaded) transforms for + shared-memory systems. + + * Starting with version 3.3, FFTW includes distributed-memory + parallel transforms using MPI. + + We assume herein that you are familiar with the properties and uses +of the DFT that are relevant to your application. Otherwise, see e.g. +`The Fast Fourier Transform and Its Applications' by E. O. Brigham +(Prentice-Hall, Englewood Cliffs, NJ, 1988). Our web page +(http://www.fftw.org) also has links to FFT-related information online. + + In order to use FFTW effectively, you need to learn one basic concept +of FFTW's internal structure: FFTW does not use a fixed algorithm for +computing the transform, but instead it adapts the DFT algorithm to +details of the underlying hardware in order to maximize performance. +Hence, the computation of the transform is split into two phases. +First, FFTW's "planner" "learns" the fastest way to compute the +transform on your machine. The planner produces a data structure +called a "plan" that contains this information. Subsequently, the plan +is "executed" to transform the array of input data as dictated by the +plan. The plan can be reused as many times as needed. In typical +high-performance applications, many transforms of the same size are +computed and, consequently, a relatively expensive initialization of +this sort is acceptable. On the other hand, if you need a single +transform of a given size, the one-time cost of the planner becomes +significant. For this case, FFTW provides fast planners based on +heuristics or on previously computed plans. + + FFTW supports transforms of data with arbitrary length, rank, +multiplicity, and a general memory layout. In simple cases, however, +this generality may be unnecessary and confusing. Consequently, we +organized the interface to FFTW into three levels of increasing +generality. + * The "basic interface" computes a single transform of + contiguous data. + + * The "advanced interface" computes transforms of multiple or + strided arrays. + + * The "guru interface" supports the most general data layouts, + multiplicities, and strides. + We expect that most users will be best served by the basic interface, +whereas the guru interface requires careful attention to the +documentation to avoid problems. + + Besides the automatic performance adaptation performed by the +planner, it is also possible for advanced users to customize FFTW +manually. For example, if code space is a concern, we provide a tool +that links only the subset of FFTW needed by your application. +Conversely, you may need to extend FFTW because the standard +distribution is not sufficient for your needs. For example, the +standard FFTW distribution works most efficiently for arrays whose size +can be factored into small primes (2, 3, 5, and 7), and otherwise it +uses a slower general-purpose routine. If you need efficient +transforms of other sizes, you can use FFTW's code generator, which +produces fast C programs ("codelets") for any particular array size you +may care about. For example, if you need transforms of size 513 = 19 x +3^3, you can customize FFTW to support the factor 19 efficiently. + + For more information regarding FFTW, see the paper, "The Design and +Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an +invited paper in `Proc. IEEE' 93 (2), p. 216 (2005). The code +generator is described in the paper "A fast Fourier transform compiler", by +M. Frigo, in the `Proceedings of the 1999 ACM SIGPLAN Conference on +Programming Language Design and Implementation (PLDI), Atlanta, +Georgia, May 1999'. These papers, along with the latest version of +FFTW, the FAQ, benchmarks, and other links, are available at the FFTW +home page (http://www.fftw.org). + + The current version of FFTW incorporates many good ideas from the +past thirty years of FFT literature. In one way or another, FFTW uses +the Cooley-Tukey algorithm, the prime factor algorithm, Rader's +algorithm for prime sizes, and a split-radix algorithm (with a +"conjugate-pair" variation pointed out to us by Dan Bernstein). FFTW's +code generator also produces new algorithms that we do not completely +understand. The reader is referred to the cited papers for the +appropriate references. + + The rest of this manual is organized as follows. We first discuss +the sequential (single-processor) implementation. We start by +describing the basic interface/features of FFTW in *note Tutorial::. +Next, *note Other Important Topics:: discusses data alignment (*note +SIMD alignment and fftw_malloc::), the storage scheme of +multi-dimensional arrays (*note Multi-dimensional Array Format::), and +FFTW's mechanism for storing plans on disk (*note Words of +Wisdom-Saving Plans::). Next, *note FFTW Reference:: provides +comprehensive documentation of all FFTW's features. Parallel +transforms are discussed in their own chapters: *note Multi-threaded +FFTW:: and *note Distributed-memory FFTW with MPI::. Fortran +programmers can also use FFTW, as described in *note Calling FFTW from +Legacy Fortran:: and *note Calling FFTW from Modern Fortran::. *note +Installation and Customization:: explains how to install FFTW in your +computer system and how to adapt FFTW to your needs. License and +copyright information is given in *note License and Copyright::. +Finally, we thank all the people who helped us in *note +Acknowledgments::. + + +File: fftw3.info, Node: Tutorial, Next: Other Important Topics, Prev: Introduction, Up: Top + +2 Tutorial +********** + +* Menu: + +* Complex One-Dimensional DFTs:: +* Complex Multi-Dimensional DFTs:: +* One-Dimensional DFTs of Real Data:: +* Multi-Dimensional DFTs of Real Data:: +* More DFTs of Real Data:: + + This chapter describes the basic usage of FFTW, i.e., how to compute the +Fourier transform of a single array. This chapter tells the truth, but +not the _whole_ truth. Specifically, FFTW implements additional +routines and flags that are not documented here, although in many cases +we try to indicate where added capabilities exist. For more complete +information, see *note FFTW Reference::. (Note that you need to +compile and install FFTW before you can use it in a program. For the +details of the installation, see *note Installation and +Customization::.) + + We recommend that you read this tutorial in order.(1) At the least, +read the first section (*note Complex One-Dimensional DFTs::) before +reading any of the others, even if your main interest lies in one of +the other transform types. + + Users of FFTW version 2 and earlier may also want to read *note +Upgrading from FFTW version 2::. + + ---------- Footnotes ---------- + + (1) You can read the tutorial in bit-reversed order after computing +your first transform. + + +File: fftw3.info, Node: Complex One-Dimensional DFTs, Next: Complex Multi-Dimensional DFTs, Prev: Tutorial, Up: Tutorial + +2.1 Complex One-Dimensional DFTs +================================ + + Plan: To bother about the best method of accomplishing an + accidental result. [Ambrose Bierce, `The Enlarged Devil's + Dictionary'.] + + The basic usage of FFTW to compute a one-dimensional DFT of size `N' +is simple, and it typically looks something like this code: + + #include <fftw3.h> + ... + { + fftw_complex *in, *out; + fftw_plan p; + ... + in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); + out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); + p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); + ... + fftw_execute(p); /* repeat as needed */ + ... + fftw_destroy_plan(p); + fftw_free(in); fftw_free(out); + } + + You must link this code with the `fftw3' library. On Unix systems, +link with `-lfftw3 -lm'. + + The example code first allocates the input and output arrays. You +can allocate them in any way that you like, but we recommend using +`fftw_malloc', which behaves like `malloc' except that it properly +aligns the array when SIMD instructions (such as SSE and Altivec) are +available (*note SIMD alignment and fftw_malloc::). [Alternatively, we +provide a convenient wrapper function `fftw_alloc_complex(N)' which has +the same effect.] + + The data is an array of type `fftw_complex', which is by default a +`double[2]' composed of the real (`in[i][0]') and imaginary +(`in[i][1]') parts of a complex number. + + The next step is to create a "plan", which is an object that +contains all the data that FFTW needs to compute the FFT. This +function creates the plan: + + fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + The first argument, `n', is the size of the transform you are trying +to compute. The size `n' can be any positive integer, but sizes that +are products of small factors are transformed most efficiently +(although prime sizes still use an O(n log n) algorithm). + + The next two arguments are pointers to the input and output arrays of +the transform. These pointers can be equal, indicating an "in-place" +transform. + + The fourth argument, `sign', can be either `FFTW_FORWARD' (`-1') or +`FFTW_BACKWARD' (`+1'), and indicates the direction of the transform +you are interested in; technically, it is the sign of the exponent in +the transform. + + The `flags' argument is usually either `FFTW_MEASURE' or `FFTW_ESTIMATE'. +`FFTW_MEASURE' instructs FFTW to run and measure the execution time of +several FFTs in order to find the best way to compute the transform of +size `n'. This process takes some time (usually a few seconds), +depending on your machine and on the size of the transform. +`FFTW_ESTIMATE', on the contrary, does not run any computation and just +builds a reasonable plan that is probably sub-optimal. In short, if +your program performs many transforms of the same size and +initialization time is not important, use `FFTW_MEASURE'; otherwise use +the estimate. + + _You must create the plan before initializing the input_, because +`FFTW_MEASURE' overwrites the `in'/`out' arrays. (Technically, +`FFTW_ESTIMATE' does not touch your arrays, but you should always +create plans first just to be sure.) + + Once the plan has been created, you can use it as many times as you +like for transforms on the specified `in'/`out' arrays, computing the +actual transforms via `fftw_execute(plan)': + void fftw_execute(const fftw_plan plan); + + The DFT results are stored in-order in the array `out', with the +zero-frequency (DC) component in `out[0]'. If `in != out', the +transform is "out-of-place" and the input array `in' is not modified. +Otherwise, the input array is overwritten with the transform. + + If you want to transform a _different_ array of the same size, you +can create a new plan with `fftw_plan_dft_1d' and FFTW automatically +reuses the information from the previous plan, if possible. +Alternatively, with the "guru" interface you can apply a given plan to +a different array, if you are careful. *Note FFTW Reference::. + + When you are done with the plan, you deallocate it by calling +`fftw_destroy_plan(plan)': + void fftw_destroy_plan(fftw_plan plan); + If you allocate an array with `fftw_malloc()' you must deallocate it +with `fftw_free()'. Do not use `free()' or, heaven forbid, `delete'. + + FFTW computes an _unnormalized_ DFT. Thus, computing a forward +followed by a backward transform (or vice versa) results in the original +array scaled by `n'. For the definition of the DFT, see *note What +FFTW Really Computes::. + + If you have a C compiler, such as `gcc', that supports the C99 +standard, and you `#include <complex.h>' _before_ `<fftw3.h>', then +`fftw_complex' is the native double-precision complex type and you can +manipulate it with ordinary arithmetic. Otherwise, FFTW defines its +own complex type, which is bit-compatible with the C99 complex type. +*Note Complex numbers::. (The C++ `<complex>' template class may also +be usable via a typecast.) + + To use single or long-double precision versions of FFTW, replace the +`fftw_' prefix by `fftwf_' or `fftwl_' and link with `-lfftw3f' or +`-lfftw3l', but use the _same_ `<fftw3.h>' header file. + + Many more flags exist besides `FFTW_MEASURE' and `FFTW_ESTIMATE'. +For example, use `FFTW_PATIENT' if you're willing to wait even longer +for a possibly even faster plan (*note FFTW Reference::). You can also +save plans for future use, as described by *note Words of Wisdom-Saving +Plans::. + + +File: fftw3.info, Node: Complex Multi-Dimensional DFTs, Next: One-Dimensional DFTs of Real Data, Prev: Complex One-Dimensional DFTs, Up: Tutorial + +2.2 Complex Multi-Dimensional DFTs +================================== + +Multi-dimensional transforms work much the same way as one-dimensional +transforms: you allocate arrays of `fftw_complex' (preferably using +`fftw_malloc'), create an `fftw_plan', execute it as many times as you +want with `fftw_execute(plan)', and clean up with +`fftw_destroy_plan(plan)' (and `fftw_free'). + + FFTW provides two routines for creating plans for 2d and 3d +transforms, and one routine for creating plans of arbitrary +dimensionality. The 2d and 3d routines have the following signature: + fftw_plan fftw_plan_dft_2d(int n0, int n1, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + These routines create plans for `n0' by `n1' two-dimensional (2d) +transforms and `n0' by `n1' by `n2' 3d transforms, respectively. All +of these transforms operate on contiguous arrays in the C-standard +"row-major" order, so that the last dimension has the fastest-varying +index in the array. This layout is described further in *note +Multi-dimensional Array Format::. + + FFTW can also compute transforms of higher dimensionality. In order +to avoid confusion between the various meanings of the the word +"dimension", we use the term _rank_ to denote the number of independent +indices in an array.(1) For example, we say that a 2d transform has +rank 2, a 3d transform has rank 3, and so on. You can plan transforms +of arbitrary rank by means of the following function: + + fftw_plan fftw_plan_dft(int rank, const int *n, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + Here, `n' is a pointer to an array `n[rank]' denoting an `n[0]' by +`n[1]' by ... by `n[rank-1]' transform. Thus, for example, the call + fftw_plan_dft_2d(n0, n1, in, out, sign, flags); + is equivalent to the following code fragment: + int n[2]; + n[0] = n0; + n[1] = n1; + fftw_plan_dft(2, n, in, out, sign, flags); + `fftw_plan_dft' is not restricted to 2d and 3d transforms, however, +but it can plan transforms of arbitrary rank. + + You may have noticed that all the planner routines described so far +have overlapping functionality. For example, you can plan a 1d or 2d +transform by using `fftw_plan_dft' with a `rank' of `1' or `2', or even +by calling `fftw_plan_dft_3d' with `n0' and/or `n1' equal to `1' (with +no loss in efficiency). This pattern continues, and FFTW's planning +routines in general form a "partial order," sequences of interfaces +with strictly increasing generality but correspondingly greater +complexity. + + `fftw_plan_dft' is the most general complex-DFT routine that we +describe in this tutorial, but there are also the advanced and guru +interfaces, which allow one to efficiently combine multiple/strided +transforms into a single FFTW plan, transform a subset of a larger +multi-dimensional array, and/or to handle more general complex-number +formats. For more information, see *note FFTW Reference::. + + ---------- Footnotes ---------- + + (1) The term "rank" is commonly used in the APL, FORTRAN, and Common +Lisp traditions, although it is not so common in the C world. + + +File: fftw3.info, Node: One-Dimensional DFTs of Real Data, Next: Multi-Dimensional DFTs of Real Data, Prev: Complex Multi-Dimensional DFTs, Up: Tutorial + +2.3 One-Dimensional DFTs of Real Data +===================================== + +In many practical applications, the input data `in[i]' are purely real +numbers, in which case the DFT output satisfies the "Hermitian" redundancy: +`out[i]' is the conjugate of `out[n-i]'. It is possible to take +advantage of these circumstances in order to achieve roughly a factor +of two improvement in both speed and memory usage. + + In exchange for these speed and space advantages, the user sacrifices +some of the simplicity of FFTW's complex transforms. First of all, the +input and output arrays are of _different sizes and types_: the input +is `n' real numbers, while the output is `n/2+1' complex numbers (the +non-redundant outputs); this also requires slight "padding" of the +input array for in-place transforms. Second, the inverse transform +(complex to real) has the side-effect of _overwriting its input array_, +by default. Neither of these inconveniences should pose a serious +problem for users, but it is important to be aware of them. + + The routines to perform real-data transforms are almost the same as +those for complex transforms: you allocate arrays of `double' and/or +`fftw_complex' (preferably using `fftw_malloc' or +`fftw_alloc_complex'), create an `fftw_plan', execute it as many times +as you want with `fftw_execute(plan)', and clean up with +`fftw_destroy_plan(plan)' (and `fftw_free'). The only differences are +that the input (or output) is of type `double' and there are new +routines to create the plan. In one dimension: + + fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out, + unsigned flags); + + for the real input to complex-Hermitian output ("r2c") and +complex-Hermitian input to real output ("c2r") transforms. Unlike the +complex DFT planner, there is no `sign' argument. Instead, r2c DFTs +are always `FFTW_FORWARD' and c2r DFTs are always `FFTW_BACKWARD'. (For +single/long-double precision `fftwf' and `fftwl', `double' should be +replaced by `float' and `long double', respectively.) + + Here, `n' is the "logical" size of the DFT, not necessarily the +physical size of the array. In particular, the real (`double') array +has `n' elements, while the complex (`fftw_complex') array has `n/2+1' +elements (where the division is rounded down). For an in-place +transform, `in' and `out' are aliased to the same array, which must be +big enough to hold both; so, the real array would actually have +`2*(n/2+1)' elements, where the elements beyond the first `n' are +unused padding. (Note that this is very different from the concept of +"zero-padding" a transform to a larger length, which changes the +logical size of the DFT by actually adding new input data.) The kth +element of the complex array is exactly the same as the kth element of +the corresponding complex DFT. All positive `n' are supported; +products of small factors are most efficient, but an O(n log n) +algorithm is used even for prime sizes. + + As noted above, the c2r transform destroys its input array even for +out-of-place transforms. This can be prevented, if necessary, by +including `FFTW_PRESERVE_INPUT' in the `flags', with unfortunately some +sacrifice in performance. This flag is also not currently supported +for multi-dimensional real DFTs (next section). + + Readers familiar with DFTs of real data will recall that the 0th (the +"DC") and `n/2'-th (the "Nyquist" frequency, when `n' is even) elements +of the complex output are purely real. Some implementations therefore +store the Nyquist element where the DC imaginary part would go, in +order to make the input and output arrays the same size. Such packing, +however, does not generalize well to multi-dimensional transforms, and +the space savings are miniscule in any case; FFTW does not support it. + + An alternative interface for one-dimensional r2c and c2r DFTs can be +found in the `r2r' interface (*note The Halfcomplex-format DFT::), with +"halfcomplex"-format output that _is_ the same size (and type) as the +input array. That interface, although it is not very useful for +multi-dimensional transforms, may sometimes yield better performance. + + +File: fftw3.info, Node: Multi-Dimensional DFTs of Real Data, Next: More DFTs of Real Data, Prev: One-Dimensional DFTs of Real Data, Up: Tutorial + +2.4 Multi-Dimensional DFTs of Real Data +======================================= + +Multi-dimensional DFTs of real data use the following planner routines: + + fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, + double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, + double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_r2c(int rank, const int *n, + double *in, fftw_complex *out, + unsigned flags); + + as well as the corresponding `c2r' routines with the input/output +types swapped. These routines work similarly to their complex +analogues, except for the fact that here the complex output array is cut +roughly in half and the real array requires padding for in-place +transforms (as in 1d, above). + + As before, `n' is the logical size of the array, and the +consequences of this on the the format of the complex arrays deserve +careful attention. Suppose that the real data has dimensions n[0] x +n[1] x n[2] x ... x n[d-1] (in row-major order). Then, after an r2c +transform, the output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) +array of `fftw_complex' values in row-major order, corresponding to +slightly over half of the output of the corresponding complex DFT. +(The division is rounded down.) The ordering of the data is otherwise +exactly the same as in the complex-DFT case. + + For out-of-place transforms, this is the end of the story: the real +data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x +n[d-1] and the complex data is stored as a row-major array of size +n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) . + + For in-place transforms, however, extra padding of the real-data +array is necessary because the complex array is larger than the real +array, and the two arrays share the same memory locations. Thus, for +in-place transforms, the final dimension of the real-data array must be +padded with extra values to accommodate the size of the complex +data--two values if the last dimension is even and one if it is odd. That +is, the last dimension of the real data must physically contain 2 * +(n[d-1]/2+1) `double' values (exactly enough to hold the complex data). +This physical array size does not, however, change the _logical_ array +size--only n[d-1] values are actually stored in the last dimension, and +n[d-1] is the last dimension passed to the plan-creation routine. + + For example, consider the transform of a two-dimensional real array +of size `n0' by `n1'. The output of the r2c transform is a +two-dimensional complex array of size `n0' by `n1/2+1', where the `y' +dimension has been cut nearly in half because of redundancies in the +output. Because `fftw_complex' is twice the size of `double', the +output array is slightly bigger than the input array. Thus, if we want +to compute the transform in place, we must _pad_ the input array so +that it is of size `n0' by `2*(n1/2+1)'. If `n1' is even, then there +are two padding elements at the end of each row (which need not be +initialized, as they are only used for output). + + These transforms are unnormalized, so an r2c followed by a c2r +transform (or vice versa) will result in the original data scaled by +the number of real data elements--that is, the product of the (logical) +dimensions of the real data. + + (Because the last dimension is treated specially, if it is equal to +`1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r +transform. In that case, the last complex dimension also has size `1' +(`=1/2+1'), and no advantage is gained over the complex transforms.) + + +File: fftw3.info, Node: More DFTs of Real Data, Prev: Multi-Dimensional DFTs of Real Data, Up: Tutorial + +2.5 More DFTs of Real Data +========================== + +* Menu: + +* The Halfcomplex-format DFT:: +* Real even/odd DFTs (cosine/sine transforms):: +* The Discrete Hartley Transform:: + + FFTW supports several other transform types via a unified "r2r" +(real-to-real) interface, so called because it takes a real (`double') +array and outputs a real array of the same size. These r2r transforms +currently fall into three categories: DFTs of real input and +complex-Hermitian output in halfcomplex format, DFTs of real input with +even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs), +and discrete Hartley transforms (DHTs), all described in more detail by +the following sections. + + The r2r transforms follow the by now familiar interface of creating +an `fftw_plan', executing it with `fftw_execute(plan)', and destroying +it with `fftw_destroy_plan(plan)'. Furthermore, all r2r transforms +share the same planner interface: + + fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, + fftw_r2r_kind kind, unsigned flags); + fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, + fftw_r2r_kind kind0, fftw_r2r_kind kind1, + unsigned flags); + fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, + double *in, double *out, + fftw_r2r_kind kind0, + fftw_r2r_kind kind1, + fftw_r2r_kind kind2, + unsigned flags); + fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, + const fftw_r2r_kind *kind, unsigned flags); + + Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional +transforms for contiguous arrays in row-major order, transforming (real) +input to output of the same size, where `n' specifies the _physical_ +dimensions of the arrays. All positive `n' are supported (with the +exception of `n=1' for the `FFTW_REDFT00' kind, noted in the real-even +subsection below); products of small factors are most efficient +(factorizing `n-1' and `n+1' for `FFTW_REDFT00' and `FFTW_RODFT00' +kinds, described below), but an O(n log n) algorithm is used even for +prime sizes. + + Each dimension has a "kind" parameter, of type `fftw_r2r_kind', +specifying the kind of r2r transform to be used for that dimension. (In +the case of `fftw_plan_r2r', this is an array `kind[rank]' where +`kind[i]' is the transform kind for the dimension `n[i]'.) The kind +can be one of a set of predefined constants, defined in the following +subsections. + + In other words, FFTW computes the separable product of the specified +r2r transforms over each dimension, which can be used e.g. for partial +differential equations with mixed boundary conditions. (For some r2r +kinds, notably the halfcomplex DFT and the DHT, such a separable +product is somewhat problematic in more than one dimension, however, as +is described below.) + + In the current version of FFTW, all r2r transforms except for the +halfcomplex type are computed via pre- or post-processing of +halfcomplex transforms, and they are therefore not as fast as they +could be. Since most other general DCT/DST codes employ a similar +algorithm, however, FFTW's implementation should provide at least +competitive performance. + + +File: fftw3.info, Node: The Halfcomplex-format DFT, Next: Real even/odd DFTs (cosine/sine transforms), Prev: More DFTs of Real Data, Up: More DFTs of Real Data + +2.5.1 The Halfcomplex-format DFT +-------------------------------- + +An r2r kind of `FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note +One-Dimensional DFTs of Real Data::) but with "halfcomplex" format +output, and may sometimes be faster and/or more convenient than the +latter. The inverse "hc2r" transform is of kind `FFTW_HC2R'. This +consists of the non-redundant half of the complex output for a 1d +real-input DFT of size `n', stored as a sequence of `n' real numbers +(`double') in the format: + + r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 + + Here, rk is the real part of the kth output, and ik is the imaginary +part. (Division by 2 is rounded down.) For a halfcomplex array +`hc[n]', the kth component thus has its real part in `hc[k]' and its +imaginary part in `hc[n-k]', with the exception of `k' `==' `0' or +`n/2' (the latter only if `n' is even)--in these two cases, the +imaginary part is zero due to symmetries of the real-input DFT, and is +not stored. Thus, the r2hc transform of `n' real values is a +halfcomplex array of length `n', and vice versa for hc2r. + + Aside from the differing format, the output of +`FFTW_R2HC'/`FFTW_HC2R' is otherwise exactly the same as for the +corresponding 1d r2c/c2r transform (i.e. `FFTW_FORWARD'/`FFTW_BACKWARD' +transforms, respectively). Recall that these transforms are +unnormalized, so r2hc followed by hc2r will result in the original data +multiplied by `n'. Furthermore, like the c2r transform, an +out-of-place hc2r transform will _destroy its input_ array. + + Although these halfcomplex transforms can be used with the +multi-dimensional r2r interface, the interpretation of such a separable +product of transforms along each dimension is problematic. For example, +consider a two-dimensional `n0' by `n1', r2hc by r2hc transform planned +by `fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC, +FFTW_MEASURE)'. Conceptually, FFTW first transforms the rows (of size +`n1') to produce halfcomplex rows, and then transforms the columns (of +size `n0'). Half of these column transforms, however, are of imaginary +parts, and should therefore be multiplied by i and combined with the +r2hc transforms of the real columns to produce the 2d DFT amplitudes; +FFTW's r2r transform does _not_ perform this combination for you. +Thus, if a multi-dimensional real-input/output DFT is required, we +recommend using the ordinary r2c/c2r interface (*note Multi-Dimensional +DFTs of Real Data::). + + +File: fftw3.info, Node: Real even/odd DFTs (cosine/sine transforms), Next: The Discrete Hartley Transform, Prev: The Halfcomplex-format DFT, Up: More DFTs of Real Data + +2.5.2 Real even/odd DFTs (cosine/sine transforms) +------------------------------------------------- + +The Fourier transform of a real-even function f(-x) = f(x) is +real-even, and i times the Fourier transform of a real-odd function +f(-x) = -f(x) is real-odd. Similar results hold for a discrete Fourier +transform, and thus for these symmetries the need for complex +inputs/outputs is entirely eliminated. Moreover, one gains a factor of +two in speed/space from the fact that the data are real, and an +additional factor of two from the even/odd symmetry: only the +non-redundant (first) half of the array need be stored. The result is +the real-even DFT ("REDFT") and the real-odd DFT ("RODFT"), also known +as the discrete cosine and sine transforms ("DCT" and "DST"), +respectively. + + (In this section, we describe the 1d transforms; multi-dimensional +transforms are just a separable product of these transforms operating +along each dimension.) + + Because of the discrete sampling, one has an additional choice: is +the data even/odd around a sampling point, or around the point halfway +between two samples? The latter corresponds to _shifting_ the samples +by _half_ an interval, and gives rise to several transform variants +denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate +whether the input (a) and/or output (b) are shifted by half a sample (1 +means it is shifted). These are also known as types I-IV of the DCT +and DST, and all four types are supported by FFTW's r2r interface.(1) + + The r2r kinds for the various REDFT and RODFT types supported by +FFTW, along with the boundary conditions at both ends of the _input_ +array (`n' real numbers `in[j=0..n-1]'), are: + + * `FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1. + + * `FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even + around j=n-0.5. + + * `FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd + around j=n. + + * `FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5. + + * `FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n. + + * `FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5. + + * `FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1. + + * `FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5. + + + Note that these symmetries apply to the "logical" array being +transformed; *there are no constraints on your physical input data*. +So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data +abcde, it corresponds to the DFT of the logical even array abcdedcb of +size 8. A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the +size-8 logical DFT of the even array abcddcba, shifted by half a sample. + + All of these transforms are invertible. The inverse of R*DFT00 is +R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called +simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11. +However, the transforms computed by FFTW are unnormalized, exactly like +the corresponding real and complex DFTs, so computing a transform +followed by its inverse yields the original array scaled by N, where N +is the _logical_ DFT size. For REDFT00, N=2(n-1); for RODFT00, +N=2(n+1); otherwise, N=2n. + + Note that the boundary conditions of the transform output array are +given by the input boundary conditions of the inverse transform. Thus, +the above transforms are all inequivalent in terms of input/output +boundary conditions, even neglecting the 0.5 shift difference. + + FFTW is most efficient when N is a product of small factors; note +that this _differs_ from the factorization of the physical size `n' for +REDFT00 and RODFT00! There is another oddity: `n=1' REDFT00 transforms +correspond to N=0, and so are _not defined_ (the planner will return +`NULL'). Otherwise, any positive `n' is supported. + + For the precise mathematical definitions of these transforms as used +by FFTW, see *note What FFTW Really Computes::. (For people accustomed +to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of +the cos/sin functions so that they correspond precisely to an even/odd +DFT of size N. Some authors also include additional multiplicative +factors of sqrt(2) for selected inputs and outputs; this makes the +transform orthogonal, but sacrifices the direct equivalence to a +symmetric DFT.) + +Which type do you need? +....................... + +Since the required flavor of even/odd DFT depends upon your problem, +you are the best judge of this choice, but we can make a few comments +on relative efficiency to help you in your selection. In particular, +R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially +for odd sizes), while the R*DFT00 transforms are sometimes +significantly slower (especially for even sizes).(2) + + Thus, if only the boundary conditions on the transform inputs are +specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over +R*DFT11 (unless the half-sample shift or the self-inverse property is +significant for your problem). + + If performance is important to you and you are using only small sizes +(say n<200), e.g. for multi-dimensional transforms, then you might +consider generating hard-coded transforms of those sizes and types that +you are interested in (*note Generating your own code::). + + We are interested in hearing what types of symmetric transforms you +find most useful. + + ---------- Footnotes ---------- + + (1) There are also type V-VIII transforms, which correspond to a +logical DFT of _odd_ size N, independent of whether the physical size +`n' is odd, but we do not support these variants. + + (2) R*DFT00 is sometimes slower in FFTW because we discovered that +the standard algorithm for computing this by a pre/post-processed real +DFT--the algorithm used in FFTPACK, Numerical Recipes, and other +sources for decades now--has serious numerical problems: it already +loses several decimal places of accuracy for 16k sizes. There seem to +be only two alternatives in the literature that do not suffer +similarly: a recursive decomposition into smaller DCTs, which would +require a large set of codelets for efficiency and generality, or +sacrificing a factor of 2 in speed to use a real DFT of twice the size. +We currently employ the latter technique for general n, as well as a +limited form of the former method: a split-radix decomposition when n +is odd (N a multiple of 4). For N containing many factors of 2, the +split-radix method seems to recover most of the speed of the standard +algorithm without the accuracy tradeoff. + + +File: fftw3.info, Node: The Discrete Hartley Transform, Prev: Real even/odd DFTs (cosine/sine transforms), Up: More DFTs of Real Data + +2.5.3 The Discrete Hartley Transform +------------------------------------ + +If you are planning to use the DHT because you've heard that it is +"faster" than the DFT (FFT), *stop here*. The DHT is not faster than +the DFT. That story is an old but enduring misconception that was +debunked in 1987. + + The discrete Hartley transform (DHT) is an invertible linear +transform closely related to the DFT. In the DFT, one multiplies each +input by cos - i * sin (a complex exponential), whereas in the DHT each +input is multiplied by simply cos + sin. Thus, the DHT transforms `n' +real numbers to `n' real numbers, and has the convenient property of +being its own inverse. In FFTW, a DHT (of any positive `n') can be +specified by an r2r kind of `FFTW_DHT'. + + Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of +size `n' followed by another DHT of the same size will result in the +original array multiplied by `n'. + + The DHT was originally proposed as a more efficient alternative to +the DFT for real data, but it was subsequently shown that a specialized +DFT (such as FFTW's r2hc or r2c transforms) could be just as fast. In +FFTW, the DHT is actually computed by post-processing an r2hc +transform, so there is ordinarily no reason to prefer it from a +performance perspective.(1) However, we have heard rumors that the DHT +might be the most appropriate transform in its own right for certain +applications, and we would be very interested to hear from anyone who +finds it useful. + + If `FFTW_DHT' is specified for multiple dimensions of a +multi-dimensional transform, FFTW computes the separable product of 1d +DHTs along each dimension. Unfortunately, this is not quite the same +thing as a true multi-dimensional DHT; you can compute the latter, if +necessary, with at most `rank-1' post-processing passes [see e.g. H. +Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)]. + + For the precise mathematical definition of the DHT as used by FFTW, +see *note What FFTW Really Computes::. + + ---------- Footnotes ---------- + + (1) We provide the DHT mainly as a byproduct of some internal +algorithms. FFTW computes a real input/output DFT of _prime_ size by +re-expressing it as a DHT plus post/pre-processing and then using +Rader's prime-DFT algorithm adapted to the DHT. + + +File: fftw3.info, Node: Other Important Topics, Next: FFTW Reference, Prev: Tutorial, Up: Top + +3 Other Important Topics +************************ + +* Menu: + +* SIMD alignment and fftw_malloc:: +* Multi-dimensional Array Format:: +* Words of Wisdom-Saving Plans:: +* Caveats in Using Wisdom:: + + +File: fftw3.info, Node: SIMD alignment and fftw_malloc, Next: Multi-dimensional Array Format, Prev: Other Important Topics, Up: Other Important Topics + +3.1 SIMD alignment and fftw_malloc +================================== + +SIMD, which stands for "Single Instruction Multiple Data," is a set of +special operations supported by some processors to perform a single +operation on several numbers (usually 2 or 4) simultaneously. SIMD +floating-point instructions are available on several popular CPUs: +SSE/SSE2/AVX on recent x86/x86-64 processors, AltiVec (single precision) +on some PowerPCs (Apple G4 and higher), NEON on some ARM models, and +MIPS Paired Single (currently only in FFTW 3.2.x). FFTW can be +compiled to support the SIMD instructions on any of these systems. + + A program linking to an FFTW library compiled with SIMD support can +obtain a nonnegligible speedup for most complex and r2c/c2r transforms. +In order to obtain this speedup, however, the arrays of complex (or +real) data passed to FFTW must be specially aligned in memory +(typically 16-byte aligned), and often this alignment is more stringent +than that provided by the usual `malloc' (etc.) allocation routines. + + In order to guarantee proper alignment for SIMD, therefore, in case +your program is ever linked against a SIMD-using FFTW, we recommend +allocating your transform data with `fftw_malloc' and de-allocating it +with `fftw_free'. These have exactly the same interface and behavior as +`malloc'/`free', except that for a SIMD FFTW they ensure that the +returned pointer has the necessary alignment (by calling `memalign' or +its equivalent on your OS). + + You are not _required_ to use `fftw_malloc'. You can allocate your +data in any way that you like, from `malloc' to `new' (in C++) to a +fixed-size array declaration. If the array happens not to be properly +aligned, FFTW will not use the SIMD extensions. + + Since `fftw_malloc' only ever needs to be used for real and complex +arrays, we provide two convenient wrapper routines `fftw_alloc_real(N)' +and `fftw_alloc_complex(N)' that are equivalent to +`(double*)fftw_malloc(sizeof(double) * N)' and +`(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively +(or their equivalents in other precisions). + + +File: fftw3.info, Node: Multi-dimensional Array Format, Next: Words of Wisdom-Saving Plans, Prev: SIMD alignment and fftw_malloc, Up: Other Important Topics + +3.2 Multi-dimensional Array Format +================================== + +This section describes the format in which multi-dimensional arrays are +stored in FFTW. We felt that a detailed discussion of this topic was +necessary. Since several different formats are common, this topic is +often a source of confusion. + +* Menu: + +* Row-major Format:: +* Column-major Format:: +* Fixed-size Arrays in C:: +* Dynamic Arrays in C:: +* Dynamic Arrays in C-The Wrong Way:: + + +File: fftw3.info, Node: Row-major Format, Next: Column-major Format, Prev: Multi-dimensional Array Format, Up: Multi-dimensional Array Format + +3.2.1 Row-major Format +---------------------- + +The multi-dimensional arrays passed to `fftw_plan_dft' etcetera are +expected to be stored as a single contiguous block in "row-major" order +(sometimes called "C order"). Basically, this means that as you step +through adjacent memory locations, the first dimension's index varies +most slowly and the last dimension's index varies most quickly. + + To be more explicit, let us consider an array of rank d whose +dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a +location in the array by a sequence of d (zero-based) indices, one for +each dimension: (i[0], i[1], ..., i[d-1]). If the array is stored in +row-major order, then this element is located at the position i[d-1] + +n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])). + + Note that, for the ordinary complex DFT, each element of the array +must be of type `fftw_complex'; i.e. a (real, imaginary) pair of +(double-precision) numbers. + + In the advanced FFTW interface, the physical dimensions n from which +the indices are computed can be different from (larger than) the +logical dimensions of the transform to be computed, in order to +transform a subset of a larger array. Note also that, in the advanced +interface, the expression above is multiplied by a "stride" to get the +actual array index--this is useful in situations where each element of +the multi-dimensional array is actually a data structure (or another +array), and you just want to transform a single field. In the basic +interface, however, the stride is 1. + + +File: fftw3.info, Node: Column-major Format, Next: Fixed-size Arrays in C, Prev: Row-major Format, Up: Multi-dimensional Array Format + +3.2.2 Column-major Format +------------------------- + +Readers from the Fortran world are used to arrays stored in +"column-major" order (sometimes called "Fortran order"). This is +essentially the exact opposite of row-major order in that, here, the +_first_ dimension's index varies most quickly. + + If you have an array stored in column-major order and wish to +transform it using FFTW, it is quite easy to do. When creating the +plan, simply pass the dimensions of the array to the planner in +_reverse order_. For example, if your array is a rank three `N x M x +L' matrix in column-major order, you should pass the dimensions of the +array as if it were an `L x M x N' matrix (which it is, from the +perspective of FFTW). This is done for you _automatically_ by the FFTW +legacy-Fortran interface (*note Calling FFTW from Legacy Fortran::), +but you must do it manually with the modern Fortran interface (*note +Reversing array dimensions::). + + +File: fftw3.info, Node: Fixed-size Arrays in C, Next: Dynamic Arrays in C, Prev: Column-major Format, Up: Multi-dimensional Array Format + +3.2.3 Fixed-size Arrays in C +---------------------------- + +A multi-dimensional array whose size is declared at compile time in C +is _already_ in row-major order. You don't have to do anything special +to transform it. For example: + + { + fftw_complex data[N0][N1][N2]; + fftw_plan plan; + ... + plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0], + FFTW_FORWARD, FFTW_ESTIMATE); + ... + } + + This will plan a 3d in-place transform of size `N0 x N1 x N2'. +Notice how we took the address of the zero-th element to pass to the +planner (we could also have used a typecast). + + However, we tend to _discourage_ users from declaring their arrays +in this way, for two reasons. First, this allocates the array on the +stack ("automatic" storage), which has a very limited size on most +operating systems (declaring an array with more than a few thousand +elements will often cause a crash). (You can get around this +limitation on many systems by declaring the array as `static' and/or +global, but that has its own drawbacks.) Second, it may not optimally +align the array for use with a SIMD FFTW (*note SIMD alignment and +fftw_malloc::). Instead, we recommend using `fftw_malloc', as +described below. + + +File: fftw3.info, Node: Dynamic Arrays in C, Next: Dynamic Arrays in C-The Wrong Way, Prev: Fixed-size Arrays in C, Up: Multi-dimensional Array Format + +3.2.4 Dynamic Arrays in C +------------------------- + +We recommend allocating most arrays dynamically, with `fftw_malloc'. +This isn't too hard to do, although it is not as straightforward for +multi-dimensional arrays as it is for one-dimensional arrays. + + Creating the array is simple: using a dynamic-allocation routine like +`fftw_malloc', allocate an array big enough to store N `fftw_complex' +values (for a complex DFT), where N is the product of the sizes of the +array dimensions (i.e. the total number of complex values in the +array). For example, here is code to allocate a 5 x 12 x 27 rank-3 +array: + + fftw_complex *an_array; + an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex)); + + Accessing the array elements, however, is more tricky--you can't +simply use multiple applications of the `[]' operator like you could +for fixed-size arrays. Instead, you have to explicitly compute the +offset into the array using the formula given earlier for row-major +arrays. For example, to reference the (i,j,k)-th element of the array +allocated above, you would use the expression `an_array[k + 27 * (j + +12 * i)]'. + + This pain can be alleviated somewhat by defining appropriate macros, +or, in C++, creating a class and overloading the `()' operator. The +recent C99 standard provides a way to reinterpret the dynamic array as +a "variable-length" multi-dimensional array amenable to `[]', but this +feature is not yet widely supported by compilers. + + +File: fftw3.info, Node: Dynamic Arrays in C-The Wrong Way, Prev: Dynamic Arrays in C, Up: Multi-dimensional Array Format + +3.2.5 Dynamic Arrays in C--The Wrong Way +---------------------------------------- + +A different method for allocating multi-dimensional arrays in C is +often suggested that is incompatible with FFTW: _using it will cause +FFTW to die a painful death_. We discuss the technique here, however, +because it is so commonly known and used. This method is to create +arrays of pointers of arrays of pointers of ...etcetera. For example, +the analogue in this method to the example above is: + + int i,j; + fftw_complex ***a_bad_array; /* another way to make a 5x12x27 array */ + + a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **)); + for (i = 0; i < 5; ++i) { + a_bad_array[i] = + (fftw_complex **) malloc(12 * sizeof(fftw_complex *)); + for (j = 0; j < 12; ++j) + a_bad_array[i][j] = + (fftw_complex *) malloc(27 * sizeof(fftw_complex)); + } + + As you can see, this sort of array is inconvenient to allocate (and +deallocate). On the other hand, it has the advantage that the +(i,j,k)-th element can be referenced simply by `a_bad_array[i][j][k]'. + + If you like this technique and want to maximize convenience in +accessing the array, but still want to pass the array to FFTW, you can +use a hybrid method. Allocate the array as one contiguous block, but +also declare an array of arrays of pointers that point to appropriate +places in the block. That sort of trick is beyond the scope of this +documentation; for more information on multi-dimensional arrays in C, +see the `comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html). + + +File: fftw3.info, Node: Words of Wisdom-Saving Plans, Next: Caveats in Using Wisdom, Prev: Multi-dimensional Array Format, Up: Other Important Topics + +3.3 Words of Wisdom--Saving Plans +================================= + +FFTW implements a method for saving plans to disk and restoring them. +In fact, what FFTW does is more general than just saving and loading +plans. The mechanism is called "wisdom". Here, we describe this +feature at a high level. *Note FFTW Reference::, for a less casual but +more complete discussion of how to use wisdom in FFTW. + + Plans created with the `FFTW_MEASURE', `FFTW_PATIENT', or +`FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may +require a long time to compute because FFTW must measure the runtime of +many possible plans and select the best one. This setup is designed +for the situations where so many transforms of the same size must be +computed that the start-up time is irrelevant. For short +initialization times, but slower transforms, we have provided +`FFTW_ESTIMATE'. The `wisdom' mechanism is a way to get the best of +both worlds: you compute a good plan once, save it to disk, and later +reload it as many times as necessary. The wisdom mechanism can +actually save and reload many plans at once, not just one. + + Whenever you create a plan, the FFTW planner accumulates wisdom, +which is information sufficient to reconstruct the plan. After +planning, you can save this information to disk by means of the +function: + int fftw_export_wisdom_to_filename(const char *filename); + (This function returns non-zero on success.) + + The next time you run the program, you can restore the wisdom with +`fftw_import_wisdom_from_filename' (which also returns non-zero on +success), and then recreate the plan using the same flags as before. + int fftw_import_wisdom_from_filename(const char *filename); + + Wisdom is automatically used for any size to which it is applicable, +as long as the planner flags are not more "patient" than those with +which the wisdom was created. For example, wisdom created with +`FFTW_MEASURE' can be used if you later plan with `FFTW_ESTIMATE' or +`FFTW_MEASURE', but not with `FFTW_PATIENT'. + + The `wisdom' is cumulative, and is stored in a global, private data +structure managed internally by FFTW. The storage space required is +minimal, proportional to the logarithm of the sizes the wisdom was +generated from. If memory usage is a concern, however, the wisdom can +be forgotten and its associated memory freed by calling: + void fftw_forget_wisdom(void); + + Wisdom can be exported to a file, a string, or any other medium. +For details, see *note Wisdom::. + + +File: fftw3.info, Node: Caveats in Using Wisdom, Prev: Words of Wisdom-Saving Plans, Up: Other Important Topics + +3.4 Caveats in Using Wisdom +=========================== + + For in much wisdom is much grief, and he that increaseth knowledge + increaseth sorrow. [Ecclesiastes 1:18] + + There are pitfalls to using wisdom, in that it can negate FFTW's +ability to adapt to changing hardware and other conditions. For +example, it would be perfectly possible to export wisdom from a program +running on one processor and import it into a program running on +another processor. Doing so, however, would mean that the second +program would use plans optimized for the first processor, instead of +the one it is running on. + + It should be safe to reuse wisdom as long as the hardware and program +binaries remain unchanged. (Actually, the optimal plan may change even +between runs of the same binary on identical hardware, due to +differences in the virtual memory environment, etcetera. Users +seriously interested in performance should worry about this problem, +too.) It is likely that, if the same wisdom is used for two different +program binaries, even running on the same machine, the plans may be +sub-optimal because of differing code alignments. It is therefore wise +to recreate wisdom every time an application is recompiled. The more +the underlying hardware and software changes between the creation of +wisdom and its use, the greater grows the risk of sub-optimal plans. + + Nevertheless, if the choice is between using `FFTW_ESTIMATE' or +using possibly-suboptimal wisdom (created on the same machine, but for a +different binary), the wisdom is likely to be better. For this reason, +we provide a function to import wisdom from a standard system-wide +location (`/etc/fftw/wisdom' on Unix): + + int fftw_import_system_wisdom(void); + + FFTW also provides a standalone program, `fftw-wisdom' (described by +its own `man' page on Unix) with which users can create wisdom, e.g. +for a canonical set of sizes to store in the system wisdom file. *Note +Wisdom Utilities::. + + +File: fftw3.info, Node: FFTW Reference, Next: Multi-threaded FFTW, Prev: Other Important Topics, Up: Top + +4 FFTW Reference +**************** + +This chapter provides a complete reference for all sequential (i.e., +one-processor) FFTW functions. Parallel transforms are described in +later chapters. + +* Menu: + +* Data Types and Files:: +* Using Plans:: +* Basic Interface:: +* Advanced Interface:: +* Guru Interface:: +* New-array Execute Functions:: +* Wisdom:: +* What FFTW Really Computes:: + + +File: fftw3.info, Node: Data Types and Files, Next: Using Plans, Prev: FFTW Reference, Up: FFTW Reference + +4.1 Data Types and Files +======================== + +All programs using FFTW should include its header file: + + #include <fftw3.h> + + You must also link to the FFTW library. On Unix, this means adding +`-lfftw3 -lm' at the _end_ of the link command. + +* Menu: + +* Complex numbers:: +* Precision:: +* Memory Allocation:: + + +File: fftw3.info, Node: Complex numbers, Next: Precision, Prev: Data Types and Files, Up: Data Types and Files + +4.1.1 Complex numbers +--------------------- + +The default FFTW interface uses `double' precision for all +floating-point numbers, and defines a `fftw_complex' type to hold +complex numbers as: + + typedef double fftw_complex[2]; + + Here, the `[0]' element holds the real part and the `[1]' element +holds the imaginary part. + + Alternatively, if you have a C compiler (such as `gcc') that +supports the C99 revision of the ANSI C standard, you can use C's new +native complex type (which is binary-compatible with the typedef above). +In particular, if you `#include <complex.h>' _before_ `<fftw3.h>', then +`fftw_complex' is defined to be the native complex type and you can +manipulate it with ordinary arithmetic (e.g. `x = y * (3+4*I)', where +`x' and `y' are `fftw_complex' and `I' is the standard symbol for the +imaginary unit); + + C++ has its own `complex<T>' template class, defined in the standard +`<complex>' header file. Reportedly, the C++ standards committee has +recently agreed to mandate that the storage format used for this type +be binary-compatible with the C99 type, i.e. an array `T[2]' with +consecutive real `[0]' and imaginary `[1]' parts. (See report +`http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf +WG21/N1388'.) Although not part of the official standard as of this +writing, the proposal stated that: "This solution has been tested with +all current major implementations of the standard library and shown to +be working." To the extent that this is true, if you have a variable +`complex<double> *x', you can pass it directly to FFTW via +`reinterpret_cast<fftw_complex*>(x)'. + + +File: fftw3.info, Node: Precision, Next: Memory Allocation, Prev: Complex numbers, Up: Data Types and Files + +4.1.2 Precision +--------------- + +You can install single and long-double precision versions of FFTW, +which replace `double' with `float' and `long double', respectively +(*note Installation and Customization::). To use these interfaces, you: + + * Link to the single/long-double libraries; on Unix, `-lfftw3f' or + `-lfftw3l' instead of (or in addition to) `-lfftw3'. (You can + link to the different-precision libraries simultaneously.) + + * Include the _same_ `<fftw3.h>' header file. + + * Replace all lowercase instances of `fftw_' with `fftwf_' or + `fftwl_' for single or long-double precision, respectively. + (`fftw_complex' becomes `fftwf_complex', `fftw_execute' becomes + `fftwf_execute', etcetera.) + + * Uppercase names, i.e. names beginning with `FFTW_', remain the + same. + + * Replace `double' with `float' or `long double' for subroutine + parameters. + + + Depending upon your compiler and/or hardware, `long double' may not +be any more precise than `double' (or may not be supported at all, +although it is standard in C99). + + We also support using the nonstandard `__float128' +quadruple-precision type provided by recent versions of `gcc' on 32- +and 64-bit x86 hardware (*note Installation and Customization::). To +use this type, link with `-lfftw3q -lquadmath -lm' (the `libquadmath' +library provided by `gcc' is needed for quadruple-precision +trigonometric functions) and use `fftwq_' identifiers. + + +File: fftw3.info, Node: Memory Allocation, Prev: Precision, Up: Data Types and Files + +4.1.3 Memory Allocation +----------------------- + + void *fftw_malloc(size_t n); + void fftw_free(void *p); + + These are functions that behave identically to `malloc' and `free', +except that they guarantee that the returned pointer obeys any special +alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD +acceleration). *Note SIMD alignment and fftw_malloc::. + + Data allocated by `fftw_malloc' _must_ be deallocated by `fftw_free' +and not by the ordinary `free'. + + These routines simply call through to your operating system's +`malloc' or, if necessary, its aligned equivalent (e.g. `memalign'), so +you normally need not worry about any significant time or space +overhead. You are _not required_ to use them to allocate your data, +but we strongly recommend it. + + Note: in C++, just as with ordinary `malloc', you must typecast the +output of `fftw_malloc' to whatever pointer type you are allocating. + + We also provide the following two convenience functions to allocate +real and complex arrays with `n' elements, which are equivalent to +`(double *) fftw_malloc(sizeof(double) * n)' and `(fftw_complex *) +fftw_malloc(sizeof(fftw_complex) * n)', respectively: + + double *fftw_alloc_real(size_t n); + fftw_complex *fftw_alloc_complex(size_t n); + + The equivalent functions in other precisions allocate arrays of `n' +elements in that precision. e.g. `fftwf_alloc_real(n)' is equivalent +to `(float *) fftwf_malloc(sizeof(float) * n)'. + + +File: fftw3.info, Node: Using Plans, Next: Basic Interface, Prev: Data Types and Files, Up: FFTW Reference + +4.2 Using Plans +=============== + +Plans for all transform types in FFTW are stored as type `fftw_plan' +(an opaque pointer type), and are created by one of the various +planning routines described in the following sections. An `fftw_plan' +contains all information necessary to compute the transform, including +the pointers to the input and output arrays. + + void fftw_execute(const fftw_plan plan); + + This executes the `plan', to compute the corresponding transform on +the arrays for which it was planned (which must still exist). The plan +is not modified, and `fftw_execute' can be called as many times as +desired. + + To apply a given plan to a different array, you can use the +new-array execute interface. *Note New-array Execute Functions::. + + `fftw_execute' (and equivalents) is the only function in FFTW +guaranteed to be thread-safe; see *note Thread safety::. + + This function: + void fftw_destroy_plan(fftw_plan plan); + deallocates the `plan' and all its associated data. + + FFTW's planner saves some other persistent data, such as the +accumulated wisdom and a list of algorithms available in the current +configuration. If you want to deallocate all of that and reset FFTW to +the pristine state it was in when you started your program, you can +call: + + void fftw_cleanup(void); + + After calling `fftw_cleanup', all existing plans become undefined, +and you should not attempt to execute them nor to destroy them. You can +however create and execute/destroy new plans, in which case FFTW starts +accumulating wisdom information again. + + `fftw_cleanup' does not deallocate your plans, however. To prevent +memory leaks, you must still call `fftw_destroy_plan' before executing +`fftw_cleanup'. + + Occasionally, it may useful to know FFTW's internal "cost" metric +that it uses to compare plans to one another; this cost is proportional +to an execution time of the plan, in undocumented units, if the plan +was created with the `FFTW_MEASURE' or other timing-based options, or +alternatively is a heuristic cost function for `FFTW_ESTIMATE' plans. +(The cost values of measured and estimated plans are not comparable, +being in different units. Also, costs from different FFTW versions or +the same version compiled differently may not be in the same units. +Plans created from wisdom have a cost of 0 since no timing measurement +is performed for them. Finally, certain problems for which only one +top-level algorithm was possible may have required no measurements of +the cost of the whole plan, in which case `fftw_cost' will also return +0.) The cost metric for a given plan is returned by: + + double fftw_cost(const fftw_plan plan); + + The following two routines are provided purely for academic purposes +(that is, for entertainment). + + void fftw_flops(const fftw_plan plan, + double *add, double *mul, double *fma); + + Given a `plan', set `add', `mul', and `fma' to an exact count of the +number of floating-point additions, multiplications, and fused +multiply-add operations involved in the plan's execution. The total +number of floating-point operations (flops) is `add + mul + 2*fma', or +`add + mul + fma' if the hardware supports fused multiply-add +instructions (although the number of FMA operations is only approximate +because of compiler voodoo). (The number of operations should be an +integer, but we use `double' to avoid overflowing `int' for large +transforms; the arguments are of type `double' even for single and +long-double precision versions of FFTW.) + + void fftw_fprint_plan(const fftw_plan plan, FILE *output_file); + void fftw_print_plan(const fftw_plan plan); + + This outputs a "nerd-readable" representation of the `plan' to the +given file or to `stdout', respectively. + + +File: fftw3.info, Node: Basic Interface, Next: Advanced Interface, Prev: Using Plans, Up: FFTW Reference + +4.3 Basic Interface +=================== + +Recall that the FFTW API is divided into three parts(1): the "basic +interface" computes a single transform of contiguous data, the "advanced +interface" computes transforms of multiple or strided arrays, and the +"guru interface" supports the most general data layouts, +multiplicities, and strides. This section describes the the basic +interface, which we expect to satisfy the needs of most users. + +* Menu: + +* Complex DFTs:: +* Planner Flags:: +* Real-data DFTs:: +* Real-data DFT Array Format:: +* Real-to-Real Transforms:: +* Real-to-Real Transform Kinds:: + + ---------- Footnotes ---------- + + (1) Gallia est omnis divisa in partes tres (Julius Caesar). + + +File: fftw3.info, Node: Complex DFTs, Next: Planner Flags, Prev: Basic Interface, Up: Basic Interface + +4.3.1 Complex DFTs +------------------ + + fftw_plan fftw_plan_dft_1d(int n0, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + fftw_plan fftw_plan_dft_2d(int n0, int n1, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + fftw_plan fftw_plan_dft(int rank, const int *n, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + Plan a complex input/output discrete Fourier transform (DFT) in zero +or more dimensions, returning an `fftw_plan' (*note Using Plans::). + + Once you have created a plan for a certain transform type and +parameters, then creating another plan of the same type and parameters, +but for different arrays, is fast and shares constant data with the +first plan (if it still exists). + + The planner returns `NULL' if the plan cannot be created. In the +standard FFTW distribution, the basic interface is guaranteed to return +a non-`NULL' plan. A plan may be `NULL', however, if you are using a +customized FFTW configuration supporting a restricted set of transforms. + +Arguments +......... + + * `rank' is the rank of the transform (it should be the size of the + array `*n'), and can be any non-negative integer. (*Note Complex + Multi-Dimensional DFTs::, for the definition of "rank".) The + `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', + `2', and `3', respectively. The rank may be zero, which is + equivalent to a rank-1 transform of size 1, i.e. a copy of one + number from input to output. + + * `n0', `n1', `n2', or `n[0..rank-1]' (as appropriate for each + routine) specify the size of the transform dimensions. They can + be any positive integer. + + - Multi-dimensional arrays are stored in row-major order with + dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x + `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array + Format::. + + - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d + 11^e 13^f, where e+f is either 0 or 1, and the other exponents + are arbitrary. Other sizes are computed by means of a slow, + general-purpose algorithm (which nevertheless retains O(n log + n) performance even for prime sizes). It is possible to + customize FFTW for different array sizes; see *note + Installation and Customization::. Transforms whose sizes are + powers of 2 are especially fast. + + * `in' and `out' point to the input and output arrays of the + transform, which may be the same (yielding an in-place transform). These + arrays are overwritten during planning, unless `FFTW_ESTIMATE' is + used in the flags. (The arrays need not be initialized, but they + must be allocated.) + + If `in == out', the transform is "in-place" and the input array is + overwritten. If `in != out', the two arrays must not overlap (but + FFTW does not check for this condition). + + * `sign' is the sign of the exponent in the formula that defines the + Fourier transform. It can be -1 (= `FFTW_FORWARD') or +1 (= + `FFTW_BACKWARD'). + + * `flags' is a bitwise OR (`|') of zero or more planner flags, as + defined in *note Planner Flags::. + + + FFTW computes an unnormalized transform: computing a forward +followed by a backward transform (or vice versa) will result in the +original data multiplied by the size of the transform (the product of +the dimensions). For more information, see *note What FFTW Really +Computes::. + + +File: fftw3.info, Node: Planner Flags, Next: Real-data DFTs, Prev: Complex DFTs, Up: Basic Interface + +4.3.2 Planner Flags +------------------- + +All of the planner routines in FFTW accept an integer `flags' argument, +which is a bitwise OR (`|') of zero or more of the flag constants +defined below. These flags control the rigor (and time) of the +planning process, and can also impose (or lift) restrictions on the +type of transform algorithm that is employed. + + _Important:_ the planner overwrites the input array during planning +unless a saved plan (*note Wisdom::) is available for that problem, so +you should initialize your input data after creating the plan. The +only exceptions to this are the `FFTW_ESTIMATE' and `FFTW_WISDOM_ONLY' +flags, as mentioned below. + + In all cases, if wisdom is available for the given problem that +was created with equal-or-greater planning rigor, then the more +rigorous wisdom is used. For example, in `FFTW_ESTIMATE' mode any +available wisdom is used, whereas in `FFTW_PATIENT' mode only wisdom +created in patient or exhaustive mode can be used. *Note Words of +Wisdom-Saving Plans::. + +Planning-rigor flags +.................... + + * `FFTW_ESTIMATE' specifies that, instead of actual measurements of + different algorithms, a simple heuristic is used to pick a + (probably sub-optimal) plan quickly. With this flag, the + input/output arrays are not overwritten during planning. + + * `FFTW_MEASURE' tells FFTW to find an optimized plan by actually + _computing_ several FFTs and measuring their execution time. + Depending on your machine, this can take some time (often a few + seconds). `FFTW_MEASURE' is the default planning option. + + * `FFTW_PATIENT' is like `FFTW_MEASURE', but considers a wider range + of algorithms and often produces a "more optimal" plan (especially + for large transforms), but at the expense of several times longer + planning time (especially for large transforms). + + * `FFTW_EXHAUSTIVE' is like `FFTW_PATIENT', but considers an even + wider range of algorithms, including many that we think are + unlikely to be fast, to produce the most optimal plan but with a + substantially increased planning time. + + * `FFTW_WISDOM_ONLY' is a special planning mode in which the plan is + only created if wisdom is available for the given problem, and + otherwise a `NULL' plan is returned. This can be combined with + other flags, e.g. `FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan + only if wisdom is available that was created in `FFTW_PATIENT' or + `FFTW_EXHAUSTIVE' mode. The `FFTW_WISDOM_ONLY' flag is intended + for users who need to detect whether wisdom is available; for + example, if wisdom is not available one may wish to allocate new + arrays for planning so that user data is not overwritten. + + +Algorithm-restriction flags +........................... + + * `FFTW_DESTROY_INPUT' specifies that an out-of-place transform is + allowed to _overwrite its input_ array with arbitrary data; this + can sometimes allow more efficient algorithms to be employed. + + * `FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must + _not change its input_ array. This is ordinarily the _default_, + except for c2r and hc2r (i.e. complex-to-real) transforms for + which `FFTW_DESTROY_INPUT' is the default. In the latter cases, + passing `FFTW_PRESERVE_INPUT' will attempt to use algorithms that + do not destroy the input, at the expense of worse performance; for + multi-dimensional c2r transforms, however, no input-preserving + algorithms are implemented and the planner will return `NULL' if + one is requested. + + * `FFTW_UNALIGNED' specifies that the algorithm may not impose any + unusual alignment requirements on the input/output arrays (i.e. no + SIMD may be used). This flag is normally _not necessary_, since + the planner automatically detects misaligned arrays. The only use + for this flag is if you want to use the new-array execute + interface to execute a given plan on a different array that may + not be aligned like the original. (Using `fftw_malloc' makes this + flag unnecessary even then.) + + +Limiting planning time +...................... + + extern void fftw_set_timelimit(double seconds); + + This function instructs FFTW to spend at most `seconds' seconds +(approximately) in the planner. If `seconds == FFTW_NO_TIMELIMIT' (the +default value, which is negative), then planning time is unbounded. +Otherwise, FFTW plans with a progressively wider range of algorithms +until the the given time limit is reached or the given range of +algorithms is explored, returning the best available plan. + + For example, specifying `FFTW_PATIENT' first plans in +`FFTW_ESTIMATE' mode, then in `FFTW_MEASURE' mode, then finally (time +permitting) in `FFTW_PATIENT'. If `FFTW_EXHAUSTIVE' is specified +instead, the planner will further progress to `FFTW_EXHAUSTIVE' mode. + + Note that the `seconds' argument specifies only a rough limit; in +practice, the planner may use somewhat more time if the time limit is +reached when the planner is in the middle of an operation that cannot +be interrupted. At the very least, the planner will complete planning +in `FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0). + + +File: fftw3.info, Node: Real-data DFTs, Next: Real-data DFT Array Format, Prev: Planner Flags, Up: Basic Interface + +4.3.3 Real-data DFTs +-------------------- + + fftw_plan fftw_plan_dft_r2c_1d(int n0, + double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, + double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, + double *in, fftw_complex *out, + unsigned flags); + fftw_plan fftw_plan_dft_r2c(int rank, const int *n, + double *in, fftw_complex *out, + unsigned flags); + + Plan a real-input/complex-output discrete Fourier transform (DFT) in +zero or more dimensions, returning an `fftw_plan' (*note Using Plans::). + + Once you have created a plan for a certain transform type and +parameters, then creating another plan of the same type and parameters, +but for different arrays, is fast and shares constant data with the +first plan (if it still exists). + + The planner returns `NULL' if the plan cannot be created. A +non-`NULL' plan is always returned by the basic interface unless you +are using a customized FFTW configuration supporting a restricted set +of transforms, or if you use the `FFTW_PRESERVE_INPUT' flag with a +multi-dimensional out-of-place c2r transform (see below). + +Arguments +......... + + * `rank' is the rank of the transform (it should be the size of the + array `*n'), and can be any non-negative integer. (*Note Complex + Multi-Dimensional DFTs::, for the definition of "rank".) The + `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', + `2', and `3', respectively. The rank may be zero, which is + equivalent to a rank-1 transform of size 1, i.e. a copy of one + real number (with zero imaginary part) from input to output. + + * `n0', `n1', `n2', or `n[0..rank-1]', (as appropriate for each + routine) specify the size of the transform dimensions. They can + be any positive integer. This is different in general from the + _physical_ array dimensions, which are described in *note + Real-data DFT Array Format::. + + - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d + 11^e 13^f, where e+f is either 0 or 1, and the other exponents + are arbitrary. Other sizes are computed by means of a slow, + general-purpose algorithm (which nevertheless retains O(n log + n) performance even for prime sizes). (It is possible to + customize FFTW for different array sizes; see *note + Installation and Customization::.) Transforms whose sizes + are powers of 2 are especially fast, and it is generally + beneficial for the _last_ dimension of an r2c/c2r transform + to be _even_. + + * `in' and `out' point to the input and output arrays of the + transform, which may be the same (yielding an in-place transform). These + arrays are overwritten during planning, unless `FFTW_ESTIMATE' is + used in the flags. (The arrays need not be initialized, but they + must be allocated.) For an in-place transform, it is important to + remember that the real array will require padding, described in + *note Real-data DFT Array Format::. + + * `flags' is a bitwise OR (`|') of zero or more planner flags, as + defined in *note Planner Flags::. + + + The inverse transforms, taking complex input (storing the +non-redundant half of a logically Hermitian array) to real output, are +given by: + + fftw_plan fftw_plan_dft_c2r_1d(int n0, + fftw_complex *in, double *out, + unsigned flags); + fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1, + fftw_complex *in, double *out, + unsigned flags); + fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2, + fftw_complex *in, double *out, + unsigned flags); + fftw_plan fftw_plan_dft_c2r(int rank, const int *n, + fftw_complex *in, double *out, + unsigned flags); + + The arguments are the same as for the r2c transforms, except that the +input and output data formats are reversed. + + FFTW computes an unnormalized transform: computing an r2c followed +by a c2r transform (or vice versa) will result in the original data +multiplied by the size of the transform (the product of the logical +dimensions). An r2c transform produces the same output as a +`FFTW_FORWARD' complex DFT of the same input, and a c2r transform is +correspondingly equivalent to `FFTW_BACKWARD'. For more information, +see *note What FFTW Really Computes::. + + +File: fftw3.info, Node: Real-data DFT Array Format, Next: Real-to-Real Transforms, Prev: Real-data DFTs, Up: Basic Interface + +4.3.4 Real-data DFT Array Format +-------------------------------- + +The output of a DFT of real data (r2c) contains symmetries that, in +principle, make half of the outputs redundant (*note What FFTW Really +Computes::). (Similarly for the input of an inverse c2r transform.) In +practice, it is not possible to entirely realize these savings in an +efficient and understandable format that generalizes to +multi-dimensional transforms. Instead, the output of the r2c +transforms is _slightly_ over half of the output of the corresponding +complex transform. We do not "pack" the data in any way, but store it +as an ordinary array of `fftw_complex' values. In fact, this data is +simply a subsection of what would be the array in the corresponding +complex transform. + + Specifically, for a real transform of d (= `rank') dimensions n[0] x +n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2] +x ... x (n[d-1]/2 + 1) array of `fftw_complex' values in row-major +order (with the division rounded down). That is, we only store the +_lower_ half (non-negative frequencies), plus one element, of the last +dimension of the data from the ordinary complex transform. (We could +have instead taken half of any other dimension, but implementation +turns out to be simpler if the last, contiguous, dimension is used.) + + For an out-of-place transform, the real data is simply an array with +physical dimensions n[0] x n[1] x n[2] x ... x n[d-1] in row-major +order. + + For an in-place transform, some complications arise since the +complex data is slightly larger than the real data. In this case, the +final dimension of the real data must be _padded_ with extra values to +accommodate the size of the complex data--two extra if the last +dimension is even and one if it is odd. That is, the last dimension of +the real data must physically contain 2 * (n[d-1]/2+1) `double' values +(exactly enough to hold the complex data). This physical array size +does not, however, change the _logical_ array size--only n[d-1] values +are actually stored in the last dimension, and n[d-1] is the last +dimension passed to the planner. + + +File: fftw3.info, Node: Real-to-Real Transforms, Next: Real-to-Real Transform Kinds, Prev: Real-data DFT Array Format, Up: Basic Interface + +4.3.5 Real-to-Real Transforms +----------------------------- + + fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, + fftw_r2r_kind kind, unsigned flags); + fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, + fftw_r2r_kind kind0, fftw_r2r_kind kind1, + unsigned flags); + fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, + double *in, double *out, + fftw_r2r_kind kind0, + fftw_r2r_kind kind1, + fftw_r2r_kind kind2, + unsigned flags); + fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, + const fftw_r2r_kind *kind, unsigned flags); + + Plan a real input/output (r2r) transform of various kinds in zero or +more dimensions, returning an `fftw_plan' (*note Using Plans::). + + Once you have created a plan for a certain transform type and +parameters, then creating another plan of the same type and parameters, +but for different arrays, is fast and shares constant data with the +first plan (if it still exists). + + The planner returns `NULL' if the plan cannot be created. A +non-`NULL' plan is always returned by the basic interface unless you +are using a customized FFTW configuration supporting a restricted set +of transforms, or for size-1 `FFTW_REDFT00' kinds (which are not +defined). + +Arguments +......... + + * `rank' is the dimensionality of the transform (it should be the + size of the arrays `*n' and `*kind'), and can be any non-negative + integer. The `_1d', `_2d', and `_3d' planners correspond to a + `rank' of `1', `2', and `3', respectively. A `rank' of zero is + equivalent to a copy of one number from input to output. + + * `n', or `n0'/`n1'/`n2', or `n[rank]', respectively, gives the + (physical) size of the transform dimensions. They can be any + positive integer. + + - Multi-dimensional arrays are stored in row-major order with + dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x + `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array + Format::. + + - FFTW is generally best at handling sizes of the form 2^a 3^b + 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other + exponents are arbitrary. Other sizes are computed by means + of a slow, general-purpose algorithm (which nevertheless + retains O(n log n) performance even for prime sizes). (It + is possible to customize FFTW for different array sizes; see + *note Installation and Customization::.) Transforms whose + sizes are powers of 2 are especially fast. + + - For a `REDFT00' or `RODFT00' transform kind in a dimension of + size n, it is n-1 or n+1, respectively, that should be + factorizable in the above form. + + * `in' and `out' point to the input and output arrays of the + transform, which may be the same (yielding an in-place transform). These + arrays are overwritten during planning, unless `FFTW_ESTIMATE' is + used in the flags. (The arrays need not be initialized, but they + must be allocated.) + + * `kind', or `kind0'/`kind1'/`kind2', or `kind[rank]', is the kind + of r2r transform used for the corresponding dimension. The valid + kind constants are described in *note Real-to-Real Transform + Kinds::. In a multi-dimensional transform, what is computed is + the separable product formed by taking each transform kind along + the corresponding dimension, one dimension after another. + + * `flags' is a bitwise OR (`|') of zero or more planner flags, as + defined in *note Planner Flags::. + + + +File: fftw3.info, Node: Real-to-Real Transform Kinds, Prev: Real-to-Real Transforms, Up: Basic Interface + +4.3.6 Real-to-Real Transform Kinds +---------------------------------- + +FFTW currently supports 11 different r2r transform kinds, specified by +one of the constants below. For the precise definitions of these +transforms, see *note What FFTW Really Computes::. For a more +colloquial introduction to these transform kinds, see *note More DFTs +of Real Data::. + + For dimension of size `n', there is a corresponding "logical" +dimension `N' that determines the normalization (and the optimal +factorization); the formula for `N' is given for each kind below. +Also, with each transform kind is listed its corrsponding inverse +transform. FFTW computes unnormalized transforms: a transform followed +by its inverse will result in the original data multiplied by `N' (or +the product of the `N''s for each dimension, in multi-dimensions). + + * `FFTW_R2HC' computes a real-input DFT with output in "halfcomplex" + format, i.e. real and imaginary parts for a transform of size `n' + stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical + `N=n', inverse is `FFTW_HC2R'.) + + * `FFTW_HC2R' computes the reverse of `FFTW_R2HC', above. (Logical + `N=n', inverse is `FFTW_R2HC'.) + + * `FFTW_DHT' computes a discrete Hartley transform. (Logical `N=n', + inverse is `FFTW_DHT'.) + + * `FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I. + (Logical `N=2*(n-1)', inverse is `FFTW_REDFT00'.) + + * `FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II + (sometimes called "the" DCT). (Logical `N=2*n', inverse is + `FFTW_REDFT01'.) + + * `FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III + (sometimes called "the" IDCT, being the inverse of DCT-II). + (Logical `N=2*n', inverse is `FFTW_REDFT=10'.) + + * `FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV. + (Logical `N=2*n', inverse is `FFTW_REDFT11'.) + + * `FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I. + (Logical `N=2*(n+1)', inverse is `FFTW_RODFT00'.) + + * `FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II. + (Logical `N=2*n', inverse is `FFTW_RODFT01'.) + + * `FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III. + (Logical `N=2*n', inverse is `FFTW_RODFT=10'.) + + * `FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV. + (Logical `N=2*n', inverse is `FFTW_RODFT11'.) + + + +File: fftw3.info, Node: Advanced Interface, Next: Guru Interface, Prev: Basic Interface, Up: FFTW Reference + +4.4 Advanced Interface +====================== + +FFTW's "advanced" interface supplements the basic interface with four +new planner routines, providing a new level of flexibility: you can plan +a transform of multiple arrays simultaneously, operate on non-contiguous +(strided) data, and transform a subset of a larger multi-dimensional +array. Other than these additional features, the planner operates in +the same fashion as in the basic interface, and the resulting +`fftw_plan' is used in the same way (*note Using Plans::). + +* Menu: + +* Advanced Complex DFTs:: +* Advanced Real-data DFTs:: +* Advanced Real-to-real Transforms:: + + +File: fftw3.info, Node: Advanced Complex DFTs, Next: Advanced Real-data DFTs, Prev: Advanced Interface, Up: Advanced Interface + +4.4.1 Advanced Complex DFTs +--------------------------- + + fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany, + fftw_complex *in, const int *inembed, + int istride, int idist, + fftw_complex *out, const int *onembed, + int ostride, int odist, + int sign, unsigned flags); + + This routine plans multiple multidimensional complex DFTs, and it +extends the `fftw_plan_dft' routine (*note Complex DFTs::) to compute +`howmany' transforms, each having rank `rank' and size `n'. In +addition, the transform data need not be contiguous, but it may be laid +out in memory with an arbitrary stride. To account for these +possibilities, `fftw_plan_many_dft' adds the new parameters `howmany', +{`i',`o'}`nembed', {`i',`o'}`stride', and {`i',`o'}`dist'. The FFTW +basic interface (*note Complex DFTs::) provides routines specialized +for ranks 1, 2, and 3, but the advanced interface handles only the +general-rank case. + + `howmany' is the number of transforms to compute. The resulting +plan computes `howmany' transforms, where the input of the `k'-th +transform is at location `in+k*idist' (in C pointer arithmetic), and +its output is at location `out+k*odist'. Plans obtained in this way +can often be faster than calling FFTW multiple times for the individual +transforms. The basic `fftw_plan_dft' interface corresponds to +`howmany=1' (in which case the `dist' parameters are ignored). + + Each of the `howmany' transforms has rank `rank' and size `n', as in +the basic interface. In addition, the advanced interface allows the +input and output arrays of each transform to be row-major subarrays of +larger rank-`rank' arrays, described by `inembed' and `onembed' +parameters, respectively. {`i',`o'}`nembed' must be arrays of length +`rank', and `n' should be elementwise less than or equal to +{`i',`o'}`nembed'. Passing `NULL' for an `nembed' parameter is +equivalent to passing `n' (i.e. same physical and logical dimensions, +as in the basic interface.) + + The `stride' parameters indicate that the `j'-th element of the +input or output arrays is located at `j*istride' or `j*ostride', +respectively. (For a multi-dimensional array, `j' is the ordinary +row-major index.) When combined with the `k'-th transform in a +`howmany' loop, from above, this means that the (`j',`k')-th element is +at `j*stride+k*dist'. (The basic `fftw_plan_dft' interface corresponds +to a stride of 1.) + + For in-place transforms, the input and output `stride' and `dist' +parameters should be the same; otherwise, the planner may return `NULL'. + + Arrays `n', `inembed', and `onembed' are not used after this +function returns. You can safely free or reuse them. + + *Examples*: One transform of one 5 by 6 array contiguous in memory: + int rank = 2; + int n[] = {5, 6}; + int howmany = 1; + int idist = odist = 0; /* unused because howmany = 1 */ + int istride = ostride = 1; /* array is contiguous in memory */ + int *inembed = n, *onembed = n; + + Transform of three 5 by 6 arrays, each contiguous in memory, stored +in memory one after another: + int rank = 2; + int n[] = {5, 6}; + int howmany = 3; + int idist = odist = n[0]*n[1]; /* = 30, the distance in memory + between the first element + of the first array and the + first element of the second array */ + int istride = ostride = 1; /* array is contiguous in memory */ + int *inembed = n, *onembed = n; + + Transform each column of a 2d array with 10 rows and 3 columns: + int rank = 1; /* not 2: we are computing 1d transforms */ + int n[] = {10}; /* 1d transforms of length 10 */ + int howmany = 3; + int idist = odist = 1; + int istride = ostride = 3; /* distance between two elements in + the same column */ + int *inembed = n, *onembed = n; + + +File: fftw3.info, Node: Advanced Real-data DFTs, Next: Advanced Real-to-real Transforms, Prev: Advanced Complex DFTs, Up: Advanced Interface + +4.4.2 Advanced Real-data DFTs +----------------------------- + + fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany, + double *in, const int *inembed, + int istride, int idist, + fftw_complex *out, const int *onembed, + int ostride, int odist, + unsigned flags); + fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany, + fftw_complex *in, const int *inembed, + int istride, int idist, + double *out, const int *onembed, + int ostride, int odist, + unsigned flags); + + Like `fftw_plan_many_dft', these two functions add `howmany', +`nembed', `stride', and `dist' parameters to the `fftw_plan_dft_r2c' +and `fftw_plan_dft_c2r' functions, but otherwise behave the same as the +basic interface. + + The interpretation of `howmany', `stride', and `dist' are the same +as for `fftw_plan_many_dft', above. Note that the `stride' and `dist' +for the real array are in units of `double', and for the complex array +are in units of `fftw_complex'. + + If an `nembed' parameter is `NULL', it is interpreted as what it +would be in the basic interface, as described in *note Real-data DFT +Array Format::. That is, for the complex array the size is assumed to +be the same as `n', but with the last dimension cut roughly in half. +For the real array, the size is assumed to be `n' if the transform is +out-of-place, or `n' with the last dimension "padded" if the transform +is in-place. + + If an `nembed' parameter is non-`NULL', it is interpreted as the +physical size of the corresponding array, in row-major order, just as +for `fftw_plan_many_dft'. In this case, each dimension of `nembed' +should be `>=' what it would be in the basic interface (e.g. the halved +or padded `n'). + + Arrays `n', `inembed', and `onembed' are not used after this +function returns. You can safely free or reuse them. + + +File: fftw3.info, Node: Advanced Real-to-real Transforms, Prev: Advanced Real-data DFTs, Up: Advanced Interface + +4.4.3 Advanced Real-to-real Transforms +-------------------------------------- + + fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany, + double *in, const int *inembed, + int istride, int idist, + double *out, const int *onembed, + int ostride, int odist, + const fftw_r2r_kind *kind, unsigned flags); + + Like `fftw_plan_many_dft', this functions adds `howmany', `nembed', +`stride', and `dist' parameters to the `fftw_plan_r2r' function, but +otherwise behave the same as the basic interface. The interpretation +of those additional parameters are the same as for +`fftw_plan_many_dft'. (Of course, the `stride' and `dist' parameters +are now in units of `double', not `fftw_complex'.) + + Arrays `n', `inembed', `onembed', and `kind' are not used after this +function returns. You can safely free or reuse them. + + +File: fftw3.info, Node: Guru Interface, Next: New-array Execute Functions, Prev: Advanced Interface, Up: FFTW Reference + +4.5 Guru Interface +================== + +The "guru" interface to FFTW is intended to expose as much as possible +of the flexibility in the underlying FFTW architecture. It allows one +to compute multi-dimensional "vectors" (loops) of multi-dimensional +transforms, where each vector/transform dimension has an independent +size and stride. One can also use more general complex-number formats, +e.g. separate real and imaginary arrays. + + For those users who require the flexibility of the guru interface, +it is important that they pay special attention to the documentation +lest they shoot themselves in the foot. + +* Menu: + +* Interleaved and split arrays:: +* Guru vector and transform sizes:: +* Guru Complex DFTs:: +* Guru Real-data DFTs:: +* Guru Real-to-real Transforms:: +* 64-bit Guru Interface:: + + +File: fftw3.info, Node: Interleaved and split arrays, Next: Guru vector and transform sizes, Prev: Guru Interface, Up: Guru Interface + +4.5.1 Interleaved and split arrays +---------------------------------- + +The guru interface supports two representations of complex numbers, +which we call the interleaved and the split format. + + The "interleaved" format is the same one used by the basic and +advanced interfaces, and it is documented in *note Complex numbers::. +In the interleaved format, you provide pointers to the real part of a +complex number, and the imaginary part understood to be stored in the +next memory location. + + The "split" format allows separate pointers to the real and +imaginary parts of a complex array. + + Technically, the interleaved format is redundant, because you can +always express an interleaved array in terms of a split array with +appropriate pointers and strides. On the other hand, the interleaved +format is simpler to use, and it is common in practice. Hence, FFTW +supports it as a special case. + + +File: fftw3.info, Node: Guru vector and transform sizes, Next: Guru Complex DFTs, Prev: Interleaved and split arrays, Up: Guru Interface + +4.5.2 Guru vector and transform sizes +------------------------------------- + +The guru interface introduces one basic new data structure, +`fftw_iodim', that is used to specify sizes and strides for +multi-dimensional transforms and vectors: + + typedef struct { + int n; + int is; + int os; + } fftw_iodim; + + Here, `n' is the size of the dimension, and `is' and `os' are the +strides of that dimension for the input and output arrays. (The stride +is the separation of consecutive elements along this dimension.) + + The meaning of the stride parameter depends on the type of the array +that the stride refers to. _If the array is interleaved complex, +strides are expressed in units of complex numbers (`fftw_complex'). If +the array is split complex or real, strides are expressed in units of +real numbers (`double')._ This convention is consistent with the usual +pointer arithmetic in the C language. An interleaved array is denoted +by a pointer `p' to `fftw_complex', so that `p+1' points to the next +complex number. Split arrays are denoted by pointers to `double', in +which case pointer arithmetic operates in units of `sizeof(double)'. + + The guru planner interfaces all take a (`rank', `dims[rank]') pair +describing the transform size, and a (`howmany_rank', +`howmany_dims[howmany_rank]') pair describing the "vector" size (a +multi-dimensional loop of transforms to perform), where `dims' and +`howmany_dims' are arrays of `fftw_iodim'. + + For example, the `howmany' parameter in the advanced complex-DFT +interface corresponds to `howmany_rank' = 1, `howmany_dims[0].n' = +`howmany', `howmany_dims[0].is' = `idist', and `howmany_dims[0].os' = +`odist'. (To compute a single transform, you can just use +`howmany_rank' = 0.) + + A row-major multidimensional array with dimensions `n[rank]' (*note +Row-major Format::) corresponds to `dims[i].n' = `n[i]' and the +recurrence `dims[i].is' = `n[i+1] * dims[i+1].is' (similarly for `os'). +The stride of the last (`i=rank-1') dimension is the overall stride of +the array. e.g. to be equivalent to the advanced complex-DFT +interface, you would have `dims[rank-1].is' = `istride' and +`dims[rank-1].os' = `ostride'. + + In general, we only guarantee FFTW to return a non-`NULL' plan if +the vector and transform dimensions correspond to a set of distinct +indices, and for in-place transforms the input/output strides should be +the same. + + +File: fftw3.info, Node: Guru Complex DFTs, Next: Guru Real-data DFTs, Prev: Guru vector and transform sizes, Up: Guru Interface + +4.5.3 Guru Complex DFTs +----------------------- + + fftw_plan fftw_plan_guru_dft( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + fftw_plan fftw_plan_guru_split_dft( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + double *ri, double *ii, double *ro, double *io, + unsigned flags); + + These two functions plan a complex-data, multi-dimensional DFT for +the interleaved and split format, respectively. Transform dimensions +are given by (`rank', `dims') over a multi-dimensional vector (loop) of +dimensions (`howmany_rank', `howmany_dims'). `dims' and `howmany_dims' +should point to `fftw_iodim' arrays of length `rank' and +`howmany_rank', respectively. + + `flags' is a bitwise OR (`|') of zero or more planner flags, as +defined in *note Planner Flags::. + + In the `fftw_plan_guru_dft' function, the pointers `in' and `out' +point to the interleaved input and output arrays, respectively. The +sign can be either -1 (= `FFTW_FORWARD') or +1 (= `FFTW_BACKWARD'). If +the pointers are equal, the transform is in-place. + + In the `fftw_plan_guru_split_dft' function, `ri' and `ii' point to +the real and imaginary input arrays, and `ro' and `io' point to the +real and imaginary output arrays. The input and output pointers may be +the same, indicating an in-place transform. For example, for +`fftw_complex' pointers `in' and `out', the corresponding parameters +are: + + ri = (double *) in; + ii = (double *) in + 1; + ro = (double *) out; + io = (double *) out + 1; + + Because `fftw_plan_guru_split_dft' accepts split arrays, strides are +expressed in units of `double'. For a contiguous `fftw_complex' array, +the overall stride of the transform should be 2, the distance between +consecutive real parts or between consecutive imaginary parts; see +*note Guru vector and transform sizes::. Note that the dimension +strides are applied equally to the real and imaginary parts; real and +imaginary arrays with different strides are not supported. + + There is no `sign' parameter in `fftw_plan_guru_split_dft'. This +function always plans for an `FFTW_FORWARD' transform. To plan for an +`FFTW_BACKWARD' transform, you can exploit the identity that the +backwards DFT is equal to the forwards DFT with the real and imaginary +parts swapped. For example, in the case of the `fftw_complex' arrays +above, the `FFTW_BACKWARD' transform is computed by the parameters: + + ri = (double *) in + 1; + ii = (double *) in; + ro = (double *) out + 1; + io = (double *) out; + + +File: fftw3.info, Node: Guru Real-data DFTs, Next: Guru Real-to-real Transforms, Prev: Guru Complex DFTs, Up: Guru Interface + +4.5.4 Guru Real-data DFTs +------------------------- + + fftw_plan fftw_plan_guru_dft_r2c( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + double *in, fftw_complex *out, + unsigned flags); + + fftw_plan fftw_plan_guru_split_dft_r2c( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + double *in, double *ro, double *io, + unsigned flags); + + fftw_plan fftw_plan_guru_dft_c2r( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + fftw_complex *in, double *out, + unsigned flags); + + fftw_plan fftw_plan_guru_split_dft_c2r( + int rank, const fftw_iodim *dims, + int howmany_rank, const fftw_iodim *howmany_dims, + double *ri, double *ii, double *out, + unsigned flags); + + Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT +with transform dimensions given by (`rank', `dims') over a +multi-dimensional vector (loop) of dimensions (`howmany_rank', +`howmany_dims'). `dims' and `howmany_dims' should point to +`fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. +As for the basic and advanced interfaces, an r2c transform is +`FFTW_FORWARD' and a c2r transform is `FFTW_BACKWARD'. + + The _last_ dimension of `dims' is interpreted specially: that +dimension of the real array has size `dims[rank-1].n', but that +dimension of the complex array has size `dims[rank-1].n/2+1' (division +rounded down). The strides, on the other hand, are taken to be exactly +as specified. It is up to the user to specify the strides +appropriately for the peculiar dimensions of the data, and we do not +guarantee that the planner will succeed (return non-`NULL') for any +dimensions other than those described in *note Real-data DFT Array +Format:: and generalized in *note Advanced Real-data DFTs::. (That is, +for an in-place transform, each individual dimension should be able to +operate in place.) + + `in' and `out' point to the input and output arrays for r2c and c2r +transforms, respectively. For split arrays, `ri' and `ii' point to the +real and imaginary input arrays for a c2r transform, and `ro' and `io' +point to the real and imaginary output arrays for an r2c transform. +`in' and `ro' or `ri' and `out' may be the same, indicating an in-place +transform. (In-place transforms where `in' and `io' or `ii' and `out' +are the same are not currently supported.) + + `flags' is a bitwise OR (`|') of zero or more planner flags, as +defined in *note Planner Flags::. + + In-place transforms of rank greater than 1 are currently only +supported for interleaved arrays. For split arrays, the planner will +return `NULL'. + + +File: fftw3.info, Node: Guru Real-to-real Transforms, Next: 64-bit Guru Interface, Prev: Guru Real-data DFTs, Up: Guru Interface + +4.5.5 Guru Real-to-real Transforms +---------------------------------- + + fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims, + int howmany_rank, + const fftw_iodim *howmany_dims, + double *in, double *out, + const fftw_r2r_kind *kind, + unsigned flags); + + Plan a real-to-real (r2r) multi-dimensional `FFTW_FORWARD' transform +with transform dimensions given by (`rank', `dims') over a +multi-dimensional vector (loop) of dimensions (`howmany_rank', +`howmany_dims'). `dims' and `howmany_dims' should point to +`fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. + + The transform kind of each dimension is given by the `kind' +parameter, which should point to an array of length `rank'. Valid +`fftw_r2r_kind' constants are given in *note Real-to-Real Transform +Kinds::. + + `in' and `out' point to the real input and output arrays; they may +be the same, indicating an in-place transform. + + `flags' is a bitwise OR (`|') of zero or more planner flags, as +defined in *note Planner Flags::. + + +File: fftw3.info, Node: 64-bit Guru Interface, Prev: Guru Real-to-real Transforms, Up: Guru Interface + +4.5.6 64-bit Guru Interface +--------------------------- + +When compiled in 64-bit mode on a 64-bit architecture (where addresses +are 64 bits wide), FFTW uses 64-bit quantities internally for all +transform sizes, strides, and so on--you don't have to do anything +special to exploit this. However, in the ordinary FFTW interfaces, you +specify the transform size by an `int' quantity, which is normally only +32 bits wide. This means that, even though FFTW is using 64-bit sizes +internally, you cannot specify a single transform dimension larger than +2^31-1 numbers. + + We expect that few users will require transforms larger than this, +but, for those who do, we provide a 64-bit version of the guru +interface in which all sizes are specified as integers of type +`ptrdiff_t' instead of `int'. (`ptrdiff_t' is a signed integer type +defined by the C standard to be wide enough to represent address +differences, and thus must be at least 64 bits wide on a 64-bit +machine.) We stress that there is _no performance advantage_ to using +this interface--the same internal FFTW code is employed regardless--and +it is only necessary if you want to specify very large transform sizes. + + In particular, the 64-bit guru interface is a set of planner routines +that are exactly the same as the guru planner routines, except that +they are named with `guru64' instead of `guru' and they take arguments +of type `fftw_iodim64' instead of `fftw_iodim'. For example, instead +of `fftw_plan_guru_dft', we have `fftw_plan_guru64_dft'. + + fftw_plan fftw_plan_guru64_dft( + int rank, const fftw_iodim64 *dims, + int howmany_rank, const fftw_iodim64 *howmany_dims, + fftw_complex *in, fftw_complex *out, + int sign, unsigned flags); + + The `fftw_iodim64' type is similar to `fftw_iodim', with the same +interpretation, except that it uses type `ptrdiff_t' instead of type +`int'. + + typedef struct { + ptrdiff_t n; + ptrdiff_t is; + ptrdiff_t os; + } fftw_iodim64; + + Every other `fftw_plan_guru' function also has a `fftw_plan_guru64' +equivalent, but we do not repeat their documentation here since they +are identical to the 32-bit versions except as noted above. + + +File: fftw3.info, Node: New-array Execute Functions, Next: Wisdom, Prev: Guru Interface, Up: FFTW Reference + +4.6 New-array Execute Functions +=============================== + +Normally, one executes a plan for the arrays with which the plan was +created, by calling `fftw_execute(plan)' as described in *note Using +Plans::. However, it is possible for sophisticated users to apply a +given plan to a _different_ array using the "new-array execute" +functions detailed below, provided that the following conditions are +met: + + * The array size, strides, etcetera are the same (since those are + set by the plan). + + * The input and output arrays are the same (in-place) or different + (out-of-place) if the plan was originally created to be in-place or + out-of-place, respectively. + + * For split arrays, the separations between the real and imaginary + parts, `ii-ri' and `io-ro', are the same as they were for the + input and output arrays when the plan was created. (This + condition is automatically satisfied for interleaved arrays.) + + * The "alignment" of the new input/output arrays is the same as that + of the input/output arrays when the plan was created, unless the + plan was created with the `FFTW_UNALIGNED' flag. Here, the + alignment is a platform-dependent quantity (for example, it is the + address modulo 16 if SSE SIMD instructions are used, but the + address modulo 4 for non-SIMD single-precision FFTW on the same + machine). In general, only arrays allocated with `fftw_malloc' + are guaranteed to be equally aligned (*note SIMD alignment and + fftw_malloc::). + + + The alignment issue is especially critical, because if you don't use +`fftw_malloc' then you may have little control over the alignment of +arrays in memory. For example, neither the C++ `new' function nor the +Fortran `allocate' statement provide strong enough guarantees about +data alignment. If you don't use `fftw_malloc', therefore, you +probably have to use `FFTW_UNALIGNED' (which disables most SIMD +support). If possible, it is probably better for you to simply create +multiple plans (creating a new plan is quick once one exists for a +given size), or better yet re-use the same array for your transforms. + + If you are tempted to use the new-array execute interface because you +want to transform a known bunch of arrays of the same size, you should +probably go use the advanced interface instead (*note Advanced +Interface::)). + + The new-array execute functions are: + + void fftw_execute_dft( + const fftw_plan p, + fftw_complex *in, fftw_complex *out); + + void fftw_execute_split_dft( + const fftw_plan p, + double *ri, double *ii, double *ro, double *io); + + void fftw_execute_dft_r2c( + const fftw_plan p, + double *in, fftw_complex *out); + + void fftw_execute_split_dft_r2c( + const fftw_plan p, + double *in, double *ro, double *io); + + void fftw_execute_dft_c2r( + const fftw_plan p, + fftw_complex *in, double *out); + + void fftw_execute_split_dft_c2r( + const fftw_plan p, + double *ri, double *ii, double *out); + + void fftw_execute_r2r( + const fftw_plan p, + double *in, double *out); + + These execute the `plan' to compute the corresponding transform on +the input/output arrays specified by the subsequent arguments. The +input/output array arguments have the same meanings as the ones passed +to the guru planner routines in the preceding sections. The `plan' is +not modified, and these routines can be called as many times as +desired, or intermixed with calls to the ordinary `fftw_execute'. + + The `plan' _must_ have been created for the transform type +corresponding to the execute function, e.g. it must be a complex-DFT +plan for `fftw_execute_dft'. Any of the planner routines for that +transform type, from the basic to the guru interface, could have been +used to create the plan, however. + + +File: fftw3.info, Node: Wisdom, Next: What FFTW Really Computes, Prev: New-array Execute Functions, Up: FFTW Reference + +4.7 Wisdom +========== + +This section documents the FFTW mechanism for saving and restoring +plans from disk. This mechanism is called "wisdom". + +* Menu: + +* Wisdom Export:: +* Wisdom Import:: +* Forgetting Wisdom:: +* Wisdom Utilities:: + + +File: fftw3.info, Node: Wisdom Export, Next: Wisdom Import, Prev: Wisdom, Up: Wisdom + +4.7.1 Wisdom Export +------------------- + + int fftw_export_wisdom_to_filename(const char *filename); + void fftw_export_wisdom_to_file(FILE *output_file); + char *fftw_export_wisdom_to_string(void); + void fftw_export_wisdom(void (*write_char)(char c, void *), void *data); + + These functions allow you to export all currently accumulated wisdom +in a form from which it can be later imported and restored, even during +a separate run of the program. (*Note Words of Wisdom-Saving Plans::.) +The current store of wisdom is not affected by calling any of these +routines. + + `fftw_export_wisdom' exports the wisdom to any output medium, as +specified by the callback function `write_char'. `write_char' is a +`putc'-like function that writes the character `c' to some output; its +second parameter is the `data' pointer passed to `fftw_export_wisdom'. +For convenience, the following three "wrapper" routines are provided: + + `fftw_export_wisdom_to_filename' writes wisdom to a file named +`filename' (which is created or overwritten), returning `1' on success +and `0' on failure. A lower-level function, which requires you to open +and close the file yourself (e.g. if you want to write wisdom to a +portion of a larger file) is `fftw_export_wisdom_to_file'. This writes +the wisdom to the current position in `output_file', which should be +open with write permission; upon exit, the file remains open and is +positioned at the end of the wisdom data. + + `fftw_export_wisdom_to_string' returns a pointer to a +`NULL'-terminated string holding the wisdom data. This string is +dynamically allocated, and it is the responsibility of the caller to +deallocate it with `free' when it is no longer needed. + + All of these routines export the wisdom in the same format, which we +will not document here except to say that it is LISP-like ASCII text +that is insensitive to white space. + + +File: fftw3.info, Node: Wisdom Import, Next: Forgetting Wisdom, Prev: Wisdom Export, Up: Wisdom + +4.7.2 Wisdom Import +------------------- + + int fftw_import_system_wisdom(void); + int fftw_import_wisdom_from_filename(const char *filename); + int fftw_import_wisdom_from_string(const char *input_string); + int fftw_import_wisdom(int (*read_char)(void *), void *data); + + These functions import wisdom into a program from data stored by the +`fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving +Plans::.) The imported wisdom replaces any wisdom already accumulated +by the running program. + + `fftw_import_wisdom' imports wisdom from any input medium, as +specified by the callback function `read_char'. `read_char' is a +`getc'-like function that returns the next character in the input; its +parameter is the `data' pointer passed to `fftw_import_wisdom'. If the +end of the input data is reached (which should never happen for valid +data), `read_char' should return `EOF' (as defined in `<stdio.h>'). +For convenience, the following three "wrapper" routines are provided: + + `fftw_import_wisdom_from_filename' reads wisdom from a file named +`filename'. A lower-level function, which requires you to open and +close the file yourself (e.g. if you want to read wisdom from a portion +of a larger file) is `fftw_import_wisdom_from_file'. This reads wisdom +from the current position in `input_file' (which should be open with +read permission); upon exit, the file remains open, but the position of +the read pointer is unspecified. + + `fftw_import_wisdom_from_string' reads wisdom from the +`NULL'-terminated string `input_string'. + + `fftw_import_system_wisdom' reads wisdom from an +implementation-defined standard file (`/etc/fftw/wisdom' on Unix and +GNU systems). + + The return value of these import routines is `1' if the wisdom was +read successfully and `0' otherwise. Note that, in all of these +functions, any data in the input stream past the end of the wisdom data +is simply ignored. + + +File: fftw3.info, Node: Forgetting Wisdom, Next: Wisdom Utilities, Prev: Wisdom Import, Up: Wisdom + +4.7.3 Forgetting Wisdom +----------------------- + + void fftw_forget_wisdom(void); + + Calling `fftw_forget_wisdom' causes all accumulated `wisdom' to be +discarded and its associated memory to be freed. (New `wisdom' can +still be gathered subsequently, however.) + + +File: fftw3.info, Node: Wisdom Utilities, Prev: Forgetting Wisdom, Up: Wisdom + +4.7.4 Wisdom Utilities +---------------------- + +FFTW includes two standalone utility programs that deal with wisdom. We +merely summarize them here, since they come with their own `man' pages +for Unix and GNU systems (with HTML versions on our web site). + + The first program is `fftw-wisdom' (or `fftwf-wisdom' in single +precision, etcetera), which can be used to create a wisdom file +containing plans for any of the transform sizes and types supported by +FFTW. It is preferable to create wisdom directly from your executable +(*note Caveats in Using Wisdom::), but this program is useful for +creating global wisdom files for `fftw_import_system_wisdom'. + + The second program is `fftw-wisdom-to-conf', which takes a wisdom +file as input and produces a "configuration routine" as output. The +latter is a C subroutine that you can compile and link into your +program, replacing a routine of the same name in the FFTW library, that +determines which parts of FFTW are callable by your program. +`fftw-wisdom-to-conf' produces a configuration routine that links to +only those parts of FFTW needed by the saved plans in the wisdom, +greatly reducing the size of statically linked executables (which should +only attempt to create plans corresponding to those in the wisdom, +however). + + +File: fftw3.info, Node: What FFTW Really Computes, Prev: Wisdom, Up: FFTW Reference + +4.8 What FFTW Really Computes +============================= + +In this section, we provide precise mathematical definitions for the +transforms that FFTW computes. These transform definitions are fairly +standard, but some authors follow slightly different conventions for the +normalization of the transform (the constant factor in front) and the +sign of the complex exponent. We begin by presenting the +one-dimensional (1d) transform definitions, and then give the +straightforward extension to multi-dimensional transforms. + +* Menu: + +* The 1d Discrete Fourier Transform (DFT):: +* The 1d Real-data DFT:: +* 1d Real-even DFTs (DCTs):: +* 1d Real-odd DFTs (DSTs):: +* 1d Discrete Hartley Transforms (DHTs):: +* Multi-dimensional Transforms:: + + +File: fftw3.info, Node: The 1d Discrete Fourier Transform (DFT), Next: The 1d Real-data DFT, Prev: What FFTW Really Computes, Up: What FFTW Really Computes + +4.8.1 The 1d Discrete Fourier Transform (DFT) +--------------------------------------------- + +The forward (`FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d +complex array X of size n computes an array Y, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . + The backward (`FFTW_BACKWARD') DFT computes: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . + FFTW computes an unnormalized transform, in that there is no +coefficient in front of the summation in the DFT. In other words, +applying the forward and then the backward transform will multiply the +input by n. + + From above, an `FFTW_FORWARD' transform corresponds to a sign of -1 +in the exponent of the DFT. Note also that we use the standard +"in-order" output ordering--the k-th output corresponds to the +frequency k/n (or k/T, where T is your total sampling period). For +those who like to think in terms of positive and negative frequencies, +this means that the positive frequencies are stored in the first half +of the output and the negative frequencies are stored in backwards +order in the second half of the output. (The frequency -k/n is the +same as the frequency (n-k)/n.) + + +File: fftw3.info, Node: The 1d Real-data DFT, Next: 1d Real-even DFTs (DCTs), Prev: The 1d Discrete Fourier Transform (DFT), Up: What FFTW Really Computes + +4.8.2 The 1d Real-data DFT +-------------------------- + +The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of +the size `n' real array X, exactly as defined above, i.e. Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . + This output array Y can easily be shown to possess the "Hermitian" +symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] = +Y[0]. + + As a result of this symmetry, half of the output Y is redundant +(being the complex conjugate of the other half), and so the 1d r2c +transforms only output elements 0...n/2 of Y (n/2+1 complex numbers), +where the division by 2 is rounded down. + + Moreover, the Hermitian symmetry implies that Y[0] and, if n is +even, the Y[n/2] element, are purely real. So, for the `R2HC' r2r +transform, these elements are not stored in the halfcomplex output +format. + + The c2r and `H2RC' r2r transforms compute the backward DFT of the +_complex_ array X with Hermitian symmetry, stored in the r2c/`R2HC' +output formats, respectively, where the backward transform is defined +exactly as for the complex case: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . + The outputs `Y' of this transform can easily be seen to be purely +real, and are stored as an array of real numbers. + + Like FFTW's complex DFT, these transforms are unnormalized. In other +words, applying the real-to-complex (forward) and then the +complex-to-real (backward) transform will multiply the input by n. + + +File: fftw3.info, Node: 1d Real-even DFTs (DCTs), Next: 1d Real-odd DFTs (DSTs), Prev: The 1d Real-data DFT, Up: What FFTW Really Computes + +4.8.3 1d Real-even DFTs (DCTs) +------------------------------ + +The Real-even symmetry DFTs in FFTW are exactly equivalent to the +unnormalized forward (and backward) DFTs as defined above, where the +input array X of length N is purely real and is also "even" symmetry. +In this case, the output array is likewise real and even symmetry. + + For the case of `REDFT00', this even symmetry means that X[j] = +X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of +this redundancy, only the first n real numbers are actually stored, +where N = 2(n-1). + + The proper definition of even symmetry for `REDFT10', `REDFT01', and +`REDFT11' transforms is somewhat more intricate because of the shifts +by 1/2 of the input and/or output, although the corresponding boundary +conditions are given in *note Real even/odd DFTs (cosine/sine +transforms)::. Because of the even symmetry, however, the sine terms +in the DFT all cancel and the remaining cosine terms are written +explicitly below. This formulation often leads people to call such a +transform a "discrete cosine transform" (DCT), although it is really +just a special case of the DFT. + + In each of the definitions below, we transform a real array X of +length n to a real array Y of length n: + +REDFT00 (DCT-I) +............... + +An `REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0] ++ (-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))). +Note that this transform is not defined for n=1. For n=2, the +summation term above is dropped as you might expect. + +REDFT10 (DCT-II) +................ + +An `REDFT10' transform (type-II DCT, sometimes called "the" DCT) in +FFTW is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi +(j+1/2) k / n)). + +REDFT01 (DCT-III) +................. + +An `REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] = +X[0] + 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)). In the +case of n=1, this reduces to Y[0] = X[0]. Up to a scale factor (see +below), this is the inverse of `REDFT10' ("the" DCT), and so the +`REDFT01' (DCT-III) is sometimes called the "IDCT". + +REDFT11 (DCT-IV) +................ + +An `REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2 +(sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)). + +Inverses and Normalization +.......................... + +These definitions correspond directly to the unnormalized DFTs used +elsewhere in FFTW (hence the factors of 2 in front of the summations). +The unnormalized inverse of `REDFT00' is `REDFT00', of `REDFT10' is +`REDFT01' and vice versa, and of `REDFT11' is `REDFT11'. Each +unnormalized inverse results in the original array multiplied by N, +where N is the _logical_ DFT size. For `REDFT00', N=2(n-1) (note that +n=1 is not defined); otherwise, N=2n. + + In defining the discrete cosine transform, some authors also include +additional factors of sqrt(2) (or its inverse) multiplying selected +inputs and/or outputs. This is a mostly cosmetic change that makes the +transform orthogonal, but sacrifices the direct equivalence to a +symmetric DFT. + + +File: fftw3.info, Node: 1d Real-odd DFTs (DSTs), Next: 1d Discrete Hartley Transforms (DHTs), Prev: 1d Real-even DFTs (DCTs), Up: What FFTW Really Computes + +4.8.4 1d Real-odd DFTs (DSTs) +----------------------------- + +The Real-odd symmetry DFTs in FFTW are exactly equivalent to the +unnormalized forward (and backward) DFTs as defined above, where the +input array X of length N is purely real and is also "odd" symmetry. In +this case, the output is odd symmetry and purely imaginary. + + For the case of `RODFT00', this odd symmetry means that X[j] = +-X[N-j], where we take X to be periodic so that X[N] = X[0]. Because +of this redundancy, only the first n real numbers starting at j=1 are +actually stored (the j=0 element is zero), where N = 2(n+1). + + The proper definition of odd symmetry for `RODFT10', `RODFT01', and +`RODFT11' transforms is somewhat more intricate because of the shifts +by 1/2 of the input and/or output, although the corresponding boundary +conditions are given in *note Real even/odd DFTs (cosine/sine +transforms)::. Because of the odd symmetry, however, the cosine terms +in the DFT all cancel and the remaining sine terms are written +explicitly below. This formulation often leads people to call such a +transform a "discrete sine transform" (DST), although it is really just +a special case of the DFT. + + In each of the definitions below, we transform a real array X of +length n to a real array Y of length n: + +RODFT00 (DST-I) +............... + +An `RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2 +(sum for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))). + +RODFT10 (DST-II) +................ + +An `RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2 +(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)). + +RODFT01 (DST-III) +................. + +An `RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] = +(-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) / +n)). In the case of n=1, this reduces to Y[0] = X[0]. + +RODFT11 (DST-IV) +................ + +An `RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2 +(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)). + +Inverses and Normalization +.......................... + +These definitions correspond directly to the unnormalized DFTs used +elsewhere in FFTW (hence the factors of 2 in front of the summations). +The unnormalized inverse of `RODFT00' is `RODFT00', of `RODFT10' is +`RODFT01' and vice versa, and of `RODFT11' is `RODFT11'. Each +unnormalized inverse results in the original array multiplied by N, +where N is the _logical_ DFT size. For `RODFT00', N=2(n+1); otherwise, +N=2n. + + In defining the discrete sine transform, some authors also include +additional factors of sqrt(2) (or its inverse) multiplying selected +inputs and/or outputs. This is a mostly cosmetic change that makes the +transform orthogonal, but sacrifices the direct equivalence to an +antisymmetric DFT. + + +File: fftw3.info, Node: 1d Discrete Hartley Transforms (DHTs), Next: Multi-dimensional Transforms, Prev: 1d Real-odd DFTs (DSTs), Up: What FFTW Really Computes + +4.8.5 1d Discrete Hartley Transforms (DHTs) +------------------------------------------- + +The discrete Hartley transform (DHT) of a 1d real array X of size n +computes a real array Y of the same size, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)]. + FFTW computes an unnormalized transform, in that there is no +coefficient in front of the summation in the DHT. In other words, +applying the transform twice (the DHT is its own inverse) will multiply +the input by n. + + +File: fftw3.info, Node: Multi-dimensional Transforms, Prev: 1d Discrete Hartley Transforms (DHTs), Up: What FFTW Really Computes + +4.8.6 Multi-dimensional Transforms +---------------------------------- + +The multi-dimensional transforms of FFTW, in general, compute simply the +separable product of the given 1d transform along each dimension of the +array. Since each of these transforms is unnormalized, computing the +forward followed by the backward/inverse multi-dimensional transform +will result in the original array scaled by the product of the +normalization factors for each dimension (e.g. the product of the +dimension sizes, for a multi-dimensional DFT). + + The definition of FFTW's multi-dimensional DFT of real data (r2c) +deserves special attention. In this case, we logically compute the full +multi-dimensional DFT of the input data; since the input data are purely +real, the output data have the Hermitian symmetry and therefore only one +non-redundant half need be stored. More specifically, for an n[0] x +n[1] x n[2] x ... x n[d-1] multi-dimensional real-input DFT, the full +(logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the +symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ..., +n[d-1] - k[d-1]]* (where each dimension is periodic). Because of this +symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the +_last_ dimension (division by 2 is rounded down). (We could instead +have cut any other dimension in half, but the last dimension proved +computationally convenient.) This results in the peculiar array format +described in more detail by *note Real-data DFT Array Format::. + + The multi-dimensional c2r transform is simply the unnormalized +inverse of the r2c transform. i.e. it is the same as FFTW's complex +backward multi-dimensional DFT, operating on a Hermitian input array in +the peculiar format mentioned above and outputting a real array (since +the DFT output is purely real). + + We should remind the user that the separable product of 1d transforms +along each dimension, as computed by FFTW, is not always the same thing +as the usual multi-dimensional transform. A multi-dimensional `R2HC' +(or `HC2R') transform is not identical to the multi-dimensional DFT, +requiring some post-processing to combine the requisite real and +imaginary parts, as was described in *note The Halfcomplex-format +DFT::. Likewise, FFTW's multidimensional `FFTW_DHT' r2r transform is +not the same thing as the logical multi-dimensional discrete Hartley +transform defined in the literature, as discussed in *note The Discrete +Hartley Transform::. + + +File: fftw3.info, Node: Multi-threaded FFTW, Next: Distributed-memory FFTW with MPI, Prev: FFTW Reference, Up: Top + +5 Multi-threaded FFTW +********************* + +In this chapter we document the parallel FFTW routines for +shared-memory parallel hardware. These routines, which support +parallel one- and multi-dimensional transforms of both real and complex +data, are the easiest way to take advantage of multiple processors with +FFTW. They work just like the corresponding uniprocessor transform +routines, except that you have an extra initialization routine to call, +and there is a routine to set the number of threads to employ. Any +program that uses the uniprocessor FFTW can therefore be trivially +modified to use the multi-threaded FFTW. + + A shared-memory machine is one in which all CPUs can directly access +the same main memory, and such machines are now common due to the +ubiquity of multi-core CPUs. FFTW's multi-threading support allows you +to utilize these additional CPUs transparently from a single program. +However, this does not necessarily translate into performance +gains--when multiple threads/CPUs are employed, there is an overhead +required for synchronization that may outweigh the computatational +parallelism. Therefore, you can only benefit from threads if your +problem is sufficiently large. + +* Menu: + +* Installation and Supported Hardware/Software:: +* Usage of Multi-threaded FFTW:: +* How Many Threads to Use?:: +* Thread safety:: + + +File: fftw3.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW + +5.1 Installation and Supported Hardware/Software +================================================ + +All of the FFTW threads code is located in the `threads' subdirectory +of the FFTW package. On Unix systems, the FFTW threads libraries and +header files can be automatically configured, compiled, and installed +along with the uniprocessor FFTW libraries simply by including +`--enable-threads' in the flags to the `configure' script (*note +Installation on Unix::), or `--enable-openmp' to use OpenMP +(http://www.openmp.org) threads. + + The threads routines require your operating system to have some sort +of shared-memory threads support. Specifically, the FFTW threads +package works with POSIX threads (available on most Unix variants, from +GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are +supported in many common compilers (e.g. gcc) are also supported, and +may give better performance on some systems. (OpenMP threads are also +useful if you are employing OpenMP in your own code, in order to +minimize conflicts between threading models.) If you have a +shared-memory machine that uses a different threads API, it should be a +simple matter of programming to include support for it; see the file +`threads/threads.c' for more detail. + + You can compile FFTW with _both_ `--enable-threads' and +`--enable-openmp' at the same time, since they install libraries with +different names (`fftw3_threads' and `fftw3_omp', as described below). +However, your programs may only link to _one_ of these two libraries at +a time. + + Ideally, of course, you should also have multiple processors in +order to get any benefit from the threaded transforms. + + +File: fftw3.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW + +5.2 Usage of Multi-threaded FFTW +================================ + +Here, it is assumed that the reader is already familiar with the usage +of the uniprocessor FFTW routines, described elsewhere in this manual. +We only describe what one has to change in order to use the +multi-threaded routines. + + First, programs using the parallel complex transforms should be +linked with `-lfftw3_threads -lfftw3 -lm' on Unix, or `-lfftw3_omp +-lfftw3 -lm' if you compiled with OpenMP. You will also need to link +with whatever library is responsible for threads on your system (e.g. +`-lpthread' on GNU/Linux) or include whatever compiler flag enables +OpenMP (e.g. `-fopenmp' with gcc). + + Second, before calling _any_ FFTW routines, you should call the +function: + + int fftw_init_threads(void); + + This function, which need only be called once, performs any one-time +initialization required to use threads on your system. It returns zero +if there was some error (which should not happen under normal +circumstances) and a non-zero value otherwise. + + Third, before creating a plan that you want to parallelize, you +should call: + + void fftw_plan_with_nthreads(int nthreads); + + The `nthreads' argument indicates the number of threads you want +FFTW to use (or actually, the maximum number). All plans subsequently +created with any planner routine will use that many threads. You can +call `fftw_plan_with_nthreads', create some plans, call +`fftw_plan_with_nthreads' again with a different argument, and create +some more plans for a new number of threads. Plans already created +before a call to `fftw_plan_with_nthreads' are unaffected. If you pass +an `nthreads' argument of `1' (the default), threads are disabled for +subsequent plans. + + With OpenMP, to configure FFTW to use all of the currently running +OpenMP threads (set by `omp_set_num_threads(nthreads)' or by the +`OMP_NUM_THREADS' environment variable), you can do: +`fftw_plan_with_nthreads(omp_get_max_threads())'. (The `omp_' OpenMP +functions are declared via `#include <omp.h>'.) + + Given a plan, you then execute it as usual with +`fftw_execute(plan)', and the execution will use the number of threads +specified when the plan was created. When done, you destroy it as +usual with `fftw_destroy_plan'. As described in *note Thread safety::, +plan _execution_ is thread-safe, but plan creation and destruction are +_not_: you should create/destroy plans only from a single thread, but +can safely execute multiple plans in parallel. + + There is one additional routine: if you want to get rid of all memory +and other resources allocated internally by FFTW, you can call: + + void fftw_cleanup_threads(void); + + which is much like the `fftw_cleanup()' function except that it also +gets rid of threads-related data. You must _not_ execute any +previously created plans after calling this function. + + We should also mention one other restriction: if you save wisdom +from a program using the multi-threaded FFTW, that wisdom _cannot be +used_ by a program using only the single-threaded FFTW (i.e. not calling +`fftw_init_threads'). *Note Words of Wisdom-Saving Plans::. + + +File: fftw3.info, Node: How Many Threads to Use?, Next: Thread safety, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW + +5.3 How Many Threads to Use? +============================ + +There is a fair amount of overhead involved in synchronizing threads, +so the optimal number of threads to use depends upon the size of the +transform as well as on the number of processors you have. + + As a general rule, you don't want to use more threads than you have +processors. (Using more threads will work, but there will be extra +overhead with no benefit.) In fact, if the problem size is too small, +you may want to use fewer threads than you have processors. + + You will have to experiment with your system to see what level of +parallelization is best for your problem size. Typically, the problem +will have to involve at least a few thousand data points before threads +become beneficial. If you plan with `FFTW_PATIENT', it will +automatically disable threads for sizes that don't benefit from +parallelization. + + +File: fftw3.info, Node: Thread safety, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW + +5.4 Thread safety +================= + +Users writing multi-threaded programs (including OpenMP) must concern +themselves with the "thread safety" of the libraries they use--that is, +whether it is safe to call routines in parallel from multiple threads. +FFTW can be used in such an environment, but some care must be taken +because the planner routines share data (e.g. wisdom and trigonometric +tables) between calls and plans. + + The upshot is that the only thread-safe (re-entrant) routine in FFTW +is `fftw_execute' (and the new-array variants thereof). All other +routines (e.g. the planner) should only be called from one thread at a +time. So, for example, you can wrap a semaphore lock around any calls +to the planner; even more simply, you can just create all of your plans +from one thread. We do not think this should be an important +restriction (FFTW is designed for the situation where the only +performance-sensitive code is the actual execution of the transform), +and the benefits of shared data between plans are great. + + Note also that, since the plan is not modified by `fftw_execute', it +is safe to execute the _same plan_ in parallel by multiple threads. +However, since a given plan operates by default on a fixed array, you +need to use one of the new-array execute functions (*note New-array +Execute Functions::) so that different threads compute the transform of +different data. + + (Users should note that these comments only apply to programs using +shared-memory threads or OpenMP. Parallelism using MPI or forked +processes involves a separate address-space and global variables for +each process, and is not susceptible to problems of this sort.) + + If you are configured FFTW with the `--enable-debug' or +`--enable-debug-malloc' flags (*note Installation on Unix::), then +`fftw_execute' is not thread-safe. These flags are not documented +because they are intended only for developing and debugging FFTW, but +if you must use `--enable-debug' then you should also specifically pass +`--disable-debug-malloc' for `fftw_execute' to be thread-safe. + + +File: fftw3.info, Node: Distributed-memory FFTW with MPI, Next: Calling FFTW from Modern Fortran, Prev: Multi-threaded FFTW, Up: Top + +6 Distributed-memory FFTW with MPI +********************************** + +In this chapter we document the parallel FFTW routines for parallel +systems supporting the MPI message-passing interface. Unlike the +shared-memory threads described in the previous chapter, MPI allows you +to use _distributed-memory_ parallelism, where each CPU has its own +separate memory, and which can scale up to clusters of many thousands +of processors. This capability comes at a price, however: each process +only stores a _portion_ of the data to be transformed, which means that +the data structures and programming-interface are quite different from +the serial or threads versions of FFTW. + + Distributed-memory parallelism is especially useful when you are +transforming arrays so large that they do not fit into the memory of a +single processor. The storage per-process required by FFTW's MPI +routines is proportional to the total array size divided by the number +of processes. Conversely, distributed-memory parallelism can easily +pose an unacceptably high communications overhead for small problems; +the threshold problem size for which parallelism becomes advantageous +will depend on the precise problem you are interested in, your +hardware, and your MPI implementation. + + A note on terminology: in MPI, you divide the data among a set of +"processes" which each run in their own memory address space. +Generally, each process runs on a different physical processor, but +this is not required. A set of processes in MPI is described by an +opaque data structure called a "communicator," the most common of which +is the predefined communicator `MPI_COMM_WORLD' which refers to _all_ +processes. For more information on these and other concepts common to +all MPI programs, we refer the reader to the documentation at the MPI +home page (http://www.mcs.anl.gov/research/projects/mpi/). + + We assume in this chapter that the reader is familiar with the usage +of the serial (uniprocessor) FFTW, and focus only on the concepts new +to the MPI interface. + +* Menu: + +* FFTW MPI Installation:: +* Linking and Initializing MPI FFTW:: +* 2d MPI example:: +* MPI Data Distribution:: +* Multi-dimensional MPI DFTs of Real Data:: +* Other Multi-dimensional Real-data MPI Transforms:: +* FFTW MPI Transposes:: +* FFTW MPI Wisdom:: +* Avoiding MPI Deadlocks:: +* FFTW MPI Performance Tips:: +* Combining MPI and Threads:: +* FFTW MPI Reference:: +* FFTW MPI Fortran Interface:: + + +File: fftw3.info, Node: FFTW MPI Installation, Next: Linking and Initializing MPI FFTW, Prev: Distributed-memory FFTW with MPI, Up: Distributed-memory FFTW with MPI + +6.1 FFTW MPI Installation +========================= + +All of the FFTW MPI code is located in the `mpi' subdirectory of the +FFTW package. On Unix systems, the FFTW MPI libraries and header files +are automatically configured, compiled, and installed along with the +uniprocessor FFTW libraries simply by including `--enable-mpi' in the +flags to the `configure' script (*note Installation on Unix::). + + Any implementation of the MPI standard, version 1 or later, should +work with FFTW. The `configure' script will attempt to automatically +detect how to compile and link code using your MPI implementation. In +some cases, especially if you have multiple different MPI +implementations installed or have an unusual MPI software package, you +may need to provide this information explicitly. + + Most commonly, one compiles MPI code by invoking a special compiler +command, typically `mpicc' for C code. The `configure' script knows +the most common names for this command, but you can specify the MPI +compilation command explicitly by setting the `MPICC' variable, as in +`./configure MPICC=mpicc ...'. + + If, instead of a special compiler command, you need to link a certain +library, you can specify the link command via the `MPILIBS' variable, +as in `./configure MPILIBS=-lmpi ...'. Note that if your MPI library +is installed in a non-standard location (one the compiler does not know +about by default), you may also have to specify the location of the +library and header files via `LDFLAGS' and `CPPFLAGS' variables, +respectively, as in `./configure LDFLAGS=-L/path/to/mpi/libs +CPPFLAGS=-I/path/to/mpi/include ...'. + + +File: fftw3.info, Node: Linking and Initializing MPI FFTW, Next: 2d MPI example, Prev: FFTW MPI Installation, Up: Distributed-memory FFTW with MPI + +6.2 Linking and Initializing MPI FFTW +===================================== + +Programs using the MPI FFTW routines should be linked with `-lfftw3_mpi +-lfftw3 -lm' on Unix in double precision, `-lfftw3f_mpi -lfftw3f -lm' +in single precision, and so on (*note Precision::). You will also need +to link with whatever library is responsible for MPI on your system; in +most MPI implementations, there is a special compiler alias named +`mpicc' to compile and link MPI code. + + Before calling any FFTW routines except possibly `fftw_init_threads' +(*note Combining MPI and Threads::), but after calling `MPI_Init', you +should call the function: + + void fftw_mpi_init(void); + + If, at the end of your program, you want to get rid of all memory and +other resources allocated internally by FFTW, for both the serial and +MPI routines, you can call: + + void fftw_mpi_cleanup(void); + + which is much like the `fftw_cleanup()' function except that it also +gets rid of FFTW's MPI-related data. You must _not_ execute any +previously created plans after calling this function. + + +File: fftw3.info, Node: 2d MPI example, Next: MPI Data Distribution, Prev: Linking and Initializing MPI FFTW, Up: Distributed-memory FFTW with MPI + +6.3 2d MPI example +================== + +Before we document the FFTW MPI interface in detail, we begin with a +simple example outlining how one would perform a two-dimensional `N0' +by `N1' complex DFT. + + #include <fftw3-mpi.h> + + int main(int argc, char **argv) + { + const ptrdiff_t N0 = ..., N1 = ...; + fftw_plan plan; + fftw_complex *data; + ptrdiff_t alloc_local, local_n0, local_0_start, i, j; + + MPI_Init(&argc, &argv); + fftw_mpi_init(); + + /* get local data size and allocate */ + alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, + &local_n0, &local_0_start); + data = fftw_alloc_complex(alloc_local); + + /* create plan for in-place forward DFT */ + plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, + FFTW_FORWARD, FFTW_ESTIMATE); + + /* initialize data to some function my_function(x,y) */ + for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) + data[i*N1 + j] = my_function(local_0_start + i, j); + + /* compute transforms, in-place, as many times as desired */ + fftw_execute(plan); + + fftw_destroy_plan(plan); + + MPI_Finalize(); + } + + As can be seen above, the MPI interface follows the same basic style +of allocate/plan/execute/destroy as the serial FFTW routines. All of +the MPI-specific routines are prefixed with `fftw_mpi_' instead of +`fftw_'. There are a few important differences, however: + + First, we must call `fftw_mpi_init()' after calling `MPI_Init' +(required in all MPI programs) and before calling any other `fftw_mpi_' +routine. + + Second, when we create the plan with `fftw_mpi_plan_dft_2d', +analogous to `fftw_plan_dft_2d', we pass an additional argument: the +communicator, indicating which processes will participate in the +transform (here `MPI_COMM_WORLD', indicating all processes). Whenever +you create, execute, or destroy a plan for an MPI transform, you must +call the corresponding FFTW routine on _all_ processes in the +communicator for that transform. (That is, these are _collective_ +calls.) Note that the plan for the MPI transform uses the standard +`fftw_execute' and `fftw_destroy' routines (on the other hand, there +are MPI-specific new-array execute functions documented below). + + Third, all of the FFTW MPI routines take `ptrdiff_t' arguments +instead of `int' as for the serial FFTW. `ptrdiff_t' is a standard C +integer type which is (at least) 32 bits wide on a 32-bit machine and +64 bits wide on a 64-bit machine. This is to make it easy to specify +very large parallel transforms on a 64-bit machine. (You can specify +64-bit transform sizes in the serial FFTW, too, but only by using the +`guru64' planner interface. *Note 64-bit Guru Interface::.) + + Fourth, and most importantly, you don't allocate the entire +two-dimensional array on each process. Instead, you call +`fftw_mpi_local_size_2d' to find out what _portion_ of the array +resides on each processor, and how much space to allocate. Here, the +portion of the array on each process is a `local_n0' by `N1' slice of +the total array, starting at index `local_0_start'. The total number +of `fftw_complex' numbers to allocate is given by the `alloc_local' +return value, which _may_ be greater than `local_n0 * N1' (in case some +intermediate calculations require additional storage). The data +distribution in FFTW's MPI interface is described in more detail by the +next section. + + Given the portion of the array that resides on the local process, it +is straightforward to initialize the data (here to a function +`myfunction') and otherwise manipulate it. Of course, at the end of +the program you may want to output the data somehow, but synchronizing +this output is up to you and is beyond the scope of this manual. (One +good way to output a large multi-dimensional distributed array in MPI +to a portable binary file is to use the free HDF5 library; see the HDF +home page (http://www.hdfgroup.org/).) + + +File: fftw3.info, Node: MPI Data Distribution, Next: Multi-dimensional MPI DFTs of Real Data, Prev: 2d MPI example, Up: Distributed-memory FFTW with MPI + +6.4 MPI Data Distribution +========================= + +The most important concept to understand in using FFTW's MPI interface +is the data distribution. With a serial or multithreaded FFT, all of +the inputs and outputs are stored as a single contiguous chunk of +memory. With a distributed-memory FFT, the inputs and outputs are +broken into disjoint blocks, one per process. + + In particular, FFTW uses a _1d block distribution_ of the data, +distributed along the _first dimension_. For example, if you want to +perform a 100 x 200 complex DFT, distributed over 4 processes, each +process will get a 25 x 200 slice of the data. That is, process 0 +will get rows 0 through 24, process 1 will get rows 25 through 49, +process 2 will get rows 50 through 74, and process 3 will get rows 75 +through 99. If you take the same array but distribute it over 3 +processes, then it is not evenly divisible so the different processes +will have unequal chunks. FFTW's default choice in this case is to +assign 34 rows to processes 0 and 1, and 32 rows to process 2. + + FFTW provides several `fftw_mpi_local_size' routines that you can +call to find out what portion of an array is stored on the current +process. In most cases, you should use the default block sizes picked +by FFTW, but it is also possible to specify your own block size. For +example, with a 100 x 200 array on three processes, you can tell FFTW +to use a block size of 40, which would assign 40 rows to processes 0 +and 1, and 20 rows to process 2. FFTW's default is to divide the data +equally among the processes if possible, and as best it can otherwise. +The rows are always assigned in "rank order," i.e. process 0 gets the +first block of rows, then process 1, and so on. (You can change this +by using `MPI_Comm_split' to create a new communicator with re-ordered +processes.) However, you should always call the `fftw_mpi_local_size' +routines, if possible, rather than trying to predict FFTW's +distribution choices. + + In particular, it is critical that you allocate the storage size that +is returned by `fftw_mpi_local_size', which is _not_ necessarily the +size of the local slice of the array. The reason is that intermediate +steps of FFTW's algorithms involve transposing the array and +redistributing the data, so at these intermediate steps FFTW may +require more local storage space (albeit always proportional to the +total size divided by the number of processes). The +`fftw_mpi_local_size' functions know how much storage is required for +these intermediate steps and tell you the correct amount to allocate. + +* Menu: + +* Basic and advanced distribution interfaces:: +* Load balancing:: +* Transposed distributions:: +* One-dimensional distributions:: + + +File: fftw3.info, Node: Basic and advanced distribution interfaces, Next: Load balancing, Prev: MPI Data Distribution, Up: MPI Data Distribution + +6.4.1 Basic and advanced distribution interfaces +------------------------------------------------ + +As with the planner interface, the `fftw_mpi_local_size' distribution +interface is broken into basic and advanced (`_many') interfaces, where +the latter allows you to specify the block size manually and also to +request block sizes when computing multiple transforms simultaneously. +These functions are documented more exhaustively by the FFTW MPI +Reference, but we summarize the basic ideas here using a couple of +two-dimensional examples. + + For the 100 x 200 complex-DFT example, above, we would find the +distribution by calling the following function in the basic interface: + + ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start); + + Given the total size of the data to be transformed (here, `n0 = 100' +and `n1 = 200') and an MPI communicator (`comm'), this function +provides three numbers. + + First, it describes the shape of the local data: the current process +should store a `local_n0' by `n1' slice of the overall dataset, in +row-major order (`n1' dimension contiguous), starting at index +`local_0_start'. That is, if the total dataset is viewed as a `n0' by +`n1' matrix, the current process should store the rows `local_0_start' +to `local_0_start+local_n0-1'. Obviously, if you are running with only +a single MPI process, that process will store the entire array: +`local_0_start' will be zero and `local_n0' will be `n0'. *Note +Row-major Format::. + + Second, the return value is the total number of data elements (e.g., +complex numbers for a complex DFT) that should be allocated for the +input and output arrays on the current process (ideally with +`fftw_malloc' or an `fftw_alloc' function, to ensure optimal +alignment). It might seem that this should always be equal to +`local_n0 * n1', but this is _not_ the case. FFTW's distributed FFT +algorithms require data redistributions at intermediate stages of the +transform, and in some circumstances this may require slightly larger +local storage. This is discussed in more detail below, under *note +Load balancing::. + + The advanced-interface `local_size' function for multidimensional +transforms returns the same three things (`local_n0', `local_0_start', +and the total number of elements to allocate), but takes more inputs: + + ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, + ptrdiff_t howmany, + ptrdiff_t block0, + MPI_Comm comm, + ptrdiff_t *local_n0, + ptrdiff_t *local_0_start); + + The two-dimensional case above corresponds to `rnk = 2' and an array +`n' of length 2 with `n[0] = n0' and `n[1] = n1'. This routine is for +any `rnk > 1'; one-dimensional transforms have their own interface +because they work slightly differently, as discussed below. + + First, the advanced interface allows you to perform multiple +transforms at once, of interleaved data, as specified by the `howmany' +parameter. (`hoamany' is 1 for a single transform.) + + Second, here you can specify your desired block size in the `n0' +dimension, `block0'. To use FFTW's default block size, pass +`FFTW_MPI_DEFAULT_BLOCK' (0) for `block0'. Otherwise, on `P' +processes, FFTW will return `local_n0' equal to `block0' on the first +`P / block0' processes (rounded down), return `local_n0' equal to `n0 - +block0 * (P / block0)' on the next process, and `local_n0' equal to +zero on any remaining processes. In general, we recommend using the +default block size (which corresponds to `n0 / P', rounded up). + + For example, suppose you have `P = 4' processes and `n0 = 21'. The +default will be a block size of `6', which will give `local_n0 = 6' on +the first three processes and `local_n0 = 3' on the last process. +Instead, however, you could specify `block0 = 5' if you wanted, which +would give `local_n0 = 5' on processes 0 to 2, `local_n0 = 6' on +process 3. (This choice, while it may look superficially more +"balanced," has the same critical path as FFTW's default but requires +more communications.) + + +File: fftw3.info, Node: Load balancing, Next: Transposed distributions, Prev: Basic and advanced distribution interfaces, Up: MPI Data Distribution + +6.4.2 Load balancing +-------------------- + +Ideally, when you parallelize a transform over some P processes, each +process should end up with work that takes equal time. Otherwise, all +of the processes end up waiting on whichever process is slowest. This +goal is known as "load balancing." In this section, we describe the +circumstances under which FFTW is able to load-balance well, and in +particular how you should choose your transform size in order to load +balance. + + Load balancing is especially difficult when you are parallelizing +over heterogeneous machines; for example, if one of your processors is a +old 486 and another is a Pentium IV, obviously you should give the +Pentium more work to do than the 486 since the latter is much slower. +FFTW does not deal with this problem, however--it assumes that your +processes run on hardware of comparable speed, and that the goal is +therefore to divide the problem as equally as possible. + + For a multi-dimensional complex DFT, FFTW can divide the problem +equally among the processes if: (i) the _first_ dimension `n0' is +divisible by P; and (ii), the _product_ of the subsequent dimensions is +divisible by P. (For the advanced interface, where you can specify +multiple simultaneous transforms via some "vector" length `howmany', a +factor of `howmany' is included in the product of the subsequent +dimensions.) + + For a one-dimensional complex DFT, the length `N' of the data should +be divisible by P _squared_ to be able to divide the problem equally +among the processes. + + +File: fftw3.info, Node: Transposed distributions, Next: One-dimensional distributions, Prev: Load balancing, Up: MPI Data Distribution + +6.4.3 Transposed distributions +------------------------------ + +Internally, FFTW's MPI transform algorithms work by first computing +transforms of the data local to each process, then by globally +_transposing_ the data in some fashion to redistribute the data among +the processes, transforming the new data local to each process, and +transposing back. For example, a two-dimensional `n0' by `n1' array, +distributed across the `n0' dimension, is transformd by: (i) +transforming the `n1' dimension, which are local to each process; (ii) +transposing to an `n1' by `n0' array, distributed across the `n1' +dimension; (iii) transforming the `n0' dimension, which is now local to +each process; (iv) transposing back. + + However, in many applications it is acceptable to compute a +multidimensional DFT whose results are produced in transposed order +(e.g., `n1' by `n0' in two dimensions). This provides a significant +performance advantage, because it means that the final transposition +step can be omitted. FFTW supports this optimization, which you +specify by passing the flag `FFTW_MPI_TRANSPOSED_OUT' to the planner +routines. To compute the inverse transform of transposed output, you +specify `FFTW_MPI_TRANSPOSED_IN' to tell it that the input is +transposed. In this section, we explain how to interpret the output +format of such a transform. + + Suppose you have are transforming multi-dimensional data with (at +least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always, +it is distributed along the first dimension n[0] . Now, if we compute +its DFT with the `FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output +data are stored with the first _two_ dimensions transposed: n[1] x n[0] +x n[2] x ... x n[d-1] , distributed along the n[1] dimension. +Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and +transform it with the `FFTW_MPI_TRANSPOSED_IN' flag, then the format +goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array. + + There are two ways to find the portion of the transposed array that +resides on the current process. First, you can simply call the +appropriate `local_size' function, passing n[1] x n[0] x n[2] x ... x +n[d-1] (the transposed dimensions). This would mean calling the +`local_size' function twice, once for the transposed and once for the +non-transposed dimensions. Alternatively, you can call one of the +`local_size_transposed' functions, which returns both the +non-transposed and transposed data distribution from a single call. +For example, for a 3d transform with transposed output (or input), you +might call: + + ptrdiff_t fftw_mpi_local_size_3d_transposed( + ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + + Here, `local_n0' and `local_0_start' give the size and starting +index of the `n0' dimension for the _non_-transposed data, as in the +previous sections. For _transposed_ data (e.g. the output for +`FFTW_MPI_TRANSPOSED_OUT'), `local_n1' and `local_1_start' give the +size and starting index of the `n1' dimension, which is the first +dimension of the transposed data (`n1' by `n0' by `n2'). + + (Note that `FFTW_MPI_TRANSPOSED_IN' is completely equivalent to +performing `FFTW_MPI_TRANSPOSED_OUT' and passing the first two +dimensions to the planner in reverse order, or vice versa. If you pass +_both_ the `FFTW_MPI_TRANSPOSED_IN' and `FFTW_MPI_TRANSPOSED_OUT' +flags, it is equivalent to swapping the first two dimensions passed to +the planner and passing _neither_ flag.) + + +File: fftw3.info, Node: One-dimensional distributions, Prev: Transposed distributions, Up: MPI Data Distribution + +6.4.4 One-dimensional distributions +----------------------------------- + +For one-dimensional distributed DFTs using FFTW, matters are slightly +more complicated because the data distribution is more closely tied to +how the algorithm works. In particular, you can no longer pass an +arbitrary block size and must accept FFTW's default; also, the block +sizes may be different for input and output. Also, the data +distribution depends on the flags and transform direction, in order for +forward and backward transforms to work correctly. + + ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, + int sign, unsigned flags, + ptrdiff_t *local_ni, ptrdiff_t *local_i_start, + ptrdiff_t *local_no, ptrdiff_t *local_o_start); + + This function computes the data distribution for a 1d transform of +size `n0' with the given transform `sign' and `flags'. Both input and +output data use block distributions. The input on the current process +will consist of `local_ni' numbers starting at index `local_i_start'; +e.g. if only a single process is used, then `local_ni' will be `n0' and +`local_i_start' will be `0'. Similarly for the output, with `local_no' +numbers starting at index `local_o_start'. The return value of +`fftw_mpi_local_size_1d' will be the total number of elements to +allocate on the current process (which might be slightly larger than +the local size due to intermediate steps in the algorithm). + + As mentioned above (*note Load balancing::), the data will be divided +equally among the processes if `n0' is divisible by the _square_ of the +number of processes. In this case, `local_ni' will equal `local_no'. +Otherwise, they may be different. + + For some applications, such as convolutions, the order of the output +data is irrelevant. In this case, performance can be improved by +specifying that the output data be stored in an FFTW-defined +"scrambled" format. (In particular, this is the analogue of transposed +output in the multidimensional case: scrambled output saves a +communications step.) If you pass `FFTW_MPI_SCRAMBLED_OUT' in the +flags, then the output is stored in this (undocumented) scrambled +order. Conversely, to perform the inverse transform of data in +scrambled order, pass the `FFTW_MPI_SCRAMBLED_IN' flag. + + In MPI FFTW, only composite sizes `n0' can be parallelized; we have +not yet implemented a parallel algorithm for large prime sizes. + + +File: fftw3.info, Node: Multi-dimensional MPI DFTs of Real Data, Next: Other Multi-dimensional Real-data MPI Transforms, Prev: MPI Data Distribution, Up: Distributed-memory FFTW with MPI + +6.5 Multi-dimensional MPI DFTs of Real Data +=========================================== + +FFTW's MPI interface also supports multi-dimensional DFTs of real data, +similar to the serial r2c and c2r interfaces. (Parallel +one-dimensional real-data DFTs are not currently supported; you must +use a complex transform and set the imaginary parts of the inputs to +zero.) + + The key points to understand for r2c and c2r MPI transforms (compared +to the MPI complex DFTs or the serial r2c/c2r transforms), are: + + * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1] + x n[2] x ... x n[d-1] real data to/from n[0] x n[1] x n[2] x ... + x (n[d-1]/2 + 1) complex data: the last dimension of the complex + data is cut in half (rounded down), plus one. As for the serial + transforms, the sizes you pass to the `plan_dft_r2c' and + `plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1] + dimensions of the real data. + + * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ... + x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x + ... x [2 (n[d-1]/2 + 1)] array, where the last dimension has been + _padded_ to make it the same size as the complex output. This is + much like the in-place serial r2c/c2r interface (*note + Multi-Dimensional DFTs of Real Data::), except that in MPI the + padding is required even for out-of-place data. The extra padding + numbers are ignored by FFTW (they are _not_ like zero-padding the + transform to a larger size); they are only used to determine the + data layout. + + * The data distribution in MPI for _both_ the real and complex data + is determined by the shape of the _complex_ data. That is, you + call the appropriate `local size' function for the n[0] x n[1] x + n[2] x ... x (n[d-1]/2 + 1) + + complex data, and then use the _same_ distribution for the real + data except that the last complex dimension is replaced by a + (padded) real dimension of twice the length. + + + For example suppose we are performing an out-of-place r2c transform +of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x +M x N/2+1 complex data. Similar to the example in *note 2d MPI +example::, we might do something like: + + #include <fftw3-mpi.h> + + int main(int argc, char **argv) + { + const ptrdiff_t L = ..., M = ..., N = ...; + fftw_plan plan; + double *rin; + fftw_complex *cout; + ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; + + MPI_Init(&argc, &argv); + fftw_mpi_init(); + + /* get local data size and allocate */ + alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, + &local_n0, &local_0_start); + rin = fftw_alloc_real(2 * alloc_local); + cout = fftw_alloc_complex(alloc_local); + + /* create plan for out-of-place r2c DFT */ + plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, + FFTW_MEASURE); + + /* initialize rin to some function my_func(x,y,z) */ + for (i = 0; i < local_n0; ++i) + for (j = 0; j < M; ++j) + for (k = 0; k < N; ++k) + rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); + + /* compute transforms as many times as desired */ + fftw_execute(plan); + + fftw_destroy_plan(plan); + + MPI_Finalize(); + } + + Note that we allocated `rin' using `fftw_alloc_real' with an +argument of `2 * alloc_local': since `alloc_local' is the number of +_complex_ values to allocate, the number of _real_ values is twice as +many. The `rin' array is then local_n0 x M x 2(N/2+1) in row-major +order, so its `(i,j,k)' element is at the index `(i*M + j) * +(2*(N/2+1)) + k' (*note Multi-dimensional Array Format::). + + As for the complex transforms, improved performance can be obtained +by specifying that the output is the transpose of the input or vice +versa (*note Transposed distributions::). In our L x M x N r2c +example, including `FFTW_TRANSPOSED_OUT' in the flags means that the +input would be a padded L x M x 2(N/2+1) real array distributed over +the `L' dimension, while the output would be a M x L x N/2+1 complex +array distributed over the `M' dimension. To perform the inverse c2r +transform with the same data distributions, you would use the +`FFTW_TRANSPOSED_IN' flag. + + +File: fftw3.info, Node: Other Multi-dimensional Real-data MPI Transforms, Next: FFTW MPI Transposes, Prev: Multi-dimensional MPI DFTs of Real Data, Up: Distributed-memory FFTW with MPI + +6.6 Other multi-dimensional Real-Data MPI Transforms +==================================================== + +FFTW's MPI interface also supports multi-dimensional `r2r' transforms +of all kinds supported by the serial interface (e.g. discrete cosine +and sine transforms, discrete Hartley transforms, etc.). Only +multi-dimensional `r2r' transforms, not one-dimensional transforms, are +currently parallelized. + + These are used much like the multidimensional complex DFTs discussed +above, except that the data is real rather than complex, and one needs +to pass an r2r transform kind (`fftw_r2r_kind') for each dimension as +in the serial FFTW (*note More DFTs of Real Data::). + + For example, one might perform a two-dimensional L x M that is an +REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the +second dimension with code like: + + const ptrdiff_t L = ..., M = ...; + fftw_plan plan; + double *data; + ptrdiff_t alloc_local, local_n0, local_0_start, i, j; + + /* get local data size and allocate */ + alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, + &local_n0, &local_0_start); + data = fftw_alloc_real(alloc_local); + + /* create plan for in-place REDFT10 x RODFT10 */ + plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, + FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); + + /* initialize data to some function my_function(x,y) */ + for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) + data[i*M + j] = my_function(local_0_start + i, j); + + /* compute transforms, in-place, as many times as desired */ + fftw_execute(plan); + + fftw_destroy_plan(plan); + + Notice that we use the same `local_size' functions as we did for +complex data, only now we interpret the sizes in terms of real rather +than complex values, and correspondingly use `fftw_alloc_real'. + + +File: fftw3.info, Node: FFTW MPI Transposes, Next: FFTW MPI Wisdom, Prev: Other Multi-dimensional Real-data MPI Transforms, Up: Distributed-memory FFTW with MPI + +6.7 FFTW MPI Transposes +======================= + +The FFTW's MPI Fourier transforms rely on one or more _global +transposition_ step for their communications. For example, the +multidimensional transforms work by transforming along some dimensions, +then transposing to make the first dimension local and transforming +that, then transposing back. Because global transposition of a +block-distributed matrix has many other potential uses besides FFTs, +FFTW's transpose routines can be called directly, as documented in this +section. + +* Menu: + +* Basic distributed-transpose interface:: +* Advanced distributed-transpose interface:: +* An improved replacement for MPI_Alltoall:: + + +File: fftw3.info, Node: Basic distributed-transpose interface, Next: Advanced distributed-transpose interface, Prev: FFTW MPI Transposes, Up: FFTW MPI Transposes + +6.7.1 Basic distributed-transpose interface +------------------------------------------- + +In particular, suppose that we have an `n0' by `n1' array in row-major +order, block-distributed across the `n0' dimension. To transpose this +into an `n1' by `n0' array block-distributed across the `n1' dimension, +we would create a plan by calling the following function: + + fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, + double *in, double *out, + MPI_Comm comm, unsigned flags); + + The input and output arrays (`in' and `out') can be the same. The +transpose is actually executed by calling `fftw_execute' on the plan, +as usual. + + The `flags' are the usual FFTW planner flags, but support two +additional flags: `FFTW_MPI_TRANSPOSED_OUT' and/or +`FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose +plans, is that the output and/or input, respectively, are _locally_ +transposed. That is, on each process input data is normally stored as +a `local_n0' by `n1' array in row-major order, but for an +`FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as `n1' by +`local_n0' in row-major order. Similarly, `FFTW_MPI_TRANSPOSED_OUT' +means that the output is `n0' by `local_n1' instead of `local_n1' by +`n0'. + + To determine the local size of the array on each process before and +after the transpose, as well as the amount of storage that must be +allocated, one should call `fftw_mpi_local_size_2d_transposed', just as +for a 2d DFT as described in the previous section: + + ptrdiff_t fftw_mpi_local_size_2d_transposed + (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + + Again, the return value is the local storage to allocate, which in +this case is the number of _real_ (`double') values rather than complex +numbers as in the previous examples. + + +File: fftw3.info, Node: Advanced distributed-transpose interface, Next: An improved replacement for MPI_Alltoall, Prev: Basic distributed-transpose interface, Up: FFTW MPI Transposes + +6.7.2 Advanced distributed-transpose interface +---------------------------------------------- + +The above routines are for a transpose of a matrix of numbers (of type +`double'), using FFTW's default block sizes. More generally, one can +perform transposes of _tuples_ of numbers, with user-specified block +sizes for the input and output: + + fftw_plan fftw_mpi_plan_many_transpose + (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, + ptrdiff_t block0, ptrdiff_t block1, + double *in, double *out, MPI_Comm comm, unsigned flags); + + In this case, one is transposing an `n0' by `n1' matrix of +`howmany'-tuples (e.g. `howmany = 2' for complex numbers). The input +is distributed along the `n0' dimension with block size `block0', and +the `n1' by `n0' output is distributed along the `n1' dimension with +block size `block1'. If `FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a +block size then FFTW uses its default block size. To get the local +size of the data on each process, you should then call +`fftw_mpi_local_size_many_transposed'. + + +File: fftw3.info, Node: An improved replacement for MPI_Alltoall, Prev: Advanced distributed-transpose interface, Up: FFTW MPI Transposes + +6.7.3 An improved replacement for MPI_Alltoall +---------------------------------------------- + +We close this section by noting that FFTW's MPI transpose routines can +be thought of as a generalization for the `MPI_Alltoall' function +(albeit only for floating-point types), and in some circumstances can +function as an improved replacement. + + `MPI_Alltoall' is defined by the MPI standard as: + + int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, + void *recvbuf, int recvcnt, MPI_Datatype recvtype, + MPI_Comm comm); + + In particular, for `double*' arrays `in' and `out', consider the +call: + + MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); + + This is completely equivalent to: + + MPI_Comm_size(comm, &P); + plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); + fftw_execute(plan); + fftw_destroy_plan(plan); + + That is, computing a P x P transpose on `P' processes, with a block +size of 1, is just a standard all-to-all communication. + + However, using the FFTW routine instead of `MPI_Alltoall' may have +certain advantages. First of all, FFTW's routine can operate in-place +(`in == out') whereas `MPI_Alltoall' can only operate out-of-place. + + Second, even for out-of-place plans, FFTW's routine may be faster, +especially if you need to perform the all-to-all communication many +times and can afford to use `FFTW_MEASURE' or `FFTW_PATIENT'. It +should certainly be no slower, not including the time to create the +plan, since one of the possible algorithms that FFTW uses for an +out-of-place transpose _is_ simply to call `MPI_Alltoall'. However, +FFTW also considers several other possible algorithms that, depending +on your MPI implementation and your hardware, may be faster. + + +File: fftw3.info, Node: FFTW MPI Wisdom, Next: Avoiding MPI Deadlocks, Prev: FFTW MPI Transposes, Up: Distributed-memory FFTW with MPI + +6.8 FFTW MPI Wisdom +=================== + +FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be +used to save MPI plans as well as to save uniprocessor plans. However, +for MPI there are several unavoidable complications. + + First, the MPI standard does not guarantee that every process can +perform file I/O (at least, not using C stdio routines)--in general, we +may only assume that process 0 is capable of I/O.(1) So, if we want to +export the wisdom from a single process to a file, we must first export +the wisdom to a string, then send it to process 0, then write it to a +file. + + Second, in principle we may want to have separate wisdom for every +process, since in general the processes may run on different hardware +even for a single MPI program. However, in practice FFTW's MPI code is +designed for the case of homogeneous hardware (*note Load balancing::), +and in this case it is convenient to use the same wisdom for every +process. Thus, we need a mechanism to synchronize the wisdom. + + To address both of these problems, FFTW provides the following two +functions: + + void fftw_mpi_broadcast_wisdom(MPI_Comm comm); + void fftw_mpi_gather_wisdom(MPI_Comm comm); + + Given a communicator `comm', `fftw_mpi_broadcast_wisdom' will +broadcast the wisdom from process 0 to all other processes. +Conversely, `fftw_mpi_gather_wisdom' will collect wisdom from all +processes onto process 0. (If the plans created for the same problem +by different processes are not the same, `fftw_mpi_gather_wisdom' will +arbitrarily choose one of the plans.) Both of these functions may +result in suboptimal plans for different processes if the processes are +running on non-identical hardware. Both of these functions are +_collective_ calls, which means that they must be executed by all +processes in the communicator. + + So, for example, a typical code snippet to import wisdom from a file +and use it on all processes would be: + + { + int rank; + + fftw_mpi_init(); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); + fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); + } + + (Note that we must call `fftw_mpi_init' before importing any wisdom +that might contain MPI plans.) Similarly, a typical code snippet to +export wisdom from all processes to a file is: + + { + int rank; + + fftw_mpi_gather_wisdom(MPI_COMM_WORLD); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); + } + + ---------- Footnotes ---------- + + (1) In fact, even this assumption is not technically guaranteed by +the standard, although it seems to be universal in actual MPI +implementations and is widely assumed by MPI-using software. +Technically, you need to query the `MPI_IO' attribute of +`MPI_COMM_WORLD' with `MPI_Attr_get'. If this attribute is +`MPI_PROC_NULL', no I/O is possible. If it is `MPI_ANY_SOURCE', any +process can perform I/O. Otherwise, it is the rank of a process that +can perform I/O ... but since it is not guaranteed to yield the _same_ +rank on all processes, you have to do an `MPI_Allreduce' of some kind +if you want all processes to agree about which is going to do I/O. And +even then, the standard only guarantees that this process can perform +output, but not input. See e.g. `Parallel Programming with MPI' by P. +S. Pacheco, section 8.1.3. Needless to say, in our experience +virtually no MPI programmers worry about this. + + +File: fftw3.info, Node: Avoiding MPI Deadlocks, Next: FFTW MPI Performance Tips, Prev: FFTW MPI Wisdom, Up: Distributed-memory FFTW with MPI + +6.9 Avoiding MPI Deadlocks +========================== + +An MPI program can _deadlock_ if one process is waiting for a message +from another process that never gets sent. To avoid deadlocks when +using FFTW's MPI routines, it is important to know which functions are +_collective_: that is, which functions must _always_ be called in the +_same order_ from _every_ process in a given communicator. (For +example, `MPI_Barrier' is the canonical example of a collective +function in the MPI standard.) + + The functions in FFTW that are _always_ collective are: every +function beginning with `fftw_mpi_plan', as well as +`fftw_mpi_broadcast_wisdom' and `fftw_mpi_gather_wisdom'. Also, the +following functions from the ordinary FFTW interface are collective +when they are applied to a plan created by an `fftw_mpi_plan' function: +`fftw_execute', `fftw_destroy_plan', and `fftw_flops'. + + +File: fftw3.info, Node: FFTW MPI Performance Tips, Next: Combining MPI and Threads, Prev: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI + +6.10 FFTW MPI Performance Tips +============================== + +In this section, we collect a few tips on getting the best performance +out of FFTW's MPI transforms. + + First, because of the 1d block distribution, FFTW's parallelization +is currently limited by the size of the first dimension. +(Multidimensional block distributions may be supported by a future +version.) More generally, you should ideally arrange the dimensions so +that FFTW can divide them equally among the processes. *Note Load +balancing::. + + Second, if it is not too inconvenient, you should consider working +with transposed output for multidimensional plans, as this saves a +considerable amount of communications. *Note Transposed +distributions::. + + Third, the fastest choices are generally either an in-place transform +or an out-of-place transform with the `FFTW_DESTROY_INPUT' flag (which +allows the input array to be used as scratch space). In-place is +especially beneficial if the amount of data per process is large. + + Fourth, if you have multiple arrays to transform at once, rather than +calling FFTW's MPI transforms several times it usually seems to be +faster to interleave the data and use the advanced interface. (This +groups the communications together instead of requiring separate +messages for each transform.) + + +File: fftw3.info, Node: Combining MPI and Threads, Next: FFTW MPI Reference, Prev: FFTW MPI Performance Tips, Up: Distributed-memory FFTW with MPI + +6.11 Combining MPI and Threads +============================== + +In certain cases, it may be advantageous to combine MPI +(distributed-memory) and threads (shared-memory) parallelization. FFTW +supports this, with certain caveats. For example, if you have a +cluster of 4-processor shared-memory nodes, you may want to use threads +within the nodes and MPI between the nodes, instead of MPI for all +parallelization. + + In particular, it is possible to seamlessly combine the MPI FFTW +routines with the multi-threaded FFTW routines (*note Multi-threaded +FFTW::). However, some care must be taken in the initialization code, +which should look something like this: + + int threads_ok; + + int main(int argc, char **argv) + { + int provided; + MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); + threads_ok = provided >= MPI_THREAD_FUNNELED; + + if (threads_ok) threads_ok = fftw_init_threads(); + fftw_mpi_init(); + + ... + if (threads_ok) fftw_plan_with_nthreads(...); + ... + + MPI_Finalize(); + } + + First, note that instead of calling `MPI_Init', you should call +`MPI_Init_threads', which is the initialization routine defined by the +MPI-2 standard to indicate to MPI that your program will be +multithreaded. We pass `MPI_THREAD_FUNNELED', which indicates that we +will only call MPI routines from the main thread. (FFTW will launch +additional threads internally, but the extra threads will not call MPI +code.) (You may also pass `MPI_THREAD_SERIALIZED' or +`MPI_THREAD_MULTIPLE', which requests additional multithreading support +from the MPI implementation, but this is not required by FFTW.) The +`provided' parameter returns what level of threads support is actually +supported by your MPI implementation; this _must_ be at least +`MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so +we define a global variable `threads_ok' to record this. You should +only call `fftw_init_threads' or `fftw_plan_with_nthreads' if +`threads_ok' is true. For more information on thread safety in MPI, +see the MPI and Threads +(http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the +MPI-2 standard. + + Second, we must call `fftw_init_threads' _before_ `fftw_mpi_init'. +This is critical for technical reasons having to do with how FFTW +initializes its list of algorithms. + + Then, if you call `fftw_plan_with_nthreads(N)', _every_ MPI process +will launch (up to) `N' threads to parallelize its transforms. + + For example, in the hypothetical cluster of 4-processor nodes, you +might wish to launch only a single MPI process per node, and then call +`fftw_plan_with_nthreads(4)' on each process to use all processors in +the nodes. + + This may or may not be faster than simply using as many MPI processes +as you have processors, however. On the one hand, using threads within +a node eliminates the need for explicit message passing within the +node. On the other hand, FFTW's transpose routines are not +multi-threaded, and this means that the communications that do take +place will not benefit from parallelization within the node. Moreover, +many MPI implementations already have optimizations to exploit shared +memory when it is available, so adding the multithreaded FFTW on top of +this may be superfluous. + + +File: fftw3.info, Node: FFTW MPI Reference, Next: FFTW MPI Fortran Interface, Prev: Combining MPI and Threads, Up: Distributed-memory FFTW with MPI + +6.12 FFTW MPI Reference +======================= + +This chapter provides a complete reference to all FFTW MPI functions, +datatypes, and constants. See also *note FFTW Reference:: for +information on functions and types in common with the serial interface. + +* Menu: + +* MPI Files and Data Types:: +* MPI Initialization:: +* Using MPI Plans:: +* MPI Data Distribution Functions:: +* MPI Plan Creation:: +* MPI Wisdom Communication:: + + +File: fftw3.info, Node: MPI Files and Data Types, Next: MPI Initialization, Prev: FFTW MPI Reference, Up: FFTW MPI Reference + +6.12.1 MPI Files and Data Types +------------------------------- + +All programs using FFTW's MPI support should include its header file: + + #include <fftw3-mpi.h> + + Note that this header file includes the serial-FFTW `fftw3.h' header +file, and also the `mpi.h' header file for MPI, so you need not include +those files separately. + + You must also link to _both_ the FFTW MPI library and to the serial +FFTW library. On Unix, this means adding `-lfftw3_mpi -lfftw3 -lm' at +the end of the link command. + + Different precisions are handled as in the serial interface: *Note +Precision::. That is, `fftw_' functions become `fftwf_' (in single +precision) etcetera, and the libraries become `-lfftw3f_mpi -lfftw3f +-lm' etcetera on Unix. Long-double precision is supported in MPI, but +quad precision (`fftwq_') is not due to the lack of MPI support for +this type. + + +File: fftw3.info, Node: MPI Initialization, Next: Using MPI Plans, Prev: MPI Files and Data Types, Up: FFTW MPI Reference + +6.12.2 MPI Initialization +------------------------- + +Before calling any other FFTW MPI (`fftw_mpi_') function, and before +importing any wisdom for MPI problems, you must call: + + void fftw_mpi_init(void); + + If FFTW threads support is used, however, `fftw_mpi_init' should be +called _after_ `fftw_init_threads' (*note Combining MPI and Threads::). +Calling `fftw_mpi_init' additional times (before `fftw_mpi_cleanup') +has no effect. + + If you want to deallocate all persistent data and reset FFTW to the +pristine state it was in when you started your program, you can call: + + void fftw_mpi_cleanup(void); + + (This calls `fftw_cleanup', so you need not call the serial cleanup +routine too, although it is safe to do so.) After calling +`fftw_mpi_cleanup', all existing plans become undefined, and you should +not attempt to execute or destroy them. You must call `fftw_mpi_init' +again after `fftw_mpi_cleanup' if you want to resume using the MPI FFTW +routines. + + +File: fftw3.info, Node: Using MPI Plans, Next: MPI Data Distribution Functions, Prev: MPI Initialization, Up: FFTW MPI Reference + +6.12.3 Using MPI Plans +---------------------- + +Once an MPI plan is created, you can execute and destroy it using +`fftw_execute', `fftw_destroy_plan', and the other functions in the +serial interface that operate on generic plans (*note Using Plans::). + + The `fftw_execute' and `fftw_destroy_plan' functions, applied to MPI +plans, are _collective_ calls: they must be called for all processes in +the communicator that was used to create the plan. + + You must _not_ use the serial new-array plan-execution functions +`fftw_execute_dft' and so on (*note New-array Execute Functions::) with +MPI plans. Such functions are specialized to the problem type, and +there are specific new-array execute functions for MPI plans: + + void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); + void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); + void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); + void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); + + These functions have the same restrictions as those of the serial +new-array execute functions. They are _always_ safe to apply to the +_same_ `in' and `out' arrays that were used to create the plan. They +can only be applied to new arrarys if those arrays have the same types, +dimensions, in-placeness, and alignment as the original arrays, where +the best way to ensure the same alignment is to use FFTW's +`fftw_malloc' and related allocation functions for all arrays (*note +Memory Allocation::). Note that distributed transposes (*note FFTW MPI +Transposes::) use `fftw_mpi_execute_r2r', since they count as rank-zero +r2r plans from FFTW's perspective. + + +File: fftw3.info, Node: MPI Data Distribution Functions, Next: MPI Plan Creation, Prev: Using MPI Plans, Up: FFTW MPI Reference + +6.12.4 MPI Data Distribution Functions +-------------------------------------- + +As described above (*note MPI Data Distribution::), in order to +allocate your arrays, _before_ creating a plan, you must first call one +of the following routines to determine the required allocation size and +the portion of the array locally stored on a given process. The +`MPI_Comm' communicator passed here must be equivalent to the +communicator used below for plan creation. + + The basic interface for multidimensional transforms consists of the +functions: + + ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start); + ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start); + ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start); + + ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + + These functions return the number of elements to allocate (complex +numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the +`local_n0' and `local_0_start' return the portion (`local_0_start' to +`local_0_start + local_n0 - 1') of the first dimension of an n[0] x +n[1] x n[2] x ... x n[d-1] array that is stored on the local process. +*Note Basic and advanced distribution interfaces::. For +`FFTW_MPI_TRANSPOSED_OUT' plans, the `_transposed' variants are useful +in order to also return the local portion of the first dimension in the +n[1] x n[0] x n[2] x ... x n[d-1] transposed output. *Note Transposed +distributions::. The advanced interface for multidimensional +transforms is: + + ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, + ptrdiff_t block0, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start); + ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, + ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, + ptrdiff_t *local_n0, ptrdiff_t *local_0_start, + ptrdiff_t *local_n1, ptrdiff_t *local_1_start); + + These differ from the basic interface in only two ways. First, they +allow you to specify block sizes `block0' and `block1' (the latter for +the transposed output); you can pass `FFTW_MPI_DEFAULT_BLOCK' to use +FFTW's default block size as in the basic interface. Second, you can +pass a `howmany' parameter, corresponding to the advanced planning +interface below: this is for transforms of contiguous `howmany'-tuples +of numbers (`howmany = 1' in the basic interface). + + The corresponding basic and advanced routines for one-dimensional +transforms (currently only complex DFTs) are: + + ptrdiff_t fftw_mpi_local_size_1d( + ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, + ptrdiff_t *local_ni, ptrdiff_t *local_i_start, + ptrdiff_t *local_no, ptrdiff_t *local_o_start); + ptrdiff_t fftw_mpi_local_size_many_1d( + ptrdiff_t n0, ptrdiff_t howmany, + MPI_Comm comm, int sign, unsigned flags, + ptrdiff_t *local_ni, ptrdiff_t *local_i_start, + ptrdiff_t *local_no, ptrdiff_t *local_o_start); + + As above, the return value is the number of elements to allocate +(complex numbers, for complex DFTs). The `local_ni' and +`local_i_start' arguments return the portion (`local_i_start' to +`local_i_start + local_ni - 1') of the 1d array that is stored on this +process for the transform _input_, and `local_no' and `local_o_start' +are the corresponding quantities for the input. The `sign' +(`FFTW_FORWARD' or `FFTW_BACKWARD') and `flags' must match the +arguments passed when creating a plan. Although the inputs and outputs +have different data distributions in general, it is guaranteed that the +_output_ data distribution of an `FFTW_FORWARD' plan will match the +_input_ data distribution of an `FFTW_BACKWARD' plan and vice versa; +similarly for the `FFTW_MPI_SCRAMBLED_OUT' and `FFTW_MPI_SCRAMBLED_IN' +flags. *Note One-dimensional distributions::. + + +File: fftw3.info, Node: MPI Plan Creation, Next: MPI Wisdom Communication, Prev: MPI Data Distribution Functions, Up: FFTW MPI Reference + +6.12.5 MPI Plan Creation +------------------------ + +Complex-data MPI DFTs +..................... + +Plans for complex-data DFTs (*note 2d MPI example::) are created by: + + fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, + MPI_Comm comm, int sign, unsigned flags); + fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, + fftw_complex *in, fftw_complex *out, + MPI_Comm comm, int sign, unsigned flags); + fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + fftw_complex *in, fftw_complex *out, + MPI_Comm comm, int sign, unsigned flags); + fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, + fftw_complex *in, fftw_complex *out, + MPI_Comm comm, int sign, unsigned flags); + fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, + ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, + fftw_complex *in, fftw_complex *out, + MPI_Comm comm, int sign, unsigned flags); + + These are similar to their serial counterparts (*note Complex DFTs::) +in specifying the dimensions, sign, and flags of the transform. The +`comm' argument gives an MPI communicator that specifies the set of +processes to participate in the transform; plan creation is a +collective function that must be called for all processes in the +communicator. The `in' and `out' pointers refer only to a portion of +the overall transform data (*note MPI Data Distribution::) as specified +by the `local_size' functions in the previous section. Unless `flags' +contains `FFTW_ESTIMATE', these arrays are overwritten during plan +creation as for the serial interface. For multi-dimensional +transforms, any dimensions `> 1' are supported; for one-dimensional +transforms, only composite (non-prime) `n0' are currently supported +(unlike the serial FFTW). Requesting an unsupported transform size +will yield a `NULL' plan. (As in the serial interface, highly +composite sizes generally yield the best performance.) + + The advanced-interface `fftw_mpi_plan_many_dft' additionally allows +you to specify the block sizes for the first dimension (`block') of the +n[0] x n[1] x n[2] x ... x n[d-1] input data and the first dimension +(`tblock') of the n[1] x n[0] x n[2] x ... x n[d-1] transposed data +(at intermediate steps of the transform, and for the output if +`FFTW_TRANSPOSED_OUT' is specified in `flags'). These must be the same +block sizes as were passed to the corresponding `local_size' function; +you can pass `FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size +as in the basic interface. Also, the `howmany' parameter specifies +that the transform is of contiguous `howmany'-tuples rather than +individual complex numbers; this corresponds to the same parameter in +the serial advanced interface (*note Advanced Complex DFTs::) with +`stride = howmany' and `dist = 1'. + +MPI flags +......... + +The `flags' can be any of those for the serial FFTW (*note Planner +Flags::), and in addition may include one or more of the following +MPI-specific flags, which improve performance at the cost of changing +the output or input data formats. + + * `FFTW_MPI_SCRAMBLED_OUT', `FFTW_MPI_SCRAMBLED_IN': valid for 1d + transforms only, these flags indicate that the output/input of the + transform are in an undocumented "scrambled" order. A forward + `FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward + `FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization). + *Note One-dimensional distributions::. + + * `FFTW_MPI_TRANSPOSED_OUT', `FFTW_MPI_TRANSPOSED_IN': valid for + multidimensional (`rnk > 1') transforms only, these flags specify + that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1] + transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] . + *Note Transposed distributions::. + + +Real-data MPI DFTs +.................. + +Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI +DFTs of Real Data::) are created by: + + fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, + double *in, fftw_complex *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, + double *in, fftw_complex *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + double *in, fftw_complex *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, + double *in, fftw_complex *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, + fftw_complex *in, double *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, + fftw_complex *in, double *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + fftw_complex *in, double *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, + fftw_complex *in, double *out, + MPI_Comm comm, unsigned flags); + + Similar to the serial interface (*note Real-data DFTs::), these +transform logically n[0] x n[1] x n[2] x ... x n[d-1] real data +to/from n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) complex data, +representing the non-redundant half of the conjugate-symmetry output of +a real-input DFT (*note Multi-dimensional Transforms::). However, the +real array must be stored within a padded n[0] x n[1] x n[2] x ... x [2 +(n[d-1]/2 + 1)] + + array (much like the in-place serial r2c transforms, but here for +out-of-place transforms as well). Currently, only multi-dimensional +(`rnk > 1') r2c/c2r transforms are supported (requesting a plan for +`rnk = 1' will yield `NULL'). As explained above (*note +Multi-dimensional MPI DFTs of Real Data::), the data distribution of +both the real and complex arrays is given by the `local_size' function +called for the dimensions of the _complex_ array. Similar to the other +planning functions, the input and output arrays are overwritten when +the plan is created except in `FFTW_ESTIMATE' mode. + + As for the complex DFTs above, there is an advance interface that +allows you to manually specify block sizes and to transform contiguous +`howmany'-tuples of real/complex numbers: + + fftw_plan fftw_mpi_plan_many_dft_r2c + (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, + ptrdiff_t iblock, ptrdiff_t oblock, + double *in, fftw_complex *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_many_dft_c2r + (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, + ptrdiff_t iblock, ptrdiff_t oblock, + fftw_complex *in, double *out, + MPI_Comm comm, unsigned flags); + +MPI r2r transforms +.................. + +There are corresponding plan-creation routines for r2r transforms +(*note More DFTs of Real Data::), currently supporting multidimensional +(`rnk > 1') transforms only (`rnk = 1' will yield a `NULL' plan): + + fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, + double *in, double *out, + MPI_Comm comm, + fftw_r2r_kind kind0, fftw_r2r_kind kind1, + unsigned flags); + fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, + double *in, double *out, + MPI_Comm comm, + fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, + unsigned flags); + fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, + double *in, double *out, + MPI_Comm comm, const fftw_r2r_kind *kind, + unsigned flags); + fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, + ptrdiff_t iblock, ptrdiff_t oblock, + double *in, double *out, + MPI_Comm comm, const fftw_r2r_kind *kind, + unsigned flags); + + The parameters are much the same as for the complex DFTs above, +except that the arrays are of real numbers (and hence the outputs of the +`local_size' data-distribution functions should be interpreted as +counts of real rather than complex numbers). Also, the `kind' +parameters specify the r2r kinds along each dimension as for the serial +interface (*note Real-to-Real Transform Kinds::). *Note Other +Multi-dimensional Real-data MPI Transforms::. + +MPI transposition +................. + +FFTW also provides routines to plan a transpose of a distributed `n0' +by `n1' array of real numbers, or an array of `howmany'-tuples of real +numbers with specified block sizes (*note FFTW MPI Transposes::): + + fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, + double *in, double *out, + MPI_Comm comm, unsigned flags); + fftw_plan fftw_mpi_plan_many_transpose + (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, + ptrdiff_t block0, ptrdiff_t block1, + double *in, double *out, MPI_Comm comm, unsigned flags); + + These plans are used with the `fftw_mpi_execute_r2r' new-array +execute function (*note Using MPI Plans::), since they count as (rank +zero) r2r plans from FFTW's perspective. + + +File: fftw3.info, Node: MPI Wisdom Communication, Prev: MPI Plan Creation, Up: FFTW MPI Reference + +6.12.6 MPI Wisdom Communication +------------------------------- + +To facilitate synchronizing wisdom among the different MPI processes, +we provide two functions: + + void fftw_mpi_gather_wisdom(MPI_Comm comm); + void fftw_mpi_broadcast_wisdom(MPI_Comm comm); + + The `fftw_mpi_gather_wisdom' function gathers all wisdom in the +given communicator `comm' to the process of rank 0 in the communicator: +that process obtains the union of all wisdom on all the processes. As +a side effect, some other processes will gain additional wisdom from +other processes, but only process 0 will gain the complete union. + + The `fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom +from process 0 in `comm' to all other processes in the communicator, +replacing any wisdom they currently have. + + *Note FFTW MPI Wisdom::. + + +File: fftw3.info, Node: FFTW MPI Fortran Interface, Prev: FFTW MPI Reference, Up: Distributed-memory FFTW with MPI + +6.13 FFTW MPI Fortran Interface +=============================== + +The FFTW MPI interface is callable from modern Fortran compilers +supporting the Fortran 2003 `iso_c_binding' standard for calling C +functions. As described in *note Calling FFTW from Modern Fortran::, +this means that you can directly call FFTW's C interface from Fortran +with only minor changes in syntax. There are, however, a few things +specific to the MPI interface to keep in mind: + + * Instead of including `fftw3.f03' as in *note Overview of Fortran + interface::, you should `include 'fftw3-mpi.f03'' (after `use, + intrinsic :: iso_c_binding' as before). The `fftw3-mpi.f03' file + includes `fftw3.f03', so you should _not_ `include' them both + yourself. (You will also want to include the MPI header file, + usually via `include 'mpif.h'' or similar, although though this is + not needed by `fftw3-mpi.f03' per se.) (To use the `fftwl_' `long + double' extended-precision routines in supporting compilers, you + should include `fftw3f-mpi.f03' in _addition_ to `fftw3-mpi.f03'. + *Note Extended and quadruple precision in Fortran::.) + + * Because of the different storage conventions between C and Fortran, + you reverse the order of your array dimensions when passing them to + FFTW (*note Reversing array dimensions::). This is merely a + difference in notation and incurs no performance overhead. + However, it means that, whereas in C the _first_ dimension is + distributed, in Fortran the _last_ dimension of your array is + distributed. + + * In Fortran, communicators are stored as `integer' types; there is + no `MPI_Comm' type, nor is there any way to access a C `MPI_Comm'. + Fortunately, this is taken care of for you by the FFTW Fortran + interface: whenever the C interface expects an `MPI_Comm' type, + you should pass the Fortran communicator as an `integer'.(1) + + * Because you need to call the `local_size' function to find out how + much space to allocate, and this may be _larger_ than the local + portion of the array (*note MPI Data Distribution::), you should + _always_ allocate your arrays dynamically using FFTW's allocation + routines as described in *note Allocating aligned memory in + Fortran::. (Coincidentally, this also provides the best + performance by guaranteeding proper data alignment.) + + * Because all sizes in the MPI FFTW interface are declared as + `ptrdiff_t' in C, you should use `integer(C_INTPTR_T)' in Fortran + (*note FFTW Fortran type reference::). + + * In Fortran, because of the language semantics, we generally + recommend using the new-array execute functions for all plans, + even in the common case where you are executing the plan on the + same arrays for which the plan was created (*note Plan execution + in Fortran::). However, note that in the MPI interface these + functions are changed: `fftw_execute_dft' becomes + `fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::. + + + For example, here is a Fortran code snippet to perform a distributed +L x M complex DFT in-place. (This assumes you have already +initialized MPI with `MPI_init' and have also performed `call +fftw_mpi_init'.) + + use, intrinsic :: iso_c_binding + include 'fftw3-mpi.f03' + integer(C_INTPTR_T), parameter :: L = ... + integer(C_INTPTR_T), parameter :: M = ... + type(C_PTR) :: plan, cdata + complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) + integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset + + ! get local data size and allocate (note dimension reversal) + alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & + local_M, local_j_offset) + cdata = fftw_alloc_complex(alloc_local) + call c_f_pointer(cdata, data, [L,local_M]) + + ! create MPI plan for in-place forward DFT (note dimension reversal) + plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & + FFTW_FORWARD, FFTW_MEASURE) + + ! initialize data to some function my_function(i,j) + do j = 1, local_M + do i = 1, L + data(i, j) = my_function(i, j + local_j_offset) + end do + end do + + ! compute transform (as many times as desired) + call fftw_mpi_execute_dft(plan, data, data) + + call fftw_destroy_plan(plan) + call fftw_free(cdata) + + Note that when we called `fftw_mpi_local_size_2d' and +`fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L +x M Fortran array is viewed by FFTW in C as a M x L array. This +means that the array was distributed over the `M' dimension, the local +portion of which is a L x local_M array in Fortran. (You must _not_ +use an `allocate' statement to allocate an L x local_M array, however; +you must allocate `alloc_local' complex numbers, which may be greater +than `L * local_M', in order to reserve space for intermediate steps of +the transform.) Finally, we mention that because C's array indices are +zero-based, the `local_j_offset' argument can conveniently be +interpreted as an offset in the 1-based `j' index (rather than as a +starting index as in C). + + If instead you had used the `ior(FFTW_MEASURE, +FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a +transposed M x local_L array, associated with the _same_ `cdata' +allocation (since the transform is in-place), and which you could +declare with: + + complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) + ... + call c_f_pointer(cdata, tdata, [M,local_L]) + + where `local_L' would have been obtained by changing the +`fftw_mpi_local_size_2d' call to: + + alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & + local_M, local_j_offset, local_L, local_i_offset) + + ---------- Footnotes ---------- + + (1) Technically, this is because you aren't actually calling the C +functions directly. You are calling wrapper functions that translate +the communicator with `MPI_Comm_f2c' before calling the ordinary C +interface. This is all done transparently, however, since the +`fftw3-mpi.f03' interface file renames the wrappers so that they are +called in Fortran with the same names as the C interface functions. + + +File: fftw3.info, Node: Calling FFTW from Modern Fortran, Next: Calling FFTW from Legacy Fortran, Prev: Distributed-memory FFTW with MPI, Up: Top + +7 Calling FFTW from Modern Fortran +********************************** + +Fortran 2003 standardized ways for Fortran code to call C libraries, +and this allows us to support a direct translation of the FFTW C API +into Fortran. Compared to the legacy Fortran 77 interface (*note +Calling FFTW from Legacy Fortran::), this direct interface offers many +advantages, especially compile-time type-checking and aligned memory +allocation. As of this writing, support for these C interoperability +features seems widespread, having been implemented in nearly all major +Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland +Group, NAG). + + This chapter documents that interface. For the most part, since this +interface allows Fortran to call the C interface directly, the usage is +identical to C translated to Fortran syntax. However, there are a few +subtle points such as memory allocation, wisdom, and data types that +deserve closer attention. + +* Menu: + +* Overview of Fortran interface:: +* Reversing array dimensions:: +* FFTW Fortran type reference:: +* Plan execution in Fortran:: +* Allocating aligned memory in Fortran:: +* Accessing the wisdom API from Fortran:: +* Defining an FFTW module:: + + +File: fftw3.info, Node: Overview of Fortran interface, Next: Reversing array dimensions, Prev: Calling FFTW from Modern Fortran, Up: Calling FFTW from Modern Fortran + +7.1 Overview of Fortran interface +================================= + +FFTW provides a file `fftw3.f03' that defines Fortran 2003 interfaces +for all of its C routines, except for the MPI routines described +elsewhere, which can be found in the same directory as `fftw3.h' (the C +header file). In any Fortran subroutine where you want to use FFTW +functions, you should begin with: + + use, intrinsic :: iso_c_binding + include 'fftw3.f03' + + This includes the interface definitions and the standard +`iso_c_binding' module (which defines the equivalents of C types). You +can also put the FFTW functions into a module if you prefer (*note +Defining an FFTW module::). + + At this point, you can now call anything in the FFTW C interface +directly, almost exactly as in C other than minor changes in syntax. +For example: + + type(C_PTR) :: plan + complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out + plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE) + ... + call fftw_execute_dft(plan, in, out) + ... + call fftw_destroy_plan(plan) + + A few important things to keep in mind are: + + * FFTW plans are `type(C_PTR)'. Other C types are mapped in the + obvious way via the `iso_c_binding' standard: `int' turns into + `integer(C_INT)', `fftw_complex' turns into + `complex(C_DOUBLE_COMPLEX)', `double' turns into `real(C_DOUBLE)', + and so on. *Note FFTW Fortran type reference::. + + * Functions in C become functions in Fortran if they have a return + value, and subroutines in Fortran otherwise. + + * The ordering of the Fortran array dimensions must be _reversed_ + when they are passed to the FFTW plan creation, thanks to + differences in array indexing conventions (*note Multi-dimensional + Array Format::). This is _unlike_ the legacy Fortran interface + (*note Fortran-interface routines::), which reversed the dimensions + for you. *Note Reversing array dimensions::. + + * Using ordinary Fortran array declarations like this works, but may + yield suboptimal performance because the data may not be not + aligned to exploit SIMD instructions on modern proessors (*note + SIMD alignment and fftw_malloc::). Better performance will often + be obtained by allocating with `fftw_alloc'. *Note Allocating + aligned memory in Fortran::. + + * Similar to the legacy Fortran interface (*note FFTW Execution in + Fortran::), we currently recommend _not_ using `fftw_execute' but + rather using the more specialized functions like + `fftw_execute_dft' (*note New-array Execute Functions::). + However, you should execute the plan on the `same arrays' as the + ones for which you created the plan, unless you are especially + careful. *Note Plan execution in Fortran::. To prevent you from + using `fftw_execute' by mistake, the `fftw3.f03' file does not + provide an `fftw_execute' interface declaration. + + * Multiple planner flags are combined with `ior' (equivalent to `|' + in C). e.g. `FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes + `ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'. (You can also use `+' as + long as you don't try to include a given flag more than once.) + + +* Menu: + +* Extended and quadruple precision in Fortran:: + + +File: fftw3.info, Node: Extended and quadruple precision in Fortran, Prev: Overview of Fortran interface, Up: Overview of Fortran interface + +7.1.1 Extended and quadruple precision in Fortran +------------------------------------------------- + +If FFTW is compiled in `long double' (extended) precision (*note +Installation and Customization::), you may be able to call the +resulting `fftwl_' routines (*note Precision::) from Fortran if your +compiler supports the `C_LONG_DOUBLE_COMPLEX' type code. + + Because some Fortran compilers do not support +`C_LONG_DOUBLE_COMPLEX', the `fftwl_' declarations are segregated into +a separate interface file `fftw3l.f03', which you should include _in +addition_ to `fftw3.f03' (which declares precision-independent `FFTW_' +constants): + + use, intrinsic :: iso_c_binding + include 'fftw3.f03' + include 'fftw3l.f03' + + We also support using the nonstandard `__float128' +quadruple-precision type provided by recent versions of `gcc' on 32- +and 64-bit x86 hardware (*note Installation and Customization::), using +the corresponding `real(16)' and `complex(16)' types supported by +`gfortran'. The quadruple-precision `fftwq_' functions (*note +Precision::) are declared in a `fftw3q.f03' interface file, which +should be included in addition to `fftw3l.f03', as above. You should +also link with `-lfftw3q -lquadmath -lm' as in C. + + +File: fftw3.info, Node: Reversing array dimensions, Next: FFTW Fortran type reference, Prev: Overview of Fortran interface, Up: Calling FFTW from Modern Fortran + +7.2 Reversing array dimensions +============================== + +A minor annoyance in calling FFTW from Fortran is that FFTW's array +dimensions are defined in the C convention (row-major order), while +Fortran's array dimensions are the opposite convention (column-major +order). *Note Multi-dimensional Array Format::. This is just a +bookkeeping difference, with no effect on performance. The only +consequence of this is that, whenever you create an FFTW plan for a +multi-dimensional transform, you must always _reverse the ordering of +the dimensions_. + + For example, consider the three-dimensional (L x M x N ) arrays: + + complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out + + To plan a DFT for these arrays using `fftw_plan_dft_3d', you could +do: + + plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE) + + That is, from FFTW's perspective this is a N x M x L array. _No +data transposition need occur_, as this is _only notation_. Similarly, +to use the more generic routine `fftw_plan_dft' with the same arrays, +you could do: + + integer(C_INT), dimension(3) :: n = [N,M,L] + plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE) + + Note, by the way, that this is different from the legacy Fortran +interface (*note Fortran-interface routines::), which automatically +reverses the order of the array dimension for you. Here, you are +calling the C interface directly, so there is no "translation" layer. + + An important thing to keep in mind is the implication of this for +multidimensional real-to-complex transforms (*note Multi-Dimensional +DFTs of Real Data::). In C, a multidimensional real-to-complex DFT +chops the last dimension roughly in half (N x M x L real input goes to +N x M x L/2+1 complex output). In Fortran, because the array +dimension notation is reversed, the _first_ dimension of the complex +data is chopped roughly in half. For example consider the `r2c' +transform of L x M x N real input in Fortran: + + type(C_PTR) :: plan + real(C_DOUBLE), dimension(L,M,N) :: in + complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out + plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) + ... + call fftw_execute_dft_r2c(plan, in, out) + + Alternatively, for an in-place r2c transform, as described in the C +documentation we must _pad_ the _first_ dimension of the real input +with an extra two entries (which are ignored by FFTW) so as to leave +enough space for the complex output. The input is _allocated_ as a +2[L/2+1] x M x N array, even though only L x M x N of it is actually +used. In this example, we will allocate the array as a pointer type, +using `fftw_alloc' to ensure aligned memory for maximum performance +(*note Allocating aligned memory in Fortran::); this also makes it easy +to reference the same memory as both a real array and a complex array. + + real(C_DOUBLE), pointer :: in(:,:,:) + complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:) + type(C_PTR) :: plan, data + data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T)) + call c_f_pointer(data, in, [2*(L/2+1),M,N]) + call c_f_pointer(data, out, [L/2+1,M,N]) + plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) + ... + call fftw_execute_dft_r2c(plan, in, out) + ... + call fftw_destroy_plan(plan) + call fftw_free(data) + + +File: fftw3.info, Node: FFTW Fortran type reference, Next: Plan execution in Fortran, Prev: Reversing array dimensions, Up: Calling FFTW from Modern Fortran + +7.3 FFTW Fortran type reference +=============================== + +The following are the most important type correspondences between the C +interface and Fortran: + + * Plans (`fftw_plan' and variants) are `type(C_PTR)' (i.e. an opaque + pointer). + + * The C floating-point types `double', `float', and `long double' + correspond to `real(C_DOUBLE)', `real(C_FLOAT)', and + `real(C_LONG_DOUBLE)', respectively. The C complex types + `fftw_complex', `fftwf_complex', and `fftwl_complex' correspond in + Fortran to `complex(C_DOUBLE_COMPLEX)', + `complex(C_FLOAT_COMPLEX)', and `complex(C_LONG_DOUBLE_COMPLEX)', + respectively. Just as in C (*note Precision::), the FFTW + subroutines and types are prefixed with `fftw_', `fftwf_', and + `fftwl_' for the different precisions, and link to different + libraries (`-lfftw3', `-lfftw3f', and `-lfftw3l' on Unix), but use + the _same_ include file `fftw3.f03' and the _same_ constants (all + of which begin with `FFTW_'). The exception is `long double' + precision, for which you should _also_ include `fftw3l.f03' (*note + Extended and quadruple precision in Fortran::). + + * The C integer types `int' and `unsigned' (used for planner flags) + become `integer(C_INT)'. The C integer type `ptrdiff_t' (e.g. in + the *note 64-bit Guru Interface::) becomes `integer(C_INTPTR_T)', + and `size_t' (in `fftw_malloc' etc.) becomes `integer(C_SIZE_T)'. + + * The `fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::) + becomes `integer(C_FFTW_R2R_KIND)'. The various constant values + of the C enumerated type (`FFTW_R2HC' etc.) become simply integer + constants of the same names in Fortran. + + * Numeric array pointer arguments (e.g. `double *') become + `dimension(*), intent(out)' arrays of the same type, or + `dimension(*), intent(in)' if they are pointers to constant data + (e.g. `const int *'). There are a few exceptions where numeric + pointers refer to scalar outputs (e.g. for `fftw_flops'), in which + case they are `intent(out)' scalar arguments in Fortran too. For + the new-array execute functions (*note New-array Execute + Functions::), the input arrays are declared `dimension(*), + intent(inout)', since they can be modified in the case of in-place + or `FFTW_DESTROY_INPUT' transforms. + + * Pointer _return_ values (e.g `double *') become `type(C_PTR)'. + (If they are pointers to arrays, as for `fftw_alloc_real', you can + convert them back to Fortran array pointers with the standard + intrinsic function `c_f_pointer'.) + + * The `fftw_iodim' type in the guru interface (*note Guru vector and + transform sizes::) becomes `type(fftw_iodim)' in Fortran, a + derived data type (the Fortran analogue of C's `struct') with + three `integer(C_INT)' components: `n', `is', and `os', with the + same meanings as in C. The `fftw_iodim64' type in the 64-bit guru + interface (*note 64-bit Guru Interface::) is the same, except that + its components are of type `integer(C_INTPTR_T)'. + + * Using the wisdom import/export functions from Fortran is a bit + tricky, and is discussed in *note Accessing the wisdom API from + Fortran::. In brief, the `FILE *' arguments map to `type(C_PTR)', + `const char *' to `character(C_CHAR), dimension(*), intent(in)' + (null-terminated!), and the generic read-char/write-char functions + map to `type(C_FUNPTR)'. + + + You may be wondering if you need to search-and-replace +`real(kind(0.0d0))' (or whatever your favorite Fortran spelling of +"double precision" is) with `real(C_DOUBLE)' everywhere in your +program, and similarly for `complex' and `integer' types. The answer +is no; you can still use your existing types. As long as these types +match their C counterparts, things should work without a hitch. The +worst that can happen, e.g. in the (unlikely) event of a system where +`real(kind(0.0d0))' is different from `real(C_DOUBLE)', is that the +compiler will give you a type-mismatch error. That is, if you don't +use the `iso_c_binding' kinds you need to accept at least the +theoretical possibility of having to change your code in response to +compiler errors on some future machine, but you don't need to worry +about silently compiling incorrect code that yields runtime errors. + + +File: fftw3.info, Node: Plan execution in Fortran, Next: Allocating aligned memory in Fortran, Prev: FFTW Fortran type reference, Up: Calling FFTW from Modern Fortran + +7.4 Plan execution in Fortran +============================= + +In C, in order to use a plan, one normally calls `fftw_execute', which +executes the plan to perform the transform on the input/output arrays +passed when the plan was created (*note Using Plans::). The +corresponding subroutine call in modern Fortran is: + call fftw_execute(plan) + + However, we have had reports that this causes problems with some +recent optimizing Fortran compilers. The problem is, because the +input/output arrays are not passed as explicit arguments to +`fftw_execute', the semantics of Fortran (unlike C) allow the compiler +to assume that the input/output arrays are not changed by +`fftw_execute'. As a consequence, certain compilers end up +repositioning the call to `fftw_execute', assuming incorrectly that it +does nothing to the arrays. + + There are various workarounds to this, but the safest and simplest +thing is to not use `fftw_execute' in Fortran. Instead, use the +functions described in *note New-array Execute Functions::, which take +the input/output arrays as explicit arguments. For example, if the +plan is for a complex-data DFT and was created for the arrays `in' and +`out', you would do: + call fftw_execute_dft(plan, in, out) + + There are a few things to be careful of, however: + + * You must use the correct type of execute function, matching the way + the plan was created. Complex DFT plans should use + `fftw_execute_dft', Real-input (r2c) DFT plans should use use + `fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use + `fftw_execute_dft_c2r'. The various r2r plans should use + `fftw_execute_r2r'. Fortunately, if you use the wrong one you + will get a compile-time type-mismatch error (unlike legacy + Fortran). + + * You should normally pass the same input/output arrays that were + used when creating the plan. This is always safe. + + * _If_ you pass _different_ input/output arrays compared to those + used when creating the plan, you must abide by all the + restrictions of the new-array execute functions (*note New-array + Execute Functions::). The most tricky of these is the requirement + that the new arrays have the same alignment as the original + arrays; the best (and possibly only) way to guarantee this is to + use the `fftw_alloc' functions to allocate your arrays (*note + Allocating aligned memory in Fortran::). Alternatively, you can + use the `FFTW_UNALIGNED' flag when creating the plan, in which + case the plan does not depend on the alignment, but this may + sacrifice substantial performance on architectures (like x86) with + SIMD instructions (*note SIMD alignment and fftw_malloc::). + + + +File: fftw3.info, Node: Allocating aligned memory in Fortran, Next: Accessing the wisdom API from Fortran, Prev: Plan execution in Fortran, Up: Calling FFTW from Modern Fortran + +7.5 Allocating aligned memory in Fortran +======================================== + +In order to obtain maximum performance in FFTW, you should store your +data in arrays that have been specially aligned in memory (*note SIMD +alignment and fftw_malloc::). Enforcing alignment also permits you to +safely use the new-array execute functions (*note New-array Execute +Functions::) to apply a given plan to more than one pair of in/out +arrays. Unfortunately, standard Fortran arrays do _not_ provide any +alignment guarantees. The _only_ way to allocate aligned memory in +standard Fortran is to allocate it with an external C function, like +the `fftw_alloc_real' and `fftw_alloc_complex' functions. Fortunately, +Fortran 2003 provides a simple way to associate such allocated memory +with a standard Fortran array pointer that you can then use normally. + + We therefore recommend allocating all your input/output arrays using +the following technique: + + 1. Declare a `pointer', `arr', to your array of the desired type and + dimensions. For example, `real(C_DOUBLE), pointer :: a(:,:)' for + a 2d real array, or `complex(C_DOUBLE_COMPLEX), pointer :: + a(:,:,:)' for a 3d complex array. + + 2. The number of elements to allocate must be an `integer(C_SIZE_T)'. + You can either declare a variable of this type, e.g. + `integer(C_SIZE_T) :: sz', to store the number of elements to + allocate, or you can use the `int(..., C_SIZE_T)' intrinsic + function. e.g. set `sz = L * M * N' or use `int(L * M * N, + C_SIZE_T)' for an L x M x N array. + + 3. Declare a `type(C_PTR) :: p' to hold the return value from FFTW's + allocation routine. Set `p = fftw_alloc_real(sz)' for a real + array, or `p = fftw_alloc_complex(sz)' for a complex array. + + 4. Associate your pointer `arr' with the allocated memory `p' using + the standard `c_f_pointer' subroutine: `call c_f_pointer(p, arr, + [...dimensions...])', where `[...dimensions...])' are an array of + the dimensions of the array (in the usual Fortran order). e.g. + `call c_f_pointer(p, arr, [L,M,N])' for an L x M x N array. + (Alternatively, you can omit the dimensions argument if you + specified the shape explicitly when declaring `arr'.) You can now + use `arr' as a usual multidimensional array. + + 5. When you are done using the array, deallocate the memory by `call + fftw_free(p)' on `p'. + + + For example, here is how we would allocate an L x M 2d real array: + + real(C_DOUBLE), pointer :: arr(:,:) + type(C_PTR) :: p + p = fftw_alloc_real(int(L * M, C_SIZE_T)) + call c_f_pointer(p, arr, [L,M]) + _...use arr and arr(i,j) as usual..._ + call fftw_free(p) + + and here is an L x M x N 3d complex array: + + complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:) + type(C_PTR) :: p + p = fftw_alloc_complex(int(L * M * N, C_SIZE_T)) + call c_f_pointer(p, arr, [L,M,N]) + _...use arr and arr(i,j,k) as usual..._ + call fftw_free(p) + + See *note Reversing array dimensions:: for an example allocating a +single array and associating both real and complex array pointers with +it, for in-place real-to-complex transforms. + + +File: fftw3.info, Node: Accessing the wisdom API from Fortran, Next: Defining an FFTW module, Prev: Allocating aligned memory in Fortran, Up: Calling FFTW from Modern Fortran + +7.6 Accessing the wisdom API from Fortran +========================================= + +As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a +"wisdom" API for saving plans to disk so that they can be recreated +quickly. The C API for exporting (*note Wisdom Export::) and importing +(*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran, +however, because of differences in file I/O and string types between C +and Fortran. + +* Menu: + +* Wisdom File Export/Import from Fortran:: +* Wisdom String Export/Import from Fortran:: +* Wisdom Generic Export/Import from Fortran:: + + +File: fftw3.info, Node: Wisdom File Export/Import from Fortran, Next: Wisdom String Export/Import from Fortran, Prev: Accessing the wisdom API from Fortran, Up: Accessing the wisdom API from Fortran + +7.6.1 Wisdom File Export/Import from Fortran +-------------------------------------------- + +The easiest way to export and import wisdom is to do so using +`fftw_export_wisdom_to_filename' and `fftw_wisdom_from_filename'. The +only trick is that these require you to pass a C string, which is an +array of type `CHARACTER(C_CHAR)' that is terminated by `C_NULL_CHAR'. +You can call them like this: + + integer(C_INT) :: ret + ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) + if (ret .eq. 0) stop 'error exporting wisdom to file' + ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) + if (ret .eq. 0) stop 'error importing wisdom from file' + + Note that prepending `C_CHAR_' is needed to specify that the literal +string is of kind `C_CHAR', and we null-terminate the string by +appending `// C_NULL_CHAR'. These functions return an `integer(C_INT)' +(`ret') which is `0' if an error occurred during export/import and +nonzero otherwise. + + It is also possible to use the lower-level routines +`fftw_export_wisdom_to_file' and `fftw_import_wisdom_from_file', which +accept parameters of the C type `FILE*', expressed in Fortran as +`type(C_PTR)'. However, you are then responsible for creating the +`FILE*' yourself. You can do this by using `iso_c_binding' to define +Fortran intefaces for the C library functions `fopen' and `fclose', +which is a bit strange in Fortran but workable. + + +File: fftw3.info, Node: Wisdom String Export/Import from Fortran, Next: Wisdom Generic Export/Import from Fortran, Prev: Wisdom File Export/Import from Fortran, Up: Accessing the wisdom API from Fortran + +7.6.2 Wisdom String Export/Import from Fortran +---------------------------------------------- + +Dealing with FFTW's C string export/import is a bit more painful. In +particular, the `fftw_export_wisdom_to_string' function requires you to +deal with a dynamically allocated C string. To get its length, you +must define an interface to the C `strlen' function, and to deallocate +it you must define an interface to C `free': + + use, intrinsic :: iso_c_binding + interface + integer(C_INT) function strlen(s) bind(C, name='strlen') + import + type(C_PTR), value :: s + end function strlen + subroutine free(p) bind(C, name='free') + import + type(C_PTR), value :: p + end subroutine free + end interface + + Given these definitions, you can then export wisdom to a Fortran +character array: + + character(C_CHAR), pointer :: s(:) + integer(C_SIZE_T) :: slen + type(C_PTR) :: p + p = fftw_export_wisdom_to_string() + if (.not. c_associated(p)) stop 'error exporting wisdom' + slen = strlen(p) + call c_f_pointer(p, s, [slen+1]) + ... + call free(p) + + Note that `slen' is the length of the C string, but the length of +the array is `slen+1' because it includes the terminating null +character. (You can omit the `+1' if you don't want Fortran to know +about the null character.) The standard `c_associated' function checks +whether `p' is a null pointer, which is returned by +`fftw_export_wisdom_to_string' if there was an error. + + To import wisdom from a string, use `fftw_import_wisdom_from_string' +as usual; note that the argument of this function must be a +`character(C_CHAR)' that is terminated by the `C_NULL_CHAR' character, +like the `s' array above. + + +File: fftw3.info, Node: Wisdom Generic Export/Import from Fortran, Prev: Wisdom String Export/Import from Fortran, Up: Accessing the wisdom API from Fortran + +7.6.3 Wisdom Generic Export/Import from Fortran +----------------------------------------------- + +The most generic wisdom export/import functions allow you to provide an +arbitrary callback function to read/write one character at a time in +any way you want. However, your callback function must be written in a +special way, using the `bind(C)' attribute to be passed to a C +interface. + + In particular, to call the generic wisdom export function +`fftw_export_wisdom', you would write a callback subroutine of the form: + + subroutine my_write_char(c, p) bind(C) + use, intrinsic :: iso_c_binding + character(C_CHAR), value :: c + type(C_PTR), value :: p + _...write c..._ + end subroutine my_write_char + + Given such a subroutine (along with the corresponding interface +definition), you could then export wisdom using: + + call fftw_export_wisdom(c_funloc(my_write_char), p) + + The standard `c_funloc' intrinsic converts a Fortran `bind(C)' +subroutine into a C function pointer. The parameter `p' is a +`type(C_PTR)' to any arbitrary data that you want to pass to +`my_write_char' (or `C_NULL_PTR' if none). (Note that you can get a C +pointer to Fortran data using the intrinsic `c_loc', and convert it +back to a Fortran pointer in `my_write_char' using `c_f_pointer'.) + + Similarly, to use the generic `fftw_import_wisdom', you would define +a callback function of the form: + + integer(C_INT) function my_read_char(p) bind(C) + use, intrinsic :: iso_c_binding + type(C_PTR), value :: p + character :: c + _...read a character c..._ + my_read_char = ichar(c, C_INT) + end function my_read_char + + .... + + integer(C_INT) :: ret + ret = fftw_import_wisdom(c_funloc(my_read_char), p) + if (ret .eq. 0) stop 'error importing wisdom' + + Your function can return `-1' if the end of the input is reached. +Again, `p' is an arbitrary `type(C_PTR' that is passed through to your +function. `fftw_import_wisdom' returns `0' if an error occurred and +nonzero otherwise. + + +File: fftw3.info, Node: Defining an FFTW module, Prev: Accessing the wisdom API from Fortran, Up: Calling FFTW from Modern Fortran + +7.7 Defining an FFTW module +=========================== + +Rather than using the `include' statement to include the `fftw3.f03' +interface file in any subroutine where you want to use FFTW, you might +prefer to define an FFTW Fortran module. FFTW does not install itself +as a module, primarily because `fftw3.f03' can be shared between +different Fortran compilers while modules (in general) cannot. +However, it is trivial to define your own FFTW module if you want. +Just create a file containing: + + module FFTW3 + use, intrinsic :: iso_c_binding + include 'fftw3.f03' + end module + + Compile this file into a module as usual for your compiler (e.g. with +`gfortran -c' you will get a file `fftw3.mod'). Now, instead of +`include 'fftw3.f03'', whenever you want to use FFTW routines you can +just do: + + use FFTW3 + + as usual for Fortran modules. (You still need to link to the FFTW +library, of course.) + + +File: fftw3.info, Node: Calling FFTW from Legacy Fortran, Next: Upgrading from FFTW version 2, Prev: Calling FFTW from Modern Fortran, Up: Top + +8 Calling FFTW from Legacy Fortran +********************************** + +This chapter describes the interface to FFTW callable by Fortran code +in older compilers not supporting the Fortran 2003 C interoperability +features (*note Calling FFTW from Modern Fortran::). This interface +has the major disadvantage that it is not type-checked, so if you +mistake the argument types or ordering then your program will not have +any compiler errors, and will likely crash at runtime. So, greater +care is needed. Also, technically interfacing older Fortran versions +to C is nonstandard, but in practice we have found that the techniques +used in this chapter have worked with all known Fortran compilers for +many years. + + The legacy Fortran interface differs from the C interface only in the +prefix (`dfftw_' instead of `fftw_' in double precision) and a few +other minor details. This Fortran interface is included in the FFTW +libraries by default, unless a Fortran compiler isn't found on your +system or `--disable-fortran' is included in the `configure' flags. We +assume here that the reader is already familiar with the usage of FFTW +in C, as described elsewhere in this manual. + + The MPI parallel interface to FFTW is _not_ currently available to +legacy Fortran. + +* Menu: + +* Fortran-interface routines:: +* FFTW Constants in Fortran:: +* FFTW Execution in Fortran:: +* Fortran Examples:: +* Wisdom of Fortran?:: + + +File: fftw3.info, Node: Fortran-interface routines, Next: FFTW Constants in Fortran, Prev: Calling FFTW from Legacy Fortran, Up: Calling FFTW from Legacy Fortran + +8.1 Fortran-interface routines +============================== + +Nearly all of the FFTW functions have Fortran-callable equivalents. +The name of the legacy Fortran routine is the same as that of the +corresponding C routine, but with the `fftw_' prefix replaced by +`dfftw_'.(1) The single and long-double precision versions use +`sfftw_' and `lfftw_', respectively, instead of `fftwf_' and `fftwl_'; +quadruple precision (`real*16') is available on some systems as +`fftwq_' (*note Precision::). (Note that `long double' on x86 hardware +is usually at most 80-bit extended precision, _not_ quadruple +precision.) + + For the most part, all of the arguments to the functions are the +same, with the following exceptions: + + * `plan' variables (what would be of type `fftw_plan' in C), must be + declared as a type that is at least as big as a pointer (address) + on your machine. We recommend using `integer*8' everywhere, since + this should always be big enough. + + * Any function that returns a value (e.g. `fftw_plan_dft') is + converted into a _subroutine_. The return value is converted into + an additional _first_ parameter of this subroutine.(2) + + * The Fortran routines expect multi-dimensional arrays to be in + _column-major_ order, which is the ordinary format of Fortran + arrays (*note Multi-dimensional Array Format::). They do this + transparently and costlessly simply by reversing the order of the + dimensions passed to FFTW, but this has one important consequence + for multi-dimensional real-complex transforms, discussed below. + + * Wisdom import and export is somewhat more tricky because one cannot + easily pass files or strings between C and Fortran; see *note + Wisdom of Fortran?::. + + * Legacy Fortran cannot use the `fftw_malloc' dynamic-allocation + routine. If you want to exploit the SIMD FFTW (*note SIMD + alignment and fftw_malloc::), you'll need to figure out some other + way to ensure that your arrays are at least 16-byte aligned. + + * Since Fortran 77 does not have data structures, the `fftw_iodim' + structure from the guru interface (*note Guru vector and transform + sizes::) must be split into separate arguments. In particular, any + `fftw_iodim' array arguments in the C guru interface become three + integer array arguments (`n', `is', and `os') in the Fortran guru + interface, all of whose lengths should be equal to the + corresponding `rank' argument. + + * The guru planner interface in Fortran does _not_ do any automatic + translation between column-major and row-major; you are responsible + for setting the strides etcetera to correspond to your Fortran + arrays. However, as a slight bug that we are preserving for + backwards compatibility, the `plan_guru_r2r' in Fortran _does_ + reverse the order of its `kind' array parameter, so the `kind' + array of that routine should be in the reverse of the order of the + iodim arrays (see above). + + + In general, you should take care to use Fortran data types that +correspond to (i.e. are the same size as) the C types used by FFTW. In +practice, this correspondence is usually straightforward (i.e. +`integer' corresponds to `int', `real' corresponds to `float', +etcetera). The native Fortran double/single-precision complex type +should be compatible with `fftw_complex'/`fftwf_complex'. Such simple +correspondences are assumed in the examples below. + + ---------- Footnotes ---------- + + (1) Technically, Fortran 77 identifiers are not allowed to have more +than 6 characters, nor may they contain underscores. Any compiler that +enforces this limitation doesn't deserve to link to FFTW. + + (2) The reason for this is that some Fortran implementations seem to +have trouble with C function return values, and vice versa. + + +File: fftw3.info, Node: FFTW Constants in Fortran, Next: FFTW Execution in Fortran, Prev: Fortran-interface routines, Up: Calling FFTW from Legacy Fortran + +8.2 FFTW Constants in Fortran +============================= + +When creating plans in FFTW, a number of constants are used to specify +options, such as `FFTW_MEASURE' or `FFTW_ESTIMATE'. The same constants +must be used with the wrapper routines, but of course the C header +files where the constants are defined can't be incorporated directly +into Fortran code. + + Instead, we have placed Fortran equivalents of the FFTW constant +definitions in the file `fftw3.f', which can be found in the same +directory as `fftw3.h'. If your Fortran compiler supports a +preprocessor of some sort, you should be able to `include' or +`#include' this file; otherwise, you can paste it directly into your +code. + + In C, you combine different flags (like `FFTW_PRESERVE_INPUT' and +`FFTW_MEASURE') using the ``|'' operator; in Fortran you should just +use ``+''. (Take care not to add in the same flag more than once, +though. Alternatively, you can use the `ior' intrinsic function +standardized in Fortran 95.) + + +File: fftw3.info, Node: FFTW Execution in Fortran, Next: Fortran Examples, Prev: FFTW Constants in Fortran, Up: Calling FFTW from Legacy Fortran + +8.3 FFTW Execution in Fortran +============================= + +In C, in order to use a plan, one normally calls `fftw_execute', which +executes the plan to perform the transform on the input/output arrays +passed when the plan was created (*note Using Plans::). The +corresponding subroutine call in legacy Fortran is: + call dfftw_execute(plan) + + However, we have had reports that this causes problems with some +recent optimizing Fortran compilers. The problem is, because the +input/output arrays are not passed as explicit arguments to +`dfftw_execute', the semantics of Fortran (unlike C) allow the compiler +to assume that the input/output arrays are not changed by +`dfftw_execute'. As a consequence, certain compilers end up optimizing +out or repositioning the call to `dfftw_execute', assuming incorrectly +that it does nothing. + + There are various workarounds to this, but the safest and simplest +thing is to not use `dfftw_execute' in Fortran. Instead, use the +functions described in *note New-array Execute Functions::, which take +the input/output arrays as explicit arguments. For example, if the +plan is for a complex-data DFT and was created for the arrays `in' and +`out', you would do: + call dfftw_execute_dft(plan, in, out) + + There are a few things to be careful of, however: + + * You must use the correct type of execute function, matching the way + the plan was created. Complex DFT plans should use + `dfftw_execute_dft', Real-input (r2c) DFT plans should use use + `dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should + use `dfftw_execute_dft_c2r'. The various r2r plans should use + `dfftw_execute_r2r'. + + * You should normally pass the same input/output arrays that were + used when creating the plan. This is always safe. + + * _If_ you pass _different_ input/output arrays compared to those + used when creating the plan, you must abide by all the + restrictions of the new-array execute functions (*note New-array + Execute Functions::). The most difficult of these, in Fortran, is + the requirement that the new arrays have the same alignment as the + original arrays, because there seems to be no way in legacy + Fortran to obtain guaranteed-aligned arrays (analogous to + `fftw_malloc' in C). You can, of course, use the `FFTW_UNALIGNED' + flag when creating the plan, in which case the plan does not + depend on the alignment, but this may sacrifice substantial + performance on architectures (like x86) with SIMD instructions + (*note SIMD alignment and fftw_malloc::). + + + +File: fftw3.info, Node: Fortran Examples, Next: Wisdom of Fortran?, Prev: FFTW Execution in Fortran, Up: Calling FFTW from Legacy Fortran + +8.4 Fortran Examples +==================== + +In C, you might have something like the following to transform a +one-dimensional complex array: + + fftw_complex in[N], out[N]; + fftw_plan plan; + + plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE); + fftw_execute(plan); + fftw_destroy_plan(plan); + + In Fortran, you would use the following to accomplish the same thing: + + double complex in, out + dimension in(N), out(N) + integer*8 plan + + call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) + call dfftw_execute_dft(plan, in, out) + call dfftw_destroy_plan(plan) + + Notice how all routines are called as Fortran subroutines, and the +plan is returned via the first argument to `dfftw_plan_dft_1d'. Notice +also that we changed `fftw_execute' to `dfftw_execute_dft' (*note FFTW +Execution in Fortran::). To do the same thing, but using 8 threads in +parallel (*note Multi-threaded FFTW::), you would simply prefix these +calls with: + + integer iret + call dfftw_init_threads(iret) + call dfftw_plan_with_nthreads(8) + + (You might want to check the value of `iret': if it is zero, it +indicates an unlikely error during thread initialization.) + + To transform a three-dimensional array in-place with C, you might do: + + fftw_complex arr[L][M][N]; + fftw_plan plan; + + plan = fftw_plan_dft_3d(L,M,N, arr,arr, + FFTW_FORWARD, FFTW_ESTIMATE); + fftw_execute(plan); + fftw_destroy_plan(plan); + + In Fortran, you would use this instead: + + double complex arr + dimension arr(L,M,N) + integer*8 plan + + call dfftw_plan_dft_3d(plan, L,M,N, arr,arr, + & FFTW_FORWARD, FFTW_ESTIMATE) + call dfftw_execute_dft(plan, arr, arr) + call dfftw_destroy_plan(plan) + + Note that we pass the array dimensions in the "natural" order in +both C and Fortran. + + To transform a one-dimensional real array in Fortran, you might do: + + double precision in + dimension in(N) + double complex out + dimension out(N/2 + 1) + integer*8 plan + + call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE) + call dfftw_execute_dft_r2c(plan, in, out) + call dfftw_destroy_plan(plan) + + To transform a two-dimensional real array, out of place, you might +use the following: + + double precision in + dimension in(M,N) + double complex out + dimension out(M/2 + 1, N) + integer*8 plan + + call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE) + call dfftw_execute_dft_r2c(plan, in, out) + call dfftw_destroy_plan(plan) + + *Important:* Notice that it is the _first_ dimension of the complex +output array that is cut in half in Fortran, rather than the last +dimension as in C. This is a consequence of the interface routines +reversing the order of the array dimensions passed to FFTW so that the +Fortran program can use its ordinary column-major order. + + +File: fftw3.info, Node: Wisdom of Fortran?, Prev: Fortran Examples, Up: Calling FFTW from Legacy Fortran + +8.5 Wisdom of Fortran? +====================== + +In this section, we discuss how one can import/export FFTW wisdom +(saved plans) to/from a Fortran program; we assume that the reader is +already familiar with wisdom, as described in *note Words of +Wisdom-Saving Plans::. + + The basic problem is that is difficult to (portably) pass files and +strings between Fortran and C, so we cannot provide a direct Fortran +equivalent to the `fftw_export_wisdom_to_file', etcetera, functions. +Fortran interfaces _are_ provided for the functions that do not take +file/string arguments, however: `dfftw_import_system_wisdom', +`dfftw_import_wisdom', `dfftw_export_wisdom', and `dfftw_forget_wisdom'. + + So, for example, to import the system-wide wisdom, you would do: + + integer isuccess + call dfftw_import_system_wisdom(isuccess) + + As usual, the C return value is turned into a first parameter; +`isuccess' is non-zero on success and zero on failure (e.g. if there is +no system wisdom installed). + + If you want to import/export wisdom from/to an arbitrary file or +elsewhere, you can employ the generic `dfftw_import_wisdom' and +`dfftw_export_wisdom' functions, for which you must supply a subroutine +to read/write one character at a time. The FFTW package contains an +example file `doc/f77_wisdom.f' demonstrating how to implement +`import_wisdom_from_file' and `export_wisdom_to_file' subroutines in +this way. (These routines cannot be compiled into the FFTW library +itself, lest all FFTW-using programs be required to link with the +Fortran I/O library.) + + +File: fftw3.info, Node: Upgrading from FFTW version 2, Next: Installation and Customization, Prev: Calling FFTW from Legacy Fortran, Up: Top + +9 Upgrading from FFTW version 2 +******************************* + +In this chapter, we outline the process for updating codes designed for +the older FFTW 2 interface to work with FFTW 3. The interface for FFTW +3 is not backwards-compatible with the interface for FFTW 2 and earlier +versions; codes written to use those versions will fail to link with +FFTW 3. Nor is it possible to write "compatibility wrappers" to bridge +the gap (at least not efficiently), because FFTW 3 has different +semantics from previous versions. However, upgrading should be a +straightforward process because the data formats are identical and the +overall style of planning/execution is essentially the same. + + Unlike FFTW 2, there are no separate header files for real and +complex transforms (or even for different precisions) in FFTW 3; all +interfaces are defined in the `<fftw3.h>' header file. + +Numeric Types +============= + +The main difference in data types is that `fftw_complex' in FFTW 2 was +defined as a `struct' with macros `c_re' and `c_im' for accessing the +real/imaginary parts. (This is binary-compatible with FFTW 3 on any +machine except perhaps for some older Crays in single precision.) The +equivalent macros for FFTW 3 are: + + #define c_re(c) ((c)[0]) + #define c_im(c) ((c)[1]) + + This does not work if you are using the C99 complex type, however, +unless you insert a `double*' typecast into the above macros (*note +Complex numbers::). + + Also, FFTW 2 had an `fftw_real' typedef that was an alias for +`double' (in double precision). In FFTW 3 you should just use `double' +(or whatever precision you are employing). + +Plans +===== + +The major difference between FFTW 2 and FFTW 3 is in the +planning/execution division of labor. In FFTW 2, plans were found for a +given transform size and type, and then could be applied to _any_ +arrays and for _any_ multiplicity/stride parameters. In FFTW 3, you +specify the particular arrays, stride parameters, etcetera when +creating the plan, and the plan is then executed for _those_ arrays +(unless the guru interface is used) and _those_ parameters _only_. +(FFTW 2 had "specific planner" routines that planned for a particular +array and stride, but the plan could still be used for other arrays and +strides.) That is, much of the information that was formerly specified +at execution time is now specified at planning time. + + Like FFTW 2's specific planner routines, the FFTW 3 planner +overwrites the input/output arrays unless you use `FFTW_ESTIMATE'. + + FFTW 2 had separate data types `fftw_plan', `fftwnd_plan', +`rfftw_plan', and `rfftwnd_plan' for complex and real one- and +multi-dimensional transforms, and each type had its own `destroy' +function. In FFTW 3, all plans are of type `fftw_plan' and all are +destroyed by `fftw_destroy_plan(plan)'. + + Where you formerly used `fftw_create_plan' and `fftw_one' to plan +and compute a single 1d transform, you would now use `fftw_plan_dft_1d' +to plan the transform. If you used the generic `fftw' function to +execute the transform with multiplicity (`howmany') and stride +parameters, you would now use the advanced interface +`fftw_plan_many_dft' to specify those parameters. The plans are now +executed with `fftw_execute(plan)', which takes all of its parameters +(including the input/output arrays) from the plan. + + In-place transforms no longer interpret their output argument as +scratch space, nor is there an `FFTW_IN_PLACE' flag. You simply pass +the same pointer for both the input and output arguments. (Previously, +the output `ostride' and `odist' parameters were ignored for in-place +transforms; now, if they are specified via the advanced interface, they +are significant even in the in-place case, although they should +normally equal the corresponding input parameters.) + + The `FFTW_ESTIMATE' and `FFTW_MEASURE' flags have the same meaning +as before, although the planning time will differ. You may also +consider using `FFTW_PATIENT', which is like `FFTW_MEASURE' except that +it takes more time in order to consider a wider variety of algorithms. + + For multi-dimensional complex DFTs, instead of `fftwnd_create_plan' +(or `fftw2d_create_plan' or `fftw3d_create_plan'), followed by +`fftwnd_one', you would use `fftw_plan_dft' (or `fftw_plan_dft_2d' or +`fftw_plan_dft_3d'). followed by `fftw_execute'. If you used `fftwnd' +to to specify strides etcetera, you would instead specify these via +`fftw_plan_many_dft'. + + The analogues to `rfftw_create_plan' and `rfftw_one' with +`FFTW_REAL_TO_COMPLEX' or `FFTW_COMPLEX_TO_REAL' directions are +`fftw_plan_r2r_1d' with kind `FFTW_R2HC' or `FFTW_HC2R', followed by +`fftw_execute'. The stride etcetera arguments of `rfftw' are now in +`fftw_plan_many_r2r'. + + Instead of `rfftwnd_create_plan' (or `rfftw2d_create_plan' or +`rfftw3d_create_plan') followed by `rfftwnd_one_real_to_complex' or +`rfftwnd_one_complex_to_real', you now use `fftw_plan_dft_r2c' (or +`fftw_plan_dft_r2c_2d' or `fftw_plan_dft_r2c_3d') or +`fftw_plan_dft_c2r' (or `fftw_plan_dft_c2r_2d' or +`fftw_plan_dft_c2r_3d'), respectively, followed by `fftw_execute'. As +usual, the strides etcetera of `rfftwnd_real_to_complex' or +`rfftwnd_complex_to_real' are no specified in the advanced planner +routines, `fftw_plan_many_dft_r2c' or `fftw_plan_many_dft_c2r'. + +Wisdom +====== + +In FFTW 2, you had to supply the `FFTW_USE_WISDOM' flag in order to use +wisdom; in FFTW 3, wisdom is always used. (You could simulate the FFTW +2 wisdom-less behavior by calling `fftw_forget_wisdom' after every +planner call.) + + The FFTW 3 wisdom import/export routines are almost the same as +before (although the storage format is entirely different). There is +one significant difference, however. In FFTW 2, the import routines +would never read past the end of the wisdom, so you could store extra +data beyond the wisdom in the same file, for example. In FFTW 3, the +file-import routine may read up to a few hundred bytes past the end of +the wisdom, so you cannot store other data just beyond it.(1) + + Wisdom has been enhanced by additional humility in FFTW 3: whereas +FFTW 2 would re-use wisdom for a given transform size regardless of the +stride etc., in FFTW 3 wisdom is only used with the strides etc. for +which it was created. Unfortunately, this means FFTW 3 has to create +new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g. +one transform of size 1024 also created wisdom for all smaller powers +of 2, but this no longer occurs). + + FFTW 3 also has the new routine `fftw_import_system_wisdom' to +import wisdom from a standard system-wide location. + +Memory allocation +================= + +In FFTW 3, we recommend allocating your arrays with `fftw_malloc' and +deallocating them with `fftw_free'; this is not required, but allows +optimal performance when SIMD acceleration is used. (Those two +functions actually existed in FFTW 2, and worked the same way, but were +not documented.) + + In FFTW 2, there were `fftw_malloc_hook' and `fftw_free_hook' +functions that allowed the user to replace FFTW's memory-allocation +routines (e.g. to implement different error-handling, since by default +FFTW prints an error message and calls `exit' to abort the program if +`malloc' returns `NULL'). These hooks are not supported in FFTW 3; +those few users who require this functionality can just directly modify +the memory-allocation routines in FFTW (they are defined in +`kernel/alloc.c'). + +Fortran interface +================= + +In FFTW 2, the subroutine names were obtained by replacing `fftw_' with +`fftw_f77'; in FFTW 3, you replace `fftw_' with `dfftw_' (or `sfftw_' +or `lfftw_', depending upon the precision). + + In FFTW 3, we have begun recommending that you always declare the +type used to store plans as `integer*8'. (Too many people didn't notice +our instruction to switch from `integer' to `integer*8' for 64-bit +machines.) + + In FFTW 3, we provide a `fftw3.f' "header file" to include in your +code (and which is officially installed on Unix systems). (In FFTW 2, +we supplied a `fftw_f77.i' file, but it was not installed.) + + Otherwise, the C-Fortran interface relationship is much the same as +it was before (e.g. return values become initial parameters, and +multi-dimensional arrays are in column-major order). Unlike FFTW 2, we +do provide some support for wisdom import/export in Fortran (*note +Wisdom of Fortran?::). + +Threads +======= + +Like FFTW 2, only the execution routines are thread-safe. All planner +routines, etcetera, should be called by only a single thread at a time +(*note Thread safety::). _Unlike_ FFTW 2, there is no special +`FFTW_THREADSAFE' flag for the planner to allow a given plan to be +usable by multiple threads in parallel; this is now the case by default. + + The multi-threaded version of FFTW 2 required you to pass the number +of threads each time you execute the transform. The number of threads +is now stored in the plan, and is specified before the planner is +called by `fftw_plan_with_nthreads'. The threads initialization +routine used to be called `fftw_threads_init' and would return zero on +success; the new routine is called `fftw_init_threads' and returns zero +on failure. *Note Multi-threaded FFTW::. + + There is no separate threads header file in FFTW 3; all the function +prototypes are in `<fftw3.h>'. However, you still have to link to a +separate library (`-lfftw3_threads -lfftw3 -lm' on Unix), as well as to +the threading library (e.g. POSIX threads on Unix). + + ---------- Footnotes ---------- + + (1) We do our own buffering because GNU libc I/O routines are +horribly slow for single-character I/O, apparently for thread-safety +reasons (whether you are using threads or not). + + +File: fftw3.info, Node: Installation and Customization, Next: Acknowledgments, Prev: Upgrading from FFTW version 2, Up: Top + +10 Installation and Customization +********************************* + +This chapter describes the installation and customization of FFTW, the +latest version of which may be downloaded from the FFTW home page +(http://www.fftw.org). + + In principle, FFTW should work on any system with an ANSI C compiler +(`gcc' is fine). However, planner time is drastically reduced if FFTW +can exploit a hardware cycle counter; FFTW comes with cycle-counter +support for all modern general-purpose CPUs, but you may need to add a +couple of lines of code if your compiler is not yet supported (*note +Cycle Counters::). (On Unix, there will be a warning at the end of the +`configure' output if no cycle counter is found.) + + Installation of FFTW is simplest if you have a Unix or a GNU system, +such as GNU/Linux, and we describe this case in the first section below, +including the use of special configuration options to e.g. install +different precisions or exploit optimizations for particular +architectures (e.g. SIMD). Compilation on non-Unix systems is a more +manual process, but we outline the procedure in the second section. It +is also likely that pre-compiled binaries will be available for popular +systems. + + Finally, we describe how you can customize FFTW for particular needs +by generating _codelets_ for fast transforms of sizes not supported +efficiently by the standard FFTW distribution. + +* Menu: + +* Installation on Unix:: +* Installation on non-Unix systems:: +* Cycle Counters:: +* Generating your own code:: + + +File: fftw3.info, Node: Installation on Unix, Next: Installation on non-Unix systems, Prev: Installation and Customization, Up: Installation and Customization + +10.1 Installation on Unix +========================= + +FFTW comes with a `configure' program in the GNU style. Installation +can be as simple as: + + ./configure + make + make install + + This will build the uniprocessor complex and real transform libraries +along with the test programs. (We recommend that you use GNU `make' if +it is available; on some systems it is called `gmake'.) The "`make +install'" command installs the fftw and rfftw libraries in standard +places, and typically requires root privileges (unless you specify a +different install directory with the `--prefix' flag to `configure'). +You can also type "`make check'" to put the FFTW test programs through +their paces. If you have problems during configuration or compilation, +you may want to run "`make distclean'" before trying again; this +ensures that you don't have any stale files left over from previous +compilation attempts. + + The `configure' script chooses the `gcc' compiler by default, if it +is available; you can select some other compiler with: + ./configure CC="<the name of your C compiler>" + + The `configure' script knows good `CFLAGS' (C compiler flags) for a +few systems. If your system is not known, the `configure' script will +print out a warning. In this case, you should re-configure FFTW with +the command + ./configure CFLAGS="<write your CFLAGS here>" + and then compile as usual. If you do find an optimal set of +`CFLAGS' for your system, please let us know what they are (along with +the output of `config.guess') so that we can include them in future +releases. + + `configure' supports all the standard flags defined by the GNU +Coding Standards; see the `INSTALL' file in FFTW or the GNU web page +(http://www.gnu.org/prep/standards/html_node/index.html). Note +especially `--help' to list all flags and `--enable-shared' to create +shared, rather than static, libraries. `configure' also accepts a few +FFTW-specific flags, particularly: + + * `--enable-float': Produces a single-precision version of FFTW + (`float') instead of the default double-precision (`double'). + *Note Precision::. + + * `--enable-long-double': Produces a long-double precision version of + FFTW (`long double') instead of the default double-precision + (`double'). The `configure' script will halt with an error + message if `long double' is the same size as `double' on your + machine/compiler. *Note Precision::. + + * `--enable-quad-precision': Produces a quadruple-precision version + of FFTW using the nonstandard `__float128' type provided by `gcc' + 4.6 or later on x86, x86-64, and Itanium architectures, instead of + the default double-precision (`double'). The `configure' script + will halt with an error message if the compiler is not `gcc' + version 4.6 or later or if `gcc''s `libquadmath' library is not + installed. *Note Precision::. + + * `--enable-threads': Enables compilation and installation of the + FFTW threads library (*note Multi-threaded FFTW::), which provides + a simple interface to parallel transforms for SMP systems. By + default, the threads routines are not compiled. + + * `--enable-openmp': Like `--enable-threads', but using OpenMP + compiler directives in order to induce parallelism rather than + spawning its own threads directly, and installing an `fftw3_omp' + library rather than an `fftw3_threads' library (*note + Multi-threaded FFTW::). You can use both `--enable-openmp' and + `--enable-threads' since they compile/install libraries with + different names. By default, the OpenMP routines are not compiled. + + * `--with-combined-threads': By default, if `--enable-threads' is + used, the threads support is compiled into a separate library that + must be linked in addition to the main FFTW library. This is so + that users of the serial library do not need to link the system + threads libraries. If `--with-combined-threads' is specified, + however, then no separate threads library is created, and threads + are included in the main FFTW library. This is mainly useful + under Windows, where no system threads library is required and + inter-library dependencies are problematic. + + * `--enable-mpi': Enables compilation and installation of the FFTW + MPI library (*note Distributed-memory FFTW with MPI::), which + provides parallel transforms for distributed-memory systems with + MPI. (By default, the MPI routines are not compiled.) *Note FFTW + MPI Installation::. + + * `--disable-fortran': Disables inclusion of legacy-Fortran wrapper + routines (*note Calling FFTW from Legacy Fortran::) in the standard + FFTW libraries. These wrapper routines increase the library size + by only a negligible amount, so they are included by default as + long as the `configure' script finds a Fortran compiler on your + system. (To specify a particular Fortran compiler foo, pass + `F77='foo to `configure'.) + + * `--with-g77-wrappers': By default, when Fortran wrappers are + included, the wrappers employ the linking conventions of the + Fortran compiler detected by the `configure' script. If this + compiler is GNU `g77', however, then _two_ versions of the + wrappers are included: one with `g77''s idiosyncratic convention + of appending two underscores to identifiers, and one with the more + common convention of appending only a single underscore. This + way, the same FFTW library will work with both `g77' and other + Fortran compilers, such as GNU `gfortran'. However, the converse + is not true: if you configure with a different compiler, then the + `g77'-compatible wrappers are not included. By specifying + `--with-g77-wrappers', the `g77'-compatible wrappers are included + in addition to wrappers for whatever Fortran compiler `configure' + finds. + + * `--with-slow-timer': Disables the use of hardware cycle counters, + and falls back on `gettimeofday' or `clock'. This greatly worsens + performance, and should generally not be used (unless you don't + have a cycle counter but still really want an optimized plan + regardless of the time). *Note Cycle Counters::. + + * `--enable-sse', `--enable-sse2', `--enable-avx', + `--enable-altivec', `--enable-neon': Enable the compilation of + SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy + Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM + processors). SSE, AltiVec, and NEON only work with + `--enable-float' (above). SSE2 works in both single and double + precision (and is simply SSE in single precision). The resulting + code will _still work_ on earlier CPUs lacking the SIMD extensions + (SIMD is automatically disabled, although the FFTW library is + still larger). + - These options require a compiler supporting SIMD extensions, + and compiler support is always a bit flaky: see the FFTW FAQ + for a list of compiler versions that have problems compiling + FFTW. + + - With AltiVec and `gcc', you may have to use the + `-mabi=altivec' option when compiling any code that links to + FFTW, in order to properly align the stack; otherwise, FFTW + could crash when it tries to use an AltiVec feature. (This + is not necessary on MacOS X.) + + - With SSE/SSE2 and `gcc', you should use a version of gcc that + properly aligns the stack when compiling any code that links + to FFTW. By default, `gcc' 2.95 and later versions align the + stack as needed, but you should not compile FFTW with the + `-Os' option or the `-mpreferred-stack-boundary' option with + an argument less than 4. + + - Because of the large variety of ARM processors and ABIs, FFTW + does not attempt to guess the correct `gcc' flags for + generating NEON code. In general, you will have to provide + them on the command line. This command line is known to have + worked at least once: + ./configure --with-slow-timer --host=arm-linux-gnueabi \ + --enable-single --enable-neon \ + "CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp" + + + To force `configure' to use a particular C compiler foo (instead of +the default, usually `gcc'), pass `CC='foo to the `configure' script; +you may also need to set the flags via the variable `CFLAGS' as +described above. + + +File: fftw3.info, Node: Installation on non-Unix systems, Next: Cycle Counters, Prev: Installation on Unix, Up: Installation and Customization + +10.2 Installation on non-Unix systems +===================================== + +It should be relatively straightforward to compile FFTW even on non-Unix +systems lacking the niceties of a `configure' script. Basically, you +need to edit the `config.h' header (copy it from `config.h.in') to +`#define' the various options and compiler characteristics, and then +compile all the `.c' files in the relevant directories. + + The `config.h' header contains about 100 options to set, each one +initially an `#undef', each documented with a comment, and most of them +fairly obvious. For most of the options, you should simply `#define' +them to `1' if they are applicable, although a few options require a +particular value (e.g. `SIZEOF_LONG_LONG' should be defined to the size +of the `long long' type, in bytes, or zero if it is not supported). We +will likely post some sample `config.h' files for various operating +systems and compilers for you to use (at least as a starting point). +Please let us know if you have to hand-create a configuration file +(and/or a pre-compiled binary) that you want to share. + + To create the FFTW library, you will then need to compile all of the +`.c' files in the `kernel', `dft', `dft/scalar', `dft/scalar/codelets', +`rdft', `rdft/scalar', `rdft/scalar/r2cf', `rdft/scalar/r2cb', +`rdft/scalar/r2r', `reodft', and `api' directories. If you are +compiling with SIMD support (e.g. you defined `HAVE_SSE2' in +`config.h'), then you also need to compile the `.c' files in the +`simd-support', `{dft,rdft}/simd', `{dft,rdft}/simd/*' directories. + + Once these files are all compiled, link them into a library, or a +shared library, or directly into your program. + + To compile the FFTW test program, additionally compile the code in +the `libbench2/' directory, and link it into a library. Then compile +the code in the `tests/' directory and link it to the `libbench2' and +FFTW libraries. To compile the `fftw-wisdom' (command-line) tool +(*note Wisdom Utilities::), compile `tools/fftw-wisdom.c' and link it +to the `libbench2' and FFTW libraries + + +File: fftw3.info, Node: Cycle Counters, Next: Generating your own code, Prev: Installation on non-Unix systems, Up: Installation and Customization + +10.3 Cycle Counters +=================== + +FFTW's planner actually executes and times different possible FFT +algorithms in order to pick the fastest plan for a given n. In order +to do this in as short a time as possible, however, the timer must have +a very high resolution, and to accomplish this we employ the hardware +"cycle counters" that are available on most CPUs. Currently, FFTW +supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC +(SPARC v9), IA64, PA-RISC, and MIPS processors. + + Access to the cycle counters, unfortunately, is a compiler and/or +operating-system dependent task, often requiring inline assembly +language, and it may be that your compiler is not supported. If you are +_not_ supported, FFTW will by default fall back on its estimator +(effectively using `FFTW_ESTIMATE' for all plans). + + You can add support by editing the file `kernel/cycle.h'; normally, +this will involve adapting one of the examples already present in order +to use the inline-assembler syntax for your C compiler, and will only +require a couple of lines of code. Anyone adding support for a new +system to `cycle.h' is encouraged to email us at <fftw@fftw.org>. + + If a cycle counter is not available on your system (e.g. some +embedded processor), and you don't want to use estimated plans, as a +last resort you can use the `--with-slow-timer' option to `configure' +(on Unix) or `#define WITH_SLOW_TIMER' in `config.h' (elsewhere). This +will use the much lower-resolution `gettimeofday' function, or even +`clock' if the former is unavailable, and planning will be extremely +slow. + + +File: fftw3.info, Node: Generating your own code, Prev: Cycle Counters, Up: Installation and Customization + +10.4 Generating your own code +============================= + +The directory `genfft' contains the programs that were used to generate +FFTW's "codelets," which are hard-coded transforms of small sizes. We +do not expect casual users to employ the generator, which is a rather +sophisticated program that generates directed acyclic graphs of FFT +algorithms and performs algebraic simplifications on them. It was +written in Objective Caml, a dialect of ML, which is available at +`http://caml.inria.fr/ocaml/index.en.html'. + + If you have Objective Caml installed (along with recent versions of +GNU `autoconf', `automake', and `libtool'), then you can change the set +of codelets that are generated or play with the generation options. +The set of generated codelets is specified by the +`{dft,rdft}/{codelets,simd}/*/Makefile.am' files. For example, you can +add efficient REDFT codelets of small sizes by modifying +`rdft/codelets/r2r/Makefile.am'. After you modify any `Makefile.am' +files, you can type `sh bootstrap.sh' in the top-level directory +followed by `make' to re-generate the files. + + We do not provide more details about the code-generation process, +since we do not expect that most users will need to generate their own +code. However, feel free to contact us at <fftw@fftw.org> if you are +interested in the subject. + + You might find it interesting to learn Caml and/or some modern +programming techniques that we used in the generator (including monadic +programming), especially if you heard the rumor that Java and +object-oriented programming are the latest advancement in the field. +The internal operation of the codelet generator is described in the +paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is +available from the FFTW home page (http://www.fftw.org) and also +appeared in the `Proceedings of the 1999 ACM SIGPLAN Conference on +Programming Language Design and Implementation (PLDI)'. + + +File: fftw3.info, Node: Acknowledgments, Next: License and Copyright, Prev: Installation and Customization, Up: Top + +11 Acknowledgments +****************** + +Matteo Frigo was supported in part by the Special Research Program SFB +F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln +Laboratory. For previous versions of FFTW, he was supported in part by +the Defense Advanced Research Projects Agency (DARPA), under Grants +N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment +Corporation Fellowship. + + Steven G. Johnson was supported in part by a Dept. of Defense NDSEG +Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials +Research Science and Engineering Center program of the National Science +Foundation under award DMR-9400334. + + Code for the Cell Broadband Engine was graciously donated to the FFTW +project by the IBM Austin Research Lab and included in fftw-3.2. (This +code was removed in fftw-3.3.) + + Code for the MIPS paired-single SIMD support was graciously donated +to the FFTW project by CodeSourcery, Inc. + + We are grateful to Sun Microsystems Inc. for its donation of a +cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These +machines served as the primary platform for the development of early +versions of FFTW. + + We thank Intel Corporation for donating a four-processor Pentium Pro +machine. We thank the GNU/Linux community for giving us a decent OS to +run on that machine. + + We are thankful to the AMD corporation for donating an AMD Athlon XP +1700+ computer to the FFTW project. + + We thank the Compaq/HP testdrive program and VA Software Corporation +(SourceForge.net) for providing remote access to machines that were used +to test FFTW. + + The `genfft' suite of code generators was written using Objective +Caml, a dialect of ML. Objective Caml is a small and elegant language +developed by Xavier Leroy. The implementation is available from +`http://caml.inria.fr/' (http://caml.inria.fr/). In previous releases +of FFTW, `genfft' was written in Caml Light, by the same authors. An +even earlier implementation of `genfft' was written in Scheme, but Caml +is definitely better for this kind of application. + + FFTW uses many tools from the GNU project, including `automake', +`texinfo', and `libtool'. + + Prof. Charles E. Leiserson of MIT provided continuous support and +encouragement. This program would not exist without him. Charles also +proposed the name "codelets" for the basic FFT blocks. + + Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance +of Steven's "extra-curricular" computer-science activities, as well as +remarkable creativity in working them into his grant proposals. +Steven's physics degree would not exist without him. + + Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually +led to the SIMD support in FFTW 3. + + Stefan Kral wrote most of the K7 code generator distributed with FFTW +3.0.x and 3.1.x. + + Andrew Sterian contributed the Windows timing code in FFTW 2. + + Didier Miras reported a bug in the test procedure used in FFTW 1.2. +We now use a completely different test algorithm by Funda Ergun that +does not require a separate FFT program to compare against. + + Wolfgang Reimer contributed the Pentium cycle counter and a few fixes +that help portability. + + Ming-Chang Liu uncovered a well-hidden bug in the complex transforms +of FFTW 2.0 and supplied a patch to correct it. + + The FFTW FAQ was written in `bfnn' (Bizarre Format With No Name) and +formatted using the tools developed by Ian Jackson for the Linux FAQ. + + _We are especially thankful to all of our users for their continuing +support, feedback, and interest during our development of FFTW._ + + +File: fftw3.info, Node: License and Copyright, Next: Concept Index, Prev: Acknowledgments, Up: Top + +12 License and Copyright +************************ + +FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003, +2007-11 Massachusetts Institute of Technology. + + FFTW is free software; you can redistribute it and/or modify it +under the terms of the GNU General Public License as published by the +Free Software Foundation; either version 2 of the License, or (at your +option) any later version. + + This program is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +General Public License for more details. + + You should have received a copy of the GNU General Public License +along with this program; if not, write to the Free Software Foundation, +Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You +can also find the GPL on the GNU web site +(http://www.gnu.org/licenses/gpl-2.0.html). + + In addition, we kindly ask you to acknowledge FFTW and its authors in +any program or publication in which you use FFTW. (You are not +_required_ to do so; it is up to your common sense to decide whether +you want to comply with this request or not.) For general +publications, we suggest referencing: Matteo Frigo and Steven G. +Johnson, "The design and implementation of FFTW3," Proc. IEEE 93 (2), +216-231 (2005). + + Non-free versions of FFTW are available under terms different from +those of the General Public License. (e.g. they do not require you to +accompany any object code using FFTW with the corresponding source +code.) For these alternative terms you must purchase a license from +MIT's Technology Licensing Office. Users interested in such a license +should contact us (<fftw@fftw.org>) for more information. +