Chris@19: This is fftw3.info, produced by makeinfo version 4.13 from fftw3.texi. Chris@19: Chris@19: This manual is for FFTW (version 3.3.4, 20 September 2013). Chris@19: Chris@19: Copyright (C) 2003 Matteo Frigo. Chris@19: Chris@19: Copyright (C) 2003 Massachusetts Institute of Technology. Chris@19: Chris@19: Permission is granted to make and distribute verbatim copies of Chris@19: this manual provided the copyright notice and this permission Chris@19: notice are preserved on all copies. Chris@19: Chris@19: Permission is granted to copy and distribute modified versions of Chris@19: this manual under the conditions for verbatim copying, provided Chris@19: that the entire resulting derived work is distributed under the Chris@19: terms of a permission notice identical to this one. Chris@19: Chris@19: Permission is granted to copy and distribute translations of this Chris@19: manual into another language, under the above conditions for Chris@19: modified versions, except that this permission notice may be Chris@19: stated in a translation approved by the Free Software Foundation. Chris@19: Chris@19: INFO-DIR-SECTION Development Chris@19: START-INFO-DIR-ENTRY Chris@19: * fftw3: (fftw3). FFTW User's Manual. Chris@19: END-INFO-DIR-ENTRY Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir) Chris@19: Chris@19: FFTW User Manual Chris@19: **************** Chris@19: Chris@19: Welcome to FFTW, the Fastest Fourier Transform in the West. FFTW is a Chris@19: collection of fast C routines to compute the discrete Fourier transform. Chris@19: This manual documents FFTW version 3.3.4. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Introduction:: Chris@19: * Tutorial:: Chris@19: * Other Important Topics:: Chris@19: * FFTW Reference:: Chris@19: * Multi-threaded FFTW:: Chris@19: * Distributed-memory FFTW with MPI:: Chris@19: * Calling FFTW from Modern Fortran:: Chris@19: * Calling FFTW from Legacy Fortran:: Chris@19: * Upgrading from FFTW version 2:: Chris@19: * Installation and Customization:: Chris@19: * Acknowledgments:: Chris@19: * License and Copyright:: Chris@19: * Concept Index:: Chris@19: * Library Index:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Introduction, Next: Tutorial, Prev: Top, Up: Top Chris@19: Chris@19: 1 Introduction Chris@19: ************** Chris@19: Chris@19: This manual documents version 3.3.4 of FFTW, the _Fastest Fourier Chris@19: Transform in the West_. FFTW is a comprehensive collection of fast C Chris@19: routines for computing the discrete Fourier transform (DFT) and various Chris@19: special cases thereof. Chris@19: * FFTW computes the DFT of complex data, real data, even- or Chris@19: odd-symmetric real data (these symmetric transforms are usually Chris@19: known as the discrete cosine or sine transform, respectively), and Chris@19: the discrete Hartley transform (DHT) of real data. Chris@19: Chris@19: * The input data can have arbitrary length. FFTW employs O(n Chris@19: log n) algorithms for all lengths, including prime numbers. Chris@19: Chris@19: * FFTW supports arbitrary multi-dimensional data. Chris@19: Chris@19: * FFTW supports the SSE, SSE2, AVX, Altivec, and MIPS PS instruction Chris@19: sets. Chris@19: Chris@19: * FFTW includes parallel (multi-threaded) transforms for Chris@19: shared-memory systems. Chris@19: Chris@19: * Starting with version 3.3, FFTW includes distributed-memory Chris@19: parallel transforms using MPI. Chris@19: Chris@19: We assume herein that you are familiar with the properties and uses Chris@19: of the DFT that are relevant to your application. Otherwise, see e.g. Chris@19: `The Fast Fourier Transform and Its Applications' by E. O. Brigham Chris@19: (Prentice-Hall, Englewood Cliffs, NJ, 1988). Our web page Chris@19: (http://www.fftw.org) also has links to FFT-related information online. Chris@19: Chris@19: In order to use FFTW effectively, you need to learn one basic concept Chris@19: of FFTW's internal structure: FFTW does not use a fixed algorithm for Chris@19: computing the transform, but instead it adapts the DFT algorithm to Chris@19: details of the underlying hardware in order to maximize performance. Chris@19: Hence, the computation of the transform is split into two phases. Chris@19: First, FFTW's "planner" "learns" the fastest way to compute the Chris@19: transform on your machine. The planner produces a data structure Chris@19: called a "plan" that contains this information. Subsequently, the plan Chris@19: is "executed" to transform the array of input data as dictated by the Chris@19: plan. The plan can be reused as many times as needed. In typical Chris@19: high-performance applications, many transforms of the same size are Chris@19: computed and, consequently, a relatively expensive initialization of Chris@19: this sort is acceptable. On the other hand, if you need a single Chris@19: transform of a given size, the one-time cost of the planner becomes Chris@19: significant. For this case, FFTW provides fast planners based on Chris@19: heuristics or on previously computed plans. Chris@19: Chris@19: FFTW supports transforms of data with arbitrary length, rank, Chris@19: multiplicity, and a general memory layout. In simple cases, however, Chris@19: this generality may be unnecessary and confusing. Consequently, we Chris@19: organized the interface to FFTW into three levels of increasing Chris@19: generality. Chris@19: * The "basic interface" computes a single transform of Chris@19: contiguous data. Chris@19: Chris@19: * The "advanced interface" computes transforms of multiple or Chris@19: strided arrays. Chris@19: Chris@19: * The "guru interface" supports the most general data layouts, Chris@19: multiplicities, and strides. Chris@19: We expect that most users will be best served by the basic interface, Chris@19: whereas the guru interface requires careful attention to the Chris@19: documentation to avoid problems. Chris@19: Chris@19: Besides the automatic performance adaptation performed by the Chris@19: planner, it is also possible for advanced users to customize FFTW Chris@19: manually. For example, if code space is a concern, we provide a tool Chris@19: that links only the subset of FFTW needed by your application. Chris@19: Conversely, you may need to extend FFTW because the standard Chris@19: distribution is not sufficient for your needs. For example, the Chris@19: standard FFTW distribution works most efficiently for arrays whose size Chris@19: can be factored into small primes (2, 3, 5, and 7), and otherwise it Chris@19: uses a slower general-purpose routine. If you need efficient Chris@19: transforms of other sizes, you can use FFTW's code generator, which Chris@19: produces fast C programs ("codelets") for any particular array size you Chris@19: may care about. For example, if you need transforms of size 513 = 19 x Chris@19: 3^3, you can customize FFTW to support the factor 19 efficiently. Chris@19: Chris@19: For more information regarding FFTW, see the paper, "The Design and Chris@19: Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an Chris@19: invited paper in `Proc. IEEE' 93 (2), p. 216 (2005). The code Chris@19: generator is described in the paper "A fast Fourier transform compiler", by Chris@19: M. Frigo, in the `Proceedings of the 1999 ACM SIGPLAN Conference on Chris@19: Programming Language Design and Implementation (PLDI), Atlanta, Chris@19: Georgia, May 1999'. These papers, along with the latest version of Chris@19: FFTW, the FAQ, benchmarks, and other links, are available at the FFTW Chris@19: home page (http://www.fftw.org). Chris@19: Chris@19: The current version of FFTW incorporates many good ideas from the Chris@19: past thirty years of FFT literature. In one way or another, FFTW uses Chris@19: the Cooley-Tukey algorithm, the prime factor algorithm, Rader's Chris@19: algorithm for prime sizes, and a split-radix algorithm (with a Chris@19: "conjugate-pair" variation pointed out to us by Dan Bernstein). FFTW's Chris@19: code generator also produces new algorithms that we do not completely Chris@19: understand. The reader is referred to the cited papers for the Chris@19: appropriate references. Chris@19: Chris@19: The rest of this manual is organized as follows. We first discuss Chris@19: the sequential (single-processor) implementation. We start by Chris@19: describing the basic interface/features of FFTW in *note Tutorial::. Chris@19: Next, *note Other Important Topics:: discusses data alignment (*note Chris@19: SIMD alignment and fftw_malloc::), the storage scheme of Chris@19: multi-dimensional arrays (*note Multi-dimensional Array Format::), and Chris@19: FFTW's mechanism for storing plans on disk (*note Words of Chris@19: Wisdom-Saving Plans::). Next, *note FFTW Reference:: provides Chris@19: comprehensive documentation of all FFTW's features. Parallel Chris@19: transforms are discussed in their own chapters: *note Multi-threaded Chris@19: FFTW:: and *note Distributed-memory FFTW with MPI::. Fortran Chris@19: programmers can also use FFTW, as described in *note Calling FFTW from Chris@19: Legacy Fortran:: and *note Calling FFTW from Modern Fortran::. *note Chris@19: Installation and Customization:: explains how to install FFTW in your Chris@19: computer system and how to adapt FFTW to your needs. License and Chris@19: copyright information is given in *note License and Copyright::. Chris@19: Finally, we thank all the people who helped us in *note Chris@19: Acknowledgments::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Tutorial, Next: Other Important Topics, Prev: Introduction, Up: Top Chris@19: Chris@19: 2 Tutorial Chris@19: ********** Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Complex One-Dimensional DFTs:: Chris@19: * Complex Multi-Dimensional DFTs:: Chris@19: * One-Dimensional DFTs of Real Data:: Chris@19: * Multi-Dimensional DFTs of Real Data:: Chris@19: * More DFTs of Real Data:: Chris@19: Chris@19: This chapter describes the basic usage of FFTW, i.e., how to compute the Chris@19: Fourier transform of a single array. This chapter tells the truth, but Chris@19: not the _whole_ truth. Specifically, FFTW implements additional Chris@19: routines and flags that are not documented here, although in many cases Chris@19: we try to indicate where added capabilities exist. For more complete Chris@19: information, see *note FFTW Reference::. (Note that you need to Chris@19: compile and install FFTW before you can use it in a program. For the Chris@19: details of the installation, see *note Installation and Chris@19: Customization::.) Chris@19: Chris@19: We recommend that you read this tutorial in order.(1) At the least, Chris@19: read the first section (*note Complex One-Dimensional DFTs::) before Chris@19: reading any of the others, even if your main interest lies in one of Chris@19: the other transform types. Chris@19: Chris@19: Users of FFTW version 2 and earlier may also want to read *note Chris@19: Upgrading from FFTW version 2::. Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) You can read the tutorial in bit-reversed order after computing Chris@19: your first transform. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Complex One-Dimensional DFTs, Next: Complex Multi-Dimensional DFTs, Prev: Tutorial, Up: Tutorial Chris@19: Chris@19: 2.1 Complex One-Dimensional DFTs Chris@19: ================================ Chris@19: Chris@19: Plan: To bother about the best method of accomplishing an Chris@19: accidental result. [Ambrose Bierce, `The Enlarged Devil's Chris@19: Dictionary'.] Chris@19: Chris@19: The basic usage of FFTW to compute a one-dimensional DFT of size `N' Chris@19: is simple, and it typically looks something like this code: Chris@19: Chris@19: #include Chris@19: ... Chris@19: { Chris@19: fftw_complex *in, *out; Chris@19: fftw_plan p; Chris@19: ... Chris@19: in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); Chris@19: out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); Chris@19: p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); Chris@19: ... Chris@19: fftw_execute(p); /* repeat as needed */ Chris@19: ... Chris@19: fftw_destroy_plan(p); Chris@19: fftw_free(in); fftw_free(out); Chris@19: } Chris@19: Chris@19: You must link this code with the `fftw3' library. On Unix systems, Chris@19: link with `-lfftw3 -lm'. Chris@19: Chris@19: The example code first allocates the input and output arrays. You Chris@19: can allocate them in any way that you like, but we recommend using Chris@19: `fftw_malloc', which behaves like `malloc' except that it properly Chris@19: aligns the array when SIMD instructions (such as SSE and Altivec) are Chris@19: available (*note SIMD alignment and fftw_malloc::). [Alternatively, we Chris@19: provide a convenient wrapper function `fftw_alloc_complex(N)' which has Chris@19: the same effect.] Chris@19: Chris@19: The data is an array of type `fftw_complex', which is by default a Chris@19: `double[2]' composed of the real (`in[i][0]') and imaginary Chris@19: (`in[i][1]') parts of a complex number. Chris@19: Chris@19: The next step is to create a "plan", which is an object that Chris@19: contains all the data that FFTW needs to compute the FFT. This Chris@19: function creates the plan: Chris@19: Chris@19: fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: The first argument, `n', is the size of the transform you are trying Chris@19: to compute. The size `n' can be any positive integer, but sizes that Chris@19: are products of small factors are transformed most efficiently Chris@19: (although prime sizes still use an O(n log n) algorithm). Chris@19: Chris@19: The next two arguments are pointers to the input and output arrays of Chris@19: the transform. These pointers can be equal, indicating an "in-place" Chris@19: transform. Chris@19: Chris@19: The fourth argument, `sign', can be either `FFTW_FORWARD' (`-1') or Chris@19: `FFTW_BACKWARD' (`+1'), and indicates the direction of the transform Chris@19: you are interested in; technically, it is the sign of the exponent in Chris@19: the transform. Chris@19: Chris@19: The `flags' argument is usually either `FFTW_MEASURE' or `FFTW_ESTIMATE'. Chris@19: `FFTW_MEASURE' instructs FFTW to run and measure the execution time of Chris@19: several FFTs in order to find the best way to compute the transform of Chris@19: size `n'. This process takes some time (usually a few seconds), Chris@19: depending on your machine and on the size of the transform. Chris@19: `FFTW_ESTIMATE', on the contrary, does not run any computation and just Chris@19: builds a reasonable plan that is probably sub-optimal. In short, if Chris@19: your program performs many transforms of the same size and Chris@19: initialization time is not important, use `FFTW_MEASURE'; otherwise use Chris@19: the estimate. Chris@19: Chris@19: _You must create the plan before initializing the input_, because Chris@19: `FFTW_MEASURE' overwrites the `in'/`out' arrays. (Technically, Chris@19: `FFTW_ESTIMATE' does not touch your arrays, but you should always Chris@19: create plans first just to be sure.) Chris@19: Chris@19: Once the plan has been created, you can use it as many times as you Chris@19: like for transforms on the specified `in'/`out' arrays, computing the Chris@19: actual transforms via `fftw_execute(plan)': Chris@19: void fftw_execute(const fftw_plan plan); Chris@19: Chris@19: The DFT results are stored in-order in the array `out', with the Chris@19: zero-frequency (DC) component in `out[0]'. If `in != out', the Chris@19: transform is "out-of-place" and the input array `in' is not modified. Chris@19: Otherwise, the input array is overwritten with the transform. Chris@19: Chris@19: If you want to transform a _different_ array of the same size, you Chris@19: can create a new plan with `fftw_plan_dft_1d' and FFTW automatically Chris@19: reuses the information from the previous plan, if possible. Chris@19: Alternatively, with the "guru" interface you can apply a given plan to Chris@19: a different array, if you are careful. *Note FFTW Reference::. Chris@19: Chris@19: When you are done with the plan, you deallocate it by calling Chris@19: `fftw_destroy_plan(plan)': Chris@19: void fftw_destroy_plan(fftw_plan plan); Chris@19: If you allocate an array with `fftw_malloc()' you must deallocate it Chris@19: with `fftw_free()'. Do not use `free()' or, heaven forbid, `delete'. Chris@19: Chris@19: FFTW computes an _unnormalized_ DFT. Thus, computing a forward Chris@19: followed by a backward transform (or vice versa) results in the original Chris@19: array scaled by `n'. For the definition of the DFT, see *note What Chris@19: FFTW Really Computes::. Chris@19: Chris@19: If you have a C compiler, such as `gcc', that supports the C99 Chris@19: standard, and you `#include ' _before_ `', then Chris@19: `fftw_complex' is the native double-precision complex type and you can Chris@19: manipulate it with ordinary arithmetic. Otherwise, FFTW defines its Chris@19: own complex type, which is bit-compatible with the C99 complex type. Chris@19: *Note Complex numbers::. (The C++ `' template class may also Chris@19: be usable via a typecast.) Chris@19: Chris@19: To use single or long-double precision versions of FFTW, replace the Chris@19: `fftw_' prefix by `fftwf_' or `fftwl_' and link with `-lfftw3f' or Chris@19: `-lfftw3l', but use the _same_ `' header file. Chris@19: Chris@19: Many more flags exist besides `FFTW_MEASURE' and `FFTW_ESTIMATE'. Chris@19: For example, use `FFTW_PATIENT' if you're willing to wait even longer Chris@19: for a possibly even faster plan (*note FFTW Reference::). You can also Chris@19: save plans for future use, as described by *note Words of Wisdom-Saving Chris@19: Plans::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Complex Multi-Dimensional DFTs, Next: One-Dimensional DFTs of Real Data, Prev: Complex One-Dimensional DFTs, Up: Tutorial Chris@19: Chris@19: 2.2 Complex Multi-Dimensional DFTs Chris@19: ================================== Chris@19: Chris@19: Multi-dimensional transforms work much the same way as one-dimensional Chris@19: transforms: you allocate arrays of `fftw_complex' (preferably using Chris@19: `fftw_malloc'), create an `fftw_plan', execute it as many times as you Chris@19: want with `fftw_execute(plan)', and clean up with Chris@19: `fftw_destroy_plan(plan)' (and `fftw_free'). Chris@19: Chris@19: FFTW provides two routines for creating plans for 2d and 3d Chris@19: transforms, and one routine for creating plans of arbitrary Chris@19: dimensionality. The 2d and 3d routines have the following signature: Chris@19: fftw_plan fftw_plan_dft_2d(int n0, int n1, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: These routines create plans for `n0' by `n1' two-dimensional (2d) Chris@19: transforms and `n0' by `n1' by `n2' 3d transforms, respectively. All Chris@19: of these transforms operate on contiguous arrays in the C-standard Chris@19: "row-major" order, so that the last dimension has the fastest-varying Chris@19: index in the array. This layout is described further in *note Chris@19: Multi-dimensional Array Format::. Chris@19: Chris@19: FFTW can also compute transforms of higher dimensionality. In order Chris@19: to avoid confusion between the various meanings of the the word Chris@19: "dimension", we use the term _rank_ to denote the number of independent Chris@19: indices in an array.(1) For example, we say that a 2d transform has Chris@19: rank 2, a 3d transform has rank 3, and so on. You can plan transforms Chris@19: of arbitrary rank by means of the following function: Chris@19: Chris@19: fftw_plan fftw_plan_dft(int rank, const int *n, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: Here, `n' is a pointer to an array `n[rank]' denoting an `n[0]' by Chris@19: `n[1]' by ... by `n[rank-1]' transform. Thus, for example, the call Chris@19: fftw_plan_dft_2d(n0, n1, in, out, sign, flags); Chris@19: is equivalent to the following code fragment: Chris@19: int n[2]; Chris@19: n[0] = n0; Chris@19: n[1] = n1; Chris@19: fftw_plan_dft(2, n, in, out, sign, flags); Chris@19: `fftw_plan_dft' is not restricted to 2d and 3d transforms, however, Chris@19: but it can plan transforms of arbitrary rank. Chris@19: Chris@19: You may have noticed that all the planner routines described so far Chris@19: have overlapping functionality. For example, you can plan a 1d or 2d Chris@19: transform by using `fftw_plan_dft' with a `rank' of `1' or `2', or even Chris@19: by calling `fftw_plan_dft_3d' with `n0' and/or `n1' equal to `1' (with Chris@19: no loss in efficiency). This pattern continues, and FFTW's planning Chris@19: routines in general form a "partial order," sequences of interfaces Chris@19: with strictly increasing generality but correspondingly greater Chris@19: complexity. Chris@19: Chris@19: `fftw_plan_dft' is the most general complex-DFT routine that we Chris@19: describe in this tutorial, but there are also the advanced and guru Chris@19: interfaces, which allow one to efficiently combine multiple/strided Chris@19: transforms into a single FFTW plan, transform a subset of a larger Chris@19: multi-dimensional array, and/or to handle more general complex-number Chris@19: formats. For more information, see *note FFTW Reference::. Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) The term "rank" is commonly used in the APL, FORTRAN, and Common Chris@19: Lisp traditions, although it is not so common in the C world. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: One-Dimensional DFTs of Real Data, Next: Multi-Dimensional DFTs of Real Data, Prev: Complex Multi-Dimensional DFTs, Up: Tutorial Chris@19: Chris@19: 2.3 One-Dimensional DFTs of Real Data Chris@19: ===================================== Chris@19: Chris@19: In many practical applications, the input data `in[i]' are purely real Chris@19: numbers, in which case the DFT output satisfies the "Hermitian" redundancy: Chris@19: `out[i]' is the conjugate of `out[n-i]'. It is possible to take Chris@19: advantage of these circumstances in order to achieve roughly a factor Chris@19: of two improvement in both speed and memory usage. Chris@19: Chris@19: In exchange for these speed and space advantages, the user sacrifices Chris@19: some of the simplicity of FFTW's complex transforms. First of all, the Chris@19: input and output arrays are of _different sizes and types_: the input Chris@19: is `n' real numbers, while the output is `n/2+1' complex numbers (the Chris@19: non-redundant outputs); this also requires slight "padding" of the Chris@19: input array for in-place transforms. Second, the inverse transform Chris@19: (complex to real) has the side-effect of _overwriting its input array_, Chris@19: by default. Neither of these inconveniences should pose a serious Chris@19: problem for users, but it is important to be aware of them. Chris@19: Chris@19: The routines to perform real-data transforms are almost the same as Chris@19: those for complex transforms: you allocate arrays of `double' and/or Chris@19: `fftw_complex' (preferably using `fftw_malloc' or Chris@19: `fftw_alloc_complex'), create an `fftw_plan', execute it as many times Chris@19: as you want with `fftw_execute(plan)', and clean up with Chris@19: `fftw_destroy_plan(plan)' (and `fftw_free'). The only differences are Chris@19: that the input (or output) is of type `double' and there are new Chris@19: routines to create the plan. In one dimension: Chris@19: Chris@19: fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: Chris@19: for the real input to complex-Hermitian output ("r2c") and Chris@19: complex-Hermitian input to real output ("c2r") transforms. Unlike the Chris@19: complex DFT planner, there is no `sign' argument. Instead, r2c DFTs Chris@19: are always `FFTW_FORWARD' and c2r DFTs are always `FFTW_BACKWARD'. (For Chris@19: single/long-double precision `fftwf' and `fftwl', `double' should be Chris@19: replaced by `float' and `long double', respectively.) Chris@19: Chris@19: Here, `n' is the "logical" size of the DFT, not necessarily the Chris@19: physical size of the array. In particular, the real (`double') array Chris@19: has `n' elements, while the complex (`fftw_complex') array has `n/2+1' Chris@19: elements (where the division is rounded down). For an in-place Chris@19: transform, `in' and `out' are aliased to the same array, which must be Chris@19: big enough to hold both; so, the real array would actually have Chris@19: `2*(n/2+1)' elements, where the elements beyond the first `n' are Chris@19: unused padding. (Note that this is very different from the concept of Chris@19: "zero-padding" a transform to a larger length, which changes the Chris@19: logical size of the DFT by actually adding new input data.) The kth Chris@19: element of the complex array is exactly the same as the kth element of Chris@19: the corresponding complex DFT. All positive `n' are supported; Chris@19: products of small factors are most efficient, but an O(n log n) Chris@19: algorithm is used even for prime sizes. Chris@19: Chris@19: As noted above, the c2r transform destroys its input array even for Chris@19: out-of-place transforms. This can be prevented, if necessary, by Chris@19: including `FFTW_PRESERVE_INPUT' in the `flags', with unfortunately some Chris@19: sacrifice in performance. This flag is also not currently supported Chris@19: for multi-dimensional real DFTs (next section). Chris@19: Chris@19: Readers familiar with DFTs of real data will recall that the 0th (the Chris@19: "DC") and `n/2'-th (the "Nyquist" frequency, when `n' is even) elements Chris@19: of the complex output are purely real. Some implementations therefore Chris@19: store the Nyquist element where the DC imaginary part would go, in Chris@19: order to make the input and output arrays the same size. Such packing, Chris@19: however, does not generalize well to multi-dimensional transforms, and Chris@19: the space savings are miniscule in any case; FFTW does not support it. Chris@19: Chris@19: An alternative interface for one-dimensional r2c and c2r DFTs can be Chris@19: found in the `r2r' interface (*note The Halfcomplex-format DFT::), with Chris@19: "halfcomplex"-format output that _is_ the same size (and type) as the Chris@19: input array. That interface, although it is not very useful for Chris@19: multi-dimensional transforms, may sometimes yield better performance. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Multi-Dimensional DFTs of Real Data, Next: More DFTs of Real Data, Prev: One-Dimensional DFTs of Real Data, Up: Tutorial Chris@19: Chris@19: 2.4 Multi-Dimensional DFTs of Real Data Chris@19: ======================================= Chris@19: Chris@19: Multi-dimensional DFTs of real data use the following planner routines: Chris@19: Chris@19: fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_r2c(int rank, const int *n, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: Chris@19: as well as the corresponding `c2r' routines with the input/output Chris@19: types swapped. These routines work similarly to their complex Chris@19: analogues, except for the fact that here the complex output array is cut Chris@19: roughly in half and the real array requires padding for in-place Chris@19: transforms (as in 1d, above). Chris@19: Chris@19: As before, `n' is the logical size of the array, and the Chris@19: consequences of this on the the format of the complex arrays deserve Chris@19: careful attention. Suppose that the real data has dimensions n[0] x Chris@19: n[1] x n[2] x ... x n[d-1] (in row-major order). Then, after an r2c Chris@19: transform, the output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) Chris@19: array of `fftw_complex' values in row-major order, corresponding to Chris@19: slightly over half of the output of the corresponding complex DFT. Chris@19: (The division is rounded down.) The ordering of the data is otherwise Chris@19: exactly the same as in the complex-DFT case. Chris@19: Chris@19: For out-of-place transforms, this is the end of the story: the real Chris@19: data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x Chris@19: n[d-1] and the complex data is stored as a row-major array of size Chris@19: n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) . Chris@19: Chris@19: For in-place transforms, however, extra padding of the real-data Chris@19: array is necessary because the complex array is larger than the real Chris@19: array, and the two arrays share the same memory locations. Thus, for Chris@19: in-place transforms, the final dimension of the real-data array must be Chris@19: padded with extra values to accommodate the size of the complex Chris@19: data--two values if the last dimension is even and one if it is odd. That Chris@19: is, the last dimension of the real data must physically contain 2 * Chris@19: (n[d-1]/2+1) `double' values (exactly enough to hold the complex data). Chris@19: This physical array size does not, however, change the _logical_ array Chris@19: size--only n[d-1] values are actually stored in the last dimension, and Chris@19: n[d-1] is the last dimension passed to the plan-creation routine. Chris@19: Chris@19: For example, consider the transform of a two-dimensional real array Chris@19: of size `n0' by `n1'. The output of the r2c transform is a Chris@19: two-dimensional complex array of size `n0' by `n1/2+1', where the `y' Chris@19: dimension has been cut nearly in half because of redundancies in the Chris@19: output. Because `fftw_complex' is twice the size of `double', the Chris@19: output array is slightly bigger than the input array. Thus, if we want Chris@19: to compute the transform in place, we must _pad_ the input array so Chris@19: that it is of size `n0' by `2*(n1/2+1)'. If `n1' is even, then there Chris@19: are two padding elements at the end of each row (which need not be Chris@19: initialized, as they are only used for output). Chris@19: Chris@19: These transforms are unnormalized, so an r2c followed by a c2r Chris@19: transform (or vice versa) will result in the original data scaled by Chris@19: the number of real data elements--that is, the product of the (logical) Chris@19: dimensions of the real data. Chris@19: Chris@19: (Because the last dimension is treated specially, if it is equal to Chris@19: `1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r Chris@19: transform. In that case, the last complex dimension also has size `1' Chris@19: (`=1/2+1'), and no advantage is gained over the complex transforms.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: More DFTs of Real Data, Prev: Multi-Dimensional DFTs of Real Data, Up: Tutorial Chris@19: Chris@19: 2.5 More DFTs of Real Data Chris@19: ========================== Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * The Halfcomplex-format DFT:: Chris@19: * Real even/odd DFTs (cosine/sine transforms):: Chris@19: * The Discrete Hartley Transform:: Chris@19: Chris@19: FFTW supports several other transform types via a unified "r2r" Chris@19: (real-to-real) interface, so called because it takes a real (`double') Chris@19: array and outputs a real array of the same size. These r2r transforms Chris@19: currently fall into three categories: DFTs of real input and Chris@19: complex-Hermitian output in halfcomplex format, DFTs of real input with Chris@19: even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs), Chris@19: and discrete Hartley transforms (DHTs), all described in more detail by Chris@19: the following sections. Chris@19: Chris@19: The r2r transforms follow the by now familiar interface of creating Chris@19: an `fftw_plan', executing it with `fftw_execute(plan)', and destroying Chris@19: it with `fftw_destroy_plan(plan)'. Furthermore, all r2r transforms Chris@19: share the same planner interface: Chris@19: Chris@19: fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, Chris@19: fftw_r2r_kind kind, unsigned flags); Chris@19: fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, Chris@19: fftw_r2r_kind kind0, fftw_r2r_kind kind1, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, Chris@19: double *in, double *out, Chris@19: fftw_r2r_kind kind0, Chris@19: fftw_r2r_kind kind1, Chris@19: fftw_r2r_kind kind2, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, Chris@19: const fftw_r2r_kind *kind, unsigned flags); Chris@19: Chris@19: Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional Chris@19: transforms for contiguous arrays in row-major order, transforming (real) Chris@19: input to output of the same size, where `n' specifies the _physical_ Chris@19: dimensions of the arrays. All positive `n' are supported (with the Chris@19: exception of `n=1' for the `FFTW_REDFT00' kind, noted in the real-even Chris@19: subsection below); products of small factors are most efficient Chris@19: (factorizing `n-1' and `n+1' for `FFTW_REDFT00' and `FFTW_RODFT00' Chris@19: kinds, described below), but an O(n log n) algorithm is used even for Chris@19: prime sizes. Chris@19: Chris@19: Each dimension has a "kind" parameter, of type `fftw_r2r_kind', Chris@19: specifying the kind of r2r transform to be used for that dimension. (In Chris@19: the case of `fftw_plan_r2r', this is an array `kind[rank]' where Chris@19: `kind[i]' is the transform kind for the dimension `n[i]'.) The kind Chris@19: can be one of a set of predefined constants, defined in the following Chris@19: subsections. Chris@19: Chris@19: In other words, FFTW computes the separable product of the specified Chris@19: r2r transforms over each dimension, which can be used e.g. for partial Chris@19: differential equations with mixed boundary conditions. (For some r2r Chris@19: kinds, notably the halfcomplex DFT and the DHT, such a separable Chris@19: product is somewhat problematic in more than one dimension, however, as Chris@19: is described below.) Chris@19: Chris@19: In the current version of FFTW, all r2r transforms except for the Chris@19: halfcomplex type are computed via pre- or post-processing of Chris@19: halfcomplex transforms, and they are therefore not as fast as they Chris@19: could be. Since most other general DCT/DST codes employ a similar Chris@19: algorithm, however, FFTW's implementation should provide at least Chris@19: competitive performance. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: The Halfcomplex-format DFT, Next: Real even/odd DFTs (cosine/sine transforms), Prev: More DFTs of Real Data, Up: More DFTs of Real Data Chris@19: Chris@19: 2.5.1 The Halfcomplex-format DFT Chris@19: -------------------------------- Chris@19: Chris@19: An r2r kind of `FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note Chris@19: One-Dimensional DFTs of Real Data::) but with "halfcomplex" format Chris@19: output, and may sometimes be faster and/or more convenient than the Chris@19: latter. The inverse "hc2r" transform is of kind `FFTW_HC2R'. This Chris@19: consists of the non-redundant half of the complex output for a 1d Chris@19: real-input DFT of size `n', stored as a sequence of `n' real numbers Chris@19: (`double') in the format: Chris@19: Chris@19: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 Chris@19: Chris@19: Here, rk is the real part of the kth output, and ik is the imaginary Chris@19: part. (Division by 2 is rounded down.) For a halfcomplex array Chris@19: `hc[n]', the kth component thus has its real part in `hc[k]' and its Chris@19: imaginary part in `hc[n-k]', with the exception of `k' `==' `0' or Chris@19: `n/2' (the latter only if `n' is even)--in these two cases, the Chris@19: imaginary part is zero due to symmetries of the real-input DFT, and is Chris@19: not stored. Thus, the r2hc transform of `n' real values is a Chris@19: halfcomplex array of length `n', and vice versa for hc2r. Chris@19: Chris@19: Aside from the differing format, the output of Chris@19: `FFTW_R2HC'/`FFTW_HC2R' is otherwise exactly the same as for the Chris@19: corresponding 1d r2c/c2r transform (i.e. `FFTW_FORWARD'/`FFTW_BACKWARD' Chris@19: transforms, respectively). Recall that these transforms are Chris@19: unnormalized, so r2hc followed by hc2r will result in the original data Chris@19: multiplied by `n'. Furthermore, like the c2r transform, an Chris@19: out-of-place hc2r transform will _destroy its input_ array. Chris@19: Chris@19: Although these halfcomplex transforms can be used with the Chris@19: multi-dimensional r2r interface, the interpretation of such a separable Chris@19: product of transforms along each dimension is problematic. For example, Chris@19: consider a two-dimensional `n0' by `n1', r2hc by r2hc transform planned Chris@19: by `fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC, Chris@19: FFTW_MEASURE)'. Conceptually, FFTW first transforms the rows (of size Chris@19: `n1') to produce halfcomplex rows, and then transforms the columns (of Chris@19: size `n0'). Half of these column transforms, however, are of imaginary Chris@19: parts, and should therefore be multiplied by i and combined with the Chris@19: r2hc transforms of the real columns to produce the 2d DFT amplitudes; Chris@19: FFTW's r2r transform does _not_ perform this combination for you. Chris@19: Thus, if a multi-dimensional real-input/output DFT is required, we Chris@19: recommend using the ordinary r2c/c2r interface (*note Multi-Dimensional Chris@19: DFTs of Real Data::). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Real even/odd DFTs (cosine/sine transforms), Next: The Discrete Hartley Transform, Prev: The Halfcomplex-format DFT, Up: More DFTs of Real Data Chris@19: Chris@19: 2.5.2 Real even/odd DFTs (cosine/sine transforms) Chris@19: ------------------------------------------------- Chris@19: Chris@19: The Fourier transform of a real-even function f(-x) = f(x) is Chris@19: real-even, and i times the Fourier transform of a real-odd function Chris@19: f(-x) = -f(x) is real-odd. Similar results hold for a discrete Fourier Chris@19: transform, and thus for these symmetries the need for complex Chris@19: inputs/outputs is entirely eliminated. Moreover, one gains a factor of Chris@19: two in speed/space from the fact that the data are real, and an Chris@19: additional factor of two from the even/odd symmetry: only the Chris@19: non-redundant (first) half of the array need be stored. The result is Chris@19: the real-even DFT ("REDFT") and the real-odd DFT ("RODFT"), also known Chris@19: as the discrete cosine and sine transforms ("DCT" and "DST"), Chris@19: respectively. Chris@19: Chris@19: (In this section, we describe the 1d transforms; multi-dimensional Chris@19: transforms are just a separable product of these transforms operating Chris@19: along each dimension.) Chris@19: Chris@19: Because of the discrete sampling, one has an additional choice: is Chris@19: the data even/odd around a sampling point, or around the point halfway Chris@19: between two samples? The latter corresponds to _shifting_ the samples Chris@19: by _half_ an interval, and gives rise to several transform variants Chris@19: denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate Chris@19: whether the input (a) and/or output (b) are shifted by half a sample (1 Chris@19: means it is shifted). These are also known as types I-IV of the DCT Chris@19: and DST, and all four types are supported by FFTW's r2r interface.(1) Chris@19: Chris@19: The r2r kinds for the various REDFT and RODFT types supported by Chris@19: FFTW, along with the boundary conditions at both ends of the _input_ Chris@19: array (`n' real numbers `in[j=0..n-1]'), are: Chris@19: Chris@19: * `FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1. Chris@19: Chris@19: * `FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even Chris@19: around j=n-0.5. Chris@19: Chris@19: * `FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd Chris@19: around j=n. Chris@19: Chris@19: * `FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5. Chris@19: Chris@19: * `FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n. Chris@19: Chris@19: * `FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5. Chris@19: Chris@19: * `FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1. Chris@19: Chris@19: * `FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5. Chris@19: Chris@19: Chris@19: Note that these symmetries apply to the "logical" array being Chris@19: transformed; *there are no constraints on your physical input data*. Chris@19: So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data Chris@19: abcde, it corresponds to the DFT of the logical even array abcdedcb of Chris@19: size 8. A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the Chris@19: size-8 logical DFT of the even array abcddcba, shifted by half a sample. Chris@19: Chris@19: All of these transforms are invertible. The inverse of R*DFT00 is Chris@19: R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called Chris@19: simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11. Chris@19: However, the transforms computed by FFTW are unnormalized, exactly like Chris@19: the corresponding real and complex DFTs, so computing a transform Chris@19: followed by its inverse yields the original array scaled by N, where N Chris@19: is the _logical_ DFT size. For REDFT00, N=2(n-1); for RODFT00, Chris@19: N=2(n+1); otherwise, N=2n. Chris@19: Chris@19: Note that the boundary conditions of the transform output array are Chris@19: given by the input boundary conditions of the inverse transform. Thus, Chris@19: the above transforms are all inequivalent in terms of input/output Chris@19: boundary conditions, even neglecting the 0.5 shift difference. Chris@19: Chris@19: FFTW is most efficient when N is a product of small factors; note Chris@19: that this _differs_ from the factorization of the physical size `n' for Chris@19: REDFT00 and RODFT00! There is another oddity: `n=1' REDFT00 transforms Chris@19: correspond to N=0, and so are _not defined_ (the planner will return Chris@19: `NULL'). Otherwise, any positive `n' is supported. Chris@19: Chris@19: For the precise mathematical definitions of these transforms as used Chris@19: by FFTW, see *note What FFTW Really Computes::. (For people accustomed Chris@19: to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of Chris@19: the cos/sin functions so that they correspond precisely to an even/odd Chris@19: DFT of size N. Some authors also include additional multiplicative Chris@19: factors of sqrt(2) for selected inputs and outputs; this makes the Chris@19: transform orthogonal, but sacrifices the direct equivalence to a Chris@19: symmetric DFT.) Chris@19: Chris@19: Which type do you need? Chris@19: ....................... Chris@19: Chris@19: Since the required flavor of even/odd DFT depends upon your problem, Chris@19: you are the best judge of this choice, but we can make a few comments Chris@19: on relative efficiency to help you in your selection. In particular, Chris@19: R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially Chris@19: for odd sizes), while the R*DFT00 transforms are sometimes Chris@19: significantly slower (especially for even sizes).(2) Chris@19: Chris@19: Thus, if only the boundary conditions on the transform inputs are Chris@19: specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over Chris@19: R*DFT11 (unless the half-sample shift or the self-inverse property is Chris@19: significant for your problem). Chris@19: Chris@19: If performance is important to you and you are using only small sizes Chris@19: (say n<200), e.g. for multi-dimensional transforms, then you might Chris@19: consider generating hard-coded transforms of those sizes and types that Chris@19: you are interested in (*note Generating your own code::). Chris@19: Chris@19: We are interested in hearing what types of symmetric transforms you Chris@19: find most useful. Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) There are also type V-VIII transforms, which correspond to a Chris@19: logical DFT of _odd_ size N, independent of whether the physical size Chris@19: `n' is odd, but we do not support these variants. Chris@19: Chris@19: (2) R*DFT00 is sometimes slower in FFTW because we discovered that Chris@19: the standard algorithm for computing this by a pre/post-processed real Chris@19: DFT--the algorithm used in FFTPACK, Numerical Recipes, and other Chris@19: sources for decades now--has serious numerical problems: it already Chris@19: loses several decimal places of accuracy for 16k sizes. There seem to Chris@19: be only two alternatives in the literature that do not suffer Chris@19: similarly: a recursive decomposition into smaller DCTs, which would Chris@19: require a large set of codelets for efficiency and generality, or Chris@19: sacrificing a factor of 2 in speed to use a real DFT of twice the size. Chris@19: We currently employ the latter technique for general n, as well as a Chris@19: limited form of the former method: a split-radix decomposition when n Chris@19: is odd (N a multiple of 4). For N containing many factors of 2, the Chris@19: split-radix method seems to recover most of the speed of the standard Chris@19: algorithm without the accuracy tradeoff. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: The Discrete Hartley Transform, Prev: Real even/odd DFTs (cosine/sine transforms), Up: More DFTs of Real Data Chris@19: Chris@19: 2.5.3 The Discrete Hartley Transform Chris@19: ------------------------------------ Chris@19: Chris@19: If you are planning to use the DHT because you've heard that it is Chris@19: "faster" than the DFT (FFT), *stop here*. The DHT is not faster than Chris@19: the DFT. That story is an old but enduring misconception that was Chris@19: debunked in 1987. Chris@19: Chris@19: The discrete Hartley transform (DHT) is an invertible linear Chris@19: transform closely related to the DFT. In the DFT, one multiplies each Chris@19: input by cos - i * sin (a complex exponential), whereas in the DHT each Chris@19: input is multiplied by simply cos + sin. Thus, the DHT transforms `n' Chris@19: real numbers to `n' real numbers, and has the convenient property of Chris@19: being its own inverse. In FFTW, a DHT (of any positive `n') can be Chris@19: specified by an r2r kind of `FFTW_DHT'. Chris@19: Chris@19: Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of Chris@19: size `n' followed by another DHT of the same size will result in the Chris@19: original array multiplied by `n'. Chris@19: Chris@19: The DHT was originally proposed as a more efficient alternative to Chris@19: the DFT for real data, but it was subsequently shown that a specialized Chris@19: DFT (such as FFTW's r2hc or r2c transforms) could be just as fast. In Chris@19: FFTW, the DHT is actually computed by post-processing an r2hc Chris@19: transform, so there is ordinarily no reason to prefer it from a Chris@19: performance perspective.(1) However, we have heard rumors that the DHT Chris@19: might be the most appropriate transform in its own right for certain Chris@19: applications, and we would be very interested to hear from anyone who Chris@19: finds it useful. Chris@19: Chris@19: If `FFTW_DHT' is specified for multiple dimensions of a Chris@19: multi-dimensional transform, FFTW computes the separable product of 1d Chris@19: DHTs along each dimension. Unfortunately, this is not quite the same Chris@19: thing as a true multi-dimensional DHT; you can compute the latter, if Chris@19: necessary, with at most `rank-1' post-processing passes [see e.g. H. Chris@19: Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)]. Chris@19: Chris@19: For the precise mathematical definition of the DHT as used by FFTW, Chris@19: see *note What FFTW Really Computes::. Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) We provide the DHT mainly as a byproduct of some internal Chris@19: algorithms. FFTW computes a real input/output DFT of _prime_ size by Chris@19: re-expressing it as a DHT plus post/pre-processing and then using Chris@19: Rader's prime-DFT algorithm adapted to the DHT. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Other Important Topics, Next: FFTW Reference, Prev: Tutorial, Up: Top Chris@19: Chris@19: 3 Other Important Topics Chris@19: ************************ Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * SIMD alignment and fftw_malloc:: Chris@19: * Multi-dimensional Array Format:: Chris@19: * Words of Wisdom-Saving Plans:: Chris@19: * Caveats in Using Wisdom:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: SIMD alignment and fftw_malloc, Next: Multi-dimensional Array Format, Prev: Other Important Topics, Up: Other Important Topics Chris@19: Chris@19: 3.1 SIMD alignment and fftw_malloc Chris@19: ================================== Chris@19: Chris@19: SIMD, which stands for "Single Instruction Multiple Data," is a set of Chris@19: special operations supported by some processors to perform a single Chris@19: operation on several numbers (usually 2 or 4) simultaneously. SIMD Chris@19: floating-point instructions are available on several popular CPUs: Chris@19: SSE/SSE2/AVX on recent x86/x86-64 processors, AltiVec (single precision) Chris@19: on some PowerPCs (Apple G4 and higher), NEON on some ARM models, and Chris@19: MIPS Paired Single (currently only in FFTW 3.2.x). FFTW can be Chris@19: compiled to support the SIMD instructions on any of these systems. Chris@19: Chris@19: A program linking to an FFTW library compiled with SIMD support can Chris@19: obtain a nonnegligible speedup for most complex and r2c/c2r transforms. Chris@19: In order to obtain this speedup, however, the arrays of complex (or Chris@19: real) data passed to FFTW must be specially aligned in memory Chris@19: (typically 16-byte aligned), and often this alignment is more stringent Chris@19: than that provided by the usual `malloc' (etc.) allocation routines. Chris@19: Chris@19: In order to guarantee proper alignment for SIMD, therefore, in case Chris@19: your program is ever linked against a SIMD-using FFTW, we recommend Chris@19: allocating your transform data with `fftw_malloc' and de-allocating it Chris@19: with `fftw_free'. These have exactly the same interface and behavior as Chris@19: `malloc'/`free', except that for a SIMD FFTW they ensure that the Chris@19: returned pointer has the necessary alignment (by calling `memalign' or Chris@19: its equivalent on your OS). Chris@19: Chris@19: You are not _required_ to use `fftw_malloc'. You can allocate your Chris@19: data in any way that you like, from `malloc' to `new' (in C++) to a Chris@19: fixed-size array declaration. If the array happens not to be properly Chris@19: aligned, FFTW will not use the SIMD extensions. Chris@19: Chris@19: Since `fftw_malloc' only ever needs to be used for real and complex Chris@19: arrays, we provide two convenient wrapper routines `fftw_alloc_real(N)' Chris@19: and `fftw_alloc_complex(N)' that are equivalent to Chris@19: `(double*)fftw_malloc(sizeof(double) * N)' and Chris@19: `(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively Chris@19: (or their equivalents in other precisions). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Multi-dimensional Array Format, Next: Words of Wisdom-Saving Plans, Prev: SIMD alignment and fftw_malloc, Up: Other Important Topics Chris@19: Chris@19: 3.2 Multi-dimensional Array Format Chris@19: ================================== Chris@19: Chris@19: This section describes the format in which multi-dimensional arrays are Chris@19: stored in FFTW. We felt that a detailed discussion of this topic was Chris@19: necessary. Since several different formats are common, this topic is Chris@19: often a source of confusion. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Row-major Format:: Chris@19: * Column-major Format:: Chris@19: * Fixed-size Arrays in C:: Chris@19: * Dynamic Arrays in C:: Chris@19: * Dynamic Arrays in C-The Wrong Way:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Row-major Format, Next: Column-major Format, Prev: Multi-dimensional Array Format, Up: Multi-dimensional Array Format Chris@19: Chris@19: 3.2.1 Row-major Format Chris@19: ---------------------- Chris@19: Chris@19: The multi-dimensional arrays passed to `fftw_plan_dft' etcetera are Chris@19: expected to be stored as a single contiguous block in "row-major" order Chris@19: (sometimes called "C order"). Basically, this means that as you step Chris@19: through adjacent memory locations, the first dimension's index varies Chris@19: most slowly and the last dimension's index varies most quickly. Chris@19: Chris@19: To be more explicit, let us consider an array of rank d whose Chris@19: dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a Chris@19: location in the array by a sequence of d (zero-based) indices, one for Chris@19: each dimension: (i[0], i[1], ..., i[d-1]). If the array is stored in Chris@19: row-major order, then this element is located at the position i[d-1] + Chris@19: n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])). Chris@19: Chris@19: Note that, for the ordinary complex DFT, each element of the array Chris@19: must be of type `fftw_complex'; i.e. a (real, imaginary) pair of Chris@19: (double-precision) numbers. Chris@19: Chris@19: In the advanced FFTW interface, the physical dimensions n from which Chris@19: the indices are computed can be different from (larger than) the Chris@19: logical dimensions of the transform to be computed, in order to Chris@19: transform a subset of a larger array. Note also that, in the advanced Chris@19: interface, the expression above is multiplied by a "stride" to get the Chris@19: actual array index--this is useful in situations where each element of Chris@19: the multi-dimensional array is actually a data structure (or another Chris@19: array), and you just want to transform a single field. In the basic Chris@19: interface, however, the stride is 1. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Column-major Format, Next: Fixed-size Arrays in C, Prev: Row-major Format, Up: Multi-dimensional Array Format Chris@19: Chris@19: 3.2.2 Column-major Format Chris@19: ------------------------- Chris@19: Chris@19: Readers from the Fortran world are used to arrays stored in Chris@19: "column-major" order (sometimes called "Fortran order"). This is Chris@19: essentially the exact opposite of row-major order in that, here, the Chris@19: _first_ dimension's index varies most quickly. Chris@19: Chris@19: If you have an array stored in column-major order and wish to Chris@19: transform it using FFTW, it is quite easy to do. When creating the Chris@19: plan, simply pass the dimensions of the array to the planner in Chris@19: _reverse order_. For example, if your array is a rank three `N x M x Chris@19: L' matrix in column-major order, you should pass the dimensions of the Chris@19: array as if it were an `L x M x N' matrix (which it is, from the Chris@19: perspective of FFTW). This is done for you _automatically_ by the FFTW Chris@19: legacy-Fortran interface (*note Calling FFTW from Legacy Fortran::), Chris@19: but you must do it manually with the modern Fortran interface (*note Chris@19: Reversing array dimensions::). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Fixed-size Arrays in C, Next: Dynamic Arrays in C, Prev: Column-major Format, Up: Multi-dimensional Array Format Chris@19: Chris@19: 3.2.3 Fixed-size Arrays in C Chris@19: ---------------------------- Chris@19: Chris@19: A multi-dimensional array whose size is declared at compile time in C Chris@19: is _already_ in row-major order. You don't have to do anything special Chris@19: to transform it. For example: Chris@19: Chris@19: { Chris@19: fftw_complex data[N0][N1][N2]; Chris@19: fftw_plan plan; Chris@19: ... Chris@19: plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0], Chris@19: FFTW_FORWARD, FFTW_ESTIMATE); Chris@19: ... Chris@19: } Chris@19: Chris@19: This will plan a 3d in-place transform of size `N0 x N1 x N2'. Chris@19: Notice how we took the address of the zero-th element to pass to the Chris@19: planner (we could also have used a typecast). Chris@19: Chris@19: However, we tend to _discourage_ users from declaring their arrays Chris@19: in this way, for two reasons. First, this allocates the array on the Chris@19: stack ("automatic" storage), which has a very limited size on most Chris@19: operating systems (declaring an array with more than a few thousand Chris@19: elements will often cause a crash). (You can get around this Chris@19: limitation on many systems by declaring the array as `static' and/or Chris@19: global, but that has its own drawbacks.) Second, it may not optimally Chris@19: align the array for use with a SIMD FFTW (*note SIMD alignment and Chris@19: fftw_malloc::). Instead, we recommend using `fftw_malloc', as Chris@19: described below. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Dynamic Arrays in C, Next: Dynamic Arrays in C-The Wrong Way, Prev: Fixed-size Arrays in C, Up: Multi-dimensional Array Format Chris@19: Chris@19: 3.2.4 Dynamic Arrays in C Chris@19: ------------------------- Chris@19: Chris@19: We recommend allocating most arrays dynamically, with `fftw_malloc'. Chris@19: This isn't too hard to do, although it is not as straightforward for Chris@19: multi-dimensional arrays as it is for one-dimensional arrays. Chris@19: Chris@19: Creating the array is simple: using a dynamic-allocation routine like Chris@19: `fftw_malloc', allocate an array big enough to store N `fftw_complex' Chris@19: values (for a complex DFT), where N is the product of the sizes of the Chris@19: array dimensions (i.e. the total number of complex values in the Chris@19: array). For example, here is code to allocate a 5 x 12 x 27 rank-3 Chris@19: array: Chris@19: Chris@19: fftw_complex *an_array; Chris@19: an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex)); Chris@19: Chris@19: Accessing the array elements, however, is more tricky--you can't Chris@19: simply use multiple applications of the `[]' operator like you could Chris@19: for fixed-size arrays. Instead, you have to explicitly compute the Chris@19: offset into the array using the formula given earlier for row-major Chris@19: arrays. For example, to reference the (i,j,k)-th element of the array Chris@19: allocated above, you would use the expression `an_array[k + 27 * (j + Chris@19: 12 * i)]'. Chris@19: Chris@19: This pain can be alleviated somewhat by defining appropriate macros, Chris@19: or, in C++, creating a class and overloading the `()' operator. The Chris@19: recent C99 standard provides a way to reinterpret the dynamic array as Chris@19: a "variable-length" multi-dimensional array amenable to `[]', but this Chris@19: feature is not yet widely supported by compilers. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Dynamic Arrays in C-The Wrong Way, Prev: Dynamic Arrays in C, Up: Multi-dimensional Array Format Chris@19: Chris@19: 3.2.5 Dynamic Arrays in C--The Wrong Way Chris@19: ---------------------------------------- Chris@19: Chris@19: A different method for allocating multi-dimensional arrays in C is Chris@19: often suggested that is incompatible with FFTW: _using it will cause Chris@19: FFTW to die a painful death_. We discuss the technique here, however, Chris@19: because it is so commonly known and used. This method is to create Chris@19: arrays of pointers of arrays of pointers of ...etcetera. For example, Chris@19: the analogue in this method to the example above is: Chris@19: Chris@19: int i,j; Chris@19: fftw_complex ***a_bad_array; /* another way to make a 5x12x27 array */ Chris@19: Chris@19: a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **)); Chris@19: for (i = 0; i < 5; ++i) { Chris@19: a_bad_array[i] = Chris@19: (fftw_complex **) malloc(12 * sizeof(fftw_complex *)); Chris@19: for (j = 0; j < 12; ++j) Chris@19: a_bad_array[i][j] = Chris@19: (fftw_complex *) malloc(27 * sizeof(fftw_complex)); Chris@19: } Chris@19: Chris@19: As you can see, this sort of array is inconvenient to allocate (and Chris@19: deallocate). On the other hand, it has the advantage that the Chris@19: (i,j,k)-th element can be referenced simply by `a_bad_array[i][j][k]'. Chris@19: Chris@19: If you like this technique and want to maximize convenience in Chris@19: accessing the array, but still want to pass the array to FFTW, you can Chris@19: use a hybrid method. Allocate the array as one contiguous block, but Chris@19: also declare an array of arrays of pointers that point to appropriate Chris@19: places in the block. That sort of trick is beyond the scope of this Chris@19: documentation; for more information on multi-dimensional arrays in C, Chris@19: see the `comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Words of Wisdom-Saving Plans, Next: Caveats in Using Wisdom, Prev: Multi-dimensional Array Format, Up: Other Important Topics Chris@19: Chris@19: 3.3 Words of Wisdom--Saving Plans Chris@19: ================================= Chris@19: Chris@19: FFTW implements a method for saving plans to disk and restoring them. Chris@19: In fact, what FFTW does is more general than just saving and loading Chris@19: plans. The mechanism is called "wisdom". Here, we describe this Chris@19: feature at a high level. *Note FFTW Reference::, for a less casual but Chris@19: more complete discussion of how to use wisdom in FFTW. Chris@19: Chris@19: Plans created with the `FFTW_MEASURE', `FFTW_PATIENT', or Chris@19: `FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may Chris@19: require a long time to compute because FFTW must measure the runtime of Chris@19: many possible plans and select the best one. This setup is designed Chris@19: for the situations where so many transforms of the same size must be Chris@19: computed that the start-up time is irrelevant. For short Chris@19: initialization times, but slower transforms, we have provided Chris@19: `FFTW_ESTIMATE'. The `wisdom' mechanism is a way to get the best of Chris@19: both worlds: you compute a good plan once, save it to disk, and later Chris@19: reload it as many times as necessary. The wisdom mechanism can Chris@19: actually save and reload many plans at once, not just one. Chris@19: Chris@19: Whenever you create a plan, the FFTW planner accumulates wisdom, Chris@19: which is information sufficient to reconstruct the plan. After Chris@19: planning, you can save this information to disk by means of the Chris@19: function: Chris@19: int fftw_export_wisdom_to_filename(const char *filename); Chris@19: (This function returns non-zero on success.) Chris@19: Chris@19: The next time you run the program, you can restore the wisdom with Chris@19: `fftw_import_wisdom_from_filename' (which also returns non-zero on Chris@19: success), and then recreate the plan using the same flags as before. Chris@19: int fftw_import_wisdom_from_filename(const char *filename); Chris@19: Chris@19: Wisdom is automatically used for any size to which it is applicable, Chris@19: as long as the planner flags are not more "patient" than those with Chris@19: which the wisdom was created. For example, wisdom created with Chris@19: `FFTW_MEASURE' can be used if you later plan with `FFTW_ESTIMATE' or Chris@19: `FFTW_MEASURE', but not with `FFTW_PATIENT'. Chris@19: Chris@19: The `wisdom' is cumulative, and is stored in a global, private data Chris@19: structure managed internally by FFTW. The storage space required is Chris@19: minimal, proportional to the logarithm of the sizes the wisdom was Chris@19: generated from. If memory usage is a concern, however, the wisdom can Chris@19: be forgotten and its associated memory freed by calling: Chris@19: void fftw_forget_wisdom(void); Chris@19: Chris@19: Wisdom can be exported to a file, a string, or any other medium. Chris@19: For details, see *note Wisdom::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Caveats in Using Wisdom, Prev: Words of Wisdom-Saving Plans, Up: Other Important Topics Chris@19: Chris@19: 3.4 Caveats in Using Wisdom Chris@19: =========================== Chris@19: Chris@19: For in much wisdom is much grief, and he that increaseth knowledge Chris@19: increaseth sorrow. [Ecclesiastes 1:18] Chris@19: Chris@19: There are pitfalls to using wisdom, in that it can negate FFTW's Chris@19: ability to adapt to changing hardware and other conditions. For Chris@19: example, it would be perfectly possible to export wisdom from a program Chris@19: running on one processor and import it into a program running on Chris@19: another processor. Doing so, however, would mean that the second Chris@19: program would use plans optimized for the first processor, instead of Chris@19: the one it is running on. Chris@19: Chris@19: It should be safe to reuse wisdom as long as the hardware and program Chris@19: binaries remain unchanged. (Actually, the optimal plan may change even Chris@19: between runs of the same binary on identical hardware, due to Chris@19: differences in the virtual memory environment, etcetera. Users Chris@19: seriously interested in performance should worry about this problem, Chris@19: too.) It is likely that, if the same wisdom is used for two different Chris@19: program binaries, even running on the same machine, the plans may be Chris@19: sub-optimal because of differing code alignments. It is therefore wise Chris@19: to recreate wisdom every time an application is recompiled. The more Chris@19: the underlying hardware and software changes between the creation of Chris@19: wisdom and its use, the greater grows the risk of sub-optimal plans. Chris@19: Chris@19: Nevertheless, if the choice is between using `FFTW_ESTIMATE' or Chris@19: using possibly-suboptimal wisdom (created on the same machine, but for a Chris@19: different binary), the wisdom is likely to be better. For this reason, Chris@19: we provide a function to import wisdom from a standard system-wide Chris@19: location (`/etc/fftw/wisdom' on Unix): Chris@19: Chris@19: int fftw_import_system_wisdom(void); Chris@19: Chris@19: FFTW also provides a standalone program, `fftw-wisdom' (described by Chris@19: its own `man' page on Unix) with which users can create wisdom, e.g. Chris@19: for a canonical set of sizes to store in the system wisdom file. *Note Chris@19: Wisdom Utilities::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW Reference, Next: Multi-threaded FFTW, Prev: Other Important Topics, Up: Top Chris@19: Chris@19: 4 FFTW Reference Chris@19: **************** Chris@19: Chris@19: This chapter provides a complete reference for all sequential (i.e., Chris@19: one-processor) FFTW functions. Parallel transforms are described in Chris@19: later chapters. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Data Types and Files:: Chris@19: * Using Plans:: Chris@19: * Basic Interface:: Chris@19: * Advanced Interface:: Chris@19: * Guru Interface:: Chris@19: * New-array Execute Functions:: Chris@19: * Wisdom:: Chris@19: * What FFTW Really Computes:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Data Types and Files, Next: Using Plans, Prev: FFTW Reference, Up: FFTW Reference Chris@19: Chris@19: 4.1 Data Types and Files Chris@19: ======================== Chris@19: Chris@19: All programs using FFTW should include its header file: Chris@19: Chris@19: #include Chris@19: Chris@19: You must also link to the FFTW library. On Unix, this means adding Chris@19: `-lfftw3 -lm' at the _end_ of the link command. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Complex numbers:: Chris@19: * Precision:: Chris@19: * Memory Allocation:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Complex numbers, Next: Precision, Prev: Data Types and Files, Up: Data Types and Files Chris@19: Chris@19: 4.1.1 Complex numbers Chris@19: --------------------- Chris@19: Chris@19: The default FFTW interface uses `double' precision for all Chris@19: floating-point numbers, and defines a `fftw_complex' type to hold Chris@19: complex numbers as: Chris@19: Chris@19: typedef double fftw_complex[2]; Chris@19: Chris@19: Here, the `[0]' element holds the real part and the `[1]' element Chris@19: holds the imaginary part. Chris@19: Chris@19: Alternatively, if you have a C compiler (such as `gcc') that Chris@19: supports the C99 revision of the ANSI C standard, you can use C's new Chris@19: native complex type (which is binary-compatible with the typedef above). Chris@19: In particular, if you `#include ' _before_ `', then Chris@19: `fftw_complex' is defined to be the native complex type and you can Chris@19: manipulate it with ordinary arithmetic (e.g. `x = y * (3+4*I)', where Chris@19: `x' and `y' are `fftw_complex' and `I' is the standard symbol for the Chris@19: imaginary unit); Chris@19: Chris@19: C++ has its own `complex' template class, defined in the standard Chris@19: `' header file. Reportedly, the C++ standards committee has Chris@19: recently agreed to mandate that the storage format used for this type Chris@19: be binary-compatible with the C99 type, i.e. an array `T[2]' with Chris@19: consecutive real `[0]' and imaginary `[1]' parts. (See report Chris@19: `http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf Chris@19: WG21/N1388'.) Although not part of the official standard as of this Chris@19: writing, the proposal stated that: "This solution has been tested with Chris@19: all current major implementations of the standard library and shown to Chris@19: be working." To the extent that this is true, if you have a variable Chris@19: `complex *x', you can pass it directly to FFTW via Chris@19: `reinterpret_cast(x)'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Precision, Next: Memory Allocation, Prev: Complex numbers, Up: Data Types and Files Chris@19: Chris@19: 4.1.2 Precision Chris@19: --------------- Chris@19: Chris@19: You can install single and long-double precision versions of FFTW, Chris@19: which replace `double' with `float' and `long double', respectively Chris@19: (*note Installation and Customization::). To use these interfaces, you: Chris@19: Chris@19: * Link to the single/long-double libraries; on Unix, `-lfftw3f' or Chris@19: `-lfftw3l' instead of (or in addition to) `-lfftw3'. (You can Chris@19: link to the different-precision libraries simultaneously.) Chris@19: Chris@19: * Include the _same_ `' header file. Chris@19: Chris@19: * Replace all lowercase instances of `fftw_' with `fftwf_' or Chris@19: `fftwl_' for single or long-double precision, respectively. Chris@19: (`fftw_complex' becomes `fftwf_complex', `fftw_execute' becomes Chris@19: `fftwf_execute', etcetera.) Chris@19: Chris@19: * Uppercase names, i.e. names beginning with `FFTW_', remain the Chris@19: same. Chris@19: Chris@19: * Replace `double' with `float' or `long double' for subroutine Chris@19: parameters. Chris@19: Chris@19: Chris@19: Depending upon your compiler and/or hardware, `long double' may not Chris@19: be any more precise than `double' (or may not be supported at all, Chris@19: although it is standard in C99). Chris@19: Chris@19: We also support using the nonstandard `__float128' Chris@19: quadruple-precision type provided by recent versions of `gcc' on 32- Chris@19: and 64-bit x86 hardware (*note Installation and Customization::). To Chris@19: use this type, link with `-lfftw3q -lquadmath -lm' (the `libquadmath' Chris@19: library provided by `gcc' is needed for quadruple-precision Chris@19: trigonometric functions) and use `fftwq_' identifiers. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Memory Allocation, Prev: Precision, Up: Data Types and Files Chris@19: Chris@19: 4.1.3 Memory Allocation Chris@19: ----------------------- Chris@19: Chris@19: void *fftw_malloc(size_t n); Chris@19: void fftw_free(void *p); Chris@19: Chris@19: These are functions that behave identically to `malloc' and `free', Chris@19: except that they guarantee that the returned pointer obeys any special Chris@19: alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD Chris@19: acceleration). *Note SIMD alignment and fftw_malloc::. Chris@19: Chris@19: Data allocated by `fftw_malloc' _must_ be deallocated by `fftw_free' Chris@19: and not by the ordinary `free'. Chris@19: Chris@19: These routines simply call through to your operating system's Chris@19: `malloc' or, if necessary, its aligned equivalent (e.g. `memalign'), so Chris@19: you normally need not worry about any significant time or space Chris@19: overhead. You are _not required_ to use them to allocate your data, Chris@19: but we strongly recommend it. Chris@19: Chris@19: Note: in C++, just as with ordinary `malloc', you must typecast the Chris@19: output of `fftw_malloc' to whatever pointer type you are allocating. Chris@19: Chris@19: We also provide the following two convenience functions to allocate Chris@19: real and complex arrays with `n' elements, which are equivalent to Chris@19: `(double *) fftw_malloc(sizeof(double) * n)' and `(fftw_complex *) Chris@19: fftw_malloc(sizeof(fftw_complex) * n)', respectively: Chris@19: Chris@19: double *fftw_alloc_real(size_t n); Chris@19: fftw_complex *fftw_alloc_complex(size_t n); Chris@19: Chris@19: The equivalent functions in other precisions allocate arrays of `n' Chris@19: elements in that precision. e.g. `fftwf_alloc_real(n)' is equivalent Chris@19: to `(float *) fftwf_malloc(sizeof(float) * n)'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Using Plans, Next: Basic Interface, Prev: Data Types and Files, Up: FFTW Reference Chris@19: Chris@19: 4.2 Using Plans Chris@19: =============== Chris@19: Chris@19: Plans for all transform types in FFTW are stored as type `fftw_plan' Chris@19: (an opaque pointer type), and are created by one of the various Chris@19: planning routines described in the following sections. An `fftw_plan' Chris@19: contains all information necessary to compute the transform, including Chris@19: the pointers to the input and output arrays. Chris@19: Chris@19: void fftw_execute(const fftw_plan plan); Chris@19: Chris@19: This executes the `plan', to compute the corresponding transform on Chris@19: the arrays for which it was planned (which must still exist). The plan Chris@19: is not modified, and `fftw_execute' can be called as many times as Chris@19: desired. Chris@19: Chris@19: To apply a given plan to a different array, you can use the Chris@19: new-array execute interface. *Note New-array Execute Functions::. Chris@19: Chris@19: `fftw_execute' (and equivalents) is the only function in FFTW Chris@19: guaranteed to be thread-safe; see *note Thread safety::. Chris@19: Chris@19: This function: Chris@19: void fftw_destroy_plan(fftw_plan plan); Chris@19: deallocates the `plan' and all its associated data. Chris@19: Chris@19: FFTW's planner saves some other persistent data, such as the Chris@19: accumulated wisdom and a list of algorithms available in the current Chris@19: configuration. If you want to deallocate all of that and reset FFTW to Chris@19: the pristine state it was in when you started your program, you can Chris@19: call: Chris@19: Chris@19: void fftw_cleanup(void); Chris@19: Chris@19: After calling `fftw_cleanup', all existing plans become undefined, Chris@19: and you should not attempt to execute them nor to destroy them. You can Chris@19: however create and execute/destroy new plans, in which case FFTW starts Chris@19: accumulating wisdom information again. Chris@19: Chris@19: `fftw_cleanup' does not deallocate your plans, however. To prevent Chris@19: memory leaks, you must still call `fftw_destroy_plan' before executing Chris@19: `fftw_cleanup'. Chris@19: Chris@19: Occasionally, it may useful to know FFTW's internal "cost" metric Chris@19: that it uses to compare plans to one another; this cost is proportional Chris@19: to an execution time of the plan, in undocumented units, if the plan Chris@19: was created with the `FFTW_MEASURE' or other timing-based options, or Chris@19: alternatively is a heuristic cost function for `FFTW_ESTIMATE' plans. Chris@19: (The cost values of measured and estimated plans are not comparable, Chris@19: being in different units. Also, costs from different FFTW versions or Chris@19: the same version compiled differently may not be in the same units. Chris@19: Plans created from wisdom have a cost of 0 since no timing measurement Chris@19: is performed for them. Finally, certain problems for which only one Chris@19: top-level algorithm was possible may have required no measurements of Chris@19: the cost of the whole plan, in which case `fftw_cost' will also return Chris@19: 0.) The cost metric for a given plan is returned by: Chris@19: Chris@19: double fftw_cost(const fftw_plan plan); Chris@19: Chris@19: The following two routines are provided purely for academic purposes Chris@19: (that is, for entertainment). Chris@19: Chris@19: void fftw_flops(const fftw_plan plan, Chris@19: double *add, double *mul, double *fma); Chris@19: Chris@19: Given a `plan', set `add', `mul', and `fma' to an exact count of the Chris@19: number of floating-point additions, multiplications, and fused Chris@19: multiply-add operations involved in the plan's execution. The total Chris@19: number of floating-point operations (flops) is `add + mul + 2*fma', or Chris@19: `add + mul + fma' if the hardware supports fused multiply-add Chris@19: instructions (although the number of FMA operations is only approximate Chris@19: because of compiler voodoo). (The number of operations should be an Chris@19: integer, but we use `double' to avoid overflowing `int' for large Chris@19: transforms; the arguments are of type `double' even for single and Chris@19: long-double precision versions of FFTW.) Chris@19: Chris@19: void fftw_fprint_plan(const fftw_plan plan, FILE *output_file); Chris@19: void fftw_print_plan(const fftw_plan plan); Chris@19: char *fftw_sprint_plan(const fftw_plan plan); Chris@19: Chris@19: This outputs a "nerd-readable" representation of the `plan' to the Chris@19: given file, to `stdout', or two a newly allocated NUL-terminated string Chris@19: (which the caller is responsible for deallocating with `free'), Chris@19: respectively. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Basic Interface, Next: Advanced Interface, Prev: Using Plans, Up: FFTW Reference Chris@19: Chris@19: 4.3 Basic Interface Chris@19: =================== Chris@19: Chris@19: Recall that the FFTW API is divided into three parts(1): the "basic Chris@19: interface" computes a single transform of contiguous data, the "advanced Chris@19: interface" computes transforms of multiple or strided arrays, and the Chris@19: "guru interface" supports the most general data layouts, Chris@19: multiplicities, and strides. This section describes the the basic Chris@19: interface, which we expect to satisfy the needs of most users. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Complex DFTs:: Chris@19: * Planner Flags:: Chris@19: * Real-data DFTs:: Chris@19: * Real-data DFT Array Format:: Chris@19: * Real-to-Real Transforms:: Chris@19: * Real-to-Real Transform Kinds:: Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) Gallia est omnis divisa in partes tres (Julius Caesar). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Complex DFTs, Next: Planner Flags, Prev: Basic Interface, Up: Basic Interface Chris@19: Chris@19: 4.3.1 Complex DFTs Chris@19: ------------------ Chris@19: Chris@19: fftw_plan fftw_plan_dft_1d(int n0, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: fftw_plan fftw_plan_dft_2d(int n0, int n1, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: fftw_plan fftw_plan_dft(int rank, const int *n, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: Plan a complex input/output discrete Fourier transform (DFT) in zero Chris@19: or more dimensions, returning an `fftw_plan' (*note Using Plans::). Chris@19: Chris@19: Once you have created a plan for a certain transform type and Chris@19: parameters, then creating another plan of the same type and parameters, Chris@19: but for different arrays, is fast and shares constant data with the Chris@19: first plan (if it still exists). Chris@19: Chris@19: The planner returns `NULL' if the plan cannot be created. In the Chris@19: standard FFTW distribution, the basic interface is guaranteed to return Chris@19: a non-`NULL' plan. A plan may be `NULL', however, if you are using a Chris@19: customized FFTW configuration supporting a restricted set of transforms. Chris@19: Chris@19: Arguments Chris@19: ......... Chris@19: Chris@19: * `rank' is the rank of the transform (it should be the size of the Chris@19: array `*n'), and can be any non-negative integer. (*Note Complex Chris@19: Multi-Dimensional DFTs::, for the definition of "rank".) The Chris@19: `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', Chris@19: `2', and `3', respectively. The rank may be zero, which is Chris@19: equivalent to a rank-1 transform of size 1, i.e. a copy of one Chris@19: number from input to output. Chris@19: Chris@19: * `n0', `n1', `n2', or `n[0..rank-1]' (as appropriate for each Chris@19: routine) specify the size of the transform dimensions. They can Chris@19: be any positive integer. Chris@19: Chris@19: - Multi-dimensional arrays are stored in row-major order with Chris@19: dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x Chris@19: `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array Chris@19: Format::. Chris@19: Chris@19: - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d Chris@19: 11^e 13^f, where e+f is either 0 or 1, and the other exponents Chris@19: are arbitrary. Other sizes are computed by means of a slow, Chris@19: general-purpose algorithm (which nevertheless retains O(n log Chris@19: n) performance even for prime sizes). It is possible to Chris@19: customize FFTW for different array sizes; see *note Chris@19: Installation and Customization::. Transforms whose sizes are Chris@19: powers of 2 are especially fast. Chris@19: Chris@19: * `in' and `out' point to the input and output arrays of the Chris@19: transform, which may be the same (yielding an in-place transform). These Chris@19: arrays are overwritten during planning, unless `FFTW_ESTIMATE' is Chris@19: used in the flags. (The arrays need not be initialized, but they Chris@19: must be allocated.) Chris@19: Chris@19: If `in == out', the transform is "in-place" and the input array is Chris@19: overwritten. If `in != out', the two arrays must not overlap (but Chris@19: FFTW does not check for this condition). Chris@19: Chris@19: * `sign' is the sign of the exponent in the formula that defines the Chris@19: Fourier transform. It can be -1 (= `FFTW_FORWARD') or +1 (= Chris@19: `FFTW_BACKWARD'). Chris@19: Chris@19: * `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19: Chris@19: FFTW computes an unnormalized transform: computing a forward Chris@19: followed by a backward transform (or vice versa) will result in the Chris@19: original data multiplied by the size of the transform (the product of Chris@19: the dimensions). For more information, see *note What FFTW Really Chris@19: Computes::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Planner Flags, Next: Real-data DFTs, Prev: Complex DFTs, Up: Basic Interface Chris@19: Chris@19: 4.3.2 Planner Flags Chris@19: ------------------- Chris@19: Chris@19: All of the planner routines in FFTW accept an integer `flags' argument, Chris@19: which is a bitwise OR (`|') of zero or more of the flag constants Chris@19: defined below. These flags control the rigor (and time) of the Chris@19: planning process, and can also impose (or lift) restrictions on the Chris@19: type of transform algorithm that is employed. Chris@19: Chris@19: _Important:_ the planner overwrites the input array during planning Chris@19: unless a saved plan (*note Wisdom::) is available for that problem, so Chris@19: you should initialize your input data after creating the plan. The Chris@19: only exceptions to this are the `FFTW_ESTIMATE' and `FFTW_WISDOM_ONLY' Chris@19: flags, as mentioned below. Chris@19: Chris@19: In all cases, if wisdom is available for the given problem that Chris@19: was created with equal-or-greater planning rigor, then the more Chris@19: rigorous wisdom is used. For example, in `FFTW_ESTIMATE' mode any Chris@19: available wisdom is used, whereas in `FFTW_PATIENT' mode only wisdom Chris@19: created in patient or exhaustive mode can be used. *Note Words of Chris@19: Wisdom-Saving Plans::. Chris@19: Chris@19: Planning-rigor flags Chris@19: .................... Chris@19: Chris@19: * `FFTW_ESTIMATE' specifies that, instead of actual measurements of Chris@19: different algorithms, a simple heuristic is used to pick a Chris@19: (probably sub-optimal) plan quickly. With this flag, the Chris@19: input/output arrays are not overwritten during planning. Chris@19: Chris@19: * `FFTW_MEASURE' tells FFTW to find an optimized plan by actually Chris@19: _computing_ several FFTs and measuring their execution time. Chris@19: Depending on your machine, this can take some time (often a few Chris@19: seconds). `FFTW_MEASURE' is the default planning option. Chris@19: Chris@19: * `FFTW_PATIENT' is like `FFTW_MEASURE', but considers a wider range Chris@19: of algorithms and often produces a "more optimal" plan (especially Chris@19: for large transforms), but at the expense of several times longer Chris@19: planning time (especially for large transforms). Chris@19: Chris@19: * `FFTW_EXHAUSTIVE' is like `FFTW_PATIENT', but considers an even Chris@19: wider range of algorithms, including many that we think are Chris@19: unlikely to be fast, to produce the most optimal plan but with a Chris@19: substantially increased planning time. Chris@19: Chris@19: * `FFTW_WISDOM_ONLY' is a special planning mode in which the plan is Chris@19: only created if wisdom is available for the given problem, and Chris@19: otherwise a `NULL' plan is returned. This can be combined with Chris@19: other flags, e.g. `FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan Chris@19: only if wisdom is available that was created in `FFTW_PATIENT' or Chris@19: `FFTW_EXHAUSTIVE' mode. The `FFTW_WISDOM_ONLY' flag is intended Chris@19: for users who need to detect whether wisdom is available; for Chris@19: example, if wisdom is not available one may wish to allocate new Chris@19: arrays for planning so that user data is not overwritten. Chris@19: Chris@19: Chris@19: Algorithm-restriction flags Chris@19: ........................... Chris@19: Chris@19: * `FFTW_DESTROY_INPUT' specifies that an out-of-place transform is Chris@19: allowed to _overwrite its input_ array with arbitrary data; this Chris@19: can sometimes allow more efficient algorithms to be employed. Chris@19: Chris@19: * `FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must Chris@19: _not change its input_ array. This is ordinarily the _default_, Chris@19: except for c2r and hc2r (i.e. complex-to-real) transforms for Chris@19: which `FFTW_DESTROY_INPUT' is the default. In the latter cases, Chris@19: passing `FFTW_PRESERVE_INPUT' will attempt to use algorithms that Chris@19: do not destroy the input, at the expense of worse performance; for Chris@19: multi-dimensional c2r transforms, however, no input-preserving Chris@19: algorithms are implemented and the planner will return `NULL' if Chris@19: one is requested. Chris@19: Chris@19: * `FFTW_UNALIGNED' specifies that the algorithm may not impose any Chris@19: unusual alignment requirements on the input/output arrays (i.e. no Chris@19: SIMD may be used). This flag is normally _not necessary_, since Chris@19: the planner automatically detects misaligned arrays. The only use Chris@19: for this flag is if you want to use the new-array execute Chris@19: interface to execute a given plan on a different array that may Chris@19: not be aligned like the original. (Using `fftw_malloc' makes this Chris@19: flag unnecessary even then. You can also use `fftw_alignment_of' Chris@19: to detect whether two arrays are equivalently aligned.) Chris@19: Chris@19: Chris@19: Limiting planning time Chris@19: ...................... Chris@19: Chris@19: extern void fftw_set_timelimit(double seconds); Chris@19: Chris@19: This function instructs FFTW to spend at most `seconds' seconds Chris@19: (approximately) in the planner. If `seconds == FFTW_NO_TIMELIMIT' (the Chris@19: default value, which is negative), then planning time is unbounded. Chris@19: Otherwise, FFTW plans with a progressively wider range of algorithms Chris@19: until the the given time limit is reached or the given range of Chris@19: algorithms is explored, returning the best available plan. Chris@19: Chris@19: For example, specifying `FFTW_PATIENT' first plans in Chris@19: `FFTW_ESTIMATE' mode, then in `FFTW_MEASURE' mode, then finally (time Chris@19: permitting) in `FFTW_PATIENT'. If `FFTW_EXHAUSTIVE' is specified Chris@19: instead, the planner will further progress to `FFTW_EXHAUSTIVE' mode. Chris@19: Chris@19: Note that the `seconds' argument specifies only a rough limit; in Chris@19: practice, the planner may use somewhat more time if the time limit is Chris@19: reached when the planner is in the middle of an operation that cannot Chris@19: be interrupted. At the very least, the planner will complete planning Chris@19: in `FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Real-data DFTs, Next: Real-data DFT Array Format, Prev: Planner Flags, Up: Basic Interface Chris@19: Chris@19: 4.3.3 Real-data DFTs Chris@19: -------------------- Chris@19: Chris@19: fftw_plan fftw_plan_dft_r2c_1d(int n0, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_r2c(int rank, const int *n, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: Chris@19: Plan a real-input/complex-output discrete Fourier transform (DFT) in Chris@19: zero or more dimensions, returning an `fftw_plan' (*note Using Plans::). Chris@19: Chris@19: Once you have created a plan for a certain transform type and Chris@19: parameters, then creating another plan of the same type and parameters, Chris@19: but for different arrays, is fast and shares constant data with the Chris@19: first plan (if it still exists). Chris@19: Chris@19: The planner returns `NULL' if the plan cannot be created. A Chris@19: non-`NULL' plan is always returned by the basic interface unless you Chris@19: are using a customized FFTW configuration supporting a restricted set Chris@19: of transforms, or if you use the `FFTW_PRESERVE_INPUT' flag with a Chris@19: multi-dimensional out-of-place c2r transform (see below). Chris@19: Chris@19: Arguments Chris@19: ......... Chris@19: Chris@19: * `rank' is the rank of the transform (it should be the size of the Chris@19: array `*n'), and can be any non-negative integer. (*Note Complex Chris@19: Multi-Dimensional DFTs::, for the definition of "rank".) The Chris@19: `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1', Chris@19: `2', and `3', respectively. The rank may be zero, which is Chris@19: equivalent to a rank-1 transform of size 1, i.e. a copy of one Chris@19: real number (with zero imaginary part) from input to output. Chris@19: Chris@19: * `n0', `n1', `n2', or `n[0..rank-1]', (as appropriate for each Chris@19: routine) specify the size of the transform dimensions. They can Chris@19: be any positive integer. This is different in general from the Chris@19: _physical_ array dimensions, which are described in *note Chris@19: Real-data DFT Array Format::. Chris@19: Chris@19: - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d Chris@19: 11^e 13^f, where e+f is either 0 or 1, and the other exponents Chris@19: are arbitrary. Other sizes are computed by means of a slow, Chris@19: general-purpose algorithm (which nevertheless retains O(n log Chris@19: n) performance even for prime sizes). (It is possible to Chris@19: customize FFTW for different array sizes; see *note Chris@19: Installation and Customization::.) Transforms whose sizes Chris@19: are powers of 2 are especially fast, and it is generally Chris@19: beneficial for the _last_ dimension of an r2c/c2r transform Chris@19: to be _even_. Chris@19: Chris@19: * `in' and `out' point to the input and output arrays of the Chris@19: transform, which may be the same (yielding an in-place transform). These Chris@19: arrays are overwritten during planning, unless `FFTW_ESTIMATE' is Chris@19: used in the flags. (The arrays need not be initialized, but they Chris@19: must be allocated.) For an in-place transform, it is important to Chris@19: remember that the real array will require padding, described in Chris@19: *note Real-data DFT Array Format::. Chris@19: Chris@19: * `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19: Chris@19: The inverse transforms, taking complex input (storing the Chris@19: non-redundant half of a logically Hermitian array) to real output, are Chris@19: given by: Chris@19: Chris@19: fftw_plan fftw_plan_dft_c2r_1d(int n0, Chris@19: fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1, Chris@19: fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2, Chris@19: fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_dft_c2r(int rank, const int *n, Chris@19: fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: Chris@19: The arguments are the same as for the r2c transforms, except that the Chris@19: input and output data formats are reversed. Chris@19: Chris@19: FFTW computes an unnormalized transform: computing an r2c followed Chris@19: by a c2r transform (or vice versa) will result in the original data Chris@19: multiplied by the size of the transform (the product of the logical Chris@19: dimensions). An r2c transform produces the same output as a Chris@19: `FFTW_FORWARD' complex DFT of the same input, and a c2r transform is Chris@19: correspondingly equivalent to `FFTW_BACKWARD'. For more information, Chris@19: see *note What FFTW Really Computes::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Real-data DFT Array Format, Next: Real-to-Real Transforms, Prev: Real-data DFTs, Up: Basic Interface Chris@19: Chris@19: 4.3.4 Real-data DFT Array Format Chris@19: -------------------------------- Chris@19: Chris@19: The output of a DFT of real data (r2c) contains symmetries that, in Chris@19: principle, make half of the outputs redundant (*note What FFTW Really Chris@19: Computes::). (Similarly for the input of an inverse c2r transform.) In Chris@19: practice, it is not possible to entirely realize these savings in an Chris@19: efficient and understandable format that generalizes to Chris@19: multi-dimensional transforms. Instead, the output of the r2c Chris@19: transforms is _slightly_ over half of the output of the corresponding Chris@19: complex transform. We do not "pack" the data in any way, but store it Chris@19: as an ordinary array of `fftw_complex' values. In fact, this data is Chris@19: simply a subsection of what would be the array in the corresponding Chris@19: complex transform. Chris@19: Chris@19: Specifically, for a real transform of d (= `rank') dimensions n[0] x Chris@19: n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2] Chris@19: x ... x (n[d-1]/2 + 1) array of `fftw_complex' values in row-major Chris@19: order (with the division rounded down). That is, we only store the Chris@19: _lower_ half (non-negative frequencies), plus one element, of the last Chris@19: dimension of the data from the ordinary complex transform. (We could Chris@19: have instead taken half of any other dimension, but implementation Chris@19: turns out to be simpler if the last, contiguous, dimension is used.) Chris@19: Chris@19: For an out-of-place transform, the real data is simply an array with Chris@19: physical dimensions n[0] x n[1] x n[2] x ... x n[d-1] in row-major Chris@19: order. Chris@19: Chris@19: For an in-place transform, some complications arise since the Chris@19: complex data is slightly larger than the real data. In this case, the Chris@19: final dimension of the real data must be _padded_ with extra values to Chris@19: accommodate the size of the complex data--two extra if the last Chris@19: dimension is even and one if it is odd. That is, the last dimension of Chris@19: the real data must physically contain 2 * (n[d-1]/2+1) `double' values Chris@19: (exactly enough to hold the complex data). This physical array size Chris@19: does not, however, change the _logical_ array size--only n[d-1] values Chris@19: are actually stored in the last dimension, and n[d-1] is the last Chris@19: dimension passed to the planner. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Real-to-Real Transforms, Next: Real-to-Real Transform Kinds, Prev: Real-data DFT Array Format, Up: Basic Interface Chris@19: Chris@19: 4.3.5 Real-to-Real Transforms Chris@19: ----------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out, Chris@19: fftw_r2r_kind kind, unsigned flags); Chris@19: fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out, Chris@19: fftw_r2r_kind kind0, fftw_r2r_kind kind1, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2, Chris@19: double *in, double *out, Chris@19: fftw_r2r_kind kind0, Chris@19: fftw_r2r_kind kind1, Chris@19: fftw_r2r_kind kind2, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out, Chris@19: const fftw_r2r_kind *kind, unsigned flags); Chris@19: Chris@19: Plan a real input/output (r2r) transform of various kinds in zero or Chris@19: more dimensions, returning an `fftw_plan' (*note Using Plans::). Chris@19: Chris@19: Once you have created a plan for a certain transform type and Chris@19: parameters, then creating another plan of the same type and parameters, Chris@19: but for different arrays, is fast and shares constant data with the Chris@19: first plan (if it still exists). Chris@19: Chris@19: The planner returns `NULL' if the plan cannot be created. A Chris@19: non-`NULL' plan is always returned by the basic interface unless you Chris@19: are using a customized FFTW configuration supporting a restricted set Chris@19: of transforms, or for size-1 `FFTW_REDFT00' kinds (which are not Chris@19: defined). Chris@19: Chris@19: Arguments Chris@19: ......... Chris@19: Chris@19: * `rank' is the dimensionality of the transform (it should be the Chris@19: size of the arrays `*n' and `*kind'), and can be any non-negative Chris@19: integer. The `_1d', `_2d', and `_3d' planners correspond to a Chris@19: `rank' of `1', `2', and `3', respectively. A `rank' of zero is Chris@19: equivalent to a copy of one number from input to output. Chris@19: Chris@19: * `n', or `n0'/`n1'/`n2', or `n[rank]', respectively, gives the Chris@19: (physical) size of the transform dimensions. They can be any Chris@19: positive integer. Chris@19: Chris@19: - Multi-dimensional arrays are stored in row-major order with Chris@19: dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x Chris@19: `n[1]' x ... x `n[rank-1]'. *Note Multi-dimensional Array Chris@19: Format::. Chris@19: Chris@19: - FFTW is generally best at handling sizes of the form 2^a 3^b Chris@19: 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other Chris@19: exponents are arbitrary. Other sizes are computed by means Chris@19: of a slow, general-purpose algorithm (which nevertheless Chris@19: retains O(n log n) performance even for prime sizes). (It Chris@19: is possible to customize FFTW for different array sizes; see Chris@19: *note Installation and Customization::.) Transforms whose Chris@19: sizes are powers of 2 are especially fast. Chris@19: Chris@19: - For a `REDFT00' or `RODFT00' transform kind in a dimension of Chris@19: size n, it is n-1 or n+1, respectively, that should be Chris@19: factorizable in the above form. Chris@19: Chris@19: * `in' and `out' point to the input and output arrays of the Chris@19: transform, which may be the same (yielding an in-place transform). These Chris@19: arrays are overwritten during planning, unless `FFTW_ESTIMATE' is Chris@19: used in the flags. (The arrays need not be initialized, but they Chris@19: must be allocated.) Chris@19: Chris@19: * `kind', or `kind0'/`kind1'/`kind2', or `kind[rank]', is the kind Chris@19: of r2r transform used for the corresponding dimension. The valid Chris@19: kind constants are described in *note Real-to-Real Transform Chris@19: Kinds::. In a multi-dimensional transform, what is computed is Chris@19: the separable product formed by taking each transform kind along Chris@19: the corresponding dimension, one dimension after another. Chris@19: Chris@19: * `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Real-to-Real Transform Kinds, Prev: Real-to-Real Transforms, Up: Basic Interface Chris@19: Chris@19: 4.3.6 Real-to-Real Transform Kinds Chris@19: ---------------------------------- Chris@19: Chris@19: FFTW currently supports 11 different r2r transform kinds, specified by Chris@19: one of the constants below. For the precise definitions of these Chris@19: transforms, see *note What FFTW Really Computes::. For a more Chris@19: colloquial introduction to these transform kinds, see *note More DFTs Chris@19: of Real Data::. Chris@19: Chris@19: For dimension of size `n', there is a corresponding "logical" Chris@19: dimension `N' that determines the normalization (and the optimal Chris@19: factorization); the formula for `N' is given for each kind below. Chris@19: Also, with each transform kind is listed its corrsponding inverse Chris@19: transform. FFTW computes unnormalized transforms: a transform followed Chris@19: by its inverse will result in the original data multiplied by `N' (or Chris@19: the product of the `N''s for each dimension, in multi-dimensions). Chris@19: Chris@19: * `FFTW_R2HC' computes a real-input DFT with output in "halfcomplex" Chris@19: format, i.e. real and imaginary parts for a transform of size `n' Chris@19: stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical Chris@19: `N=n', inverse is `FFTW_HC2R'.) Chris@19: Chris@19: * `FFTW_HC2R' computes the reverse of `FFTW_R2HC', above. (Logical Chris@19: `N=n', inverse is `FFTW_R2HC'.) Chris@19: Chris@19: * `FFTW_DHT' computes a discrete Hartley transform. (Logical `N=n', Chris@19: inverse is `FFTW_DHT'.) Chris@19: Chris@19: * `FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I. Chris@19: (Logical `N=2*(n-1)', inverse is `FFTW_REDFT00'.) Chris@19: Chris@19: * `FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II Chris@19: (sometimes called "the" DCT). (Logical `N=2*n', inverse is Chris@19: `FFTW_REDFT01'.) Chris@19: Chris@19: * `FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III Chris@19: (sometimes called "the" IDCT, being the inverse of DCT-II). Chris@19: (Logical `N=2*n', inverse is `FFTW_REDFT=10'.) Chris@19: Chris@19: * `FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV. Chris@19: (Logical `N=2*n', inverse is `FFTW_REDFT11'.) Chris@19: Chris@19: * `FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I. Chris@19: (Logical `N=2*(n+1)', inverse is `FFTW_RODFT00'.) Chris@19: Chris@19: * `FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II. Chris@19: (Logical `N=2*n', inverse is `FFTW_RODFT01'.) Chris@19: Chris@19: * `FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III. Chris@19: (Logical `N=2*n', inverse is `FFTW_RODFT=10'.) Chris@19: Chris@19: * `FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV. Chris@19: (Logical `N=2*n', inverse is `FFTW_RODFT11'.) Chris@19: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Advanced Interface, Next: Guru Interface, Prev: Basic Interface, Up: FFTW Reference Chris@19: Chris@19: 4.4 Advanced Interface Chris@19: ====================== Chris@19: Chris@19: FFTW's "advanced" interface supplements the basic interface with four Chris@19: new planner routines, providing a new level of flexibility: you can plan Chris@19: a transform of multiple arrays simultaneously, operate on non-contiguous Chris@19: (strided) data, and transform a subset of a larger multi-dimensional Chris@19: array. Other than these additional features, the planner operates in Chris@19: the same fashion as in the basic interface, and the resulting Chris@19: `fftw_plan' is used in the same way (*note Using Plans::). Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Advanced Complex DFTs:: Chris@19: * Advanced Real-data DFTs:: Chris@19: * Advanced Real-to-real Transforms:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Advanced Complex DFTs, Next: Advanced Real-data DFTs, Prev: Advanced Interface, Up: Advanced Interface Chris@19: Chris@19: 4.4.1 Advanced Complex DFTs Chris@19: --------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany, Chris@19: fftw_complex *in, const int *inembed, Chris@19: int istride, int idist, Chris@19: fftw_complex *out, const int *onembed, Chris@19: int ostride, int odist, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: This routine plans multiple multidimensional complex DFTs, and it Chris@19: extends the `fftw_plan_dft' routine (*note Complex DFTs::) to compute Chris@19: `howmany' transforms, each having rank `rank' and size `n'. In Chris@19: addition, the transform data need not be contiguous, but it may be laid Chris@19: out in memory with an arbitrary stride. To account for these Chris@19: possibilities, `fftw_plan_many_dft' adds the new parameters `howmany', Chris@19: {`i',`o'}`nembed', {`i',`o'}`stride', and {`i',`o'}`dist'. The FFTW Chris@19: basic interface (*note Complex DFTs::) provides routines specialized Chris@19: for ranks 1, 2, and 3, but the advanced interface handles only the Chris@19: general-rank case. Chris@19: Chris@19: `howmany' is the number of transforms to compute. The resulting Chris@19: plan computes `howmany' transforms, where the input of the `k'-th Chris@19: transform is at location `in+k*idist' (in C pointer arithmetic), and Chris@19: its output is at location `out+k*odist'. Plans obtained in this way Chris@19: can often be faster than calling FFTW multiple times for the individual Chris@19: transforms. The basic `fftw_plan_dft' interface corresponds to Chris@19: `howmany=1' (in which case the `dist' parameters are ignored). Chris@19: Chris@19: Each of the `howmany' transforms has rank `rank' and size `n', as in Chris@19: the basic interface. In addition, the advanced interface allows the Chris@19: input and output arrays of each transform to be row-major subarrays of Chris@19: larger rank-`rank' arrays, described by `inembed' and `onembed' Chris@19: parameters, respectively. {`i',`o'}`nembed' must be arrays of length Chris@19: `rank', and `n' should be elementwise less than or equal to Chris@19: {`i',`o'}`nembed'. Passing `NULL' for an `nembed' parameter is Chris@19: equivalent to passing `n' (i.e. same physical and logical dimensions, Chris@19: as in the basic interface.) Chris@19: Chris@19: The `stride' parameters indicate that the `j'-th element of the Chris@19: input or output arrays is located at `j*istride' or `j*ostride', Chris@19: respectively. (For a multi-dimensional array, `j' is the ordinary Chris@19: row-major index.) When combined with the `k'-th transform in a Chris@19: `howmany' loop, from above, this means that the (`j',`k')-th element is Chris@19: at `j*stride+k*dist'. (The basic `fftw_plan_dft' interface corresponds Chris@19: to a stride of 1.) Chris@19: Chris@19: For in-place transforms, the input and output `stride' and `dist' Chris@19: parameters should be the same; otherwise, the planner may return `NULL'. Chris@19: Chris@19: Arrays `n', `inembed', and `onembed' are not used after this Chris@19: function returns. You can safely free or reuse them. Chris@19: Chris@19: *Examples*: One transform of one 5 by 6 array contiguous in memory: Chris@19: int rank = 2; Chris@19: int n[] = {5, 6}; Chris@19: int howmany = 1; Chris@19: int idist = odist = 0; /* unused because howmany = 1 */ Chris@19: int istride = ostride = 1; /* array is contiguous in memory */ Chris@19: int *inembed = n, *onembed = n; Chris@19: Chris@19: Transform of three 5 by 6 arrays, each contiguous in memory, stored Chris@19: in memory one after another: Chris@19: int rank = 2; Chris@19: int n[] = {5, 6}; Chris@19: int howmany = 3; Chris@19: int idist = odist = n[0]*n[1]; /* = 30, the distance in memory Chris@19: between the first element Chris@19: of the first array and the Chris@19: first element of the second array */ Chris@19: int istride = ostride = 1; /* array is contiguous in memory */ Chris@19: int *inembed = n, *onembed = n; Chris@19: Chris@19: Transform each column of a 2d array with 10 rows and 3 columns: Chris@19: int rank = 1; /* not 2: we are computing 1d transforms */ Chris@19: int n[] = {10}; /* 1d transforms of length 10 */ Chris@19: int howmany = 3; Chris@19: int idist = odist = 1; Chris@19: int istride = ostride = 3; /* distance between two elements in Chris@19: the same column */ Chris@19: int *inembed = n, *onembed = n; Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Advanced Real-data DFTs, Next: Advanced Real-to-real Transforms, Prev: Advanced Complex DFTs, Up: Advanced Interface Chris@19: Chris@19: 4.4.2 Advanced Real-data DFTs Chris@19: ----------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany, Chris@19: double *in, const int *inembed, Chris@19: int istride, int idist, Chris@19: fftw_complex *out, const int *onembed, Chris@19: int ostride, int odist, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany, Chris@19: fftw_complex *in, const int *inembed, Chris@19: int istride, int idist, Chris@19: double *out, const int *onembed, Chris@19: int ostride, int odist, Chris@19: unsigned flags); Chris@19: Chris@19: Like `fftw_plan_many_dft', these two functions add `howmany', Chris@19: `nembed', `stride', and `dist' parameters to the `fftw_plan_dft_r2c' Chris@19: and `fftw_plan_dft_c2r' functions, but otherwise behave the same as the Chris@19: basic interface. Chris@19: Chris@19: The interpretation of `howmany', `stride', and `dist' are the same Chris@19: as for `fftw_plan_many_dft', above. Note that the `stride' and `dist' Chris@19: for the real array are in units of `double', and for the complex array Chris@19: are in units of `fftw_complex'. Chris@19: Chris@19: If an `nembed' parameter is `NULL', it is interpreted as what it Chris@19: would be in the basic interface, as described in *note Real-data DFT Chris@19: Array Format::. That is, for the complex array the size is assumed to Chris@19: be the same as `n', but with the last dimension cut roughly in half. Chris@19: For the real array, the size is assumed to be `n' if the transform is Chris@19: out-of-place, or `n' with the last dimension "padded" if the transform Chris@19: is in-place. Chris@19: Chris@19: If an `nembed' parameter is non-`NULL', it is interpreted as the Chris@19: physical size of the corresponding array, in row-major order, just as Chris@19: for `fftw_plan_many_dft'. In this case, each dimension of `nembed' Chris@19: should be `>=' what it would be in the basic interface (e.g. the halved Chris@19: or padded `n'). Chris@19: Chris@19: Arrays `n', `inembed', and `onembed' are not used after this Chris@19: function returns. You can safely free or reuse them. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Advanced Real-to-real Transforms, Prev: Advanced Real-data DFTs, Up: Advanced Interface Chris@19: Chris@19: 4.4.3 Advanced Real-to-real Transforms Chris@19: -------------------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany, Chris@19: double *in, const int *inembed, Chris@19: int istride, int idist, Chris@19: double *out, const int *onembed, Chris@19: int ostride, int odist, Chris@19: const fftw_r2r_kind *kind, unsigned flags); Chris@19: Chris@19: Like `fftw_plan_many_dft', this functions adds `howmany', `nembed', Chris@19: `stride', and `dist' parameters to the `fftw_plan_r2r' function, but Chris@19: otherwise behave the same as the basic interface. The interpretation Chris@19: of those additional parameters are the same as for Chris@19: `fftw_plan_many_dft'. (Of course, the `stride' and `dist' parameters Chris@19: are now in units of `double', not `fftw_complex'.) Chris@19: Chris@19: Arrays `n', `inembed', `onembed', and `kind' are not used after this Chris@19: function returns. You can safely free or reuse them. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Guru Interface, Next: New-array Execute Functions, Prev: Advanced Interface, Up: FFTW Reference Chris@19: Chris@19: 4.5 Guru Interface Chris@19: ================== Chris@19: Chris@19: The "guru" interface to FFTW is intended to expose as much as possible Chris@19: of the flexibility in the underlying FFTW architecture. It allows one Chris@19: to compute multi-dimensional "vectors" (loops) of multi-dimensional Chris@19: transforms, where each vector/transform dimension has an independent Chris@19: size and stride. One can also use more general complex-number formats, Chris@19: e.g. separate real and imaginary arrays. Chris@19: Chris@19: For those users who require the flexibility of the guru interface, Chris@19: it is important that they pay special attention to the documentation Chris@19: lest they shoot themselves in the foot. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Interleaved and split arrays:: Chris@19: * Guru vector and transform sizes:: Chris@19: * Guru Complex DFTs:: Chris@19: * Guru Real-data DFTs:: Chris@19: * Guru Real-to-real Transforms:: Chris@19: * 64-bit Guru Interface:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Interleaved and split arrays, Next: Guru vector and transform sizes, Prev: Guru Interface, Up: Guru Interface Chris@19: Chris@19: 4.5.1 Interleaved and split arrays Chris@19: ---------------------------------- Chris@19: Chris@19: The guru interface supports two representations of complex numbers, Chris@19: which we call the interleaved and the split format. Chris@19: Chris@19: The "interleaved" format is the same one used by the basic and Chris@19: advanced interfaces, and it is documented in *note Complex numbers::. Chris@19: In the interleaved format, you provide pointers to the real part of a Chris@19: complex number, and the imaginary part understood to be stored in the Chris@19: next memory location. Chris@19: Chris@19: The "split" format allows separate pointers to the real and Chris@19: imaginary parts of a complex array. Chris@19: Chris@19: Technically, the interleaved format is redundant, because you can Chris@19: always express an interleaved array in terms of a split array with Chris@19: appropriate pointers and strides. On the other hand, the interleaved Chris@19: format is simpler to use, and it is common in practice. Hence, FFTW Chris@19: supports it as a special case. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Guru vector and transform sizes, Next: Guru Complex DFTs, Prev: Interleaved and split arrays, Up: Guru Interface Chris@19: Chris@19: 4.5.2 Guru vector and transform sizes Chris@19: ------------------------------------- Chris@19: Chris@19: The guru interface introduces one basic new data structure, Chris@19: `fftw_iodim', that is used to specify sizes and strides for Chris@19: multi-dimensional transforms and vectors: Chris@19: Chris@19: typedef struct { Chris@19: int n; Chris@19: int is; Chris@19: int os; Chris@19: } fftw_iodim; Chris@19: Chris@19: Here, `n' is the size of the dimension, and `is' and `os' are the Chris@19: strides of that dimension for the input and output arrays. (The stride Chris@19: is the separation of consecutive elements along this dimension.) Chris@19: Chris@19: The meaning of the stride parameter depends on the type of the array Chris@19: that the stride refers to. _If the array is interleaved complex, Chris@19: strides are expressed in units of complex numbers (`fftw_complex'). If Chris@19: the array is split complex or real, strides are expressed in units of Chris@19: real numbers (`double')._ This convention is consistent with the usual Chris@19: pointer arithmetic in the C language. An interleaved array is denoted Chris@19: by a pointer `p' to `fftw_complex', so that `p+1' points to the next Chris@19: complex number. Split arrays are denoted by pointers to `double', in Chris@19: which case pointer arithmetic operates in units of `sizeof(double)'. Chris@19: Chris@19: The guru planner interfaces all take a (`rank', `dims[rank]') pair Chris@19: describing the transform size, and a (`howmany_rank', Chris@19: `howmany_dims[howmany_rank]') pair describing the "vector" size (a Chris@19: multi-dimensional loop of transforms to perform), where `dims' and Chris@19: `howmany_dims' are arrays of `fftw_iodim'. Chris@19: Chris@19: For example, the `howmany' parameter in the advanced complex-DFT Chris@19: interface corresponds to `howmany_rank' = 1, `howmany_dims[0].n' = Chris@19: `howmany', `howmany_dims[0].is' = `idist', and `howmany_dims[0].os' = Chris@19: `odist'. (To compute a single transform, you can just use Chris@19: `howmany_rank' = 0.) Chris@19: Chris@19: A row-major multidimensional array with dimensions `n[rank]' (*note Chris@19: Row-major Format::) corresponds to `dims[i].n' = `n[i]' and the Chris@19: recurrence `dims[i].is' = `n[i+1] * dims[i+1].is' (similarly for `os'). Chris@19: The stride of the last (`i=rank-1') dimension is the overall stride of Chris@19: the array. e.g. to be equivalent to the advanced complex-DFT Chris@19: interface, you would have `dims[rank-1].is' = `istride' and Chris@19: `dims[rank-1].os' = `ostride'. Chris@19: Chris@19: In general, we only guarantee FFTW to return a non-`NULL' plan if Chris@19: the vector and transform dimensions correspond to a set of distinct Chris@19: indices, and for in-place transforms the input/output strides should be Chris@19: the same. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Guru Complex DFTs, Next: Guru Real-data DFTs, Prev: Guru vector and transform sizes, Up: Guru Interface Chris@19: Chris@19: 4.5.3 Guru Complex DFTs Chris@19: ----------------------- Chris@19: Chris@19: fftw_plan fftw_plan_guru_dft( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: fftw_plan fftw_plan_guru_split_dft( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: double *ri, double *ii, double *ro, double *io, Chris@19: unsigned flags); Chris@19: Chris@19: These two functions plan a complex-data, multi-dimensional DFT for Chris@19: the interleaved and split format, respectively. Transform dimensions Chris@19: are given by (`rank', `dims') over a multi-dimensional vector (loop) of Chris@19: dimensions (`howmany_rank', `howmany_dims'). `dims' and `howmany_dims' Chris@19: should point to `fftw_iodim' arrays of length `rank' and Chris@19: `howmany_rank', respectively. Chris@19: Chris@19: `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19: In the `fftw_plan_guru_dft' function, the pointers `in' and `out' Chris@19: point to the interleaved input and output arrays, respectively. The Chris@19: sign can be either -1 (= `FFTW_FORWARD') or +1 (= `FFTW_BACKWARD'). If Chris@19: the pointers are equal, the transform is in-place. Chris@19: Chris@19: In the `fftw_plan_guru_split_dft' function, `ri' and `ii' point to Chris@19: the real and imaginary input arrays, and `ro' and `io' point to the Chris@19: real and imaginary output arrays. The input and output pointers may be Chris@19: the same, indicating an in-place transform. For example, for Chris@19: `fftw_complex' pointers `in' and `out', the corresponding parameters Chris@19: are: Chris@19: Chris@19: ri = (double *) in; Chris@19: ii = (double *) in + 1; Chris@19: ro = (double *) out; Chris@19: io = (double *) out + 1; Chris@19: Chris@19: Because `fftw_plan_guru_split_dft' accepts split arrays, strides are Chris@19: expressed in units of `double'. For a contiguous `fftw_complex' array, Chris@19: the overall stride of the transform should be 2, the distance between Chris@19: consecutive real parts or between consecutive imaginary parts; see Chris@19: *note Guru vector and transform sizes::. Note that the dimension Chris@19: strides are applied equally to the real and imaginary parts; real and Chris@19: imaginary arrays with different strides are not supported. Chris@19: Chris@19: There is no `sign' parameter in `fftw_plan_guru_split_dft'. This Chris@19: function always plans for an `FFTW_FORWARD' transform. To plan for an Chris@19: `FFTW_BACKWARD' transform, you can exploit the identity that the Chris@19: backwards DFT is equal to the forwards DFT with the real and imaginary Chris@19: parts swapped. For example, in the case of the `fftw_complex' arrays Chris@19: above, the `FFTW_BACKWARD' transform is computed by the parameters: Chris@19: Chris@19: ri = (double *) in + 1; Chris@19: ii = (double *) in; Chris@19: ro = (double *) out + 1; Chris@19: io = (double *) out; Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Guru Real-data DFTs, Next: Guru Real-to-real Transforms, Prev: Guru Complex DFTs, Up: Guru Interface Chris@19: Chris@19: 4.5.4 Guru Real-data DFTs Chris@19: ------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_guru_dft_r2c( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: double *in, fftw_complex *out, Chris@19: unsigned flags); Chris@19: Chris@19: fftw_plan fftw_plan_guru_split_dft_r2c( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: double *in, double *ro, double *io, Chris@19: unsigned flags); Chris@19: Chris@19: fftw_plan fftw_plan_guru_dft_c2r( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: fftw_complex *in, double *out, Chris@19: unsigned flags); Chris@19: Chris@19: fftw_plan fftw_plan_guru_split_dft_c2r( Chris@19: int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, const fftw_iodim *howmany_dims, Chris@19: double *ri, double *ii, double *out, Chris@19: unsigned flags); Chris@19: Chris@19: Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT Chris@19: with transform dimensions given by (`rank', `dims') over a Chris@19: multi-dimensional vector (loop) of dimensions (`howmany_rank', Chris@19: `howmany_dims'). `dims' and `howmany_dims' should point to Chris@19: `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. Chris@19: As for the basic and advanced interfaces, an r2c transform is Chris@19: `FFTW_FORWARD' and a c2r transform is `FFTW_BACKWARD'. Chris@19: Chris@19: The _last_ dimension of `dims' is interpreted specially: that Chris@19: dimension of the real array has size `dims[rank-1].n', but that Chris@19: dimension of the complex array has size `dims[rank-1].n/2+1' (division Chris@19: rounded down). The strides, on the other hand, are taken to be exactly Chris@19: as specified. It is up to the user to specify the strides Chris@19: appropriately for the peculiar dimensions of the data, and we do not Chris@19: guarantee that the planner will succeed (return non-`NULL') for any Chris@19: dimensions other than those described in *note Real-data DFT Array Chris@19: Format:: and generalized in *note Advanced Real-data DFTs::. (That is, Chris@19: for an in-place transform, each individual dimension should be able to Chris@19: operate in place.) Chris@19: Chris@19: `in' and `out' point to the input and output arrays for r2c and c2r Chris@19: transforms, respectively. For split arrays, `ri' and `ii' point to the Chris@19: real and imaginary input arrays for a c2r transform, and `ro' and `io' Chris@19: point to the real and imaginary output arrays for an r2c transform. Chris@19: `in' and `ro' or `ri' and `out' may be the same, indicating an in-place Chris@19: transform. (In-place transforms where `in' and `io' or `ii' and `out' Chris@19: are the same are not currently supported.) Chris@19: Chris@19: `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19: In-place transforms of rank greater than 1 are currently only Chris@19: supported for interleaved arrays. For split arrays, the planner will Chris@19: return `NULL'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Guru Real-to-real Transforms, Next: 64-bit Guru Interface, Prev: Guru Real-data DFTs, Up: Guru Interface Chris@19: Chris@19: 4.5.5 Guru Real-to-real Transforms Chris@19: ---------------------------------- Chris@19: Chris@19: fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims, Chris@19: int howmany_rank, Chris@19: const fftw_iodim *howmany_dims, Chris@19: double *in, double *out, Chris@19: const fftw_r2r_kind *kind, Chris@19: unsigned flags); Chris@19: Chris@19: Plan a real-to-real (r2r) multi-dimensional `FFTW_FORWARD' transform Chris@19: with transform dimensions given by (`rank', `dims') over a Chris@19: multi-dimensional vector (loop) of dimensions (`howmany_rank', Chris@19: `howmany_dims'). `dims' and `howmany_dims' should point to Chris@19: `fftw_iodim' arrays of length `rank' and `howmany_rank', respectively. Chris@19: Chris@19: The transform kind of each dimension is given by the `kind' Chris@19: parameter, which should point to an array of length `rank'. Valid Chris@19: `fftw_r2r_kind' constants are given in *note Real-to-Real Transform Chris@19: Kinds::. Chris@19: Chris@19: `in' and `out' point to the real input and output arrays; they may Chris@19: be the same, indicating an in-place transform. Chris@19: Chris@19: `flags' is a bitwise OR (`|') of zero or more planner flags, as Chris@19: defined in *note Planner Flags::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: 64-bit Guru Interface, Prev: Guru Real-to-real Transforms, Up: Guru Interface Chris@19: Chris@19: 4.5.6 64-bit Guru Interface Chris@19: --------------------------- Chris@19: Chris@19: When compiled in 64-bit mode on a 64-bit architecture (where addresses Chris@19: are 64 bits wide), FFTW uses 64-bit quantities internally for all Chris@19: transform sizes, strides, and so on--you don't have to do anything Chris@19: special to exploit this. However, in the ordinary FFTW interfaces, you Chris@19: specify the transform size by an `int' quantity, which is normally only Chris@19: 32 bits wide. This means that, even though FFTW is using 64-bit sizes Chris@19: internally, you cannot specify a single transform dimension larger than Chris@19: 2^31-1 numbers. Chris@19: Chris@19: We expect that few users will require transforms larger than this, Chris@19: but, for those who do, we provide a 64-bit version of the guru Chris@19: interface in which all sizes are specified as integers of type Chris@19: `ptrdiff_t' instead of `int'. (`ptrdiff_t' is a signed integer type Chris@19: defined by the C standard to be wide enough to represent address Chris@19: differences, and thus must be at least 64 bits wide on a 64-bit Chris@19: machine.) We stress that there is _no performance advantage_ to using Chris@19: this interface--the same internal FFTW code is employed regardless--and Chris@19: it is only necessary if you want to specify very large transform sizes. Chris@19: Chris@19: In particular, the 64-bit guru interface is a set of planner routines Chris@19: that are exactly the same as the guru planner routines, except that Chris@19: they are named with `guru64' instead of `guru' and they take arguments Chris@19: of type `fftw_iodim64' instead of `fftw_iodim'. For example, instead Chris@19: of `fftw_plan_guru_dft', we have `fftw_plan_guru64_dft'. Chris@19: Chris@19: fftw_plan fftw_plan_guru64_dft( Chris@19: int rank, const fftw_iodim64 *dims, Chris@19: int howmany_rank, const fftw_iodim64 *howmany_dims, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: int sign, unsigned flags); Chris@19: Chris@19: The `fftw_iodim64' type is similar to `fftw_iodim', with the same Chris@19: interpretation, except that it uses type `ptrdiff_t' instead of type Chris@19: `int'. Chris@19: Chris@19: typedef struct { Chris@19: ptrdiff_t n; Chris@19: ptrdiff_t is; Chris@19: ptrdiff_t os; Chris@19: } fftw_iodim64; Chris@19: Chris@19: Every other `fftw_plan_guru' function also has a `fftw_plan_guru64' Chris@19: equivalent, but we do not repeat their documentation here since they Chris@19: are identical to the 32-bit versions except as noted above. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: New-array Execute Functions, Next: Wisdom, Prev: Guru Interface, Up: FFTW Reference Chris@19: Chris@19: 4.6 New-array Execute Functions Chris@19: =============================== Chris@19: Chris@19: Normally, one executes a plan for the arrays with which the plan was Chris@19: created, by calling `fftw_execute(plan)' as described in *note Using Chris@19: Plans::. However, it is possible for sophisticated users to apply a Chris@19: given plan to a _different_ array using the "new-array execute" Chris@19: functions detailed below, provided that the following conditions are Chris@19: met: Chris@19: Chris@19: * The array size, strides, etcetera are the same (since those are Chris@19: set by the plan). Chris@19: Chris@19: * The input and output arrays are the same (in-place) or different Chris@19: (out-of-place) if the plan was originally created to be in-place or Chris@19: out-of-place, respectively. Chris@19: Chris@19: * For split arrays, the separations between the real and imaginary Chris@19: parts, `ii-ri' and `io-ro', are the same as they were for the Chris@19: input and output arrays when the plan was created. (This Chris@19: condition is automatically satisfied for interleaved arrays.) Chris@19: Chris@19: * The "alignment" of the new input/output arrays is the same as that Chris@19: of the input/output arrays when the plan was created, unless the Chris@19: plan was created with the `FFTW_UNALIGNED' flag. Here, the Chris@19: alignment is a platform-dependent quantity (for example, it is the Chris@19: address modulo 16 if SSE SIMD instructions are used, but the Chris@19: address modulo 4 for non-SIMD single-precision FFTW on the same Chris@19: machine). In general, only arrays allocated with `fftw_malloc' Chris@19: are guaranteed to be equally aligned (*note SIMD alignment and Chris@19: fftw_malloc::). Chris@19: Chris@19: Chris@19: The alignment issue is especially critical, because if you don't use Chris@19: `fftw_malloc' then you may have little control over the alignment of Chris@19: arrays in memory. For example, neither the C++ `new' function nor the Chris@19: Fortran `allocate' statement provide strong enough guarantees about Chris@19: data alignment. If you don't use `fftw_malloc', therefore, you Chris@19: probably have to use `FFTW_UNALIGNED' (which disables most SIMD Chris@19: support). If possible, it is probably better for you to simply create Chris@19: multiple plans (creating a new plan is quick once one exists for a Chris@19: given size), or better yet re-use the same array for your transforms. Chris@19: Chris@19: For rare circumstances in which you cannot control the alignment of Chris@19: allocated memory, but wish to determine where a given array is aligned Chris@19: like the original array for which a plan was created, you can use the Chris@19: `fftw_alignment_of' function: Chris@19: int fftw_alignment_of(double *p); Chris@19: Two arrays have equivalent alignment (for the purposes of applying a Chris@19: plan) if and only if `fftw_alignment_of' returns the same value for the Chris@19: corresponding pointers to their data (typecast to `double*' if Chris@19: necessary). Chris@19: Chris@19: If you are tempted to use the new-array execute interface because you Chris@19: want to transform a known bunch of arrays of the same size, you should Chris@19: probably go use the advanced interface instead (*note Advanced Chris@19: Interface::)). Chris@19: Chris@19: The new-array execute functions are: Chris@19: Chris@19: void fftw_execute_dft( Chris@19: const fftw_plan p, Chris@19: fftw_complex *in, fftw_complex *out); Chris@19: Chris@19: void fftw_execute_split_dft( Chris@19: const fftw_plan p, Chris@19: double *ri, double *ii, double *ro, double *io); Chris@19: Chris@19: void fftw_execute_dft_r2c( Chris@19: const fftw_plan p, Chris@19: double *in, fftw_complex *out); Chris@19: Chris@19: void fftw_execute_split_dft_r2c( Chris@19: const fftw_plan p, Chris@19: double *in, double *ro, double *io); Chris@19: Chris@19: void fftw_execute_dft_c2r( Chris@19: const fftw_plan p, Chris@19: fftw_complex *in, double *out); Chris@19: Chris@19: void fftw_execute_split_dft_c2r( Chris@19: const fftw_plan p, Chris@19: double *ri, double *ii, double *out); Chris@19: Chris@19: void fftw_execute_r2r( Chris@19: const fftw_plan p, Chris@19: double *in, double *out); Chris@19: Chris@19: These execute the `plan' to compute the corresponding transform on Chris@19: the input/output arrays specified by the subsequent arguments. The Chris@19: input/output array arguments have the same meanings as the ones passed Chris@19: to the guru planner routines in the preceding sections. The `plan' is Chris@19: not modified, and these routines can be called as many times as Chris@19: desired, or intermixed with calls to the ordinary `fftw_execute'. Chris@19: Chris@19: The `plan' _must_ have been created for the transform type Chris@19: corresponding to the execute function, e.g. it must be a complex-DFT Chris@19: plan for `fftw_execute_dft'. Any of the planner routines for that Chris@19: transform type, from the basic to the guru interface, could have been Chris@19: used to create the plan, however. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom, Next: What FFTW Really Computes, Prev: New-array Execute Functions, Up: FFTW Reference Chris@19: Chris@19: 4.7 Wisdom Chris@19: ========== Chris@19: Chris@19: This section documents the FFTW mechanism for saving and restoring Chris@19: plans from disk. This mechanism is called "wisdom". Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Wisdom Export:: Chris@19: * Wisdom Import:: Chris@19: * Forgetting Wisdom:: Chris@19: * Wisdom Utilities:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom Export, Next: Wisdom Import, Prev: Wisdom, Up: Wisdom Chris@19: Chris@19: 4.7.1 Wisdom Export Chris@19: ------------------- Chris@19: Chris@19: int fftw_export_wisdom_to_filename(const char *filename); Chris@19: void fftw_export_wisdom_to_file(FILE *output_file); Chris@19: char *fftw_export_wisdom_to_string(void); Chris@19: void fftw_export_wisdom(void (*write_char)(char c, void *), void *data); Chris@19: Chris@19: These functions allow you to export all currently accumulated wisdom Chris@19: in a form from which it can be later imported and restored, even during Chris@19: a separate run of the program. (*Note Words of Wisdom-Saving Plans::.) Chris@19: The current store of wisdom is not affected by calling any of these Chris@19: routines. Chris@19: Chris@19: `fftw_export_wisdom' exports the wisdom to any output medium, as Chris@19: specified by the callback function `write_char'. `write_char' is a Chris@19: `putc'-like function that writes the character `c' to some output; its Chris@19: second parameter is the `data' pointer passed to `fftw_export_wisdom'. Chris@19: For convenience, the following three "wrapper" routines are provided: Chris@19: Chris@19: `fftw_export_wisdom_to_filename' writes wisdom to a file named Chris@19: `filename' (which is created or overwritten), returning `1' on success Chris@19: and `0' on failure. A lower-level function, which requires you to open Chris@19: and close the file yourself (e.g. if you want to write wisdom to a Chris@19: portion of a larger file) is `fftw_export_wisdom_to_file'. This writes Chris@19: the wisdom to the current position in `output_file', which should be Chris@19: open with write permission; upon exit, the file remains open and is Chris@19: positioned at the end of the wisdom data. Chris@19: Chris@19: `fftw_export_wisdom_to_string' returns a pointer to a Chris@19: `NULL'-terminated string holding the wisdom data. This string is Chris@19: dynamically allocated, and it is the responsibility of the caller to Chris@19: deallocate it with `free' when it is no longer needed. Chris@19: Chris@19: All of these routines export the wisdom in the same format, which we Chris@19: will not document here except to say that it is LISP-like ASCII text Chris@19: that is insensitive to white space. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom Import, Next: Forgetting Wisdom, Prev: Wisdom Export, Up: Wisdom Chris@19: Chris@19: 4.7.2 Wisdom Import Chris@19: ------------------- Chris@19: Chris@19: int fftw_import_system_wisdom(void); Chris@19: int fftw_import_wisdom_from_filename(const char *filename); Chris@19: int fftw_import_wisdom_from_string(const char *input_string); Chris@19: int fftw_import_wisdom(int (*read_char)(void *), void *data); Chris@19: Chris@19: These functions import wisdom into a program from data stored by the Chris@19: `fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving Chris@19: Plans::.) The imported wisdom replaces any wisdom already accumulated Chris@19: by the running program. Chris@19: Chris@19: `fftw_import_wisdom' imports wisdom from any input medium, as Chris@19: specified by the callback function `read_char'. `read_char' is a Chris@19: `getc'-like function that returns the next character in the input; its Chris@19: parameter is the `data' pointer passed to `fftw_import_wisdom'. If the Chris@19: end of the input data is reached (which should never happen for valid Chris@19: data), `read_char' should return `EOF' (as defined in `'). Chris@19: For convenience, the following three "wrapper" routines are provided: Chris@19: Chris@19: `fftw_import_wisdom_from_filename' reads wisdom from a file named Chris@19: `filename'. A lower-level function, which requires you to open and Chris@19: close the file yourself (e.g. if you want to read wisdom from a portion Chris@19: of a larger file) is `fftw_import_wisdom_from_file'. This reads wisdom Chris@19: from the current position in `input_file' (which should be open with Chris@19: read permission); upon exit, the file remains open, but the position of Chris@19: the read pointer is unspecified. Chris@19: Chris@19: `fftw_import_wisdom_from_string' reads wisdom from the Chris@19: `NULL'-terminated string `input_string'. Chris@19: Chris@19: `fftw_import_system_wisdom' reads wisdom from an Chris@19: implementation-defined standard file (`/etc/fftw/wisdom' on Unix and Chris@19: GNU systems). Chris@19: Chris@19: The return value of these import routines is `1' if the wisdom was Chris@19: read successfully and `0' otherwise. Note that, in all of these Chris@19: functions, any data in the input stream past the end of the wisdom data Chris@19: is simply ignored. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Forgetting Wisdom, Next: Wisdom Utilities, Prev: Wisdom Import, Up: Wisdom Chris@19: Chris@19: 4.7.3 Forgetting Wisdom Chris@19: ----------------------- Chris@19: Chris@19: void fftw_forget_wisdom(void); Chris@19: Chris@19: Calling `fftw_forget_wisdom' causes all accumulated `wisdom' to be Chris@19: discarded and its associated memory to be freed. (New `wisdom' can Chris@19: still be gathered subsequently, however.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom Utilities, Prev: Forgetting Wisdom, Up: Wisdom Chris@19: Chris@19: 4.7.4 Wisdom Utilities Chris@19: ---------------------- Chris@19: Chris@19: FFTW includes two standalone utility programs that deal with wisdom. We Chris@19: merely summarize them here, since they come with their own `man' pages Chris@19: for Unix and GNU systems (with HTML versions on our web site). Chris@19: Chris@19: The first program is `fftw-wisdom' (or `fftwf-wisdom' in single Chris@19: precision, etcetera), which can be used to create a wisdom file Chris@19: containing plans for any of the transform sizes and types supported by Chris@19: FFTW. It is preferable to create wisdom directly from your executable Chris@19: (*note Caveats in Using Wisdom::), but this program is useful for Chris@19: creating global wisdom files for `fftw_import_system_wisdom'. Chris@19: Chris@19: The second program is `fftw-wisdom-to-conf', which takes a wisdom Chris@19: file as input and produces a "configuration routine" as output. The Chris@19: latter is a C subroutine that you can compile and link into your Chris@19: program, replacing a routine of the same name in the FFTW library, that Chris@19: determines which parts of FFTW are callable by your program. Chris@19: `fftw-wisdom-to-conf' produces a configuration routine that links to Chris@19: only those parts of FFTW needed by the saved plans in the wisdom, Chris@19: greatly reducing the size of statically linked executables (which should Chris@19: only attempt to create plans corresponding to those in the wisdom, Chris@19: however). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: What FFTW Really Computes, Prev: Wisdom, Up: FFTW Reference Chris@19: Chris@19: 4.8 What FFTW Really Computes Chris@19: ============================= Chris@19: Chris@19: In this section, we provide precise mathematical definitions for the Chris@19: transforms that FFTW computes. These transform definitions are fairly Chris@19: standard, but some authors follow slightly different conventions for the Chris@19: normalization of the transform (the constant factor in front) and the Chris@19: sign of the complex exponent. We begin by presenting the Chris@19: one-dimensional (1d) transform definitions, and then give the Chris@19: straightforward extension to multi-dimensional transforms. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * The 1d Discrete Fourier Transform (DFT):: Chris@19: * The 1d Real-data DFT:: Chris@19: * 1d Real-even DFTs (DCTs):: Chris@19: * 1d Real-odd DFTs (DSTs):: Chris@19: * 1d Discrete Hartley Transforms (DHTs):: Chris@19: * Multi-dimensional Transforms:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: The 1d Discrete Fourier Transform (DFT), Next: The 1d Real-data DFT, Prev: What FFTW Really Computes, Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.1 The 1d Discrete Fourier Transform (DFT) Chris@19: --------------------------------------------- Chris@19: Chris@19: The forward (`FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d Chris@19: complex array X of size n computes an array Y, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . Chris@19: The backward (`FFTW_BACKWARD') DFT computes: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . Chris@19: FFTW computes an unnormalized transform, in that there is no Chris@19: coefficient in front of the summation in the DFT. In other words, Chris@19: applying the forward and then the backward transform will multiply the Chris@19: input by n. Chris@19: Chris@19: From above, an `FFTW_FORWARD' transform corresponds to a sign of -1 Chris@19: in the exponent of the DFT. Note also that we use the standard Chris@19: "in-order" output ordering--the k-th output corresponds to the Chris@19: frequency k/n (or k/T, where T is your total sampling period). For Chris@19: those who like to think in terms of positive and negative frequencies, Chris@19: this means that the positive frequencies are stored in the first half Chris@19: of the output and the negative frequencies are stored in backwards Chris@19: order in the second half of the output. (The frequency -k/n is the Chris@19: same as the frequency (n-k)/n.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: The 1d Real-data DFT, Next: 1d Real-even DFTs (DCTs), Prev: The 1d Discrete Fourier Transform (DFT), Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.2 The 1d Real-data DFT Chris@19: -------------------------- Chris@19: Chris@19: The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of Chris@19: the size `n' real array X, exactly as defined above, i.e. Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) . Chris@19: This output array Y can easily be shown to possess the "Hermitian" Chris@19: symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] = Chris@19: Y[0]. Chris@19: Chris@19: As a result of this symmetry, half of the output Y is redundant Chris@19: (being the complex conjugate of the other half), and so the 1d r2c Chris@19: transforms only output elements 0...n/2 of Y (n/2+1 complex numbers), Chris@19: where the division by 2 is rounded down. Chris@19: Chris@19: Moreover, the Hermitian symmetry implies that Y[0] and, if n is Chris@19: even, the Y[n/2] element, are purely real. So, for the `R2HC' r2r Chris@19: transform, these elements are not stored in the halfcomplex output Chris@19: format. Chris@19: Chris@19: The c2r and `H2RC' r2r transforms compute the backward DFT of the Chris@19: _complex_ array X with Hermitian symmetry, stored in the r2c/`R2HC' Chris@19: output formats, respectively, where the backward transform is defined Chris@19: exactly as for the complex case: Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) . Chris@19: The outputs `Y' of this transform can easily be seen to be purely Chris@19: real, and are stored as an array of real numbers. Chris@19: Chris@19: Like FFTW's complex DFT, these transforms are unnormalized. In other Chris@19: words, applying the real-to-complex (forward) and then the Chris@19: complex-to-real (backward) transform will multiply the input by n. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: 1d Real-even DFTs (DCTs), Next: 1d Real-odd DFTs (DSTs), Prev: The 1d Real-data DFT, Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.3 1d Real-even DFTs (DCTs) Chris@19: ------------------------------ Chris@19: Chris@19: The Real-even symmetry DFTs in FFTW are exactly equivalent to the Chris@19: unnormalized forward (and backward) DFTs as defined above, where the Chris@19: input array X of length N is purely real and is also "even" symmetry. Chris@19: In this case, the output array is likewise real and even symmetry. Chris@19: Chris@19: For the case of `REDFT00', this even symmetry means that X[j] = Chris@19: X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of Chris@19: this redundancy, only the first n real numbers are actually stored, Chris@19: where N = 2(n-1). Chris@19: Chris@19: The proper definition of even symmetry for `REDFT10', `REDFT01', and Chris@19: `REDFT11' transforms is somewhat more intricate because of the shifts Chris@19: by 1/2 of the input and/or output, although the corresponding boundary Chris@19: conditions are given in *note Real even/odd DFTs (cosine/sine Chris@19: transforms)::. Because of the even symmetry, however, the sine terms Chris@19: in the DFT all cancel and the remaining cosine terms are written Chris@19: explicitly below. This formulation often leads people to call such a Chris@19: transform a "discrete cosine transform" (DCT), although it is really Chris@19: just a special case of the DFT. Chris@19: Chris@19: In each of the definitions below, we transform a real array X of Chris@19: length n to a real array Y of length n: Chris@19: Chris@19: REDFT00 (DCT-I) Chris@19: ............... Chris@19: Chris@19: An `REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0] Chris@19: + (-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))). Chris@19: Note that this transform is not defined for n=1. For n=2, the Chris@19: summation term above is dropped as you might expect. Chris@19: Chris@19: REDFT10 (DCT-II) Chris@19: ................ Chris@19: Chris@19: An `REDFT10' transform (type-II DCT, sometimes called "the" DCT) in Chris@19: FFTW is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi Chris@19: (j+1/2) k / n)). Chris@19: Chris@19: REDFT01 (DCT-III) Chris@19: ................. Chris@19: Chris@19: An `REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] = Chris@19: X[0] + 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)). In the Chris@19: case of n=1, this reduces to Y[0] = X[0]. Up to a scale factor (see Chris@19: below), this is the inverse of `REDFT10' ("the" DCT), and so the Chris@19: `REDFT01' (DCT-III) is sometimes called the "IDCT". Chris@19: Chris@19: REDFT11 (DCT-IV) Chris@19: ................ Chris@19: Chris@19: An `REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2 Chris@19: (sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)). Chris@19: Chris@19: Inverses and Normalization Chris@19: .......................... Chris@19: Chris@19: These definitions correspond directly to the unnormalized DFTs used Chris@19: elsewhere in FFTW (hence the factors of 2 in front of the summations). Chris@19: The unnormalized inverse of `REDFT00' is `REDFT00', of `REDFT10' is Chris@19: `REDFT01' and vice versa, and of `REDFT11' is `REDFT11'. Each Chris@19: unnormalized inverse results in the original array multiplied by N, Chris@19: where N is the _logical_ DFT size. For `REDFT00', N=2(n-1) (note that Chris@19: n=1 is not defined); otherwise, N=2n. Chris@19: Chris@19: In defining the discrete cosine transform, some authors also include Chris@19: additional factors of sqrt(2) (or its inverse) multiplying selected Chris@19: inputs and/or outputs. This is a mostly cosmetic change that makes the Chris@19: transform orthogonal, but sacrifices the direct equivalence to a Chris@19: symmetric DFT. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: 1d Real-odd DFTs (DSTs), Next: 1d Discrete Hartley Transforms (DHTs), Prev: 1d Real-even DFTs (DCTs), Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.4 1d Real-odd DFTs (DSTs) Chris@19: ----------------------------- Chris@19: Chris@19: The Real-odd symmetry DFTs in FFTW are exactly equivalent to the Chris@19: unnormalized forward (and backward) DFTs as defined above, where the Chris@19: input array X of length N is purely real and is also "odd" symmetry. In Chris@19: this case, the output is odd symmetry and purely imaginary. Chris@19: Chris@19: For the case of `RODFT00', this odd symmetry means that X[j] = Chris@19: -X[N-j], where we take X to be periodic so that X[N] = X[0]. Because Chris@19: of this redundancy, only the first n real numbers starting at j=1 are Chris@19: actually stored (the j=0 element is zero), where N = 2(n+1). Chris@19: Chris@19: The proper definition of odd symmetry for `RODFT10', `RODFT01', and Chris@19: `RODFT11' transforms is somewhat more intricate because of the shifts Chris@19: by 1/2 of the input and/or output, although the corresponding boundary Chris@19: conditions are given in *note Real even/odd DFTs (cosine/sine Chris@19: transforms)::. Because of the odd symmetry, however, the cosine terms Chris@19: in the DFT all cancel and the remaining sine terms are written Chris@19: explicitly below. This formulation often leads people to call such a Chris@19: transform a "discrete sine transform" (DST), although it is really just Chris@19: a special case of the DFT. Chris@19: Chris@19: In each of the definitions below, we transform a real array X of Chris@19: length n to a real array Y of length n: Chris@19: Chris@19: RODFT00 (DST-I) Chris@19: ............... Chris@19: Chris@19: An `RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2 Chris@19: (sum for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))). Chris@19: Chris@19: RODFT10 (DST-II) Chris@19: ................ Chris@19: Chris@19: An `RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2 Chris@19: (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)). Chris@19: Chris@19: RODFT01 (DST-III) Chris@19: ................. Chris@19: Chris@19: An `RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] = Chris@19: (-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) / Chris@19: n)). In the case of n=1, this reduces to Y[0] = X[0]. Chris@19: Chris@19: RODFT11 (DST-IV) Chris@19: ................ Chris@19: Chris@19: An `RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2 Chris@19: (sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)). Chris@19: Chris@19: Inverses and Normalization Chris@19: .......................... Chris@19: Chris@19: These definitions correspond directly to the unnormalized DFTs used Chris@19: elsewhere in FFTW (hence the factors of 2 in front of the summations). Chris@19: The unnormalized inverse of `RODFT00' is `RODFT00', of `RODFT10' is Chris@19: `RODFT01' and vice versa, and of `RODFT11' is `RODFT11'. Each Chris@19: unnormalized inverse results in the original array multiplied by N, Chris@19: where N is the _logical_ DFT size. For `RODFT00', N=2(n+1); otherwise, Chris@19: N=2n. Chris@19: Chris@19: In defining the discrete sine transform, some authors also include Chris@19: additional factors of sqrt(2) (or its inverse) multiplying selected Chris@19: inputs and/or outputs. This is a mostly cosmetic change that makes the Chris@19: transform orthogonal, but sacrifices the direct equivalence to an Chris@19: antisymmetric DFT. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: 1d Discrete Hartley Transforms (DHTs), Next: Multi-dimensional Transforms, Prev: 1d Real-odd DFTs (DSTs), Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.5 1d Discrete Hartley Transforms (DHTs) Chris@19: ------------------------------------------- Chris@19: Chris@19: The discrete Hartley transform (DHT) of a 1d real array X of size n Chris@19: computes a real array Y of the same size, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)]. Chris@19: FFTW computes an unnormalized transform, in that there is no Chris@19: coefficient in front of the summation in the DHT. In other words, Chris@19: applying the transform twice (the DHT is its own inverse) will multiply Chris@19: the input by n. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Multi-dimensional Transforms, Prev: 1d Discrete Hartley Transforms (DHTs), Up: What FFTW Really Computes Chris@19: Chris@19: 4.8.6 Multi-dimensional Transforms Chris@19: ---------------------------------- Chris@19: Chris@19: The multi-dimensional transforms of FFTW, in general, compute simply the Chris@19: separable product of the given 1d transform along each dimension of the Chris@19: array. Since each of these transforms is unnormalized, computing the Chris@19: forward followed by the backward/inverse multi-dimensional transform Chris@19: will result in the original array scaled by the product of the Chris@19: normalization factors for each dimension (e.g. the product of the Chris@19: dimension sizes, for a multi-dimensional DFT). Chris@19: Chris@19: The definition of FFTW's multi-dimensional DFT of real data (r2c) Chris@19: deserves special attention. In this case, we logically compute the full Chris@19: multi-dimensional DFT of the input data; since the input data are purely Chris@19: real, the output data have the Hermitian symmetry and therefore only one Chris@19: non-redundant half need be stored. More specifically, for an n[0] x Chris@19: n[1] x n[2] x ... x n[d-1] multi-dimensional real-input DFT, the full Chris@19: (logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the Chris@19: symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ..., Chris@19: n[d-1] - k[d-1]]* (where each dimension is periodic). Because of this Chris@19: symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the Chris@19: _last_ dimension (division by 2 is rounded down). (We could instead Chris@19: have cut any other dimension in half, but the last dimension proved Chris@19: computationally convenient.) This results in the peculiar array format Chris@19: described in more detail by *note Real-data DFT Array Format::. Chris@19: Chris@19: The multi-dimensional c2r transform is simply the unnormalized Chris@19: inverse of the r2c transform. i.e. it is the same as FFTW's complex Chris@19: backward multi-dimensional DFT, operating on a Hermitian input array in Chris@19: the peculiar format mentioned above and outputting a real array (since Chris@19: the DFT output is purely real). Chris@19: Chris@19: We should remind the user that the separable product of 1d transforms Chris@19: along each dimension, as computed by FFTW, is not always the same thing Chris@19: as the usual multi-dimensional transform. A multi-dimensional `R2HC' Chris@19: (or `HC2R') transform is not identical to the multi-dimensional DFT, Chris@19: requiring some post-processing to combine the requisite real and Chris@19: imaginary parts, as was described in *note The Halfcomplex-format Chris@19: DFT::. Likewise, FFTW's multidimensional `FFTW_DHT' r2r transform is Chris@19: not the same thing as the logical multi-dimensional discrete Hartley Chris@19: transform defined in the literature, as discussed in *note The Discrete Chris@19: Hartley Transform::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Multi-threaded FFTW, Next: Distributed-memory FFTW with MPI, Prev: FFTW Reference, Up: Top Chris@19: Chris@19: 5 Multi-threaded FFTW Chris@19: ********************* Chris@19: Chris@19: In this chapter we document the parallel FFTW routines for Chris@19: shared-memory parallel hardware. These routines, which support Chris@19: parallel one- and multi-dimensional transforms of both real and complex Chris@19: data, are the easiest way to take advantage of multiple processors with Chris@19: FFTW. They work just like the corresponding uniprocessor transform Chris@19: routines, except that you have an extra initialization routine to call, Chris@19: and there is a routine to set the number of threads to employ. Any Chris@19: program that uses the uniprocessor FFTW can therefore be trivially Chris@19: modified to use the multi-threaded FFTW. Chris@19: Chris@19: A shared-memory machine is one in which all CPUs can directly access Chris@19: the same main memory, and such machines are now common due to the Chris@19: ubiquity of multi-core CPUs. FFTW's multi-threading support allows you Chris@19: to utilize these additional CPUs transparently from a single program. Chris@19: However, this does not necessarily translate into performance Chris@19: gains--when multiple threads/CPUs are employed, there is an overhead Chris@19: required for synchronization that may outweigh the computatational Chris@19: parallelism. Therefore, you can only benefit from threads if your Chris@19: problem is sufficiently large. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Installation and Supported Hardware/Software:: Chris@19: * Usage of Multi-threaded FFTW:: Chris@19: * How Many Threads to Use?:: Chris@19: * Thread safety:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW Chris@19: Chris@19: 5.1 Installation and Supported Hardware/Software Chris@19: ================================================ Chris@19: Chris@19: All of the FFTW threads code is located in the `threads' subdirectory Chris@19: of the FFTW package. On Unix systems, the FFTW threads libraries and Chris@19: header files can be automatically configured, compiled, and installed Chris@19: along with the uniprocessor FFTW libraries simply by including Chris@19: `--enable-threads' in the flags to the `configure' script (*note Chris@19: Installation on Unix::), or `--enable-openmp' to use OpenMP Chris@19: (http://www.openmp.org) threads. Chris@19: Chris@19: The threads routines require your operating system to have some sort Chris@19: of shared-memory threads support. Specifically, the FFTW threads Chris@19: package works with POSIX threads (available on most Unix variants, from Chris@19: GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are Chris@19: supported in many common compilers (e.g. gcc) are also supported, and Chris@19: may give better performance on some systems. (OpenMP threads are also Chris@19: useful if you are employing OpenMP in your own code, in order to Chris@19: minimize conflicts between threading models.) If you have a Chris@19: shared-memory machine that uses a different threads API, it should be a Chris@19: simple matter of programming to include support for it; see the file Chris@19: `threads/threads.c' for more detail. Chris@19: Chris@19: You can compile FFTW with _both_ `--enable-threads' and Chris@19: `--enable-openmp' at the same time, since they install libraries with Chris@19: different names (`fftw3_threads' and `fftw3_omp', as described below). Chris@19: However, your programs may only link to _one_ of these two libraries at Chris@19: a time. Chris@19: Chris@19: Ideally, of course, you should also have multiple processors in Chris@19: order to get any benefit from the threaded transforms. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW Chris@19: Chris@19: 5.2 Usage of Multi-threaded FFTW Chris@19: ================================ Chris@19: Chris@19: Here, it is assumed that the reader is already familiar with the usage Chris@19: of the uniprocessor FFTW routines, described elsewhere in this manual. Chris@19: We only describe what one has to change in order to use the Chris@19: multi-threaded routines. Chris@19: Chris@19: First, programs using the parallel complex transforms should be Chris@19: linked with `-lfftw3_threads -lfftw3 -lm' on Unix, or `-lfftw3_omp Chris@19: -lfftw3 -lm' if you compiled with OpenMP. You will also need to link Chris@19: with whatever library is responsible for threads on your system (e.g. Chris@19: `-lpthread' on GNU/Linux) or include whatever compiler flag enables Chris@19: OpenMP (e.g. `-fopenmp' with gcc). Chris@19: Chris@19: Second, before calling _any_ FFTW routines, you should call the Chris@19: function: Chris@19: Chris@19: int fftw_init_threads(void); Chris@19: Chris@19: This function, which need only be called once, performs any one-time Chris@19: initialization required to use threads on your system. It returns zero Chris@19: if there was some error (which should not happen under normal Chris@19: circumstances) and a non-zero value otherwise. Chris@19: Chris@19: Third, before creating a plan that you want to parallelize, you Chris@19: should call: Chris@19: Chris@19: void fftw_plan_with_nthreads(int nthreads); Chris@19: Chris@19: The `nthreads' argument indicates the number of threads you want Chris@19: FFTW to use (or actually, the maximum number). All plans subsequently Chris@19: created with any planner routine will use that many threads. You can Chris@19: call `fftw_plan_with_nthreads', create some plans, call Chris@19: `fftw_plan_with_nthreads' again with a different argument, and create Chris@19: some more plans for a new number of threads. Plans already created Chris@19: before a call to `fftw_plan_with_nthreads' are unaffected. If you pass Chris@19: an `nthreads' argument of `1' (the default), threads are disabled for Chris@19: subsequent plans. Chris@19: Chris@19: With OpenMP, to configure FFTW to use all of the currently running Chris@19: OpenMP threads (set by `omp_set_num_threads(nthreads)' or by the Chris@19: `OMP_NUM_THREADS' environment variable), you can do: Chris@19: `fftw_plan_with_nthreads(omp_get_max_threads())'. (The `omp_' OpenMP Chris@19: functions are declared via `#include '.) Chris@19: Chris@19: Given a plan, you then execute it as usual with Chris@19: `fftw_execute(plan)', and the execution will use the number of threads Chris@19: specified when the plan was created. When done, you destroy it as Chris@19: usual with `fftw_destroy_plan'. As described in *note Thread safety::, Chris@19: plan _execution_ is thread-safe, but plan creation and destruction are Chris@19: _not_: you should create/destroy plans only from a single thread, but Chris@19: can safely execute multiple plans in parallel. Chris@19: Chris@19: There is one additional routine: if you want to get rid of all memory Chris@19: and other resources allocated internally by FFTW, you can call: Chris@19: Chris@19: void fftw_cleanup_threads(void); Chris@19: Chris@19: which is much like the `fftw_cleanup()' function except that it also Chris@19: gets rid of threads-related data. You must _not_ execute any Chris@19: previously created plans after calling this function. Chris@19: Chris@19: We should also mention one other restriction: if you save wisdom Chris@19: from a program using the multi-threaded FFTW, that wisdom _cannot be Chris@19: used_ by a program using only the single-threaded FFTW (i.e. not calling Chris@19: `fftw_init_threads'). *Note Words of Wisdom-Saving Plans::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: How Many Threads to Use?, Next: Thread safety, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW Chris@19: Chris@19: 5.3 How Many Threads to Use? Chris@19: ============================ Chris@19: Chris@19: There is a fair amount of overhead involved in synchronizing threads, Chris@19: so the optimal number of threads to use depends upon the size of the Chris@19: transform as well as on the number of processors you have. Chris@19: Chris@19: As a general rule, you don't want to use more threads than you have Chris@19: processors. (Using more threads will work, but there will be extra Chris@19: overhead with no benefit.) In fact, if the problem size is too small, Chris@19: you may want to use fewer threads than you have processors. Chris@19: Chris@19: You will have to experiment with your system to see what level of Chris@19: parallelization is best for your problem size. Typically, the problem Chris@19: will have to involve at least a few thousand data points before threads Chris@19: become beneficial. If you plan with `FFTW_PATIENT', it will Chris@19: automatically disable threads for sizes that don't benefit from Chris@19: parallelization. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Thread safety, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW Chris@19: Chris@19: 5.4 Thread safety Chris@19: ================= Chris@19: Chris@19: Users writing multi-threaded programs (including OpenMP) must concern Chris@19: themselves with the "thread safety" of the libraries they use--that is, Chris@19: whether it is safe to call routines in parallel from multiple threads. Chris@19: FFTW can be used in such an environment, but some care must be taken Chris@19: because the planner routines share data (e.g. wisdom and trigonometric Chris@19: tables) between calls and plans. Chris@19: Chris@19: The upshot is that the only thread-safe (re-entrant) routine in FFTW Chris@19: is `fftw_execute' (and the new-array variants thereof). All other Chris@19: routines (e.g. the planner) should only be called from one thread at a Chris@19: time. So, for example, you can wrap a semaphore lock around any calls Chris@19: to the planner; even more simply, you can just create all of your plans Chris@19: from one thread. We do not think this should be an important Chris@19: restriction (FFTW is designed for the situation where the only Chris@19: performance-sensitive code is the actual execution of the transform), Chris@19: and the benefits of shared data between plans are great. Chris@19: Chris@19: Note also that, since the plan is not modified by `fftw_execute', it Chris@19: is safe to execute the _same plan_ in parallel by multiple threads. Chris@19: However, since a given plan operates by default on a fixed array, you Chris@19: need to use one of the new-array execute functions (*note New-array Chris@19: Execute Functions::) so that different threads compute the transform of Chris@19: different data. Chris@19: Chris@19: (Users should note that these comments only apply to programs using Chris@19: shared-memory threads or OpenMP. Parallelism using MPI or forked Chris@19: processes involves a separate address-space and global variables for Chris@19: each process, and is not susceptible to problems of this sort.) Chris@19: Chris@19: If you are configured FFTW with the `--enable-debug' or Chris@19: `--enable-debug-malloc' flags (*note Installation on Unix::), then Chris@19: `fftw_execute' is not thread-safe. These flags are not documented Chris@19: because they are intended only for developing and debugging FFTW, but Chris@19: if you must use `--enable-debug' then you should also specifically pass Chris@19: `--disable-debug-malloc' for `fftw_execute' to be thread-safe. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Distributed-memory FFTW with MPI, Next: Calling FFTW from Modern Fortran, Prev: Multi-threaded FFTW, Up: Top Chris@19: Chris@19: 6 Distributed-memory FFTW with MPI Chris@19: ********************************** Chris@19: Chris@19: In this chapter we document the parallel FFTW routines for parallel Chris@19: systems supporting the MPI message-passing interface. Unlike the Chris@19: shared-memory threads described in the previous chapter, MPI allows you Chris@19: to use _distributed-memory_ parallelism, where each CPU has its own Chris@19: separate memory, and which can scale up to clusters of many thousands Chris@19: of processors. This capability comes at a price, however: each process Chris@19: only stores a _portion_ of the data to be transformed, which means that Chris@19: the data structures and programming-interface are quite different from Chris@19: the serial or threads versions of FFTW. Chris@19: Chris@19: Distributed-memory parallelism is especially useful when you are Chris@19: transforming arrays so large that they do not fit into the memory of a Chris@19: single processor. The storage per-process required by FFTW's MPI Chris@19: routines is proportional to the total array size divided by the number Chris@19: of processes. Conversely, distributed-memory parallelism can easily Chris@19: pose an unacceptably high communications overhead for small problems; Chris@19: the threshold problem size for which parallelism becomes advantageous Chris@19: will depend on the precise problem you are interested in, your Chris@19: hardware, and your MPI implementation. Chris@19: Chris@19: A note on terminology: in MPI, you divide the data among a set of Chris@19: "processes" which each run in their own memory address space. Chris@19: Generally, each process runs on a different physical processor, but Chris@19: this is not required. A set of processes in MPI is described by an Chris@19: opaque data structure called a "communicator," the most common of which Chris@19: is the predefined communicator `MPI_COMM_WORLD' which refers to _all_ Chris@19: processes. For more information on these and other concepts common to Chris@19: all MPI programs, we refer the reader to the documentation at the MPI Chris@19: home page (http://www.mcs.anl.gov/research/projects/mpi/). Chris@19: Chris@19: We assume in this chapter that the reader is familiar with the usage Chris@19: of the serial (uniprocessor) FFTW, and focus only on the concepts new Chris@19: to the MPI interface. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * FFTW MPI Installation:: Chris@19: * Linking and Initializing MPI FFTW:: Chris@19: * 2d MPI example:: Chris@19: * MPI Data Distribution:: Chris@19: * Multi-dimensional MPI DFTs of Real Data:: Chris@19: * Other Multi-dimensional Real-data MPI Transforms:: Chris@19: * FFTW MPI Transposes:: Chris@19: * FFTW MPI Wisdom:: Chris@19: * Avoiding MPI Deadlocks:: Chris@19: * FFTW MPI Performance Tips:: Chris@19: * Combining MPI and Threads:: Chris@19: * FFTW MPI Reference:: Chris@19: * FFTW MPI Fortran Interface:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Installation, Next: Linking and Initializing MPI FFTW, Prev: Distributed-memory FFTW with MPI, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.1 FFTW MPI Installation Chris@19: ========================= Chris@19: Chris@19: All of the FFTW MPI code is located in the `mpi' subdirectory of the Chris@19: FFTW package. On Unix systems, the FFTW MPI libraries and header files Chris@19: are automatically configured, compiled, and installed along with the Chris@19: uniprocessor FFTW libraries simply by including `--enable-mpi' in the Chris@19: flags to the `configure' script (*note Installation on Unix::). Chris@19: Chris@19: Any implementation of the MPI standard, version 1 or later, should Chris@19: work with FFTW. The `configure' script will attempt to automatically Chris@19: detect how to compile and link code using your MPI implementation. In Chris@19: some cases, especially if you have multiple different MPI Chris@19: implementations installed or have an unusual MPI software package, you Chris@19: may need to provide this information explicitly. Chris@19: Chris@19: Most commonly, one compiles MPI code by invoking a special compiler Chris@19: command, typically `mpicc' for C code. The `configure' script knows Chris@19: the most common names for this command, but you can specify the MPI Chris@19: compilation command explicitly by setting the `MPICC' variable, as in Chris@19: `./configure MPICC=mpicc ...'. Chris@19: Chris@19: If, instead of a special compiler command, you need to link a certain Chris@19: library, you can specify the link command via the `MPILIBS' variable, Chris@19: as in `./configure MPILIBS=-lmpi ...'. Note that if your MPI library Chris@19: is installed in a non-standard location (one the compiler does not know Chris@19: about by default), you may also have to specify the location of the Chris@19: library and header files via `LDFLAGS' and `CPPFLAGS' variables, Chris@19: respectively, as in `./configure LDFLAGS=-L/path/to/mpi/libs Chris@19: CPPFLAGS=-I/path/to/mpi/include ...'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Linking and Initializing MPI FFTW, Next: 2d MPI example, Prev: FFTW MPI Installation, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.2 Linking and Initializing MPI FFTW Chris@19: ===================================== Chris@19: Chris@19: Programs using the MPI FFTW routines should be linked with `-lfftw3_mpi Chris@19: -lfftw3 -lm' on Unix in double precision, `-lfftw3f_mpi -lfftw3f -lm' Chris@19: in single precision, and so on (*note Precision::). You will also need Chris@19: to link with whatever library is responsible for MPI on your system; in Chris@19: most MPI implementations, there is a special compiler alias named Chris@19: `mpicc' to compile and link MPI code. Chris@19: Chris@19: Before calling any FFTW routines except possibly `fftw_init_threads' Chris@19: (*note Combining MPI and Threads::), but after calling `MPI_Init', you Chris@19: should call the function: Chris@19: Chris@19: void fftw_mpi_init(void); Chris@19: Chris@19: If, at the end of your program, you want to get rid of all memory and Chris@19: other resources allocated internally by FFTW, for both the serial and Chris@19: MPI routines, you can call: Chris@19: Chris@19: void fftw_mpi_cleanup(void); Chris@19: Chris@19: which is much like the `fftw_cleanup()' function except that it also Chris@19: gets rid of FFTW's MPI-related data. You must _not_ execute any Chris@19: previously created plans after calling this function. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: 2d MPI example, Next: MPI Data Distribution, Prev: Linking and Initializing MPI FFTW, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.3 2d MPI example Chris@19: ================== Chris@19: Chris@19: Before we document the FFTW MPI interface in detail, we begin with a Chris@19: simple example outlining how one would perform a two-dimensional `N0' Chris@19: by `N1' complex DFT. Chris@19: Chris@19: #include Chris@19: Chris@19: int main(int argc, char **argv) Chris@19: { Chris@19: const ptrdiff_t N0 = ..., N1 = ...; Chris@19: fftw_plan plan; Chris@19: fftw_complex *data; Chris@19: ptrdiff_t alloc_local, local_n0, local_0_start, i, j; Chris@19: Chris@19: MPI_Init(&argc, &argv); Chris@19: fftw_mpi_init(); Chris@19: Chris@19: /* get local data size and allocate */ Chris@19: alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, Chris@19: &local_n0, &local_0_start); Chris@19: data = fftw_alloc_complex(alloc_local); Chris@19: Chris@19: /* create plan for in-place forward DFT */ Chris@19: plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, Chris@19: FFTW_FORWARD, FFTW_ESTIMATE); Chris@19: Chris@19: /* initialize data to some function my_function(x,y) */ Chris@19: for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) Chris@19: data[i*N1 + j] = my_function(local_0_start + i, j); Chris@19: Chris@19: /* compute transforms, in-place, as many times as desired */ Chris@19: fftw_execute(plan); Chris@19: Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: MPI_Finalize(); Chris@19: } Chris@19: Chris@19: As can be seen above, the MPI interface follows the same basic style Chris@19: of allocate/plan/execute/destroy as the serial FFTW routines. All of Chris@19: the MPI-specific routines are prefixed with `fftw_mpi_' instead of Chris@19: `fftw_'. There are a few important differences, however: Chris@19: Chris@19: First, we must call `fftw_mpi_init()' after calling `MPI_Init' Chris@19: (required in all MPI programs) and before calling any other `fftw_mpi_' Chris@19: routine. Chris@19: Chris@19: Second, when we create the plan with `fftw_mpi_plan_dft_2d', Chris@19: analogous to `fftw_plan_dft_2d', we pass an additional argument: the Chris@19: communicator, indicating which processes will participate in the Chris@19: transform (here `MPI_COMM_WORLD', indicating all processes). Whenever Chris@19: you create, execute, or destroy a plan for an MPI transform, you must Chris@19: call the corresponding FFTW routine on _all_ processes in the Chris@19: communicator for that transform. (That is, these are _collective_ Chris@19: calls.) Note that the plan for the MPI transform uses the standard Chris@19: `fftw_execute' and `fftw_destroy' routines (on the other hand, there Chris@19: are MPI-specific new-array execute functions documented below). Chris@19: Chris@19: Third, all of the FFTW MPI routines take `ptrdiff_t' arguments Chris@19: instead of `int' as for the serial FFTW. `ptrdiff_t' is a standard C Chris@19: integer type which is (at least) 32 bits wide on a 32-bit machine and Chris@19: 64 bits wide on a 64-bit machine. This is to make it easy to specify Chris@19: very large parallel transforms on a 64-bit machine. (You can specify Chris@19: 64-bit transform sizes in the serial FFTW, too, but only by using the Chris@19: `guru64' planner interface. *Note 64-bit Guru Interface::.) Chris@19: Chris@19: Fourth, and most importantly, you don't allocate the entire Chris@19: two-dimensional array on each process. Instead, you call Chris@19: `fftw_mpi_local_size_2d' to find out what _portion_ of the array Chris@19: resides on each processor, and how much space to allocate. Here, the Chris@19: portion of the array on each process is a `local_n0' by `N1' slice of Chris@19: the total array, starting at index `local_0_start'. The total number Chris@19: of `fftw_complex' numbers to allocate is given by the `alloc_local' Chris@19: return value, which _may_ be greater than `local_n0 * N1' (in case some Chris@19: intermediate calculations require additional storage). The data Chris@19: distribution in FFTW's MPI interface is described in more detail by the Chris@19: next section. Chris@19: Chris@19: Given the portion of the array that resides on the local process, it Chris@19: is straightforward to initialize the data (here to a function Chris@19: `myfunction') and otherwise manipulate it. Of course, at the end of Chris@19: the program you may want to output the data somehow, but synchronizing Chris@19: this output is up to you and is beyond the scope of this manual. (One Chris@19: good way to output a large multi-dimensional distributed array in MPI Chris@19: to a portable binary file is to use the free HDF5 library; see the HDF Chris@19: home page (http://www.hdfgroup.org/).) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Data Distribution, Next: Multi-dimensional MPI DFTs of Real Data, Prev: 2d MPI example, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.4 MPI Data Distribution Chris@19: ========================= Chris@19: Chris@19: The most important concept to understand in using FFTW's MPI interface Chris@19: is the data distribution. With a serial or multithreaded FFT, all of Chris@19: the inputs and outputs are stored as a single contiguous chunk of Chris@19: memory. With a distributed-memory FFT, the inputs and outputs are Chris@19: broken into disjoint blocks, one per process. Chris@19: Chris@19: In particular, FFTW uses a _1d block distribution_ of the data, Chris@19: distributed along the _first dimension_. For example, if you want to Chris@19: perform a 100 x 200 complex DFT, distributed over 4 processes, each Chris@19: process will get a 25 x 200 slice of the data. That is, process 0 Chris@19: will get rows 0 through 24, process 1 will get rows 25 through 49, Chris@19: process 2 will get rows 50 through 74, and process 3 will get rows 75 Chris@19: through 99. If you take the same array but distribute it over 3 Chris@19: processes, then it is not evenly divisible so the different processes Chris@19: will have unequal chunks. FFTW's default choice in this case is to Chris@19: assign 34 rows to processes 0 and 1, and 32 rows to process 2. Chris@19: Chris@19: FFTW provides several `fftw_mpi_local_size' routines that you can Chris@19: call to find out what portion of an array is stored on the current Chris@19: process. In most cases, you should use the default block sizes picked Chris@19: by FFTW, but it is also possible to specify your own block size. For Chris@19: example, with a 100 x 200 array on three processes, you can tell FFTW Chris@19: to use a block size of 40, which would assign 40 rows to processes 0 Chris@19: and 1, and 20 rows to process 2. FFTW's default is to divide the data Chris@19: equally among the processes if possible, and as best it can otherwise. Chris@19: The rows are always assigned in "rank order," i.e. process 0 gets the Chris@19: first block of rows, then process 1, and so on. (You can change this Chris@19: by using `MPI_Comm_split' to create a new communicator with re-ordered Chris@19: processes.) However, you should always call the `fftw_mpi_local_size' Chris@19: routines, if possible, rather than trying to predict FFTW's Chris@19: distribution choices. Chris@19: Chris@19: In particular, it is critical that you allocate the storage size that Chris@19: is returned by `fftw_mpi_local_size', which is _not_ necessarily the Chris@19: size of the local slice of the array. The reason is that intermediate Chris@19: steps of FFTW's algorithms involve transposing the array and Chris@19: redistributing the data, so at these intermediate steps FFTW may Chris@19: require more local storage space (albeit always proportional to the Chris@19: total size divided by the number of processes). The Chris@19: `fftw_mpi_local_size' functions know how much storage is required for Chris@19: these intermediate steps and tell you the correct amount to allocate. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Basic and advanced distribution interfaces:: Chris@19: * Load balancing:: Chris@19: * Transposed distributions:: Chris@19: * One-dimensional distributions:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Basic and advanced distribution interfaces, Next: Load balancing, Prev: MPI Data Distribution, Up: MPI Data Distribution Chris@19: Chris@19: 6.4.1 Basic and advanced distribution interfaces Chris@19: ------------------------------------------------ Chris@19: Chris@19: As with the planner interface, the `fftw_mpi_local_size' distribution Chris@19: interface is broken into basic and advanced (`_many') interfaces, where Chris@19: the latter allows you to specify the block size manually and also to Chris@19: request block sizes when computing multiple transforms simultaneously. Chris@19: These functions are documented more exhaustively by the FFTW MPI Chris@19: Reference, but we summarize the basic ideas here using a couple of Chris@19: two-dimensional examples. Chris@19: Chris@19: For the 100 x 200 complex-DFT example, above, we would find the Chris@19: distribution by calling the following function in the basic interface: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); Chris@19: Chris@19: Given the total size of the data to be transformed (here, `n0 = 100' Chris@19: and `n1 = 200') and an MPI communicator (`comm'), this function Chris@19: provides three numbers. Chris@19: Chris@19: First, it describes the shape of the local data: the current process Chris@19: should store a `local_n0' by `n1' slice of the overall dataset, in Chris@19: row-major order (`n1' dimension contiguous), starting at index Chris@19: `local_0_start'. That is, if the total dataset is viewed as a `n0' by Chris@19: `n1' matrix, the current process should store the rows `local_0_start' Chris@19: to `local_0_start+local_n0-1'. Obviously, if you are running with only Chris@19: a single MPI process, that process will store the entire array: Chris@19: `local_0_start' will be zero and `local_n0' will be `n0'. *Note Chris@19: Row-major Format::. Chris@19: Chris@19: Second, the return value is the total number of data elements (e.g., Chris@19: complex numbers for a complex DFT) that should be allocated for the Chris@19: input and output arrays on the current process (ideally with Chris@19: `fftw_malloc' or an `fftw_alloc' function, to ensure optimal Chris@19: alignment). It might seem that this should always be equal to Chris@19: `local_n0 * n1', but this is _not_ the case. FFTW's distributed FFT Chris@19: algorithms require data redistributions at intermediate stages of the Chris@19: transform, and in some circumstances this may require slightly larger Chris@19: local storage. This is discussed in more detail below, under *note Chris@19: Load balancing::. Chris@19: Chris@19: The advanced-interface `local_size' function for multidimensional Chris@19: transforms returns the same three things (`local_n0', `local_0_start', Chris@19: and the total number of elements to allocate), but takes more inputs: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, Chris@19: ptrdiff_t howmany, Chris@19: ptrdiff_t block0, Chris@19: MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, Chris@19: ptrdiff_t *local_0_start); Chris@19: Chris@19: The two-dimensional case above corresponds to `rnk = 2' and an array Chris@19: `n' of length 2 with `n[0] = n0' and `n[1] = n1'. This routine is for Chris@19: any `rnk > 1'; one-dimensional transforms have their own interface Chris@19: because they work slightly differently, as discussed below. Chris@19: Chris@19: First, the advanced interface allows you to perform multiple Chris@19: transforms at once, of interleaved data, as specified by the `howmany' Chris@19: parameter. (`hoamany' is 1 for a single transform.) Chris@19: Chris@19: Second, here you can specify your desired block size in the `n0' Chris@19: dimension, `block0'. To use FFTW's default block size, pass Chris@19: `FFTW_MPI_DEFAULT_BLOCK' (0) for `block0'. Otherwise, on `P' Chris@19: processes, FFTW will return `local_n0' equal to `block0' on the first Chris@19: `P / block0' processes (rounded down), return `local_n0' equal to `n0 - Chris@19: block0 * (P / block0)' on the next process, and `local_n0' equal to Chris@19: zero on any remaining processes. In general, we recommend using the Chris@19: default block size (which corresponds to `n0 / P', rounded up). Chris@19: Chris@19: For example, suppose you have `P = 4' processes and `n0 = 21'. The Chris@19: default will be a block size of `6', which will give `local_n0 = 6' on Chris@19: the first three processes and `local_n0 = 3' on the last process. Chris@19: Instead, however, you could specify `block0 = 5' if you wanted, which Chris@19: would give `local_n0 = 5' on processes 0 to 2, `local_n0 = 6' on Chris@19: process 3. (This choice, while it may look superficially more Chris@19: "balanced," has the same critical path as FFTW's default but requires Chris@19: more communications.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Load balancing, Next: Transposed distributions, Prev: Basic and advanced distribution interfaces, Up: MPI Data Distribution Chris@19: Chris@19: 6.4.2 Load balancing Chris@19: -------------------- Chris@19: Chris@19: Ideally, when you parallelize a transform over some P processes, each Chris@19: process should end up with work that takes equal time. Otherwise, all Chris@19: of the processes end up waiting on whichever process is slowest. This Chris@19: goal is known as "load balancing." In this section, we describe the Chris@19: circumstances under which FFTW is able to load-balance well, and in Chris@19: particular how you should choose your transform size in order to load Chris@19: balance. Chris@19: Chris@19: Load balancing is especially difficult when you are parallelizing Chris@19: over heterogeneous machines; for example, if one of your processors is a Chris@19: old 486 and another is a Pentium IV, obviously you should give the Chris@19: Pentium more work to do than the 486 since the latter is much slower. Chris@19: FFTW does not deal with this problem, however--it assumes that your Chris@19: processes run on hardware of comparable speed, and that the goal is Chris@19: therefore to divide the problem as equally as possible. Chris@19: Chris@19: For a multi-dimensional complex DFT, FFTW can divide the problem Chris@19: equally among the processes if: (i) the _first_ dimension `n0' is Chris@19: divisible by P; and (ii), the _product_ of the subsequent dimensions is Chris@19: divisible by P. (For the advanced interface, where you can specify Chris@19: multiple simultaneous transforms via some "vector" length `howmany', a Chris@19: factor of `howmany' is included in the product of the subsequent Chris@19: dimensions.) Chris@19: Chris@19: For a one-dimensional complex DFT, the length `N' of the data should Chris@19: be divisible by P _squared_ to be able to divide the problem equally Chris@19: among the processes. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Transposed distributions, Next: One-dimensional distributions, Prev: Load balancing, Up: MPI Data Distribution Chris@19: Chris@19: 6.4.3 Transposed distributions Chris@19: ------------------------------ Chris@19: Chris@19: Internally, FFTW's MPI transform algorithms work by first computing Chris@19: transforms of the data local to each process, then by globally Chris@19: _transposing_ the data in some fashion to redistribute the data among Chris@19: the processes, transforming the new data local to each process, and Chris@19: transposing back. For example, a two-dimensional `n0' by `n1' array, Chris@19: distributed across the `n0' dimension, is transformd by: (i) Chris@19: transforming the `n1' dimension, which are local to each process; (ii) Chris@19: transposing to an `n1' by `n0' array, distributed across the `n1' Chris@19: dimension; (iii) transforming the `n0' dimension, which is now local to Chris@19: each process; (iv) transposing back. Chris@19: Chris@19: However, in many applications it is acceptable to compute a Chris@19: multidimensional DFT whose results are produced in transposed order Chris@19: (e.g., `n1' by `n0' in two dimensions). This provides a significant Chris@19: performance advantage, because it means that the final transposition Chris@19: step can be omitted. FFTW supports this optimization, which you Chris@19: specify by passing the flag `FFTW_MPI_TRANSPOSED_OUT' to the planner Chris@19: routines. To compute the inverse transform of transposed output, you Chris@19: specify `FFTW_MPI_TRANSPOSED_IN' to tell it that the input is Chris@19: transposed. In this section, we explain how to interpret the output Chris@19: format of such a transform. Chris@19: Chris@19: Suppose you have are transforming multi-dimensional data with (at Chris@19: least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always, Chris@19: it is distributed along the first dimension n[0] . Now, if we compute Chris@19: its DFT with the `FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output Chris@19: data are stored with the first _two_ dimensions transposed: n[1] x n[0] Chris@19: x n[2] x ... x n[d-1] , distributed along the n[1] dimension. Chris@19: Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and Chris@19: transform it with the `FFTW_MPI_TRANSPOSED_IN' flag, then the format Chris@19: goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array. Chris@19: Chris@19: There are two ways to find the portion of the transposed array that Chris@19: resides on the current process. First, you can simply call the Chris@19: appropriate `local_size' function, passing n[1] x n[0] x n[2] x ... x Chris@19: n[d-1] (the transposed dimensions). This would mean calling the Chris@19: `local_size' function twice, once for the transposed and once for the Chris@19: non-transposed dimensions. Alternatively, you can call one of the Chris@19: `local_size_transposed' functions, which returns both the Chris@19: non-transposed and transposed data distribution from a single call. Chris@19: For example, for a 3d transform with transposed output (or input), you Chris@19: might call: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_3d_transposed( Chris@19: ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: Chris@19: Here, `local_n0' and `local_0_start' give the size and starting Chris@19: index of the `n0' dimension for the _non_-transposed data, as in the Chris@19: previous sections. For _transposed_ data (e.g. the output for Chris@19: `FFTW_MPI_TRANSPOSED_OUT'), `local_n1' and `local_1_start' give the Chris@19: size and starting index of the `n1' dimension, which is the first Chris@19: dimension of the transposed data (`n1' by `n0' by `n2'). Chris@19: Chris@19: (Note that `FFTW_MPI_TRANSPOSED_IN' is completely equivalent to Chris@19: performing `FFTW_MPI_TRANSPOSED_OUT' and passing the first two Chris@19: dimensions to the planner in reverse order, or vice versa. If you pass Chris@19: _both_ the `FFTW_MPI_TRANSPOSED_IN' and `FFTW_MPI_TRANSPOSED_OUT' Chris@19: flags, it is equivalent to swapping the first two dimensions passed to Chris@19: the planner and passing _neither_ flag.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: One-dimensional distributions, Prev: Transposed distributions, Up: MPI Data Distribution Chris@19: Chris@19: 6.4.4 One-dimensional distributions Chris@19: ----------------------------------- Chris@19: Chris@19: For one-dimensional distributed DFTs using FFTW, matters are slightly Chris@19: more complicated because the data distribution is more closely tied to Chris@19: how the algorithm works. In particular, you can no longer pass an Chris@19: arbitrary block size and must accept FFTW's default; also, the block Chris@19: sizes may be different for input and output. Also, the data Chris@19: distribution depends on the flags and transform direction, in order for Chris@19: forward and backward transforms to work correctly. Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, Chris@19: int sign, unsigned flags, Chris@19: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, Chris@19: ptrdiff_t *local_no, ptrdiff_t *local_o_start); Chris@19: Chris@19: This function computes the data distribution for a 1d transform of Chris@19: size `n0' with the given transform `sign' and `flags'. Both input and Chris@19: output data use block distributions. The input on the current process Chris@19: will consist of `local_ni' numbers starting at index `local_i_start'; Chris@19: e.g. if only a single process is used, then `local_ni' will be `n0' and Chris@19: `local_i_start' will be `0'. Similarly for the output, with `local_no' Chris@19: numbers starting at index `local_o_start'. The return value of Chris@19: `fftw_mpi_local_size_1d' will be the total number of elements to Chris@19: allocate on the current process (which might be slightly larger than Chris@19: the local size due to intermediate steps in the algorithm). Chris@19: Chris@19: As mentioned above (*note Load balancing::), the data will be divided Chris@19: equally among the processes if `n0' is divisible by the _square_ of the Chris@19: number of processes. In this case, `local_ni' will equal `local_no'. Chris@19: Otherwise, they may be different. Chris@19: Chris@19: For some applications, such as convolutions, the order of the output Chris@19: data is irrelevant. In this case, performance can be improved by Chris@19: specifying that the output data be stored in an FFTW-defined Chris@19: "scrambled" format. (In particular, this is the analogue of transposed Chris@19: output in the multidimensional case: scrambled output saves a Chris@19: communications step.) If you pass `FFTW_MPI_SCRAMBLED_OUT' in the Chris@19: flags, then the output is stored in this (undocumented) scrambled Chris@19: order. Conversely, to perform the inverse transform of data in Chris@19: scrambled order, pass the `FFTW_MPI_SCRAMBLED_IN' flag. Chris@19: Chris@19: In MPI FFTW, only composite sizes `n0' can be parallelized; we have Chris@19: not yet implemented a parallel algorithm for large prime sizes. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Multi-dimensional MPI DFTs of Real Data, Next: Other Multi-dimensional Real-data MPI Transforms, Prev: MPI Data Distribution, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.5 Multi-dimensional MPI DFTs of Real Data Chris@19: =========================================== Chris@19: Chris@19: FFTW's MPI interface also supports multi-dimensional DFTs of real data, Chris@19: similar to the serial r2c and c2r interfaces. (Parallel Chris@19: one-dimensional real-data DFTs are not currently supported; you must Chris@19: use a complex transform and set the imaginary parts of the inputs to Chris@19: zero.) Chris@19: Chris@19: The key points to understand for r2c and c2r MPI transforms (compared Chris@19: to the MPI complex DFTs or the serial r2c/c2r transforms), are: Chris@19: Chris@19: * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1] Chris@19: x n[2] x ... x n[d-1] real data to/from n[0] x n[1] x n[2] x ... Chris@19: x (n[d-1]/2 + 1) complex data: the last dimension of the complex Chris@19: data is cut in half (rounded down), plus one. As for the serial Chris@19: transforms, the sizes you pass to the `plan_dft_r2c' and Chris@19: `plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1] Chris@19: dimensions of the real data. Chris@19: Chris@19: * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ... Chris@19: x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x Chris@19: ... x [2 (n[d-1]/2 + 1)] array, where the last dimension has been Chris@19: _padded_ to make it the same size as the complex output. This is Chris@19: much like the in-place serial r2c/c2r interface (*note Chris@19: Multi-Dimensional DFTs of Real Data::), except that in MPI the Chris@19: padding is required even for out-of-place data. The extra padding Chris@19: numbers are ignored by FFTW (they are _not_ like zero-padding the Chris@19: transform to a larger size); they are only used to determine the Chris@19: data layout. Chris@19: Chris@19: * The data distribution in MPI for _both_ the real and complex data Chris@19: is determined by the shape of the _complex_ data. That is, you Chris@19: call the appropriate `local size' function for the n[0] x n[1] x Chris@19: n[2] x ... x (n[d-1]/2 + 1) Chris@19: Chris@19: complex data, and then use the _same_ distribution for the real Chris@19: data except that the last complex dimension is replaced by a Chris@19: (padded) real dimension of twice the length. Chris@19: Chris@19: Chris@19: For example suppose we are performing an out-of-place r2c transform Chris@19: of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x Chris@19: M x N/2+1 complex data. Similar to the example in *note 2d MPI Chris@19: example::, we might do something like: Chris@19: Chris@19: #include Chris@19: Chris@19: int main(int argc, char **argv) Chris@19: { Chris@19: const ptrdiff_t L = ..., M = ..., N = ...; Chris@19: fftw_plan plan; Chris@19: double *rin; Chris@19: fftw_complex *cout; Chris@19: ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; Chris@19: Chris@19: MPI_Init(&argc, &argv); Chris@19: fftw_mpi_init(); Chris@19: Chris@19: /* get local data size and allocate */ Chris@19: alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, Chris@19: &local_n0, &local_0_start); Chris@19: rin = fftw_alloc_real(2 * alloc_local); Chris@19: cout = fftw_alloc_complex(alloc_local); Chris@19: Chris@19: /* create plan for out-of-place r2c DFT */ Chris@19: plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, Chris@19: FFTW_MEASURE); Chris@19: Chris@19: /* initialize rin to some function my_func(x,y,z) */ Chris@19: for (i = 0; i < local_n0; ++i) Chris@19: for (j = 0; j < M; ++j) Chris@19: for (k = 0; k < N; ++k) Chris@19: rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); Chris@19: Chris@19: /* compute transforms as many times as desired */ Chris@19: fftw_execute(plan); Chris@19: Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: MPI_Finalize(); Chris@19: } Chris@19: Chris@19: Note that we allocated `rin' using `fftw_alloc_real' with an Chris@19: argument of `2 * alloc_local': since `alloc_local' is the number of Chris@19: _complex_ values to allocate, the number of _real_ values is twice as Chris@19: many. The `rin' array is then local_n0 x M x 2(N/2+1) in row-major Chris@19: order, so its `(i,j,k)' element is at the index `(i*M + j) * Chris@19: (2*(N/2+1)) + k' (*note Multi-dimensional Array Format::). Chris@19: Chris@19: As for the complex transforms, improved performance can be obtained Chris@19: by specifying that the output is the transpose of the input or vice Chris@19: versa (*note Transposed distributions::). In our L x M x N r2c Chris@19: example, including `FFTW_TRANSPOSED_OUT' in the flags means that the Chris@19: input would be a padded L x M x 2(N/2+1) real array distributed over Chris@19: the `L' dimension, while the output would be a M x L x N/2+1 complex Chris@19: array distributed over the `M' dimension. To perform the inverse c2r Chris@19: transform with the same data distributions, you would use the Chris@19: `FFTW_TRANSPOSED_IN' flag. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Other Multi-dimensional Real-data MPI Transforms, Next: FFTW MPI Transposes, Prev: Multi-dimensional MPI DFTs of Real Data, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.6 Other multi-dimensional Real-Data MPI Transforms Chris@19: ==================================================== Chris@19: Chris@19: FFTW's MPI interface also supports multi-dimensional `r2r' transforms Chris@19: of all kinds supported by the serial interface (e.g. discrete cosine Chris@19: and sine transforms, discrete Hartley transforms, etc.). Only Chris@19: multi-dimensional `r2r' transforms, not one-dimensional transforms, are Chris@19: currently parallelized. Chris@19: Chris@19: These are used much like the multidimensional complex DFTs discussed Chris@19: above, except that the data is real rather than complex, and one needs Chris@19: to pass an r2r transform kind (`fftw_r2r_kind') for each dimension as Chris@19: in the serial FFTW (*note More DFTs of Real Data::). Chris@19: Chris@19: For example, one might perform a two-dimensional L x M that is an Chris@19: REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the Chris@19: second dimension with code like: Chris@19: Chris@19: const ptrdiff_t L = ..., M = ...; Chris@19: fftw_plan plan; Chris@19: double *data; Chris@19: ptrdiff_t alloc_local, local_n0, local_0_start, i, j; Chris@19: Chris@19: /* get local data size and allocate */ Chris@19: alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, Chris@19: &local_n0, &local_0_start); Chris@19: data = fftw_alloc_real(alloc_local); Chris@19: Chris@19: /* create plan for in-place REDFT10 x RODFT10 */ Chris@19: plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, Chris@19: FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); Chris@19: Chris@19: /* initialize data to some function my_function(x,y) */ Chris@19: for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) Chris@19: data[i*M + j] = my_function(local_0_start + i, j); Chris@19: Chris@19: /* compute transforms, in-place, as many times as desired */ Chris@19: fftw_execute(plan); Chris@19: Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: Notice that we use the same `local_size' functions as we did for Chris@19: complex data, only now we interpret the sizes in terms of real rather Chris@19: than complex values, and correspondingly use `fftw_alloc_real'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Transposes, Next: FFTW MPI Wisdom, Prev: Other Multi-dimensional Real-data MPI Transforms, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.7 FFTW MPI Transposes Chris@19: ======================= Chris@19: Chris@19: The FFTW's MPI Fourier transforms rely on one or more _global Chris@19: transposition_ step for their communications. For example, the Chris@19: multidimensional transforms work by transforming along some dimensions, Chris@19: then transposing to make the first dimension local and transforming Chris@19: that, then transposing back. Because global transposition of a Chris@19: block-distributed matrix has many other potential uses besides FFTs, Chris@19: FFTW's transpose routines can be called directly, as documented in this Chris@19: section. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Basic distributed-transpose interface:: Chris@19: * Advanced distributed-transpose interface:: Chris@19: * An improved replacement for MPI_Alltoall:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Basic distributed-transpose interface, Next: Advanced distributed-transpose interface, Prev: FFTW MPI Transposes, Up: FFTW MPI Transposes Chris@19: Chris@19: 6.7.1 Basic distributed-transpose interface Chris@19: ------------------------------------------- Chris@19: Chris@19: In particular, suppose that we have an `n0' by `n1' array in row-major Chris@19: order, block-distributed across the `n0' dimension. To transpose this Chris@19: into an `n1' by `n0' array block-distributed across the `n1' dimension, Chris@19: we would create a plan by calling the following function: Chris@19: Chris@19: fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: Chris@19: The input and output arrays (`in' and `out') can be the same. The Chris@19: transpose is actually executed by calling `fftw_execute' on the plan, Chris@19: as usual. Chris@19: Chris@19: The `flags' are the usual FFTW planner flags, but support two Chris@19: additional flags: `FFTW_MPI_TRANSPOSED_OUT' and/or Chris@19: `FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose Chris@19: plans, is that the output and/or input, respectively, are _locally_ Chris@19: transposed. That is, on each process input data is normally stored as Chris@19: a `local_n0' by `n1' array in row-major order, but for an Chris@19: `FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as `n1' by Chris@19: `local_n0' in row-major order. Similarly, `FFTW_MPI_TRANSPOSED_OUT' Chris@19: means that the output is `n0' by `local_n1' instead of `local_n1' by Chris@19: `n0'. Chris@19: Chris@19: To determine the local size of the array on each process before and Chris@19: after the transpose, as well as the amount of storage that must be Chris@19: allocated, one should call `fftw_mpi_local_size_2d_transposed', just as Chris@19: for a 2d DFT as described in the previous section: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_2d_transposed Chris@19: (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: Chris@19: Again, the return value is the local storage to allocate, which in Chris@19: this case is the number of _real_ (`double') values rather than complex Chris@19: numbers as in the previous examples. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Advanced distributed-transpose interface, Next: An improved replacement for MPI_Alltoall, Prev: Basic distributed-transpose interface, Up: FFTW MPI Transposes Chris@19: Chris@19: 6.7.2 Advanced distributed-transpose interface Chris@19: ---------------------------------------------- Chris@19: Chris@19: The above routines are for a transpose of a matrix of numbers (of type Chris@19: `double'), using FFTW's default block sizes. More generally, one can Chris@19: perform transposes of _tuples_ of numbers, with user-specified block Chris@19: sizes for the input and output: Chris@19: Chris@19: fftw_plan fftw_mpi_plan_many_transpose Chris@19: (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, Chris@19: ptrdiff_t block0, ptrdiff_t block1, Chris@19: double *in, double *out, MPI_Comm comm, unsigned flags); Chris@19: Chris@19: In this case, one is transposing an `n0' by `n1' matrix of Chris@19: `howmany'-tuples (e.g. `howmany = 2' for complex numbers). The input Chris@19: is distributed along the `n0' dimension with block size `block0', and Chris@19: the `n1' by `n0' output is distributed along the `n1' dimension with Chris@19: block size `block1'. If `FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a Chris@19: block size then FFTW uses its default block size. To get the local Chris@19: size of the data on each process, you should then call Chris@19: `fftw_mpi_local_size_many_transposed'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: An improved replacement for MPI_Alltoall, Prev: Advanced distributed-transpose interface, Up: FFTW MPI Transposes Chris@19: Chris@19: 6.7.3 An improved replacement for MPI_Alltoall Chris@19: ---------------------------------------------- Chris@19: Chris@19: We close this section by noting that FFTW's MPI transpose routines can Chris@19: be thought of as a generalization for the `MPI_Alltoall' function Chris@19: (albeit only for floating-point types), and in some circumstances can Chris@19: function as an improved replacement. Chris@19: Chris@19: `MPI_Alltoall' is defined by the MPI standard as: Chris@19: Chris@19: int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, Chris@19: void *recvbuf, int recvcnt, MPI_Datatype recvtype, Chris@19: MPI_Comm comm); Chris@19: Chris@19: In particular, for `double*' arrays `in' and `out', consider the Chris@19: call: Chris@19: Chris@19: MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); Chris@19: Chris@19: This is completely equivalent to: Chris@19: Chris@19: MPI_Comm_size(comm, &P); Chris@19: plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); Chris@19: fftw_execute(plan); Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: That is, computing a P x P transpose on `P' processes, with a block Chris@19: size of 1, is just a standard all-to-all communication. Chris@19: Chris@19: However, using the FFTW routine instead of `MPI_Alltoall' may have Chris@19: certain advantages. First of all, FFTW's routine can operate in-place Chris@19: (`in == out') whereas `MPI_Alltoall' can only operate out-of-place. Chris@19: Chris@19: Second, even for out-of-place plans, FFTW's routine may be faster, Chris@19: especially if you need to perform the all-to-all communication many Chris@19: times and can afford to use `FFTW_MEASURE' or `FFTW_PATIENT'. It Chris@19: should certainly be no slower, not including the time to create the Chris@19: plan, since one of the possible algorithms that FFTW uses for an Chris@19: out-of-place transpose _is_ simply to call `MPI_Alltoall'. However, Chris@19: FFTW also considers several other possible algorithms that, depending Chris@19: on your MPI implementation and your hardware, may be faster. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Wisdom, Next: Avoiding MPI Deadlocks, Prev: FFTW MPI Transposes, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.8 FFTW MPI Wisdom Chris@19: =================== Chris@19: Chris@19: FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be Chris@19: used to save MPI plans as well as to save uniprocessor plans. However, Chris@19: for MPI there are several unavoidable complications. Chris@19: Chris@19: First, the MPI standard does not guarantee that every process can Chris@19: perform file I/O (at least, not using C stdio routines)--in general, we Chris@19: may only assume that process 0 is capable of I/O.(1) So, if we want to Chris@19: export the wisdom from a single process to a file, we must first export Chris@19: the wisdom to a string, then send it to process 0, then write it to a Chris@19: file. Chris@19: Chris@19: Second, in principle we may want to have separate wisdom for every Chris@19: process, since in general the processes may run on different hardware Chris@19: even for a single MPI program. However, in practice FFTW's MPI code is Chris@19: designed for the case of homogeneous hardware (*note Load balancing::), Chris@19: and in this case it is convenient to use the same wisdom for every Chris@19: process. Thus, we need a mechanism to synchronize the wisdom. Chris@19: Chris@19: To address both of these problems, FFTW provides the following two Chris@19: functions: Chris@19: Chris@19: void fftw_mpi_broadcast_wisdom(MPI_Comm comm); Chris@19: void fftw_mpi_gather_wisdom(MPI_Comm comm); Chris@19: Chris@19: Given a communicator `comm', `fftw_mpi_broadcast_wisdom' will Chris@19: broadcast the wisdom from process 0 to all other processes. Chris@19: Conversely, `fftw_mpi_gather_wisdom' will collect wisdom from all Chris@19: processes onto process 0. (If the plans created for the same problem Chris@19: by different processes are not the same, `fftw_mpi_gather_wisdom' will Chris@19: arbitrarily choose one of the plans.) Both of these functions may Chris@19: result in suboptimal plans for different processes if the processes are Chris@19: running on non-identical hardware. Both of these functions are Chris@19: _collective_ calls, which means that they must be executed by all Chris@19: processes in the communicator. Chris@19: Chris@19: So, for example, a typical code snippet to import wisdom from a file Chris@19: and use it on all processes would be: Chris@19: Chris@19: { Chris@19: int rank; Chris@19: Chris@19: fftw_mpi_init(); Chris@19: MPI_Comm_rank(MPI_COMM_WORLD, &rank); Chris@19: if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); Chris@19: fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); Chris@19: } Chris@19: Chris@19: (Note that we must call `fftw_mpi_init' before importing any wisdom Chris@19: that might contain MPI plans.) Similarly, a typical code snippet to Chris@19: export wisdom from all processes to a file is: Chris@19: Chris@19: { Chris@19: int rank; Chris@19: Chris@19: fftw_mpi_gather_wisdom(MPI_COMM_WORLD); Chris@19: MPI_Comm_rank(MPI_COMM_WORLD, &rank); Chris@19: if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); Chris@19: } Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) In fact, even this assumption is not technically guaranteed by Chris@19: the standard, although it seems to be universal in actual MPI Chris@19: implementations and is widely assumed by MPI-using software. Chris@19: Technically, you need to query the `MPI_IO' attribute of Chris@19: `MPI_COMM_WORLD' with `MPI_Attr_get'. If this attribute is Chris@19: `MPI_PROC_NULL', no I/O is possible. If it is `MPI_ANY_SOURCE', any Chris@19: process can perform I/O. Otherwise, it is the rank of a process that Chris@19: can perform I/O ... but since it is not guaranteed to yield the _same_ Chris@19: rank on all processes, you have to do an `MPI_Allreduce' of some kind Chris@19: if you want all processes to agree about which is going to do I/O. And Chris@19: even then, the standard only guarantees that this process can perform Chris@19: output, but not input. See e.g. `Parallel Programming with MPI' by P. Chris@19: S. Pacheco, section 8.1.3. Needless to say, in our experience Chris@19: virtually no MPI programmers worry about this. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Avoiding MPI Deadlocks, Next: FFTW MPI Performance Tips, Prev: FFTW MPI Wisdom, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.9 Avoiding MPI Deadlocks Chris@19: ========================== Chris@19: Chris@19: An MPI program can _deadlock_ if one process is waiting for a message Chris@19: from another process that never gets sent. To avoid deadlocks when Chris@19: using FFTW's MPI routines, it is important to know which functions are Chris@19: _collective_: that is, which functions must _always_ be called in the Chris@19: _same order_ from _every_ process in a given communicator. (For Chris@19: example, `MPI_Barrier' is the canonical example of a collective Chris@19: function in the MPI standard.) Chris@19: Chris@19: The functions in FFTW that are _always_ collective are: every Chris@19: function beginning with `fftw_mpi_plan', as well as Chris@19: `fftw_mpi_broadcast_wisdom' and `fftw_mpi_gather_wisdom'. Also, the Chris@19: following functions from the ordinary FFTW interface are collective Chris@19: when they are applied to a plan created by an `fftw_mpi_plan' function: Chris@19: `fftw_execute', `fftw_destroy_plan', and `fftw_flops'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Performance Tips, Next: Combining MPI and Threads, Prev: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.10 FFTW MPI Performance Tips Chris@19: ============================== Chris@19: Chris@19: In this section, we collect a few tips on getting the best performance Chris@19: out of FFTW's MPI transforms. Chris@19: Chris@19: First, because of the 1d block distribution, FFTW's parallelization Chris@19: is currently limited by the size of the first dimension. Chris@19: (Multidimensional block distributions may be supported by a future Chris@19: version.) More generally, you should ideally arrange the dimensions so Chris@19: that FFTW can divide them equally among the processes. *Note Load Chris@19: balancing::. Chris@19: Chris@19: Second, if it is not too inconvenient, you should consider working Chris@19: with transposed output for multidimensional plans, as this saves a Chris@19: considerable amount of communications. *Note Transposed Chris@19: distributions::. Chris@19: Chris@19: Third, the fastest choices are generally either an in-place transform Chris@19: or an out-of-place transform with the `FFTW_DESTROY_INPUT' flag (which Chris@19: allows the input array to be used as scratch space). In-place is Chris@19: especially beneficial if the amount of data per process is large. Chris@19: Chris@19: Fourth, if you have multiple arrays to transform at once, rather than Chris@19: calling FFTW's MPI transforms several times it usually seems to be Chris@19: faster to interleave the data and use the advanced interface. (This Chris@19: groups the communications together instead of requiring separate Chris@19: messages for each transform.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Combining MPI and Threads, Next: FFTW MPI Reference, Prev: FFTW MPI Performance Tips, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.11 Combining MPI and Threads Chris@19: ============================== Chris@19: Chris@19: In certain cases, it may be advantageous to combine MPI Chris@19: (distributed-memory) and threads (shared-memory) parallelization. FFTW Chris@19: supports this, with certain caveats. For example, if you have a Chris@19: cluster of 4-processor shared-memory nodes, you may want to use threads Chris@19: within the nodes and MPI between the nodes, instead of MPI for all Chris@19: parallelization. Chris@19: Chris@19: In particular, it is possible to seamlessly combine the MPI FFTW Chris@19: routines with the multi-threaded FFTW routines (*note Multi-threaded Chris@19: FFTW::). However, some care must be taken in the initialization code, Chris@19: which should look something like this: Chris@19: Chris@19: int threads_ok; Chris@19: Chris@19: int main(int argc, char **argv) Chris@19: { Chris@19: int provided; Chris@19: MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); Chris@19: threads_ok = provided >= MPI_THREAD_FUNNELED; Chris@19: Chris@19: if (threads_ok) threads_ok = fftw_init_threads(); Chris@19: fftw_mpi_init(); Chris@19: Chris@19: ... Chris@19: if (threads_ok) fftw_plan_with_nthreads(...); Chris@19: ... Chris@19: Chris@19: MPI_Finalize(); Chris@19: } Chris@19: Chris@19: First, note that instead of calling `MPI_Init', you should call Chris@19: `MPI_Init_threads', which is the initialization routine defined by the Chris@19: MPI-2 standard to indicate to MPI that your program will be Chris@19: multithreaded. We pass `MPI_THREAD_FUNNELED', which indicates that we Chris@19: will only call MPI routines from the main thread. (FFTW will launch Chris@19: additional threads internally, but the extra threads will not call MPI Chris@19: code.) (You may also pass `MPI_THREAD_SERIALIZED' or Chris@19: `MPI_THREAD_MULTIPLE', which requests additional multithreading support Chris@19: from the MPI implementation, but this is not required by FFTW.) The Chris@19: `provided' parameter returns what level of threads support is actually Chris@19: supported by your MPI implementation; this _must_ be at least Chris@19: `MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so Chris@19: we define a global variable `threads_ok' to record this. You should Chris@19: only call `fftw_init_threads' or `fftw_plan_with_nthreads' if Chris@19: `threads_ok' is true. For more information on thread safety in MPI, Chris@19: see the MPI and Threads Chris@19: (http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the Chris@19: MPI-2 standard. Chris@19: Chris@19: Second, we must call `fftw_init_threads' _before_ `fftw_mpi_init'. Chris@19: This is critical for technical reasons having to do with how FFTW Chris@19: initializes its list of algorithms. Chris@19: Chris@19: Then, if you call `fftw_plan_with_nthreads(N)', _every_ MPI process Chris@19: will launch (up to) `N' threads to parallelize its transforms. Chris@19: Chris@19: For example, in the hypothetical cluster of 4-processor nodes, you Chris@19: might wish to launch only a single MPI process per node, and then call Chris@19: `fftw_plan_with_nthreads(4)' on each process to use all processors in Chris@19: the nodes. Chris@19: Chris@19: This may or may not be faster than simply using as many MPI processes Chris@19: as you have processors, however. On the one hand, using threads within Chris@19: a node eliminates the need for explicit message passing within the Chris@19: node. On the other hand, FFTW's transpose routines are not Chris@19: multi-threaded, and this means that the communications that do take Chris@19: place will not benefit from parallelization within the node. Moreover, Chris@19: many MPI implementations already have optimizations to exploit shared Chris@19: memory when it is available, so adding the multithreaded FFTW on top of Chris@19: this may be superfluous. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Reference, Next: FFTW MPI Fortran Interface, Prev: Combining MPI and Threads, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.12 FFTW MPI Reference Chris@19: ======================= Chris@19: Chris@19: This chapter provides a complete reference to all FFTW MPI functions, Chris@19: datatypes, and constants. See also *note FFTW Reference:: for Chris@19: information on functions and types in common with the serial interface. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * MPI Files and Data Types:: Chris@19: * MPI Initialization:: Chris@19: * Using MPI Plans:: Chris@19: * MPI Data Distribution Functions:: Chris@19: * MPI Plan Creation:: Chris@19: * MPI Wisdom Communication:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Files and Data Types, Next: MPI Initialization, Prev: FFTW MPI Reference, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.1 MPI Files and Data Types Chris@19: ------------------------------- Chris@19: Chris@19: All programs using FFTW's MPI support should include its header file: Chris@19: Chris@19: #include Chris@19: Chris@19: Note that this header file includes the serial-FFTW `fftw3.h' header Chris@19: file, and also the `mpi.h' header file for MPI, so you need not include Chris@19: those files separately. Chris@19: Chris@19: You must also link to _both_ the FFTW MPI library and to the serial Chris@19: FFTW library. On Unix, this means adding `-lfftw3_mpi -lfftw3 -lm' at Chris@19: the end of the link command. Chris@19: Chris@19: Different precisions are handled as in the serial interface: *Note Chris@19: Precision::. That is, `fftw_' functions become `fftwf_' (in single Chris@19: precision) etcetera, and the libraries become `-lfftw3f_mpi -lfftw3f Chris@19: -lm' etcetera on Unix. Long-double precision is supported in MPI, but Chris@19: quad precision (`fftwq_') is not due to the lack of MPI support for Chris@19: this type. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Initialization, Next: Using MPI Plans, Prev: MPI Files and Data Types, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.2 MPI Initialization Chris@19: ------------------------- Chris@19: Chris@19: Before calling any other FFTW MPI (`fftw_mpi_') function, and before Chris@19: importing any wisdom for MPI problems, you must call: Chris@19: Chris@19: void fftw_mpi_init(void); Chris@19: Chris@19: If FFTW threads support is used, however, `fftw_mpi_init' should be Chris@19: called _after_ `fftw_init_threads' (*note Combining MPI and Threads::). Chris@19: Calling `fftw_mpi_init' additional times (before `fftw_mpi_cleanup') Chris@19: has no effect. Chris@19: Chris@19: If you want to deallocate all persistent data and reset FFTW to the Chris@19: pristine state it was in when you started your program, you can call: Chris@19: Chris@19: void fftw_mpi_cleanup(void); Chris@19: Chris@19: (This calls `fftw_cleanup', so you need not call the serial cleanup Chris@19: routine too, although it is safe to do so.) After calling Chris@19: `fftw_mpi_cleanup', all existing plans become undefined, and you should Chris@19: not attempt to execute or destroy them. You must call `fftw_mpi_init' Chris@19: again after `fftw_mpi_cleanup' if you want to resume using the MPI FFTW Chris@19: routines. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Using MPI Plans, Next: MPI Data Distribution Functions, Prev: MPI Initialization, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.3 Using MPI Plans Chris@19: ---------------------- Chris@19: Chris@19: Once an MPI plan is created, you can execute and destroy it using Chris@19: `fftw_execute', `fftw_destroy_plan', and the other functions in the Chris@19: serial interface that operate on generic plans (*note Using Plans::). Chris@19: Chris@19: The `fftw_execute' and `fftw_destroy_plan' functions, applied to MPI Chris@19: plans, are _collective_ calls: they must be called for all processes in Chris@19: the communicator that was used to create the plan. Chris@19: Chris@19: You must _not_ use the serial new-array plan-execution functions Chris@19: `fftw_execute_dft' and so on (*note New-array Execute Functions::) with Chris@19: MPI plans. Such functions are specialized to the problem type, and Chris@19: there are specific new-array execute functions for MPI plans: Chris@19: Chris@19: void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); Chris@19: void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); Chris@19: void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); Chris@19: void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); Chris@19: Chris@19: These functions have the same restrictions as those of the serial Chris@19: new-array execute functions. They are _always_ safe to apply to the Chris@19: _same_ `in' and `out' arrays that were used to create the plan. They Chris@19: can only be applied to new arrarys if those arrays have the same types, Chris@19: dimensions, in-placeness, and alignment as the original arrays, where Chris@19: the best way to ensure the same alignment is to use FFTW's Chris@19: `fftw_malloc' and related allocation functions for all arrays (*note Chris@19: Memory Allocation::). Note that distributed transposes (*note FFTW MPI Chris@19: Transposes::) use `fftw_mpi_execute_r2r', since they count as rank-zero Chris@19: r2r plans from FFTW's perspective. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Data Distribution Functions, Next: MPI Plan Creation, Prev: Using MPI Plans, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.4 MPI Data Distribution Functions Chris@19: -------------------------------------- Chris@19: Chris@19: As described above (*note MPI Data Distribution::), in order to Chris@19: allocate your arrays, _before_ creating a plan, you must first call one Chris@19: of the following routines to determine the required allocation size and Chris@19: the portion of the array locally stored on a given process. The Chris@19: `MPI_Comm' communicator passed here must be equivalent to the Chris@19: communicator used below for plan creation. Chris@19: Chris@19: The basic interface for multidimensional transforms consists of the Chris@19: functions: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); Chris@19: ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); Chris@19: ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: Chris@19: These functions return the number of elements to allocate (complex Chris@19: numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the Chris@19: `local_n0' and `local_0_start' return the portion (`local_0_start' to Chris@19: `local_0_start + local_n0 - 1') of the first dimension of an n[0] x Chris@19: n[1] x n[2] x ... x n[d-1] array that is stored on the local process. Chris@19: *Note Basic and advanced distribution interfaces::. For Chris@19: `FFTW_MPI_TRANSPOSED_OUT' plans, the `_transposed' variants are useful Chris@19: in order to also return the local portion of the first dimension in the Chris@19: n[1] x n[0] x n[2] x ... x n[d-1] transposed output. *Note Transposed Chris@19: distributions::. The advanced interface for multidimensional Chris@19: transforms is: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, Chris@19: ptrdiff_t block0, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start); Chris@19: ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, Chris@19: ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, Chris@19: ptrdiff_t *local_n0, ptrdiff_t *local_0_start, Chris@19: ptrdiff_t *local_n1, ptrdiff_t *local_1_start); Chris@19: Chris@19: These differ from the basic interface in only two ways. First, they Chris@19: allow you to specify block sizes `block0' and `block1' (the latter for Chris@19: the transposed output); you can pass `FFTW_MPI_DEFAULT_BLOCK' to use Chris@19: FFTW's default block size as in the basic interface. Second, you can Chris@19: pass a `howmany' parameter, corresponding to the advanced planning Chris@19: interface below: this is for transforms of contiguous `howmany'-tuples Chris@19: of numbers (`howmany = 1' in the basic interface). Chris@19: Chris@19: The corresponding basic and advanced routines for one-dimensional Chris@19: transforms (currently only complex DFTs) are: Chris@19: Chris@19: ptrdiff_t fftw_mpi_local_size_1d( Chris@19: ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, Chris@19: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, Chris@19: ptrdiff_t *local_no, ptrdiff_t *local_o_start); Chris@19: ptrdiff_t fftw_mpi_local_size_many_1d( Chris@19: ptrdiff_t n0, ptrdiff_t howmany, Chris@19: MPI_Comm comm, int sign, unsigned flags, Chris@19: ptrdiff_t *local_ni, ptrdiff_t *local_i_start, Chris@19: ptrdiff_t *local_no, ptrdiff_t *local_o_start); Chris@19: Chris@19: As above, the return value is the number of elements to allocate Chris@19: (complex numbers, for complex DFTs). The `local_ni' and Chris@19: `local_i_start' arguments return the portion (`local_i_start' to Chris@19: `local_i_start + local_ni - 1') of the 1d array that is stored on this Chris@19: process for the transform _input_, and `local_no' and `local_o_start' Chris@19: are the corresponding quantities for the input. The `sign' Chris@19: (`FFTW_FORWARD' or `FFTW_BACKWARD') and `flags' must match the Chris@19: arguments passed when creating a plan. Although the inputs and outputs Chris@19: have different data distributions in general, it is guaranteed that the Chris@19: _output_ data distribution of an `FFTW_FORWARD' plan will match the Chris@19: _input_ data distribution of an `FFTW_BACKWARD' plan and vice versa; Chris@19: similarly for the `FFTW_MPI_SCRAMBLED_OUT' and `FFTW_MPI_SCRAMBLED_IN' Chris@19: flags. *Note One-dimensional distributions::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Plan Creation, Next: MPI Wisdom Communication, Prev: MPI Data Distribution Functions, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.5 MPI Plan Creation Chris@19: ------------------------ Chris@19: Chris@19: Complex-data MPI DFTs Chris@19: ..................... Chris@19: Chris@19: Plans for complex-data DFTs (*note 2d MPI example::) are created by: Chris@19: Chris@19: fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, Chris@19: MPI_Comm comm, int sign, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: MPI_Comm comm, int sign, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: MPI_Comm comm, int sign, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: MPI_Comm comm, int sign, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, Chris@19: ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, Chris@19: fftw_complex *in, fftw_complex *out, Chris@19: MPI_Comm comm, int sign, unsigned flags); Chris@19: Chris@19: These are similar to their serial counterparts (*note Complex DFTs::) Chris@19: in specifying the dimensions, sign, and flags of the transform. The Chris@19: `comm' argument gives an MPI communicator that specifies the set of Chris@19: processes to participate in the transform; plan creation is a Chris@19: collective function that must be called for all processes in the Chris@19: communicator. The `in' and `out' pointers refer only to a portion of Chris@19: the overall transform data (*note MPI Data Distribution::) as specified Chris@19: by the `local_size' functions in the previous section. Unless `flags' Chris@19: contains `FFTW_ESTIMATE', these arrays are overwritten during plan Chris@19: creation as for the serial interface. For multi-dimensional Chris@19: transforms, any dimensions `> 1' are supported; for one-dimensional Chris@19: transforms, only composite (non-prime) `n0' are currently supported Chris@19: (unlike the serial FFTW). Requesting an unsupported transform size Chris@19: will yield a `NULL' plan. (As in the serial interface, highly Chris@19: composite sizes generally yield the best performance.) Chris@19: Chris@19: The advanced-interface `fftw_mpi_plan_many_dft' additionally allows Chris@19: you to specify the block sizes for the first dimension (`block') of the Chris@19: n[0] x n[1] x n[2] x ... x n[d-1] input data and the first dimension Chris@19: (`tblock') of the n[1] x n[0] x n[2] x ... x n[d-1] transposed data Chris@19: (at intermediate steps of the transform, and for the output if Chris@19: `FFTW_TRANSPOSED_OUT' is specified in `flags'). These must be the same Chris@19: block sizes as were passed to the corresponding `local_size' function; Chris@19: you can pass `FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size Chris@19: as in the basic interface. Also, the `howmany' parameter specifies Chris@19: that the transform is of contiguous `howmany'-tuples rather than Chris@19: individual complex numbers; this corresponds to the same parameter in Chris@19: the serial advanced interface (*note Advanced Complex DFTs::) with Chris@19: `stride = howmany' and `dist = 1'. Chris@19: Chris@19: MPI flags Chris@19: ......... Chris@19: Chris@19: The `flags' can be any of those for the serial FFTW (*note Planner Chris@19: Flags::), and in addition may include one or more of the following Chris@19: MPI-specific flags, which improve performance at the cost of changing Chris@19: the output or input data formats. Chris@19: Chris@19: * `FFTW_MPI_SCRAMBLED_OUT', `FFTW_MPI_SCRAMBLED_IN': valid for 1d Chris@19: transforms only, these flags indicate that the output/input of the Chris@19: transform are in an undocumented "scrambled" order. A forward Chris@19: `FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward Chris@19: `FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization). Chris@19: *Note One-dimensional distributions::. Chris@19: Chris@19: * `FFTW_MPI_TRANSPOSED_OUT', `FFTW_MPI_TRANSPOSED_IN': valid for Chris@19: multidimensional (`rnk > 1') transforms only, these flags specify Chris@19: that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1] Chris@19: transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] . Chris@19: *Note Transposed distributions::. Chris@19: Chris@19: Chris@19: Real-data MPI DFTs Chris@19: .................. Chris@19: Chris@19: Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI Chris@19: DFTs of Real Data::) are created by: Chris@19: Chris@19: fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: double *in, fftw_complex *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: double *in, fftw_complex *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: double *in, fftw_complex *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, Chris@19: double *in, fftw_complex *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: fftw_complex *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: fftw_complex *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: fftw_complex *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, Chris@19: fftw_complex *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: Chris@19: Similar to the serial interface (*note Real-data DFTs::), these Chris@19: transform logically n[0] x n[1] x n[2] x ... x n[d-1] real data Chris@19: to/from n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) complex data, Chris@19: representing the non-redundant half of the conjugate-symmetry output of Chris@19: a real-input DFT (*note Multi-dimensional Transforms::). However, the Chris@19: real array must be stored within a padded n[0] x n[1] x n[2] x ... x [2 Chris@19: (n[d-1]/2 + 1)] Chris@19: Chris@19: array (much like the in-place serial r2c transforms, but here for Chris@19: out-of-place transforms as well). Currently, only multi-dimensional Chris@19: (`rnk > 1') r2c/c2r transforms are supported (requesting a plan for Chris@19: `rnk = 1' will yield `NULL'). As explained above (*note Chris@19: Multi-dimensional MPI DFTs of Real Data::), the data distribution of Chris@19: both the real and complex arrays is given by the `local_size' function Chris@19: called for the dimensions of the _complex_ array. Similar to the other Chris@19: planning functions, the input and output arrays are overwritten when Chris@19: the plan is created except in `FFTW_ESTIMATE' mode. Chris@19: Chris@19: As for the complex DFTs above, there is an advance interface that Chris@19: allows you to manually specify block sizes and to transform contiguous Chris@19: `howmany'-tuples of real/complex numbers: Chris@19: Chris@19: fftw_plan fftw_mpi_plan_many_dft_r2c Chris@19: (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, Chris@19: ptrdiff_t iblock, ptrdiff_t oblock, Chris@19: double *in, fftw_complex *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_many_dft_c2r Chris@19: (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, Chris@19: ptrdiff_t iblock, ptrdiff_t oblock, Chris@19: fftw_complex *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: Chris@19: MPI r2r transforms Chris@19: .................. Chris@19: Chris@19: There are corresponding plan-creation routines for r2r transforms Chris@19: (*note More DFTs of Real Data::), currently supporting multidimensional Chris@19: (`rnk > 1') transforms only (`rnk = 1' will yield a `NULL' plan): Chris@19: Chris@19: fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, Chris@19: fftw_r2r_kind kind0, fftw_r2r_kind kind1, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, Chris@19: fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, const fftw_r2r_kind *kind, Chris@19: unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, Chris@19: ptrdiff_t iblock, ptrdiff_t oblock, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, const fftw_r2r_kind *kind, Chris@19: unsigned flags); Chris@19: Chris@19: The parameters are much the same as for the complex DFTs above, Chris@19: except that the arrays are of real numbers (and hence the outputs of the Chris@19: `local_size' data-distribution functions should be interpreted as Chris@19: counts of real rather than complex numbers). Also, the `kind' Chris@19: parameters specify the r2r kinds along each dimension as for the serial Chris@19: interface (*note Real-to-Real Transform Kinds::). *Note Other Chris@19: Multi-dimensional Real-data MPI Transforms::. Chris@19: Chris@19: MPI transposition Chris@19: ................. Chris@19: Chris@19: FFTW also provides routines to plan a transpose of a distributed `n0' Chris@19: by `n1' array of real numbers, or an array of `howmany'-tuples of real Chris@19: numbers with specified block sizes (*note FFTW MPI Transposes::): Chris@19: Chris@19: fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, Chris@19: double *in, double *out, Chris@19: MPI_Comm comm, unsigned flags); Chris@19: fftw_plan fftw_mpi_plan_many_transpose Chris@19: (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, Chris@19: ptrdiff_t block0, ptrdiff_t block1, Chris@19: double *in, double *out, MPI_Comm comm, unsigned flags); Chris@19: Chris@19: These plans are used with the `fftw_mpi_execute_r2r' new-array Chris@19: execute function (*note Using MPI Plans::), since they count as (rank Chris@19: zero) r2r plans from FFTW's perspective. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: MPI Wisdom Communication, Prev: MPI Plan Creation, Up: FFTW MPI Reference Chris@19: Chris@19: 6.12.6 MPI Wisdom Communication Chris@19: ------------------------------- Chris@19: Chris@19: To facilitate synchronizing wisdom among the different MPI processes, Chris@19: we provide two functions: Chris@19: Chris@19: void fftw_mpi_gather_wisdom(MPI_Comm comm); Chris@19: void fftw_mpi_broadcast_wisdom(MPI_Comm comm); Chris@19: Chris@19: The `fftw_mpi_gather_wisdom' function gathers all wisdom in the Chris@19: given communicator `comm' to the process of rank 0 in the communicator: Chris@19: that process obtains the union of all wisdom on all the processes. As Chris@19: a side effect, some other processes will gain additional wisdom from Chris@19: other processes, but only process 0 will gain the complete union. Chris@19: Chris@19: The `fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom Chris@19: from process 0 in `comm' to all other processes in the communicator, Chris@19: replacing any wisdom they currently have. Chris@19: Chris@19: *Note FFTW MPI Wisdom::. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW MPI Fortran Interface, Prev: FFTW MPI Reference, Up: Distributed-memory FFTW with MPI Chris@19: Chris@19: 6.13 FFTW MPI Fortran Interface Chris@19: =============================== Chris@19: Chris@19: The FFTW MPI interface is callable from modern Fortran compilers Chris@19: supporting the Fortran 2003 `iso_c_binding' standard for calling C Chris@19: functions. As described in *note Calling FFTW from Modern Fortran::, Chris@19: this means that you can directly call FFTW's C interface from Fortran Chris@19: with only minor changes in syntax. There are, however, a few things Chris@19: specific to the MPI interface to keep in mind: Chris@19: Chris@19: * Instead of including `fftw3.f03' as in *note Overview of Fortran Chris@19: interface::, you should `include 'fftw3-mpi.f03'' (after `use, Chris@19: intrinsic :: iso_c_binding' as before). The `fftw3-mpi.f03' file Chris@19: includes `fftw3.f03', so you should _not_ `include' them both Chris@19: yourself. (You will also want to include the MPI header file, Chris@19: usually via `include 'mpif.h'' or similar, although though this is Chris@19: not needed by `fftw3-mpi.f03' per se.) (To use the `fftwl_' `long Chris@19: double' extended-precision routines in supporting compilers, you Chris@19: should include `fftw3f-mpi.f03' in _addition_ to `fftw3-mpi.f03'. Chris@19: *Note Extended and quadruple precision in Fortran::.) Chris@19: Chris@19: * Because of the different storage conventions between C and Fortran, Chris@19: you reverse the order of your array dimensions when passing them to Chris@19: FFTW (*note Reversing array dimensions::). This is merely a Chris@19: difference in notation and incurs no performance overhead. Chris@19: However, it means that, whereas in C the _first_ dimension is Chris@19: distributed, in Fortran the _last_ dimension of your array is Chris@19: distributed. Chris@19: Chris@19: * In Fortran, communicators are stored as `integer' types; there is Chris@19: no `MPI_Comm' type, nor is there any way to access a C `MPI_Comm'. Chris@19: Fortunately, this is taken care of for you by the FFTW Fortran Chris@19: interface: whenever the C interface expects an `MPI_Comm' type, Chris@19: you should pass the Fortran communicator as an `integer'.(1) Chris@19: Chris@19: * Because you need to call the `local_size' function to find out how Chris@19: much space to allocate, and this may be _larger_ than the local Chris@19: portion of the array (*note MPI Data Distribution::), you should Chris@19: _always_ allocate your arrays dynamically using FFTW's allocation Chris@19: routines as described in *note Allocating aligned memory in Chris@19: Fortran::. (Coincidentally, this also provides the best Chris@19: performance by guaranteeding proper data alignment.) Chris@19: Chris@19: * Because all sizes in the MPI FFTW interface are declared as Chris@19: `ptrdiff_t' in C, you should use `integer(C_INTPTR_T)' in Fortran Chris@19: (*note FFTW Fortran type reference::). Chris@19: Chris@19: * In Fortran, because of the language semantics, we generally Chris@19: recommend using the new-array execute functions for all plans, Chris@19: even in the common case where you are executing the plan on the Chris@19: same arrays for which the plan was created (*note Plan execution Chris@19: in Fortran::). However, note that in the MPI interface these Chris@19: functions are changed: `fftw_execute_dft' becomes Chris@19: `fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::. Chris@19: Chris@19: Chris@19: For example, here is a Fortran code snippet to perform a distributed Chris@19: L x M complex DFT in-place. (This assumes you have already Chris@19: initialized MPI with `MPI_init' and have also performed `call Chris@19: fftw_mpi_init'.) Chris@19: Chris@19: use, intrinsic :: iso_c_binding Chris@19: include 'fftw3-mpi.f03' Chris@19: integer(C_INTPTR_T), parameter :: L = ... Chris@19: integer(C_INTPTR_T), parameter :: M = ... Chris@19: type(C_PTR) :: plan, cdata Chris@19: complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) Chris@19: integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset Chris@19: Chris@19: ! get local data size and allocate (note dimension reversal) Chris@19: alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & Chris@19: local_M, local_j_offset) Chris@19: cdata = fftw_alloc_complex(alloc_local) Chris@19: call c_f_pointer(cdata, data, [L,local_M]) Chris@19: Chris@19: ! create MPI plan for in-place forward DFT (note dimension reversal) Chris@19: plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & Chris@19: FFTW_FORWARD, FFTW_MEASURE) Chris@19: Chris@19: ! initialize data to some function my_function(i,j) Chris@19: do j = 1, local_M Chris@19: do i = 1, L Chris@19: data(i, j) = my_function(i, j + local_j_offset) Chris@19: end do Chris@19: end do Chris@19: Chris@19: ! compute transform (as many times as desired) Chris@19: call fftw_mpi_execute_dft(plan, data, data) Chris@19: Chris@19: call fftw_destroy_plan(plan) Chris@19: call fftw_free(cdata) Chris@19: Chris@19: Note that when we called `fftw_mpi_local_size_2d' and Chris@19: `fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L Chris@19: x M Fortran array is viewed by FFTW in C as a M x L array. This Chris@19: means that the array was distributed over the `M' dimension, the local Chris@19: portion of which is a L x local_M array in Fortran. (You must _not_ Chris@19: use an `allocate' statement to allocate an L x local_M array, however; Chris@19: you must allocate `alloc_local' complex numbers, which may be greater Chris@19: than `L * local_M', in order to reserve space for intermediate steps of Chris@19: the transform.) Finally, we mention that because C's array indices are Chris@19: zero-based, the `local_j_offset' argument can conveniently be Chris@19: interpreted as an offset in the 1-based `j' index (rather than as a Chris@19: starting index as in C). Chris@19: Chris@19: If instead you had used the `ior(FFTW_MEASURE, Chris@19: FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a Chris@19: transposed M x local_L array, associated with the _same_ `cdata' Chris@19: allocation (since the transform is in-place), and which you could Chris@19: declare with: Chris@19: Chris@19: complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) Chris@19: ... Chris@19: call c_f_pointer(cdata, tdata, [M,local_L]) Chris@19: Chris@19: where `local_L' would have been obtained by changing the Chris@19: `fftw_mpi_local_size_2d' call to: Chris@19: Chris@19: alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & Chris@19: local_M, local_j_offset, local_L, local_i_offset) Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) Technically, this is because you aren't actually calling the C Chris@19: functions directly. You are calling wrapper functions that translate Chris@19: the communicator with `MPI_Comm_f2c' before calling the ordinary C Chris@19: interface. This is all done transparently, however, since the Chris@19: `fftw3-mpi.f03' interface file renames the wrappers so that they are Chris@19: called in Fortran with the same names as the C interface functions. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Calling FFTW from Modern Fortran, Next: Calling FFTW from Legacy Fortran, Prev: Distributed-memory FFTW with MPI, Up: Top Chris@19: Chris@19: 7 Calling FFTW from Modern Fortran Chris@19: ********************************** Chris@19: Chris@19: Fortran 2003 standardized ways for Fortran code to call C libraries, Chris@19: and this allows us to support a direct translation of the FFTW C API Chris@19: into Fortran. Compared to the legacy Fortran 77 interface (*note Chris@19: Calling FFTW from Legacy Fortran::), this direct interface offers many Chris@19: advantages, especially compile-time type-checking and aligned memory Chris@19: allocation. As of this writing, support for these C interoperability Chris@19: features seems widespread, having been implemented in nearly all major Chris@19: Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland Chris@19: Group, NAG). Chris@19: Chris@19: This chapter documents that interface. For the most part, since this Chris@19: interface allows Fortran to call the C interface directly, the usage is Chris@19: identical to C translated to Fortran syntax. However, there are a few Chris@19: subtle points such as memory allocation, wisdom, and data types that Chris@19: deserve closer attention. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Overview of Fortran interface:: Chris@19: * Reversing array dimensions:: Chris@19: * FFTW Fortran type reference:: Chris@19: * Plan execution in Fortran:: Chris@19: * Allocating aligned memory in Fortran:: Chris@19: * Accessing the wisdom API from Fortran:: Chris@19: * Defining an FFTW module:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Overview of Fortran interface, Next: Reversing array dimensions, Prev: Calling FFTW from Modern Fortran, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.1 Overview of Fortran interface Chris@19: ================================= Chris@19: Chris@19: FFTW provides a file `fftw3.f03' that defines Fortran 2003 interfaces Chris@19: for all of its C routines, except for the MPI routines described Chris@19: elsewhere, which can be found in the same directory as `fftw3.h' (the C Chris@19: header file). In any Fortran subroutine where you want to use FFTW Chris@19: functions, you should begin with: Chris@19: Chris@19: use, intrinsic :: iso_c_binding Chris@19: include 'fftw3.f03' Chris@19: Chris@19: This includes the interface definitions and the standard Chris@19: `iso_c_binding' module (which defines the equivalents of C types). You Chris@19: can also put the FFTW functions into a module if you prefer (*note Chris@19: Defining an FFTW module::). Chris@19: Chris@19: At this point, you can now call anything in the FFTW C interface Chris@19: directly, almost exactly as in C other than minor changes in syntax. Chris@19: For example: Chris@19: Chris@19: type(C_PTR) :: plan Chris@19: complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out Chris@19: plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE) Chris@19: ... Chris@19: call fftw_execute_dft(plan, in, out) Chris@19: ... Chris@19: call fftw_destroy_plan(plan) Chris@19: Chris@19: A few important things to keep in mind are: Chris@19: Chris@19: * FFTW plans are `type(C_PTR)'. Other C types are mapped in the Chris@19: obvious way via the `iso_c_binding' standard: `int' turns into Chris@19: `integer(C_INT)', `fftw_complex' turns into Chris@19: `complex(C_DOUBLE_COMPLEX)', `double' turns into `real(C_DOUBLE)', Chris@19: and so on. *Note FFTW Fortran type reference::. Chris@19: Chris@19: * Functions in C become functions in Fortran if they have a return Chris@19: value, and subroutines in Fortran otherwise. Chris@19: Chris@19: * The ordering of the Fortran array dimensions must be _reversed_ Chris@19: when they are passed to the FFTW plan creation, thanks to Chris@19: differences in array indexing conventions (*note Multi-dimensional Chris@19: Array Format::). This is _unlike_ the legacy Fortran interface Chris@19: (*note Fortran-interface routines::), which reversed the dimensions Chris@19: for you. *Note Reversing array dimensions::. Chris@19: Chris@19: * Using ordinary Fortran array declarations like this works, but may Chris@19: yield suboptimal performance because the data may not be not Chris@19: aligned to exploit SIMD instructions on modern proessors (*note Chris@19: SIMD alignment and fftw_malloc::). Better performance will often Chris@19: be obtained by allocating with `fftw_alloc'. *Note Allocating Chris@19: aligned memory in Fortran::. Chris@19: Chris@19: * Similar to the legacy Fortran interface (*note FFTW Execution in Chris@19: Fortran::), we currently recommend _not_ using `fftw_execute' but Chris@19: rather using the more specialized functions like Chris@19: `fftw_execute_dft' (*note New-array Execute Functions::). Chris@19: However, you should execute the plan on the `same arrays' as the Chris@19: ones for which you created the plan, unless you are especially Chris@19: careful. *Note Plan execution in Fortran::. To prevent you from Chris@19: using `fftw_execute' by mistake, the `fftw3.f03' file does not Chris@19: provide an `fftw_execute' interface declaration. Chris@19: Chris@19: * Multiple planner flags are combined with `ior' (equivalent to `|' Chris@19: in C). e.g. `FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes Chris@19: `ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'. (You can also use `+' as Chris@19: long as you don't try to include a given flag more than once.) Chris@19: Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Extended and quadruple precision in Fortran:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Extended and quadruple precision in Fortran, Prev: Overview of Fortran interface, Up: Overview of Fortran interface Chris@19: Chris@19: 7.1.1 Extended and quadruple precision in Fortran Chris@19: ------------------------------------------------- Chris@19: Chris@19: If FFTW is compiled in `long double' (extended) precision (*note Chris@19: Installation and Customization::), you may be able to call the Chris@19: resulting `fftwl_' routines (*note Precision::) from Fortran if your Chris@19: compiler supports the `C_LONG_DOUBLE_COMPLEX' type code. Chris@19: Chris@19: Because some Fortran compilers do not support Chris@19: `C_LONG_DOUBLE_COMPLEX', the `fftwl_' declarations are segregated into Chris@19: a separate interface file `fftw3l.f03', which you should include _in Chris@19: addition_ to `fftw3.f03' (which declares precision-independent `FFTW_' Chris@19: constants): Chris@19: Chris@19: use, intrinsic :: iso_c_binding Chris@19: include 'fftw3.f03' Chris@19: include 'fftw3l.f03' Chris@19: Chris@19: We also support using the nonstandard `__float128' Chris@19: quadruple-precision type provided by recent versions of `gcc' on 32- Chris@19: and 64-bit x86 hardware (*note Installation and Customization::), using Chris@19: the corresponding `real(16)' and `complex(16)' types supported by Chris@19: `gfortran'. The quadruple-precision `fftwq_' functions (*note Chris@19: Precision::) are declared in a `fftw3q.f03' interface file, which Chris@19: should be included in addition to `fftw3l.f03', as above. You should Chris@19: also link with `-lfftw3q -lquadmath -lm' as in C. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Reversing array dimensions, Next: FFTW Fortran type reference, Prev: Overview of Fortran interface, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.2 Reversing array dimensions Chris@19: ============================== Chris@19: Chris@19: A minor annoyance in calling FFTW from Fortran is that FFTW's array Chris@19: dimensions are defined in the C convention (row-major order), while Chris@19: Fortran's array dimensions are the opposite convention (column-major Chris@19: order). *Note Multi-dimensional Array Format::. This is just a Chris@19: bookkeeping difference, with no effect on performance. The only Chris@19: consequence of this is that, whenever you create an FFTW plan for a Chris@19: multi-dimensional transform, you must always _reverse the ordering of Chris@19: the dimensions_. Chris@19: Chris@19: For example, consider the three-dimensional (L x M x N ) arrays: Chris@19: Chris@19: complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out Chris@19: Chris@19: To plan a DFT for these arrays using `fftw_plan_dft_3d', you could Chris@19: do: Chris@19: Chris@19: plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE) Chris@19: Chris@19: That is, from FFTW's perspective this is a N x M x L array. _No Chris@19: data transposition need occur_, as this is _only notation_. Similarly, Chris@19: to use the more generic routine `fftw_plan_dft' with the same arrays, Chris@19: you could do: Chris@19: Chris@19: integer(C_INT), dimension(3) :: n = [N,M,L] Chris@19: plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE) Chris@19: Chris@19: Note, by the way, that this is different from the legacy Fortran Chris@19: interface (*note Fortran-interface routines::), which automatically Chris@19: reverses the order of the array dimension for you. Here, you are Chris@19: calling the C interface directly, so there is no "translation" layer. Chris@19: Chris@19: An important thing to keep in mind is the implication of this for Chris@19: multidimensional real-to-complex transforms (*note Multi-Dimensional Chris@19: DFTs of Real Data::). In C, a multidimensional real-to-complex DFT Chris@19: chops the last dimension roughly in half (N x M x L real input goes to Chris@19: N x M x L/2+1 complex output). In Fortran, because the array Chris@19: dimension notation is reversed, the _first_ dimension of the complex Chris@19: data is chopped roughly in half. For example consider the `r2c' Chris@19: transform of L x M x N real input in Fortran: Chris@19: Chris@19: type(C_PTR) :: plan Chris@19: real(C_DOUBLE), dimension(L,M,N) :: in Chris@19: complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out Chris@19: plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) Chris@19: ... Chris@19: call fftw_execute_dft_r2c(plan, in, out) Chris@19: Chris@19: Alternatively, for an in-place r2c transform, as described in the C Chris@19: documentation we must _pad_ the _first_ dimension of the real input Chris@19: with an extra two entries (which are ignored by FFTW) so as to leave Chris@19: enough space for the complex output. The input is _allocated_ as a Chris@19: 2[L/2+1] x M x N array, even though only L x M x N of it is actually Chris@19: used. In this example, we will allocate the array as a pointer type, Chris@19: using `fftw_alloc' to ensure aligned memory for maximum performance Chris@19: (*note Allocating aligned memory in Fortran::); this also makes it easy Chris@19: to reference the same memory as both a real array and a complex array. Chris@19: Chris@19: real(C_DOUBLE), pointer :: in(:,:,:) Chris@19: complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:) Chris@19: type(C_PTR) :: plan, data Chris@19: data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T)) Chris@19: call c_f_pointer(data, in, [2*(L/2+1),M,N]) Chris@19: call c_f_pointer(data, out, [L/2+1,M,N]) Chris@19: plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE) Chris@19: ... Chris@19: call fftw_execute_dft_r2c(plan, in, out) Chris@19: ... Chris@19: call fftw_destroy_plan(plan) Chris@19: call fftw_free(data) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW Fortran type reference, Next: Plan execution in Fortran, Prev: Reversing array dimensions, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.3 FFTW Fortran type reference Chris@19: =============================== Chris@19: Chris@19: The following are the most important type correspondences between the C Chris@19: interface and Fortran: Chris@19: Chris@19: * Plans (`fftw_plan' and variants) are `type(C_PTR)' (i.e. an opaque Chris@19: pointer). Chris@19: Chris@19: * The C floating-point types `double', `float', and `long double' Chris@19: correspond to `real(C_DOUBLE)', `real(C_FLOAT)', and Chris@19: `real(C_LONG_DOUBLE)', respectively. The C complex types Chris@19: `fftw_complex', `fftwf_complex', and `fftwl_complex' correspond in Chris@19: Fortran to `complex(C_DOUBLE_COMPLEX)', Chris@19: `complex(C_FLOAT_COMPLEX)', and `complex(C_LONG_DOUBLE_COMPLEX)', Chris@19: respectively. Just as in C (*note Precision::), the FFTW Chris@19: subroutines and types are prefixed with `fftw_', `fftwf_', and Chris@19: `fftwl_' for the different precisions, and link to different Chris@19: libraries (`-lfftw3', `-lfftw3f', and `-lfftw3l' on Unix), but use Chris@19: the _same_ include file `fftw3.f03' and the _same_ constants (all Chris@19: of which begin with `FFTW_'). The exception is `long double' Chris@19: precision, for which you should _also_ include `fftw3l.f03' (*note Chris@19: Extended and quadruple precision in Fortran::). Chris@19: Chris@19: * The C integer types `int' and `unsigned' (used for planner flags) Chris@19: become `integer(C_INT)'. The C integer type `ptrdiff_t' (e.g. in Chris@19: the *note 64-bit Guru Interface::) becomes `integer(C_INTPTR_T)', Chris@19: and `size_t' (in `fftw_malloc' etc.) becomes `integer(C_SIZE_T)'. Chris@19: Chris@19: * The `fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::) Chris@19: becomes `integer(C_FFTW_R2R_KIND)'. The various constant values Chris@19: of the C enumerated type (`FFTW_R2HC' etc.) become simply integer Chris@19: constants of the same names in Fortran. Chris@19: Chris@19: * Numeric array pointer arguments (e.g. `double *') become Chris@19: `dimension(*), intent(out)' arrays of the same type, or Chris@19: `dimension(*), intent(in)' if they are pointers to constant data Chris@19: (e.g. `const int *'). There are a few exceptions where numeric Chris@19: pointers refer to scalar outputs (e.g. for `fftw_flops'), in which Chris@19: case they are `intent(out)' scalar arguments in Fortran too. For Chris@19: the new-array execute functions (*note New-array Execute Chris@19: Functions::), the input arrays are declared `dimension(*), Chris@19: intent(inout)', since they can be modified in the case of in-place Chris@19: or `FFTW_DESTROY_INPUT' transforms. Chris@19: Chris@19: * Pointer _return_ values (e.g `double *') become `type(C_PTR)'. Chris@19: (If they are pointers to arrays, as for `fftw_alloc_real', you can Chris@19: convert them back to Fortran array pointers with the standard Chris@19: intrinsic function `c_f_pointer'.) Chris@19: Chris@19: * The `fftw_iodim' type in the guru interface (*note Guru vector and Chris@19: transform sizes::) becomes `type(fftw_iodim)' in Fortran, a Chris@19: derived data type (the Fortran analogue of C's `struct') with Chris@19: three `integer(C_INT)' components: `n', `is', and `os', with the Chris@19: same meanings as in C. The `fftw_iodim64' type in the 64-bit guru Chris@19: interface (*note 64-bit Guru Interface::) is the same, except that Chris@19: its components are of type `integer(C_INTPTR_T)'. Chris@19: Chris@19: * Using the wisdom import/export functions from Fortran is a bit Chris@19: tricky, and is discussed in *note Accessing the wisdom API from Chris@19: Fortran::. In brief, the `FILE *' arguments map to `type(C_PTR)', Chris@19: `const char *' to `character(C_CHAR), dimension(*), intent(in)' Chris@19: (null-terminated!), and the generic read-char/write-char functions Chris@19: map to `type(C_FUNPTR)'. Chris@19: Chris@19: Chris@19: You may be wondering if you need to search-and-replace Chris@19: `real(kind(0.0d0))' (or whatever your favorite Fortran spelling of Chris@19: "double precision" is) with `real(C_DOUBLE)' everywhere in your Chris@19: program, and similarly for `complex' and `integer' types. The answer Chris@19: is no; you can still use your existing types. As long as these types Chris@19: match their C counterparts, things should work without a hitch. The Chris@19: worst that can happen, e.g. in the (unlikely) event of a system where Chris@19: `real(kind(0.0d0))' is different from `real(C_DOUBLE)', is that the Chris@19: compiler will give you a type-mismatch error. That is, if you don't Chris@19: use the `iso_c_binding' kinds you need to accept at least the Chris@19: theoretical possibility of having to change your code in response to Chris@19: compiler errors on some future machine, but you don't need to worry Chris@19: about silently compiling incorrect code that yields runtime errors. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Plan execution in Fortran, Next: Allocating aligned memory in Fortran, Prev: FFTW Fortran type reference, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.4 Plan execution in Fortran Chris@19: ============================= Chris@19: Chris@19: In C, in order to use a plan, one normally calls `fftw_execute', which Chris@19: executes the plan to perform the transform on the input/output arrays Chris@19: passed when the plan was created (*note Using Plans::). The Chris@19: corresponding subroutine call in modern Fortran is: Chris@19: call fftw_execute(plan) Chris@19: Chris@19: However, we have had reports that this causes problems with some Chris@19: recent optimizing Fortran compilers. The problem is, because the Chris@19: input/output arrays are not passed as explicit arguments to Chris@19: `fftw_execute', the semantics of Fortran (unlike C) allow the compiler Chris@19: to assume that the input/output arrays are not changed by Chris@19: `fftw_execute'. As a consequence, certain compilers end up Chris@19: repositioning the call to `fftw_execute', assuming incorrectly that it Chris@19: does nothing to the arrays. Chris@19: Chris@19: There are various workarounds to this, but the safest and simplest Chris@19: thing is to not use `fftw_execute' in Fortran. Instead, use the Chris@19: functions described in *note New-array Execute Functions::, which take Chris@19: the input/output arrays as explicit arguments. For example, if the Chris@19: plan is for a complex-data DFT and was created for the arrays `in' and Chris@19: `out', you would do: Chris@19: call fftw_execute_dft(plan, in, out) Chris@19: Chris@19: There are a few things to be careful of, however: Chris@19: Chris@19: * You must use the correct type of execute function, matching the way Chris@19: the plan was created. Complex DFT plans should use Chris@19: `fftw_execute_dft', Real-input (r2c) DFT plans should use use Chris@19: `fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use Chris@19: `fftw_execute_dft_c2r'. The various r2r plans should use Chris@19: `fftw_execute_r2r'. Fortunately, if you use the wrong one you Chris@19: will get a compile-time type-mismatch error (unlike legacy Chris@19: Fortran). Chris@19: Chris@19: * You should normally pass the same input/output arrays that were Chris@19: used when creating the plan. This is always safe. Chris@19: Chris@19: * _If_ you pass _different_ input/output arrays compared to those Chris@19: used when creating the plan, you must abide by all the Chris@19: restrictions of the new-array execute functions (*note New-array Chris@19: Execute Functions::). The most tricky of these is the requirement Chris@19: that the new arrays have the same alignment as the original Chris@19: arrays; the best (and possibly only) way to guarantee this is to Chris@19: use the `fftw_alloc' functions to allocate your arrays (*note Chris@19: Allocating aligned memory in Fortran::). Alternatively, you can Chris@19: use the `FFTW_UNALIGNED' flag when creating the plan, in which Chris@19: case the plan does not depend on the alignment, but this may Chris@19: sacrifice substantial performance on architectures (like x86) with Chris@19: SIMD instructions (*note SIMD alignment and fftw_malloc::). Chris@19: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Allocating aligned memory in Fortran, Next: Accessing the wisdom API from Fortran, Prev: Plan execution in Fortran, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.5 Allocating aligned memory in Fortran Chris@19: ======================================== Chris@19: Chris@19: In order to obtain maximum performance in FFTW, you should store your Chris@19: data in arrays that have been specially aligned in memory (*note SIMD Chris@19: alignment and fftw_malloc::). Enforcing alignment also permits you to Chris@19: safely use the new-array execute functions (*note New-array Execute Chris@19: Functions::) to apply a given plan to more than one pair of in/out Chris@19: arrays. Unfortunately, standard Fortran arrays do _not_ provide any Chris@19: alignment guarantees. The _only_ way to allocate aligned memory in Chris@19: standard Fortran is to allocate it with an external C function, like Chris@19: the `fftw_alloc_real' and `fftw_alloc_complex' functions. Fortunately, Chris@19: Fortran 2003 provides a simple way to associate such allocated memory Chris@19: with a standard Fortran array pointer that you can then use normally. Chris@19: Chris@19: We therefore recommend allocating all your input/output arrays using Chris@19: the following technique: Chris@19: Chris@19: 1. Declare a `pointer', `arr', to your array of the desired type and Chris@19: dimensions. For example, `real(C_DOUBLE), pointer :: a(:,:)' for Chris@19: a 2d real array, or `complex(C_DOUBLE_COMPLEX), pointer :: Chris@19: a(:,:,:)' for a 3d complex array. Chris@19: Chris@19: 2. The number of elements to allocate must be an `integer(C_SIZE_T)'. Chris@19: You can either declare a variable of this type, e.g. Chris@19: `integer(C_SIZE_T) :: sz', to store the number of elements to Chris@19: allocate, or you can use the `int(..., C_SIZE_T)' intrinsic Chris@19: function. e.g. set `sz = L * M * N' or use `int(L * M * N, Chris@19: C_SIZE_T)' for an L x M x N array. Chris@19: Chris@19: 3. Declare a `type(C_PTR) :: p' to hold the return value from FFTW's Chris@19: allocation routine. Set `p = fftw_alloc_real(sz)' for a real Chris@19: array, or `p = fftw_alloc_complex(sz)' for a complex array. Chris@19: Chris@19: 4. Associate your pointer `arr' with the allocated memory `p' using Chris@19: the standard `c_f_pointer' subroutine: `call c_f_pointer(p, arr, Chris@19: [...dimensions...])', where `[...dimensions...])' are an array of Chris@19: the dimensions of the array (in the usual Fortran order). e.g. Chris@19: `call c_f_pointer(p, arr, [L,M,N])' for an L x M x N array. Chris@19: (Alternatively, you can omit the dimensions argument if you Chris@19: specified the shape explicitly when declaring `arr'.) You can now Chris@19: use `arr' as a usual multidimensional array. Chris@19: Chris@19: 5. When you are done using the array, deallocate the memory by `call Chris@19: fftw_free(p)' on `p'. Chris@19: Chris@19: Chris@19: For example, here is how we would allocate an L x M 2d real array: Chris@19: Chris@19: real(C_DOUBLE), pointer :: arr(:,:) Chris@19: type(C_PTR) :: p Chris@19: p = fftw_alloc_real(int(L * M, C_SIZE_T)) Chris@19: call c_f_pointer(p, arr, [L,M]) Chris@19: _...use arr and arr(i,j) as usual..._ Chris@19: call fftw_free(p) Chris@19: Chris@19: and here is an L x M x N 3d complex array: Chris@19: Chris@19: complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:) Chris@19: type(C_PTR) :: p Chris@19: p = fftw_alloc_complex(int(L * M * N, C_SIZE_T)) Chris@19: call c_f_pointer(p, arr, [L,M,N]) Chris@19: _...use arr and arr(i,j,k) as usual..._ Chris@19: call fftw_free(p) Chris@19: Chris@19: See *note Reversing array dimensions:: for an example allocating a Chris@19: single array and associating both real and complex array pointers with Chris@19: it, for in-place real-to-complex transforms. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Accessing the wisdom API from Fortran, Next: Defining an FFTW module, Prev: Allocating aligned memory in Fortran, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.6 Accessing the wisdom API from Fortran Chris@19: ========================================= Chris@19: Chris@19: As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a Chris@19: "wisdom" API for saving plans to disk so that they can be recreated Chris@19: quickly. The C API for exporting (*note Wisdom Export::) and importing Chris@19: (*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran, Chris@19: however, because of differences in file I/O and string types between C Chris@19: and Fortran. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Wisdom File Export/Import from Fortran:: Chris@19: * Wisdom String Export/Import from Fortran:: Chris@19: * Wisdom Generic Export/Import from Fortran:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom File Export/Import from Fortran, Next: Wisdom String Export/Import from Fortran, Prev: Accessing the wisdom API from Fortran, Up: Accessing the wisdom API from Fortran Chris@19: Chris@19: 7.6.1 Wisdom File Export/Import from Fortran Chris@19: -------------------------------------------- Chris@19: Chris@19: The easiest way to export and import wisdom is to do so using Chris@19: `fftw_export_wisdom_to_filename' and `fftw_wisdom_from_filename'. The Chris@19: only trick is that these require you to pass a C string, which is an Chris@19: array of type `CHARACTER(C_CHAR)' that is terminated by `C_NULL_CHAR'. Chris@19: You can call them like this: Chris@19: Chris@19: integer(C_INT) :: ret Chris@19: ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) Chris@19: if (ret .eq. 0) stop 'error exporting wisdom to file' Chris@19: ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR) Chris@19: if (ret .eq. 0) stop 'error importing wisdom from file' Chris@19: Chris@19: Note that prepending `C_CHAR_' is needed to specify that the literal Chris@19: string is of kind `C_CHAR', and we null-terminate the string by Chris@19: appending `// C_NULL_CHAR'. These functions return an `integer(C_INT)' Chris@19: (`ret') which is `0' if an error occurred during export/import and Chris@19: nonzero otherwise. Chris@19: Chris@19: It is also possible to use the lower-level routines Chris@19: `fftw_export_wisdom_to_file' and `fftw_import_wisdom_from_file', which Chris@19: accept parameters of the C type `FILE*', expressed in Fortran as Chris@19: `type(C_PTR)'. However, you are then responsible for creating the Chris@19: `FILE*' yourself. You can do this by using `iso_c_binding' to define Chris@19: Fortran intefaces for the C library functions `fopen' and `fclose', Chris@19: which is a bit strange in Fortran but workable. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom String Export/Import from Fortran, Next: Wisdom Generic Export/Import from Fortran, Prev: Wisdom File Export/Import from Fortran, Up: Accessing the wisdom API from Fortran Chris@19: Chris@19: 7.6.2 Wisdom String Export/Import from Fortran Chris@19: ---------------------------------------------- Chris@19: Chris@19: Dealing with FFTW's C string export/import is a bit more painful. In Chris@19: particular, the `fftw_export_wisdom_to_string' function requires you to Chris@19: deal with a dynamically allocated C string. To get its length, you Chris@19: must define an interface to the C `strlen' function, and to deallocate Chris@19: it you must define an interface to C `free': Chris@19: Chris@19: use, intrinsic :: iso_c_binding Chris@19: interface Chris@19: integer(C_INT) function strlen(s) bind(C, name='strlen') Chris@19: import Chris@19: type(C_PTR), value :: s Chris@19: end function strlen Chris@19: subroutine free(p) bind(C, name='free') Chris@19: import Chris@19: type(C_PTR), value :: p Chris@19: end subroutine free Chris@19: end interface Chris@19: Chris@19: Given these definitions, you can then export wisdom to a Fortran Chris@19: character array: Chris@19: Chris@19: character(C_CHAR), pointer :: s(:) Chris@19: integer(C_SIZE_T) :: slen Chris@19: type(C_PTR) :: p Chris@19: p = fftw_export_wisdom_to_string() Chris@19: if (.not. c_associated(p)) stop 'error exporting wisdom' Chris@19: slen = strlen(p) Chris@19: call c_f_pointer(p, s, [slen+1]) Chris@19: ... Chris@19: call free(p) Chris@19: Chris@19: Note that `slen' is the length of the C string, but the length of Chris@19: the array is `slen+1' because it includes the terminating null Chris@19: character. (You can omit the `+1' if you don't want Fortran to know Chris@19: about the null character.) The standard `c_associated' function checks Chris@19: whether `p' is a null pointer, which is returned by Chris@19: `fftw_export_wisdom_to_string' if there was an error. Chris@19: Chris@19: To import wisdom from a string, use `fftw_import_wisdom_from_string' Chris@19: as usual; note that the argument of this function must be a Chris@19: `character(C_CHAR)' that is terminated by the `C_NULL_CHAR' character, Chris@19: like the `s' array above. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom Generic Export/Import from Fortran, Prev: Wisdom String Export/Import from Fortran, Up: Accessing the wisdom API from Fortran Chris@19: Chris@19: 7.6.3 Wisdom Generic Export/Import from Fortran Chris@19: ----------------------------------------------- Chris@19: Chris@19: The most generic wisdom export/import functions allow you to provide an Chris@19: arbitrary callback function to read/write one character at a time in Chris@19: any way you want. However, your callback function must be written in a Chris@19: special way, using the `bind(C)' attribute to be passed to a C Chris@19: interface. Chris@19: Chris@19: In particular, to call the generic wisdom export function Chris@19: `fftw_export_wisdom', you would write a callback subroutine of the form: Chris@19: Chris@19: subroutine my_write_char(c, p) bind(C) Chris@19: use, intrinsic :: iso_c_binding Chris@19: character(C_CHAR), value :: c Chris@19: type(C_PTR), value :: p Chris@19: _...write c..._ Chris@19: end subroutine my_write_char Chris@19: Chris@19: Given such a subroutine (along with the corresponding interface Chris@19: definition), you could then export wisdom using: Chris@19: Chris@19: call fftw_export_wisdom(c_funloc(my_write_char), p) Chris@19: Chris@19: The standard `c_funloc' intrinsic converts a Fortran `bind(C)' Chris@19: subroutine into a C function pointer. The parameter `p' is a Chris@19: `type(C_PTR)' to any arbitrary data that you want to pass to Chris@19: `my_write_char' (or `C_NULL_PTR' if none). (Note that you can get a C Chris@19: pointer to Fortran data using the intrinsic `c_loc', and convert it Chris@19: back to a Fortran pointer in `my_write_char' using `c_f_pointer'.) Chris@19: Chris@19: Similarly, to use the generic `fftw_import_wisdom', you would define Chris@19: a callback function of the form: Chris@19: Chris@19: integer(C_INT) function my_read_char(p) bind(C) Chris@19: use, intrinsic :: iso_c_binding Chris@19: type(C_PTR), value :: p Chris@19: character :: c Chris@19: _...read a character c..._ Chris@19: my_read_char = ichar(c, C_INT) Chris@19: end function my_read_char Chris@19: Chris@19: .... Chris@19: Chris@19: integer(C_INT) :: ret Chris@19: ret = fftw_import_wisdom(c_funloc(my_read_char), p) Chris@19: if (ret .eq. 0) stop 'error importing wisdom' Chris@19: Chris@19: Your function can return `-1' if the end of the input is reached. Chris@19: Again, `p' is an arbitrary `type(C_PTR' that is passed through to your Chris@19: function. `fftw_import_wisdom' returns `0' if an error occurred and Chris@19: nonzero otherwise. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Defining an FFTW module, Prev: Accessing the wisdom API from Fortran, Up: Calling FFTW from Modern Fortran Chris@19: Chris@19: 7.7 Defining an FFTW module Chris@19: =========================== Chris@19: Chris@19: Rather than using the `include' statement to include the `fftw3.f03' Chris@19: interface file in any subroutine where you want to use FFTW, you might Chris@19: prefer to define an FFTW Fortran module. FFTW does not install itself Chris@19: as a module, primarily because `fftw3.f03' can be shared between Chris@19: different Fortran compilers while modules (in general) cannot. Chris@19: However, it is trivial to define your own FFTW module if you want. Chris@19: Just create a file containing: Chris@19: Chris@19: module FFTW3 Chris@19: use, intrinsic :: iso_c_binding Chris@19: include 'fftw3.f03' Chris@19: end module Chris@19: Chris@19: Compile this file into a module as usual for your compiler (e.g. with Chris@19: `gfortran -c' you will get a file `fftw3.mod'). Now, instead of Chris@19: `include 'fftw3.f03'', whenever you want to use FFTW routines you can Chris@19: just do: Chris@19: Chris@19: use FFTW3 Chris@19: Chris@19: as usual for Fortran modules. (You still need to link to the FFTW Chris@19: library, of course.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Calling FFTW from Legacy Fortran, Next: Upgrading from FFTW version 2, Prev: Calling FFTW from Modern Fortran, Up: Top Chris@19: Chris@19: 8 Calling FFTW from Legacy Fortran Chris@19: ********************************** Chris@19: Chris@19: This chapter describes the interface to FFTW callable by Fortran code Chris@19: in older compilers not supporting the Fortran 2003 C interoperability Chris@19: features (*note Calling FFTW from Modern Fortran::). This interface Chris@19: has the major disadvantage that it is not type-checked, so if you Chris@19: mistake the argument types or ordering then your program will not have Chris@19: any compiler errors, and will likely crash at runtime. So, greater Chris@19: care is needed. Also, technically interfacing older Fortran versions Chris@19: to C is nonstandard, but in practice we have found that the techniques Chris@19: used in this chapter have worked with all known Fortran compilers for Chris@19: many years. Chris@19: Chris@19: The legacy Fortran interface differs from the C interface only in the Chris@19: prefix (`dfftw_' instead of `fftw_' in double precision) and a few Chris@19: other minor details. This Fortran interface is included in the FFTW Chris@19: libraries by default, unless a Fortran compiler isn't found on your Chris@19: system or `--disable-fortran' is included in the `configure' flags. We Chris@19: assume here that the reader is already familiar with the usage of FFTW Chris@19: in C, as described elsewhere in this manual. Chris@19: Chris@19: The MPI parallel interface to FFTW is _not_ currently available to Chris@19: legacy Fortran. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Fortran-interface routines:: Chris@19: * FFTW Constants in Fortran:: Chris@19: * FFTW Execution in Fortran:: Chris@19: * Fortran Examples:: Chris@19: * Wisdom of Fortran?:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Fortran-interface routines, Next: FFTW Constants in Fortran, Prev: Calling FFTW from Legacy Fortran, Up: Calling FFTW from Legacy Fortran Chris@19: Chris@19: 8.1 Fortran-interface routines Chris@19: ============================== Chris@19: Chris@19: Nearly all of the FFTW functions have Fortran-callable equivalents. Chris@19: The name of the legacy Fortran routine is the same as that of the Chris@19: corresponding C routine, but with the `fftw_' prefix replaced by Chris@19: `dfftw_'.(1) The single and long-double precision versions use Chris@19: `sfftw_' and `lfftw_', respectively, instead of `fftwf_' and `fftwl_'; Chris@19: quadruple precision (`real*16') is available on some systems as Chris@19: `fftwq_' (*note Precision::). (Note that `long double' on x86 hardware Chris@19: is usually at most 80-bit extended precision, _not_ quadruple Chris@19: precision.) Chris@19: Chris@19: For the most part, all of the arguments to the functions are the Chris@19: same, with the following exceptions: Chris@19: Chris@19: * `plan' variables (what would be of type `fftw_plan' in C), must be Chris@19: declared as a type that is at least as big as a pointer (address) Chris@19: on your machine. We recommend using `integer*8' everywhere, since Chris@19: this should always be big enough. Chris@19: Chris@19: * Any function that returns a value (e.g. `fftw_plan_dft') is Chris@19: converted into a _subroutine_. The return value is converted into Chris@19: an additional _first_ parameter of this subroutine.(2) Chris@19: Chris@19: * The Fortran routines expect multi-dimensional arrays to be in Chris@19: _column-major_ order, which is the ordinary format of Fortran Chris@19: arrays (*note Multi-dimensional Array Format::). They do this Chris@19: transparently and costlessly simply by reversing the order of the Chris@19: dimensions passed to FFTW, but this has one important consequence Chris@19: for multi-dimensional real-complex transforms, discussed below. Chris@19: Chris@19: * Wisdom import and export is somewhat more tricky because one cannot Chris@19: easily pass files or strings between C and Fortran; see *note Chris@19: Wisdom of Fortran?::. Chris@19: Chris@19: * Legacy Fortran cannot use the `fftw_malloc' dynamic-allocation Chris@19: routine. If you want to exploit the SIMD FFTW (*note SIMD Chris@19: alignment and fftw_malloc::), you'll need to figure out some other Chris@19: way to ensure that your arrays are at least 16-byte aligned. Chris@19: Chris@19: * Since Fortran 77 does not have data structures, the `fftw_iodim' Chris@19: structure from the guru interface (*note Guru vector and transform Chris@19: sizes::) must be split into separate arguments. In particular, any Chris@19: `fftw_iodim' array arguments in the C guru interface become three Chris@19: integer array arguments (`n', `is', and `os') in the Fortran guru Chris@19: interface, all of whose lengths should be equal to the Chris@19: corresponding `rank' argument. Chris@19: Chris@19: * The guru planner interface in Fortran does _not_ do any automatic Chris@19: translation between column-major and row-major; you are responsible Chris@19: for setting the strides etcetera to correspond to your Fortran Chris@19: arrays. However, as a slight bug that we are preserving for Chris@19: backwards compatibility, the `plan_guru_r2r' in Fortran _does_ Chris@19: reverse the order of its `kind' array parameter, so the `kind' Chris@19: array of that routine should be in the reverse of the order of the Chris@19: iodim arrays (see above). Chris@19: Chris@19: Chris@19: In general, you should take care to use Fortran data types that Chris@19: correspond to (i.e. are the same size as) the C types used by FFTW. In Chris@19: practice, this correspondence is usually straightforward (i.e. Chris@19: `integer' corresponds to `int', `real' corresponds to `float', Chris@19: etcetera). The native Fortran double/single-precision complex type Chris@19: should be compatible with `fftw_complex'/`fftwf_complex'. Such simple Chris@19: correspondences are assumed in the examples below. Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) Technically, Fortran 77 identifiers are not allowed to have more Chris@19: than 6 characters, nor may they contain underscores. Any compiler that Chris@19: enforces this limitation doesn't deserve to link to FFTW. Chris@19: Chris@19: (2) The reason for this is that some Fortran implementations seem to Chris@19: have trouble with C function return values, and vice versa. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW Constants in Fortran, Next: FFTW Execution in Fortran, Prev: Fortran-interface routines, Up: Calling FFTW from Legacy Fortran Chris@19: Chris@19: 8.2 FFTW Constants in Fortran Chris@19: ============================= Chris@19: Chris@19: When creating plans in FFTW, a number of constants are used to specify Chris@19: options, such as `FFTW_MEASURE' or `FFTW_ESTIMATE'. The same constants Chris@19: must be used with the wrapper routines, but of course the C header Chris@19: files where the constants are defined can't be incorporated directly Chris@19: into Fortran code. Chris@19: Chris@19: Instead, we have placed Fortran equivalents of the FFTW constant Chris@19: definitions in the file `fftw3.f', which can be found in the same Chris@19: directory as `fftw3.h'. If your Fortran compiler supports a Chris@19: preprocessor of some sort, you should be able to `include' or Chris@19: `#include' this file; otherwise, you can paste it directly into your Chris@19: code. Chris@19: Chris@19: In C, you combine different flags (like `FFTW_PRESERVE_INPUT' and Chris@19: `FFTW_MEASURE') using the ``|'' operator; in Fortran you should just Chris@19: use ``+''. (Take care not to add in the same flag more than once, Chris@19: though. Alternatively, you can use the `ior' intrinsic function Chris@19: standardized in Fortran 95.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: FFTW Execution in Fortran, Next: Fortran Examples, Prev: FFTW Constants in Fortran, Up: Calling FFTW from Legacy Fortran Chris@19: Chris@19: 8.3 FFTW Execution in Fortran Chris@19: ============================= Chris@19: Chris@19: In C, in order to use a plan, one normally calls `fftw_execute', which Chris@19: executes the plan to perform the transform on the input/output arrays Chris@19: passed when the plan was created (*note Using Plans::). The Chris@19: corresponding subroutine call in legacy Fortran is: Chris@19: call dfftw_execute(plan) Chris@19: Chris@19: However, we have had reports that this causes problems with some Chris@19: recent optimizing Fortran compilers. The problem is, because the Chris@19: input/output arrays are not passed as explicit arguments to Chris@19: `dfftw_execute', the semantics of Fortran (unlike C) allow the compiler Chris@19: to assume that the input/output arrays are not changed by Chris@19: `dfftw_execute'. As a consequence, certain compilers end up optimizing Chris@19: out or repositioning the call to `dfftw_execute', assuming incorrectly Chris@19: that it does nothing. Chris@19: Chris@19: There are various workarounds to this, but the safest and simplest Chris@19: thing is to not use `dfftw_execute' in Fortran. Instead, use the Chris@19: functions described in *note New-array Execute Functions::, which take Chris@19: the input/output arrays as explicit arguments. For example, if the Chris@19: plan is for a complex-data DFT and was created for the arrays `in' and Chris@19: `out', you would do: Chris@19: call dfftw_execute_dft(plan, in, out) Chris@19: Chris@19: There are a few things to be careful of, however: Chris@19: Chris@19: * You must use the correct type of execute function, matching the way Chris@19: the plan was created. Complex DFT plans should use Chris@19: `dfftw_execute_dft', Real-input (r2c) DFT plans should use use Chris@19: `dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should Chris@19: use `dfftw_execute_dft_c2r'. The various r2r plans should use Chris@19: `dfftw_execute_r2r'. Chris@19: Chris@19: * You should normally pass the same input/output arrays that were Chris@19: used when creating the plan. This is always safe. Chris@19: Chris@19: * _If_ you pass _different_ input/output arrays compared to those Chris@19: used when creating the plan, you must abide by all the Chris@19: restrictions of the new-array execute functions (*note New-array Chris@19: Execute Functions::). The most difficult of these, in Fortran, is Chris@19: the requirement that the new arrays have the same alignment as the Chris@19: original arrays, because there seems to be no way in legacy Chris@19: Fortran to obtain guaranteed-aligned arrays (analogous to Chris@19: `fftw_malloc' in C). You can, of course, use the `FFTW_UNALIGNED' Chris@19: flag when creating the plan, in which case the plan does not Chris@19: depend on the alignment, but this may sacrifice substantial Chris@19: performance on architectures (like x86) with SIMD instructions Chris@19: (*note SIMD alignment and fftw_malloc::). Chris@19: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Fortran Examples, Next: Wisdom of Fortran?, Prev: FFTW Execution in Fortran, Up: Calling FFTW from Legacy Fortran Chris@19: Chris@19: 8.4 Fortran Examples Chris@19: ==================== Chris@19: Chris@19: In C, you might have something like the following to transform a Chris@19: one-dimensional complex array: Chris@19: Chris@19: fftw_complex in[N], out[N]; Chris@19: fftw_plan plan; Chris@19: Chris@19: plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE); Chris@19: fftw_execute(plan); Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: In Fortran, you would use the following to accomplish the same thing: Chris@19: Chris@19: double complex in, out Chris@19: dimension in(N), out(N) Chris@19: integer*8 plan Chris@19: Chris@19: call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) Chris@19: call dfftw_execute_dft(plan, in, out) Chris@19: call dfftw_destroy_plan(plan) Chris@19: Chris@19: Notice how all routines are called as Fortran subroutines, and the Chris@19: plan is returned via the first argument to `dfftw_plan_dft_1d'. Notice Chris@19: also that we changed `fftw_execute' to `dfftw_execute_dft' (*note FFTW Chris@19: Execution in Fortran::). To do the same thing, but using 8 threads in Chris@19: parallel (*note Multi-threaded FFTW::), you would simply prefix these Chris@19: calls with: Chris@19: Chris@19: integer iret Chris@19: call dfftw_init_threads(iret) Chris@19: call dfftw_plan_with_nthreads(8) Chris@19: Chris@19: (You might want to check the value of `iret': if it is zero, it Chris@19: indicates an unlikely error during thread initialization.) Chris@19: Chris@19: To transform a three-dimensional array in-place with C, you might do: Chris@19: Chris@19: fftw_complex arr[L][M][N]; Chris@19: fftw_plan plan; Chris@19: Chris@19: plan = fftw_plan_dft_3d(L,M,N, arr,arr, Chris@19: FFTW_FORWARD, FFTW_ESTIMATE); Chris@19: fftw_execute(plan); Chris@19: fftw_destroy_plan(plan); Chris@19: Chris@19: In Fortran, you would use this instead: Chris@19: Chris@19: double complex arr Chris@19: dimension arr(L,M,N) Chris@19: integer*8 plan Chris@19: Chris@19: call dfftw_plan_dft_3d(plan, L,M,N, arr,arr, Chris@19: & FFTW_FORWARD, FFTW_ESTIMATE) Chris@19: call dfftw_execute_dft(plan, arr, arr) Chris@19: call dfftw_destroy_plan(plan) Chris@19: Chris@19: Note that we pass the array dimensions in the "natural" order in Chris@19: both C and Fortran. Chris@19: Chris@19: To transform a one-dimensional real array in Fortran, you might do: Chris@19: Chris@19: double precision in Chris@19: dimension in(N) Chris@19: double complex out Chris@19: dimension out(N/2 + 1) Chris@19: integer*8 plan Chris@19: Chris@19: call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE) Chris@19: call dfftw_execute_dft_r2c(plan, in, out) Chris@19: call dfftw_destroy_plan(plan) Chris@19: Chris@19: To transform a two-dimensional real array, out of place, you might Chris@19: use the following: Chris@19: Chris@19: double precision in Chris@19: dimension in(M,N) Chris@19: double complex out Chris@19: dimension out(M/2 + 1, N) Chris@19: integer*8 plan Chris@19: Chris@19: call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE) Chris@19: call dfftw_execute_dft_r2c(plan, in, out) Chris@19: call dfftw_destroy_plan(plan) Chris@19: Chris@19: *Important:* Notice that it is the _first_ dimension of the complex Chris@19: output array that is cut in half in Fortran, rather than the last Chris@19: dimension as in C. This is a consequence of the interface routines Chris@19: reversing the order of the array dimensions passed to FFTW so that the Chris@19: Fortran program can use its ordinary column-major order. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Wisdom of Fortran?, Prev: Fortran Examples, Up: Calling FFTW from Legacy Fortran Chris@19: Chris@19: 8.5 Wisdom of Fortran? Chris@19: ====================== Chris@19: Chris@19: In this section, we discuss how one can import/export FFTW wisdom Chris@19: (saved plans) to/from a Fortran program; we assume that the reader is Chris@19: already familiar with wisdom, as described in *note Words of Chris@19: Wisdom-Saving Plans::. Chris@19: Chris@19: The basic problem is that is difficult to (portably) pass files and Chris@19: strings between Fortran and C, so we cannot provide a direct Fortran Chris@19: equivalent to the `fftw_export_wisdom_to_file', etcetera, functions. Chris@19: Fortran interfaces _are_ provided for the functions that do not take Chris@19: file/string arguments, however: `dfftw_import_system_wisdom', Chris@19: `dfftw_import_wisdom', `dfftw_export_wisdom', and `dfftw_forget_wisdom'. Chris@19: Chris@19: So, for example, to import the system-wide wisdom, you would do: Chris@19: Chris@19: integer isuccess Chris@19: call dfftw_import_system_wisdom(isuccess) Chris@19: Chris@19: As usual, the C return value is turned into a first parameter; Chris@19: `isuccess' is non-zero on success and zero on failure (e.g. if there is Chris@19: no system wisdom installed). Chris@19: Chris@19: If you want to import/export wisdom from/to an arbitrary file or Chris@19: elsewhere, you can employ the generic `dfftw_import_wisdom' and Chris@19: `dfftw_export_wisdom' functions, for which you must supply a subroutine Chris@19: to read/write one character at a time. The FFTW package contains an Chris@19: example file `doc/f77_wisdom.f' demonstrating how to implement Chris@19: `import_wisdom_from_file' and `export_wisdom_to_file' subroutines in Chris@19: this way. (These routines cannot be compiled into the FFTW library Chris@19: itself, lest all FFTW-using programs be required to link with the Chris@19: Fortran I/O library.) Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Upgrading from FFTW version 2, Next: Installation and Customization, Prev: Calling FFTW from Legacy Fortran, Up: Top Chris@19: Chris@19: 9 Upgrading from FFTW version 2 Chris@19: ******************************* Chris@19: Chris@19: In this chapter, we outline the process for updating codes designed for Chris@19: the older FFTW 2 interface to work with FFTW 3. The interface for FFTW Chris@19: 3 is not backwards-compatible with the interface for FFTW 2 and earlier Chris@19: versions; codes written to use those versions will fail to link with Chris@19: FFTW 3. Nor is it possible to write "compatibility wrappers" to bridge Chris@19: the gap (at least not efficiently), because FFTW 3 has different Chris@19: semantics from previous versions. However, upgrading should be a Chris@19: straightforward process because the data formats are identical and the Chris@19: overall style of planning/execution is essentially the same. Chris@19: Chris@19: Unlike FFTW 2, there are no separate header files for real and Chris@19: complex transforms (or even for different precisions) in FFTW 3; all Chris@19: interfaces are defined in the `' header file. Chris@19: Chris@19: Numeric Types Chris@19: ============= Chris@19: Chris@19: The main difference in data types is that `fftw_complex' in FFTW 2 was Chris@19: defined as a `struct' with macros `c_re' and `c_im' for accessing the Chris@19: real/imaginary parts. (This is binary-compatible with FFTW 3 on any Chris@19: machine except perhaps for some older Crays in single precision.) The Chris@19: equivalent macros for FFTW 3 are: Chris@19: Chris@19: #define c_re(c) ((c)[0]) Chris@19: #define c_im(c) ((c)[1]) Chris@19: Chris@19: This does not work if you are using the C99 complex type, however, Chris@19: unless you insert a `double*' typecast into the above macros (*note Chris@19: Complex numbers::). Chris@19: Chris@19: Also, FFTW 2 had an `fftw_real' typedef that was an alias for Chris@19: `double' (in double precision). In FFTW 3 you should just use `double' Chris@19: (or whatever precision you are employing). Chris@19: Chris@19: Plans Chris@19: ===== Chris@19: Chris@19: The major difference between FFTW 2 and FFTW 3 is in the Chris@19: planning/execution division of labor. In FFTW 2, plans were found for a Chris@19: given transform size and type, and then could be applied to _any_ Chris@19: arrays and for _any_ multiplicity/stride parameters. In FFTW 3, you Chris@19: specify the particular arrays, stride parameters, etcetera when Chris@19: creating the plan, and the plan is then executed for _those_ arrays Chris@19: (unless the guru interface is used) and _those_ parameters _only_. Chris@19: (FFTW 2 had "specific planner" routines that planned for a particular Chris@19: array and stride, but the plan could still be used for other arrays and Chris@19: strides.) That is, much of the information that was formerly specified Chris@19: at execution time is now specified at planning time. Chris@19: Chris@19: Like FFTW 2's specific planner routines, the FFTW 3 planner Chris@19: overwrites the input/output arrays unless you use `FFTW_ESTIMATE'. Chris@19: Chris@19: FFTW 2 had separate data types `fftw_plan', `fftwnd_plan', Chris@19: `rfftw_plan', and `rfftwnd_plan' for complex and real one- and Chris@19: multi-dimensional transforms, and each type had its own `destroy' Chris@19: function. In FFTW 3, all plans are of type `fftw_plan' and all are Chris@19: destroyed by `fftw_destroy_plan(plan)'. Chris@19: Chris@19: Where you formerly used `fftw_create_plan' and `fftw_one' to plan Chris@19: and compute a single 1d transform, you would now use `fftw_plan_dft_1d' Chris@19: to plan the transform. If you used the generic `fftw' function to Chris@19: execute the transform with multiplicity (`howmany') and stride Chris@19: parameters, you would now use the advanced interface Chris@19: `fftw_plan_many_dft' to specify those parameters. The plans are now Chris@19: executed with `fftw_execute(plan)', which takes all of its parameters Chris@19: (including the input/output arrays) from the plan. Chris@19: Chris@19: In-place transforms no longer interpret their output argument as Chris@19: scratch space, nor is there an `FFTW_IN_PLACE' flag. You simply pass Chris@19: the same pointer for both the input and output arguments. (Previously, Chris@19: the output `ostride' and `odist' parameters were ignored for in-place Chris@19: transforms; now, if they are specified via the advanced interface, they Chris@19: are significant even in the in-place case, although they should Chris@19: normally equal the corresponding input parameters.) Chris@19: Chris@19: The `FFTW_ESTIMATE' and `FFTW_MEASURE' flags have the same meaning Chris@19: as before, although the planning time will differ. You may also Chris@19: consider using `FFTW_PATIENT', which is like `FFTW_MEASURE' except that Chris@19: it takes more time in order to consider a wider variety of algorithms. Chris@19: Chris@19: For multi-dimensional complex DFTs, instead of `fftwnd_create_plan' Chris@19: (or `fftw2d_create_plan' or `fftw3d_create_plan'), followed by Chris@19: `fftwnd_one', you would use `fftw_plan_dft' (or `fftw_plan_dft_2d' or Chris@19: `fftw_plan_dft_3d'). followed by `fftw_execute'. If you used `fftwnd' Chris@19: to to specify strides etcetera, you would instead specify these via Chris@19: `fftw_plan_many_dft'. Chris@19: Chris@19: The analogues to `rfftw_create_plan' and `rfftw_one' with Chris@19: `FFTW_REAL_TO_COMPLEX' or `FFTW_COMPLEX_TO_REAL' directions are Chris@19: `fftw_plan_r2r_1d' with kind `FFTW_R2HC' or `FFTW_HC2R', followed by Chris@19: `fftw_execute'. The stride etcetera arguments of `rfftw' are now in Chris@19: `fftw_plan_many_r2r'. Chris@19: Chris@19: Instead of `rfftwnd_create_plan' (or `rfftw2d_create_plan' or Chris@19: `rfftw3d_create_plan') followed by `rfftwnd_one_real_to_complex' or Chris@19: `rfftwnd_one_complex_to_real', you now use `fftw_plan_dft_r2c' (or Chris@19: `fftw_plan_dft_r2c_2d' or `fftw_plan_dft_r2c_3d') or Chris@19: `fftw_plan_dft_c2r' (or `fftw_plan_dft_c2r_2d' or Chris@19: `fftw_plan_dft_c2r_3d'), respectively, followed by `fftw_execute'. As Chris@19: usual, the strides etcetera of `rfftwnd_real_to_complex' or Chris@19: `rfftwnd_complex_to_real' are no specified in the advanced planner Chris@19: routines, `fftw_plan_many_dft_r2c' or `fftw_plan_many_dft_c2r'. Chris@19: Chris@19: Wisdom Chris@19: ====== Chris@19: Chris@19: In FFTW 2, you had to supply the `FFTW_USE_WISDOM' flag in order to use Chris@19: wisdom; in FFTW 3, wisdom is always used. (You could simulate the FFTW Chris@19: 2 wisdom-less behavior by calling `fftw_forget_wisdom' after every Chris@19: planner call.) Chris@19: Chris@19: The FFTW 3 wisdom import/export routines are almost the same as Chris@19: before (although the storage format is entirely different). There is Chris@19: one significant difference, however. In FFTW 2, the import routines Chris@19: would never read past the end of the wisdom, so you could store extra Chris@19: data beyond the wisdom in the same file, for example. In FFTW 3, the Chris@19: file-import routine may read up to a few hundred bytes past the end of Chris@19: the wisdom, so you cannot store other data just beyond it.(1) Chris@19: Chris@19: Wisdom has been enhanced by additional humility in FFTW 3: whereas Chris@19: FFTW 2 would re-use wisdom for a given transform size regardless of the Chris@19: stride etc., in FFTW 3 wisdom is only used with the strides etc. for Chris@19: which it was created. Unfortunately, this means FFTW 3 has to create Chris@19: new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g. Chris@19: one transform of size 1024 also created wisdom for all smaller powers Chris@19: of 2, but this no longer occurs). Chris@19: Chris@19: FFTW 3 also has the new routine `fftw_import_system_wisdom' to Chris@19: import wisdom from a standard system-wide location. Chris@19: Chris@19: Memory allocation Chris@19: ================= Chris@19: Chris@19: In FFTW 3, we recommend allocating your arrays with `fftw_malloc' and Chris@19: deallocating them with `fftw_free'; this is not required, but allows Chris@19: optimal performance when SIMD acceleration is used. (Those two Chris@19: functions actually existed in FFTW 2, and worked the same way, but were Chris@19: not documented.) Chris@19: Chris@19: In FFTW 2, there were `fftw_malloc_hook' and `fftw_free_hook' Chris@19: functions that allowed the user to replace FFTW's memory-allocation Chris@19: routines (e.g. to implement different error-handling, since by default Chris@19: FFTW prints an error message and calls `exit' to abort the program if Chris@19: `malloc' returns `NULL'). These hooks are not supported in FFTW 3; Chris@19: those few users who require this functionality can just directly modify Chris@19: the memory-allocation routines in FFTW (they are defined in Chris@19: `kernel/alloc.c'). Chris@19: Chris@19: Fortran interface Chris@19: ================= Chris@19: Chris@19: In FFTW 2, the subroutine names were obtained by replacing `fftw_' with Chris@19: `fftw_f77'; in FFTW 3, you replace `fftw_' with `dfftw_' (or `sfftw_' Chris@19: or `lfftw_', depending upon the precision). Chris@19: Chris@19: In FFTW 3, we have begun recommending that you always declare the Chris@19: type used to store plans as `integer*8'. (Too many people didn't notice Chris@19: our instruction to switch from `integer' to `integer*8' for 64-bit Chris@19: machines.) Chris@19: Chris@19: In FFTW 3, we provide a `fftw3.f' "header file" to include in your Chris@19: code (and which is officially installed on Unix systems). (In FFTW 2, Chris@19: we supplied a `fftw_f77.i' file, but it was not installed.) Chris@19: Chris@19: Otherwise, the C-Fortran interface relationship is much the same as Chris@19: it was before (e.g. return values become initial parameters, and Chris@19: multi-dimensional arrays are in column-major order). Unlike FFTW 2, we Chris@19: do provide some support for wisdom import/export in Fortran (*note Chris@19: Wisdom of Fortran?::). Chris@19: Chris@19: Threads Chris@19: ======= Chris@19: Chris@19: Like FFTW 2, only the execution routines are thread-safe. All planner Chris@19: routines, etcetera, should be called by only a single thread at a time Chris@19: (*note Thread safety::). _Unlike_ FFTW 2, there is no special Chris@19: `FFTW_THREADSAFE' flag for the planner to allow a given plan to be Chris@19: usable by multiple threads in parallel; this is now the case by default. Chris@19: Chris@19: The multi-threaded version of FFTW 2 required you to pass the number Chris@19: of threads each time you execute the transform. The number of threads Chris@19: is now stored in the plan, and is specified before the planner is Chris@19: called by `fftw_plan_with_nthreads'. The threads initialization Chris@19: routine used to be called `fftw_threads_init' and would return zero on Chris@19: success; the new routine is called `fftw_init_threads' and returns zero Chris@19: on failure. *Note Multi-threaded FFTW::. Chris@19: Chris@19: There is no separate threads header file in FFTW 3; all the function Chris@19: prototypes are in `'. However, you still have to link to a Chris@19: separate library (`-lfftw3_threads -lfftw3 -lm' on Unix), as well as to Chris@19: the threading library (e.g. POSIX threads on Unix). Chris@19: Chris@19: ---------- Footnotes ---------- Chris@19: Chris@19: (1) We do our own buffering because GNU libc I/O routines are Chris@19: horribly slow for single-character I/O, apparently for thread-safety Chris@19: reasons (whether you are using threads or not). Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Installation and Customization, Next: Acknowledgments, Prev: Upgrading from FFTW version 2, Up: Top Chris@19: Chris@19: 10 Installation and Customization Chris@19: ********************************* Chris@19: Chris@19: This chapter describes the installation and customization of FFTW, the Chris@19: latest version of which may be downloaded from the FFTW home page Chris@19: (http://www.fftw.org). Chris@19: Chris@19: In principle, FFTW should work on any system with an ANSI C compiler Chris@19: (`gcc' is fine). However, planner time is drastically reduced if FFTW Chris@19: can exploit a hardware cycle counter; FFTW comes with cycle-counter Chris@19: support for all modern general-purpose CPUs, but you may need to add a Chris@19: couple of lines of code if your compiler is not yet supported (*note Chris@19: Cycle Counters::). (On Unix, there will be a warning at the end of the Chris@19: `configure' output if no cycle counter is found.) Chris@19: Chris@19: Installation of FFTW is simplest if you have a Unix or a GNU system, Chris@19: such as GNU/Linux, and we describe this case in the first section below, Chris@19: including the use of special configuration options to e.g. install Chris@19: different precisions or exploit optimizations for particular Chris@19: architectures (e.g. SIMD). Compilation on non-Unix systems is a more Chris@19: manual process, but we outline the procedure in the second section. It Chris@19: is also likely that pre-compiled binaries will be available for popular Chris@19: systems. Chris@19: Chris@19: Finally, we describe how you can customize FFTW for particular needs Chris@19: by generating _codelets_ for fast transforms of sizes not supported Chris@19: efficiently by the standard FFTW distribution. Chris@19: Chris@19: * Menu: Chris@19: Chris@19: * Installation on Unix:: Chris@19: * Installation on non-Unix systems:: Chris@19: * Cycle Counters:: Chris@19: * Generating your own code:: Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Installation on Unix, Next: Installation on non-Unix systems, Prev: Installation and Customization, Up: Installation and Customization Chris@19: Chris@19: 10.1 Installation on Unix Chris@19: ========================= Chris@19: Chris@19: FFTW comes with a `configure' program in the GNU style. Installation Chris@19: can be as simple as: Chris@19: Chris@19: ./configure Chris@19: make Chris@19: make install Chris@19: Chris@19: This will build the uniprocessor complex and real transform libraries Chris@19: along with the test programs. (We recommend that you use GNU `make' if Chris@19: it is available; on some systems it is called `gmake'.) The "`make Chris@19: install'" command installs the fftw and rfftw libraries in standard Chris@19: places, and typically requires root privileges (unless you specify a Chris@19: different install directory with the `--prefix' flag to `configure'). Chris@19: You can also type "`make check'" to put the FFTW test programs through Chris@19: their paces. If you have problems during configuration or compilation, Chris@19: you may want to run "`make distclean'" before trying again; this Chris@19: ensures that you don't have any stale files left over from previous Chris@19: compilation attempts. Chris@19: Chris@19: The `configure' script chooses the `gcc' compiler by default, if it Chris@19: is available; you can select some other compiler with: Chris@19: ./configure CC="" Chris@19: Chris@19: The `configure' script knows good `CFLAGS' (C compiler flags) for a Chris@19: few systems. If your system is not known, the `configure' script will Chris@19: print out a warning. In this case, you should re-configure FFTW with Chris@19: the command Chris@19: ./configure CFLAGS="" Chris@19: and then compile as usual. If you do find an optimal set of Chris@19: `CFLAGS' for your system, please let us know what they are (along with Chris@19: the output of `config.guess') so that we can include them in future Chris@19: releases. Chris@19: Chris@19: `configure' supports all the standard flags defined by the GNU Chris@19: Coding Standards; see the `INSTALL' file in FFTW or the GNU web page Chris@19: (http://www.gnu.org/prep/standards/html_node/index.html). Note Chris@19: especially `--help' to list all flags and `--enable-shared' to create Chris@19: shared, rather than static, libraries. `configure' also accepts a few Chris@19: FFTW-specific flags, particularly: Chris@19: Chris@19: * `--enable-float': Produces a single-precision version of FFTW Chris@19: (`float') instead of the default double-precision (`double'). Chris@19: *Note Precision::. Chris@19: Chris@19: * `--enable-long-double': Produces a long-double precision version of Chris@19: FFTW (`long double') instead of the default double-precision Chris@19: (`double'). The `configure' script will halt with an error Chris@19: message if `long double' is the same size as `double' on your Chris@19: machine/compiler. *Note Precision::. Chris@19: Chris@19: * `--enable-quad-precision': Produces a quadruple-precision version Chris@19: of FFTW using the nonstandard `__float128' type provided by `gcc' Chris@19: 4.6 or later on x86, x86-64, and Itanium architectures, instead of Chris@19: the default double-precision (`double'). The `configure' script Chris@19: will halt with an error message if the compiler is not `gcc' Chris@19: version 4.6 or later or if `gcc''s `libquadmath' library is not Chris@19: installed. *Note Precision::. Chris@19: Chris@19: * `--enable-threads': Enables compilation and installation of the Chris@19: FFTW threads library (*note Multi-threaded FFTW::), which provides Chris@19: a simple interface to parallel transforms for SMP systems. By Chris@19: default, the threads routines are not compiled. Chris@19: Chris@19: * `--enable-openmp': Like `--enable-threads', but using OpenMP Chris@19: compiler directives in order to induce parallelism rather than Chris@19: spawning its own threads directly, and installing an `fftw3_omp' Chris@19: library rather than an `fftw3_threads' library (*note Chris@19: Multi-threaded FFTW::). You can use both `--enable-openmp' and Chris@19: `--enable-threads' since they compile/install libraries with Chris@19: different names. By default, the OpenMP routines are not compiled. Chris@19: Chris@19: * `--with-combined-threads': By default, if `--enable-threads' is Chris@19: used, the threads support is compiled into a separate library that Chris@19: must be linked in addition to the main FFTW library. This is so Chris@19: that users of the serial library do not need to link the system Chris@19: threads libraries. If `--with-combined-threads' is specified, Chris@19: however, then no separate threads library is created, and threads Chris@19: are included in the main FFTW library. This is mainly useful Chris@19: under Windows, where no system threads library is required and Chris@19: inter-library dependencies are problematic. Chris@19: Chris@19: * `--enable-mpi': Enables compilation and installation of the FFTW Chris@19: MPI library (*note Distributed-memory FFTW with MPI::), which Chris@19: provides parallel transforms for distributed-memory systems with Chris@19: MPI. (By default, the MPI routines are not compiled.) *Note FFTW Chris@19: MPI Installation::. Chris@19: Chris@19: * `--disable-fortran': Disables inclusion of legacy-Fortran wrapper Chris@19: routines (*note Calling FFTW from Legacy Fortran::) in the standard Chris@19: FFTW libraries. These wrapper routines increase the library size Chris@19: by only a negligible amount, so they are included by default as Chris@19: long as the `configure' script finds a Fortran compiler on your Chris@19: system. (To specify a particular Fortran compiler foo, pass Chris@19: `F77='foo to `configure'.) Chris@19: Chris@19: * `--with-g77-wrappers': By default, when Fortran wrappers are Chris@19: included, the wrappers employ the linking conventions of the Chris@19: Fortran compiler detected by the `configure' script. If this Chris@19: compiler is GNU `g77', however, then _two_ versions of the Chris@19: wrappers are included: one with `g77''s idiosyncratic convention Chris@19: of appending two underscores to identifiers, and one with the more Chris@19: common convention of appending only a single underscore. This Chris@19: way, the same FFTW library will work with both `g77' and other Chris@19: Fortran compilers, such as GNU `gfortran'. However, the converse Chris@19: is not true: if you configure with a different compiler, then the Chris@19: `g77'-compatible wrappers are not included. By specifying Chris@19: `--with-g77-wrappers', the `g77'-compatible wrappers are included Chris@19: in addition to wrappers for whatever Fortran compiler `configure' Chris@19: finds. Chris@19: Chris@19: * `--with-slow-timer': Disables the use of hardware cycle counters, Chris@19: and falls back on `gettimeofday' or `clock'. This greatly worsens Chris@19: performance, and should generally not be used (unless you don't Chris@19: have a cycle counter but still really want an optimized plan Chris@19: regardless of the time). *Note Cycle Counters::. Chris@19: Chris@19: * `--enable-sse', `--enable-sse2', `--enable-avx', Chris@19: `--enable-altivec', `--enable-neon': Enable the compilation of Chris@19: SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy Chris@19: Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM Chris@19: processors). SSE, AltiVec, and NEON only work with Chris@19: `--enable-float' (above). SSE2 works in both single and double Chris@19: precision (and is simply SSE in single precision). The resulting Chris@19: code will _still work_ on earlier CPUs lacking the SIMD extensions Chris@19: (SIMD is automatically disabled, although the FFTW library is Chris@19: still larger). Chris@19: - These options require a compiler supporting SIMD extensions, Chris@19: and compiler support is always a bit flaky: see the FFTW FAQ Chris@19: for a list of compiler versions that have problems compiling Chris@19: FFTW. Chris@19: Chris@19: - With AltiVec and `gcc', you may have to use the Chris@19: `-mabi=altivec' option when compiling any code that links to Chris@19: FFTW, in order to properly align the stack; otherwise, FFTW Chris@19: could crash when it tries to use an AltiVec feature. (This Chris@19: is not necessary on MacOS X.) Chris@19: Chris@19: - With SSE/SSE2 and `gcc', you should use a version of gcc that Chris@19: properly aligns the stack when compiling any code that links Chris@19: to FFTW. By default, `gcc' 2.95 and later versions align the Chris@19: stack as needed, but you should not compile FFTW with the Chris@19: `-Os' option or the `-mpreferred-stack-boundary' option with Chris@19: an argument less than 4. Chris@19: Chris@19: - Because of the large variety of ARM processors and ABIs, FFTW Chris@19: does not attempt to guess the correct `gcc' flags for Chris@19: generating NEON code. In general, you will have to provide Chris@19: them on the command line. This command line is known to have Chris@19: worked at least once: Chris@19: ./configure --with-slow-timer --host=arm-linux-gnueabi \ Chris@19: --enable-single --enable-neon \ Chris@19: "CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp" Chris@19: Chris@19: Chris@19: To force `configure' to use a particular C compiler foo (instead of Chris@19: the default, usually `gcc'), pass `CC='foo to the `configure' script; Chris@19: you may also need to set the flags via the variable `CFLAGS' as Chris@19: described above. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Installation on non-Unix systems, Next: Cycle Counters, Prev: Installation on Unix, Up: Installation and Customization Chris@19: Chris@19: 10.2 Installation on non-Unix systems Chris@19: ===================================== Chris@19: Chris@19: It should be relatively straightforward to compile FFTW even on non-Unix Chris@19: systems lacking the niceties of a `configure' script. Basically, you Chris@19: need to edit the `config.h' header (copy it from `config.h.in') to Chris@19: `#define' the various options and compiler characteristics, and then Chris@19: compile all the `.c' files in the relevant directories. Chris@19: Chris@19: The `config.h' header contains about 100 options to set, each one Chris@19: initially an `#undef', each documented with a comment, and most of them Chris@19: fairly obvious. For most of the options, you should simply `#define' Chris@19: them to `1' if they are applicable, although a few options require a Chris@19: particular value (e.g. `SIZEOF_LONG_LONG' should be defined to the size Chris@19: of the `long long' type, in bytes, or zero if it is not supported). We Chris@19: will likely post some sample `config.h' files for various operating Chris@19: systems and compilers for you to use (at least as a starting point). Chris@19: Please let us know if you have to hand-create a configuration file Chris@19: (and/or a pre-compiled binary) that you want to share. Chris@19: Chris@19: To create the FFTW library, you will then need to compile all of the Chris@19: `.c' files in the `kernel', `dft', `dft/scalar', `dft/scalar/codelets', Chris@19: `rdft', `rdft/scalar', `rdft/scalar/r2cf', `rdft/scalar/r2cb', Chris@19: `rdft/scalar/r2r', `reodft', and `api' directories. If you are Chris@19: compiling with SIMD support (e.g. you defined `HAVE_SSE2' in Chris@19: `config.h'), then you also need to compile the `.c' files in the Chris@19: `simd-support', `{dft,rdft}/simd', `{dft,rdft}/simd/*' directories. Chris@19: Chris@19: Once these files are all compiled, link them into a library, or a Chris@19: shared library, or directly into your program. Chris@19: Chris@19: To compile the FFTW test program, additionally compile the code in Chris@19: the `libbench2/' directory, and link it into a library. Then compile Chris@19: the code in the `tests/' directory and link it to the `libbench2' and Chris@19: FFTW libraries. To compile the `fftw-wisdom' (command-line) tool Chris@19: (*note Wisdom Utilities::), compile `tools/fftw-wisdom.c' and link it Chris@19: to the `libbench2' and FFTW libraries Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Cycle Counters, Next: Generating your own code, Prev: Installation on non-Unix systems, Up: Installation and Customization Chris@19: Chris@19: 10.3 Cycle Counters Chris@19: =================== Chris@19: Chris@19: FFTW's planner actually executes and times different possible FFT Chris@19: algorithms in order to pick the fastest plan for a given n. In order Chris@19: to do this in as short a time as possible, however, the timer must have Chris@19: a very high resolution, and to accomplish this we employ the hardware Chris@19: "cycle counters" that are available on most CPUs. Currently, FFTW Chris@19: supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC Chris@19: (SPARC v9), IA64, PA-RISC, and MIPS processors. Chris@19: Chris@19: Access to the cycle counters, unfortunately, is a compiler and/or Chris@19: operating-system dependent task, often requiring inline assembly Chris@19: language, and it may be that your compiler is not supported. If you are Chris@19: _not_ supported, FFTW will by default fall back on its estimator Chris@19: (effectively using `FFTW_ESTIMATE' for all plans). Chris@19: Chris@19: You can add support by editing the file `kernel/cycle.h'; normally, Chris@19: this will involve adapting one of the examples already present in order Chris@19: to use the inline-assembler syntax for your C compiler, and will only Chris@19: require a couple of lines of code. Anyone adding support for a new Chris@19: system to `cycle.h' is encouraged to email us at . Chris@19: Chris@19: If a cycle counter is not available on your system (e.g. some Chris@19: embedded processor), and you don't want to use estimated plans, as a Chris@19: last resort you can use the `--with-slow-timer' option to `configure' Chris@19: (on Unix) or `#define WITH_SLOW_TIMER' in `config.h' (elsewhere). This Chris@19: will use the much lower-resolution `gettimeofday' function, or even Chris@19: `clock' if the former is unavailable, and planning will be extremely Chris@19: slow. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Generating your own code, Prev: Cycle Counters, Up: Installation and Customization Chris@19: Chris@19: 10.4 Generating your own code Chris@19: ============================= Chris@19: Chris@19: The directory `genfft' contains the programs that were used to generate Chris@19: FFTW's "codelets," which are hard-coded transforms of small sizes. We Chris@19: do not expect casual users to employ the generator, which is a rather Chris@19: sophisticated program that generates directed acyclic graphs of FFT Chris@19: algorithms and performs algebraic simplifications on them. It was Chris@19: written in Objective Caml, a dialect of ML, which is available at Chris@19: `http://caml.inria.fr/ocaml/index.en.html'. Chris@19: Chris@19: If you have Objective Caml installed (along with recent versions of Chris@19: GNU `autoconf', `automake', and `libtool'), then you can change the set Chris@19: of codelets that are generated or play with the generation options. Chris@19: The set of generated codelets is specified by the Chris@19: `{dft,rdft}/{codelets,simd}/*/Makefile.am' files. For example, you can Chris@19: add efficient REDFT codelets of small sizes by modifying Chris@19: `rdft/codelets/r2r/Makefile.am'. After you modify any `Makefile.am' Chris@19: files, you can type `sh bootstrap.sh' in the top-level directory Chris@19: followed by `make' to re-generate the files. Chris@19: Chris@19: We do not provide more details about the code-generation process, Chris@19: since we do not expect that most users will need to generate their own Chris@19: code. However, feel free to contact us at if you are Chris@19: interested in the subject. Chris@19: Chris@19: You might find it interesting to learn Caml and/or some modern Chris@19: programming techniques that we used in the generator (including monadic Chris@19: programming), especially if you heard the rumor that Java and Chris@19: object-oriented programming are the latest advancement in the field. Chris@19: The internal operation of the codelet generator is described in the Chris@19: paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is Chris@19: available from the FFTW home page (http://www.fftw.org) and also Chris@19: appeared in the `Proceedings of the 1999 ACM SIGPLAN Conference on Chris@19: Programming Language Design and Implementation (PLDI)'. Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: Acknowledgments, Next: License and Copyright, Prev: Installation and Customization, Up: Top Chris@19: Chris@19: 11 Acknowledgments Chris@19: ****************** Chris@19: Chris@19: Matteo Frigo was supported in part by the Special Research Program SFB Chris@19: F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln Chris@19: Laboratory. For previous versions of FFTW, he was supported in part by Chris@19: the Defense Advanced Research Projects Agency (DARPA), under Grants Chris@19: N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment Chris@19: Corporation Fellowship. Chris@19: Chris@19: Steven G. Johnson was supported in part by a Dept. of Defense NDSEG Chris@19: Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials Chris@19: Research Science and Engineering Center program of the National Science Chris@19: Foundation under award DMR-9400334. Chris@19: Chris@19: Code for the Cell Broadband Engine was graciously donated to the FFTW Chris@19: project by the IBM Austin Research Lab and included in fftw-3.2. (This Chris@19: code was removed in fftw-3.3.) Chris@19: Chris@19: Code for the MIPS paired-single SIMD support was graciously donated Chris@19: to the FFTW project by CodeSourcery, Inc. Chris@19: Chris@19: We are grateful to Sun Microsystems Inc. for its donation of a Chris@19: cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These Chris@19: machines served as the primary platform for the development of early Chris@19: versions of FFTW. Chris@19: Chris@19: We thank Intel Corporation for donating a four-processor Pentium Pro Chris@19: machine. We thank the GNU/Linux community for giving us a decent OS to Chris@19: run on that machine. Chris@19: Chris@19: We are thankful to the AMD corporation for donating an AMD Athlon XP Chris@19: 1700+ computer to the FFTW project. Chris@19: Chris@19: We thank the Compaq/HP testdrive program and VA Software Corporation Chris@19: (SourceForge.net) for providing remote access to machines that were used Chris@19: to test FFTW. Chris@19: Chris@19: The `genfft' suite of code generators was written using Objective Chris@19: Caml, a dialect of ML. Objective Caml is a small and elegant language Chris@19: developed by Xavier Leroy. The implementation is available from Chris@19: `http://caml.inria.fr/' (http://caml.inria.fr/). In previous releases Chris@19: of FFTW, `genfft' was written in Caml Light, by the same authors. An Chris@19: even earlier implementation of `genfft' was written in Scheme, but Caml Chris@19: is definitely better for this kind of application. Chris@19: Chris@19: FFTW uses many tools from the GNU project, including `automake', Chris@19: `texinfo', and `libtool'. Chris@19: Chris@19: Prof. Charles E. Leiserson of MIT provided continuous support and Chris@19: encouragement. This program would not exist without him. Charles also Chris@19: proposed the name "codelets" for the basic FFT blocks. Chris@19: Chris@19: Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance Chris@19: of Steven's "extra-curricular" computer-science activities, as well as Chris@19: remarkable creativity in working them into his grant proposals. Chris@19: Steven's physics degree would not exist without him. Chris@19: Chris@19: Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually Chris@19: led to the SIMD support in FFTW 3. Chris@19: Chris@19: Stefan Kral wrote most of the K7 code generator distributed with FFTW Chris@19: 3.0.x and 3.1.x. Chris@19: Chris@19: Andrew Sterian contributed the Windows timing code in FFTW 2. Chris@19: Chris@19: Didier Miras reported a bug in the test procedure used in FFTW 1.2. Chris@19: We now use a completely different test algorithm by Funda Ergun that Chris@19: does not require a separate FFT program to compare against. Chris@19: Chris@19: Wolfgang Reimer contributed the Pentium cycle counter and a few fixes Chris@19: that help portability. Chris@19: Chris@19: Ming-Chang Liu uncovered a well-hidden bug in the complex transforms Chris@19: of FFTW 2.0 and supplied a patch to correct it. Chris@19: Chris@19: The FFTW FAQ was written in `bfnn' (Bizarre Format With No Name) and Chris@19: formatted using the tools developed by Ian Jackson for the Linux FAQ. Chris@19: Chris@19: _We are especially thankful to all of our users for their continuing Chris@19: support, feedback, and interest during our development of FFTW._ Chris@19: Chris@19:  Chris@19: File: fftw3.info, Node: License and Copyright, Next: Concept Index, Prev: Acknowledgments, Up: Top Chris@19: Chris@19: 12 License and Copyright Chris@19: ************************ Chris@19: Chris@19: FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003, Chris@19: 2007-11 Massachusetts Institute of Technology. Chris@19: Chris@19: FFTW is free software; you can redistribute it and/or modify it Chris@19: under the terms of the GNU General Public License as published by the Chris@19: Free Software Foundation; either version 2 of the License, or (at your Chris@19: option) any later version. Chris@19: Chris@19: This program is distributed in the hope that it will be useful, but Chris@19: WITHOUT ANY WARRANTY; without even the implied warranty of Chris@19: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Chris@19: General Public License for more details. Chris@19: Chris@19: You should have received a copy of the GNU General Public License Chris@19: along with this program; if not, write to the Free Software Foundation, Chris@19: Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You Chris@19: can also find the GPL on the GNU web site Chris@19: (http://www.gnu.org/licenses/gpl-2.0.html). Chris@19: Chris@19: In addition, we kindly ask you to acknowledge FFTW and its authors in Chris@19: any program or publication in which you use FFTW. (You are not Chris@19: _required_ to do so; it is up to your common sense to decide whether Chris@19: you want to comply with this request or not.) For general Chris@19: publications, we suggest referencing: Matteo Frigo and Steven G. Chris@19: Johnson, "The design and implementation of FFTW3," Proc. IEEE 93 (2), Chris@19: 216-231 (2005). Chris@19: Chris@19: Non-free versions of FFTW are available under terms different from Chris@19: those of the General Public License. (e.g. they do not require you to Chris@19: accompany any object code using FFTW with the corresponding source Chris@19: code.) For these alternative terms you must purchase a license from Chris@19: MIT's Technology Licensing Office. Users interested in such a license Chris@19: should contact us () for more information. Chris@19: