diff src/fftw-3.3.3/doc/fftw3.info-1 @ 10:37bf6b4a2645

Add FFTW3
author Chris Cannam
date Wed, 20 Mar 2013 15:35:50 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src/fftw-3.3.3/doc/fftw3.info-1	Wed Mar 20 15:35:50 2013 +0000
@@ -0,0 +1,6280 @@
+This is fftw3.info, produced by makeinfo version 4.13 from fftw3.texi.
+
+This manual is for FFTW (version 3.3.3, 25 November 2012).
+
+   Copyright (C) 2003 Matteo Frigo.
+
+   Copyright (C) 2003 Massachusetts Institute of Technology.
+
+     Permission is granted to make and distribute verbatim copies of
+     this manual provided the copyright notice and this permission
+     notice are preserved on all copies.
+
+     Permission is granted to copy and distribute modified versions of
+     this manual under the conditions for verbatim copying, provided
+     that the entire resulting derived work is distributed under the
+     terms of a permission notice identical to this one.
+
+     Permission is granted to copy and distribute translations of this
+     manual into another language, under the above conditions for
+     modified versions, except that this permission notice may be
+     stated in a translation approved by the Free Software Foundation.
+
+INFO-DIR-SECTION Texinfo documentation system
+START-INFO-DIR-ENTRY
+* fftw3: (fftw3).	FFTW User's Manual.
+END-INFO-DIR-ENTRY
+
+
+File: fftw3.info,  Node: Top,  Next: Introduction,  Prev: (dir),  Up: (dir)
+
+FFTW User Manual
+****************
+
+Welcome to FFTW, the Fastest Fourier Transform in the West.  FFTW is a
+collection of fast C routines to compute the discrete Fourier transform.
+This manual documents FFTW version 3.3.3.
+
+* Menu:
+
+* Introduction::
+* Tutorial::
+* Other Important Topics::
+* FFTW Reference::
+* Multi-threaded FFTW::
+* Distributed-memory FFTW with MPI::
+* Calling FFTW from Modern Fortran::
+* Calling FFTW from Legacy Fortran::
+* Upgrading from FFTW version 2::
+* Installation and Customization::
+* Acknowledgments::
+* License and Copyright::
+* Concept Index::
+* Library Index::
+
+
+File: fftw3.info,  Node: Introduction,  Next: Tutorial,  Prev: Top,  Up: Top
+
+1 Introduction
+**************
+
+This manual documents version 3.3.3 of FFTW, the _Fastest Fourier
+Transform in the West_.  FFTW is a comprehensive collection of fast C
+routines for computing the discrete Fourier transform (DFT) and various
+special cases thereof.  
+   * FFTW computes the DFT of complex data, real data, even-   or
+     odd-symmetric real data (these symmetric transforms are usually
+     known as the discrete cosine or sine transform, respectively), and
+     the   discrete Hartley transform (DHT) of real data.
+
+   * The input data can have arbitrary length.         FFTW employs O(n
+     log n)  algorithms for all lengths, including        prime numbers.
+
+   * FFTW supports arbitrary multi-dimensional data.
+
+   * FFTW supports the SSE, SSE2, AVX, Altivec, and MIPS PS instruction
+           sets.
+
+   * FFTW includes parallel (multi-threaded) transforms        for
+     shared-memory systems.
+
+   * Starting with version 3.3, FFTW includes distributed-memory
+     parallel        transforms using MPI.
+
+   We assume herein that you are familiar with the properties and uses
+of the DFT that are relevant to your application.  Otherwise, see e.g.
+`The Fast Fourier Transform and Its Applications' by E. O. Brigham
+(Prentice-Hall, Englewood Cliffs, NJ, 1988).  Our web page
+(http://www.fftw.org) also has links to FFT-related information online.  
+
+   In order to use FFTW effectively, you need to learn one basic concept
+of FFTW's internal structure: FFTW does not use a fixed algorithm for
+computing the transform, but instead it adapts the DFT algorithm to
+details of the underlying hardware in order to maximize performance.
+Hence, the computation of the transform is split into two phases.
+First, FFTW's "planner" "learns" the fastest way to compute the
+transform on your machine.  The planner produces a data structure
+called a "plan" that contains this information.  Subsequently, the plan
+is "executed" to transform the array of input data as dictated by the
+plan.  The plan can be reused as many times as needed.  In typical
+high-performance applications, many transforms of the same size are
+computed and, consequently, a relatively expensive initialization of
+this sort is acceptable.  On the other hand, if you need a single
+transform of a given size, the one-time cost of the planner becomes
+significant.  For this case, FFTW provides fast planners based on
+heuristics or on previously computed plans.
+
+   FFTW supports transforms of data with arbitrary length, rank,
+multiplicity, and a general memory layout.  In simple cases, however,
+this generality may be unnecessary and confusing.  Consequently, we
+organized the interface to FFTW into three levels of increasing
+generality.
+   * The "basic interface" computes a single       transform of
+     contiguous data.
+
+   * The "advanced interface" computes transforms       of multiple or
+     strided arrays.
+
+   * The "guru interface" supports the most general data       layouts,
+     multiplicities, and strides.
+   We expect that most users will be best served by the basic interface,
+whereas the guru interface requires careful attention to the
+documentation to avoid problems.  
+
+   Besides the automatic performance adaptation performed by the
+planner, it is also possible for advanced users to customize FFTW
+manually.  For example, if code space is a concern, we provide a tool
+that links only the subset of FFTW needed by your application.
+Conversely, you may need to extend FFTW because the standard
+distribution is not sufficient for your needs.  For example, the
+standard FFTW distribution works most efficiently for arrays whose size
+can be factored into small primes (2, 3, 5, and 7), and otherwise it
+uses a slower general-purpose routine.  If you need efficient
+transforms of other sizes, you can use FFTW's code generator, which
+produces fast C programs ("codelets") for any particular array size you
+may care about.  For example, if you need transforms of size 513 = 19 x
+3^3, you can customize FFTW to support the factor 19 efficiently.
+
+   For more information regarding FFTW, see the paper, "The Design and
+Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an
+invited paper in `Proc. IEEE' 93 (2), p. 216 (2005).  The code
+generator is described in the paper "A fast Fourier transform compiler", by
+M. Frigo, in the `Proceedings of the 1999 ACM SIGPLAN Conference on
+Programming Language Design and Implementation (PLDI), Atlanta,
+Georgia, May 1999'.  These papers, along with the latest version of
+FFTW, the FAQ, benchmarks, and other links, are available at the FFTW
+home page (http://www.fftw.org).
+
+   The current version of FFTW incorporates many good ideas from the
+past thirty years of FFT literature.  In one way or another, FFTW uses
+the Cooley-Tukey algorithm, the prime factor algorithm, Rader's
+algorithm for prime sizes, and a split-radix algorithm (with a
+"conjugate-pair" variation pointed out to us by Dan Bernstein).  FFTW's
+code generator also produces new algorithms that we do not completely
+understand.  The reader is referred to the cited papers for the
+appropriate references.
+
+   The rest of this manual is organized as follows.  We first discuss
+the sequential (single-processor) implementation.  We start by
+describing the basic interface/features of FFTW in *note Tutorial::.
+Next, *note Other Important Topics:: discusses data alignment (*note
+SIMD alignment and fftw_malloc::), the storage scheme of
+multi-dimensional arrays (*note Multi-dimensional Array Format::), and
+FFTW's mechanism for storing plans on disk (*note Words of
+Wisdom-Saving Plans::).  Next, *note FFTW Reference:: provides
+comprehensive documentation of all FFTW's features.  Parallel
+transforms are discussed in their own chapters: *note Multi-threaded
+FFTW:: and *note Distributed-memory FFTW with MPI::.  Fortran
+programmers can also use FFTW, as described in *note Calling FFTW from
+Legacy Fortran:: and *note Calling FFTW from Modern Fortran::.  *note
+Installation and Customization:: explains how to install FFTW in your
+computer system and how to adapt FFTW to your needs.  License and
+copyright information is given in *note License and Copyright::.
+Finally, we thank all the people who helped us in *note
+Acknowledgments::.
+
+
+File: fftw3.info,  Node: Tutorial,  Next: Other Important Topics,  Prev: Introduction,  Up: Top
+
+2 Tutorial
+**********
+
+* Menu:
+
+* Complex One-Dimensional DFTs::
+* Complex Multi-Dimensional DFTs::
+* One-Dimensional DFTs of Real Data::
+* Multi-Dimensional DFTs of Real Data::
+* More DFTs of Real Data::
+
+   This chapter describes the basic usage of FFTW, i.e., how to compute the
+Fourier transform of a single array.  This chapter tells the truth, but
+not the _whole_ truth. Specifically, FFTW implements additional
+routines and flags that are not documented here, although in many cases
+we try to indicate where added capabilities exist.  For more complete
+information, see *note FFTW Reference::.  (Note that you need to
+compile and install FFTW before you can use it in a program.  For the
+details of the installation, see *note Installation and
+Customization::.)
+
+   We recommend that you read this tutorial in order.(1)  At the least,
+read the first section (*note Complex One-Dimensional DFTs::) before
+reading any of the others, even if your main interest lies in one of
+the other transform types.
+
+   Users of FFTW version 2 and earlier may also want to read *note
+Upgrading from FFTW version 2::.
+
+   ---------- Footnotes ----------
+
+   (1) You can read the tutorial in bit-reversed order after computing
+your first transform.
+
+
+File: fftw3.info,  Node: Complex One-Dimensional DFTs,  Next: Complex Multi-Dimensional DFTs,  Prev: Tutorial,  Up: Tutorial
+
+2.1 Complex One-Dimensional DFTs
+================================
+
+     Plan: To bother about the best method of accomplishing an
+     accidental result.  [Ambrose Bierce, `The Enlarged Devil's
+     Dictionary'.]  
+
+   The basic usage of FFTW to compute a one-dimensional DFT of size `N'
+is simple, and it typically looks something like this code:
+
+     #include <fftw3.h>
+     ...
+     {
+         fftw_complex *in, *out;
+         fftw_plan p;
+         ...
+         in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
+         out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
+         p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
+         ...
+         fftw_execute(p); /* repeat as needed */
+         ...
+         fftw_destroy_plan(p);
+         fftw_free(in); fftw_free(out);
+     }
+
+   You must link this code with the `fftw3' library.  On Unix systems,
+link with `-lfftw3 -lm'.
+
+   The example code first allocates the input and output arrays.  You
+can allocate them in any way that you like, but we recommend using
+`fftw_malloc', which behaves like `malloc' except that it properly
+aligns the array when SIMD instructions (such as SSE and Altivec) are
+available (*note SIMD alignment and fftw_malloc::). [Alternatively, we
+provide a convenient wrapper function `fftw_alloc_complex(N)' which has
+the same effect.]  
+
+   The data is an array of type `fftw_complex', which is by default a
+`double[2]' composed of the real (`in[i][0]') and imaginary
+(`in[i][1]') parts of a complex number.  
+
+   The next step is to create a "plan", which is an object that
+contains all the data that FFTW needs to compute the FFT.  This
+function creates the plan:
+
+     fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+   
+   The first argument, `n', is the size of the transform you are trying
+to compute.  The size `n' can be any positive integer, but sizes that
+are products of small factors are transformed most efficiently
+(although prime sizes still use an O(n log n)  algorithm).
+
+   The next two arguments are pointers to the input and output arrays of
+the transform.  These pointers can be equal, indicating an "in-place"
+transform.  
+
+   The fourth argument, `sign', can be either `FFTW_FORWARD' (`-1') or
+`FFTW_BACKWARD' (`+1'), and indicates the direction of the transform
+you are interested in; technically, it is the sign of the exponent in
+the transform.
+
+   The `flags' argument is usually either `FFTW_MEASURE' or `FFTW_ESTIMATE'.
+`FFTW_MEASURE' instructs FFTW to run and measure the execution time of
+several FFTs in order to find the best way to compute the transform of
+size `n'.  This process takes some time (usually a few seconds),
+depending on your machine and on the size of the transform.
+`FFTW_ESTIMATE', on the contrary, does not run any computation and just
+builds a reasonable plan that is probably sub-optimal.  In short, if
+your program performs many transforms of the same size and
+initialization time is not important, use `FFTW_MEASURE'; otherwise use
+the estimate.
+
+   _You must create the plan before initializing the input_, because
+`FFTW_MEASURE' overwrites the `in'/`out' arrays.  (Technically,
+`FFTW_ESTIMATE' does not touch your arrays, but you should always
+create plans first just to be sure.)
+
+   Once the plan has been created, you can use it as many times as you
+like for transforms on the specified `in'/`out' arrays, computing the
+actual transforms via `fftw_execute(plan)':
+     void fftw_execute(const fftw_plan plan);
+   
+   The DFT results are stored in-order in the array `out', with the
+zero-frequency (DC) component in `out[0]'.  If `in != out', the
+transform is "out-of-place" and the input array `in' is not modified.
+Otherwise, the input array is overwritten with the transform.
+
+   If you want to transform a _different_ array of the same size, you
+can create a new plan with `fftw_plan_dft_1d' and FFTW automatically
+reuses the information from the previous plan, if possible.
+Alternatively, with the "guru" interface you can apply a given plan to
+a different array, if you are careful.  *Note FFTW Reference::.
+
+   When you are done with the plan, you deallocate it by calling
+`fftw_destroy_plan(plan)':
+     void fftw_destroy_plan(fftw_plan plan);
+   If you allocate an array with `fftw_malloc()' you must deallocate it
+with `fftw_free()'.  Do not use `free()' or, heaven forbid, `delete'.  
+
+   FFTW computes an _unnormalized_ DFT.  Thus, computing a forward
+followed by a backward transform (or vice versa) results in the original
+array scaled by `n'.  For the definition of the DFT, see *note What
+FFTW Really Computes::.  
+
+   If you have a C compiler, such as `gcc', that supports the C99
+standard, and you `#include <complex.h>' _before_ `<fftw3.h>', then
+`fftw_complex' is the native double-precision complex type and you can
+manipulate it with ordinary arithmetic.  Otherwise, FFTW defines its
+own complex type, which is bit-compatible with the C99 complex type.
+*Note Complex numbers::.  (The C++ `<complex>' template class may also
+be usable via a typecast.)  
+
+   To use single or long-double precision versions of FFTW, replace the
+`fftw_' prefix by `fftwf_' or `fftwl_' and link with `-lfftw3f' or
+`-lfftw3l', but use the _same_ `<fftw3.h>' header file.  
+
+   Many more flags exist besides `FFTW_MEASURE' and `FFTW_ESTIMATE'.
+For example, use `FFTW_PATIENT' if you're willing to wait even longer
+for a possibly even faster plan (*note FFTW Reference::).  You can also
+save plans for future use, as described by *note Words of Wisdom-Saving
+Plans::.
+
+
+File: fftw3.info,  Node: Complex Multi-Dimensional DFTs,  Next: One-Dimensional DFTs of Real Data,  Prev: Complex One-Dimensional DFTs,  Up: Tutorial
+
+2.2 Complex Multi-Dimensional DFTs
+==================================
+
+Multi-dimensional transforms work much the same way as one-dimensional
+transforms: you allocate arrays of `fftw_complex' (preferably using
+`fftw_malloc'), create an `fftw_plan', execute it as many times as you
+want with `fftw_execute(plan)', and clean up with
+`fftw_destroy_plan(plan)' (and `fftw_free').
+
+   FFTW provides two routines for creating plans for 2d and 3d
+transforms, and one routine for creating plans of arbitrary
+dimensionality.  The 2d and 3d routines have the following signature:
+     fftw_plan fftw_plan_dft_2d(int n0, int n1,
+                                fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+     fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
+                                fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+   
+   These routines create plans for `n0' by `n1' two-dimensional (2d)
+transforms and `n0' by `n1' by `n2' 3d transforms, respectively.  All
+of these transforms operate on contiguous arrays in the C-standard
+"row-major" order, so that the last dimension has the fastest-varying
+index in the array.  This layout is described further in *note
+Multi-dimensional Array Format::.
+
+   FFTW can also compute transforms of higher dimensionality.  In order
+to avoid confusion between the various meanings of the the word
+"dimension", we use the term _rank_ to denote the number of independent
+indices in an array.(1)  For example, we say that a 2d transform has
+rank 2, a 3d transform has rank 3, and so on.  You can plan transforms
+of arbitrary rank by means of the following function:
+
+     fftw_plan fftw_plan_dft(int rank, const int *n,
+                             fftw_complex *in, fftw_complex *out,
+                             int sign, unsigned flags);
+   
+   Here, `n' is a pointer to an array `n[rank]' denoting an `n[0]' by
+`n[1]' by ... by `n[rank-1]' transform.  Thus, for example, the call
+     fftw_plan_dft_2d(n0, n1, in, out, sign, flags);
+   is equivalent to the following code fragment:
+     int n[2];
+     n[0] = n0;
+     n[1] = n1;
+     fftw_plan_dft(2, n, in, out, sign, flags);
+   `fftw_plan_dft' is not restricted to 2d and 3d transforms, however,
+but it can plan transforms of arbitrary rank.
+
+   You may have noticed that all the planner routines described so far
+have overlapping functionality.  For example, you can plan a 1d or 2d
+transform by using `fftw_plan_dft' with a `rank' of `1' or `2', or even
+by calling `fftw_plan_dft_3d' with `n0' and/or `n1' equal to `1' (with
+no loss in efficiency).  This pattern continues, and FFTW's planning
+routines in general form a "partial order," sequences of interfaces
+with strictly increasing generality but correspondingly greater
+complexity.
+
+   `fftw_plan_dft' is the most general complex-DFT routine that we
+describe in this tutorial, but there are also the advanced and guru
+interfaces, which allow one to efficiently combine multiple/strided
+transforms into a single FFTW plan, transform a subset of a larger
+multi-dimensional array, and/or to handle more general complex-number
+formats.  For more information, see *note FFTW Reference::.
+
+   ---------- Footnotes ----------
+
+   (1) The term "rank" is commonly used in the APL, FORTRAN, and Common
+Lisp traditions, although it is not so common in the C world.
+
+
+File: fftw3.info,  Node: One-Dimensional DFTs of Real Data,  Next: Multi-Dimensional DFTs of Real Data,  Prev: Complex Multi-Dimensional DFTs,  Up: Tutorial
+
+2.3 One-Dimensional DFTs of Real Data
+=====================================
+
+In many practical applications, the input data `in[i]' are purely real
+numbers, in which case the DFT output satisfies the "Hermitian" redundancy:
+`out[i]' is the conjugate of `out[n-i]'.  It is possible to take
+advantage of these circumstances in order to achieve roughly a factor
+of two improvement in both speed and memory usage.
+
+   In exchange for these speed and space advantages, the user sacrifices
+some of the simplicity of FFTW's complex transforms. First of all, the
+input and output arrays are of _different sizes and types_: the input
+is `n' real numbers, while the output is `n/2+1' complex numbers (the
+non-redundant outputs); this also requires slight "padding" of the
+input array for in-place transforms.  Second, the inverse transform
+(complex to real) has the side-effect of _overwriting its input array_,
+by default.  Neither of these inconveniences should pose a serious
+problem for users, but it is important to be aware of them.
+
+   The routines to perform real-data transforms are almost the same as
+those for complex transforms: you allocate arrays of `double' and/or
+`fftw_complex' (preferably using `fftw_malloc' or
+`fftw_alloc_complex'), create an `fftw_plan', execute it as many times
+as you want with `fftw_execute(plan)', and clean up with
+`fftw_destroy_plan(plan)' (and `fftw_free').  The only differences are
+that the input (or output) is of type `double' and there are new
+routines to create the plan.  In one dimension:
+
+     fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out,
+                                    unsigned flags);
+   
+   for the real input to complex-Hermitian output ("r2c") and
+complex-Hermitian input to real output ("c2r") transforms.  Unlike the
+complex DFT planner, there is no `sign' argument.  Instead, r2c DFTs
+are always `FFTW_FORWARD' and c2r DFTs are always `FFTW_BACKWARD'.  (For
+single/long-double precision `fftwf' and `fftwl', `double' should be
+replaced by `float' and `long double', respectively.)  
+
+   Here, `n' is the "logical" size of the DFT, not necessarily the
+physical size of the array.  In particular, the real (`double') array
+has `n' elements, while the complex (`fftw_complex') array has `n/2+1'
+elements (where the division is rounded down).  For an in-place
+transform, `in' and `out' are aliased to the same array, which must be
+big enough to hold both; so, the real array would actually have
+`2*(n/2+1)' elements, where the elements beyond the first `n' are
+unused padding.  (Note that this is very different from the concept of
+"zero-padding" a transform to a larger length, which changes the
+logical size of the DFT by actually adding new input data.)  The kth
+element of the complex array is exactly the same as the kth element of
+the corresponding complex DFT.  All positive `n' are supported;
+products of small factors are most efficient, but an O(n log n)
+algorithm is used even for prime sizes.
+
+   As noted above, the c2r transform destroys its input array even for
+out-of-place transforms.  This can be prevented, if necessary, by
+including `FFTW_PRESERVE_INPUT' in the `flags', with unfortunately some
+sacrifice in performance.  This flag is also not currently supported
+for multi-dimensional real DFTs (next section).
+
+   Readers familiar with DFTs of real data will recall that the 0th (the
+"DC") and `n/2'-th (the "Nyquist" frequency, when `n' is even) elements
+of the complex output are purely real.  Some implementations therefore
+store the Nyquist element where the DC imaginary part would go, in
+order to make the input and output arrays the same size.  Such packing,
+however, does not generalize well to multi-dimensional transforms, and
+the space savings are miniscule in any case; FFTW does not support it.
+
+   An alternative interface for one-dimensional r2c and c2r DFTs can be
+found in the `r2r' interface (*note The Halfcomplex-format DFT::), with
+"halfcomplex"-format output that _is_ the same size (and type) as the
+input array.  That interface, although it is not very useful for
+multi-dimensional transforms, may sometimes yield better performance.
+
+
+File: fftw3.info,  Node: Multi-Dimensional DFTs of Real Data,  Next: More DFTs of Real Data,  Prev: One-Dimensional DFTs of Real Data,  Up: Tutorial
+
+2.4 Multi-Dimensional DFTs of Real Data
+=======================================
+
+Multi-dimensional DFTs of real data use the following planner routines:
+
+     fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
+                                    double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
+                                    double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
+                                 double *in, fftw_complex *out,
+                                 unsigned flags);
+   
+   as well as the corresponding `c2r' routines with the input/output
+types swapped.  These routines work similarly to their complex
+analogues, except for the fact that here the complex output array is cut
+roughly in half and the real array requires padding for in-place
+transforms (as in 1d, above).
+
+   As before, `n' is the logical size of the array, and the
+consequences of this on the the format of the complex arrays deserve
+careful attention.  Suppose that the real data has dimensions n[0] x
+n[1] x n[2] x ... x n[d-1]  (in row-major order).  Then, after an r2c
+transform, the output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1)
+array of `fftw_complex' values in row-major order, corresponding to
+slightly over half of the output of the corresponding complex DFT.
+(The division is rounded down.)  The ordering of the data is otherwise
+exactly the same as in the complex-DFT case.
+
+   For out-of-place transforms, this is the end of the story: the real
+data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x
+n[d-1]  and the complex data is stored as a row-major array of size
+n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) .
+
+   For in-place transforms, however, extra padding of the real-data
+array is necessary because the complex array is larger than the real
+array, and the two arrays share the same memory locations.  Thus, for
+in-place transforms, the final dimension of the real-data array must be
+padded with extra values to accommodate the size of the complex
+data--two values if the last dimension is even and one if it is odd.  That
+is, the last dimension of the real data must physically contain 2 *
+(n[d-1]/2+1) `double' values (exactly enough to hold the complex data).
+This physical array size does not, however, change the _logical_ array
+size--only n[d-1] values are actually stored in the last dimension, and
+n[d-1] is the last dimension passed to the plan-creation routine.
+
+   For example, consider the transform of a two-dimensional real array
+of size `n0' by `n1'.  The output of the r2c transform is a
+two-dimensional complex array of size `n0' by `n1/2+1', where the `y'
+dimension has been cut nearly in half because of redundancies in the
+output.  Because `fftw_complex' is twice the size of `double', the
+output array is slightly bigger than the input array.  Thus, if we want
+to compute the transform in place, we must _pad_ the input array so
+that it is of size `n0' by `2*(n1/2+1)'.  If `n1' is even, then there
+are two padding elements at the end of each row (which need not be
+initialized, as they are only used for output).
+
+   These transforms are unnormalized, so an r2c followed by a c2r
+transform (or vice versa) will result in the original data scaled by
+the number of real data elements--that is, the product of the (logical)
+dimensions of the real data.  
+
+   (Because the last dimension is treated specially, if it is equal to
+`1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r
+transform.  In that case, the last complex dimension also has size `1'
+(`=1/2+1'), and no advantage is gained over the complex transforms.)
+
+
+File: fftw3.info,  Node: More DFTs of Real Data,  Prev: Multi-Dimensional DFTs of Real Data,  Up: Tutorial
+
+2.5 More DFTs of Real Data
+==========================
+
+* Menu:
+
+* The Halfcomplex-format DFT::
+* Real even/odd DFTs (cosine/sine transforms)::
+* The Discrete Hartley Transform::
+
+   FFTW supports several other transform types via a unified "r2r"
+(real-to-real) interface, so called because it takes a real (`double')
+array and outputs a real array of the same size.  These r2r transforms
+currently fall into three categories: DFTs of real input and
+complex-Hermitian output in halfcomplex format, DFTs of real input with
+even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs),
+and discrete Hartley transforms (DHTs), all described in more detail by
+the following sections.
+
+   The r2r transforms follow the by now familiar interface of creating
+an `fftw_plan', executing it with `fftw_execute(plan)', and destroying
+it with `fftw_destroy_plan(plan)'.  Furthermore, all r2r transforms
+share the same planner interface:
+
+     fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
+                                fftw_r2r_kind kind, unsigned flags);
+     fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
+                                fftw_r2r_kind kind0, fftw_r2r_kind kind1,
+                                unsigned flags);
+     fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
+                                double *in, double *out,
+                                fftw_r2r_kind kind0,
+                                fftw_r2r_kind kind1,
+                                fftw_r2r_kind kind2,
+                                unsigned flags);
+     fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
+                             const fftw_r2r_kind *kind, unsigned flags);
+   
+   Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional
+transforms for contiguous arrays in row-major order, transforming (real)
+input to output of the same size, where `n' specifies the _physical_
+dimensions of the arrays.  All positive `n' are supported (with the
+exception of `n=1' for the `FFTW_REDFT00' kind, noted in the real-even
+subsection below); products of small factors are most efficient
+(factorizing `n-1' and `n+1' for `FFTW_REDFT00' and `FFTW_RODFT00'
+kinds, described below), but an O(n log n)  algorithm is used even for
+prime sizes.
+
+   Each dimension has a "kind" parameter, of type `fftw_r2r_kind',
+specifying the kind of r2r transform to be used for that dimension.  (In
+the case of `fftw_plan_r2r', this is an array `kind[rank]' where
+`kind[i]' is the transform kind for the dimension `n[i]'.)  The kind
+can be one of a set of predefined constants, defined in the following
+subsections.
+
+   In other words, FFTW computes the separable product of the specified
+r2r transforms over each dimension, which can be used e.g. for partial
+differential equations with mixed boundary conditions.  (For some r2r
+kinds, notably the halfcomplex DFT and the DHT, such a separable
+product is somewhat problematic in more than one dimension, however, as
+is described below.)
+
+   In the current version of FFTW, all r2r transforms except for the
+halfcomplex type are computed via pre- or post-processing of
+halfcomplex transforms, and they are therefore not as fast as they
+could be.  Since most other general DCT/DST codes employ a similar
+algorithm, however, FFTW's implementation should provide at least
+competitive performance.
+
+
+File: fftw3.info,  Node: The Halfcomplex-format DFT,  Next: Real even/odd DFTs (cosine/sine transforms),  Prev: More DFTs of Real Data,  Up: More DFTs of Real Data
+
+2.5.1 The Halfcomplex-format DFT
+--------------------------------
+
+An r2r kind of `FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note
+One-Dimensional DFTs of Real Data::) but with "halfcomplex" format
+output, and may sometimes be faster and/or more convenient than the
+latter.  The inverse "hc2r" transform is of kind `FFTW_HC2R'.  This
+consists of the non-redundant half of the complex output for a 1d
+real-input DFT of size `n', stored as a sequence of `n' real numbers
+(`double') in the format:
+
+   r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1
+
+   Here, rk is the real part of the kth output, and ik is the imaginary
+part.  (Division by 2 is rounded down.) For a halfcomplex array
+`hc[n]', the kth component thus has its real part in `hc[k]' and its
+imaginary part in `hc[n-k]', with the exception of `k' `==' `0' or
+`n/2' (the latter only if `n' is even)--in these two cases, the
+imaginary part is zero due to symmetries of the real-input DFT, and is
+not stored.  Thus, the r2hc transform of `n' real values is a
+halfcomplex array of length `n', and vice versa for hc2r.  
+
+   Aside from the differing format, the output of
+`FFTW_R2HC'/`FFTW_HC2R' is otherwise exactly the same as for the
+corresponding 1d r2c/c2r transform (i.e. `FFTW_FORWARD'/`FFTW_BACKWARD'
+transforms, respectively).  Recall that these transforms are
+unnormalized, so r2hc followed by hc2r will result in the original data
+multiplied by `n'.  Furthermore, like the c2r transform, an
+out-of-place hc2r transform will _destroy its input_ array.
+
+   Although these halfcomplex transforms can be used with the
+multi-dimensional r2r interface, the interpretation of such a separable
+product of transforms along each dimension is problematic.  For example,
+consider a two-dimensional `n0' by `n1', r2hc by r2hc transform planned
+by `fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC,
+FFTW_MEASURE)'.  Conceptually, FFTW first transforms the rows (of size
+`n1') to produce halfcomplex rows, and then transforms the columns (of
+size `n0').  Half of these column transforms, however, are of imaginary
+parts, and should therefore be multiplied by i and combined with the
+r2hc transforms of the real columns to produce the 2d DFT amplitudes;
+FFTW's r2r transform does _not_ perform this combination for you.
+Thus, if a multi-dimensional real-input/output DFT is required, we
+recommend using the ordinary r2c/c2r interface (*note Multi-Dimensional
+DFTs of Real Data::).
+
+
+File: fftw3.info,  Node: Real even/odd DFTs (cosine/sine transforms),  Next: The Discrete Hartley Transform,  Prev: The Halfcomplex-format DFT,  Up: More DFTs of Real Data
+
+2.5.2 Real even/odd DFTs (cosine/sine transforms)
+-------------------------------------------------
+
+The Fourier transform of a real-even function f(-x) = f(x) is
+real-even, and i times the Fourier transform of a real-odd function
+f(-x) = -f(x) is real-odd.  Similar results hold for a discrete Fourier
+transform, and thus for these symmetries the need for complex
+inputs/outputs is entirely eliminated.  Moreover, one gains a factor of
+two in speed/space from the fact that the data are real, and an
+additional factor of two from the even/odd symmetry: only the
+non-redundant (first) half of the array need be stored.  The result is
+the real-even DFT ("REDFT") and the real-odd DFT ("RODFT"), also known
+as the discrete cosine and sine transforms ("DCT" and "DST"),
+respectively.  
+
+   (In this section, we describe the 1d transforms; multi-dimensional
+transforms are just a separable product of these transforms operating
+along each dimension.)
+
+   Because of the discrete sampling, one has an additional choice: is
+the data even/odd around a sampling point, or around the point halfway
+between two samples?  The latter corresponds to _shifting_ the samples
+by _half_ an interval, and gives rise to several transform variants
+denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate
+whether the input (a) and/or output (b) are shifted by half a sample (1
+means it is shifted).  These are also known as types I-IV of the DCT
+and DST, and all four types are supported by FFTW's r2r interface.(1)
+
+   The r2r kinds for the various REDFT and RODFT types supported by
+FFTW, along with the boundary conditions at both ends of the _input_
+array (`n' real numbers `in[j=0..n-1]'), are:
+
+   * `FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1.  
+
+   * `FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even
+     around j=n-0.5.  
+
+   * `FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd
+     around j=n.  
+
+   * `FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5.  
+
+   * `FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n.  
+
+   * `FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5.  
+
+   * `FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1.  
+
+   * `FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5.  
+
+
+   Note that these symmetries apply to the "logical" array being
+transformed; *there are no constraints on your physical input data*.
+So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data
+abcde, it corresponds to the DFT of the logical even array abcdedcb of
+size 8.  A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the
+size-8 logical DFT of the even array abcddcba, shifted by half a sample.
+
+   All of these transforms are invertible.  The inverse of R*DFT00 is
+R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called
+simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11.
+However, the transforms computed by FFTW are unnormalized, exactly like
+the corresponding real and complex DFTs, so computing a transform
+followed by its inverse yields the original array scaled by N, where N
+is the _logical_ DFT size.  For REDFT00, N=2(n-1); for RODFT00,
+N=2(n+1); otherwise, N=2n.  
+
+   Note that the boundary conditions of the transform output array are
+given by the input boundary conditions of the inverse transform.  Thus,
+the above transforms are all inequivalent in terms of input/output
+boundary conditions, even neglecting the 0.5 shift difference.
+
+   FFTW is most efficient when N is a product of small factors; note
+that this _differs_ from the factorization of the physical size `n' for
+REDFT00 and RODFT00!  There is another oddity: `n=1' REDFT00 transforms
+correspond to N=0, and so are _not defined_ (the planner will return
+`NULL').  Otherwise, any positive `n' is supported.
+
+   For the precise mathematical definitions of these transforms as used
+by FFTW, see *note What FFTW Really Computes::.  (For people accustomed
+to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of
+the cos/sin functions so that they correspond precisely to an even/odd
+DFT of size N.  Some authors also include additional multiplicative
+factors of sqrt(2) for selected inputs and outputs; this makes the
+transform orthogonal, but sacrifices the direct equivalence to a
+symmetric DFT.)
+
+Which type do you need?
+.......................
+
+Since the required flavor of even/odd DFT depends upon your problem,
+you are the best judge of this choice, but we can make a few comments
+on relative efficiency to help you in your selection.  In particular,
+R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially
+for odd sizes), while the R*DFT00 transforms are sometimes
+significantly slower (especially for even sizes).(2)
+
+   Thus, if only the boundary conditions on the transform inputs are
+specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over
+R*DFT11 (unless the half-sample shift or the self-inverse property is
+significant for your problem).
+
+   If performance is important to you and you are using only small sizes
+(say n<200), e.g. for multi-dimensional transforms, then you might
+consider generating hard-coded transforms of those sizes and types that
+you are interested in (*note Generating your own code::).
+
+   We are interested in hearing what types of symmetric transforms you
+find most useful.
+
+   ---------- Footnotes ----------
+
+   (1) There are also type V-VIII transforms, which correspond to a
+logical DFT of _odd_ size N, independent of whether the physical size
+`n' is odd, but we do not support these variants.
+
+   (2) R*DFT00 is sometimes slower in FFTW because we discovered that
+the standard algorithm for computing this by a pre/post-processed real
+DFT--the algorithm used in FFTPACK, Numerical Recipes, and other
+sources for decades now--has serious numerical problems: it already
+loses several decimal places of accuracy for 16k sizes.  There seem to
+be only two alternatives in the literature that do not suffer
+similarly: a recursive decomposition into smaller DCTs, which would
+require a large set of codelets for efficiency and generality, or
+sacrificing a factor of 2 in speed to use a real DFT of twice the size.
+We currently employ the latter technique for general n, as well as a
+limited form of the former method: a split-radix decomposition when n
+is odd (N a multiple of 4).  For N containing many factors of 2, the
+split-radix method seems to recover most of the speed of the standard
+algorithm without the accuracy tradeoff.
+
+
+File: fftw3.info,  Node: The Discrete Hartley Transform,  Prev: Real even/odd DFTs (cosine/sine transforms),  Up: More DFTs of Real Data
+
+2.5.3 The Discrete Hartley Transform
+------------------------------------
+
+If you are planning to use the DHT because you've heard that it is
+"faster" than the DFT (FFT), *stop here*.  The DHT is not faster than
+the DFT.  That story is an old but enduring misconception that was
+debunked in 1987.
+
+   The discrete Hartley transform (DHT) is an invertible linear
+transform closely related to the DFT.  In the DFT, one multiplies each
+input by cos - i * sin (a complex exponential), whereas in the DHT each
+input is multiplied by simply cos + sin.  Thus, the DHT transforms `n'
+real numbers to `n' real numbers, and has the convenient property of
+being its own inverse.  In FFTW, a DHT (of any positive `n') can be
+specified by an r2r kind of `FFTW_DHT'.  
+
+   Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of
+size `n' followed by another DHT of the same size will result in the
+original array multiplied by `n'.  
+
+   The DHT was originally proposed as a more efficient alternative to
+the DFT for real data, but it was subsequently shown that a specialized
+DFT (such as FFTW's r2hc or r2c transforms) could be just as fast.  In
+FFTW, the DHT is actually computed by post-processing an r2hc
+transform, so there is ordinarily no reason to prefer it from a
+performance perspective.(1) However, we have heard rumors that the DHT
+might be the most appropriate transform in its own right for certain
+applications, and we would be very interested to hear from anyone who
+finds it useful.
+
+   If `FFTW_DHT' is specified for multiple dimensions of a
+multi-dimensional transform, FFTW computes the separable product of 1d
+DHTs along each dimension.  Unfortunately, this is not quite the same
+thing as a true multi-dimensional DHT; you can compute the latter, if
+necessary, with at most `rank-1' post-processing passes [see e.g. H.
+Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)].
+
+   For the precise mathematical definition of the DHT as used by FFTW,
+see *note What FFTW Really Computes::.
+
+   ---------- Footnotes ----------
+
+   (1) We provide the DHT mainly as a byproduct of some internal
+algorithms. FFTW computes a real input/output DFT of _prime_ size by
+re-expressing it as a DHT plus post/pre-processing and then using
+Rader's prime-DFT algorithm adapted to the DHT.
+
+
+File: fftw3.info,  Node: Other Important Topics,  Next: FFTW Reference,  Prev: Tutorial,  Up: Top
+
+3 Other Important Topics
+************************
+
+* Menu:
+
+* SIMD alignment and fftw_malloc::
+* Multi-dimensional Array Format::
+* Words of Wisdom-Saving Plans::
+* Caveats in Using Wisdom::
+
+
+File: fftw3.info,  Node: SIMD alignment and fftw_malloc,  Next: Multi-dimensional Array Format,  Prev: Other Important Topics,  Up: Other Important Topics
+
+3.1 SIMD alignment and fftw_malloc
+==================================
+
+SIMD, which stands for "Single Instruction Multiple Data," is a set of
+special operations supported by some processors to perform a single
+operation on several numbers (usually 2 or 4) simultaneously.  SIMD
+floating-point instructions are available on several popular CPUs:
+SSE/SSE2/AVX on recent x86/x86-64 processors, AltiVec (single precision)
+on some PowerPCs (Apple G4 and higher), NEON on some ARM models, and
+MIPS Paired Single (currently only in FFTW 3.2.x).  FFTW can be
+compiled to support the SIMD instructions on any of these systems.  
+
+   A program linking to an FFTW library compiled with SIMD support can
+obtain a nonnegligible speedup for most complex and r2c/c2r transforms.
+In order to obtain this speedup, however, the arrays of complex (or
+real) data passed to FFTW must be specially aligned in memory
+(typically 16-byte aligned), and often this alignment is more stringent
+than that provided by the usual `malloc' (etc.)  allocation routines.
+
+   In order to guarantee proper alignment for SIMD, therefore, in case
+your program is ever linked against a SIMD-using FFTW, we recommend
+allocating your transform data with `fftw_malloc' and de-allocating it
+with `fftw_free'.  These have exactly the same interface and behavior as
+`malloc'/`free', except that for a SIMD FFTW they ensure that the
+returned pointer has the necessary alignment (by calling `memalign' or
+its equivalent on your OS).
+
+   You are not _required_ to use `fftw_malloc'.  You can allocate your
+data in any way that you like, from `malloc' to `new' (in C++) to a
+fixed-size array declaration.  If the array happens not to be properly
+aligned, FFTW will not use the SIMD extensions.  
+
+   Since `fftw_malloc' only ever needs to be used for real and complex
+arrays, we provide two convenient wrapper routines `fftw_alloc_real(N)'
+and `fftw_alloc_complex(N)' that are equivalent to
+`(double*)fftw_malloc(sizeof(double) * N)' and
+`(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively
+(or their equivalents in other precisions).
+
+
+File: fftw3.info,  Node: Multi-dimensional Array Format,  Next: Words of Wisdom-Saving Plans,  Prev: SIMD alignment and fftw_malloc,  Up: Other Important Topics
+
+3.2 Multi-dimensional Array Format
+==================================
+
+This section describes the format in which multi-dimensional arrays are
+stored in FFTW.  We felt that a detailed discussion of this topic was
+necessary.  Since several different formats are common, this topic is
+often a source of confusion.
+
+* Menu:
+
+* Row-major Format::
+* Column-major Format::
+* Fixed-size Arrays in C::
+* Dynamic Arrays in C::
+* Dynamic Arrays in C-The Wrong Way::
+
+
+File: fftw3.info,  Node: Row-major Format,  Next: Column-major Format,  Prev: Multi-dimensional Array Format,  Up: Multi-dimensional Array Format
+
+3.2.1 Row-major Format
+----------------------
+
+The multi-dimensional arrays passed to `fftw_plan_dft' etcetera are
+expected to be stored as a single contiguous block in "row-major" order
+(sometimes called "C order").  Basically, this means that as you step
+through adjacent memory locations, the first dimension's index varies
+most slowly and the last dimension's index varies most quickly.
+
+   To be more explicit, let us consider an array of rank d whose
+dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a
+location in the array by a sequence of d (zero-based) indices, one for
+each dimension: (i[0], i[1], ..., i[d-1]).  If the array is stored in
+row-major order, then this element is located at the position i[d-1] +
+n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])).
+
+   Note that, for the ordinary complex DFT, each element of the array
+must be of type `fftw_complex'; i.e. a (real, imaginary) pair of
+(double-precision) numbers.
+
+   In the advanced FFTW interface, the physical dimensions n from which
+the indices are computed can be different from (larger than) the
+logical dimensions of the transform to be computed, in order to
+transform a subset of a larger array.  Note also that, in the advanced
+interface, the expression above is multiplied by a "stride" to get the
+actual array index--this is useful in situations where each element of
+the multi-dimensional array is actually a data structure (or another
+array), and you just want to transform a single field. In the basic
+interface, however, the stride is 1.  
+
+
+File: fftw3.info,  Node: Column-major Format,  Next: Fixed-size Arrays in C,  Prev: Row-major Format,  Up: Multi-dimensional Array Format
+
+3.2.2 Column-major Format
+-------------------------
+
+Readers from the Fortran world are used to arrays stored in
+"column-major" order (sometimes called "Fortran order").  This is
+essentially the exact opposite of row-major order in that, here, the
+_first_ dimension's index varies most quickly.
+
+   If you have an array stored in column-major order and wish to
+transform it using FFTW, it is quite easy to do.  When creating the
+plan, simply pass the dimensions of the array to the planner in
+_reverse order_.  For example, if your array is a rank three `N x M x
+L' matrix in column-major order, you should pass the dimensions of the
+array as if it were an `L x M x N' matrix (which it is, from the
+perspective of FFTW).  This is done for you _automatically_ by the FFTW
+legacy-Fortran interface (*note Calling FFTW from Legacy Fortran::),
+but you must do it manually with the modern Fortran interface (*note
+Reversing array dimensions::).  
+
+
+File: fftw3.info,  Node: Fixed-size Arrays in C,  Next: Dynamic Arrays in C,  Prev: Column-major Format,  Up: Multi-dimensional Array Format
+
+3.2.3 Fixed-size Arrays in C
+----------------------------
+
+A multi-dimensional array whose size is declared at compile time in C
+is _already_ in row-major order.  You don't have to do anything special
+to transform it.  For example:
+
+     {
+          fftw_complex data[N0][N1][N2];
+          fftw_plan plan;
+          ...
+          plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0],
+                                  FFTW_FORWARD, FFTW_ESTIMATE);
+          ...
+     }
+
+   This will plan a 3d in-place transform of size `N0 x N1 x N2'.
+Notice how we took the address of the zero-th element to pass to the
+planner (we could also have used a typecast).
+
+   However, we tend to _discourage_ users from declaring their arrays
+in this way, for two reasons.  First, this allocates the array on the
+stack ("automatic" storage), which has a very limited size on most
+operating systems (declaring an array with more than a few thousand
+elements will often cause a crash).  (You can get around this
+limitation on many systems by declaring the array as `static' and/or
+global, but that has its own drawbacks.)  Second, it may not optimally
+align the array for use with a SIMD FFTW (*note SIMD alignment and
+fftw_malloc::).  Instead, we recommend using `fftw_malloc', as
+described below.
+
+
+File: fftw3.info,  Node: Dynamic Arrays in C,  Next: Dynamic Arrays in C-The Wrong Way,  Prev: Fixed-size Arrays in C,  Up: Multi-dimensional Array Format
+
+3.2.4 Dynamic Arrays in C
+-------------------------
+
+We recommend allocating most arrays dynamically, with `fftw_malloc'.
+This isn't too hard to do, although it is not as straightforward for
+multi-dimensional arrays as it is for one-dimensional arrays.
+
+   Creating the array is simple: using a dynamic-allocation routine like
+`fftw_malloc', allocate an array big enough to store N `fftw_complex'
+values (for a complex DFT), where N is the product of the sizes of the
+array dimensions (i.e. the total number of complex values in the
+array).  For example, here is code to allocate a 5 x 12 x 27  rank-3
+array: 
+
+     fftw_complex *an_array;
+     an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex));
+
+   Accessing the array elements, however, is more tricky--you can't
+simply use multiple applications of the `[]' operator like you could
+for fixed-size arrays.  Instead, you have to explicitly compute the
+offset into the array using the formula given earlier for row-major
+arrays.  For example, to reference the (i,j,k)-th element of the array
+allocated above, you would use the expression `an_array[k + 27 * (j +
+12 * i)]'.
+
+   This pain can be alleviated somewhat by defining appropriate macros,
+or, in C++, creating a class and overloading the `()' operator.  The
+recent C99 standard provides a way to reinterpret the dynamic array as
+a "variable-length" multi-dimensional array amenable to `[]', but this
+feature is not yet widely supported by compilers.  
+
+
+File: fftw3.info,  Node: Dynamic Arrays in C-The Wrong Way,  Prev: Dynamic Arrays in C,  Up: Multi-dimensional Array Format
+
+3.2.5 Dynamic Arrays in C--The Wrong Way
+----------------------------------------
+
+A different method for allocating multi-dimensional arrays in C is
+often suggested that is incompatible with FFTW: _using it will cause
+FFTW to die a painful death_.  We discuss the technique here, however,
+because it is so commonly known and used.  This method is to create
+arrays of pointers of arrays of pointers of ...etcetera.  For example,
+the analogue in this method to the example above is:
+
+     int i,j;
+     fftw_complex ***a_bad_array;  /* another way to make a 5x12x27 array */
+
+     a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **));
+     for (i = 0; i < 5; ++i) {
+          a_bad_array[i] =
+             (fftw_complex **) malloc(12 * sizeof(fftw_complex *));
+          for (j = 0; j < 12; ++j)
+               a_bad_array[i][j] =
+                     (fftw_complex *) malloc(27 * sizeof(fftw_complex));
+     }
+
+   As you can see, this sort of array is inconvenient to allocate (and
+deallocate).  On the other hand, it has the advantage that the
+(i,j,k)-th element can be referenced simply by `a_bad_array[i][j][k]'.
+
+   If you like this technique and want to maximize convenience in
+accessing the array, but still want to pass the array to FFTW, you can
+use a hybrid method.  Allocate the array as one contiguous block, but
+also declare an array of arrays of pointers that point to appropriate
+places in the block.  That sort of trick is beyond the scope of this
+documentation; for more information on multi-dimensional arrays in C,
+see the `comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html).
+
+
+File: fftw3.info,  Node: Words of Wisdom-Saving Plans,  Next: Caveats in Using Wisdom,  Prev: Multi-dimensional Array Format,  Up: Other Important Topics
+
+3.3 Words of Wisdom--Saving Plans
+=================================
+
+FFTW implements a method for saving plans to disk and restoring them.
+In fact, what FFTW does is more general than just saving and loading
+plans.  The mechanism is called "wisdom".  Here, we describe this
+feature at a high level. *Note FFTW Reference::, for a less casual but
+more complete discussion of how to use wisdom in FFTW.
+
+   Plans created with the `FFTW_MEASURE', `FFTW_PATIENT', or
+`FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may
+require a long time to compute because FFTW must measure the runtime of
+many possible plans and select the best one.  This setup is designed
+for the situations where so many transforms of the same size must be
+computed that the start-up time is irrelevant.  For short
+initialization times, but slower transforms, we have provided
+`FFTW_ESTIMATE'.  The `wisdom' mechanism is a way to get the best of
+both worlds: you compute a good plan once, save it to disk, and later
+reload it as many times as necessary.  The wisdom mechanism can
+actually save and reload many plans at once, not just one.  
+
+   Whenever you create a plan, the FFTW planner accumulates wisdom,
+which is information sufficient to reconstruct the plan.  After
+planning, you can save this information to disk by means of the
+function:
+     int fftw_export_wisdom_to_filename(const char *filename);
+   (This function returns non-zero on success.)
+
+   The next time you run the program, you can restore the wisdom with
+`fftw_import_wisdom_from_filename' (which also returns non-zero on
+success), and then recreate the plan using the same flags as before.
+     int fftw_import_wisdom_from_filename(const char *filename);
+   
+   Wisdom is automatically used for any size to which it is applicable,
+as long as the planner flags are not more "patient" than those with
+which the wisdom was created.  For example, wisdom created with
+`FFTW_MEASURE' can be used if you later plan with `FFTW_ESTIMATE' or
+`FFTW_MEASURE', but not with `FFTW_PATIENT'.
+
+   The `wisdom' is cumulative, and is stored in a global, private data
+structure managed internally by FFTW.  The storage space required is
+minimal, proportional to the logarithm of the sizes the wisdom was
+generated from.  If memory usage is a concern, however, the wisdom can
+be forgotten and its associated memory freed by calling:
+     void fftw_forget_wisdom(void);
+   
+   Wisdom can be exported to a file, a string, or any other medium.
+For details, see *note Wisdom::.
+
+
+File: fftw3.info,  Node: Caveats in Using Wisdom,  Prev: Words of Wisdom-Saving Plans,  Up: Other Important Topics
+
+3.4 Caveats in Using Wisdom
+===========================
+
+     For in much wisdom is much grief, and he that increaseth knowledge
+     increaseth sorrow.  [Ecclesiastes 1:18] 
+
+   There are pitfalls to using wisdom, in that it can negate FFTW's
+ability to adapt to changing hardware and other conditions. For
+example, it would be perfectly possible to export wisdom from a program
+running on one processor and import it into a program running on
+another processor.  Doing so, however, would mean that the second
+program would use plans optimized for the first processor, instead of
+the one it is running on.
+
+   It should be safe to reuse wisdom as long as the hardware and program
+binaries remain unchanged. (Actually, the optimal plan may change even
+between runs of the same binary on identical hardware, due to
+differences in the virtual memory environment, etcetera.  Users
+seriously interested in performance should worry about this problem,
+too.)  It is likely that, if the same wisdom is used for two different
+program binaries, even running on the same machine, the plans may be
+sub-optimal because of differing code alignments.  It is therefore wise
+to recreate wisdom every time an application is recompiled.  The more
+the underlying hardware and software changes between the creation of
+wisdom and its use, the greater grows the risk of sub-optimal plans.
+
+   Nevertheless, if the choice is between using `FFTW_ESTIMATE' or
+using possibly-suboptimal wisdom (created on the same machine, but for a
+different binary), the wisdom is likely to be better.  For this reason,
+we provide a function to import wisdom from a standard system-wide
+location (`/etc/fftw/wisdom' on Unix): 
+
+     int fftw_import_system_wisdom(void);
+   
+   FFTW also provides a standalone program, `fftw-wisdom' (described by
+its own `man' page on Unix) with which users can create wisdom, e.g.
+for a canonical set of sizes to store in the system wisdom file.  *Note
+Wisdom Utilities::.  
+
+
+File: fftw3.info,  Node: FFTW Reference,  Next: Multi-threaded FFTW,  Prev: Other Important Topics,  Up: Top
+
+4 FFTW Reference
+****************
+
+This chapter provides a complete reference for all sequential (i.e.,
+one-processor) FFTW functions.  Parallel transforms are described in
+later chapters.
+
+* Menu:
+
+* Data Types and Files::
+* Using Plans::
+* Basic Interface::
+* Advanced Interface::
+* Guru Interface::
+* New-array Execute Functions::
+* Wisdom::
+* What FFTW Really Computes::
+
+
+File: fftw3.info,  Node: Data Types and Files,  Next: Using Plans,  Prev: FFTW Reference,  Up: FFTW Reference
+
+4.1 Data Types and Files
+========================
+
+All programs using FFTW should include its header file:
+
+     #include <fftw3.h>
+
+   You must also link to the FFTW library.  On Unix, this means adding
+`-lfftw3 -lm' at the _end_ of the link command.
+
+* Menu:
+
+* Complex numbers::
+* Precision::
+* Memory Allocation::
+
+
+File: fftw3.info,  Node: Complex numbers,  Next: Precision,  Prev: Data Types and Files,  Up: Data Types and Files
+
+4.1.1 Complex numbers
+---------------------
+
+The default FFTW interface uses `double' precision for all
+floating-point numbers, and defines a `fftw_complex' type to hold
+complex numbers as:
+
+     typedef double fftw_complex[2];
+   
+   Here, the `[0]' element holds the real part and the `[1]' element
+holds the imaginary part.
+
+   Alternatively, if you have a C compiler (such as `gcc') that
+supports the C99 revision of the ANSI C standard, you can use C's new
+native complex type (which is binary-compatible with the typedef above).
+In particular, if you `#include <complex.h>' _before_ `<fftw3.h>', then
+`fftw_complex' is defined to be the native complex type and you can
+manipulate it with ordinary arithmetic (e.g. `x = y * (3+4*I)', where
+`x' and `y' are `fftw_complex' and `I' is the standard symbol for the
+imaginary unit); 
+
+   C++ has its own `complex<T>' template class, defined in the standard
+`<complex>' header file.  Reportedly, the C++ standards committee has
+recently agreed to mandate that the storage format used for this type
+be binary-compatible with the C99 type, i.e. an array `T[2]' with
+consecutive real `[0]' and imaginary `[1]' parts.  (See report
+`http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf
+WG21/N1388'.)  Although not part of the official standard as of this
+writing, the proposal stated that: "This solution has been tested with
+all current major implementations of the standard library and shown to
+be working."  To the extent that this is true, if you have a variable
+`complex<double> *x', you can pass it directly to FFTW via
+`reinterpret_cast<fftw_complex*>(x)'.  
+
+
+File: fftw3.info,  Node: Precision,  Next: Memory Allocation,  Prev: Complex numbers,  Up: Data Types and Files
+
+4.1.2 Precision
+---------------
+
+You can install single and long-double precision versions of FFTW,
+which replace `double' with `float' and `long double', respectively
+(*note Installation and Customization::).  To use these interfaces, you:
+
+   * Link to the single/long-double libraries; on Unix, `-lfftw3f' or
+     `-lfftw3l' instead of (or in addition to) `-lfftw3'.  (You can
+     link to the different-precision libraries simultaneously.)
+
+   * Include the _same_ `<fftw3.h>' header file.
+
+   * Replace all lowercase instances of `fftw_' with `fftwf_' or
+     `fftwl_' for single or long-double precision, respectively.
+     (`fftw_complex' becomes `fftwf_complex', `fftw_execute' becomes
+     `fftwf_execute', etcetera.)
+
+   * Uppercase names, i.e. names beginning with `FFTW_', remain the
+     same.
+
+   * Replace `double' with `float' or `long double' for subroutine
+     parameters.
+
+
+   Depending upon your compiler and/or hardware, `long double' may not
+be any more precise than `double' (or may not be supported at all,
+although it is standard in C99).  
+
+   We also support using the nonstandard `__float128'
+quadruple-precision type provided by recent versions of `gcc' on 32-
+and 64-bit x86 hardware (*note Installation and Customization::).  To
+use this type, link with `-lfftw3q -lquadmath -lm' (the `libquadmath'
+library provided by `gcc' is needed for quadruple-precision
+trigonometric functions) and use `fftwq_' identifiers.
+
+
+File: fftw3.info,  Node: Memory Allocation,  Prev: Precision,  Up: Data Types and Files
+
+4.1.3 Memory Allocation
+-----------------------
+
+     void *fftw_malloc(size_t n);
+     void fftw_free(void *p);
+
+   These are functions that behave identically to `malloc' and `free',
+except that they guarantee that the returned pointer obeys any special
+alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD
+acceleration).  *Note SIMD alignment and fftw_malloc::.  
+
+   Data allocated by `fftw_malloc' _must_ be deallocated by `fftw_free'
+and not by the ordinary `free'.
+
+   These routines simply call through to your operating system's
+`malloc' or, if necessary, its aligned equivalent (e.g. `memalign'), so
+you normally need not worry about any significant time or space
+overhead.  You are _not required_ to use them to allocate your data,
+but we strongly recommend it.
+
+   Note: in C++, just as with ordinary `malloc', you must typecast the
+output of `fftw_malloc' to whatever pointer type you are allocating.  
+
+   We also provide the following two convenience functions to allocate
+real and complex arrays with `n' elements, which are equivalent to
+`(double *) fftw_malloc(sizeof(double) * n)' and `(fftw_complex *)
+fftw_malloc(sizeof(fftw_complex) * n)', respectively:
+
+     double *fftw_alloc_real(size_t n);
+     fftw_complex *fftw_alloc_complex(size_t n);
+   
+   The equivalent functions in other precisions allocate arrays of `n'
+elements in that precision.  e.g. `fftwf_alloc_real(n)' is equivalent
+to `(float *) fftwf_malloc(sizeof(float) * n)'.  
+
+
+File: fftw3.info,  Node: Using Plans,  Next: Basic Interface,  Prev: Data Types and Files,  Up: FFTW Reference
+
+4.2 Using Plans
+===============
+
+Plans for all transform types in FFTW are stored as type `fftw_plan'
+(an opaque pointer type), and are created by one of the various
+planning routines described in the following sections.  An `fftw_plan'
+contains all information necessary to compute the transform, including
+the pointers to the input and output arrays.
+
+     void fftw_execute(const fftw_plan plan);
+   
+   This executes the `plan', to compute the corresponding transform on
+the arrays for which it was planned (which must still exist).  The plan
+is not modified, and `fftw_execute' can be called as many times as
+desired.
+
+   To apply a given plan to a different array, you can use the
+new-array execute interface.  *Note New-array Execute Functions::.
+
+   `fftw_execute' (and equivalents) is the only function in FFTW
+guaranteed to be thread-safe; see *note Thread safety::.
+
+   This function:
+     void fftw_destroy_plan(fftw_plan plan);
+   deallocates the `plan' and all its associated data.
+
+   FFTW's planner saves some other persistent data, such as the
+accumulated wisdom and a list of algorithms available in the current
+configuration.  If you want to deallocate all of that and reset FFTW to
+the pristine state it was in when you started your program, you can
+call:
+
+     void fftw_cleanup(void);
+   
+   After calling `fftw_cleanup', all existing plans become undefined,
+and you should not attempt to execute them nor to destroy them.  You can
+however create and execute/destroy new plans, in which case FFTW starts
+accumulating wisdom information again.
+
+   `fftw_cleanup' does not deallocate your plans, however.  To prevent
+memory leaks, you must still call `fftw_destroy_plan' before executing
+`fftw_cleanup'.
+
+   Occasionally, it may useful to know FFTW's internal "cost" metric
+that it uses to compare plans to one another; this cost is proportional
+to an execution time of the plan, in undocumented units, if the plan
+was created with the `FFTW_MEASURE' or other timing-based options, or
+alternatively is a heuristic cost function for `FFTW_ESTIMATE' plans.
+(The cost values of measured and estimated plans are not comparable,
+being in different units.  Also, costs from different FFTW versions or
+the same version compiled differently may not be in the same units.
+Plans created from wisdom have a cost of 0 since no timing measurement
+is performed for them.  Finally, certain problems for which only one
+top-level algorithm was possible may have required no measurements of
+the cost of the whole plan, in which case `fftw_cost' will also return
+0.)  The cost metric for a given plan is returned by:
+
+     double fftw_cost(const fftw_plan plan);
+   
+   The following two routines are provided purely for academic purposes
+(that is, for entertainment).
+
+     void fftw_flops(const fftw_plan plan,
+                     double *add, double *mul, double *fma);
+   
+   Given a `plan', set `add', `mul', and `fma' to an exact count of the
+number of floating-point additions, multiplications, and fused
+multiply-add operations involved in the plan's execution.  The total
+number of floating-point operations (flops) is `add + mul + 2*fma', or
+`add + mul + fma' if the hardware supports fused multiply-add
+instructions (although the number of FMA operations is only approximate
+because of compiler voodoo).  (The number of operations should be an
+integer, but we use `double' to avoid overflowing `int' for large
+transforms; the arguments are of type `double' even for single and
+long-double precision versions of FFTW.)
+
+     void fftw_fprint_plan(const fftw_plan plan, FILE *output_file);
+     void fftw_print_plan(const fftw_plan plan);
+   
+   This outputs a "nerd-readable" representation of the `plan' to the
+given file or to `stdout', respectively.
+
+
+File: fftw3.info,  Node: Basic Interface,  Next: Advanced Interface,  Prev: Using Plans,  Up: FFTW Reference
+
+4.3 Basic Interface
+===================
+
+Recall that the FFTW API is divided into three parts(1): the "basic
+interface" computes a single transform of contiguous data, the "advanced
+interface" computes transforms of multiple or strided arrays, and the
+"guru interface" supports the most general data layouts,
+multiplicities, and strides.  This section describes the the basic
+interface, which we expect to satisfy the needs of most users.
+
+* Menu:
+
+* Complex DFTs::
+* Planner Flags::
+* Real-data DFTs::
+* Real-data DFT Array Format::
+* Real-to-Real Transforms::
+* Real-to-Real Transform Kinds::
+
+   ---------- Footnotes ----------
+
+   (1) Gallia est omnis divisa in partes tres (Julius Caesar).
+
+
+File: fftw3.info,  Node: Complex DFTs,  Next: Planner Flags,  Prev: Basic Interface,  Up: Basic Interface
+
+4.3.1 Complex DFTs
+------------------
+
+     fftw_plan fftw_plan_dft_1d(int n0,
+                                fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+     fftw_plan fftw_plan_dft_2d(int n0, int n1,
+                                fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+     fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
+                                fftw_complex *in, fftw_complex *out,
+                                int sign, unsigned flags);
+     fftw_plan fftw_plan_dft(int rank, const int *n,
+                             fftw_complex *in, fftw_complex *out,
+                             int sign, unsigned flags);
+
+   Plan a complex input/output discrete Fourier transform (DFT) in zero
+or more dimensions, returning an `fftw_plan' (*note Using Plans::).
+
+   Once you have created a plan for a certain transform type and
+parameters, then creating another plan of the same type and parameters,
+but for different arrays, is fast and shares constant data with the
+first plan (if it still exists).
+
+   The planner returns `NULL' if the plan cannot be created.  In the
+standard FFTW distribution, the basic interface is guaranteed to return
+a non-`NULL' plan.  A plan may be `NULL', however, if you are using a
+customized FFTW configuration supporting a restricted set of transforms.
+
+Arguments
+.........
+
+   * `rank' is the rank of the transform (it should be the size of the
+     array `*n'), and can be any non-negative integer.  (*Note Complex
+     Multi-Dimensional DFTs::, for the definition of "rank".)  The
+     `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1',
+     `2', and `3', respectively.  The rank may be zero, which is
+     equivalent to a rank-1 transform of size 1, i.e. a copy of one
+     number from input to output.
+
+   * `n0', `n1', `n2', or `n[0..rank-1]' (as appropriate for each
+     routine) specify the size of the transform dimensions.  They can
+     be any positive integer.
+
+        - Multi-dimensional arrays are stored in row-major order with
+          dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x
+          `n[1]' x ... x `n[rank-1]'.  *Note Multi-dimensional Array
+          Format::.
+
+        - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
+          11^e 13^f, where e+f is either 0 or 1, and the other exponents
+          are arbitrary.  Other sizes are computed by means of a slow,
+          general-purpose algorithm (which nevertheless retains O(n log
+          n)  performance even for prime sizes).  It is possible to
+          customize FFTW for different array sizes; see *note
+          Installation and Customization::.  Transforms whose sizes are
+          powers of 2 are especially fast.
+
+   * `in' and `out' point to the input and output arrays of the
+     transform, which may be the same (yielding an in-place transform).  These
+     arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
+     used in the flags.  (The arrays need not be initialized, but they
+     must be allocated.)
+
+     If `in == out', the transform is "in-place" and the input array is
+     overwritten. If `in != out', the two arrays must not overlap (but
+     FFTW does not check for this condition).
+
+   * `sign' is the sign of the exponent in the formula that defines the
+     Fourier transform.  It can be -1 (= `FFTW_FORWARD') or +1 (=
+     `FFTW_BACKWARD').
+
+   * `flags' is a bitwise OR (`|') of zero or more planner flags, as
+     defined in *note Planner Flags::.
+
+
+   FFTW computes an unnormalized transform: computing a forward
+followed by a backward transform (or vice versa) will result in the
+original data multiplied by the size of the transform (the product of
+the dimensions).  For more information, see *note What FFTW Really
+Computes::.
+
+
+File: fftw3.info,  Node: Planner Flags,  Next: Real-data DFTs,  Prev: Complex DFTs,  Up: Basic Interface
+
+4.3.2 Planner Flags
+-------------------
+
+All of the planner routines in FFTW accept an integer `flags' argument,
+which is a bitwise OR (`|') of zero or more of the flag constants
+defined below.  These flags control the rigor (and time) of the
+planning process, and can also impose (or lift) restrictions on the
+type of transform algorithm that is employed.
+
+   _Important:_ the planner overwrites the input array during planning
+unless a saved plan (*note Wisdom::) is available for that problem, so
+you should initialize your input data after creating the plan.  The
+only exceptions to this are the `FFTW_ESTIMATE' and `FFTW_WISDOM_ONLY'
+flags, as mentioned below.
+
+   In all  cases, if  wisdom is  available for the  given problem  that
+was created  with equal-or-greater  planning rigor,  then the  more
+rigorous wisdom is used.  For example, in `FFTW_ESTIMATE' mode any
+available wisdom is used, whereas  in `FFTW_PATIENT' mode only wisdom
+created in patient or exhaustive mode can be used.  *Note Words of
+Wisdom-Saving Plans::.
+
+Planning-rigor flags
+....................
+
+   * `FFTW_ESTIMATE' specifies that, instead of actual measurements of
+     different algorithms, a simple heuristic is used to pick a
+     (probably sub-optimal) plan quickly.  With this flag, the
+     input/output arrays are not overwritten during planning.
+
+   * `FFTW_MEASURE' tells FFTW to find an optimized plan by actually
+     _computing_ several FFTs and measuring their execution time.
+     Depending on your machine, this can take some time (often a few
+     seconds).  `FFTW_MEASURE' is the default planning option.
+
+   * `FFTW_PATIENT' is like `FFTW_MEASURE', but considers a wider range
+     of algorithms and often produces a "more optimal" plan (especially
+     for large transforms), but at the expense of several times longer
+     planning time (especially for large transforms).
+
+   * `FFTW_EXHAUSTIVE' is like `FFTW_PATIENT', but considers an even
+     wider range of algorithms, including many that we think are
+     unlikely to be fast, to produce the most optimal plan but with a
+     substantially increased planning time.
+
+   * `FFTW_WISDOM_ONLY' is a special planning mode in which the plan is
+     only created if wisdom is available for the given problem, and
+     otherwise a `NULL' plan is returned.  This can be combined with
+     other flags, e.g. `FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan
+     only if wisdom is available that was created in `FFTW_PATIENT' or
+     `FFTW_EXHAUSTIVE' mode.  The `FFTW_WISDOM_ONLY' flag is intended
+     for users who need to detect whether wisdom is available; for
+     example, if wisdom is not available one may wish to allocate new
+     arrays for planning so that user data is not overwritten.
+
+
+Algorithm-restriction flags
+...........................
+
+   * `FFTW_DESTROY_INPUT' specifies that an out-of-place transform is
+     allowed to _overwrite its input_ array with arbitrary data; this
+     can sometimes allow more efficient algorithms to be employed.  
+
+   * `FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must
+     _not change its input_ array.  This is ordinarily the _default_,
+     except for c2r and hc2r (i.e. complex-to-real) transforms for
+     which `FFTW_DESTROY_INPUT' is the default.  In the latter cases,
+     passing `FFTW_PRESERVE_INPUT' will attempt to use algorithms that
+     do not destroy the input, at the expense of worse performance; for
+     multi-dimensional c2r transforms, however, no input-preserving
+     algorithms are implemented and the planner will return `NULL' if
+     one is requested.  
+
+   * `FFTW_UNALIGNED' specifies that the algorithm may not impose any
+     unusual alignment requirements on the input/output arrays (i.e. no
+     SIMD may be used).  This flag is normally _not necessary_, since
+     the planner automatically detects misaligned arrays.  The only use
+     for this flag is if you want to use the new-array execute
+     interface to execute a given plan on a different array that may
+     not be aligned like the original.  (Using `fftw_malloc' makes this
+     flag unnecessary even then.)
+
+
+Limiting planning time
+......................
+
+     extern void fftw_set_timelimit(double seconds);
+
+   This function instructs FFTW to spend at most `seconds' seconds
+(approximately) in the planner.  If `seconds == FFTW_NO_TIMELIMIT' (the
+default value, which is negative), then planning time is unbounded.
+Otherwise, FFTW plans with a progressively wider range of algorithms
+until the the given time limit is reached or the given range of
+algorithms is explored, returning the best available plan.  
+
+   For example, specifying `FFTW_PATIENT' first plans in
+`FFTW_ESTIMATE' mode, then in `FFTW_MEASURE' mode, then finally (time
+permitting) in `FFTW_PATIENT'.  If `FFTW_EXHAUSTIVE' is specified
+instead, the planner will further progress to `FFTW_EXHAUSTIVE' mode.
+
+   Note that the `seconds' argument specifies only a rough limit; in
+practice, the planner may use somewhat more time if the time limit is
+reached when the planner is in the middle of an operation that cannot
+be interrupted.  At the very least, the planner will complete planning
+in `FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0).
+
+
+File: fftw3.info,  Node: Real-data DFTs,  Next: Real-data DFT Array Format,  Prev: Planner Flags,  Up: Basic Interface
+
+4.3.3 Real-data DFTs
+--------------------
+
+     fftw_plan fftw_plan_dft_r2c_1d(int n0,
+                                    double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
+                                    double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
+                                    double *in, fftw_complex *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
+                                 double *in, fftw_complex *out,
+                                 unsigned flags);
+
+   Plan a real-input/complex-output discrete Fourier transform (DFT) in
+zero or more dimensions, returning an `fftw_plan' (*note Using Plans::).
+
+   Once you have created a plan for a certain transform type and
+parameters, then creating another plan of the same type and parameters,
+but for different arrays, is fast and shares constant data with the
+first plan (if it still exists).
+
+   The planner returns `NULL' if the plan cannot be created.  A
+non-`NULL' plan is always returned by the basic interface unless you
+are using a customized FFTW configuration supporting a restricted set
+of transforms, or if you use the `FFTW_PRESERVE_INPUT' flag with a
+multi-dimensional out-of-place c2r transform (see below).
+
+Arguments
+.........
+
+   * `rank' is the rank of the transform (it should be the size of the
+     array `*n'), and can be any non-negative integer.  (*Note Complex
+     Multi-Dimensional DFTs::, for the definition of "rank".)  The
+     `_1d', `_2d', and `_3d' planners correspond to a `rank' of `1',
+     `2', and `3', respectively.  The rank may be zero, which is
+     equivalent to a rank-1 transform of size 1, i.e. a copy of one
+     real number (with zero imaginary part) from input to output.
+
+   * `n0', `n1', `n2', or `n[0..rank-1]', (as appropriate for each
+     routine) specify the size of the transform dimensions.  They can
+     be any positive integer.  This is different in general from the
+     _physical_ array dimensions, which are described in *note
+     Real-data DFT Array Format::.
+
+        - FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
+          11^e 13^f, where e+f is either 0 or 1, and the other exponents
+          are arbitrary.  Other sizes are computed by means of a slow,
+          general-purpose algorithm (which nevertheless retains O(n log
+          n)  performance even for prime sizes).  (It is possible to
+          customize FFTW for different array sizes; see *note
+          Installation and Customization::.)  Transforms whose sizes
+          are powers of 2 are especially fast, and it is generally
+          beneficial for the _last_ dimension of an r2c/c2r transform
+          to be _even_.
+
+   * `in' and `out' point to the input and output arrays of the
+     transform, which may be the same (yielding an in-place transform).  These
+     arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
+     used in the flags.  (The arrays need not be initialized, but they
+     must be allocated.)  For an in-place transform, it is important to
+     remember that the real array will require padding, described in
+     *note Real-data DFT Array Format::.  
+
+   * `flags' is a bitwise OR (`|') of zero or more planner flags, as
+     defined in *note Planner Flags::.
+
+
+   The inverse transforms, taking complex input (storing the
+non-redundant half of a logically Hermitian array) to real output, are
+given by:
+
+     fftw_plan fftw_plan_dft_c2r_1d(int n0,
+                                    fftw_complex *in, double *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1,
+                                    fftw_complex *in, double *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2,
+                                    fftw_complex *in, double *out,
+                                    unsigned flags);
+     fftw_plan fftw_plan_dft_c2r(int rank, const int *n,
+                                 fftw_complex *in, double *out,
+                                 unsigned flags);
+   
+   The arguments are the same as for the r2c transforms, except that the
+input and output data formats are reversed.
+
+   FFTW computes an unnormalized transform: computing an r2c followed
+by a c2r transform (or vice versa) will result in the original data
+multiplied by the size of the transform (the product of the logical
+dimensions).  An r2c transform produces the same output as a
+`FFTW_FORWARD' complex DFT of the same input, and a c2r transform is
+correspondingly equivalent to `FFTW_BACKWARD'.  For more information,
+see *note What FFTW Really Computes::.
+
+
+File: fftw3.info,  Node: Real-data DFT Array Format,  Next: Real-to-Real Transforms,  Prev: Real-data DFTs,  Up: Basic Interface
+
+4.3.4 Real-data DFT Array Format
+--------------------------------
+
+The output of a DFT of real data (r2c) contains symmetries that, in
+principle, make half of the outputs redundant (*note What FFTW Really
+Computes::).  (Similarly for the input of an inverse c2r transform.)  In
+practice, it is not possible to entirely realize these savings in an
+efficient and understandable format that generalizes to
+multi-dimensional transforms.  Instead, the output of the r2c
+transforms is _slightly_ over half of the output of the corresponding
+complex transform.  We do not "pack" the data in any way, but store it
+as an ordinary array of `fftw_complex' values.  In fact, this data is
+simply a subsection of what would be the array in the corresponding
+complex transform.
+
+   Specifically, for a real transform of d (= `rank') dimensions n[0] x
+n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2]
+x ... x (n[d-1]/2 + 1)  array of `fftw_complex' values in row-major
+order (with the division rounded down).  That is, we only store the
+_lower_ half (non-negative frequencies), plus one element, of the last
+dimension of the data from the ordinary complex transform.  (We could
+have instead taken half of any other dimension, but implementation
+turns out to be simpler if the last, contiguous, dimension is used.)
+
+   For an out-of-place transform, the real data is simply an array with
+physical dimensions n[0] x n[1] x n[2] x ... x n[d-1]  in row-major
+order.
+
+   For an in-place transform, some complications arise since the
+complex data is slightly larger than the real data.  In this case, the
+final dimension of the real data must be _padded_ with extra values to
+accommodate the size of the complex data--two extra if the last
+dimension is even and one if it is odd.  That is, the last dimension of
+the real data must physically contain 2 * (n[d-1]/2+1) `double' values
+(exactly enough to hold the complex data).  This physical array size
+does not, however, change the _logical_ array size--only n[d-1] values
+are actually stored in the last dimension, and n[d-1] is the last
+dimension passed to the planner.
+
+
+File: fftw3.info,  Node: Real-to-Real Transforms,  Next: Real-to-Real Transform Kinds,  Prev: Real-data DFT Array Format,  Up: Basic Interface
+
+4.3.5 Real-to-Real Transforms
+-----------------------------
+
+     fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
+                                fftw_r2r_kind kind, unsigned flags);
+     fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
+                                fftw_r2r_kind kind0, fftw_r2r_kind kind1,
+                                unsigned flags);
+     fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
+                                double *in, double *out,
+                                fftw_r2r_kind kind0,
+                                fftw_r2r_kind kind1,
+                                fftw_r2r_kind kind2,
+                                unsigned flags);
+     fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
+                             const fftw_r2r_kind *kind, unsigned flags);
+
+   Plan a real input/output (r2r) transform of various kinds in zero or
+more dimensions, returning an `fftw_plan' (*note Using Plans::).
+
+   Once you have created a plan for a certain transform type and
+parameters, then creating another plan of the same type and parameters,
+but for different arrays, is fast and shares constant data with the
+first plan (if it still exists).
+
+   The planner returns `NULL' if the plan cannot be created.  A
+non-`NULL' plan is always returned by the basic interface unless you
+are using a customized FFTW configuration supporting a restricted set
+of transforms, or for size-1 `FFTW_REDFT00' kinds (which are not
+defined).  
+
+Arguments
+.........
+
+   * `rank' is the dimensionality of the transform (it should be the
+     size of the arrays `*n' and `*kind'), and can be any non-negative
+     integer.  The `_1d', `_2d', and `_3d' planners correspond to a
+     `rank' of `1', `2', and `3', respectively.  A `rank' of zero is
+     equivalent to a copy of one number from input to output.
+
+   * `n', or `n0'/`n1'/`n2', or `n[rank]', respectively, gives the
+     (physical) size of the transform dimensions.  They can be any
+     positive integer.
+
+        - Multi-dimensional arrays are stored in row-major order with
+          dimensions: `n0' x `n1'; or `n0' x `n1' x `n2'; or `n[0]' x
+          `n[1]' x ... x `n[rank-1]'.  *Note Multi-dimensional Array
+          Format::.
+
+        - FFTW is generally best at handling sizes of the form 2^a 3^b
+          5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other
+          exponents are arbitrary.  Other sizes are computed by means
+          of a slow, general-purpose algorithm (which nevertheless
+          retains O(n log n)  performance even for prime sizes).  (It
+          is possible to customize FFTW for different array sizes; see
+          *note Installation and Customization::.)  Transforms whose
+          sizes are powers of 2 are especially fast.
+
+        - For a `REDFT00' or `RODFT00' transform kind in a dimension of
+          size n, it is n-1 or n+1, respectively, that should be
+          factorizable in the above form.
+
+   * `in' and `out' point to the input and output arrays of the
+     transform, which may be the same (yielding an in-place transform).  These
+     arrays are overwritten during planning, unless `FFTW_ESTIMATE' is
+     used in the flags.  (The arrays need not be initialized, but they
+     must be allocated.)
+
+   * `kind', or `kind0'/`kind1'/`kind2', or `kind[rank]', is the kind
+     of r2r transform used for the corresponding dimension.  The valid
+     kind constants are described in *note Real-to-Real Transform
+     Kinds::.  In a multi-dimensional transform, what is computed is
+     the separable product formed by taking each transform kind along
+     the corresponding dimension, one dimension after another.
+
+   * `flags' is a bitwise OR (`|') of zero or more planner flags, as
+     defined in *note Planner Flags::.
+
+
+
+File: fftw3.info,  Node: Real-to-Real Transform Kinds,  Prev: Real-to-Real Transforms,  Up: Basic Interface
+
+4.3.6 Real-to-Real Transform Kinds
+----------------------------------
+
+FFTW currently supports 11 different r2r transform kinds, specified by
+one of the constants below.  For the precise definitions of these
+transforms, see *note What FFTW Really Computes::.  For a more
+colloquial introduction to these transform kinds, see *note More DFTs
+of Real Data::.
+
+   For dimension of size `n', there is a corresponding "logical"
+dimension `N' that determines the normalization (and the optimal
+factorization); the formula for `N' is given for each kind below.
+Also, with each transform kind is listed its corrsponding inverse
+transform.  FFTW computes unnormalized transforms: a transform followed
+by its inverse will result in the original data multiplied by `N' (or
+the product of the `N''s for each dimension, in multi-dimensions).  
+
+   * `FFTW_R2HC' computes a real-input DFT with output in "halfcomplex"
+     format, i.e. real and imaginary parts for a transform of size `n'
+     stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical
+     `N=n', inverse is `FFTW_HC2R'.)
+
+   * `FFTW_HC2R' computes the reverse of `FFTW_R2HC', above.  (Logical
+     `N=n', inverse is `FFTW_R2HC'.)
+
+   * `FFTW_DHT' computes a discrete Hartley transform.  (Logical `N=n',
+     inverse is `FFTW_DHT'.)  
+
+   * `FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I.
+     (Logical `N=2*(n-1)', inverse is `FFTW_REDFT00'.)  
+
+   * `FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II
+     (sometimes called "the" DCT).  (Logical `N=2*n', inverse is
+     `FFTW_REDFT01'.)
+
+   * `FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III
+     (sometimes called "the" IDCT, being the inverse of DCT-II).
+     (Logical `N=2*n', inverse is `FFTW_REDFT=10'.)  
+
+   * `FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV.
+     (Logical `N=2*n', inverse is `FFTW_REDFT11'.)
+
+   * `FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I.
+     (Logical `N=2*(n+1)', inverse is `FFTW_RODFT00'.)  
+
+   * `FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II.
+     (Logical `N=2*n', inverse is `FFTW_RODFT01'.)
+
+   * `FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III.
+     (Logical `N=2*n', inverse is `FFTW_RODFT=10'.)
+
+   * `FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV.
+     (Logical `N=2*n', inverse is `FFTW_RODFT11'.)
+
+
+
+File: fftw3.info,  Node: Advanced Interface,  Next: Guru Interface,  Prev: Basic Interface,  Up: FFTW Reference
+
+4.4 Advanced Interface
+======================
+
+FFTW's "advanced" interface supplements the basic interface with four
+new planner routines, providing a new level of flexibility: you can plan
+a transform of multiple arrays simultaneously, operate on non-contiguous
+(strided) data, and transform a subset of a larger multi-dimensional
+array.  Other than these additional features, the planner operates in
+the same fashion as in the basic interface, and the resulting
+`fftw_plan' is used in the same way (*note Using Plans::).
+
+* Menu:
+
+* Advanced Complex DFTs::
+* Advanced Real-data DFTs::
+* Advanced Real-to-real Transforms::
+
+
+File: fftw3.info,  Node: Advanced Complex DFTs,  Next: Advanced Real-data DFTs,  Prev: Advanced Interface,  Up: Advanced Interface
+
+4.4.1 Advanced Complex DFTs
+---------------------------
+
+     fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany,
+                                  fftw_complex *in, const int *inembed,
+                                  int istride, int idist,
+                                  fftw_complex *out, const int *onembed,
+                                  int ostride, int odist,
+                                  int sign, unsigned flags);
+
+   This routine plans multiple multidimensional complex DFTs, and it
+extends the `fftw_plan_dft' routine (*note Complex DFTs::) to compute
+`howmany' transforms, each having rank `rank' and size `n'.  In
+addition, the transform data need not be contiguous, but it may be laid
+out in memory with an arbitrary stride.  To account for these
+possibilities, `fftw_plan_many_dft' adds the new parameters `howmany',
+{`i',`o'}`nembed', {`i',`o'}`stride', and {`i',`o'}`dist'.  The FFTW
+basic interface (*note Complex DFTs::) provides routines specialized
+for ranks 1, 2, and 3, but the advanced interface handles only the
+general-rank case.
+
+   `howmany' is the number of transforms to compute.  The resulting
+plan computes `howmany' transforms, where the input of the `k'-th
+transform is at location `in+k*idist' (in C pointer arithmetic), and
+its output is at location `out+k*odist'.  Plans obtained in this way
+can often be faster than calling FFTW multiple times for the individual
+transforms.  The basic `fftw_plan_dft' interface corresponds to
+`howmany=1' (in which case the `dist' parameters are ignored).  
+
+   Each of the `howmany' transforms has rank `rank' and size `n', as in
+the basic interface.  In addition, the advanced interface allows the
+input and output arrays of each transform to be row-major subarrays of
+larger rank-`rank' arrays, described by `inembed' and `onembed'
+parameters, respectively.  {`i',`o'}`nembed' must be arrays of length
+`rank', and `n' should be elementwise less than or equal to
+{`i',`o'}`nembed'.  Passing `NULL' for an `nembed' parameter is
+equivalent to passing `n' (i.e. same physical and logical dimensions,
+as in the basic interface.)
+
+   The `stride' parameters indicate that the `j'-th element of the
+input or output arrays is located at `j*istride' or `j*ostride',
+respectively.  (For a multi-dimensional array, `j' is the ordinary
+row-major index.)  When combined with the `k'-th transform in a
+`howmany' loop, from above, this means that the (`j',`k')-th element is
+at `j*stride+k*dist'.  (The basic `fftw_plan_dft' interface corresponds
+to a stride of 1.)  
+
+   For in-place transforms, the input and output `stride' and `dist'
+parameters should be the same; otherwise, the planner may return `NULL'.
+
+   Arrays `n', `inembed', and `onembed' are not used after this
+function returns.  You can safely free or reuse them.
+
+   *Examples*: One transform of one 5 by 6 array contiguous in memory:
+        int rank = 2;
+        int n[] = {5, 6};
+        int howmany = 1;
+        int idist = odist = 0; /* unused because howmany = 1 */
+        int istride = ostride = 1; /* array is contiguous in memory */
+        int *inembed = n, *onembed = n;
+
+   Transform of three 5 by 6 arrays, each contiguous in memory, stored
+in memory one after another:
+        int rank = 2;
+        int n[] = {5, 6};
+        int howmany = 3;
+        int idist = odist = n[0]*n[1]; /* = 30, the distance in memory
+                                          between the first element
+                                          of the first array and the
+                                          first element of the second array */
+        int istride = ostride = 1; /* array is contiguous in memory */
+        int *inembed = n, *onembed = n;
+
+   Transform each column of a 2d array with 10 rows and 3 columns:
+        int rank = 1; /* not 2: we are computing 1d transforms */
+        int n[] = {10}; /* 1d transforms of length 10 */
+        int howmany = 3;
+        int idist = odist = 1;
+        int istride = ostride = 3; /* distance between two elements in
+                                      the same column */
+        int *inembed = n, *onembed = n;
+
+
+File: fftw3.info,  Node: Advanced Real-data DFTs,  Next: Advanced Real-to-real Transforms,  Prev: Advanced Complex DFTs,  Up: Advanced Interface
+
+4.4.2 Advanced Real-data DFTs
+-----------------------------
+
+     fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany,
+                                      double *in, const int *inembed,
+                                      int istride, int idist,
+                                      fftw_complex *out, const int *onembed,
+                                      int ostride, int odist,
+                                      unsigned flags);
+     fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany,
+                                      fftw_complex *in, const int *inembed,
+                                      int istride, int idist,
+                                      double *out, const int *onembed,
+                                      int ostride, int odist,
+                                      unsigned flags);
+
+   Like `fftw_plan_many_dft', these two functions add `howmany',
+`nembed', `stride', and `dist' parameters to the `fftw_plan_dft_r2c'
+and `fftw_plan_dft_c2r' functions, but otherwise behave the same as the
+basic interface.
+
+   The interpretation of `howmany', `stride', and `dist' are the same
+as for `fftw_plan_many_dft', above.  Note that the `stride' and `dist'
+for the real array are in units of `double', and for the complex array
+are in units of `fftw_complex'.
+
+   If an `nembed' parameter is `NULL', it is interpreted as what it
+would be in the basic interface, as described in *note Real-data DFT
+Array Format::.  That is, for the complex array the size is assumed to
+be the same as `n', but with the last dimension cut roughly in half.
+For the real array, the size is assumed to be `n' if the transform is
+out-of-place, or `n' with the last dimension "padded" if the transform
+is in-place.
+
+   If an `nembed' parameter is non-`NULL', it is interpreted as the
+physical size of the corresponding array, in row-major order, just as
+for `fftw_plan_many_dft'.  In this case, each dimension of `nembed'
+should be `>=' what it would be in the basic interface (e.g. the halved
+or padded `n').
+
+   Arrays `n', `inembed', and `onembed' are not used after this
+function returns.  You can safely free or reuse them.
+
+
+File: fftw3.info,  Node: Advanced Real-to-real Transforms,  Prev: Advanced Real-data DFTs,  Up: Advanced Interface
+
+4.4.3 Advanced Real-to-real Transforms
+--------------------------------------
+
+     fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany,
+                                  double *in, const int *inembed,
+                                  int istride, int idist,
+                                  double *out, const int *onembed,
+                                  int ostride, int odist,
+                                  const fftw_r2r_kind *kind, unsigned flags);
+
+   Like `fftw_plan_many_dft', this functions adds `howmany', `nembed',
+`stride', and `dist' parameters to the `fftw_plan_r2r' function, but
+otherwise behave the same as the basic interface.  The interpretation
+of those additional parameters are the same as for
+`fftw_plan_many_dft'.  (Of course, the `stride' and `dist' parameters
+are now in units of `double', not `fftw_complex'.)
+
+   Arrays `n', `inembed', `onembed', and `kind' are not used after this
+function returns.  You can safely free or reuse them.
+
+
+File: fftw3.info,  Node: Guru Interface,  Next: New-array Execute Functions,  Prev: Advanced Interface,  Up: FFTW Reference
+
+4.5 Guru Interface
+==================
+
+The "guru" interface to FFTW is intended to expose as much as possible
+of the flexibility in the underlying FFTW architecture.  It allows one
+to compute multi-dimensional "vectors" (loops) of multi-dimensional
+transforms, where each vector/transform dimension has an independent
+size and stride.  One can also use more general complex-number formats,
+e.g. separate real and imaginary arrays.
+
+   For those users who require the flexibility of the guru interface,
+it is important that they pay special attention to the documentation
+lest they shoot themselves in the foot.
+
+* Menu:
+
+* Interleaved and split arrays::
+* Guru vector and transform sizes::
+* Guru Complex DFTs::
+* Guru Real-data DFTs::
+* Guru Real-to-real Transforms::
+* 64-bit Guru Interface::
+
+
+File: fftw3.info,  Node: Interleaved and split arrays,  Next: Guru vector and transform sizes,  Prev: Guru Interface,  Up: Guru Interface
+
+4.5.1 Interleaved and split arrays
+----------------------------------
+
+The guru interface supports two representations of complex numbers,
+which we call the interleaved and the split format.
+
+   The "interleaved" format is the same one used by the basic and
+advanced interfaces, and it is documented in *note Complex numbers::.
+In the interleaved format, you provide pointers to the real part of a
+complex number, and the imaginary part understood to be stored in the
+next memory location.  
+
+   The "split" format allows separate pointers to the real and
+imaginary parts of a complex array.  
+
+   Technically, the interleaved format is redundant, because you can
+always express an interleaved array in terms of a split array with
+appropriate pointers and strides.  On the other hand, the interleaved
+format is simpler to use, and it is common in practice.  Hence, FFTW
+supports it as a special case.
+
+
+File: fftw3.info,  Node: Guru vector and transform sizes,  Next: Guru Complex DFTs,  Prev: Interleaved and split arrays,  Up: Guru Interface
+
+4.5.2 Guru vector and transform sizes
+-------------------------------------
+
+The guru interface introduces one basic new data structure,
+`fftw_iodim', that is used to specify sizes and strides for
+multi-dimensional transforms and vectors:
+
+     typedef struct {
+          int n;
+          int is;
+          int os;
+     } fftw_iodim;
+   
+   Here, `n' is the size of the dimension, and `is' and `os' are the
+strides of that dimension for the input and output arrays.  (The stride
+is the separation of consecutive elements along this dimension.)
+
+   The meaning of the stride parameter depends on the type of the array
+that the stride refers to.  _If the array is interleaved complex,
+strides are expressed in units of complex numbers (`fftw_complex').  If
+the array is split complex or real, strides are expressed in units of
+real numbers (`double')._  This convention is consistent with the usual
+pointer arithmetic in the C language.  An interleaved array is denoted
+by a pointer `p' to `fftw_complex', so that `p+1' points to the next
+complex number.  Split arrays are denoted by pointers to `double', in
+which case pointer arithmetic operates in units of `sizeof(double)'.  
+
+   The guru planner interfaces all take a (`rank', `dims[rank]') pair
+describing the transform size, and a (`howmany_rank',
+`howmany_dims[howmany_rank]') pair describing the "vector" size (a
+multi-dimensional loop of transforms to perform), where `dims' and
+`howmany_dims' are arrays of `fftw_iodim'.
+
+   For example, the `howmany' parameter in the advanced complex-DFT
+interface corresponds to `howmany_rank' = 1, `howmany_dims[0].n' =
+`howmany', `howmany_dims[0].is' = `idist', and `howmany_dims[0].os' =
+`odist'.  (To compute a single transform, you can just use
+`howmany_rank' = 0.)
+
+   A row-major multidimensional array with dimensions `n[rank]' (*note
+Row-major Format::) corresponds to `dims[i].n' = `n[i]' and the
+recurrence `dims[i].is' = `n[i+1] * dims[i+1].is' (similarly for `os').
+The stride of the last (`i=rank-1') dimension is the overall stride of
+the array.  e.g. to be equivalent to the advanced complex-DFT
+interface, you would have `dims[rank-1].is' = `istride' and
+`dims[rank-1].os' = `ostride'.  
+
+   In general, we only guarantee FFTW to return a non-`NULL' plan if
+the vector and transform dimensions correspond to a set of distinct
+indices, and for in-place transforms the input/output strides should be
+the same.
+
+
+File: fftw3.info,  Node: Guru Complex DFTs,  Next: Guru Real-data DFTs,  Prev: Guru vector and transform sizes,  Up: Guru Interface
+
+4.5.3 Guru Complex DFTs
+-----------------------
+
+     fftw_plan fftw_plan_guru_dft(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          fftw_complex *in, fftw_complex *out,
+          int sign, unsigned flags);
+
+     fftw_plan fftw_plan_guru_split_dft(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          double *ri, double *ii, double *ro, double *io,
+          unsigned flags);
+
+   These two functions plan a complex-data, multi-dimensional DFT for
+the interleaved and split format, respectively.  Transform dimensions
+are given by (`rank', `dims') over a multi-dimensional vector (loop) of
+dimensions (`howmany_rank', `howmany_dims').  `dims' and `howmany_dims'
+should point to `fftw_iodim' arrays of length `rank' and
+`howmany_rank', respectively.
+
+   `flags' is a bitwise OR (`|') of zero or more planner flags, as
+defined in *note Planner Flags::.
+
+   In the `fftw_plan_guru_dft' function, the pointers `in' and `out'
+point to the interleaved input and output arrays, respectively.  The
+sign can be either -1 (= `FFTW_FORWARD') or +1 (= `FFTW_BACKWARD').  If
+the pointers are equal, the transform is in-place.
+
+   In the `fftw_plan_guru_split_dft' function, `ri' and `ii' point to
+the real and imaginary input arrays, and `ro' and `io' point to the
+real and imaginary output arrays.  The input and output pointers may be
+the same, indicating an in-place transform.  For example, for
+`fftw_complex' pointers `in' and `out', the corresponding parameters
+are:
+
+     ri = (double *) in;
+     ii = (double *) in + 1;
+     ro = (double *) out;
+     io = (double *) out + 1;
+
+   Because `fftw_plan_guru_split_dft' accepts split arrays, strides are
+expressed in units of `double'.  For a contiguous `fftw_complex' array,
+the overall stride of the transform should be 2, the distance between
+consecutive real parts or between consecutive imaginary parts; see
+*note Guru vector and transform sizes::.  Note that the dimension
+strides are applied equally to the real and imaginary parts; real and
+imaginary arrays with different strides are not supported.
+
+   There is no `sign' parameter in `fftw_plan_guru_split_dft'.  This
+function always plans for an `FFTW_FORWARD' transform.  To plan for an
+`FFTW_BACKWARD' transform, you can exploit the identity that the
+backwards DFT is equal to the forwards DFT with the real and imaginary
+parts swapped.  For example, in the case of the `fftw_complex' arrays
+above, the `FFTW_BACKWARD' transform is computed by the parameters:
+
+     ri = (double *) in + 1;
+     ii = (double *) in;
+     ro = (double *) out + 1;
+     io = (double *) out;
+
+
+File: fftw3.info,  Node: Guru Real-data DFTs,  Next: Guru Real-to-real Transforms,  Prev: Guru Complex DFTs,  Up: Guru Interface
+
+4.5.4 Guru Real-data DFTs
+-------------------------
+
+     fftw_plan fftw_plan_guru_dft_r2c(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          double *in, fftw_complex *out,
+          unsigned flags);
+
+     fftw_plan fftw_plan_guru_split_dft_r2c(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          double *in, double *ro, double *io,
+          unsigned flags);
+
+     fftw_plan fftw_plan_guru_dft_c2r(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          fftw_complex *in, double *out,
+          unsigned flags);
+
+     fftw_plan fftw_plan_guru_split_dft_c2r(
+          int rank, const fftw_iodim *dims,
+          int howmany_rank, const fftw_iodim *howmany_dims,
+          double *ri, double *ii, double *out,
+          unsigned flags);
+
+   Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT
+with transform dimensions given by (`rank', `dims') over a
+multi-dimensional vector (loop) of dimensions (`howmany_rank',
+`howmany_dims').  `dims' and `howmany_dims' should point to
+`fftw_iodim' arrays of length `rank' and `howmany_rank', respectively.
+As for the basic and advanced interfaces, an r2c transform is
+`FFTW_FORWARD' and a c2r transform is `FFTW_BACKWARD'.
+
+   The _last_ dimension of `dims' is interpreted specially: that
+dimension of the real array has size `dims[rank-1].n', but that
+dimension of the complex array has size `dims[rank-1].n/2+1' (division
+rounded down).  The strides, on the other hand, are taken to be exactly
+as specified.  It is up to the user to specify the strides
+appropriately for the peculiar dimensions of the data, and we do not
+guarantee that the planner will succeed (return non-`NULL') for any
+dimensions other than those described in *note Real-data DFT Array
+Format:: and generalized in *note Advanced Real-data DFTs::.  (That is,
+for an in-place transform, each individual dimension should be able to
+operate in place.)  
+
+   `in' and `out' point to the input and output arrays for r2c and c2r
+transforms, respectively.  For split arrays, `ri' and `ii' point to the
+real and imaginary input arrays for a c2r transform, and `ro' and `io'
+point to the real and imaginary output arrays for an r2c transform.
+`in' and `ro' or `ri' and `out' may be the same, indicating an in-place
+transform.   (In-place transforms where `in' and `io' or `ii' and `out'
+are the same are not currently supported.)
+
+   `flags' is a bitwise OR (`|') of zero or more planner flags, as
+defined in *note Planner Flags::.
+
+   In-place transforms of rank greater than 1 are currently only
+supported for interleaved arrays.  For split arrays, the planner will
+return `NULL'.  
+
+
+File: fftw3.info,  Node: Guru Real-to-real Transforms,  Next: 64-bit Guru Interface,  Prev: Guru Real-data DFTs,  Up: Guru Interface
+
+4.5.5 Guru Real-to-real Transforms
+----------------------------------
+
+     fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims,
+                                  int howmany_rank,
+                                  const fftw_iodim *howmany_dims,
+                                  double *in, double *out,
+                                  const fftw_r2r_kind *kind,
+                                  unsigned flags);
+
+   Plan a real-to-real (r2r) multi-dimensional `FFTW_FORWARD' transform
+with transform dimensions given by (`rank', `dims') over a
+multi-dimensional vector (loop) of dimensions (`howmany_rank',
+`howmany_dims').  `dims' and `howmany_dims' should point to
+`fftw_iodim' arrays of length `rank' and `howmany_rank', respectively.
+
+   The transform kind of each dimension is given by the `kind'
+parameter, which should point to an array of length `rank'.  Valid
+`fftw_r2r_kind' constants are given in *note Real-to-Real Transform
+Kinds::.
+
+   `in' and `out' point to the real input and output arrays; they may
+be the same, indicating an in-place transform.
+
+   `flags' is a bitwise OR (`|') of zero or more planner flags, as
+defined in *note Planner Flags::.
+
+
+File: fftw3.info,  Node: 64-bit Guru Interface,  Prev: Guru Real-to-real Transforms,  Up: Guru Interface
+
+4.5.6 64-bit Guru Interface
+---------------------------
+
+When compiled in 64-bit mode on a 64-bit architecture (where addresses
+are 64 bits wide), FFTW uses 64-bit quantities internally for all
+transform sizes, strides, and so on--you don't have to do anything
+special to exploit this.  However, in the ordinary FFTW interfaces, you
+specify the transform size by an `int' quantity, which is normally only
+32 bits wide.  This means that, even though FFTW is using 64-bit sizes
+internally, you cannot specify a single transform dimension larger than
+2^31-1 numbers.
+
+   We expect that few users will require transforms larger than this,
+but, for those who do, we provide a 64-bit version of the guru
+interface in which all sizes are specified as integers of type
+`ptrdiff_t' instead of `int'.  (`ptrdiff_t' is a signed integer type
+defined by the C standard to be wide enough to represent address
+differences, and thus must be at least 64 bits wide on a 64-bit
+machine.)  We stress that there is _no performance advantage_ to using
+this interface--the same internal FFTW code is employed regardless--and
+it is only necessary if you want to specify very large transform sizes.  
+
+   In particular, the 64-bit guru interface is a set of planner routines
+that are exactly the same as the guru planner routines, except that
+they are named with `guru64' instead of `guru' and they take arguments
+of type `fftw_iodim64' instead of `fftw_iodim'.  For example, instead
+of `fftw_plan_guru_dft', we have `fftw_plan_guru64_dft'.
+
+     fftw_plan fftw_plan_guru64_dft(
+          int rank, const fftw_iodim64 *dims,
+          int howmany_rank, const fftw_iodim64 *howmany_dims,
+          fftw_complex *in, fftw_complex *out,
+          int sign, unsigned flags);
+   
+   The `fftw_iodim64' type is similar to `fftw_iodim', with the same
+interpretation, except that it uses type `ptrdiff_t' instead of type
+`int'.
+
+     typedef struct {
+          ptrdiff_t n;
+          ptrdiff_t is;
+          ptrdiff_t os;
+     } fftw_iodim64;
+   
+   Every other `fftw_plan_guru' function also has a `fftw_plan_guru64'
+equivalent, but we do not repeat their documentation here since they
+are identical to the 32-bit versions except as noted above.
+
+
+File: fftw3.info,  Node: New-array Execute Functions,  Next: Wisdom,  Prev: Guru Interface,  Up: FFTW Reference
+
+4.6 New-array Execute Functions
+===============================
+
+Normally, one executes a plan for the arrays with which the plan was
+created, by calling `fftw_execute(plan)' as described in *note Using
+Plans::.  However, it is possible for sophisticated users to apply a
+given plan to a _different_ array using the "new-array execute"
+functions detailed below, provided that the following conditions are
+met:
+
+   * The array size, strides, etcetera are the same (since those are
+     set by the plan).
+
+   * The input and output arrays are the same (in-place) or different
+     (out-of-place) if the plan was originally created to be in-place or
+     out-of-place, respectively.
+
+   * For split arrays, the separations between the real and imaginary
+     parts, `ii-ri' and `io-ro', are the same as they were for the
+     input and output arrays when the plan was created.  (This
+     condition is automatically satisfied for interleaved arrays.)
+
+   * The "alignment" of the new input/output arrays is the same as that
+     of the input/output arrays when the plan was created, unless the
+     plan was created with the `FFTW_UNALIGNED' flag.  Here, the
+     alignment is a platform-dependent quantity (for example, it is the
+     address modulo 16 if SSE SIMD instructions are used, but the
+     address modulo 4 for non-SIMD single-precision FFTW on the same
+     machine).  In general, only arrays allocated with `fftw_malloc'
+     are guaranteed to be equally aligned (*note SIMD alignment and
+     fftw_malloc::).
+
+
+   The alignment issue is especially critical, because if you don't use
+`fftw_malloc' then you may have little control over the alignment of
+arrays in memory.  For example, neither the C++ `new' function nor the
+Fortran `allocate' statement provide strong enough guarantees about
+data alignment.  If you don't use `fftw_malloc', therefore, you
+probably have to use `FFTW_UNALIGNED' (which disables most SIMD
+support).  If possible, it is probably better for you to simply create
+multiple plans (creating a new plan is quick once one exists for a
+given size), or better yet re-use the same array for your transforms.
+
+   If you are tempted to use the new-array execute interface because you
+want to transform a known bunch of arrays of the same size, you should
+probably go use the advanced interface instead (*note Advanced
+Interface::)).
+
+   The new-array execute functions are:
+
+     void fftw_execute_dft(
+          const fftw_plan p,
+          fftw_complex *in, fftw_complex *out);
+
+     void fftw_execute_split_dft(
+          const fftw_plan p,
+          double *ri, double *ii, double *ro, double *io);
+
+     void fftw_execute_dft_r2c(
+          const fftw_plan p,
+          double *in, fftw_complex *out);
+
+     void fftw_execute_split_dft_r2c(
+          const fftw_plan p,
+          double *in, double *ro, double *io);
+
+     void fftw_execute_dft_c2r(
+          const fftw_plan p,
+          fftw_complex *in, double *out);
+
+     void fftw_execute_split_dft_c2r(
+          const fftw_plan p,
+          double *ri, double *ii, double *out);
+
+     void fftw_execute_r2r(
+          const fftw_plan p,
+          double *in, double *out);
+   
+   These execute the `plan' to compute the corresponding transform on
+the input/output arrays specified by the subsequent arguments.  The
+input/output array arguments have the same meanings as the ones passed
+to the guru planner routines in the preceding sections.  The `plan' is
+not modified, and these routines can be called as many times as
+desired, or intermixed with calls to the ordinary `fftw_execute'.
+
+   The `plan' _must_ have been created for the transform type
+corresponding to the execute function, e.g. it must be a complex-DFT
+plan for `fftw_execute_dft'.  Any of the planner routines for that
+transform type, from the basic to the guru interface, could have been
+used to create the plan, however.
+
+
+File: fftw3.info,  Node: Wisdom,  Next: What FFTW Really Computes,  Prev: New-array Execute Functions,  Up: FFTW Reference
+
+4.7 Wisdom
+==========
+
+This section documents the FFTW mechanism for saving and restoring
+plans from disk.  This mechanism is called "wisdom".
+
+* Menu:
+
+* Wisdom Export::
+* Wisdom Import::
+* Forgetting Wisdom::
+* Wisdom Utilities::
+
+
+File: fftw3.info,  Node: Wisdom Export,  Next: Wisdom Import,  Prev: Wisdom,  Up: Wisdom
+
+4.7.1 Wisdom Export
+-------------------
+
+     int fftw_export_wisdom_to_filename(const char *filename);
+     void fftw_export_wisdom_to_file(FILE *output_file);
+     char *fftw_export_wisdom_to_string(void);
+     void fftw_export_wisdom(void (*write_char)(char c, void *), void *data);
+
+   These functions allow you to export all currently accumulated wisdom
+in a form from which it can be later imported and restored, even during
+a separate run of the program. (*Note Words of Wisdom-Saving Plans::.)
+The current store of wisdom is not affected by calling any of these
+routines.
+
+   `fftw_export_wisdom' exports the wisdom to any output medium, as
+specified by the callback function `write_char'. `write_char' is a
+`putc'-like function that writes the character `c' to some output; its
+second parameter is the `data' pointer passed to `fftw_export_wisdom'.
+For convenience, the following three "wrapper" routines are provided:
+
+   `fftw_export_wisdom_to_filename' writes wisdom to a file named
+`filename' (which is created or overwritten), returning `1' on success
+and `0' on failure.  A lower-level function, which requires you to open
+and close the file yourself (e.g. if you want to write wisdom to a
+portion of a larger file) is `fftw_export_wisdom_to_file'.  This writes
+the wisdom to the current position in `output_file', which should be
+open with write permission; upon exit, the file remains open and is
+positioned at the end of the wisdom data.
+
+   `fftw_export_wisdom_to_string' returns a pointer to a
+`NULL'-terminated string holding the wisdom data. This string is
+dynamically allocated, and it is the responsibility of the caller to
+deallocate it with `free' when it is no longer needed.
+
+   All of these routines export the wisdom in the same format, which we
+will not document here except to say that it is LISP-like ASCII text
+that is insensitive to white space.
+
+
+File: fftw3.info,  Node: Wisdom Import,  Next: Forgetting Wisdom,  Prev: Wisdom Export,  Up: Wisdom
+
+4.7.2 Wisdom Import
+-------------------
+
+     int fftw_import_system_wisdom(void);
+     int fftw_import_wisdom_from_filename(const char *filename);
+     int fftw_import_wisdom_from_string(const char *input_string);
+     int fftw_import_wisdom(int (*read_char)(void *), void *data);
+
+   These functions import wisdom into a program from data stored by the
+`fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving
+Plans::.)  The imported wisdom replaces any wisdom already accumulated
+by the running program.
+
+   `fftw_import_wisdom' imports wisdom from any input medium, as
+specified by the callback function `read_char'. `read_char' is a
+`getc'-like function that returns the next character in the input; its
+parameter is the `data' pointer passed to `fftw_import_wisdom'. If the
+end of the input data is reached (which should never happen for valid
+data), `read_char' should return `EOF' (as defined in `<stdio.h>').
+For convenience, the following three "wrapper" routines are provided:
+
+   `fftw_import_wisdom_from_filename' reads wisdom from a file named
+`filename'.  A lower-level function, which requires you to open and
+close the file yourself (e.g. if you want to read wisdom from a portion
+of a larger file) is `fftw_import_wisdom_from_file'. This reads wisdom
+from the current position in `input_file' (which should be open with
+read permission); upon exit, the file remains open, but the position of
+the read pointer is unspecified.
+
+   `fftw_import_wisdom_from_string' reads wisdom from the
+`NULL'-terminated string `input_string'.
+
+   `fftw_import_system_wisdom' reads wisdom from an
+implementation-defined standard file (`/etc/fftw/wisdom' on Unix and
+GNU systems).  
+
+   The return value of these import routines is `1' if the wisdom was
+read successfully and `0' otherwise. Note that, in all of these
+functions, any data in the input stream past the end of the wisdom data
+is simply ignored.
+
+
+File: fftw3.info,  Node: Forgetting Wisdom,  Next: Wisdom Utilities,  Prev: Wisdom Import,  Up: Wisdom
+
+4.7.3 Forgetting Wisdom
+-----------------------
+
+     void fftw_forget_wisdom(void);
+
+   Calling `fftw_forget_wisdom' causes all accumulated `wisdom' to be
+discarded and its associated memory to be freed. (New `wisdom' can
+still be gathered subsequently, however.)
+
+
+File: fftw3.info,  Node: Wisdom Utilities,  Prev: Forgetting Wisdom,  Up: Wisdom
+
+4.7.4 Wisdom Utilities
+----------------------
+
+FFTW includes two standalone utility programs that deal with wisdom.  We
+merely summarize them here, since they come with their own `man' pages
+for Unix and GNU systems (with HTML versions on our web site).
+
+   The first program is `fftw-wisdom' (or `fftwf-wisdom' in single
+precision, etcetera), which can be used to create a wisdom file
+containing plans for any of the transform sizes and types supported by
+FFTW.  It is preferable to create wisdom directly from your executable
+(*note Caveats in Using Wisdom::), but this program is useful for
+creating global wisdom files for `fftw_import_system_wisdom'.  
+
+   The second program is `fftw-wisdom-to-conf', which takes a wisdom
+file as input and produces a "configuration routine" as output.  The
+latter is a C subroutine that you can compile and link into your
+program, replacing a routine of the same name in the FFTW library, that
+determines which parts of FFTW are callable by your program.
+`fftw-wisdom-to-conf' produces a configuration routine that links to
+only those parts of FFTW needed by the saved plans in the wisdom,
+greatly reducing the size of statically linked executables (which should
+only attempt to create plans corresponding to those in the wisdom,
+however).  
+
+
+File: fftw3.info,  Node: What FFTW Really Computes,  Prev: Wisdom,  Up: FFTW Reference
+
+4.8 What FFTW Really Computes
+=============================
+
+In this section, we provide precise mathematical definitions for the
+transforms that FFTW computes.  These transform definitions are fairly
+standard, but some authors follow slightly different conventions for the
+normalization of the transform (the constant factor in front) and the
+sign of the complex exponent.  We begin by presenting the
+one-dimensional (1d) transform definitions, and then give the
+straightforward extension to multi-dimensional transforms.
+
+* Menu:
+
+* The 1d Discrete Fourier Transform (DFT)::
+* The 1d Real-data DFT::
+* 1d Real-even DFTs (DCTs)::
+* 1d Real-odd DFTs (DSTs)::
+* 1d Discrete Hartley Transforms (DHTs)::
+* Multi-dimensional Transforms::
+
+
+File: fftw3.info,  Node: The 1d Discrete Fourier Transform (DFT),  Next: The 1d Real-data DFT,  Prev: What FFTW Really Computes,  Up: What FFTW Really Computes
+
+4.8.1 The 1d Discrete Fourier Transform (DFT)
+---------------------------------------------
+
+The forward (`FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d
+complex array X of size n computes an array Y, where:  Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
+   The backward (`FFTW_BACKWARD') DFT computes:  Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
+   FFTW computes an unnormalized transform, in that there is no
+coefficient in front of the summation in the DFT.  In other words,
+applying the forward and then the backward transform will multiply the
+input by n.
+
+   From above, an `FFTW_FORWARD' transform corresponds to a sign of -1
+in the exponent of the DFT.  Note also that we use the standard
+"in-order" output ordering--the k-th output corresponds to the
+frequency k/n (or k/T, where T is your total sampling period).  For
+those who like to think in terms of positive and negative frequencies,
+this means that the positive frequencies are stored in the first half
+of the output and the negative frequencies are stored in backwards
+order in the second half of the output.  (The frequency -k/n is the
+same as the frequency (n-k)/n.)
+
+
+File: fftw3.info,  Node: The 1d Real-data DFT,  Next: 1d Real-even DFTs (DCTs),  Prev: The 1d Discrete Fourier Transform (DFT),  Up: What FFTW Really Computes
+
+4.8.2 The 1d Real-data DFT
+--------------------------
+
+The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of
+the size `n' real array X, exactly as defined above, i.e.   Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
+   This output array Y can easily be shown to possess the "Hermitian"
+symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] =
+Y[0].
+
+   As a result of this symmetry, half of the output Y is redundant
+(being the complex conjugate of the other half), and so the 1d r2c
+transforms only output elements 0...n/2 of Y (n/2+1 complex numbers),
+where the division by 2 is rounded down.
+
+   Moreover, the Hermitian symmetry implies that Y[0] and, if n is
+even, the Y[n/2] element, are purely real.  So, for the `R2HC' r2r
+transform, these elements are not stored in the halfcomplex output
+format.  
+
+   The c2r and `H2RC' r2r transforms compute the backward DFT of the
+_complex_ array X with Hermitian symmetry, stored in the r2c/`R2HC'
+output formats, respectively, where the backward transform is defined
+exactly as for the complex case:  Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
+   The outputs `Y' of this transform can easily be seen to be purely
+real, and are stored as an array of real numbers.
+
+   Like FFTW's complex DFT, these transforms are unnormalized.  In other
+words, applying the real-to-complex (forward) and then the
+complex-to-real (backward) transform will multiply the input by n.
+
+
+File: fftw3.info,  Node: 1d Real-even DFTs (DCTs),  Next: 1d Real-odd DFTs (DSTs),  Prev: The 1d Real-data DFT,  Up: What FFTW Really Computes
+
+4.8.3 1d Real-even DFTs (DCTs)
+------------------------------
+
+The Real-even symmetry DFTs in FFTW are exactly equivalent to the
+unnormalized forward (and backward) DFTs as defined above, where the
+input array X of length N is purely real and is also "even" symmetry.
+In this case, the output array is likewise real and even symmetry.  
+
+   For the case of `REDFT00', this even symmetry means that X[j] =
+X[N-j], where we take X to be periodic so that X[N] = X[0].  Because of
+this redundancy, only the first n real numbers are actually stored,
+where N = 2(n-1).
+
+   The proper definition of even symmetry for `REDFT10', `REDFT01', and
+`REDFT11' transforms is somewhat more intricate because of the shifts
+by 1/2 of the input and/or output, although the corresponding boundary
+conditions are given in *note Real even/odd DFTs (cosine/sine
+transforms)::.  Because of the even symmetry, however, the sine terms
+in the DFT all cancel and the remaining cosine terms are written
+explicitly below.  This formulation often leads people to call such a
+transform a "discrete cosine transform" (DCT), although it is really
+just a special case of the DFT.  
+
+   In each of the definitions below, we transform a real array X of
+length n to a real array Y of length n:
+
+REDFT00 (DCT-I)
+...............
+
+An `REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0]
++ (-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))).
+Note that this transform is not defined for n=1.  For n=2, the
+summation term above is dropped as you might expect.
+
+REDFT10 (DCT-II)
+................
+
+An `REDFT10' transform (type-II DCT, sometimes called "the" DCT) in
+FFTW is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi
+(j+1/2) k / n)).
+
+REDFT01 (DCT-III)
+.................
+
+An `REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] =
+X[0] + 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)).  In the
+case of n=1, this reduces to Y[0] = X[0].  Up to a scale factor (see
+below), this is the inverse of `REDFT10' ("the" DCT), and so the
+`REDFT01' (DCT-III) is sometimes called the "IDCT".  
+
+REDFT11 (DCT-IV)
+................
+
+An `REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2
+(sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)).
+
+Inverses and Normalization
+..........................
+
+These definitions correspond directly to the unnormalized DFTs used
+elsewhere in FFTW (hence the factors of 2 in front of the summations).
+The unnormalized inverse of `REDFT00' is `REDFT00', of `REDFT10' is
+`REDFT01' and vice versa, and of `REDFT11' is `REDFT11'.  Each
+unnormalized inverse results in the original array multiplied by N,
+where N is the _logical_ DFT size.  For `REDFT00', N=2(n-1) (note that
+n=1 is not defined); otherwise, N=2n.  
+
+   In defining the discrete cosine transform, some authors also include
+additional factors of sqrt(2) (or its inverse) multiplying selected
+inputs and/or outputs.  This is a mostly cosmetic change that makes the
+transform orthogonal, but sacrifices the direct equivalence to a
+symmetric DFT.
+
+
+File: fftw3.info,  Node: 1d Real-odd DFTs (DSTs),  Next: 1d Discrete Hartley Transforms (DHTs),  Prev: 1d Real-even DFTs (DCTs),  Up: What FFTW Really Computes
+
+4.8.4 1d Real-odd DFTs (DSTs)
+-----------------------------
+
+The Real-odd symmetry DFTs in FFTW are exactly equivalent to the
+unnormalized forward (and backward) DFTs as defined above, where the
+input array X of length N is purely real and is also "odd" symmetry.  In
+this case, the output is odd symmetry and purely imaginary.  
+
+   For the case of `RODFT00', this odd symmetry means that X[j] =
+-X[N-j], where we take X to be periodic so that X[N] = X[0].  Because
+of this redundancy, only the first n real numbers starting at j=1 are
+actually stored (the j=0 element is zero), where N = 2(n+1).
+
+   The proper definition of odd symmetry for `RODFT10', `RODFT01', and
+`RODFT11' transforms is somewhat more intricate because of the shifts
+by 1/2 of the input and/or output, although the corresponding boundary
+conditions are given in *note Real even/odd DFTs (cosine/sine
+transforms)::.  Because of the odd symmetry, however, the cosine terms
+in the DFT all cancel and the remaining sine terms are written
+explicitly below.  This formulation often leads people to call such a
+transform a "discrete sine transform" (DST), although it is really just
+a special case of the DFT.  
+
+   In each of the definitions below, we transform a real array X of
+length n to a real array Y of length n:
+
+RODFT00 (DST-I)
+...............
+
+An `RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2
+(sum for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))).
+
+RODFT10 (DST-II)
+................
+
+An `RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2
+(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)).
+
+RODFT01 (DST-III)
+.................
+
+An `RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] =
+(-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) /
+n)).  In the case of n=1, this reduces to Y[0] = X[0].
+
+RODFT11 (DST-IV)
+................
+
+An `RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2
+(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)).
+
+Inverses and Normalization
+..........................
+
+These definitions correspond directly to the unnormalized DFTs used
+elsewhere in FFTW (hence the factors of 2 in front of the summations).
+The unnormalized inverse of `RODFT00' is `RODFT00', of `RODFT10' is
+`RODFT01' and vice versa, and of `RODFT11' is `RODFT11'.  Each
+unnormalized inverse results in the original array multiplied by N,
+where N is the _logical_ DFT size.  For `RODFT00', N=2(n+1); otherwise,
+N=2n.  
+
+   In defining the discrete sine transform, some authors also include
+additional factors of sqrt(2) (or its inverse) multiplying selected
+inputs and/or outputs.  This is a mostly cosmetic change that makes the
+transform orthogonal, but sacrifices the direct equivalence to an
+antisymmetric DFT.
+
+
+File: fftw3.info,  Node: 1d Discrete Hartley Transforms (DHTs),  Next: Multi-dimensional Transforms,  Prev: 1d Real-odd DFTs (DSTs),  Up: What FFTW Really Computes
+
+4.8.5 1d Discrete Hartley Transforms (DHTs)
+-------------------------------------------
+
+The discrete Hartley transform (DHT) of a 1d real array X of size n
+computes a real array Y of the same size, where: Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)].
+   FFTW computes an unnormalized transform, in that there is no
+coefficient in front of the summation in the DHT.  In other words,
+applying the transform twice (the DHT is its own inverse) will multiply
+the input by n.
+
+
+File: fftw3.info,  Node: Multi-dimensional Transforms,  Prev: 1d Discrete Hartley Transforms (DHTs),  Up: What FFTW Really Computes
+
+4.8.6 Multi-dimensional Transforms
+----------------------------------
+
+The multi-dimensional transforms of FFTW, in general, compute simply the
+separable product of the given 1d transform along each dimension of the
+array.  Since each of these transforms is unnormalized, computing the
+forward followed by the backward/inverse multi-dimensional transform
+will result in the original array scaled by the product of the
+normalization factors for each dimension (e.g. the product of the
+dimension sizes, for a multi-dimensional DFT).
+
+   The definition of FFTW's multi-dimensional DFT of real data (r2c)
+deserves special attention.  In this case, we logically compute the full
+multi-dimensional DFT of the input data; since the input data are purely
+real, the output data have the Hermitian symmetry and therefore only one
+non-redundant half need be stored.  More specifically, for an n[0] x
+n[1] x n[2] x ... x n[d-1]  multi-dimensional real-input DFT, the full
+(logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the
+symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ...,
+n[d-1] - k[d-1]]* (where each dimension is periodic).  Because of this
+symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the
+_last_ dimension (division by 2 is rounded down).  (We could instead
+have cut any other dimension in half, but the last dimension proved
+computationally convenient.)  This results in the peculiar array format
+described in more detail by *note Real-data DFT Array Format::.
+
+   The multi-dimensional c2r transform is simply the unnormalized
+inverse of the r2c transform.  i.e. it is the same as FFTW's complex
+backward multi-dimensional DFT, operating on a Hermitian input array in
+the peculiar format mentioned above and outputting a real array (since
+the DFT output is purely real).
+
+   We should remind the user that the separable product of 1d transforms
+along each dimension, as computed by FFTW, is not always the same thing
+as the usual multi-dimensional transform.  A multi-dimensional `R2HC'
+(or `HC2R') transform is not identical to the multi-dimensional DFT,
+requiring some post-processing to combine the requisite real and
+imaginary parts, as was described in *note The Halfcomplex-format
+DFT::.  Likewise, FFTW's multidimensional `FFTW_DHT' r2r transform is
+not the same thing as the logical multi-dimensional discrete Hartley
+transform defined in the literature, as discussed in *note The Discrete
+Hartley Transform::.
+
+
+File: fftw3.info,  Node: Multi-threaded FFTW,  Next: Distributed-memory FFTW with MPI,  Prev: FFTW Reference,  Up: Top
+
+5 Multi-threaded FFTW
+*********************
+
+In this chapter we document the parallel FFTW routines for
+shared-memory parallel hardware.  These routines, which support
+parallel one- and multi-dimensional transforms of both real and complex
+data, are the easiest way to take advantage of multiple processors with
+FFTW.  They work just like the corresponding uniprocessor transform
+routines, except that you have an extra initialization routine to call,
+and there is a routine to set the number of threads to employ.  Any
+program that uses the uniprocessor FFTW can therefore be trivially
+modified to use the multi-threaded FFTW.
+
+   A shared-memory machine is one in which all CPUs can directly access
+the same main memory, and such machines are now common due to the
+ubiquity of multi-core CPUs.  FFTW's multi-threading support allows you
+to utilize these additional CPUs transparently from a single program.
+However, this does not necessarily translate into performance
+gains--when multiple threads/CPUs are employed, there is an overhead
+required for synchronization that may outweigh the computatational
+parallelism.  Therefore, you can only benefit from threads if your
+problem is sufficiently large.  
+
+* Menu:
+
+* Installation and Supported Hardware/Software::
+* Usage of Multi-threaded FFTW::
+* How Many Threads to Use?::
+* Thread safety::
+
+
+File: fftw3.info,  Node: Installation and Supported Hardware/Software,  Next: Usage of Multi-threaded FFTW,  Prev: Multi-threaded FFTW,  Up: Multi-threaded FFTW
+
+5.1 Installation and Supported Hardware/Software
+================================================
+
+All of the FFTW threads code is located in the `threads' subdirectory
+of the FFTW package.  On Unix systems, the FFTW threads libraries and
+header files can be automatically configured, compiled, and installed
+along with the uniprocessor FFTW libraries simply by including
+`--enable-threads' in the flags to the `configure' script (*note
+Installation on Unix::), or `--enable-openmp' to use OpenMP
+(http://www.openmp.org) threads.  
+
+   The threads routines require your operating system to have some sort
+of shared-memory threads support.  Specifically, the FFTW threads
+package works with POSIX threads (available on most Unix variants, from
+GNU/Linux to MacOS X) and Win32 threads.  OpenMP threads, which are
+supported in many common compilers (e.g. gcc) are also supported, and
+may give better performance on some systems.  (OpenMP threads are also
+useful if you are employing OpenMP in your own code, in order to
+minimize conflicts between threading models.)  If you have a
+shared-memory machine that uses a different threads API, it should be a
+simple matter of programming to include support for it; see the file
+`threads/threads.c' for more detail.
+
+   You can compile FFTW with _both_ `--enable-threads' and
+`--enable-openmp' at the same time, since they install libraries with
+different names (`fftw3_threads' and `fftw3_omp', as described below).
+However, your programs may only link to _one_ of these two libraries at
+a time.
+
+   Ideally, of course, you should also have multiple processors in
+order to get any benefit from the threaded transforms.
+
+
+File: fftw3.info,  Node: Usage of Multi-threaded FFTW,  Next: How Many Threads to Use?,  Prev: Installation and Supported Hardware/Software,  Up: Multi-threaded FFTW
+
+5.2 Usage of Multi-threaded FFTW
+================================
+
+Here, it is assumed that the reader is already familiar with the usage
+of the uniprocessor FFTW routines, described elsewhere in this manual.
+We only describe what one has to change in order to use the
+multi-threaded routines.
+
+   First, programs using the parallel complex transforms should be
+linked with `-lfftw3_threads -lfftw3 -lm' on Unix, or `-lfftw3_omp
+-lfftw3 -lm' if you compiled with OpenMP. You will also need to link
+with whatever library is responsible for threads on your system (e.g.
+`-lpthread' on GNU/Linux) or include whatever compiler flag enables
+OpenMP (e.g. `-fopenmp' with gcc).  
+
+   Second, before calling _any_ FFTW routines, you should call the
+function:
+
+     int fftw_init_threads(void);
+   
+   This function, which need only be called once, performs any one-time
+initialization required to use threads on your system.  It returns zero
+if there was some error (which should not happen under normal
+circumstances) and a non-zero value otherwise.
+
+   Third, before creating a plan that you want to parallelize, you
+should call:
+
+     void fftw_plan_with_nthreads(int nthreads);
+   
+   The `nthreads' argument indicates the number of threads you want
+FFTW to use (or actually, the maximum number).  All plans subsequently
+created with any planner routine will use that many threads.  You can
+call `fftw_plan_with_nthreads', create some plans, call
+`fftw_plan_with_nthreads' again with a different argument, and create
+some more plans for a new number of threads.  Plans already created
+before a call to `fftw_plan_with_nthreads' are unaffected.  If you pass
+an `nthreads' argument of `1' (the default), threads are disabled for
+subsequent plans.
+
+   With OpenMP, to configure FFTW to use all of the currently running
+OpenMP threads (set by `omp_set_num_threads(nthreads)' or by the
+`OMP_NUM_THREADS' environment variable), you can do:
+`fftw_plan_with_nthreads(omp_get_max_threads())'. (The `omp_' OpenMP
+functions are declared via `#include <omp.h>'.)
+
+   Given a plan, you then execute it as usual with
+`fftw_execute(plan)', and the execution will use the number of threads
+specified when the plan was created.  When done, you destroy it as
+usual with `fftw_destroy_plan'.  As described in *note Thread safety::,
+plan _execution_ is thread-safe, but plan creation and destruction are
+_not_: you should create/destroy plans only from a single thread, but
+can safely execute multiple plans in parallel.
+
+   There is one additional routine: if you want to get rid of all memory
+and other resources allocated internally by FFTW, you can call:
+
+     void fftw_cleanup_threads(void);
+   
+   which is much like the `fftw_cleanup()' function except that it also
+gets rid of threads-related data.  You must _not_ execute any
+previously created plans after calling this function.
+
+   We should also mention one other restriction: if you save wisdom
+from a program using the multi-threaded FFTW, that wisdom _cannot be
+used_ by a program using only the single-threaded FFTW (i.e. not calling
+`fftw_init_threads').  *Note Words of Wisdom-Saving Plans::.
+
+
+File: fftw3.info,  Node: How Many Threads to Use?,  Next: Thread safety,  Prev: Usage of Multi-threaded FFTW,  Up: Multi-threaded FFTW
+
+5.3 How Many Threads to Use?
+============================
+
+There is a fair amount of overhead involved in synchronizing threads,
+so the optimal number of threads to use depends upon the size of the
+transform as well as on the number of processors you have.
+
+   As a general rule, you don't want to use more threads than you have
+processors.  (Using more threads will work, but there will be extra
+overhead with no benefit.)  In fact, if the problem size is too small,
+you may want to use fewer threads than you have processors.
+
+   You will have to experiment with your system to see what level of
+parallelization is best for your problem size.  Typically, the problem
+will have to involve at least a few thousand data points before threads
+become beneficial.  If you plan with `FFTW_PATIENT', it will
+automatically disable threads for sizes that don't benefit from
+parallelization.  
+
+
+File: fftw3.info,  Node: Thread safety,  Prev: How Many Threads to Use?,  Up: Multi-threaded FFTW
+
+5.4 Thread safety
+=================
+
+Users writing multi-threaded programs (including OpenMP) must concern
+themselves with the "thread safety" of the libraries they use--that is,
+whether it is safe to call routines in parallel from multiple threads.
+FFTW can be used in such an environment, but some care must be taken
+because the planner routines share data (e.g. wisdom and trigonometric
+tables) between calls and plans.
+
+   The upshot is that the only thread-safe (re-entrant) routine in FFTW
+is `fftw_execute' (and the new-array variants thereof).  All other
+routines (e.g. the planner) should only be called from one thread at a
+time.  So, for example, you can wrap a semaphore lock around any calls
+to the planner; even more simply, you can just create all of your plans
+from one thread.  We do not think this should be an important
+restriction (FFTW is designed for the situation where the only
+performance-sensitive code is the actual execution of the transform),
+and the benefits of shared data between plans are great.
+
+   Note also that, since the plan is not modified by `fftw_execute', it
+is safe to execute the _same plan_ in parallel by multiple threads.
+However, since a given plan operates by default on a fixed array, you
+need to use one of the new-array execute functions (*note New-array
+Execute Functions::) so that different threads compute the transform of
+different data.
+
+   (Users should note that these comments only apply to programs using
+shared-memory threads or OpenMP.  Parallelism using MPI or forked
+processes involves a separate address-space and global variables for
+each process, and is not susceptible to problems of this sort.)
+
+   If you are configured FFTW with the `--enable-debug' or
+`--enable-debug-malloc' flags (*note Installation on Unix::), then
+`fftw_execute' is not thread-safe.  These flags are not documented
+because they are intended only for developing and debugging FFTW, but
+if you must use `--enable-debug' then you should also specifically pass
+`--disable-debug-malloc' for `fftw_execute' to be thread-safe.
+
+
+File: fftw3.info,  Node: Distributed-memory FFTW with MPI,  Next: Calling FFTW from Modern Fortran,  Prev: Multi-threaded FFTW,  Up: Top
+
+6 Distributed-memory FFTW with MPI
+**********************************
+
+In this chapter we document the parallel FFTW routines for parallel
+systems supporting the MPI message-passing interface.  Unlike the
+shared-memory threads described in the previous chapter, MPI allows you
+to use _distributed-memory_ parallelism, where each CPU has its own
+separate memory, and which can scale up to clusters of many thousands
+of processors.  This capability comes at a price, however: each process
+only stores a _portion_ of the data to be transformed, which means that
+the data structures and programming-interface are quite different from
+the serial or threads versions of FFTW.  
+
+   Distributed-memory parallelism is especially useful when you are
+transforming arrays so large that they do not fit into the memory of a
+single processor.  The storage per-process required by FFTW's MPI
+routines is proportional to the total array size divided by the number
+of processes.  Conversely, distributed-memory parallelism can easily
+pose an unacceptably high communications overhead for small problems;
+the threshold problem size for which parallelism becomes advantageous
+will depend on the precise problem you are interested in, your
+hardware, and your MPI implementation.
+
+   A note on terminology: in MPI, you divide the data among a set of
+"processes" which each run in their own memory address space.
+Generally, each process runs on a different physical processor, but
+this is not required.  A set of processes in MPI is described by an
+opaque data structure called a "communicator," the most common of which
+is the predefined communicator `MPI_COMM_WORLD' which refers to _all_
+processes.  For more information on these and other concepts common to
+all MPI programs, we refer the reader to the documentation at the MPI
+home page (http://www.mcs.anl.gov/research/projects/mpi/).  
+
+   We assume in this chapter that the reader is familiar with the usage
+of the serial (uniprocessor) FFTW, and focus only on the concepts new
+to the MPI interface.
+
+* Menu:
+
+* FFTW MPI Installation::
+* Linking and Initializing MPI FFTW::
+* 2d MPI example::
+* MPI Data Distribution::
+* Multi-dimensional MPI DFTs of Real Data::
+* Other Multi-dimensional Real-data MPI Transforms::
+* FFTW MPI Transposes::
+* FFTW MPI Wisdom::
+* Avoiding MPI Deadlocks::
+* FFTW MPI Performance Tips::
+* Combining MPI and Threads::
+* FFTW MPI Reference::
+* FFTW MPI Fortran Interface::
+
+
+File: fftw3.info,  Node: FFTW MPI Installation,  Next: Linking and Initializing MPI FFTW,  Prev: Distributed-memory FFTW with MPI,  Up: Distributed-memory FFTW with MPI
+
+6.1 FFTW MPI Installation
+=========================
+
+All of the FFTW MPI code is located in the `mpi' subdirectory of the
+FFTW package.  On Unix systems, the FFTW MPI libraries and header files
+are automatically configured, compiled, and installed along with the
+uniprocessor FFTW libraries simply by including `--enable-mpi' in the
+flags to the `configure' script (*note Installation on Unix::).  
+
+   Any implementation of the MPI standard, version 1 or later, should
+work with FFTW.  The `configure' script will attempt to automatically
+detect how to compile and link code using your MPI implementation.  In
+some cases, especially if you have multiple different MPI
+implementations installed or have an unusual MPI software package, you
+may need to provide this information explicitly.
+
+   Most commonly, one compiles MPI code by invoking a special compiler
+command, typically `mpicc' for C code.  The `configure' script knows
+the most common names for this command, but you can specify the MPI
+compilation command explicitly by setting the `MPICC' variable, as in
+`./configure MPICC=mpicc ...'.  
+
+   If, instead of a special compiler command, you need to link a certain
+library, you can specify the link command via the `MPILIBS' variable,
+as in `./configure MPILIBS=-lmpi ...'.  Note that if your MPI library
+is installed in a non-standard location (one the compiler does not know
+about by default), you may also have to specify the location of the
+library and header files via `LDFLAGS' and `CPPFLAGS' variables,
+respectively, as in `./configure LDFLAGS=-L/path/to/mpi/libs
+CPPFLAGS=-I/path/to/mpi/include ...'.
+
+
+File: fftw3.info,  Node: Linking and Initializing MPI FFTW,  Next: 2d MPI example,  Prev: FFTW MPI Installation,  Up: Distributed-memory FFTW with MPI
+
+6.2 Linking and Initializing MPI FFTW
+=====================================
+
+Programs using the MPI FFTW routines should be linked with `-lfftw3_mpi
+-lfftw3 -lm' on Unix in double precision, `-lfftw3f_mpi -lfftw3f -lm'
+in single precision, and so on (*note Precision::). You will also need
+to link with whatever library is responsible for MPI on your system; in
+most MPI implementations, there is a special compiler alias named
+`mpicc' to compile and link MPI code.  
+
+   Before calling any FFTW routines except possibly `fftw_init_threads'
+(*note Combining MPI and Threads::), but after calling `MPI_Init', you
+should call the function:
+
+     void fftw_mpi_init(void);
+   
+   If, at the end of your program, you want to get rid of all memory and
+other resources allocated internally by FFTW, for both the serial and
+MPI routines, you can call:
+
+     void fftw_mpi_cleanup(void);
+   
+   which is much like the `fftw_cleanup()' function except that it also
+gets rid of FFTW's MPI-related data.  You must _not_ execute any
+previously created plans after calling this function.
+
+
+File: fftw3.info,  Node: 2d MPI example,  Next: MPI Data Distribution,  Prev: Linking and Initializing MPI FFTW,  Up: Distributed-memory FFTW with MPI
+
+6.3 2d MPI example
+==================
+
+Before we document the FFTW MPI interface in detail, we begin with a
+simple example outlining how one would perform a two-dimensional `N0'
+by `N1' complex DFT.
+
+     #include <fftw3-mpi.h>
+
+     int main(int argc, char **argv)
+     {
+         const ptrdiff_t N0 = ..., N1 = ...;
+         fftw_plan plan;
+         fftw_complex *data;
+         ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
+
+         MPI_Init(&argc, &argv);
+         fftw_mpi_init();
+
+         /* get local data size and allocate */
+         alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
+                                              &local_n0, &local_0_start);
+         data = fftw_alloc_complex(alloc_local);
+
+         /* create plan for in-place forward DFT */
+         plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
+                                     FFTW_FORWARD, FFTW_ESTIMATE);
+
+         /* initialize data to some function my_function(x,y) */
+         for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
+            data[i*N1 + j] = my_function(local_0_start + i, j);
+
+         /* compute transforms, in-place, as many times as desired */
+         fftw_execute(plan);
+
+         fftw_destroy_plan(plan);
+
+         MPI_Finalize();
+     }
+
+   As can be seen above, the MPI interface follows the same basic style
+of allocate/plan/execute/destroy as the serial FFTW routines.  All of
+the MPI-specific routines are prefixed with `fftw_mpi_' instead of
+`fftw_'.  There are a few important differences, however:
+
+   First, we must call `fftw_mpi_init()' after calling `MPI_Init'
+(required in all MPI programs) and before calling any other `fftw_mpi_'
+routine.  
+
+   Second, when we create the plan with `fftw_mpi_plan_dft_2d',
+analogous to `fftw_plan_dft_2d', we pass an additional argument: the
+communicator, indicating which processes will participate in the
+transform (here `MPI_COMM_WORLD', indicating all processes).  Whenever
+you create, execute, or destroy a plan for an MPI transform, you must
+call the corresponding FFTW routine on _all_ processes in the
+communicator for that transform.  (That is, these are _collective_
+calls.)  Note that the plan for the MPI transform uses the standard
+`fftw_execute' and `fftw_destroy' routines (on the other hand, there
+are MPI-specific new-array execute functions documented below).  
+
+   Third, all of the FFTW MPI routines take `ptrdiff_t' arguments
+instead of `int' as for the serial FFTW.  `ptrdiff_t' is a standard C
+integer type which is (at least) 32 bits wide on a 32-bit machine and
+64 bits wide on a 64-bit machine.  This is to make it easy to specify
+very large parallel transforms on a 64-bit machine.  (You can specify
+64-bit transform sizes in the serial FFTW, too, but only by using the
+`guru64' planner interface.  *Note 64-bit Guru Interface::.)  
+
+   Fourth, and most importantly, you don't allocate the entire
+two-dimensional array on each process.  Instead, you call
+`fftw_mpi_local_size_2d' to find out what _portion_ of the array
+resides on each processor, and how much space to allocate.  Here, the
+portion of the array on each process is a `local_n0' by `N1' slice of
+the total array, starting at index `local_0_start'.  The total number
+of `fftw_complex' numbers to allocate is given by the `alloc_local'
+return value, which _may_ be greater than `local_n0 * N1' (in case some
+intermediate calculations require additional storage).  The data
+distribution in FFTW's MPI interface is described in more detail by the
+next section.  
+
+   Given the portion of the array that resides on the local process, it
+is straightforward to initialize the data (here to a function
+`myfunction') and otherwise manipulate it.  Of course, at the end of
+the program you may want to output the data somehow, but synchronizing
+this output is up to you and is beyond the scope of this manual.  (One
+good way to output a large multi-dimensional distributed array in MPI
+to a portable binary file is to use the free HDF5 library; see the HDF
+home page (http://www.hdfgroup.org/).)  
+
+
+File: fftw3.info,  Node: MPI Data Distribution,  Next: Multi-dimensional MPI DFTs of Real Data,  Prev: 2d MPI example,  Up: Distributed-memory FFTW with MPI
+
+6.4 MPI Data Distribution
+=========================
+
+The most important concept to understand in using FFTW's MPI interface
+is the data distribution.  With a serial or multithreaded FFT, all of
+the inputs and outputs are stored as a single contiguous chunk of
+memory.  With a distributed-memory FFT, the inputs and outputs are
+broken into disjoint blocks, one per process.
+
+   In particular, FFTW uses a _1d block distribution_ of the data,
+distributed along the _first dimension_.  For example, if you want to
+perform a 100 x 200  complex DFT, distributed over 4 processes, each
+process will get a 25 x 200  slice of the data.  That is, process 0
+will get rows 0 through 24, process 1 will get rows 25 through 49,
+process 2 will get rows 50 through 74, and process 3 will get rows 75
+through 99.  If you take the same array but distribute it over 3
+processes, then it is not evenly divisible so the different processes
+will have unequal chunks.  FFTW's default choice in this case is to
+assign 34 rows to processes 0 and 1, and 32 rows to process 2.  
+
+   FFTW provides several `fftw_mpi_local_size' routines that you can
+call to find out what portion of an array is stored on the current
+process.  In most cases, you should use the default block sizes picked
+by FFTW, but it is also possible to specify your own block size.  For
+example, with a 100 x 200  array on three processes, you can tell FFTW
+to use a block size of 40, which would assign 40 rows to processes 0
+and 1, and 20 rows to process 2.  FFTW's default is to divide the data
+equally among the processes if possible, and as best it can otherwise.
+The rows are always assigned in "rank order," i.e. process 0 gets the
+first block of rows, then process 1, and so on.  (You can change this
+by using `MPI_Comm_split' to create a new communicator with re-ordered
+processes.)  However, you should always call the `fftw_mpi_local_size'
+routines, if possible, rather than trying to predict FFTW's
+distribution choices.
+
+   In particular, it is critical that you allocate the storage size that
+is returned by `fftw_mpi_local_size', which is _not_ necessarily the
+size of the local slice of the array.  The reason is that intermediate
+steps of FFTW's algorithms involve transposing the array and
+redistributing the data, so at these intermediate steps FFTW may
+require more local storage space (albeit always proportional to the
+total size divided by the number of processes).  The
+`fftw_mpi_local_size' functions know how much storage is required for
+these intermediate steps and tell you the correct amount to allocate.
+
+* Menu:
+
+* Basic and advanced distribution interfaces::
+* Load balancing::
+* Transposed distributions::
+* One-dimensional distributions::
+
+
+File: fftw3.info,  Node: Basic and advanced distribution interfaces,  Next: Load balancing,  Prev: MPI Data Distribution,  Up: MPI Data Distribution
+
+6.4.1 Basic and advanced distribution interfaces
+------------------------------------------------
+
+As with the planner interface, the `fftw_mpi_local_size' distribution
+interface is broken into basic and advanced (`_many') interfaces, where
+the latter allows you to specify the block size manually and also to
+request block sizes when computing multiple transforms simultaneously.
+These functions are documented more exhaustively by the FFTW MPI
+Reference, but we summarize the basic ideas here using a couple of
+two-dimensional examples.
+
+   For the 100 x 200  complex-DFT example, above, we would find the
+distribution by calling the following function in the basic interface:
+
+     ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
+                                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
+   
+   Given the total size of the data to be transformed (here, `n0 = 100'
+and `n1 = 200') and an MPI communicator (`comm'), this function
+provides three numbers.
+
+   First, it describes the shape of the local data: the current process
+should store a `local_n0' by `n1' slice of the overall dataset, in
+row-major order (`n1' dimension contiguous), starting at index
+`local_0_start'.  That is, if the total dataset is viewed as a `n0' by
+`n1' matrix, the current process should store the rows `local_0_start'
+to `local_0_start+local_n0-1'.  Obviously, if you are running with only
+a single MPI process, that process will store the entire array:
+`local_0_start' will be zero and `local_n0' will be `n0'.  *Note
+Row-major Format::.  
+
+   Second, the return value is the total number of data elements (e.g.,
+complex numbers for a complex DFT) that should be allocated for the
+input and output arrays on the current process (ideally with
+`fftw_malloc' or an `fftw_alloc' function, to ensure optimal
+alignment).  It might seem that this should always be equal to
+`local_n0 * n1', but this is _not_ the case.  FFTW's distributed FFT
+algorithms require data redistributions at intermediate stages of the
+transform, and in some circumstances this may require slightly larger
+local storage.  This is discussed in more detail below, under *note
+Load balancing::.  
+
+   The advanced-interface `local_size' function for multidimensional
+transforms returns the same three things (`local_n0', `local_0_start',
+and the total number of elements to allocate), but takes more inputs:
+
+     ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
+                                        ptrdiff_t howmany,
+                                        ptrdiff_t block0,
+                                        MPI_Comm comm,
+                                        ptrdiff_t *local_n0,
+                                        ptrdiff_t *local_0_start);
+   
+   The two-dimensional case above corresponds to `rnk = 2' and an array
+`n' of length 2 with `n[0] = n0' and `n[1] = n1'.  This routine is for
+any `rnk > 1'; one-dimensional transforms have their own interface
+because they work slightly differently, as discussed below.
+
+   First, the advanced interface allows you to perform multiple
+transforms at once, of interleaved data, as specified by the `howmany'
+parameter.  (`hoamany' is 1 for a single transform.)
+
+   Second, here you can specify your desired block size in the `n0'
+dimension, `block0'.  To use FFTW's default block size, pass
+`FFTW_MPI_DEFAULT_BLOCK' (0) for `block0'.  Otherwise, on `P'
+processes, FFTW will return `local_n0' equal to `block0' on the first
+`P / block0' processes (rounded down), return `local_n0' equal to `n0 -
+block0 * (P / block0)' on the next process, and `local_n0' equal to
+zero on any remaining processes.  In general, we recommend using the
+default block size (which corresponds to `n0 / P', rounded up).  
+
+   For example, suppose you have `P = 4' processes and `n0 = 21'.  The
+default will be a block size of `6', which will give `local_n0 = 6' on
+the first three processes and `local_n0 = 3' on the last process.
+Instead, however, you could specify `block0 = 5' if you wanted, which
+would give `local_n0 = 5' on processes 0 to 2, `local_n0 = 6' on
+process 3.  (This choice, while it may look superficially more
+"balanced," has the same critical path as FFTW's default but requires
+more communications.)
+
+
+File: fftw3.info,  Node: Load balancing,  Next: Transposed distributions,  Prev: Basic and advanced distribution interfaces,  Up: MPI Data Distribution
+
+6.4.2 Load balancing
+--------------------
+
+Ideally, when you parallelize a transform over some P processes, each
+process should end up with work that takes equal time.  Otherwise, all
+of the processes end up waiting on whichever process is slowest.  This
+goal is known as "load balancing."  In this section, we describe the
+circumstances under which FFTW is able to load-balance well, and in
+particular how you should choose your transform size in order to load
+balance.
+
+   Load balancing is especially difficult when you are parallelizing
+over heterogeneous machines; for example, if one of your processors is a
+old 486 and another is a Pentium IV, obviously you should give the
+Pentium more work to do than the 486 since the latter is much slower.
+FFTW does not deal with this problem, however--it assumes that your
+processes run on hardware of comparable speed, and that the goal is
+therefore to divide the problem as equally as possible.
+
+   For a multi-dimensional complex DFT, FFTW can divide the problem
+equally among the processes if: (i) the _first_ dimension `n0' is
+divisible by P; and (ii), the _product_ of the subsequent dimensions is
+divisible by P.  (For the advanced interface, where you can specify
+multiple simultaneous transforms via some "vector" length `howmany', a
+factor of `howmany' is included in the product of the subsequent
+dimensions.)
+
+   For a one-dimensional complex DFT, the length `N' of the data should
+be divisible by P _squared_ to be able to divide the problem equally
+among the processes.
+
+
+File: fftw3.info,  Node: Transposed distributions,  Next: One-dimensional distributions,  Prev: Load balancing,  Up: MPI Data Distribution
+
+6.4.3 Transposed distributions
+------------------------------
+
+Internally, FFTW's MPI transform algorithms work by first computing
+transforms of the data local to each process, then by globally
+_transposing_ the data in some fashion to redistribute the data among
+the processes, transforming the new data local to each process, and
+transposing back.  For example, a two-dimensional `n0' by `n1' array,
+distributed across the `n0' dimension, is transformd by: (i)
+transforming the `n1' dimension, which are local to each process; (ii)
+transposing to an `n1' by `n0' array, distributed across the `n1'
+dimension; (iii) transforming the `n0' dimension, which is now local to
+each process; (iv) transposing back.  
+
+   However, in many applications it is acceptable to compute a
+multidimensional DFT whose results are produced in transposed order
+(e.g., `n1' by `n0' in two dimensions).  This provides a significant
+performance advantage, because it means that the final transposition
+step can be omitted.  FFTW supports this optimization, which you
+specify by passing the flag `FFTW_MPI_TRANSPOSED_OUT' to the planner
+routines.  To compute the inverse transform of transposed output, you
+specify `FFTW_MPI_TRANSPOSED_IN' to tell it that the input is
+transposed.  In this section, we explain how to interpret the output
+format of such a transform.  
+
+   Suppose you have are transforming multi-dimensional data with (at
+least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] .  As always,
+it is distributed along the first dimension n[0] .  Now, if we compute
+its DFT with the `FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output
+data are stored with the first _two_ dimensions transposed: n[1] x n[0]
+x n[2] x ... x n[d-1] , distributed along the n[1]  dimension.
+Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1]  data and
+transform it with the `FFTW_MPI_TRANSPOSED_IN' flag, then the format
+goes back to the original n[0] x n[1] x n[2] x ... x n[d-1]  array.
+
+   There are two ways to find the portion of the transposed array that
+resides on the current process.  First, you can simply call the
+appropriate `local_size' function, passing n[1] x n[0] x n[2] x ... x
+n[d-1]  (the transposed dimensions).  This would mean calling the
+`local_size' function twice, once for the transposed and once for the
+non-transposed dimensions.  Alternatively, you can call one of the
+`local_size_transposed' functions, which returns both the
+non-transposed and transposed data distribution from a single call.
+For example, for a 3d transform with transposed output (or input), you
+might call:
+
+     ptrdiff_t fftw_mpi_local_size_3d_transposed(
+                     ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
+                     ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                     ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+   
+   Here, `local_n0' and `local_0_start' give the size and starting
+index of the `n0' dimension for the _non_-transposed data, as in the
+previous sections.  For _transposed_ data (e.g. the output for
+`FFTW_MPI_TRANSPOSED_OUT'), `local_n1' and `local_1_start' give the
+size and starting index of the `n1' dimension, which is the first
+dimension of the transposed data (`n1' by `n0' by `n2').
+
+   (Note that `FFTW_MPI_TRANSPOSED_IN' is completely equivalent to
+performing `FFTW_MPI_TRANSPOSED_OUT' and passing the first two
+dimensions to the planner in reverse order, or vice versa.  If you pass
+_both_ the `FFTW_MPI_TRANSPOSED_IN' and `FFTW_MPI_TRANSPOSED_OUT'
+flags, it is equivalent to swapping the first two dimensions passed to
+the planner and passing _neither_ flag.)
+
+
+File: fftw3.info,  Node: One-dimensional distributions,  Prev: Transposed distributions,  Up: MPI Data Distribution
+
+6.4.4 One-dimensional distributions
+-----------------------------------
+
+For one-dimensional distributed DFTs using FFTW, matters are slightly
+more complicated because the data distribution is more closely tied to
+how the algorithm works.  In particular, you can no longer pass an
+arbitrary block size and must accept FFTW's default; also, the block
+sizes may be different for input and output.  Also, the data
+distribution depends on the flags and transform direction, in order for
+forward and backward transforms to work correctly.
+
+     ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
+                     int sign, unsigned flags,
+                     ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
+                     ptrdiff_t *local_no, ptrdiff_t *local_o_start);
+   
+   This function computes the data distribution for a 1d transform of
+size `n0' with the given transform `sign' and `flags'.  Both input and
+output data use block distributions.  The input on the current process
+will consist of `local_ni' numbers starting at index `local_i_start';
+e.g. if only a single process is used, then `local_ni' will be `n0' and
+`local_i_start' will be `0'.  Similarly for the output, with `local_no'
+numbers starting at index `local_o_start'.  The return value of
+`fftw_mpi_local_size_1d' will be the total number of elements to
+allocate on the current process (which might be slightly larger than
+the local size due to intermediate steps in the algorithm).
+
+   As mentioned above (*note Load balancing::), the data will be divided
+equally among the processes if `n0' is divisible by the _square_ of the
+number of processes.  In this case, `local_ni' will equal `local_no'.
+Otherwise, they may be different.
+
+   For some applications, such as convolutions, the order of the output
+data is irrelevant.  In this case, performance can be improved by
+specifying that the output data be stored in an FFTW-defined
+"scrambled" format.  (In particular, this is the analogue of transposed
+output in the multidimensional case: scrambled output saves a
+communications step.)  If you pass `FFTW_MPI_SCRAMBLED_OUT' in the
+flags, then the output is stored in this (undocumented) scrambled
+order.  Conversely, to perform the inverse transform of data in
+scrambled order, pass the `FFTW_MPI_SCRAMBLED_IN' flag.  
+
+   In MPI FFTW, only composite sizes `n0' can be parallelized; we have
+not yet implemented a parallel algorithm for large prime sizes.
+
+
+File: fftw3.info,  Node: Multi-dimensional MPI DFTs of Real Data,  Next: Other Multi-dimensional Real-data MPI Transforms,  Prev: MPI Data Distribution,  Up: Distributed-memory FFTW with MPI
+
+6.5 Multi-dimensional MPI DFTs of Real Data
+===========================================
+
+FFTW's MPI interface also supports multi-dimensional DFTs of real data,
+similar to the serial r2c and c2r interfaces.  (Parallel
+one-dimensional real-data DFTs are not currently supported; you must
+use a complex transform and set the imaginary parts of the inputs to
+zero.)
+
+   The key points to understand for r2c and c2r MPI transforms (compared
+to the MPI complex DFTs or the serial r2c/c2r transforms), are:
+
+   * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1]
+     x n[2] x ... x n[d-1]  real data to/from n[0] x n[1] x n[2] x ...
+     x (n[d-1]/2 + 1)  complex data: the last dimension of the complex
+     data is cut in half (rounded down), plus one.  As for the serial
+     transforms, the sizes you pass to the `plan_dft_r2c' and
+     `plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1]
+     dimensions of the real data.
+
+   * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ...
+     x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x
+     ... x [2 (n[d-1]/2 + 1)]  array, where the last dimension has been
+     _padded_ to make it the same size as the complex output.  This is
+     much like the in-place serial r2c/c2r interface (*note
+     Multi-Dimensional DFTs of Real Data::), except that in MPI the
+     padding is required even for out-of-place data.  The extra padding
+     numbers are ignored by FFTW (they are _not_ like zero-padding the
+     transform to a larger size); they are only used to determine the
+     data layout.
+
+   * The data distribution in MPI for _both_ the real and complex data
+     is determined by the shape of the _complex_ data.  That is, you
+     call the appropriate `local size' function for the n[0] x n[1] x
+     n[2] x ... x (n[d-1]/2 + 1)
+
+     complex data, and then use the _same_ distribution for the real
+     data except that the last complex dimension is replaced by a
+     (padded) real dimension of twice the length.
+
+
+   For example suppose we are performing an out-of-place r2c transform
+of L x M x N  real data [padded to L x M x 2(N/2+1) ], resulting in L x
+M x N/2+1  complex data.  Similar to the example in *note 2d MPI
+example::, we might do something like:
+
+     #include <fftw3-mpi.h>
+
+     int main(int argc, char **argv)
+     {
+         const ptrdiff_t L = ..., M = ..., N = ...;
+         fftw_plan plan;
+         double *rin;
+         fftw_complex *cout;
+         ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
+
+         MPI_Init(&argc, &argv);
+         fftw_mpi_init();
+
+         /* get local data size and allocate */
+         alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
+                                              &local_n0, &local_0_start);
+         rin = fftw_alloc_real(2 * alloc_local);
+         cout = fftw_alloc_complex(alloc_local);
+
+         /* create plan for out-of-place r2c DFT */
+         plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
+                                         FFTW_MEASURE);
+
+         /* initialize rin to some function my_func(x,y,z) */
+         for (i = 0; i < local_n0; ++i)
+            for (j = 0; j < M; ++j)
+              for (k = 0; k < N; ++k)
+            rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
+
+         /* compute transforms as many times as desired */
+         fftw_execute(plan);
+
+         fftw_destroy_plan(plan);
+
+         MPI_Finalize();
+     }
+
+   Note that we allocated `rin' using `fftw_alloc_real' with an
+argument of `2 * alloc_local': since `alloc_local' is the number of
+_complex_ values to allocate, the number of _real_ values is twice as
+many.  The `rin' array is then local_n0 x M x 2(N/2+1)  in row-major
+order, so its `(i,j,k)' element is at the index `(i*M + j) *
+(2*(N/2+1)) + k' (*note Multi-dimensional Array Format::).
+
+   As for the complex transforms, improved performance can be obtained
+by specifying that the output is the transpose of the input or vice
+versa (*note Transposed distributions::).  In our L x M x N  r2c
+example, including `FFTW_TRANSPOSED_OUT' in the flags means that the
+input would be a padded L x M x 2(N/2+1)  real array distributed over
+the `L' dimension, while the output would be a M x L x N/2+1  complex
+array distributed over the `M' dimension.  To perform the inverse c2r
+transform with the same data distributions, you would use the
+`FFTW_TRANSPOSED_IN' flag.
+
+
+File: fftw3.info,  Node: Other Multi-dimensional Real-data MPI Transforms,  Next: FFTW MPI Transposes,  Prev: Multi-dimensional MPI DFTs of Real Data,  Up: Distributed-memory FFTW with MPI
+
+6.6 Other multi-dimensional Real-Data MPI Transforms
+====================================================
+
+FFTW's MPI interface also supports multi-dimensional `r2r' transforms
+of all kinds supported by the serial interface (e.g. discrete cosine
+and sine transforms, discrete Hartley transforms, etc.).  Only
+multi-dimensional `r2r' transforms, not one-dimensional transforms, are
+currently parallelized.
+
+   These are used much like the multidimensional complex DFTs discussed
+above, except that the data is real rather than complex, and one needs
+to pass an r2r transform kind (`fftw_r2r_kind') for each dimension as
+in the serial FFTW (*note More DFTs of Real Data::).
+
+   For example, one might perform a two-dimensional L x M  that is an
+REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the
+second dimension with code like:
+
+         const ptrdiff_t L = ..., M = ...;
+         fftw_plan plan;
+         double *data;
+         ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
+
+         /* get local data size and allocate */
+         alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
+                                              &local_n0, &local_0_start);
+         data = fftw_alloc_real(alloc_local);
+
+         /* create plan for in-place REDFT10 x RODFT10 */
+         plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
+                                     FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
+
+         /* initialize data to some function my_function(x,y) */
+         for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
+            data[i*M + j] = my_function(local_0_start + i, j);
+
+         /* compute transforms, in-place, as many times as desired */
+         fftw_execute(plan);
+
+         fftw_destroy_plan(plan);
+
+   Notice that we use the same `local_size' functions as we did for
+complex data, only now we interpret the sizes in terms of real rather
+than complex values, and correspondingly use `fftw_alloc_real'.
+
+
+File: fftw3.info,  Node: FFTW MPI Transposes,  Next: FFTW MPI Wisdom,  Prev: Other Multi-dimensional Real-data MPI Transforms,  Up: Distributed-memory FFTW with MPI
+
+6.7 FFTW MPI Transposes
+=======================
+
+The FFTW's MPI Fourier transforms rely on one or more _global
+transposition_ step for their communications.  For example, the
+multidimensional transforms work by transforming along some dimensions,
+then transposing to make the first dimension local and transforming
+that, then transposing back.  Because global transposition of a
+block-distributed matrix has many other potential uses besides FFTs,
+FFTW's transpose routines can be called directly, as documented in this
+section.
+
+* Menu:
+
+* Basic distributed-transpose interface::
+* Advanced distributed-transpose interface::
+* An improved replacement for MPI_Alltoall::
+
+
+File: fftw3.info,  Node: Basic distributed-transpose interface,  Next: Advanced distributed-transpose interface,  Prev: FFTW MPI Transposes,  Up: FFTW MPI Transposes
+
+6.7.1 Basic distributed-transpose interface
+-------------------------------------------
+
+In particular, suppose that we have an `n0' by `n1' array in row-major
+order, block-distributed across the `n0' dimension.  To transpose this
+into an `n1' by `n0' array block-distributed across the `n1' dimension,
+we would create a plan by calling the following function:
+
+     fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
+                                       double *in, double *out,
+                                       MPI_Comm comm, unsigned flags);
+   
+   The input and output arrays (`in' and `out') can be the same.  The
+transpose is actually executed by calling `fftw_execute' on the plan,
+as usual.  
+
+   The `flags' are the usual FFTW planner flags, but support two
+additional flags: `FFTW_MPI_TRANSPOSED_OUT' and/or
+`FFTW_MPI_TRANSPOSED_IN'.  What these flags indicate, for transpose
+plans, is that the output and/or input, respectively, are _locally_
+transposed.  That is, on each process input data is normally stored as
+a `local_n0' by `n1' array in row-major order, but for an
+`FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as `n1' by
+`local_n0' in row-major order.  Similarly, `FFTW_MPI_TRANSPOSED_OUT'
+means that the output is `n0' by `local_n1' instead of `local_n1' by
+`n0'.  
+
+   To determine the local size of the array on each process before and
+after the transpose, as well as the amount of storage that must be
+allocated, one should call `fftw_mpi_local_size_2d_transposed', just as
+for a 2d DFT as described in the previous section: 
+
+     ptrdiff_t fftw_mpi_local_size_2d_transposed
+                     (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
+                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                      ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+   
+   Again, the return value is the local storage to allocate, which in
+this case is the number of _real_ (`double') values rather than complex
+numbers as in the previous examples.
+
+
+File: fftw3.info,  Node: Advanced distributed-transpose interface,  Next: An improved replacement for MPI_Alltoall,  Prev: Basic distributed-transpose interface,  Up: FFTW MPI Transposes
+
+6.7.2 Advanced distributed-transpose interface
+----------------------------------------------
+
+The above routines are for a transpose of a matrix of numbers (of type
+`double'), using FFTW's default block sizes.  More generally, one can
+perform transposes of _tuples_ of numbers, with user-specified block
+sizes for the input and output:
+
+     fftw_plan fftw_mpi_plan_many_transpose
+                     (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
+                      ptrdiff_t block0, ptrdiff_t block1,
+                      double *in, double *out, MPI_Comm comm, unsigned flags);
+   
+   In this case, one is transposing an `n0' by `n1' matrix of
+`howmany'-tuples (e.g. `howmany = 2' for complex numbers).  The input
+is distributed along the `n0' dimension with block size `block0', and
+the `n1' by `n0' output is distributed along the `n1' dimension with
+block size `block1'.  If `FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a
+block size then FFTW uses its default block size.  To get the local
+size of the data on each process, you should then call
+`fftw_mpi_local_size_many_transposed'.  
+
+
+File: fftw3.info,  Node: An improved replacement for MPI_Alltoall,  Prev: Advanced distributed-transpose interface,  Up: FFTW MPI Transposes
+
+6.7.3 An improved replacement for MPI_Alltoall
+----------------------------------------------
+
+We close this section by noting that FFTW's MPI transpose routines can
+be thought of as a generalization for the `MPI_Alltoall' function
+(albeit only for floating-point types), and in some circumstances can
+function as an improved replacement.  
+
+   `MPI_Alltoall' is defined by the MPI standard as:
+
+     int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                      void *recvbuf, int recvcnt, MPI_Datatype recvtype,
+                      MPI_Comm comm);
+
+   In particular, for `double*' arrays `in' and `out', consider the
+call:
+
+     MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
+
+   This is completely equivalent to:
+
+     MPI_Comm_size(comm, &P);
+     plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
+     fftw_execute(plan);
+     fftw_destroy_plan(plan);
+
+   That is, computing a P x P  transpose on `P' processes, with a block
+size of 1, is just a standard all-to-all communication.
+
+   However, using the FFTW routine instead of `MPI_Alltoall' may have
+certain advantages.  First of all, FFTW's routine can operate in-place
+(`in == out') whereas `MPI_Alltoall' can only operate out-of-place.  
+
+   Second, even for out-of-place plans, FFTW's routine may be faster,
+especially if you need to perform the all-to-all communication many
+times and can afford to use `FFTW_MEASURE' or `FFTW_PATIENT'.  It
+should certainly be no slower, not including the time to create the
+plan, since one of the possible algorithms that FFTW uses for an
+out-of-place transpose _is_ simply to call `MPI_Alltoall'.  However,
+FFTW also considers several other possible algorithms that, depending
+on your MPI implementation and your hardware, may be faster.  
+
+
+File: fftw3.info,  Node: FFTW MPI Wisdom,  Next: Avoiding MPI Deadlocks,  Prev: FFTW MPI Transposes,  Up: Distributed-memory FFTW with MPI
+
+6.8 FFTW MPI Wisdom
+===================
+
+FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be
+used to save MPI plans as well as to save uniprocessor plans.  However,
+for MPI there are several unavoidable complications.
+
+   First, the MPI standard does not guarantee that every process can
+perform file I/O (at least, not using C stdio routines)--in general, we
+may only assume that process 0 is capable of I/O.(1) So, if we want to
+export the wisdom from a single process to a file, we must first export
+the wisdom to a string, then send it to process 0, then write it to a
+file.
+
+   Second, in principle we may want to have separate wisdom for every
+process, since in general the processes may run on different hardware
+even for a single MPI program.  However, in practice FFTW's MPI code is
+designed for the case of homogeneous hardware (*note Load balancing::),
+and in this case it is convenient to use the same wisdom for every
+process.  Thus, we need a mechanism to synchronize the wisdom.
+
+   To address both of these problems, FFTW provides the following two
+functions:
+
+     void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
+     void fftw_mpi_gather_wisdom(MPI_Comm comm);
+   
+   Given a communicator `comm', `fftw_mpi_broadcast_wisdom' will
+broadcast the wisdom from process 0 to all other processes.
+Conversely, `fftw_mpi_gather_wisdom' will collect wisdom from all
+processes onto process 0.  (If the plans created for the same problem
+by different processes are not the same, `fftw_mpi_gather_wisdom' will
+arbitrarily choose one of the plans.)  Both of these functions may
+result in suboptimal plans for different processes if the processes are
+running on non-identical hardware.  Both of these functions are
+_collective_ calls, which means that they must be executed by all
+processes in the communicator.  
+
+   So, for example, a typical code snippet to import wisdom from a file
+and use it on all processes would be:
+
+     {
+         int rank;
+
+         fftw_mpi_init();
+         MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+         if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
+         fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
+     }
+
+   (Note that we must call `fftw_mpi_init' before importing any wisdom
+that might contain MPI plans.)  Similarly, a typical code snippet to
+export wisdom from all processes to a file is: 
+
+     {
+         int rank;
+
+         fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
+         MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+         if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
+     }
+
+   ---------- Footnotes ----------
+
+   (1) In fact, even this assumption is not technically guaranteed by
+the standard, although it seems to be universal in actual MPI
+implementations and is widely assumed by MPI-using software.
+Technically, you need to query the `MPI_IO' attribute of
+`MPI_COMM_WORLD' with `MPI_Attr_get'.  If this attribute is
+`MPI_PROC_NULL', no I/O is possible.  If it is `MPI_ANY_SOURCE', any
+process can perform I/O.  Otherwise, it is the rank of a process that
+can perform I/O ... but since it is not guaranteed to yield the _same_
+rank on all processes, you have to do an `MPI_Allreduce' of some kind
+if you want all processes to agree about which is going to do I/O.  And
+even then, the standard only guarantees that this process can perform
+output, but not input. See e.g. `Parallel Programming with MPI' by P.
+S. Pacheco, section 8.1.3.  Needless to say, in our experience
+virtually no MPI programmers worry about this.
+
+
+File: fftw3.info,  Node: Avoiding MPI Deadlocks,  Next: FFTW MPI Performance Tips,  Prev: FFTW MPI Wisdom,  Up: Distributed-memory FFTW with MPI
+
+6.9 Avoiding MPI Deadlocks
+==========================
+
+An MPI program can _deadlock_ if one process is waiting for a message
+from another process that never gets sent.  To avoid deadlocks when
+using FFTW's MPI routines, it is important to know which functions are
+_collective_: that is, which functions must _always_ be called in the
+_same order_ from _every_ process in a given communicator.  (For
+example, `MPI_Barrier' is the canonical example of a collective
+function in the MPI standard.)  
+
+   The functions in FFTW that are _always_ collective are: every
+function beginning with `fftw_mpi_plan', as well as
+`fftw_mpi_broadcast_wisdom' and `fftw_mpi_gather_wisdom'.  Also, the
+following functions from the ordinary FFTW interface are collective
+when they are applied to a plan created by an `fftw_mpi_plan' function:
+`fftw_execute', `fftw_destroy_plan', and `fftw_flops'.  
+
+
+File: fftw3.info,  Node: FFTW MPI Performance Tips,  Next: Combining MPI and Threads,  Prev: Avoiding MPI Deadlocks,  Up: Distributed-memory FFTW with MPI
+
+6.10 FFTW MPI Performance Tips
+==============================
+
+In this section, we collect a few tips on getting the best performance
+out of FFTW's MPI transforms.
+
+   First, because of the 1d block distribution, FFTW's parallelization
+is currently limited by the size of the first dimension.
+(Multidimensional block distributions may be supported by a future
+version.) More generally, you should ideally arrange the dimensions so
+that FFTW can divide them equally among the processes. *Note Load
+balancing::.  
+
+   Second, if it is not too inconvenient, you should consider working
+with transposed output for multidimensional plans, as this saves a
+considerable amount of communications.  *Note Transposed
+distributions::.  
+
+   Third, the fastest choices are generally either an in-place transform
+or an out-of-place transform with the `FFTW_DESTROY_INPUT' flag (which
+allows the input array to be used as scratch space).  In-place is
+especially beneficial if the amount of data per process is large.  
+
+   Fourth, if you have multiple arrays to transform at once, rather than
+calling FFTW's MPI transforms several times it usually seems to be
+faster to interleave the data and use the advanced interface.  (This
+groups the communications together instead of requiring separate
+messages for each transform.)
+
+
+File: fftw3.info,  Node: Combining MPI and Threads,  Next: FFTW MPI Reference,  Prev: FFTW MPI Performance Tips,  Up: Distributed-memory FFTW with MPI
+
+6.11 Combining MPI and Threads
+==============================
+
+In certain cases, it may be advantageous to combine MPI
+(distributed-memory) and threads (shared-memory) parallelization.  FFTW
+supports this, with certain caveats.  For example, if you have a
+cluster of 4-processor shared-memory nodes, you may want to use threads
+within the nodes and MPI between the nodes, instead of MPI for all
+parallelization.
+
+   In particular, it is possible to seamlessly combine the MPI FFTW
+routines with the multi-threaded FFTW routines (*note Multi-threaded
+FFTW::). However, some care must be taken in the initialization code,
+which should look something like this:
+
+     int threads_ok;
+
+     int main(int argc, char **argv)
+     {
+         int provided;
+         MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
+         threads_ok = provided >= MPI_THREAD_FUNNELED;
+
+         if (threads_ok) threads_ok = fftw_init_threads();
+         fftw_mpi_init();
+
+         ...
+         if (threads_ok) fftw_plan_with_nthreads(...);
+         ...
+
+         MPI_Finalize();
+     }
+   
+   First, note that instead of calling `MPI_Init', you should call
+`MPI_Init_threads', which is the initialization routine defined by the
+MPI-2 standard to indicate to MPI that your program will be
+multithreaded.  We pass `MPI_THREAD_FUNNELED', which indicates that we
+will only call MPI routines from the main thread.  (FFTW will launch
+additional threads internally, but the extra threads will not call MPI
+code.)  (You may also pass `MPI_THREAD_SERIALIZED' or
+`MPI_THREAD_MULTIPLE', which requests additional multithreading support
+from the MPI implementation, but this is not required by FFTW.)  The
+`provided' parameter returns what level of threads support is actually
+supported by your MPI implementation; this _must_ be at least
+`MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so
+we define a global variable `threads_ok' to record this.  You should
+only call `fftw_init_threads' or `fftw_plan_with_nthreads' if
+`threads_ok' is true.  For more information on thread safety in MPI,
+see the MPI and Threads
+(http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the
+MPI-2 standard.  
+
+   Second, we must call `fftw_init_threads' _before_ `fftw_mpi_init'.
+This is critical for technical reasons having to do with how FFTW
+initializes its list of algorithms.
+
+   Then, if you call `fftw_plan_with_nthreads(N)', _every_ MPI process
+will launch (up to) `N' threads to parallelize its transforms.
+
+   For example, in the hypothetical cluster of 4-processor nodes, you
+might wish to launch only a single MPI process per node, and then call
+`fftw_plan_with_nthreads(4)' on each process to use all processors in
+the nodes.
+
+   This may or may not be faster than simply using as many MPI processes
+as you have processors, however.  On the one hand, using threads within
+a node eliminates the need for explicit message passing within the
+node.  On the other hand, FFTW's transpose routines are not
+multi-threaded, and this means that the communications that do take
+place will not benefit from parallelization within the node.  Moreover,
+many MPI implementations already have optimizations to exploit shared
+memory when it is available, so adding the multithreaded FFTW on top of
+this may be superfluous.  
+
+
+File: fftw3.info,  Node: FFTW MPI Reference,  Next: FFTW MPI Fortran Interface,  Prev: Combining MPI and Threads,  Up: Distributed-memory FFTW with MPI
+
+6.12 FFTW MPI Reference
+=======================
+
+This chapter provides a complete reference to all FFTW MPI functions,
+datatypes, and constants.  See also *note FFTW Reference:: for
+information on functions and types in common with the serial interface.
+
+* Menu:
+
+* MPI Files and Data Types::
+* MPI Initialization::
+* Using MPI Plans::
+* MPI Data Distribution Functions::
+* MPI Plan Creation::
+* MPI Wisdom Communication::
+
+
+File: fftw3.info,  Node: MPI Files and Data Types,  Next: MPI Initialization,  Prev: FFTW MPI Reference,  Up: FFTW MPI Reference
+
+6.12.1 MPI Files and Data Types
+-------------------------------
+
+All programs using FFTW's MPI support should include its header file:
+
+     #include <fftw3-mpi.h>
+
+   Note that this header file includes the serial-FFTW `fftw3.h' header
+file, and also the `mpi.h' header file for MPI, so you need not include
+those files separately.
+
+   You must also link to _both_ the FFTW MPI library and to the serial
+FFTW library.  On Unix, this means adding `-lfftw3_mpi -lfftw3 -lm' at
+the end of the link command.
+
+   Different precisions are handled as in the serial interface: *Note
+Precision::.  That is, `fftw_' functions become `fftwf_' (in single
+precision) etcetera, and the libraries become `-lfftw3f_mpi -lfftw3f
+-lm' etcetera on Unix.  Long-double precision is supported in MPI, but
+quad precision (`fftwq_') is not due to the lack of MPI support for
+this type.
+
+
+File: fftw3.info,  Node: MPI Initialization,  Next: Using MPI Plans,  Prev: MPI Files and Data Types,  Up: FFTW MPI Reference
+
+6.12.2 MPI Initialization
+-------------------------
+
+Before calling any other FFTW MPI (`fftw_mpi_') function, and before
+importing any wisdom for MPI problems, you must call:
+
+     void fftw_mpi_init(void);
+
+   If FFTW threads support is used, however, `fftw_mpi_init' should be
+called _after_ `fftw_init_threads' (*note Combining MPI and Threads::).
+Calling `fftw_mpi_init' additional times (before `fftw_mpi_cleanup')
+has no effect.
+
+   If you want to deallocate all persistent data and reset FFTW to the
+pristine state it was in when you started your program, you can call:
+
+     void fftw_mpi_cleanup(void);
+
+   (This calls `fftw_cleanup', so you need not call the serial cleanup
+routine too, although it is safe to do so.)  After calling
+`fftw_mpi_cleanup', all existing plans become undefined, and you should
+not attempt to execute or destroy them.  You must call `fftw_mpi_init'
+again after `fftw_mpi_cleanup' if you want to resume using the MPI FFTW
+routines.
+
+
+File: fftw3.info,  Node: Using MPI Plans,  Next: MPI Data Distribution Functions,  Prev: MPI Initialization,  Up: FFTW MPI Reference
+
+6.12.3 Using MPI Plans
+----------------------
+
+Once an MPI plan is created, you can execute and destroy it using
+`fftw_execute', `fftw_destroy_plan', and the other functions in the
+serial interface that operate on generic plans (*note Using Plans::).
+
+   The `fftw_execute' and `fftw_destroy_plan' functions, applied to MPI
+plans, are _collective_ calls: they must be called for all processes in
+the communicator that was used to create the plan.
+
+   You must _not_ use the serial new-array plan-execution functions
+`fftw_execute_dft' and so on (*note New-array Execute Functions::) with
+MPI plans.  Such functions are specialized to the problem type, and
+there are specific new-array execute functions for MPI plans:
+
+     void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
+     void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
+     void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
+     void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
+
+   These functions have the same restrictions as those of the serial
+new-array execute functions.  They are _always_ safe to apply to the
+_same_ `in' and `out' arrays that were used to create the plan.  They
+can only be applied to new arrarys if those arrays have the same types,
+dimensions, in-placeness, and alignment as the original arrays, where
+the best way to ensure the same alignment is to use FFTW's
+`fftw_malloc' and related allocation functions for all arrays (*note
+Memory Allocation::).  Note that distributed transposes (*note FFTW MPI
+Transposes::) use `fftw_mpi_execute_r2r', since they count as rank-zero
+r2r plans from FFTW's perspective.
+
+
+File: fftw3.info,  Node: MPI Data Distribution Functions,  Next: MPI Plan Creation,  Prev: Using MPI Plans,  Up: FFTW MPI Reference
+
+6.12.4 MPI Data Distribution Functions
+--------------------------------------
+
+As described above (*note MPI Data Distribution::), in order to
+allocate your arrays, _before_ creating a plan, you must first call one
+of the following routines to determine the required allocation size and
+the portion of the array locally stored on a given process.  The
+`MPI_Comm' communicator passed here must be equivalent to the
+communicator used below for plan creation.
+
+   The basic interface for multidimensional transforms consists of the
+functions:
+
+     ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
+                                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
+     ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                      MPI_Comm comm,
+                                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
+     ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
+                                   ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
+
+     ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
+                                                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                                                 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+     ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                                 MPI_Comm comm,
+                                                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                                                 ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+     ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
+                                              ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                                              ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+
+   These functions return the number of elements to allocate (complex
+numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the
+`local_n0' and `local_0_start' return the portion (`local_0_start' to
+`local_0_start + local_n0 - 1') of the first dimension of an n[0] x
+n[1] x n[2] x ... x n[d-1]  array that is stored on the local process.
+*Note Basic and advanced distribution interfaces::.  For
+`FFTW_MPI_TRANSPOSED_OUT' plans, the `_transposed' variants are useful
+in order to also return the local portion of the first dimension in the
+n[1] x n[0] x n[2] x ... x n[d-1]  transposed output.  *Note Transposed
+distributions::.  The advanced interface for multidimensional
+transforms is:
+
+     ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
+                                        ptrdiff_t block0, MPI_Comm comm,
+                                        ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
+     ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
+                                                   ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
+                                                   ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
+                                                   ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
+
+   These differ from the basic interface in only two ways.  First, they
+allow you to specify block sizes `block0' and `block1' (the latter for
+the transposed output); you can pass `FFTW_MPI_DEFAULT_BLOCK' to use
+FFTW's default block size as in the basic interface.  Second, you can
+pass a `howmany' parameter, corresponding to the advanced planning
+interface below: this is for transforms of contiguous `howmany'-tuples
+of numbers (`howmany = 1' in the basic interface).
+
+   The corresponding basic and advanced routines for one-dimensional
+transforms (currently only complex DFTs) are:
+
+     ptrdiff_t fftw_mpi_local_size_1d(
+                  ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
+                  ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
+                  ptrdiff_t *local_no, ptrdiff_t *local_o_start);
+     ptrdiff_t fftw_mpi_local_size_many_1d(
+                  ptrdiff_t n0, ptrdiff_t howmany,
+                  MPI_Comm comm, int sign, unsigned flags,
+                  ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
+                  ptrdiff_t *local_no, ptrdiff_t *local_o_start);
+
+   As above, the return value is the number of elements to allocate
+(complex numbers, for complex DFTs).  The `local_ni' and
+`local_i_start' arguments return the portion (`local_i_start' to
+`local_i_start + local_ni - 1') of the 1d array that is stored on this
+process for the transform _input_, and `local_no' and `local_o_start'
+are the corresponding quantities for the input.  The `sign'
+(`FFTW_FORWARD' or `FFTW_BACKWARD') and `flags' must match the
+arguments passed when creating a plan.  Although the inputs and outputs
+have different data distributions in general, it is guaranteed that the
+_output_ data distribution of an `FFTW_FORWARD' plan will match the
+_input_ data distribution of an `FFTW_BACKWARD' plan and vice versa;
+similarly for the `FFTW_MPI_SCRAMBLED_OUT' and `FFTW_MPI_SCRAMBLED_IN'
+flags.  *Note One-dimensional distributions::.
+
+
+File: fftw3.info,  Node: MPI Plan Creation,  Next: MPI Wisdom Communication,  Prev: MPI Data Distribution Functions,  Up: FFTW MPI Reference
+
+6.12.5 MPI Plan Creation
+------------------------
+
+Complex-data MPI DFTs
+.....................
+
+Plans for complex-data DFTs (*note 2d MPI example::) are created by:
+
+     fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
+                                    MPI_Comm comm, int sign, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                    fftw_complex *in, fftw_complex *out,
+                                    MPI_Comm comm, int sign, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                    fftw_complex *in, fftw_complex *out,
+                                    MPI_Comm comm, int sign, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
+                                 fftw_complex *in, fftw_complex *out,
+                                 MPI_Comm comm, int sign, unsigned flags);
+     fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
+                                      ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
+                                      fftw_complex *in, fftw_complex *out,
+                                      MPI_Comm comm, int sign, unsigned flags);
+
+   These are similar to their serial counterparts (*note Complex DFTs::)
+in specifying the dimensions, sign, and flags of the transform.  The
+`comm' argument gives an MPI communicator that specifies the set of
+processes to participate in the transform; plan creation is a
+collective function that must be called for all processes in the
+communicator.  The `in' and `out' pointers refer only to a portion of
+the overall transform data (*note MPI Data Distribution::) as specified
+by the `local_size' functions in the previous section.  Unless `flags'
+contains `FFTW_ESTIMATE', these arrays are overwritten during plan
+creation as for the serial interface.  For multi-dimensional
+transforms, any dimensions `> 1' are supported; for one-dimensional
+transforms, only composite (non-prime) `n0' are currently supported
+(unlike the serial FFTW).  Requesting an unsupported transform size
+will yield a `NULL' plan.  (As in the serial interface, highly
+composite sizes generally yield the best performance.)
+
+   The advanced-interface `fftw_mpi_plan_many_dft' additionally allows
+you to specify the block sizes for the first dimension (`block') of the
+n[0] x n[1] x n[2] x ... x n[d-1]  input data and the first dimension
+(`tblock') of the n[1] x n[0] x n[2] x ... x n[d-1]  transposed data
+(at intermediate steps of the transform, and for the output if
+`FFTW_TRANSPOSED_OUT' is specified in `flags').  These must be the same
+block sizes as were passed to the corresponding `local_size' function;
+you can pass `FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size
+as in the basic interface.  Also, the `howmany' parameter specifies
+that the transform is of contiguous `howmany'-tuples rather than
+individual complex numbers; this corresponds to the same parameter in
+the serial advanced interface (*note Advanced Complex DFTs::) with
+`stride = howmany' and `dist = 1'.
+
+MPI flags
+.........
+
+The `flags' can be any of those for the serial FFTW (*note Planner
+Flags::), and in addition may include one or more of the following
+MPI-specific flags, which improve performance at the cost of changing
+the output or input data formats.
+
+   * `FFTW_MPI_SCRAMBLED_OUT', `FFTW_MPI_SCRAMBLED_IN': valid for 1d
+     transforms only, these flags indicate that the output/input of the
+     transform are in an undocumented "scrambled" order.  A forward
+     `FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward
+     `FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization).
+     *Note One-dimensional distributions::.
+
+   * `FFTW_MPI_TRANSPOSED_OUT', `FFTW_MPI_TRANSPOSED_IN': valid for
+     multidimensional (`rnk > 1') transforms only, these flags specify
+     that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1]
+     transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] .
+     *Note Transposed distributions::.
+
+
+Real-data MPI DFTs
+..................
+
+Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI
+DFTs of Real Data::) are created by:
+
+     fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                        double *in, fftw_complex *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                        double *in, fftw_complex *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                        double *in, fftw_complex *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
+                                     double *in, fftw_complex *out,
+                                     MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                        fftw_complex *in, double *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                        fftw_complex *in, double *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                        fftw_complex *in, double *out,
+                                        MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
+                                     fftw_complex *in, double *out,
+                                     MPI_Comm comm, unsigned flags);
+
+   Similar to the serial interface (*note Real-data DFTs::), these
+transform logically n[0] x n[1] x n[2] x ... x n[d-1]  real data
+to/from n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1)  complex data,
+representing the non-redundant half of the conjugate-symmetry output of
+a real-input DFT (*note Multi-dimensional Transforms::).  However, the
+real array must be stored within a padded n[0] x n[1] x n[2] x ... x [2
+(n[d-1]/2 + 1)]
+
+   array (much like the in-place serial r2c transforms, but here for
+out-of-place transforms as well). Currently, only multi-dimensional
+(`rnk > 1') r2c/c2r transforms are supported (requesting a plan for
+`rnk = 1' will yield `NULL').  As explained above (*note
+Multi-dimensional MPI DFTs of Real Data::), the data distribution of
+both the real and complex arrays is given by the `local_size' function
+called for the dimensions of the _complex_ array.  Similar to the other
+planning functions, the input and output arrays are overwritten when
+the plan is created except in `FFTW_ESTIMATE' mode.
+
+   As for the complex DFTs above, there is an advance interface that
+allows you to manually specify block sizes and to transform contiguous
+`howmany'-tuples of real/complex numbers:
+
+     fftw_plan fftw_mpi_plan_many_dft_r2c
+                   (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
+                    ptrdiff_t iblock, ptrdiff_t oblock,
+                    double *in, fftw_complex *out,
+                    MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_many_dft_c2r
+                   (int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
+                    ptrdiff_t iblock, ptrdiff_t oblock,
+                    fftw_complex *in, double *out,
+                    MPI_Comm comm, unsigned flags);
+
+MPI r2r transforms
+..................
+
+There are corresponding plan-creation routines for r2r transforms
+(*note More DFTs of Real Data::), currently supporting multidimensional
+(`rnk > 1') transforms only (`rnk = 1' will yield a `NULL' plan):
+
+     fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
+                                    double *in, double *out,
+                                    MPI_Comm comm,
+                                    fftw_r2r_kind kind0, fftw_r2r_kind kind1,
+                                    unsigned flags);
+     fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
+                                    double *in, double *out,
+                                    MPI_Comm comm,
+                                    fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
+                                    unsigned flags);
+     fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
+                                 double *in, double *out,
+                                 MPI_Comm comm, const fftw_r2r_kind *kind,
+                                 unsigned flags);
+     fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
+                                      ptrdiff_t iblock, ptrdiff_t oblock,
+                                      double *in, double *out,
+                                      MPI_Comm comm, const fftw_r2r_kind *kind,
+                                      unsigned flags);
+
+   The parameters are much the same as for the complex DFTs above,
+except that the arrays are of real numbers (and hence the outputs of the
+`local_size' data-distribution functions should be interpreted as
+counts of real rather than complex numbers).  Also, the `kind'
+parameters specify the r2r kinds along each dimension as for the serial
+interface (*note Real-to-Real Transform Kinds::).  *Note Other
+Multi-dimensional Real-data MPI Transforms::.
+
+MPI transposition
+.................
+
+FFTW also provides routines to plan a transpose of a distributed `n0'
+by `n1' array of real numbers, or an array of `howmany'-tuples of real
+numbers with specified block sizes (*note FFTW MPI Transposes::):
+
+     fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
+                                       double *in, double *out,
+                                       MPI_Comm comm, unsigned flags);
+     fftw_plan fftw_mpi_plan_many_transpose
+                     (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
+                      ptrdiff_t block0, ptrdiff_t block1,
+                      double *in, double *out, MPI_Comm comm, unsigned flags);
+
+   These plans are used with the `fftw_mpi_execute_r2r' new-array
+execute function (*note Using MPI Plans::), since they count as (rank
+zero) r2r plans from FFTW's perspective.
+
+
+File: fftw3.info,  Node: MPI Wisdom Communication,  Prev: MPI Plan Creation,  Up: FFTW MPI Reference
+
+6.12.6 MPI Wisdom Communication
+-------------------------------
+
+To facilitate synchronizing wisdom among the different MPI processes,
+we provide two functions:
+
+     void fftw_mpi_gather_wisdom(MPI_Comm comm);
+     void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
+
+   The `fftw_mpi_gather_wisdom' function gathers all wisdom in the
+given communicator `comm' to the process of rank 0 in the communicator:
+that process obtains the union of all wisdom on all the processes.  As
+a side effect, some other processes will gain additional wisdom from
+other processes, but only process 0 will gain the complete union.
+
+   The `fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom
+from process 0 in `comm' to all other processes in the communicator,
+replacing any wisdom they currently have.
+
+   *Note FFTW MPI Wisdom::.
+
+
+File: fftw3.info,  Node: FFTW MPI Fortran Interface,  Prev: FFTW MPI Reference,  Up: Distributed-memory FFTW with MPI
+
+6.13 FFTW MPI Fortran Interface
+===============================
+
+The FFTW MPI interface is callable from modern Fortran compilers
+supporting the Fortran 2003 `iso_c_binding' standard for calling C
+functions.  As described in *note Calling FFTW from Modern Fortran::,
+this means that you can directly call FFTW's C interface from Fortran
+with only minor changes in syntax.  There are, however, a few things
+specific to the MPI interface to keep in mind:
+
+   * Instead of including `fftw3.f03' as in *note Overview of Fortran
+     interface::, you should `include 'fftw3-mpi.f03'' (after `use,
+     intrinsic :: iso_c_binding' as before).  The `fftw3-mpi.f03' file
+     includes `fftw3.f03', so you should _not_ `include' them both
+     yourself.  (You will also want to include the MPI header file,
+     usually via `include 'mpif.h'' or similar, although though this is
+     not needed by `fftw3-mpi.f03' per se.)  (To use the `fftwl_' `long
+     double' extended-precision routines in supporting compilers, you
+     should include `fftw3f-mpi.f03' in _addition_ to `fftw3-mpi.f03'.
+     *Note Extended and quadruple precision in Fortran::.)
+
+   * Because of the different storage conventions between C and Fortran,
+     you reverse the order of your array dimensions when passing them to
+     FFTW (*note Reversing array dimensions::).  This is merely a
+     difference in notation and incurs no performance overhead.
+     However, it means that, whereas in C the _first_ dimension is
+     distributed, in Fortran the _last_ dimension of your array is
+     distributed.
+
+   * In Fortran, communicators are stored as `integer' types; there is
+     no `MPI_Comm' type, nor is there any way to access a C `MPI_Comm'.
+     Fortunately, this is taken care of for you by the FFTW Fortran
+     interface: whenever the C interface expects an `MPI_Comm' type,
+     you should pass the Fortran communicator as an `integer'.(1)
+
+   * Because you need to call the `local_size' function to find out how
+     much space to allocate, and this may be _larger_ than the local
+     portion of the array (*note MPI Data Distribution::), you should
+     _always_ allocate your arrays dynamically using FFTW's allocation
+     routines as described in *note Allocating aligned memory in
+     Fortran::.  (Coincidentally, this also provides the best
+     performance by guaranteeding proper data alignment.)
+
+   * Because all sizes in the MPI FFTW interface are declared as
+     `ptrdiff_t' in C, you should use `integer(C_INTPTR_T)' in Fortran
+     (*note FFTW Fortran type reference::).
+
+   * In Fortran, because of the language semantics, we generally
+     recommend using the new-array execute functions for all plans,
+     even in the common case where you are executing the plan on the
+     same arrays for which the plan was created (*note Plan execution
+     in Fortran::).  However, note that in the MPI interface these
+     functions are changed: `fftw_execute_dft' becomes
+     `fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::.
+
+
+   For example, here is a Fortran code snippet to perform a distributed
+L x M  complex DFT in-place.  (This assumes you have already
+initialized MPI with `MPI_init' and have also performed `call
+fftw_mpi_init'.)
+
+       use, intrinsic :: iso_c_binding
+       include 'fftw3-mpi.f03'
+       integer(C_INTPTR_T), parameter :: L = ...
+       integer(C_INTPTR_T), parameter :: M = ...
+       type(C_PTR) :: plan, cdata
+       complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
+       integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
+
+     !   get local data size and allocate (note dimension reversal)
+       alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
+                                            local_M, local_j_offset)
+       cdata = fftw_alloc_complex(alloc_local)
+       call c_f_pointer(cdata, data, [L,local_M])
+
+     !   create MPI plan for in-place forward DFT (note dimension reversal)
+       plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
+                                   FFTW_FORWARD, FFTW_MEASURE)
+
+     ! initialize data to some function my_function(i,j)
+       do j = 1, local_M
+         do i = 1, L
+           data(i, j) = my_function(i, j + local_j_offset)
+         end do
+       end do
+
+     ! compute transform (as many times as desired)
+       call fftw_mpi_execute_dft(plan, data, data)
+
+       call fftw_destroy_plan(plan)
+       call fftw_free(cdata)
+
+   Note that when we called `fftw_mpi_local_size_2d' and
+`fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L
+x M  Fortran array is viewed by FFTW in C as a M x L  array.  This
+means that the array was distributed over the `M' dimension, the local
+portion of which is a L x local_M  array in Fortran.  (You must _not_
+use an `allocate' statement to allocate an L x local_M  array, however;
+you must allocate `alloc_local' complex numbers, which may be greater
+than `L * local_M', in order to reserve space for intermediate steps of
+the transform.)  Finally, we mention that because C's array indices are
+zero-based, the `local_j_offset' argument can conveniently be
+interpreted as an offset in the 1-based `j' index (rather than as a
+starting index as in C).
+
+   If instead you had used the `ior(FFTW_MEASURE,
+FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a
+transposed M x local_L  array, associated with the _same_ `cdata'
+allocation (since the transform is in-place), and which you could
+declare with:
+
+       complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
+       ...
+       call c_f_pointer(cdata, tdata, [M,local_L])
+
+   where `local_L' would have been obtained by changing the
+`fftw_mpi_local_size_2d' call to:
+
+       alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
+                                local_M, local_j_offset, local_L, local_i_offset)
+
+   ---------- Footnotes ----------
+
+   (1) Technically, this is because you aren't actually calling the C
+functions directly. You are calling wrapper functions that translate
+the communicator with `MPI_Comm_f2c' before calling the ordinary C
+interface.  This is all done transparently, however, since the
+`fftw3-mpi.f03' interface file renames the wrappers so that they are
+called in Fortran with the same names as the C interface functions.
+
+
+File: fftw3.info,  Node: Calling FFTW from Modern Fortran,  Next: Calling FFTW from Legacy Fortran,  Prev: Distributed-memory FFTW with MPI,  Up: Top
+
+7 Calling FFTW from Modern Fortran
+**********************************
+
+Fortran 2003 standardized ways for Fortran code to call C libraries,
+and this allows us to support a direct translation of the FFTW C API
+into Fortran.  Compared to the legacy Fortran 77 interface (*note
+Calling FFTW from Legacy Fortran::), this direct interface offers many
+advantages, especially compile-time type-checking and aligned memory
+allocation.  As of this writing, support for these C interoperability
+features seems widespread, having been implemented in nearly all major
+Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland
+Group, NAG).  
+
+   This chapter documents that interface.  For the most part, since this
+interface allows Fortran to call the C interface directly, the usage is
+identical to C translated to Fortran syntax.  However, there are a few
+subtle points such as memory allocation, wisdom, and data types that
+deserve closer attention.
+
+* Menu:
+
+* Overview of Fortran interface::
+* Reversing array dimensions::
+* FFTW Fortran type reference::
+* Plan execution in Fortran::
+* Allocating aligned memory in Fortran::
+* Accessing the wisdom API from Fortran::
+* Defining an FFTW module::
+
+
+File: fftw3.info,  Node: Overview of Fortran interface,  Next: Reversing array dimensions,  Prev: Calling FFTW from Modern Fortran,  Up: Calling FFTW from Modern Fortran
+
+7.1 Overview of Fortran interface
+=================================
+
+FFTW provides a file `fftw3.f03' that defines Fortran 2003 interfaces
+for all of its C routines, except for the MPI routines described
+elsewhere, which can be found in the same directory as `fftw3.h' (the C
+header file).  In any Fortran subroutine where you want to use FFTW
+functions, you should begin with:
+
+       use, intrinsic :: iso_c_binding
+       include 'fftw3.f03'
+
+   This includes the interface definitions and the standard
+`iso_c_binding' module (which defines the equivalents of C types).  You
+can also put the FFTW functions into a module if you prefer (*note
+Defining an FFTW module::).
+
+   At this point, you can now call anything in the FFTW C interface
+directly, almost exactly as in C other than minor changes in syntax.
+For example:
+
+       type(C_PTR) :: plan
+       complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out
+       plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
+       ...
+       call fftw_execute_dft(plan, in, out)
+       ...
+       call fftw_destroy_plan(plan)
+
+   A few important things to keep in mind are:
+
+   * FFTW plans are `type(C_PTR)'.  Other C types are mapped in the
+     obvious way via the `iso_c_binding' standard: `int' turns into
+     `integer(C_INT)', `fftw_complex' turns into
+     `complex(C_DOUBLE_COMPLEX)', `double' turns into `real(C_DOUBLE)',
+     and so on. *Note FFTW Fortran type reference::.
+
+   * Functions in C become functions in Fortran if they have a return
+     value, and subroutines in Fortran otherwise.
+
+   * The ordering of the Fortran array dimensions must be _reversed_
+     when they are passed to the FFTW plan creation, thanks to
+     differences in array indexing conventions (*note Multi-dimensional
+     Array Format::).  This is _unlike_ the legacy Fortran interface
+     (*note Fortran-interface routines::), which reversed the dimensions
+     for you.  *Note Reversing array dimensions::.
+
+   * Using ordinary Fortran array declarations like this works, but may
+     yield suboptimal performance because the data may not be not
+     aligned to exploit SIMD instructions on modern proessors (*note
+     SIMD alignment and fftw_malloc::). Better performance will often
+     be obtained by allocating with `fftw_alloc'. *Note Allocating
+     aligned memory in Fortran::.
+
+   * Similar to the legacy Fortran interface (*note FFTW Execution in
+     Fortran::), we currently recommend _not_ using `fftw_execute' but
+     rather using the more specialized functions like
+     `fftw_execute_dft' (*note New-array Execute Functions::).
+     However, you should execute the plan on the `same arrays' as the
+     ones for which you created the plan, unless you are especially
+     careful.  *Note Plan execution in Fortran::.  To prevent you from
+     using `fftw_execute' by mistake, the `fftw3.f03' file does not
+     provide an `fftw_execute' interface declaration.
+
+   * Multiple planner flags are combined with `ior' (equivalent to `|'
+     in C).  e.g. `FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes
+     `ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'.  (You can also use `+' as
+     long as you don't try to include a given flag more than once.)
+
+
+* Menu:
+
+* Extended and quadruple precision in Fortran::
+
+
+File: fftw3.info,  Node: Extended and quadruple precision in Fortran,  Prev: Overview of Fortran interface,  Up: Overview of Fortran interface
+
+7.1.1 Extended and quadruple precision in Fortran
+-------------------------------------------------
+
+If FFTW is compiled in `long double' (extended) precision (*note
+Installation and Customization::), you may be able to call the
+resulting `fftwl_' routines (*note Precision::) from Fortran if your
+compiler supports the `C_LONG_DOUBLE_COMPLEX' type code.
+
+   Because some Fortran compilers do not support
+`C_LONG_DOUBLE_COMPLEX', the `fftwl_' declarations are segregated into
+a separate interface file `fftw3l.f03', which you should include _in
+addition_ to `fftw3.f03' (which declares precision-independent `FFTW_'
+constants):
+
+       use, intrinsic :: iso_c_binding
+       include 'fftw3.f03'
+       include 'fftw3l.f03'
+
+   We also support using the nonstandard `__float128'
+quadruple-precision type provided by recent versions of `gcc' on 32-
+and 64-bit x86 hardware (*note Installation and Customization::), using
+the corresponding `real(16)' and `complex(16)' types supported by
+`gfortran'.  The quadruple-precision `fftwq_' functions (*note
+Precision::) are declared in a `fftw3q.f03' interface file, which
+should be included in addition to `fftw3l.f03', as above.  You should
+also link with `-lfftw3q -lquadmath -lm' as in C.
+
+
+File: fftw3.info,  Node: Reversing array dimensions,  Next: FFTW Fortran type reference,  Prev: Overview of Fortran interface,  Up: Calling FFTW from Modern Fortran
+
+7.2 Reversing array dimensions
+==============================
+
+A minor annoyance in calling FFTW from Fortran is that FFTW's array
+dimensions are defined in the C convention (row-major order), while
+Fortran's array dimensions are the opposite convention (column-major
+order). *Note Multi-dimensional Array Format::.  This is just a
+bookkeeping difference, with no effect on performance.  The only
+consequence of this is that, whenever you create an FFTW plan for a
+multi-dimensional transform, you must always _reverse the ordering of
+the dimensions_.
+
+   For example, consider the three-dimensional (L x M x N ) arrays:
+
+       complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out
+
+   To plan a DFT for these arrays using `fftw_plan_dft_3d', you could
+do:
+
+       plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
+
+   That is, from FFTW's perspective this is a N x M x L  array.  _No
+data transposition need occur_, as this is _only notation_.  Similarly,
+to use the more generic routine `fftw_plan_dft' with the same arrays,
+you could do:
+
+       integer(C_INT), dimension(3) :: n = [N,M,L]
+       plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
+
+   Note, by the way, that this is different from the legacy Fortran
+interface (*note Fortran-interface routines::), which automatically
+reverses the order of the array dimension for you.  Here, you are
+calling the C interface directly, so there is no "translation" layer.
+
+   An important thing to keep in mind is the implication of this for
+multidimensional real-to-complex transforms (*note Multi-Dimensional
+DFTs of Real Data::).  In C, a multidimensional real-to-complex DFT
+chops the last dimension roughly in half (N x M x L  real input goes to
+N x M x L/2+1  complex output).  In Fortran, because the array
+dimension notation is reversed, the _first_ dimension of the complex
+data is chopped roughly in half.  For example consider the `r2c'
+transform of L x M x N  real input in Fortran:
+
+       type(C_PTR) :: plan
+       real(C_DOUBLE), dimension(L,M,N) :: in
+       complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out
+       plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
+       ...
+       call fftw_execute_dft_r2c(plan, in, out)
+
+   Alternatively, for an in-place r2c transform, as described in the C
+documentation we must _pad_ the _first_ dimension of the real input
+with an extra two entries (which are ignored by FFTW) so as to leave
+enough space for the complex output. The input is _allocated_ as a
+2[L/2+1] x M x N  array, even though only L x M x N  of it is actually
+used.  In this example, we will allocate the array as a pointer type,
+using `fftw_alloc' to ensure aligned memory for maximum performance
+(*note Allocating aligned memory in Fortran::); this also makes it easy
+to reference the same memory as both a real array and a complex array.
+
+       real(C_DOUBLE), pointer :: in(:,:,:)
+       complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:)
+       type(C_PTR) :: plan, data
+       data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T))
+       call c_f_pointer(data, in, [2*(L/2+1),M,N])
+       call c_f_pointer(data, out, [L/2+1,M,N])
+       plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
+       ...
+       call fftw_execute_dft_r2c(plan, in, out)
+       ...
+       call fftw_destroy_plan(plan)
+       call fftw_free(data)
+
+
+File: fftw3.info,  Node: FFTW Fortran type reference,  Next: Plan execution in Fortran,  Prev: Reversing array dimensions,  Up: Calling FFTW from Modern Fortran
+
+7.3 FFTW Fortran type reference
+===============================
+
+The following are the most important type correspondences between the C
+interface and Fortran:
+
+   * Plans (`fftw_plan' and variants) are `type(C_PTR)' (i.e. an opaque
+     pointer).
+
+   * The C floating-point types `double', `float', and `long double'
+     correspond to `real(C_DOUBLE)', `real(C_FLOAT)', and
+     `real(C_LONG_DOUBLE)', respectively.  The C complex types
+     `fftw_complex', `fftwf_complex', and `fftwl_complex' correspond in
+     Fortran to `complex(C_DOUBLE_COMPLEX)',
+     `complex(C_FLOAT_COMPLEX)', and `complex(C_LONG_DOUBLE_COMPLEX)',
+     respectively.  Just as in C (*note Precision::), the FFTW
+     subroutines and types are prefixed with `fftw_', `fftwf_', and
+     `fftwl_' for the different precisions, and link to different
+     libraries (`-lfftw3', `-lfftw3f', and `-lfftw3l' on Unix), but use
+     the _same_ include file `fftw3.f03' and the _same_ constants (all
+     of which begin with `FFTW_').  The exception is `long double'
+     precision, for which you should _also_ include `fftw3l.f03' (*note
+     Extended and quadruple precision in Fortran::).
+
+   * The C integer types `int' and `unsigned' (used for planner flags)
+     become `integer(C_INT)'.  The C integer type `ptrdiff_t' (e.g. in
+     the *note 64-bit Guru Interface::) becomes `integer(C_INTPTR_T)',
+     and `size_t' (in `fftw_malloc' etc.) becomes `integer(C_SIZE_T)'.
+
+   * The `fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::)
+     becomes `integer(C_FFTW_R2R_KIND)'.  The various constant values
+     of the C enumerated type (`FFTW_R2HC' etc.) become simply integer
+     constants of the same names in Fortran.
+
+   * Numeric array pointer arguments (e.g. `double *') become
+     `dimension(*), intent(out)' arrays of the same type, or
+     `dimension(*), intent(in)' if they are pointers to constant data
+     (e.g. `const int *').  There are a few exceptions where numeric
+     pointers refer to scalar outputs (e.g. for `fftw_flops'), in which
+     case they are `intent(out)' scalar arguments in Fortran too.  For
+     the new-array execute functions (*note New-array Execute
+     Functions::), the input arrays are declared `dimension(*),
+     intent(inout)', since they can be modified in the case of in-place
+     or `FFTW_DESTROY_INPUT' transforms.
+
+   * Pointer _return_ values (e.g `double *') become `type(C_PTR)'.
+     (If they are pointers to arrays, as for `fftw_alloc_real', you can
+     convert them back to Fortran array pointers with the standard
+     intrinsic function `c_f_pointer'.)
+
+   * The `fftw_iodim' type in the guru interface (*note Guru vector and
+     transform sizes::) becomes `type(fftw_iodim)' in Fortran, a
+     derived data type (the Fortran analogue of C's `struct') with
+     three `integer(C_INT)' components: `n', `is', and `os', with the
+     same meanings as in C.  The `fftw_iodim64' type in the 64-bit guru
+     interface (*note 64-bit Guru Interface::) is the same, except that
+     its components are of type `integer(C_INTPTR_T)'.
+
+   * Using the wisdom import/export functions from Fortran is a bit
+     tricky, and is discussed in *note Accessing the wisdom API from
+     Fortran::.  In brief, the `FILE *' arguments map to `type(C_PTR)',
+     `const char *' to `character(C_CHAR), dimension(*), intent(in)'
+     (null-terminated!), and the generic read-char/write-char functions
+     map to `type(C_FUNPTR)'.
+
+
+   You may be wondering if you need to search-and-replace
+`real(kind(0.0d0))' (or whatever your favorite Fortran spelling of
+"double precision" is) with `real(C_DOUBLE)' everywhere in your
+program, and similarly for `complex' and `integer' types.  The answer
+is no; you can still use your existing types.  As long as these types
+match their C counterparts, things should work without a hitch.  The
+worst that can happen, e.g. in the (unlikely) event of a system where
+`real(kind(0.0d0))' is different from `real(C_DOUBLE)', is that the
+compiler will give you a type-mismatch error.  That is, if you don't
+use the `iso_c_binding' kinds you need to accept at least the
+theoretical possibility of having to change your code in response to
+compiler errors on some future machine, but you don't need to worry
+about silently compiling incorrect code that yields runtime errors.
+
+
+File: fftw3.info,  Node: Plan execution in Fortran,  Next: Allocating aligned memory in Fortran,  Prev: FFTW Fortran type reference,  Up: Calling FFTW from Modern Fortran
+
+7.4 Plan execution in Fortran
+=============================
+
+In C, in order to use a plan, one normally calls `fftw_execute', which
+executes the plan to perform the transform on the input/output arrays
+passed when the plan was created (*note Using Plans::).  The
+corresponding subroutine call in modern Fortran is:
+      call fftw_execute(plan)
+   
+   However, we have had reports that this causes problems with some
+recent optimizing Fortran compilers.  The problem is, because the
+input/output arrays are not passed as explicit arguments to
+`fftw_execute', the semantics of Fortran (unlike C) allow the compiler
+to assume that the input/output arrays are not changed by
+`fftw_execute'.  As a consequence, certain compilers end up
+repositioning the call to `fftw_execute', assuming incorrectly that it
+does nothing to the arrays.
+
+   There are various workarounds to this, but the safest and simplest
+thing is to not use `fftw_execute' in Fortran.  Instead, use the
+functions described in *note New-array Execute Functions::, which take
+the input/output arrays as explicit arguments.  For example, if the
+plan is for a complex-data DFT and was created for the arrays `in' and
+`out', you would do:
+      call fftw_execute_dft(plan, in, out)
+   
+   There are a few things to be careful of, however:
+
+   * You must use the correct type of execute function, matching the way
+     the plan was created.  Complex DFT plans should use
+     `fftw_execute_dft', Real-input (r2c) DFT plans should use use
+     `fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use
+     `fftw_execute_dft_c2r'.  The various r2r plans should use
+     `fftw_execute_r2r'.  Fortunately, if you use the wrong one you
+     will get a compile-time type-mismatch error (unlike legacy
+     Fortran).
+
+   * You should normally pass the same input/output arrays that were
+     used when creating the plan.  This is always safe.
+
+   * _If_ you pass _different_ input/output arrays compared to those
+     used when creating the plan, you must abide by all the
+     restrictions of the new-array execute functions (*note New-array
+     Execute Functions::).  The most tricky of these is the requirement
+     that the new arrays have the same alignment as the original
+     arrays; the best (and possibly only) way to guarantee this is to
+     use the `fftw_alloc' functions to allocate your arrays (*note
+     Allocating aligned memory in Fortran::). Alternatively, you can
+     use the `FFTW_UNALIGNED' flag when creating the plan, in which
+     case the plan does not depend on the alignment, but this may
+     sacrifice substantial performance on architectures (like x86) with
+     SIMD instructions (*note SIMD alignment and fftw_malloc::).  
+
+
+
+File: fftw3.info,  Node: Allocating aligned memory in Fortran,  Next: Accessing the wisdom API from Fortran,  Prev: Plan execution in Fortran,  Up: Calling FFTW from Modern Fortran
+
+7.5 Allocating aligned memory in Fortran
+========================================
+
+In order to obtain maximum performance in FFTW, you should store your
+data in arrays that have been specially aligned in memory (*note SIMD
+alignment and fftw_malloc::).  Enforcing alignment also permits you to
+safely use the new-array execute functions (*note New-array Execute
+Functions::) to apply a given plan to more than one pair of in/out
+arrays.  Unfortunately, standard Fortran arrays do _not_ provide any
+alignment guarantees.  The _only_ way to allocate aligned memory in
+standard Fortran is to allocate it with an external C function, like
+the `fftw_alloc_real' and `fftw_alloc_complex' functions.  Fortunately,
+Fortran 2003 provides a simple way to associate such allocated memory
+with a standard Fortran array pointer that you can then use normally.
+
+   We therefore recommend allocating all your input/output arrays using
+the following technique:
+
+  1. Declare a `pointer', `arr', to your array of the desired type and
+     dimensions.  For example, `real(C_DOUBLE), pointer :: a(:,:)' for
+     a 2d real array, or `complex(C_DOUBLE_COMPLEX), pointer ::
+     a(:,:,:)' for a 3d complex array.
+
+  2. The number of elements to allocate must be an `integer(C_SIZE_T)'.
+     You can either declare a variable of this type, e.g.
+     `integer(C_SIZE_T) :: sz', to store the number of elements to
+     allocate, or you can use the `int(..., C_SIZE_T)' intrinsic
+     function. e.g. set `sz = L * M * N' or use `int(L * M * N,
+     C_SIZE_T)' for an L x M x N  array.
+
+  3. Declare a `type(C_PTR) :: p' to hold the return value from FFTW's
+     allocation routine.  Set `p = fftw_alloc_real(sz)' for a real
+     array, or `p = fftw_alloc_complex(sz)' for a complex array.
+
+  4. Associate your pointer `arr' with the allocated memory `p' using
+     the standard `c_f_pointer' subroutine: `call c_f_pointer(p, arr,
+     [...dimensions...])', where `[...dimensions...])' are an array of
+     the dimensions of the array (in the usual Fortran order). e.g.
+     `call c_f_pointer(p, arr, [L,M,N])' for an L x M x N  array.
+     (Alternatively, you can omit the dimensions argument if you
+     specified the shape explicitly when declaring `arr'.)  You can now
+     use `arr' as a usual multidimensional array.
+
+  5. When you are done using the array, deallocate the memory by `call
+     fftw_free(p)' on `p'.
+
+
+   For example, here is how we would allocate an L x M  2d real array:
+
+       real(C_DOUBLE), pointer :: arr(:,:)
+       type(C_PTR) :: p
+       p = fftw_alloc_real(int(L * M, C_SIZE_T))
+       call c_f_pointer(p, arr, [L,M])
+       _...use arr and arr(i,j) as usual..._
+       call fftw_free(p)
+
+   and here is an L x M x N  3d complex array:
+
+       complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:)
+       type(C_PTR) :: p
+       p = fftw_alloc_complex(int(L * M * N, C_SIZE_T))
+       call c_f_pointer(p, arr, [L,M,N])
+       _...use arr and arr(i,j,k) as usual..._
+       call fftw_free(p)
+
+   See *note Reversing array dimensions:: for an example allocating a
+single array and associating both real and complex array pointers with
+it, for in-place real-to-complex transforms.
+
+
+File: fftw3.info,  Node: Accessing the wisdom API from Fortran,  Next: Defining an FFTW module,  Prev: Allocating aligned memory in Fortran,  Up: Calling FFTW from Modern Fortran
+
+7.6 Accessing the wisdom API from Fortran
+=========================================
+
+As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a
+"wisdom" API for saving plans to disk so that they can be recreated
+quickly.  The C API for exporting (*note Wisdom Export::) and importing
+(*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran,
+however, because of differences in file I/O and string types between C
+and Fortran.
+
+* Menu:
+
+* Wisdom File Export/Import from Fortran::
+* Wisdom String Export/Import from Fortran::
+* Wisdom Generic Export/Import from Fortran::
+
+
+File: fftw3.info,  Node: Wisdom File Export/Import from Fortran,  Next: Wisdom String Export/Import from Fortran,  Prev: Accessing the wisdom API from Fortran,  Up: Accessing the wisdom API from Fortran
+
+7.6.1 Wisdom File Export/Import from Fortran
+--------------------------------------------
+
+The easiest way to export and import wisdom is to do so using
+`fftw_export_wisdom_to_filename' and `fftw_wisdom_from_filename'.  The
+only trick is that these require you to pass a C string, which is an
+array of type `CHARACTER(C_CHAR)' that is terminated by `C_NULL_CHAR'.
+You can call them like this:
+
+       integer(C_INT) :: ret
+       ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
+       if (ret .eq. 0) stop 'error exporting wisdom to file'
+       ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
+       if (ret .eq. 0) stop 'error importing wisdom from file'
+
+   Note that prepending `C_CHAR_' is needed to specify that the literal
+string is of kind `C_CHAR', and we null-terminate the string by
+appending `// C_NULL_CHAR'.  These functions return an `integer(C_INT)'
+(`ret') which is `0' if an error occurred during export/import and
+nonzero otherwise.
+
+   It is also possible to use the lower-level routines
+`fftw_export_wisdom_to_file' and `fftw_import_wisdom_from_file', which
+accept parameters of the C type `FILE*', expressed in Fortran as
+`type(C_PTR)'.  However, you are then responsible for creating the
+`FILE*' yourself.  You can do this by using `iso_c_binding' to define
+Fortran intefaces for the C library functions `fopen' and `fclose',
+which is a bit strange in Fortran but workable.
+
+
+File: fftw3.info,  Node: Wisdom String Export/Import from Fortran,  Next: Wisdom Generic Export/Import from Fortran,  Prev: Wisdom File Export/Import from Fortran,  Up: Accessing the wisdom API from Fortran
+
+7.6.2 Wisdom String Export/Import from Fortran
+----------------------------------------------
+
+Dealing with FFTW's C string export/import is a bit more painful.  In
+particular, the `fftw_export_wisdom_to_string' function requires you to
+deal with a dynamically allocated C string.  To get its length, you
+must define an interface to the C `strlen' function, and to deallocate
+it you must define an interface to C `free':
+
+       use, intrinsic :: iso_c_binding
+       interface
+         integer(C_INT) function strlen(s) bind(C, name='strlen')
+           import
+           type(C_PTR), value :: s
+         end function strlen
+         subroutine free(p) bind(C, name='free')
+           import
+           type(C_PTR), value :: p
+         end subroutine free
+       end interface
+
+   Given these definitions, you can then export wisdom to a Fortran
+character array:
+
+       character(C_CHAR), pointer :: s(:)
+       integer(C_SIZE_T) :: slen
+       type(C_PTR) :: p
+       p = fftw_export_wisdom_to_string()
+       if (.not. c_associated(p)) stop 'error exporting wisdom'
+       slen = strlen(p)
+       call c_f_pointer(p, s, [slen+1])
+       ...
+       call free(p)
+   
+   Note that `slen' is the length of the C string, but the length of
+the array is `slen+1' because it includes the terminating null
+character.  (You can omit the `+1' if you don't want Fortran to know
+about the null character.) The standard `c_associated' function checks
+whether `p' is a null pointer, which is returned by
+`fftw_export_wisdom_to_string' if there was an error.
+
+   To import wisdom from a string, use `fftw_import_wisdom_from_string'
+as usual; note that the argument of this function must be a
+`character(C_CHAR)' that is terminated by the `C_NULL_CHAR' character,
+like the `s' array above.
+
+
+File: fftw3.info,  Node: Wisdom Generic Export/Import from Fortran,  Prev: Wisdom String Export/Import from Fortran,  Up: Accessing the wisdom API from Fortran
+
+7.6.3 Wisdom Generic Export/Import from Fortran
+-----------------------------------------------
+
+The most generic wisdom export/import functions allow you to provide an
+arbitrary callback function to read/write one character at a time in
+any way you want.  However, your callback function must be written in a
+special way, using the `bind(C)' attribute to be passed to a C
+interface.
+
+   In particular, to call the generic wisdom export function
+`fftw_export_wisdom', you would write a callback subroutine of the form:
+
+       subroutine my_write_char(c, p) bind(C)
+         use, intrinsic :: iso_c_binding
+         character(C_CHAR), value :: c
+         type(C_PTR), value :: p
+         _...write c..._
+       end subroutine my_write_char
+
+   Given such a subroutine (along with the corresponding interface
+definition), you could then export wisdom using:
+
+       call fftw_export_wisdom(c_funloc(my_write_char), p)
+
+   The standard `c_funloc' intrinsic converts a Fortran `bind(C)'
+subroutine into a C function pointer.  The parameter `p' is a
+`type(C_PTR)' to any arbitrary data that you want to pass to
+`my_write_char' (or `C_NULL_PTR' if none).  (Note that you can get a C
+pointer to Fortran data using the intrinsic `c_loc', and convert it
+back to a Fortran pointer in `my_write_char' using `c_f_pointer'.)
+
+   Similarly, to use the generic `fftw_import_wisdom', you would define
+a callback function of the form:
+
+       integer(C_INT) function my_read_char(p) bind(C)
+         use, intrinsic :: iso_c_binding
+         type(C_PTR), value :: p
+         character :: c
+         _...read a character c..._
+         my_read_char = ichar(c, C_INT)
+       end function my_read_char
+
+       ....
+
+       integer(C_INT) :: ret
+       ret = fftw_import_wisdom(c_funloc(my_read_char), p)
+       if (ret .eq. 0) stop 'error importing wisdom'
+
+   Your function can return `-1' if the end of the input is reached.
+Again, `p' is an arbitrary `type(C_PTR' that is passed through to your
+function.  `fftw_import_wisdom' returns `0' if an error occurred and
+nonzero otherwise.
+
+
+File: fftw3.info,  Node: Defining an FFTW module,  Prev: Accessing the wisdom API from Fortran,  Up: Calling FFTW from Modern Fortran
+
+7.7 Defining an FFTW module
+===========================
+
+Rather than using the `include' statement to include the `fftw3.f03'
+interface file in any subroutine where you want to use FFTW, you might
+prefer to define an FFTW Fortran module.  FFTW does not install itself
+as a module, primarily because `fftw3.f03' can be shared between
+different Fortran compilers while modules (in general) cannot.
+However, it is trivial to define your own FFTW module if you want.
+Just create a file containing:
+
+       module FFTW3
+         use, intrinsic :: iso_c_binding
+         include 'fftw3.f03'
+       end module
+
+   Compile this file into a module as usual for your compiler (e.g. with
+`gfortran -c' you will get a file `fftw3.mod').  Now, instead of
+`include 'fftw3.f03'', whenever you want to use FFTW routines you can
+just do:
+
+       use FFTW3
+
+   as usual for Fortran modules.  (You still need to link to the FFTW
+library, of course.)
+
+
+File: fftw3.info,  Node: Calling FFTW from Legacy Fortran,  Next: Upgrading from FFTW version 2,  Prev: Calling FFTW from Modern Fortran,  Up: Top
+
+8 Calling FFTW from Legacy Fortran
+**********************************
+
+This chapter describes the interface to FFTW callable by Fortran code
+in older compilers not supporting the Fortran 2003 C interoperability
+features (*note Calling FFTW from Modern Fortran::).  This interface
+has the major disadvantage that it is not type-checked, so if you
+mistake the argument types or ordering then your program will not have
+any compiler errors, and will likely crash at runtime.  So, greater
+care is needed.  Also, technically interfacing older Fortran versions
+to C is nonstandard, but in practice we have found that the techniques
+used in this chapter have worked with all known Fortran compilers for
+many years.
+
+   The legacy Fortran interface differs from the C interface only in the
+prefix (`dfftw_' instead of `fftw_' in double precision) and a few
+other minor details.  This Fortran interface is included in the FFTW
+libraries by default, unless a Fortran compiler isn't found on your
+system or `--disable-fortran' is included in the `configure' flags.  We
+assume here that the reader is already familiar with the usage of FFTW
+in C, as described elsewhere in this manual.
+
+   The MPI parallel interface to FFTW is _not_ currently available to
+legacy Fortran.
+
+* Menu:
+
+* Fortran-interface routines::
+* FFTW Constants in Fortran::
+* FFTW Execution in Fortran::
+* Fortran Examples::
+* Wisdom of Fortran?::
+
+
+File: fftw3.info,  Node: Fortran-interface routines,  Next: FFTW Constants in Fortran,  Prev: Calling FFTW from Legacy Fortran,  Up: Calling FFTW from Legacy Fortran
+
+8.1 Fortran-interface routines
+==============================
+
+Nearly all of the FFTW functions have Fortran-callable equivalents.
+The name of the legacy Fortran routine is the same as that of the
+corresponding C routine, but with the `fftw_' prefix replaced by
+`dfftw_'.(1)  The single and long-double precision versions use
+`sfftw_' and `lfftw_', respectively, instead of `fftwf_' and `fftwl_';
+quadruple precision (`real*16') is available on some systems as
+`fftwq_' (*note Precision::).  (Note that `long double' on x86 hardware
+is usually at most 80-bit extended precision, _not_ quadruple
+precision.)
+
+   For the most part, all of the arguments to the functions are the
+same, with the following exceptions:
+
+   * `plan' variables (what would be of type `fftw_plan' in C), must be
+     declared as a type that is at least as big as a pointer (address)
+     on your machine.  We recommend using `integer*8' everywhere, since
+     this should always be big enough.  
+
+   * Any function that returns a value (e.g. `fftw_plan_dft') is
+     converted into a _subroutine_.  The return value is converted into
+     an additional _first_ parameter of this subroutine.(2)
+
+   * The Fortran routines expect multi-dimensional arrays to be in
+     _column-major_ order, which is the ordinary format of Fortran
+     arrays (*note Multi-dimensional Array Format::).  They do this
+     transparently and costlessly simply by reversing the order of the
+     dimensions passed to FFTW, but this has one important consequence
+     for multi-dimensional real-complex transforms, discussed below.
+
+   * Wisdom import and export is somewhat more tricky because one cannot
+     easily pass files or strings between C and Fortran; see *note
+     Wisdom of Fortran?::.
+
+   * Legacy Fortran cannot use the `fftw_malloc' dynamic-allocation
+     routine.  If you want to exploit the SIMD FFTW (*note SIMD
+     alignment and fftw_malloc::), you'll need to figure out some other
+     way to ensure that your arrays are at least 16-byte aligned.
+
+   * Since Fortran 77 does not have data structures, the `fftw_iodim'
+     structure from the guru interface (*note Guru vector and transform
+     sizes::) must be split into separate arguments.  In particular, any
+     `fftw_iodim' array arguments in the C guru interface become three
+     integer array arguments (`n', `is', and `os') in the Fortran guru
+     interface, all of whose lengths should be equal to the
+     corresponding `rank' argument.
+
+   * The guru planner interface in Fortran does _not_ do any automatic
+     translation between column-major and row-major; you are responsible
+     for setting the strides etcetera to correspond to your Fortran
+     arrays.  However, as a slight bug that we are preserving for
+     backwards compatibility, the `plan_guru_r2r' in Fortran _does_
+     reverse the order of its `kind' array parameter, so the `kind'
+     array of that routine should be in the reverse of the order of the
+     iodim arrays (see above).
+
+
+   In general, you should take care to use Fortran data types that
+correspond to (i.e. are the same size as) the C types used by FFTW.  In
+practice, this correspondence is usually straightforward (i.e.
+`integer' corresponds to `int', `real' corresponds to `float',
+etcetera).  The native Fortran double/single-precision complex type
+should be compatible with `fftw_complex'/`fftwf_complex'.  Such simple
+correspondences are assumed in the examples below.  
+
+   ---------- Footnotes ----------
+
+   (1) Technically, Fortran 77 identifiers are not allowed to have more
+than 6 characters, nor may they contain underscores.  Any compiler that
+enforces this limitation doesn't deserve to link to FFTW.
+
+   (2) The reason for this is that some Fortran implementations seem to
+have trouble with C function return values, and vice versa.
+
+
+File: fftw3.info,  Node: FFTW Constants in Fortran,  Next: FFTW Execution in Fortran,  Prev: Fortran-interface routines,  Up: Calling FFTW from Legacy Fortran
+
+8.2 FFTW Constants in Fortran
+=============================
+
+When creating plans in FFTW, a number of constants are used to specify
+options, such as `FFTW_MEASURE' or `FFTW_ESTIMATE'.  The same constants
+must be used with the wrapper routines, but of course the C header
+files where the constants are defined can't be incorporated directly
+into Fortran code.
+
+   Instead, we have placed Fortran equivalents of the FFTW constant
+definitions in the file `fftw3.f', which can be found in the same
+directory as `fftw3.h'.  If your Fortran compiler supports a
+preprocessor of some sort, you should be able to `include' or
+`#include' this file; otherwise, you can paste it directly into your
+code.
+
+   In C, you combine different flags (like `FFTW_PRESERVE_INPUT' and
+`FFTW_MEASURE') using the ``|'' operator; in Fortran you should just
+use ``+''.  (Take care not to add in the same flag more than once,
+though.  Alternatively, you can use the `ior' intrinsic function
+standardized in Fortran 95.)
+
+
+File: fftw3.info,  Node: FFTW Execution in Fortran,  Next: Fortran Examples,  Prev: FFTW Constants in Fortran,  Up: Calling FFTW from Legacy Fortran
+
+8.3 FFTW Execution in Fortran
+=============================
+
+In C, in order to use a plan, one normally calls `fftw_execute', which
+executes the plan to perform the transform on the input/output arrays
+passed when the plan was created (*note Using Plans::).  The
+corresponding subroutine call in legacy Fortran is:
+             call dfftw_execute(plan)
+   
+   However, we have had reports that this causes problems with some
+recent optimizing Fortran compilers.  The problem is, because the
+input/output arrays are not passed as explicit arguments to
+`dfftw_execute', the semantics of Fortran (unlike C) allow the compiler
+to assume that the input/output arrays are not changed by
+`dfftw_execute'.  As a consequence, certain compilers end up optimizing
+out or repositioning the call to `dfftw_execute', assuming incorrectly
+that it does nothing.
+
+   There are various workarounds to this, but the safest and simplest
+thing is to not use `dfftw_execute' in Fortran.  Instead, use the
+functions described in *note New-array Execute Functions::, which take
+the input/output arrays as explicit arguments.  For example, if the
+plan is for a complex-data DFT and was created for the arrays `in' and
+`out', you would do:
+             call dfftw_execute_dft(plan, in, out)
+   
+   There are a few things to be careful of, however:
+
+   * You must use the correct type of execute function, matching the way
+     the plan was created.  Complex DFT plans should use
+     `dfftw_execute_dft', Real-input (r2c) DFT plans should use use
+     `dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should
+     use `dfftw_execute_dft_c2r'.  The various r2r plans should use
+     `dfftw_execute_r2r'.
+
+   * You should normally pass the same input/output arrays that were
+     used when creating the plan.  This is always safe.
+
+   * _If_ you pass _different_ input/output arrays compared to those
+     used when creating the plan, you must abide by all the
+     restrictions of the new-array execute functions (*note New-array
+     Execute Functions::).  The most difficult of these, in Fortran, is
+     the requirement that the new arrays have the same alignment as the
+     original arrays, because there seems to be no way in legacy
+     Fortran to obtain guaranteed-aligned arrays (analogous to
+     `fftw_malloc' in C).  You can, of course, use the `FFTW_UNALIGNED'
+     flag when creating the plan, in which case the plan does not
+     depend on the alignment, but this may sacrifice substantial
+     performance on architectures (like x86) with SIMD instructions
+     (*note SIMD alignment and fftw_malloc::).  
+
+
+
+File: fftw3.info,  Node: Fortran Examples,  Next: Wisdom of Fortran?,  Prev: FFTW Execution in Fortran,  Up: Calling FFTW from Legacy Fortran
+
+8.4 Fortran Examples
+====================
+
+In C, you might have something like the following to transform a
+one-dimensional complex array:
+
+             fftw_complex in[N], out[N];
+             fftw_plan plan;
+
+             plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
+             fftw_execute(plan);
+             fftw_destroy_plan(plan);
+
+   In Fortran, you would use the following to accomplish the same thing:
+
+             double complex in, out
+             dimension in(N), out(N)
+             integer*8 plan
+
+             call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE)
+             call dfftw_execute_dft(plan, in, out)
+             call dfftw_destroy_plan(plan)
+   
+   Notice how all routines are called as Fortran subroutines, and the
+plan is returned via the first argument to `dfftw_plan_dft_1d'.  Notice
+also that we changed `fftw_execute' to `dfftw_execute_dft' (*note FFTW
+Execution in Fortran::).  To do the same thing, but using 8 threads in
+parallel (*note Multi-threaded FFTW::), you would simply prefix these
+calls with:
+
+             integer iret
+             call dfftw_init_threads(iret)
+             call dfftw_plan_with_nthreads(8)
+   
+   (You might want to check the value of `iret': if it is zero, it
+indicates an unlikely error during thread initialization.)
+
+   To transform a three-dimensional array in-place with C, you might do:
+
+             fftw_complex arr[L][M][N];
+             fftw_plan plan;
+
+             plan = fftw_plan_dft_3d(L,M,N, arr,arr,
+                                     FFTW_FORWARD, FFTW_ESTIMATE);
+             fftw_execute(plan);
+             fftw_destroy_plan(plan);
+
+   In Fortran, you would use this instead:
+
+             double complex arr
+             dimension arr(L,M,N)
+             integer*8 plan
+
+             call dfftw_plan_dft_3d(plan, L,M,N, arr,arr,
+            &                       FFTW_FORWARD, FFTW_ESTIMATE)
+             call dfftw_execute_dft(plan, arr, arr)
+             call dfftw_destroy_plan(plan)
+   
+   Note that we pass the array dimensions in the "natural" order in
+both C and Fortran.
+
+   To transform a one-dimensional real array in Fortran, you might do:
+
+             double precision in
+             dimension in(N)
+             double complex out
+             dimension out(N/2 + 1)
+             integer*8 plan
+
+             call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE)
+             call dfftw_execute_dft_r2c(plan, in, out)
+             call dfftw_destroy_plan(plan)
+   
+   To transform a two-dimensional real array, out of place, you might
+use the following:
+
+             double precision in
+             dimension in(M,N)
+             double complex out
+             dimension out(M/2 + 1, N)
+             integer*8 plan
+
+             call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE)
+             call dfftw_execute_dft_r2c(plan, in, out)
+             call dfftw_destroy_plan(plan)
+   
+   *Important:* Notice that it is the _first_ dimension of the complex
+output array that is cut in half in Fortran, rather than the last
+dimension as in C.  This is a consequence of the interface routines
+reversing the order of the array dimensions passed to FFTW so that the
+Fortran program can use its ordinary column-major order.  
+
+
+File: fftw3.info,  Node: Wisdom of Fortran?,  Prev: Fortran Examples,  Up: Calling FFTW from Legacy Fortran
+
+8.5 Wisdom of Fortran?
+======================
+
+In this section, we discuss how one can import/export FFTW wisdom
+(saved plans) to/from a Fortran program; we assume that the reader is
+already familiar with wisdom, as described in *note Words of
+Wisdom-Saving Plans::.
+
+   The basic problem is that is difficult to (portably) pass files and
+strings between Fortran and C, so we cannot provide a direct Fortran
+equivalent to the `fftw_export_wisdom_to_file', etcetera, functions.
+Fortran interfaces _are_ provided for the functions that do not take
+file/string arguments, however: `dfftw_import_system_wisdom',
+`dfftw_import_wisdom', `dfftw_export_wisdom', and `dfftw_forget_wisdom'.  
+
+   So, for example, to import the system-wide wisdom, you would do:
+
+             integer isuccess
+             call dfftw_import_system_wisdom(isuccess)
+
+   As usual, the C return value is turned into a first parameter;
+`isuccess' is non-zero on success and zero on failure (e.g. if there is
+no system wisdom installed).
+
+   If you want to import/export wisdom from/to an arbitrary file or
+elsewhere, you can employ the generic `dfftw_import_wisdom' and
+`dfftw_export_wisdom' functions, for which you must supply a subroutine
+to read/write one character at a time.  The FFTW package contains an
+example file `doc/f77_wisdom.f' demonstrating how to implement
+`import_wisdom_from_file' and `export_wisdom_to_file' subroutines in
+this way.  (These routines cannot be compiled into the FFTW library
+itself, lest all FFTW-using programs be required to link with the
+Fortran I/O library.)
+
+
+File: fftw3.info,  Node: Upgrading from FFTW version 2,  Next: Installation and Customization,  Prev: Calling FFTW from Legacy Fortran,  Up: Top
+
+9 Upgrading from FFTW version 2
+*******************************
+
+In this chapter, we outline the process for updating codes designed for
+the older FFTW 2 interface to work with FFTW 3.  The interface for FFTW
+3 is not backwards-compatible with the interface for FFTW 2 and earlier
+versions; codes written to use those versions will fail to link with
+FFTW 3.  Nor is it possible to write "compatibility wrappers" to bridge
+the gap (at least not efficiently), because FFTW 3 has different
+semantics from previous versions.  However, upgrading should be a
+straightforward process because the data formats are identical and the
+overall style of planning/execution is essentially the same.
+
+   Unlike FFTW 2, there are no separate header files for real and
+complex transforms (or even for different precisions) in FFTW 3; all
+interfaces are defined in the `<fftw3.h>' header file.
+
+Numeric Types
+=============
+
+The main difference in data types is that `fftw_complex' in FFTW 2 was
+defined as a `struct' with macros `c_re' and `c_im' for accessing the
+real/imaginary parts.  (This is binary-compatible with FFTW 3 on any
+machine except perhaps for some older Crays in single precision.)  The
+equivalent macros for FFTW 3 are:
+
+     #define c_re(c) ((c)[0])
+     #define c_im(c) ((c)[1])
+
+   This does not work if you are using the C99 complex type, however,
+unless you insert a `double*' typecast into the above macros (*note
+Complex numbers::).
+
+   Also, FFTW 2 had an `fftw_real' typedef that was an alias for
+`double' (in double precision).  In FFTW 3 you should just use `double'
+(or whatever precision you are employing).
+
+Plans
+=====
+
+The major difference between FFTW 2 and FFTW 3 is in the
+planning/execution division of labor.  In FFTW 2, plans were found for a
+given transform size and type, and then could be applied to _any_
+arrays and for _any_ multiplicity/stride parameters.  In FFTW 3, you
+specify the particular arrays, stride parameters, etcetera when
+creating the plan, and the plan is then executed for _those_ arrays
+(unless the guru interface is used) and _those_ parameters _only_.
+(FFTW 2 had "specific planner" routines that planned for a particular
+array and stride, but the plan could still be used for other arrays and
+strides.)  That is, much of the information that was formerly specified
+at execution time is now specified at planning time.
+
+   Like FFTW 2's specific planner routines, the FFTW 3 planner
+overwrites the input/output arrays unless you use `FFTW_ESTIMATE'.
+
+   FFTW 2 had separate data types `fftw_plan', `fftwnd_plan',
+`rfftw_plan', and `rfftwnd_plan' for complex and real one- and
+multi-dimensional transforms, and each type had its own `destroy'
+function.  In FFTW 3, all plans are of type `fftw_plan' and all are
+destroyed by `fftw_destroy_plan(plan)'.
+
+   Where you formerly used `fftw_create_plan' and `fftw_one' to plan
+and compute a single 1d transform, you would now use `fftw_plan_dft_1d'
+to plan the transform.  If you used the generic `fftw' function to
+execute the transform with multiplicity (`howmany') and stride
+parameters, you would now use the advanced interface
+`fftw_plan_many_dft' to specify those parameters.  The plans are now
+executed with `fftw_execute(plan)', which takes all of its parameters
+(including the input/output arrays) from the plan.
+
+   In-place transforms no longer interpret their output argument as
+scratch space, nor is there an `FFTW_IN_PLACE' flag.  You simply pass
+the same pointer for both the input and output arguments.  (Previously,
+the output `ostride' and `odist' parameters were ignored for in-place
+transforms; now, if they are specified via the advanced interface, they
+are significant even in the in-place case, although they should
+normally equal the corresponding input parameters.)
+
+   The `FFTW_ESTIMATE' and `FFTW_MEASURE' flags have the same meaning
+as before, although the planning time will differ.  You may also
+consider using `FFTW_PATIENT', which is like `FFTW_MEASURE' except that
+it takes more time in order to consider a wider variety of algorithms.
+
+   For multi-dimensional complex DFTs, instead of `fftwnd_create_plan'
+(or `fftw2d_create_plan' or `fftw3d_create_plan'), followed by
+`fftwnd_one', you would use `fftw_plan_dft' (or `fftw_plan_dft_2d' or
+`fftw_plan_dft_3d').  followed by `fftw_execute'.  If you used `fftwnd'
+to to specify strides etcetera, you would instead specify these via
+`fftw_plan_many_dft'.
+
+   The analogues to `rfftw_create_plan' and `rfftw_one' with
+`FFTW_REAL_TO_COMPLEX' or `FFTW_COMPLEX_TO_REAL' directions are
+`fftw_plan_r2r_1d' with kind `FFTW_R2HC' or `FFTW_HC2R', followed by
+`fftw_execute'.  The stride etcetera arguments of `rfftw' are now in
+`fftw_plan_many_r2r'.
+
+   Instead of `rfftwnd_create_plan' (or `rfftw2d_create_plan' or
+`rfftw3d_create_plan') followed by `rfftwnd_one_real_to_complex' or
+`rfftwnd_one_complex_to_real', you now use `fftw_plan_dft_r2c' (or
+`fftw_plan_dft_r2c_2d' or `fftw_plan_dft_r2c_3d') or
+`fftw_plan_dft_c2r' (or `fftw_plan_dft_c2r_2d' or
+`fftw_plan_dft_c2r_3d'), respectively, followed by `fftw_execute'.  As
+usual, the strides etcetera of `rfftwnd_real_to_complex' or
+`rfftwnd_complex_to_real' are no specified in the advanced planner
+routines, `fftw_plan_many_dft_r2c' or `fftw_plan_many_dft_c2r'.
+
+Wisdom
+======
+
+In FFTW 2, you had to supply the `FFTW_USE_WISDOM' flag in order to use
+wisdom; in FFTW 3, wisdom is always used.  (You could simulate the FFTW
+2 wisdom-less behavior by calling `fftw_forget_wisdom' after every
+planner call.)
+
+   The FFTW 3 wisdom import/export routines are almost the same as
+before (although the storage format is entirely different).  There is
+one significant difference, however.  In FFTW 2, the import routines
+would never read past the end of the wisdom, so you could store extra
+data beyond the wisdom in the same file, for example.  In FFTW 3, the
+file-import routine may read up to a few hundred bytes past the end of
+the wisdom, so you cannot store other data just beyond it.(1)
+
+   Wisdom has been enhanced by additional humility in FFTW 3: whereas
+FFTW 2 would re-use wisdom for a given transform size regardless of the
+stride etc., in FFTW 3 wisdom is only used with the strides etc. for
+which it was created.  Unfortunately, this means FFTW 3 has to create
+new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g.
+one transform of size 1024 also created wisdom for all smaller powers
+of 2, but this no longer occurs).
+
+   FFTW 3 also has the new routine `fftw_import_system_wisdom' to
+import wisdom from a standard system-wide location.
+
+Memory allocation
+=================
+
+In FFTW 3, we recommend allocating your arrays with `fftw_malloc' and
+deallocating them with `fftw_free'; this is not required, but allows
+optimal performance when SIMD acceleration is used.  (Those two
+functions actually existed in FFTW 2, and worked the same way, but were
+not documented.)
+
+   In FFTW 2, there were `fftw_malloc_hook' and `fftw_free_hook'
+functions that allowed the user to replace FFTW's memory-allocation
+routines (e.g. to implement different error-handling, since by default
+FFTW prints an error message and calls `exit' to abort the program if
+`malloc' returns `NULL').  These hooks are not supported in FFTW 3;
+those few users who require this functionality can just directly modify
+the memory-allocation routines in FFTW (they are defined in
+`kernel/alloc.c').
+
+Fortran interface
+=================
+
+In FFTW 2, the subroutine names were obtained by replacing `fftw_' with
+`fftw_f77'; in FFTW 3, you replace `fftw_' with `dfftw_' (or `sfftw_'
+or `lfftw_', depending upon the precision).
+
+   In FFTW 3, we have begun recommending that you always declare the
+type used to store plans as `integer*8'.  (Too many people didn't notice
+our instruction to switch from `integer' to `integer*8' for 64-bit
+machines.)
+
+   In FFTW 3, we provide a `fftw3.f' "header file" to include in your
+code (and which is officially installed on Unix systems).  (In FFTW 2,
+we supplied a `fftw_f77.i' file, but it was not installed.)
+
+   Otherwise, the C-Fortran interface relationship is much the same as
+it was before (e.g. return values become initial parameters, and
+multi-dimensional arrays are in column-major order).  Unlike FFTW 2, we
+do provide some support for wisdom import/export in Fortran (*note
+Wisdom of Fortran?::).
+
+Threads
+=======
+
+Like FFTW 2, only the execution routines are thread-safe.  All planner
+routines, etcetera, should be called by only a single thread at a time
+(*note Thread safety::).  _Unlike_ FFTW 2, there is no special
+`FFTW_THREADSAFE' flag for the planner to allow a given plan to be
+usable by multiple threads in parallel; this is now the case by default.
+
+   The multi-threaded version of FFTW 2 required you to pass the number
+of threads each time you execute the transform.  The number of threads
+is now stored in the plan, and is specified before the planner is
+called by `fftw_plan_with_nthreads'.  The threads initialization
+routine used to be called `fftw_threads_init' and would return zero on
+success; the new routine is called `fftw_init_threads' and returns zero
+on failure.  *Note Multi-threaded FFTW::.
+
+   There is no separate threads header file in FFTW 3; all the function
+prototypes are in `<fftw3.h>'.  However, you still have to link to a
+separate library (`-lfftw3_threads -lfftw3 -lm' on Unix), as well as to
+the threading library (e.g. POSIX threads on Unix).
+
+   ---------- Footnotes ----------
+
+   (1) We do our own buffering because GNU libc I/O routines are
+horribly slow for single-character I/O, apparently for thread-safety
+reasons (whether you are using threads or not).
+
+
+File: fftw3.info,  Node: Installation and Customization,  Next: Acknowledgments,  Prev: Upgrading from FFTW version 2,  Up: Top
+
+10 Installation and Customization
+*********************************
+
+This chapter describes the installation and customization of FFTW, the
+latest version of which may be downloaded from the FFTW home page
+(http://www.fftw.org).
+
+   In principle, FFTW should work on any system with an ANSI C compiler
+(`gcc' is fine).  However, planner time is drastically reduced if FFTW
+can exploit a hardware cycle counter; FFTW comes with cycle-counter
+support for all modern general-purpose CPUs, but you may need to add a
+couple of lines of code if your compiler is not yet supported (*note
+Cycle Counters::).  (On Unix, there will be a warning at the end of the
+`configure' output if no cycle counter is found.)  
+
+   Installation of FFTW is simplest if you have a Unix or a GNU system,
+such as GNU/Linux, and we describe this case in the first section below,
+including the use of special configuration options to e.g. install
+different precisions or exploit optimizations for particular
+architectures (e.g. SIMD).  Compilation on non-Unix systems is a more
+manual process, but we outline the procedure in the second section.  It
+is also likely that pre-compiled binaries will be available for popular
+systems.
+
+   Finally, we describe how you can customize FFTW for particular needs
+by generating _codelets_ for fast transforms of sizes not supported
+efficiently by the standard FFTW distribution.  
+
+* Menu:
+
+* Installation on Unix::
+* Installation on non-Unix systems::
+* Cycle Counters::
+* Generating your own code::
+
+
+File: fftw3.info,  Node: Installation on Unix,  Next: Installation on non-Unix systems,  Prev: Installation and Customization,  Up: Installation and Customization
+
+10.1 Installation on Unix
+=========================
+
+FFTW comes with a `configure' program in the GNU style.  Installation
+can be as simple as: 
+
+     ./configure
+     make
+     make install
+
+   This will build the uniprocessor complex and real transform libraries
+along with the test programs.  (We recommend that you use GNU `make' if
+it is available; on some systems it is called `gmake'.)  The "`make
+install'" command installs the fftw and rfftw libraries in standard
+places, and typically requires root privileges (unless you specify a
+different install directory with the `--prefix' flag to `configure').
+You can also type "`make check'" to put the FFTW test programs through
+their paces.  If you have problems during configuration or compilation,
+you may want to run "`make distclean'" before trying again; this
+ensures that you don't have any stale files left over from previous
+compilation attempts.
+
+   The `configure' script chooses the `gcc' compiler by default, if it
+is available; you can select some other compiler with:
+     ./configure CC="<the name of your C compiler>"
+
+   The `configure' script knows good `CFLAGS' (C compiler flags) for a
+few systems.  If your system is not known, the `configure' script will
+print out a warning.  In this case, you should re-configure FFTW with
+the command
+     ./configure CFLAGS="<write your CFLAGS here>"
+   and then compile as usual.  If you do find an optimal set of
+`CFLAGS' for your system, please let us know what they are (along with
+the output of `config.guess') so that we can include them in future
+releases.
+
+   `configure' supports all the standard flags defined by the GNU
+Coding Standards; see the `INSTALL' file in FFTW or the GNU web page
+(http://www.gnu.org/prep/standards/html_node/index.html).  Note
+especially `--help' to list all flags and `--enable-shared' to create
+shared, rather than static, libraries.  `configure' also accepts a few
+FFTW-specific flags, particularly:
+
+   * `--enable-float': Produces a single-precision version of FFTW
+     (`float') instead of the default double-precision (`double').
+     *Note Precision::.
+
+   * `--enable-long-double': Produces a long-double precision version of
+     FFTW (`long double') instead of the default double-precision
+     (`double').  The `configure' script will halt with an error
+     message if `long double' is the same size as `double' on your
+     machine/compiler.  *Note Precision::.
+
+   * `--enable-quad-precision': Produces a quadruple-precision version
+     of FFTW using the nonstandard `__float128' type provided by `gcc'
+     4.6 or later on x86, x86-64, and Itanium architectures, instead of
+     the default double-precision (`double').  The `configure' script
+     will halt with an error message if the compiler is not `gcc'
+     version 4.6 or later or if `gcc''s `libquadmath' library is not
+     installed.  *Note Precision::.
+
+   * `--enable-threads': Enables compilation and installation of the
+     FFTW threads library (*note Multi-threaded FFTW::), which provides
+     a simple interface to parallel transforms for SMP systems.  By
+     default, the threads routines are not compiled.
+
+   * `--enable-openmp': Like `--enable-threads', but using OpenMP
+     compiler directives in order to induce parallelism rather than
+     spawning its own threads directly, and installing an `fftw3_omp'
+     library rather than an `fftw3_threads' library (*note
+     Multi-threaded FFTW::).  You can use both `--enable-openmp' and
+     `--enable-threads' since they compile/install libraries with
+     different names.  By default, the OpenMP routines are not compiled.
+
+   * `--with-combined-threads': By default, if `--enable-threads' is
+     used, the threads support is compiled into a separate library that
+     must be linked in addition to the main FFTW library.  This is so
+     that users of the serial library do not need to link the system
+     threads libraries.  If `--with-combined-threads' is specified,
+     however, then no separate threads library is created, and threads
+     are included in the main FFTW library.  This is mainly useful
+     under Windows, where no system threads library is required and
+     inter-library dependencies are problematic.
+
+   * `--enable-mpi': Enables compilation and installation of the FFTW
+     MPI library (*note Distributed-memory FFTW with MPI::), which
+     provides parallel transforms for distributed-memory systems with
+     MPI.  (By default, the MPI routines are not compiled.)  *Note FFTW
+     MPI Installation::.
+
+   * `--disable-fortran': Disables inclusion of legacy-Fortran wrapper
+     routines (*note Calling FFTW from Legacy Fortran::) in the standard
+     FFTW libraries.  These wrapper routines increase the library size
+     by only a negligible amount, so they are included by default as
+     long as the `configure' script finds a Fortran compiler on your
+     system.  (To specify a particular Fortran compiler foo, pass
+     `F77='foo to `configure'.)
+
+   * `--with-g77-wrappers': By default, when Fortran wrappers are
+     included, the wrappers employ the linking conventions of the
+     Fortran compiler detected by the `configure' script.  If this
+     compiler is GNU `g77', however, then _two_ versions of the
+     wrappers are included: one with `g77''s idiosyncratic convention
+     of appending two underscores to identifiers, and one with the more
+     common convention of appending only a single underscore.  This
+     way, the same FFTW library will work with both `g77' and other
+     Fortran compilers, such as GNU `gfortran'.  However, the converse
+     is not true: if you configure with a different compiler, then the
+     `g77'-compatible wrappers are not included.  By specifying
+     `--with-g77-wrappers', the `g77'-compatible wrappers are included
+     in addition to wrappers for whatever Fortran compiler `configure'
+     finds.  
+
+   * `--with-slow-timer': Disables the use of hardware cycle counters,
+     and falls back on `gettimeofday' or `clock'.  This greatly worsens
+     performance, and should generally not be used (unless you don't
+     have a cycle counter but still really want an optimized plan
+     regardless of the time).  *Note Cycle Counters::.
+
+   * `--enable-sse', `--enable-sse2', `--enable-avx',
+     `--enable-altivec', `--enable-neon': Enable the compilation of
+     SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy
+     Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM
+     processors).  SSE, AltiVec, and NEON only work with
+     `--enable-float' (above).  SSE2 works in both single and double
+     precision (and is simply SSE in single precision).  The resulting
+     code will _still work_ on earlier CPUs lacking the SIMD extensions
+     (SIMD is automatically disabled, although the FFTW library is
+     still larger).
+        - These options require a compiler supporting SIMD extensions,
+          and compiler support is always a bit flaky: see the FFTW FAQ
+          for a list of compiler versions that have problems compiling
+          FFTW.
+
+        - With AltiVec and `gcc', you may have to use the
+          `-mabi=altivec' option when compiling any code that links to
+          FFTW, in order to properly align the stack; otherwise, FFTW
+          could crash when it tries to use an AltiVec feature.  (This
+          is not necessary on MacOS X.)
+
+        - With SSE/SSE2 and `gcc', you should use a version of gcc that
+          properly aligns the stack when compiling any code that links
+          to FFTW.  By default, `gcc' 2.95 and later versions align the
+          stack as needed, but you should not compile FFTW with the
+          `-Os' option or the `-mpreferred-stack-boundary' option with
+          an argument less than 4.
+
+        - Because of the large variety of ARM processors and ABIs, FFTW
+          does not attempt to guess the correct `gcc' flags for
+          generating NEON code.  In general, you will have to provide
+          them on the command line.  This command line is known to have
+          worked at least once:
+               ./configure --with-slow-timer --host=arm-linux-gnueabi \
+                 --enable-single --enable-neon \
+                 "CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp"
+
+
+   To force `configure' to use a particular C compiler foo (instead of
+the default, usually `gcc'), pass `CC='foo to the `configure' script;
+you may also need to set the flags via the variable `CFLAGS' as
+described above.  
+
+
+File: fftw3.info,  Node: Installation on non-Unix systems,  Next: Cycle Counters,  Prev: Installation on Unix,  Up: Installation and Customization
+
+10.2 Installation on non-Unix systems
+=====================================
+
+It should be relatively straightforward to compile FFTW even on non-Unix
+systems lacking the niceties of a `configure' script.  Basically, you
+need to edit the `config.h' header (copy it from `config.h.in') to
+`#define' the various options and compiler characteristics, and then
+compile all the `.c' files in the relevant directories.
+
+   The `config.h' header contains about 100 options to set, each one
+initially an `#undef', each documented with a comment, and most of them
+fairly obvious.  For most of the options, you should simply `#define'
+them to `1' if they are applicable, although a few options require a
+particular value (e.g. `SIZEOF_LONG_LONG' should be defined to the size
+of the `long long' type, in bytes, or zero if it is not supported).  We
+will likely post some sample `config.h' files for various operating
+systems and compilers for you to use (at least as a starting point).
+Please let us know if you have to hand-create a configuration file
+(and/or a pre-compiled binary) that you want to share.
+
+   To create the FFTW library, you will then need to compile all of the
+`.c' files in the `kernel', `dft', `dft/scalar', `dft/scalar/codelets',
+`rdft', `rdft/scalar', `rdft/scalar/r2cf', `rdft/scalar/r2cb',
+`rdft/scalar/r2r', `reodft', and `api' directories.  If you are
+compiling with SIMD support (e.g. you defined `HAVE_SSE2' in
+`config.h'), then you also need to compile the `.c' files in the
+`simd-support', `{dft,rdft}/simd', `{dft,rdft}/simd/*' directories.
+
+   Once these files are all compiled, link them into a library, or a
+shared library, or directly into your program.
+
+   To compile the FFTW test program, additionally compile the code in
+the `libbench2/' directory, and link it into a library.  Then compile
+the code in the `tests/' directory and link it to the `libbench2' and
+FFTW libraries.  To compile the `fftw-wisdom' (command-line) tool
+(*note Wisdom Utilities::), compile `tools/fftw-wisdom.c' and link it
+to the `libbench2' and FFTW libraries
+
+
+File: fftw3.info,  Node: Cycle Counters,  Next: Generating your own code,  Prev: Installation on non-Unix systems,  Up: Installation and Customization
+
+10.3 Cycle Counters
+===================
+
+FFTW's planner actually executes and times different possible FFT
+algorithms in order to pick the fastest plan for a given n.  In order
+to do this in as short a time as possible, however, the timer must have
+a very high resolution, and to accomplish this we employ the hardware
+"cycle counters" that are available on most CPUs.  Currently, FFTW
+supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC
+(SPARC v9), IA64, PA-RISC, and MIPS processors.
+
+   Access to the cycle counters, unfortunately, is a compiler and/or
+operating-system dependent task, often requiring inline assembly
+language, and it may be that your compiler is not supported.  If you are
+_not_ supported, FFTW will by default fall back on its estimator
+(effectively using `FFTW_ESTIMATE' for all plans).  
+
+   You can add support by editing the file `kernel/cycle.h'; normally,
+this will involve adapting one of the examples already present in order
+to use the inline-assembler syntax for your C compiler, and will only
+require a couple of lines of code.  Anyone adding support for a new
+system to `cycle.h' is encouraged to email us at <fftw@fftw.org>.
+
+   If a cycle counter is not available on your system (e.g. some
+embedded processor), and you don't want to use estimated plans, as a
+last resort you can use the `--with-slow-timer' option to `configure'
+(on Unix) or `#define WITH_SLOW_TIMER' in `config.h' (elsewhere).  This
+will use the much lower-resolution `gettimeofday' function, or even
+`clock' if the former is unavailable, and planning will be extremely
+slow.
+
+
+File: fftw3.info,  Node: Generating your own code,  Prev: Cycle Counters,  Up: Installation and Customization
+
+10.4 Generating your own code
+=============================
+
+The directory `genfft' contains the programs that were used to generate
+FFTW's "codelets," which are hard-coded transforms of small sizes.  We
+do not expect casual users to employ the generator, which is a rather
+sophisticated program that generates directed acyclic graphs of FFT
+algorithms and performs algebraic simplifications on them.  It was
+written in Objective Caml, a dialect of ML, which is available at
+`http://caml.inria.fr/ocaml/index.en.html'.  
+
+   If you have Objective Caml installed (along with recent versions of
+GNU `autoconf', `automake', and `libtool'), then you can change the set
+of codelets that are generated or play with the generation options.
+The set of generated codelets is specified by the
+`{dft,rdft}/{codelets,simd}/*/Makefile.am' files.  For example, you can
+add efficient REDFT codelets of small sizes by modifying
+`rdft/codelets/r2r/Makefile.am'.  After you modify any `Makefile.am'
+files, you can type `sh bootstrap.sh' in the top-level directory
+followed by `make' to re-generate the files.
+
+   We do not provide more details about the code-generation process,
+since we do not expect that most users will need to generate their own
+code.  However, feel free to contact us at <fftw@fftw.org> if you are
+interested in the subject.
+
+   You might find it interesting to learn Caml and/or some modern
+programming techniques that we used in the generator (including monadic
+programming), especially if you heard the rumor that Java and
+object-oriented programming are the latest advancement in the field.
+The internal operation of the codelet generator is described in the
+paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is
+available from the FFTW home page (http://www.fftw.org) and also
+appeared in the `Proceedings of the 1999 ACM SIGPLAN Conference on
+Programming Language Design and Implementation (PLDI)'.
+
+
+File: fftw3.info,  Node: Acknowledgments,  Next: License and Copyright,  Prev: Installation and Customization,  Up: Top
+
+11 Acknowledgments
+******************
+
+Matteo Frigo was supported in part by the Special Research Program SFB
+F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln
+Laboratory.  For previous versions of FFTW, he was supported in part by
+the Defense Advanced Research Projects Agency (DARPA), under Grants
+N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment
+Corporation Fellowship.
+
+   Steven G. Johnson was supported in part by a Dept. of Defense NDSEG
+Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials
+Research Science and Engineering Center program of the National Science
+Foundation under award DMR-9400334.
+
+   Code for the Cell Broadband Engine was graciously donated to the FFTW
+project by the IBM Austin Research Lab and included in fftw-3.2.  (This
+code was removed in fftw-3.3.)
+
+   Code for the MIPS paired-single SIMD support was graciously donated
+to the FFTW project by CodeSourcery, Inc.
+
+   We are grateful to Sun Microsystems Inc. for its donation of a
+cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These
+machines served as the primary platform for the development of early
+versions of FFTW.
+
+   We thank Intel Corporation for donating a four-processor Pentium Pro
+machine.  We thank the GNU/Linux community for giving us a decent OS to
+run on that machine.
+
+   We are thankful to the AMD corporation for donating an AMD Athlon XP
+1700+ computer to the FFTW project.
+
+   We thank the Compaq/HP testdrive program and VA Software Corporation
+(SourceForge.net) for providing remote access to machines that were used
+to test FFTW.
+
+   The `genfft' suite of code generators was written using Objective
+Caml, a dialect of ML.  Objective Caml is a small and elegant language
+developed by Xavier Leroy.  The implementation is available from
+`http://caml.inria.fr/' (http://caml.inria.fr/).  In previous releases
+of FFTW, `genfft' was written in Caml Light, by the same authors.  An
+even earlier implementation of `genfft' was written in Scheme, but Caml
+is definitely better for this kind of application.  
+
+   FFTW uses many tools from the GNU project, including `automake',
+`texinfo', and `libtool'.
+
+   Prof. Charles E. Leiserson of MIT provided continuous support and
+encouragement.  This program would not exist without him.  Charles also
+proposed the name "codelets" for the basic FFT blocks.  
+
+   Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance
+of Steven's "extra-curricular" computer-science activities, as well as
+remarkable creativity in working them into his grant proposals.
+Steven's physics degree would not exist without him.
+
+   Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually
+led to the SIMD support in FFTW 3.
+
+   Stefan Kral wrote most of the K7 code generator distributed with FFTW
+3.0.x and 3.1.x.
+
+   Andrew Sterian contributed the Windows timing code in FFTW 2.
+
+   Didier Miras reported a bug in the test procedure used in FFTW 1.2.
+We now use a completely different test algorithm by Funda Ergun that
+does not require a separate FFT program to compare against.
+
+   Wolfgang Reimer contributed the Pentium cycle counter and a few fixes
+that help portability.
+
+   Ming-Chang Liu uncovered a well-hidden bug in the complex transforms
+of FFTW 2.0 and supplied a patch to correct it.
+
+   The FFTW FAQ was written in `bfnn' (Bizarre Format With No Name) and
+formatted using the tools developed by Ian Jackson for the Linux FAQ.
+
+   _We are especially thankful to all of our users for their continuing
+support, feedback, and interest during our development of FFTW._
+
+
+File: fftw3.info,  Node: License and Copyright,  Next: Concept Index,  Prev: Acknowledgments,  Up: Top
+
+12 License and Copyright
+************************
+
+FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003,
+2007-11 Massachusetts Institute of Technology.
+
+   FFTW is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 2 of the License, or (at your
+option) any later version.
+
+   This program is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software Foundation,
+Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You
+can also find the GPL on the GNU web site
+(http://www.gnu.org/licenses/gpl-2.0.html).
+
+   In addition, we kindly ask you to acknowledge FFTW and its authors in
+any program or publication in which you use FFTW.  (You are not
+_required_ to do so; it is up to your common sense to decide whether
+you want to comply with this request or not.)  For general
+publications, we suggest referencing: Matteo Frigo and Steven G.
+Johnson, "The design and implementation of FFTW3," Proc. IEEE 93 (2),
+216-231 (2005).
+
+   Non-free versions of FFTW are available under terms different from
+those of the General Public License. (e.g. they do not require you to
+accompany any object code using FFTW with the corresponding source
+code.)  For these alternative terms you must purchase a license from
+MIT's Technology Licensing Office.  Users interested in such a license
+should contact us (<fftw@fftw.org>) for more information.
+