Chris@82: @node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top
Chris@82: @chapter Multi-threaded FFTW
Chris@82: 
Chris@82: @cindex parallel transform
Chris@82: In this chapter we document the parallel FFTW routines for
Chris@82: shared-memory parallel hardware.  These routines, which support
Chris@82: parallel one- and multi-dimensional transforms of both real and
Chris@82: complex data, are the easiest way to take advantage of multiple
Chris@82: processors with FFTW.  They work just like the corresponding
Chris@82: uniprocessor transform routines, except that you have an extra
Chris@82: initialization routine to call, and there is a routine to set the
Chris@82: number of threads to employ.  Any program that uses the uniprocessor
Chris@82: FFTW can therefore be trivially modified to use the multi-threaded
Chris@82: FFTW.
Chris@82: 
Chris@82: A shared-memory machine is one in which all CPUs can directly access
Chris@82: the same main memory, and such machines are now common due to the
Chris@82: ubiquity of multi-core CPUs.  FFTW's multi-threading support allows
Chris@82: you to utilize these additional CPUs transparently from a single
Chris@82: program.  However, this does not necessarily translate into
Chris@82: performance gains---when multiple threads/CPUs are employed, there is
Chris@82: an overhead required for synchronization that may outweigh the
Chris@82: computatational parallelism.  Therefore, you can only benefit from
Chris@82: threads if your problem is sufficiently large.
Chris@82: @cindex shared-memory
Chris@82: @cindex threads
Chris@82: 
Chris@82: @menu
Chris@82: * Installation and Supported Hardware/Software::
Chris@82: * Usage of Multi-threaded FFTW::
Chris@82: * How Many Threads to Use?::
Chris@82: * Thread safety::
Chris@82: @end menu
Chris@82: 
Chris@82: @c ------------------------------------------------------------
Chris@82: @node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW
Chris@82: @section Installation and Supported Hardware/Software
Chris@82: 
Chris@82: All of the FFTW threads code is located in the @code{threads}
Chris@82: subdirectory of the FFTW package.  On Unix systems, the FFTW threads
Chris@82: libraries and header files can be automatically configured, compiled,
Chris@82: and installed along with the uniprocessor FFTW libraries simply by
Chris@82: including @code{--enable-threads} in the flags to the @code{configure}
Chris@82: script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use
Chris@82: @uref{http://www.openmp.org,OpenMP} threads.
Chris@82: @fpindex configure
Chris@82: 
Chris@82: 
Chris@82: @cindex portability
Chris@82: @cindex OpenMP
Chris@82: The threads routines require your operating system to have some sort
Chris@82: of shared-memory threads support.  Specifically, the FFTW threads
Chris@82: package works with POSIX threads (available on most Unix variants,
Chris@82: from GNU/Linux to MacOS X) and Win32 threads.  OpenMP threads, which
Chris@82: are supported in many common compilers (e.g. gcc) are also supported,
Chris@82: and may give better performance on some systems.  (OpenMP threads are
Chris@82: also useful if you are employing OpenMP in your own code, in order to
Chris@82: minimize conflicts between threading models.)  If you have a
Chris@82: shared-memory machine that uses a different threads API, it should be
Chris@82: a simple matter of programming to include support for it; see the file
Chris@82: @code{threads/threads.c} for more detail.
Chris@82: 
Chris@82: You can compile FFTW with @emph{both} @code{--enable-threads} and
Chris@82: @code{--enable-openmp} at the same time, since they install libraries
Chris@82: with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as
Chris@82: described below).  However, your programs may only link to @emph{one}
Chris@82: of these two libraries at a time.
Chris@82: 
Chris@82: Ideally, of course, you should also have multiple processors in order to
Chris@82: get any benefit from the threaded transforms.
Chris@82: 
Chris@82: @c ------------------------------------------------------------
Chris@82: @node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW
Chris@82: @section Usage of Multi-threaded FFTW
Chris@82: 
Chris@82: Here, it is assumed that the reader is already familiar with the usage
Chris@82: of the uniprocessor FFTW routines, described elsewhere in this manual.
Chris@82: We only describe what one has to change in order to use the
Chris@82: multi-threaded routines.
Chris@82: 
Chris@82: @cindex OpenMP
Chris@82: First, programs using the parallel complex transforms should be linked
Chris@82: with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp
Chris@82: -lfftw3 -lm} if you compiled with OpenMP. You will also need to link
Chris@82: with whatever library is responsible for threads on your system
Chris@82: (e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag
Chris@82: enables OpenMP (e.g. @code{-fopenmp} with gcc).
Chris@82: @cindex linking on Unix
Chris@82: 
Chris@82: 
Chris@82: Second, before calling @emph{any} FFTW routines, you should call the
Chris@82: function:
Chris@82: 
Chris@82: @example
Chris@82: int fftw_init_threads(void);
Chris@82: @end example
Chris@82: @findex fftw_init_threads
Chris@82: 
Chris@82: This function, which need only be called once, performs any one-time
Chris@82: initialization required to use threads on your system.  It returns zero
Chris@82: if there was some error (which should not happen under normal
Chris@82: circumstances) and a non-zero value otherwise.
Chris@82: 
Chris@82: Third, before creating a plan that you want to parallelize, you should
Chris@82: call:
Chris@82: 
Chris@82: @example
Chris@82: void fftw_plan_with_nthreads(int nthreads);
Chris@82: @end example
Chris@82: @findex fftw_plan_with_nthreads
Chris@82: 
Chris@82: The @code{nthreads} argument indicates the number of threads you want
Chris@82: FFTW to use (or actually, the maximum number).  All plans subsequently
Chris@82: created with any planner routine will use that many threads.  You can
Chris@82: call @code{fftw_plan_with_nthreads}, create some plans, call
Chris@82: @code{fftw_plan_with_nthreads} again with a different argument, and
Chris@82: create some more plans for a new number of threads.  Plans already created
Chris@82: before a call to @code{fftw_plan_with_nthreads} are unaffected.  If you
Chris@82: pass an @code{nthreads} argument of @code{1} (the default), threads are
Chris@82: disabled for subsequent plans.
Chris@82: 
Chris@82: @cindex OpenMP
Chris@82: With OpenMP, to configure FFTW to use all of the currently running
Chris@82: OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the
Chris@82: @code{OMP_NUM_THREADS} environment variable), you can do:
Chris@82: @code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_}
Chris@82: OpenMP functions are declared via @code{#include <omp.h>}.)
Chris@82: 
Chris@82: @cindex thread safety
Chris@82: Given a plan, you then execute it as usual with
Chris@82: @code{fftw_execute(plan)}, and the execution will use the number of
Chris@82: threads specified when the plan was created.  When done, you destroy
Chris@82: it as usual with @code{fftw_destroy_plan}.  As described in
Chris@82: @ref{Thread safety}, plan @emph{execution} is thread-safe, but plan
Chris@82: creation and destruction are @emph{not}: you should create/destroy
Chris@82: plans only from a single thread, but can safely execute multiple plans
Chris@82: in parallel.
Chris@82: 
Chris@82: There is one additional routine: if you want to get rid of all memory
Chris@82: and other resources allocated internally by FFTW, you can call:
Chris@82: 
Chris@82: @example
Chris@82: void fftw_cleanup_threads(void);
Chris@82: @end example
Chris@82: @findex fftw_cleanup_threads
Chris@82: 
Chris@82: which is much like the @code{fftw_cleanup()} function except that it
Chris@82: also gets rid of threads-related data.  You must @emph{not} execute any
Chris@82: previously created plans after calling this function.
Chris@82: 
Chris@82: We should also mention one other restriction: if you save wisdom from a
Chris@82: program using the multi-threaded FFTW, that wisdom @emph{cannot be used}
Chris@82: by a program using only the single-threaded FFTW (i.e. not calling
Chris@82: @code{fftw_init_threads}).  @xref{Words of Wisdom-Saving Plans}.
Chris@82: 
Chris@82: @c ------------------------------------------------------------
Chris@82: @node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW
Chris@82: @section How Many Threads to Use?
Chris@82: 
Chris@82: @cindex number of threads
Chris@82: There is a fair amount of overhead involved in synchronizing threads,
Chris@82: so the optimal number of threads to use depends upon the size of the
Chris@82: transform as well as on the number of processors you have.
Chris@82: 
Chris@82: As a general rule, you don't want to use more threads than you have
Chris@82: processors.  (Using more threads will work, but there will be extra
Chris@82: overhead with no benefit.)  In fact, if the problem size is too small,
Chris@82: you may want to use fewer threads than you have processors.
Chris@82: 
Chris@82: You will have to experiment with your system to see what level of
Chris@82: parallelization is best for your problem size.  Typically, the problem
Chris@82: will have to involve at least a few thousand data points before threads
Chris@82: become beneficial.  If you plan with @code{FFTW_PATIENT}, it will
Chris@82: automatically disable threads for sizes that don't benefit from
Chris@82: parallelization.
Chris@82: @ctindex FFTW_PATIENT
Chris@82: 
Chris@82: @c ------------------------------------------------------------
Chris@82: @node Thread safety,  , How Many Threads to Use?, Multi-threaded FFTW
Chris@82: @section Thread safety
Chris@82: 
Chris@82: @cindex threads
Chris@82: @cindex OpenMP
Chris@82: @cindex thread safety
Chris@82: Users writing multi-threaded programs (including OpenMP) must concern
Chris@82: themselves with the @dfn{thread safety} of the libraries they
Chris@82: use---that is, whether it is safe to call routines in parallel from
Chris@82: multiple threads.  FFTW can be used in such an environment, but some
Chris@82: care must be taken because the planner routines share data
Chris@82: (e.g. wisdom and trigonometric tables) between calls and plans.
Chris@82: 
Chris@82: The upshot is that the only thread-safe routine in FFTW is
Chris@82: @code{fftw_execute} (and the new-array variants thereof).  All other routines
Chris@82: (e.g. the planner) should only be called from one thread at a time.  So,
Chris@82: for example, you can wrap a semaphore lock around any calls to the
Chris@82: planner; even more simply, you can just create all of your plans from
Chris@82: one thread.  We do not think this should be an important restriction
Chris@82: (FFTW is designed for the situation where the only performance-sensitive
Chris@82: code is the actual execution of the transform), and the benefits of
Chris@82: shared data between plans are great.
Chris@82: 
Chris@82: Note also that, since the plan is not modified by @code{fftw_execute},
Chris@82: it is safe to execute the @emph{same plan} in parallel by multiple
Chris@82: threads.  However, since a given plan operates by default on a fixed
Chris@82: array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data.
Chris@82: 
Chris@82: (Users should note that these comments only apply to programs using
Chris@82: shared-memory threads or OpenMP.  Parallelism using MPI or forked processes
Chris@82: involves a separate address-space and global variables for each process,
Chris@82: and is not susceptible to problems of this sort.)
Chris@82: 
Chris@82: The FFTW planner is intended to be called from a single thread.  If you
Chris@82: really must call it from multiple threads, you are expected to grab
Chris@82: whatever lock makes sense for your application, with the understanding
Chris@82: that you may be holding that lock for a long time, which is undesirable.
Chris@82: 
Chris@82: Neither strategy works, however, in the following situation.  The
Chris@82: ``application'' is structured as a set of ``plugins'' which are unaware
Chris@82: of each other, and for whatever reason the ``plugins'' cannot coordinate
Chris@82: on grabbing the lock.  (This is not a technical problem, but an
Chris@82: organizational one.  The ``plugins'' are written by independent agents,
Chris@82: and from the perspective of each plugin's author, each plugin is using
Chris@82: FFTW correctly from a single thread.)  To cope with this situation,
Chris@82: starting from FFTW-3.3.5, FFTW supports an API to make the planner
Chris@82: thread-safe:
Chris@82: 
Chris@82: @example
Chris@82: void fftw_make_planner_thread_safe(void);
Chris@82: @end example
Chris@82: @findex fftw_make_planner_thread_safe
Chris@82: 
Chris@82: This call operates by brute force: It just installs a hook that wraps a
Chris@82: lock (chosen by us) around all planner calls.  So there is no magic and
Chris@82: you get the worst of all worlds.  The planner is still single-threaded,
Chris@82: but you cannot choose which lock to use.  The planner still holds the
Chris@82: lock for a long time, but you cannot impose a timeout on lock
Chris@82: acquisition.  As of FFTW-3.3.5 and FFTW-3.3.6, this call does not work
Chris@82: when using OpenMP as threading substrate.  (Suggestions on what to do
Chris@82: about this bug are welcome.)  @emph{Do not use
Chris@82: @code{fftw_make_planner_thread_safe} unless there is no other choice,}
Chris@82: such as in the application/plugin situation.