Chris@366: Speed:
Chris@366:     * If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs).
Chris@366: 	Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but
Chris@366: 	less wall time.
Chris@366: 
Chris@366:     * experiment with compiler flags
Chris@366:         Special thanks to Oscar Lesta. He suggested some compiler flags 
Chris@366:         for gcc that make a big difference. They shave 10-15% off
Chris@366:         execution time on some systems.  Try some combination of:
Chris@366:                 -march=pentiumpro
Chris@366:                 -ffast-math
Chris@366:                 -fomit-frame-pointer
Chris@366: 
Chris@366:     * If the input data has no imaginary component, use the kiss_fftr code under tools/.
Chris@366:       Real ffts are roughly twice as fast as complex.
Chris@366: 
Chris@366:     * If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine,
Chris@366:     then you might want to experiment with the USE_SIMD code.  See README.simd
Chris@366: 
Chris@366: 
Chris@366: Reducing code size:
Chris@366:     * remove some of the butterflies. There are currently butterflies optimized for radices
Chris@366:         2,3,4,5.  It is worth mentioning that you can still use FFT sizes that contain 
Chris@366:         other factors, they just won't be quite as fast.  You can decide for yourself 
Chris@366:         whether to keep radix 2 or 4.  If you do some work in this area, let me 
Chris@366:         know what you find.
Chris@366: 
Chris@366:     * For platforms where ROM/code space is more plentiful than RAM,
Chris@366:      consider creating a hardcoded kiss_fft_state. In other words, decide which 
Chris@366:      FFT size(s) you want and make a structure with the correct factors and twiddles.
Chris@366: 
Chris@366:     * Frank van der Hulst offered numerous suggestions for smaller code size and correct operation 
Chris@366:     on embedded targets.  "I'm happy to help anyone who is trying to implement KISSFFT on a micro"
Chris@366: 
Chris@366:     Some of these were rolled into the mainline code base:
Chris@366:         - using long casts to promote intermediate results of short*short multiplication
Chris@366:         - delaying allocation of buffers that are sometimes unused.
Chris@366:     In some cases, it may be desirable to limit capability in order to better suit the target:
Chris@366:         - predefining the twiddle tables for the desired fft size.