Chris@366: Speed: Chris@366: * If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs). Chris@366: Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but Chris@366: less wall time. Chris@366: Chris@366: * experiment with compiler flags Chris@366: Special thanks to Oscar Lesta. He suggested some compiler flags Chris@366: for gcc that make a big difference. They shave 10-15% off Chris@366: execution time on some systems. Try some combination of: Chris@366: -march=pentiumpro Chris@366: -ffast-math Chris@366: -fomit-frame-pointer Chris@366: Chris@366: * If the input data has no imaginary component, use the kiss_fftr code under tools/. Chris@366: Real ffts are roughly twice as fast as complex. Chris@366: Chris@366: * If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine, Chris@366: then you might want to experiment with the USE_SIMD code. See README.simd Chris@366: Chris@366: Chris@366: Reducing code size: Chris@366: * remove some of the butterflies. There are currently butterflies optimized for radices Chris@366: 2,3,4,5. It is worth mentioning that you can still use FFT sizes that contain Chris@366: other factors, they just won't be quite as fast. You can decide for yourself Chris@366: whether to keep radix 2 or 4. If you do some work in this area, let me Chris@366: know what you find. Chris@366: Chris@366: * For platforms where ROM/code space is more plentiful than RAM, Chris@366: consider creating a hardcoded kiss_fft_state. In other words, decide which Chris@366: FFT size(s) you want and make a structure with the correct factors and twiddles. Chris@366: Chris@366: * Frank van der Hulst offered numerous suggestions for smaller code size and correct operation Chris@366: on embedded targets. "I'm happy to help anyone who is trying to implement KISSFFT on a micro" Chris@366: Chris@366: Some of these were rolled into the mainline code base: Chris@366: - using long casts to promote intermediate results of short*short multiplication Chris@366: - delaying allocation of buffers that are sometimes unused. Chris@366: In some cases, it may be desirable to limit capability in order to better suit the target: Chris@366: - predefining the twiddle tables for the desired fft size.