Chris@184: Speed: Chris@184: * If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs). Chris@184: Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but Chris@184: less wall time. Chris@184: Chris@184: * experiment with compiler flags Chris@184: Special thanks to Oscar Lesta. He suggested some compiler flags Chris@184: for gcc that make a big difference. They shave 10-15% off Chris@184: execution time on some systems. Try some combination of: Chris@184: -march=pentiumpro Chris@184: -ffast-math Chris@184: -fomit-frame-pointer Chris@184: Chris@184: * If the input data has no imaginary component, use the kiss_fftr code under tools/. Chris@184: Real ffts are roughly twice as fast as complex. Chris@184: Chris@184: * If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine, Chris@184: then you might want to experiment with the USE_SIMD code. See README.simd Chris@184: Chris@184: Chris@184: Reducing code size: Chris@184: * remove some of the butterflies. There are currently butterflies optimized for radices Chris@184: 2,3,4,5. It is worth mentioning that you can still use FFT sizes that contain Chris@184: other factors, they just won't be quite as fast. You can decide for yourself Chris@184: whether to keep radix 2 or 4. If you do some work in this area, let me Chris@184: know what you find. Chris@184: Chris@184: * For platforms where ROM/code space is more plentiful than RAM, Chris@184: consider creating a hardcoded kiss_fft_state. In other words, decide which Chris@184: FFT size(s) you want and make a structure with the correct factors and twiddles. Chris@184: Chris@184: * Frank van der Hulst offered numerous suggestions for smaller code size and correct operation Chris@184: on embedded targets. "I'm happy to help anyone who is trying to implement KISSFFT on a micro" Chris@184: Chris@184: Some of these were rolled into the mainline code base: Chris@184: - using long casts to promote intermediate results of short*short multiplication Chris@184: - delaying allocation of buffers that are sometimes unused. Chris@184: In some cases, it may be desirable to limit capability in order to better suit the target: Chris@184: - predefining the twiddle tables for the desired fft size.