Mercurial > hg > qm-dsp
diff ext/kissfft/TIPS @ 409:1f1999b0f577
Bring in kissfft into this repo (formerly a subrepo, but the remote is not responding)
author | Chris Cannam <c.cannam@qmul.ac.uk> |
---|---|
date | Tue, 21 Jul 2015 07:34:15 +0100 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ext/kissfft/TIPS Tue Jul 21 07:34:15 2015 +0100 @@ -0,0 +1,39 @@ +Speed: + * If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs). + Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but + less wall time. + + * experiment with compiler flags + Special thanks to Oscar Lesta. He suggested some compiler flags + for gcc that make a big difference. They shave 10-15% off + execution time on some systems. Try some combination of: + -march=pentiumpro + -ffast-math + -fomit-frame-pointer + + * If the input data has no imaginary component, use the kiss_fftr code under tools/. + Real ffts are roughly twice as fast as complex. + + * If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine, + then you might want to experiment with the USE_SIMD code. See README.simd + + +Reducing code size: + * remove some of the butterflies. There are currently butterflies optimized for radices + 2,3,4,5. It is worth mentioning that you can still use FFT sizes that contain + other factors, they just won't be quite as fast. You can decide for yourself + whether to keep radix 2 or 4. If you do some work in this area, let me + know what you find. + + * For platforms where ROM/code space is more plentiful than RAM, + consider creating a hardcoded kiss_fft_state. In other words, decide which + FFT size(s) you want and make a structure with the correct factors and twiddles. + + * Frank van der Hulst offered numerous suggestions for smaller code size and correct operation + on embedded targets. "I'm happy to help anyone who is trying to implement KISSFFT on a micro" + + Some of these were rolled into the mainline code base: + - using long casts to promote intermediate results of short*short multiplication + - delaying allocation of buffers that are sometimes unused. + In some cases, it may be desirable to limit capability in order to better suit the target: + - predefining the twiddle tables for the desired fft size.