c@174: If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft c@174: to do 4 *separate* FFTs at once. c@174: c@174: Beware! Beyond here there be dragons! c@174: c@174: This API is not easy to use, is not well documented, and breaks the KISS principle. c@174: c@174: c@174: Still reading? Okay, you may get rewarded for your patience with a considerable speedup c@174: (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops. c@174: c@174: The basic idea is to use the packed 4 float __m128 data type as a scalar element. c@174: This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D. c@174: c@174: For complex data, the data is interlaced as follows: c@174: rA0,rB0,rC0,rD0, iA0,iB0,iC0,iD0, rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ... c@174: where "rA0" is the real part of the zeroth sample for signal A c@174: c@174: Real-only data is laid out: c@174: rA0,rB0,rC0,rD0, rA1,rB1,rC1,rD1, ... c@174: c@174: Compile with gcc flags something like c@174: -O3 -mpreferred-stack-boundary=4 -DUSE_SIMD=1 -msse c@174: c@174: Be aware of SIMD alignment. This is the most likely cause of segfaults. c@174: The code within kissfft uses scratch variables on the stack. c@174: With SIMD, these must have addresses on 16 byte boundaries. c@174: Search on "SIMD alignment" for more info. c@174: c@174: c@174: c@174: Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft. c@174: I have not run it -- use it at your own risk. It appears to do 4xN and Nx4 transpositions c@174: (out of place). c@174: c@174: void SSETools::pack128(float* target, float* source, unsigned long size128) c@174: { c@174: __m128* pDest = (__m128*)target; c@174: __m128* pDestEnd = pDest+size128; c@174: float* source0=source; c@174: float* source1=source0+size128; c@174: float* source2=source1+size128; c@174: float* source3=source2+size128; c@174: c@174: while(pDest