annotate constant-q-cpp/src/ext/kissfft/README.simd @ 372:af71cbdab621 tip

Update bqvec code
author Chris Cannam
date Tue, 19 Nov 2019 10:13:32 +0000
parents 5d0a2ebb4d17
children
rev   line source
Chris@366 1 If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft
Chris@366 2 to do 4 *separate* FFTs at once.
Chris@366 3
Chris@366 4 Beware! Beyond here there be dragons!
Chris@366 5
Chris@366 6 This API is not easy to use, is not well documented, and breaks the KISS principle.
Chris@366 7
Chris@366 8
Chris@366 9 Still reading? Okay, you may get rewarded for your patience with a considerable speedup
Chris@366 10 (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops.
Chris@366 11
Chris@366 12 The basic idea is to use the packed 4 float __m128 data type as a scalar element.
Chris@366 13 This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D.
Chris@366 14
Chris@366 15 For complex data, the data is interlaced as follows:
Chris@366 16 rA0,rB0,rC0,rD0, iA0,iB0,iC0,iD0, rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ...
Chris@366 17 where "rA0" is the real part of the zeroth sample for signal A
Chris@366 18
Chris@366 19 Real-only data is laid out:
Chris@366 20 rA0,rB0,rC0,rD0, rA1,rB1,rC1,rD1, ...
Chris@366 21
Chris@366 22 Compile with gcc flags something like
Chris@366 23 -O3 -mpreferred-stack-boundary=4 -DUSE_SIMD=1 -msse
Chris@366 24
Chris@366 25 Be aware of SIMD alignment. This is the most likely cause of segfaults.
Chris@366 26 The code within kissfft uses scratch variables on the stack.
Chris@366 27 With SIMD, these must have addresses on 16 byte boundaries.
Chris@366 28 Search on "SIMD alignment" for more info.
Chris@366 29
Chris@366 30
Chris@366 31
Chris@366 32 Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft.
Chris@366 33 I have not run it -- use it at your own risk. It appears to do 4xN and Nx4 transpositions
Chris@366 34 (out of place).
Chris@366 35
Chris@366 36 void SSETools::pack128(float* target, float* source, unsigned long size128)
Chris@366 37 {
Chris@366 38 __m128* pDest = (__m128*)target;
Chris@366 39 __m128* pDestEnd = pDest+size128;
Chris@366 40 float* source0=source;
Chris@366 41 float* source1=source0+size128;
Chris@366 42 float* source2=source1+size128;
Chris@366 43 float* source3=source2+size128;
Chris@366 44
Chris@366 45 while(pDest<pDestEnd)
Chris@366 46 {
Chris@366 47 *pDest=_mm_set_ps(*source3,*source2,*source1,*source0);
Chris@366 48 source0++;
Chris@366 49 source1++;
Chris@366 50 source2++;
Chris@366 51 source3++;
Chris@366 52 pDest++;
Chris@366 53 }
Chris@366 54 }
Chris@366 55
Chris@366 56 void SSETools::unpack128(float* target, float* source, unsigned long size128)
Chris@366 57 {
Chris@366 58
Chris@366 59 float* pSrc = source;
Chris@366 60 float* pSrcEnd = pSrc+size128*4;
Chris@366 61 float* target0=target;
Chris@366 62 float* target1=target0+size128;
Chris@366 63 float* target2=target1+size128;
Chris@366 64 float* target3=target2+size128;
Chris@366 65
Chris@366 66 while(pSrc<pSrcEnd)
Chris@366 67 {
Chris@366 68 *target0=pSrc[0];
Chris@366 69 *target1=pSrc[1];
Chris@366 70 *target2=pSrc[2];
Chris@366 71 *target3=pSrc[3];
Chris@366 72 target0++;
Chris@366 73 target1++;
Chris@366 74 target2++;
Chris@366 75 target3++;
Chris@366 76 pSrc+=4;
Chris@366 77 }
Chris@366 78 }