Mercurial > hg > qm-dsp
comparison ext/kissfft/README.simd @ 184:76ec2365b250
Bring in kissfft into this repo (formerly a subrepo, but the remote is not responding)
| author | Chris Cannam |
|---|---|
| date | Tue, 21 Jul 2015 07:34:15 +0100 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 183:7ab3539e92e3 | 184:76ec2365b250 |
|---|---|
| 1 If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft | |
| 2 to do 4 *separate* FFTs at once. | |
| 3 | |
| 4 Beware! Beyond here there be dragons! | |
| 5 | |
| 6 This API is not easy to use, is not well documented, and breaks the KISS principle. | |
| 7 | |
| 8 | |
| 9 Still reading? Okay, you may get rewarded for your patience with a considerable speedup | |
| 10 (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops. | |
| 11 | |
| 12 The basic idea is to use the packed 4 float __m128 data type as a scalar element. | |
| 13 This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D. | |
| 14 | |
| 15 For complex data, the data is interlaced as follows: | |
| 16 rA0,rB0,rC0,rD0, iA0,iB0,iC0,iD0, rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ... | |
| 17 where "rA0" is the real part of the zeroth sample for signal A | |
| 18 | |
| 19 Real-only data is laid out: | |
| 20 rA0,rB0,rC0,rD0, rA1,rB1,rC1,rD1, ... | |
| 21 | |
| 22 Compile with gcc flags something like | |
| 23 -O3 -mpreferred-stack-boundary=4 -DUSE_SIMD=1 -msse | |
| 24 | |
| 25 Be aware of SIMD alignment. This is the most likely cause of segfaults. | |
| 26 The code within kissfft uses scratch variables on the stack. | |
| 27 With SIMD, these must have addresses on 16 byte boundaries. | |
| 28 Search on "SIMD alignment" for more info. | |
| 29 | |
| 30 | |
| 31 | |
| 32 Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft. | |
| 33 I have not run it -- use it at your own risk. It appears to do 4xN and Nx4 transpositions | |
| 34 (out of place). | |
| 35 | |
| 36 void SSETools::pack128(float* target, float* source, unsigned long size128) | |
| 37 { | |
| 38 __m128* pDest = (__m128*)target; | |
| 39 __m128* pDestEnd = pDest+size128; | |
| 40 float* source0=source; | |
| 41 float* source1=source0+size128; | |
| 42 float* source2=source1+size128; | |
| 43 float* source3=source2+size128; | |
| 44 | |
| 45 while(pDest<pDestEnd) | |
| 46 { | |
| 47 *pDest=_mm_set_ps(*source3,*source2,*source1,*source0); | |
| 48 source0++; | |
| 49 source1++; | |
| 50 source2++; | |
| 51 source3++; | |
| 52 pDest++; | |
| 53 } | |
| 54 } | |
| 55 | |
| 56 void SSETools::unpack128(float* target, float* source, unsigned long size128) | |
| 57 { | |
| 58 | |
| 59 float* pSrc = source; | |
| 60 float* pSrcEnd = pSrc+size128*4; | |
| 61 float* target0=target; | |
| 62 float* target1=target0+size128; | |
| 63 float* target2=target1+size128; | |
| 64 float* target3=target2+size128; | |
| 65 | |
| 66 while(pSrc<pSrcEnd) | |
| 67 { | |
| 68 *target0=pSrc[0]; | |
| 69 *target1=pSrc[1]; | |
| 70 *target2=pSrc[2]; | |
| 71 *target3=pSrc[3]; | |
| 72 target0++; | |
| 73 target1++; | |
| 74 target2++; | |
| 75 target3++; | |
| 76 pSrc+=4; | |
| 77 } | |
| 78 } |
