c@409
|
1 If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft
|
c@409
|
2 to do 4 *separate* FFTs at once.
|
c@409
|
3
|
c@409
|
4 Beware! Beyond here there be dragons!
|
c@409
|
5
|
c@409
|
6 This API is not easy to use, is not well documented, and breaks the KISS principle.
|
c@409
|
7
|
c@409
|
8
|
c@409
|
9 Still reading? Okay, you may get rewarded for your patience with a considerable speedup
|
c@409
|
10 (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops.
|
c@409
|
11
|
c@409
|
12 The basic idea is to use the packed 4 float __m128 data type as a scalar element.
|
c@409
|
13 This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D.
|
c@409
|
14
|
c@409
|
15 For complex data, the data is interlaced as follows:
|
c@409
|
16 rA0,rB0,rC0,rD0, iA0,iB0,iC0,iD0, rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ...
|
c@409
|
17 where "rA0" is the real part of the zeroth sample for signal A
|
c@409
|
18
|
c@409
|
19 Real-only data is laid out:
|
c@409
|
20 rA0,rB0,rC0,rD0, rA1,rB1,rC1,rD1, ...
|
c@409
|
21
|
c@409
|
22 Compile with gcc flags something like
|
c@409
|
23 -O3 -mpreferred-stack-boundary=4 -DUSE_SIMD=1 -msse
|
c@409
|
24
|
c@409
|
25 Be aware of SIMD alignment. This is the most likely cause of segfaults.
|
c@409
|
26 The code within kissfft uses scratch variables on the stack.
|
c@409
|
27 With SIMD, these must have addresses on 16 byte boundaries.
|
c@409
|
28 Search on "SIMD alignment" for more info.
|
c@409
|
29
|
c@409
|
30
|
c@409
|
31
|
c@409
|
32 Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft.
|
c@409
|
33 I have not run it -- use it at your own risk. It appears to do 4xN and Nx4 transpositions
|
c@409
|
34 (out of place).
|
c@409
|
35
|
c@409
|
36 void SSETools::pack128(float* target, float* source, unsigned long size128)
|
c@409
|
37 {
|
c@409
|
38 __m128* pDest = (__m128*)target;
|
c@409
|
39 __m128* pDestEnd = pDest+size128;
|
c@409
|
40 float* source0=source;
|
c@409
|
41 float* source1=source0+size128;
|
c@409
|
42 float* source2=source1+size128;
|
c@409
|
43 float* source3=source2+size128;
|
c@409
|
44
|
c@409
|
45 while(pDest<pDestEnd)
|
c@409
|
46 {
|
c@409
|
47 *pDest=_mm_set_ps(*source3,*source2,*source1,*source0);
|
c@409
|
48 source0++;
|
c@409
|
49 source1++;
|
c@409
|
50 source2++;
|
c@409
|
51 source3++;
|
c@409
|
52 pDest++;
|
c@409
|
53 }
|
c@409
|
54 }
|
c@409
|
55
|
c@409
|
56 void SSETools::unpack128(float* target, float* source, unsigned long size128)
|
c@409
|
57 {
|
c@409
|
58
|
c@409
|
59 float* pSrc = source;
|
c@409
|
60 float* pSrcEnd = pSrc+size128*4;
|
c@409
|
61 float* target0=target;
|
c@409
|
62 float* target1=target0+size128;
|
c@409
|
63 float* target2=target1+size128;
|
c@409
|
64 float* target3=target2+size128;
|
c@409
|
65
|
c@409
|
66 while(pSrc<pSrcEnd)
|
c@409
|
67 {
|
c@409
|
68 *target0=pSrc[0];
|
c@409
|
69 *target1=pSrc[1];
|
c@409
|
70 *target2=pSrc[2];
|
c@409
|
71 *target3=pSrc[3];
|
c@409
|
72 target0++;
|
c@409
|
73 target1++;
|
c@409
|
74 target2++;
|
c@409
|
75 target3++;
|
c@409
|
76 pSrc+=4;
|
c@409
|
77 }
|
c@409
|
78 }
|