qm-dsp: ext/kissfft/README.simd annotate

annotate ext/kissfft/README.simd @ 515:08bcc06c38ec tip master

Remove fast-math

author	Chris Cannam <cannam@all-day-breakfast.com>
date	Tue, 28 Jan 2020 15:27:37 +0000
parents	1f1999b0f577
children

rev	line source
c@409	1 If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft
c@409	2 to do 4 separate FFTs at once.
c@409	3
c@409	4 Beware! Beyond here there be dragons!
c@409	5
c@409	6 This API is not easy to use, is not well documented, and breaks the KISS principle.
c@409	7
c@409	8
c@409	9 Still reading? Okay, you may get rewarded for your patience with a considerable speedup
c@409	10 (2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops.
c@409	11
c@409	12 The basic idea is to use the packed 4 float __m128 data type as a scalar element.
c@409	13 This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D.
c@409	14
c@409	15 For complex data, the data is interlaced as follows:
c@409	16 rA0,rB0,rC0,rD0, iA0,iB0,iC0,iD0, rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ...
c@409	17 where "rA0" is the real part of the zeroth sample for signal A
c@409	18
c@409	19 Real-only data is laid out:
c@409	20 rA0,rB0,rC0,rD0, rA1,rB1,rC1,rD1, ...
c@409	21
c@409	22 Compile with gcc flags something like
c@409	23 -O3 -mpreferred-stack-boundary=4 -DUSE_SIMD=1 -msse
c@409	24
c@409	25 Be aware of SIMD alignment. This is the most likely cause of segfaults.
c@409	26 The code within kissfft uses scratch variables on the stack.
c@409	27 With SIMD, these must have addresses on 16 byte boundaries.
c@409	28 Search on "SIMD alignment" for more info.
c@409	29
c@409	30
c@409	31
c@409	32 Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft.
c@409	33 I have not run it -- use it at your own risk. It appears to do 4xN and Nx4 transpositions
c@409	34 (out of place).
c@409	35
c@409	36 void SSETools::pack128(float* target, float* source, unsigned long size128)
c@409	37 {
c@409	38 __m128* pDest = (__m128*)target;
c@409	39 __m128* pDestEnd = pDest+size128;
c@409	40 float* source0=source;
c@409	41 float* source1=source0+size128;
c@409	42 float* source2=source1+size128;
c@409	43 float* source3=source2+size128;
c@409	44
c@409	45 while(pDest<pDestEnd)
c@409	46 {
c@409	47 pDest=_mm_set_ps(source3,source2,source1,*source0);
c@409	48 source0++;
c@409	49 source1++;
c@409	50 source2++;
c@409	51 source3++;
c@409	52 pDest++;
c@409	53 }
c@409	54 }
c@409	55
c@409	56 void SSETools::unpack128(float* target, float* source, unsigned long size128)
c@409	57 {
c@409	58
c@409	59 float* pSrc = source;
c@409	60 float* pSrcEnd = pSrc+size128*4;
c@409	61 float* target0=target;
c@409	62 float* target1=target0+size128;
c@409	63 float* target2=target1+size128;
c@409	64 float* target3=target2+size128;
c@409	65
c@409	66 while(pSrc<pSrcEnd)
c@409	67 {
c@409	68 *target0=pSrc[0];
c@409	69 *target1=pSrc[1];
c@409	70 *target2=pSrc[2];
c@409	71 *target3=pSrc[3];
c@409	72 target0++;
c@409	73 target1++;
c@409	74 target2++;
c@409	75 target3++;
c@409	76 pSrc+=4;
c@409	77 }
c@409	78 }

Mercurial > hg > qm-dsp

annotate ext/kissfft/README.simd @ 515:08bcc06c38ec tip master