FFT implementation using C6678 intrinsics

Arunmoezhi Ramachandran

Other Parts Discussed in Thread: BIOSLINUXMCSDK

I'm trying to use complex multiply intrinsic to see if it will improve performance of a simple FFT.

Here is the base code without any intrinsics:

void base_fft(float* data, int nn, int isign)
{
	
	int n, mmax, m, j, istep, i;
	float wr, wi, tempr, tempi; 
	n = nn << 1;   	
	fSintable = (float*)&SINTAB;  /* SINTAB defined in std_rtaf.h as volatile int* */   	
	mmax = 2;   	
	while (n > mmax) 	
	{      	
		istep = 2*mmax;      	
		for (m = 0; m < mmax; m += 2) 		
		{			
			wi = fSintable[((m)/2 * nn/mmax)];			
			wr = isign * fSintable[nn/4 + ((m)/2 * nn/mmax)];         	
			for (i = m; i < n; i += istep) 			
			{            	
				j =i + mmax;            	
				tempr = wr*data[j] - wi*data[j+1];            	
				tempi = wr*data[j+1] + wi*data[j];            	
				data[j] = data[i] - tempr;            	
				data[j+1] = data[i+1] - tempi;            	
				data[i] += tempr;            	
				data[i+1] += tempi;         	
			}      	
		}     	
	mmax = istep;   	
	}
}

For a 1024 pt FFT it takes 221us.

Here is the code using intrinsics:

void intrin_fft(float* data, int nn, int isign) 
{
	int n, mmax, m, j, istep, i;
	float wri[2];
	double wric,dric,tempri;
	n = nn << 1;
	fSintable = (float*)&SINTAB;  
	/* SINTAB defined in std_rtaf.h as volatile int* */
	mmax = 2;   	
	while (n > mmax) 	
	{     	
		istep = 2*mmax;    	
		for (m = 0; m < mmax; m += 2)      	
		{    		
			wri[0] = isign * fSintable[nn/4 + ((m)/2 * nn/mmax)];
			wri[1] = fSintable[((m)/2 * nn/mmax)];     		
			wric = _amemd8((void*)&wri[0]);     		
			for (i = m; i < n; i += istep)      		
			{
				j =i + mmax;     			
				dric = _amemd8((void*)&data[j]);     			
				tempri = _complex_mpysp(wric,dric);     		    
				data[j] = data[i] - (-_hif(tempri));     	   		
				data[j+1] = data[i+1] - (_lof(tempri));     	    	
				data[i] += -_hif(tempri);     		    
				data[i+1] += _lof(tempri);     		
			}     	
		}
     		mmax = istep;	
	}
}

This takes 223us.

I would like to know where I'm going wrong. I expected the intrinsics to give a boost.

Thanks,

Arun

over 11 years ago

0 ran35366 over 11 years ago

TI__Genius 12805 points

Without getting into the specific code, here is what you should do-

1. Tell CCS to keep the assembly code that the compiler generates

2. Look at the assembly code and see how many cycles the loop takes

3. The assembly will tell you what is the reason it did not find a faster solution –

4. Some possible reasons may be register pressure. Or the fact that the compiler already use better instructions

I attach a short presentation about optimization that we give as part of KeyStone workshop. Look at slide number 24 how to set the assembly, look at slide number 51 last item about register pressure. Look at the last slide for more resources about optimization

Regards

Ran

Ran6136.KeyStone Optimization.pptx

0 one and zero over 11 years ago in reply to ran35366

TI__Mastermind 18256 points

Hi,

you might want to take a look at the DSPLIB which is contained in the MCSDK:

http://www.ti.com/tool/bioslinuxmcsdk

The TI C6000 DSPLIB is an optimized DSP Function Library for C programmers. It includes many C-callable, optimized, general-purpose signal-processing routines. These routines are typically used in computationally-intensive real-time applications where optimal execution speed is critical. By using these routines, you can achieve execution speeds considerably faster than equivalent code written in standard ANSI C language. It contain algorithms in the following domains:

Adaptive filtering
Correlation
Fast Fourier Transform
Filtering and convolution
Math
Matrix

Details on Fast Fourier Transformations:

DSPF_dp_fftDPXDP (double precision floating point)
DSPF_dp_ifftDPXDP
DSPF_sp_bitrev_cplx
DSPF_sp_fftSPXSP (single precision floating point)
DSPF_sp_ifftSPXSP
DSP_fft16x16
DSP_fft16x16r
DSP_fft16x16_imre
DSP_fft16x32
DSP_fft32x32
DSP_fft32x32s
DSP_ifft16x16
DSP_ifft16x16_imre
DSP_ifft16x32
DSP_ifft32x32

Kind regards,

one and zero

0 Arunmoezhi Ramachandran over 11 years ago in reply to ran35366

Prodigy 50 points

data from the .asm files:

for version with intrinsics, the core loop does:
1 CMPSP,1DADDSP,2 FADDSP, 2 FSUBSP, 4 ADD & 9 MV
for version without intrinsics, the core loop does:
4 MPYSP, 3 FADDSP, 3 FSUBSP, 5 ADD & 7 MV

the version with intrinsics uses 19 ops while the no-intrinsics version uses 22 ops. But both take the same time.

Processors

Processors forum

FFT implementation using C6678 intrinsics