Problems optimizing an FFT based function

Remco Poelstra

Hi,

I've to do an FFT of a buffer which contains mostly zero's. In order to save computation time I tried implementing a decomposed FFT which would ignore all zero's. It based on the paper 'Efficient computation of the DFT with Only a Subset of Input or Output Points' by Sorensen en Burrus.

I implemented their algorithm, but instead of less computations with less input points, I need more computations!

I followed all recommendations in the 'TMS320C674x DSP Cache User's Guide' and 'Introduction to TMS320C6000 DSP Optimization'.

Analysis shows that all code and data (even output buffers) reside in L1 (after the first run of course).

My first question is:

Why does DSPF_sp_fftSPxSP (from DSPLIB 3_1_1_1) require about twice more cycles than the formula cycle count? I expect zero stalls with everything in L1.

My second question is: Is there a way that I can optimize my code further to make it more efficient? The calculations that my function does is actually quite limited, but seems to cost almost more time than calling the FFT functions.

Condensed code:

void litDFT_complex_vectors(complex_t * restrict x, complex_t * restrict X) {

int J,k;

_nassert((int) x % 8 == 0);

_nassert((int) X % 8 == 0);

for (J=1;J<Q;J++) {

x_k1[0]=x[0];

complex_t w=W_td[J];

complex_t v=w;
for (k=1;k<P;k++) {

x_k1[k].real=w.real*x[k].real+w.imag*x[k].imag;

x_k1[k].imag=w.real*x[k].imag-w.imag*x[k].real;

//Recursively update trig functions

w=cmul(w,v);

}

DSPF_sp_fftSPxSP(P,(float *)x_k1,(float *)W,(float *)X_k1,brev,2,0,P);
for (k=0;k<P;k++) {

X[Q*k+J]=X_k1[k];

}

DSPF_sp_fftSPxSP(P,(float *)x,(float *)W,(float *)X_k1,brev,2,0,P);

for (k=0;k<P;k++) {

X[Q*k]=X_k1[k];

}

P and Q are constant defines. cmul is an inline function that multiplies two complex numbers (much like the two lines above that call, just the sign differs).

Some measurement results:

4096 point FFT using DSP_sp_fftSPxSP 126452 cycles

128 point FFT using DSP_sp_fftSPxSP 2579 cycles

4096 fft with only 128 non-zero values using litDFT_complex_vectors 116419 (P=128,Q=32)

4096 fft with only 32 non-zero values using litDFT_complex_vectors 202691 (P=32,Q=128)

Any insights would be highly appreciated.

Kind regards,

Remco Poelstra

over 12 years ago

Processors

Processors forum

Problems optimizing an FFT based function