Real FFT – Split part taking incredibly long

Robert Wolfe

Hello,

I’ve implemented the Real FFT method described here:

http://processors.wiki.ti.com/index.php/Efficient_FFT_Computation_of_Real_Input

and the results match the standard size Complex FFT, and are as expected. However, the timing used is way out of line for the Real FFT approach, and specifically for the FFT_Split operation. For instance, with a test of 128 samples, the 64 point FFT is 2 us, while the FFT_Split is 28 us!

Here’s my function calls:

DSPF_sp_fftSPxSP( ( FFT_SIZE >> 1 ), buffer_fft_in, twiddle, buffer_fft_tmp, bit_rev, fft_radix, FFT_OFFSET, ( FFT_SIZE >> 1 ) );

FFT_Split( ( FFT_SIZE >> 1 ), buffer_fft_tmp, a, b, buffer_fft_out );

and related memory locations are:

11812c00 _a

11812e00 _b

11813c30 _bit_rev

11813820 _buffer_fft_in

11813410 _buffer_fft_out

11813000 _buffer_fft_tmp

They seem properly aligned, per my use of the DATA_ALIGN pragma.

Can someone help provide some things I should start looking at, to explain the time of the split?

Thanks,

Robert

over 13 years ago

0 Robert Wolfe over 13 years ago

Mastermind 7860 points

As an update to this, I removed –g from the compiler options, and added –o3, and the timing is now:

FFT: 2us

FFT_Split: 6.5 us

But would it still be expected that the split takes more than 3 times the FFT itself?

Robert

0 Brad Griffis over 13 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

Can you share your FFT_Split code? That will make it easier for me to make a suggestion.

0 Robert Wolfe over 13 years ago in reply to Brad Griffis

Mastermind 7860 points

Hi Brad,

I’ve included the FFT_Split() code below, which is essentially the same as the example C code at the Real FFT Wiki page, tidied up a bit.

Since my last post, though, things seem to have fallen in line. For a 512 point FFT, I now see:

FFT: 8.1 us

FFT_Split: 5.2 us

which combined, gives a 35% improvement over the standard Complex FFT approach.

Now the hog is the FFT magnitude/power spectrum routine ;), which probably suffers from the sqrt() call. For that I’ve pulled in the C67 fast library sqrtsp() version. But with the 512 point FFT, only doing ½ (given symmetry), the call takes 50.6 us. If that becomes a problem later, we could always remove the sqrt (squared valued results will still be proportionate to each other, etc). But if you see anything other than that which could be optimized, in my fft_power() routine included below, please let me know.

Thanks,

Robert

----------------------------------------- FFT Split -----------------------------------------------------------------

bool_t FFT_Split( uint16_t n, float *restrict pIn, float *pATable, float *pBTable, float *pOut )

{

_nassert ( (int)pIn % 8 == 0 );

_nassert ( (int)pOut % 8 == 0 );

_nassert ( (int)pATable % 8 == 0 );

_nassert ( (int)pBTable % 8 == 0 );

uint16_t n_x_2, i_x_2, n_x_4, i;

float Tr, Ti;

n_x_2 = ( 2 * n );

n_x_4 = ( 4 * n );

pIn[ n_x_2 ] = pIn[0];

pIn[ n_x_2 + 1 ] = pIn[1];

#pragma UNROLL(2)

for ( i = 0; i < n; i++ )

{

i_x_2 = ( 2 * i );

Tr = (

( pIn[ i_x_2 ] * pATable[ i_x_2 ] ) -

( pIn[ i_x_2 + 1 ] * pATable[ i_x_2 + 1 ] ) +

( pIn[ n_x_2 - i_x_2 ] * pBTable[ i_x_2 ] ) +

( pIn[ n_x_2 - i_x_2 + 1 ] * pBTable[ i_x_2 + 1 ] )

);

Ti = (

( pIn[ i_x_2 + 1 ] * pATable[ i_x_2 ] ) +

( pIn[ i_x_2 ] * pATable[ i_x_2 + 1 ] ) +

( pIn[ n_x_2 - i_x_2 ] * pBTable[ i_x_2 + 1 ] ) -

( pIn[ n_x_2 - i_x_2 + 1 ] * pBTable[ i_x_2 ] )

);

pOut[ i_x_2 ] = Tr;

pOut[ i_x_2 + 1 ] = Ti;

// use complex conjugate symmetry properties to get the rest

//----------------------------------------------------------

pOut[ n_x_4 - i_x_2 ] = Tr;

pOut[ n_x_4 - i_x_2 + 1 ] = -Ti;

}

pOut[ n_x_2 ] = ( pIn[0] - pIn[1] );

pOut[ n_x_2 + 1 ] = 0;

return FALSE;

}

----------------------------------------- FFT magnitude ---------------------------------------------------------------

bool_t fft_power( float *fft_in, uint16_t n )

{

uint16_t i, j;

float tmp_f2, tmp_f1, tmp_f0;

_nassert ( (int)fft_in % 8 == 0 );

fft_in[0] = fabs( fft_in[0] );

#pragma UNROLL(2)

for ( i = 1; i <= n; i++ )

{

j = ( 2 * i );

tmp_f0 = fft_in[ j ];

tmp_f1 = fft_in[ j + 1 ];

tmp_f2 = ( ( tmp_f0 * tmp_f0 ) + ( tmp_f1 * tmp_f1 ) );

fft_in[ i ] = sqrtsp( tmp_f2 );

}

return FALSE;

}

0 Brad Griffis over 13 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

Do you see any further gain by making the other pointers "restrict" too? (I'm assuming they meet the requirements for "restrict", i.e. they are non-overlapped in what they point to.)

For the power number I would definitely kill the square root. For example, here are a couple typical usages of the power spectrum:

Finding peak power. In this case you will still end up with the same result regardless of whether you have taken the square root of everything.
Power thresholding, i.e. looking for where power is greater than some predefined threshold. In this case you just square your threshold number so that everything is apples to apples, i.e. you make everything a comparison of "power squared". If the threshold is known at build time you can do the squaring ahead of time, i.e. no CPU cycles!

0 Robert Wolfe over 13 years ago in reply to Brad Griffis

Mastermind 7860 points

Brad,

There’s wasn’t much of a savings with the additional “restricts”, maybe 0.1 us.

10-4 on removing the sqrt() for power! I was half doing it as a geewhiz, to easily compare the spectrum magnitude initially, against the known input signal.

Thanks,

Robert

Processors

Processors forum

Real FFT – Split part taking incredibly long