This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Real FFT – Split part taking incredibly long

 

Hello,

 

I’ve implemented the Real FFT method described here:

 

http://processors.wiki.ti.com/index.php/Efficient_FFT_Computation_of_Real_Input

 

and the results match the standard size Complex FFT, and are as expected.  However, the timing used is way out of line for the Real FFT approach, and specifically for the FFT_Split operation.  For instance, with a test of 128 samples, the 64 point FFT is 2 us, while the FFT_Split is 28 us!  

 

Here’s my function calls:

 

        DSPF_sp_fftSPxSP( ( FFT_SIZE >> 1 ), buffer_fft_in, twiddle, buffer_fft_tmp, bit_rev, fft_radix, FFT_OFFSET, ( FFT_SIZE >> 1 ) );

 

        FFT_Split( ( FFT_SIZE >> 1 ), buffer_fft_tmp, a, b, buffer_fft_out );

 

and related memory locations are:

 

11812c00   _a

11812e00   _b

11813c30   _bit_rev

11813820   _buffer_fft_in

11813410   _buffer_fft_out

11813000   _buffer_fft_tmp

 

They seem properly aligned, per my use of the DATA_ALIGN pragma.

 

Can someone help provide some things I should start looking at, to explain the time of the split?

 

Thanks,

Robert

 

 

  • As an update to this, I removed –g from the compiler options, and added –o3, and the timing is now:

     

                FFT:                 2us

                FFT_Split:         6.5 us

     

    But would it still be expected that the split takes more than 3 times the FFT itself?

     

    Robert

  • Can you share your FFT_Split code?  That will make it easier for me to make a suggestion. 

  • Hi Brad,

     

    I’ve included the FFT_Split() code below, which is essentially the same as the example C code at the Real FFT Wiki page, tidied up a bit.  

     

    Since my last post, though, things seem to have fallen in line.  For a 512 point FFT, I now see:

     

                FFT:                 8.1 us

                FFT_Split:         5.2 us

     

    which combined, gives a 35% improvement over the standard Complex FFT approach.

     

    Now the hog is the FFT magnitude/power spectrum routine ;), which probably suffers from the sqrt() call.  For that I’ve pulled in the C67 fast library sqrtsp() version.  But with the 512 point FFT, only doing ½ (given symmetry), the call takes 50.6 us.  If that becomes a problem later, we could always remove the sqrt (squared valued results will still be proportionate to each other, etc).  But if you see anything other than that which could be optimized, in my fft_power() routine included below, please let me know.

     

    Thanks,

    Robert

     

    ----------------------------------------- FFT Split -----------------------------------------------------------------

     

    bool_t FFT_Split( uint16_t n, float *restrict pIn, float *pATable, float *pBTable, float *pOut )

    {

        _nassert ( (int)pIn % 8 == 0 );

     

        _nassert ( (int)pOut % 8 == 0 );

     

        _nassert ( (int)pATable % 8 == 0 );

     

        _nassert ( (int)pBTable % 8 == 0 );

     

        uint16_t   n_x_2, i_x_2, n_x_4,  i;

     

        float Tr, Ti;

     

        n_x_2 = ( 2 * n );

     

        n_x_4 = ( 4 * n );

     

        pIn[ n_x_2 ] = pIn[0];

     

        pIn[ n_x_2 + 1 ] = pIn[1];

     

    #pragma UNROLL(2)

        for ( i = 0; i < n; i++ )

        {

            i_x_2 = ( 2 * i );

     

            Tr = (

                        ( pIn[ i_x_2 ] * pATable[ i_x_2 ] ) -

     

                        ( pIn[ i_x_2 + 1 ] * pATable[ i_x_2 + 1 ] ) +

     

                        ( pIn[ n_x_2 - i_x_2 ] * pBTable[ i_x_2 ] ) +

     

                        ( pIn[ n_x_2 - i_x_2 + 1 ] * pBTable[ i_x_2 + 1 ] )

                );

     

            Ti = (

                        ( pIn[ i_x_2 + 1 ] * pATable[ i_x_2 ] ) +

     

                        ( pIn[ i_x_2 ] * pATable[ i_x_2 + 1 ] ) +

     

                        ( pIn[ n_x_2 - i_x_2 ] * pBTable[ i_x_2 + 1 ] ) -

     

                        ( pIn[ n_x_2 - i_x_2 + 1 ] * pBTable[ i_x_2 ] )

                );

     

            pOut[ i_x_2 ] = Tr;

     

            pOut[ i_x_2 + 1 ] = Ti;

     

            // use complex conjugate symmetry properties to get the rest

            //----------------------------------------------------------

     

            pOut[ n_x_4 - i_x_2 ] = Tr;

     

            pOut[ n_x_4 - i_x_2 + 1 ] = -Ti;

     

        }

     

        pOut[ n_x_2 ] = ( pIn[0] - pIn[1] );

     

        pOut[ n_x_2 + 1 ] = 0;

     

        return FALSE;

    }

     

     

    ----------------------------------------- FFT magnitude ---------------------------------------------------------------

     

    bool_t fft_power( float *fft_in, uint16_t n )

    {

        uint16_t  i, j;

     

        float tmp_f2, tmp_f1, tmp_f0;

     

        _nassert ( (int)fft_in % 8 == 0 );

     

        fft_in[0] = fabs( fft_in[0] );

     

    #pragma UNROLL(2)

        for ( i = 1; i <= n; i++ )

        {

            j = ( 2 * i );

     

            tmp_f0 = fft_in[ j ];

            tmp_f1 = fft_in[ j + 1 ];

     

            tmp_f2 = ( ( tmp_f0 * tmp_f0 ) + ( tmp_f1 * tmp_f1 ) );

     

            fft_in[ i ] = sqrtsp( tmp_f2 );

        }

     

        return FALSE;

    }

  • Do you see any further gain by making the other pointers "restrict" too?  (I'm assuming they meet the requirements for "restrict", i.e. they are non-overlapped in what they point to.)

    For the power number I would definitely kill the square root.  For example, here are a couple typical usages of the power spectrum:

    1. Finding peak power.  In this case you will still end up with the same result regardless of whether you have taken the square root of everything.
    2. Power thresholding, i.e. looking for where power is greater than some predefined threshold.  In this case you just square your threshold number so that everything is apples to apples, i.e. you make everything a comparison of "power squared".  If the threshold is known at build time you can do the squaring ahead of time, i.e. no CPU cycles!
  • Brad,

     

    There’s wasn’t much of a savings with the additional “restricts”, maybe 0.1 us.  

     

    10-4 on removing the sqrt() for power!  I was half doing it as a geewhiz, to easily compare the spectrum magnitude initially, against the known input signal.  

     

    Thanks,

    Robert