This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C55 8k fft

Team,

Customer of mine is looking at compiling efficient 8K, ideally using the HWA of C55xx. Yet it seems limited to 1K FFT.

Can we split the 8K into smaller FFTs that can processed by HWA?

customer reference is ARM11 @ 350MHz: would we have benchmarks processing FFTs from 512 to 8K C55 vs. ARM11 (with and w/o HWA)

Thanks

  • Hi,

    Yes - the max of HWAFFT is 1024-pts, but the DSP core can be used to compute extra Radix-2 stages.

    For example, perform bit reverse, compute 8 1k FFTs with the HWAFFT, then perform the final 3 Radix-2 stages entirely on the CPU (4 2k FFTs, then 2 4k FFTs, then 1 8k FFT).

    Refer to 8. Computation of Large (Greater Than 1024-Point) FFTs in http://www.ti.com/lit/an/sprabb6a/sprabb6a.pdf

    Performance improvements are possible if Radix-4 is used instead of Radix-2 for the final stages on the CPU.

    If the data is real only, then a 1024-pt complex FFT can be used to perform a 2048-pt real FFT.

    Keep in mind that C55xx is 16-bit fixed-point architecture, so if the FFT inputs are full dynamic range then the FFT results must be scaled. The HWAFFT supports a scale by two after every butterfly - resulting in precision loss of 1/2 bit per butterfly (5 error bits for 1024-pt FFT).

    Refer to Alan V. Oppenheim and Ronald W. Schafer. 2009. Discrete-Time Signal Processing (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA.

    Hope this helps,
    Mark

  • Hi,

    I have asked a similar quesition but did not answered yet. (http://e2e.ti.com/support/dsp/c5000/f/109/t/269223.aspx)

    We plan  to use 16834 point FFT and it can be performed by the support of HW accelerator.

    But we have no idea about performance. 

    Have you got an estimation (cycle) for 16384 point FFT on C5535 ?

    Best Regards.

  • Mark,

    The link doesnt seem to work for me.

    Then it sure helps, but can you guide me to benchmarks (even extimated at this stage)?

  • Try this link -  (version changed from a to b):  http://www.ti.com/litv/pdf/sprabb6b

    This document is also a chapter in C5535 TRM: http://www.ti.com/lit/ug/spruh87c/spruh87c.pdf

    Each hwafft_1024 call takes 5,244 cycles.

    I took some benchmarks with some unoptimized I code wrote some time ago. The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines are blowing up the cycle counts because they are written as function calls in C instead of optimized assembly.

    In the case of 8k FFT, the 8 calls to the hwafft_1024 account for just 2% of the cycles consumed. 

    ----

    Profile Data for executing 4096-pt FFT using the HWAFFT for 1024-pt FFT computation and the CPU to complete the final 2 Radix-2 Stages (Radix2FFT_2048 & Radix2FFT_4096)
    Built with -02 & -ms Optimization Level 2 (Function), Speed most critical
    Time to execute
    FFT_4096       100 MHz 120 MHz  
      4096-pt Bit Reverse 8,221 cycles 0.082 0.069 mS
      Split into 4x1024 arrays 35,839 cycles 0.358 0.299 mS
      hwafft_1024 5,242 cycles 0.052 0.044 mS
      hwafft_1024 5,244 cycles 0.052 0.044 mS
      hwafft_1024 5,244 cycles 0.052 0.044 mS
      hwafft_1024 5,244 cycles 0.052 0.044 mS
      Radix2FFT_2048 147,541 cycles 1.475 1.230 mS
      Radix2FFT_2048 147,541 cycles 1.475 1.230 mS
      Radix2FFT_4096 294,996 cycles 2.950 2.458 mS
           
      TOTAL 655,112 cycles TOTAL 6.551 5.459 mS
       
    Further breakdown of cycles:    
    Radix2FFT_2048                  
      CPLX_Mul* 52 cycles 1024 Iterations 53,248 cycles 0.532 0.444 mS
      CPLX_Add* 43 cycles 1024 Iterations 44,032 cycles 0.440 0.367 mS
      CPLX_Subtract* 46 cycles 1024 Iterations 47,104 cycles 0.471 0.393 mS
           
      TOTAL         144,384 +overhead 1.444 1.203 mS
       
    Radix2FFT_4096                  
      CPLX_Mul* 52 cycles 2048 Iterations 106,496 cycles 1.065 0.887 mS
      CPLX_Add* 43 cycles 2048 Iterations 88,064 cycles 0.881 0.734 mS
      CPLX_Subtract* 46 cycles 2048 Iterations 94,208 cycles 0.942 0.785 mS
           
      TOTAL         288,768 +overhead 2.888 2.406 mS
    *The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines can be further optimized with intrinsics and hand-coded assembly
    *Using a Radix-4 approach will also reduce cycles considerably

    ----

    Extrapolating these results to 8k FFT:

    td class=xl1515535>

    Extrapolated Profile Data for executing 8192-pt FFT using the HWAFFT for 1024-pt FFT computation and the CPU to complete the final Radix-2 Stages
    Extrapolated from 4096-pt profile data
    Cycles Time to execute  
    FFT_8192       100 MHz 120 MHz 150 MHz  
      8192-pt Bit Reverse 16,442 cycles 0.164 0.137 0.110 mS
      Split into 8x1024 arrays 71,678 cycles 0.717 0.597 0.478 mS
      hwafft_1024 5,242 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,242 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035 mS
      hwafft_1024 5,244 cycles 0.052 0.044 0.035  
      Radix2FFT_2048 147,541 cycles 1.475 1.230 0.984  
      Radix2FFT_2048 147,541 cycles 1.475 1.230 0.984  
      Radix2FFT_2048 147,541 cycles 1.475 1.230 0.984  
      Radix2FFT_2048 147,541 cycles 1.475 1.230 0.984  
      Radix2FFT_4096 294,996 cycles 2.950 2.458 1.967  
      Radix2FFT_4096 294,996 cycles 2.950 2.458 1.967  
      Radix2FFT_8192 589,992 cycles  5.900 4.917 3.933  
               
      TOTAL 1,900,216 cycles 19.002 15.835 12.668 mS
    Further breakdown of cycles:
    Radix2FFT_2048              
      CPLX_Mul* 52 cycles 1024 Iterations 53,248 cycles
      CPLX_Add* 43 cycles 1024 Iterations 44,032 cycles
      CPLX_Subtract* 46 cycles 1024 Iterations 47,104 cycles
       
      TOTAL         144,384 +overhead
    Radix2FFT_4096              
      CPLX_Mul* 52 cycles 2048 Iterations 106,496 cycles
      CPLX_Add* 43 cycles 2048 Iterations 88,064 cycles
      CPLX_Subtract* 46 cycles 2048 Iterations 94,208 cycles
       
      TOTAL         288,768 +overhead
    *The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines can be further optimized with intrinsics and hand-coded assembly
    *Using a Radix-4 approach will also reduce cycles considerably

    ----

    Hope this helps,
    Mark 

  • Attaching source for hwafft_4096 using HWAFFT and Radix-2 on CPU.

    These routines require optimization, but are functionally correct.

    // MM To do:
    // Place memory in cmd file
    // test program - graph output, comment out check refvec
    // create refvec for 5khz input
    // creat testvecs and refvecs for all frequencies
    
    
    
    
    /*****************************************************************************/
    /*  File name   : test.c                                                     */
    /*  Description : 4096-pt FFT implementation: 4 1024 FFTs on HWAFFT plus an  */
    /*   			  intermediate Radix 2 Stage 2x1024 -> 1x2048 and a final	 */
    /*   			  Radix 2 Stage 2x2048 -> 1x4096 							 */
    /*****************************************************************************/
    
    #include "data_types.h"
    #include "hwafft.h"
    #include "test_cfg.h"
    
    // Define subtract_flag of CPLX_Add() Function
    #define ADD 		0
    #define SUBTRACT 	1
    
    // Static Memory Allocations:
    #pragma DATA_SECTION(Buffer_A, "Buffer_A");
    Int32 Buffer_A[DATA_LEN_4096] = 
    {
    	#include "invec_fft_4096pts_5khz.dat"
    };
    
    #pragma DATA_SECTION(Buffer_B, "Buffer_B");
    Int32 Buffer_B[TEST_DATA_LEN];
    
    // Function Prototypes:
    Int32 CPLX_Mul(Int32 op1, Int32 op2);
    Int32 CPLX_Add(Int32 op1, Int32 op2, Uint16 subtract_flag);
    Uint16 Radix2FFT(Int32 *In_Even, Int32 *In_Odd, Int32 *Out, Int32 *Twiddle, Uint16 Out_Len, Uint16 Twid_Len);
    static Uint16 chk_out(Int32 *data, Int32 *ref_data, Uint16 data_len);
    
    int main(void)
    {
        Int32 *data;
        Int32 *data_br;
        Int32 *data_1024_0, *data_1024_1, *data_1024_2, *data_1024_3;
    	Int32 *scratch_1024_0, *scratch_1024_1, *scratch_1024_2, *scratch_1024_3;
    	Int32 *data_2048_0, *data_2048_1;
    	Int32 *data_4096;
       	Int32 *twiddle;
       	Int32 *ref_data;
       	Uint16 fft_flag;
        Uint16 scale_flag;
        Uint16 out_sel0, out_sel1, out_sel2, out_sel3;
        Uint16 err = 1;
       	Uint16 k;
       		
       	// Assign pointers to static memory allocations
        data = Buffer_A;
    	data_br = Buffer_B;
    	data_1024_0 = &Buffer_A[0*TEST_DATA_LEN/4];
    	data_1024_1 = &Buffer_A[1*TEST_DATA_LEN/4];
    	data_1024_2 = &Buffer_A[2*TEST_DATA_LEN/4];
    	data_1024_3 = &Buffer_A[3*TEST_DATA_LEN/4];
    	scratch_1024_0 = &Buffer_B[0*TEST_DATA_LEN/4];
    	scratch_1024_1 = &Buffer_B[1*TEST_DATA_LEN/4];
    	scratch_1024_2 = &Buffer_B[2*TEST_DATA_LEN/4];
    	scratch_1024_3 = &Buffer_B[3*TEST_DATA_LEN/4];
    	data_2048_0 = &Buffer_A[0*TEST_DATA_LEN/2];	//COULD BE BUFFER_A OR BUFFER_B... ASSIGN LATER?
    	data_2048_1 = &Buffer_A[1*TEST_DATA_LEN/2];	//COULD BE BUFFER_A OR BUFFER_B... ASSIGN LATER?
    	data_4096 = Buffer_B;						//DEPENDS ON WHERE data_2048_0/1 ARE (ASSUMING OUTSEL = OUT_SEL_DATA)
       	twiddle = (Int32*)twiddle_buf;
        ref_data = (Int32*)refvec_fft_4096pts;	
    
    	// HWAFFT flags:
        fft_flag = FFT_FLAG;		// HWAFFT to perform FFT (not IFFT)
        scale_flag = SCALE_FLAG;	// HWAFFT to scale by 2 after each butterfly stage
          
    	// Bit-reverse input data for in-place DIT FFT calculation
        hwafft_br(data, data_br, DATA_LEN_4096);  
            
        // Split 4096-element data into 4 1024-element arrays (already bit-reversed)
        for(k=0; k<DATA_LEN_4096/4; k++)
        {
        	data_1024_0[k] = data_br[k+0*DATA_LEN_4096/4];		
        	data_1024_1[k] = data_br[k+1*DATA_LEN_4096/4];
    		data_1024_2[k] = data_br[k+2*DATA_LEN_4096/4];		
        	data_1024_3[k] = data_br[k+3*DATA_LEN_4096/4];
        }    	
          
        // 1024-pt FFT data_1024_0 on the FFT Hardware Accelerator
        out_sel0 = hwafft_1024pts(data_1024_0, scratch_1024_0, fft_flag, scale_flag); 
            
        // 1024-pt FFT data_1024_1 on the FFT Hardware Accelerator
        out_sel1 = hwafft_1024pts(data_1024_1, scratch_1024_1, fft_flag, scale_flag);
    	
    	// 1024-pt FFT data_1024_2 on the FFT Hardware Accelerator
        out_sel2 = hwafft_1024pts(data_1024_2, scratch_1024_2, fft_flag, scale_flag); 
        
        // 1024-pt FFT data_1024_3 on the FFT Hardware Accelerator
        out_sel3 = hwafft_1024pts(data_1024_3, scratch_1024_3, fft_flag, scale_flag);
    	
    /*	if(out_sel0 == out_sel1 == out_sel2 == out_sel3)
    	{
    		if(out_sel0 == OUT_SEL_SCRATCH)
    		{
    			//data_1024_0 = scratch_1024_0;
    			//data_1024_1 = scratch_1024_1;
    			//data_1024_2 = scratch_1024_2;
    			//data_1024_3 = scratch_1024_3;
    			
    		}
    		
    	}
    	else 
    	{
    		//uh oh - good data in both data and scratch...
    		while(1);
    	}*/
    	
    	Radix2FFT(&scratch_1024_0[0], &scratch_1024_1[0], &data_2048_0[0], &twiddle[0], 2048, 2048);
    	
    	Radix2FFT(&scratch_1024_2[0], &scratch_1024_3[0], &data_2048_1[0], &twiddle[0], 2048, 2048);
    	
    	Radix2FFT(&data_2048_0[0], &data_2048_1[0], &data_4096[0], &twiddle[0], 4096, 2048);
     
     	err = chk_out(data_4096, ref_data, DATA_LEN_4096); // check results for errors
     	// Set Breakpoint here to check err and graph results
     	// PASS: err = 0
     	// FAIL: err = 1
        return err;	
    }
    
    /************************************************************************************************* 
     * cplx_prod = CPLX_Mul(op1, op2):
     * Computes complex multiplication
     * Re{cplx_prod} = Re{op1}*Re{op2} - Im{op1}*Im{op2}
     * Im(cplx_prod) = Re{op1}*Im{op2} + Im{op1}*Re{op2}
     * cplx_prod = Re{cplx_prod},Im(cplx_prod) stored in one double word [16-bit Re, 16-bit Im]
     *************************************************************************************************/
    Int32 CPLX_Mul(Int32 op1, Int32 op2)
    {
    	Int16 op1_r, op1_i, op2_r, op2_i;
    	Int32 op1_op2_r, op1_op2_i;
    	Int16 op1_op2_r16, op1_op2_i16;
    	Int32 cplx_prod;
    
    	// Mask/shift data to extract Re and Im data = [Re,Im]
    	op1_r = op1 >> 16;
    	op1_i = op1 & 0x0000FFFF;
    	op2_r = op2 >> 16;
    	op2_i = op2 & 0x0000FFFF;
    	
    	// Multiply into 32-bits then calculate Re-Re and Im+Im
    	op1_op2_r	= (Int32)op1_r*op2_r - (Int32)op1_i*op2_i;
    	op1_op2_i	= (Int32)op1_r*op2_i + (Int32)op1_i*op2_r;
    	
    	// Repack Re and Im results into one double word
    	op1_op2_r16 = (Int16)(op1_op2_r >> 15);
    	op1_op2_i16 = (Int16)(op1_op2_i >> 15);
    
    	cplx_prod = (((Int32)op1_op2_r16 & 0x0000FFFF)<< 16);
    	cplx_prod = cplx_prod | ((Int32)op1_op2_i16 & 0x0000FFFF);
    	
    	return cplx_prod;
    }
    
    
    /*************************************************************************************************
     * cplx_sum = CPLX_Add(op1, op2, subtract_flag):
     * Computes complex addition/subtraction with divide by 2 scaling afterwards 
     * Re{cplx_sum} = Re{op1} +- Re{op2}
     * Im(cplx_sum) = Im{op1} +- Im{op2}
     * cplx_sum = Re{cplx_sum},Im(cplx_sum) stored in one double word [16-bit Re, 16-bit Im]
     * subtract_flag = 0: ADD, subtract_flag = 1: SUBTRACT
     *************************************************************************************************/
    Int32 CPLX_Add(Int32 op1, Int32 op2, Uint16 subtract_flag)	
    {
    	Int16 op1_r, op1_i, op2_r, op2_i;
    	Int32 op1_op2_r32, op1_op2_i32;
    	Int16 op1_op2_r16, op1_op2_i16;
    	Int32 cplx_sum;
    	
    	// Mask/shift data to extract Re and Im data = [Re,Im]
    	op1_r = op1 >> 16;
    	op1_i = op1 & 0x0000FFFF;
    	op2_r = op2 >> 16;
    	op2_i = op2 & 0x0000FFFF;
    
    	// If subtract_flag, Subtract op2 from op1 by negating op2 and adding it to op1
    	if(subtract_flag == 1)
    	{
    		op2_r = -1*op2_r;
    		op2_i = -1*op2_i;		
    	}
    	
    	// Perform addition and scale the sum by 1/2
    	op1_op2_r32 = ((Int32)op1_r + op2_r)>>1;		
    	op1_op2_i32 = ((Int32)op1_i + op2_i)>>1;		
    	
    	// Repack Re and Im results into one double word
    	op1_op2_r16 = (Int16)(op1_op2_r32 & 0x0000FFFF);
    	op1_op2_i16 = (Int16)(op1_op2_i32 & 0x0000FFFF);
    	
    	cplx_sum = (((Int32)op1_op2_r16 & 0x0000FFFF)<< 16);
    	cplx_sum = cplx_sum | ((Int32)op1_op2_i16 & 0x0000FFFF);
    	
    	return cplx_sum;
    }
    
    
    /*************************************************************************************************
     * err = chk_out(*data, *ref_data, data_len):
     * Compares output data with reference data 
     * Returns an error (err = 1) if any of the data and ref_data arrays differ by more than the Error_Threshold = 0.0001 
     * err = 1 if |data[idx] - ref_data[idx]| > Error_Threshold = 0.0001, for any array index, idx
     ************************************************************************************************/
    static Uint16 chk_out(Int32 *data, Int32 *ref_data, Uint16 data_len) 
    {
        // Error_Threshold = 0.0001, Convert 0.0001 to S16Q15 Fixed Point (0.0001 * 32768 = 3.2768 ~= 3)
        Int16 Error_Threshold = 3; 	
        
        Int16 data_r, data_i, ref_data_r, ref_data_i;
        Int16 err_r, err_i; 
        Uint16 idx;
        Uint16 err; 
      
        idx = 0;
        err = 0;
    
        while ((idx < data_len) && err == 0)
        {
        	// Mask/shift data to extract Re and Im data = [Re,Im]
        	data_r = data[idx] >> 16;
    		data_i = data[idx] & 0x0000FFFF;
    		ref_data_r = ref_data[idx] >> 16;
    		ref_data_i = ref_data[idx] & 0x0000FFFF;
    		
            // Calculate error 
            err_r = data_r - ref_data_r;
            err_i = data_i - ref_data_i;
            
            // Return an error if |error| > (Error_Threshold = 0.0001)
            if (abs(err_r) > Error_Threshold || abs(err_i) > Error_Threshold)
            {            
                err = 1;
            }
            idx++;
        }
        return err;
    }
    
    
    
     //		lengths		   2048			   2048			 4096		  2048			  4096			  2048  twidStepSize = 1
     //     lengths		   1024			   1024			 2048		  2048			  2048			  2048	twidStepSize = 2
     //		lengths		   1024			   1024			 2048		  2048			  2048			  1024	twidStepSize = 1
    Uint16 Radix2FFT(Int32 *In_Even, Int32 *In_Odd, Int32 *Out, Int32 *Twiddle, Uint16 Out_Len, Uint16 Twid_Len)
    {
    	Int32 twiddle_data_odd;
    	Uint16 twidStepSize = Twid_Len / (Out_Len / 2);	//Check this with twiddle table - should be 1 for 4096, 2 for 2048
    	Uint16 k = 0;
    
    	for(k=0; k<Out_Len/2; k++)
    	{
    		// X(k) 	= 1/2*(X_even[k] + Twiddle[k]*X_odd(k))
    		// X(k+N/2) = 1/2*(X_even[k] - Twiddle[k]*X_odd(k))
    		
    		// Twiddle[k]*X_odd(k):
    		twiddle_data_odd = CPLX_Mul(Twiddle[k * twidStepSize], In_Odd[k]); 
    		
    		// X(k):
    		Out[k] = CPLX_Add(In_Even[k], twiddle_data_odd, ADD);
    		
    		// X(k+N/2):
    		Out[k+Out_Len/2] = CPLX_Add(In_Even[k], twiddle_data_odd, SUBTRACT);
    	}
    	return 0; //make void or return something useful...
    } 
    

    http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/109/5187.test.cmd

    1460.twid4096.asm

    Hope this helps,
    Mark 

  • Great thanks,

    2Mcycles... is it worth using the HWA at all? what would be your estimate of running the 8K FFT straight through DSPlib?

    cheers

  • Its worth it to use the HWAFFT.

    Using DSPLIB, a 1024-pt CFFT requires 25,954 cycles according to CFFT − SCALE table in http://www.ti.com/lit/ug/spru422j/spru422j.pdf

    Whereas the HWAFFT requires 5,242 cycles to perform the same 1024-pt complex FFT.

    All the larger FFT stages are computed on the CPU (inefficiently with my bad code), but require the same cycle counts to perform 4x 2k FFTs, 2x 4k FFTs and 1x 8k FFT (majority of cycles, Radix 2).

    Replacing HWAFFT with DSPLIB for the 1024-pt FFT and performing 8 1024-pt FFTs would add 165,696 cycles to the 8k FFT.

    The larger FFT stages executed on the CPU can be greatly reduced with optimization. I cannot estimate that cycle count but it shouldnt 40-50 cycles to compute complex multiplication (4 multiplies & 3 additions) as it does in my code...

    Hope this helps,
    Mark 

  • This document is also a chapter in C5535 TRM: www.ti.com/.../spruh87c.pdf

    This link is down.

    Your table about the FFT cycles and duration is impressive, is there any formal document in the pdf format such that I can keep it as a reference ? Thanks a lot.
  • This document is also a chapter in C5535 TRM: www.ti.com/.../spruh87c.pdf.
    The link is down.
  • Hi,

    We have revised the C5535 TRM a few times since rev C, so the link is now on rev H: http://www.ti.com/lit/ug/spruh87h/spruh87h.pdf

    You can find the same chapter as an independent appnote here: http://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf

    Hope this helps,
    Mark

  • I have 4096 real points for which I need to compute the FFT. Will this code work? As I am new to DSP, I am facing some issues/doubts while building the project:

    1. Confused between invec and refvec. What is the difference? Where I am supposed to enter my 4096 points?
    2. twiddle = (Int32*)twiddle_buf; -- Am I supposed to use the "1460.twid4096.asm"?

    Could you please share the entire project wih some test values, which I can refer and modify according to my needs.

    Thanks