C55 8k fft

O'Mellin

Team,

Customer of mine is looking at compiling efficient 8K, ideally using the HWA of C55xx. Yet it seems limited to 1K FFT.

Can we split the 8K into smaller FFTs that can processed by HWA?

customer reference is ARM11 @ 350MHz: would we have benchmarks processing FFTs from 512 to 8K C55 vs. ARM11 (with and w/o HWA)

Thanks

over 12 years ago

0 Mark M over 12 years ago

TI__Mastermind 30240 points

Hi,

Yes - the max of HWAFFT is 1024-pts, but the DSP core can be used to compute extra Radix-2 stages.

For example, perform bit reverse, compute 8 1k FFTs with the HWAFFT, then perform the final 3 Radix-2 stages entirely on the CPU (4 2k FFTs, then 2 4k FFTs, then 1 8k FFT).

Refer to 8. Computation of Large (Greater Than 1024-Point) FFTs in http://www.ti.com/lit/an/sprabb6a/sprabb6a.pdf

Performance improvements are possible if Radix-4 is used instead of Radix-2 for the final stages on the CPU.

If the data is real only, then a 1024-pt complex FFT can be used to perform a 2048-pt real FFT.

Keep in mind that C55xx is 16-bit fixed-point architecture, so if the FFT inputs are full dynamic range then the FFT results must be scaled. The HWAFFT supports a scale by two after every butterfly - resulting in precision loss of 1/2 bit per butterfly (5 error bits for 1024-pt FFT).

Refer to Alan V. Oppenheim and Ronald W. Schafer. 2009. Discrete-Time Signal Processing (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA.

Hope this helps,
Mark

0 YILMAZ KIRÇİÇEK over 12 years ago in reply to Mark M

Intellectual 905 points

Hi,

I have asked a similar quesition but did not answered yet. (http://e2e.ti.com/support/dsp/c5000/f/109/t/269223.aspx)

We plan to use 16834 point FFT and it can be performed by the support of HW accelerator.

But we have no idea about performance.

Have you got an estimation (cycle) for 16384 point FFT on C5535 ?

Best Regards.

0 O'Mellin over 12 years ago in reply to Mark M

TI__Expert 6325 points

Mark,

The link doesnt seem to work for me.

Then it sure helps, but can you guide me to benchmarks (even extimated at this stage)?

0 Mark M over 12 years ago in reply to O'Mellin

TI__Mastermind 30240 points

Try this link - (version changed from a to b): http://www.ti.com/litv/pdf/sprabb6b

This document is also a chapter in C5535 TRM: http://www.ti.com/lit/ug/spruh87c/spruh87c.pdf

Each hwafft_1024 call takes 5,244 cycles.

I took some benchmarks with some unoptimized I code wrote some time ago. The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines are blowing up the cycle counts because they are written as function calls in C instead of optimized assembly.

In the case of 8k FFT, the 8 calls to the hwafft_1024 account for just 2% of the cycles consumed.

----

Profile Data for executing 4096-pt FFT using the HWAFFT for 1024-pt FFT computation and the CPU to complete the final 2 Radix-2 Stages (Radix2FFT_2048 & Radix2FFT_4096)
Built with -02 & -ms	Optimization Level 2 (Function), Speed most critical

									Time to execute
FFT_4096									100 MHz	120 MHz
	4096-pt Bit Reverse	8,221	cycles						0.082	0.069	mS
	Split into 4x1024 arrays	35,839	cycles						0.358	0.299	mS
	hwafft_1024	5,242	cycles						0.052	0.044	mS
	hwafft_1024	5,244	cycles						0.052	0.044	mS
	hwafft_1024	5,244	cycles						0.052	0.044	mS
	hwafft_1024	5,244	cycles						0.052	0.044	mS
	Radix2FFT_2048	147,541	cycles						1.475	1.230	mS
	Radix2FFT_2048	147,541	cycles						1.475	1.230	mS
	Radix2FFT_4096	294,996	cycles						2.950	2.458	mS

	TOTAL	655,112	cycles					TOTAL	6.551	5.459	mS

Further breakdown of cycles:
Radix2FFT_2048
	CPLX_Mul*	52	cycles	1024	Iterations	53,248	cycles		0.532	0.444	mS
	CPLX_Add*	43	cycles	1024	Iterations	44,032	cycles		0.440	0.367	mS
	CPLX_Subtract*	46	cycles	1024	Iterations	47,104	cycles		0.471	0.393	mS

	TOTAL					144,384	+overhead		1.444	1.203	mS

Radix2FFT_4096
	CPLX_Mul*	52	cycles	2048	Iterations	106,496	cycles		1.065	0.887	mS
	CPLX_Add*	43	cycles	2048	Iterations	88,064	cycles		0.881	0.734	mS
	CPLX_Subtract*	46	cycles	2048	Iterations	94,208	cycles		0.942	0.785	mS

	TOTAL					288,768	+overhead		2.888	2.406	mS

*The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines can be further optimized with intrinsics and hand-coded assembly
*Using a Radix-4 approach will also reduce cycles considerably

----

Extrapolating these results to 8k FFT:

td class=xl1515535>

Extrapolated Profile Data for executing 8192-pt FFT using the HWAFFT for 1024-pt FFT computation and the CPU to complete the final Radix-2 Stages
Extrapolated from 4096-pt profile data

Cycles					Time to execute
FFT_8192					100 MHz	120 MHz	150 MHz
	8192-pt Bit Reverse	16,442	cycles		0.164	0.137	0.110	mS
	Split into 8x1024 arrays	71,678	cycles		0.717	0.597	0.478	mS
	hwafft_1024	5,242	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,242	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035	mS
	hwafft_1024	5,244	cycles		0.052	0.044	0.035
	Radix2FFT_2048	147,541	cycles		1.475	1.230	0.984
	Radix2FFT_2048	147,541	cycles		1.475	1.230	0.984
	Radix2FFT_2048	147,541	cycles		1.475	1.230	0.984
	Radix2FFT_2048	147,541	cycles		1.475	1.230	0.984
	Radix2FFT_4096	294,996	cycles		2.950	2.458	1.967
	Radix2FFT_4096	294,996	cycles		2.950	2.458	1.967
	Radix2FFT_8192	589,992	cycles		5.900	4.917	3.933

	TOTAL	1,900,216	cycles		19.002	15.835	12.668	mS

Further breakdown of cycles:
Radix2FFT_2048
	CPLX_Mul*	52	cycles	1024	Iterations	53,248	cycles
	CPLX_Add*	43	cycles	1024	Iterations	44,032	cycles
	CPLX_Subtract*	46	cycles	1024	Iterations	47,104	cycles

	TOTAL					144,384	+overhead

Radix2FFT_4096
	CPLX_Mul*	52	cycles	2048	Iterations	106,496	cycles
	CPLX_Add*	43	cycles	2048	Iterations	88,064	cycles
	CPLX_Subtract*	46	cycles	2048	Iterations	94,208	cycles

	TOTAL					288,768	+overhead

*The CPLX_Mul, CPLX_Add, & CPLX_Subtract routines can be further optimized with intrinsics and hand-coded assembly
*Using a Radix-4 approach will also reduce cycles considerably

----

Hope this helps,
Mark

0 Mark M over 12 years ago in reply to Mark M

TI__Mastermind 30240 points

Attaching source for hwafft_4096 using HWAFFT and Radix-2 on CPU.

These routines require optimization, but are functionally correct.

Fullscreen 5808.test.c Download

// MM To do:
// Place memory in cmd file
// test program - graph output, comment out check refvec
// create refvec for 5khz input
// creat testvecs and refvecs for all frequencies




/*****************************************************************************/
/*  File name   : test.c                                                     */
/*  Description : 4096-pt FFT implementation: 4 1024 FFTs on HWAFFT plus an  */
/*   			  intermediate Radix 2 Stage 2x1024 -> 1x2048 and a final	 */
/*   			  Radix 2 Stage 2x2048 -> 1x4096 							 */
/*****************************************************************************/

#include "data_types.h"
#include "hwafft.h"
#include "test_cfg.h"

// Define subtract_flag of CPLX_Add() Function
#define ADD 		0
#define SUBTRACT 	1

// Static Memory Allocations:
#pragma DATA_SECTION(Buffer_A, "Buffer_A");
Int32 Buffer_A[DATA_LEN_4096] = 
{
	#include "invec_fft_4096pts_5khz.dat"
};

#pragma DATA_SECTION(Buffer_B, "Buffer_B");
Int32 Buffer_B[TEST_DATA_LEN];

// Function Prototypes:
Int32 CPLX_Mul(Int32 op1, Int32 op2);
Int32 CPLX_Add(Int32 op1, Int32 op2, Uint16 subtract_flag);
Uint16 Radix2FFT(Int32 *In_Even, Int32 *In_Odd, Int32 *Out, Int32 *Twiddle, Uint16 Out_Len, Uint16 Twid_Len);
static Uint16 chk_out(Int32 *data, Int32 *ref_data, Uint16 data_len);

int main(void)
{
    Int32 *data;
    Int32 *data_br;
    Int32 *data_1024_0, *data_1024_1, *data_1024_2, *data_1024_3;
	Int32 *scratch_1024_0, *scratch_1024_1, *scratch_1024_2, *scratch_1024_3;
	Int32 *data_2048_0, *data_2048_1;
	Int32 *data_4096;
   	Int32 *twiddle;
   	Int32 *ref_data;
   	Uint16 fft_flag;
    Uint16 scale_flag;
    Uint16 out_sel0, out_sel1, out_sel2, out_sel3;
    Uint16 err = 1;
   	Uint16 k;
   		
   	// Assign pointers to static memory allocations
    data = Buffer_A;
	data_br = Buffer_B;
	data_1024_0 = &Buffer_A[0*TEST_DATA_LEN/4];
	data_1024_1 = &Buffer_A[1*TEST_DATA_LEN/4];
	data_1024_2 = &Buffer_A[2*TEST_DATA_LEN/4];
	data_1024_3 = &Buffer_A[3*TEST_DATA_LEN/4];
	scratch_1024_0 = &Buffer_B[0*TEST_DATA_LEN/4];
	scratch_1024_1 = &Buffer_B[1*TEST_DATA_LEN/4];
	scratch_1024_2 = &Buffer_B[2*TEST_DATA_LEN/4];
	scratch_1024_3 = &Buffer_B[3*TEST_DATA_LEN/4];
	data_2048_0 = &Buffer_A[0*TEST_DATA_LEN/2];	//COULD BE BUFFER_A OR BUFFER_B... ASSIGN LATER?
	data_2048_1 = &Buffer_A[1*TEST_DATA_LEN/2];	//COULD BE BUFFER_A OR BUFFER_B... ASSIGN LATER?
	data_4096 = Buffer_B;						//DEPENDS ON WHERE data_2048_0/1 ARE (ASSUMING OUTSEL = OUT_SEL_DATA)
   	twiddle = (Int32*)twiddle_buf;
    ref_data = (Int32*)refvec_fft_4096pts;	

	// HWAFFT flags:
    fft_flag = FFT_FLAG;		// HWAFFT to perform FFT (not IFFT)
    scale_flag = SCALE_FLAG;	// HWAFFT to scale by 2 after each butterfly stage
      
	// Bit-reverse input data for in-place DIT FFT calculation
    hwafft_br(data, data_br, DATA_LEN_4096);  
        
    // Split 4096-element data into 4 1024-element arrays (already bit-reversed)
    for(k=0; k<DATA_LEN_4096/4; k++)
    {
    	data_1024_0[k] = data_br[k+0*DATA_LEN_4096/4];		
    	data_1024_1[k] = data_br[k+1*DATA_LEN_4096/4];
		data_1024_2[k] = data_br[k+2*DATA_LEN_4096/4];		
    	data_1024_3[k] = data_br[k+3*DATA_LEN_4096/4];
    }    	
      
    // 1024-pt FFT data_1024_0 on the FFT Hardware Accelerator
    out_sel0 = hwafft_1024pts(data_1024_0, scratch_1024_0, fft_flag, scale_flag); 
        
    // 1024-pt FFT data_1024_1 on the FFT Hardware Accelerator
    out_sel1 = hwafft_1024pts(data_1024_1, scratch_1024_1, fft_flag, scale_flag);
	
	// 1024-pt FFT data_1024_2 on the FFT Hardware Accelerator
    out_sel2 = hwafft_1024pts(data_1024_2, scratch_1024_2, fft_flag, scale_flag); 
    
    // 1024-pt FFT data_1024_3 on the FFT Hardware Accelerator
    out_sel3 = hwafft_1024pts(data_1024_3, scratch_1024_3, fft_flag, scale_flag);
	
/*	if(out_sel0 == out_sel1 == out_sel2 == out_sel3)
	{
		if(out_sel0 == OUT_SEL_SCRATCH)
		{
			//data_1024_0 = scratch_1024_0;
			//data_1024_1 = scratch_1024_1;
			//data_1024_2 = scratch_1024_2;
			//data_1024_3 = scratch_1024_3;
			
		}
		
	}
	else 
	{
		//uh oh - good data in both data and scratch...
		while(1);
	}*/
	
	Radix2FFT(&scratch_1024_0[0], &scratch_1024_1[0], &data_2048_0[0], &twiddle[0], 2048, 2048);
	
	Radix2FFT(&scratch_1024_2[0], &scratch_1024_3[0], &data_2048_1[0], &twiddle[0], 2048, 2048);
	
	Radix2FFT(&data_2048_0[0], &data_2048_1[0], &data_4096[0], &twiddle[0], 4096, 2048);
 
 	err = chk_out(data_4096, ref_data, DATA_LEN_4096); // check results for errors
 	// Set Breakpoint here to check err and graph results
 	// PASS: err = 0
 	// FAIL: err = 1
    return err;	
}

/************************************************************************************************* 
 * cplx_prod = CPLX_Mul(op1, op2):
 * Computes complex multiplication
 * Re{cplx_prod} = Re{op1}*Re{op2} - Im{op1}*Im{op2}
 * Im(cplx_prod) = Re{op1}*Im{op2} + Im{op1}*Re{op2}
 * cplx_prod = Re{cplx_prod},Im(cplx_prod) stored in one double word [16-bit Re, 16-bit Im]
 *************************************************************************************************/
Int32 CPLX_Mul(Int32 op1, Int32 op2)
{
	Int16 op1_r, op1_i, op2_r, op2_i;
	Int32 op1_op2_r, op1_op2_i;
	Int16 op1_op2_r16, op1_op2_i16;
	Int32 cplx_prod;

	// Mask/shift data to extract Re and Im data = [Re,Im]
	op1_r = op1 >> 16;
	op1_i = op1 & 0x0000FFFF;
	op2_r = op2 >> 16;
	op2_i = op2 & 0x0000FFFF;
	
	// Multiply into 32-bits then calculate Re-Re and Im+Im
	op1_op2_r	= (Int32)op1_r*op2_r - (Int32)op1_i*op2_i;
	op1_op2_i	= (Int32)op1_r*op2_i + (Int32)op1_i*op2_r;
	
	// Repack Re and Im results into one double word
	op1_op2_r16 = (Int16)(op1_op2_r >> 15);
	op1_op2_i16 = (Int16)(op1_op2_i >> 15);

	cplx_prod = (((Int32)op1_op2_r16 & 0x0000FFFF)<< 16);
	cplx_prod = cplx_prod | ((Int32)op1_op2_i16 & 0x0000FFFF);
	
	return cplx_prod;
}


/*************************************************************************************************
 * cplx_sum = CPLX_Add(op1, op2, subtract_flag):
 * Computes complex addition/subtraction with divide by 2 scaling afterwards 
 * Re{cplx_sum} = Re{op1} +- Re{op2}
 * Im(cplx_sum) = Im{op1} +- Im{op2}
 * cplx_sum = Re{cplx_sum},Im(cplx_sum) stored in one double word [16-bit Re, 16-bit Im]
 * subtract_flag = 0: ADD, subtract_flag = 1: SUBTRACT
 *************************************************************************************************/
Int32 CPLX_Add(Int32 op1, Int32 op2, Uint16 subtract_flag)	
{
	Int16 op1_r, op1_i, op2_r, op2_i;
	Int32 op1_op2_r32, op1_op2_i32;
	Int16 op1_op2_r16, op1_op2_i16;
	Int32 cplx_sum;
	
	// Mask/shift data to extract Re and Im data = [Re,Im]
	op1_r = op1 >> 16;
	op1_i = op1 & 0x0000FFFF;
	op2_r = op2 >> 16;
	op2_i = op2 & 0x0000FFFF;

	// If subtract_flag, Subtract op2 from op1 by negating op2 and adding it to op1
	if(subtract_flag == 1)
	{
		op2_r = -1*op2_r;
		op2_i = -1*op2_i;		
	}
	
	// Perform addition and scale the sum by 1/2
	op1_op2_r32 = ((Int32)op1_r + op2_r)>>1;		
	op1_op2_i32 = ((Int32)op1_i + op2_i)>>1;		
	
	// Repack Re and Im results into one double word
	op1_op2_r16 = (Int16)(op1_op2_r32 & 0x0000FFFF);
	op1_op2_i16 = (Int16)(op1_op2_i32 & 0x0000FFFF);
	
	cplx_sum = (((Int32)op1_op2_r16 & 0x0000FFFF)<< 16);
	cplx_sum = cplx_sum | ((Int32)op1_op2_i16 & 0x0000FFFF);
	
	return cplx_sum;
}


/*************************************************************************************************
 * err = chk_out(*data, *ref_data, data_len):
 * Compares output data with reference data 
 * Returns an error (err = 1) if any of the data and ref_data arrays differ by more than the Error_Threshold = 0.0001 
 * err = 1 if |data[idx] - ref_data[idx]| > Error_Threshold = 0.0001, for any array index, idx
 ************************************************************************************************/
static Uint16 chk_out(Int32 *data, Int32 *ref_data, Uint16 data_len) 
{
    // Error_Threshold = 0.0001, Convert 0.0001 to S16Q15 Fixed Point (0.0001 * 32768 = 3.2768 ~= 3)
    Int16 Error_Threshold = 3; 	
    
    Int16 data_r, data_i, ref_data_r, ref_data_i;
    Int16 err_r, err_i; 
    Uint16 idx;
    Uint16 err; 
  
    idx = 0;
    err = 0;

    while ((idx < data_len) && err == 0)
    {
    	// Mask/shift data to extract Re and Im data = [Re,Im]
    	data_r = data[idx] >> 16;
		data_i = data[idx] & 0x0000FFFF;
		ref_data_r = ref_data[idx] >> 16;
		ref_data_i = ref_data[idx] & 0x0000FFFF;
		
        // Calculate error 
        err_r = data_r - ref_data_r;
        err_i = data_i - ref_data_i;
        
        // Return an error if |error| > (Error_Threshold = 0.0001)
        if (abs(err_r) > Error_Threshold || abs(err_i) > Error_Threshold)
        {            
            err = 1;
        }
        idx++;
    }
    return err;
}



 //		lengths		   2048			   2048			 4096		  2048			  4096			  2048  twidStepSize = 1
 //     lengths		   1024			   1024			 2048		  2048			  2048			  2048	twidStepSize = 2
 //		lengths		   1024			   1024			 2048		  2048			  2048			  1024	twidStepSize = 1
Uint16 Radix2FFT(Int32 *In_Even, Int32 *In_Odd, Int32 *Out, Int32 *Twiddle, Uint16 Out_Len, Uint16 Twid_Len)
{
	Int32 twiddle_data_odd;
	Uint16 twidStepSize = Twid_Len / (Out_Len / 2);	//Check this with twiddle table - should be 1 for 4096, 2 for 2048
	Uint16 k = 0;

	for(k=0; k<Out_Len/2; k++)
	{
		// X(k) 	= 1/2*(X_even[k] + Twiddle[k]*X_odd(k))
		// X(k+N/2) = 1/2*(X_even[k] - Twiddle[k]*X_odd(k))
		
		// Twiddle[k]*X_odd(k):
		twiddle_data_odd = CPLX_Mul(Twiddle[k * twidStepSize], In_Odd[k]); 
		
		// X(k):
		Out[k] = CPLX_Add(In_Even[k], twiddle_data_odd, ADD);
		
		// X(k+N/2):
		Out[k+Out_Len/2] = CPLX_Add(In_Even[k], twiddle_data_odd, SUBTRACT);
	}
	return 0; //make void or return something useful...
}

http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/109/5187.test.cmd

1460.twid4096.asm

Hope this helps,
Mark

0 O'Mellin over 12 years ago in reply to Mark M

TI__Expert 6325 points

Great thanks,

2Mcycles... is it worth using the HWA at all? what would be your estimate of running the 8K FFT straight through DSPlib?

cheers

0 Mark M over 12 years ago in reply to O'Mellin

TI__Mastermind 30240 points

Its worth it to use the HWAFFT.

Using DSPLIB, a 1024-pt CFFT requires 25,954 cycles according to CFFT − SCALE table in http://www.ti.com/lit/ug/spru422j/spru422j.pdf

Whereas the HWAFFT requires 5,242 cycles to perform the same 1024-pt complex FFT.

All the larger FFT stages are computed on the CPU (inefficiently with my bad code), but require the same cycle counts to perform 4x 2k FFTs, 2x 4k FFTs and 1x 8k FFT (majority of cycles, Radix 2).

Replacing HWAFFT with DSPLIB for the 1024-pt FFT and performing 8 1024-pt FFTs would add 165,696 cycles to the 8k FFT.

The larger FFT stages executed on the CPU can be greatly reduced with optimization. I cannot estimate that cycle count but it shouldnt 40-50 cycles to compute complex multiplication (4 multiplies & 3 additions) as it does in my code...

Hope this helps,
Mark

0 user4881750 over 9 years ago in reply to Mark M

Intellectual 440 points

This document is also a chapter in C5535 TRM: www.ti.com/.../spruh87c.pdf

This link is down.

Your table about the FFT cycles and duration is impressive, is there any formal document in the pdf format such that I can keep it as a reference ? Thanks a lot.

0 user4881750 over 9 years ago in reply to Mark M

Intellectual 440 points

This document is also a chapter in C5535 TRM: www.ti.com/.../spruh87c.pdf.
The link is down.

0 Mark M over 9 years ago in reply to user4881750

TI__Mastermind 30240 points

Hi,

We have revised the C5535 TRM a few times since rev C, so the link is now on rev H: http://www.ti.com/lit/ug/spruh87h/spruh87h.pdf

You can find the same chapter as an independent appnote here: http://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf

Hope this helps,
Mark

0 Abhik Banerjee over 8 years ago in reply to Mark M

Prodigy 65 points

I have 4096 real points for which I need to compute the FFT. Will this code work? As I am new to DSP, I am facing some issues/doubts while building the project:

1. Confused between invec and refvec. What is the difference? Where I am supposed to enter my 4096 points?
2. twiddle = (Int32*)twiddle_buf; -- Am I supposed to use the "1460.twid4096.asm"?

Could you please share the entire project wih some test values, which I can refer and modify according to my needs.

Thanks

Processors

Processors forum

C55 8k fft