TMS320C5535: Using the 1024FFT library function to calculate the 2048-point FFT result is wrong

Nancy Wang

Part Number: TMS320C5535

Hi,

Customer calculates the 2048-point FFT according to the Greater Than 1024-point FFT code provided in the manual, the result of the FFT combination of odd-even sequences is wrong.

He uses the following CPLX_Mul code, the multiplication and shifting of two int16 numbers will cause the sign bit to be shifted to the right.

Int32 CPLX_Mul(Int32 op1, Int32 op2)
{
	Int16 op1_r, op1_i, op2_r, op2_i;
	Int32 op1_op2_r, op1_op2_i;
	Int16 op1_op2_r16, op1_op2_i16;
	Int32 cplx_prod;

	// Mask data for real and imag data = (real:imag)
	op1_r = op1 >> 16;
	op1_i = op1 & 0x0000FFFF;
	op2_r = op2 >> 16;
	op2_i = op2 & 0x0000FFFF;

	op1_op2_i	= (Int32)op1_r*op2_i + (Int32)op1_i*op2_r;
	op1_op2_r	= (Int32)op1_r*op2_r - (Int32)op1_i*op2_i;
	
	op1_op2_i16 = (Int16)(op1_op2_i >> 15);
	op1_op2_r16 = (Int16)(op1_op2_r >> 15);

	cplx_prod = (((Int32)op1_op2_r16 & 0x0000FFFF)<< 16);
	cplx_prod = cplx_prod | ((Int32)op1_op2_i16 & 0x0000FFFF);

	return cplx_prod;
}
Int32 CPLX_Add(Int32 op1, Int32 op2 , Uint16 scale_flag)
{
    Int16 op1_r, op1_i, op2_r, op2_i;
    Int32 op1_op2_r, op1_op2_i;
    Int16 op1_op2_r16, op1_op2_i16;
    Int32 cplx_prod;

    // Mask data for real and imag data = (real:imag)
    op1_r = op1 >> 16;
    op1_i = op1 & 0x0000FFFF;
    op2_r = op2 >> 16;
    op2_i = op2 & 0x0000FFFF;

    // Yr = 1/2 * (op1_r + op2_r), Yi = 1/2 *(op1_i + op2_i)
    op1_op2_i   = (Int32)op1_i +(Int32)op2_i;
    op1_op2_r   = (Int32)op1_r +(Int32)op2_r;

    op1_op2_i16 = (Int16)(op1_op2_i >> 1);
    op1_op2_r16 = (Int16)(op1_op2_r >> 1);


    cplx_prod = (((Int32)op1_op2_r & 0x0000FFFF)<< 16);
    cplx_prod = cplx_prod | ((Int32)op1_op2_i & 0x0000FFFF);

    return cplx_prod;
}

Int32 CPLX_Subtract(Int32 op1, Int32 op2 , Uint16 scale_flag)
{
    Int16 op1_r, op1_i, op2_r, op2_i;
    Int32 op1_op2_r, op1_op2_i;
    Int16 op1_op2_r16, op1_op2_i16;
    Int32 cplx_prod;

    // Mask data for real and imag data = (real:imag)
    op1_r = op1 >> 16;
    op1_i = op1 & 0x0000FFFF;
    op2_r = op2 >> 16;
    op2_i = op2 & 0x0000FFFF;

    // Yr = 1/2 * (op1_r - op2_r), Yi = 1/2 *(op1_i - op2_i)
    op1_op2_i   = (Int32)op1_i - (Int32)op2_i;
    op1_op2_r   = (Int32)op1_r - (Int32)op2_r;

    op1_op2_i16 = (Int16)(op1_op2_i >> 1);
    op1_op2_r16 = (Int16)(op1_op2_r >> 1);

    cplx_prod = (((Int32)op1_op2_r & 0x0000FFFF)<< 16);
    cplx_prod = cplx_prod | ((Int32)op1_op2_i & 0x0000FFFF);

    return cplx_prod;
}

Where can I find source code of

CPLX_Mul(twiddle[k], data_odd[k]).

CPLX_Add(data_even[k], twiddle_times_data_odd, SCALE_FLAG);

CPLX_Subtract(data_even[k], twiddle_times_data_odd, SCALE_FLAG)

www.ti.com.cn/.../spruh87h.pdf

over 2 years ago

0 Frank Livingston over 2 years ago

TI__Mastermind 29338 points

Hi Nancy,

Support for C5000 is limited, please see: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/818771/faq-support-guidance-for-c5000-digital-signal-processors

Nancy Wang said:
Customer calculates the 2048-point FFT according to the Greater Than 1024-point FFT code provided in the manual, the result of the FFT combination of odd-even sequences is wrong.

I didn't review the code in the UG, but the code should be a final radix-2 stage for combining the results of the two 1024-pt FFTs into a 2048-pt FFT result.

Nancy Wang said:
He uses the following CPLX_Mul code, the multiplication and shifting of two int16 numbers will cause the sign bit to be shifted to the right.

The complex multiply code looks correct. The I,Q parts of the complex number are 16-bit signed Q15 values packed into a 32-bit value: S16Q15 | S16Q15. When multiplying two 16-bit Q15 values: S16Q15 * S16Q15 = S32Q30. Truncating the LS 15 bits and storing into a 16 bit output gives S16Q15.

The complex add/subtract code looks correct. I notice he always shifts right by 1 (divide by 2). This should be OK and is done to avoid overflow at the output of the butterfly. In the code in the UG, I notice there is a SCALING flag which is set to 0, so I'm guessing the final stage wasn't scaled by 2 in the UG code. However, this should only affect the magnitude of the I,Q results.

Nancy Wang said:
Where can I find source code of

That code is no longer available.

Regards,
Frank

0 Nancy Wang over 2 years ago in reply to Frank Livingston

TI__Expert 8210 points

Hi,

Is there any source code for this method to implement FFT with 8192 points or more?

0 Frank Livingston over 2 years ago in reply to Nancy Wang

TI__Mastermind 29338 points

Hi Nancy,

No, there isn't any code available for this.

4096- and 8192-point FFTs would be an extension of the ideas described in the TRM, Section 2.8.4 2048-point FFT Source Code.

For example, this is a rough outline of what I believe would be required for a 4096-point FFT.

Offline: Generate twiddle table with 4096/2 twiddles

Run-time:

Bit-reverse 4096-pt complex input vector
Split bit-reversed input vector into 4 1024-pt input vectors
Processes 4 1024-pt input vectors by separate calls to HWAFFT FFT function
Generate 2 2048-pt FFTs as outlined in TRM:
- First 2048-pt FFT uses first 2 1024-pt FFT outputs
- Second 2048-pt FFT uses second 2 1024-pt FFT outputs
Generate 4096-pt FFT ouput using 2 2048-pt FFT outputs.

The HWAFFT uses DIT, so the DIT float graph should be used for:

2048 pt: 1 final radix-2 stage
4096 pt: 2 final radix-2 stages
8192 pt: 3 final radix-2 stages

To start, the code can be written in C. The C55x Rev 3 DSPLIB contains code optimized assembly language for up to 1024-point complex FFTs, so this could also be consulted in case optimized assembly is required.

Regards,
Frank

0 Yuankai Zhang over 2 years ago

Prodigy 50 points

Hello, Frank.

I am the initiator of this question. Thank you for your help on my previous question, and thank Nancy Wang for her follow-up on this question.

At present, I have realized the calculation of 4096-point FFT with your method, but in order to obtain higher resolution, I still want to complete the calculation of 8192 points FFT, but now I meet difficulties.That is, memory resources have become significantly insufficient.

The memory usage is as follows:

• 3 twiddles, 1024-pt,2048-pt,4096-pt , to save resources, I cut each set of arrays in half, the other half is achieved by copying the first half;

• The 8192-point complex input vector after bit-reverse, data_br;

• 8 1024-pt output vectors by separate calls to HWAFFT FFT function;

• 4 2048-pt output vectors by complex multiplication of twiddle_1024;

• 2 4096-pt output vectors by complex multiplication of twiddles_2048;

• 1 8192-pt output vectors by complex multiplication of twiddles_4096, which occupies the same memory space as data_br

I wonder if there is any way to optimize the space, I probably need 30KB or more (which array can be reused?).

In addition, can HWAFFT implement 8192 FFT calculations by modifying the code optimized assembly language ? How do you modify it ?

Thank you
Wei Jiang

0 Frank Livingston over 2 years ago in reply to Yuankai Zhang

TI__Mastermind 29338 points

Hi Wei,

Yuankai Zhang said:
3 twiddles, 1024-pt,2048-pt,4096-pt , to save resources

The twiddles are e^(-j*2*pi/N). For an N-point FFT, a single size N/2 twiddle table is required. The twiddles for smaller size FFTs are contained
in this twiddle table. For example, the twiddles for 4096- and 2048-point FFTs are contained in the twiddles for the 8192-point (4096 twiddles) FFT.
Hence, a single 8192/2=4096 size twiddle table should be enough to compute all FFT stages after the HWAFFT.

Yuankai Zhang said:
I wonder if there is any way to optimize the space, I probably need 30KB or more (which array can be reused?).

It seems like you could use two size 8192 vectors:

One vector for bit-reversed input data.
Single vector for all "scratch" buffers supplied to 8x calls to HWAFFT routine.

Segment these vectors into 8 pieces for 8x calls to HWAFFT routine:

Vector 0: 0...1023
Vector 1: 1024...2048
...
Vector 7: 7168...8191.

The HWAFFT routine returns a flag indicating which buffer contains the output.

After this, the final three FFT stages (assuming radix-2 stages) can be computed in-place. You just need to write an FFT stage routine to compute results in place. The size 8192-vector selected for these stages is determined by the flag returned by the HWAFFT routine.

So you need:

Twiddle table: (8192/2 twiddles * 2 entries per twiddle * 2 bytes / entry) = 16384 bytes.
Data vector: (2 vectors * 8192 elements / vector * 2 entries / element * 2 bytes / entry) = 65536 bytes.

Yuankai Zhang said:
can HWAFFT implement 8192 FFT calculations by modifying the code optimized assembly language ? How do you modify it ?

HWAFFT is a tightly-coupled FFT accelerator. The maximum size supported by the HWAFFT hardware is 1024. The HWAFFT assembly routines are located in ROM. The assembly I was referring to above is for computing the final FFT stages (after HWAFFT) on the C55x CPU core.

Let me know if you have any questions or concerns.

Regards,
Frank

0 Frank Livingston over 2 years ago in reply to Frank Livingston

TI__Mastermind 29338 points

You might be able to save some memory by only using a single scratch buffer of size 1024. After a call to the HWAFFT routine, copy the scratch buffer back to the data buffer if the output is in the scratch buffer. This would cost more cycles, but it would save 65536 / 2 * 7/8 bytes.

0 Yuankai Zhang over 2 years ago in reply to Frank Livingston

Prodigy 50 points

I don't know what you mean. I use 8 * scratch(each one size is 1024) buffers. But each scratch needs to occupy memory. How to save memory

#pragma DATA_SECTION(scratch111_buf, "scratch111_buf");
Int32 scratch111_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch112_buf, "scratch112_buf");
Int32 scratch112_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch121_buf, "scratch121_buf");
Int32 scratch121_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch122_buf, "scratch122_buf");
Int32 scratch122_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch211_buf, "scratch211_buf");
Int32 scratch211_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch212_buf, "scratch212_buf");
Int32 scratch212_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch221_buf, "scratch221_buf");
Int32 scratch221_buf[DATA_LEN_1024];
#pragma DATA_SECTION(scratch222_buf, "scratch222_buf");
Int32 scratch222_buf[DATA_LEN_1024];

#pragma DATA_SECTION(data_br_buf, "data_br_buf"); 
//#pragma DATA_ALIGN(data_br_buf, 8192);
Int32 data_br_buf[DATA_LEN_8192];

#pragma DATA_SECTION(data111_buf, "data111_buf");
Int32 data111_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data112_buf, "data112_buf");
Int32 data112_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data121_buf, "data121_buf");
Int32 data121_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data122_buf, "data122_buf");
Int32 data122_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data211_buf, "data211_buf");
Int32 data211_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data212_buf, "data212_buf");
Int32 data212_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data221_buf, "data221_buf");
Int32 data221_buf[DATA_LEN_1024];
#pragma DATA_SECTION(data222_buf, "data222_buf");
Int32 data222_buf[DATA_LEN_1024];

Int32 *data_br;
Int32 *data11 , *data12 , *data;
Int32 *data111 , *data112 , *data121 , *data122 , *data211 , *data212 , *data221 , *data222;
Int32 *scratch111 , *scratch112 , *scratch121 , *scratch122 , *scratch211 , *scratch212 , *scratch221 , *scratch222;

    data_br = data_br_buf;
    data111 = data111_buf;
    data112 = data112_buf;
    data121 = data121_buf;
    data122 = data122_buf;
    data211 = data211_buf;
    data212 = data212_buf;
    data221 = data221_buf;
    data222 = data222_buf;

    scratch111 = scratch111_buf;
    scratch112 = scratch112_buf;
    scratch121 = scratch121_buf;
    scratch122 = scratch122_buf;
    scratch211 = scratch211_buf;
    scratch212 = scratch212_buf;
    scratch221 = scratch221_buf;
    scratch222 = scratch222_buf;
    
    fft_flag = FFT_FLAG;
    scale_flag = SCALE_FLAG;

	hwafft_br(In, data_br, DATA_LEN_8192);  // bit-reverse input data
	data = data_br;
	//data = In;

	// Split data into even-indexed data & odd-indexed data
	// data_br is already bit-reversed, so even-indexed data = first half & odd-indexed data = second half

	for(k=0; k<1024; k++)
	{
        data111[k] = data[k];//偶数索引序列
        data112[k] = data[k+DATA_LEN_1024];//奇数索引序列
        data121[k] = data[k+DATA_LEN_1024*2];//偶数索引序列
        data122[k] = data[k+DATA_LEN_1024*3];//奇数索引序列
        data211[k] = data[k+DATA_LEN_1024*4];
        data212[k] = data[k+DATA_LEN_1024*5];
        data221[k] = data[k+DATA_LEN_1024*6];
        data222[k] = data[k+DATA_LEN_1024*7];
	}

    out_sel = hwafft_1024pts(data111, scratch111, fft_flag, scale_flag);// perform FFT
    if(out_sel == OUT_SEL_SCRATCH)
        data111 = scratch111;

    out_sel = hwafft_1024pts(data112, scratch112, fft_flag, scale_flag);
    if(out_sel == OUT_SEL_SCRATCH)
        data112 = scratch112;

    out_sel = hwafft_1024pts(data121, scratch121, fft_flag, scale_flag);// perform FFT
    if(out_sel == OUT_SEL_SCRATCH)
        data121 = scratch121;

    out_sel = hwafft_1024pts(data122, scratch122, fft_flag, scale_flag);
    if(out_sel == OUT_SEL_SCRATCH)
        data122 = scratch122;

    out_sel = hwafft_1024pts(data211, scratch211, fft_flag, scale_flag);// perform FFT
    if(out_sel == OUT_SEL_SCRATCH)
        data211 = scratch211;

    out_sel = hwafft_1024pts(data212, scratch212, fft_flag, scale_flag);
    if(out_sel == OUT_SEL_SCRATCH)
        data212 = scratch212;

    out_sel = hwafft_1024pts(data221, scratch221, fft_flag, scale_flag);// perform FFT
    if(out_sel == OUT_SEL_SCRATCH)
        data221 = scratch221;

    out_sel = hwafft_1024pts(data222, scratch222, fft_flag, scale_flag);
    if(out_sel == OUT_SEL_SCRATCH)
        data222 = scratch222;
        
    for(k=0; k<DATA_LEN_1024; k++) // Computes 2048-point FFT
    {
                // X(k) = X_even[k] + Twiddle[k]*X_odd(k)
                // X(k+N/2) = X_even[k] - Twiddle[k]*X_odd(k)
                // Twiddle[k]*X_odd(k):
        twiddle_times_data_odd1 = CPLX_Mul(twiddle[k], data112[k]);
        // X(k):
        data11[k] = CPLX_Add(data111[k], twiddle_times_data_odd1, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data11[k+DATA_LEN_1024] = CPLX_Subtract(data111[k], twiddle_times_data_odd1, SCALE_FLAG);
        //Sub then scale

        twiddle_times_data_odd2 = CPLX_Mul(twiddle[k], data122[k]);
        // X(k):
        data12[k] = CPLX_Add(data121[k], twiddle_times_data_odd2, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data12[k+DATA_LEN_1024] = CPLX_Subtract(data121[k], twiddle_times_data_odd2, SCALE_FLAG);
        //Sub then scale
          
        twiddle_times_data_odd1 = CPLX_Mul(twiddle[k], data212[k]);
        // X(k):
        data21[k] = CPLX_Add(data211[k], twiddle_times_data_odd1, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data21[k+DATA_LEN_1024] = CPLX_Subtract(data211[k], twiddle_times_data_odd1, SCALE_FLAG);
        //Sub then scale

        twiddle_times_data_odd2 = CPLX_Mul(twiddle[k], data222[k]);
        // X(k):
        data22[k] = CPLX_Add(data221[k], twiddle_times_data_odd2, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data22[k+DATA_LEN_1024] = CPLX_Subtract(data221[k], twiddle_times_data_odd2, SCALE_FLAG);
        //Sub then scale       
    }

    for(k=0; k<DATA_LEN_2048; k++) // Computes 4096-point FFT
    {
        twiddle_times_data_odd = CPLX_Mul(twiddle_2048[k], data12[k]);
        // X(k):
        data1[k] = CPLX_Add(data11[k], twiddle_times_data_odd, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data1[k+DATA_LEN_2048] = CPLX_Subtract(data11[k], twiddle_times_data_odd, SCALE_FLAG);
        //Sub then scale

        twiddle_times_data_odd = CPLX_Mul(twiddle_2048[k], data22[k]);
        // X(k):
        data2[k] = CPLX_Add(data21[k], twiddle_times_data_odd, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data2[k+DATA_LEN_2048] = CPLX_Subtract(data21[k], twiddle_times_data_odd, SCALE_FLAG);
        //Sub then scale
    }
    
    for(k=0; k<DATA_LEN_4096; k++) // Computes 8192-point FFT
    {
        twiddle_times_data_odd = CPLX_Mul(twiddle_4096[k], data2[k]);
        // X(k):
        data[k] = CPLX_Add(data1[k], twiddle_times_data_odd, SCALE_FLAG); // Add then scale by 2
        // X(k+N/2):
        data[k+DATA_LEN_4096] = CPLX_Subtract(data1[k], twiddle_times_data_odd, SCALE_FLAG);
        //Sub then scale
    }