This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6748: Performance data for FFT

Part Number: TMS320C6748

Hi

I'm referring the document  

       - http://www.ti.com/lit/an/spracn4/spracn4.pdf

       - http://www.ti.com/lit/an/sprac13/sprac13.pdf

However, I'd like to know the performance data that is more suitable for my usage as below.

  - FFT

  - 4096 points

  - 16-bit unsigned data

If it's possible then I'd like to know the result on both when the data is allocated on cache/internal RAM and external DRAM.

And also, I'd like to know whether the data is rounded during the processing if the interim results over 16bit.

Thanks and Best regards,

HaTa.

  • HaTa,

    DSPLIB for C64x+ that we provide along with PRocessor SDK RTOS for this device provides a function for 16 bit FFT for up to 64K points. It comes with a test case where you can change the value of N to 4096 to get the performance estimate.

    I ran the test case and the code and data fit in L2 memory so external DRAM was not required since C6748 has 256 KB of L2 which is sufficient for code/data for this function. 

    The performance numbers reported were as follows:

    DSP_fft16x16 Iter#: 1 Result Successful (y_i) Result Successful (y_sa) Radix = 4 N = 16 natC: 457 intC: 158 SA: 154
    DSP_fft16x16 Iter#: 2 Result Successful (y_i) Result Successful (y_sa) Radix = 2 N = 32 natC: 708 intC: 186 SA: 199
    DSP_fft16x16 Iter#: 3 Result Successful (y_i) Result Successful (y_sa) Radix = 4 N = 64 natC: 1302 intC: 316 SA: 303
    DSP_fft16x16 Iter#: 4 Result Successful (y_i) Result Successful (y_sa) Radix = 2 N = 128 natC: 3066 intC: 731 SA: 693
    DSP_fft16x16 Iter#: 5 Result Successful (y_i) Result Successful (y_sa) Radix = 4 N = 256 natC: 6093 intC: 1518 SA: 1342
    DSP_fft16x16 Iter#: 6 Result Successful (y_i) Result Successful (y_sa) Radix = 2 N = 512 natC: 14471 intC: 3575 SA: 3110
    DSP_fft16x16 Iter#: 7 Result Successful (y_i) Result Successful (y_sa) Radix = 4 N = 1024 natC: 29128 intC: 7051 SA: 6088
    DSP_fft16x16 Iter#: 8 Result Successful (y_i) Result Successful (y_sa) Radix = 2 N = 2048 natC: 68046 intC: 16357 SA: 13944
    DSP_fft16x16 Iter#: 9 Result Successful (y_i) Result Successful (y_sa) Radix = 4 N = 4096 natC: 137439 intC: 32443 SA: 27387

    I have highlighted the 4096 result above. This indicates for 4096 point FFT of radix 4, Natural C code takes 137439 cycles, intrinsic C implementation takes 32443 cycles and optimized serial assembly takes 27387 cycles. 

    Hope this helps

    Regards,

    Rahul