This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

FFT benchmark on c6678

I'm trying to gather some benchmarks on FFT in c6678.

This document http://www.ti.com/lit/pdf/sprt578 has data for c6670.

All the numbers here are for radix4 fft.
I'm looking for numbers for 1024 pt radix 2 single precision fft.

Also this document says that there are 3 FFT coprocessors embedded in c6670. Is this true from C6678 as well?

This document ti/dsplib_c66x_3_2_0_1/docs/DSPLib_c66xTest_Report.html has data for DSPF_sp_fftSPxSP_66 for 256pt fft in cycles. Where can I find the results for 1024 pt fft?


Thanks,

Arun

  • Hi,

    The document sprt578 describes the FFT coprocessor performance on C6670.  C6678 does not have any FFT coprocessors.

    You can benchmark FFT performance on C6678 using the FFT kernels provided in TI DSPLIB.  The DSPLIB is part of the TI Multicore Software Development Kit (MCSDK).

    Xiaohui

  • Just to be clear, the DSPLib numbers do not make use of the FFT coprocessor, they use the CPU, so same performance between C6678 and C6670.  1024pt fft for single precision floating point is around 6K cycles.

    Regards,

    Travis

  • tscheck said:
    1024pt fft for single precision floating point is around 6K cycles

    My code takes 160us. DSPLIB code takes 4.8us (6000/1250). There is a huge difference of 32x. Should I turn on any specific optimization flags?

  • Hi,

    There are various techniques to optimize dsp code.  Turnning on optimization flags is one of the options.  There are appnotes under C6678 such as Optimizing Loops on the C66x DSP provides details on techniques on optimizing TI DSP code.

    Also all the kernels from TI DSPLIB come with the source code.  They show exactly how each kernel is implemented to achieve better performance.

    Xiaohui

  • Travis-

    At 1.2 GHz, 6000 cycles would be about 5 usec.  The C6670 product brief (http://www.ti.com/lit/ml/sprt578b/sprt578b.pdf) indicates 14.6 usec for a 2048 pt FFT , so we might say 6-7 usec for a 1024 pt FFT.  But this uses the C6670 FFT co-processors -- is one C6678 core really faster?

    Please confirm the 6000 cycle figure -- is there some doc for this?  Thanks.

    -Jeff

  • Jeff, 

    I went back and found some independent internal emails with the following measurements on our EVM:

    1024pt FFT Single Precision Complex floating point - 6863 cycles

    Another one indicates 6632 cycles.

    The one I referenced earlier indicates 6100 cycles, but I'm not 100% sure that was measured on an EVM like the above numbers.  That should provide you a good estimate.  The DSPLib kernals are highly optimized.

    Regards,

    Travis

  • Travis-

    Ok thanks, we'll go with 6500 cycles.

    One other question -- we don't see any dsplib function that optimizes a 1D FFT using all 8 cores, for example decomposition into smaller FFT sizes followed by recombination. Is there anything we're missing?

    Thanks.

    -Jeff

  • DSPLib functions are all single core kernals.  We have some code examples that use multiple cores to perform larger FFTs 8K to 1024K pt.  See:  http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/303599.aspx

    Regards,

    Travis