FFT benchmark on c6678

Arunmoezhi Ramachandran

I'm trying to gather some benchmarks on FFT in c6678.

This document http://www.ti.com/lit/pdf/sprt578 has data for c6670.

All the numbers here are for radix4 fft.
I'm looking for numbers for 1024 pt radix 2 single precision fft.

Also this document says that there are 3 FFT coprocessors embedded in c6670. Is this true from C6678 as well?

This document ti/dsplib_c66x_3_2_0_1/docs/DSPLib_c66xTest_Report.html has data for DSPF_sp_fftSPxSP_66 for 256pt fft in cycles. Where can I find the results for 1024 pt fft?

Thanks,

Arun

over 11 years ago

0 Xiaohui Li over 11 years ago

TI__Intellectual 1870 points

Hi,

The document sprt578 describes the FFT coprocessor performance on C6670. C6678 does not have any FFT coprocessors.

You can benchmark FFT performance on C6678 using the FFT kernels provided in TI DSPLIB. The DSPLIB is part of the TI Multicore Software Development Kit (MCSDK).

Xiaohui

0 tscheck over 11 years ago

TI__Mastermind 23525 points

Just to be clear, the DSPLib numbers do not make use of the FFT coprocessor, they use the CPU, so same performance between C6678 and C6670. 1024pt fft for single precision floating point is around 6K cycles.

Regards,

Travis

0 Arunmoezhi Ramachandran over 11 years ago in reply to tscheck

Prodigy 50 points

tscheck said:
1024pt fft for single precision floating point is around 6K cycles

My code takes 160us. DSPLIB code takes 4.8us (6000/1250). There is a huge difference of 32x. Should I turn on any specific optimization flags?

0 Xiaohui Li over 11 years ago in reply to Arunmoezhi Ramachandran

TI__Intellectual 1870 points

Hi,

There are various techniques to optimize dsp code. Turnning on optimization flags is one of the options. There are appnotes under C6678 such as Optimizing Loops on the C66x DSP provides details on techniques on optimizing TI DSP code.

Also all the kernels from TI DSPLIB come with the source code. They show exactly how each kernel is implemented to achieve better performance.

Xiaohui

0 Jeff Brower73 over 11 years ago in reply to tscheck

Genius 3420 points

Travis-

At 1.2 GHz, 6000 cycles would be about 5 usec. The C6670 product brief (http://www.ti.com/lit/ml/sprt578b/sprt578b.pdf) indicates 14.6 usec for a 2048 pt FFT , so we might say 6-7 usec for a 1024 pt FFT. But this uses the C6670 FFT co-processors -- is one C6678 core really faster?

Please confirm the 6000 cycle figure -- is there some doc for this? Thanks.

-Jeff

0 tscheck over 11 years ago in reply to Jeff Brower73

TI__Mastermind 23525 points

Jeff,

I went back and found some independent internal emails with the following measurements on our EVM:

1024pt FFT Single Precision Complex floating point - 6863 cycles

Another one indicates 6632 cycles.

The one I referenced earlier indicates 6100 cycles, but I'm not 100% sure that was measured on an EVM like the above numbers. That should provide you a good estimate. The DSPLib kernals are highly optimized.

Regards,

Travis

0 Jeff Brower73 over 11 years ago in reply to tscheck

Genius 3420 points

Travis-

Ok thanks, we'll go with 6500 cycles.

One other question -- we don't see any dsplib function that optimizes a 1D FFT using all 8 cores, for example decomposition into smaller FFT sizes followed by recombination. Is there anything we're missing?

Thanks.

-Jeff

0 tscheck over 11 years ago in reply to Jeff Brower73

TI__Mastermind 23525 points

DSPLib functions are all single core kernals. We have some code examples that use multiple cores to perform larger FFTs 8K to 1024K pt. See: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/303599.aspx

Regards,

Travis

Processors

Processors forum

FFT benchmark on c6678