Good day experts,
I am having trouble in achieving the specified FFT benchmarks of the C66x DSPLIB.
My setup is as follows:
- TMDSEVM6657LS EVM, using only core 0, clocked at 1000 MHz
- C66x DSPLIB v3.4.0.0
- CCS v5.5
- Codegen tools v7.6.0 (-o3 optimization)
- No operating system
I am benchmarking the DSP_fft16x32() routine, with all code and data in L2 internal RAM, with L1P and L1D configured as cache. I am executing the routine 100000 times in a for-loop, using the MUST_ITERATE pragma to ensure that its not optimized out, and measuring the duration with a timer.
From the DSPLIB specified benchmarks:
- 128 point FFT takes 813 cycles, i.e. 0.813 us at 1000 MHz CPU clock
- 256 point FFT takes 1469 cycles, i.e. 1.469 us at 1000 MHz CPU clock
However, I measure the following:
- 128 point FFT (100000 iterations) takes 116 ms, i.e. 1.16 us per FFT
- 256 point FFT (100000 iterations) takes 213 ms, i.e. 2.13 us per FFT
Now I understand that there are some overheads involved in my setup, which are not present in the DSPLIB supplied benchmarks, but I am more than 40% slower in both instances. I did similar benchmarks for the C64x+ DSPLIB on the C6748 and the measurements were within less than 5% of the benchmarks.
Are there perhaps some setup settings I am missing here? Can someone please provide advice?
Thanks in advance.