Hi... I'm making use of LCDK: TMS320C6748, it has a C6748 DSP in it.
I'm in need of a very fast FFT implementation, so rather than using my implementation, I thought of using well optimized FFT implementation in C674x DSPLIB. In the benchmarks section of this PDF spru657.pdf(Manual for TMS320C6000 DSP Library) it says that a call for the function DSPF_sp_fftSPxSP(...) and for a 64 Point FFT it takes only 598 clock cycles:
Benchmarks
cycles = 3 * ceil(log4(N)-1) * N + 21 * ceil(log4(N)-1) + 2*N + 44
e.g., N = 1024, cycles = 14464
e.g., N = 512, cycles = 7296
e.g., N = 256, cycles = 2923
e.g., N = 128, cycles = 1515
e.g., N = 64, cycles = 598
Code size
(in bytes)
1440
But when I make a call for DSPF_sp_fftSPxSP(...) in my example I find that for a 64 point FFT, the CCS Clock tool, measures a whooping 23,926 CPU cycles. Why is this disparity? Am I missing something?
Can anyone kindly explain how I can achieve a 64 Point FFT which uses least number of clock cycles possible or at least at 598 clock cycles as given in the manual?
Thank you.
Vikram