DSP lib FFT/IFFT benchmarks with C674x simulator

laurent poyart

Hi,

I'm using the Device Cycle Accurate simulator (little endian) for the C6747 DSP.

I use the profiler to bench the fixed-point FFT/IFFT of the DSP C64x+ DSP LIB.

With a 64 points FFT, I get : 182 cycles for the DSP_fft16x16 and 158 cycles for DSP_ifft16x16.

But when I look at the spruec5.pdf (TMS320C64x+ DSP Big-Endian DSP Library Programmer’s Reference), the benchmarks formula for FFT/IFFT is (6 * nx/8 + 19) * ceil[log4(nx) - 1] + 8*nx/8 + 30 cycles. For a 64 points FFT, I should get 224 cycles. In the sprueb8b.pdf (TMS320C64x+ DSP Little-Endian DSP Library Programmer’s Reference) the benchmarks are given in a table with 242cycles for DSP_fft16x16 (case SA assembly implementation)

I link my code with the dsplib.a64P (use the DSP_fft16x16_sa.sa). My data are 8bytes aligned and the code is in internal memory (no L2 cache).

I'm suprise to get a better result in the profiler. What could be wrong? Do the results/formula depend on the target architecture (parallele execution capabilities) and compilation options?

Regards.

Laurent.

over 12 years ago

0 Yimin Zhang over 12 years ago

TI__Intellectual 1690 points

Hi Laurent,

Do you know what is the DSPLIB version number you are using?

Regards,

Yimin

0 laurent poyart over 12 years ago in reply to Yimin Zhang

Prodigy 180 points

Hi Yimin,

the DSPLIB version is 3_1_1_1. I didn't recompile the library, I directly use the library provided within the downloaded package.

I use code composer v4.1.3 with cgtools TI v7.3.1.

Regards.

Laurent.

DSPLIB 3.1.1.1 Release Notes

October 10, 2012

0 Yimin Zhang over 12 years ago in reply to laurent poyart

TI__Intellectual 1690 points

Hi Laurent,

The document spruec5.pdf or sprueb8b.pdf does not apply to DSPLIB release 3.1.1.1. They applies to versions earlier than 2.1. You can find the appropriate cycle formula in the test report under ...\dsplib_c64Px_3_1_1_1\docs\ direcotry. For FFT functions, we did not list formula. Here is the output I got from DSP_fft16x16 unit test (little-endian elf).

DSP_fft16x16   Iter#: 1   Result Successful (y_i) Result Successful (y_sa)    Radix = 4   N = 16   natC: 295   intC: 115   SA: 100
DSP_fft16x16   Iter#: 2   Result Successful (y_i) Result Successful (y_sa)    Radix = 2   N = 32   natC: 632   intC: 168   SA: 158
DSP_fft16x16   Iter#: 3   Result Successful (y_i) Result Successful (y_sa)    Radix = 4   N = 64   natC: 1130   intC: 262   SA: 238
DSP_fft16x16   Iter#: 4   Result Successful (y_i) Result Successful (y_sa)    Radix = 2   N = 128   natC: 2749   intC: 529   SA: 514
DSP_fft16x16   Iter#: 5   Result Successful (y_i) Result Successful (y_sa)    Radix = 4   N = 256   natC: 5355   intC: 1011   SA: 930
Memory: 928 bytes
Cycles: 529 (N=128) 1011 (N=256)

For 64 point, the linear assembly version takes 238 cycles. Your number is a little low. I would suggest you to run the unit test first. You can compile the project under ...\dsplib_c64Px_3_1_1_1\packages\ti\dsplib\src\DSP_fft16x16\c64P\DSP_fft16x16_64P_LE_ELF directory. I use CCS version 5.2.1 and CG tools 7.2.4. CCS version should not matter. CG tools version matters. Release 3.1.1.1 object libraries were created with 7.2.4 tools. From personal experience, I would expect similar result from 7.3.1 tools.

Generally speaking, source code may change between different releases and, usually, each release uses different CG tools. So we would expect cycle number to change for each release. You should always use the cycle formula for your specific release as a guideline. Starting from DSPLIB 3.0.0, we provide the formula in the test report.

regards,

Yimin

0 laurent poyart over 12 years ago in reply to Yimin Zhang

Prodigy 180 points

Hi Yimin,

Using an EVM C6748 , I get the cycles you have provided above. I have not yet run the unitary test on the simulator to see if I find the same results. I inform you as soon as I run the test.

Why is the library provided with several implementations (native C, intrinsic, assembly). Don't you always recommend to use the assembly version which provides the better performances?

Regards.

Laurent

0 Yimin Zhang over 12 years ago in reply to laurent poyart

TI__Intellectual 1690 points

Hi Laurent,

Natural C implementation is our reference in unit test. Other implementations are tested against natural C result. It also helps user to better understand the algorithm. Other optimized implementations are not as readable. Intrinsic C implementation is ideally our only optimized implementation. Only when we cannot achieve optimal result with intrinsic, we would use linear assembly or assembly implementation. Because compiler also evolves, intrinsic C implementation performance can sometimes improve over releases. If it perform better than assembly consistently over a few releases, we would remove the assembly implementation. That's why you see three implementations for some kernels.

regards,

Yimin

0 laurent poyart over 12 years ago in reply to Yimin Zhang

Prodigy 180 points

Hi Yimin,

thank you for those precisions and for your help.

I've made some new tests using the simulator :

running the FFT unitary test project of the DSPLIB , I get the same values as in your previous post (time measurements are based on the clock function)
In my project where I link with dsplib.a64P:

using the TSCL/TSCH registers, I also get the same values as in your post
if I look at the profiler output, I don't get the same values(even if check the option "profile TI librairies"). The values are too low.

Finally, in my project, if I include the DSPLIB files (native C and assembly ) rather than linking with the DSPLIB, I get the right values. I think that it comes from the fact that in my project I use the --systemdebug=skeletal option for program analysis (needed for profiling) and that the DSPLIB CCS project doesn't use this options, then the DSPLIB functions are not well profiled. Do you agree with this analysis?

Regards.

Laurent

0 Yimin Zhang over 12 years ago in reply to laurent poyart

TI__Intellectual 1690 points

Hi Laurent,

I am not very familiar with "--systemdebug=skeletal" option you used. Typically, we use clock function or timer registers to get cycle numbers, which is quite reliable.

regards,

Yimin

Processors

Processors forum

DSP lib FFT/IFFT benchmarks with C674x simulator

DSPLIB 3.1.1.1 Release Notes

October 10, 2012