This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DSP lib FFT/IFFT benchmarks with C674x simulator

Hi,

I'm using the Device Cycle Accurate simulator (little endian) for the C6747 DSP.

I use the profiler to bench the fixed-point FFT/IFFT of the DSP C64x+ DSP LIB.

With a 64 points FFT, I get : 182 cycles for the DSP_fft16x16 and 158 cycles for DSP_ifft16x16.

But when I look at the spruec5.pdf (TMS320C64x+ DSP Big-Endian DSP Library Programmer’s Reference), the benchmarks formula for FFT/IFFT is (6 * nx/8 + 19) * ceil[log4(nx) - 1] + 8*nx/8 + 30 cycles. For a 64 points FFT, I should get 224 cycles. In the sprueb8b.pdf (TMS320C64x+ DSP Little-Endian DSP Library Programmer’s Reference) the benchmarks are given in a table with 242cycles for DSP_fft16x16 (case SA assembly implementation)

I link my code with the dsplib.a64P (use the DSP_fft16x16_sa.sa). My data are 8bytes aligned and the code is in internal memory (no L2 cache).

I'm suprise to get a better result in the profiler. What could be wrong? Do the results/formula depend on the target architecture (parallele execution capabilities) and compilation options?

Regards.

Laurent.

  • Hi Laurent,

    Do you know what is the DSPLIB version number you are using?

    Regards,

    Yimin

  • Hi Yimin,

    the DSPLIB version is 3_1_1_1. I didn't recompile the library, I directly use the library provided within the downloaded package.

    I use code composer v4.1.3 with cgtools TI v7.3.1.

    Regards.

    Laurent.

    DSPLIB 3.1.1.1 Release Notes

    October 10, 2012

  • Hi Laurent,

    The document spruec5.pdf or sprueb8b.pdf does not apply to DSPLIB release 3.1.1.1.  They applies to versions earlier than 2.1.  You can find the appropriate cycle formula in the test report under ...\dsplib_c64Px_3_1_1_1\docs\ direcotry.  For FFT functions, we did not list formula.  Here is the output I got from DSP_fft16x16 unit test (little-endian elf).

    DSP_fft16x16    Iter#: 1    Result Successful (y_i) Result Successful (y_sa)     Radix = 4    N = 16    natC: 295    intC: 115    SA: 100
    DSP_fft16x16    Iter#: 2    Result Successful (y_i) Result Successful (y_sa)     Radix = 2    N = 32    natC: 632    intC: 168    SA: 158
    DSP_fft16x16    Iter#: 3    Result Successful (y_i) Result Successful (y_sa)     Radix = 4    N = 64    natC: 1130    intC: 262    SA: 238
    DSP_fft16x16    Iter#: 4    Result Successful (y_i) Result Successful (y_sa)     Radix = 2    N = 128    natC: 2749    intC: 529    SA: 514
    DSP_fft16x16    Iter#: 5    Result Successful (y_i) Result Successful (y_sa)     Radix = 4    N = 256    natC: 5355    intC: 1011    SA: 930
    Memory:  928 bytes
    Cycles:  529 (N=128) 1011 (N=256)

    For 64 point, the linear assembly version takes 238 cycles. Your number is a little low. I would suggest you to run the unit test first. You can compile the project under ...\dsplib_c64Px_3_1_1_1\packages\ti\dsplib\src\DSP_fft16x16\c64P\DSP_fft16x16_64P_LE_ELF directory. I use CCS version 5.2.1 and CG tools 7.2.4. CCS version should not matter. CG tools version matters. Release 3.1.1.1 object libraries were created with 7.2.4 tools. From personal experience, I would expect similar result from 7.3.1 tools.

    Generally speaking, source code may change between different releases and, usually, each release uses different CG tools. So we would expect cycle number to change for each release. You should always use the cycle formula for your specific release as a guideline. Starting from DSPLIB 3.0.0, we provide the formula in the test report.

    regards,

    Yimin

  • Hi Yimin,

    Using an EVM C6748 , I get the cycles you have provided above. I have not yet run the unitary test on the simulator to see if I find the same results. I inform you as soon as I run the test.

    Why is the library provided with several implementations (native C, intrinsic, assembly). Don't you always recommend to use the assembly version which provides the better performances?

    Regards.

    Laurent

  • Hi Laurent,

    Natural C implementation is our reference in unit test. Other implementations are tested against natural C result. It also helps user to better understand the algorithm. Other optimized implementations are not as readable. Intrinsic C implementation is ideally our only optimized implementation. Only when we cannot achieve optimal result with intrinsic, we would use linear assembly or assembly implementation. Because compiler also evolves, intrinsic C implementation performance can sometimes improve over releases. If it perform better than assembly consistently over a few releases, we would remove the assembly implementation. That's why you see three implementations for some kernels.

    regards,

    Yimin

  • Hi Yimin,

    thank you for those precisions and for your help.

    I've made some new tests using the simulator :

    • running the FFT unitary test project of the DSPLIB , I get the same values as in your previous post (time measurements are based on the clock function)
    • In my project where I link with dsplib.a64P:
    • using the TSCL/TSCH registers, I also get the same values as in your post
    • if I look at the profiler output, I don't get the same values(even if check the option "profile TI librairies"). The values are too low.
    • Finally, in my project, if I include the DSPLIB files (native C and assembly ) rather than linking with the DSPLIB, I get the right values. I think that it comes from the fact that in my project I use the --systemdebug=skeletal option for program analysis (needed for profiling) and that the DSPLIB CCS project doesn't use this options, then the DSPLIB functions are not well profiled. Do you agree with this analysis?

    Regards.

    Laurent

  • Hi Laurent,

    I am not very familiar with "--systemdebug=skeletal" option you used. Typically, we use clock function or timer registers to get cycle numbers, which is quite reliable.

    regards,

    Yimin