This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM2732: DSPLIB FFT clock cycles count

Part Number: AM2732

Hi,

When I'm using DSP_fft16x16 from the provided TI DSPLIB and I measure the clock cycles it takes to complete that FFT, it is always more than the measured clock cycles in the Texas Instruments document "Test Results DSPLIB 3.4.0.0 C66x (comes with the MCU SDK installation, located here: mcu_plus_sdk_am273x_09_02_00_60/source/dsplib_c66x_3_4_0_0/docs/DSPLIB_C66x_TestReport.html)".

The size of my FFT is N=256 and according that your document it should take 743 cycles but my measurement says 1149.

 
    CycleCounterP_reset();
    uint32_t cpu_cycles_start = CycleCounterP_getCount32();
    DSP_fft16x16(w_16x16, N, x_16x16, y_16x16);
    uint32_t cpu_cycles_end = CycleCounterP_getCount32();
    uint32_t cpuCycles = cpu_cycles_end - cpu_cycles_start;
    DebugP_log("Cycles: %d!\r\n",cpuCycles);

 

Am I doing something wrong?

 

Thank you in advance,

Konstantinos

  • Hi Konstantinos,
    Please expect a response soon.

    Thanks,
    Shreyansh

  • Hi Konstantinos,

    Are you building the code with release profile or debug profile?

    Are you linking to the prebuilt library in DSPLIB?

    Where are your code and data sections located in the memory (MSS_L2, DSS_L2 or DSS_L3)?

    Best regards,

    Ming

  • Hi,

    I am building with the debug profile. I tried building with release profile but no difference in the cycle count was observed.

    I am linking to the prebuilt library in DSPLIB. Otherwise my code wouldn't build. In the C6000 Linker File Search Path I have added the dsplib.lib file.

    If my linking process is wrong please let me know.

    And also added the path in the C6000 compiler Include Options.

    Then after linking the dsplib.lib file I include the library inside my code like so: 

    #include <ti/dsplib/dsplib.h>
    Regarding the memory everything is inside DSS_L2 memory. Both the code and the data.
    Thank you.
    K.
     
  • Hi Konstantinos,

    I did the same test (loop 10x) with the release mode. Here are my results:

    [C66xx_DSP] All tests have passed!!
    Cycles: 1129!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!
    Cycles: 992!

    My guess is that for first one, the cache is cold. With warm cache, the result is more accurate. I know it still not 743 cycles yet. I do not know what platform the DSPLIB team is running their benchmark on, but for AM273x, this is the number we can get: 992 cycles.

    Best regards,

    Ming

  • Hi Ming,

    Thank you very much. Is the way I am measuring the cycles accurate or is there a calling overhead in CycleCounterP_getCount32() that I need to compensate for? Or is that calling overhead compensated when I am doing after_cycles - before_cycles?

    I hope what I am saying is clear.

    Thanks again.

    Best regards,

    K.

  • Hi Konstantinos,

    Your measurement method is correct. The CycleCounterP_getCount32 overhead is about 13 cycles. This is what I got when remove the measurement overhead:

    [C66xx_DSP] All tests have passed!!
    Cycles: 1120!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!
    Cycles: 979!

    Best regards,

    Ming

  • Would we see any better performance if we moved the data in the L1 Cache/SRAM? I have made a relevant thread regarding that ( AM2732: Configure L1D Cache as SRAM ) and you said it would get worse and it did get worse in deed when we tried putting the FFT vectors in the L1D SRAM. What I am asking is if there is any way to get the performance promised in the DSPLIB benchmarks. Also since you are doing after_cycles - before_cycles when you calculate the total cycles, isn't the calling overhead of the CycleCounterP_getCount32() function automatically compensated during that subtraction?

    Thanks again.

    K.

  • Hi Konstantinos,

    First of all, the 979 is the best number I got on AM275x EVM. I do not know there is a better way to furtherly improve it.

    Secondly the 979 is considered the CycleCounterP_getCount32 overhead which is about 13 cycles.

    Best regards,

    Ming

  • Hi Ming,

    I don't think you understood my question regarding the calling overhead of CycleCounterP_getCount32().

    Here is my code once again:

    cpu_cycles_start = CycleCounterP_getCount32();

    DSP_fft16x16(w_16x16, N, x_16x16, y_16x16);

    cpu_cycles_end = CycleCounterP_getCount32();

    cpuCycles = cpu_cycles_end - cpu_cycles_start;

    Both variables cpu_cycles_start & cpu_cycles_end include the 13 cycles calling overhead that CycleCounterP_getCount32() has.
    So at the end when I am doing:

    cpuCycles = cpu_cycles_end - cpu_cycles_start;
    it is basically like doing this:
    cpuCycles = (TRUE_cpu_cycles_end + 13) - (TRUE_cpu_cycles_start + 13);
    So basically what I am asking is if my thinking is correct and if the +13 at each measurement cancels out eventually and I don't need to manually do it in the end.
    Best regards,
    K.
  • Hi Konstantinos,

    Because the CycleCounterP_getCount32() takes 13 cycles to get the true timestamp, so the start timestamp is accurate. The end timestamp is 13 cycles after the actual timestamp. The measured cpuCycles = (TRUE_cpu_cycles_end + 13) - TRUE_cpu_cycles_start. That is why the minus 13 cycles overhead is needed.

    Best regards,

    Ming