This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

FIR performance: Benchmark results are strange

Hi

I have a TMS320DM8148 on a TMDXEVM8148BTA Evaluation Board and using a LAN560 JTAG Emulator to connect to the processors via the 20Pin JTAG Port. I'm currently trying to setup a Benchmark Framework to measure the computation power of the C674x DSP.

Currently I'm comparing the results of my Framwork to the results from the profiling. Although the results are coherent, there are some results which contradict all the gathered data:

I use several different FIR algorithms from the C64x+ DSPLIB(v.3.0.0.8) (the gen, r4 and r8) to calculate a random 1048576 short data array through a random 128 short filter array. The results from the profiler:

CPU Cycles
CPU.NOP CPU.access.data.read CPU.access.data.write
gen 31'981'603 1'835'013 33'554'433 262'144
r4 39'321'629 1'572'871 33'554'432
262'144
r8 39'583'777 1'572'870 20'971'520 262'144

The results of my framework are equal to the profilers results. What puzzles me is: How can r8 be the slowest algorithm? It is supposed to be the most optimized, has the most restrictions, has the lowest NOP count and the fewest data reads... But it needs the most processor cycles...?

The performance of r8 is really bad:

Multiplications per Clock
gen 4.197
r4 3.413
r8

3.371

 

 

 -mv6740 -g -O3 --program_level_compile --abi=coffabi --call_assumptions=2 -z --reread_libs --rom_model

 

IPC:                   1.23.3.31

SYS/BIOS:         6.32.3.43

XDCtools:          3.22.2.27

CCS:                  5.1.0.07001

C64x+ DSPLIB: 3.0.0.8

  • Christian,

    This issue does not appear to be BIOS related, so I am moving this thread to the DM81x device forum t see if it can get a faster response there.

    Dave

  • Dave,

    Thank you for this; It got into the BIOS Forum by accident...

     

    As for the Performance: I found that the r8 code is not optimized for the C64x+ Processor and uses only the C64x commands... But the r4 is optimized for the C64x+...

  • I now managed to write an implementation of an FIR Filter optimized for my Processor (I called it a8):

    CPU Cycles
    CPU.NOP CPU.access.data.read CPU.access.data.write
    gen 31'981'603 1'835'013 33'554'433 262'144
    a8 27'000'859 262'194 29'360'128 262'144

    The read access still can't compete with r8 but it still is much faster. My guess is that it could get even faster if I manage to minimize the read access to the r8 value..

    Multiplications per Clock
    gen 4.197
    a8

    4.971

     But I'm closing this, since no one seams to be interested...