Hi
I have a TMS320DM8148 on a TMDXEVM8148BTA Evaluation Board and using a LAN560 JTAG Emulator to connect to the processors via the 20Pin JTAG Port. I'm currently trying to setup a Benchmark Framework to measure the computation power of the C674x DSP.
Currently I'm comparing the results of my Framwork to the results from the profiling. Although the results are coherent, there are some results which contradict all the gathered data:
I use several different FIR algorithms from the C64x+ DSPLIB(v.3.0.0.8) (the gen, r4 and r8) to calculate a random 1048576 short data array through a random 128 short filter array. The results from the profiler:
CPU Cycles |
CPU.NOP | CPU.access.data.read | CPU.access.data.write | |
gen | 31'981'603 | 1'835'013 | 33'554'433 | 262'144 |
r4 | 39'321'629 | 1'572'871 | 33'554'432 |
262'144 |
r8 | 39'583'777 | 1'572'870 | 20'971'520 | 262'144 |
The results of my framework are equal to the profilers results. What puzzles me is: How can r8 be the slowest algorithm? It is supposed to be the most optimized, has the most restrictions, has the lowest NOP count and the fewest data reads... But it needs the most processor cycles...?
The performance of r8 is really bad:
Multiplications per Clock | |
gen | 4.197 |
r4 | 3.413 |
r8 |
3.371 |
-mv6740 -g -O3 --program_level_compile --abi=coffabi --call_assumptions=2 -z --reread_libs --rom_model
IPC: 1.23.3.31
SYS/BIOS: 6.32.3.43
XDCtools: 3.22.2.27
CCS: 5.1.0.07001
C64x+ DSPLIB: 3.0.0.8