FIR performance: Benchmark results are strange

christian bach

I have a TMS320DM8148 on a TMDXEVM8148BTA Evaluation Board and using a LAN560 JTAG Emulator to connect to the processors via the 20Pin JTAG Port. I'm currently trying to setup a Benchmark Framework to measure the computation power of the C674x DSP.

Currently I'm comparing the results of my Framwork to the results from the profiling. Although the results are coherent, there are some results which contradict all the gathered data:

I use several different FIR algorithms from the C64x+ DSPLIB(v.3.0.0.8) (the gen, r4 and r8) to calculate a random 1048576 short data array through a random 128 short filter array. The results from the profiler:

	CPU Cycles	CPU.NOP	CPU.access.data.read	CPU.access.data.write
gen	31'981'603	1'835'013	33'554'433	262'144
r4	39'321'629	1'572'871	33'554'432	262'144
r8	39'583'777	1'572'870	20'971'520	262'144

The results of my framework are equal to the profilers results. What puzzles me is: How can r8 be the slowest algorithm? It is supposed to be the most optimized, has the most restrictions, has the lowest NOP count and the fewest data reads... But it needs the most processor cycles...?

The performance of r8 is really bad:

	Multiplications per Clock
gen	4.197
r4	3.413
r8	3.371

-mv6740 -g -O3 --program_level_compile --abi=coffabi --call_assumptions=2 -z --reread_libs --rom_model

IPC: 1.23.3.31

SYS/BIOS: 6.32.3.43

XDCtools: 3.22.2.27

CCS: 5.1.0.07001

C64x+ DSPLIB: 3.0.0.8

over 14 years ago

0 David Friedland over 14 years ago

TI__Mastermind 18320 points

Christian,

This issue does not appear to be BIOS related, so I am moving this thread to the DM81x device forum t see if it can get a faster response there.

Dave

0 christian bach over 14 years ago in reply to David Friedland

Prodigy 200 points

Dave,

Thank you for this; It got into the BIOS Forum by accident...

As for the Performance: I found that the r8 code is not optimized for the C64x+ Processor and uses only the C64x commands... But the r4 is optimized for the C64x+...

0 christian bach over 14 years ago in reply to christian bach

Prodigy 200 points

I now managed to write an implementation of an FIR Filter optimized for my Processor (I called it a8):

	CPU Cycles	CPU.NOP	CPU.access.data.read	CPU.access.data.write
gen	31'981'603	1'835'013	33'554'433	262'144

a8	27'000'859	262'194	29'360'128	262'144

The read access still can't compete with r8 but it still is much faster. My guess is that it could get even faster if I manage to minimize the read access to the r8 value..

	Multiplications per Clock
gen	4.197

a8	4.971

But I'm closing this, since no one seams to be interested...

Processors

Processors forum

FIR performance: Benchmark results are strange