Hello,
We are working on dra7xx-evm(OMAP5777) board with following setup :
1. ipc_3_23_00_01
2. bios_6_37_03_30
3. xdctools_3_25_06_96
4. CCS5.5
5. dsplib_c66x_3_4_0_0
6. mathlib_c66x_3_1_0_0
We are running linux on ARM core and sysbios is running on DSP core.
We tried to profile simple vector addition call ( DSPF_sp_vecadd ) on the DRA7xx DSP1 core running at 600MHz clock with SYSBIOSv6.37.
The length of the vector is 40000 float values.
As per the TI provided benchmark for this call it is ( 3 /4 * N + 24) cycles which amounts to ( 3/4 * 40000 + 24) cycles. This comes to around 30024 cycles.
Since our DSP core1 is running at 600MHz, the corresponding benchmark figure for the same translates to 30024/0.600 = 50040ns = 50.040 micro sec
But when we profile the same call in our code it is providing 541.3 micro sec !!! That's more than 10 times the mentioned benchmark figure for the same. We have used -O3 optimization flag to compile our code. The mathlib and dsplib libraries are also compiled with -O3 flags.
Please Note : We have used clock and timestamp calls to profile the code. We have also ensured beforehand, the accuracy of the timestamp calls by individually profiling them against a Task_sleep of 1 sec. Hence there is no ambiguity in the profiled figures which we have got.
Please let us know how can we improve on the DSPLIB throughput. The above mentioned addition call is a snippet of the algorithm which our application uses. This algorithm has multiple addition, multiplication, fft, sqrt and various other vector operations. All these calls show a degraded performance, which is no where near to the benchmark figure.
Please shed some light on the same.
Thanks,
Naveen Shetti