Can someone confirm the benchmarks published for the C674x DSPF_sp_mat_mul (DSPLib 3.1.0.0)?
| DSPF_sp_mat_mul_674LE_LE_ELF_c674-LE-ELF-CGT:7.2.4 | 1 | (C674xCPUCycleAccurateSim-LE) | 4.2.3.00004 | Passed | " 1/2*r1*c2*c1 + 12/2*r1*c2 + 9/2*r1 + 42" | " 864 bytes" |
For r1=32, c1 = 64, c2 = 1, I'm getting about 2658 compared to the 1402 suggested by the docs. The code and data are in L1. I'm linking to dsplib.ae674 from the distribution, but I get slightly worse results when I compile it myself with optimizations on (cl6x 7.4.1). I'm using TSC to count cycles.
The docs don't mention anything about bank alignment of the buffers, and it's a bit hard to figure out the best alignment after pipelining (would be great if the compiler made suggestions in the cases where the code doesn't assert any bank alignments). I tried a few combinations of bank alignments to no avail.
I wrote specialized code for my particular case (double word aligned buffers, removed one loop, unrolled inner loop by 4 and padded the matrix with zeros when necessary, interrupt threshold -1) and got the count down to 1700, but I'd still like to know if the docs are wrong.
Thanks