This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C674x DSPF_sp_mat_mul benchmark

Can someone confirm the benchmarks published for the C674x DSPF_sp_mat_mul (DSPLib 3.1.0.0)?

DSPF_sp_mat_mul_674LE_LE_ELF_c674-LE-ELF-CGT:7.2.4 1 (C674xCPUCycleAccurateSim-LE) 4.2.3.00004 Passed " 1/2*r1*c2*c1 + 12/2*r1*c2 + 9/2*r1 + 42" " 864 bytes"

For r1=32, c1 = 64, c2 = 1, I'm getting about 2658 compared to the 1402 suggested by the docs.  The code and data are in L1.  I'm linking to dsplib.ae674 from the distribution, but I get slightly worse results when I compile it myself with optimizations on (cl6x 7.4.1).  I'm using TSC to count cycles.

The docs don't mention anything about bank alignment of the buffers, and it's a bit hard to figure out the best alignment after pipelining (would be great if the compiler made suggestions in the cases where the code doesn't assert any bank alignments).  I tried a few combinations of bank alignments to no avail.

I wrote specialized code for my particular case (double word aligned buffers, removed one loop, unrolled inner loop by 4 and padded the matrix with zeros when necessary, interrupt threshold -1) and got the count down to 1700, but I'd still like to know if the docs are wrong. 

Thanks