Other Parts Discussed in Thread: TMS320C6678, MATHLIB
Hello,
I'm working with CCS v 6.1.3 for a DSP multicore chip C6678 on a TMS320C6678 EVM, and I'm trying to optimize the performance of a matrix multiplication algorithm; in particular, optimizing the multiplication of a 24 x 660 matrix and a 660 x 2 matrix. To test this, I modified the code included in DSPLIB for DSPF_sup_mat_mul_66_LE_ELF to multiply matrices of sizes 24 x 660 and 660 x 2. The calculated number of clock cycles for this operation gives me 12378 cycles, but the measured result I am getting is 22369 - nearly double. Here are some additional details about my code:
- My project was set to be built in Release mode, with the highest optimization level (level 3) for the compiler
- The timer functions for the clock cycles are the same as those included in the DSPLIB file
- The L1D-cache and L1P-cache are both enabled and set at 32KB
- Compiler version TI v8.1.0
- Using SYS/BIOS 6.45.1.29, DSPLIB 3.4.0.0, MATHLIB 3.1.2.1
- All memory is on L2SRAM
- Running on Windows 7 64-bit, Service Pack 1
So here's my question: What's the best way to bring the measured number of clock cycles closer to the calculated number of clock cycles for this particular case?