Part Number: TMS320C6678
Hi,
I'm trying to multiply matrix of order of 2k x 2k using sp matrix multiply routine .
Both the matrix are in ddr and both l1 and l2 are cached fully.
Single core performance is very slow to meet my real time requirement.
I want to parallelise the code to multi core using openmp .
I understand the single core performance hit is due to non aligned read, which is inherent problem of matrix multiply.
Is there a way to achieve better performance. I'm looking at performance close to the bench mark given for dsplib matrix multiply routine.