Hi, I have a performance problem using floating point complex number complex multiplication (I'm working with 6670 DSP).
Using you cycle approximate simulator I found out this information:
The "CPU cycles" are very low thanks to the parallelism. (ii=4 for every cycle)
"Total cycles" are 10 times higher than "CPU cycles", because of "CPU.stall.mem.L1D"
Here the code:
pS, pH and r0 are __float2_t
{ loop 1200 times, every loop pH matrix is changing }
_amemd8(&pS[0]) = _complex_mpysp(_amemd8(&pH[0][0][0]), r0);
_amemd8(&pS[1]) = _complex_mpysp(_amemd8(&pH[0][1][1]), r1);
_amemd8(&pS[2]) = _complex_mpysp(_amemd8(&pH[0][2][2]), r2);
_amemd8(&pS[3]) = _complex_mpysp(_amemd8(&pH[0][3][3]), r3);
{ end loop }
The whole problem seems related to the pH matrix, how can I optimise the "total cycles"?
It doesn't look a cache bank issue, because the accessed pH elements are spaced 40 bytes.
Of course max caching is enabled. pS and pH are mapped in L2 memory, data_aligned(8)