Hi,
I've written some linear assembly code which calculates hamming distances.
It loops over the same image line again and again and issues in 3 cycles 4 LDDW (cached in L1D) as well as 2 STDW (to L2SRAM):
LDDW *pusSrc1Ptr++[1], usLine1_32:usLine1_10
LDDW *pusSrc2Ptr++[1], usLine2_32:usLine2_10
XOR usLine1_10, usLine2_10, ucLineXor_10
XOR usLine1_32, usLine2_32, ucLineXor_32
BITC4 ucLineXor_10, ucBitCnt_10
BITC4 ucLineXor_32, ucBitCnt_32
ADD4 ucBitCnt_10, ucBitCnt_32, ucBitCnt_3210
DOTPU4 ucBitCnt_3210, dotpMask, usHamDist1
However, instead of ~1.5 cycles per Pixel, the the function takes ~2.3cpp on our EVM6678, which is quite unfourtunaty as its one of our most time-consuming functions.
Using the cycle approximate simulator, I get the following metrics:
61.800 Cycles Total
53.200 Cycles CPU
8600 CPU.stall.summary
8500 CPU.stall.L1D
6341 mem bank conflicts
17050 L2SRAM.data.write
I would be really greatful for hints what causes those L1D stalls, and for suggestions how to avoid those.
I already had a look at the mptr-directive, but from what I've understood its only useful for loads/stores <= 1 word, right?
Thank you in advance, Clemens