This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

High CPU.stall.L1D value when loading with LDDW/STDW from L2SRAM with L1D-Cache enabled.

Hi,

I've written some linear assembly code which calculates hamming distances.
It loops over the same image line again and again and issues in 3 cycles 4 LDDW (cached in L1D) as well as 2 STDW (to L2SRAM):


        LDDW            *pusSrc1Ptr++[1], usLine1_32:usLine1_10
        LDDW           *pusSrc2Ptr++[1], usLine2_32:usLine2_10
        XOR                  usLine1_10, usLine2_10, ucLineXor_10
        XOR                usLine1_32, usLine2_32, ucLineXor_32
        BITC4            ucLineXor_10, ucBitCnt_10
        BITC4            ucLineXor_32, ucBitCnt_32
        ADD4            ucBitCnt_10, ucBitCnt_32, ucBitCnt_3210
        DOTPU4            ucBitCnt_3210, dotpMask, usHamDist1


However, instead of ~1.5 cycles per Pixel, the the function takes ~2.3cpp on our EVM6678, which is quite unfourtunaty as its one of our most time-consuming functions.

Using the cycle approximate simulator, I get the following metrics:
61.800 Cycles Total
53.200 Cycles CPU
8600    CPU.stall.summary
8500    CPU.stall.L1D
6341    mem bank conflicts
17050 L2SRAM.data.write 

I would be really greatful for hints what causes those L1D stalls, and for suggestions how to avoid those.
I already had a look at the mptr-directive, but from what I've understood its only useful for loads/stores <= 1 word, right?

Thank you in advance, Clemens

  • Clemens,

    A search of TI documentation shows that the C Compiler User's Guide has a generic section on Avoiding Memory Bank Conflicts. I am not sure, but there could be an article or two addressing this in the TI Wiki Pages.

    Depending on which DSP, there is additional information about memory banks in other documents, such as the CPU & Instruction Set Reference Guide, the Megamodule Reference Guide, and the CorePac Reference Guide.

    Regards,
    RandyP

  • Thanks for the pointer, the Corepack users guide was providing the information I was looking for.
    Seems in order to avoid those stalls, I'll have to re-organize my algorithm :/

    Thanks, Clemens

  • Clemens,

    Depending on your algorithm, it may be as easy as using the DATA_MEM_BANK Pragma to direct the placement of the input arrays.

    Regards,
    RandyP