This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to avoid memory bank stalls on streaming data

I have a subtle question (and solution?) about how to avoid memory bank stalls.

Farther below is the TI C-code for performing a dot product (taken from the TI C64+ DSP Library).  The streams of 16-bit data are fetched as double words that are double-word aligned in memory (for speed efficiency), which makes perfect sense. The optimizing assembler fetches these in parallel, like this:

         LDDW    .D1T1   *A6++,A5:A4
 ||      LDDW    .D2T2   *B7++,B5:B4

As the loop index i is incremented, those two pointers can (if so aligned) always point to the same memory bank as each other.  (That is, the memory bank pointed to will change, but both pointers will be pointing to the same bank as each other.) The problem is that those two addresses are always to the same memory bank, and therefore result in a memory bank stall every time. (Such a stall stops the entire execution pipeline for one instruction cycle.) The optimizing assembler "estimates" it as a 25% chance of memory bank conflict, whereas I see it as a 100% chance -- a certainty. The solution, it seems to me, is to offset the addresses so they never simultaneously point to the same memory bank. 

For example, previously load the data into array n starting at n[4] instead of n[0]. (Note: the offset of "4" will depend on the specific c6000 processor and its memory bank design.) See my comments below in blue.  Because this approach avoids memory bank stalls, I believe this will be faster than the TI example code. (Note: I cannot find this solution discussed in the manuals, yet it would seem to be a common issue. The C6000 Programmer's Guide, Section 5.12, gives a brief generic discussion of memory banks and stalls -- then refers readers to the "TMS320C6000 Peripherals Reference Guide" for details on any specific processor's memory banks, but there is no such information there.)

Question:  Do I understand this correctly?  Is this a speed improvement?

#pragma CODE_SECTION(DSP_dotprod, ".text:intrinsic");
int DSP_dotprod ( short * restrict m, short * restrict n, int count )
{
    int i;
    int sum1 = 0;
    int sum2 = 0;

    /* The kernel assumes that the data pointers are double word aligned */
    _nassert((int)m % 8 == 0);
    _nassert((int)n  % 8 == 0);
    // Align the two arrays on, say, a quadruple (or higher) word boundary. This will guarantee that the two pointers,
    // m[0] and n[0], point to the same memory bank, 
also guaranteeing that the two pointers,
    // m[0] and n[4], do not point to the same memory bank. Then previously store array n starting at n[4],

    /* The kernel assumes that the input count is multiple of 4 */
    for (i = 0; i < count; i+=4) {
        sum1 += _dotp2(_lo(_amemd8_const(&m[i])),  _lo(_amemd8_const(&n[i])));  // Should be n[i+4] instead of n[i];
        sum2 += _dotp2(_hi(_amemd8_const(&m[i])),  _hi(_amemd8_const(&n[i])));  // Should be n[i+4] instead of n[i];

    }

    return (sum1+sum2);
}

  • I found a helpful discussion on this issue (regarding memory bank stalls) in the TMSC320C6000 Programmer's Guide -- spru198i.pdf, section 5.2.4 on "The .mptr Directive"

    The solution discussed there utilizes the .mptr directive in combination with Linear Assembly code (rather than the C-code used in my above example). Also, the generated assembly code is not SPLOOP'd.  I'm trying to wrap my mind around their example (as the description of the .mptr directive there is cryptic). 

    ||     LDH .D2T2 *B5++(4),B8     ; These two loads are done simultaneously, and later multiplied together.
    ||     LDH .D1T1 *−A4(2),A0

    ....

    ||     LDH .D2T2 *−B5(2),B7        ; Likewise, these two loads are done simultaneously, and later multiplied together.
    ||     LDH .D1T1 *A4++(4),A0

    The asymmetry in the load addresses is obvious.  But it is not yet obvious (to me) how the first and last load would get paired-up, and the 2nd and 3rd load would get paired-up -- for multiplication. It seems to me that my comments above in blue would apply.

    ======

    On a different note. I have seen yet another approach, where double-words loads are explicitly done in a non-aligned manner -- which in my view is another way of accepting an automatic memory stall during each load.

  • UPDATE:  I finally found a document detailing the Memory Bank structure of my target device -- (In the TMS320C674x Megamodule Reference Guide - sprufk5.pdf - section 3.5.1 - L1D Memory Banking.)

    My target device has eight memory banks of 32-bits each.  For example, if you step upward by 8 words you end up in the same memory bank you started at. 

    I want to align two arrays on double word boundaries (for speedy access), which is easy to do using the DATA_ALIGN pragma.

    #pragma DATA_ALIGN   (  a,     8 );
    #pragma DATA_ALIGN   (  b,     8 );

    GOAL:  In addition, I want the two arrays to begin in different memory banks, so when indexing through the arrays simultaneously there will never be memory bank stalls.

    for(i=0; i<length; i++)        sum += a[i] * b[i];       //   I want this to produce no memory bank stalls

    QUESTION: How do I tell the compiler/linker to begin these two arrays in different memory banks?  And how do I inform the compiler to take advantage of that when optimizing the code?

    UPDATE:  I just found the answer:  Use the DATA_MEM_BANK Pragma.  Cool!  ( However, there are eight memory banks, but only four allowable values for choosing the memory bank to use. )

    ANOTHER UPDATE:  I just discovered the compiler will not accept BOTH pragmas at the same time (DATA_ALIGN and DATA_MEM_BANK).  I presume the DATA_MEM_BANK pragma is a superset of (includes the function of) the DATA_ALIGN pragma.   ???

    YET ANOTHER UPDATE:  I need the DATA_ALIGN pragma for aligning the data for optimal cache memory usage.  I'm not sure I can do without it.  Am I caught in a Catch-22, where I can't use both pragmas, but I need both pragmas? 

     

  • Update:  (I'm figuring this stuff out as I go along.  ....)

    The DATA_MEM_BANK pragma and the DATA_ALIGN pragma cannot be used simultaneously, because they contradict each other.  It's not possible to have both at the same time. Moreover, the DATA_MEM_BANK pragma should be sufficient to get full speed access to the cache memory -- in other words, the DATA_ALIGN pragma is not necessary when the other pragma is being used.