I have a subtle question (and solution?) about how to avoid memory bank stalls.
Farther below is the TI C-code for performing a dot product (taken from the TI C64+ DSP Library). The streams of 16-bit data are fetched as double words that are double-word aligned in memory (for speed efficiency), which makes perfect sense. The optimizing assembler fetches these in parallel, like this:
LDDW .D1T1 *A6++,A5:A4
|| LDDW .D2T2 *B7++,B5:B4
As the loop index i is incremented, those two pointers can (if so aligned) always point to the same memory bank as each other. (That is, the memory bank pointed to will change, but both pointers will be pointing to the same bank as each other.) The problem is that those two addresses are always to the same memory bank, and therefore result in a memory bank stall every time. (Such a stall stops the entire execution pipeline for one instruction cycle.) The optimizing assembler "estimates" it as a 25% chance of memory bank conflict, whereas I see it as a 100% chance -- a certainty. The solution, it seems to me, is to offset the addresses so they never simultaneously point to the same memory bank.
For example, previously load the data into array n starting at n[4] instead of n[0]. (Note: the offset of "4" will depend on the specific c6000 processor and its memory bank design.) See my comments below in blue. Because this approach avoids memory bank stalls, I believe this will be faster than the TI example code. (Note: I cannot find this solution discussed in the manuals, yet it would seem to be a common issue. The C6000 Programmer's Guide, Section 5.12, gives a brief generic discussion of memory banks and stalls -- then refers readers to the "TMS320C6000 Peripherals Reference Guide" for details on any specific processor's memory banks, but there is no such information there.)
Question: Do I understand this correctly? Is this a speed improvement?
#pragma CODE_SECTION(DSP_dotprod, ".text:intrinsic");
int DSP_dotprod ( short * restrict m, short * restrict n, int count )
{
int i;
int sum1 = 0;
int sum2 = 0;
/* The kernel assumes that the data pointers are double word aligned */
_nassert((int)m % 8 == 0);
_nassert((int)n % 8 == 0);
// Align the two arrays on, say, a quadruple (or higher) word boundary. This will guarantee that the two pointers,
// m[0] and n[0], point to the same memory bank, also guaranteeing that the two pointers,
// m[0] and n[4], do not point to the same memory bank. Then previously store array n starting at n[4],
/* The kernel assumes that the input count is multiple of 4 */
for (i = 0; i < count; i+=4) {
sum1 += _dotp2(_lo(_amemd8_const(&m[i])), _lo(_amemd8_const(&n[i]))); // Should be n[i+4] instead of n[i];
sum2 += _dotp2(_hi(_amemd8_const(&m[i])), _hi(_amemd8_const(&n[i]))); // Should be n[i+4] instead of n[i];
}
return (sum1+sum2);
}