Part Number: AWR1642
Tool/software: TI C/C++ Compiler
This is related to code compiled on C674X DSP on the AWR1642 part. Compiler version used is 8.2.2. Relevant compiler options : -O3 -g -ms0
Hi,
I am trying to optimize a function which is somewhat like memmove (allows overlapped moves) on 32-bit datum and is inspired by DSPLIB's blockmove function which optimizes the move by doing aligned double word loads two at a time and achieves a 0.5 cycle per 32-bit move but DSPLIB works only when source is bigger than destination whereas I need the other direction to also be optimal. See code below:
void MmwDemo_blockMove_final (
uint32_t * restrict source,
uint32_t * restrict destination,
int blockSize
)
{
int i;
_nassert(((int)source % 8) == 0);
_nassert(((int)destination % 8) == 0);
_nassert((blockSize % 4) == 0);
_nassert(blockSize >= 4);
if (source > destination)
{
#pragma MUST_ITERATE(4,,4)
#pragma UNROLL(4)
for (i = 0; i < blockSize; i++)
{
destination[i] = source[i];
}
}
else
{
uint32_t *srcEnd = source + blockSize - 1;
uint32_t *dstEnd = destination + blockSize - 1;
*dstEnd-- = *srcEnd--;
_nassert(((int)srcEnd % 8) == 0);
_nassert(((int)dstEnd % 8) == 0);
#pragma MUST_ITERATE(,,4)
#pragma UNROLL(4)
for (i = 0; i < (blockSize - 4); i++)
{
*dstEnd-- = *srcEnd--;
}
for (i = 0; i < 3; i++)
{
*dstEnd-- = *srcEnd--;
}
}
}
In the above code, lines 9, 10 and 11 express the assumptions that source and destination are 64-bit aligned and block size is multiples of 4. For the source > destination case the loop in lines 18 to 21 is generating optimal code (two aligned double word loads followed by stores) like the DSPLIB code [the UNROLL of 4 is required other compiler doesn't do it] achieving 0.5 cycle/ 32-bit move. In the else case, we have to start from the tail and progress backwards (decreasing address). The tail will be unaligned because source/dest arguments are aligned and blockSize is multiples of 4 (which is also a multiple of 2). So for this case to be optimal, it is split into 3 stages : one is a singleton 32-bit copy, which should then align the addresses for the loop following it, and following the loop are 3 singleton 32-bit copies. Although compiler should not need additional hints, I try to put _nasserts as seen in lines 30 and 31 but this is not useful, the generated code in the else loop uses unaligned loads and stores and is not optimal (1 cycle/32-bit). Thinking that _nasserts may not be context aware (i.e the fact that the address decrements at line 28), I tried a version where I created two additional local variables as below but this still generates unoptimal code.
void MmwDemo_blockMove_final (
uint32_t * restrict source,
uint32_t * restrict destination,
int blockSize
)
{
int i;
_nassert(((int)source % 8) == 0);
_nassert(((int)destination % 8) == 0);
_nassert((blockSize % 4) == 0);
_nassert(blockSize >= 4);
if (source > destination)
{
#pragma MUST_ITERATE(4,,4)
#pragma UNROLL(4)
for (i = 0; i < blockSize; i++)
{
destination[i] = source[i];
}
}
else
{
uint32_t *srcEnd1 = source + blockSize - 1;
uint32_t *dstEnd1 = destination + blockSize - 1;
uint32_t *srcEnd, *dstEnd;
*dstEnd1-- = *srcEnd1--;
srcEnd = srcEnd1;
dstEnd = dstEnd1;
_nassert(((int)srcEnd % 8) == 0);
_nassert(((int)dstEnd % 8) == 0);
#pragma MUST_ITERATE(,,4)
#pragma UNROLL(4)
for (i = 0; i < (blockSize - 4); i++)
{
*dstEnd-- = *srcEnd--;
}
for (i = 0; i < 3; i++)
{
*dstEnd-- = *srcEnd--;
}
}
}
If I remove the line (in the 1st code I quoted) at 28 (with or without the nasserts in the else), it generated optimal code for the loop in else (although C code is not right in this case as it will not achieve the functionality I desire but I did this experiment just to see what happens) but this seems surprising, if anything srcEnd and dstEnd initial values (= source/destination + blockSize - 1) will not be aligned. So the compiler behavior is not making sense. However, I have not checked for correctness of the else part also yet, I am just trying to first see if the code can be generated optimally as I have written. I think I can workaround this situation by using the amem8 instrinsics similar to how DSPLIB has done things but it is much cleaner to write the code as in the first loop in source > destination case which generates the same code essentially as the DSPLIB, if it can work for the loop in else case. DSPLIB code below for reference (they operate on shorts but it is essentially same in outcome as the first loop in the source > destination case):
void DSP_blk_move (
short * restrict x,
short * restrict r,
int nx
)
{
int i;
long long x3210, x7654;
nx = nx >> 3;
#pragma MUST_ITERATE(1,,1);
for (i = 0; i < nx; i++) {
x3210 = _amem8_const(&x[i*8+0]);
x7654 = _amem8_const(&x[i*8+4]);
_amem8(&r[i*8+0]) = x3210;
_amem8(&r[i*8+4]) = x7654;
}
}