Suboptimal code generation from 7.0 series

Orjan Friberg

Expert 1385 points

Hi,

I have enountered two different loops where 7.0.4 gives very different results compared to either 6.1.18 or 6.0.28.

Complete examples are attached, compiled with cl6x -mv64+ -c -O3 -mw -k -on=2

1. The first example is a plain MAC loop, just

tmp += a[i] * b[i]

I get:

ii = 1 with no unroll for 7.0.4 or 6.1.18

ii = 1with 2x unroll for 6.0.28 or 7.0.4/6.1.18 with --vectorize=off specified

Are there less disruptive compiler options that would allow the 7.0 compiler to make the same optimization? Or is the example code missing anything?

2. A manually unrolled loop, where a double indirection in the original loop causes pointer aliasing issues. It is broken up through the use of intermediate pointers which are declared as restrict.

Original loop:

a[index[i]] += K * b[i];

Broken up:

int *restrict tmp1 = &a[index[i]];
int *restrict tmp2 = &a[index[i+1]];
int *restrict tmp3 = &a[index[i+2]];
int *restrict tmp4 = &a[index[i+3]];

*tmp1 += K * b[i];
*tmp2 += K * b[i+1];
*tmp3 += K * b[i+2];
*tmp4 += K * b[i+3];

For the unrolled loop I get:

ii = 9 for 6.0.28 and 6.1.18

ii = 28 with 7.0.4 (no change with --vectorize=off)

The original loop has ii = 7 due to the loop carried dependency bound, and the ii = 28 (7 * 4x unroll) is also dependency bound.

Have there been changes regarding scoping or usage of the restrict keyword in the 7.0.x series?

The fact that there are some esoteric cases where you want to tweak the compiler arguments to something non-standard (like adding --vectorize=off) isn't surprising, but I'm surprised that it's needed for such straight forward code like this (especially the multiply and accumulate loop).

Thanks,

Orjan

Fullscreen 8311.vanilla_mac.txt Download

#define WORD_ALIGNED(x) _nassert(((int)(x) & 0x3) == 0)
#define DWORD_ALIGNED(x) _nassert(((int)(x) & 0x7) == 0)
#define SZ 1000
int a[SZ];
int b[SZ];

int loop(int *restrict a, int *restrict b)
{
     int i;
     int tmp = 0;

     DWORD_ALIGNED(a);
     DWORD_ALIGNED(b);
#pragma MUST_ITERATE(SZ, SZ, )
     for (i = 0; i < SZ; i++) {
         tmp += a[i] * b[i];
     }
     return tmp;
}

Fullscreen 6177.manual_unroll.txt Download

#define SZ 1000
int a[SZ];
int b[SZ];
int index[SZ];

void loop(int *restrict a, int *restrict b, int K)
{
    int i;

    /* Original code: ii=7 due to loop carried dependency bound
       (double indirection causes pointer aliasing issues). */
    //#pragma MUST_ITERATE(SZ, SZ, )
    //for (i = 0; i < SZ; i++) {
    //     a[index[i]] += K * b[i];
    //}

    /* Unrolled loop to let the compiler know that a[index[i]]
       cannot point to the same address as a[index[i+1]] etc. */
#pragma MUST_ITERATE(SZ/4, SZ/4, )
    for (i = 0; i < SZ; i+=4) {
        int *restrict tmp1 = &a[index[i]];
        int *restrict tmp2 = &a[index[i+1]];
        int *restrict tmp3 = &a[index[i+2]];
        int *restrict tmp4 = &a[index[i+3]];
                
        *tmp1 += K * b[i];
        *tmp2 += K * b[i+1];
        *tmp3 += K * b[i+2];
        *tmp4 += K * b[i+3];
    }
}

over 15 years ago

pf over 15 years ago

TI__Expert 4930 points

Note that your second example, by declaring the restricted pointers within the loop body, is asserting that they are restricted for a single iteration only. In other words, their scope is only a single iteration. That may allow less software pipelining because separate iterations can't be safely mixed. Try defining the variables outside the loop and only assigning to them within it; when I do that, I get ii=8 instead of ii=28.

Orjan Friberg over 15 years ago in reply to pf

Expert 1385 points

That's a valid point, thank you. I would have expected the effect of this to be that there was a loop carried dependency *between* iterations (for something like ii = 2 + 2 + 2 + 7 = 13), but then again I'm perfectly happy with this solution.

With 6.1.18/7.0.4 I also get ii = 8 (which I recall having seen previously for this loop).

Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway) I get

ii = 8 Did not find schedule

ii = 9 Schedule found with 3 iterations in parallel

Any thoughts on the vanilla MAC loop?

Thanks,

Orjan

pf over 15 years ago in reply to Orjan Friberg

TI__Expert 4930 points

\[Brief pause while I went on vacation for a week. - pf\]

Orjan Friberg said:

I would have expected the effect of this to be that there was a loop carried dependency *between* iterations

That's pretty much what happens, but the vagaries of software pipelining mean the ii doesn't add up quite like you suggested.

Orjan Friberg said:

Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway)

Why should it be the default? Historically, compilation for C6x has leaned heavily toward performance and not so much toward code size. The -ms0 option starts leaning toward code size; in particular, it tends to inhibit loop unrolling. In this case, it affects unrolling only internally where it isn't visible, but it does visibly affect instruction selection and SPLOOPing, which in turn affect the schedule and resulting ii. With -ms0, it seems the code size is reduced by half on this example.

Orjan Friberg said:

Any thoughts on the vanilla MAC loop?

An internal measure of "goodness" thinks that the original loop and the 2X-unrolled loop are equally good and chooses the original. I suspect the primary reason is the 32-bit multiplication. If a and b are pointers to short, the compiler will choose to unroll 4X for an ii=1 loop that uses DOTP2 instructions. C6x has 16-bit multipliers and can combine those in several efficient ways, but 32-bit multiplies required a three-instruction sequence up until the C6400+ and we may be overestimating the expense.

By the way, you don't need the "restrict"s or the MUST_ITERATE; the for-loop is simple enough to interpret as it stands, and there are no memory writes in the function that would cause dependences that need "restrict" to help disambiguate.

Orjan Friberg over 15 years ago in reply to pf

Expert 1385 points

pf said:

Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway)

[/quote]

Muddled wording on my part: cl6x --help says that 0 is the default for -ms which lead me to believe that adding -ms0 was the same as not specifying "ms" at all. I realize now it probably meant that "-ms" is the same as "-ms0".

pf said:

Any thoughts on the vanilla MAC loop?

[/quote]

Ok, thanks for the insight. Is there any reason to assume I need to watch out when using 32-bit calculations in general (inlined IQMath, for example)?

pf said:

By the way, you don't need the "restrict"s or the MUST_ITERATE; the for-loop is simple enough to interpret as it stands, and there are no memory writes in the function that would cause dependences that need "restrict" to help disambiguate.

Yes, agreed.

Thanks,

Orjan

Code Composer Studio™︎

Code Composer Studio forum

Suboptimal code generation from 7.0 series