This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Suboptimal code generation from 7.0 series

Hi,

I have enountered two different loops where 7.0.4 gives very different results compared to either 6.1.18 or 6.0.28.

Complete examples are attached, compiled with cl6x -mv64+ -c -O3 -mw -k -on=2

1. The first example is a plain MAC loop, just

  tmp += a[i] * b[i]

I get:

  ii = 1 with no unroll for 7.0.4 or 6.1.18

  ii = 1with 2x unroll for 6.0.28 or 7.0.4/6.1.18 with --vectorize=off specified

 

Are there less disruptive compiler options that would allow the 7.0 compiler to make the same optimization? Or is the example code missing anything?

 

2. A manually unrolled loop, where a double indirection in the original loop causes pointer aliasing issues.  It is broken up through the use of intermediate pointers which are declared as restrict.

Original loop:

  a[index[i]] += K * b[i]; 

Broken up:

  int *restrict tmp1 = &a[index[i]];
  int *restrict tmp2 = &a[index[i+1]];
  int *restrict tmp3 = &a[index[i+2]];
  int *restrict tmp4 = &a[index[i+3]];

  *tmp1 += K * b[i];
  *tmp2 += K * b[i+1];
  *tmp3 += K * b[i+2];
  *tmp4 += K * b[i+3];

For the unrolled loop I get:

  ii = 9 for 6.0.28 and 6.1.18

  ii = 28 with 7.0.4 (no change with --vectorize=off)

The original loop has ii = 7 due to the loop carried dependency bound, and the ii = 28 (7 * 4x unroll) is also dependency bound.

Have there been changes regarding scoping or usage of the restrict keyword in the 7.0.x series?

 

The fact that there are some esoteric cases where you want to tweak the compiler arguments to something non-standard (like adding --vectorize=off) isn't surprising, but I'm surprised that it's needed for such straight forward code like this (especially the multiply and accumulate loop).

Thanks,

Orjan

#define WORD_ALIGNED(x) _nassert(((int)(x) & 0x3) == 0)
#define DWORD_ALIGNED(x) _nassert(((int)(x) & 0x7) == 0)
#define SZ 1000
int a[SZ];
int b[SZ];

int loop(int *restrict a, int *restrict b)
{
     int i;
     int tmp = 0;

     DWORD_ALIGNED(a);
     DWORD_ALIGNED(b);
#pragma MUST_ITERATE(SZ, SZ, )
     for (i = 0; i < SZ; i++) {
         tmp += a[i] * b[i];
     }
     return tmp;
}

#define SZ 1000
int a[SZ];
int b[SZ];
int index[SZ];

void loop(int *restrict a, int *restrict b, int K)
{
    int i;

    /* Original code: ii=7 due to loop carried dependency bound
       (double indirection causes pointer aliasing issues). */
    //#pragma MUST_ITERATE(SZ, SZ, )
    //for (i = 0; i < SZ; i++) {
    //     a[index[i]] += K * b[i];
    //}

    /* Unrolled loop to let the compiler know that a[index[i]]
       cannot point to the same address as a[index[i+1]] etc. */
#pragma MUST_ITERATE(SZ/4, SZ/4, )
    for (i = 0; i < SZ; i+=4) {
        int *restrict tmp1 = &a[index[i]];
        int *restrict tmp2 = &a[index[i+1]];
        int *restrict tmp3 = &a[index[i+2]];
        int *restrict tmp4 = &a[index[i+3]];
                
        *tmp1 += K * b[i];
        *tmp2 += K * b[i+1];
        *tmp3 += K * b[i+2];
        *tmp4 += K * b[i+3];
    }
}

 

  • Note that your second example, by declaring the restricted pointers within the loop body, is asserting that they are restricted for a single iteration only.  In other words, their scope is only a single iteration.  That may allow less software pipelining because separate iterations can't be safely mixed.  Try defining the variables outside the loop and only assigning to them within it;  when I do that, I get ii=8 instead of ii=28.

  • That's a valid point, thank you.  I would have expected the effect of this to be that there was a loop carried dependency *between* iterations (for something like ii = 2 + 2 + 2 + 7 = 13), but then again I'm perfectly happy with this solution.

    With 6.1.18/7.0.4 I also get ii = 8 (which I recall having seen previously for this loop).

     

    Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway) I get

      ii = 8  Did not find schedule

      ii = 9  Schedule found with 3 iterations in parallel

     

    Any thoughts on the vanilla MAC loop?

     

    Thanks,

    Orjan

  • \[Brief pause while I went on vacation for a week.  - pf\]

    Orjan Friberg said:

    I would have expected the effect of this to be that there was a loop carried dependency *between* iterations

    That's pretty much what happens, but the vagaries of software pipelining mean the ii doesn't add up quite like you suggested.

    Orjan Friberg said:

    Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway)

    Why should it be the default?  Historically, compilation for C6x has leaned heavily toward performance and not so much toward code size.  The -ms0 option starts leaning toward code size;  in particular, it tends to inhibit loop unrolling.  In this case, it affects unrolling only internally where it isn't visible, but it does visibly affect instruction selection and SPLOOPing, which in turn affect the schedule and resulting ii.  With -ms0, it seems the code size is reduced by half on this example.

    Orjan Friberg said:

    Any thoughts on the vanilla MAC loop?

    An internal measure of "goodness" thinks that the original loop and the 2X-unrolled loop are equally good and chooses the original.  I suspect the primary reason is the 32-bit multiplication.  If a and b are pointers to short, the compiler will choose to unroll 4X for an ii=1 loop that uses DOTP2 instructions.  C6x has 16-bit multipliers and can combine those in several efficient ways, but 32-bit multiplies required a three-instruction sequence up until the C6400+ and we may be overestimating the expense.

    By the way, you don't need the "restrict"s or the MUST_ITERATE;  the for-loop is simple enough to interpret as it stands, and there are no memory writes in the function that would cause dependences that need "restrict" to help disambiguate.

  • pf said:

    Weird thing is, if I add -ms0 to the cl6x command line (which should be the default anyway)

    Why should it be the default?  Historically, compilation for C6x has leaned heavily toward performance and not so much toward code size.  The -ms0 option starts leaning toward code size;  in particular, it tends to inhibit loop unrolling.  In this case, it affects unrolling only internally where it isn't visible, but it does visibly affect instruction selection and SPLOOPing, which in turn affect the schedule and resulting ii.  With -ms0, it seems the code size is reduced by half on this example.

    [/quote]

    Muddled wording on my part: cl6x --help says that 0 is the default for -ms which lead me to believe that adding -ms0 was the same as not specifying "ms" at all.  I realize now it probably meant that "-ms" is the same as "-ms0".

    pf said:

    Any thoughts on the vanilla MAC loop?

    An internal measure of "goodness" thinks that the original loop and the 2X-unrolled loop are equally good and chooses the original.  I suspect the primary reason is the 32-bit multiplication.  If a and b are pointers to short, the compiler will choose to unroll 4X for an ii=1 loop that uses DOTP2 instructions.  C6x has 16-bit multipliers and can combine those in several efficient ways, but 32-bit multiplies required a three-instruction sequence up until the C6400+ and we may be overestimating the expense.

    [/quote]

    Ok, thanks for the insight.  Is there any reason to assume I need to watch out when using 32-bit calculations in general (inlined IQMath, for example)?

    pf said:

    By the way, you don't need the "restrict"s or the MUST_ITERATE;  the for-loop is simple enough to interpret as it stands, and there are no memory writes in the function that would cause dependences that need "restrict" to help disambiguate.

    Yes, agreed.

    Thanks,

    Orjan