Hi,
I have enountered two different loops where 7.0.4 gives very different results compared to either 6.1.18 or 6.0.28.
Complete examples are attached, compiled with cl6x -mv64+ -c -O3 -mw -k -on=2
1. The first example is a plain MAC loop, just
tmp += a[i] * b[i]
I get:
ii = 1 with no unroll for 7.0.4 or 6.1.18
ii = 1with 2x unroll for 6.0.28 or 7.0.4/6.1.18 with --vectorize=off specified
Are there less disruptive compiler options that would allow the 7.0 compiler to make the same optimization? Or is the example code missing anything?
2. A manually unrolled loop, where a double indirection in the original loop causes pointer aliasing issues. It is broken up through the use of intermediate pointers which are declared as restrict.
Original loop:
a[index[i]] += K * b[i];
Broken up:
int *restrict tmp1 = &a[index[i]];
int *restrict tmp2 = &a[index[i+1]];
int *restrict tmp3 = &a[index[i+2]];
int *restrict tmp4 = &a[index[i+3]];
*tmp1 += K * b[i];
*tmp2 += K * b[i+1];
*tmp3 += K * b[i+2];
*tmp4 += K * b[i+3];
For the unrolled loop I get:
ii = 9 for 6.0.28 and 6.1.18
ii = 28 with 7.0.4 (no change with --vectorize=off)
The original loop has ii = 7 due to the loop carried dependency bound, and the ii = 28 (7 * 4x unroll) is also dependency bound.
Have there been changes regarding scoping or usage of the restrict keyword in the 7.0.x series?
The fact that there are some esoteric cases where you want to tweak the compiler arguments to something non-standard (like adding --vectorize=off) isn't surprising, but I'm surprised that it's needed for such straight forward code like this (especially the multiply and accumulate loop).
Thanks,
Orjan
#define WORD_ALIGNED(x) _nassert(((int)(x) & 0x3) == 0)
#define DWORD_ALIGNED(x) _nassert(((int)(x) & 0x7) == 0)
#define SZ 1000
int a[SZ];
int b[SZ];
int loop(int *restrict a, int *restrict b)
{
int i;
int tmp = 0;
DWORD_ALIGNED(a);
DWORD_ALIGNED(b);
#pragma MUST_ITERATE(SZ, SZ, )
for (i = 0; i < SZ; i++) {
tmp += a[i] * b[i];
}
return tmp;
}
#define SZ 1000
int a[SZ];
int b[SZ];
int index[SZ];
void loop(int *restrict a, int *restrict b, int K)
{
int i;
/* Original code: ii=7 due to loop carried dependency bound
(double indirection causes pointer aliasing issues). */
//#pragma MUST_ITERATE(SZ, SZ, )
//for (i = 0; i < SZ; i++) {
// a[index[i]] += K * b[i];
//}
/* Unrolled loop to let the compiler know that a[index[i]]
cannot point to the same address as a[index[i+1]] etc. */
#pragma MUST_ITERATE(SZ/4, SZ/4, )
for (i = 0; i < SZ; i+=4) {
int *restrict tmp1 = &a[index[i]];
int *restrict tmp2 = &a[index[i+1]];
int *restrict tmp3 = &a[index[i+2]];
int *restrict tmp4 = &a[index[i+3]];
*tmp1 += K * b[i];
*tmp2 += K * b[i+1];
*tmp3 += K * b[i+2];
*tmp4 += K * b[i+3];
}
}