Compiler/PROCESSOR-SDK-DRA8X: Auto-vectorization and performance advisor with the C7x compiler

Florian Tramnitzke

Part Number: PROCESSOR-SDK-DRA8X

Tool/software: TI C/C++ Compiler

Hello Support,

I was wondering about the auto vectorization feature of the C7x compiler. The C7000 Compiler Guide states

4.13.12 Vectorization (SIMD)
The compiler may convert a loop such that it uses instructions that operate on more than one piece of
data at a time, increasing the performance dramatically.

Is there any indication whether the compiler actually did that or what it needs to achieve this?

Moreover, does the performance advisor already work with the ti-cgt-c7000_1.2.0.STS? I wrote a simple Vector-Multiply-Add program and if I turn on the performance advisor, there is no additional output. If I compile the same program with the C66x compiler with the performance advisor it actually does tell me where I can do some optimizations.

Thank you and kind regards,

Florian

over 6 years ago

0 Rishabh Garg over 6 years ago

TI__Guru 55685 points

Hi,

Can you please share the code that you have written and compiler feedback i.e. asm file generated by compiler.

Regards,

Rishabh

0 Florian Tramnitzke over 6 years ago in reply to Rishabh Garg

Intellectual 335 points

    T* a = new T[numElements];
    T* b = new T[numElements];
    T* c = new T[numElements];

    fillArray(a);
    fillArray(b);
    fillArray(c);

    for (size_t i = 0; i < numElements; i++)
        a[i] = a[i] + b[i] * c[i];

    delete[] a;
    delete[] b;
    delete[] c;

This is the main part of a function that I measure. The function is called with double, float, int32 and int8 as type. Fill array simply fills the array with random values. numElements is a constexpr and set to 1000.

6371.BenchmarkMAC.asm

0 Rishabh Garg over 6 years ago in reply to Florian Tramnitzke

TI__Guru 55685 points

Hi,

There are multiple loops in the shared asm file.

Can you please share the exact code and corresponding compiler feedback for me to analyze it properly.

Regards,

Rishabh

0 Florian Tramnitzke over 6 years ago in reply to Rishabh Garg

Intellectual 335 points

Hi Rishabh,

I did some more modifications to the code and loops and tested against a SIMD version written by hand. I compiled with the C7000-cgt-1.2.0.STS compiler and with -O3 --opt_for_speed=4. I attached the file. The results I got where the following:

Simple loop
double MAC average time: 16.914 us
float MAC average time: 16.911 us
int32_t MAC average time: 16.895 us
int8_t MAC average time: 16.939 us

Intrinsics
double MAC average time: 4.369 us
float MAC average time: 2.193 us
int32_t MAC average time: 2.065 us
int8_t MAC average time: 0.528 us

The timings in us where measured through getting the elapsed cycles from the TSC register and multiplying with the clock speed of 1GHz. Even if the calculation of the times are wrong, the reduced cycle counts for the intrinsic version shows me, that it is far more performant than the simple loop. And that tells me, that auto-vectorization seems not possible for the C7x, is it?

I also played around with different variants, e.g. just doing a simple vector add of "a = b + c" so that the a vector is not used as in- and output but it was also not vectorized.

Kind regards,

Florian

8306.code.zip

0 Todd Hahn over 6 years ago in reply to Florian Tramnitzke

TI__Expert 3455 points

This may be unexpected. I'm having the appropriate compiler engineer look into why the compiler is not vectorizing the macArraysSimple() functions.

The results of vectorization can be seen in a couple of ways. First, the "SOFTWARE PIPELINE INFORMATION" comment block in the assembly file shows that the loop has been unrolled by some amount, say 32x. This may or may not indicate vectorization has occurred, but is often associated with vectorization. (Sometimes the compiler will unroll for reasons other than to perform vectorization.) Second, the software pipelined loop is using instructions with a "V" such as VMPYWW and VADDW. The 'V' in the instruction mnemonics often (but not always) indicates that the compiler has vectorized a code sequence (using vector/SIMD instructions).

;* Loop Unroll Multiple : 32x

0 Todd Hahn over 6 years ago

TI__Expert 3455 points

Florian,

After discussing this issue with the appropriate compiler engineer, I have some explanation and a potential solution.

The short story is that the compiler is being conservative in this case because of the unsigned iteration counter and iteration limit. Try making both the iteration counter and iteration limit a signed value, say for example "int" (rather than unsigned -- size_t).

The conservatism from the compiler stems from the way the C89/C99 language standard has defined the behavior for potential overflow cases in signed arithmetic vs unsigned arithmetic. In the case of unsigned overflow, the wrap-around behavior (going from max-positive to zero) is defined according to the C standard and therefore the compiler must be conservative about certain optimizations it makes to loops with unsigned iteration counters that may wrap-around. In the case of signed overflow, the C standard says this is undefined behavior and so the compiler can make more conclusions about the behavior or the iteration counter.

I hope this information helps. We will be putting this information into a C7000 optimization guide that we plan to release in the middle of next year.

-Todd

0 Florian Tramnitzke over 6 years ago in reply to Todd Hahn

Intellectual 335 points

Hi Todd,

thank you for that hint. By using int32 instead of size_t loop counters the problem was solved and the compiler was able to auto vectorize the code!

Kind regards,

Florian

0 Rishabh Garg over 6 years ago in reply to Florian Tramnitzke

TI__Guru 55685 points

Hi,

Thanks for the confirmation. I am closing this thread.

Regards,

Rishabh

Processors

Processors forum

Compiler/PROCESSOR-SDK-DRA8X: Auto-vectorization and performance advisor with the C7x compiler