This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/PROCESSOR-SDK-DRA8X: Auto-vectorization and performance advisor with the C7x compiler

Part Number: PROCESSOR-SDK-DRA8X

Tool/software: TI C/C++ Compiler

Hello Support,

I was wondering about the auto vectorization feature of the C7x compiler. The C7000 Compiler Guide states

4.13.12 Vectorization (SIMD)
The compiler may convert a loop such that it uses instructions that operate on more than one piece of
data at a time, increasing the performance dramatically.

Is there any indication whether the compiler actually did that or what it needs to achieve this?

Moreover, does the performance advisor already work with the ti-cgt-c7000_1.2.0.STS? I wrote a simple Vector-Multiply-Add program and if I turn on the performance advisor, there is no additional output. If I compile the same program with the C66x compiler with the performance advisor it actually does tell me where I can do some optimizations.

Thank you and kind regards,

Florian

  • Hi,

    Can you please share the code that you have written and compiler feedback i.e. asm file generated by compiler.

    Regards,

    Rishabh

  •     T* a = new T[numElements];
        T* b = new T[numElements];
        T* c = new T[numElements];
    
        fillArray(a);
        fillArray(b);
        fillArray(c);
    
        for (size_t i = 0; i < numElements; i++)
            a[i] = a[i] + b[i] * c[i];
    
        delete[] a;
        delete[] b;
        delete[] c;
    

    This is the main part of a function that I measure. The function is called with double, float, int32 and int8 as type. Fill array simply fills the array with random values. numElements is a constexpr and set to 1000.

    6371.BenchmarkMAC.asm

  • Hi,

    There are multiple loops in the shared asm file.

    Can you please share the exact code and corresponding compiler feedback for me to analyze it properly.

    Regards,

    Rishabh

  • Hi Rishabh,

    I did some more modifications to the code and loops and tested against a SIMD version written by hand. I compiled with the C7000-cgt-1.2.0.STS compiler and with  -O3 --opt_for_speed=4. I attached the file. The results I got where the following:

    Simple loop
    double MAC average time: 16.914 us
    float MAC average time: 16.911 us
    int32_t MAC average time: 16.895 us
    int8_t MAC average time: 16.939 us

    Intrinsics
    double MAC average time: 4.369 us
    float MAC average time: 2.193 us
    int32_t MAC average time: 2.065 us
    int8_t MAC average time: 0.528 us

    The timings in us where measured through getting the elapsed cycles from the TSC register and multiplying with the clock speed of 1GHz. Even if the calculation of the times are wrong, the reduced cycle counts for the intrinsic version shows me, that it is far more performant than the simple loop. And that tells me, that auto-vectorization seems not possible for the C7x, is it?

    I also played around with different variants, e.g. just doing a simple vector add of "a = b + c" so that the a vector is not used as in- and output but it was also not vectorized.

    Kind regards,

    Florian

    8306.code.zip

  • This may be unexpected. I'm having the appropriate compiler engineer look into why the compiler is not vectorizing the macArraysSimple() functions.

    The results of vectorization can be seen in a couple of ways. First, the "SOFTWARE PIPELINE INFORMATION" comment block in the assembly file shows that the loop has been unrolled by some amount, say 32x. This may or may not indicate vectorization has occurred, but is often associated with vectorization. (Sometimes the compiler will unroll for reasons other than to perform vectorization.) Second, the software pipelined loop is using instructions with a "V" such as VMPYWW and VADDW. The 'V' in the instruction mnemonics often (but not always) indicates that the compiler has vectorized a code sequence (using vector/SIMD instructions).

    ;*     Loop Unroll Multiple             : 32x

  • Florian,

    After discussing this issue with the appropriate compiler engineer, I have some explanation and a potential solution.

    The short story is that the compiler is being conservative in this case because of the unsigned iteration counter and iteration limit. Try making both the iteration counter and iteration limit a signed value, say for example "int" (rather than unsigned -- size_t).

    The conservatism from the compiler stems from the way the C89/C99 language standard has defined the behavior for potential overflow cases in signed arithmetic vs unsigned arithmetic. In the case of unsigned overflow, the wrap-around behavior (going from max-positive to zero) is defined according to the C standard and therefore the compiler must be conservative about certain optimizations it makes to loops with unsigned iteration counters that may wrap-around. In the case of signed overflow, the C standard says this is undefined behavior and so the compiler can make more conclusions about the behavior or the iteration counter.

    I hope this information helps. We will be putting this information into a C7000 optimization guide that we plan to release in the middle of next year.

    -Todd

  • Hi Todd,

    thank you for that hint. By using int32 instead of size_t loop counters the problem was solved and the compiler was able to auto vectorize the code!

    Kind regards,

    Florian

  • Hi,

    Thanks for the confirmation. I am closing this thread.

    Regards,

    Rishabh