TMS320F28388D: Optimization Behavior with -O4 in TI C2000 Compiler

Part Number: TMS320F28388D

Tool/software:

I am using the TI C2000 compiler with the following optimization flags:

/opt/ti/ti-cgt-c2000_22.6.1.LTS/bin/cl2000 \
--issue_remarks --gen_opt_info=2 -v28 -ml -O4 -op=3 \
--c_src_interlist --auto_inline --verbose_diagnostics \
--advice:performance=all --opt_for_speed=5 \
--preproc_with_compile --keep_asm \
-I/opt/ti/ti-cgt-c2000_22.6.1.LTS/include \
main.cpp

However, I have noticed a few inefficiencies in the code optimization behavior at the `-O4` optimization level, and I wanted to ask if there are any recommendations or insights to address these.

Static Table Optimization

In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently. Is there any way to ensure that the table is properly optimized away?

/*
    MOVZ      AR6,AL                ; [CPU_ALU] |4|
    MOVL      XAR4,#_table$1        ; [CPU_ARAU] |6|
    SETC      SXM                   ; [CPU_ALU]
    MOVL      ACC,XAR4              ; [CPU_ALU] |6|
    ADD       ACC,AR6               ; [CPU_ALU] |6|
    MOVL      XAR4,ACC              ; [CPU_ALU] |6|
    MOV       AL,*+XAR4[0]          ; [CPU_ALU] |6|
    LRETR     ; [CPU_ALU]
*/
int foo(char a) {
    static const int table[] = { 1,2,3,4,5 };
    return table[a];
}

Auto inline

In the second example, the read() function is simple and should ideally be inlined, especially given the --auto_inline flag. However, the compiler does not seem to inline this function. GCC at -O1 inlines it automatically. Is there a reason why this function is not inlined in the C2000 compiler even with -O4, and are there additional flags that can ensure this?

Also I observed that the compiler generates an unnecessary call to memcpy(), which is not ideal for performance. The code is essentially moving around values that could be done with simpler instructions, so I was surprised to see the memcpy call. How can I avoid this issue, or is there a setting that can better optimize this pattern?

/*
ADDB      SP,#4                 ; [CPU_ARAU]
MOV       PL,#65012             ; [CPU_ALU] |2|
MOVZ      AR4,SP                ; [CPU_ALU] |8|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVL      ACC,XAR6              ; [CPU_ALU] |7|
MOVL      *-SP[4],P             ; [CPU_ALU] |2|
SUBB      XAR4,#4               ; [CPU_ARAU] |8|
MOV       PL,#65012             ; [CPU_ALU] |2|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVZ      AR5,AR4               ; [CPU_ALU] |8|
MOVL      *-SP[2],P             ; [CPU_ALU] |2|
B         $C$L1,EQ              ; [CPU_ALU] |8|
MOVL      XAR4,ACC              ; [CPU_ALU] |8|
MOVB      ACC,#4                ; [CPU_ALU] |8|
LCR       #_memcpy              ; [CPU_ALU] |8| << Whoooh!
SUBB      SP,#4                 ; [CPU_ARAU]
LRETR     ; [CPU_ALU]
*/
struct float2 {
    float2(float x, float y) : x(x), y(y) {}
    float x, y;
};

// Should be auto inlined (called once)
float2 read() { return float2(0.707f, 0.707f); }

float test() {
    float2 v = read();
    return v.x + v.y;
}

I would appreciate any insights or suggestions on improving the optimization for these cases with the TI C2000 compiler.

  • In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently.

    I am unable to reproduce this result with the GCC compiler.  Please view this attempt with Compiler Explorer

    In the second example, the read() function is simple and should ideally be inlined
    are there additional flags that can ensure this?

    Try ...

    __attribute__((always_inline))
    float2 read() { return float2(0.707f, 0.707f); }

    Please be cautious when using always_inline.  For details, please search for it in the C28x compiler manual.

    Thanks and regards,

    -George

  • For the first example I have to agree, I somehow got a wrong result in gcc, with the `static` keyword it gets fully optimized:

    foo(char):
     movsx rdi, dil
     mov eax, DWORD PTR foo(char)::table[0+rdi*4]
     ret
    and in `-O3` it goes like this:
    foo(char):
     movdqa xmm0, XMMWORD PTR .LC0[rip]
     movsx  rdi, dil
     mov    DWORD PTR [rsp-24], 5
     movaps XMMWORD PTR [rsp-40], xmm0
     mov    eax, DWORD PTR [rsp-40+rdi*4]
     ret
    Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR. 
    did I miss something?
  • Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR. 
    did I miss something?

    I think you have a good understanding of the always_inline attribute.  Only use it in a few places where performance testing has shown it is clearly needed.

    Thanks and regards,

    -George

  • Yeah, but why cl2000 does not auto inline functions that are used once? This is the default behavior in most compiler. Is there any technical reason?

  • why cl2000 does not auto inline functions that are used once?

    Based on what I see here, I cannot answer.  Do you have a case where all of these conditions occur in a single source file?

    • A function is called one time
    • That function is implemented in the same file
    • The function does not get inlined

    If so, for that file, please follow the directions in the article How to Submit a Compiler Test Case.

    Thanks and regards,

    -George