TMS320F28388D: Optimization Behavior with -O4 in TI C2000 Compiler

Part Number: TMS320F28388D

Tool/software:

I am using the TI C2000 compiler with the following optimization flags:

Fullscreen
1
2
3
4
5
6
7
/opt/ti/ti-cgt-c2000_22.6.1.LTS/bin/cl2000 \
--issue_remarks --gen_opt_info=2 -v28 -ml -O4 -op=3 \
--c_src_interlist --auto_inline --verbose_diagnostics \
--advice:performance=all --opt_for_speed=5 \
--preproc_with_compile --keep_asm \
-I/opt/ti/ti-cgt-c2000_22.6.1.LTS/include \
main.cpp
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

However, I have noticed a few inefficiencies in the code optimization behavior at the `-O4` optimization level, and I wanted to ask if there are any recommendations or insights to address these.

Static Table Optimization

In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently. Is there any way to ensure that the table is properly optimized away?

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/*
MOVZ AR6,AL ; [CPU_ALU] |4|
MOVL XAR4,#_table$1 ; [CPU_ARAU] |6|
SETC SXM ; [CPU_ALU]
MOVL ACC,XAR4 ; [CPU_ALU] |6|
ADD ACC,AR6 ; [CPU_ALU] |6|
MOVL XAR4,ACC ; [CPU_ALU] |6|
MOV AL,*+XAR4[0] ; [CPU_ALU] |6|
LRETR ; [CPU_ALU]
*/
int foo(char a) {
static const int table[] = { 1,2,3,4,5 };
return table[a];
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Auto inline

In the second example, the read() function is simple and should ideally be inlined, especially given the --auto_inline flag. However, the compiler does not seem to inline this function. GCC at -O1 inlines it automatically. Is there a reason why this function is not inlined in the C2000 compiler even with -O4, and are there additional flags that can ensure this?

Also I observed that the compiler generates an unnecessary call to memcpy(), which is not ideal for performance. The code is essentially moving around values that could be done with simpler instructions, so I was surprised to see the memcpy call. How can I avoid this issue, or is there a setting that can better optimize this pattern?

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
ADDB SP,#4 ; [CPU_ARAU]
MOV PL,#65012 ; [CPU_ALU] |2|
MOVZ AR4,SP ; [CPU_ALU] |8|
MOV PH,#16180 ; [CPU_ALU] |2|
MOVL ACC,XAR6 ; [CPU_ALU] |7|
MOVL *-SP[4],P ; [CPU_ALU] |2|
SUBB XAR4,#4 ; [CPU_ARAU] |8|
MOV PL,#65012 ; [CPU_ALU] |2|
MOV PH,#16180 ; [CPU_ALU] |2|
MOVZ AR5,AR4 ; [CPU_ALU] |8|
MOVL *-SP[2],P ; [CPU_ALU] |2|
B $C$L1,EQ ; [CPU_ALU] |8|
MOVL XAR4,ACC ; [CPU_ALU] |8|
MOVB ACC,#4 ; [CPU_ALU] |8|
LCR #_memcpy ; [CPU_ALU] |8| << Whoooh!
SUBB SP,#4 ; [CPU_ARAU]
LRETR ; [CPU_ALU]
*/
struct float2 {
float2(float x, float y) : x(x), y(y) {}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I would appreciate any insights or suggestions on improving the optimization for these cases with the TI C2000 compiler.

  • In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently.

    I am unable to reproduce this result with the GCC compiler.  Please view this attempt with Compiler Explorer

    In the second example, the read() function is simple and should ideally be inlined
    are there additional flags that can ensure this?

    Try ...

    Fullscreen
    1
    2
    __attribute__((always_inline))
    float2 read() { return float2(0.707f, 0.707f); }
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Please be cautious when using always_inline.  For details, please search for it in the C28x compiler manual.

    Thanks and regards,

    -George

  • For the first example I have to agree, I somehow got a wrong result in gcc, with the `static` keyword it gets fully optimized:

    Fullscreen
    1
    2
    3
    4
    foo(char):
    movsx rdi, dil
    mov eax, DWORD PTR foo(char)::table[0+rdi*4]
    ret
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    and in `-O3` it goes like this:
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    foo(char):
    movdqa xmm0, XMMWORD PTR .LC0[rip]
    movsx rdi, dil
    mov DWORD PTR [rsp-24], 5
    movaps XMMWORD PTR [rsp-40], xmm0
    mov eax, DWORD PTR [rsp-40+rdi*4]
    ret
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR. 
    did I miss something?
  • Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR. 
    did I miss something?

    I think you have a good understanding of the always_inline attribute.  Only use it in a few places where performance testing has shown it is clearly needed.

    Thanks and regards,

    -George

  • Yeah, but why cl2000 does not auto inline functions that are used once? This is the default behavior in most compiler. Is there any technical reason?

  • why cl2000 does not auto inline functions that are used once?

    Based on what I see here, I cannot answer.  Do you have a case where all of these conditions occur in a single source file?

    • A function is called one time
    • That function is implemented in the same file
    • The function does not get inlined

    If so, for that file, please follow the directions in the article How to Submit a Compiler Test Case.

    Thanks and regards,

    -George