TMS320F28388D: Optimization Behavior with -O4 in TI C2000 Compiler

Yves Chevallier

Part Number: TMS320F28388D

Tool/software:

I am using the TI C2000 compiler with the following optimization flags:

1
2
3
4
5
6
7
/opt/ti/ti-cgt-c2000_22.6.1.LTS/bin/cl2000 \
--issue_remarks --gen_opt_info=2 -v28 -ml -O4 -op=3 \
--c_src_interlist --auto_inline --verbose_diagnostics \
--advice:performance=all --opt_for_speed=5 \
--preproc_with_compile --keep_asm \
-I/opt/ti/ti-cgt-c2000_22.6.1.LTS/include \
main.cpp
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/opt/ti/ti-cgt-c2000_22.6.1.LTS/bin/cl2000 \
--issue_remarks --gen_opt_info=2 -v28 -ml -O4 -op=3 \
--c_src_interlist --auto_inline --verbose_diagnostics \
--advice:performance=all --opt_for_speed=5 \
--preproc_with_compile --keep_asm \
-I/opt/ti/ti-cgt-c2000_22.6.1.LTS/include \
main.cpp

However, I have noticed a few inefficiencies in the code optimization behavior at the `-O4` optimization level, and I wanted to ask if there are any recommendations or insights to address these.

Static Table Optimization

In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently. Is there any way to ensure that the table is properly optimized away?

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/*
    MOVZ      AR6,AL                ; [CPU_ALU] |4|
    MOVL      XAR4,#_table$1        ; [CPU_ARAU] |6|
    SETC      SXM                   ; [CPU_ALU]
    MOVL      ACC,XAR4              ; [CPU_ALU] |6|
    ADD       ACC,AR6               ; [CPU_ALU] |6|
    MOVL      XAR4,ACC              ; [CPU_ALU] |6|
    MOV       AL,*+XAR4[0]          ; [CPU_ALU] |6|
    LRETR     ; [CPU_ALU]
*/
int foo(char a) {
    static const int table[] = { 1,2,3,4,5 };
    return table[a];
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/*
    MOVZ      AR6,AL                ; [CPU_ALU] |4|
    MOVL      XAR4,#_table$1        ; [CPU_ARAU] |6|
    SETC      SXM                   ; [CPU_ALU]
    MOVL      ACC,XAR4              ; [CPU_ALU] |6|
    ADD       ACC,AR6               ; [CPU_ALU] |6|
    MOVL      XAR4,ACC              ; [CPU_ALU] |6|
    MOV       AL,*+XAR4[0]          ; [CPU_ALU] |6|
    LRETR     ; [CPU_ALU]
*/
int foo(char a) {
    static const int table[] = { 1,2,3,4,5 };
    return table[a];
}

Auto inline

In the second example, the read() function is simple and should ideally be inlined, especially given the --auto_inline flag. However, the compiler does not seem to inline this function. GCC at -O1 inlines it automatically. Is there a reason why this function is not inlined in the C2000 compiler even with -O4, and are there additional flags that can ensure this?

Also I observed that the compiler generates an unnecessary call to memcpy(), which is not ideal for performance. The code is essentially moving around values that could be done with simpler instructions, so I was surprised to see the memcpy call. How can I avoid this issue, or is there a setting that can better optimize this pattern?

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
ADDB      SP,#4                 ; [CPU_ARAU]
MOV       PL,#65012             ; [CPU_ALU] |2|
MOVZ      AR4,SP                ; [CPU_ALU] |8|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVL      ACC,XAR6              ; [CPU_ALU] |7|
MOVL      *-SP[4],P             ; [CPU_ALU] |2|
SUBB      XAR4,#4               ; [CPU_ARAU] |8|
MOV       PL,#65012             ; [CPU_ALU] |2|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVZ      AR5,AR4               ; [CPU_ALU] |8|
MOVL      *-SP[2],P             ; [CPU_ALU] |2|
B         $C$L1,EQ              ; [CPU_ALU] |8|
MOVL      XAR4,ACC              ; [CPU_ALU] |8|
MOVB      ACC,#4                ; [CPU_ALU] |8|
LCR       #_memcpy              ; [CPU_ALU] |8| << Whoooh!
SUBB      SP,#4                 ; [CPU_ARAU]
LRETR     ; [CPU_ALU]
*/
struct float2 {
    float2(float x, float y) : x(x), y(y) {}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/*
ADDB      SP,#4                 ; [CPU_ARAU]
MOV       PL,#65012             ; [CPU_ALU] |2|
MOVZ      AR4,SP                ; [CPU_ALU] |8|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVL      ACC,XAR6              ; [CPU_ALU] |7|
MOVL      *-SP[4],P             ; [CPU_ALU] |2|
SUBB      XAR4,#4               ; [CPU_ARAU] |8|
MOV       PL,#65012             ; [CPU_ALU] |2|
MOV       PH,#16180             ; [CPU_ALU] |2|
MOVZ      AR5,AR4               ; [CPU_ALU] |8|
MOVL      *-SP[2],P             ; [CPU_ALU] |2|
B         $C$L1,EQ              ; [CPU_ALU] |8|
MOVL      XAR4,ACC              ; [CPU_ALU] |8|
MOVB      ACC,#4                ; [CPU_ALU] |8|
LCR       #_memcpy              ; [CPU_ALU] |8| << Whoooh!
SUBB      SP,#4                 ; [CPU_ARAU]
LRETR     ; [CPU_ALU]
*/
struct float2 {
    float2(float x, float y) : x(x), y(y) {}
    float x, y;
};

// Should be auto inlined (called once)
float2 read() { return float2(0.707f, 0.707f); }

float test() {
    float2 v = read();
    return v.x + v.y;
}

I would appreciate any insights or suggestions on improving the optimization for these cases with the TI C2000 compiler.

10 days ago

0 George Mock 10 days ago

TI__Guru**** 244930 points

Yves Chevallier said:
In the first example, where I have a static table[] in the function foo(), I expected the compiler to optimize this table away and remove unnecessary memory accesses. However, the table is still being accessed directly, even though the value of a is within the bounds of the table. In comparison, GCC at optimization level -O1 would handle this more efficiently.

I am unable to reproduce this result with the GCC compiler. Please view this attempt with Compiler Explorer

Yves Chevallier said:
In the second example, the read() function is simple and should ideally be inlined

Yves Chevallier said:
are there additional flags that can ensure this?

Try ...

Fullscreen

1
2
__attribute__((always_inline))
float2 read() { return float2(0.707f, 0.707f); }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

__attribute__((always_inline))
float2 read() { return float2(0.707f, 0.707f); }

Please be cautious when using always_inline. For details, please search for it in the C28x compiler manual.

Thanks and regards,

-George

0 Yves Chevallier 10 days ago in reply to George Mock

Prodigy 30 points

For the first example I have to agree, I somehow got a wrong result in gcc, with the `static` keyword it gets fully optimized:

Fullscreen

1
2
3
4
foo(char):
 movsx rdi, dil
 mov eax, DWORD PTR foo(char)::table[0+rdi*4]
 ret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

foo(char):
 movsx rdi, dil
 mov eax, DWORD PTR foo(char)::table[0+rdi*4]
 ret

and in `-O3` it goes like this:

Fullscreen

1
2
3
4
5
6
7
foo(char):
 movdqa xmm0, XMMWORD PTR .LC0[rip]
 movsx  rdi, dil
 mov    DWORD PTR [rsp-24], 5
 movaps XMMWORD PTR [rsp-40], xmm0
 mov    eax, DWORD PTR [rsp-40+rdi*4]
 ret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

foo(char):
 movdqa xmm0, XMMWORD PTR .LC0[rip]
 movsx  rdi, dil
 mov    DWORD PTR [rsp-24], 5
 movaps XMMWORD PTR [rsp-40], xmm0
 mov    eax, DWORD PTR [rsp-40+rdi*4]
 ret

Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR.

did I miss something?

0 George Mock 7 days ago in reply to Yves Chevallier

TI__Guru**** 244930 points

Yves Chevallier said:
Regarding the inline, I am not ready to put `__attribute__((always_inline))` in front of all functions, but in the other hand, I don't think I am ready to waste CPU computation power in memcpy and LCR.

did I miss something?

I think you have a good understanding of the always_inline attribute. Only use it in a few places where performance testing has shown it is clearly needed.

Thanks and regards,

-George

0 Yves Chevallier 5 days ago in reply to George Mock

Prodigy 30 points

Yeah, but why cl2000 does not auto inline functions that are used once? This is the default behavior in most compiler. Is there any technical reason?

0 George Mock 5 days ago in reply to Yves Chevallier

TI__Guru**** 244930 points

Yves Chevallier said:
why cl2000 does not auto inline functions that are used once?

Based on what I see here, I cannot answer. Do you have a case where all of these conditions occur in a single source file?

A function is called one time
That function is implemented in the same file
The function does not get inlined

If so, for that file, please follow the directions in the article How to Submit a Compiler Test Case.

Thanks and regards,

-George

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28388D: Optimization Behavior with -O4 in TI C2000 Compiler

Static Table Optimization

Auto inline