TMS320F28375D: CLA slow compared to FPU

Antenna_Head

Part Number: TMS320F28375D

Team-

Posting this on behalf of a customer who is starting to ramp up on the '28375 and will cc: them on this thread. Please contact me internally if you would like further details.

From Customer:

I am testing the timing of some code and was expecting to see the Control Law Actuator (CLA) be significantly faster than code run using the Floating Point Unit(FPU). However, when I tested it, the FPU was about 5microseconds faster than the same code run from the CLA. Based on the documentation, I was led to believe the code run from the CLA would be much faster?

also,

Is it was possible to read/write to the EPWM registers (i.e Epwm1Regs.CMPA) from the CLA?

Please direct us to the appropriate area of documentation and if there is any relevant training modules, please let me know.

Thanks!

Best, Steve

over 4 years ago

+1 Ashwini Athalye over 4 years ago

TI__Expert 7695 points

Hi Steve,

1. CLA is a floating-point engine and so typically if the code involves a lot of floating point math then yes, the CLA should do just as well as the C28+FPU. CLA does not however do well with integer operations or trigonometric operations compared to C28 as C28 has better integer support and a TMU engine which performs trigonometric math. Customer may want to look at the code being run on CLA and determine if any of these factors are causing the difference in execution performance.

We have a CLA software development guide that has useful information. For example, here is a list of workshops:

https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/getting_started.html#workshops

2. Yes, for the F2837x device, the CLA does have access to EPWM. Here is some good information on how to find out which peripherals CLA can access:

https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/faq.html#which-peripheral-registers-can-the-cla-access-directly

Thanks,

Ashwini

0 Antenna_Head over 4 years ago in reply to Ashwini Athalye

TI__Genius 12740 points

Thanks much!

0 Casey Au over 4 years ago in reply to Ashwini Athalye

Prodigy 40 points

Hi!

Thank you for your answer above.

I was testing my code and realized some other reasons why it was being slower. I am using arrays a lot (that hold floats) and noticed the CLA runs slower than/same speed as the FPU when

1) I use variables as indexes.

Ex:

Uint16 i,x;

i = 0;

x = array1[i] + array2[i];

array1[] and array2[] are float32 globals and initialized in my main.c

2) Setting a value of an array location

Ex:

Uint16 i;

i = 0;

result[i]= 23.3 + 90.9;

result[] is also a float32 global that is initialized in my main.c

I was wondering if I need to change something for the CLA to run faster? I know if I hardcode the array positions the CLA is faster, but my code needs to use a variable index.

0 Ashwini Athalye over 4 years ago in reply to Casey Au

TI__Expert 7695 points

Hi Casey,

I am reaching out to the experts and will let you know any suggestions once I head back from the team.

Thanks,

Ashwini

0 Casey Au over 4 years ago in reply to Ashwini Athalye

Prodigy 40 points

Hi!

Found one more oddity(?)

when I compare floats("<","<=", ">", ">=") whether from an array or constant, the CLA is slower. (When I compare using "=" or "!=" they are about the same speed).

Ex:

//time: FPU - 40ns; CLA - 60ns

Uint16 x;

if(1.2 < 2.3){

x = 5;

}

Ex2:

//array was initialized as a global in main as array[3] = {1.2, 3.4, 5.6}

//time: FPU - 0.08μs; CLA - 0.1μs

Uint16 x;

if(array[0] < 2.3){

x = 5;

}

Thanks!

0 Ashwini Athalye over 4 years ago in reply to Casey Au

TI__Expert 7695 points

Hi Casey,

Thanks, I will share this with the team as well.

Can you let me know the compiler version and optimization level being used?

Regards,

Ashwini

0 Casey Au over 4 years ago in reply to Ashwini Athalye

Prodigy 40 points

Hi!

Compiler: TI v20.2.1.LTS

Optimization Level: off

Speed vs Size trad-offs: 2

Floating point mode: strict

0 MatthewPate over 4 years ago in reply to Casey Au

TI__Guru* 80270 points

Casey,

Just wanted to let you know that Ashwini is out of the office for a few days, I'll make sure some others have eyes on this post but worst case you can expect a reply by next Thursday the 5th when she is back in the office.

Best,

Matthew

0 Antenna_Head over 4 years ago in reply to Casey Au

TI__Genius 12740 points

Casey-

When you do a compare such as < or >, the processor has to compute the difference of the argument and evaluate the remainder. The checking for equality, = or !=, it can be done faster with a bitwise compare. this might explain the difference.

0 Ashwini Athalye over 4 years ago in reply to Antenna_Head

TI__Expert 7695 points

Hi Casey,

Sorry for the later reply. I gave the 2 examples you posted (reposted below again for reference) a try on CLA and C28 and I did not see a significant difference in the cycle count for the code generated for CLA and C28. Can you post the assembly generated for CLA and let me know how you are benchmarking the CLA?

Ex:

//time: FPU - 40ns; CLA - 60ns

Uint16 x;

if(1.2 < 2.3){

x = 5;

}

Ex2:

//array was initialized as a global in main as array[3] = {1.2, 3.4, 5.6}

//time: FPU - 0.08μs; CLA - 0.1μs

Uint16 x;

if(array[0] < 2.3){

x = 5;

}

Thanks,

Ashwini

0 Casey Au over 4 years ago in reply to Ashwini Athalye

Prodigy 40 points

Hi!

To time the CLA and FPU/C28 code, I used GPIO breakout pins and a logic analyzer to time it. Running it at the same time or separately made no noticeable difference in time.

//in gpio.c
void InitGpio(void)
{
    GpioCtrlRegs.GPBCSEL2.all = 0x00010000 //GPIODAT/SET/CLEAR/TOGGLE reg. master
    GpioCtrlRegs.GPBDIR1bit.GPIO39 = 1;
    GpioDataRegs.GPBSET.bit.GPIO39 = 1;
    GpioCtrlRegs.GPBDIR.bit.GPIO44 = 1;
    GpioDataRegs.GPBSET.bit.GPIO44 = 0;
}

//in ClaTasks_C.cla
//have a ADC trigger this task once in a while
interrupt void ClaTask1 (void)
{
    GpioDataRegs.GPBSET.bit.GPIO44 = 1;
    Uint16 x;
    x = 0;
    if(1.2 < 2.3){
        x = 5;
    }
    GpioDataRegs.GPBCLEAR.bit.GPIO44 = 1;
}

//in test.c
//gets called in main function loop
void sampleFunction (void)
{
    GpioDataRegs.GPBSET.bit.GPIO39 = 1;
    Uint16 x;
    x = 0;
    if(1.2 < 2.3){
        x = 5;
    }
    GpioDataRegs.GPBCLEAR.bit.GPIO39 = 1;
}

Here is the assembly code for the section:

22 GpioDataRegs.GPBSET.bit.GPIO44 = 1;
Cla1Task1(), c:
0000a60a: 78410000 MMOVIZ MR1, #0x0
0000a60c: 75807F0A MMOVZ16 MR0, @0x7f0a
0000a60e: 78811000 MMOVXI MR1, #0x1000
0000a610: 7C800004 MOR32 MR0, MR1, MR0
0000a612: 75C07F0A MMOV16 @0x7f0a, MR0
24 x = 0;
0000a614: 78400000 MMOVIZ MR0, #0x0
0000a616: 75C08000 MMOV16 @0x8000, MR0
26 x = 5;
0000a618: 78400000 MMOVIZ MR0, #0x0
0000a61a: 78800005 MMOVXI MR0, #0x5
0000a61c: 75C08000 MMOV16 @0x8000, MR0
28 GpioDataRegs.GPBCLEAR.bit.GPIO44 = 1;
0000a61e: 78410000 MMOVIZ MR1, #0x0
0000a620: 78811000 MMOVXI MR1, #0x1000
0000a622: 75807F0C MMOVZ16 MR0, @0x7f0c
0000a624: 7C800004 MOR32 MR0, MR1, MR0
0000a626: 75C07F0C MMOV16 @0x7f0c, MR0
29 }

Also, sorry if you're still testing but did you have a chance to try out my other 2 findings mentioned above:

1) I use variables as indexes.

Ex:

Uint16 i,x;

i = 0;

x = array1[i] + array2[i];

array1[] and array2[] are float32 globals and initialized in my main.c

2) Setting a value of an array location

Ex:

Uint16 i;

i = 0;

result[i]= 23.3 + 90.9;

result[] is also a float32 global that is initialized in my main.c

The part you answered in your most recent response is only my point #3.

0 Ashwini Athalye over 4 years ago in reply to Casey Au

TI__Expert 7695 points

Hi Casey,

Thanks for posting the assembly. I will take a look and get back to you tomorrow. Upon a brief look it appears that the GPIO write at the end of the function could be adding to the overhead.

I would recommend using the method mentioned in the link below where the EPWM is used to measure the execution in clock cycles instead. By looking at the assembly the clock cycles can be adjusted to account for EPWM read overhead to get exact cycles. That will be easier to compare with similar profiling on C28 side using same EPWM or CPU timer0.

https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/faq.html#how-can-i-measure-the-duration-of-a-task

Thanks,
Ashwini

0 Ashwini Athalye over 4 years ago in reply to Ashwini Athalye

TI__Expert 7695 points

Hi Casey,

I took a look at the assembly code for one of the code blocks (below) on C28 and CLA. The difference in behavior is coming from the instruction set and architecture -

1. CLA does not support instructions for moving immediate values to memory, they much be moved to a register and register moved to memory.

2. CLA is a floating point accelerator so its not optimized for integers. If you change the datatypes to float you are likely to see better performance.

I have copied here C code and assembly snippets for one piece of code that I looked at.

You can conduct a similar exercise with the other pieces of code to understand the difference in execution on CLA and C28.

1. Instruction set for CLA can be found in the device TRM

2. Instruction set for C28 can be found in these two documents:

https://www.ti.com/lit/spru430

https://www.ti.com/lit//spruhs1

Also, turning optimization to at least O2 is recommended for good performance both on CLA and C28.

C CODE:

Uint16 x =0;

    if(1.2 < 2.3){

       x = 5;

    }

C28 assembly: 3 cycles

;----------------------------------------------------------------------
; 64 | Uint16 x = 0;
;----------------------------------------------------------------------
        MOV       *-SP[5],#0            ; 2p cycles
;----------------------------------------------------------------------
; 66 | if(1.2 < 2.3){
;----------------------------------------------------------------------
;----------------------------------------------------------------------
; 68 | x = 5;
;----------------------------------------------------------------------
        MOVB      *-SP[5],#5,UNC        1 cycle

CLA assembly: 5 cycles

;----------------------------------------------------------------------
; 69 | Uint16 x =0;
;----------------------------------------------------------------------
        MMOVIZ    MR0,#0                ; 1 cycle
        MMOV16    @__cla_aci_ctrlLoop_sp+2,MR0 ; 1 cycle
;----------------------------------------------------------------------
; 71 | if(1.2 < 2.3){
;----------------------------------------------------------------------
;----------------------------------------------------------------------
; 73 | x = 5;
;----------------------------------------------------------------------
        MMOVIZ    MR0,#0                ; 1 cycle
        MMOVXI    MR0,#5                ; 1 cycle
        MMOV16    @__cla_aci_ctrlLoop_sp+2,MR0 ; 1 cycle

Thanks,
Ashwini

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28375D: CLA slow compared to FPU