This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28375D: CLA slow compared to FPU

Part Number: TMS320F28375D

Team-

Posting this on behalf of a customer who is starting to ramp up on the '28375 and will cc: them on this thread.  Please contact me internally if you would like further details.

From Customer:


I am testing the timing of some code and was expecting to see the Control Law Actuator (CLA) be significantly faster than code run using the Floating Point Unit(FPU). However, when I tested it, the FPU was about 5microseconds faster than the same code run from the CLA. Based on the documentation, I was led to believe the code run from the CLA would be much faster?


also,


Is it was possible to read/write to the EPWM registers (i.e Epwm1Regs.CMPA) from the CLA?


Please direct us to the appropriate area of documentation and if there is any relevant training modules, please let me know.  

Thanks!

Best, Steve

  • Hi Steve,

    1. CLA is a floating-point engine and so typically if the code involves a lot of floating point math then yes, the CLA should do just as well as the C28+FPU. CLA does not however do well with integer operations or trigonometric operations compared to C28 as C28 has better integer support  and a TMU engine which performs trigonometric math. Customer may want to look at the code being run on CLA and determine if any of these factors are causing the difference in execution performance.

    We have a CLA software development guide that has useful information. For example, here is a list of workshops:

    https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/getting_started.html#workshops

    2. Yes, for the F2837x device, the CLA does have access to EPWM. Here is some good information on how to find out which peripherals CLA can access:

    https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/faq.html#which-peripheral-registers-can-the-cla-access-directly

    Thanks,

    Ashwini

  • Hi!

    Thank you for your answer above.

    I was testing my code and realized some other reasons why it was being slower. I am using arrays a lot (that hold floats) and noticed the CLA runs slower than/same speed as the FPU when

    1) I use variables as indexes.

    Ex:

    Uint16 i,x;

    i = 0;

    x = array1[i] + array2[i];

    array1[] and array2[] are float32 globals and initialized in my main.c

    2) Setting a value of an array location

    Ex:

    Uint16 i;

    i = 0;

    result[i]= 23.3 + 90.9;

    result[] is also a float32 global that is initialized in my main.c

    I was wondering if I need to change something for the CLA to run faster? I know if I hardcode the array positions the CLA is faster, but my code needs to use a variable index.

  • Hi Casey,

    I am reaching out to the experts and will let you know any suggestions once I head back from the team.

    Thanks,

    Ashwini

  • Hi!

    Found one more oddity(?)

    3)

    when I compare floats("<","<=", ">", ">=") whether from an array or constant, the CLA is slower. (When I compare using "=" or "!=" they are about the same speed).

    Ex:

    //time: FPU - 40ns; CLA - 60ns

    Uint16 x;

    if(1.2 < 2.3){

       x = 5;

    }

    Ex2:

    //array was initialized as a global in main as array[3] = {1.2, 3.4, 5.6}

    //time: FPU - 0.08μs; CLA - 0.1μs

    Uint16 x;

    if(array[0] < 2.3){

       x = 5;

    }

    Thanks!

  • Hi Casey,

    Thanks, I will share this with the team as well.

    Can you let me know the compiler version and optimization level being used?

    Regards,

    Ashwini

  • Hi!

    Compiler: TI v20.2.1.LTS

    Optimization Level: off

    Speed vs Size trad-offs: 2

    Floating point mode: strict

  • Casey,

    Just wanted to let you know that Ashwini is out of the office for a few days, I'll make sure some others have eyes on this post but worst case you can expect a reply by next Thursday the 5th when she is back in the office.

    Best,

    Matthew

  • Casey-

    When you do a compare such as < or >, the processor has to compute the difference of the argument and evaluate the remainder.  The checking for equality, = or !=, it can be done faster with a bitwise compare.  this might explain the difference.

  • Hi Casey,

    Sorry for the later reply. I gave the 2 examples you posted (reposted below again for reference) a try on CLA and C28 and I did not see a significant difference in the cycle count for the code generated for CLA and C28. Can you post the assembly generated for CLA and let me know how you are benchmarking the CLA?

    Ex:

    //time: FPU - 40ns; CLA - 60ns

    Uint16 x;

    if(1.2 < 2.3){

       x = 5;

    }

    Ex2:

    //array was initialized as a global in main as array[3] = {1.2, 3.4, 5.6}

    //time: FPU - 0.08μs; CLA - 0.1μs

    Uint16 x;

    if(array[0] < 2.3){

       x = 5;

    }

    Thanks,

    Ashwini

  • Hi!

    To time the CLA and FPU/C28 code, I used GPIO breakout pins and a logic analyzer to time it. Running it at the same time or separately made no noticeable difference in time.

    //in gpio.c
    void InitGpio(void)
    {
        GpioCtrlRegs.GPBCSEL2.all = 0x00010000 //GPIODAT/SET/CLEAR/TOGGLE reg. master
        GpioCtrlRegs.GPBDIR1bit.GPIO39 = 1;
        GpioDataRegs.GPBSET.bit.GPIO39 = 1;
        GpioCtrlRegs.GPBDIR.bit.GPIO44 = 1;
        GpioDataRegs.GPBSET.bit.GPIO44 = 0;
    }
    
    //in ClaTasks_C.cla
    //have a ADC trigger this task once in a while
    interrupt void ClaTask1 (void)
    {
        GpioDataRegs.GPBSET.bit.GPIO44 = 1;
        Uint16 x;
        x = 0;
        if(1.2 < 2.3){
            x = 5;
        }
        GpioDataRegs.GPBCLEAR.bit.GPIO44 = 1;
    }
    
    //in test.c
    //gets called in main function loop
    void sampleFunction (void)
    {
        GpioDataRegs.GPBSET.bit.GPIO39 = 1;
        Uint16 x;
        x = 0;
        if(1.2 < 2.3){
            x = 5;
        }
        GpioDataRegs.GPBCLEAR.bit.GPIO39 = 1;
    }

    Here is the assembly code for the section:

    22 GpioDataRegs.GPBSET.bit.GPIO44 = 1;
    Cla1Task1(), c:
    0000a60a: 78410000 MMOVIZ MR1, #0x0
    0000a60c: 75807F0A MMOVZ16 MR0, @0x7f0a
    0000a60e: 78811000 MMOVXI MR1, #0x1000
    0000a610: 7C800004 MOR32 MR0, MR1, MR0
    0000a612: 75C07F0A MMOV16 @0x7f0a, MR0
    24 x = 0;
    0000a614: 78400000 MMOVIZ MR0, #0x0
    0000a616: 75C08000 MMOV16 @0x8000, MR0
    26 x = 5;
    0000a618: 78400000 MMOVIZ MR0, #0x0
    0000a61a: 78800005 MMOVXI MR0, #0x5
    0000a61c: 75C08000 MMOV16 @0x8000, MR0
    28 GpioDataRegs.GPBCLEAR.bit.GPIO44 = 1;
    0000a61e: 78410000 MMOVIZ MR1, #0x0
    0000a620: 78811000 MMOVXI MR1, #0x1000
    0000a622: 75807F0C MMOVZ16 MR0, @0x7f0c
    0000a624: 7C800004 MOR32 MR0, MR1, MR0
    0000a626: 75C07F0C MMOV16 @0x7f0c, MR0
    29 }

    Also, sorry if you're still testing but did you have a chance to try out my other 2 findings mentioned above:

    1) I use variables as indexes.

    Ex:

    Uint16 i,x;
    
    i = 0;
    
    x = array1[i] + array2[i];

    array1[] and array2[] are float32 globals and initialized in my main.c

    2) Setting a value of an array location

    Ex:

    Uint16 i;
    
    i = 0;
    
    result[i]= 23.3 + 90.9;

    result[] is also a float32 global that is initialized in my main.c

    The part you answered in your most recent response is only my point #3.

  • Hi Casey,

    Thanks for posting the assembly. I will take a look and get back to you tomorrow. Upon a brief look it appears that the GPIO write at the end of the function could be adding to the overhead.

    I would recommend using the method mentioned in the link below where the EPWM is used to measure the execution in clock cycles instead. By looking at the assembly the clock cycles can be adjusted to account for EPWM read overhead to get exact cycles. That will be easier to compare with similar profiling on C28 side using same EPWM or CPU timer0.

    https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/faq.html#how-can-i-measure-the-duration-of-a-task

    Thanks,
    Ashwini

  • Hi Casey,

    I took a look at the assembly code for one of the code blocks (below) on C28 and CLA. The difference in behavior is coming from the instruction set and architecture -

    1. CLA does not support instructions for moving immediate values to memory, they much be moved to a register and register moved to memory.

    2. CLA is a floating point accelerator so its not optimized for integers. If you change the datatypes to float you are likely to see better performance.

    I have copied here C code and assembly snippets for one piece of code that I looked at.

    You can conduct a similar exercise with the other pieces of code to understand the difference in execution on CLA and C28.

    1. Instruction set for CLA can be found in the device TRM

    2. Instruction set for C28 can be found in these two documents:

    https://www.ti.com/lit/spru430

    https://www.ti.com/lit//spruhs1

    Also, turning optimization to at least O2 is recommended for good performance both on CLA and C28.

    C CODE:

    Uint16 x =0;

        if(1.2 < 2.3){

           x = 5;

        }

    C28 assembly: 3 cycles

    ;----------------------------------------------------------------------
    ;  64 | Uint16 x = 0;                                                          
    ;----------------------------------------------------------------------
            MOV       *-SP[5],#0            ; 2p cycles
    ;----------------------------------------------------------------------
    ;  66 | if(1.2 < 2.3){                                                         
    ;----------------------------------------------------------------------
    ;----------------------------------------------------------------------
    ;  68 | x = 5;                                                                 
    ;----------------------------------------------------------------------
            MOVB      *-SP[5],#5,UNC        1 cycle

    CLA assembly: 5 cycles

    ;----------------------------------------------------------------------
    ;  69 | Uint16 x =0;                                                           
    ;----------------------------------------------------------------------
            MMOVIZ    MR0,#0                ; 1 cycle
            MMOV16    @__cla_aci_ctrlLoop_sp+2,MR0 ; 1 cycle
    ;----------------------------------------------------------------------
    ;  71 | if(1.2 < 2.3){                                                         
    ;----------------------------------------------------------------------
    ;----------------------------------------------------------------------
    ;  73 | x = 5;                                                                 
    ;----------------------------------------------------------------------
            MMOVIZ    MR0,#0                ; 1 cycle
            MMOVXI    MR0,#5                ; 1 cycle
            MMOV16    @__cla_aci_ctrlLoop_sp+2,MR0 ; 1 cycle

    Thanks,
    Ashwini