This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28069: calculation time CPU vs CLA

Part Number: TMS320F28069
Other Parts Discussed in Thread: C2000WARE

Hello TI,

 

I have a (understanding) problem with the CLA vs. CPU calculate time.

We use a TMS320F28069. For testing and debugging I compare the absolutely (!) identical code. At first in the Main-CPU, second in a CLA-Task. I expected the same calculation time.

But my measurements shows a very much faster calc time for the CLA.

The difference is 3µs (CLA) vs. 24µs (CPU)!

 

To measure the time I toggle a GPIO.

The CLA-Task is measured also by a GPIO. Started by Sotwaretrigger (IACK), stopped by the CLA1_INT2_ISR Routine. The codesnippet below.

All needed Variables are in the CLA1DataRam-Area. Calculation is correct and works (in CLA and in CPU).

 

void main( void )
{
   …

init, etc

…

             EALLOW;
             GpioDataRegs.GPASET.bit.GPIO13 = 1;   // Start measure time
             EDIS;
              __asm(" IACK #0x0002"); //start CLA-Task2
    }

// INT11.2
__interrupt void CLA1_INT2_ISR( void )   // CLA
{
       EALLOW;
       GpioDataRegs.GPACLEAR.bit.GPIO13 = 1; // Stop measure time
       EDIS;
       PieCtrlRegs.PIEACK.all = PIEACK_GROUP11;
}

 

The main time for calculation is needed by this code. There are some Divisions. But why is this in the CLA so much faster?

delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;

delta_Z_o_n1= delta_ZK_o;

 

This formula is called 6 times (also with other variables).

 

 Compiler Optimization has no speed effect.

Do you have some Ideas how this should happen?

 

Thank you.

Best Regards

Markus

 

 

 

 

  • Markus,

    You mention the calculation is done 6 times.  Is that done all during one task, or does the task need to be triggered 6 times?

    -Lori

  • Something else to check:

    Is the compiler option for floating point enabled for the C28x. --float_support = fpu32 

    Are using the fast RTS library from C2000Ware.  (C:\ti\c2000\C2000Ware<version>\libraries\math\FPUfastRTS).  You can check the generated .map file to see if C28x division comes from this library. 

  • Hi Lori,

    Thats all during in one task.

    Analogous:

    my_calc_task
    {
    delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
    delta_Z_o_n1= delta_Z_o;
    
    delta_Z_o1 = ( ( Ch ) * ( P_Z_o1 - ( delta_Z_o_n11 / Rjc1 ) ) * deltat ) + delta_Z_o_n11;
    delta_Z_o_n11= delta_Z_o1;
    
    delta_Z_o2 = ( ( Ch ) * ( P_Z_o2 - ( delta_Z_o_n12 / Rjc2 ) ) * deltat ) + delta_Z_o_n12 ;
    delta_Z_o_n12 = delta_Z_o2 ;
    
    delta_Z_o3 = ( ( Ch ) * ( P_Z_o3 - ( delta_Z_o_n13 / Rjc3 ) ) * deltat ) + delta_Z_o_n13 ;
    delta_Z_o_n13 = delta_Z_o3 ;
    
    delta_Z_o4 = ( ( Ch ) * ( P_Z_o4 - ( delta_Z_o_n14 / Rjc4 ) ) * deltat ) + delta_Z_o_n14 ;
    delta_Z_o_n14 = delta_Z_o4 ;
    
    delta_Z_o5 = ( ( Ch ) * ( P_Z_o5 - ( delta_Z_o_n15 / Rjc5 ) ) * deltat ) + delta_Z_o_n15 ;
    delta_Z_o_n15 = delta_Z_o5 ;
    
    }

    Markus

  • Hi Lori,

    yes the Division is using the RTS-Lib. Coming from:

    C:/ti/ccsv8/tools/compiler/ti-cgt-c2000_18.1.4.LTS/lib/rts2800_fpu32.lib

    But my understanding of the Processor is right?

    The (same) Code in the CLA must be as quickly as in the CPU, correct?

    Thanks.

    Markus

  • Hi,

    I have a look in the dissassambly.

    CPU
       
    1090    		delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
    3dd465:   761F0231    MOVW         DP, #0x231
    3dd467:   FF69        SPM          #0
    3dd468:   E2AF0004    MOV32        R0H, @0x4, UNCF
    3dd46a:   761F0232    MOVW         DP, #0x232
    3dd46c:   E2AF0116    MOV32        R1H, @0x16, UNCF
    3dd46e:   767E0EB9    LCR          FS$$DIV
    3dd470:   761F0231    MOVW         DP, #0x231
    3dd472:   E2AF0100    MOV32        R1H, @0x0, UNCF
    3dd474:   761F0232    MOVW         DP, #0x232
    3dd476:   E3204814    SUBF32       R1H, R1H, R0H
                       || MOV32        R0H, @0x14
    3dd478:   761F0231    MOVW         DP, #0x231
    3dd47a:   E3004116    MPYF32       R0H, R1H, R0H
                       || MOV32        R1H, @0x16
    3dd47c:   761F0231    MOVW         DP, #0x231
    3dd47e:   E7000008    MPYF32       R0H, R1H, R0H
    3dd480:   E2AF0104    MOV32        R1H, @0x4, UNCF
    3dd482:   E7100040    ADDF32       R0H, R0H, R1H
    3dd484:   761F0231    MOVW         DP, #0x231
    3dd486:   E2030006    MOV32        @0x6, R0H
    1091    		delta_Z_o_n1= delta_Z_o;
    3dd488:   0606        MOVL         ACC, @0x6
    3dd489:   761F0231    MOVW         DP, #0x231
    3dd48b:   1E04        MOVL         @0x4, ACC

    CLA
         
    1187      	delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
    00009cf0:   73C08C96    MMOV32     MR0, @0x8c96, UNCF
    00009cf2:   7F000003    MEINVF32   MR3, MR0
    00009cf4:   7C000032    MMPYF32    MR2, MR0, MR3
    00009cf6:   780A4000    MSUBF32    MR2, #0x4000, MR2
    00009cf8:   7C00003B    MMPYF32    MR3, MR2, MR3
    00009cfa:   7C00000E    MMPYF32    MR2, MR3, MR0
    00009cfc:   780A4000    MSUBF32    MR2, #0x4000, MR2
    00009cfe:   0ED08C44    MMPYF32    MR3, MR2, MR3 || MMOV32    MR1, @0x8c44
    00009d00:   07808C40    MMPYF32    MR2, MR3, MR1 || MMOV32    MR0, @0x8c40
    00009d02:   28408C94    MSUBF32    MR1, MR0, MR2 || MMOV32    MR0, @0x8c94
    00009d04:   01108C56    MMPYF32    MR0, MR1, MR0 || MMOV32    MR1, @0x8c56
    00009d06:   01108C44    MMPYF32    MR0, MR1, MR0 || MMOV32    MR1, @0x8c44
    00009d08:   7C200004    MADDF32    MR0, MR1, MR0
    00009d0a:   74C08C46    MMOV32     @0x8c46, MR0
    1188      	delta_Z_o_n1= delta_Z_o;
    00009d0c:   74C08C44    MMOV32     @0x8c44, MR0

    The code in CLA is more efficent. But why?

    Hope it helps.

    Thanks Markus

  • CPU calls division routine, which is quite slow:

    LCR          FS$$DIV

    Perhaps you didn't set C2000 Compiler->Optimization->Floating Point mode (--fp_mode) to relaxed. "strict" setting leads to FS$$DIV call.

    Edward

  • Hi Edward,

    thanks. You are right. I changed to relaxed. but the call of LCR    FS$$DIV is already present. See below in the dissassembly.

    The settings for Optimization:

    Disassembly:

    1090  delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
    3dcd66:   8B16        MOVL         XAR1, @0x16
    3dcd67:   FF69        SPM          #0
    3dcd68:   8214        MOVL         XAR3, @0x14
    3dcd69:   761F0231    MOVW         DP, #0x231
    3dcd6b:   BDA10F16    MOV32        R1H, @XAR1
    3dcd6d:   E2AF000C    MOV32        R0H, @0xc, UNCF
    3dcd6f:   767E136A    LCR          FS$$DIV
    3dcd71:   BDA30F16    MOV32        R1H, @XAR3
    3dcd73:   7700        NOP          
    3dcd74:   7700        NOP          
    3dcd75:   E7200038    SUBF32       R0H, R7H, R0H
    3dcd77:   761F0231    MOVW         DP, #0x231
    3dcd79:   E7000009    MPYF32       R1H, R1H, R0H
    3dcd7b:   E2AF004A    MOV32        R0H, *-SP[10], UNCF
    3dcd7d:   E7000040    MPYF32       R0H, R0H, R1H
    3dcd7f:   E2AF010C    MOV32        R1H, @0xc, UNCF
    3dcd81:   E7100040    ADDF32       R0H, R0H, R1H
    3dcd83:   7700        NOP          
    3dcd84:   7700        NOP          
    3dcd85:   7700        NOP          
    3dcd86:   BFA20F12    MOV32        @XAR2, R0H
    3dcd88:   761F0231    MOVW         DP, #0x231


    But it makes no difference in speed.


    Any Ideas?

    Thanks.
    Markus

  • Sorry, I fought EINVF32 instruction is not called directly (like DIVF32 when TMU is present), but is still a call to FS$$DIV, which then should use EINVF32 for fast divide operation. strict / relaxed mode I guess is not relevant for you and lack of TMU unit.

    Guru asked you if you are using *fast RTS library*. rts2800_fpu32.lib is not fast. You certainly need fast supplement lib for fast divide. Once you add it, make sure its priority is highest, to do so try using up/down arrows to move it to the top in C2000 Linker->File Search Path->Include library file ... list box.

    To make sure fast lib is working, try single stepping in debugger into FS$$DIV routine, there you should see EINVF32 instruction used.

    Edward

  • Hi,

    The big question is already present:

    The (same) Code in the CLA must be as quickly as in the CPU, is this right?

    Thank you.

    Markus

  • As a general statement it depends on the code.  For pure 32-bit floating point math operations I would expect them to be similar.

    If the code, as an example, had fixed-point operations, a lot of branches, C28x leveraged a repeat block, then the CLA will not do as well as the C28x. This is because of the instruction set, addressing modes available to the CLA.  

    Because in your case the C28x is taking much longer it seems something is not apples for apples. 

    Another question on the C28x code, is it running from flash or RAM?

    Also to check that your code is pulling in the fast RTS library, take a look at the generated .map file and see which object libraries are being used. 

    Regards

    Lori

  • Hi Edward,

    Hi Lori,

    using the rts2800_fpu32_fast_supplement.lib has the most postiv effect.

    Before we used the rts2800_fpu32.lib

    Also a very good tip: Running from Flash vs. RAM.

    6,2µs (Ram) to 8,4µs (Flash)

    6,7µs (w/o Compiler Optimization; Ram)  to 9,2µs (w/o Compiler Optimization; Flash)

    Note: all old CPU measurements in the postings before will be made with running from flash (CLA Program always running from RAM)

    for the sake of completeness: Disassembly from Flash w/o CompilerOptimization
    
    1084    		delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
    3dd46a:   761F0231    MOVW         DP, #0x231
    3dd46c:   E2AF0004    MOV32        R0H, @0x4, UNCF
    3dd46e:   761F0232    MOVW         DP, #0x232
    3dd470:   E2AF0116    MOV32        R1H, @0x16, UNCF
    3dd472:   767E137D    LCR          $C:/ti/controlSUITE/libs/math/FPUfastRTS/V100/source/div_f32.asm:52:71$
    3dd474:   761F0231    MOVW         DP, #0x231
    3dd476:   E2AF0100    MOV32        R1H, @0x0, UNCF
    3dd478:   761F0232    MOVW         DP, #0x232
    3dd47a:   E3204814    SUBF32       R1H, R1H, R0H || MOV32        R0H, @0x14
    3dd47c:   761F0231    MOVW         DP, #0x231
    3dd47e:   E3004116    MPYF32       R0H, R1H, R0H || MOV32        R1H, @0x16
    3dd480:   761F0231    MOVW         DP, #0x231
    3dd482:   E7000008    MPYF32       R0H, R1H, R0H
    3dd484:   E2AF0104    MOV32        R1H, @0x4, UNCF
    3dd486:   E7100040    ADDF32       R0H, R0H, R1H
    3dd488:   761F0231    MOVW         DP, #0x231
    3dd48a:   E2030006    MOV32        @0x6, R0H
    1085    		delta_Z_o_n1= delta_Z_o;
    3dd48c:   0606        MOVL         ACC, @0x6
    3dd48d:   761F0231    MOVW         DP, #0x231
    3dd48f:   1E04        MOVL         @0x4, ACC

    I think the problem is resolved.

    Thanks.

    Markus