TMS320F28069: calculation time CPU vs CLA

MarkusL

Part Number: TMS320F28069
Other Parts Discussed in Thread: C2000WARE

Hello TI,

I have a (understanding) problem with the CLA vs. CPU calculate time.

We use a TMS320F28069. For testing and debugging I compare the absolutely (!) identical code. At first in the Main-CPU, second in a CLA-Task. I expected the same calculation time.

But my measurements shows a very much faster calc time for the CLA.

The difference is 3µs (CLA) vs. 24µs (CPU)!

To measure the time I toggle a GPIO.

The CLA-Task is measured also by a GPIO. Started by Sotwaretrigger (IACK), stopped by the CLA1_INT2_ISR Routine. The codesnippet below.

All needed Variables are in the CLA1DataRam-Area. Calculation is correct and works (in CLA and in CPU).

void main( void )
{
   …

init, etc

…

             EALLOW;
             GpioDataRegs.GPASET.bit.GPIO13 = 1;   // Start measure time
             EDIS;
              __asm(" IACK #0x0002"); //start CLA-Task2
    }

// INT11.2
__interrupt void CLA1_INT2_ISR( void )   // CLA
{
       EALLOW;
       GpioDataRegs.GPACLEAR.bit.GPIO13 = 1; // Stop measure time
       EDIS;
       PieCtrlRegs.PIEACK.all = PIEACK_GROUP11;
}

The main time for calculation is needed by this code. There are some Divisions. But why is this in the CLA so much faster?

delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;

delta_Z_o_n1= delta_ZK_o;

This formula is called 6 times (also with other variables).

Compiler Optimization has no speed effect.

Do you have some Ideas how this should happen?

Thank you.

Best Regards

Markus

over 5 years ago

0 Lori Heustess over 5 years ago

TI__Guru* 91195 points

Markus,

You mention the calculation is done 6 times. Is that done all during one task, or does the task need to be triggered 6 times?

-Lori

0 Lori Heustess over 5 years ago in reply to Lori Heustess

TI__Guru* 91195 points

Something else to check:

Is the compiler option for floating point enabled for the C28x. --float_support = fpu32

Are using the fast RTS library from C2000Ware. (C:\ti\c2000\C2000Ware<version>\libraries\math\FPUfastRTS). You can check the generated .map file to see if C28x division comes from this library.

0 MarkusL over 5 years ago in reply to Lori Heustess

Prodigy 100 points

Hi Lori,

Thats all during in one task.

Analogous:

my_calc_task
{
delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
delta_Z_o_n1= delta_Z_o;

delta_Z_o1 = ( ( Ch ) * ( P_Z_o1 - ( delta_Z_o_n11 / Rjc1 ) ) * deltat ) + delta_Z_o_n11;
delta_Z_o_n11= delta_Z_o1;

delta_Z_o2 = ( ( Ch ) * ( P_Z_o2 - ( delta_Z_o_n12 / Rjc2 ) ) * deltat ) + delta_Z_o_n12 ;
delta_Z_o_n12 = delta_Z_o2 ;

delta_Z_o3 = ( ( Ch ) * ( P_Z_o3 - ( delta_Z_o_n13 / Rjc3 ) ) * deltat ) + delta_Z_o_n13 ;
delta_Z_o_n13 = delta_Z_o3 ;

delta_Z_o4 = ( ( Ch ) * ( P_Z_o4 - ( delta_Z_o_n14 / Rjc4 ) ) * deltat ) + delta_Z_o_n14 ;
delta_Z_o_n14 = delta_Z_o4 ;

delta_Z_o5 = ( ( Ch ) * ( P_Z_o5 - ( delta_Z_o_n15 / Rjc5 ) ) * deltat ) + delta_Z_o_n15 ;
delta_Z_o_n15 = delta_Z_o5 ;

}

Markus

0 MarkusL over 5 years ago in reply to Lori Heustess

Prodigy 100 points

Hi Lori,

yes the Division is using the RTS-Lib. Coming from:

C:/ti/ccsv8/tools/compiler/ti-cgt-c2000_18.1.4.LTS/lib/rts2800_fpu32.lib

But my understanding of the Processor is right?

The (same) Code in the CLA must be as quickly as in the CPU, correct?

Thanks.

Markus

0 MarkusL over 5 years ago

Prodigy 100 points

Hi,

I have a look in the dissassambly.

CPU
   
1090    		delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
3dd465:   761F0231    MOVW         DP, #0x231
3dd467:   FF69        SPM          #0
3dd468:   E2AF0004    MOV32        R0H, @0x4, UNCF
3dd46a:   761F0232    MOVW         DP, #0x232
3dd46c:   E2AF0116    MOV32        R1H, @0x16, UNCF
3dd46e:   767E0EB9    LCR          FS$$DIV
3dd470:   761F0231    MOVW         DP, #0x231
3dd472:   E2AF0100    MOV32        R1H, @0x0, UNCF
3dd474:   761F0232    MOVW         DP, #0x232
3dd476:   E3204814    SUBF32       R1H, R1H, R0H
                   || MOV32        R0H, @0x14
3dd478:   761F0231    MOVW         DP, #0x231
3dd47a:   E3004116    MPYF32       R0H, R1H, R0H
                   || MOV32        R1H, @0x16
3dd47c:   761F0231    MOVW         DP, #0x231
3dd47e:   E7000008    MPYF32       R0H, R1H, R0H
3dd480:   E2AF0104    MOV32        R1H, @0x4, UNCF
3dd482:   E7100040    ADDF32       R0H, R0H, R1H
3dd484:   761F0231    MOVW         DP, #0x231
3dd486:   E2030006    MOV32        @0x6, R0H
1091    		delta_Z_o_n1= delta_Z_o;
3dd488:   0606        MOVL         ACC, @0x6
3dd489:   761F0231    MOVW         DP, #0x231
3dd48b:   1E04        MOVL         @0x4, ACC

CLA
     
1187      	delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
00009cf0:   73C08C96    MMOV32     MR0, @0x8c96, UNCF
00009cf2:   7F000003    MEINVF32   MR3, MR0
00009cf4:   7C000032    MMPYF32    MR2, MR0, MR3
00009cf6:   780A4000    MSUBF32    MR2, #0x4000, MR2
00009cf8:   7C00003B    MMPYF32    MR3, MR2, MR3
00009cfa:   7C00000E    MMPYF32    MR2, MR3, MR0
00009cfc:   780A4000    MSUBF32    MR2, #0x4000, MR2
00009cfe:   0ED08C44    MMPYF32    MR3, MR2, MR3 || MMOV32    MR1, @0x8c44
00009d00:   07808C40    MMPYF32    MR2, MR3, MR1 || MMOV32    MR0, @0x8c40
00009d02:   28408C94    MSUBF32    MR1, MR0, MR2 || MMOV32    MR0, @0x8c94
00009d04:   01108C56    MMPYF32    MR0, MR1, MR0 || MMOV32    MR1, @0x8c56
00009d06:   01108C44    MMPYF32    MR0, MR1, MR0 || MMOV32    MR1, @0x8c44
00009d08:   7C200004    MADDF32    MR0, MR1, MR0
00009d0a:   74C08C46    MMOV32     @0x8c46, MR0
1188      	delta_Z_o_n1= delta_Z_o;
00009d0c:   74C08C44    MMOV32     @0x8c44, MR0

The code in CLA is more efficent. But why?

Hope it helps.

Thanks Markus

0 EK over 5 years ago in reply to MarkusL

Expert 2520 points

CPU calls division routine, which is quite slow:

LCR FS$$DIV

Perhaps you didn't set C2000 Compiler->Optimization->Floating Point mode (--fp_mode) to relaxed. "strict" setting leads to FS$$DIV call.

Edward

0 MarkusL over 5 years ago in reply to EK

Prodigy 100 points

Hi Edward,

thanks. You are right. I changed to relaxed. but the call of LCR FS$$DIV is already present. See below in the dissassembly.

The settings for Optimization:

Disassembly:

1090  delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
3dcd66:   8B16        MOVL         XAR1, @0x16
3dcd67:   FF69        SPM          #0
3dcd68:   8214        MOVL         XAR3, @0x14
3dcd69:   761F0231    MOVW         DP, #0x231
3dcd6b:   BDA10F16    MOV32        R1H, @XAR1
3dcd6d:   E2AF000C    MOV32        R0H, @0xc, UNCF
3dcd6f:   767E136A    LCR          FS$$DIV
3dcd71:   BDA30F16    MOV32        R1H, @XAR3
3dcd73:   7700        NOP          
3dcd74:   7700        NOP          
3dcd75:   E7200038    SUBF32       R0H, R7H, R0H
3dcd77:   761F0231    MOVW         DP, #0x231
3dcd79:   E7000009    MPYF32       R1H, R1H, R0H
3dcd7b:   E2AF004A    MOV32        R0H, *-SP[10], UNCF
3dcd7d:   E7000040    MPYF32       R0H, R0H, R1H
3dcd7f:   E2AF010C    MOV32        R1H, @0xc, UNCF
3dcd81:   E7100040    ADDF32       R0H, R0H, R1H
3dcd83:   7700        NOP          
3dcd84:   7700        NOP          
3dcd85:   7700        NOP          
3dcd86:   BFA20F12    MOV32        @XAR2, R0H
3dcd88:   761F0231    MOVW         DP, #0x231


But it makes no difference in speed.


Any Ideas?

Thanks.
Markus

0 EK over 5 years ago in reply to MarkusL

Expert 2520 points

Sorry, I fought EINVF32 instruction is not called directly (like DIVF32 when TMU is present), but is still a call to FS$$DIV, which then should use EINVF32 for fast divide operation. strict / relaxed mode I guess is not relevant for you and lack of TMU unit.

Guru asked you if you are using *fast RTS library*. rts2800_fpu32.lib is not fast. You certainly need fast supplement lib for fast divide. Once you add it, make sure its priority is highest, to do so try using up/down arrows to move it to the top in C2000 Linker->File Search Path->Include library file ... list box.

To make sure fast lib is working, try single stepping in debugger into FS$$DIV routine, there you should see EINVF32 instruction used.

Edward

0 MarkusL over 5 years ago

Prodigy 100 points

Hi,

The big question is already present:

The (same) Code in the CLA must be as quickly as in the CPU, is this right?

Thank you.

Markus

0 Lori Heustess over 5 years ago in reply to MarkusL

TI__Guru* 91195 points

As a general statement it depends on the code. For pure 32-bit floating point math operations I would expect them to be similar.

If the code, as an example, had fixed-point operations, a lot of branches, C28x leveraged a repeat block, then the CLA will not do as well as the C28x. This is because of the instruction set, addressing modes available to the CLA.

Because in your case the C28x is taking much longer it seems something is not apples for apples.

Another question on the C28x code, is it running from flash or RAM?

Also to check that your code is pulling in the fast RTS library, take a look at the generated .map file and see which object libraries are being used.

Regards

Lori

0 MarkusL over 5 years ago in reply to EK

Prodigy 100 points

Hi Edward,

Hi Lori,

using the rts2800_fpu32_fast_supplement.lib has the most postiv effect.

Before we used the rts2800_fpu32.lib

Also a very good tip: Running from Flash vs. RAM.

6,2µs (Ram) to 8,4µs (Flash)

6,7µs (w/o Compiler Optimization; Ram) to 9,2µs (w/o Compiler Optimization; Flash)

Note: all old CPU measurements in the postings before will be made with running from flash (CLA Program always running from RAM)

for the sake of completeness: Disassembly from Flash w/o CompilerOptimization

1084    		delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
3dd46a:   761F0231    MOVW         DP, #0x231
3dd46c:   E2AF0004    MOV32        R0H, @0x4, UNCF
3dd46e:   761F0232    MOVW         DP, #0x232
3dd470:   E2AF0116    MOV32        R1H, @0x16, UNCF
3dd472:   767E137D    LCR          $C:/ti/controlSUITE/libs/math/FPUfastRTS/V100/source/div_f32.asm:52:71$
3dd474:   761F0231    MOVW         DP, #0x231
3dd476:   E2AF0100    MOV32        R1H, @0x0, UNCF
3dd478:   761F0232    MOVW         DP, #0x232
3dd47a:   E3204814    SUBF32       R1H, R1H, R0H || MOV32        R0H, @0x14
3dd47c:   761F0231    MOVW         DP, #0x231
3dd47e:   E3004116    MPYF32       R0H, R1H, R0H || MOV32        R1H, @0x16
3dd480:   761F0231    MOVW         DP, #0x231
3dd482:   E7000008    MPYF32       R0H, R1H, R0H
3dd484:   E2AF0104    MOV32        R1H, @0x4, UNCF
3dd486:   E7100040    ADDF32       R0H, R0H, R1H
3dd488:   761F0231    MOVW         DP, #0x231
3dd48a:   E2030006    MOV32        @0x6, R0H
1085    		delta_Z_o_n1= delta_Z_o;
3dd48c:   0606        MOVL         ACC, @0x6
3dd48d:   761F0231    MOVW         DP, #0x231
3dd48f:   1E04        MOVL         @0x4, ACC

I think the problem is resolved.

Thanks.

Markus

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28069: calculation time CPU vs CLA