This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello TI,
I have a (understanding) problem with the CLA vs. CPU calculate time.
We use a TMS320F28069. For testing and debugging I compare the absolutely (!) identical code. At first in the Main-CPU, second in a CLA-Task. I expected the same calculation time.
But my measurements shows a very much faster calc time for the CLA.
The difference is 3µs (CLA) vs. 24µs (CPU)!
To measure the time I toggle a GPIO.
The CLA-Task is measured also by a GPIO. Started by Sotwaretrigger (IACK), stopped by the CLA1_INT2_ISR Routine. The codesnippet below.
All needed Variables are in the CLA1DataRam-Area. Calculation is correct and works (in CLA and in CPU).
void main( void ) { … init, etc … EALLOW; GpioDataRegs.GPASET.bit.GPIO13 = 1; // Start measure time EDIS; __asm(" IACK #0x0002"); //start CLA-Task2 } // INT11.2 __interrupt void CLA1_INT2_ISR( void ) // CLA { EALLOW; GpioDataRegs.GPACLEAR.bit.GPIO13 = 1; // Stop measure time EDIS; PieCtrlRegs.PIEACK.all = PIEACK_GROUP11; }
The main time for calculation is needed by this code. There are some Divisions. But why is this in the CLA so much faster?
delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1;
delta_Z_o_n1= delta_ZK_o;
This formula is called 6 times (also with other variables).
Compiler Optimization has no speed effect.
Do you have some Ideas how this should happen?
Thank you.
Best Regards
Markus
Markus,
You mention the calculation is done 6 times. Is that done all during one task, or does the task need to be triggered 6 times?
-Lori
Something else to check:
Is the compiler option for floating point enabled for the C28x. --float_support = fpu32
Are using the fast RTS library from C2000Ware. (C:\ti\c2000\C2000Ware<version>\libraries\math\FPUfastRTS). You can check the generated .map file to see if C28x division comes from this library.
Hi Lori,
Thats all during in one task.
Analogous:
my_calc_task { delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1; delta_Z_o_n1= delta_Z_o; delta_Z_o1 = ( ( Ch ) * ( P_Z_o1 - ( delta_Z_o_n11 / Rjc1 ) ) * deltat ) + delta_Z_o_n11; delta_Z_o_n11= delta_Z_o1; delta_Z_o2 = ( ( Ch ) * ( P_Z_o2 - ( delta_Z_o_n12 / Rjc2 ) ) * deltat ) + delta_Z_o_n12 ; delta_Z_o_n12 = delta_Z_o2 ; delta_Z_o3 = ( ( Ch ) * ( P_Z_o3 - ( delta_Z_o_n13 / Rjc3 ) ) * deltat ) + delta_Z_o_n13 ; delta_Z_o_n13 = delta_Z_o3 ; delta_Z_o4 = ( ( Ch ) * ( P_Z_o4 - ( delta_Z_o_n14 / Rjc4 ) ) * deltat ) + delta_Z_o_n14 ; delta_Z_o_n14 = delta_Z_o4 ; delta_Z_o5 = ( ( Ch ) * ( P_Z_o5 - ( delta_Z_o_n15 / Rjc5 ) ) * deltat ) + delta_Z_o_n15 ; delta_Z_o_n15 = delta_Z_o5 ; }
Markus
Hi Lori,
yes the Division is using the RTS-Lib. Coming from:
C:/ti/ccsv8/tools/compiler/ti-cgt-c2000_18.1.4.LTS/lib/rts2800_fpu32.lib
But my understanding of the Processor is right?
The (same) Code in the CLA must be as quickly as in the CPU, correct?
Thanks.
Markus
Hi,
I have a look in the dissassambly.
CPU 1090 delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1; 3dd465: 761F0231 MOVW DP, #0x231 3dd467: FF69 SPM #0 3dd468: E2AF0004 MOV32 R0H, @0x4, UNCF 3dd46a: 761F0232 MOVW DP, #0x232 3dd46c: E2AF0116 MOV32 R1H, @0x16, UNCF 3dd46e: 767E0EB9 LCR FS$$DIV 3dd470: 761F0231 MOVW DP, #0x231 3dd472: E2AF0100 MOV32 R1H, @0x0, UNCF 3dd474: 761F0232 MOVW DP, #0x232 3dd476: E3204814 SUBF32 R1H, R1H, R0H || MOV32 R0H, @0x14 3dd478: 761F0231 MOVW DP, #0x231 3dd47a: E3004116 MPYF32 R0H, R1H, R0H || MOV32 R1H, @0x16 3dd47c: 761F0231 MOVW DP, #0x231 3dd47e: E7000008 MPYF32 R0H, R1H, R0H 3dd480: E2AF0104 MOV32 R1H, @0x4, UNCF 3dd482: E7100040 ADDF32 R0H, R0H, R1H 3dd484: 761F0231 MOVW DP, #0x231 3dd486: E2030006 MOV32 @0x6, R0H 1091 delta_Z_o_n1= delta_Z_o; 3dd488: 0606 MOVL ACC, @0x6 3dd489: 761F0231 MOVW DP, #0x231 3dd48b: 1E04 MOVL @0x4, ACC
CLA 1187 delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1; 00009cf0: 73C08C96 MMOV32 MR0, @0x8c96, UNCF 00009cf2: 7F000003 MEINVF32 MR3, MR0 00009cf4: 7C000032 MMPYF32 MR2, MR0, MR3 00009cf6: 780A4000 MSUBF32 MR2, #0x4000, MR2 00009cf8: 7C00003B MMPYF32 MR3, MR2, MR3 00009cfa: 7C00000E MMPYF32 MR2, MR3, MR0 00009cfc: 780A4000 MSUBF32 MR2, #0x4000, MR2 00009cfe: 0ED08C44 MMPYF32 MR3, MR2, MR3 || MMOV32 MR1, @0x8c44 00009d00: 07808C40 MMPYF32 MR2, MR3, MR1 || MMOV32 MR0, @0x8c40 00009d02: 28408C94 MSUBF32 MR1, MR0, MR2 || MMOV32 MR0, @0x8c94 00009d04: 01108C56 MMPYF32 MR0, MR1, MR0 || MMOV32 MR1, @0x8c56 00009d06: 01108C44 MMPYF32 MR0, MR1, MR0 || MMOV32 MR1, @0x8c44 00009d08: 7C200004 MADDF32 MR0, MR1, MR0 00009d0a: 74C08C46 MMOV32 @0x8c46, MR0 1188 delta_Z_o_n1= delta_Z_o; 00009d0c: 74C08C44 MMOV32 @0x8c44, MR0
The code in CLA is more efficent. But why?
Hope it helps.
Thanks Markus
Hi Edward,
thanks. You are right. I changed to relaxed. but the call of LCR FS$$DIV is already present. See below in the dissassembly.
The settings for Optimization:
Disassembly:
1090 delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1; 3dcd66: 8B16 MOVL XAR1, @0x16 3dcd67: FF69 SPM #0 3dcd68: 8214 MOVL XAR3, @0x14 3dcd69: 761F0231 MOVW DP, #0x231 3dcd6b: BDA10F16 MOV32 R1H, @XAR1 3dcd6d: E2AF000C MOV32 R0H, @0xc, UNCF 3dcd6f: 767E136A LCR FS$$DIV 3dcd71: BDA30F16 MOV32 R1H, @XAR3 3dcd73: 7700 NOP 3dcd74: 7700 NOP 3dcd75: E7200038 SUBF32 R0H, R7H, R0H 3dcd77: 761F0231 MOVW DP, #0x231 3dcd79: E7000009 MPYF32 R1H, R1H, R0H 3dcd7b: E2AF004A MOV32 R0H, *-SP[10], UNCF 3dcd7d: E7000040 MPYF32 R0H, R0H, R1H 3dcd7f: E2AF010C MOV32 R1H, @0xc, UNCF 3dcd81: E7100040 ADDF32 R0H, R0H, R1H 3dcd83: 7700 NOP 3dcd84: 7700 NOP 3dcd85: 7700 NOP 3dcd86: BFA20F12 MOV32 @XAR2, R0H 3dcd88: 761F0231 MOVW DP, #0x231
But it makes no difference in speed.
Any Ideas?
Thanks.
Markus
Sorry, I fought EINVF32 instruction is not called directly (like DIVF32 when TMU is present), but is still a call to FS$$DIV, which then should use EINVF32 for fast divide operation. strict / relaxed mode I guess is not relevant for you and lack of TMU unit.
Guru asked you if you are using *fast RTS library*. rts2800_fpu32.lib is not fast. You certainly need fast supplement lib for fast divide. Once you add it, make sure its priority is highest, to do so try using up/down arrows to move it to the top in C2000 Linker->File Search Path->Include library file ... list box.
To make sure fast lib is working, try single stepping in debugger into FS$$DIV routine, there you should see EINVF32 instruction used.
Edward
As a general statement it depends on the code. For pure 32-bit floating point math operations I would expect them to be similar.
If the code, as an example, had fixed-point operations, a lot of branches, C28x leveraged a repeat block, then the CLA will not do as well as the C28x. This is because of the instruction set, addressing modes available to the CLA.
Because in your case the C28x is taking much longer it seems something is not apples for apples.
Another question on the C28x code, is it running from flash or RAM?
Also to check that your code is pulling in the fast RTS library, take a look at the generated .map file and see which object libraries are being used.
Regards
Lori
Hi Edward,
Hi Lori,
using the rts2800_fpu32_fast_supplement.lib has the most postiv effect.
Before we used the rts2800_fpu32.lib
Also a very good tip: Running from Flash vs. RAM.
6,2µs (Ram) to 8,4µs (Flash)
6,7µs (w/o Compiler Optimization; Ram) to 9,2µs (w/o Compiler Optimization; Flash)
Note: all old CPU measurements in the postings before will be made with running from flash (CLA Program always running from RAM)
for the sake of completeness: Disassembly from Flash w/o CompilerOptimization 1084 delta_Z_o = ( ( Ch ) * ( P_Z_o - ( delta_Z_o_n1 / Rjc ) ) * deltat ) + delta_Z_o_n1; 3dd46a: 761F0231 MOVW DP, #0x231 3dd46c: E2AF0004 MOV32 R0H, @0x4, UNCF 3dd46e: 761F0232 MOVW DP, #0x232 3dd470: E2AF0116 MOV32 R1H, @0x16, UNCF 3dd472: 767E137D LCR $C:/ti/controlSUITE/libs/math/FPUfastRTS/V100/source/div_f32.asm:52:71$ 3dd474: 761F0231 MOVW DP, #0x231 3dd476: E2AF0100 MOV32 R1H, @0x0, UNCF 3dd478: 761F0232 MOVW DP, #0x232 3dd47a: E3204814 SUBF32 R1H, R1H, R0H || MOV32 R0H, @0x14 3dd47c: 761F0231 MOVW DP, #0x231 3dd47e: E3004116 MPYF32 R0H, R1H, R0H || MOV32 R1H, @0x16 3dd480: 761F0231 MOVW DP, #0x231 3dd482: E7000008 MPYF32 R0H, R1H, R0H 3dd484: E2AF0104 MOV32 R1H, @0x4, UNCF 3dd486: E7100040 ADDF32 R0H, R0H, R1H 3dd488: 761F0231 MOVW DP, #0x231 3dd48a: E2030006 MOV32 @0x6, R0H 1085 delta_Z_o_n1= delta_Z_o; 3dd48c: 0606 MOVL ACC, @0x6 3dd48d: 761F0231 MOVW DP, #0x231 3dd48f: 1E04 MOVL @0x4, ACC
I think the problem is resolved.
Thanks.
Markus