TMS320F28379D: Addition on FPU seems slower than sin() on TMU?

Wayne Huang

Part Number: TMS320F28379D

Dear Champs,

I am asking this for our customer.

The user is optimizing their codes in speed.

They tried to compare the speed of a simple addition on FPU and sin() on TMU and found sin() is faster than the addition (i += 1.1).

Just wonder it makes sense?

Do you have any comment?

Wayne Huang

over 4 years ago

0 Sira Rao80 over 4 years ago

TI__Mastermind 24800 points

Wayne,

I presume optimization is disabled? Can you look at the disassembly? That will tell you what is going on. Float add is 1 cycle, and sinpuf32 is 4 cycles. So surely there are multiple instructions involved (e.g. for the add, read, add, store). The sine loop is operating on the same value, the compiler might optimize this but it depends on the settings and if Time_V2 is a volatile. The numbers are in the same range, so that tells me the compiler isn't optimizing it out.

Thanks,

Sira

0 Wayne Huang over 4 years ago in reply to Sira Rao80

TI__Genius 15635 points

Dear Sira,

The user was doing more labs.

CPU Timer period is 100000.

That is, if GH_Timer_us is 90000, it means addition takes 100000-90000 = 10000 counts.

If Ori_Time_us is 90000, it means sin() takes 100000-90000 = 10000 counts.

Compiler is V18.12.7.LTS

Addition is changed to "Time_V1 = Time_V1 + Timer_V1;"

For loop is changed to 2000 times to avoid underflow.

1. If it's without optimization, the result shows the time for addition is similar to that for sin().

2. Much to our surprise, the result shows the time for sin() is much shorter than that for addition if below optimization is used.

Do you have any comment?

Wayne Huang

0 Sira Rao80 over 4 years ago in reply to Wayne Huang

TI__Mastermind 24800 points

Wayne,

So basically in the case with optimization, the sine loop takes about 8000 cycles, for 2000 iterations i.e. 4 cycles/iteration, which is how long the sinpuf32 instruction takes.

In the case with optimization, the add loop takes about 14000 cycles, for 2000 iterations i.e. 7 cycles/iteration.

The disassembly for the add loop looks strange to me. I am still trying to make sense of it. Please give me until Monday to get back to you.

Meanwhile, can you tell me what how Time_V1 is defined. Is it a volatile float?

Thanks,

Sira

0 Wayne Huang over 4 years ago in reply to Sira Rao80

TI__Genius 15635 points

Dear Sira,

It's as follows.

float Time_V1, Time_V2, Time_V0;

Wayne Huang

0 Sira Rao80 over 4 years ago in reply to Wayne Huang

TI__Mastermind 24800 points

Wayne,

The compiler is generating instructions that basically do the following:

write the result to memory (R0H -> memory)

load a register from the same memory location (memory -> R1H) - this results in a stall given the above write operation needs to finish

perform the addition (R0H = R0H + R1H) - this is anyway a 2p cycle instruction

Thanks,

Sira

0 Sira Rao80 over 4 years ago in reply to Sira Rao80

TI__Mastermind 24800 points

Wayne, I want to send this test case (the addition loop) to the compiler team so we can understand why there is an intermediate write to memory and read from memory, instead of just having register to register transfers, given the fixed known length of the loop.

So can you please send me the Compiler version? Is code running out of Flash or RAM?

Thanks,

Sira

0 Wayne Huang over 4 years ago in reply to Sira Rao80

TI__Genius 15635 points

Dear Sira,

Compiler is V18.12.7.LTS

The codes were running on RAM (moved from flash and run on RAM).

No interrupt was used.

void main(void)

{

Device_init();

Device_initGPIO();

...

for(;;)

{

mainLoop();

}

mainLoop() is executed on RAM as below.

And the comparison codes we are discussing are in mainLoop().

#pragma CODE_SECTION(mainLoop,".TI.ramfunc");

void mainLoop(void)

{

…

}

Wayne Huang

0 Sira Rao80 over 4 years ago in reply to Wayne Huang

TI__Mastermind 24800 points

Thank you, Wayne.

Thanks,

Sira

0 Sira Rao80 over 4 years ago in reply to Sira Rao80

TI__Mastermind 24800 points

Wayne,

I have filed a JIRA for this issue to be addressed. I will go ahead and close this ticket.

Thanks, Sira

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28379D: Addition on FPU seems slower than sin() on TMU?