This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28379D: Addition on FPU seems slower than sin() on TMU?

Part Number: TMS320F28379D

Dear Champs,

I am asking this for our customer.

The user is optimizing their codes in speed.

They tried to compare the speed of a simple addition on FPU and sin() on TMU and found sin() is faster than the addition (i += 1.1).

Just wonder it makes sense?

Do you have any comment?

Wayne Huang

  • Wayne,

    I presume optimization is disabled? Can you look at the disassembly? That will tell you what is going on. Float add is 1 cycle, and sinpuf32 is 4 cycles. So surely there are multiple instructions involved (e.g. for the add, read, add, store). The sine loop is operating on the same value, the compiler might optimize this but it depends on the settings and if Time_V2 is a volatile. The numbers are in the same range, so that tells me the compiler isn't optimizing it out.

    Thanks,

    Sira 

  • Dear Sira,

    The user was doing more labs.

    CPU Timer period is 100000.

    That is, if GH_Timer_us is 90000, it means addition takes 100000-90000 = 10000 counts.

    If Ori_Time_us is 90000, it means sin() takes 100000-90000 = 10000 counts.

    Compiler is V18.12.7.LTS

    Addition is changed to "Time_V1 = Time_V1 + Timer_V1;"

    For loop is changed to 2000 times to avoid underflow.

    1. If it's without optimization, the result shows the time for addition is similar to that for sin().

    2. Much to our surprise, the result shows the time for sin() is much shorter than that for addition if below optimization is used.

    Do you have any comment?

    Wayne Huang

  • Wayne,

    So basically in the case with optimization, the sine loop takes about 8000 cycles, for 2000 iterations i.e. 4 cycles/iteration, which is how long the sinpuf32 instruction takes.

    In the case with optimization, the add loop takes about 14000 cycles, for 2000 iterations i.e. 7 cycles/iteration.

    The disassembly for the add loop looks strange to me. I am still trying to make sense of it. Please give me until Monday to get back to you.

    Meanwhile, can you tell me what how Time_V1 is defined. Is it a volatile float?

    Thanks,

    Sira

  • Dear Sira,

    It's as follows.

    float  Time_V1, Time_V2, Time_V0;

    Wayne Huang

  • Wayne,

    The compiler is generating instructions that basically do the following:

    write the result to memory (R0H -> memory)

    load a register from the same memory location (memory -> R1H) - this results in a stall given the above write operation needs to finish

    perform the addition (R0H = R0H + R1H) - this is anyway a 2p cycle instruction

    Thanks,

    Sira

  • Wayne, I want to send this test case (the addition loop) to the compiler team so we can understand why there is an intermediate write to memory and read from memory, instead of just having register to register transfers, given the fixed known length of the loop.

    So can you please send me the Compiler version? Is code running out of Flash or RAM?

    Thanks,

    Sira

  • Dear Sira,

    Compiler is V18.12.7.LTS

    The codes were running on RAM (moved from flash and run on RAM).

    No interrupt was used.

    void main(void)

    {

      Device_init();

      Device_initGPIO();

      ...

      ...

    for(;;)

    {

           mainLoop();

    }

    }

     

    mainLoop() is executed on RAM as below.

    And the comparison codes we are discussing are in mainLoop().

    #pragma CODE_SECTION(mainLoop,".TI.ramfunc");

    void mainLoop(void)

    {

       …

      …

    }

    Wayne Huang

  • Thank you, Wayne.

    Thanks,

    Sira

  • Wayne,

    I have filed a JIRA for this issue to be addressed. I will go ahead and close this ticket.

    Thanks, Sira