CCS/MSP430F2619: CCS Version Floating Point Computation Speed

Aaron Viejo

Part Number: MSP430F2619
Other Parts Discussed in Thread: MSP-EXP430F5529LP

Tool/software: Code Composer Studio

With CCS v6.1.2 a loop of 128 double precision floating point iterations of math.h functions runs in about 2secs.

The same loop complied in CCS v7.4 with math.h runs in about 8-9secs. Both versions are release builds and --fp_reassoc is off for both versions.

What could be causing v7.4 floating point code (math.h functions) to run 4x slower?

Aaron

over 7 years ago

0 Keith Barkley over 7 years ago

Guru 42195 points

what is sizeof(double) in both cases? could they have made doubles real doubles?

Can you share some example code?

0 George Mock over 7 years ago

TI__Guru**** 251110 points

Please show which versions of the compiler (not CCS) were used. Please see the article Understanding tools versioning.

Consider the possibility that a difference in memory speed could be the cause. Did either the code, or the data, change from fast internal memory to slow external memory?

Thanks and regards,

-George

0 Aaron Viejo over 7 years ago in reply to George Mock

Prodigy 110 points

The loop is calculating a 128 point LUT: 6th order polynomial using math.h with terms coeff*pow(x,n)
The coeffs are double precision (64-bit). The x value is the index (for loop) of the table so there are no data dependencies between
compiled versions. There is no external memory. Is it possible math.h is slower in CCSv7? Perhaps trading precision for speed?

Aaron

0 Aaron Viejo over 7 years ago in reply to Aaron Viejo

Prodigy 110 points

CCSv6.1.2 using TI v4.4.5 Compiler
CCSv7.4.0 using TI v16.9.7.LTS Compiler

0 George Mock over 7 years ago in reply to Aaron Viejo

TI__Guru**** 251110 points

Aaron Viejo said:
The coeffs are double precision (64-bit).

Are you certain the math is done using 64-bit computations, with both the 4.4.x and 16.9.x compiler versions? Because the built-in type double changed from 32-bits to 64-bits in that time. If this loop uses type double, then you effectively changed from 32-bit computations, to 64-bit computations.

Thanks and regards,

-George

0 Tomasz Kocon over 7 years ago in reply to Aaron Viejo

Guru 26200 points

Hello Aaron,

could you show your Project -> General setting for CCSv6.1.2 using TI v4.4.5 Compiler in the way like a screen capture below?

0 Tomasz Kocon over 7 years ago

Guru 26200 points

Hello Aaron,

do you need additional help, did we resolved your issue?

0 Aaron Viejo over 7 years ago in reply to Tomasz Kocon

Prodigy 110 points

Hello Tomasz,

See Project Properties below.

The polynomial coefficients (cn) are declared as double and uses <math.h> lib with the function pow(x,n):

const double sf = value;

long table[128];

for (i=0; i<128; i++)

{

double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);

table[i] = lround( y / sf );

}

Runs 4x faster in CCS 6.1.2 than in v7.

0 Tomasz Kocon over 7 years ago in reply to Aaron Viejo

Guru 26200 points

Aaron,

First of all, I agree with you.

You have not responded to Keith question regarding a double size.

That is pity. Your answer would help to speed up an investigation.

Your screen shot showing "eabi (ELF)" means that in both CCS v6.1.2 and v7.4 you are using 64 bit doubles,

that you are comparing apples to apples.

I have used MSP-EXP430F5529LP with defaults like: a DEBUG version, MPY32, ....

I haven't checked ACLK default freq., memory regions for data and code, etc. because of limited time till now.

Using this simple code:

#include <msp430.h>
#include <math.h>

int main(void)
{
    WDTCTL = WDTPW | WDTHOLD;   // Stop watchdog timer
    /* ********* Timer_B Init (can also be done by Grace tool)  *********************** */
    TB0EX0 = 0x7;
    TB0CTL = CNTL_0 + TBSSEL_1 + ID_3 + MC_2 + TBIE;        // 16 bits counter, ACLK, divider 8 (not checked)

    const double c0=1, c1=2, c2=3, c3=4, c4=5, c5=6;
    const double sf = 1.23456789;
    const double x = 2.3;
    int i, dsize;

    dsize = sizeof(sf);

    long table[128];

    TB0CTL &= ~(0x0030); //stop counter
    TB0R = 0;  //set counter 0
    TB0CTL |= MC_2; //start counter

    for (i=0; i<128; i++)
    {
        double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
        table[i] = lround( y / sf );
    }

    TB0CTL &= ~(0x0030); //stop counter

    unsigned int Counter_B0_value = *(unsigned int *)0x03D0; // write counter value to variable

    __no_operation();                         // For debugger
    return 0;
}

// Timer B0 interrupt service routine
#if defined(__TI_COMPILER_VERSION__) || defined(__IAR_SYSTEMS_ICC__)
#pragma vector=TIMERB0_VECTOR
__interrupt void TIMERB0_ISR (void)
#elif defined(__GNUC__)
void __attribute__ ((interrupt(TIMERB0_VECTOR))) TIMERB0_ISR (void)
#else
#error Compiler not supported!
#endif
{
    __no_operation();
    P1OUT ^= 0x01;                            // Toggle P1.0
    TBCCR0 += 50000;                          // Add Offset to CCR0 [Cont mode]
}

I did some tests.

Below you will find TimerB (TB0R register precisely speaking) values for the main loop of your core code (plus loop control execution, meaning less in this case) :

        double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
        table[i] = lround( y / sf );

CCS v6.1.2, TI v 4.4.5, legacy COFF, 4 byte double: 1,804
CCS v6.1.2, TI v.4.4.5, eabi (ELF), 8 byte double: 4,676
CCS v7.4.0, TI v18.1.1.LTS, eabi (ELF), 8 byte double: 18,856

I fully agree that an execution time of your double calc loop takes 4 times longer
under CCS v7.4 then under v6.1.2.

Interesting is that when 18,856 / 4,676 = 4.032.
It is very close to 4.

Any systematic error?

Memory speed?

I had no spare time to check map files, to check times with MPY16 only (like your chip), to check it in RELEASE mode, your optimization flag, etc.

0 Tomasz Kocon over 7 years ago in reply to Aaron Viejo

Guru 26200 points

Aaron,

from a 16 bits MCU processing power perspective,
your polynomial statement:
double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
is a disaster.

You can make it, lets say, 10 times faster.

Good method needs 1,090 instead of 18,856 TimerB counts.

17.3x times improvement.

0 Aaron Viejo over 7 years ago in reply to Tomasz Kocon

Prodigy 110 points

George and Tomasz,

The 64-bit coeffs are a given - I am just trying to understand why CCS6.1.2 is 4x faster.

I did sizeof(double) for both CCS6.1.2 and CCS7.4 and got 8 bytes for both.

Robert

0 Aaron Viejo over 7 years ago in reply to George Mock

Prodigy 110 points

Hello George,

The 64-bit coeffs are a given - I am just trying to understand why CCS6.1.2 is 4x faster.

I did sizeof(double) for both CCS6.1.2 and CCS7.4 and got 8 bytes for both,

Robert

0 George Mock over 7 years ago in reply to Aaron Viejo

TI__Guru**** 251110 points

I apologize for the delay.

To further this issue, I filed CODEGEN-4883 in the SDOWP system. This does not report a bug against the compiler, but points out a degradation in performance. You are welcome to follow it with the SDOWP link below in my signature.

Thanks and regards,

-George

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

CCS/MSP430F2619: CCS Version Floating Point Computation Speed