This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/MSP430F2619: CCS Version Floating Point Computation Speed

Part Number: MSP430F2619
Other Parts Discussed in Thread: MSP-EXP430F5529LP

Tool/software: Code Composer Studio

With CCS v6.1.2 a loop of 128 double precision floating point iterations of math.h functions runs in about 2secs.

The same loop complied in CCS v7.4 with math.h runs in about 8-9secs.  Both versions are release builds and --fp_reassoc is off for both versions.

What could be causing v7.4 floating point code (math.h functions) to run 4x slower?

Aaron

  • what is sizeof(double) in both cases? could they have made doubles real doubles?

    Can you share some example code?

  • Please show which versions of the compiler (not CCS) were used.  Please see the article Understanding tools versioning.

    Consider the possibility that a difference in memory speed could be the cause.  Did either the code, or the data, change from fast internal memory to slow external memory?

    Thanks and regards,

    -George

  • The loop is calculating a 128 point LUT: 6th order polynomial using math.h with terms coeff*pow(x,n)
    The coeffs are double precision (64-bit). The x value is the index (for loop) of the table so there are no data dependencies between
    compiled versions. There is no external memory. Is it possible math.h is slower in CCSv7? Perhaps trading precision for speed?

    Aaron
  • CCSv6.1.2 using TI v4.4.5 Compiler
    CCSv7.4.0 using TI v16.9.7.LTS Compiler
  • Aaron Viejo said:
    The coeffs are double precision (64-bit).

    Are you certain the math is done using 64-bit computations, with both the 4.4.x and 16.9.x compiler versions?  Because the built-in type double changed from 32-bits to 64-bits in that time.  If this loop uses type double, then you effectively changed from 32-bit computations, to 64-bit computations.

    Thanks and regards,

    -George

  • Hello Aaron,

    could you show your Project -> General setting for CCSv6.1.2 using TI v4.4.5 Compiler in the way like a screen capture below?

  • Hello Aaron,

    do you need additional help, did we resolved your issue?
  • Hello Tomasz,

    See Project Properties below.

    The polynomial coefficients (cn) are declared as double and uses <math.h> lib with the function pow(x,n):

    const double sf = value;

    long table[128];

    for (i=0; i<128; i++)

         double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);

         table[i] = lround( y / sf );

    }

    Runs 4x faster in CCS 6.1.2 than in v7.

  • Aaron,

    First of all, I agree with you.

    You have not responded to Keith question regarding a double size.

    That is pity. Your answer would help to speed up an investigation.

    Your screen shot showing "eabi (ELF)" means that in both CCS v6.1.2 and v7.4 you are using 64 bit doubles,

    that you are comparing apples to apples.

    I have used MSP-EXP430F5529LP with defaults like: a DEBUG version, MPY32, ....

    I haven't checked ACLK default freq., memory regions for data and code, etc. because of limited time till now.

    Using this simple code:

    #include <msp430.h>
    #include <math.h>
    
    int main(void)
    {
        WDTCTL = WDTPW | WDTHOLD;   // Stop watchdog timer
        /* ********* Timer_B Init (can also be done by Grace tool)  *********************** */
        TB0EX0 = 0x7;
        TB0CTL = CNTL_0 + TBSSEL_1 + ID_3 + MC_2 + TBIE;        // 16 bits counter, ACLK, divider 8 (not checked)
    
        const double c0=1, c1=2, c2=3, c3=4, c4=5, c5=6;
        const double sf = 1.23456789;
        const double x = 2.3;
        int i, dsize;
    
        dsize = sizeof(sf);
    
        long table[128];
    
        TB0CTL &= ~(0x0030); //stop counter
        TB0R = 0;  //set counter 0
        TB0CTL |= MC_2; //start counter
    
        for (i=0; i<128; i++)
        {
            double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
            table[i] = lround( y / sf );
        }
    
        TB0CTL &= ~(0x0030); //stop counter
    
        unsigned int Counter_B0_value = *(unsigned int *)0x03D0; // write counter value to variable
    
        __no_operation();                         // For debugger
        return 0;
    }
    
    // Timer B0 interrupt service routine
    #if defined(__TI_COMPILER_VERSION__) || defined(__IAR_SYSTEMS_ICC__)
    #pragma vector=TIMERB0_VECTOR
    __interrupt void TIMERB0_ISR (void)
    #elif defined(__GNUC__)
    void __attribute__ ((interrupt(TIMERB0_VECTOR))) TIMERB0_ISR (void)
    #else
    #error Compiler not supported!
    #endif
    {
        __no_operation();
        P1OUT ^= 0x01;                            // Toggle P1.0
        TBCCR0 += 50000;                          // Add Offset to CCR0 [Cont mode]
    }
    

    I did some tests.

    Below you will find TimerB (TB0R register precisely speaking) values for the main loop of your core code (plus loop control execution, meaning less in this case)  :

            double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
            table[i] = lround( y / sf );

    CCS v6.1.2, TI v 4.4.5, legacy COFF, 4 byte double: 1,804
    CCS v6.1.2, TI v.4.4.5, eabi (ELF), 8 byte double: 4,676
    CCS v7.4.0, TI v18.1.1.LTS, eabi (ELF), 8 byte double: 18,856

    I fully agree that an execution time of your double calc loop takes 4 times longer
    under CCS v7.4 then under v6.1.2.

    Interesting is that when 18,856 / 4,676 = 4.032.
    It is very close to 4.

    Any systematic error?

    Memory speed?

    I had no spare time to check map files, to check times with MPY16 only (like your chip), to check it in RELEASE mode, your optimization flag, etc.

  • Aaron,

    from a 16 bits MCU processing power perspective,
    your polynomial statement:
    double y = c5*pow(x,5)+c4*pow(x,4)+c3*pow(x,3)+c2*pow(x,2)+c1*pow(x,1)+c0*pow(x,0);
    is a disaster.

    You can make it, lets say, 10 times faster.

    Good method needs 1,090 instead of 18,856 TimerB counts.

    17.3x times improvement.

  • George and Tomasz,

    The 64-bit coeffs are a given - I am just trying to understand why CCS6.1.2 is 4x faster.

    I did sizeof(double) for both CCS6.1.2 and CCS7.4 and got 8 bytes for both.

    Robert
  • Hello George,

    The 64-bit coeffs are a given - I am just trying to understand why CCS6.1.2 is 4x faster.

    I did sizeof(double) for both CCS6.1.2 and CCS7.4 and got 8 bytes for both,

    Robert
  • I apologize for the delay.

    To further this issue, I filed CODEGEN-4883 in the SDOWP system.  This does not report a bug against the compiler, but points out a degradation in performance.  You are welcome to follow it with the SDOWP link below in my signature.

    Thanks and regards,

    -George

**Attention** This is a public forum