This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Single Point Precision FPU with Square Root

Hi all,

Great to be back! Probably my first post from the new home, on the other side of the pond...

I'm trying to grab a better understanding of the FPU usage on TM4C12x. From my readings, it does the "float" (as in 32 bit float) operations in hardware, so a multiplication will automagically be ready on the next clock cycle (or next few clocks?).

Specifically reading TM4C129NCPDT datasheet, section 3.1.5 FPU, I see it has HW support for addition, multiplication, etc etc AND square root.

On the standard c library, square root takes and returns a double (64 bit) value.

So, how can I use SQR with the hardware FPU?

Cheers,

  • For information on the fpu check ARM's documentation for the Cortex fpu.

    As far as a float implementation of square root check your compiler documentation. If it exists it's usually called something like sqrf.

    Keep in mind that the C/C++ language standards require converting float variables to double before performing any operations. There is an as-if exemption but that is a large hurdle for floating point arithmetic.

    Robert
  • Yes, Robert is correct here.
    Most functions in math.h come in a doube-precision and a single precision version, like sinf(), cosf(), sqrtf(), and so on. If you need performance (and can live with the mediocre accuracy), stick with "float".
    But be aware of another "feature" of C - numerical literals are ALWAYS interpreted as double, if not stated explicitly as float. For example "float f = 3.14159;" involves a conversion from double to float, but "float f = 3.14159f;" does not !
    In loops, such small omissions can make a difference ...
  • When posters Robert & f.m. join forces - cut/paste and distribution makes great sense.

    Thank you both for this clarity - much appreciated.   (NOW I know why our (recent) loop "stunk"...)

    Gems such as this may warrant inclusion w/in, "Robert's Bookshelf" (which on occasion appears - here...)

  • Thank you all for indeed such enlightening replies.

    Now I am happy with a much faster code, good enough with a 32-bit square root.

    A few years ago I came across something described as "float precision square root", which meant absolutely nothing at the time - and useless as well. Not that useless anymore!

    And the note about implicit double conversion is also welcome! Life is so easy for those PC programmers who have 64-bit 3GHz octa-cores available to run their codes, ain't?

  • Life is so easy for those PC programmers who have 64-bit 3GHz octa-cores available to run their codes, ain't?

    Yes, it is - in this regard. As a side note, even the first 8087 had been operating internally on 80 bit, not 64 (like "double"). So much to worry as an embedded developer - may try my luck as used-car salesman ...

  • f. m. said:
    may try my luck as used-car salesman ...

    Contact firm/me when you can supply V-16 - low miles of course.    (and goes w/out saying - engine must employ "float")

  • As it turns out I overstated the conversion problems a bit but there's still a lot of leeway for the compiler to not match your expectations.

    One thing Bruno, if speed is your issue test it. There have been instances where the double versions of functions have been as fast as or faster than the float versions.

    Floating point is full of traps. I try to avoid it except when it doesn't matter.

    Robert

    Floating point's advantage is dynamic range, by most (all?) other measures it's worse.

  • Robert,

    Indeed I should test it - and I'm very curious to see the results.
    It won't happen so soon, as I'm really holding a long list of things to do here, and this particular board "is working", so let it be for now. I will share the tests here when I have results available.

    But your comment pulls a trigger here: look at this beatiful sentence taken from the datasheet:

    "The Cortex-M4F FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations."

    Now, if I remember the last time I looked at assembler instructions, there are internal operations on the mcu hw that mean "add" - and those very same operations are used by the compiler when I code "A = B + 1;".

    But, is there likewise a hw operation for sqrt, as the text suggests? Does CCS and the Arm compiler "knows" of it, and magically uses a 1 clock operation for "sqrtf(443556.0f);" ????

    Regards,
  • A slow saturday morning for me, and this got me interested enough to "dig deep".

    First of all, about the HW operation - one does really exist, take a look here:

    But it isn't the 1 cycle wonder you would like - takes 14 cycles. I created a small test program:

    #include "project_config.h"
    
    #include <vtl_core.h>
    #include <vtl_launchpad.h>
    
    #include <driverlib/timer.h>
    #include <inc/hw_timer.h>
    
    #include <math.h>
    
    void stackoverflow(void) {
    	// When this happens "without reason", consider increasing the stack size.
    	// This can be done in Project->Properties->Build->ARM Linker->Basic Options
    	// (in CCS)
    	while(1);
    }
    
    volatile float sqrtval1;
    volatile float sqrtval2;
    volatile float sqrtdiff;
    volatile float val = 1.0f;
    volatile double valdbl = 1.0;
    volatile double sqrtdbl;
    volatile double sqrtdbldiff;
    volatile double err_ppm;
    volatile uint32_t exec_timeWsqrtf = 0;
    volatile uint32_t exec_timeW__sqrtf = 0;
    volatile uint32_t exec_timeWsqrt = 0;
    
    void main(void) {
    	// Setup FPU
    	FPUEnable();
    	FPULazyStackingEnable();
    
    	// Setup clock
    	VTL_ClockSet();
    
    	// Setup stack overflow detection
    	VTL_MPUStackOverflowDetectionEnable(stackoverflow);
    
    VTL_SysCtlPeripheralEnable(SYSCTL_PERIPH_TIMER0); TimerConfigure(TIMER0_BASE, TIMER_CFG_PERIODIC_UP); TimerEnable(TIMER0_BASE, TIMER_A); while(1) { uint32_t start, end; val *= 1.0001f; valdbl *= 1.0001; start = HWREG(TIMER0_BASE + TIMER_O_TAR); sqrtval1 = sqrtf(val); end = HWREG(TIMER0_BASE + TIMER_O_TAR); exec_timeWsqrtf = end - start; start = HWREG(TIMER0_BASE + TIMER_O_TAR); sqrtval2 = __sqrtf(val); end = HWREG(TIMER0_BASE + TIMER_O_TAR); exec_timeW__sqrtf = end - start; start = HWREG(TIMER0_BASE + TIMER_O_TAR); sqrtdbl = sqrt(valdbl); end = HWREG(TIMER0_BASE + TIMER_O_TAR); exec_timeWsqrt = end - start; sqrtdiff = sqrtval2 - sqrtval1; sqrtdbldiff = sqrtdbl - sqrtval1; err_ppm = (sqrtdbldiff / sqrtdbl) * 1000000.0; } }

    (Please just ignore project_config.h and all the vtl_ stuff, they're just my little helpers - the function names tell what they do, details not relevant for this exercise.)

    The __sqrtf comes from taking a peek into math.h. If you look at the "Build->ARM Compiler->Optimization" pane in Project properties (I'm using CCS, don't know where these settings are in other IDE's) - you'll see there is a setting "Floating point mode" for which there are options "strict" and "relaxed". The difference is explained in math.h:

    /* If --fp_mode=relaxed is used and VFP is enabled, use the hardware square  */
    /* root directly instead of calling the sqrtx routine. This will not set     */
    /* errno if the argument is negative.                                        */
    

    So assuming your "math checks out", there is no advantage to using the strict mode from what I gather. Wiser people, please advise if this is untrue!

    Now, how does it all add up? I ran the program with both fp modes and in all optimization levels (off, 0, 1, 2, 3, 4) with the dial "speed vs size tradeoffs" cranked all the way up to 5 for max speed. Here are the results:

    (I thought of the double thing only after having run it all, I re-ran it only for two settings. The execution time for the double-sqrt varied roughly between the indicated values.)

    Opt. Level FP mode exec_timeWsqrtf exec_timeW__sqrtf exec_timeWsqrt
    none strict 36 26 -
    none relaxed 34 26 ~4200-4400
    0 strict 30 19 -
    0 relaxed 19 19 -
    1 strict 29 19 -
    1 relaxed 19 19 -
    2 strict 29 19 -
    2 relaxed 19 19 -
    3 strict 29 19 -
    3 relaxed 19 19 -
    4 strict 24 19 -
    4 relaxed 19 19 ~4100-4250

    From this I conclude that you'll get the max. possible performance by using sqrtf if you have FP mode set to relaxed and have optimization turned on - doesn't matter what level, just not 'off'. And don't use sqrt with doubles unless you require the best possible precision. Oh, and the relative error between sqrtf and sqrt (err_ppm) grew rather steadily from some -50 ppm upon startup to some -7500 ppm close to the max. range of a 32-bit float, somewhere just above 1e38.

    One more tidbit - if you want to assure that no "double math" slips into your project, check out "Build->ARM Compiler->Advanced Options->Language Options" and the setting "Floating point precision accepted by compiler". When I set that to 32, I get the following error:

    #1558-D 64-bit floating point operations are not allowed
    

    That concludes my morning session, "I learnt something new today" can be now ticked off for today!

  • One more tidbit - if you want to assure that no "double math" slips into your project, check out "Build->ARM Compiler->Advanced Options->Language Options" and the setting "Floating point precision accepted by compiler". When I set that to 32, I get the following error:
    #1558-D 64-bit floating point operations are not allowed
    


    My toolchain (and IMHO also others) allow for something similiar, a "treat double as float" option in the project settings. Because such things are not visible in the sources (and rarely in make-files), and often buried in proprietary project setting files, it is mostly of questionable portability. At least I tend to avoid it.