Single Point Precision FPU with Square Root

Bruno Saraiva

Hi all,

Great to be back! Probably my first post from the new home, on the other side of the pond...

I'm trying to grab a better understanding of the FPU usage on TM4C12x. From my readings, it does the "float" (as in 32 bit float) operations in hardware, so a multiplication will automagically be ready on the next clock cycle (or next few clocks?).

Specifically reading TM4C129NCPDT datasheet, section 3.1.5 FPU, I see it has HW support for addition, multiplication, etc etc AND square root.

On the standard c library, square root takes and returns a double (64 bit) value.

So, how can I use SQR with the hardware FPU?

Cheers,

over 9 years ago

0 Robert Adsett72 over 9 years ago

Guru 10570 points

For information on the fpu check ARM's documentation for the Cortex fpu.

As far as a float implementation of square root check your compiler documentation. If it exists it's usually called something like sqrf.

Keep in mind that the C/C++ language standards require converting float variables to double before performing any operations. There is an as-if exemption but that is a large hurdle for floating point arithmetic.

Robert

0 f. m. over 9 years ago in reply to Robert Adsett72

Guru 11940 points

Yes, Robert is correct here.
Most functions in math.h come in a doube-precision and a single precision version, like sinf(), cosf(), sqrtf(), and so on. If you need performance (and can live with the mediocre accuracy), stick with "float".
But be aware of another "feature" of C - numerical literals are ALWAYS interpreted as double, if not stated explicitly as float. For example "float f = 3.14159;" involves a conversion from double to float, but "float f = 3.14159f;" does not !
In loops, such small omissions can make a difference ...

0 cb1 over 9 years ago in reply to f. m.

Guru 47900 points

When posters Robert & f.m. join forces - cut/paste and distribution makes great sense.

Thank you both for this clarity - much appreciated. (NOW I know why our (recent) loop "stunk"...)

Gems such as this may warrant inclusion w/in, "Robert's Bookshelf" (which on occasion appears - here...)

0 Bruno Saraiva over 9 years ago in reply to cb1

Guru 13040 points

Thank you all for indeed such enlightening replies.

Now I am happy with a much faster code, good enough with a 32-bit square root.

A few years ago I came across something described as "float precision square root", which meant absolutely nothing at the time - and useless as well. Not that useless anymore!

And the note about implicit double conversion is also welcome! Life is so easy for those PC programmers who have 64-bit 3GHz octa-cores available to run their codes, ain't?

0 f. m. over 9 years ago in reply to Bruno Saraiva

Guru 11940 points

Life is so easy for those PC programmers who have 64-bit 3GHz octa-cores available to run their codes, ain't?

Yes, it is - in this regard. As a side note, even the first 8087 had been operating internally on 80 bit, not 64 (like "double"). So much to worry as an embedded developer - may try my luck as used-car salesman ...

0 cb1 over 9 years ago in reply to f. m.

Guru 47900 points

f. m. said:
may try my luck as used-car salesman ...

Contact firm/me when you can supply V-16 - low miles of course. (and goes w/out saying - engine must employ "float")

0 Robert Adsett over 9 years ago in reply to cb1

Guru 27665 points

As it turns out I overstated the conversion problems a bit but there's still a lot of leeway for the compiler to not match your expectations.

One thing Bruno, if speed is your issue test it. There have been instances where the double versions of functions have been as fast as or faster than the float versions.

Floating point is full of traps. I try to avoid it except when it doesn't matter.

Robert

Floating point's advantage is dynamic range, by most (all?) other measures it's worse.

0 Bruno Saraiva over 9 years ago in reply to Robert Adsett

Guru 13040 points

Robert,

Indeed I should test it - and I'm very curious to see the results.
It won't happen so soon, as I'm really holding a long list of things to do here, and this particular board "is working", so let it be for now. I will share the tests here when I have results available.

But your comment pulls a trigger here: look at this beatiful sentence taken from the datasheet:

"The Cortex-M4F FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations."

Now, if I remember the last time I looked at assembler instructions, there are internal operations on the mcu hw that mean "add" - and those very same operations are used by the compiler when I code "A = B + 1;".

But, is there likewise a hw operation for sqrt, as the text suggests? Does CCS and the Arm compiler "knows" of it, and magically uses a 1 clock operation for "sqrtf(443556.0f);" ????

Regards,

0 Veikko Immonen over 9 years ago in reply to Bruno Saraiva

Expert 2415 points

A slow saturday morning for me, and this got me interested enough to "dig deep".

First of all, about the HW operation - one does really exist, take a look here:

But it isn't the 1 cycle wonder you would like - takes 14 cycles. I created a small test program:

#include "project_config.h"

#include <vtl_core.h>
#include <vtl_launchpad.h>

#include <driverlib/timer.h>
#include <inc/hw_timer.h>

#include <math.h>

void stackoverflow(void) {
	// When this happens "without reason", consider increasing the stack size.
	// This can be done in Project->Properties->Build->ARM Linker->Basic Options
	// (in CCS)
	while(1);
}

volatile float sqrtval1;
volatile float sqrtval2;
volatile float sqrtdiff;
volatile float val = 1.0f;
volatile double valdbl = 1.0;
volatile double sqrtdbl;
volatile double sqrtdbldiff;
volatile double err_ppm;
volatile uint32_t exec_timeWsqrtf = 0;
volatile uint32_t exec_timeW__sqrtf = 0;
volatile uint32_t exec_timeWsqrt = 0;

void main(void) {
	// Setup FPU
	FPUEnable();
	FPULazyStackingEnable();

	// Setup clock
	VTL_ClockSet();

	// Setup stack overflow detection
	VTL_MPUStackOverflowDetectionEnable(stackoverflow);

  
	VTL_SysCtlPeripheralEnable(SYSCTL_PERIPH_TIMER0);
	TimerConfigure(TIMER0_BASE, TIMER_CFG_PERIODIC_UP);
	TimerEnable(TIMER0_BASE, TIMER_A);


	while(1) {
		uint32_t start, end;
		val *= 1.0001f;
		valdbl *= 1.0001;

		start = HWREG(TIMER0_BASE + TIMER_O_TAR);
		sqrtval1 = sqrtf(val);
		end = HWREG(TIMER0_BASE + TIMER_O_TAR);
		exec_timeWsqrtf = end - start;

		start = HWREG(TIMER0_BASE + TIMER_O_TAR);
		sqrtval2 = __sqrtf(val);
		end = HWREG(TIMER0_BASE + TIMER_O_TAR);
		exec_timeW__sqrtf = end - start;

		start = HWREG(TIMER0_BASE + TIMER_O_TAR);
		sqrtdbl = sqrt(valdbl);
		end = HWREG(TIMER0_BASE + TIMER_O_TAR);
		exec_timeWsqrt = end - start;

		sqrtdiff = sqrtval2 - sqrtval1;
		sqrtdbldiff = sqrtdbl - sqrtval1;

		err_ppm = (sqrtdbldiff / sqrtdbl) * 1000000.0;
	}
}

(Please just ignore project_config.h and all the vtl_ stuff, they're just my little helpers - the function names tell what they do, details not relevant for this exercise.)

The __sqrtf comes from taking a peek into math.h. If you look at the "Build->ARM Compiler->Optimization" pane in Project properties (I'm using CCS, don't know where these settings are in other IDE's) - you'll see there is a setting "Floating point mode" for which there are options "strict" and "relaxed". The difference is explained in math.h:

/* If --fp_mode=relaxed is used and VFP is enabled, use the hardware square  */
/* root directly instead of calling the sqrtx routine. This will not set     */
/* errno if the argument is negative.                                        */

So assuming your "math checks out", there is no advantage to using the strict mode from what I gather. Wiser people, please advise if this is untrue!

Now, how does it all add up? I ran the program with both fp modes and in all optimization levels (off, 0, 1, 2, 3, 4) with the dial "speed vs size tradeoffs" cranked all the way up to 5 for max speed. Here are the results:

(I thought of the double thing only after having run it all, I re-ran it only for two settings. The execution time for the double-sqrt varied roughly between the indicated values.)

Opt. Level	FP mode	exec_timeWsqrtf	exec_timeW__sqrtf	exec_timeWsqrt
none	strict	36	26	-
none	relaxed	34	26	~4200-4400
0	strict	30	19	-
0	relaxed	19	19	-
1	strict	29	19	-
1	relaxed	19	19	-
2	strict	29	19	-
2	relaxed	19	19	-
3	strict	29	19	-
3	relaxed	19	19	-
4	strict	24	19	-
4	relaxed	19	19	~4100-4250

From this I conclude that you'll get the max. possible performance by using sqrtf if you have FP mode set to relaxed and have optimization turned on - doesn't matter what level, just not 'off'. And don't use sqrt with doubles unless you require the best possible precision. Oh, and the relative error between sqrtf and sqrt (err_ppm) grew rather steadily from some -50 ppm upon startup to some -7500 ppm close to the max. range of a 32-bit float, somewhere just above 1e38.

One more tidbit - if you want to assure that no "double math" slips into your project, check out "Build->ARM Compiler->Advanced Options->Language Options" and the setting "Floating point precision accepted by compiler". When I set that to 32, I get the following error:

#1558-D 64-bit floating point operations are not allowed

That concludes my morning session, "I learnt something new today" can be now ticked off for today!

0 f. m. over 9 years ago in reply to Veikko Immonen

Guru 11940 points

One more tidbit - if you want to assure that no "double math" slips into your project, check out "Build->ARM Compiler->Advanced Options->Language Options" and the setting "Floating point precision accepted by compiler". When I set that to 32, I get the following error:
#1558-D 64-bit floating point operations are not allowed

My toolchain (and IMHO also others) allow for something similiar, a "treat double as float" option in the project settings. Because such things are not visible in the sources (and rarely in make-files), and often buried in proprietary project setting files, it is mostly of questionable portability. At least I tend to avoid it.

Arm-based microcontrollers

Arm-based microcontrollers forum

Single Point Precision FPU with Square Root