Floating point arithmetic precision

charles saperstein

Expert 1445 points

We are using a F32F28069 processor. if I do the following using variables that are floats:

8000178 X 0.01 = 80001.78

80001.78 - 80000 = 1.78125

It should be 1.78. I tried using double and float64. Is there a way to get better accuracy?

over 11 years ago

0 Vishal_Coelho over 11 years ago

TI__Mastermind 20850 points

Hi,

I tried doing the subtraction by hand and this is what i got

decimal : 80001.78
binary : 01000111100111000100000011100100
hex : 0x479c40e4
power 2: 1.2207303047180176 x 2^16 (2^143 where 127 is the exponent bias)

decimal : 80000.0
binary : 01000111100111000100000000000000
hex : 0x479c4000
power 2 : 1.220703125 x 2^16 (2^143 where 127 is the exponent bias)

1.00111000100000011100100 x 2^16
- 1.00111000100000000000000 x 2^16
----------------------------------------------------
0.00000000000000011100100 x 2^16
<-------------------------*
----------------------------------------------------
1.11001000000000000000000 x 2^(16 + -16)
----------------------------------------------------
1.78125 x 2^0

After adding the exponent bias the number is
1.78125 x 2^127
hex
0x3fe40000
binary
0 | 01111111 | 11001000000000000000000 |
S | exponent | mantissa |

So it looks like the result of the subtraction is correct (and will be the same even if you did it with 64 bits). Now i tried this instead: 8000178 - 8000000

Solution (possible):
decimal : 8000178.0
binary : 01001010111101000010010101100100
hex : 0x4af42564
power 2: 1.90739107131958 x 2^22

decimal : 8000000.0
binary : 01001010111101000010010000000000
hex : 0x4af42400
power 2: 1.9073486328125 x 2^22

1.11101000010010101100100 x 2^22
- 1.11101000010010000000000 x 2^22
----------------------------------------------------
0.00000000000000101100100 x 2^22
<------------------------*
----------------------------------------------------
1.01100100000000000000000 x 2^(22 + -15)
----------------------------------------------------
1.390625 x 2^7 = 178

After adding the exponent bias the number is
1.390625 x 2^134
hex
0x43320000
binary
0 | 10000110 | 01100100000000000000000 |
S | exponent | mantissa |

The result is 178 which you will have to divide by 100 - which WILL NOT GIVE YOU 1.78 BUT 1.779999...., - 1.78 is not exactly representable in floating point, 32 or 64. The downside is that you need to use divide to get a more accurate result and that you need to know the range or size of the numbers you are dealing with and implement the right amount of scaling.

I dont know if this is the best way to do this kind of arithmetic, there are possibly better ways. I can recommend this paper for more ways to tackle this problem:

What every computer scientist should know about floating point numbers (1991) by David Goldberg.

I also used the following online converter, in case you want to double check the math:

http://www.h-schmidt.net/FloatConverter/

0 charles saperstein over 11 years ago in reply to Vishal_Coelho

Expert 1445 points

I had written an MFC Dialog application to do the conversion and was lulled into a false sense of security. I thought it was displaying properly in my little app, but the number I was using had a lot of zeroes. So when I formatted it using "%f", it must have rounded or truncated it. so I thought I was seeing the correct number.

C2000™︎ microcontrollers

C2000 microcontrollers forum

Floating point arithmetic precision