This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Floating point arithmetic precision

We are using a F32F28069 processor. if I do the following using variables that are floats:

8000178 X 0.01 = 80001.78

80001.78 - 80000 = 1.78125

It should be 1.78. I  tried using double and float64. Is there a way to get better accuracy?

  • Hi,

    I tried doing the subtraction by hand and this is what i got

    decimal : 80001.78
    binary   : 01000111100111000100000011100100
    hex       : 0x479c40e4
    power 2: 1.2207303047180176 x 2^16 (2^143 where 127 is the exponent bias)

    decimal  : 80000.0
    binary    : 01000111100111000100000000000000
    hex        : 0x479c4000
    power 2 : 1.220703125 x 2^16 (2^143 where 127 is the exponent bias)

      1.00111000100000011100100 x 2^16
    - 1.00111000100000000000000 x 2^16
    ----------------------------------------------------
       0.00000000000000011100100 x 2^16
        <-------------------------*
    ----------------------------------------------------
    1.11001000000000000000000 x 2^(16 + -16)
    ----------------------------------------------------
    1.78125 x 2^0

    After adding the exponent bias the number is
    1.78125 x 2^127
    hex
    0x3fe40000
    binary
    0 | 01111111 | 11001000000000000000000 |
    S | exponent  | mantissa                                 |

    So it looks like the result of the subtraction is correct (and will be the same even if you did it with 64 bits). Now i tried this instead: 8000178 - 8000000

    Solution (possible):
    decimal : 8000178.0
    binary   : 01001010111101000010010101100100
    hex       : 0x4af42564
    power 2: 1.90739107131958 x 2^22

    decimal : 8000000.0
    binary   : 01001010111101000010010000000000
    hex       : 0x4af42400
    power 2: 1.9073486328125 x 2^22

      1.11101000010010101100100 x 2^22
    - 1.11101000010010000000000 x 2^22
    ----------------------------------------------------
    0.00000000000000101100100 x 2^22
    <------------------------*
    ----------------------------------------------------
    1.01100100000000000000000 x 2^(22 + -15)
    ----------------------------------------------------
    1.390625 x 2^7 = 178

    After adding the exponent bias the number is
      1.390625 x 2^134
    hex
      0x43320000
    binary
    0 | 10000110 | 01100100000000000000000 |
    S | exponent  | mantissa                                 |

    The result is 178 which you will have to divide by 100 - which WILL NOT GIVE YOU 1.78 BUT 1.779999...., - 1.78 is not exactly representable in floating point, 32 or 64. The downside is that you need to use divide to get a more accurate result and that you need to know the range or size of the numbers you are dealing with and implement the right amount of scaling.

    I dont know if this is the best way to do this kind of arithmetic, there are possibly better ways. I can recommend this paper for more ways to tackle this problem:

    What every computer scientist should know about floating point numbers (1991) by David Goldberg.

    I also used the following online converter, in case you want to double check the math:

    http://www.h-schmidt.net/FloatConverter/

  • I had written an MFC Dialog application to do the conversion and was lulled into a false sense of security. I thought it was displaying properly in my little app, but the number I was using had a lot of zeroes. So when I formatted it using "%f", it must have rounded or truncated it. so I thought I was seeing the correct number.