Floating Point truncation issue

Ronaldo Pace

Hi,

I'm programming a routine on my Concerto MCU to try to compensate for the lack of standard C libraries on the TI implementation on their product, but I'm stumbling across a truncation issue.

long double value = 2.0;
int64_t integer_part = (int64_t) value;
// on here integer_part equals 1

long double fraction_part = value - integer_part;
// here fraction_part equals 0.99999996

I also tried using the truncl(long double) from math.h and not putting the type cast and the error still persists.

So I ask, how do I make this truncation work properly?

over 14 years ago

0 Ronaldo Pace over 14 years ago

Intellectual 875 points

please? someone? no one?

0 Vishal_Coelho over 14 years ago in reply to Ronaldo Pace

TI__Mastermind 20850 points

Hi Ronaldo,

Can you post the disassembly for these two instrcutions?

0 Ronaldo Pace over 14 years ago in reply to Vishal_Coelho

Intellectual 875 points

hi Vishal,

sorry for taking long time to answer, but it was weekend and I was away from the office.

The two instructions doesn't exist on that exact form, that was just an example of what is happening, here is the exact code:

char * doubleToString(long double value, unsigned int dec_places) {
	int64_t integer_part = (int64_t) value;
	long double fraction_part = value - integer_part;

and the disassembly is:

13        char * doubleToString(long double value, unsigned int dec_places) {
          doubleToString:
0020a694:   B510     PUSH            {R4, LR}
0020a696:   F1AD0D30 SUB.W           R13, R13, #48
0020a69a:   9202     STR             R2, [SP, #0x8]
0020a69c:   E88D0003 STMIA.W         R13, {R0, R1}
18        	int64_t integer_part = (int64_t) value;
0020a6a0:   E89D0003 LDMIA.W         R13, {R0, R1}
0020a6a4:   F000FE3A BL              __aeabi_d2lz
0020a6a8:   AA04     ADD             R2, SP, #0x10 
0020a6aa:   E8820003 STMIA.W         R2, {R0, R1}
19        	long double fraction_part = value - integer_part;
0020a6ae:   A804     ADD             R0, SP, #0x10 
0020a6b0:   E8900003 LDMIA.W         R0, {R0, R1}
0020a6b4:   F001F82C BL              __aeabi_l2d
0020a6b8:   4602     MOV             R2, R0
0020a6ba:   460B     MOV             R3, R1
0020a6bc:   E89D0003 LDMIA.W         R13, {R0, R1}
0020a6c0:   F7FEFEE2 BL              __aeabi_dsub
0020a6c4:   AA06     ADD             R2, SP, #0x18 
0020a6c6:   E8820003 STMIA.W         R2, {R0, R1}

It's been 10 years since last time I saw an assembly code, so the disassembly is a bit meaningless for me... can you see what's the problem?

0 Ronaldo Pace over 14 years ago in reply to Ronaldo Pace

Intellectual 875 points

nothing? no one?

0 Lori Heustess over 14 years ago

TI__Guru* 93800 points

Ronaldo,

I am going to move this to the compiler forum to see if they have any suggestions.

Regards

Lori

0 Archaeologist over 14 years ago in reply to Lori Heustess

TI__Guru* 84285 points

I cannot reproduce this problem. We'll need to see a complete, compilable test case (in particular, the values being passed as parameters), as well as the exact command-line options and compiler version (not the CCS version).

0 Ronaldo Pace over 14 years ago in reply to Archaeologist

Intellectual 875 points

Hi,

Lori thanks for the move.

Archaeologist, I'm kinda of just accepting the error and minimizing it with a few leading / trailing zeros on my communication but I still have some mixed feelings regarding what is happening.
Maybe long doubles are just weird and the current standard for floating points is not very good.

I'm using CCS 5.1.0.201 and I guess it's the "ARM Compiler Tools" version 4.9.1 (is that the one you needed to know? if not, let me know how to find it)
The MCU is a Concerto board and the code is running on the M3 core.
Attached is the compiled .obj and .pp for the following source code:

#include "stdint.h"
#include "utils/ustdlib.h"

char doubleToString_buff[64];

char * doubleToString(long double value, unsigned int dec_places) {
	// long double maximum significant digits are 16
	if (dec_places > 16)
		dec_places = 16;

	int64_t integer_part = (int64_t) value;
	long double fraction_part = value - integer_part;

	char * buff_ptr = &doubleToString_buff[0];

	unsigned long size = usprintf(buff_ptr, "%d.", integer_part);
	buff_ptr = buff_ptr + size;

	int i;
	for (i = 0; i < dec_places; i++) {
		fraction_part = fraction_part * 10;
		integer_part = (int64_t) fraction_part;
		fraction_part = fraction_part - integer_part;
		usprintf(buff_ptr, "%d", integer_part);
		buff_ptr++;
	}

	*buff_ptr = 0;

	return (char *) &doubleToString_buff[0];
}

I'm aware that some of the stuff of this code are sub-optimal and not very portable (e.g. I return always the same buffer), but it's meant to be used in-line with 'printf' type of expressions and my code only calls it in very specific points that means it doesn't overwrite the buffer before I printf it. So please, not too much judgement on this part.

So, specific for testing before I send the files to you I put the following code in an infinite loop and step through it and checking values on the expressions screen:

	long double a = 5.1;
	char * ptr;
	ptr = doubleToString(a, 5);

my ptr at the end had "5.099999" and the error was introduced on the first value fraction_part received (before the for loop).
Interesting to note, and that's where my mixed feelings about it appear, is that if I try manually edit the value in the expressions screen before executing the code, when I tyep "5.1", it already shows "5.099999" that means that we already have some weird errors in there.

Also, further analysing it, I noticed that some of the errors I've been seeing are introduced in the atof instruction and the doubleToString is acting accordingly.

I hope you (the experts) can make some sense of it all.

Thanks for the help.

doubleToString.zip

0 Archaeologist over 14 years ago in reply to Ronaldo Pace

TI__Guru* 84285 points

The value 5.1 cannot be exactly represented in IEEE binary format. Only sums of exact multiples of two can be so represented, and 1/10th cannot be so represented. See http://c-faq.com/fp/printfprec.html Do you still see the problem when you use an exactly representable value such as 5.125?

0 Ronaldo Pace over 14 years ago in reply to Archaeologist

Intellectual 875 points

Yes it works.

I've also tested with a bunch of different x.1 numbers and all the same problem.

So those weird truncation I've been seeing is due to the inability of IEEE to correctly represent some numbers. Also subsequent truncation errors may occur on the for loop.

It just bothers me a bit that the "long double a = 5.1" appears fine on the Expressions window.

Would you suggest a different implementation for a doubleToString() that is not actually decoding it based on the IEEE ?

thanks.

0 Archaeologist over 14 years ago in reply to Ronaldo Pace

TI__Guru* 84285 points

The expressions window is no doubt rounding the number before displaying it.

Ronaldo Pace said:
Would you suggest a different implementation for a doubleToString() that is not actually decoding it based on the IEEE ?

I don't understand what you mean. What do you hope to see for a value like 5.1?

The usual method for creating a string from a double is something like:

sprintf(dst, "%g", 5.1)

0 Ronaldo Pace over 13 years ago in reply to Archaeologist

Intellectual 875 points

TI did not implement any of the float stuff on their print /sprint /usprint/ etc.

0 Archaeologist over 13 years ago in reply to Ronaldo Pace

TI__Guru* 84285 points

You are using an add-on library that does not ship with the TI compiler. The run-time support library that does ship with the TI compiler fully supports floating-point operations. The drawback to the standard sprintf is that it is bigger because it supports all the crazy formats the standard requires, which is why some choose to use alternate implementations.

0 Douglas Gwyn over 13 years ago in reply to Archaeologist

Expert 2210 points

I think the actual issue is that almost all real numbers cannot be represented precisely in the finite amount of storage allocated for a floating-point type. Thus, a lot of rounding-to-nearest-representable-value goes on in the course of computation with floating-point. You generally need to avoid depending on absolute precision in your f.p. algorithms.

Code Composer Studio™︎

Code Composer Studio forum

Floating Point truncation issue