TMS320F28075: C28x FPU: Why no F32TOI32R instruction?

fmdhr

Part Number: TMS320F28075

Hi,

triggered by a recent discussion (e2e.ti.com/.../713771), I wonder why there is no F32TOI32R instruction (and an intrinsic long __f32toi32r(double src))?

This would remove the nonlinearity around zero that the F32TOI32 instruction has: both positive and negative floats are rounded (truncated) towards zero.
Something like float f; long l; l = (int32_t)(f >= 0.0 ? f + 0.5 : f - 0.5) does it, and there is the F32TOI16R instruction that does it, but for 16 bit values only.

An idea for a next release?

See SPRU514P, Table 7-7. C/C++ Compiler Intrinsics for FPU.

Regards,

Frank

over 7 years ago

0 Richard Poley over 7 years ago

TI__Mastermind 27200 points

Frank,

Thank you for the suggestion. I see the value.

The FPU instruction set isn't going to be changed any time soon so if this is to be implemented it will have to be by intrinsic. I will ensure your suggestion is passed along to the compiler group. If I learn more I will post here.

Regards,

Richard

0 EK over 7 years ago

Expert 2520 points

Hi,

Instead of (f >= 0.0 ? f + 0.5 : f - 0.5) you should use branchless ((f + 0x0.Cp24f) - 0x0.Cp24f). It is add 2^23, then subtract 2^23. It will follow current FP rounding mode. Since it's round to nearest even, it will round similar to your expression, but all in the middle values to even integers, +-0.5 to 0.0, +-1.5 to +-2.0, +-2.5 to +-2.0, +-3.5 to +-4.0 and so on.

Sorry, +2^23 -2^23 trick is only valid for positive numbers. For negative it's necessary to subtract 2^23 first, then add. Still branch is required.

Regards,
Edward

0 fmdhr over 7 years ago in reply to EK

Expert 1530 points

Edward,

I received a mail with your 1st reply, but no mail with the correction added later. May be the forum admins want to check that.

I tested your suggestion and got the results below. In my application, it is not necessary to hold the "round x.5 towards even" rule, so I think it is a good solution for my problem.

I did not know the "0x0.Cp24f" syntax, and I did not found it documented. Is it general C or a TI specific syntax? Where to read more about it?

Thanks & regards,

Frank

static inline
int32_t f32toi32r(float32_t f)
{
return (int32_t)((f + 0x0.Cp24f) - 0x0.Cp24f);
}

floats fr and f32toi32r results ir:

fr
-4.0   -3.875   -3.75   -3.625   -3.5   -3.375   -3.25   -3.125
-3.0   -2.875   -2.75   -2.625   -2.5   -2.375   -2.25   -2.125
-2.0   -1.875   -1.75   -1.625   -1.5   -1.375   -1.25   -1.125
-1.0   -0.875   -0.75   -0.625   -0.5   -0.375   -0.25   -0.125
0.0   0.125   0.25   0.375   0.5   0.625   0.75   0.875
1.0   1.125   1.25   1.375   1.5   1.625   1.75   1.875
2.0   2.125   2.25   2.375   2.5   2.625   2.75   2.875
3.0   3.125   3.25   3.375   3.5   3.625   3.75   3.875

ir
-4   -4   -4   -4   -4   -3   -3   -3
-3   -3   -3   -3   -2   -2   -2   -2
-2   -2   -2   -2   -2   -1   -1   -1
-1   -1   -1   -1   0   0   0   0
0   0   0   0   0   1   1   1
1   1   1   1   2   2   2   2
2   2   2   2   2   3   3   3
3   3   3   3   4   4   4   4

fr
-3.9375   -3.8125   -3.6875   -3.5625   -3.4375   -3.3125   -3.1875   -3.0625
-2.9375   -2.8125   -2.6875   -2.5625   -2.4375   -2.3125   -2.1875   -2.0625
-1.9375   -1.8125   -1.6875   -1.5625   -1.4375   -1.3125   -1.1875   -1.0625
-0.9375   -0.8125   -0.6875   -0.5625   -0.4375   -0.3125   -0.1875   -0.0625
0.0625   0.1875   0.3125   0.4375   0.5625   0.6875   0.8125   0.9375
1.0625   1.1875   1.3125   1.4375   1.5625   1.6875   1.8125   1.9375
2.0625   2.1875   2.3125   2.4375   2.5625   2.6875   2.8125   2.9375
3.0625   3.1875   3.3125   3.4375   3.5625   3.6875   3.8125   3.9375

ir
-4   -4   -4   -4   -3   -3   -3   -3
-3   -3   -3   -3   -2   -2   -2   -2
-2   -2   -2   -2   -1   -1   -1   -1
-1   -1   -1   -1   0   0   0   0
0   0   0   0   1   1   1   1
1   1   1   1   2   2   2   2
2   2   2   2   3   3   3   3
3   3   3   3   4   4   4   4

nonlinearity with ir = (int32_t)fr;

ir
-3   -3   -3   -3   -3   -3   -3   -3
-2   -2   -2   -2   -2   -2   -2   -2
-1   -1   -1   -1   -1   -1   -1   -1
0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0
1   1   1   1   1   1   1   1
2   2   2   2   2   2   2   2
3   3   3   3   3   3   3   3

0 EK over 7 years ago in reply to fmdhr

Expert 2520 points

Hi,

sorry again. This time to unfix it. + 0x0.Cp24f then - 0x0.Cp24f is correct. What is not correct that 0x0.Cp24f is not 2^23 but 1.5 * 2^23. This is where my doubt came from.
Adding this number (12582912) to both, positive or negative completely cancels fractional part, subtracting it back from the sum gives correctly rounded single precision number.

0x0.Cp24f syntax is relatively new, old compilers don't support it. It allows to specify FP mantissa in hex.

0x - prefix
0.C or 0.C000000 - hex fraction. Take it as 0xC / 0x10 = 12/16 = 0.75

p24 - 24 here is multiplier 2^24. 24 is decimal, not hex.

0.75 * 2^24 = 12582912

Edward

0 fmdhr over 7 years ago in reply to EK

Expert 1530 points

Edward,

thank you for your explanations. The 0x0.Cp24f syntax is a bit strange, so I finally used

static inline
int32_t f32toi32r(float32_t f)
{
return (int32_t)((f + 12582912.0) - 12582912.0);
}

Initially, I had a different approach using a union, but this has its limitations, details see below. Generated fine code for the CLA, but not for the CPU/FPU (no inlined function for some reason).

So your solution definitively is the better one.

Thank you again,
Frank

PS: May be you also have a better idea for a static inline float fsign(float f) { return f >= 0.0 ? 1.0 : -1.0; } ?

/* define macros to hide the strange syntax of signf() , halfsignf() usage */
#define fsign(x) (signf(&(x)).f)
#define halffsign(x) (halfsignf(&(x)).f) /* 0.5 * fsign(x) */

/*
A fast version of sign(f) = f >= 0.0 ? 1.0 : -1.0; that does not need the
slow branches in the cla. Unfortunately, the argument must be an addressable
variable.
*/

union fi {
float32_t f;
int32_t i;
};

static inline
union fi signf(float32_t *f)
{
register union fi r;

/* result = 1.0 | sign_bit(f) */
r.i = 0x3F800000 | ((*(int32_t *)f) & 0x80000000);
return r;
}

static inline
union fi halfsignf(float32_t *f)
{
register union fi r;

/* result = 0.5 | sign_bit(f) */
r.i = 0x3F000000 | ((*(int32_t *)f) & 0x80000000);
return r;
}

static inline
int32_t f32toi32r_(float32_t f)
{
return (int32_t)(f + halffsign(f));
}

0 EK over 7 years ago in reply to fmdhr

Expert 2520 points

Hi,

What you are doing in signf() can be macroed in TI compiler:

__u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000)

These special __u32xx / __f32xx intrinsics help treating u32 as float and float as u32 respectively, without the need to define any variables.

But I'm not sure what's faster, transfer from FPU register to ACC, bitwise and, bitwise or/add, and transfer back to FPU register, or indeed branchy code? FPU has conditional NEGF instruction. So if we load one extra FPU register with 1.0, compare signf() argument against zero and negate our 1.0 to -1.0 on LT, it should be faster, I think. Unfortunately I don't know how to persuade compiler to do it like I want, I see branches in compiled code.

Edward

0 fmdhr over 7 years ago in reply to EK

Expert 1530 points

Edward,

again thank you yor the informations.

I did not found the __u32_bits_as_f32, __f32_bits_as_u32 intrinsics documented, neighter in SPRUEO2B (FPU), nor in SPRU514P (C28x Compiler), nor in SPRU430F (C28x CPU in), nor in SPRU513P (C28 Assembly Tools). Can anyone from TI please help here? Or is this "Herrschaftswissen" (German)?

Unforunately, these intrinsics do not work with the CLA compiler.

Below is a code snipped that tests several versions. It includes an assembler coded function that uses the NEGF instruction as suggested. It seems it does what we want, but I'm not sure if all conditions (pipeline etc.) are met. It also seems to be the fastest way - 4 + 4 cycles overhead for call/return (although, the call may hinder other optmizations). The & 0x80000000 approach looks good with memory, bad in expressions due to the 4 cycles pipeline delay for ACC <> RnH transfers.

Thanks & regards

Frank

static inline
float32_t   fsign(float32_t f)
{
    return __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000);
}

volatile    float32_t   f1, f2, f3;

extern float32_t   fsignf(float32_t f);

/* r0h = r0h < 0.0 ? -1.0 : 1.0 */
asm("           .global     _fsignf         ");
asm("_fsignf:   movizf32    r1h, #1.0       ");
asm("           cmpf32      r0h, #0.0       ");
asm("           negf32      r0h, r1h, lt    ");
asm("           lretr                       ");

void
main(void)
{

    f1 = 2.0;
    f2 = fsign(f1);
    f1 = -2.0;
    f3 = f1 * fsign(f1 + f2);
    f3 = fsignf(f1);
    f3 = f1 >= 0.0 ? 1.0 : -1.0;
    f3 = fsignf(-f1);
}

Using

"C:/ti/ccsv7/tools/compiler/ti-cgt-c2000_16.9.1.LTS/bin/cl2000" -v28 -ml -mt --vcu_support=vcu2 --cla_support=cla1 --float_support=fpu32 --tmu_support=tmu0 -O2 -g --define=CPU1

compiles to:

015155:   761F0487    MOVW         DP, #0x487
88         f1 = 2.0;
015157:   E8020000    MOVIZ        R0, #0x4000
015159:   E2030016    MOV32        @0x16, R0H
61     {
01515b:   0616        MOVL         ACC, @0x16
90         f1 = -2.0;
01515c:   E8060000    MOVIZ        R0, #0xc000
89         f2 = fsign(f1);
01515e:   18A88000    AND          @AH, #0x8000
015160:   9000        ANDB         AL, #0x0
015161:   1AA83F80    OR           @AH, #0x3f80
015163:   1E14        MOVL         @0x14, ACC
90         f1 = -2.0;
015164:   E2030016    MOV32        @0x16, R0H
61     {
015166:   E2AF0016    MOV32        R0H, @0x16, UNCF
015168:   E2AF0114    MOV32        R1H, @0x14, UNCF
01516a:   E7100040    ADDF32       R0H, R0H, R1H
01516c:   7700        NOP
01516d:   7700        NOP
01516e:   7700        NOP
91         f3 = f1 * fsign(f1 + f2);
01516f:   BFA90F12    MOV32        @ACC, R0H
015171:   18A88000    AND          @AH, #0x8000
015173:   9000        ANDB         AL, #0x0
015174:   1AA83F80    OR           @AH, #0x3f80
015176:   BDA90F12    MOV32        R0H, @ACC
015178:   7700        NOP
015179:   7700        NOP
01517a:   7700        NOP
01517b:   E2AF0116    MOV32        R1H, @0x16, UNCF
01517d:   E7000008    MPYF32       R0H, R1H, R0H
01517f:   7700        NOP
015180:   E2030018    MOV32        @0x18, R0H
92         f3 = fsignf(f1);
015182:   E2AF0016    MOV32        R0H, @0x16, UNCF
015184:   7641514B    LCR          fsignf
015186:   761F0487    MOVW         DP, #0x487
015188:   E2030018    MOV32        @0x18, R0H
93         f3 = f1 >= 0.0 ? 1.0 : -1.0;
01518a:   E2AF0016    MOV32        R0H, @0x16, UNCF
01518c:   E5A0        CMPF32       R0H, #0.0
01518d:   AD14        MOVST0       NF,ZF
01518e:   6304        SB           C$L1, GEQ
01518f:   E805FC00    MOVIZ        R0, #0xbf80
015191:   6F03        SB           C$L2, UNC
015192:   E801FC00    MOVIZ        R0, #0x3f80
015194:   E2030018    MOV32        @0x18, R0H
94         f3 = fsignf(-f1);
015196:   E2AF0016    MOV32        R0H, @0x16, UNCF
015198:   E6AF0000    NEGF32       R0H, R0H, UNCF
01519a:   7641514B    LCR          fsignf
01519c:   761F0487    MOVW         DP, #0x487
01519e:   E2030018    MOV32        @0x18, R0H

0 EK over 7 years ago in reply to fmdhr

Expert 2520 points

Hi,

Yes, I don't see these useful intrinsics documented in TI pdfs, I found them perhaps here
e2e.ti.com/.../632168

Regarding using them in CLA. Yes, unfortunately looks these are not implemented.

Regards,
Edward

0 Richard Poley over 7 years ago in reply to EK

TI__Mastermind 27200 points

Frank,

Earlier you posted these results:

fr
-4.0 -3.875 -3.75 -3.625 -3.5 -3.375 -3.25 -3.125
-3.0 -2.875 -2.75 -2.625 -2.5 -2.375 -2.25 -2.125
-2.0 -1.875 -1.75 -1.625 -1.5 -1.375 -1.25 -1.125
-1.0 -0.875 -0.75 -0.625 -0.5 -0.375 -0.25 -0.125
0.0 0.125 0.25 0.375 0.5 0.625 0.75 0.875
1.0 1.125 1.25 1.375 1.5 1.625 1.75 1.875
2.0 2.125 2.25 2.375 2.5 2.625 2.75 2.875
3.0 3.125 3.25 3.375 3.5 3.625 3.75 3.875

ir
-4 -4 -4 -4 -4 -3 -3 -3
-3 -3 -3 -3 -2 -2 -2 -2
-2 -2 -2 -2 -2 -1 -1 -1
-1 -1 -1 -1 0 0 0 0
0 0 0 0 0 1 1 1
1 1 1 1 2 2 2 2
2 2 2 2 2 3 3 3
3 3 3 3 4 4 4 4

fr
-3.9375 -3.8125 -3.6875 -3.5625 -3.4375 -3.3125 -3.1875 -3.0625
-2.9375 -2.8125 -2.6875 -2.5625 -2.4375 -2.3125 -2.1875 -2.0625
-1.9375 -1.8125 -1.6875 -1.5625 -1.4375 -1.3125 -1.1875 -1.0625
-0.9375 -0.8125 -0.6875 -0.5625 -0.4375 -0.3125 -0.1875 -0.0625
0.0625 0.1875 0.3125 0.4375 0.5625 0.6875 0.8125 0.9375
1.0625 1.1875 1.3125 1.4375 1.5625 1.6875 1.8125 1.9375
2.0625 2.1875 2.3125 2.4375 2.5625 2.6875 2.8125 2.9375
3.0625 3.1875 3.3125 3.4375 3.5625 3.6875 3.8125 3.9375

ir
-4 -4 -4 -4 -3 -3 -3 -3
-3 -3 -3 -3 -2 -2 -2 -2
-2 -2 -2 -2 -1 -1 -1 -1
-1 -1 -1 -1 0 0 0 0
0 0 0 0 1 1 1 1
1 1 1 1 2 2 2 2
2 2 2 2 3 3 3 3
3 3 3 3 4 4 4 4

I just want to double-check: is this exactly the behaviour you are expecting from an F32TOI32R instruction?

Regards,

Richard

0 fmdhr over 7 years ago in reply to Richard Poley

Expert 1530 points

Richard,

I think I can summarize as follows: a F32TOI32R should do what F32TOI16R does for 16 bit integers, but for 32 bit integers.

I again ran my test program and included the __f32toi16r intrinsic. As discussed with Edward, the treatment of the x.5 case is different: F32TOI16R and Edwards add/sub 1.5*2^23 round toward next even, the branch ((int32_t)(fr[i] < 0.0 ? fr[i] - 0.5 : fr[i] + 0.5);) and my f + halffsign(f) approach do not.

What may be essential for mathematicians and in other applications, does not matter in my control application. Since there is always some noise, it is not important where x.5 is rounded to. I just want to have a "straight lined staircase" with evenly spaced stairsteps (without the missing stairstep at zero that F32TOI32 has).

Thanks & regards,

Frank

Here is the file, I think you can build a complete test case from.

testfsign.c

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/******************************************************************************/
/*                      testfsign.h - sign / round functions                  */
/******************************************************************************/
typedef long    int32_t;
typedef float   float32_t;
/******************************************************************************/
/*                               enums, defines                               */
/******************************************************************************/
/* define macros to hide the strange syntax of signf() , halfsignf() usage */
#define fsign(x)        (signf(&(x)).f)
#define halffsign(x)    (halfsignf(&(x)).f)   /* 0.5 * fsign(x) */
/******************************************************************************/
/*                            function definitions                            */
/******************************************************************************/
/*
A fast version of sign(f) = f >= 0.0 ? 1.0 : -1.0; that does not need the
slow branches in the cla. Unfortunately, the argument must be an addressable
variable.
*/
union fi {
    float32_t   f;
    int32_t     i;
};
static inline
union fi signf(float32_t *f)
{
    register union fi   r;
    /* result = 1.0 | sign_bit(f) */
    r.i = 0x3F800000 | ((*(int32_t *)f) & 0x80000000);
    return r;
}
#if 0
static inline
float32_t   fsign(float32_t f)
{
    return __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000);
}
#endif
/*
halfsignf(x) = 0.5 * fsign(x)
Good for rounding of floats to integers to avoid the gap around 0.0,
see below.
*/
static inline
union fi halfsignf(float32_t *f)
{
    register union fi   r;
    /* result = 0.5 | sign_bit(f) */
    r.i = 0x3F000000 | ((*(int32_t *)f) & 0x80000000);
    return r;
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/******************************************************************************/
/*                      testfsign.h - sign / round functions                  */
/******************************************************************************/


typedef long    int32_t;
typedef float   float32_t;


/******************************************************************************/
/*                               enums, defines                               */
/******************************************************************************/


/* define macros to hide the strange syntax of signf() , halfsignf() usage */
#define fsign(x)        (signf(&(x)).f)
#define halffsign(x)    (halfsignf(&(x)).f)   /* 0.5 * fsign(x) */


/******************************************************************************/
/*                            function definitions                            */
/******************************************************************************/


/*
A fast version of sign(f) = f >= 0.0 ? 1.0 : -1.0; that does not need the
slow branches in the cla. Unfortunately, the argument must be an addressable
variable.
*/

union fi {
    float32_t   f;
    int32_t     i;
};


static inline
union fi signf(float32_t *f)
{
    register union fi   r;

    /* result = 1.0 | sign_bit(f) */
    r.i = 0x3F800000 | ((*(int32_t *)f) & 0x80000000);
    return r;
}

#if 0
static inline
float32_t   fsign(float32_t f)
{
    return __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000);
}
#endif
/*
halfsignf(x) = 0.5 * fsign(x)
Good for rounding of floats to integers to avoid the gap around 0.0,
see below.
*/

static inline
union fi halfsignf(float32_t *f)
{
    register union fi   r;

    /* result = 0.5 | sign_bit(f) */
    r.i = 0x3F000000 | ((*(int32_t *)f) & 0x80000000);
    return r;
}

static inline
int32_t f32toi32r_(float32_t f)
{
    return (int32_t)(f + halffsign(f));
}


/* https://e2e.ti.com/support/microcontrollers/c2000/f/171/t/719504 */

static inline
int32_t f32toi32r(float32_t f)
{
    // return (int32_t)((f + 0x0.Cp24f) - 0x0.Cp24f);
    /* round adding/subtracting 1.5 * 2^23 */
    return (int32_t)((f + 12582912.0) - 12582912.0);
}


#if TEST_ROUND || 1 && !defined(__TMS320C28XX_CLA__)


#if 0
/* not better */
float32_t   onedotfive2e23 = 12582912.0;

static inline
int32_t f32toi32r(float32_t f)
{
    // return (int32_t)((f + 0x0.Cp24f) - 0x0.Cp24f);
    /* round adding/subtracting 1.5 * 2^23 */
    return (int32_t)((f + onedotfive2e23) - onedotfive2e23);
}
#endif


/* demonstrate several methods of rounding floats to integers */

#define NROUND  64  /* # of values to round */
#define RROUND  4.0 /* round -range ... range */
#define EROUND  0.0 /* epsilon added to values, e.g. 0.0625 */

float32_t   fr[NROUND];
int32_t     ir[NROUND];

void
testround()
{
    int16_t i;

    for ( i = 0; i < NROUND; ++i ) {
        fr[i] = EROUND - RROUND +
                (float32_t)(2 * i) * RROUND / (float32_t)NROUND;
    }
    /*
    fr =
    -2.0  -1.75 -1.5  -1.25 -1.0  -0.75 -0.5  -0.25
     0.0   0.25  0.5   0.75  1.0   1.25  1.5   1.75
    */
    for ( i = 0; i < NROUND; ++i ) {
        /* round with gap around 0 */
        ir[i] = (int32_t)fr[i];
    }
    /* ir = -2  -1  -1  -1  -1   0   0   0   0   0   0   0   1   1   1   1 */
    for ( i = 0; i < NROUND; ++i ) {
        /* add of 0.5 only shifts the gap */
        ir[i] = (int32_t)(fr[i] + 0.5);
    }
    /* ir = -1  -1  -1   0   0   0   0   0   0   0   1   1   1   1   2   2 */
    for ( i = 0; i < NROUND; ++i ) {
        /* close the gap adding 0.5 * fsign(x) */
        //ir[i] = (int32_t)(fr[i] + halffsign(fr[i]));
        ir[i] = (int32_t)(fr[i] < 0.0 ? fr[i] - 0.5 : fr[i] + 0.5);
    }
    /* ir = -2  -2  -2  -1  -1  -1  -1   0   0   0   1   1   1   1   2   2 */
    for ( i = 0; i < NROUND; ++i ) {
        /* close the gap adding 0.5 * fsign(x) */
        ir[i] = (int32_t)(fr[i] + halffsign(fr[i]));
        //ir[i] = f32toi32r_(fr[i]);
    }
    /* ir = -2  -2  -2  -1  -1  -1  -1   0   0   0   1   1   1   1   2   2 */
    for ( i = 0; i < NROUND; ++i ) {
        ir[i] = (int32_t)__f32toi16r(fr[i]);
    }
    /* ir = -2  -2  -2  -1  -1  -1   0   0   0   0   0   1   1   1   2   2 */
    for ( i = 0; i < NROUND; ++i ) {
        ir[i] = f32toi32r(fr[i]);
    }
    /* ir = -2  -2  -2  -1  -1  -1   0   0   0   0   0   1   1   1   2   2 */
}

#endif  /* TEST_ROUND */

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28075: C28x FPU: Why no F32TOI32R instruction?