This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28075: C28x FPU: Why no F32TOI32R instruction?

Part Number: TMS320F28075

Hi,

triggered by a recent discussion (e2e.ti.com/.../713771), I wonder why there is no F32TOI32R instruction (and an intrinsic long __f32toi32r(double src))?

This would remove the nonlinearity around zero that the F32TOI32 instruction has: both positive and negative floats are rounded (truncated) towards zero.
Something like float f; long l; l = (int32_t)(f >= 0.0 ? f + 0.5 : f - 0.5) does it, and there is the F32TOI16R instruction that does it, but for 16 bit values only.

An idea for a next release?

See SPRU514P, Table 7-7. C/C++ Compiler Intrinsics for FPU.

Regards,

Frank

  • Frank,

    Thank you for the suggestion. I see the value.

    The FPU instruction set isn't going to be changed any time soon so if this is to be implemented it will have to be by intrinsic. I will ensure your suggestion is passed along to the compiler group. If I learn more I will post here.

    Regards,

    Richard
  • Hi,

    Instead of (f >= 0.0 ? f + 0.5 : f - 0.5) you should use branchless ((f + 0x0.Cp24f) - 0x0.Cp24f). It is add 2^23, then subtract 2^23. It will follow current FP rounding mode. Since it's round to nearest even, it will round similar to your expression, but all in the middle values to even integers, +-0.5 to 0.0, +-1.5 to +-2.0, +-2.5 to +-2.0, +-3.5 to +-4.0 and so on.

    Sorry, +2^23 -2^23 trick is only valid for positive numbers. For negative it's necessary to subtract 2^23 first, then add. Still branch is required.


    Regards,
    Edward

  • Edward,

    I received a mail with your 1st reply, but no mail with the correction added later. May be the forum admins want to check that.

    I tested your suggestion and got the results below. In my application, it is not necessary to hold the "round x.5 towards even" rule, so I think it is a good solution for my problem.

    I did not know the "0x0.Cp24f" syntax, and I did not found it documented. Is it general C or a TI specific syntax? Where to read more about it?

    Thanks & regards,

    Frank

    static inline
    int32_t f32toi32r(float32_t f)
    {
        return (int32_t)((f + 0x0.Cp24f) - 0x0.Cp24f);
    }

    floats fr and f32toi32r results ir:

    fr
    -4.0    -3.875    -3.75    -3.625    -3.5    -3.375    -3.25    -3.125
    -3.0    -2.875    -2.75    -2.625    -2.5    -2.375    -2.25    -2.125
    -2.0    -1.875    -1.75    -1.625    -1.5    -1.375    -1.25    -1.125
    -1.0    -0.875    -0.75    -0.625    -0.5    -0.375    -0.25    -0.125
    0.0    0.125    0.25    0.375    0.5    0.625    0.75    0.875
    1.0    1.125    1.25    1.375    1.5    1.625    1.75    1.875
    2.0    2.125    2.25    2.375    2.5    2.625    2.75    2.875
    3.0    3.125    3.25    3.375    3.5    3.625    3.75    3.875

    ir
    -4    -4    -4    -4    -4    -3    -3    -3
    -3    -3    -3    -3    -2    -2    -2    -2
    -2    -2    -2    -2    -2    -1    -1    -1
    -1    -1    -1    -1    0    0    0    0
    0    0    0    0    0    1    1    1
    1    1    1    1    2    2    2    2
    2    2    2    2    2    3    3    3
    3    3    3    3    4    4    4    4


    fr
    -3.9375    -3.8125    -3.6875    -3.5625    -3.4375    -3.3125    -3.1875    -3.0625
    -2.9375    -2.8125    -2.6875    -2.5625    -2.4375    -2.3125    -2.1875    -2.0625
    -1.9375    -1.8125    -1.6875    -1.5625    -1.4375    -1.3125    -1.1875    -1.0625
    -0.9375    -0.8125    -0.6875    -0.5625    -0.4375    -0.3125    -0.1875    -0.0625
    0.0625    0.1875    0.3125    0.4375    0.5625    0.6875    0.8125    0.9375
    1.0625    1.1875    1.3125    1.4375    1.5625    1.6875    1.8125    1.9375
    2.0625    2.1875    2.3125    2.4375    2.5625    2.6875    2.8125    2.9375
    3.0625    3.1875    3.3125    3.4375    3.5625    3.6875    3.8125    3.9375


    ir
    -4    -4    -4    -4    -3    -3    -3    -3
    -3    -3    -3    -3    -2    -2    -2    -2
    -2    -2    -2    -2    -1    -1    -1    -1
    -1    -1    -1    -1    0    0    0    0
    0    0    0    0    1    1    1    1
    1    1    1    1    2    2    2    2
    2    2    2    2    3    3    3    3
    3    3    3    3    4    4    4    4


    nonlinearity with ir = (int32_t)fr;

    ir
    -3    -3    -3    -3    -3    -3    -3    -3
    -2    -2    -2    -2    -2    -2    -2    -2
    -1    -1    -1    -1    -1    -1    -1    -1
    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0
    1    1    1    1    1    1    1    1
    2    2    2    2    2    2    2    2
    3    3    3    3    3    3    3    3

  • Hi,

    sorry again. This time to unfix it. + 0x0.Cp24f then - 0x0.Cp24f is correct. What is not correct that 0x0.Cp24f is not 2^23 but 1.5 * 2^23. This is where my doubt came from.
    Adding this number (12582912) to both, positive or negative completely cancels fractional part, subtracting it back from the sum gives correctly rounded single precision number.

    0x0.Cp24f syntax is relatively new, old compilers don't support it. It allows to specify FP mantissa in hex.

    0x - prefix
    0.C or 0.C000000 - hex fraction. Take it as 0xC / 0x10 = 12/16 = 0.75

    p24 - 24 here is multiplier 2^24. 24 is decimal, not hex.

    0.75 * 2^24 = 12582912

    Edward
  • Edward,

    thank you for your explanations. The 0x0.Cp24f syntax is a bit strange, so I finally used

    static inline
    int32_t f32toi32r(float32_t f)
    {
    return (int32_t)((f + 12582912.0) - 12582912.0);
    }

    Initially, I had a different approach using a union, but this has its limitations, details see below. Generated fine code for the CLA, but not for the CPU/FPU (no inlined function for some reason).

    So your solution definitively is the better one.

    Thank you again,
    Frank

    PS: May be you also have a better idea for a static inline float fsign(float f) { return f >= 0.0 ? 1.0 : -1.0; } ?



    /* define macros to hide the strange syntax of signf() , halfsignf() usage */
    #define fsign(x) (signf(&(x)).f)
    #define halffsign(x) (halfsignf(&(x)).f) /* 0.5 * fsign(x) */

    /*
    A fast version of sign(f) = f >= 0.0 ? 1.0 : -1.0; that does not need the
    slow branches in the cla. Unfortunately, the argument must be an addressable
    variable.
    */

    union fi {
    float32_t f;
    int32_t i;
    };


    static inline
    union fi signf(float32_t *f)
    {
    register union fi r;

    /* result = 1.0 | sign_bit(f) */
    r.i = 0x3F800000 | ((*(int32_t *)f) & 0x80000000);
    return r;
    }

    static inline
    union fi halfsignf(float32_t *f)
    {
    register union fi r;

    /* result = 0.5 | sign_bit(f) */
    r.i = 0x3F000000 | ((*(int32_t *)f) & 0x80000000);
    return r;
    }

    static inline
    int32_t f32toi32r_(float32_t f)
    {
    return (int32_t)(f + halffsign(f));
    }
  • Hi,

    What you are doing in signf() can be macroed in TI compiler:

    __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000)

    These special __u32xx / __f32xx intrinsics help treating u32 as float and float as u32 respectively, without the need to define any variables.

    But I'm not sure what's faster, transfer from FPU register to ACC, bitwise and, bitwise or/add, and transfer back to FPU register, or indeed branchy code? FPU has conditional NEGF instruction. So if we load one extra FPU register with 1.0, compare signf() argument against zero and negate our 1.0 to -1.0 on LT, it should be faster, I think. Unfortunately I don't know how to persuade compiler to do it like I want, I see branches in compiled code.

    Edward
  • Edward,

    again thank you yor the informations.

    I did not found the __u32_bits_as_f32, __f32_bits_as_u32 intrinsics documented, neighter in SPRUEO2B (FPU), nor in SPRU514P (C28x Compiler), nor in SPRU430F (C28x CPU in), nor in SPRU513P (C28 Assembly Tools). Can anyone from TI please help here? Or is this "Herrschaftswissen" (German)?

    Unforunately, these intrinsics do not work with the CLA compiler.

    Below is a code snipped that tests several versions. It includes an assembler coded function that uses the NEGF instruction as suggested. It seems it does what we want, but I'm not sure if all conditions (pipeline etc.) are met. It also seems to be the fastest way - 4 + 4 cycles overhead for call/return (although, the call may hinder other optmizations). The & 0x80000000 approach looks good with memory, bad in expressions due to the 4 cycles pipeline delay for ACC <> RnH transfers.

    Thanks & regards

    Frank

    static inline
    float32_t   fsign(float32_t f)
    {
        return __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000);
    }

    volatile    float32_t   f1, f2, f3;


    extern  float32_t   fsignf(float32_t f);

    /* r0h = r0h < 0.0 ? -1.0 : 1.0 */
    asm("           .global     _fsignf         ");
    asm("_fsignf:   movizf32    r1h, #1.0       ");
    asm("           cmpf32      r0h, #0.0       ");
    asm("           negf32      r0h, r1h, lt    ");
    asm("           lretr                       ");

    void
    main(void)
    {

        f1 = 2.0;
        f2 = fsign(f1);
        f1 = -2.0;
        f3 = f1 * fsign(f1 + f2);
        f3 = fsignf(f1);
        f3 = f1 >= 0.0 ? 1.0 : -1.0;
        f3 = fsignf(-f1);
    }

    Using

    "C:/ti/ccsv7/tools/compiler/ti-cgt-c2000_16.9.1.LTS/bin/cl2000" -v28 -ml -mt --vcu_support=vcu2 --cla_support=cla1 --float_support=fpu32 --tmu_support=tmu0 -O2 -g --define=CPU1

    compiles to:

    015155:   761F0487    MOVW         DP, #0x487
     88         f1 = 2.0;
    015157:   E8020000    MOVIZ        R0, #0x4000
    015159:   E2030016    MOV32        @0x16, R0H
     61     {
    01515b:   0616        MOVL         ACC, @0x16
     90         f1 = -2.0;
    01515c:   E8060000    MOVIZ        R0, #0xc000
     89         f2 = fsign(f1);
    01515e:   18A88000    AND          @AH, #0x8000
    015160:   9000        ANDB         AL, #0x0
    015161:   1AA83F80    OR           @AH, #0x3f80
    015163:   1E14        MOVL         @0x14, ACC
     90         f1 = -2.0;
    015164:   E2030016    MOV32        @0x16, R0H
     61     {
    015166:   E2AF0016    MOV32        R0H, @0x16, UNCF
    015168:   E2AF0114    MOV32        R1H, @0x14, UNCF
    01516a:   E7100040    ADDF32       R0H, R0H, R1H
    01516c:   7700        NOP          
    01516d:   7700        NOP          
    01516e:   7700        NOP          
     91         f3 = f1 * fsign(f1 + f2);
    01516f:   BFA90F12    MOV32        @ACC, R0H
    015171:   18A88000    AND          @AH, #0x8000
    015173:   9000        ANDB         AL, #0x0
    015174:   1AA83F80    OR           @AH, #0x3f80
    015176:   BDA90F12    MOV32        R0H, @ACC
    015178:   7700        NOP          
    015179:   7700        NOP          
    01517a:   7700        NOP          
    01517b:   E2AF0116    MOV32        R1H, @0x16, UNCF
    01517d:   E7000008    MPYF32       R0H, R1H, R0H
    01517f:   7700        NOP          
    015180:   E2030018    MOV32        @0x18, R0H
     92         f3 = fsignf(f1);
    015182:   E2AF0016    MOV32        R0H, @0x16, UNCF
    015184:   7641514B    LCR          fsignf
    015186:   761F0487    MOVW         DP, #0x487
    015188:   E2030018    MOV32        @0x18, R0H
     93         f3 = f1 >= 0.0 ? 1.0 : -1.0;
    01518a:   E2AF0016    MOV32        R0H, @0x16, UNCF
    01518c:   E5A0        CMPF32       R0H, #0.0
    01518d:   AD14        MOVST0       NF,ZF
    01518e:   6304        SB           C$L1, GEQ
    01518f:   E805FC00    MOVIZ        R0, #0xbf80
    015191:   6F03        SB           C$L2, UNC
    015192:   E801FC00    MOVIZ        R0, #0x3f80
    015194:   E2030018    MOV32        @0x18, R0H
     94         f3 = fsignf(-f1);
    015196:   E2AF0016    MOV32        R0H, @0x16, UNCF
    015198:   E6AF0000    NEGF32       R0H, R0H, UNCF
    01519a:   7641514B    LCR          fsignf
    01519c:   761F0487    MOVW         DP, #0x487
    01519e:   E2030018    MOV32        @0x18, R0H

  • Hi,

    Yes, I don't see these useful intrinsics documented in TI pdfs, I found them perhaps here
    e2e.ti.com/.../632168

    Regarding using them in CLA. Yes, unfortunately looks these are not implemented.

    Regards,
    Edward
  • Frank,

    Earlier you posted these results:

    fr
    -4.0 -3.875 -3.75 -3.625 -3.5 -3.375 -3.25 -3.125
    -3.0 -2.875 -2.75 -2.625 -2.5 -2.375 -2.25 -2.125
    -2.0 -1.875 -1.75 -1.625 -1.5 -1.375 -1.25 -1.125
    -1.0 -0.875 -0.75 -0.625 -0.5 -0.375 -0.25 -0.125
    0.0 0.125 0.25 0.375 0.5 0.625 0.75 0.875
    1.0 1.125 1.25 1.375 1.5 1.625 1.75 1.875
    2.0 2.125 2.25 2.375 2.5 2.625 2.75 2.875
    3.0 3.125 3.25 3.375 3.5 3.625 3.75 3.875

    ir
    -4 -4 -4 -4 -4 -3 -3 -3
    -3 -3 -3 -3 -2 -2 -2 -2
    -2 -2 -2 -2 -2 -1 -1 -1
    -1 -1 -1 -1 0 0 0 0
    0 0 0 0 0 1 1 1
    1 1 1 1 2 2 2 2
    2 2 2 2 2 3 3 3
    3 3 3 3 4 4 4 4


    fr
    -3.9375 -3.8125 -3.6875 -3.5625 -3.4375 -3.3125 -3.1875 -3.0625
    -2.9375 -2.8125 -2.6875 -2.5625 -2.4375 -2.3125 -2.1875 -2.0625
    -1.9375 -1.8125 -1.6875 -1.5625 -1.4375 -1.3125 -1.1875 -1.0625
    -0.9375 -0.8125 -0.6875 -0.5625 -0.4375 -0.3125 -0.1875 -0.0625
    0.0625 0.1875 0.3125 0.4375 0.5625 0.6875 0.8125 0.9375
    1.0625 1.1875 1.3125 1.4375 1.5625 1.6875 1.8125 1.9375
    2.0625 2.1875 2.3125 2.4375 2.5625 2.6875 2.8125 2.9375
    3.0625 3.1875 3.3125 3.4375 3.5625 3.6875 3.8125 3.9375


    ir
    -4 -4 -4 -4 -3 -3 -3 -3
    -3 -3 -3 -3 -2 -2 -2 -2
    -2 -2 -2 -2 -1 -1 -1 -1
    -1 -1 -1 -1 0 0 0 0
    0 0 0 0 1 1 1 1
    1 1 1 1 2 2 2 2
    2 2 2 2 3 3 3 3
    3 3 3 3 4 4 4 4

    I just want to double-check: is this exactly the behaviour you are expecting from an F32TOI32R instruction?

    Regards,

    Richard
  • Richard,

    I think I can summarize as follows: a F32TOI32R should do what F32TOI16R does for 16 bit integers, but for 32 bit integers.

    I again ran my test program and included the __f32toi16r intrinsic. As discussed with Edward, the treatment of the x.5 case is different: F32TOI16R and Edwards add/sub 1.5*2^23 round toward next even, the branch ((int32_t)(fr[i] < 0.0 ? fr[i] - 0.5 : fr[i] + 0.5);) and my f + halffsign(f) approach do not.

    What may be essential for mathematicians and in other applications, does not matter in my control application. Since there is always some noise, it is not important where x.5 is rounded to. I just want to have a "straight lined staircase" with evenly spaced stairsteps (without the missing stairstep at zero that F32TOI32 has).

    Thanks & regards,

    Frank

    Here is the file, I think you can build a complete test case from.

    testfsign.c
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    /******************************************************************************/
    /* testfsign.h - sign / round functions */
    /******************************************************************************/
    typedef long int32_t;
    typedef float float32_t;
    /******************************************************************************/
    /* enums, defines */
    /******************************************************************************/
    /* define macros to hide the strange syntax of signf() , halfsignf() usage */
    #define fsign(x) (signf(&(x)).f)
    #define halffsign(x) (halfsignf(&(x)).f) /* 0.5 * fsign(x) */
    /******************************************************************************/
    /* function definitions */
    /******************************************************************************/
    /*
    A fast version of sign(f) = f >= 0.0 ? 1.0 : -1.0; that does not need the
    slow branches in the cla. Unfortunately, the argument must be an addressable
    variable.
    */
    union fi {
    float32_t f;
    int32_t i;
    };
    static inline
    union fi signf(float32_t *f)
    {
    register union fi r;
    /* result = 1.0 | sign_bit(f) */
    r.i = 0x3F800000 | ((*(int32_t *)f) & 0x80000000);
    return r;
    }
    #if 0
    static inline
    float32_t fsign(float32_t f)
    {
    return __u32_bits_as_f32((__f32_bits_as_u32(f) & 0x80000000) | 0x3F800000);
    }
    #endif
    /*
    halfsignf(x) = 0.5 * fsign(x)
    Good for rounding of floats to integers to avoid the gap around 0.0,
    see below.
    */
    static inline
    union fi halfsignf(float32_t *f)
    {
    register union fi r;
    /* result = 0.5 | sign_bit(f) */
    r.i = 0x3F000000 | ((*(int32_t *)f) & 0x80000000);
    return r;
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX