This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Fast and Space efficient Integer and Floating Point on C2000 family.

I have a number of questions related to obtaining space and cpu cycle efficiency with the C2000 family.  I have an application where I would like to use integer math for FIR filtering of stream data (to save space) and floating point for more complex math or smaller blocks (to save CPU cycles).

Questions:
1.  Can I get Int MAC loops to run as fast as floating point MAC loops on C28346?  See detailed code example below.
2.  Are there C2000 family processor that can do non trivial integer math as fast as FPU processors to floating point?
3..  Are there any tips for getting the C/C++ compilter to produce the most effiicent assembly code?  I found writing a C source loop by different methods can create very different efficiencies of compiler output code with the most efficient result being a RPT / MAC sequence for an FIR algorithm.
4.  Is there a cpu cycle efficient way to do 16 * 16 MACS without any loss of precision.  Thus for 256 such multiplies, an accumulator size of 40 bits would be required.

Benchmarking the simple FIR code below which uses the preproccessor defined value of "TYPE" for the calculation I found the int implementation much less cycle efficient than the float implemenation.  More specifically, if I get the following cycle counts from the simulator between the xx = 0 lines before and after the FIR loops.:\


 float:  149 cycles.
 int:    905 cycles (6 X the float cycles)

The float case used an "RTP" followed by a "MACF32" for the loop.  However, the int case uses an acutal loop. 

---------------------------------------------------------------------------
#define TYPE float

TYPE signal[256];
TYPE filter[32];

volatile int xx;
volatile TYPE result;


TYPE FIRPass(TYPE *input)
{
 static TYPE filter[128] = {0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31};
 TYPE retVal = 0.0;
 int index;
 xx = 0;
 for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)
 { 
  retVal += *(input + index) * *(filter + index);
 }
 xx = 0;
 return retVal;
}

main()
{
 static TYPE input[32] = {0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31};
 result = FIRPass(input);
}


------------------------------------------------
TYPE == float  SIMULATOR CPU CYCLES = 149.

27         xx = 0;
0x00A04A:   8F40C0C0    MOVL         XAR5, #0x00c0c0
0x00A04C:   8F00C040    MOVL         XAR4, #0x00c040
0x00A04E:   761F0300    MOVW         DP, #0x300
0x00A050:   2B00        MOV          @0x0, #0
30          retVal += *(input + index) * *(filter + index);
0x00A051:   E592        ZERO         R2
0x00A052:   E593        ZERO         R3
0x00A053:   E596        ZERO         R6
0x00A054:   E597        ZERO         R7
0x00A055:   C5A4        MOVL         XAR7, @XAR4
0x00A056:   F67F     || RPT          #127
0x00A057:   E2501F85    MACF32       R7H, R3H, *XAR5++, *XAR7++
0x00A059:   E71001BF    ADDF32       R7H, R7H, R6H
0x00A05B:   E710009B    ADDF32       R3H, R3H, R2H
0x00A05D:   7700        NOP
0x00A05E:   E71001D8    ADDF32       R0H, R3H, R7H
32         xx = 0;

------------------------------------------------------
TYPE == int  SIMULATOR CPU CYCLES = 905
27         xx = 0;
          main:
0x00A078:   8F00C080    MOVL         XAR4, #0x00c080
0x00A07A:   8F40C040    MOVL         XAR5, #0x00c040
0x00A07C:   761F0300    MOVW         DP, #0x300
0x00A07E:   2B00        MOV          @0x0, #0
25         TYPE retVal = 0.0;
0x00A07F:   BE7F        MOVB         XAR6, #0x7f
0x00A080:   9A00        MOVB         AL, #0x0
30          retVal += *(input + index) * *(filter + index);
0x00A081:   2D84        MOV          T, *XAR4++
0x00A082:   3385        MPY          P, T, *XAR5++
0x00A083:   94AB        ADD          AL, @PL
28         for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)
0x00A084:   000EFFFD    BANZ         -3,AR6--
32         xx = 0;

  • Hello World:

     

    I have the very same excellent question that John does above as:

    "Are there any tips for getting the C/C++ compiler to produce the most efficient assembly code?"

    I am performing a simple sum of squares in RMS (Root Mean Square) calculations and I would like to know if there is a way to write C code that the compiler will recognize and create assembly code that employs the highly efficient RPT with MAC assembly instructions.

    Unfortunately, I implemented the (virtually) exact C code that John did above for #define TYPE float, but the Code Composer Studio (CCS) TI compiler did NOT generate the RPT with MACF32 code for the C28 core of the Concerto Microcontroller.

    The assembly code between the xx = 0; lines that was generated for floats follows:

     

    ;----------------------------------------------------------------------

    ; 333 | xx = 0;                                                               

    ;----------------------------------------------------------------------

            MOV       @_xx,#0               ; [CPU_] |333|

     

    ;----------------------------------------------------------------------

    ; 334 | for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)     

    ;----------------------------------------------------------------------

            MOVB      XAR6,#0               ; [CPU_] |334|

            MOV       AL,AR6                ; [CPU_] |334|

            CMPB      AL,#128               ; [CPU_] |334|

            B         $C$L8,HIS             ; [CPU_] |334|

            ; branchcc occurs ; [] |334|

    $C$L7:   

     

    ;----------------------------------------------------------------------

    ; 336 | retVal += *(input + index) * *(filter + index);                       

    ;----------------------------------------------------------------------

            SETC      SXM                   ; [CPU_]

            MOVL      XAR5,XAR2             ; [CPU_] |336|

            MOV       ACC,AR6 << 1          ; [CPU_] |336|

            MOVL      XAR4,#_filter$1       ; [CPU_U] |336|

            ADDL      XAR5,ACC              ; [CPU_] |336|

            MOV32     R0H,*+XAR5[0]         ; [CPU_] |336|

            MOV       ACC,AR6 << 1          ; [CPU_] |336|

            ADDL      XAR4,ACC              ; [CPU_] |336|

            MOV32     R1H,*+XAR4[0]         ; [CPU_] |336|

            MPYF32    R2H,R1H,R0H           ; [CPU_] |336|

            NOP       ; [CPU_]

            ADDF32    R4H,R4H,R2H           ; [CPU_] |336|

     

            ADDB      XAR6,#1               ; [CPU_] |334|

            MOV       AL,AR6                ; [CPU_] |334|

            CMPB      AL,#128               ; [CPU_] |334|

            B         $C$L7,LO              ; [CPU_] |334|

            ; branchcc occurs ; [] |334|

    $C$L8:   

     

    ;----------------------------------------------------------------------

    ; 338 | xx = 0;                                                               

    ;----------------------------------------------------------------------

            MOVW      DP,#_xx               ; [CPU_U]

            MOV       @_xx,#0               ; [CPU_] |338|

     

    ;----------------------------------------------------------------------

     

    The timestamp ticks between the xx = 0; lines above were 7,684 for 128 filter[] elements (which causes the input pointer to exceed the input data boundary) and 1,924 for 32 filter[] elements (to match the 32 input elements multiplied with).

     

    NOTE: The #define TYPE float assembly code generated by the compiler follows for Optimization set to Register Optimizations (0) and 32 element filter and input:

    ;----------------------------------------------------------------------

    ; 336 | retVal += *(input + index) * *(filter + index);                       

    ;----------------------------------------------------------------------

            SETC      SXM                   ; [CPU_]

            MOVL      XAR5,XAR2             ; [CPU_] |336|

            MOV       ACC,AR6 << 1          ; [CPU_] |336|

            MOVL      XAR4,#_filter$1       ; [CPU_U] |336|

            ADDL      XAR5,ACC              ; [CPU_] |336|

            MOV32     R0H,*+XAR5[0]         ; [CPU_] |336|

            MOV       ACC,AR6 << 1          ; [CPU_] |336|

            ADDL      XAR4,ACC              ; [CPU_] |336|

            MOV32     R1H,*+XAR4[0]         ; [CPU_] |336|

            MPYF32    R2H,R1H,R0H           ; [CPU_] |336|

            NOP       ; [CPU_]

            ADDF32    R4H,R4H,R2H           ; [CPU_] |336|

        

            ADDB      XAR6,#1               ; [CPU_] |334|

            MOV       AL,AR6                ; [CPU_] |334|

            CMPB      AL,#32                ; [CPU_] |334|

            B         $C$L7,LO              ; [CPU_] |334|

            ; branchcc occurs ; [] |334|

    $C$L8:   

     

    The debugger code (with @s, UNCFs, etc. added by debugger) follows:

    336       retVal += *(input + index) * *(filter + index);

            C$L7:

    10826e:   3B01        SETC         SXM

    10826f:   83A2        MOVL         XAR5, @XAR2

    108270:   560301A6    MOV          ACC, @AR6 << 1

    108272:   8F00BD00    MOVL         XAR4, #0x00bd00

    108274:   560100A5    ADDL         @XAR5, ACC

    108276:   E2AF00C5    MOV32        R0H, *+XAR5[0], UNCF

    108278:   560301A6    MOV          ACC, @AR6 << 1

    10827a:   560100A4    ADDL         @XAR4, ACC

    10827c:   E2AF01C4    MOV32        R1H, *+XAR4[0], UNCF

    10827e:   E700000A    MPYF32       R2H, R1H, R0H

    108280:   7700        NOP         

    108281:   E71000A4    ADDF32       R4H, R4H, R2H

    334      for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)

    108283:   DE01        ADDB         XAR6, #1

    108284:   92A6        MOV          AL, @AR6

    108285:   5220        CMPB         AL, #0x20

    108286:   68E8        SB           C$L7, LO

     

    Any input from TI or the remainder of the world would be greatly appreciated.

     

    Thank you all for your time and efforts,

    Tim Ball

    TDB Consulting

  • Hello World:

     

    The RPT/MACF32 assembly code was generated by the compiler after the Project/Properties/Build/C2000 Compiler/Optimization/Optimization level was incremented to "2 Global Optimizations".

    The assembly code generated follows with some C code rearranged (e.g. et from other lines) as expected for optimization:

    ;----------------------------------------------------------------------

    ; 336 | retVal += *(input + index) * *(filter + index);                       

    ;----------------------------------------------------------------------

            ZERO      R6H                   ; [CPU_] |336|

            ZERO      R7H                   ; [CPU_] |336|

            MOVW      DP,#_xx               ; [CPU_U]

            MOVL      XAR7,#_filter$1       ; [CPU_U]

     

            MOV       @_xx,#0               ; [CPU_] |333|

            MOVL      XAR4,XAR2             ; [CPU_]

     

            MOV32     R3H,R4H               ; [CPU_] |336|

            MOV32     R2H,R5H               ; [CPU_] |336|

     

            MOVL      @_et,ACC              ; [CPU_] |331|

     

    ;----------------------------------------------------------------------

    ; 338 | xx = 0;                                                               

    ;----------------------------------------------------------------------

            MOV       @_xx,#0               ; [CPU_] |338|

     

            RPT       #31

    ||      MACF32   R7H,R3H,*XAR4++,*XAR7++ ; [CPU_] |336|

            ADDF32    R4H,R3H,R2H           ; [CPU_] |336|

     

    The debug Disassembly code follows:

    336       retVal += *(input + index) * *(filter + index);

    107a04:   E596        ZERO         R6

    107a05:   E597        ZERO         R7

    107a06:   761F02F2    MOVW         DP, #0x2f2

    107a08:   76C0BD00    MOVL         XAR7, #0x00bd00

    333      xx = 0;

    107a0a:   2B00        MOV          @0x0, #0

    107a0b:   8AA2        MOVL         XAR4, @XAR2

    336       retVal += *(input + index) * *(filter + index);

    107a0c:   E6CF0023    MOV32        R3H, R4H, UNCF

    107a0e:   E6CF002A    MOV32        R2H, R5H, UNCF

    331      et = (Uint32)Timestamp_get32(); //TDB Begin

    107a10:   1E06        MOVL         @0x6, ACC

    338      xx = 0;

    107a11:   2B00        MOV          @0x0, #0

    336       retVal += *(input + index) * *(filter + index);

    107a12:   F61F        RPT          #31

    107a13:   E2501F84 || MACF32       R7H, R3H, *XAR4++, *XAR7++

    107a15:   E710009C    ADDF32       R4H, R3H, R2H

     

    For 32 input and filter elements, elapsed time ticks went from 1,924 to 104 with RPT/MACF32 code created by the compiler.

    However, RPT/MACF32 is NOT employed by the compiler when squaring the same variables in a set as for example:

    value += *(phase+i) * *(phase+i);

    RPT/MACF32 is only employed when two separate variables are multiplied as:

    retVal += *(input + index) * *(filter + index);

     

    Prompting the compiler to employ RPT/MACF32 when squaring individual variables in a set (versus multiplying two variables in two sets) is required.

    Thanks TI,

    TDB

     

  • Hello TI World:

    In our previous episode (above), we learned that the compiler would only create RPT/MACF32 assembly code when multiplying two separate sets of variables and would not create RPT/MACF32 code when squaring a single set of variables. This issue was (thought to have been) solved by adding a pointer variable, which MUST be a local static variable. Note the CAPITALIZED comments in the 'C' code below for which the compiler does create RPT/MACF32 code for with Optimization set to "2 Global Optimizations":

    #define COUNT 32 float
    SumSquares(float *samples)
    {
    static float *ptr; //REQUIRED AND MUST BE STATIC FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE
    float value; //MUST BE LOCAL FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE
    int i; //MUST BE LOCAL FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE

      ptr = samples; //MUST BE ASSIGNED FIRST FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE
       et = (Uint32)Timestamp_get32();
       value = 0.0;    for (i=0; i<COUNT; ++i) value += *(samples+i) * *(ptr+i); //sum quares
       et = (Uint32)Timestamp_get32() - et - 69; //69 cycles to execute Timestamp_get32()
       return value; 
    }

    HOWEVER, after testing that code, the line of code: 'et = (Uint32)Timestamp_get32();' was removed and the compiler did NOT create RPT/MACF32 assembly code!

    This exercise demonstrates that it is apparently impossible to guarantee that the compiler will generate RPT/MACF32 assembly code from 'C' code. Therefore, one MUST include inline assembly code to guarantee that the highly efficient RPT/MACF32 assembly instruction is utilized. For example, in our previous episode in which the elapsed time ticks went from 1,924 cycles (without RPT/MACF32) to 104 cycles (with RPT/MACF32).

    Tim Ball

    TDB Consulting

     

  • Texas Instruments:

    Test results (provided below) prove that the Texas Instruments (TI) MACF32 (Multiply and Accumulate Floating point 32 bit values) instruction does not operate as indicated in the TI 'TMS320C28x Floating Point Unit and Instruction Set Reference Guide' as given in the TI SPRUEO2a document.

    This fact was also verified after discovering the TI E2E Community question (with TI responses) at:

    'Compiler bug? Using MAC in C - FIR Filter produces incorrect results with -O2, ok with -O0'

    at: http://e2e.ti.com/support/development_tools/compiler/f/343/t/70835.aspx

    The results (below) also show that the MACF32 instruction operates differently when executed multiple times versus being executed repeatedly by the RPT (Repeat) instruction.

     

    The register values after each of 10 multiple MACF32 instruction executions for summing squared values of 1.0 follow:

    #  R2               R3               R6               R7               XAR6   XAR7 

    0  0                0                0                0                0XE7C0 0XE7C0

    1  0X3F800000(1.0)  0X00000000(0.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7C2 0XE7C2

    2  0X3F800000(1.0)  0X3F800000(1.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7C4 0XE7C4

    3  0X3F800000(1.0)  0X40000000(2.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7C6 0XE7C6  

    4  0X3F800000(1.0)  0X40400000(3.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7C8 0XE7C8

    5  0X3F800000(1.0)  0X40800000(4.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7CA 0XE7CA

    6  0X3F800000(1.0)  0X40A00000(5.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7CC 0XE7CC

    7  0X3F800000(1.0)  0X40C00000(6.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7CE 0XE7CE

    8  0X3F800000(1.0)  0X40E00000(7.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7D0 0XE7D0

    9  0X3F800000(1.0)  0X41000000(8.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7D2 0XE7D2

    10 0X3F800000(1.0)  0X41100000(9.0)  0X00000000(0.0)  0X00000000(0.0)  0XE7D4 0XE7D4

    Notice that the accumulated sum is in R2 and R3 with R6 and R7 remaining zero.

     

    The register values after 10 Repeated MACF32 instruction executions, with the RPT instruction, for summing squared values of 1.0 follow:

    #  R2               R3               R6               R7               XAR6   XAR7 

    0  0                0                0                0                0XE7C0 0XE7C0

    10 0X3F800000(1.0)  0X40A00000(5.0)  0X3F800000(1.0)  0X40800000(4.0)  0XE7D6 0XE7D6

    Notice that the accumulated sum is in R2, R3, R6, and R7.

    Based on the address incremented to (i.e. 0XE7D6 instead of 0XE7D4 in the multiple executions), the instruction is indeed "Repeated" (i.e. performed once and then repeated by the RPT count) and so the "repeat" should indeed be performed count-1 times.

    The register values after 9 Repeated MACF32 instruction executions for summing squared values of 1.0 follow:

    #  R2               R3                R6                R7                XAR6   XAR7 

    0  0                0                 0                 0                 0XE7C0 0XE7C0

    9  0X3F800000( 1.0) 0X40800000( 4.0)  0X3F800000( 1.0)  0X40800000( 4.0)  0XE7D4 0XE7D4

    With 9 "repeats", a total of 10 executions are performed.

     

    In none of the results above does the correct MACF32 result reside in R7 and R3 as indicated in the Texas Instruments documentation.

    As demonstrated above, the MACF32 instruction also behaves differently when executed multiple times versus executed by the RPT instruction.

    When executed multiple times, but not by RPT, the accumulated sum resides in the R2 and R3 registers with R6 and R7 set to zero.

    When executed by the RPT instruction, the accumulated sum resides in the R2, R3, R6, and R7 registers.

    However, fortunately, in both cases, the accumulated sum may be determined by adding R2+R3+R6+R7.

     

    Lessons Learned in all the exercises above on this TI page:

    1. Executing RPT/MACF32 can reduce execution time to nearly 1/20 of the time it takes to execute multiple MACF32 instructions.

    2. It is evidently impossible to guarantee that the TI compiler will generate RPT/MACF32 assembly code from 'C' code.

    3. One MUST include inline assembly code to guarantee that the highly efficient RPT/MACF32 assembly instruction is utilized.

    4. The TI SPRUEO2a document is erroneous in regard to the MACF32 instruction (i.e. results are not in R7 and/or R3).

    5. The MACF32 instruction operates differently when executed multiple times versus being executed repeatedly by the RPT (Repeat) instruction.

    6. Correct results for the MACF32 instruction are produced by adding the floating point registers R2, R3, R6, and R7.

     

    No need to answer these questions, TI, I answered them for you for your valuable customers.

     

    Regards,

    Tim Ball

    TDB Consulting

    http://TDBConsulting.org/

    P.S. For those interested in what the TI acronyms TMS, SPR, and others stand for, refer to:

    http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/85359.aspx