Fast and Space efficient Integer and Floating Point on C2000 family.

John Sotack

I have a number of questions related to obtaining space and cpu cycle efficiency with the C2000 family. I have an application where I would like to use integer math for FIR filtering of stream data (to save space) and floating point for more complex math or smaller blocks (to save CPU cycles).

Questions:
1. Can I get Int MAC loops to run as fast as floating point MAC loops on C28346? See detailed code example below.
2. Are there C2000 family processor that can do non trivial integer math as fast as FPU processors to floating point?
3.. Are there any tips for getting the C/C++ compilter to produce the most effiicent assembly code? I found writing a C source loop by different methods can create very different efficiencies of compiler output code with the most efficient result being a RPT / MAC sequence for an FIR algorithm.
4. Is there a cpu cycle efficient way to do 16 * 16 MACS without any loss of precision. Thus for 256 such multiplies, an accumulator size of 40 bits would be required.

Benchmarking the simple FIR code below which uses the preproccessor defined value of "TYPE" for the calculation I found the int implementation much less cycle efficient than the float implemenation. More specifically, if I get the following cycle counts from the simulator between the xx = 0 lines before and after the FIR loops.:\

float: 149 cycles.
int: 905 cycles (6 X the float cycles)

The float case used an "RTP" followed by a "MACF32" for the loop. However, the int case uses an acutal loop.

---------------------------------------------------------------------------
#define TYPE float

TYPE signal[256];
TYPE filter[32];

volatile int xx;
volatile TYPE result;

TYPE FIRPass(TYPE *input)
{
static TYPE filter[128] = {0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31,
    0,1,2,3,4,5,6,7,8,9,10,11,12,
    13,14,15,16,17,18,19,20,21,22,
    23,24,25,26,27,28,29,30,31};
TYPE retVal = 0.0;
int index;
xx = 0;
for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)
{
  retVal += *(input + index) * *(filter + index);
}
xx = 0;
return retVal;
}

main()
{
static TYPE input[32] = {0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29,30,31};
result = FIRPass(input);
}

------------------------------------------------
TYPE == float SIMULATOR CPU CYCLES = 149.

27        xx = 0;
0x00A04A:   8F40C0C0    MOVL         XAR5, #0x00c0c0
0x00A04C:   8F00C040    MOVL         XAR4, #0x00c040
0x00A04E:   761F0300    MOVW         DP, #0x300
0x00A050:   2B00        MOV          @0x0, #0
30          retVal += *(input + index) * *(filter + index);
0x00A051:   E592        ZERO         R2
0x00A052:   E593        ZERO         R3
0x00A053:   E596        ZERO         R6
0x00A054:   E597        ZERO         R7
0x00A055:   C5A4        MOVL         XAR7, @XAR4
0x00A056:   F67F     || RPT          #127
0x00A057:   E2501F85    MACF32       R7H, R3H, *XAR5++, *XAR7++
0x00A059:   E71001BF    ADDF32       R7H, R7H, R6H
0x00A05B:   E710009B    ADDF32       R3H, R3H, R2H
0x00A05D:   7700        NOP
0x00A05E:   E71001D8    ADDF32       R0H, R3H, R7H
32        xx = 0;

------------------------------------------------------
TYPE == int SIMULATOR CPU CYCLES = 905
27        xx = 0;
          main:
0x00A078:   8F00C080    MOVL         XAR4, #0x00c080
0x00A07A:   8F40C040    MOVL         XAR5, #0x00c040
0x00A07C:   761F0300    MOVW         DP, #0x300
0x00A07E:   2B00        MOV          @0x0, #0
25        TYPE retVal = 0.0;
0x00A07F:   BE7F        MOVB         XAR6, #0x7f
0x00A080:   9A00        MOVB         AL, #0x0
30          retVal += *(input + index) * *(filter + index);
0x00A081:   2D84        MOV          T, *XAR4++
0x00A082:   3385        MPY          P, T, *XAR5++
0x00A083:   94AB        ADD          AL, @PL
28        for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)
0x00A084:   000EFFFD    BANZ         -3,AR6--
32        xx = 0;

over 12 years ago

0 Timothy Ball over 9 years ago

Expert 1175 points

Hello World:

I have the very same excellent question that John does above as:

"Are there any tips for getting the C/C++ compiler to produce the most efficient assembly code?"

I am performing a simple sum of squares in RMS (Root Mean Square) calculations and I would like to know if there is a way to write C code that the compiler will recognize and create assembly code that employs the highly efficient RPT with MAC assembly instructions.

Unfortunately, I implemented the (virtually) exact C code that John did above for #define TYPE float, but the Code Composer Studio (CCS) TI compiler did NOT generate the RPT with MACF32 code for the C28 core of the Concerto Microcontroller.

The assembly code between the xx = 0; lines that was generated for floats follows:

;----------------------------------------------------------------------

; 333 | xx = 0;

;----------------------------------------------------------------------

MOV @_xx,#0 ; [CPU_] |333|

;----------------------------------------------------------------------

; 334 | for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)

;----------------------------------------------------------------------

MOVB XAR6,#0 ; [CPU_] |334|

MOV AL,AR6 ; [CPU_] |334|

CMPB AL,#128 ; [CPU_] |334|

B $C$L8,HIS ; [CPU_] |334|

; branchcc occurs ; [] |334|

$C$L7:

;----------------------------------------------------------------------

; 336 | retVal += *(input + index) * *(filter + index);

;----------------------------------------------------------------------

SETC SXM ; [CPU_]

MOVL XAR5,XAR2 ; [CPU_] |336|

MOV ACC,AR6 << 1 ; [CPU_] |336|

MOVL XAR4,#_filter$1 ; [CPU_U] |336|

ADDL XAR5,ACC ; [CPU_] |336|

MOV32 R0H,*+XAR5[0] ; [CPU_] |336|

MOV ACC,AR6 << 1 ; [CPU_] |336|

ADDL XAR4,ACC ; [CPU_] |336|

MOV32 R1H,*+XAR4[0] ; [CPU_] |336|

MPYF32 R2H,R1H,R0H ; [CPU_] |336|

NOP ; [CPU_]

ADDF32 R4H,R4H,R2H ; [CPU_] |336|

ADDB XAR6,#1 ; [CPU_] |334|

MOV AL,AR6 ; [CPU_] |334|

CMPB AL,#128 ; [CPU_] |334|

B $C$L7,LO ; [CPU_] |334|

; branchcc occurs ; [] |334|

$C$L8:

;----------------------------------------------------------------------

; 338 | xx = 0;

;----------------------------------------------------------------------

MOVW DP,#_xx ; [CPU_U]

MOV @_xx,#0 ; [CPU_] |338|

;----------------------------------------------------------------------

The timestamp ticks between the xx = 0; lines above were 7,684 for 128 filter[] elements (which causes the input pointer to exceed the input data boundary) and 1,924 for 32 filter[] elements (to match the 32 input elements multiplied with).

NOTE: The #define TYPE float assembly code generated by the compiler follows for Optimization set to Register Optimizations (0) and 32 element filter and input:

;----------------------------------------------------------------------

; 336 | retVal += *(input + index) * *(filter + index);

;----------------------------------------------------------------------

SETC SXM ; [CPU_]

MOVL XAR5,XAR2 ; [CPU_] |336|

MOV ACC,AR6 << 1 ; [CPU_] |336|

MOVL XAR4,#_filter$1 ; [CPU_U] |336|

ADDL XAR5,ACC ; [CPU_] |336|

MOV32 R0H,*+XAR5[0] ; [CPU_] |336|

MOV ACC,AR6 << 1 ; [CPU_] |336|

ADDL XAR4,ACC ; [CPU_] |336|

MOV32 R1H,*+XAR4[0] ; [CPU_] |336|

MPYF32 R2H,R1H,R0H ; [CPU_] |336|

NOP ; [CPU_]

ADDF32 R4H,R4H,R2H ; [CPU_] |336|

ADDB XAR6,#1 ; [CPU_] |334|

MOV AL,AR6 ; [CPU_] |334|

CMPB AL,#32 ; [CPU_] |334|

B $C$L7,LO ; [CPU_] |334|

; branchcc occurs ; [] |334|

$C$L8:

The debugger code (with @s, UNCFs, etc. added by debugger) follows:

336 retVal += *(input + index) * *(filter + index);

C$L7:

10826e: 3B01 SETC SXM

10826f: 83A2 MOVL XAR5, @XAR2

108270: 560301A6 MOV ACC, @AR6 << 1

108272: 8F00BD00 MOVL XAR4, #0x00bd00

108274: 560100A5 ADDL @XAR5, ACC

108276: E2AF00C5 MOV32 R0H, *+XAR5[0], UNCF

108278: 560301A6 MOV ACC, @AR6 << 1

10827a: 560100A4 ADDL @XAR4, ACC

10827c: E2AF01C4 MOV32 R1H, *+XAR4[0], UNCF

10827e: E700000A MPYF32 R2H, R1H, R0H

108280: 7700 NOP

108281: E71000A4 ADDF32 R4H, R4H, R2H

334 for(index = 0; index < sizeof(filter)/sizeof(filter[0]); index++)

108283: DE01 ADDB XAR6, #1

108284: 92A6 MOV AL, @AR6

108285: 5220 CMPB AL, #0x20

108286: 68E8 SB C$L7, LO

Any input from TI or the remainder of the world would be greatly appreciated.

Thank you all for your time and efforts,

Tim Ball

TDB Consulting

0 Timothy Ball over 9 years ago in reply to Timothy Ball

Expert 1175 points

Hello World:

The RPT/MACF32 assembly code was generated by the compiler after the Project/Properties/Build/C2000 Compiler/Optimization/Optimization level was incremented to "2 Global Optimizations".

The assembly code generated follows with some C code rearranged (e.g. et from other lines) as expected for optimization:

;----------------------------------------------------------------------

; 336 | retVal += *(input + index) * *(filter + index);

;----------------------------------------------------------------------

ZERO R6H ; [CPU_] |336|

ZERO R7H ; [CPU_] |336|

MOVW DP,#_xx ; [CPU_U]

MOVL XAR7,#_filter$1 ; [CPU_U]

MOV @_xx,#0 ; [CPU_] |333|

MOVL XAR4,XAR2 ; [CPU_]

MOV32 R3H,R4H ; [CPU_] |336|

MOV32 R2H,R5H ; [CPU_] |336|

MOVL @_et,ACC ; [CPU_] |331|

;----------------------------------------------------------------------

; 338 | xx = 0;

;----------------------------------------------------------------------

MOV @_xx,#0 ; [CPU_] |338|

RPT #31

|| MACF32 R7H,R3H,*XAR4++,*XAR7++ ; [CPU_] |336|

ADDF32 R4H,R3H,R2H ; [CPU_] |336|

The debug Disassembly code follows:

336 retVal += *(input + index) * *(filter + index);

107a04: E596 ZERO R6

107a05: E597 ZERO R7

107a06: 761F02F2 MOVW DP, #0x2f2

107a08: 76C0BD00 MOVL XAR7, #0x00bd00

333 xx = 0;

107a0a: 2B00 MOV @0x0, #0

107a0b: 8AA2 MOVL XAR4, @XAR2

336 retVal += *(input + index) * *(filter + index);

107a0c: E6CF0023 MOV32 R3H, R4H, UNCF

107a0e: E6CF002A MOV32 R2H, R5H, UNCF

331 et = (Uint32)Timestamp_get32(); //TDB Begin

107a10: 1E06 MOVL @0x6, ACC

338 xx = 0;

107a11: 2B00 MOV @0x0, #0

336 retVal += *(input + index) * *(filter + index);

107a12: F61F RPT #31

107a13: E2501F84 || MACF32 R7H, R3H, *XAR4++, *XAR7++

107a15: E710009C ADDF32 R4H, R3H, R2H

For 32 input and filter elements, elapsed time ticks went from 1,924 to 104 with RPT/MACF32 code created by the compiler.

However, RPT/MACF32 is NOT employed by the compiler when squaring the same variables in a set as for example:

value += *(phase+i) * *(phase+i);

RPT/MACF32 is only employed when two separate variables are multiplied as:

retVal += *(input + index) * *(filter + index);

Prompting the compiler to employ RPT/MACF32 when squaring individual variables in a set (versus multiplying two variables in two sets) is required.

Thanks TI,

TDB

0 Timothy Ball over 9 years ago in reply to Timothy Ball

Expert 1175 points

Hello TI World:

In our previous episode (above), we learned that the compiler would only create RPT/MACF32 assembly code when multiplying two separate sets of variables and would not create RPT/MACF32 code when squaring a single set of variables. This issue was (thought to have been) solved by adding a pointer variable, which MUST be a local static variable. Note the CAPITALIZED comments in the 'C' code below for which the compiler does create RPT/MACF32 code for with Optimization set to "2 Global Optimizations":

#define COUNT 32 float
SumSquares(float *samples)
{
static float *ptr; //REQUIRED AND MUST BE STATIC FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE 
float value; //MUST BE LOCAL FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE 
int i; //MUST BE LOCAL FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE

  ptr = samples; //MUST BE ASSIGNED FIRST FOR COMPILER TO GENERATE RPT/MACF32 ASSEMBLY CODE

   et = (Uint32)Timestamp_get32();

   value = 0.0;    for (i=0; i<COUNT; ++i) value += *(samples+i) * *(ptr+i); //sum quares

   et = (Uint32)Timestamp_get32() - et - 69; //69 cycles to execute Timestamp_get32()

   return value; 
}

HOWEVER, after testing that code, the line of code: 'et = (Uint32)Timestamp_get32();' was removed and the compiler did NOT create RPT/MACF32 assembly code!

This exercise demonstrates that it is apparently impossible to guarantee that the compiler will generate RPT/MACF32 assembly code from 'C' code. Therefore, one MUST include inline assembly code to guarantee that the highly efficient RPT/MACF32 assembly instruction is utilized. For example, in our previous episode in which the elapsed time ticks went from 1,924 cycles (without RPT/MACF32) to 104 cycles (with RPT/MACF32).

Tim Ball

TDB Consulting

0 Timothy Ball over 9 years ago in reply to Timothy Ball

Expert 1175 points

Texas Instruments:

Test results (provided below) prove that the Texas Instruments (TI) MACF32 (Multiply and Accumulate Floating point 32 bit values) instruction does not operate as indicated in the TI 'TMS320C28x Floating Point Unit and Instruction Set Reference Guide' as given in the TI SPRUEO2a document.

This fact was also verified after discovering the TI E2E Community question (with TI responses) at:

'Compiler bug? Using MAC in C - FIR Filter produces incorrect results with -O2, ok with -O0'

at: http://e2e.ti.com/support/development_tools/compiler/f/343/t/70835.aspx

The results (below) also show that the MACF32 instruction operates differently when executed multiple times versus being executed repeatedly by the RPT (Repeat) instruction.

The register values after each of 10 multiple MACF32 instruction executions for summing squared values of 1.0 follow:

# R2 R3 R6 R7 XAR6 XAR7

0 0 0 0 0 0XE7C0 0XE7C0

1 0X3F800000(1.0) 0X00000000(0.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7C2 0XE7C2

2 0X3F800000(1.0) 0X3F800000(1.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7C4 0XE7C4

3 0X3F800000(1.0) 0X40000000(2.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7C6 0XE7C6

4 0X3F800000(1.0) 0X40400000(3.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7C8 0XE7C8

5 0X3F800000(1.0) 0X40800000(4.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7CA 0XE7CA

6 0X3F800000(1.0) 0X40A00000(5.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7CC 0XE7CC

7 0X3F800000(1.0) 0X40C00000(6.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7CE 0XE7CE

8 0X3F800000(1.0) 0X40E00000(7.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7D0 0XE7D0

9 0X3F800000(1.0) 0X41000000(8.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7D2 0XE7D2

10 0X3F800000(1.0) 0X41100000(9.0) 0X00000000(0.0) 0X00000000(0.0) 0XE7D4 0XE7D4

Notice that the accumulated sum is in R2 and R3 with R6 and R7 remaining zero.

The register values after 10 Repeated MACF32 instruction executions, with the RPT instruction, for summing squared values of 1.0 follow:

# R2 R3 R6 R7 XAR6 XAR7

0 0 0 0 0 0XE7C0 0XE7C0

10 0X3F800000(1.0) 0X40A00000(5.0) 0X3F800000(1.0) 0X40800000(4.0) 0XE7D6 0XE7D6

Notice that the accumulated sum is in R2, R3, R6, and R7.

Based on the address incremented to (i.e. 0XE7D6 instead of 0XE7D4 in the multiple executions), the instruction is indeed "Repeated" (i.e. performed once and then repeated by the RPT count) and so the "repeat" should indeed be performed count-1 times.

The register values after 9 Repeated MACF32 instruction executions for summing squared values of 1.0 follow:

# R2 R3 R6 R7 XAR6 XAR7

0 0 0 0 0 0XE7C0 0XE7C0

9 0X3F800000( 1.0) 0X40800000( 4.0) 0X3F800000( 1.0) 0X40800000( 4.0) 0XE7D4 0XE7D4

With 9 "repeats", a total of 10 executions are performed.

In none of the results above does the correct MACF32 result reside in R7 and R3 as indicated in the Texas Instruments documentation.

As demonstrated above, the MACF32 instruction also behaves differently when executed multiple times versus executed by the RPT instruction.

When executed multiple times, but not by RPT, the accumulated sum resides in the R2 and R3 registers with R6 and R7 set to zero.

When executed by the RPT instruction, the accumulated sum resides in the R2, R3, R6, and R7 registers.

However, fortunately, in both cases, the accumulated sum may be determined by adding R2+R3+R6+R7.

Lessons Learned in all the exercises above on this TI page:

1. Executing RPT/MACF32 can reduce execution time to nearly 1/20 of the time it takes to execute multiple MACF32 instructions.

2. It is evidently impossible to guarantee that the TI compiler will generate RPT/MACF32 assembly code from 'C' code.

3. One MUST include inline assembly code to guarantee that the highly efficient RPT/MACF32 assembly instruction is utilized.

4. The TI SPRUEO2a document is erroneous in regard to the MACF32 instruction (i.e. results are not in R7 and/or R3).

5. The MACF32 instruction operates differently when executed multiple times versus being executed repeatedly by the RPT (Repeat) instruction.

6. Correct results for the MACF32 instruction are produced by adding the floating point registers R2, R3, R6, and R7.

No need to answer these questions, TI, I answered them for you for your valuable customers.

Regards,

Tim Ball

TDB Consulting

http://TDBConsulting.org/

P.S. For those interested in what the TI acronyms TMS, SPR, and others stand for, refer to:

http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/85359.aspx

C2000™︎ microcontrollers

C2000 microcontrollers forum

Fast and Space efficient Integer and Floating Point on C2000 family.