Real Application that achieve 32 MAC operations per cycle

ran35366

TI claims that it can achieve 32MAC operations per second on a single core, so that 6678 runs in 1.25GHZ can produce 320G MAC operations in a second. Can you show me a real application that achieves this rate?

over 12 years ago

0 ran35366 over 12 years ago

TI__Genius 12805 points

Example of 32 MAC operation per C66 cycle - Sum of matrix transform

Consider the following application; Let X is N sets of 1x2 vectors of complex numbers. The application uses two 2x2 complex matrixes. For each 2x2 complex matrix the following operations are does:

A 1x2 vector of complex numbers is multiply by the 2x2 matrix to produce a transformed 2x1 complex vector. The resulted vectors are summed. Real part and imaginary part of each component of the 2x1 vector are summed together and the four sums are recorded.

The application uses two 2x2 complex matrixes to do two separate transforms, so the results are eight values.

The input vector and the transform matrixes are all 16-bit values, and the multiplication results are 32-bit values and so are the accumulations. A c code that describes the loop is the following:

for (i=0; i < N; i++)

{

// Read the 1x2 vector from memory, each element is 16-bit

ui1 = *p_in++ ;

ur1 = *p_in++ ;

ui2 = *p_in++ ;

ur2 = *p_in++ ;

// Calculate the first transform and sum the results

xr1 = ur1 * f1_1r - ui1 * f1_1i + ur2 * f1_2r - ui2 * f1_2i;

xi1 = ur1 * f1_1i + ui1 * f1_1r + ur2 * f1_2i + ui2 * f1_2r;

xr2 = ur1 * f1_3r - ui1 * f1_3i + ur2 * f1_4r - ui2 * f1_4i;

xi2 = ur1 * f1_3i + ui1 * f1_3r + ur2 * f1_4i + ui2 * f1_4r;

ar1 = ar1 + xr1 ;

ar2 = ar2 + xr2 ;

ai1 = ai1 + xi1 ;

ai2 = ai2 + xi2 ;

// Calculate the Second transform and sum the results

xr1 = ur1 * f2_1r - ui1 * f2_1i + ur2 * f2_2r - ui2 * f2_2i;

xi1 = ur1 * f2_1i + ui1 * f2_1r + ur2 * f2_2i + ui2 * f2_2r;

xr2 = ur1 * f2_3r - ui1 * f2_3i + ur2 * f2_4r - ui2 * f2_4i;

xi2 = ur1 * f2_3i + ui1 * f2_3r + ur2 * f2_4i + ui2 * f2_4r;

br1 = br1 + xr1 ;

br2 = br2 + xr2 ;

bi1 = bi1 + xi1 ;

bi2 = bi2 + xi2 ;

}

The number of multiplications in the loop is 32, and the number of additions is 32 (please count).

Using TI optimized code with intrinsic the loop is implemented in a single cycle code when SPLOOP is used. When time the routine on 1024 vectors, the number of cycles, measured on a single core of EFVM6678, is 1092. That is, 1024 cycles for the loop and 68 cycles overhead. The measurements were done from the calling main.

The enclosed project has the source code, the output of the assembly code and the executable for the case of 1024 vectors. The data as well as the matrix values are randomly chosen from the range of -256 to + 255. Note that the matrix multiplication intrinsic has 33 bits saturation of the 32 bit accumulation, which can be different than the non-intrinsic case. The value range was chosen so that the natural C code and the intrinsic code produce the exact same results.

Some notes:

1. The EVM and the tools are in little endian mode, so the order of the real and imaginary parts in reading 64-bit values from the memory requires special attention
2. To see the values of each operation, the code has commented-out print functions that print values of 64-bit and 128-bit registers, and other values. The user can un-comment the print function if he or she so desire.

The following is the actual SPLOOP code:

*----------------------------------------------------------------------------*

$C$L1: ; PIPED LOOP PROLOG

SPLOOP 1 ;10 ; (P)

|| MV .L2X A16,B16

;** --------------------------------------------------------------------------*

$C$L2: ; PIPED LOOP KERNEL

SPMASK L2

|| MV .L2X A17,B19

|| LDDW .D1T1 *A3++,A17:A16 ; |84| (P) <0,0>

SPMASK L2

|| MV .L2X A23,B18

SPMASK L2

|| MV .L2X A7,B17

SPMASK L1,L2

|| MV .L2X A9,B9

|| MV .L1X B21,A9

SPMASK L1,L2

|| MV .L2X A8,B8

|| MV .L1X B20,A8

CMATMPY .M2X A17:A16,B11:B10:B9:B8,B7:B6:B5:B4 ; |84| (P) <0,5>

|| CMATMPY .M1 A17:A16,A11:A10:A9:A8,A7:A6:A5:A4 ; |87| (P) <0,5>

NOP 2

NOP 1

SPKERNEL 9,0

|| DADD .L2 B17:B16,B7:B6,B17:B16 ; |90| <0,9>

|| DADD .S2 B19:B18,B5:B4,B19:B18 ; |92| <0,9>

|| DADD .L1 A19:A18,A7:A6,A19:A18 ; |94| <0,9>

|| DADD .S1 A21:A20,A5:A4,A21:A20 ; |96| <0,9>

;** --------------------------------------------------------------------------*

$C$L3: ; PIPED LOOP EPILOG

;** --------------------------------------------------------------------------*

NOP 1

MV .L1X B16,A16

MV .L1X B19,A17

MV .L1X B18,A23

MV .L1X B17,A7

;** --------------------------------------------------------------------------* 0804.CMATMPY_DEMO.zip

Processors

Processors forum

Real Application that achieve 32 MAC operations per cycle

Example of 32 MAC operation per C66 cycle - Sum of matrix transform