TI claims that it can achieve 32MAC operations per second on a single core, so that 6678 runs in 1.25GHZ can produce 320G MAC operations in a second. Can you show me a real application that achieves this rate?
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
TI claims that it can achieve 32MAC operations per second on a single core, so that 6678 runs in 1.25GHZ can produce 320G MAC operations in a second. Can you show me a real application that achieves this rate?
Consider the following application; Let X is N sets of 1x2 vectors of complex numbers. The application uses two 2x2 complex matrixes. For each 2x2 complex matrix the following operations are does:
A 1x2 vector of complex numbers is multiply by the 2x2 matrix to produce a transformed 2x1 complex vector. The resulted vectors are summed. Real part and imaginary part of each component of the 2x1 vector are summed together and the four sums are recorded.
The application uses two 2x2 complex matrixes to do two separate transforms, so the results are eight values.
The input vector and the transform matrixes are all 16-bit values, and the multiplication results are 32-bit values and so are the accumulations. A c code that describes the loop is the following:
for (i=0; i < N; i++)
{
// Read the 1x2 vector from memory, each element is 16-bit
ui1 = *p_in++ ;
ur1 = *p_in++ ;
ui2 = *p_in++ ;
ur2 = *p_in++ ;
// Calculate the first transform and sum the results
xr1 = ur1 * f1_1r - ui1 * f1_1i + ur2 * f1_2r - ui2 * f1_2i;
xi1 = ur1 * f1_1i + ui1 * f1_1r + ur2 * f1_2i + ui2 * f1_2r;
xr2 = ur1 * f1_3r - ui1 * f1_3i + ur2 * f1_4r - ui2 * f1_4i;
xi2 = ur1 * f1_3i + ui1 * f1_3r + ur2 * f1_4i + ui2 * f1_4r;
ar1 = ar1 + xr1 ;
ar2 = ar2 + xr2 ;
ai1 = ai1 + xi1 ;
ai2 = ai2 + xi2 ;
// Calculate the Second transform and sum the results
xr1 = ur1 * f2_1r - ui1 * f2_1i + ur2 * f2_2r - ui2 * f2_2i;
xi1 = ur1 * f2_1i + ui1 * f2_1r + ur2 * f2_2i + ui2 * f2_2r;
xr2 = ur1 * f2_3r - ui1 * f2_3i + ur2 * f2_4r - ui2 * f2_4i;
xi2 = ur1 * f2_3i + ui1 * f2_3r + ur2 * f2_4i + ui2 * f2_4r;
br1 = br1 + xr1 ;
br2 = br2 + xr2 ;
bi1 = bi1 + xi1 ;
bi2 = bi2 + xi2 ;
}
The number of multiplications in the loop is 32, and the number of additions is 32 (please count).
Using TI optimized code with intrinsic the loop is implemented in a single cycle code when SPLOOP is used. When time the routine on 1024 vectors, the number of cycles, measured on a single core of EFVM6678, is 1092. That is, 1024 cycles for the loop and 68 cycles overhead. The measurements were done from the calling main.
The enclosed project has the source code, the output of the assembly code and the executable for the case of 1024 vectors. The data as well as the matrix values are randomly chosen from the range of -256 to + 255. Note that the matrix multiplication intrinsic has 33 bits saturation of the 32 bit accumulation, which can be different than the non-intrinsic case. The value range was chosen so that the natural C code and the intrinsic code produce the exact same results.
Some notes:
The following is the actual SPLOOP code:
*----------------------------------------------------------------------------*
$C$L1: ; PIPED LOOP PROLOG
SPLOOP 1 ;10 ; (P)
|| MV .L2X A16,B16
;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
SPMASK L2
|| MV .L2X A17,B19
|| LDDW .D1T1 *A3++,A17:A16 ; |84| (P) <0,0>
SPMASK L2
|| MV .L2X A23,B18
SPMASK L2
|| MV .L2X A7,B17
SPMASK L1,L2
|| MV .L2X A9,B9
|| MV .L1X B21,A9
SPMASK L1,L2
|| MV .L2X A8,B8
|| MV .L1X B20,A8
CMATMPY .M2X A17:A16,B11:B10:B9:B8,B7:B6:B5:B4 ; |84| (P) <0,5>
|| CMATMPY .M1 A17:A16,A11:A10:A9:A8,A7:A6:A5:A4 ; |87| (P) <0,5>
NOP 2
NOP 1
SPKERNEL 9,0
|| DADD .L2 B17:B16,B7:B6,B17:B16 ; |90| <0,9>
|| DADD .S2 B19:B18,B5:B4,B19:B18 ; |92| <0,9>
|| DADD .L1 A19:A18,A7:A6,A19:A18 ; |94| <0,9>
|| DADD .S1 A21:A20,A5:A4,A21:A20 ; |96| <0,9>
;** --------------------------------------------------------------------------*
$C$L3: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
NOP 1
MV .L1X B16,A16
MV .L1X B19,A17
MV .L1X B18,A23
MV .L1X B17,A7
;** --------------------------------------------------------------------------* 0804.CMATMPY_DEMO.zip