This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Real Application that achieve 32 MAC operations per cycle

TI claims that it can achieve 32MAC operations per second on a single core, so that 6678 runs in 1.25GHZ can produce 320G MAC operations in a second.  Can you show me a real application that achieves this rate?

  • Example of 32 MAC operation per C66 cycle - Sum of matrix transform

     

    Consider the following application; Let X is N sets of 1x2 vectors of complex numbers. The application uses two 2x2 complex matrixes. For each 2x2 complex matrix the following operations are does:

    A 1x2 vector of complex numbers is multiply by the 2x2 matrix to produce a transformed 2x1 complex vector. The resulted vectors are summed.  Real part and imaginary part of each component of the 2x1 vector are summed together and the four sums are recorded.

     

    The application uses two 2x2 complex matrixes to do two separate transforms, so the results are eight values.

     

    The input vector and the transform matrixes are all 16-bit values, and the multiplication results are 32-bit values and so are the accumulations.  A c code that describes the loop is the following:

     

            for (i=0; i < N; i++)

            {

     

    //       Read the 1x2 vector from memory, each element is 16-bit

     

                   ui1 = *p_in++  ;

                   ur1 = *p_in++  ;

                   ui2 = *p_in++  ;

                   ur2 = *p_in++  ;

     

    //    Calculate the first transform and sum the results

     

                   xr1 = ur1 * f1_1r - ui1 * f1_1i + ur2 * f1_2r - ui2 * f1_2i;

                   xi1 = ur1 * f1_1i + ui1 * f1_1r + ur2 * f1_2i + ui2 * f1_2r;

     

                   xr2 = ur1 * f1_3r - ui1 * f1_3i + ur2 * f1_4r - ui2 * f1_4i;

                   xi2 = ur1 * f1_3i + ui1 * f1_3r + ur2 * f1_4i + ui2 * f1_4r;

                   ar1 = ar1 + xr1    ;

                   ar2 = ar2 + xr2    ;

                   ai1 = ai1 + xi1    ;

                   ai2 = ai2 + xi2    ;

     

    //    Calculate the Second transform and sum the results

     

     

                   xr1 = ur1 * f2_1r - ui1 * f2_1i + ur2 * f2_2r - ui2 * f2_2i;

                   xi1 = ur1 * f2_1i + ui1 * f2_1r + ur2 * f2_2i + ui2 * f2_2r;

     

                   xr2 = ur1 * f2_3r - ui1 * f2_3i + ur2 * f2_4r - ui2 * f2_4i;

                   xi2 = ur1 * f2_3i + ui1 * f2_3r + ur2 * f2_4i + ui2 * f2_4r;

                   br1 = br1 + xr1    ;

                   br2 = br2 + xr2    ;

                   bi1 = bi1 + xi1    ;

                   bi2 = bi2 + xi2    ;

     

     

     

            }

    The number of multiplications in the loop is 32, and the number of additions is 32 (please count).

    Using TI optimized code with intrinsic the loop is implemented in a single cycle code when SPLOOP is used.   When time the routine on 1024 vectors, the number of cycles, measured on a single core of EFVM6678,  is 1092.  That is, 1024 cycles for the loop and 68 cycles overhead. The measurements were done from the calling main.

    The enclosed project has the source code, the output of the assembly code and the executable for the case of 1024 vectors. The data as well as the matrix values are randomly chosen from the range of -256 to + 255. Note that the matrix multiplication intrinsic has 33 bits saturation of the 32 bit accumulation, which can be different than the non-intrinsic case. The value range was chosen so that the natural C code and the intrinsic code produce the exact same results.

     

    Some notes:

    1. 1.       The EVM and the tools are in little endian mode, so the order of the real and imaginary parts in reading 64-bit values from the memory requires special attention
    2. 2.       To see the values of each operation, the code has commented-out print functions that print values of 64-bit and 128-bit registers, and other values. The user can un-comment the print function if he or she so desire.  

     

     

     The following is the actual SPLOOP code:

    *----------------------------------------------------------------------------*

    $C$L1:    ; PIPED LOOP PROLOG

     

               SPLOOP  1       ;10               ; (P)

    ||         MV      .L2X    A16,B16

     

    ;** --------------------------------------------------------------------------*

    $C$L2:    ; PIPED LOOP KERNEL

     

               SPMASK          L2

    ||         MV      .L2X    A17,B19

    ||         LDDW    .D1T1   *A3++,A17:A16     ; |84| (P) <0,0>

     

               SPMASK          L2

    ||         MV      .L2X    A23,B18

     

               SPMASK          L2

    ||         MV      .L2X    A7,B17

     

               SPMASK          L1,L2

    ||         MV      .L2X    A9,B9

    ||         MV      .L1X    B21,A9

     

               SPMASK          L1,L2

    ||         MV      .L2X    A8,B8

    ||         MV      .L1X    B20,A8

     

               CMATMPY .M2X    A17:A16,B11:B10:B9:B8,B7:B6:B5:B4 ; |84| (P) <0,5>

    ||         CMATMPY .M1     A17:A16,A11:A10:A9:A8,A7:A6:A5:A4 ; |87| (P) <0,5>

     

               NOP             2

               NOP             1

     

               SPKERNEL 9,0

    ||         DADD    .L2     B17:B16,B7:B6,B17:B16 ; |90| <0,9>

    ||         DADD    .S2     B19:B18,B5:B4,B19:B18 ; |92| <0,9>

    ||         DADD    .L1     A19:A18,A7:A6,A19:A18 ; |94| <0,9>

    ||         DADD    .S1     A21:A20,A5:A4,A21:A20 ; |96| <0,9>

     

    ;** --------------------------------------------------------------------------*

    $C$L3:    ; PIPED LOOP EPILOG

    ;** --------------------------------------------------------------------------*

               NOP             1

               MV      .L1X    B16,A16

               MV      .L1X    B19,A17

               MV      .L1X    B18,A23

               MV      .L1X    B17,A7

    ;** --------------------------------------------------------------------------* 0804.CMATMPY_DEMO.zip