This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

confused with "Peak MMACS" on C6678 specification

Other Parts Discussed in Thread: TMS320C6678

HI,

  The TI website tell us that the Peak MMACS for c6678 is 320,000

   I don't understand the meaning of it

   Does it mean that the eight core of C6678 can totally perform 320,000 Millions Multiply-Accumulates in one second ?

   (of course, it is based on the best case)

   If it's correct, the performance is really really good . But, how can I achieve it ?

   If no, what is the true meaning of "Peak MMACS"  ?

  

 Thanks a lot

 

 

  • Wu,

    Yes, it's based upon the peak performance.  Specifically a complex conjugate matrix multiplication routine.  In which you effectively get 16 Multiplies per cycle from the instruction, it's on 2 functional units (.M1 and .M2) so it's 32 Multiplies per cycle per core.  That gives you 256 per cycle on TMS320C6678.  When running at 1.25GHz that's 320GMACs.  For this specific operation. 

    Bringing this back down to earth some, there's a lot of other SIMD type operations that are going to put it @ around 80GMACs range that are applicable over a wider range of applications.

    Best Regards,
    Chad

  • Thank you !!

    You mean the average is around 80 GMACS.

    But I still have some question about it.

    Let me make an example

    short a[100];   short b[100];  short c[100];

    for(i=0;i<100;i++)    c [ i ] = c [ i ] + a [ i ] * b [ i ] ;

    Suppose it have total amount of   cnt  ACs

    the single core needs cnt / 40G  (sec)  to run the c code ? (best case)

    And also , how to get higher MMACS ??

    Thanks

     

  • This one would probably not be as optimal to optimize, since you're storing each successive result into memory the way it's written.  But you could still get 2 MACs per cycle doing this, with straight forward instructions.

    If instead you only really cared about c[99] (the final value), which would be much more common in signal processing.

    short a[100];   short b[100];  short c[100];

    for(i=0;i<100;i++)    c += a [ i ] * b [ i ] ;

    then a Kernel could be created to do this w/ 4 Multiply Accumulates per cycle, the results, there would effectively be 2 results per cycle which would be accumulated on successive cycles and then on last cycle to add the two together.

    You can use the dot product instructions such as DOTP2 for this for 16bit short data -> 40GMAC performance for 8bit data DOTP4 instructions would yield 80GMAC performance.

    Best Regards,

    Chad