confused with "Peak MMACS" on C6678 specification

wu sheng-hung

Prodigy 100 points

Other Parts Discussed in Thread: TMS320C6678

HI,

The TI website tell us that the Peak MMACS for c6678 is 320,000

I don't understand the meaning of it

Does it mean that the eight core of C6678 can totally perform 320,000 Millions Multiply-Accumulates in one second ?

(of course, it is based on the best case)

If it's correct, the performance is really really good . But, how can I achieve it ?

If no, what is the true meaning of "Peak MMACS" ?

Thanks a lot

over 13 years ago

0 Chad Courtney over 13 years ago

TI__Mastermind 30825 points

Wu,

Yes, it's based upon the peak performance. Specifically a complex conjugate matrix multiplication routine. In which you effectively get 16 Multiplies per cycle from the instruction, it's on 2 functional units (.M1 and .M2) so it's 32 Multiplies per cycle per core. That gives you 256 per cycle on TMS320C6678. When running at 1.25GHz that's 320GMACs. For this specific operation.

Bringing this back down to earth some, there's a lot of other SIMD type operations that are going to put it @ around 80GMACs range that are applicable over a wider range of applications.

Best Regards,
Chad

0 wu sheng-hung over 13 years ago in reply to Chad Courtney

Prodigy 100 points

Thank you !!

You mean the average is around 80 GMACS.

But I still have some question about it.

Let me make an example

short a[100]; short b[100]; short c[100];

for(i=0;i<100;i++) c [ i ] = c [ i ] + a [ i ] * b [ i ] ;

Suppose it have total amount of cnt ACs

the single core needs cnt / 40G (sec) to run the c code ? (best case)

And also , how to get higher MMACS ??

Thanks

0 Chad Courtney over 13 years ago in reply to wu sheng-hung

TI__Mastermind 30825 points

This one would probably not be as optimal to optimize, since you're storing each successive result into memory the way it's written. But you could still get 2 MACs per cycle doing this, with straight forward instructions.

If instead you only really cared about c[99] (the final value), which would be much more common in signal processing.

short a[100]; short b[100]; short c[100];

for(i=0;i<100;i++) c += a [ i ] * b [ i ] ;

then a Kernel could be created to do this w/ 4 Multiply Accumulates per cycle, the results, there would effectively be 2 results per cycle which would be accumulated on successive cycles and then on last cycle to add the two together.

You can use the dot product instructions such as DOTP2 for this for 16bit short data -> 40GMAC performance for 8bit data DOTP4 instructions would yield 80GMAC performance.

Best Regards,

Chad

Processors

Processors forum

confused with "Peak MMACS" on C6678 specification