This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

dsp6678 performance,why is 40GMAC and 20G float?

Other Parts Discussed in Thread: TMS320C6678

Why PAGE 18 says that four 16 × 16 multiplies with add/subtract each clock cycle per unit,while PAGE 13 says that  40GMAC? What is 40G consist of?

Why PAGE 18 says that four single precision multiplies each clock cycle per unit,while PAGE 13 says that  20GFLOP? What is 20G consist of?

why PAGE 14 says that 8 single precision floating point MAC operations per cycle,while  page 18 says four single-precision multiplies per unit. sometime is MAC,sometime is multiplie.i don't know what it is mean.

 

TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor Data Manual

Literature Number:  SPRS691C February 2012

 

PAGE 18

Each C66x .M unit can perform one of the following fixed-point operations each clock cycle: four 32 × 32 bit multiplies, sixteen 16 × 16 bit multiplies, four 16 × 32 bit multiplies, four 8 × 8 bit multiplies, four 8 × 8 bit multiplies with add operations, and four 16 × 16 multiplies with add/subtract capabilities

Each C66x .M unit can also perform one the following floating-point operations each clock cycle: one, two, or four single-precision multiplies or a complex single-precision multiply.

PAGE 13

40 GMAC/Core for Fixed Point @ 1.25 GHz

20 GFLOP/Core for Floating Point @ 1.25 GHz

PAGE 14:

the C66x core integrates floating point capability and the per core raw computational performance is an industry-leading 32 MACS/cycle and 16 flops/cycle. It can execute 8 single precision floating point MAC operations per cycle and can perform double- and mixed-precision operations and is IEEE754 compliant

.qmbox style, .qmbox script, .qmbox head, .qmbox link, .qmbox meta {display: none !important;}
  • Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages (for processor issues). Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics (e2e.ti.com). Please read all the links below my signature.

    We will get back to you on the above query shortly. Thank you for your patience.

    Note: We strongly recommend you to create new e2e thread for your queries instead of following up on an old/closed e2e thread, new threads gets more attention than old threads and can provide link of old threads or information on the new post for clarity and faster response.

  • Please refer below thread which may help you to understand,

    Thank you.

  • The 32 16x16 MAC(-ish) operations per cycle are calculated assuming that both multiply units execute CMATMPY, or one of its variants, every cycle. CMATMPY calculates the product of a 2x1 complex vector and a 2x2 complex matrix. This requires four complex multiplies and two complex additions. Each complex multiply requires four scalar 16x16 multiplies and two additions of the result. 4*4=16 MAC(-ish) operations per cycle per .M unit, times two units=32 MACs per cycle. (I say "MAC-ish" because there is one less addition than multiply for each output component of CMATMPY, for a total of 16 [16x16->32] multiplies and 12 [32-bit] additions per CMATMPY instruction. Other functional units can perform the remaining additions, so you can get full MACs in the end.)

    If your application uses non-matrix complex arithmetic, the best you can get is eight 16x16 multiplies per .M unit per cycle with DCMPY and its variants. If your application uses non-complex arithmetic, I think the best you can get is four multiples per .M unit per cycle, which a variety of instructions provide.

    FLOPs is calculated similarly: 4 multiplies per .M unit per cycle (CMPYSP or QMPYSP), plus 2 adds per .L and .S unit per cycle (DADDSP), so 16 floating-point operations per cycle.

    In any case, these can only be sustained if your inner loop runs mostly from registers, because load/store bandwidth is only 2x64 bits per cycle. Nobody ever said benchmarketing was closely tied to real-world applications :)
  • thank you very much!

  • thank you very much!