How to parallelly utilize DSP's MAC resources to do multiplications in one cycle?

Yang Lu99085

Other Parts Discussed in Thread: TMS320C6678

Hello,

I plan to use TMS320C6678 to run an algorithm. According to technical document, TMS320C6678 is able to perform 256 16x16 bit fixed-point multiplies or 64 floating-point multiplies each clock cycle. My question is: How to implement this, by using certain instruction like MPY or by properly setting the pipeline?

Thanks a lot.

over 13 years ago

0 Uday over 13 years ago

TI__Expert 4920 points

Hello Yang,

There are several techniques you can use to optimize your code and achieve optimum performance. You can go through the C6000 DSP Optimization Guide at http://www.ti.com/lit/an/sprabf2/sprabf2.pdf for more details. There is also a workshop that TI hosts on C6000 DSP optimization. You can find details and register for the workshop at http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=4DW102260 or you can download the workshop collateral from http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop

0 Yang Lu99085 over 13 years ago in reply to Uday

Intellectual 590 points

Hello,

Thank you very much for your reply. I find the document you provide quite useful. However, I think that only by using the programming optimization approaches the document offers is far from enough to achieve the calculation parallel degree that tms320c6678 can do. I still don't know how to implement 256 multipliers in one CPU cycle. Is there any example that illustrates this?

Best regards,

Yang

0 Allen Lee over 13 years ago in reply to Yang Lu99085

Genius 3770 points

Hi Yang,

This can only be achieved through special types of instruction, such as CMATMPY. But I think it's just a figure to reference if the instruction doesn't fit the realization of your application. In other words, it's related to whether you are able to prepare so much data into proper registers before execution or is there any correlation between these multiplications according to your algorithm(e.g Multiplier#0 need the result of Multiplier #1 ), and so on.

So generally speaking, I don't think it's very meaningful to deeply dig the multiplier ability rather than make the optimization which is suitable and feasible according to your target application. It's all my opinion, welcome the further discussion.

Allen

0 may may92122 over 13 years ago in reply to Allen Lee

Expert 1030 points

Hi Allen,

How can I use instructions, such as CMATMPY and FMPYSP. When the main framework is in c language format, how to insert these instructions?

Thanks,

May

0 Allen Lee over 13 years ago in reply to may may92122

Genius 3770 points

Hi ,please refer to my reply of another thread , http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/170468.aspx

0 Yang Lu99085 over 13 years ago in reply to Allen Lee

Intellectual 590 points

Hi Allen,

Is there any instruction that can perform sixteen 16x16 bit signed real-valued multipliers a clock cycle? The instruction CMATMPY you mentioned performs complex conjugate matrix multiply, which does not fit my appliction. What I want to implement is as following:

s1(1)*s2(1)=d1; s1(2)*s2(2)=d2; s1(3)*s2(3)=d3; s1(4)*s2(4)=d4; s1(5)*s2(5)=d5; s1(6)*s2(6)=d6; s1(7)*s2(7)=d7; s1(8)*s2(8)=d8;

s1(9)*s2(9)=d9; s1(10)*s2(10)=d10; s1(11)*s2(11)=d11; s1(12)*s2(12)=d12; s1(13)*s2(13)=d13; s1(14)*s2(14)=d14; s1(15)*s2(15)=d15; s1(16)*s2(16)=d16;

Can the above sixteen multipliers be implemented via utilizing certain SIMD instruction?

Thanks very much!

Sincerely,

Yang

0 Allen Lee over 13 years ago in reply to Yang Lu99085

Genius 3770 points

Hi Yang,

Supposed that all the data is ready before the multiplication.

s1(1) -> A16l, s1(2) -> A16h

s1(3) -> A17l, s1(4) -> A17h

s2(1) -> A18l, s2(2) -> A18h

s2(3) -> A19l, s2(4) -> A19h

s1(9) -> B16l, s1(10) -> B16h

s1(11) -> B17l, s1(12) -> B17h

s2(9) -> B18l, s2(10) -> B18h

s2(11) -> B19l, s2(12) -> B19h

……

Then execute the MPY operation as:

DMPY2 A17:A16,A19:18,A23:A22:A21:A20

||DMPY2 B17:B16,B19:B18,B23:B22:B21:B20

DMPY2 A25:A24,A27:A26,A31:A30:A29:A28

||DMPY2 B25:B24,B27:B26,B31:B30:B29:B28

So I'm afraid that you need 2 cycle to complete the calculation using DMPY2 instruction.

Allen

Processors

Processors forum

How to parallelly utilize DSP's MAC resources to do multiplications in one cycle?