This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to parallelly utilize DSP's MAC resources to do multiplications in one cycle?

Other Parts Discussed in Thread: TMS320C6678

Hello,

I plan to use TMS320C6678 to run an algorithm. According to technical document, TMS320C6678 is able to perform 256 16x16 bit fixed-point multiplies or 64 floating-point multiplies each clock cycle. My question is: How to implement this,  by using certain instruction like MPY or by properly setting the pipeline?

Thanks a lot.

  • Hello Yang,

    There are several techniques you can use to optimize your code and achieve optimum performance. You can go through the C6000 DSP Optimization Guide at http://www.ti.com/lit/an/sprabf2/sprabf2.pdf for more details. There is also a workshop that TI hosts on C6000 DSP optimization. You can find details and register for the workshop at http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=4DW102260 or you can download the workshop collateral from http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop

  • Hello,

    Thank you very much for your reply. I find the document you provide quite useful. However, I think that only by using the programming optimization approaches the document offers is far from enough to achieve the calculation parallel degree that tms320c6678 can do. I still don't know how to implement 256 multipliers in one CPU cycle. Is there any example that illustrates this?

    Best regards,

    Yang

  • Hi Yang,

    This can only be achieved through special types of instruction, such as CMATMPY. But I think it's just a figure to reference if the instruction doesn't fit the realization of your application. In other words, it's related to whether you are able to prepare so much data into proper registers before execution or is there any correlation between these multiplications according to your algorithm(e.g Multiplier#0 need the result of Multiplier #1 ), and so on.

    So generally speaking, I don't think it's very meaningful to deeply dig the multiplier ability rather than make the optimization  which is suitable and feasible according to your target application. It's all my opinion, welcome the further discussion.

    Allen

  • Hi Allen,

    How can I use instructions, such as CMATMPY and FMPYSP. When the main framework is in c language format, how to insert these instructions?

    Thanks,

    May

  • Hi Allen,

    Is there any instruction that can perform sixteen 16x16 bit signed real-valued multipliers a clock cycle?  The instruction CMATMPY you mentioned performs complex conjugate matrix multiply, which does not fit my appliction.  What I want to implement is as following:

    s1(1)*s2(1)=d1;  s1(2)*s2(2)=d2;  s1(3)*s2(3)=d3;  s1(4)*s2(4)=d4;  s1(5)*s2(5)=d5;  s1(6)*s2(6)=d6;  s1(7)*s2(7)=d7;  s1(8)*s2(8)=d8;

    s1(9)*s2(9)=d9;  s1(10)*s2(10)=d10;  s1(11)*s2(11)=d11;  s1(12)*s2(12)=d12;  s1(13)*s2(13)=d13;  s1(14)*s2(14)=d14;  s1(15)*s2(15)=d15;  s1(16)*s2(16)=d16; 

    Can the above sixteen multipliers be implemented via utilizing certain SIMD instruction?

    Thanks very much!

    Sincerely,

    Yang

  • Hi Yang,

    Supposed that all the data is ready before the multiplication.

    s1(1) -> A16l, s1(2) -> A16h

    s1(3) -> A17l, s1(4) -> A17h

    s2(1) -> A18l, s2(2) -> A18h

    s2(3) -> A19l, s2(4) -> A19h

    s1(9) -> B16l, s1(10) -> B16h

    s1(11) -> B17l, s1(12) -> B17h

    s2(9) -> B18l, s2(10) -> B18h

    s2(11) -> B19l, s2(12) -> B19h

    ……

    Then execute the MPY operation as:

       DMPY2 A17:A16,A19:18,A23:A22:A21:A20

    ||DMPY2 B17:B16,B19:B18,B23:B22:B21:B20

       DMPY2 A25:A24,A27:A26,A31:A30:A29:A28

    ||DMPY2 B25:B24,B27:B26,B31:B30:B29:B28

    So I'm afraid that you need 2 cycle to complete the calculation using DMPY2 instruction.

     

    Allen