This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C66x MAC performance

Hello,

I have some questions regarding the MAC performance on the C66x.

I have to perform 32 32x32 MACs as fast as possible on the C66x. The datasheet (SPRS691B) and the document "Optimizing Loops on the C66x DSP" states that the C66x can perform 8 32x32MACs/cycle. This would result in 4 cycles for the total 32  32x32MACs.

Well, this is not what I see from my performance measurements. What I see when I perform e.g. the DSP_dotprod example of the DSPLib is that the C66x needs 63cycles for 32 16x16MACs!

Hopefully somebody can comment on this and give some advice?

BR,

     Andreas

 

  • Andreas,

    Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages. Be sure to search those for helpful information and to browse for the questions others have asked on similar topics.

    In particular, you can search this forum for "MAC performance" (no quotes needed) and you will find other discussions on this issue. You could also try "C66 performance" or "C66x multiplies" and such. You may find more detail in other threads than my simple statements here.

    • The statements in the datasheet are correct.
    • Your measurements of the DSP_dotprod function are probably correct, but they measure a lot more than just the MAC statement in the datasheet.
    • Maximum possible performance is good to know, but measured performance of the whole application is what really matters.
    • Memory and pipeline effects affect your measured performance.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Hi RandyP,

    ok, I went through the forum but couln't find an answer ;-)

    Coming back to my original question about the statement that the C66x can perform 8 32x32MACs/cycle:

    1) MAC == Multiply-Accumulate

    2) The C66x has the folowing instructions: QMPY32 (fixed) and QMPYSP (float)

         They can do 4 32x32 multiplications (fixed & floating)

         Executing them in both .M-units would result in 8 32x32 multiplications

    But I still don't see that the C66x can perform 8 32x32MACs/cycle because there is not DOTP* instruction for 32bit available and no ADD which can do 4 32bit additions.

     

    It would be nice if you could elaborate a little bit how you would achieve the 32x32MACs.

    BR,

         Andreas

  • Andreas,

    I think I can clarify a bit.

    At the very lowest level, you're right. If the goal is to multiply 4 pairs of 32 bit values by an additional 4 pairs of 32 bit values and subsequently add the result to 4 other 32-bit values in less than 8 nanoseconds, it can't be done on a C66x.  It would take the C66x twice that long to get the four results (neglecting the time to get the registers loaded from memory, and write the result back, etc.)  But I suspect that your ultimate goal is not to get a single set of 4 results.  I suspect that your goal is more that of a implementing a typical DSP application where you have sets of input data streaming in, and being processed with a MAC type algorithm that gets streamed out.  If that's the real goal, then the C66x can get the same performance that a processor that could do a 4 32x32 bit MAC in 8 nanoseconds. 

    On average, though, you can get the performance of 8 32x32 bit MACs.  While the 66x can't  really do 8 MACs in a single cycle, if you divide the MAC in half (one half for the multiply and the other half for an accumulate), it can do 16 half MACs (8 32x32 bit multiplies and 8 32x32 bit adds) in a single cycle.  

    After the first cycle, you will have only the result of 32x32 multiplies 

    QMPY32 - A Side .M1

    QMPY32 - B Side .M2

    After the 2nd cycle, you will have the result of the MAC for the first 8 samples.  But while we were doing the sum (with 4 parallel DADD instruuctions), we also do another 2 QMPY32s 

    QMPY32 - A Side .M1

    DADD - A Side .S1

    DADD - A Side .L1

    QMPY32 - B Side .M2

    DADD - B Side .S2

    DADD - B Side .L2

     

    So after 2 cycles, we get our first 4 MAC results.  but on every subsequent cycle, we get another 4 because of the parallelism.  

    So, back to the initial problem statement.  In A DSP application, the first 8 MACs takes 2 cycles, but the first 8,000,000 MACs only take 1,000,001 cycles.    

    Regards,

    Dan