C64x+ - Device characteristics

Bubsy

Hello,

I'm working on the C64x+ CPU. In the SPRS552E (October 2008 - Revised April 2010), on § 1.2. Core Processor, it is written that "Each C64x+ .M unit doubles the multiply throughput versus the C64x core by performing four 16-bit x 16-bit multiply-accumulates".

So, thanks to the TMS320C64x+ DSPlib (SPRUEB8B), I tested the DSP_dotprod function which multiply two vectors (real) with the C64x+ CPU Cycle Accurate Simulator. For a vector of 200 elements, I found 71cycles (your benchmark mentionned 200/4+19 = 69). Here, it's seems to be ok.

As this function use the intrinsic "_dotp2", you have only two 16-bit x 16-bit multiply-accumulates per fonctionnal units.

Your documentation mentionned "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates". So for a vector of 200 elements (real), you need 200 multiplications. As we have 2 functionnal units, so we have 100 multiplications per unit. As a "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates" per cycle, I think we should have 100/4 = 25 cycles + 19 cycles (pipeline prologue and epilogue) = 44 cycles and not 69 cycles.

So my first question is how could I use the possibilities to have the four 16-bit x 16-bit multiply-accumulates per cycle as mentionned in your documentation ?

Moreover, I made a function for multiplying two vectors (complex), I'm using the intrinsic "_cmpy". For a vector of 100 elements (complex), you need 400 real multiplications. As we have 2 functionnal units, so we have 200 real multiplications per unit. As a "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates" per cycle, I think we should have 200/4 = 50 cycles + 19 cycles (pipeline prologue and epilogue) = 69 cycles and not 220 cycles as mesured.

The complex function I'm using is the following :

void cmpy_intrinsic_struct(struct complexe *x, struct complexe *y, int count, struct complexe32 *z)
{
long long result=0, result1=0;

int cplx_1, cplx_2, cplx_3, cplx_4;

int idx;

z[0].re = 0;
z[0].im = 0;
z[1].re = 0;
z[1].im = 0;

#pragma MUST_ITERATE (50);
for(idx = 0 ; idx < count ; idx+=2) {

  // Ordering for 32bits register (16bits for Re et idem pour Im)
  cplx_1 = _pack2(x[idx].re, x[idx].im);     // (1+2j)
  cplx_2 = _pack2(y[idx].re, y[idx].im);     // (2+3j)

// Multiplication complex & accumulation
result += _cmpy(cplx_1, cplx_2); // (1+2j)*(2+3j) = -4+7j

  cplx_3 = _pack2(x[idx+1].re, x[idx+1].im);
  cplx_4 = _pack2(y[idx+1].re, y[idx+1].im);
  result1 += _cmpy(cplx_3, cplx_4);
}

//  Saving the real and imaginary part (32bits for Re & the same for Im)
z[0].re = _hill(result) + _hill(result1);   // (1+2j)*(2+3j) = -4+7j
z[0].im = _loll(result) + _loll(result1);
}

Note :

struct complexe {
short re;
short im;
};

struct complexe32 {
int re;
int im;
};

This is the only solution I found for multiplying two vectors (complex) in an optimized way.

Due to the use of PACK2 function for CMPY, I mesured 220 cycles and not 69 cycles as promised (CPU C64x+ = four 16-bit x 16-bit multiply-accumulates per cycle). Have you an idea for having better performance ?

D'avance merci

Bubsy

over 15 years ago

0 RandyP over 15 years ago

TI__Guru* 84110 points

Bubsy said:
how could I use the possibilities to have the four 16-bit x 16-bit multiply-accumulates per cycle as mentioned in your documentation ?

Since you got the performance you wanted from the DSP_dotprod() function, will it be good enough? From your measurements, it is doing 4 16x16 MACs per cycle, right?

The DOTP2 is the right instruction to use to get two 16x16 multiplies, and you can use two DOTP2 instructions per cycle, and you can use 2 ADD instructions per cycle.

You can get even more with the CMPY since each CMPY does four 16x16 multiplies per instruction.

Bubsy said:
Have you an idea for having better performance ?

Use all the optimization techniques that you can learn from the C Compiler User's Guide and from the TI Wiki pages.
Use the "restrict" keyword for your passed pointers.
Structure your data like the DSPlib functions do, instead of the structs that you are using now.
Pack your data so that you do not need to use the PACK2 instructions. Those kill your performance in this case.
Since you want to do 4 multiplies per cycle, change the count from 50 to a multiple of 4.
Try unrolling the loop, either manually or using #pragma UNROLL(n) [check my syntax]

0 Bubsy over 15 years ago in reply to RandyP

Intellectual 300 points

Hello Randy,

1 . Dotprod

I'm agree with your benchmark concerning the dotprod function on C64x+. I share also your opinion when you written that DOTP2 is the right instruction to use to get two 16x16 multiplies.

But following the SPRS552E (October 2008 - Revised April 2010), on § 1.2. Core Processor, it is written that "Each C64x+ .M unit doubles the multiply throughput versus the C64x core by performing four 16-bit x 16-bit multiply-accumulates".

So what is the instruction to use this possibility of having four 16-bit x 16-bit multiply-accumulates per .M unit if I except the CMPY function ?

Or how I can use two DOTP2 instructions and 2 ADD instructions per cycle per .M unit if the C64x+ are able to perform four 16-bit x 16-bit multiply-accumulates per cycle per .M unit ?

2 . CMPY

I thank you for your tips. I trying to apply it. However, how could I pack my data without using the PACK2 instructions which kill this code performance ?

Because when I'm using the word-wide data access by using the _amem4_const(), the register is not well ordered for using CMPY function. For example, if x[0] = {1,2} so 1+2j,
_amem4_const(&x[idx]) will lead to load a register with "00020001" and not "00010002".

And for using CMPY function, we need that the 2 input registers are ordered as A1 '00010002" et A2 "00020003" with A1 for (1+2j) & A2 for (2+3j).

So with the word-wide data access, we have "00020001" et "00030002" which lead to (2+1j) * (3+2j) != (1+2j) * (2+3j) !!!

And if I use a loop for reordering my data, I will also kill this code performance as using PACK2. So how could I easily load register in the good order ?

D'avance merci

Bubsy

0 RandyP over 15 years ago in reply to Bubsy

TI__Guru* 84110 points

Bubsy said:
So what is the instruction to use this possibility of having four 16-bit x 16-bit multiply-accumulates per .M unit if I except the CMPY function ?

DDOTPL2 and DDOTPH2 perform 4 16 x 16 multiplies and adds. And the CMPY function does the same number of multiplies, with an add and a subtract, if you do not exclude it.

These are very powerful instructions which can help you reach the performance you need for your product. Put the system together and find the performance you need.

Bubsy said:
how could I pack my data without using the PACK2 instructions which kill this code performance ?

In my list of "performance ideas", #3 and #4 are really the same, I think:

3. Structure your data like the DSPlib functions do, instead of the structs that you are using now.
4. Pack your data so that you do not need to use the PACK2 instructions. Those kill your performance in this case.

Look at how the data is stored for the complex math functions in DSPlib. They do not require the PACK2 instructions, because the data is already in the right order within the 32-bit words or 64-bit double words.

The key is to have the data already in the right order. If you load from memory and the values are reversed in the registers, then store them in the reverse order in memory when they are originally written to memory. That results in free packing into the order you need for these powerful instructions.

And look for ways to pull in 64-bit values from memory. This will further improve your code performance.

0 Bubsy over 15 years ago in reply to RandyP

Intellectual 300 points

Hello Randy,

So concerning item 1 (the four 16 x 16 multiplies and adds), I conclude as 1 functionnal unit on C64x+ is able to perform four 16 x 16 multiplies and two 16 + 16 additions per cycle. And to obtain this performance, I have to use only one of this three instructions : CMPY, DDOTPL2 and DDOTPH2.

Unfortunately, DDOTPL2 and DDOTPH2 do not correspond to my need. I wish to do in one functionnal unit ( (A1 * B1) + (B2 * B2) & (A3 * B3) + (A4 * B4) ) in one cycle.

For item 2 ("performance ideas"), I will consider to reverse order in memory in order to have free packing.

Merci beaucoup.

Bubsy

Processors

Processors forum

C64x+ - Device characteristics