Hello,
I'm working on the C64x+ CPU. In the SPRS552E (October 2008 - Revised April 2010), on § 1.2. Core Processor, it is written that "Each C64x+ .M unit doubles the multiply throughput versus the C64x core by performing four 16-bit x 16-bit multiply-accumulates".
So, thanks to the TMS320C64x+ DSPlib (SPRUEB8B), I tested the DSP_dotprod function which multiply two vectors (real) with the C64x+ CPU Cycle Accurate Simulator. For a vector of 200 elements, I found 71cycles (your benchmark mentionned 200/4+19 = 69). Here, it's seems to be ok.
As this function use the intrinsic "_dotp2", you have only two 16-bit x 16-bit multiply-accumulates per fonctionnal units.
Your documentation mentionned "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates". So for a vector of 200 elements (real), you need 200 multiplications. As we have 2 functionnal units, so we have 100 multiplications per unit. As a "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates" per cycle, I think we should have 100/4 = 25 cycles + 19 cycles (pipeline prologue and epilogue) = 44 cycles and not 69 cycles.
So my first question is how could I use the possibilities to have the four 16-bit x 16-bit multiply-accumulates per cycle as mentionned in your documentation ?
Moreover, I made a function for multiplying two vectors (complex), I'm using the intrinsic "_cmpy". For a vector of 100 elements (complex), you need 400 real multiplications. As we have 2 functionnal units, so we have 200 real multiplications per unit. As a "CPU C64x+ = four 16-bit x 16-bit multiply-accumulates" per cycle, I think we should have 200/4 = 50 cycles + 19 cycles (pipeline prologue and epilogue) = 69 cycles and not 220 cycles as mesured.
The complex function I'm using is the following :
void cmpy_intrinsic_struct(struct complexe *x, struct complexe *y, int count, struct complexe32 *z)
{
long long result=0, result1=0;
int cplx_1, cplx_2, cplx_3, cplx_4;
int idx;
z[0].re = 0;
z[0].im = 0;
z[1].re = 0;
z[1].im = 0;
#pragma MUST_ITERATE (50);
for(idx = 0 ; idx < count ; idx+=2) {
// Ordering for 32bits register (16bits for Re et idem pour Im)
cplx_1 = _pack2(x[idx].re, x[idx].im); // (1+2j)
cplx_2 = _pack2(y[idx].re, y[idx].im); // (2+3j)
// Multiplication complex & accumulation
result += _cmpy(cplx_1, cplx_2); // (1+2j)*(2+3j) = -4+7j
cplx_3 = _pack2(x[idx+1].re, x[idx+1].im);
cplx_4 = _pack2(y[idx+1].re, y[idx+1].im);
result1 += _cmpy(cplx_3, cplx_4);
}
// Saving the real and imaginary part (32bits for Re & the same for Im)
z[0].re = _hill(result) + _hill(result1); // (1+2j)*(2+3j) = -4+7j
z[0].im = _loll(result) + _loll(result1);
}
Note :
struct complexe {
short re;
short im;
};
struct complexe32 {
int re;
int im;
};
This is the only solution I found for multiplying two vectors (complex) in an optimized way.
Due to the use of PACK2 function for CMPY, I mesured 220 cycles and not 69 cycles as promised (CPU C64x+ = four 16-bit x 16-bit multiply-accumulates per cycle). Have you an idea for having better performance ?
D'avance merci
Bubsy