DM6446 DSP loop performance

Orjan Friberg

First of all, since I'm using the DVSDK 2.00.00.22; do I need to close down or unload any other stuff to free up resources in the DSP?

Second: I have modified the readwrite example on the DSP side to instead do a simple MAC loop (on char size buffers). Looking at the compiler output, I get

ii = 1 Schedule found with 8 iterations in parallel

Good, each iteration of the MAC loop takes 1 cycle. I have measured (by doing nothing besides the memory handling) that the memory passing overhead is negilible, and thus I'll spend almost all of my time in this loop. Still, when measuring I only get ~290 MMAC/s. With the DSP at 594 Mhz I'm only at half the expected performance. What am I missing?

Moving on, I've been searching for how to utilize 4 16*16 or 8 8*8 MACs per cycle in the C64+ DSP but haven't found any. Anyone know of a "peak MMACs" example?

Thanks,

Orjan

over 15 years ago

0 Orjan Friberg over 15 years ago

Expert 1385 points

Ok, this is the code I'm executing in a loop inside TSKRDWR_execute in the modded readwrite example:

void loop1(Short *restrict a, Short *restrict b, Short c, Int n)
{
Int i;

WORD_ALIGNED(a);
WORD_ALIGNED(b);

#pragma MUST_ITERATE(1024, 16384, 4)
for (i = n; i > 0; i--)
*b++ += c * *a++;
}

Int loop2(Short *restrict a, Short *restrict b, Int n)
{
Int i;
Int tmp = 0;

WORD_ALIGNED(a);
WORD_ALIGNED(b);

#pragma MUST_ITERATE(1024, 16384, 4)
for (i = n; i > 0; i--)
tmp += *a++ * *b++;

return tmp;
}

Simple enough, right?

The compiler feedback for loop1 says

ii = 5 Schedule found with 3 iterations in parallel

with

Loop Unroll Multiple : 4x

So, 4 loops every 5 cycles which means I should get about 594 Mhz * 4/5 MAC per second = 475 MMAC. When measuring I get 470 MMAC so everything is as expected.

Moving on to loop2, the compiler feedback is

ii = 2 Schedule found with 6 iterations in parallel

with

Loop Unroll Multiple : 4x

which should give me 594 Mhz * 4/2 MAC per second = 1188 MMAC, but I get 585 MMAC which is about half. What am I missing here?

Thanks,

Orjan

0 Jaehoon Heo over 15 years ago in reply to Orjan Friberg

Prodigy 80 points

How do you know that you are getting 585 MMACs? CPU MIPS load graph?

0 Orjan Friberg over 15 years ago in reply to Jaehoon Heo

Expert 1385 points

No, I'm just dividing the number of iterations in the C code loop with the execution time.

(How do I access the MIPS load graph?)

The resulting assembly code for the piped loop kernel:

[!A1] ADD .L2 B4,B5,B5 ; |230| <0,9>
|| DOTP2 .M2X B7,A3,B4 ; |230| <2,5>
|| [ A0] BDEC .S1 $C$L2,A0 ; |229| <2,5>
|| LDW .D1T1 *+A5(4),A4 ; |230| <4,1>
|| LDW .D2T2 *+B6(4),B4 ; |230| <4,1>

[ A1] SUB .L1 A1,1,A1 ; <0,10>
|| [!A1] ADD .S1 A3,A6,A6 ; |230| <0,10>
|| DOTP2 .M1X B4,A4,A3 ; |230| <2,6>
|| LDW .D1T1 *++A5(8),A3 ; |230| <5,0>
|| LDW .D2T2 *++B6(8),B7 ; |230| <5,0>

Looking at the code above, the 4x unrolled loop has one DOTP2 every cycle and no MPY, which to me looks like only 2 iterations of the original loop is being executed.

Am I misinterpreting the compiler feedback?

0 Orjan Friberg over 15 years ago in reply to Orjan Friberg

Expert 1385 points

I was misinterpreting the compiler feedback. After finding another example on the Internet and then looking at the loop it seemed like I should be able to do 64-bit loads instead of 32-bit ones. By asserting on 64-bit alignment i got a 4x loop unroll with ii = 1 with two DOTP2 every cycle, with my execution time cut in half (getting ~1160 MMAC/s which is what I was hoping for).

0 Jaehoon Heo over 15 years ago in reply to Orjan Friberg

Prodigy 80 points

Sorry that I couldn't help much and glad to see you finding the answer. Enabling 64-bit load using nassert was a tricky one, and gave me some hard time doing it on C67x. It appears that C64x compiler is doing better job than C67x one. I wonder if you are also interested in getting maximum 32x16 MACs done per cycle.

0 Orjan Friberg over 15 years ago in reply to Jaehoon Heo

Expert 1385 points

Actually, I'd be interested in maximum 16x16 MACs (i.e. 4 every cycle). At the moment I'm only at half that, but then again I'm also doing pointer dereferencing in the tmp += *a++ * *b++ loop. Changing the code to tmp += *a++ * 7 the execution time is cut in half, so maybe that's what needed.

Comparing the assembly code for the two cases, one LDDW disappears in the second version. I'm guessing only one of these can be done every cycle, which accounts for the execution time being cut in half. What is strange to me though is that in both cases all instructions in the loop are marked with || in the compiler output, indicating that they are executing in parallell.

(I haven't looked into using intrinsics yet, though I'm sure they will be useful in more advanced examples.)

0 Orjan Friberg over 15 years ago in reply to Orjan Friberg

Expert 1385 points

Quick update: I used the DSP_dotprod function through the C64x+ DSP library and got the same performance as with my 64-bit-aligned tmp += *a++ * *b++ loop, but the documentation (sprueb8b.pdf) says "Cycles nx / 4 + 19" (where nx is the size of the short vectors). Even the source (DSP_dotprod.c) says "1 clock/4 iterations".

Obviously I'm missing something here.

Processors

Processors forum

DM6446 DSP loop performance