AM5708: C66x assembly code

pengfei jin

Part Number: AM5708

Hi,

I am so tired to understand the NOP below:

                MVK            .S1            100,A1
||           ZERO          .L1            A7
LOOP:
               LDW             .D1          *A4++,A2
||          LDW           .D2           *B4++,B2

SUB .S1 A1,1,A1
NOP 2

[A1]           B                   .S2           LOOP
               MPYSP       .M1X           A2, B2, A6
               NOP                            3
               ADDSP       .L1           A6,A7,A7

From above asm code, We can see the .M1X , .S1 and .S2 are parallel. So, I think the asm code should be written as the below code:

                MVK            .S1            100,A1
||           ZERO          .L1            A7
LOOP:
               LDW             .D1          *A4++,A2
||          LDW           .D2           *B4++,B2

SUB .S1 A1,1,A1
NOP 4 ; this is the variety

[A1]           B                   .S2           LOOP
               MPYSP       .M1X           A2, B2, A6
               NOP                            3
               ADDSP       .L1           A6,A7,A7

the variety is in red typeface. NOP 2 -> NOP 4

Why?

over 6 years ago

0 Biser Gatchev-XID over 6 years ago

TI__Guru**** 393215 points

Please post from where is this assembly code taken.

0 pengfei jin over 6 years ago in reply to Biser Gatchev-XID

Intellectual 585 points

int dotp(short []a, short []b){

int sum,i;

sum = 0;

for(i=0;i<100;i++)

{

sum+=a[i]+b[i];

}

return sum;

}

0 pengfei jin over 6 years ago in reply to Biser Gatchev-XID

Intellectual 585 points

You can see the 5-17 of the manuel named TMS320C6000 Programmer's Guide provided by TI.

1) the C code is below:

int dotp(short []a, short []b){

int sum,i;

sum = 0;

for(i=0;i<100;i++)

{

sum+=a[i]+b[i];

}

return sum;

}

2) Nonparallel Assembly Code for Floating-Point Dot Product is below:

MVK .S1 100, A1 ; set up loop counter

ZERO .L1 A7 ; zero out accumulator

LOOP:

LDW .D1 *A4++,A2 ; load ai from memory

LDW .D1 *A3++,A5 ; load bi from memory

NOP 4 ; delay slots for LDW

MPYSP .M1 A2,A5,A6 ; ai * bi

NOP 3 ; delay slots for MPYSP

ADDSP .L1 A6,A7,A7 ; sum += (ai * bi)

NOP 3 ; delay slots for ADDSP

SUB .S1 A1,1,A1 ; decrement loop counter

[A1] B .S2 LOOP ; branch to loop

NOP 5 ; delay slots for branch

; Branch occurs here

3) Parallel Assembly Code for Floating-Point Dot Product is below:

MVK .S1 100, A1 ; set up loop counter

|| ZERO .L1 A7 ; zero out accumulator

LOOP:

LDW .D1 *A4++,A2 ; load ai from memory

|| LDW .D2 *B4++,B2 ; load bi from memory

SUB .S1 A1,1,A1 ; decrement loop counter

NOP 2 ; delay slots for LDW

[A1] B .S2 LOOP ; branch to loop

MPYSP .M1X A2,B2,A6 ; ai * bi

NOP 3 ; delay slots for MPYSP

ADDSP .L1 A6,A7,A7 ; sum += (ai * bi)

; Branch occurs here

0 lding over 6 years ago in reply to pengfei jin

TI__Guru* 95265 points

Hi,

I looked at the http://www.ti.com/lit/ug/spru198k/spru198k.pdf

Example 5-5. Fixed-Point Dot Product C Code
int dotp(short a[], short b[])
{
int sum, i;
sum = 0;
for(i=0; i<100; i++)
sum += a[i] * b[i];
return(sum);
}

I believe this is the integer dotp C code you referred to.

Then, if you look at the assmbly code:

Example 5-9. Nonparallel Assembly Code for Fixed-Point Dot Product

MVK .S1 100, A1 ; set up loop counter
ZERO .L1 A7 ; zero out accumulator
LOOP:
LDH .D1 *A4++,A2 ; load ai from memory
LDH .D1 *A3++,A5 ; load bi from memory
NOP 4 ; delay slots for LDH
MPY .M1 A2,A5,A6 ; ai * bi
NOP ; delay slot for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 5 ; delay slots for branch
; Branch occurs here

If you use Parallel Assembly Code

Example 5-10. Parallel Assembly Code for Fixed-Point Dot Product
MVK .S1 100, A1 ; set up loop counter
|| ZERO .L1 A7 ; zero out accumulator
LOOP:
LDH .D1 *A4++,A2 ; load ai from memory
|| LDH .D2 *B4++,B2 ; load bi from memory
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 2 ; delay slots for LDH
MPY .M1X A2,B2,A6 ; ai * bi
NOP ; delay slots for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
; Branch occurs here

This is explained as follows:

Because the loads of ai and bi do not depend on one another, both LDH
instructions can execute in parallel as long as they do not share the same
resources. To schedule the load instructions in parallel, allocate the functional
units as follows:
ai and the pointer to ai to a functional unit on the A side, .D1
bi and the pointer to bi to a functional unit on the B side, .D2
Because the MPY instruction now has one source operand from A and one
from B, MPY uses the 1X cross path.

Rearranging the order of the instructions also improves the performance of the
code. The SUB instruction can take the place of one of the NOP delay slots
for the LDH instructions. Moving the B instruction after the SUB removes the
need for the NOP 5 used at the end of the code in Example 5−9.
The branch now occurs immediately after the ADD instruction so that the MPY
and ADD execute in parallel with the five delay slots required by the branch
instruction.

Regards, Eric

0 pengfei jin over 6 years ago in reply to lding

Intellectual 585 points

Thanks, but my question is why the SUB instruction can take the place of one of the NOP delay slots
for the LDH instructions ? Why can they rearrange the order, I think D2,S1,and S2 are parallel in below, so the MPY cannot get the value of B2:

0 lding over 6 years ago in reply to pengfei jin

TI__Guru* 95265 points

Hi,

I asked our compiler team for help on those assembly instructions.

Regards, Eric

0 George Mock over 6 years ago

TI__Guru**** 249175 points

I'm going to restate the original question in the form of this example assembly code ...

	MVK	.S1	100,A1
||	ZERO	.L1	A7
LOOP:
	LDW	.D1	*A4++,A2
||	LDW	.D2	*B4++,B2

	SUB	.S1	A1,1,A2
	NOP		2		; WHY NOT 4?

[A1]	B	.S2	LOOP
	MPYSP	.M1X	A2, B2, A6
	NOP		3
	ADDSP	.L1	A6,A7,A7

You want to know why the first NOP only waits 2 cycles. Why not 4 instead?

It is correct that 4 cycles must occur between the LDW instructions and the MPYSP. Those are called delay cycles. The compiler always tries to fill delay cycles with useful work. When all the available instructions are scheduled, and some delay cycles still remain, the compiler is forced to issue a NOP for the rest of the delay cycles. In this specific case, a SUB and B instruction are available to take up 2 cycles. So, the NOP 2 takes up the rest of the 4 delay cycles.

Thanks and regards,

-George

0 pengfei jin over 6 years ago in reply to lding

Intellectual 585 points

Thanks.

0 pengfei jin over 6 years ago in reply to George Mock

Intellectual 585 points

Hi,George,

Thanks for your reply,

I know your mean ,but I cannot understand why the SUB and B instruction can take up 2 cycles between LDW and MPYSP. I think the .S1 used by SUB , .S2 used by B , .D2 used by LDW and .M1X used by MPYSP are parallel , so SUB between LDW and MPYSP is equal to Nop 0, B between LDW and MPYSP is equal to Nop 0 so that B2 haven' t been obtained when M1X need B2 because of Nop 2 (Nop 0 + Nop 0 + Nop 2) . In other words , I don't understand the reason why SUB and B can replace the 2 cycles of 4 cycles (NOP 4).

Looking forward to your reply.

0 pf over 6 years ago in reply to pengfei jin

TI__Expert 4930 points

Physically, they *can* be parallel, because they're different units. In this example, they clearly *aren't*, because in asm only the instructions connected by "||" are executed in parallel. The reason they aren't in parallel is because the compiler sets them up that way, to maintain the proper number of delay slots.

0 pengfei jin over 6 years ago in reply to pf

Intellectual 585 points

Thanks for your reply.

Could you recommend a material for study the delay slots , || and etc ?

0 pf over 6 years ago in reply to pengfei jin

TI__Expert 4930 points

The TMS320C6000 Programmer's Guide, already mentioned, and the TMS320C6000 CPU and Instruction Set Reference Guide are good sources. The latter goes into more detail about them.

0 pengfei jin over 6 years ago in reply to pf

Intellectual 585 points

Thanks.

Processors

Processors forum

AM5708: C66x assembly code