This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5708: C66x assembly code

Part Number: AM5708

Hi,

I am so tired to understand the NOP below:

                MVK             .S1             100,A1
||             ZERO             .L1             A7
LOOP:
                LDW              .D1             *A4++,A2
||              LDW            .D2            *B4++,B2

                SUB            .S1            A1,1,A1
                NOP                             2

[A1]            B                    .S2            LOOP
                MPYSP        .M1X            A2, B2, A6
                NOP                             3
                ADDSP        .L1            A6,A7,A7
            
 From above asm code, We can see the .M1X ,  .S1 and .S2 are parallel. So, I think the asm code should be written as the below code:

                MVK             .S1             100,A1
||             ZERO             .L1             A7
LOOP:
                LDW              .D1             *A4++,A2
||              LDW            .D2            *B4++,B2

                SUB            .S1            A1,1,A1
                NOP                             4 ; this is the variety

[A1]            B                    .S2            LOOP
                MPYSP        .M1X            A2, B2, A6
                NOP                             3
                ADDSP        .L1            A6,A7,A7

 the variety is in red typeface. NOP 2 -> NOP 4

Why?

  • Please post from where is this assembly code taken.

  • int dotp(short []a, short []b){

    int sum,i;

    sum = 0;

    for(i=0;i<100;i++)

    {

    sum+=a[i]+b[i];

    }

    return sum;

    }

  • You can see the 5-17 of the manuel named  TMS320C6000 Programmer's Guide provided by TI.

    1) the C code is below:

    int dotp(short []a, short []b){

    int sum,i;

    sum = 0;

    for(i=0;i<100;i++)

    {

    sum+=a[i]+b[i];

    }

    return sum;

    }

    2) Nonparallel Assembly Code for Floating-Point Dot Product is below:

    MVK .S1 100, A1 ; set up loop counter
    ZERO .L1 A7 ; zero out accumulator
    LOOP:
    LDW .D1 *A4++,A2 ; load ai from memory
    LDW .D1 *A3++,A5 ; load bi from memory
    NOP 4 ; delay slots for LDW
    MPYSP .M1 A2,A5,A6 ; ai * bi
    NOP 3 ; delay slots for MPYSP
    ADDSP .L1 A6,A7,A7 ; sum += (ai * bi)
    NOP 3 ; delay slots for ADDSP
    SUB .S1 A1,1,A1 ; decrement loop counter
    [A1] B .S2 LOOP ; branch to loop
    NOP 5 ; delay slots for branch
    ; Branch occurs here

    3) Parallel Assembly Code for Floating-Point Dot Product is below:

          

    MVK .S1 100, A1 ; set up loop counter
    || ZERO .L1 A7 ; zero out accumulator
    LOOP:
    LDW .D1 *A4++,A2 ; load ai from memory
    || LDW .D2 *B4++,B2 ; load bi from memory
    SUB .S1 A1,1,A1 ; decrement loop counter
    NOP 2 ; delay slots for LDW
    [A1] B .S2 LOOP ; branch to loop
    MPYSP .M1X A2,B2,A6 ; ai * bi
    NOP 3 ; delay slots for MPYSP
    ADDSP .L1 A6,A7,A7 ; sum += (ai * bi)
    ; Branch occurs here
  • Hi,

    I looked at the http://www.ti.com/lit/ug/spru198k/spru198k.pdf 

    Example 5-5. Fixed-Point Dot Product C Code
    int dotp(short a[], short b[])
    {
    int sum, i;
    sum = 0;
    for(i=0; i<100; i++)
    sum += a[i] * b[i];
    return(sum);
    }

    I believe this is the integer dotp C code you referred to.

    Then, if you look at the assmbly code: 

    Example 5-9. Nonparallel Assembly Code for Fixed-Point Dot Product

    MVK .S1 100, A1 ; set up loop counter
    ZERO .L1 A7 ; zero out accumulator
    LOOP:
    LDH .D1 *A4++,A2 ; load ai from memory
    LDH .D1 *A3++,A5 ; load bi from memory
    NOP 4 ; delay slots for LDH
    MPY .M1 A2,A5,A6 ; ai * bi
    NOP ; delay slot for MPY
    ADD .L1 A6,A7,A7 ; sum += (ai * bi)
    SUB .S1 A1,1,A1 ; decrement loop counter
    [A1] B .S2 LOOP ; branch to loop
    NOP 5 ; delay slots for branch
    ; Branch occurs here

    If you use  Parallel Assembly Code

    Example 5-10. Parallel Assembly Code for Fixed-Point Dot Product
    MVK .S1 100, A1 ; set up loop counter
    || ZERO .L1 A7 ; zero out accumulator
    LOOP:
    LDH .D1 *A4++,A2 ; load ai from memory
    || LDH .D2 *B4++,B2 ; load bi from memory
    SUB .S1 A1,1,A1 ; decrement loop counter
    [A1] B .S2 LOOP ; branch to loop
    NOP 2 ; delay slots for LDH
    MPY .M1X A2,B2,A6 ; ai * bi
    NOP ; delay slots for MPY
    ADD .L1 A6,A7,A7 ; sum += (ai * bi)
    ; Branch occurs here

    This is explained as follows:

    Because the loads of ai and bi do not depend on one another, both LDH
    instructions can execute in parallel as long as they do not share the same
    resources. To schedule the load instructions in parallel, allocate the functional
    units as follows:
    ai and the pointer to ai to a functional unit on the A side, .D1
    bi and the pointer to bi to a functional unit on the B side, .D2
    Because the MPY instruction now has one source operand from A and one
    from B, MPY uses the 1X cross path.

    Rearranging the order of the instructions also improves the performance of the
    code. The SUB instruction can take the place of one of the NOP delay slots
    for the LDH instructions. Moving the B instruction after the SUB removes the
    need for the NOP 5 used at the end of the code in Example 5−9.
    The branch now occurs immediately after the ADD instruction so that the MPY
    and ADD execute in parallel with the five delay slots required by the branch
    instruction.

    Regards, Eric

  • Thanks, but  my question is why the SUB instruction can take the place of one of the NOP delay slots
    for the LDH instructions ? Why can they rearrange the order, I think D2,S1,and S2 are parallel in below, so the MPY cannot get the value of B2:

    Example 5-10. Parallel Assembly Code for Fixed-Point Dot Product
    MVK .S1 100, A1 ; set up loop counter
    || ZERO .L1 A7 ; zero out accumulator
    LOOP:
    LDH .D1 *A4++,A2 ; load ai from memory
    || LDH .D2 *B4++,B2 ; load bi from memory
    SUB .S1 A1,1,A1 ; decrement loop counter
    [A1] B .S2 LOOP ; branch to loop
    NOP 2 ; delay slots for LDH
    MPY .M1X A2,B2,A6 ; ai * bi
    NOP ; delay slots for MPY
    ADD .L1 A6,A7,A7 ; sum += (ai * bi)
    ; Branch occurs here

  • Hi,

    I asked our compiler team for help on those assembly instructions.

    Regards, Eric 

  • I'm going to restate the original question in the form of this example assembly code ...

    	MVK	.S1	100,A1
    ||	ZERO	.L1	A7
    LOOP:
    	LDW	.D1	*A4++,A2
    ||	LDW	.D2	*B4++,B2
    
    	SUB	.S1	A1,1,A2
    	NOP		2		; WHY NOT 4?
    
    [A1]	B	.S2	LOOP
    	MPYSP	.M1X	A2, B2, A6
    	NOP		3
    	ADDSP	.L1	A6,A7,A7

    You want to know why the first NOP only waits 2 cycles.  Why not 4 instead?

    It is correct that 4 cycles must occur between the LDW instructions and the MPYSP.   Those are called delay cycles.  The compiler always tries to fill delay cycles with useful work.  When all the available instructions are scheduled, and some delay cycles still remain, the compiler is forced to issue a NOP for the rest of the delay cycles.  In this specific case, a SUB and B instruction are available to take up 2 cycles.  So, the NOP 2 takes up the rest of the 4 delay cycles.

    Thanks and regards,

    -George

  • Hi,George,

    Thanks for your reply,

    I know your mean ,but I cannot understand  why the SUB and B instruction can take up 2 cycles between LDW and MPYSP.  I think the  .S1 used by SUB , .S2 used by B , .D2 used by LDW  and .M1X used by MPYSP are parallel , so SUB between LDW and MPYSP is equal to Nop 0, B between LDW and MPYSP is equal to Nop 0 so that   B2 haven' t been obtained when M1X need  B2 because of Nop 2 (Nop 0 + Nop 0 + Nop 2) . In other words , I don't understand the reason why SUB and B can replace the 2 cycles of 4 cycles (NOP 4).

    Looking forward to your reply.

  • Physically, they *can* be parallel, because they're different units.  In this example, they clearly *aren't*, because in asm only the instructions connected by "||" are executed in parallel.  The reason they aren't in parallel is because the compiler sets them up that way, to maintain the proper number of delay slots.

  • Thanks for your reply.

    Could you recommend a material for study the delay slots , || and etc ?

  • The TMS320C6000 Programmer's Guide, already mentioned, and the TMS320C6000 CPU and Instruction Set Reference Guide are good sources.  The latter goes into more detail about them.