Understanding Linear Assembly Optimizer output.

Du Fu

Hi All!

Most of our previous projects was for DSP C6416, and we were using Linear Assembly Optimizer a lot.

Now we are porting one existing project to C66 Keystone which is new for us. Checking output, produced with Linear Assembly Optimizer we found big difference in the code comparing with the result produced for C64x.

It is not clear what is the real length of PIPED LOOP KERNEL, produced.

To understand better, we took the Dot Product Example 4-10 from "Optimizing Compiler v7.4 User Guide p.131"

Generation code for c64x (as well as for c62 or c64+)

cl6x --symdebug:none -O3 -k -mv6400 -mw dotp2.sa

produce the result as expected:

;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL
;          EXCLUSIVE CPU CYCLES: 2

   [ B1]   SUB     .L2     B1,1,B1           ; <0,8>
||         ADD     .S2     prod1',sum0,sum0' ; |22| <0,8> ^ sum0 += a[0] * b[0]
||         ADD     .L1     prod2',sum1,sum1' ; |23| <0,8> ^ sum1 += a[1] * b[1]
||         MPY     .M2X    val1$1,val2',prod1' ; |20| <1,6> a[0] * b[0]
||         MPYH    .M1X    val1$1,val2',prod2' ; |21| <1,6> a[1] * b[1]
|| [ I]    B       .S1     $C$L2             ; |33| <2,4> if (!0) goto loop
|| [ B1]   LDW     .D1T1   *a_4++(8),val1    ; |25| <3,2> load a[2-3] bankx+2
|| [ A0]   LDW     .D2T2   *a_0'++(8),val1$1 ; |18| <4,0> load a[0-1] bankx

   [ A0]   SUB     .L1     A0,1,A0           ; <0,9>
||         ADD     .S2     prod1,sum0',sum0 ; |29| <0,9> ^ sum0 += a[2] * b[2]
||         ADD     .S1     prod2,sum1',sum1 ; |30| <0,9> ^ sum1 += a[3] * b[3]
||         MPY     .M2X    val1,val2,prod1   ; |27| <1,7> a[2] * b[2]
||         MPYH    .M1X    val1,val2,prod2   ; |28| <1,7> a[3] * b[3]
|| [ I]    ADD     .L2     0xffffffff,I,I    ; |32| <3,3> I--
|| [ A0]   LDW     .D2T2   *b_4++(8),val2    ; |26| <4,1> load b[2-3] banky+2
|| [ A0]   LDW     .D1T1   *b_0++(8),val2'   ; |19| <4,1> load b[0-1] banky
;** --------------------------------------------------------------------------*

While generation code for c66, produce totally different output:

$C$L1:    ; PIPED LOOP PROLOG
;          EXCLUSIVE CPU CYCLES: 9

           SPLOOPD 2       ;10               ; (P)
||         MV      .L2X    a_0',a_0          ; |1|
||         MVC     .S2     B6,ILC
;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL
;          EXCLUSIVE CPU CYCLES: 2

           SPMASK          L1,L2
||         MV      .L1X    b_0',b_0          ; |1|
||         ADD     .L2     0x4,b_0',b_4      ; |6|
||         LDW     .D2T2   *a_0++(8),val1'   ; |18| (P) <0,0> load a[0-1] bankx

           SPMASK          L1
||         ADD     .L1     0x4,a_0',a_4      ; |5|
||         LDW     .D2T2   *b_4++(8),val2    ; |26| (P) <0,1> load b[2-3] banky+2
||         LDW     .D1T1   *b_0++(8),val2'   ; |19| (P) <0,1> load b[0-1] banky

           LDW     .D1T1   *a_4++(8),val1    ; |25| (P) <0,2> load a[2-3] bankx+2
           NOP             3

           MPY     .M2X    val1',val2',prod1' ; |20| (P) <0,6> a[0] * b[0]
||         MPYH    .M1X    val1',val2',prod2 ; |21| (P) <0,6> a[1] * b[1]

           SPMASK          L1,L2
||         ZERO    .L2     B7                ; |9|
||         ZERO    .L1     sum1              ; |9|
||         MPY     .M2X    val1,val2,prod1   ; |27| (P) <0,7> a[2] * b[2]
||         MPYH    .M1X    val1,val2,prod2   ; |28| (P) <0,7> a[3] * b[3]

           ADD     .L2     prod1',sum0,sum0' ; |22| <0,8> ^ sum0 += a[0] * b[0]
||         ADD     .L1     prod2,sum1,sum1   ; |23| <0,8> ^ sum1 += a[1] * b[1]

           SPKERNEL 2,0
||         ADD     .L2     prod1,sum0',sum0 ; |29| <0,9> ^ sum0 += a[2] * b[2]
||         ADD     .L1     prod2,sum1,sum1   ; |30| <0,9> ^ sum1 += a[3] * b[3]
;** --------------------------------------------------------------------------*

On the first look PIPED LOOP KERNEL is more longer now. Seems we missed something important, but more reading about SPKERNEL and SPMASK instructions didn't make the picture clear.

Can you please give some hints how to understand this.

Thank you.

Best regards.

Dmitry.

over 12 years ago

George Mock over 12 years ago

TI__Guru**** 251090 points

For some basics on SPLOOP and friends, please see this PDF document, which comes from this wiki article.

That said, I recommend you not worry much about the details of how these instructions work. Let the compiler worry about that for you. Use the compiler option --debug_software_pipeline. The compiler emits a block comment about each loop in the generated assembly file. This option tells the compiler to issue a verbose version of that comment block, and to not delete the .asm file. You then inspect the .asm file to see information about each loop. This is true regardless of whether you are building for an older generation device that does not support SPLOOP, or a later generation device that does. But it is much more helpful in understanding SPLOOP loops. Several good examples of getting information from this compiler generated comment block are in this application note.

Thanks and regards,

-George

Du Fu over 12 years ago in reply to George Mock

Intellectual 380 points

Hi George.

Thank you for your quick reply and pointing to very useful reading.

Software pipeline loop buffer future, introduced in c64p, and highly utilized in c66 Keystone, is a little bit difficult to understand, especially when piped loop kernel utilize more the one one CPU cycle. I didn't find more detailed explanation of that situation.

Just example. Here loop kernel took 2 CPU cycles:

;*----------------------------------------------------------------------------*
$C$L1:    ; PIPED LOOP PROLOG
;          EXCLUSIVE CPU CYCLES: 9

           SPLOOPD 2       ;10               ; (P)
||         MV      .L1X    B4,A7             ; |35|
||         MV      .L2X    A4,B6
||         MVC     .S2     B6,ILC

;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL
;          EXCLUSIVE CPU CYCLES: 2

           SPMASK          L1
||         MV      .L1     A6,A8
||         LDW     .D1T1   *A7++,A6          ; |49| (P) <0,0>
||         LDW     .D2T2   *B6++,B5          ; |49| (P) <0,0>

           NOP             4

           SMPYH .M1X    B5,A6,A3          ; |49| (P) <0,5>
||         SMPY    .M2X    B5,A6,B4          ; |49| (P) <0,5>

           NOP             1
           SHR     .S1     A3,8,A5           ; |49| (P) <0,7>
           SHR     .S1X    B4,8,A4           ; |49| <0,8>

           SPKERNEL 1,0
||         STDW .D1T1   A5:A4,*A8++(8)    ; |49| <0,9>

;** --------------------------------------------------------------------------*
$C$L3:    ; PIPED LOOP EPILOG

Which instructions in this loop fulfilled during the first CPU cycle and which during the second?

Thanks and regards.

Dmitry

George Mock over 12 years ago in reply to Du Fu

TI__Guru**** 251090 points

Dmitry Froloff said:
Which instructions in this loop fulfilled during the first CPU cycle and which during the second?

I don't know. It is an exaggeration to say no one who writes code for C6000 knows. But not by much. Knowing such details is not required. You, and many who have gone before you, will be able to make your code run very, very fast, despite the fact you don't know which instructions in the loop are executing on which cycle.

Thanks and regards,

-George

Du Fu over 12 years ago in reply to George Mock

Intellectual 380 points

Hi George.

George Mock said:

Which instructions in this loop fulfilled during the first CPU cycle and which during the second?

I don't know. It is an exaggeration to say no one who writes code for C6000 knows. But not by much. Knowing such details is not required.

[/quote]

I can accept this, but I can't verify your answer, sorry.

Regards

Dmitry.

Code Composer Studio™︎

Code Composer Studio forum

Understanding Linear Assembly Optimizer output.