Hi All!
Most of our previous projects was for DSP C6416, and we were using Linear Assembly Optimizer a lot.
Now we are porting one existing project to C66 Keystone which is new for us. Checking output, produced with Linear Assembly Optimizer we found big difference in the code comparing with the result produced for C64x.
It is not clear what is the real length of PIPED LOOP KERNEL, produced.
To understand better, we took the Dot Product Example 4-10 from "Optimizing Compiler v7.4 User Guide p.131"
Generation code for c64x (as well as for c62 or c64+)
cl6x --symdebug:none -O3 -k -mv6400 -mw dotp2.sa
produce the result as expected:
;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
; EXCLUSIVE CPU CYCLES: 2
[ B1] SUB .L2 B1,1,B1 ; <0,8>
|| ADD .S2 prod1',sum0,sum0' ; |22| <0,8> ^ sum0 += a[0] * b[0]
|| ADD .L1 prod2',sum1,sum1' ; |23| <0,8> ^ sum1 += a[1] * b[1]
|| MPY .M2X val1$1,val2',prod1' ; |20| <1,6> a[0] * b[0]
|| MPYH .M1X val1$1,val2',prod2' ; |21| <1,6> a[1] * b[1]
|| [ I] B .S1 $C$L2 ; |33| <2,4> if (!0) goto loop
|| [ B1] LDW .D1T1 *a_4++(8),val1 ; |25| <3,2> load a[2-3] bankx+2
|| [ A0] LDW .D2T2 *a_0'++(8),val1$1 ; |18| <4,0> load a[0-1] bankx
[ A0] SUB .L1 A0,1,A0 ; <0,9>
|| ADD .S2 prod1,sum0',sum0 ; |29| <0,9> ^ sum0 += a[2] * b[2]
|| ADD .S1 prod2,sum1',sum1 ; |30| <0,9> ^ sum1 += a[3] * b[3]
|| MPY .M2X val1,val2,prod1 ; |27| <1,7> a[2] * b[2]
|| MPYH .M1X val1,val2,prod2 ; |28| <1,7> a[3] * b[3]
|| [ I] ADD .L2 0xffffffff,I,I ; |32| <3,3> I--
|| [ A0] LDW .D2T2 *b_4++(8),val2 ; |26| <4,1> load b[2-3] banky+2
|| [ A0] LDW .D1T1 *b_0++(8),val2' ; |19| <4,1> load b[0-1] banky
;** --------------------------------------------------------------------------*
While generation code for c66, produce totally different output:
$C$L1: ; PIPED LOOP PROLOG
; EXCLUSIVE CPU CYCLES: 9
SPLOOPD 2 ;10 ; (P)
|| MV .L2X a_0',a_0 ; |1|
|| MVC .S2 B6,ILC
;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
; EXCLUSIVE CPU CYCLES: 2
SPMASK L1,L2
|| MV .L1X b_0',b_0 ; |1|
|| ADD .L2 0x4,b_0',b_4 ; |6|
|| LDW .D2T2 *a_0++(8),val1' ; |18| (P) <0,0> load a[0-1] bankx
SPMASK L1
|| ADD .L1 0x4,a_0',a_4 ; |5|
|| LDW .D2T2 *b_4++(8),val2 ; |26| (P) <0,1> load b[2-3] banky+2
|| LDW .D1T1 *b_0++(8),val2' ; |19| (P) <0,1> load b[0-1] banky
LDW .D1T1 *a_4++(8),val1 ; |25| (P) <0,2> load a[2-3] bankx+2
NOP 3
MPY .M2X val1',val2',prod1' ; |20| (P) <0,6> a[0] * b[0]
|| MPYH .M1X val1',val2',prod2 ; |21| (P) <0,6> a[1] * b[1]
SPMASK L1,L2
|| ZERO .L2 B7 ; |9|
|| ZERO .L1 sum1 ; |9|
|| MPY .M2X val1,val2,prod1 ; |27| (P) <0,7> a[2] * b[2]
|| MPYH .M1X val1,val2,prod2 ; |28| (P) <0,7> a[3] * b[3]
ADD .L2 prod1',sum0,sum0' ; |22| <0,8> ^ sum0 += a[0] * b[0]
|| ADD .L1 prod2,sum1,sum1 ; |23| <0,8> ^ sum1 += a[1] * b[1]
SPKERNEL 2,0
|| ADD .L2 prod1,sum0',sum0 ; |29| <0,9> ^ sum0 += a[2] * b[2]
|| ADD .L1 prod2,sum1,sum1 ; |30| <0,9> ^ sum1 += a[3] * b[3]
;** --------------------------------------------------------------------------*
On the first look PIPED LOOP KERNEL is more longer now. Seems we missed something important, but more reading about SPKERNEL and SPMASK instructions didn't make the picture clear.
Can you please give some hints how to understand this.
Thank you.
Best regards.
Dmitry.