Version of MCSDK - MCSDK_HPC_03_00_01_08
Processor and board platform (EVM & its revision) - Hawking chip Evaluation Board EVMK2H Rev 3.0.
API - OpenCL 1.1, no CSL
Hello everyone,
Very specific question: the openCL code was compiled with the options -Wall -O3 -k options. Below is the ASM code; ii=19. The question is how to optimize this loop and reduce ii? I think "Split a long life (split-join)" should help at this, but I am not sure what that means. I would appreciate if you point to some reading materials as well: I remember I read a document which describes all possible compiler hints on optimizing the loops, but I can not find it again. I also remember that even when I had that document I did not find any information about the meaning of things like "[B_M66] <0,15> " (in the right of every loop instruction). What do they mean?
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : Unknown
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 19
;* Unpartitioned Resource Bound : 6
;* Partitioned Resource Bound(*) : 6
;* Resource Partition:
;* A-side B-side
;* .L units 0 1
;* .S units 1 0
;* .D units 2 4
;* .M units 2 2
;* .X cross paths 0 3
;* .T address paths 2 4
;* Logical ops (.LS) 0 2 (.L or .S unit)
;* Addition ops (.LSD) 13 8 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1 2
;* Bound(.L .S .D .LS .LSD) 6* 5
;*
;* Searching for software pipeline schedule at ...
;* ii = 19 Schedule found with 2 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------------------------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: |*** ***** |** ** *** *** |
;* 1: |*** ***** |** ** *** *** |
;* 2: |********* |** ** *** *** |
;* 3: |********* |** ** *** *** |
;* 4: |********* |** ** *** *** |
;* 5: |********** |** ** *** *** |
;* 6: |********* |** ** *** *** |
;* 7: |********* |** ****** *** |
;* 8: |********* |** ****** *** |
;* 9: |********** |** ****** **** |
;* 10: |*** ***** |** ****** *** |
;* 11: |*** ***** |** ****** *** |
;* 12: |*** ***** |** ****** *** |
;* 13: |*** ***** |** ****** **** |
;* 14: |*** ***** |** ****** ** |
;* 15: |*** ***** |** ****** *** |
;* 16: |*** ***** |** ****** *** |
;* 17: | ******** |** ****** *** |
;* 18: | ******** |** ****** **** |
;* +-----------------------------------------------------------------+
;*
;* Done
;*
;* Collapsed epilog stages : 1
;* Prolog not removed
;* Collapsed prolog stages : 0
;*
;* Minimum required memory pad : 0 bytes
;*
;* For further improvement on this loop, try option -mh56
;*
;* Minimum safe trip count : 1
;* Min. prof. trip count (est.) : 3
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.125, max 1.000 }
;* Mem bank perf. penalty (est.) : 0.7%
;*
;* Effective ii : { min 19.00, est 19.12, max 20.00 }
;*
;*
;* Total cycles (est.) : 6 + trip_cnt * 19
;*----------------------------------------------------------------------------*
;* SETUP CODE
;*
;* MVK 1,A2 ; []
;* MV A2,A1 ; []
;* MV A1,A0 ; []
;*
;* SINGLE SCHEDULED ITERATION
;*
;* $C$C205:
;* 0 [ A2] LDW .D2T2 *B5++(4),B4 ; [B_D64P]
;* 1 NOP 1 ; [A_L66]
;* 2 [ A2] LDW .D1T1 *A6++(4),A7 ; [A_D64P]
;* 3 NOP 2 ; [A_L66]
;* 5 CMPEQ .L2 B7,B4,B0 ; [B_L66] ^
;* 6 [ A1] LDW .D2T2 *+B8[B4],B17 ; [B_D64P]
;* 7 MV .L1 A7,A3 ; [A_L66] Split a long life (split-join)
;* || [!B0] LDW .D1T1 *A5(0),A3 ; [A_D64P] ^
;* 8 NOP 1 ; [A_L66]
;* 9 SUB .L2 B1,1,B1 ; [B_L66]
;* 10 MV .S1 A3,A9 ; [A_S66] Split a long life (split-join)
;* || [!B1] ZERO .L1 A2 ; [A_L66]
;* 11 MPYSP .M2X A7,B17,B6 ; [B_M66]
;* || [!B0] MPYSP .M1 A9,A3,A9 ; [A_M66] ^
;* || MV .L1 A2,A3 ; [A_L66] Split a long life (split-join)
;* 12 MV .L2 B4,B6 ; [B_L66] Split a long life (split-join)
;* 13 NOP 1 ; [A_L66]
;* 14 MV .L2 B6,B19 ; [B_L66] Split a long life (split-join)
;* 15 MV .S2 B19,B6 ; [B_Sb66] Split a long life (split-join)
;* || MPYSP .M2X A4,B6,B19 ; [B_M66]
;* || [!B0] MPYSP .M1 A4,A9,A8 ; [A_M66] ^
;* || [!A1] MVK .L2 1,B0 ; [B_L66] ^
;* || MV .L1 A3,A1 ; [A_L66] Split a long life (split-join)
;* 16 [!B0] LDW .D2T2 *+B9[B6],B16 ; [B_D64P] ^
;* 17 NOP 2 ; [A_L66]
;* 19 FADDSP .L2 B19,B16,B16 ; [B_L66]
;* || [ A2] B .S1 $C$C205 ; [A_S66]
;* 20 NOP 1 ; [A_L66]
;* 21 [!B0] FADDSP .L2X B16,A8,B19 ; [B_L66] ^
;* 22 [ A0] MV .L2 B16,B18 ; [B_L66]
;* || MV .L1 A1,A3 ; [A_L66] Split a long life (split-join)
;* 23 NOP 1 ; [A_L66]
;* 24 [!B0] STW .D2T2 B19,*+B9[B6] ; [B_D64P] ^
;* || MV .L1 A3,A0 ; [A_L66] Split a long life (split-join)
;* 25 ; BRANCHCC OCCURS {$C$C205} ; []
;*
;* RESTORE CODE
;*
;* MV B18,B16 ; []
;*----------------------------------------------------------------------------*
$C$L150: ; PIPED LOOP PROLOG
;** --------------------------------------------------------------------------*
$C$L151: ; PIPED LOOP KERNEL
; EXCLUSIVE CPU CYCLES: 19
[ A1] LDW .D2T2 *+B8[B4],B17 ; [B_D64P] <0,6>
|| [!B0] LDW .D1T1 *A5(0),A3 ; [A_D64P] <0,6> ^
MV .L1 A7,A3 ; [A_L66] <0,7> Split a long life (split-join)
NOP 1 ; [A_L66]
SUB .L2 B1,1,B1 ; [B_L66] <0,9>
MV .S1 A3,A9 ; [A_S66] <0,10> Split a long life (split-join)
|| [!B1] ZERO .L1 A2 ; [A_L66] <0,10>
MV .L1 A2,A3 ; [A_L66] <0,11> Split a long life (split-join)
|| MPYSP .M2X A7,B17,B6 ; [B_M66] <0,11>
|| [!B0] MPYSP .M1 A9,A3,A9 ; [A_M66] <0,11> ^
MV .L2 B4,B6 ; [B_L66] <0,12> Split a long life (split-join)
NOP 1 ; [A_L66]
MV .L2 B6,B19 ; [B_L66] <0,14> Split a long life (split-join)
MV .S2 B19,B6 ; [B_Sb66] <0,15> Split a long life (split-join)
|| MV .L1 A3,A1 ; [A_L66] <0,15> Split a long life (split-join)
|| MPYSP .M2X A4,B6,B19 ; [B_M66] <0,15>
|| [!A1] MVK .L2 1,B0 ; [B_L66] <0,15> ^
|| [!B0] MPYSP .M1 A4,A9,A8 ; [A_M66] <0,15> ^
[!B0] LDW .D2T2 *+B9[B6],B16 ; [B_D64P] <0,16> ^
NOP 2 ; [A_L66]
[ A2] BNOP $C$L151,1 ; [] <0,19>
|| FADDSP .L2 B19,B16,B16 ; [B_L66] <0,19>
|| [ A2] LDW .D2T2 *B5++(4),B4 ; [B_D64P] <1,0>
[!B0] FADDSP .L2X B16,A8,B19 ; [B_L66] <0,21> ^
|| [ A2] LDW .D1T1 *A6++(4),A7 ; [A_D64P] <1,2>
MV .L1 A1,A3 ; [A_L66] <0,22> Split a long life (split-join)
|| [ A0] MV .L2 B16,B18 ; [B_L66] <0,22>
NOP 1 ; [A_L66]
MV .L1 A3,A0 ; [A_L66] <0,24> Split a long life (split-join)
|| [!B0] STW .D2T2 B19,*+B9[B6] ; [B_D64P] <0,24> ^
|| CMPEQ .L2 B7,B4,B0 ; [B_L66] <1,5> ^
;** --------------------------------------------------------------------------*
$C$L152: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*