Version of MCSDK - MCSDK_HPC_03_00_01_08
Processor and board platform (EVM & its revision) - Hawking chip Evaluation Board EVMK2H Rev 3.0.
API - OpenCL 1.1, no CSL
Hello everyone,
Very specific question: the openCL code was compiled with the options -Wall -O3 -k options. Below is the ASM code; ii=19. The question is how to optimize this loop and reduce ii? I think "Split a long life (split-join)" should help at this, but I am not sure what that means. I would appreciate if you point to some reading materials as well: I remember I read a document which describes all possible compiler hints on optimizing the loops, but I can not find it again. I also remember that even when I had that document I did not find any information about the meaning of things like "[B_M66] <0,15> " (in the right of every loop instruction). What do they mean?
;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : Unknown ;* Known Minimum Trip Count : 1 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 19 ;* Unpartitioned Resource Bound : 6 ;* Partitioned Resource Bound(*) : 6 ;* Resource Partition: ;* A-side B-side ;* .L units 0 1 ;* .S units 1 0 ;* .D units 2 4 ;* .M units 2 2 ;* .X cross paths 0 3 ;* .T address paths 2 4 ;* Logical ops (.LS) 0 2 (.L or .S unit) ;* Addition ops (.LSD) 13 8 (.L or .S or .D unit) ;* Bound(.L .S .LS) 1 2 ;* Bound(.L .S .D .LS .LSD) 6* 5 ;* ;* Searching for software pipeline schedule at ... ;* ii = 19 Schedule found with 2 iterations in parallel ;* ;* Register Usage Table: ;* +-----------------------------------------------------------------+ ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB| ;* |00000000001111111111222222222233|00000000001111111111222222222233| ;* |01234567890123456789012345678901|01234567890123456789012345678901| ;* |--------------------------------+--------------------------------| ;* 0: |*** ***** |** ** *** *** | ;* 1: |*** ***** |** ** *** *** | ;* 2: |********* |** ** *** *** | ;* 3: |********* |** ** *** *** | ;* 4: |********* |** ** *** *** | ;* 5: |********** |** ** *** *** | ;* 6: |********* |** ** *** *** | ;* 7: |********* |** ****** *** | ;* 8: |********* |** ****** *** | ;* 9: |********** |** ****** **** | ;* 10: |*** ***** |** ****** *** | ;* 11: |*** ***** |** ****** *** | ;* 12: |*** ***** |** ****** *** | ;* 13: |*** ***** |** ****** **** | ;* 14: |*** ***** |** ****** ** | ;* 15: |*** ***** |** ****** *** | ;* 16: |*** ***** |** ****** *** | ;* 17: | ******** |** ****** *** | ;* 18: | ******** |** ****** **** | ;* +-----------------------------------------------------------------+ ;* ;* Done ;* ;* Collapsed epilog stages : 1 ;* Prolog not removed ;* Collapsed prolog stages : 0 ;* ;* Minimum required memory pad : 0 bytes ;* ;* For further improvement on this loop, try option -mh56 ;* ;* Minimum safe trip count : 1 ;* Min. prof. trip count (est.) : 3 ;* ;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.125, max 1.000 } ;* Mem bank perf. penalty (est.) : 0.7% ;* ;* Effective ii : { min 19.00, est 19.12, max 20.00 } ;* ;* ;* Total cycles (est.) : 6 + trip_cnt * 19 ;*----------------------------------------------------------------------------* ;* SETUP CODE ;* ;* MVK 1,A2 ; [] ;* MV A2,A1 ; [] ;* MV A1,A0 ; [] ;* ;* SINGLE SCHEDULED ITERATION ;* ;* $C$C205: ;* 0 [ A2] LDW .D2T2 *B5++(4),B4 ; [B_D64P] ;* 1 NOP 1 ; [A_L66] ;* 2 [ A2] LDW .D1T1 *A6++(4),A7 ; [A_D64P] ;* 3 NOP 2 ; [A_L66] ;* 5 CMPEQ .L2 B7,B4,B0 ; [B_L66] ^ ;* 6 [ A1] LDW .D2T2 *+B8[B4],B17 ; [B_D64P] ;* 7 MV .L1 A7,A3 ; [A_L66] Split a long life (split-join) ;* || [!B0] LDW .D1T1 *A5(0),A3 ; [A_D64P] ^ ;* 8 NOP 1 ; [A_L66] ;* 9 SUB .L2 B1,1,B1 ; [B_L66] ;* 10 MV .S1 A3,A9 ; [A_S66] Split a long life (split-join) ;* || [!B1] ZERO .L1 A2 ; [A_L66] ;* 11 MPYSP .M2X A7,B17,B6 ; [B_M66] ;* || [!B0] MPYSP .M1 A9,A3,A9 ; [A_M66] ^ ;* || MV .L1 A2,A3 ; [A_L66] Split a long life (split-join) ;* 12 MV .L2 B4,B6 ; [B_L66] Split a long life (split-join) ;* 13 NOP 1 ; [A_L66] ;* 14 MV .L2 B6,B19 ; [B_L66] Split a long life (split-join) ;* 15 MV .S2 B19,B6 ; [B_Sb66] Split a long life (split-join) ;* || MPYSP .M2X A4,B6,B19 ; [B_M66] ;* || [!B0] MPYSP .M1 A4,A9,A8 ; [A_M66] ^ ;* || [!A1] MVK .L2 1,B0 ; [B_L66] ^ ;* || MV .L1 A3,A1 ; [A_L66] Split a long life (split-join) ;* 16 [!B0] LDW .D2T2 *+B9[B6],B16 ; [B_D64P] ^ ;* 17 NOP 2 ; [A_L66] ;* 19 FADDSP .L2 B19,B16,B16 ; [B_L66] ;* || [ A2] B .S1 $C$C205 ; [A_S66] ;* 20 NOP 1 ; [A_L66] ;* 21 [!B0] FADDSP .L2X B16,A8,B19 ; [B_L66] ^ ;* 22 [ A0] MV .L2 B16,B18 ; [B_L66] ;* || MV .L1 A1,A3 ; [A_L66] Split a long life (split-join) ;* 23 NOP 1 ; [A_L66] ;* 24 [!B0] STW .D2T2 B19,*+B9[B6] ; [B_D64P] ^ ;* || MV .L1 A3,A0 ; [A_L66] Split a long life (split-join) ;* 25 ; BRANCHCC OCCURS {$C$C205} ; [] ;* ;* RESTORE CODE ;* ;* MV B18,B16 ; [] ;*----------------------------------------------------------------------------* $C$L150: ; PIPED LOOP PROLOG ;** --------------------------------------------------------------------------* $C$L151: ; PIPED LOOP KERNEL ; EXCLUSIVE CPU CYCLES: 19 [ A1] LDW .D2T2 *+B8[B4],B17 ; [B_D64P] <0,6> || [!B0] LDW .D1T1 *A5(0),A3 ; [A_D64P] <0,6> ^ MV .L1 A7,A3 ; [A_L66] <0,7> Split a long life (split-join) NOP 1 ; [A_L66] SUB .L2 B1,1,B1 ; [B_L66] <0,9> MV .S1 A3,A9 ; [A_S66] <0,10> Split a long life (split-join) || [!B1] ZERO .L1 A2 ; [A_L66] <0,10> MV .L1 A2,A3 ; [A_L66] <0,11> Split a long life (split-join) || MPYSP .M2X A7,B17,B6 ; [B_M66] <0,11> || [!B0] MPYSP .M1 A9,A3,A9 ; [A_M66] <0,11> ^ MV .L2 B4,B6 ; [B_L66] <0,12> Split a long life (split-join) NOP 1 ; [A_L66] MV .L2 B6,B19 ; [B_L66] <0,14> Split a long life (split-join) MV .S2 B19,B6 ; [B_Sb66] <0,15> Split a long life (split-join) || MV .L1 A3,A1 ; [A_L66] <0,15> Split a long life (split-join) || MPYSP .M2X A4,B6,B19 ; [B_M66] <0,15> || [!A1] MVK .L2 1,B0 ; [B_L66] <0,15> ^ || [!B0] MPYSP .M1 A4,A9,A8 ; [A_M66] <0,15> ^ [!B0] LDW .D2T2 *+B9[B6],B16 ; [B_D64P] <0,16> ^ NOP 2 ; [A_L66] [ A2] BNOP $C$L151,1 ; [] <0,19> || FADDSP .L2 B19,B16,B16 ; [B_L66] <0,19> || [ A2] LDW .D2T2 *B5++(4),B4 ; [B_D64P] <1,0> [!B0] FADDSP .L2X B16,A8,B19 ; [B_L66] <0,21> ^ || [ A2] LDW .D1T1 *A6++(4),A7 ; [A_D64P] <1,2> MV .L1 A1,A3 ; [A_L66] <0,22> Split a long life (split-join) || [ A0] MV .L2 B16,B18 ; [B_L66] <0,22> NOP 1 ; [A_L66] MV .L1 A3,A0 ; [A_L66] <0,24> Split a long life (split-join) || [!B0] STW .D2T2 B19,*+B9[B6] ; [B_D64P] <0,24> ^ || CMPEQ .L2 B7,B4,B0 ; [B_L66] <1,5> ^ ;** --------------------------------------------------------------------------* $C$L152: ; PIPED LOOP EPILOG ;** --------------------------------------------------------------------------*