Hi,
I'm trying to optimize a loop on C6657 to calculate the absolute difference from two input images. what I got so far is the following:
looking to the asm I see the following:
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : ../source/main.cpp
;* Loop source line : 115
;* Loop opening brace source line : 116
;* Loop closing brace source line : 137
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 512
;* Known Max Trip Count Factor : 16
;* Loop Carried Dependency Bound(^) : 18
;* Unpartitioned Resource Bound : 19
;* Partitioned Resource Bound(*) : 19
;* Resource Partition:
;* A-side B-side
;* .L units 5 3
;* .S units 19* 18
;* .D units 1 3
;* .M units 0 0
;* .X cross paths 8 7
;* .T address paths 2 2
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 5 3 (.L or .S unit)
;* Addition ops (.LSD) 4 4 (.L or .S or .D unit)
;* Bound(.L .S .LS) 15 12
;* Bound(.L .S .D .LS .LSD) 12 11
;*
;* Searching for software pipeline schedule at ...
;* ii = 19 Schedule found with 3 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------------------------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: |* ****** * ** |* ** * * *** |
;* 1: |* *** *** * ** |* * ** * * *** |
;* 2: |* ****** * ** |* * **** * *** |
;* 3: |* **** ** * ** |* ****** * *** |
;* 4: |* ******* * ** |* ***** ***** |
;* 5: |* **** ** * ** |* ****** ***** |
;* 6: |* **** * * ** |* ****** * *** |
;* 7: |* ****** * * |* ****** * *** |
;* 8: |* ****** * * |* ****** * *** |
;* 9: |* * ** * * ** |* ***** * *** |
;* 10: |* **** ** |* ***** * *** |
;* 11: |* * **** ** |* ***** *** |
;* 12: |* ***** ** |* ***** *** |
;* 13: |* ***** ** |* ****** *** |
;* 14: |* ****** ** |* ***** *** |
;* 15: |* ****** ** |* ***** *** |
;* 16: |* ***** ** |* *** ** *** |
;* 17: |* ***** ** |* *** * *** |
;* 18: |* * * * **** |* **** * *** |
;* +-----------------------------------------------------------------+
;*
;* Done
;*
;* Epilog not removed
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 2
;* Minimum required memory pad : 0 bytes
;*
;* For further improvement on this loop, try option -mh16
;*
;* Minimum safe trip count : 2 (after unrolling)
;* Min. prof. trip count (est.) : 4 (after unrolling)
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;* Mem bank perf. penalty (est.) : 0.0%
;*
;*
;* Total cycles (est.) : 25 + trip_cnt * 19
;*----------------------------------------------------------------------------*
;* SETUP CODE
;*
;* SUB B0,1,B0
;*
;* SINGLE SCHEDULED ITERATION
;*
;* $C$C57:
;* 0 LDDW .D2T2 *B19++,B7:B6 ; |118|
;* 1 LDDW .D1T1 *A18++,A5:A4 ; |119|
;* 2 NOP 4
;* 6 EXTU .S2 B6,24,24,B4 ; |122|
;* || SHRU .S1X B6,24,A9 ; |125|
;* 7 EXTU .S2 B6,8,24,B8 ; |124|
;* 8 EXTU .S1 A4,24,24,A3 ; |122|
;* 9 SUB .L1X B4,A3,A7 ; |122|
;* || EXTU .S1 A4,16,24,A3 ; |123|
;* || SHRU .S2X A4,24,B17 ; |125|
;* 10 ABS .L1 A7,A6 ; |122|
;* || EXTU .S1 A4,8,24,A4 ; |124|
;* || EXTU .S2 B7,24,24,B9 ; |122|
;* 11 EXTU .S1 A6,24,24,A6 ; |122|
;* || SUB .L1X B8,A4,A4 ; |124|
;* || SUB .L2X A9,B17,B8 ; |125|
;* 12 EXTU .S2 B6,16,24,B6 ; |123|
;* || ABS .L1 A4,A7 ; |124|
;* || EXTU .S1 A5,24,24,A4 ; |122|
;* 13 EXTU .S1 A7,24,24,A7 ; |124|
;* || ABS .L2 B8,B8 ; |125|
;* 14 EXTU .S1 A7,24,8,A19 ; |124|
;* || SHL .S2 B8,24,B18 ; |125|
;* || SUB .L2X B9,A4,B8 ; |122|
;* 15 SUB .L2X B6,A3,B4 ; |123|
;* 16 ABS .L2 B4,B6 ; |123|
;* || EXTU .S1 A5,16,24,A3 ; |123|
;* 17 NOP 1
;* 18 EXTU .S2 B7,16,24,B4 ; |123|
;* 19 EXTU .S1 A5,8,24,A7 ; |124|
;* 20 ABS .L2 B8,B8 ; |122|
;* || SUB .L1X B4,A3,A8 ; |123|
;* || EXTU .S2 B7,8,24,B9 ; |124|
;* 21 ABS .L1 A8,A5 ; |123|
;* || SHRU .S1 A5,24,A4 ; |125|
;* || SHRU .S2 B7,24,B4 ; |125|
;* 22 EXTU .S2 B8,24,24,B16 ; |122|
;* || SUB .L1X B9,A7,A7 ; |124|
;* 23 EXTU .S2 B6,24,24,B5 ; |123|
;* || EXTU .S1 A5,24,24,A16 ; |123|
;* || ABS .L1 A7,A8 ; |124|
;* 24 CLR .S1 A17,0,7,A7 ; |122| ^
;* || EXTU .S2 B5,24,16,B9 ; |123|
;* || SUB .L1X B4,A4,A3 ; |125|
;* 25 OR .D1 A6,A7,A3 ; |122| ^
;* || ABS .L1 A3,A7 ; |125|
;* 26 CLR .S1 A3,8,15,A6 ; |123| ^
;* 27 SHL .S2X A7,24,B5 ; |125|
;* 28 NOP 1
;* 29 OR .L2X B9,A6,B4 ; |123| ^
;* 30 CLR .S2 B4,16,23,B4 ; |124| ^
;* 31 OR .L2X A19,B4,B4 ; |124| ^
;* 32 EXTU .S2 B4,8,8,B4 ; |125| ^
;* 33 OR .D2 B18,B4,B4 ; |125| ^
;* 34 STW .D2T2 B4,*++B20(8) ; |136|
;* || CLR .S2 B4,0,7,B6 ; |122| ^
;* || EXTU .S1 A16,24,16,A7 ; |123|
;* 35 OR .S2 B16,B6,B4 ; |122| ^
;* 36 CLR .S2 B4,8,15,B9 ; |123| ^
;* || EXTU .S1 A8,24,24,A4 ; |124|
;* 37 EXTU .S1 A4,24,8,A4 ; |124|
;* 38 OR .L1X A7,B9,A8 ; |123| ^
;* || [ B0] BDEC .S2 $C$C57,B0 ; |115|
;* 39 CLR .S1 A8,16,23,A3 ; |124| ^
;* 40 OR .D1 A4,A3,A3 ; |124| ^
;* 41 EXTU .S1 A3,8,8,A3 ; |125| ^
;* 42 OR .D1X B5,A3,A17 ; |125| ^
;* 43 STW .D2T1 A17,*+B20(4) ; |136|
;* 44 ; BRANCHCC OCCURS {$C$C57} ; |115|
;*----------------------------------------------------------------------------*
I don't get the Loop Carried Dependency Bound away because I don't understand its source. Can anybody help me improving this loop? Maybe with a complete new concept or with improvements on the existing.
Thanks for your help,
best regards
Pay Gießelmann