So, I have read the docs on software pipelined loops, and I think I generally understand what is happening and what the compiler comments mean, but there is a fine point that I don't understand.
On my optimized loop, I get the following:
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 80
;* Loop opening brace source line : 81
;* Loop closing brace source line : 108
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 2
;* Known Maximum Trip Count : 2
;* Known Max Trip Count Factor : 2
;* Loop Carried Dependency Bound(^) : 28
;* Unpartitioned Resource Bound : 11
;* Partitioned Resource Bound(*) : 11
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 6 7
;* .D units 2 2
;* .M units 8 8
;* .X cross paths 9 11*
;* .T address paths 2 2
;* Long read paths 1 1
;* Long write paths 0 0
;* Logical ops (.LS) 16 13 (.L or .S unit)
;* Addition ops (.LSD) 7 9 (.L or .S or .D unit)
;* Bound(.L .S .LS) 11* 10
;* Bound(.L .S .D .LS .LSD) 11* 11*
;*
;* Searching for software pipeline schedule at ...
;* ii = 28 Schedule found with 3 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------------------------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: | ******** ***** * | ** ** ******** |
;* 1: | * ** *** ****** * | ** **** ******** |
;* 2: | * ** *** ****** * | ** **** * ****** |
;* 3: | * ** *** ****** * | ** ** * ****** |
;* 4: | * ** *** ******* * | ** ** * * ****** |
;* 5: | ******** ******* * |*** ** * * ******* |
;* 6: | ** * **** ******* * | ** ***** * ******* |
;* 7: | *** **** ******* * | ** *** * ******* |
;* 8: | * ** *** ******* * | ** *** * ******* |
;* 9: | ********* ******* * | ** *** * * ******* |
;* 10: | **** *** ******* * |*** ***** * ******* |
;* 11: | *** *** ******* * | ** ***** * ***** * |
;* 12: | * * *** ***** * * | ** *** * ***** * |
;* 13: | **** *** ***** * * | ** *** * ***** * |
;* 14: | *** *** ***** * * | ** *** ** * ***** * |
;* 15: | * * *** ***** ** | ** ****** * ***** * |
;* 16: | * * *** ***** ** |*** *** ** * ***** |
;* 17: | *** *** ***** ** |*** *** ** * ***** |
;* 18: | **** **** ***** ** | ** ****** * ***** |
;* 19: | ** ***** ***** ** |*** ** *** ******* |
;* 20: | ** ****** ***** * | ** **** * ******* |
;* 21: | ******** ***** * | ** *** * ****** |
;* 22: | * ****** ***** * | ** *** * ****** |
;* 23: | ********* ***** * | ** ** * * ****** |
;* 24: | ******** ***** * |*** ****** ****** |
;* 25: | *** **** ***** * | ** ****** ****** |
;* 26: | * * **** ***** | ** *** * ****** |
;* 27: | *** **** ***** * | ** ** ****** |
;* +-----------------------------------------------------------------+
;*
;* Done
;*
;* Epilog not entirely removed
;* Collapsed epilog stages : 1
;* Collapsed prolog stages : 2
;* Minimum required memory pad : 8 bytes
;* Minimum threshold value : -mh16
;*
;* Minimum safe trip count : 1 (after unrolling)
;* Min. prof. trip count (est.) : 2 (after unrolling)
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;* Mem bank perf. penalty (est.) : 0.0%
;*
;*
;* Total cycles (est.) : 34 + min_trip_cnt * 28 = 90
;*----------------------------------------------------------------------------*
and following the comment block, I see:
;*----------------------------------------------------------------------------* $C$L4: ; PIPED LOOP PROLOG ;** --------------------------------------------------------------------------* $C$L5: ; PIPED LOOP KERNEL ; EXCLUSIVE CPU CYCLES: 28 ... code omitted ... ;** --------------------------------------------------------------------------* $C$L6: ; PIPED LOOP EPILOG ; EXCLUSIVE CPU CYCLES: 2So, in the SPL comment, the line that has the Total Cycles:
;* Total cycles (est.) : 34 + min_trip_cnt * 28 = 90
Says 28 for the kernel code (which matches the actual code), but it has 34 for the overhead of the loop, and I don't see where that is coming from, given that the prolog has been completely collapsed and that the epilog is only 2 cycles. How is that 34 calculated? What am I missing in analyzing the performance of the optimized loop code?
TIA,
B.J.