This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Understanding Software Pipelined loop cycle counts

Hi,

So, I have read the docs on software pipelined loops, and I think I generally understand what is happening and what the compiler comments mean, but there is a fine point that I don't understand.

On my optimized loop, I get the following:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 80
;*      Loop opening brace source line   : 81
;*      Loop closing brace source line   : 108
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 2                    
;*      Known Maximum Trip Count         : 2                    
;*      Known Max Trip Count Factor      : 2
;*      Loop Carried Dependency Bound(^) : 28
;*      Unpartitioned Resource Bound     : 11
;*      Partitioned Resource Bound(*)    : 11
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     6        7     
;*      .D units                     2        2     
;*      .M units                     8        8     
;*      .X cross paths               9       11*    
;*      .T address paths             2        2     
;*      Long read paths              1        1     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)          16       13     (.L or .S unit)
;*      Addition ops (.LSD)          7        9     (.L or .S or .D unit)
;*      Bound(.L .S .LS)            11*      10     
;*      Bound(.L .S .D .LS .LSD)    11*      11*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 28 Schedule found with 3 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |  ********      *****   *       | ** **          ********        |
;*       1: |  * ** ***      ******  *       | ** ****        ********        |
;*       2: |  * ** ***      ******  *       | ** ****        * ******        |
;*       3: |  * ** ***      ******  *       | ** **          * ******        |
;*       4: |  * ** ***      ******* *       | ** ** *        * ******        |
;*       5: |  ********      ******* *       |*** ** *        * *******       |
;*       6: | ** * ****      ******* *       | ** *****       * *******       |
;*       7: |  *** ****      ******* *       | ** ***         * *******       |
;*       8: |  * ** ***      ******* *       | ** ***         * *******       |
;*       9: | *********      ******* *       | ** *** *       * *******       |
;*      10: |  **** ***      ******* *       |*** *****       * *******       |
;*      11: |  ***  ***      ******* *       | ** *****       * ***** *       |
;*      12: |  * *  ***      ***** * *       | ** ***         * ***** *       |
;*      13: |  **** ***      ***** * *       | ** ***         * ***** *       |
;*      14: |  ***  ***      ***** * *       | ** *** **      * ***** *       |
;*      15: |  * *  ***      *****  **       | ** ******      * ***** *       |
;*      16: |  * *  ***      *****  **       |*** *** **      * *****         |
;*      17: |  ***  ***      *****  **       |*** *** **      * *****         |
;*      18: | **** ****      *****  **       | ** ******      * *****         |
;*      19: |  ** *****      *****  **       |*** ** ***      *******         |
;*      20: | ** ******      *****  *        | ** **** *      *******         |
;*      21: |  ********      *****  *        | ** ***  *       ******         |
;*      22: |  * ******      *****  *        | ** ***  *       ******         |
;*      23: | *********      *****  *        | ** ** * *       ******         |
;*      24: |  ********      *****  *        |*** ******       ******         |
;*      25: |  *** ****      *****  *        | ** ******       ******         |
;*      26: |  * * ****      *****           | ** *** *        ******         |
;*      27: |  *** ****      *****   *       | ** **           ******         |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Epilog not entirely removed
;*      Collapsed epilog stages       : 1
;*      Collapsed prolog stages       : 2
;*      Minimum required memory pad   : 8 bytes
;*      Minimum threshold value       : -mh16
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*      Min. prof. trip count  (est.) : 2 (after unrolling)
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 34 + min_trip_cnt * 28 = 90        
;*----------------------------------------------------------------------------*
and following the comment block, I see:

;*----------------------------------------------------------------------------*
$C$L4:    ; PIPED LOOP PROLOG
;** --------------------------------------------------------------------------*
$C$L5:    ; PIPED LOOP KERNEL
;          EXCLUSIVE CPU CYCLES: 28

... code omitted ...


;** --------------------------------------------------------------------------*
$C$L6:    ; PIPED LOOP EPILOG
;          EXCLUSIVE CPU CYCLES: 2
So, in the SPL comment, the line that has the Total Cycles:
;*      Total cycles (est.)         : 34 + min_trip_cnt * 28 = 90        

Says 28 for the kernel code (which matches the actual code), but it has 34 for the overhead of the loop, and I don't see where that is coming from, given that the prolog has been completely collapsed and that the epilog is only 2 cycles. How is that 34 calculated? What am I missing in analyzing the performance of the optimized loop code?

TIA,

B.J.

  • The calculation in the compiler is very complicated.   I don't think I understand it well enough to explain it.  You already know the basic explanation, that it is the cycles that will be spend during pipe-up (including prolog) and pipe-down (including epilog).  Don't forget that because of collapsing, the pipe-up/pipe-down cycles will include instructions from what appears to be the kernel; 28 cycles accounts for the bulk of the value 34.

    Without seeing the code in the prolog, epilog, pre-prolog, and post-epilog blocks, it will be very difficult to check that 34 is the correct number.

  • In general, when one or more prolog or epilog stages are collapsed, the total time spent in the software pipelined loop code doesn't change.  The first term ("34") in your example is the length of a single iteration of the pipelined loop minus the ii.

    The number of cycles in a software pipelined loop is generally:
      single_iteration_length - ii + trip_count * ii

    This doesn't change when collapsing is performed.

    Drawing the parallel iterations of a software pipelined loop helps visualize what's going on in collapsing.  Say we can collapse the first prolog stage in a loop with an ii of 1 cycle and a single iteration length of 4 cycles:

    loop                 loop
    version 1            version 2
    no collapsing        collapse one prolog stage
    *                            <- collapsed prolog stage
    **                   **
    ***                  ***
    ****  <- kernel ->   ****    <- kernel must execute one more time in ver 2
     ***                  ***
      **                   **
       *                    *


    Note that in the collapsed prolog case, the first time through the kernel is really the last stage of the prolog.

    A couple of other notes:

    Collapsing reduces code size.  It does not increase performance.  When collapsing is performed, we have to spend more time in the kernel.

    Another tricky thing is that the instructions in the latter (earlier) part of the epilog (prolog) may also have been scheduled with the code following (preceding) the loop, so in the .asm file, the epilog (prolog) may appear smaller that it actually is.

  • Hi Todd,

    Thanks for the info. I think that gets me closer to understanding what is going on, but not fully.

    I am going to post the entire loop without the prolog and epilog collapse, and maybe you can help me understand how the numbers are computed:

               MVC     .S2     CSR,B25
               LDW     .D1T2   *+A26(4),B9       ; |75| 
               LDW     .D1T1   *A26,A5           ; |74| 
               SUB     .L1     A3,8,A7
               LDW     .D1T1   *++A7(8),A4       ; |87| (P)  
               LDW     .D1T1   *+A26(12),A22     ; |77| 
               LDW     .D1T2   *+A26(8),B24      ; |76| 
               AND     .L2     -2,B25,B5
               MVC     .S2     B5,CSR            ; interrupts off
               ABSSP   .S1     A4,A3             ; |87| (P)  
               SUBSP   .L2X    A3,B9,B6          ; |91| (P)   ^ 
               SUBSP   .L1     A3,A5,A6          ; |87| (P)   ^ 
               NOP             2
    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop source line                 : 80
    ;*      Loop opening brace source line   : 81
    ;*      Loop closing brace source line   : 108
    ;*      Loop Unroll Multiple             : 2x
    ;*      Known Minimum Trip Count         : 2                    
    ;*      Known Maximum Trip Count         : 2                    
    ;*      Known Max Trip Count Factor      : 2
    ;*      Loop Carried Dependency Bound(^) : 28
    ;*      Unpartitioned Resource Bound     : 11
    ;*      Partitioned Resource Bound(*)    : 11
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     0        0     
    ;*      .S units                     6        7     
    ;*      .D units                     2        2     
    ;*      .M units                     8        8     
    ;*      .X cross paths               9       11*    
    ;*      .T address paths             2        2     
    ;*      Long read paths              1        1     
    ;*      Long write paths             0        0     
    ;*      Logical  ops (.LS)          16       13     (.L or .S unit)
    ;*      Addition ops (.LSD)          7        9     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)            11*      10     
    ;*      Bound(.L .S .D .LS .LSD)    11*      11*    
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 28 Schedule found with 3 iterations in parallel
    ;*
    ;*      Register Usage Table:
    ;*          +-----------------------------------------------------------------+
    ;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
    ;*          |00000000001111111111222222222233|00000000001111111111222222222233|
    ;*          |01234567890123456789012345678901|01234567890123456789012345678901|
    ;*          |--------------------------------+--------------------------------|
    ;*       0: |  ********      *****   *       | *  **          ********        |
    ;*       1: |  * ** ***      ******  *       | *  ****        ********        |
    ;*       2: |  * ** ***      ******  *       | *  ****        * ******        |
    ;*       3: |  * ** ***      ******  *       | *  **          * ******        |
    ;*       4: |  * ** ***      ******* *       | *  ** *        * ******        |
    ;*       5: |  ********      ******* *       |**  ** *        * *******       |
    ;*       6: | ** * ****      ******* *       | *  *****       * *******       |
    ;*       7: |  *** ****      ******* *       | *  ***         * *******       |
    ;*       8: |  * ** ***      ******* *       | *  ***         * *******       |
    ;*       9: | *********      ******* *       | *  *** *       * *******       |
    ;*      10: |  **** ***      ******* *       |**  *****       * *******       |
    ;*      11: |  ***  ***      ******* *       | *  *****       * ***** *       |
    ;*      12: |  * *  ***      ***** * *       | *  ***         * ***** *       |
    ;*      13: |  **** ***      ***** * *       | *  ***         * ***** *       |
    ;*      14: |  ***  ***      ***** * *       | *  *** **      * ***** *       |
    ;*      15: |  * *  ***      *****  **       | *  ******      * ***** *       |
    ;*      16: |  * *  ***      *****  **       |**  *** **      * *****         |
    ;*      17: |  ***  ***      *****  **       |**  *** **      * *****         |
    ;*      18: | **** ****      *****  **       | *  ******      * *****         |
    ;*      19: |  ** *****      *****  **       |**  ** ***      *******         |
    ;*      20: | ** ******      *****  *        | *  **** *      *******         |
    ;*      21: |  ********      *****  *        | *  ***  *       ******         |
    ;*      22: |  * ******      *****  *        | *  ***  *       ******         |
    ;*      23: | *********      *****  *        | *  ** * *       ******         |
    ;*      24: |  ********      *****  *        |**  ******       ******         |
    ;*      25: |  *** ****      *****  *        | *  ******       ******         |
    ;*      26: |  * * ****      *****           | *  *** *        ******         |
    ;*      27: |  *** ****      *****   *       | *  **           ******         |
    ;*          +-----------------------------------------------------------------+
    ;*
    ;*      Done
    ;*
    ;*      Epilog not entirely removed
    ;*      Collapsed epilog stages       : 1
    ;*
    ;*      Prolog not removed
    ;*      Collapsed prolog stages       : 0
    ;*
    ;*      Minimum required memory pad   : 0 bytes
    ;*
    ;*      For further improvement on this loop, try option -mh16
    ;*
    ;*      Minimum safe trip count       : 2 (after unrolling)
    ;*
    ;*
    ;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
    ;*      Mem bank perf. penalty (est.) : 0.0%
    ;*
    ;*
    ;*      Total cycles (est.)         : 34 + min_trip_cnt * 28 = 90        
    ;*----------------------------------------------------------------------------*
    ;*       SETUP CODE
    ;*
    ;*                  MV              A8,B4
    ;*
    ;*        SINGLE SCHEDULED ITERATION
    ;*
    ;*        $C$C91:
    ;*   0              LDW     .D1T1   *++A7(8),A24      ; |87| 
    ;*   1              LDW     .D1T2   *+A7(4),B16       ; |87| 
    ;*   2              NOP             3
    ;*   5              ABSSP   .S1     A24,A5            ; |87| 
    ;*   6              SUBSP   .L2X    A5,B23,B7         ; |91|  ^ 
    ;*   7              SUBSP   .L1     A5,A21,A6         ; |87|  ^ 
    ;*   8              NOP             2
    ;*  10              CMPGTSP .S2     B7,B4,B0          ; |91|  ^ 
    ;*     ||           SUBSP   .L1     A5,A22,A5         ; |95|  ^ 
    ;*  11              CMPGTSP .S1     A6,A8,A1          ; |87|  ^ 
    ;*     ||   [ B0]   MV      .S2     B21,B8            ; |91|  ^ 
    ;*     ||   [!B0]   MV      .D2     B19,B8            ; |91|  ^ 
    ;*     ||           SUBSP   .L2X    A5,B24,B8         ; |99|  ^ 
    ;*  12      [ A1]   MV      .S1     A16,A3            ; |87|  ^ 
    ;*     ||   [!A1]   MV      .L1X    B19,A3            ; |87|  ^ 
    ;*     ||           MPYSP   .M2     B7,B8,B7          ; |91|  ^ 
    ;*  13              MPYSP   .M1     A6,A3,A3          ; |87|  ^ 
    ;*  14              CMPGTSP .S1     A5,A8,A1          ; |95|  ^ 
    ;*  15      [ A1]   MV      .S1X    B20,A3            ; |95|  ^ 
    ;*     ||   [!A1]   MV      .D1     A17,A3            ; |95|  ^ 
    ;*     ||           CMPGTSP .S2     B8,B4,B0          ; |99|  ^ 
    ;*  16              ADDSP   .L2     B7,B23,B9         ; |91|  ^ 
    ;*     ||           MPYSP   .M1     A5,A3,A3          ; |95|  ^ 
    ;*     ||   [ B0]   MV      .D2     B20,B7            ; |99|  ^ 
    ;*     ||   [!B0]   MV      .S2     B22,B7            ; |99|  ^ 
    ;*  17              ADDSP   .L1     A3,A21,A23        ; |87|  ^ 
    ;*     ||           MPYSP   .M2     B8,B7,B7          ; |99|  ^ 
    ;*  18              NOP             1
    ;*  19              ABSSP   .S2     B16,B8            ; |87| 
    ;*  20              ADDSP   .L1     A3,A22,A6         ; |95|  ^ 
    ;*     ||           SUBSP   .L2     B8,B9,B7          ; |91|  ^ 
    ;*  21              ADDSP   .L2     B7,B24,B17        ; |99|  ^ 
    ;*     ||           SUBSP   .L1X    B8,A23,A5         ; |87|  ^ 
    ;*  22              SUBSP   .L2X    A23,B9,B6         ; |105| 
    ;*  23              NOP             1
    ;*  24              CMPGTSP .S2     B7,B4,B0          ; |91|  ^ 
    ;*     ||           SUBSP   .L1X    B8,A6,A5          ; |95|  ^ 
    ;*  25              MV      .L1     A24,A4            ; |87| Split a long life
    ;*     ||           CMPGTSP .S1     A5,A8,A1          ; |87|  ^ 
    ;*     ||   [ B0]   MV      .S2     B21,B5            ; |91|  ^ 
    ;*     ||   [!B0]   MV      .D2     B19,B5            ; |91|  ^ 
    ;*     ||           SUBSP   .L2     B8,B17,B7         ; |99|  ^ 
    ;*  26              SUBSP   .L2X    A6,B17,B8         ; |105| 
    ;*     ||           MV      .S2     B16,B5            ; |87| Split a long life
    ;*     ||   [ A1]   MV      .S1     A16,A3            ; |87|  ^ 
    ;*     ||   [!A1]   MV      .L1X    B19,A3            ; |87|  ^ 
    ;*     ||           MPYSP   .M2     B7,B5,B6          ; |91|  ^ 
    ;*  27              MPYSP   .M2X    A18,B6,B6         ; |105| 
    ;*     ||           MPYSP   .M1     A5,A3,A3          ; |87|  ^ 
    ;*  28              CMPGTSP .S1     A5,A8,A1          ; |95|  ^ 
    ;*  29      [ A1]   MV      .L1X    B20,A3            ; |95|  ^ 
    ;*     ||   [!A1]   MV      .S1     A17,A3            ; |95|  ^ 
    ;*     ||           CMPGTSP .S2     B7,B4,B0          ; |99|  ^ 
    ;*  30              ADDSP   .L2     B6,B9,B23         ; |91|  ^ 
    ;*     ||           MPYSP   .M1     A5,A3,A3          ; |95|  ^ 
    ;*     ||   [ B0]   MV      .D2     B20,B9            ; |99|  ^ 
    ;*     ||   [!B0]   MV      .S2     B22,B9            ; |99|  ^ 
    ;*  31              ADDSP   .L2X    A9,B6,B7          ; |105| 
    ;*     ||           ADDSP   .L1     A3,A23,A21        ; |87|  ^ 
    ;*     ||           MPYSP   .M2     B7,B9,B6          ; |99|  ^ 
    ;*  32              MPYSP   .M2X    A19,B8,B6         ; |105| 
    ;*  33              NOP             1
    ;*  34              ADDSP   .L1     A3,A6,A22         ; |95|  ^ 
    ;*  35              ADDSP   .L2     B6,B17,B24        ; |99|  ^ 
    ;*     ||           SUBSP   .S1X    A21,B23,A3        ; |105| 
    ;*  36              ADDSP   .L2     B6,B7,B6          ; |105| 
    ;*  37              NOP             2
    ;*  39              MPYSP   .M1     A18,A3,A3         ; |105| 
    ;*     ||           SUBSP   .L1X    A22,B24,A6        ; |105| 
    ;*  40              NOP             3
    ;*  43              ADDSP   .L1     A9,A3,A3          ; |105| 
    ;*     ||           MPYSP   .M1     A19,A6,A5         ; |105| 
    ;*  44              NOP             3
    ;*  47              ADDSP   .L1     A5,A3,A3          ; |105| 
    ;*  48              NOP             1
    ;*  49              CMPLTSP .S2X    B6,A8,B0          ; |105| 
    ;*  50              NOP             1
    ;*  51      [ B0]   MV      .S2X    A8,B6             ; |105| 
    ;*     ||           CMPLTSP .S1     A3,A8,A1          ; |105| 
    ;*  52              MPYSP   .M2X    A4,B6,B6          ; |105| 
    ;*     ||   [ A1]   MV      .D1     A8,A3             ; |105| 
    ;*  53              MPYSP   .M1X    B5,A3,A3          ; |105| 
    ;*  54              NOP             1
    ;*  55      [ A2]   SUB     .S1     A2,1,A2           ; |80| 
    ;*  56              MPYSP   .M2X    A20,B6,B6         ; |105| 
    ;*     ||   [ A2]   B       .S2     $C$C91            ; |80| 
    ;*  57              MPYSP   .M1     A20,A3,A3         ; |105| 
    ;*  58              NOP             2
    ;*  60              STW     .D2T2   B6,*++B18(8)      ; |105| 
    ;*  61              STW     .D2T1   A3,*+B18(4)       ; |105| 
    ;*  62              ; BRANCHCC OCCURS {$C$C91}        ; |80| 
    ;*----------------------------------------------------------------------------*
    $C$L4:    ; PIPED LOOP PROLOG
    ;          EXCLUSIVE CPU CYCLES: 24
    
               SUBSP   .L1     A3,A22,A6         ; |95| (P)   ^ 
    ||         CMPGTSP .S2     B6,B4,B0          ; |91| (P)   ^ 
    
               CMPGTSP .S1     A6,A8,A1          ; |87| (P)   ^ 
    ||         SUBSP   .L2X    A3,B24,B5         ; |99| (P)   ^ 
    || [ B0]   MV      .S2     B21,B8            ; |91| (P)   ^ 
    || [!B0]   MV      .D2     B19,B8            ; |91| (P)   ^ 
    
       [ A1]   MV      .S1     A16,A3            ; |87| (P)   ^ 
    || [!A1]   MV      .L1X    B19,A3            ; |87| (P)   ^ 
    ||         MPYSP   .M2     B6,B8,B7          ; |91| (P)   ^ 
    
               MPYSP   .M1     A6,A3,A3          ; |87| (P)   ^ 
    
               LDW     .D1T2   *+A7(4),B6        ; |87| (P)  
    ||         CMPGTSP .S1     A6,A8,A1          ; |95| (P)   ^ 
    
       [!A1]   MV      .S1     A17,A3            ; |95| (P)   ^ 
    || [ A1]   MV      .L1X    B20,A3            ; |95| (P)   ^ 
    ||         CMPGTSP .S2     B5,B4,B0          ; |99| (P)   ^ 
    
               MPYSP   .M1     A6,A3,A3          ; |95| (P)   ^ 
    || [!B0]   MV      .S2     B22,B7            ; |99| (P)   ^ 
    || [ B0]   MV      .D2     B20,B7            ; |99| (P)   ^ 
    ||         ADDSP   .L2     B7,B9,B9          ; |91| (P)   ^ 
    
               ADDSP   .L1     A3,A5,A23         ; |87| (P)   ^ 
    ||         MPYSP   .M2     B5,B7,B7          ; |99| (P)   ^ 
    
               NOP             1
               ABSSP   .S2     B6,B5             ; |87| (P)  
    
               ADDSP   .L1     A3,A22,A6         ; |95| (P)   ^ 
    ||         SUBSP   .L2     B5,B9,B7          ; |91| (P)   ^ 
    
               SUBSP   .L1X    B5,A23,A5         ; |87| (P)   ^ 
    ||         ADDSP   .L2     B7,B24,B17        ; |99| (P)   ^ 
    
               SUBSP   .L2X    A23,B9,B8         ; |105| (P)  
               NOP             1
    
               SUBSP   .L1X    B5,A6,A5          ; |95| (P)   ^ 
    ||         CMPGTSP .S2     B7,B4,B0          ; |91| (P)   ^ 
    
               CMPGTSP .S1     A5,A8,A1          ; |87| (P)   ^ 
    ||         SUBSP   .L2     B5,B17,B6         ; |99| (P)   ^ 
    || [ B0]   MV      .S2     B21,B5            ; |91| (P)   ^ 
    || [!B0]   MV      .D2     B19,B5            ; |91| (P)   ^ 
    
               MV      .S2     B6,B5             ; |87| (P)  Split a long life
    ||         SUBSP   .L2X    A6,B17,B8         ; |105| (P)  
    || [ A1]   MV      .S1     A16,A3            ; |87| (P)   ^ 
    || [!A1]   MV      .L1X    B19,A3            ; |87| (P)   ^ 
    ||         MPYSP   .M2     B7,B5,B16         ; |91| (P)   ^ 
    
               MPYSP   .M2X    A18,B8,B16        ; |105| (P)  
    ||         MPYSP   .M1     A5,A3,A3          ; |87| (P)   ^ 
    
               LDW     .D2T2   *++B26,B7         ; |70| 
    ||         CMPGTSP .S1     A5,A8,A1          ; |95| (P)   ^ 
    ||         LDW     .D1T1   *++A7(8),A24      ; |87| (P)  
    
       [!A1]   MV      .S1     A17,A3            ; |95| (P)   ^ 
    || [ A1]   MV      .L1X    B20,A3            ; |95| (P)   ^ 
    ||         CMPGTSP .S2     B6,B4,B0          ; |99| (P)   ^ 
    ||         LDW     .D1T2   *+A7(4),B16       ; |87| (P)  
    
               MPYSP   .M1     A5,A3,A3          ; |95| (P)   ^ 
    || [!B0]   MV      .S2     B22,B9            ; |99| (P)   ^ 
    || [ B0]   MV      .D2     B20,B9            ; |99| (P)   ^ 
    ||         ADDSP   .L2     B16,B9,B23        ; |91| (P)   ^ 
    
               MVK     .S1     0x2,A3            ; |80| 
    ||         ADDSP   .L2X    A9,B16,B7         ; |105| (P)  
    ||         ADDSP   .L1     A3,A23,A21        ; |87| (P)   ^ 
    ||         MPYSP   .M2     B6,B9,B6          ; |99| (P)   ^ 
    
               SUB     .L1     A3,1,A2
    ||         MPYSP   .M2X    A19,B8,B6         ; |105| (P)  
    
               SUB     .S2X    A2,1,B1
    ||         SUB     .D2     B7,8,B18
    ||         ABSSP   .S1     A24,A5            ; |87| (P)  
    
    ;** --------------------------------------------------------------------------*
    $C$L5:    ; PIPED LOOP KERNEL
    ;          EXCLUSIVE CPU CYCLES: 28
    
               ADDSP   .L1     A3,A6,A22         ; |95|   ^ 
    ||         SUBSP   .L2X    A5,B23,B7         ; |91|   ^ 
    
               SUBSP   .S1X    A21,B23,A3        ; |105|  
    ||         ADDSP   .L2     B6,B17,B24        ; |99|   ^ 
    ||         SUBSP   .L1     A5,A21,A6         ; |87|   ^ 
    
               ADDSP   .L2     B6,B7,B6          ; |105|  
               NOP             1
    
               SUBSP   .L1     A5,A22,A5         ; |95|   ^ 
    ||         CMPGTSP .S2     B7,B4,B0          ; |91|   ^ 
    
               SUBSP   .L1X    A22,B24,A6        ; |105|  
    ||         MPYSP   .M1     A18,A3,A3         ; |105|  
    ||         CMPGTSP .S1     A6,A8,A1          ; |87|   ^ 
    ||         SUBSP   .L2X    A5,B24,B8         ; |99|   ^ 
    || [ B0]   MV      .S2     B21,B8            ; |91|   ^ 
    || [!B0]   MV      .D2     B19,B8            ; |91|   ^ 
    
       [ A1]   MV      .S1     A16,A3            ; |87|   ^ 
    || [!A1]   MV      .L1X    B19,A3            ; |87|   ^ 
    ||         MPYSP   .M2     B7,B8,B7          ; |91|   ^ 
    
               MPYSP   .M1     A6,A3,A3          ; |87|   ^ 
               CMPGTSP .S1     A5,A8,A1          ; |95|   ^ 
    
               ADDSP   .L1     A9,A3,A3          ; |105|  
    ||         MPYSP   .M1     A19,A6,A5         ; |105|  
    || [!A1]   MV      .D1     A17,A3            ; |95|   ^ 
    || [ A1]   MV      .S1X    B20,A3            ; |95|   ^ 
    ||         CMPGTSP .S2     B8,B4,B0          ; |99|   ^ 
    
               MPYSP   .M1     A5,A3,A3          ; |95|   ^ 
    || [!B0]   MV      .S2     B22,B7            ; |99|   ^ 
    || [ B0]   MV      .D2     B20,B7            ; |99|   ^ 
    ||         ADDSP   .L2     B7,B23,B9         ; |91|   ^ 
    
               ADDSP   .L1     A3,A21,A23        ; |87|   ^ 
    ||         MPYSP   .M2     B8,B7,B7          ; |99|   ^ 
    
               NOP             1
    
               ADDSP   .L1     A5,A3,A3          ; |105|  
    ||         ABSSP   .S2     B16,B8            ; |87|  
    
               ADDSP   .L1     A3,A22,A6         ; |95|   ^ 
    ||         SUBSP   .L2     B8,B9,B7          ; |91|   ^ 
    
               CMPLTSP .S2X    B6,A8,B0          ; |105|  
    ||         SUBSP   .L1X    B8,A23,A5         ; |87|   ^ 
    ||         ADDSP   .L2     B7,B24,B17        ; |99|   ^ 
    
               SUBSP   .L2X    A23,B9,B6         ; |105|  
    
       [ B0]   MV      .S2X    A8,B6             ; |105|  
    ||         CMPLTSP .S1     A3,A8,A1          ; |105|  
    
               MPYSP   .M2X    A4,B6,B6          ; |105|  
    || [ A1]   MV      .D1     A8,A3             ; |105|  
    ||         SUBSP   .L1X    B8,A6,A5          ; |95|   ^ 
    ||         CMPGTSP .S2     B7,B4,B0          ; |91|   ^ 
    
               MPYSP   .M1X    B5,A3,A3          ; |105|  
    ||         MV      .L1     A24,A4            ; |87|  Split a long life
    ||         CMPGTSP .S1     A5,A8,A1          ; |87|   ^ 
    ||         SUBSP   .L2     B8,B17,B7         ; |99|   ^ 
    || [ B0]   MV      .S2     B21,B5            ; |91|   ^ 
    || [!B0]   MV      .D2     B19,B5            ; |91|   ^ 
    
               MV      .S2     B16,B5            ; |87|  Split a long life
    ||         SUBSP   .L2X    A6,B17,B8         ; |105|  
    || [ A1]   MV      .S1     A16,A3            ; |87|   ^ 
    || [!A1]   MV      .L1X    B19,A3            ; |87|   ^ 
    ||         MPYSP   .M2     B7,B5,B6          ; |91|   ^ 
    
       [ A2]   SUB     .S1     A2,1,A2           ; |80|  
    ||         MPYSP   .M2X    A18,B6,B6         ; |105|  
    ||         MPYSP   .M1     A5,A3,A3          ; |87|   ^ 
    
               MPYSP   .M2X    A20,B6,B6         ; |105|  
    || [ A2]   B       .S2     $C$L5             ; |80|  
    ||         CMPGTSP .S1     A5,A8,A1          ; |95|   ^ 
    || [ B1]   LDW     .D1T1   *++A7(8),A24      ; |87|  
    
               MPYSP   .M1     A20,A3,A3         ; |105|  
    || [!A1]   MV      .S1     A17,A3            ; |95|   ^ 
    || [ A1]   MV      .L1X    B20,A3            ; |95|   ^ 
    ||         CMPGTSP .S2     B7,B4,B0          ; |99|   ^ 
    || [ B1]   LDW     .D1T2   *+A7(4),B16       ; |87|  
    
               MPYSP   .M1     A5,A3,A3          ; |95|   ^ 
    || [!B0]   MV      .S2     B22,B9            ; |99|   ^ 
    || [ B0]   MV      .D2     B20,B9            ; |99|   ^ 
    ||         ADDSP   .L2     B6,B9,B23         ; |91|   ^ 
    
               ADDSP   .L2X    A9,B6,B7          ; |105|  
    ||         ADDSP   .L1     A3,A23,A21        ; |87|   ^ 
    ||         MPYSP   .M2     B7,B9,B6          ; |99|   ^ 
    
               STW     .D2T2   B6,*++B18(8)      ; |105|  
    ||         MPYSP   .M2X    A19,B8,B6         ; |105|  
    
       [ B1]   SUB     .S2     B1,1,B1           ;  
    ||         STW     .D2T1   A3,*+B18(4)       ; |105|  
    ||         ABSSP   .S1     A24,A5            ; |87|  
    
    ;** --------------------------------------------------------------------------*
    $C$L6:    ; PIPED LOOP EPILOG
    ;          EXCLUSIVE CPU CYCLES: 2
    
               SUB     .S2     B2,1,B2           ; |65| 
    ||         ADDSP   .L1     A3,A6,A22         ; |95| (E)   ^ 
    
               SUBSP   .S1X    A21,B23,A3        ; |105| (E)  
    ||         ADDSP   .L2     B6,B17,B24        ; |99| (E)   ^ 
    
    
    It seems to me that the loop should take 24 + 2*28 + 2 = 82 cycles, not 90. A fine point I suppose, but that is roughly 10%. I just want to understand what I am missing here.

    TIA,

    B.J.

  • The last sentence of my last post partly addresses your most recent question:

    Another tricky thing is that the instructions in the latter (earlier) part of the epilog (prolog) may also have been scheduled with the code following (preceding) the loop, so in the .asm file, the epilog (prolog) may appear smaller that it actually is.

    Note there are instructions from the prolog that have been scheduled in the block preceding the prolog block (before the software pipelining comments).  Look for "(P)" in the assembly comment.  I also strongly suspect that there are instructions from the epilog in the block following the epilog block (look for "(E)"), but can't be sure because the subsequent block isn't included in the posted example.

     

    Note that the "Total cycles" is an estimate because of this compiler scheduling behavior.

  • Thanks for the info.

    So, I might have missed it, but I don't think that the (P) and (E) in the comments are documented in the Compiler guide (along with the other elements of the compiler comments that are documented in the SPL section). I think that would help to clarify this for someone trying to understand it for the first time if that was included in the documentation.

    So, because of the scheduling, the estimated total cycles for the loop is just an estimate, and is likely to be wrong, but if I count the cycles provided by actual "exclusive cycles" comments and get the loop iteration counts right, that should provide an accurrate assessment of the cycles required to execute the code?

    Best regards,

    B.J.