This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Loop not pipelined, although schedule was found

Hi,

I've written a loop targeted at the C6600 using linear assembler. However I am puzzled by the outcome, although the optimizer states a schedule was found with 4 iterations in parallel, a non-piplined version is generated which is very slow of course.

In my previous attemps the optimizer stated the reason of not generating a pipelined loop, however in this case it remains silent.

Any ideas how to find out whats bothering the optimizer?

Thank you in advance, Clemens

[code];*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 34
;*      Loop closing brace source line   : 131
;*      Known Minimum Trip Count         : 8                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 7
;*      Unpartitioned Resource Bound     : 11
;*      Partitioned Resource Bound(*)    : 11
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     3        2     
;*      .D units                     9       10     
;*      .M units                     0        0     
;*      .X cross paths               2        2     
;*      .T address paths             9       10     
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)          14       12     (.L or .S unit)
;*      Addition ops (.LSD)          5        8     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             9        7     
;*      Bound(.L .S .D .LS .LSD)    11*      11*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 11 Schedule found with 4 iterations in parallel
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1
;*----------------------------------------------------------------------------*[/code]

  • Your loop is what we call resource bound.  This line tells the story ...

    Clemens Eisserer said:
    ;*      Bound(.L .S .D .LS .LSD)    11*      11*    

    That says you have so many instructions on the .L .S and .D functional units, it will take 11 fetch packets (8 instructions per packet) to perform it all.  Thus, theoretical minimum ii for your loop is 11.  And that's what you get.

    The section titled Understanding Feedback of the C6000 Programmer's Guide shows how to understand these compiler generated comments.  It is a bit out of date.  Nothing in that section is wrong.  But you will see things not described there.

    Thanks and regards,

    -George

  • Hi Georgem,


    Thanks for you promt reply. 11 cycles per "iteration" would be what I would expect - given the amount of intructions executed per iteration.
    However, what I see when reading the generated assemly code seems different, it looks a lot more like a sequential loop without any pipelining as there are only a few instructions per fetch packet and a lot more than 11 fetch packets per iterarion.

    Thank you in advance, Clemens

    [code];*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop source line                 : 35
    ;*      Loop closing brace source line   : 117
    ;*      Known Minimum Trip Count         : 8                     
    ;*      Known Max Trip Count Factor      : 1
    ;*      Loop Carried Dependency Bound(^) : 7
    ;*      Unpartitioned Resource Bound     : 11
    ;*      Partitioned Resource Bound(*)    : 11
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     0        0      
    ;*      .S units                     3        2      
    ;*      .D units                     9       10      
    ;*      .M units                     0        0      
    ;*      .X cross paths               2        2      
    ;*      .T address paths             9       10      
    ;*      Long read paths              0        0      
    ;*      Long write paths             0        0      
    ;*      Logical  ops (.LS)          14       12     (.L or .S unit)
    ;*      Addition ops (.LSD)          5        8     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             9        7      
    ;*      Bound(.L .S .D .LS .LSD)    11*      11*     
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 11 Schedule found with 4 iterations in parallel
    ;*      Done
    ;*
    ;*      Loop will be splooped
    ;*      Collapsed epilog stages       : 0
    ;*      Collapsed prolog stages       : 0
    ;*      Minimum required memory pad   : 0 bytes
    ;*
    ;*      Minimum safe trip count       : 1
    ;*----------------------------------------------------------------------------*
    $C$L2:    ; PIPED LOOP PROLOG
        .dwpsn    file "../SumC6600.sa",line 35,column 0,is_stmt,isa 0
     
               SPLOOPD 11      ;44               ; (P)  
    ||         MV      .L2X    stride,stride'
    ||         MVC     .S2     B25,ILC
     
    ;** --------------------------------------------------------------------------*
    $C$L3:    ; PIPED LOOP KERNEL
    $C$DW$L$_Sum3x3_C66Asm$4$B:
               ADD     .S2X    src,stride',src$3 ; |41| (P) <0,0>  ^  
               NOP             1
               LDW     .D1T1   *+src(32),us0_m1m2 ; |39| (P) <0,2>  
               LDDW    .D1T1   *src,us0_fe:us0_dc ; |35| (P) <0,3>  
     
               LDDW    .D1T2   *+src(24),us0_32:us0_10 ; |38| (P) <0,4>  
    ||         ADD     .D2     src$3,stride',src$2 ; |48| (P) <0,4>  ^  
     
               MVD     .M2     src$2,src$4       ; |48| (P) <0,5> Split a long life
    ||         LDDW    .D1T1   *+src(8),us0_ba:us0_98 ; |36| (P) <0,5>  
    ||         LDW     .D2T2   *+src$3(32),us1_m1m2 ; |46| (P) <0,5>  
     
               LDDW    .D1T2   *+src(16),us0_76:us0_54 ; |37| (P) <0,6>  
    ||         SUB     .L1X    src$2,stride,src$1 ; |56| (P) <0,6>  ^  
     
               MVD     .M1     us0_m1m2,us0_m1m2' ; |39| (P) <0,7> Split a long life
    ||         SUB     .L1     src$1,stride,src  ; |57| (P) <0,7>  ^  
    ||         LDDW    .D2T1   *src$3,us1_fe:us1_dc ; |42| (P) <0,7>  
     
               ADDK    .S1     0x20,src          ; |58| (P) <0,8>  ^  
               LDDW    .D2T1   *+src$3(8),us1_ba:us1_98 ; |43| (P) <0,9>  
     
               MVD     .M1     us0_ba,us0_ba'    ; |36| (P) <0,10> Split a long life
    ||         LDDW    .D2T2   *+src$3(16),us1_76:us1_54 ; |44| (P) <0,10>  
     
               MVD     .M1     us0_98,us0_98'    ; |36| (P) <0,11> Split a long life
    ||         LDDW    .D2T2   *+src$3(24),us1_32:us1_10 ; |45| (P) <0,11>  
     
               DADD    .L1     0,us0_fe:us0_dc,us0_fe':us0_dc' ; |35| (P) <0,12> Split a long life
    ||         LDW     .D2T2   *+src$4(32),us2_m1m2 ; |53| (P) <0,12>  
     
               LDDW    .D2T2   *+src$4(16),us2_76:us2_54 ; |51| (P) <0,13>  
               LDDW    .D2T2   *+src$4(24),us2_32:us2_10 ; |52| (P) <0,14>  
               NOP             2
               LDDW    .D2T1   *+src$4(8),us2_ba:us2_98 ; |50| (P) <0,17>  
               NOP             1
     
               DADD    .L2     0,us0_32:us0_10,us0_32':us0_10' ; |36| (P) <0,19> Split a long life
    ||         ADD2    .S2X    us0_m1m2',us1_m1m2,usSum_m1m2' ; |72| (P) <0,19>  
    ||         LDDW    .D2T1   *src$4,us2_fe:us2_dc ; |49| (P) <0,19>  
     
               SPMASK          L1
    ||         MV      .L1     dst,dst$1
    ||         DADD2   .L2     us0_32':us0_10',us1_32:us1_10,usSum_32':usSum_10' ; |69| (P) <0,20>  
    ||         ADD2    .S2     us2_m1m2,usSum_m1m2',usSum_m1m2 ; |73| (P) <0,20>  
     
               SPMASK          L1,S1
    ||         MV      .L1     dst,dst$2
    ||         ADDK    .S1     24,dst$1
    ||         DADD2   .L2     us0_76:us0_54,us1_76:us1_54,usSum_76':usSum_54' ; |66| (P) <0,21>  
    ||         DADD2   .S2     us2_32:us2_10,usSum_32':usSum_10',usSum_32:usSum_10 ; |70| (P) <0,21>  
     
               SPMASK          S1
    ||         ADDK    .S1     16,dst$2
    ||         DADD2   .L2     us2_76:us2_54,usSum_76':usSum_54',usSum_76'':usSum_54'' ; |67| (P) <0,22>  
    ||         DADD2   .L1     us0_fe':us0_dc',us1_fe:us1_dc,usSum_fe':usSum_dc' ; |60| (P) <0,22>  
     
               SHRMB   .L2     usSum_m1m2,usSum_32,usSum_21 ; |91| (P) <0,23>  
    ||         SHR     .S2     usSum_10,0x8,tmp5 ; |87| (P) <0,23>  
    ||         DADD2   .S1     us0_ba':us0_98',us1_ba:us1_98,usSum_ba':usSum_98' ; |63| (P) <0,23>  
     
               PACKLH2 .L2     usSum_32,usSum_10,usSum_0m1 ; |90| (P) <0,24>  
    ||         SHRMB   .S2     usSum_10,usSum_76'',usSum_65 ; |86| (P) <0,24>  
    ||         DADD2   .L1     us2_ba:us2_98,usSum_ba':usSum_98',usSum_ba:usSum_98 ; |64| (P) <0,24>  
    ||         DADD2   .S1     us2_fe:us2_dc,usSum_fe':usSum_dc',usSum_fe:usSum_dc ; |61| (P) <0,24>  
     
               PACKLH2 .L2     usSum_76'',usSum_54'',usSum_43 ; |85| (P) <0,25>  
    ||         SHRMB   .S2     tmp5,usSum_65,usSum_65 ; |88| (P) <0,25>  
    ||         PACKLH2 .L1     usSum_ba,usSum_98,usSum_87 ; |80| (P) <0,25>  
    ||         SHRMB   .S1     usSum_98,usSum_fe,usSum_ed' ; |76| (P) <0,25>  
     
               SHR     .S2     usSum_m1m2,0x8,tmp4 ; |92| (P) <0,26>  
    ||         DADD2   .L2     usSum_76'':usSum_54'',usSum_65:usSum_43,usRes_76':usRes_54' ; |103| (P) <0,26>  
    ||         SHR     .S1     usSum_98,0x8,tmp7 ; |77| (P) <0,26>  
    ||         DADD    .L1X    0,usSum_76'':usSum_54'',usSum_76:usSum_54 ; |67| (P) <0,26> Define a twin register
     
               ADD2    .S2     usSum_10,usRes_76',usRes_76 ; |104| (P) <0,27>  
    ||         SHRMB   .L2     tmp4,usSum_21,usSum_21 ; |93| (P) <0,27>  
    ||         PACKLH2 .L1     usSum_fe,usSum_dc,usSum_cb ; |75| (P) <0,27>  
    ||         SHRMB   .S1     tmp7,usSum_ed',usSum_ed ; |78| (P) <0,27>  
     
               SHRMB   .S1     usSum_54,usSum_ba,usSum_a9' ; |81| (P) <0,28>  
    ||         ADD2    .S2     usSum_76'',usRes_54',usRes_54 ; |105| (P) <0,28>  
    ||         DADD2   .L2     usSum_32:usSum_10,usSum_21:usSum_0m1,usRes_32:usRes_10 ; |107| (P) <0,28>  
     
               SHR     .S1     usSum_54,0x8,tmp6 ; |82| (P) <0,29>  
    ||         STDW    .D1T2   usRes_76:usRes_54,*dst$2++(32) ; |114| (P) <0,29>  
    ||         ADD2    .L2     usSum_m1m2,usRes_32,usRes_32 ; |108| (P) <0,29>  
    ||         ADD2    .S2     usSum_32,usRes_10,usRes_10 ; |109| (P) <0,29>  
     
               STDW    .D1T2   usRes_32:usRes_10,*dst$1++(32) ; |115| (P) <0,30>  
    ||         SHRMB   .L1     tmp6,usSum_a9',usSum_a9 ; |83| (P) <0,30>  
     
               SPMASK          D1
    ||         ADD     .D1     8,dst,A2
    ||         DADD2   .L1     usSum_fe:usSum_dc,usSum_ed:usSum_cb,usRes_fe:usRes_dc ; |95| (P) <0,31>  
    ||         DADD2   .S1     usSum_ba:usSum_98,usSum_a9:usSum_87,usRes_ba':usRes_98' ; |99| (P) <0,31>  
     
               ADD2    .L1     usSum_fe,usRes_dc,usRes_dc ; |97| (P) <0,32>  
    ||         ADD2    .S1     usSum_98,usRes_fe,usRes_fe ; |96| (P) <0,32>  
    ||         ADD2    .D1     usSum_ba,usRes_98',usRes_98 ; |101| (P) <0,32>  
     
               ADD2    .S1     usSum_54,usRes_ba',usRes_ba ; |100| <0,33>  
    ||         STDW    .D1T1   usRes_fe:usRes_dc,*dst++(32) ; |112| <0,33>  
     
        .dwpsn    file "../SumC6600.sa",line 117,column 0,is_stmt,isa 0
     
               SPKERNEL 1,8
    ||         STDW    .D1T1   usRes_ba:usRes_98,*A2++(32) ; |113| <0,34>  
     
    $C$DW$L$_Sum3x3_C66Asm$4$E:
    ;** --------------------------------------------------------------------------*
    $C$L4:    ; PIPED LOOP EPILOG
     
               ZERO    .L2     xLoopCnt
    ||         BDEC    .S1     $C$L1,yLoopCnt    ; |119|  
     
               NOP             5
               ; BRANCHCC OCCURS {$C$L1}         ; |119|  
    ;** --------------------------------------------------------------------------*[/code]

  • That long list of fetch packets is folded into 11 slots by virtue of the "SPLOOPD 11" instruction.  That software pipeline kernel really does have ii=11.

  • Thanks for the reply - indeed I hadn't understood how the software pipelined loop buffer does work.
    I had a look at the documentation and its clear too me now - quite a clever piece of hardware :)

  • George / Archaeologist-

    My apologies for bumping this thread after a long time, but this appears to be the best search result and my question might be helpful to others.

    In the absence of actual disqualification messages as in the case Clemens describes,  what is the best .asm output feedback to actually know the loop was not pipelined ?  So far I've been simply looking for SPxx instructions, and not finding them, assuming that in pipelining did not happen.

    Thanks.


    -Jeff
    Signalogic

  • Add the build option --debug_software_pipeline.  This causes the compiler to add a very detailed comment block before every loop.  If a loop is not pipelined, this comment clearly says so, and explains why.

    Thanks and regards,

    -George

  • George-

    Thanks for your fast reply.  Yes we've been using

     --debug_software_pipeline --advice:performance

    in our builds and for the most part that tells us what we need to know.  But my question is whether there is something specific we can search for by presence instead of absence ?  Right now if we don't see "loop will be splooped" then we know, but that's a process of elimination.

    The reason we need this is we're building a set of script and program tools that help automate the optimization process.  The objective of these tools is to find loops not splooped, along with good guesses about possible number of iterations, in order to assign a "worthwhile effort priority" for optimizing loops.  We build these things because (as you may know) Signalogic adds 100s of c66x cores as co-CPUs inside servers and we run a wide range of applications (OpenCV, analytics, telecom, robotics, driverless cars, oil & gas exploration, etc) and those tend to be huge amounts of code with literally 1000s of loops.

    Thanks.

    -Jeff
    Signalogic

  • Your best bet is to look for "Disqualified". You might also look for "BEGIN LOOP", which indicates that the optimization level command-line option is too low to even attempt SP. You might also consider looking for "Loop will not be splooped", which means the loop will be software pipelined, but using the old style instructions rather than the more efficient SPLOOP family.