This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to get huge loop to be pipelined?

Hello,

I am currenzly optimizing some very time-consuming code-fragment which has 3 loops. The first two loop over y/x, while the innerst loop executes some code exactly 4 times.

While the optimizer does a good job optimizing the innerst loop, the loop prolog/epilog is quite long and because the loop is only executed 4 times contributes significantly to execution time. Furthermore the "preparation"-code, which is executed for every pixel isn't pipelined at all.

What I tried is to manually unroll the innerst loop, I get the following error: [E0800] Specified label is too far away; max range is [-2048,2047]  
Furthermore the loop isn't pipelined at all.

So, my question consists of 3 parts:

- Is there any switch to instruct the optimizer to try to software-pipeline larger loops?

- Should I use another Branch-Instruction?

- Is this adviseable at all? As far as I can see the code should still fit into L1P.

Thank you in advance, Clemens

for(y=0; y < height; y++) {

for(x=0; x < width; x++) {

      //Some preparation code

for(0 ...3) {

//Innerst loop code (~100 Instructions executed in 16 cycles)

}

}

}

  • Clemens Eisserer said:
    I get the following error: [E0800] Specified label is too far away; max range is [-2048,2047]

    This error is from the assembler.  Are you programming in C and you got this error?

    Thanks and regards,

    -George

  • Hi George,

    Thanks for your reply. No, the loop is written in linear assembly - so I guess BDEC is just the wrong branch instruction for larger offsets?

    Any idea how I could get the assembly optimizer to pipeline loops containing a lot of instructions (~500)?

    Thank you in advance, Clemens

  • Hi,

    The software pipeline is based on a pipeline buffer that has a limited capacity (14 execution packets), so 500 instructions are too much to be pipelined.

  • And whats about the "manual" pipelining the compiler does, when no loop-buffer is available (like on The c64)?

    However, it seems the optimizer can't copy with that many instructions  anyway - only unrolling the loop once caused the optimizer to bail out ("did not find schedule") :/

  • Try unrolling the innermost 4-iteration loop completely.  Unrolling a loop partially will probably not help the compiler software pipeline it.

    Can you interchange the loops so that the 4-iteration loop is no longer the innermost loop?

    Try using the --debug_software_pipeline (-mw) option to get more details about the compiler's attempts to software pipeline.

  • Hi Archeoloist,

    Thanks for your feedback. With the innermost loop unrolled, the optimizer bails out with:

    ;*   SOFTWARE PIPELINE INFORMATION
    ;*      Disqualified loop: Too many instructions (limit = 250)

    The completly unrolled loop has 432 instructions - Is there any way I can set the limit higher?

    Thanks a lot, Clemens

  • Found the -mpn switch, so I compiled with "-mpn500", which resulted (in combination with -O3 or O2) in:

    Renamed pair with base %s above window:         DCMPGTU4    .S2X    VRB2700:VRB2617,VRA2410:VRA2409,VRB2564    ; |554|
    >> ../Census16_C66Turbo.sa, line 111: INTERNAL ERROR: Corrupted IR detected
                                          during check_mve/spilling

    This may be a serious problem.  Please contact customer support with a
    description of this problem and a sample of the source files that caused this
    INTERNAL ERROR message to appear.

    Cannot continue compilation - ABORTING!

  • Well, that's a bug.  Could you send me the test case through private conversation?  I can't guarantee that fixing the bug will make the compiler able to software pipeline this loop, but we should at least try to fix the bug.

  • This is now SDSCM00044316.  This bug appears to have been introduced in release 7.2.x.