Hello,
I am currenzly optimizing some very time-consuming code-fragment which has 3 loops. The first two loop over y/x, while the innerst loop executes some code exactly 4 times.
While the optimizer does a good job optimizing the innerst loop, the loop prolog/epilog is quite long and because the loop is only executed 4 times contributes significantly to execution time. Furthermore the "preparation"-code, which is executed for every pixel isn't pipelined at all.
What I tried is to manually unroll the innerst loop, I get the following error: [E0800] Specified label is too far away; max range is [-2048,2047]
Furthermore the loop isn't pipelined at all.
So, my question consists of 3 parts:
- Is there any switch to instruct the optimizer to try to software-pipeline larger loops?
- Should I use another Branch-Instruction?
- Is this adviseable at all? As far as I can see the code should still fit into L1P.
Thank you in advance, Clemens
for(y=0; y < height; y++) {
for(x=0; x < width; x++) {
//Some preparation code
for(0 ...3) {
//Innerst loop code (~100 Instructions executed in 16 cycles)
}
}
}