This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM46L852: What causes instruction buffer stalls without an instruction cache on a Cortex-R4?

Part Number: RM46L852

I have implemented a radix-2 and radix-4 decoding algorithm that runs baremetal on a TI RM46L852 chip with a ARM Cortex-R4F. I have also implemented optimized versions of both and for radix-4 I get a speedup that is almost exactly what I predicted, but with radix-2 I actually have a performance regression.

Then I implemented the radix-2 and optimized radix-2 designs in assembly and used the PMU performance counters to find the cause of the regression. My optimization is basically to skip 30 instructions if an input value is 0. The branch predictor does a good job with around 96% of all branches being correctly predicted and the remaining 4% does not account for the performance regression. Every other PMU counter value is roughly the same except for Instruction Buffer Stalls which is 25 times higher in the optimized design.

Looking at the documentation, it only says that this could happen because of instruction cache misses, but this chip doesn't have any I$ or D$ and has single cycle RAM on-chip. I could not find any other documentation that explain what the instruction buffer exactly is and what could cause this. Does anyone have an explanation?