This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM46L852: What causes instruction buffer stalls without an instruction cache on a Cortex-R4?

Part Number: RM46L852

I have implemented a radix-2 and radix-4 decoding algorithm that runs baremetal on a TI RM46L852 chip with a ARM Cortex-R4F. I have also implemented optimized versions of both and for radix-4 I get a speedup that is almost exactly what I predicted, but with radix-2 I actually have a performance regression.

Then I implemented the radix-2 and optimized radix-2 designs in assembly and used the PMU performance counters to find the cause of the regression. My optimization is basically to skip 30 instructions if an input value is 0. The branch predictor does a good job with around 96% of all branches being correctly predicted and the remaining 4% does not account for the performance regression. Every other PMU counter value is roughly the same except for Instruction Buffer Stalls which is 25 times higher in the optimized design.

Looking at the documentation, it only says that this could happen because of instruction cache misses, but this chip doesn't have any I$ or D$ and has single cycle RAM on-chip. I could not find any other documentation that explain what the instruction buffer exactly is and what could cause this. Does anyone have an explanation?

  • Hi Arvid,

    You are correct. RM46 doesn't have instruction cache and data cache. 

  • Hi,

    thanks for the reply. As we both said, this chip doesn't have any I$ and D$, but how can I explain the "Instruction Buffer Stalls"? Do you have an explanation for that?

  • As we both said, this chip doesn't have any I$ and D$, but how can I explain the "Instruction Buffer Stalls"?

    Is the code running from flash or RAM?

    The RM46L852 datasheet shows RAM is zero wait state, but flash accesses might require wait states depending upon the CPU clock speed:

  • Is the code running from flash or RAM?

    The code is running from flash, while the data is in RAM. I'm aware of the wait states that are required when I run the MCU at 220MHz. But even if I slow down the MCU to 44MHz (0 wait states for flash) the instruction buffer stalls are still 25 times higher in the optimized design.

    So, reducing the PLL frequency and setting the wait states to 0, the problem with the instruction buffer stalls is still there.

  • Unaligned LDR instructions have an extra cycle penalty compared with aligned loads. In general, stores are less likely to stall the system compared to loads. STRB and STRH have similar performance to STR, because of the merging write buffer. Because there are four slots in the load/store unit, more than four consecutive pending loads will always cause a pipeline stall.