RM46L852: What causes instruction buffer stalls without an instruction cache on a Cortex-R4?

Arvid van den Brink

Part Number: RM46L852

I have implemented a radix-2 and radix-4 decoding algorithm that runs baremetal on a TI RM46L852 chip with a ARM Cortex-R4F. I have also implemented optimized versions of both and for radix-4 I get a speedup that is almost exactly what I predicted, but with radix-2 I actually have a performance regression.

Then I implemented the radix-2 and optimized radix-2 designs in assembly and used the PMU performance counters to find the cause of the regression. My optimization is basically to skip 30 instructions if an input value is 0. The branch predictor does a good job with around 96% of all branches being correctly predicted and the remaining 4% does not account for the performance regression. Every other PMU counter value is roughly the same except for Instruction Buffer Stalls which is 25 times higher in the optimized design.

Looking at the documentation, it only says that this could happen because of instruction cache misses, but this chip doesn't have any I$ or D$ and has single cycle RAM on-chip. I could not find any other documentation that explain what the instruction buffer exactly is and what could cause this. Does anyone have an explanation?

over 3 years ago

0 QJ Wang over 3 years ago

TI__Guru**** 192486 points

Hi Arvid,

You are correct. RM46 doesn't have instruction cache and data cache.

0 Arvid van den Brink over 3 years ago in reply to QJ Wang

Prodigy 10 points

Hi,

thanks for the reply. As we both said, this chip doesn't have any I$ and D$, but how can I explain the "Instruction Buffer Stalls"? Do you have an explanation for that?

0 Chester Gillon over 3 years ago in reply to Arvid van den Brink

Guru 92251 points

Arvid van den Brink said:
As we both said, this chip doesn't have any I$ and D$, but how can I explain the "Instruction Buffer Stalls"?

Is the code running from flash or RAM?

The RM46L852 datasheet shows RAM is zero wait state, but flash accesses might require wait states depending upon the CPU clock speed:

0 Arvid van den Brink over 3 years ago in reply to Chester Gillon

Prodigy 10 points

Chester Gillon said:
Is the code running from flash or RAM?

The code is running from flash, while the data is in RAM. I'm aware of the wait states that are required when I run the MCU at 220MHz. But even if I slow down the MCU to 44MHz (0 wait states for flash) the instruction buffer stalls are still 25 times higher in the optimized design.

So, reducing the PLL frequency and setting the wait states to 0, the problem with the instruction buffer stalls is still there.

0 QJ Wang over 2 years ago in reply to Arvid van den Brink

TI__Guru**** 192486 points

Unaligned LDR instructions have an extra cycle penalty compared with aligned loads. In general, stores are less likely to stall the system compared to loads. STRB and STRH have similar performance to STR, because of the merging write buffer. Because there are four slots in the load/store unit, more than four consecutive pending loads will always cause a pipeline stall.

Arm-based microcontrollers

Arm-based microcontrollers forum

RM46L852: What causes instruction buffer stalls without an instruction cache on a Cortex-R4?