Dear TI team,
I'm currently trying to a analyze a performance issue with a MCU+ SDK FreeRTOS application running on a R5f core.
Basically what I wanted to prove is that, while A53+Linux gives us worst case latencies of ~100us, the combination of R5f+FreeRTOS+TCM should give us worst case latencies of <10us.
Unfortunately I'm seeing a lot bigger worst-case latencies of 25-27us, and I'm even seeing these for a high-priority ISR. A high-priority task, triggered via a semaphore given from that ISR, has a worst-case latency of a few more us.
I've already started reducing my test case, and currently it basically consists of the following:
- A high priority timer ISR triggering every 100us that calculates latency from when the timer should have fired to when the ISR is entered
- A high priority task pending on a semaphore and calculating latency (similar to ISR)
- A lower priority "background" task that prints to the UART every second and executes 1 MB of instructions located in DDR memory that also read from DDR memory
All of my high priority ISR + task code + task stack plus most of FreeRTOS is linked to TCM. The low priority task is linked to TCM, too, and executes the runtime-generated 1 MB function in DDR memory. IRQ/SVC stacks are in TCM, too.
The issue I'm seeing is that, while the average latency for the ISR is on the order of a few hundred ns, the ISR occasionally sees a latency of >25us.
The average latency for the task is on the order of a few us, but occasionally it is almost 30us.
I've verified that the timer readings are correct, at least when compared to the core's cycle counter, i.e. normally - when the timer-measured latency is "ok" - the cycle counter incremented by ~80000 cycles (*1.25ns = 100us), but in case of an error - when the timer-measured latency is e.g. 20000 - the cycle counter incremented for ~97000 (120us).
I've instrumented the HwiP_irq_handler code to record the core's cycle counter as soon as possible (immediately after "Push used registers"), and pretty much all of the excess latency has already passed, i.e. the cycle counter at the entry to HwiP_irq_handler is only a few hundred cycles less than what it is later in the ISR.
I've used core trace to record ETM->ETB traces until a breakpoint triggers due to a latency > e.g. 18000, and I can see that every time I hit excess latency the core was previously executing my low-priority background task. That also matches the return address from IRQ entry that is pushed on the IRQ stack. Using the core trace I can also be certain that my critical loop executes only code in TCM. Unfortunately CCS is consistently unable to decode my trace, and either crashes or aborts after ~32k instructions. The 64KB trace buffer contains a few 100k instructions at that time. As a workaround I use CCS only to setup the trace, then close the core trace window, manually download the TBR content using Scripting Console, and decode the trace buffer with OpenCSD (CoreSight trace stream decoder).
Originally my "background loop in DDR memory" consisted of 1 MB of "ldr r0, [pc]" to create a high load on the caches and the memory system in general. I've since then enhanced my code to use the performance monitoring counters, and I keep resetting the PMU counters in the "background loop", i.e. MCR 0x3 to the PMCR register (enable counters, reset event counters, don't reset cycle counter) followed by a "ldr r12, [pc]", so that every other instruction resets the PMU counters while still exercising stress on the caches and memory system. The latencies measured using the cycle counter at HwiP_irq_handler entry remained roughly the same.
Upon entry to HwiP_irq_handler I also store the content of the PMU registers to TCM memory, and I can see that the counters for instruction buffer stalls (event type 0x40) show ~17000 cycles and LSU busy (event type 0x44) shows ~1000 cycles.
If I change the "background loop" code that executes from DDR memory to "NOP" (mov r0, r0), the excess latency seems to disappear. I'm still seeing unexpectedly high latencies later in the task, but the ISR always shows a few us worst case.
The DDR memory region is marked as normal, WB-WA, non-shared memory (MPU region access control is set to 0x30b), so the loads from DDR memory should be interruptible.
My test code is currently only running on our own hardware, because of the requirement to use DDR memory (LPDDR4 in our case). I've tried running it on the EVM, but if I use GEL files to initialize DDR memory, after booting using the NULL bootloader from OSPI, my application wont get past Sciclient_init(). On our own hardware I'm using a modified NULL bootloader that includes the DDR setup code from the PDK tree + a few fixes for LPDDR4. I'm working on getting this to run on the EVM, too.
It seems that entry into an IRQ handler located in TCM, with a stack located in TCM, sometimes takes ~20us. I know that this sounds very unlikely and that this indicates a bug, but at the moment I'm pretty much out of ideas what else to look for.
Regards,
Dominic