AM6442: Unexpected high latencies when executing high priority ISR from TCM

Dominic Rath

Part Number: AM6442

Dear TI team,

I'm currently trying to a analyze a performance issue with a MCU+ SDK FreeRTOS application running on a R5f core.

Basically what I wanted to prove is that, while A53+Linux gives us worst case latencies of ~100us, the combination of R5f+FreeRTOS+TCM should give us worst case latencies of <10us.

Unfortunately I'm seeing a lot bigger worst-case latencies of 25-27us, and I'm even seeing these for a high-priority ISR. A high-priority task, triggered via a semaphore given from that ISR, has a worst-case latency of a few more us.

I've already started reducing my test case, and currently it basically consists of the following:

A high priority timer ISR triggering every 100us that calculates latency from when the timer should have fired to when the ISR is entered
A high priority task pending on a semaphore and calculating latency (similar to ISR)
A lower priority "background" task that prints to the UART every second and executes 1 MB of instructions located in DDR memory that also read from DDR memory

All of my high priority ISR + task code + task stack plus most of FreeRTOS is linked to TCM. The low priority task is linked to TCM, too, and executes the runtime-generated 1 MB function in DDR memory. IRQ/SVC stacks are in TCM, too.

The issue I'm seeing is that, while the average latency for the ISR is on the order of a few hundred ns, the ISR occasionally sees a latency of >25us.

The average latency for the task is on the order of a few us, but occasionally it is almost 30us.

I've verified that the timer readings are correct, at least when compared to the core's cycle counter, i.e. normally - when the timer-measured latency is "ok" - the cycle counter incremented by ~80000 cycles (*1.25ns = 100us), but in case of an error - when the timer-measured latency is e.g. 20000 - the cycle counter incremented for ~97000 (120us).

I've instrumented the HwiP_irq_handler code to record the core's cycle counter as soon as possible (immediately after "Push used registers"), and pretty much all of the excess latency has already passed, i.e. the cycle counter at the entry to HwiP_irq_handler is only a few hundred cycles less than what it is later in the ISR.

I've used core trace to record ETM->ETB traces until a breakpoint triggers due to a latency > e.g. 18000, and I can see that every time I hit excess latency the core was previously executing my low-priority background task. That also matches the return address from IRQ entry that is pushed on the IRQ stack. Using the core trace I can also be certain that my critical loop executes only code in TCM. Unfortunately CCS is consistently unable to decode my trace, and either crashes or aborts after ~32k instructions. The 64KB trace buffer contains a few 100k instructions at that time. As a workaround I use CCS only to setup the trace, then close the core trace window, manually download the TBR content using Scripting Console, and decode the trace buffer with OpenCSD (CoreSight trace stream decoder).

Originally my "background loop in DDR memory" consisted of 1 MB of "ldr r0, [pc]" to create a high load on the caches and the memory system in general. I've since then enhanced my code to use the performance monitoring counters, and I keep resetting the PMU counters in the "background loop", i.e. MCR 0x3 to the PMCR register (enable counters, reset event counters, don't reset cycle counter) followed by a "ldr r12, [pc]", so that every other instruction resets the PMU counters while still exercising stress on the caches and memory system. The latencies measured using the cycle counter at HwiP_irq_handler entry remained roughly the same.

Upon entry to HwiP_irq_handler I also store the content of the PMU registers to TCM memory, and I can see that the counters for instruction buffer stalls (event type 0x40) show ~17000 cycles and LSU busy (event type 0x44) shows ~1000 cycles.

If I change the "background loop" code that executes from DDR memory to "NOP" (mov r0, r0), the excess latency seems to disappear. I'm still seeing unexpectedly high latencies later in the task, but the ISR always shows a few us worst case.

The DDR memory region is marked as normal, WB-WA, non-shared memory (MPU region access control is set to 0x30b), so the loads from DDR memory should be interruptible.

My test code is currently only running on our own hardware, because of the requirement to use DDR memory (LPDDR4 in our case). I've tried running it on the EVM, but if I use GEL files to initialize DDR memory, after booting using the NULL bootloader from OSPI, my application wont get past Sciclient_init(). On our own hardware I'm using a modified NULL bootloader that includes the DDR setup code from the PDK tree + a few fixes for LPDDR4. I'm working on getting this to run on the EVM, too.

It seems that entry into an IRQ handler located in TCM, with a stack located in TCM, sometimes takes ~20us. I know that this sounds very unlikely and that this indicates a bug, but at the moment I'm pretty much out of ideas what else to look for.

Regards,

Dominic

over 4 years ago

0 Dominic Rath over 4 years ago

Mastermind 7540 points

A few more notes:

I've noticed that the FreeRTOS code sets the DLFO bit, even though that shouldn't be necessary. I ran the code without DLFO, but that apparently didn't affect the performance.

I've also changed FreeRTOS / HwiP to disable the VIM vectored interface (VE bit), again without any noticeable effect.

0 Aravind Batni over 4 years ago in reply to Dominic Rath

TI__Expert 8885 points

Hi Dominic,

Thanks for bringing this to our notice. Can you please provide "background task" code? Also can you please share the .map and linker cmd file?

I have routed the question to respective organization to address this issue. Some one from that organization would get in touch with you for any further details.

Thanks

0 Dominic Rath over 4 years ago in reply to Aravind Batni

Mastermind 7540 points

Hello Aravind,

I've made some progress, since today the GEL files approach for initializing DDR worked. No idea yet what was wrong yesterday, but I'd like to stay away from GEL files anyway.

The problem is only visible on our own hardware (which is using LPDDR4)
- the problem is visible both when using the prebuilt bootloader + GEL script for LPDDR4 init
- and when using our modified bootloader that initializes LPDDR4 using code from the PDK
The problem is not visible on the TI EVM (which is using DDR4)
- when using the prebuilt bootloader + GEL script for DDR4 init

Though I couldn't find any details on this, my current assumption is that code fetches are not abortable, contrary to data accesses, which are abortable when the memory is marked as normal (which it is, in my case). Is there someone at TI who could confirm or refute that assumption?

Since the test code doesn't use any other board specific hardware, I'm guessing that the issue is the difference between DDR4 on the EVM and LPDDR4 on our custom hardware. That leaves me with either a bug in our LPDDR4 configuration or some hardware related issue.

At this point I don't think the actual instructions of the background code matter anymore, but here's the code I'm using to build my 1MB function at 0x80000000:

    ddr_loop_mem[0] = 0xe3a00003;       /* mov r0, #3 */
    ddr_loop_mem[1] = 0xe1a00000;       /* nop        */

    for (i = 2; i < 0x3fffe; i+=2)
    {
        ddr_loop_mem[i + 0] = 0xee090f1c;   /* mcr        p15, #0, r0, c9, c12, #0 */
        ddr_loop_mem[i + 1] = 0xe59fc000;   /* ldr        r12, [pc]                */
    }

    ddr_loop_mem[0x3fffe] = 0xe1a00000;     /* nop               */
    ddr_loop_mem[0x3ffff] = 0xe1a0f00e;     /* mov        pc, lr */
    
    CacheP_wbInvAll(CacheP_TYPE_ALLD);
    CacheP_wbInvAll(CacheP_TYPE_ALLP);

    __builtin_arm_dmb(0xF);
    __builtin_arm_dsb(0xF);

The task then just looks at the time via ClockP_getTimeUsec, outputs logging every 1 and 10 seconds (tried other intervals, with no effect on the frequency of excess latencies), calls the 1 MB function at 0x80000000, and delays for 1 ms to make sure the idle task gets to run now and then.

Every time I've seen excess latencies in the ISR the background task was interrupted somewhere within 0x80000000...0x80100000. There's nothing else in DDR memory (it's not even defined in the linker script). If you want I can send you the complete test project (code is somewhat dirty, due to the many changes in search of the culprit I've made), but I'm not sure if that's useful at this point, since I'm unable to reproduce the issue on the EVM.

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Dominic Rath

Mastermind 7540 points

Another update:

I've measured access latencies when reading from DDR memory mapped as strongly ordered. The measurement loop runs with IRQs disabled from TCM, and measures processor cycles required to read a field of N 32-bit values. That measurement is repeated 100,000 times, and min/max and a histogram are generated.

The average latency for a single access is ~160 cycles, but the latencies for accessing the whole field of N values differs by ~21,000 cycles, independent of the size N:

field size 512: min: 83911, max: 104928
field size 1024: min: 167867, max: 188897
field size 2048: min: 335820, max: 357324

The histogram data shows that there are only two distinct values (+-1000):

field size 512, lower boundary 80000, step size 1000: 0;0;0;1949;97988;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;20;43;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
field size 1024, lower boundary 160000, step size 1000: 0;0;0;0;0;0;0;2137;97738;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;21;104;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
field size 2048, lower boundary 320000, step size 1000: 0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;12051;87699;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;19;230;1;0;0;0;0;0;0;0;0;0;0;0;0

(i.e. for N=512 the latency is either [83000,84000[ or [103000,104000[, for 1024 it's either ~167000 or ~188000, for 2048 it's either ~335000 or ~357000, but no latencies in between are seen).

This is a good match for the worst case latencies for my timer ISR of ~25us that I've seen.

The EVM doesn't show this behavior:

field size 512: min: 86301, max: 87370
0;0;0;0;0;0;63285;36715;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
field size 1024: min: 172713, max: 174400
0;0;0;0;0;0;0;0;0;0;0;0;490;96731;2779;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
field size 2048: min: 346018, max: 348358
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;6705;92391;904;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0

(i.e. the min/max difference is ~1000 cycles)

I'd appreciate any ideas regarding what might cause these latencies. Our own memory is slightly different from the LPDDR4 default values in the Processor_DDR_Config_0.03.00.0000 tool, but so far I haven't seen any obvious issues. There are only 14 values set in the .syscfg file, the others are default values.

Regards,

Dominic

0 Arnon Friedmann over 4 years ago in reply to Dominic Rath

TI__Intellectual 2465 points

Hi Dominic

Thanks for all the details and information. We're reviewing and will get back shortly.

0 Pekka Varis over 4 years ago in reply to Arnon Friedmann

TI__Mastermind 27050 points

Dominic,

I really appreciate the detailed approach tests you ran based on the offline suggestions. It looks like the root cause for the memory latency and then indirectly the R5 interrupt latency looks to be LPDDR4 automatic write leveling getting triggered and lasting a lot longer than expected. Turning it off (bit 9 in register address 0x0f30a394 cleared ) removed the outliers in this case. We'll incorporate the learnings from this to the initialization examples.

Pekka

Processors

Processors forum

AM6442: Unexpected high latencies when executing high priority ISR from TCM