AM6442: FIQ Interrupt on A53

Steven Hansen

Part Number: AM6442

Short version: The TRM seems to be missing information on how to configure a GIC interrupt for FIQ instead of IRQ (on the A53 cores using Linux). I also couldn't find any examples in Processor Linux SDK v8.04 showing how to do this. Is this something TI can provide?

Longer version: We are porting over an application that used a processor from another vendor. This system ran a control loop in a FIQ interrupt handler under Linux, which gave <5 microseconds of latency and reasonably low jitter. We've ported over this code to the AM6442 on the A53 cores, but benchmarks on Linux and Linux-RT show interrupt (IRQ, not FIQ) latencies up to 60 microseconds, much too high for our application. We investigated running this code on the R5F cores, but 1) the DDR access from R5F is way too slow presumably due to the memory interconnect, and 2) many of TI's network stacks take a significant amount of MSRAM.

So getting this running on the A53 cores / Linux seems to be our best option. The previous processor vendor supplied a function in the Linux kernel that could configure an interrupt to be hooked by FIQ instead of IRQ. I've been looking for something similar in TI's documentation but so far haven't found it.

Any help is much appreciated, thanks!

over 2 years ago

0 Andreas Dannenberg over 2 years ago

TI__Guru 64877 points

Hi Steven,

let me look into what we have on this front, it may take a couple of days to get back on this. Will be in touch.

Regards, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Thank you Andreas, much appreciated.

0 Steven Hansen over 2 years ago in reply to Steven Hansen

Intellectual 631 points

Hi Andreas, checking back to see if you've been able to look into this.

I skimmed the GIC-500 TRM, but didn't see any info on selecting FIQ vs IRQ. I'm guessing that means I need to look at the Cortex A53 docs instead, so I'll try to do that today. But if you can point me in the right direction that would be a major help. Thank you!

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Steven,

traditionally the Kernel doesn't use FIQs really however there are some signs of support getting added to the upstream Kernel mostly in the context of enabling support for the Apple M1 armv8 silicon, see these patch series for example: https://lore.kernel.org/lkml/CAK8P3a1bXiWcieqTSZARN+to=J5RjC2cwbn_8ZOCYw2hhyyBYw@mail.gmail.com/T/ and https://lore.kernel.org/linux-arm-kernel/20210302101211.2328-9-mark.rutland@arm.com/T/ If you look at arch/arm64/include/asm in the upstream Kernel tree specifically you can find several traces of FIQ support that got merged, and that Kernel is supposed to be bootable on M1 silicon. I have not spent time looking at what is possible in terms of our AM6x SoCs but getting FIQs to work would likely involve extending/adding/backporting the existing Apple M1 support to the ti-linux-5.10.y tree but that's not something we support or even have tried to my knowledge. I would not recommend going down this path by yourself at this time.

Taking a step back, the real issue you are having is Linux-RT latency related and we do have other customer reports actually about similar concerns and our R&D team is currently actively investigating how this behavior can be improved. I suggest we keep this thread open for 1-2 more weeks so we can share additional details through this avenue.

Regards, Andreas

0 Andreas Dannenberg over 2 years ago

TI__Guru 64877 points

Steven Hansen said:
Longer version: We are porting over an application that used a processor from another vendor. This system ran a control loop in a FIQ interrupt handler under Linux, which gave <5 microseconds of latency and reasonably low jitter. We've ported over this code to the AM6442 on the A53 cores, but benchmarks on Linux and Linux-RT show interrupt (IRQ, not FIQ) latencies up to 60 microseconds, much too high for our application

Also can you please share some system level details on how you set this up and test the performance? Are you using any standard testing tools?

Thanks, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas, thanks for the feedback.

I did some digging into FIQ options and it appears somewhat complicated as ATF/OPTEE set SCR_EL3.FIQ=1 so the A53 routes all FIQs as EL3 (secure world). I have yet to find out if there's a way for ATF to route a periodic FIQ to Linux in EL1, but I'm guessing this would be difficult to say the least.

I agree if Linux-RT latency could be improved that would be a better solution. Another option would be using AMP mode where one A53 core runs Linux and the other core runs bare metal or FreeRTOS. However this doesn't seem to be supported as of Processor SDK v8.04.

For benchmarking, I'm using the default SD card image from Processor Linux-RT SDK v8.04 on the (new, SR2.0) AM64xEVM board. I've added "isolcpus=1" to the bootargs to isolate the 2nd A53 core from the scheduler. The two benchmarks I've run are:

1) cyclictest with "taskset" to schedule only on CPU2. First I load the CPU using:

Fullscreen

1
yes > /dev/null &
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

yes > /dev/null &

Then I start cyclictest and log results to a file:

Fullscreen

1
taskset -c 1 cyclictest -l300000 -m -S -p99 -i200 -h400 > cyclictest.txt
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

taskset -c 1 cyclictest -l300000 -m -S -p99 -i200 -h400 > cyclictest.txt

Note I'm only running 300,000 loops @ 200us, so this only runs for 60 seconds which is very short. This gives a more optimistic result compared to running for longer time (eg hours). But here is the result showing single digit 30-40us latencies over a 60 second period. I see closer to 60-70us when running for longer periods of time.

2) The other benchmark I use is a kernel module interrupt handler (interrupt reserved using IRQF_NO_THREAD). This is more representative of what I'm trying to achieve long term. The interrupt is the PRU IEP Compare Event (periodic @ 20kHz) routed using the Compare Event Interrupt Router. In the IRQ handler I immediately read the IEP Timer and compare it's value to the value it would have had if there was zero latency. The results are written to MSRAM and I graph them using the graph tool in CCS. The results look like this (left axis in nanoseconds, bottom axis is sample # sampled @ 20kHz):

Note periodic spiking to 46us:

I seem to get similar results from both benchmarks. If it helps I can share the kernel module code, but cyclictest might be the easier option for you to reproduce.

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Hi Steven,

this is helpful, thanks for this additional level of detail. I'll feed this to our R&D team so that this can be considered as part of the ongoing investigation. On the surface for sure it looks very applicable. Let's check in again later next week to see where we are at with this on our end. Unfortunately I don't have an immediate answer for you but you can probably appreciate the complexity of the associated system-level and SoC-level investigation.

Regards, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas, no problem, I understand w.r.t. SoC complexity.

As you can see in the graph I posted previously, the kernel driver usually sees <3 microseconds of latency with occasional bursting to 40+ microseconds. If we could remove the bursting behavior the remaining 2-3 microseconds does meet our requirements.

This week I plan to look into the kernel configuration to see if there are any settings that improve this behavior. Unfortunately I don't yet have an OS-aware debugger, so that makes it more difficult to isolate what is causing the latency spiking.

Please let me know if your team makes progress on this issue or if there is any test data I can provide.

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Steven Hansen said:
As you can see in the graph I posted previously, the kernel driver usually sees <3 microseconds of latency with occasional bursting to 40+ microseconds. If we could remove the bursting behavior the remaining 2-3 microseconds does meet our requirements.

For reference here's a related thread also part of our investigation: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1127375/sk-am62-real-time-performance/4186772

Steven Hansen said:
Please let me know if your team makes progress on this issue or if there is any test data I can provide.

There's a couple of things and theories being looked at but the final root cause is yet to be confirmed. Will keep you posted.

Regards, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas, I was pulled off to work on another project, but will be back to looking at this tomorrow. Has the TI team made any progress finding a way to solve this latency issue on AM64x? I read the thread you linked previously and it seems to be the same problem I'm encountering here.

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Hi Steven,

the investigation is still ongoing, but our preliminary findings are as follows, so take this with a grain of salt. We found that Cortex-A SMP (ARM64) with background cache trashing (like stress-ng memrate) coupled with narrow DRAM has cyclictest outliers. This is regardless of SoC vendor. Outright single core worst case, older cores are likely better in worst case. Newer ARM is better in average. Like 1 per 10M worst case. Limiting to a single core (using maxcpus=1 on the Kernel command line) seems to make the problem disappear, probably because of limiting the DDR bursts.

Regards, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas,

Thanks for sharing your results, however preliminary. We will try to re-run our benchmarks using maxcpus=1. Although we'll have to evaluate the performance impact overall as it isn't ideal for 1/2 of the largest cores on the chip to go unused.

It's a shame the memory interconnect has so much latency because this is exactly the problem the Cortex R5's are designed to solve. Unfortunately when running R5 code out of DDR we see 60+ clock cycles per instruction (assumes our code won't fit into cache), effectively turning the 800 MHz R5 into a 12 MHz microcontroller.

Since the latency spikes are DDR burst related, would you expect the same results with FreeRTOS SMP instead of Linux?

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Hi Steven,

Steven Hansen said:
We will try to re-run our benchmarks using maxcpus=1.

Thanks, it's always good to have more pairs of eyes on it, please report back any findings.

Steven Hansen said:
Although we'll have to evaluate the performance impact overall as it isn't ideal for 1/2 of the largest cores on the chip to go unused.

Understood. We are not at the end of it, there may or may not be other ways to resolve this more elegantly.

Steven Hansen said:
Unfortunately when running R5 code out of DDR we see 60+ clock cycles per instruction (assumes our code won't fit into cache), effectively turning the 800 MHz R5 into a 12 MHz microcontroller.

I haven't double checked the numbers but from discussions I know the MCU cores are best operated from on-chip memory only, not DDR. How much code space would you need for R5 code?

Steven Hansen said:
Since the latency spikes are DDR burst related, would you expect the same results with FreeRTOS SMP instead of Linux?

Generally speaking FreeRTOS is "leaner" than Linux, so the absolute numbers are likely better/smaller, but the general concept should still apply, if it is indeed a fundamental limitation of ARM64 SMP + narrow DDR.

Will keep you posted, this has internally a lot of attention for sure.

Regards, Andreas

0 Dominic Rath over 2 years ago in reply to Steven Hansen

Mastermind 6965 points

Hello Steven,

regarding R5f: If that's an option for you, you should try to fit as much as code as possible into the TCMs. They're as large as the caches, equally fast, and not affected by (pseudo-)random eviction. If you can afford to run a (dual-)R5FSS in single-core mode, you can use twice the TCM (128 KB in total) for that single core.

On-chip SRAM (2 MB) is already a lot worse at ~60 cycles latency, and DDR memory is really not an option at ~160 cycles latency (latency for a single strongly-ordered 32-bit read). Combined with the caches the SRAM might still work, since the 60 cycles affect you only once per cache line (for sequential code).

Regards,

Dominic

0 Steven Hansen over 2 years ago in reply to Dominic Rath

Intellectual 631 points

Hi Andreas and Dominic,

Really appreciate the suggestions. We are planning to use TI network stacks (Ethernet/IP, Profinet, and EtherCAT), and last I checked some of these use a large percentage of on-chip SRAM. For example the Ethernet/IP adapter demo takes up ~1MB of MSRAM, and I'm unsure how these numbers will change over time with new SDKs. Additionally, I believe the DMSC reserves a portion of MSRAM even after boot completes. For those reasons I was treating MSRAM as unavailable.

We do see a significant performance improvement when running R5F code from TCM. But the TCM size is quite limited. The application we're trying to port to AM64x currently runs from a kernel module with ~50kB of code and ~4MB of data. Most of the data is a large shared buffer, so we could possibly reduce the data to ~100kB by moving that buffer to DDR. But my assumption is our application can't easily fit into TCM without a significant overhaul. We also make use of 64-bit floating point for control algorithms, and 64-bit float operations seem to have a large performance advantage on A53 cores compared to R5F cores. Again, we may be able to overhaul some of this to 32-bit floating point, but it would require significant changes.

Thanks for the suggestion about single-core mode Dominic, having 128kB of TCM might make sense if we do run on R5F. I cross-checked my numbers and the ~67 cycles I saw when executing from DDR on R5 must have been mostly cache hits. 160 cycles of latency definitely takes strongly-ordered DDR execution off the table.

I'm going to continue looking for ways to reduce latency on the A53 cores unless we hit a major roadblock. The performance we see on the A53 cores is good aside from the occasional latency burst. If we can't find a solution to the latency bursting, we'll look at the R5F cores again.

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Good discussion here, let's keep this thread open. I'll update any additional findings from our site. Note that next week is US holiday week which is coming up quickly so this may things put on pause briefly here on our end.

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas,

We are seeing essentially no change in latency when using "maxcpus=1", both with cyclictest and in our kernel-mode interrupt test on the SR2.0 AM64EVM board. We confirmed Linux is restricted to 1 core by looking at the output of 'lscpu' and 'cat /proc/cpuinfo':

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@am64xx-evm:~# lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0
Off-line CPU(s) list:            1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
Vendor ID:                       ARM
Model:                           4
Model name:                      Cortex-A53
Stepping:                        r0p4
BogoMIPS:                        400.00
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

root@am64xx-evm:~# lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0
Off-line CPU(s) list:            1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
Vendor ID:                       ARM
Model:                           4
Model name:                      Cortex-A53
Stepping:                        r0p4
BogoMIPS:                        400.00
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpui
                                 d
root@am64xx-evm:~# cat /proc/cpuinfo
processor       : 0
BogoMIPS        : 400.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

root@am64xx-evm:~# cat /proc/cmdline
console=ttyS2,115200n8 maxcpus=1 earlycon=ns16550a,mmio32,0x02800000 mtdparts=fc40000.spi.0:1m(ospi.tiboot3),2m(ospi.tispl),4m(ospi.u-boot),256k(ospi.env),256k(ospi.env.backup),57088k@8m(ospi.rootfs),256k(ospi.phypattern);omap2-nand.0:2m(NAND.tiboot3),2m(NAND.tispl),2m(NAND.tiboot3.backup),4m(NAND.u-boot),256k(NAND.u-boot-env),256k(NAND.u-boot-env.backup),-(NAND.file-system) root=PARTUUID=7c6f56e7-02 rw rootfstype=ext4 rootwait
root@am64xx-evm:~# uname -a
Linux am64xx-evm 5.10.140-rt73-g3a997318d8 #3 SMP PREEMPT_RT Fri Oct 21 10:05:23 PDT 2022 aarch64 aarch64 aarch64 GNU/Linux

Here are cyclictest results with both cores active (command = "cyclictest -l300000 -m -S -p99 -i200 -h400 > cyclictest.txt"):

Cyclictest results with "maxcpus=1" (command same as above):

0 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Thanks for the detailed results and console logs. This seems not aligned with what we observed. Checking with the team for an explanation....

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas: I re-flashed the board to the default image shipped with the 8.04 Linux-RT SDK (tisdk-default-image-am64xx-evm.wic.xz) to ensure I hadn't inadvertently broken something, but I still don't see any difference in latency with "maxcpus=1". Please let me know if I can provide additional info to help troubleshoot.

Here is one additional plot showing latency for a kernel-mode interrupt handler over time with "maxcpus=1" (vertical axis in nanoseconds):

+1 Andreas Dannenberg over 2 years ago in reply to Steven Hansen

TI__Guru 64877 points

Hi Steve,

while re-digesting some of this discussion here and what we are discussing/testing internally I noticed while the concerns are related they are not exactly the same. What I mean by that is your original concern was latency you observed in the ~50us range. What our internal investigation was about was regarding occasional latency spikes in the 100's of us range (200-500us). We see these larger spikes as the main issues which according to our testing can be addressed with maxcpus=1 or isolcpus=X. We think tuning a system for sub 50us is a realistic target for embedded systems but this involves hands-on tuning per target, drivers, and the application running. We don't think you would be able to get into the 20's or even 10's of us with this kind of processors.

Can you please also review the E2E FAQ post we created for some additional comments/insight: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1172055/faq-am625-how-to-measure-interrupt-latency-on-multicore-sitara-devices-using-cyclictest

I'm afraid at this time I don't have a good AM6-based solution that would get you into the sub-10us of latency on the A53 cores running from DDR.

The options worth exploring further might be to see how to make best use of internal MSRAM given the other things you want to do (e.g, validate actual needs from other stacks.... btw, System Controller FW (SYSFW) running on the DMSC itself only seems to need 48KB once it's up and running according to https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/latest/exports/docs/api_guide_am64x/MEMORY_MAP.html) to see how much we actually have free for use. Or somehow squeeze things into the R5s TCMs as Dominic suggested or do a combination of all those things but you already said it might be some major re-architecting of your FW.

Regards, Andreas

0 Steven Hansen over 2 years ago in reply to Andreas Dannenberg

Intellectual 631 points

Hi Andreas,

Thanks for your feedback. I agree with your findings, although it was surprising to find that a 15 year old Cortex A9 can outperform latency compared to a more modern ARM core. But I assume this is a tradeoff that's been made at an architectural level to increase performance in other areas.

We've decided to move forward porting our code base to run on the R5F cores. Running our interrupt benchmark on R5F shows <300ns of interrupt latency with <75ns of jitter; a massive improvement over the A53. Thanks for your assistance guiding us in the right direction.

Processors

Processors forum

AM6442: FIQ Interrupt on A53