This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: Latency in Linux-RT is well above expected values

Part Number: AM6442
Other Parts Discussed in Thread: AM5718

I see there is a new release of the RT-Linux SDK for AM64x - 08.00.00.21. In the former one (07.xx.xx.xx) I reported that I see quite bad latency results (cyclictest). I see that the new one has some changes in the kernel area (e.g. version is up changed from 5.4-rt to 5.10-rt). 

Does TI run some kind of real-time verification/validation on the new versions of the SDK? I see some notes about LTP-DDT in the SDK documentation, but no results or any other references.

What is the goal for the Cortex-A53 cluster as real-time behavior? I understand that the Cortex-R5 cluster is quite a better match to real-time application demands, but in many cases, there are legacy applications that need to use the RT properties of the "Linux-RT" - for example, soft-PLC runtimes, NC/CNC kernels, and so on. 

I will surely run the benchmarks again on the SK, but it's better to know if these kinds of applications are supported, or targeted, by TI for the AM64x series. If not then there is no warranty that in the next release there will be no regression that would now allow using this series.

  • Hi,

    I will check into if the cyclictests results changed or are published yet. Could you please describe the bad latency that you witnessing and what the latency that you require or are expecting to see? Latency is sometime inherited or limited based on architecture.

    The development team does look for latency regressions between SDK releases.  Concerning the goal for RT behavior, TI would like to get the best entitlement it can for a core(s) and peripheral set for the AM64x in this case. Essentially TI does not develop but inherits the RT kernel and perhaps some of the latency that goes with A53s. TI does develop the AM64 peripherals drivers to be as low latency as possible. 

    The question about support is a broad one. TI only supports the peripheral drivers that were developed for the AM64. This is the support model for both Non-RT and RT kernels alike. TI uses LTS kernels releases from the Linux mainline community. Could you please expand more on the type of support you require?  

    Best Regards,

    Schuyler

  • I have reported this some weeks ago using older SDK. We are evaluating AM64x as possible replacement for AM5718 design. With 57x series, we were able to achieve sub-30us latency - measured with cyclictest on system under load, on rt-kernel with disabled SMP. This is excellent result and we made a similar test on the AM64x. But we saw worst-case results close to 200us. 

    This is far away from the best A53 designs like the Zynq UltraScale+ (17us) and even some Chinese SoCs like Allwinner where the latency is around 50-55us. This might be because of the architecture around the core, or peripheral/driver misbehavior. We could dig into this deeper (as we did with some targets before) but I think TI might be interested in this even more - right now we don't have the resources to do this (mainly, time) so we would probably put this on hold until we can allocate someone to work on this.

    What we expect - in best case, we would dream of the sub-30us of the AM57x and the Zynq - this would allow us to replace the AM57x and have single base platform (also, this would allow us to consider the AM 24x for other designs because of the common source base). It would probably be acceptable if we get something like the 50us - but in this case we would need to keep the AM57x design because of the faster reaction and the higher CPU performance of the Cortex-A15.

  • Hi,

    Thank you for the data and showing the comparison with other processors, that really helps provide your question context.  I will be out of the office tomorrow through next week. I will start a discussion before I leave and I plan to report back to you the week after next.

    Best Regards,

    Schuyler

  • Krasi,

    We do run the OSADL style cyclictest in automated testing, and it will be added to the benchmarks we publish at https://software-dl.ti.com/processor-sdk-linux-rt/esd/docs/08_00_00_21/devices/AM64X/RT_Linux_Performance_Guide.html . We just had a scripting bug so the results were not made public, which we'll fix.

    cyclictest -l100000000 -m -Sp99 -i200 -h400

    is what we run at on https://www.ti.com/tool/TMDS64GPEVM 7.3 default filesystem and services running the OSADL plot comes out to:

    Avarage is 8us and there were 6 (out of 100M) outliers on CPU0 and a single one on CPU1 that caused the worst case to be ~88us on CPU0 and ~70us on CPU1. This is with the default filesystem and all the services it comes with running. I will run this on the 8.0.

    As a note on the https://www.ti.com/tool/SK-AM64 there are two issues why it currently scores worse out-of-the-box, and the GP EVM should be used to judge realtime. The LPDDR4 settings are overly conservative to the point of visibly impacting realtime, and by default there is a wifi hotspot running. This is what it looked like with 7.3 and full defaul file system and set of services (average stayed at 8us):

    I'll update the 8.0 RT numbers here once I get the runs done (>5h so will need to a run overnight, have patience).

    While waiting for that would it be possible to share a glimpse of what you have running when you get the good results in AM57x or the other A53's, for example just what ps -aux prints out.

      Pekka

  • I didn't know this limitation of the SK board, so I obviously would need to obtain the GPEVM.

    About AM571x tests - we were running our own image based on the TI SDK - a Yocto build based on the smallest TI image but with additional modification - stripped-down unneeded features and small kernel modification - mainly, disabling SMP (as we use single-core AM5718) did the magic to bring the latency below 30us level. In our image, we have no graphical output. To put some load on the system we do intensive network access. This is what is the basic usage of the final product - it will be a headless controller that would run industrial ethernet. protocols. 

    We haven't validated any other A53 SoC internally - the values that I referenced are based on some public (internet) papers. For example, the best that I have seen is a paper about Zynq UltraScale+ SoC that also uses the A53 cores: 

    https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Preempt-RT-Latency-Benchmarking-of-the-Cortex-A53-Processor-Paul-Thomas-AMSC.pdf

    Other references show different scores - RPi 3b+ is about 90-100us depending on the kernel version, while Allwinner H3 is in the range 42-60us. Sure this depends on the used peripherals and many other "environment" options. 

    For now, I will accept that the SK DDR issue is the reason and will try to obtain the GPEVM and repeat the testing. 

    But if those values are added to the documentation of SDK releases, I would just use them - so we could discuss internally if those would fit our needs.

    Thank you!

  • Thank you for the pointer to the paper. It clarifies quite a lot, the 17us is reached with CPUSETS, or isolating a core for the realtime cyclictest and putting background load on the other core(s). We'll look into replicating that. The OSADL setup command I had above ( from https://www.osadl.org/Create-a-latency-plot-from-cyclictest-hi.bash-script-for-latency-plot.0.html ) is SMP, without process affinity. Real application probably can leverage affinity to a point but measurement with affinity or just SMP will have different results. The SK board LPDDR4 conservative configuration issue is actually not significant enough to show up in the tests at least when using the tiny filesystem image. Here are 100M iteration plots on https://www.ti.com/tool/SK-AM64:

    and https://www.ti.com/tool/TMDS64GPEVM 

    both with /filesystem/tisdk-tiny-image-am64xx-evm.wic.xz from the 8.0 SDK.

      Pekka

  • Thank you for your effort in analyzing and documenting this - I see similar results. Surely it will be interesting to find out if core isolation could help - in real-world scenarios, this quite well matches the typical use case where on "runtime kernel" needs to get real-time performance (PLC runtime, CNC kernel, and so on). I've made some trials based on the paper and some other references but without significant success.