AM62L: AM62L Real-time Performance Issue

Part Number: AM62L

Hi,
We are encountering real-time performance issues while using the AM62L chip. The current SDK version in use is ti-processor-sdk-linux-rt-am62lxx-evm-11.02.08.02-Linux-x86-Install.bin, with a kernel version of 6.12.57. The issue details are as follows:
  1. 1. Core isolation coconfig.gz nfiguration was applied to CPU1, and real-time optimization parameters were set in the cmdline, as shown below:
root@am62xx-evm:/# cat /proc/cmdline
console=ttyS0,115200n8 earlycon=ns16550a,mmio32,0x02800000 ubi.mtd=ospi_nand.rootfs root=ubi0:rootfs rw rootfstype=ubifs rootwait rcu_nocb_poll rcu_nohz=1 idle=poll rcu_nocbs=1 nohz=on nohz_full=1 kthread_cpus=0 irqaffinity=0 isolcpus=managed_irq,domain,1
  1. 2. A CPU load was added to Core 1, and jitter was tested using cyclictest with the following commands:
root@am62xx-evm:/# taskset -c 1 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 80% &
root@am62xx-evm:/# cyclictest -a 0-1 -t 2 -p 99 -m -D 0 &
  1. Under this stress load, the maximum jitter exceeded 150µs within a 10-minute test, as illustrated in the figure below:

    T: 0 (  404) P:99 I:1000 C:72252 Min:      5 Act:   11 Avg:   15 Max:     203
    T: 1 (  405) P:99 I:1500 C:48159 Min:      9 Act:   33 Avg:   33 Max:     152

We are particularly concerned about the real-time performance of Core 1, as real-time tasks in actual application scenarios will also be assigned to this core. However, the current system results deviate significantly from the 60–70µs jitter self-tested in the official documentation. We would appreciate your suggestions for real-time performance optimization.
Thank you for your support!
Attachment: Our kernel .config file.
  • Hello Xi,

    I always like to make sure that we have the same "known good" starting point before doing any development.

    Before attempting any core isolation, did you attempt to replicate the jitter results of the official documentation by following the exact steps in the docs? Please share the results of that test with an unmodified default SDK image:
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/11_02_08_02/exports/docs/devices/AM62LX/linux/RT_Linux_Performance_Guide.html

    Regards,

    Nick

  • Hi Nick,

    On our own product, we have performed driver adaptation based on the official SDK, incorporating considerations specific to our product design. Following the stress test commands and cyclictest parameters described in the issue, we observed the following:
    When conducting tests strictly per the official documentation, the applied stress appears relatively light, yielding real-time performance results of approximately 69μs. However, when using our customized test commands—which simulate higher workloads aligned with actual application scenarios—the real-time performance degrades significantly.
    Given this discrepancy, we would appreciate real-time performance optimization recommendations tailored to our revised test methodology and increased load conditions.
    Regards,
    Xi
  • Hello Xi,

    Before we continue:

    Disclaimer: I am not an expert at tuning RT-Linux 

    I will be learning with you as we work to tune your RT Linux system.

    What processor are you using?

    You mentioned AM62L, but your terminal logs report AM62x (am62xx-evm instead of am62lxx-evm). Which processor are you using?

    These different processors have slightly different hardware, which affects some of the adjustments you could make to improve performance.

    Tell me more about your usecase 

    Ok, so we have 2 cores. You want core1 dedicated to performing a single heavy task (I assume this is a stand-in for CodeSys), and everything else running on core0. Is that correct?

    What other tasks are running on core0?

    Do I see any issues with your current test setup? 

    I have used cgroups in the past to divide tasks between processor cores. I am not sure if it would cause significantly different behavior from isolcpus. I have attached feedback from an AI agent below [1].

    How to tune the RT-Linux performance for your application? 

    You can find some getting started documentation for tuning RT Linux performance here:
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/latest/exports/docs/linux/How_to_Guides/Target/How_to_Tune_Real_Time_Linux.html

    Bootlin also provides excellent trainings. You can find their RT Linux training here:
    https://bootlin.com/training/preempt-rt/

    The first step is to disable any applications or drivers that you are not using, regardless of which core that code is running on. Why?

    Core1 is still impacted by the code running on core0, even after you isolate the tasks and the interrupts:

    1) shared cache: all cores in the A53 cluster share a single cache. A cache miss means a longer response time

    2) DDR interface: this could be a bottleneck if both cores are doing heavy DDR traffic

    Regards,

    Nick

    [1] AI feedback on isolcpus vs cgroup isolation:

    Q: What are the differences between isolcpus and cgroup isolation?

    A: [GENERAL KNOWLEDGE] Both mechanisms, key differences:

    isolcpus (what you have)

    Kernel scheduler mechanism. Works at boot, no userspace needed.

    What it does:
    - domain — removes CPU 1 from SMP load balancing domains. Scheduler won't migrate tasks onto it spontaneously
    - managed_irq — prevents managed IRQs (MSI/MSI-X) from being assigned to CPU 1
    - Tasks can still be explicitly pinned to isolated CPU via taskset/sched_setaffinity

    Limits:
    - Kernel threads (kthread_cpus=0 helps but not complete coverage)
    - Some kernel subsystems ignore it
    - No enforcement — any process can call sched_setaffinity and land on CPU 1
    - No memory, I/O, or network isolation — only scheduling

    cpuset cgroups (what you don't have — CONFIG_CPUSETS not set)

    Hierarchical resource container. Enforced by kernel at task level.

    What it does differently:
    - Hard partition: assign CPUs to cgroups, tasks in that cgroup cannot run elsewhere
    - cpuset.cpus.exclusive=1 (cgroups v2) — full exclusive ownership, not just isolation hint
    - Combined with cpuset.mems — also isolates NUMA memory nodes (less relevant on single-socket AM62L, but still affects page allocator
    behavior)
    - Inheritable: child processes stay in cgroup automatically
    - Composable with other cgroup controllers (memory limits, I/O weights)

    The key semantic difference:

    isolcpus: "scheduler, don't put things HERE unless explicitly asked"
    cpusets: "this CPU belongs to THIS cgroup, full stop"

    isolcpus is an opt-out hint to the scheduler. cpuset is an enforced partition.


    Without CONFIG_CPUSETS, what can still escape CPU 1?

    [GENERAL KNOWLEDGE] Even with your full cmdline:
    - Workqueue threads — kthread_cpus=0 helps, but WQ_UNBOUND workqueues may still use any CPU
    - Softirqs — bound to CPU where hardware interrupt fires; irqaffinity=0 helps but not complete
    - Timer softirqs — nohz_full=1 suppresses most, but not all kernel activity
    - Any process that explicitly calls sched_setaffinity(1) — no enforcement layer to stop this without cpusets

    Practical recommendation for real-time workload on CPU 1

    [GENERAL KNOWLEDGE]

    Option A — pure isolcpus (what you have): Sufficient if your RT task is the only thing you deliberately pin to CPU 1, and you trust
    userspace not to misbehave. Simpler.

    Option B — enable cpusets: Stronger guarantee. Boot CPU 1 isolated, then:
    # Create exclusive cpuset for RT task
    mkdir /sys/fs/cgroup/cpuset/rt
    echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.cpus
    echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
    echo $RT_PID > /sys/fs/cgroup/cpuset/rt/tasks

    # Remaining system tasks stay in root cpuset (CPU 0 only)
    echo 0 > /sys/fs/cgroup/cpuset/cpuset.cpus

    Both together is strongest: isolcpus keeps scheduler away during boot before cgroups are configured; cpusets enforce the partition at
    runtime.

    Given CONFIG_CPUSETS is not set — if your RT workload is sensitive, worth enabling it. Single kernel recompile cost, significant
    runtime isolation guarantee gain.

  • What is the target performance? 

    Are you looking for a specific benchmark number?

    Or are you just trying to get the cyclictest number to go down, and then re-running Codesys with the new configuration to see if a faster cyclictest is associated with better Codesys performance?

  • Hi,Nick,

    Thank you very much for your reply.

    Firstly, the actual hardware model I adopt is AM62L3.

    Secondly, my core business application is Codesys. The real-time workload runs as the Codesys master station. I configure all Codesys EtherCAT tasks to be scheduled on real-time Core 1 through the Codesys upper computer.

    In terms of my test scheme: I plan to verify the system real-time jitter with professional test commands first. My expectation is that the maximum jitter can be kept below 70 microseconds in a 48-hour long-term continuous test. After reaching this target, I will deploy the actual business scenarios, which includes driving servo motors via EtherCAT bus, and monitoring the peak jitter values through the Codesys host software. Therefore, I use  cyclictest  and  stress-ng  to simulate high CPU occupancy and heavy memory pressure consistent with actual business loads.

    Besides, I am carrying out relevant verification tests on the cpuset isolation solution you mentioned simultaneously. I will feed back the latest test data to you in a timely manner once I get new results.

    Regards,

    Xi

  • EDITED May 18, 2026 - Edits in RED

    Hello Xi,

    Expectations for this week

    My AM62L EVM is set up. I will run tests this week to isolate specific settings which could reduce worst-case latency.

    I expect lots of DDR traffic will lead to higher worst-case interrupt response time. I am not sure if <70usec cyclictest is reasonable for a 48 hour test, but let's see how low we can get it.

    Test 260518_1: exactly replicate commands in SDK 11.2 docs (except using OPTEE TRNG driver) 

    UPDATE: The SDK benchmark results were captured with the Pseudo RNG driver enabled. I did not enable it for this test. Will run again with updated results. 

    My initial results do not match docs. Will look into
    * stress-ng -c 4 vs -c 2 for a dual core A53
    * WARN: stat /dev/cpu_dma_latency failed: No such file or directory

    FOLLOWUP NOTES:

    Yes, stress-ng -c 4 is the standard test TI runs for all the AM6 devices, regardless of whether the A53 core is quad core or dual core. Unrelated to test results.

    The warning about cpu_dma_latency is expected. The TI DMA drivers do not have a cpu_dma_latency option, which is why cyclictest is unable to find it on this processor.

    root@am62lxx-evm:~# uname -a
    Linux am62lxx-evm 6.12.57-ti-rt-g31b07ab8dfbc #1 SMP PREEMPT_RT Thu Dec  4 13:07:37 UTC 2025 aarch64 GNU/Linux
    
    root@am62lxx-evm:~# stress-ng --cpu-method=all -c 4 &
    [2] 1002
    [1]   Done                    stress-ng --cpu-method=all -c 2
    root@am62lxx-evm:~# stress-ng: info:  [1002] defaulting to a 1 day, 0 secs run per stressor
    stress-ng: info:  [1002] dispatching hogs: 4 cpu
    root@am62lxx-evm:~# cyclictest -m -Sp80 -D6h -h400 -i200 -M -q
    WARN: stat /dev/cpu_dma_latency failed: No such file or directory
    # Histogram
    000000 000000   000000
    000001 000000   000000
    000002 000000   000000
    000003 000000   000000
    000004 000000   000000
    000005 345248   762554
    000006 27360548 24662134
    000007 37407885 25009511
    000008 24363913 18800255
    000009 10887474 15688180
    000010 3723560  10831733
    000011 1546892  5943833
    000012 880035   2643272
    000013 538533   1211026
    000014 319457   674355
    000015 187475   449129
    000016 117677   316653
    000017 083693   227635
    000018 066541   163902
    000019 054741   119252
    000020 042386   090612
    000021 028144   076970
    000022 016581   069648
    000023 009738   059346
    000024 005631   046448
    000025 003812   034093
    000026 002662   024576
    000027 001900   018179
    000028 001396   014439
    000029 001044   011841
    000030 000767   009951
    000031 000603   008023
    000032 000472   006384
    000033 000324   005158
    000034 000257   004241
    000035 000189   003392
    000036 000097   002807
    000037 000086   002390
    000038 000047   001997
    000039 000034   001506
    000040 000021   001195
    000041 000012   000865
    000042 000009   000688
    000043 000004   000463
    000044 000010   000362
    000045 000001   000248
    000046 000002   000206
    000047 000004   000144
    000048 000002   000081
    000049 000001   000076
    000050 000001   000050
    000051 000004   000038
    000052 000003   000028
    000053 000005   000015
    000054 000002   000013
    000055 000001   000006
    000056 000002   000005
    000057 000004   000006
    000058 000002   000002
    000059 000001   000000
    000060 000000   000000
    000061 000001   000000
    000062 000000   000000
    000063 000003   000000
    000064 000004   000001
    000065 000003   000000
    000066 000001   000001
    000067 000003   000000
    000068 000003   000000
    000069 000002   000000
    000070 000001   000000
    000071 000002   000000
    000072 000000   000000
    000073 000004   000000
    000074 000002   000000
    000075 000003   000000
    000076 000000   000000
    000077 000002   000000
    000078 000001   000000
    000079 000002   000000
    000080 000001   000000
    000081 000004   000000
    000082 000002   000000
    000083 000002   000000
    000084 000001   000000
    000085 000003   000000
    000086 000003   000000
    000087 000001   000000
    000088 000002   000000
    000089 000000   000000
    000090 000000   000000
    000091 000001   000000
    000092 000000   000000
    000093 000000   000000
    000094 000001   000000
    000095 000000   000000
    000096 000001   000000
    000097 000000   000000
    000098 000001   000000
    000099 000000   000000
    000100 000000   000000
    000101 000000   000000
    000102 000001   000000
    000103 000001   000000
    000104 000000   000000
    000105 000001   000000
    000106 000000   000000
    000107 000000   000000
    000108 000001   000000
    000109 000000   000000
    000110 000001   000000
    000111 000000   000000
    000112 000000   000000
    000113 000000   000000
    000114 000000   000000
    000115 000000   000000
    000116 000000   000000
    000117 000000   000000
    000118 000000   000000
    000119 000001   000000
    000120 000000   000000
    000121 000000   000000
    000122 000000   000000
    000123 000000   000000
    000124 000000   000000
    000125 000000   000000
    000126 000001   000000
    …
    # Total: 108000000 107999918
    # Min Latencies: 00005 00005
    # Avg Latencies: 00007 00008
    # Max Latencies: 00126 00066
    # Histogram Overflows: 00000 00000
    # Histogram Overflow at cycle number:
    # Thread 0:
    # Thread 1:
    
    root@am62lxx-evm:~# stress-ng: info:  [1002] skipped: 0
    stress-ng: info:  [1002] passed: 4: cpu (4)
    stress-ng: info:  [1002] failed: 0
    stress-ng: info:  [1002] metrics untrustworthy: 0
    stress-ng: info:  [1002] successful run completed in 1 day, 0.77 secs
    
    [2]+  Done                    stress-ng --cpu-method=all -c 4

    Next test 

    exactly replicate your initial test above

    re-do Test 260518_1 with OP_TEE TRNG disabled

    Topics for later

    DDR & A53 Quality of Service (QoS) and Class of Service (CoS) might be helpful for us here. I am still reading up on it. Links for future readers:

    https://www.ti.com/lit/sprads6

     RE: AM6422: [LinuxRT] Poor lantency performance on isolated core  

    Regards,

    Nick

  • Hello Xi,

    Did you build OP-TEE with Pseudo RNG drivers for your tests?
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/11_02_08_02/exports/docs/linux/Foundational_Components_OPTEE.html#building-optee-with-prng

    The performance guide document has a slight bug. It mentions disabling the OP-TEE true RNG driver (TRNG) and enabling the Psueudo RNG driver as an option, which is good. But the performance guide does not mention that the TI tests were conducted with default SDK image + uboot files, where the u-boot files are rebuilt with a modified OP-TEE with PRNG:
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/11_02_08_02/exports/docs/linux/Foundational_Components/U-Boot/BG-Build-K3.html 

    I have filed a ticket to update the performance guide document in future SDK releases.

    I will re-run Test 260518_1 with OP_TEE TRNG disabled to see if that allows me to replicate the benchmark results.

    Regards,

    Nick

  • Please note that there is a bug in the SDK 11.2 version of the SDK docs for building OP-TEE which is fixed starting in SDK 12.0 docs. 

    This is the wrong argument in the SDK 11.2 version of docs:
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/11_02_08_02/exports/docs/linux/Foundational_Components_OPTEE.html

    PLATFORM=k3-k3-am62lx

    This is the correct argument in the SDK 12.0 version of docs:
    https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/latest/exports/docs/linux/Foundational_Components_OPTEE.html

    PLATFORM=k3-am62lx
  • I have filed a ticket to update the performance guide document in future SDK releases.

    Jira can only update future version. Is there a way to update the error on line for released version to avoid confusion for later readers since it is reported and confirmed. 

    As all released version will exist on line always, let a known error existing on line for world wide readers doesn't make sense.

    PLATFORM=k3-k3-am62lx

    I found this error when I built OPTEE also. and I reported to somebody, if it got updated, you would not see this again.

  • Hi Nick,
    My OP-TEE firmware uses the SDK-provided prebuilt image: board-support/prebuilt-images/am62lxx-evm/bl32.bin. Since there were no development or modification requirements, I directly used this firmware without recompilation.
    However, upon reviewing my log records, it appears that the OP-TEE module was not properly loaded. While this does not affect the overall system boot or functionality, could you help confirm the potential implications of this issue? Below are relevant excerpts from my U-Boot logs and kernel logs related to OP-TEE:

    UBOOT LOG:

    NOTICE: bl1_plat_arch_setup arch setup
    NOTICE: Booting Trusted Firmware
    NOTICE: BL1: v2.12.0(release):11.02.01-14-g5939ceaeb-dirty
    NOTICE: BL1: Built : 08:05:07, Jan 28 2026
    NOTICE: BL1: dram_class: 10
    NOTICE: lpddr4: post start - PI training status=0x29c02000
    NOTICE: bl1_platform_setup DDR init done
    NOTICE: k3_bl1_handoff ENTERING WFI - end of bl1
    NOTICE: BL31: v2.12.0(release):11.02.01-14-g5939ceaeb-dirty
    NOTICE: BL31: Built : 08:05:11, Jan 28 2026
    NOTICE: SYSFW ABI: 4.0 (firmware rev 0x000b '11.2.5-v11.02.05a (Fancy Rat)')
    get_device_type a0a
    ERROR: Agent 0 Protocol 0x10 Message 0x7: not supported
    
    U-Boot SPL 2025.01-g398f44b6f7db (Apr 30 2026 - 03:21:19 +0000)
    SPL initial stack usage: 2048 bytes
    Trying to boot from SPINAND
    ERROR: Agent 0 Protocol 0x10 Message 0x7: not supported
    
    
    U-Boot 2025.01-g398f44b6f7db (Apr 30 2026 - 03:21:19 +0000)
    
    SoC: AM62LX SR1.1 HS-FS
    Model: Texas Instruments AM62L3 Evaluation Module
    DRAM: 512 MiB
    ERROR: Agent 0 Protocol 0x10 Message 0x7: not supported
    Core: 67 devices, 31 uclasses, devicetree: separate
    MMC: mmc@fa10000: 0
    Loading Environment from nowhere... OK
    In: serial@2800000
    Out: serial@2800000
    Err: serial@2800000
    Net: eth0: ethernet@8000000port@1
    Warning: ethernet@8000000port@2 (eth1) using random MAC address - 4e:11:69:6c:28:a0
    , eth1: ethernet@8000000port@2
    Hit any key to stop autoboot: 0
    Total of 1 byte(s) were the same
    Total of 1 byte(s) were the same
    Setting bus to 0
    ubi0: attaching mtd4
    ubi0: scanning is finished
    ubi0: attached mtd4 (name "ospi_nand.rootfs", size 123 MiB)
    ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
    ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
    ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
    ubi0: good PEBs: 991, bad PEBs: 0, corrupted PEBs: 0
    ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
    ubi0: max/mean erase counter: 4/1, WL threshold: 4096, image sequence number: 80226512
    ubi0: available PEBs: 0, total reserved PEBs: 991, PEBs reserved for bad PEB handling: 20
    Loading file '/boot/Image' to addr 0x82000000...
    Done
    Loading file '/boot/dtb/ti/k3-am62l3-plc.dtb' to addr 0x88000000...
    Done
    ## Flattened Device Tree blob at 88000000
    Booting using the fdt blob at 0x88000000
    Working FDT set to 88000000
    Loading Device Tree to 000000008fff2000, end 000000008ffff7b9 ... OK
    Working FDT set to 8fff2000
    
    Starting kernel ...

    And here is my kernel log about op-tee:

    [ 1.338980] optee: probing for conduit method.
    [ 1.338998] optee: api uid mismatch
    [ 1.339004] optee firmware:optee: probe with driver optee failed with error -22

    Regards,

    Xi

  • Test 260519_1: exactly replicate commands in SDK 11.2 docs, INCLUDING disabling OP-TEE TRNG driver 

    I am now able to replicate the observations in the Linux SDK 11.2 docs.

    Step 1: Get the OP-TEE source code. I will link to the SDK 12.0 version of the docs since it fixes some bugs in the SDK 11.2 docs.

    Step 2: Build the OP-TEE code with Pseudo RNG instead of True RNG:

    #!/bin/bash
    
    # set variables
    export OPTEE_PLATFORM="k3-am62lx"
    export SDK_INSTALL_DIR="<path_to>/ti-processor-sdk-linux-rt-am62lxx-evm-11.02.08.02"
    export CROSS_COMPILE_64="${SDK_INSTALL_DIR}/linux-devkit/sysroots/x86_64-arago-linux/usr/bin/aarch64-oe-linux/aarch64-oe-linux-"
    export SYSROOT_64="${SDK_INSTALL_DIR}/linux-devkit/sysroots/aarch64-oe-linux"
    export CC_64="${CROSS_COMPILE_64}gcc --sysroot=${SYSROOT_64}"
    export CROSS_COMPILE_32="${SDK_INSTALL_DIR}/k3r5-devkit/sysroots/x86_64-arago-linux/usr/bin/arm-oe-eabi/arm-oe-eabi-"
    export CFLAGS64="--sysroot=${SYSROOT_64}"
    export KCFLAGS="--sysroot=${SYSROOT_64}"
    export LDFLAGS="--sysroot=${SYSROOT_64}"
    
    # clean sources
    make CROSS_COMPILE="$CROSS_COMPILE_64" clean
    
    # make OPTEE
    make CROSS_COMPILE64="$CROSS_COMPILE_64" PLATFORM="$OPTEE_PLATFORM" CFG_ARM64_core=y CFG_WITH_SOFTWARE_PRNG=y CFG_USER_TA_TARGETS=ta_arm64
    
    echo "Make succeeded!"
    

    Step 3: Rebuild U-Boot with the updated OP-TEE code:

    #!/bin/bash
    
    # set variables
    export PLATFORM_DEFCONFIG="am62lx_evm_defconfig"
    export SDK_INSTALL_DIR="<path_to>/ti-processor-sdk-linux-rt-am62lxx-evm-11.02.08.02"
    export CROSS_COMPILE_64="${SDK_INSTALL_DIR}/linux-devkit/sysroots/x86_64-arago-linux/usr/bin/aarch64-oe-linux/aarch64-oe-linux-"
    export SYSROOT_64="${SDK_INSTALL_DIR}/linux-devkit/sysroots/aarch64-oe-linux"
    export CC_64="${CROSS_COMPILE_64}gcc --sysroot=${SYSROOT_64}"
    export CROSS_COMPILE_32="${SDK_INSTALL_DIR}/k3r5-devkit/sysroots/x86_64-arago-linux/usr/bin/arm-oe-eabi/arm-oe-eabi-"
    export CFLAGS64="--sysroot=${SYSROOT_64}"
    export KCFLAGS="--sysroot=${SYSROOT_64}"
    export LDFLAGS="--sysroot=${SYSROOT_64}"
    
    export UBOOT_DIR="${SDK_INSTALL_DIR}/board-support/ti-u-boot-2025.01+git"
    export TI_LINUX_FW_DIR="${SDK_INSTALL_DIR}/board-support/prebuilt-images/am62lxx-evm"
    # I used prebuilt TFA
    export TFA_DIR="${TI_LINUX_FW_DIR}"
    # I used rebuilt OPTEE
    # NOTE: Make sure to check out the appropriate OPTEE tag, check Release Notes
    export OPTEE="/home/a0226750local/git/optee_os/out/arm-plat-k3/core/tee-pager_v2.bin"
    # Using prebuilt OPTEE
    #export OPTEE="${TI_LINUX_FW_DIR}/bl32.bin"
    
    # clean sources
    make CROSS_COMPILE="$CROSS_COMPILE_64" clean
    
    #configure u-boot
    make CROSS_COMPILE="$CROSS_COMPILE_64" "$PLATFORM_DEFCONFIG"
    
    # build u-boot
    make CROSS_COMPILE="$CROSS_COMPILE_64" BL1=$TFA_DIR/bl1.bin BL31=$TFA_DIR/bl31.bin BINMAN_INDIRS=$TI_LINUX_FW_DIR TEE=$OPTEE
    
    echo "Make succeeded!"
    

    Test results:

    root@am62lxx-evm:~# uname -a
    Linux am62lxx-evm 6.12.57-ti-rt-g31b07ab8dfbc #1 SMP PREEMPT_RT Thu Dec  4 13:07:37 UTC 2025 aarch64 GNU/Linux
    root@am62lxx-evm:~# stress-ng --cpu-method=all -c 4 &
    [1] 936
    root@am62lxx-evm:~# stress-ng: info:  [936] defaulting to a 1 day, 0 secs run per stressor
    stress-ng: info:  [936] dispatching hogs: 4 cpu
    
    root@am62lxx-evm:~# cyclictest -m -Sp80 -D6h -h400 -i200 -M -q
    WARN: stat /dev/cpu_dma_latency failed: No such file or directory
    # Histogram
    000000 000000   000000
    000001 000000   000000
    000002 000000   000000
    000003 000000   000000
    000004 000000   000000
    000005 316691   705021
    000006 20586842 22417152
    000007 29812256 33595187
    000008 23874842 24719871
    000009 16066047 15036644
    000010 8610582  6758354
    000011 3841554  2421157
    000012 1763729  936561
    000013 1002112  478290
    000014 640858   266835
    000015 406657   151581
    000016 249747   094050
    000017 157131   069519
    000018 111235   059859
    000019 091638   056972
    000020 084283   053234
    000021 078843   046950
    000022 070029   037546
    000023 057772   027135
    000024 042498   018041
    000025 030672   011984
    000026 023045   008631
    000027 017919   006895
    000028 014503   005833
    000029 012056   004995
    000030 009495   003834
    000031 007223   002603
    000032 005447   001679
    000033 003909   001064
    000034 002929   000702
    000035 002046   000463
    000036 001535   000347
    000037 001102   000274
    000038 000761   000189
    000039 000577   000142
    000040 000375   000100
    000041 000272   000080
    000042 000190   000027
    000043 000148   000025
    000044 000131   000019
    000045 000087   000009
    000046 000064   000005
    000047 000032   000004
    000048 000028   000002
    000049 000025   000001
    000050 000025   000003
    000051 000014   000000
    000052 000011   000000
    000053 000011   000001
    000054 000005   000001
    000055 000003   000001
    000056 000003   000000
    000057 000002   000000
    000058 000003   000000
    000059 000000   000000
    000060 000003   000000
    000061 000002   000000
    000062 000001   000000
    000063 000000   000000
    ...
    000399 000000   000000
    # Total: 108000000 107999872
    # Min Latencies: 00005 00005
    # Avg Latencies: 00008 00007
    # Max Latencies: 00062 00055
    # Histogram Overflows: 00000 00000
    # Histogram Overflow at cycle number:
    # Thread 0:
    # Thread 1:
    

  • How does changing OP-TEE settings impact Linux interrupt response time?

    My understanding is that the context switch between Linux and OP-TEE can add additional latency to the kernel's interrupt response time. Since Linux also has a pseudo RNG driver, disabling the OP-TEE hardware RNG driver means that Linux does NOT switch to OP-TEE when RNG is needed (since Linux already has the driver to generate the RNG). Since context switching to OP-TEE is reduced, latency is reduced.

    This is NOT the same as disabling or removing OP-TEE. My current understanding is that if your application required regular switching to OP-TEE, then we should expect that the latency should get worse again.

    Regards,

    Nick

  • Please create a separate thread for the OP-TEE loading question 

    that way we can make sure that separate question gets addressed properly

  • Edited May 20 2026

    Test 260519_2: exactly replicate customer setup from May 12, including using OoB OP-TEE 

    root@am62lxx-evm:~# cat /proc/cmdline
    console=ttyS0,115200n8 vt.global_cursor_default=0 rcu_nocb_poll rcu_nohz=1 rcu_nocbs=1 idle=poll nohz=on nohz_full=1 kthread_cpus=0 irq
    affinity=0 isolcpus=managed_irq,domain,1 earlycon=ns16550a,mmio32,0x02800000 root=PARTUUID=076c4a2a-02 rw rootfstype=ext4 rootwait
    root@am62lxx-evm:~# uname -a
    Linux am62lxx-evm 6.12.57-ti-rt-g31b07ab8dfbc #1 SMP PREEMPT_RT Thu Dec  4 13:07:37 UTC 2025 aarch64 GNU/Linux
    root@am62lxx-evm:~# taskset -c 1 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 80% &
    [1] 986
    root@am62lxx-evm:~# stress-ng: info:  [986] defaulting to a 1 day, 0 secs run per stressor
    stress-ng: info:  [986] dispatching hogs: 1 cpu, 1 vm
    stress-ng: info:  [988] cpu: for stable load results, select a specific cpu stress method with --cpu-method other than 'all'
    
    root@am62lxx-evm:~# cyclictest -a 0-1 -t 2 -p 99 -m -D 0 &
    
    # output from earlier in the day
    policy: fifo: loadavg: 0.57 1.56 1.84 1/158 1920
    stress-ng: info:  [986] failed: 0
    T: 0 (  992) P:99 I:1000 C:86443004 Min:      5 Act:    9 Avg:   15 Max:     489
    T: 1 (  993) P:99 I:1500 C:57628654 Min:      6 Act:    8 Avg:   28 Max:     295
    
    # output as of 3:30pm central, not sure if stress-ng has finished running or not
    policy: fifo: loadavg: 0.00 0.01 0.01 1/157 1964
    T: 0 (  992) P:99 I:1000 C:90242144 Min:      5 Act:    5 Avg:   15 Max:     489
    T: 1 (  993) P:99 I:1500 C:60161415 Min:      6 Act:    7 Avg:   27 Max:     295
    

    This is the test after running for almost 24 hours ^

    Plan for tomorrow:

    1) Re-run Test 260519_2 with OP-TEE TRNG disabled

    2) Re-run Test 260519_2 with OoB OP-TEE, but different cyclictest arguments:

    * cyclictest -m -S  -p99 -D6h -h600 -i200 -M -q  cmdline entry isolcpus=domain,1 breaks -S. Need to explicitly set affinity.

    * cyclictest -m -a 0-1 -t 2 -p99 -D6h -h600 -i200 -M -q
    * change 1ms --> 200 usec test interval (-i200)
    * avoid page faults from stack & memory swapped to disk (-m -M)
    * leave cyclictest priority at 99 (not sure how this relates to CodeSys behavior)

    see an AI comparison of the differences between the initial two tests at [1]

    Regards,

    Nick

    [1] compare the behavior of "cyclictest -a 0-1 -t 2 -p 99 -m -D 0" against "cyclictest -m -Sp80 -D6h -h400 -i200 -M -q"

      ┌───────────────┬────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────┐
      │   Parameter   │                cyclictest -a 0-1 -t 2 -p 99 -m -D 0                │ cyclictest -m -Sp80 -D6h -h400 -i200 -M  │
      │               │                                                                    │                    -q                    │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Thread        │ -t 2 explicit 2 threads                                            │ -S auto: 1 thread per CPU                │
      │ creation      │                                                                    │                                          │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ CPU affinity  │ -a 0-1 pin to CPU0+1                                               │ -S handles affinity automatically        │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ RT priority   │ -p 99 (max SCHED_FIFO)                                             │ -p 80                                    │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Duration      │ -D 0 [UNCERTAIN: 0 = indefinite or exit immediately — test before  │ -D 6h explicit 6-hour run                │
      │               │ relying on it]                                                     │                                          │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Sample        │ default = 1000 µs                                                  │ -i 200 = 200 µs (5× more samples)        │
      │ interval      │                                                                    │                                          │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Histogram     │ none                                                               │ -h 400 (buckets up to 400 µs)            │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Stack         │ standard stack                                                     │ -M mmap stack (avoids stack growth       │
      │               │                                                                    │ faults)                                  │
      ├───────────────┼────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────┤
      │ Output        │ verbose — prints every sample, floods terminal                     │ -q quiet — prints only final             │
      │               │                                                                    │ summary/histogram                        │
      └───────────────┴────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘
    

  • Hi, Nick,

    In our project, the OP-TEE module is not loaded, and our actual business scenarios do not utilize any security modules. I would like to know whether pursuing this direction and conducting related verification is necessary?

    Additionally, in the provided test commands, I originally applied only 70% CPU load and 80% memory load to CPU1 while leaving CPU0 idle. However, in real-world scenarios, CPU0 runs business-critical functions such as CANopen and Ethernet-related CODESYS operations. Only the CODESYS EtherCAT master station operates on CPU1. Therefore, I have also added a similar load to CPU0 using the following command:
    taskset -c 0 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 40% &
    (The CPU/memory load was reduced because assigning too high a load would trigger OOM [Out-Of-Memory] on our hardware platform.) Under this configuration, the significant jitter issue on CPU1 can be reproduced more quickly.
    Could you please provide some additional kernel-level optimization approaches to help mitigate the maximum jitter on CPU1?

    Regards,

    Xi

  • I am now able to replicate the observations in the Linux SDK 11.2 docs.

    When I tested with the self build OPTEE, the jitter varies between test, and between power cycle, sometimes I can get good result as yours/sdk user guide, sometimes get larger number. Can you test more times?

  • Tony and I are discussing offline.

    Hello Xi,

    Let's set expectations: this will take time. There is no "magic" kernel optimization that will allow us to skip important tests

    You have set a difficult goal. Right now, I do not have a pre-packaged solution which magically reaches your goal.

    Your goal might be possible, or it might not. There are a LOT of ways to optimize performance on a complex part like this. It will take time to run tests and see which tests improve performance. You can speed up the process by running tests on your side and sharing your observations.

    What is my approach? 

    I talk with our TI experts about things that they expect to impact RT Linux performance, and then I run tests to verify (or challenge) their expectations. I am starting with things that the TI experts have already tested, or that the experts are very confident about (like OP-TEE TRNG).

    I am also looking at specific hardware settings to make sure that the hardware configuration is not slowing down code execution.

    Once I have investigated the suggestions of the TI experts and we are confident that the hardware is running as efficiently as possible, then I could look at other software settings. If you want to start testing your own software configuration, feel free to do the Bootlin RT training and get started: https://bootlin.com/training/preempt-rt/

    Questions for you 

    "isolcpus=managed_irq" - is there a specific reason you are using managed_irq for this test?

    Have you benchmarked resource usage on your CodeSys platform, and found that taskset -c 0 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 40% accurately mirrors your resource usage? Or is this just a guess? These tests will be most useful if we are actually approximating your system.

    If there is a different setup that you would like me to use, please give me the updated cmdline arguments, taskset for both cores, etc.

    Test 260520_1: replicate customer setup from May 12, but disable OP-TEE TRNG 

    Disabling OP-TEE TRNG appears to improve performance. Hard to tell for sure since the cyclictest command used did not generate a histogram output.

    Out-of-the-box OP-TEE:
    T: 0 ( 992) P:99 I:1000 C:86443004 Min: 5 Act: 9 Avg: 15 Max: 489
    T: 1 ( 993) P:99 I:1500 C:57628654 Min: 6 Act: 8 Avg: 28 Max: 295

    OP-TEE TRNG disabled:
    T: 0 ( 945) P:99 I:1000 C:75969951 Min: 4 Act: 42 Avg: 13 Max: 372
    T: 1 ( 946) P:99 I:1500 C:50646620 Min: 7 Act: 23 Avg: 25 Max: 252

    screenshot at 2:30pm: ~21 hrs test run
    
    root@am62lxx-evm:~# uname -a
    Linux am62lxx-evm 6.12.57-ti-rt-g31b07ab8dfbc #1 SMP PREEMPT_RT Thu Dec  4 13:07:37 UTC 2025 aarch64 GNU/Linux
    root@am62lxx-evm:~# cat /proc/cmdline
    console=ttyS0,115200n8 vt.global_cursor_default=0 rcu_nocb_poll rcu_nohz=1 rcu_nocbs=1 idle=poll nohz=on nohz_full=1 kthread_cpus=0 irq
    affinity=0 isolcpus=managed_irq,domain,1 earlycon=ns16550a,mmio32,0x02800000 root=PARTUUID=076c4a2a-02 rw rootfstype=ext4 rootwait
    root@am62lxx-evm:~# taskset -c 1 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 80% &
    [1] 939
    root@am62lxx-evm:~# stress-ng: info:  [939] defaulting to a 1 day, 0 secs run per stressor
    stress-ng: info:  [939] dispatching hogs: 1 cpu, 1 vm
    stress-ng: info:  [941] cpu: for stable load results, select a specific cpu stress method with --cpu-method other than 'all'
    
    root@am62lxx-evm:~# # started test at 5:20pm central
    root@am62lxx-evm:~# cyclictest -a 0-1 -t 2 -p 99 -m -D 0 &
    [2] 944
    root@am62lxx-evm:~# WARN: stat /dev/cpu_dma_latency failed: No such file or directory
    policy: fifo: loadavg: 2.04 2.05 2.00 3/161 1794                                                                               [0/7798]
    
    T: 0 (  945) P:99 I:1000 C:75969951 Min:      4 Act:   42 Avg:   13 Max:     372
    T: 1 (  946) P:99 I:1500 C:50646620 Min:      7 Act:   23 Avg:   25 Max:     252

    Test 260520_2: replicate customer setup from May 12 with OoB OP-TEE, but use different cyclictest arguments

    cyclictest -m -a 0-1 -t 2 -p99 -D6h -h600 -i200 -M -q
    * change 1ms --> 200 usec test interval (-i200)
    * avoid page faults from stack & memory swapped to disk (-m -M)
    * leave cyclictest priority at 99 (not sure how this relates to CodeSys behavior)

    This leads to MUCH lower results. Additional testing needed to determine which change or changes contributed.

    root@am62lxx-evm:~# uname -a
    Linux am62lxx-evm 6.12.57-ti-rt-g31b07ab8dfbc #1 SMP PREEMPT_RT Thu Dec  4 13:07:37 UTC 2025 aarch64 GNU/Linux
    root@am62lxx-evm:~# cat /proc/cmdline
    console=ttyS0,115200n8 vt.global_cursor_default=0 rcu_nocb_poll rcu_nohz=1 rcu_nocbs=1 idle=poll nohz=on nohz_full=1 kthread_cpus=0 irq
    affinity=0 isolcpus=managed_irq,domain,1 earlycon=ns16550a,mmio32,0x02800000 root=PARTUUID=076c4a2a-02 rw rootfstype=ext4 rootwait
    root@am62lxx-evm:~# taskset -c 1 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 80% &
    [1] 985
    root@am62lxx-evm:~# stress-ng: info:  [985] defaulting to a 1 day, 0 secs run per stressor
    stress-ng: info:  [985] dispatching hogs: 1 cpu, 1 vm
    stress-ng: info:  [987] cpu: for stable load results, select a specific cpu stress method with --cpu-method other than 'all'
    
    root@am62lxx-evm:~# cyclictest -m -a 0-1 -t 2 -p99 -D6h -h600 -i200 -M -q
    WARN: stat /dev/cpu_dma_latency failed: No such file or directory
    # Histogram
    000000 000000   000000
    000001 000000   000000
    000002 000000   000000
    000003 000000   000000
    000004 000084   000000
    000005 63517791 000000
    000006 23151139 000149
    000007 9505702  000107
    000008 4289494  000096
    000009 2509577  1496370
    000010 1595242  24763741
    000011 929138   32345454
    000012 531403   23753747
    000013 373906   13131295
    000014 293321   5402244
    000015 237471   2276379
    000016 192376   1205394
    000017 081270   761352
    000018 000679   543054
    000019 000040   418520
    000020 000027   338381
    000021 013613   283021
    000022 058118   238270
    000023 085291   199003
    000024 097215   163140
    000025 101050   135156
    000026 097044   112877
    000027 084739   094526
    000028 067873   077422
    000029 050777   062743
    000030 035735   048960
    000031 024383   037456
    000032 016773   028125
    000033 011804   020976
    000034 008708   015963
    000035 006392   011913
    000036 005001   009026
    000037 004123   006706
    000038 003454   004873
    000039 002870   003411
    000040 002557   002530
    000041 002241   001877
    000042 001884   001409
    000043 001550   001037
    000044 001373   000724
    000045 001123   000466
    000046 000908   000308
    000047 000733   000219
    000048 000568   000180
    000049 000422   000152
    000050 000373   000105
    000051 000277   000084
    000052 000237   000107
    000053 000206   000072
    000054 000158   000088
    000055 000163   000060
    000056 000142   000053
    000057 000133   000054
    000058 000119   000052
    000059 000127   000051
    000060 000105   000044
    000061 000096   000039
    000062 000083   000031
    000063 000070   000031
    000064 000055   000024
    000065 000073   000023
    000066 000063   000035
    000067 000045   000023
    000068 000048   000021
    000069 000035   000016
    000070 000027   000014
    000071 000028   000013
    000072 000032   000011
    000073 000032   000007
    000074 000022   000007
    000075 000022   000008
    000076 000025   000006
    000077 000017   000011
    000078 000015   000006
    000079 000012   000004
    000080 000013   000009
    000081 000011   000006
    000082 000008   000005
    000083 000009   000008
    000084 000008   000004
    000085 000013   000008
    000086 000000   000001
    000087 000007   000003
    000088 000005   000002
    000089 000003   000002
    000090 000004   000003
    000091 000003   000005
    000092 000004   000004
    000093 000003   000002
    000094 000002   000001
    000095 000003   000003
    000096 000005   000002
    000097 000003   000001
    000098 000006   000002
    000099 000003   000002
    000100 000005   000000
    000101 000007   000000
    000102 000005   000002
    000103 000007   000001
    000104 000001   000002
    000105 000003   000002
    000106 000003   000000
    000107 000003   000000
    000108 000004   000001
    000109 000002   000001
    000110 000001   000002
    000111 000004   000000
    000112 000004   000000
    000113 000001   000001
    000114 000000   000000
    000115 000002   000000
    000116 000001   000000
    000117 000001   000000
    000118 000000   000000
    000119 000001   000000
    000120 000000   000000
    000121 000001   000000
    000122 000001   000000
    000123 000002   000000
    000124 000001   000000
    000125 000002   000000
    000126 000000   000000
    000127 000003   000000
    000128 000000   000000
    ...
    000599 000000   000000
    # Total: 108000000 107999937
    # Min Latencies: 00004 00006
    # Avg Latencies: 00006 00011
    # Max Latencies: 00127 00113
    # Histogram Overflows: 00000 00000
    # Histogram Overflow at cycle number:
    # Thread 0:
    # Thread 1:
    

    Next steps: 

    I am working on modifying the DDR QoS / CoS settings so that we can ensure DDR accesses are happening as efficiently as possible, and generating a minimal filesystem to check the impact of removing unneeded software (I have been told that removing unneeded software is a BIG part of the software optimizations you asked about). But it will be another day or so before those tests are ready.

    In the meantime, tonight I will
    1) replicate test 260520_2, but with OP-TEE TRNG disabled
    2) replicate test 260520_2, but with 1ms test interval (cyclictest -m -a 0-1 -t 2 -p99 -D6h -h600 -i1000 -M -q)

    Regards,

    Nick

  • Hi Nick,

    Thanks a lot for helping analyze this performance issue.

    First, regarding the  isolcpus=managed_irq  parameter, there are no extra dedicated business configurations configured at the application layer.

    Second, for the resource occupancy evaluation of Codesys, we mainly adopt the pressure simulation mode for Core 1 we discussed before. I have observed that the CPU load on Core 0 fluctuates dynamically from 0% to 50% in actual service scenarios. Such load changes normally will not cause jitter on Core 1. My only concern is that unexpected sharp load surges on Core 0 may impact the running performance of Core 1.

    In short, I believe there is no need to change the current test environment configuration.

    Regards,

    Xi

  • Hello Xi,

    Thanks for the confirmation. I will continue using your initial stressors from May 12 in my tests for now. Let me know if you find a better way to model the system in the future.

    Brief update: tests from today 

    My goal for the previous 24 hours was to dig further into how the cyclictest parameters impact the test results. I collected histograms from 3 more tests, and I started graphing them to make it easier to compare behavior. I will share that information when it is ready - maybe tomorrow, but my higher priority goal is to run tests with different DDR options. So it might be a few days before you see those graphs.

    Those tests were:
    260521_1 replicate 260520_2 with -i1000 (cyclictest -m -a 0-1 -t 2 -p99 -D6h -h600 -i1000 -M -q)
    260521_2 replicate 260520_2 (OP-TEE TRNG disabled)
    260521_3 replicate 260520_2 with -i1000 (OP-TEE TRNG disabled)
    260522_1 replicate 260520_2 without -M (OoB OP-TEE)

    Next steps

    I added cyclictest and stress-ng and their dependencies to a base filesystem image. I will report initial tests tomorrow. (this is a barebones image. SDK documentation is here: https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM62LX/11_02_08_02/exports/docs/linux/Foundational_Components_Filesystem.html ).

    My hypothesis is that removing most of the code from the filesystem will improve latency. We will see how much of a difference it makes.

    I am still figuring out the DDR settings. I hope to start running DDR tests by Friday night, but it might take until next week. Please note that Monday is a holiday in the U.S.

    Regards,

    Nick

  • quick update since I am on vacation today:

    adjusting the DDR settings made a huge improvement in latency numbers. By tomorrow I’ll have histograms from 20+ different test runs.

    I’ll post an analysis of the results before Wednesday your time. Looking at

    DDR setting impact

    OP-TEE setting impact 

    filesystem impact (ie, impact of removing code)

    Core isolation impact

    impact of different cyclic test arguments

  • Hello Xi,

    Status update 

    26 tests have been run, all 6 hours or more. 24 histograms generated. I will continue running a few 6-hour tests every day for the next few days.

    Summary of findings: 

    1) DDR QoS has the biggest impact on cyclictest results for your test setup. I will provide settings for you to run experiments

    2) A smaller filesystem does NOT result in a clear reduction of latency

    3) the cyclictest arguments have a big impact on the reported results (specifically, -i1000 vs -i200). Since there is not a 1 to 1 mapping between cyclictest results and CodeSys performance, I would prioritize actual CodeSys performance over specific cyclictest output numbers

    More tests needed to comment on OP-TEE impact and core isolation impact.

    How to apply the DDR QoS configuration to test with cyclictest & CodeSys? 

    type this into your terminal before running tests:

    # first, set CBASS QoS priority
    # A53 READ and WRITE ports default to the same EPRIORITY = 7
    # 7 is the lowest priority
    # set A53 READ port EPRIORITY from 7 (default) to 6 (higher priority)
    root@am62lxx-evm:~# devmem2 0x45D20500 w 0x00006000
    
    # verify write
    root@am62lxx-evm:~# devmem2 0x45D20500
    
    # next, set DDR priority
    # DDR priority defaults to all entries = AXI priority 0 (highest priority)
    # DDRSS DEF_PRI_MAP — map VBUSM priority 7 → DDR AXI priority 1
    # all other PRIMAP entries stay 0 = DDR AXI priority 0 (higher priority)
    # thus A53 READ has higher priority than WRITE
    root@am62lxx-evm:~# devmem2 0x0F300030 w 0x00000001
    
    # verify write
    root@am62lxx-evm:~# devmem2 0x0F300030

    Show me the raw data please

    Sure.

    test_table.csv has the list of all the tests that have been run that resulted in a histogram, and all the configurations that were used for each test.

    260525_test_table.csv

    cat1/2/3 files compare the histograms for tests that can be used to learn more about the impact of cyclictest parameters, DDR configuration, and the filesystem.

    cat1_cyclictest_parameters.html

    cat2_ddr_configuration.html

    cat3_filesystem.html

    latency_report has all histograms.

    latency_report.html

    Regards,

    Nick

  • Hello Xi,

    Just wanted to let you know that I am still running tests. I will put together reports on whether I observe any improvements with OP-TEE and with core isolation at the end of the day on Friday.

    Regards,

    Nick

  • Hello Xi,

    I got some interesting findings for how to get the best cyclictest results for your test case:

    1) use DDR QoS settings - already discussed

    2) Do NOT isolate cores - performance was significantly worse in all tests

    3) Do NOT use OPTEE TRNG (I assume not a factor if you are not actually loading OPTEE)

    4) The data around whether the filesystem matters is inconclusive. If it matters, the effect is much smaller than the previous 3 points.

    I was surprised that core isolation caused such a drop in performance. I am not sure how this would affect Codesys, but I would suggest testing with core isolation off to see if the performance improves.

    Full reports

    report_core_isolation.html

    report_ddr_qos.html

    report_optee.html

    report_filesystem.html

    Any explanation of the DDR QoS settings? 

    Here is my current draft. Let me know if there are followup questions.

    # Prioritizing A53 Memory Reads Over Writes on AM\* Sitara™ Processors
    
    ---
    
    ## Section 1 — Background and Motivation
    
    Real-time Linux workloads on AM\* processors can exhibit elevated cyclictest latency under heavy DDR
    load, even when the A53 core is isolated (`isolcpus`, `nohz_full`, etc.). A representative
    reproduction command:
    
    ```bash
    stress-ng --vm-method=zero-one --memrate 2 &
    cyclictest -m -p 99 -i 200 -l 100000 -a <isolated_cpu>
    ```
    
    The symptom is a large increase in worst-case latency (maximum, not average) that appears only when
    DDR bandwidth is under pressure. On AM64x, field testing showed worst-case cyclictest latency
    drop from **800+ µs to ~170 µs** after applying the register writes described in this document.
    
    **Root cause mechanism:**
    
    1. An RT task preempts a lower-priority task and must run on the isolated A53 core.
    2. The RT task's working set is not fully in L1/L2 cache (cold cache, or evicted by the previous
       task). This causes cache-fill read misses.
    3. Cache-fill reads must complete before the A53 core can execute the first instruction of the RT
       task. The core stalls until the data arrives from SDRAM.
    4. Meanwhile, background write traffic from the same or other masters is in flight. At reset,
       DDR QoS is not configured, so reads and writes compete at equal priority inside the DDR
       controller's command queue.
    5. A53 cache-fill reads are delayed behind write traffic. The RT task does not actually begin
       executing until those reads complete — this delay is the observed cyclictest latency spike.
    
    **Solution:** Configure the CBASS (interconnect) and the DDR subsystem (DDRSS) to give A53 read
    transactions higher priority than write transactions. Both hardware blocks have independent
    priority arbitrators, and configuring both yields the maximum latency reduction.
    
    **Devices covered:** AM62L, AM62x, AM64x, AM62Ax, AM62Px.
    
    ---
    
    ## Section 2 — Hardware Architecture
    
    ### 2.1 Signal Path from A53 to DDR
    
    Two hardware building blocks sit between the A53 cluster and SDRAM:
    
    1. **CBASS (Crossbar Switch)** — Routes transactions between all initiators (A53, R5F, DMA, GPU,
       etc.) and targets (DDR, MSMC, peripherals). Performs priority arbitration. Priority is encoded
       as a 3-bit VBUSM field where **0 = highest priority, 7 = lowest**. Each initiator port has a
       QoS block that injects a configurable EPRIORITY value into outgoing transactions.
    
    2. **DDR Subsystem (DDRSS)** — Contains a VBUSM2AXI bridge (called MSMC2DDR on AM64x) that
       translates the VBUSM priority carried on incoming transactions to an AXI priority used by the
       DDR controller's command queue. A second stage of priority arbitration occurs here, operating
       on all transactions already buffered inside the DDR subsystem.
    
    ```
    A53 Cluster (128-bit VBUSM per port)
      │
      ├─ AXI Read port ──→ [CBASS QoS Block]  ─┐
      └─ AXI Write port ─→ [CBASS QoS Block]  ─┤
                                                │  VBUSM bus (carries priority + Route ID)
                           Other initiators ───→┤
                                                ↓
                                   ┌─────────────────────────────────┐
                                   │  DDR Subsystem (DDRSS)          │
                                   │  ┌──────────────────────────┐   │
                                   │  │  VBUSM2AXI Bridge        │   │
                                   │  │  • Route ID comparators  │   │
                                   │  │  • VBUSM→AXI priority    │   │
                                   │  │    mapping               │   │
                                   │  └──────────┬───────────────┘   │
                                   │             ↓                    │
                                   │  ┌──────────────────────────┐   │
                                   │  │  DDR Controller          │   │
                                   │  │  (AXI priority queue)    │   │
                                   │  └──────────┬───────────────┘   │
                                   └─────────────│────────────────────┘
                                                 ↓
                                            SDRAM (LPDDR4/DDR4)
    ```
    
    **DDR32SS devices (AM62Ax, AM62Px)** use a bridge with two physical VBUSM input ports:
    
    ```
                                   │  ┌──────────────────────────┐   │
                                   │  │  VBUSM2AXI Bridge        │   │
                                   │  │                          │   │
                                   │  │  HPT port ← orderID 8-15 │   │  (High Priority Thread)
                                   │  │  LPT port ← orderID 0-7  │   │  (Low Priority Thread)
                                   │  │                          │   │
                                   │  │  HPT always preempts LPT │   │
                                   │  └──────────────────────────┘   │
    ```
    
    HPT transactions always enter the DDR controller's command queue ahead of LPT transactions —
    this is a hardware-enforced structural priority, not a software-controlled mapping.
    
    ### 2.2 Key Concept: Two-Stage Priority Control
    
    There are two independent priority control points:
    
    | Stage | Location | What it controls |
    |-------|----------|-----------------|
    | **Stage 1 — CBASS** | CBASS QoS MAP0 register, EPRIORITY field | Which transactions win arbitration through the interconnect fabric, before they reach the DDR subsystem |
    | **Stage 2 — DDRSS** | VBUSM2AXI bridge DEF_PRI_MAP register | Which transactions win arbitration inside the DDR controller's command queue |
    
    For maximum latency reduction, configure both stages. Stage 2 alone still provides meaningful
    improvement because the DDR controller holds many in-flight transactions simultaneously and its
    internal arbitration determines which complete first.
    
    ### 2.3 Route ID
    
    Every VBUSM transaction carries a 12-bit **Route ID** identifying the originating initiator
    interface. The DDRSS VBUSM2AXI bridge can inspect this Route ID for its range-match CoS
    mechanism (see Section 3, Approach C).
    
    Route IDs are assigned per initiator port in the CBASS connectivity table. Confirmed values
    across AM62x, AM62Ax, AM62Px, and AM64x:
    
    | Initiator                  | Route ID range |
    |----------------------------|---------------|
    | A53 Write port (CBA_AXI_W) | **0–7**       |
    | A53 Read port (CBA_AXI_R)  | **16–23**     |
    | Other initiators (R5F, GPU, DMA, etc.) | 64+ (device-specific) |
    
    > **[UNCERTAIN for AM62L]** Route IDs for AM62L (write 0–7, read 16–23) have not been directly
    > verified from the AM62L TRM Route ID table. The pattern is consistent across all other devices
    > and the same CBASS IP is used.
    
    ### 2.4 Device Comparison Table
    
    | Device | A53 Cores    | DDR SS   | DDR Ports     | Bridge        | CBASS QoS base |
    |--------|-------------|---------|---------------|---------------|----------------|
    | AM62L  | 2 (dual)    | DDR16SS | 1 (LPT only)  | VBUSM2AXI     | 0x45D20000     |
    | AM62x  | 4 (quad)    | DDR16SS | 1 (LPT only)  | VBUSM2AXI     | 0x45D20000     |
    | AM64x  | 2 (dual)    | DDR16SS | 1 (LPT only)  | MSMC2DDR      | **0x45D80000** |
    | AM62Ax | 4 (quad)    | DDR32SS | 2 (HPT + LPT) | VBUSM2AXI     | 0x45D20000     |
    | AM62Px | 4 (quad)    | DDR32SS | 2 (HPT + LPT) | VBUSM2AXI     | 0x45D20000     |
    
    **AM64x** is the exception: its A53 QoS block resides in a separate CBASS region at **0x45D80000**,
    not 0x45D20000. AM64x also has four R5F cores (two R5FSS subsystems). AM62Ax has a C7x DSP and
    MCU R5F. AM62Px has an MCU R5F.
    
    ### 2.5 How Priority Settings Work
    
    This section traces a single transaction end-to-end under two configurations — reset defaults
    and after applying Approach A1 — to make the interaction between Stage 1 and Stage 2 concrete.
    
    **Default state (at reset):**
    
    - All CBASS QoS MAP0 registers reset to **0x7000**: every initiator port injects EPRIORITY = 7
      (numerically lowest priority) into outgoing transactions. A53 reads, A53 writes, DMA, GPU, and
      all other masters leave CBASS carrying the same VBUSM priority 7.
    - DDRSS DEF_PRI_MAP resets to **0x00000000**: all VBUSM priorities (0–7) map to DDR AXI
      priority 0 (numerically highest). The DDR controller sees every transaction at equal AXI
      priority 0 and services them in arrival order.
    - Net effect: every master competes identically at both the CBASS arbitration stage and the DDR
      controller command queue. A53 cache-fill reads wait behind write traffic with no mechanism to
      advance.
    
    **After Approach A1:**
    
    - A53 read port MAP0 is written to **0x6000**: A53 reads now leave CBASS carrying VBUSM
      priority 6. All other traffic (A53 writes, DMA, etc.) remains at VBUSM priority 7.
    - DDRSS DEF_PRI_MAP is written to **0x00000001**: VBUSM priority 7 now maps to DDR AXI
      priority 1 (one step lower). VBUSM priority 6 retains its mapping to DDR AXI priority 0.
    - Net effect: A53 reads carry VBUSM priority 6 and are mapped to DDR AXI priority 0 (highest).
      A53 writes and all other traffic carry VBUSM priority 7 and are mapped to DDR AXI priority 1.
      At both the CBASS arbitration point and the DDR controller command queue, A53 reads win.
    
    | Stage | Default | After Approach A1 |
    |-------|---------|-------------------|
    | A53 read VBUSM priority out of CBASS | 7 (equal to writes) | 6 (beats writes at 7) |
    | A53 read DDR controller AXI priority | 0 (equal to all traffic) | 0 (writes degraded to AXI 1) |
    
    **Why both stages are necessary:** At reset, DEF_PRI_MAP maps all VBUSM priorities to DDR AXI
    priority 0 regardless of their VBUSM value. If only Stage 1 (CBASS EPRIORITY) is configured
    without updating DEF_PRI_MAP, the DDR controller receives transactions at VBUSM 6 and VBUSM 7
    but maps both to AXI priority 0 — the distinction is lost, and reads and writes still compete
    equally inside the DDR controller. Stage 2 (DDRSS DEF_PRI_MAP) must be written to translate
    the distinct VBUSM priorities into distinct DDR AXI priorities for the differentiation to have
    effect inside the DDR controller.
    
    ---
    
    ## Section 3 — Configuration Approaches
    
    ### Overview: Which approaches are available per device
    
    | Approach | Works on | What it does |
    |----------|----------|--------------|
    | **A: EPRIORITY + DEF_PRI_MAP** | All devices | Prioritizes at both CBASS and DDR levels |
    | **B: HPT/LPT orderID routing** | AM62Ax, AM62Px only | Routes reads to structurally-preferred HPT port |
    | **C: DDRSS Route ID range match** | All devices | DDR-level differentiation by initiator Route ID |
    
    Approaches can be combined. Approach A is the well-tested baseline. Approach B is the most
    direct method on DDR32SS devices. Approach C provides the finest per-initiator control.
    
    ---
    
    ### Approach A: EPRIORITY + DEF_PRI_MAP (all devices)
    
    Register addresses for this approach: CBASS QoS MAP0 (see Appendix A.1), DDRSS DEF_PRI_MAP
    (see Appendix A.2).
    
    **How it works:**
    
    1. Write a lower EPRIORITY value (numerically smaller = higher priority) into the A53 read port
       MAP0 register. This raises the read port's VBUSM priority above the write port's value.
    2. At reset, all traffic enters the DDR controller with the same AXI priority (DEF_PRI_MAP = 0).
       After raising EPRIORITY on reads, reads arrive with a distinct VBUSM priority.
    3. Configure DEF_PRI_MAP to map the now-distinct VBUSM priorities to distinct DDR AXI priorities.
    4. The effect operates at both the interconnect (CBASS arbitration) and the DDR controller
       (command queue arbitration).
    
    #### Variant A1: Read priority only (minimal, recommended starting point)
    
    Result: A53 reads → VBUSM priority 6 → DDR AXI priority 0 (highest). A53 writes and all other
    masters → VBUSM priority 7 → DDR AXI priority 1.
    
    ```bash
    # ── Stage 1: CBASS QoS — raise A53 read EPRIORITY from 7 to 6 ──────────────────
    # EPRIORITY = 6 → 6 << 12 = 0x6000
    # AM62L / AM62x / AM62Ax / AM62Px
    mw.l 0x45D20500 0x00006000   # A53 read port EPRIORITY=6
    md.l 0x45D20500 1            # read-back verification
    
    # AM64x (different CBASS region)
    mw.l 0x45D80500 0x00006000   # A53 read port EPRIORITY=6
    md.l 0x45D80500 1
    
    # ── Stage 2: DDRSS DEF_PRI_MAP — map VBUSM priority 7 → DDR AXI priority 1 ──────
    # PRIMAP7[2:0] = 1 → 0x00000001  (all other PRIMAP entries stay 0 = DDR AXI priority 0)
    mw.l 0x0F300030 0x00000001   # same address on all devices
    ```
    
    > **NOTE for AM62Ax / AM62Px:** 0x0F300030 is LPT_DEF_PRI_MAP. This setting applies only to
    > traffic on the LPT port. If A53 reads are routed to HPT via Approach B, also configure
    > HPT_DEF_PRI_MAP at 0x0F30004C with the same value.
    
    #### Variant A2: Both read and write boosted above other masters
    
    Result: A53 reads → VBUSM 5 → DDR AXI 0. A53 writes → VBUSM 6 → DDR AXI 1. All other
    masters → VBUSM 7 → DDR AXI 2.
    
    ```bash
    # ── Stage 1: CBASS QoS ────────────────────────────────────────────────────────────
    # AM62L / AM62x / AM62Ax / AM62Px
    mw.l 0x45D20500 0x00005000   # A53 read EPRIORITY=5 (5<<12 = 0x5000)
    md.l 0x45D20500 1
    mw.l 0x45D20900 0x00006000   # A53 write EPRIORITY=6 (6<<12 = 0x6000)
    md.l 0x45D20900 1
    
    # AM64x
    mw.l 0x45D80500 0x00005000   # A53 read EPRIORITY=5
    md.l 0x45D80500 1
    mw.l 0x45D80900 0x00006000   # A53 write EPRIORITY=6
    md.l 0x45D80900 1
    
    # ── Stage 2: DDRSS DEF_PRI_MAP ───────────────────────────────────────────────────
    # PRIMAP6[6:4] = 1 (VBUSM 6 → DDR AXI 1), PRIMAP7[2:0] = 2 (VBUSM 7 → DDR AXI 2)
    # = (1<<4) | (2<<0) = 0x10 | 0x02 = 0x00000012
    mw.l 0x0F300030 0x00000012   # all devices (LPT_DEF_PRI_MAP on DDR32SS)
    ```
    
    > **NOTE on 0x00000102:** Some historical AM64x references cite this value for DEF_PRI_MAP in
    > the A2 variant. This value encodes PRIMAP5=1 (VBUSM 5 → AXI 1) and PRIMAP7=2 (VBUSM 7 →
    > AXI 2), leaving PRIMAP6=0 (VBUSM 6 → AXI 0). That would give A53 writes (VBUSM 6) higher DDR
    > AXI priority than A53 reads (VBUSM 5) — the opposite of the intended behavior. The correct
    > value for the stated intent is **0x00000012**.
    >
    > Field testing showed no measurable latency difference between A1 and A2. **Variant A1 is
    > recommended** as the simpler starting point.
    
    ---
    
    ### Approach B: HPT/LPT orderID Routing (AM62Ax, AM62Px only)
    
    Register addresses for this approach: CBASS QoS MAP0 ORDERID field (see Appendix A.1). DDRSS
    range registers not required for B1; see Appendix A.2 for HPT_DEF_PRI_MAP used in B2.
    
    **How it works:**
    
    The DDR32SS VBUSM2AXI bridge has two physical VBUSM input ports: HPT (High Priority Thread) and
    LPT (Low Priority Thread). HPT always has structural priority over LPT — HPT commands enter
    the DDR controller's command queue ahead of LPT commands at every arbitration cycle. This is
    enforced in hardware and does not require any DEF_PRI_MAP configuration.
    
    The CBASS routes a transaction to HPT if its ORDERID field ≥ 8, to LPT if ORDERID ≤ 7. ORDERID
    is set per-initiator in the CBASS QoS MAP0 register, bits [7:4].
    
    #### Approach B1: Route A53 reads to HPT, leave writes on LPT
    
    ```bash
    # ── CBASS QoS: set ORDERID=8 on A53 read port ────────────────────────────────────
    # MAP0: EPRIORITY stays at 7 (default = 0x7000), ORDERID = 8 → 8<<4 = 0x0080
    # Full register value: 0x7000 | 0x0080 = 0x00007080
    mw.l 0x45D20500 0x00007080   # AM62Ax / AM62Px: A53 read → HPT
    md.l 0x45D20500 1
    # A53 write port stays at default 0x7000 (ORDERID=0 → LPT). No write needed.
    ```
    
    No DDRSS register changes are required. The HPT structural priority handles the differentiation.
    
    #### Approach B2: Combine HPT routing with DEF_PRI_MAP for finer control
    
    ```bash
    # A53 reads to HPT (ORDERID=8) and also elevated EPRIORITY=6
    # EPRIORITY=6 → 0x6000, ORDERID=8 → 0x0080 → combined: 0x00006080
    mw.l 0x45D20500 0x00006080   # AM62Ax / AM62Px: read → HPT, EPRIORITY=6
    md.l 0x45D20500 1
    
    # HPT_DEF_PRI_MAP: reads arrive at HPT with VBUSM priority 6
    # PRIMAP7[2:0] = 1 → HPT traffic at VBUSM 7 → DDR AXI 1
    mw.l 0x0F30004C 0x00000001   # HPT_DEF_PRI_MAP: PRIMAP7=1
    
    # LPT_DEF_PRI_MAP: writes arrive at LPT with VBUSM priority 7 → DDR AXI 1
    mw.l 0x0F300030 0x00000001   # LPT_DEF_PRI_MAP: PRIMAP7=1
    ```
    
    > **NOTE (AM62Ax):** UDMA write channels are mapped to HPT by hardware for guaranteed QoS (per
    > TRM section 4.6). When using HPT routing for A53 reads, UDMA writes also compete on the HPT
    > port. If UDMA write bandwidth is a concern, use EPRIORITY or range-match registers
    > (LPT_R\*/HPT_R\*) to further differentiate within the HPT port.
    
    ---
    
    ### Approach C: DDRSS Route ID Range Matching (all devices)
    
    Register addresses for this approach: DDRSS range match MAT registers and range priority map
    registers (see Appendix A.3), DEF_PRI_MAP (see Appendix A.2).
    
    **How it works:**
    
    The VBUSM2AXI bridge inspects the Route ID on every incoming transaction. If the Route ID
    matches one of the three range match registers (R1, R2, R3 MAT), the corresponding range
    priority map register (R1, R2, R3 PRI_MAP) overrides DEF_PRI_MAP for that transaction. This
    allows different VBUSM→AXI priority mappings per initiator, entirely within the DDRSS and
    without any CBASS EPRIORITY change.
    
    **Match logic:** `(incoming_routeid >> MASK) == (ROUTEID_field >> MASK)`, where MASK specifies
    how many LSBs to ignore. MASK=3 matches any Route ID in the same octet (e.g., 16–23 all match
    with ROUTEID=16 and MASK=3).
    
    **Route ID values confirmed for AM62x, AM62Ax, AM62Px, AM64x:**
    
    | Initiator                   | Route ID range | ROUTEID_x field | MASK_x |
    |-----------------------------|---------------|-----------------|--------|
    | A53 Write (CBA_AXI_W)       | 0–7           | 0x000           | 3      |
    | A53 Read (CBA_AXI_R)        | 16–23         | 0x010 (= 16)    | 3      |
    | C7x DSP (AM62Ax)            | 32–39         | 0x020 (= 32)    | 3      |
    | GPU Read/Write (AM62x/Px)   | 64–65         | see device TRM  | 0      |
    | R5FSS (AM64x)               | 66–74         | see device TRM  | varies |
    | PRU_ICSSG (AM64x)           | 384–447       | see device TRM  | 6      |
    | MMCSD, GIC, USB, etc.       | 256+          | see device TRM  | varies |
    
    **Encoding example for A53 reads (Route IDs 16–23):**
    
    ```
    R1_MAT = (RANGEEN_A=1 << 31) | (MASK_A=3 << 28) | (ROUTEID_A=16 << 16)
           = 0x80000000 | 0x30000000 | 0x00100000
           = 0xB0100000
    ```
    
    #### Approach C1: A53 reads → highest DDR priority (Route ID match, no CBASS change)
    
    Effect: A53 reads → DDR AXI priority 0 (via range match, reset default). All other traffic →
    DDR AXI priority 1 (via DEF_PRI_MAP). CBASS EPRIORITY is not changed.
    
    ```bash
    # R1_MAT: enable A, MASK_A=3 (ignore lower 3 bits), ROUTEID_A=16 → matches Route IDs 16-23
    mw.l 0x0F300024 0xB0100000
    
    # R1_PRI_MAP stays at reset (0x0) → A53 reads get DDR AXI priority 0 (all fields = 0)
    
    # DEF_PRI_MAP: PRIMAP7[2:0]=1 → everyone else (VBUSM 7) → DDR AXI priority 1
    mw.l 0x0F300030 0x00000001
    
    # AM62Ax / AM62Px: the above sets LPT registers. HPT range map registers (0x0F300050)
    # reset to 0 (AXI priority 0) and typically need no change.
    ```
    
    #### Approach C2: Three-tier — A53 reads > A53 writes > everyone else
    
    Uses two range registers. R3 matches A53 reads; R1 matches A53 writes. R3 > R1 in precedence.
    
    ```bash
    # R3_MAT: match A53 reads (Route IDs 16-23)
    mw.l 0x0F30002C 0xB0100000
    
    # R3_PRI_MAP stays at reset (0x0) → A53 reads → DDR AXI priority 0
    
    # R1_MAT: match A53 writes (Route IDs 0-7), ROUTEID_A=0, MASK_A=3
    # (1<<31)|(3<<28)|(0<<16) = 0x80000000|0x30000000|0x00000000 = 0xB0000000
    mw.l 0x0F300024 0xB0000000
    
    # R1_PRI_MAP: PRIMAP7[2:0]=1 → A53 writes → DDR AXI priority 1
    mw.l 0x0F300034 0x00000001
    
    # DEF_PRI_MAP: PRIMAP7[2:0]=2 → everyone else → DDR AXI priority 2
    mw.l 0x0F300030 0x00000002
    ```
    
    On AM62Ax / AM62Px, also write the same values to LPT_R3_PRI_MAP (0x0F30003C) and
    LPT_R1_PRI_MAP (0x0F300034) when range-matching LPT traffic.
    
    #### Approach C3: MCU R5F above A53 above others (AM64x, AM62Ax, AM62Px)
    
    Use case: a system where an MCU R5F runs the hardest RT task and needs priority over A53.
    AM64x R5FSS0 CPU0 read Route ID = 66, write = 67.
    
    ```bash
    # R3_MAT: match R5F read (Route ID 66, exact match MASK=0)
    # RANGEEN_A=1, MASK_A=0, ROUTEID_A=66 (0x042) → (1<<31)|(0<<28)|(0x042<<16) = 0x80420000
    mw.l 0x0F30002C 0x80420000
    
    # R3_PRI_MAP stays at reset → R5F reads → DDR AXI priority 0
    
    # R2_MAT: match A53 reads (Route IDs 16-23)
    mw.l 0x0F300028 0xB0100000
    
    # R2_PRI_MAP: PRIMAP7[2:0]=1 → A53 reads → DDR AXI priority 1
    mw.l 0x0F300038 0x00000001
    
    # DEF_PRI_MAP: PRIMAP7[2:0]=2 → everyone else → DDR AXI priority 2
    mw.l 0x0F300030 0x00000002
    ```
    
    ### Approach Comparison Summary
    
    | Aspect | A: EPRIORITY + DEF_PRI_MAP | B: HPT/LPT orderID | C: Route ID range match |
    |--------|---------------------------|---------------------|-------------------------|
    | Devices | All | AM62Ax, AM62Px only | All |
    | Affects CBASS arbitration | Yes | Indirectly (via orderID) | No |
    | Affects DDR arbitration | Yes | Yes (structurally) | Yes |
    | Requires Route ID knowledge | No | No | Yes |
    | Per-initiator granularity | Low (all share same PRIMAP bucket) | Low | High (up to 6 initiators) |
    | Field tested | Yes (AM64x, AM62Px) | Not specifically confirmed | Not confirmed |
    | Complexity | Low | Low | Medium |
    
    **Recommendation:** Start with **Approach A, Variant A1** on all devices. On AM62Ax/AM62Px,
    **Approach B1** is the most direct method due to the hardware-enforced HPT priority. Use
    Approach C when multiple RT masters need independent priority tiers (primarily AM64x with MCU
    R5F and A53 both running RT workloads).
    
    ---
    
    ## Section 4 — When to Apply These Settings
    
    ### 4.1 Initialization Timing
    
    Apply these register writes during early firmware initialization (BL1 / R5 SPL), before the
    A53 cluster starts executing workloads. The TRM states:
    
    > *"QoS block programming shall happen during device initialization time while there is no
    > in-flight transaction for that initiator."*
    
    For Linux-only setups, these can also be applied as one-time register writes in early kernel
    init or via a pre-boot register write mechanism, since the registers persist until the next
    power cycle.
    
    ### 4.2 U-Boot K3 QoS Framework Support
    
    U-Boot provides a K3 QoS framework (`setup_qos()` iterating over a `qos_data[]` array in
    `arch/arm/mach-k3/`) that writes CBASS QoS MAP registers from per-device data files at SPL
    init time. The `K3_QOS_REG(base, i)` macro computes `base + 0x100 + i*4` to locate MAP
    register offsets; `K3_QOS_VAL()` encodes the EPRIORITY, ORDERID, and ASEL fields into the
    register value.
    
    | Device | U-Boot QoS file | A53 priority entries | Notes |
    |--------|-----------------|----------------------|-------|
    | AM62Ax | `r5/am62ax/am62a_qos_uboot.c` | Not present | DSS display DMA only (orderID=8) |
    | AM62Px | `r5/am62px/am62p_qos_uboot.c` | Not present | DSS display DMA ×2 (orderID=15) |
    | AM62L  | None | N/A | Must write registers manually |
    | AM62x  | None | N/A | Must write registers manually |
    | AM64x  | None | N/A | Must write registers manually |
    
    `CONFIG_K3_QOS` defaults to `y` only for `SOC_K3_AM62A7` in Kconfig. It must be explicitly
    enabled for AM62Px (`CONFIG_K3_QOS=y` in defconfig). It is not set for AM62x, AM64x, or AM62L.
    
    ### 4.3 Gaps — What Is Missing from the Current U-Boot Framework
    
    The following gaps affect all five devices and are documented here as reference for the
    development team.
    
    **Gap 1 — AM62Ax and AM62Px: A53 priority entries absent from qos_data[]**
    
    The `am62a_qos.h` and `am62p_qos.h` headers define the A53 CBASS block addresses:
    
    - `SAM62A_A53_512KB_WRAP_MAIN_0_A53_QUAD_WRAP_CBA_AXI_R = 0x45D20400`
    - `SAM62A_A53_512KB_WRAP_MAIN_0_A53_QUAD_WRAP_CBA_AXI_W = 0x45D20800`
    
    The MAP0 registers for these blocks are at base + 0x100, i.e.:
    - A53 read MAP0 = 0x45D20400 + 0x100 + 0×4 = **0x45D20500**
    - A53 write MAP0 = 0x45D20800 + 0x100 + 0×4 = **0x45D20900**
    
    Despite these addresses being defined in the headers, the existing `qos_data[]` arrays in both
    `am62a_qos_uboot.c` and `am62p_qos_uboot.c` only configure DSS display DMA. No entries exist
    to set EPRIORITY on the A53 read or write ports.
    
    **Gap 2 — AM62L, AM62x, AM64x: No _qos_uboot.c files exist at all**
    
    `CONFIG_K3_QOS` defaults to `y` only for `SOC_K3_AM62A7`; it is not enabled for AM62x, AM64x,
    or AM62L in Kconfig. No `qos.h` header files with initiator endpoint addresses exist for these
    devices in the U-Boot tree (verified against both SDK 11.2 / ti-u-boot-2025.01 and SDK 12.0 /
    ti-u-boot-2026.01 — both are identical in QoS content). Any QoS configuration for these devices
    must be written manually via direct register access in SPL or a separate init stage.
    
    **Gap 3 — DDRSS Stage 2 registers not in framework scope**
    
    The K3 QoS framework (`setup_qos()` / `k3_qos_data`) only covers CBASS QoS MAP registers.
    The DDRSS registers — DEF_PRI_MAP (0x0F300030), HPT_DEF_PRI_MAP (0x0F30004C), and range match
    MAT and PRI_MAP registers at the 0x0F300000 base — are outside the current framework and would
    require a separate mechanism (e.g., explicit writes in the board init sequence alongside or
    after `setup_qos()`).
    
    ---
    
    ## Verification
    
    After applying the register settings on hardware:
    
    1. **Baseline:** `cyclictest -m -p 99 -i 200 -l 100000 -a <isolated_cpu>`
    2. **Apply DDR load:** `stress-ng --vm-method=zero-one --memrate 2 &`
    3. **Loaded run:** `cyclictest -m -p 99 -i 200 -l 100000 -a <isolated_cpu>`
    4. **Compare** worst-case (maximum) latency between the two runs.
    5. **Expected result:** Significant reduction in maximum latency. AM64x reference:
       800+ µs → ~170 µs.
    
    **Verify register writes took effect** by reading back immediately after writing:
    
    ```bash
    # AM62L / AM62x / AM62Ax / AM62Px — after Approach A1
    md.l 0x45D20500 1   # should read 0x00006000
    md.l 0x0F300030 1   # should read 0x00000001
    
    # AM64x — after Approach A1
    md.l 0x45D80500 1   # should read 0x00006000
    md.l 0x0F300030 1   # should read 0x00000001
    ```
    
    ---
    
    ## Appendix A — Register Reference
    
    ### A.1 CBASS QoS MAP0 Registers (Stage 1)
    
    Each initiator port has one QoS block. The MAP0 register is at `block_base + 0x100`. All
    registers reset to **0x7000** (EPRIORITY = 7, lowest priority).
    
    **MAP0 register bitfield layout (identical on all devices):**
    
    | Bits  | Field     | Reset | Description |
    |-------|-----------|-------|-------------|
    | 14:12 | EPRIORITY | 7h    | VBUSM priority injected on outgoing transactions. 0 = highest, 7 = lowest. |
    | 11:8  | ASEL      | 0h    | Leave at 0 for DDR access. (Values 14/15 reserved for A53 ACP.) |
    | 7:4   | ORDERID   | 0h    | 0–7 → LPT port; 8–15 → HPT port. Relevant only on DDR32SS devices (AM62Ax, AM62Px). |
    | 2:0   | QOS       | 0h    | Not used. |
    
    **A53 QoS register addresses (MAP0 = block_base + 0x100):**
    
    | Device | A53 Read port MAP0 | A53 Write port MAP0 | QoS block bases (R / W) |
    |--------|--------------------|---------------------|--------------------------|
    | AM62L  | **0x45D20500**     | **0x45D20900**      | 0x45D20400 / 0x45D20800  |
    | AM62x  | **0x45D20500**     | **0x45D20900**      | 0x45D20400 / 0x45D20800  |
    | AM64x  | **0x45D80500**     | **0x45D80900**      | 0x45D80400 / 0x45D80800  |
    | AM62Ax | **0x45D20500**     | **0x45D20900**      | 0x45D20400 / 0x45D20800  |
    | AM62Px | **0x45D20500**     | **0x45D20900**      | 0x45D20400 / 0x45D20800  |
    
    > **TRM NOTE:** After any write to the 0x4500_0000–0x45FF_FFFF range, always read back the
    > register to confirm the write landed.
    
    > **TRM NOTE:** For peripherals with both a QoS block0 and block1 serving the same function
    > (e.g., MMCSD), both must be written to the same value. For A53 specifically, block0 = read
    > port and block1 = write port — intentionally different values are correct.
    
    ### A.2 DDRSS Priority Map Registers (Stage 2)
    
    **Base address: 0x0F300000** (DDR16SS0 SSCFG, all devices)
    
    **Default priority map registers (VBUSM → DDR AXI priority):**
    
    | Register                            | Offset | Address      | Devices                        |
    |-------------------------------------|--------|--------------|--------------------------------|
    | EMIF_SSCFG_V2A_DEF_PRI_MAP_REG      | 0x30   | 0x0F300030   | AM62L, AM62x, AM64x            |
    | EMIF_SSCFG_V2A_LPT_DEF_PRI_MAP_REG  | 0x30   | 0x0F300030   | AM62Ax, AM62Px (LPT port)      |
    | EMIF_SSCFG_V2A_HPT_DEF_PRI_MAP_REG  | 0x4C   | 0x0F30004C   | AM62Ax, AM62Px (HPT port)      |
    
    **Range match registers (all devices; shared between LPT/HPT on DDR32SS):**
    
    | Register                  | Offset | Address    |
    |---------------------------|--------|------------|
    | EMIF_SSCFG_V2A_R1_MAT_REG | 0x24   | 0x0F300024 |
    | EMIF_SSCFG_V2A_R2_MAT_REG | 0x28   | 0x0F300028 |
    | EMIF_SSCFG_V2A_R3_MAT_REG | 0x2C   | 0x0F30002C |
    
    **Range priority map registers:**
    
    | Register                           | Offset | Address    | Devices                        |
    |------------------------------------|--------|------------|--------------------------------|
    | EMIF_SSCFG_V2A_R1_PRI_MAP_REG      | 0x34   | 0x0F300034 | AM62L, AM62x, AM64x            |
    | EMIF_SSCFG_V2A_R2_PRI_MAP_REG      | 0x38   | 0x0F300038 | AM62L, AM62x, AM64x            |
    | EMIF_SSCFG_V2A_R3_PRI_MAP_REG      | 0x3C   | 0x0F30003C | AM62L, AM62x, AM64x            |
    | EMIF_SSCFG_V2A_LPT_R1_PRI_MAP_REG  | 0x34   | 0x0F300034 | AM62Ax, AM62Px (LPT)           |
    | EMIF_SSCFG_V2A_LPT_R2_PRI_MAP_REG  | 0x38   | 0x0F300038 | AM62Ax, AM62Px (LPT)           |
    | EMIF_SSCFG_V2A_LPT_R3_PRI_MAP_REG  | 0x3C   | 0x0F30003C | AM62Ax, AM62Px (LPT)           |
    | EMIF_SSCFG_V2A_HPT_R1_PRI_MAP_REG  | 0x50   | 0x0F300050 | AM62Ax, AM62Px (HPT)           |
    | EMIF_SSCFG_V2A_HPT_R2_PRI_MAP_REG  | 0x54   | 0x0F300054 | AM62Ax, AM62Px (HPT)           |
    | EMIF_SSCFG_V2A_HPT_R3_PRI_MAP_REG  | 0x58   | 0x0F300058 | AM62Ax, AM62Px (HPT)           |
    
    **DEF_PRI_MAP / LPT_DEF_PRI_MAP / HPT_DEF_PRI_MAP bitfield layout (identical for all):**
    
    | Bits  | Field   | Description |
    |-------|---------|-------------|
    | 30:28 | PRIMAP0 | VBUSM priority 0 → DDR AXI priority (0 = highest, 7 = lowest) |
    | 26:24 | PRIMAP1 | VBUSM priority 1 → DDR AXI priority |
    | 22:20 | PRIMAP2 | VBUSM priority 2 → DDR AXI priority |
    | 18:16 | PRIMAP3 | VBUSM priority 3 → DDR AXI priority |
    | 14:12 | PRIMAP4 | VBUSM priority 4 → DDR AXI priority |
    | 10:8  | PRIMAP5 | VBUSM priority 5 → DDR AXI priority |
    | 6:4   | PRIMAP6 | VBUSM priority 6 → DDR AXI priority |
    | 2:0   | PRIMAP7 | VBUSM priority 7 → DDR AXI priority |
    
    **Reset value: 0x00000000** — all VBUSM priorities map to DDR AXI priority 0. At reset, every
    master has equal highest priority inside the DDR controller.
    
    ### A.3 Range Match Register Bitfields
    
    Each MAT register contains two independent Route ID matchers (A and B):
    
    | Bits  | Field      | Description |
    |-------|------------|-------------|
    | 31    | RANGEEN_A  | Enable matcher A |
    | 30:28 | MASK_A     | Number of LSBs to ignore: 0 = exact match, 1 = match pairs, 3 = match octets |
    | 27:16 | ROUTEID_A  | 12-bit Route ID pattern for matcher A |
    | 15    | RANGEEN_B  | Enable matcher B |
    | 14:12 | MASK_B     | Number of LSBs to ignore for matcher B |
    | 11:0  | ROUTEID_B  | 12-bit Route ID pattern for matcher B |
    
    **Priority resolution:** if multiple range registers match a transaction, the highest-numbered
    range wins: R3 > R2 > R1 > DEF.
    
    **Encoding formula:**
    
    ```
    REG = (RANGEEN_A<<31) | (MASK_A<<28) | (ROUTEID_A<<16)
        | (RANGEEN_B<<15) | (MASK_B<<12) | (ROUTEID_B<<0)
    ```
    
    ---
    
    *Document scope: DDR QoS configuration for A53 read/write prioritization. Out of scope: leaky
    bucket threshold registers, ECC CoS configuration, per-range priority tuning beyond the
    examples given.*
    

    Next steps 

    I am going to do a few more test runs of the "best case scenario" with the default and base filesystems to see if there is a difference in behavior. After that, optimizations would be at the software level.

    Regards,

    Nick

  • Hi Nick,

    Thanks for your help.

    After adding the DDR QoS configuration, we ran the CODESYS application test environment (1ms cycle, 8 motor axes in operation). The test has been ongoing for 6 hours, with jitter ranging from -141 μs to 140 μs. Another performance indicator shows the maximum cycle duration reached 1120 μs, exceeding the expected 1ms.

    We also conducted verification with the same DDR configuration using cyclictest. Over a 6-hour test, the maximum jitter was 132 μs. Therefore, we expect to explore new optimization approaches to reduce the jitter to the target of 100 μs.

    As for core isolation: disabling it brought no noticeable improvement in the CODESYS scenario. Hence I do not recommend turning it off. In the current setup, CPU1 maintains a steady load of 70%–75% for a long time, while CPU0 runs at 40%–50%. If the load on CPU0 rises along with business operations, disabling core isolation will likely impact CPU1 performance.

    Regards,

    Xi

  • For future readers, I moved the discussion around moving interrupts around with CodeSys to another thread:
    RE: AM62L: Testing Codesys

    This thread is already pretty long, so we will focus on cyclictest results here.

    Hello Xi,

    1) Did you test with both core isolation = OFF, and DDR QoS = ON? I would suggest running with both settings

    2) Did you ever create a separate thread to confirm the OP-TEE behavior discussed? If you are confident that your usecase is not loading OPTEE, then we do not need to investigate that further. My hypothesis is that OPTEE TRNG enabled lead to increased latencies by swapping between "regular" Linux context and OPTEE context, so if you do not have OPTEE then there is no OPTEE context switch to worry about.

    Regards,

    Nick

  • Hi Nick,

    Yes, I have tried not enabling isolcpus, and simultaneously enabled DDR QoS, but my test commands still apply fixed stress to the specified CPU cores: cyclictest -a 0-1 -t 2 -p 99 -m -D 0; taskset -c 1 stress-ng --cpu 1 --cpu-load 70 --vm 1 --vm-bytes 80% & ; taskset -c 0 stress-ng --cpu 1 --cpu-load 50 &.

    By the way,  I'm applying stress to CPU0 because, based on the CODESYS environment reference, stressing CPU0 can quickly reproduce issues. After 6 hours of testing, the current maximum jitter is 166us, whereas with isolcpus + DDR QoS enabled, under the same stress conditions, the maximum jitter after 6 hours of testing is 132us.

    Regarding OP-TEE, there are no logs showing its successful startup in either our kernel or U-Boot, so I believe OP-TEE is not being invoked in my scenario, and therefore I'm not considering this direction.
    Regards,
    Xi