This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62: Real-time performance

Part Number: SK-AM62
Other Parts Discussed in Thread: AM625

Hello all,

I ran a series of tests using the PREEMPT_RT kernel (v5.10.120-rt70) on an am625 SoC, aimed at assessing the real-time performance I can get from this device, specifically the worst-case latency for a user task waiting for timer events. My question is whether the results below are accurately representing the typical worst-case latency we may expect (sorry for the long post, details may help though).

The test configuration is as follows:

- PLL set to 25Mhz (boot switches)

- worst-case latency on timer events measured with the regular 'cyclictest' program from the rt-tests suite (clock_nanosleep() interface only).

- TI vendor kernel available from [1]

- kernel config tweaks: 

* disable ACPI (CONFIG_ACPI)

* force enable CPU_FREQ 'performance' governor
(CPU_FREQ_DEFAULT_GOV_PERFORMANCE), disable all other governors

* all kernel debug switches off

- 20' sampling loop running at 1Khz, performed by a single thread. This may be way too short to observe the worst figure, but enough in our case to observe high values already.

- the sampling thread was always pinned on a single CPU, either isolated (CPU2) or not (CPU1).

- a stress load was running in parallel to the test, composed of a dd loop continuously clearing memory and a 'hackbench' loop issuing a massive amount of context switches, all left freely running on the non-isolated CPUs.

Practically, the commands used were:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# dd if=/dev/zero of=/dev/null bs=128M&
# while :; do hackbench; done&
# cyclictest -a <cpu_nr> -p 98 -m -n -i 1000 -D 20m -q

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With the 'isolcpus=2' boot param when testing the isolated CPU case. Proper CPU affinity for the sampling thread was double-checked.

The results are as follows, displayed as "worst-case(average)", all in microseconds,

     ISOLATED(CPU2)   |    NON-ISOLATED(CPU1)

           170(27)                               368(59)

These figures seem high for that class of hardware, with significant disturbance/noise on the isolated CPU running the latency test, caused by activities running on other CPUs which move memory around and switch context at high rate. As expected, it's much worse in the non-isolated case.

Any insight about those figures, and a way to get them down if possible would be much appreciated.

Sidenote: the xenomai4 EVL core ported this SoC on top of the TI base kernel [2] revealed the same impact of the non-rt stress load, with 130(10) and 284(40) respectively. This may rule out a PREEMPT_RT-specific issue there, since the implementations have nothing in common. Another takeaway from this particular test: we could definitely see a (negative) impact of enabling the transparent huge page support in the configuration when looking to the EVL figures.  Ftracing tells us that this may have to do with I/D cache maintenance operations after fixing up page table entries, significantly delaying interrupts despite the CPU is not masking them. I could not check this for  PREEMPT_RT, since this configuration is not supported.

Thanks,

[1] git.ti.com/.../ at #gca705d5c043)

[2] source.denx.de/.../v5.10.120

  • Phillippe,

    This seems similar to an issue we have been chasing for a little while on AM64x and AM62x. Setup memory load pinned to one core, and all the other cores (1 with dual core AM64x and 3 with AM62x) show significant jump in latency. In general the memory stressor settings don't seem to matter too much, swap a bitmask for any of the 4 cores on AM62x for where I have 2 below and the max cyclictest latency on the other 3 is quickly in the >300us range, while the core with the memory stressor stays more under control.

    taskset 2 stress-ng --memrate 1 --memrate-rd-mbs 100 --memrate-wr-mbs 100 &
    cyclictest -n -m -Sp91 -i400 -M

    I know this is not solving the issue but I wanted to let you know if you have comments on do you think this is the same issue. It does seem like there is no difference with what is the DDR intensive program (lmbench bw_mem will also bring it up), or the exact parameters, just results in some variation on how quickly this shows up.

    If I don't pin down the background memory stressing program there is a drop in how quickly the issue shows up.

      Pekka

  • Hello Pekka,

    I ran a few tests more with xenomai4 on a mainline v5.19 kernel, and I believe that we are indeed observing the same issue:

    - running both the stressors and the latency measurement program (1Khz loop) on the same (isolated) core yields < 45 µs worst-case, 10 µs on average, which is ok.

    - running the same test configuration on a kernel booted with maxcpus=1 yields figures in the same ballpark as above.

    - now, running a 10Khz sampling loop (instead of 1Khz) on an isolated CPU core with the stressors running freely on all other cores - i.e. the problematic case - yields < 60 µs worst-case, 2 µs on average, which is good too..

    (All tests ran only for a few minutes, so these worst-case figures may be optimistic, but not by a wide margin though).

    IOW, it looks like that:

    - the busier the CPU core running the sampling loop, the rarer the issue.

    - the faster the sampling loop, the better the figures. This outcome is generally expected, thanks to hotter caches and less time for the non-rt activities to evict cache lines used by the rt side. However, in this case, increasing the sampling frequency seems to paper over the issue entirely, not just improve the figures marginally.

    I wonder if something related to the CPU idling state could be at work? Regarding this, a few test results more:

    - Disabling CPU_IDLE entirely, or leaving the PSCI idle driver out did not improve the figures.

    - Booting with idle=poll did not improve them either.

    For all those tests, CPU_FREQ was enabled, with the 'performance' governor forced on. Conversely, disabling CPU_FREQ did not fix the problem.