AM6442: memcpy impact real time performance

Tony Tang

Part Number: AM6442

In order to root cause the cyclictest max latency issue, customer found the memcpy in their application impact the result.

Run a application with memcpy 1Mbyte data in while (1), it impacts cyclictest result very much. I though the memcpy of lib.c should be built with optimization, if similar with dsp compiler, it will disable interrupt ahead of loop core to avoid pipeline corruption.

Do you agree with it? is there alterative method to replace memcpy in SDK? such as DMA, if yes, is there a example to use DMA in user space?

Or rebuild lib.c with some option like -i n of DSP compiler? if rebuild lib.c does it need to rebuild SDK with the new lib.c?

over 3 years ago

0 Pekka Varis over 3 years ago

TI__Mastermind 27050 points

Good to hear about identifying a source for interrupt latency jitter. What lib.c are they using? The glib.c that came with the TI SDK and GCC?

While memcpy() is often extremely optimized for throughput (like https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memcpy-advsimd.S ), it should not directly disable interrupts. Rather it should use the vector side (alsu called NEON) and invoke the write streaming (stop doing write allocate, do write through in L2 cache) for maximum throughput. Reading side will still allocate in L2 cache. All this should not show up as more than some dozen or so cache misses impact to cyclictest, but it is worth digging in to the exact lib.c used, maybe we included a version that has severe interrupt latency issues.

There are many other similar tight loops, such as memset() etc. that arbitrary user space code can call, so replacing just memcpy() with DMA is likely just a partial solution.

Pekka

0 Tony Tang over 3 years ago in reply to Pekka Varis

TI__Mastermind 30182 points

Pekka,

it is from SDK rootfs,

gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu

root@am64xx-evm:/# find ./ -name libc.so.6
./lib/libc.so.6

Is it using NEON instead of CPU while loop? it should not impact interrupt latency?

0 Pekka Varis over 3 years ago in reply to Tony Tang

TI__Mastermind 27050 points

Tony Tang said:
Is it using NEON instead of CPU while loop? it should not impact interrupt latency?

Yes the Arm optimized library will use optimized routines in general and specifically on memcpy(). Below is two ways to find out the glibc version:

root@am64xx-evm:~# ldd --version                  
ldd (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.30
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@am64xx-evm:~# /lib/libc.so.6
GNU C Library (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) stable release version 2.30.
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 9.2.1 20191025.
libc ABIs: UNIQUE ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.linaro.org/>.
root@am64xx-evm:~#

There should not be a direct >100us level impact to interrupt latency, but indirectly via cache trashing some several microseconds (DDR refresh, cache misses,...) could happen. There will need to be some indirect way to cause 100us level scheduling.

Pekka

0 Tony Tang over 3 years ago in reply to Pekka Varis

TI__Mastermind 30182 points

Pekka,

Thanks for detail, how about to rebuild the libc with interrupt threshold? is there build option -i n such as DSP compiler? Does it make sense for ARM and Linux system?

0 Pekka Varis over 3 years ago in reply to Tony Tang

TI__Mastermind 27050 points

C6000 DSP has an exposed pipeline VLIW which results in the interrupt disable in the tightest loops. Arm architecture and most RISC architectures do not have this issue. Interrupts are generally never disabled by the compiler in optimized loops on Arm.

But the impact of various optimizations can have an effect and compiling and linking to a a different version of glibc might be worth investigating. Looks like https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain#Technical-Specifications and then https://www.gnu.org/software/libc/libc.html is where the latest code is.

Pekka

0 Pekka Varis over 3 years ago in reply to Pekka Varis

TI__Mastermind 27050 points

Tony,

I did some more research especially on osadl.org and some experiments. A consistent way to get good latency on a multicore and Cortex A53 multicore in particular seems to be using isolating CPU cores. So halting at uboot and passing the command line:

optargs="isolcpus=1, nohz_full=1, rcu_nocbs=1"

gets core index 1, so the second code on AM64x, into good interrupt latency performance (<<100us on the isolated core, >>100us on the non isolated). I'm trying this out on a the tiny filesystem and also on default with stress-ng memrate background load (will only run on core index 0). The runs will take overnight so I'll clarify here if there are any conclusions.

If you have not used kernel command line a step by step is at https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1172055/faq-am625-how-to-measure-interrupt-latency-on-multicore-sitara-devices-using-cyclictest .

I just added the couple more options to restrict further what can run on the isolated core. I think the optimized memocpy() probably contributes somehow but considering the use case is to prioritize the real-time application such as Codesys runtime this is probably a useful path.

Pekka

0 Pekka Varis over 3 years ago in reply to Pekka Varis

TI__Mastermind 27050 points

Here is a run with the optargs above, the tiny filesystem from the SDK, and the background load (will be scheduled on CPU0) and cyclictest:

stress-ng --memrate 1 --memrate-rd-mbs 70 --memrate-wr-mbs 140 &
cyclictest -l100000000 -n -m -Sp98 -i200 -h400 -q > output

The plotted with steps as in https://www.osadl.org/Create-a-latency-plot-from-cyclictest-hi.bash-script-for-latency-plot.0.html

So sub 100us for the isolated core.

Same for AM62x except using default filesystem, isolating CPU3:

Shows sub 50us in the isolated CPU3. So in this use case the larger cache, higher clock speed, and larger number of cores allows one core to be better. The example also shows the large number of things running in the default file system seems to carry a latency penalty on the non-isolated cores.

Pekka

0 Pekka Varis over 3 years ago in reply to Pekka Varis

TI__Mastermind 27050 points

Here is AM64x 8.6 default filesystem run with stress-ng memrate running in the non isolated core:

So exactly the same test case as the case above for tiny filesystem, with the default the performance is maybe 30% worse for both the isolated core and the bachground core.

0 Tony Tang over 3 years ago in reply to Pekka Varis

TI__Mastermind 30182 points

Pekka,

Thanks for sharing, if identify what in the default filesystem impact the latency result would help, usually customer won't do this kind of optimization by themselves, if there is optimization guide, maybe they will try.

Tiny Filesystem is too simple for reference.

0 Pekka Varis over 3 years ago in reply to Tony Tang

TI__Mastermind 27050 points

We will be adding tracking this to the RT Linux performance guide.

But in general to chase latency a customer should not start with default SD card image. If interrupt latency is the primary goal, the process must be take the minimum services needed and add what is needed. The default filesystem does not target best possible interrupt latency.

The osadl.org builds are a good example of what to run if interrupt latency is the primary goal.

Pekka

Processors

Processors forum

AM6442: memcpy impact real time performance