AM6442: Codesys performance problems

Manuel Philippin

Intellectual 1335 points

Part Number: AM6442
Other Parts Discussed in Thread: SK-AM64B

Edited March 26 2025

Tool/software:

Hello Sitara team,

my customer has developed a small scale PLC based on AM6442, running Codesys as EtherCAT Master.

They are trying to achieve best CODESYS EtherCAT performance (Max. Cycle Time) with AM6442.
First I will describe the problem (in attached presentation) and then what measures we have taken to improve the cycle time performance.

4863.CODESYS Performance.pptx

Some basic information:
For our tests we are using SK-AM64B board SK-AM64B Evaluation board | TI.com
We are using Yocto to build custom firmware based on SDK 09.02.00 with added RT Patch.

root@plcnext:~# uname -a
Linux plcnext 6.1.83-rt28-ti-rt-g96b0ebd82722 #1 SMP PREEMPT_RT Mon May 13 23:06:24 UTC 2024 aarch64 GNU/Linux

We are using very simple CODESYS project with almost no code just one recommended for Trace. EtherCAT Master is running and cyclically exchange process data with Beckhoff EK1100 and EL2252.
We have USB Dongle for CODESYS licensing.

This is the Codesys project:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/3324.Trace_5F00_demo.project

The goal is to rotate 10 servo drives in SoftMotion on 1 usec 1 msec cycle time.

Is this possible?

Do you see other measures to improve CODESYS Max. Cycle time?

Best regards
Manuel

6 months ago

0 Nick Saulnier 5 months ago

TI__Guru** 100980 points

Hello Manuel,

The thread owner is out of the office for the rest of March. Feel free to ping the thread in early April if you have not received a response.

Regards,

Nick

0 Manuel Philippin 5 months ago in reply to Nick Saulnier

TI__Intellectual 1335 points

Hello,

in the mean time I have pointed my customer to this similar thread: (+) PROCESSOR-SDK-AM64X: XDP Support - Processors forum - Processors - TI E2E support forums

Unfortunately changing the ksoftirqs priority to FIFO 52 has not improved the maximum cycle time.

Can you please help to debug this cause?

Best regards
Manuel

0 Pekka Varis 5 months ago in reply to Manuel Philippin

TI__Mastermind 27050 points

Have you contacted Codesys? What is their expected performance for a dual core A53 at 1GHz? Step through all the guidance they and others have:

https://content.helpme-codesys.com/en/CODESYS%20Control/_rtsl_performance_optimization_linux.html

https://www.linutronix.de/blog/A-Checklist-for-Real-Time-Applications-in-Linux

Manuel Philippin said:
We have USB Dongle for CODESYS licensing.

It is very likely the latency outliers are related to USB and the license server codemeter Codesys uses. Have you run the same without the USB involved? Some other licensing method, or just the demo version not requiring USB. In our Codesys test runs the worst offender we saw was the licensing infrastructure and related USB drivers.

Is there a specific reason you are using AM6442? For running Codesys I'd recommend https://www.ti.com/product/DRA821U , https://www.ti.com/product/AM62P or https://www.ti.com/product/AM67 . These should perform significantly better even without tuning the RT Linux setup .

Pekka

0 Pekka Varis 5 months ago

TI__Mastermind 27050 points

Manuel Philippin said:
The goal is to rotate 10 servo drives in SoftMotion on 1 usec 1 msec cycle time.

Is this possible?

Good clarification, I was assuming 1millisecond. Theoretical EtherCAT best performance is 31.25us (microseconds). See https://www.ibv-augsburg.de/downloads/icECAT_EtherCAT_Master_Stack_Benchmark.pdf showing 100us cycle time benchmark breakdown on AM64x R5 core. You'll also see 500us A53 Linux results, so assuming setup is tuned properly 1ms EtherCAT master seems possible.

Pekka

0 Manuel Philippin 5 months ago in reply to Pekka Varis

TI__Intellectual 1335 points

Customer has tested a licensing without the USB dongle but the results did not change a lot.

This is with USB licensing:

This is without:

Would it make any difference if they re-compiled Codesys with their own environment and toolchain?

Best regards
Manuel

0 Pekka Varis 5 months ago in reply to Manuel Philippin

TI__Mastermind 27050 points

First question what is Codesys saying?

Second question why AM64x? Why not a more powerful device from the TI portfolio. AM64x A53 Linux performance is the worst of all the TI AM6x portfolio.

Manuel Philippin said:
Would it make any difference if they re-compiled Codesys with their own environment and toolchain?

Codesys is a black box binary, to my knowledge they do not allow source code access to their customers.

From the couple screenshots I see both look like meet 1000us cycle time? Without USB dongle the run looks very short 370s compared with 1429s, but anyway both look like meet 1000us. To my knowledge (they should of course check with the vendor that sold the Codesys), "max cycle time" is the worst case, when it is below desired cycle time the schedule is being met.

Pekka

0 Manuel Philippin 4 months ago in reply to Pekka Varis

TI__Intellectual 1335 points

The design with AM6442 is done already and performance was expected to be better based on the DMIPS compared to the Zynq solution used before, with lower core clock.

The issue is that the 1ms cycle time is only reached with one device in the loop.
With 8 devices the performance drops to ~3ms:

No of Axis	PRG/LINE	Min Cycle Time	Average Cycle Time	Max Cycle Time	Diffence	Recommended Max	Target Cycle Time
8	Idle	1348	1423	1710	287	3200	8000
8	CAM	1499	1565	1879	314	3200	8000
8	60000	2699	2765	2948	183	3200	8000
8	120000	4021	4089	4303	214	3200	8000
8	240000	6661	6722	6949	227	3200	8000

This is their full task priority list:

There is no IPC between R5F and A53 running.

root@plcnext:~# uname -a
Linux plcnext 6.1.83-rt28-ti-rt-g96b0ebd82722 #1 SMP PREEMPT_RT Mon May 13 23:06:24 UTC 2024 aarch64 GNU/Linux

In a discussion with Thomas Schneider he mentioned that we have seen ~300us on AM6442 with 3 drives connected.

There seems to be something wrong with the overall Codesys configuration?

Regards
Manuel

0 Pekka Varis 4 months ago in reply to Manuel Philippin

TI__Mastermind 27050 points

Ok so seems a quite long way away from the goal. For the reference device, do you have more specifics, what ZYNQ device? Old one like ZYNQ-7000 with Cortex-A9s, or Ultrascale+ with A53s? The steps and kernel configurations used there to get to performance they needed, have the followed all the same ones?

It should be down to "normal" RT tuning and latency optimizations. This is something where companies like Linutronix and BayLibre are good. Or other high end Linux contractors. But a few starting steps.

1. Lets make sure interrupt handling is not the issue. PREEMPT_RT uses ksoftirq's to handle interrupts (https://bootlin.com/doc/training/preempt-rt/preempt-rt-slides.pdf ). What is their priority? Type:

ps aux | grep ksoftirq

Look at the PID of the two ksofirqs (one per core). By default in our SDKs they are not RT, or FIFO scheduling. Change to high priority like FIFO at 10. With commands like below assuming the PIDs were 13 and 27:

chrt -f -p 10 13 
chrt -f -p 10 27

2. In the full task priority list I see lots of stull at RT priority (negative numbers) in the PRI+RT column. I'm thinking >90% of the stuff in green should not be at that high priority. Only stuff related to the Ethernet interface used with Codesys should be RT, everything else should not. RT is a zero-sum game, the priority comes from having as little as possible at a high priority, squash down everyone else so the ambulance can pass. The more there is in high priority the less high priority is worth. Get the same printout on the ZYNQ and compare.

3. Make the system lean. The smaller the Linux kernel the better RT performnce will be is a good rule of thumb. Remove all services and kernel modules you don't need.

4. https://www.linutronix.de/blog/A-Checklist-for-Real-Time-Applications-in-Linux start going down the list.

Pekka

0 Ivan Stoyanov 4 months ago in reply to Pekka Varis

Prodigy 20 points

Hello, and thanks for adding me to the forum and thank you for your support!

Changing the priority of the ksoft interrupts doesn't improve the performance.

Pekka Varis said:
Ok so seems a quite long way away from the goal. For the reference device, do you have more specifics, what ZYNQ device? Old one like ZYNQ-7000 with Cortex-A9s, or Ultrascale+ with A53s? The steps and kernel configurations used there to get to performance they needed, have the followed all the same ones?

For our old systems we are using ZYNQ 7020 based on Cortex-A9.

"Second question why AM64x? Why not a more powerful device from the TI portfolio. AM64x A53 Linux performance is the worst of all the TI AM6x portfolio."

This statement is quite interesting for me. What is the reason? And would the performance become worse if we use rpmsgs between all 4 R5 cores and ARM64 core?

0 Pekka Varis 4 months ago in reply to Ivan Stoyanov

TI__Mastermind 27050 points

Why do you have all the interrupts as real-time at priority -51? You should only have the Codesys tasks, ksoftirq, maybe a couple other things as real-time. Everything else should be lower priority. Real-time is a zero sum game, you can really have only one thing prioritized, everyone else should be pushed down and suffer. Can you print out the tasks and priorities in your old system and the new evaluation system? Something like

uname -a
to get the kernel version

ps -ALo psr,policy,priority,pid,tid,cputime,comm
to get what is running and what is real-time

Manuel Philippin said:
performance was expected to be better based on the DMIPS compared to the Zynq solution

DMIPS measures warm L1 cache average performance. It has nothing to do with real-time performance, and correlation with a complex Linux application performance performance is almost non-existent.

Ivan Stoyanov said:
"Second question why AM64x? Why not a more powerful device from the TI portfolio. AM64x A53 Linux performance is the worst of all the TI AM6x portfolio."

This statement is quite interesting for me. What is the reason? And would the performance become worse if we use rpmsgs between all 4 R5 cores and ARM64 core?

What are you trying to do? Codesys does not utilize the R5's. For Codesys Linux performance you just want A-cores, max clock speed and cache, and DDR performance. Generally yes, the more things you have running in parallel, the worse Linux real-time performance will be.

Manuel Philippin said:
This is with USB licensing:

This is without:

The results in these screenshots look like meeting 1ms cycle time and 5x better than:

Manuel Philippin said:

No of Axis

PRG/LINE

Min Cycle Time

Average Cycle Time

Max Cycle Time

Diffence

Recommended Max

Target Cycle Time

8

Idle

1348

1423

1710

287

3200

8000

8

CAM

1499

1565

1879

314

3200

8000

8

60000

2699

2765

2948

183

3200

8000

8

120000

4021

4089

4303

214

3200

8000

8

240000

6661

6722

6949

227

3200

8000

Can you elaborate what is different?

Pekka