AM6442: IPC performances

Geoffrey Ficara

Expert 3395 points

Part Number: AM6442

Hi,

I checked the documentation online but I could not find the information I was looking for. Please can you answer :

What is the minimum latency when communicating between 1x Cortex-R5F (RTOS) and 1x Cortex-R5F (RTOS)
- When using IPC RPMessage
- When using IPC Notify
What is the minimum latency when communicating between 1x Cortex-R5F (RTOS) and 1x Cortex-A53 (Linux)
- When using IPC RPMessage, transmitting from RTOS to Linux
- When using IPC RPMessage, transmitting from Linux to RTOS
It is the fastest way to communicate between the cores?

Regards,

Geoffrey

over 3 years ago

0 Nick Saulnier over 3 years ago

TI__Guru** 101910 points

Hello Geoffrey,

It does not look like we are publishing an AM64x MCU+ SDK Performance Guide at this point in time. I can look into whether that is planned for the future if you want.

RTOS / NORTOS to RTOS / NORTOS

I would need another day or so to look at what tests have already been done by the team. I know for sure that we tested IPC Notify average round-trip latency between RTOS cores (R5s and M4). For SDK 8.1 average one-way latency for IPC Notify from one R5F to any other R5F was less than 1us, while average one-way latency for IPC Notify from an R5F to an M4F was less than 2us. IPC Notify is the fastest IPC software TI provides for communication between RTOS/NORTOS cores.

We could pretty easily test worst-case round-trip latency between RTOS cores. The team has not looked at taking one-way worst-case measurements at this point in time. That is not to say that one-way worst case latency is impossible to measure, but it is definitely trickier: You need to synchronize timebases across different cores, and once we are talking about measurements with precision in the range of tens or hundreds of nanoseconds, factors like read latency to shared memory, interrupt propagation delays, etc can start to affect the measurements.

IPC RPMsg between two R5F cores will vary based on message size (larger messages take longer) and might be affected by whether you are messaging between cores in the same R5F subsystem, or across subsystems. At this point, I do not have a feel for the time to do an IPC RPMsg message as opposed to IPC Notify + shared memory.

RTOS / NORTOS to Linux

I have not run benchmarks on IPC RPMsg communication between R5 and Linux A53 on AM64x. Is there a specific timeframe that the customer is looking for? Keep in mind that just interrupting the Linux OS and getting it to switch tasks once an interrupt is received can take a non-deterministic period of time. If the customer has latency needs for RTOS / NORTOS to Linux IPC, we suggest they use RT Linux (and even then, keep in mind Linux will not respond as quickly as an RTOS would).

Regards,

Nick

0 Geoffrey Ficara over 3 years ago in reply to Nick Saulnier

TI__Expert 3395 points

Hi Nick,

Thank you for the answer. The goal here is to optimize the communication between Linux-RT and the Cortex-R5F core for low-latency.

In the training here : https://training.ti.com/process-inter-processor-communication @ 21:25 the speaker says that to optimize for low latency and/or high throughput, one can create character drivers where we'd expose interrupt and memory directly to userspace instead of going through a couple of copies. Then the user space and RTOS can both directly access the the memory themselves. kindda similar to a IPC Notify implementation.

Do you have more information about it ?

regards,

Geoffrey

0 Geoffrey Ficara over 3 years ago in reply to Geoffrey Ficara

TI__Expert 3395 points

Kind reminder

0 Nick Saulnier over 3 years ago in reply to Geoffrey Ficara

TI__Guru** 101910 points

Hello Geoffrey,

Apologies for the delayed response. That speaker was me. At this point TI has not developed an example character / UIO driver that customers could use as a template for custom IPC. I am actually working on putting together a requirement for a shared memory / low latency IPC offering, so if you could send me a direct email with the customer information and any additional information that would be helpful to build a business case for the development that would be helpful.

Regards,

Nick

0 Nick Saulnier over 3 years ago in reply to Nick Saulnier

TI__Guru** 101910 points

Hello Geoffrey,

Summarizing our offline discussion here.

CUSTOMER USE CASE

One R5F core runs EtherCAT stack. Another core does computations. Need IPC between the cores that sends an interrupt or mailbox when data is ready to be processed. Targeting low latency (potentially ~20us one way latency)

Something else to keep in mind: what is the required cycle time? i.e., an output is required by the EtherCAT core within a specific amount of time after reading an input. How much time does the entire system have to read the input, do all the processing and computations, and provide an output? The latency requirements to notify another core to do processing will be dictated by this overall cycle time.

QUESTION: How do I design my multicore system so that computations occur within a set cycle time?

ANSWER

Introduction

The "cycle time" is the maximum allowable time for a system to receive an input, do some processing, and provide an output.

If a single processor core is doing all the work, then the time required to complete the computing cycle looks like
Input arrives at processor --> time for input to be detected --> core does processing --> time for output to be sent out of the chip and arrive at the destination

If multiple cores are in the control loop, then inter-processor communication (IPC) latencies also have to be taken into account. "core does processing" above expands to
core 1 does processing --> IPC latency to core 2 --> core 2 does processing --> IPC latency to core 1 --> etc

Different systems can be optimized for different usecases: data throughput, worst case latency, and average latency are all factors that might need to be taken into account. A use case that requires high throughput of data may not care about worst case latency, and so on.

This discussion will focus on short cycle times (i.e., 10s to 100s of microseconds (uS)). Short cycle times in turn require low latency IPC. The discussion is provided as a general introduction to basic concepts. TI cannot provide full training on system design. You are the expert on your use case. When in doubt, try searching the internet for more information.

RTOS / Bare metal cores

Real-time operating systems (RTOS) perform tasks in "real time" (which actually means "a known, deterministic time"). This means that designs with short cycle times typically want to use cores that are running RTOS or bare metal.

As of SDK 8.1, MCU+ SDK FreeRTOS and NORTOS is NOT supported on A53 (https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_01_00_36/exports/docs/api_guide_am64x/RELEASE_NOTES_08_01_00_PAGE.html#EXPERIMENTAL_FEATURES ). So we will focus on the other cores.

If the customer is using an AM642x or AM644x device, then there will be 2 or 4 R5F cores total. And there is an M4F core on all AM64x devices. As of SDK 8.1, TI supports general purpose development on the M4F core, and supports loading the M4F core from Linux. Average one way latency from any R5F/M4F to any other R5F/M4F is less than 2us when using IPC Notify (https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_01_00_36/exports/docs/api_guide_am64x/DRIVERS_IPC_NOTIFY_PAGE.html ).

Linux cores

Linux is one kind of high level operating system (HLOS). It is more complex than RTOS. While Linux is a good design choice for usecases like human-machine interaction (HMI), the complexity means that it can take longer than RTOS to complete a task, and the time it takes to complete that task can vary (i.e., it is non-deterministic).

In general, we do not suggest using Linux cores in the processing chain for control loops with short cycle times. For example, an AM64x motor control application may use just the PRU_ICSSG and R5F cores in the motor control loop, but it will probably use Linux to update the HMI outside of the control loop. In this usecase, if the motor control calculations were delayed by 1ms, something might break; by contrast, if the user's touchscreen is updated 1ms late, nothing bad happens.

But what if the design needs a Linux core to be in the control loop?

Use RT Linux

Use RT Linux instead of regular Linux if the Linux core must be in the control loop. RT Linux has been modified to be more real-time than regular Linux. Note that RT Linux is not suddenly a true RTOS! There will be fewer edge cases where a process takes longer to complete than expected, but edge cases may still exist. It is up to the customer to rigorously test RT Linux to make sure it fulfills their use case over long periods of testing.

Understand your system's interrupt response time

Most of the time when Linux is involved in a control loop, we need to care about the interrupt response time.

When we look at the latency for sending a mailbox / interrupt to a Linux core, it looks like this:
RTOS/Bare metal core writes to mailbox / interrupt --> Signal latency for the mailbox / interrupt to travel through the processor --> interrupt response time (Linux receives the interrupt, reaches a stopping point in the current task, context switches to store the data for the current task, and switches to the interrupt handler)

Interrupt response time impacts the system even if the Linux core is periodically polling for updates instead of waiting for an interrupt. Let's say we have a high priority Linux application that polls a remote core for data every 250 uS (https://www.ti.com/tool/TIDA-01555. See Linux code here: https://git.ti.com/cgit/apps/tida01555/tree/ARM_User_Space_App/arm_user_space_app.c). This use case is NOT actually polling exactly every 250 uS. Instead, the timing is impacted by the interrupt response time:
nanosleep starts background timer, application sleeps --> Linux switches to a different thread --> timer interrupt goes off after 250 uS --> interrupt response time --> Linux returns to userspace application

Ok, so the interrupt response time will contribute towards the IPC latency. What kind of interrupt response time does your system need to plan for?

Linux is so complex that there is no way to create a theoretical model that predicts interrupt response time. The best way to discover interrupt response times for your system is to generate a Linux build similar to your design, and run tests to experimentally determine what latencies can be expected. Cyclictest is a good starting point for these tests. For example, the out-of-the-box cyclictest results for AM64x SDK 8.1 are here: https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/08_01_00_39/exports/docs/devices/AM64X/RT_Linux_Performance_Guide.html#maximum-latency-under-different-use-cases .

For AM64x Linux SDK 8.1, the RT Linux core worst-case interrupt response time that was observed was 72 uS. If I was prototyping a system and got this as the worst-case result, then my control loop design needs to take this interrupt response time into account.

Other notes about cyclictest:
* The latencies for the SDK numbers are in microseconds (uS)
* The performance guide test uses create_cgroup to move as many tasks as possible from the RT core to the non-RT core. However, there ARE Linux tasks that will not move from one core to the other. These tasks can continue to play a role in the max latencies of the worst case scenarios
* You can find additional discussions around running cyclictest at https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1033771/am6442-latency-in-linux-rt-is-well-above-expected-values/3835092#3835092
* Different Linux builds will have different threads that contribute to different max latencies. One option is to start from the minimal build (e.g., the tiny filesystem found in the AM64x Linux SDK under filesystem/), see the cyclictest results, and then see how that changes as you add other Linux features needed in your system. A design will need to find the compromise between the real time task and the system calls and services needed by the wider system

Can I get around the interrupt response time?

Keep in mind that the following thoughts are hypothetical. TI has NOT tested them, and we do NOT necessarily recommend them. This is just to help customers consider their options.

If the A53 core is constantly reading from the memory location, then there is no interrupt response time. The total latency to notify the Linux core from the R5F core reduces to

R5F write to memory location --> A53 Linux read latency from memory location

Is that possible?

One option is to do a two-stage notification: use an interrupt, RPMsg, or longer poll time to tell the Linux core when it needs to start watching for a low latency message. Once Linux knows that the low latency message is coming soon, then Linux just constantly reads the memory location without allowing any other threads to run on the Linux core until the R5F write occurs.

The AM64x A53 cores are dual cores. So another option is to isolate the A53 cores and dedicate one core purely to reading the memory location until a message is received. The downside of this option is that your computing power is cut in half for all other Linux applications.

However, there are other challenges that the system designer needs to keep in mind. For the two-stage notification option, the constant memory reads will need to have a timeout mechanism to make sure Linux does not starve all the other threads. What if the thread times out before the low latency notification happens? Then an edge case occurs where the Linux core does not respond to the R5F core notification in time.

Even if we dedicate a core purely for reading a shared memory location, there are issues. If you set the thread priority high enough, you can prevent most other threads from running on the Linux core... except for the archtimer (in our limited experiments). And when the archtimer takes control, the userspace application is not reading for however long it takes for interrupt response time --> archtimer application --> context switch back to userspace application. If the notification is sent while archtimer is in control, this is another edge case where the system would not meet the cycle time requirements.

SUMMARY

1) Keep the overall system design in mind, including the required overall cycle time / control loop.

2) RTOS / bare metal cores are designed to respond within a known amount of time. These cores are typically preferred when designing applications with short cycle times (in the tens or hundreds of microseconds (uS)).

3) It is technically possible to include a Linux core in the processing chain for a control loop. However, the complexity of the Linux OS and the interrupt response time makes it very difficult to get Linux to behave 100% deterministically within short cycle times.

Regards,

Nick

0 Nick Saulnier over 3 years ago in reply to Nick Saulnier

TI__Guru** 101910 points

FYI, I am editing the response above based on feedback from other team members. Might need to edit the response again in another couple of days

-Nick

0 Nick Saulnier over 3 years ago in reply to Nick Saulnier

TI__Guru** 101910 points

ok, I got a couple of things wrong in the second edit. modifying the response above a third time, and starting to change it more into a form where I can turn it into an e2e FAQ

0 Geoffrey Ficara over 3 years ago in reply to Nick Saulnier

TI__Expert 3395 points

Hi Nick,

Thank you for this elaborated answer !

Just one question, one the mcu_plus_sdk_am64x_08_01_00_36mcu_plus_sdk_am64x_08_01_00_36 there are many examples running RTOS on the A53, and the release note mentions it is indeed possible. Why do you say it is not supported ? Maybe you meant AMP Linux/RTOS is not ?

Regards,

Geoffrey

0 Nick Saulnier over 3 years ago in reply to Geoffrey Ficara

TI__Guru** 101910 points

Hello Geoffrey,

In this release, there is a difference between code existing in the SDK, and code being supported by TI.

Note the "Attention" box on the home page of the SDK docs:
https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_01_00_36/exports/docs/api_guide_am64x/index.html

Take a closer look at the Experimental Features link:
https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_01_00_36/exports/docs/api_guide_am64x/RELEASE_NOTES_08_01_00_PAGE.html#EXPERIMENTAL_FEATURES

A53 NORTOS and A53 FreeRTOS is "early versions and should be considered as "experimental". Users can evaluate the feature, however the feature is not fully tested at TI side. TI would not support these feature on public e2e."

Regards,

Nick

0 Nick Saulnier over 3 years ago in reply to Nick Saulnier

TI__Guru** 101910 points

For future readers, I have cleaned up the above response and turned it into an FAQ at https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1085663/faq-sitara-multicore-system-design-how-to-ensure-computations-occur-within-a-set-cycle-time

Regards,

Nick

Processors

Processors forum

AM6442: IPC performances