This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[FAQ] Sitara multicore system design: How to ensure computations occur within a set cycle time?

Part Number: AM6442

I am designing a system with a Sitara multicore device (e.g., AM24x, AM64x, AM65x). The processor needs to complete certain activities within a set cycle time. How do I design my system to make sure that I can meet that cycle time?

The "cycle time" is the maximum allowable time for a system to receive an input, do some processing, and provide an output. For example, a motor control application may have a 10 microsecond (uS) cycle time to measure the angle and speed of the motor, do some math, and then update the PWM output that controls the motor.

  • INTRODUCTION

    The discussion is provided as a general introduction to basic concepts. TI cannot provide full training on system design. You are the expert on your use case. When in doubt, try searching the internet for more information.

    Different systems can be optimized for different usecases: data throughput, worst case latency, and average latency are all factors that might need to be taken into account. A use case that requires high throughput of data may not care about worst case latency, and so on.

    This discussion will focus on a usecase that requires short cycle times (i.e., 10s to 100s of microseconds (uS)). That means the system needs to be optimized to reduce the worst case latency.

    What contributes to cycle time?

    If a single processor core does all the work, then the time required to complete the computing cycle looks like
    Input arrives at processor --> time for input to be detected --> processing occurs --> time for output to be sent out of the chip and arrive at the destination

    If multiple cores are in the control loop, then inter-processor communication (IPC) latencies also have to be taken into account. In the above flow, "processing occurs" could expand to something like
    core 1 does processing --> IPC latency to core 2 --> core 2 does processing --> IPC latency to core 1 --> etc

    Short cycle times in turn require low latency IPC.

  • RTOS / BARE METAL CORES

    Real-time operating systems (RTOS) perform tasks in "real time" (which actually means "a known, deterministic time"). Designs with short cycle times typically want to use cores that are running RTOS or bare metal (i.e., no operating system).

    AM24x R5F, M4F
    AM64x R5F, M4F, A53

    As of SDK 8.2, AM64x MCU+ SDK lists A53 FreeRTOS and A53 NORTOS as an experimental feature. That means FreeRTOS and NORTOS on AM64x A53 is NOT supported by TI (https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_02_00_28/exports/docs/api_guide_am64x/RELEASE_NOTES_08_02_00_PAGE.html#autotoc_md37 ). So this discussion will focus on the other cores.

    R5F cores exist on AM24x, AM642x, AM644x devices.

    M4F cores exist on all AM24x and AM64x devices. Starting in SDK 8.1, TI supports general purpose development on the M4F core, and supports loading the M4F core from Linux. Average one way latency from any R5F/M4F to any other R5F/M4F is less than 2us when using IPC Notify (https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/08_01_00_36/exports/docs/api_guide_am64x/DRIVERS_IPC_NOTIFY_PAGE.html ).

    AM65x R5F, A53

    AM65x RTOS SDK is supported on both R5F and A53.

    AM24x PRU_ICSSG
    AM64x PRU_ICSSG
    AM65x PRU_ICSSG

    ICSSG cores run bare metal software, and are physically close to the PRU GPI / GPO pins. This makes ICSSG cores a good candidate for sending and receiving signals in and out of the processor quickly.

  • LINUX CORES

    Linux is a high level operating system (HLOS). It is more complex than RTOS. While Linux is a good design choice for many usecases (e.g., human-machine interaction (HMI)), the complexity of the OS means that it can take longer than RTOS to complete a task. Additionally, the time it takes Linux to complete a task can vary (i.e., Linux is non-deterministic).

    In general, we do not suggest using Linux cores in the processing chain for control loops with short cycle times. For example, an AM64x motor control application may use just the PRU_ICSSG and R5F cores in the motor control loop, and then use Linux to update the HMI outside of the control loop. This is because if the motor control calculations are delayed so that the system misses the cycle time, something might break. By contrast, if the user's touchscreen is updated 0.5ms late, nothing bad happens.

    But what if the design absolutely needs a Linux core to be in the control loop?

    Use RT Linux 

    Use RT Linux instead of regular Linux if the Linux core must be in the control loop. RT Linux has been modified to be more real-time than regular Linux. However, RT Linux is NOT a true RTOS! There will be fewer edge cases where a process takes longer to complete than expected, but edge cases may still exist. RT Linux cannot guarantee that a timing will be met.

    It is up to the customer to rigorously test RT Linux to make sure it fulfills their use case over long periods of testing. These tests can demonstrate that an edge case is statistically unlikely. However, observing that RT Linux is unlikely to miss timing is NOT the same as guaranteeing that it will never miss timing.

    Understand your system's interrupt response time 

    Most of the time when Linux is involved in a control loop, we need to care about the interrupt response time.

    When we look at the latency for sending a mailbox or interrupt to a Linux core, it looks like this:
    RTOS/Bare metal core writes to mailbox or interrupt --> Signal latency for the mailbox or interrupt to travel through the processor --> interrupt response time (Linux receives the interrupt, reaches a stopping point in the current task, context switches to store the data for the current task, and switches to the interrupt handler)

    Interrupt response time impacts the system even if the Linux core is periodically polling for updates instead of waiting for an interrupt. Let's say we have a high priority Linux application that polls a remote core for data every 250 uS. (For example, https://www.ti.com/tool/TIDA-01555. The Linux code is at https://git.ti.com/cgit/apps/tida01555/tree/ARM_User_Space_App/arm_user_space_app.c). The time between polls is actually slightly more than 250 uS. This is because the timing is impacted by the interrupt response time:
    nanosleep starts background timer, application sleeps --> Linux switches to a different thread --> timer interrupt goes off after 250 uS --> interrupt response time --> Linux returns to userspace application

    Ok, so the interrupt response time contributes towards the IPC latency. What kind of interrupt response time does your system need to plan for?

    Linux is so complex that there is no way to create a theoretical model that predicts interrupt response time. The best way to determine interrupt response times for your system is to generate a Linux build similar to your design, and then run tests to experimentally determine what latencies can be expected. Cyclictest provides a good starting point. For example, the out-of-the-box cyclictest results for AM64x SDK 8.1 are here: https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/08_01_00_39/exports/docs/devices/AM64X/RT_Linux_Performance_Guide.html#maximum-latency-under-different-use-cases .

    For more information about using cyclictest, refer to this FAQ: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1172055/faq-am625-how-to-measure-interrupt-latency-on-multicore-sitara-devices-using-cyclictest 

    For AM64x Linux SDK 8.1, the RT Linux core worst-case interrupt response time that was observed was 72 uS. If I was prototyping a system and got this as my worst-case result, then my control loop design needs to take this interrupt response time into account.

    Other notes about cyclictest:
    * The latencies for the SDK numbers are in microseconds (uS)
    * The performance guide test uses create_cgroup to move as many tasks as possible from the RT core to the non-RT core. However, there ARE Linux tasks that will not move from one core to the other. These tasks can continue to play a role in the max latencies of the worst case scenarios
    * You can find additional discussions around running cyclictest at https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1033771/am6442-latency-in-linux-rt-is-well-above-expected-values/3835092#3835092 
    * Different Linux builds will have different threads that contribute to different max latencies. One option is to start from the minimal build (e.g., the tiny filesystem found in the AM64x Linux SDK under filesystem/), see the cyclictest results, and then see how the latencies change as you add other Linux features needed in your system. A design will need to find the compromise between the demands of the real time task, and the system calls and services needed by the wider system.

    Can I get around the interrupt response time?

    The following thoughts are hypothetical. TI has NOT tested them, and we do NOT necessarily recommend them. This is just to help customers consider their options.

    If the A53 core is constantly reading from the memory location, then there is no interrupt response time. The total latency to notify the Linux core from the R5F core reduces to

    R5F write to memory location --> A53 Linux read latency from memory location

    Is that possible?

    One option would be to do a two-stage notification: use an interrupt, RPMsg, or longer poll time to tell the Linux core when it needs to start watching for a low latency message. Once Linux knows that the low latency message is coming soon, then Linux just constantly reads the memory location without allowing any other threads to run on the Linux core until the R5F write occurs.

    The AM64x A53 cores are dual cores. So another option is to isolate the A53 cores and dedicate one core purely to reading the memory location until a message is received. The downside of this option is that your computing power is cut in half for all other Linux applications. 

    However, there are other challenges that the system designer needs to keep in mind. For the two-stage notification option, the constant memory reads will need to have a timeout mechanism to make sure Linux does not starve all the other threads. What if the thread times out before the low latency notification happens? Then an edge case could occur where the Linux core does not respond to the R5F core notification in time.

    Even if we dedicate a core purely for reading a shared memory location, there are issues. If you set the thread priority high enough, you can prevent most other threads from running on the Linux core... except for the archtimer (in our limited experiments). And when the archtimer takes control, the userspace application is not reading for however long it takes for interrupt response time --> archtimer application --> context switch back to userspace application. If the notification is sent while archtimer is in control, this is another edge case where the system may not meet the cycle time requirements.