This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: Using rpmsg for a (soft) realtime system between Linux/A72 and MCU island R5F

Part Number: TDA4VH-Q1

We are building a system that includes an application that runs on an MCU island R5F (mcu1_1). This application runs every 15ms for about 10ms. During this runtime it exchanges data multiple times with another application that runs on the A72. This application has a realtime (sched_fifo) priority. The A72 is running Linux which includes the PREEMPT_RT patch set.

We planned to use rpmsg to implement the communication between these applications. (using ti-rpmsg-char on the A72) It does work and the communication time and general latency seem to be sufficient for our purposes. However, we have problems if a lower priority Linux task happens to run on the same core as the A72 application. If this lower priority task (f.e. some systemd housekeeping) runs for several miliseconds, receiving of data from the R5F is delayed by about that amount of time. We have made sure that the interrupt handlers that are involved have a sufficiently high priority, but that's not enough.

In traces, we can see that the normal flow for receiving data from the R5F for our A72 application in the system is this:

- mbox interrupt handler for mcu1_0 (mbox-mcu-r5fss0-core0) runs

- mbox interrupt handler for mxu1_1 (mbox-mcu-r5fss0-core1) runs

- kworker for this core runs

- our A72 application is awakened and receives the data

The problem is that the kworker thread has a normal priority (sched_other) and has to wait for other linux tasks, before it can start its work.

Are you aware of these kinds of problems? As far as we know, there is no easy way to change the default priority of these kworker threads in the Linux kernel. It is possible to change the priority of kworker threads that are currently running, but these threads are stopped and re-spawned regularly. Do you know of any Linux configuration options, which could skip the use of a kworker and directly complete everything during the interrupt handler? Do you have other suggestions for this problem? We'd like to avoid having to quarantine this core, in order not to waste CPU resources.

  • Hi Robert,

    - kworker for this core runs

    - our A72 application is awakened and receives the data

    The mailbox interrupt handler schedules the processing of the mailbox for a workqueue function (a kworker thread), and this is the one that processes the incoming rpmsg messages.

    The kworker thread should be created during the mailbox channel creation, and should remain the same. This doesn't get re-spawned (not sure what your definition is).

    Do you know of any Linux configuration options, which could skip the use of a kworker and directly complete everything during the interrupt handler?

    No, we don't have support for this. This used to be tasklets, but they are completely frowned upon, and the mailbox driver has long away moved from using tasklets to workqueues.

    I am confused by your statement about a low-priority thread holding up the kworker thread. How are you able to determine that it is indeed the kworker thread that is held up rather than your application read thread?

    regards

    Suman

  • Thank you for taking the time answer. I'll try to address your questions.

    The kworker thread should be created during the mailbox channel creation, and should remain the same. This doesn't get re-spawned (not sure what your definition is).

    There are obviously new kworker threads created on our system. When we take a look at the system after a few hours, the kworker threads that are running (checked with ps), have a higher pid (compared to the start of the system) and have their default priority again. The IRQ threads and our A72 program remain the same (same PIDs).

    I am confused by your statement about a low-priority thread holding up the kworker thread. How are you able to determine that it is indeed the kworker thread that is held up rather than your application read thread?

    By looking at an ftrace using Kernel Shark. For delayed cycles, we can see that a low-priority task happens to run on the same core. This low-priority task is interrupted by real-time tasks such as the interrupts or other real-time tasks, but not by our A72 application, which also has a real-time priority. So we assume that our A72 application (which is waiting in a read on the rpmsg device) would also interrupt this low-prio task if it could, but it has to wait for the kworker to run, which unfortunately starts out as low-priority and has to wait for the other low-prio task to complete (or to get a time slice by default linux scheduler).

    This is a trace of such a case. The blue is a low-prio task running (in this case pidstat). The first box from the left contains the irq threads, which correctly interrupt the blue thread. The second box is an example of another real-time prio thread that correctly interrupts the blue thread. The third and last box is the kworker finally receiving its time slice, which then also prompts our real-time A72 program to run and complete the communication.

  • Hi Robert,

    There are obviously new kworker threads created on our system. When we take a look at the system after a few hours, the kworker threads that are running (checked with ps), have a higher pid (compared to the start of the system) and have their default priority again.

    The Mailbox kworker threads would have been created during the boot when the Linux remoteproc and rpmsg kernel modules are probed. You should be able to determine the kworker threads associated with the Mailbox.

    it has to wait for the kworker to run, which unfortunately starts out as low-priority and has to wait for the other low-prio task to complete (or to get a time slice by default linux scheduler).

    What are the thread priority levels of the low-priority task and the kworker thread? I am surprised that a userspace spawned thread has a higher priority than a kernel-spawned kworker thread.

    You are running this on RT-Linux right? Do you see the same behavior on a regular Linux (your application round-trip usage should be well within the latency scopes on regular Linux as well).

    Aren't you able to use nice command to adjust the kworker thread priority?

    regards

    Suman

  • The Mailbox kworker threads would have been created during the boot when the Linux remoteproc and rpmsg kernel modules are probed. You should be able to determine the kworker threads associated with the Mailbox.

    I understand what you're saying, and we also expected it to work this way. So at the start of our tests, we have changed the priority of all kworker threads for the specific core. After that, the system runs for a few hours. But after some time the original kworker threads are simply gone, even though the system was running uninterrupted. The new kworker threads have their original priority again, and this inevitably leads so interference from low priority threads (as described).

    What are the thread priority levels of the low-priority task and the kworker thread? I am surprised that a userspace spawned thread has a higher priority than a kernel-spawned kworker thread.

    They seem to have the same priorities (sched_other and niceness zero). Niceness zero is the default priority for kworker threads. Since they have the same priority, the default scheduler will treat them the same way, so the user-space threads may get to run for some time, before they have exceeding their time slice. Only after their time slice runs out, the kworker may start.

    For the record, we use the Linux kernel version 6.1. We have not tested it yet with a "normal" Linux without the PREEMPT_RT patch. Do you think the restarted kworker threads may be specific to the PREEMPT_RT?

    Since you are very sure, that the kworker threads should normally not be restarted, we will check whether we have some Linux kernel configuration, that may lead to that behavior.

  • Hi Robert,

    But after some time the original kworker threads are simply gone, even though the system was running uninterrupted.

    Interesting, I have not personally encountered this, and is very strange as well. Is the system somehow going through a reboot, or if the remoteproc modules are getting reloaded? Is this for all kworker threads, or just a few? How long after do you typically see this? Can you provide a snapshot log around these?

    We have not tested it yet with a "normal" Linux without the PREEMPT_RT patch. Do you think the restarted kworker threads may be specific to the PREEMPT_RT?

    I do not know.

    regards

    Suman

    PS: I am out on vacation for early next week, and won't be able to provide any responses until later next week.

  • Interesting, I have not personally encountered this, and is very strange as well. Is the system somehow going through a reboot, or if the remoteproc modules are getting reloaded? Is this for all kworker threads, or just a few? How long after do you typically see this? Can you provide a snapshot log around these?

    The system is not rebooting in that time and all of our related tasks (A72 task and R5F firmware) keep running. There is no indication that kernel modules are reloaded.

    The kworker threads that are regularly restarted (after a few hours, it seems) are only the general purpose kworker threads. This seems to be as intended, as these general purpose workers are kept or killed at the workqueue's discretion. See https://www.kernel.org/doc/html/v6.1/core-api/workqueue.html ("Keeping idle workers around doesn’t cost other than the memory space for kthreads, so cmwq holds onto idle ones for a while before killing them.")

    I've been looking into the kernel sources, and neither the mbox driver, nor the rpmsg driver seem to create a dedicated kworker thread. The mbox driver uses "schedule_work" in the interrupt function (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/mailbox/omap-mailbox.c?h=v6.1.78#n310), which queues work for such a general kworker thread. The rpmsg_char implementation doesn't seem to use kworkers at all, and just waits for data in a queue during the read. (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/rpmsg/rpmsg_char.c?h=v6.1.78#n191). So we still think our problem is the (general purpose) kworker, which is triggered by the mbox interrupt.

  • We have experimented with this small patch.

    So we explicitly request a high priority worker, which runs at sched_other and niceness -20. This seems to run much better. We still have to run some longer running tests, but we experimented with stress-ng. Even running the CPU cores on 100% with the (non-rt) stress-ng tasks, we did not have problems.

    If the mbox driver is to be part of a communication chain, I think it should use these higher priority queues. Do you think it would be possible to make such a change upstream?

  • Hi Robert,

    I've been looking into the kernel sources, and neither the mbox driver, nor the rpmsg driver seem to create a dedicated kworker thread.

    Correct. The OMAP Mailbox driver just initializes a work item and uses the Linux system's workqueue for execution. The rpmsg driver callbacks are executed in the same context for processing incoming messages, so there is no separate kworker threads.

    The rpmsg_char implementation doesn't seem to use kworkers at all, and just waits for data in a queue during the read.

    The rpmsg_char driver reads are executed in the context of the user application's reader threads (dequeueing the messages). The queueing itself is done as part of the rpmsg driver callbacks.

    If the mbox driver is to be part of a communication chain, I think it should use these higher priority queues. Do you think it would be possible to make such a change upstream?

    I don't think this can be accepted to upstream, it will mostly be a no-go, since there is no valid/proper justification for such a change. Every driver consumer can make the same arguments, and it will all be the same contention.

    If the solution works for you, you can make the change on your system. The only other solution I can think of it to use a dedicated workqueue (which is probably where the kernel was) before the consolidation, this would again be an out-of-tree solution.

    regards

    Suman

  • We have created a kernel patch that changes the omap mailbox driver to use dedicated kthread workers, which are set to a fifo priority. We're still testing, but this seems to be sufficient to protect the real-time communication from any normal Linux tasks.

    Thank you for your help!