This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: rpmsg_char_zerocopy performance

Genius 13655 points
Part Number: AM625

Hello Champs,

Customer is testing the performance of rpmsg_char_simple and rpmsg_char_zerocopy. 

As shown in the following figure, he calculated the performance by using linux script: calculate the entry and exit intervals for the send and receive functions.

The result is below:

char_simple performance is similar to the performance given on MCU SDK user guide.

But rpmsg_char_zerocopy is slower than rpmsg_char_simple.

1. Why the rpmsg_char_zerocopy is slower? 

2. Is there other IPC communication scheme which is faster than rpmsg_char_simple?


  • Hello Shine,

    The output numbers the customer provided do not match the code snippet that you attached. I am not sure how they generated their numbers, so I cannot tell you exactly what is going on.

    Does the code snippet from rpmsg_char_zerocopy actually measure latency? No. 

    t measures the time it takes to execute function send_msg. It does NOT measure the time between when Linux sends the RPMsg, and the time when the remote core receives and processes the RPMsg.

    t2 measures the time it takes for (the remote core to receive the RPMsg, process the RPMsg, execute all the other code it wants to execute, rewrite the shared memory, send an RPMsg to Linux, have Linux receive and process the RPMsg) minus (the time it takes to go through all the print statements on the Linux side).

    Neither t, nor t2, actually measures RPMsg latency.

    Ok, so what could the customer do if they DO want to benchmark the zerocopy example? 

    The whole point of the zerocopy example is to show how to move large amounts of data between cores. This is NOT an example of how to minimize latency. It is an example of how to maximize THROUGHPUT.

    If I were a customer benchmarking the zerocopy example, I would compare code like this:


    //Define 1MB of data to send between Linux & remote core
    // it is probably easiest to just run the RPMsg_echo example
    // 1048576 bytes / 496 bytes = 2,115 times
    // i.e., 4230 total messages get sent, 2,115 messages in each direction
    // this avoids potential issues like overflow if we were sending 2,115 messages
    // back-to-back from Linux to the remote core, or vice versa
    send & receive messages
    total_time = end_time - start_time

    Then I would run the zerocopy example like this:

    // define a 1MB region of shared memory
    write 1MB of data to the shared memory region
    send RPMsg
    wait for RPMsg reply
    // while we are waiting:
    // remote core receives RPMsg, reads in the 1MB of data, writes 1MB of data
    // then remote core sends an RPMsg
    read 1MB of data from the shared memory region

    Why else would I want to use the zerocopy example? 

    One usecase I have seen is when customers are trying to send data 496 bytes at a time through RPMsg in a single direction, they can reach an overflow situation where the sending core is sending data faster than the receiving core can keep up. Eventually they start losing data.

    Instead of interrupting the receiving core for every 496 bytes of data, a shared memory example allows the sending core to send fewer interrupts to transmit the same amount of data. That can help the receiving core run more efficiently, since it is interrupted less, so it has to context switch less.

    What if I am trying to minimize latency between Linux and a remote core? 

    First of all, make sure that you ACTUALLY want Linux to be in a critical control path where low, deterministic latency is required. Even RT Linux is NOT a true real-time OS, so there is always the risk that Linux will miss timing eventually. Refer here for more details: 

    Additionally, Linux RPMsg is NOT currently designed to be deterministic. So the average latency may be on the order of tens of microseconds to 100 microseconds, but the worst-case latency CAN rarely spike up to 1ms or more in Linux kernel 6.1 and earlier.

    RPMsg is the TI-supported IPC between Linux and a remote core. If you do not actually need to send 496 bytes of data with each message, you could implement your own IPC, like something that just used mailboxes. Keep in mind that TI does NOT provide support for mailbox communication between Linux userspace and remote cores. If the customer decides to develop their own mailbox IPC, we will NOT be able to support that development.



  • Hello Nick,

    Thank you very much for your great support.

    Customer's understanding is that this inter-core communication is via the rpmsg_char_simple to open the device driver, and then through the device driver to write data into the shared memory,  where to find the source code for this device driver? Customer wanted to see how data is read and written to the shared memory.


  • Hello Shine,

    MCU+ Side documentation (low-level driver used by IPC_RPMsg)

    source code in MCU+ SDK source/drivers/ipc_rpmsg

    Linux side 

    The rpmsg driver is in the Linux SDK under drivers/rpmsg.

    The source code for the rpmsg-char utility library for interacting with Linux userspace is here: