This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/66AK2E05: Linux <-> TI-RTOS IPC performance & problems

Part Number: 66AK2E05

Tool/software: Linux

Hi,

I am trying to evaluate the IPC options between Linux on the ARM cores and TI-RTOS on the DSP core of Keystone 2 66AK2E05. I started with the ex02_messageq example and modified to explore the performance and functionality limits. Most important modification I made are:

  • the Linux side allocates and sends messages as fast as it can instead of waiting for each reply from the DSP.
  • the length of the MessageQ message is increased to the maximum allowed of 469 bytes (including MessageQ header). I had to find this limit by trial and error, since the only relevant information I found in any documentation was the 512 byte limit of rpmsg.
  • added an option to disable the DSP->ARM communication in order to test max unidirectional performance

With this change I have found that If there are no measures to limit the rate at which the ARM / Linux application tries to allocate and send/put new messages to the DSP's queue the IPC fails when MessageQ_put() is called with the following message on the ARM side:

TransportRpmsg_put: send failed: 512 (Unknown error 512)

On the DSP trace I see that execution fails after that:

[      1.398] [t=0x74b78c74] xdc.runtime.Memory: ERROR: line 52: out of memory: heap=0x87cdc0, size=496
[      1.398] xdc.runtime.Memory: line 52: out of memory: heap=0x87cdc0, size=496
[      1.398] [t=0x74b96e10] ti.sdo.ipc.MessageQ: ERROR: line 503: assertion failure: A_invalidMsg: Invalid message
[      1.398] ti.sdo.ipc.MessageQ: line 503: assertion failure: A_invalidMsg: Invalid message
[      1.398] xdc.runtime.Error.raise: terminating execution

Moreover, by artificially limiting the message rate I managed to get a maximum transfer rate of a little over 20MB/s using the maximum message size and with only the ARM side generating traffic. With lower message size or bidirectional traffic this decreases significantly.

The above mean that it is not that hard for an application burst to cause the messageQ to fail ungracefully! Please note that I have taken care to check the return status of all messageQ-related functions but I cannot get any information to avoid the failure. I was hoping that if the rpmsg /v-ring buffers are full messageQ_put() would fail gracefully and allow some time to the receiving end to empty the buffers, or that I could at least check for the required space before actually calling MessageQ_put(), but none of this worked.

Also, the max throughput of 20-25MB/s seems rather low, given that I am working with high performance ARM and DSP cores with shared RAM. Moreover, when reaching this limit one of the ARM cores is at 100% usage. I understand that this could be due to the new MessageQ/IPC implementation using rpmsg and not shared memory, so that the Linux kernel has to copy the data from ARM memory to DSP memory, but still I was expecting higher performance and more robust behavior at the limit. Is there any option to improve performance, using the shared memory or otherwise? Do I have to resort to custom shared memory implementation using cmem for higher performance?

Is it "normal" for the DSP application to fail like that when the MessageQ heap is full? I could increase the heap size (eg. by placing the heap in the DDR) but still I would like to know that if the heap becomes full my application will not fail, just slow down until some messages are processed to make room in the heap.

As others have noted, the information about IPC and memory management is scattered and incomplete, I believe a short IPC facts sheet describing the basic operation, limitations, performance expectations, etc would be very useful.

  • Hi,

    Which Processor SDK Linux version is this? Also which processor sdk rtos version?

    Best Regards,
    Yordan
  • This is with SDK 05.01.00.11 (Linux & TIRTOS)
    Do you know if there are any relavant improvments in 05.02.00.10?
  • I tried with Processor SDK 05_02_00_10 (Linux & RTOS) with the same results. Also, I've seen another probably related problem: when testing boh directions simultaneously (ARM sends as fast as possible, DSP echos back as fast as possible) sometimes I get packets lost (or received in the wrong order) in the DSP -> ARM direction without any warnings or error. I have added a sequence number in the messages to track them and I see that the DSP receives and sends back all messages in order but the ARM / Linux process may miss a random number of  messages (sometimes ~5 sometimes ~300), eg. it receives message 50000 and then message 503xx.

  • Hi Yordan, can you provide any advice on this?
  • Sorry, I was OoO. I've escalated this to the IPC experts.

    Best Regards,
    Yordan
  • Hello,

    Please refer to the following guide for tips on further optimizing your IPC application:

    Regards,
    Sahin

  • Hi Yordan and Sahir,

    I have seen the IPC optimisation guide, however, it is not clear if it applies to HLOS too, given the HLOS restrictions with IPC 3.x (the page was last edited in 2014) Also, and more importantly, my major issue is with the failure of IPC to gracefully fail when the rpm buffers are full, letting the application know that it needs to wait until the messageQ receives the messages before it can put new messages to the queue.

    I am testing v5.02 but the results are pretty much the same: 23 MB/s one way (ARM->DSP), 6.5-14 MB/s per direction in full duplex communication, depending on the exact code setup. The problems remain, i.e.:

    • If the ARM/Linux side puts messages on the queue as fast as possible then I get in the DSP trace
      [t=0xe04952c8] xdc.runtime.Memory: ERROR: line 52: out of memory: heap=0x87cdc8, size=496
      [      2.687] xdc.runtime.Memory: line 52: out of memory: heap=0x87cdc8, size=496
      [      2.687] [t=0xe04a4ff2] ti.sdo.ipc.MessageQ: ERROR: line 503: assertion failure: A_invalidMsg: Invalid message
      [      2.687] ti.sdo.ipc.MessageQ: line 503: assertion failure: A_invalidMsg: Invalid message
      [      2.687] xdc.runtime.Error.raise: terminating execution

      and in the ARM/Linux application:

      TransportRpmsg_put: send failed: 512 (Unknown error 512)

      without any way to avoid program termination.

    • If I add a small delay between MessageQ_put() calls (a printf() works fine for that) then I don't get the out of memory error but the ARM misses some of the messages that the DSP echos back

    I will try the performance tricks mentioned in the guide but it is far more important to ensure that my application cannot fail unexpectedly because the IPC doesn't handle gracefully a message heap full situation.

  • Hello,

    That error indicates you have run out of heap. Can you try increasing the heap in your cfg file?

    Regards,
    Sahin
  • Hi Sahir,

    I understand that the DSP runs out of heap, however, I would expect the MessageQ_alloc() or MessageQ_put() functions to check for that and handle it in a non-catastrophic way. Otherwise I can never be sure that my application wont crash dusring a packet burst. I have tried increasing the heap, but right now it is placed in L2SRAM so there is not much room for growth. I could try moving it to the DDR3, but it may be slower, plus I would like to know that a message heap overrun is handled gracefully.

    Do you have any explanation and solution for the packets being received by the ARM out of order or lost?

    Also, if the IPC 3.x Linux implementation usees vring buffers which are copied by the Linux kernel from ARM-exclusive memory to DSP exclusive memory, then that means that the MultiCore Shared Memory (MSMC) is unused? If yes, could I place the messageQ heap of the DSP there instead of the external DDR3 memory? If again yes, would this be faster than DDR3 since MSMC is internal to the SoC?

    Thank you in advance,

    Giannis

  • Hi, Giannis,

    I ran MessageQBench in the released PLSDK 5.2. The results show:

    run succeeded
    Running MessageQBench:
    Using numLoops: 1000; payloadSize: 8, procId : 1
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId [0x10080]
    Exchanging 1000 messages with remote processor CORE0...
    CORE0: Avg round trip time: 82 usecs
    Leaving MessageQApp_execute

    For payload size of 8 bytes of 1000 messages, that is 64Kb in 82 usecs. So, I get roughly 780mbps.

    Rex
  • By the way, I was using K2H EVM which is the same family of KS2. I don't expect the numbers be too much different.

    Rex
  • Hi Rex,
    reading your reply the first time made me very happy because it meant I was ding something wrong and the MessageQ mechanism is indeed very fast. BUT, reading it again more carefully and cross-checking with the MessageQBench.c source made and the DSP trace me even more worried!
    First of all, the test prints the AVERAGE round trip time, i.e. PER MESSAGE, not for all 1000 messages as you assumed. Thus, the actual throughput is 1000 times lower, 780kbps or ~100kBps ! This is heartbreakingly slow, but it's not the full story. Looking into the actual code I saw that the MessageQ payload is not 8 bytes but 16 because the messages sent include an application header of 8 bytes. To be fair, I will take that into account and calculate the final throughput at 200kBps, still painfully slow! However, this is with a rather small message size, so I tried with the maximum possible size of 456 bytes (448 bytes "application" payload + 8 "application" header bytes). The results were pretty much the same, 82usec per message or 5.3MB/s or 42.2Mb/s, still a bit slower than my worst measurements. In any case, it is a strong indication that raw memory throughput is not the bottleneck.
    Finally, I noticed that the MessageQBench.c code, as well as the accompanying messageq_single.c DSP code do not allocate new messages for each transfer but send back the message they receive. This means that the two directions need to be synchronous, which is not the case in many applications, where one direction needs to send a burst of messages without waiting for a response from the other side. Still, avoiding the message allocation at each loop should offer at least a small performance benefit to the MessageQBench test, but somehow it still manages to be slower than my test.

    So, to conclude, the MessageQBench test instead of showing that the IPC performance is better than my tests, it more than validates my results and concerns! To be honest I am disappointed by the support of IPC by TI. Keystone II seems like a very capable HW platform provided that IPC can be efficiently used to exploit all the cores but the support and documentation by TI is not on par with the HW. The responses by TI employees in this tread seem like I am asking a trivial question that they want to quickly mark as resolved and be done with by telling me to read the documentation and look at the example code to make my application better. In reality however, I believe I have presented in detail PERFORMANCE and FUNCTIONAL issues with the IPC library. TI's first response was to direct me to an outdated and not-applicable to my case Wiki page that was supposed to address the performance issues (it did not), ignoring the problems of lost or out-of-order messages I have seen. Then, MessageQBench was suggested as a reference implementation that offers good IPC performance, but what it actually shows is that my disappointing IPC performance measurements are correct, if not optimistic.
    I hope this has been coincidental and that TI can provide better support on IPC. If not, it would be acceptable by me for TI to clarify that IPC support (at least through the E2E and public documentation) is limited so that anyone looking into IPC would take that into account and accept the risks (or not).
    Regards,
    Giannis
    PS: No personal offense meant, just disappointment at TI IPC support.

  • After more experimentation and some tricks I managed to get to what is probably the limit of the IPC library with the current HW & SW. To do this I used bidirectional communication and set a maximum limit on the number of messages the ARM side can send without having received the respective response from the DSP, something like the receive window in TCP. By limiting that to ~130 messages (150 cause the test to fail as before), so that for example if the ARM has received 100 messages it will proceed to send up to message 230 before waiting to get the responses from the DSP back, I dont get any errors. Thus I can remove all artificial delays in my TX loop on the ARM and get ~14.7 MB/s per direction application throughput, i.e. without counting the MessageQ header bytes.
    While this is an improvement it is far from ideal because:
    a) Even at that moderate throughput and with the test application doing nothing more than exchanging dummy messages I get a ~45% total CPU usage in Linux, so almost two of the 4 ARM cores are maxed out! Specifically, my application consumes ~130% CPU (3/4 of that is in kernel space) and the kernel another ~60% for a total of ~190% (out of 400% for all 4 ARM cores).
    b) Most importantly, this requires the application to track the messages sent and acknowledged by a response message from the DSP, in order to avoid not just losing the message but resulting in an application crash. This offloads significant responsibility of synchronizing the two sides to the application while IPC claims to handle that.

    I truly hope I am missing something big here.
    Giannis
  • Hi, Giannis,

    That was my mistake for the throughput calculation.

    What is your use case of application? Are you sending a large piece of data to DSP or small chunk each time?

    IPC is meant for small messages such as control messages to send between ARM and DSP, and isn’t meant to be a data pipe like Ethernet to send large chunk of data flow. If you are sending large pieces of data, you can try to use CMEM to allocate a 1MB space for example, and send the pointer across where DSP can access and get the data.
    For more info on CMEM, please refer to CMEM User Guide, software-dl.ti.com/.../Foundational_Components_CMEM.html and k2hk-evm-cmem.dtsi.

    I am not sure why the DSP execution termination happens and also not sure if it has something to do with your changes. The message replied from DSP to ack the receiving message shouldn't have anything to do with your throughput measurement. We don't suggest nor think you need to change anything for the purpose.

    Rex

  • Hi Rex,

    in our application we need to transfer from and to the DSP multiple streams of data with an AVERAGE total (aggregate) throughput around 10% of what I have achieved in my best-case scenario. So on average we should be OK, but still I wanted to make sure that the IPC framework can gracefully handle short but quick bursts. Being packet/message-based, MessageQ gives the impression that it can (at least somewhat) handle issues like packer ordering, delivery status information and buffer exhaustion gracefully, i.e. inform the application and avoid crashing.

    Having to build owr own IPC over a CMEM area kind of negates the MessageQ advantage of TI-RTOS, especially since the average throughput is possible with MessageQ. I would suggest that you include these limitations and intended usage of MessageQ in the IPC User's Guide, because in its current form it presents MessageQ as a one-size-fits-all solution, which apparently it is not.

    It is not clear to me if the problems I see with MessageQ, i.e. packets out of order or lost and application crash when the receiving buffer/heap is full are known to TI. If not, I could share my test code with you in order to replicate and look into them.

    Best regards,

    Giannis

    PS: I did some more testing, by artificially limiting the message rate at various levels to see what happens. It seems that the ARM->DSP direction is capable of ~25MB/s if not kept in sync with the DSP->ARM messages. The DSP->ARM direction can do up to ~14.7MB/s and if pushed beyond that messages are lost without any indication by the IPC/MessageQ functions. The ARM->DSP direction, although faster, if pushed beyond its limit fails even worse by causing an out of heap error on the DSP and causing the ARM/Linux application to hang, requiring a reboot.

  • Hi, Giannis,

    Please share the code if you don't mind so we can reproduce it and the implication of changes.

    Rex
  • Hi, Giannis,

    Just ack on receiving the code. I'll get back to you when I get something or if I have any questions on the code.

    Rex
  • Hi Rex,

    having to send (and receive) and ack message means that I cannot exploit the multiple writers, single reader architecture of the MessageQ framework without complexing the IPC mechanism with additional queues just for the ACK messages. If I was going to use a 1-to-1 bidirectional writer-reader structure your suggestion would be trivial (and obvious).

    Please let me know if you need any help with my test, or have any conlcusions to share.

    Best regards,

    Giannis

  • Hi, Giannis,

    I meant to let you know that I received your code. I had them built and was able to reproduce the issue you saw. I'll take a look at the code and discuss internally on that crash issue.

    Rex
  • Hi, Giannis,

    What should I expect when running the test? I see no issues and am getting 14.95MB/s each way. DSP is in good state and allows me to run multiple times. I am using 512MB data file.

    Rex
  • Hi Rex, in the code I sent you I have implemented a throttling mechanism by limiting the max number of unacknowledged messages the ARM is allowed to send to the DSP before waiting for an ACK. This is adjusted by the MAX_TX_WINDOW macro, now set to 130. Increasing this (eg. 200) will almost certainly cause the error. Also, if you enable TX_ONLY macro you can test just the ARM->DSP direction.

  • Hi, Giannis,

    I have been busy, and won't be able to look at it till tomorrow afternoon the earliest. I'll play around with those variables, then get back to you.

    Rex
  • Hi, Giannis,

    I changed the window to 200, but I don't see any crashes. The logs below show the end of running test_host with window 200, but not coming back to kernel prompt. So, I ctrl-C out of it, then dump dsp trace which didn't show any crashes. I then ran test_host with window 130 which ran without issues.

    I see rpmsg on Linux side ran out of skb buffers, but DSP is still up and running. RPMSG has a limit on number of buffers, 512 (256 on each direction). The messages printed in console just warning messages.

    [ 131.603307] rpmsg_proto virtio0.rpmsg-proto.-1.61: sock_queue_rcv_skb failed: -12
    [ 131.603321] rpmsg_proto virtio0.rpmsg-proto.-1.61: sock_queue_rcv_skb failed: -12
    [ 138.639902] systemd-journald[98]: /dev/kmsg buffer overrun, some messages lost.

    ^CSIGINT: exiting
    root@k2e-evm:~# cat /sys/kernel/debug/remoteproc/remoteproc0/trace0
    [ 0.000] 2 Resource entries at 0x800000
    [ 0.000] [t=0x0011e9dc] xdc.runtime.Main: --> main:
    [ 0.000] registering rpmsg-proto:rpmsg-proto service on 61 with HOST
    [ 0.000] [t=0x0015000d] xdc.runtime.Main: NameMap_sendMessage: HOST 53, port=61
    [ 0.000] [t=0x0016c041] ipc_echo: ipc_test_dsp_create: ipc_test_dsp is ready
    [ 0.000] [t=0x00174da9] ipc_echo: <-- ipc_test_dsp_create: 0
    [ 0.000] [t=0x0017b809] ipc_echo: --> ipc_test_dsp_exec:
    [ 14.686] [t=0x00000004:c9a0d659] ipc_echo: ipc_test_dsp_exec: received 450 messages with 122176 bytes total
    [ 14.686] [t=0x00000004:c9a1bbbf] ipc_echo: <-- ipc_test_dsp_exec: 0
    [ 14.686] [t=0x00000004:c9a248f7] ipc_echo: --> ipc_test_dsp_delete:
    [ 14.686] [t=0x00000004:c9a32f2f] ipc_echo: <-- ipc_test_dsp_delete: 0
    [ 14.686] [t=0x00000004:c9a46105] ipc_echo: ipc_test_dsp_create: ipc_test_dsp is ready
    [ 14.686] [t=0x00000004:c9a51327] ipc_echo: <-- ipc_test_dsp_create: 0
    [ 14.686] [t=0x00000004:c9a59f59] ipc_echo: --> ipc_test_dsp_exec:
    [ 89.209] [t=0x0000001d:1459374b] ipc_echo: ipc_test_dsp_exec: received 1198373 messages with 594392816 bytes total
    [ 89.209] [t=0x0000001d:145a287f] ipc_echo: <-- ipc_test_dsp_exec: 0
    [ 89.209] [t=0x0000001d:145ab873] ipc_echo: --> ipc_test_dsp_delete:
    [ 89.209] [t=0x0000001d:145b9b15] ipc_echo: <-- ipc_test_dsp_delete: 0
    [ 89.209] [t=0x0000001d:145cdaa3] ipc_echo: ipc_test_dsp_create: ipc_test_dsp is ready
    [ 89.209] [t=0x0000001d:145d8d77] ipc_echo: <-- ipc_test_dsp_create: 0
    [ 89.209] [t=0x0000001d:145e1af5] ipc_echo: --> ipc_test_dsp_exec:
    root@k2e-evm:~#
    root@k2e-evm:~#
    root@k2e-evm:~#
    root@k2e-evm:~#
    root@k2e-evm:~# ./ipc_test_host-130 CORE0
    --> main entry:
    Ipc_start succeeded: status = 0
    --> Main_main:
    --> dsp_comm_create:
    dsp_com_create: Creating HOST:MsgQ queue
    dsp_com_create: Opening CORE0:MsgQ queue
    dsp_com_create: Host is ready
    <-- dsp_com_create:
    --> ipc_test_exec:
    Trying packTransportRpmsg_put: send failed: 90 (Message too long)
    Trying packet with payload size 449 : put error -1
    Using max payload size of 448, total message size 496
    Sent last packet to DSP
    --> dsp_comm_delete:
    <-- dsp_comm_delete:
    --> dsp_comm_create:
    dsp_com_create: Creating HOST:MsgQ queue
    dsp_com_create: Opening CORE0:MsgQ queue
    dsp_com_create: Host is ready
    <-- dsp_com_create:
    ipc_test_exec: Opened input.dat with 536870912 bytes for sending
    Packet Tx (MB/s) Rx (MB/s)
    1198373 14.896 14.894
    Done sending, waiting to receive remaining 57600 bytes
    Lost RX packets: 0
    Transferred complete in 34.989 sec, rate 14.894 MB/sec
    Average round trip latency: 29.20 usec/message
    Sent 1198373 messages, received 1198373 messages
    <-- ipc_test_exec: 0
    --> dsp_comm_delete:
    <-- dsp_comm_delete:
    <-- Main_main:
    <-- main:
    root@k2e-evm:~#
  • Hi, Giannis,

    I wasn't able to reproduce the crash issue. Those messages printed in the console are just warning messages which don't affect the whole system. I'll close this thread. If you have other issues, please create new ones. Thanks!

    Rex