This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Tool/software: Linux
Hi,
I am trying to evaluate the IPC options between Linux on the ARM cores and TI-RTOS on the DSP core of Keystone 2 66AK2E05. I started with the ex02_messageq example and modified to explore the performance and functionality limits. Most important modification I made are:
With this change I have found that If there are no measures to limit the rate at which the ARM / Linux application tries to allocate and send/put new messages to the DSP's queue the IPC fails when MessageQ_put() is called with the following message on the ARM side:
TransportRpmsg_put: send failed: 512 (Unknown error 512)
On the DSP trace I see that execution fails after that:
[ 1.398] [t=0x74b78c74] xdc.runtime.Memory: ERROR: line 52: out of memory: heap=0x87cdc0, size=496 [ 1.398] xdc.runtime.Memory: line 52: out of memory: heap=0x87cdc0, size=496 [ 1.398] [t=0x74b96e10] ti.sdo.ipc.MessageQ: ERROR: line 503: assertion failure: A_invalidMsg: Invalid message [ 1.398] ti.sdo.ipc.MessageQ: line 503: assertion failure: A_invalidMsg: Invalid message [ 1.398] xdc.runtime.Error.raise: terminating execution
Moreover, by artificially limiting the message rate I managed to get a maximum transfer rate of a little over 20MB/s using the maximum message size and with only the ARM side generating traffic. With lower message size or bidirectional traffic this decreases significantly.
The above mean that it is not that hard for an application burst to cause the messageQ to fail ungracefully! Please note that I have taken care to check the return status of all messageQ-related functions but I cannot get any information to avoid the failure. I was hoping that if the rpmsg /v-ring buffers are full messageQ_put() would fail gracefully and allow some time to the receiving end to empty the buffers, or that I could at least check for the required space before actually calling MessageQ_put(), but none of this worked.
Also, the max throughput of 20-25MB/s seems rather low, given that I am working with high performance ARM and DSP cores with shared RAM. Moreover, when reaching this limit one of the ARM cores is at 100% usage. I understand that this could be due to the new MessageQ/IPC implementation using rpmsg and not shared memory, so that the Linux kernel has to copy the data from ARM memory to DSP memory, but still I was expecting higher performance and more robust behavior at the limit. Is there any option to improve performance, using the shared memory or otherwise? Do I have to resort to custom shared memory implementation using cmem for higher performance?
Is it "normal" for the DSP application to fail like that when the MessageQ heap is full? I could increase the heap size (eg. by placing the heap in the DDR) but still I would like to know that if the heap becomes full my application will not fail, just slow down until some messages are processed to make room in the heap.
As others have noted, the information about IPC and memory management is scattered and incomplete, I believe a short IPC facts sheet describing the basic operation, limitations, performance expectations, etc would be very useful.
I tried with Processor SDK 05_02_00_10 (Linux & RTOS) with the same results. Also, I've seen another probably related problem: when testing boh directions simultaneously (ARM sends as fast as possible, DSP echos back as fast as possible) sometimes I get packets lost (or received in the wrong order) in the DSP -> ARM direction without any warnings or error. I have added a sequence number in the messages to track them and I see that the DSP receives and sends back all messages in order but the ARM / Linux process may miss a random number of messages (sometimes ~5 sometimes ~300), eg. it receives message 50000 and then message 503xx.
Hi Yordan and Sahir,
I have seen the IPC optimisation guide, however, it is not clear if it applies to HLOS too, given the HLOS restrictions with IPC 3.x (the page was last edited in 2014) Also, and more importantly, my major issue is with the failure of IPC to gracefully fail when the rpm buffers are full, letting the application know that it needs to wait until the messageQ receives the messages before it can put new messages to the queue.
I am testing v5.02 but the results are pretty much the same: 23 MB/s one way (ARM->DSP), 6.5-14 MB/s per direction in full duplex communication, depending on the exact code setup. The problems remain, i.e.:
[t=0xe04952c8] xdc.runtime.Memory: ERROR: line 52: out of memory: heap=0x87cdc8, size=496 [ 2.687] xdc.runtime.Memory: line 52: out of memory: heap=0x87cdc8, size=496 [ 2.687] [t=0xe04a4ff2] ti.sdo.ipc.MessageQ: ERROR: line 503: assertion failure: A_invalidMsg: Invalid message [ 2.687] ti.sdo.ipc.MessageQ: line 503: assertion failure: A_invalidMsg: Invalid message [ 2.687] xdc.runtime.Error.raise: terminating execution
and in the ARM/Linux application:
TransportRpmsg_put: send failed: 512 (Unknown error 512)
without any way to avoid program termination.
I will try the performance tricks mentioned in the guide but it is far more important to ensure that my application cannot fail unexpectedly because the IPC doesn't handle gracefully a message heap full situation.
Hi Sahir,
I understand that the DSP runs out of heap, however, I would expect the MessageQ_alloc() or MessageQ_put() functions to check for that and handle it in a non-catastrophic way. Otherwise I can never be sure that my application wont crash dusring a packet burst. I have tried increasing the heap, but right now it is placed in L2SRAM so there is not much room for growth. I could try moving it to the DDR3, but it may be slower, plus I would like to know that a message heap overrun is handled gracefully.
Do you have any explanation and solution for the packets being received by the ARM out of order or lost?
Also, if the IPC 3.x Linux implementation usees vring buffers which are copied by the Linux kernel from ARM-exclusive memory to DSP exclusive memory, then that means that the MultiCore Shared Memory (MSMC) is unused? If yes, could I place the messageQ heap of the DSP there instead of the external DDR3 memory? If again yes, would this be faster than DDR3 since MSMC is internal to the SoC?
Thank you in advance,
Giannis
Hi Rex,
reading your reply the first time made me very happy because it meant I was ding something wrong and the MessageQ mechanism is indeed very fast. BUT, reading it again more carefully and cross-checking with the MessageQBench.c source made and the DSP trace me even more worried!
First of all, the test prints the AVERAGE round trip time, i.e. PER MESSAGE, not for all 1000 messages as you assumed. Thus, the actual throughput is 1000 times lower, 780kbps or ~100kBps ! This is heartbreakingly slow, but it's not the full story. Looking into the actual code I saw that the MessageQ payload is not 8 bytes but 16 because the messages sent include an application header of 8 bytes. To be fair, I will take that into account and calculate the final throughput at 200kBps, still painfully slow! However, this is with a rather small message size, so I tried with the maximum possible size of 456 bytes (448 bytes "application" payload + 8 "application" header bytes). The results were pretty much the same, 82usec per message or 5.3MB/s or 42.2Mb/s, still a bit slower than my worst measurements. In any case, it is a strong indication that raw memory throughput is not the bottleneck.
Finally, I noticed that the MessageQBench.c code, as well as the accompanying messageq_single.c DSP code do not allocate new messages for each transfer but send back the message they receive. This means that the two directions need to be synchronous, which is not the case in many applications, where one direction needs to send a burst of messages without waiting for a response from the other side. Still, avoiding the message allocation at each loop should offer at least a small performance benefit to the MessageQBench test, but somehow it still manages to be slower than my test.
So, to conclude, the MessageQBench test instead of showing that the IPC performance is better than my tests, it more than validates my results and concerns! To be honest I am disappointed by the support of IPC by TI. Keystone II seems like a very capable HW platform provided that IPC can be efficiently used to exploit all the cores but the support and documentation by TI is not on par with the HW. The responses by TI employees in this tread seem like I am asking a trivial question that they want to quickly mark as resolved and be done with by telling me to read the documentation and look at the example code to make my application better. In reality however, I believe I have presented in detail PERFORMANCE and FUNCTIONAL issues with the IPC library. TI's first response was to direct me to an outdated and not-applicable to my case Wiki page that was supposed to address the performance issues (it did not), ignoring the problems of lost or out-of-order messages I have seen. Then, MessageQBench was suggested as a reference implementation that offers good IPC performance, but what it actually shows is that my disappointing IPC performance measurements are correct, if not optimistic.
I hope this has been coincidental and that TI can provide better support on IPC. If not, it would be acceptable by me for TI to clarify that IPC support (at least through the E2E and public documentation) is limited so that anyone looking into IPC would take that into account and accept the risks (or not).
Regards,
Giannis
PS: No personal offense meant, just disappointment at TI IPC support.
Hi, Giannis,
That was my mistake for the throughput calculation.
What is your use case of application? Are you sending a large piece of data to DSP or small chunk each time?
IPC is meant for small messages such as control messages to send between ARM and DSP, and isn’t meant to be a data pipe like Ethernet to send large chunk of data flow. If you are sending large pieces of data, you can try to use CMEM to allocate a 1MB space for example, and send the pointer across where DSP can access and get the data.
For more info on CMEM, please refer to CMEM User Guide, software-dl.ti.com/.../Foundational_Components_CMEM.html and k2hk-evm-cmem.dtsi.
I am not sure why the DSP execution termination happens and also not sure if it has something to do with your changes. The message replied from DSP to ack the receiving message shouldn't have anything to do with your throughput measurement. We don't suggest nor think you need to change anything for the purpose.
Rex
Hi Rex,
in our application we need to transfer from and to the DSP multiple streams of data with an AVERAGE total (aggregate) throughput around 10% of what I have achieved in my best-case scenario. So on average we should be OK, but still I wanted to make sure that the IPC framework can gracefully handle short but quick bursts. Being packet/message-based, MessageQ gives the impression that it can (at least somewhat) handle issues like packer ordering, delivery status information and buffer exhaustion gracefully, i.e. inform the application and avoid crashing.
Having to build owr own IPC over a CMEM area kind of negates the MessageQ advantage of TI-RTOS, especially since the average throughput is possible with MessageQ. I would suggest that you include these limitations and intended usage of MessageQ in the IPC User's Guide, because in its current form it presents MessageQ as a one-size-fits-all solution, which apparently it is not.
It is not clear to me if the problems I see with MessageQ, i.e. packets out of order or lost and application crash when the receiving buffer/heap is full are known to TI. If not, I could share my test code with you in order to replicate and look into them.
Best regards,
Giannis
PS: I did some more testing, by artificially limiting the message rate at various levels to see what happens. It seems that the ARM->DSP direction is capable of ~25MB/s if not kept in sync with the DSP->ARM messages. The DSP->ARM direction can do up to ~14.7MB/s and if pushed beyond that messages are lost without any indication by the IPC/MessageQ functions. The ARM->DSP direction, although faster, if pushed beyond its limit fails even worse by causing an out of heap error on the DSP and causing the ARM/Linux application to hang, requiring a reboot.
Hi Rex,
having to send (and receive) and ack message means that I cannot exploit the multiple writers, single reader architecture of the MessageQ framework without complexing the IPC mechanism with additional queues just for the ACK messages. If I was going to use a 1-to-1 bidirectional writer-reader structure your suggestion would be trivial (and obvious).
Please let me know if you need any help with my test, or have any conlcusions to share.
Best regards,
Giannis
Hi Rex, in the code I sent you I have implemented a throttling mechanism by limiting the max number of unacknowledged messages the ARM is allowed to send to the DSP before waiting for an ACK. This is adjusted by the MAX_TX_WINDOW macro, now set to 130. Increasing this (eg. 200) will almost certainly cause the error. Also, if you enable TX_ONLY macro you can test just the ARM->DSP direction.