66AK2G12: IPC RPMSG message queue messaging overhead between the ARM and DSP increases proportionally with the time spent doing work the DSP

Jeremy McClintock

Part Number: 66AK2G12

In our project we are receiving audio data on the ARM/Linux processor, putting the data into a shared memory region and then messaging the DSP/BIOS to perform work on the audio data. Once audio processing is completed by the DSP, the same IPC message is returned to the ARM indicating that processing is complete and the next message may be sent. This process works as expected with one caveat; as the amount of time that the DSP spends processing audio data increases, the amount of time it takes for the DSP return message to be received by the ARM, post audio processing, also increases. The increase in overhead appears to be approximately 18% of the time spent by the DSP doing its work.

We instrumented the ARM/Linux application to measure the round-trip time of a single message for various amounts of DSP work. Below is a table of measurements:

Linux Monotonic Elapsed Round-Trip TX/RX Time (milliseconds)	DSP Processing Delay (milliseconds)	Overhead (milliseconds)	Overhead as % of delay
0.15	0	0.15	100
6.27	5	1.27	20.25518341
12.38	10	2.38	19.22455574
24.58	20	4.58	18.63303499
36.79	30	6.79	18.4561022

When no work is performed by the DSP it takes .15ms round-trip for an IPC message. We are assuming this is the 'base' overhead for a round-trip packet between the ARM and DSP. As the processing delay on the DSP increases, so does the overhead. We would expect that the round-trip time should be close to the DSP processing delay + the base overhead (.15ms), plus/minus some small variance. But what we find is that the overhead increases as the DSP processing increases. If you subtract the base overhead (.15ms) from the all round-trip measurements made on the ARM/Linux side, the overhead is approximately 18% of the time spent processing on the DSP.

We are trying to understand where this additional overhead is coming from? We have tried all of the following and still see the same behavior:

Non-RT kernel vs RT kernel
Running Linux thread as RT vs no-RT
Running DSP from L2SRAM vs DDR (which is shared with Linux)
Moving DSP RX message queue alone into L2SRAM (with code in DDR)
With and without any other processes running on the ARM - aside from built-in/standard processes (covered in non-RT vs RT kernel)
Task_Sleep() as delay in DSP versus hard-coded NOP loop versus doing actual audio processing
Polling RX MessageQ for returned packet versus waiting forever

As a sanity check I also performed the exact same experiment using the stock ipc_3_50_04_08/exmaples/66AK2G_linux_elf/ex02_messageq demo application. The only changes I made where to add timing instrumentation to the Linux application around the round-trip message, and add a fixed delay in the DSP, after reception of the message but before returning the message.

In App.c I added

clock_gettime(CLOCK_MONOTONIC, &start_mono);

in App_exec() before the message is allocated and added

clock_gettime(CLOCK_MONOTONIC, &end_mono);

diff = 1000000000 * (end_mono.tv_sec - start_mono.tv_sec) + end_mono.tv_nsec - start_mono.tv_nsec;
printf("rt:%llu nanoseconds\n", diff);
fflush(stdout);

After the message is received again from the DSP and freed.

In Server.c I added

Task_sleep(X);

in Server_exec() after receiving the message but before returning it.

This experiment with the 'stock' message queue demo application yields the same result. As the value of X in the DSP increases, so does the round-trip overhead; the round-trip time is not simply the base overhead (.15ms) + the Task_Sleep(X), it is approximately the base overhead (.15ms) + Task_sleep(X) + .18(Task_sleep(X).

We are hoping someone can help up understand where this extra overhead is coming from?

Thanks,

Jeremy

over 2 years ago

0 Nick Saulnier over 2 years ago

TI__Guru** 103210 points

Hello Jeremy,

Before we get too deep into discussions, I just need to put some support caveats out there: we can no longer offer design support for TI_RTOS / DSP/BIOS running on K2G cores, as per the RTOS highlights section of the SDK page and this e2e notice. That does mean that the support we can offer about the DSP, or the IPC 3.x software, will be limited.

With that said, are you using the "default" IPC3.x configuration? If not, perhaps you could look into using one of the alternative transports & drivers:https://software-dl.ti.com/processor-sdk-rtos/esd/docs/06_03_00_106/rtos/index_Foundational_Components.html#optimizing-notify-and-messageq-latency

Regards,

Nick

0 Jeremy McClintock over 2 years ago in reply to Nick Saulnier

Prodigy 20 points

Thank you for the input Nick - I will re-try the experiment with some of the optimization methods listed in the link. We do have trace enabled on the DSP - maybe this is contributing? It seems like the trace feature would present a 'constant' overhead, whereas we are seeing an overhead proportional to the amount of time we do 'work' on the DSP, but it is worth a try.

It is acting as if work is 'accumulating' in the background while we are busy on the DSP, and when the DSP is free to run again and return the processed data, we are penalized for this accumulated work.

0 Nick Saulnier over 2 years ago in reply to Jeremy McClintock

TI__Guru** 103210 points

Hello Jeremy,

I assume this trace functionality leaves a log that you can read? If IPC 3.x works like the later RPMsg I support on devices like AM62 / AM64, I would expect the trace log would only take processing time while the log is being written - so assuming you see the same or similar logs, I would be surprised if it was the trace itself that was causing the difference in delay. It couldn't hurt to try without the trace though just to be sure.

Regards,

Nick

Processors

Processors forum

66AK2G12: IPC RPMSG message queue messaging overhead between the ARM and DSP increases proportionally with the time spent doing work the DSP