In our project we are receiving audio data on the ARM/Linux processor, putting the data into a shared memory region and then messaging the DSP/BIOS to perform work on the audio data. Once audio processing is completed by the DSP, the same IPC message is returned to the ARM indicating that processing is complete and the next message may be sent. This process works as expected with one caveat; as the amount of time that the DSP spends processing audio data increases, the amount of time it takes for the DSP return message to be received by the ARM, post audio processing, also increases. The increase in overhead appears to be approximately 18% of the time spent by the DSP doing its work.
We instrumented the ARM/Linux application to measure the round-trip time of a single message for various amounts of DSP work. Below is a table of measurements:
|
Linux Monotonic Elapsed Round-Trip TX/RX Time (milliseconds) |
DSP Processing Delay (milliseconds) |
Overhead (milliseconds) |
Overhead as % of delay |
|
0.15 |
0 |
0.15 |
100 |
|
6.27 |
5 |
1.27 |
20.25518341 |
|
12.38 |
10 |
2.38 |
19.22455574 |
|
24.58 |
20 |
4.58 |
18.63303499 |
|
36.79 |
30 |
6.79 |
18.4561022 |
When no work is performed by the DSP it takes .15ms round-trip for an IPC message. We are assuming this is the 'base' overhead for a round-trip packet between the ARM and DSP. As the processing delay on the DSP increases, so does the overhead. We would expect that the round-trip time should be close to the DSP processing delay + the base overhead (.15ms), plus/minus some small variance. But what we find is that the overhead increases as the DSP processing increases. If you subtract the base overhead (.15ms) from the all round-trip measurements made on the ARM/Linux side, the overhead is approximately 18% of the time spent processing on the DSP.
We are trying to understand where this additional overhead is coming from? We have tried all of the following and still see the same behavior:
- Non-RT kernel vs RT kernel
- Running Linux thread as RT vs no-RT
- Running DSP from L2SRAM vs DDR (which is shared with Linux)
- Moving DSP RX message queue alone into L2SRAM (with code in DDR)
- With and without any other processes running on the ARM - aside from built-in/standard processes (covered in non-RT vs RT kernel)
- Task_Sleep() as delay in DSP versus hard-coded NOP loop versus doing actual audio processing
- Polling RX MessageQ for returned packet versus waiting forever
As a sanity check I also performed the exact same experiment using the stock ipc_3_50_04_08/exmaples/66AK2G_linux_elf/ex02_messageq demo application. The only changes I made where to add timing instrumentation to the Linux application around the round-trip message, and add a fixed delay in the DSP, after reception of the message but before returning the message.
In App.c I added
clock_gettime(CLOCK_MONOTONIC, &start_mono);
in App_exec() before the message is allocated and added
clock_gettime(CLOCK_MONOTONIC, &end_mono);
diff = 1000000000 * (end_mono.tv_sec - start_mono.tv_sec) + end_mono.tv_nsec - start_mono.tv_nsec;
printf("rt:%llu nanoseconds\n", diff);
fflush(stdout);
After the message is received again from the DSP and freed.
In Server.c I added
Task_sleep(X);
in Server_exec() after receiving the message but before returning it.
This experiment with the 'stock' message queue demo application yields the same result. As the value of X in the DSP increases, so does the round-trip overhead; the round-trip time is not simply the base overhead (.15ms) + the Task_Sleep(X), it is approximately the base overhead (.15ms) + Task_sleep(X) + .18(Task_sleep(X).
We are hoping someone can help up understand where this extra overhead is coming from?
Thanks,
Jeremy