DSPLINK and Arm processor load

David Meixner

This was posted a while back in the Linux forum, but I think it might belong in the BIOS forum. I appreciate any insight someone can offer.

Hi,

I'm performing some performance tests with DSPLINK 1.64 on dm6467. My aim is to continuously stream 200Mbps of data from a thread running on arm to a task running on dsp (only in that direction) and I'm trying to find the best method to achieve that. I'm running the classic loop, message, ring_io tests (slightly modified to calculate bandwidth and removed data transfers from DSP to Arm) provided with dsplink.

So far I'm quite puzzled by the high Arm processor load associated with data transfers. I expected zero copy methods to allow for lower cpu load on the Arm side. No memcpy are required on the buffers, since in the demo the buffer data is generated only once at start of the application. Assuming that there is no actual data movement (zero copy transfer channel), there seems to be a high overhead associated with synchronization information exchange between Arm and DSP. Ad an example, loop demo can be executed with 512, 1024, 16KB buffers. As buffer size increase, the actual bitrate rises (as expected), but the cpu load is the same. This is expected since zero copy means that there is no load associated with the amount of data transfer. But this also means that processor load is associated with actual synchronization information passing between Arm and DSP.

As an example, using 15KB buffers I can achieve 200Mbps, but Arm processor load is nearly 80%. All this without actually moving data between the processors.

It this what I can expect from DSPLINK or there is some catch I am no aware of?

over 13 years ago

0 Ramsey over 13 years ago

TI__Genius 12025 points

David,

Unfortunately, I don't have performance numbers for DSPLink. But there are many factors involved in performance analysis. Are you building DSPLink optimized? Is your test case optimized? The DSPLink samples have been written to illustrate the API usage, they are not optimized for performance.

First, I'm trying to understand your test case. 200Mbps is equal to 25MB/sec. 25MB/sec divided by 15KB buffers is equal to 1,706 buffers/sec. When using message queue, that equals sending 1,706 messages per second, or about 586us per second. Have I got this correct so far?

The load on the ARM side has to do with many factors. When using message queue, are you allocating a new message everytime? It sounds like your data is traveling one-way. So, it seems that the DSP would need to free the message it receives. This implies that the message heap is shared between the host and DSP. Thus, every alloc and free operation on the heap will require entering/leaving a gate. This adds overhead as well as delay if the remote processor is currently holding the gate. If the ARM thread is doing a busy-wait on the gate, it should be a short delay. But if the ARM thread is actually blocked, then the Linux scheduler will add a significant delay before the ARM thread gets to run again due to its scheduling algorithm. It might be faster to pre-allocate the messages, then send a full message to the DSP and send the empty message back to the ARM. If you pipline the messaging (say with three messages), that might help reduce latency.

It's also important to add some work load. Running a load test where all you are doing is sending messages without any processing can give bad results. For example, if the ARM is doing nothing but sending a message and then waiting for a return message, it will get swapped out by the scheduler resulting in large delays. However, if the ARM thread is busy for a while between messages, then the return message will be available when the ARM thread needs it and will not get preempted. Using larger buffers with fewer messages per second would also help performance.

~ Ramsey

Processors

Processors forum

DSPLINK and Arm processor load