Hi,
I'm performing some performance tests with DSPLINK 1.64 on dm6467. My aim is to continuously stream 200Mbps of data from a thread running on arm to a task running on dsp (only in that direction) and I'm trying to find the best method to achieve that. I'm running the classic loop, message, ring_io tests (slightly modified to calculate bandwidth and removed data transfers from DSP to Arm) provided with dsplink.
So far I'm quite puzzled by the high Arm processor load associated with data transfers. I expected zero copy methods to allow for lower cpu load on the Arm side. No memcpy are required on the buffers, since in the demo the buffer data is generated only once at start of the application. Assuming that there is no actual data movement (zero copy transfer channel), there seems to be a high overhead associated with synchronization information exchange between Arm and DSP. Ad an example, loop demo can be executed with 512, 1024, 16KB buffers. As buffer size increase, the actual bitrate rises (as expected), but the cpu load is the same. This is expected since zero copy means that there is no load associated with the amount of data transfer. But this also means that processor load is associated with actual synchronization information passing between Arm and DSP.
As an example, using 15KB buffers I can achieve 200Mbps, but Arm processor load is nearly 80%. All this without actually moving data between the processors.
It this what I can expect from DSPLINK or there is some catch I am no aware of?