I updated the big data ipc example to benchmark the data being sent similar to how MessageQBench is setup. However, when looking at the results, I am getting really low throughput numbers no matter the buffer size. Ranging from 16KB buffers up to 4MB buffers, the throughput numbers ranged from 11,000 KB/s to 15,000KB/s. Where we are expecting in the 100's of MB/s.
Posted the code here: https://github.com/jcormier/big-data-ipc-example/commits/benchmark
I started digging into where the majority of time is spent and was able to localize it to the while loop on the DSP which validates the ARM's count pattern.
So for the 2MB messages, it consistently takes the DSP 161ms (~12.4MB/s) to read from RAM
/* Check values to see expected results */ for( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) { if ( bigDataLocalPtr[j] != (msg->id+j) ) { errorCount++; } }
but only 9ms (~222MB/s) to fill it
/* Fill new data */ for ( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) bigDataLocalPtr[j] = msg->id + 10 +j;
I ran a couple of tests:
- Disabled cache with Cache_disable(Cache_Type_ALL);
With this change, the check loop stayed the same 161ms but the Fill loop now took >300ms. This would seem to indicate that somehow the cache isn't affecting reading the data. I would have expected both loops to become slower. - Malloc'd a 256KB buffer on the DSP Heap and used that pointer instead of the one passed via IPC
Note: I wasn't able to figure out how to get a larger malloc so had to adjust all my number for the now much smaller buffer but the issue is still visible
ARM CMEM ptr (256KB) took ~21ms to verify buffer and ~1ms to write
DSP Malloc ptr (256KB) took ~1ms to verify buffer and ~1ms to write
Note: I'm using the timestamps in the DSP tracebuffer to time the DSP code.
And with such a small buffer, I'd need to get better timing resolution to calculate anything but regardless the big difference in timing between the ARM CMEM pointer and the DSP malloc pointer shows that something is wrong.
The debugging commits are in this branch: https://github.com/jcormier/big-data-ipc-example/commits/benchmark_debugging
Can TI run my benchmark code and verify they are seeing this same slow throughput?
Any feedback on what may be misconfigured in the DSP would be helpful.