Hi,
I'm using TI processor sdk Linux 03.00.00.04 with an AM572x GP EVM Rev A3.
I've been working on a simple OpenCL() program. I'm not using the C++ bindings but rather the raw OpenCL 1.1 C APIs. For my question though I don't think that matters. My kernels (I have two in my program) simply use printf to say hello and I see that in the console output of my host app.If anybody is interested, here's my program as an embedded string. To keep it simple, I don't even pass any arguments to the kernels.
const char * programStr =
"__kernel void hello1(void) {\n"
" printf(\"Hello1 from DSP%d\\n\", __core_num());\n"
" __cycle_delay(750000000);\n"
"}\n"
"\n"
"__kernel void hello2(void) {\n"
" printf(\"Hello2 from DSP%d\\n\", __core_num());\n"
" __cycle_delay(750000000);\n"
"}\n"
"\n";
So what I found by experiment is, it will only use both DSPs if I use clEnqueueNDRangeKernel() with multiple work-groups. If I use clEnqueueTask() to queue up the kernels for execution or clEnqueueNDRangeKernel() with just one work-group, they always execute sequentially on the same DSP core even though multiple are queued up. I put a long delay in the kernel itself via __cycle_delay() to ensure the kernels get queued up. When I run it and queue up a kernel 16 times back to back, the kernel always executes sequentially back to back on only one DSP. I set the host environment variable export TI_OCL_COMPUTE_UNIT_LIST="0, 1". If I change this to only core 1, the kernel always runs on core 1 only as expected but it at least validated I can run something on core 1. When I have it set to use both cores, I would have expected multiple queued up kernels to use up both DSPs instead of just waiting in the queue.
I decided to use clQueueNDRangeKernel() with multiple work-groups instead of clEnqueueTask() which is a single work-group and this time it did use both DSPs. I took it a step further and called clQueueNDRangeKernel() twice back to back but using a different kernel for each and although each queueing did use both DSPs, it still dequeued sequentially.
I thought that maybe this is a result of kernels executing in a FIFO order, although I would think it could still ensure the FIFO order and still use two DSPs. So I set the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property when I created my queue but this made no difference.
So my question(s) to the experts are this: Is the only way to take advantage of multiple DSPs in parallel is to use clEnqueueNDRangeKernel() with multiple work-groups? Is there some requirement that the same kernel must be running on both DSPs if they are to run in parallel? My hope was I could queue up multiple kernels, single work-group each, and have them dispatched to use both DSPs. Do I need to create a second command queue to make this happen?
Thanks