AM572x OpenCL() question

Brad Caldwell

Hi,

I'm using TI processor sdk Linux 03.00.00.04 with an AM572x GP EVM Rev A3.

I've been working on a simple OpenCL() program. I'm not using the C++ bindings but rather the raw OpenCL 1.1 C APIs. For my question though I don't think that matters. My kernels (I have two in my program) simply use printf to say hello and I see that in the console output of my host app.If anybody is interested, here's my program as an embedded string. To keep it simple, I don't even pass any arguments to the kernels.

const char * programStr =
"__kernel void hello1(void) {\n"
" printf(\"Hello1 from DSP%d\\n\", __core_num());\n"
" __cycle_delay(750000000);\n"
"}\n"
"\n"
"__kernel void hello2(void) {\n"
" printf(\"Hello2 from DSP%d\\n\", __core_num());\n"
" __cycle_delay(750000000);\n"
"}\n"
"\n";

So what I found by experiment is, it will only use both DSPs if I use clEnqueueNDRangeKernel() with multiple work-groups. If I use clEnqueueTask() to queue up the kernels for execution or clEnqueueNDRangeKernel() with just one work-group, they always execute sequentially on the same DSP core even though multiple are queued up. I put a long delay in the kernel itself via __cycle_delay() to ensure the kernels get queued up. When I run it and queue up a kernel 16 times back to back, the kernel always executes sequentially back to back on only one DSP. I set the host environment variable export TI_OCL_COMPUTE_UNIT_LIST="0, 1". If I change this to only core 1, the kernel always runs on core 1 only as expected but it at least validated I can run something on core 1. When I have it set to use both cores, I would have expected multiple queued up kernels to use up both DSPs instead of just waiting in the queue.

I decided to use clQueueNDRangeKernel() with multiple work-groups instead of clEnqueueTask() which is a single work-group and this time it did use both DSPs. I took it a step further and called clQueueNDRangeKernel() twice back to back but using a different kernel for each and although each queueing did use both DSPs, it still dequeued sequentially.

I thought that maybe this is a result of kernels executing in a FIFO order, although I would think it could still ensure the FIFO order and still use two DSPs. So I set the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property when I created my queue but this made no difference.

So my question(s) to the experts are this: Is the only way to take advantage of multiple DSPs in parallel is to use clEnqueueNDRangeKernel() with multiple work-groups? Is there some requirement that the same kernel must be running on both DSPs if they are to run in parallel? My hope was I could queue up multiple kernels, single work-group each, and have them dispatched to use both DSPs. Do I need to create a second command queue to make this happen?

Thanks

over 7 years ago

0 award over 7 years ago

TI__Intellectual 1400 points

Brad,

You should be able to get both DSP’s executing concurrently using the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag when creating the command queue and using task rather than NDRangeKernel, and obviously not waiting on individual kernel enqueues to complete. I took your opencl C embedded string and put together a simple host side app and got the following output.

[core 0] Hello1 from DSP0

[core 1] Hello2 from DSP1

I did use the CPP bindings, simply because I can test things more quickly, but the pertinent parts of the host application that I used are:

Kernel K1(program, "hello1");

Kernel K2(program, "hello2");

CommandQueue Q(context, devices[0],CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);

Q.enqueueTask(K1);

Q.enqueueTask(K2);

Q.finish();

I am using the same version of the SDK as you and did use a Linux host.

-Alan

0 Brad Caldwell over 7 years ago in reply to award

TI__Genius 12595 points

Alan,

Thanks for looking into this. I have it working now such that I can enqueue multiple tasks and have them dispatch to both DSPs. I didn't really change anything to make it work but I did do a make clean, that shouldn't have done it, only have one source file. Maybe there was a sync error since I'm developing on a host NFS mounted folder. Anyways I have it working now with regular C code.

I also took the C++ code you provided and it executes as expected using Q.enqueueTask(). I then replaced the two Q.enqueueTask() calls that with Q.enqueueNDRangeKernel() calls using single a work-group. In this case, the two kernels executed in series on a single DSP. Finally I specified more than one work-group in the calls to Q.enqueueNDRangeKernel() and the kernels still executed sequentially but within each kernel, the work-groups were dispatched to both DSPs.

For what I'm doing, I think this is okay behavior. None of the OpenCL 1.1 documentation/specification is clear on these finer points of dispatching, at least not that I've seen. That's why I'm doing these experiments, to get a handle on the behavior.

For anybody whose interested, here are the results of the three experiments...

----------------------------------------------------------------------------

Kernels are queued as single work-group tasks...
Q.enqueueTask(K1);
Q.enqueueTask(K2);

Console output...
[core 0] Hello1 from DSP0
[core 1] Hello2 from DSP1

----------------------------------------------------------------------------
Kernels enqueued using NDRange with single work-group...
Q.enqueueNDRangeKernel(K1, NullRange, NDRange(256), NDRange(256));
Q.enqueueNDRangeKernel(K2, NullRange, NDRange(256), NDRange(256));

Console output...
[core 0] Hello1 from DSP0
[core 0] Hello2 from DSP0

----------------------------------------------------------------------------
Kernels are queued as NDRange with multiple work-groups...
Q.enqueueNDRangeKernel(K1, NullRange, NDRange(256), NDRange(128));
Q.enqueueNDRangeKernel(K2, NullRange, NDRange(256), NDRange(128));

Console output...
[core 1] Hello1 from DSP1
[core 0] Hello1 from DSP0
[core 0] Hello2 from DSP0
[core 1] Hello2 from DSP1
----------------------------------------------------------------------------

Thanks

Processors

Processors forum

AM572x OpenCL() question