This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM572x OpenCL() question

Hi,

I'm using TI processor sdk Linux 03.00.00.04 with an AM572x GP EVM Rev A3.

I've been working on a simple OpenCL() program. I'm not using the C++ bindings but rather the raw OpenCL 1.1 C APIs. For my question though I don't think that matters. My kernels (I have two in my program) simply use printf to say hello and I see that in the console output of my host app.If anybody is interested, here's my program as an embedded string. To keep it simple, I don't even pass any arguments to the kernels.

const char * programStr =
  "__kernel void hello1(void) {\n"
  "  printf(\"Hello1 from DSP%d\\n\", __core_num());\n"
  "  __cycle_delay(750000000);\n"
  "}\n"
  "\n"
  "__kernel void hello2(void) {\n"
  "  printf(\"Hello2 from DSP%d\\n\", __core_num());\n"
  "  __cycle_delay(750000000);\n"
  "}\n"
  "\n";

So what I found by experiment is, it will only use both DSPs if I use clEnqueueNDRangeKernel() with multiple work-groups. If I use clEnqueueTask() to queue up the kernels for execution or clEnqueueNDRangeKernel() with just one work-group, they always execute sequentially on the same DSP core even though multiple are queued up. I put a long delay in the kernel itself via __cycle_delay() to ensure the kernels get queued up. When I run it and queue up a kernel 16 times back to back, the kernel always executes sequentially back to back on only one DSP. I set the host environment variable export TI_OCL_COMPUTE_UNIT_LIST="0, 1". If I change this to only core 1, the kernel always runs on core 1 only as expected but it at least validated I can run something on core 1. When I have it set to use both cores, I would have expected multiple queued up kernels to use up both DSPs instead of just waiting in the queue.

I decided to use clQueueNDRangeKernel() with multiple work-groups instead of clEnqueueTask() which is a single work-group and this time it did use both DSPs. I took it a step further and called clQueueNDRangeKernel() twice back to back but using a different kernel for each and although each queueing did use both DSPs, it still dequeued sequentially.

I thought that maybe this is a result of kernels executing in a FIFO order, although I would think it could still ensure the FIFO order and still use two DSPs. So I set the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property when I created my queue but this made no difference.

So my question(s) to the experts are this: Is the only way to take advantage of multiple DSPs in parallel is to use clEnqueueNDRangeKernel() with multiple work-groups? Is there some requirement that the same kernel must be running on both DSPs if they are to run in parallel? My hope was I could queue up multiple kernels, single work-group each, and have them dispatched to use both DSPs. Do I need to create a second command queue to make this happen?

Thanks

  • Brad, 

    You should be able to get both DSP’s executing concurrently using the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag when creating the command queue and using task rather than NDRangeKernel, and obviously not waiting on individual kernel enqueues to complete.  I took your opencl C embedded string and put together a simple host side app and got the following output.

    [core 0] Hello1 from DSP0

    [core 1] Hello2 from DSP1

     

    I did use the CPP bindings, simply because I can test things more quickly, but the pertinent parts of the host application that I used are:

     

           Kernel K1(program, "hello1");

       Kernel K2(program, "hello2");

     

       CommandQueue Q(context, devices[0],CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);

     

       Q.enqueueTask(K1);

       Q.enqueueTask(K2);

     

       Q.finish();

     

    I am using the same version of the SDK as you and did use a Linux host.

     

    -Alan 

  • Alan,

    Thanks for looking into this. I have it working now such that I can enqueue multiple tasks and have them dispatch to both DSPs. I didn't really change anything to make it work but I did do a make clean, that shouldn't have done it, only have one source file. Maybe there was a sync error since I'm developing on a host NFS mounted folder. Anyways I have it working now with regular C code.

    I also took the C++ code you provided and it executes as expected using Q.enqueueTask(). I then replaced the two Q.enqueueTask() calls that with Q.enqueueNDRangeKernel() calls using single a work-group. In this case, the two kernels executed in series on a single DSP. Finally I specified more than one work-group in the calls to Q.enqueueNDRangeKernel() and the kernels still executed sequentially but within each kernel, the work-groups were dispatched to both DSPs.

    For what I'm doing, I think this is okay behavior. None of the OpenCL 1.1 documentation/specification is clear on these finer points of dispatching, at least not that I've seen. That's why I'm doing these experiments, to get a handle on the behavior.

    For anybody whose interested, here are the results of the three experiments...

    ----------------------------------------------------------------------------

    Kernels are queued as single work-group tasks...
    Q.enqueueTask(K1);
    Q.enqueueTask(K2);

    Console output...
    [core 0] Hello1 from DSP0
    [core 1] Hello2 from DSP1

    ----------------------------------------------------------------------------    
    Kernels enqueued using NDRange with single work-group...  
    Q.enqueueNDRangeKernel(K1, NullRange, NDRange(256), NDRange(256));
    Q.enqueueNDRangeKernel(K2, NullRange, NDRange(256), NDRange(256));

    Console output...
    [core 0] Hello1 from DSP0

    [core 0] Hello2 from DSP0

    ----------------------------------------------------------------------------   
    Kernels are queued as NDRange with multiple work-groups...
    Q.enqueueNDRangeKernel(K1, NullRange, NDRange(256), NDRange(128));

    Q.enqueueNDRangeKernel(K2, NullRange, NDRange(256), NDRange(128));

    Console output...
    [core 1] Hello1 from DSP1

    [core 0] Hello1 from DSP0
    [core 0] Hello2 from DSP0
    [core 1] Hello2 from DSP1
    ----------------------------------------------------------------------------   

    Thanks