This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OpenCl Platform on EVMK2h

Other Parts Discussed in Thread: SYSBIOS

Hi ,

I am a graduate student at Purdue University. I am doing performance testing on EVMK2H -OpenCL runtime and I would like to know how to do the following.

1> Limit the number of DSP cores to which WorkGroups are offloaded to less than 8 using a configurable parameter from within OpenCL runtime / from Linux or may be using a system call of Sysbios. 

2> View L2 Usage of DSP when OpenCL offloads WorkItems to DSP cores.

3> View Thread ID of Sysbios corresponding to WorkItem Thread ID of OPenCL. Is the Thread ID same or different for WorkItems executing within a workgroup ?


4> Is there a way to view DSP registers and other parameters using CCS  JTAG while the OpenCL Host offloads WorkItems to DSP Cores to gather Scheduling timeline per core ? ps - I have fully licsensed CCS version 6.0 .

These are very critical questions and it would be helpful to many in the OpenCL on ARM-DSP developers community for improving their code.

Thanks to get back to me at the earliest.

Ravi Gupta 

  • Hi Ravi,

       The following are the answers to your 4 questions.

    1>  Currently we do not export such a control to limit the number of DSP cores that pariticipate in the workgroups computation.  The only way that user can limit the number of DSP cores is to set the number of total workgroups to a value between 1 and 7.  We do plan to support device partitioning feature in OpenCL 1.2 feature, but that is not available right now.

    2> We do not expose the L2 usage of DSP at this time.  Available size of  OpenCL __local address space (mapped to L2 SRAM in our implementation)  can be queried from the device information API.  At kernel enqueue time, if user application has not used up all available L2 buffers, OpenCL runtime reserves the right to put certain runtime data into L2 to improve the performance.  There is a part of L2 SRAM statically partitioned as  regular L2 cache.

    3> Our DSP architecture is different from that of GPU, in that we only want limited number of threads (tasks in SYSBIOS term).  All workitems in the same workgroup are run by the same task/thread on the same core.  Different workgroups could be run on different core.  We support printf() in OpenCL C kernel source, which will prepend a core nunber that tells you which core this printf is on.  For example, you can write "if (get_local_id(0) == 0) printf("workgroup %d\n", get_group_id(0));", on the ARM side, you might see "[core 2] workgroup 5".

    4> You can certainly pause the OpenCL application and use JTAG interface to examine the DSP registers and memory content.  We have a debug interface that help you to set up breakpoints and pause.  The environment variable is "TI_OCL_DEBUG".  For example, you can do "TI_OCL_DEBUG=1 ./your_ocl_app".  It will print out a gdbc6x command that you can run from a different window.  We also support certain events with ULM, which sends timestamps and events back to ARM side.  The executable is called "dsptop -l last".  For example, you can run "dsptop" in one ARM window, then start your ocl app in a different ARM window.  Once your ocl app finishes, type "q" in the first windows to see the dumped ULM messages.

    - Yuan

     

  • Hi Yuan ,

    Thanks for the quick response. I have used dsptop before and it is a really good utility.


    I also wanted to check if there is a way that I can get floating point operation count or numberof instruction count from the DSP counters when OpenCL offloads task to DSP.  or Does C66x has a hardware performance counter to measure FLOPs / Instructions. 

    Thanks

    Ravi

  • Hi Ravi,

    No, I am not aware of such hardware performance counters for FLOPs/instructions.  You can get the cycles, though.  We offer two APIs you can use directly in your OpenCL C kernel source for DSP cycle count (32-bit vs 64-bit). 

    /usr/share/ti/opencl/dsp_c.h:uint32_t __clock           (void);
    /usr/share/ti/opencl/dsp_c.h:uint64_t __clock64         (void);

    I am not sure number of (floating point) instructions is a valid measurement.  Otherwise, as a compiler writer, I could seek every opportunity in VLIW instruction packet (empty slot)  to insert a dummy FLOP instruction.  Does that mean I can artificially pump up the FLOP performance?  I think people usually compute the FLOPs required on the algorithm level and measure the time spent to derive the performance, not measuring it directly on the hardware.

    - Yuan