This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

EVMK2H: OpenCL and DSP Tracing

Part Number: EVMK2H

Hi,

We have been using TI OpenCl on EVMK2H to test the signal processing capability provided by the DSPs. We have implemented most of the functionality in standard C code linked to OpenCL C code.

For general DSP debugging and tracing we have been using printf() and it works fine. However, we are wondering whether there is any way we can make use of the CTools debugging capabilities at all,  for example, using STM with ETB to capture DSP core trace and dump the data from ETB to some memory that can be transferred back to the OpenCL host (and potentially using the CCS Trace Analyzer to view the result).

My understanding is that the CTools capabilities cannot be used with OpenCL because the CTools libraries require linker command file for memory mapping and OpenCL does not support that. Is that correct or is there a way that CTools can work with OpenCL?

Thanks,

Kathy

  • I've forwarded this to the experts. Their feedback should be posted here.

    BR
    Tsvetolin Shulev
  • Hi Kathy,

        Glad to hear that you are using OpenCL on K2H and so far it has been working for your use case.  We have a work item on our TODO list to integrate some of the CTools libraries (e.g. AETLib, if not all of them) into OpenCL runtime, then we either expose those APIs to OpenCL kernel code as extended C API, like what we did in /usr/share/ti/opencl/dsp_c.h, or we provide profiling information on the workgroup or kernel boundary on the host side.  However, we haven't got time to work on this yet.

        Depends on the APIs that you wish to use, you might be able to link in CTools libraries into your OpenCL kernel, just like what you have done to link the standard C code into the OpenCL C code.  You mentioned the linker command file.  The default linker command file that we use for OpenCL kernel is in /usr/share/ti/opencl/dsp.syms.  We put all kernel code and data in DDR memory and they get dynamically loaded onto DSPs by OpenCL runtime.  I looked at a few linker command examples in AETLib, they put code and data in L2 SRAM.  If you don't mind putting them in DDR, then the default OpenCL runtime linker command file should continue to work for your use case.

        In your user code (standard C or OpenCL C), you just call these CTools APIs as usual.  It might be helpful to restrict OpenCL work group size to (1,1,1) by using "__attribute__((reqd_work_group_size(1,1,1)))" so that your CTools Setup and Teardown APIs are called only once per work group at the beginning and the end.  Otherwise, you will have to deal with the semantics between OpenCL work items (your kernel code as written) and OpenCL work groups (loops wrapped around your kernel code).

        Another concern is about core-private data.  This is related to the linker command question that you asked.  It is okay to share the library code in DDR, but not okay to share data in DDR if they are core-private.  I guess you can find out if any CTools library data are private.  If they are indeed core-private, you might need to run your kernel on only one core by enqueuing it as a task.  Note that AETLib linker command files worked around this issue by using core-private L2 SRAM.  Be sure to check the return code of CTools APIs and make sure each API call has succeeded.

        Last but not the least, you may need to know where the kernel code and data are dynamically loaded, if you want to use CCS tools to analyze the data that you collected.  You can run your application with "TI_OCL_DEBUG=ccs" to see the code offset.  You can control c to terminate your app right after seeing the offset.  Currently this is the only exposed method for obtaining code offset (we did it for debugging purpose, not profiling purpose).

        Hope this answers your question.  Let us know how it goes.  We are interested in your OpenCL use cases as well, if you don't mind sharing :)

    - Yuan

  • Hi Yuan,

    Thank you very much for your reply. It's great to know that trying to make CTools work with OpenCL is not totally far-fetched. However, your reply also confirmed that it might not be a very straight forward matter.

    As you pointed out, the core-private data is one of our greatest concerns. We are interested in using STM and ETB. We could see that the ETBLib requires specific memory sections to be allocated (such as ETBLib_dmaData) and the examples all use L2 SRAM for that. The comment in the code "If this data section is located in MSMC or DDR3 memory, it should be put in a non-cacheable region" seems to imply that it can work with DDR memory, but we are not completely sure whether the default OpenCL memory mapping will be suitable.

    I guess we will need to experiment and explore a bit more based on your suggestions and see how things work out.

    We don't mind sharing our experience at all. Our application involves making use of the 66AK2H SoC to process about 200 audio channels concurrently. It is a time critical application so we need to utilise the DSPs as efficiently as possible to meet the performance requirements. Our first simplest approach involved having about 200 threads (one per channel) on the ARM host sharing a single OpenCL context and command queue. Every 20ms, each host thread simply writes its real-time audio data to an OpenCL buffer and enqueues a kernel task to the command queue for DSP to do the complex algorithm processing. The approach didn't work well because the command queue became a bottleneck. (We tried having a command queue per thread at earlier stage but it seemed to corrupt memory occasionally so the idea was abandoned.)

    Since then, we have changed the design so we have only a single host thread collecting data from all channels. The host thread enqueues a kernel at a regular interval. The kernel is to be distributed to multiple work-groups. By tweaking the work-group/work-item size and the number of channels to be processed by each work-item, we can almost meet our requirement. We are currently going through another round of design review to see how we can improve the performance further. We have been using dsptop to give us an idea of how the DSPs are utilised.

    Thanks again for your help.
    Kathy
  • Hi Kathy,

       Thanks for enlightening us with your use case (our team consists of compiler and runtime developers :).  This will definitely help us understand the real-world applications better.

       We have an online OpenCL optimization guide:

    In general, there are three areas that we can optimize:

    1) instruction pipeline efficiency: has the loop been software pipelined at an efficient II?

    2) SIMD efficiency: has the computation utilized SIMD instructions on C66?

    3) memory hierarchy performance: can EDMA with double buffering in on-chip memory be applied to overlap computation and data movement, if the application data size and algorithm permit?  Software pipelining can also benefit from data in on-chip memory because pipeline stall due to cache misses can be resolved much quicker.

    You might have already been doing all of these.

       Without knowing your particular data size, I did a back of envelop calculation with some googled audio data size: 24-bit mono at 96KHz will be about 1152KB per 200 channels per 20ms.  We do have 4.5MB shared MSMC on-chip memory on K2H (can be used with CL_MEM_USE_MSMC_TI flag).  And we also have 864KB L2 OpenCL local memory on each core.  If your total data sizes fit, you could put all the data into MSMC memory.  And if each channel data fits in a half or a quarter of L2 OpenCL local memory, you could perform EDMA with double buffering and perform a pipelined processing on channels.  Say, before processing data of current channel in L2 ping buffer, EDMA data of next channel into L2 pong buffer, and so on.  We have shipped OpenCL sgemm/dgemm examples that are optimized with EDMA with double buffering.  They might be slightly more complicated because they deal with 2-D matrices and transposing.  I'll see if we can put a 1-D example out there (Please excuse me for my domain knowledge, I am assuming audio processing is 1-D).


        Feel free to contact us if we can be of any help.  Thanks!

    - Yuan