OpenCL build for OMAP 3 family

Michael Harney

Does anybody here know where I can find a resource to port OpenCL to the OMAP 3 family? I realize OMAP 3 is obsolete, but we still have legacy products (Gumstix in particular) in production that we need to use and we would like to port OpenCL for speed improvements. Thank you.

over 3 years ago

0 Nick Saulnier over 3 years ago

TI__Guru** 100980 points

Hello Michael,

Unfortunately if the documentation is not on the product pages or the legacy software pages for the OMAP 3 devices you are interested in, it may not exist any more. You can also search the e2e forums to see if previous discussions are helpful.

I did a quick check of Processors wiki pages that went offline that mentioned OpenCL. Attaching that information below.

DISCLAIMERS
These pages may or may not have anything do to with OMAP 3. Links from these pages might be broken. I cannot answer any questions about information from these pages.

---------------------------------------------------------------------------------------------------

Using existing DSP libraries in openCL_files.zip

-----------------------------------------------------------------------------

OpenCL Memory_files.zip

----------------------------------------------------------------------------

Determine which version of OpenCL is installed

You can determine which version of TI's OpenCL implementation you have installed and are using in one of 4 ways

if the device is running ubuntu and ti-opencl was installed using dpkg or apt-get, then the command "dpkg -s ti-opencl" will display the current installed version
executing the command "clocl --version" will display the version of the opencl compiler installed. Only for OpenCL 0.12.0 and later.
executing the command "ls -l /usr/lib/libOpenCL*" will display the OpenCL libraries installed on the device. Follow the soft links from /usr/lib/libOpenCL.so to a fully version qualified libOpenCL library like /usr/lib/libOpenCL.so.0.12.0. The version on the library will indicate which version of the OpenCL package is installed.
The version can be queried programmatically in an application by using the OpenCL API's to query the platform version. The returned string will have a format similar to: "OpenCL 1.1 TI product version 0.12.0 (Sep 19 2014 16:02:54)". Sample c++ code to query the version follows:

include <CL/cl.hpp>
include <iostream>

std::vector<cl::Platform> platforms; std::string str;

cl::Platform::get(&platforms); platforms[0].getInfo(CL_PLATFORM_VERSION, &str); std::cout << str << std::endl; </source>

----------------------------------------------------------------------------

OpenCL Applications Differences from Standalone DSP applications

For C66 DSP developers moving to OpenCL on the K2H from a standalone DSP application environment, the following guidelines will be helpful for the transition. OpenCL buffers can be defined in DDR, MSMC, and L2 memory regions (See sections Global Buffers and Local Buffers for details)

Memory management for the C66 DSP is accomplished exclusively through OpenCL buffers defined in the host application. OpenCL buffers can be defined in DDR, MSMC, and L2 memory regions (See sections Global Buffers and Local Buffers for details). This eliminates the need for all heap management on the DSP. Calls to malloc, calloc, realloc and free should not be used in OpenCL C programs or in any standard C code call tree originating from OpenCL C programs.
OpenCL C kernels and any call tree originating from a kernel inherit a stack from the OpenCL runtime executing on the DSPs. To limit the amount of on-chip memory that is reserved on the K2, this stack is small, currently 10K bytes. Auto (function scope) variables should be therefore kept to a minimum. Additionally, call tree depth must be limited. Recursion is not allowed in OpenCL C code. If OpenCL C code calls standard C code, then recursion should not be used in the standard C code.
OpenCL C allows the definition of constant data in the constant memory address space at global scope. No other global data is allowed in OpenCL C. If OpenCL C is linked with standard C code, the standard C code should avoid global variables. Again, if the DSP code needs persistent data, it should be allocated an OpenCL buffer and the data should reside and persist in the buffer.
OpenCL C does not run a boot setup for dispatch of kernels, therefore dependencies on items that typically run before main is called or after main returns in a standalone DSP application are not supported in a OpenCL C environment.
Linking OpenCL C code with standard C code that requires a user specified linker command file is not supported.
Calling CSL or BIOS functions is allowed from OpenCL C code or a standard C code call tree originating from OpenCL C, if it conforms to the restrictions above (particularly, no memory management, no link command file, and no requirement on boot time code)

----------------------------------------------------------------------------

OpenCL existing app performance challenges

There are a number of constructs used in existing OpenCL code that provide a challenge to the TI OpenCL implementation related to performance.

For existing apps that use enqueueNDRangeKernel (which is the majority of cases), recall that the OpenCL C kernel specifies the algorithm for a work item. When one of these kernels is enqueued the OpenCL runtime will group some number of work-items into a work-group and then there will be as many work-groups as needed to cover the entire range of the problem space.

For the DSP device in TI's OpenCL, we turn workgroups into loops iterating over the work-items in the workgroup. When the work-item algorithm has barrier statements, the generation of those iterating loops becomes more complicated and the resulting performance of the kernel suffers.

Additionally, existing OpenCL C kernels use a method for calculating reductions by staging log2(N) steps. During step 1, half the work-items perform useful work. During step 2, a quarter of the work-items perform useful work. Until step n, when only 1 work item performs useful work. This is inherently a poor utilization of resources, but is necessary in a GPU environment where the work-items are executing concurrently in independent SIMD lanes. However, on a DSP, reductions are a strength since the DSP is iterative in nature. However, the current OpenCL C compiler does a more straight forward translation of the GPU reduction idiom and it does not result in a highly efficient DSP program today. As more compiler optimization is implemented in future releases of the TI OpenCL product, this will improve.

Existing OpenCL programs are structured with the size of work-groups defined to be small. The work-group size is typically set to align with the number of SIMD lanes in the GPUS. The DSP would prefer to have larger work-group sizes since we turn the work-groups into loops. Longer executing loops hide both the overhead of the loop and the overhead of dispatching the work-group.

OpenCL C vector types are a very useful for algorithm expression and also to explicitly control SIMD of the OpenCL C compiler. The TI OpenCL C compiler and builtin library have not yet been optimized for vector types. They are accepted and correct code will be generated, but not all the C66 DSP SIMD instructions will be utilized in the compilation of vector types. Again, this will be improved, but for version 0.9.0 it may be best to avoid vector types. Vectors of length 2 can sometimes be used effectively. Vectors with total width > 64 bits will not be likely to improve and may degrade performance vs. scalar types.

For kernels enqueued with enqueueTask, where the kernel is an algorithm expressing all of a task (ie. it is not just 1 work-item out of a group of work-items) The DSP device can process these effectively. In fact, the DSP device can process up to 8 of these concurrently. However, GPU's would typically process one of these at a time and therefore existing applications using task would not necessarily be structured for concurrent task submission. In this case, the DSP device my be only 1/8 utilized.

--------------------------------------------------------------------------------

OpenCL FAQ

How do I get support for TI OpenCL products?

Post your questions and/or suspected defects to the High Performance Computing forum with the tag opencl.

Collapse

Which version of OpenCL do I have installed?

See the page Which OpenCL Version is Installed for details.

Collapse

Can multiple OpenMP threads in the host application submit to OpenCL queues?

Yes, each thread could have a private queue or the threads could share a queue. The OpenCL API's are thread safe. See the page OpenCL Interoperability with Host OpenMP for more details.

Collapse

When running OpenCL application I get the error messages:

<< D L O A D >> ERROR: File location of segment 0 is past the end of file.
<< D L O A D >> ERROR: Attempt to load invalid ELF file, '(null)'.

OpenCL uses the directory /tmp to store intermediate compilation results and to cache compilation results. This error typically results when /tmp is full. You can issue the command “rm /tmp/opencl*” to free /tmp space. When either of the environment variables TI_OCL_CACHE_KERNELS or TI_OCL_KEEP_FILES are set, the OpenCL runtime will keep more persistent data in /tmp and this error could become more frequent. Either unset these environment variables or modify your Linux setup to increase the amount of space allocated to /tmp.

Collapse

When running OpenCL applications I get messages about /var/lock/opencl

The TI OpenCL implementation currently allows only 1 OpenCL enabled process to execute at any given time. To enforce this, The OpenCL implementation locks the file /var/local/opencl when an OpenCL application begins and frees it when it completes. If two OpenCL processes attempt to run concurrently, then one will block waiting for the file lock to be released. It is possible for an OpenCL application to terminate abnormally and leave the locked file in place. If you determine that no other OpenCL process is running and your OpenCL application still recevies the waiting on /var/lock/opencl message, then the /var/lock/opencl file can be safely removed to allow your process to continue.

--------------------------------------------------------------------------

Opencl math status

The OpenCL Math functions from section 6.11.2 of the OpenCL 1.1 spec are fully supported in this OpenCL release. However, they are not all completely conformant per the precision and special value handling defined in the spec. The below table illustrates the cases that are not fully conformant. If a math function is not listed in the below table, then it is fully conformant across the float scalar and vector types and across the double scalar and vector types. Each function has 12 columns representing the float scalar and vector types and the double scalar and vector types.

The below table has the following color code:

Function columns highlighted as dark green in this table are fully conformant.
Function columns highlighted as light green in this table are fully conformant with the exception of subnormal values.
Function columns highlighted as yellow in the table are conformant with the exception of subnormals and/or minor ULP precision.
Function columns highlighted as red have larger ULP precision errors

0 Michael Harney over 3 years ago in reply to Nick Saulnier

Prodigy 10 points

Thank you Nick, I appreciate your response and information!

Processors

Processors forum

OpenCL build for OMAP 3 family