Using OpenCLTM to offload to C66x DSPs on Sitara™ AM572x processors

Customers traditionally approach the programming of complex ARM® + C66x digital signal processor (DSP) systems on chip (SoCs) such as TI’s Sitara™ AM57x SoC by focusing on manually partitioning the application across the ARM cores and DSPs and hand-optimizing the appropriate sections of the application for the given core. This approach tends to yield the maximum entitlement but has drawbacks; –the resulting application cannot be easily ported from one TI SoC to another and time to market increases because programmers need to manage dispatch, communication and synchronization mechanisms.

To provide customers an alternative, TI has enabled industry-standard heterogeneous multicore programming models such as OpenCL and OpenMP® offload on ARM + DSP SoCs. This post will focus on OpenCL and TI’s implementation of it. In a later post, I will write about OpenMP Offload.

What is OpenCL?

OpenCL is a framework for expressing programs where parallel computation is dispatched across heterogeneous devices. It is an open, royalty-free standard managed by Khronos consortium. On a heterogeneous SoC, OpenCL views one of the programmable cores as a host and the other cores as devices. For example, on a Sitara AM572x SoC, the host is the ARM® Cortex®-A15 cluster running SMP/Linux or TI-RTOS and the device is the C6xx DSP cluster. The application running on the host (i.e. the host program) manages execution of code (kernels) on the device and is also responsible for making data available to the device. A device consists of one or more compute units. On Sitara AM572x SoCs, each C66x DSP is a compute unit.

The OpenCL runtime consists of two components: (1) An API for the host program to create and submit kernels for execution and (2) A cross-platform language for expressing kernels – OpenCL C – which is based on C99 C with some additions and restrictions

OpenCL supports both data parallel and task parallel programming paradigms. Data parallel execution parallelizes the execution across compute units on a device. Task parallel execution enables asynchronous dispatch of tasks to each compute unit

TI’s implementation of OpenCL

Here are a few key features of TI’s OpenCL implementation for Sitara AM572x SoCs:

  • The host is the dual ARM Cortex-A15 cluster running SMP/Linux
  • One device with either one or two C66x DSP cores (configurable via an environment variable)
  • The compute unit is a single C66x DSP
  • OpenCL implementation conformant to v1.1 (full profile)
  • No support for images, ‘double’ supported as an extension, not included in conformance submission

TI’s extensions to OpenCL

TI has enabled various extensions to OpenCL to take advantage of features on its SoCs, leverage the existing C66x DSP code base and make them easier to use. The following is a list of some examples of such extensions:

  • Calling standard C code from OpenCL C code, including code with OpenMP pragmas
  • On-chip global buffers using OCMC RAM
    • __malloc_ddr and __malloc_msmc substitutes for malloc (like OpenCL 2.0 SVMalloc)
    • Enables zero copy buffer capability
  • Support for C66x DSP intrinsic functions in OpenCL, like _dcmpy, _dotpsu4, etc
  • Printf from OpenCL C (similar to OpenCL 1.2 printf capability)
  • Access to EDMA functionality from OpenCL C using the EdmaMgr API functions
  • Cache resize and control operations for the DSP’s L1D and L2 caches from OpenCL C
  • Additional OpenCL C built-in functions, __clock, __cycle_delay etc.

Why use OpenCL?

Using a standard approach to programming heterogeneous SoCs simplifies programming; it allows the programmer to use standard, well-documented APIs to handle the mechanics of dispatching code and data to the DSPs and focus on optimizing the dispatched code. Other benefits include:

  • Seamless migration of applications between TI SoCs – e.g. Take an OpenCL application written for a 66AK2H SoC with eight C66x DSP cores and run it on an AM572x SoC with two C66x DSP cores with only a recompile.
  • TI extensions to OpenCL enable programmers to leverage optimized TI-provided DSP libraries such as dsplib, mathlib, imagelib
  • Use the DSPs to offload computation within open source libraries such as OpenCV

When does OpenCL apply?

OpenCL was designed for a scenario in which one of the cores on the SoC is designated the host. This core runs the OpenCL API and manages execution of kernels on the devices. As described earlier, the host on an AM57x SoC is an ARM Cortex-A15 core running Linux or TI-RTOS. All orchestration of control and data is done by the host. If the customer’s use case fits within the host-centric model, OpenCL is an option to offload computation to the DSPs.

Other considerations include:

  • Can the application take advantage of TI libraries optimized using OpenCL to dispatch to the DSP? Examples include OpenCV and Linear Algebra libraries
  • Is the OpenCL execution and memory model a good fit for the region of application code dispatched to the device?
  • What is the nature of existing code base –is it already written to use OpenCL for dispatch?
  • Are there real-time requirements with respect to computation offloaded to the DSP? OpenCL does not offer any real time guarantees
  • Programmer expertise and preference: using OpenCL APIs for dispatch vs. TI inter-processor communication (IPC)
  • Can OpenCL be used to get to a quick prototype for offloading code to the DSPs?
  • Are the necessary OpenCL features supported by TI’s implementation? (E.g. TI OpenCL does not support some optional OpenCL v1.1 features such as images)

How can I get started?

OpenCL on Sitara AM572x SoCs is available via TI’s Processor SDK Linux. Download and install the SDK following the instructions listed at this link. Refer to the OpenCL User Guide for instructions on building and running OpenCL examples.


  1. TI’s Sitara AM5728 SoC:
  2. TI Processor SDK Linux for AM57x SoCs:
  3. TI OpenCL User Guide:
  4. Khronos OpenCL:
  5. OpenMP Offload User Guide:
  6. TI Signal processing libraries:
  7. TI OpenCV library:
  8. TI Linear Algebra library: