OpenCL Memory
Contents
[hide]- 1 Understanding memory usage in OpenCL for 66AK2H
- 1.1 Device Memory
- 1.2 How DDR3 is Partitioned for Linux System and OpenCL
- 1.3 The OpenCL Memory Model
- 1.4 OpenCL Buffers
- 1.5 Buffer Read/Write vs. Map/Unmap
- 1.6 Large OpenCL buffers and Memory Beyond 32-bit DSP Address Space
- 1.7 DSP memory in OpenCL vs. standalone DSP application
- 1.8 Discovering OpenCL Memory Sizes and Limits
- 1.9 Cache Coherency
- 1.10 Using DMA for data movement within a DSP kernel
Understanding memory usage in OpenCL for 66AK2H
[edit]
Device Memory[edit]
The 66AK2H12 device will be referred to as the K2H for the remainder of this document.
The following K2H device components are relevant to the memory discussion for OpenCL.
- Four ARM A15 CPU cores @ 1Ghz,
- Eight TI C66 DSP cores @ 1Ghz,
- 8GB of DDR3 attached to the K2H device through a 1600Mhz 72-bit bus,
- 6MB of internal shared memory referred to as MSMC,
- 1MB L2 memory per C66 core (data & instruction cache, scratchpad, or both)
- 32KB L1P memory per C66 core (instruction cache, scratchpad, or both)
- 32KB L1D memory per C66 core (data cache, scratchpad, or both)
The ARM A15 cores also have a shared 4MB L2 cache, and 32KB Instruction and 32KB Data caches per core. However, Linux will manage these and they are otherwise not user visible.
The K2H device allows the L1 and L2 memory areas in the C66 cores to be configured as all cache, all scratchpad or partitioned with both. For OpenCL applications this partition is fixed as follows
- L1P is configured as all cache on all C66 cores
- L1D is configured as all cache on all C66 cores
- L2 is configured as 128K cache, 128K reserved scratchpad, and 768K scratchpad available for OpenCL Local buffers on all C66 cores
Additionally the MSMC memory has 1.25MB reserved and 4.75MB available for OpenCL buffer use. Of the 8GB of DDR3, 48MB is reserved, the remainder is partitioned between Linux system memory and CMEM contiguous memory for use by OpenCL. The next section describes the DDR3 usage in more detail.
The A15 CPUs are cache coherent with each other, but they are not cache coherent with the C66 DSPs. The C66 DSPs are not cache coherent with the A15s or with other C66 DSPs. In most use cases, the OpenCL runtime will manage coherency of the various device caches through software cache coherency calls.
How DDR3 is Partitioned for Linux System and OpenCL[edit]
The 8GB of attached DDR3 memory is accessible to the K2H device through a 64-bit 1600Mhz bus. The 8GB of DDR3 is populated in the K2H 36-bit address space at locations 8:0000:0000 through 9:FFFF:FFFF.
The DDR3 is partitioned into three components:
- Linux system memory,
- CMEM contiguous memory, and
- Reserved memory.
The Linux system memory is the underlying memory store for the Linux virtual memory system and would contain standard items like:
- A15 stacks,
- A15 heaps,
- A15 application code,
- A15 application variables, etc.
The CMEM contiguous memory is controlled by a Linux kernel module that guarantees that contiguous virtual addresses within a range are mapped to contiguous physical addresses within the range. This is required for buffer communication between the A15 and C66 cores, because the C66 cores do not access memory through a shared MMU with the A15 CPUs and thus require that buffers be allocated in contiguous physical memory.
The reserved memory is a very small portion of the DDR3 memory that is used in the OpenCL implementation and is exposed to neither CMEM nor Linux.
The first 2GB of DDR3 are fixed in usage to the following:
8:0000:0000 - 8:1FFF:FFFF : 512M Linux System 8:2000:0000 - 8:22FF:FFFF : 48M Reserved 8:2300:0000 - 8:7FFF:FFFF : 1488M CMEM
The remaining 6GB of DDR3 can be split between Linux and CMEM using boot time variables. The default partition of the remaining 6GB would be:
8:8000:0000 - 8:BFFF:FFFF : 1GB Linux System 8:C000:0000 - 9:FFFF:FFFF : 5GB CMEM
You can verify the partition in your system by viewing the /proc/iomem system file. The bottom of this file will contain the external DDR memory map, for example:
800000000-81fffffff : System RAM 800008000-8006b5277 : Kernel code 8006fa000-8007ace53 : Kernel data 823000000-87fffffff : CMEM 880000000-8bfffffff : System RAM 8c0000000-9ffffffff : CMEM
The CMEM memory areas will be managed by OpenCL for allocation to OpenCL buffers and OpenCL C programs. The default partition of 1.5GB Linux system memory and 6.48GB CMEM provides a minimum suggested Linux system memory size and a larger area for OpenCL buffer and program space.
The OpenCL Memory Model[edit]
The OpenCL 1.1 specification available from Khronos defines a memory model in Section 3.3. Please refer to the specification for details on these memory regions and how they relate to work-items, work-groups, and kernels. This document will focus on the mapping of the OpenCL memory model to the K2H device. There are four virtual memory regions defined:
- Global Memory
- This memory region contains global buffers and is the primary conduit for data transfers from the host A15 CPUs to/from the C66 DSPs. This region will also contain OpenCL C program code that will be executed on the C66 DSPs. For this OpenCL implementation, global memory by default maps to the portion of DDR3 partitioned as CMEM contiguous memory. Additionally, 4.75MB of MSMC memory is also available as global memory and buffers can be defined to reside in this memory instead of DDR3 through an OpenCL API extension specfic to TI. This mechanism will be described in a later section that details handling of the OpenCL buffer creation flags.
- Constant Memory
- This memory region contains content that remains constant during the execution of a kernel. OpenCL C program code and constant data defined in that code would be placed in this region. For this implementation, constant memory is mapped to the portion of DDR3 partitioned as CMEM contiguous memory.
- Local Memory
- The local memory region is not defined by the spec to be accessible from the host (ARM A15 cores). This memory is local to a work group. It can be viewed as a core local scratchpad memory and in fact for this implementation it is mapped to the 768K of L2 per core that is reserved for this purpose. The use case for local memory is for an OpenCL work-group to migrate a portion of a global buffer to/from a local buffer for performance reasons. This use case is optional for users as access to global buffers in DDR will be cached in both the 128K L2 cache and the 32K L1D cache on the C66 DSPs. However, performance can often be improved by taking the extra step in OpenCL C programs to manage local memory as a scratchpad.
- Private Memory
- This memory region is for values that are private to a work-item and these values are typically allocated to registers in the C66 DSP core. Sometimes it may be necessary for these values to exist in memory. In these cases the values are stored on the C66 DSP stack which resides in the reserved portion of the L2 memory.
OpenCL Buffers[edit]
Global Buffers[edit]
OpenCL global buffers are the conduit through which data is communicated from the host application to OpenCL C kernels running on the C66 DSP. The C prototype for the OpenCL API function that creates global buffers is:
<source lang="C"> cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
void *host_ptr, cl_int *errcode_ret);
</source>
The C++ binding for OpenCL specifies a Buffer object and the constructor for that object has the following prototype:
<source lang="CPP">
Buffer(const Context& context, cl_mem_flags flags, size_t size,
void* host_ptr = NULL, cl_int* err = NULL);
</source>
For the remainder of this section on OpenCL Buffers, the examples will use the C++ binding and the Buffer constructor. Conversion to the C API is straight forward as the arguments to both methods are the same. The C++ Buffer constructor does have default values of NULL for host_ptr and err, so in examples where those arguments are not specified, conversion to the C API will require adding NULL arguments in those parameter slots.
Also for the remainder of this section we will assume an OpenCL context named ctx has been created with only the DSP's present in the context. The C++ code to create such a context is:
<source lang="CPP"> Context ctx(CL_DEVICE_TYPE_ACCELERATOR); </source>
Note that the device type accelerator is used in the context constructor. In this OpenCL implementation accelerator equates to DSP.
With the context parameter now fixed to the context ctx, and default parameters of NULL for host_ptr and err, buffer creation is dependent on the flags argument and the size argument. The size argument is relatively straightforward as well. It should always be specified and represents the size in bytes of the buffer. Note the highlighted in bytes. It is a frequent error to attempt to specify the size in number of elements. For example, if a buffer of 100 ints is required, you need to pass in 400 or sizeof(int)*100 as the size and not just 100.
The flags argument defines some important properties for the buffer. Section 5.2.1 in the OpenCL 1.1 spec defines the flag values. They are also listed below with their significance to this implementation. In general the flag values may be or'ed together to create buffers with a combination of properties. The OpenCL 1.1 spec enumerates the cases of mutually exclusive buffer creation flags.
The flags are:
- CL_MEM_READ_WRITE
- CL_MEM_WRITE_ONLY
- CL_MEM_READ_ONLY
- The above three flags are mutually exclusive. Only one should be specified. If none are specified then CL_MEM_READ_WRITE is assumed. These flags indicate to the OpenCL runtime how the buffer will be accessed from the perspective of OpenCL C kernels running on the DSP. If a buffer will only be passed to a kernel that simply reads the buffer, then specify CL_MEM_READ_ONLY. These flags are used to control cache coherency operations that the OpenCL runtime performs for you. On the K2H device the ARM A15 devices are not cache coherent with the C66 DSPs, so the OpenCL runtime will issue cache coherency operations between the writing of a buffer on one device and the reading of the buffer on a different device. When read only or write only is specified, some coherency operations may be skipped for performance.
- CL_MEM_USE_HOST_PTR
- If using this buffer creation flag, a non-NULL host_ptr argument must also be provided. This flag indicates to the OpenCL runtime that the underlying memory store for this buffer object should be the memory area pointed to by the host_ptr argument.
<source lang="CPP"> int size = 1024 *sizeof(int); int *p = (int*) malloc(size); Buffer buf(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, p); </source>
- The above code fragment will allocate an area sized for 1K int values in the A15 Linux heap. It will then create a buffer using the CL_MEM_USE_HOST_PTR flag and will pass in the address of the heap area as the host_ptr. The result will be an OpenCL buffer whose underlying memory will be on the Linux heap at address p.
- Recall from the previous section that the DSP cannot reliably read from Linux system memory because it can be paged and non-contiguous. The DSP requires a contiguous buffer and so when a buffer created with this flag is passed to an OpenCL C kernel, it will require an area of CMEM memory to be allocated, a copy from the host heap memory into the CMEM area, a dispatch of the kernel and copy from CMEM back to the host heap memory. This is clearly not ideal from a performance perspective since there are multiple memory copies involved and they are just in time before kernel dispatch and just after kernel dispatch and may lengthen a critical path involving the kernel invocation. Other buffer creation flags and OpenCL API calls can eliminate both of these performance drawbacks.
- The benefit of using this flag is it can simplify the OpenCL API calls in your program. You would not need to explicitly read/write the buffer, nor explicitly map/unmap the buffer. You could write code in the manner shown below:
<source lang="CPP"> for (i = 0; i < 1024; ++i) p[i] = ... foo(buf).wait(); for (i = 0; i < 1024; ++i) ... = p[i]; </source>
- The above uses a previously defined C++ kernel functor named foo to enqueue a kernel using buf as an argument. Please see the OpenCL C++ binding specification for details on OpenCL kernel functors. For the purposes of this example, it enqueues a kernel with the buffer buf as an argument and then it waits for completion of the kernel. It is recommended that this flag not be used for performance critical OpenCL code. Although, as you can see it does simplify the API calls and can be used for prototyping.
- CL_MEM_ALLOC_HOST_PTR
- A host_ptr argument is not necessary for buffers created with this creation flag. The default NULL value is valid. This flag is mutually exclusive with CL_MEM_USE_HOST_PTR. This flag indicates that OpenCL should allocate an underlying memory store for the buffer than can be accessed from the host. For this implementation, a buffer created with this flag is allocated memory in the CMEM contiguous memory region and can be accessed directly from both the host A15 and the C66 DSPs. This flag is recommended for for performance in buffer handling. It is also the default flag if none of CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR or CL_MEM_COPY_HOST_PTR is specified in the creation API. Buffers of this type can be used with the read and write buffer OpenCL APIs or they can be used with the map and unmap APIs for zero copy operation. The read/write and map/unmap APIs will be described later in this section.
- CL_MEM_COPY_HOST_PTR
- A host_ptr argument is required for this buffer creation flag. This creation flag is identical to the CL_MEM_ALLOC_HOST_PTR flag in allocation and usage. The only difference is that on creation (or at least before first use) of a buffer with this flag, the memory pointed to by the argument host_ptr is used to initialize the underlying memory store for the buffer which will be in CMEM contiguous memory.
- CL_MEM_USE_MSMC_TI
- This flag is a TI extension to standard OpenCL. It can be used in combination with the other buffer creation flags, except for CL_MEM_USE_HOST_PTR. When this flag is used, the buffer will be allocated to a CMEM block in the MSMC memory area, rather than a CMEM block in the DDR3 area. The MSMC area available for OpenCL buffers is limited to 4.75MB, so use of this flag must be judicial. However, in most circumstances the DSP can access MSMC buffers significantly faster than DDR buffers. This flag only affects the underlying memory store used for the buffer. It will still be considered a global buffer and can be used anywhere a global buffer can be used.
Global buffers can contain persistent data from one kernel invocation to the next kernel invocation. It is possible for OpenCL C kernels to communicate data between them in time by simply having kernel 1 produce data and kernel 2 consume data all on the C66 DSP. Other than creating the buffer thorugh which the communication will occur and sequencing the kernel enqueues, it is not necessary for the host A15 to be involved in that data communication from kernel 1 to kernel 2, i.e. the A15 does not need to read the data from kernel 1 and transfer it to kernel 2, the data can simply persist on the C66 DSP.
Local Buffers[edit]
Local buffers are quite different than global buffers. You cannot access local buffers from the host and you do not create them using API's like global buffers. Local buffers will be allocated from local memory which in this implementation exists in the L2 scratchpad memory on the C66 DSP cores. Data cannot persist from kernel to kernel in a local buffer. The lifetime of a local buffer is the same as the dynamic lifetime of the kernel execution. Local buffers are never required to be used, but are often used in OpenCL C kernels for potential performance improvement. The typical use case for local buffers in a kernel that is passed a global buffer, is for the local buffer to be used explicitly by the user's OpenCL C kernel as a fast scratchpad memory for the larger and slower global buffer. This scratchpad memory would be managed by the user using asynchronous built-in functions to move the data between the global and local buffers. Again local buffers are never required and the OpenCL C kernel can depend on the C66 DSP cache to alleviate DDR access delay rather than use local buffers. However, it is often the case that manual data movement to/from local buffers can be advantageous to performance.
Local buffers can be defined in two ways. The first way is to simply define an array in your OpenCL C kernel that is defined with the local keyword. For example, the following OpenCL C kernel defines a local buffer named scratch and then calls the async_work_group_copy builtin function to copy 100 char values from the passed in global buffer to the local buffer. The limitation to this method, is that the local buffers are statically sized, in this case to 100 chars.
<source lang="c">kernel void foo(global char *buf) {
local char scratch[100]; async_work_group_copy(scratch, buf, 100, 0); ...
}</source>
Alternatively, local buffers can be passed to OpenCL C kernels as an argument and can be sized dynamically. In this method you simply define your OpenCL C kernel with a local buffer argument. For example:
<source lang="c">kernel void foo(global char *buf, local char *scratch) {
async_work_group_copy(scratch, buf, 100, 0); ...
}</source>
and then from the host side you setup an argument to the local buffer by passing a null pointer and a size to the clSetKernelArg function.
The OpenCL API for setting an argument to a kernel has the following prototype: <source lang="c"> cl_int clSetKernelArg (cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value); </source>
To setup the 1st argument to the kernel foo with a global buffer, the API call would look like: <source lang="c"> cl_mem buf = clCreateBuffer(...); clSetKernelArg(foo, 0, sizeof(buf), &buf); </source>
To setup the 2nd argument to kernel foo with a local buffer, the API call would look like: <source lang="c"> clSetKernelArg(foo, 1, 100, NULL); </source>
The OpenCL runtime will interpret the size and null pointer passed to clSetKernelArg as a local buffer and will temporarily allocate an area of local memory (L2 in this implementation) of that size and will pass a pointer to that area rather as the local buffer argument.
If the host code is using the C++ bindings then the previous two code boxes combined would look like:
<source lang="CPP"> Buffer buf(...); foo.setArg(0, buf); foo.setArg(1, __local(100)); </source>
In the C++ case, the __local() object is used to indicate a global buffer of size 100 bytes.
Sub-Buffers[edit]
OpenCL Sub-Buffers are aliases to existing OpenCL global Buffers. Creating a sub-buffer does not result in any underlying memory store allocation above what is already required for the aliased buffer. There are two primary use cases for sub-buffers:
- Accessing a buffer with different access flags than were specified in buffer creation, or
- Accessing a subset of a buffer.
The C++ API's for creating SubBuffers are described below. Please see the OpenCL 1.1 specification or the OpenCL 1.1 Online Reference for the sytax of the C API for sub-buffer creation.
<source lang="cpp">typedef struct _cl_buffer_region { size_t origin; size_t size;} cl_buffer_region;
Buffer createSubBuffer(cl_mem_flags flags, cl_buffer_create_type buffer_create_type,
const void * buffer_create_info, cl_int * err = NULL);
</source>
createSubBuffer is a member function of the OpenCL C++ Buffer object. The flags argument should be one of CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY. The buffer_create_type should be CL_BUFFER_CREATE_TYPE_REGION. That is the only cl_buffer_create_type supported in OpenCL 1.1. The buffer_create_info argument should be a pointer to a cl_buffer_region structure, in which you define the buffer subset for the sub-buffer. Usage of these API's may look like:
<source lang="cpp">Buffer buf(ctx, CL_MEM_READ_WRITE, bufsize);
cl_buffer_region rgn = {0, bufsize};
Buffer buf_read = buf.createSubBuffer(CL_MEM_READ_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &rgn); Buffer buf_write = buf.createSubBuffer(CL_MEM_WRITE_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &rgn);</source>
The prior subsection indicated that global buffers can be persistent from one kernel invocation to the next. It is a common use case that kernel K1 only writes a buffer and kernel K2 only reads the buffer. The buffer must be created with the CL_MEM_READ_WRITE access flag, because the buffer is being both read and written by OpenCL C kernels running on the C66 DSPs. However, no individual kernel is both reading and writing the buffer, so the CL_MEM_READ_WRITE property that the buffer has, may result in underlying cache coherency operations that are unneccessary. For performace reasons, sub-buffers can be used to specify more restrictive buffer access flags and they can be customized for the behavior of the particular kernel to which the buffer is being passed as an argument. The above illustration on SubBuffer creation is the setup for this process. A Buffer buf has been defined as both read/write and two sub-buffer aliases have been setup; one as read only and the other as write only. These new sub-buffers may then be passed to kernels K1 and K2 instead of the buffer buf directly. This process will ensure that the OpenCL runtime does not perform any unnecessary cache coherency operations.
The other use case for sub-buffers is to create an object representing a subset of a buffer. For example, it may be desirable to process a buffer in chunks. Sub-buffers can be used to achieve those chunks in a form suitable for arguments to OpenCL C Kernels. Assuming an OpenCL queue named Q and a Kernel name K are already setup, the following code would result in K being dispatched twice, once with the first half of Buffer buf and again with the second half of Buffer buf.
<source lang="cpp">Buffer bufA(ctx, CL_MEM_READ_ONLY, bufsize);
cl_buffer_region rgn_half1 = {0, bufsize/2}; cl_buffer_region rgn_half2 = {bufsize/2, bufsize/2};
Buffer buf_half1 = buf.createSubBuffer(CL_MEM_READ_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &rgn_half1); Buffer buf_half2 = buf.createSubBuffer(CL_MEM_READ_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &rgn_half2);
K.setArg(0, buf_half1); Q.enqueueTask(K);
K.setArg(0, buf_half2); Q.enqueueTask(K); </source>
Buffer Read/Write vs. Map/Unmap[edit]
The OpenCL APIs support two mechanisms for the host application to interact with OpenCL buffers. They can:
- Read and write buffers using clEnqueueReadBuffer and clEnqueueWriteBuffer in C or the member functions enqueueReadBuffer and enqueueWriteBuffer in C++, or
- Map and unmap using clEnqueueMapBuffer and clEnqueueUnmapMemObject in C.
The read and write API's imply a movement of data to and from OpenCL buffers. This typically means a movement of data from linux system memory to CMEM memory where an OpenCL buffer typically resides.
The map/unmap API's map the underlying memory store of a buffer into the host address space and allows the host application to read and write directly from/to the buffer's content. This method has the advantages of:
- Not requiring 2 storage areas containing the same data (one in Linux system memory and one in the Buffer in CMEM memory), and
- Not requiring extra data movement between the two storage areas.
There are situations where the read/write buffer is preferable, however. For smaller buffers, the overhead of the extra copies is small and the extra commands enqueued to the CommandQueue do have some overhead. In the below examples of read/write and map/unmap use, you can see that there are 4 commands enqueued for data movement in the map/unmap case and there are only two commands enqueued for data movement in the read/write case.
The map/unmap commands will perform cache coherency operations and do entail some cost. The read and write buffer commands currently use memcpy for data transfer. In a future release, this will be enhanced to use the K2H's DMA capability and the DMA's have cache coherency to the cores and therefore it may sometimes be faster to read / write than to map / unmap.
For the examples below, please refer to the OpenCL 1.1 specficication or online reference pages for the details of the APIs. In these examples, most of the arguments to the read/write or map/unmap enqueue commands are obvious with the exception of CL_TRUE as the second argument and 0 as the third argument to read/write and fourth argument to map/unmap. The CL_TRUE argument indicates to the OpenCL runtime that you would like this enqueue command to block until the operation is complete. OpenCL enqueue commands are typically asynchronous, the command is enqueued and the main thread continues execution in parallel with the operations enqueued. If these APIs are passed CL_FALSE as a second argument, then they behave asynchronously as well.
The 0 argument is an offset into the buffer being read, written, mapped or unmapped.
The below code fragment illustrates a write/read buffer use case using the C++ OpenCL Binding. An OpenCL context ctx, CommandQueue Q and Kernel K are already created and bufsize represents the number of bytes in the buffers. Note that bufsize bytes are allocated in the Linux heap for the array ary and an additional bufsize bytes are allocated in CMEM for the buffer buf. This double allocation is clearly a limitation if bufsize is particularly large. If it is not large, then an application can double buffer using this approach. After the buffer write, the memory pointed to by ary can be repopulated and a pipeline can be established. Obviously, in that use case, the example would need some modification to not reuse ary for the read buffer.
<source lang="cpp"> int *ary = (int*) malloc(bufsize);
// populate ary
Buffer buf (ctx, CL_MEM_READ_WRITE, bufsize); K.setArg(0, buf);
Q.enqueueWriteBuffer(buf, CL_TRUE, 0, bufsize, ary); Q.enqueueTask(K); Q.enqueueReadBuffer (buf, CL_TRUE, 0, bufsize, ary);
// consume ary </source>
The below code fragment illustrates a map/unmap buffer use case using the C++ OpenCL Binding.
<source lang="cpp">Buffer buf (ctx, CL_MEM_READ_WRITE, bufsize); K.setArg(0, buf);
int * ary = (int*)Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_WRITE, 0, bufsize); // populate ary Q.enqueueUnmapMemObject(buf, ary);
Q.enqueueTask(K);
ary = (int*)Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_READ, 0, bufsize); // consume ary Q.enqueueUnmapMemObject(buf, ary);</source>
Large OpenCL buffers and Memory Beyond 32-bit DSP Address Space[edit]
The K2H device will support up to 8GB of DDR3 on the DDR3A bus. The C66 DSP, however is a 32-bit architecture and cannot access all 8GB at any given time. The C66 DSP does have a memory translation capability that will allow it to access any portion of that memory, but there are constraints on the mapping that will be described here.
The 8GB of DDR3 exist in the K2h 36-bit physical address space at addresses 8:0000:0000 to 9:FFFF:FFFF. The K2h device boots with the C66 DSP's mapping the upper 2GB of its address space 8000:0000 to FFFF:FFFF to the beginning of that physical range. For the remainder of this section, the physical range from 8:0000:0000 to 8:7FFF:FFFF will be referred to as the low 2GB and the range from 8:8000:0000 to 9:FFFF:FFFF will be referred to as the upper 6GB.
The figure below illustrates the mapping, using 512M blocks of memory. The red blocks are Linux system memory and the green blocks are either CMEM or reserved memory. See the section above on DDR partition for definitions of these memory types. The actual use of the lower 2GB can vary based on memory usage boot variables, but the use illustrated below is the default and is typical.
If the entire upper 6GB of memory are configured as Linux system memory and are therefore unavailable to OpenCL, then OpenCL will have 1488MB of memory in the lower 2GB available for OpenCL C programs and Buffers and no further constraints are necessary and the remainder of this section is not applicable. Additionally, if the environment variable TI_OCL_DSP_NOMAP is set, then OpenCL will ignore any CMEM region that is defined in the upper 6GB, and OpenCL operation will be restrictied to the lower 2GB and again the remainder of this section is not applicable.
If there is memory in the upper 6GB that is given to CMEM to manage, then that memory will be available to OpenCL as well and understanding how OpenCL will use that memory is important so an application can maximize resource utilization. The figure below illustrates a potential DDR partition with CMEM in the the upper 6GB.
In the partition above, 1.5GB is partitioned to Linux System memory and 6.5GB is partitioned for OpenCL use. Note that only the 512M block of memory from A000:0000 to BFFF:FFFF is indicated as green for OpenCL. The other 3 512M blocks are blue indicating that they are available destinations for mapping from alternate regions of the 36-bit address space. The one green 512M block will always be fixed to it's corresponding location in physical memory and is not available for mapping.
Within that 512M fixed block there is 48M of reserved memory and 464M of CMEM OpenCL memory. OpenCL will manage allocations using two heaps: a fixed heap and a mapped heap. The fixed heap is the 464M of OpenCL memory in the fixed block of DSP memory from A000:0000 to BFFF:FFFF. The mapped heap manages all other OpenCL memory. In reality, the mapped heap may be more than one heap in the OpenCL implementation, if the additional OpenCL memory is not contiguous, as is the case in the above example figure. The number of actual heaps in the virtual mapped heap is unimportant to the user, except that a single very larger buffer may not span all the additional OpenCL memory if it is not contiguous.
The OpenCL runtime will manage which heap is used for allocation using the following algorithm:
- OpenCL C programs are always allocated from the fixed heap
- OpenCL C Buffers <= 16K bytes are allocated from the fixed heap, until it is full and then from the mapped heap.
- OpenCL C Buffers <= 16M bytes are allocated from the fixed heap, while fixed heap space available >= 64M, and then from the mapped heap.
- OpenCL C Buffers > 16M are always allocated from the mapped heap.
The mapping of buffers from the mapped heap into the blue (mapping destination) regions of the 32-bit C66 address space occur at OpenCL C kernel execution boundaries. Immediately before the launch of a kernel, mapping will occur for the buffer arguments to the kernel. Immediately after the kernel completes the mapping is returned to the default mode. Mapping will not change during a single execution of a kernel. This execution model results in the some constraints for any single kernel invocation. Most importantly, all buffer arguments to an OpenCL C kernel that are allocated from the mapped heap, must cumulatively be mappable to the blue mapping destination region of the 32-bit C66 DSP address space. The mapping destination region includes one 512M block and one 1024M block, so this could support one 1GB buffer and one 512MB buffer, or three 512M buffers, or two 512M buffers and four 128M buffers, etc.
Additionally, OpenCL can manage only 7 mapped regions and each mapped region must be sized to a power of 2 and aligned to a power of 2. this limitation of 7 mapped regions will limit the number of buffers (from the mapped heap) that can be passed to a kernel. Subject to the size limits from above, at least 7 such buffers are possible. Given buffer sizes and locations, it may be possible to map greater than 7 buffers to a single OpenCL C kernel, because the OpenCL runtime may be able to map a larger region that covers multiple buffers. For example, four 128M buffers that are consecutive in the mapped heap, where the first is aligned to a 512M boundary, could potentially be mapped using only 1 of the 7 map regions.
To maximize OpenCL memory usage in the upper 6GB, and prevent fragmentation, it is recommended that buffers larger than 16MB be:
- sized to a power of 2.
- allocated in decreasing size order.
Smaller buffers allocated in the fixed heap are not subject to the memory mapping constraints and need not be power of 2 sized.
Large Buffer Use Cases[edit]
For the purposes of this section a requirement for 3 large OpenCL buffers will be assumed. From the operational discussion in the previous section, we can deduce that three 512M buffers can be passed to an OpenCL C kernel. If the OpenCL host application requires buffers larger than 512M, for example, it requires three 2GB buffers, it can allocate three 2GB buffers that are populated by the host and sub-buffers (See sub-buffer section) can be used to define 512M subsets of the larger buffers. The host application can then enqueue an OpenCL C kernel four times, once for each 512M sub-buffer section of the larger buffer.
Alternatively, An OpenCL application could base its calculations on a 512M buffer size and it could define multiple sets of buffers for a ping-pong buffer implementation. Set A of three 512M buffers are populated and a kernel is enqueued to the DSP to process those buffers. Concurrent with the processing of buffer set A, and second set B of three buffers is being populated by the host. Buffer processing and buffer population would then alternate on the two buffer sets.
A third use case, involves using enqueueTask to enqueue kernels. In this model 8 independent kernels can be executing concurrently, one on each of the C66 DSPs in the K2H device. Each of these tasks can be operating on an independent set of buffers. In this case there would be eight sets of three buffers and they would be limited to 256M each. A combination of this approach on ping-pong buffers would require 16 sets of three buffers each limited to 128M.
There are other use-cases for large buffers, but the above briefly describes some of the common use cases. The vecadd_mpax example shipped with OpenCL provides a framework for the sub-buffer use-case.
DSP memory in OpenCL vs. standalone DSP application[edit]
For C66 DSP developers moving to OpenCL on the K2H from a standalone DSP application environment, the following guidelines will be helpful for the transition. OpenCL buffers can be defined in DDR, MSMC, and L2 memory regions (See sections Global Buffers and Local Buffers for details)
- Memory management for the C66 DSP is accomplished exclusively through OpenCL buffers defined in the host application. OpenCL buffers can be defined in DDR, MSMC, and L2 memory regions (See sections Global Buffers and Local Buffers for details). This eliminates the need for all heap management on the DSP. Calls to malloc, calloc, realloc and free should not be used in OpenCL C programs or in any standard C code call tree originating from OpenCL C programs.
- OpenCL C kernels and any call tree originating from a kernel inherit a stack from the OpenCL runtime executing on the DSPs. To limit the amount of on-chip memory that is reserved on the K2, this stack is small, currently 10K bytes. Auto (function scope) variables should be therefore kept to a minimum. Additionally, call tree depth must be limited. Recursion is not allowed in OpenCL C code. If OpenCL C code calls standard C code, then recursion should not be used in the standard C code.
- OpenCL C allows the definition of constant data in the constant memory address space at global scope. No other global data is allowed in OpenCL C. If OpenCL C is linked with standard C code, the standard C code should avoid global variables. Again, if the DSP code needs persistent data, it should be allocated an OpenCL buffer and the data should reside and persist in the buffer.
- OpenCL C does not run a boot setup for dispatch of kernels, therefore dependencies on items that typically run before main is called or after main returns in a standalone DSP application are not supported in a OpenCL C enviorment.
- Linking OpenCL C code with standard C code that requires a user specified linker command file is not supported.
- Calling CSL or BIOS functions is allowed from OpenCL C code or a standard C code call tree originiating from OpenCL C, if it conforms to the restrictions above (particularly, no memory management, no link command file, and no requirement on boot time code)
Discovering OpenCL Memory Sizes and Limits[edit]
TBD
Cache Coherency[edit]
TBD
Using DMA for data movement within a DSP kernel[edit]
TBD



