This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H12: MSMC vs DDR, no performance improvement

Other Parts Discussed in Thread: 66AK2H12

Hi,

We are working on the Keystone 2 66AK2H12 evm to execute convolution of a 32x32 image on the DSP cores using OpenCL. We are storing the image and the mask in MSMC for better performance but the results are not as expected. We tried keeping the image and mask in the DDR to compare the performance, but the performance turned out to be similar. As per our understanding, keeping the image and mask in the MSMC should improve the time for access and hence the performance. 

We use the following instructions to create space in the MSMC for the image, mask and output:

Output = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_MSMC_TI, mem_size_A, NULL, &err);
Image = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_MSMC_TI, mem_size_A, h_A, &err);
Mask = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_MSMC_TI, mem_size_B, h_B, &err);

The timing we got for the execution is as follows:

execution time with CL_MEM_USE_MSMC_TI = 338 microseconds

execution time without CL_MEM_USE_MSMC_TI = 329 microseconds

Is there anything extra we need to do for creating the MSMC buffer space? Also can you give us more suggestions on improving the performance?

Thanks,

Faizan

  • Not sure what happened without knowing what you are measuring and how your code is structured. Are you dispatching your kernel to all 8 cores? Can you try 1 core and see if you see any performance improvement?

    The general optimization tip for OpenCL on TI's SoC can be found here: downloads.ti.com/.../index.html
  • Hi Yuan,

    Sorry I should have mentioned that. We are measuring the time from the host side. We measure the kernel computation time by measuring the NDRange kernel function like below:

    ptimer1 = PAPI_get_virt_usec();

    err = clEnqueueNDRangeKernel(commands, kernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &event);

    ptimer2 = PAPI_get_virt_usec();

    As you can see we are using the PAPI for the time measurement. The kernel code is attached for your reference.(I tried attaching the file but couldn't. Sorry for that.)

    Since we are using OpenCL, we assumed that the kernel will be executed on all the cores and this will be taken care internally. I am not sure if there is a way to execute the kernel on one DSP only. Correct me if I'm wrong. 

    Thank you for the reply and the optimization link. Please let us know if you need anything else.

    Regards,

    Faizan.

    Kernel code for your reference:

    __kernel void gaussianblur(
    const __global float * const input,
    __constant float * const mask,
    __global float * const output,
    const int inputWidth,
    const int maskWidth)
    {
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    float sum = 0;
    for (int r = 0; r < maskWidth; r++)
    {
    const int idxIntmp = (y + r) * inputWidth + x;

    for (int c = 0; c < maskWidth; c++)
    {
    sum += ((mask[(r * maskWidth) + c])/9) * input[idxIntmp + c];
    }
    }

    output[y * get_global_size(0) + x] = sum;
    }

  • Okay, I see. Is there a "clWaitForEvents(1, &event);" in between your measurements of ptimer1 and ptimer2? It needs to be there for kernel to finish because kernel execution happens asynchronously from the host execution. Otherwise, you will be measuring the cost of host API call.

    Alternatively, you can use the OpenCL profiling information associated with the events. But you will need the CL_QUEUE_PROFILING_ENABLE when creating the OpenCL command queue. Then you can use clGetEventProfilingInfo() API to query your event for the time from START to END. See OpenCL spec for details.

    To make your kernel only run on 1 core, you can make your "localWorkSize" to be exact as "globalWorkSize", then you will have only 1 workgroup and it will be dispatched to only 1 core.

  • Hi

    Sorry for the late reply.

    Actually I'm calculating the computation time of kernel by OpenCL events as shown below and the PAPI library is used to calculate the time taken to execute each API call on host.
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);

    As per your suggestion, I made the "localworksize" and "globalworksize" as same and measured time, it is around 2280us which is 8 times more than previously calculated value since the kernel is executed on only one DSP core.

    I launched one empty kernel without taking any argument, without any code in it by making "localworksize" and "globalworksize" as same. I calculated the time taken for this scenario it is around 130 us which is too high. Can you explain why this much overhead just for one kernel launch and return of that kernel.

    Regards
    Faizan
  • The measured time may or may not include the time for loading kernel program onto the DSP, depends on whether you have run any kernel from that same kernel program before.

    We have shipped a similar example that measures the elapsed time between kernel launch and kernel finish in the APPLICATION thread: /usr/share/ti/examples/opencl/null/null. If you run it a few times, you can see that the best elapsed time is around 80 us. I will attached a slighted modified null example, which separates the time spent into from Queue to Submit, from Submit to Start, and from Start to End. Run it a few times. You can see that the actual kernel round trip time inside the OpenCL runtime (from Start to End) is about 30 us. The rest are spent in the OpenCL runtime, including the thread scheduling overhead in the Linux. Our OpenCL runtime itself is a multi-threaded implementation, and we try to move the actual OpenCL runtime work away from user's APPLICATION thread into OpenCL runtime thread.

    However, a null kernel should not represent the typical use case. A typical use case should try to optimize the overall system performance, utilizing the inherent asynchronous execution built into the OpenCL. After a kernel gets enqueued onto the DSP queue, ARM could be doing some other useful work and come back to check for kernel completion. And APPLICATION thread/ARM can enqueue a sequence of commands onto the DSP queue and use events dependencies to take care of execution ordering, without waiting for each individual command to finish before proceeding to enqueue the next one.

    ---------------- A dividing line -----------
    Now switching to a different topic, by limiting the number of workgroups to one, what I intended was for you to compare the performance between using MSMC and not using MSMC. Depends on your application, when all 8 DSP cores are accessing the MSMC, there might not be enough MSMC bandwidth. And you may also increase the problem size in your application when investigating the MSMC performance. We should have a few examples in the shipped examples: /usr/share/ti/examples/opencl/.

  • A slightly modified null example that also reports timing components inside OpenCL runtime, within the elapsed time between kernel enqueu and kernel finish in user's application thread.


    5466.null_main_monidifed.cpp


    Sample output:

    root@k2hk-evm:~/oclexamples/null# ./null
    The OpenCL runtime will lazily load the device program upon the
     first enqueue of a kernel from the program, so the elapsed time
     overall from the first enqueue will be longer to account for the
     loading of the kernel.  Also, subsequent enqueue's of a kernel will
     also potentially benefit from a warm cache on the device.

    Elapsed (with Load): 481 usecs
    Null Kernel Exec : Queue  to Submit: 11 us
    Null Kernel Exec : Submit to Start : 54 us
    Null Kernel Exec : Start  to End   : 121 us

    Elapsed (w/o  Load): 109 usecs
    Null Kernel Exec : Queue  to Submit: 2 us
    Null Kernel Exec : Submit to Start : 37 us
    Null Kernel Exec : Start  to End   : 30 us

    Elapsed (w/o  Load): 92 usecs
    Null Kernel Exec : Queue  to Submit: 2 us
    Null Kernel Exec : Submit to Start : 34 us
    Null Kernel Exec : Start  to End   : 28 us

    Elapsed (w/o  Load): 88 usecs
    Null Kernel Exec : Queue  to Submit: 2 us
    Null Kernel Exec : Submit to Start : 31 us
    Null Kernel Exec : Start  to End   : 27 us

    Elapsed (w/o  Load): 88 usecs
    Null Kernel Exec : Queue  to Submit: 2 us
    Null Kernel Exec : Submit to Start : 31 us
    Null Kernel Exec : Start  to End   : 28 us

    Elapsed (w/o  Load): 88 usecs
    Null Kernel Exec : Queue  to Submit: 2 us
    Null Kernel Exec : Submit to Start : 33 us
    Null Kernel Exec : Start  to End   : 26 us

    Done!