This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/PROCESSOR-SDK-AM57X: OpenCL+DSP accelerate

Part Number: PROCESSOR-SDK-AM57X
Other Parts Discussed in Thread: AM5718

Tool/software: Linux

Hi Ti,

I am doing OpenCV+DSP accelerate on AM5718 board,SDK version is ti-processor-sdk-linux-rt-am57xx-evm-04.01.00.06.

I found UMat convert to Mat takes too much time。

Here is my test code and result:

code:

int main(int argc, char ** argv)

{

unsigned char *yuv_raw_data = NULL;
int c = -1;
struct timespec tp0, tp1, tp2, tp3, tp4, tp5;

cameraOpenDevice(1);

Mat mat_yuv(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC2);
Mat mat_rgb(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC3);

UMat umat_yuv(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC2);
UMat umat_rgb(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC3);

while(1) {
clock_gettime(CLOCK_MONOTONIC, &tp0);
DqBuffer(&yuv_raw_data);//yuv raw data from V4l2 camera
QBuffer();
memcpy(mat_yuv.data, yuv_raw_data, CAMERA_WIDTH*CAMERA_HEIGHT*2);

clock_gettime(CLOCK_MONOTONIC, &tp1);
mat_yuv.copyTo(umat_yuv);

clock_gettime(CLOCK_MONOTONIC, &tp2);

cvtColor(umat_yuv, umat_rgb, CV_YUV2BGR_YUYV);

clock_gettime(CLOCK_MONOTONIC, &tp3);

umat_rgb.copyTo(mat_rgb);
clock_gettime(CLOCK_MONOTONIC, &tp4);

imshow("frame", mat_rgb);
clock_gettime(CLOCK_MONOTONIC, &tp5);

printf("get yuv Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1));
printf("Mat2UMat tdiff=%lf ms \n", tdiff_calc(tp1, tp2));
printf("cvtColor_YUV2BGR tdiff=%lf ms \n", tdiff_calc(tp2, tp3));
printf("UMat2Mat tdiff=%lf ms \n", tdiff_calc(tp3, tp4));
printf("imshow tdiff=%lf ms \n", tdiff_calc(tp4, tp5));

c = waitKey(1);
if( c == 27 || c == 'q' || c == 'Q' )
break;
}

cameraCloseDevice();
return 0;

}

Then I compile it to an exec bin:  dsp_accelerate_opencv_opencl

result :

root@ok5718-idk:/home/forlinx/qt# on_dsp_accelerate.sh

get yuv Mat tdiff=5.017779 ms
Mat2UMat tdiff=0.672464 ms
cvtColor_YUV2BGR tdiff=0.165757 ms
UMat2Mat tdiff=151.863810 ms
imshow tdiff=1.787381 ms

root@ok5718-idk:/home/forlinx/qt# off_dsp_accelerate.sh

get yuv Mat tdiff=0.509635 ms
Mat2UMat tdiff=0.479541 ms
cvtColor_YUV2BGR tdiff=11.217494 ms
UMat2Mat tdiff=5.070482 ms
imshow tdiff=2.764032 ms

root@ok5718-idk:/home/forlinx/qt# cat on_dsp_accelerate.sh
export TI_OCL_CACHE_KERNELS=Y
export OPENCV_OPENCL_DEVICE='TI AM57:ACCELERATOR:TI Multicore C66 DSP'
echo "OpenCL on"
./dsp_accelerate_opencv_opencl

root@ok5718-idk:/home/forlinx/qt# cat off_dsp_accelerate.sh
export OPENCV_OPENCL_DEVICE='disabled'
echo "OpenCL off"
./dsp_accelerate_opencv_opencl

The result shows UMat convert to Mat takes too much time when dsp accelerate is working.

I Found TI docs , http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components_OpenCV.html#alternative-approach-to-add-new-opencl-kernels-at-opencv-application-level :

It said the UMat cost time includes: waiting time (for DSP to finish) plus actual write and any format conversion (done on CPU). It also depends a lot on data types used, and if floating point operations are involved. This can be accelerated if DSP optimized implementation of remap() is created.

So, my question is:How to optimized remap() with DSP accelerate。

I read the Chapters Creating OpenCL C kernel optimized for C66 core to Alternative approach to add new OpenCL kernels at OpenCV application level, but I found no way to do.

Please give me help, thank you.

  • Hi,

    Please take a look at the 3rd bullet in OpenCV FAQ,
    software-dl.ti.com/.../Foundational_Components_OpenCV.html

    If it answers your question, please click "Resolved".

    Rex
  • Hi, 

    Thanks for you timely reply.

    I have read the chapter: software-dl.ti.com/.../Foundational_Components_OpenCV.html .

    Then I improve my test code. 

    I use CMEM to share memory between DSP and A15,  and I write opencl kernel code in my OpenCV application.

    I compile it to an exec bin: dsp_accelerate_opencv_opencl.

    Then I found the root case of the cost time when doing opencl+dsp accelerate. 

    Below is my improved code:

    main function:

    int main(int argc, char ** argv)
    {
             unsigned char *raw_yuv = NULL;
             int c = -1;
             struct timespec tp0, tp1, tp2, tp3, tp4, tp5;

    void *cmem_src   = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*2);
    void *cmem_dest = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*3);
    void *cmem_out   = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*3);

    cameraOpenDevice(1);

    while(1){
             clock_gettime(CLOCK_MONOTONIC, &tp0);
             time = (double)getTickCount();

             DqBuffer(&raw_yuv);
             QBuffer();

             Mat mat_yuv(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC2);
             memcpy(mat_yuv.data, raw_yuv, CAMERA_HEIGHT * CAMERA_WIDTH*2);

             clock_gettime(CLOCK_MONOTONIC, &tp1);

             Mat mat_bgr(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC3, cmem_out);
             ProcRawClCmem(mat_yuv, "cvtcolor.cl", cmem_src, cmem_dest, mat_bgr);

             clock_gettime(CLOCK_MONOTONIC, &tp2);

             imshow("frame", mat_bgr);
             clock_gettime(CLOCK_MONOTONIC, &tp3);

             printf ("get raw_yuv Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1));
             printf ("ProcRawClCmem tdiff=%lf ms \n", tdiff_calc(tp1, tp2));
             printf ("imshow tdiff=%lf ms \n", tdiff_calc(tp2, tp3));

             c = waitKey(1);
             if( c == 27 || c == 'q' || c == 'Q' )
             break;
    }

             cameraCloseDevice();

             __free_ddr(cmem_src);
             __free_ddr(cmem_dest);
             __free_ddr(cmem_out);
             return 0;
    }


    void ProcRawClCmem(Mat &mat_in, const std::string &kernel_name, void *cmem_src, void *cmem_dest, Mat &mat_out)
    {
             int err;
             struct timespec tp0, tp1, tp2, tp3, tp4, tp5, tp6, tp7, tp8, tp9, tp10;

             Mat mat_src(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC2, cmem_src);
             Mat mat_dest(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC3, cmem_dest);

             clock_gettime(CLOCK_MONOTONIC, &tp0);
             mat_in.copyTo(mat_src);
             clock_gettime(CLOCK_MONOTONIC, &tp1);

             //R/W
             Context context(CL_DEVICE_TYPE_ACCELERATOR);
             std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

             ifstream ifs(kernel_name);
             if (ifs.fail()){
                      cout << "opencl file is not avaiable..." <<endl;
             }
             std::string kernelStr((istreambuf_iterator<char>(ifs)), istreambuf_iterator<char>());


    Program::Sources source(1, std::make_pair(kernelStr.c_str(), kernelStr.length()));
    Program program = Program(context, source);
    program.build(devices);

    Kernel kernel(program, "YUV2RGB_CMEM");
    CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
    clock_gettime(CLOCK_MONOTONIC, &tp2);

    cl_mem bufSrc = clCreateBuffer(context(), CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR, BUF_SRC, cmem_src, &err);
    cl_mem bufDst = clCreateBuffer(context(), CL_MEM_WRITE_ONLY|CL_MEM_USE_HOST_PTR, BUF_DEST, cmem_dest, &err);

    cl_event ev0;
    err = clEnqueueWriteBuffer(queue(), bufSrc, CL_TRUE, 0, BUF_SRC, mat_src.data, 0, NULL, &ev0);
    clWaitForEvents(1, &ev0);
    clock_gettime(CLOCK_MONOTONIC, &tp3);

    err = clSetKernelArg(kernel(), 0, sizeof(cl_mem), &bufSrc);
    err = clSetKernelArg(kernel(), 1, sizeof(cl_mem), &bufDst);

    cl_event ev1;
    std::size_t GlobalWokrSize[2] = { (std::size_t)CAMERA_WIDTH, (std::size_t)CAMERA_HEIGHT};
    err = clEnqueueNDRangeKernel(queue(), kernel(), 2, NULL, GlobalWokrSize, NULL, 0, NULL, &ev1);
    clWaitForEvents(1, &ev1);
    clock_gettime(CLOCK_MONOTONIC, &tp4);

    cl_event ev2;
    err = clEnqueueReadBuffer(queue(), bufDst, CL_TRUE, 0, BUF_DEST, mat_dest.data, 0, NULL, &ev2);
    clWaitForEvents(1, &ev2);

    clock_gettime(CLOCK_MONOTONIC, &tp5);

    err = clReleaseMemObject(bufSrc);
    err = clReleaseMemObject(bufDst);
    clock_gettime(CLOCK_MONOTONIC, &tp6);

    mat_dest.copyTo(mat_out);
    clock_gettime(CLOCK_MONOTONIC, &tp7);

             //R/W
             printf ("Mat2Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1));                                 //0.472872 ms
             printf ("build kernel tdiff=%lf ms \n", tdiff_calc(tp1, tp2));                            //3.456178 ms
             printf ("clEnqueueWriteBuffer tdiff=%lf ms \n", tdiff_calc(tp2, tp3));           //0.608536 ms
             printf ("clEnqueueNDRangeKernel tdiff=%lf ms \n", tdiff_calc(tp3, tp4));   //97.881413 ms
             printf ("clEnqueueReadBuffer tdiff=%lf ms \n", tdiff_calc(tp4, tp5));           //0.871893 ms
             printf ("clReleaseMemObject tdiff=%lf ms \n", tdiff_calc(tp5, tp6));            //0.011387 ms
             printf ("Mat2Mat tdiff=%lf ms \n", tdiff_calc(tp6, tp7));                                //1.283602 ms
    }

    cvtcolor.cl:

    __kernel void YUV2RGB_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
    {
             int USE_OPTIMIZED_LOAD = 0;

    int src_offset = 0;
    int dst_offset = 0;
    int src_step = 640*2;
    int dst_step = 640*3;
    int rows = 480;
    int cols = 640;

    int x = get_global_id(0);
    int y = get_global_id(1);

    if (x < cols / 2)
    {
    __global const uchar* src = srcptr + mad24(y, src_step, (x << 2) + src_offset);
    __global uchar* dst = dstptr + mad24(y, dst_step, mad24(x << 1, 3, dst_offset));

    #pragma unroll
    for (int cy = 0; cy < 1; ++cy)
    {
    if (y < rows ){
    __constant float* coeffs = c_YUV2RGBCoeffs_420;
    int load_src = *((__global int*) src);
    float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
    float U = vec_src[1] - 128;
    float V = vec_src[3] - 128;
    float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
    float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];

    float ruv = ((coeffs[4]*V)+0.5f);
    float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
    float buv = ((coeffs[1]*U)+0.5f);

             dst[2] = convert_uchar_sat(y00 + ruv);
             dst[1] = convert_uchar_sat(y00 + guv);
             dst[0] = convert_uchar_sat(y00 + buv);
             dst[5] = convert_uchar_sat(y01 + ruv);
             dst[4] = convert_uchar_sat(y01 + guv);
             dst[3] = convert_uchar_sat(y01 + buv);
             }
             ++y;
             src += src_step;
             dst += dst_step;
             }
             }
    }

    result:

    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt# on_dsp_accelerate.sh

    get raw_yuv Mat tdiff=4.396880 ms
    ProcRawClCmem tdiff=106.702501 ms
    imshow tdiff=1.879939 ms


    Mat2Mat tdiff=0.455792 ms
    build kernel tdiff=8.894614 ms
    clEnqueueWriteBuffer tdiff=1.317600 ms
    clEnqueueNDRangeKernel tdiff=98.243997 ms
    clEnqueueReadBuffer tdiff=1.948584 ms
    clReleaseMemObject tdiff=0.004392 ms
    Mat2Mat tdiff=1.185027 ms

    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt# cat on_dsp_accelerate.sh
    export TI_OCL_CACHE_KERNELS=Y
    export OPENCV_OPENCL_DEVICE='TI AM57:ACCELERATOR:TI Multicore C66 DSP'
    echo "OpenCL on"
    ./dsp_accelerate_opencv_opencl

    As you can see, the function clEnqueueNDRangeKernel() takes the most time when DSP is working.

    My question is:

    Is there any way to Improve the efficiency of clEnqueueNDRangeKernel function?

    Thanks.

  • Hi,
    The clEnqueueNDRangeKernel() call might have included the kernel loading time.
    o Please hoist program build and loading out of the time measurement. Loading the program can be achieved by calling a null() kernel function in the same program. See “null” example for details in OpenCL User's Guide
    downloads.ti.com/.../overview.html
    o Check if the compiled kernel got software pipelined. Use “-k” to keep the generated assembly file.
    o If further kernel optimization is desired, please try double buffering with EDMA/AsyncCopies. See “conv1d” example and TI online documentation for details.

    Your Processor SDK is old, if you can, please migrate to the latest 5.3 release.

    Rex
  • Hi,
    I follow your suggestion, below is result:

    1. opencl "null" example

    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt#
    root@ok5718-idk:/home/forlinx/qt# ./null
    The OpenCL runtime will lazily load the device program upon the
    first enqueue of a kernel from the program, so the elapsed time
    overall from the first enqueue will be longer to account for the
    loading of the kernel. Also, subsequent enqueue's of a kernel will
    also potentially benefit from a warm cache on the device.

    Elapsed (with Load): 1008 usecs
    Elapsed (w/o Load): 286 usecs
    Null Kernel Exec : Queue to Submit: 2 us
    Null Kernel Exec : Submit to Start : 30 us
    Null Kernel Exec : Start to End : 214 us

    Elapsed (w/o Load): 279 usecs
    Null Kernel Exec : Queue to Submit: 2 us
    Null Kernel Exec : Submit to Start : 31 us
    Null Kernel Exec : Start to End : 210 us

    Elapsed (w/o Load): 269 usecs
    Null Kernel Exec : Queue to Submit: 2 us
    Null Kernel Exec : Submit to Start : 30 us
    Null Kernel Exec : Start to End : 202 us

    Elapsed (w/o Load): 280 usecs
    Null Kernel Exec : Queue to Submit: 1 us
    Null Kernel Exec : Submit to Start : 29 us
    Null Kernel Exec : Start to End : 215 us

    Elapsed (w/o Load): 277 usecs
    Null Kernel Exec : Queue to Submit: 1 us
    Null Kernel Exec : Submit to Start : 28 us
    Null Kernel Exec : Start to End : 214 us

    Done!



    2. Use “-k” to keep the generated assembly file.

    I compile my opencl kernel with this command: clocl -t -k cvtcolor.cl

    And I modified my code:
    #include "cvtcolor.dsp_h"
    Program::Binaries binary(1, std::make_pair(cvtcolor_dsp_bin,sizeof(cvtcolor_dsp_bin)));
    Program program = Program(context, devices, binary);
    program.build(devices);

    Then I execute it in the shell, below is result:

    build kernel tdiff=1.538664 ms
    clEnqueueWriteBuffer tdiff=0.623989 ms
    clEnqueueNDRangeKernel tdiff=97.438472 ms
    clEnqueueReadBuffer tdiff=0.905565 ms
    clReleaseMemObject tdiff=0.007808 ms

    The clEnqueueNDRangeKernel() cost time did not change.
    So the root cause is not the kernel loading time.


    3. I found the really cost time is the kernel function __kernel void YUV2RGB_CMEM(__global const uchar* srcptr, __global uchar* dstptr).

    This function comes from YUV2RGB_422() function in OpenCV file cvtcolor.cl:
    __kernel void YUV2RGB_422(__global const uchar* srcptr, int src_step, int src_offset,
    __global uchar* dstptr, int dst_step, int dst_offset,
    int rows, int cols)

    I copied it to my kernel function YUV2RGB_CMEM().
    It executes on the DSP using floating point operation really takes too much time.
  • Hi,
    I found a new problem about opencl kernel function YUV2RGB_CMEM:

    Original YUV2RGB_CMEM() function is:

    __kernel void YUV2RGB_422_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
    {
    int x = get_global_id(0);
    int y = get_global_id(1);

    if (x < 320){
    __global const uchar* src = srcptr + (y*1280+(x << 2));
    __global uchar* dst = dstptr + (y*1920 + ((x << 1)*3));

    if (y < 480 ){
    __constant float* coeffs = c_YUV2RGBCoeffs_420;
    int load_src = *((__global int*) src);
    float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
    float U = vec_src[1] - 128;
    float V = vec_src[3] - 128;
    float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
    float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];
    float ruv = ((coeffs[4]*V)+0.5f);
    float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
    float buv = ((coeffs[1]*U)+0.5f);

    //////////////////////////////////////////////////////
    dst[2] = convert_uchar_sat(y00 + ruv);
    dst[1] = convert_uchar_sat(y00 + guv);
    dst[0] = convert_uchar_sat(y00 + buv);
    dst[5] = convert_uchar_sat(y01 + ruv);
    dst[4] = convert_uchar_sat(y01 + guv);
    dst[3] = convert_uchar_sat(y01 + buv);
    //////////////////////////////////////////////////////
    }
    }
    }

    With this opencl kernel, clEnqueueNDRangeKernel() cost time is:
    clEnqueueNDRangeKernel tdiff= 98.937283 ms

    I use linux top command in the shell and found CPU% is 31.0.

    Then I modified YUV2RGB_CMEM() as follow:

    __kernel void YUV2RGB_422_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
    {
    int x = get_global_id(0);
    int y = get_global_id(1);

    if (x < 320){
    __global const uchar* src = srcptr + (y*1280+(x << 2));
    __global uchar* dst = dstptr + (y*1920 + ((x << 1)*3));

    if (y < 480 ){
    __constant float* coeffs = c_YUV2RGBCoeffs_420;
    int load_src = *((__global int*) src);
    float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
    float U = vec_src[1] - 128;
    float V = vec_src[3] - 128;
    float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
    float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];
    float ruv = ((coeffs[4]*V)+0.5f);
    float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
    float buv = ((coeffs[1]*U)+0.5f);

    //////////////////////////////////////////////////////modified start
    int y1,y2,y3,y4,y5,y6;
    y1 = (int)(y00 + ruv);
    y2 = (int)(y00 + guv);
    y3 = (int)(y00 + buv);
    y4 = (int)(y01 + ruv);
    y5 = (int)(y01 + guv);
    y6 = (int)(y01 + buv);

    dst[2] = (uchar)(y1 < 0 ? 0 : (y1 > 255 ? 255 : y1));
    dst[1] = (uchar)(y2 < 0 ? 0 : (y2 > 255 ? 255 : y2));
    dst[0] = (uchar)(y3 < 0 ? 0 : (y3 > 255 ? 255 : y3));
    dst[5] = (uchar)(y4 < 0 ? 0 : (y4 > 255 ? 255 : y4));
    dst[4] = (uchar)(y5 < 0 ? 0 : (y5 > 255 ? 255 : y5));
    dst[3] = (uchar)(y6 < 0 ? 0 : (y6 > 255 ? 255 : y6));
    //////////////////////////////////////////////////////modified end
    }
    }
    }

    clEnqueueNDRangeKernel() cost time is:
    clEnqueueNDRangeKernel tdiff= 18.490971 ms.

    Then I use top command in the shell and found CPU% is 58.4.

    Although its cost time reduces a lot, the DSP acceleration effect is not obviously.

    I just expand the function convert_uchar_sat() as its definition in usr/share/ti/opencl/dsp.h, but their performance differs greatly.

    Could you explain me the root reason of it, thanks very much.
  • Hi,

    Your kernel is still not software pipelined, if you look at the generated assembly file.
    1. Use correct global work size in clEnqueueNDRangeKernel, so that you can get rid of if(y<480) and if(x<320) in the kernel

    - Yuan