Linux/PROCESSOR-SDK-AM57X: OpenCL+DSP accelerate

user6010539

Prodigy 70 points

Part Number: PROCESSOR-SDK-AM57X
Other Parts Discussed in Thread: AM5718

Tool/software: Linux

Hi Ti，

I am doing OpenCV+DSP accelerate on AM5718 board，SDK version is ti-processor-sdk-linux-rt-am57xx-evm-04.01.00.06.

I found UMat convert to Mat takes too much time。

Here is my test code and result：

code：

int main(int argc, char ** argv)

{

unsigned char *yuv_raw_data = NULL;
int c = -1;
struct timespec tp0, tp1, tp2, tp3, tp4, tp5;

cameraOpenDevice(1);

Mat mat_yuv(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC2);
Mat mat_rgb(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC3);

UMat umat_yuv(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC2);
UMat umat_rgb(CAMERA_HEIGHT , CAMERA_WIDTH, CV_8UC3);

while(1) {
clock_gettime(CLOCK_MONOTONIC, &tp0);
DqBuffer(&yuv_raw_data);//yuv raw data from V4l2 camera
QBuffer();
memcpy(mat_yuv.data, yuv_raw_data, CAMERA_WIDTH*CAMERA_HEIGHT*2);

clock_gettime(CLOCK_MONOTONIC, &tp1);
mat_yuv.copyTo(umat_yuv);

clock_gettime(CLOCK_MONOTONIC, &tp2);

cvtColor(umat_yuv, umat_rgb, CV_YUV2BGR_YUYV);

clock_gettime(CLOCK_MONOTONIC, &tp3);

umat_rgb.copyTo(mat_rgb);
clock_gettime(CLOCK_MONOTONIC, &tp4);

imshow("frame", mat_rgb);
clock_gettime(CLOCK_MONOTONIC, &tp5);

printf("get yuv Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1));
printf("Mat2UMat tdiff=%lf ms \n", tdiff_calc(tp1, tp2));
printf("cvtColor_YUV2BGR tdiff=%lf ms \n", tdiff_calc(tp2, tp3));
printf("UMat2Mat tdiff=%lf ms \n", tdiff_calc(tp3, tp4));
printf("imshow tdiff=%lf ms \n", tdiff_calc(tp4, tp5));

c = waitKey(1);
if( c == 27 || c == 'q' || c == 'Q' )
break;
}

cameraCloseDevice();
return 0;

}

Then I compile it to an exec bin: dsp_accelerate_opencv_opencl

result ：

root@ok5718-idk:/home/forlinx/qt# on_dsp_accelerate.sh

get yuv Mat tdiff=5.017779 ms
Mat2UMat tdiff=0.672464 ms
cvtColor_YUV2BGR tdiff=0.165757 ms
UMat2Mat tdiff=151.863810 ms
imshow tdiff=1.787381 ms

root@ok5718-idk:/home/forlinx/qt# off_dsp_accelerate.sh

get yuv Mat tdiff=0.509635 ms
Mat2UMat tdiff=0.479541 ms
cvtColor_YUV2BGR tdiff=11.217494 ms
UMat2Mat tdiff=5.070482 ms
imshow tdiff=2.764032 ms

root@ok5718-idk:/home/forlinx/qt# cat on_dsp_accelerate.sh
export TI_OCL_CACHE_KERNELS=Y
export OPENCV_OPENCL_DEVICE='TI AM57:ACCELERATOR:TI Multicore C66 DSP'
echo "OpenCL on"
./dsp_accelerate_opencv_opencl

root@ok5718-idk:/home/forlinx/qt# cat off_dsp_accelerate.sh
export OPENCV_OPENCL_DEVICE='disabled'
echo "OpenCL off"
./dsp_accelerate_opencv_opencl

The result shows UMat convert to Mat takes too much time when dsp accelerate is working.

I Found TI docs , http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components_OpenCV.html#alternative-approach-to-add-new-opencl-kernels-at-opencv-application-level :

It said the UMat cost time includes: waiting time (for DSP to finish) plus actual write and any format conversion (done on CPU). It also depends a lot on data types used, and if floating point operations are involved. This can be accelerated if DSP optimized implementation of remap() is created.

So, my question is：How to optimized remap() with DSP accelerate。

I read the Chapters Creating OpenCL C kernel optimized for C66 core to Alternative approach to add new OpenCL kernels at OpenCV application level， but I found no way to do.

Please give me help, thank you.

over 6 years ago

0 Rex Chang over 6 years ago

TI__Guru 50170 points

Hi,

Please take a look at the 3rd bullet in OpenCV FAQ,
software-dl.ti.com/.../Foundational_Components_OpenCV.html

If it answers your question, please click "Resolved".

Rex

0 user6010539 over 6 years ago in reply to Rex Chang

Prodigy 70 points

Hi,

Thanks for you timely reply.

I have read the chapter: software-dl.ti.com/.../Foundational_Components_OpenCV.html .

Then I improve my test code.

I use CMEM to share memory between DSP and A15, and I write opencl kernel code in my OpenCV application.

I compile it to an exec bin: dsp_accelerate_opencv_opencl.

Then I found the root case of the cost time when doing opencl+dsp accelerate.

Below is my improved code：

main function:

int main(int argc, char ** argv)
{
unsigned char *raw_yuv = NULL;
int c = -1;
struct timespec tp0, tp1, tp2, tp3, tp4, tp5;

void *cmem_src = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*2);
void *cmem_dest = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*3);
void *cmem_out = __malloc_ddr(CAMERA_HEIGHT * CAMERA_WIDTH*3);

cameraOpenDevice(1);

while(1){
clock_gettime(CLOCK_MONOTONIC, &tp0);
time = (double)getTickCount();

DqBuffer(&raw_yuv);
QBuffer();

Mat mat_yuv(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC2);
memcpy(mat_yuv.data, raw_yuv, CAMERA_HEIGHT * CAMERA_WIDTH*2);

clock_gettime(CLOCK_MONOTONIC, &tp1);

Mat mat_bgr(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC3, cmem_out);
ProcRawClCmem(mat_yuv, "cvtcolor.cl", cmem_src, cmem_dest, mat_bgr);

clock_gettime(CLOCK_MONOTONIC, &tp2);

imshow("frame", mat_bgr);
clock_gettime(CLOCK_MONOTONIC, &tp3);

printf ("get raw_yuv Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1));
printf ("ProcRawClCmem tdiff=%lf ms \n", tdiff_calc(tp1, tp2));
printf ("imshow tdiff=%lf ms \n", tdiff_calc(tp2, tp3));

c = waitKey(1);
if( c == 27 || c == 'q' || c == 'Q' )
break;
}

cameraCloseDevice();

__free_ddr(cmem_src);
__free_ddr(cmem_dest);
__free_ddr(cmem_out);
return 0;
}

void ProcRawClCmem(Mat &mat_in, const std::string &kernel_name, void *cmem_src, void *cmem_dest, Mat &mat_out)
{
int err;
struct timespec tp0, tp1, tp2, tp3, tp4, tp5, tp6, tp7, tp8, tp9, tp10;

Mat mat_src(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC2, cmem_src);
Mat mat_dest(CAMERA_HEIGHT, CAMERA_WIDTH, CV_8UC3, cmem_dest);

clock_gettime(CLOCK_MONOTONIC, &tp0);
mat_in.copyTo(mat_src);
clock_gettime(CLOCK_MONOTONIC, &tp1);

//R/W
Context context(CL_DEVICE_TYPE_ACCELERATOR);
std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

ifstream ifs(kernel_name);
if (ifs.fail()){
cout << "opencl file is not avaiable..." <<endl;
}
std::string kernelStr((istreambuf_iterator<char>(ifs)), istreambuf_iterator<char>());

Program::Sources source(1, std::make_pair(kernelStr.c_str(), kernelStr.length()));
Program program = Program(context, source);
program.build(devices);

Kernel kernel(program, "YUV2RGB_CMEM");
CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
clock_gettime(CLOCK_MONOTONIC, &tp2);

cl_mem bufSrc = clCreateBuffer(context(), CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR, BUF_SRC, cmem_src, &err);
cl_mem bufDst = clCreateBuffer(context(), CL_MEM_WRITE_ONLY|CL_MEM_USE_HOST_PTR, BUF_DEST, cmem_dest, &err);

cl_event ev0;
err = clEnqueueWriteBuffer(queue(), bufSrc, CL_TRUE, 0, BUF_SRC, mat_src.data, 0, NULL, &ev0);
clWaitForEvents(1, &ev0);
clock_gettime(CLOCK_MONOTONIC, &tp3);

err = clSetKernelArg(kernel(), 0, sizeof(cl_mem), &bufSrc);
err = clSetKernelArg(kernel(), 1, sizeof(cl_mem), &bufDst);

cl_event ev1;
std::size_t GlobalWokrSize[2] = { (std::size_t)CAMERA_WIDTH, (std::size_t)CAMERA_HEIGHT};
err = clEnqueueNDRangeKernel(queue(), kernel(), 2, NULL, GlobalWokrSize, NULL, 0, NULL, &ev1);
clWaitForEvents(1, &ev1);
clock_gettime(CLOCK_MONOTONIC, &tp4);

cl_event ev2;
err = clEnqueueReadBuffer(queue(), bufDst, CL_TRUE, 0, BUF_DEST, mat_dest.data, 0, NULL, &ev2);
clWaitForEvents(1, &ev2);

clock_gettime(CLOCK_MONOTONIC, &tp5);

err = clReleaseMemObject(bufSrc);
err = clReleaseMemObject(bufDst);
clock_gettime(CLOCK_MONOTONIC, &tp6);

mat_dest.copyTo(mat_out);
clock_gettime(CLOCK_MONOTONIC, &tp7);

//R/W
printf ("Mat2Mat tdiff=%lf ms \n", tdiff_calc(tp0, tp1)); //0.472872 ms
printf ("build kernel tdiff=%lf ms \n", tdiff_calc(tp1, tp2)); //3.456178 ms
printf ("clEnqueueWriteBuffer tdiff=%lf ms \n", tdiff_calc(tp2, tp3)); //0.608536 ms
printf ("clEnqueueNDRangeKernel tdiff=%lf ms \n", tdiff_calc(tp3, tp4)); //97.881413 ms
printf ("clEnqueueReadBuffer tdiff=%lf ms \n", tdiff_calc(tp4, tp5)); //0.871893 ms
printf ("clReleaseMemObject tdiff=%lf ms \n", tdiff_calc(tp5, tp6)); //0.011387 ms
printf ("Mat2Mat tdiff=%lf ms \n", tdiff_calc(tp6, tp7)); //1.283602 ms
}

cvtcolor.cl:

__kernel void YUV2RGB_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
{
int USE_OPTIMIZED_LOAD = 0;

int src_offset = 0;
int dst_offset = 0;
int src_step = 640*2;
int dst_step = 640*3;
int rows = 480;
int cols = 640;

int x = get_global_id(0);
int y = get_global_id(1);

if (x < cols / 2)
{
__global const uchar* src = srcptr + mad24(y, src_step, (x << 2) + src_offset);
__global uchar* dst = dstptr + mad24(y, dst_step, mad24(x << 1, 3, dst_offset));

#pragma unroll
for (int cy = 0; cy < 1; ++cy)
{
if (y < rows ){
__constant float* coeffs = c_YUV2RGBCoeffs_420;
int load_src = *((__global int*) src);
float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
float U = vec_src[1] - 128;
float V = vec_src[3] - 128;
float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];

float ruv = ((coeffs[4]*V)+0.5f);
float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
float buv = ((coeffs[1]*U)+0.5f);

dst[2] = convert_uchar_sat(y00 + ruv);
dst[1] = convert_uchar_sat(y00 + guv);
dst[0] = convert_uchar_sat(y00 + buv);
dst[5] = convert_uchar_sat(y01 + ruv);
dst[4] = convert_uchar_sat(y01 + guv);
dst[3] = convert_uchar_sat(y01 + buv);
}
++y;
src += src_step;
dst += dst_step;
}
}
}

result:

root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt# on_dsp_accelerate.sh

get raw_yuv Mat tdiff=4.396880 ms
ProcRawClCmem tdiff=106.702501 ms
imshow tdiff=1.879939 ms

Mat2Mat tdiff=0.455792 ms
build kernel tdiff=8.894614 ms
clEnqueueWriteBuffer tdiff=1.317600 ms
clEnqueueNDRangeKernel tdiff=98.243997 ms
clEnqueueReadBuffer tdiff=1.948584 ms
clReleaseMemObject tdiff=0.004392 ms
Mat2Mat tdiff=1.185027 ms

root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt# cat on_dsp_accelerate.sh
export TI_OCL_CACHE_KERNELS=Y
export OPENCV_OPENCL_DEVICE='TI AM57:ACCELERATOR:TI Multicore C66 DSP'
echo "OpenCL on"
./dsp_accelerate_opencv_opencl

As you can see, the function clEnqueueNDRangeKernel() takes the most time when DSP is working.

My question is：

Is there any way to Improve the efficiency of clEnqueueNDRangeKernel function?

Thanks.

0 Rex Chang over 6 years ago in reply to user6010539

TI__Guru 50170 points

Hi,
The clEnqueueNDRangeKernel() call might have included the kernel loading time.
o Please hoist program build and loading out of the time measurement. Loading the program can be achieved by calling a null() kernel function in the same program. See “null” example for details in OpenCL User's Guide
downloads.ti.com/.../overview.html
o Check if the compiled kernel got software pipelined. Use “-k” to keep the generated assembly file.
o If further kernel optimization is desired, please try double buffering with EDMA/AsyncCopies. See “conv1d” example and TI online documentation for details.

Your Processor SDK is old, if you can, please migrate to the latest 5.3 release.

Rex

0 user6010539 over 6 years ago in reply to Rex Chang

Prodigy 70 points

Hi,
I follow your suggestion, below is result：

1. opencl "null" example

root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt#
root@ok5718-idk:/home/forlinx/qt# ./null
The OpenCL runtime will lazily load the device program upon the
first enqueue of a kernel from the program, so the elapsed time
overall from the first enqueue will be longer to account for the
loading of the kernel. Also, subsequent enqueue's of a kernel will
also potentially benefit from a warm cache on the device.

Elapsed (with Load): 1008 usecs
Elapsed (w/o Load): 286 usecs
Null Kernel Exec : Queue to Submit: 2 us
Null Kernel Exec : Submit to Start : 30 us
Null Kernel Exec : Start to End : 214 us

Elapsed (w/o Load): 279 usecs
Null Kernel Exec : Queue to Submit: 2 us
Null Kernel Exec : Submit to Start : 31 us
Null Kernel Exec : Start to End : 210 us

Elapsed (w/o Load): 269 usecs
Null Kernel Exec : Queue to Submit: 2 us
Null Kernel Exec : Submit to Start : 30 us
Null Kernel Exec : Start to End : 202 us

Elapsed (w/o Load): 280 usecs
Null Kernel Exec : Queue to Submit: 1 us
Null Kernel Exec : Submit to Start : 29 us
Null Kernel Exec : Start to End : 215 us

Elapsed (w/o Load): 277 usecs
Null Kernel Exec : Queue to Submit: 1 us
Null Kernel Exec : Submit to Start : 28 us
Null Kernel Exec : Start to End : 214 us

Done!

2. Use “-k” to keep the generated assembly file.

I compile my opencl kernel with this command: clocl -t -k cvtcolor.cl

And I modified my code:
#include "cvtcolor.dsp_h"
Program::Binaries binary(1, std::make_pair(cvtcolor_dsp_bin,sizeof(cvtcolor_dsp_bin)));
Program program = Program(context, devices, binary);
program.build(devices);

Then I execute it in the shell, below is result:

build kernel tdiff=1.538664 ms
clEnqueueWriteBuffer tdiff=0.623989 ms
clEnqueueNDRangeKernel tdiff=97.438472 ms
clEnqueueReadBuffer tdiff=0.905565 ms
clReleaseMemObject tdiff=0.007808 ms

The clEnqueueNDRangeKernel() cost time did not change.
So the root cause is not the kernel loading time.

3. I found the really cost time is the kernel function __kernel void YUV2RGB_CMEM(__global const uchar* srcptr, __global uchar* dstptr).

This function comes from YUV2RGB_422() function in OpenCV file cvtcolor.cl:
__kernel void YUV2RGB_422(__global const uchar* srcptr, int src_step, int src_offset,
__global uchar* dstptr, int dst_step, int dst_offset,
int rows, int cols)

I copied it to my kernel function YUV2RGB_CMEM().
It executes on the DSP using floating point operation really takes too much time.

0 user6010539 over 6 years ago in reply to user6010539

Prodigy 70 points

Hi,
I found a new problem about opencl kernel function YUV2RGB_CMEM:

Original YUV2RGB_CMEM() function is:

__kernel void YUV2RGB_422_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
{
int x = get_global_id(0);
int y = get_global_id(1);

if (x < 320){
__global const uchar* src = srcptr + (y*1280+(x << 2));
__global uchar* dst = dstptr + (y*1920 + ((x << 1)*3));

if (y < 480 ){
__constant float* coeffs = c_YUV2RGBCoeffs_420;
int load_src = *((__global int*) src);
float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
float U = vec_src[1] - 128;
float V = vec_src[3] - 128;
float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];
float ruv = ((coeffs[4]*V)+0.5f);
float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
float buv = ((coeffs[1]*U)+0.5f);

//////////////////////////////////////////////////////
dst[2] = convert_uchar_sat(y00 + ruv);
dst[1] = convert_uchar_sat(y00 + guv);
dst[0] = convert_uchar_sat(y00 + buv);
dst[5] = convert_uchar_sat(y01 + ruv);
dst[4] = convert_uchar_sat(y01 + guv);
dst[3] = convert_uchar_sat(y01 + buv);
//////////////////////////////////////////////////////
}
}
}

With this opencl kernel, clEnqueueNDRangeKernel() cost time is:
clEnqueueNDRangeKernel tdiff= 98.937283 ms

I use linux top command in the shell and found CPU% is 31.0.

Then I modified YUV2RGB_CMEM() as follow:

__kernel void YUV2RGB_422_CMEM(__global const uchar* srcptr, __global uchar* dstptr)
{
int x = get_global_id(0);
int y = get_global_id(1);

if (x < 320){
__global const uchar* src = srcptr + (y*1280+(x << 2));
__global uchar* dst = dstptr + (y*1920 + ((x << 1)*3));

if (y < 480 ){
__constant float* coeffs = c_YUV2RGBCoeffs_420;
int load_src = *((__global int*) src);
float vec_src[4] = { load_src & 0xff, (load_src >> 8) & 0xff, (load_src >> 16) & 0xff, (load_src >> 24) & 0xff};
float U = vec_src[1] - 128;
float V = vec_src[3] - 128;
float y00 = max(0.f, vec_src[0] - 16.f) * coeffs[0];
float y01 = max(0.f, vec_src[2] - 16.f) * coeffs[0];
float ruv = ((coeffs[4]*V)+0.5f);
float guv = (coeffs[3]*V)+((coeffs[2]*U)+0.5f);
float buv = ((coeffs[1]*U)+0.5f);

//////////////////////////////////////////////////////modified start
int y1,y2,y3,y4,y5,y6;
y1 = (int)(y00 + ruv);
y2 = (int)(y00 + guv);
y3 = (int)(y00 + buv);
y4 = (int)(y01 + ruv);
y5 = (int)(y01 + guv);
y6 = (int)(y01 + buv);

dst[2] = (uchar)(y1 < 0 ? 0 : (y1 > 255 ? 255 : y1));
dst[1] = (uchar)(y2 < 0 ? 0 : (y2 > 255 ? 255 : y2));
dst[0] = (uchar)(y3 < 0 ? 0 : (y3 > 255 ? 255 : y3));
dst[5] = (uchar)(y4 < 0 ? 0 : (y4 > 255 ? 255 : y4));
dst[4] = (uchar)(y5 < 0 ? 0 : (y5 > 255 ? 255 : y5));
dst[3] = (uchar)(y6 < 0 ? 0 : (y6 > 255 ? 255 : y6));
//////////////////////////////////////////////////////modified end
}
}
}

clEnqueueNDRangeKernel() cost time is:
clEnqueueNDRangeKernel tdiff= 18.490971 ms.

Then I use top command in the shell and found CPU% is 58.4.

Although its cost time reduces a lot, the DSP acceleration effect is not obviously.

I just expand the function convert_uchar_sat() as its definition in usr/share/ti/opencl/dsp.h, but their performance differs greatly.

Could you explain me the root reason of it, thanks very much.

0 Yuan Zhao over 6 years ago in reply to user6010539

TI__Expert 4055 points

Hi,

Your kernel is still not software pipelined, if you look at the generated assembly file.
1. Use correct global work size in clEnqueueNDRangeKernel, so that you can get rid of if(y<480) and if(x<320) in the kernel

- Yuan

Processors

Processors forum

Linux/PROCESSOR-SDK-AM57X: OpenCL+DSP accelerate