Linux/OpenCL: Performance of OpenCL + OpenMP

jinhu wu

Intellectual 800 points

Other Parts Discussed in Thread: AM5728, DM3730

PSDK: 03.02.05

Kernel: 4.4.32

CPU: AM5728

I test with the ti example of vecadd_openmp.

Result for OpenCL + OpenMP(Dual Core)

./vecadd_openmp

DEVICE: TI Multicore C66 DSP

[core 0] i:0
[core 0] i:100
[core 0] i:200
[core 0] i:300
[core 0] i:400
[core 0] i:500
[core 0] i:600
[core 0] i:700
[core 0] i:800
[core 0] i:900
[core 0] i:1000
[core 0] i:1100
[core 0] i:1200
[core 0] i:1300
[core 0] i:1400
[core 0] i:1500
[core 0] i:1600
[core 0] i:1700
[core 0] i:1800
[core 0] i:1900
[core 1] i:4100
[core 0] i:2000
[core 1] i:4200
[core 0] i:2100
[core 1] i:4300
[core 0] i:2200
[core 1] i:4400
[core 0] i:2300
[core 1] i:4500
[core 0] i:2400
[core 1] i:4600
[core 0] i:2500
[core 1] i:4700
[core 0] i:2600
[core 1] i:4800
[core 0] i:2700
[core 1] i:4900
[core 0] i:2800
[core 1] i:5000
[core 0] i:2900
[core 1] i:5100
[core 0] i:3000
[core 1] i:5200
[core 0] i:3100
[core 1] i:5300
[core 0] i:3200
[core 1] i:5400
[core 0] i:3300
[core 1] i:5500
[core 0] i:3400
[core 1] i:5600
[core 0] i:3500
[core 1] i:5700
[core 0] i:3600
[core 1] i:5800
[core 0] i:3700
[core 1] i:5900
[core 0] i:3800
[core 1] i:6000
[core 0] i:3900
[core 1] i:6100
[core 0] i:4000
[core 1] i:6200
Write BufA : Queue to Submit: 23 us
Write BufA : Submit to Start : 59 us
Write BufA : Start to End : 145 us

Write BufB : Queue to Submit: 177 us
Write BufB : Submit to Start : 77 us
Write BufB : Start to End : 131 us

Kernel : Queue to Submit: 5 us
Kernel : Submit to Start : 134 us
Kernel : Start to End : 4823 us

Read BufDst : Queue to Submit: 4926 us
Read BufDst : Submit to Start : 245 us
Read BufDst : Start to End : 146 us

PASS!

Result for OpenCL + OpenMP End

Result for OpenCL(by commenting the "#pragma omp parallel for") (Single Core)

root@am57xx-evm:/vecadd_openmp# ./vecadd_openmp
DEVICE: TI Multicore C66 DSP

[core 0] i:0
[core 0] i:100
[core 0] i:200
[core 0] i:300
[core 0] i:400
[core 0] i:500
[core 0] i:600
[core 0] i:700
[core 0] i:800
[core 0] i:900
[core 0] i:1000
[core 0] i:1100
[core 0] i:1200
[core 0] i:1300
[core 0] i:1400
[core 0] i:1500
[core 0] i:1600
[core 0] i:1700
[core 0] i:1800
[core 0] i:1900
[core 0] i:2000
[core 0] i:2100
[core 0] i:2200
[core 0] i:2300
[core 0] i:2400
[core 0] i:2500
[core 0] i:2600
[core 0] i:2700
[core 0] i:2800
[core 0] i:2900
[core 0] i:3000
[core 0] i:3100
[core 0] i:3200
[core 0] i:3300
[core 0] i:3400
[core 0] i:3500
[core 0] i:3600
[core 0] i:3700
[core 0] i:3800
[core 0] i:3900
[core 0] i:4000
[core 0] i:4100
[core 0] i:4200
[core 0] i:4300
[core 0] i:4400
[core 0] i:4500
[core 0] i:4600
[core 0] i:4700
[core 0] i:4800
[core 0] i:4900
[core 0] i:5000
[core 0] i:5100
[core 0] i:5200
[core 0] i:5300
[core 0] i:5400
[core 0] i:5500
[core 0] i:5600
[core 0] i:5700
[core 0] i:5800
[core 0] i:5900
[core 0] i:6000
[core 0] i:6100
[core 0] i:6200
[core 0] i:6300
[core 0] i:6400
[core 0] i:6500
[core 0] i:6600
[core 0] i:6700
[core 0] i:6800
[core 0] i:6900
[core 0] i:7000
[core 0] i:7100
[core 0] i:7200
[core 0] i:7300
[core 0] i:7400
[core 0] i:7500
[core 0] i:7600
[core 0] i:7700
[core 0] i:7800
[core 0] i:7900
[core 0] i:8000
[core 0] i:8100
Write BufA : Queue to Submit: 22 us
Write BufA : Submit to Start : 57 us
Write BufA : Start to End : 161 us

Write BufB : Queue to Submit: 192 us
Write BufB : Submit to Start : 176 us
Write BufB : Start to End : 127 us

Kernel : Queue to Submit: 3 us
Kernel : Submit to Start : 52 us
Kernel : Start to End : 6973 us

Read BufDst : Queue to Submit: 7000 us
Read BufDst : Submit to Start : 242 us
Read BufDst : Start to End : 162 us

PASS!
root@am57xx-evm:/vecadd_openmp#

Result for OpenCL(by commenting the "#pragma omp parallel for") End

I want to know why the "Kernel : Start to End" is almost the same? In my mind, If I comment the "#pragma omp parallel for", only one dsp core computing, Time should be nearly twice to the dual core version.

over 7 years ago

0 jinhu wu over 7 years ago

Intellectual 800 points

When I Increase 1024 times of Data Element. eg, 8*1024 -> 8 *1024 * 1024.

Result for OpenCL + OpenMP(Dual Core) Kernel : Start to End : 199836us
Result for OpenCL(by commenting the "#pragma omp parallel for") (Single Core) : Start to End : 259696us

The difference is only about 24%.

But when I move the code from vadd_openmp.c to vadd_wrapper.cl, Kernel : Start to End :119738us.(Using enqueueTask, it means only one DSP core active, right?)

Is this performance result right?

I want to use OPEN_CL+OPEN_MP because of legency DM3730 dsp code, Is it any way to increase the OPEN_MP performance?

The pure OpenCL is highest performance type for dual core DSP of TI?

0 Biser Gatchev-XID over 7 years ago in reply to jinhu wu

TI__Guru**** 393215 points

The person who can answer this is currently travelling on business. We apologize for eventual delays in responding.

0 Rex Chang over 7 years ago in reply to jinhu wu

TI__Guru 50170 points

Hi, Jinhu,

The performance improvement won't be linear. There may be overhead involved.

To optimize, please refer to the Optimization Tips in OpenCL User's Guide, downloads.ti.com/.../index.html

Rex

0 jinhu wu over 7 years ago in reply to Rex Chang

Intellectual 800 points

From my test, I think the cost for OPENCL+OPENMP is too high, and is not suitable for multi core DSP of TI.

Best way is pure OPENCL, right?

0 Rex Chang over 7 years ago in reply to jinhu wu

TI__Guru 50170 points

Hi, Jinhu,

From the numbers you got, it shows performance improvement using OpenCL + OpenMP. Isn't it? OpenCL allows offloading and OpenMP does the parallelism on DSP. It really depends on the type of application to see the advantage of the use case.

Rex

0 jinhu wu over 7 years ago in reply to Rex Chang

Intellectual 800 points

But in my app, I found the when use "#pragma omp parallel for", the fps of process is down, and is worst than using only single DSP core.(about 30%)

So, I want to know the recommand app for OPENCL+OPENMP and pure OPENCL.

0 Rex Chang over 7 years ago in reply to jinhu wu

TI__Guru 50170 points

Hi, Jinhu,

That is a generic question for the open forum. I searched internet and only find the comparison between opencl and openmp, stackoverflow.com/.../opencl-vs-openmp-performance

It will be your discretion to choose the one fits your application best.

Rex

Processors

Processors forum

Linux/OpenCL: Performance of OpenCL + OpenMP