AM572x OpenCL float_compute performance issue

Oleg Ostap

Other Parts Discussed in Thread: AM5728

AM5728 SDK image has OpenCL example called: float_compute which should show around 2 times faster run on OpenCL, per TI OpenCL document http://downloads.ti.com/mctools/esd/docs/opencl/examples/float_compute.html .

This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
-Computation on the ARM is parallelized across the A15s using OpenMP.
- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.
Running.....
Average across 5 runs:
ARM (2 OpenMP threads) : 0.012077 secs
DSP (OpenCL NDRange kernel) : 0.005909 secs

OpenCL-DSP speedup : 2.043985

What I see on my AM5728 SDK 3.1.0 is

Running[18615.976168] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
[18615.982083] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0
[18615.988175] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0
[18616.003441] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
[18616.009340] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
[18616.018610] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
.....

Average across 5 runs:
ARM (2 OpenMP threads) : 0.007770 secs
DSP (OpenCL NDRange kernel) : 0.007961 secs
OpenCL-DSP speedup : 0.975968

Running on DSP even slower than on ARM.
Any ideas why is that? And how to make it work correctly?

over 9 years ago

0 Yordan Kovachev over 9 years ago

TI__Guru**** 161600 points

Moving this to AM5728 device forum.

Best Regards,
Yordan

0 Biser Gatchev-XID over 9 years ago in reply to Yordan Kovachev

TI__Guru**** 393215 points

The software team have been notified. They will respond here.

0 BO over 9 years ago in reply to Biser Gatchev-XID

TI__Genius 12760 points

Biser,

please see my mail with background information I sent you offline.

Regards,
Bernd

0 Biser Gatchev-XID over 9 years ago in reply to BO

TI__Guru**** 393215 points

The query has already been assigned. I have no influence over the software team schedules and priorities. If you do not receive a response within a few days, ping me here on this thread.

0 ran35366 over 9 years ago in reply to Biser Gatchev-XID

TI__Genius 12805 points

I am not sure what was sent offline, but when I look at the problem here is what I see:

Unless the system can hide somewhat the data movement (using DMA to move data while some other computations are taking place) this is data movement problem. Each of the cores, DSP and the ARM are capable of doing floating point multiplication and addition in a single step (in pipeline fashion) so really the bottle neck is the data in and out the DDR, 2M 32-bit elements of two vectors read and one vector write must reside in the DDR (or generated on the fly)

Since this is data movement problem and not processing problem, I would not expect the DSP to be much faster, unless the ARM is busy doing something else at the same time.
So the advantage of the OpenCL in this case is that the ARM can do something else while the DSP is busy performing the (so simple) algorithm.

Ran

0 Oleg Ostap over 9 years ago in reply to ran35366

Intellectual 260 points

Thank you for clarification.

0 ran35366 over 9 years ago in reply to Oleg Ostap

TI__Genius 12805 points

Can you close the thread?

Thanks

Ran

0 Oleg Ostap over 9 years ago in reply to ran35366

Intellectual 260 points

Done.

Processors

Processors forum

AM572x OpenCL float_compute performance issue