Other Parts Discussed in Thread: AM5728
AM5728 SDK image has OpenCL example called: float_compute which should show around 2 times faster run on OpenCL, per TI OpenCL document http://downloads.ti.com/mctools/esd/docs/opencl/examples/float_compute.html .
This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
-Computation on the ARM is parallelized across the A15s using OpenMP.
- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.
Running.....
Average across 5 runs:
ARM (2 OpenMP threads) : 0.012077 secs
DSP (OpenCL NDRange kernel) : 0.005909 secs
OpenCL-DSP speedup : 2.043985
What I see on my AM5728 SDK 3.1.0 is
This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
- Computation on the ARM is parallelized across the A15s using OpenMP.
- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.
Running[18615.976168] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
[18615.982083] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0
[18615.988175] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0
[18616.003441] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
[18616.009340] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
[18616.018610] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
.....
Average across 5 runs:
ARM (2 OpenMP threads) : 0.007770 secs
DSP (OpenCL NDRange kernel) : 0.007961 secs
OpenCL-DSP speedup : 0.975968
Running on DSP even slower than on ARM.
Any ideas why is that? And how to make it work correctly?