From the link(http://downloads.ti.com/mctools/esd/docs/opencl/examples/float_compute.html#compute-on-opencl-device), the float_compute on C66 is ~2x of A15, what is the platform got the test result? is the DSP and A15 run @same frequency?
Sample Output
./float_compute
This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
- Computation on the ARM is parallelized across the A15s using OpenMP.
- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.
Running.....
Average across 5 runs:
ARM (2 OpenMP threads) : 0.012077 secs
DSP (OpenCL NDRange kernel) : 0.005909 secs
OpenCL-DSP speedup : 2.043985
For more information on:
* TI's OpenCL product, 
but test on AM572x EVM the result is as below, C66x is slower than that of A15. what is the matter? is it due to C66x frequency is 1/2 of A15 on AM57xx?
root@am57xx-evm:~/float_compute# ./float_compute
This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
- Computation on the ARM is parallelized across the A15s using OpenMP.
- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.
Running[ 185.210646] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
[ 185.216555] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0
[ 185.222616] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0
[ 185.238733] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
[ 185.244636] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
[ 185.252979] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
.....
Average across 5 runs:
ARM (2 OpenMP threads) : 0.007032 secs
DSP (OpenCL NDRange kernel) : 0.008084 secs
OpenCL-DSP speedup : 0.869913
For more information on:
* TI's OpenCL product, http://software-dl.ti.com/mctools/esd/docs/opencl/index.html