float_compute benchmark

Tony Tang

From the link(http://downloads.ti.com/mctools/esd/docs/opencl/examples/float_compute.html#compute-on-opencl-device), the float_compute on C66 is ~2x of A15, what is the platform got the test result? is the DSP and A15 run @same frequency?

Sample Output

./float_compute

This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152
- Computation on the ARM is parallelized across the A15s using OpenMP.
  - Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.

  Running.....

  Average across 5 runs:
  ARM (2 OpenMP threads)         : 0.012077 secs
  DSP (OpenCL NDRange kernel)    : 0.005909 secs
  OpenCL-DSP speedup             : 2.043985

  For more information on:
    * TI's OpenCL product,

but test on AM572x EVM the result is as below, C66x is slower than that of A15. what is the matter? is it due to C66x frequency is 1/2 of A15 on AM57xx?

root@am57xx-evm:~/float_compute# ./float_compute

This example computes y[i] = M[i] * x[i] + C on single precision floating point arrays of size 2097152

- Computation on the ARM is parallelized across the A15s using OpenMP.

- Computation on the DSP is performed by dispatching an OpenCL NDRange kernel across the compute units (C66x cores) in the compute device.

Running[ 185.210646] omap_hwmod: mmu0_dsp2: _wait_target_disable failed

[ 185.216555] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0

[ 185.222616] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0

[ 185.238733] omap_hwmod: mmu0_dsp1: _wait_target_disable failed

[ 185.244636] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0

[ 185.252979] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0

.....

Average across 5 runs:

ARM (2 OpenMP threads) : 0.007032 secs

DSP (OpenCL NDRange kernel) : 0.008084 secs

OpenCL-DSP speedup : 0.869913

For more information on:

* TI's OpenCL product, http://software-dl.ti.com/mctools/esd/docs/opencl/index.html

over 9 years ago

0 Biser Gatchev-XID over 9 years ago

TI__Guru**** 393215 points

Hi,

I will ask the software team for clarification. They will respond here.

0 Tony Tang over 9 years ago in reply to Biser Gatchev-XID

TI__Mastermind 29252 points

Thanks Biser,

Add one more question. is there a version of this example to use A15 and C66x together by OPENCL to fully take advantage of the AM57xx device.

0 ran35366 over 9 years ago in reply to Tony Tang

TI__Genius 12805 points

If I understand correctly the code is y[i] = x[i] * m[i] + c where y, x and m are floating point numbers (32-bit). It seems to me that the bottleneck is the data move and not the computation. The code is too "simple" to really take advantage of the DSP power.

To see if my theory is correct do the following:

1. Run DSP only code. You can do it using CCS, put the data in L1 memory and benchmark the time using CCS timer or TSCL/TSCH. See how fast the DSP runs from L1

2. Run DSP only code. You can do it using CCS, put the data in L2 memory and benchmark the time using CCS timer or TSCL/TSCH. See how fast the DSP runs from L2

3. Next put the data in OCMC memory and benchmark the DSP

4. last put the data in DDR and benchmark the DSP

This will give you insight into what is the processing time, what is the data moving time, and what is the overhead that is associated with OpenCL calls of the DSP code.

Please report back to the forum

Best Regards

Ran

0 Tony Tang over 9 years ago in reply to ran35366

TI__Mastermind 29252 points

Ran,

But the same code run on ARM-A15, I think the example target to show DSP computation performance better than ARM, but the result is same on AM57xx device.

I guess the result on the link is based on keystone II whose A15 and C66 almost same frequency, so the result show DSP is 2x of A15 computation performance.

But on AM57xx, DSP is 1/2 of A15 frequency, so show similar result, based on the result, algorithm run on ARM or DSP seems not a matter.

Then comes to another question, how to take advantage of A15+C66x benefit for vision application?

0 ran35366 over 9 years ago in reply to Tony Tang

TI__Genius 12805 points

Tony

DSP execution is very efficient in complex code. You are right that in AM57X because the DSP runs slower than the ARM, it is better sometimes to implement applications on the ARM and not the DSP.

For example, in TI's implementation of OpenCV, that is based on OpenCL, some functions are implemented only on ARM, and some are implemented on ARM and DSP (the developers actually bench-marked both cases and then decided which implementation to include in the library)

You can look at TI design for OpenCL http://www.ti.com/tool/TIDEP0046

That shows implementation of complex algorithm on ARM and DSP for AM57X. The idea here is to use the DSP for complex calculations while at the same time the ARM is working on a different part of the algorithm. This may be the answer to the question how to use the ARM + C66 for vision application. I am not familiar enough with vision application algorithms to recommend any architecture, but you may be.

I hope it helps

Best Regards

Ran

Processors

Processors forum

float_compute benchmark