OMAP 35 floating point speed problem

David Reitons

Hello,

I am new to Embedded systems and especially to OMAP 35. I am trying to run C++ image processing code

on the OMAP 35x EVM board. My problem is that the speed is very slow(comparing to Intel PC) when coming to passing filters on float gray images.

I am compiling with Code Sourcery arm gcc compiler, with options -O3 -mtune=cortex-a8 -march=armv7-a -ftree-vectorize -funroll-loops -mfpu=neon -mfloat-abi=softfp.

Is there something special I should configure concerning memory access ?

Is there some known benchmark of float computation ( preferably something which is like passing filter on image) which I can run on my board ?

Is there any profiling tool or method which can tell me which code lines are the most heavy ? (I have gprof but it tells only which functions are most heavy)

over 14 years ago

0 Jeff L over 14 years ago

TI__Expert 5960 points

HI David,

Oprofile is a very good Linux profiler. It will profile your application and kernel. It currently is not included in the SDK, so you have to build it yourself.

You will also need to enable Oprofile in the kernel.

Does your application call floating point functions in the math library - libm? Is it single or double precision float?

0 David Reitons over 14 years ago in reply to Jeff L

Prodigy 20 points

Hi Jeff,

Thanks for the information,

Currently I don't have access to the kernel, when I will have such access I will try to install and enable Oprofile.

My application is single precision, and I am using very simple arithmetic - add, multiply,compare.

I suspect that I might have some problem with the cache - I saw that in the support forums there are some discussions about

enabling L2 cache in OMAP - do you know if the cache is enabled by default in the EVM ?

How can I check it - can it be checked in user-mode ?

0 Jeff L over 14 years ago in reply to David Reitons

TI__Expert 5960 points

Hi David,

Cache should be enabled when you boot up. I am assuming you are using Linux?

You can run some simple benchmarks to verify cache is enabled. You can download a statically built Dhrystone here: https://gforge.ti.com/gf/project/am_benchmarks/frs/

Dhrystone should return 1.9 DMIPS / MHz. The caclulation is (Dhrystones per sec.) / 1757 / (CPU frequency). Please use at least 10000000 iterations.

Linpack is a simple floating point benchmark. You can use it to verify performance. It just does simple floating point (+, -, /, *). You can find it here: http://www.netlib.org/benchmark/linpackc

Here are the CFLAGS: -DUNROLL -DSP -O3 -mfloat-abi=softfp -march=armv7-a -c -fmessage-length=0

Here are the results for OMAP35x:

CPU MHz KFlops (SP unrolled)

125 4308

500 17234

600 20927

720 24415

0 ChickenDuck over 14 years ago in reply to Jeff L

Intellectual 780 points

Hi Jeff

I'm also seeing some very poor floating point performance. I'm running on the Beagleboard xM, running ubuntu, and I've compiled the linpack program for both an Atom processor (my netbook) and the omap using the compile options you give above, using the free CodeSourcery tool chain. The netbook looks to give around 1,716,667 kflops (which sounds about right for the speed of the processor), but here's the output of linpackc on the Beagleboard xM:

     norm. resid      resid           machep         x[0]-1        x[n-1]-1
       1.6        3.80277634e-05 1.19209290e-07 -1.38282776e-05 -7.51018524e-06
    times are reported for matrices of order   100
      dgefa      dgesl      total       kflops     unit      ratio
times for array with leading dimension of 201
       0.02       0.00       0.02      29297       0.07       0.42
       0.02       0.00       0.02      43947       0.05       0.28
       0.02       0.00       0.02      43947       0.05       0.28
       0.01       0.00       0.01      46260       0.04       0.27
times for array with leading dimension of 200
       0.02       0.00       0.02      43947       0.05       0.28
       0.02       0.00       0.02      43947       0.05       0.28
       0.02       0.00       0.02      43947       0.05       0.28
       0.02       0.00       0.02      41854       0.05       0.29

So, it looks like I'm getting about 43 MFlops from the Omap, a little bit better than what you're seeing but then again this is the 3730 on the xM so that's possible.

To double check things, I tried setting the floating point option to soft (rather than softfp), and sure enough the performance got worse by a factor of 4 or so. So, it seems like floating point instructions are being generated, so maybe something else is drastically affecting things?

There must be a way to get more out of this processor...any ideas?

0 ChickenDuck over 14 years ago in reply to ChickenDuck

Intellectual 780 points

Sorry, forgot to post my Dhrystones result:

Microseconds for one run through Dhrystone: 0.4
Dhrystones per Second: 2500000.0

Processors

Processors forum

OMAP 35 floating point speed problem