This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP 35 floating point speed problem

Hello,

I am new to Embedded systems and especially to OMAP 35. I am trying to run C++ image processing code 

on the OMAP 35x EVM board. My problem is that the speed is very slow(comparing to Intel PC) when coming to passing filters on float gray images.

I am compiling with Code Sourcery arm gcc compiler, with options  -O3 -mtune=cortex-a8 -march=armv7-a  -ftree-vectorize  -funroll-loops -mfpu=neon -mfloat-abi=softfp.

Is there something special I should configure concerning memory access ? 

Is there some known benchmark of float computation ( preferably something which is like passing filter on image) which I can run on my board ?

Is there any profiling tool or method which can tell me which code lines are the most heavy ? (I have gprof but it tells only which functions are most heavy)

 

  • HI David,

    Oprofile is a very good Linux profiler. It will profile your application and kernel. It currently is not included in the SDK, so you have to build it yourself.

    You will also need to enable Oprofile in the kernel.

    Does your application call floating point functions in the math library - libm?  Is it single or double  precision float?

     

     

  • Hi Jeff,

    Thanks for the information,

    Currently I don't have access to the kernel, when I will have such access I will try to  install and enable Oprofile.

    My application is single precision, and I am using very simple arithmetic - add, multiply,compare.

    I suspect that I might have some problem with the cache - I saw that in the support forums there are some discussions about

    enabling L2 cache in OMAP - do you know if the cache is enabled by default in the EVM ?

    How can I check it  - can it be checked in user-mode ?

  • Hi David,

    Cache should be enabled when you boot up. I am assuming you are using Linux?

    You can run some simple benchmarks to verify cache is enabled.  You can download a statically built Dhrystone here: https://gforge.ti.com/gf/project/am_benchmarks/frs/

    Dhrystone should return 1.9 DMIPS / MHz.   The caclulation is (Dhrystones per sec.) / 1757 / (CPU frequency). Please use at least 10000000 iterations.

    Linpack is a simple floating point benchmark. You can use it to verify performance. It just does simple floating point (+, -, /, *).  You can find it here: http://www.netlib.org/benchmark/linpackc

    Here are the CFLAGS: -DUNROLL -DSP -O3 -mfloat-abi=softfp -march=armv7-a -c -fmessage-length=0

    Here are the results for OMAP35x:

    CPU MHz KFlops (SP unrolled)

    125   4308

    500   17234

    600   20927

    720   24415

     

     

  • Hi Jeff

    I'm also seeing some very poor floating point performance.  I'm running on the Beagleboard xM, running ubuntu, and I've compiled the linpack program for both an Atom processor (my netbook) and the omap using the compile options you give above, using the free CodeSourcery tool chain.  The netbook looks to give around 1,716,667 kflops (which sounds about right for the speed of the processor), but here's the output of linpackc on the Beagleboard xM:

         norm. resid      resid           machep         x[0]-1        x[n-1]-1
           1.6        3.80277634e-05  1.19209290e-07 -1.38282776e-05 -7.51018524e-06
        times are reported for matrices of order   100
          dgefa      dgesl      total       kflops     unit      ratio
     times for array with leading dimension of  201
           0.02       0.00       0.02      29297       0.07       0.42
           0.02       0.00       0.02      43947       0.05       0.28
           0.02       0.00       0.02      43947       0.05       0.28
           0.01       0.00       0.01      46260       0.04       0.27
     times for array with leading dimension of 200
           0.02       0.00       0.02      43947       0.05       0.28
           0.02       0.00       0.02      43947       0.05       0.28
           0.02       0.00       0.02      43947       0.05       0.28
           0.02       0.00       0.02      41854       0.05       0.29

    So, it looks like I'm getting about 43 MFlops from the Omap, a little bit better than what you're seeing but then again this is the 3730 on the xM so that's possible.

    To double check things, I tried setting the floating point option to soft (rather than softfp), and sure enough the performance got worse by a factor of 4 or so.  So, it seems like floating point instructions are being generated, so maybe something else is drastically affecting things?

    There must be a way to get more out of this processor...any ideas?

     

     

     

     

  • Sorry, forgot to post my Dhrystones result:

    Microseconds for one run through Dhrystone:    0.4
    Dhrystones per Second:                      2500000.0