This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Beegle Performence Issue

Other Parts Discussed in Thread: OMAP3530

Hi, All

Recently i use the beegle board to run an image processing application. I have two set of code. The version one is ANSI C and version 2 is Neon Code(ie:: i used neon C intrinsics for parallel tasks. Not the assembly instruction).

Both codes working fine in the beegle board and the output obtained is good.

When i profile the C and NEON versions issues generated. The issues come under the performence. I used the gettimeofday() function availed in the <sys/time.h> and check the time taken to complete the task. Unfortunately the Neon and C versions show almost same.

I create the assembly of C files but it have neon assembly in it [For the Neon Intrinsic Part]

my doubt is anything i miss to enable the neon core of beegle board ?

or any preprocessing is needed for enable the Neon core of the beegle board ?

Rgds

Dave

 

  • Audio Dave said:
    my doubt is anything i miss to enable the neon core of beegle board ?

    You do have to have the NEON supported by your OS to use the instructions, assuming you are using Linux you have to enable them in your kernel build options as mentioned in this wiki article. I was under the impression that you would have a runtime failure and not just low performance if NEON code was executed without being enabled, so there may be something else going on here.

    It could simply be that your code has bottlenecks elsewhere, such as if your code was already fast enough without NEON but you were always waiting for memory transfers, to the point that faster processing does not improve tangible performance.

  • Bernie Thompson said:

    You do have to have the NEON supported by your OS to use the instructions, assuming you are using Linux you have to enable them in your kernel build options as mentioned in this wiki article. I was under the impression that you would have a runtime failure and not just low performance if NEON code was executed without being enabled, so there may be something else going on here.

    It could simply be that your code has bottlenecks elsewhere, such as if your code was already fast enough without NEON but you were always waiting for memory transfers, to the point that faster processing does not improve tangible performance.

    Reason of my doubt

    Source is profiled in an RTSM Emulation then the core function shows significant amount of cycle save. Then the same source is ported to the beegle board the difference is not shown.

    Due to this i am under impression that the NEON core is not enabled.

    But the configuration files of OS kernal and its building shows that the Neon core is enabled successfully.  

    SO i  trust the OS part and put my effort to source code .

    If a bottlenecks in the source code is exist (as you mention earlier) then how the RTSM Emulator show a much improved performence and saves the machine cycle.?

     

    Rgds

    Dave

     

     

     

     

     

     

  • Audio Dave said:
    Source is profiled in an RTSM Emulation then the core function shows significant amount of cycle save. Then the same source is ported to the beegle board the difference is not shown.

    Do you have any specific figures as to the difference in performance? I am somewhat curious how far off the hardware is from the simulator you are using.

    Audio Dave said:
    If a bottlenecks in the source code is exist (as you mention earlier) then how the RTSM Emulator show a much improved performence and saves the machine cycle.?

    I have never used RTSM so I could not say for sure, but my first thought would be how the simulator manages external memory simulation and possibly caching. If you have code that does a large number of external memory accesses which happens to thrash the cache, but the simulator treats external memory with the same performance properties of internal memory, you could have a dramatic performance difference between the simulator and the actual hardware. Essentially this would mean you would be limited by the memory bandwidth available more than the CPU cycles available.

    Another possibility is how you are using the simulator, if you use a simulator and run your code directly you would have taken out any potential OS overhead which may also impact performance.

    These are just some possible explanations, in general if you have NEON enabled in the kernel, you should be able to use the NEON instructions seamlessly, at least this seems to have worked for developers in the past.

  • Hi,

    please see the tool chain,

    I have two set of source code

    a) C Only

    b) C + Neon C intrinsic

    both set of code compile the following tool chain

    ABC=arm-none-linux-gnueabi-gcc
    PQR=-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions

    the result of the build is run on the beegle board. Board using angstrom linux kernal.

    The OS kernal is OMAP3530 enabled.

    I get almost same time to complete the image viewer application by using the 2 versions of code?

    ie the neon part execution is improper ...

    Have any problem in my tool chain?

     

     

     

     

  • Audio Dave said:
    Have any problem in my tool chain?

    Your options seem reasonable except for -flax-vector-conversions which I am not very familiar with, and may not be bad, but seems risky to use if you are working with potentially unstable code. You may want to try -ftree-vectorize as mentioned here to see if you can squeeze any cycles out of the auto vectorization capability of the compiler.

  • Another suggestion is to post your question to the list at beagleboard.org. The community there have been doing a lot of NEON work and might be able to help.

    Steve K.