This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Trying to get Neon optimization working for benchmarking OMAP3530 (gcc/linux)

Hi All,

 

   I've been trying to sort this out for a couple of days now and I'm beaten. I've got a simple suite of benchmarks called nbench ( http://www.tux.org/~mayer/linux/bmark.html ) that I'm running on the OMAP3EVM so as to compare it with some other potential processors for a new product. However, I cannot seem to get it to make use of the Neon coprocessor! I'm using GCC 4.3.2 (CodeSourcery 2008q3).

 

These are my current compiler flags:

CFLAGS = -s -save-temps -static -Wall -O3 -march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize -fomit-frame-pointer -ffast-math

 

   The annoying thing is that, in one function in one file, I do a float * float multiply and there is a vmul.f32 in the generated assembler. However for other float * float multiplies in the same file it has just used fmul!

 

   I'm sure this chip has more floating-point power than it's demonstrating at the moment, but if I can't demonstrate it then we can't really take it seriously. Can anyone offer any hints as to where I'm going wrong? Does GCC only optimize certain types of multiply or certain types of variables?

 

Thanks in advance for any assistance,

--

Olly

  • WIth regards to Neon, have you enabled NEON in the kernel per http://tiexpressdsp.com/wiki/index.php?title=FAQ_OMAP35x_Linux_PSP

    [edit] I also checked the compile options you are using and they appear to be correct.  Let us know if you get better performance numbers...

  • Also, I thought you may be interested in the follwing forum post: http://community.ti.com/forums/p/3230/11914.aspx#11914

  • Hi Juan, thanks for the reply.

     

    Yes, I have NEON on in the kernel - the app compiled with these options runs fine, and the floating point operations are somewhat quicker than with just -O3. However, the overall result for floating point is still roughly half the performance of an AMD K6-233! I would imagine that the OMAP should give the K6 a sound thrashing, yes?

     

    Benchmark results with just -O3:

    BYTEmark* Native Mode Benchmark ver. 2 (10/95)
    Index-split by Andrew D. Balsa (11/97)
    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

    TEST                : Iterations/sec.  : Old Index   : New Index
                        :                  : Pentium 90* : AMD K6/233*
    --------------------:------------------:-------------:------------
    NUMERIC SORT        :          239.92  :       6.15  :       2.02
    STRING SORT         :          30.619  :      13.68  :       2.12
    BITFIELD            :      7.8572e+07  :      13.48  :       2.82
    FP EMULATION        :          59.552  :      28.58  :       6.59
    FOURIER             :          241.25  :       0.27  :       0.15
    ASSIGNMENT          :          3.5729  :      13.60  :       3.53
    IDEA                :          475.25  :       7.27  :       2.16
    HUFFMAN             :          321.73  :       8.92  :       2.85
    NEURAL NET          :         0.35361  :       0.57  :       0.24
    LU DECOMPOSITION    :          10.798  :       0.56  :       0.40
    ==========================ORIGINAL BYTEMARK RESULTS==========================
    INTEGER INDEX       : 11.619
    FLOATING-POINT INDEX: 0.443
    Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
    ==============================LINUX DATA BELOW===============================
    CPU                 :
    L2 Cache            :
    OS                  : Linux 2.6.22.18-omap3
    C compiler          : arm-none-linux-gnueabi-gcc
    libc                : static
    MEMORY INDEX        : 2.760
    INTEGER INDEX       : 3.009
    FLOATING-POINT INDEX: 0.246
    Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

    Benchmark results with all the options above:

    BYTEmark* Native Mode Benchmark ver. 2 (10/95)
    Index-split by Andrew D. Balsa (11/97)
    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

    TEST                : Iterations/sec.  : Old Index   : New Index
                        :                  : Pentium 90* : AMD K6/233*
    --------------------:------------------:-------------:------------
    NUMERIC SORT        :          283.14  :       7.26  :       2.38
    STRING SORT         :          26.628  :      11.90  :       1.84
    BITFIELD            :      8.6004e+07  :      14.75  :       3.08
    FP EMULATION        :          57.834  :      27.75  :       6.40
    FOURIER             :          239.59  :       0.27  :       0.15
    ASSIGNMENT          :          3.8164  :      14.52  :       3.77
    IDEA                :          644.03  :       9.85  :       2.92
    HUFFMAN             :          315.83  :       8.76  :       2.80
    NEURAL NET          :         0.76717  :       1.23  :       0.52
    LU DECOMPOSITION    :          41.756  :       2.16  :       1.56
    ==========================ORIGINAL BYTEMARK RESULTS==========================
    INTEGER INDEX       : 12.370
    FLOATING-POINT INDEX: 0.899
    Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
    ==============================LINUX DATA BELOW===============================
    CPU                 :
    L2 Cache            :
    OS                  : Linux 2.6.22.18-omap3
    C compiler          : arm-none-linux-gnueabi-gcc
    libc                : static
    MEMORY INDEX        : 2.775
    INTEGER INDEX       : 3.343
    FLOATING-POINT INDEX: 0.499
    Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

    Notice that the string sort has actually been made slower by these optimisations!

     

    Cheers,

    --

    Olly

  • Forgot to mention, it did occur to me that this benchmark is using some functions from the math library, which (I'm guessing) won't have been compiled with all these optimisations. So, I made a start on making my own small math library so I could compile it with the same options. However, having made one function (pow, which is full of float*float multiplies) that would compile, I had a look at the assembler output and there weren't any vmul.f32 instructions in there...

     

    Cheers,

    --

    Olly

  • My understanding is that many rules must be followed in your C code for CodeSourcery tools to generate NEON instructions and therefore, I suspect these instructions are not being generated.  Have you checked the resulting assembly code...?   For performance reasons, using intrinsics may be the way to go. 

  • In regards to the intrinsics Juan mentions, they may be found at http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html. Using these you would most likely be able to see large performance gains, unfortunately the compiler is not always that great at leveraging the neon on its own, in reality at the moment you have to massage code to work perfectly with the neon.

    If the code is not leveraging the neon and is rather using the VFP it will appear relatively slow as you have seen.

  • Juan Gonzales said:

    My understanding is that many rules must be followed in your C code for CodeSourcery tools to generate NEON instructions 

     

    Does anyone know what these rules might be? I think at least if I know the rules I can make the decision whether to tweak the C code to enable the compiler to optimise, or (shudder) start inlining assembler for the whole math library...

     

    Thanks,

    --

    Olly

  • Hi,

    My name is Felipe I'm a brazilian enginner working with the OMAP 3530. We are developing some linux application but we get a bad performance. First we are using Code Sourery Compiler (2007q1), and linux 2.6.22.18 and we can't get the expected results. The second step was to use the linux 2.6.27 and the same compiler but the results were the same, the last step was to change the compiler Code Sourcery Compiler(2008q3), but the linux does not boot.

    To compare the results of the two compilers we will try the benchmark program you mention, but with the Code Sourcery (2007q1) the program stops in the line ASSIGNMENT (after FOURIER).

    We think the problems with the benchmark program and with the compiled linux are in the flags. We are using the following flags:
    KBUILD_CFLAGS   := -mlittle-endian -fno-strict-aliasing -fno-common -Os -marm -fno-omit-frame-pointer -mapcs -mno-sched-prolog -mabi=aapcs-linux -mno-thumb-interwork -march=armv5t -Wa, -march=armv7a -msoft-float -Uarm -fno-optimize-sibling-calls -fno-stack-protector

    This approch is correct? There is any help you can provide?

    Thanks,

    Felipe Elias

  • Hi Felipe,

     

    I think you may be running into a problem that I had with nbench, where it is trying to work in 64-bit due to the makefile detecting the memory width of the build machine rather than the target. Try changing the original makefile instruction for pointer.h (around line 132) to this:

     

    # compiler running on 64-bit system for cross compile, target is 32!
    pointer.h:
        touch pointer.h

     

    Remove any existing pointer.h and rebuild, and hopefully it will no longer get stuck.

     

    Good luck, hope you find something that speeds things up!

    --

    Olly

     

    [edit: spelling]

  • Have you guys seen the following wiki regarding NEON compiler options and how to write code for NEON: http://tiexpressdsp.com/index.php?title=Cortex_A8

     

  • Funkster,

    i've tried to compile the nbench once again and this time it works, the problem is that i have lower values with my processor. The modification I do in the Makefile was the following:

    pointer.h:
        touch pointer.h

    does not work, need a separator, so i tried:

    pointer.h: pointer Makefile
        $(CC) $(MACHINE) $(DEFINES) $(CFLAGS)\
            -o pointer pointer.c
        rm -f pointer.h
        touch pointer.h

    the same behaviour with:

    pointer.h: pointer Makefile
        touch pointer.h

    is that correct?

    The performance of the 3530 is descriped below:


    BYTEmark* Native Mode Benchmark ver. 2 (10/95)

    Index-split by Andrew D. Balsa (11/97)

    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

    TEST                : Iterations/sec.  : Old Index   : New Index

                        :                  : Pentium 90* : AMD K6/233*

    --------------------:------------------:-------------:------------

    NUMERIC SORT        :           50.16  :       1.29  :       0.42

    STRING SORT         :          13.718  :       6.13  :       0.95

    BITFIELD            :      1.2291e+07  :       2.11  :       0.44

    FP EMULATION        :          4.6154  :       2.21  :       0.51

    FOURIER             :          215.06  :       0.24  :       0.14

    ASSIGNMENT          :          1.1834  :       4.50  :       1.17

    IDEA                :          189.04  :       2.89  :       0.86

    HUFFMAN             :          91.979  :       2.55  :       0.81

    NEURAL NET          :          0.4095  :       0.66  :       0.28

    LU DECOMPOSITION    :          23.313  :       1.21  :       0.87

    ==========================ORIGINAL BYTEMARK RESULTS==========================

    INTEGER INDEX       : 2.761

    FLOATING-POINT INDEX: 0.579

    Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

    ==============================LINUX DATA BELOW===============================

    CPU                 :

    L2 Cache            :

    OS                  : Linux 2.6.27-omap1

    C compiler          : arm-none-linux-gnueabi-gcc

    libc                :

    MEMORY INDEX        : 0.787

    INTEGER INDEX       : 0.623

    FLOATING-POINT INDEX: 0.321


  • Hi Felipe,

     

    I think either of your makefile solutions is fine - the point is to end up with a completely empty pointer.h.

     

    What platform is your OMAP on? Mine is the Mistral OMAP35xx evaluation module. I see that you are using a different kernel, much newer than the one that came in the software kit for the EVM - perhaps this is having an effect?

     

    All the best,

    --

    Olly

  • Hello,

    Here is a wiki article on getting started with Cortex-A8 and NEON.  http://wiki.omap.com/index.php?title=Cortex_A8

    This article shows the 3 basic methods for getting NEON instructions generated.

    1. Using the compiler to autovectorize
    2. Using intrinsics
    3. Coding in Assembly.

    The NEON unit is very powerful for any code that can be vectorized. I've taken a brief look at nbench. I specifically looked at the code for FP emulation. It does not look like a good candidate for autovectorization. Which nbench test are you interested in?  I looked at a couple different tests and I don't see any floating point math. In the FP emulation, there are no floats or doubles, they are using some algorithms on fixed point math.

    By the way NEON supports integer math and single precision floating point. Neon does not support double precision floating point math. There is a VFP (Floating point accelerator) that will speed up double precision, but it will not be as fast because it is not SIMD.

    To compare you might want to write some code for simple loops that can autovectorize such as the ones in the wiki article and then compare with other processors.

    Jeff L

  • Hi,

    The results i show were obtained with the flags Juan mention in the Wiki. The flags you pass give an error:

    BYTEmark* Native Mode Benchmark ver. 2 (10/95)                                 
    Index-split by Andrew D. Balsa (11/97)                                         
    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)                                 
                                                                                   
    TEST                : Iterations/sec.  : Old Index   : New Index               
                        :                  : Pentium 90* : AMD K6/233*             
    --------------------:------------------:-------------:------------             
    NUMERIC SORT        :          219.52  :       5.63  :       1.85              
    Illegal instruction

     

    The flags I used before give the following result (after the pointer modification):

    BYTEmark* Native Mode Benchmark ver. 2 (10/95)                                  
    Index-split by Andrew D. Balsa (11/97)                                          
    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)                                  
                                                                                    
    TEST                : Iterations/sec.  : Old Index   : New Index                
                        :                  : Pentium 90* : AMD K6/233*              
    --------------------:------------------:-------------:------------              
    NUMERIC SORT        :          217.52  :       5.58  :       1.83               
    STRING SORT         :          21.992  :       9.83  :       1.52               
    BITFIELD            :      5.0316e+07  :       8.63  :       1.80               
    FP EMULATION        :          15.768  :       7.57  :       1.75               
    FOURIER             :          181.37  :       0.21  :       0.12               
    ASSIGNMENT          :          2.6996  :      10.27  :       2.66               
    IDEA                :           411.6  :       6.30  :       1.87               
    HUFFMAN             :          211.21  :       5.86  :       1.87               
    NEURAL NET          :         0.27533  :       0.44  :       0.19               
    LU DECOMPOSITION    :          8.5048  :       0.44  :       0.32               
    ==========================ORIGINAL BYTEMARK RESULTS==========================   
    INTEGER INDEX       : 7.517                                                     
    FLOATING-POINT INDEX: 0.343                                                     
    Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0       
    ==============================LINUX DATA BELOW===============================   
    CPU                 :                                                           
    L2 Cache            :                                                           
    OS                  : Linux 2.6.22.18-omap3                                     
    C compiler          : arm-none-linux-gnueabi-gcc                                
    libc                :                                                           
    MEMORY INDEX        : 1.940                                                     
    INTEGER INDEX       : 1.829                                                     
    FLOATING-POINT INDEX: 0.190                                                     
    Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38    
    * Trademarks are property of their respective holder.

    But this is still lower than your result. Could you pass the source you modified?

    My e-mail is felipe.eletrica@ufpr.br.

     

    Thanks

     

     

  •  

    > The results i show were obtained with the flags Juan mention in the Wiki. The flags you pass give an error:

    > Illegal instruction

     

    Looks like you might not have NEON enabled in the kernel? I don't think it's enabled in the SDK kernel, I built the kernel from source with a few tweaks to the default config. I'll email you my kernel config file.

     

    I didn't modify any of the source of nbench to get the results above, just the makefile.

     

    As Jeff notes, it looks like the floating-point math routines in nbench aren't good candidates for auto-vectorisation, perhaps a simpler dedicated floating-point benchmark would be the way to go instead. Still, it does make the OMAP a lot less useful as any application we wish to port to it is likely to have similarly complex math in it, which again won't vectorise well.

     

    All the best,

    --

    Olly