Trying to get Neon optimization working for benchmarking OMAP3530 (gcc/linux)

Funkster

Hi All,

I've been trying to sort this out for a couple of days now and I'm beaten. I've got a simple suite of benchmarks called nbench ( http://www.tux.org/~mayer/linux/bmark.html ) that I'm running on the OMAP3EVM so as to compare it with some other potential processors for a new product. However, I cannot seem to get it to make use of the Neon coprocessor! I'm using GCC 4.3.2 (CodeSourcery 2008q3).

These are my current compiler flags:

CFLAGS = -s -save-temps -static -Wall -O3 -march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize -fomit-frame-pointer -ffast-math

The annoying thing is that, in one function in one file, I do a float * float multiply and there is a vmul.f32 in the generated assembler. However for other float * float multiplies in the same file it has just used fmul!

I'm sure this chip has more floating-point power than it's demonstrating at the moment, but if I can't demonstrate it then we can't really take it seriously. Can anyone offer any hints as to where I'm going wrong? Does GCC only optimize certain types of multiply or certain types of variables?

Thanks in advance for any assistance,

Olly

over 16 years ago

0 Juan Gonzales over 16 years ago

TI__Mastermind 37340 points

WIth regards to Neon, have you enabled NEON in the kernel per http://tiexpressdsp.com/wiki/index.php?title=FAQ_OMAP35x_Linux_PSP

[edit] I also checked the compile options you are using and they appear to be correct. Let us know if you get better performance numbers...

0 Juan Gonzales over 16 years ago

TI__Mastermind 37340 points

Also, I thought you may be interested in the follwing forum post: http://community.ti.com/forums/p/3230/11914.aspx#11914

0 Funkster over 16 years ago in reply to Juan Gonzales

Prodigy 70 points

Hi Juan, thanks for the reply.

Yes, I have NEON on in the kernel - the app compiled with these options runs fine, and the floating point operations are somewhat quicker than with just -O3. However, the overall result for floating point is still roughly half the performance of an AMD K6-233! I would imagine that the OMAP should give the K6 a sound thrashing, yes?

Benchmark results with just -O3:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec. : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          239.92 :       6.15 :       2.02
STRING SORT         :          30.619 :      13.68 :       2.12
BITFIELD            :      7.8572e+07 :      13.48 :       2.82
FP EMULATION        :          59.552 :      28.58 :       6.59
FOURIER             :          241.25 :       0.27 :       0.15
ASSIGNMENT          :          3.5729 :      13.60 :       3.53
IDEA                :          475.25 :       7.27 :       2.16
HUFFMAN             :          321.73 :       8.92 :       2.85
NEURAL NET          :         0.35361 :       0.57 :       0.24
LU DECOMPOSITION    :          10.798 :       0.56 :       0.40
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 11.619
FLOATING-POINT INDEX: 0.443
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 2.6.22.18-omap3
C compiler          : arm-none-linux-gnueabi-gcc
libc                : static
MEMORY INDEX        : 2.760
INTEGER INDEX       : 3.009
FLOATING-POINT INDEX: 0.246
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

Benchmark results with all the options above:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec. : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          283.14 :       7.26 :       2.38
STRING SORT         :          26.628 :      11.90 :       1.84
BITFIELD            :      8.6004e+07 :      14.75 :       3.08
FP EMULATION        :          57.834 :      27.75 :       6.40
FOURIER             :          239.59 :       0.27 :       0.15
ASSIGNMENT          :          3.8164 :      14.52 :       3.77
IDEA                :          644.03 :       9.85 :       2.92
HUFFMAN             :          315.83 :       8.76 :       2.80
NEURAL NET          :         0.76717 :       1.23 :       0.52
LU DECOMPOSITION    :          41.756 :       2.16 :       1.56
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 12.370
FLOATING-POINT INDEX: 0.899
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 2.6.22.18-omap3
C compiler          : arm-none-linux-gnueabi-gcc
libc                : static
MEMORY INDEX        : 2.775
INTEGER INDEX       : 3.343
FLOATING-POINT INDEX: 0.499
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

Notice that the string sort has actually been made slower by these optimisations!

Cheers,

Olly

0 Funkster over 16 years ago in reply to Funkster

Prodigy 70 points

Forgot to mention, it did occur to me that this benchmark is using some functions from the math library, which (I'm guessing) won't have been compiled with all these optimisations. So, I made a start on making my own small math library so I could compile it with the same options. However, having made one function (pow, which is full of float*float multiplies) that would compile, I had a look at the assembler output and there weren't any vmul.f32 instructions in there...

Cheers,

Olly

0 Juan Gonzales over 16 years ago in reply to Funkster

TI__Mastermind 37340 points

My understanding is that many rules must be followed in your C code for CodeSourcery tools to generate NEON instructions and therefore, I suspect these instructions are not being generated. Have you checked the resulting assembly code...? For performance reasons, using intrinsics may be the way to go.

0 Bernie Thompson TI over 16 years ago in reply to Funkster

TI__Mastermind 41665 points

In regards to the intrinsics Juan mentions, they may be found at http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html. Using these you would most likely be able to see large performance gains, unfortunately the compiler is not always that great at leveraging the neon on its own, in reality at the moment you have to massage code to work perfectly with the neon.

If the code is not leveraging the neon and is rather using the VFP it will appear relatively slow as you have seen.

0 Funkster over 16 years ago in reply to Juan Gonzales

Prodigy 70 points

Juan Gonzales said:

My understanding is that many rules must be followed in your C code for CodeSourcery tools to generate NEON instructions

Does anyone know what these rules might be? I think at least if I know the rules I can make the decision whether to tweak the C code to enable the compiler to optimise, or (shudder) start inlining assembler for the whole math library...

Thanks,

Olly

0 Daniboy over 16 years ago

Prodigy 150 points

Hi,

My name is Felipe I'm a brazilian enginner working with the OMAP 3530. We are developing some linux application but we get a bad performance. First we are using Code Sourery Compiler (2007q1), and linux 2.6.22.18 and we can't get the expected results. The second step was to use the linux 2.6.27 and the same compiler but the results were the same, the last step was to change the compiler Code Sourcery Compiler(2008q3), but the linux does not boot.

To compare the results of the two compilers we will try the benchmark program you mention, but with the Code Sourcery (2007q1) the program stops in the line ASSIGNMENT (after FOURIER).

We think the problems with the benchmark program and with the compiled linux are in the flags. We are using the following flags:
KBUILD_CFLAGS := -mlittle-endian -fno-strict-aliasing -fno-common -Os -marm -fno-omit-frame-pointer -mapcs -mno-sched-prolog -mabi=aapcs-linux -mno-thumb-interwork -march=armv5t -Wa, -march=armv7a -msoft-float -Uarm -fno-optimize-sibling-calls -fno-stack-protector

This approch is correct? There is any help you can provide?

Thanks,

Felipe Elias

0 Funkster over 16 years ago in reply to Daniboy

Prodigy 70 points

Hi Felipe,

I think you may be running into a problem that I had with nbench, where it is trying to work in 64-bit due to the makefile detecting the memory width of the build machine rather than the target. Try changing the original makefile instruction for pointer.h (around line 132) to this:

# compiler running on 64-bit system for cross compile, target is 32!
pointer.h:
touch pointer.h

Remove any existing pointer.h and rebuild, and hopefully it will no longer get stuck.

Good luck, hope you find something that speeds things up!

Olly

[edit: spelling]

0 Juan Gonzales over 16 years ago in reply to Daniboy

TI__Mastermind 37340 points

Have you guys seen the following wiki regarding NEON compiler options and how to write code for NEON: http://tiexpressdsp.com/index.php?title=Cortex_A8

0 Daniboy over 16 years ago in reply to Funkster

Prodigy 150 points

Funkster,

i've tried to compile the nbench once again and this time it works, the problem is that i have lower values with my processor. The modification I do in the Makefile was the following:

pointer.h:
touch pointer.h

does not work, need a separator, so i tried:

pointer.h: pointer Makefile
    $(CC) $(MACHINE) $(DEFINES) $(CFLAGS)\
        -o pointer pointer.c
    rm -f pointer.h
    touch pointer.h

the same behaviour with:

pointer.h: pointer Makefile
touch pointer.h

is that correct?

The performance of the 3530 is descriped below:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST : Iterations/sec. : Old Index : New Index

: : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT : 50.16 : 1.29 : 0.42

STRING SORT : 13.718 : 6.13 : 0.95

BITFIELD : 1.2291e+07 : 2.11 : 0.44

FP EMULATION : 4.6154 : 2.21 : 0.51

FOURIER : 215.06 : 0.24 : 0.14

ASSIGNMENT : 1.1834 : 4.50 : 1.17

IDEA : 189.04 : 2.89 : 0.86

HUFFMAN : 91.979 : 2.55 : 0.81

NEURAL NET : 0.4095 : 0.66 : 0.28

LU DECOMPOSITION : 23.313 : 1.21 : 0.87

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX : 2.761

FLOATING-POINT INDEX: 0.579

Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

CPU :

L2 Cache :

OS : Linux 2.6.27-omap1

C compiler : arm-none-linux-gnueabi-gcc

libc :

MEMORY INDEX : 0.787

INTEGER INDEX : 0.623

FLOATING-POINT INDEX: 0.321

0 Funkster over 16 years ago in reply to Daniboy

Prodigy 70 points

Hi Felipe,

I think either of your makefile solutions is fine - the point is to end up with a completely empty pointer.h.

What platform is your OMAP on? Mine is the Mistral OMAP35xx evaluation module. I see that you are using a different kernel, much newer than the one that came in the software kit for the EVM - perhaps this is having an effect?

All the best,

Olly

0 Jeff L over 16 years ago

TI__Expert 5960 points

Hello,

Here is a wiki article on getting started with Cortex-A8 and NEON. http://wiki.omap.com/index.php?title=Cortex_A8

This article shows the 3 basic methods for getting NEON instructions generated.

Using the compiler to autovectorize
Using intrinsics
Coding in Assembly.

The NEON unit is very powerful for any code that can be vectorized. I've taken a brief look at nbench. I specifically looked at the code for FP emulation. It does not look like a good candidate for autovectorization. Which nbench test are you interested in? I looked at a couple different tests and I don't see any floating point math. In the FP emulation, there are no floats or doubles, they are using some algorithms on fixed point math.

By the way NEON supports integer math and single precision floating point. Neon does not support double precision floating point math. There is a VFP (Floating point accelerator) that will speed up double precision, but it will not be as fast because it is not SIMD.

To compare you might want to write some code for simple loops that can autovectorize such as the ones in the wiki article and then compare with other processors.

Jeff L

0 Daniboy over 16 years ago in reply to Funkster

Prodigy 150 points

Hi,

The results i show were obtained with the flags Juan mention in the Wiki. The flags you pass give an error:

The flags I used before give the following result (after the pointer modification):

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec. : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          217.52 :       5.58 :       1.83
STRING SORT         :          21.992 :       9.83 :       1.52
BITFIELD            :      5.0316e+07 :       8.63 :       1.80
FP EMULATION        :          15.768 :       7.57 :       1.75
FOURIER             :          181.37 :       0.21 :       0.12
ASSIGNMENT          :          2.6996 :      10.27 :       2.66
IDEA                :           411.6 :       6.30 :       1.87
HUFFMAN             :          211.21 :       5.86 :       1.87
NEURAL NET          :         0.27533 :       0.44 :       0.19
LU DECOMPOSITION    :          8.5048 :       0.44 :       0.32
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 7.517
FLOATING-POINT INDEX: 0.343
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 2.6.22.18-omap3
C compiler          : arm-none-linux-gnueabi-gcc
libc                :
MEMORY INDEX        : 1.940
INTEGER INDEX       : 1.829
FLOATING-POINT INDEX: 0.190
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

But this is still lower than your result. Could you pass the source you modified?

My e-mail is felipe.eletrica@ufpr.br.

Thanks

0 Funkster over 16 years ago in reply to Daniboy

Prodigy 70 points

> The results i show were obtained with the flags Juan mention in the Wiki. The flags you pass give an error:

> Illegal instruction

Looks like you might not have NEON enabled in the kernel? I don't think it's enabled in the SDK kernel, I built the kernel from source with a few tweaks to the default config. I'll email you my kernel config file.

I didn't modify any of the source of nbench to get the results above, just the makefile.

As Jeff notes, it looks like the floating-point math routines in nbench aren't good candidates for auto-vectorisation, perhaps a simpler dedicated floating-point benchmark would be the way to go instead. Still, it does make the OMAP a lot less useful as any application we wish to port to it is likely to have similarly complex math in it, which again won't vectorise well.

All the best,

Olly

Processors

Processors forum

Trying to get Neon optimization working for benchmarking OMAP3530 (gcc/linux)