NEON FPU not working with Data from Stack?

Karl Krach

Hello,

I hit the problem, that the NEON does not comply with IEEE 754 arithmetic. So I wrote a test program to

evaluate how to use NEON in only parts of code and in others VFP
verify that the GCC flags we use either enable the NEON or the VFP

Now it seems to work - but only when working with data on the heap - and is 9 times slower, when using Stack data.

# sync ; ./fpu-test
Using stack

Starting tests with 'FPU auto detection':
Performance:     4.798976 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true

Starting tests with 'NEON FPU':
Performance:     4.800814 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true

Starting tests with 'VFP FPU':
Performance:     4.795331 s
c[0]=0.010000
Flushes to zero: false
#
# sync ; ./fpu-test
Using heap

Starting tests with 'FPU auto detection':
Performance:     0.560130 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true

Starting tests with 'NEON FPU':
Performance:     0.560076 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true

Starting tests with 'VFP FPU':
Performance:     4.781635 s
c[0]=0.010000
Flushes to zero: false
#

To describe the output:

It seems that without the NEON instructions it takes about 4,8 seconds
Further more, the NEON flushes to zero - while VFP does not

The code is compiled with:

$ make
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -mfloat-abi=softfp CTLAutoTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp CTLNeonTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp CTLVFPTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -mfloat-abi=softfp main.cpp
arm-linux-g++ -Wall -lrt CTLAutoTests.o CTLNeonTests.o CTLVFPTests.o main.o -o fpu-test

Best regards,

Charly

PS: Maybe interesting further information:

Flush-to-zero mode on arm.com:
"Flush-to-zero mode replaces denormalized numbers with 0. This does not comply with IEEE 754 arithmetic, but in some circumstances can improve performance considerably. [...] NEON always uses flush-to-zero mode."
Using NEON even if no -ffast-math is defined - GCC Bugtracking

fpu-test.zip

over 14 years ago

0 Karl Krach over 14 years ago

Intellectual 320 points

Hmmm... it seems to be a compiler issue.

When using the compiler form CodeSoucery both - from heap and stack - work like specified. Only when using a newer compiler (see bugtracking link from previous post) you need the -funsafe-math-optimizations flag to enable neon.

Compiler		without -mfpu		-mfpu=neon		-mfpu=vfp
		time	flush2zero	time	flush2zero	time	flush2zero
custom compiler	4.4.5	0.560130s	yes	0.560076s	yes	4.781635s	no
CodeSoucery arm-2009q3	4.4.1	3.518515s	no	0.560107s	yes	3.518906s	no
CodeSoucery arm-2011.03	4.5.2	4.621518s	no	4.628804s	no	4.628980s	no
CodeSoucery arm-2011.03 with -funsafe-math-optimizations	4.5.2	4.628695s	no	0.560841s	yes	4.628162s	no

Further more the performance of the 4.4.1 is better than the 4.5.2...

Processors

Processors forum

NEON FPU not working with Data from Stack?