Hello,
I hit the problem, that the NEON does not comply with IEEE 754 arithmetic. So I wrote a test program to
- evaluate how to use NEON in only parts of code and in others VFP
- verify that the GCC flags we use either enable the NEON or the VFP
Now it seems to work - but only when working with data on the heap - and is 9 times slower, when using Stack data.
# sync ; ./fpu-test
Using stack
Starting tests with 'FPU auto detection':
Performance: 4.798976 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true
Starting tests with 'NEON FPU':
Performance: 4.800814 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true
Starting tests with 'VFP FPU':
Performance: 4.795331 s
c[0]=0.010000
Flushes to zero: false
#
# sync ; ./fpu-test
Using heap
Starting tests with 'FPU auto detection':
Performance: 0.560130 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true
Starting tests with 'NEON FPU':
Performance: 0.560076 s
c[0]=0.000000
precision problem: c[0] was flushed to zero
Flushes to zero: true
Starting tests with 'VFP FPU':
Performance: 4.781635 s
c[0]=0.010000
Flushes to zero: false
#
To describe the output:
- It seems that without the NEON instructions it takes about 4,8 seconds
- Further more, the NEON flushes to zero - while VFP does not
The code is compiled with:
$ make
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -mfloat-abi=softfp CTLAutoTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp CTLNeonTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp CTLVFPTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -mfloat-abi=softfp main.cpp
arm-linux-g++ -Wall -lrt CTLAutoTests.o CTLNeonTests.o CTLVFPTests.o main.o -o fpu-test
Best regards,
Charly
PS: Maybe interesting further information:
- Flush-to-zero mode on arm.com:
"Flush-to-zero mode replaces denormalized numbers with 0. This does not comply with IEEE 754 arithmetic, but in some circumstances can improve performance considerably. [...] NEON always uses flush-to-zero mode." - Using NEON even if no -ffast-math is defined - GCC Bugtracking