This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NEON FPU not working with Data from Stack?



Hello,

I hit the problem, that the NEON does not comply with IEEE 754 arithmetic. So I wrote a test program to

  • evaluate how to use NEON in only parts of code and in others VFP
  • verify that the GCC flags we use either enable the NEON or the VFP

Now it seems to work - but only when working with data on the heap - and is 9 times slower, when using Stack data.

# sync ; ./fpu-test
Using stack

Starting tests with 'FPU auto detection':
  Performance:     4.798976 s
  c[0]=0.000000
  precision problem: c[0] was flushed to zero
  Flushes to zero: true

Starting tests with 'NEON FPU':
  Performance:     4.800814 s
  c[0]=0.000000
  precision problem: c[0] was flushed to zero
  Flushes to zero: true

Starting tests with 'VFP FPU':
  Performance:     4.795331 s
  c[0]=0.010000
  Flushes to zero: false
#
# sync ; ./fpu-test
Using heap

Starting tests with 'FPU auto detection':
  Performance:     0.560130 s
  c[0]=0.000000
  precision problem: c[0] was flushed to zero
  Flushes to zero: true

Starting tests with 'NEON FPU':
  Performance:     0.560076 s
  c[0]=0.000000
  precision problem: c[0] was flushed to zero
  Flushes to zero: true

Starting tests with 'VFP FPU':
  Performance:     4.781635 s
  c[0]=0.010000
  Flushes to zero: false
#

To describe the output:

  • It seems that without the NEON instructions it takes about 4,8 seconds
  • Further more, the NEON flushes to zero - while VFP does not

The code is compiled with:

$ make
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8            -ftree-vectorize -mfloat-abi=softfp CTLAutoTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp CTLNeonTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp  -ftree-vectorize -mfloat-abi=softfp CTLVFPTests.cpp
arm-linux-g++ -g -c -Wall -I. -O2 -march=armv7-a -mtune=cortex-a8            -ftree-vectorize -mfloat-abi=softfp main.cpp
arm-linux-g++ -Wall -lrt CTLAutoTests.o CTLNeonTests.o CTLVFPTests.o main.o -o fpu-test

Best regards,

Charly

PS: Maybe interesting further information:

  • Flush-to-zero mode on arm.com:
    "Flush-to-zero mode replaces denormalized numbers with 0. This does not comply with IEEE 754 arithmetic, but in some circumstances can improve performance considerably. [...] NEON always uses flush-to-zero mode."
  • Using NEON even if no -ffast-math is defined - GCC Bugtracking
fpu-test.zip
  • Hmmm... it seems to be a compiler issue.

    When using the compiler form CodeSoucery both - from heap and stack - work like specified. Only when using a newer compiler (see bugtracking link from previous post) you need the -funsafe-math-optimizations flag to enable neon.

    Compiler without -mfpu -mfpu=neon -mfpu=vfp
    time flush2zero time flush2zero time flush2zero
    custom compiler 4.4.5 0.560130s  yes 0.560076s yes 4.781635s no
    CodeSoucery arm-2009q3 4.4.1 3.518515s no 0.560107s yes 3.518906s no
    CodeSoucery arm-2011.03 4.5.2 4.621518s
    no 4.628804s no 4.628980s no

    CodeSoucery arm-2011.03
    with -funsafe-math-optimizations

    4.5.2 4.628695s no 0.560841s yes 4.628162s no

    Further more the performance of the 4.4.1 is better than the 4.5.2...