I ran the trivial NEON/VFP benchmark (as seen here) on a beaglebone, both as a Linux user process and a simple StarterWare program. The StarterWare program uses similar MMU startup code to the other SW examples. I added some code to initialize the FPU. Using gcc for both, version 2011.09-69/arm-none-gnueabi and 2010q1-202/arm-none-linux-gnueabi, similar CFLAGS: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
The StarterWare code takes about 2.1x as long to run the benchmark. (The Linux time is consistent with the table and 720MHz mpu rate, ~0.79s). Replacing the float computations with ints yields a similar result (2.5x slower on SW). Coding the multiply as a vmulq_f32 intrinsic helped a little, but it helped the linux version too, by about the same amount.
Hoping a StarterWare expert can chime in and help me here, I'm sure I've overlooked something as I'm quite new to all this.
Thanks all,
G