Hi,
I am trying to find out how fast the floating point coprocessor (vfpv3) in Omap3530 is in terms of MFLOPS. Has anyone done any benchmark on it? Can anyone give me some numbers for single precision as well as double?
Thanks,
Max
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi,
I am trying to find out how fast the floating point coprocessor (vfpv3) in Omap3530 is in terms of MFLOPS. Has anyone done any benchmark on it? Can anyone give me some numbers for single precision as well as double?
Thanks,
Max
Hi,
I've run flops which is double precision and linpack which can run single precision.
I was using an OMAP3530 EVM running the Cortex-A8 at 600MHz (which is operating point 5). with the CFLAGS and compiler shown below:
CFLAGS=-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp
gcc version 4.3.3 (Sourcery G++ Lite 2009q1-203)
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 1.3358e-12 0.3871 36.1678
2 2.0517e-13 0.1888 37.0679
3 1.7542e-14 0.3995 42.5493
4 -5.4512e-14 0.3865 38.8124
5 3.3307e-16 0.8024 36.1430
6 -1.9040e-14 0.7319 39.6211
7 2.6034e-11 0.4915 24.4173
8 -5.4068e-14 0.7601 39.4668
Iterations = 64000000
NullTime (usec) = 0.0000
MFLOPS(1) = 38.6977
MFLOPS(2) = 32.5258
MFLOPS(3) = 36.8781
MFLOPS(4) = 39.9460
For Linpack I was using the same platform and same Cortex-A8 clock speed and the same compiler and the following CFLAGS:
CFLAGS=-O3 -march=armv7-a -mfloat-abi=softfp -ftree-vectorize -mfpu=vfp -DSP -DROLL
Unrolled Single Precision Linpack
Unrolled Single Precision Linpack
norm. resid resid machep x[0]-1 x[n-1]-1
1.6 3.80277634e-05 1.19209290e-07 -1.38282776e-05 -7.51018524e-06
times are reported for matrices of order 100
dgefa dgesl total kflops unit ratio
times for array with leading dimension of 201
0.02 0.00 0.02 43947 0.05 0.28
0.02 0.00 0.02 29297 0.07 0.42
0.02 0.00 0.02 29298 0.07 0.42
0.02 0.00 0.02 32553 0.06 0.38
times for array with leading dimension of 200
0.02 0.00 0.02 29298 0.07 0.42
0.02 0.00 0.02 29297 0.07 0.42
0.02 0.00 0.02 29298 0.07 0.42
0.02 0.00 0.02 32553 0.06 0.38
Unrolled Single Precision 32553 Kflops ; 10 Reps
You can find source for flops here: http://linux.maruhn.com/sec/flops.html and linpack here: http://www.netlib.org/benchmark/
Hi,
I am not familiar with those benchmarks. So you are basically saying that you get about 35Mflops when running flops which is double precision and 29MFlops when running Linpack which is single precision. Why did you get less mflops for single precision? Also those numbers seems very low to me. The DaVinci DSP67xx can get 1800Mflops. The ARM VFP9s could get 320MFlops running at 200MHz and can do 1.3MFlops/MHz (http://www.arm.com/products/processors/technologies/vector-floating-point.php).
What mflops would you get if using -mfloat-abi=soft? How fast will you get if you use Neon SIMD?
Thanks,
Max
Max,
I just put the only examples I have of single and double precision floating point benchmarks. I don't know how accurately they compare to each other or a real system. But to address some of your questions. The VFP on the Cortex-A8 is not as powerful as on previous versions of ARMs. Whereas Neon is fast in comparison. Just to summarize, VFP supports both double and single precision floating point, but NEON only supports single precision floating point. See this wiki page for some more information and an example comparing performance in NEON and VFP using the same identical piece of code: http://wiki.omap.com/index.php/Cortex_A8#Neon_and_VFP_both_support_floating_point.2C_which_should_I_use.3F
To use NEON, the code has to be vectorized. There is a compiler flag to enable auto-vectorization, but unfortunately with many benchmarks, they will not auto-vectorize. However I do see on the ARM webpage that the most recent realview compiler will do a better job with autovectorizing benchmarks, but I have not yet tried it out. I do have an assembly NEON benchmark where I was trying to maximize MFLOPS. I can dig it out and put up the results if you are interested in NEON further.
Regards,
Jeff L
Jeff,
I am very surprised that the VFPv3 in Cortex-A8 is so much slower running at 600MHz. Can you disable the VFPv3 and run the same programs with the software floating pointer emulator? I think the ARM VFP9-s can get around 4.5MFLOPS with software emulator at 200MHz. It would be interested to see the comparison.
I am also very interested in the numbers with Neon. Can you show me your CFLAGS and your source code if you don't mind for the Neon?
Thanks very much,
Max
Max,
I can run the tests you are asking with the floating point libs, but not until early next week, maybe someone else will jump in with results.
I'm not sure I can post the NEON code to this forum, I need to check if it is O.K. first. But for the NEON test I will post the inner assembly loop and show you the results I obtained:
Running again at 600MHz CPU clock rate, I wrote some NEON intrinsics which performed 900 million SIMD single precision multiply operations in 3.008 seconds. Each operation was performing 4 single precision multiplies, so there was actually 3.6 billion multiplies. Below I cut out the inner loop which is running was iterated 100 million times.
.L7:
add r2, r2, #1
vmul.f32 q9, q9, q8
cmp r2, r1
vmul.f32 q2, q2, q8
vmul.f32 q3, q3, q8
vmul.f32 q15, q15, q8
vmul.f32 q14, q14, q8
vmul.f32 q13, q13, q8
vmul.f32 q12, q12, q8
vmul.f32 q11, q11, q8
vmul.f32 q10, q10, q8
bne .L7
Regards,
Jeff L
Max,
Again these benchmarks can give comparison, but I wouldn't quote the MFLOPS from these tests. How are you getting the MLFOPS numbers you are quoting? Maybe we should be comparing Apples and Apples.
I ran the same earlier benchmarks with similar compiler flags, just not enabling the VFP.
For flops:
CFLAGS= -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float
[root@OMAP3EVM /neon]# ./flops2
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -1.5632e-13 1.9727 7.0970
2 -1.0347e-13 1.3584 5.1531
3 -3.1197e-14 1.6230 10.4741
4 7.7938e-14 1.4697 10.2060
5 -3.2641e-14 3.5664 8.1314
6 -9.9920e-16 2.8896 10.0358
7 -5.5650e-11 3.0850 3.8898
8 2.7700e-14 2.9600 10.1353
Iterations = 8000000
NullTime (usec) = 0.0000
MFLOPS(1) = 6.1794
MFLOPS(2) = 6.3701
MFLOPS(3) = 8.3113
MFLOPS(4) = 10.1763
For Linpack: CFLAGS=-DSP -DROLL -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float [root@OMAP3EVM linpack]# ./linpack_arm Rolled Single Precision Linpack norm. resid resid machep x[0]-1 x[n-1]-1 Jeff L
Rolled Single Precision Linpack
1.6 3.80277634e-05 1.19209290e-07 -1.38282776e-05 -7.51018524e-06
times are reported for matrices of order 100
dgefa dgesl total kflops unit ratio
times for array with leading dimension of 201
0.05 0.00 0.05 14649 0.14 0.84
0.04 0.00 0.04 17579 0.11 0.70
0.05 0.00 0.05 14649 0.14 0.84
0.05 0.00 0.05 14649 0.14 0.84
times for array with leading dimension of 200
0.04 0.00 0.04 17578 0.11 0.70
0.05 0.00 0.05 14649 0.14 0.84
0.05 0.00 0.05 14649 0.14 0.84
0.05 0.00 0.05 14897 0.13 0.82
Rolled Single Precision 14649 Kflops ; 10 Reps