vfpv3 speed

Max42430

Other Parts Discussed in Thread: OMAP3530

Hi,

I am trying to find out how fast the floating point coprocessor (vfpv3) in Omap3530 is in terms of MFLOPS. Has anyone done any benchmark on it? Can anyone give me some numbers for single precision as well as double?

Thanks,

Max

over 14 years ago

0 Jeff L over 14 years ago

TI__Expert 5960 points

Hi,

I've run flops which is double precision and linpack which can run single precision.

I was using an OMAP3530 EVM running the Cortex-A8 at 600MHz (which is operating point 5). with the CFLAGS and compiler shown below:

CFLAGS=-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp

gcc version 4.3.3 (Sourcery G++ Lite 2009q1-203)

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1      1.3358e-12      0.3871     36.1678
     2      2.0517e-13      0.1888     37.0679
     3      1.7542e-14      0.3995     42.5493
     4     -5.4512e-14      0.3865     38.8124
     5      3.3307e-16      0.8024     36.1430
     6     -1.9040e-14      0.7319     39.6211
     7      2.6034e-11      0.4915     24.4173
     8     -5.4068e-14      0.7601     39.4668

   Iterations      =   64000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =    38.6977
   MFLOPS(2)       =    32.5258
   MFLOPS(3)       =    36.8781
   MFLOPS(4)       =    39.9460

For Linpack I was using the same platform and same Cortex-A8 clock speed and the same compiler and the following CFLAGS:

CFLAGS=-O3 -march=armv7-a -mfloat-abi=softfp -ftree-vectorize -mfpu=vfp -DSP -DROLL

Unrolled Single Precision Linpack

     norm. resid      resid           machep         x[0]-1        x[n-1]-1
       1.6        3.80277634e-05 1.19209290e-07 -1.38282776e-05 -7.51018524e-06
    times are reported for matrices of order   100
      dgefa      dgesl      total       kflops     unit      ratio
times for array with leading dimension of 201
       0.02       0.00       0.02      43947       0.05       0.28
       0.02       0.00       0.02      29297       0.07       0.42
       0.02       0.00       0.02      29298       0.07       0.42
       0.02       0.00       0.02      32553       0.06       0.38
times for array with leading dimension of 200
       0.02       0.00       0.02      29298       0.07       0.42
       0.02       0.00       0.02      29297       0.07       0.42
       0.02       0.00       0.02      29298       0.07       0.42
       0.02       0.00       0.02      32553       0.06       0.38
Unrolled Single Precision 32553 Kflops ; 10 Reps

You can find source for flops here: http://linux.maruhn.com/sec/flops.html and linpack here: http://www.netlib.org/benchmark/

0 Max42430 over 14 years ago in reply to Jeff L

Prodigy 110 points

Hi,

I am not familiar with those benchmarks. So you are basically saying that you get about 35Mflops when running flops which is double precision and 29MFlops when running Linpack which is single precision. Why did you get less mflops for single precision? Also those numbers seems very low to me. The DaVinci DSP67xx can get 1800Mflops. The ARM VFP9s could get 320MFlops running at 200MHz and can do 1.3MFlops/MHz (http://www.arm.com/products/processors/technologies/vector-floating-point.php).

What mflops would you get if using -mfloat-abi=soft? How fast will you get if you use Neon SIMD?

Thanks,

Max

0 Jeff L over 14 years ago in reply to Max42430

TI__Expert 5960 points

Max,

I just put the only examples I have of single and double precision floating point benchmarks. I don't know how accurately they compare to each other or a real system. But to address some of your questions. The VFP on the Cortex-A8 is not as powerful as on previous versions of ARMs. Whereas Neon is fast in comparison. Just to summarize, VFP supports both double and single precision floating point, but NEON only supports single precision floating point. See this wiki page for some more information and an example comparing performance in NEON and VFP using the same identical piece of code: http://wiki.omap.com/index.php/Cortex_A8#Neon_and_VFP_both_support_floating_point.2C_which_should_I_use.3F

To use NEON, the code has to be vectorized. There is a compiler flag to enable auto-vectorization, but unfortunately with many benchmarks, they will not auto-vectorize. However I do see on the ARM webpage that the most recent realview compiler will do a better job with autovectorizing benchmarks, but I have not yet tried it out. I do have an assembly NEON benchmark where I was trying to maximize MFLOPS. I can dig it out and put up the results if you are interested in NEON further.

Regards,

Jeff L

0 Max42430 over 14 years ago in reply to Jeff L

Prodigy 110 points

Jeff,

I am very surprised that the VFPv3 in Cortex-A8 is so much slower running at 600MHz. Can you disable the VFPv3 and run the same programs with the software floating pointer emulator? I think the ARM VFP9-s can get around 4.5MFLOPS with software emulator at 200MHz. It would be interested to see the comparison.

I am also very interested in the numbers with Neon. Can you show me your CFLAGS and your source code if you don't mind for the Neon?

Thanks very much,

Max

0 Jeff L over 14 years ago in reply to Max42430

TI__Expert 5960 points

Max,

I can run the tests you are asking with the floating point libs, but not until early next week, maybe someone else will jump in with results.

I'm not sure I can post the NEON code to this forum, I need to check if it is O.K. first. But for the NEON test I will post the inner assembly loop and show you the results I obtained:

Running again at 600MHz CPU clock rate, I wrote some NEON intrinsics which performed 900 million SIMD single precision multiply operations in 3.008 seconds. Each operation was performing 4 single precision multiplies, so there was actually 3.6 billion multiplies. Below I cut out the inner loop which is running was iterated 100 million times.

.L7:
        add     r2, r2, #1
        vmul.f32        q9, q9, q8
        cmp     r2, r1
        vmul.f32        q2, q2, q8
        vmul.f32        q3, q3, q8
        vmul.f32        q15, q15, q8
        vmul.f32        q14, q14, q8
        vmul.f32        q13, q13, q8
        vmul.f32        q12, q12, q8
        vmul.f32        q11, q11, q8
        vmul.f32        q10, q10, q8
        bne     .L7

Regards,

Jeff L

0 Max42430 over 14 years ago in reply to Jeff L

Prodigy 110 points

Jeff,

I'd appreciate if you could run the comparison tests as soon as you can. Thanks very much.

Max

0 Jeff L over 14 years ago in reply to Max42430

TI__Expert 5960 points

Max,

Again these benchmarks can give comparison, but I wouldn't quote the MFLOPS from these tests. How are you getting the MLFOPS numbers you are quoting? Maybe we should be comparing Apples and Apples.

I ran the same earlier benchmarks with similar compiler flags, just not enabling the VFP.

For flops:

CFLAGS= -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float

[root@OMAP3EVM /neon]# ./flops2

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1     -1.5632e-13      1.9727      7.0970
     2     -1.0347e-13      1.3584      5.1531
     3     -3.1197e-14      1.6230     10.4741
     4      7.7938e-14      1.4697     10.2060
     5     -3.2641e-14      3.5664      8.1314
     6     -9.9920e-16      2.8896     10.0358
     7     -5.5650e-11      3.0850      3.8898
     8      2.7700e-14      2.9600     10.1353

   Iterations      =    8000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =     6.1794
   MFLOPS(2)       =     6.3701
   MFLOPS(3)       =     8.3113
   MFLOPS(4)       =    10.1763

For Linpack:

CFLAGS=-DSP -DROLL -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float

[root@OMAP3EVM linpack]# ./linpack_arm
Rolled Single Precision Linpack

Rolled Single Precision Linpack

     norm. resid      resid           machep         x[0]-1        x[n-1]-1
       1.6        3.80277634e-05 1.19209290e-07 -1.38282776e-05 -7.51018524e-06
    times are reported for matrices of order   100
      dgefa      dgesl      total       kflops     unit      ratio
times for array with leading dimension of 201
       0.05       0.00       0.05      14649       0.14       0.84
       0.04       0.00       0.04      17579       0.11       0.70
       0.05       0.00       0.05      14649       0.14       0.84
       0.05       0.00       0.05      14649       0.14       0.84
times for array with leading dimension of 200
       0.04       0.00       0.04      17578       0.11       0.70
       0.05       0.00       0.05      14649       0.14       0.84
       0.05       0.00       0.05      14649       0.14       0.84
       0.05       0.00       0.05      14897       0.13       0.82
Rolled Single Precision 14649 Kflops ; 10 Reps

Jeff L

Processors

Processors forum

vfpv3 speed