This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

vfpv3 speed

Other Parts Discussed in Thread: OMAP3530

Hi,

 

I am trying to find out how fast the floating point coprocessor (vfpv3) in Omap3530 is in terms of MFLOPS. Has anyone done any benchmark on it? Can anyone give me some numbers for single precision as well as double?

Thanks,

Max

  • Hi,

    I've run flops which is double precision and linpack which can run single precision.

    I was using an OMAP3530 EVM running the Cortex-A8 at 600MHz (which is operating point 5). with the CFLAGS and compiler shown below:

    CFLAGS=-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp

    gcc version 4.3.3 (Sourcery G++ Lite 2009q1-203)

    FLOPS C Program (Double Precision), V2.0 18 Dec 1992

       Module     Error        RunTime      MFLOPS
                                (usec)
         1      1.3358e-12      0.3871     36.1678
         2      2.0517e-13      0.1888     37.0679
         3      1.7542e-14      0.3995     42.5493
         4     -5.4512e-14      0.3865     38.8124
         5      3.3307e-16      0.8024     36.1430
         6     -1.9040e-14      0.7319     39.6211
         7      2.6034e-11      0.4915     24.4173
         8     -5.4068e-14      0.7601     39.4668

       Iterations      =   64000000
       NullTime (usec) =     0.0000
       MFLOPS(1)       =    38.6977
       MFLOPS(2)       =    32.5258
       MFLOPS(3)       =    36.8781
       MFLOPS(4)       =    39.9460

    For Linpack I was using the same platform and same Cortex-A8 clock speed and the same compiler and the following CFLAGS:

    CFLAGS=-O3 -march=armv7-a -mfloat-abi=softfp -ftree-vectorize -mfpu=vfp -DSP -DROLL

    Unrolled Single Precision Linpack

    Unrolled Single Precision Linpack

         norm. resid      resid           machep         x[0]-1        x[n-1]-1
           1.6        3.80277634e-05  1.19209290e-07 -1.38282776e-05 -7.51018524e-06
        times are reported for matrices of order   100
          dgefa      dgesl      total       kflops     unit      ratio
     times for array with leading dimension of  201
           0.02       0.00       0.02      43947       0.05       0.28
           0.02       0.00       0.02      29297       0.07       0.42
           0.02       0.00       0.02      29298       0.07       0.42
           0.02       0.00       0.02      32553       0.06       0.38
     times for array with leading dimension of 200
           0.02       0.00       0.02      29298       0.07       0.42
           0.02       0.00       0.02      29297       0.07       0.42
           0.02       0.00       0.02      29298       0.07       0.42
           0.02       0.00       0.02      32553       0.06       0.38
    Unrolled Single  Precision 32553 Kflops ; 10 Reps

    You can find source for flops here: http://linux.maruhn.com/sec/flops.html and linpack here: http://www.netlib.org/benchmark/

  • Hi,

     

    I am not familiar with those benchmarks. So you are basically saying that you get about 35Mflops when running flops which is double precision and 29MFlops when running Linpack which is single precision. Why did you get less mflops for single precision? Also those numbers seems very low to me. The DaVinci DSP67xx can get 1800Mflops. The ARM VFP9s could get 320MFlops running at 200MHz and can do 1.3MFlops/MHz (http://www.arm.com/products/processors/technologies/vector-floating-point.php).

    What mflops would you get if using -mfloat-abi=soft? How fast will you get if you use Neon SIMD?

     

    Thanks,

     

    Max

  • Max,

    I just put the only examples I have of single and double precision floating point benchmarks. I don't know how accurately they compare to each other or a real system. But to address some of your questions.  The VFP on the Cortex-A8 is not as powerful as on previous versions of ARMs.  Whereas Neon is fast in comparison. Just to summarize, VFP supports both double and single precision floating point, but NEON only supports single precision floating point. See this wiki page for some more information and an example comparing performance in NEON and VFP using the same identical piece of code: http://wiki.omap.com/index.php/Cortex_A8#Neon_and_VFP_both_support_floating_point.2C_which_should_I_use.3F

    To use NEON, the code has to be vectorized. There is a compiler flag to enable auto-vectorization, but unfortunately with many benchmarks, they will not auto-vectorize. However I do see on the ARM webpage that the most recent realview compiler will do a better job with autovectorizing benchmarks, but I have not yet tried it out. I do have an assembly NEON benchmark where I was trying to maximize MFLOPS. I can dig it out and put up the results if you are interested in NEON further.

    Regards,

    Jeff L

  • Jeff,

     

    I am very surprised that the VFPv3 in Cortex-A8 is so much slower running at 600MHz. Can you disable the VFPv3 and run the same programs with the software floating pointer emulator? I think the ARM VFP9-s can get around 4.5MFLOPS with software emulator at 200MHz. It would be interested to see the comparison.

     

    I am also very interested in the numbers with Neon. Can you show me your CFLAGS and your source code if  you don't mind for the Neon?

     

    Thanks very much,

     

    Max

  • Max,

    I can run the tests you are asking with the floating point libs, but not until early next week, maybe someone else will jump in with results.

    I'm not sure I can post the NEON code to this forum, I need to check if it is O.K. first. But for the NEON test I will post the inner assembly loop and show you the results I obtained:

    Running again at 600MHz CPU clock rate, I wrote some NEON intrinsics which performed 900 million SIMD single precision multiply operations in 3.008 seconds. Each operation was performing 4 single precision multiplies, so there was actually 3.6 billion multiplies. Below I cut out the inner loop which is running was iterated 100 million times.

    .L7:
            add     r2, r2, #1
            vmul.f32        q9, q9, q8
            cmp     r2, r1
            vmul.f32        q2, q2, q8
            vmul.f32        q3, q3, q8
            vmul.f32        q15, q15, q8
            vmul.f32        q14, q14, q8
            vmul.f32        q13, q13, q8
            vmul.f32        q12, q12, q8
            vmul.f32        q11, q11, q8
            vmul.f32        q10, q10, q8
            bne     .L7

    Regards,

    Jeff L

     

  • Jeff,

    I'd appreciate if you could run the comparison tests as soon as you can. Thanks very much.

     

    Max

  • Max,

    Again these benchmarks can give comparison, but I wouldn't quote the MFLOPS from these tests. How are you getting the MLFOPS numbers you are quoting? Maybe we should be comparing Apples and Apples.

    I ran the same earlier benchmarks with similar compiler flags, just not enabling the VFP.

    For flops:

    CFLAGS= -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float

    [root@OMAP3EVM /neon]# ./flops2

       FLOPS C Program (Double Precision), V2.0 18 Dec 1992

       Module     Error        RunTime      MFLOPS
                                (usec)
         1     -1.5632e-13      1.9727      7.0970
         2     -1.0347e-13      1.3584      5.1531
         3     -3.1197e-14      1.6230     10.4741
         4      7.7938e-14      1.4697     10.2060
         5     -3.2641e-14      3.5664      8.1314
         6     -9.9920e-16      2.8896     10.0358
         7     -5.5650e-11      3.0850      3.8898
         8      2.7700e-14      2.9600     10.1353

       Iterations      =    8000000
       NullTime (usec) =     0.0000
       MFLOPS(1)       =     6.1794
       MFLOPS(2)       =     6.3701
       MFLOPS(3)       =     8.3113
       MFLOPS(4)       =    10.1763

    For Linpack:

    CFLAGS=-DSP -DROLL -O3 -march=armv7-a -mtune=cortex-a8 -msoft-float

    [root@OMAP3EVM linpack]# ./linpack_arm
    Rolled Single Precision Linpack

    Rolled Single Precision Linpack

         norm. resid      resid           machep         x[0]-1        x[n-1]-1
           1.6        3.80277634e-05  1.19209290e-07 -1.38282776e-05 -7.51018524e-06
        times are reported for matrices of order   100
          dgefa      dgesl      total       kflops     unit      ratio
     times for array with leading dimension of  201
           0.05       0.00       0.05      14649       0.14       0.84
           0.04       0.00       0.04      17579       0.11       0.70
           0.05       0.00       0.05      14649       0.14       0.84
           0.05       0.00       0.05      14649       0.14       0.84
     times for array with leading dimension of 200
           0.04       0.00       0.04      17578       0.11       0.70
           0.05       0.00       0.05      14649       0.14       0.84
           0.05       0.00       0.05      14649       0.14       0.84
           0.05       0.00       0.05      14897       0.13       0.82
    Rolled Single  Precision 14649 Kflops ; 10 Reps

    Jeff L