ARM openMP runs slower than single-thread

Mohamed Mansour

I built an ARM-only (no DSP) openMP test program (for simple vector add) on keystone II. The code is working fine and produces the correct result,

but It turns out that the openMP realization is much slower than the single-Core single threaded ARM realization

I checked the available number of devices using omp_get_num_procs(), and it gives the correct number : 4

I added more complexity to the computation by taking the logarithm and exponential before adding to make the computation in each loop more involved, but again the single-threaded realization seems to be much faster,

Is this normal behavior ? Is there any advice to boost the openMP performance?

over 11 years ago

0 dzhou over 11 years ago

TI__Genius 9065 points

Hello Mohamed,

Can you share the code to demo the issue?

regards,

David

0 Mohamed Mansour over 11 years ago in reply to dzhou

TI__Prodigy 100 points

Hi david,

here is the piece of code

    cout << "num of cores = " << omp_get_num_procs() << ", max threads = " << omp_get_max_threads()<< endl;
    typedef uint32 dataType ;

   dataType * A= new dataType[VEC_SIZE];
   dataType * B = new dataType[VEC_SIZE];
   dataType * C = new dataType[VEC_SIZE];
   dataType * C2 = new dataType[VEC_SIZE];
// single ARM
    uint32 i;
   for(i=0;i<VEC_SIZE;i++)
   {
       A[i] = rand();
       B[i] = rand();
   }
    double t1 = get_time_us();
   for(i=0;i<VEC_SIZE;i++)
   {
        C[i] = A[i]*B[i];//log(exp(A[i]))*log(exp(B[i]));
   }
   double t2 = get_time_us();
   cout<< "Single time = " << (uint32)(t2-t1) << " us" << endl;

   t1 = get_time_us();
#pragma omp parallel for
   for(i=0 ; i < VEC_SIZE ; i++)
   {
        C2[i] = A[i]*B[i];
   }

    t2 = get_time_us();
   double err = 0;
   for(i=0;i<VEC_SIZE;i++)
       err += (C2[i]-C[i])*(C2[i]-C[i]);

    cout<< "Multiple time = " << (uint32)(t2-t1) << " us" << " err = " << err << endl;

delete A; delete B; delete C; delete C2;
return 0;

where get_time_us() returns the absolute system time in us. For VEC_SIZE = 2000000L, the output I am getting is:

num of cores = 4, max threads = 4
Single time = 13045 us
Multiple time = 33634 us err = 0

Is there another way (rahter than omp_get_num_procs() ) to check if ARM cores 1-3 are activated?

0 Mohamed Mansour over 11 years ago in reply to Mohamed Mansour

TI__Prodigy 100 points

I think I figured out the problem (thanks to Alan)

in computing the time I was using

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &newTime);

The CLOCK_PROCESS_CPUTIME_ID time with openMP reports the time of all threads (which is correct if a single thread is used). Instead use CLOCK_MONOTONIC,

Processors

Processors forum

ARM openMP runs slower than single-thread