This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

ARM openMP runs slower than single-thread



I built an ARM-only (no DSP) openMP test program (for simple vector add) on keystone II. The code is working fine and produces the correct result,

but It turns out that the openMP realization is much slower than the single-Core single threaded ARM realization

I checked the available number of devices using omp_get_num_procs(), and it gives the correct number : 4

I added more complexity to the computation by taking the logarithm and exponential before adding to make the computation in each loop more involved, but again the single-threaded realization seems to be much faster,

Is this normal behavior ? Is there any advice to boost the openMP performance?

  • Hello Mohamed,

    Can you share the code to demo the issue?

    regards,

    David

  • Hi david,

    here is the piece of code

        cout << "num of cores = " << omp_get_num_procs() << ", max threads =  " << omp_get_max_threads()<< endl;
        typedef uint32 dataType ;

        dataType * A= new dataType[VEC_SIZE];
        dataType * B = new dataType[VEC_SIZE];
        dataType * C = new dataType[VEC_SIZE];
        dataType * C2 = new dataType[VEC_SIZE];
     // single ARM
        uint32 i;
        for(i=0;i<VEC_SIZE;i++)
        {
            A[i] = rand();
            B[i] = rand();
        }
        double t1 = get_time_us();
        for(i=0;i<VEC_SIZE;i++)
        {
            C[i] = A[i]*B[i];//log(exp(A[i]))*log(exp(B[i]));
        }
        double t2 = get_time_us();
        cout<< "Single time = " << (uint32)(t2-t1) << " us" << endl;

        t1 = get_time_us();
    #pragma omp parallel for
        for(i=0 ; i < VEC_SIZE ; i++)
        {
            C2[i] = A[i]*B[i];
        }

        t2 = get_time_us();
        double err = 0;
        for(i=0;i<VEC_SIZE;i++)
            err += (C2[i]-C[i])*(C2[i]-C[i]);


        cout<< "Multiple time = " << (uint32)(t2-t1) << " us" << "  err = " << err << endl;


        delete A;     delete B;    delete C;    delete C2;
        return 0;


    where get_time_us() returns the absolute system time in us. For VEC_SIZE = 2000000L, the output I am getting is:

    num of cores = 4, max threads =  4
    Single time = 13045 us
    Multiple time = 33634 us  err = 0

    Is there another way (rahter than omp_get_num_procs() ) to check if ARM cores 1-3 are activated?

  • I think I figured out the problem (thanks to Alan)

    in computing the time I was using

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &newTime);

    The CLOCK_PROCESS_CPUTIME_ID time with openMP reports the time of all threads (which is correct if a single thread is used). Instead use CLOCK_MONOTONIC,