I built an ARM-only (no DSP) openMP test program (for simple vector add) on keystone II. The code is working fine and produces the correct result,
but It turns out that the openMP realization is much slower than the single-Core single threaded ARM realization
I checked the available number of devices using omp_get_num_procs(), and it gives the correct number : 4
I added more complexity to the computation by taking the logarithm and exponential before adding to make the computation in each loop more involved, but again the single-threaded realization seems to be much faster,
Is this normal behavior ? Is there any advice to boost the openMP performance?