Hi to everyone,
I'am trying out OpenMP and after Hello world example I vent to the more complex thing, which is Matrix-vector multiplication example.
So when I tested OpenMP performance against sequential code of the same block I get that sequential code is ~20 times faster.
There is an OpenMP block:
#pragma omp parallel shared(A,b,c,total) private(tid,i) { tid = omp_get_thread_num(); /* Loop work-sharing construct - distribute rows of matrix */ #pragma omp for private(j) for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[i]); /* Update and display of running total must be serialized */ #pragma omp critical { total = total + c[i]; // printf(" thread %d did row %d\t c[%d]=%.2f\t",tid,i,i,c[i]); // printf("Running total= %.2f\n",total); } } /* end of parallel i loop */ } /* end of parallel construct */
And similar sequential one:
for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[i]); total = total + c[i]; }
So the numbers I get with omp_get_wtime() function were following:
OpenMP: 74.850458 micro seconds
Sequential: 4.183113 micros second
Can someone try to reproduce this results?
I have tried some Pi calculation examples from OpenMP presentation and for the case when we accumulating sum we can really get amazing performance increase. But when I was planning the loop structure in my project I get OpenMP slower that sequential in about 5-10 times (I even align my buffers to the cache line size).
Is there any one know the reason why the performance is like that?
I'am using TMS320C6670 DSP, OpenMP 2_01_16_03, XDC tools 3_23_04_60, SYS/BIOS 6_33_06_50
Best regards
Pavlo!