OpenMP Example Matrix-vector multiplication performance

Pavlo Sergiienko

Other Parts Discussed in Thread: TMS320C6670

Hi to everyone,

I'am trying out OpenMP and after Hello world example I vent to the more complex thing, which is Matrix-vector multiplication example.

So when I tested OpenMP performance against sequential code of the same block I get that sequential code is ~20 times faster.

There is an OpenMP block:

#pragma omp parallel shared(A,b,c,total) private(tid,i)
  {
      tid = omp_get_thread_num();

      /* Loop work-sharing construct - distribute rows of matrix */
#pragma omp for private(j)
      for (i=0; i < SIZE; i++)
      {
          for (j=0; j < SIZE; j++)
              c[i] += (A[i][j] * b[i]);

          /* Update and display of running total must be serialized */
#pragma omp critical
          {
              total = total + c[i];
//              printf("  thread %d did row %d\t c[%d]=%.2f\t",tid,i,i,c[i]);
//              printf("Running total= %.2f\n",total);
          }

      }   /* end of parallel i loop */

  } /* end of parallel construct */

And similar sequential one:

  for (i=0; i < SIZE; i++)
  {
      for (j=0; j < SIZE; j++)
          c[i] += (A[i][j] * b[i]);
      total = total + c[i];
  }

So the numbers I get with omp_get_wtime() function were following:

OpenMP: 74.850458 micro seconds

Sequential: 4.183113 micros second

Can someone try to reproduce this results?

I have tried some Pi calculation examples from OpenMP presentation and for the case when we accumulating sum we can really get amazing performance increase. But when I was planning the loop structure in my project I get OpenMP slower that sequential in about 5-10 times (I even align my buffers to the cache line size).

Is there any one know the reason why the performance is like that?

I'am using TMS320C6670 DSP, OpenMP 2_01_16_03, XDC tools 3_23_04_60, SYS/BIOS 6_33_06_50

Best regards

Pavlo!

over 9 years ago

0 Raja over 9 years ago

TI__Guru* 81335 points

Hi Pavlo,
Please refer below threads to improve the performance,

e2e.ti.com/.../415412
e2e.ti.com/.../1154659

Thank you.

0 Pavlo Sergiienko over 9 years ago in reply to Raja

Intellectual 660 points

Hi Raja,

Thank you for the links, unfortunately they don't answer the question about matrix vector application. Since I don't have much time I have optimized only loop in my application, never get to matrix example. But what I know that nested loops for openMP is a vary bad idea. They cause false sharing between cores even harder.
So first my suggestion is to align input and output data on 128 byes, which is the cache line length in c6670.
Second is to split the input and output data block in a way that they don't cross in the memory. E.g. core 0 in doing first 1/4 of the data, core1 second 1/4th and so on.

But I have a question about QMSS initialization for an OpenMP usage. Should I proceed in that post or create a new one?
The problem is following: our application already using QMMS regions. But OpenMP require to launch that function from the config file. Startup.lastFxns.$add('&__TI_omp_initialize_rtsc_mode'); As I read about it in the OpenMP_user_guide it initialize the QMSS by default if it is not done yet. I don't want that.

So as the manual says I made some changes in my config file:
ompSettings.runtimeInitializesQmss = false;
OpenMP.qmssMemRegionIndex = 4;
OpenMP.qmssFirstDescIdxInLinkingRam = 0;
put the Startup.lastFxns.$add('CustomQMSSinit'); before the line Startup.lastFxns.$add('&__TI_omp_initialize_rtsc_mode'); where CustomQMSSinit function just perform Qmss_init(), Qmss_start() routine with setting up memory regions.

But in that way I'am getting this error:
[C66xx_0] Error! Inserting memory region -1, Error code : -138
[C66xx_1] Error! Inserting memory region -1, Error code : -138
[C66xx_2] Error! Inserting memory region -1, Error code : -138
[C66xx_3] Error! Inserting memory region -1, Error code : -138
[C66xx_0] INTERNAL ERROR: QMSS queue operation failed - src/omp_init.c, 182

Any idea what could be wrong?

Best Regards,
Pavlo!

0 Garrett Ding over 9 years ago in reply to Pavlo Sergiienko

TI__Mastermind 43296 points

Hi Pavlo,

From e2e.ti.com/.../468231, you seem have fixed the build error. I have also updated that thread with regard to runtime issue.

Regards,
Garrett

0 Garrett Ding over 9 years ago in reply to Garrett Ding

TI__Mastermind 43296 points

Pavlo,

Please refer to e2e.ti.com/.../472014 for matvec performance issue.

Regards,
Garrett

Processors

Processors forum

OpenMP Example Matrix-vector multiplication performance