OpenMP performance with MATHLIB

Damian Szmulewicz

Expert 6140 points

Hi Team,

Here is the background on the system:

Hardware: C6678 EVM Rev 3.

OMP: omp_1_02_00_05 (following this post: e2e.ti.com/.../261272)

MATHLIB: mathlib_c66x_3_0_1_1

MCSDK: mcsdk_2_01_02_06

PDK: pdk_C6678_1_1_2_6

Compiler Optimization level: 3

Description: I am performing a performance analysis of the MATHLIB functions on C6678. The first test was done in Core 0 with no OS and no OpenMP. All functions of mathlib were tested, but I will just pick one here to explain the process and results.

The time to complete a logsp of 4K of data was done:

// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++) // ui32IdxCount = 4096;
{
pfBuffer[ui32Idx] = logsp(pfBuffer[ui32Idx]);
}

// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;

The time to complete the inline logsp_i of 4K of data was done:

// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++) // ui32IdxCount = 4096;
{
pfBuffer[ui32Idx] = logsp_i(pfBuffer[ui32Idx]);
}

// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;

The inline function (logsp_i) did much better (as expected) with an average cycles per log operation of 10.

The next step is to further improve the performance by parallelizing the computation in all 8 cores. To do this, I started with the omp_hello example from “\ti\omp_1_02_00_05\packages\ti\omp\examples” and then modified the code to do the same computation as above:

No inline first:

// Get Start Time

g_ui64StartTime = (uint64_t)(TSCL) ;

g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

#pragma omp parallel private(ui32Idx) shared(ui32IdxCount,pfBuffer)

{

#pragma omp for

for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++)

{

pfBuffer[ui32Idx] = logsp(pfBuffer[ui32Idx]);

}

// Get Stop Time

g_ui64StopTime = (uint64_t)(TSCL) ;

g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;

Inline next:

// Get Start Time

g_ui64StartTime = (uint64_t)(TSCL) ;

g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

#pragma omp parallel private(ui32Idx) shared(ui32IdxCount,pfBuffer)

{

#pragma omp for

for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++)

{

pfBuffer[ui32Idx] = logsp_i(pfBuffer[ui32Idx]);

}

// Get Stop Time

g_ui64StopTime = (uint64_t)(TSCL) ;

g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;

g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;

Problem: The inline function (logsp_i) is much slower when running with OpenMP. The performance of the non-I function is improved, but the inline goes from 10 cycles per operation in a single core no OpenMP to >70 cycles per operation when openMP is added. I understand there is overhead to fork when the #pragma parallel is placed, but this seems too much of added overhead. If I run all 8 cores in parallel, I get 9 cycles per operation to complete the above for loop: Only 1 cycle improvement when compared to the single core no OpenMP code! So all 8 cores perform just as good as 1 code when no parallelism is used. It looks to me that the software pipelining the compiler does for the inline function is not working correctly when the project has OpenMP and therefore the performance is greatly degraded. I tried placing the data buffer in other sections of memory (by default in DDR3) and this is not the cause of the issue.

Any ideas of what could be going on?

Thanks,

Damian

over 10 years ago

0 George Mock over 10 years ago

TI__Guru**** 249900 points

I have requested help with your question. I hope to get back to you in a few days.

Thanks and regards,

-George

0 EricStotzer over 10 years ago

TI__Prodigy 610 points

HI Damian,

Use firstprivate for the pointer variables in the parallel region.

ull logsp_vector_parallel(float *restrict in, float *restrict out, int length)
{
   ull start, stop;
   int i;

   init_clock();

   start = read_clock();
#pragma omp parallel for firstprivate(in, out, length)
   for (i=0; i<length; i++)
      out[i] = logsp(in[i]);
   stop = read_clock();

   return stop - start;
}

The compiler should be able to figure this out, but by asserting first private each thread gets it's own copy of the pointer's.

I saw decent speedup when I tested this, but I had to bump up the vector size to 4*4096 to ocercome the openMP overhead.

My code is attached: omp_logsp.zip

There is an openMP 2.0 runtime, but as of now you have to do so some work to get it running on 6678. The 2.0 runtime significantly reduces the overhead of the OpenMP constructs. See here for instructions:

http://processors.wiki.ti.com/index.php/Porting_OpenMP_2.x_to_KeyStone_1

Eric

Code Composer Studio™︎

Code Composer Studio forum

OpenMP performance with MATHLIB