Some performance test for OpenMP on a C6678 EVM

IChung

Other Parts Discussed in Thread: SYSBIOS

I carried out some performance test for OpenMP on a C6678 EVM using the following code.

#define SYS_CLK   (1000000000.f)
#define NTHREADS 8
#define N          100000

double   x[N], y[N], z[N];

void main()
{
   int i, nthread;
   CSL_Uint64 start, end;

   /******************** Single core *******************/
   start = CSL_tscRead( );
   for(i=0; i<N; i++ )
   {
       x[i] = log(fabs(sin((double)i))+0.1);
       y[i] = log(fabs(cos((double)i))+0.1);
       z[i] = x[i]/y[i];
   }
   end = CSL_tscRead( );
   printf("Elapsed time(1 core ) = %f sec\n", (end-start)/SYS_CLK );
   /***************************************************************/

   for(nthread = 2; nthread<=NTHREADS; nthread++ )
   {
       omp_set_num_threads(nthread);

       /****************** nthread cores **********************/
       start = CSL_tscRead( );
       #pragma omp parallel
       {
           #pragma omp for
           for(i=0; i<N; i++ )
           {
               x[i] = log(fabs(sin((double)i))+0.1);
               y[i] = log(fabs(cos((double)i))+0.1);
               z[i] = x[i]/y[i];
           }
       }
       end = CSL_tscRead( );
       printf("Elapsed time(%d cores) = %f sec\n", nthread, (end-start)/SYS_CLK )
       /********************************************************/
   }
}

I obtained the following results for two cases.

1. In case using default shared region heap (OpenMP.stackRegionId = 0)

Elapsed time(1 core ) = 0.516394 sec
Elapsed time(2 cores) = 0.255472 sec
Elapsed time(3 cores) = 0.170711 sec
Elapsed time(4 cores) = 0.129573 sec
Elapsed time(5 cores) = 0.105580 sec
Elapsed time(6 cores) = 0.088497 sec
Elapsed time(7 cores) = 0.076308 sec
Elapsed time(8 cores) = 0.066832 sec

2. In case using local heap (OpenMP.stackRegionId = -1)

Elapsed time(1 core ) = 0.206598 sec
Elapsed time(2 cores) = 0.102902 sec
Elapsed time(3 cores) = 0.068666 sec
Elapsed time(4 cores) = 0.051578 sec
Elapsed time(5 cores) = 0.044793 sec
Elapsed time(6 cores) = 0.046065 sec
Elapsed time(7 cores) = 0.050581 sec
Elapsed time(8 cores) = 0.053255 sec

Comparing two cases, using local heap gives better performance.By the way, while the performance using default shared region heap improves proportionally with increasing number of cores, the performance using local heap does not improve further since the number of cores=5.

In case of using local heap, why does not the performance improve further since the number of cores=5?

I wonder if there is any way to improve the performance proportionally to the number of cores like the case of using default shared region?

The environment for the test is as follows :

CCS v5.5

C6000 compiler 7.4.4

XDCtool 3.24.5.48

SYSBIOS 6.34.4.22

IPC 1.25.1.09

OpenMP BIOS library 1.1.3.02

over 10 years ago

0 Jason Reeder over 10 years ago

TI__Genius 10440 points

IChung,

If my assumptions are correct, you have:

The x, y, and z arrays in a non-cached shared memory section either in MSMC or DDR memory
Setting stackRegionId = 0 puts the stack of each core into shared MSMC or DDR memory
Setting stackRegionId = -1 puts the stack of each core into local L2 memory

It is recommended to place the stack of each core into its own local L2 memory for performance reasons. Accessing data from a local stack is always going to be much faster than accessing data from a stack stored in MSMC or DDR memory. This is evidenced by the difference in your elapsed time above in the two (1 core) cases. 1 core with a remote stack took 0.5164 seconds while 1 core with a local stack only took 0.2066 seconds. That's nearly a 2.5x speed up just by placing your stack in local memory.

In the local stack case above it looks like your example is running into memory bandwidth saturation around the 5 or 6 core iteration. Because your x, y, and z arrays are stored in remote non-cached memory then every read or write of an array value is a data read or a data write to that remote memory location. Somewhere around the 5 core mark it appears like the data reads and writes are starting to overwhelm the bandwidth of the remote memory and are incurring stalls while the previous data accesses are completed. As more cores are added beyond 5 cores then it appears that there are even more outstanding memory requests that further slow down your example.

Your first iteration (with remote stacks for each core) is slowed down by the remote stack to the point that it never reaches memory bandwidth saturation so you see proportional improvements as each core is added.

Jason Reeder

0 Jason Reeder over 10 years ago in reply to Jason Reeder

TI__Genius 10440 points

IChung,

You should also know, we have continued to improve our implementation of OpenMP and have released these updates in a software release for KeyStone II devices. A wiki page has been created that gives instructions on how to download the new OpenMP implementation and port it to KeyStone I devices. Please find the wiki page here: processors.wiki.ti.com/.../Porting_OpenMP_2.x_to_KeyStone_1 for the porting instructions.

Thanks,

Jason Reeder

0 IChung over 10 years ago in reply to Jason Reeder

Intellectual 350 points

Thank you for your explanation. I also speculated that the memory bandwidth limitation might hinder additional improvement.

Processors

Processors forum

Some performance test for OpenMP on a C6678 EVM