I carried out some performance test for OpenMP on a C6678 EVM using the following code.
#define SYS_CLK (1000000000.f)
#define NTHREADS 8
#define N 100000
double x[N], y[N], z[N];
void main()
{
int i, nthread;
CSL_Uint64 start, end;
/******************** Single core *******************/
start = CSL_tscRead( );
for(i=0; i<N; i++ )
{
x[i] = log(fabs(sin((double)i))+0.1);
y[i] = log(fabs(cos((double)i))+0.1);
z[i] = x[i]/y[i];
}
end = CSL_tscRead( );
printf("Elapsed time(1 core ) = %f sec\n", (end-start)/SYS_CLK );
/***************************************************************/
for(nthread = 2; nthread<=NTHREADS; nthread++ )
{
omp_set_num_threads(nthread);
/****************** nthread cores **********************/
start = CSL_tscRead( );
#pragma omp parallel
{
#pragma omp for
for(i=0; i<N; i++ )
{
x[i] = log(fabs(sin((double)i))+0.1);
y[i] = log(fabs(cos((double)i))+0.1);
z[i] = x[i]/y[i];
}
}
end = CSL_tscRead( );
printf("Elapsed time(%d cores) = %f sec\n", nthread, (end-start)/SYS_CLK )
/********************************************************/
}
}
I obtained the following results for two cases.
1. In case using default shared region heap (OpenMP.stackRegionId = 0)
Elapsed time(1 core ) = 0.516394 sec
Elapsed time(2 cores) = 0.255472 sec
Elapsed time(3 cores) = 0.170711 sec
Elapsed time(4 cores) = 0.129573 sec
Elapsed time(5 cores) = 0.105580 sec
Elapsed time(6 cores) = 0.088497 sec
Elapsed time(7 cores) = 0.076308 sec
Elapsed time(8 cores) = 0.066832 sec
2. In case using local heap (OpenMP.stackRegionId = -1)
Elapsed time(1 core ) = 0.206598 sec
Elapsed time(2 cores) = 0.102902 sec
Elapsed time(3 cores) = 0.068666 sec
Elapsed time(4 cores) = 0.051578 sec
Elapsed time(5 cores) = 0.044793 sec
Elapsed time(6 cores) = 0.046065 sec
Elapsed time(7 cores) = 0.050581 sec
Elapsed time(8 cores) = 0.053255 sec
Comparing two cases, using local heap gives better performance.By the way, while the performance using default shared region heap improves proportionally with increasing number of cores, the performance using local heap does not improve further since the number of cores=5.
In case of using local heap, why does not the performance improve further since the number of cores=5?
I wonder if there is any way to improve the performance proportionally to the number of cores like the case of using default shared region?
The environment for the test is as follows :
CCS v5.5
C6000 compiler 7.4.4
XDCtool 3.24.5.48
SYSBIOS 6.34.4.22
IPC 1.25.1.09
OpenMP BIOS library 1.1.3.02