Hi Team,
Here is the background on the system:
Hardware: C6678 EVM Rev 3.
OMP: omp_1_02_00_05 (following this post: e2e.ti.com/.../261272)
MCSDK: mcsdk_2_01_02_06
PDK: pdk_C6678_1_1_2_6
Compiler Optimization level: 3
Description: I am performing a performance analysis of the MATHLIB functions on C6678. The first test was done in Core 0 with no OS and no OpenMP. All functions of mathlib were tested, but I will just pick one here to explain the process and results.
The time to complete a logsp of 4K of data was done:
// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++) // ui32IdxCount = 4096;
{
pfBuffer[ui32Idx] = logsp(pfBuffer[ui32Idx]);
}
// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;
The time to complete the inline logsp_i of 4K of data was done:
// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++) // ui32IdxCount = 4096;
{
pfBuffer[ui32Idx] = logsp_i(pfBuffer[ui32Idx]);
}
// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;
The inline function (logsp_i) did much better (as expected) with an average cycles per log operation of 10.
The next step is to further improve the performance by parallelizing the computation in all 8 cores. To do this, I started with the omp_hello example from “\ti\omp_1_02_00_05\packages\ti\omp\examples” and then modified the code to do the same computation as above:
No inline first:
// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
#pragma omp parallel private(ui32Idx) shared(ui32IdxCount,pfBuffer)
{
#pragma omp for
for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++)
{
pfBuffer[ui32Idx] = logsp(pfBuffer[ui32Idx]);
}
}
// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;
Inline next:
// Get Start Time
g_ui64StartTime = (uint64_t)(TSCL) ;
g_ui64StartTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
#pragma omp parallel private(ui32Idx) shared(ui32IdxCount,pfBuffer)
{
#pragma omp for
for(ui32Idx=0;ui32Idx<ui32IdxCount;ui32Idx++)
{
pfBuffer[ui32Idx] = logsp_i(pfBuffer[ui32Idx]);
}
}
// Get Stop Time
g_ui64StopTime = (uint64_t)(TSCL) ;
g_ui64StopTime |= (uint64_t)((uint64_t)TSCH << 32 ) ;
g_ui64ElapsedTime = g_ui64StopTime - g_ui64StartTime;
Problem: The inline function (logsp_i) is much slower when running with OpenMP. The performance of the non-I function is improved, but the inline goes from 10 cycles per operation in a single core no OpenMP to >70 cycles per operation when openMP is added. I understand there is overhead to fork when the #pragma parallel is placed, but this seems too much of added overhead. If I run all 8 cores in parallel, I get 9 cycles per operation to complete the above for loop: Only 1 cycle improvement when compared to the single core no OpenMP code! So all 8 cores perform just as good as 1 code when no parallelism is used. It looks to me that the software pipelining the compiler does for the inline function is not working correctly when the project has OpenMP and therefore the performance is greatly degraded. I tried placing the data buffer in other sections of memory (by default in DDR3) and this is not the cause of the issue.
Any ideas of what could be going on?
Thanks,
Damian