Proper Way to Benchmark Timing within OpenMP on C6678

Ryan Radjabi

Other Parts Discussed in Thread: SYSBIOS, TMS320C6678

What is the proper way to calculate execution times within a single core for a multi-core application using OpenMP?

In a single-core application, I use the Timestamp_get32() function to count cycles between lines of code, although this doesn't seem to return the correct value for code inside the #pragma omp parallel private(nthreads, tid) block in my code.

I ran the multiplication for a Hanning window both inside and outside the OMP pragma and roughly got the results I was seeing on my single-core application. Time benchmarks within the pragma are around 6-8 times what they are outside the pragma. Check out the simple C code below. There isn’t anything fancy going on, this is built around the HelloWorld template for OpenMP. This time difference is irrelevant if I set this application for 1 core, 4 cores, or 8 cores.

Can Timestamp_get32() be trusted within the pragma statement?

/******************************************************************************

* FILE: omp_hello.c

* DESCRIPTION:

* OpenMP Example - Hello World - C/C++ Version

* In this simple example, the master thread forks a parallel region.

* All threads in the team obtain their unique thread number and print it.

* The master thread only prints the total number of threads. Two OpenMP

* library routines are used to obtain the number of threads and each

* thread's number.

* AUTHOR: Blaise Barney 5/99

* LAST REVISED: 04/06/05

******************************************************************************/

#include <ti/omp/omp.h>

#include <string.h>

#include <assert.h>

#include <stdio.h>

#include <time.h>

#include <stdint.h>

#include <xdc/std.h>

#include <xdc/runtime/System.h>

#include <ti/sysbios/BIOS.h>

#include <xdc/runtime/Log.h>

#include <xdc/runtime/Timestamp.h>

#include <math.h>

#define NTHREADS 1

#define FFT_MAX_L 2048

#define PI 3.14159265358979323846

float multiplier[ 2048 ];

void generateHanningLookup( void )

{

int32_t i;

for (i = 0; i < 2048; i++)

{

// equation from stackoverflow.com

multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );

}

void main()

{

int nthreads, tid;

nthreads = NTHREADS;

omp_set_num_threads(NTHREADS);

int16_t windowOutputData[ FFT_MAX_L ];

uint16_t j;

uint32_t start, totalTime;

generateHanningLookup();

start = Timestamp_get32();

for( j = 0; j < FFT_MAX_L; j++ )

{

windowOutputData[ j ] = j * multiplier[ j ];

}

printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

totalTime = Timestamp_get32();

/* Fork a team of threads giving them their own copies of variables */

#pragma omp parallel private(nthreads, tid)

{

/* Obtain thread number */

tid = omp_get_thread_num();

printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */

if (tid == 0)

{

nthreads = omp_get_num_threads();

printf("Number of threads = %d\n", nthreads);

// Hanning Window

start = Timestamp_get32();

for( j = 0; j < FFT_MAX_L; j++ )

{

windowOutputData[ j ] = j * multiplier[ j ];

}

printf( "HANNING#OMP = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

}

else

{

tid = omp_get_thread_num();

uint32_t startMp = Timestamp_get32();

for( j = 0; j < FFT_MAX_L; j++ )

{

windowOutputData[ j ] = j * multiplier[ j ];

}

printf( "HANNING#%u = [ %u ] cycles \n", tid, ( Timestamp_get32() - startMp ) );

}

} /* All threads join master thread and disband */

printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );

start = Timestamp_get32();

for( j = 0; j < FFT_MAX_L; j++ )

{

windowOutputData[ j ] = j * multiplier[ j ];

}

printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

}

over 12 years ago

0 Ryan Radjabi over 12 years ago

Expert 1125 points

This is not a Linux related question. Please move it to the appropriate forum. This application uses SYSBIOS.

0 David Friedland over 12 years ago in reply to Ryan Radjabi

TI__Mastermind 18320 points

Ryan,

Actually, this question does not really seem to be BIOS-specific. I have moved this post over to the Keystone device forum in hopes that it will get a faster response there...

0 Xiaohui Li over 12 years ago

TI__Intellectual 1870 points

7633.tsc_h.asm

Hi,

I made a couple of changes to your original code.

1) Instead of using Timerstamp_get32(), I directly read the timer registers of C66x to get exact cycle information. The file for the simple function of reading the registers are attached.

2) I put the following hanning window operation into a separate function to make sure the same code is used both inside and outside the #pragma. You never know if the compiler generates the same kind of code inside or outside the parall region.

When I run the code, I got the following performance numbers.

[C66xx_0] HANNING#1 = [ 6507 ] cycles

[C66xx_0] HANNING#OMP = [ 2145 ] cycles

[C66xx_0] HANNING#2 = [ 6543 ] cycles

The performance of Windowing inside #pragma is 2145 cycles. If you look at the compiler outout, this number matches very closely the theoretical performance of the code generated by the compiler for doing the Hanning Windowing. The problem is the performance is much larger when doing the Windowing outside the #pragma.

The reason is that before the 1st Windowing is done, the data are in MSMC. The performance outside the #pragma region includes cache penalties. By the time it starts doing the Windowing inside the #pragma, the data are in cache already. That's why the perfomrnace inside #pragma is much better. By commenting out the code for the 1st Windowing and rerun the code, the performance of the Windowing inside the #pragma is follows.

[C66xx_0] HANNING#OMP = [ 6329 ] cycles

This matches closely the performance of Windowing before #progma. The code I am using is attached below.

Xiaohui

#include <ti/omp/omp.h>

#include <string.h>
#include <assert.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <xdc/std.h>
#include <xdc/runtime/System.h>
#include <ti/sysbios/BIOS.h>
#include <xdc/runtime/Log.h>
#include <xdc/runtime/Timestamp.h>
#include <math.h>

#define NTHREADS 1

#define FFT_MAX_L 2048
#define PI 3.14159265358979323846

float multiplier[ 2048 ];

void generateHanningLookup( void )
{
int32_t i;

       for (i = 0; i < 2048; i++)
       {
              // equation from stackoverflow.com
              multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
       }

}

void hanningWindow( int16_t * restrict windowOutputData, float * restrict multiplier )
{
int j;
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }

}

void main()
{

int nthreads, tid;

nthreads = NTHREADS;

omp_set_num_threads(NTHREADS);

       int16_t   windowOutputData[ FFT_MAX_L ];
       uint16_t j;
       uint32_t start, end, totalTime;
       uint32_t start1, end1;
       generateHanningLookup();

       TSC_enable();

//     start = Timestamp_get32();
       start = TSC_read();
#if 0
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
#else
       hanningWindow( windowOutputData, multiplier );
#endif
       end = TSC_read();
//     printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
       printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );


//     totalTime = Timestamp_get32();
       totalTime = TSC_read();
       /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
       {

              /* Obtain thread number */
              tid = omp_get_thread_num();
              printf("Hello World from thread = %d\n", tid);

              if (tid == 0){
                     nthreads = omp_get_num_threads();
                     printf("Number of threads = %d\n", nthreads);
                 }

              // Hanning Window
              uint32_t startMp0, endMp0;
              startMp0 = TSC_read();
#if 0
              for( j = 0; j < FFT_MAX_L; j++ )
              {
                     windowOutputData[ j ] = j * multiplier[ j ];
              }
#else
              hanningWindow( windowOutputData, multiplier );
#endif
              endMp0 = TSC_read();
              printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );

       } /* All threads join master thread and disband */
//     printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );
//     printf( "OMP TIME = [ %u ] cycles \n", ( clock() - totalTime ) );
       printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );

//     start = Timestamp_get32();
       start1= TSC_read();
#if 0
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
#else
       hanningWindow( windowOutputData, multiplier );
#endif
       end1= TSC_read();
//     printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
       printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );

}

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Hi Xiaohui Li,

Thank you for your response. I'm unable to build my project (RELEASE). It gives the following errors, which I find very strange since I'm not building the DEBUG project, but the RELEASE.

Is there something I must do besides adding the 7633.txc_h.asm to the C6000 Linker\File Search Path\ -I include line?

The build errors point to these lines in the linker.cmd file (in the DEBUG folder, don't understand that!).

MEMORY
{
L2SRAM : org = 0x800000, len = 0x80000
DDR3 : org = 0x80000000, len = 0x20000000
MSMCSRAM : org = 0xc000000, len = 0x100000
MSMCSRAM_NOCACHE : org = 0xa0100000, len = 0x300000
}

My build errors are listed below. Thanks!

Ryan

Description Resource Path Location Type
#10263 DDR3 memory range has already linker.cmd /mc_sanity/Debug/configPkg line 39 C/C++ Problem
#10263 L2SRAM memory range has linker.cmd /mc_sanity/Debug/configPkg line 38 C/C++ Problem
#10263 MSMCSRAM memory range has linker.cmd /mc_sanity/Debug/configPkg line 40 C/C++ Problem
#10263 MSMCSRAM_NOCACHE memory range linker.cmd /mc_sanity/Debug/configPkg line 41 C/C++ Problem
#10264 DDR3 memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 39 C/C++ Problem
#10264 L2SRAM memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 38 C/C++ Problem
#10264 MSMCSRAM memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 40 C/C++ Problem
#10264 MSMCSRAM_NOCACHE memory range linker.cmd /mc_sanity/Debug/configPkg line 41 C/C++ Problem
#99900 error limit reached; 100 errors detected mc_sanity C/C++ Problem

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Hi Ryan,

You should rename the .asm file as tsc_h.asm and put it in the directory where you source files locate.

I am not sure what the errors are about. Did you start with the original OpenMP Hello example? That's what I did and it compiled just fine without any changes in the settings or configurations.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Hi Xiaohui,

I got it to build by creating a new omp_helloWorld project, but I'm not getting the same results you had (listed below).

When I run the code, I got the following performance numbers.

[C66xx_0] HANNING#1 = [ 6507 ] cycles

[C66xx_0] HANNING#OMP = [ 2145 ] cycles

[C66xx_0] HANNING#2 = [ 6543 ] cycles

My results are

[C66xx_0] HANNING#1 = [ 32210 ] cycles

[C66xx_0] HANNING#OMP = [ 32254 ] cycles

[C66xx_0] HANNING#2 = [ 32240 ] cycles

Am I missing a configuration, perhaps the RTSC Platform file?

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

I run the code in Release mode.

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Why am I getting such different results? How many cores are you loading this on? 4 in my case.

Should NTHREADS be defined 4 instead of 1? That shouldn't affect this benchmark.

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

I am loading to all 8 cores, but NTHREADS is set to 1 as in your original code. So only core0 is doing the computation. The length of the computation is 2048 as specified in your original code. Are you using the code I sent to you?

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

I loaded the code on 8 cores (even though my CFG file has "OpenMP.setNumProcessors(4);" in it). When I run, it outputs:

[C66xx_0] Hello World from thread = 0
[C66xx_0] Number of threads = 1
[C66xx_0] HANNING#OMP = [ 32236 ] cycles
[C66xx_0] OMP TIME = [ 144885 ] cycles
[C66xx_0] HANNING#2 = [ 32228 ] cycles

What does your RTSC Platform config look like? Mine is:

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

4834.OpenMp_Hello_1.zip

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

Your MAP file was identical to mine (except for the timestamp). I used it, nonetheless, but I didn't get different results. Still:

[C66xx_0] HANNING#1 = [ 32216 ] cycles
[C66xx_0] Hello World from thread = 0
[C66xx_0] Number of threads = 1
[C66xx_0] HANNING#OMP = [ 32236 ] cycles
[C66xx_0] OMP TIME = [ 146152 ] cycles
[C66xx_0] HANNING#2 = [ 32298 ] cycles

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

6201.platform.zip

Actually your platform is quite different from mine. Your data are allocated in DDR vs mine are in MSMC. Your stack are in MSMC but mine is in L2 SRAM. Attached is my platform.

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Okay, I reconfigured my platform to nearly match yours, except I don't have the MSMCSRAM_NOCACHE section. I still get:

[C66xx_0] HANNING#1 = [ 32204 ] cycles
[C66xx_0] Hello World from thread = 0
[C66xx_0] Number of threads = 1
[C66xx_0] HANNING#OMP = [ 32288 ] cycles
[C66xx_0] HANNING#2 = [ 32262 ] cycles

My platform settings are attached.

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

If I remember correctly, your code only compute 2048 samples for Hanning Windowing and each sample only requires one multiply. Your measured performance indicates it takes more than 15 cycles for each sample. That's way too much. Could you share your code and .asm from compiler output?

Xiaohui

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

Could you deletre L1DSRAM and L1PSRAM two lines from your platform?

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

I've attached my code and the .asm file (my only ASM file is the one you provided me). It also deleted L1PSRAM and L1DSRAM from my platform config as you requested.

Ryan

Fullscreen 5751.omp_hello.c Download

#include <ti/omp/omp.h>

#include <string.h>
#include <assert.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <xdc/std.h>
#include <xdc/runtime/System.h>
#include <ti/sysbios/BIOS.h>
#include <xdc/runtime/Log.h>
#include <xdc/runtime/Timestamp.h>
#include <math.h>

#define NTHREADS  1

#define FFT_MAX_L 2048
#define PI 3.14159265358979323846

float     multiplier[ 2048 ];

void generateHanningLookup( void )
{
       int32_t i;

       for (i = 0; i < 2048; i++)
       {
              // equation from stackoverflow.com
              multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
       }

}
void hanningWindow( int16_t * restrict windowOutputData, float * restrict multiplier )
{
int j;
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
}

void main()
{

       int nthreads, tid;

       nthreads = NTHREADS;

       omp_set_num_threads(NTHREADS);

       int16_t   windowOutputData[ FFT_MAX_L ];
       uint16_t j;
       uint32_t start,  end, totalTime;
       uint32_t start1, end1;
       generateHanningLookup();

       hanningWindow( windowOutputData, multiplier );

       TSC_enable();
//     start = Timestamp_get32();
       start = TSC_read();
#if 0
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
#else
       hanningWindow( windowOutputData, multiplier );
#endif
       end = TSC_read();
//     printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
       printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );


//     totalTime = Timestamp_get32();
       totalTime = TSC_read();
       /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
       {

              /* Obtain thread number */
              tid = omp_get_thread_num();
              printf("Hello World from thread = %d\n", tid);

              if (tid == 0){
                     nthreads = omp_get_num_threads();
                     printf("Number of threads = %d\n", nthreads);
                 }
              // Hanning Window
              uint32_t startMp0, endMp0;
              startMp0 = TSC_read();
#if 0
              for( j = 0; j < FFT_MAX_L; j++ )
              {
                     windowOutputData[ j ] = j * multiplier[ j ];
              }
#else
              hanningWindow( windowOutputData, multiplier );
#endif
              endMp0 = TSC_read();
              printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );

       }  /* All threads join master thread and disband */
//     printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );
//     printf( "OMP TIME = [ %u ] cycles \n", ( clock() - totalTime ) );
//       printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );

//     start = Timestamp_get32();
       start1= TSC_read();
#if 0
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
#else
       hanningWindow( windowOutputData, multiplier );
#endif
       end1= TSC_read();
//     printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
       printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );

}

7607.tsc_h.asm

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

Actually I was ask the .asm file from the compiler output. The compiler should generate a omp_hello.asm file if you enable "keeping the compiler generated asm file" .

Properties->Build->C6000 Compiler ->Advanced Options->Assembler-> check the box for "Keep the generated assembly language"

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Okay I attached the program ASM. 2134.omp_hello.asm

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

The .asm file looks fine. Could you also send me the .map file? What's your optimization level, -O2 or -O3? Do you have any debugging features turned on?

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

**** Build of configuration Release for project omp_hello_timing ****

C:\ti\ccsv5\utils\bin\gmake -k all
'Building file: ../omp_config.cfg'
'Invoking: XDCtools'
"C:/ti/xdctools_3_24_06_63/xs" --xdcpath="C:/ti/omp_1_01_03_02/packages;C:/ti/bios_6_33_06_50/packages;C:/ti/ipc_1_24_03_32/packages;C:/ti/pdk_C6678_1_1_2_6/packages;C:/ti/ccsv5/ccs_base;" xdc.tools.configuro -o configPkg -t ti.targets.elf.C66 -p ti.omp.examples.platforms.evm6678 -r debug -c "C:/ti/ccsv5/tools/compiler/c6000_7.4.2" "../omp_config.cfg"
making package.mak (because of package.bld) ...
generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
configuring omp_config.xe66 from package/cfg/omp_config_pe66.cfg ...
cle66 package/cfg/omp_config_pe66.c ...
'Finished building: ../omp_config.cfg'
' '
'Building file: ../omp_hello.c'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k --preproc_with_compile --preproc_dependency="omp_hello.pp" --cmd_file="./configPkg/compiler.opt" "../omp_hello.c"
"../omp_hello.c", line 59: warning #225-D: function declared implicitly
"../omp_hello.c", line 61: warning #225-D: function declared implicitly
"../omp_hello.c", line 52: warning #179-D: variable "j" was declared but never referenced
"../omp_hello.c", line 53: warning #552-D: variable "totalTime" was set but never used
'Finished building: ../omp_hello.c'
' '
'Building file: ../tsc_h.asm'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k --preproc_with_compile --preproc_dependency="tsc_h.pp" --cmd_file="./configPkg/compiler.opt" "../tsc_h.asm"
'Finished building: ../tsc_h.asm'
' '
'Building target: omp_hello_timing.out'
'Invoking: C6000 Linker'
"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k -z -m"omp_hello_timing.map" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/lib" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --reread_libs --warn_sections --display_error_number --diag_wrap=off --rom_model -o "omp_hello_timing.out" -l"./configPkg/linker.cmd" "./tsc_h.obj" "./omp_hello.obj" -l"libc.a"
<Linking>
'Finished building target: omp_hello_timing.out'
' '

**** Build Finished ****

0 Ryan Radjabi over 12 years ago in reply to Ryan Radjabi

Expert 1125 points

Also see my map file, i meant to attach it in the last message.2500.omp_hello_timing_map.zip

0 Xiaohui Li over 12 years ago in reply to Xiaohui Li

TI__Intellectual 1870 points

Ryan,

Your multiplier array starts at 0xa0194500 which is in DDR3. In my code, it starts at 0x0c0b78a8 which is in MSMC. This will make huge difference. I am not sure if the your DDR3 is configured as cachable or non-cachable. Again that can make huge difference.

Your memory configuration is:

         name            origin    length      used     unused   attr    fill
---------------------- -------- --------- -------- -------- ---- --------
L2SRAM                00800000   00080000 0002bca0 00054360 RWIX
MSMCSRAM              0c000000   00100000 00039a0a 000c65f6 RWIX
DDR3                  80000000   20000000 01000000 1f000000 RWIX
MSMCSRAM_NOCACHE      a0100000   00300000 00096a92 0026956e RWIX

My configurations is,

         name            origin    length      used     unused   attr    fill
---------------------- -------- --------- -------- -------- ---- --------
L2SRAM                00800000   00080000 0002bcc0 00054340 RWIX
MSMCSRAM              0c000000   003c0000 000bab2a 003054d6 RWIX
DDR3                  80000000   20000000 00000000 20000000 RWIX
MSMCSRAM_NOCACHE      a03c0000   00040000 0003996e 00006692 RWIX

The two configurations is drastic different in the sense that yours uses significant amount of DDR3. Mine does not use DDR3 at all.

I think that explains the performance differences we are seeing.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

That's what I expected. What can I do to place the multiplier array in a different address? If we used the same code and platform configuration, what is causing my array to be stored in DDR3?

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

I am not sure we are using the same platform configuration because it ultimately decides the memory configuration.

Are we using the same versios of everything. Attached shows the versios of verious software components on my setup. Otherwise just to double check to make sure your platform configuration puts data in MSMC and not to use any DDR.

8585.configuration.zip

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

I'm using the same versions as you, except I'm using OpenMP BIOS runtime library 1.1.3.02 and you are using 1.2.0.05

Where can I find 1.2.0.05 to install that?

Can you send me your platform configuration file so I can just load that?

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/omp/1_02_00_05/index_FDS.html

8838.evm6678.zip

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

That didn't help me. How do I load your platform settings? Also, what is the procedure for loading these settings and applying them? Do I need to terminate and reload my target configuration? I've tried that, but didn't produce different results.

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

You need to unzip and save the platform file I sent to you on your hard drive. Then open the property of your project and the do the following to add the platform to the repository.

property->general->RTSC->Add->Select repository from file-system->Browse...

After adding the desired platform, you can pull down the Platform under the Property window and find and select the platform you need.

To modify a platform, go to Tools->RTSC Tools->Platform->Edit/New->Browse under Platform Package Repository to go to the directory that has the Platform file then choose the platform under Package Name -> OK. A window for the platform configuration should pop up to allow you to make any changes.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

Thank you. I was not importing the new platform repository, and I was never setting my project to use a different platform file, so there was no change being made.

Now my MAP file looks like this:

MEMORY CONFIGURATION

name origin length used unused attr fill
---------------------- -------- --------- -------- -------- ---- --------
NEARRAM 00000001 00007fff 00000000 00007fff RWIX
RAM 00008000 fffffffe 0000793a ffff86c4 RWIX

Does this look correct? What does yours look like?

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

It doesn't look right. Here is the memory configuration from my .map file.

MEMORY CONFIGURATION

         name                                    origin           length         used       unused        attr    fill
----------------------                          --------         ---------        --------      --------           ---- --------
L2SRAM                                 00800000   00080000 0002bcc0 00054340 RWIX
MSMCSRAM                          0c000000   003c0000 000bab2a 003054d6 RWIX
DDR3                                     80000000   20000000 00000000 20000000 RWIX
MSMCSRAM_NOCACHE    a03c0000   00040000 0003996e 00006692 RWIX

Again, this is from the original OpenMP Hello example. I didn't make any changes. I would expect the same thing to you.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

My stock project direct from the original OpenMP Hello example produces the same DDR3 usage. I don't understand what is going on. I have the same RTSC Platform as you, see the screenshot. I'm making sure to import this in the Project Build Settings. I'm not sure if you're doing the Debug or Release build, but I have the same usage in DDR3 regardless.

DEBUG

L2SRAM 00800000 00080000 0002bc60 000543a0 RW X
MSMCSRAM 0c000000 003c0000 000b957f 00306a81 RW X
DDR3 80000000 20000000 01000000 1f000000 RW X
MSMCSRAM_NOCACHE a03c0000 00040000 00014752 0002b8ae RW X

RELEASE

L2SRAM 00800000 00080000 0002bc60 000543a0 RWIX
MSMCSRAM 0c000000 00100000 00037a16 000c85ea RWIX
DDR3 80000000 20000000 01000000 1f000000 RWIX
MSMCSRAM_NOCACHE a0100000 00300000 00094a92 0026b56e RWIX

Do you think you can zip up your project and share that with me? I can't see what else I'm doing wrong.

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

4212.OpenMp_Hello_1.zip

My project is attached.

Could you try deleting DDR from your platform or at least set the length to zero?

Have you changed to use the save version of IPC that I am using?

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

2 problems:

1) It fails to build and gives the following console output.

**** Build of configuration Release for project OpenMp_Hello_1 ****

**** Build Finished ****

2) I don't have your version of OpemMP BIOS runtime library 1.2.0.05, do you have a link to where I can download that?

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

Could you please check my previous post? I already sent out the link for OMP 1.02.00.05 which is what I am using.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

Yes, updating to OMP 1.02.00.05 corrected my MAP output file.

L2SRAM 00800000 00080000 0002bcc0 00054340 RWIX
MSMCSRAM 0c000000 003c0000 000b9906 003066fa RWIX
DDR3 80000000 20000000 00000000 20000000 RWIX
MSMCSRAM_NOCACHE a03c0000 00040000 0003996e 00006692 RWIX

My benchmark times are much nicer now.

Back to my original question, can you help me benchmark the following? I want to:

1) time hanningWindow() before the OMP parallel section

2) time hanningWindow() inside the parallel section, I want each core to run the window function on it's own set of input/output data

3) time hanningWindow() outside and after the parallel section

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

One of my early post shows the code and the benchmark results for the things you listed? Are you able to duplicate the results I got?

The results address your original questions about inconsistent benchmark results inside and outside the parallel region.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

I'm able to reproduce those results from before. You only run this with 1 thread, I want to run it with 4 threads, and print out the time spent in each thread. So I want to concurrently run 4 windowHanning() each on their own thread, and print out the time spent in each thread.

I'd expect outcome like this:

HANNING#1 = [ 6500 ] cycles

HANNING[ core 0 ] = [ 6500 ] cycles

HANNING[ core 1 ] = [ 6500 ] cycles

HANNING[ core 2 ] = [ 6500 ] cycles

HANNING[ core 3 ] = [ 6500 ] cycles

HANNING#2 = [ 6500 ] cycles

What's the code to generate that output?

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Here is one way to do it.

#include <ti/omp/omp.h>

#define NTHREADS 4

#define FFT_MAX_L 2048
#define PI 3.14159265358979323846

float multiplier[ 2048 ];
int16_t windowOutputData[ FFT_MAX_L*NTHREADS ];

void generateHanningLookup( void )
{
int32_t i;

       for (i = 0; i < 2048; i++)
       {
              // equation from stackoverflow.com
              multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
       }

}

void main()
{

int nthreads, tid;

nthreads = NTHREADS;

omp_set_num_threads(NTHREADS);

//     int16_t   windowOutputData[ FFT_MAX_L ];
       uint16_t j;
       uint32_t start, end, totalTime;
       uint32_t start1, end1;
       uint32_t startMp0, endMp0;

generateHanningLookup();

TSC_enable();

       start = TSC_read();
       hanningWindow( &windowOutputData[FFT_MAX_L*0], multiplier );
       end = TSC_read();
       printf("\n");
       printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );


       totalTime = TSC_read();
       /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
       {

              TSC_enable();

              /* Obtain thread number */
              tid = omp_get_thread_num();
              printf("Hello World from thread = %d\n", tid);

              // Hanning Window
              startMp0 = TSC_read();
              hanningWindow( &windowOutputData[FFT_MAX_L*tid], multiplier );
              endMp0 = TSC_read();

printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );

} /* All threads join master thread and disband */
printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );

       start1= TSC_read();
       hanningWindow( windowOutputData, multiplier );
       end1= TSC_read();
       printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
}

----------------------------------------------------------------------

[C66xx_0]

[C66xx_0] HANNING#1 = [ 6726 ] cycles

[C66xx_0] Hello World from thread = 0

[C66xx_0] HANNING#OMP = [ 6732 ] cycles

[C66xx_1] Hello World from thread = 1

[C66xx_2] Hello World from thread = 2

[C66xx_3] Hello World from thread = 3

[C66xx_1] HANNING#OMP = [ 6818 ] cycles

[C66xx_2] HANNING#OMP = [ 6806 ] cycles

[C66xx_3] HANNING#OMP = [ 6818 ] cycles

[C66xx_0] OMP TIME = [ 145931975 ] cycles

[C66xx_0] HANNING#2 = [ 6796 ] cycles

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Excellent! I got the same results.

[C66xx_0]
[C66xx_0] HANNING#1 = [ 6294 ] cycles
[C66xx_0] Hello World from thread = 0
[C66xx_0] HANNING#OMP = [ 6304 ] cycles
[C66xx_1] Hello World from thread = 1
[C66xx_2] Hello World from thread = 2
[C66xx_3] Hello World from thread = 3
[C66xx_1] HANNING#OMP = [ 6368 ] cycles
[C66xx_2] HANNING#OMP = [ 6304 ] cycles
[C66xx_3] HANNING#OMP = [ 6342 ] cycles
[C66xx_0] OMP TIME = [ 206855207 ] cycles
[C66xx_0] HANNING#2 = [ 6270 ] cycles

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

What is the meaning of the number of cycles at OMP TIME? Is this a meaningless calculation?

[C66xx_0] OMP TIME = [ 145931975 ] cycles

Thank you,

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

This is the amonut of cycles that covers from the start of the parallel region to the end of the parallel region and is measured in the master thread.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Xiaohui,

Then why is that cycle count so large?

[C66xx_0] OMP TIME = [ 206855207 ] cycles <-------result in the previous code

I understand printf is expensive, so even if I take out all the overhead, and simplify it to the following code, I still get a cycle count that is 18 times greater than I would expect:

#pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
{

tid = omp_get_thread_num();

hanningWindow( &windowOutputData[FFT_MAX_L*tid], &multiplier[FFT_MAX_L*tid] );

} /* All threads join master thread and disband */

[C66xx_0] OMP TIME = [ 113837 ] cycles <--- result from this code

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

I am getting about 11 times. It's kind of high. I will look into it and let you know.

Xiaohui

0 Ryan Radjabi over 12 years ago in reply to Xiaohui Li

Expert 1125 points

Hi Xiaohui,

Have you found out anything about this overhead time?

Thanks,

Ryan

0 Xiaohui Li over 12 years ago in reply to Ryan Radjabi

TI__Intellectual 1870 points

Ryan,

I added a dummy parallel region as the following before your actual parallel region.

#pragma omp parallel
{
}

Then go to properties->General->RTSC and select 'release' under Build-profile. I got the following performance.

[C66xx_0] HANNING#1 = [ 6724 ] cycles

[C66xx_0] OMP TIME = [ 21184 ] cycles

[C66xx_0] HANNING#2 = [ 6769 ] cycles

The OpenMP overhead in this case is around 13000 cycles. The L2 Cache is enabled in this case. If you disable L2 Cache, the overhead will down to around 7000 cycles.

Xiaohui

void main()
{

int nthreads, tid;

nthreads = NTHREADS;
omp_set_num_threads(NTHREADS);

       uint16_t j;
       uint32_t start, end, totalTime;
       uint32_t start1, end1;
       uint32_t startMp0, endMp0;

generateHanningLookup();

TSC_enable();

       start = TSC_read();
       hanningWindow( &windowOutputData[FFT_MAX_L*0], multiplier );
       end = TSC_read();
       printf("\n");
       printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );

#pragma omp parallel
{
}

       totalTime = TSC_read();
       /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
       {
              //TSC_enable();

/* Obtain thread number */
tid = omp_get_thread_num();

              // Hanning Window
              //startMp0 = TSC_read();
              hanningWindow( &windowOutputData[FFT_MAX_L*tid], multiplier );
              //endMp0 = TSC_read();

       } /* All threads join master thread and disband */
       totalTime = TSC_read() - totalTime;
       printf( "OMP TIME = [ %u ] cycles \n", totalTime );

       start1= TSC_read();
       hanningWindow( windowOutputData, multiplier );
       end1= TSC_read();
       printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
}

0 Lopet azer over 12 years ago in reply to Xiaohui Li

Intellectual 680 points

Hi Xiaohui , Ryan ,

I am learning parallel Programming using opneMP and TMS320C6678 Board . I try to compile your Application ( Hanning Windowing) on my System ,After compiling I have these Problems:

- shared Memory is invalid : to solve this I use the original config. File from the OpenMP Example

- My compiler cannot find TSC_enable() and TSC_read() -> compiling Error

- The openMP Pragma derective still unrecognized -> give a Warning from Compiler.

Actually my MCSDK ist 2.01.06 and my openMP version ist 1_01_03_02 , I have try to update this but unsucessfull

Can please help me to compile and run your project

Thank you for reply

Processors

Processors forum

Proper Way to Benchmark Timing within OpenMP on C6678