This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Proper Way to Benchmark Timing within OpenMP on C6678

Other Parts Discussed in Thread: SYSBIOS, TMS320C6678

What is the proper way to calculate execution times within a single core for a multi-core application using OpenMP?

In a single-core application, I use the Timestamp_get32() function to count cycles between lines of code, although this doesn't seem to return the correct value for code inside the #pragma omp parallel private(nthreads, tid) block in my code. 

I ran the multiplication for a Hanning window both inside and outside the OMP pragma and roughly got the results I was seeing on my single-core application. Time benchmarks within the pragma are around 6-8 times what they are outside the pragma. Check out the simple C code below. There isn’t anything fancy going on, this is built around the HelloWorld template for OpenMP. This time difference is irrelevant if I set this application for 1 core, 4 cores, or 8 cores.

Can Timestamp_get32() be trusted within the pragma statement?

/******************************************************************************
* FILE: omp_hello.c
* DESCRIPTION:
*   OpenMP Example - Hello World - C/C++ Version
*   In this simple example, the master thread forks a parallel region.
*   All threads in the team obtain their unique thread number and print it.
*   The master thread only prints the total number of threads.  Two OpenMP
*   library routines are used to obtain the number of threads and each
*   thread's number.
* AUTHOR: Blaise Barney  5/99
* LAST REVISED: 04/06/05
******************************************************************************/
#include <ti/omp/omp.h>

 

#include <string.h>
#include <assert.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <xdc/std.h>
#include <xdc/runtime/System.h>
#include <ti/sysbios/BIOS.h>
#include <xdc/runtime/Log.h>
#include <xdc/runtime/Timestamp.h>
#include <math.h>

 

#define NTHREADS  1

 

#define FFT_MAX_L 2048
#define PI 3.14159265358979323846

 

float  multiplier[ 2048 ];

 

void generateHanningLookup( void )
{
       int32_t i;

 

       for (i = 0; i < 2048; i++)
       {
              // equation from stackoverflow.com
              multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
       }

 

}

 

void main()
{

 

       int nthreads, tid;

 

       nthreads = NTHREADS;

 

       omp_set_num_threads(NTHREADS);

 

       int16_t   windowOutputData[ FFT_MAX_L ];
       uint16_t j;
       uint32_t start, totalTime;
       generateHanningLookup();

 

       start = Timestamp_get32();
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
       printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

 

       totalTime = Timestamp_get32();
       /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
       {

 

              /* Obtain thread number */
              tid = omp_get_thread_num();
              printf("Hello World from thread = %d\n", tid);

 

              /* Only master thread does this */
              if (tid == 0)
              {
                     nthreads = omp_get_num_threads();
                     printf("Number of threads = %d\n", nthreads);

 

                     // Hanning Window
                     start = Timestamp_get32();
                     for( j = 0; j < FFT_MAX_L; j++ )
                     {
                           windowOutputData[ j ] = j * multiplier[ j ];
                     }
                     printf( "HANNING#OMP = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

  

              }
              else
              {
                     tid = omp_get_thread_num();

 

                     uint32_t startMp = Timestamp_get32();
                     for( j = 0; j < FFT_MAX_L; j++ )
                     {
                           windowOutputData[ j ] = j * multiplier[ j ];
                     }
                     printf( "HANNING#%u = [ %u ] cycles \n", tid, ( Timestamp_get32() - startMp ) );

 

              }

 

       }  /* All threads join master thread and disband */
       printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );

 

       start = Timestamp_get32();
       for( j = 0; j < FFT_MAX_L; j++ )
       {
              windowOutputData[ j ] = j * multiplier[ j ];
       }
       printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );

 

}

 

  • This is not a Linux related question. Please move it to the appropriate forum. This application uses SYSBIOS.

  • Ryan,

    Actually, this question does not really seem to be BIOS-specific.  I have moved this post over to the Keystone device forum in hopes that it will get a faster response there...

  • 7633.tsc_h.asm

    Hi,

    I made a couple of changes to your original code.

    1) Instead of using Timerstamp_get32(), I directly read the timer registers of C66x to get exact cycle information.  The file for the simple function of reading the registers are attached.

    2) I put the following hanning window operation into a separate function to make sure the same code is used both inside and outside the #pragma.  You never know if the compiler generates the same kind of code inside or outside the parall region.

    When I run the code, I got the following performance numbers.

    [C66xx_0] HANNING#1 = [ 6507 ] cycles

    [C66xx_0] HANNING#OMP = [ 2145 ] cycles

    [C66xx_0] HANNING#2 = [ 6543 ] cycles

    The performance of Windowing inside #pragma is 2145 cycles.  If you look at the compiler outout, this number matches very closely the theoretical performance of the code generated by the compiler for doing the Hanning Windowing.  The problem is the performance is much larger when doing the Windowing outside the #pragma.

    The reason is that before the 1st Windowing is done, the data are in MSMC.  The performance outside the #pragma region includes cache penalties.  By the time it starts doing the Windowing inside the #pragma, the data are in cache already.  That's why the perfomrnace inside #pragma is much better.  By commenting out the code for the 1st Windowing and rerun the code, the performance of the Windowing inside the #pragma is follows.

    [C66xx_0] HANNING#OMP = [ 6329 ] cycles

    This matches closely the performance of Windowing before #progma.  The code I am using is attached below.

    Xiaohui

     

    #include <ti/omp/omp.h>
     

    #include <string.h>
    #include <assert.h>
    #include <stdio.h>
    #include <time.h>
    #include <stdint.h>
    #include <xdc/std.h>
    #include <xdc/runtime/System.h>
    #include <ti/sysbios/BIOS.h>
    #include <xdc/runtime/Log.h>
    #include <xdc/runtime/Timestamp.h>
    #include <math.h>
     

    #define NTHREADS  1
     

    #define FFT_MAX_L 2048
    #define PI 3.14159265358979323846
     

    float     multiplier[ 2048 ];
     

    void generateHanningLookup( void )
    {
           int32_t i;
     

           for (i = 0; i < 2048; i++)
           {
                  // equation from stackoverflow.com
                  multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
           }
     

    }

    void hanningWindow( int16_t * restrict windowOutputData, float * restrict multiplier )
    {
    int j;
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }

    }


    void main()
    {
     

           int nthreads, tid;
     

           nthreads = NTHREADS;
     

           omp_set_num_threads(NTHREADS);
     

           int16_t   windowOutputData[ FFT_MAX_L ];
           uint16_t j;
           uint32_t start,  end, totalTime;
           uint32_t start1, end1;
           generateHanningLookup();
     
           TSC_enable();

    //     start = Timestamp_get32();
           start = TSC_read();
    #if 0
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }
    #else
           hanningWindow( windowOutputData, multiplier );
    #endif
           end = TSC_read();
    //     printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
           printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );
     
      
    //     totalTime = Timestamp_get32();
           totalTime = TSC_read();
           /* Fork a team of threads giving them their own copies of variables */
    #pragma omp parallel private(nthreads, tid)
           {
     

                  /* Obtain thread number */
                  tid = omp_get_thread_num();
                  printf("Hello World from thread = %d\n", tid);
     
                  if (tid == 0){
                         nthreads = omp_get_num_threads();
                         printf("Number of threads = %d\n", nthreads);
                     }

                  // Hanning Window
                  uint32_t startMp0, endMp0;
                  startMp0 = TSC_read();
    #if 0
                  for( j = 0; j < FFT_MAX_L; j++ )
                  {
                         windowOutputData[ j ] = j * multiplier[ j ];
                  }
    #else
                  hanningWindow( windowOutputData, multiplier );
    #endif
                  endMp0 = TSC_read();
                  printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );


           }  /* All threads join master thread and disband */
    //     printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );
    //     printf( "OMP TIME = [ %u ] cycles \n", ( clock() - totalTime ) );
           printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );
     

    //     start = Timestamp_get32();
           start1= TSC_read();
    #if 0
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }
    #else
           hanningWindow( windowOutputData, multiplier );
    #endif
           end1= TSC_read();
    //     printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
           printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
     

    }
     

  • Hi Xiaohui Li,

    Thank you for your response. I'm unable to build my project (RELEASE). It gives the following errors, which I find very strange since I'm not building the DEBUG project, but the RELEASE.

    Is there something I must do besides adding the 7633.txc_h.asm to the C6000 Linker\File Search Path\ -I include line?

    The build errors point to these lines in the linker.cmd file (in the DEBUG folder, don't understand that!).

    MEMORY
    {
    L2SRAM : org = 0x800000, len = 0x80000
    DDR3 : org = 0x80000000, len = 0x20000000
    MSMCSRAM : org = 0xc000000, len = 0x100000
    MSMCSRAM_NOCACHE : org = 0xa0100000, len = 0x300000
    }

    My build errors are listed below. Thanks!

    Ryan

    Description Resource Path Location Type
    #10263 DDR3 memory range has already linker.cmd /mc_sanity/Debug/configPkg line 39 C/C++ Problem
    #10263 L2SRAM memory range has linker.cmd /mc_sanity/Debug/configPkg line 38 C/C++ Problem
    #10263 MSMCSRAM memory range has linker.cmd /mc_sanity/Debug/configPkg line 40 C/C++ Problem
    #10263 MSMCSRAM_NOCACHE memory range linker.cmd /mc_sanity/Debug/configPkg line 41 C/C++ Problem
    #10264 DDR3 memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 39 C/C++ Problem
    #10264 L2SRAM memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 38 C/C++ Problem
    #10264 MSMCSRAM memory range overlaps linker.cmd /mc_sanity/Debug/configPkg line 40 C/C++ Problem
    #10264 MSMCSRAM_NOCACHE memory range linker.cmd /mc_sanity/Debug/configPkg line 41 C/C++ Problem
    #99900 error limit reached; 100 errors detected mc_sanity C/C++ Problem

  • Hi Ryan,

    You should rename the .asm file as tsc_h.asm and put it in the directory where you source files locate.

    I am not sure what the errors are about.  Did you start with the original OpenMP Hello example?  That's what I did and it compiled just fine without any changes in the settings or configurations.

    Xiaohui

  • Hi Xiaohui,

    I got it to build by creating a new omp_helloWorld project, but I'm not getting the same results you had (listed below).

    When I run the code, I got the following performance numbers.

    [C66xx_0] HANNING#1 = [ 6507 ] cycles

    [C66xx_0] HANNING#OMP = [ 2145 ] cycles

    [C66xx_0] HANNING#2 = [ 6543 ] cycles


    My results are 

    [C66xx_0] HANNING#1 = [ 32210 ] cycles

    [C66xx_0] HANNING#OMP = [ 32254 ] cycles

    [C66xx_0] HANNING#2 = [ 32240 ] cycles

    Am I missing a configuration, perhaps the RTSC Platform file?

  • I run the code in Release mode.

  • Why am I getting such different results? How many cores are you loading this on? 4 in my case.

    Should NTHREADS be defined 4 instead of 1? That shouldn't affect this benchmark.

  • I am loading to all 8 cores, but NTHREADS is set to 1 as in your original code.  So only core0 is doing the computation.  The length of the computation is 2048 as specified in your original code.  Are you using the code I sent to you?

  • I loaded the code on 8 cores (even though my CFG file has "OpenMP.setNumProcessors(4);" in it). When I run, it outputs:

    [C66xx_0] Hello World from thread = 0
    [C66xx_0] Number of threads = 1
    [C66xx_0] HANNING#OMP = [ 32236 ] cycles
    [C66xx_0] OMP TIME = [ 144885 ] cycles
    [C66xx_0] HANNING#2 = [ 32228 ] cycles

    What does your RTSC Platform config look like? Mine is:

  • Xiaohui,

    Your MAP file was identical to mine (except for the timestamp). I used it, nonetheless, but I didn't get different results. Still: 

    [C66xx_0] HANNING#1 = [ 32216 ] cycles
    [C66xx_0] Hello World from thread = 0
    [C66xx_0] Number of threads = 1
    [C66xx_0] HANNING#OMP = [ 32236 ] cycles
    [C66xx_0] OMP TIME = [ 146152 ] cycles
    [C66xx_0] HANNING#2 = [ 32298 ] cycles

  • 6201.platform.zip

    Actually your platform is quite different from mine.  Your data are allocated in DDR vs mine are in MSMC.  Your stack are in MSMC but mine is in L2 SRAM.  Attached is my platform.

  • Okay, I reconfigured my platform to nearly match yours, except I don't have the MSMCSRAM_NOCACHE section. I still get: 

    [C66xx_0] HANNING#1 = [ 32204 ] cycles
    [C66xx_0] Hello World from thread = 0
    [C66xx_0] Number of threads = 1
    [C66xx_0] HANNING#OMP = [ 32288 ] cycles
    [C66xx_0] HANNING#2 = [ 32262 ] cycles

    My platform settings are attached.

  • Ryan,

    If I remember correctly, your code only compute 2048 samples for Hanning Windowing and each sample only requires one multiply.  Your measured performance indicates it takes more than 15 cycles for each sample. That's way too much.  Could you share your code and .asm from compiler output?

    Xiaohui

  • Ryan,

    Could you deletre L1DSRAM and L1PSRAM two lines from your platform?

    Xiaohui

  • Xiaohui,

    I've attached my code and the .asm file (my only ASM file is the one you provided me). It also deleted L1PSRAM and L1DSRAM from my platform config as you requested.

    Ryan

    #include <ti/omp/omp.h>
    
    #include <string.h>
    #include <assert.h>
    #include <stdio.h>
    #include <time.h>
    #include <stdint.h>
    #include <xdc/std.h>
    #include <xdc/runtime/System.h>
    #include <ti/sysbios/BIOS.h>
    #include <xdc/runtime/Log.h>
    #include <xdc/runtime/Timestamp.h>
    #include <math.h>
    
    #define NTHREADS  1
    
    #define FFT_MAX_L 2048
    #define PI 3.14159265358979323846
    
    float     multiplier[ 2048 ];
    
    void generateHanningLookup( void )
    {
           int32_t i;
    
           for (i = 0; i < 2048; i++)
           {
                  // equation from stackoverflow.com
                  multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
           }
    
    }
    void hanningWindow( int16_t * restrict windowOutputData, float * restrict multiplier )
    {
    int j;
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }
    }
    
    void main()
    {
    
           int nthreads, tid;
    
           nthreads = NTHREADS;
    
           omp_set_num_threads(NTHREADS);
    
           int16_t   windowOutputData[ FFT_MAX_L ];
           uint16_t j;
           uint32_t start,  end, totalTime;
           uint32_t start1, end1;
           generateHanningLookup();
    
           hanningWindow( windowOutputData, multiplier );
    
           TSC_enable();
    //     start = Timestamp_get32();
           start = TSC_read();
    #if 0
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }
    #else
           hanningWindow( windowOutputData, multiplier );
    #endif
           end = TSC_read();
    //     printf( "HANNING#1 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
           printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );
    
    
    //     totalTime = Timestamp_get32();
           totalTime = TSC_read();
           /* Fork a team of threads giving them their own copies of variables */
    #pragma omp parallel private(nthreads, tid)
           {
    
                  /* Obtain thread number */
                  tid = omp_get_thread_num();
                  printf("Hello World from thread = %d\n", tid);
    
                  if (tid == 0){
                         nthreads = omp_get_num_threads();
                         printf("Number of threads = %d\n", nthreads);
                     }
                  // Hanning Window
                  uint32_t startMp0, endMp0;
                  startMp0 = TSC_read();
    #if 0
                  for( j = 0; j < FFT_MAX_L; j++ )
                  {
                         windowOutputData[ j ] = j * multiplier[ j ];
                  }
    #else
                  hanningWindow( windowOutputData, multiplier );
    #endif
                  endMp0 = TSC_read();
                  printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );
    
           }  /* All threads join master thread and disband */
    //     printf( "OMP TIME = [ %u ] cycles \n", ( Timestamp_get32() - totalTime ) );
    //     printf( "OMP TIME = [ %u ] cycles \n", ( clock() - totalTime ) );
    //       printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );
    
    //     start = Timestamp_get32();
           start1= TSC_read();
    #if 0
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j ] = j * multiplier[ j ];
           }
    #else
           hanningWindow( windowOutputData, multiplier );
    #endif
           end1= TSC_read();
    //     printf( "HANNING#2 = [ %u ] cycles \n", ( Timestamp_get32() - start ) );
           printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
    
    }
    
    
    7607.tsc_h.asm

    Ryan

  • Ryan,

    Actually I was ask the .asm file from the compiler output.  The compiler should generate a omp_hello.asm file if you enable "keeping  the compiler generated asm file" .

    Properties->Build->C6000 Compiler ->Advanced Options->Assembler-> check the box for "Keep the generated assembly language"

    Xiaohui

     

  • Okay I attached the program ASM. 2134.omp_hello.asm

  • The .asm file looks fine.  Could you also send me the .map file?  What's your optimization level, -O2 or -O3?  Do you have any debugging features turned on?

  • **** Build of configuration Release for project omp_hello_timing ****

    C:\ti\ccsv5\utils\bin\gmake -k all
    'Building file: ../omp_config.cfg'
    'Invoking: XDCtools'
    "C:/ti/xdctools_3_24_06_63/xs" --xdcpath="C:/ti/omp_1_01_03_02/packages;C:/ti/bios_6_33_06_50/packages;C:/ti/ipc_1_24_03_32/packages;C:/ti/pdk_C6678_1_1_2_6/packages;C:/ti/ccsv5/ccs_base;" xdc.tools.configuro -o configPkg -t ti.targets.elf.C66 -p ti.omp.examples.platforms.evm6678 -r debug -c "C:/ti/ccsv5/tools/compiler/c6000_7.4.2" "../omp_config.cfg"
    making package.mak (because of package.bld) ...
    generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
    configuring omp_config.xe66 from package/cfg/omp_config_pe66.cfg ...
    cle66 package/cfg/omp_config_pe66.c ...
    'Finished building: ../omp_config.cfg'
    ' '
    'Building file: ../omp_hello.c'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k --preproc_with_compile --preproc_dependency="omp_hello.pp" --cmd_file="./configPkg/compiler.opt" "../omp_hello.c"
    "../omp_hello.c", line 59: warning #225-D: function declared implicitly
    "../omp_hello.c", line 61: warning #225-D: function declared implicitly
    "../omp_hello.c", line 52: warning #179-D: variable "j" was declared but never referenced
    "../omp_hello.c", line 53: warning #552-D: variable "totalTime" was set but never used
    'Finished building: ../omp_hello.c'
    ' '
    'Building file: ../tsc_h.asm'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k --preproc_with_compile --preproc_dependency="tsc_h.pp" --cmd_file="./configPkg/compiler.opt" "../tsc_h.asm"
    'Finished building: ../tsc_h.asm'
    ' '
    'Building target: omp_hello_timing.out'
    'Invoking: C6000 Linker'
    "C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6600 --abi=eabi -O2 --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k -z -m"omp_hello_timing.map" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/lib" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --reread_libs --warn_sections --display_error_number --diag_wrap=off --rom_model -o "omp_hello_timing.out" -l"./configPkg/linker.cmd" "./tsc_h.obj" "./omp_hello.obj" -l"libc.a"
    <Linking>
    'Finished building target: omp_hello_timing.out'
    ' '

    **** Build Finished ****

  • Also see my map file, i meant to attach it in the last message.2500.omp_hello_timing_map.zip

  • Ryan,

    Your multiplier array starts at 0xa0194500 which is in DDR3.  In my code, it starts at 0x0c0b78a8 which is in MSMC.  This will make huge difference.  I am not sure if the your DDR3 is configured as cachable or non-cachable.  Again that can make huge difference.

    Your memory configuration is:

             name            origin    length      used     unused   attr    fill
    ----------------------  --------  ---------  --------  --------  ----  --------
      L2SRAM                00800000   00080000  0002bca0  00054360  RWIX
      MSMCSRAM              0c000000   00100000  00039a0a  000c65f6  RWIX
      DDR3                  80000000   20000000  01000000  1f000000  RWIX
      MSMCSRAM_NOCACHE      a0100000   00300000  00096a92  0026956e  RWIX

    My configurations is,

             name            origin    length      used     unused   attr    fill
    ----------------------  --------  ---------  --------  --------  ----  --------
      L2SRAM                00800000   00080000  0002bcc0  00054340  RWIX
      MSMCSRAM              0c000000   003c0000  000bab2a  003054d6  RWIX
      DDR3                  80000000   20000000  00000000  20000000  RWIX
      MSMCSRAM_NOCACHE      a03c0000   00040000  0003996e  00006692  RWIX

    The two configurations is drastic different in the sense that yours uses significant amount of DDR3.  Mine does not use DDR3 at all.

    I think that explains the performance differences we are seeing.

    Xiaohui

     

     

  • Xiaohui,

    That's what I expected. What can I do to place the multiplier array in a different address? If we used the same code and platform configuration, what is causing my array to be stored in DDR3?

    Ryan

  • Ryan,

    I am not sure we are using the same platform configuration because it ultimately decides the memory configuration.

    Are we using the same versios of everything.  Attached shows the versios of verious software components on my setup.  Otherwise just to double check to make sure your platform configuration puts data in MSMC and not to use any DDR.

    8585.configuration.zip

  • I'm using the same versions as you, except I'm using OpenMP BIOS runtime library 1.1.3.02 and you are using 1.2.0.05

    Where can I find 1.2.0.05 to install that? 

    Can you send me your platform configuration file so I can just load that?

    Ryan

  • That didn't help me. How do I load your platform settings? Also, what is the procedure for loading these settings and applying them? Do I need to terminate and reload my target configuration? I've tried that, but didn't produce different results.

    Ryan

  • Ryan,

    You need to unzip and save the platform file I sent to you on your hard drive.  Then open the property of your project and the do the following to add the platform to the repository.

    property->general->RTSC->Add->Select repository from file-system->Browse...

    After adding the desired platform, you can pull down the Platform under the Property window and find and select the platform you need.

    To modify a platform, go to Tools->RTSC Tools->Platform->Edit/New->Browse under Platform Package Repository to go to the directory that has the Platform file then choose the platform under Package Name -> OK.  A window for the platform configuration should pop up to allow you to make any changes.

    Xiaohui

     

     

  • Xiaohui,

    Thank you. I was not importing the new platform repository, and I was never setting my project to use a different platform file, so there was no change being made.

    Now my MAP file looks like this:

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    NEARRAM 00000001 00007fff 00000000 00007fff RWIX
    RAM 00008000 fffffffe 0000793a ffff86c4 RWIX

    Does this look correct? What does yours look like?

    Ryan

  • Ryan,

    It doesn't look right.  Here is the memory configuration from my .map file.

    MEMORY CONFIGURATION

             name                                    origin           length         used       unused        attr    fill
    ----------------------                          --------          ---------        --------      --------           ----  --------
      L2SRAM                                 00800000   00080000  0002bcc0  00054340  RWIX
      MSMCSRAM                          0c000000   003c0000  000bab2a  003054d6  RWIX
      DDR3                                     80000000   20000000  00000000  20000000  RWIX
      MSMCSRAM_NOCACHE    a03c0000   00040000  0003996e  00006692  RWIX

    Again, this is from the original OpenMP Hello example.  I didn't make any changes.  I would expect the same thing to you. 

    Xiaohui

  • My stock project direct from the original OpenMP Hello example produces the same DDR3 usage. I don't understand what is going on. I have the same RTSC Platform as you, see the screenshot. I'm making sure to import this in the Project Build Settings. I'm not sure if you're doing the Debug or Release build, but I have the same usage in DDR3 regardless. 

    DEBUG

    L2SRAM 00800000 00080000 0002bc60 000543a0 RW X
    MSMCSRAM 0c000000 003c0000 000b957f 00306a81 RW X
    DDR3 80000000 20000000 01000000 1f000000 RW X
    MSMCSRAM_NOCACHE a03c0000 00040000 00014752 0002b8ae RW X

    RELEASE

    L2SRAM 00800000 00080000 0002bc60 000543a0 RWIX
    MSMCSRAM 0c000000 00100000 00037a16 000c85ea RWIX
    DDR3 80000000 20000000 01000000 1f000000 RWIX
    MSMCSRAM_NOCACHE a0100000 00300000 00094a92 0026b56e RWIX


    Do you think you can  zip up your project and share that with me? I can't see what else I'm doing wrong.

    Ryan

  • 4212.OpenMp_Hello_1.zip

    My project is attached.

    Could you try deleting DDR from your platform or at least set the length to zero?

    Have you changed to use the save version of IPC that I am using?

    Xiaohui

  • 2 problems:

    1) It fails to build and gives the following console output. 

    **** Build of configuration Release for project OpenMp_Hello_1 ****

    C:\ti\ccsv5\utils\bin\gmake -k all
    'Building file: ../omp_config.cfg'
    'Invoking: XDCtools'
    "C:/ti/xdctools_3_24_06_63/xs" --xdcpath="C:/ti/omp_1_01_03_02/packages;C:/ti/bios_6_33_06_50/packages;C:/ti/ipc_1_24_03_32/packages;C:/ti/pdk_C6678_1_1_2_6/packages;C:/ti/ccsv5/ccs_base;" xdc.tools.configuro -o configPkg -t ti.targets.elf.C66 -p ti.omp.examples.platforms.evm6678 -r debug -c "C:/ti/ccsv5/tools/compiler/c6000_7.4.2" "../omp_config.cfg"
    making package.mak (because of package.bld) ...
    generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
    configuring omp_config.xe66 from package/cfg/omp_config_pe66.cfg ...
    error: ti.omp.utils.HeapOMP: "C:/ti/omp_1_01_03_02/packages/ti/omp/utils/HeapOMP.xs", line 77: ti.omp.utils.HeapOMP : HeapOMP.sharedRegionId is invalid
    js: "C:/ti/xdctools_3_24_06_63/packages/xdc/cfg/Main.xs", line 149: Error: Configuration failed!
    gmake.exe: *** [package/cfg/omp_config_pe66.xdl] Error 1
    js: "C:/ti/xdctools_3_24_06_63/packages/xdc/tools/Cmdr.xs", line 51: Error: xdc.tools.configuro: configuration failed due to earlier errors (status = 2); 'linker.cmd' deleted.
    gmake: *** [configPkg/compiler.opt] Error 1
    gmake: Target `all' not remade because of errors.

    **** Build Finished ****

    2) I don't have your version of OpemMP BIOS runtime library 1.2.0.05, do you have a link to where I can download that?

    Ryan

  • Ryan,

    Could you please check my previous post?  I already sent out the link for OMP 1.02.00.05 which is what I am using.

    Xiaohui

  • Xiaohui,

    Yes, updating to OMP 1.02.00.05 corrected my MAP output file. 

    L2SRAM 00800000 00080000 0002bcc0 00054340 RWIX
    MSMCSRAM 0c000000 003c0000 000b9906 003066fa RWIX
    DDR3 80000000 20000000 00000000 20000000 RWIX
    MSMCSRAM_NOCACHE a03c0000 00040000 0003996e 00006692 RWIX

    My benchmark times are much nicer now.

    Back to my original question, can you help me benchmark the following? I want to:

    1) time hanningWindow() before the OMP parallel section

    2) time hanningWindow() inside the parallel section, I want each core to run the window function on it's own set of input/output data

    3) time hanningWindow() outside and after the parallel section 

    Ryan

  • Ryan,

    One of my early post shows the code and the benchmark results for the things you listed?  Are you able to duplicate the results I got?

    The results address your original questions about inconsistent benchmark results inside and outside the parallel region.

    Xiaohui

  • Xiaohui,

    I'm able to reproduce those results from before. You only run this with 1 thread, I want to run it with 4 threads, and print out the time spent in each thread. So I want to concurrently run 4 windowHanning() each on their own thread, and print out the time spent in each thread. 

    I'd expect outcome like this:

    HANNING#1 = [ 6500 ] cycles

    HANNING[ core 0 ] = [ 6500 ] cycles

    HANNING[ core 1 ] = [ 6500 ] cycles

    HANNING[ core 2 ] = [ 6500 ] cycles

    HANNING[ core 3 ] = [ 6500 ] cycles

    HANNING#2 = [ 6500 ] cycles


    What's the code to generate that output?

    Ryan

  • Here is one way to do it.

    #include <ti/omp/omp.h>

    #include <string.h>
    #include <assert.h>
    #include <stdio.h>
    #include <time.h>
    #include <stdint.h>
    #include <xdc/std.h>
    #include <xdc/runtime/System.h>
    #include <ti/sysbios/BIOS.h>
    #include <xdc/runtime/Log.h>
    #include <xdc/runtime/Timestamp.h>
    #include <math.h>
     

    #define NTHREADS  4
     

    #define FFT_MAX_L 2048
    #define PI 3.14159265358979323846
     

    float     multiplier[ 2048 ];
    int16_t   windowOutputData[ FFT_MAX_L*NTHREADS ];
     

    void generateHanningLookup( void )
    {
           int32_t i;
     

           for (i = 0; i < 2048; i++)
           {
                  // equation from stackoverflow.com
                  multiplier[ i ] = 0.5 * ( 1 - cos( 2 * PI * i / ( FFT_MAX_L - 1 ) ) );
           }
     

    }

    void hanningWindow( int16_t * restrict windowOutputData, float * restrict multiplier )
    {
    int j;
           for( j = 0; j < FFT_MAX_L; j++ )
           {
                  windowOutputData[ j               ] = j * multiplier[ j ];
           }

    }


    void main()
    {
     

           int nthreads, tid;
     

           nthreads = NTHREADS;
     

           omp_set_num_threads(NTHREADS);
     

    //     int16_t   windowOutputData[ FFT_MAX_L ];
           uint16_t j;
           uint32_t start,  end, totalTime;
           uint32_t start1, end1;
           uint32_t startMp0, endMp0;

           generateHanningLookup();
     
           TSC_enable();

           start = TSC_read();
           hanningWindow( &windowOutputData[FFT_MAX_L*0], multiplier );
           end = TSC_read();
           printf("\n");
           printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );
     
      
           totalTime = TSC_read();
           /* Fork a team of threads giving them their own copies of variables */
    #pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
           {
     
                  TSC_enable();

                  /* Obtain thread number */
                  tid = omp_get_thread_num();
                  printf("Hello World from thread = %d\n", tid);

                  // Hanning Window
                  startMp0 = TSC_read();
                  hanningWindow( &windowOutputData[FFT_MAX_L*tid], multiplier );
                  endMp0 = TSC_read();

                  printf( "HANNING#OMP = [ %u ] cycles \n", ( endMp0 - startMp0 ) );


           }  /* All threads join master thread and disband */
           printf( "OMP TIME = [ %u ] cycles \n", ( TSC_read() - totalTime ) );
     

           start1= TSC_read();
           hanningWindow( windowOutputData, multiplier );
           end1= TSC_read();
           printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
     }

    ----------------------------------------------------------------------

    [C66xx_0]

    [C66xx_0] HANNING#1 = [ 6726 ] cycles

    [C66xx_0] Hello World from thread = 0

    [C66xx_0] HANNING#OMP = [ 6732 ] cycles

    [C66xx_1] Hello World from thread = 1

    [C66xx_2] Hello World from thread = 2

    [C66xx_3] Hello World from thread = 3

    [C66xx_1] HANNING#OMP = [ 6818 ] cycles

    [C66xx_2] HANNING#OMP = [ 6806 ] cycles

    [C66xx_3] HANNING#OMP = [ 6818 ] cycles

    [C66xx_0] OMP TIME = [ 145931975 ] cycles

    [C66xx_0] HANNING#2 = [ 6796 ] cycles

  • Excellent! I got the same results.

    [C66xx_0]
    [C66xx_0] HANNING#1 = [ 6294 ] cycles
    [C66xx_0] Hello World from thread = 0
    [C66xx_0] HANNING#OMP = [ 6304 ] cycles
    [C66xx_1] Hello World from thread = 1
    [C66xx_2] Hello World from thread = 2
    [C66xx_3] Hello World from thread = 3
    [C66xx_1] HANNING#OMP = [ 6368 ] cycles
    [C66xx_2] HANNING#OMP = [ 6304 ] cycles
    [C66xx_3] HANNING#OMP = [ 6342 ] cycles
    [C66xx_0] OMP TIME = [ 206855207 ] cycles
    [C66xx_0] HANNING#2 = [ 6270 ] cycles

  • Xiaohui,

    What is the meaning of the number of cycles at OMP TIME? Is this a meaningless calculation?

    [C66xx_0] OMP TIME = [ 145931975 ] cycles

    Thank you,

    Ryan

  • Ryan,

    This is the amonut of cycles that covers from the start of the parallel region to the end of the parallel region and is measured in the master thread.

    Xiaohui

  • Xiaohui,

    Then why is that cycle count so large? 

    [C66xx_0] OMP TIME = [ 206855207 ] cycles <-------result in the previous code

    I understand printf is expensive, so even if I take out all the overhead, and simplify it to the following code, I still get a cycle count that is 18 times greater than I would expect:

    #pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
    {

          tid = omp_get_thread_num();

          hanningWindow( &windowOutputData[FFT_MAX_L*tid], &multiplier[FFT_MAX_L*tid] );

    } /* All threads join master thread and disband */

    [C66xx_0] OMP TIME = [ 113837 ] cycles <--- result from this code


    Ryan

  • I am getting about 11 times.  It's kind of high.  I will look into it and let you know.

    Xiaohui

  • Hi Xiaohui,

    Have you found out anything about this overhead time?

    Thanks,

    Ryan

  • Ryan,

    I added a dummy parallel region as the following before your actual parallel region.

    #pragma omp parallel
           {
           }

    Then go to properties->General->RTSC and select 'release' under Build-profile.  I got the following performance.

    [C66xx_0] HANNING#1 = [ 6724 ] cycles

    [C66xx_0] OMP TIME = [ 21184 ] cycles

    [C66xx_0] HANNING#2 = [ 6769 ] cycles

    The OpenMP overhead in this case is around 13000 cycles.  The L2 Cache is enabled in this case.  If you disable L2 Cache, the overhead will down to around 7000 cycles.

    Xiaohui

    void main()
    {
     

           int nthreads, tid;
     

           nthreads = NTHREADS;
           omp_set_num_threads(NTHREADS);

           uint16_t j;
           uint32_t start,  end, totalTime;
           uint32_t start1, end1;
           uint32_t startMp0, endMp0;

           generateHanningLookup();
     
           TSC_enable();

           start = TSC_read();
           hanningWindow( &windowOutputData[FFT_MAX_L*0], multiplier );
           end = TSC_read();
           printf("\n");
           printf( "HANNING#1 = [ %u ] cycles \n", ( end - start ) );


     
    #pragma omp parallel
           {
           }
     

           totalTime = TSC_read();
           /* Fork a team of threads giving them their own copies of variables */
    #pragma omp parallel shared(windowOutputData, multiplier ) private(nthreads, tid, startMp0, endMp0)
           {
                  //TSC_enable();

                  /* Obtain thread number */
                  tid = omp_get_thread_num();

                  // Hanning Window
                  //startMp0 = TSC_read();
                  hanningWindow( &windowOutputData[FFT_MAX_L*tid], multiplier );
                  //endMp0 = TSC_read();

           }  /* All threads join master thread and disband */
           totalTime = TSC_read() - totalTime;
           printf( "OMP TIME = [ %u ] cycles \n", totalTime );
     

           start1= TSC_read();
           hanningWindow( windowOutputData, multiplier );
           end1= TSC_read();
           printf( "HANNING#2 = [ %u ] cycles \n", ( end1 - start1 ) );
     }

  • Hi Xiaohui , Ryan ,

    I am learning parallel Programming using opneMP and TMS320C6678 Board . I try to compile your Application ( Hanning Windowing) on my System ,After compiling I have these Problems:

    - shared Memory is invalid : to solve this I  use the original config. File from the OpenMP Example

    - My compiler cannot find TSC_enable() and TSC_read() -> compiling Error

    - The openMP Pragma derective still unrecognized -> give a Warning from Compiler.

    Actually my MCSDK ist 2.01.06 and my openMP version ist 1_01_03_02 , I have try to update this but unsucessfull

    Can please help me to compile and run your project

    Thank you for reply