This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMDSEVM6678: What is the purpose of OpenMP Heap Management API?

Part Number: TMDSEVM6678
Other Parts Discussed in Thread: TMS320C6678

Tool/software: TI C/C++ Compiler

hi All,

I am applying OpenMP to my code, and result wise it works just fine. However, performance-wise it much slower. When I profiled the clock count for different parts of the code, I noticed that Memmory_alloc and Memmory_free takes too many cycles compared to when I run the app in serial mode. ( 2.6e9 in OpenMP mode vs .25e9 in Serial mode) 

The above, made me into thinking, whether I am doing heap management correctly???

For one thing, I use null as the first argument of the Memory_alloc and Memory_free which I BELIEVE it uses the default system heap as per below link snapshot. 

However, I came across Heap Management API where it talks about Heap management on DDR3 and MSMC. So, I am thinking maybe I get a better performance if I use this api instead ... if read the main purpose of this api it mentions that it is meant to avoid cache inconsistency .... but apparently I don't experience such issue because the code runs fine (or maybe it was just good luck so far);

Therefore, I seem to have two questions:

1) Why do I experience a slow execution in OpenMP mode on DSP but the code run faster on my PC when OpenMP enabled?

2) Whether using OpenMP Heap Management API would potentially speed up the code ? Or it is just meant to guarantee cache consistency ?

3) If I use OpenMP Heap Management API, would that mean memory allocations/frees would be thread safe such that I don't need use "omp critical" directive ???

I guess I need to better understand the main purpose of "OpenMP Heap Management API", apologies if questions sounds stupid ... 

Please let me know if you need more details to answer my questions.

Regards,

  • Hello!

    In my team we had some brief attempt with OpenMP, so I can't speak as expert, though I'd like to share couple observations and thoughts.

    We tested OpenMP on data windowing, which effectively is is input data convolution with window function.

    When we applied OpenMP, we saw degradation of the execution speed versus single core. So gradually we applied recommended techniques, such as placed all data in multicore shared memory, made sure all buffers were aligned on cache line size, then partitioned the data for each core in pieces of size being multiple of cache line. Only after this we could have like 3 and a little times speed up on 4 cores versus single core app.

    So my feeling is that in contrast to compiler optimizations, which may bring visible performance effect by just applying compiler option, OpenMP is a beast which brings no such performance improvement without careful planning and resources partitioning. I see that as a consequence of the fact, that DSP is quite special machine comparing to general purpose PC, thus requires more intelligence.

    I told you about our experience with muticore app just to illustrate, that it worth efforts, if one can provide independent access to data for each core without contention. MSMC is such a storage, it has independent ports per each core and can serve them efficiently when accessed locations which do not overlap. With this in mind one may think about dynamic memory allocation, that it might be efficient only when all above restrictions met - cache line boundary alignment, non-overlapping locations. In my imagination, there should be per core heaps to be used efficiently, but honestly I never went that deep, we used static preallocated buffers for that.

    Hope this could give some ground to think about your app.

  • Ok thanks very much for sharing your experience.So, here is mine ... 

    1) I used "omp for scheduled(dynamic,1024)" in all loops to make sure the calculations are all cache-line aligned as below. I'd like to know if that is enough and potentially what else I could do to improve ?? What are your thoughts (I would appreciate if you could shed some more light on this)?

    2) However, I placed the heap in DDR3 which may/maynot have caused the program to become memory-bounded. I use cfg files and as far as I understand the heap memory is placed on shared region which conceptually I don't understand. Below, is the cfg file I use. I would appreciate it if you can take a look and let me know how I can place heap on MSMC and if I need to use "OpenMP Heap Management API" in the code instead of regular Memory_alloc(NULL, size)?

    I believe in the below snippet if I change ddr3.base to msmc.base, that would do the job. Am I correct?

    	       // Initialize a Shared Region & create a heap in the DDR3 memory region
    	       var SharedRegion   = xdc.useModule('ti.sdo.ipc.SharedRegion');
    	       SharedRegion.setEntryMeta( sharedRegionId,
    	                                  {   base: ddr3.base,
    	                                      len:  sharedHeapSize,
    	                                      ownerProcId: OpenMP.masterCoreIdx,
    	                                      cacheEnable: true,
    	                                      createHeap: true,
    	                                      isValid: true,
    	                                      name: "DDR3_SR0",
    	                                  });
    	

    3) In my program, we do image processing and the .out file is around 20MB mostly because we use many images throughout the program execution. I know msmc is only 4MB so I probably have to use EDMA to move chunks of image between DDR3 and MSMC. Please let us know your thoughts on that too (much appreciate it).

    4) You mentioned "made sure all buffers were aligned on cache line size", do you mean the start address of the buffer should be multiple of cacheline size or the size of the buffer needs to be multiple of the cacheline size (I guess the schedule omp clause takes care of the later not the former, this might be the point I am missing)?

    Regards,

    omp_config.cfg

    /***************************/
    /* SECTION MAPPING         */
    /***************************/
    var program = xdc.useModule('xdc.cfg.Program');
    
    program.sectMap[".args"]        = new Program.SectionSpec();
    program.sectMap[".bss"]         = new Program.SectionSpec();
    program.sectMap[".cinit"]       = new Program.SectionSpec();
    program.sectMap[".cio"]         = new Program.SectionSpec();
    program.sectMap[".const"]       = new Program.SectionSpec();
    program.sectMap[".data"]        = new Program.SectionSpec();
    program.sectMap[".far"]         = new Program.SectionSpec();
    program.sectMap[".fardata"]     = new Program.SectionSpec();
    program.sectMap[".neardata"]    = new Program.SectionSpec();
    program.sectMap[".rodata"]      = new Program.SectionSpec();
    program.sectMap[".stack"]       = new Program.SectionSpec();
    program.sectMap[".switch"]      = new Program.SectionSpec();
    program.sectMap[".sysmem"]      = new Program.SectionSpec();
    program.sectMap[".text"]        = new Program.SectionSpec();
        
    // Must place these sections in core local memory 
    program.sectMap[".args"].loadSegment        = "DDR3";
    program.sectMap[".cio"].loadSegment         = "L2SRAM";
    
    // Variables in the following data sections can potentially be 'shared' in
    // OpenMP. These sections must be placed in shared memory.
    program.sectMap[".bss"].loadSegment         = "DDR3";
    program.sectMap[".cinit"].loadSegment       = "DDR3";
    program.sectMap[".const"].loadSegment       = "DDR3";
    program.sectMap[".data"].loadSegment        = "DDR3";
    program.sectMap[".far"].loadSegment         = "DDR3";
    program.sectMap[".fardata"].loadSegment     = "DDR3";
    program.sectMap[".neardata"].loadSegment    = "DDR3";
    program.sectMap[".rodata"].loadSegment      = "DDR3";
    program.sectMap[".sysmem"].loadSegment      = "DDR3";
    
    // Code sections shared by cores - place in shared memory to avoid duplication
    program.sectMap[".switch"].loadSegment      = program.platform.codeMemory;
    program.sectMap[".text"].loadSegment        = "MSMCSRAM";
    print(" program.platform.codeMemory = ", program.platform.codeMemory);
    
    // Size the default stack and place it in L2SRAM 
    var deviceName = String(Program.cpu.deviceName);
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 0x20000; }
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 256*1024; } // working in serial mode 
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 330*1024; }
    if  (deviceName.search("DRA7XX") == -1) { program.stack = 330*1024; }
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 512*1024; } 
    else                                    { program.stack = 0x8000;  }
    program.sectMap[".stack"].loadSegment       = "L2SRAM"; // works in serial mode 256KB .stack
    // program.sectMap[".stack"].loadSegment       = "MSMCSRAM"; // doesnt work in OpenMP
    
    // Since there are no arguments passed to main, set .args size to 0
    program.argSize = 0;
    
    
    /********************************/
    /* OPENMP RUNTIME CONFIGURATION */
    /********************************/
    USE_OPENMP = false;
    if (USE_OPENMP) {
    	// Include OMP runtime in the build
    	var ompSettings = xdc.useModule("ti.runtime.openmp.Settings");
    	
    	// Set to true if the application uses or has dependencies on BIOS components
    	ompSettings.usingRtsc = true;
    	
    	if (ompSettings.usingRtsc)
    	{
    	    /* Configure OpenMP for BIOS
    	     * - OpenMP.configureCores(masterCoreId, numberofCoresInRuntime)
    	     *       Configures the id of the master core and the number of cores
    	     *       available to the runtime.
    	     */
    	
    	    var OpenMP = xdc.useModule('ti.runtime.ompbios.OpenMP');
    	
    	    // Configure the index of the master core and the number of cores available
    	    // to the runtime. The cores are contiguous.
    	    OpenMP.masterCoreIdx = 0;
    	
    	    // Setup number of cores based on the device
    	    if      (deviceName.search("DRA7XX") != -1) { OpenMP.numCores = 2; }
    	    else if (deviceName.search("6670")   != -1) { OpenMP.numCores = 4; }
    	    else if (deviceName.search("6657")   != -1) { OpenMP.numCores = 2; }
    	    else if (deviceName.search("K2L")    != -1) { OpenMP.numCores = 4; }
    	    else                                        { OpenMP.numCores = 8; }
    	
    	    // Pull in memory ranges described in Platform.xdc to configure the runtime
    	    var ddr3       = Program.cpu.memoryMap["DDR3"];
    	    var ddr3_nc    = Program.cpu.memoryMap["DDR3_NC"];
    	    var msmc       = Program.cpu.memoryMap["MSMCSRAM"];
    	    var msmcNcVirt = Program.cpu.memoryMap["OMP_MSMC_NC_VIRT"];
    	    var msmcNcPhy  = Program.cpu.memoryMap["OMP_MSMC_NC_PHY"];
    	
    	    // Initialize the runtime with memory range information
    	    if (deviceName.search("DRA7XX") == -1) {
    	       OpenMP.msmcBase = msmc.base
    	       OpenMP.msmcSize = msmc.len;
    	
    	       OpenMP.msmcNoCacheVirtualBase  = msmcNcVirt.base;
    	       OpenMP.msmcNoCacheVirtualSize  = msmcNcVirt.len;
    	
    	       OpenMP.msmcNoCachePhysicalBase  = msmcNcPhy.base;
    	       
    //	       OpenMP.allocateStackFromHeap = true;
    //	       OpenMP.allocateStackFromHeapSize = 4*330*1024;
    	       
    	    }
    	    else
    	    {
    	       OpenMP.allocateStackFromHeap = true;
    	       OpenMP.allocateStackFromHeapSize = 0x010000;
    	
    	       OpenMP.hasMsmc = false;
    	       OpenMP.ddrNoCacheBase = ddr3_nc.base;
    	       OpenMP.ddrNoCacheSize = ddr3_nc.len;
    	    }
    	
    	    OpenMP.ddrBase          = ddr3.base;
    	    OpenMP.ddrSize          = ddr3.len;
    	
    	    // Configure memory allocation using HeapOMP
    	    // HeapOMP handles 
    	    // - Memory allocation requests from BIOS components (core local memory)
    	    // - Shared memory allocation by utilizing the IPC module to enable 
    	    //   multiple cores to allocate memory out of the same heap - used by malloc
    	    if (deviceName.search("DRA7XX") == -1) {
    	       print("deviceName.search(\"DRA7XX\") == -1")
    	       var HeapOMP = xdc.useModule('ti.runtime.ompbios.HeapOMP');
    	
    	       // Shared Region 0 must be initialized for IPC 
    	       var sharedRegionId = 0;
    	
    	       // Size of the core local heap
    	       var localHeapSize  = 0x8000;
    	
    	       // Size of the heap shared by all the cores
    	       // var sharedHeapSize = 0x08000000;
    	       var sharedHeapSize = (512-32-22)*1024*1024;
    	
    	       // Initialize a Shared Region & create a heap in the DDR3 memory region
    	       var SharedRegion   = xdc.useModule('ti.sdo.ipc.SharedRegion');
    	       SharedRegion.setEntryMeta( sharedRegionId,
    	                                  {   base: ddr3.base,
    	                                      len:  sharedHeapSize,
    	                                      ownerProcId: OpenMP.masterCoreIdx,
    	                                      cacheEnable: true,
    	                                      createHeap: true,
    	                                      isValid: true,
    	                                      name: "DDR3_SR0",
    	                                  });
    	
    	       // Configure and setup HeapOMP
    	       HeapOMP.configure(sharedRegionId, localHeapSize);
    	
    	/*       
    	       var HeapTrack = xdc.useModule('ti.sysbios.heaps.HeapTrack'); 
    			var heapTrackParams = new HeapTrack.Params; 
    			heapTrackParams.heap = HeapOMP; 
    			var myHeapTracker = HeapTrack.create(heapTrackParams);
    			*/
    	    }
    	    else
    	    {
    	       OpenMP.useIpcSharedHeap = false;
    	       OpenMP.allocateLocalHeapSize = 0x8000
    	       OpenMP.allocateSharedHeapSize = 0x00800000
    	    }
    	
    	    
    	    var Startup = xdc.useModule('xdc.runtime.Startup');
    	    Startup.lastFxns.$add('&__TI_omp_initialize_rtsc_mode');
    	}
    	else
    	{
    	    /* Size the heap. It must be placed in shared memory */
    	    program.heap = sharedHeapSize;
    	}
    	
    
    // Use no openmp 
    } else {
    	var BIOS = xdc.useModule('ti.sysbios.BIOS');	
    	BIOS.heapSize = (512-32-22)*1024*1024;
    	Program.sectMap["systemHeap"]	= new Program.SectionSpec();
    	Program.sectMap["systemHeap"].loadSegment	= "DDR3";
    	BIOS.heapSection = "systemHeap";	
    }
    
    
    var BIOS = xdc.useModule('ti.sysbios.BIOS');
    print("BIOS.heapSize = ", BIOS.heapSize)
    BIOS.cpuFreq.lo = 1000000000;
    
    
    	

     

     

  • Hi, rrlagic,

    Thanks for sharing your observation and experience in the forum.

    Hi, Feng,

    Our OpenMP expert is out of office for a few weeks, we'll try to answer your query if we can, but the response time may be slow.  If not, we may have to wait for his return.

    Rex

  • Hi again,

    Once again I have to start with disclaimer that my experience with OpenMP is quite limited, so I would outline just few points which look suspicious to me, but you'd better wait till OpenMP guru from TI chimes in.

    As to 1), I don't see how your actual data are cache line aligned, as there is only processing macro, but not data itself. In our case we had to allocate data buffers with alignment.

    Next, in same 1) I see 3 nested loops. To my knowledge, that is not the best idea and makes sharing issues identification even harder. We also don't know your input matrix organization, is that linear array with address manipulation or that was array of pointer to unrelated regions. I see division and modulo operations which might prevent outer loop unrolling if divisors are not powers of two. Finally, there are 4 conditions in the inner loop, I don't know if that could be optimized.

    To illustrate my doubts, consider simple windowing scenario: out[i] = in[i] * window[i]. See piece of code:

      omp_set_num_threads(4);
    
      iq_pairs=64/sizeof(float); // number of IQ pairs in one cache line (128 bytes)
      c_l_n= (num_sampled_data+iq_pairs-1)/iq_pairs;// cache lines number
      cl_pc= (c_l_n +3)/4; // cache lines per core number.
      num_p_pc = cl_pc*iq_pairs; // number of IQ pairs per core
    
      st_o = omp_get_wtime();
    
    #pragma omp parallel private (i)
      {
          int id;
          float * restrict pout;
          const float * restrict pin,* restrict pw;
          id=omp_get_thread_num();
    
          pout=&data[PAD+id*num_p_pc];
          pin=(float*)&iqs[id*num_p_pc];
          pw=&meas->window[id*num_p_pc];
    #pragma omp for
          for (i=0;i<num_p_pc/2;i++)
          {
              pout[2*i]=pin[2*i]*pw[i];
              pout[2*i+1]=pin[2*i+1]*pw[i];
          }
      }
    

    As you see here, each core gets their piece of linear array 'data' for processing through private pointer 'pout', which in turn is defined through core Id in a way, that each piece is aligned on 128 bytes boundary, which is cache line size. Array 'data' itself was allocated in multicore shared memory with 128B alignment. I'm showing this excerpt just to illustrate amount of work required to optimize convolution of two linear arrays. With matrices one have to be double cautious. Say, if your matrices are linear blocks with address arithmetic, I would add padding between rows to make sure each new row starts at 128B boundary. Then one may assign each core a row in round robin fashion, e.g. Core0 would process rows 0,8,16,..., Core1 would process rows 1,9,16 and so on.

    As to 2), that's simple. MSMC has independent port for each core, and theoretically they all can suck data in parallel. Now count this way. Each core has 128bit data bus, totally 1024 bit on 8 cores, but EMIF controller to DDR3 has only 64 bit. Next count clock difference and you'll see that memory access would be a bottleneck. Having heap in multicore shared memory allows you to (re)allocate memory in runtime, but have it in fast memory with wide data bus. Honestly

    3) looks good to me, using EDMA to move data between huge, but slow DD3 to small, but fast MSMCSRAM seems to be common and recommended practice.

    4) Is partially answered with 1), and yes, I mean its beneficial to have array origin to be aligned on cache line size too.

    Sorry, I'm not that good in OpenMP to judge more certainly, specifically about heaps. In your shoes I would start with static buffer in MSMC and tried thread partitioning either by rows or at least by cache line size.

    Just in case, do you apply compiler optimizations? Did you see compiler feedback on loop scheduling? About those division and modulo operations - are divisors fixed, or known in advance? One may avoid them with overflowing multiplicative inverse too.

  • Hi rrlagic,

    Thanks for the valuable input.

    Yes the 2D matrixes/images are simple linear arrays and to access M(i,j)  we do ( j*nrows+i ) (column major alignment).

    Other than cache-line initial address alignment and placing Heap in MSMC as you pointed out, I guess the schedule omp clause pretty much would take care of everything because for instance "omp for schedule(dynamic,1024)" would make the compiler assign batches of following 1024 iterations to each available thread dynamically. Therefore, given the fact that arrays/2D-matrices are stored linearly and assuming the matrix datatype is either float, or double, while data is loaded into cache-line, all elements of the array in the cache line would be updated by only one thread. In other words, no matter what the size of the updated array is, as long as chunks of loop iterations allocated to each thread is multiple of cacheline size / matrix-datatype ( eg. 32-byte (TMS320C6678 cache line size ) / 4 bytes (float datatype) ), then one is 100% sure that none of the element would not updated by two threads handling different cache lines. This called avoiding false sharing

    I haven't tried MSMC but valuable input regarding the bus speed ... I will definitely test and update here. 

    >> Just in case, do you apply compiler optimizations? Did you see compiler feedback on loop scheduling? About those division and modulo operations - are divisors fixed, or known in advance? One may avoid them with overflowing multiplicative inverse too.

    Yes I usually try different OPT levels just to compare and contrast. Before applying scheduling, I used to get calculation errors so I started searching here and there and it is mentioned pretty much everywhere false sharing can cause serious in accuracy in code execution. But, I guess it manifest its impact better if data is boundary aligned according to cache line size. Divisor is simply the number of rows in 2D matrix which is stored column majorly (see here).

    What did you mean by "multiplicative inverse"?

    Regards,

  • Hello!

    I did not dig deep to analyze your loops, just seeing division/modulo inside the loop is kind of attention trigger. You'd better check, whether presence of division/modulo prevents any optimization of the loop.

    As to multiplicative inverse, its a technique to replace division operation. I guess there should be some reading on Internet, and in simplest form one may take it this way. Assume we have to do integer division by 3 with 16 bit types. Let us define full scale unity as 0x1 0000. Then 16 bit are good to store fractional part, while integer part overflows. The one third of unity is 0x1 0000 / 3 = 0x5555. Let us divide, say 125 / 3. Then instead of division we do multiplication first: 125 x 0x5555 = 0x29 AA81 and then SHR by 16: 0x29 AA81 >>16 = 0x29 = 41. It takes one multiplication and one extraction, both get pipelined well. From what I saw, compiler is smart enough to insert this kind of substitution what is sees constants or simple variables, but in expressions this substitution may not happen, so have to code. So once again I would check produced assembly first to see if those divisions make any trouble and if they do, then replace with the trick.

    Hope this helps.

  • Ok thanks ... I will look into multiplicative inverse too. 

    But, I think first I need to learn how to use MSMC heaps and learn to allocate cache-line aligned memory from it ... this looks potentially more promising as you said. ;)

    Thanks a lot for now, stay tuned for further updates :)

  • rrlagic said:

    As to 2), that's simple. MSMC has independent port for each core, and theoretically they all can suck data in parallel. Now count this way. Each core has 128bit data bus, totally 1024 bit on 8 cores, but EMIF controller to DDR3 has only 64 bit. Next count clock difference and you'll see that memory access would be a bottleneck. Having heap in multicore shared memory allows you to (re)allocate memory in runtime, but have it in fast memory with wide data bus. Honestly

    Hello rrlagic,

    Do you any snippet just to get an idea how you guys did it? Sth like a benchmark code ... 

    I just have a hard time figuring out how to place a buffer array (char buf[32*1024];) on MSMC. I know I can do that from linker.cmd, however my linker.cmd file is generated by xdctools via a cfg file. Is there any way do that from cfg file directly or possibly is there any include command statement so that I can include a helper .cmd into auto generated one by xdctools?

    Regards,

  • Hello!

    The only reference I have now is shared heap in DDR3, but with L2 caching enabled:

        // Initialize a Shared Region & create a heap in the DDR3 memory region
        var SharedRegion   = xdc.useModule('ti.sdo.ipc.SharedRegion');
        SharedRegion.setEntryMeta( sharedRegionId,
                                   {   base: ddr3.base,
                                       len:  sharedHeapSize,
                                       ownerProcId: 0,
                                       cacheEnable: true, 
                                       createHeap: true,
                                       isValid: true,
                                       name: "DDR3_SR0",
                                   });
    
    
    
    #ifdef OMP_PLATFORM
        float *data   = (float*) Memory_alloc (NULL,2*(num_sampled_data+PAD)*sizeof(float),128,NULL);
    #else
    
    

    I'm afraid that's not exactly the result I was describing.

    As to placing static memory buffer to specific location, we used

    #pragma DATA_SECTION

    #pragma DATA_ALIGN

  • HI,

    Maybe you can look at *.cfg files under pdk_c667x_2_0_16\packages\ti\drv\hyplnk\example\memoryMappedExample\c6678\c66\bios to see how to specify a data section into a location, syntax like below:

    Program.sectMap[".init_array"] = "L2SRAM";
    Program.sectMap[".csl_vect"] = "L2SRAM";

    /* Create data sections for specific memory locations */

    Program.sectMap[".bss:hyplnkData"] = new Program.SectionSpec();
    Program.sectMap[".bss:hyplnkData"].loadAddress=0x830000;
    Program.sectMap[".bss:testData"] = "L2SRAM";
    Program.sectMap[".bss:QMSSData"] = new Program.SectionSpec();
    Program.sectMap[".bss:QMSSData"].loadAddress=0x850000;
    Program.sectMap[".bss:packetData"] = new Program.SectionSpec();
    Program.sectMap[".bss:packetData"].loadAddress=0x870000;

    Regards, Eric

  • Thanks Eric,

    Program.sectMap[".bss:hyplnkData"] = new Program.SectionSpec();
    Program.sectMap[".bss:hyplnkData"].loadAddress=0x830000;
    

    So, in the above example, line 1 means .bss of the hypInkData.obj is going to be placed at address 0x830000, right ???

    or in the below example:

    Program.sectMap[".bss:testData"] = "L2SRAM";

    .bss of testData.obj will be put into L2SRAM ???

  • Hi rrlagic,

    Here I did some testing just to compare the impact of the MSMC on OpenMP apps based on your valuable suggestions.

    So, as you said I started by allocating a buffer on MSMC using the cfg file (as you kindly provided in another thread.). and then allocated the same size buffer from DDR3 heap using Memeory_alloc(NULL, 4*1024).

    Then I compared execution time for looping over the two buffers and doing some simple math. Interestingly enough, when using DDR3, the serial mode runs faster than OpenMP version as you pointed out. 

     

    -= Program Started=- 
    MSMC BUFF PROCESS TIME: 24570 
    OMP DDR BUFF PROCESS TIME: 89496 
    NOOMP DDR BUFF PROCESS TIME: 70598 
    -= Program Ended=- 

    I leave the code below along the cfg file I used just in case sb is interested to confirm and criticize.

    #include <stdio.h>
    #include <c6x.h>
    #include <math.h>
    #include <omp.h>
    
    #include <xdc/std.h>
    #include <xdc/runtime/IHeap.h>
    #include <xdc/runtime/System.h>
    #include <xdc/runtime/Memory.h>
    #include <xdc/runtime/Error.h>
    
    typedef unsigned long long dt_uint64;
    
    // buffer on MSMC
    #pragma DATA_SECTION( msmc_buff, ".msmc_bufSec" )
    #pragma DATA_ALIGN( msmc_buff, 256 )
    #define BUFLEN  (4*1024)
    char msmc_buff[BUFLEN];
    
    // buffer on DDR3 allcoated from heap
    char *ddr_buff;
    
    int main(int argc, char* argv[]){
        printf("-= Program Started=- \n");
    
        TSCH = 0; TSCL = 0;
        dt_uint64  t_start = _itoll(TSCH, TSCL);
    
        // OMP, Buffer on MSMC
    #pragma omp parallel num_threads(8)
        {
    #pragma omp for
            for(int j = 0; j < BUFLEN; j++){
    //          putchar('M'); putchar('0' + omp_get_thread_num());
                msmc_buff[j] = 1+msmc_buff[j]/2*3-50;
            }
        }
        dt_uint64  t_end = _itoll(TSCH, TSCL);
        printf("MSMC BUFF PROCESS TIME: %llu \n", t_end - t_start);
    
    
        Ptr buf1, buf2;
        Error_Block eb;
        Error_init(&eb);
        xdc_Ptr retval = Memory_alloc(NULL, BUFLEN, 0, &eb);
    
        ddr_buff = (char *) retval;
    
        if (!ddr_buff){
            printf("Cannot allocate ddr_buff from heap \n");
            exit(1);
        }
    
        // OMP, Buffer on DDR
        t_start = _itoll(TSCH, TSCL);
    #pragma omp parallel num_threads(8)
        {
    #pragma omp for
            for(int j = 0; j < BUFLEN; j++){
                ddr_buff[j] = 1+ddr_buff[j]/2*3-50;
            }
        }
        t_end = _itoll(TSCH, TSCL);
        printf("OMP DDR BUFF PROCESS TIME: %llu \n", t_end - t_start);
    
        // NO-OMP, Buffer on DDR
        t_start = _itoll(TSCH, TSCL);
            for(int j = 0; j < BUFLEN; j++){
                ddr_buff[j] = 1+ddr_buff[j]/2*3-50;
            }
        t_end = _itoll(TSCH, TSCL);
        printf("NOOMP DDR BUFF PROCESS TIME: %llu \n", t_end - t_start);
    
    
    
        Memory_free(NULL, ddr_buff, BUFLEN);
    
        printf("-= Program Ended=- \n");
    }
    

    configuration script (cfg file)

     

    /***************************/
    /* SECTION MAPPING         */
    /***************************/
    var program = xdc.useModule('xdc.cfg.Program');
    
    program.sectMap[".args"]        = new Program.SectionSpec();
    program.sectMap[".bss"]         = new Program.SectionSpec();
    program.sectMap[".cinit"]       = new Program.SectionSpec();
    program.sectMap[".cio"]         = new Program.SectionSpec();
    program.sectMap[".const"]       = new Program.SectionSpec();
    program.sectMap[".data"]        = new Program.SectionSpec();
    program.sectMap[".far"]         = new Program.SectionSpec();
    program.sectMap[".fardata"]     = new Program.SectionSpec();
    program.sectMap[".neardata"]    = new Program.SectionSpec();
    program.sectMap[".rodata"]      = new Program.SectionSpec();
    program.sectMap[".stack"]       = new Program.SectionSpec();
    program.sectMap[".switch"]      = new Program.SectionSpec();
    program.sectMap[".sysmem"]      = new Program.SectionSpec();
    program.sectMap[".text"]        = new Program.SectionSpec();
        
    // Must place these sections in core local memory 
    program.sectMap[".args"].loadSegment        = "DDR3";
    program.sectMap[".cio"].loadSegment         = "L2SRAM";
    
    // Variables in the following data sections can potentially be 'shared' in
    // OpenMP. These sections must be placed in shared memory.
    program.sectMap[".bss"].loadSegment         = "DDR3";
    program.sectMap[".cinit"].loadSegment       = "DDR3";
    program.sectMap[".const"].loadSegment       = "DDR3";
    program.sectMap[".data"].loadSegment        = "DDR3";
    program.sectMap[".far"].loadSegment         = "DDR3";
    program.sectMap[".fardata"].loadSegment     = "DDR3";
    program.sectMap[".neardata"].loadSegment    = "DDR3";
    program.sectMap[".rodata"].loadSegment      = "DDR3";
    program.sectMap[".sysmem"].loadSegment      = "DDR3";
    
    // Code sections shared by cores - place in shared memory to avoid duplication
    program.sectMap[".switch"].loadSegment      = program.platform.codeMemory;
    program.sectMap[".text"].loadSegment        = "MSMCSRAM";
    print(" program.platform.codeMemory = ", program.platform.codeMemory);
    
    // Size the default stack and place it in L2SRAM 
    var deviceName = String(Program.cpu.deviceName);
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 0x20000; }
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 256*1024; } // working in serial mode 
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 330*1024; }
    if  (deviceName.search("DRA7XX") == -1) { program.stack = 330*1024; }
    // if  (deviceName.search("DRA7XX") == -1) { program.stack = 512*1024; } 
    else                                    { program.stack = 0x8000;  }
    program.sectMap[".stack"].loadSegment       = "L2SRAM"; // works in serial mode 256KB .stack
    // program.sectMap[".stack"].loadSegment       = "MSMCSRAM"; // doesnt work in OpenMP
    
    // Since there are no arguments passed to main, set .args size to 0
    program.argSize = 0;
    
    
    /********************************/
    /* OPENMP RUNTIME CONFIGURATION */
    /********************************/
    USE_OPENMP = true;
    if (USE_OPENMP) {
    	// Include OMP runtime in the build
    	var ompSettings = xdc.useModule("ti.runtime.openmp.Settings");
    	
    	// Set to true if the application uses or has dependencies on BIOS components
    	ompSettings.usingRtsc = true;
    	
    	if (ompSettings.usingRtsc)
    	{
    	    /* Configure OpenMP for BIOS
    	     * - OpenMP.configureCores(masterCoreId, numberofCoresInRuntime)
    	     *       Configures the id of the master core and the number of cores
    	     *       available to the runtime.
    	     */
    	
    	    var OpenMP = xdc.useModule('ti.runtime.ompbios.OpenMP');
    	
    	    // Configure the index of the master core and the number of cores available
    	    // to the runtime. The cores are contiguous.
    	    OpenMP.masterCoreIdx = 0;
    	
    	    // Setup number of cores based on the device
    	    if      (deviceName.search("DRA7XX") != -1) { OpenMP.numCores = 2; }
    	    else if (deviceName.search("6670")   != -1) { OpenMP.numCores = 4; }
    	    else if (deviceName.search("6657")   != -1) { OpenMP.numCores = 2; }
    	    else if (deviceName.search("K2L")    != -1) { OpenMP.numCores = 4; }
    	    else                                        { OpenMP.numCores = 8; }
    	
    	    // Pull in memory ranges described in Platform.xdc to configure the runtime
    	    var ddr3       = Program.cpu.memoryMap["DDR3"];
    	    var ddr3_nc    = Program.cpu.memoryMap["DDR3_NC"];
    	    var msmc       = Program.cpu.memoryMap["MSMCSRAM"];
    	    var msmcNcVirt = Program.cpu.memoryMap["OMP_MSMC_NC_VIRT"];
    	    var msmcNcPhy  = Program.cpu.memoryMap["OMP_MSMC_NC_PHY"];
    	
    	    // Initialize the runtime with memory range information
    	    if (deviceName.search("DRA7XX") == -1) {
    	       OpenMP.msmcBase = msmc.base
    	       OpenMP.msmcSize = msmc.len;
    	
    	       OpenMP.msmcNoCacheVirtualBase  = msmcNcVirt.base;
    	       OpenMP.msmcNoCacheVirtualSize  = msmcNcVirt.len;
    	
    	       OpenMP.msmcNoCachePhysicalBase  = msmcNcPhy.base;
    	       
    //	       OpenMP.allocateStackFromHeap = true;
    //	       OpenMP.allocateStackFromHeapSize = 4*330*1024;
    	       
    	    }
    	    else
    	    {
    	       OpenMP.allocateStackFromHeap = true;
    	       OpenMP.allocateStackFromHeapSize = 0x010000;
    	
    	       OpenMP.hasMsmc = false;
    	       OpenMP.ddrNoCacheBase = ddr3_nc.base;
    	       OpenMP.ddrNoCacheSize = ddr3_nc.len;
    	    }
    	
    	    OpenMP.ddrBase          = ddr3.base;
    	    OpenMP.ddrSize          = ddr3.len;
    	
    	    // Configure memory allocation using HeapOMP
    	    // HeapOMP handles 
    	    // - Memory allocation requests from BIOS components (core local memory)
    	    // - Shared memory allocation by utilizing the IPC module to enable 
    	    //   multiple cores to allocate memory out of the same heap - used by malloc
    	    if (deviceName.search("DRA7XX") == -1) {
    	       print("deviceName.search(\"DRA7XX\") == -1")
    	       var HeapOMP = xdc.useModule('ti.runtime.ompbios.HeapOMP');
    	
    	       // Shared Region 0 must be initialized for IPC 
    	       var sharedRegionId = 0;
    	
    	       // Size of the core local heap
    	       var localHeapSize  = 0x8000;
    	
    	       // Size of the heap shared by all the cores
    	       // var sharedHeapSize = 0x08000000;
    	       var sharedHeapSize = (512-32-22)*1024*1024;
    	
    	       // Initialize a Shared Region & create a heap in the DDR3 memory region
    	       var SharedRegion   = xdc.useModule('ti.sdo.ipc.SharedRegion');
    	       SharedRegion.setEntryMeta( sharedRegionId,
    	                                  {   base: ddr3.base,
    	                                      len:  sharedHeapSize,
    	                                      ownerProcId: OpenMP.masterCoreIdx,
    	                                      cacheEnable: true,
    	                                      createHeap: true,
    	                                      isValid: true,
    	                                      name: "DDR3_SR0",
    	                                  });
    	
    	       // Configure and setup HeapOMP
    	       HeapOMP.configure(sharedRegionId, localHeapSize);
    	
    	/*       
    	       var HeapTrack = xdc.useModule('ti.sysbios.heaps.HeapTrack'); 
    			var heapTrackParams = new HeapTrack.Params; 
    			heapTrackParams.heap = HeapOMP; 
    			var myHeapTracker = HeapTrack.create(heapTrackParams);
    			*/
    	    }
    	    else
    	    {
    	       OpenMP.useIpcSharedHeap = false;
    	       OpenMP.allocateLocalHeapSize = 0x8000
    	       OpenMP.allocateSharedHeapSize = 0x00800000
    	    }
    	
    	    
    	    var Startup = xdc.useModule('xdc.runtime.Startup');
    	    Startup.lastFxns.$add('&__TI_omp_initialize_rtsc_mode');
    	}
    	else
    	{
    	    /* Size the heap. It must be placed in shared memory */
    	    program.heap = sharedHeapSize;
    	}
    	
    
    // Use no openmp 
    } else {
    	var BIOS = xdc.useModule('ti.sysbios.BIOS');	
    	BIOS.heapSize = (512-32-22)*1024*1024;
    	Program.sectMap["systemHeap"]	= new Program.SectionSpec();
    	Program.sectMap["systemHeap"].loadSegment	= "DDR3";
    	BIOS.heapSection = "systemHeap";	
    }
    
    
    var BIOS = xdc.useModule('ti.sysbios.BIOS');
    print("BIOS.heapSize = ", BIOS.heapSize)
    BIOS.cpuFreq.lo = 1000000000;
    
    Program.sectMap[".msmc_bufSec"]    = "MSMCSRAM";
    
    

     

  • Hello!

    Thanks for sharing your observations, that's important for followers to know the outcome.

    One thing which may affect your measurements is that non-OMP version is executed after OPM one over the same ddr_buff. Perhaps, portions of it were already cached, so non-OMP had an unfair advantage. Invalidating caches could make initial conditions closer to equality.