This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OpenMP performance issue for adding a large set of numbers to find mean

Other Parts Discussed in Thread: SYSBIOS

I am working on a curve fit application and I am generating a data of 4096k elements of type float and storing in DDR3. With one core I, the time taken to find mean is 0.59 sec but this doesn't scale well with OpenMP.

I have attached the platform file.

Stack: L2
Data:  MSMCSRAM_NOCACHE
code: MSMCSRAM
L2 Cache: 256k
L1D Cache: 32k
L1P Cache: 32k

Is there any optimization to get a salable speedup. With 4 cores it takes 0.25sec and with 8 cores 0.31sec

  • Hi Ajay,

    What version of OMP, and other RTSC components (MCSDK, BIOS, IPC, etc) are you using? The settings in the platform file and cfg file you use would depend on that.

    If you aren't already using MCSDK 2.01.02.06, I would recommend installing it from http://software-dl.ti.com/sdoemb/sdoemb_public_sw/bios_mcsdk/latest/index_FDS.html. Once you have that, the next step is to update OpenMP to OMP 1.02.05 http://software-dl.ti.com/sdoemb/sdoemb_public_sw/omp/1_02_00_05/index_FDS.html  This version of OMP has various performance enhancements over previous versions.

    Once you have these installed and CCS recognizes the OMP package, please ensure that you go through the OpenMP examples from this new release, to see both platform file changes and cfg changes. Once you incorporate these changes, if you continue to face issues, please reply to this post with both your cfg file and platform file, and a list of the software versions used.

    Thanks!

  • Hi Uday,

    Thanks for your reply.

    I am already using MCSDK 2.01.02.06 with OpenMP version 1.01.03.02. I will update the OMP runtime. If I DMA the data from the DDR3 to L2 SRAM (with L2 cache disabled) using OMP, will this provide performance improvement?

    Also could you provide me with a good cache configuration for such memory bound computation where the data doesn't stay in the cache for long.

    Thank you

  • Hi Ajay

    You should typically see the scaling without the need for data movement from DDR3 to L2. Let's try a few things:

    Once you update to the OMP 1.02.00.05, I would suggest starting out with the following settings in the RTSC platform file that you use (you can edit the platform you are using by going to the CCS Debug Perspective and then choose menu Tools --> RTSC Tools --> Platform --> Edit/View. 

    Once you've saved the platform, please ensure that you select it to be your RTSC platform of choice in your project settings.

    Now in your configuration file (.cfg), among what's already there, add the following lines if they aren't already present:

    var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
    // all external memory cacheable; write-through disabled (better performance if disabled)
    Cache.setMarMeta(0x80000000, 0x20000000, Cache.PFX | Cache.PC);
    OpenMP.enableMemoryConsistency = true; 
    OpenMP.noncachedMsmcAlias = 0xA0000000;

    Let us know what performance you observe once you've implemented this, and if you see a scaling performance.

    Also just in case there are a lot of 'if...then...else' statements inside your parallel region, see if you can eliminate them or extract those out of the parallel region.