OpenMP: Schedule type and chunk-size ignored?

Clemens Eisserer

Other Parts Discussed in Thread: SYSBIOS

Hi again,

To initialize per-core data structures, we currently have loops which are intended to be executed once by each core, and look like this:

    int nOmpNumProcs = omp_get_num_procs();
    #pragma omp parallel for schedule(static, 1) private(i)
   for(i=0; i < nOmpNumProcs; i++)
   {
       printf("core %d was initialized", omp_get_thread_num());
       fflush(stdout);
   }

However, even with a static schedule and a chunk-size of 1, the output suggests only core 0 (first iteration) and core 6 were running the code:

[C66xx_0] core 0 was initialized
[C66xx_6] core 1 was initialized
[C66xx_6] core 2 was initialized
[C66xx_6] core 3 was initialized
[C66xx_6] core 4 was initialized
[C66xx_6] core 5 was initialized
[C66xx_6] core 6 was initialized
[C66xx_6] core 7 was initialized

So although omp_get_thread_num() returns the intended result, the actual scheduling seems to be done in a dynamic way.

Is there a better way to execute code in parallel which relies on core-local resources like L2SRAM?
Our legacy code has a clearly seperated initiliazation phase (which prepares data structures for multiple parallel loops), so unfourtunatly it isn't that easy to do initialization inside of the loop :(

Thank you in advance, Clemens

over 13 years ago

Clemens Eisserer over 13 years ago

Expert 2430 points

After having validated that all my other loops execute the same way (only one iteration on core 0, all others on core 6) I have the feeling something is going wrong here.

When executing the following code as part of a standalone-project, everything works fine - each core immediatly gets it's amount of work:

void main() {
   int i;

   #pragma omp parallel for
   for(i=0; i < 8; i++) {
       printf("Hello\n");
       fflush(stdout);

       int z = 0;
       volatile int m = 0;
       for(z=0; z < 2007483647; z++) {
           m = z;
       }
   }}

However, as part of my openmp project, I get the behaviour described above with only core6 doing work, while all other cores seem to sit idle with the following stack:

Sorry I am so support intensive with this OpenMP project, thing just don't seem to work right for me this time :/

Thank you in advance, Clemens

Ajay Jayaraj over 13 years ago

TI__Expert 3350 points

Clemens,

schedule(static, 1) only guarantees that iterations are divided into chunks of size 1 and assigned to threads in the team in a round-robin manner. It does not provide any guarantees about how the threads are mapped to cores on the system.

An alternative is to use DNUM - the DSP core number register. E.g.

#include <c6x.h>

...

    #pragma omp parallel private(i)
    for(i=0; i < nOmpNumProcs; i++)
    {
        if (i == DNUM)
           printf("thread %d was initialized by core %d\n", omp_get_thread_num(), DNUM);
    }

Ajay

Ajay Jayaraj over 13 years ago in reply to Clemens Eisserer

TI__Expert 3350 points

Clemens,

Just to make sure I understand you correctly - When you try a simple standalone OpenMP project (say, A), the 8 threads are evenly distributed across the DSP cores. This is expected behavior of the C6x OpenMP runtime. However, when you run your OpenMP project (say, B), the threads execute only on cores 0 and 6?

Do you have any schedule clauses on your OpenMP directives in project B? Is there code outside of OpenMP that initiates tasks on the cores?

Ajay

Clemens Eisserer over 13 years ago in reply to Ajay Jayaraj

Expert 2430 points

Hi Ajay,

Exactly, in project A thread distribution works well, in project B the first iteration takes place on core0 and all subsequent ones only on core6.
There are no schedule clauses or omp-behaviour altering calls to the runtime, and no other tasks involved - its just a simple ~10 lines main file.

The only thing I can think of is possibly incorrect placement of code/data, I will compare the linker.cmd file tomorrow.
I am also experiencing other strange effects when running with openmp enabled, maybe those problems have a common root.

Thanks, Clemens

Clemens Eisserer over 13 years ago in reply to Clemens Eisserer

Expert 2430 points

Did some testing - the binary works as expected on the simulator, the problems I observe only occur when running on my EVM6678 (rev 3, silicon 2).
Just double-checked, I am using the gel-file supplied by the OMP package on all 8 cores.
Also, I reverted all my placement settings, this is the gerenated linker.cmd-file: 1374.linker.zip

Any ideas how to diagnose this issue further?
Would it help to upload a precompiled test-binary which triggers the issue?

Thanks a lot, Clemens

Ajay Jayaraj over 13 years ago in reply to Clemens Eisserer

TI__Expert 3350 points

Yes, a binary that triggers the issue would help. Could you also upload the Platform file, application config file and generated map file? You mentioned that the source file is small - if there is no proprietary information, could you also upload the source?

Ajay

Clemens Eisserer over 13 years ago in reply to Ajay Jayaraj

Expert 2430 points

Hi Ajay,

I've sent you a friend-request which includes the link to the binary. the cfg-file, the map file as well as the platform file have to wait ~15h until I am at work again.
The source itself is very small however its embedded in a large project (mostly dead code when executing this test), and because the problem only occurs when the testcase is embedded in this problem, I unfourtunatly can not share it.

Thanks a lot for investigating this issue, I am really grateful about that.

Thanks, Clemens

Ajay Jayaraj over 13 years ago in reply to Clemens Eisserer

TI__Expert 3350 points

Clemens,

I think I've found the problem - the configuration file associated with the project, EV3D.cfg, is incorrect. It creates a shared region in memory at address 0x9000000:

SharedRegion.setEntryMeta( HeapOMP.sharedRegionId,
                            {   base: 0x90000000,
                                len: HeapOMP.sharedHeapSize,
                                ownerProcId: 0,
                                cacheLineSize: 0,
                                cacheEnable: false,
                                createHeap: true,
                                isValid: true,
                                name: "sr2-ddr3",
                            }

Note that when the region is created, we are indicating that this region is not cached, via the 'cacheEnable: false' setting. However, the MAR setting disables caching for an address range starting at 0x80000000 :

var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
Cache.setMarMeta(0x80000000, 0x20000000, 0);

So, the problem is that cache is left enabled (the default) for the memory associated with shared region 2, leading to unpredictable behavior. Changing the start address to correspond to the Shared Region start address, 0x90000000 fixes the problem. I verified this with your project.

Cache.setMarMeta(0x90000000, HeapOMP.sharedHeapSize, 0);

Ajay

Clemens Eisserer over 13 years ago in reply to Ajay Jayaraj

Expert 2430 points

Hi Ajay,

Thanks a lot for this great news - I will give it a try on Monday when I have access to the EVM again.

I copied that memory configuration 1:1 from the omp math_vec example, that is distributed with the mcsdk.
It would be great if you could fix the configuration of the example project, too.

Thanks again, Clemens

Ajay Jayaraj over 13 years ago in reply to Clemens Eisserer

TI__Expert 3350 points

Clemens,

Unfortunately, that is not the issue - I came to the wrong conclusion. The setMarMeta range starts at 0x80000000, but the length extends into the Shared region range. However, I'm noticing that if I run the test case from a clean start (reset EVM), it works. When I run the test case a second time, it fails. I can then load and run an arbitrary OpenMP program, and when I run the test case again, it passes. It seems like EV3D.out leaves the system in a bad state after it runs.

It needs further investigation.

Ajay

Clemens Eisserer over 13 years ago in reply to Ajay Jayaraj

Expert 2430 points

Please let me know in case there is any information I can provide which would help to tackle this issue down.

Did you have a look at the example-project (including source), or only at the binary-only EV3D I sent you before?

Thanks again, Clemens

Ajay Jayaraj over 13 years ago in reply to Clemens Eisserer

TI__Expert 3350 points

Clemens,

Yes, I looked at the example project and was able to build and run it. That's when I noticed the intermittent behavior - the binary I built ran correctly the first time, but subsequent runs failed - it seemed like the runtime was leaving the DSP in a bad state. Let me find out if there is a more recent version of the OpenMP runtime that is available externally for you to try.

Ajay

Clemens Eisserer over 13 years ago in reply to Ajay Jayaraj

Expert 2430 points

Hi Ajay,

I was able to find the problem - it was project related. The linker settings had initialization model = none set, which didn't seem to cause any problems with single-threaded code, but caused unreliable behaviour with OpenMP enabled. With initialization model = ROM, everything works as expected on my EVM.

It would be great to have some warning when an invalid initialization model is chosen with OMP enabled.

Sorry I consumed so much of your time, and thank you for providing support.

Thanks, Clemens

Code Composer Studio™︎

Code Composer Studio forum

OpenMP: Schedule type and chunk-size ignored?