OpenMP in single core vs multi-core

RIZUAN

Other Parts Discussed in Thread: SYSBIOS

Hi everyone,

I’ve built a Single Program Multiple Data program using OpenMP (omp_1_01_02_06) and MCSDK (mcsdk_2_01_01_04) using TMDSEVMC6678L board and GEL file Ver 2.004 to initialize it (currently using ‘no boot’ configuration). I use the following pragma omp and for loop to assign the iteration to each thread:

#pragma omp parallel

#pragma omp master

{

no_threads = omp_get_num_threads();

platform_write("Executing in %d threads\n\n", no_threads);

}

#pragma omp parallel shared(/*some vars here*/) private((/*some vars here*/)

{

thread_id = omp_get_thread_num();

for (a = thread_id; a < iteration; a = (a + no_threads))

{

......

}

My .cfg is as attached. 7331.image_processing_openmp_evmc6678l.cfg

The project is about processing a cube of data. At the beginning of the code, the variable ‘iteration’ is determined first and if ‘iteration’ is less or equal to 8, omp_set_num_threads(iteration) is used to set the number of cores used and 8 is the maximum value for ‘iteration’.

I’ve executed the code and when the iteration is equal to 1 (i.e. using only 1 core), the code could execute without any problem, as well as 2 cores. However, when I use larger number of iteration, some of the cores keep running and some has exited (with exception error message at console). The exception message that I could capture from the ROV are as follows:

Exception,

Address,0x0c00d110

Decoded,Internal: Opcode exception;

Exception Context,

$addr,0x0084d2f0

$type,ti.sysbios.family.c64p.Exception.Context

A0,0x00000000

A1,0x00000402

A10,0x00000000

A11,0x00000000

A12,0x00000000

A13,0x00000000

A14,0x00000000

A15,0x00000000

A16,0x00000000

A17,0x00000000

A18,0x9007fbd4

A19,0x00000020

A2,0x00000000

A20,0x902407a0

A21,0x00000000

A22,0x600c8144

A23,0xc09291b4

A24,0x00000000

A25,0x00000000

A26,0x00000000

A27,0x00000000

A28,0x00000400

A29,0x00000001

A3,0x00000031

A30,0x00000000

A31,0x00000000

A4,0x00000030

A5,0x41ba44e2

A6,0x80000000

A7,0x40200000

A8,0x00000031

A9,0x0c0e5ca0

AMR,0x00000000

B0,0x00000000

B1,0x00000001

B10,0x00000000

B11,0x00000000

B12,0x00000000

B13,0x00000000

B14,0xa02515a0

B15,0x9007ffe8

B16,0x00000000

B17,0x00000000

B18,0x00004980

B19,0x00000000

B2,0x9007fb38

B20,0x00000000

B21,0x00000000

B22,0x00000000

B23,0x00000000

B24,0x00000030

B25,0x00000000

B26,0x00000000

B27,0x00000030

B28,0x00000000

B29,0x00000030

B3,0x0c00d0f0

B30,0x00000000

B31,0x00000000

B4,0x00000100

B5,0x00000100

B6,0x00000000

B7,0x00000000

B8,0x21000000

B9,0x41ba44e2

EFR,0x00000002

IERR,0x00000008

ILC,0x00000000

IRP,0x0c06f434

ITSR,0x0000000f

NRP,0x0c00d110

NTSR,0x0001000f

RILC,0x00000031

SSR,0x00000000

The idea of using SPMD is to reduce memory usage for each core. So, the memory used when using 1 core is larger than when use 2 cores. Can anyone give any idea/feedback of what is the problem that I’m facing?

Thanks and kind regards,

Rizuan

over 13 years ago

0 RIZUAN over 13 years ago

Expert 1285 points

Hi,

A bit addition to my previous post. My RTSC platform is as below:

Inside my .cfg (in the previous post), I created a heap named "ddr_heap" inside DDR3 memory region. Inside this heap I allocate several variables (arrays) (using Memory_alloc) for example S, P, L, and Ss. So, basically all these arrays are in DDR3. In my #pragma omp parallel, I set these arrays as private.

Is this safe? Sorry if this has been questioned and answered before.

I'm guessing this is why when only 1 core is used, no error happened. The exception error happened when more than 1 core are used.

Any response/help is appreciated.

[EDIT: Is this problem related to http://e2e.ti.com/support/embedded/bios/f/355/p/180742/654302.aspx ?]

Rizuan

0 RIZUAN over 13 years ago in reply to RIZUAN

Expert 1285 points

Hi,

Can anyone help me regarding this issue, please.

Kind regards,

Rizuan

0 Ajay Jayaraj over 13 years ago

TI__Expert 3350 points

Rizuan, the error message appears to indicate that an overflow is corrupting the text section in MSMCSRAM.

RIZUAN said:

Exception,

Address,0x0c00d110

Decoded,Internal: Opcode exception;

0 RIZUAN over 13 years ago in reply to Ajay Jayaraj

Expert 1285 points

Hi Ajay,

Sorry for the delay. I've been away for some other works. You've been helpful.

I couldn't recreate the same exception error. I just upgrade to the MCSDK 2.1.2.5 and CGT 7.4.2.

Now, I've changed the .text to DDR3 section and received the following exception error:

Exception, Address,0xf0800000 Decoded,Internal: Opcode exception; Exception Context, $addr,0x0084d290 $type,ti.sysbios.family.c64p.Exception.Context A0,0x00000000 A1,0x00000404 A10,0xe6000000 A11,0xfe400000 A12,0xad000000 A13,0x41d23555 A14,0xe3000000 A15,0xc1cff1a3 A16,0x00000000 A17,0x00000000 A18,0x00000000 A19,0x0000003a A2,0x00000000 A20,0x00000039 A21,0x901bab58 A22,0x00000000 A23,0x901bab58 A24,0x00000001 A25,0x8e120c1c A26,0x8e120c1a A27,0x8e120c18 A28,0x8e120c16 A29,0x8e120c14 A3,0x00000038 A30,0x8e120c12 A31,0x000001c8 A4,0x85800000 A5,0xc1cfd535 A6,0x00000208 A7,0x71eff119 A8,0x00000200 A9,0x0c0e6700 AMR,0x00000000 B0,0x00000000 B1,0x00000001 B10,0x59800000 B11,0xc1c0aceb B12,0x30c00000 B13,0x41d00734 B14,0xa02515a0 B15,0x901baf20 B16,0x00104000 B17,0x80000000 B18,0x0000003b B19,0x901bab58 B2,0x901bad28 B20,0x00000000 B21,0x901bab58 B22,0x901bab58 B23,0x00000000 B24,0x00000000 B25,0x00000000 B26,0x8e117ae0 B27,0x00004d00 B28,0x8e117920 B29,0x8e118aa0 B3,0xf0800000 B30,0x0000001c B31,0x00103800 B4,0x00000038 B5,0x41d50852 B6,0x80104820 B7,0x71eff115 B8,0x5421488a B9,0x00000208 EFR,0x00000002 IERR,0x00000008 ILC,0x00000000 IRP,0x0c00eb36 ITSR,0x0000000f NRP,0xf0800000 NTSR,0x0001000f RILC,0x00000001 SSR,0x00000000

How did you know that the previous exception error belongs to the .text section? Is there any documentation that I missed?

Rizuan

0 Archaeologist over 13 years ago in reply to RIZUAN

TI__Guru* 84285 points

Because it says "Opcode exception". This is presumably an attempt to decode an invalid instruction opcode, and instructions typically reside in the .text section.

0 RIZUAN over 13 years ago in reply to Archaeologist

Expert 1285 points

Hi Ajay and Archaeologist,

Thanks for your reply and help. It turns out that the reason why I received such exception errors because of the Image Processing Demo example is having issue with CGT 7.4.1 and above as stated here. I created my project based on the example.

After change to CGT 7.4.0, the problem is gone.

I don't have problem when running my SPMD program (I've explained how I create the SPMD program briefly in my first post in this thread) in single and two cores. When I divide the program to run in 4 cores, it hang and sometimes gave exception errors that I attached here at the end of this post. This is my .cfg file. 0552.RKLT_OpenMPv2.1.cfg. And my platform.xdc is as shown below.

As you may notice, my OpenMP.stackRegionId is 2 and the exception error stated that it is Internal: Instruction fetch exception. Why the address is 0x00000000 ? Can anyone here help me about this and maybe other problems that may exists in this configuration and platform files ?

Exception, Address,0x00000000 Decoded,Internal: Instruction fetch exception; Exception Context, $addr,0x0084d360 $type,ti.sysbios.family.c64p.Exception.Context A0,0x00000000 A1,0x00000000 A10,0x00000000 A11,0x00000000 A12,0x00000000 A13,0x40a12200 A14,0x00000000 A15,0x40a14000 A16,0x00000000 A17,0x00000000 A18,0x00000000 A19,0x0000003a A2,0x00000000 A20,0x00000039 A21,0x9023aad8 A22,0x00000000 A23,0x9023aad8 A24,0x00000000 A25,0x8e120c1c A26,0x8e120c1a A27,0x8e120c18 A28,0x8e120c16 A29,0x8e120c14 A3,0x00000038 A30,0x8e120c12 A31,0x000001c8 A4,0x00000000 A5,0x409a6000 A6,0x00000208 A7,0x71effbf5 A8,0x00000200 A9,0x0c0e69c8 AMR,0x00000000 B0,0x00000000 B1,0x00000001 B10,0x00000000 B11,0x40a1ea00 B12,0x00000000 B13,0x40a25800 B14,0xa03215a0 B15,0x9023aea0 B16,0x00104000 B17,0x80000000 B18,0x0000003b B19,0x9023aad8 B2,0x9023aca8 B20,0x00000000 B21,0x9023aad8 B22,0x9023aad8 B23,0x00000000 B24,0x00000000 B25,0x00000000 B26,0x8e117ae0 B27,0x00004d00 B28,0x8e117920 B29,0x8e118aa0 B3,0x00000000 B30,0x0000001c B31,0x00103800 B4,0x00000038 B5,0x4099ec00 B6,0x80104820 B7,0x71effbf1 B8,0x0000067b B9,0x00000208 EFR,0x00000002 IERR,0x00000001 ILC,0x00000000 IRP,0x0c08114c ITSR,0x0000000f NRP,0x00000000 NTSR,0x0001000f RILC,0x00000001 SSR,0x00000000

Many thanks,

Rizuan

0 Archaeologist over 13 years ago in reply to RIZUAN

TI__Guru* 84285 points

Trying to execute address 0x0 indicates a NULL function pointer somewhere in the code. Are you able to determine what the PC was before this exception?

C6000 compiler version 7.4.1 should add only bug fixes to 7.4.0; if 7.4.0 is working for you, it is unlikely to be a bug that was introduced in 7.4.1. More likely, 7.4.1 does something slightly different that happens to trigger a bug that was already in either your code or some part of the toolchain.

With regards to the configuration, I don't know; I am not an expert in that area.

0 Ajay Jayaraj over 13 years ago in reply to RIZUAN

TI__Expert 3350 points

Probably not related to your issue - In your cfg file, you set "OpenMP.enableMemoryConsistency = false;". This implies that you will manage cache consistency for shared variables manually by inserting Cache invalidate calls at the appropriate points in your program. Any specific reason you disabled the memory consistency feature in the OMP runtime?

0 Ajay Jayaraj over 13 years ago in reply to Ajay Jayaraj

TI__Expert 3350 points

One more thing - have you been able to get some of the simpler examples, such as matrix multiplication to work across 8 cores in your OpenMP environment?

Ajay

0 RIZUAN over 13 years ago in reply to Ajay Jayaraj

Expert 1285 points

Hi everyone,

Thanks for the replies.

Ajay Jayaraj said:
have you been able to get some of the simpler examples, such as matrix multiplication to work across 8 cores in your OpenMP environment?

Yes. I used the code from OMP Examples->C6678 Examples->OpenMP matrix vector multiplication example, into my project, using the configuration file and platform.xdc that I show before by using 8 cores. It returned the same result as the example.

Ajay Jayaraj said:
Any specific reason you disabled the memory consistency feature in the OMP runtime?

Unfortunately, I don't. I just follow what is there in the configuration file from the image processing example by TI. I've set it to true now but it give the same problem when using 4 cores or more.

Archaeologist said:
Are you able to determine what the PC was before this exception?

No. But I will do that later.

At this point, I think it is necessary for me to show my code, at least the omp pragma part because it is lengthy. Here it is:

This is the output from console when using only 1 core:

And this is the output if using 2 cores:

And finally 4 cores:

We can notice that in the 4 cores execution, it ignores the second #pragma omp critical, marked by "Begin Applying P,L,U,S". It doesn't wait for all the cores to finish executing the previous lines as in the 2 cores execution. This is what I suspect.

Can anyone comment on this or any other solutions?

Many thanks, guys.

Rizuan

0 Archaeologist over 13 years ago in reply to RIZUAN

TI__Guru* 84285 points

Forgive my ignorance, I'm not an OMP expert. However, I don't see why the other threads should be required to finish just because thread 2 entered the second critical section. I would assume that "critical section" merely means "no other thread may enter this particular critical section", and does not provide a guarantee that all threads will wait at the critical section until the same time, nor provide a guarantee which thread executes it first. Do I misunderstand how OMP "critical" works?

0 RIZUAN over 13 years ago in reply to Archaeologist

Expert 1285 points

Hi Archaeologist,

Suddenly you've made me confuse since I'm also a beginner in OpenMP =D. Then I realised that I've used the same approach on Visual C++ OpenMP 2.0 and also by using Intel Compiler OpenMP 3.0. Actually my project in CCS now is from the desktop implementation that I've created before, and that was my first time using OpenMP.

You can refer to this pdf file at page 23. Threads wait their turn - only one at a time call the function inside the critical region.

Kind regards,

Rizuan

0 RIZUAN over 13 years ago in reply to RIZUAN

Expert 1285 points

Hi Archaeologist,

When read again your question, I think you are right. I've no idea now how to troubleshoot the problem. I just suspect that is the problem.

P/S: That was a good question though.

Rizuan

0 Archaeologist over 13 years ago in reply to RIZUAN

TI__Guru* 84285 points

I believe the fact that you've demonstrated it successfully multithreading with 2 cores, and failing with 4, is an important observation.

0 Ajay Jayaraj over 13 years ago in reply to Archaeologist

TI__Expert 3350 points

Arch is correct about critical sections. It appears that Core 2 is crashing as it executes the critical section. One possibility is that the application is writing past the bounds of some data structures (either malloc'ed or on the stack) and overwriting OMP runtime structures resulting in the crash.

Ajay

0 RIZUAN over 13 years ago in reply to Ajay Jayaraj

Expert 1285 points

Hi everyone,

Thanks for the support. Many many thanks.

It turns out that I don't need that critical section. I further debugged my codes, and I found that making dynamic variables private is not a stable direction. Even by calling the Memory_alloc inside the parallel region. During debugging, the address of the dynamic variables changed and it looks like it is shared and not private.

Shared variables works perfectly. What I've done, I increased the OpenMP.stackSize so that the private dynamic variables are now changed to static but declared inside the functions. I used large shared variables to update that static vars. No more critical section.

I don't know how to explain this. I hope this will help others who want to implement image processing that use large memory using OpenMP. I am now able to run Integer KLT for encoding multi-component image for space application using OpenMP on board of TMDSEVMC6678L board, and decode the image to verify its losslessness. The encoding is using the OpenMP whereas the decoding is by using single core. I am now able to run the encoding using 1, 2, 4, 7 and 8 cores.

Thanks a lot to Ajay, Arch for helping me.

Rizuan

0 joe lindula over 9 years ago in reply to RIZUAN

Mastermind 6940 points

Hello, did you ever get this problem solved? How about not using MSMCSRAM_NOCACHE for data allocation? Did you try that?
Joe

Code Composer Studio™︎

Code Composer Studio forum

OpenMP in single core vs multi-core