OpenMP on c6678 performance

Han Cui

i'm testing a disparity calculation program using OpenMP runtime 1.2.0.5 and 2.1.17.1, and i got some performance issues.

the code was 0.15s by not having #pragma omp clauses;

when using OpenMP 1.2, the time was 0.5s by on 1 core and 0.06s on 8 cores;

when using OpenMP 2.1, the time was same on 1 core but 0.3s on 8 cores, and the result does not change if i put data into DDR or MSMC.

I need some help on:

1. why using OpenMP on 1 core might be worse than not using OpenMP?

2. why OpenMP 2.1 might perform worse than Openmp 1.2 on 8 cores when memory setting were same?

3. The OpenMP 2.1 specification says OpenMP runtime need some non-cached MSMC but does not mention about program data, whereas in OpenMP 1.x the data memory in MSMC should be non-cached, is this still necessary in 2.x? If it is necessary, can you give a example how to do that?

4. I wrote another program using "omp atomic"/ "omp critical" clause and it seems that there were not any difference, but in general atomic should be much faster than critical, is it c6678 not support atomic?

summary of code:

data1 = char[size];

data2 = char[size];

output = malloc(sizeof(char)*size);


#pragma omp parallel for private(i,j) shared(data1,data2,output) collapse(2) schedule(static) num_threads(8)
	for (i=height;i>0;i--)
	{
		for (j=Width;j>0;j--)
		{	
			output[i*Width+j]= dosomething(data1,data2);
		}
	}

The cfg file used was from the example in OpenMP packages.

The tools used are: ccs 6.1.1, compiler 8.1.0, xdc 3.25.6.96, ipc 1.24.3.32, c6678 pdk 1.1.2.6, sys/bios 6.33.6.50, uia 1.2.0.7

Thanks,

Han

over 8 years ago

0 Raja over 8 years ago

TI__Guru* 81335 points

Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages (for processor issues). Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics (e2e.ti.com). Please read all the links below my signature.

There are multiple threads discussed about OpenMP performance in e2e forum. I have assigned this to OpenMP expert, in the mean time i recommend you to go through those posts for some help.

Thank you.

0 Han Cui over 8 years ago in reply to Raja

Prodigy 110 points

Thanks.

As I can't post my code here, I made some more specific question regarding memory configuration:

var sharedRegionId = 0;

// Size of the core local heap
var localHeapSize = 0x8000;

// Size of the heap shared by all the cores
var sharedHeapSize = 0x8000000;


var SharedRegion   = xdc.useModule('ti.sdo.ipc.SharedRegion');
    SharedRegion.setEntryMeta( sharedRegionId,
                               {   base: ddr3.base,
                                   len:  sharedHeapSize,
                                   ownerProcId: 0,
                                   cacheEnable: true,
                                   createHeap: true,
                                   isValid: true,
                                   name: "DDR3_SR0",
                               });

1. The cfg code is from the example code in omp package, it set up a shared region in ddr, will malloc use this region?

2. If i put this region in MSMC will there be better performance?

3. Shall this region be cached?

4. The cfg code is same in both version of OpenMP except sharedRegionId which is 2 in omp1.1 and 0 in omp2, what is the difference?

0 Garrett Ding over 8 years ago in reply to Han Cui

TI__Mastermind 43296 points

Hi Han,

Please focus on OpenMP 2.x as we are no longer support OpenMP1.x. The latest OpenMP2.2 is available at www.ti.com/.../processor-sdk-c667x.

Regarding the share region for OpenMP heap, please refer to processors.wiki.ti.com/.../SharedRegion_Module.

HeapMP is created from the shared region, and used by malloc in OpenMP parallel construct, see the comment in om_config.cfg from \ti\openmp_dsp_c667x_2_02_00_02\packages\examples\matmul\:

// Configure memory allocation using HeapOMP
// HeapOMP handles
// - Memory allocation requests from BIOS components (core local memory)
// - Shared memory allocation by utilizing the IPC module to enable
// multiple cores to allocate memory out of the same heap - used by malloc

The field "cacheEnable: true" indicates the region is cached.

Regards,
Garrett

0 Han Cui over 8 years ago in reply to Garrett Ding

Prodigy 110 points

Thanks Garrett,

I installed this version but can not compiler my program with error:

js: "D:/ccs/openmp_dsp_c667x_2_02_00_02/packages/ti/runtime/openmp/package.xs", line 175: XDC runtime error: ti.csl.Settings: no element named 'deviceType'

Do i need to follow the porting OpenMP 2.x to keystone 1 tutorial again? Which version of other tools shall i use? The documentation under "openmp_dsp_c667x_2_02_00_02\docs" is not helpful.

About the sharedRegion, I understand how to enable cache, but i'm still confused about shall I enable it? There were many post saying that data should be put into non cached shared memory, but in the example cfg file cacheEnable was set to true.

Regards,

Han

0 Garrett Ding over 8 years ago in reply to Han Cui

TI__Mastermind 43296 points

Han,

CSL is from C667x PDK 2.0.0. Do you have the PDK selected in your project properties->General->RTSC?

Here is the detailed explanation of cacheability of each memory section: downloads.ti.com/.../configuring_runtime.html.

Regards,
Garrett

0 Han Cui over 8 years ago in reply to Garrett Ding

Prodigy 110 points

Changing to PDK 2 result in following error message:
error #10056: symbol "ti_runtime_ompbios_HeapOMP_isBlocking__E" redefined: first defined in "D:\project\Debug\configPkg\package\cfg\omp2_pe66.oe66"; redefined in "D:\ccs\openmp_dsp_c667x_2_02_00_02\packages\ti\runtime\ompbios\lib\ti_runtime_ompbios_debug_e66.ae66<HeapOMP.oe66>"

Or can you give me a working RTSC setting environment?
According to the documentation, MSMC and DDR are both cached, so i assume program data should be in cached MSMC to get the best performance?

0 Han Cui over 8 years ago in reply to Garrett Ding

Prodigy 110 points

Waiting for reply

Processors

Processors forum

OpenMP on c6678 performance