Use of Shared Code Program Memory

FeruzM

Other Parts Discussed in Thread: SYSBIOS

Hi,

I wonder if someone can guide me in memory management on 6670.

I want to try Use of a shared message buffer for large array allocated from DDR, section 6.3 of http://www.ti.com/lit/an/sprab27a/sprab27a.pdf mentioning use of formula

<base address> + <per-core-area size> × DNUM

Let say I have following code and I want to run same code on all cores, but all cores have different parts of array.

#pragma(A,".mydata") //mydata is allocated from DDR

int A[1000];

when each core runs above code they will have length of 250 on each.

I would appreciate if someone give an example of allocating array?

Thanks,

over 13 years ago

0 one and zero over 13 years ago

TI__Mastermind 18146 points

Hi Feruz,

the following code would initialize your array A with 0 to 249 where each core utilizes his area of the array:

for (i=0;i<250;i++)
{
A[DNUM*250+i] = i;
}

This assumes you're loading the same code image ( .out file) to each core. That way array A be linked to the same address in DDR. Also it's your own responsibility to make sure each core sticks to it's assigned portion of the array A.

Kind regards,

one and zero

0 FeruzM over 13 years ago in reply to one and zero

Intellectual 740 points

Hello "one and zero",

Thank you for your respond!

I did try this method, but if I have larger arrays, or use malloc to allocation memory for array, somehow in simple calculations cpu hangs.

Could you confirm that using malloc won't affect on this operations? or Shall I malloc not for full array, just for parts that each core owns?

Sincerely,

0 one and zero over 13 years ago in reply to FeruzM

TI__Mastermind 18146 points

Hi Feruz,

the first thing to check is cache coherency on the simple example I provided to you. Cache coherency is not automatically maintained for DDR. So what you can do e.g. set the memory region where the shared array located as non-cacheable. The C66x DSP Cache User's Guide explains that in detail.

Then on the malloc. This won't work since this is only providing a single core view.

In case you plan to use SYS/BIOS and IPC there's a multicore heap implementation you can use for this purpose.

Kind regards,

one and zero

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

DDR memory is global, so it does not require the use of the formula based on DNUM -- the same address (somewhere around 0x80000000) means the same thing on all cores without an alias.

I am completely confused at why the CPU would hang on any such operations -- but if you were writing in the wrong address, that might be a reason. What I gathered from your first post is that you wanted to do something like:

#pragma DATA_SECTION(A, ".mydata") // DDR

int A[1000]

...

int *Aptr; // core-local memory

Aptr = &A[DNUM*250]; // no address translation used because DDR addresses are the same for all cores

int blahblah = Aptr[0]; // local and different for each core

But you talk about a shared memory buffer, so I'm not sure this was your goal (everything would still be shared in A, but the relative indexes into Aptr would be different)

Calling malloc() uses memory from your DataSection or, more specifically, .sysmem -- so if this is located in L2SRAM, then it will actually be core-local. Setting up a HeapMP object in DDR and then using Memory_alloc() will get you dynamic memory allocations (but also needs IPC configuration and such)

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hello,

Thank you for answers, but cpu running for long time, only by canceling operation it gets stopped.

As you suggested dynamic memory allocation is last option, how can I configure IPC and HeapMP in DDR and use Memory_alloc() particularly, for my application?

Thank you.

I look forward to hear from you.

Regards,

0 FeruzM over 13 years ago in reply to FeruzM

Intellectual 740 points

Hi again,

1: int *A = malloc(1000*sizeof(int));

Running same code for each core, will above code allocate same memory for A or each core will allocate different random memories?

How can I put my array in one memory area and use array from all cores?

Thanks,

0 DanRinkes over 13 years ago in reply to FeruzM

TI__Expert 8055 points

Feruz,

No, running malloc for each core will not necessarily allocate the same memory for A.

The L2SRAM from each core is accessible from the memory map of each core. For example, Core 0's L2SRAM can be accessed from any core at address 0x10800000. If you allocate it in one of these sections, it can be accessible by everyone.

Note that malloc has no knowledge of other cores. It's just going to allocate memory available from its core's heap section.

Regards,

Dan

0 FeruzM over 13 years ago in reply to DanRinkes

Intellectual 740 points

Hi,

So you mean like, this would work?

int *A;

switch(DNUM)

case 0: A = malloc(10*sizeof(int)); //sysmem is in L1SRAM of core 0

for(i=0; i<10;i++)

A[i] =i;

break;

case 1: printf("%d\n", A[1]);

....

Thanks,

Feruz

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

(note, I edited this multiple times, and it's still not correct, but it should convey the point)

Not exactly. Core1 needs to access the address, so something that might work *assuming the cores have exactly the same stack state and variable locations*:

int *A;

switch(DNUM)

case 0: A = malloc(10*sizeof(int)); //sysmem is in L1SRAM of core 0

for(i=0; i<10;i++)

A[i] =i;

break;

case 1:

// You need to make sure that Core1 will execute only after Core0 has written to A (and then the values of A before printing)

// Core0's A points to some place in Core0's local alias

A = (int *)( 0x10000000 + (long unsigned int)A); // now Core1's A is a pointer to Core0's memory by using the global address for Core0's L1SRAM

printf("%d\n", A[1]);

For more generic Shared memory approaches: two similar constructs that would work:

Static allocation:

#pragma DATA_SECTION(A, ".shared")

#pragma DATA_ALIGN(A, 128)

int A[10];

#pragma DATA_SECTION(flag, ".shared")

#pragma DATA_ALIGN(flag, 128) // or you could be tricky and stick this as A[11] or something similar

int flag = 0;

...

switch(DNUM)

case 0:

for(i=0; i<10;i++)

A[i] =i;

Cache_wb(A, sizeof(int)*10);

flag = 1;

Cache_wb(&flag, sizeof(int));

break;

case 1:

while (flag == 0) {

Cache_inv(&flag, sizeof(int));

}

printf("%d\n", A[1]);

workaround dynamic allocation:

#include <ti/sysbios/hal/Cache.h> // or something similar, depending on what method you use to invalidate the cache

#pragma DATA_SECTION(heap_addr, ".shared")

long unsigned int heap_addr = 0;

...

int *A;

long unsigned int base_wbAddr;

int wbSize = 128; // an L2 cacheline is 128 bytes, this is needed if A is in DDR. an L1 cacheline is only 64 bytes

switch(DNUM)

case 0: A = malloc(10*sizeof(int)); //sysmem is in L1SRAM of core 0

heap_addr = A;

if (/* a tedious calculation to find out if A exceeds a cacheline boundary */) {

wbSize = 128*2; // 2 cachelines since A exceeded the boundary and 10*sizeof(int) won't go to a 3rd line

base_wbAddr = ((unsigned int)A) & 128; // base of the cacheline

} else {

wbSize = 128;

base_wbAddr = A;

}

for(i=0; i<10;i++)

A[i] =i;

Cache_wb(base_wbAddr, wbSize);

Cache_wb(heap_addr, sizeof(long unsigned int));

break;

case 1:

while (heap_addr == 0) {

Cache_inv(heap_addr, sizeof(unsigned long int));

}

A = heap_addr;

printf("%d\n", A[1]);

EDIT: I had to edit the code quite a bit to be more correct (thought it still might not be completely correct)

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hi,

Thank you twentz!

In your first point: Is there automate way to define which processor start working first? Does changing priority helps in this case? (CCS, from target configuration file, choosing master and slave options)

Second and third points: I found it bit difficult to follow, since I am not familiar with Cache_... functions, Could you be bit specific, please?

Thank you very much for supports...

Regards,
Feruz

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

As far as I know, there isn't a standard way to synchronize processors. The only way I know how is to do the "flag in shared memory" method -- or a couple other operations that involve shared memory in the IPC module (part of SYS/BIOS). I don't think master/slave has any effect.

The Cache functions I was using are part of SYS/BIOS (at least the header files); there are similar CSL functions (which don't require using all of SYS/BIOS, only the CSL library), but they are slightly different.

The main operations of the Cache_ functions are:

1) Writeback

2) Invalidate

3) Writeback-Invalidate (performs both)

----

Typically when you perform a "store" in code, it writes to only the variable in L1 cache, which is core-local. For example, if you are writing to a variable that is located in DDR (shared), then instead of writing the value in DDR, it stays in the L1/L2 caches until it gets replaced. Unless you explicitly write-back memory, then the value could theoretically never be replaced, certainly not when you desire it. Therefore, if another core is trying to read the value in DDR that the other core wrote (and is in its L1 cache), the writing core must writeback the result to DDR before the new core can read the new value.

Invalidating is the counterpart in that if Core1 and Core2 both read a value from DDR, then each core will have a copy in L1/L2 cache. Since the DSPs don't have cache coherency, if Core1 writes a new value, then Core2 will not know about this but it will still have its old copy of the value. Even if Core1 writes back the new value, Core2 thinks it already has the value since it is in the cache. Hence, if Core2 tries to use the cached variable, it will be using its own, old copy instead of Core1's new value. To fix this, Core2 needs to "invalidate" its cache. After this, when it tries to read the value, it will incur a cache miss and read the value directly from the original location (DDR in this case), so if Core1 performed a writeback, then Core2 will now have the new value.

For the SYS/BIOS cache functions, the most useful ones are:

Cache_wb, Cache_inv, Cache_wbInv

with the arguments to these functions being:

Cache_*(void *variableToWriteBack, size_t numberOfBytesToWriteBack, int waitForCompletion)

but since any writebacks/invalidations occur on an entire cacheline, the number of bytes is automatically rounded up to the nearest cacheline (64 or 128 bytes)

Most of the time, you will want to wait for completion, but it depends on how asynchronous your code is.

There may or may not be another option specifying which level of the cache you are operating on, but Cache_Type_ALL is safest.

Example:

#pragma DATA_SECTION(A, ".shared") // assuming there is a section called ".shared" in shared memory

#pragma DATA_SECTION(flag, ".shared")

int A[10];

int flag;

...

Cache_wbInv(A, sizeof(int)*10, TRUE); // same as Cache_wbInv(A, 128) if A is in DDR

...

Cache_wbInv(&flag, sizeof(int), TRUE);

There is a little bit about the cache functions in the SYS/BIOS user guide

http://www.ti.com/lit/ug/spruex3i/spruex3i.pdf

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hi,

Thank you twentz, above theoretical explanations were much clear.

You mentioned that there is possible ways you know, using IPC module for shared memory. Could you please give brief information about usage, in this case how I can implement them?

Cache functions are understood now. Below you will see my test code, but

#pragma DATA_SECTION(A, ".shared")

#pragma DATA_ALIGN(A, 128)

int A[10];

#pragma DATA_SECTION(flag, ".shared")

#pragma DATA_ALIGN(flag, 128) // or you could be tricky and stick this as A[11] or something similar

int flag = 0;

int i;

switch(DNUM){

case 2:

for(i=0; i<10;i++)

A[i] =i;

Cache_wbInv(A, sizeof(int)*10, Cache_Type_ALL, TRUE); //0x400 - 128 bytes

flag = 1;

Cache_wbInv(&flag, sizeof(int), Cache_Type_ALL, TRUE);

printf("%d\n", A[5]);

break;

case 3:

while (flag != 1) {

Cache_wbInv(&flag, sizeof(int),Cache_Type_ALL, TRUE);

}

printf("%d\n", A[5]);

break;

default: printf("default\n");

break;

}

core 3 is in forever loop, how would you workaround from this while loop? What is causing this problem, is that possibility that Cache is not performing Writeback?

Thank you very much to keep this thread alive. I think this thread is already answering number of people's questions!

Regards,

Feruz

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

Core3 needs to call Cache_inv() -- NOT wbInv(). Core3 is only trying to *read* the value, not *write* it.

IPC doesn't have an exact solution for shared memory programming, but it does have 2 features:

1) Create a heap memory pool that is shared between processors. This doesn't solve your problem though.

2) Message passing through MessageQ. Shared memory is one programming model, and often the "other" method is message passing.

There are a few things the need to be set up for MessageQ (the examples in CCS from the MCSDK have these), but after you set it up, you can do something like:

-----------------------------------------

This is pseudo code (again)

------------------------------------------

Core0:

struct myMsg {

MessageQ_Header header;

int payload;

};

...

struct myMsg msg = MessageQ_alloc(sizeof(struct myMsg);

msg.payload = 10;

MessageQ_put(messageQueueHandle, (MessageQ_Msg *) &msg);

..............

Core1:

int value;

struct myMsg recvMsg;

MessageQ_get(messageQueueHandle, &recvMsg); // This blocks the processor until the other core has performed the "put"

value = recvMsg.payload;

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hi twentz,

1. Changing function to Cache_inv did not help, still running endless...

2. About MessageQ module, so it is possible to send arrays by creating your own message structure?!

I found inside IPC module, example of MessageQ for single image in all cores, meanwhile I am trying to implement by using MessageQ module.

So far made few changes in example of MessageQ single image for all cores from IPC

typedef struct myMsg{

MessageQ_MsgHeader header;

Int payload;

}myMsg;

myMsg msg;

error: a value of type "MessageQ_Msg" cannot be assigned to an entity of type "myMsg" in MessageQ_alloc(sizeof(myMsg))

error: argument of type "myMsg *" is incompatible with parameter of type "MessageQ_Msg *" in MessageQ_get(MessageQ_Handle, &msg, MessageQ_FOREVER);

error: argument of type "myMsg" is incompatible with parameter of type "MessageQ_Msg" in MessageQ_put(remoteQueueId, msg);

Structure is defined in IPC user guide, code should recognize MessageQ_MsgHeader from structure right? So I don't know what is causing error.

3. Could you explain below image, of course it is different hardware, but how would you describe the one with +No Cache?

As I might see in that case we cannot use Cache_x functions, if unbind all caches from RTSC or from gel file?!

Then how would shared memory work, earlier you mentioned that even though we have shared memory from DDR, we will use Caches, then is it access shared memory directly (DDR)?

Thank you...

Regards,

Feruz

0 FeruzM over 13 years ago in reply to FeruzM

Intellectual 740 points

Hello everybody,

I wonder if anybody could help me with above issues?

Thank you!

/Feruz

0 twentz over 13 years ago

Intellectual 435 points

#pragma(A,".mydata") //mydata is allocated from DDR

int A[1000];

....

int *myA = &A[250*DNUM];

myA[0] = ...

you would need to keep track of cache coherence issues though

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hello, twentz,

Sorry, but before my last post, there is another post which isn't first one, I guess forum make auto page, so my very first post showing up on every new pages, but actually my question is in the end of first page.

Thanks!

/Feruz

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

replace "myMsg" with "myMsg *" and that should get rid of those. "MessageQ_Msg" is actually an alias for "MessageQ_MsgHeader *", so everything works in pointers.

No Cache means that the cache is turned off completely. Obviously this makes performance very bad, but this means that read/writes to memory will go directly to the location instead of the cache.

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

I have tried, please feel free to try:

typedef struct myMsg{

MessageQ_MsgHeader header;

Int payload;

}myMsg;

...

myMsg *msg;

MessageQ_Handle messageQ;

...

msg = MessageQ_alloc(HEAPID, sizeof(myMsg)); //returns MessageQ_MsgHeader *pointer in a sense code should work, but?!

...

status = MessageQ_put(remoteQueueId, msg);

...

status = MessageQ_get(messageQ, &msg, MessageQ_FOREVER);

still errors on this lines, that's the reason in the first place I asked question, I think structure is not working...

Thanks,

Feruz

0 twentz over 13 years ago in reply to FeruzM

Intellectual 435 points

status = MessageQ_put(remoteQueueId, (MessageQ_Msg) msg);

...

status = MessageQ_get(messageQ, (MessageQ_Msg *)&msg, MessageQ_FOREVER);

0 FeruzM over 13 years ago in reply to twentz

Intellectual 740 points

Hi twentz,

It has been long ago, but I have a little favor to ask!

from your above post, when I run

int *A;

switch(DNUM)

case 0: A = malloc(10*sizeof(int)); //sysmem is in L1SRAM of core 0

for(i=0; i<10;i++)

A[i] =i;

//printf("A[1]=%d\n",A[1]);

break;

case 1:

// You need to make sure that Core1 will execute only after Core0 has written to A (and then the values of A before printing)

// Core0's A points to some place in Core0's local alias

A = (int *)( 0x10000000 + (long unsigned int)A); // now Core1's A is a pointer to Core0's memory by using the global address for Core0's L1SRAM

//printf("A[1]=%d\n", A[1]);

Output looks like following when I do it for nine elements

[C66xx_0] 2 0 1 
[C66xx_0] 4 3 3 
[C66xx_0] 2 1 2 
[C66xx_0] A[0] = 2

[C66xx_1] Core #1
[C66xx_1] 
[C66xx_1] A = 
[C66xx_1] 32  749731840  32 
[C66xx_1] 32  749731840  32 
[C66xx_1] 32  749731840  32 
[C66xx_1] A[0] = 749731840

Somewhere while taking pointer from memory or setting to pointer A in core 1 I get error, I couldn't figure out what might be the problem.

Could you please help me on this?

Thank you.

Regards,
Feruz

Processors

Processors forum

Use of Shared Code Program Memory