Problem with multicore shared data

admeltech

Other Parts Discussed in Thread: TMS320C6670, SYSBIOS

Hello, everyone.

I've been developing a FIR filter for a while in the TMS320C6670 DSP and I need to increase the processing frequency as much as possible. In order to do so, I use DSPLIB to filter (it has good benchmarks) and SYS/BIOS. My idea is to split the input data buffer into 4 segments and process each of those in a different core. To avoid the process of locally copying the data the input buffers and the output are placed in a the shared memory, wich I allocate dinamically at "initialization" using the HeapBufMP module. Then I use a MessageQ to send from the "master" Core 0 to the others the shared pointers.

The problem occurs when the main loop begins. Previously, I tried to syncronize the four cores by sending back a MessageQ to the Core 0 when filtering of a block finished and the master core send a new message with the new pointer to control the flow (input buffers are "Ping-Ponged"). However, that introduced an very high latency in the process (a round trip of Messages, without including the time of input buffer updating), 4 or 5 times longer than the filtering itself.

To correct that, I changed that and I sent just I message at the begining with the shared location of a variable (also allocated with HeapBufMP), called dataRep. The struct has an array named coreStates in wich only each core writes a flag in a possition to indicate if it is ready or not to process data and before doing it polls all the flags until all other cores are ready. This should work and remove the latency, but at simulation or running in the DSP, I notice the program flow stops at the flag polling since the contents of the supposed shared variable are different in each core, so they can not sync.

I don't know the reason of that but I guess it could be some kind of caching process performed by the BIOS because of this error does not seem to happend with the processing data, wich are longer buffers to be cached.

My question are then:

a) Is there a way to avoid SYS/BIOS to cache the contents of an specific variable or memory section?

b) Is the syncronization method I'm using valid?

c) Is there a way to send messages using SYS/BIOS APIs with lower overhead?

Thanks for any help.

-- Adrian

Note: I attach my main .c file and the RTSC .cfg file.

message_multicore.zip

over 13 years ago

0 judahvang over 13 years ago

TI__Mastermind 32475 points

Adrian,

What memory have you defined to be shared across the cores? Is it DDR or MSMCRAM?

a. There is a way to prevent DDR memory from being cached by setting the MAR attributes for a particular memory segment. This can be done with the ti/sysbios/family/c66/Cache module. The granularity though is 16MB. If using MSMCRAM, its possible too but requires a bit more work in that you would have to map the MSMCRAM into a different address space because the default address space does not allow you to change the cacheability.

b. I suppose its valid if it works. You could do a Cache_wbInv() of the single word to get your flag out of your cache if you can't bear the 16MB.

c. You could try to use the Notify module instead of MessageQ. Notify allows a 32-bit payload which should be suffice for you case. So no message needs to be sent.

Judah

0 admeltech over 13 years ago in reply to judahvang

Prodigy 245 points

Hello, Judah.

I'm using the MSMCRAM. Excuse me, but could you be more specific about the process of mapping the MSMCRAM into different address spaces? I don't know how could I do it, since the first step I made in my project was to create a platform board to map the memory, but it doesn't show any option about multicore.

I will also try the other options until I get a better performance and I'll report if they are successful.

Thanks.

0 judahvang over 13 years ago in reply to admeltech

TI__Mastermind 32475 points

Adrian,

I have not done this myself but have seen others do it. It requires using the MPAX to create an alias for MSMCRAM and then turning off the cacheablilty of new address space. Someone else might have to jump in here to correct it.

#include <ti/csl/csl_xmcAux.h

#define SEGSZ_4MB       0x15
#define RADDR           0x00C000 /* Physical address of MSMC RAM */
#define MPAX_INDEX      3

static void initMPAX(void)
{
    CSL_XMC_XMPAXH mpaxh;
    CSL_XMC_XMPAXL mpaxl;

    CSL_XMC_getXMPAXL (MPAX_INDEX, &mpaxl);
    CSL_XMC_getXMPAXH (MPAX_INDEX, &mpaxh);

    mpaxh.segSize = SEGSZ_4MB;
    mpaxh.bAddr = OpenMP_nonCachedMsmcAlias >> 12;

    CSL_XMC_setXMPAXH (MPAX_INDEX, &mpaxh);

    mpaxl.ux = 1;
    mpaxl.uw = 1;
    mpaxl.ur = 1;
    mpaxl.sx = 1;
    mpaxl.sw = 1;
    mpaxl.sr = 1;
    mpaxl.rAddr = RADDR;

    CSL_XMC_setXMPAXL (MPAX_INDEX, &mpaxl);

}

Judah

0 admeltech over 13 years ago in reply to judahvang

Prodigy 245 points

Hello, Judah.

I tryed to use the Cache_wbInv() fuction and the problem remains the same, I guess. I also tried to just write back the cache content with Cache_wb() and to disable the cache with Cache_Inv() just after each core receives the MessageQ with the pointer to the shared variable.

I put the last segment of code related to the MCSMRAM MPAX, but I don't find any definition called "OpenMP_nonCachedMsmcAlias" in the CSL packages. What is that value?

0 judahvang over 13 years ago in reply to admeltech

TI__Mastermind 32475 points

Adrian,

You would need to align/pad your "flag" with 128 bytes because that's what a cache line size is.

You are doing Cache_wbInv() on the "flag" address or whatever memory you were using to signal the remote core right?

If Cache_wbInv() doesn't work, then do Cache_wb() for the core that did the write and Cache_inv() for the core that is doing the read.

MessageQ should not have a cache coherency problem....this is taken care of by IPC...we insure the cache coherency of the message if passed to the remote core with MessageQ.

I found that OpenMP_nonCacheMsmcAlias is the value: 0xa0000000.

Judah

0 admeltech over 13 years ago in reply to judahvang

Prodigy 245 points

Hello, Judah.

There was a long time ago since I finished with the filter I was trying to design. I tested your suggested answer but I didn't obtain the results I was expecting.

I just wanted to briefly explain the method I actually used:

1) When I created my custom platform, I disabled all the cache memories. I tried to only disable the one of the MSMCRAM with the MPAX configuration you gave me, but I didn't notice any change.

2) I instanced a Shared Memory Section in the same way I was doing from the beginning and after I allocated space for an array of Boolean flags.

3) Using the Notify module of the IPC I transmitted the Shared Memory pointer from the "master" core 0 to the remaining cores.

4) Inside the main loop, every core writes its state in its own array index (its Core number) and the master core polled the flag until every of them were in the same expected state.

As you can see, I used exactly the method I had proposed before. What makes everything to work is to force the cache memory to disable. When I measured the filter performance with the Code Composer Studio "Clock" tool, I also noticed that the cache-disabled version was even faster, what is not supposed to happen.

Thanks for your help. I hope my idea to be helpful for someone else.

--Adrian

Processors

Processors forum

Problem with multicore shared data