Cache coherency problem with EVMC6678

Maciej Fajfer

I have a cache coherency issue. If I changed variable in shared RAM in one of cores it's doesn't update it in another or it's update only one time. Currently I'm using only four cores. I have separate images per all cores and the same cfg file and memory map.

My buffers declarations:

float (*A_ES1)[6][6] = (float (*)[6][6]) 0x0C200000; //for CORE1
float (*A_ES2)[6][6] = (float (*)[6][6]) 0x0C200100; //for CORE2
float (*A_ES3)[6][6] = (float (*)[6][6]) 0x0C200200; //for CORE3

float (*B_ES1)[6]    = (float (*)[6])    0x0C200300; //for CORE1
float (*B_ES2)[6]    = (float (*)[6])    0x0C200320; //for CORE2
float (*B_ES3)[6]    = (float (*)[6])    0x0C200340; //for CORE3

float (*Vs_1)[3]     = (float (*)[3])    0x0C200360; //for CORE1
float (*Vs_2)[3]     = (float (*)[3])    0x0C200370; //for CORE2
float (*Vs_3)[3]     = (float (*)[3])    0x0C200380; //for CORE3

A_ES1...AES3 and B_ES1...B_ES3 - it's IN buffer inside the CORE0, but IN/OUT buffer inside the CORE1...3

Vs_1 - it's OUT buffer inside the CORE0, but IN buffer inside the CORE1...3

Here is code for CORE0:

CACHE_wbInvL1d (&A_ES1, 256, CACHE_FENCE_WAIT);
CACHE_wbInvL1d (&A_ES2, 256, CACHE_FENCE_WAIT);
CACHE_wbInvL1d (&A_ES3, 256, CACHE_FENCE_WAIT);

//do something with A_ES1...3 buffers - only read operation

CACHE_wbInvL1d (&B_ES1, 32, CACHE_FENCE_WAIT);
CACHE_wbInvL1d (&B_ES2, 32, CACHE_FENCE_WAIT);
CACHE_wbInvL1d (&B_ES3, 32, CACHE_FENCE_WAIT);

//do something with B_ES1...3 buffers - only read operatio

... 

CACHE_invL1d (&Vs_1, 16, CACHE_FENCE_WAIT); 
CACHE_invL1d (&Vs_2, 16, CACHE_FENCE_WAIT); 
CACHE_invL1d (&Vs_3, 16, CACHE_FENCE_WAIT); 

//do something with buffers - only write operation 

CACHE_wbL1d (&Vs_1, 16, CACHE_FENCE_WAIT); 
CACHE_wbL1d (&Vs_2, 16, CACHE_FENCE_WAIT); 
CACHE_wbL1d (&Vs_3, 16, CACHE_FENCE_WAIT); 

...

And similar code for other cores:

...

CACHE_wbInvL1d (&A_ES1, 256, CACHE_FENCE_WAIT);
CACHE_wbInvL1d (&B_ES1, 32, CACHE_FENCE_WAIT);

//do something with buffers A_ES1 and A_ES2 - read and write operation

CACHE_wbL1d (&A_ES1, 256, CACHE_FENCE_WAIT);
CACHE_wbL1d (&B_ES1, 32, CACHE_FENCE_WAIT);

...

CACHE_wbInvL1d (&Vs_1, 16, CACHE_FENCE_WAIT);

//do something with buffers Vs_1 - read and write operation

...

I'm using BIOS and IPC for Notify (synchronization purposes), but cache is taken from CSL because it's probably faster than BIOS.

over 11 years ago

0 Chad Courtney over 11 years ago

TI__Mastermind 30825 points

It looks like you may be running into the issue defined in Advisory 33 of the TMS320C6670 Errata. The workaround is to disable the prefetching for the MAR range 0x0C00 0000 - 0x0F00 0000 (i.e. what would effectively be the MSMC range.)

You can do this by setting the PFX bits of MARs 12-15 to 0 (this is 1 by default.) This will disable the prefetching for those addresses.

Best Regards,

Chad

0 Michael P over 11 years ago

Expert 1810 points

Cache operations expect the first argument to be a pointer to the range they should operate on. A_ES1 (for example) is already a pointer, so when you write CACHE_wbInvL1d(&A_ES1, ...), it writes back and invalidates the block of memory that contains the pointer called A_ES1 -- not the region in shared RAM that you probably expect.

Does removing the & operators from the CACHE_* functions' arguments resolve your problem?

0 Maciej Fajfer over 11 years ago in reply to Michael P

Intellectual 330 points

Hi Michael and Chad,

I removed the & operators from cash's functions, but it still doesn't work. Also I tried setting the PFX bits of MARs 12-15 to 0. It isn't solve my problem too. My MARs setting:

#define MAR12 *( volatile Uint32* )( 0x01848030 )
#define MAR13 *( volatile Uint32* )( 0x01848034 )
#define MAR14 *( volatile Uint32* )( 0x01848038 )
#define MAR15 *( volatile Uint32* )( 0x0184803C )

MARs setting:

MAR12 &= 0x00000007;
MAR13 &= 0x00000007;
MAR14 &= 0x00000007;
MAR15 &= 0x00000007;

I put it inside the CORE0. How can I solve my problem ?

0 Maciej Fajfer over 11 years ago in reply to Maciej Fajfer

Intellectual 330 points

Hi,

I solved it. It was something wrong with WB/INV operations and with & operator. Other problem was inside my algorithm. I have additional questions:

1. What happen if two core read the same address in shared memory ? It's obvious in write case it's conflict, but what happen in read case ?

2. I have a question about buffers sizes in shared memory. Can I use whichever size ? Maybe it depend on cash line size (128B) ?

0 Chad Courtney over 11 years ago in reply to Maciej Fajfer

TI__Mastermind 30825 points

1.) There's no conflict with both reading, operations proceed normally. If neither has the data cached in both will cache in the values read. One core will stale a cycle if both are attempting to access on the same cycle.

2.) Your codes buffer sizes have no barring on cache lines sizes. You can make the as large as you want so long as they fit within the memory. Aligning them to specific access boundaries will help the cache performance a little bit. And aligning them to cache line start boundaries may help prevent cache thrashing if you have buffers that you're going to be modifying right next to each other. I'd suggest reading the C66x Cache User Guide for details.

0 Michael P over 11 years ago in reply to Maciej Fajfer

Expert 1810 points

One expansion on what Chad said about buffer sizes:

For any given cache line, you should provide mutual exclusion between any cores writing to that cache line. For example, if you have two 64-byte structures that are on the same cache line, it is unsafe for core 0 to write to one of the structures and core 1 to write to the other structure. If they do that, they could both read (and cache) the line, modify their cached version, and write their cached copies back -- in which case one core's changes would be "reverted" by the other core.

MCSDK examples round a lot of structures up to a multiple of 128 bytes for this reason. You could also use the same hardware semaphore for all structures that share a cache line, although there you have to do additional bookkeeping to figure out how to select the "right" semaphore for each structure (without over-serializing your application), or do something more application-specific to guarantee the right mutual exclusion.

0 Maciej Fajfer over 11 years ago in reply to Michael P

Intellectual 330 points

Hi Chad and Michael,

I thank you for yours help. I will round my buffers in shared memory to multiple of 128 bytes.

Regards,

Maciej

0 Chad Courtney over 11 years ago in reply to Maciej Fajfer

TI__Mastermind 30825 points

Maciej,

It's better to align them to 128B address boundaries, if you do this, then you won't have overlaps of the buffers. Better yet, if you have cores accessing memories that only that core writes back to, then separate it into there own spaces in MSMC.

Best Regards,

Chad

0 Maciej Fajfer over 11 years ago in reply to Chad Courtney

Intellectual 330 points

Hi,

I have next problem with cache coherency. In my current project I don't use Bios so my code is based on CSL. In my new project I have some buffers in shared memory:

double (*AS1)[10][10] = (double (*)[10][10]) 0x0C200000;
double (*BS1)[10]     = (double (*)[10])     0x0C200320;

double (*AS2)[10][10] = (double (*)[10][10]) 0x0C200380;
double (*BS2)[10]     = (double (*)[10])     0x0C2006A0;

double (*AS3)[10][10] = (double (*)[10][10]) 0x0C200700;
double (*BS3)[10]     = (double (*)[10])     0x0C200A20;

double (*AS4)[10][10] = (double (*)[10][10]) 0x0C200A80;
double (*BS4)[10]     = (double (*)[10])     0x0C200DA0;

double (*AS5)[10][10] = (double (*)[10][10]) 0x0C200E00;
double (*BS5)[10]     = (double (*)[10])     0x0C201120;

double (*AS6)[10][10] = (double (*)[10][10]) 0x0C201180;
double (*BS6)[10]     = (double (*)[10])     0x0C2014A0;

double (*AS7)[10][10] = (double (*)[10][10]) 0x0C201500;
double (*BS7)[10]     = (double (*)[10])     0x0C201820;

double (*VS)[10]      = (double (*)[10])     0x0C201880;

Everything is aligned to cache line size. I think I have a problem with VS buffer. It's output buffer inside the Core 0 and input buffer for other Cores 1...7. My code in Core 0:

                CACHE_invL1d (VS, 128, CACHE_FENCE_WAIT);
                //writing VS buffer
                CACHE_wbL1d (VS, 128, CACHE_FENCE_WAIT);

My code in other Cores 1...7:

CACHE_wbInvL1d (VS, 128, CACHE_FENCE_WAIT);
//reading VS

Inside the Cores 1...7 I have improper values of VS buffer but it's not zero. I don't understand it. In my previous project the same cache coherency management worked correctly. Can you help me with that issue ?

0 Michael P over 11 years ago in reply to Maciej Fajfer

Expert 1810 points

Maciej,

I suspect the problem is using CACHE_wbInvL1d() rather than CACHE_invL1d() on cores 1 through 7. If their copies of the cache lines containing VS are marked as dirty, this will make them all try to write their copies of VS back to the global memory, and that would probably overwrite whatever core 0 wrote earlier.

Alternatively, L2 cache might be enabled, and contain those lines. If L2 cache is enabled, you should use CACHE_wbInvL2() rather than CACHE_wbInvL1d() (and similarly for CACHE_inv*() and CACHE_wb*()) instead. These program-initiated L2 operations also affect the corresponding lines in L1P and L1D, but -- if I understand the documentation correctly -- the L1D cache operations do not affect the L2 cache.

Less likely is that cores 1 through 7 are running their code before core 0 finishes its writeback. However, I suspect you have serialization code that will make them run in the order you expect.

If you see similar problems for the ASn arrays, another thing to confirm is that the invalidate/writeback size (128 bytes in the code snippets you pasted) is large enough -- for double[10][10] arrays, it should be 800. Otherwise only the first entries would be invalidated or written back.

[Edited to clarify that cache lines are what get marked as dirty, not the array itself, which occupies only part of one line of L2 cache and one line plus a fraction in L1D cache, and also to mention what would happen if L2 cache is enabled.]

Michael

0 Maciej Fajfer over 11 years ago in reply to Michael P

Intellectual 330 points

Michael,

Thank you for yours response. I don't know whether cache L2 is enable or disable. I use only default settings of cache. How can I enable L2 cache ?

I checked physical memory using memory browser in CCS. It contain correct values of VS buffer after CACHE_wbL1d() inside core 0. After CACHE_wbInvL1d() inside core 1 (and probably other) values of VS buffer in physical memory are incorrect. Question is why CACHE_wbInvL1d() on core 1 damage VS buffer in physical memory ? How can I fix it ? Other issue is with nature of my algorithm. Cores 1...7 can make CACHE_wbInvL1d() operation in the same time. Is it problem ?

I changed all cache CSL functions for L2 (for example CACHE_wbInvL2() to CACHE_wbInvL1d() ) but it still doesn't work.

Off course I use IPC interrupt for cores synchronization based on IPCGRx registers. It works good after several hours for debugging.

I think I have to get to know how can I disable cache for MCSM memory ? It's obvious cache operations take several time but actually it's better solve current problem.

0 Michael P over 11 years ago in reply to Maciej Fajfer

Expert 1810 points

Maciej,

Whether L2 cache is enabled by default depends on the platform. I think most platforms enable it by default (in the boot loader or standard pre-main() initialization code). Without SYS/BIOS to provide an ROV display of the cache's mode, you will probably need to look at the control registers directly. Section 3.4 of SPRUGW0 describes the registers; they exist inside each core.

For your use case, cores 1 through 7 should use CACHE_invL1d() instead of CACHE_wbInvL1d(). The general rule of thumb is to write-back only after modifying data, and to invalidate before reading. You can combine write-back operations to the same memory region, and you can skip invalidates if you can prove that nothing else wrote to the memory since your last read, but those are performance optimizations and should usually be done after your code is working properly.

When you check the contents using the memory browser, it is useful to pay attention to which data are cached on different cores. CCS's memory browser indicates this by changing the background color -- one color if the data is in L1D, another if it is in L2 but not in L1D, and plain white if it is in neither. You can uncheck the boxes for L1D and L2 inside the memory browser to disable the display of cached data. This will make the background color change for cached data, and will show different values if the cache's copy of the data is either stale or dirty. This is a powerful, but easily overlooked, feature of CCS to help debug cache consistency issues.

Michael

0 Maciej Fajfer over 11 years ago in reply to Michael P

Intellectual 330 points

Hi Michael,

Actually I use CACHE_invL1d() before reading and CACHE_wbL1d() after writing on all cores. Also I carried out research of my application. I found bug in function which processing VS buffer on cores 1 trough 7. That function was wrote something to VS buffer instead carried out read only operation. Question: Is it possible damage VS buffer in physical memory without CACHE_wbL1d() operation after writing ?

Actually everything looks like good. Question about disable cashing for MCSM is still open. I think I have to do following steps:

1. Map physical memory address to logical using MPAx registers.

2. Disable cashing for specific region using MARx register. MAR is writable and readable only in supervisor mode. I think all cores are in supervisor mode by default.

Is it correct ?

Processors

Processors forum

Cache coherency problem with EVMC6678