Cache coherence problems in IPC and MCSDK

Anatoliy Sivov

Other Parts Discussed in Thread: SYSBIOS

Hello all,

I'm working on multicore project for C6472. It uses ndk, sysbios, ipc, etc.

I faced with several issues that are due to cache coherence problems (as I see now).

1st issue is related to networking: from time to time I received incorrect data in UDP socket. I realized that it happens because of cache coherence problems in nimu_eth.lib (I use mcsdk_1_00_00_08). Because my project is rather big, I had to place .far:NDK_PACKETMEM in external memory. There is some "support" for external memory in nimu_eth.c, but it's buggy. First of all, it checks, whether given address can be in cache with (nAddress & EMAC_EXTMEM) check. This check does not take into account SL2 memory, but there is no hardware cache coherence for this region. So, I replaced this check with the following function (for C6472):

static int MaybeCacheable(UINT8* pAddress)
{
	UINT32 nAddress = (UINT32)pAddress;
	if(nAddress & EMAC_EXTMEM) return TRUE; // DDR
	return (nAddress >= 0x200000) && (nAddress <= 0x2C0000); // SL2
}

The other issue in nimu_eth.c is in EmacRxPkt. It calls OEMCacheClean at the end of function, but there are two problems:

1) It uses that memory before cache invalidation (see protocol = ( pBuffer[12] << 8) | pBuffer[13] ; line of code)

2) OEMCacheClean does Cache_wbInv, but we need Cache_Inv. Writeback corrupts data received from emac.

So, I added OEMCacheInv, which is wrapper for Cache_Inv and call it instead of OEMCacheClean at the beginning of function before pBuffer pointer is used.

2nd issue is related to MessageQ (I use ipc_3_40_01_08). I use several SharedRegions in SL2 as source of heaps for MessageQ. I faced with cache coherence problem, when tried MessageQ.SetupTransportProxy = xdc.useModule('ti.sdo.ipc.transports.TransportShmNotifySetup');

Default transport (TransportShm) didn't cause cache coherence problems, but it happened by chance. Here is my explanation:

There are two sides of transport: sender and receiver called on difference cores (with separate cache). On receiver's side transport waits for notifications and handles them to get messages sent by sender's side. These messages are located in shared memory (probably, with cache enabled), but receiver does nothing to invalidate message header in notification handler. It does call Cache_inv later in MessageQ_put, but it's too late, because it calls MessageQ_getDstQueue(msg) before MessageQ_put to get queue id, but msg is not invalidated at that moment. So, my fix for this issue was to add code like

id = SharedRegion_getId(msg);

/* Assert that the region is valid */
Assert_isTrue(id != SharedRegion_INVALIDREGIONID,
        ti_sdo_ipc_Ipc_A_addrNotInSharedRegion);

/* invalidate the attrs before using it */
if (SharedRegion_isCacheEnabled(id)) {
    Cache_inv(msg, sizeof(*msg), Cache_Type_ALL, TRUE);
}

before MessageQ_getDstQueue(msg) .

This change fixed problems with cache coherence I observed.

Finally, why TransportShm didn't cause problems with cache. This transport uses ListMP to store messages it passes from sender to receiver. ListMP's element is really a message header (first two fields of message header are used by ListMP for links to neighbor elements). ListMP does Cache_Inv, when you get elements from it. It is to invalidate only these two fields, but by luck all message header is less than one cache line, so it gets invalidated.

So, my question is are TI's guys aware of these issues?

over 9 years ago

0 ran35366 over 9 years ago

TI__Genius 12805 points

Thanks Anatoliy

I wonder, if you disable cache at all, and use the original code. Do you still see the same errors?

Thanks

Ran

0 Anatoliy Sivov over 9 years ago in reply to ran35366

Prodigy 120 points

Hello Ran,
I did not try running my code (with original ipc and nimu_eth.lib) with L1D cache = 0k before. Currently I'm very busy porting another part of my project to omap board. And deadline for this project is almost over.
I'll try to run original code with L1D cache off as soon as I get a chance.
However, I did try running my code with L2 cache off. It made issue with nimu_eth.lib much less frequent, but it's obvious. Of course, the same configuration didn't help much with ipc (where SL2 is used).
Anyway, if you have any questions regarding my changes, assumptions or anything else, you're welcome.

0 Anatoliy Sivov over 9 years ago in reply to Anatoliy Sivov

Prodigy 120 points

Hello All,

I made a mistake in my initial post message. MessageQ_put does not call Cache_inv on receiver's side, because it's local queue at that point.

So, invalidate cache for message header is not enough, one should invalidate the whole message and correct fix would look like this:

if (SharedRegion_isCacheEnabled(id)) {
    Cache_inv(msg, sizeof(*msg), Cache_Type_ALL, TRUE);
    Cache_inv(msg, msg->msgSize, Cache_Type_ALL, TRUE);
}

To Ran: I still had no chance to try IPC with L1D cache off. However, I can confirm that data corruption was caused by incoherent cache. I've observed incorrect data on receiver's side of message queue become correct after call to Cache_inv

0 ran35366 over 9 years ago in reply to Anatoliy Sivov

TI__Genius 12805 points

Hi Anatoliy

I still maintain that the best way to deal with bugs when you suspect that cache is the problem is just to disable cache and see what is happening

It might add time to your other calculations, (so be away of time out or things like that) but at least you will know if this is a cache problem or not

Ran

Processors

Processors forum

Cache coherence problems in IPC and MCSDK