Invalidate Cache, Errata and Workarounds

Benjamin Meyer

Hello,

we designed a custom board with two C6678. Image data is sent from a FPGA to the two DSPs with SRIO. A doorbell is is signalling the end of the transmission and after that I am invalidating the image buffer. Today I found out that the invalidation does not work correct.

Last year I wroted a few functions for cache invalidation / writeback / writeback-and-invalidation. This functions should take care of Silicon Errata, advisory 7 and 22. Here is an example of Cache Invalidation I was using:

void Cache_Invalidate(void* blockPtr, Uint32 byteCnt)
{
    unsigned int key;

    byteCnt = (byteCnt + 127) & 0xFFFFFF80; // byteCnt must be multiple of 128

    key = _disable_interrupts();
    CSL_XMC_invalidatePrefetchBuffer(); // Cleanup the prefetch buffer also.
    ASM_16X_NOP();
    CACHE_invL1d(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    ASM_16X_NOP();
    CACHE_invL2(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    ASM_16X_NOP();
    _restore_interrupts(key);
}

Maybe there are a few unnecessary NOPs (the macro ASM_16X_NOP() is performing 16 times asm(" NOP ");). I verified that code in the past, however I was using Silicon Revision 1.0.

Today I am using Silicon Revision 2.0 and I see old image data in the buffer, because cache invalidation didn't work correctly. So I had a look into Silicon Errata again and found advisory 27. Since I am using mfence I modified my code to:

void Cache_Invalidate(void* blockPtr, Uint32 byteCnt)
{
    unsigned int key;

    byteCnt = (byteCnt + 127) & 0xFFFFFF80; // byteCnt must be multiple of 128

    key = _disable_interrupts();
    CSL_XMC_invalidatePrefetchBuffer(); // Cleanup the prefetch buffer also.
    ASM_16X_NOP();
    CACHE_invL1d(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    CACHE_invL2(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    _restore_interrupts(key);
}

But Cache Invalidation is still not working. Finally I was able to fix the problem by using my own Cache_WriteBackInvalidate(). It even works without taking care of advisory 27. Here is the function Cache_WriteBackInvalidate() I am using now:

void Cache_WriteBackInvalidate(void* blockPtr, Uint32 byteCnt)
{
    unsigned int key;

    byteCnt = (byteCnt + 127) & 0xFFFFFF80; // byteCnt must be multiple of 128

    key = _disable_interrupts();
    CACHE_wbInvL1d(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    ASM_16X_NOP();
    CACHE_wbInvL2(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    ASM_16X_NOP();
    _restore_interrupts(key);
}

It should be harmful to use this function, because I am writing no data in this buffer, it is only used for reading. So the writeback portion has no effect. However I want to know why it behaves different from my Cache_Invalidate().

Am I still doing something wrong?

One post which helped a lot was:
http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/253690

Thanks in advance
Benny

over 10 years ago

0 Ganapathi Dhandapani95 over 10 years ago

TI__Mastermind 28085 points

Hi,

Take a look at below thread for C6678 Silicon Errata Advisory 27 workaround clarification:
e2e.ti.com/.../1012840

Thanks,

0 Benjamin Meyer over 10 years ago in reply to Ganapathi Dhandapani95

Intellectual 265 points

Hi Ganapathi,

thanks for the link.

This thread only says that the NOPs can be replaced with a second mfence. My revised code uses a second mfence and still the NOPs. So I assume my code is simply just a little slower.

Regards
Benny

0 Benjamin Meyer over 10 years ago

Intellectual 265 points

I have a small update on my cache issue. It is not true that the function Cache_WriteBackInvalidate() works correct every time, I even added invalidating the Prefetch buffer and the double mfence like in Cache_Invalidate(). So it works a little bit better than Cache_Invalidate(), but not reliable. At the moment it looks like this:

void Cache_Invalidate(void* blockPtr, Uint32 byteCnt)
{
    unsigned int key;

    byteCnt = (byteCnt + 127) & 0xFFFFFF80; // byteCnt must be multiple of 128

    key = _disable_interrupts();
    CSL_XMC_invalidatePrefetchBuffer(); // Cleanup the prefetch buffer also.
    ASM_16X_NOP();
    CACHE_invL1d(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    CACHE_invL2(blockPtr, byteCnt, CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    _restore_interrupts(key);
}

What I can say at the moment after much of testing is, that a global cache coherence function like Cache_WriteBackInvalidateAll() works reliable:

void Cache_WriteBackInvalidateAll()
{
    unsigned int key;

    key = _disable_interrupts();
    CSL_XMC_invalidatePrefetchBuffer(); // Cleanup the prefetch buffer also.
    ASM_16X_NOP();
    CACHE_wbInvAllL1d(CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    CACHE_wbInvAllL2(CACHE_FENCE_WAIT);
    _mfence();
    ASM_16X_NOP();
    _restore_interrupts(key);
}

Before starting to process image data I can execute my function Cache_WriteBackInvalidateAll(), but I don't want to do that every time I use the block coherence version Cache_Invalidate() or Cache_WriteBackInvalidate(). I use several multicore shared variables which I need to synchronize with writeback and invalidate between cores. But I cannot use the global versions since that would invalidate all image data in the cache and processing would dramatically slow down.

Thanks in advance
Benny

0 Andy Polyakov over 10 years ago in reply to Benjamin Meyer

Expert 1340 points

What makes me feel uneasy in the context is how does one ensure that workaround conditions are met from compiler? Especially if generated code if affected by optimization options? For example advisory 27 talks about two MFENCEs, but specifically back to back. Advisories talk about ensuring inactivity of memory interface, but can L2 miss on behalf of suggested code be a problem? They do mention "read or write allocate for some address" as part of drama. Bottom line is that I wouldn't trust compiler, but implement such subroutine in assembly to ensure that critical instruction sequence is contained within single cache line, so that it can be executed in guaranteed cache-conflict manner.

BTW, they also speak of L1D write miss as first step that can trigger corruption, but then wouldn't mfence prior block invalidation be appropriate (or even solve the problem)?

0 Aditya over 10 years ago in reply to Benjamin Meyer

TI__Expert 6815 points

Benny,

Sorry for the delay in our response. It is strange that the double MFENCE workaround does not work and you have to perform a global WbInv.

Few questions for my information:

1)If I understand correctly, CACHE_FENCE_WAIT + 16 NOPs did not pose any problems for silicon rev 1.0?

2) Have you tried disabling cache completely (L1 and L2) and see if the code works reliably without any cache dependencies?

3) What is your bytecnt?

4) Can you share the disassembly dump between interrupt disable and restore for the reliable (WbInvAll) and unreliable (block Inv) blocks of code?

5) Do you have some testcode that can demonstrate the problem? We can try to reproduce internally.

Yes, global invalidate will have a negative performance impact. Let's see if we can get to the bottom of this.

0 Aditya over 10 years ago in reply to Aditya

TI__Expert 6815 points

One more question, about the dataflow --> where does the SRIO drop the image buffer (DDR3?). I assume the image buffer needs to be read into the cache by the DSP for some processing? And before this happens, you want to invalidate the buffer data from the previous SRIO transaction? Please let me know if I have understood your use case correctly. Thanks.

0 Aditya over 10 years ago in reply to Benjamin Meyer

TI__Expert 6815 points

Benny,
Looking at the code once more, is there a reason you are invalidating both L1D and L2? Invalidating L2 also invalidates L1D in hardware, so you could just invalidate L2.

0 Benjamin Meyer over 10 years ago in reply to Andy Polyakov

Intellectual 265 points

Hello Andy,
thank you for your reply. I had also a look into the assembler output of the compiler and it looked good (MFENCE back to back).
Benny

0 Benjamin Meyer over 10 years ago in reply to Aditya

Intellectual 265 points

Hello Aditya,

I'll try to give you answers for all your questions.

Aditya said:

1)If I understand correctly, CACHE_FENCE_WAIT + 16 NOPs did not pose any problems for silicon rev 1.0?

I can not definetely say yes, because our software was quite older and I don't have the chance to test with the old hardware any more. The only thing I can say is, that I didn't see such problems with rev 1.0. Maybe they would have arised sooner or later.

Aditya said:

2) Have you tried disabling cache completely (L1 and L2) and see if the code works reliably without any cache dependencies?

Yes, it works reliably (and dramatically slow) if cache is completely disabled.

Aditya said:

3) What is your bytecnt?

The bytecnt is quite large, because we can potentially receive quite large images. The image buffer should be 146 MB aligned to 256 Byte.

Aditya said:

4) Can you share the disassembly dump between interrupt disable and restore for the reliable (WbInvAll) and unreliable (block Inv) blocks of code?

I'll get back to you later and share disassembly with you.

Aditya said:

5) Do you have some testcode that can demonstrate the problem? We can try to reproduce internally.

Puh, that will be difficult, because the software is quite complicted yet and I have to strip down a lot. I'll see what I can do.

Benny

0 Benjamin Meyer over 10 years ago in reply to Aditya

Intellectual 265 points

Aditya said:
Benny,
Looking at the code once more, is there a reason you are invalidating both L1D and L2? Invalidating L2 also invalidates L1D in hardware, so you could just invalidate L2.

The code growed a lot, because of the problem and I tried everything possible. The original version did only invalidate L2, but I can test once more if you want.

Regards

Benny

0 Benjamin Meyer over 10 years ago in reply to Aditya

Intellectual 265 points

Aditya said:

One more question, about the dataflow --> where does the SRIO drop the image buffer (DDR3?). I assume the image buffer needs to be read into the cache by the DSP for some processing? And before this happens, you want to invalidate the buffer data from the previous SRIO transaction? Please let me know if I have understood your use case correctly. Thanks.

Hello Aditya,

your understanding is absolutely correct.

The image data goes via SRIO directly into a input image buffer in DDR3. We are using SWRITE packets over SRIO. The DSP processes the image and outputs into a second output image buffer which is send back via SRIO. Before processing the image, we have to invalidate the input buffer. (And before sending the output buffer, we have to do a write-back).

Thanks a lot for your help

Benny

0 Andy Polyakov over 10 years ago in reply to Benjamin Meyer

Expert 1340 points

Benjamin Meyer said:
I had also a look into the assembler output of the compiler and it looked good (MFENCE back to back).

Note that original formulation included "for example", i.e. pair of MFENCE being or not being generated back to back was an example that was simplest to illustrate/visualize. You say "I had a look", but concern is of more general principle character, namely that compiler leaves no guarantee that it will always be done that way. Which in turn would mean that formally you would have to look every time, not just once. Or at least for every new compiler version and every used combination of compiler flags. But that was only part of concern. Another part is (which to me appears as [if not more] important) is possibility of L2 miss on behalf of workaround code. I mean the workaround code attempts to ensure that there is no activity on memory bus, but itself can cause cache miss when fetching its own instruction. Which is why I wonder if workaround code should be implemented in such way that precludes such possibility. And point is that you can't do it reliably in high-level language.

I probably should clarify that questions are not addressed to specifically you, Benny, but just as much [if not more] to TI. Answer might as well be that my concerns are ungrounded, but they would have to say it, preferably arguing why.

0 Aditya over 10 years ago in reply to Benjamin Meyer

TI__Expert 6815 points

Benjamin Meyer said:

I can not definetely say yes, because our software was quite older and I don't have the chance to test with the old hardware any more. The only thing I can say is, that I didn't see such problems with rev 1.0. Maybe they would have arised sooner or later.

Got it, no worries. We can focus on the current behavior you see on rev 2.0.

Good to know it works reliably with cache disabled.

Benjamin Meyer said:

The bytecnt is quite large, because we can potentially receive quite large images. The image buffer should be 146 MB aligned to 256 Byte.

Do keep in mind that the max invalidate byte count supported by hardware is 256KB (0xFFFF words x 4 bytes/word). See Section 3.4.3.2.2 L1D Invalidate Word Count Register (L1DIWC) in the C66x CorePac users guide (SPRUGW0C). Anything larger than that is not supported in hardware - can you confirm your byte count is within limits? Edit: If we are talking about 146 MB, then you might be better off using global invalidate.

Benjamin Meyer said:

Puh, that will be difficult, because the software is quite complicted yet and I have to strip down a lot. I'll see what I can do.

No problem. Probably do not need this now. I thought i'd ask in case you already had one.

0 Aditya over 10 years ago in reply to Andy Polyakov

TI__Expert 6815 points

Andy,
Thanks a lot for your contribution and the questions you have raised. I am following up internally and will post my response.

0 Aditya over 10 years ago in reply to Andy Polyakov

TI__Expert 6815 points

Andy,

The compiler guarantees back-to-back mfence instructions in Benny's impementation. It will not optimize out either one. It will also not reorder any loads or stores around mfence since it is also a scheduling barrier.

Regarding your next point on cache miss resulting from an instruction fetch for the workaround code : The second mfence just covers the window where the first mfence might have released early because as described in the erratum, it falsely assumed the memory bus was idle when it wasn't. Additional program fetches from a cache miss will not impact the workaround.

0 Andy Polyakov over 10 years ago in reply to Aditya

Expert 1340 points

Aditya said:
The compiler guarantees back-to-back mfence instructions in Benny's impementation. It will not optimize out either one. It will also not reorder any loads or stores around mfence since it is also a scheduling barrier.

If assertion in first sentence is supposed to be logical conclusion from two following sentences, then I would have to reject it. I don't question whether or not compiler will optimize out one mfence. I don't question whether or not compiler will reorder loads and stores around mfence. Question is if there is guarantee that compiler won't insert any other instruction. Of course keyword might be that mfence is treated as "scheduling barrier" for all instructions, but then why would you explicitly mention loads and stores?

Aditya said:
Regarding your next point on cache miss resulting from an instruction fetch for the workaround code : The second mfence just covers the window where the first mfence might have released early because as described in the erratum, it falsely assumed the memory bus was idle when it wasn't. Additional program fetches from a cache miss will not impact the workaround.

But that point was of more general character. Why do you refer to what is second mfence instruction for when question is about for example first mfence instruction incurring L2 cache miss? Note "for example". I mean consider any instruction in workaround code incurring miss. Can miss on any of them, be it either mfence or not, contribute to drama? In other words concern is of broadest possible character and is not about advisory 27, but rather 7 and 22. Well, I can imagine course of discussion when it boils down to following. The 3rd precondition for mishap is "write allocation for some address". As code fetch can't cause write allocation per definition (can cause eviction and write-back, but not write allocation, right?), L2 cache miss in workaround code would actually do good by disrupting the potentially fatal chain of events. Is that it?

0 Benjamin Meyer over 10 years ago in reply to Aditya

Intellectual 265 points

Hello Aditya,

I am very sorry getting back to you so late.

Aditya said:
Do keep in mind that the max invalidate byte count supported by hardware is 256KB (0xFFFF words x 4 bytes/word). See Section 3.4.3.2.2 L1D Invalidate Word Count Register (L1DIWC) in the C66x CorePac users guide (SPRUGW0C). Anything larger than that is not supported in hardware - can you confirm your byte count is within limits? Edit: If we are talking about 146 MB, then you might be better off using global invalidate.

Thank you a lot. This is definitely the reason why the cache invalidate was unreliable. I did not realize, that we are restricted to 256KB.

So either I have to stay with global invalidate or invalidate my 146 MB in chunks of 256KB in loop.

Also many thanks to Andy for starting a good discussion.

Benny

Processors

Processors forum

Invalidate Cache, Errata and Workarounds