trouble with cache coherency - local vs. global invalidate dm643x

MattLipsey

Genius 3575 points

Other Parts Discussed in Thread: TMS320DM6435

I am having what I believe to be cache coherency problems on a dm6435 custom app.

Basically, I am in a loop of acquire then process:

preloop: invalidate cache for buffer 1

during loop:

trigger image (will be placed in buffer1 by vpfe)

wait for image acquired interrupt from vpfe

process image, placing results in buffer2

invalidate cache for buffer 1

repeat

If I perform a global L2 invalidate, everything appears to work fine. But if I try to only invalidate the cache for buffer1, I get artifacts in buffer2, but not buffer1. Here is my routine to perform local invalidates:

void invalidateCacheBlock( Uint32 blockStart, Uint32 blockSize )
{
Uint32 currentSize;

while (blockSize > 0) {

  if (blockSize > (2 << 18)) {
   currentSize = 2 << 18;
  } else {
   currentSize = blockSize;
  }

CACHE_L2IBAR = blockStart;
CACHE_L2IWC = currentSize >> 2;

// wait for invalidate operation to complete
while (CACHE_L2IWC != 0);

blockSize -= currentSize;
blockStart += currentSize;
}
}

I have a few questions: 1) In my loop, I believe I should only have to invalidate buffer1 (there is no dma action besides the vpfe filling the buffer. Is this correct, or would I need to invalidate buffer2 for some reason? 2) If 1 is true, is there anything wrong with the routine above for performing a localized invalidate? The frame I am invalidating is 256 byte aligned at start and finish. 3) Invalidating L2 also invalidates L1D, correct? 4) I know that the image consumes > 100% of my cache. Is there any speed penalty to just performing a global invalidate vs. the local invalidate?

Thanks for any insight

edit:

I realized that I can't just do a global invalidate, I would have to do a global writeback/invalidate (to save changes to various other data not related to image processing), which I'd rather not do if possible for time reasons.

over 16 years ago

0 MattLipsey over 14 years ago

Genius 3575 points

So, I was investigating a mysterious cache coherency problem on another project and decided to do a forum search for info, when I found this unanswered question I posted back in the murky depths of time. DM6435 is the chip for both projects.

Since I never got a response, I went back to look at my original code, and I found that I had limited the stride size of cache so that it would always land on a L2 line boundary before each operation (basically change the 2^18 references to 0x03fe0). That made my original problems vanish.

The project that is currently not working sets a stride limit of 0x3ffc bytes (which works out to setting the word count registers to 0x0000ffff for max steps). I believe this should work according to the cache register descriptions in spru871. Yet this does not work, but when I change the max value for the word count register to 0x0000ff80 (this makes sure I always land on a 128 byte boundary, which is the line size for L2 cache) for max steps, everything works. I define "not working" as filling the image with a set value, writing back and then invalidating the cache, then taking a picture and observing a "hole" in the image that contains my prefilled value. When I do a memory watch on this section, I can see that the "hole" is still cached in L2. If I do a global invalidate on L2, the hole disappears.

Can I get an expert to comment on this behavior and point me to the relevant documentation for cache coherency problems? I don't know if the L2 line size limit just magically happens to meet some other requirement that I am not aware of.

0 RandyP over 14 years ago in reply to MattLipsey

TI__Guru* 84110 points

MattLipsey,

Not very good response on our part, for your previous post. No excuses, but we do try to be faster now.

MattLipsey said:
Can I get an expert to comment on this behavior and point me to the relevant documentation for cache coherency problems?

Since no one answered you after two years on the previous post, I hope you will settle for me rather than whatever "expert" means. Someone else may join in with better answers, but we can at least get started. For documentation, SPRU871 is the technical document you need, and that is where you are already looking. If you go to the TMS320DM6435 Product Folder (auto-link in red) and go to Technical Documents, you can look for "cache" and "memory" in the document titles to find Application Notes and User's Guides related to these issues. There is a lot of useful information on the TI Wiki Pages also; go there and search for "cache coherency" (no quotes) to find related articles.

As to the behavior, I believe some of your numbers in the post above are missing 4 bits in some of your translations from bytes to word counts. In addition, the register field descriptions for the L2xWC registers specifically state that the max value is 0x0000FFE0, so 0x0000FFFF is invalid.

In the Training section of TI.com, there is a training video set for the C6474 which uses three (3) of the C64x+ cores. It may be helpful for you to review several of the modules. But in particular, the Memory and Cache Module will apply to handling the cache coherency issues. You can find the complete video set at http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=OLT110002 .

MattLipsey said:
Is there any speed penalty to just performing a global invalidate vs. the local invalidate?

The answer is "it depends". The relative size of your total cache compared to your buffer to be invalidated is important to the answer here. If you have 32KB of cache and a 2MB buffer, then a global operation should be faster than a set of block operations. But you also have to make the tradeoff of the side-effects of a global operation on other variables.

The best answer is to closely examine your algorithm and determine what portions of the large buffer will still reside in cache when you have finished the algorithm. Since the buffer size is much larger than the total cache size, then perhaps only the last 64KB needs to be written back. The same could be true for the input buffer, but all of this will depend on what the algorithm is doing with respect to reading the input values while writing the output values, and possibly whether the output values are also read. The datasheet and the cache documents will help you understand the conditions when cache lines are allocated and auto-flushed.

Be sure to keep a distinction between the cases for invalidating the cache and writing-back the cache. You can do one or the other or both, but they have different purposes as explained in the documentation.

Regards,
RandyP

If you need more help, please reply back. If this answers the question, please click Verify Answer , below.

0 MattLipsey over 14 years ago in reply to RandyP

Genius 3575 points

So I found that when spru871 went from rev j to rev k in august 2010, they changed the ranges from xffff to xffe0 for all the L2xxxWC registers.

Guess I will go and sign up to be notified when dm6435 documents change. Will that take care of docs like spru871, which is a 64x+ family document?

"In addition, the register field descriptions for the L2xWC registers specifically state that the max value is 0x0000FFE0, so 0x0000FFFF is invalid." That answers my question.

Processors

Processors forum

trouble with cache coherency - local vs. global invalidate dm643x