This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Cache Invalidate not working

Hello,

I posted this problem back in December and by some magic combination I was able to get it work. 

  http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/233503/824058.aspx#824058

However, I am redo the same code with some slight differences and the problem is back!  I can't get this section of code to invalidate the value in L1d cache.

The pseudo code is this - CORE0 sets a flag (in a struct) to indicate to CORE1 that it is time to start working.  CORE1 reads the variable from the struct and patiently waits until the flag is set.  After every read, it sleeps for a bit and then invalidated cache and checks again.  I do not have L2 cache enabled, only L1.

#pragma DATA_ALIGN(128)

#pragma DATA_SECTION(".srioSharedMem")

volatile TrackerSharedMemoryStruct far trackerMem;

1. while((trackerMem.asscVars.assocStartFlag & (CORE0_ASSOC_TASK << CORE_NUM)) == 0) //wait for your flag to be active.

{

2. Task_sleep(ASSOC_SLEEP_DELAY);

3. CACHE_invL1d((void *) &trackerMem, 1024,CACHE_WAIT);

}

Like before, the emulator is showing me that it is correctly written in MSMCSRAM and that CORE1 is just not seeing it correctly.  Using line numbers as reference, Line 1 reads the flag and it is zero.  The task sleeps, line 3 invalidates the entire structure (which is volatile and aligned).  If I break at line 3 and check the memory, I can see the flag is set to a value.  then when I loop back around to Line 1, the value is zero again!  I also can see that CORE0 set the flag and wrote back its cache.

Please advise.  I didn't really understand why the solution worked the first time and it obviously wasn't viable because a small change in the memory map broke it again!

Things I have tried:

- volatile and far -> no improvement

- DATA_ALIGN 64, 128, 512 -> none worked

- Cache invalidate just the variable with size 128 (CACHE_invL1d ((void *) &trackerMem.asscVars.assocStartFlag, 128,CACHE_WAIT))

- The structure is only 304 bytes, so I have tried to invalidate every size, 128, 256, 512, 1024. 

- I disabled cache in MSMCSRAM.  This solves the problem but I don't want to do that, this structure is used all the time.  Once the cores read the variables once, I would prefer them to cache them!

 

Thanks for any help!  This is just so silly.  I have similar loops all throughout my code.  2 of them are working.  This one is not and there are 2 untested because they are after this loop.   All 5 worked before the memory map got changed (what I mean is I deleted some unused variables that were existing in MSMCSRAM).

Brandy

  • On the writer core, are you doing a cache writeback?  If not the written value is likely just sitting in the writer core's cache.

    Are the assocStartFlags written by more than one core?  If so you might have false sharing.  Lets say core 0 sets bit 4, and core 1 sets bit 5 (for core 4, 5 on a 6678, respectively).

    core0:

    cacheInv

    assocStartFlags |= 0x10; // start core 4

    cacheWB

    core1:

    cacheInv

    assocStartFlags |= 0x20; // start core 5

    cacheWB.

    If both core0 and core1 run at about the same time, you don't know whether assocStartFlags will contain 0x10, 0x20, or 0x30.  Thus one of the cores 4 or 5 won't start.

    If you  have variables that are written and read by multiple cores, there are two solutions:

    1) protect the entire structure with a semaphore.  Invalidate the entire structure anytime you are outside the semaphore.  This ensures that it is only in one cache at a time, eliminating false sharing.  You still need to pad the structure to a multiple of 64 bytes to make sure nothing placed immediately after the structure by the linker is unintentionally destroyed during invalidate.

    2) Align and pad each element to 64 bytes, such that it gets its own cache line.  This eliminates false sharing.

    3) Run with cache off just for this structure by mapping an alias of MSMC using the XMC MPAX into a region where the MAR is set to disabled.  No align/pad/cache operations are needed.  The rest of MSMC through 0x0c000000 is still cacheable.

  • Hi John,

    Thank you for quick reply.  Core 0 is the only write of this flag at this point.  And whenever the flag is written there is a semaphore wrapped around it.  Here is the code from when Core 0 writes it:

    while ((CSL_semAcquireDirect (ASSOC_HW_SEM)) == 0);

    {

    //Set flag for association start

    LOG_DEBUG("Starting Assoc Task.");

    CACHE_invL1d((void *) &trackerMem, 512, CACHE_WAIT);

    trackerMem.asscVars.assocStartFlag = trackerMem.asscVars.assocStartFlagValue;

    CACHE_wbL1d((void *) &trackerMem, 512, CACHE_WAIT);

    }CSL_semReleaseSemaphore (ASSOC_HW_SEM);

    The only thing is, I invalidate it after I have acquired the semaphore.

    Option 1:  I thought this is what I was doing.  I do not have extra bytes at the end of the structure but that is easy enough.

    Option 2:  Yuck.  That sounds really ugly and why bother with a struct at that point, I might as well just make individual variables and use the pragma to align.

    Option 3:  I have a need for speed, I feel like that would slow me down a bit.  These variables are used a lot but mostly as a read-only capacity.

    Can you give me any other ideas for making Option 1 work properly?  What other information would be helpful to you?

    Thanks,
    Brandy

     

  • Hi John,

     

    I added the padding to get a multiple of 64.  Then I invalidated only this amount (320) when I invalidated the struct.  Still no luck.  My loop just sits there a spins.

    Thanks!

     

    Brandy

  • Is assocStartFlag declared "volatile"?


  • Yes, I have done trials with the entire struct as volatile and with the single variable inside the struct as volatile.  I am sure you are correct that for some reason the value is being refreshed from the registers but I can't seem to make it read directly from memory.  I have even checked the assembly and it looks like it is loaded the memory value it should read from but it reads zero.

    Its weird that this has happened twice now.  It is so sensitive to alignment I guess but why doesn't volatile work - or if I invalidate the entire structure - that doesn't work either.

    Thanks!

    Brandy

  • Hi,

    There are some other structures that you inv/wb, allocated just before the trackerMem? In this case, is the inv/wb always coherent with the real structure size and there are gaps between the structures to be sure that an inv/wb of  another structure cannot inadvertently overlaps the trackerMem?

    When you break in line 3 (CACHE_invL1d) and then loop back to line 1, the core 0 is running or there are other task on core 1 that could run?

  • Hi Alberto,

    The TrackerMem is a struct of 4 other smaller structs.  Do you think that is a problem?  I will check the memory map in the morning and see.  There is only one task on Core 1.  Core 0 has the task that sets the flag, the NDK task and a health monitor task.  The last two do not use TrackerMem.

    Thanks for thinking about this problem.

    Brandy

     

  • Brandy,

    There is nothing obviously wrong in what you have shown in this thread, that I can see. You have followed the steps the way John said, and it should be working right.

    The bad news is that this means there is something wrong in the rest of the code that is not shown here, and you have to do debug to figure it out.

    The good news is that CCS has great tools in the Memory Browser window to help you figure out what is in the cache and what is not in the cache.

    As a basepoint, I have attached a simple project that I built for the EVM6678 that does approximately what you seem to be trying, so you can see it working. Maybe you can compare something in this project with what is in yours and see a difference that we can try to explain. You can download the zip file and do an Import into your workspace in the CCSv5.3 Project Explorer window in the CCS Debug perspective.

    In the InterCoreCacheTest2 project, the following sequence is followed:

    1. All 8 cores sync up so they are all running at about the same point in the code.
    2. Core0 prints that it is starting, then writes to an MSMCSRAM location to flag Core1 to start running.
    3. The other seven cores repeatedly do an L1dInv, then test the same MSMCSRAM location for their bit being set.
    4. When each core sees its bit set, it prints that it is continuing, then writes the flag bit for the next core, followed by a command L1dWb.

    The flag bits are written directly, but they do go into the cache since the variable was read to test it.

    Build and load the program onto all 8 cores, then run them all at the same time. The 8 cores execute in sequence, as expected.

    To try to debug your situation, if you do not see differences in the two programs, is to watch the flags in the Memory Browser window and turn the cache highlights on and off to see what is in cache and what is in the target memory. You should be able to single-step over the various lines and see the right things happening in the cache and in the MSMCSRAM.

    If not, please tell us which cache things are not happening right, based on your CCS Memory Browser observations.

    Regards,
    RandyP

    InterCoreCacheTest2.zip
  • BrandyJ said:

    The TrackerMem is a struct of 4 other smaller structs.  Do you think that is a problem?  There is only one task on Core 1.  Core 0 has the task that sets the flag, the NDK task and a health monitor task.  The last two do not use TrackerMem.

    It is hard to say. It you update this structure onyl in one place I suppose it is not. The problem could be hidden in some other structure placed before in memory or below, if you inv/wb there structures.

    Just to debug, try this:

    In code 1, just to be sure:

    unsigned int saved=_disable_interrupts();

    CACHE_invL1d(....);

    flag_value=trackerMem.asscVars.assocStartFlag;

    _restore_interrupts(saved);

    And, when you break at CACHE_invL1d() line, be sure to stop the core 0 also. Then, as RandyP suggest, check the memmory in both core (using the cache and no cache vieww and paying attention at the "colors") and step-by-step to see what you read in flag_value and its coherence with the memory. All other cores have to be stopped.

    By the way: I'm sure the cache works, but in my application I prefer to map the a small control area as uncached, then once inside the critical section I copy the data in a local strcture and then I use that copy. Continue to using the sahred copy could produce unpredicable effect (thatdepend on how you use it) since you cannot assume how the cache will be used by the cache controller. You say you prefer to don't t do so since you want to benefit of the cache speed, but the caceh controller could choose to discard the line when it want,

  • I made a quick video of the problem.  I had to go fast to keep the size small enough.  I thought maybe this might help with ideas.  Now I will try the suggestions you gave as well.  Thanks!

    Brandy

     

     

     

     

  • Also, it looks like the video won't play in IE.  It played in Firefox for me though.

  • Hi,

    I tried disabling interrupts first.  This did not improve the situation.

    unsigned int myValue = 0;

    // while((trackerMem.asscVars.assocStartFlag & (CORE0_ASSOC_TASK << CORE_NUM)) == 0) //wait for your flag to be active.

    while((myValue & (CORE0_ASSOC_TASK << CORE_NUM)) == 0) //wait for your flag to be active.

    {

     Task_sleep(ASSOC_SLEEP_DELAY);

    //CACHE_invL1d((void *) &trackerMem, CACHE_INV_SIZE_TRK_STRCT,CACHE_WAIT);

    unsigned int saved = _disable_interrupts();

    CACHE_invL1d((void *) &trackerMem, CACHE_INV_SIZE_TRK_STRCT,CACHE_WAIT);

    myValue = trackerMem.asscVars.assocStartFlag;

    _restore_interrupts(saved);

    }

    Now the memory reverts back to the "cached" (or wherever it is coming from) value when I go to assign myValue to the flag.

    I moved the assignment of myValue outside of the disable interrupts - this did not help either.

    Then I went back to my original code and removed the task_sleep().  This did not change anything.

    Now I will test Randy's code and look at the memory map some.


    Thanks,
    Brandy

     

     

     

  • I was able to play the video by saving it to disk and playing with vlc.

    I see one problem in the code.  In CMHTAssociate_MT::Execute(), the CACHE_wbL1d does NOT include the CACHE_FENCE_WAIT instead of CACHE_WAIT.

    I'd take the following steps to debug this directly.  Lets find out if the writeback is failing or the invalidate is failing or the read is failing.  I think your video shows that the read is NOT failing, since you used WorkerMain() to view a memory expression of trackerMem.asscVars.assocStartFlag==0.  This does a read of memory (through cache) independant of whatever is prefetched into registers.

    Now we need to split this into just is writeback or invalidating failing.  Is the new value really landing in memory?  If you look on core 0, you see core 0's cache contents, not necessarily the memory.  If you look from core 1, you are seeing 0, so not helpful.  You can instead use core 2 (or some other core) to view the meory location 0x0c011fe8 in a memory browser window.  You won't be influenced by cache on core 2, assuming core 2 never accesses around this location.  You can try the cache selection buttons in the memory browser (which tells the browser to look at memory instead of cache).  You can also disable the caches on core 2 or another unused core by using a memory window to write 0 into 0x1840000, 0x1840020, and 0x1840040.  This will ensure you are really looking in the memory not a cache.

    If you find the correct value in 0x0c011fe8 from another core, then it means that the writeback worked, but the invalidate didn't.

    If you find the wrong value from the other core, then it means the writeback didn't work, and the invalidate is inconclusive.

    At least this cuts the problem in half.

  • Hi John,

    Thanks for the help again!

    John Dowdal said:
    I see one problem in the code.  In CMHTAssociate_MT::Execute(), the CACHE_wbL1d does NOT include the CACHE_FENCE_WAIT instead of CACHE_WAIT.

    I changed it to CACHE_FENCE_WAIT and saw no improvement.  What is the difference by the way?  What does MFENCE mean?

    John Dowdal said:
    You can instead use core 2 (or some other core) to view the memory location 0x0c011fe8 in a memory browser window.  You won't be influenced by cache on core 2, assuming core 2 never accesses around this location.  You can try the cache selection buttons in the memory browser (which tells the browser to look at memory instead of cache).  You can also disable the caches on core 2 or another unused core by using a memory window to write 0 into 0x1840000, 0x1840020, and 0x1840040.  This will ensure you are really looking in the memory not a cache.

    I checked the value in Core 2 (which does not have any tasks running on it, after main() it just exits.)  The value is correctly set in CORE 2.  Would you like another video

    So, the invalidate did not work.  I tried CACHE_FENCE_WAIT there too.

    Here's another video:

     

    It seems like the invalidate is working, but the VOLATILE nature of the variable is no?  Or maybe this is also a symptom of invalidate not working. 

     

  • Hi Randy,

    I tried to build your code but it gets an unresolvable resource that I can't seem to make it find.  I placed CG_TOOL_ROOT in both my env vars and system env vars.  Then I also placed it in the build options. 

    I cannot find a reference to this in your project so I am not sure what the value of the variable should be.  I assumed it was: C:\ti\ccsv5\tools

    Please advise.

    Brandy

  • The mfence ensures the data actually lands in memory.  Its similar to dmb/dsb on arm.

    Your video makes it very clear the invalidate didn't work.  What is CACHE_INV_SIZE_TRK_STRCT?  Can you try a hardcoded "64" and see what happens?


  • CACHE_INV_SIZE_TRK_STRCT is 320.  The struct was 304 bytes, but I padded it with 16 bytes to make it divisible by 64.  It doesn't make sense to me that it should be 64, but I tried it anyhow.  No good.

    I checked the memory map.  The memory right before the struct is empty and before that are the variables from the electrocardiography code (lol - a joke because it is the health monitor).  The EKG task is not actually running, I have turned it off for this testing so it shouldn't be affecting it.

    .srioSharedMem.1

    *          0    0c002000    0000fecf     UNINITIALIZED

                      0c002000    0000fecf     electrocardiography.obj (.srioSharedMem)

     

    .srioSharedMem.2

    *          0    0c011f00    00000174     UNINITIALIZED

                      0c011f00    00000140     tracking.obj (.srioSharedMem)

                      0c012040    00000034     dspLumberjack.obj (.srioSharedMem)

     

    .srioSharedMem.3

    *          0    0c012080    00000120     UNINITIALIZED

                      0c012080    00000100     ti.drv.srio.ae66 : srio_drv.oe66 (.srioSharedMem)

                      0c012180    00000020     pxmTracking.obj (.srioSharedMem)

     

    John, if you are willing - I can work with my TI rep, Erin McCook, and set up a webex.  I believe we have an NDA on file.  Then you can see what is happening directly and I can post the solution when we find it.


    Thanks,
    Brandy

  • Ok.  The struct is now completely align.  The only thing left to do would be to align each variable on 64bytes.  What a waste that would be but I'll try it just to see.

     

    [C66xx_0] [DEBUG] [0000000000:04.670] [CORE0] Config Vars is 256 bytes

    [C66xx_0] [DEBUG] [0000000000:04.670] [CORE0] Assoc  Vars is 64 bytes

    [C66xx_0] [DEBUG] [0000000000:04.675] [CORE0] Solver Vars is 64 bytes

    [C66xx_0] [DEBUG] [0000000000:04.675] [CORE0] Shared Vars is 64 bytes

    [C66xx_0] [DEBUG] [0000000000:04.675] [CORE0] -----------------------

    [C66xx_0] [DEBUG] [0000000000:04.675] [CORE0] Total is 448 bytes

  • Ok, I've since padded the entire Assoc Vars struct so that each variable aligns on 64 bytes.  Then I invalidate only the startFlag.  Still no good.

     

    I moved the "volatile" command to just the variable, to the AsscVars struct, and again to the entire stuct with no improvement.

     

    I copied the code from the csl function (which is supposed to be inlined anyhow...) directly into my program.

        - If I use the CACHE_invL1dWait() call, the variable is still not read properly.

        - If I use the mfence() call, the code just hangs on the call, waiting forever.  That I don't get but it obviously does not solve my problem either.

     

    Wow, this is really frustrating. I am running out of good (or even bad) ideas!

     

    Thanks,
    Brandy

     

  • I just looked at source code for the CSL cache implementations, and it doesn't seem to include the workaround for advisory 6 in the device errata (www.ti.com/lit/sprz332).  Please see forum post http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/161625.aspx for a possible workaround (which is to use the BIOS calls if you are already using BIOS or to wrap the workaround around the CSL calls).

    I'll follow up in a bit with a CPR (bug tracking ticket) # for the CSL.

  • Hi John,

    Thanks for pointing this out.  I did not know about it.  I tested it with the workaround and it works fine.  I guess since the order the instructions occur can cause the problem, that would explain why some of my loops would have it and some wouldn't?  Can you confirm why it would work sometimes depending on where in my code it was.

    Also would disabling interrupts that often by multiple cores effect the network stack?  I'm just wondering because I had to increase it after I introduced this fix.  But maybe its just a coincidence.

    Thanks,

    Brandy

  • Brandy,

    If I follow this, John solved the real problem, which was that the CSL was not including the MFENCE work-around.

    My code did not use CSL, only because I did not have immediate access to the CSL in my C6678 test workspace. So my "ugly" code did use the MFENCE and did not have the problem. A comparison would not have helped much because of the problem being in the CSL and not in your code.

    In my build, CG_TOOL_ROOT is set to C:\TI\CCSv5.3\ccsv5\tools\compiler\c6000_7.4.1\  , which I did not set outside of the normal installation process. I maintain multiple copies of CCS and various libraries, so I put all the CCSv5.3 installs into C:\TI\CCSv5.3 to keep them separate from the CCSv5.2 installs, for example. But the path from ccsv5 and lower should be the same for any build. Or it can be set to wherever you have the compiler installed.

    FYI, in CCS, I found this by right-clicking on the project name and selection Properties, then clicking the expansion arrow on Resource in the left-hand pane, the selecting the Linked Resources item in the left-hand pane.

    Regards,
    RandyP

  • Hi Randy,

     

    I think the solution was that I had to invalidate the prefetch buffer also.  But if I read the errata correctly, you had to do things in a precise order -

    A potential L2 cache corruption issue during block coherence operations has been

    identified. Under a specific set of circumstances, L1D or L2 block coherence operations

    can cause L2 cache corruption. The problem arises when the following four actions

    happen back-to-back in the same L2 set:

    1. L1D write miss

    2. Writeback or invalidate or writeback-with-invalidate due to block coherence

    operations

    3. Write allocate for some address

    4. Read or write allocate for some address

    This issue applies to all the block coherence operations listed below:

    • L1D writeback

    • L1D invalidate

    • L1D writeback with invalidate

    • L2 writeback

    • L2 invalidate

    • L2 writeback with invalidate

     

    I just followed the solution from the linked post.  Thanks for the code.  It has been a long time since I coded all in registers :)  I have been using TI project for three years now and you have so many libraries, I rarely write to the register directly :)

    Brandy

  • Hello John,

    Can you confirm that I only need to invalidate the prefetch buffer with an invalidate command?  Or do I also need to do it on a writeback/invalidate command?


    Thanks,
    Brandy

  • Hello John,

    Do you know the answer to the above?  Please advise.

     

    Brandy

  • [edited s/alter/later]

    The prefetch invalidate needs to be tied to the reader, not the writer.  The writeback+invalidate is usually done just after handing off the buffer to someone else, so it doesn't matter what is in your prefetch on the core doing the writeback+invalidate (and the prefetch in one xmc is coherent with that xmc's own writes). 

    However, if you later read that location on the same core that previously did the wb+inv, the prefetcher could have sucked in "old" data due to nearby accesses, even if that cache line was never sucked into the cache.  Thus its possible to need a prefetch invalidate, without a cache invalidate.

    See section 7.5.3 of www.ti.com/lit/sprugw0 for a brief overview of xmc prefetch coherence.

  • Hi John,

     

    Yup, I just found the case where the wb wasn't working.  When I sped my project up with -O2, the timing caused caching issues again.  I invalidated the prefetch buffer for my wbs and now it works.  Just for good measure (so hopefully this doesn't bite me again) I also invalidate the prefetch buffer when I do the inv/wbALL. 

    Thanks again,
    brandy