Cache Coherence problems when SYS/BIOS (BIOSv6) is not used

Vikram Ragukumar

Other Parts Discussed in Thread: TMS320C6472

Hello,

We are running some experiments with SYS/BIOS and "NON-BIOS" projects under CCS4 with an EVMC6472 board.

We have observed that there is some difference in handling of Cache coherence (by hardware methods) for L1D cache and L2SRAM with respect to CPU accesses, in the case when SYS/BIOS is used when compared to the case when SYS/BIOS is not used. Possibly, some cache configuration registers are being setup by SYS/BIOS in the background which needs to be done in the NON-BIOS project. Automatic handling of cache coherence between L1D and L2SRAM for CPU accesses seems to occur when using SYS/BIOS.

Given below is a snippet of code from the example application program.

#pragma DATA_SECTION(Ready_Flag, "SharedMem");
volatile int Ready_Flag;
Ready_Flag = 0;

int main() {
    ...
    ...
    unsigned int *pL1DCFG = (unisgned int*)0x01840040;
    unsigned int *pL1dINV = (unsigned int*)0x01845048;
    *pL1DCFG = 0x7;
    *pL1dINV = 0x1;
    printf("Begin\n");
    if (DNUM == 0) {
      Ready_Flag = 1;
    }
    while (Ready_Flag != 1);
    printf("Done\n");
    ...
    ...
}

This is the only user source file, and is identical in both projects.
In the code shown above, L1D cache is enabled by writing to the L1DCFG register. The expected output when the generated .out file is run on all 6 cores is, Cores 1-5 wait in the blocking while loop until Core 0 writes a value of 1 in Ready_Flag, after which all Cores print a "Done" message.

The SYS/BIOS project runs on 6 cores and it's output matches expected output. However when the NON-BIOS project is run on 6 cores, only Core 0 prints the "Done" message. Upon investigating further, we determined that this was due to incoherence in the Cache, i.e. if we disable L1D Caching, the NON-BIOS project also produces expected output. Hence it appears that SYS/BIOS handles Cache coherence in the background, and this is not done in the NON-BIOS project. Could someone point out which registers we might take a look at so as to enable Cache coherence management in hardware ?

Also upon reading section 2.4 (Cache Coherence) in the C64x Cache Users Guide (SPRU862B), it appears that Cache coherence is automatically maintained/activated only on DMA Reads and Writes, is this correct.. ? Does the Hardware Cache coherence protocol apply for CPU Reads and Writes also (between L1D cache and L2SRAM) ?

Thanks and Regards,
Vikram.

over 14 years ago

0 RandyP over 14 years ago

TI__Guru* 84110 points

Vikram,

The best document for you to read is the C64x+ Megamodule Reference Guide, SPRU871(k). It describes how the cache is architected and how the coherency operations work.

Check out the CAUTION near the bottom of page 56 about using the L1DINV command that you are using.

Also, read the instructions for changing cache mode at run-time on page 53, Table 3-3.

I do not know if these are the issues that you are dealing with, but what you are doing needs to be avoided, per the sections I have referenced.

You should be able to debug DSP-related (meaning not related to other masters like EDMA3) cache issues using the CCS memory window. It shows clearly, with highlighting, which values are held in cache and which values are held in the target memory.

Vikram Ragukumar said:
Also upon reading section 2.4 (Cache Coherence) in the C64x Cache Users Guide (SPRU862B), it appears that Cache coherence is automatically maintained/activated only on DMA Reads and Writes, is this correct.. ? Does the Hardware Cache coherence protocol apply for CPU Reads and Writes also (between L1D cache and L2SRAM) ?

I think you will benefit from some of our training videos on the C6472. In the Training section of TI.com, there is a training video set for the C6472. It may be helpful for you to review all of the modules. But in particular, the Memory & Cache Module will apply to your current questions. You can find the complete video set at http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=OLT110001 .

In a nutshell, cache exists for the sake of the CPU. By definition the CPU does not know that it has cache, and by design the CPU does not require cache coherency protocols. Cache coherency issues apply when other masters, of which DMA is one, access memory that is cache for the benefit of a CPU.

I hope some of this will help.

Regards,
RandyP

If this answers your question, please click the Verify Answer button below. If not, please reply back with more information.

0 Vikram Ragukumar over 14 years ago in reply to RandyP

Prodigy 120 points

RandyP,

Thank you for your response.

I should have stated explicitly in my post that I am referring to Cache incoherence between L1D cache and Shared L2 RAM. The problem being Cores 1-5 cache the value of Ready_Flag and hence when Core 0 comes along and writes a new value "1" into Ready_Flag in Shared L2 RAM, this is not reflected in the values read from Cache by Cores 1-5.

However by forcing a Cache Invalidate, Cores 1-5 read the correct value because the L1D Cache gets updated through the Read from L2 after the invalidate operation.

So the workaround we use, is shown below

#pragma DATA_SECTION(Ready_Flag, "SharedMem");
volatile int Ready_Flag; <------------------ Ready_Flag is in Shared L2 RAM
Ready_Flag = 0;

int main() {
    ...
    ...
    unsigned int *pL1DCFG = (unisgned int*)0x01840040;
    unsigned int *pL1dINV = (unsigned int*)0x01845048;
    *pL1DCFG = 0x7;
    printf("Begin\n");
    if (DNUM == 0) {
      Ready_Flag = 1;
    }
    while (Ready_Flag != 1) {

*pL1dINV = 0x1; <--------------------- This invalidate forces a read from Shared L2 RAM to obtain the value of Ready_Flag.

    };
    printf("Done\n");
    ...
    ...
}

I would like to know if there is a way to let the hardware handle this Cache coherency problem, without having the application to keep track of invalidating Cache entries when something in Shared L2 RAM is changed ?

Thanks,

Vikram.

PS : Thank you for your notes on the use of L1DINV, we understand the side effects of using this and will use a block invalidate instead.

0 RandyP over 14 years ago in reply to Vikram Ragukumar

TI__Guru* 84110 points

Vikram,

In my reply on another thread here, I attached an example project (C6472_Edma_IPC_BIOS) for the C6472 that has the core-sync code that I like to use. You could use some of the ideas in that example to make your code a bit simpler. I honestly did not check this part rigorously with the C6472, but I did check it with the C6474 for which it was originally written - I might have missed something in the transfer so please let me know if you find a mistake. In particular, in main.c starting on about line 350 is the following code:

C6472_Edma_IPC_BIOS main.c line 350 said:
    // all cores need to wait here until Core0 has finished with the EDMA setup call above
#if 1
    {
    Uint32                        uSum;
    Uint32                        uMarSave;
    Uint32                        uMarBlock;
    uMarBlock = ((Uint32)InterCoreComm) >> 24;                            // determine MAR block
    uMarSave = ((CSL_CacheRegsOvly)CSL_CACHE_0_REGS)->MAR[uMarBlock];    // save MAR
    ((CSL_CacheRegsOvly)CSL_CACHE_0_REGS)->MAR[uMarBlock] = 0;             // clear MAR
    while (((CSL_CacheRegsOvly)CSL_CACHE_0_REGS)->MAR[uMarBlock] == 1); // wait

    for ( i = 0; i < NUM_CORES; i++ ) InterCoreComm[i] = 0; // clear all words, done by all cores
    do {
        if ( InterCoreComm[CoreNum] == 0 ) // if this core's is 0, set to 1
            InterCoreComm[CoreNum] = 1;
        for ( i = 0, uSum = 0; i < NUM_CORES; i++ ) // add up all the locations
            uSum += InterCoreComm[i];
    }
    while ( uSum != NUM_CORES ); // wait until all cores' ICC locs are set to 1

    ((CSL_CacheRegsOvly)CSL_CACHE_0_REGS)->MAR[uMarBlock] = uMarSave;    // restore this MAR
    // at this point all cores are running at this same location
    }
#endif

The entire project may answer questions like where the memory locations are and where the array declaration/definition is.

Any cache coherency command, global or block, is a multi-cycle operation. There is more to using these than you are implementing. You need to read the sections I pointed to before and include everything. But better yet, do not do them at all and just turn off caching during the sync time using the MARs.

Regards,
RandyP

If this answers your question, please click the Verify Answer button below. If not, please reply back with more information.

0 Vikram Ragukumar over 14 years ago in reply to RandyP

Prodigy 120 points

RandyP,

From the C64x+ Megamodule Reference Guide,

(a) Section 4.3.8 (L1/L2 Coherence Support), states that "Coherence between the C64x+ megamodule's L2 RAM segments and L1D cache is maintained". Does the usage of L2 RAM in the previous statement not include Shared L2 RAM (port 1 of L2 RAM) ? If it does include Shared L2 RAM, then why is there a requirement for the application to disable caching when shared memory writes occur ? Isn't coherency supposed to be maintained by Snooping ?

(b) Section 4.4.4 states that Memory Attribute Registers (MAR) are used to control the cacheability of memory spaces external to the Megamodule. Both L1D cache and Shared L2 RAM are internal to the megamodule, so why would we need to use MAR's in our application as you suggested in your previous post ?

I can't find anywhere in TI C64x+ documentation where it says that snooping circuitry maintains cache coherency between CPU initiated Shared L2 RAM data accesses. So right now it looks like cache coherence operations are not supported for inter-core communication through Shared L2 RAM (using basic load/store instructions, not DMA). Can you confirm?

Thanks and Regards,

Vikram.

0 RandyP over 14 years ago in reply to Vikram Ragukumar

TI__Guru* 84110 points

Vikram,

When I start working with a new TI processor, I go to the Product Folder such as TMS320C6472 and download the datasheet, errata, all the User/Reference Guides, and all the Application Notes. I use an excellent desktop search program that allows me to search all of the docs for any keywords that I may be interested in.

When I searched the Megamodule RG for the word "shared", it only showed up once to say that the megamodule includes and interface that can reach shared memory. There is no shared memory as part of the C64x+ Megamodule.

In the list of User's Guides for the C6472, you will find one called the Shared Memory Controller User's Guide. This document may answer all of your questions.

Vikram Ragukumar said:
Section 4.3.8 (L1/L2 Coherence Support), states that "Coherence between the C64x+ megamodule's L2 RAM segments and L1D cache is maintained". Does the usage of L2 RAM in the previous statement not include Shared L2 RAM (port 1 of L2 RAM) ?

No, it does not. See page 8 of the Shared Memory Controller User's Guide.

Also, you will want to search for "coherency" in this document. This may answer your remaining questions.

I agree that this is a confusing point about the memory of the C6472. Advantageous hooks were put into the architecture for performance and ease-of-use. But some of these make it harder to understand.

I would have hoped that the training video(s) I referenced above would have been more help in bridging this gap between the documentation and the system as a whole. Did you find the Memory & Cache module helpful?

Regards,
RandyP

Processors

Processors forum

Cache Coherence problems when SYS/BIOS (BIOSv6) is not used