This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

L2 controller does not handle cache coherency between L2 SRAM and L1D !!!

Other Parts Discussed in Thread: TMS320C6747

Hi all,

I am using TMS320C6747 DSP which has a 64x+ core.

In DSP Megamodule document (spru871j)  it says that, there is an L2 memory controller which handles cache coherency between L1D and L2 (when using L2 as SRAM)

I am using ACPY3 interface which uses IDMA channel to copy memory from one loc to another. When I copy something with this API to L2 from SDRAM when the L2 data is in cache(L1D), cache is not invalidated by L2 controlled as I expect. But when I use CACHE commands to invalidate or wb data, it is Ok.

So, isn't there an L2 controller that handles coherency between L1D and L2? Or should I make some configurations to enable it?

Thanks...

  • Hasan,

    The TMS320C6747 DSP has the C674x core which is described in sprufk5. Section 4.3.8 of that document describes the hardware features implemented in the C674x core for maintaining L1D-L2 cache coherency and what conditions are maintained.

    It may just be a typo, but the ACPY3 interface will most likely used a DSP QDMA channel. It may be implemented using functions that confusingly are called IDMA3. The C674x has IDMA0 and IDMA1 for specific types of transfers that can be implemented inside the Megamodule and without using the EDMA3; ACPY3 will use the EDMA3 through the QDMA channels, to the best of my understanding.

    If you are seeing cache coherency problems between L2SRAM contents and L1D cached copies of that L2SRAM, please provide detailed specific examples that will allow us to duplicate the problem. Screen shots from CCS with the different cache visibility options selected may be helpful.

    In the Megamodule Guide I referenced above, it says in Table 4-9 that when a DMA write to L2SRAM occurs, then the corresponding L1D action will be:

    SPRUFK5 Table 4-9 said:
    Up to 256 bits of new data is sent from L2 to L1D. L1D and L2 both update their respective copies of the data. The dirty and valid bits for the line in L1D do not change.

    You can see that the L1D contents will be updated rather than being invalidated.

    If L1D has dirty contents that have not been written to L2, but the DMA operation writes to those locations in L2, then the contents of L1D are also overwritten with the new data from the DMA operation and are still marked as dirty and valid. Could this be related to your observations?

  • RandyP said:

    If you are seeing cache coherency problems between L2SRAM contents and L1D cached copies of that L2SRAM, please provide detailed specific examples that will allow us to duplicate the problem.

    This is an example of the problem,

    #define SIZE (512)

    float  L2_destination[512],SD_source[512 ];

     

     

     for(i = 0; i<SIZE; i++)

    {

    L2_destination[i] = 1;

    }  //L2_destination[*] = 1;

     

    for(i = 0; i<SIZE; i++)

    {

    SD_source[i] = 2;

    } //SD_source[*] = 2;

     

    for(i = 0; i<SIZE; i++) // a dummy operation just to be cached

    {

     L2_destination[i] = L2_destination[i]+5;

     

    }

     ACPY3_transferAndWaitToBeCompleted(L2_destination, SD_source, SIZE*sizeof(float),&handleDma); //L2_destination[*]= 2;

    for(i = 0; i<SIZE; i++)

    {

    L2_destination[i] = L2_destination[i]+7;

    }

    Finally, I expect all L2_destination contents are 2+7 = 9 but what I see is 1+5+7=13. In other words, not the copied data is used, instead, the data in cache is used.

    When I use cache invalidate commands on the other hand, there is no problem, all L2_destination contents are 9. But I dont want to use these commands if L2 controller can handle this situation automatically.

    RandyP said:

    Screen shots from CCS with the different cache visibility options selected may be helpful.

    I can see L2 is cached in L1D when I use cache visibility in memory window.

     

     

  • hasan turken said:
    ACPY3_transferAndWaitToBeCompleted(L2_destination, SD_source, SIZE*sizeof(float),&handleDma); //L2_destination[*]= 2;

    This must be an oversight of mine, but I do not find this function on my computer. I do not have all packages installed for every device, but I do have the PSP 1.30. Did this come from Codec Engine? Does it set OPT.TCCMODE=NORMAL?

    From what you have supplied so far, I do not see why you should be getting the results that you do. The L2-L1D coherency is not something you turn on or off, so there is not anything extra for you to do to get it to work. I agree that your results are troublesome, so a little more investigation is needed.

    Please indicate the following in case someone is able to try to duplicate this:

    • Physical addresses for L2_destination and SD_source
    • Are they global or local variables?
    • L1D cache size setting
    • L2 cache size setting
    • What specific manual cache command do you use for invalidation and where do you place it?
    • In the CCS Memory window, what values are in L2 when you un-check the L1D box?
  • RandyP said:

    This must be an oversight of mine, but I do not find this function on my computer. I do not have all packages installed for every device, but I do have the PSP 1.30. Did this come from Codec Engine? Does it set OPT.TCCMODE=NORMAL?

    • I dont know how to set it, and didnt see in any ACPY3 document.
    • It is my function, sorry I should mention. I wrote it to pack ACPY3 configs together.

    int ACPY3_transferAndWaitToBeCompleted(void* dst,void* src,int size,IDMA3_Handle* dmaHandle)

    {

    ACPY3_Params tcfg;

    tcfg.transferType = ACPY3_1D1D;

    tcfg.numElements = 1;

    tcfg.numFrames = 1;

    tcfg.waitId = 0;

    tcfg.dstElementIndex = sizeof(char);

    ACPY3_configure (*dmaHandle, &tcfg, 0);

    ACPY3_fastConfigure32b(*dmaHandle,ACPY3_PARAMFIELD_SRCADDR,(Uns)src,0);

    ACPY3_fastConfigure32b(*dmaHandle,ACPY3_PARAMFIELD_DSTADDR,(Uns)dst,0);

    ACPY3_fastConfigure16b(*dmaHandle,ACPY3_PARAMFIELD_ELEMENTSIZE,size*sizeof(char),0);

    ACPY3_init();

    ACPY3_start (*dmaHandle);

    ACPY3_wait (*dmaHandle);

    return 0;

    }

    • And I call this initialization function once at the begining of my main function

    int ACPY3_initialize(IDMA3_Handle* dmaHandle,IDMA3_ChannelRec* dmaTab)

    {

    short status;

    DMAN3_PARAMS.heapInternal = INT_L2;

    DMAN3_PARAMS.heapExternal = EXT_SD;

    dmaTab->numTransfers = 1;

    dmaTab->numWaits = 1;

    dmaTab->priority = IDMA3_PRIORITY_LOW;

    dmaTab->protocol = &ACPY3_PROTOCOL;

    dmaTab->persistent = FALSE;

    DMAN3_init();

    status = DMAN3_createChannels(0, dmaTab, 1);

    if (status == DMAN3_SOK) LOG_printf(&trace,"DMAN3 initialization is ok\n");

    else LOG_printf(&trace,"DMAN3 initialization error!!!\n");

    *dmaHandle = dmaTab->handle;

    ACPY3_activate(*dmaHandle);

    return 0;

    }

    RandyP said:

    Please indicate the following in case someone is able to try to duplicate this:

    • Physical addresses for L2_destination and SD_source
    • Are they global or local variables?

    • They are global values and I used #pragma DATA_SECTION command before the main function as below. So address of L2_destination is something starting with 0x1180xxxx and SD_source is sth as 0xC000xxxx.

    #pragma DATA_ALIGN(L2_destination,64)//to be in safe side for cache operations

    #pragma DATA_ALIGN(L2_source,64)

    #pragma DATA_ALIGN(SD_destination,64)

    #pragma DATA_ALIGN(SD_source,64)

    #pragma DATA_SECTION(L2_source,".L2"); //.L2,.L3 or .SD

    #pragma DATA_SECTION(L2_destination,".L2");

    #pragma DATA_SECTION(SD_source,".SD");

    #pragma DATA_SECTION(SD_destination,".SD");

    #define SIZE (512)

    float L2_source[SIZE];

    float L2_destination[SIZE];

    float SD_source[SIZE];

    float SD_destination[SIZE];

    int main(void)

    {

    float fTime1,fTime2,fCPUCycles,fTimeAbsolute;

    int i;

    ACPY3_initialize(&handleDma,&tabDma);

    ...

    ACPY3_transferAndWaitToBeCompleted(L2_destination, SD_source, SIZE*sizeof(float),&handleDma);

    }

    RandyP said:
    • L1D cache size setting
    • L2 cache size setting
    • What specific manual cache command do you use for invalidation and where do you place it?
    • In the CCS Memory window, what values are in L2 when you un-check the L1D box?

    • L1D is 32kb cache-all cache in other words
    • L2 is all SRAM - no cache
    • I used BCACHE commands from DSP/BIOS
    • When I uncheck I saw all the data is transfered correctly - in this case all of them are 2 just after the transfer.

    Something new:

    When I debug step by step, L1D cache is updated correctly after the transfer even I do not use any cache commands. But when there is no break point after the transfer, I realized that first 1-2 elements are miscalculated. I changed the size, the case is same.Even for SIZE=2! Again the two elements are miscalculated. When I uncheck L1D cache from memory window, all values are "2". Below is a screenshot of this issue.

     

    By the way, thanks for your interest...

  • hasan turken said:

    Something new:

    When I debug step by step, L1D cache is updated correctly after the transfer even I do not use any cache commands. But when there is no break point after the transfer, I realized that first 1-2 elements are miscalculated. I changed the size, the case is same.Even for SIZE=2! Again the two elements are miscalculated. When I uncheck L1D cache from memory window, all values are "2". Below is a screenshot of this issue.

    Looks like there is a race condition. This is what you would expect to happen if some interrupt started a transfer running and then your code was accessing the same area of memory at the same time the DMA transfer is in progress. To test this theory, please insert the following code between the ACPY3_transfer and the for-loop:

    {
    volatile int DCnt;
    for ( DCnt = 0; DCnt < 10000; DCnt+=2 ) DCnt--;
    }

     

  • When I insert that code it is OK. I see in the memory window what I expect to see.

    However, between ACPY3_transfer and the for loop, anything that "waits" seems to solve the problem.

    I tested the ACPY3_wait function inside the ACPY3_transferAndWaitToBeCompleted, and see, this function always waits for the same amount of time for any transfer size.

    For example, when transfer of 64 floats lasts 0.0017ms and when I transfer 8192 floats, it also lasts 0.0017ms.

    So there is something wrong with the wait function, it always wait same amount of time regardless of the transfer size!!!

    Am I missing something on DMAN or ACPY3 configurations ???

    Are any part of these configurations device related ?

    I am using 6747 but I did not make any configurations related to my device. İn fact data is transfered with no problem. But I dont know when it is completed !

  • I cannot help you with the ACPY3 functions. Your best option may be to post a new question onto the appropriate Embedded Software forum.