This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Query on usage of Cache_Inv() with no_wait

Other Parts Discussed in Thread: SYSBIOS

Hi, 

see the below 2 cases,

case 1:

block{

 Cache_Inv() of 500KB with no_wait

   DSP processing of ~ 200000cycles

  Cache wait()

}

case 2: 

block{

 Cache_Inv() of 500KB with wait

   DSP processing of ~ 200000cycles

}

In both the cases, cycles profile across the block shows same cycles. is there any problem with the above usage?. how can i make best use of cache invalidate with no_wait?

Regards,

Jeeva

  • Jeeva,

    The Cache_inv() API works by invalidating a block of 65,280 words (261,120 bytes) at a time until the entire requested block has been invalidated.

    Each block invalidation takes a finite amount of time and must be finished prior to starting the next block invalidation operation.

    The 'wait' argument determines whether Cache_inv() waits for the LAST block invalidate operation to complete before returning.

    Consequently, for large block sizes, the length of time spent in Cache_inv() is dominated by the time it takes to invalidate each 65,280 words of memory.

    As an experiment, you might try breaking your 500KB Cache_inv() operation into several calls of 261,120 bytes each.

    I believe you'll notice a performance difference on the very first call to Cache_inv() with wait=true vs wait=false.

    However, subsequent calls will have to wait for the previous operation to complete before the next invalidate operation can begin.

    This experiment will not improve the overall time to invalidate the entire block. It will just serve to demonstrate the value of the 'wait' argument.

    Alan

  • In my experiment, when i profile across cache_inv(10240 bytes) with or without wait flag, cycles consumed by the DSP is same. There is no influence of the wait flag. 

    I am using CCSv5 with xdc tools_3_22_04_06. please let me know what could cause this issue

    - Jeeva

  • Your experiment must measure one call to Cache_inv().

    Back to back calls to Cache_inv() will internally wait for the previous Cache_inv() operation to complete before executing the current request.

    Alan

  • yes I measured just one call

    -Jeeva

  • Please share the code you are using to measure the API times as well as the measured values.

    Alan

  • Which device or which Cache are you using?  I assume either C64P or C66, which one?

    Judah

  • Alan and Judah,

    The device being used here is TMS320C66Shn with 8 cores. Code usage is as follows

    pu1_ptr, this is the Sl2 memory pointer

    /*first block*/

    Cache_inv(pu1_ptr,  10240, Cache_Type_ALL, (Bool)FALSE);

    cache_wait();

    while(end of processing)

    {

    start = TSCL;

    /*second or n+1 th block*/

    Cache_inv(pu1_ptr,  10240, Cache_Type_ALL, FALSE);

    End = TSCL;

    Inv_Cycles = End - start; 

    {

    algorithm  uses first or nth block Sl2 memory 

    }

    start = TSCL;

    Cache wait() /* for second block or n+1*/

    End = TSCL;

    Wait_Cycles = End - start; 

    }

    The above code is scheduled across 8 cores and profiling is done for each core.

    As I am using seperate cache_wait() operation after each invalidate, my assumption here is Inv_Cycles should be only the cache invalidation start time, invalidation time should be under wait_cycles. but I am seeing Inv_cycles are higher as shown below and wait cycles are negligible and there is no difference between wait flag as true or false.

    Core id and wait flag value Cycles
    [C66xx_0] with wait flag = FALSE 11,735
    [C66xx_1] with wait flag = FALSE 11,788
    [C66xx_2] with wait flag = FALSE 11,788
    [C66xx_3] with wait flag = FALSE 11,788
    [C66xx_4] with wait flag = FALSE 11,790
    [C66xx_5] with wait flag = FALSE 11,788
    [C66xx_6] with wait flag = FALSE 11,790
    [C66xx_7] with wait flag = FALSE 11,788
    [C66xx_0] with wait flag = TRUE 11,735
    [C66xx_1] with wait flag = TRUE 11,735
    [C66xx_2] with wait flag = TRUE 11,735
    [C66xx_3] with wait flag = TRUE 11,735
    [C66xx_4] with wait flag = TRUE 11,735
    [C66xx_5] with wait flag = TRUE 11,735
    [C66xx_6] with wait flag = TRUE 11,735
    [C66xx_7] with wait flag = TRUE 11,735
  • I think I know what's going on.  Could you try this out.  In your *.cfg file do the following:

        var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
        Cache.atomicBlockSize = 0;

    We worked around a silicon bug by breaking up a huge buffer into smaller chunks.   If you set the atomicBlockSize = 0, then this doesn't happen and it will use the max size the cache can handle.

     See Silicon errata sprz331a Advisory 14.

    Judah

  • Thanks Judah.

    By making Cache.atomicBlockSize = 0, now cache invalidate call takes less cycles. and now I can schedule this better to improve performance.

    But I have a doubt regarding the workaround mentioned in Advisory 14

    In my project,  MSMC RAM is configured as Share L3 mode which is cached first in L2 then L1 D. My design uses most of the time “cache invalidate” of Sl2 data in L2.

    Advisory 14 from the Silicon errata sprz331a says that

    "The workaround requires that the memory system be idle during the block coherence operations. Hence programs must wait for block coherence operations to complete before continuing.This applies to L1D and L2 memory block coherence operations."

    Will "Cache.atomicBlockSize = 0;" break the workaround done to avoid the cache corruption?

    is it safe to use delayed cache wait with "Cache.atomicBlockSize = 0"?

    -Jeeva

  • Workaround 2 mentioned in advisory 32 is as follows

     This workaround is also generic, but will allow CPU traffic to go on in parallel with
    cache coherence operations. To issue a block coherence operation, follow the sequence
    below:
    1. Issue a MFENCE command.
    2. Freeze L1D cache.
    3. Start L1D WBINV.
    4. Restart CPU traffic.
    (CPU operations happen in parallel with WBINV)
    5. Poll the WC register until the word count field reads as 0.
    6. WBINV completes when word count field reads 0.
    7. Issue an MFENCE command

    so as a programmer, do I need to follow the above steps while using cache operations? or  is it taken care inside  cache API calls itself?

    -Jeeva

  • Jeeva,

    That is correct.  By setting atomicBlockSize = 0, it is possible that you may encounter Advisory 14.  If you don't set the atomicBlockSize = 0, the code is compliant to Advisory 14 so you don't have to do anything extra.  I describe below how you can set atomicBlockSize = 0 and have your code compliant.

    The major reason why we implemented this atomcBlockSize instead of just making sure the call is always compliant to Advisory 14 is because the workaround requires a disable of interrupts.  If you're working on a very large buffer, disabling interrupts could mean a large latency which some people may not want.  That's why we break up the large buffer into smaller chunks to allow interrupts.

    Now, if the latency doesn't really matter to you.  You could set atomicBlockSize = 0.  Then put your Cache call within a Hwi_disable/restore.

    Judah

  • Thank You Judah :-)