Query on usage of Cache_Inv() with no_wait

Jeeva Raj Arumugam

Prodigy 80 points

Other Parts Discussed in Thread: SYSBIOS

Hi,

see the below 2 cases,

case 1:

block{

Cache_Inv() of 500KB with no_wait

DSP processing of ~ 200000cycles

Cache wait()

}

case 2:

block{

Cache_Inv() of 500KB with wait

DSP processing of ~ 200000cycles

}

In both the cases, cycles profile across the block shows same cycles. is there any problem with the above usage?. how can i make best use of cache invalidate with no_wait?

Regards,

Jeeva

over 12 years ago

0 Alan DeMars over 12 years ago

TI__Mastermind 30830 points

Jeeva,

The Cache_inv() API works by invalidating a block of 65,280 words (261,120 bytes) at a time until the entire requested block has been invalidated.

Each block invalidation takes a finite amount of time and must be finished prior to starting the next block invalidation operation.

The 'wait' argument determines whether Cache_inv() waits for the LAST block invalidate operation to complete before returning.

Consequently, for large block sizes, the length of time spent in Cache_inv() is dominated by the time it takes to invalidate each 65,280 words of memory.

As an experiment, you might try breaking your 500KB Cache_inv() operation into several calls of 261,120 bytes each.

I believe you'll notice a performance difference on the very first call to Cache_inv() with wait=true vs wait=false.

However, subsequent calls will have to wait for the previous operation to complete before the next invalidate operation can begin.

This experiment will not improve the overall time to invalidate the entire block. It will just serve to demonstrate the value of the 'wait' argument.

Alan

0 Jeeva Raj Arumugam over 12 years ago in reply to Alan DeMars

Prodigy 80 points

In my experiment, when i profile across cache_inv(10240 bytes) with or without wait flag, cycles consumed by the DSP is same. There is no influence of the wait flag.

I am using CCSv5 with xdc tools_3_22_04_06. please let me know what could cause this issue

- Jeeva

0 Alan DeMars over 12 years ago in reply to Jeeva Raj Arumugam

TI__Mastermind 30830 points

Your experiment must measure one call to Cache_inv().

Back to back calls to Cache_inv() will internally wait for the previous Cache_inv() operation to complete before executing the current request.

Alan

0 Jeeva Raj Arumugam over 12 years ago in reply to Alan DeMars

Prodigy 80 points

yes I measured just one call

-Jeeva

0 Alan DeMars over 12 years ago in reply to Jeeva Raj Arumugam

TI__Mastermind 30830 points

Please share the code you are using to measure the API times as well as the measured values.

Alan

0 judahvang over 12 years ago in reply to Jeeva Raj Arumugam

TI__Mastermind 32475 points

Which device or which Cache are you using? I assume either C64P or C66, which one?

Judah

0 Jeeva Raj Arumugam over 12 years ago in reply to Alan DeMars

Prodigy 80 points

Alan and Judah,

The device being used here is TMS320C66Shn with 8 cores. Code usage is as follows

pu1_ptr, this is the Sl2 memory pointer

/*first block*/

Cache_inv(pu1_ptr, 10240, Cache_Type_ALL, (Bool)FALSE);

cache_wait();

while(end of processing)

{

start = TSCL;

/*second or n+1 th block*/

Cache_inv(pu1_ptr, 10240, Cache_Type_ALL, FALSE);

End = TSCL;

Inv_Cycles = End - start;

{

algorithm uses first or nth block Sl2 memory

}

start = TSCL;

Cache wait() /* for second block or n+1*/

End = TSCL;

Wait_Cycles = End - start;

}

The above code is scheduled across 8 cores and profiling is done for each core.

As I am using seperate cache_wait() operation after each invalidate, my assumption here is Inv_Cycles should be only the cache invalidation start time, invalidation time should be under wait_cycles. but I am seeing Inv_cycles are higher as shown below and wait cycles are negligible and there is no difference between wait flag as true or false.

Core id and wait flag value	Cycles
[C66xx_0] with wait flag = FALSE	11,735
[C66xx_1] with wait flag = FALSE	11,788
[C66xx_2] with wait flag = FALSE	11,788
[C66xx_3] with wait flag = FALSE	11,788
[C66xx_4] with wait flag = FALSE	11,790
[C66xx_5] with wait flag = FALSE	11,788
[C66xx_6] with wait flag = FALSE	11,790
[C66xx_7] with wait flag = FALSE	11,788
[C66xx_0] with wait flag = TRUE	11,735
[C66xx_1] with wait flag = TRUE	11,735
[C66xx_2] with wait flag = TRUE	11,735
[C66xx_3] with wait flag = TRUE	11,735
[C66xx_4] with wait flag = TRUE	11,735
[C66xx_5] with wait flag = TRUE	11,735
[C66xx_6] with wait flag = TRUE	11,735
[C66xx_7] with wait flag = TRUE	11,735

0 judahvang over 12 years ago in reply to Jeeva Raj Arumugam

TI__Mastermind 32475 points

I think I know what's going on. Could you try this out. In your *.cfg file do the following:

var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
Cache.atomicBlockSize = 0;

We worked around a silicon bug by breaking up a huge buffer into smaller chunks. If you set the atomicBlockSize = 0, then this doesn't happen and it will use the max size the cache can handle.

See Silicon errata sprz331a Advisory 14.

Judah

0 Jeeva Raj Arumugam over 12 years ago in reply to judahvang

Prodigy 80 points

Thanks Judah.

By making Cache.atomicBlockSize = 0, now cache invalidate call takes less cycles. and now I can schedule this better to improve performance.

But I have a doubt regarding the workaround mentioned in Advisory 14

In my project, MSMC RAM is configured as Share L3 mode which is cached first in L2 then L1 D. My design uses most of the time “cache invalidate” of Sl2 data in L2.

Advisory 14 from the Silicon errata sprz331a says that

"The workaround requires that the memory system be idle during the block coherence operations. Hence programs must wait for block coherence operations to complete before continuing.This applies to L1D and L2 memory block coherence operations."

Will "Cache.atomicBlockSize = 0;" break the workaround done to avoid the cache corruption?

is it safe to use delayed cache wait with "Cache.atomicBlockSize = 0"?

-Jeeva

0 Jeeva Raj Arumugam over 12 years ago in reply to Jeeva Raj Arumugam

Prodigy 80 points

Workaround 2 mentioned in advisory 32 is as follows

This workaround is also generic, but will allow CPU traffic to go on in parallel with
cache coherence operations. To issue a block coherence operation, follow the sequence
below:
1. Issue a MFENCE command.
2. Freeze L1D cache.
3. Start L1D WBINV.
4. Restart CPU traffic.
(CPU operations happen in parallel with WBINV)
5. Poll the WC register until the word count field reads as 0.
6. WBINV completes when word count field reads 0.
7. Issue an MFENCE command

so as a programmer, do I need to follow the above steps while using cache operations? or is it taken care inside cache API calls itself?

-Jeeva

0 judahvang over 12 years ago in reply to Jeeva Raj Arumugam

TI__Mastermind 32475 points

Jeeva,

That is correct. By setting atomicBlockSize = 0, it is possible that you may encounter Advisory 14. If you don't set the atomicBlockSize = 0, the code is compliant to Advisory 14 so you don't have to do anything extra. I describe below how you can set atomicBlockSize = 0 and have your code compliant.

The major reason why we implemented this atomcBlockSize instead of just making sure the call is always compliant to Advisory 14 is because the workaround requires a disable of interrupts. If you're working on a very large buffer, disabling interrupts could mean a large latency which some people may not want. That's why we break up the large buffer into smaller chunks to allow interrupts.

Now, if the latency doesn't really matter to you. You could set atomicBlockSize = 0. Then put your Cache call within a Hwi_disable/restore.

Judah

0 Jeeva Raj Arumugam over 12 years ago in reply to judahvang

Prodigy 80 points

Thank You Judah :-)

Processors

Processors forum

Query on usage of Cache_Inv() with no_wait