RTOS/AM5728: sysbios cache writeback invalidate

Mohsen Khayami

Mastermind 22391 points

Part Number: AM5728
Other Parts Discussed in Thread: SYSBIOS

Tool/software: TI-RTOS

like to understand what happens with the below command

Cache_inv(addr,size,Cache_Type_ALLD,TRUE)

addr = 0xC0000000

size=500MBytes

we are using this command to invalidate the cache. The problem we are having is that it takes about

52441674 DSP clock cycles for invalidate (the DSP is running at 750Mhz) around 70ms. We would have thought it should be shorter than that. During this time we are not doing anything else and all the HWI are disabled before going executing this command.

In the below e2e it looks like it says the number of DSP clock cycles that it takes to invalidate the cache is

https://e2e.ti.com/support/processors/f/791/t/422590?L1D-cache-invalidate-latency

(total addresible ram space)/(cache line length)

if this is true then for our case it should take

L1D = 500MB/64= 8.2M cycles

L2D=500MB/128 = 4.1Mcycles

the total cycle should be 8.2M + 4.1M = 12.3M cycles and not 54Mcyles.

Like to know why the number of cycles is about 3x than what its suppose to be.

Thanks

Regards

Mohsen

over 6 years ago

0 lding over 6 years ago

TI__Guru* 95265 points

Mohsen,

Do you use -O3 to compile the DSP code? Do you see that the DSP cycle is roughly linear with the data size? E.g, if you invalidate 1/10 of 500MB, the cycle is reduced to 1/10?

Regards, Eric

0 Mohsen Khayami over 6 years ago in reply to lding

TI__Mastermind 22391 points

Hi Eric

yes it is linear with the bigger size memory. Yes we did compile with -o3.

0 lding over 6 years ago in reply to Mohsen Khayami

TI__Guru* 95265 points

Hi,

Cache_inv(addr,size,Cache_Type_ALLD,TRUE) is a SYSBIOS call, implemented in bios_6_xx_xx_00\packages\ti\sysbios\family\c66\cache.c

Void Cache_inv(Ptr blockPtr, SizeT byteCnt, Bits16 type, Bool wait)
{
Cache_block(blockPtr, byteCnt, wait, L2IBAR);
}

This call also Invalidate the prefetch buffer. I don't have the exact number for each steps. Perhaps you can extract a few functions from the bios_6_xx_xx_00\packages\ti\sysbios\family\c66\cache.c to create a CCS project, then it is easier to understand the cycle spending.

I also added our C66x expert to this.

Regards, Eric

0 Mohsen Khayami over 6 years ago in reply to lding

TI__Mastermind 22391 points

Hi Eric
thanks for the reply i am hoping the C66x guys have some numbers of cycle it takes to invalidate the L2 and L1 cache. I would have imagined that would be something related to the size of the cache and not the memory space. Lets assume that we want to just work on the L2 and the size of the L2 is 256K. It should not matter how many ways and assume that i want to invalidate a big memory section so it has to search every location on the L2. Also assume it takes about 10 CPU cycles to check the cache address to see if it has to be invalidated or not.

Therefore it would take about 256*1024*10=2621440 CPU cycles. Lets assume the C66 is running at 750Mhz so its around 1.3ns the total time should be 3.495ms just for the L2. This is assuming nothing else is going on with the C66 core, which is our case.

If the above is true maybe what we need is the number of cycles to invalidate a cache location. Also I assume that it has to search every location and invalidate it which would be the worst case. I guess the worst case would be to invalidate a cache location that is at the end of the cache location.

Thanks

0 jian35385 over 6 years ago in reply to Mohsen Khayami

TI__Mastermind 23125 points

Mohsen,

Note that cache memory access is by addressing and tag ram, all done in hardware cache controllers. So there is no need to search through the whole address range. To invalidate the entire cache, all the hardware need to do is to mark the tag ram to signal that all cache lines are invalid, so next time the CPU need to read something from external it has to fill the cache line (thus wait). To invalidate and write back, the write data needs to be written through to external memory.

I will run the 52Mcycle number you mentioned above by the compiler team to see if it makes sense to him.

regards
jian

0 jian35385 over 6 years ago in reply to jian35385

TI__Mastermind 23125 points

Mohsen,
Just asked Yuan Zhao from compiler team. He showed me the ARM side of library function and mentioned DSP lib will work similar.
Basically, the suggestion was to time a "threshold", based on the supplied size of inv function. If the size exceed the threshold, just go ahead do a inv_all.
Since you already mentioned cycles are pretty linear with the size, you can compare the inv_all cycles vs. a few trial of inv_addr cycles, and find the threshold, then directly use either function based on the threshold.
Now I get what Mohsen was saying in the earlier post.
Jian

0 Brad Griffis over 6 years ago in reply to jian35385

TI__Guru*** 125430 points

jian35385 said:
Basically, the suggestion was to time a "threshold", based on the supplied size of inv function. If the size exceed the threshold, just go ahead do a inv_all.

How do you guarantee that you don't blow away "good" data that's cached elsewhere? Global invalidation is much faster because you are iterating across cache lines (i.e. you are bound by the cache size). Block invalidation takes longer because you're doing the opposite, i.e. looking at locations in the cache that could contain a given address, and so it scales with the size of the memory being invalidated.

Going back to the original intent -- Mohsen, why are you invalidating 500 MB of memory? Is it possible to operate on smaller pieces of memory as needed?

Here's another thought.... How much total memory is being used? For example, if there's 512 MB of memory, perhaps you could do a writeback of the "other 12 MB" to make sure you don't lose good data, and then you could perform a global invalidate of the entire 512 MB. Of course, writebacks can be dangerous too... You would need to know precisely the content of the other 12 MB. You wouldn't want any DMA buffers located in there.

0 jian35385 over 6 years ago in reply to Brad Griffis

TI__Mastermind 23125 points

Brad,

I think Mohsen mentioned this is the only cache use. So I assume this data set is read only and he already carved out some SRAM space for writes. Agree if there are any write data in the cache, then inv and writeback is needed. Also your suggestion of wb a smaller data set then inv all cache.

Good to hear from you since the FAE summit. hope you had a good vacation.

Jian

Processors

Processors forum

RTOS/AM5728: sysbios cache writeback invalidate