TMS570LC4357: Simulate a lock mechanism for the cache

Mathieu Quenpt

Part Number: TMS570LC4357
Other Parts Discussed in Thread: LAUNCHXL2-570LC43

Hello

I have a question about the TMS570LC4357.

We cannot operate the caches with their replacement mechanism, because in the worst-case the performance with cache is lower than without cache, which is not acceptable for a safety critical system. For this reason, we are planning to simulate a lock mechanism to use the data and instruction cache of that processor. To achieve this goal we have planned the following actions:

The cacheable MPU regions are limited to the size of the cache to avoid cache misses.
During the initialisation of the SW, data and instructions are loaded into the cache.

Questions:

For the instructions to be loaded into the cache, we plan to execute all of them during the initialisation, so that they are loaded into the instruction-cache. Is this procedure sufficient to ensure that these instructions are always deterministically loaded into the cache?
For the data to be loaded into the cache, we plan to write all of them during the initialisation, so that they are loaded into the data-cache. Is this procedure sufficient to ensure that this data set is always deterministically loaded into the cache?
What is the best method to check that a set of data or instructions is loaded into the cache?
Can the instructions to be loaded into the cache allocated to two not contiguous segments? For instance one of 8 Kb and another of 24 Kb. could it be for the cache line (8 Words = 32 Bytes) a problem due to the alignment?

Best regards and thanks for the support!!

Mathieu

over 4 years ago

0 Sunil Oak over 4 years ago

TI__Mastermind 49120 points

Hello,

I think the above method will work to ensure that the I$ and D$ will retain their contents during your application execution. I am trying to confirm with ARM if the above method is sufficient to ensure that the cache is not refilled for any access to the cacheable regions. Stay tuned.

0 Mathieu Quenpt over 4 years ago

Prodigy 210 points

Hello Sunil,

Could you also please clarify with ARM, how exactly the pseudo-random replacement policy of the cache works?

I understand that a cache line contains four blocks.

Scenario: If a memory block is being assigned into a cache line with 3 free blocks and one used block, how does the policy work? Randomly (Non-deterministic and could allocate the used block) or it allocates one of the empty blocks.

Thank you

Mathieu

0 Sunil Oak over 4 years ago in reply to Mathieu Quenpt

TI__Mastermind 49120 points

Hello,

ARM's response was as expected. I am pasting the thread from the support ticket here:

> The cache contents on Cortex-R5F cannot be locked. To overcome this situation, we are planning to:
> - define cacheable regions to be the same size as the available I$ and D$, both 32kB
> - load the I$ and D$ with the instructions and data that are required to be cached

I want to start of by clarifying what you are already aware of, caches are inherently non-deterministic. I assume your Cortex-R5 does not implement TCMs, or they are full? It seems the properties of TCM memory are exactly what you require here? As you are also aware, the Cortex-R5 does not include support for cache lockdown.

Let me start off by giving you the architectural answer, which will ultimately be the only Arm approved answer, as the rest of this reply will be based on theory and assumptions that will not have been validated in any way.

See section 'B2.2.2 Cache behavior' in the latest Armv8-A Architecture Reference Manual.

https://developer.arm.com/documentation/ddi0487/latest/

General behavior of the caches
When a memory location is marked with a Normal Cacheable memory attribute, determining whether a copy of the
memory location is held in a cache still depends on many aspects of the implementation. The following
non-exhaustive list of factors might be involved:
• The size, line length, and associativity of the cache.
• The cache allocation algorithm.
• Activity by other elements of the system that can access the memory.
• Speculative instruction fetching algorithms.
• Speculative data fetching algorithms.
• Interrupt behaviors.

Given this range of factors, and the large variety of cache systems that might be implemented, the architecture
cannot guarantee whether:
• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.

Instead, the following principles apply to the behavior of caches

An unlocked entry in the cache cannot be relied upon to remain in the cache. If an unlocked entry does remain
in the cache, it cannot be relied upon to remain incoherent with the rest of memory. In other words, software
must not assume that an unlocked item that remains in the cache remains dirty.

There is no mechanism that can guarantee that the memory location cannot be allocated to an enabled cache
at any time if a memory location both:
— Has permissions that mean it can be accessed, either by reads or by writes, for the translation scheme
at either the current level of privilege or at a higher level of privilege.
— Is marked as Cacheable for that translation regime.

So to clarify, the rest of the answers given in the reply are purely theoretical, and are based on assumption to the best of our understanding. They will not have been validated. It will be your responsibility to validate your requirement and that you understand we may have corner cases which cause evictions and allocations in a non-deterministic behavior consistent with the original intent of caches.

> 1) For the instructions to be loaded into the cache, we plan to execute all of them during the initialization, so that they are loaded into the instruction-cache. Is this procedure sufficient to ensure that these instructions are always deterministically loaded into the cache?

I believe this is possible assuming that MPU regions covering (4GB - 32KB) of defined the the memory as Non-cacheable and / or XN (eXecute Never). In this case, the 32KB region used for data would need to be XN and the other MPU regions covering (4GB - 64KB) would need to be Non-cacheable.

> 2) For the data to be loaded into the cache, we plan to write all of them during the initialization, so that they are loaded into the data-cache. Is this procedure sufficient to ensure that this data set is always deterministically loaded into the cache?

The MPU region(s) covering 32KB of instructions, could theoretically be allocated into the data cache due to speculation.

That is, a load executed speculatively that targeted a an address in 32KB instruction region(s) could cause a line in the data cache to be evicted.

In is not impossible for the branch predictor to point to an incorrect location, if that incorrect location happened to be a random opcode that reads from a location in the instruction region then we could cause an unwanted allocation into the data region.

That said, if your instruction regions only contained known opcodes, unused memory could contain NOPs or UDFs, I do see it unlikely.

> 3) What is the best method to check that a set of data or instructions is loaded into the cache?

On the Cortex-R5, there is no direct access to internal cache RAMs. I would suspect using the PMU to monitor access latencies may be a good mechanism.

> 4) Can the instructions to be loaded into the cache allocated to two not contiguous segments? For instance one of 8 Kb and another of 24 Kb. could it be for the cache line (8 Words = 32 Bytes) a problem due to the alignment?

The MPU allows regions to be defined at a granularity down to 32 bytes. The region address does however need to be aligned to the region size. I see region size support for 8KB and 16KB so you would need 3 MPU regions to achieve this.

Again, you will also need to make sure that the set index from the addresses in each of the regions don't overlap. PA[12:5] is used for indexing into the 4-way 32KB L1 instruction cache.

I assume your application will need more guaranteed determinism then offered in the reply. For this reason, I would strongly suggest using TCM memory, or a RAM connected to the LLPP for deterministic data (although that would probably slightly lower performance, and doesn’t support instruction fetches), or even something like an external L2C-310 since that supports lockdown.

0 Mathieu Quenpt over 4 years ago in reply to Sunil Oak

Prodigy 210 points

Hello Sunil,

Thank you for your answer!

Could you please answer my fith question?

"Could you also please clarify with ARM, how exactly the pseudo-random replacement policy of the cache works?

I understand that a cache line contains four blocks.

Best regards

Mathieu

0 Mathieu Quenpt over 4 years ago in reply to Sunil Oak

Prodigy 210 points

I would appreciate to get more details about your answer to item 2, as we have observed a corner case.

Configuration:

MPU Region X contains instructions, starts at 0x0005E000 with a size of 8KB and is cacheable.

MPU Region Y contains instructions, starts at 0x00200000 with a size of 24KB and is cacheable.

Case 1: Execution of all instructions of regions X, n times. 256 caches misses were observed with PMU during the first iteration. In the others iterations, no misses were observed.

Case 2: Execution of all instructions of regions Y, n times. 768 caches misses were observed with PMU during the first iteration. In the others iterations, no misses were observed.

Case 3: Execution of all instructions of regions X and Y, n times. 1024 caches misses were observed with PMU during the first iteration. In the others iterations, there are one or two misses per iteration.

Case 4: Execution of all instructions (except the last 32 Byte of every region) of regions X and Y, n times. 1022 caches misses during the first iteration were observed with PMU. In the others iterations, no misses were observed.

We cannot understand case 3. Are the instruction cache misses caused by the branch prediction reading in the adjacent regions or because the cache is full?

Have a good weekend!

0 Chester Gillon over 4 years ago in reply to Sunil Oak

Guru 92251 points

Sunil Oak said:
On the Cortex-R5, there is no direct access to internal cache RAMs.

TMS570LC4357: L1 cache memory usable as scratchpad memory if cache disabled? was a previous investigation into leaving the caches disabled, and directly accessing the cache memories. The conclusion was that direct access to the cache memories was only to support functional tests of the cache memories, and not for general use by a program. E.g. seemed to be an issue with quick back-to-back writes causing the cache memories not to be updated correctly.

0 Mathieu Quenpt over 4 years ago in reply to Chester Gillon

Prodigy 210 points

Thank you Chester for the reference! I am also aware of that thread. We are currently examining the corner case at a cacheable region border. Our best guess is that the branch prediction causes a cache miss, but we don't know why, maybe because of an invalid opcode.

0 Sunil Oak over 4 years ago in reply to Mathieu Quenpt

TI__Mastermind 49120 points

ARM clearly stated that Cortex R5 does not support locking cache contents and will not guarantee this method (cacheable region size = cache size) will work all the time. Caches are inherently non-deterministic, and this specific usage mode has not been verified at ARM (not planned either). ARM will also not support any more technical queries related to this (as expected).

You can try disabling branch prediction and see if that affects the behavior.

See information on this page for disabling branch prediction: https://developer.arm.com/documentation/ddi0460/c/Prefetch-Unit/Controlling-instruction-prefetch-and-program-flow-prediction?lang=en

0 Mathieu Quenpt over 4 years ago in reply to Sunil Oak

Prodigy 210 points

Hello Sunil,

As already written, case 4 demonstrates that the cache can behave deterministically. We need to understand what happens when the last 32 byte of a region are loaded into the cache leading to caches misses. For that I need the support of TI.

Please answer my outstanding question (pseudo-random replacement policy) of a previous post.

Thank you

Mathieu

0 Sunil Oak over 4 years ago in reply to Mathieu Quenpt

TI__Mastermind 49120 points

Hi Mathieu,

Please file a support case with ARM directly: https://developer.arm.com/support

0 Chester Gillon over 4 years ago in reply to Mathieu Quenpt

Guru 92251 points

Mathieu Quenpt said:
Case 3: Execution of all instructions of regions X and Y, n times. 1024 caches misses were observed with PMU during the first iteration. In the others iterations, there are one or two misses per iteration.

Case 4: Execution of all instructions (except the last 32 Byte of every region) of regions X and Y, n times. 1022 caches misses during the first iteration were observed with PMU. In the others iterations, no misses were observed.

We cannot understand case 3. Are the instruction cache misses caused by the branch prediction reading in the adjacent regions or because the cache is full?

As a learning exercise I attempted to re-create your test based upon the description.

In the attached project, the PMU is used to report the PMU_INST_CACHE_MISS / PMU_INST_CACHE_ACCESS / PMU_CYCLE_COUNT counts for the first two test iterations and then any others in which a miss occurs.

The results for the test as per your case 3:

iter 1 Region X size:0x2000 Region Y size:0x6000
  After region X instructions: miss=256 access=1024 cycles=5949
  After region X+Y instructions: miss=1024 access=4096 cycles=23145

iter 2 Region X size:0x2000 Region Y size:0x6000
  After region X instructions: miss=0 access=1024 cycles=2359
  After region X+Y instructions: miss=0 access=4096 cycles=8797

And there were no further misses in the following 4,744,585 iterations.

The results for the test as per your case 4:

iter 1 Region X size:0x1fe0 Region Y size:0x5fe0
  After region X instructions: miss=256 access=1021 cycles=5927
  After region X+Y instructions: miss=1024 access=4090 cycles=23101

iter 2 Region X size:0x1fe0 Region Y size:0x5fe0
  After region X instructions: miss=0 access=1022 cycles=2325
  After region X+Y instructions: miss=0 access=4092 cycles=8729

And there were no further misses in the following 3,378,403 iterations.

I failed to repeat your issue with case 3 in that with instructions occupying the entire 32 KB cache didn't get any misses after the first iteration.

However, that might be down to the instructions placed in cache; to keep the test simple created one function per MPU region which only contained nops followed by a final bx lr. Depending upon the instructions the Cortex-R5 PreFetch Unit (PFU) may behave differently.

My test program, which can run in a LAUNCHXL2-570LC43, is attached.

TMS570LC4357_cache_lock.zip

0 Mathieu Quenpt over 4 years ago in reply to Chester Gillon

Prodigy 210 points

Thank you very much Chester for reproducing our test!! We are checking your and our program to understand the discrepancy. We'll let you know.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LC4357: Simulate a lock mechanism for the cache