AM6422: how to implement A53 data cache self-test

lina zhang

Part Number: AM6422

When I'm trying to inject an error in A53SS's Data Cache RAM following Technical Reference Manual of AM64x for the purpose of self-test, I found that it did not provide a complete implementation process plan. I am confused about the followings:

1. What kind of ECC Aggregator A53SS_ECC_AGGR should be? ECC Wrapper or Interconnect ECC Component?

These two types have different register formats, but nowhere described which type it should belong to. I tried to read the value of 0x0071 7014 (ECC_CTRL Register of ecc_aggr_corepac_regs) as 0x181, which looks very similar to the data format of ECC Wrapper, but still hoping for an official confirmation.

2. If it is ECC Wrapper, what should ECC_ROW be? And if it is Interconnect ECC Component, what should ECC_GRP be?

In the related question, Sreenivasa showed some contents from "12.2.1.4.6.12.1 Packet Header ECC", which does have some descriptions about ECC_ROW. And what if I want to inject an error in A53SS's Data Cache RAM? "6.1.3.9 A53SS Functional Safety - ECC Error Injection Support" doesn't support any word of ECC_ROW or ECC_GRP, does this mean neither of the two needs to be initialized in this case?

For example, A53 L2 Data RAM 0 have 72 data bits (63:0 – Data; 71:64 – ECC). If I want to flip bit 8, does it mean I need select vector 16 of register ecc_aggr_corepac_regs, then transfer 8u to ECC_BIT_1 and keep 0u to ECC_ROW/ECC_GRP?

over 2 years ago

0 Pekka Varis over 2 years ago

TI__Mastermind 27050 points

For error injection we support using the interface in the software diagnostics library, AM64x MCU+ SDK: Software Diagnostics Library (SDL) I'm assuming you have looked there and noticed A53 cache error injection is missing. I'm assigning this to the owner of the SDL to comment on when we will have the support.

Pekka

0 lina zhang over 2 years ago in reply to Pekka Varis

Intellectual 440 points

Hi Pekka

In fact, in order to achieve multi platform compatibility, my code was not developed based on SDK or SDL, but was directly compiled into ELF files and booted through u-boot.

Due to certain requirements, my code needs to be debugged and passed within this year. May I ask if the support codes in the SDK can be provided within this year, or are there any early reference suggestions?

0 Swargam Anil over 2 years ago

TI__Guru 50707 points

lina zhang said:
When I'm trying to inject an error in A53SS's Data Cache RAM following Technical Reference Manual of AM64x for the purpose of self-test, I found that it did not provide a complete implementation process plan. I am confused about the followings:

Hello lina,

If you wanted to inject errors into cache memory, it is not possible. Since the A53 L2 cache does not have an SoC address, it is not possible to enable ECC for this memory using SDL ECC aggregators. The ECC for these memories is an ARM core functionality and is not provided by TI.

Typically, these errors can be controlled by the A53 core itself.

You can look at the A53 core technical reference manual and enable the corresponding control registers that enable ECC checks for this memory region.

There are CEC (Cache Enabled Control Bits) bits that can be controlled on R5F in the AXCTRL Register. Similarly, see the A53 TRM, where it has a control register to enable ECC for cache memories.

Regards,

S.Anil.

0 lina zhang over 2 years ago in reply to Swargam Anil

Intellectual 440 points

Swargam Anil said:
If you wanted to inject errors into cache memory, it is not possible.

In Chapter "6.1.3.9 A53SS Functional Safety - ECC Error Injection Support" of AM64x TRM, the ECC Aggregator is introduced, which supports an Inject Only Mode("12.6.4.3.6 Inject Only Mode" of AM64x TRM). Can you confirm that this feature is not supported?

0 Neelima Muralidharan over 2 years ago in reply to lina zhang

TI__Expert 6980 points

Hello Lina,

Yes this feature is currently supported in hardware. TI SDL also provides the API to control the ecc aggregator to cause fault injection. However since this is cache, the system integrator is expected to perform writes and reads to cause the cache line eviction and then cause the ECC aggregator to inject errors. Another approach could be a large enough memory copy to evict most cache lines.

Additionally, A53 supports error injection on L1D and L2 cache rams. See section 8.3 of the r0p4 of the A53 TRM. This potentially will be an easier approach to cause error injection and test the ECC logic.

Let us know if you have any questions.

Regards,
Neelima

0 lina zhang over 2 years ago in reply to Neelima Muralidharan

Intellectual 440 points

Hi Neelima,

I have made some attempts recently, but still have not been able to successfully trigger the ESM signal associated with A53SS.

As Pekka mentioned earlier, as SDL has not yet implemented the functions related to "A53 cache error injection", I am currently unable to obtain real and effective assistance from the SDL.

I also tried following your suggestion to perform a certain number of memory copy operations (4k buff * 100 times) after injecting errors, but there seems to be no improvement.

A53 TRM has described that to directly inject errors into it, MMU needs to be disabled first, but this is unacceptable for the code I am running. In this case, if TI's ECC Aggregator can achieve it, I still prefer to use this TI module.

By the way, no one has replied to the two questions I asked at the beginning of this post. I can't even confirm whether the registers I wrote and the values in them are correct as yet.

0 lina zhang over 2 years ago in reply to lina zhang

Intellectual 440 points

I am still paying attention to this issue. Can any experts provide any suggestions?

0 Neelima Muralidharan over 2 years ago in reply to lina zhang

TI__Expert 6980 points

Hello Lina,

Apologies for the delay. Here are some details. I am combining responses for a few E2E tickets.

Regarding the error injection to L2 caches:

Due to the intended memory being a cache, there needs to be a specific sequence to cause the memory read so that the read data from the cache can be injected with the error. Note that the error injection is done on the read data and not on the
Here is a simple sequence that can be followed to cause a read from L2 cache and then the error injection (thanks to Dave and Harshil)
1. First setup the ecc aggregator to end point id (16 – 22) by setting the ECC_VECTOR register. Additionally setup the ECC_CTRL to force_n_row with force_ded set.
2. Read or write 32 consecutive addresses from external memory (DDR). Consecutive is important so that all the physical banks of the cache (end point ids 16 – 22 on the ecc aggregator) will be populated which will guarantee the hit in the targeted end point.
3. This will cache the 32 addresses in the L1D data cache in core0 and in the unified L2 cache.
4. Now read the 32 addresses from core 1. This will directly read from L2 cache as the L1D cache of core1 will not have these addresses cached.
5. You should see a 2 bit error being reported via the interrupt signal.

Note that the ECC_ROW in the ECC_CTRL1 register is not the right way to error inject in this scenario as the row of the cache where the data will land is unpredictable.

Regarding error reporting capabilities:

For A53 memories (L2 data, L1 data rams, L2 tag rams) uncorrectable errors will be the reported via the interrupt signal (nINTERRIRQ). This is routed to the ESM.
Single bit errors are not routed as an interrupt but are tracked in the CPUMERRSR and L2MERRSR registers. But these registers can only be read by the software running on A53.

Please let us know if you have questions.

Regards,

Neelima

0 lina zhang over 2 years ago in reply to Neelima Muralidharan

Intellectual 440 points

Hi Neelima,

Thank you for your reply again. I will try your suggestions in my testing environment later to checkout if nINTERRIRQ works.

But here still have some questions that need your confirm:

Neelima Muralidharan said:
uncorrectable errors will be the reported via the interrupt signal (nINTERRIRQ).

There are many caches associations on this interrupt, how can I figure out which cache is the matter? By checking CPUMERRSR and L2MERRSR?

Neelima Muralidharan said:
You should see a 2 bit error being reported via the interrupt signal.

After received this report, what should I do to recover the system environment? I need to run self-test function periodically, so need to ensure that everything is normal after each execution.

Neelima Muralidharan said:
This will directly read from L2 cache as the L1D cache of core1 will not have these addresses cached.

This is a very genius idea, and I can't wait to implement it in my testing environment. However, the final product will only run in single core mode. Is there an another solution that can run in single core mode?

This part is just my concern. If there is an uncorrectable error in data cache, what data will it return? The wrong data, an empty data, or the raw data from DDR? I am considering whether it is necessary to completely disable the scheduling of other threads during this self-test process to avoid potential corruption...

0 Neelima Muralidharan over 2 years ago in reply to lina zhang

TI__Expert 6980 points

Lina,

Regarding 1 - Yes you will need to read the mentioned registers to determine which cache has the issue

Regarding 2 - For this I need more understanding of your system. Typically we recommend such tests to be run before the actual application is running so that such testing does not interfere with the application or the safety function. If this is needed to run periodically, it should be during a test interval session where the safety function can be stopped.

The other option to consider is that when such a testing is done that this error does not trigger the error pin by configuring the ESM. The software that handles the interrupts can also be written such that these interrupts are expected during the testing window and no action is taken.

Regarding 3 - The other solution would be to invalidate the L1D cache after L2 has already cached the data in which case only one core will be needed.

Regarding 4 - Yes the testing should be done during a window when the safety execution is not ongoing. Typically at start up.

Regards,
Neelima

0 lina zhang over 2 years ago in reply to Neelima Muralidharan

Intellectual 440 points

Hi Neelima,

Due to DED Cache error event may cause some issues that cannot be recovered at present, what if I trigger a SEC? Will there be any values that I can obtain?

0 Neelima Muralidharan over 2 years ago in reply to lina zhang

TI__Expert 6980 points

Lina,

I did not quite understand your question. Are you asking what would happen in the case when there is SEC instead of DED? What values are you referring to - status registers?

Regards,
Neelima

0 lina zhang over 2 years ago in reply to Neelima Muralidharan

Intellectual 440 points

Hi Neelima,

Yes, I wonder if there will be something happened in the case of SEC, and status registers are some of the values that interests me.

I am just trying to finish the self-test without any stuck in EL3.

0 Thomas Yang55737 over 2 years ago in reply to lina zhang

TI__Expert 5496 points

Hi Lina,

Error reporting

For A53 memories (L2 data, L1 data rams, L2 tag rams) uncorrectable errors will be the reported via the interrupt signal (nINTERRIRQ). This is routed to the ESM.
Single bit errors are not routed as an interrupt but are tracked in the CPUMERRSR and L2MERRSR registers. But these registers can only be read by the software running on A53.

There is no SEC error event route out from A53 cluster, software need to check CPUMERRSR and L2MERRSR register, this is the conclusion we discussed in other thread.

Please check Arm Cortex-A53 MPCore Processor Technical Reference Manual r0p4 for details:

CPUMERRSR_EL1c RW - 64 CPU Memory Error Syndrome Register on page 4-120
L2MERRSR_EL1c RW - 64 L2 Memory Error Syndrome Register on page 4-123

-Thomas

0 lina zhang over 2 years ago in reply to Thomas Yang55737

Intellectual 440 points

Hi Thomas,

Can you provide a more detailed description of which data bit of this register can be used to determine whether there is an error that has occurred?

In the code I have currently implemented, I will detect the Fatal [63] bit of L2MERRSR after injected L2 Data Cache with force_n_row and force_sec and some data copy in core 0 and core 1, but I have not been able to find that Fatal bit is set to 1 after the inject.

Processors

Processors forum

AM6422: how to implement A53 data cache self-test