AM5K2E02, how to test ECC?

Brad Caldwell

Other Parts Discussed in Thread: AM5K2E02

The AM5K2E02 devices has 4MB L2 cache with ECC/parity for the Arm cores. Then there's 2MB MSMC RAM, don't see any mention of ECC for it so Ill assume it doesn't.

Certain customers have a requirement to test the ECC features before they can use them.

I have a couple questions related to this

Does the MSMC RAM have ECC, EDC, or parity?
Is there a way for a customer to test the Arm L2 ECC feature?
How does TI validate the the silicon ECC features? Is this tested on silicon?

Thanks

over 8 years ago

0 Yordan Kovachev over 8 years ago

TI__Guru**** 161600 points

Hi,

I've notified the design team to elaborate on these queries.

Best Regards,
Yordan

0 ran35366 over 8 years ago in reply to Yordan Kovachev

TI__Genius 12805 points

Brad

This is what we do with MSMC ECC errors when there is DSP involved. The document for the MSMC in AM5K2E02 page is the same as the document for the devices with DSP so I think that it should work for the ARM as well, especially if you use bare Metal. So here it comes:

I look at the MSMC User's guide for AM5K2E02 http://www.ti.com/lit/ug/spruhj6/spruhj6.pdf and I look at chapter 2.6 for error correction.

Following 2.6.4 about scrubbing engine and the scrubbing control register (table 3-3 chapter 3.3.1 of the same document) I understand that in order to test the ECC the user must do the following:

Write something to the memory at least 32 bytes (and aligned on 32 bytes)

Disable the scrubbing (bit 31 of the control register)

Change a single byte in the 32 bytes word by changing a single bit

Read the data and verify that the new value is in the memory (cache invalidate and so on)

Enable the scrubbing

Read the data from the memory and observed that the data was changed and observe the value of SMCERRAR register

Repeat the process but do it with two bit errors. This time the ECC cannot correct the value but look at register SMNCEA

try it. If it does not work we will approach the architecture team.

Ran

0 ran35366 over 8 years ago in reply to ran35366

TI__Genius 12805 points

Just noticed (thanks to Rex) that you ask about L2 ECC

The same procedure can be done for the L2 (namely, disable scrubbing change value enable scrubbing) just find the appropriate registers

Ran

0 Bill Smeed over 8 years ago in reply to ran35366

Prodigy 40 points

Hi Ran,

I've looked at spruhj6, section 2.6, but I don't think it offers the path I need to do this testing. If you do have it working on the DSP as suggested above, could you let me know what was done that may be unique?

As I read the scrubbing engine, it seems to do two things for you as part of its Read-Modify-Write Scrub Burst on a 256-bit datum (32-bytes):
(1) If the parity stored in parity RAM is valid, it'll do the checks for validity of the 256-bit datum, correcting 1-bit errors, reporting the errors to the various registers, etc. and if it was corrected, write back the corrected data as well as re-generate the parity bits.
(2) If the parity stored in parity RAM is invalid, it'll read the 32-bytes and write those 32-bytes back and calculate parity and mark that entry for parity RAM as valid.

Like 2.6.2 says, writes that aren't an aligned 32-bytes result in the parity RAM entry being marked invalid. I'm assuming the parity RAM, in addition to the parity bits, must have a bit indicating whether the parity RAM entry is valid or not. I imagine that bit gets set when parity RAM has been calculated and gets cleared when a non aligned 32-byte write occurs. (In addition, I believe a write to the SRAM could only NOT be a 32-byte aligned write if the cache is disabled. Otherwise, the SRAM shouldn't get written unless the cache entry is cleaned, and those entries are aligned 64-byte).

Because of that, I don't believe the approach you listed above will work. If cache is enabled, then even when changing a single-byte, that won't go to SRAM until the cache line is cleaned and at that point it will be a write that is granular to 32-bytes so new parity will be generated. Cache has to be disabled. However, if cache is disabled, and a single-byte write is done, as I read section 2.6.2, that will mark the corresponding parity RAM entry as invalid which temporarily makes this region of memory unprotected by ECC. The scrubbing engine appears to be the mechanism used to come along and eventually do the scrub burst (as a contrast, the DDR3 memory controller has an option (spruhn7c) in its ECCCTL register, RMW_EN, that tells the DDR3 controller to effectively do the same type of operation that is done with the scrub burst).

I included the test code I used to try out the approach you listed above, as well as the results I got. I have a delay in there that lasts for a few seconds each time it's run. It's being run with all caches off and the MMU off. I tried turning on the MMU and running it that way, too, but get the same result.

-------------------------------------------------------
static unsigned int buffer_32bytes[32/sizeof(unsigned int)] __attribute__ ((aligned (32))) = {0};
static void do_1_bit_ecc_test_in_sram (void)
{
volatile unsigned char *pu8 = (unsigned char *)&buffer_32bytes;
volatile unsigned int *pu32 = (unsigned int *)&buffer_32bytes;
volatile unsigned int *psmedcc = (unsigned int *)0x0bc00010;
unsigned int smedcc = *psmedcc;
unsigned int i;

platform_write ("Address of SRAM: 0x%08X\r\n", (unsigned int)&buffer_32bytes);
platform_write ("Size of test SRAM: %d\r\n", sizeof(buffer_32bytes));
platform_write ("SMEDCC: 0x%08X\r\n", smedcc);

for (i=0; i<32/sizeof(unsigned char); i++)
pu8[i]=(unsigned char)i;
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

for (i=0; i<32/sizeof(unsigned int); i++)
platform_write ("%08X ", pu32[i]);
platform_write ("\r\n");
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

*psmedcc = 0x80000000 | (smedcc & 0x7FFFFFFF); // Turn off scrubbing engine
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

pu8[17] = pu8[17] ^ 0x80; // Flip one bit at byte index 17
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

for (i=0; i<32/sizeof(unsigned int); i++)
platform_write ("%08X ", pu32[i]);
platform_write ("\r\n");
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

*psmedcc = smedcc; // Restore SMEDCC - Turn on scrubbing engine
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");

for (i=0; i<32/sizeof(unsigned int); i++)
platform_write ("%08X ", pu32[i]);
platform_write ("\r\n");
platform_delaycycles(0x00200000 / 32 * 1024 / 16);
asm (" isb"); asm (" dsb");
}
-------------------------------------------------------

Which gave me the following output:

-------------------------------------------------------
Address of SRAM: 0x0C0A0280
Size of test SRAM: 32
SMEDCC: 0x44000001
03020100 07060504 0B0A0908 0F0E0D0C 13121110 17161514 1B1A1918 1F1E1D1C
03020100 07060504 0B0A0908 0F0E0D0C 13129110 17161514 1B1A1918 1F1E1D1C
03020100 07060504 0B0A0908 0F0E0D0C 13129110 17161514 1B1A1918 1F1E1D1C
-------------------------------------------------------
The SMEDCC value reported is the value before starting the test.
The first line of output is how the test initializes that memory (32 bytes). Scrubber is still on.
The second line of output is after scrubber is disabled and byte 17 (the 32-byte word "13121110" changed to "13129110").
The third line is after the scrubber was activated and read, but as it shows, the data wasn't corrected.

If you have it working, do you see where my test above is missing something that your test has working? I don't imagine it's possible to directly access the parity RAM? Is there a way to confirm that the parity RAM entries for the SRAM include a bit that marks that entry valid or not valid (where valid would mean the parity bits are valid and can be used, and not valid means the parity bits aren't valid and shouldn't be used, essentially indicating there is no ECC at that time).

I'm assuming the ARM ROM code handled enabling the ECC on SRAM? I ask because the manual says by default that ECC is disabled (bit 30 of SMEDCC), but if I boot the processor in IDLE mode, break with the debugger, then SMEDCC already shows 0x44000001, indicating PRR is done (as expected... in terms of me breaking the debugger, it would be long done) and that ECM is enabled.

0 ran35366 over 8 years ago in reply to Bill Smeed

TI__Genius 12805 points

OK, I guess I did not understand what you ask.

You say that you want to test the ECC. Can you please descibe in words what you mean by testing the ECC?

Do you want to enter an error and see if the ECC correct it?

Do you want to make sure that there is no error in the ECC for a long time?

Please describe the test that you have in mind

Regards

Ran

0 Bill Smeed over 8 years ago in reply to ran35366

Prodigy 40 points

Hi Ran,

Thank you for the questions. Ideally, we would like to test the ECC functionality as part of our initialization. By "test the ECC", I'm referring to just what you were thinking:
* Parity: Inject a single-bit error, and confirm that it is detected
* ECC: Inject a single-bit error, confirm that it is corrected; Inject a double-bit error, confirm that it is detected

Currently, with the DDR3 controller, we can mostly test its ECC behaviors. We can disable DDR3 ECC, flip one bit, reenable DDR3 ECC, then read that DDR3 location and confirm that the bit was corrected as well as read the DDR3 status registers that indicated that a 1-bit error was corrected. We do this test for each of the data bits (all 64) in that quanta. Because we don't have direct access to the 8 parity bits associated with those 64 data bits, we don't inject single-bit errors into the parity bits. We do a similar test with 2-bit errors as well (flip all possible 2-bit error patterns within the 64 data bits, and ensure that on read we do get an interrupt and that the DDR3 status bits indicated the 2-bit error), again skipping injecting errors into the 8 parity bits since we don't have direct access.

Ideally, we'd like to do the same type of approach across cache (both L1 and L2) and SRAM, checking to make sure errors (where they can be injected) are detected and the documented mitigations are taken.
* L1-I: Inject a single bit error into the cache, then attempt to execute the instruction at that location, confirm that the cache line is invalidated and reloaded from memory as the mitigation indicates
* L1-I: Do not test 2-bit injection, since it appears the parity scheme only can detect single-bit errors reliably. Multi-bit errors are undetermined
* L1-D: Inject a single bit error into "clean" cache, then attempt to read from that location and confirm that the cache line is invalidated and reloaded from the L2.
* L1-D: Inject a single bit error into "dirty" cache, then attempt to read from that location and confirm that the corrected cache line was written to the L2, that the L1-D cache line was invalidated, and reloaded from the (now-updated) L2.
* L1-D: Inject a 2-bit error into "clean" cache, then attempt to read from that location and confirm that the cache line is invalidated and reloaded from the L2.
* L1-D: Inject a 2-bit error into "dirty" cache, then attempt to read from that location and confirm that an exception occurred.
* L2: (similar to the L1-D tests)
* SRAM: Inject a single-bit error, wait for a sufficient period to ensure the scrubbing engine has covered all of SRAM, then attempt to read from that location and confirm that the data has been corrected. Check the SRAM status bits to confirm that the scrubbing engine corrected a single-bit error.
* SRAM: Inject a double-bit error, wait for a sufficient period to ensure the scrubbing engine has covered all of SRAM, then attempt to read from that location and confirm that the data has not been corrected. Check the SRAM status bits to confirm that the scrubbing engine detected a double-bit error.
* SRAM: Disable the scrubbing engine. Inject a single-bit error, then attempt to read from that location and confirm that the data has been corrected. Check the SRAM status bits to confirm that the SRAM returned the corrected data. Read a second time and confirm the SRAM once again returned corrected data (since the data is not corrected in SRAM unless the corrected data is written back).
* SRAM: Disable the scrubbing engine. Inject a double-bit error, then attempt to read from that location and confirm that an exception occurs. Check the SRAM status bits to confirm the SRAM detected the 2-bit error.

The ideal for each of the single-bit error tests would be to perform the test for each bit included in the parity, including the parity bits. The ideal for the double-bit error tests would be to perform the test for all combination of 2-bit errors across the data and parity bits. At a minimum, we'd like to at least do each of the tests above to demonstrate that the error detection and resulting mitigation steps are operating as expected.

As for your last question about making sure there is no error in the ECC for a long time; we believe that if we can perform this testing during each initialization, we will minimize the exposure of having the ECC fail. Statistically, for any individual time that the processor is initialized and used until the next reset of power is removed, it is extremely unlikely that the ECC functionality that was successfully tested during initialization will fail. However, if the unlikely does happen, by testing the ECC functionality during each initialization, the next time the processor is initialized, the test will detect that ECC is no longer functioning correctly, so the exposure of that error is minimal.

Bill

Processors

Processors forum

AM5K2E02, how to test ECC?