This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM48L952: RAM ECC single bit error test fails if repeated

Part Number: RM48L952
Other Parts Discussed in Thread: HALCOGEN

Hi,

I want to perform periodically a RAM ECC self-test in my application. I started from self-test functions generated by Halcogen in sys_selftest.c file. Since 2-bit error causes an abort, I need to remove the 2-bit error injection in the functions. I attach the two modified functions that I’m calling. There are two different results:

-   If I call the checkRAMECC()_mod function periodically, everything is fine. Every time I inject 1-bit error, both tcram1ErrStat and tcram2ErrStat flags are set as expected.

-   If I call the checkB0RAMECC()_mod function periodically, tcram1ErrStat is set just the first time I call the function. From the second time that I call the function, the flag is not set anymore as if the error is not corrected.

There is a reason for this behavior? How can I proceed?

Thanks

8463.Functions.c
Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
bool checkRAMECC_mod()
{
bool error = false;
volatile SIS_U64 ramread = 0U;
volatile SIS_U32 regread = 0U;
SIS_U32 tcram1ErrStat, tcram2ErrStat = 0U;
SIS_U64 tcramA1_bk = tcramA1bit;
SIS_U64 tcramB1_bk = tcramB1bit;
/* Clear RAMOCUUR before setting RAMTHRESHOLD register */
tcram1REG->RAMOCCUR = 0U;
tcram2REG->RAMOCCUR = 0U;
/* Set Single-bit Error Threshold Count as 1 */
tcram1REG->RAMTHRESHOLD = 1U;
tcram2REG->RAMTHRESHOLD = 1U;
/* Disable single bit error generation */
tcram1REG->RAMINTCTRL = 0U;
tcram2REG->RAMINTCTRL = 0U;
/* Enable writes to ECC Memory, enable ECC error response */
tcram1REG->RAMCTRL = 0x0005010AU;
tcram2REG->RAMCTRL = 0x0005010AU;
/* Force a single bit or double bit error in both the banks */
/* corrupting RAM ECC 1-bit cause a corrected 1-bit ECC error and SERR flag set if threshold is 1 */
_coreDisableRamEcc_();
tcramA1bitError ^= 1U;
tcramB1bitError ^= 1U;
_coreEnableRamEcc_();
/* Read the corrupted data to generate single bit error */
ramread = tcramA1bit;
ramread = tcramB1bit;
/* Check for error status */
tcram1ErrStat = tcram1REG->RAMERRSTATUS & mcECC_SERR_FLAG_MASK;
tcram2ErrStat = tcram2REG->RAMERRSTATUS & mcECC_SERR_FLAG_MASK;
if ((tcram1ErrStat == 0U) || (tcram2ErrStat == 0U)) // SERR not set in TCRAM1 or TCRAM2 modules
{
error = true; // TCRAM module does not reflect 1-bit error reported by CPU
}
else
{
/* Clear SERR flag by writing 1, otherwise it stays set and an error is detected even if no error are present! */
tcram1REG->RAMERRSTATUS = 0x1U;
tcram2REG->RAMERRSTATUS = 0x1U;
}
regread = tcram1REG->RAMUERRADDR;
regread = tcram2REG->RAMUERRADDR;
/* Restore backup value, disabling writes to ECC RAM */
tcram1REG->RAMCTRL = 0x0005000AU;
tcram2REG->RAMCTRL = 0x0005000AU;
/* Compute correct ECC */
tcramA1bit = tcramA1_bk;
tcramB1bit = tcramB1_bk;
tcramA2bit = tcramA2_bk;
tcramB2bit = tcramB2_bk;
return error;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • Hello Smeet,

    It is expected.

    CPU will read from the cache instead of reading from the SRAM if the address is same. This cache is a special buffer designed for ECC, and it is not user visible.

    The Cortex-R4 processor attempts to correct 1-bit errors in the SRAM by writing the corrected data back to the SRAM and retrying the access. If a 1-bit error is due to a hard fault, then doing this will not change the data read from the SRAM, and when the access is retried, the same error will be detected again and the processor will livelock, forever detecting the error and retrying and not making any progress.

    The purpose of the hard error cache is to prevent CPU from reading the SRAM which has permanent single bit error. Let's say there is a defect in one of the memory cells. If you read from it the CPU will detect it as a single bit ECC error. What the CPU will try to do is to save the corrected data to the hard error cache and also write back the corrected to the SRAM and then retry. Next time if the CPU reads from the same error address then it simply read from the cache instead of reading from the SRAM since there is a match in the address.

    You can do like this:
    1. address 1: corrupt its ECC, read the data to get ESM error
    2. address 2: corrupt its ECC, read the data to get ESM error
    3. address 1 again: corrupt its ECC, read the data to get ESM error
  • Hello QJ Wang,

    Thanks for the explanation. There is still something I do not understand about the mode of operation of the RM48 device's ECC. First, I want to ask if there is an official document where I can find all necessary information about this topic.

    Based on your answer, I'm expecting such behavior (correct me if I'm wrong). Considering the checkB0RAMECC_mod function attached before, the steps are:

    /* (1) Read from SRAM: during a memory read, the ECC bits in the ECC space are read along with the data to detect or correct any error. In this case there are no error to be detected because I have not forced a single bit error yet. */

    SIS_U64 tcramA1_bk = tcramA1bit;

    /* Configuration of tcram1REG and tcram2REG */

    /* (2) Force a single bit error in ECC space */

    tcramA1bitError ^= 1U;

    /* (3) Read from SRAM: at this point an error must be detected because the ECC bits in the ECC space are different from the ECC generated for the current data read from tcramA1bit address of SRAM. */

    ramread = tcramA1bit;

    /* Check for error status */

    /* (4) Restore the initial value at tcramA1bit address of SRAM : during this write operation the hardware generates the corresponding ECC check bits for the tcramA1_bk data and stores the value in the ECC space */

    tcramA1bit = tcramA1_bk;

    Question: which data is going to be corrected between point (3) and (4)? Is the data stored at tcramA1bit address of SRAM that is going to be changed accordingly to the new ECC bits written in (2)? How can I verify this value change?

    Then, the second time I call this function, everything is the same except for point (3): the CPU read from cache the value that was corrected at previous call. This read value mismatches with the new ECC bits that have been changed again at (2) of the current call, but CPU will not correct the error because it is an hard error so the error status flag is not set. Am I right?

    Thanks

  • Hello,

    This RAM is protected by ECC allowing the CPU to correct any single-bit errors. If CPU reads from data from SRAM and detects an single bit ECC error, the data will be corrected based on the ECC value in ECC memory.

    Your understanding is correct, but the data is correct since the corrected data is in the cache.  

    Using the following sequence to get the error every time:
    1. address 1: corrupt its ECC, read the data to get ESM error
    2. address 2: corrupt its ECC, read the data to get ESM error
    3. address 1 again: corrupt its ECC, read the data to get ESM error

     

    This sequence will generate 2nd erorr:

    1. address 1: corrupt its ECC, read the data to get ESM error

    2. address 1: corrupt its ECC, read the corrected data (from the cache), but don't get ESM error

    3. address 1: corrupt its ECC, read the corrected data (from the cache), but don't get ESM error

     

  • Hello,

    thanks for the reply. Just one more general clarification about RM48 ECC: is the corrected data ONLY in the cache?

    What I mean is this: does processor correct 1-bit error by writing the corrected data ALSO back to the SRAM?

    If this is the case, if I use the following sequence (not repeated!):

    1. address 1: read the data from SRAM (called value 1)
    2. address 1: corrupt its ECC
    3. address 1: read the data from SRAM to get ESM error
    4. reset error flag
    5. address 1: read the data from SRAM (called value 2)

    at point 5 should I read a value 2 different from value 1?

    Thanks

  • Just to clarify, I always read 0 and no value changing.

  • Hello,

    any clarification about the last posted topic?

    Thanks

  • Hi QJ Wang,

    The mechanism you mention "cache is a special buffer designed for ECC", is not documented. Looking into ARM technical reference seems that it is implemented into R7 architecture (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0458c/BGBFGEFD.html) but not in R4. Is this the one you were referencing? Is it undocumented for R4? How deep is that cache?

    We really need more details on RAM ECC to face questions and tests with our TUV assessors. RAM ECC is a fundamental safety mechanism.

    We do not understand its behavior in many circumstances and it is difficult to implement systematic tests in these conditions.

    For example explain the behavior of the following code, it is executed right at the end of initialization produced by HALcogen "_c_int00()". I just wanted to understand how RAM ECC was changed, but it seems not predictable.

    Code just changes RAM content and reads corresponding ECC. According to documentation RAM ECC can be read always, even if ECC write is not enabled. Consider "_coreEnableRamEcc_()" has been called before.

    3 things are strange (red in code comments)

    1. The 8 bytes for a single ECC location are not equal within the same 64 bit (it should according to your documentation)
    2. ECC doesn't change anymore after the first write to RAM
    3. A data abort is raised if read is performed on a second location and then back on the first location

    void triggerECCabort()
    {
    #define tcramAddress1 (0x08000010U)
    #define tcram1Location (*(volatile uint64 *)(tcramAddress1))
    #define tcram1EccLocation (*(volatile uint32 *)(tcramAddress1|0x0400000U))
    #define tcramAddress2 (tcramAddress1+0x10)
    #define tcram2Location (*(volatile uint64 *)(tcramAddress2))
    #define tcram2EccLocation (*(volatile uint32 *)(tcramAddress2|0x0400000U))

    volatile SIS_U32 ramEccread = 0U;

    //DEBUG: All ram is zeroed and ECC memory is all 0x0C

    /* Location 1: change value in RAM and view corresponding value in ECC RAM */
    ramEccread = tcram1EccLocation; //ECC=0x0C0C0C0C
    tcram1Location = 4U; // write in RAM something
    ramEccread = tcram1EccLocation; //ECC changes to 0xDFDFDFDB, 1. why least significant byte is read differently?
    tcram1Location = 312312U; // write in RAM something
    ramEccread = tcram1EccLocation; //2. ECC does not change, why?

    /* Location 2: change value in RAM and view corresponding value in ECC RAM */
    ramEccread = tcram2EccLocation;
    tcram2Location = 4U;
    ramEccread = tcram2EccLocation;
    tcram2Location = 312312U;
    ramEccread = tcram2EccLocation;

    /* Again Location 1: change value in RAM and view corresponding value in ECC RAM */
    ramEccread = tcram1EccLocation; //3. This causes a data abort, from Fault Status Registers I see it is from source "Synchronous Parity or ECC Error". Why??
    tcram1Location = 4U;
    ramEccread = tcram1EccLocation;
    tcram1Location = 312312U;
    ramEccread = tcram1EccLocation;
    }

    Are we missing something?

    Strangely if I disable ECC (_coreDisableRamEcc_) problems 1 and 2 disappear. Even 3 but I guess due to the missing ecc check.

    Regards,

    Valerio

  • Hi Valerio,

    Will study your code.

  • Hello Valerio,

    For every 64-bit write to the RAM, the CPU writes an 8-bit ECC to ECC space. For every 64-bit read from the RAM, an 8-bit ECC is read by the CPU on its ECC bus automatically. Only the first byte is valid.

    I tried your code, and observed the same phenomena. I am investigating and come back to you later. Sorry for late response.

  • Hi QJ Wang,

    any suggestion? 

    We are really under pressure about understanding ECC, poor documentation and confusing bahavior are not helping...

    Thank you,

    Valerio 

  • Hi Valerio,

    I haven't got chance to investigate this phenomenon. I am sorry for the delay.

  • QJ Wang, 

    do you have any predictions about when you will have that chance?

    I do not exclude our mistake but I would expect a different kind of support on one of the main diagnostics of the micro.

    Thanks,

    Valerio

  • Hello Valerio,

    ECC memory can be written or read by a master because it is memory mapped. During ECC read, the return ECC will be duplicated in all the bytes.  For every 64-bit write to the RAM, the CPU also write an 8-bit ECC value using the ECC bus. But there is no ECC for ECC value. To avoid false single or double-bit errors, we need to disable the ECC check before reading ECC data.

    When I did the test, I got data abort, and the abort status shows that the abort is caused by the ECC error.