This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RAM ECC Questions

Other Parts Discussed in Thread: HALCOGEN

I'm currently working on enabling RAM ECC functionality and am running into several issues I have some questions on.

As I understand it, the ECC in the R4 core is always calculated and written (i.e. there is no way to enable or disable it).  This I have gathered from this thread...

http://e2e.ti.com/support/microcontrollers/tms570/f/312/t/107339.aspx

So as I see it, the process for testing ECC functionality cannot be done as outlined in the spna126 guide (ECC Handling in TMSx70-Based Microcontrollers).  Is this correct?

 

To summarize my questions:

1)  What is recommended method of testing ECC functionality for the R4F Core?  The only thing I can think of to test this is directly writing the ECC memory to purposely corrupt it. 

2)  What is the difference between enabling and disabling the RAM Wrapper ECC and the R4 CPU ECC?  Is there a situation on the R4 where I would only want to do one vs the other?  My initial understanding was the the wrapper disabled ECC reporting, and the R4 CPU ECC enable would disable the generation of the ECC data itself, but when I found this thread talking about how in the R4 the ECC is always generated, I became unclear on this.

3)  I am a little unclear on what influence the debugger/Code Composer 4 has with RAM ECC.  Should I expect to be able to read and write ECC covered RAM data and the ECC data itself through the debugger (and view in memory windows).  Should I be able to monitor registers showing ECC failures (failed address, # of occurances, etc)?  I noticed this statement in the spna125 guide...

In case of Cortex-R4, single stepping with ECC on results in an abort.

What exactly does this mean?

4)  I noticed that HALCOGEN tool will generate some functions for enabling and disabling RAM ECC.  This appears to have less functionality than the recommended methods in the spna126 guide?  Can someone explain the differences?

5)  Are there any examples of setting up and testing the RAM ECC functionality on the R4 that are available?

  • Lucas,

    Thanks for the post.

    I have forwarded your questions to our ECC experts. I will get back to you as soon as possible.

    Yes Halcogen does not support any RAM ECC error testing by introducing errors. It enables or disables the RAM ECC error generation and reporting of ECC errors.
    I will get back to you with more details.

    Best Regards
    Prathap

     

  • Lucas,

    I'm working on a sample code for this, which I'll send across to you this week.

    That should help.

    Regards,

    Pratip

  • Thanks, I will wait for the sample code and answers to my original questions.  I have had some luck with trying to verify functionality by writing to the ECC memory directly to simulate bit-flips.  There are a couple things that still don't seem to be working the way I would expect though that I don't know if your sample code will help with or not...

    1) I don't seem to see the RAMSERRADDR or the RAMUERRADD registers getting updated with the address of the failure for single or double bit failures when running these bit-flip simulations.  This is being done with the RAMTHRESHOLD register setting to "1" which from what I can tell is the only thing that is required (for RAMSERRADDR).  Is there something I'm missing as to why these registers will always read "0" for me?  NOTE: Found out my problem here... I was testing address 0x8000000 and didn't realize it only captures address bits 17-3 (which were all zero's for the address I was testing)

     

    2) When I simulate a single bit ECC error on a particular address with RAMTHRESHOLD > 1, I see the RAMOCCUR count up.  However, if I run the same corruption simulation on the same address again, I don't see the RAMOCCUR count increment, and even if I run a multi bit error on this address, it doesn't appear to detect it either.  However, if I run a corruption test on a different address between the two I run on the same address, the RAMOCCUR increments to a value of "3" as I would expect.   Is there anything that could be causing this particular behavior? 

     

    3) As a side question, in the code composer debugger, I see an "ERRPOSITION" register in the RamWrapper register section.  However, I don't see this register discussed at all in the Technical Reference Manual.  Reasons?

  • 1)  What is recommended method of testing ECC functionality for the R4F Core?  The only thing I can think of to test this is directly writing the ECC memory to purposely corrupt it. 

    pk> Yes. You are right. we can only check the ECC by corrupting the ECC bits.

    2)  What is the difference between enabling and disabling the RAM Wrapper ECC and the R4 CPU ECC?  Is there a situation on the R4 where I would only want to do one vs the other?  My initial understanding was the the wrapper disabled ECC reporting, and the R4 CPU ECC enable would disable the generation of the ECC data itself, but when I found this thread talking about how in the R4 the ECC is always generated, I became unclear on this.

    pk> The ECC calculation , detection and correction is carried out by R4 CPU and only logging and profiling and few controls are carried out by the TCRAM wrapper.

    In the R4 CPU : The ecc calculation logic is always enabled. Only ECC check and ECC correction can be enabled/disabled.

    In the TCRAM wrapper :  We can enable/disable the error logging alone.

    Disabling/enabling the ECC checking in R4 will anyways not be updating any status in the wrapper anyways, so enabling/disabling in the wrapper is  ideally not required but it is just recomended.

    - Pratip

     

     

  • 3) As a side question, in the code composer debugger, I see an "ERRPOSITION" register in the RamWrapper register section. However, I don't see this register discussed at all in the Technical Reference Manual. Reasons?

    pk> In the CCS V4.x  Debugger Register view of the RAM Wrapper, the following three descriptions are incorrect

    1. ErrPosition  2. DErrAdd (Shoulod be UErrAddress)  3. Ctrl2

    This will be corrected in the next veriosn of CCS update.

    - Pratip

     

  • 2) When I simulate a single bit ECC error on a particular address with RAMTHRESHOLD > 1, I see the RAMOCCUR count up. However, if I run the same corruption simulation on the same address again, I don't see the RAMOCCUR count increment, and even if I run a multi bit error on this address, it doesn't appear to detect it either. However, if I run a corruption test on a different address between the two I run on the same address, the RAMOCCUR increments to a value of "3" as I would expect. Is there anything that could be causing this particular behavior?

    pk>

    - The RAMOCCUR reflects only single bit error occurance, for mutibit error we need to see the  error flag and uncorrectable error address.

    - If a single bit error occurs, it gets corrected and it won't be detected the second time. Reading the same address again may not show the error.

    Regards,

    Pratip

     

  • Lucas,

    Find attached a sample code to generate single bit error and check the err count.

    The project file is in TI_CODE\RAM_ECC_TEST_R4\TCRAMW\ECC16\ECC16_R4.pjt

    Go through the ReadMe.txt before execution.

    - Pratip

     1884.RAMECC_CODE_TI.zip

  • pk>

    Now we happen to know what you mean.

    1. Single bit error followed by single bit error on the same address

    2. Single bit error followed by double bit error on the same address.

    In both cases we don't see the second error getting detected .

    We did couple of tests at TI and we had to dig into the design and understand the R4 architecture.

    This is the way the R4 ECC works :

    When ever a single bit error is detected the R4 corrects the data+ECC and stores in the internal buffer and traps the address which was corrected. 

    The TCRAMW also writes back the corrected value to RAM, but for any subsequent read , the data would be read from the internal buffer.

    Now when a double/single ECC error is created again ,on the same address , the ECC will be updated ,and if read  it will read the ECC from the internal buffer.

    So this scenario is restricted by R4 design.

    We may have to use different addresses  when we test single and double bit error one after the other and make sure that the software doesn't use the same address to check single and double bit errors back to back.

    Regards,

    Pratip

  • Pratip,

    Thanks for the response.  Unfortunately, I'm not sure if I fully understand... I don't suppose you have or can put together a visual diagram showing the process to help clarify?  Let me pose a couple new questions and outline the test I was doing to see if this makes sense to you based on your understanding of the RAM wrapper and the buffering.

    My questions are...

    According to your safety manual for the TMS570, upon detecting a single bit ECC error, the software shall attempt to correct he error by writing the data back to the address and check the corrected memory (presumably by reading the data again and checking if another ECC fault has occurred).  If the check on the corrected memory fails, it is considered a hard error.  Is this possible to do then with your comments about the buffering and the address trapping?  Also, you mentioned in your response that the TCRAMW writes back the corrected value to RAM.  Does this mean that I don't have to manually do the write back that the safety manual specifies because the RAM wrapper will do this for me?

    Second, in the event where a single bit failure occurs, and the application continues to run without failures, and then at some point in the future, that same address fails with a multi-bit failure, will the multibit failure be detected?  This seems to be my scenario as I'm simulating the ECC faults (I'm not sure if when you mentioned "single and double bit errors back to back" if you literally meant back to back without any access of other RAM addresses in between - that is NOT what I'm doing in my test).

     

    Here's a high level outline of the ECC tests I have been running and encountering these issues with:

    Setup: RAMTHRESHOLD = 1, Single Bit Error Interrupt Enabled (RAMINTCTRL)

    Test:

    1) Disable Interrupts

    2) R4 core - Disable RAM ECC and Disable Event Bus Exporting

    3) Enable Writes to ECC Memory (RAMCTRL)

    4) Write "0" to test RAM address

    5) Read ECC data corresponding to my test RAM address

    6) Write "1" to test RAM address

    7) Write the previously read ECC data back to ECC RAM (so "1" should now be the data, but the ECC corresponds to a "0" value... i.e. single bit flip)

    8) Read test RAM address (ECC is disabled so read back "1")

    9) Disable Writes to ECC Memory (RAMCTRL)

    10) R4 core - Enable Event Bus Exporting and Enable RAM ECC

    11) Re-enable Interrupts

    12) Read test RAM address (ECC is enabled so triggers Single Bit Error (which I have causing an interrupt via ESM module))

    13) In ISR-> RAMOCCUR set to "0", Failed Address is captured by reading RAMSERRADDR, RAMERRSTATUS set to "1"

    .... Then a bunch of normal application code runs....

    14) At some point in the future I trigger a double bit error with the exact same process as single bit, except in step 6 I write "3" instead of "1" (which is two bits flipped compared to ECC data).

    15) I expect to trigger a DATAABORT exception on step 12 during the read and expect to see registers get updated indication the multi bit error status and address (and I don't see any of this, i just read back the "3" I wrote with no indication that it is invalid data)

     

    If I trigger a multi bit error without doing the single bit first, I see the proper DATAABORT exception response.

     

  • Lucas,

    According to your safety manual for the TMS570, upon detecting a single bit ECC error, the software shall attempt to correct he error by writing the data back to the address and check the corrected memory (presumably by reading the data again and checking if another ECC fault has occurred).  If the check on the corrected memory fails, it is considered a hard error.  Is this possible to do then with your comments about the buffering and the address trapping?

    pk>> Upon detecting a single bit error , the hardware attempts to correct the error (not software).First  it buffers the corrected value and then subsequently attempts to correct in the RAM. It might take couple of cycles to correct in the actual RAM but the buffer correction in the TCM buffer is quick.

    Also, you mentioned in your response that the TCRAMW writes back the corrected value to RAM.  Does this mean that I don't have to manually do the write back that the safety manual specifies because the RAM wrapper will do this for me?

    pk>> No,you need not have to write back in case of Single bit error explictly. But you may readback and check in the ISR . But when you read back the data is read from the buffer. (If we don't know what's the correct value, we may not not know what to write back,if we emulate we know the correct value but not otherwise).

    Second, in the event where a single bit failure occurs, and the application continues to run without failures, and then at some point in the future, that same address fails with a multi-bit failure, will the multibit failure be detected?

    pk>> No, the second multibit failure won't get detected.This is as  per R4 design.

    This is because the read happens from the buffer (and will continue to be read from the buffer until the buffer address is updated ,but the error that we are talking about is in the RAM which is not being actually read out).

    The suggestion that we have from our design expert is that, the second double bit/single bit error can be caught by the CCM module(if enabled) when there is any bit change in the buffer.

    Regards,

    Pratip

  • Lucas,

    For the steps (1 to 15 ) that you mention in your test  , the behaviour is expected as per the R4 ECC design.

    Regards,

    Pratip

  • Pratip,

    To be clear then in my thinking, the buffer contains both "data", "ecc value" and an "address", and once the buffer has an address in it, all application data writes to that address update the buffers "data" and "ecc value".  Reads from the address will read from the buffer, not RAM.  Corruption of the buffer itself would be detected through the CCM (because the buffer is duplicated on both cores).

    Correct?

    Luke

  • Pratip,

    In the spna126.pdf file, I see the mention of DIAG_EN_KEY

    Diagnostic Control Register - DIAG_EN_KEY [3:0]

    • DIAG_EN_KEY [3:0] = 0x5 - Enables diagnostic mode.

    • DIAG_EN_KEY [3:0] = 0xA, or others - Disables diagnostic mode.

    Can you please tell me where do I find the description of those fields within spnu489b.pdf file [TRM]?

    Thank you.

    Regards

    Pashan

  • Luke,

    True. Do you think this explaination of R4 behaviour makes sense with what you observe ?

    -Pratip

     

  • Pashan,

    This is with respect to F035 Flash ECC.I'll move this to a seperate thread.

    These register are not documented in this version of the TRM. I'll get back to you with those descriptions.

    Regards,

    Pratip

  • Pratip,

    Yes, I believe this buffer functionality explains the results I was seeing.  I would only suggest that it is somehow better documented.  

    Thanks,

    Luke

  • Luke,

    The next version of the App note is being updated with this case study.

    And something that I could think of  to get away with this scenario if you are really concerned of missing the second single/double bit error on the same location is , to have a known error introduced in a location and whenever we get a real single bit error , in the ISR let the CPU read this known error location so that the trapped address get cleared.

    -Pratip