This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357-EP: TMS570 ECC Code Error Handling and Correction Mechanism Clarification

Part Number: TMS570LC4357-EP

Tool/software:

Hi TI team,

I’m currently working with the TMS570 microcontroller and have been studying its ECC (Error Correcting Code) mechanism, specifically the SECDED (Single Error Correction, Double Error Detection) implementation for SRAM. I have a few questions regarding the handling of ECC code errors, particularly in cases where the ECC code itself is affected.

My specific questions are:

  1. If a single-bit error occurs in the ECC code itself (as opposed to the data), how does the SECDED mechanism handle this in terms of both error detection and correction?

    • Could the ECC mechanism potentially detect this as both a single-bit error and a multi-bit error simultaneously?
    • From a correction standpoint, is there a risk that the mechanism might incorrectly correct the data based on a faulty ECC code, while leaving the ECC code unchanged? This is a major concern in my application.
  2. If such a scenario (i.e., erroneous correction of data due to a faulty ECC code) is possible, are there any strategies or mechanisms available to enhance the reliability of the ECC code itself?

    • For example, are there recommended methods or features within the TMS570 architecture to further protect the ECC code from bit errors?
  3. Lastly, does TMS570 implement any kind of redundancy for the ECC code (e.g., storing multiple copies of the ECC code and using majority voting mechanisms), or is it limited to a single copy of the ECC code for each 64-bit data word?

Any clarification or recommendations on improving the ECC code's reliability would be greatly appreciated!

Best regards,
Hanson

  • Hi Hanson,

    If a single-bit error occurs in the ECC code itself (as opposed to the data), how does the SECDED mechanism handle this in terms of both error detection and correction?

    SECDED module can be able to detect single bit errors in the ECC code as well.

    Please verify below highlighted data in the syndrome table:

    For example, if ECC bit-4 gets corrupted then the generated syndrome will become the 0b00010000, so based on this it can detect the single bit error in ECC 4th bit and can do the bit correction.

    From a correction standpoint, is there a risk that the mechanism might incorrectly correct the data based on a faulty ECC code, while leaving the ECC code unchanged? This is a major concern in my application

    No, there won't be any risk in the corretion mechanism.

    If such a scenario (i.e., erroneous correction of data due to a faulty ECC code) is possible, are there any strategies or mechanisms available to enhance the reliability of the ECC code itself?

    No, such scenario will occur.

    Lastly, does TMS570 implement any kind of redundancy for the ECC code (e.g., storing multiple copies of the ECC code and using majority voting mechanisms), or is it limited to a single copy of the ECC code for each 64-bit data word?

    There won't be any redundancy storage for ECC codes, as i mentioned earlier the device can be able to perform single bit error correction and double bit error detection for stored ECC code as well.

    --
    Thanks & regards,
    Jagadish.

  • Hi Jagadish,

    Thank you for your response.

    I would like to confirm my understanding regarding single-bit errors in the ECC code.

    You mentioned that if a single-bit error occurs in ECC bit-4, the syndrome would be 0b00010000. I understand that the syndrome is generated during the ECC verification process when comparing the received ECC code with the recalculated one. For example, if data bit 26 has a single-bit error, the syndrome generated would be 0b11100110, allowing the system to detect and correct this error.

    Could you please confirm if my understanding is correct?

    Best regards,
    Hanson

  • Additionally, I am currently trying to intentionally inject ECC errors in the Cache area for testing purposes. Do you have any insights or suggestions on how I can achieve this?

  • Hi Hanson,

    I understand that the syndrome is generated during the ECC verification process when comparing the received ECC code with the recalculated one.

    You are correct, here the syndrome is nothing but XOR of regenerated ECC and old ECC.

    For example, if data bit 26 has a single-bit error, the syndrome generated would be 0b11100110, allowing the system to detect and correct this error.

    You are totally correct.

    Additionally, I am currently trying to intentionally inject ECC errors in the Cache area for testing purposes. Do you have any insights or suggestions on how I can achieve this?

    Honestly i never did this, however you can try and if you face any difficulty i will help you on this.

    --
    Thanks & regards,
    Jagadish.

  • You are correct, here the syndrome is nothing but XOR of regenerated ECC and old ECC.

    That's helpful ! Thank you!

    Honestly i never did this, however you can try and if you face any difficulty i will help you on this.

    Honestly i have tried and failed.
    1 I enable cache via HCG code.

    2 Set cache ECC mode to 0b010 by modifying the bits [5:3] of Auxiliary Control Register.


    3 I trid to create ECC error by enabling the DR2B in the Secondary Auxiliary Control Register.


    4 This error was supposed to be triggered.But nothing happened in the ESM Status Register 3.

  • Hi Hanson,

    Can you refer below thread once, this might be useful in this context:

    (+) TMS570LC4357: Cache ECC and ESM Group 3 channel 9 - Arm-based microcontrollers forum - Arm-based microcontrollers - TI E2E support forums

    --
    Thanks & regards,
    Jagadish.

  • Hi Jagadish,

    Thanks for your reply - it was very helpful !

    However there is still something greatly confusing me.
    I've checked the thread and  attempted to generate an ECC error .
    Here is my procedure.

    When the program reaches the code in the red box, a data abort is triggered.

    The 0x300000000 is the start address of Dcache ,which is also the address I tried to read from.

    The data fault status provides the following information:


    Based on this,I believe either the cache is not properly prepared ,or the way I am accessing it is incorrect.

    So,I have a few questions:

    1 Should the cache be enabled when trying to accessing it as regular memory? In my case I have disabled it. 
    2 When accessing the cache as regular memory , are there any specific tips I should be aware of?

    Any clarification will be greatly appreciated.

    --
    Thanks & regards,

    Hanson.

  • Hi Hanson,

    Apologies for the delay in late response, i was stuck with lot of other issues in this mean time.

    Regarding this issue i am suspecting one thing:

    If we try to directly access the any area that is not configured in MPU regions then we can't access them and it can produce abort exceptions, please very the below default configurations.

    So, my suggestion would be trying to configure this area also into one of the regions like below:

    This might solve this issue, please check and let me know.

    --
    Thanks & regards,
    Jagadish.

  • Hi Jagadish,

    Apologies for the late reply. I just finished my holiday.

    Thank you for your advice. I tried configuring the MPU as you suggested, but unfortunately, it didn’t resolve the issue. I’m still encountering the same problem as before.

    I actually don't think the data abort is triggered by the MPU.As we can see, writing 0x11223344U into the cache memory didn't trigger a data abort while reading from the same address did. 

    The data abort status indicates it's a Synchronous External Abort whereas the MPU would trigger background abort if we access the undefined area.

    Thanks again for your advice. I initially overlooked the MPU when accessing the cache as regular memory. But it seems that this might not be the cause of the issue.

    Any further suggestions would be greatly appreciated.

    Thanks & regards,

    Hanson.

  • Hi Hanson,

    Is it possible to share a sample code with the above issue, so that i can quickly do a debug on my end.

    --
    Thanks & regards,
    Jagadish.

  • Hi Jagadish,

    I've prepared a simple demo for you.

    Please check it.

    Thanks & regards,

    Hanson.

    Cache_ECC_error_create.zip

  • Hi Hanson,

    Sincere apologies for late response.

    Are you still stuck with this issue?

    --
    Thanks & Regards,
    Jagadish.

  • Hi Jagadish,

    Thanks for your help. I've actually moved past this issue since I couldn't think of any other solutions to try.

    Testing it would definitely be ideal, but I can also proceed without directly testing it myself.

    Currently, I'm focused on the ECC mechanism in the flash memory.

    It appears that even though the actual data in flash memory remains correct, we still can’t retrieve it accurately if there’s a 3-bit ECC error within the ECC code itself.

    In other words, if I attempt to read from a flash region where an ECC error exceeds the SECDED (Single Error Correction, Double Error Detection) capability—despite the data itself being intact—the returned value will differ from what’s stored in flash. 

    Any clarification on this would be helpful.

    Thanks & regards,

    Hanson.

  • Hi Hanson,

    Sincere apologies for the further delayed response, make sure to respond without delay next time.

    It appears that even though the actual data in flash memory remains correct, we still can’t retrieve it accurately if there’s a 3-bit ECC error within the ECC code itself.

    In other words, if I attempt to read from a flash region where an ECC error exceeds the SECDED (Single Error Correction, Double Error Detection) capability—despite the data itself being intact—the returned value will differ from what’s stored in flash. 

    I never tested this practically but this as per my knowledge this should not happen. If there is a multibit (more than 2 bit) error in ECC then the returned data should not be wrong.

    As per the syndrome table this 3-bit error in ECC will give multi-bit error. For example, assume least 3 bits in the ECC were got corrupted:

    Example:

    ECC before corrupted: 0x46

    ECC after corrupted: 0x41

    Now syndrome (Syndrome is EX-OR of ECC before corrupted and ECC after data/ECC corrupted) will become: 0x07

    So, if you compare this 0x07 with syndrome table it will point to the first column 8th element, right? which is nothing but M in this case, here M means multi-bit error. So, any multibit error in the device will trigger the uncorrectable ESM flag similar to the double bit error and it might also generate exception.

    --
    Thanks & regards,
    Jagadish.

  • Hi Jagadish,

    Happy to see your response. Your explanation clarified that the flash ECC mechanism accurately detects both single-bit and multi-bit errors, which is very helpful. However, my main concern remains unresolved. In my testing, I observed that when the ECC code has multiple bit errors, the data we read can differ from the actual stored value.

    Normal State:

    address     = 0xF021A000

    Data          = 0x00218ED408065924
    ECC code = 0x6E

    Single-bit ECC Flip
    Flipping ECC code bit 4:

    • Expected:
      Data = 0x00218ED4 08065924
      ECC code = 0x7E
    • Actual Results:
      • Erase step:

      • Write Data = 0x00218ED4 08065924

        • Expected data = 0x00218ED4 08065924
        • Actual data = 0x2C218ED4 08068924
        • Result: data read does not match expected value
      • Write ECC code = 0x7E

        • Data read matches expected: data = 0x00218ED4 08065924

    This shows that when a single-bit error occurs, the data remains consistent with the expectation.

    Double-bit Error
    Flipping bits 4 and 5 of the ECC code:
    Expected:
    Data = 0x00218ED4 08065924
    ECC code = 0x5E

    Actual:

    1. Erase:

    2. Write Data = 0x00218ED4 08065924


      Expected data = 0x00218ED4 08065924, but actual data = 0x2C218ED4 08068924
      Data does not match expectation.

    3. Write ECC code = 0x5E


      ECC code matches expectation, but data still does not: Actual data = 0x00A18ED4 08065964.

    The method for creating single-bit and double-bit ECC errors is the same, only the ECC code written differs. Why does this discrepancy occur?

    Any clarification on this would be helpful.

    Thanks & regards,

    Hanson.