66AK2H06: Single-bit error in the SL2/MSMC memory

Part Number: 66AK2H06

Tool/software:

Hi

Recently, a device using a 66AK2H06 processor reported a single-bit error in the SL2/MSMC memory that persisted. We have conducted simulation tests and would appreciate your feedback and follow-up as soon as possible.


Symptom: A single-bit error occurred in the constant data of the 66AK2H06 processor's SL2 memory. The memory address was 0x0C3A4048, and the memory data changed from 0x0C38CB04 to 0x0C18CB04 (Bit 21 changed from 1 to 0). The device was powered on and running for nine months, and the log showed two single-bit error corrections in the SL2 memory.


Problem Analysis: The suspected cause of the problem is a write operation to other data in the same SL2 memory subbank. The small-scale write caused the ECC checksum to be temporarily invalid, which, combined with particle radiation, caused the single-bit error in the SL2 memory to persist.

Simulation test: Disable the SL2 memory background scanning function, compare large and small data writes to the same subbank, simulate a single-bit error in the same subbank, and observe whether accessing the erroneous data can trigger an ECC interrupt.

Regards

Zekun

  • Hi Zekun,

    I have informed the corresponding expert to have a look into this query.

    Thanks in advance for your patience.

    Regards

    Gokul

  • Hi Gokul

    Since this issue is happening during mass production, so it is urgent to give us a clue. Thanks for supporting.

    Regards

    Zekun

  • Hi Zekun,

    Problem Analysis: The suspected cause of the problem is a write operation to other data in the same SL2 memory subbank. The small-scale write caused the ECC checksum to be temporarily invalid, which, combined with particle radiation, caused the single-bit error in the SL2 memory to persist.

    Yes , Exactly. I think the partial memory write operation in SL2 memory causing the issue. We have to always use bit aligned and full-width writes to ECC-protected memory. 

    Regards,

    Betsy Varughese

  • Hi Betsy

    Do we have more detailed requirement about this write operation?

    Like 32-bit or 64-bit aligned address write, minimal size = 128k or 256k?

    Regards

    Zekun

  • Hi Zekun,

    You can find these details below.

    Reference Link: https://www.ti.com/lit/ug/spruhj6/spruhj6.pdf

    Regards,

    Betsy Varughese

  • 1) To further supplement the previous simulation test case description: Disable L1D Cache and L2 Cache, turn off the SL2 memory background scan function, compare the write of large or small granules to the same SubBank, simulate a single-bit error in the same SubBank, and observe whether accessing the erroneous data can trigger an ECC interrupt.
    2) The on-site device actually uses L1D Cache and L2 Cache, with cache lines of 32 and 128 bytes respectively. Writing data to SL2 memory needs to go through the cache, meaning that theoretically, writing data to SL2 memory is a large granule write.
    3) May I ask if there is any small granule write or short-term invalidity of the SL2 memory ECC check code in the above actual working conditions? Please help analyze and provide feedback again. Thank you!

  • Hi,

    The errata reports a false DDR3 (not MSMC) write ECC error under certain conditions and suggests possible workarounds.

    Could you please have a look at this https://www.ti.com/lit/er/sprz402f/sprz402f.pdf?. but we need to check whether it can be applicable for MSMC also.

    Regards,

    Betsy Varughese

  • Hi Betsy

    1)We have checked the DDR ECC configuration and enabled the RMW function. Does MSMC memory have a similar function?

    2)The on-site device actually uses L1D Cache and L2 Cache. If background scrubbing engine is not considered, is it possible for the MSMC memory ECC check code to fail in the short term?

    Regards,

    GQ Zhou

  • Hi,

    1)We have checked the DDR ECC configuration and enabled the RMW function. Does MSMC memory have a similar function?

    The MSMC memory does not support Read-Modify-Write (RMW) ECC like DDR memory. Its ECC relies on aligned writes and standard error correction, with no special handling for partial or misaligned writes. RMW is exclusive to the DDR3 memory controller.

    2)The on-site device actually uses L1D Cache and L2 Cache. If background scrubbing engine is not considered, is it possible for the MSMC memory ECC check code to fail in the short term?

    If your system is heavily cache-based (L1D/L2 caches), many memory accesses are served from cache and do not trigger ECC checking in MSMC immediately.Without active scrubbing, single-bit errors in MSMC can persist undetected until a direct MSMC read is performed or the scrubbing engine cycles through the address.

    Regards,
    Shabary S Sundar.

  • Hi,

    Thank you for the reply!

    1) We analyze one probable cause is single-bit error occurred in L1D Cache. Because L1D Cache does not have ECC/EDC function, When Cache data writeback to MSMC memory and single-bit error permanently occurs in the MSMC memory.

    2) The device actually uses L1D Cache and L2 Cache. when Cache data writeback to MSMC memory, Is ECC check code synchronously calculate and no need to wait scrubbing engine?

    3) Based on theoretical analysis, are there other probable causes?

    Regards,

    GQ Zhou

  • Hi,

    2) The device actually uses L1D Cache and L2 Cache. when Cache data writeback to MSMC memory, Is ECC check code synchronously calculate and no need to wait scrubbing engine?

    The ECC is synchronously calculated by the MSMC memory hardware during the write operation.ECC check codes are generated and stored immediately as the data is written back, ensuring data integrity without delay. There is no need to wait for the scrubbing engine to calculate ECC during the normal cache writeback process.

    3) Based on theoretical analysis, are there other probable causes?

    I haven't noticed any other scenarios. But sure I will check on this and get back to you.

    Regards,
    Shabary S Sundar 

  • Hi,

    Thank you for the reply!

    1)When cache data writeback to MSMC memory, is there a situation of ECC check code does not synchronously calculate or temporarily invalid?

    2)When cache is used, is there any write to MSMC without going through cache?

    Regards,

    GQ Zhou

  • Hi,

    1)When cache data writeback to MSMC memory, is there a situation of ECC check code does not synchronously calculate or temporarily invalid?

    Sorry for the delay, it was a holiday here. Under normal operating conditions, the ECC for MSMC memory is calculated synchronously. I will check on this and get back to you soon.

    Regards,
    Shabary S Sundar

  • Hi,

    2)When cache is used, is there any write to MSMC without going through cache?


    By default, if cache is enabled, MSMC/Level 2 memory is cacheable by both L1D and L1P caches, unless the memory regions are explicitly configured as non-cacheable.

    Regards,
    Shabary S Sundar

  • Hi,

    Thank you for the reply!

    1) Our usage scenario is that MSMC memory only stores data, not code. Because cache data writeback to MSMC memory and ECC check codes are generated and stored immediately. Is it no need scrubbing engine when enabling L1D cache?

    2) Return to our original problem(single-bit error in the MSMC memory occured and persisted), are there other probable causes? Is L1D cache error the only cause?

    Regards,

    GQ Zhou

  • Hi,

    1) Our usage scenario is that MSMC memory only stores data, not code. Because cache data writeback to MSMC memory and ECC check codes are generated and stored immediately. Is it no need scrubbing engine when enabling L1D cache?

    Even with L1D cache and ECC on writeback, the scrubbing engine is needed to keep MSMC memory reliable by regularly finding and fixing errors over time.

    2) Return to our original problem(single-bit error in the MSMC memory occured and persisted), are there other probable causes? Is L1D cache error the only cause?

    I will check and update on that within a day.

    Regards,
    Shabary S Sundar

  • Hi,

    Is there any latest infomation? Are there other probable causes of our original problem?

    Regards,

    GQ Zhou

  • Hi,

    Is there any latest infomation? Are there other probable causes of our original problem?

    Based on the documentation, the most likely causes appear to be partial writes and alignment issues. At this time, we have not identified any other factors that could be linked to your case. The silicon errata also does not indicate any additional issues related to this.

    Regards,
    Shabary S Sundar