This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6655: MSM ecc check over memory space range.

Part Number: TMS320C6655


Hi all,

Customer enable MSM ECC check on C6655.

The C6655 MSM memory from 0xc00000~0xc1fffff in datasheet. But 6655 only have 1M MSM, there should be a error in the datasheet. MSM memory should be 0xc000000~0xc0fffff.

And also customer have found the ecc error report at 0x0c137020、0x0c136820. It's over the 1M space!

Why this happened?

Do we have any register to restrict the ecc check range 0xc000000~0xc0fffff?

thanks!

BR,
Denny

  • Hi,

    I've notified the team. Their feedback will be posted here.

    Best Regards,
    Yordan
  • Hi Denny
    I looked at the internal design spec and I confirm that this seems to be a typo in the datasheet.
    MSMC memory on this device is 1M and the relevant address range should be 0x0C00 000 to 0x0c0F FFFF

    I will submit a documentation ticket on this.

    I will check with the software team to see if there is any setting to restrict such that the ECC range is as per the chip memory map etc.

    Regards
    Mukul
  • Hi Mukul,

    Do you have any update about this issue? thanks!

    BR,
    Denny

  • Hi Mukul,

    Customer found the 2 ecc error happened interval time is 350ms. It’s abnormal, Customer hope BU help to analyze the reason. It this possible that one time ray impact to the chip 2 address but  with MSMC refresh time cause the interrupt report interval time is 350ms?

    thanks.

    BR,
    Denny

  • Hi Denny

    Can you clarify your above statement.
    Is the customer still see the ecc error happen in the reserved section?

    when you say 2 errors in 350 ms - does this happen consistently? How many boards tested and how many showing this issue?

    Did they expose it to any rays to test ECC ?
  • Mukul,

    Is the customer still see the ecc error happen in the reserved section?
    yes, the address is 0x0c137020、0x0c136820.

    when you say 2 errors in 350 ms - does this happen consistently?

    Two errors cause the interrupt one by one, the interval time less than 350ms.  There some record time in 350ms, so it may happen consistently. We are not sure.

    How many boards tested and how many showing this issue?

    Only one. Customer made several hundreds of board and find one board record this log.

    Did they expose it to any rays to test ECC ?

    No, it common use.

    Customer use the method to identify the address by

    errAddr = (gpMSMC_regs->SMCEA&0xFFFFFF)+0xc000000

    Customer think it's not reasonable that 2 ecc errors happened in the too short time. And want ti explain this.

    BR,
    Denny

  • Hi Denny
    Thanks for the information

    Few more questions and request for information

    1) Can they share the MSMC register dump etc , for us to see the ECC error log?
    2) You said 1 out of 100 boards show this - is this board a field return or something they are catching in their in house testings? Are all boards exposed to same testing and software?
    3) ECC errors should be getting generated if a master is accessing /reading this memory - given we have established that this address space is truly reserved in the device memory map , can we ensure that software is not accessing this memory area intentionally and accidentally, and once we do that , does it still show errors?
    4) I am checking with design if this reserved address should also generate a CPU buserr (look at the core pac user guide for bus err registers) - can you confirm if in the customer software the bus err interrupt is enabled and/or when they see failures what is the register values for the buserr registers


    Regards
    Mukul
  • Denny
    On #3, further specific questions
    1. What address does the customer intend to access? Are they trying to access the reserved space between 0x0c10000:0x0c1FFFFFFF, the two addresses and got ECC error during the access?
    2. or, when they access the legal MSMC SRAM space between 0x0c00000000:0x0c0FFFFFF, the ECC errors are triggered on the two unrelated addresses reported below?
  • Hi Mukul,

    1, Sorry, They didn't record the register when ERROR happened. The log just record the error happened time.

    2,The boards are field using. All board used the same software.

    3,We can ensure the software didn't access the reserved space.

    4,Yes, customer software bus err interrupt is enabled.

    BR,

    Denny

  • it's 2. when they access the legal MSMC SRAM space between 0x0c00000000:0x0c0FFFFFF, the ECC errors are triggered on the two unrelated addresses
  • Thanks for the update Denny.
    To debug this we will likely need some sort of error logs /register dump for corepac and msmc registers.

    Is the issue easy to reproduce , such that they can capture the relevant registers when it fails again?
  • Some more follow up questions based on internal discussion

    Please clarify your observation on ecc error reported from reserved region when accessing regular MSMC memory space:

    1. When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with data bus all 0’s and ECC error status. But the log register, according to their calculation, saved address which appears to be out of the legal region?
    2. When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with correct data and success status. But a two-bit error interrupt is triggered, the log register, according to their calculation, saved address which appears to be out of the legal region?

    As previously requested, we may need the MSMC register dump, but in case customer had previously captured the SMCEA register content- please share.
  • 2, When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with correct data and success status. But a two-bit error interrupt is triggered, the log register, according to their calculation, saved address which appears to be out of the legal region.
    this is customer observation.
  • I share the log, hope this can help.


    **********ExceptionLog**********

    Date :Mar 14 2017,version :1.01


    BOARD CPU TYPE : TI C66X
    FORMAT VERSION : 1.00

    |----------------------------------|
    ----------最近一次异常信息---------
    |----------------------------------|

    *****************************错误序号:2*****************************

    异常发生于CORE0

    异常发生时间:2017年08月15日03时28分49秒817毫秒

    异常发生的中断服务程序:0x00000000
    --- 异常捕获于主循环

    异常原因:0x00000051
    --- core外部异常

    详细异常信息:
    SL2 Correctable error occurred at address 0x0c137020 by scrubbing

    异常时相关寄存器内容:
    B3 = 0x00000000 return pointer of caller
    A4 = 0x00000000 first input parameter of caller
    B4 = 0x00000000 second input parameter of caller
    B14 = 0x00000000 data pointer
    B15 = 0x00000000 stack pointer
    NTSR = 0x00000000 NMI/Exception Task State Register
    NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
    EFR = 0x00000000 Exception flag register
    ITSR = 0x00000000 Interrupt task state register
    IRP = 0x00000000 Interrupt Return Pointer Register

    异常PC指针:0x00000000
    异常函数指针:0x8A0E7350
    ************************end of an exception inf************************


    |----------------------------------|
    -----------所有异常信息------------
    |----------------------------------|
    异常存储信息共2条!

    *****************************错误序号:1*****************************

    异常发生于CORE0

    异常发生时间:2017年08月15日03时28分49秒464毫秒

    异常发生的中断服务程序:0x00000000
    --- 异常捕获于主循环

    异常原因:0x00000051
    --- core外部异常

    详细异常信息:
    SL2 Correctable error occurred at address 0x0c136820 by scrubbing

    异常时相关寄存器内容:
    B3 = 0x00000000 return pointer of caller
    A4 = 0x00000000 first input parameter of caller
    B4 = 0x00000000 second input parameter of caller
    B14 = 0x00000000 data pointer
    B15 = 0x00000000 stack pointer
    NTSR = 0x00000000 NMI/Exception Task State Register
    NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
    EFR = 0x00000000 Exception flag register
    ITSR = 0x00000000 Interrupt task state register
    IRP = 0x00000000 Interrupt Return Pointer Register

    异常PC指针:0x00000000
    异常函数指针:0x8A0E7350
    ************************end of an exception inf************************


    *****************************错误序号:2*****************************

    异常发生于CORE0

    异常发生时间:2017年08月15日03时28分49秒817毫秒

    异常发生的中断服务程序:0x00000000
    --- 异常捕获于主循环

    异常原因:0x00000051
    --- core外部异常

    详细异常信息:
    SL2 Correctable error occurred at address 0x0c137020 by scrubbing

    异常时相关寄存器内容:
    B3 = 0x00000000 return pointer of caller
    A4 = 0x00000000 first input parameter of caller
    B4 = 0x00000000 second input parameter of caller
    B14 = 0x00000000 data pointer
    B15 = 0x00000000 stack pointer
    NTSR = 0x00000000 NMI/Exception Task State Register
    NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
    EFR = 0x00000000 Exception flag register
    ITSR = 0x00000000 Interrupt task state register
    IRP = 0x00000000 Interrupt Return Pointer Register

    异常PC指针:0x00000000
    异常函数指针:0x8A0E7350
    ************************end of an exception inf************************
  • Thanks Denny. This log is likely not enough and the next time they see failure, we may need a MSMC register dump.

    Can you also share the program code that generates the following message

    >>详细异常信息:
    SL2 Correctable error occurred at address 0x0c137020 by scrubbing