TMS320C6655: MSM ecc check over memory space range.

Denny%20Yang99373

Part Number: TMS320C6655

Hi all,

Customer enable MSM ECC check on C6655.

The C6655 MSM memory from 0xc00000~0xc1fffff in datasheet. But 6655 only have 1M MSM, there should be a error in the datasheet. MSM memory should be 0xc000000~0xc0fffff.

And also customer have found the ecc error report at 0x0c137020、0x0c136820. It's over the 1M space!

Why this happened?

Do we have any register to restrict the ecc check range 0xc000000~0xc0fffff?

thanks!

BR,
Denny

over 8 years ago

0 Yordan Kovachev over 8 years ago

TI__Guru**** 161600 points

Hi,

I've notified the team. Their feedback will be posted here.

Best Regards,
Yordan

0 Mukul Bhatnagar over 8 years ago

TI__Guru* 84865 points

Hi Denny
I looked at the internal design spec and I confirm that this seems to be a typo in the datasheet.
MSMC memory on this device is 1M and the relevant address range should be 0x0C00 000 to 0x0c0F FFFF

I will submit a documentation ticket on this.

I will check with the software team to see if there is any setting to restrict such that the ECC range is as per the chip memory map etc.

Regards
Mukul

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

Hi Mukul,

Do you have any update about this issue? thanks!

BR,
Denny

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

Hi Mukul,

Customer found the 2 ecc error happened interval time is 350ms. It’s abnormal, Customer hope BU help to analyze the reason. It this possible that one time ray impact to the chip 2 address but with MSMC refresh time cause the interrupt report interval time is 350ms?

thanks.

BR,
Denny

0 Mukul Bhatnagar over 8 years ago in reply to Denny%20Yang99373

TI__Guru* 84865 points

Hi Denny

Can you clarify your above statement.
Is the customer still see the ecc error happen in the reserved section?

when you say 2 errors in 350 ms - does this happen consistently? How many boards tested and how many showing this issue?

Did they expose it to any rays to test ECC ?

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

Mukul,

Is the customer still see the ecc error happen in the reserved section?
yes, the address is 0x0c137020、0x0c136820.

when you say 2 errors in 350 ms - does this happen consistently?

Two errors cause the interrupt one by one, the interval time less than 350ms. There some record time in 350ms, so it may happen consistently. We are not sure.

How many boards tested and how many showing this issue?

Only one. Customer made several hundreds of board and find one board record this log.

Did they expose it to any rays to test ECC ?

No, it common use.

Customer use the method to identify the address by

errAddr = (gpMSMC_regs->SMCEA&0xFFFFFF)+0xc000000

Customer think it's not reasonable that 2 ecc errors happened in the too short time. And want ti explain this.

BR,
Denny

0 Mukul Bhatnagar over 8 years ago in reply to Denny%20Yang99373

TI__Guru* 84865 points

Hi Denny
Thanks for the information

Few more questions and request for information

1) Can they share the MSMC register dump etc , for us to see the ECC error log?
2) You said 1 out of 100 boards show this - is this board a field return or something they are catching in their in house testings? Are all boards exposed to same testing and software?
3) ECC errors should be getting generated if a master is accessing /reading this memory - given we have established that this address space is truly reserved in the device memory map , can we ensure that software is not accessing this memory area intentionally and accidentally, and once we do that , does it still show errors?
4) I am checking with design if this reserved address should also generate a CPU buserr (look at the core pac user guide for bus err registers) - can you confirm if in the customer software the bus err interrupt is enabled and/or when they see failures what is the register values for the buserr registers

Regards
Mukul

0 Mukul Bhatnagar over 8 years ago in reply to Mukul Bhatnagar

TI__Guru* 84865 points

Denny
On #3, further specific questions
1. What address does the customer intend to access? Are they trying to access the reserved space between 0x0c10000:0x0c1FFFFFFF, the two addresses and got ECC error during the access?
2. or, when they access the legal MSMC SRAM space between 0x0c00000000:0x0c0FFFFFF, the ECC errors are triggered on the two unrelated addresses reported below?

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

Hi Mukul,

1, Sorry, They didn't record the register when ERROR happened. The log just record the error happened time.

2,The boards are field using. All board used the same software.

3,We can ensure the software didn't access the reserved space.

4,Yes, customer software bus err interrupt is enabled.

BR,

Denny

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

it's 2. when they access the legal MSMC SRAM space between 0x0c00000000:0x0c0FFFFFF, the ECC errors are triggered on the two unrelated addresses

0 Mukul Bhatnagar over 8 years ago in reply to Denny%20Yang99373

TI__Guru* 84865 points

Thanks for the update Denny.
To debug this we will likely need some sort of error logs /register dump for corepac and msmc registers.

Is the issue easy to reproduce , such that they can capture the relevant registers when it fails again?

0 Mukul Bhatnagar over 8 years ago in reply to Mukul Bhatnagar

TI__Guru* 84865 points

Some more follow up questions based on internal discussion

Please clarify your observation on ecc error reported from reserved region when accessing regular MSMC memory space:

1. When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with data bus all 0’s and ECC error status. But the log register, according to their calculation, saved address which appears to be out of the legal region?
2. When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with correct data and success status. But a two-bit error interrupt is triggered, the log register, according to their calculation, saved address which appears to be out of the legal region?

As previously requested, we may need the MSMC register dump, but in case customer had previously captured the SMCEA register content- please share.

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

2, When the customer code was accessing legal SRAM space between 0x0c00000000:0x0c0FFFFFF, MSMC responded with correct data and success status. But a two-bit error interrupt is triggered, the log register, according to their calculation, saved address which appears to be out of the legal region.
this is customer observation.

0 Denny%20Yang99373 over 8 years ago in reply to Mukul Bhatnagar

TI__Expert 3805 points

I share the log, hope this can help.

**********ExceptionLog**********

Date :Mar 14 2017,version :1.01

BOARD CPU TYPE : TI C66X
FORMAT VERSION : 1.00

|----------------------------------|
----------最近一次异常信息---------
|----------------------------------|

*****************************错误序号：2*****************************

异常发生于CORE0

异常发生时间：2017年08月15日03时28分49秒817毫秒

异常发生的中断服务程序：0x00000000
--- 异常捕获于主循环

异常原因：0x00000051
--- core外部异常

详细异常信息：
SL2 Correctable error occurred at address 0x0c137020 by scrubbing

异常时相关寄存器内容:
B3 = 0x00000000 return pointer of caller
A4 = 0x00000000 first input parameter of caller
B4 = 0x00000000 second input parameter of caller
B14 = 0x00000000 data pointer
B15 = 0x00000000 stack pointer
NTSR = 0x00000000 NMI/Exception Task State Register
NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
EFR = 0x00000000 Exception flag register
ITSR = 0x00000000 Interrupt task state register
IRP = 0x00000000 Interrupt Return Pointer Register

异常PC指针：0x00000000
异常函数指针：0x8A0E7350
************************end of an exception inf************************

|----------------------------------|
-----------所有异常信息------------
|----------------------------------|
异常存储信息共2条!

*****************************错误序号：1*****************************

异常发生于CORE0

异常发生时间：2017年08月15日03时28分49秒464毫秒

异常发生的中断服务程序：0x00000000
--- 异常捕获于主循环

异常原因：0x00000051
--- core外部异常

详细异常信息：
SL2 Correctable error occurred at address 0x0c136820 by scrubbing

异常时相关寄存器内容:
B3 = 0x00000000 return pointer of caller
A4 = 0x00000000 first input parameter of caller
B4 = 0x00000000 second input parameter of caller
B14 = 0x00000000 data pointer
B15 = 0x00000000 stack pointer
NTSR = 0x00000000 NMI/Exception Task State Register
NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
EFR = 0x00000000 Exception flag register
ITSR = 0x00000000 Interrupt task state register
IRP = 0x00000000 Interrupt Return Pointer Register

异常PC指针：0x00000000
异常函数指针：0x8A0E7350
************************end of an exception inf************************

*****************************错误序号：2*****************************

异常发生于CORE0

异常发生时间：2017年08月15日03时28分49秒817毫秒

异常发生的中断服务程序：0x00000000
--- 异常捕获于主循环

异常原因：0x00000051
--- core外部异常

详细异常信息：
SL2 Correctable error occurred at address 0x0c137020 by scrubbing

异常时相关寄存器内容:
B3 = 0x00000000 return pointer of caller
A4 = 0x00000000 first input parameter of caller
B4 = 0x00000000 second input parameter of caller
B14 = 0x00000000 data pointer
B15 = 0x00000000 stack pointer
NTSR = 0x00000000 NMI/Exception Task State Register
NRP = 0x00000000 Nonmaskable Interrupt Return Pointer Register
EFR = 0x00000000 Exception flag register
ITSR = 0x00000000 Interrupt task state register
IRP = 0x00000000 Interrupt Return Pointer Register

异常PC指针：0x00000000
异常函数指针：0x8A0E7350
************************end of an exception inf************************

0 Mukul Bhatnagar over 8 years ago in reply to Denny%20Yang99373

TI__Guru* 84865 points

Thanks Denny. This log is likely not enough and the next time they see failure, we may need a MSMC register dump.

Can you also share the program code that generates the following message

>>详细异常信息：
SL2 Correctable error occurred at address 0x0c137020 by scrubbing

Processors

Processors forum

TMS320C6655: MSM ecc check over memory space range.