TMS570LS3137: ESM Group3 RAM even bank ECC error on cold boot

Chuck Wong

Part Number: TMS570LS3137

EDITED: HDK at ambient temperature.

Hello there,

I've a early version of the TMS570LS31x HDK with a TMX570LS3137BZWTQWQI YFB - 25ASX4W GI MCU that we have acquired a couple of years ago.

Recently, every time when it is powering up in the morning, the ESMSR3=0x00000008 indicating an ESM Group3 RAM even bank ECC error, and the nERROR pin was asserted (onboard red led ON), and the MCU refused to start.

Please note that it is possible to read the ESMSR3 value only when the debugger is connected via JTAG, then reprogrammed with the same binary code. It is now possible to operate correctly but the onboard red led remains ON.

After warming up the MCU for a while, if the power is turned OFF on the HDK, for a matter of seconds or minutes, then turned ON, everything is working fine and no more RAM ECC error.

Any explanation on this behavior?

Thanks.

over 4 years ago

0 Sunil Oak over 4 years ago

TI__Mastermind 49120 points

Hi Chuck,

The CPU should have also responded to this error with a data abort. The data abort handler (or a user with a debugger connection) can query the data fault status and address registers to identify the root cause of the abort.

How does your application handle a data abort condition?

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hi Sunil,

My application is compiled/linked using the IAR EWARM IDE, in which there is a default exception handler for such purpose (the code will stop if caught in the handler).

I was expecting at least the debugger to be able to connect via JTAG without downloading, so I could inspect the state of the CPU system following such error. Since connect without downloading is not possible, I have to re-download the complete application so the state is probably lost but it could now run to main() thereafter.

Is my expectation wrong?

Thanks.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

You should be able to "Debug without Downloading" the code into the Flash again. See attached picture for the option to use:

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

I'm thinking that I've done so without success.

Since the unit is now warm and working without the nERROR asserted, I would have to wait for Monday morning to see if doable.

I'll keep you posted.

Thanks.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

Were you able to replicate the issue today?

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hi Sunil,

Thanks for the follow-up. I have issue the "debug without downloading command" from the IAR interface, and it does connect without error.

However, I suspect that the CPU is lost somewhere, and when I press the "paused debugger" key, it does break and show the ESMSR3 register value as 0x0000008 as previously shown, but the debugger doesn't tell where the code has stopped.

The question is how come this happens only on cold boot (not previously warmed up)? A defective MCU?

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

If you see the CPU registers, the program counter (PC) should show the code where it is stuck. I suspect it is going to be in the data abort handler where there could be a "loop here" code construct keeping the CPU from returning back to the main application.

That said, a TMX570LS3137BZWTQWQI part has not been through the final production test. So there could be functionality that may not be correct across the entire temperature range.

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hi Sunil,

The HDK was OFF from the power since 11:32 this morning after running for less than 5 minutes (RAM ECC error the whole time), time during which I've used to reply to your question.

Now it is almost 5 hours later, and I've just connected it back to power, and guess what, the MCU is just running fine with last week's binary code in the Flash.

So I guess I can confirm the program counter location only tomorrow morning ...

Weirdo ... just how long does it take to have the RAM ECC error?!?

Thanks.

0 Chuck Wong over 4 years ago in reply to Chuck Wong

Genius 3920 points

Please also note that this RAM ECC error, which declared after an extended period of power-off, only starts to make it way to my HDK recently. The indicator, other than the CPU hangup and the ESMSR3 register contents, is the onboard red LED.

I know that I can find a old build to try it out, but it will take another day ...

Could this be a SW issue that has caused the RAM ECC error in ESM Group3 (not SW controllable and not configurable)?

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Chuck,

You can try using some freeze spray and see if you can cause the failure to happen on demand. A software issue can certainly cause this, for example if you read a RAM location without first initializing it. This can happen even on a write access to the SRAM. The Cortex-R4F CPU performs a read-modify-write operation for any write to the SRAM that is not 64-bit wide. So any push onto the stack could cause this RMW operation to trigger a double-bit ECC error.

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Sunil,

Kind of strange, remember that the HDK is in my office and always at room temperature, around 23°C.

I can't find a can of freezer but a can of compressed-gas duster, and I spray on the MCU for around 60 seconds and make sure that it is really cold, still over the freezing point but very cold on touch, no luck. It just runs fine upon connection of power.

I'm asking myself whether this is temperature-related or simply some molecules (read: flip-flop gates) inside the MCU that is poking fun on me! It would be happy once energized for a couple of minutes, then discharged slowwwwwwwwly (more than 5 hours) ... lol

I remember that in the morning, if I plug the HDK to power for a couple of seconds, seen the problem, disconnect and reconnect it again, and still problem. My observation here is that it should take more than a couple of seconds to energize the things ... if the things are real :)

I will try again tomorrow.

0 Chuck Wong over 4 years ago

Genius 3920 points

Good morning Sunil,

Yes the program counter is indeed at 0x000cc98, which points to the interrupt handler of the Abort exception.

All it needs is a couple of second connected to power, then works fine after the power supply was recycled.

Should I get the MCU changed on this HDK?

Any other suggestion for such behavior?

Thanks again.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Good morning Chuck,

Yes, there are some things you can do to debug further.

The CPU captures the address and the actual cause of the data abort in the data fault address and status registers. Please read the contents of these registers from the debugger. This will tell you the actual address that was accessed and the cause of the error (double-bit ECC fault, access permission fault, etc.).
Based on the above information, you can look into your program to see if there is any scope for your application to make an access to the SRAM address without the SRAM being first initialized.

Another suggestion: please update the data abort handler to manage such as condition so that it is not stuck in a "loop-forever" code construct unless you have an external watchdog to take the system into a safe state.

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hi Sunil,

Could you please let me know how to inspect the data fault address and status registers? Are those registers managed by the ARM-core and copied to some MCU registers space, or they are managed directly by the MCU?

Thanks for sharing your knowledge.

Regards.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

You should be able to see the CPU fault registers in the debugger window that shows the CPU registers. In CCS, try View -> Registers. One of the set of registers will be the CPU registers (first one on top). If you open the group of "System Registers", you can see the DFAR and DFSR. See some screenshots below.

Using IAR, you can View -> Registers when connected to the part. Then choose the register "Group" to be PL1 Fault Handling Registers. Screenshot below.

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Great Sunil,

I see both DFSR and DFAR registers under IAR EWARM.

I understand that the DFAR register indicates the data access fault address, but where may I find the interpretation of the DFSR register please?

Thanks.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

DFSR is described in the Cortex-R4F TRM. See https://developer.arm.com/docs/ddi0363/g/system-control/register-descriptions/fault-status-and-address-registers

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Thank you Sunil,

Will check tomorrow and keep you posted.

Regards.

0 Chuck Wong over 4 years ago in reply to Chuck Wong

Genius 3920 points

Morning Sunil,

Well kind of missed a chance to reproduce the problem this morning ...

I've cold started the HDK as usual, while the nERROR pin LED is still asserted like any previous day, the "debug without downloading" debugger command has actually work, meaning the debugger ran to main() without trouble, and wait there ... and the LED is still red, contrary as previously endless-looping in the Abort handler.

By inspecting ESM registers, the same ESM3SR=0x00000008 for RAM ECC error was still declared, but all PL1 error handling registers are 0x00000000.

Pressing the "Run" button has actually launching the normal SW execution, with the LED red.

How could this happen?

Thanks.

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

I am surprised the code execution got out of the data abort handler as well. ESMSR3 register is not cleared on a system reset, so the following scenario could explain your observation:

Double-bit ECC error occurred on RAM access, causing the CPU to take the data abort handler
CPU code execution stuck in data abort handler in a "while(1)" loop
Something like a watchdog (internal or external) caused a system reset to be asserted to restart code execution
This time there is no double-bit ECC error from RAM and code executed fine
ESM group3 status register still shows the previous error status and nERROR is still driven

You will need to confirm the source of the system reset if you do log it as part of the reset handler.

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hi Sunil,

I understand what you're saying. The MCU is on the TI HDK so there is no external watchdog reset. The only reset would be the debugger JTAG initiated reset when it is trying to connect (without downloading).

The trouble is that sometimes the reset is successful so the code restarts, but some other times not, stuck in Abort_Handler().

Does this make sense?

Good day!

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Hi Chuck,

Yes, that matches the sequence that I sent. I am not an expert user when it comes to IAR tools, but there should be a way to connect to the part without asserting system reset. Then you can catch the CPU in the abort handler.

Alternatively, you can write the data abort handler to log the DFAR and DFSR values for you and store it in a RAM that is not auto-initialized during start-up.

Regards, Sunil

0 Chuck Wong over 4 years ago in reply to Sunil Oak

Genius 3920 points

Hello Sunil,

Well I've finally capture something, please bear with me.

So the IAR command to use is "Attach to Running Target" rather than "Debug without Downloading".

On one occasion, I got ESMSR3=0x00000028, which signifies that both EVEN and ODD RAM banks have declared double-bit ECC error. Looking at the data fault status register DFSR=0x00000C09, it says that AXI Decoder Error on a WRITE operation, it is a "Sync Parity/ECC Error", and that the data fault address register DFAR is valid, which is 0x080017C8.

I tried to locate what is in this static RAM location, but couldn't interpret it properly:

1) Watch window on dfar and dfsr variables

2) Linker map at line 1670, 1671, dfar at 0x080017c8 and dfsr at 0x080017cc and with length of 0x4 bytes, which are fine.

3) Memory dump at address 0x080017c8: Guess what, all data at this address range contain the value 0x080017c8. How to interpret address=value?

0 Sunil Oak over 4 years ago in reply to Chuck Wong

TI__Mastermind 49120 points

Chuck,

This is what I expected. It appears that you do have a stack write (push) operation occurring before the CPU SRAM is initialized (resulting in a R-M-W operation causing the abort). The way to identify the part of code performing this write is to look at the link register while in the abort mode (R14_abt). This would have the address two instructions after the one that caused the data abort. Also, the normal return instruction from the data abort handler should return back to the instruction that caused the abort (unless it is trapped in a while(1) loop).

Regards, Sunil

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LS3137: ESM Group3 RAM even bank ECC error on cold boot