TM4C1294NCPDT: TM4C1294 MCU is not reachable and likely core hung

Senthil Paramasivam

Part Number: TM4C1294NCPDT

Hi,

We are seeing the issue from one of our customer deployment where TM4C1294 MCU is not reachable after few days and likely core hung.
TM4C1294 is connected with embedded CPU (communication processor) through UART interface. Tried resetting the UART interface from
embedded CPU and it doesn't help.

I have following questions,

1. Does TM4C1294 supports core dump mechanism?

2. When TM4C1294 is stuck, is it possible to collect the core dump from embedded CPU?

3. Since it is on the customer site, we can't connect with the CCS. Do you have any other suggestions to isolate the root cause or avoiding this issue?

Thanks,
Senthil

How to generate the core dump for TM4C1294

over 7 years ago

0 cb1_mobile over 7 years ago

Guru 117855 points

Senthil Paramasivam said:
TM4C1294 MCU is not reachable after few days and likely core hung.

Perhaps that's true - yet you've presented (very) minimal evidence - in the advancement of such (core hung) diagnosis.

Might the following approach - which 'Probes for a (pardon) 'more sound & detailed diagnosis' be deployed?

Adding a regularly toggled Led to your TM4C (quickly) identifies any 'significant' loss of function. (tests MCU's: Timer, GPIO & Interrupt!)
You note, 'Tried Resetting the UART interface from 'embedded (communication processor.) Should you not have tried 'Resetting the TM4C's UART' - as well?
It proves 'advisable' - during development - to create 'active tests' - which 'monitor individual MCU peripherals' - and provide for 'Peripheral Recovery' - as/if/when needed...
As a 'Standard Operating Procedure' - should not you have a 'Golden Client Board' (or closest facsimile - @ your site) - so that you CAN, 'Connect w/CCS' - and 'observe this (or other) conditions - directly?'
And - what percent of your 'Total Field Population' do these 'inflicted boards' represent? That's important - is it not?

The 'push' for core-dump - even if such is available - seems (very) premature - and far 'second' to the approach (detailed) above. (unless of course - gaining insight into 'core-dump' - proves the (not so) 'hidden' objective...)

0 Charles Tsai over 7 years ago

TI__Guru**** 191736 points

Hi,
There is no core dump capability per se. If you can connect to the processor via the debugger then you can observe the processor registers and the memory to diagnose the issue. Cb1 provided excellent diagnostic suggestions. The LED toggle will be a good way to know if the core is hung or not. Do you have other peripherals (i.e. other UART, SPI, timers, etc) that are still functioning when this specific UART communicating to the embedded CPU is dead? One very important question I have is that is this failure repeatable or is it a one time event? What type of operating environment is the MCU subject to operate in? Things to check are noise, temperature, stable power supply? Can they be violated out of spec? I could also image a soft error that can corrupt the memory bit cells and registers from radiation.

0 Senthil Paramasivam over 7 years ago in reply to cb1_mobile

Prodigy 95 points

Thanks cb1 for your insights. See inline.

Adding a regularly toggled Led to your TM4C (quickly) identifies any 'significant' loss of function. (tests MCU's: Timer, GPIO & Interrupt!)
<Senthil> It is good idea. We could validate the timer and GPIO access periodically and toggle the LED. How do you suggest to validate the interrupt?

You note, 'Tried Resetting the UART interface from 'embedded (communication processor.) Should you not have tried 'Resetting the TM4C's UART' - as well?
<Senthil> In our design we have to access TM4C's only from embedded CPU via UART interface. No direct access. In this case we couldn't access TM4C's from embedded board hence tried resetting UART on the embedded side.

It proves 'advisable' - during development - to create 'active tests' - which 'monitor individual MCU peripherals' - and provide for 'Peripheral Recovery' - as/if/when needed...
<Senthil> Good idea, we will include them.

As a 'Standard Operating Procedure' - should not you have a 'Golden Client Board' (or closest facsimile - @ your site) - so that you CAN, 'Connect w/CCS' - and 'observe this (or other) conditions - directly?'
<Senthil> Yes, we do have golden board locally where we didn't see this issue so far.

And - what percent of your 'Total Field Population' do these 'inflicted boards' represent? That's important - is it not?
<Senthil> It is not important. We don't have huge samples to conclude and observed this issue only on one of the customer setup multiple times.

0 Senthil Paramasivam over 7 years ago in reply to Charles Tsai

Prodigy 95 points

Thanks Charles for the suggestions. LED toggle would help incase the setup is local but not for debugging customer issues. Since communication to MCU is broken, it is hard to confirm whether other peripherals are working. This issue was observed atleast 4 or 5 times in last one month time period with one of our customer setup and not seen in the local setup or other customers. Based on all the test observation, it is operating in the normal condition with stable power, temp etc. How do I confirm if there is a soft error? Does TM4C has parity error interrupt or counters? Which memory is prone for these errors? Is there a correction logic (ECC)?

-Senthil

0 Genatco over 7 years ago

Guru 55913 points

Hello,

How have you configured the Watchdog timer on the remote system. The default configuration should soft reset the MCU on second timeout and can be made to watch NMI pin. Ideally you can have software check the MCU reset cause register and print (dump) reason via the UART to a debug terminal after POR.

Another method incudes enabling the Debug directive switch in the compile code so the contents of the offending application or Tivaware library calls can be determined in a similar way.

0 cb1_mobile over 7 years ago in reply to Senthil Paramasivam

Guru 117855 points

Thank you - answers follow: (cb1's initial suggestion; cb1's follow-up response)

Senthil Paramasivam said:
Adding a regularly toggled Led to your TM4C (quickly) identifies any 'significant' loss of function. (tests MCU's: Timer, GPIO & Interrupt!)
<Senthil> It is good idea. We could validate the timer and GPIO access periodically and toggle the LED. How do you suggest to validate the interrupt?

That's easy - as the objective is the 'Evaluation of as many MCU System elements as (reasonably) possible' - simply have the (chosen) Timer's periodic 'Interrupt' call the 'X-Or'ed' GPIO. Thus ... THREE BIRDS - 'SINGLE (cb1) STONE.' (Timer, GPIO & Interrupt all (regularly) tested & clearly monitored!) Should the Led continue blinking - 'Hung Core' is (dismissed) as a realistic finding.

Senthil Paramasivam said:
Should you not have tried 'Resetting the TM4C's UART' - as well?
<Senthil> In our design we have to access TM4C's only from embedded CPU via UART interface. No direct access. In this case we couldn't access TM4C's from embedded board hence tried resetting UART on the embedded side.

It is unusual for the MCU's UART to 'just quit' - after performing properly from power-up. Starting w/basics - have the signal levels (at the MCU) been recently measured - and confirmed as 'w/in spec?' In many/most of our firm's designs - we program the UART(s) (via a Timer) to regularly 'Handshake (*) w/their attached device' (unless transaction activity is (both) high & recent!) When a sufficient period of 'quiet' is detected - the MCU will 'Handshake' (*) - and await response. Should that response fail to arrive - or appear (somehow) corrupted - then the MCU's UART will execute 'Peripheral Reset' - re-initialize - and attempt a new handshake.(*) This involves the 'Addition of such 'broadening and robustness heightening' code strategy - w/in your design.

Senthil Paramasivam said:
It proves 'advisable' - during development - to create 'active tests' - which 'monitor individual MCU peripherals' - and provide for 'Peripheral Recovery' - as/if/when needed...
<Senthil> Good idea, we will include them.

Good that! Response (immediately above) notes (one method) to achieve such 'active test' - which 'Enables the MCU to 'detect issues' - and to 'Intelligently - attempt a fix.' Note too - that each such 'Issue Detection' may be programmed to produce a unique pattern upon the 'MCU monitoring Led' - announcing the issue - and 'Returning to normal' - when or should the 'fix' have succeeded! In addition - it (always) proves wise - to 'Alert' the connected (Processor or Board) - that the MCU has 'Detected some issue' - is (about) to attempt a 'fix' - and then to signal the 'fix's' completion...

Senthil Paramasivam said:
And - what percent of your 'Total Field Population' do these 'inflicted boards' represent? That's important - is it not?
<Senthil> It is not important. We don't have huge samples to conclude and observed this issue only on one of the customer setup multiple times.

May it be noted that MANY here would challenge your assessment of, 'NOT Important!' Should the sample size be SO SMALL - then (any) failure proves 'GRAVE' - and the (thus far) 'limited discovery of such defects' - may reflect MORE upon your/client's 'Detection Techniques - and/or Effort' - than upon the 'Defects' (expanded) presence! I believe that the adage, 'SAFE trumping Sorry' - (very) much applies - here...

Another notes 'Watchdog Usage' - which (can) cause a complete MCU Reset - and that (Reset) proves far more 'disruptive & time consuming' - than the 'specialized & focused - Peripheral-based techniques' - outlined herein. And of course - W-Dog provides, 'NONE of the key INSIGHTS' - which the 'far more focused/effective' diagnostic methods presented herein - do so well...

(*) 'Handshake' - as noted here - does not involve 'flow control' or other (extra) signals. Usage here describes a (very) brief, 'UART Exchange' - between the MCU & Remote - to insure that communication 'exists, is bi-directional - and is proper...'

0 Senthil Paramasivam over 7 years ago in reply to cb1_mobile

Prodigy 95 points

Thank you cb1 for the prompt responses and suggestions.

That's easy - as the objective is the 'Evaluation of as many MCU System elements as (reasonably) possible' - simply have the (chosen) Timer's periodic 'Interrupt' call the 'X-Or'ed' GPIO. Thus ... THREE BIRDS - 'SINGLE (cb1) STONE.' (Timer, GPIO & Interrupt all (regularly) tested & clearly monitored!) Should the Led continue blinking - 'Hung Core' is (dismissed) as a realistic finding.

<Senthil> Sounds good. Will try this option.

It is unusual for the MCU's UART to 'just quit' - after performing properly from power-up. Starting w/basics - have the signal levels (at the MCU) been recently measured - and confirmed as 'w/in spec?' In many/most of our firm's designs - we program the UART(s) (via a Timer) to regularly 'Handshake (*) w/their attached device' (unless transaction activity is (both) high & recent!) When a sufficient period of 'quiet' is detected - the MCU will 'Handshake' (*) - and await response. Should that response fail to arrive - or appear (somehow) corrupted - then the MCU's UART will execute 'Peripheral Reset' - re-initialize - and attempt a new handshake.(*) This involves the 'Addition of such 'broadening and robustness heightening' code strategy - w/in your design.

<Senthil> When you say 'Peripheral Reset', are you referring to embedded CPU (device) reset which is connected in UART or just resetting UART interface? It is good idea to have a periodic heart beat or handshake between these two devices and does the reset.

Good that! Response (immediately above) notes (one method) to achieve such 'active test' - which 'Enables the MCU to 'detect issues' - and to 'Intelligently - attempt a fix.' Note too - that each such 'Issue Detection' may be programmed to produce a unique pattern upon the 'MCU monitoring Led' - announcing the issue - and 'Returning to normal' - when or should the 'fix' have succeeded! In addition - it (always) proves wise - to 'Alert' the connected (Processor or Board) - that the MCU has 'Detected some issue' - is (about) to attempt a 'fix' - and then to signal the 'fix's' completion...

<Senthil> It is good idea to have a periodic heart beat or handshake between these two devices and try auto recovery when there is a problem.

May it be noted that MANY here would challenge your assessment of, 'NOT Important!' Should the sample size be SO SMALL - then (any) failure proves 'GRAVE' - and the (thus far) 'limited discovery of such defects' - may reflect MORE upon your/client's 'Detection Techniques - and/or Effort' - than upon the 'Defects' (expanded) presence! I believe that the adage, 'SAFE trumping Sorry' - (very) much applies - here...

<Senthil> Yes, currently sample size is very small.

Another notes 'Watchdog Usage' - which (can) cause a complete MCU Reset - and that (Reset) proves far more 'disruptive & time consuming' - than the 'specialized & focused - Peripheral-based techniques' - outlined herein. And of course - W-Dog provides, 'NONE of the key INSIGHTS' - which the 'far more focused/effective' diagnostic methods presented herein - do so well...

<Senthil> Yes, we are already planning to enable watchdog which is missing today

0 Senthil Paramasivam over 7 years ago in reply to Genatco

Prodigy 95 points

Hi,

How have you configured the Watchdog timer on the remote system. The default configuration should soft reset the MCU on second timeout and can be made to watch NMI pin.

<Senthil> We are planning to enable watchdog which is missing today. Can you provide the sample code or pointer for watching NMI pin with the watch dog?

Ideally you can have software check the MCU reset cause register and print (dump) reason via the UART to a debug terminal after POR.

<Senthil> Okay, will check Reset Cause (RESC) register. It may not give much new info since we did power cycle after MCU is hung to recover hence likely reset cause will record power reset as a reason.

Another method incudes enabling the Debug directive switch in the compile code so the contents of the offending application or Tivaware library calls can be determined in a similar way.

<Senthil> Can you provide the example for enabling debug directive switch? does it print all the debug messages on the console? how does it help to find the offending application or Tivaware library calls ?

Thanks,
Senthil

0 cb1_mobile over 7 years ago in reply to Senthil Paramasivam

Guru 117855 points

Perhaps time (now) for a, 'This RESOLVED' green marker! (note that the vendor has registered 'pleasure' w/such guidance.)

When you say 'Peripheral Reset', are you referring to embedded CPU (device) reset which is connected in UART or just resetting UART interface?

I deliberately employed 'Peripheral Reset' to distinguish the 'UART Peripheral's Reset.' (such 'individual peripheral resets' are faster - and less disruptive - than an entire MCU Reset.) In addition - they may be individually 'counted' - thus providing insight into potential: 'program, component or system' issues!

I should note that our firm (usually) includes a small (external) EEProm (deliberately NOT that w/in the MCU) and this device is 'loaded' w/pertinent 'Diagnostic Info' which provides key data - even when the board is away (in client's possession.)

Freed from the (earlier) 'core-hung' (sense) - you now should be better prepared to produce a more robust design - and likely one which can 'Detect, (even) Overcome & then Record' (issues) - rather than ... run blind...

0 Genatco over 7 years ago in reply to Senthil Paramasivam

Guru 55913 points

Like Like Like to include CB1 recent various colored text is very pleasing to read from. Besides the mission to Mars debug routines he suggest also very heavy in pull of earth gravity.

The debug for Tivaware console prints:

#define DEBUG              1

//*****************************************************************************
//
// The error routine that is called if the driver library encounters an error.
//
//*****************************************************************************

#ifdef DEBUG
void
__error__(char *pcFilename, uint32_t ui32Line)
{
}
#endif


 UARTprintf("ASSERT_DEBUG: %s line:%d  %s\r\n", msg, __LINE__, __FILE__)

Watchdog example project can be found C:\ti\Tivaware\MyLibraryVerison\Examples. The NMI pin defaults PD7 must to be configured for Watchdog use (consult SYSCTRL registers text in datasheet), software simulated POR and can be triggered remotely from another device. It would seem certain exceptions might also somehow trigger NMI interrupt if configured to do so from Watchdog handling of NMI. That part is above my current pay grade, perhaps Charles/CB1 can chime in on how that might be accomplished.

0 Genatco over 7 years ago in reply to Genatco

Guru 55913 points

The idea is Watchdog 1st time out can reset any peripheral. Calls to the peripheral also refresh dogs counter to keep it from expiring. They call this petting the dog, feeding the dog and we disable MCU reset on second timeout feature.

Arm-based microcontrollers

Arm-based microcontrollers forum

TM4C1294NCPDT: TM4C1294 MCU is not reachable and likely core hung