TM4C1297NCZAD: The proper way to handle NMI, Fault Interrupt, and Default Interrupts in deployed equipment

Mark Sheats

Part Number: TM4C1297NCZAD

I am not experienced with this processor, but I am reviewing some code and this does not seem like the best way to handle these interrupts for a deployed system (no debugger connected).

Any suggestions or comments?

//============================================================================
// This is the code that gets called when the processor receives a NMI. This
// simply enters an infinite loop, preserving the system state for examination
// by a debugger.
//============================================================================
static void
NmiSR(void)
{
//
// Enter an infinite loop.
//
while(1)
{
}
}
//============================================================================
// This is the code that gets called when the processor receives a fault
// interrupt. This simply enters an infinite loop, preserving the system state
// for examination by a debugger.
//============================================================================
static void
FaultISR(void)
{
//
// Enter an infinite loop.
//
while(1)
{
}
}
//============================================================================
// This is the code that gets called when the processor receives an unexpected
// interrupt. This simply enters an infinite loop, preserving the system state
// for examination by a debugger.
//============================================================================
static void
IntDefaultHandler(void)
{
//
// Go into an infinite loop.
//
while(1)
{
}
}

over 6 years ago

0 cb1_mobile over 6 years ago

Guru 117855 points

You are correct - yet there are vendor documents which suggest that, "Landing in such "segregated bins" should allow "Additional Code" to "attempt to correct the situation" - and to then, "Reset the MCU."

The "segmentation of such issues" enables a much reduced, "corrective strategy" to be created.

It also is worth noting that a properly designed "Watchdog" should "escape one" from these loops... (yet is "devoid of any (further)" corrective mechanisms...)

0 Mark Sheats over 6 years ago in reply to cb1_mobile

Intellectual 450 points

We have an issue where I think the Watchdog is causing us to reset. We just don't know which part of the code is causing the infinite loop. Does a properly designed Watchdog tell you which part of the code is causing the problem, or just cause a reset?

0 cb1_mobile over 6 years ago in reply to Mark Sheats

Guru 117855 points

I believe that the API includes functions which enable one to, "Determine the cause of such Reset." (after writing this - I don't believe it properly addresses your issue.)

Mark Sheats said:
We just don't know which part of the code is causing the infinite loop

Can you not, "Set a unique bit - while w/in each "infinite segment?" Such could then be tested/examined w/in the Watchdog code - could it not?

It seems preferred to "escape the loop" via a properly constructed (corrective) action - avoiding watchdog entry...

0 Mark Sheats over 6 years ago in reply to cb1_mobile

Intellectual 450 points

I have not been working on this code up until now (still haven't compiled it myself), but hope to soon. I'm looking for guidance so I can make good suggestions to help us solve the problem. I wrote code years ago with different tools, different processor, etc.

0 cb1_mobile over 6 years ago in reply to Mark Sheats

Guru 117855 points

It is suspected that the guidance as supplied:

set some unique, identifying factor when w/in such infinite loop
from w/in each infinite loop - launch a proper "corrective action attempt" (thus escaping that loop)

reasonably meets the "guidance requirement" you've identified. Bon chance.

0 Peter Borenstein over 6 years ago in reply to cb1_mobile

Mastermind 8695 points

You can't set a normal variable because stack variables will be lost at reset. You need to write to flash or EEPROM to record hitting these loops.

0 cb1_mobile over 6 years ago in reply to Peter Borenstein

Guru 117855 points

It is hoped that you (meant) to direct your writing to the thread's originator. Our group is NOT experiencing such issue. (at least - Not yet!)

Note too - that our "corrective action launch" occurs well before any Reset - thus our "identifying variables" are, "Alive & Well."

0 Mark Sheats over 6 years ago in reply to Peter Borenstein

Intellectual 450 points

Peter,

Thanks for your advice. I am not sure that we are "unintentionally hitting interrupts that require explicit enables". I am just starting on the project and trying to get a grip on things. But... we'll see. Another programmer in house has much more experience with this processor family than I. I'm the expert on the previous version of this instrument.

We have already discussed writing something to flash. I'm hoping that write can be completed before the processor resets.

Thanks,

Mark

0 Peter Borenstein over 6 years ago in reply to Mark Sheats

Mastermind 8695 points

Everything I said isn't 100% true. Few people explicitly hit a fault. All statements seem to come with an asterisk...

I always hit the bottom reply button. Wasn't a targeted message.

0 Mark Sheats over 6 years ago in reply to cb1_mobile

Intellectual 450 points

cb1_mobile, I'm sorry for my novice questions. Are you able to avoid Reset completely with your corrective action launch?

0 cb1_mobile over 6 years ago in reply to Mark Sheats

Guru 117855 points

Mark Sheats said:
Are you able to avoid Reset completely with your corrective action launch?

I'm still, "on the hunt" for a (more) skilled "consultant." (taking a past tech firm PUBLIC seems "insufficient" for some!)

Having attended Engineering & Law School - your question "Avoid Reset completely" - causes discomfort! What & whom defines "completely?" And what if "Reset" occurs tomorrow?

I would note that "most always" (and tending to always) we "Avoid Reset." And - in a "High-Current, Noise Assured Environment" (Autonomous Auto & Cordless, Power Tools) such ability is highly valued.

Your questions are "far from novice" - your "anticipation of problems" - prior to their occurrence - and subsequent, "Design of appropriate & efficient "Corrective Action" - describes (much) of our (on-going) success...

0 Robert Adsett over 6 years ago in reply to Mark Sheats

Guru 27665 points

Yes, but it does require work on your part. I know I've written of this before but I cannot find a reference now. I'll post more on it later when I've got a little more time.

Robert

0 Mark Sheats over 6 years ago in reply to Robert Adsett

Intellectual 450 points

Thank you. I look forward to any more information you or someone else can provide with regards to best practices handling these types of interrupts, and actual details when possible. One of our plans is to insert some diagnostic code in strategic places to see if we can figure out why things are locking up.

0 cb1_mobile over 6 years ago in reply to Mark Sheats

Guru 117855 points

I should add that my group has found it "especially useful" to be able to, "Provoke such issue!" (i.e. cause such an issue/occurrence (almost) "Upon Command!)

When such issues are "fleeting" - their "cause & resolution" - are rendered far harder to identify - and reduce/prevent.

Note that both poster/friend Robert & I are big proponents of, "Test-Driven Development." (TDD) We both recommend the book, "Test-Driven Development for Embedded C," by James Grenning.

We've discovered that - when fortune smiles - a "Proper (even better) an Inspired Test - may greatly assist the "teasing out" - of even a "difficult to detect - issue's cause!" (and resolution!)

0 Mark Sheats over 6 years ago in reply to cb1_mobile

Intellectual 450 points

Thank you very much. That's very helpful.

0 Robert Adsett over 6 years ago in reply to Mark Sheats

Guru 27665 points

Mark,

I couldn't find my previous notes on watchdogs and resets. So starting from scratch this is what I would consider my current best practice for watchdogs and unexpected interrupts. It is assumed you have already eliminated as many cases as you could find and this is to catch the stray intermittents left that testing did not reveal (although the same techniques help in testing as well).

Unexpected interrupts

All unexpected interrupts go into an endless loop with interrupts turned off or they trigger a reset directly. The endless loop is expected to trigger the watchdog. Since this is not a normal or expected interrupt something has gone very wrong and you cannot trust the state of the micro. Before they enter the endless loop (or reset) they perform two operations.

They write values to a set of reserved location to indicate that an unexpected interrupt occurred and which one (for certain processor fault interrupts you may record other information as well to help with later diagnosis)

If there is simple I/O you can do to safe the system then the routines do that as well. Note that in general this is difficult since you cannot know what state the system is in. This is something you need to consider in HW design as well (such as providing an output to float all power drivers)

Watchdog

The watchdog runs a multilevel process checking all critical threads. The physical watchdog has its time set for the fastest watched process (in my case that's a 10kHz A/D process so I would set the watchdog reset time to something like 150 or 200uS, you can set tighter or looser depending on your process tolerances). Now in that process the watchdog is conditionally fed. The feeding of the watchdog depends not only on the condition of this fast process (or interrupt) but also on other process SW watchdogs. This fast process watchdog feed check each SW watchdog for expiry (usually by decrementing a 'Timer' and looking for underflow. If any SW watchdog has expired the HW watchdog is not fed and will time out resetting the processor. If any SW watchdog has expired it's ID is written to the reserved locations previously mentioned.

Each SW watchdog is periodically reset (fed) by the process/loop it is watching. So a loop that is expected to run every 10'th (ie at 1/10 the frequency) the fast critical process watched by the HW watchdog you might set the counter to 12 or 15 every time the loop runs.

On startup (after reset) you perform a couple of checks.

Read the reset cause register and the reserved location(s). Log this information in some way.
Then set the reserved locations to some value they will not have because of reporting (usually bits all zero or all one) and clear the reset cause register.
Then you proceed with the rest of the startup.

There are a couple of places to do some fine tuning to this.

I use an ECC code on the SW watchdog timers to reduce the chance that a wild write will act as a watchdog feed. An invalid value is treated as a watchdog timeout
The SW watchdogs are ignored until the first feeding. This allows for any needed initialization.

With this unexpected interrupts and all critical processes (And usually some not so critical) are watched and regardless of the cause of a reset a record of it will be gathered and hopefully recorded. While the information gathered may not immediately diagnose an issue it should narrow the area you are looking in and the information can be expanded to help with further diagnoses to the limit of you ability to gather and store the information.

Robert

0 cb1_mobile over 6 years ago in reply to Robert Adsett

Guru 117855 points

Bravo - I'm told my "hand clapping" - from the (proper) side of Niagara - was heard (across the falls) in Canada.
You may want to make mention of "our" favored method of, "Writes to fast responding, non-volatile, EXTERNAL memory" - to hold those key/critical values - in a far more secure & (errata-free) cooperative device...

0 Robert Adsett over 6 years ago in reply to cb1_mobile

Guru 27665 points

cb1_mobile said:
You may want to make mention of "our" favored method of, "Writes to fast responding, non-volatile, EXTERNAL memory" - to hold those key/critical values - in a far more secure & (errata-free) cooperative device...

Good point. The log of reset cause after the reset is placed at that point in the process partly to reduce timing considerations. External memory is both faster and more reliable (particularly FRAM) so using it for logging is a considerable benefit. The process is independent of which processor is used though (except that not all processors can distinguish between reset sources)

BTW, I have successfully used this technique to diagnose, in the field, oversensitivity* in the brown-out circuit on the TM4C123.

Robert

* For our particular application

0 cb1_mobile over 6 years ago in reply to Robert Adsett

Guru 117855 points

Might any such "Brown-Out, over-sensitivity" be the result of your connection to CPL? (Canada Power & Light ...)
Last trip to Vancouver - Swear to God - the 62" flat-screen shrunk to ~40" - ran that way for a good hour - before CPL plunged us (all) into darkness...

(Power Company's name is fictional - "Brown to Black-Out" was Not!)

0 Robert Adsett over 6 years ago in reply to cb1_mobile

Guru 27665 points

In this particular case the reverse would be more of a concern. Noise was all of internal source. The power was stable and nothing riding on the edge of capability. Boards in different locations showed different frequency of occurrence. From all appearances the brownout circuitry was responding to conducted noise, the power wasn't actually dipping.

The external supervisor (have I mentioned I don't trust the reset circuits on micros?) never tripped. Solution was multi-pronged

board layout fixes (it was a first run board)
addition of ferrites to reduce conducted noise
disabling of the brown-out

Robert

0 Mark Sheats over 6 years ago in reply to Robert Adsett

Intellectual 450 points

Thank you all for your very valuable input. This gives me some very helpful tips, and we will be working hard on this the next few days... Hopefully we can find and solve the problem within a week!!!!

0 cb1_mobile over 6 years ago in reply to Robert Adsett

Guru 117855 points

Robert Adsett said:
Have I mentioned, "I don't trust the reset circuits on micros?"

Perhaps ... maybe ... Ok - "SO often" - that even Bruno (and Univ student Luis) - both in Portugal - now, "Force such external supervisors" - upon ALL they meet!

Associates of ours - present at the "Decommissioning of your pre-fixed board" - report, "We come (not) to praise the Brown-Out Circuit - but to bury it..."

Appears that crack, millennial staff here, now, "have it" ... "Long live supervisors and fast, non-volatile, errata-free, external memory..."

0 Peter Borenstein over 6 years ago in reply to Robert Adsett

Mastermind 8695 points

Robert Adsett said:
disabling of the brown-out

Why did you do this? Turning off error detection does not seem like an appropriate way to deal with problems.

0 Robert Adsett over 6 years ago in reply to Peter Borenstein

Guru 27665 points

Peter Borenstein said:

Robert Adsett

disabling of the brown-out

Why did you do this? Turning off error detection does not seem like an appropriate way to deal with problems.

I thought I was clear in the text. It was turned off because it was introducing false alarms. I didn't rely on them for protection in any case, proper supply protection was present elsewhere.

The pants were not only held up by belt and suspenders but also a couple of paper clips. I just removed the paper clips because they were drawing blood.

Robert

0 cb1_mobile over 6 years ago in reply to Peter Borenstein

Guru 117855 points

In general - I agree. Yet - in poster Robert's case - he (carefully) probed/analyzed - and determined that the "Brown-Out" circuit (w/in the MCU) appeared to be "causal-nexus."

Do you propose, "Continuing the Use" of a failed/failing "detection method?" (Only because it is present?) That seems - less than - appropriate...

Arm-based microcontrollers

Arm-based microcontrollers forum

TM4C1297NCZAD: The proper way to handle NMI, Fault Interrupt, and Default Interrupts in deployed equipment

Unexpected interrupts

Watchdog