This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TM4C startup file handling FaultISR....

Other Parts Discussed in Thread: STRIKE

Hello everyone, 

I have one kind of more general question regarding startup file in CCS. I'm working with M4 MCU-s, mostly testing on TM4C123gxl and TM4C1294xl Tiva Launchpads. As I see, NmiSR(), FaultISR(), IntDefaultHandler() all end in the infinite loop so the error can be examined with the debugger. I'm interested now in best way how to handle those interrupts for production code, I believe it is wrong way to leave it this way since I don't want my system to freeze at some point (when deployed). I found some nice tutorials of how to solve this kind of problems but they are build to be CMSIS compliant and according to them in such cases TivaC library cannot be used since it is not compliant to CMSIS and I really want to be able to use TivaC library. 

Are there some kind of documentation or tutorials for me to study so I can resolve these handlers in some acceptable way so my system can recover itself in case of jumping in FaultIISR during operation, or maybe I'm complicating simple situation, and all I need is some kind of restart??

Thank you!

Kind regards.

  • Hello Djedjica,

    How to handle Faults and Default Interrupt handlers is a system specific implementation. We do not provide any guidelines on implementation.

    Personally a good program should not end up in any of these situations and must have a watchdog to recover the system.

    Regards
    Amit
  • How to handle Faults and Default Interrupt handlers is a system specific implementation. We do not provide any guidelines on implementation.
    Personally a good program should not end up in any of these situations and must have a watchdog to recover the system.

    Joseph Yiu's book(s) "The definitive Guide to Arm Cortex Mx" (with x=[0,3,4]) will provide you some examples. But AFAIK, there are several of his extensive hardfault handlers for debug support around, just use your favorite search engine.

    However,I must agree with Amit. What should a (virtually un-servicable) embedded system do in case of a fatal error ?

    Doing nothing is usually considered the least dangerous option. Rigorous testing should have ruled out all critical software bugs, so hardware failure (or EMI) is assumed to be the cause for such a failure. It is even debatable if a watchdog makes sense here ...

  • Hi,

    Did you read/examined spma043.pdf document? Only registers are involved, not related to specific implementation.

    I agree with Amit - on production line this should not happen...

  • System specific, definitely. But there are some general considerations. While you do want to catch these in test, it is, practically speaking, impossible to test all code paths in the system. And once you include unanticipated environmental conditions (did you really anticipate partially surviving a direct lighting strike?) it is necessary to include a proper mechanism for dealing with unanticipated exceptions.

    Unless the system cannot do anything dangerous, doing nothing is usually not a viable option. The system absolutely must be brought to a safe condition if at all possible at a minimum. Systems that are not themselves dangerous but need to function remotely and continuously also need a recovery mechanism. If service means a 4 hour flight and a 2 hour hike to push a reset button you are not going to be looked at kindly.

    So some options (there are probably others but this should give you some things to consider)

    1. Record fault, wait for watchdog to restart and continue. The restart is responsible for safeing the system which means the startup must be prepared for a system that starts in an unsafe state. It needs to be able to di this if a watchdog fires in any case.
    2. Record fault, safe the system and halt. Halt may mean actually halting the micro in some fashion or it could mean setting a flag so the system idles on startup. The advantage of this over option 1 is the time from fault until entering a 'safe' state is minimized.
    3. One of the above with a delay or rate limit so restarts do not strain the equipment.
    4. One of the above with a limit on the number of retries. Sorry you're making that hike.

    Robert

  • System specific, definitely. But there are some general considerations. While you do want to catch these in test, it is, practically speaking, impossible to test all code paths in the system. And once you include unanticipated environmental conditions (did you really anticipate partially surviving a direct lighting strike?) it is necessary to include a proper mechanism for dealing with unanticipated exceptions.

    Testing is surely a delicate and important issue - and often a neglected one. But you are right. For any nontrivial system, the number of possible states is striving toward infinity, so it is impossible to test all. Catching everything relevant for the envisaged use cases is vital, however. That is a science by itself.

    Unless the system cannot do anything dangerous, doing nothing is usually not a viable option. The system absolutely must be brought to a safe condition if at all possible at a minimum.

    IMHO it is dangerous to rely on software alone for reaching a safe state. A safe system usually requires all outputs to go into the safe state when driving MCU fails, and the pins go into an high-impedance state. The bar of an (old) railway signal (the original "semaphore"), would be a good example. The "line free" state is signaled by the up-state, which requires active driving. In case of fail, the bar drops down, signaling a "line busy" state.

    I have often seen projects neglecting this, especially for cheaper, not safety-relevant mass products. But even when not required, such a "saving" always fired back ...

    Systems that are not themselves dangerous but need to function remotely and continuously also need a recovery mechanism. If service means a 4 hour flight and a 2 hour hike to push a reset button you are not going to be looked at kindly.

    Had left a job ten years ago were I had to do this sometimes. Only the "flight" consisted of 3 consecutive flights, taking more than 10 hours. And I'm happy to be out of this business now ...

  • f. m. said:
    IMHO it is dangerous to rely on software alone for reaching a safe state

    I didn't mean to imply you should necessarily rely only on SW. Just that the SW must always return to a safe state.

    f. m. said:
    I have often seen projects neglecting this, especially for cheaper, not safety-relevant mass products. But even when not required, such a "saving" always fired back ...

    Like the reported stove fan that needs occasional power cycle resets.

    f. m. said:
    . And I'm happy to be out of this business now ...

    I bet.

    Robert

  • I didn't mean to imply you should necessarily rely only on SW. Just that the SW must always return to a safe state.

    And your post didn't imply this, so far I understood you correctly. However, I met people in my career (not software engineers) who were foolish enough to think and act that way. That was in a company where 25...30% of the code memory (Flash) in their main product was occupied by code that circumvented hardware design bugs and omissions. Fortunately, that product had no safety requirement at all ...

  • Ah, good. I was worried I'd mis-directed. Sorry for my confusion.

    I've certainly had to work around HW problems but thankfully nothing that extreme.

    Robert
  • Ah, good. I was worried I'd mis-directed. Sorry for my confusion.

    Think I need to say sorry for my misleading digression ...

    I've certainly had to work around HW problems but thankfully nothing that extreme.

    I've learned the hard way that electronics mass production is a very special kind of business. Everything is sacrificed to the BOM costs :

    • The controller was always the cheapest one with the least possible amount memory, no problem if it means another vendor and another toolchain
    • The hardware design strived for a minimal amount of components (no inherent safety, minimal parameter safety margin, cheapest possible components) - I've seen dozens boards die during debugging and testing - not to mention those one's who died at customer site ....
    • The firmware had to have minimal Flash footprint. A structured design and implementation approach was taboo, because it created "overhead". The chosen controller had no real stack, so function calls were expensive. Optimization toward the controller was demanded from the begin, portability was a third-priority issue.

    Some people felt comfortable with this style of development. Not to say it is generally bad, but it's just not for me - it just goes against a lot of things I learned.

    And to hark back to the original post, such things as MCU fault behaviour were not a point of concern there. The MCU architecture mostly used had the "pleasant feature" to interpret erased Flash and unimplemented memory(0xFF) as NOP, so if the code got lost there, it usually wrapped around at the 64k border and re-started with the reset entry point ...

  • Hello everyone,

    I read a little about watchdog timer in Tivaware and below is code for this timer setup, in my interrupt handler all I do is WatchdogIntClear(), if I understood well this is sufficient to have watchdog reset system for some kind of stalling, or? ofcourse I need to look more into this topic. I tried to test it on tiva launchpad, with some flag that gets raised on button click which causes watchdog service routine to return without clearing, this in order to get system reset, but it seems there is not everything in order. Maybe some more configuration for watchdog is needed?

    void Watchdog_Configuration()
    {
    	SysCtlPeripheralEnable(SYSCTL_PERIPH_WDOG0);
    	uint32_t ui32_sysCtl=SysCtlClockGet();
    	uint32_t ui32_countDownVal=(ui32_sysCtl/1000)*250;
    	IntEnable(INT_WATCHDOG);
    	if(WatchdogLockState(WATCHDOG0_BASE) == true){
    		WatchdogUnlock(WATCHDOG0_BASE);
    	}
    	WatchdogIntEnable(WATCHDOG0_BASE);
    	WatchdogIntTypeSet(WATCHDOG0_BASE, WATCHDOG_INT_TYPE_INT);
    	WatchdogReloadSet(WATCHDOG0_BASE, ui32_countDownVal);
    	WatchdogResetEnable(WATCHDOG0_BASE);
    	WatchdogEnable(WATCHDOG0_BASE);
    }

  • Djedjica said:
    I read a little about watchdog timer in Tivaware and below is code for this timer setup, in my interrupt handler all I do is WatchdogIntClear(), if I understood well this is sufficient to have watchdog reset system for some kind of stalling

    No, absolutely not. (actually I consider the presence of an interrupt associated with the watchdog one of the weaknesses of the TI implementation). All this does is reset the processor if the watchdog stops running, a close to useless test.

    What you need to do is ensure that your application is running.  Jack Ganssle has written several good articles on this subject.

    What I've done for many years is to use a cascaded Watchdog.

    1. First I use the primary timer interrupt (or another faster interrupt) to reset the watchdog. This runs contrary to many recommendations you will see on implementing watchdogs, but there is a method to this madness. This will reset the processor if the timer interrupt stops running.
    2. Since this does not provide any assurance about the rest of the system I only reset the watchdog if system health flags are set for all other periodic processes. Each process has it's own health counter that the timer interrupt decrements. If any counter reaches zero the watchdog is no longer reset. Each process periodically resets it's health counter to indicate it's continuing to operate.
    3. The counters are protected (usually by a hamming code) so random writes are unlikely to refresh the counter with a valid value. Invalid values are treated as zeros.

    You probably want to start simpler than that but that gives you an idea of what you can do. Ganssle's article covers the weaknesses of various watchdog implementations well.

    Robert

  • Hello Mr. Robert,
    I know it took me some time but I was busy with a different subject, can I just sum up how I understood of what You tried to explain about this watchdog method You used, since I'm trying to find the best solution for system health check for watchdog reset.
    So if I have several periodic processes in the system (for example I2C reading, SSI data send, read states from GPIO, ADC and some PWM function...). I'll give each of those processes some health flag which will in the case of process not stalling always be non-zero value in spite that watchdog timer decrements the value, in the case of the operating system. In the case one of the processes halts counter will not be refreshed in some "healthy value" and therefore timer interrupt will decrement that value to zero in which case I "starve the watchdog" which will cause the system reset. Plus that humming code you mentioned which will as you said provide some assurance from random writes to the counter.
  • Yes, I think you have it.

    And it's hamming code, I don't know what a search for humming code is going to find.

    The reference I've used for implementation is the old National Semiconductor App Note AN-482, now nearly 30 years old.

    Robert
  • Yes, I searched over google and will try to implement algorithm myself. Just to ask you since you obviously have much more experience than I do.
    If I decrement counter in watchdog interrupt than to encode it with humming code or should I decode counter and than decrement . I'm a bit confused on where to check if some value is valid. In each scenario I can think of I need to do either hamming decode or encode in timer interrupt, do you think that this calculation for code healthy values in interrupt is to much time consuming, I don't think I will have too much of those but nevertheless I go by the rule no much calculations in interrupt handlers so I'm a bit worried about that.
    Thank you.
    Djedjica
  • Djedjica said:
    If I decrement counter in watchdog interrupt than to encode it with humming code or should I decode counter and than decrement . I'm a bit confused on where to check if some value is valid.

    General process is

    In the highest priority (Usually fastest) watchdogged loop,

    • for each lower watchdog
      • decode encoded value to an integer value. The hamming encoding I use is derived from an error correction code (4 bits encoded into 7) If that detects an error I do not update the watchdog
      • decrement the integer value
      • reencode the value and store
    • If no watchdog has faulted or timed out then reset the watchdog.

    In each lower priority loop,

    • Update its watchdog value with a new value. This is a constant, not computed.

    Djedjica said:
    do you think that this calculation for code healthy values in interrupt is to much time consuming, I don't think I will have too much of those but nevertheless I go by the rule no much calculations in interrupt handlers so I'm a bit worried about that

    That is a good worry. For reference, this is what I have

    main watchdog in a 10kHz task

    • Error calculation per watchdog ~ 24 logical operations 
    • Data decoding ~29 logical operations
    • Data encoding ~28 logical operations

    It's not a significant load on my system. Obviously less of a load if you go slower. I don't have latency numbers to hand but the watchdog is a small proportion of the work in that loop so in my case it's not a bit contributor.

    I'd measure what ever you come up with using a 'scope or logic analyser.

    Although I only use a single level of cascade you could use multiple to reduce the load on your main interrupt if desirable. You could also do a subset of the watchdogs each loop rather than all of them.

    Robert