Hints for debugging a rare Hardfault

Per E

Other Parts Discussed in Thread: TM4C1237H6PZ

I am experiencing a Hardfault from time to time and I cannot find the cause. The FAULTSTAT register is 0x00000001 so I do not have a valid fault address. The only real information I have is the exception number (21) which makes sense since this is the UART0 peripheral which is used to communicate with the device during regression testing. Below is an output from my application.

20150414 02:46:51.674 :   LOG :[!] Hardfault detected\r\n
20150414 02:46:51.674 :   LOG :FAULTSTAT: 0x00000001\r\n
20150414 02:46:51.674 :   LOG :No valid bus fault address\r\n
20150414 02:46:51.674 :   LOG :Stacked core registers:\r\n
20150414 02:46:51.674 :   LOG :R0 = 0x00000004\r\n
20150414 02:46:51.674 :   LOG :R1 = 0x80022049\r\n
20150414 02:46:51.674 :   LOG :R2 = 0x00000010\r\n
20150414 02:46:51.674 :   LOG :R3 = 0x00000000\r\n
20150414 02:46:51.674 :   LOG :R12 = 0x0001BBBD\r\n
20150414 02:46:51.690 :   LOG :LR = 0xFFFFFFF1\r\n
20150414 02:46:51.690 :   LOG :PC = 0x7180F04E\r\n
20150414 02:46:51.690 :   LOG :PSR = 0x01000015\r\n

How do I go ahead and debug this kind of problem?

Also, the LR and PC registers make no sense. I have the following code in my startup.s file that is supposed to give me a valid stack pointer both in thread mode and handler mode. Does it seem correct?

FaultISR
        TST   lr, #4 ; Test for MSP or PSP
        ITE   EQ
        MRSEQ r0, MSP
        MRSNE r0, PSP
        B     mcu_hardfault_handler

MpuISR
        TST   lr, #4 ; Test for MSP or PSP
        ITE   EQ
        MRSEQ r0, MSP
        MRSNE r0, PSP
        B     mpu_fault_handler

Kind regards, Per

over 10 years ago

0 Robert Adsett over 10 years ago

Guru 27665 points

When this gets difficult enough the only way I have found to track these down is with a trace. You have to have a trace running and when the hard fault is reached back track through the trace until you find a thread to grab and pull on.

This is difficult with a full ICE and the only facility ARM cores have is JTAG/SWD and the supplemental trace output. ARM claims it is as good but I haven't had the need to test that assertion. You will also need a debugger capable of logging the trace and maybe the appropriate pins brought out. The upside is you won't be using a $10k to $100K ICE with a delicate tower to attach the emulator pod to the board.

Thankfully there are some intermediate steps you can take. Add an extra memory location or two to your dump. At strategic places in the program update these memory locations to indicate where you are executing. This allows you to narrow in on the location triggering the fault. The code would be something like

extern unsigned int progress; /* Keep track of program position */

void function(void)
{
unsigned int savest;

savest = progress; /* Preserve so we can restore on exit */
progress = FUNCTION_ST1;
/* Do something */
progress = FUNCTION_ST2;
/* Do something */
progress = FUNCTION_ST3;
/* Do something */
progress = savest;
}

The values assigned to progress should be defines or part on an enum. This could be macroized so they could be defined out or in as needed.

Once you've narrowed down the region you can save more information into registers to dump on the fault.

One of the downsides to this approach is that these are often Heisenbugs that skitter away as you approach them or Will o' Wisp bugs that lead you astray. Persistence and patience are very much needed.

Robert

0 Per E over 10 years ago in reply to Robert Adsett

Intellectual 490 points

Thanks for your elaborate answer Robert.

We do use the trace functionality in our development environment with appropriate debugger. However, we have only seen this during regression testing and here we only have UART output for logging. Also, the amount of code that can be run from this context is huge and we do not have the resources to debug the rough way.

What I really am looking for is to get more relevant information from the MCU. I am thinking something like a timing issue when coming out of deep-sleep (which we do on UART reception). Perhaps a TI employee can have a look at the thread?

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per,

FAULTSTAT = 0x1 means the CPU executes from a XN space (not execute region). Is there an external memory being used? Also since debug is not available, can you use GPIO toggle to isolate it to a code piece which gets executed prior to the bus fault?

Regards
Amit

0 Robert Adsett over 10 years ago in reply to Per E

Guru 27665 points

You are quite welcome

Per E said:
Also, the amount of code that can be run from this context is huge and we do not have the resources to debug the rough way.

I'm not sure there is another choice. Amit just proposed the same method using GPIO output rather than the serial port.

Not a lot of resources are required. Initially all you are looking for is to see if it happens in a consistent thread or block of code. These can be quite large and you can narrow now as you close in on the problem, no need to instrument the entire code base. If you are running multiple treads/tasks the first thing to do is find out if you are running is a consistent task when it fails.

One of the worst case scenarios is if you are overwriting a return address in an ISR. That can be very hard to track without using trace.

Your only other tools are code review (and human reviewers tire quickly) and static analysis tools. If you are not using something like PC-Lint and you are using C/C++ I do recommend you start.

None of these are foolproof and rapid. I once spent three months before finding a race condition (in a third party RTK). Not full-time but it kept changing and disappearing as other conditions changed so I would try to capture it and it would cease to fail and would have to turn my attention to other items until it re-appeared.

Once I caught it in a trace dump I could see the faulting memory access and go back the many instructions to find the corrupting instructions (several task switches previous).

Robert

0 Amit Ashara over 10 years ago in reply to Robert Adsett

TI__Guru**** 244440 points

Hello Robert,

Very well said. Program execution faults when they are sporadic are tough to debug.

Regards
Amit

0 cb1 over 10 years ago in reply to Per E

Guru 47900 points

Per E said:
... thinking something like a timing issue when coming out of deep-sleep (which we do on UART reception)

Might it make sense to, "Reduce your baud rate" (at least temporarily) and see if this reduces the issue's occurrence? At too high baud rates - your "level shifter" (if used) may also play a role.

Have you tested to see if the issue is confined to UART only - when escaping from, "deep sleep?" Might a simpler GPIO pin toggle - causing the same escape - help detect if the issue is (truly/fully) deep sleep or UART specific?

We work w/many ARM MCUs (multiple firms) not all behave "well - or consistently" when having their "sleep time" disrupted! Our cure - do not enforce nor expect "time critical" operations upon such, "awakenings."

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

OK, now I finally followed Robert's and Amit's recommendations and, by means of dumping a progress variable, found that the hardfault does not seem to occur in my ISR handler but rather before it is executed. This conclusion comes from the fact that the progress variable has the value which is assigned just before exiting the ISR (e.g. from the last interrupt).

So, what I have is a hardfault with UART0 interrupt active (at least if I can trust the PSR register) when coming out of deep-sleep. This sometimes happens on UART1 as well but this bus is not as busy as UART0. The peripheral is enabled in deep-sleep (or else we would not wake up on incoming data). The last dump contains the following information:

20150512 15:30:41.262 : LOG :[!] Hardfault detected\r\n
20150512 15:30:41.262 : LOG :FAULTSTAT: 0x00000001\r\n
20150512 15:30:41.262 : LOG :No valid bus fault address\r\n
20150512 15:30:41.262 : LOG :Stacked core registers:\r\n
20150512 15:30:41.262 : LOG :R0 = 0x00000004\r\n
20150512 15:30:41.262 : LOG :R1 = 0x80022049\r\n
20150512 15:30:41.262 : LOG :R2 = 0x00000010\r\n
20150512 15:30:41.262 : LOG :R3 = 0x00000000\r\n
20150512 15:30:41.262 : LOG :R12 = 0x0001BDE1\r\n
20150512 15:30:41.262 : LOG :LR = 0xFFFFFFF1\r\n
20150512 15:30:41.262 : LOG :PC = 0x400FE18C\r\n
20150512 15:30:41.262 : LOG :PSR = 0x00000015\r\n
20150512 15:30:41.262 : LOG :Debug variable = 17\r\n

Any ideas?

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello PerE

My suspicion on the following line.
20150512 15:30:41.262 : LOG :PC = 0x400FE18C\r\n

Program Counter cannot be a peripheral address and hence FaultStat of 0x1 does make sense as an IERR.

Regards
Amit

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

Yes, I know that the program counter is way off - in my first post in this thread it is 0x7180F04E. But I don't really see how I should proceed debugging from here. The ISR handler is called directly from the NVIC.

0 Robert Adsett over 10 years ago in reply to Per E

Guru 27665 points

I think it is likely you are barking up the wrong tree by looking at interrupt entry rather than interrupt exit. Do you have any evidence that program execution continues after the end of the interrupt. Stack corruption will easily lead to this sort of behaviour.

Robert

0 Per E over 10 years ago in reply to Robert Adsett

Intellectual 490 points

Good point, I will try to verify this overnight.

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per E,

It seems that the exception is happening on exit of the interrupt. The PSR is showing the value 0x15 where bit-24 is clear and indicates a POP, or ISR exit and 0x15 indicates the active interrupt at the time. In all likelihood as Robert mentioned, it could be the Stack Pointer getting corrupted. Can you increase the size of the Stack to see if it delays the Fault?

Regards
Amit

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

Hello Amit,

I have increased the main stack (we also have separate stacks for our RTOS (Keil RXT) tasks) and added some more assignments to the progress debug variable to verify your theory about stack pointer corruption, should the hardfault manifest again.

Kind regards Per

0 Chester Gillon over 10 years ago in reply to Per E

Guru 92251 points

Per E said:
I am thinking something like a timing issue when coming out of deep-sleep (which we do on UART reception).

Which device and which revision are you using?

The reason is there may be a device errata affecting exiting from deep-sleep. E.g. TM4C devices have errata SYSCTL#04 "Device May not Wake Correctly From Sleep Mode Under Certain Circumstances", the description of which is

With a certain configuration, the device may not wake correctly from Sleep mode because invalid data may be fetched from the prefetch buffer. The configuration that causes this issue is as follows:
• The system clock must be at least 40 MHz
• Interrupts must be disabled

This specific errata may not apply to your device / configuration, but is an example of an errata which could cause erratic behavior - due to invalid data (code) being fetched from the prefetch buffer.

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per,

If the issue is that of a recurring function, then stack should overflow again but later in time. If the issue is of not enough stack space then it would not.

Regards
Amit

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

Hello again,

I would say the issue manifests randomly rather than after a certain amount of time. We have units that have been running for weeks with no problem but still we see it from time to time in our regression testing where we reset the device all the time.

During the last nights it happened twice but in different ISRs (UART0 and SSI2). The SSI2 ISR does not call any RTX functions which rules out some suspicions.

When testing I set the main stack to 2 kB which should be more than enough for ISRs + RTX SVC functions. I can also confirm that we never return from the ISR as you said. Furthermore the PC value 0x400FE18C is the address of the last write operation before we enter deep-sleep mode (Deep-Sleep Power Configuration register).

Kind regards Per

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per E,

Chester's point before required us to know which TM4C12x device is this, so that we can see if there is a known issue. Please confirm if it is a TM4C123 or a TM4C129 device?

Regards
Amit

0 Chester Gillon over 10 years ago in reply to Per E

Guru 92251 points

Per E said:
We have units that have been running for weeks with no problem but still we see it from time to time in our regression testing where we reset the device all the time.

How is the device reset done during the regression testing?

This thread https://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/t/399372#pi239031349=2 describes an issue where a TM4C129 device would end up in the FaultISR after a reset. See the work-around for a change to TivaWare SysCtlClockFreqSet() function (in the sysctl.c file)

0 cb1_mobile over 10 years ago in reply to Per E

Guru 117855 points

Per E said:
see it from time to time in our regression testing where we reset the device all the time.

Should not full attention then be directed to:

a) any/all reset circuitry - especially the voltage levels and rise/fall times

b) the method of causing such "reset" - is it proper & in full compliance w/MCU spec

Our small firm has noted issues when such "reset" has (on occasion) "glitched" the MCU power rail or was not - at all times - fully w/in spec! Note that even if the MCU "appears" to have properly reset - such may not be (entirely) true - and leads directly to conditions similar to those you report...

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

The device is TM4C1237H6PZ and I tried to apply a workaround for errata issue SYSCTL#04 (SPMZ849E). The issue deals with sleep mode but I guess it might apply for deep sleep as well. So far I have not seen the hardfault occur but we need to run several iterations to confirm a fix.

Chester Gillon and cb1_mobile: There is no problem with reset whatsoever. We were discussing the time it takes before the fault occurs and this was to illustrate that the problem does not seem to be caused by triggering the ISR repeatedly.

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per E,

Or may be another effect of SYSCTL #01 as well.

Regards
Amit

0 Per E over 10 years ago in reply to Amit Ashara

Intellectual 490 points

We had another problem caused by SYSCTL #01 but these were not related.

After several runs, I have not seen this kind of hardfault which looks promising. However, last night we had a similar one:

20150519 20:06:47.835 :	 LOG  :[!] Hardfault detected\r\n
20150519 20:06:47.835 :	 LOG  :FAULTSTAT: 0x00020000\r\n
20150519 20:06:47.835 :	 LOG  :No valid bus fault address\r\n
20150519 20:06:47.851 :	 LOG  :Stacked core registers:\r\n
20150519 20:06:47.851 :	 LOG  :R0  = 0x00000004\r\n
20150519 20:06:47.851 :	 LOG  :R1  = 0x80022049\r\n
20150519 20:06:47.851 :	 LOG  :R2  = 0x00000010\r\n
20150519 20:06:47.851 :	 LOG  :R3  = 0x00000000\r\n
20150519 20:06:47.851 :	 LOG  :R12 = 0x0001BFDD\r\n
20150519 20:06:47.851 :	 LOG  :LR  = 0xFFFFFFF1\r\n
20150519 20:06:47.851 :	 LOG  :PC  = 0x00004770\r\n
20150519 20:06:47.851 :	 LOG  :PSR = 0x00000016\r\n

The difference is that FAULTSTAT now has INVSTAT set instead of IERR and the PC value points to a valid address. However, the address does not make much sense to me. What conclusions can I draw from the INVSTAT bit?

0 Amit Ashara over 10 years ago in reply to Per E

TI__Guru**** 244440 points

Hello Per E,

Datasheet excerpt:

When this bit is set, the PC value stacked for the exception return points
to the instruction that attempted the illegal use of the Execution
Program Status Register (EPSR) register.

The Stack will show what the instruction was that caused the issue that you can co-relate to the context of the code in disassembly.

Regards
Amit

Arm-based microcontrollers

Arm-based microcontrollers forum

Hints for debugging a rare Hardfault