Watchdog reset

Richard Dolf

Other Parts Discussed in Thread: CC2540, CC2541, CC2543

A question for TI developers or anyone else who has access to BLE v1.2 source code:

We're experiencing occasional, non-deterministic, watchdog resets after our devices have been running 4-5 days continuously. Is there any circumstance in the BLE libraries which might cause a watchdog reset? Any use of HAL_SYSTEM_RESET() for example? Any code like while (1) or while(TRUE), etc?

Our code enables the watchdog timer (1 second timeout), but we are not deliberately causing the WD timer to expire.

The library in question is CC2540_ble_single_chip_peri.lib without power-saving. We have a rich RF environment (25-30 BLE peripheral devices in very close proximity).

over 12 years ago

0 J Lindh over 12 years ago

TI__Guru 57865 points

Hi Richard,

I did a brief search in the source and could not find any obvious code that would cause the watchdog timer to run out. I have not heard of any similar reports...

Can you reproduce the issue in a more "RF clean" environment?

Best Regards

0 Richard Dolf over 12 years ago in reply to J Lindh

Expert 2170 points

Hi Nick,

Thanks for looking. We've not seen the issue in less dense RF environments (4-5 devices in range of one another), but we've just recently added the code to detect and report watchdog reset so it is possible that the problem has existed for a while but gone unnoticed. Investigations continue :)

Richard

0 Richard Dolf over 12 years ago in reply to Richard Dolf

Expert 2170 points

I would like to re-open this issue because we have added more detailed event logging to our code, the problem of watchdog reset persists and in fact occurs more often than we first realized. More details:

We have a production burn-in area where as many as 50+ CC2540 peripheral devices are running at the same time, all within radio range of one another, each advertising. Over the course of 12 hours, typically one or two units experience a watchdog reset.

We have other units in our engineering development/test area which have never experienced watchdog reset over the course of several months. In the engineering area, there are typically 5-10 units in radio proximity to one another.

We have the watchdog timer programmed for 1 second period, we kick the watchdog timer every 500 msecs, driven by osal timer event (osal_start_reload_timer).

None of our code is deliberately triggering the watchdog reset. None of our code blocks for more than a few milliseconds at most.

At this point, we believe that the watchdog reset is caused by something in the TI BLE stack. Since that source code is not available to us, we would like to ask TI for any suggestions for solving this problem. In the meantime, we're migrating to the 1.3 stack, perhaps that will change the behavior.

TIA,

Richard

0 Richard Dolf over 12 years ago in reply to Richard Dolf

Expert 2170 points

FWIW - we've been running v1.3 stack for about 6 weeks now, still getting watchdog resets. Frequency varies from once twice per day to once a week or so.

We now have 3 units running special firmware where we continuously enable advertising for 1/2 hour and then disable advertising for 1/2 hour. The idea was to see if WD resets only occur when BLE is active. Curiously, over the past 2 weeks that this test has been running, we have not seen any watchdog resets in these units. Perhaps enable/disable of BLE resets something in the stack. Given the opacity of the BLE stack, I feel like a blind man trying to figure out why the elephant has diabetes.

0 Greenja over 12 years ago in reply to Richard Dolf

Guru 22270 points

Hello Richard,

Are the devices that are timing out by WDT changing the clock division at any point? According to the spec sheet, only in the CC2541 will the WDT interval remain the same if there is a clock division. As found in SWRU191C.pdf WDCTL (0xC9) – Watchdog Timer Control.

Just a suggestion.

Thanks,

0 Richard Dolf over 12 years ago in reply to Greenja

Expert 2170 points

Hi Greenja,

Good idea, but I don't think that's it. Our code never writes to CLKCONCMD; that is managed completely by the low-level HAL and OSAL code. As far as I can tell, CLKCOMCMD is always 0; i.e. no division of the clock. We are line-powered and are not using any power-saving modes.

One thing we've wondered about is that our WD timeout is 1 second and we're using a 500 msec tick to kick the dog. The 500 msec tick is driven by an OSAL timer (osal_start_reload_timer). If OSAL should ever skip a timer callback, it would be a close race between the next OSAL tick and the WD timeout.

One reason we suspect BLE as the culprit is that the WD reset occurs in a quiescent system; the only things running are OSAL timers and BLE advertisements.

0 Greenja over 12 years ago in reply to Richard Dolf

Guru 22270 points

Did you use the SLEEPSTA to determine it was the WDT? I just stumbled upon an example from the CC2543/44/45 chips for the WDT that test the bits in the SLEEPSTA to determine what really caused the reset.

0 Richard Dolf over 12 years ago in reply to Greenja

Expert 2170 points

Yes, exactly that:

/* SLEEPSTA - 8051 register contains reason for most recent reset in bits 3-4 */
#define RESET_CAUSED_BY_WATCHDOG() ((SLEEPSTA & 0x18) == 0x10)
#define RESET_CAUSED_BY_CLOCK_LOSS() ((SLEEPSTA & 0x18) == 0x11)

0 Greenja over 12 years ago in reply to Richard Dolf

Guru 22270 points

I would think that the OSAL timer set at 500ms would only have ~100ms latency unless you where using Power_Savings and the oscillators needed time to settle.

You can always set it to 100ms or the smallest interval your code allows.

0 tylerw over 12 years ago in reply to Greenja

Intellectual 555 points

Not sure what kind of debug/logging interface you have.. how about turning off the hw wdog and making your own sw watchdog on one of the other high res hw timers? Then you can dump a bunch of state when your watchdog fires and see whats going on. Of course there are cases when maybe your handler would be blocked depending on what the issue is, but just an idea.

-Tyler

0 Richard Dolf over 12 years ago in reply to tylerw

Expert 2170 points

Excellent suggestion. To date, we been pursuing other avenues because WD resets occur when our application is idle, so the interesting 'bunch of state' is all in TI code, much of which we have no source to. The most recent test (disabling advertising for 30 minutes of each hour) has been running in 3 separate units for 2 weeks now with no watchdog resets. We're going to switch back to continuous advertising next week and see if WD resets start happening more frequently.

Thanks,

Richard

0 Richard Dolf over 12 years ago in reply to Richard Dolf

Expert 2170 points

Update on our testing. One unit has been running 24/7 since April 1 with special test software which disables advertising for 30 minutes of each hour. Here is the record of watchdog resets from our event logging:

Sun, 12 May 2013, 23:37 *** Reset, watchdog
Sat, 11 May 2013, 11:38 *** Reset, watchdog
Sat, 11 May 2013, 04:32 *** Reset, watchdog
Thu, 09 May 2013, 12:30 *** Reset, watchdog
Tue, 07 May 2013, 16:44 *** Reset, watchdog
Thu, 02 May 2013, 22:46 *** Reset, watchdog
Tue, 30 Apr 2013, 00:34 *** Reset, watchdog
Fri, 26 Apr 2013, 03:32 *** Reset, watchdog

Curiously, the unit ran for over 3 weeks with no watchdog resets (April 1-26), followed by 8 watchdog resets over next 2 1/2 weeks.

Of note is that the watchdog resets all occur when Bluetooth is actively advertising (minutes 30-59 of each hour). None of the resets were associated with attempts to connect to the device, the device was simply advertising.

This test software also records each entry and exit to our application via osal event handler as well as each entry to osal_mem_alloc() and osal_mem_free(). None of the watchdog resets occurred while in our application's osal event handler, nor while in osal malloc/free.

Our conclusion is that something in the BLE stack is causing the watchdog resets.

0 Richard Dolf over 12 years ago in reply to Richard Dolf

Expert 2170 points

This weekend we pulled logs from 8 customer units which have been running "in the wild" since May. Every unit has experienced one or more watchdog resets. Furthermore, all our in-house long-term test units are experiencing watchdog resets. The frequency of resets is completely random, sometimes weeks between resets, sometimes several resets within a few hours.

We really only have 2 clues as to the source of this problem:
- The resets only occur when Bluetooth is actively advertising (see previous post)
- With rare exceptions, the resets only occur when circuit board is installed in our cast aluminum casings.
My best guess at this point is that stray RF somehow results in a malformed packet that is mishandled by the BLE stack.

Although our system is tolerant of the watchdog resets, we really would like to get to the bottom of this problem. If anyone at TI is interested in finding this problem, we are willing commit resources to assist.

Richard

0 Aslak N. over 12 years ago in reply to Richard Dolf

TI__Mastermind 23440 points

Hi Richard,

It can be due to some timing issue, perhaps. It could be that your 32MHz is too slow to start up, especially if you use a DC/DC converter which could add some noise when enabled. In this case, the device may hang.

Please verify the xtal layout, and perhaps try to increase the define in hal_sleep.c called HAL_SLEEP_ADJ_TICKS upwards to allow more time for the 32MHz to stabilize. Perhaps by 5 or 10.

Also (in-house) test the setup without power saving to see if xtals are at all the issue.

Best regards,
Aslak

0 Richard Dolf over 12 years ago in reply to Aslak N.

Expert 2170 points

Hi Aslak,

Thanks for considering this problem, but it seems unlikely to be related to clock start-up because 1) we use the Panasonic 1720 module which contains the crystals as well as the 2540 chip and 2) our units are all line powered and we don't use any power-saving. Once powered up, the units essentially run forever (until watchdog reset occurs).

What is running in our system:
- 4 timers with intervals of 250 msecs, 500 msecs, one second and one minute
- ADC conversion every 250 msecs
- short SPI activity (8 bytes), once per minute
- Bluetooth advertising

As described in previous posts, the watchdog reset occurs outside our task and the watchdog reset only occurs when Bluetooth advertising is enabled. The randomness of reset frequency suggests BLE receive because everything else in the system seems deterministic. My guess is BLE mishandling of malformed packets, but its just a guess.

Richard

0 Aslak N. over 12 years ago in reply to Richard Dolf

TI__Mastermind 23440 points

Hi Richard,

So if you don't advertise it never fails? How about if you remain in a connection? If you disable SPI?

If ble packets are malformed then this will be caught by the CRC check in hardware and not be passed up to the software stack at all.

Can you try calling the command HCI_EXT_HaltDuringRfCmd( HCI_EXT_HALT_DURING_RF_DISABLE ) during init?

The default condition is that the MCU is stopped during RF events - it may be that this causes things to stop working 100% in your setup.

Another thing that some project have set up is HCI_EXT_ClkDivOnHaltCmd( HCI_EXT_ENABLE_CLK_DIVIDE_ON_HALT ) which divides the system clock to 1MHz during RF events. However, I think you would have noticed that by now if it was active.

Best regards,
Aslak

0 Richard Dolf over 12 years ago in reply to Aslak N.

Expert 2170 points

>> So if you don't advertise it never fails? How about if you remain in a connection? If you disable SPI?

That is correct, watchdog resets do not occur when advertising is disabled via GAPRole_SetParameter( GAPROLE_ADVERT_ENABLED, .....).
We haven't tried disabling SPI, but
- all SPI activity occurs within our task, we poll the UART for TX/RX complete, no interrupts AND
- our task is never active when watchdog reset occurs so SPI seems to be eliminated as possible cause.
Also, the SPI peripheral is fundamental to our application, so it's not really feasible to disable SPI.

>> Can you try calling the command HCI_EXT_HaltDuringRfCmd( HCI_EXT_HALT_DURING_RF_DISABLE ) during init?

I will try this.

>> Another thing that some project have set up is HCI_EXT_ClkDivOnHaltCmd( HCI_EXT_ENABLE_CLK_DIVIDE_ON_HALT ) which divides the system clock to 1MHz during RF events. However, I think you would have noticed that by now if it was active.

There are no calls to HCI_EXT_ClkDivOnHaltCmd() anywhere in our project source code

0 Richard Dolf over 11 years ago in reply to Richard Dolf

Expert 2170 points

Hi Aslak,

Regarding the command: HCI_EXT_HaltDuringRfCmd( HCI_EXT_HALT_DURING_RF_DISABLE );

I have 4 test units running identical code except 2 issue the command during initialization and 2 do not. Results are:
Units issuing the HCI command:
49:ED - experienced 10 watchdog resets since Sept 10
95:95 - experienced 5 watchdog resets since Sept 10
Units which do use the HCI command:
97:C7 - experienced 1 watchdog reset since Sept 10
3D:84 - experienced no watchdog resets since Sept 10

It seems that using the HCI command exacerbates the watchdog reset problem.

One other small bit of information is that I am tracking whether the watchdog reset occurs while inside or outside our code. In previous testing (prior to issuing the HCI command), the watchdog resets all occurred outside our code. Now, the 2 units which are issuing the HCI command have watchdog reset occurring about half the time inside our code. When the watchdog reset occurs while in our code, we are executing the OSAL event handler and the specific event being handled is the OSAL timer event which tells us to kick the dog. I'm not sure what to make of this because there is only a very small period of time between entering our OSAL event handler and the handler's issuing the WD_KICK sequence; that window of opportunity seems very small.

Best regards,
Richard

0 Aslak N. over 11 years ago in reply to Richard Dolf

TI__Mastermind 23440 points

Hi Richard,

Hm. The reason could be that the WDT is ticking along more when the MCU is not halted. I need to confirm this with some designers.

I'm also not sure what to make of that, but if let's say there's nothing much going on except for that kicking event, then there aren't a lot of other places it could trigger.

In any case, using a osal event to kick the dog is not a good idea because the user task has very low priority. Still, I am surprised that it can take more than 1s before something executes.

I would recommend to put the dog-kicks in the osal_run_system() function so that it gets executed before _any_ system event call. If the main context stalls this will pick up on that.

I'm not sure how you perform the dog-kicks, but it may be that they are interrupted for some reason, so that not all kicks "count" because of the timing constraint of 1 32kHz period between 0x0A and 0x05 writes.

I suggest a macro such as this;

#define KICK_DOG() \
 { \
 HAL_ENTER_CRITICAL_SECTION(intState); \
 WDCTL = 0xA8; \
 WDCTL = 0x58; \
 HAL_EXIT_CRITICAL_SECTION(intState); \
 }

And if inserted in the osal_run_system function, that the intState variable declaration is moved to the top of the function.

Best regards,
Aslak

0 Richard Dolf over 11 years ago in reply to Aslak N.

Expert 2170 points

Hi Aslak,

I will try your suggestion to move kick to the osal_run_system() function. Can you share who calls this function and how often? If power-saving is disabled, is osal_run_system() called continuously from something like while(1) loop? I'm not seeing the system's idle loop in source code.

For kicking the dog, we're using the macro defined in hal_mcu.h which is
#define WD_KICK() st( WDCTL = (0xA0 | WDCTL & 0x0F); WDCTL = (0x50 | WDCTL & 0x0F); )

Your comment about interrupting the kick sequence is interesting because that could explain why the resets are so random. I will also try putting the kick sequence into a critical section.

Thanks,
Richard

0 Aslak N. over 11 years ago in reply to Richard Dolf

TI__Mastermind 23440 points

Hi Richard,

osal_run_system() is what's running all the time in an infinite loop. At the top it polls HAL, next it finds timed out timers and sets events, next it checks each task for events starting from 0 and executes xx_ProcessEvent. If no events->sleep. It's called from osal_start_system().

Ok, that macro is more sensible - I just made something up for your setup.

Best regards,
Aslak

0 Richard Dolf over 11 years ago in reply to Aslak N.

Expert 2170 points

Hi Aslak,

On Sept 25, we added watchdog kick to the osal_run_system():
HAL_ENTER_CRITICAL_SECTION(intState);
WD_KICK();
HAL_EXIT_CRITICAL_SECTION(intState);
So now, the watchdog is getting kicked continuously before each task event handler is called.

As before, we have 4 test units running the same code, 2 are calling HCI_EXT_HaltDuringRfCmd( HCI_EXT_HALT_DURING_RF_DISABLE ) and 2 are not.

Units issuing the HCI command:
49:ED - experienced 0 watchdog resets since Sept 25
95:95 - experienced 0 watchdog resets since Sept 25
Units which do use the HCI command:
97:C7 - experienced 3 watchdog reset since Sept 25
3D:84 - experienced 11 watchdog resets since Sept 25

Interestingly, units issuing the HCI command had no watchdog resets. This is the opposite result from previous test. Perhaps we're just seeing the randomness of watchdog resets or perhaps the HCI command has some side effect that we don't understand. At this point, I'm inclined to remove this HCI command unless TI recommends otherwise.

As to watchdog resets themselves, we're still in the dark. Watchdog resets occur seemingly randomly, while executing TI code, in the absence of any activity other than advertising.

BR,
Richard

0 Richard Dolf over 11 years ago in reply to Richard Dolf

Expert 2170 points

Sorry for the typo, should be

Units issuing the HCI command:

49:ED - experienced 0 watchdog resets since Sept 25
95:95 - experienced 0 watchdog resets since Sept 25

Units which do NOT use the HCI command:
97:C7 - experienced 3 watchdog reset since Sept 25
3D:84 - experienced 11 watchdog resets since Sept 25

0 Aslak N. over 11 years ago in reply to Richard Dolf

TI__Mastermind 23440 points

Hi Richard,

The side-effect is simply that the WDT will have exactly the length of an RF event less time between kicks. In that sense it makes sense to issue the HALT_DURING_RF_DISABLE.

It seems like you are on a marginal case here. Even though the WDR doesn't occur in the SPI routine, it or other events can still be to blame for delaying the time it takes until the kick occurs.

Do you have timing numbers for the serial communication? Is it perhaps not entirely deterministic? Do you have other events or tasks that may be long-running?

Best regards,
Aslak

0 Richard Dolf over 11 years ago in reply to Aslak N.

Expert 2170 points

Hi Aslak,

I don't have timing numbers for SPI, but I don't see how that could be the source of the problem. We are the SPI master and the communication consists of a 7-byte transfer at 1mbps. Our code modeled after the suggestions in TI's Design Note #113 (swra223a.pdf).

Twice a day we have a routine that runs 200-300 msecs. This causes problems with Bluetooth (see http://e2e.ti.com/support/low_power_rf/f/538/t/294169.aspx) but the running of this routine is not correlated with any watchdog resets. We know this because the long running routine occurs at specific times of the day and those times are not aligned with any watchdog resets.

I've started a new test which will record the active task ID at the time watchdog reset occurs. Will let you know what that turns up.

Is it possible that watchdog resets are caused by electrical problem like unconnected inputs or sagging power or...? Something that would cause the WD timer to keep ticking but prevent software from running?

Richard

0 Richard Dolf over 11 years ago in reply to Richard Dolf

Expert 2170 points

Hi Aslak,

For the latest test, we're kicking the dog inside osal_run_system(), the kick is also inside a critical section. I've added code to osal_run_system() which tracks the currently active task.

This test has running since Friday, Oct 4. In that time:
49:ED - experienced 7 watchdog resets
95:95 - experienced 0 watchdog resets
97:C7 - experienced 1 watchdog reset
3D:84 - experienced 1 watchdog reset
All of the watchdog resets occurred with activeTaskID = 0xFF, indicating that no task was running.

Since no task is active, I think we're left with:
1. Some ISR runs too long or interrupt condition is not cleared resulting in continuous interrupts
2. Idle loop which calls osal_run_system()
3. Problem in osalTimeUpdate()
4. Hardware problem

Since the problem only occurs when BLE advertising is enabled, I would put my money on #1.

Suggestions? (this problem has been festering for almost a year now)

Best regards,
Richard

0 Andrew King over 11 years ago in reply to Richard Dolf

Prodigy 230 points

Richard

It is possible that there is a 5th WDT trigger possibility. Given that, as part of the advertising process the units listen for response and that, in your case, there are many messages they will have received that are non valid Advertising responses (Advertising from other units). It is thereby possible that there may be occasional buffer overflow that corrupts RAM and therefore indirectly triggering the WDT.

We have recently seen circumstances where, in field deployments of 50+ units which both Advertise and Observe, there has seemingly been corruption of configuration parameters stored in RAM. Unfortunately, the in field diagnostics capability of these types of systems is extremely limited. This behaviour was not seen in smaller, production testing quantities.

It would be very helpful if TI could quantify what level of stress testing that the BLE stack has undergone, in term of concurrent Advertising, Observing, etc so that we may have confidence that we are working inside a validated parametized envelope

Regards

ayemk

0 Richard Dolf over 11 years ago in reply to Andrew King

Expert 2170 points

Hi ayemk,

I'd hazard a guess that most testing has been done with the most common use case, i.e. a battery operated device that responds to user input by advertising for a short time and then going back to sleep. That could explain why the watchdog reset problem and your issues are not seen by more developers.

Richard

0 Aslak N. over 11 years ago in reply to Andrew King

TI__Mastermind 23440 points

Ayemk,

How did you proceed to determine that it was RAM corruption that occured?

I can tell you that we have not tested with 50+ units that I'm aware of.

BR,
Aslak

0 Andrew King over 11 years ago in reply to Aslak N.

Prodigy 230 points

Aslak

Our application uses 3 bytes to define hardware configuration and software functionality. When these 3 bytes were in RAM, and only in the presence of large numbers of advertising units, we were seeing faulty behavior that could only be explained if the internal values of these bytes had changed.

Changing these bytes to # defines (ROM) had a Major effect on system reliability. This application also uses Observer functionality, but we have not been able ,with stress testing of 30 + Advertising devices, to localize the fault to the Observer function.

Our suspicion remains overflow in the BLE stack when large numbers of devices are issuing connectable advertisements -and then listening for responses.

It would be EXTREMELY useful if you could publish metrics on the BLE ram requirements under varying traffic levels

Regards

Ayemk

0 Kevin Lockwood over 11 years ago in reply to Andrew King

Expert 1035 points

I would like to bump this thread as I've been seeing a similar issue recently. Peripherals have been resetting themselves while connected or advertising. Upon restarting the batteries measure well above 70%, this likely rules out brown out reset. My application does not call any events during connection and thus cannot be consuming memory thus leaving the stack as the suspect for this overflow reset. Any help/guidance would be great. Thanks

0 Richard Dolf over 11 years ago in reply to Kevin Lockwood

Expert 2170 points

What does the SLEEPSTA register say is cause of reset?

0 Kevin Lockwood over 11 years ago in reply to Richard Dolf

Expert 1035 points

Waiting to get the hardware back from the field so I can investigate. I cannot replicate on the bench. I don`t imagine a POR or Brownout as the batteries in these were recently replaced and upon restart measure to be 90%. How did you solve your issue? Was it a Clock loss reset?

p.s. Thanks for the reply Richard

0 Richard Dolf over 11 years ago in reply to Kevin Lockwood

Expert 2170 points

Our units continue to experience watchdog resets, during manufacturing burn-in as well as in the field. Our units are line-powered, so resets are not battery related. Extensive logging and instrumentation has eliminated our code as the source so we believe the TI stack is responsible for the resets, likely something related to BLE receive data isrs.

In our case, the resets are all watchdog resets, not clock loss or anything else (so says SLEEPSTA).

0 Greenja over 11 years ago in reply to Richard Dolf

Guru 22270 points

Hello Richard,

Your saga continues. How about putting a control unit beside your field unit. In the control unit load it with the SimplePeripheral, KeyFob demo or any other example code. Modify it so that the WT is in the same place relative to your field code and attach the logging and instrumentation to it.

I personally don't use the WT, but I have had units in the field for over 6 months running a modified SimplePeripheral to control a TriMedia Billboard. I have not experienced any hanging of the controller.

Thanks,

0 Richard Dolf over 11 years ago in reply to Greenja

Expert 2170 points

At this point, there is really no saga because in our units, the watchdog reset is benign and invisible. Given a design life of 15-20 years unattended, we felt it prudent to enable a watchdog. In the absence of TI's help, we're not spending any more effort to solve the issue.

FWIW, the BLE stack includes a macro HAL_SYSTEM_RESET which enables that watchdog and causes it to expire. If the TI stack is using this macro, it would be possible for anyone to experience watchdog reset, even when deliberately not enabling the WD timer.

I suspect that in many applications, a watchdog reset would be undetected. For example, if my Fitbit reset itself every few months, who would know?

0 Greenja over 11 years ago in reply to Richard Dolf

Guru 22270 points

I was not aware of the system reset enabling the WT. Since all parameters changes where only stored in RAM, I would have noticed any resets since the defaults would have been reloaded.

When the time comes for doing the final board (using the USB Dongle now), the tried and true DS1232 monitor will be used.

Bluetooth®︎

Bluetooth forum

Watchdog reset