CC2652R: Intermittent reboot on NV write

Grant China

Part Number: CC2652R

Hi, I'm trying to track down an intermittent reboot in our product. Our product uses a CC2652R and is using the Zigbee stack in SimpleLink version 4.20.1.0. Our application is fairly sizable and unfortunately I can't really provide a minimal code example that reproduces the issue that I'm seeing. But I can tell you the sequence of function calls that I think lead to the reboot.

The scenario where we're seeing the reboot is when trying to change the Zigbee channel on our device. Our system carries out various actions in response to incoming commands over Zigbee. I have been able to pare down our command handlers to the bare minimum code that will reproduce the issue that we're seeing. My test script sends down two Zigbee commands in a loop.

The first command handler just saves the received Zigbee channel mask to NV storage and tells the stack to reset the network settings on the next boot by calling the following lines of code:

zclport_writeNV(NV_USER_CHANNEL, 0, sizeof(mask), &mask);
zgWriteStartupOptions(ZG_STARTUP_SET, ZCD_STARTOPT_DEFAULT_NETWORK_STATE);

The second command handler just reboots the device by calling the following line of code:

SysCtrlSystemReset();

When the device boots back up, it reads the saved Zigbee channel mask from NV storage and initializes the stack via a call to Zstackapi_sysConfigWriteReq().

As I said, the test script just sends these two Zigbee commands down in a loop, passing down a new channel on each iteration. I can usually run this loop a few dozen times with no incident but intermittently, the device reboots when the first command is sent down. (Yes, the second command would have rebooted the unit anyway but remember, this is a minimal example to reproduce the problem. In our actual application, there are other actions that need to take place between those two commands.)

Note that I'm using the watchdog API in SimpleLink so the fact that the system is rebooting is very possibly a watchdog reset because something is hanging. I've got a fairly long watchdog timeout, 45 seconds, so I doubt it's simply a case of some command taking too long to complete.

The interesting part of this problem is that if I take the second command out of our test suite, ie, we're just looping on the command that calls zclport_writeNV() and zgWriteStartupOptions(), then I can't reproduce the issue. So it seems to only manifest if I call zclport_writeNV() and zgWriteStartupOptions() and follow it up with a call to SysCtrlSystemReset().

Are there any known issues with zclport_writeNV() or zgWriteStartupOptions() that could lead to a reboot or hang? From what I can tell, zgWriteStartupOptions() also basically boils down to a NV write. In particular, is there anything that might also involve SysCtrlSystemReset()?

Thanks in advance,
Grant China
WattIQ, Inc.

over 2 years ago

0 Ryan Brown1 over 2 years ago

TI__Guru**** 219997 points

Hi Grant,

I recall your thread from five months ago: https://e2e.ti.com/f/1/t/1233083

How do you know that the device resets before SysCtrlSystemReset is called? And can you debug the device during an unexpected usage to further determine the call stack before the device resets? You can even use print logs to further prove the application state and that the watchdog is not involved. Please insert a delay between the two commands, basically your NV write and system reset, as it is dangerous to intentionally perform these two operations consecutively. This is why bdb_resetLocalAction delays 500 ms after resetting the NV to factory new contents. You should also be able to use osal_nv_write or NV driver function pointers (i.e. NVINTF_nvFuncts_t *pfnZdlNV) in place of zclport_writeNV, it's difficult to remember given the age of the v4.20 SDK.

Regards,
Ryan

0 Mark Madsen over 2 years ago

Prodigy 10 points

Ryan Brown1 - I am a colleague of Grant's, also involved in this issue. Grant is traveling this week and will be slower to reply. One observation, and then I have a question.

The observation is that we do have an inherent delay between the NV writes and the system reset in the test Grant is running -- the system reset is sent by a cloud-based controller application after the NV writes attempt to set the new channel value. We will check the timing on that, that's a good reminder, but typically it's not happening right on top of each other. So something else is going on.

Which leads to my question. We have a relatively short watchdog timer interval, to keep the system interactive and data flowing (45 seconds). In the zStack documentation I was reading about how NV writes work, and the occasional need for "reclaiming" unused locations as writes are done to fresh locations, changing the pointer used by the ID scheme. This seems like occasionally, one would see "compaction" of things. How long does this internal reorganization usually take, and could it cause the timing of the NV write call to vary? There seem to be bare hints of this in other forum posts, where flash operations could vary widely in timing. We're trying to formulate more conditions we can test and rule out. Thanks in advance for any pointers you might have.

Mark Madsen

WattIQ, Inc.

0 Ryan Brown1 over 2 years ago in reply to Mark Madsen

TI__Guru**** 219997 points

Hi Mark,

You can find use of compactPage and COMPACT_PAGE_CLEANUP in osal_nv.c, then use further debugging or print logs to track usage and determine whether it is conflicting with the watchdog. You should also attempt to restart the watchdog timer immediately before writing to the NV, or increase the watchdog interval in general for testing purposes. There is a distinctive risk of compaction failure detailed in the function comments, which must be further investigated if you can verify that this occurs for your system.

/******************************************************************************
 * @fn      compactPage
 *
 * @brief   Compacts the page specified.
 *
 * @param   srcPg - Valid NV page to erase.
 * @param   skipId - Item Id to not compact.
 *
 * @return  TRUE if valid items from 'srcPg' are successully compacted onto the 'pgRes';
 *          FALSE otherwise.
 *          Note that on a failure, this could loop, re-erasing the 'pgRes' and re-compacting with
 *          the risk of infinitely looping on HAL flash failure.
 *          Worst case scenario: HAL flash starts failing in general, perhaps low Vdd?
 *          All page compactions will fail which will cause all osal_nv_write() calls to return
 *          NV_OPER_FAILED.
 *          Eventually, all pages in use may also be in the state of "pending compaction" where
 *          the page header member OSAL_NV_PG_XFER is zeroed out.
 *          During this "HAL flash brown-out", the code will run and OTA should work (until low Vdd
 *          causes an actual chip brown-out, of course.) Although no new NV items will be created
 *          or written, the last value written with a return value of SUCCESS can continue to be
 *          read successfully.
 *          If eventually HAL flash starts working again, all of the pages marked as
 *          "pending compaction" may or may not be eventually compacted. But, initNV() will
 *          deterministically clean-up one page pending compaction per power-cycle
 *          (if HAL flash is working.) Nevertheless, one erased reserve page will be maintained
 *          through such a scenario.
 */

Regards,
Ryan

Zigbee & Thread

Zigbee & Thread forum

CC2652R: Intermittent reboot on NV write