CC2651P3: End-Devices losing network information

Tiago Lone

Part Number: CC2651P3

Hello,

I am developing an application in which my zigbee end-devices stay in shutdown mode for most of the time and only get out from shutdown mode when a button is clicked, Then the device rejoin the network, send an APS packet with APS Ack, and go back to shutdown.

My code is based on the zed_switch example, but was modified to send an APS packet with APS Ack to the coordinator when the button is clicked. I am using the last SDK (simplelink_cc13xx_cc26xx_sdk@7.10.01.24)

Most of time, everything just works, but from time to time, a device loses all network information and I need to recommission the device.

I could reproduce this problem more than once, but only with hard network conditions (pressing many devices at the same time so that they try to rejoin at the same time and at a distance in which some packets may be lost).

I think that something may be triggering a Factory New reset. I do not see any Leave message on the air with rejoin set to false, what would trigger a Factory New reset on the end-device. So it must be something triggered on the end-device alone. In the documentation, I saw that a TCLK exchange failure could do that. But could that happen with a commissioned device that is just rejoining the network? There is some other situation in which a Factory New reset can be triggered in the end-device? Any other idea of what could cause the device to lose the network information?

In my test setup, I had one coordinator and 11 end-devices. In one test that runned for 4 days, I managed to lose 3 end-devices with the following address:

00:12:4b:00:2a:80:41:5c

00:12:4b:00:2a:80:41:6c

00:12:4b:00:2a:80:41:72

I have the sniffer file from this test. How can I send it here?

Last packets from 00:12:4b:00:2a:80:41:5c - Last messages from this device.

Last packets from 00:12:4b:00:2a:80:41:6c - The last packets (APS) could not be decrypted. I don't know why. In almost 70k packets captured, this kind of packet (orange) appears only here.

Last packets from 00:12:4b:00:2a:80:41:72 - After a transmission that appears to be OK, there are packets with Transport Key from the coordinator to the end-device short address, but without response. No more packages to or from this device after that.

over 2 years ago

0 Tiago Lone over 2 years ago

Prodigy 80 points

Here is a link for the sniffer capture file:

www.dropbox.com/.../2023-09-25_captura_4_dias_quinta_a_segunda.pcapng

0 Ryan Brown1 over 2 years ago in reply to Tiago Lone

TI__Guru**** 218057 points

Hi Tiago,

Thank you for providing the sniffer log. Factory resets inside the application only occur if MAX_DEVICE_UNAUTH_TIMEOUT is exceeded (the TCLK exchange failure case you mentioned), the device is asked to leave the network without rejoining (leave command), or called upon by the application UI (UART terminal menu or BTN-2 at start-up). You've already covered the first two options, have you disabled the UI or changed this feature in any way?

The erratic behavior of these devices indicate that the NV memory could be corrupted. Do you have to force a factory new condition to recommission these devices? Otherwise, what specific steps do you take to get them to rejoin the network after the failure occurs? At what voltage are you powering these devices, does this voltage fluctuate, and have you implemented low voltage detection? Also, how are your devices entering shutdown mode from the application?

Regards,
Ryan

0 Tiago Lone over 2 years ago in reply to Ryan Brown1

Prodigy 80 points

Hi Ryan,

UI is disabled and I do not see any Leave message on the air with rejoin set to false, so I think both options can be discarded.

Regarding your questions, yes, I have to force a factory new condition to recommission the devices. The device is powered by a CR2032 lithium battery and I think the voltage is stable but I have not implemented low voltage detection. I do not believe it is causing the problem but I will take a look at it and implement low-voltage detection so that the system is more robust.

I am forcing the device to enter shutdown mode. I was not considering the possibility of memory corruption as a probable cause, but your question made me think, and by forcing shutdown I could be causing NVS corruption if the stack was writing on it when I force shutdown. I will correct it by using TI-RTOS power management routines to avoid entering shutdown while some other activity is being done.

Anyway, I will also investigate TCLK exchange failure in parallel. What could cause MAX_DEVICE_UNAUTH_TIMEOUT to expire? Just RF communication failures could cause it? For example, if the device is orphaned in the middle of this processor, would that cause this situation? What triggers a new TCLK exchange procedure in a commissioned device?

Thank you

0 Ryan Brown1 over 2 years ago in reply to Tiago Lone

TI__Guru**** 218057 points

Just so you are aware, if the battery is dropping below the 2200 mV range or insufficiently powering the device then this can certainly cause NV flash write failure and thus corrupt the NV memory.

Tiago Lone said:
I could be causing NVS corruption if the stack was writing on it when I force shutdown

This is certainly a concern and I \highly advise the use of TI-RTOS power management routines and the Power TI Driver.

MAX_DEVICE_UNAUTH_TIMEOUT will only happen if the joining device requests a TCLK update from the ZC trust center and does not receive a reply within 10 seconds. This only occurs during the original join process, not rejoins, so I do not believe this is an issue in your case.

Regards,
Ryan

0 Tiago Lone over 2 years ago in reply to Ryan Brown1

Prodigy 80 points

I checked my code to enter shutdown mode and I am using Power_shutdown(0, 0). Its documentation states that this routine should respect any constraint set by the stack or any other driver that uses power management. So I don't think I am corrupting NVS going to shutdown, right? My code is the following:

Battery tension barely falls below 3V during operation, never below 2.9V in my tests. I will implement low voltage detection, but I don't think this is a problem now.

Well, I will try to identify if my problem is caused by a factory new reset or NVS corruption. For factory new reset, I will try to find where it could be triggered and put some code to notify me if that code path is executed. And for NVS corruption, I think that I need to examine NVS content after the event happened. Do you have any other suggestions about how I can identify NVS corruption?

0 Ryan Brown1 over 2 years ago in reply to Tiago Lone

TI__Guru**** 218057 points

Thinking back, entering shutdown from the application task should be safe. You should likely enter a blank while loop afterwards since no other operations should take place once shutdown is being entered, however I would not expect the application to get past Power_shutdown. You may also consider using OsalPort_enterCS to enter a critical section (essentially disable hardware interrupts). NV corruption is difficult to recognize since the flash memory region is expected to change during operation to account for network changes, and thus corrupted entries can go unnoticed unless the entire region is parsed for further context. TI does offer a network cloning guide which could give you some further hints. Thank you for the oscilloscope reading, if possible it would be great to capture the same for a device at the exact instance of failure. Although I recognize the complexity of this task as it is very difficult to replicate the behavior. It would be valuable to find a way to easily recreate the problem to further identify which variable directly causes it.

You can break or add print logs to bdb_resetLocalAction from bdb.c for UI factory new resets, and monitor ZDApp_ResetTimerStart usage from zd_app.c otherwise. You should be able to establish whether factory new is called by the application at any time, but as you've stated that you "have to force a factory new condition to recommission the devices" it seems more likely that NV corruption is involved (if not something else entirely).

Regards,
Ryan

0 Tiago Lone over 2 years ago in reply to Ryan Brown1

Prodigy 80 points

Hi Ryan,

I discovered an easy and reproducible sequence in which my end-device lose network information. I am not 100% sure that the problems that occurred in all cases in my past tests were caused exactly by this situation, but I think this path of investigation is promising.

First of all, I found this sequence by disabling shutdown and enabling UI on the end-device. The use of shutdown can block or reduce the problem probability, but I don't think it eliminates it completely. I don't think UI has anything to do with this sequence. I use only a coordinator and an end-device. The complete sequence that triggers the problem is the following:

1. Turn on coordinator and end-device.

2. Open the network on the coordinator and commission end-device using the UI

3. End-device is commissioned and working as expected.

4. Turn off coordinator and click on end-device

5. End-device is connected and tries to send a message, but as the coordinator does not respond it retries many times and goes to orphaned state.

6 End-device detects orphaned state and tries to rejoin sometimes with a 1s interval between retries (I use Zstackapi_bdbRecoverNwkReq for that).

7. Turn on the coordinator and after reboot, it answers the beacon request

8. End-device sends a rejoin message, but unencrypted, triggering an unauth rejoin

9 . Unauth rejoin triggers a 10s timer on the end-device that when expired restarts the device and clear its network information (ZCD_STARTOPT_DEFAULT_NETWORK_STATR)

10. End-device starts to send many Data Request messages and it continues for 10s until the timer expires, which clear all network information

11. I need to recommission the end-device in order to connect again to the network. Obs: I said before that I needed to do a Factory New Reset every time, but it is not necessary in this sequence and I think I did it on the devices that had problem without trying just to recommission.

Some observations:

- As I said, the problem appears under hard network situation. That could trigger the problem without turning coordinator off.

- If I restart manually the end-device before the timer expires, it returns to the network and works without problem, which demonstrates that the end-device was not in a state that could not be restored.

- I could trigger this situation with shutdown enabled. It is a little harder because I need to let the device connect to coordinator, then turn off coordinator, and then turn on it again and the entire sequence must occur without the device going to shutdown.

- Even though I could trigger the situation with shutdown enabled, the process does not go until the end because the device goes to shutdown before the 10s timer expires (like in manual reset). I couldn't make it go until the end, but that does not mean that there is no situation in which that occurs.

With that said, I have some questions:

- What is triggering an unauth rejoin? The many beacon request without answer? If so, what should I be doing? How can I avoid it?

- Even if an unauth rejoin is triggered, why the device can't recover from it?

- I know this simple sequence that triggers this problem, but maybe exists more complex sequences that triggers it too. Should I lose nwk information in a situation like that? Am I doing something wrong? There is some way to block it in any situation?

Relevant code analysis and sniffer capture:

Step 8 triggers the following code block (red arrows) on ZDApp_ProcessNetworkJoin in zd_app.c

On 10 the timer expires and trigger the following event on ZDApp_event_loop in zd_app.c

Commentary explaining what the flag ZCD_STARTOPT_DEFAULT_NETWORK_STATRE do:

Sniffer capture commented with full sequence of events:

Rejoin message unencrypted triggering an unauth rejoin.

Sequence reproduced in a device with shutdown enabled. It goes to shutdown before the timer expires so the problem does not occur.

If necessary I can send the sniffer capture file.

Thanks.

+1 Ryan Brown1 over 2 years ago in reply to Tiago Lone

TI__Guru**** 218057 points

Thank you for providing all of this detailed information. Although I followed your replication steps as instructed, I was not able to replicate the behavior (see the attached sniffer log). I am using the default zc_light and zed_sw from SDK v7.10. The greatest issue I see from your sniffer screenshots is that the ZED appears to ACK but then ignore the Transport Key from the ZC in packet 6204. You will need to perform more debugging to determine why ZDSecMgrTransportKeyInd is not entered from the ZED to continue the rejoin process. The network key had previously been abandoned so that they ZED could choose to find a new network to join. Thus the issue is based on the TC rejoin feature. There are ways to work around this issue, including setting BDB_ATTEMPT_UNSECURE_REJOIN to FALSE or changing the application behavior of the DEV_END_DEVICE_UNAUTH to reset the device without erasing NV information. Here are some relevant E2E threads you can read through:

https://e2e.ti.com/f/1/t/1179401
https://e2e.ti.com/f/1/t/1176456

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/158/sniffer_5F00_unsecure_5F00_rejoin_5F00_test.pcapng

Regards,
Ryan

0 Tiago Lone over 2 years ago in reply to Ryan Brown1

Prodigy 80 points

I changed BDB_ATTEMPT_UNSECURE_REJOIN to FALSE and BDB_MAX_SECURE_REJOIN_ATTEMPTS to 255. With this new configuration, I couldn't recreate the problem. So I will advance with my development.

In the future I want to investigate why my ZED ignored the Transport Key from the ZC as it can indicate some other problem in my code that can be triggered in some different situation.

Changing the application behavior of the DEV_END_DEVICE_UNAUTH to reset the device without erasing NV information can be a valid defensive strategy too.

Thank you for your help Ryan!

Zigbee & Thread

Zigbee & Thread forum

CC2651P3: End-Devices losing network information