CC2652R: Router Communication Stops

Damon Stewart

Part Number: CC2652R
Other Parts Discussed in Thread: CC2592,

Hello,

We have a pretty simple design with the CC2652R and the CC2592 and are using the SimpleLink CC13x2 26x2 SDK 4.20.01.04 stack. We are running our router application on dozens of products and it generally runs well.

During our testing, we have seen where one of our routers will power on and be able to transmit & receive for several seconds, but after that, it no longer sends or receives anything. Using a sniffer capture, we see that the packets it did sent have strong RSSI (i.e.-40 or -50dBm). But after some number of seconds (2-15?), it doesn't route, repeat broadcasts, or even send link status packets.

We instrumented code on the router to watchdog reset it if it doesn't receive any packets for 35 minutes, and when it is reset, it operates fine. It is rare for a router to get into this state, and it is not persistent.

When the 35-minute timer expires without receiving any packets, I wrote code to read a bunch of values from the stack & application, and write them to a nonvolatile blob. I then send this blob in a special packet after the reset to help us see what state the stack was in when it was not transmitting or receiving anything. Here are some of the data items I'm saving, and how I'm getting them.

1. What the stack thinks the PAN ID and channel are, along with the rx_on_when_idle flag and if it's part of a network:

I call Zstackapi_sysConfigReadReq() every 15 seconds and read the panID, chanList, macRxOnIdle, and devPartOfNetwork values. When a router gets into this bad state and the 35-minute timer expires, I save the most recently read values to nonvol right before watchdog resetting the software.

2. If the stack thinks it is joined:

This method is probably redundant to reading the devPartOfNetwork field with the Zstackapi_sysConfigReadReg() command in #1.
I set a boolean flag to true in the zclGenericApp_ProcessCommissioningStatus() function, in the BDB_COMMISSIONING_NWK_STEERING case handler, if the bdbCommissioningStatus is BDB_COMMISSIONING_SUCCESS. Or, I set the boolean flag to false if the status is not BDB_COMMISSIONING_SUCCESS.

3. Monitor ZStackMessages to see if any "leave" or other unexpected messages are received.

In zclGenericApp_processZStackMsgs(), I shift in the most recent event into a U32 if it doesn't match the most recently received event, and keep a counter of the number of received stack messages. This will track the 4 most recent stack events and provide a counter of how many stack events were received.

4. Transmit Packet Status

Our application sends a packet to the coordinator every 15 seconds. I track the number of packets we think we are sending, and the return status value from AF_DataRequest().

5. Number of Received Packets

I increment a counter every time zclGenericApp_processAfIncomingMsgInd() is called. There are broadcast commands being send by the coordinator once every couple of minutes that it should receive.

6. Transmit Power Level

I set a value in ZMacSetTransmitPower() anytime the transmit power is set.

There are many other fields, but these are most of the stack-related settings. When the router resets, and sends the debug information, I extracted the following information:

#1 - The values read from Zstackapi_sysConfigReadReq() indicate the panID and channel are correct. The RxOnWhenIdle setting is true, and the devPartOfNetwork setting is also true. So the stack seems to think it's connected and operating on the right channel & PAN ID.

#2 - My boolean "joined" flag is set to true.

#3 - The most recent ZStack messages include: BDB_Notification (0xc5), AF_DATA_CONFIRM_IND (0x91), and INCOMING_MSG_IND (0x92). (There were two 0x91 messages.) The number of received stack messages is 4.

#4 - The number of application messages in the counter is 140, which matches the expected number (4 per minute * 35 minutes = 140). The most recent AF_DataRequest() return status was SUCCESS. So the application is sending messages when it should, and the stack is responding with a SUCCESS status, even though they are never seen in the sniffer capture, or received by the coordinator.

#5 - The number of received packets is only 1.

#6 - The transmit power level is set to what we expect it to be (0xF7), or -9dBm. (Keep in mind this is the power level that feeds into the CC2592.)

I also noticed that when the router is reset after the 35-minute timeout, it ASSOCIATES and joins a new network and receives a new short address. That seems to indicate something is wrong with the stack settings or something, to make it join instead of just resuming operation on its current network.

So from what I can tell, when the router stops transmitting or receiving, the stack thinks it is joined to the correct network, the power level is set correctly, and the stack thinks it should be able to transmit packets since it is not throwing an error. Couple questions:

1. Are these valid conclusions? Am I reading the right APIs to see if the stack thinks it is joined and what channel and pan ID it thinks it's operating on? Or is there a better way to get that information? I've wondered if the radio is operating on a different channel for some reason, but that doesn't seem to be the case.

2. Are there any other APIs we could call or settings we can check to try and figure out why the radio is not transmitting or receiving, or to confirm if the radio or stack is currently enabled?

3. Are there any other hardware pins or firmware settings we should be looking at to troubleshoot this failure?

Lastly, we just noticed this problem seems very similar to the reported problem here, although that was posted years ago with a much older stack. We do have a Wifi router operating close by though, so channel interference could be a culprit. If this is a similar problem to what is in the link, I should be able to detect the problem as suggested in that post by monitoring the number of neighbors with >0 link cost, and then reset the router if it drops to 0.

over 1 year ago

0 Alex Fager over 1 year ago

TI__Genius 14384 points

Hello Damon Stewart,

One thing that does stick out to me is point #5 where only one packet was received. Do we know a specific number of packets the device is supposed to receive (something like 18 since you wait for 35 minutes)?

Damon Stewart said:
ASSOCIATES and joins a new network and receives a new short address.

-So after the timeout the device joins a new network, just confirming here but the coordinator is not setting up a new network for it (IE the old network is still in use).

We could also try isolating the wifi router to see if that helps as a test.

1. Thank you for providing so much information, I do think you were on the right track here.

2. We could check these two threads for extra information:

https://e2e.ti.com/f/1/t/1360488/

https://e2e.ti.com/f/1/t/1324931/

Thanks,
Alex F

0 Damon Stewart over 1 year ago in reply to Alex Fager

Prodigy 120 points

Hi Alex,

I don't have an exact number for the expected rx packet count, but it is easily a couple dozen. (I could get that number from the sniffer capture, but let's just say 1 is much less than it should be.) It appears TX and RX worked properly for a few seconds and then both stopped working until the reset.

You are correct, the coordinator did not set up a new network. The interesting observation in that point was that the router needed to associate to get back on the network, instead of just coming up and resuming operation.

I agree with your thought about isolating the wifi router - I will try turning it off and see if it becomes more difficult to reproduce this condition. I will also try having another device send beacon requests when the router is in this state and see if that can recover it.

Thank you for the thread links. I didn't spend more than a couple minutes looking at them, but they both seem to be related to routing issues. Since our transmit requests are returning SUCCESS status and since we don't even see link status messages being sent when a router gets into this state, those may not be relevant to the issue we are seeing, but it's worth keeping them in mind.

Damon

0 Alex Fager over 1 year ago in reply to Damon Stewart

TI__Genius 14384 points

Hello Damon,

Thank you for your reply, I also wanted to ask if the device joins during heavy traffic or when a lot of other devices join?

-How many devices are on the network, and do they join at the same time?

Then instead of timing out and resting after 35 minutes could you try to reset the device as soon as the device detects it does not route, repeat, or send link status packets. (maybe we reset as soon as we do not see the count increment after 2 mins)

Thanks,
Alex F

0 Damon Stewart over 1 year ago in reply to Alex Fager

Prodigy 120 points

To generally describe our setup, I have about 30 routers joined to the coordinator. They send a unicast ping message to the coordinator every 15 seconds, and we have a web application that monitors the "health" of these routers and indicates what percentage of ping messages have been received in the last couple of minutes.

In this particular test, I power cycle all of the routers and watch them for a couple of minutes to make sure they are all able to communicate with the coordinator after the power cycle. Once I verify all routers are communicating with the coordinator, I reset them again and repeat the test. Occasionally one will fail to communicate, and we find that it is in this state where it no longer transmits or receives any packets. Our firmware detects this condition, and after 35 minutes, it writes a bunch of diagnostic data to nonvol, and resets the socket. When the timeout expires on the bad unit, it resets and associates to the network and receives a new short address. There are no other devices joining or resetting when the "bad state" router resets.

As far as the detection idea goes, yes, I would like to detect this condition and reset faster. I just added code to monitor the tx cost of the neighbor table. If the tx cost for all neighbor devices drop to 0, then I should be able to reset the router. Does that seem like a reasonable algorithm?

Also, I just got another router into this tx/rx failure state, and having another router send beacon requests did not recover it.

0 Alex Fager over 1 year ago in reply to Damon Stewart

TI__Genius 14384 points

Hello Damon,

Thank you for the extra background into your system!

There was a similar issue where 30 routers joined at the same time in this case some of the bad routers had NIB.nwkState as NWK_INIT and devaState as DEV_END_DEVICE. In this case the problem was caused by NLME_StartRouterRequest failed. A probable cause was if the routers received a lot of broadcasts when executing ZMacStartReg, the function will not succeed; could you check the ZMacStartReg state?

Damon Stewart said:
As far as the detection idea goes, yes, I would like to detect this condition and reset faster. I just added code to monitor the tx cost of the neighbor table. If the tx cost for all neighbor devices drop to 0, then I should be able to reset the router. Does that seem like a reasonable algorithm?

-Considering the ping message to coordinator every 15 seconds couldn't we use this to detect the faulty router faster? But yes your detection does sound ok, though some tests might be needed.

Thanks,
Alex F

0 Ryan Brown1 over 1 year ago in reply to Damon Stewart

TI__Guru**** 220577 points

Hi Damon,

Damon Stewart said:
The most recent AF_DataRequest() return status was SUCCESS. So the application is sending messages when it should, and the stack is responding with a SUCCESS status, even though they are never seen in the sniffer capture, or received by the coordinator.

This is the correct behavior of AF_DataRequest returns, which are immediately return once the command is sent to the radio core. The actual over-the-air result is determined from the zstackmsg_CmdIDs_AF_DATA_CONFIRM_IND callback and the status (align with the transID you use to send the message) would further indicate what the issue could be,

https://dev.ti.com/tirex/explore/content/simplelink_cc13xx_cc26xx_sdk_7_40_00_77/docs/zigbee/html/zigbee/z-stack-overview.html?highlight=zstackmsg_cmdids_af_data_confirm_ind#end-to-end-acknowledgements .

Regards,
Ryan

0 Damon Stewart over 1 year ago in reply to Ryan Brown1

Prodigy 120 points

Unfortunately, our application does not use end-to-end acknowledgements for various reasons. Hopefully the AF_DATA_CONFIRM_IND() gives the status of the mac-layer transmission too. If so, I can monitor that and see if we can learn anything from it to more quickly detect this condition.

I can also check the ZMacStartReg state.

Thank you Ryan and Alex for your suggestions.

0 Ryan Brown1 over 1 year ago in reply to Damon Stewart

TI__Guru**** 220577 points

Damon Stewart said:
AF_DATA_CONFIRM_IND() gives the status of the mac-layer transmission

Exactly so. "Otherwise, with the APS ACK flag disabled, zstackmsg_CmdIDs_AF_DATA_CONFIRM_IND occurs whenever the stack sends out an APS layer frame successfully or will return a fail message to notify the application of an issue. "

Zigbee & Thread

Zigbee & Thread forum

CC2652R: Router Communication Stops