I’m working with a Zigbee Pro system using zstack that occasionally has a problem where a router node “goes mute”. When this condition occurs all traffic from the offending router ceases except for Data Acknowledgements in response to associated end device Data Requests. Sorry for the long description. My specific request is at the end.
This “muteness” occurs about once every 24 hours in the small test system we have running. That system consists of:
- 14 or 30 routers. (We’ve gotten it to happen in both cases.)
- 59 end devices. (55 of which are only making Data Requests every 12 seconds. The other 4 devices are each doing 6, 1 hop broadcast packets in bursts every 15 minutes.)
- A Cooridinator / Gateway / Server node is Digi based.
After it goes mute, the router will not pass traffic again unless it is either reset (obviously) or, more interestingly, receives either a Beacon Request or hears a Beacon sent from another router or the coordinator. It’s not clear which of these packets are the cause of the recovery as it only comes back to life after both have been seen in traffic in its vicinity. We provoke the Beacons by resetting one of the end devices and allowing it to re-associate with the network. It is probably worth noting that the offending router only responds to 2 of the 3 beacon requests from the end device while other routers can be seen responding to all of the 3 of the beacon requests. When the router recovers in this way it appears that there are about 12 old routed packets that are transmitted almost immediately. After this, forwarding packets picks up again as usual.
We’ve also added some code to rxFcsIsr() in mac_rx.c, at the point after the CRC is found to be correct, to filter for several specific things. The only one of these that should be active during any of the traffic seen in these tests is code that detects Data Requests and Association Requests and places their source addresses into a queue of our own implementation. This snippet of code also makes a call to APSME_LookupNwkAddr() when the only address present is the 64 bit extended address. The vast majority of the traffic seen, and only seen before one of the routers goes mute, contains only short addresses.
Here are some things we’ve found by adding diagnostics to attempt to identify the state of the router that goes mute:
- Main task loop in osal_run_system() keeps going.
- No task listed in tasksArr[] ever has event flags remain set for 4 seconds or more without being allowed to run. (That is, no task appears to be starved for CPU.)
- (In process) Determine if application level code of ours continues to think it’s transmitting during one of these mute conditions. I’m hoping to find that it’s not successfully sending a packet. I’ll have more information on that once the “mute” problem occurs again.
I realize it’s possible that we have a wild pointer or something that is causing us to corrupt memory. I’m looking for answers to one or more of these questions:
- Are there any calls I can make to determine that things are locked up this way? Is there a way to know how many packets are queued for transmission originating either from the application or from the NWK layer routing? (This would be a more specific question if I knew if packets were not being seen or it was just that they could not transmit.)
- What’s special about the beacon requests when a node joins that would kick the router out of whatever state it’s in? Does this provide any kind of clue as to what getting hung up internally?
- Is this a problem that anyone has seen before in a system small or large?
I can provide lots of traffic examples in Wireshark captures if that would be helpful. (TI sniffer tool -> UDP 5000 -> perl script -> UDP port -> Wireshark).
Other useful information:
- CC2530 in end devices.
- CC2530 + CC2591 (or is it CC2590) in router
- Zstack version 2.5.1a