This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Router Goes "Mute"

Other Parts Discussed in Thread: CC2530, CC2591, CC2590, CC2520, Z-STACK

I’m working with a Zigbee Pro system using zstack that occasionally has a problem where a router node “goes mute”.  When this condition occurs all traffic from the offending router ceases except for Data Acknowledgements in response to associated end device Data Requests.  Sorry for the long description.  My specific request is at the end.

 This “muteness” occurs about once every 24 hours in the small test system we have running.  That system consists of:

  • 14 or 30 routers.  (We’ve gotten it to happen in both cases.)
  • 59 end devices.  (55 of which are only making Data Requests every 12 seconds.  The other 4 devices are each doing 6, 1 hop broadcast packets in bursts every 15 minutes.)
  • A Cooridinator / Gateway / Server node is Digi based.

After it goes mute, the router will not pass traffic again unless it is either reset (obviously) or, more interestingly, receives either a Beacon Request or hears a Beacon sent from another router or the coordinator.  It’s not clear which of these packets are the cause of the recovery as it only comes back to life after both have been seen in traffic in its vicinity.  We provoke the Beacons by resetting one of the end devices and allowing it to re-associate with the network.  It is probably worth noting that the offending router only responds to 2 of the 3 beacon requests from the end device while other routers can be seen responding to all of the 3 of the beacon requests.  When the router recovers in this way it appears that there are about 12 old routed packets that are transmitted almost immediately.  After this, forwarding packets picks up again as usual.

 We’ve also added some code to rxFcsIsr() in mac_rx.c, at the point after the CRC is found to be correct, to filter for several specific things.  The only one of these that should be active during any of the traffic seen in these tests is code that detects Data Requests and Association Requests and places their source addresses into a queue of our own implementation.  This snippet of code also makes a call to APSME_LookupNwkAddr() when the only address present is the 64 bit extended address.  The vast majority of the traffic seen, and only seen before one of the routers goes mute, contains only short addresses.

 Here are some things we’ve found by adding diagnostics to attempt to identify the state of the router that goes mute:

  • Main task loop in osal_run_system() keeps going.
  • No task listed in tasksArr[] ever has event flags remain set for 4 seconds or more without being allowed to run.  (That is, no task appears to be starved for CPU.)
  • (In process) Determine if application level code of ours continues to think it’s transmitting during one of these mute conditions.  I’m hoping to find that it’s not successfully sending a packet.  I’ll have more information on that once the “mute” problem occurs again.

 I realize it’s possible that we have a wild pointer or something that is causing us to corrupt memory.  I’m looking for answers to one or more of these questions:

  • Are there any calls I can make to determine that things are locked up this way?  Is there a way to know how many packets are queued for transmission originating either from the application or from the NWK layer routing?  (This would be a more specific question if I knew if packets were not being seen or it was just that they could not transmit.)
  • What’s special about the beacon requests when a node joins that would kick the router out of whatever state it’s in?  Does this provide any kind of clue as to what getting hung up internally?
  • Is this a problem that anyone has seen before in a system small or large?

I can provide lots of traffic examples in Wireshark captures if that would be helpful.  (TI sniffer tool -> UDP 5000 -> perl script -> UDP port -> Wireshark).

Other useful information:

  • CC2530 in end devices.
  • CC2530 + CC2591 (or is it CC2590) in router
  • Zstack version 2.5.1a

  • Hi Andy,

    A few questions in order to understand it better:

    Is it a specific router that always fail, or is it another router each time?

    When the router "goes mute", does it still send link status messages?

    I understand it will not route messages when mute. It is not clear, though, whether packets sent from the application in the effected router are actually sent of blocked as well.

    Please provide sniffer traces of the issue in the original TI packet sniffer format (psd). The logs should include as much information as possible, before and after the issue happens. Please also refer us to the actual line numbers where you observed the issue.

    Best regards,

    OD.

  • OD,

    Thanks for the quick followup.

    Which Router: The router that fails seems to be random.  For the last few instances of this that we've had, the routers were all powered down for long enough that all of the end devices knew that they weren't connected anymore and tried to rejoin.  I don't know the relationship of the router that fails to the rest of the mesh.

    Link status messages: None.  In fact, no traffic at all with the short source address it had before going away until the Beacon thing kicks it again.  I'm presuming the Data Ack's are from the locator because of packets immediately before that share the same 802.15.4 sequence number.

    Other traffic/packets from the application: We don't see that either.  No application generated traffic.  Assuming the diagnostic code I added did what I intended, our application task was running.  There is a timer that would cause it to sent an update to the coordinator every 25 seconds.  Noe of those come out.

    Unfortunately I can't provide TI format sniffer format data.  With the TI sniffer I wouldn't have been able to reliably capture packets continuously over the last few days.  Wireshark is easy to download; there's nothing special about the version I'm using to review my capture data.  In total, I have about 1.1GB of capture data.  Each file is 20MB.  I can work on pairing it down but it might be easier if I could get you some of the 20MB files and point you to places in those.  Is there a place I could do that to?

    Thank you for your quick attention!

    Andy

  • The capture files .zip'd down to about 4.5MB each.  Here's a try at all of the relevant details:

    Wireshark available here: http://www.wireshark.org/download.html

    File #00051 shows the router going mute.

    File #00056 shows where it was provoked by beacon activity to come back.

    Offending node extended address: 0x00124B00040BD63E.  In both files it’s short address is 0x25C6.

    End device in File #56 that was rebooted extended address: 0x00124B00040BD42B.  Short address, both before and after reset, 0x9812

     

    WS Searching examples:

    802.15.4 addresses: (long) wpan.src64 == 0x00124B00040BD63E

    (short) wpan.src16 == 0x25C6

    Logical operators work: 

    Search for NWK layer traffic originated by this node: wpan.src16 == 0x25C6 && (wpan.src16 == zbee_nwk.src)

    Search for 802.15.4 traffic in or out of the node (less data acks): wpan.src16 == 0x25C6 || wpan.dst16 == 0x25C6

    Particular types of the Application Endpoint 232 message can be distinguished by zbee_aps.dst==232 && data[18]==0x15. Values at data[18] that may be interesting from the router are 0x15, 0x25, 0x26.  From end devices a value at data[0] of 0x42 should correlate with their broadcast messages that cause all of the routers that heart it to produce their 0x15 message types.

     

    The files:

    Look for the file number in the name right before the timestamp field; for instance PairingAfterL09Onward-_00051_20140721202352.pcapng.

    File #51:3716.PairingAfterL09Onward-_00051_20140721202352.zip2287.PairingAfterL09Onward-_00052_20140721231807.zip8407.PairingAfterL09Onward-_00056_20140722110753.zip  Router goes mute sometime after its last transmit at Pkt#165718; (wpan.src16==0x25c6).  Others continue to direct traffic at it until after Pkt#166023; (wpan.src16==0x25c6 || wpan.dst16 == 0x25c6).  After that, only short address 0x9196 keeps making data requests.

    File #52: Aftermath; (wpan.src16==0x25c6 || wpan.dst16 == 0x25c6).  Nobody except 0x9169 is interested in talking to it any more.

    The next day, I show up and find the router in the weeds and reset an end device to see if that will wake it up.

    File #56: Reset of end device causes router to start transmitting again around Pkt#79849; (wpan.src16==0x25c6 || wpan.dst16 == 0x25c6 || wpan.src16 == 0x9812 || wpan.src64 == 0x00124B00040BD42B).  Running that search, selecting packet 79849 and then “clear”ing the search will leave the selection in the same place and allow you to see the first beacon request at 79843 and the other router responses after it.

     

  • This problem continues to occur.  From diagnostics that have been incorporated into the code we can tell:

    • Main task loop is running and no tasks are being starved of CPU.
    • Stacks are not being overrun.  XDATA and IDATA stack diagnostics are in place as "fences" to detect if the stack pointers run past the defined area.
    • Periodic application level messages sent through AF_DataRequest() return afStatus_SUCCESS, perhaps with multiple tries, within 5 seconds of starting to try to send a message.

    With updates to our diagnostics we can now tell what the maximum stack usage and heap high water mark is.  In addition, we have a flag that is set inside of osal_mem_alloc() if the alloc ever returns NULL since reset.  These three pieces of information are available in messages being sent from the routers.  From them we see in the two new cases that heap usage before the "mute" event was 1879 to 2437 bytes out of an allocation of ~3600 bytes. Afterward, we see that some osal_mem_alloc() call did receive a null and the heap high watermark is on the order of 3300 to 3400 bytes.

    More code review was done on the application code, looking for possible buffer overrun errors.  None were found, even on storage that was allocated from osal_mem_alloc().

    The most troubling part of this problem is that it seems to be undetectable from the application code.  If we could see that we were not getting packets out we could restart the node and go back to normal operation.

    What can be done from TI's end to identify methods to detect this or the mechanism that is in play here that causes it in a way that Beacon Requests / Beacon Reception wakes it up?

  • Over the last two days there is some more, hopefully helpful information, to report.

    We've added an additional diagnostic to walk the heap.  This is run every 5 seconds out of the main task loop.  It has not tripped yet, even though over the last 24 hours we've had another group of 4 routers go "mute", taking their end devices with them because of the data request ACKing that they do.  If necessary, I can proved the captured traffic around both the disappearance and the recovery of these nodes.

    Also, with the error logging capability we have seen one instance where a router stopped talking but in a different way: No packets of any kind came out.   It took a power cycle on the device to bring it out of this state; for some reason the logging / restarting function got hung up.  When we looked at the logged error we determined that it came from a call to halAssertHandler() which in this system is only called by the TI libraries.  We've replaced this function with one that calls our own assert handler.  Our assumption is that there are no parameters passed to this function by the TI libraries.  Hence we have no further information about its source.

  • Hi Andy,

    We are still looking into finding the reason for the issue. As a work-around, The application shall be able to identify this "mute" state, by monitoring the neighbor table:

    You specified that link status are not being sent out when in mute. This means that the neighboring devices will mark the link cost from the mute device as 0 after about 45 seconds (the default 3 link status periods). Assuming the nwk layer operates correctly, the mute device, upon reception of the updated link status from its neighbors, will reflect the outgoing link cost in its neighbor table, to indicate a dead link to these neighbors. Another option is that the mute device will fail to process the incoming link statuses due to this erroneous state, and in this case even the incoming cost will be marked as 0. In any way, assuming that the age-out of the neighbor table entries is operational despite the error, the content of the neighbor table shall reflect the fact that the device went mute, so that the application can decide to reset the device.

    Best regards,

    OD

  • Hi OD,

    We now have a diagnostic implemented that scans the neighborTable[], counting all of the entries with non-zero .linkInfo.txCost members.  When this count has been above zero in the past and drops back to zero, we log the error and restart.

    From its name it seemed that nwkNeighborCount() should have done the same job as our table scan & count.  It didn't seem to reliably go back to zero, even after all of the txCost values had change to zero.

    Can you confirm that this method that we're using is what you were recommending and that the behavior of nwkNeighborCount() is what you'd expect?

    Andy

  • More update:

    After reviewing the various logs it appears that the diagnostic tripped in five routers overnight.  Of those, only one appears to be a suspect false-positive; it happened when router nodes and end devices were being powered up.

    It's taking about 80 seconds for the diagnostic to conclude that all of its neighbors have gone away.  With the design constraints on the system, this period inactivity will be visible to the end user.

    Any updates on finding the root cause?

    Andy

  • Hi Andy,

    The diagnostic implementation you describe is indeed according to my suggestion.

    You specified it takes it 80 seconds to identify the failure, while I'd expect a maximum of 45 seconds (unless you changed NWK_LINK_STATUS_PERIOD or NWK_ROUTE_AGE_LIMIT from their defaults of 15 and 3, respectively, on the peer nodes).

    nwkNeighborCount() includes all valid neighbor table entries, including past neighbors even if they are unreachable (they are kept in the table as long as there is enough space for new entries). This explains your observation.

    I hope to be able to provide more insight regarding the root cause by the end of the week.

    Best regards,

    OD

  • OD,

    Yes, NWK_LINK_STATUS_PERIOD is still 15 and NWK_ROUTE_AGE_LIMIT  is still 3.

    Thank you for the confirmation that our diagnostic is valid.

    Andy

  • Hi Andy,

    In order to narrow down the possible reasons for this issue, could you please verify whether the network task is being executed and the right event is being called when the mute state happens? To check this, you can hook into osal_run_system() in osal.c. The network task ID is available at the variable NWK_TaskID, and the event that is triggering message propagation from network layer to mac layer is 0x0020. This event is normally called in short intervals as long as there are packets in the network tx queue.

    Regards,

    OD

  • OD,

    To have an idea of how to implement this, please provide some guidance.  Right now, our focus is on restoring the router to a functioning state as quickly as possible after this occurs.

    For the test you suggest, it seems that we'd need to wait in the "mute" state a while to observe the event flags for the NWK task.  How do we know if we've waited long enough? How many times should we expect to see this flag set if it's being set "enough"?  Once implemented, we will have some work to do to attempt to intentionally recreate the conditions that seem to expose this problem.

    For us right now, implementing this test for TI is a low priority.  We are focused on removing other reliability issues in the system.

    Andy

  • Hi Andy,
    (and OD)
    Did you find a solution or workaround for this problem?
    I've found your thread and I think I've got the same problem that you do with the "go-mute" routers over here.
    I'm turning nuts trying to find the cause.
    We use:
    MSP430 + CC2520 + CC2591 routers & coordinator
    Z-Stack 2.5.1a

    We have set up networks of up to ~500 nodes with the same problem, but the behavior can be seen in smaller networks as with 100 nodes. It happens to be quite random but we haven't observed it in smaller networks, though we haven't thoroughly tested it in such scenarios.

    After reading your post I can confirm that, at least for me and now, a beacon request from a powering up devices makes the muted router 'come to life'.
    I can also see in Wireshark that the incoming and outgoing cost for the offending router in other router's link status messages are both 0. And when the muted router starts transmitting again, and link status from it are first seen, those link status have all the neighbors with incoming and outgoing costs equal 0. Afterwards they start getting 1's.

    I understand the suggested workaround of reseting the device when the costs of all neighbors are 0, but I cannot do it as we use a 'software switch' that, whenever we reset the device it means we power it off :-(
    Is there any other way to restart the communications? (I'm afraid restarting z-stack is only possible through a reset or power cycle)

    By the way, I don't know if this is your case, but in mine, the 'muted' router still receives messages at the application level. I can send it a message and it gets displayed in the screen.

    Thanks a lot for any light you can bring, best regards,
    Asier.
  • Hi Again,
    I forgot to add a couple of things:
    - I've seen this situation to happen when traffic increases (i.e. sending a broadcast from the concentrator and having all of the devices to answer in less than 30 secs, so at an average of 3 per second) repeteadly

    - Has anyone seen an improvement migrating to Z-Stack 2.6.0.3? I know it's been out for a while, but, despite the hardware abstraction layers, the migration is not trivial for us, as we've done some tweakings here and there.

    Thanks a lot. Regards,
    Asier
  • Hi 
    As you may have seen I've got what looks like the same problems as Andy. Finally I've found some time for more testing and I've implemented debug code for:
    - reporting the number of neighbors with non zero txCost and the maximum of these reached over time, and
    - reporting the number of times event 0x0020 has been called for the network task
    In my first test, this morning, with a small deployment of 40 routers plus 1 coordinator, I've seen 5 routers going mute after the first batch of high traffic load (I don't know if they were mute before but they had had traffic working properly for sure).
    This muteness for me means they receive traffic (both unicast and broadcast up to the app layer) but they don't output any traffic. The situation is only solved when a new device is turned on (sending the beacon), or power cycling the muted device.
    As for the reporting number for the muted routers all I can say is:
    - all of them had a max of neighbors with non zero txCost >0
    - all of them had 0 neighbors with txCost>0
    - all of them kept updating (incrementing) the counter at the event 0x0020 of the network task

    Once I've brought them back to the network by turning on a new device (all of them have come back all together with just this new device) they have been working properly for ours with new bursts of high traffic and no other device has gone mute.

    Does this bring any light into the problem? May I add any further information in order to debug it?

    Thanks a lot for your help. Regards,
    Asier.

  • Hi again OD,
    I hope you or someone else is following this thread, otherwise I'm talking to myself :-(
    I've managed to conduct some more tests:
    This time I've deployed 66 routers + 1 coordinator.
    Everything was working fine. All of them receiving broadcast and unicast traffic and sending unicast traffic properly.
    Suddenly one router has gone "mute":
    - It received broadcast traffic at the Application layer
    - Its unicast application traffic was not getting to the network
    - It didn't receive unicast traffic
    - The number tracking of event 0x0020 at NWK_TaskID has been incrementing as usual
    - The maximum number of neighbours with non-zero tx cost has reached 48
    - The current number of neighbours with non-zero tx cost was equal to 0

    This situation hasn't changed until I've turned on an additional routers. After waiting some seconds (~10-30) the number of neighbours with non-zero tx cost was growing up to 13. Then communications have been reestablished. The final number of neighbours with non-zero tx cost is now 18.

    Does this provide any help clue? Can I debug anything else?
    Is there any workaround not involving reset?
    Would upgrading the Z-stack solve this? (it won't be trivial and we are in production runs so I've to take these steps very carefully)
    A short note letting me know someone reads this and is looking into it would be highly appreciated.

    Regards,
    Asier.
  • Hi again,
    any chance to get this issue looked at by TI at all or any other expert?

    TIA,
    Asier
  • Hi Asier,

    Sorry for not replying.  I feel your pain.  The project that my original question was about has been canceled.  I have not been following this in detail over the last year or so, since my original question and work-around.

    Andy

  • Hi Andy,

    Thanks for the empathy, I hope someone else get into it... It's turning me mad....

    If at least someone could assure a z-stack version migration would get rid of the problem, I could maybe venture into it...

    Regards,

    Asier

  • Hi again,
    after 5 months I'm still experimenting the same failure. Any expert (TI or not) willing to get hands (really) dirty?
    Thanks!
    Asier
  • TI just release Z-Stack Home 1.2.2a this week and I would suggest you to test on it.
  • Thanks, I guess that will be my only workable path so far, but I was trying to avoid it by know as:
    - the error is not 100% repeatable, or at least you never know when it's gonna happen, and
    - the test involves quite a change as our zstack has been quite customized, trying to respect the HAL but customized nevertheless, so it's not trivial work
    That's why I was searching any kind of confirmation about that being the best way, no just the test-error way :-(

    Thanks anyway, I didn't received the alert on z-stack being updated this week.
    Regards,
    Asier.
  • You are welcome and I wish your test on Z-Stack Home 1.2.2a will show no problem.
  • Thanks. But I'm afraid I'm stuck, I've just realized that Z-Stack Home is just for CCs SOC, while my setup includes MSP430+CC2520, which attending to the z-stack page would leave me at Z-Stack Energy 1.1.0.
    Can anyone confirm this?

    TIA
    Asier.
  • Yes, Z-Stack Home 1.2.2a is only for CC SoC.
  • Hello Asier,

    I had similar issues if you see this thread.
    e2e.ti.com/.../317563

    Did you find out anything newer on the problem? I am now using ZStack 1.2.0 and ZStack 1.2.2a but sometimes i think the problem still occurs.

    Best Regards,
    Nikos
  • When you test this, do you use devices with Z-Stack home 1.2.2a or you have Z-Stack 1.2.0 hybrid? Do you have sniffer log for this issue?
  • Hi Andy and Asier,

    I have just uncover the same problem recently.
    I have had no problems to date with small networks running for year with no issues. 

    Just recently we implemented AES security and a few other little features and this problem has stated occurring even in little networks of 5 routers.
    I have also tried turned security back off, but the issues still persist, on devices we have had security turn enabled on. (However this is not conclusive and will investigate further).

    The symptoms I have found are;
    The router stops transmitting, but still is responsive to unicast messages i.e identity
    We have an application broadcast message every 60 seconds sourced from the coordinator.
    We have enabled SECURE=1 and TC_LINKKEY_JOIN with pre-configured link key.

    Environment:
    CC2530 
    Zstack 2.5.0

    Any luck solving the problem ?  

  • I would suggest you to upgrade to Z-Stack Home run 1.2.2a since there are lots of bug fixing after Z-Stack 2.5.0.
  • Hi ,
    I'm sorry to say I'm still suffering the same issue and have no further info or workaround other than starting a new device when a detect "muted" devices, that 'solves' the issue temporarily.
    I need to find time to test new versions of Z-stack, but we need to migrate the code first which needs time as well.
    Besides Z-stack Home is not for me as I don't use a CC SoC.
    I'll have to give a shot to either z-stack energy or mesh.

    If at least I would be able to send a BEACON REQUEST automatically from any of the devices in the network, I could schedule the 'workaround' to be run every x minutes, which would help me a lot, but I haven't managed that as it only happens when a new devices joins the network for the very first time.
    Does anybody know how to do it? I'm using a ZNP as coordinator, so maybe there's something in the API I can use, any hint?
  • Hi YK Chen,

    I don't find your reply helpful.
    For those of us that have mature products in the market, just changing stacks version is not an easy option without a lot of re-qualification that all product functional works as intended.

    I find this generic comment a cop out. Unless you know that the specific bug has been address , why make this recommendation.
    Ready the release notes from Z-home and Z-mesh I cannot see any indication that a bug like this has been address.
    The only changes I see in the new releases are additional functionality for specific industry profiles.
  • I know it's not easy to change stack version. If you don't want to change stack version, I suggest you to contact TI local FAE to see if they can help you to fix this.
  • Hi Asier,
    I have found this problem started for me when using a channel I have never used before and that changing the zigbee network channel has solved the problem. This leads me to believe the issues in interference related from other 2.45GHz system ie Wifi.
    I then conducted a Wifi channel scan and discovered that the nearby Wifi access point was on a channel that over lapped with the Zigbee channel i was using.

    The reasoning I now come to for the issues is now a physical channel access problem not a software issues.
    As the IEEE802.15.4 standard uses CSMA ie listen before transmit. If a near by WiFi access point is raising the in channel noise floor above the IEEE802.15.4 CCA Threshold, about-80dBm, the device will never transmit. This inference could happen at different node and antinode points in the environment so not all devices will be detecting the interference. Another factor that may be leading to the problem being intermittent, is Load on the WiFi channel.

    This physical problem may also be triggering other untested scenarios in the stack. Due to buffers filling up or overflowing due to the blockage in the transmission pipeline. This mite the reason why beacons seem to clear issues.

    Regards Darren

  • You find a good point and WiFi interference might cause this. I think it's easy to verify this. You can test it again and try to turn off WiFi when this happens. If router goes to work again, you can confirm the problem is WiFi interference.