This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

delayed APS ACKs when link is bad

I'm running a robustness test on my network, feeding in 1000 messages from a USB interface on the coordinator (2531 USB dongle) and then unicasting them to a router which displays the number received.  AF_ACK_REQUEST is being set, and each time an OTA message is sent, a flag is set to block the transmission of any further messages until the AF_DATA_CONFIRM_CMD event is triggered, at which point the flag is reset.  A new message cannot therefore be sent until an APS acknowledgement had been received for the previous one.

Under good RF conditions, the router receives all 1000 messages in less than a minute.

If I separate the two devices to say 100m, the link is much more variable, and I sometimes see the messages stop completely for 7-8 seconds, then resume at a slower rate.  They may then speed up again to their original rate after a few more seconds.

If I switch off the router in the middle of a burst, the same thing often happens, although the faster rate will not resume unless I switch it back on and let it rejoin the network.

I assume that the 7-8 second delay is due to the coordinator and router trying to re-establish a link.  The slower subsequent rate is presumably if the link remains bad and retries are needed.  What I don't understand is why APS acknowledgements aren't being processed at all for those 7-8 seconds (I've confirmed this is the case at the coordinator end). 

I would guess that this is Zigbee's attempt to maximise the chances of a successful acknowledgement, by waiting to see whether the link can be repaired.  But I would rather have a failed acknowledgement in a shorter time, rather than have all my messages back up.  Does anyone know of a way around this?

  • I'm still having problems with this issue, and even with a small number of devices on my network, I can sometimes get delays of 20-25 seconds where no unicast messages are being successfully sent from the coordinator. It's particularly bad when a device that was on the network has just moved out of range.  Is there a way to have a message return a "delivery failed" acknowledgement in a shorter amount of time?  Which config parameters influence this?

  • ZigBee-PRO, which you are using with the 2.5.0 ZStack, specifies a mandatory "Link Status" message at 30 seconds. So it could take up to 30 seconds after you move the ZR back into radio range of the ZC before the ZC NWK layer realizes that its ZR neighbor/child is back in range and thereby "addressable" via OTA message.

  • Sounds promising, thanks!  If I were to  reduce this link status period to say, 10 seconds, do you think there would be any obvious knock-on effects (apart from slightly increased traffic of course)?

  • Perhaps you will want to think sincerely about the realistic problem that you are trying to solve before going and changing numbers and constants and playing around with serious network parameters. Perhaps your inveneted stress test is not testing anything useful at all. In the big picture of a well-designed and installed self-healing mesh network, who cares about a few 10's of seconds of latency in re-establish a neighbor router anyway? Perhaps a more realistic and useful stress test will be to place 3 ZR's within 1 hop of the ZC and a 4th ZR out of radio range of the ZC but within 1 hop of the other 3 ZR's. Then do your 1000-message stress test between the ZC and this 4th ZR while randomly turning on and off 1 to 2 of the 3 ZR's such that at no time will all 3 have been turned off within the last 30-second window (NV_RESTORE and NV_INIT should be compile options for this to work best). I think that you will find that your 1000-message stress test will pass everytime in such a test of mesh routing - that is what ZigBee is useful for, not some arbitrary, single point-to-point make or break hop.

  • Hi Dirty Harry,

    I suppose I should have been clearer - I am no longer doing the 1000 packet device to device test that I mentioned in my original post, but instead setting up a small network where I take a device out of range of all other devices and then bring it back in range.  This is a realistic scenario for our application, and 10s of seconds delay rediscovering the network would greatly hamper our system operation.  That being said, this is more of an issue for end device than it is for routers in our system, and we are now able to reduce the end device delay by reducing the value of APSC_ACK_WAIT_DURATION_POLLED in the main config file.

    Jimbo