This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Some routing issues in a large ZigBee network using TI Z-Stack

Other Parts Discussed in Thread: CC2538, CC2592, Z-STACK

We have some questions during our 250 ZigBee nodes test. We have a long term test about one month using one coordinator and 250 routers.

Our ZigBee module is CC2538+CC2592.

 

Test environment:

We do the test in our office, and use 5 test racks. There are 50pcs in each test rack, and the distance between racks is less than 30 meters.

Figure1. The deployment of coordinator and test racks

Figure2. The deployment of ZigBee nodes on the test rack

 

Test scenario:

  1. 1*Coordinator and 250*Routers.
  2. Coordinator sends a data request to each router one by one, and the procedure continuously during our test period.
  3. Each router sends a data request to coordinator every 15 minutes.
  4. Coordinator sends a heart-beat broadcast message to routers every 3 minutes, if a router didn’t receive the heart-beat from coordinator in 15 minutes, the router reboot.

 

Parameters setting:

We use TI Z-Stack v2.6.2.

Coordinator

Router

INT_HEAP_LEN=4096

MAX_NEIGHBOR_ENTRIES=48

NWK_MAX_DEVICE_LIST=20

NWK_MAX_DEVICE_LIST=20

CONCENTRATOR_ENABLE=1

ZMAC_MAX_DATA_IND=5

CONCENTRATOR_DISCOVERY_TIME=120

ROUTE_EXPIRY_TIME=30

MAX_RTG_SRC_ENTRIES=255

MAX_RTG_ENTRIES=40

CONCENTRATOR_ROUTE_CACHE=1

NWK_LINK_STATUS_PERIOD=60

SRC_RTG_EXPIRY_TIME=2

MTO_RREQ_LIMIT_TIME=5000

ZMAC_MAX_DATA_IND=5

ROUTE_EXPIRY_TIME=30

MAX_RTG_ENTRIES=64

NWK_LINK_STATUS_PERIOD=60

Questions:

We use many-to-one routing scheme and have sufficient memory (MAX_RTG_SRC_ENTRIES=255) to store the path back to all the nodes, and we often find the AODV route discovery from coordinator and route record command from routers in sniffer.

  1. How does the coordinator select the path when it has a data request or send back APS_ACK to a router?
  2. In Z-Stack, there are two tables that are rtgTable[] and rtgSrcTable[] in coordinator. The path from AODV is stored in rtgTable[] and the path from route record command is stored in rtgSrcTable[], is it correct?
  3. In routers, there is only rtgTable[] used to store the destination address and the address of next hot regardless if using AODV routing scheme or MTO routing scheme, is it correct?
  4. Are rtgTable[] and rtgSrcTable[] used in coordinator at the same time when MTO routing scheme is used? Which one has the high priority for path selection? Additionally, how do we optimize the two table size?
  5. In ZigBee specification description, the route record required field will set to TRUE if the routing table entry is new, or if no route cache flag is set to TRUE, or if the next hop field changed. Because we didn’t move the nodes, we think the network is stable and the path from coordinator to routers or the routers to coordinator didn’t change frequently, but we often find the route record command, why?
  6. There is no any MTO route discovery (CONCENTRATOR_DISCOVERY_TIME=120) from coordinator in 26th day, no any route record command from all routers and many network status error (many-to-one failure 0x0C), it let all data requests fail to arrive in the destination, why?
  • Hi,
    1) Whenever data is sent out, the node checks in the network layer whether it has a routing entry for the final destination. First the routing table is checked, then the source routing table is checked. if the source routing table has an entry that the frame is formatted using the source routing header fields in the network layer
    2) Yes that is correct. Though the statement technically applies to the concentrator only. While typically this is the coordinator in actual implementation, this may not be always the case.
    3) yes non concentrator nodes (which typically are the routers) only store routing entries in the routing table.
    4) they both could be used. The key to understand this is in the way the source routing table is filled up. Source routing table contains entries which are populated when the Route Record message is received. This is sent unicast by the node which wants to send a NWK data packet to the concentrator automatically. Therefore, if the concentrator starts sending out data packets to that node before the first actual data packet from that node is sent to the concentrator, then an AODV route discovery from the concentrator to that node starts. I would recommend though using only one table and make sure the target nodes are sending a packet to the concentrator so the source routing table is filled up
    5) yes that is correct whenever the routing table entry is updated (or created), the route record message is sent out. If in your system you configured the concentrator to have cache (and you verified this over the air), then the only option is that for some nodes the routing entry is updated frequently. This can happen in case the radio link quality varies. Please also note your network is very densely populated with nodes at a very close distance. The 'next hop' link is maintained by checking the link status message received from surrounding nodes, and the number of nodes reported in this message depends on MAX_NEIGHBOR_ENTRIES parameter. If the node doesn't find the neighbor in that list, after 3 link status messages where he doesn't see that neighbor he considers the link as 'broken' and therefore updates the routing entry. Maybe this is another possible reason. I don't recommend increasing MAX_NEIGHBOR_ENTRIES as if you increase it too much then you are going to break it in multiple 802.15.4 messages, increasing the OTA traffic
    6) Lack of MTO route discovery seems to point to the coordinator not being alive anymore. In Z-Stack sending out MTO Route-Req is controlled by a timer; it basically seems the timer is not running anymore. Do the NWK-STATUS message reporting a many-to-one failure arrive to the concentrator and are acknowledged? What happens if you reset the concentratror?

    Thanks,
    TheDarkSide
  • Hi TheDarkSide

    Thanks for your reply, the answers are useful for us.

    And we have other questions about your answer below

    1) The explanation that the routing table is checked first, then the source routing table is checked. In coordinator, when there is no any routing entry in neither rtgTable[] nor rtgSrcTable[] for the destination, an AODV route discovery from coordinator is triggered, is it correct? Need we increase both table size to the numbers of our test nodes (MAX_RTG_SRC_ENTRIES=255 and MAX_RTG_ENTRIES=64 in our setting)?

    2) The MAX_RTG_SRC_ENTRIES is equal to 255 in our test, it is enough to store the route record command from  all nodes, but we often find AODV from coordinator, why?

    3) For Q6, in the circumstance we can control our coordinator module, and the network recovery after resetting. The callback function ZDO_ManytoOneFailureIndicationCB() In ZDApp.c, it will trigger RTG_MTORouteReq() when a many-to-one failure is received, but we can't find it in sniffer.

    Thanks.

  • Hi,
    1) an AODV route discovery is triggered if there's no routing entry in thr rtg Table or the rtgSrcTable, and at the same time an application data packet is attempted to be sent from the concentrator to other nodes. Ideally, all the entries should be in the rtgSrcTable which means that first you should have the MTO route request procedure completed, then before any packet is sent out from the concentrator to say node 'X', that node 'X' should have issued a route-record packet to the concentrator. One way to achieve this is to do a 'controlled' MTO route discovery/source route discovery. For instance you can control the zgConcentratorRadius parameter to progressively increase, so the path between the nodes and the concentrator is progressively discovered in terms of number of hops/distance from the concentrator. Also, say that you discover the route from all nodes up to radius 3 from the concentrator. Instead of moving along and discover the nodes with higher radius, you could start a route record procedure by sending a trigger broadcast message (with radius 3) to all the nodes, which then will trigger a unicast route record packet to the concentrator which reports the path to it and builds the path from the concentrator to the nodes.
    This way you would control the route discovery in a large network for all the nodes, avoid routing storm and be sure that before you start sending actual data to/from the concentrator the routes are all in place. This procedure may be longer than traditional full MTO Route Req, but you would have control over it/
    2) AODV happens if the data needs to be sent out by the node to reach destination 'X' and there is no routing entry (either in the rtgTable or in the rtgSrcTable), ie when application data is sent out before the MTO/Source routing procedure is completed. That is why I am suggesting the scheme above, where you can control and sequence the MTO/Source routing and avoid broadcast storms
    3) That is weird. It means whether that there is a memory leakage and you can't allocate memory to send the message out (but in that case, you wouldn't even be able to be called back through ZDO_MaManytoOneFailureIndicationCB()), or more probably because the route discovery table is full. If the route discovery table is full, that may be a symptom of a routing storm or that your route discovery table has not been sized well for your network

    Thanks,
    TheDarkSide
  • Hi

    1) “The MTO/Source routing procedure is completed” that means concentrator broadcasts a MTO route discovery, nodes send a unicast route record command to concentrator and concentrator stores the path in rtgSrcTable[], is right?

    2) In our test procedure, all nodes send a data packet to coordinator every 15 minutes, so nodes maybe trigger a route record command at some moment, ideally all the entries should be in rtgSrcTable[] (MAX_RTG_SRC_ENTRIES=255 > test nodes=250) if concentrater receives all route record command from nodes. We think that when concentrator has a data packet to a node, it can find a path to destination in rtgSrcTable[] and doesn’t trigger route discovery. Why do we often find AODV after a long time test? Are the entries in rtgSrcTable[] deleted automatically?

    3) Do the expire time in rtgTable[] and rtgSrcTable[] (ROUTE_EXPIRY_TIME=30 and SRC_RTG_EXPIRY_TIME=2) have any affection in route discovery scheme?

    4) The rtgTable[] seems an attachment for rtgSrcTable[] when MTO routing scheme is used, is right?

    5) We can find the data packet in sniffer when lack of the MTO route discovery, so it maybe the route discovery table is full. Do you have suggestion (MAX_RREQ_ENTRIES=8, there is no specification in swra427b)?

    Thanks.

  • Hi,
    I can comment on, maybe anwer, three of your questions and would also be interested to know about the others.

    1) Yes this is correct.

    2) This is most likely caused by the fact that there are many devices close together. I've seen that (especially in the first hour after network start) routes change many times. I think this is because neighbor tables change frequently. If a source routed packet fails to arrive, AODV is automatically used as fallback.

    3) The swra427 appnote is confusing in the sense that in the app note they use a device with 'enough' memory to store all routes, but then still set the SRC_RTG_EXPIRY_TIME to 2, going against what is descibed in chapter 2. Ideally you wouldn't want any route to expire if you can store them all, but nowhere it's explained how to do this (maybe set it to 0). Naturally, if routes expire after 2 seconds, there will be many AODV route discoveries to get back those expired ones.


    Regards,
    Sjef.
  • Hi

    1) Have you set the SRC_RTG_EXPIRY_TIME to 0 and test?

    2) If the SRC_RTG_EXPIRY_TIME is set to 0, is there any side effect?

    Thanks.
  • Hi

    1) Do the expire time in rtgTable[] and rtgSrcTable[] (ROUTE_EXPIRY_TIME=30 and SRC_RTG_EXPIRY_TIME=2) have any affection in route discovery scheme? If the SRC_RTG_EXPIRY_TIME is set to 0, is there any side effect?

    2) I do the second test about 30 days, there is a aps_data broadcast message (it's a heart-beat in my application) from coordinator, finally the coordinator can't send successfully (can't find it in sniffer). I try to let another node join the network, from sniffer, i can find the association request and response, but no transport key message (normally the message exist in join procedure). For your mention before, is it a memory leakage and can't allocate memory in stack when a message be send?

    It's the memory information below, using IAR, CC2538NF23

    -DINT_HEAP_LEN=4096

    111 282 bytes of readonly code memory
    4 254 bytes of readonly data memory
    31 807 bytes of readwrite data memory (+ 6 144 absolute)

    do you have any suggestion?

    thanks a lot.