This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC2652R: CC2652R: Zigbee Source Routing failure

Part Number: CC2652R
Other Parts Discussed in Thread: Z-STACK, CC2538,

There is a fault (appears to be either a race condition / use of undefined memory) in TI"s source routing implementation. Specifically under certain situations the MAC layer destination used when sending a packet with a source route is incorrect. It is addressed to the device that will receive the packet, not the next relay.

The packet contains the correct source route information (as it should the coordinator has the source route in it's DB and it's not expired). It is however addressed to the wrong device.

I've circled the incorrect MAC destination field and put a line next to the correct value (the source route relay). The packlets highlighted in black are those that have been incorrectly addressed by the ZC.

This is a regular occurance on our ZNP (simplelink) project that uses Source Routing for 99%+ of all communication. It does this because the table is large, and the source route expiration is huge. This issue is most likely to occur when many devices are communicating at the same time.

I've done some debugging and currently suspect the problem is around or in the call to NLDE_BuildSrcRtgFrame in NLDE_DataReqSend in the libZstack blob. However that is dificult to say with certainty due to the closed nature of the blob.

  • Hi Mathew,

    Thank you for reporting this behavior.  Can you provide the sniffer log you've displayed, and are there any other packets which exhibit this behavior?  Which SDK version are you evaluating and what are your values for SRC_RTG_EXPIRY_TIME/MAX_RTG_SRC_ENTRIES?  Have you tried increasing heap memory or is the behavior temporarily mitigated when the ZNP device is reset?  How many devices do you believe must be connected and communicating to replicate the issue?  And could this be related to the prior E2E thread  https://e2e.ti.com/f/1/t/1289742? 

    Regards,
    Ryan

  • >> Can you provide the sniffer log you've displayed, and are there any other packets which exhibit this behavior?

    I've since reset wireshark, however heres a sniffer log for the same issue

    drive.google.com/.../view


    >> Which SDK version are you evaluating

    simplelink_cc13xx_cc26xx_sdk_6_41_00_17

    >> and what are your values for SRC_RTG_EXPIRY_TIME/MAX_RTG_SRC_ENTRIES?

    -DMAX_RTG_SRC_ENTRIES=40
    -DSRC_RTG_EXPIRY_TIME=180

    >> Have you tried increasing heap memory or is the behavior temporarily mitigated when the ZNP device is reset?

    I have not adjusted the heap memory. If this is necessary please provide any recommendations.

    I have not seen any error codes to MT commands that indicate memory pressure

    >> How many devices do you believe must be connected and communicating to replicate the issue?

    We have 17 ZED + 4 ZR with most of the ZED connected to the final ZR's in a chain

    The common outgoing routes are:
    - ZC -> ZR1 or ZR4 -> ZR2 -> ZED
    - ZC -> ZR1 or ZR4 -> ZR3 -> ZED

    This is to force the ZED's to remain connected to ZR's (as they are outside the range of the ZC). We are testing over roughly 150 meters.

    I do not know if a large number of ZED's is necessary to replicate the issue or if the issue is only seen with larger numbers due to the relative infrequency (~2-3 incidents per hour with 17 devices each communicating typically at-least once per minute or two)

    We send every packet with AF:dataReqSrcRtg that we have a known srcrtg for, so we use source routes almost exclusively as we have found this to be necessary to prevent route requests from overloading downstream ZRs (mostly due to broadcast storms due to ZRs with limited ability to de-duplicate broadcasts) when dealing with larger numbers of routers (4+).

    >> And could this be related to the prior E2E thread e2e.ti.com/.../1289742

    That issue which we have since mitigated (through an ugly workaround) is on the receive path and very clearly the result of incorrect pre-processor defines used in the building of the maclib so unlikely to be related. TI should increase the size of the mac src pend table if they want to build a ZC capable of supporting more than 5 directly connected ZED at any reasonable performance (as performance rapidly deteriorates otherwise).

    This issue was first noted before we patched through workaround the auto pend table issue. However full investigation of the srcrtg issue has only begun this week.

  • Thank you for providing this detailed account.  I have verified that what you are observing in the sniffer log does not align with Z-Stack expectations and have filed a bug ticket for this to be further investigated by TI's Z-Stack Development Team.  Any other stack changes which you can provide which you think might contribute to this setup would be appreciated.

    Since you are using SDK v6.41, the heap for Z-Stack is allocated dynamically and should not be a concern.  Thanks also for mentioning that you have not seen any heap or memory errors through the MT interface.

    Regards,
    Ryan

  • Most of what we have added is additional MT commands for exposing additional insights into ZNP state. A few other TI quirks are patched, mostly on the receive path. We have also reviewed and applied many of the patches from https://github.com/Koenkk/Z-Stack-firmware (which also covers the known issues). None that should be relevant here.

    There isnt much we can change that can get inside the TI blobs at this level. Unfortunately having looked for this problem and a workaround we have strong confidence on the exact area of the problem and have had to resort to a pretty bad hack to get this situation resolved.

    For those following along heres a hacky patch that directly re-writes the faulty frames at the MAC layer.

    MAC_INTERNAL_API void macTxFrameHook(uint8 txType)
    {
    if(pMacDataTx->internal.frameType == 1) {
    // not broadcast
    if(pMacDataTx->internal.dest.dstAddr.addrMode == ApiMac_addrType_short && pMacDataTx->internal.dest.dstAddr.addr.shortAddr < 0xFFF8) {
    // pan 3,4
    // dst, src 5,6,7,8
    // nwkfc = 9,10
    uint16_t nwk_fctrl = BUILD_UINT16(pMacDataTx->msdu.p[9], pMacDataTx->msdu.p[10]);
    if(nwk_fctrl & 0x0400 && pMacDataTx->msdu.len > 18) {
    
    uint8_t relay_count = pMacDataTx->msdu.p[11 + 6];
    if(relay_count != 0 && pMacDataTx->msdu.len > 18 + (relay_count * 2)) {
    uint8_t relay_index = pMacDataTx->msdu.p[11 + 7];
    if(relay_index + 1 > relay_count) {
    relay_index = relay_count - 1;
    }
    
    uint16_t correct_destination = BUILD_UINT16(pMacDataTx->msdu.p[11 + 8 + (relay_index * 2)], pMacDataTx->msdu.p[11 + 8 + (relay_index * 2) + 1]);
    
    if(correct_destination != 0) {
    //debug_uint32("sending with srcrtg ", correct_destination);
    
    if(correct_destination != pMacDataTx->internal.dest.dstAddr.addr.shortAddr) {
    //debug_uint32("found srcfail mac ", pMacDataTx->internal.dest.dstAddr.addr.shortAddr);
    //debug_uint32("correct_dest ", correct_destination);
    
    // then we need to change the destination address
    pMacDataTx->msdu.p[5] = LO_UINT16(correct_destination);
    pMacDataTx->msdu.p[6] = HI_UINT16(correct_destination);
    
    pMacDataTx->internal.dest.dstAddr.addr.shortAddr = correct_destination;
    }
    }
    }
    //debug_uint32("frame counter ", pMacDataTx->msdu.p[2]);
    }
    }
    }
    
    macTxFrame(txType);
    }

    Its and ugly hack at best and if you don't know how to integrate that function into Z-Stack you probably shouldnt use it.

    Once doing that our devices in the field with larger numbers of ZR have finally stable.

  • Thanks for providing this information!  I've asked to review this and they might follow-up about testing your rudimentary patch for Zigbee2MQTT.

    Regards,
    Ryan

  • Thanks for the patch!

    Are you experiencing this problem since a certain SDK? For Z2M we are having instability issues starting from 6.20 which seems to be related to many TuYa devices on the network/routing requests.

  • Hey Koen,

    Unless otherwise stated, Mathew is observing the issue with SDK v6.41

    Mathew,

    To Koen's point, are you aware whether this specific issue is prevalent with SDK v6.10 or earlier?

    Regards,
    Ryan

  • We have not used a version prior to v6.41. Before v6.41 we used the 3.x stack on the CC2538 hardware. The effect of this issue has been observed over all stack versions we have tested, however last week was the first time I had the time to sit down and properly investigate.


    We have never seen this issue lead to a crash. However it is not impossilble that it could, there does appear to be some failures on TI's part to keep structures consistent in the event of allocation errors. Its just not affecting us as on the CC2652R we beleive we have more than ample memory for our needs.

    The only TuYa device (PIR Temperature Humidity sensor) I have access to is a sleepy ZED. I've got it joined to the test network currently and it has not created any crashes for me, however I have only one of them. As this device is sleepy (silently disconnects at the end of every event) you do want to be careful about how you make outgoing communication to it (dont exaust your Z-Stack network buffers with queued packets etc).

    The other sensors on the test network are either ZRs (Ikea ZRs and Philips Hue ZRs) and ZEDs as a mix of Develco sensors. And those are full fat persistently connected devices.

     

    If TI is actively working in the source routing area while I don't think its the issue I'm reporting they should also fix the issue that when a source route is received, if ZNP is out of memory the older route will be freed (and the relayList set to NULL) however the relayCount field will be left non zero. This logic should be improved in both RTG_ProcessRrec and RTG_AddSrcRtgEntry_Guaranteed and if the module runs out of memory may result in a NULL relayList being used as the relayCount is non zero, which seems to be guarded for in most cases, however perhaps not all.

    I'd also suggest that:

    1. A allocation for the new relaylist is created before freeing the old one. That way if the allocation fails the old route can be left in place

    2. When replacing a relayList size is the same just use memcpy, a new allocation is not necessary.


    My analysis is only based off reading the assembly for RTG_ProcessRrec & RTG_AddSrcRtgEntry_Guarantee. I do not have access to TI"s source code for these blobs.

    I do not beleive either of these potential memory safety issues to be the problem that I have reported, however they are probably issues and can be fixed as part of the same scope of work as they are simple fixes.

  • Thanks for the feedback, this information has been passed on to the Software Development Team for their review.

    Regards,
    Ryan

  • - Where did you apply this change?

     A few other TI quirks are patched, mostly on the receive path. 

    - Would you want to share these?

    I'm interested to apply these changes in my fw.