This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/CC2538: ZR Stop sending messages to network

Part Number: CC2538
Other Parts Discussed in Thread: Z-STACK, CC2531,

Tool/software: TI-RTOS

Hi,

I'm using Z-Stack 1.2.1 in the Coordinator, ZR and ZED. 

We already have some solutions created in the market, and apparently working fine. 

at a few months ago  We have had some customer complaints related that some devices stop communicating to coordinator. We go to the site to analyze the problem and we can verify that the device that can not communicate is able to receive and execute the orders sent by the coordinator, but can not transmit any messages to the network. For this reason, the coordinator reports on the platform that the device is unavailable, when in the interface is not. only can't  transmit messages to the network. 

With a TI Dongle CC2531 and Packet Sniffer, we can check that device don't transmit anything to network, Link Status, broadcast messages and uni-cast messages. We can't see any of this type of messages in the sniffer. 

The device only recovers normal operation when we reboot it. 

So that reason, we thought the problem could be one of the Z-Stack parameters that was corrupted, but i can't understand how is it?

I know, that one of the first recommendations is to upgrade the Z-Stack version, but unfortunately, we can't do it because we have the solution on the market and to resolve this problem we must upgrade OTA the firmware of the all devices, and the OTA upgrades with a Z-Stack version put in risc the normal functionalities of the devices. 

For this reason I am asking for your support in trying to resolve this issue based on the feedback you have and whether this issue has been detected and resolved in the most recent versions. If so, could you tell me what the changes were in the open code or do you hear any changes at the compiled  libraries?

Best regards

Nalves

  • Hi Nalves,

    Based on your description this appears to be an issue with the CC2538 radio. There are obviously several updates and improvements to Z-Stack and the low-level MAC layer since v1.2.1 but it would be difficult to test any one improvement when the behavior cannot be easily replicated. Also, it sounds like you cannot upgrade Z-Stack so any pre-compiled library updates would not be possible. Since you did mention that commands from the ZC can be executed by the ZR and that resetting the device fixes the issue, my advice is to recognize the behavior within the network and send the faulty device a Leave Request with rejoin enabled so that the device performs a software reset and is able to resume operation.

    Regards,
    Ryan
  • Hi Ryan,

    Thank you for your feedback.

    to turn ZR working again, it is necessary to send a leave request with rejoin enabled. it will enough for the device to reboot, and the device becomes operational.

    For now, until understand what is the main reason of this issue, do you have a list of stack parameters that are directly responsible for the tX communications, that I could read/write the values for i can understand if are the cause of this issue? Or, for mitigate the problem a way that device can understand that are with TX problems, for can a auto reboot?


    Best regards

    Nalves
  • Hi,

    Just as a reference here are the release notes with the list of things fixed and changed in the stack from 1.2.1 to 1.2.2 

    Texas Instruments, Inc.
    
    Z-Stack Core Release Notes
    
    --------------------------------------------------------------------------------
    --------------------------------------------------------------------------------
    
    Version 2.6.3
    February 25, 2015
    
    
    Notices:
     - Z-Stack 2.6.3 has been certified as a ZigBee Compliant Platform (ZCP) for
       the devices, toolchains, and TI-RTOS SimpleLink versions listed below.
    
     - Z-Stack for the CC2530/CC2531 platforms has been built and tested with
       IAR's CLIB library, which provides a light-weight C library which does not
       support Embedded C++. Use of DLIB is not recommended for those platforms.
       Z-Stack for CC2538 and CC2630/CC2650 platforms has been tested with the
       DLIB library.
    
     - Z-Stack projects specify compile options in two ways: (1) "-D" statements
       in f8wConfig.cfg and other *.cfg files, (2) entries in the IDE Compiler
       Preprocessor Defined Symbols. When using the EW8051 or EW430 compilers,
       settings made in the IDE will override settings in the *.cfg files. When
       using the EWARM compiler, settings in the *.cfg files will override any
       made in the IDE.
    
     - When programming devices for the first time with this release, the entire
       Flash memory should be erased. For the CC2530/CC2531 and EXP5438 platforms,
       select "Erase Flash" in the "Debugger->Texas Instruments->Download" tab of
       the project options. For the CC2538 and CC2630/CC2650 devices, select the
       "Project->Download->Erase Memory" tab of the project options
    
     - Application, library, and hex files were built/tested with the following
       versions of IAR tools and may not work with different IAR tool versions:
         - CC2630 + SRF06EB:  EWARM 7.30.4  (7.30.4.8167)
         - CC2538 + SRF06EB:  EWARM 7.30.4  (7.30.4.8167)
         - CC2530 + SRF05EB:  EW8051 9.10.1 (9.10.1.2146)
         - CC2530 + EXP5438:  EW430 6.20.1  (6.20.1.931)
    
     - Foundation software library files for the CC2538 include the following:
         - bsp.lib, version 1.3.1
         - driverlib.lib, version 1.3.1
         - usblib.lib, version 1.0.1
    
     - Support software packages for the CC2630/CC2650 include:
         - TI-RTOS:  tirtos_simplelink_2_11_01_09
         - XDCTOOLS: xdctools_3_30_06_67_core
    
    
    Changes:
     - [6357] A new Monitor-Test command has been added to provide parameters for
       the new enhanced rejoin mechanism. See section 3.12.1.33 of the "Z-Stack
       Monitor and Test API" document for details on ZDO_SET_REJOIN_PARAMETERS.
    
     - [6180] Added the function ZDApp_ChangeState() to enhance processing of
       'devState' changes and setting the ZDO event to send App notifications.
    
     - [6122] Enhanced the leave mechanism to retain NV information when the
       leave request is with rejoin=1 and the device does not reset, allowing it
       to preserve volatile application data.
    
     - [6079] Improved handling of the situation where the original parent of a
       sleeping End-Device misses the Device Announcement when the child changes
       its parent. This provides faster recovery than waiting for Child Aging.
    
     - [6078] Implemented a Parent Announce mechanism for routers to exchange two
       new ZDO messages specified by R21 Child Management. The new "Parent_annce"
       and "Parent_annce_rsp" messages are used to determine which is the real
       parent router of specific child End-Devices.
    
     - [5956] Reduced memory usage in End-Device builds by removing un-necessary
       references to the Associated Device list.
    
     - [5476] Implemented the "enhanced" rejoin mechanism for aged out End-Devices
       as suggested by the R21 specification. An End-Device can try both Secure
       Rejoin and TC Rejoin to rejoin a network after it has been aged out and
       told to leave.
    
     - [5475] Implemented Orphan Scan Notification as the child aging keep-alive
       mechanism required by the R21 specification.
    
     - [5474] Implemented timeout establishment for the child aging mechanism in
       the R21 specification. This includes End-Devices sending the new "End
       Device Timeout Request" command once they have joined the network.
    
     - [5472] Updated neighbor table behavior as needed to support the child
       aging mechanism in the R21 specification.
    
     - [5471] Added NIB elements, stored in Z-Stack "zg" variables, as required to
       support the child aging mechanism in the R21 specification.
    
     - [5470] Moved EndDeviceTimeout Request and Response commands from ZDO to NWK
       layer as required in the R21 specification.
    
     - [5468] Removed the ZIGBEE_CHILD_AGING compile flag from Z-Stack code since
       "Child Aging and Recovery Protocol" becomes a mandatory feature in the R21
       specification.
       
     - [5467] Implemented End-Device Initiator functionality as required for the
       child aging feature in the approved R21 specification.
    
     - [5217] Modified the NWK message buffer system to delete queued messages for
       a non-RX_ON_IDLE device when a rejoin request is received from it. Queued
       messages are also deleted for a child (non-parent) when a leave indication
       is received for it.
    
     - [5168] Changed the NLME_SetPollRate() function to accept a 32-bit timer
       parameter, allowing for polling intervals longer than 65.535 seconds. The
       variable 'zgPollRate' was changed to 32-bit, along with the associated
       NV item. Added backward-compatible MT support forthe old 16-bit NV item.
    
     - [4583] Modified Coordinator network start-up behavior to save the selected
       PanID and other NIB information to Non-Volatile memory before any child
       device associates to it. If the Coordinator has a reset while waiting for
       other devices to join, it will retain the original PanID.
    
    
    Bug Fixes:
     - [6365] Fixed a problem where an UpdateDevice message for a leaving device
       would only be sent to the Trust Center by the parent router even if other
       routers at radius=1 heard the broadcast LeaveIndication. The UpdateDevice
       message would not get delivered if the parent router was turned off.
    
     - [6346] Fixed a race condition that could occur when an End-device sent two
       NWK Rejoin Requests to a potential parent and didn't received the Rejoin
       Response. The End-device would stop polling and could never rejoin until
       the device was reset.
    
     - [6231] Changed ZDApp_NVUpdate() to set an event, instead of a timer, for
       End-Devices to save network parameters immediately to NV after it joins.
    
     - [6165] Changed the definition of 0xFFFE as an invalid PanID to allow it
       to be used - this would not allow TI devices to operate in a network with
       that PanID, creating a potential inter-operability problem.
    
     - [6153] Handled a rare situation where two parents have the same End-Device
       in their association table with different short addresses and the former
       parent sends a Coordinator Realignment with the former short address that
       is different from the current short address of the End-Device.
    
     - [6077] Fixed an issue where an End-Device Secure Rejoin failed because the
       Update Device message could not be sent to the Trust Center and the parent
       silently removed the End-Device from its associated device list, stopping
       it from forwarding messages to the End-Device.
    
     - [6070] Reduced the time for an End-Device to send orphan notification on
       higher channels - now does not default to trying all 16 channels.
    
     - [6043] Removed possible transmissions from a device that has been requested
       to leave but is waiting for LEAVE_RESET_DELAY to reset. This eliminates the
       "spurious" End-Device polling and Router link-status messages.
    
     - [5783] Fixed a problem in the CC2530 MAC where a transmit packet is would
       be "pended" during active packet reception but never got sent. Transmit
       packets would not be sent until the device received its next packet.
    
     - [5771] Enhanced RF error handling in the CC2538 MAC by also processing
       strobe and synth lock errors. Refer to macMcuRfErrIsr() for details.
    
     - [5770] Corrected RF error handling in the CC2538 MAC by also clearing the
       error flag from the NVIC register. Refer to macMcuRfErrIsr() for details.
    
     - [5698] Fixed a problem where a child was aged out of the child table and
       sent a leave command, it performed a MAC association. If the association
       permit was false it would not rejoin the network and find a new parent.
    
     - [5696] Fixed a problem on CC2538 devices where memory area for customer
       commissioning of IEEE address (CCA) would be filled with zeroes, causing
       the device to wake up with a bogus IEEE address of 0x0000000000000000 in
       the secondary location, preventing use of the primary IEEE address.
    
     - [5626] Fixed a problem where an RX-Always-On End-Device saw a DeviceAnnce,
       its parent could be removed from the neighbor table, resulting in sending
       NwkStatus "No Route Available" when it receives messages from its parent.
    
     - [5563] Improved End-Device power consumption by turning off the receiver
       sooner after receiving a MAC ACK.
    
     - [5455] Fixed the APS retransmission mechanism to increment the APS frame
       counter for each retry and for each APS ACK.
    
     - [5270] Fixed a problem with Many-To-One routing where the routing table
       entry was not properly freed even when the concentrator left the network.
    
     - [5067] Improved sleeping End-Device power consumption by allowing it to
       sleep while waiting for an APS ACK. Previously, it would not sleep until
       it revived the ACK or exhausted its APS retries.
    
     - [4966] Fixed a scenario that once a Many-To-One (MTO) route discovery was
       complete and every router had the MTO routing entry to the concentrator, a
       newly joined device that hadn't yet participated in the MTO route discovery
       was not able to send a packet to the concentrator until it got the MTO
       routing entry by next MTO route discovery request from the concentrator.
    
     - [4811] Fixed a problem where a child device changed its NWK address because
       itr rejoined the network via MAC association. It was no longer able to do
       secure communications because it had reset the frame counter to zero. 
    
     - [4579] Fixed a problem with PA/LNA devices where the control signals for
       the PA/LNA would not works correctly after the radio woke up from sleep. 
    
     - [4504] Fixed a problem where a device that received a Leave Request with
       Rejoin set to true, would clear its NV information, making it not capable
       of rejoining the previous network without doing MAC association.
    
     - [4454] Fixed a problem where an End-Device would not try to rejoin its
       parent if the parent's end-device capacity status was set to zero.
    
     - [4453] Fixed a problem where an End-Device would not try to rejoin its
       parent if the parent's associate permit status was set to false.
    
     - [4225] Fixed a Monitor-Test (MT) problem where the device did not respond
       to a SYS_OSAL_NV_READ command from Z-Tool for NV item 0x0022. MT response
       would be dropped silently if the requested data block was larger than the
       serial TX buffer. 
    
     - [4035] Fixed a problem where an End-Device would get confused after reset
       in the following scenario: (1) two Coordinators, A and B, have networks on
       different channels, (2) End-Device joins network A on lower channel, (3)
       Coordinator A is turned off, (4) End-Device resets, clears NV, and joins
       Coordinator B, (5) Coordinator A is turned back on, (6) reset End-device
       and observe that it sent rejoin request to Coordinator A instead of B.
    
     - [3366] Fixed a Monitor-Test (MT) problem where the device did not respond
       to a UTIL_ZCL_KEY_EST_SIGN command from Z-Tool.
    
    
    Known Issues:
     - Processing of Manufacturer Specific foundation commands is not supported.
    
     - The ZDO Complex Descriptor is not supported.
    
    --------------------------------------------------------------------------------
    --------------------------------------------------------------------------------
    
     

    Some of the things that I can guess could make something like this happen are the following:

    - Somehow CSMA algorithm at the MAC layer keeps failing and MAC decides not to send the packet because of this.

    - Potential memory corruption/ stack overflow.

    - Potential memory leak filling heap and NWK layer cant allocate memory for TX buffer.

    If you have any way of modifying the code in your devices and want to attempt at detecting this failure in your code I recommend you look into processing the data confirmation comming from the stack and checking the status of the sent packets. You can add the following code to your event loop inside the "switch ( MSGpkt->hdr.event )"

            case AF_DATA_CONFIRM_CMD:
                myApplication_ProcessDataConfirm( (afDataConfirm_t *)MSGpkt );
              break;


    See example below so you can have an idea of what to do in your process data confirm function

    static void myApplication_ProcessDataConfirm( afDataConfirm_t *pDataConfirm )
    {
        if(pDataConfirm->hdr.status != ZSuccess)
        {
            //Something went wrong, Packet was not sent. 
            //Log status here or count failures. 
            //If failures > MY_APP_MAX_ALLOWED_FAILS then something might be wrong and need to reset device
        }
    }

    I hope this helps

  • Hi Hector,

    Thank you for your feedback.

    Any of the "thinks" that you referee, could be detected checking of internal radio or stack registers?
    I would be considering developing a supervisor routine that takes decisions when any anomaly occurs at the level of stack overflow or memory leaks, but to do that, i must understand and know if exists some register, flag or internal stack variables that detect this type of issues.

    The solution that you proposed, I had already thought about it, but it could occur more frequently than desired, since the device may want to communicate with other devices that is off, for example, the device will restart permanently.

    If something register or flag reports that the TX communication had problems, for me is a better solution.

    Best regards

    Nalves
  • I think the best place to keep track of this will be in the data confirm.
    The status will indicate what the failure was for example if CSMA fails then the status will be "ZMacChannelAccessFailure"
    If there was a memory error the status will be ZMemError or ZMacMemError. And based on these statuses you should be able to determine when you reset or not. For example if you receive a status of ZMacNoACK then you don't increase the number of failures because this will indicate that the packet was sent out but did not received an Ack.

    You can check the list of status codes in zcomdef.h

    Another thing that you should be checking is the return status of the APIs that you use in your application to send packets over the air. For example check the return status of "zcl_SendCommand" to verify that the packet was successfully allocated in the TX buffer.

    Regarding memory more specifically Heap, I recommend reading section "14. Heap Memory Management" of the users guide which explains on how to enable the heap memory profiler which should help you track the heap to see if there are any leaks or if you need to increase the heap size in your application.

    When it comes to identifying stack overflows this becomes a bit trickier. You will have to initialize the stack with a known pattern and then check if the pattern was overwritten at the boundary where the stack ends.
    Also, there is an easier way if you have a debugger attached and if you are using IAR. There is an option to track the stack which will show you the max usage of the stack while you are debugging.
  • Hi Hector,

    Many thanks for your feedback. This clues are very helpful and I will implement it.

    I hope, soon, can I send you some feedback.

    Best regards.

    Nalves