This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

A RESUME/REJOIN bug in Z-Stack 2.5.1.a needs to confirm

Other Parts Discussed in Thread: Z-STACK, CC2530

I am writing this post in order to get the confirmation from TI.

I am working on an end device using CC2530 and the Z-Stack version is 2.5.1.a. The coordinator is based on Ember chip.

Problem: put the end device out of the valid range of the hub for a certain period, and take it back to the hub, the end device could probably fail to rejoin the network.  The failure rate is about 10%. After more and more test, I found the failure rate increases a lot when I moves the end device aournd at the edge of the valid range, that means the unstalbe signal environment.

Debug: after a series of debugging, I have finally focused on the code below:

ZDApp.c  -> ZDApp_ProcessNetworkJoin()

{

  if ( (devState == DEV_NWK_JOINING) ||
      ((devState == DEV_NWK_ORPHAN)  &&
       (ZDO_Config_Node_Descriptor.LogicalType == NODETYPE_ROUTER)) )
  {

    ...................

  }

  else if ( devState == DEV_NWK_ORPHAN || devState == DEV_NWK_REJOIN )

  {

      if (nwkStatus == ZSuccess)
      {

         .......................

       }

       else

        {

             if ( devStartMode == MODE_RESUME )
             { 
                  if ( ++retryCnt <= MAX_RESUME_RETRY )  
                  {
                      if ( _NIB.nwkPanId == 0xFFFF || _NIB.nwkPanId == INVALID_PAN_ID )
                      {  
                          devStartMode = MODE_JOIN;
                      }
                     else
                      {
                         devStartMode = MODE_REJOIN;
                         _tmpRejoinState = true;
                      }
                   }
        
                    else if( AIB_apsUseInsecureJoin == true )    // Do a normal join to the network after certain times of rejoin retrie
                         devStartMode = MODE_JOIN;    
        
               }


               // Clear the neighbor Table and network discovery tables.
               nwkNeighborInitTable();
               NLME_NwkDiscTerm();

               // setup a retry for later...
               ZDApp_NetworkInit( (uint16)(NWK_START_DELAY
                     + (osal_rand()& EXTENDED_JOINING_RANDOM_MASK)) );
            }
    }

 }

}

My analysis:

(1) Once the end device is out of the range, ZDO_SyncIndicationCB() will be called to trigger an event  ZDO_NWK_JOIN_REQ, which set the mode to RESUME and init the network to do NLME_OrphanJoinRequest().

(2) As a result, if RESUME successfully, that's no problem, but if fails, it will go to the code above. Since the retryCnt is initialized as 0, the "++retryCnt <= MAX_RESUME_RETRY " will always be true. That means the mode will shift to REJOIN for sure. The resume will only be performed by one times.

(3) So after the only one times RESUME process failed, it will rejoin the network. But if rejoin fails again,  _tmpRejoinState  is not set to true based on above code, and for the second REJOIN, it will not find the proper network to rejoin.

(4) I assumed two times failure in above analysis, it would actually not happen in good signal environment, but if the signal is unstable, it would happen for sure as I described in the beginning. 

 

My questions:

(1)  I consider MAX_RESUME_RETRY as the retry times of REUME process, but actually the definition 3 is invalid. The RESUME only performs one times.  

(2)  The code "devStartMode = MODE_JOIN;" will never be executed after the mode is shifted to Rejoin. Actually, It is also not reasonable to shift the mode to JOIN mode, which could result in the symptom of "drop off" because the coordinator does not allow to join at that time. 

(3) If the first REJOIN fails, the _tmpRejoinState  is reset to 0 and is not set to true any more based on the code above. That's means the network scanning result will  not have Devicecapacity true. So the end device will not rejoin the network for sure. I think " _tmpRejoinState = true;" should be added to above code.

 

My change: I made some changes based on above code and no more drop off issue happens any more.

if ( devStartMode == MODE_RESUME )
      { 
        if ( ++retryCnt >= MAX_RESUME_RETRY )  //"<=" Changed to ">=" by Andy 20130630
        {
          if ( _NIB.nwkPanId == 0xFFFF || _NIB.nwkPanId == INVALID_PAN_ID )
          {  
            devStartMode = MODE_JOIN;
          }
          else
          {
            devStartMode = MODE_REJOIN;
            _tmpRejoinState = true;
          }
        }
        
        // Do a normal join to the network after certain times of rejoin retries
        //Commented by Andy to avoid the JOIN mode because it does not make sense. 20130630
        //else if( AIB_apsUseInsecureJoin == true )  
        //  devStartMode = MODE_JOIN;    
        
        else  //Added by Andy to keep RESUME until MAX_RESUME_RETRY times. 20130630
        {
          devStartMode = MODE_RESUME;
          _tmpRejoinState = true;
          osal_cpyExtAddr( ZDO_UseExtendedPANID, _NIB.extendedPANID );
          zgDefaultStartingScanDuration = BEACON_ORDER_60_MSEC;
          ZDApp_NetworkInit( 0 );           
        }
       
      }
     
      //Added by Andy. 20130630
      //If the mode has shifted from RESUME to REJOIN. Make sure _tmpRejoinState is set to true.
      if (devStartMode == MODE_REJOIN) 
      {
        _tmpRejoinState = true;
        // Clear the neighbor Table and network discovery tables.
        nwkNeighborInitTable();
        NLME_NwkDiscTerm();

        // setup a retry for later...
        ZDApp_NetworkInit( (uint16)(NWK_START_DELAY
             + (osal_rand()& EXTENDED_JOINING_RANDOM_MASK)) );
      }