A RESUME/REJOIN bug in Z-Stack 2.5.1.a needs to confirm

Andy Yang1

Other Parts Discussed in Thread: Z-STACK, CC2530

I am writing this post in order to get the confirmation from TI.

I am working on an end device using CC2530 and the Z-Stack version is 2.5.1.a. The coordinator is based on Ember chip.

Problem: put the end device out of the valid range of the hub for a certain period, and take it back to the hub, the end device could probably fail to rejoin the network. The failure rate is about 10%. After more and more test, I found the failure rate increases a lot when I moves the end device aournd at the edge of the valid range, that means the unstalbe signal environment.

Debug: after a series of debugging, I have finally focused on the code below:

ZDApp.c -> ZDApp_ProcessNetworkJoin()

{

  if ( (devState == DEV_NWK_JOINING) ||
      ((devState == DEV_NWK_ORPHAN) &&
       (ZDO_Config_Node_Descriptor.LogicalType == NODETYPE_ROUTER)) )
{

...................

}

else if ( devState == DEV_NWK_ORPHAN || devState == DEV_NWK_REJOIN )

{

if (nwkStatus == ZSuccess)
{

.......................

}

else

{

             if ( devStartMode == MODE_RESUME )
             {
                  if ( ++retryCnt <= MAX_RESUME_RETRY )
                  {
                      if ( _NIB.nwkPanId == 0xFFFF || _NIB.nwkPanId == INVALID_PAN_ID )
                      {
                          devStartMode = MODE_JOIN;
                     }
                    else
                      {
                         devStartMode = MODE_REJOIN;
                         _tmpRejoinState = true;
                      }
                  }

                    else if( AIB_apsUseInsecureJoin == true )    // Do a normal join to the network after certain times of rejoin retrie
                         devStartMode = MODE_JOIN;

               }

               // Clear the neighbor Table and network discovery tables.
               nwkNeighborInitTable();
               NLME_NwkDiscTerm();

               // setup a retry for later...
               ZDApp_NetworkInit( (uint16)(NWK_START_DELAY
                     + (osal_rand()& EXTENDED_JOINING_RANDOM_MASK)) );
            }
    }

}

My analysis:

(1) Once the end device is out of the range, ZDO_SyncIndicationCB() will be called to trigger an event ZDO_NWK_JOIN_REQ, which set the mode to RESUME and init the network to do NLME_OrphanJoinRequest().

(2) As a result, if RESUME successfully, that's no problem, but if fails, it will go to the code above. Since the retryCnt is initialized as 0, the "++retryCnt <= MAX_RESUME_RETRY " will always be true. That means the mode will shift to REJOIN for sure. The resume will only be performed by one times.

(3) So after the only one times RESUME process failed, it will rejoin the network. But if rejoin fails again, _tmpRejoinState is not set to true based on above code, and for the second REJOIN, it will not find the proper network to rejoin.

(4) I assumed two times failure in above analysis, it would actually not happen in good signal environment, but if the signal is unstable, it would happen for sure as I described in the beginning.

My questions:

(1) I consider MAX_RESUME_RETRY as the retry times of REUME process, but actually the definition 3 is invalid. The RESUME only performs one times.

(2) The code "devStartMode = MODE_JOIN;" will never be executed after the mode is shifted to Rejoin. Actually, It is also not reasonable to shift the mode to JOIN mode, which could result in the symptom of "drop off" because the coordinator does not allow to join at that time.

(3) If the first REJOIN fails, the _tmpRejoinState is reset to 0 and is not set to true any more based on the code above. That's means the network scanning result will not have Devicecapacity true. So the end device will not rejoin the network for sure. I think " _tmpRejoinState = true;" should be added to above code.

My change: I made some changes based on above code and no more drop off issue happens any more.

if ( devStartMode == MODE_RESUME )
      {
        if ( ++retryCnt >= MAX_RESUME_RETRY ) //"<=" Changed to ">=" by Andy 20130630
        {
          if ( _NIB.nwkPanId == 0xFFFF || _NIB.nwkPanId == INVALID_PAN_ID )
          {
            devStartMode = MODE_JOIN;
          }
          else
          {
            devStartMode = MODE_REJOIN;
            _tmpRejoinState = true;
          }
        }

        // Do a normal join to the network after certain times of rejoin retries
        //Commented by Andy to avoid the JOIN mode because it does not make sense. 20130630
        //else if( AIB_apsUseInsecureJoin == true )
        // devStartMode = MODE_JOIN;

        else //Added by Andy to keep RESUME until MAX_RESUME_RETRY times. 20130630
        {
          devStartMode = MODE_RESUME;
          _tmpRejoinState = true;
          osal_cpyExtAddr( ZDO_UseExtendedPANID, _NIB.extendedPANID );
          zgDefaultStartingScanDuration = BEACON_ORDER_60_MSEC;
          ZDApp_NetworkInit( 0 );
        }

      }

      //Added by Andy. 20130630
      //If the mode has shifted from RESUME to REJOIN. Make sure _tmpRejoinState is set to true.
      if (devStartMode == MODE_REJOIN)
      {
        _tmpRejoinState = true;
        // Clear the neighbor Table and network discovery tables.
        nwkNeighborInitTable();
        NLME_NwkDiscTerm();

        // setup a retry for later...
        ZDApp_NetworkInit( (uint16)(NWK_START_DELAY
             + (osal_rand()& EXTENDED_JOINING_RANDOM_MASK)) );
      }

over 12 years ago

0 YiKai Chen over 12 years ago

Guru 735685 points

If your finding is true, it is a serious problem in ZStack. Don't know why the TI employees on this forum just skip this post. By the way, could you also attach the packet sniffer log? TI should need the sniffer log to have a look.

0 OD over 12 years ago in reply to YiKai Chen

TI__Expert 3050 points

Hi Andy,

Tanks for your detailed post.

Your changes make sense and look right.
In addition, as was already suggested to you offline, you may use the following change:
In ZDO_beaconNotifyIndCB() in ZDApp.c, change the following:

if ( ( pBeacon->LQI > gMIN_TREE_LINK_COST ) &&
( ( pBeacon->permitJoining == TRUE ) || ( _tmpRejoinState ) ) )

To:

if ( ( pBeacon->LQI > gMIN_TREE_LINK_COST ) &&
( ( pBeacon->permitJoining == TRUE ) || (devStartMode == MODE_REJOIN) ) )

Best regards,
OD

0 YiKai Chen over 12 years ago in reply to OD

Guru 735685 points

Hi OD,

So this is a confirmed bug, isn't? If yes, would you please specify which version of ZStack needs this patch. Since this bug might be critical, I would also advise TI can post a new thread and stick on top to help guys to apply this patch.

0 YiKai Chen over 12 years ago in reply to YiKai Chen

Guru 735685 points

Would any TI employee please specify which version of ZStack that needs this patch? Since this bug might be critical, I would also advise TI can post a new thread and stick on top to help guys to apply this patch.

0 Jason hu1 over 11 years ago in reply to YiKai Chen

Prodigy 150 points

thanks for OD and andy

0 Timofeev Boris over 10 years ago in reply to Jason hu1

Prodigy 30 points

Thanks wery much. In ZStack 2.6.2 (Home Automation) this errors wasn't corrected

Zigbee & Thread

Zigbee & Thread forum

A RESUME/REJOIN bug in Z-Stack 2.5.1.a needs to confirm