End device fails to go to sleep when response to rejoin request is 'PAN Access 'Denied'

Leo Cahalan

Prodigy 110 points

Other Parts Discussed in Thread: CC2538, Z-STACK, CC2530, CC2630

If an end device receives a ‘PAN access denied’ response to the rejoin request, it fails to return to sleep.

#define MAC_ASSOC_DENIED 2 /* PAN access denied */

It looks like the mac and network tasks continue to run and prevent the radio from going to sleep.

Setup

Z-Stack Home 1.2.2 running on CC2538.

-DREJOIN_BACKOFF=900000

-DREJOIN_POLL_RATE=440

-DREJOIN_SCAN=10000

How to create a ‘PAN access denied’ response

This response is created if the parent no longer has the child address in its associated device and network manager tables.

Build the coordinator with NWK_MAX_DEVICE_LIST=1.

This will limit the coordinator to having 2 end devices (#define NWK_MAX_DEVICES ( NWK_MAX_DEVICE_LIST + 1 ))

1. Join 2 end devices to the coordinator

2. Power down one device (Device #1)

3. Remove Device #1 from the coordinator table using the following calls.

ZDSecMgrAddrClear( addrEntry.extAddr );

AssocRemove( addrEntry.extAddr );

4. Join a third device

5. Power down all devices

6. Power up Device #1. It tries to rejoin but is rejected with "PAN access denied" response.

#define MAC_ASSOC_DENIED 2 /* PAN access denied */

Here is the log, with time stamp in tenths of seconds, showing the ZDO state change call backs.

0.0 ZDO St: 10 - NWK_ORPHAN

0.6 ZDO St: 2 - NWK_DISC

1.1 ZDO St: 15 - TC_REJOIN_ALL_CH

1.8 ZDO St: 2 - NWK_DISC

9.9 ZDO St: 14 - TC_REJOIN_CURR_CH

10.0 ZDO St: 12 - NWK_BACKOFF

11.7 ZDO St: 12 - NWK_BACKOFF

When the NWK_BACKOFF state is entered, the radio is on and drawing about 40 mA. Data request are being transmitted every 440 msec, which is the rejoin poll rate. The system remains stuck in this state and will keep sending the data requests forever.

I tried a few fixes, but nothing I do seems to be able to get it out of this state.

I called MAC_PwrOffReq(MAC_PWR_SLEEP_DEEP), but it returned the error code 0xe2 – ZmacDenied.

Calling NLME_SetPollRate(0), stops the data requests, but the radio remains on and data requests re-start after 5 minutes at the 440 msec poll rate.

The closest to a solution I have come is to add the following in ZDO_STATE change when the NWK_BACKOFF is reported:

NLME_SetPollRate(0)

uint8 rxOnIdle = false;

ZMacSetReq( ZMacRxOnIdle, &rxOnIdle );

This turns off the radio and normal sleep mode resumes. However, after 5 minutes, the rejoin poll rate starts up again. I have verified that NLME_SetPollRate( zgRejoinPollRate ) is not being called from ZDApp.c and this is the only instance of this call I see in the code.

The 'PAN access denied' response is probably quite rare. In fact I had never encountered it before this test. But, it appears that is not being handled correctly by the stack and I am hoping that someone out there may have come across it and have found a work around.

Thanks,

over 9 years ago

0 JasonB over 9 years ago

TI__Expert 8950 points

Hi Leo,

We're looking into this. We'll let you know once we've made any developments.

0 JasonB over 9 years ago

TI__Expert 8950 points

Quick update here, we have been able to reproduce your problem on our devices and we are currently working towards a solution. I'll keep you posted. Have you discovered anything else in your system that could be relevant since your initial post?

0 Leo Cahalan over 9 years ago in reply to JasonB

Prodigy 110 points

I don't have anything further to add at this point.

0 JasonB over 9 years ago in reply to Leo Cahalan

TI__Expert 8950 points

My testing was done using a fresh install of Z-Stack Home 1.2.2a, 4 Smart RF06 + 4 CC2538EM, and using the SampleSwitch project as end devices and a SampleLight as the coordinator.

So it looks like this issue stems from a few different things:

- The power manager is never set to allow the device to go into sleep mode if it does not successfully connect to a network. i.e., the device won't go to sleep when it is in backoff unless the system is initialized with a call to osal_pwrmgr_device( PWRMGR_BATTERY ) instead of osal_pwrmgr_device( PWRMGR_ALWAYS_ON ) (the default), or unless the that function call is added in an appropriate place.

- Like you stated, the radio and polling must be turned off when the device goes into backoff as well.

- In ZDApp_ProcessNetworkJoin(), while the device attempts to connect to the network, it tries 4 different methods, ordered as follows: DEV_NWK_SEC_REJOIN_CURR_CHANNEL -> DEV_NWK_SEC_REJOIN_ALL_CHANNEL -> DEV_NWK_TC_REJOIN_CURR_CHANNEL -> DEV_NWK_TC_REJOIN_ALL_CHANNEL, but once reaching DEV_NWK_TC_REJOIN_ALL_CHANNEL there is no logic to prevent it from stopping and forcing it to go into backoff, e.g. after a certain number of failed attempts. However, the rejoin scan timer timeout will make it go into backout after a defined period of time.

- Your rejoin scan timeout value (-DREJOIN_SCAN) should be sufficiently large enough to account for the entire length of your scanning procedure, which in my case takes around 20 seconds real time. You generally wouldn't want ZDO_REJOIN_BACKOFF to activate in the middle of your scanning procedure.

- ZDApp_ProcessOSALMsg() was sometimes executed while devState == DEV_NWK_BACKOFF, and the devState changing logic was handling it incorrectly

I set my project up as you stated, but I used the following config values for testing purposes:

-DREJOIN_SCAN=60000

-DREJOIN_BACKOFF=30000

I've made the following changes in ZDApp.c:

Defines/Global variables:

#define MAX_TC_ALL_CH_REJOIN_FAILURES 3
uint8 rejoinFailureCount = 0;

in ZDApp_ProcessOSALMsg():

else
{
  if( prevDevState == DEV_NWK_SEC_REJOIN_ALL_CHANNEL )
  {
	ZDApp_ChangeState( DEV_NWK_TC_REJOIN_CURR_CHANNEL );
  }
  else if (prevDevState == DEV_NWK_TC_REJOIN_CURR_CHANNEL)
  {
	ZDApp_ChangeState( DEV_NWK_TC_REJOIN_ALL_CHANNEL );
  }
  // Here I increment how many times I have failed to connect to the network using
  // DEV_NWK_TC_REJOIN_ALL_CHANNEL mode. This is used by ZDApp_ProcessNetworkJoin
  else if (prevDevState == DEV_NWK_TC_REJOIN_ALL_CHANNEL){
	rejoinFailureCount++;
	ZDApp_ChangeState( DEV_NWK_TC_REJOIN_ALL_CHANNEL );
  }
  // if we are just coming out of backoff, we want to switch back to 
  // DEV_NWK_TC_REJOIN_ALL_CHANNEL mode
  else if (prevDevState == DEV_NWK_BACKOFF){
	ZDApp_ChangeState( DEV_NWK_TC_REJOIN_ALL_CHANNEL );
	prevDevState = DEV_NWK_TC_REJOIN_ALL_CHANNEL;
  }
}

in ZDApp_event_loop():

if( events & ZDO_REJOIN_BACKOFF )
  {
    // if we are here and devState == DEV_NWK_BACKOFF, we are coming out
    // of backoff after the timer elapsed, so start network rejoining process
    if( devState == DEV_NWK_BACKOFF )
    {
      ZDApp_ChangeState(DEV_NWK_DISC);
      prevDevState = DEV_NWK_BACKOFF;
      // Restart scan for rejoin
      ZDApp_StartJoiningCycle();
      osal_start_timerEx( ZDAppTaskID, ZDO_REJOIN_BACKOFF, zgDefaultRejoinScan );
    }
    
    // otherwise, this event was generated from the rejoin scan timer expiring
    // which means scanning was attempted and unsuccessful for
    // zgDefaultRejoinScan milliseconds, so we should go into backoff/sleep
    else
    {
      // Rejoin backoff, silent period
      ZDApp_ChangeState(DEV_NWK_BACKOFF);
      ZDApp_StopJoiningCycle();
      
      NLME_SetPollRate(0);
      uint8 rxOnIdle = false;
      ZMacSetReq( ZMacRxOnIdle, &rxOnIdle );
      
      #if defined ( POWER_SAVING )
        if(pwrmgr_attribute.pwrmgr_device != PWRMGR_BATTERY) {
          osal_pwrmgr_device( PWRMGR_BATTERY );
        }
      #endif
      
      osal_start_timerEx( ZDAppTaskID, ZDO_REJOIN_BACKOFF, zgDefaultRejoinBackoff );
    }

    return ( events ^ ZDO_REJOIN_BACKOFF);
  }

in ZDApp_ProcessNetworkJoin():

else 
{
 if ( devStartMode == MODE_RESUME )
      {	
		... // leave this code as is
	  }
else if(devStartMode == MODE_REJOIN)
      {
        if ( ZSTACK_END_DEVICE_BUILD )
        {
          devStartMode = MODE_REJOIN;
          _tmpRejoinState = true;
          _NIB.nwkState = NWK_INIT;

          if( prevDevState == DEV_NWK_SEC_REJOIN_CURR_CHANNEL )
          {
            runtimeChannel = MAX_CHANNELS_24GHZ;
            prevDevState = DEV_NWK_SEC_REJOIN_ALL_CHANNEL ;
          }
          else if ( prevDevState == DEV_NWK_SEC_REJOIN_ALL_CHANNEL)
          {
            // Set the flag that will ask the device to do trust center network layer rejoin.
            _NIB.nwkKeyLoaded = FALSE;
            ZDApp_ResetNwkKey(); // Clear up the old network key.
            runtimeChannel = (uint32) (1L << _NIB.nwkLogicalChannel);
            prevDevState = DEV_NWK_TC_REJOIN_CURR_CHANNEL ;
            
          }
          else if ( prevDevState == DEV_NWK_TC_REJOIN_CURR_CHANNEL )
          {
            runtimeChannel = MAX_CHANNELS_24GHZ;
            prevDevState= DEV_NWK_TC_REJOIN_ALL_CHANNEL ;
            
          }
          else if (prevDevState == DEV_NWK_TC_REJOIN_ALL_CHANNEL)
          {
            // rejoinFailureCount is incremented in ZDApp_ProcessOSALMsg
            // where devState is changed based on prevDevState
            
            // if we fail to connect MAX_TC_ALL_CH_REJOIN_FAILURES times,
            // go into backoff before trying again
            if (rejoinFailureCount >= MAX_TC_ALL_CH_REJOIN_FAILURES){
              rejoinFailureCount = 0;
              // ZDO_REJOIN_BACKOFF timer is currently running with 
              // zgDefaultRejoinScan timeout, stop it before continuing 
              osal_stop_timerEx(ZDAppTaskID, ZDO_REJOIN_BACKOFF);
              ZDApp_ChangeState(DEV_NWK_BACKOFF);
              ZDApp_StopJoiningCycle();

              NLME_SetPollRate(0);
              uint8 rxOnIdle = false;
              ZMacSetReq( ZMacRxOnIdle, &rxOnIdle );

              #if defined ( POWER_SAVING )
              if(pwrmgr_attribute.pwrmgr_device != PWRMGR_BATTERY) {
                osal_pwrmgr_device( PWRMGR_BATTERY );
              }
              #endif
              
              // start the backoff timer, go to sleep if power_saving is defined
              osal_start_timerEx( ZDAppTaskID, ZDO_REJOIN_BACKOFF, zgDefaultRejoinBackoff );
            }
          }
        }
      }

      // Clear the neighbor Table and network discovery tables.
      nwkNeighborInitTable();
      NLME_NwkDiscTerm();

      if (devState != DEV_NWK_BACKOFF) {
        // setup a retry for later...
        ZDApp_NetworkInit( (uint16)(NWK_START_DELAY
             + (osal_rand()& EXTENDED_JOINING_RANDOM_MASK)) );
      
      }
}

The event handling function for ZDO_REJOIN_BACKOFF may be somewhat unclear at first glance. With my code changes, it is entered under two conditions:
1. If a timer is started with zgDefaultRejoinScan as the timeout value and allowed to expire before stopping it, the ZDO_REJOIN_BACKOFF statement will be executed and the system will be put into backoff, following the else.
2. If a timer is started with zgDefaultRejoinBackoff as the timeout value and allowed to expire, the system will be coming out of backoff, hence the current state being DEV_NWK_BACKOFF, and the if part will be executed.

The system being put INTO backoff after a defined number of failed connection attempts is handled separately inside ZDApp_ProcessNetworkJoin()

Let me know if you have any questions and if you're able to integrate these changes.

0 YiKai Chen over 9 years ago in reply to JasonB

Guru 735695 points

Hi Jason,
I see others report similar issue on E2E forum. Is this a formal patch to fix rejoin issue?

0 JasonB over 9 years ago in reply to YiKai Chen

TI__Expert 8950 points

We are still testing the proposed fixes, but it would help us if others with similar issues could also try this code.

0 YiKai Chen over 9 years ago in reply to JasonB

Guru 735695 points

Hi Jason,
You can read The Seven's issue in e2e.ti.com/.../502380

0 The Seven over 9 years ago

Guru 11440 points

Hi Leo Cahalan， thank you for your post ！

i am using the zstack 1.2.2 in CC2530. and i have a similar issue .

but my issue is that the device execute the REJOIN_BACKOFF case in ZDO_REJOIN_BACKOFF EVENT,

But the device state is alway keep DEV_NWK_DIS.

do you find that?

SET UP:

when ZED join network , power off the ZC , so the ZED running rejoin scan === rejoin backoff === rejoin scan

sometimes my issue happens!!!

this issue happen both 1.2.2 and 1.2.2a

CHANGE ITEM TEST:

I have change the rejoin scan period from 10s ,20s ,3 minutes ,5minutes`~~~~~

rejoin backoff 30s, 30minutes and so on.

it seems that long time period will easy to call this issue , ( just base on the result of test again and again,)

BR!

0 JasonB over 9 years ago in reply to The Seven

TI__Expert 8950 points

Hi The Seven,

Try implementing my changes as shown above. If I recall correctly, the device being stuck in DEV_NWK_DISC is a side effect of the same issue.

0 The Seven over 9 years ago in reply to JasonB

Guru 11440 points

Hi JasonB
if the device stuck in DEV_NWK_DISC , after the next rejoin cylce coming , does it could rejoin the network ?

br!

0 YiKai Chen over 9 years ago in reply to The Seven

Guru 735695 points

I would suggest you to test Jason's patch first and if you still see problems, you can report it.

0 The Seven over 9 years ago in reply to YiKai Chen

Guru 11440 points

Hi Yikai

i add those code , could i always add "Start rejoin scan when key press" in the application ?

if ( keys & HAL_KEY_SW_4 )//P04  Wakeup Press
{
 // make sure start rejoin scan only implementing one time
 if(Being_Scaning == FALSE)
 { 
   // only the device being DEV_NWK_BACKOFF
   // key press will make deivces turn to rejoin scan right now 
   if(zclEmerButton_NwkState == DEV_NWK_BACKOFF)
   {
    // make a flag ,don't let the devices going into two times ,
    //while it restart to rejoin scan
    Being_Scaning = TRUE;
    //stop the  zgDefaultRejoinBackoff  timer counting 
    osal_stop_timerEx( ZDAppTaskID, ZDO_REJOIN_BACKOFF );
    // set the device turn into rejoin scan right now 
    ZDApp_ChangeState(DEV_NWK_DISC);
    ZDApp_StartJoiningCycle();
    //make sure devices only scan (zgDefaultRejoinScan)  
    //then turn into rejoin backoff again  
    osal_start_timerEx( ZDAppTaskID, ZDO_REJOIN_BACKOFF, zgDefaultRejoinScan );
   }            

  }
}

BR!

0 YiKai Chen over 9 years ago in reply to The Seven

Guru 735695 points

I see no problem on it.

0 The Seven over 9 years ago in reply to YiKai Chen

Guru 11440 points

Hi Yikai，
it seems work until now 。
but i don't get the point of JasonB's code.

1. Why does it set prevDevState as DEV_NWK_BACKOFF nomatter what state of the devices in ZDO_REJOIN_BACKOFF event？
2. what is the purpose of the limit rejoin failed ？it seems JasonB do not find the root cause and just like “Do a reset ”

it is really hard for me to understand！

0 YiKai Chen over 9 years ago in reply to The Seven

Guru 735695 points

I agree with you that Jason seems do a reset. I think the root cause might be hidden in Z-Stack kernel. If you want to fix it from root cause, it might need to update kernel. So, I think it would be easier to use the patch to fix it outside Z-Stack kernel.

0 The Seven over 9 years ago in reply to YiKai Chen

Guru 11440 points

Hi ALL

Unfortunately， this patch seems did not fix it totally。

SET UP：

rejoin backoff period ：1800000 // 30minutes

rejoin scan period ：180000// 3minutes

PollRate：45s

Shut down ZC ，when ZED success joinng the network， and then testing the rejoin logic。

After testing several rejoin backoff/Scan cycle， turn on the ZC ，checking whether ZED could success rejoin

Today ，i found two strange case happen。

1. i update this patch in the IAS_ZONE devices，several testing cycle later ，i found that the ZED could not notice it is disconnect ZC。

At this moment， ZC is being shut down ， i press the ZED key to send out the msg to ZC ， as normal case，when it does not receive the ACK it will being orphan。

but it doesn't ， and i saw the log ，there is not more ACK ，when it sending out the msg , And i do it again ，it always doesn't being orphan although there is not more ACK of its msg ！！！

this case is rare ，i test all the day and just happen one time。

2. the ZED could not follow the rejoin logic （rejoin scan 3minutes =====》rejoin backoff 30minutes =====》oin scan 3minutes） sometimes。

As the normal case ，it must follow the default rejoin time。

but i found one ZED rejoin scan 3minutes and then sleep 3minutes ， and then rejoin scan 3minutes again。 but it does not always keeping this rejoin cycle。

after several cycle later ，it turn to the new one cycle ， the default one （rejoin scan 3minutes =====》rejoin backoff 30minutes =====》oin scan 3minutes）

BR!

0 YiKai Chen over 9 years ago in reply to The Seven

Guru 735695 points

I would suggest you to also attach sniffer log for Jason to check.

0 JasonB over 9 years ago in reply to The Seven

TI__Expert 8950 points

1. I made a small mistake in my initial solution post, I should not have set prevDevState to DEV_NWK_BACKOFF inside the else part of the ZDO_REJOIN_BACKOFF event handling code. It should not affect the functionality because prevDevState would be set to DEV_NWK_BACKOFF anyway right after the device exits backoff, but it was redundant. I will update my other post.

2. The rejoin limit counter is not a necessary addition, but it is something that ZDApp.c did not previously implement. Leo's original problem (constant 440ms pinging) is actually unrelated to the rejoin limit counter I implemented, but both together provide a more complete solution.

Leo's problem was due to ZDApp_ProcessOSALMsg() not properly changing the value of devState based on prevDevState, which should be solved with my changes. Without the rejoin limit counter I added, *normal* operation would mean the device would try to do an unsecure rejoin on all channels over and over, until the ZDO_REJOIN_BACKOFF timer set with zgDefaultRejoinScan as the timeout value finally expires and the device is put into backoff. If the user wants this to be the case, they're welcome to either not implement the counter changes or set the max tries value to a sufficiently large number, but I figure if the user leaves zgDefaultRejoinScan to its default value of 15 minutes or some other large-ish time value, they should also be given the option to force the device into backoff after a certain number of failed unsecure all channel connection attempts.

Also, it seems that your problem is not the same as Leo's, so as YK suggested it would help me if you could post a sniffer log of your issue happening.

0 The Seven over 9 years ago in reply to JasonB

Guru 11440 points

sorry JasonB , i was so repent of mine carelessness。

****1

i forgot to save the log when “1 case（don't being Oprah）” happen.

****2

and i saw the “2 case ” in the record video，

《 i set a led blink in the ZDApp_NetworkInit( )， when being rejoin scan i could know that is being scanning 》。

Yesterday i found a new strange case that the ZED stop rejoin scanning in sometimes，（but i don't know what state does it being at that moment）

SET UP:

rejoin scan period : 180000

rejoin backoff period： 1800000

pollrate ：45s

default channel：-DDEFAULT_CHANLIST=0x04888800 // 26 23 19 15 11

shut down ZC after ZED join the network.

Log : 5187.456.rar

it shows that After 15:08 there is not more any beacon request until 30minutes later， although rejoin backoff period is only 30minutes

i have some questions ：

1. when “strange case” happen ，what can i do for you？

2. is that OK change the default rejoin period as what i want （just like 30minutes or 3minutes）？

3. is that matter set up 10 ZED to test the rejoin logic at the same time same place？

BR!

0 The Seven over 9 years ago

Guru 11440 points

Hi Leo
if the device is not need to enable POWER_SAVING, is that mean Jason‘s Patch is not need to be added ？？

BR!

0 Leo Cahalan over 9 years ago in reply to The Seven

Prodigy 110 points

If you have not enabled POWER_SAVING, then the problem I encountered should not be an issue for you. However, I think that some of the suggested code changes may impact other error handling and so it may be safer to add it in your project.

0 Leo Cahalan over 9 years ago in reply to JasonB

Prodigy 110 points

Hi Jason,

The code change fixes the not going back to sleep problem. Thanks for sorting this out.

I now have spent more time examining the code looking for some explanation for an occasional problem we are experiencing in the filed where an end device leave the network for no apparent reason. The only area that I suspect may cause this is where the end device is attempting an insecure rejoin and when it does this, the key is deleted by a call to ZDApp_ResetNwkKey(). If the end device rejoins to a router and the coordinator is not reachable at the time, then the end device will not get a security key and will leave the network and reboot.

When I run a test with the setup described earlier where the end device is orphaned, the parent is the coordinator but the device entry is deleted in the coordinator table and the table is full, I only see the state machine using DEV_NWK_SEC_REJOIN_CURR_CHANNEL. ProcessNetworkJoin does not change devState as prevDevState is set to DEV_NWK_BACKOFF which is not one of the states that triggers a transition.

This is the state machine flow:
DEV_INIT
DEV_NWK_DISC
DEV_NWK_SEC_REJOIN_CURR_CHANNEL
ProcessNetworkJoin (devState 4, prevDevSte 12, nwkStatus 2)
ZDApp_NetworkInit

If I reboot the end device, then it will cycle through the different join modes as expected. I am not sure though if the DEV_NWK_TC_ join modes are handled. I don't see anything specific to DEV_NWK_TC_REJOIN_CURR_CHANNEL when the end device attempts to rejoin.

In case I made a mistake pasting your code into ZDApp.c, could you post your file?

One other worrying thing I have noticed while running this test is that if I leave the end device running for several hours where it is attempting to rejoin to the coordinator every 10 minutes, it will eventually stop seeing beacon requests and require a reboot to get out of this mode. Right now I am not sure if this may be some issue related to running the debugger. I will repeat this test in normal run mode and see if I can repeat it.

Thanks

0 Simeon Felis over 9 years ago in reply to JasonB

Intellectual 350 points

Thanks for this patch. I have the following requirement on a portable end-device:

Transmit every 4-6 hours a status message

When out of range, use the 4-6 hours interval as backoff-time

On button press, leave the backoff state immediatelly and when the end-device joins, immediatelly send a package (personal alarm message).

Now in Switch_processZStackMsgs, I catch the zstackmsg_CmdIDs_DEV_STATE_CHANGE_IND. When changed to zstack_DevState_DEV_END_DEVICE, I immediatelly send the packet. This crashes the CC2630.

So you can see that after the rejoin some messges are exchanged on Endpoint 0x00. But my packet is not transferred (I'm using endpoing 0x05).

I cannot say if the CC2630 really crashes, but it does not respond on anything anymore.

0 Simeon Felis over 9 years ago in reply to Simeon Felis

Intellectual 350 points

When I add a delay of 100ms after the rejoin, the packet transmission works. Btw I use Zstackapi_AfDataReq() for the transmission.

Zigbee & Thread

Zigbee & Thread forum

End device fails to go to sleep when response to rejoin request is 'PAN Access 'Denied'