CC1352P: zigbeeHAgw script: recovery after server crash

Peter Hoyer

Part Number: CC1352P

Hi Ryan,

We have some rare cases in which the ZigBee Linux server still crashes. After that happens, the script will enter into a loop like that:

[20:08:39.495,055] [Z_STACK/HNDL] ERROR : ERROR: signal 11 was trigerred:
[0m[37m[20:08:39.495,216] [Z_STACK/HNDL] ERROR : Fault address: 0xfffffff8
[0m[37m[20:08:39.495,248] [Z_STACK/HNDL] ERROR : Fault reason: address not mapped to object
[0m[37m[20:08:39.496,883] [Z_STACK/HNDL] ERROR : Stack trace unavailable
[0m[37m[20:08:39.496,979] [Z_STACK/HNDL] ERROR : Executing original handler...
[0m[37mpid 725 is not there
count is 1, not 4
kill -SIGUSR2 681
caught SIGUSR2, a server other than NWKMGR died!
waiting for GATEWAY SERVER to exit
tracker exiting
[20:08:41.496,238] [GATEWAY/LSTN] ERROR : SRSP Cond Wait timed out!
[0m[37m[20:08:41.496,418] [GATEWAY/LSTN] ERROR : apicSendSynchData() failed getting response
[0m[37m[20:08:43.497,055] [GATEWAY/MAIN] ERROR : SRSP Cond Wait timed out!
[0m[37m[20:08:43.497,161] [GATEWAY/MAIN] ERROR : apicSendSynchData() failed getting response
[0m[37mrecv: Connection reset by peer
waiting for OTA SERVER to exit
waiting for Zstack linux to exit
waiting for NPI to exit
NETWORK MANAGER exited with code 140 on Sun Mar 7 20:08:48 CET 2021
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
a server besides NWKMGR has exited!
ignoring exit code 140 from netmgr
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
waiting for netmgr to exit ( pid 0 ) on Sun Mar 7 20:08:51 CET 2021
oops! Network manager has already exited (!) on Sun Mar 7 20:08:51 CET 2021
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
a server besides NWKMGR has exited!
ignoring exit code 127 from netmgr
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
waiting for netmgr to exit ( pid 0 ) on Sun Mar 7 20:08:53 CET 2021
oops! Network manager has already exited (!) on Sun Mar 7 20:08:53 CET 2021
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
a server besides NWKMGR has exited!
ignoring exit code 127 from netmgr
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
waiting for netmgr to exit ( pid 0 ) on Sun Mar 7 20:08:56 CET 2021
oops! Network manager has already exited (!) on Sun Mar 7 20:08:56 CET 2021
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
a server besides NWKMGR has exited!
ignoring exit code 127 from netmgr
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done
waiting for netmgr to exit ( pid 0 ) on Sun Mar 7 20:08:58 CET 2021
oops! Network manager has already exited (!) on Sun Mar 7 20:08:58 CET 2021
making sure there are no lingering servers...
there are 0 NPI servers
there are 0 ZLS servers
there are 0 GATEWAY servers
there are 0 NWKMGR servers
there are 0 OTA servers
(total 0)
done

Technically I could kill the script where you have the "oops" and restart the process. Just asking first, in case you have any "in script" idea how to handle this correctly as it looks like there is some restart mechanism in please already which just fails for some reason.

Regards
Peter

over 4 years ago

+1 Ryan Brown1 over 4 years ago

TI__Guru**** 211947 points

Hi Peter,

Inside the while loop of zigbeeHAgw, indication of a network manager failure causes a reset condition which issues a stop_all followed by start_all. I can see the stop_all prints inside your debug log but nothing indicating start_all had begun (first thing it should do is echo "Starting the ZigBee gateway subsystem"). You could further investigate this for your system or manually restart the process as already suggested.

Regards,
Ryan

0 Peter Hoyer over 4 years ago in reply to Ryan Brown1

Expert 1360 points

Thank you Ryan, our mistake: by error a testing version with "start_all" commented out went into production.

Zigbee & Thread

Zigbee & Thread forum

CC1352P: zigbeeHAgw script: recovery after server crash