This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC1352P: SRSP Cond Wait timed out

Part Number: CC1352P
Other Parts Discussed in Thread: Z-STACK, CC2538

We have are having an issue with a smart home gateway with CC1352 adn Z-Stack Linux gateway. After a certain amount of commands sent to a smart plug, we get the following message:

[16:54:21.697,127] [Z_STACK/LSTN] ERROR  : SRSP Cond Wait timed out!
[0m[37m[16:54:21.696,797] [GATEWAY/LSTN] ERROR  : SRSP Cond Wait timed out!
[0m[37m[16:54:21.696,894] [GATEWAY/LSTN] ERROR  : apicSendSynchData() failed getting response

I doubt that this has to do with UART speed or other speed-related parameters as the problem can be reproduced depending on the amount of commands sent, not their interval:

  1. 2 secs interval: after about 50 minutes
  2. 5 secs interval: after about 3 hours
  3. 15 secs interval: after about 10 hours

When running a test script changing the plug state each x seconds, at some point we do not receive the state anymore (this is when we get the timeout message) but the plug continues to switch correctly. After another 20 minutes or so the plug also stops switching. For the second state, I cannot find a specific point in the log, you can just see this on the plug.

Could it be some counter or buffer which will overrun after some time?

In the attached log you can find the event at 16:54:21.697,127. I needed, we can provide further logs. The problem happens with different units (gateway and plug) so it is not related to a specific hardware.

ZigBee gateway version is the latest (exact versions of each module at the beginning of the trace), the Z-Stack on the module is 4.20.00.35.

Regards
Peter

  • Sorry, the attachment was missing in the previous post.

    5807.zigbee.zip

  • Hi,

    Does the issue clear up if you:
    - reset the ZNP?
    - restart the gateway servers?

    Is it possible to provide a log with more verbose logs enabled?
    For example, in zigbeeHAgw --> setup_for_arm --> export NPI_CMD="./$NPI_NAME $NPI_CONFIG -v 0xFFFFFFFF" (for all the other servers too).

    Regards,
    Toby

  • Hi Toby,

    Attached please find the logs with verbose enables.

    zigbee1.log is the log of the initial test where it stopped responding after 2520 toggle operations.Timestamp 10:57:42:772.

    I have also tested if it is a problem with the end device. I unplugged and replugged our plug to see if it was working again but no. So we can exclude that option.

    Switching works fine again when restarting the server (zigbee2.log).

    In your answer, did you mean restarting single processes? I have tried with ZLSZNP and it was not a good idea as all other processes except GATEWAY_SRVR were closed as well. If you want me to try restarting single server processes, can you please advise on a good order? As I have to wait each time for about 2 hours to get back to the condition, it is not an ideal scenario having to try all combinations...

    Regards
    Peter

    ZigBee Logs.zip

  • Thanks for sharing the logs, I see if I can find anything there.

    Peter Hoyer said:
    In your answer, did you mean restarting single processes?


    I meant restarting all of them (I guess one by one). An example of this is shown in the script zigbeeHAgw:

    stop_all()
    {
    	stop_tracker
    	stop_nwkmgr
    	stop_others
    stop_others()                                              
    {                                                   
    	stop_gateway
    	stop_ota
    	stop_zlsznp
    	stop_npi                                                    
    }     
    start_all()                                                 
    {    
    ...
    				start_npi
    				SERVER_PID_VAR=NPI_PID
    			;;
    			2)
    				start_zlsznp
    				SERVER_PID_VAR=ZLSZNP_PID
    			;;
    			3)
    				start_netmgr
    				SERVER_PID_VAR=NETWORK_MGR_PID
    			;;
    			4)
    				start_gateway
    				SERVER_PID_VAR=GATEWAY_SERVER_PID
    			;;
    			5)
    				start_otaserver
    				SERVER_PID_VAR=OTA_SERVER_PID
    			;;

  • Hi Toby,

    Then I have done the right thing. Yes, if restarting all services, it works again well. You have it in the logs.

    Regards
    Peter

  • Looks like Z_STACK/READ is no longer active around where the issue begins to happen (which explains why the timeouts occur every 2 seconds for Z_STACK/LSTN (your app is configured to send onoff commands every 2 secs in this case?)).

    I don't see any further debug prints, so this could be a socket issue (recv  calls by Z_STACK/READ ).

    Also, there seems to be a lot of unnecessary route requests (that I've seen come up in another recent case).
    Can you use the following patch for reference?

    reduce_route_req.patch

  • Hi Toby,

    We are still testing but the switching issue seems to be resolved with your patch or a least it does not occur within very few hours anymore.

    Thank you and regards
    Peter

  • Hi Toby,

    It just takes longer now but unfortunately, we do still have this issue. It depends on the system. Sometimes it happens after just 2 hours with a 5 seconds switching interval, sometimes only after 12 or more hours with even a 2 seconds switching interval. Software and hardware are always the same.

    I have attached some more logs in case they might be helpful.

    How can we find the actual cause of this problem?

    Regards
    Peter

    zb.tar.gz

  • Ok, I will look into this and update you within the next week.

    If possible, can you share sniffer logs as well ?
    Ideally the log encompasses the period where the issue happens, but having steady-state network traffic should be sufficient to see what is happening in the typical case.

    Thanks,
    Toby

  • Hello,

    I have the same problem. I posted about it few month ago (e2e.ti.com/.../873858) but unfortunately no clear solution was obtained. In my case I send some commands (for example 40 on/off commands) to some devices (Interval is 500ms at maximum, but after any command is sent I wait for its response and then send next command to GW server). In first two or three tries it works good but after that confirmation responses are failed and I need to restart Zigbee Server to go back to normal state. I use CC2538 as ZNP device and Z-Stack home 1.2.2. I will be so thankful if you can help me about solving this issue because it has stopped our work.

  • Can this patch be used with Z-Stack Linux GW 1.2.2? and how can I use it?

  • Hi Ahmad.TI,

    This patch was made for the Zigbee 3.0 gateway. I recommend migrating to that one if possible. It should be compatible with Zigbee HA devices.

    The files which the patch touches are gatewaysrvr.c and zcl.c.

    For zcl.c, the change is applied in function zcl_SendCommand:

        status = AF_DataRequest( destAddr, epDesc, clusterID, msgLen, msgBuf,
                                 &zcl_TransID, options, AF_DEFAULT_RADIUS );

    is replaced with:

        // TP: if response message, use the same seqNum as original message instead of zcl_TransID
        if ( !specific &&
            ( cmd == ZCL_CMD_READ_RSP ||
              cmd == ZCL_CMD_WRITE_RSP ||
              cmd == ZCL_CMD_CONFIG_REPORT_RSP ||
              cmd == ZCL_CMD_DEFAULT_RSP ||
              cmd == ZCL_CMD_DISCOVER_ATTRS_RSP ) )
        {
            status = AF_DataRequest( destAddr, epDesc, clusterID, msgLen, msgBuf,
                                     &seqNum, options, AF_DEFAULT_RADIUS );
        }
        else
        {
            status = AF_DataRequest( destAddr, epDesc, clusterID, msgLen, msgBuf,
                                     &zcl_TransID, options, AF_DEFAULT_RADIUS );
        }

    For gatewaysrvr.c:

      if (pTransEntry &&  ZCL_CMD_DEFAULT_RSP == pInMsg->hdr.commandID) {
         // We received ack for our command, remove the transaction.
        gwMsgRemoveTrans( pTransEntry->appTransId );
      }
    

    is added before:

       // Free memory
       free( zclFrameInd.payload.data );

    Feel free to evaluate with these changes. There may be other changes between the 1.2.2 and 3.0 gateway to consider, but again, it is recommended if possible to use 3.0 gateway.

    >>Interval is 500ms at maximum

    In general I would not recommend sending too frequently, as this could overflow the buffers at ZNP. This could cause some messages to not be sent over the air, meaning that the gateway will time out commands that it is expecting a response for. Having this continuously happen could put the gateway in a non-ideal state.

    Since it seems this issue is easily reproducible for you, my suggestion for debugging your current setup is to check the transactions (e.g. "[GATEWAY/LSTN] INFO: (GW posted) cmdId: 39, zclTransId 68") with what is sent over the air (e.g. do you see the ZCL command with this sequence number?).

  • Hi Toby, 

    Thanks you for you reply. I made these changes but unfortunately no significant improvement was achieved. I increased interval time to one second and it became better but still not acceptable. In most cases that this situation happens, the command is sent to destination device but confirm command fails and also no status from device is received, so the command is sent over the air. 

    Here is the log file when the problem occurs, but in this case the server came back to normal state after a while (after that no command was sent to it for few minutes). In time [12:25:47.219,580], first time out happened and this failure continued until [12:27:45.898,562]. After this time the server worked fine. Why it can't recover from failure state and why it became OK after not sending command for a while?

    4010.Zigbee-Server-Log1.txt

  • We have made various updates to server wrapper functionality (e.g. api_client.c, api_server.c, ... , please see the files in C:\ti\Zigbee_3_0_Linux_Gateway_1_0_1\source\Projects\zstack\linux\srvwrapper). Your issue seems related to this. Please check out those files and consider migrating that to your project.

  • Hi Toby,

    Attached a very quick one. This time it took less than one hour while switching 2 plugs on and off in alternation. After restarting the server, it will work fine again.

    Regards
    Peter

    1321.zigbee.zip3487.sniffer log.zip

  • Thank you, this looks to be a helpful data point; creating a similar traffic profile on my end will ideally reproduce the issue in similar fashion, to find the root cause.

  • Hi Peter,

    An update from my side:

    After investigating the logs which you've provided, I believe the issue is related to and EINTR, which occurs during call to recv() in SISreadThreadFunc: recv: Interrupted system call. Notice that in zigbee.log, all the "SRSP Cond Wait timed out!" happen after "recv: Interrupted system call".

    Specifically for Z_STACK/READ:

      // Read from socket
      do
      {
        // Normal data
        n = recv( pInstance->sAPIconnected, &hdrbuf, sizeof(hdrbuf), MSG_WAITALL );
    
        if ( n <= 0 )
        {
          if ( n < 0 )
          {
            perror( "recv" );
          }
          else
          {
            uiPrintfEx(trINFO, "Peer closed connection\n" );
          }
          done = 1;
        }

    As you can see, if recv() runs across EINTR, then this thread essentially ends (done = 1). The result is that Z_STACK/READ no longer receives messages, causing Z_STACK/LSTN to always time out for synchronous transactions (READ is the only thread which signals that a SRSP is received).

    To overcome this, we can re-do the recv() if it was interrupted by a system call, something like:

    // Normal data
        do {
            n = recv( pInstance->sAPIconnected, &hdrbuf, sizeof(hdrbuf), MSG_WAITALL );
        } while ( n < 0   &&   errno == EINTR);
        
    // ...
            if ( len > 0 )
            {
              do {
                n = recv( pInstance->sAPIconnected, pMsg + 1, len, MSG_WAITALL );
              } while ( n < 0   &&   errno == EINTR);
            }
            else
            {
              // There are no payload bytes; which is also valid.
              n = 0;
            }

    Regards,
    Toby

  • Hi Toby,

    I have applied your changes but now it runs ino a segfault after some time, I suppose when the othe problem would have normally occured.

    I have attached the full log file and below a screenshot from when it happens. There is also the ZB trace attached and, just to make sure, a copy of the file which I have changed.

    Regards
    Peter

    5875.zigbee.zip

    sniffer trace.zip

    api_client.c
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    /*********************************************************************
    Filename: api_client.c
    Revised: $Date: 2014-11-18 18:32:59 -0800 (Tue, 18 Nov 2014) $
    Revision: $Revision: 41168 $
    Description: This file contains the API Server Wrapper client APIs.
    Copyright 2013 - 2014 Texas Instruments Incorporated. All rights reserved.
    IMPORTANT: Your use of this Software is limited to those specific rights
    granted under the terms of a software license agreement between the user
    who downloaded the software, his/her employer (which must be your employer)
    and Texas Instruments Incorporated (the "License"). You may not use this
    Software unless you agree to abide by the terms of the License. The License
    limits your use, and you acknowledge, that the Software may not be modified,
    copied or distributed unless used solely and exclusively in conjunction with
    a Texas Instruments radio frequency device, which is integrated into
    your product. Other than for the foregoing purpose, you may not use,
    reproduce, copy, prepare derivative works of, modify, distribute, perform,
    display or sell this Software and/or its documentation for any purpose.
    YOU FURTHER ACKNOWLEDGE AND AGREE THAT THE SOFTWARE AND DOCUMENTATION ARE
    PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
    INCLUDING WITHOUT LIMITATION, ANY WARRANTY OF MERCHANTABILITY, TITLE,
    NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL
    TEXAS INSTRUMENTS OR ITS LICENSORS BE LIABLE OR OBLIGATED UNDER CONTRACT,
    NEGLIGENCE, STRICT LIABILITY, CONTRIBUTION, BREACH OF WARRANTY, OR OTHER
    LEGAL EQUITABLE THEORY ANY DIRECT OR INDIRECT DAMAGES OR EXPENSES
    INCLUDING BUT NOT LIMITED TO ANY INCIDENTAL, SPECIAL, INDIRECT, PUNITIVE
    OR CONSEQUENTIAL DAMAGES, LOST PROFITS OR LOST DATA, COST OF PROCUREMENT
    OF SUBSTITUTE GOODS, TECHNOLOGY, SERVICES, OR ANY CLAIMS BY THIRD PARTIES
    (INCLUDING BUT NOT LIMITED TO ANY DEFENSE THEREOF), OR OTHER SIMILAR COSTS.
    Should you have any questions regarding your right to use this Software,
    contact Texas Instruments Incorporated at www.TI.com.
    *********************************************************************/
    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <string.h>
    #include <sys/types.h>
    #include <sys/socket.h>
    #include <sys/un.h>
    #include <pthread.h>
    #include <semaphore.h>
    #include <poll.h>
    #include <sys/time.h>
    #include <unistd.h>
    #include <signal.h>
    #include <bits/local_lim.h>
    #ifndef NPI_UNIX
    #include <netdb.h>
    #include <arpa/inet.h>
    #endif
    #include "hal_types.h"
    #include "hal_rpc.h"
    #include "api_lnx_ipc_rpc.h"
    #include "api_client.h"
    #include "trace.h"
    #include "pb_utils.h"
    #include "api_server.h"
    /*********************************************************************
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • This time it looks like the NPI server catches this issue (C:\ti\Zigbee_3_0_Linux_Gateway_1_0_1\source\Projects\zstack\linux\RemoTI-Linux-master\Projects\tools\LinuxHost\ipclib\server\npi_lnx_ipc.c).

    Similar concept can be applied here to the recv() calls.

    int NPI_LNX_IPC_ConnectionHandle(int connection)
    {
    //...
    
    	do {
    		n = recv(connection, npi_ipc_buf[0], RPC_FRAME_HDR_SZ, 0);
    	} while(n == -1 && errno == EINTR);
    
    //...
    			do {
    				n = recv(connection, (uint8*) &npi_ipc_buf[0][RPC_FRAME_HDR_SZ], ((npiMsgData_t *) npi_ipc_buf[0])->len , 0);
    			} while(n == -1 && errno == EINTR);

  • Hi Toby,

    Unfortunately, this resolves the segfault but not the problem. Actually my impressions is that - since intrdoducing the while loop in znp_zdo - the error happens even much earlier. Attached one more log.

    Any chance to resolve this issue within a reasonable time. Otherwise, any suggestion how to implement a "sof restart" which which only restart that specific thread and not the whole server once the SRSP Timeout has been reached?

    Regards
    Peter

    zigbee1 (2).log

  • Apologies for the delay.

    Looking at this most recent log, there is a similar error which causes the NPI server to exit:

    ERR:: Interrupted system call
    [ERR] npi_ipc_errno 0x02030100
    [10:01:21.687,265] [NPISRVR/U_RX] PKT_HEX: [SOCZIGB>>NPISRVR ] [ASNC] 03:44:80:00:01:8E
    [10:01:21.687,385] [NPISRVR/ACBK] PKT_HEX: [ NPISRVR>>Z_STACK ] [bcst] 03:44:80:00:01:8E
    [10:01:21.687,799] [Z_STACK/MAIN] PKTTYPE: [ Z_STACK>>>>>>>>>>>GATEWAY ] afDataConfirmInd
    [10:01:21.687,863] [Z_STACK/MAIN] PKTBODY: cmdID = <NOT_FOUND>
    [10:01:21.687,904] [Z_STACK/MAIN] PKTBODY: status = ZSuccess
    [10:01:21.687,944] [Z_STACK/MAIN] PKTBODY: endpoint = 0x00000001 (1)
    [10:01:21.687,985] [Z_STACK/MAIN] PKTBODY: transID = 0x0000008E (142)
    Exit socket while loop
    pid 692 is not there

    Close to top of log, we see NPI_PID=692.

    The npi_ipc_errno 0x02030100 is defined in npi_lnx_error.h: #define NPI_LNX_ERROR_UART_SEND_FRAME_FAILED_TO_WRITE 0x02030100

    Later we see the part of log you've screenshot:

    [10:01:30.477,901] [Z_STACK/LSTN] PKTTYPE: [ NPISRVR<<Z_STACK ] [SREQ] 04:25:45:C0:4D:00:05
    [10:01:32.478,236] [Z_STACK/LSTN] ERROR : SRSP Cond Wait timed out!
    [10:01:32.478,333] [Z_STACK/LSTN] ERROR : apicSendSynchData() failed getting response

    At this point the NPISRVR is not active, and we see that the zigbeeHAgw script is attempting to restart all the servers:

    resetting ZigBee SoC
    ===================================================
    starting NPI, cmd ' ./NPI_lnx_arm_server NPI_Gateway.cfg -v 0x0000010E ' on Thu Nov 5 10:13:17 CET 2020
    [10:13:17.518,404] [NPISRVR/MAIN] UNMSKBL:
    [10:13:17.518,855] [NPISRVR/MAIN] UNMSKBL: ************************************************
    [10:13:17.518,887] [NPISRVR/MAIN] UNMSKBL: * NPI Server v1.0.2d *

    After this, it looks like the communication has been restored (e.g there are GW_ATTRIBUTE_REPORTING_IND).

    Based on these, there are two options:

    1. Restart the specific server (in this case NPI). This would involve changes to other_server_died() (in zigbeeHAgw, which currently stops then restarts all the servers), and also the track_servers script to specify which PID died.
    2. The "Interrupted system call" can be fixed, using similar methods as before (repeating a blocking operation if EINTR is detected, such as for npi_write()).

  • Toby Pan said:

    1. The "Interrupted system call" can be fixed, using similar methods as before (repeating a blocking operation if EINTR is detected, such as for npi_write()).

    Hi Toby,

    I have added the part you mentioned in 1 to npi_lnx_uart.c and it has been working fine for more than week now so I suppose it is resolved:

    	pthread_mutex_lock(&npi_write_mutex);
    + 	do { 
    		result = write(npi_fd, buf, count);
    + 	} while ( result < 0   &&   errno == EINTR);
    	pthread_mutex_unlock(&npi_write_mutex);

    Maybe you can fix this together with your changes in your previous answers in your version control so that in the next version it will not be necessary to follow-up this fix...

    Regards
    Peter