This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC2538: ZStack 3.0: nwk_buffer messages stuck in send

Part Number: CC2538
Other Parts Discussed in Thread: Z-STACK,

I've suspected this for a while however only just built the code to confirm it.

The problem:
Over a period of consecutive usage messages (nwkDB_t) build up in a state of sent (NWK_DATABUF_SENT) leading to an eventual out of memory situation.

Testing for this issue:

We built the following MT command to dump the state of the buffered packets.

static uint8 nwkAreqSend(nwkDB_t* db, void* _mf ){
  nwk_outgoing_result_t* mf = (nwk_outgoing_result_t*)_mf;
  uint8 buf[9];
  if(!(mf->flags & (1 << db->state))) return 0;
  if(!db->pDataReq) return 0;
  if(mf->dstAddr != 0xfffe && (db->pDataReq->DstAddr.addrMode != Addr16Bit || db->pDataReq->DstAddr.addr.shortAddr != mf->dstAddr)) return 0;
  mf->count ++;
  mf->limit --;

  buf[0] = LO_UINT16( db->pDataReq->DstAddr.addr.shortAddr );
  buf[1] = HI_UINT16( db->pDataReq->DstAddr.addr.shortAddr );
  buf[2] = db->state;
  buf[3] = db->retries;
  buf[4] = db->apsRetries;
  buf[5] = db->lastCnfStatus;
  buf[6] = LO_UINT16( db->macSrcAddr );
  buf[7] = HI_UINT16( db->macSrcAddr );
  buf[8] = db->nsduHandle;

  MT_BuildAndSendZToolResponse(((uint8)MT_RPC_CMD_AREQ | (uint8)MT_RPC_SYS_NWK), MT_NWK_QUEUED_ENTRY, sizeof(buf), buf );

  return !mf->limit;
}



static void MT_NwkQueuedDump(uint8 *pBuf)
{
  uint8 cmdId;
  uint8 retValue[4];
  nwk_outgoing_result_t hd;

  /* parse header */
  cmdId = pBuf[MT_RPC_POS_CMD1];
  pBuf += MT_RPC_FRAME_HDR_SZ;
  
  /* read network address */
  hd.dstAddr = osal_build_uint16( pBuf );
  retValue[0] = pBuf[0];
  retValue[1] = pBuf[1];
  pBuf += sizeof( hd.dstAddr );
  
  /* read type */
  hd.flags = pBuf[0];
  retValue[2] = pBuf[0];
  pBuf += sizeof( hd.flags );

  /* initialize count */
  hd.count = 0;

  /* read byte limit */
  hd.limit = pBuf[0];

  /* get buffered packet count */
  nwkDB_FindMatch(&nwkAreqSend, &hd);
  retValue[3] = hd.count;
  
  //retValue = nwkDB_MaxIndirectSent(dstAddr);

  /* Build and send back the response */
  MT_BuildAndSendZToolResponse(((uint8)MT_RPC_CMD_SRSP | (uint8)MT_RPC_SYS_NWK), cmdId, 4, retValue);
}

Having left a gateway running (25 devices, a mix of routers and end devices under bidirectional communication - unicast and multicast) for a few days there are a few buffered pacekts in a sent state that are never cleared. Eventually (weeks) this can exceed the total limit (NWK_MAX_DATABUFS_TOTAL)

We have confirmed this against an otherwise unmodified (modification above) 3.0.2 (with known issues applied only) as well as against our patched build.


Hypothesis: Seems to occur most frequently when there are high numbers of new nsdu handles being created. Perhaps there is a rollover happening on existing scheduled packets? nsduHandle is only a 8bit unsiged integer (0...255). If this happens the delete function which deletes via nsduHandle would be unable to clear the packet. This all happens inside the binary blob so I can't proove or deny this behaviour.
  • Hey Mathew,

    I assume that you've already reviewed the Z-Stack 3.0.2 Known Issues and Fixes and applied changes concerning NWK_MAX_DATABUFS_* among others?  Are you able to provide ZNP/gateway/sniffer logs and point to the direct issue?  Given that the Z-Stack 3.0.2 CC253X solution is no longer updated by TI, I recommend mitigating the issue by instructing a device reset when detected.

    Regards,
    Ryan

  • >> I assume that you've already reviewed the Z-Stack 3.0.2 Known Issues and Fixes and applied changes concerning NWK_MAX_DATABUFS_* among others? 

    As per my post yes. All known issues / fixes have been applied. This includes the NWK_MAX_DATABUFS_* changes (which actually would serve to make the issue worse).

    >> Are you able to provide ZNP/gateway/sniffer logs and point to the direct issue?  

    As this deals with ZStack internals sniffer logs are not particularly useful. We have logs from our own application which we have been using to diagnose this problem. I've provided the MT command we developed to diagnose the issue.

    You can see the issue when two areq's come in over two scans sufficiently far apart in sent status stuck.

    2021-02-15T10:38:23.945Z cc-znp:AREQ <== NWK:queuedEntry { nwkaddr: 0x7812, state: 2, retries: 0, apsRetries: 0, lastCnfStatus: 0, macSrcAddr: 0, nsduHandle: 170 }
    
    2021-02-15T11:12:24.410Z cc-znp:AREQ <== NWK:queuedEntry { nwkaddr: 0x7812, state: 2, retries: 0, apsRetries: 0, lastCnfStatus: 0, macSrcAddr: 0, nsduHandle: 170 }

    Messages in this state (sent) will build up indefinately. With our heavily used devices this is exhaustion within a week, lightly used devices may not exhibit symptoms or may be slow to gain (supporting rollover collision hypothesis). In my hypothesis I've pointed at the issue (nsduHandle rollover). Of course that's just an educated guess from observation.


    >> Given that the Z-Stack 3.0.2 CC253X solution is no longer updated by TI

    Given that both the CC2538 platform and Z-Stack 3.0.x are listed as ACTIVE should this not be changed and the EOL procedures initiated then? 

    If I am correct in my hypothesis and this is indeed a break in the TI binary blob and you are not going to fix it you should at the least publish a full disclaimer on the Z-Stack release page. You should also update the status of Z-Stack 3.0 to no longer developed with known bugs rather than the currently stated ACTIVE. No one should be investing the time and money in building a solution for a known broken platform. it doesnt help anyone who has already invested years and is at the end of the process however.

    >> I recommend mitigating the issue by instructing a device reset when detected.

    You recommend resetting the device, daily <-> weekly?

    We have a better hack solution via a MT command to manage the issue (by deleting the stalled messages). However it does need to be fixed, if my hypothesis is correct this is reliability reduction due to stalled packets that are not delivered.

    static void MT_NwkBufRemove(uint8 *pBuf)
    {
      uint8 cmdId;
      uint8 retValue;
      nwkDB_t *nwk;
    
      /* parse header */
      cmdId = pBuf[MT_RPC_POS_CMD1];
      pBuf += MT_RPC_FRAME_HDR_SZ;
      
      nwk = nwkDB_FindHandle( pBuf[0] );
      if(nwk != NULL){
        nwkDB_DeleteRecAll(nwk);
        retValue = 0;
      }else{
        retValue = 1;
      }
    
      /* Build and send back the response */
      MT_BuildAndSendZToolResponse(((uint8)MT_RPC_CMD_SRSP | (uint8)MT_RPC_SYS_NWK), cmdId, sizeof(retValue), &retValue);
    }

    By the way from a glance at the SimpleLink firmware it's likely also an issue there. I however do not have the ability to test that at this moment.

  • Hi,

    Many areas in Dallas metroplex are currently experiencing unforeseen power outages due to harsher-than-normal winter weather; we appreciate your patience, as responses may be delayed here.

    Regards,
    Toby

  • Hi Mathew,

    We've supported similar use cases which have not experienced this issue, which is why I requested sniffer logs to better understand your network's traffic profile.  Some of these instances have successfully quadrupled the number of NWK_MAX_DATABUFS_* so I would like to know why you believe this will make the matter worse.  I'd appreciate any suggestions you could offer to replicate this issue.

    Regards,
    Ryan

  • It will get worse if my hypothesis of nsduHandle rollover causing it within the blob is correct. More packets in flight means more likelyhood of collision due to rollover. Further as these are permanantly leaked packets (never cleared) increaing the maximums only pushes the problem down the line. You will still need a full module restart at some point, or to use a hack like ours to clear those stuck packets.

  • The easiest way to replicate the hypthothesis is to create a situation where more than 255 packets are sent before the last scheduled packets are cleared (i.e waiting for End Device, or an ACK that never comes).

    The easiest way to test this is probably to drive up the indirect timeout and then break the ack of a different incdirectly connected device.