This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC1352P: collector with 15.4 stack and custom-phy becomes non-responsive when there are large number of messages in TX data queue

Part Number: CC1352P
Other Parts Discussed in Thread: CC2652R, , CC1312R7, , CC1312R

Tool/software:

Hi,

In one of our network with 20 sensor nodes, the collector will become non-responsive occasionally (~ once every 1 ~ 2 months). During this time, if a sensor node becomes orphaned, then restarted by external MCU to rejoin network, it will go into orphan state again in about ~ 15 seconds. This will keep happening until collector is restarted by external MCU.

This seems to coincide with high number of pending messages in TX data queue. While the SDK doesn't seem to provide an easy way to track the number of messages in TX queue, we added code to manually track pending messages in TX data queue. During this period of non-responsive time, the pending message in TX data queue usually stays around 16 ~ 20. I'm not entire sure whether this is a result or cause of collector becoming non-responsive.

Here is the size of TX/RX queue as defined in collector.opts file:

-DMAC_CFG_TX_DATA_MAX=60
-DMAC_CFG_TX_MAX=150
-DMAC_CFG_RX_MAX=16

The other relevant information are: 1) MAX device number is left as default 50; 2) POLLING_INTERVAL is changed to 60seconds; 3) SDK version: 7.40

There is a recent post on related issue with CC2652R. Presumably something has been done in the latest SDK v8.30.

Can people with knowledge of SDK internals elaborate how we should manage such situation? I suspect this has something to do with insufficient RAM in collector radio, because we have never observed such behavior in smaller networks with fewer sensor nodes. But we also haven't observed such behavior in 15.4-FH while network sizes, TX/RX data queue sizes, and pretty much everything else remains the same.

So please advise.

Thanks,

ZL

  • Hi Zhiyong,

    Thank you for posting and linking the thread. We have released SimpleLink F2 SDK 8.30, so if you could test it and see whether the issue is still present in this version i would be a good first step. I will also bring this issue to the SW developers next week.

    1. Are you using beacon mode or non-beacon mode?

    2. How many devices are connected to the collector when this happens?

    3. Which RTOS and Compiler option are you using?

    Cheers,

    Marie H

  • Hi Marie,

    Thanks for you reply. To answer your questions:

    1) No-beacon mode, it seems that is the only mode that is compatible with custom-phy

    2) I observed such behavior in networks with 18 and 20 sensor nodes, but not in network with < 10 sensor nodes.

    3) TI-RTOS7, TI Clang v3.2.2

    I will give new SDK and other ways to control TX queue sizes a try. The problem is this seems to happen randomly. When it happens, only RF part of CC1352P becomes non-responsive. The other core that handles GPIO, UART peripheral etc. would work just fine. This makes it very hard if possible at all to manage by external MCU.

    Best,

    Zhiyong

  • Hi Zhiyong,

    In the 8.30 SDK we have also added a logging tool that enables logs in the TI 15.4-Stack. Maybe this could help us get an indication of what's happening on the collector. 

    If you think TX/RX data queues are related to the issue, maybe decreasing them would make the problem happen more frequently thus making it easier to debug?

    (In the case where the RX/TX data queue is overflowing, I would expect the collector to return an error to the application and potentially loose a few packets, but not become unresponsive...)

    Cheers,

    Marie H

  • BTW here is a SimpleLink Academy lab which walks you through adding Logging to your project:

    https://dev.ti.com/tirex/explore/node?node=A__AUtBCHw9KYI99xItuPjn4w__com.ti.SIMPLELINK_ACADEMY_CC13XX_CC26XX_SDK__AfkT0vQ__LATEST&placeholder=true

    Cheers,

    Marie H

  • Hi Marie,

    Thanks for the tip on newly added logging capabilities. I will see if we can add this into debugging.

    Meanwhile could you please advise on how to calculate the RAM requirement for increased TX/RX queue sizes? My hunch is that this may have something to do with limited RAM on CC1352P because we never observed this problem in smaller networks with < 10 sensor nodes. If this is true, I assume we can solve/improve the problem by switching to parts with larger RAM such as CC1352P7, CC1312R7 or CC1312R4.

    Best,

    ZL

  • Hi Zhiyong,

    You can use the global variable heapmgrMemFreeTotal to see how much heap you have available at any point. E.g. after scheduling your Tx packet.

    If you want to swap to a higher memory device it should be very simple since we have pin compatible alternatives for most devices.

    Cheers,

    Marie H

  • Hi Marie,

    Have you heard back from you SW team about the issue?

    I want to report that we also observed similar issue in large network comprised of radio with CC1312R/SDK v7.4 and 15.4 stack in frequency hopping mode. When TX_DATA queue on collector radio is larger than 16(just an arbitrary reporting threshold), some but not all all sensor nodes may (not always) have trouble getting tracking messages therefore become orphan.This is different from network with radios in non-beacon mode and custom WB-DSSS phy, in NB/Custom phy network, when this happens, all sensor nodes will become orphan, and cannot reconnect to collector until collector is reset.

    Hope this will help your SW team debug the issue.

    Best,

    ZL

  • Hi Zhiyong,

    Can you capture a sniffer log of this happening? I would be very curios what's going on over the air: whether the orphaned sensors are trying to connect, whether the collector is able to receive the PAN AS, whether something happens during the connection phases.

    Cheers,

    Marie H

  • Hi Marie,

    It will be very hard to capture any sniffer logs. Those affected networks are running at client's site, and this only happened a few times and randomly. Also the last time I checked, the sniffer tool doesn't support custom-phy, and FH networks operate on 10s of different channels.

    Judging from the fact that within minutes the collector radio was restarted, once power-cycled with the entire system, another time restarted after external MCU toggled the reset pin, all sensor nodes were able to (re)join network, I believe the problem is on collector radio alone.

    Hope this helps,

    ZL

  • Hi Zhiyong,

    Did you have any luck triggering this issue by sending more packets from the collector side? Or do you no longer think it's related to heap/tx buffer?

    Cheers,

    Marie H