This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC1352P7: Beagleplay - using coprocessor binary in 1352, configured FH, 200kbps - higher incidence of spontaneous re-joins with larger channel list.

Part Number: CC1352P7
Other Parts Discussed in Thread: UNIFLASH

Tool/software:

I'm testing a system using a beagle play as the collector for a sensor net. It's using a co-processor model, with the collector running in the BeaglePlay linux box. I'm noticing that when running in FH mode, with larger numbers of sensor nodes that if I set the channel list to 8 channels or so, it runs without issue. However, if I configured the default 64 channels, I get 3-10 sensor re-joins per sensor per day. Is there some reason why this would happen? It's difficult to figure out because it goes away when the channel list is reduced, making packet sniffing impractical as a means of debug.

Some pertinent info:

  • Sensors are cc1352R1 devices.
  • Network is configured for FH, 200 kbps 2GFSK.
  • Sensors firmware is based on the dmm_154sensor_remote_display_oad_app
  • Sensors are set to send a data report every 20 seconds, and poll every 90 seconds.
  • tracking interval (TRACKING_INIT_TIMEOUT_VALUE) has been set to 300 seconds.
  • The network has 9 sensors joined. Another test network with only 2 sensors does not exhibit this problem.
  • Sensors are about 10 feet from the collector, which should rule out saturation as an issue.

What I'm looking for is some ideas of things to check that might be causing this, and/or suggestions for how to narrow down the cause.

  • Hi Joshua,

    1. You can do a simple noise floor test with SmartRF studio, just listen to the background noise where you're doing this test, just to make sure there's no noise disturbing the test.
    2. Were you able to see the disconnect reason on the sensors before the re-join? It would be useful to know if there are specific channels that are prone to disconnect.
    3. Is it the same sensors each time or does it vary between all of them? If it's the same ones you can check the frequency offset on these specific devices.
    4. You say you're sing the DMM example. Are you also using BLE when this happens?

    Cheers,

    Marie H

  • Marie,

    1.) I've not done this yet, but will and report back.

    2.) I've not seen the disconnect reason. Could you provide any info on how best to capture this? I presume maybe the sync loss callback in the sensor could be instrumented?

    3.)It varies across the population of sensors.

    4.) I am using only advertising, and quite infrequently. A burst of 4 advertisements every 30 seconds. I did notice that the frequency of re-connects decreased a bit when I changed from exactly every 30 seconds to every 30.25 seconds. I did this because the sensors are sending data at 20 second intervals, and I didn't want the advertisements to beat with the data reports on every other third report cycle.

  • Marie,

    Noise measurements for the 64 channels in use. It doesn't look concerning to me, what do you think?

    If is is advisable to exclude the worst channels, do you have a recommended algorithm or method to do this? Is there an example?

    Regards,

    Josh

    Channel frequency reported rssi
    0 902.4 -101
    1 902.8 -98
    2 903.2 -102
    3 903.6 -100
    4 904 -85
    5 904.4 -102
    6 904.8 -96
    7 905.2 -102
    8 905.6 -94
    9 906 -97
    10 906.4 -101
    11 906.8 -85
    12 907.2 -100
    13 907.6 -97
    14 908 -95
    15 908.4 -102
    16 908.8 -94
    17 909.2 -102
    18 909.6 -85
    19 910 -102
    20 910.4 -100
    21 910.8 -90
    22 911.2 -104
    23 911.599999999999 -97
    24 911.999999999999 -102
    25 912.399999999999 -93
    26 912.799999999999 -97
    27 913.199999999999 -103
    28 913.599999999999 -96
    29 913.999999999999 -102
    30 914.399999999999 -94
    31 914.799999999999 -102
    32 915.199999999999 -102
    33 915.599999999999 -100
    34 915.999999999999 -100
    35 916.399999999999 -84
    36 916.799999999999 -100
    37 917.199999999999 -100
    38 917.599999999999 -90
    39 917.999999999999 -100
    40 918.399999999999 -93
    41 918.799999999999 -102
    42 919.199999999999 -97
    43 919.599999999999 -100
    44 919.999999999999 -102
    45 920.399999999999 -92
    46 920.799999999999 -104
    47 921.199999999999 -102
    48 921.599999999999 -102
    49 921.999999999999 -102
    50 922.399999999999 -101
    51 922.799999999999 -103
    52 923.199999999999 -92
    53 923.599999999999 -104
    54 923.999999999999 -95
    55 924.399999999999 -100
    56 924.799999999999 -102
    57 925.199999999999 -98
    58 925.599999999999 -103
    59 925.999999999999 -85
    60 926.399999999999 -102
    61 926.799999999999 -102
    62 927.199999999999 -92
    63 927.599999999999 -101
  • It appears that the noise is a red herring in this case.

    What I've observed is that with the co-processor configuration, there's a lot of incorrectly formatted messages from the co-processor, which triggers a bug in the host_collector binary. Specifically, in the file mt_msg.c, in the function mt_msg_rx, the re-starting of parsing upon a rx checksum error doesn't work correctly. The issue is that the mt_msg isn't reset once an error occurs, so when the parsing re-starts, the code thinks there's already bytes in the receive buffer, which there are not. This leads to either no recover, or a very slow recovery. The fixed code (change starting at line 25) looks like this:

    /*!
     * @brief - read a message from the interface.
     * @param pMI - msg interface
     * @returns NULL if no message received, otherwise a valid msg
     */
    static struct mt_msg *mt_msg_rx(struct mt_msg_interface *pMI)
    {
        int r;
        int nneed;
        struct mt_msg *pMsg;
    
        /* do we allocate a new message? */
        if(pMI->pCurRxMsg == NULL)
        {
            pMI->pCurRxMsg = MT_MSG_alloc(-1, -1, -1);
            /* allocation error :-(*/
            if(pMI->pCurRxMsg == NULL)
            {
                return (NULL);
            }
            pMI->pCurRxMsg->pLogPrefix = _incoming_msg;
            MT_MSG_setSrcIface(pMI->pCurRxMsg, pMI);
        }
    
        // JVT, move this block of code down below the try_again: label, so that when
        // a checksum or other error happens, the parsing is correctly restarted.
    
        //else
        //{
        //    MT_MSG_resetMsg(pMI->pCurRxMsg, -1, -1, -1);
        //}
        // pMsg = pMI->pCurRxMsg;
        // LOG_printf(LOG_DBG_MT_MSG_traffic,
        //            "%s: rx-msg looking for start\n",
        //            pMI->dbg_name);
    
    try_again:
        /* zap what we have */
        MT_MSG_resetMsg(pMI->pCurRxMsg, -1, -1, -1);
    
        pMsg = pMI->pCurRxMsg;
    
        LOG_printf(LOG_DBG_MT_MSG_traffic,
                   "%s: rx-msg looking for start\n",
                   pMI->dbg_name);
    
    
        /* zap the existing buffer for debug reasons */
        memset((void *)(&(pMsg->iobuf[0])), 0, pMsg->iobuf_idx_max);
    
        /* how many bytes should we get? */
        nneed = (
            (pMI->frame_sync ? 1 : 0) + /* sync */
            (pMI->len_2bytes ? 2 : 1) + /* len */
            1 + /* cmd0 */
            1 + /* cmd1 */
            0 + /* unknown length yet */
            (pMI->include_chksum ? 1 : 0)); /* checksum */
    
    read_more:
        r = mt_msg_rx_bytes(pMI, nneed, pMI->intermsg_timeout_mSecs);
        if(r == 0)
        {
            LOG_printf(LOG_DBG_MT_MSG_traffic, "%s: rx-silent\n", pMI->dbg_name);
            return (NULL);
        }
    
        if(r < 0)
        {
            /* something is wrong */
            LOG_printf(LOG_DBG_MT_MSG_traffic, "%s: Io error?\n", pMI->dbg_name);
            return (NULL);
        }
    
        /* we start reading at byte 0 in the message */
        pMsg->iobuf_idx = 0;
    
        /* should we find a frame sync? */
        if(pMI->frame_sync)
        {
            /* hunt for the sync byte */
            /* and move the sync byte to byte 0 in the buffer */
            uint8_t *p8;
    
            p8 = (uint8_t *)memchr((void *)(&pMsg->iobuf[0]),
                                   0xfe,
                                   pMsg->iobuf_nvalid);
            if(p8 == NULL)
            {
                /* not found */
                MT_MSG_log(LOG_DBG_MT_MSG_traffic | LOG_DBG_MT_MSG_raw,
                           pMsg, "Garbage data...\n");
                goto try_again;
            }
    
            /* frame sync must start at zero. */
            if(p8 != pMsg->iobuf)
            {
                /* need to shift data over some */
    
                /* how many bytes to shift? */
                int n;
                n = (int)(&(pMsg->iobuf[pMsg->iobuf_nvalid]) - p8);
                /* shift */
                memmove((void *)(&pMsg->iobuf[0]), (void *)p8, n);
                /* zero what we deleted */
                memset((void *)(&(pMsg->iobuf[n])), 0, pMsg->iobuf_nvalid - n);
                pMsg->iobuf_nvalid = n;
                /* Do we have enough bytes? */
                if(nneed > pMsg->iobuf_nvalid)
                {
                    /* No - we need more, go get more */
                    goto read_more;
                }
            }
            /* DUMMY read of the sync byte */
            MT_MSG_rdU8(pMsg);
        }
    
        /* Found start */
        /* how big is our length? */
        if(pMI->len_2bytes)
        {
            pMsg->expected_len = MT_MSG_rdU16(pMsg);
        }
        else
        {
            pMsg->expected_len = MT_MSG_rdU8(pMsg);
        }
    
        pMsg->cmd0 = MT_MSG_rdU8(pMsg);
        pMsg->cmd1 = MT_MSG_rdU8(pMsg);
    
        nneed += pMsg->expected_len;
    
        /* read the data component */
        r = mt_msg_rx_bytes(pMI, nneed, pMI->intersymbol_timeout_mSecs);
        if(r != nneed)
        {
            /* something is wrong? */
            if(r > 0)
            {
                /* we got some ... but not enough .. */
                LOG_printf(LOG_DBG_MT_MSG_raw,
                    "Short read ... got: %d, want: %d, try again...\n",
                    nneed, r);
                goto read_more;
            }
            MT_MSG_log(LOG_ERROR, pMsg, "%s: expected: %d, got: %d\n",
                pMI->dbg_name, nneed, r);
        dump_recover:
            LOG_printf(LOG_DBG_MT_MSG_traffic, "Flushing RX stream\n");
            /* Dump all incoming data until we find a sync byte */
            STREAM_rdDump(pMI->hndl, pMI->flush_timeout_mSecs);
            goto try_again;
        }
    
        /* Dummy read to the end of the data. */
        /* this puts us at the checksum byte (if present) */
        MT_MSG_rdBuf(pMsg, NULL, pMsg->expected_len);
    
        /* do the checksum */
        if(pMI->include_chksum)
        {
            r = MT_MSG_calc_chksum(pMsg, 'f', pMsg->iobuf_nvalid);
            if(r != 0)
            {
                MT_MSG_log(LOG_ERROR, pMI->pCurRxMsg, "%s: chksum error\n",
                    pMI->dbg_name);
                LOG_hexdump(!LOG_ERROR, 0, pMsg->iobuf, pMsg->iobuf_nvalid);
                goto dump_recover;
            }
            else
            {
                //LOG_printf(LOG_DBG_MT_MSG, "chksum ok, message =\n");
                LOG_hexdump(LOG_DBG_MT_MSG_traffic, 0, pMsg->iobuf, pMsg->iobuf_nvalid);
            }
        }
        /* We have a message */
        MT_MSG_set_type(pMsg, pMsg->pSrcIface);
        /* Set the iobuf_idx to the start of the payload */
    
        /* since we will be parsing the message... */
        /* Set the iobuf_idx to the start of the payload */
        pMsg->iobuf_idx = (
            (pMI->frame_sync ? 1 : 0) +
            (pMI->len_2bytes ? 2 : 1) +
            1 + /* cmd0 */
            1 /* cmd1 */
           );
        /* next time we need to allocate a new message */
        pMI->pCurRxMsg = NULL;
    
        /* return our message */
        return (pMsg);
    }
    

    I have other findings as well, about the source of the corrupted data, but I will add another reply for that.

  • The other finding is that there definitely seems to be a problem with the MAC co-processor, which leads to the corrupted messages, and also sometimes to crashes, and even in a few cases overwritten flash.

    I'm running a mostly unmodified collector. I've added one feature to read out the free heap, status of message queues and such, and I've increased some stack sizes, but other than that, it's a stock TI project.

    In my test setup, I've got 8 sensors, 3 of which are reporting at 2-10 second intervals, the remaining 5 are reporting at 100 mS intervals, and sending ~60 byte packets. When running this way, the co-processor emits a defective packet approximately once per hour, and at least last night, froze after about 5 hours of running. Because of the previously described issue with the host_collector not correctly recovering, this issue was hard to discern.

    I proceeded to perform the following investigations:

    1. I read out the entire flash memory using UniFlash, in this case, the only change was to the area used for non-volatile storage. In the past, I've seen the last block of flash erased.
    2. I attached the debugger without resetting the target. The code did not operate correctly, and was running code in the 0x10000000 region of memory. Resetting the device with the reset pin did not restore proper operation.
    3. Using the "restart" button in the debugger did restore proper operation, until the next time I reset the device via the reset pin. This suggests that something that happens very early in the startup code (that the debugger bypasses when using the reset button to restart the app) is going wrong.
    4. The only thing that fully restored proper operation was a full power down of the board. Upon powering back up, with no other changes, the 1352 booted normally whenever the reset line was toggled.

    All of this suggests something along the lines of a wild write to memory outside of the intended buffer, perhaps hitting some control register in the device that isn't fully re-configured on a pin reset.

  • Hi Joshua,

    Can you test with the 50 kbps PHY and see if it's reproducible?

    Cheers,

    Marie H

  • Marie, it would appear from further investigation, which I've detailed in other issues, that the problem here was related to the serial link between the 1352 and BeaglePlay not being reliable. In particular, the beagle play has an issue with dropping received bytes from the uart, it is beleived that this is related to an erratta, and that it'll be fixed shortly. Additionally, when data loss does occur, the NPI parser does not correctly recover. I've fixed this problem, and it has pretty much eliminated the re-joins.

    The other posts:
    https://e2e.ti.com/support/wireless-connectivity/sub-1-ghz-group/sub-1-ghz/f/sub-1-ghz-forum/1468716/cc1352p7-mac-coprocessor-npi-link-loses-packets---beagleplay-with-integrated-cc1352p7-radio/5675554#5675554

    https://e2e.ti.com/support/wireless-connectivity/sub-1-ghz-group/sub-1-ghz/f/sub-1-ghz-forum/1470276/cc1352p7-sub-gig-15-4-co-processor-locks-up---long-interval-between-occurrences-recovery-requires-power-cycle

    I think we should close this ticket, as the remaining issues are addressed in these other two tickets.