This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC3235SF: SL_DEVICE_EVENT_FATAL_SYNC_LOSS

Part Number: CC3235SF
Other Parts Discussed in Thread: 66AK2G12, CC3135,

Hello, a customer is seeing a concerning issue under a specific set of circumstances that seems to be related to some interaction between the UART and the MQTT client that has us baffled at this point.

Background: The customer is using a TI 66AK2G12 SoC running TiRTOS as the OS and writing data to the UART to be published via MQTT. They are sending out two MQTT messages per second with a total payload of ~100 and 800 bytes at QoS 1. We have not made any modifications to SDK MQTT libraries. This has been observed using both SDKs 3.20 and 5.30 and their related service packs.

Issue: After some time (seems to range between 15-60 minutes and occur somewhat faster using SDK 5.30) we suffer an MQTT disconnection and are unable to reconnect to the broker without undergoing a reset. At this time the HTTP servers also are unresponsive, however the AP reports we are still connected. We are unable to reproduce this issue when another host is used to write the messages via UART as well as when the published messages are hardcoded, including if we drastically up the frequency and/or size of the published messages. Equally as perplexing is that if the customer switches from using TiRTOS to using a Linux version for the OS, they report they are unable to observe the issue. We were finally able to observe a debug session of what's occurring when this event happens, and at the time where the MQTT disconnection occurs, there is a SL_DEVICE_EVENT_FATAL_SYNC_LOSS. This is the call stack: 

Both devices are utilizing RTS/CTS flow control, and the customer reports that the baud rate does not seem to affect the issue. They have reported other issues related to the TiRTOS UART driver in the past, but have (seemingly) resolved these issues. Interestingly, if a 10ms delay (have not tested with less, but I suspect that number may be able to be lowered) is inserted between the UART writes in their code the issue seems to be resolved. Similarly, if a 0.10ms delay is inserted prior to calling MQTTClient_publish the issue is likewise resolved. The same issues are observed when QoS 2 is used, but not with QoS 0. I am unsure of how the UART is interacting with the MQTT to cause the fatal event and why they are only observing this when using TiRTOS.

Questions: Have any similar issues been observed before? Why would the issue not occur when the host's OS is changed? Why does inserting a delay between UART writes or before a call to MQTTClient_publish resolve the issue--i.e., what is the overlap? Where should we be looking to solve this?

Thanks for your help.

  • Hi Sam,

    I've assigned this thread to the subject matter expert and will have an answer for you shortly.

    BR,

    Seong

  • It is definitely looks like a porting issue (thus related to the OS running on the host).

    The host driver lost sync with the NWP.

    Note that the driver requires use of Mutex (to protect commands/event processing) and wrong implementation can lead to such issues.

    (i guess the device CC3135 and CC3235 as posted).

  • Hi Kobi, thanks for the response. Is there any way for the host (66AK2G12) to fix this aside from changing the OS (this is planned long term, but is not an immediate/near term solution), or anywhere that we should suggest they take a closer look at?

    Also, when you say that the driver requires a mutex, are you thinking that we might be missing a mutex on the code running on our side (the cc3235)? If so this would be related to sl_start and sl_stop calls?

  • I'm not sure what exactly is your architecture. if you are using CC3235 then what do you mean by the host? how does this host connect to the CC3235? 

  • The host, in this scenario, is a TI 66AK2G12 SoC using TiRTOS as the OS (when the issue is observed, and Linux when it's not observed) communicating with a TI CC3235SF SoC using FreeRTOS as the OS via UART. In theory, the host could be some other device running some other OS communicating with the CC3235SF by UART or SPI, but in this instance it's those two specifically. The issue has not been observable with any other host configuration.

  • I see. Are you using the Simplelink AT-CMD interface or a proprietary protocol between the CC3235 and the host?

    I'm not familiar with any SYNC LOSS issue on CC3235. We'll need more details from you. 

    When running different host configuration the main impact will be related to the timing (host may send bursts of commands or response faster to event from the CC3235.

    Are you sending one packet every 0.5 seconds and between the CC3235 goes to LPDS (or is the CC3235 always on)?

  • It's a proprietary protocol but is similar in overall functionality to the AT CMD interface.

    The CC3235 is always on. When we are seeing this issue, the 66AK2G12 host is performing:

    while connected to broker:
    {
        UART write to cc3235 to initiate an MQTT publish
        UART write to cc3235 to initiate an MQTT publish
        sleep 1 second
    }

    which is successful for 15-60 minutes before a broker disconnection occurs due to the fatal event.

    If Linux is used rather than TiRTOS, but no other modifications are made (as I understand it), the disconnection doesn't occur, so it doesn't seem to be directly an issue with either codebase but some strange interaction with TiRTOS.

    Likewise if their code is altered (still using TiRTOS) to introduce a short sleep period (tested at 10ms, but I suspected even shorter durations could be used) between the UART writes, the issue doesn't occur. Alternately, if their original code is left as is, but the cc3235 code is changed to add a 0.10ms sleep before a call to MQTTClient_publish the issue doesn't occur.

    Happy to provide additional info if needed.

  • What are the priorities of the threads on the CC3235 side? 

    Can you test when the sl_Task has higher priority than the URAT thread (that waits on the RX from the host)? 

  • Currently sl_Task has priority set to 9 and UART RX thread is using 5. TX thread priority is set at 6. I don't know what the thread priorities are for the host, but I can inquire if needed.

  • I changed the sl_Task to priority 2 to see what would happen (left RX as is) and behavior is the same. The call stack at the point of failure is also the same.

    Took 25 minutes to observe the failure, which is about average as well.

  • priority 9 is high. keep sl_Task with this high priority.

    What is the priority of the mqtt RX thread?

    Can you provide the NWP log (see chapter 20 in the programmer's guide)?

  • The MQTT RX thread is only at a 1. I can try bumping that up to see if it makes any difference. To provide the NWP logs, we would need access to pin 62, which is being utilized in the current setup for flow control. I've been trying to avoid having to go down that route due to having to make changes for that while not being sure the issue would still persist after those changes were made. Although if you know of any other way to obtain those logs, I'm game.

    One thing I did notice, which may or may not help--after playing with some of the other thread priorities with no discernable impact, I modified freeRTOSConfig to enable configUSE_TIME_SLICING, and that caused the sync error to occur very quickly (~2 minutes).

  • I bumped the thread priority of the thread running the MQTT RX task up to the same level as the UART RX (5), and it seems like this may have resolved the issue. We'll need more testing to ensure this hasn't/doesn't cause other issues, but it's a good sign. 

    Do you have any insight into why this issue is only observed when the host is running TI RTOS? I would expect that this issue would occur regardless of the OS on the host, but that's not what we have observed.

  • I don't have a specific observation. I can just suspect that somehow (due to timing of the specific configuration) the MQTT RX got starved for a long time that caused the sync loss. I will need to re-check the code to see how it is possible as it is not a common issue.