AWS-IOT CC3200 hangup in aws_iot_mqtt_yield

Aslak N.

Other Parts Discussed in Thread: CC3200

Hi,

I am using the latest TI-RTOS 2.16.01.14, the NS package in there, the latest 3200 service pack and AWS 1.1.1 from git.ti.com.

In my project I have a task slTask which is based on the aws sample code on git. Basically it pends on an event with timeout 1s, whereupon it calls mqtt_yield and/or publishes ~600 bytes approx every 1.3 sec based on an event. I suppose this means yield can be called with 0.3 sec interval in some cases.

It works well, but after some hours of use, it will suddenly hang in Ssock_recv -> sl_Recv when calling yield.

See call stack below, and see attached picture:

From the UART log output it is apparent that '--> sleep' which is printed after yield stops being output, the background threads keep producing content to be published, and after some seconds, the event handler (netwifi/wifi.c) says that the socket is closed because of some error. I think it says RX fragmentation or something, which it tends to say a lot, even though there is nothing being sent to the device via this socket except perhaps keepalive things. I don't have the log handy unfortunately.

After a while of this, the background threads don't have any more slots to store outgoing data, and when execution is halted it is invariably the above scenario.

This problem has persisted across TI-RTOS upgrades and AWS upgrades, and it's unpredictable but it always occurs.

I hope you may have some insight to offer.

Best regards,
Aslak

over 10 years ago

0 Ramsey over 10 years ago

TI__Genius 12025 points

Aslak,

I've used the AWS examples but have not run any extended tests as you are doing.

Let me make sure I understand your situation. Your application task wakes up every 1 second, calls mqtt_yield and then publishes 600 bytes of data. This works fine for several hours and then hangs in the call to mqtt_yield. Meanwhile, the wifi event handler runs asynchronously and reports that a socket has been closed. Have I got this correct?

Does your application subscribe to any AWS topics? If not, then there is no need to call mqtt_yield.

When your task wakes up, might it run past the next scheduled wakeup event? This should be okay, but maybe try increasing the sleep time to see if it makes a difference.

I'm guessing that when the event handler reports the socket has been closed, that's when your application hangs. Is this correct? If so, then your application needs to re-establish a connection with the AWS MQTT broker. You probably need to call aws_iot_mqtt_disconnect to release resources at the middle layer.

Network robustness is always a challenge. For example, the AWS server might arbitrarily close a socket just to free up resources. I also wonder if AWS expects IoT devices to maintain a continuous connection for long periods of time or if they should reconnect on every transaction. If you really want to update every second, then it is not practical to reconnect on each transaction. But you might consider reconnecting every hour (i.e. disconnect and then reconnect). At a minimum, your application needs to handle network related errors. The AWS examples are simple demonstrations of API usage, not starting points for IoT applications.

~Ramsey

0 Aslak N. over 10 years ago in reply to Ramsey

TI__Mastermind 23440 points

Hi Ramsey,

Thanks for your quick response!

In this case the MQTT disconnect callback doesn't occur (it prints to screen if it does).

The asynchronous socket message happens a while after the apparent hang in mqtt_yield, and is independent as you say.

It does subscribe, but nothing comes in in this case. However, in the doxygen for yield it says: [...] Yield() must be called at a rate faster than the keepalive interval. [...] which in this case is 10 sec, so I may be too eager with 1s but, well.

I would be quite happy if I could just re-establish the connection based on some mqtt error code returned, but yield doesn't return to give me this error.

The thing is, I *could* capture the socket message, but since aws lib is a bit abstracted, I can't be sure that it's my socket is disconnected. Also if I capture this async event, the stalled thread is still stalled. I just assumed on low level socket close, any blocking reads like sl_Recv's pending forever would be released. Should I call aws_iot_mqtt_disconnect from another thread to release this?

Now, I'm not saying there isn't something else odd-ball about my project that makes this happen, and I'll make a reduced test to check. But whether it's aws-cloud closes the socket or it's a problem with the router I kind of expect yield() to return and let me know.

Best regards,
Aslak

0 Ramsey over 10 years ago in reply to Aslak N.

TI__Genius 12025 points

Aslak,

Understood. But we have seen low-level failure in SimpleLink driver and upper layers like MQTT don't return. One issue is that we ship SimpleLink library with asserts enabled. You might need to rebuild SimpleLink with return status instead. This might help MQTT layer recognize low-level failures. In fact, the original intent was to have SimpleLink be rebuilt by the end-user.

Regarding your stuck task. If your task is spinning and you get an asynchronous event from SimpleLink, you might need to lower the task priority in your event handler in order to stop the task from spinning. Then you can schedule another task to perform some error handling. This might include termination of the stuck task and then restarting it. It will be a challenge to do this without any resource leaks.

~Ramsey

0 Aslak N. over 10 years ago in reply to Ramsey

TI__Mastermind 23440 points

Hi Ramsey,

I don't think this is related to asserts enabled, as all the tasks are pending on something sensible (simplelink tasks pending on mailbox), and the slTask task is pending on a semaphore, and is not spinning.

I reduced my task to just sub/pub every 1 second, and it failed overnight in the same way.

I can of course delete the task and all the resources I have allocated, but I wasn't «responsible» for the library codes allocation such as the pool object that you can see it's pending on, so I wouldn't know how to clean this up.

Not sure what you mean by rebuilding SimpleLink - I'm just using RTOS here, and including ti.drivers in the config file which includes the wifi stuff.

Best regards,
Aslak

0 Ramsey over 10 years ago in reply to Aslak N.

TI__Genius 12025 points

Aslak,

I misunderstood about the slTask behavior. I see now that it is blocked waiting for data. Sorry about that.

So, I think the mqtt_yield is trying to read data from a socket. It is expecting data which will never come. This causes the task to block indefinitely. Eventually, the TCP layer keep-alive code times out and closes the socket. This *should* cause the mqtt layer to return with an error code, but it does not seem to do this.

We will have to investigate more deeply to resolve this issue. I'll try to setup a similar test case to reproduce the issue at my end. My only suggestion to you at this point is to use a work-around to avoid the issue. I suggest you disconnect and reconnect at some rate which is less then the failure rate. Maybe once an hour?

The SimpleLink library is part of the WiFi driver. Look in products/tidrivers_###/packages/ti/mw/wifi/cc3x00/simplelink. It is possible to rebuild the tidrivers product and pass in build options for the SimpleLink source code. But this is not documented. I would need to experiment to see if it works. At any rate, I don't think you need to do this.

~Ramsey

0 Aslak N. over 9 years ago in reply to Ramsey

TI__Mastermind 23440 points

Hi Ramsey,

As I mentioned, I set up a monitoring task / watchdog task. When the main thread hung in yield, the watchdog task was able to successfully connect via an UDP socket to another machine several times, also long after the failure.

The failure rate is somewhat unpredictable. It could be as you say that AWS disconnects, but it seems to me that this can happen at any time so dis-/ reconnecting feels like an unreliable workaround.

Best regards,
Aslak

0 Steven Connell over 9 years ago in reply to Aslak N.

TI__Mastermind 45025 points

Hi Aslak,

I found a way to reproduce this problem more easily here locally, Ramsey and I are debugging. I wanted to ask if you can check something, just to make sure that it's the same issue you're seeing.

Can you check the call stack of your spawn task? Does it look like what I'm seeing (in the screen shot)?

Steve

0 Aslak N. over 9 years ago in reply to Steven Connell

TI__Mastermind 23440 points

Hi Steven,

Nice! How do you provoke the issue? As mentioned it takes some hours here.

Sorry for not getting back to you sooner - been holidaying. Well, the simpleLinkSpawnTaksk seems to be running fine on its own here. Last time I breaked on an error it was pending on a mailbox, which seemed reasonable. I think it's also been in the process of SPIDataGet when I breaked, but I also think that was transient.

Did your spawntask get stuck in SPIDataGet while running, is that what the picture shows? If so, no, I don't think I've seen that.

Best regards,
Aslak

0 Ramsey over 9 years ago in reply to Aslak N.

TI__Genius 12025 points

Aslak,

Yes, our test case always gets stuck in a call to SPIDataGet. It happens within a few seconds of starting the program (5 - 15 sec).

We are using an AWS Shadow example as our test case. We intentionally set our DMA minimum transfer size to zero in the board file. This forces all SPI transfers to use DMA (when the buffers are 4-byte aligned). Then we get a SPI transfer request (receive) which has an unaligned receive buffer. This forces the SPI driver to perform a polling transfer. It is on this transfer, that the task gets stuck waiting for the first SPI word to be received.

You might be able to reproduce your failure faster by configuring your DMA minimum transfer size to zero. Add the following to your board SPI configuration.

const SPICC3200DMA_HWAttrs spiCC3200DMAHWAttrs[CC3200_LAUNCHXL_SPICOUNT] = {
    :
    {
        .baseAddr = LSPI_BASE,
        :
        .minDmaTransferSize = 0 /* failure */
    }
};

Make sure you are only changing the LSPI configuration.

We are still investigating.

~Ramsey

Processors

Processors forum

AWS-IOT CC3200 hangup in aws_iot_mqtt_yield