This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC3220SF: sl_Close deadlock

Part Number: CC3220SF

Hi TI,  

During stress testing with our network stack, we observe sl_Close deadlocking in various use cases after given enough runtime.  We have a test that can trigger the deadlock in around ~24 hours.  Essentially an MQTT connection is being opened and closed repeatedly because sometimes the PINGREQ does not respond in time.  I have an  NWP log of this procedure with the deadlock occurring.  It is 50 MB so please let me know which email to send this to.

Unfortunately this task that sl_Close is blocking is hyper-critical to our network stack, it expects it to return eventually, otherwise the MQTT connection will never be made again.  We are on SDK 4.3.  Please advise.

Thanks,

Ben

  • Hi Ben,

    For some reason it looks like _SlDrvWaitForPoolObj is getting stuck in the following call:

    (void)_SlDrvSyncObjWaitForever(&g_pCB->ObjPool[CurrObjIndex].SyncObj);

    It could be that the NWP is not signaling the host driver to progress closing the socket and that is why it's getting stuck. The logs should be able to give me more information on this. Also, we cannot receive files that large. Can you share it through a cloud drive link? Send me a private message through E2E if you are concerned about sharing a log on public forum. 

    When you say "various use cases" what do you mean exactly? I'd like to better understand if there's anything unique about those cases that may indicate the issue as opposed to the "typical" use case. 

    Jesu

  • Various use case is using the AWS freertos network stack to open an MQTT connection, send some data, and re-connect when WLAN/networking/application errors occur.

  • Understood. Are you able to share the NWP logs via google drive link or something similar? 

    Jesu

  • Hi Jesu. I’m not seeing an option for private message. Can you accept my contact request?

  • Hi Ben,

    I just sent you a private message. Please confirm you got it.

    Jesu

  • Got it and log sent, thanks!

  • Hi Ben,

    I have received your log. A team member is converting it to a .lgr file for me. I will try to review it tomorrow and respond to you here.

    Jesu

  • Hey Ben,

    The logs you sent me were a bit too big. I could only translate 5MB before it crashing. I have a log with 137000 lines of NWP activity. I can see some socket activity. Judging by the little I saw it appears the NWP is able to free the socket as a resource during socket close and it appears it is signaling the host driver correctly. I'm afraid I didn't learn anything here because I cannot see the part of the log where the fail actually occurs. Any chance you could auto save the logs into smaller chunks? Keep them under 5MB so that my tool won't crash. Apologies...

    Jesu

  • Caught it again with a much smaller file.  sl_close.log

  • Hi Ben,

    Thanks for providing this. I was able to convert the full log you sent e. Do these logs also capture the deadlock event? I see the same activity as in the previous log just more repetitions. I would not expect to see anymore socket create or destroy traces if sl_close is stuck.

    Do you have other threads creating and closing sockets by any chance?

    Jesu

  • Yes, that log captures the sl_close deadlock.   Try looking around 2.5 hours before the end of the log.  I think that's when it happened.

    There might be another socket opened and closed to get the NTP time periodically, but that would be from a separate task and only every 24 hours.

  • Hi Jesu,

    As I'm theorizing what is going on here, I think it may be helpful to explain the general flow of sl_* calls.

    First a socket is created via sl_Socket and connects to our server via Mqtt calls but under the hood, sl_Connect.   Then a thread is created that calls sl_Recv on this socket descriptor in a blocking fashion. 

    Then, an application error occurs that requires us to reconnect to the server.  So first we must disconnect.  In order to unblock the sl_Recv and recreate everything, we call sl_Close on the socket in order to unblock the blocked sl_Recv call.  Then the steps listed previously are executed again (sl_Socket, Sl_Connect, etc...)

    The problem is sometimes that sl_Close call doesn't return.  

    What I've realize is that due to some task priority interactions, it's actually possible for the closing function to call the sl_Close at a small point in time after sl_Recv has unblocked itself (probably received data) and the task deletes itself due to the close routine beginning to occur.  So the timing that induces the deadlock is possibly very soon after sl_Recv has returned, so the socket is no longer pending on sl_Recv. 

    This may be a weird timing for the NWP, but still, calling sl_Close on a socket that is still opened by sl_Connect is completely valid, yes?  Maybe this will help with examining  the logs.

  • Continuing off my point above, a workaround isn't possible to simply not call sl_Close in this timing case.  sl_Close still needs to be called in order to free up the socket to the NWP.  Otherwise we will eventually run out of sockets to reconnect with.

  • Hi Jesu, 

    Please let me know if there is any more info or hints I can provide to help in resolving this issue.

    Thanks,

    Ben

  • Hi Ben,

    Thank you for providing this information. It is definitely helpful to understand what is going on in your application. Please give us some more time to review this and we will get back to you.

    Jesu

  • Hi Jesu

    Any luck?

    Let me know if I can provide any my help on our end.

    Ben

  • Hi Ben,

    I also had, maybe still have ;), problem with deadlock sl_close.

    The problem got away after if I open and close the WiFi connection from inside one thread/ function.

    To do this I set the socket to blocking with timeout. lets say 1 second, after a client is accepted.

    The read call is now interrupted every second. The loop itself is controlled by a variable. If set the read loop terminates and sl_close is called.

    Hint: Do not use "select" to wait for timeout or data. At least in the SDK version 3.40 there is a bug. After 65535 "select" events I got a semphore overrun of an internal used TI semaphore.

    Best regards,
    Roman

  • Hi Ben,

    Jesus had to transfer this thread to me to continue looking into. I'll need a bit of time to review the history of the thread and provide additional suggestions.

    Please give me a couple days to follow-up.

    Best Regards,

    Ben M

    Wi-Fi Stack SW Apps Manager

  • Hi Ben,

    I think I'm starting to get a solid understanding of your scenario. To recap, you are -

    • Using the FreeRTOS MQTT library on the CC3220 device (on top of integrated TCP/IP stack of NWP)
    • Using the host driver from the v4.3 CC32xx SDK

    What version of the FreeRTOS code are you using? I assume you are working based on the FreeRTOS SDK and then just pulling in driver support from the CC32xx SDK.

    The general flow of your application (as you mentioned) from the SL API perspective is -

    1) Create new socket w/ sl_Socket()

    2) Connect to MQTT broker w/ sl_Connect()

    3) Once connected, block on sl_Recv() in a separate thread

    4) Call sl_Close() IF some sort of error is detected requiring a disconnect/reconnect to the broker

    5) Repeat 1-4

    I see that symptom of the problem you are facing is some sort of deadlock. The deadlock happens while the host is executing the sl_Close() call, correct? 
    I see Jesus made a comment about what statement he thinks the system is hanging on. Is that where you are actually seeing the driver get stuck? It's inside the "_SlDrvProtectAsyncRespSetting()" call which you highlight in your original post?

    Regarding your latest observation, I see you mention that the problem appears to occur if the sl_Close() is executed in a small window of time after the blocking sl_Recv() has returned. That seems strange because I would generally expect an sl_Close() call to succeed in the case that you have an open, connected socket with no pending receive operation. From the NWP perspective, I don't see an issue.

    Perhaps that would imply that the issue is just on the host driver side and related to a conflict that happens in the host driver when processing the receive event and the request to send the close command with a certain relative timing.

    Does my understanding sound accurate?

    Thanks,

    Ben M