This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC3220SF: Sl_stop deadlock despite timeout provided

Part Number: CC3220SF

Hi Support,

After an OTA, we call sl_Stop to use our new bundle.

Unfortunately, sometimes it deadlocks the MCU, despite a non-zero parameter being passed to the sl_Stop() call.

Here's where its hung up:

#ifdef SL_PLATFORM_MULTI_THREADED
    /* Do not continue until all sync object deleted (in relevant context) */
    while (g_pCB->NumOfDeletedSyncObj < MAX_CONCURRENT_ACTIONS)
    {
        usleep(100000);
    }
#endif    

The value of NumOfDeletedSyncObj is one less than MAX_CONCURRENT_ACTIONS.  Should we be manually closing all sockets before calling sl_Stop?  This shouldn't be necessary, right?

Thanks!

  • We upgraded to the latest SDK, 3.30.01.02, after seeing this in the release notes

    Various simplelink Wi-Fi host driver fixes relating to sync object handling
    Various OTA library fixes

    but unfortunately the issue persists!

  • Hi Ben,

    Do you know what simplelink call the semaphore is related to when sl_Stop is waiting for the semaphore?

    Jesu

  • Sure, the semaphore not being released by sl_Stop has ActionID = RECV_ID (0x1B), therefore: sl_Recv.

    We do use blocking sockets in the system, necessitated by low-power requirements.

  • Hi Ben,

    Thank you for the screenshot, this is actually what I need. Give me some time to investigate and I will get back to you.

    Jesu

  • Great, thanks.

    In the meantime, we are using the hardware watchdog to protect against this issue.

    We only see this come up after 50 or so upgrade cycles, so it may not be easy to see at first.

    The issue can be exacerbated by increasing simplelink calls.  We can connect to an NTP time server continuously while calling sl_Stop(1) to make the issue happen almost every time.  

  • Hi Ben,

    Do you know if sl_Recv ever returns? If it does what value does it return? I took a look at the source code for sl_Recv (sl_socket.c) and it calls _SlDrvDataReadOp (in driver.c) which is supposed to free the semaphore by calling _SlDrvReleasePoolObj at the very end. I suspect it's not actually calling this and this is why you get stuck in sl_Stop. Could you verify to make sure?

    Jesu

  • Hi Jesu,

    The offending sl_Recv never returns.  Which is quite confusing, as I see it being signaled to in 

        while (MAX_CONCURRENT_ACTIONS > ActiveIndex)
        {
            /* Set error in case sync objects release due to stop device command */
            if (g_pCB->ObjPool[ActiveIndex].ActionID == NETUTIL_CMD_ID)
            {
                ((_SlNetUtilCmdData_t *)(g_pCB->ObjPool[ActiveIndex].pRespArgs))->Status = SL_RET_CODE_STOP_IN_PROGRESS;
            }
            else if (g_pCB->ObjPool[ActiveIndex].ActionID == RECV_ID)
            {
                ((SlSocketResponse_t *)((_SlArgsData_t *)(g_pCB->ObjPool[ActiveIndex].pRespArgs))->pArgs)->StatusOrLen = SL_RET_CODE_STOP_IN_PROGRESS;
            }
            /* First 2 bytes of all async response holds the status except with NETUTIL_CMD_ID and RECV_ID */
            else
            {
                ((SlSocketResponse_t *)(g_pCB->ObjPool[ActiveIndex].pRespArgs))->StatusOrLen = SL_RET_CODE_STOP_IN_PROGRESS;
            }
            /* Signal the pool obj*/
            SL_DRV_SYNC_OBJ_SIGNAL(&g_pCB->ObjPool[ActiveIndex].SyncObj);
            ActiveIndex = g_pCB->ObjPool[ActiveIndex].NextIndex;
        }

    The offending socket has a SL_SO_RCVTIMEO of 5 minutes.  But even after 5 minutes, it still does not return.  Not sure if that's a clue.  

    Ben

  • Hi Ben,

    Seems like the signal to release the object is failing. Try stepping into SL_DRV_SYNC_OBJ_SIGNAL while in a debug session to see what _SlDrvSyncObjSignal is returning. You'll notice it calls OSI_RET_OK_CHECK which could cause _SlDrvSyncObjSignal to return an error instead of SL_OS_RET_CODE_OK. 

    /* Wrappers for the object functions */
    _SlReturnVal_t _SlDrvSyncObjSignal(_SlSyncObj_t *pSyncObj)
    {
        OSI_RET_OK_CHECK(sl_SyncObjSignal(pSyncObj));
        return SL_OS_RET_CODE_OK;
    }
    #define OSI_RET_OK_CHECK(Func)                  {_SlReturnVal_t _RetVal = (Func); if (SL_OS_RET_CODE_OK != _RetVal) return  _RetVal;}

    I'm confident you're getting an error here but please confirm. To step into the code above you have to copy the device.c file into your project. 

    The file is located in this directory depending on where you installed the SDK: C:\ti\simplelink_cc32xx_sdk_3_30_01_02\source\ti\drivers\net\wifi\source

    You could also add sl_socket.c and driver.c to step through sl_Recv to see exactly where it's stopping if you're curious.

    Lastly, do you have a simple project I can use to try to reproduce the behavior you're seeing on my end? Perhaps a code excerpt from your project.

    Jesu

  • Hi Jesu,

    Thanks for your efforts in looking into this.  It did point me to the right place eventually.

    Basically, one of our tasks that uses sl_* api calls was starving the system after sl_Stop. Another task that also used the sl_Recv call had a lower priority than the first and therefore would never return from sl_Recv.  So it would never be able to signal back to sl_Stop that it had finished. 

    We also discovered that in our 


    void SimpleLinkFatalErrorEventHandler(SlDeviceFatal_t *slFatalErrorEvent)

    function, we were calling PRCMHibernateCycleTrigger instead of PRCMMCUReset  which seems unsafe, being that sl_Stop is supposed to be called before PRCMHibernateCycleTrigger.  We noticed that sometimes PRCMHibernateCycleTrigger would not return when called at the end of SimpleLinkFatalErrorEventHandler. 

    Thanks again,

    Ben

  • Hi Ben,

    Glad I could help. If you have anymore questions feel free to create a new thread. Good luck on your project. Closing thread.

    Jesu