This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/TM4C1294NCPDT: Stuck on HTTPCli_sendField - task running indefinitely - HTTPS

Part Number: TM4C1294NCPDT
Other Parts Discussed in Thread: SYSBIOS

Tool/software: TI-RTOS

Hello,

I have a very rare bug in my HTTPS Post Task code where occasionally, the task will get stuck in a while loop effectively blocking all tasks permanently until the device is reset. I cannot easily reproduce it but I was finally able to capture this bug with the debugger hooked up and step through the code.

I am using: RTOS version: 2.14.04.31
My code has 2 HTTPS Post tasks that run intermittently (at least once a minute, sometimes more).

I see that my code is stuck on the 2nd of 4 "HTTPCli_sendField" calls that I make.
I am not sure if that is a consistent thing as this is the first time I have captured this with the debugger hooked up.

While stepping through the code, I can see that it's stuck in the while loop inside the "Ssock_send()" function in "C:\ti\tirtos_tivac_2_14_04_31\packages\ti\net\http\ssock.c"

The while loop runs while (len > 0) but "nbytes = ssock->sec.send(ssock->ctx, ssock->s, buf, ilen, flags);" keeps returning 0.
This means that the next if condition is hit "if (nbytes >= 0)" but "len -= nbytes;" doesn't decrement since nbytes is 0.

I do not understand what would cause ssock->sec.send to keep returning 0 all the time. I even removed the ethernet connection and nothing changes.

I included some pictures of the stack trace (I think that's what it's called?) after pausing and resuming the device a few times.
It looks like the NDK Stack Thread is toggling in priority between 1 (ROV shows 2 for some reason) and 9.
At least, that's what's reported in the SYSBIOS System Logger:
"LM_setPri: tsk: 0x20001978, func: 0xbbed, oldPri: 1, newPri 9"
"LM_setPri: tsk: 0x20001978, func: 0xbbed, oldPri: 9, newPri 1"


Deeper down in the WolfSSL code, I believe this error is being hit: "WOLFSSL_CBIO_ERR_WANT_WRITE"
But I do not know what that means, or why it would be happening.

EDIT: I think it's actually just returning WOLFSSL_CBIO_ERR_GENERAL or SOCKET_ERROR_E.

NDK_Send() calls fdint_lockfd() which hits this code in "tirtos_tivac_2_14_04_31\products\ndk_2_24_03_35\packages\ti\ndk\stack\fdt\file.c"

    /* Verify Socket and verify that it is open */
    if( !pfd ||
        (Type && pfd->Type != Type) ||
        ((pfd->Type != HTYPE_RAWETHSOCK) &&
#ifdef _INCLUDE_IPv6_CODE
        (pfd->Type != HTYPE_SOCK6) &&
#endif
        (pfd->Type != HTYPE_SOCK && pfd->Type != HTYPE_PIPE)
        ) || !pfd->OpenCount )
    {
        pfdt->error = EBADF;
        return( SOCKET_ERROR );
    }


Due to the very rare occurence of this issue, I do not have any wireshark captures or anything like that. I haven't figured out how to reproduce it, it takes days to weeks for this error to happen. The HTTPS tasks run every few minutes (or less).

This is an example of my HTTPS Post task. I trimmed a lot of the supporting code down so you can just see the key functionality.

static void HttpsPostTask(UArg arg0, UArg arg1)
{
    bool moreFlag = false;
    char responseDataBuf[256];        /* buffer for response data */
    int ret;

    // Storage variables
    char responseLengthStr[10];
    int responseLengthInt;
    char signature[41] = {0};   //40 hex characters
    char counterString[12] = {0}; //1 to 11 characters (unsigned int --> decimal)
    char http_uri[50] = {0};

    HTTPCli_Struct cli;
    // This section builds the static HTTP headers
    HTTPCli_Field fields[4] = {
        { HTTPStd_FIELD_NAME_HOST, HOSTNAME },
        { HTTPStd_FIELD_NAME_USER_AGENT, USER_AGENT },
        { HTTPStd_FIELD_NAME_CONTENT_TYPE, CONTENT_TYPE },
        { NULL, NULL }
    };

    HTTPCli_Params httpParams = {0};
    httpParams.timeout = HTTP_POST_TIMEOUT; /* Timeout value in seconds */

    for(;;) {
        if (!IsDomainResolved()) {
            // DNS resolution function call
        }

        if (!DoesSSL_ContextExist()) {
            // Setup SSL Context function call
        }

        if (IP_AddressIsSet && IsDomainResolved() && DoesSSL_ContextExist()) {

            // Clear storage
            memset(&responseLengthStr[0], 0, sizeof(responseLengthStr));
            memset(&counterString[0], 0, sizeof(counterString));
            memset(&signature[0], 0, sizeof(signature));
            memset(&http_uri[0], 0, sizeof(http_uri));
            
            // Set the correct URI (feedback or metrics)
            if (httpsEndpoint == metrics) {
                PrintDebug("\n ** HTTP Post to /metrics/  ** \n");
                strncpy(http_uri, METRICS_URI, sizeof(http_uri));
            } else if (httpsEndpoint == feedback) {
                PrintDebug("\n ** HTTP Post to /feedback/  ** \n");
                strncpy(http_uri, FEEDBACK_URI, sizeof(http_uri));
            } else if (httpsEndpoint == authentication) {
                PrintDebug("\n ** HTTP Post to /authentication/  ** \n");
                strncpy(http_uri, AUTHENTICATION_URI, sizeof(http_uri));
            }

            // Get counter int to string
            httpCounter++;
            sprintf(counterString, "%u", httpCounter);
            
            // Signature generation
            GenerateSignature(signature, httpDeviceId, counterString, httpsPostData);

            // Store the size of the data in INT and STRING
            responseLengthInt = strlen(httpsPostData); 
            sprintf(responseLengthStr, "%d", responseLengthInt);

            // Start HTTP transaction
            HTTPCli_construct(&cli);
            HTTPCli_setRequestFields(&cli, fields);

			// In the actual code, I have error checks on each of these calls!
            ret = HTTPCli_connect(&cli, DNS_Results->ai_addr, HTTPCli_TYPE_TLS, &httpParams);
            ret = HTTPCli_sendRequest(&cli, HTTPStd_POST, http_uri, true);
            ret = HTTPCli_sendField(&cli, HTTP_HEADER_ID, httpDeviceId, false);
            ret = HTTPCli_sendField(&cli, HTTP_HEADER_COUNT, counterString, false);
            ret = HTTPCli_sendField(&cli, HTTP_HEADER_SIGN, signature, false);
            ret = HTTPCli_sendField(&cli, HTTPStd_FIELD_NAME_CONTENT_LENGTH, responseLengthStr, true);
            ret = HTTPCli_sendRequestBody(&cli, httpsPostData, responseLengthInt);
            ret = HTTPCli_getResponseStatus(&cli);
            ret = HTTPCli_getResponseField(&cli, responseDataBuf, sizeof(responseDataBuf), &moreFlag);

            do {
                ret = HTTPCli_readResponseBody(&cli, responseDataBuf, (sizeof(responseDataBuf)-sizeof(responseDataBuf[0])), &moreFlag);
                if (ret < 0) {
                    // error code
                }
                else if (ret) {
                    if (ret < sizeof(responseDataBuf)) {
                        responseDataBuf[ret] = '\0';
                    }
                    // handle data
                }
            } while (moreFlag);

            // Need to check for success here.
            if(strncmp(METRICS_EXPECTED_RETURN, responseDataBuf, strlen(METRICS_EXPECTED_RETURN)) != 0) {
                PrintError(-1, "HttpsPostTask: response body does not match success");
                goto httpError;
            }

            failedCount = 0;
            goto httpCleanup;

            // This goto is hit if there is an error.
            // Uses a simple backoff timer up to 30 seconds.
            httpError:
            failedCount++;
            ResetFleetCommsSuccess();
            PrintDebug("Fail Count: %u\n", failedCount);
            Task_sleep(failedCount > 8 ? 30000 : (1 << failedCount)*100);

            httpCleanup:
            HTTPCli_disconnect(&cli);
            HTTPCli_destruct(&cli);

        } else {
            // Do nothing
            PrintDebug("No IP address or DNS not resolved. HttpsPostTask doing nothing.\n");
            failedCount++;
            ResetFleetCommsSuccess();
        }

        // "Turn off" this task every run.
        // Makes this task a "run on command" task.
        Task_setPri(taskHandleHttpsPost, -1);
    }
}


// Creates the https post task and starts the handler.
void InitHttpsPostTask() {
    // Task creation and starting code.
}

void StartHttpsPost(char* data, enum Endpoints endpoint) {
    if (Task_getPri(taskHandleHttpsPost) == -1) {
        Task_setPri(taskHandleHttpsPost, 1);

        strcpy(httpsPostData, data);
        httpsEndpoint = endpoint;
    }
}

bool IsHttpsPostTaskRunning() {
    return (Task_getPri(taskHandleHttpsPost) > 0);
}

int GetPostFailureCount() {
    return failedCount;
}

void ClearPostFailureCount() {
    failedCount = 0;
}

  • This is a lot to digest here, I will ask our team expert if he can help you.

    Regards,

    - Rob
  • Thanks, I'd appreciate any help that I can get as I'm kind of stuck here and this is a critical failure for our product.

    Here's some more information I found as I have been stepping through the code:

    Removal of the ethernet is detected. This matches my previous finding that the NDK is still running, I can still ping the device...
    The NC_setLinkHook(LinkStateCallback) is still working.
    Basically, if I remove the Ethernet connection, I can remove and add an IP address using my callback function that I register using NC_setLinkHook.

    Also, the ROV shows all my other tasks set at priority 1. Is that expected?

    Most of my tasks are of different priorities between 1 to 12. I have a few UART tasks that should be priority 10-12 but they are all showing as 1 in the ROV.

    Please let me know if I can help clarify anything else.

  • It looks like the server is shutting down the connection (close notify alert) which is resulting in a return value of 0 from the send()/recv() calls. I am surprised that retrying send() after 0 was received did not result in a send failure ( < 0 return value). But anyways, this looks like a bug in Ssock that it does not handle the condition.

    I have filed a bug "NS-50 - Ssock should handle peer shutdown cases". But currently, I don't know of any new HTTP releases planned for TivaC. So, I would suggest the following options:

    • Is there a limit on how long the connection to the server can be opened? Can you make sure that you close the HTTP connection within that period and re-open the connection again?
    • Alternatively, in ti/net/http/sswolfssl.c, fix the return value check of wolfSSL_send() and wolfSSL_recv() to check for (ret <= 0) instead of (ret < 0)? After making this fix, you should rebuild the libraries by running "make -f tirtos.mak .clean-drivers .drivers" from the TI-RTOS directory.

    Hope this helps. Sorry about the inconvenience caused by the bug.

    Vikram

  • Hi Vikram,

    Thanks for the reply. I sure hope this is the root cause because then it will be pretty easy to fix.
    However, I still have a few questions/comments to make sure I understand what's happening.

    1. Is it expected that the ROV should show all my tasks as priority 1?
      1. During normal execution, I have all my tasks set at different priority levels (UART higher, HTTP lower, USB in the middle, etc...)
    2. NDK_Send is returning "SOCKET_ERROR". It is the WolfSSL code that takes that return value, and later returns a value of 0 send bytes. Is that what you were referring to by the "send()/recv() calls"?
      1. The reason this seems to happen is because this specific condition is hit: "if (ssl->error == SOCKET_ERROR_E && ssl->options.connReset)"
      2. The HTTPCli_Struct cli variable contains this: cli->ssock->ctx. The ctx is the WOLFSSL struct and it contains the error and options.connReset settings.
      3. Error is set during the NDK_Send, and I can only assume that at some point, I hit a peer reset or TCP reset and options.connReset were set as well.

    Regarding the time limit, I have 2 tasks. 1 HTTP task always returns immediately, the other task returns after 30 or 60 seconds. If there is no server response for either of these, both of them have a 90 sec timeout set in the http params.

    Once this error happens, I have no control over anything anymore because no other tasks run at that point. This is why I am so curious about the task priorities all being set to 1. This wouldn't be such a critical error if my other tasks kept running. In fact, it would be a lot easier to fix if the thing didn't lock up due to this issue. The NDK is still functional, HWIs still seem to function, but this tasks never returns and all my other tasks never run even though they are ready.

    Thanks again for the help!

    -Subhash

  • S D said:
    Is it expected that the ROV should show all my tasks as priority 1?

    No, that is not expected. I have never seen such a behavior. ROV retrieves the actual data from the device when program is halted. So, it may be that something in your program is changing the priorities of your Tasks.

    S D said:
    • NDK_Send is returning "SOCKET_ERROR". It is the WolfSSL code that takes that return value, and later returns a value of 0 send bytes. Is that what you were referring to by the "send()/recv() calls"?
      1. The reason this seems to happen is because this specific condition is hit: "if (ssl->error == SOCKET_ERROR_E && ssl->options.connReset)"
      2. The HTTPCli_Struct cli variable contains this: cli->ssock->ctx. The ctx is the WOLFSSL struct and it contains the error and options.connReset settings.
      3. Error is set during the NDK_Send, and I can only assume that at some point, I hit a peer reset or TCP reset and options.connReset were set as well.

    Yes, send()/recv() refers to underlying WolfSSL send in this case. 

    You are correct, the NDK_send() would return SOCKET_ERROR which is defined as -1 (ti/ndk/inc/bsd/socketndk.h).

    Let's try to trace this code flow. In wolfssl/src/io.c, the EmbedSend would make the call to underlying socket layer (NDK in this case). When it receives an error code, it would check errno. If the errno matches SOCKET_ECONNRESET it returns back WOLFSSL_CBIO_ERR_CONN_RST which if you trace it back would set  ssl->options.connReset to indicate peer closed and wolfSSL_send would return 0 (You can also check WolfSSL manual for the description of WolfSSL_send() return values).

    So, I still think you are hitting a peer connection closed condition.

    S D said:

    Regarding the time limit, I have 2 tasks. 1 HTTP task always returns immediately, the other task returns after 30 or 60 seconds. If there is no server response for either of these, both of them have a 90 sec timeout set in the http params.

    Once this error happens, I have no control over anything anymore because no other tasks run at that point. This is why I am so curious about the task priorities all being set to 1. This wouldn't be such a critical error if my other tasks kept running. In fact, it would be a lot easier to fix if the thing didn't lock up due to this issue. The NDK is still functional, HWIs still seem to function, but this tasks never returns and all my other tasks never run even though they are ready.

    I meant how long the HTTP connection to the server can be kept open? If you have an open connection for days/weeks, then some servers tend to close the connection to release memory held by these connections. Ideally, when this happened the HTTP Client should have returned an error code (instead of hanging). But, since we don't have an error check for this condition, one suggestion would be to close your HTTP connection and reopen your HTTP connection every day or after some time period. This way, you wouldn't hit the issue of server closing your connection.

  • Hi Vikram,

    Thanks for the quick reply and all your help. It's greatly appreciated!

    I'm definitely going to try out the fix you suggested. I would like to try and figure out how to reproduce this issue so that I can prove this is the root cause.
    When you say a "peer connection closed condition", what does that mean at a protocol level? Does it mean a TCP RST was sent by the server?
    Is there an easy way to simulate this on a typical web server, maybe just shutting it down during a HTTP request?

    Do you have any ideas as to what might be causing the tasks to all be set to priority 1?
    I only change the task priorities in a few very specific locations and I can't find anything that would set them all to 1.
    Is there any way this is a side effect of encountering an error like this where a particular task doesn't return? Or maybe a side effect of the debugger stopping and starting or anything like that?

    I am unsure what our server limit is on keeping a HTTP connection alive. However, my current task structure is not very optimal so it opens and closes the HTTP connection every time it sends data.
    I'm working on a version that keeps a HTTP connection open to be reused multiple times but I haven't finish that yet.

    Thanks,
    Subhash

  • In addition to my previous questions,

    Does TIRTOS 2.16.00.08 and later versions, also have this issue? Will the bug report you created likely be put into a fix for a future TIRTOS release?

    It looks like the file to  modify is "...\products\ns_X_XX_XX_XX\packages\ti\net\ssock_wolfssl.c"? Is that right?

    Thanks,

    Subhash

  • S D said:
    I'm definitely going to try out the fix you suggested. I would like to try and figure out how to reproduce this issue so that I can prove this is the root cause.
    When you say a "peer connection closed condition", what does that mean at a protocol level? Does it mean a TCP RST was sent by the server?
    Is there an easy way to simulate this on a typical web server, maybe just shutting it down during a HTTP request?

    Since this is a protocol concept, I would recommend reading online about it as the online sources will explain in a detailed manner than I can do.

    S D said:
    Do you have any ideas as to what might be causing the tasks to all be set to priority 1?
    I only change the task priorities in a few very specific locations and I can't find anything that would set them all to 1.
    Is there any way this is a side effect of encountering an error like this where a particular task doesn't return? Or maybe a side effect of the debugger stopping and starting or anything like that?

    I don't know what may be causing this issue. I don't think any side effect should cause this behavior. I would recommend narrowing it down by putting multiple breakpoints in your programs and looking at ROV. Also, can you check ROV after you create a Task to make sure Task priority is set correctly.

  • S D said:
    Does TIRTOS 2.16.00.08 and later versions, also have this issue? Will the bug report you created likely be put into a fix for a future TIRTOS release?

    Yes. This issue occurs in the latest version of the HTTP Client software. We will fix it in a future release but we do not have any current plans for a future release for TivaC.

    S D said:
    It looks like the file to  modify is "...\products\ns_X_XX_XX_XX\packages\ti\net\ssock_wolfssl.c"? Is that right?

    That's correct.