This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP432E401Y: TCP/IP stack issue - ack and sequence numbers

Part Number: MSP432E401Y
Other Parts Discussed in Thread: MSP-EXP432E401Y

Hi,

I'm testing the TCP/IP communication stability between MSP432 and a configuration SW running on PC. The configuration SW opens a TCP/IP session with the MSP432 based device and then sends MODBUS TCP requests, to which the MSP432 based device responds correctly. So far everything works as expected.

The issue occurs when the communication channel is disturbed by electromagnetic interference. Electromagnetic interference causes sometimes a packet loss. A MODBUS TCP response sent from MSP432 is not received by configuration SW on PC, and the configuration SW sends the same request again. From now on we can see the issue: the MSP432 based device responds to the retransmission, but the configuration SW does not consider the response valid. Checking in the Wireshark logs, included, I noticed that the sequence number of the answer is not equal to the ACK number in the request. I guess this is why the answer is not considered valid. In the following requests by the configuration SW, the ACK number remains the same, whereas the sequence number of the response by the MSP432 keeps on incrementing.

I attach the logs: PC IP address is 172.20.4.33, MSP432  IP address is 172.20.4.240.

At time around 4.3 seconds, we can see that no response is received to query (message number 149) and retransmissions are issued. When a response is received from the MSP432 device (response message number 158), the sequence number of the response is not equal to the ACK of the request.

I recompiled the TCP/IP stack, but I was not able to find a suitable fix. Do you think the analysis is correct, or maybe the configuration SW is not behaving correctly when sending the retransmissions?


Thanks

MSP432_TCP_IP_issue.7z

  • I forgot to mention that I'm using the latest available SDK: 3.10.00.11
  • Hello Luca,

    We will review this and get back to you with more information.

    Thanks,
    Sai
  • Hello Luca,

    I reviewed the wireshark log and noticed that Wireshark is complaining that previous segment(s) not captured (around Frame 153).

    Luca Ortu said:
    At time around 4.3 seconds, we can see that no response is received to query (message number 149) and retransmissions are issued.

    According to Wireshark Frame 153 is an ACK to Frame 149, but a following ModBus response packet is missing from the MSP432E4 side. So looks like the MSP432E4 was about to send two TCP segments (in response to Frame 149), but ended up sending only one segment.

    Since TCP protocol ensures that no packet is lost, I would expect that the PC's TCP/IP stack would re-transmit the Frame 149. I am not aware of how a missing segment of a TCP frame is supposed to be handled by the receiving side.

    This log stops after couple of seconds of the missing segment. What happens if you run it longer, like a minute longer or so?

    Thanks,

    Sai

  • Hi Sai,

    Thanks for you response.

    Here's how I'm interpreting frames 149, 152, 153:

    Frame 149: configuration SW sends a TCP MODBUS request. No answer is received. The answer in fact is sent by the MSP432, but it is not received by the configuration SW, due to packet loss on the network. So, according to TCP rules, frame 149 is sent again.

    Frame 152: TCP on PC side forces a re-send: the same message sent in frame 149 is sent again. 

    Frame 153: this is the ACK to packet 152, received by PC immediately after sending frame 152. In my opinion here we see the issue: sequence number (82) of frame 153 is not the sequence number expected by configuration SW. Expected sequence number would be 73, that is the ACK number specified in request 152. I think Wireshark signals this misbehavior with the comment "TCP Previous segment not captured". 

    As you point out, here we receive only the ACK, not the response, possibly, again, due to packet loss in the network. Anyway from now on the ACKs and responses from MSP432 have "wrong" sequence numbers. Configuration SW keeps on asking an ACK number = 71 (see frames 160, 169, 178). Instead sequence numbers in the responses by MSP432 keep on incrementing.
    My impression is that the sequence number keeps on incrementing without considering the expected ACK number in the request. So, when a packet is lost, ACK number in request and sequence number in the response, that should be aligned, are different.
    We never recover from this situation, messages keep on following the same sequence as the last messages you can see in included logs. They go on forever.

    I would expect sequence number on MSP432 to align to ACK number expected by configuration SW on PC, so responses would be considered valid by PC.

    Thanks,

    Luca

  • Hi Luca,

    I'm trying to undertsand how you're using MODBUS and the SDK. Could you provide a sample of the code you're running on the MSP432?

    Best,
    Brandon
  • Hi Brandon,

    Here are the files where MODBUS and TCP are implemented.

    Please let me know if you need further info.

    Thanks,

    LucaTCP_files.7z

  • Hi Luca,

    Sorry, I've been out of the office recently, but I do intend to keep looking at this. Are you able to consistently reproduce this? And have you, by any chance, observed anything new?

    Regards,
    Brandon
  • Hi Brandon,

    Thanks.

    No news, the problem keeps on occurring in the aforementioned conditions.

  • Hi Luca,

    Unfortunately I can't get too close to the source of the problem because I can't replicate your setup. It looks like you're using TIRTOS based on the files you sent. My current best suggestion is to enable debugging within the TCP library and see what you can learn:

    Inside <SDK_INSTALL_DIR>/source/ti/ndk/stack/tcp/tcpout.c uncomment around line 42 #define TCP_DEBUG

    This enables calls to DbgPrintf scattered throughout this file. I would add similar calls to DbgPrintf inside tcpin.c:TcpInput().

    Around line 262 of tcpin.c you will the variables 'seq' and 'ack' being filled in. I would try tracking down what the NDK is doing as it continually receives packets with the same ack value. After these steps you will have to rebuild the NDK using xdctools.

    I'm assuming you're using CCS as well. In your tirtos_builds_MSP_EXP432E401Y_release_xxx project, edit the release.cfg file to include the SysMin module and exclude the SysCallback module, like so:

    var SysMin = xdc.useModule('xdc.runtime.SysMin');
    SysMin.bufSize = 1024; // Increasing the buffer size here may be desirable
    System.SupportProxy = SysMin;
    //var SysCallback = xdc.useModule('xdc.runtime.SysCallback');
    //System.SupportProxy = SysCallback;

    I'd also try to use ROV to get a better view of the NDK usage and to view the SysMin output. After starting your debug session in CCS go to Tools -> Runtime Object View and connect to the target. I've attached a dashboard with the modules I think will be useful here. After connecting to the target you can find a small folder icon in the upper right portion of the ROV window. Click this to import the rov.json file attached 

    tcp_debugging.rov.json
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    [
    {
    "viewName": "OutputBuffer",
    "id": "rovModuleView_0",
    "left": 0,
    "top": 0,
    "width": "341px",
    "height": "374px",
    "position": "",
    "zIndex": "14",
    "repeatDivider": 1,
    "repeatRefreshEnabled": true,
    "defaultSize": true,
    "moduleName": "xdc.runtime.SysMin",
    "viewsData": {
    "xdc.runtime.SysMin.OutputBuffer": {
    "columnStates": [
    {
    "name": "entry",
    "checked": true,
    "hasFormat": false,
    "format": null
    }
    ],
    "fixedFont": false,
    "hasFormats": false
    }
    },
    "dashboardVersion": "1.0",
    "inFullScreenMode": false,
    "comment": ""
    },
    {
    "viewName": "Sockets",
    "id": "rovModuleView_1",
    "left": 363,
    "top": -3,
    "width": "722px",
    "height": "95px",
    "position": "",
    "zIndex": "11",
    "repeatDivider": 1,
    "repeatRefreshEnabled": true,
    "defaultSize": true,
    "moduleName": "C.NDK",
    "viewsData": {
    "C.NDK.Sockets": {
    "columnStates": [
    {
    "name": "Address",
    "checked": true,
    "hasFormat": true,
    "format": "Hex"
    },
    {
    "name": "Ctx",
    "checked": true,
    "hasFormat": true,
    "format": "Decimal"
    },
    {
    "name": "Protocol",
    "checked": true,
    "hasFormat": false,
    "format": null
    },
    {
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I want to believe this has something to do with the MODBUS application code implementation sitting on top, since this issue is previously unheard of. You're doing groundbreaking work! My next suggestion would be to re-examine that application code, but I can't give specific directions as to how to go about that since I'm unfamiliar with MODBUS.

    Best,

    Brandon

  • Hi Brandon,

    Thanks for your support, I really appreciate it.

    Before asking your support on this forum, I tried to figure out what happens in tcpin.c and tcpout.c files, but they're quite complicated and I didn't notice any obvious bug. I just recompiled my project including them and stepped in with the debugger.

    But going back to these files suggested me a way to reproduce my "bug". You should be able to reproduce it consistently, too.

    This "bug" is generated on my real device by the loss of a transmitted packet in the network. I simulate the loss of a transmitted packet (transmitted by MSP432) with the following modification in file tcpout.c:

    In label send:, around line 750, I force not to call "IPTxPacket()" once every 100 times. So, 99 times TCP stack responds as expected, the 100th time (and following iterations) the stack creates the TCP response, but it does not send it. Here's how I did it:


    TcpCounter++;

    if (TcpCounter%100 == 0)
    {

       // here's where we are not sending the response
       error = 0;
    }
    else
    {
       error = IPTxPacket( pPkt, SockGetOptionFlags(pt->hSock) & FLG_IPTX_SOSUPPORTED );
    }

    My assumption is that this modification reproduces a loss of a transmitted packet. The stack performs all the needed calculations (seq, ack...), creates the response packet and no error is generated. The stack is not immediately aware that the packet has not been sent out, as it happens when a packet is lost in the network (in my case due to electromagnetic noise).

    This way I can reproduce exactly the same bug that I see in case of loss of packet in the network, with the same problem of seq numbers and acks: MSP432 receives a ACK number, but responds with a different sequence number. The response sequence number keeps on increasing, whereas the request ACK number remains the same. The problem is the same that I mentioned in my previous posts.

    I can reproduce the bug with our custom MODBUS client. But I can also reproduce it with a third party MODBUS client (QModMaster). Modbus is a very simple master/slave protocol, where a request is always followed by a response.

    I can reproduce the bug also with HTTP traffic, using the webserver that is running on our MSP432 based device. The HTTP server sits on top of the same TCP/IP stack. Wireshark logs highlight the same problem.

    I think that in general any kind of communication based on TCP/IP stack is affected by this problem.

    By the way, looking into the code, I noticed that, in ipout.c file, a very similar debug code already exist, that is commented out.

    At line 55, inside function IPTxPacket, I uncommented the debug code:

    /*///// Simulate Trouble //////// */
    /* static int foo = 0; */
    /* if( !(foo++ % 13) ){ PktFree(hPkt); return(IPTX_SUCCESS); } */
    /*/////////////////////////////// */


    I think this code simulates exactly what I did forcing not to call the IPTxPacket tout court.

    Uncommenting this debug code generates the same error scenario.

    Could you please reproduce these modifications on your side? 

    Thanks,
    Luca

     

     

  • Hi Luca,

    Thanks a lot for filling me in on what you've tried and found. With the modification of forcing the tcp stack to drop a packet every 100 packets I wasn't able to reproduce your exact issue, but I did hit another issue that on the surface seems unrelated to yours, but I am also chasing it down. If I force my ndk to drop every 100th tcp packet, my device retransmits the dropped packet within receiving three duplicate acks - usually after receiving one duplicate ack. The wireshark capture provided doesn't go on long enough for me to see, but from your comments it sounds like yours does not retransmit whatsoever?

    Regards,

    Brandon

  • Hi Brandon,

    I also see the duplicate ACKs when the error scenario occurs, but they seem to be a side effect of the issue.

    The issue can be reproduced consistently on my side, dropping the transmitted packets, as we did with a code modification.

    I wil refer to the logs I attached before, to make it clear.

    Message num 146, REQ, ack num = 60   -> Message num 147, RESP, seq num = 60. REQ ack and RESP seq are equal, that's ok, communication is fine.

    ------ Here we drop the packet   ------- from now on, we see the error:

    Message num 157, REQ, ack num = 71  -> Message num 158,  RESP, seq num = 82. REQ ack and RESP seq are NOT equal, that's wrong.

    From now, on the REQ ack number is always 71 (since response is not considered valid) and and RESP seq number keeps on incrementing, as you can see in the following messages in the log.

    When I use a web browser to communicate with the web server of the MSP432, the wireshark log changes a little (after some time the web browser stops the requests), but the issue I get is exactly the same:

    the mismatch between request ACK number and response SEQUENCE number. 

    The issue occurs consistently, even though the stack seems to react correctly to the first packet dropped. From the second packet dropped on, the TPC IP stack seems to respond with an increasing seq num, despite the request ACK number remains always the same.

    When the error scenario occurs, the client (our proprietary Modbus client, a generic Modbus client or the web browser) considers the response wrong and the communication does not work anymore.

    I have here a MSP-EXP432E401Y demo board. Do you think it is useful if I try to reproduce the issue on the demo board, with the demo code?

    Regards,

    Luca

  • Hello,

    Maybe I'm misunderstanding your first sentence - the duplicate acks are the error response implemented into tcp.

    The code here

    /*///// Simulate Trouble //////// */
    /* static int foo = 0; */
    /* if( !(foo++ % 13) ){ PktFree(hPkt); return(IPTX_SUCCESS); } */
    /*/////////////////////////////// */

    isn't quite the same as the code you inserted to reproduce the error. I ran into memory issues after I inserted your example code, which is actually expected due to memory not being freed that would be in the call to IPTxPacket(). To get a more accurate representation of the issue I would insert a call to PBM_free( pPkt ); before setting error = 0 in your error producing code. With this addition, on my end, the NDK has no issues at all and successfully retransmits packets after receiving a few duplicate acks. I cannot replicate your issue. Is it possible to provide me with a full copy of your application that I can run?

    Thanks,

    Brandon

  • Hi Brandon,

    Now I can reproduce consistently the packet loss issue, uncommenting the lines you suggested in ipout.c file. I tried to reproduce the error scenario on the demo board, with the httpserver demo, and I can't, it works properly. I see now the TCP retransmission after some DUP ACKs. So I ported the httpserver demo on our board, and again everything works properly, the TCP stack reacts correctly. 

    I'll review our application implementation, there is definitively a bug on our side.

    Thanks for guiding me through this debug!

    Regards,

    Luca

**Attention** This is a public forum