This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NDK fails after TCP retransmission starts

Other Parts Discussed in Thread: AM3352

Hi,

We are running SYS/BIOS with NDK on our production board using the CPSW interface. Our current software versions:

Processor AM3352
Code Composer Version: 6.0.1.00040
Compiler TI v5.1.6
XDCTools 3.30.3.47_core
SYS/BIOS 6.40.3.39
NDK 2.24.0.11

     After ~4 days of running two TCP ports to a directly connected Windows XP laptop with our software, our embedded system fails to respond to pings and our NDK stack tests fail. I will follow this post with shots of the WireShark captures and NDK internal statistics (so that the first post in this thread is not to long.)
The Laptop to Embedded system is run through a switch that mirrors the Ethernet traffic to a third laptop that captures the packets. See the following post for WireShark screen shots (I can also send the capture if you need it but it is 256Mbytes in size.)

Thanks,

     John C.

  • Needed information:

    IP of XP laptop:                     10.1.9.210

    IP of Embedded system:       10.1.9.200

    Embedded Port 1033:          Data Port

    Embedded Port 1031:          Data out port

    Failure after ~4 days, TCP retransmission of data out port:

    Both ports 1033 and 1031 retransmit, causing retransmits from Laptop

    The embedded system continues to retransmit the TCP packets for a short period and the XP laptop
     gives up and tries to reestablish a new connection.

    The embedded system stops responding after a period of time.

  • Here are my final notes to the problem:
    - The Embedded systems ACK packets on port 1031 always seem to have a frame check sequence error, bad checksum. Because we are getting the data from a mirrored port on a switch and the receiving laptop Ethernet driver has checksum offload turned off, we are not sure why we see this (port 1031 ACK packets are the only ones that show this error.)
    - We added a a call to retrieve some of the internal stats:
    cslRxCnt: 381924790
    ethRxCnt: 0
    nimuRxCnt: 381945963
    notGoCnt: 0
    appRxCnt: 0
    intPktEnq: 0
    mem_squeeze_err: 0
    csl_err: 0
    emac_fatal: 0
    - We also carry stats for packets processed by each port and how long it took:
    ----------------
    Data port msg Count: 1097387 Min: 25953 Max: 6700015 Avg: 121975 nsecs
    Data Out port msg Count: 380427515 Min: 574 Max: 141367385 Avg: 84829 nsecs
    ----------------

    Thanks,
    John C.
  • John,

    Can you check what state the application is in when the hang happens? Is the RX/TC interrupt getting hit in this scenario? If yes, where is it getting broken?

    Regards,
    Vinesh

  • Hey Vinesh,

                        Technically there is no "hang". The system continues to run with the exception that the embedded device no longer can communicate via Ethernet. There are two tasks, with each one being responsible for a different TCP port. Are you asking the status of the tasks?

    Thanks,

         John C.

  • Hi John,

    Can you provide some more info on the state of the system when this issue occurs?

    For example, what Tasks are shown in the ROV tool?  Are the NDK tasks still running?

    Steve

  • Hey Steve,

    The current failure was gotten without the embedded system being run in the debugger(the image was flashed). I can restart the test using the debugger but I will not be able to get you an answer for 3-4 days. Do you wish to look at the full WireShark capture? In the trace you can see that the retransmissions start at time 171.253387 and the embedded system spitting out retransmissions every now and then until time 250.683758. So for 90 seconds the embedded system attempts to retransmit. The zipped capture is still 25Mbytes so I can't post it.

    Thanks,
    John C.
  • Hi John,

    Yes, I would recommend retrying the test with debug enabled, if it's not too difficult to do. I think it will be useful information to see the task states when this failure occurs.

    If you can save the WS capture, this may be helpful, too. We can figure out how to transfer it outside of the forums.

    While I do think it will be good/helpful to gain insight about these things, I also feel that Vinesh's suggestion about the TX/RX interrupts firing/not firing could be related to this. You need to also check that the Ethernet ISRs for TX/RX are still triggering at the failure point.

    Steve
  • Hey Steve,

         I do have the capture if you figure out how I can send it. My 90-day license ran out so I'm working on getting another so I can restart the tests. As soon as I get more information I'll post the results.

    Thanks,

         John C.