This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TM4C1292NCPDT: LwiP 1.4.1 stack stops responding after multiple hours of TCP communications

Part Number: TM4C1292NCPDT
Other Parts Discussed in Thread: LM3S6911

I am experiencing an issue with the lwip stack 1.4.1 on the TM4C1292NCPDT chip.  After a variable amount of time, generally at least eight hours, of TCP communications and responses, the TCP port abruptly stops responding to TCP packets.  Using debug code and wireshark, I can trace the incoming packets through the lwip stack and into the application layer, see that a response is generated in my application layer, but the response never exits the stack, and further TCP communication is impossible.  Using lwip stats, I can see that on a failed unit the TCPmemerror counter gets to some huge numbers of events, many tens of thousands.  Before failure, TCPmemerror does not report any events.

UDP communication is unaffeced, and TCP communication over a different connection is also possible.

For many years, and many thousands of devices, I had used the LM3S6911 TI's lwip 1.3.0 port, without a single occurrence of this issue.  It would appear that this is an issue with the TCP send mechanism in lwip 1.4.1.

This post seemed promising, but implementing the fix outlined did not solve the problem: http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/p/374100/1316428#1316428

  • Bravo - a very well constructed, detailed, caring posting - very well done!     I could not bear to see your writing, "just sit."

    Far "over my head" in matters "TCP" - yet maybe the following provides some aid.

    • you note, "Variable amount of time prior to failure" - can that time be logged - then reviewed for (some) consistency/pattern?
    • might some "unwanted" rare disturbance occur which causes the upset?   (nearby power device turns off/on - sparks/arcs - strong (external) RF signal - etc.)
    • might the "variable amount of time" result from the varying, "data quantity/load" - which when sufficient - cause an, "over-flow?"
    • have you observed (exactly) the same failure across MULTIPLE Boards - better still boards placed at different locations - lessening the chance of single board/location anomaly?  

    Again - operating w/out "TCP" knowledge - might it prove useful to "completely control" both the input & output responses - limiting such to elementary (brief) transactions?   With this level of control - perhaps the failure times may converge - which should "point the way" toward the discovery of the failure mechanism...

    Firm/I have enjoyed much diagnostic success - even when - and (sometimes) especially when - we know NOTHING of the clients (failed) application...

  • Hello Michael,

    Which version of the TivaWare are you using? I believe there were two changes that were made since then. One of them you have already highlighted.
  • Hi Amit,

    Is this to mean (i.e. the 2 changes) that one such "change" addressed this, "Stops Responding" issue?
    If that's the case - once more - should not the correction/fix have been better publicized/published?
  • Hello cb1

    Both of the fixes were rolled in the TivaWare release. I am having trouble locating the two fixes in the forum. One of them is as mentioned by the user and other I am still trying to ascertain
  • Thank you, Amit.    Is it not true that, "A fix unknown or not quickly/easily found" may be described as ineffective?     That's a pity - so hard to justify and/or understand...

  • Hello cb1

    I checked the release notes and the fixes made were in the qs_iot application. They may not be the same as what the OP has described.

    Hello Michael,

    When the issue occurs, do you see that the IP address is still valid in the lwIP stack?
  • Hi Amit,

    Many thanks for the answers.  Here are answers to your specific questions:

    TivaWare: I’m not sure the original version we were using (I will check into that), but we upgraded to 2.1.3.156 over the weekend and performed some testing.  Seven or eight test units failed within 4-5 hours, and one lasted overnight, which is fairly consistent with our earlier testing results.  As before, we are able to communicate over UDP or a different TCP connection to these units.

    IP address: I don’t have a way to check this externally at the moment, but the fact that I am able to open up additional TCP connections, and communicate through UDP, makes me believe that the IP address is still valid in the sack.

    Let me know what other testing I can do to help you narrow down this issue.

    Thanks,

    Mike

  • Hello Mike

    Thanks for the details. If you increase the stack size then does it have an impact on the up time of the application?
  • Hi Amit,


    Unfortunately not. We've tested increasing the stack size from 2k to 10k with no measurable improvement in uptime.  I verified the previous version of TivaWare we were using: 2.0.1.11577.

    Thanks,

    Mike

  • Hello Mike,

    OK. Is it possible to reproduce the issue on a LaunchPad with a simplified code example so that I can reproduce the same in lab conditions?
  • Thanks for the offer - I'll work on getting something ready for you.
    Mike
  • Hello Michael

    Can you please send the lwipopts.h file that you have for your project which gives the error, the wireshark log when the error occurs and the Debug output from lwIP?

  • Hi Amit,

    See attached for the lwipopts.h, and a screenshot with some lwipstats info.  The two drives with 0 TCP memerrs recorded have not yet failed.  Note that the TCP xmit + TCP memerr number is nearly equal to the TCP recv number for all the failed drives.

    The App xmit and App recv numbers are the number of TCP packets received and transmitted by our application layer before freezing - note the variation from hundreds of thousands to over 1.7 million packets.

     We don't have a JTAG port on this device, so we have to get all our debug info out through a UDP response from pre-coded routines in the units.  We can add more information to that packet if needed.

    I don't have a wireshark trace at hand, but I'll get one and post it.

    Thanks,

    Mike

    6765.lwipopts.h

  • One comment on the screenshot...the TCP xmit, TCP recv, and TCP memerr are 16 bit numbers, so they're constantly rolling over with this level of traffic. If you add the TCP xmit and TCP memerr on lines 1, 6, and 8 and take the lower 16 bits, you get close to the TCP recv number on those lines.
  • Hello Michael

    Thanks for the lwipopts.h. As I understand the TM4C129x works as a TCP server. So a TCP echo server running on the TM4C129x with a PC client would also work as a test suite. Is that a correct assumption? Also what is the size of packets that you see on the bus when communicating with the TM4C129x device?
  • Hello Michael,

    The MEM_SIZE is set to 16K. In ur examples we use 64K as the MEM_SIZE in the lwipopts.h. Could you check your example with the same?