TM4C1292NCPDT: LwiP 1.4.1 stack stops responding after multiple hours of TCP communications

Michael Maroney

Part Number: TM4C1292NCPDT
Other Parts Discussed in Thread: LM3S6911

I am experiencing an issue with the lwip stack 1.4.1 on the TM4C1292NCPDT chip. After a variable amount of time, generally at least eight hours, of TCP communications and responses, the TCP port abruptly stops responding to TCP packets. Using debug code and wireshark, I can trace the incoming packets through the lwip stack and into the application layer, see that a response is generated in my application layer, but the response never exits the stack, and further TCP communication is impossible. Using lwip stats, I can see that on a failed unit the TCPmemerror counter gets to some huge numbers of events, many tens of thousands. Before failure, TCPmemerror does not report any events.

UDP communication is unaffeced, and TCP communication over a different connection is also possible.

For many years, and many thousands of devices, I had used the LM3S6911 TI's lwip 1.3.0 port, without a single occurrence of this issue. It would appear that this is an issue with the TCP send mechanism in lwip 1.4.1.

This post seemed promising, but implementing the fix outlined did not solve the problem: http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/p/374100/1316428#1316428

over 8 years ago

0 cb1_mobile over 8 years ago

Guru 117855 points

Bravo - a very well constructed, detailed, caring posting - very well done! I could not bear to see your writing, "just sit."

Far "over my head" in matters "TCP" - yet maybe the following provides some aid.

you note, "Variable amount of time prior to failure" - can that time be logged - then reviewed for (some) consistency/pattern?
might some "unwanted" rare disturbance occur which causes the upset? (nearby power device turns off/on - sparks/arcs - strong (external) RF signal - etc.)
might the "variable amount of time" result from the varying, "data quantity/load" - which when sufficient - cause an, "over-flow?"
have you observed (exactly) the same failure across MULTIPLE Boards - better still boards placed at different locations - lessening the chance of single board/location anomaly?

Again - operating w/out "TCP" knowledge - might it prove useful to "completely control" both the input & output responses - limiting such to elementary (brief) transactions? With this level of control - perhaps the failure times may converge - which should "point the way" toward the discovery of the failure mechanism...

Firm/I have enjoyed much diagnostic success - even when - and (sometimes) especially when - we know NOTHING of the clients (failed) application...

0 Amit Ashara over 8 years ago

TI__Guru**** 244380 points

Hello Michael,

Which version of the TivaWare are you using? I believe there were two changes that were made since then. One of them you have already highlighted.

0 cb1_mobile over 8 years ago in reply to Amit Ashara

Guru 117855 points

Hi Amit,

Is this to mean (i.e. the 2 changes) that one such "change" addressed this, "Stops Responding" issue?
If that's the case - once more - should not the correction/fix have been better publicized/published?

0 Amit Ashara over 8 years ago in reply to cb1_mobile

TI__Guru**** 244380 points

Hello cb1

Both of the fixes were rolled in the TivaWare release. I am having trouble locating the two fixes in the forum. One of them is as mentioned by the user and other I am still trying to ascertain

0 cb1_mobile over 8 years ago in reply to Amit Ashara

Guru 117855 points

Thank you, Amit. Is it not true that, "A fix unknown or not quickly/easily found" may be described as ineffective? That's a pity - so hard to justify and/or understand...

0 Amit Ashara over 8 years ago in reply to cb1_mobile

TI__Guru**** 244380 points

Hello cb1

I checked the release notes and the fixes made were in the qs_iot application. They may not be the same as what the OP has described.

Hello Michael,

When the issue occurs, do you see that the IP address is still valid in the lwIP stack?

0 Michael Maroney over 8 years ago in reply to Amit Ashara

Prodigy 60 points

Hi Amit,

Many thanks for the answers. Here are answers to your specific questions:

TivaWare: I’m not sure the original version we were using (I will check into that), but we upgraded to 2.1.3.156 over the weekend and performed some testing. Seven or eight test units failed within 4-5 hours, and one lasted overnight, which is fairly consistent with our earlier testing results. As before, we are able to communicate over UDP or a different TCP connection to these units.

IP address: I don’t have a way to check this externally at the moment, but the fact that I am able to open up additional TCP connections, and communicate through UDP, makes me believe that the IP address is still valid in the sack.

Let me know what other testing I can do to help you narrow down this issue.

Thanks,

Mike

0 Amit Ashara over 8 years ago in reply to Michael Maroney

TI__Guru**** 244380 points

Hello Mike

Thanks for the details. If you increase the stack size then does it have an impact on the up time of the application?

0 Michael Maroney over 8 years ago in reply to Amit Ashara

Prodigy 60 points

Hi Amit,

Unfortunately not. We've tested increasing the stack size from 2k to 10k with no measurable improvement in uptime. I verified the previous version of TivaWare we were using: 2.0.1.11577.

Thanks,

Mike

0 Amit Ashara over 8 years ago in reply to Michael Maroney

TI__Guru**** 244380 points

Hello Mike,

OK. Is it possible to reproduce the issue on a LaunchPad with a simplified code example so that I can reproduce the same in lab conditions?

0 Michael Maroney over 8 years ago in reply to Amit Ashara

Prodigy 60 points

Thanks for the offer - I'll work on getting something ready for you.
Mike

0 Amit Ashara over 8 years ago in reply to Michael Maroney

TI__Guru**** 244380 points

Hello Michael

Can you please send the lwipopts.h file that you have for your project which gives the error, the wireshark log when the error occurs and the Debug output from lwIP?

0 Michael Maroney over 8 years ago in reply to Amit Ashara

Prodigy 60 points

Hi Amit,

See attached for the lwipopts.h, and a screenshot with some lwipstats info. The two drives with 0 TCP memerrs recorded have not yet failed. Note that the TCP xmit + TCP memerr number is nearly equal to the TCP recv number for all the failed drives.

The App xmit and App recv numbers are the number of TCP packets received and transmitted by our application layer before freezing - note the variation from hundreds of thousands to over 1.7 million packets.

We don't have a JTAG port on this device, so we have to get all our debug info out through a UDP response from pre-coded routines in the units. We can add more information to that packet if needed.

I don't have a wireshark trace at hand, but I'll get one and post it.

Thanks,

Mike

6765.lwipopts.h

0 Michael Maroney over 8 years ago in reply to Michael Maroney

Prodigy 60 points

One comment on the screenshot...the TCP xmit, TCP recv, and TCP memerr are 16 bit numbers, so they're constantly rolling over with this level of traffic. If you add the TCP xmit and TCP memerr on lines 1, 6, and 8 and take the lower 16 bits, you get close to the TCP recv number on those lines.

0 Amit Ashara over 8 years ago in reply to Michael Maroney

TI__Guru**** 244380 points

Hello Michael

Thanks for the lwipopts.h. As I understand the TM4C129x works as a TCP server. So a TCP echo server running on the TM4C129x with a PC client would also work as a test suite. Is that a correct assumption? Also what is the size of packets that you see on the bus when communicating with the TM4C129x device?

0 Amit Ashara over 8 years ago in reply to Michael Maroney

TI__Guru**** 244380 points

Hello Michael,

The MEM_SIZE is set to 16K. In ur examples we use 64K as the MEM_SIZE in the lwipopts.h. Could you check your example with the same?

Arm-based microcontrollers

Arm-based microcontrollers forum

TM4C1292NCPDT: LwiP 1.4.1 stack stops responding after multiple hours of TCP communications