RTOS/TM4C129XNCZAD: Is there a way to reset Ethernet Phy and NDK?

Yan Li29

Part Number: TM4C129XNCZAD

Tool/software: TI-RTOS

I have a use case where the TM4C micro-controller is in an enclosed environment with only Ethernet access available to the outside world and restricted access to the reset switch. The micro-controller is running TI-RTOS and TI NDK. Intermittently (about once a week), the Ethernet connection would drop off meaning that devices in the outside world are no longer able to communicate with the micro-controller through Ethernet. The Cisco router can sees periodic DHCP requests coming from the micro-controller which indicates that the link is still there to the micro-controller but that DHCP responses are not getting to the micro-controller. Resetting the micro-controller gets the Ethernet connection back, but is not a long term solution. Instead of rebooting the micro-controller, is there a way to just reset the Ethernet Phy and NDK? Due to limited access to the micro-controller, debug options are very limited. Are there any suggestions to help determine whether this is an issue with the network configuration or the NDK? The fact that resetting the micro-controller re-establishes Ethernet link has me leaning towards believing the issue is with the NDK.

over 7 years ago

0 Peter Borenstein over 7 years ago

Mastermind 8695 points

NC_NetStop() will shut down the network. You can start it again with NC_NetStart().

Your interest in finding the root problem seems mild. Do you want to fix your widget or not?

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

I actually do want to fix my application, but don't know where to start since I can't just put a JTAG probe on it and start debugging. Being in a closed environment, I can only rely on logs. Do you have any suggestions on how to debug?

0 Peter Borenstein over 7 years ago in reply to Yan Li29

Mastermind 8695 points

Debugging is inspiring. "turn it off and on again" is saddening, but sometimes reasonable.

DHCP is a 4 step process. Is the device responding to the offer with a request?

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

For a product that is already in customer hands, "turn it off and on again" buys me time to debug and find the root cause.

I have attached a wireshark capture from the customer. The micro-controller currently goes into Auto-IP mode when link loss is detected and you can see that the micro-controller is sending DHCP Discover from 169.254.4.5 IP address. The network sees the request and responds with DHCP Offer. However, the micro-controller does not take the offer and continues to send DHCP Discovers. One thing I am planning to try is to disable the auto-IP so that the micro-controller does not get the 169.254.4.5 address. Not sure if that is preventing it from seeing the DHCP offer.

CAMERA25.zip

0 Peter Borenstein over 7 years ago in reply to Yan Li29

Mastermind 8695 points

I would expect a DHCP discover datagram to have 0.0.0.0 as the source address. Odd.

0 Peter Borenstein over 7 years ago in reply to Peter Borenstein

Mastermind 8695 points

Do the MAC address look correct? The OID is the IEEE Registration Authority and the DHCP offer is going to a different IP than the discovery source.

Is there a router between this capture and your device?

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

The MAC address of the micro-controller is correct. If you notice, the last 2 bytes of the IP address is the same as the last 2 bytes of the MAC. When the micro-controller does not get a valid DHCP response after a period of time, the micro-controller reverts to APIPA and picks an IP address for itself. We specifically used the last 2 bytes of the MAC address to be the 2 LSBs in the initial APIPA IP address to avoid IP conflicts. That explains why the micro-controller is using the 169.254.4.5 IP address. This indicates to me that at the time the capture was taken, the micro-controller has failed to DHCP for some time and has reverted to APIPA. While in APIPA, the micro-controller still sends out DHCP requests periodically.

As I am not proficient in network equipment and topologies, I assumed that he was able to capture from the router itself. It is something I can query and clarify with the customer.

By disabling APIPA in the micro-controller, the micro-controller will not get into the state where it is sending DHCP request using the APIPA address and hopefully that resolves the issue. But that is somewhat of a WAG.

0 Charles Tsai over 7 years ago in reply to Peter Borenstein

TI__Guru**** 191906 points

Peter,

It seems like in the BootP payload of DHCP Offer frame , the client is sending the IP address of 0.0.0.0 but the server is trying to lease a new IP address of 10.101.155.2. I don't know if the client is refusing to take this new address but rather keeping the prior one and hence not sending the DHCP request phase.

0 Peter Borenstein over 7 years ago in reply to Charles Tsai

Mastermind 8695 points

What does the exchange look like after a reset when things are working?

Why is the subnet mask in the offer 255.255.255.254? Normally private network IP that starts with 10. has a mask of 255.0.0.0. I tried starting server with this mask and it failed to load.

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

Charles,

In this case the micro-controller is the client. How can I tell if the micro-controller is not taking the IP address? As DHCP is handled by the NDK almost exclusively, is there a way to find out if the NDK is getting the DHCP offer? My code has a callback set in the "Network IP address hook" and uses whatever IP is assigned there as long as fAdd is greater than 0. I don't believe I have control over the DHCP request. That is handled entirely by the NDK. Is there any way I can get visibility into that?

-Yan

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

Peter,

Good info on the subnet mask on the offer. I will dig into that a little deeper. Maybe the problem is with the switch or router.

-Yan

0 cb1_mobile over 7 years ago in reply to Yan Li29

Guru 117855 points

May I note firm's/my full agreement w/Peter re: "Substantial Value of Debug."

You stated that your, "System is in the hands/possession of the (distant) client" - as "justification for NOT debugging!" Is it possible that you, "Failed to build EXTRA/SPARE/BACK-UP UNITS" - so that you could retain several - for continued test/analysis - even replacement? If true - you (may) have created your own gallows!

And if true - should not "your build of additional units (or recovery of 1 from the client)" - quickly - be of HIGHEST PRIORITY? (Enabling YOUR more comfortable & efficient - local debug!)

While this, "One particular issue has been recognized" - is it not likely that (others) - not yet having presented - or escaping notice - await you? Again - debug IS essential!

0 Yan Li29 over 7 years ago in reply to cb1_mobile

Expert 1015 points

re: cb1

You are way off base with your comment. "in the hands/possession of the (distant) client" means a product that has already been released and is in the field. Meaning we have more than a thousand products in the field and this failure is only occurring for a specific (albeit big) customer.

Issue only occurs in the user's environment and is not reproducible on site. And yes, we have units in house ... and why did I even bother answering your post cb1???

0 cb1_mobile over 7 years ago in reply to Yan Li29

Guru 117855 points

Yan Li29 said:
For a product that is already in customer hands

Does your above quote - not "reasonably" suggest - that INDEED - your product resides w/client? And thus - how can such comment be, "Way off base" - as you loudly proclaim?

You further note, "a thousand products in the field and this failure is only occurring for a specific (albeit big) customer." That's KEY information - is it not? And NOT earlier provided.

You further state (now), "And yes, we have units in house" - yet that (appeared) not to be clearly presented earlier - is that not true?

As to your (rather unfriendly), "Why did you bother to answer" - ONLY you can provide that answer.

It is uncertain if you have the wisdom/experience to recognize that, "Many/most of the issues I raised - provide (necessary) amplification & clarification - for Vendor Agents!" (who represent - your surest & fastest - path towards success!)

The one, "Off Base" - may not be me! All "facts in evidence" point to my suggestions being, "Fair & Reasonable" ... your "Venom" - not so much!

Despite your "over the top" attack - I do hope your issues may be resolved...

0 Charles Tsai over 7 years ago in reply to Yan Li29

TI__Guru**** 191906 points

Hi Yan,
I was more speculating rather than making a statement that the client (MCU) is wanting the prior IP address than the new IP address. The DHCP is a four step process from Discover->Offer->Request->Acknowledge. In your wireshark capture the client either didn't receive the Offer from the server or it didn't want the offer. Perhaps I was making a wrong speculation earlier. The MCU maybe in a state that it is not properly receiving the offer and hence try to send the DHCP discovery repeatedly.
I don't have enough knowledge on the NDK. I will seek some help from our TI-RTOS NDK expert. Please expect some time for them to respond. I will be out of office the next two days unable to monitor this thread.

0 Peter Borenstein over 7 years ago in reply to Yan Li29

Mastermind 8695 points

The network IP address hook is the same call back you would get if setting the IP statically. I don't think it will help you debug DHCP specifically.

0 Peter Borenstein over 7 years ago

Mastermind 8695 points

Debugging the stack itself might be tricky. NDK uses pre-built libraries. The process for building the libraries yourself is here: processors.wiki.ti.com/.../Rebuilding_The_NDK_Core_Using_Gmake

I have not seen a way to debug the NDK stack in CCS (or had a reason to look), but maybe it is as simple as telling the project where the source files are.

Can you find a way to recreate the DHCP server's offer? I do not see the value in debugging the stack if you can't recreate the situation.

I doubt you really want to modify the stack, but this may help you discover the problem. Modifying the stack seems risky.

0 Yan Li29 over 7 years ago in reply to Charles Tsai

Expert 1015 points

Charles,

Understood. I will await a response from you. In the meantime, I will see if I can reproduce the issue at my location as well.

Thank you.

-Yan

BTW, who is this cb1_mobile person? Seems like a bot or someone spamming the forum in an attempt to get high prestige. I checked his 5 most recent posts and found none of them offer any value.

0 Yan Li29 over 7 years ago in reply to Peter Borenstein

Expert 1015 points

Peter,

Thank you for the info and really appreciate your help. I will try to re-create this issue on site. I have already sent a build to the customer with APIPA removed, recovery mechanism, and additional logging capability. Hopefully that gives me more insight into this issue.

I agree, debugging the stack is not an option right now as I can't even re-create the issue in-house and there is no smoking gun that the issue resides in the stack.

-Yan

0 Charles Tsai over 7 years ago in reply to Yan Li29

TI__Guru**** 191906 points

Hi Yan,
I will ping our NDK expert to see if they can provide some comments. Apologize if there is some delay as the issue may not be straightforward without some good understanding on the network and the NDK. I'm still out of office (actually in hospital right now).

I think there is some misunderstanding about cb1. He is very knowledgeable and is a long time contributor to the forum. Perhaps his words to you may have miscommunicated his intention to help you speed up the problem resolution. Let's focus on the problem resolution instead of who cb1 is. Thank you for your understanding.

0 Vincent W. over 7 years ago

TI__Genius 12865 points

Hi Yan,

Could you please confirm the version of TI-RTOS and NDK that are being used? If you can share the TI-RTOS configuration (.cfg) file that was used to compile the code it would also be handy.

In addition, I agree with Peter's earlier comment that it'd be good to get a wireshark capture of how the DHCP exchange looks like after a reset when things are working under the same environment. That could help shed some light into the issue.

Best regards,
Vincent

0 Yan Li29 over 7 years ago in reply to Vincent W.

Expert 1015 points

Hi Vincent,

AppMaster.zipRTOS version is 2.12.1.33. Cfg file is attached.

-Yan

0 Yan Li29 over 7 years ago in reply to Vincent W.

Expert 1015 points

Vincent,

Wireshark capture of a good working case is attached. Look for DHCP for 10.101.155.2

-YanCAMERA25_good.zip

0 cb1_mobile over 7 years ago in reply to Charles Tsai

Guru 117855 points

Thank you, Charles.

I'm checking w/our printer to learn if my (promotion) to "bot" may be added to latest business cards.

0 Chester Gillon over 7 years ago in reply to Yan Li29

Guru 92251 points

Yan Li29 said:
Wireshark capture of a good working case is attached.

Looking at the CAMERA25.pcap (bad) and CAMERA25_good.pcap files I can't see any differences in the DHCP Offer messages from the DHCP server between the good and bad cases.

Once the device is in the failed state is it responding to any Ethernet messages? E.g. does it respond to pings on either the original DHCP server allocated IP address or the auto-allocated IP address?

From the description of the problem not sure if the symptoms are just the device not accepting DHCP Offer messages after a DHCP lease renewal or no communication is possible with the device.

0 Vincent W. over 7 years ago in reply to Yan Li29

TI__Genius 12865 points

Thanks Yan for the information. I see that the DHCP lease time is set to 8 days from the wireshark capture. Does that match the time when the disconnection occurs? In your original post, you mentioned 'about once a week'. We are wondering if the issue is triggered by an expiring lease. Would it be easy to change the DHCP lease time set on their server to a shorter period (e.g. a few minutes) and see if the disconnection occurs? We'll try to reproduce this on our end as well and let you know if we can see the issue.

The other thing that would have been interesting to do is to connect to the MCU via telnet after the failure and do some checks (e.g. out-of-memory), but I don't see the telnet server being configured in the .cfg file. I suppose it is not possible to run a modified configuration on the setup that is failing, correct?

Best regards,
Vincent

0 Yan Li29 over 7 years ago in reply to Vincent W.

Expert 1015 points

Hi Vincent,

The customer reports that the micro-controllers are failing DHCP at random times with the average time of once a week. I am still waiting for them to get back to me with the results of the disable APIPA test.

When DHCP fails, the customer is unable to ping the micro-controller. As such even if there is a telnet server, telnet into the micro-controller would fail.

-Yan

0 Yan Li29 over 7 years ago in reply to Chester Gillon

Expert 1015 points

Chester,

When the device is in the failed state, it does not respond to any pings, but it is still attempting DHCP. From the log of the bad scenario, we know the device is still able to transmit as it is still sending DHCP discovers. Unfortunately, I have not thought of a good way to determine if the device is not receiving the offer of if the stack is rejecting the offer. Only thing I've been told is that after a reboot, and without any cabling changes, DHCP works again.

-Yan

0 Vincent W. over 7 years ago in reply to Yan Li29

TI__Genius 12865 points

Hi Yan,

Thanks for the update. If it is failing for periods less than the lease duration, then this rules out our theory.

The key function that handles the DHCP exchange is StateSelecting() in C:\ti\tirtos_tivac_2_12_01_33\products\ndk_2_24_02_31\packages\ti\ndk\nettools\dhcp\dhcpsm.c. It builds and sends the Discover packet, and then goes into a loop where it waits for the Offer packet for 3 seconds. If it doesn't receive an Offer, it'd retry by sending another Discover packet. That seems to match the behavior in the 'bad' capture.

Two possible hypotheses are:
1. The packet containing the Offer is received, but it is corrupted/invalid so it is rejected by dhcpVerifyMessage().
2. The packet is not received. The recv() call in dhcpPacketReceive() returns -1.

Is it possible to run modified binaries on their setup to troubleshoot this?

Furthermore, we tried to reproduce the issue by unplugging the ethernet cable after an IP address is obtained, but so far we haven't been able to see the issue. Does the DHCP service ever get restarted in their application?

Best regards,
Vincent

0 ToddMullanix over 7 years ago in reply to Vincent W.

TI__Guru* 96960 points

Yan,

What's the status on this one?

Todd

0 Yan Li29 over 7 years ago in reply to ToddMullanix

Expert 1015 points

Hi Todd,

We found that when we disable APIPA (auto-IP) on our device, the issue went away. There is a switch between the device and the router. I am suspecting that when the device gets into AUTO-IP state, the switch no longer forwards DHCP Offers to the device. I would say the issue is resolved now. Thank you all for your help.

-Yan

0 Peter Borenstein over 7 years ago in reply to Yan Li29

Mastermind 8695 points

Sorry to hear APIPA didn't work out. TI's demo project enet_io gets a static IP with the same APIPA 169.254. 0.0/16 IP range. This project uses lwip.

Arm-based microcontrollers

Arm-based microcontrollers forum

RTOS/TM4C129XNCZAD: Is there a way to reset Ethernet Phy and NDK?