This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TM4C1294NCPDT: Emac getting into bad state after months of run time

Part Number: TM4C1294NCPDT
Other Parts Discussed in Thread: STRIKE

Tool/software: Code Composer Studio

Hello,

I have 8 prototype products using the TM4C1294NCPDTI3 processor running a simple application sending/receiving data over a UDP port and sending out data on a different multicast port.  They have been running well for several months.  I recently noticed that 4 of them had gotten into a strange state.  They could not be pinged but they were still in the ARP table on the device they were connected to and would continue to get re-added if I cleared the entries using arp -d.  The multicast UDP ports were also still sending successfully.  I determined that the UDP ports on the Tiva devices were still receiving data from my host but weren't sending responses back out. I believe they were also receiving the Ping requests but just not sending the responses.  Because they would still receive data, I could send a reboot command and they all came back working 100%. 

I checked memory usage and there doesn't appear to be any memory leaks or tasks that could be overflowing and that fact that a ping doesn't work makes me think the problem is in the ndk and not my application code. My best guess is there may be some event happening on our network that the ndk is not handling quite right.  Maybe a malformed packet of some sort.  My plan is to get a network emulator and try to recreate the scenario in the lab.    

I'm using TI-RTOS 2_16_01_14.

Are there any known issues or a configuration item that could possibly cause this behavior?   

-Thanks

  • I don't know of any known issue that would cause this.

    Can you describe your network topology?

    Are you able to debug these prototype boards?  I.e. are they connected to JTAG and CCS still? If so, this would greatly help in seeing what might be happening.

    Steve Packwood said:
    Because they would still receive data, I could send a reboot command and they all came back working 100%.

    If no JTAG, maybe you have the Telnet server running on the device? (actually this could be useful in any case). If so, are you able to Telnet into it? There's some useful stuff in there like UDP stats we could check.

    Steve Packwood said:
    running a simple application sending/receiving data over a UDP port and sending out data on a different multicast port.

    How heavy is the traffic?

    What else is happening in your system? Are there a lot of Task threads, and it's doing heavy stuff? Or not too much, light loaded stuff. Or in between?

    Steve Packwood said:
    They could not be pinged but they were still in the ARP table on the device they were connected to and would continue to get re-added if I cleared the entries using arp -d.

    This means that ARP packets are coming out of the device, so it's able to send something.

    Are you able to get a Wireshark capture when this is happening? Do you see any ICMP error packets? Can you see ping replies on the wire? (this might indicate that they are indeed coming out of the Tiva board but are not making it back to your PC that sent the ping. Maybe there is a router in between them that's dropping them, for example, just a guess.

    Steve Packwood said:
    the UDP ports on the Tiva devices were still receiving data from my host but weren't sending responses back out. [...] The multicast UDP ports were also still sending successfully.

    OK, I just wanted to make sure I'm clear on what's going on. Your boards are no longer able to TX UDP data or ping replies.

    They are still sending MCast data out, and are able to RX UDP data and (you think) RX pings.

    Steve

  • Thanks for the reply.

    They aren't connected to JTAG.  They are actually running around on locomotives which is why debugging is difficult.

    I do have a Telnet server but I can't connect to it when it's in this degraded state.

    You understand the problem correctly that the boards don't TX UDP or ping replies.  Another method I have used to verify this is by looking at the switch counters and seeing that they aren't incrementing so it really does look like TX traffic isn't making it out at all and not that the traffic is getting dropped or mis-routed. 

    The network is lightly loaded and there are < 15 tasks running.   

    It may be several months to get more data but I'll post back when I have more info.   

  • Steven Connell said:
    Can you see ping replies on the wire? (this might indicate that they are indeed coming out of the Tiva board but are not making it back to your PC that sent the ping. Maybe there is a router in between them that's dropping them, for example, just a guess.

    Yet - as poster's subject line states, "works for months!"     Does this not remove (any) router issue?    (We must assume that multiple routers - do not all fail - after months of success.    And - as the system did work for months - routers seem "Free from blame!"

    All other diagnostic advice seemed well-thought - yet the "logic" w/in the above - rose beyond my (and few others') understanding...

     

  • I agree that just because it "works for months" doesn't mean it couldn't be a router issue.  However, the fact that it begins working normally after a power cycle of the device would

    seem to indicate that the problem is not router related.   

  • Steve Packwood said:

    I agree that just because it "works for months" doesn't mean it couldn't be a router issue.  However, the fact that it begins working normally after a power cycle of the device would

    seem to indicate that the problem is not router related.   

    Not necessarily. re-booting the device could have forced the router to flush previous entries. Also it's possible that rebooting the device will reboot the router (If for instance they are on the same power line and the re-boot is by power cycling).

    Now if re-booting the router alone does not improve matters I'd put more focus on the device.

    Robert

  • cb1_mobile said:
    Yet - as poster's subject line states, "works for months!"     Does this not remove (any) router issue?    (We must assume that multiple routers - do not all fail - after months of success.

    No, I don't see how you can conclude this at all. How can you logically make this assumption?

    It seems that we are looking at a system that contains many components, any of which could cause a problem at any given time. Just because such a system worked for months doesn't mean you can automatically eliminate certain variables as soon as something goes wrong.

    You can't, at least, based on the information given thus far.  You mention multiple routers above, but no one said anything about multiple routers being present. Actually, no one said anything about any router being present, for that matter.  As I originally said - *maybe* there is a router, as this was also part of my inquiring into the topology of the system (which still isn't clear to me).

    Or perhaps you know more about this issue than has been said in this thread?



  • Steve,

    I am still not sure what your network topology is here. Can you please provide some insight there?

    Steve Packwood said:
    I checked memory usage and there doesn't appear to be any memory leaks or tasks that could be overflowing

    I'm curious, how were you able to check this without JTAG? You must have some other means of getting some diagnostics off of the board(?)

    Steve Packwood said:
    Another method I have used to verify this is by looking at the switch counters and seeing that they aren't incrementing

    Can you please provide some more details on this?

    Thanks,

    Steve

  • Robert Adsett72 said:

    Not necessarily. re-booting the device could have forced the router to flush previous entries. Also it's possible that rebooting the device will reboot the router (If for instance they are on the same power line and the re-boot is by power cycling).

    Now if re-booting the router alone does not improve matters I'd put more focus on the device.

    Good point. It's also possible that the NDK's route table has a problem (e.g. corrupt data in the table causing TX packets to drop, being fixed by a reboot).

    I think rebooting the router when this issue comes up again would give us a good data point (that is, if there is a router :)

    Steve

  • My friend - was it not "you" who introduced the (possible) presence of the router?    I present a copy of your earlier writing:

    cb1_mobile said:
    Steven Connell
    Can you see ping replies on the wire? (this might indicate that they are indeed coming out of the Tiva board but are not making it back to your PC that sent the ping. Maybe there is a router in between them that's dropping them, for example, just a guess.  

    Might some of your protest then - properly target the SOURCE of the suggestion of  "router?"    (that would be you - was it not?)     And - while your "introduction of the router has been "frozen in time" (via your direct quote) - it is noted that it has (now) "exited" this thread.  

    O.P. HAS identified 4 systems as "failing" - thus it appears "reasonable" to assume that (each) may have contained such "router" - which justifies my use of  "multiple."  

    Surprising (and unwelcome)  "harshness"  from a vendor rep - aimed at an outsider - trying to assist...    

    It has become clear now that (your) suggestion of  "router's presence" was correct - and that my sense of  "once working (may) remove it from future/downstream, "suspect list" was premature.    (as the router may be impacted by outside conditions - thus such "continuation of working" - cannot be guaranteed.    My bad!      It (does) appear as if "multiple" routers are present - which raised a "flag" as to ALL (four in this case) having FAILED - while FOUR OTHERS HAVE SURVIVED - adding yet more "doubt" to "Routers Alone" proving the (sole) suspect...

  • Steve,

    The network has a windows box on the same subnet as the Tiva device so there is no router.  There is a cell modem on

    the network that allows me to RDP into the windows box and initiate pings to the Tiva device .  Every component, including the switch, on the network does get power cycled frequently

    which doesn't clear the problem.  Only powere cycling the Tiva device clears the problem.  I can't check for memory leaks on

    the field units but I haven't seen any leaks on my lab units that I do have the debugger on.   When I try to ping the Tiva device

    when in the bad state, the outgoing switch port packet counter (to the Tiva device) increases but I don't see the incoming packet counter increase which

    suggests that the Tiva device isn't transmitting the ping response.  I do see the incoming packet counter increase in response to an ARP request.

  • Steve,

    I agree that the symptoms suggest that the router table may be getting corrupted. The question I have is what can cause this and why would a new ARP request from my host device not fix it?

    Thanks
  • cb1_mobile said:
    Surprising (and unwelcome)  "harshness"  from a vendor rep - aimed at an outsider - trying to assist...  

    Sir - Perhaps we've gotten off on the wrong foot. I only took this to be a debate among professionals on problem solving logic. If I came off as being overly harsh, my apologies, this was not my intention!

    Cheers,

    Steve

  • Steve,

    Do you have Wireshark installed on the Windows box (that's on the same subnet as the Tiva board)? If so, please grab some captures once this problem arises again.

    1. Do you see any ARP requests coming out of the Tiva board when this issue is happening? (In w/s)

    2. Any ICMP packets coming out of the board?

    3. If there is any way to check the values of the following global variables, this could be helpful. I realize you probably can only do this in your test environment:
        a. ips
            - ips.Cantforward
            - ips.Localnoroute
            - ips.Localout


        b. udps
            - udps.RcvNoPort
            - udps.SndTotal
            - udps.SndNoPacket

        c. _ICMPIn[]
           - ICMP stats are stored in an array, as opposed to the above two which are structs

        We really want to look for *any* stats that are increasing as those may provide us with all important clues.

    4. I realize this is happening in a production environment, so not sure if this is possible. But, just in case - are these devices hooked up to any kind of display that could show you output from the program? Or do you have all program trace disabled?

        - If you have trace, you could create a new thread that listens for a command to print. This would probably need to be done on a UDP socket since the board isn't able to send TX packets and therefore couldn't complete the TCP 3 way handshake.

        - You could print out some of the stats mentioned above as well as the route table (see the telnet console code for an example of how to do this [ti/ndk/tools/console/conroute.c, DumpRouteTable()])

        - Also, wondering if you see any trace statements regarding a route being removed...

    Or, and maybe this is better, you could await a special command on a UDP socket, and then TX debug info out on a multicast socket (since TX works there). You could await that multicast data on your PC and display it there.

    One thing to note, all of the route look ups are skipped for multicast, so this also indicates a route issue.

    Steve Packwood said:
    The question I have is what can cause this and why would a new ARP request from my host device not fix it?

    Hard to say at this point. Out of memory issues, such as not being able to add an ARP entry for the PC when an ARP is received from it. Or, memory being stomped could be possible cause. It would be helpful to be able to see the route table in this scenario (more on that below). To fix it, it would depend on the root cause. Sorry for the vagueness...

    Steve

  • Thank you for your response - your earlier writing did strike several here as "borderline harsh/aggressive."      No apology was/is sought - yet targeting an outsider who attempts to assist does prove curious...   (your concluding sentence clearly "over-reached" - earlier statements logically defended your tech. position - the last one appeared aimed (only) at offense.)        (having attended UCLA Eng. & Law school - I believe such assessment proves correct - it was judged similarly by others as well - and clearly fails to reflect proper, "normal/customary" forum interaction.)

    You appear very well "tech qualified" w/in this thread's subject matter - perhaps a better "accommodation" of the "views of others" (even if/when differing) deserves consideration...

  • cb1_mobile said:
    the last one appeared aimed (only) at offense.

    I think you are referring to the following (please correct me if I'm wrong)?

    If so, I think there may be a simple miscommunication.

    "Or perhaps you know more about this issue than has been said in this thread?"

    No offense was meant here at all. As I read your original post in this thread, I was actually under the impression that you had more information on this issue than what was presented in the thread. I was thinking that you might be a co-worker of Steve's (Packwood's) and therefore had some intimate knowledge of the network topology, that was not possible to discern based on the info presented in the thread itself.

    Again, I think this was a case of miscommunication.

    Cheers,

    Steve

  • Sentence was "...perhaps you know more..." myself/several others read that as aggressive - but accept (now) your disclaimer.

    As for "miscommunication" - might the "best judge" of that result from a comparison of a sampling of your "other" recent postings vs. the one towards me?     There IS a clear difference in tone - which speaks against, "miscommunication" - does it not?

    My hope is that your expertise may assist this poster in the resolution of this troubling issue.      I go now in "peace."

  • Steve,

    Thanks for providing some more things I can look into to provide more clues.  I have other projects I'm having to jungle with this effort so it

    may be awhile before I get any more information.  Just wanted to thank you again for the support.