This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Update on the problem of ethernet communication failure after many hours of operation
Continuing the discussion on the earlier thread,
As discussed earlier the problem occured at one place and at another place it worked for days together without a problem.
So, we have repeated the experiment with the switch and cables from the place where the problem was encountered and now we find that the problem has occured after 30 hours. The following are the observations .
There are 2 active tcp_pcbs in established state (state 4) corresponding to the 2 PCs connected through the switch. They continue to show Established state even after the hang. These 2 are closed automatically after about 30 minutes from the time of hang. I think LWIP waits for this time and closes the connections.
The EMACIntStatus is mostly 0x10041 when it is working normally and sometimes 0x180c1 when there is an abnormal interrupt. There are about 32 abnormal interrupts in the 30 hours of operation.
At the time of hang the EMACIntStatus is zero and remains zero afterwards. Under this condition no new connection to the board is possible from tera term or any other way. The only way to re establish connection is to reset the MCU.
The question is, what is causing the EMACIntStatus to get stuck at zero? Can the switch cause this?
What could result in this type of behavior?
Any help is appreciated.
Hi Krishna,
Unfortunately, I really don't know the reason that is causing it to fail after 30 hours. In the last thread you were trying to investigate if changing a different switch will make a difference? Can you provide some updates?
I have some questions and suggestions.
- Will changing a switch make a difference?
- Do all of your boards have the same problem? If this is the case, then I would focus on the software side.
- You said your application is based off the tcpecho example. If you were to run the tcpecho example as is will you see any failure after the number of hours of operation you have seen when you run your own application.
- Here I'm just trying to understand if the TivaWare tcpecho example could be a problem since you said you reference it to develop your application. Can you reference another tcpecho example from https://github.com/dreamcat4/lwip/blob/master/contrib/apps/tcpecho_raw/echo.c. This is a stock lwIP tcpecho example. Looks more complicated but will work. All you need is to call echo_init() from this file. If your application were to base on this echo example, will it make a difference? I just wanted to know if it is a software issue or there are other issues to look at. What I'm trying is a process of elimination to narrow down the cause.
Hi Charles,
- Will changing a switch make a difference?
I have tried with a different set of switch and cables and the problem did not appear even after a week. But with this set of switch and cables the problem occurred twice. Once after 30 hours and then after 6 hours, To eliminate the cable problem , I checked with the 3 cables individually without switch and they all worked for a day each without a problem, I am strongly suspecting the switch. To confirm it I have to go back to the other switch and observe again.
In the mean time, I have modified my program ( which is fairly complex with RTC, SD card storage , 2 SPI communications with MSP432 and a Beagle Bone Black etc and NOSYS and no Ethernet interrupts. Ethernet is handled in the main super loop) so that it works on the Connected launch pad (without the other hardware) and kept it running since yesterday. I will observe for a couple of days and let you know. If the problem appears here, it may be a bit easier to find the root cause as I am logging the UART outputs using teraterm.
I will try the other suggestion of running a simpler program on the launch pad.
- You said your application is based off the tcpecho example. If you were to run the tcpecho example as is will you see any failure after the number of hours of operation you have seen when you run your own application.
I will try this too.
Thank you for your suggestions.
Hi,
Ethernet is handled in the main super loop
Please read carefully the response by David Wilson in this post https://e2e.ti.com/support/microcontrollers/other/f/908/t/319674?Optimizing-UDP-in-lwIP-on-TIVA-how-to-determine-when-a-packet-has-been-sent-. Basically, you should never call LwIP such as tcp_send from a context other than the Ethernet interrupt handler as the LwIP is not re-entrant so you must only call from one context.
Update:
The program running on Launchpad also stopped communicating on TCP after 37 hours. I have logged the status at regular intervals of 20 minutes and found that at the time of hang EMACIntStatus becomes zero. I have reset the launch pad to observe again for the next many hours. Later I will replace the switch and try again and report.
Once the firmware detects the hang it waits for 30 seconds and calls lwIPNetworkConfigChange with the old values in an attempt to restart. But this has no effect.
However if we do a software reset the MCU it recovers. But this is not a desirable work around as we lose some data during reset.
Is there any other parameter I have to log to get an insight into the problem?
Hi,
Ethernet is handled in the main super loop
You said you handle Ethernet in the main loop. Did you have a chance to read the other post answered by David Wilson. I've copied his answer below again. I'm suspecting if your problem is due to the reason as stated.
The biggest problem here is the fact that you are calling lwIP from a context other than the Ethernet interrupt handler. lwIP is not re-entrant so you must only call it from one context. When using an RTOS, this means only calling the lwIP APIs from a single thread. When not using an RTOS (like most of our examples), you must make calls in the context of the Ethernet interrupt handler. This can be tricky to do but there's a timer callback implemented that makes it a bit easier. Make sure you set HOST_TMR_INTERVAL to some non-zero value (I think it's a number of milliseconds) and implement lwIPHostTimerHandler in your application then make all your lwIP calls from the lwIPHostTimerHandler function.
This may sound awkward but it's vital. Calling udp_send from the main loop will likely work for a while but I guarantee you that it will crash after some period of time as some internal data structure in lwIP gets corrupted. This kind of problem is VERY difficult to debug so it's far better to fix this now rather than trying to pretend everything is OK and have it fail catastrophically later.
Hi Charles,
I think I am doing this part correctly. I will cross check again. Our requirement is that the timer interrupt (10ms) is used for data capture which cannot be interrupted by any other task, even the Ethernet. For this reason we have taken pains to see that Ethernet functionality is done without the Hardware Ethernet Interrupt.
I will make a concise file showing the modifications made to achieve this.
Essentially I have followed this article by High 12 noon Blog to the letter
high12noon.neocities.org/lwip_polled_tm4c129.html
BTW the second test with Launch pad failed after 10 hours.
Now we have replaced the suspicious switch (D-Link D1008D) with another one (D-Link DES1008C) . We will observe this for a couple of days and report.
Thank you.
Hi Charles,
The problem is sorted out at last. After seeing the discussion on another post
I changed the lwipopts.h options from
#define EMAC_PHY_CONFIG (EMAC_PHY_TYPE_INTERNAL | EMAC_PHY_INT_MDIX_EN | \
// EMAC_PHY_FORCE_10B_T_FULL_DUPLEX)
to
#define EMAC_PHY_CONFIG (EMAC_PHY_TYPE_INTERNAL | EMAC_PHY_INT_MDIX_EN | \
EMAC_PHY_AN_10B_T_HALF_DUPLEX)
Apparaently the FORCE option is not working with the suspecious switch (D-Link D1008D) whereas it was OK for the other switch (D-Link DES1008C).
We ran the program for 120 hours without any problem.
Now that is problem is sorted out , we have switched over to full duplex.
Thank you all for the help.