This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TM4C129ENCZAD: Ethernet comms stops with lack of error messages

Part Number: TM4C129ENCZAD
Other Parts Discussed in Thread: SEGGER, DP83822I, DP83822EVM

Hello,

I have two boards a custom TM4C129EN based board and a STM32F4 dev kit connected together with a poorly made ethernet cable 90m in length.  The two can communicate together for a short time (sometimes seconds some times hours) before reception on both boards comes to a complete stop.  I have enabled the following options for the onboard MAC and PHY:

	// Initialize the ethernet MAC and bus related DMA paramaters
	ROM_EMACPHYConfigSet(EMAC0_BASE, EMAC_PHY_TYPE_INTERNAL |
	                     EMAC_PHY_INT_ROBUST_MDIX |
	                     EMAC_PHY_AN_10B_T_HALF_DUPLEX |
	                     EMAC_PHY_INT_LD_ON_RX_ERR_COUNT |
	                     EMAC_PHY_INT_LD_ON_MTL3_ERR_COUNT |
	                     EMAC_PHY_INT_LD_ON_LOW_SNR |
	                     EMAC_PHY_INT_LD_ON_SIGNAL_ENERGY);
	ROM_EMACInit(EMAC0_BASE, _ClockFrequency,
			EMAC_BCONFIG_MIXED_BURST | EMAC_BCONFIG_TX_PRIORITY,
			32, 32, 0);
    ROM_EMACConfigSet(EMAC0_BASE, (
                               EMAC_CONFIG_CHECKSUM_OFFLOAD |
                               EMAC_CONFIG_7BYTE_PREAMBLE |
                               EMAC_CONFIG_IF_GAP_96BITS |
                               EMAC_CONFIG_USE_MACADDR0 |
                               EMAC_CONFIG_SA_FROM_DESCRIPTOR |
                               EMAC_CONFIG_BO_LIMIT_16 |
                               EMAC_CONFIG_JABBER_DISABLE),
                  	   (EMAC_MODE_RX_STORE_FORWARD |
                  		EMAC_MODE_TX_STORE_FORWARD |
                  		EMAC_MODE_TX_THRESHOLD_64_BYTES |
                  		EMAC_MODE_RX_THRESHOLD_64_BYTES |
                  		EMAC_MODE_RX_ERROR_FRAMES), 0);

After the fault occurs I can observe both boards transmitting an ethernet frame but neither show up on the the others recivers.  Performing a reset on the STM32 dev kit only yields a link down followed by a link up on my custom board as well as both units attempting to transmit ARP requests.  However neither of them receive them.  When I issue a soft PHY reset on the custom (TM4C) board communications comes alive again with both units rxing and txing frames again.

The PHY reset is issued by toggling the EMAC_PC_DIGRESTART bit in the register view.

Any help would be greatly appreciated in further steps to diagnosing this problem.

Thanks

  • Hi,
    So the two boards are connected in the LAN without any hub or switch? Since you said the cable is poorly made then I will suggest that you change to different cable and try again. If they two boards are on your bench, you don't really need a 80m cable. See if replacing the cable will resolve the problem as you said it was sometime working for hours.

    Do you use static IP address or you have the DHCP on the network to assign the IP address?
  • Correct the two boards are directly connected to one another.

    Unfortunately I must use this specific 90m long cable for this application. I was hoping to get an error from one of the interrupts or a register I could poll so that I would know that a fault has occurred. I understand that a better cable should be used but an undetected fault on a good cable would still be bad for my design. I would have thought that the link would go down if the cable is too poorly made but perhaps I don't fully understand the feature yet.

    Currently using statically set IP addresses on both devices. No DHCP server exists on the network with the two devices.

    As an update I've also read the following MAC registers to ensure that data has been counted by the MAC:
    EMAC_TXCNTGB
    EMAC_TXOCTCNTG

    They both increment after the fault.

    EMAC_RXCNTGB does not increment even though the devkit has sent data.

    Any other ides for debugging would be great.

    Thanks
  • While continuing testing we've changed the set up to allow for a PC to be included into the mix. This means that we've modified the set up a bit and a switch now exists in the middle. All IP addresses are still static.

    Unfortunately the switch doesn't allow us to see traffic between the two devices but I've ordered one with port mirroring that should arrive before end of day. What is interesting is that after the fault when I pull out the ethernet cable connected to the TM4C a link down is detected. When plugged back in link up is detected and an ARP frame is transmitted. The PC is able to see this frame confirming that the transmitter is still working on the TM4C.

    We've also set up the dev kit as the gateway, so that the PC would send ARP frames its way and the dev kit is able to receive and transmit responses to the PC.

    When setting the gateway address of the PC to the TM4C ARP frames from the PC are not received by the board.

    This leads us to believe that there is something wrong on the receiving end of the TM4C. We've tried the following:
    checked EMAC_DMARIS_RS - 0x03 - Running: Waiting for receive packet
    toggled EMAC_CFG_RE - no change
    toggled EMAC_DMAOPMODE_SR - no change

    The only thing that gets things moving again is toggling EMAC_PC_DIGRESTART.

    Will update again when new switch shows up and reveals anything new.
  • After sniffing out all packets with wireshark its confirmed that the transmitter is still able to transmit data but the receiver is unable to receive anything at all.

    We also tried to hack together a fix by adding the bit toggle of EMAC_PC_DIGRESTART to the program.  Unfortunately it does not work and only accomplishes bringing the link down.  Toggling it in CCS still appears to restart communication though.

    Can you elaborate on exactly what EMAC_PC_DIGRESTART does? Below is the snippet I found on it:

    This bit allows the user to restart the PHY. Asserting this bit causes the
    PHY logic and internal register to reset to initial conditions. This bit does
    not affect the configuration bits provided by the EMACPC register, which
    are stored in the PHY following a chip reset. To initiate the soft reset to
    the PHY, this bit must be written to a 1 and written again to a 0.

    Are there any timing requirements on this bit?

  • Hi Roque,

    Are you using LWIP 1.4.1 ? Perhaps the 32 bit burst rate might be flooding DMA descriptor causing RXD buffer unavailable (RBUN) at one end or the other. Suggest to try 16 byte DMA bursting slow transmission down. We discovered a link state register down issue tiva129.c abstraction layer was troubled by syntax design so check the code.

    Perhaps add a UARTprintf() call in the abnormal interrupt for debug, that helps to see what abnormal interrupts EMAC0 is processing when it all falls apart. Typically it ends up being several things causing mayhem not just one. Especailly sensitive is the blocking for rignbuffer.c and Boolean flags execute much faster than slow ass macros.
  • Hi,

    Yeah we are using lwip 1.4.1.  Unfortunately we are not getting any RU interrupts.  Appreciate the tip for adding the debug printout; we are using a Segger debugger and that functionality is easily added to the terminal output.  We've been keeping a running counter to count normal and abnormal interrupts and RU never popped up.  However we started getting OVF's (rx overflow) from what I can see it means that the rx DMA buffer is getting overrun with data before it is able to transfer it to the buffers indicated by the descriptors.

    The receive buffer had an overflow during frame reception. If
    the partial frame is transferred to the application, the overflow
    status is set in RDES0[11].

    I am not sure why this condition is happening but the device appears to recover from it well with the driver dropping the frame.  The fault occurs even if this count is 0 or as great as 10.  

    Over the weekend we moved the 90m "poor" quality cable between the switch and the STM32 device and set it up for 10 mbps full duplex (same settings on the tm4c).  Put the tm4c on a good quality cable of 45m length.  The two devices are still communicating to each other just with the switch in the middle.  It worked for 48+ hours before we manually stopped the test.  We then changed to tm4c to half duplex, it failed three hours later.  

    With this test I have to wonder if there is a better way to debug the MAC/PHY after the fault condition ocurrs? Perhaps the MAC is dropping the frame and not sending it over the RX DMA line?  

  • Roque Obusan said:
    We've been keeping a running counter to count normal and abnormal interrupts and RU never popped up

    That seems a bit odd if the DMA receive buffer overflow occurs engine should trip RU. Be sure you are reporting on hex not integer abnormal status errors. Perhaps even report PHY interrupt as it seems to indicate some odd codes at times of TCP stress.

    Roque Obusan said:
    Perhaps the MAC is dropping the frame and not sending it over the RX DMA line?

    LWIP loves to drop frames if the application hogs to much processor time and responds with RST even with fairly small frame buffer settings and an orphaned PBUF is always NULL when that occurs. Perhaps turn on LWIP debug stats in (lwipopts.h) then place a [stats_display();] in a place an application layer error trap might invoke.

    Believe me there are likely more issues under the hood. One that seems to plague LWIP is the TCP interval timer is moving a bit to fast (TM4C 150 DMIPS) and can cause TCP to drop frames.

  • BP101 said:

    That seems a bit odd if the DMA receive buffer overflow occurs engine should trip RU. Be sure you are reporting on hex not integer abnormal status errors. Perhaps even report PHY interrupt as it seems to indicate some odd codes at times of TCP stress.

    I also find it odd.  Confirmed please see code below:

    if (ui32Status & EMAC_INT_ABNORMAL_INT)
    	{
    		++netif->_DriverStats.ui32AbnormalInts;
    
    		netif->CheckAndIncCount(netif->_DriverStats.ui32TXStopped, ui32Status,
    		                        EMAC_INT_TX_STOPPED);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32TXJabber, ui32Status,
    		                        EMAC_INT_TX_JABBER);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32RXOverflow, ui32Status,
    		                        EMAC_INT_RX_OVERFLOW);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32TXUnderflow, ui32Status,
    		                        EMAC_INT_TX_UNDERFLOW);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32RXNoBuffer, ui32Status,
    		                        EMAC_INT_RX_NO_BUFFER);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32RXStopped, ui32Status,
    		                        EMAC_INT_RX_STOPPED);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32RXWatchdog, ui32Status,
    		                        EMAC_INT_RX_WATCHDOG);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32TXEarlyTransmit, ui32Status,
    		                        EMAC_INT_EARLY_TRANSMIT);
    		netif->CheckAndIncCount(netif->_DriverStats.ui32BusError, ui32Status,
    		                        EMAC_INT_BUS_ERROR);
    }

    	void                    CheckAndIncCount(uint32_t& counter,
    	                                         const uint32_t stat,
    	                                         const uint32_t mask)
    	{
    	    if (stat & mask)
    	        ++counter;
    	}

    The above code is found in the interrupt handler for the ethernet driver.  

    We are currently seeing >95% of idle time on the processor during these dropped transmissions.  When the fault occurs the interrupt stops firing completely.  We even monitor the following registers for activity:

    EMAC_RXCNTGB (ethernet mac rx frame count for good and bad frames)

    EMAC_RXCNTCRCERR (ethernet mac rx frame count for crc error frames)

    EMAC_RXCNTALGNERR (ethernet mac rx frame count for alignment error frames)

    EMAC_RXCNTGUNI (ethernet mac rx frame count for good unicast frames)

    Activity ceases completely.  When we see transmissions firing off the equivalent ethernet mac tx counters are incremented and are verified with wireshark.

  • BTW: You EMAC_CONFIG_BO_LIMIT_16  is very small we typically set this 1024 bytes for 10/100 duplex. We disable internet giants from causing buffer issues and with 32 bit bursting who knows if LWIP conjures similar beasts?

    //RX frames > 1518 bytes Giant, drop all frames > 2048 bytes.
    EMAC_DMAOPMODE_DGF), 2048); 

  • Perhaps kiss code would indicate abnormal INT your case.

    Seems netif might require debug be enabled and we disable tiva129 HAL debug. TI left it enabled (if debug) for who knows why - simply renamed (def Debug) for PBUF checking to #if NETIF_DEBUG, turns it off.

    Perhaps KISS works better ? this code always reports a hex register code and RU/TU are a nasty little bits..

    SysPrintf("> Abnormal INT Status --->:%x\r\n", ui32IntStatus);

  • I understand what you're trying to say but I believe my above code should work in also catching the abnormal interrupts. It also carefully bins the errors into easy to see counters we can see from CCS. It makes it a lot harder to miss when logging for a long amount of time.

    I think what we're missing here is the fact that the RX counters from the MAC stop completely. Also RX interrupts stop firing all together. I want to believe that it is a stack error but I cannot see frames arriving at all. As an example; before the fault I can hault my program and still see the rx counter register incrementing. After the fault that counter stops when the program is both halted and running.
  • Roque Obusan said:
    It also carefully bins the errors into easy to see counters we can see from CCS

    Do you not make an assumption your custom counters actually work in all cases? What you have to loose ? dropping frames bets are abnormal interrupts have occurred from lost packets (P=Null), runts or giants. Did you turn on LWIP debug stats ? it will all become clearer once you see the source file/assert error messages posting ever a frame drops.  

    Perhaps your DMA engine changes deviate from the tried and true, baby steps then giant leaps are best recommend practices. Again we never had much luck 32 bit bursting, more realistic 16 bit TX/RX bursting @TX-2:1 priority with LWIP, especially if TCP Nagel mode (no delay) is disabled in callbacks.  

    Our view is 64 bit thresholds are not functional in certain DMA bursting modes, if you study registers engineering text it becomes obvious they are ignored.

    Perhaps add TCP client priority if RDP are also being used for DNS etc.:

    // Setup the TCP connection priority.
    tcp_setprio(pcb, TCP_PRIO_MAX);

  • I have lwip_stats enabled.  They support our findings that the rx packets do not even make it into the link layer.  Thus they do not even make it into the lwip stack.  We have also tried the dma thresholds from 4 to 32 and none of them alleviate the issue.  This is further supported by the MAC counters (hardware) failing to increment after the fault ocurrs.

    We have had success with using a good quality cable of 45m length with no issues for over 48 hours.  My issue is when we use our poor quality cable the results appear to be a dropped RX channel with no error bits set.  We are able to reliably make the fault occur (rx comms only stopped) when we swap out the cable.  I understand it seems odd that we are testing with a poor quality cable.  The reality of our design is that we need to have a known failure mode so it can be communicated to the operator.  Currently rx comms stops and no error bits from the MAC, PHY or DMA are set.  

    Furthermore the tx channel still continues to operate, verified through the following steps:

    1. unplug ethernet cable only, link drops, verified by phy interrupt firing notifying of link status change and reading the EPHYSTS register

    2. plug ethernet cable back in, link achieved, verified by phy interrupt firing notifying of link status change and reading the EPHYSTS register

    3. lwip gets notified of the link up and fires an arp frame, verified by lwip_stats increment xmit on etharp counter, further verified by wireshark on PC

    However, even after resetting the link in that way rx still does not work.  What gets rx working are the following:

    1. Full system reset which includes a reset of the firmware and lwip

    2. Toggling DIGRESTART bit EMACPC register.  As far as I can tell it only resets logic and registers in the internal PHY.  This does not reset firmware nor the lwip stack.  Yet data reception begins again.  One caveat is that it only appears to work when we toggle the bit in CCS.  The documentation doesnt provide timing info so we may be using it incorrectly when inserting it into the program.

    Appreciate your assistance.

  • Roque Obusan said:
    lwip gets notified of the link up and fires an arp frame, verified by lwip_stats increment xmit on etharp counter, further verified by wireshark on PC

    Perhaps your application layer is causing issues in LWIP or even EMAC0. Agree activity LED randomly flashes abnormal RU and seems to indicate DMA engine TX FIFO attempts to send packets even when it is in TU register status. Log jams act that way on a river, cause major flooding up stream. 

    Please post </> your tiva129.c HAL link state up function so we might both be on the same page here.

  • To further prove that this is most likely not an lwip issue we ran the following tests overnight:

    1. custom board connected to stm32f4 dev kit running good quality 90m cable - still works > 16 hours currently

    2. custom board connected to stm32f4 dev kit running poor quality 90m cable - failed within minutes

    3. ti dev kit with modified firmware connected to stm32f4 dev kit running poor quality 90m cable - failed within minutes

    4. ti dev kit with modified firmware connected to stm32f4 dev kit running good quality 90m cable - still works > 16 hours currently

    5. stm32f4 dev kit connected to switch through poor quality 90m cable, custom board connected to switch through good quality 90m cable - still works > 16 hours currently

    I'd like to, once again, reiterate that toggling the  DIGRESTART bit gets receptions started again before failing again.  If the application or lwip were causing these errors all of these tests should have failed.  Instead the pattern we are seeing is that the failure occurs with poor quality cables (capable of only 10 mbps speeds) and on TI internal phys.  Connecting these poor quality cables to a PC allows us to transmit data with no issues at nearly the full 10 mbps.  Our code transmits no where near 10 mbps but more like 200 kBps.  We have no issue with the poor quality cable failing what we have an issue with is that we can find no errors to signal the issue has occurred.  Ideally one of the failure bits on the DMA, MAC or PHY would set so we can positively identify the issue.

    As you requested I've posted the link related code below:

    	// Read the interrupt status, so we know what has changed
    	const uint16_t misr1 = ROM_EMACPHYRead(EMAC0_BASE, PHY_PHYS_ADDR,
    	                                       EPHY_MISR1);
    	const uint16_t misr2 = ROM_EMACPHYRead(EMAC0_BASE, PHY_PHYS_ADDR,
    	                                       EPHY_MISR2);
    	const uint16_t bmsr = ROM_EMACPHYRead(EMAC0_BASE, PHY_PHYS_ADDR, EPHY_BMSR);
    
    	// Has the link status changed?
    	if (misr1 & EPHY_MISR1_LINKSTAT)
    	{
    		if (bmsr & EPHY_BMSR_LINKSTAT)
    		{
    
    			tcpip_callback((tcpip_callback_fn)netif_set_link_up,
    								&inst->_NetIfInstance);
    		    ++debugcounter.linkup;
    			_bLinkActive = true;
    		}
    		else
    		{
    			tcpip_callback((tcpip_callback_fn)netif_set_link_down,
    								&inst->_NetIfInstance);
    
    		    ++debugcounter.linkdn;
    			_bLinkActive = false;
    		}
    	}

    The transmission of ARP frames are then handled by lwip.  We can successfully trace that the ARP frame has left through the ethernet driver through MAC counters as well as breakpoints through the process, as well as wireshark.  

  • That is totally different (tiva_129.c) HAL code around link up/down state conditions. Did you rewrite this part of the Tivaware DMA engine HAL? LWIP is extremely touchy around EMAC0 link state register handling, especially if or when the link bounces up/down.
  • It's actually quite similar, see below (taken from tivatm4c129.c)

        ui16Val = EMACPHYRead(EMAC0_BASE, PHY_PHYS_ADDR, EPHY_MISR1);
    
        /* Read the current PHY status. */
        ui16Status = EMACPHYRead(EMAC0_BASE, PHY_PHYS_ADDR, EPHY_STS);
    
        /* Has the link status changed? */
        if(ui16Val & EPHY_MISR1_LINKSTAT)
        {
            /* Is link up or down now? */
            if(ui16Status & EPHY_STS_LINK)
            {
                /* Tell lwIP the link is up. */
    #if NO_SYS
                netif_set_link_up(psNetif);
    #else
                tcpip_callback((tcpip_callback_fn)netif_set_link_up, psNetif);
    #endif
    
                /* In this case we drop through since we may need to reconfigure
                 * the MAC depending upon the speed and half/fui32l-duplex settings.
                 */
            }
            else
            {
                /* Tell lwIP the link is down */
    #if NO_SYS
                netif_set_link_down(psNetif);
    #else
                tcpip_callback((tcpip_callback_fn)netif_set_link_down, psNetif);
    #endif
            }
        }


    Also see note from datasheet:

    Link Status
    Value Description
    0 Link is not established.
    Valid link is established (for either 10 or 100 Mb/s operation).
    This bit is a duplicate of the Link Status bit in the EPHYBMSR
    register (PHY offset 0x001).
    1
    This bit is not cleared upon a read of the EPHYSTS register.

    We simply use the EPHYBMSR register instead of EPHYSTS.  This is getting a bit off topic though since the link doesn't toggle unless we physically unplug and plug the cable back in.  Detecting link up / down is working correctly WHEN the interrupt fires.

  • We've settled on a workaround for now, unfortunately since there is still a lack of error messages our work around can be delayed by as much as 500 ms (in its current implementation).

    We are resetting the mac, phy and dma when we see no activity from the mac rx counters for at least 500 ms. This solution is proving satisfactory even at 135m of poor quality cable. We will continue to investigate and update this thread when we find a proper sulution.

    Appreciate the assistance.
  • Roque Obusan said:
    This is getting a bit off topic though since the link doesn't toggle unless we physically unplug and plug the cable back in.

    Far above post recall you stated when changing to the lower quality cable the EMAC stopped receiving data after plugging it in. This forum typically support the vendors HAL and you failed to mention your custom changes leads me to wonder what else you have changed.

    Please try to use the vendors DMA descriptor HAL to compare your debug trace results are the same or different. Seem to recall the PHY interrupt fires more times than just after POR or link up/down, especially during RU/TU flags EMAC requires special handling in the HAL to restart the DMA engine when it has stopped for various reasons. Recall issues reported link state detection being valid reading STS register and went back to using BMSR. Then discovered link state test was using a logical (!) failed to properly inform LWIP netif was down, a bitwise (~) works far better with binary registers.

    BTW: Your IDE debug hardware trace macro cell may not always report the interrupt events in real time or at all depending where JTAG stopped. That was my frustration with the EMAC and trying to debug the interrupts handlers in CCS debug was mostly useless.

  • Just wanted to post an update to this issue in case anyone stumbles upon this problem in the future:

    We purchased a TM4C129XNCZAD based dev kit (DK-TM4C129X), which allows for both internal and external phy and installed a DP83822I based ehternet phy (DP83822EVM).  The only change to firmware was to change the interface from internal phy to MII.

    We have our poor quality cable running with no isses for 48 hours and counting.  We will be modifying our custom board and will be changing to include this PHY.

  • Hey my friend check this out first as recent update to EMAC0 behavior can be custom PCB layout issue?

    FWI: if your custom PCB Ethernet differential pair is uneven or to close spaced  (1.27mm) TXD to RXD pair, odd things can occur such as cross talk etc...

    /cfs-file/__key/communityserver-discussions-components-files/908/TI_2D00_TM4C_2D00_PCB-layout-facts.pdf

  • BP101 said:
    if your custom PCB Ethernet differential pair is uneven or to close spaced  (1.27mm) TXD to RXD pair, odd things can occur

    Really?    Does this not, "fly in the face" of 48 hours "proper operation?"   

    Such hardware/layout issues are UNLIKELY to reveal their presence after so much (successful operating time) has elapsed!

    Kudos to poster Obusan for persisting - and developing a REAL Solution!

    (we note that you (alone) "Liked" your own {Mar 15, 6:47 p.m.} post.)

  • Poster made a custom PCB and no to your question 48 hours has nothing to do with signal quality if the EMAC differential pair degrades under stress analysis. Who knows if custom PCB is multi layer and or followed proper recommended design layout.
  • We've tried with a TI dev kit, see a few posts above:

    3. ti dev kit with modified firmware connected to stm32f4 dev kit running poor quality 90m cable - failed within minutes

    4. ti dev kit with modified firmware connected to stm32f4 dev kit running good quality 90m cable - still works > 16 hours currently

    (emphasis addded to ti dev kit)

    the modified firmware I am speaking about consists of the exact same lwip stack and ethernet driver but since it cannot interface to the same peripherals we've modifed the firmware to return a set payload instead of acquired data.

    As for signal integrity, we have not only followed TI's recommendations but even purchased signal integrity software to confirm correct impedance and minimal crosstalk.  Ethernet is fairly low speed compared to some of our other signals on this custom board.