Why is EK-TM4C1294XL EMAC0 PHY TX/RX gates latching up & stopping TX/RX controllers data I/O.

Genatco

Other Parts Discussed in Thread: EK-TM4C1294XL

Seems very apparent the EK-TM4C1294XL needs an FSB/FSN generated to compensate for PCB design anomaly and or lack of bypass capacitors in key power areas. Otherwise we pray the silicon is fault free, errata tested and certified to be free of CMOS latch up in key areas. EAPX systems of the early 1980's had big yet correctable issues with CMOS latch up in CPU/CODECS even with internal ESD protection on I/O pins.

You say what is an FSB or FCN well that's a story going back over 20 years but Field Change Notice, Field Service Bulletins saved the day time and time again. That means engineers were in the laboratory litmus testing their designs for troubles they didn't expect to occur in the field.

What where and why are the questions is latch up occurring in and around the EMAC can also generate exception 11.

We can't find in schematic a ceramic, aluminum electrolytic or even a bypass capacitor on either OTG or ICDI USB port +5 volt USB0VBus power pins.

The application remains running while the PHY TX/RX goes AWAL:

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:32896
>> Abnormal INT Status --->>:99461
<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:32896
>> Abnormal INT Status --->>:99461
<< The Exosite Connection FAILED >>

over 10 years ago

0 Amit Ashara over 10 years ago

TI__Guru**** 244400 points

Hello BP101

Is this related to the "USB Bulk client error 31"?

EDIT: Also "EMAC0 PHY TX/RX gates latching up" would show up on a scope which i do not see a snapshot of.

Also "EMAC0 PHY TX/RX gates latching up" would show up on a scope and I do not see a snapshot or scope plot which will drive the conclusion that the TX/RX pins are indeed latching up.

Regards
Amit

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

Don't believe the EMAC issue has anything to do with USB error 31.

Impossible that screen shot is making an assumption that the TX/RX controller is not what is latched up as the probable suspect.

Oddly modulated carrier seems to be present from the PHY. That is to say the EMAC activity and switch LED's are flickering about the same rate as if frames with data were present I/O from the PHY. The https status changes from 204 PHY porting frame data to not porting frame data repeating http 255 every server connect cycle. It will cycle the error codes above for hours. At this point of cycle codes above post the external telnet client can not reconnect his secession which gives a clue packets are not getting past the PHY into the TX/RX controller. When the heap corrupts the outcome is not pretty and more often faults the MPU so this condition seems very different.

More evidence below the 99525 and other codes are somewhat benign and recover and continue to process RX packets, post http 204 and run for many hours. What ever is going on self clears in these codes below all but for code 32896 AIS/RU that eventually shows up in the posted codes above, latch up.

<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

>> Abnormal INT Status --->>:99525
<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:98496
<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>> 

<< The Exosite Connection FAILED >> 

SocketOpenTCP(-1): << TCP Disconnect >> 

SocketOpenTCP(-1): << Reset Connection>>

0 Amit Ashara over 10 years ago in reply to Genatco

TI__Guru**** 244400 points

Hello BP101,

1st things 1st: What exactly is the definition of Latch Up you intend to present. Transistor latch up which forms the basis of the device or software loop which forms the basis of using the device. If a transistor is what you are referring to as latched up, there are more issues at hand than just software not working. A transistor latch up is a substantial source of physical damage to a device and it's effect as a software issue may be milder if the device core survived it.

It would be an incorrect assumption to think of that LED are not configured to what otherwise is default. Do check what LED configuration is set as and present the same on the forum post.

If suspicion of a "latch up" is still there then "Debug 101 rules apply", go back to a golden binary file like the qs_iot on the Launchpad and see whether the same issue occurs. If it shows the same behavior on an unmodified binary then present data to support "physcial" latch up with scope snapshot that do show that the RX/TX pins are not working as expected.

Regards
Amit

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

A transistor latch up is a substantial source of physical damage.

Possibly your definition of latch up differs from the one I learned decades ago. Latch up in this case is not a damaging aspect to silicon rather the effect of a gate region becoming saturated will no longer state change. The CMOS latch up is a recoverable condition by removing the condition causing self latch, this case an MPU reset.

The speed of which the TX/RX controller is being accessed is governed by Timer2A in this case polling 3 times faster than TI examples. The timer reload value (120Mhz SYSCLK/300) forms the lWIP timer interval versus the slower polling of examples (SYSCLK/100). The slower interval is directly relative and noticeable in the GUI scope widgets horizontal movement speed. The slower timer reload value is far to slow servicing two TCP ports clients in the LWIP stack. With the faster interval we have to set short delay after a W/R to Ringbuffer prior to setting RX/TX flags. Failure to add settling time in ring buffer R/W flags results in near immediate exception (11).

When we actually use the MPU to do some heavy data transfers with EMAC0 that is when the gremlins come out to play.

BTW: The addition of 33uf capacitor on both USB ports stabilized the ADC temperature readout is very steady 49-50c versus being all over the place. Note worthy since adding 33uf caps the EMAC0 symptoms have changed. The caps seem to have helped to arrest 99525, 98496 codes, the results now have been more often a TCP error.

0 Amit Ashara over 10 years ago in reply to Genatco

TI__Guru**** 244400 points

Hello BP101,

Thanks for presenting the facts. In the case that 33uF have stabilized the ADC temp readout (no details on what the temp reading was before the addition of the caps) and the fact the codes 99525 and 98496 are no longer coming, can we go and see what the source of the MPU reset was. May be there is a BOR condition that is being triggered.

Also in case the issue is still there and reproducable can the device be submitted for FA for any damage that may have been caused by prolonged failures that you have been seeing.

Regards
Amit

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

> helped to arrest 99525, 98496 codes.

Not completely gone rather vey much calmed down.

>(no details on what the temp reading was before the addition of the caps)

Load any EK-TM4C1294XL with code to produce ADC temperature readings and you will witness abnormal rapidly changing digits. Appears the ADC voltage reference was introducing some noise into the samples. After adding 33uf caps twice witnessed the PHY suspect in UARTpintf() messages a DHCP event occurred shortly after TCP callback errors yet Watchdog timer reported a valid link exists, then applications simply stops. The generic IOT code fails to test a proper flag to indicate DHCP IP address acquired and triggers repeating DHCP exoHal events wasting MPU cycles. That is how we know the PHY is suspect for causing the TCP error in both clients as each client posted event callbacks of TCP failure. In that TCP failure the application Systick timer appeared to stop asserting (run LED stops blinking) but the EMAC activity LED was randomly flickering and surprisingly the MPU was not halted nor in exception 11. We are gaining ground as they say.

>can we go and see what the source of the MPU reset was.

That has not occurring for long now time after disabling the Watchdog MPU reset switch.

The added flag to top line stops exoHal DHCP events from constantly triggering with every socket open cycle!

    else if((g_sEnet.eState == iEthDHCPWait) && (HWREGBITW(&g_sEnet.ui32Flags, FLAG_DHCP_STARTED) == 0)) 
    {
        // Get IP address.
        //
        ui32IPAddr = lwIPLocalIPAddrGet();

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

Captured the ADC temperature readings from IOT Cloud. Quite a difference in the temperature readings taken before Friday morning when the 33uf caps were installed on the USB ports. There must have been some small amount of supply ripple getting into the USB0VBus. Never actually checked for ripple with the scope rather added 33uf/.1uf caps as a preventive measure. True RMS DMV shows no AC ripple down to .000mv. Recently checked all electrolytic in the 550 Watt PC power supply output, +5vdc 2- 2200uf in parallel each measured roughly 2300uf.

0 Amit Ashara over 10 years ago in reply to Genatco

TI__Guru**** 244400 points

Hello BP101,

The ADC temp sensor may be getting coupled to the USB0VBUS supply ripple. Note that Temp Sense accuracy as mentioned in the data sheet is under lab char conditons

Regards
Amit

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,
Datasheet ADC electrical specification also states 2.2uf is minimum value. We plan to use 330 Ohm series resistor to 3v3 , SMT .05m ferrite bead and 3.3uf ceramic for the ADC VREF source.
Past experience has proven a series resistor adds current ripple rejection and circuit isolation. ADC VREF Isolation should come in handy should 12.5Khz PWM ripple be riding in/on the DC.

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

Interesting idea came to me about TCP 802.3 IEEE standard for flow control. Ethernet handshake is set in MAC 0x400E.C000 REG7 0x018 and is not configured for TX/RX flow control bit times in the control frames. FCBBPA[0] is being tested in while loop ... working ok when placed above both packet W/R in tcp_output() and tcp_rcvd(). That is to say we can still make connection to the host. The EMAC_FLOWCTL_PT_M (PT bit field 13:31) pause time is not set the same as the datasheet page 1487 suggest 0x0100 256 bit times and is much higher in Tivaware set 0xFFFF -- for what reason?

802.3 flow control is part of the IEEE hand shake protocol at the MLID device layer. Handshake is present and not enabled in systems that have Ethernet capability also absent in the examples set forth in (eth_cleint_lwip.c) and (tiva-tm4c129.c).

Perhaps IEEE flow control missing at the MLID device layer is causing random latch up in the PHY?

0 Genatco over 10 years ago in reply to Genatco

Guru 55923 points

Anyone at TI:

Below observe the SysCtlDelay() is still necessary to help arrest would be exception (11) even with IEEE 802.3 flow control enabled. The SysCtlDelay() has undesired effect such that can be directly noticed as a slight (stop) hesitation to other TCP port data.

Shouldn't the host/client respect Pauses inserted into the control frame as the RX/TX controller FIFO data is stored in the Ringbuffer? Seems each ignores IEEE 802.3 pause frames or perhaps the PHY is not actually inserting the pause frames? The pause interval bit times are (EMAC_FLOWCTL_PT_M) = 0xFFFF0000.

Note: Oddly suspicious there is never a single collision count change during hundreds of hours in CCS debug register refresh monitor mode. Ethernet CSMA/CD collision detection by theory sends a dummy packet to test the wire for traffic prior to TX the real FIFO data but that is not to say we never have a packet collision on the wire. If memory serves well Token Ring is entirely collision free by design is CSMA/CA collision avoidant.

/* Wait for the transmission of the pause frame to complete and
 * the MAC then clears REG-7 FCBBA[0] */
 while(HWREG(EMAC0_BASE + EMAC_O_FLOWCTL) & !EMAC_FLOWCTL_FCBBPA);
 {
 /* Initiate a pause frame cycle */
    HWREG(EMAC0_BASE + EMAC_O_FLOWCTL) |= EMAC_FLOWCTL_FCBBPA;

// Send the buffer contents to the server if there is
// space in the 4096 bytes eth_client send buffer. Internet
// traffic may have caused a buffer backup of send data.
//
if(error = (EthClientSend((int8_t *)pcBuffer, (uint32_t)iLength) == ERR_OK))
{
   /* Set the Sent flag to indicate the contents of the
    * Ringbuffer have been transfered */
     HWREGBITW(&g_sExosite.ui32Flags, FLAG_SENT) = 1;

   /* Allow time for the transmitted data to exit the TX FIFO */
    SysCtlDelay(g_ui32SysClock / 250); // 100, g_ui32SysClock / 100
}

The 802.3 pause control frame handshake appears to fail:

Initialization of PHY:
>> EMAC PHY INTs --->>:-2147483648
>> EMAC PHY INTs --->>:-2147417019

Minutes later:

<< TCP Connection Failed >> 

<< TCP Connection Failed >> 

<< TCP Connection Failed >> 

>> Abnormal INT Status --->>:99525
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:98496
<< TCP Connection Failed >> 

<< TCP Connection Failed >> 

<< TCP Connection Failed >> 

Client is still well connected and receives http 204 after each TX cycle.

0 Amit Ashara over 10 years ago in reply to Genatco

TI__Guru**** 244400 points

Hello BP101,

Or the interrupt mechanism or flag status needs to be taken into account when sending data. That would be a better solution that using SysCtlDelay.

Regards
Amit

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,
Both RX and TX use the ring buffer and flag in a similar way. Seem the DMARIS flags would change faster than the ring buffer can process packets into/from SRAM and have no effect, if only momentary.

Question is why does the Pause control frame not seem to delay MAC FIFO TX/RX controller making the SysCtlDelay unneeded? One hint is if the flag position is changed it remains subject to the IP prefetch cycle which grabs the next few instructions regardless of the flags immediate position before or after the delay. That tends to indicate the ring buffer SW can not process packets as fast as the EMAC HW can produce them.

The entire buffer platform takes a random dive into several different error conditions (some worse than others) depending on the amount of delay period and traffic conditions on the wire/local switch. Seems at times the ring buffer is being over run by the MAC FIFO even with all the safe guards to Block that from occurring. Reason for attempting to add 802.3 control frame pauses in both TX/RX ring buffer loops.

There was default an unusually long SysCtlDelay(Sysclk/20) in the Blocking flag cycle of SocketReceive() that may account for some of the PHY behavior. For giggles just changed blocking loop SysCtlDelay(Sysclk/250) 2.083us and tripled the timeout period in the loop. Getting a few more than usual pcBuffer read timeouts but so far actually works.

0 Genatco over 10 years ago in reply to Amit Ashara

Guru 55923 points

Hi Amit,

The Exosite server is not responding to pause frames and continues to TX data during the pause frame time the Launch Pad client sends to the Exosite sever at 0xffff.0000 span. This issue is a bit out of my normal network device configurations that run on the Windows platform or on Cisco Span Tree with clients having megabytes of SRAM RX buffer space versus only 4096kb SRAM.

802.3 pause frame control is not default set enabled in the device driver for any or most network cards. This might explain why the TicTacToe game often misses RX data packets at the launch pad. The Exosite cloud server is gender specific client meaning it only talks to the TM4C IOT clients with EMAC DMA descriptors and 4096kb ring buffer configuration.

Hence we must enable 802.3 flow control at the server interface and the IOT clients or the client RX buffer can become swamped at times of high packet transfers, especially being the default TCP WND is set at 4096kb. We set that TCP window to advertise far less frame size to the Exosite sever but even 2400 byte frames overflows the RX receiver at times. It takes about 5 minutes to enable flow control on the severs network interface card. TI experts will concur 802.3 flow control is non disruptive to IOT clients without it enabled.

Case Systick timer is ever run faster than the generic IOT code settings, SYSCLK/10: We see the TM4C1294 EMAC TX controller keeps sending packets while the RX FIFO fails in several different ways, not always the same exact outcome.

Please take the time to review this Cisco article, mostly covers internal network 802.1Q tag frames but elaborates the 802.3 frame control standard in the beginning paragraphs.

Thanks,

802.1q Flow Control white_paper_c11-542809.pdf

0 Genatco over 10 years ago in reply to Genatco

Guru 55923 points

This TX/RX frame initialization and loop shown below post added in (lwiplib.c). Asserted the pause loop during EMAC0 interrupt handler has no effect to stop the Exosite server from sending TX data fames during this absurdly long pause frame period (0xffff.0000) versus (0x0100,0000) shown in the datasheet for 256 bit time slots.

EMAC logged errors :

>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:99525
>> Abnormal INT Status --->>:99525
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99461
>> Abnormal INT Status --->>:98496
>> Abnormal INT Status --->>:99525
>> Abnormal INT Status --->>:99525
exoHAL: << pcBuffer: Read Timeout >>
<< IOT Write (EXO_STATUS_END: Code 10 - http:0) >>
<< [Exosite_Write] No Response >

0 Genatco over 10 years ago in reply to Genatco

Guru 55923 points

Now makes sense why so many the number of repeating AIS errors.

The (ui32status) test in (tiva-tm4c129.c) is not an explicit test for EMACDMARIS flag AIS bit 15. Accordingly RDES0 bit 8 in the RX descriptor indicates when the ORD status flag of AIS bit 15 is valid or not. In effect we get fooled over and over by false AIS messages.

Likewise the EMACDMARIS interrupt flag AIS bit 15 is not being constrained by the DMA engines RX descriptor RDESO bit 8 which should actually have enable control over the AIS ORD interrupt flags in the EMACDMARIS register. The other NIS ORD interrupt flags appear ok.

Patches:

/* Update our debug interrupt counters. */

// EMAC_INT_ABNORMAL_INT is the logical OR of the masked state of
// EMAC_INT_TX_STOPPED | EMAC_INT_TX_JABBER | EMAC_INT_RX_OVERFLOW |
// EMAC_INT_TX_UNDERFLOW | EMAC_INT_RX_NO_BUFFER | EMAC_INT_RX_STOPPED |
// EMAC_INT_RX_WATCHDOG | EMAC_INT_EARLY_RECEIVE | EMAC_INT_BUS_ERROR. 
  if(ui32Status == (EMAC_INT_ABNORMAL_INT | DES0_RX_STAT_LAST_DESC))
  {
	g_ui32AbnormalInts++;

      UARTprintf(">> Abnormal INT Status --->>:%i\r\n", ui32Status);

  }

// IEEE 802.3 pause flow control:
/* Wait for the transmission of the TX/RX pause frame to complete and
 * the MAC then resets REG-7 FCBBA[0] to 0x0 */
 while((HWREG(EMAC0_BASE + EMAC_O_FLOWCTL) & !EMAC_FLOWCTL_FCBBPA)
      && (ui32Status == (EMAC_INT_ABNORMAL_INT | DES0_RX_STAT_LAST_DESC)))
 {
    /* Initiate a pause frame cycle on the TX/RX control frame */
     HWREG(EMAC0_BASE + EMAC_O_FLOWCTL) |= EMAC_FLOWCTL_FCBBPA;

    //ui32Status = MAP_EMACIntStatus(EMAC0_BASE, false);
 }

Arm-based microcontrollers

Arm-based microcontrollers forum

Why is EK-TM4C1294XL EMAC0 PHY TX/RX gates latching up & stopping TX/RX controllers data I/O.