This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357: Ethernet controller (EMAC) and new undocumented race condition on receive

Part Number: TMS570LC4357
Other Parts Discussed in Thread: HALCOGEN, TMS570LS3137

Hello,

we has rare incidents (ones per every day) when EMAC stopped receiving data till next MCU restart.
After some investigation we found undocumented race condition between EMAC and MCU core.

See to TRM document SPNU563 chapter 32.2.6.2 Transmit and Receive Descriptor Queues

Quote from this document:

There is a potential race condition where the EMAC may read the “next” pointer of a descriptor as NULL in
the instant before an application appends additional descriptors to the list by patching the pointer. This
case is handled by the software application always examining the buffer descriptor flags of all EOP
packets, looking for a special flag called end of queue (EOQ). The EOQ flag is set by the EMAC on the
last descriptor of a packet when the descriptor’s “next” pointer is NULL. This is the way the EMAC
indicates to the software application that it believes it has reached the end of the list. When the software
application sees the EOQ flag set, the application may at that time submit the new list, or the portion of
the appended list that was missed by writing the new list pointer to the same HDP that started the
process.

But there is some NEW race condition. MCU can read EOQ=0 when EMAC read NULL from next pointer already . SW didn't restart current receive buffer pointer in this case. Result is stopped data receive till next reboot.

Example of test code to generate flood packets and reproduce this behavior you can find in attachment (flood.pl)

We try to flood packet to this interface. In statistics, buffer underflow manipulation described in TRM and used in HalCoGen emac driver work correctly only in 9996 cases from 10000. Remaining 4 cases cause that receive stops finally.

It looks like some time domain synchronization problem on silicon between core and EMAC.  We don't have idea if it could be fixed by adding ISB/DSB/DMB instruction into the code without investigation from your side.

For this problem we try to make patch (in attachment you can find version for HalCoGen generated  HL_emac.c file).  This code checks EOQ flag again during processing received data, because in this moment we are sure that this flag is valid.

This looks like similar EOQ problem discussed for TX. Here e2e.ti.com/support/microcontrollers/hercules/f/312/t/526697#pi239031350 or e2e.ti.com/support/arm/sitara_arm/f/791/t/543686#pi316653

We are able to repeat both variant of EOQ problem (TX and new RX described here). TX variant is known more than 2 years, but there is no errata and no fix in HalCoGen generated code. Remember, that Hercules is designed for safety and we are waiting more than 2 years. I personally waiting for TX variant of this problem over one year (e2e.ti.com/support/microcontrollers/hercules/f/312/t/583653 CQ ticket : SDOCM00122906) Horrible!

This behaviour has big influence on all our current projects, please give the  maximum priority of solving this issue.

Jiri Dobry

--- HL_emac.c	2017-05-09 14:40:58.112626600 +0200
+++ HL_emac_fix.c	2018-04-09 15:14:04.275155500 +0200
@@ -1662,6 +1662,13 @@
       /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */            
         curr_bd = (emac_rx_bd_t *)EMACSwizzleData((uint32)curr_bd->next);
 
+      if (   (0u != (curr_bd->flags_pktlen & EMACSwizzleData(EMAC_BUF_DESC_EOQ)))
+          && (NULL != curr_bd->next))
+      {
+      	//see #RX_EOQ workaround
+        EMACRxHdrDescPtrWrite(hdkif->emac_base, EMACSwizzleData((U32)(curr_bd->next)), (U32)EMAC_CHANNELNUMBER);
+      }
+
       /* Acknowledge that this packet is processed */
       /*SAFETYMCUSW 439 S MR:11.3 <APPROVED> "Address stored in pointer is passed as as an int parameter. - Advisory as per MISRA" */
       /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */     
@@ -1693,7 +1700,13 @@
         /*SAFETYMCUSW 134 S MR:12.2 <APPROVED> "LDRA Tool issue" */ 
         /*SAFETYMCUSW 134 S MR:12.2 <APPROVED> "LDRA Tool issue" */
       /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */            
-        if((EMACSwizzleData(curr_tail->flags_pktlen) & EMAC_BUF_DESC_EOQ) == EMAC_BUF_DESC_EOQ) {
+        if((EMACSwizzleData(curr_tail->flags_pktlen) & (EMAC_BUF_DESC_EOQ | EMAC_BUF_DESC_OWNER)) == EMAC_BUF_DESC_EOQ) {
+          /*
+           * Clear EMAC_BUF_DESC_EOQ flag to protect against duplicate write pointer to controller.
+           * We can do it because we know that this fragment isn't owned by EMAC.
+           * see #RX_EOQ workaround
+           */          
+          curr_tail->flags_pktlen &= ~EMACSwizzleData((U32)EMAC_BUF_DESC_EOQ);
           /*SAFETYMCUSW 439 S MR:11.3 <APPROVED> "Address stored in pointer is passed as as an int parameter. - Advisory as per MISRA" */
           /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */             
           EMACRxHdrDescPtrWrite(hdkif->emac_base, (uint32)(rxch_int->free_head), (uint32)EMAC_CHANNELNUMBER);
 
use IO::Socket;
use strict;
use Time::HiRes qw(usleep);

my $sock = IO::Socket::INET->new(
    Proto    => 'udp',
    PeerPort => 12345,
    PeerAddr => '192.168.1.10',
) or die "Could not create socket: $!\n";

my $data = pack("H*","1234");

for (my $i=0; $i <= 128; $i++)
{
	for (my $j=0; $j <= 256; $j++)
	{
		print $sock $data;
	}
	usleep(5000);
}

  • Hi Jiri,

    I am so sorry for late response. I will setup HW (HDK + lwip) and use Ostinato to feed packets for the test. Any suggests regarding to my test, please let me know.
  • Hi,

    for the test you can use my patch from attachment and add counters close to EMACRxHdrDescPtrWrite calls.
    Teoretically you cannot see any call inside this condition:
    if ( (0u != (curr_bd->flags_pktlen & EMACSwizzleData(EMAC_BUF_DESC_EOQ))) && (NULL != curr_bd->next))

    You "cannot" see it, because this situation is handled in second place with buffer chaining code.
    Problem is that you will see that this condition met in ratio cca 1:250 to normal RX restarts in RX buffer queue chaining.

    Jiri

  • Any news?
  • Hi Jiri,

    SW team is working on other projects and has a very aggressive release date. We will do test as early as we can.
  • Hi Jiri,

    I am not sure if this issue is caused by "race condition". I did a stress test for one customer months ago. The LWIP is running in MCU, and Ostinato is feeding data and receiving data to/from the MCU. If the "cache writeback" was enabled (MPU settings), the code was stuck some times. After changing "cache writeback" to "cache write through", the problem was solved.
  • Sorry, but this problem is real. It took us almost 2 years and hundreds of hours to found why our units sometimes stop communicate.

    This isn't cache problem. We are using this memory in "device" mode (direct access without cache).

    Problem is simple. We can't use ethernet interrupts directly, because it is too heavy for hard real time system with our timing requirements. Second problem is that amount of internal memory suitable for ethernet communication is limited.  This can cause input buffer underflow in ethernet trafic peak. This is normal. I am sure that you must be able to repeat it with interrupt ethernet handshake too, because this MCU is not fast enough to serve maximum theoretical ethernet traffic.

    Again: problem is that MCU can read EOQ=0 when EMAC read NULL from next pointer already. Definitely. 

    There is another description of this problem:

    1. MCU add new chain of packets into receive queue
    2. MCU test EOQ bit on last fragment of previous end to test if EMAC reached the end of the list. (in device mode without any impact of cache)
    3. This EOQ is 0, this mean, tak EMAC not reach the end of the list
    4. EMAC stop receive, because it reach end of list, but SW read not up-to-date EOQ

    I am definitely sure that it isn't  cache problem, because we are able to repeat it on TMS570LS3137 without any cache.

    PS: don't forget to another race condition problem in EMAC code. When you chaining new  fragment of link chain it must be terminated BEFORE this operation, not AFTER! Otherwise EMAC could be faster than SW.

    Jiri

  • This is NOT cache problem, but not up-to-date flag read from EMAC. See other my comments.