Other Parts Discussed in Thread: HALCOGEN, TMS570LS3137
Hello,
we has rare incidents (ones per every day) when EMAC stopped receiving data till next MCU restart.
After some investigation we found undocumented race condition between EMAC and MCU core.
See to TRM document SPNU563 chapter 32.2.6.2 Transmit and Receive Descriptor Queues
Quote from this document:
There is a potential race condition where the EMAC may read the “next” pointer of a descriptor as NULL in
the instant before an application appends additional descriptors to the list by patching the pointer. This
case is handled by the software application always examining the buffer descriptor flags of all EOP
packets, looking for a special flag called end of queue (EOQ). The EOQ flag is set by the EMAC on the
last descriptor of a packet when the descriptor’s “next” pointer is NULL. This is the way the EMAC
indicates to the software application that it believes it has reached the end of the list. When the software
application sees the EOQ flag set, the application may at that time submit the new list, or the portion of
the appended list that was missed by writing the new list pointer to the same HDP that started the
process.
But there is some NEW race condition. MCU can read EOQ=0 when EMAC read NULL from next pointer already . SW didn't restart current receive buffer pointer in this case. Result is stopped data receive till next reboot.
Example of test code to generate flood packets and reproduce this behavior you can find in attachment (flood.pl)
We try to flood packet to this interface. In statistics, buffer underflow manipulation described in TRM and used in HalCoGen emac driver work correctly only in 9996 cases from 10000. Remaining 4 cases cause that receive stops finally.
It looks like some time domain synchronization problem on silicon between core and EMAC. We don't have idea if it could be fixed by adding ISB/DSB/DMB instruction into the code without investigation from your side.
For this problem we try to make patch (in attachment you can find version for HalCoGen generated HL_emac.c file). This code checks EOQ flag again during processing received data, because in this moment we are sure that this flag is valid.
This looks like similar EOQ problem discussed for TX. Here e2e.ti.com/support/microcontrollers/hercules/f/312/t/526697#pi239031350 or e2e.ti.com/support/arm/sitara_arm/f/791/t/543686#pi316653
We are able to repeat both variant of EOQ problem (TX and new RX described here). TX variant is known more than 2 years, but there is no errata and no fix in HalCoGen generated code. Remember, that Hercules is designed for safety and we are waiting more than 2 years. I personally waiting for TX variant of this problem over one year (e2e.ti.com/support/microcontrollers/hercules/f/312/t/583653 CQ ticket : SDOCM00122906) Horrible!
This behaviour has big influence on all our current projects, please give the maximum priority of solving this issue.
Jiri Dobry
--- HL_emac.c 2017-05-09 14:40:58.112626600 +0200 +++ HL_emac_fix.c 2018-04-09 15:14:04.275155500 +0200 @@ -1662,6 +1662,13 @@ /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */ curr_bd = (emac_rx_bd_t *)EMACSwizzleData((uint32)curr_bd->next); + if ( (0u != (curr_bd->flags_pktlen & EMACSwizzleData(EMAC_BUF_DESC_EOQ))) + && (NULL != curr_bd->next)) + { + //see #RX_EOQ workaround + EMACRxHdrDescPtrWrite(hdkif->emac_base, EMACSwizzleData((U32)(curr_bd->next)), (U32)EMAC_CHANNELNUMBER); + } + /* Acknowledge that this packet is processed */ /*SAFETYMCUSW 439 S MR:11.3 <APPROVED> "Address stored in pointer is passed as as an int parameter. - Advisory as per MISRA" */ /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */ @@ -1693,7 +1700,13 @@ /*SAFETYMCUSW 134 S MR:12.2 <APPROVED> "LDRA Tool issue" */ /*SAFETYMCUSW 134 S MR:12.2 <APPROVED> "LDRA Tool issue" */ /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */ - if((EMACSwizzleData(curr_tail->flags_pktlen) & EMAC_BUF_DESC_EOQ) == EMAC_BUF_DESC_EOQ) { + if((EMACSwizzleData(curr_tail->flags_pktlen) & (EMAC_BUF_DESC_EOQ | EMAC_BUF_DESC_OWNER)) == EMAC_BUF_DESC_EOQ) { + /* + * Clear EMAC_BUF_DESC_EOQ flag to protect against duplicate write pointer to controller. + * We can do it because we know that this fragment isn't owned by EMAC. + * see #RX_EOQ workaround + */ + curr_tail->flags_pktlen &= ~EMACSwizzleData((U32)EMAC_BUF_DESC_EOQ); /*SAFETYMCUSW 439 S MR:11.3 <APPROVED> "Address stored in pointer is passed as as an int parameter. - Advisory as per MISRA" */ /*SAFETYMCUSW 45 D MR:21.1 <APPROVED> "Valid non NULL input parameters are assigned in this driver" */ EMACRxHdrDescPtrWrite(hdkif->emac_base, (uint32)(rxch_int->free_head), (uint32)EMAC_CHANNELNUMBER);
use IO::Socket; use strict; use Time::HiRes qw(usleep); my $sock = IO::Socket::INET->new( Proto => 'udp', PeerPort => 12345, PeerAddr => '192.168.1.10', ) or die "Could not create socket: $!\n"; my $data = pack("H*","1234"); for (my $i=0; $i <= 128; $i++) { for (my $j=0; $j <= 256; $j++) { print $sock $data; } usleep(5000); }