This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357: ethernet (EMAC) controller don't set EOQ bit on transmit descriptor in time.

Part Number: TMS570LC4357
Other Parts Discussed in Thread: HALCOGEN, , AM3505

Hello,

we found that there is new (without record in errata) silicon bug in EMAC module.

Problem is that EMAC controler read NULL pointer terminatin packet fragment chain. It is correct, this NULL is terminator. But when we need to add new data into this chain there exist official recommendation in TRM to not allow potential race condition. Here is quotation from chapter 32.2.6.2 Transmit and Receive Descriptor Queues

There is a potential race condition where the EMAC may read the “next” pointer of a descriptor as NULL in
the instant before an application appends additional descriptors to the list by patching the pointer. This
case is handled by the software application always examining the buffer descriptor flags of all EOP
packets, looking for a special flag called end of queue (EOQ). The EOQ flag is set by the EMAC on the
last descriptor of a packet when the descriptor’s “next” pointer is NULL. This is the way the EMAC
indicates to the software application that it believes it has reached the end of the list. When the software
application sees the EOQ flag set, the application may at that time submit the new list, or the portion of
the appended list that was missed by writing the new list pointer to the same HDP that started the
process.
This process applies when adding packets to a transmit list, and empty buffers to a receive list.

Here is equivalent in C

  {
    // Chain the bd's.
    volatile struct emacTxBuffDesc *tail = txch->active_tail;
    tail->next = (volatile struct emacTxBuffDesc *)cppiOrder((U32)(active_head));
    if ((U32)0 != (cppiOrder(tail->flags_pktlen) & EMAC_BUF_DESC_EOQ))
    {
      /*
       * If the DMA engine, already reached the end of the chain,
       * the EOQ will be set. In that case, the HDP shall be written again.
       */

      /* Write the Header Descriptor Pointer and start DMA */
      EMACTxHdrDescPtrWrite(hdkif->emac_base, (unsigned int)(active_head), 0 /*channel*/);
    }

It is clear. This code write new pointer to replace end. And AFTER check it it is not too late. If it is too late, restart transmiter to new begin It perfectly fit to recommendation from TRM.  But it fail sometimes.
It looks that EMAC read the termination NULL, but did't set EOQ same time. This actualization have some delay.

Original code from  HalCoGen have this "workaround" in source code:

    while (EMAC_BUF_DESC_EOQ != (EMACSwizzleData(curr_bd->flags_pktlen) & EMAC_BUF_DESC_EOQ))
    {
    }
    while (((uint32)0U != *((uint32 *)0xFCF78600U)))
    {
    }

This workaroud have many problems:

  1. there is no documentation. Nothing in source code. Nothing in errata
  2. this function add only one packet into TX chain and don't allow to add another into queue. With only one packet in queue, it can't be named queue.
  3. This code active polling till previous packet is not out. It optimistic case it is throwing of MCU processor time. In pessimistic case, previous  packet can have delay and ethernet code can wait long time on thi polling loop.

Therefore I try to design another (better) workaround. It have two parts. First part is adding packet into queue:

  if(txch->next_bd_to_process == NULL)
  {
    /* For the first time, write the HDP with the filled bd */
    EMACTxHdrDescPtrWrite(hdkif->emac_base, (unsigned int)(active_head), 0);
    txch->next_bd_to_process = active_head;
  }
  else
  {
    // Chain the bd's.
    volatile struct emacTxBuffDesc *tail = txch->active_tail;
    tail->next = (volatile struct emacTxBuffDesc *)cppiOrder((U32)(active_head));
  }

And second part is called after tresmition to clean up.

  curr_bd = txch->next_bd_to_process;

  if (NULL != curr_bd)
  {
    /* Check for correct start of packet */
    while // break inside
      (cppiOrder(EMAC_BUF_DESC_SOP) == (curr_bd->flags_pktlen & cppiOrder(EMAC_BUF_DESC_OWNER | EMAC_BUF_DESC_SOP)))
    {
      hdkif->free_tail->next = (volatile struct emacTxBuffDesc *)cppiOrder((U32)curr_bd);

      /* Traverse till the end of packet is reached */
      while((cppiOrder(curr_bd->flags_pktlen) & EMAC_BUF_DESC_EOP) != EMAC_BUF_DESC_EOP)
      {
        curr_bd->flags_pktlen = 0;
        curr_bd = (volatile struct emacTxBuffDesc *)cppiOrder((U32)(curr_bd->next));
      }
      /* Acknowledge the EMAC */
      EMACTxCPWrite(hdkif->emac_base, 0, (U32)curr_bd);

      curr_bd->flags_pktlen &= ~cppiOrder(EMAC_BUF_DESC_EOP | EMAC_BUF_DESC_SOP);

      /* Free the corresponding pbuf */
      pbuf_free((struct pbuf *)curr_bd->pbuf);

      volatile struct emacTxBuffDesc *next = (volatile struct emacTxBuffDesc *)cppiOrder((U32)(curr_bd->next));

      curr_bd->next = NULL;
      hdkif->free_tail = curr_bd;
      curr_bd = next;
      LINK_STATS_INC(link.xmit);
      if (NULL == curr_bd)
      {
        txch->active_tail = NULL;
        break; // while loop
      }
    }
    if ((curr_bd != NULL)
        && (0 == EMACTxHdrDescPtrRead(hdkif->emac_base, 0 /*channel*/))
        && (0 != (curr_bd->flags_pktlen & cppiOrder(EMAC_BUF_DESC_OWNER))))
    {
      /*
       * If the DMA engine, already reached the end of the chain.
       * In that case, the HDP shall be written again.
       *
       * This is workaround for problem when EMAC eat and use NULL pointer,
       * but reading of EOQ after return 0. It looks like unconfirmed silicon problem
       */
      EMACTxHdrDescPtrWrite(hdkif->emac_base, (unsigned int)(curr_bd), 0 /*channel*/);
    }

    txch->next_bd_to_process = curr_bd;
  }

This works perfectly. But I wand to have confirmation that this workaround is correct. Mainly condition:

if ((curr_bd != NULL)
        && (0 == EMACTxHdrDescPtrRead(hdkif->emac_base, 0 /*channel*/))
        && (0 != (curr_bd->flags_pktlen & cppiOrder(EMAC_BUF_DESC_OWNER))))

Could you confirm it as correct workaround? Can you release update of errata for this problem (incl. recommendations)?

Problem is that it look like problem in IT included in many MCUs. 
I can confirm it for TMS570LC4357. And I found same problem on another silicons. For ex:

RM57 see e2e.ti.com/.../526697
AM3505 see https://e2e.ti.com/support/arm/sitara_arm/f/791/t/543686#pi316653

Used MCU is TMS570LC4357BZWTQQ1 rev.B

  • Hello Jiri,

    For sure there seems to be a lot of debate in the other posts on whether this is just a result of the race condition described in the TRM or if it is outside the scope of what is described in the TRM. If the latter, then it would be potentially considered a silicon errata, if the former, then it wouldn't.

    I have forwarded this post to some of our experts to see if anything has been captured on this topic other than what is in the TRM and also to our Halcogen team so they can have a look at the work around you have proposed and provide some feedback.