This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357 EMAC: Avoiding RXERRCODE=Ownership bit not set in SOP buffer

Other Parts Discussed in Thread: TMS570LC4357

Dear Sirs,

we are currently observing a severe issue on the Ethernet Controller (EMAC).

The following happens from the perspective of our software when we release a Packet Buffer Descriptor (PBD) to the EMAC:

  • The software initializes all fields of the PBD, in particular

    • setting OWNERSHIP,

    • clearing all other flags,

    • setting packet length to 0,

    • setting buffer length to the size of the provided buffer,

    • setting the next pointer to 0;

  • The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;

  • The software executes a Data Synchronization Barrier (dsb) instruction;
  • The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0

  • If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC

  • If this is true it will set the channel’s RXnHDP to this PBD

  • Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.

  • Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD

 

Nevertheless, in rare conditions. when having heavier reception load on the EMAC receive channel. the channel is stopped with MACSTATUS.RXERRCODE set to “Ownership bit not set in SOP buffer” which means that although RXnHDP is 0 AND the software has checked that the PBD that is to be assigned to RXnHDP is still owned by the EMAC, the EMAC will find the PBD already owned by the software when the EMAC next examines the RXnHDP, i.e. we have no clear rule to determine whether the EMAC is currently processing this PBD prior to setting RXnHDP to this PBD.

 

Could you please provide us with the following information:

 

  • What is the exact order of activities that the EMAC is performing in processing a PBD, beginning with detecting that an Ethernet packet is arriving to the point in time where the EMAC clears the Ownership flag and sets RXnCP, preferably with timing information and information which activities are performed in atomic read-modify-writes?
  • What is the proper way to handle the above described problem?

 

This issue is really severe to us, as the Technical Reference Manual states having a non-zero value of MACSTATUS.TXERRCODE requires a hardware reset (cf. TMS570LC43x TRM May 2014, sec. 32.5.30, Table 32-69, Field TXERRCODE: “Transmit host error code. These bits indicate that EMAC detected transmit DMA related host errors. The host should read this field after a host error interrupt (HOSTPEND) to determine the error. Host error interrupts require hardware reset in order to recover. …”).

With best regards

Martin

  • Hi Martin,
    We have received your post. I have forwarded your post to our Ethernet expert. We will look into your issue and get back with you.

    It seems like some type of race condition. Is the CPU MPU configured in DEVICE mode or STRONGLY-ORDER mode for access to the EMAC module? If you are in DEVICE can you change to STRONGLY-ORDER and see if it makes a difference?
  • Hi Charles,

    sorry for the delayed response.

    We are not using the MPU, i.e. it is disabled.

    Looking forward to receiving a message from your Ethernet expert.

    With best regards

    Martin

  • Hi Martin,

     Our expert is out of office. As soon as he is back I hope he can give some insights.

     In the meantime can you please tell me if you are in polling mode or interrupt mode? It seems you are in polling mode but I'm not sure.

     I have reformatted your bullets into numbers for easier reference. Since I'm not an expert in this module I have some questions rather than solutions to your problems.

    1. The software initializes all fields of the PBD, in particular
      1. setting OWNERSHIP,
      2. clearing all other flags,
      3. setting packet length to 0,
      4. setting buffer length to the size of the provided buffer,
      5. setting the next pointer to 0;
    2. The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;

      Charles>> If the next pointer of the last PBD is 0 then it is already the end of queue. I wonder if you should move this step to after step 5. My point is that before appending a new PBD can you first check if EOQ is already reached and if the OWNER bit is zero which means the EMAC  has finished with all the descriptors.

    3. The software executes a Data Synchronization Barrier (dsb) instruction;

      Charles>> Since you use the dsb, it leads me to think if the cache has something to do with the race condition. Have you tried to disable cache and see if it makes a difference. 

    4. The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0

      Charles>> Perhaps it's my lack of understanding, I'm just curious why the RXnHDP will be 0. Isn't RxnHDP the pointer to the first PBD in the queue? Or the RXnHDP will get reset automatically to 0 when the EOQ is reached. I'm not too sure about this. 

    5. If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC

      Charles>> This is where I'm a bit confused. Isn't PbdEnd's next pointer just set by the software in step 2? 

    6. If this is true it will set the channel’s RXnHDP to this PBD
    7. Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.
    8. Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD
  • Hi Charles,

    please see my reply to your remarks inserted below.

    With best regards

    Martin

    Charles Tsai said:

    Hi Martin,

     Our expert is out of office. As soon as he is back I hope he can give some insights.

     In the meantime can you please tell me if you are in polling mode or interrupt mode? It seems you are in polling mode but I'm not sure.

     I have reformatted your bullets into numbers for easier reference. Since I'm not an expert in this module I have some questions rather than solutions to your problems.

    1. The software initializes all fields of the PBD, in particular
      1. setting OWNERSHIP,
      2. clearing all other flags,
      3. setting packet length to 0,
      4. setting buffer length to the size of the provided buffer,
      5. setting the next pointer to 0;
    2. The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;

      Charles>> If the next pointer of the last PBD is 0 then it is already the end of queue. I wonder if you should move this step to after step 5. My point is that before appending a new PBD can you first check if EOQ is already reached and if the OWNER bit is zero which means the EMAC  has finished with all the descriptors.

      Martin>> Please see TMS570LC4357 Technical Reference Manual (May 2014), sec. 32.2.6.2.
      Appending the PBD as early as possible should enable the EMAC to already take this PBD for packet reception; otherwise the EMAC would drop that packet.
      We later on check whether the EMAC has seen that new non-zero next pointer.

    3. The software executes a Data Synchronization Barrier (dsb) instruction;

      Charles>> Since you use the dsb, it leads me to think if the cache has something to do with the race condition. Have you tried to disable cache and see if it makes a difference. 

      Martin>> We currently have caches disabled; I expect that when enabling the caching a dsb instruction will not be sufficient but a cache data synchronization barrier is required in addition.

    4. The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0

      Charles>> Perhaps it's my lack of understanding, I'm just curious why the RXnHDP will be 0. Isn't RxnHDP the pointer to the first PBD in the queue? Or the RXnHDP will get reset automatically to 0 when the EOQ is reached. I'm not too sure about this. 

      Martin>> Please see TMS570LC4357 Technical Reference Manual (May 2014), sec. 32.5.47.The TRM states that writing to a non-zero HDP is an error (actually of the same severity of "Ownership bit not set in SOP buffer"), so checking that the HDP is zero is a prerequisite to writing the new PBD address to it.
      Nevertheless the TRM does not states when exactly the EMAC sets the HDP to zero in the process of receiving a packet but it is clear that the EMAC writes a zero to the HDP once it has observed that there is no next PBD and it appears to be clear that the EMAC never writes a non-zero value to a HDP.

    5. If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC

      Charles>> This is where I'm a bit confused. Isn't PbdEnd's next pointer just set by the software in step 2? 

      Martin>> No, the next pointer of the last packet in our PBD chain is set. We check whether the RXnCP's PBD's next pointer is non-zero which would mean that we have tried in step 2 to set the next pointer but the EMAC already read the previously zero next pointer, actually halting the channel and setting EoQ, and that packet is still the next to be filled by the EMAC.
      Nevertheless we cannot be sure that RXnCP is pointing to the same PBD that our PBD chain tail is pointing to, as RXnCP is in the hand of the EMAC while the PBD chain tail is in ours. Therefore we are differentiating between step 5 and 7.

    6. If this is true it will set the channel’s RXnHDP to this PBD
    7. Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.
    8. Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD

  • This is a picky point, but in some places I see MACSTATUS.TXERRCODE.

    Is that an error (it should always read in the description MACSTATUS.RXERRCODE) or is the TXERRCODE also sometimes indicating an error.

    Want to confirm because obviously if TXERRCODE is set then this expands the problem complexity quite significantly .. the rest of the description only indicates a problem w. the receive DMA host operations.
  • Also, just checking other E2E posts.

    This one doesn't have an answer posted but it has an interesting 'observation'
    e2e.ti.com/.../421949

    He claims that with the descriptors at a different address, not inside the CPPI RAM, the problem doesn't occur.

    I don't know how practical it would be to try this out as an experiment but it would be interesting to know if the issues might be related. MAC is the same IP.
  • Hello Anthony,

    actually TXERRCODE is a typo.

    In all place it should have been replaced by RXERRCODE.

    With best regards

    Martin

  • Martin,
    is the PDB also initialized w. the offset set to 0? Didn't see that in the list above.
  • I'm not sure the above matters too much but it's what the instructions say.

    Anyway, I just read through the logic related to this error bit.

    It is very straightforward not complicated.

    The DMA state machine has some states:

    IDLE - sits here waiting for a cell to be ready to move from the MAC to RAM
    RD_ST0 - Reads the channel state
    CHK_HEAD_PTR - Reads the queue head pointer, if not zero move to RD_DESC_SETUP
    RD_DESC_SETUP - starts reading the descriptor
    RD_DESC0 -
    RD_DESC1
    RD_DESC2
    RD_DESC3 - this is where the ownership bit is checked, if it is not set then next state is
    "HOST_ERR" which is a terminal state until another reset.
    bit 29 of the 3rd word of the descriptor is all that is checked in order to
    set the error code to 2.

    The only subtle difference that appears to me by reading the source code for the MAC is that while the description says:
    "Ownership bit not set in SOP buffer"

    In this case the MAC decides whether or not the buffer is an 'sop buffer'.

    If you come from the IDLE state it's an 'sop buffer'. If the one buffer doesn't complete the packet receive and it has to read another PDB descriptor for the same packet, then that is not an "sop buffer'.

    But in any case I don't see the 'sop buffer' being checked as a condition of the setting of
    the "Ownership bit not set in SOP buffer"

    It actually looks like *any descriptor* that is read by the DMA without the ownership bit set would cause this HOST_ERR state to be entered.

    That's the only real difference that I 'see' compared to the spec and its subtle but maybe think about whether you *ever* would have added a PDB without the ownership bit set.


    It doesn't look really like there is anything wrong in the software sequence you have written up either.

    So I would say that my next step here if I were debugging would be to look for the software doing something that I don't expect.


    I would try the Trace emulator since this problem occurs infrequently.

    I would probably setup a data trace to emit all writes to the PDB area.

    Then I would set the device up to halt on detection of the error.

    I'd then go from the halt back into the trace and look for a write prior to this where the PDB is written without the ownership bit set.

    If found - then you could try turning on a bit more information to get the context as to why this is happening.

    If you can produce this problem with no SDRAM so that you can use all of the trace pins you can probably achieve this with both program and data trace turned on otherwise you'll need to limit to filtered data trace with just the default 8 trace pins available.

    Best Regards,
    -Anthony
  • Anthony,

    the offset is set to 0 in step 1.d.

    With best regards

    Martin

  • Hello Anthony,

    as you have the state machine available, could look for the states/transitions where the next pointer is read, written to HDP, the EoQ decision is taken, and when the Ownership flag is toggled. Are those operations done atomically or is there a delay between the operations; if latter, which delays would we have to expect?

    We always have the ownership flag set when we pass the descriptor to the EMAC. What we expect is that just when we set the tail's next pointer to the released descriptor, the EMAC has just reached the state where it reads the next pointer, takes it, fills in the next packet, flips the ownership bit, and now, as we have come to setting HDP, the ownership bit is no longer with the EMAC and the error is raised.

    Using this working hypothesis, timing appears to be essential in this process, so it would be good to know more about the following states in the state machine.

    With best regards

    Martin
  • Hi Martin,

    Ok, I'll look at these today. I am not sure though I see yet why these operations need to be atomic but if it's not clear after some more study we can discuss further.
  • BTW are you ever using the channel teardown feature?  [for the receive]

  • So Martin the reads and writes of the head pointer and the descriptors are not atomic at all.
    Both are bus accesses and the receive DMA state machine waits until there is a 'ready' returned on the bus.

    I can't tell you exactly how long but the head pointer I believe is in a small scratch ram inside the MAC and so it will be fast,
    but the descriptor accesses can be very long as they could be on the EMIF.

    Meantime - Can you please elaborate more on where you think the hazard is that would cause your software to create the HOSTERR?
  • So regarding the queue head pointer I believe they are in the STATE RAM of Figure 32-11. EMAC Module Block Diagram.
    The dma control state machine can read and write this RAM. The bus to this RAM is pipelined meaning that you setup
    the read address on one cycle, and read response is generally returned in the next cycle if the RAM is single cycle and there
    isn't any delay for arbitration. when the state machine is reading and writing the RAM though it will wait for the ready at
    each state transition. I think there is likely a bus arbiter between the state RAM and the tx/rx state machines as well as the
    external host bus [to be confirmed] so this timing could vary but should always be short. And it will always be non-atomic.
    and reads & writes will always occur separately - there is no single cycle read-modify-write that is possible.

    So hopefully that gives enough of an idea about the head pointer (as viewed in state RAM) manipulation.
  • Martin,

    Here's roughly the order:

    From Idle:

    1 Head Pointer for correct DMA channel is retrieved from State RAM
    sop flag internally is set to '1'.

    2 Head Pointer is checked to see if it is zero. If it is, then an overrun occurs and the DMA
    goes to a state called 'Abort Chk Eop'

    3 Using Pointer from (1), the Descriptor is Read via DMA from anywhere in memory.
    Multiple cycles / States. Four 32-bit words are read.

    3a. When the first word of the descriptor is read, the internal copy of the head pointer
    is updated with the value of the descriptors 'Next Descriptor Pointer'
    3d. When the fourth word is read, the ownership flag is checked.
    if ownership is not set, then the terminal "Host Error" state is reached. (what you report)

    4. The DMA is then to move the incoming cells to the memory.
    On the first write the sop flag is cleared.
    If multiple PDBs are needed to finish off the packet, then (2, 3) are repeated to get the next PDB.

    Note that during this time, the head pointer isn't written back to STATE RAM.. but it's internal copy
    is updated.

    When the packet completes, step 5...

    5. The internal head pointer is written back to state RAM

    6. The PDBs are written to indicate end of packet...
    If it is a single PDB for the packet then the SOP & EOP PDBs are the same.
    If not then both the SOP and EOP PDBs are updated. First the EOP, then the SOP
    Word 2 (optional) and Word 3 are written.

    7. The Completion Pointer is written to State RAM

    8. State machine returns to idle


    That's the normal case. This seems like a solid use model to me with the exception of the one well documented hazard that you need to check for when processing the completion queue - which is that some of your buffers may need to be submitted again.

    There are two other paths though that may result in the head pointer being updated.

    The first is through the teardown process. The channel you teardown has it's head pointer written to a value of 0 and then it's completion pointer to the value 0xFFFFFFFC.

    The second occurs if there is an abort after a DMA has started writing to the descriptor. In this case the head pointer is written back w. the 'next buffer descriptor pointer' of the PDB that was used.

    Let me know if I missed any critical points.

    Thanks and Best Regards,
    Anthony
  • Talking this through with Charles, again it seems like a solid design w.o. a hazard but one thing that isn't clear
    is this description:
    "Ownership bit not set in SOP"

    We don't see any qualification of "in SOP" in the check. Instead every time a descriptor is read, when word 3 of the descriptor is read then the Ownership bit is checked and if not set you go to the HOST_ERR state and terminate there till a reset.
    But in that case .. the head pointer would not be written back.

    So let's say hypothetically that you had linked in 3 PDBs but didn't set the ownership bit on the 2nd one in that list.

    The head pointer would start pointing to the 1st PDB in the list.
    When this PDB is read, the internal copy of the head pointer now points to the 2nd PDB.
    Then if the 2nd PDB is read, the internal copy of the head pointer points to the 3rd PDB.
    Ok so that's all good.

    But the head pointer you can see and inspect is still pointing to the 1st head pointer in the list.

    Normally when the packet completes, then the visible head pointer is written back.

    But if you have the ownership flag not set in say the 2nd or the 3rd PDB in the list, then you would be stuck in HOST_ERR.
    The head pointer which still points to PDB#1 would be pointing to a PDB with the ownership bit set.

    But you need to not only check that PDB but any PDB that it's linked to until you reach the end of the list, as any one of them
    may be the cuplrit that set off the HOST_ERR state.

    -Anthony
  • We are currently not actively using the channel teardown feature. Nevertheless the software is checking for that magic value in RXnCP (0xFFFFFFFCu) and not trying to set HDP in that case.