TMS570LC4357 EMAC: Avoiding RXERRCODE=Ownership bit not set in SOP buffer

Dr. Martin Momberg

Other Parts Discussed in Thread: TMS570LC4357

Dear Sirs,

we are currently observing a severe issue on the Ethernet Controller (EMAC).

The following happens from the perspective of our software when we release a Packet Buffer Descriptor (PBD) to the EMAC:

The software initializes all fields of the PBD, in particular
- setting OWNERSHIP,
- clearing all other flags,
- setting packet length to 0,
- setting buffer length to the size of the provided buffer,
- setting the next pointer to 0;
The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;
The software executes a Data Synchronization Barrier (dsb) instruction;
The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0
If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC
If this is true it will set the channel’s RXnHDP to this PBD
Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.
Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD

Nevertheless, in rare conditions. when having heavier reception load on the EMAC receive channel. the channel is stopped with MACSTATUS.RXERRCODE set to “Ownership bit not set in SOP buffer” which means that although RXnHDP is 0 AND the software has checked that the PBD that is to be assigned to RXnHDP is still owned by the EMAC, the EMAC will find the PBD already owned by the software when the EMAC next examines the RXnHDP, i.e. we have no clear rule to determine whether the EMAC is currently processing this PBD prior to setting RXnHDP to this PBD.

Could you please provide us with the following information:

What is the exact order of activities that the EMAC is performing in processing a PBD, beginning with detecting that an Ethernet packet is arriving to the point in time where the EMAC clears the Ownership flag and sets RXnCP, preferably with timing information and information which activities are performed in atomic read-modify-writes?

What is the proper way to handle the above described problem?

This issue is really severe to us, as the Technical Reference Manual states having a non-zero value of MACSTATUS.TXERRCODE requires a hardware reset (cf. TMS570LC43x TRM May 2014, sec. 32.5.30, Table 32-69, Field TXERRCODE: “Transmit host error code. These bits indicate that EMAC detected transmit DMA related host errors. The host should read this field after a host error interrupt (HOSTPEND) to determine the error. Host error interrupts require hardware reset in order to recover. …”).

With best regards

Martin

over 9 years ago

0 Charles Tsai over 9 years ago

TI__Guru**** 191906 points

Hi Martin,
We have received your post. I have forwarded your post to our Ethernet expert. We will look into your issue and get back with you.

It seems like some type of race condition. Is the CPU MPU configured in DEVICE mode or STRONGLY-ORDER mode for access to the EMAC module? If you are in DEVICE can you change to STRONGLY-ORDER and see if it makes a difference?

0 Dr. Martin Momberg over 9 years ago in reply to Charles Tsai

Prodigy 125 points

Hi Charles,

sorry for the delayed response.

We are not using the MPU, i.e. it is disabled.

Looking forward to receiving a message from your Ethernet expert.

With best regards

Martin

0 Charles Tsai over 9 years ago in reply to Dr. Martin Momberg

TI__Guru**** 191906 points

Hi Martin,

Our expert is out of office. As soon as he is back I hope he can give some insights.

In the meantime can you please tell me if you are in polling mode or interrupt mode? It seems you are in polling mode but I'm not sure.

I have reformatted your bullets into numbers for easier reference. Since I'm not an expert in this module I have some questions rather than solutions to your problems.

The software initializes all fields of the PBD, in particular
1. setting OWNERSHIP,
2. clearing all other flags,
3. setting packet length to 0,
4. setting buffer length to the size of the provided buffer,
5. setting the next pointer to 0;
The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;
Charles>> If the next pointer of the last PBD is 0 then it is already the end of queue. I wonder if you should move this step to after step 5. My point is that before appending a new PBD can you first check if EOQ is already reached and if the OWNER bit is zero which means the EMAC has finished with all the descriptors.
The software executes a Data Synchronization Barrier (dsb) instruction;
Charles>> Since you use the dsb, it leads me to think if the cache has something to do with the race condition. Have you tried to disable cache and see if it makes a difference.
The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0
Charles>> Perhaps it's my lack of understanding, I'm just curious why the RXnHDP will be 0. Isn't RxnHDP the pointer to the first PBD in the queue? Or the RXnHDP will get reset automatically to 0 when the EOQ is reached. I'm not too sure about this.
If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC
Charles>> This is where I'm a bit confused. Isn't PbdEnd's next pointer just set by the software in step 2?
If this is true it will set the channel’s RXnHDP to this PBD
Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.
Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD

0 Dr. Martin Momberg over 9 years ago in reply to Charles Tsai

Prodigy 125 points

Hi Charles,

please see my reply to your remarks inserted below.

With best regards

Martin

Charles Tsai said:

Hi Martin,

Our expert is out of office. As soon as he is back I hope he can give some insights.

In the meantime can you please tell me if you are in polling mode or interrupt mode? It seems you are in polling mode but I'm not sure.

I have reformatted your bullets into numbers for easier reference. Since I'm not an expert in this module I have some questions rather than solutions to your problems.

The software initializes all fields of the PBD, in particular

setting OWNERSHIP,

clearing all other flags,

setting packet length to 0,

setting buffer length to the size of the provided buffer,

setting the next pointer to 0;

The software then sets the next pointer of the last PBD, currently being 0, to the address of the to-be-released PBD;
Charles>> If the next pointer of the last PBD is 0 then it is already the end of queue. I wonder if you should move this step to after step 5. My point is that before appending a new PBD can you first check if EOQ is already reached and if the OWNER bit is zero which means the EMAC has finished with all the descriptors.

Martin>> Please see TMS570LC4357 Technical Reference Manual (May 2014), sec. 32.2.6.2.
Appending the PBD as early as possible should enable the EMAC to already take this PBD for packet reception; otherwise the EMAC would drop that packet.
We later on check whether the EMAC has seen that new non-zero next pointer.

The software executes a Data Synchronization Barrier (dsb) instruction;
Charles>> Since you use the dsb, it leads me to think if the cache has something to do with the race condition. Have you tried to disable cache and see if it makes a difference.

Martin>> We currently have caches disabled; I expect that when enabling the caching a dsb instruction will not be sufficient but a cache data synchronization barrier is required in addition.

The software then checks whether the channel’s Completed PBD Pointer RXnCP points to a PBD (let’s call it PbdEnd) which has EndOfQueue set AND whether the channel’s RXnHDP is 0
Charles>> Perhaps it's my lack of understanding, I'm just curious why the RXnHDP will be 0. Isn't RxnHDP the pointer to the first PBD in the queue? Or the RXnHDP will get reset automatically to 0 when the EOQ is reached. I'm not too sure about this.

Martin>> Please see TMS570LC4357 Technical Reference Manual (May 2014), sec. 32.5.47.The TRM states that writing to a non-zero HDP is an error (actually of the same severity of "Ownership bit not set in SOP buffer"), so checking that the HDP is zero is a prerequisite to writing the new PBD address to it.
Nevertheless the TRM does not states when exactly the EMAC sets the HDP to zero in the process of receiving a packet but it is clear that the EMAC writes a zero to the HDP once it has observed that there is no next PBD and it appears to be clear that the EMAC never writes a non-zero value to a HDP.

If all conditions are true the software checks PbdEnd’s next pointer non-zero AND if true whether next PBD is owned by the EMAC
Charles>> This is where I'm a bit confused. Isn't PbdEnd's next pointer just set by the software in step 2?

Martin>> No, the next pointer of the last packet in our PBD chain is set. We check whether the RXnCP's PBD's next pointer is non-zero which would mean that we have tried in step 2 to set the next pointer but the EMAC already read the previously zero next pointer, actually halting the channel and setting EoQ, and that packet is still the next to be filled by the EMAC.
Nevertheless we cannot be sure that RXnCP is pointing to the same PBD that our PBD chain tail is pointing to, as RXnCP is in the hand of the EMAC while the PBD chain tail is in ours. Therefore we are differentiating between step 5 and 7.

If this is true it will set the channel’s RXnHDP to this PBD

Otherwise if the to-be-released PBD is still owned by the EMAC, the software sets the channel’s RXnHDP to this PBD.

Otherwise we currently have no free buffer to set RXnHDP to, expecting that the software will process and then release another filled PBD

0 Anthony F. Seely over 9 years ago

TI__Guru 68950 points

This is a picky point, but in some places I see MACSTATUS.TXERRCODE.

Is that an error (it should always read in the description MACSTATUS.RXERRCODE) or is the TXERRCODE also sometimes indicating an error.

Want to confirm because obviously if TXERRCODE is set then this expands the problem complexity quite significantly .. the rest of the description only indicates a problem w. the receive DMA host operations.

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

Also, just checking other E2E posts.

This one doesn't have an answer posted but it has an interesting 'observation'
e2e.ti.com/.../421949

He claims that with the descriptors at a different address, not inside the CPPI RAM, the problem doesn't occur.

I don't know how practical it would be to try this out as an experiment but it would be interesting to know if the issues might be related. MAC is the same IP.

0 Dr. Martin Momberg over 9 years ago in reply to Anthony F. Seely

Prodigy 125 points

Hello Anthony,

actually TXERRCODE is a typo.

In all place it should have been replaced by RXERRCODE.

With best regards

Martin

0 Anthony F. Seely over 9 years ago in reply to Dr. Martin Momberg

TI__Guru 68950 points

Martin,
is the PDB also initialized w. the offset set to 0? Didn't see that in the list above.

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

I'm not sure the above matters too much but it's what the instructions say.

Anyway, I just read through the logic related to this error bit.

It is very straightforward not complicated.

The DMA state machine has some states:

IDLE - sits here waiting for a cell to be ready to move from the MAC to RAM
RD_ST0 - Reads the channel state
CHK_HEAD_PTR - Reads the queue head pointer, if not zero move to RD_DESC_SETUP
RD_DESC_SETUP - starts reading the descriptor
RD_DESC0 -
RD_DESC1
RD_DESC2
RD_DESC3 - this is where the ownership bit is checked, if it is not set then next state is
"HOST_ERR" which is a terminal state until another reset.
bit 29 of the 3rd word of the descriptor is all that is checked in order to
set the error code to 2.

The only subtle difference that appears to me by reading the source code for the MAC is that while the description says:
"Ownership bit not set in SOP buffer"

In this case the MAC decides whether or not the buffer is an 'sop buffer'.

If you come from the IDLE state it's an 'sop buffer'. If the one buffer doesn't complete the packet receive and it has to read another PDB descriptor for the same packet, then that is not an "sop buffer'.

But in any case I don't see the 'sop buffer' being checked as a condition of the setting of
the "Ownership bit not set in SOP buffer"

It actually looks like *any descriptor* that is read by the DMA without the ownership bit set would cause this HOST_ERR state to be entered.

That's the only real difference that I 'see' compared to the spec and its subtle but maybe think about whether you *ever* would have added a PDB without the ownership bit set.

It doesn't look really like there is anything wrong in the software sequence you have written up either.

So I would say that my next step here if I were debugging would be to look for the software doing something that I don't expect.

I would try the Trace emulator since this problem occurs infrequently.

I would probably setup a data trace to emit all writes to the PDB area.

Then I would set the device up to halt on detection of the error.

I'd then go from the halt back into the trace and look for a write prior to this where the PDB is written without the ownership bit set.

If found - then you could try turning on a bit more information to get the context as to why this is happening.

If you can produce this problem with no SDRAM so that you can use all of the trace pins you can probably achieve this with both program and data trace turned on otherwise you'll need to limit to filtered data trace with just the default 8 trace pins available.

Best Regards,
-Anthony

0 Dr. Martin Momberg over 9 years ago in reply to Anthony F. Seely

Prodigy 125 points

Anthony,

the offset is set to 0 in step 1.d.

With best regards

Martin

0 Dr. Martin Momberg over 9 years ago in reply to Anthony F. Seely

Prodigy 125 points

Hello Anthony,

as you have the state machine available, could look for the states/transitions where the next pointer is read, written to HDP, the EoQ decision is taken, and when the Ownership flag is toggled. Are those operations done atomically or is there a delay between the operations; if latter, which delays would we have to expect?

We always have the ownership flag set when we pass the descriptor to the EMAC. What we expect is that just when we set the tail's next pointer to the released descriptor, the EMAC has just reached the state where it reads the next pointer, takes it, fills in the next packet, flips the ownership bit, and now, as we have come to setting HDP, the ownership bit is no longer with the EMAC and the error is raised.

Using this working hypothesis, timing appears to be essential in this process, so it would be good to know more about the following states in the state machine.

With best regards

Martin

0 Anthony F. Seely over 9 years ago in reply to Dr. Martin Momberg

TI__Guru 68950 points

Hi Martin,

Ok, I'll look at these today. I am not sure though I see yet why these operations need to be atomic but if it's not clear after some more study we can discuss further.

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

BTW are you ever using the channel teardown feature? [for the receive]

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

So Martin the reads and writes of the head pointer and the descriptors are not atomic at all.
Both are bus accesses and the receive DMA state machine waits until there is a 'ready' returned on the bus.

I can't tell you exactly how long but the head pointer I believe is in a small scratch ram inside the MAC and so it will be fast,
but the descriptor accesses can be very long as they could be on the EMIF.

Meantime - Can you please elaborate more on where you think the hazard is that would cause your software to create the HOSTERR?

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

So regarding the queue head pointer I believe they are in the STATE RAM of Figure 32-11. EMAC Module Block Diagram.
The dma control state machine can read and write this RAM. The bus to this RAM is pipelined meaning that you setup
the read address on one cycle, and read response is generally returned in the next cycle if the RAM is single cycle and there
isn't any delay for arbitration. when the state machine is reading and writing the RAM though it will wait for the ready at
each state transition. I think there is likely a bus arbiter between the state RAM and the tx/rx state machines as well as the
external host bus [to be confirmed] so this timing could vary but should always be short. And it will always be non-atomic.
and reads & writes will always occur separately - there is no single cycle read-modify-write that is possible.

So hopefully that gives enough of an idea about the head pointer (as viewed in state RAM) manipulation.

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

Martin,

Here's roughly the order:

From Idle:

1 Head Pointer for correct DMA channel is retrieved from State RAM
sop flag internally is set to '1'.

2 Head Pointer is checked to see if it is zero. If it is, then an overrun occurs and the DMA
goes to a state called 'Abort Chk Eop'

3 Using Pointer from (1), the Descriptor is Read via DMA from anywhere in memory.
Multiple cycles / States. Four 32-bit words are read.

3a. When the first word of the descriptor is read, the internal copy of the head pointer
is updated with the value of the descriptors 'Next Descriptor Pointer'
3d. When the fourth word is read, the ownership flag is checked.
if ownership is not set, then the terminal "Host Error" state is reached. (what you report)

4. The DMA is then to move the incoming cells to the memory.
On the first write the sop flag is cleared.
If multiple PDBs are needed to finish off the packet, then (2, 3) are repeated to get the next PDB.

Note that during this time, the head pointer isn't written back to STATE RAM.. but it's internal copy
is updated.

When the packet completes, step 5...

5. The internal head pointer is written back to state RAM

6. The PDBs are written to indicate end of packet...
If it is a single PDB for the packet then the SOP & EOP PDBs are the same.
If not then both the SOP and EOP PDBs are updated. First the EOP, then the SOP
Word 2 (optional) and Word 3 are written.

7. The Completion Pointer is written to State RAM

8. State machine returns to idle

That's the normal case. This seems like a solid use model to me with the exception of the one well documented hazard that you need to check for when processing the completion queue - which is that some of your buffers may need to be submitted again.

There are two other paths though that may result in the head pointer being updated.

The first is through the teardown process. The channel you teardown has it's head pointer written to a value of 0 and then it's completion pointer to the value 0xFFFFFFFC.

The second occurs if there is an abort after a DMA has started writing to the descriptor. In this case the head pointer is written back w. the 'next buffer descriptor pointer' of the PDB that was used.

Let me know if I missed any critical points.

Thanks and Best Regards,
Anthony

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68950 points

Talking this through with Charles, again it seems like a solid design w.o. a hazard but one thing that isn't clear
is this description:
"Ownership bit not set in SOP"

We don't see any qualification of "in SOP" in the check. Instead every time a descriptor is read, when word 3 of the descriptor is read then the Ownership bit is checked and if not set you go to the HOST_ERR state and terminate there till a reset.
But in that case .. the head pointer would not be written back.

So let's say hypothetically that you had linked in 3 PDBs but didn't set the ownership bit on the 2nd one in that list.

The head pointer would start pointing to the 1st PDB in the list.
When this PDB is read, the internal copy of the head pointer now points to the 2nd PDB.
Then if the 2nd PDB is read, the internal copy of the head pointer points to the 3rd PDB.
Ok so that's all good.

But the head pointer you can see and inspect is still pointing to the 1st head pointer in the list.

Normally when the packet completes, then the visible head pointer is written back.

But if you have the ownership flag not set in say the 2nd or the 3rd PDB in the list, then you would be stuck in HOST_ERR.
The head pointer which still points to PDB#1 would be pointing to a PDB with the ownership bit set.

But you need to not only check that PDB but any PDB that it's linked to until you reach the end of the list, as any one of them
may be the cuplrit that set off the HOST_ERR state.

-Anthony

0 Dr. Martin Momberg over 9 years ago in reply to Anthony F. Seely

Prodigy 125 points

We are currently not actively using the channel teardown feature. Nevertheless the software is checking for that magic value in RXnCP (0xFFFFFFFCu) and not trying to set HDP in that case.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LC4357 EMAC: Avoiding RXERRCODE=Ownership bit not set in SOP buffer