AM6548: PRU ICSSG (SR 2.0) receive stalls

Daniel Hornaes

Part Number: AM6548

Hello,

As discussed in the related thread, we've been experiencing stalls on the receive path when connecting two PRU-ICSSG ethernet interfaces point to point (E.G. emac0 to emac1).
This has been discussed in a previous thread on this forum: e2e.ti.com/.../am6548-pru-icssg-sr-2-0-receive-stall-with-point-to-point-connection-between-interfaces-on-same-icssg

Luckily, we were able to rule out this type of connection from our use-cases, so this hasn't been a critical issue for us.
Unfortunately, as we've ramped up testing of our custom board, we're seeing several network failures which look very similar to the previous case. One big difference: this happens even when the two PRU-ICSSG ethernet interfaces are not connected point to point. As our technical observations seem to match the ones discussed in the previous thread, I'll state some high-level observations here.

The following setup was used to provoke the issue:

Our custom board was connected to a host computer. Emac0 and emac1 were connected to two different ethernet interfaces, without any switches/hubs.

The observed issue is as follows:

After booting the board, the emac interfaces were configured with IP addresses and brought up (I.E. ifconfig emacX <IP address> up).
When brought into the UP state, both interfaces produces a gratuitous ARP, which were visible on their respective sub-networks (observed via WireShark)
However, when attempting to ping the board (on both interfaces), only emac0 seems to receive the ping request properly and produce a response.
When attempting to ping the host computer from the board, network traffic from both emac0 and emac1 was visible - both interfaces made an ARP request towards the host computer
The host computer responded correctly to both ARP requests
However, only emac0 seems to receive the ARP response and process it correctly. It then continued to complete a ping request/reply sequence
emac1 doesn't seem to receive the ARP response at all, and never produces any subsequent ping requests.

The general observation is that the receive path of the secondary interface in an PRU-ICSSG pair (in this case emac1) seems to get stuck in a broken state. This seems to occur as part of system start-up, and we're unable to bring emac1 out of this state. However, the problem doesn't occur every time - in a minority of cases, both interfaces seem to work just fine.

Some additional observations that may or may not be relevant:

The issue seems less frequent when the operating system/networking drivers are compiled without debugging information - this shouldn't affect the logical path in the driver itself, but it will surely affect timing.
The issue occurs less frequently when we place a switch between the board and the host computer
1. One explanation for this, is that the switch limits the network speed to 100mbps, unlike a point-to-point connection which will auto-negotiate to 1gbps.
2. We attempted to restrict the auto-negotiation speed of the network interfaces on the host computer to 100mbps - interestingly, this seems to make things much more stable
Apart from this, the observations made in the linked forum thread should still be valid and relevant.

over 1 year ago

0 Bin Liu over 1 year ago

TI__Guru*** 133266 points

Hi,

Our PRU ICSSG expert is out of office today. Please expect delayed response.

0 Nick Saulnier over 1 year ago in reply to Bin Liu

TI__Guru* 76435 points

Hello Daniel,

I am back from vacation. I did not have time to read all the way through the previous e2e thread today, I will finish reading through it tomorrow.

1) Mukul mentioned you would try to keep the PHY in reset during boot to see if it would prevent the issue. Any luck?

2) From the above, it looks like you are observing the issue with static IP addresses, not during DHCP negotiation. Is that correct?

3) I heard you are using VxWorks instead of Linux. What version of the Linux PRUETH driver is the VxWorks driver based on? (e.g., SDK release, tag in the ti-kinux-kernel repo, etc). What version of PRU Ethernet binary are you testing?

Regards,

Nick

0 Daniel Hornaes over 1 year ago in reply to Nick Saulnier

Prodigy 30 points

Hello Nick,
I hope you had an enjoyable vacation.

To the matter at hand:

I've been trying out various combinations of PHY reset (via a GPIO pin), and I haven't been able to make any useful progress. If I de-assert the reset too late, the associated interface fails completely (in both TX and RX directions) - if I de-assert the reset any earlier, we still seem to end up in the "RX stall" case.
Correct, we're not using DHCP.
We're mainly running PRU firmware 0.2.02.08.02, however we've observed the same issue on 02.02.09.0[2367] and 02.02.11.01 as well. As for the Linux version the VxWorks driver is based on, it is the one provided as part of Linux SDK 07.03.00.07.

We have managed to isolate a test case which seems fairly easy to reproduce on an evaluation board. I can provide it to you in form of a bootable SD card image, if you are interested. This forum is probably not the right place for exchanging files - Please ask Mukul to provide my e-mail address, and we'll sort out the details that way.

Best regards,
Daniel Hornæs

0 Nick Saulnier over 1 year ago in reply to Daniel Hornaes

TI__Guru* 76435 points

For future readers:

The PRU firmware versions Daniel is discussing can be found here: https://git.ti.com/cgit/processor-firmware/ti-linux-firmware/tree/ti-pruss?h=ti-linux-firmware

by looking at the commit logs, https://git.ti.com/cgit/processor-firmware/ti-linux-firmware/commit/ti-pruss/am65x-sr2-pru0-prueth-fw.elf?h=ti-linux-firmware&id=224f82474e4029dd2e550600ba7e59f687ff7131 tells us that 02.02.11.01 is the firmware version associated with the latest version of an SDK that was released (AM64x Linux SDK 8.4).

For Daniel:

1) I was provided a test binary this morning that fixes an error scenario that may or may not be related to your issue. Please feel free to run tests and let me know if you see better behavior: icssg-prueth-fw.tar.gz

2) To confirm: You are able to observe the issue on a TI EVM?

3) What is your timeframe of need here? We have made a lot of fixes to the PRUETH driver since SDK 7.3, and we will probably continue making a lot of changes to driver and firmware over the next couple of weeks. If the VxWorks driver needs to get updated (it probably should), it would make more sense to use the Linux driver in a couple weeks or months instead of the Linux driver today.

Regards,

Nick

0 Daniel Hornaes over 1 year ago in reply to Nick Saulnier

Prodigy 30 points

1. Test firmware trial results

I was able to load and start the END drivers using the test binaries you provided yesterday, unfortunately they only seem to make things worse:
using the new FW, the RX path appears dead on both interfaces, in addition I seem to lose the TX path as well in most cases. My bag of "tricks" (forcing 100mbps, starting the END driver without the network cables plugged in to avoid PHY auto-neg. during driver start-up) doesn't work anymore.

2) Yes, we have observed the issue on a TI EVM (easily reproducable using a point-to-point connection between two adjacent emac interfaces)

3) Timeframe considerations

We are nearing release/production, so a we certainly don't have "several months" to wait for resolving this problem.
Regarding keeping the VxWorks driver in line with the official Linux driver, I agree that we should to it - apart from resolving the problem at hand, this task can probably be postponed a while.
In short: if we require a large rewrite of the driver in order to resolve the RX stall issue, we need to start working on it within the next few weeks. Rewriting the driver in a broader sense can wait until you've implemented your planned changes.

New findings since yesterday:

We've dumped some snapshots of the MSMC RAM at various stages throughout the driver start-up, in order to observe where the error reports start appearing. Turns out the critical point is just after the first network interface (emac0) has called miiBusModeSet(). This function basically configures link advertisement and toggles the auto-negotiate restart bit in the PHY control register.
Note that error reports appear for emac1 in the MSMC memory, even though this interface isn't even running firmware at this point!

A potentially interesting bit is the following command that can be issued from PRU firmware to the MII_G_RT block:

(AM654x TRM, SPRUID7E, 6.5.11.2.5 table 6-2730, page 5029):
RX_RESET: RX_RESET is used to reset the receive FIFO and clear all contents.
This is required to recover from a RX FIFO overrun, if software
does not want to undrain. The typical use case is assertion after
RX_EOF.

RX overrun seems to be one of the errors reported in the MSMC RAM when we get stuck in the "RX stall" state. Is this command being issued by the PRU FW when an ethernet interface is started? If not, perhaps it should be?

0 Nick Saulnier over 1 year ago in reply to Daniel Hornaes

TI__Guru* 76435 points

Hello Daniel,

I was unable to look at your response today, apologies. I will read everything and provide a response tomorrow.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru* 76435 points

Hello Daniel,

Understood on timeframe, and thanks for attaching debug observations. That is a fair question about the firmware behavior in the case of an RX overrun, I can check with the firmware team on current behavior.

In the associated customer debug, we are seeing some improvements (but potentially not complete fixes) by changing the order of which PRU cores are initialized first by the Linux driver. Would it be helpful for you if I tested the latest version of that Linux firmware patch to try modifying the VxWorks driver?

Regards,

Nick

0 Pekka Varis over 1 year ago in reply to Nick Saulnier

TI__Mastermind 23360 points

Daniel,

I've understood this issues is still present even with the latest firmware?

Pekka

0 Daniel Hornaes over 1 year ago in reply to Pekka Varis

Prodigy 30 points

Hi Pekka,

The issue in its original form has been resolved, as far as we can see:
The problem was that we didn't properly perform soft resets of the various PRU cores, when re-starting the ethernet interfaces after booting our main operating system kernel. Once we discovered that the Linux reference implementation did a pure write (as opposed to a read-modify-write) of the ICSSG_PRU_CONTROL.PRU_SOFT_RST_N fields, things finally clicked in place!
(I'd argue that using a self-clearing, active-low reset register is more suitable for pure HW implementations than in a SW register interface, but it is what it is).

Anyway, implementing the soft-reset as mentioned above, we haven't seen any issues starting up the PRU-based ethernet interfaces in a "simple" environment.

Unfortunately, once we released the patched driver to the testing departments, we've gotten several reports of very similar issues occuring.
The issue itself looks very similar: one or more PRU-based ethernet interfaces appear to be stalled in the RX direction, just after system start-up.
We've since upgraded to the lastest official PRU FW (02.02.12.01) released recently, and the issue is present there as well.

Interestingly, we're only able to reproduce this issue if there is a certain amount of traffic going on the network while the interfaces are started.
By blasting the PRU ethernet interfaces with a steady stream of broadcast packets during system start-up, we're able to reproduce the issue quite consistently.

Running ifconfig after a stall has occurred doesn't reveal much, the output generally looks like this:
emac0 Link type:Ethernet HWaddr 02:41:4c:42:10:17 capabilities: VLAN_MTU inet 172.21.140.123 mask 255.255.0.0 broadcast 172.21.255.255 UP RUNNING SIMPLEX BROADCAST MULTICAST MTU:1500 metric:1 VR:0 ifindex:2 RX packets:0 mcast:0 errors:0 dropped:0 TX packets:31 mcast:0 errors:0 collisions:0 unsupported proto:0 RX bytes:0 TX bytes:1302

The HW counters (memory mapped to address 0x0b033000+0x54c for ICSSG0.PRU0) are more interesting though; after a stall they generally look like this:
emac0 (stats at 0x47e0054c):

good              0
broadcast         0
multicast        28
CRC error      6247
MII error         0
odd nibble        0
max size error    0
min size error 6219
overrun error     1
}

In general, we see a high amount of CRC errors, and an overrun error count of 1.

Our current interpretation is that the network traffic is being incorrectly delivered/incorrectly received by the PRU's MAC. If a certain amount is received, we get into a non-recoverable RX overrun state.

Two important observations:

This error seems to occur (somewhat randomly) in both the bootloader, and the subsequent full OS kernel. This is unlike the case discussed previously, which never seemed to occur in the bootloader. This makes sense, as the issue was related to a soft reset of the PRUs - the bootloader operated in a "cold start" environment.
The issue can occur with only a single PRU-emac slice connected. This probably rules out various arbitration/muxing issues.

We have two main theories at the moment:

There is an issue with the PRU FW, causing one of more PRU-ethernet to lock up
Something is wrong with our PHY<->MAC setup, causing corrupted frames to be delivered to the MAC during the "early" phase, eventually causing an RX overflow lock-up in the PRU.

Any ideas or suggestions are greatly appreciated - we are able to inspect most registers in the am65x via JTAG (PRUs included).

/Daniel

0 Nick Saulnier over 1 year ago in reply to Daniel Hornaes

TI__Guru* 76435 points

I am going to mark this thread resolved on my end since we are tracking offline. Feel free to respond to this thread to get it to pop back up in my inbox as needed.

Regards,

Nick

Processors

Processors forum

AM6548: PRU ICSSG (SR 2.0) receive stalls