PROCESSOR-SDK-AM57X: ti-pruss/am57xx-pru0/1-prusw-fw.elf firmware not loaded into PRUs when creating the bridge between redundant ports

Naiara Moreira

Part Number: PROCESSOR-SDK-AM57X
Other Parts Discussed in Thread: TLK105L, AM5718

We are currently working on integrating RSTP protocol into our design based on AM5701 processor, which was recently ported from SDK 06.02.00.81 to SDK 08.02.01.00. We employed the example script provided by TI in the related application note (Industrial_Protocols_RSTP) and we realized that there was no hardware offloading on PRUs. Instead of changing to switch mode by prueth driver and load ti-pruss/am57xx-pru0/1-prusw-fw.elf firmware into PRUs, the default ti-pruss/am57xx-pru0/1-prueth-fw.elf firmware is loaded:

root@am57xx-evm:~# cat /sys/class/remoteproc/remoteproc*/firmware
dra7-ipu1-fw.xem4
dra7-ipu2-fw.xem4
dra7-dsp1-fw.xe66
am57xx-pru1_0-fw
am57xx-pru1_1-fw
ti-pruss/am57xx-pru0-prueth-fw.elf
ti-pruss/am57xx-pru1-prueth-fw.elf

According to other thread (https://e2e.ti.com/support/processors-group/processors/f/processors-forum/945142/am5708-rstp-performance-issue-in-am57xx-idk) created by a colleague three years ago, in SDK 6.02, the change to switch mode was performed by applying the l2-fw-offload feature on both interfaces through ethtool. We found one similar feature available for SDK 8.02 but it is showed as fixed so we cannot put it on.

root@predixedge:~# ethtool -k eth1 | grep l2
l2-fwd-offload: on [fixed]
root@predixedge:~# ethtool -k eth2 | grep l2
l2-fwd-offload: on [fixed]

Taking a look into the code (prueth_core.c) the "eth_type" is assigned in prueth_change_to_switch_mode() (prueth->eth_type = PRUSS_ETHTYPE_SWITCH;), but that function is only called from prueth_port_offload_fwd_mark_update(), which in turn is called by prueth_ndev_port_link() and prueth_ndev_port_unlink(). What is happening is that prueth_ndev_event() would call prueth_ndev_port_link() on NETDEV_CHANGEUPPER event, but the call to prueth_sw_port_dev_check() will return "false" unless NETIF_F_HW_HSR_TAG_RM is set. Well, it doesn't make a lot of sense to have that HSR flag for the SW firmware... but when that flag is enabled, we have even seen the prusw firmware loaded sometimes!

We did some investigation on the prueth_core.c file history in ti-linux-kernel repo (https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/net/ethernet/ti?h=ti-rt-linux-5.10.y) and found the following:

This commits added support for RSTP in April 2021: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/drivers/net/ethernet/ti/prueth_core.c?h=ti-rt-linux-5.10.y&id=c4b5529b070dfdedf85b4b1cf6dc0167af29f87d --> In this commit the NETIF_F_HW_L2FW_DOFFLOAD was introduced, and was used in prueth_sw_port_dev_check() what caused the SW firmware to be loaded.
Then in July 2021 this commit added HSR/PRP: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/drivers/net/ethernet/ti/prueth_core.c?h=ti-rt-linux-5.10.y&id=d2e8eb5a46ec7216223407be3d3840648f554be0 --> This commit replaced the NETIF_F_HW_L2FW_DOFFLOAD by NETIF_F_HW_HSR_FWD and NETIF_F_HW_HSR_TAG_RM:
```
@@ -2267,12 +2548,22 @@ static int prueth_netdev_init(struct prueth *prueth,
```
         ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_TC;

-       if (of_device_is_compatible(prueth->dev->of_node, "ti,am57-prueth"))

-               ndev->features |= NETIF_F_HW_L2FW_DOFFLOAD;

+       if (prueth->support_lre)

+               ndev->hw_features |= (NETIF_F_HW_HSR_FWD | NETIF_F_HW_HSR_TAG_RM);

+

+       ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
And in a follow-up commit from the same time https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/drivers/net/ethernet/ti/prueth_core.c?h=ti-rt-linux-5.10.y&id=25fc691922bd7da8941b044df9b9a10c62188db8 prueth_sw_port_dev_check() is modified to check NETIF_F_HW_HSR_TAG_RM flag instead of NETIF_F_HW_L2FW_DOFFLOAD.

We believe that RSTP offloading was broken unintentionally when HSR/PRP was introduced. Based on these commits, we did the modifications below to the prueth_core.c driver to see if with this the prusw firmware would load, and it worked well. We are doing some tests right now to verify that the forwarding is actually being done in the PRU firmware instead of in the linux driver.

diff --git a/drivers/net/ethernet/ti/prueth_core.c b/drivers/net/ethernet/ti/prueth_core.c

--- a/drivers/net/ethernet/ti/prueth_core.c

+++ b/drivers/net/ethernet/ti/prueth_core.c

@@ -2966,6 +2966,9 @@ static int prueth_netdev_init(struct prueth *prueth,

ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_TC;

+ if (of_device_is_compatible(prueth->dev->of_node, "ti,am57-prueth"))

+ ndev->features |= NETIF_F_HW_L2FW_DOFFLOAD;

if (prueth->support_lre)

ndev->hw_features |= (NETIF_F_HW_HSR_FWD | NETIF_F_HW_HSR_TAG_RM);

@@ -3021,6 +3024,9 @@ bool prueth_sw_port_dev_check(const struct net_device *ndev)

if (ndev->features & NETIF_F_HW_HSR_TAG_RM)

return true;

+ if (ndev->features & NETIF_F_HW_L2FW_DOFFLOAD)

+ return true;

return false;

}

With this, the same script worked for configuring redundant interfaces, and we see the dmesg logs saying that ti-pruss/am57xx-pru1-prusw-fw.elf is indeed loaded into both PRUs.

root@predixedge:~# cat /sys/class/remoteproc/remoteproc*/firmware
dra7-ipu1-fw.xem4
dra7-ipu2-fw.xem4
dra7-dsp1-fw.xe66
am57xx-pru1_0-fw
am57xx-pru1_1-fw
ti-pruss/am57xx-pru0-prusw-fw.elf
ti-pruss/am57xx-pru1-prusw-fw.elf

Please, could you confirm us that this was unintentionally removed from prueth driver?

over 1 year ago

0 Naiara Moreira over 1 year ago

Prodigy 40 points

I have copied here the code of the script we used to bring up interfaces and add RSTP support:

#!/bin/bash
# rstp.sh

set -x #echo on

ETH1=${1:-eth1}
ETH2=${2:-eth2}
BR0=${4:-br0}

mstpd
sleep 1

ip link set dev $ETH1 up
sleep 1

ip link set dev $ETH2 up
sleep 1

brctl addbr $BR0
sleep 1

# manually add STP MC address to enable it in PRU MC filter table
ip maddr add 01:80:c2:00:00:00 dev $BR0

ip link set dev $BR0 address $(cat /sys/class/net/$ETH1/address)

brctl addif $BR0 $ETH1
sleep 1

brctl addif $BR0 $ETH2
sleep 1

brctl stp $BR0  on
sleep 1

mstpctl setforcevers $BR0 rstp
sleep 1

ip link set dev $BR0 up
sleep 1

mstpctl showbridge

0 Josue Zamitiz-Ayala over 1 year ago in reply to Naiara Moreira

TI__Mastermind 33806 points

Hello Naiara,

Please allow some time to interface with the development team and I will get back to you. Hopefully by next week if the teams bandwidth permits.

-Josue

0 Josue Zamitiz-Ayala over 1 year ago in reply to Josue Zamitiz-Ayala

TI__Mastermind 33806 points

Hello Naiara,

Update:

Our team has indeed confirmed that l2-fwd-offload support changes are missing in SDK_08.02. Whether this was intentional or not is not clear yet and there will be more testing done next week.

-Josue

0 Josue Zamitiz-Ayala over 1 year ago in reply to Josue Zamitiz-Ayala

TI__Mastermind 33806 points

Naiara,

Your changes were confirmed to be effective and compared to similar changes done by other customers. At this point the removal seems unintentional.

The team is focused on different priorities at the moment and the people who worked on this software before SDK 8.2 was released are longer on the team, so finding out the full story will take a little long due to bandwidth.

If this helps resolve your question please click Resolved.

Best,

Josue

0 Naiara Moreira over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 40 points

Hi Josue,

Thanks for your support. It’s comforting to know that the removal was unintentional and that we are not the only customers suffering it. Nevertheless, we continued testing prusw firmware for L2 switching offloading and we encountered some issues at low levels (below the MAC layer). We configured linux interfaces as a bridge with RSTP protocol configuration as shown above and force several topology changes in order to analyze the behavior of the protocol.

Concretely, when connecting and disconnecting one cable from the RJ45 connector several times (mostly after 2 or 3 disconnections), the communication is interrupted, and it cannot restore by itself unless we delete the bridge and restart prueth driver manually (forcing to stop PRUs execution and to download prusw firmware again). However, mstpctl tool shows everything is OK at high levels since bridge information and port status are properly updated as expected by RSTP protocol.

We employed tcpdump tool to capture Ethernet frames at MAC layer and we see both transmitting and receiving STP BPDUs, but when sniffing at the output port we cannot see transmitted frames. By debugging prueth driver, when we intentionally cause the failure, we saw that read pointer of the PRU in the transmitting queue is stopped and never resumed. Rapidly the queue gets without free space and it is never rectified.

We tried to reproduce the issue in the AM571x IDK without success. Our design uses a different PHY device (DP83822 instead of TLK105L), so we think it could be a problem of incompatibility between PHY device and prusw firmware. Assuming the problem was from PHY device auto-negotiation when becoming the link up, we debugged the link down / link up detection process initiated by the PHY device to the prueth driver (in our design is done by polling of PHY registers), and from prueth driver to the PRU through shared memory, and we saw link status was correctly updated in that memory. See below how we printed this information out.

diff --git a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
index 2ac8bb7e0..d9a7ac19a 100644
--- a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
+++ b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
@@ -569,6 +569,9 @@ static void emac_update_phystatus(struct prueth_emac *emac)
 	if (emac->link)
 		port_status |= PORT_LINK_MASK;
 	writeb(port_status, prueth->mem[region].va + PORT_STATUS_OFFSET);
+
+	port_status = readb(prueth->mem[region].va + PORT_STATUS_OFFSET);
+	netdev_err(emac->ndev, "NMCDC20240425 -> emac_update_phystatus port_status = 0x%08x (%d)\n", port_status, __LINE__);
 }
 
 /* called back by PHY layer if there is change in link state of hw port*/
@@ -594,11 +597,13 @@ static void emac_adjust_link(struct net_device *ndev)
 		if (!emac->link) {
 			new_state = true;
 			emac->link = 1;
+			netdev_err(ndev, "NMCDC20240425 -> emac_adjust_link %d\n", __LINE__);
 		}
 	} else if (emac->link) {
 		new_state = true;
 		emac->link = 0;
 		/* defaults for no link */
+		netdev_err(ndev, "NMCDC20240425 -> emac_adjust_link %d\n", __LINE__);
 
 		/* f/w only support 10 or 100 */
 		emac->speed = SPEED_100;
@@ -915,8 +920,11 @@ static int prueth_tx_enqueue(struct prueth_emac *emac, struct sk_buff *skb,
 	}
 	pkt_block_size = DIV_ROUND_UP(pktlen, ICSS_BLOCK_SIZE);
 	if (pkt_block_size > free_blocks) /* out of queue space */
+	{
+		netdev_err(emac->ndev, "NMC20240418 -> prueth_tx_enqueue txport=%d queue_id=%d write_block=%d read_block=%d free_blocks=%d line=%d\n", txport, queue_id, write_block, read_block, free_blocks, __LINE__);
 		return -ENOBUFS;
-
+	}
+	
 	/* calculate end BD address post write */
 	update_block = write_block + pkt_block_size;
 
@@ -2966,6 +2974,9 @@ static int prueth_netdev_init(struct prueth *prueth,
 
 	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_TC;
 
+    if (of_device_is_compatible(prueth->dev->of_node, "ti,am57-prueth"))
+        ndev->features |= NETIF_F_HW_L2FW_DOFFLOAD;
+
 	if (prueth->support_lre)
 		ndev->hw_features |= (NETIF_F_HW_HSR_FWD | NETIF_F_HW_HSR_TAG_RM);
 
@@ -3021,6 +3032,9 @@ bool prueth_sw_port_dev_check(const struct net_device *ndev)
 	if (ndev->features & NETIF_F_HW_HSR_TAG_RM)
 		return true;
 
+    if (ndev->features & NETIF_F_HW_L2FW_DOFFLOAD)
+        return true;
+
 	return false;
 }

Did you see this issue before? Maybe we are missing some configuration in DTB? What about the ti,pruss-gp-mux-sel and ti,pru-interrupt-map properties of prueth node in DTB? It seems to improve somewhat (we reached up to almost 20 consecutive cable removals with success) if we put ti,pruss-gp-mux-sel property as follows:

ti,pruss-gp-mux-sel = <0>, /* GP, default */
                      <0>; /* GP, default */

How we could advance in debugging tasks to resolve the issue?

0 Praveen Rao over 1 year ago in reply to Naiara Moreira

TI__Mastermind 48483 points

Hello, The assigned engineer, Josue, is out of the office until May 2nd. Please expect a 4-5 business days delay in response.

Thanks.

0 Naiara Moreira over 1 year ago in reply to Naiara Moreira

Prodigy 40 points

We found that forcing a soft reset of PRUs from prueth driver, the connection reestablished well after several seconds. For doing that, we set SOFT_RST_N bit of PRU_CONTROL register of both PRUs each time a link down state is detected from PHY, as shown below:

diff --git a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
index 2ac8bb7e0..836c36905 100644
--- a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
+++ b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/net/ethernet/ti/prueth_core.c
@@ -569,8 +569,13 @@ static void emac_update_phystatus(struct prueth_emac *emac)
 	if (emac->link)
 		port_status |= PORT_LINK_MASK;
 	writeb(port_status, prueth->mem[region].va + PORT_STATUS_OFFSET);
+
+	port_status = readb(prueth->mem[region].va + PORT_STATUS_OFFSET);
+	netdev_err(emac->ndev, "NMCDC20240425 -> emac_update_phystatus port_status = 0x%08x (%d)\n", port_status, __LINE__);
 }
 
+extern void pru_reboot(struct rproc *rproc);
+
 /* called back by PHY layer if there is change in link state of hw port*/
 static void emac_adjust_link(struct net_device *ndev)
 {
@@ -578,6 +583,8 @@ static void emac_adjust_link(struct net_device *ndev)
 	struct phy_device *phydev = emac->phydev;
 	unsigned long flags;
 	bool new_state = false;
+	struct prueth *prueth = emac->prueth;
+	struct prueth_emac *other_emac;
 
 	spin_lock_irqsave(&emac->lock, flags);
 
@@ -594,11 +601,13 @@ static void emac_adjust_link(struct net_device *ndev)
 		if (!emac->link) {
 			new_state = true;
 			emac->link = 1;
+			netdev_err(ndev, "NMCDC20240425 -> emac_adjust_link %d\n", __LINE__);
 		}
 	} else if (emac->link) {
 		new_state = true;
 		emac->link = 0;
 		/* defaults for no link */
+		netdev_err(ndev, "NMCDC20240425 -> emac_adjust_link %d\n", __LINE__);
 
 		/* f/w only support 10 or 100 */
 		emac->speed = SPEED_100;
@@ -621,9 +630,17 @@ static void emac_adjust_link(struct net_device *ndev)
 			netif_wake_queue(ndev);
 	} else {
 		/* link OFF */
+		pru_reboot(emac->pru);
+		other_emac = prueth->emac[other_port_id(emac->port_id) - 1];
+		pru_reboot(other_emac->pru);
 		netif_carrier_off(ndev);
-		if (!netif_queue_stopped(ndev))
+		if (!netif_queue_stopped(ndev)) {
+			/* SPV: Probably this is a better place to reboot? So it's only done once? */
+			//pru_reboot(emac->pru);
+			//other_emac = prueth->emac[other_port_id(emac->port_id) - 1];
+			//pru_reboot(other_emac->pru);
 			netif_stop_queue(ndev);
+		}
 	}
 
 	spin_unlock_irqrestore(&emac->lock, flags);
@@ -915,8 +932,11 @@ static int prueth_tx_enqueue(struct prueth_emac *emac, struct sk_buff *skb,
 	}
 	pkt_block_size = DIV_ROUND_UP(pktlen, ICSS_BLOCK_SIZE);
 	if (pkt_block_size > free_blocks) /* out of queue space */
+	{
+		//netdev_err(emac->ndev, "NMC20240429 -> prueth_tx_enqueue txport=%d queue_id=%d write_block=%d read_block=%d free_blocks=%d line=%d\n", txport, queue_id, write_block, read_block, free_blocks, __LINE__);
 		return -ENOBUFS;
-
+	}
+	
 	/* calculate end BD address post write */
 	update_block = write_block + pkt_block_size;
 
@@ -2073,6 +2093,7 @@ static int emac_ndo_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 static void emac_ndo_tx_timeout(struct net_device *ndev, unsigned int txqueue)
 {
 	struct prueth_emac *emac = netdev_priv(ndev);
+	//struct prueth_emac *other_emac;
 
 	if (netif_msg_tx_err(emac))
 		netdev_err(ndev, "xmit timeout");
@@ -2080,7 +2101,10 @@ static void emac_ndo_tx_timeout(struct net_device *ndev, unsigned int txqueue)
 	ndev->stats.tx_errors++;
 
 	/* TODO: can we recover or need to reboot firmware? */
-
+	/* SPV: Perhaps this does the trick? */
+	//pru_reboot(emac->pru);
+	//other_emac = prueth->emac[other_port_id(emac->port_id) - 1];
+	//pru_reboot(other_emac->pru);
 	netif_wake_queue(ndev);
 }
 
@@ -2966,6 +2990,9 @@ static int prueth_netdev_init(struct prueth *prueth,
 
 	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_TC;
 
+    if (of_device_is_compatible(prueth->dev->of_node, "ti,am57-prueth"))
+		ndev->features |= NETIF_F_HW_L2FW_DOFFLOAD;
+
 	if (prueth->support_lre)
 		ndev->hw_features |= (NETIF_F_HW_HSR_FWD | NETIF_F_HW_HSR_TAG_RM);
 
@@ -3021,6 +3048,9 @@ bool prueth_sw_port_dev_check(const struct net_device *ndev)
 	if (ndev->features & NETIF_F_HW_HSR_TAG_RM)
 		return true;
 
+    if (ndev->features & NETIF_F_HW_L2FW_DOFFLOAD)
+        return true;
+
 	return false;
 }

diff --git a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/remoteproc/pru_rproc.c b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/remoteproc/pru_rproc.c
index 677be00d6..d20d61aa7 100644
--- a/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/remoteproc/pru_rproc.c
+++ b/ti-processor-sdk-linux-rt-am57xx-evm-08_02_01_00/board-support/linux-rt-5.10.100+gitAUTOINC+204ec708dc-g204ec708dc/drivers/remoteproc/pru_rproc.c
@@ -484,6 +484,42 @@ static int pru_rproc_debug_ss_get(void *data, u64 *val)
 DEFINE_DEBUGFS_ATTRIBUTE(pru_rproc_debug_ss_fops, pru_rproc_debug_ss_get,
 			 pru_rproc_debug_ss_set, "%llu\n");
 
+static int pru_rproc_reboot_set(void *data, u64 val)
+{
+	struct rproc *rproc = data;
+	struct pru_rproc *pru = rproc->priv;
+	u32 reg_val;
+
+	reg_val = pru_control_read_reg(pru, PRU_CTRL_CTRL);
+	printk("SPV:%s:%d: PRU_CTRL_CTRL read %x\n", __func__, __LINE__, reg_val);
+	/*if (val && !pru->dbg_single_step)
+		pru->dbg_continuous = reg_val;*/
+
+	reg_val &= ~(1);
+	pru_control_write_reg(pru, PRU_CTRL_CTRL, reg_val);
+	printk("SPV:%s:%d: PRU_CTRL_CTRL wrote %x\n", __func__, __LINE__, reg_val);
+
+	return 0;
+}
+
+static int pru_rproc_reboot_get(void *data, u64 *val)
+{
+	struct rproc *rproc = data;
+	struct pru_rproc *pru = rproc->priv;
+
+	*val = pru->dbg_single_step;
+
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(pru_rproc_reboot_fops, pru_rproc_reboot_get,
+			 pru_rproc_reboot_set, "%llu\n");
+			 
+void pru_reboot(struct rproc *rproc)
+{
+	pru_rproc_reboot_set(rproc, 1);
+}
+EXPORT_SYMBOL_GPL(pru_reboot);
+
 /*
  * Create PRU-specific debugfs entries
  *
@@ -499,6 +535,8 @@ static void pru_rproc_create_debug_entries(struct rproc *rproc)
 			    rproc, &regs_fops);
 	debugfs_create_file("single_step", 0600, rproc->dbg_dir,
 			    rproc, &pru_rproc_debug_ss_fops);
+	debugfs_create_file("reboot", 0600, rproc->dbg_dir,
+			    rproc, &pru_rproc_reboot_fops);
 }
 
 static void pru_dispose_irq_mapping(struct pru_rproc *pru)

This workaround confirms the problem is at prusw firmware employed by RSTP for L2 forwarding offload.

Moreover, each time we disconnect/reconnect the cable after applying the patch above, we see a kernel warning in both IDK and our design:

[ 3492.920837] ------------[ cut here ]------------
[ 3492.920837] WARNING: CPU: 0 PID: 9016 at net/switchdev/switchdev.c:277 switchdev_port_obj_add_now+0xcc/0x114
[ 3492.920867] eth2: Commit of object (id=2) failed.
[ 3492.920867] Modules linked in: xt_nat xt_policy xt_tcpudp xt_conntrack xt_MASQUERADE nfnetlink xfrm_user xfrm_algo xt_addrt
ype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables x_tables br_netfilter bri
dge stp llc usb_f_ecm dwc3 roles irq_pruss_intc prueth icss_iep pru_rproc extcon_usb_gpio omap_aes_driver libaes pruss ti_vpe
ti_sc ti_csc ti_vpdma dwc3_omap c_can_platform c_can can_dev omap_hdq omap_des omap_crypto wire libdes omap_sham crypto_engine
 omap_remoteproc sch_fq_codel kgoose(O) kbdriver(O) g_ether usb_f_rndis u_ether libcomposite udc_core usb_common cmemk(O)
[ 3492.921081] CPU: 0 PID: 9016 Comm: kworker/0:2 Tainted: G        W  O      5.10.100-rt62 #62
[ 3492.921081] Hardware name: Generic DRA72X (Flattened Device Tree)
[ 3492.921112] Workqueue: events switchdev_deferred_process_work
[ 3492.921112] [<c020d29c>] (unwind_backtrace) from [<c0209d40>] (show_stack+0x10/0x14)
[ 3492.921142] [<c0209d40>] (show_stack) from [<c0a82608>] (__warn+0xd4/0xec)
[ 3492.921173] [<c0a82608>] (__warn) from [<c0a826b8>] (warn_slowpath_fmt+0x98/0xc8)
[ 3492.921173] [<c0a826b8>] (warn_slowpath_fmt) from [<c0a80314>] (switchdev_port_obj_add_now+0xcc/0x114)
[ 3492.921234] [<c0a80314>] (switchdev_port_obj_add_now) from [<c0a80370>] (switchdev_port_obj_add_deferred+0x14/0x60)
[ 3492.921264] [<c0a80370>] (switchdev_port_obj_add_deferred) from [<c0a7ff84>] (switchdev_deferred_process+0x78/0x118)
[ 3492.921264] [<c0a7ff84>] (switchdev_deferred_process) from [<c0a80030>] (switchdev_deferred_process_work+0xc/0x14)
[ 3492.921295] [<c0a80030>] (switchdev_deferred_process_work) from [<c023f03c>] (process_one_work+0x1c4/0x44c)
[ 3492.921295] [<c023f03c>] (process_one_work) from [<c023f31c>] (worker_thread+0x58/0x5cc)
[ 3492.921325] [<c023f31c>] (worker_thread) from [<c0244c84>] (kthread+0x168/0x1ac)
[ 3492.921325] [<c0244c84>] (kthread) from [<c0200140>] (ret_from_fork+0x14/0x34)
[ 3492.921356] Exception stack(0xc3ec9fb0 to 0xc3ec9ff8)
[ 3492.921356] 9fa0:                                     00000000 00000000 00000000 00000000
[ 3492.921356] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 3492.921386] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 3492.921386] ---[ end trace b8c197f3d8f9b85d ]---

0 Praveen Rao over 1 year ago in reply to Naiara Moreira

TI__Mastermind 48483 points

Thanks for sharing more details.

But as noted, The assigned engineer, Josue, is out of the office until May 2nd. Please expect a 3-4 business days delay in response.

Thanks.

0 Josue Zamitiz-Ayala over 1 year ago in reply to Praveen Rao

TI__Mastermind 33806 points

Hello Naiara,

I will again need some time to interface with our PRU FW team.

Can you please confirm my understanding: the PRU "hang" does not occur on the TI EVM but it does on your custom device which has a different PHY device and both devices produce the same warning when connecting/disconnecting the RJ45 cable after applying the work-around (WA) patches above.

Is it true that if you are able to reproduce the same functionality with your device and DP83822 PHY as the AM571x IDK and TLK105L PHY, then there is no need for the WA mentioned above?

-Josue

0 Naiara Moreira over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 40 points

Sorry for my bad explanation. See the responses below in red.

>>> Can you please confirm my understanding: the PRU "hang" does not occur on the TI EVM but it does on your custom device which has a different PHY device and both devices produce the same warning when connecting/disconnecting the RJ45 cable after applying the work-around (WA) patches above.

Without applying the WA, the PRU hangs only on our custom device (AM5701 processor) with a DP83822 PHY device. The IDK (AM5718 processor) and TLK105L PHY doesn`t fail when disconnecting/reconnecting cable, but it produces the warning on net/switchdev/switchdev.c

When applying the WA on our custom device, the PRU recovers the transmission and RSTP protocol works correctly, and we also see the warning of switchdev.

It seems like the switchdev warning occurs when protocol is running OK, each time we disconnect/reconnect the cable, but not when the PRU hangs after several consecutive link status changes withoud applying the WA.

>>> Is it true that if you are able to reproduce the same functionality with your device and DP83822 PHY as the AM571x IDK and TLK105L PHY, then there is no need for the WA mentioned above?

The WA mentioned above is needed to provide our custom device a mechanism to recover the PRU from a hanged situation caused by several link status changes. The IDK doesn't need this WA to work correctly since it doesn't hangs in the same way.

0 Josue Zamitiz-Ayala over 1 year ago in reply to Naiara Moreira

TI__Mastermind 33806 points

Thank you for the confimation Naiara,

I am still awaiting inputs from our Dev team.

-Josue

0 Josue Zamitiz-Ayala over 1 year ago in reply to Josue Zamitiz-Ayala

TI__Mastermind 33806 points

Hello Naiara,

I think that since the issue is not reproducible on the TI IDK, we should focus on the one difference which is your PHY configuration.

1. Are you able to share the ethernet portion of schematics and device tree changes made to accommodate the different PHY?

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

Hello Josue,

We were able to reproduce the issue in AM571x IDK hardware, and also to fix the problem on our hardware.

The difference is that the PHY interrupts are declared in IDK DTB and not declared in ours. Thus, using the IRQs fixes (or masks) the problem, and using polling for the PHY exposes it. In the way we found some posts by our former colleague Paritosh Dixit regarding SDK6 and DP83822:

- https://e2e.ti.com/support/processors-group/processors/f/processors-forum/888389/am5748-am5748-issue-with-phy-dp83822-driver-in-ti-sdk-6-02
- https://e2e.ti.com/support/interface-group/interface/f/interface-forum/893985/dp83822i-interrupt-questions

To reproduce the issue you'll need 3 devices that support RSTP (we used 3 IDKs, but the first time we reproduced it we used the IDK, an RSTP-enabled switch, and another RSTP device).

Load the vanilla SDK8 image tisdk-default-image-am57xx-evm.wic to the 3 IDKs.
In one of them modify the kernel (to fix the loading of prusw firmware), and the DTB to remove the interrupts declaration from the PHYs. I am attaching the modified zImage, modules and DTBs.
prueth_core_mod_kernel_modules_dtbs.tar.gz
Load the start_rstp.sh scripts to the three devices, they are the same except from the IP and MAC addresses.
start_rstp_scripts.zip
Execute the scripts on the three IDKs and connect them as shown in the image (caution: on my side I've noticed that eth2/eth3 are not always the same physical ports! i.e.: the ones on the corner came up as eth2/eth3 instead of the ones shown in the image).
Ensure that IDK-2 has loaded the prusw firmware with cat /sys/class/remoteproc/remoteproc*/firmware | grep prusw-fw
Log into IDK-1 and ping to IDK-3: ping 192.168.1.13. This means that the ping travels through IDK 2 switch.
Disconnect cable from IDK-2 to IDK-3 and change the port, or connect to the same one.
Observe that at some point pings from IDK-1 to IDK-3 stop
Log into IDK-2
Check that it also can't ping to IDK-3
Check that tcpdump -i eth2 stp shows repeated messages (it can be eth3 instead, depends on the one being used to connect to IDK-3, and it might take a few minutes to enter the loop)
The situation does not recover until you restart IDK-2

I suspect that the problem is not in the PHYs but in the prusw firmware, and that the PHYs behaving differently exposes this issue. I have not tried the setup with the vanilla image in all of them and only modifying the DTB, but in our early tests (when pruesw-fw was not being loaded), the issue was not reproducible in our hardware and the DTBs had the IRQs disabled.

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Hello Sebastian,

Thank you for the detailed write up.

Does this mean that with the interrupts declared in your DTB, this is not an issue in your boards anymore?
I understand that its not a fix per se, just a mask or work around to the underlying problem suspected to be attributed to the PRU-firmware.
I want to make sure I understand your statement:

Sebastian Pastor said:
We were able to reproduce the issue in AM571x IDK hardware, and also to fix the problem on our hardware.

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

Hello Josue,

You got it right: if the PHY has no interrupts declared, the problem is reproducible in IDK and our HW. If interrupts are declared, the issue is not reproducible (at least this way! )

We are investigating why the interrupts were removed from the DTBs, because at some point they were there. Probably it was because of the continous interrupts when no link that Paritosh mentions in his threads.

Don't hesitate to contact us if you need more details on the setup or has another test that we can do to debug this.

I forgot to add to my post the patches needed to reproduce in IDK-2, here it is:

diff --git a/arch/arm/boot/dts/am571x-idk.dts b/arch/arm/boot/dts/am571x-idk.dts
index ce40c5d8d..1e4a170b8 100644
--- a/arch/arm/boot/dts/am571x-idk.dts
+++ b/arch/arm/boot/dts/am571x-idk.dts
@@ -269,14 +269,14 @@ &pruss1_mdio {

        pruss1_eth0_phy: ethernet-phy@0 {
                reg = <0>;
-               interrupt-parent = <&gpio3>;
-               interrupts = <28 IRQ_TYPE_EDGE_FALLING>;
+//             interrupt-parent = <&gpio3>;
+//             interrupts = <28 IRQ_TYPE_EDGE_FALLING>;
        };

        pruss1_eth1_phy: ethernet-phy@1 {
                reg = <1>;
-               interrupt-parent = <&gpio3>;
-               interrupts = <29 IRQ_TYPE_EDGE_FALLING>;
+//             interrupt-parent = <&gpio3>;
+//             interrupts = <29 IRQ_TYPE_EDGE_FALLING>;
        };
 };

diff --git a/arch/arm/boot/dts/am57xx-idk-common.dtsi b/arch/arm/boot/dts/am57xx-idk-common.dtsi
index a064db43b..5d87ff9b4 100644
--- a/arch/arm/boot/dts/am57xx-idk-common.dtsi
+++ b/arch/arm/boot/dts/am57xx-idk-common.dtsi
@@ -656,14 +656,14 @@ &pruss2_mdio {
        status = "okay";
        pruss2_eth0_phy: ethernet-phy@0 {
                reg = <0>;
-               interrupt-parent = <&gpio3>;
-               interrupts = <30 IRQ_TYPE_EDGE_FALLING>;
+//             interrupt-parent = <&gpio3>;
+//             interrupts = <30 IRQ_TYPE_EDGE_FALLING>;
        };

        pruss2_eth1_phy: ethernet-phy@1 {
                reg = <1>;
-               interrupt-parent = <&gpio3>;
-               interrupts = <31 IRQ_TYPE_EDGE_FALLING>;
+//             interrupt-parent = <&gpio3>;
+//             interrupts = <31 IRQ_TYPE_EDGE_FALLING>;
        };
 };

diff --git a/drivers/net/ethernet/ti/prueth_core.c b/drivers/net/ethernet/ti/prueth_core.c
index c4cb25422..27276cf59 100644
--- a/drivers/net/ethernet/ti/prueth_core.c
+++ b/drivers/net/ethernet/ti/prueth_core.c
@@ -2837,6 +2837,11 @@ static int prueth_netdev_init(struct prueth *prueth,

        ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_TC;

+       if (of_device_is_compatible(prueth->dev->of_node, "ti,am57-prueth")) {
+               printk("SPV:%s:%d: adding NETIF_F_HW_L2FW_DOFFLOAD\n", __func__, __LINE__);
+               ndev->features |= NETIF_F_HW_L2FW_DOFFLOAD;
+       }
+
        if (prueth->support_lre)
                ndev->hw_features |= (NETIF_F_HW_HSR_FWD | NETIF_F_HW_HSR_TAG_RM);

@@ -2892,6 +2897,9 @@ bool prueth_sw_port_dev_check(const struct net_device *ndev)
        if (ndev->features & NETIF_F_HW_HSR_TAG_RM)
                return true;

+       if (ndev->features & NETIF_F_HW_L2FW_DOFFLOAD)
+               return true;
+
        return false;
 }

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Thank you Sebastian,

I will bring this issue to our PRU firmware team for comments.

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

Hello Josue,

It's been a while, when could we expect an update on this issue?

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Hello Sebastian,

Our firmware team is not convinced and the action lays on me to run the testing mentioned above. Unfortunately I have been out of office for a week or so due to local storms and I am behind. I will hopefully try this this week and update you.

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

I understand. We went ahead with defining the interrupts, until we hit it again (or hopefully, not!).

It's a boring and hard to setup test (plus, you need 3 boards available ), sorry that they dropped it back to you. If there is something we can test here to help you debug this or convince the FW team that it's reproducible in the IDKs, please let us know.

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Thank you Sebastian, will do.

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

Hello Josue,

Do you have any update on this or an estimation on when would it be?

Best regards,

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Hi Sebastian,

Unfortunately I do not have an estimate in completion. I've had other priorities so I have not had a chance to work this into my testing queue. Given that you have a workaround this has taken lower priority.

-Josue

0 Josue Zamitiz-Ayala over 1 year ago in reply to Josue Zamitiz-Ayala

TI__Mastermind 33806 points

Hello Sebastian,

Could you provide the a dump of the ICSS DMEM?

FW team has the following idea:

"There is a port status offset which gets disabled during link down and gets enabled during link up. May be this is not done correctly in polling mode."

-Josue

0 Sebastian Pastor over 1 year ago in reply to Josue Zamitiz-Ayala

Prodigy 210 points

Hello Josue,

Sorry for the delay, I was out for a week and I got swamped when came back. How can I get the dump you ask? I do have an XDS debug probe, but I don't have a lot of experience with these devices.

Also, in which situation should I get it? With the polling or with the interrupt mode?

Best regards,

0 Josue Zamitiz-Ayala over 1 year ago in reply to Sebastian Pastor

TI__Mastermind 33806 points

Hello Sebastian,

Yes, this is usually done using CCS Memory browser, see the following thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1250542/pru-icss-industrial-sw-hsr-prp-transmit-not-working-for-our-custom-board-with-am3356-soc/4772285#4772285

There is also plenty of CCS documentation, for example:

The DMEM will be both RAM0 and RAM1 for the ICSS core:

Best,

Josue

Processors

Processors forum

PROCESSOR-SDK-AM57X: ti-pruss/am57xx-pru0/1-prusw-fw.elf firmware not loaded into PRUs when creating the bridge between redundant ports