PROCESSOR-SDK-AM64X: HSR network stopped after several hour power up/down

Milan Stevanovic

Intellectual 706 points

Part Number: PROCESSOR-SDK-AM64X

Tool/software:

Hi support,

I'm using the TI SDK release 09.02.01.10:

https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-AM64X/09.02.01.10

Test applied:

- Rebooting the Board_#1 by toggling the power supply on and off.

- 100Mbps line rate

- Each cycle of the poweroff/on test case is ~1min (10s reboot, <1min to setup HSR after reboot)

Trouble-shooting:

Board_#2 is AM64x EVM board, left on the picture. Board_#3 is project specific board right on the picture.

- profishark sniffing traffic: Board_#2 didn’t forward incoming packets from port B to A, Board_#3 didn’t forward incoming packets from port A to B.

- The HSR supervision A packets of Board_#1 and Board_#3 were transmitted into #2.B but were not found in #2.A outgo.
- The HSR supervision B packets of Board_#1 and Board_#2 were transmitted into #3.A but were not found in #3.B outgo.

- Ethtool -S: the pru ports “rx_min_size_error_frames” of Board_#2 and Board_#3 were significantly increased abnormally.

IMPORTANT: Issue is reproduced only if board Board_#1 is power off/on. Period between off/on is 10s. If we are doing reboot from Linux with command reboot, issue is not reproduced after 24h.

2 months ago

0 Daolin Qiu 2 months ago

TI__Expert 7575 points

Hi Milan,

After I discussed this issue internally we have some additional questions about this issue

1. After the issue is observed and the packets are no longer able to be sent from Board_#1, are they able to retest this setup? In other words, are they able to recover from this state to retest the setup?

2. The custom Board_#1 and Board_#3, are these using the same ethernet PHYs as the AM64x EVM?

3. Is the software running on the custom Board_#1 and Board_#3 the same as what was using on the AM64x EVMs used on your team's test setup? (i.e. the same PRU/HSR firmwares, Linux image)?

4. What is the specific reason why the custom Board_#1 and Board_#3 cannot be replaced with an AM64x EVM? I recall you shared that the automated power off/on cannot be replicated on an AM64x EVM, but I was wondering if the AM64x EVM power supply could be powered by a programmable power outlet?

5. The results of the profishark data shared via email, do you mean that the Board_#1 was able to send good frames out to both Board_#2 and Board_#3 but Board_#2 and Board_#3 ended up receiving corrupted packets (i.e. rx_min_size_error_frames increasing)?

-Daolin

0 Milan Stevanovic 2 months ago in reply to Daolin Qiu

Intellectual 706 points

Hi Daolin,

1. Only reboot board #2 and #3 will recover system

2. No PHY are 100M and not a same like on EVM. But be aware that board #2 is EVM and this board is stuck. But PHY on #1 can have impact even if packets are sent correctly as we seen with profishark. Ethernet phy is DP83822

3. Yes PRU/HSR firmware is same, linux is 6.1 and there are some modification in kernel but not for network or PRU. They are using same SW like on our side

4. board #2 and #3 can be EVM boards. For them board #1 is connected with omicron device, send some voltage signals, this will send goose packet to omicron and then omicron will reboot the board #1. By using programmable power outlet, we have risk to damage EVM

5. I checked profishark file and I did not see any bad packets. This was very strange for me and I asked them about configuration of profishark. They think it is correct without bad packets. They will do tests again to be sure.

I sent some logs offline by mail because of size

0 Daolin Qiu 2 months ago in reply to Milan Stevanovic

TI__Expert 7575 points

Hi Milan,

Thanks for sharing these details.

We will need some more time to review the logs you shared. I hope to get back with an update tomorrow/early next week.

-Daolin

0 Daolin Qiu 2 months ago in reply to Daolin Qiu

TI__Expert 7575 points

Update:

I checked internally and while powering off/on from the 12V DC input to the EVM will causes damages to the EVM, powering on/off the EVM from the AC supply input to the adapter/transformer will be safer. Based on this, a programmable power outlet supplying AC supply source could be an option to try and recreate on an EVM setup. Would it be possible to try and recreate on your end on the EVM setup?

I took a look through the logs shared offline via email and here are my thoughts

1. The memory dump shows a nonzero value for rx_min_size_error_frames on BOARD4 after issue shows --> aligning with what you previously mentioned

2. The memory dump also shows a nonzero value for rx_min_size_error_frames on BOARD2 before issue shows on BOARD4 --> rx_min_size_error_frames still show on BOARD2

3. PCAP between BOARD1 and BOARD2 shows at least one instance of ethernet packet of limited size --> implying that BOARD1 may be sending out corrupted packets

Internally, it appears the team is looking at the BOARD2/BOARD4 receiving end rather than BOARD1 transmit end to try some firmware patches. We try reproduce this issue to test these patches and in parallel share the firmware with the patches with you next week.

BOARD 4 ICSSG0 AFTER ISSUE
0x30033B30:0x00000224 00000040 00000000 000007D0

RX MAX Size Frame (PRU1) - PRU_ICSSG0_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x30033B30=0x000007D0=2000

RX MAX Size Error Frame Count (PRU1) - PRU_ICSSG0_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x30033B34=00000000=0

RX Min Size Frame (PRU1) - PRU_ICSSG0_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x30033B38=0x00000040=64

RX Min Size Error Frame Count (PRU1) - PRU_ICSSG0_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x30033B3C=0x00000224=548 <------------------------------

0x30033B40:0x00000000000000000000022400000000

===================================================================================================================

BOARD 2 (EVM) ICSSG1 BEFORE ISSUE
0x300B3B30:0x00000097 00000040 00000000 000007D0

RX MAX Size Frame (PRU1) - PRU_ICSSG1_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x300B3B30=0x000007D0=1000

RX MAX Size Error Frame Count (PRU1) - PRU_ICSSG1_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x300B3B34=0x00000000=0

RX Min Size Frame (PRU1) - PRU_ICSSG1_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x300B3B38=0x00000040=64

RX Min Size Error Frame Count (PRU1) - PRU_ICSSG1_PR1_MII_RT_PR1_MII_RT_G_CFG_REGS_G Physical Address
0x300B3B3C=0x00000097=151 <--------------------------------

0x300B3B40:0x0000000000000248000005BF00000000

-Daolin

0 Milan Stevanovic 2 months ago in reply to Daolin Qiu

Intellectual 706 points

hi Daolin

we took more time to get results from setup with two EVM...Two AM64x EVMs and one project board

Also, you have logs from PRU and profishark for this setup... I will send you logs by mail as I have issue to upload here

+1 Daolin Qiu 2 months ago in reply to Milan Stevanovic

TI__Expert 7575 points

Hi Milan,

Thanks for sharing this additional data

hsr_filter_short_frame_fixes.zip

As discussed in the call today, attached here is the firmware version that contains fixes for the two main blocking issues

1. Power down/up resulting in increased rx_min_size_error_frames described in this thread

2. MC filter-classifier issue in https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1397896/processor-sdk-am64x-processor-sdk-am64x-icssg-multicast-packets-filter-classification

Note that for the second issue (MC filter-classifier), there needs to be a Linux patch as well. I'm still waiting for this specific patch from the developer and will update once I receive it (pushing for tomorrow). In the meantime, please start testing the power down/up case using this patch as that seems like a longer duration test.

For documentation purposes, below are more details on specifically what the PRU firmware binaries fix (from firmware team):

Fixes corner cases of firmware lockup for received packet sizes of 20B-21B or 32B-35B packets sizes at 100M/10M speed
Error handling for RX_OVF case
Classification based on filter-classifier configuration (+ special bit in FDB entry)

Details about the Linux patch for MC filter-classifier issue:

Appears that "ICSSG_FDB_ENTRY_BLOCK" needs to be removed from the icssg_prueth.c driver (https://lore.kernel.org/netdev/20240904100506.3665892-6-danishanwar@ti.com/)

-Daolin

0 Milan Stevanovic 2 months ago in reply to Daolin Qiu

Intellectual 706 points

Hi Daolin

powre up/down was tested on 7 boards for over 72 hours without any issues... This is great news... MC filter-classifier issue will be tested during this week and we expect to have good news for you...

Let me share new information as soon as we have it

Thanks a lot for your support

0 Mukul Bhatnagar 2 months ago in reply to Milan Stevanovic

TI__Guru* 81605 points

Hello Milan

Daolin and Pekka are out of office for next few days. We do appreciate you sending this note, this is great news indeed.

We look forward to hear from you on the other tests.

Regards

Mukul

0 Daolin Qiu 2 months ago in reply to Mukul Bhatnagar

TI__Expert 7575 points

Hi Milan,

I understand from your last message here that the power up/down problem did not show up after 72 hours. Please let us know if you encounter issues about the MC problem in the other thread.

-Daolin

Processors

Processors forum

PROCESSOR-SDK-AM64X: HSR network stopped after several hour power up/down