DP83867E: Stability Problems Connecting to Ethernet 100Base-T Switch (RGMII)

Part Number: DP83867E

Hello TI forums,

I am experiencing some link instability when connecting to a customer provided ethernet switch that supports only up to 100 Mbps (100Base-T). We are utilizing a GEM controller within the Xilinx's ZynqMP SoC to implement this RGMII interface. We have no issues connecting to other ethernet switches that are rated 1Gbe/10Gbe, however, when we connect to this switch at 100Base-T, the link will cyclically drop and can never steadily hold a link up. We have used ethtool to try and force the link to 10/100 Full/Half Duplex Auto-Neg On/Off settings. Our end goal is to implement 100Base-T Auto-Neg On Full Duplex. We are able to achieve 10Base-T Auto-Neg On Full Duplex to work reliably, but not the throughput we need.

I have tried connecting the ethernet lane to a PC NIC forced to operate at 100Base-T, and experience no issues and receive expected throughput (11MB/s). We have verified external PHY strapping and dumped PHY registers in kernel space to verify PHY configurations. MDIO interface is working. We have briefly inspected Wire Shark traffic and see no bad frames or crc errors, so I am ruling out signal integrity issue. Clocks have been verified to be at the right frequencies.

I have noticed when the link drops, the register, Receiver Error Counter Register (RECR), Address 0x0015, will increment. How is this register being updated? Reviewing Linux PHY driver, this register is not being used. How do these errors bubble up to the higher OSI layers?

I would like to mention I can reliably ping across the switch but with 3-5% packet loss.

If there is any insight on connecting to 100Base-T ethernet switches would be greatly appreciated.

Thank you.

  • Hi! I have a few questions and things I'd like your help looking at in order to debug this!

    Can you tell me what the link partner settings are? And, have you looked at any of the fast link drop modes? 

    If possible, could you also share what model 100BASE T switch you are using?

    To answer you questions about reg 0x15: updates occur when data is not being clocked properly, and it is a read-only register (no driver interference). My thought is the increment is happening as an effect of the link drop.

    Thanks,

    Lysny

  • Hello Lysny,

    Thank you for the response.

    Can you tell me what the link partner settings are?

    If link partner is end point device, we are using a Host PC with a PCIe network adapter, that is set to 100 Base-T auto-negotiation on. 

    I have not looked into the fast link drop modes, I will review that register and follow up with you on any improvements.

    Could you also share what model 100BASE T switch you are using?

    Here is link to ethernet switch datasheet: https://datasheetspdf.com/pdf-file/514249/ZarlinkSemiconductor/MVTX2601/1

    Thank you.

  • Hi! Please let me know if you see any new information with the fast link drop modes. I am looking into the switch and link partner. From first glance, I don't see anything that concerns me, but I'll review them with my team to make sure I'm not overlooking anything. 

    Thanks,
    Lysny

  • Hello,

    I tried enabling all of the options individually and then with all 5 options for fast drop, and experienced no improvements. Is there a method/register that I can filter out poor packets on rx path at the PHY level? Or even set a threshold almost.

    Thank you

  • Hi,

    For threshold, there are interrupts you can set like Rx error, etc for FLD.

    Also, I suggest looking into trying 50% 75% 100% utilization for interpacket gap and look for similar or different rates of failure. If the switch is being pushed to an edge limit this might ease the communication stress that could be the reason for link drops.

    Let me know your findings!

    Lysny

  • Hello,

    Setting the rx errors (bit 3) for FLP_CFG register unfortunately did not cause any improvements.

    What method do you have in mind to reduce the utilization for interpacket gap? Do you mean increasing the interpacket gap duration? Would this be done at PHY level registers or within ethernet MAC?

    I have provided an example ftrace:

    ! 195.530 us | phy_state_machine();
    ! 195.570 us | phy_state_machine();
    ! 195.610 us | phy_state_machine();
    | phy_state_machine() {
    # 4814.921 us | phy_link_change();
    # 5013.381 us | }
    + 12.832 us | gem_update_stats();
    | phy_state_machine() {
    + 32.833 us | phy_aneg_done();
    # 5828.782 us | phy_link_change();
    # 6061.766 us | }
    + 11.401 us | gem_update_stats();
    ! 196.050 us | phy_state_machine();
    ! 195.409 us | phy_state_machine();
    ! 195.650 us | phy_state_machine();
    ! 195.809 us | phy_state_machine();
    ! 195.640 us | phy_state_machine();

    Leaving the unit idle, the phy will randomly detect and report a link change. Is there any insight on this reason? Is there a state diagram, I can follow for autonegotiation and phy link change?

    Thank you.

  • Hi,

    Sorry, the interpacket gap was misleading. I suggest sending less data from the MAC side. So for 50% utilization, only send 50Mb of data.

    Can we also look into the PMD compliance for the switch for IEEE 802.3 100BASE-T? This could help us figure out if there are any issues with the transmitter that are causing the link drops/packet loss.

    Just to confirm, the link change detection is happening with the switch connected? My personal experience with ethernet switches is not very expansive, but these are some of my thoughts.

    Is it possible that the memory buffer in the switch is overflowing and causing some sort of switch reset, which could be causing the link drop and potential packet loss? I also see a section about Aging in the switch datasheet, did you look into maybe lengthening the aging time in reg 0x400 and 0x401 in the switch? I was also looking to see if there were any error registers in the switch, but didn't see anything really.

    Also, Clause 28 of the IEEE 802.3 specification is the recommended place for more information on the auto-negotiation process.

    Another topic that has come up in discussion could be jitter with the switch. My other question, is that is the link drop happening on the same interval? Like once every 2 min? And, how long is the link down for? This might give us a better idea of what is causing the link drop because FLD wasn't much help :(

    I know I listed lots of things to look at and try. Please let me know your feedback and results of any that you try! 

  • Hello Lysny,

    After reviewing the ethernet switch's PHY it appears that it is PMD compliant. The link dropping does not occur at the same interval. I was able to conduct iperf test and verified that no additional jitter is introduced when sending TCP/UDP packets through ethernet switch.

    After doing a cross comparison with IEEE 802.3 spec, I did have a question on register 0x0006 bits 5,6. The ethernet switch's PHY does not support these bits. Could this be the contention that is causing our link up/down issues? These are read only bits, do I have sw control to toggle these bits or what is best method to control them?

    Thank you.

  • Hi,

    Okay, glad to hear we ruled out the jitter and the PMD compliance. I looked into the register you mentioned, and from what I found this shouldn't be an issue. As long as the auto-negotiation completes correctly, the PHY will ignore these bits if the switch is not compliant with it. Because you have a link with occasional dropping, I don't think this is the issue. 

    One topic that came up was the strap for RX_CTRL. In the datasheet, there is an easy-to-miss footnote saying that the strap must be configured in mode 3 or mode 4 (which is not the default), or Reg 0x31[7] must be programmed. I copy and pasted the footnote here:

    "Strap modes 1 and 2 are not applicable for RX_CTRL. The RX_CTRL strap must be configured for strap mode 3 or strap mode 4. If the RX_CTRL pin cannot be strapped to mode 3 or mode 4, bit[7] of Configuration Register 4 (address 0x0031) must be cleared to 0. Autoneg Disable should always be set to 0 when using gigabit Ethernet." page 38

    The register write will only matter if RX_CTRL is strapped in mode 1 (default) or 2. So, either a strap to mode 3 or 4 (recommended) or a reg write is needed. Could you check this and see if this might be the cause of the issue?

    Thanks,
    Lysny

  • Hello Lysny, 

    Yes I just double checked the strapping and read the 0x0031 register, bit 7 is cleared and we are not configured to mode 1 or 2. Currently, RX_CTRL is strapped between a 5.76K resistor pulled up to 1.8V and 2.49K resistor strapped to GND.

    Bouncing back to the FLP CFG register, the FLP STS (12:8) status bits being reported back when enabling the FLP errors, we see Bit 10: MLT3 Errors Bit 9: SNR level assert. Do these errors give more information on the issue?

    Thank you.

  • Hi,

    Okay, so that strap does not seem to be the issue. Sorry for the slow reply, I wanted to discuss this with my team first. Based on the signal quality errors, the team suggested that the switch is tested with the IEEE compliance testing for jitter/noise/etc. I'm linking a document from tektronix that talks about it (https://download.tek.com/document/61W_17381_3.pdf).

    The other suggestion is that a clock variability could also be the cause. This could be tested by measuring the clock during transmission and identifying if there are any differences when the link is dropped.

    I hope these testing suggestions can help find the root cause of the packet loss! Let me know your findings/feedback!

    Thanks,

    Lysny