This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

am335x : frequently enabling and disabling of the Ethernet will cause the PHY to deadlock.

Greetings,

On my am335x hardware, using a linux kernel v3.12 from the TI SDK 7.0, I notice that enabling and disabling the ethernet connection will cause PHY to go into DEADLOCK.

Using DHCP, switching from enable to disabled approximately 5 to 10 times will cause an unsuccessful network connection.

After reading an old revision to the ti2013.12.01 Linux Kernel Release Notes, I found the "Known Issue" #D-01263, saying:
"Ethernet interface up/down stress leads to DEADLOCK"
"When bringing the ethernet interface up/down repeatedly during stress testing a DEADLOCK can be encountered that prevents the phy from coming back up."
processors.wiki.ti.com/index.php

The workaround for the issue says "Will be fixed in next release".


This is my exact issue! I would like to patch my kernel with this fix, however I could not find any documentation or notes in the next release about how this problem was resolved.

Is anybody able to point me to any documentation or patches made in the next release that includes this fix.

Cheers!
-Eric Zaluzec

  • Upon further inspection it might actually be this one:

    git.ti.com/.../e7cf277a6eca5237349de2356de12ff95d9eeb15

    Hopefully one of the two does the trick! Let me know.
  • To clarify a bit, it looks like there are multiple ways that can cause you to arrive in that same scenario. So I believe both patches I have mentioned above are needed to completely avoid the issue.
  • Thanks for this info! Yeah I was looking at applying both patches listed above.
    Going to patch and test now. Will share results once I get my system up and running.
  • After applying these kernel patches, the issue was still observed where the Ethernet locks up until system is rebooted.


    Here is the script used to test the up/down interface:

    #!/bin/sh
    
    pingIP=<address>
    myIP=<address>
    myNetmask=<address>
    myGW=<address>
    
    cat <<- EOF > /etc/network/interfaces
    auto lo
    iface lo inet loopback
    
    auto eth0
    iface eth0 inet static
    address $myIP
    netmask $myNetmask
    gateway $myGW
    EOF
    
    
    while [ 1 ]; do
    ifdown eth0
    sleep 1
    ifup eth0
    sleep 1
    ping -c 3 $pingIP
    if [ $? != 0 ]; then
    echo 'get ip failed, exit'
    exit 1

    test1.txt

  • I think you're missing a "fi" and a "done" on your script.  (I needed to add those lines to get the script to run.)

    I ran the test on a BeagleBone Black using the latest SDK.  I do see a failure after a random number of iterations (usually somewhere less than 10), but I do NOT see a lockup in this scenario.  In other words, I can run the script again and it will work for a few times.

    Do you have a TI development board (e.g. an AM335x EVM or BeagleBone Black)?  This would probably be a good one to debug on TI hardware, at least for starters.  Most likely I imagine there is another kernel patch somewhere that perhaps relates to this behavior that had already been fixed in the kernel before this bug was fixed from the TI side.  Or there could be a relationship with your PHY that's on your board (e.g. in its driver).  With a TI development board it would be a lot easier to try running this same test on a few different SDK versions in order to better narrow down any other patch dependencies.

    So for example, if you have TI hardware I think some good tests would be:

    1. Do you see the same behavior as me using the latest SDK (i.e. occassional initialization failures, but never a deadlock).
    2. Do you see that same behavior using SDK 7.00 + the two patches?  This would help us understand whether the issue is board-specific or kernel-specific.  If it's kernel specific we may want to test some newer SDKs with the patches to help narrow down what other potential dependencies might exist.

  • Eric,

    I reverted back to the SDK 7.00 binaries on the BBB. The first packet of the ping was always being lost so I had to increase the second "sleep" statement from 1 to 2 (i.e. just before calling "ping"). That test has been running for a few minutes without any issues at all. In other words, the behavior actually seems to be better than what I observed on the 4.1 kernel in the sense that I never see any failures.

    How long does it usually take you to run into this issue?

    Do you have any TI hardware, i.e. it would be good if we can verify consistency in test results for a given board or SDK release.

    Brad
  • Hey Brad,

    I have a BBB that I was going to attempt to replicate this problem, using SDK 7. I assume that if you were not able to reproduce, then I will find the same results. When I observed the issue for the first time and saw the TI SDK release notes documenting the issue, I assumed it was a known issue in the community. I am not sure why the fixes did not resolve my problem.

    It takes anywhere from 3 to 8 minutes with an average of 13x up/downs per minute to run into the issue. Obviously a 'use case' where functionality of dropping your network connection 13 times per minute is not standard. The issue just came up when running other tests on the am335x hardware.

    Our linux 3.12 kernel is modified a bit from what was given out of the box from the TI SDK. Although no direct modifications were made to the ethernet cpsw driver files, it is possible that another change has caused this problem.


    Thanks for your help on this. I'll be sure to post any updates when I learn more too.

    -Eric Z

  • TI's original problem description mentioned that this issue occurred in the presence of heavy traffic. That might be the reason it's not being observed (at least on my end).
  • Eric Zaluzec said:
    Our linux 3.12 kernel is modified a bit from what was given out of the box from the TI SDK. Although no direct modifications were made to the ethernet cpsw driver files, it is possible that another change has caused this problem.

    I recommend that you test your kernel on the BBB to see if the issue is consistent.  That would help to understand whether you somehow introduced this issue in your kernel, or if it's consistent across the TI kernel and your own.

  • By the way, I went back to the 4.1 kernel to understand why it seemed to be showing a failure every 5-10 iterations.  It looked like it was simply not waiting long enough for the link to come back up.  I've updated your script to poll sysfs for the link to come back up:

    #!/bin/sh
    
    pingIP=192.157.144.228
    myIP=192.157.144.32
    myNetmask=255.255.255.0
    myGW=192.157.144.1
    
    cat <<- EOF > /etc/network/interfaces
    auto lo
    iface lo inet loopback
    
    auto eth0
    iface eth0 inet static
    address $myIP
    netmask $myNetmask
    gateway $myGW
    EOF
    
    
    while [ 1 ]; do
    
    ifdown eth0
    while [ $(cat /sys/class/net/eth0/operstate) != down ]
    do
    sleep 1
    done
    
    ifup eth0
    while [ $(cat /sys/class/net/eth0/operstate) != up ]
    do
    sleep 1
    done
    
    ping -c 3 $pingIP
    if [ $? != 0 ]; then
    echo 'get ip failed, exit'
    exit 1
    fi
    done

    I just kicked off a test to see if it will run successfully for a bit.  It's gone a couple minutes so far.  I'll let it run for 30 minutes or so.

    Brad

  • FYI, it has been running for 45+ minutes without issue.

    At this point I have been unable to reproduce the issue using SDK 7.00 nor Proc SDK 2.00.01. I suspect part of the issue may be the lack of heavy network traffic as mentioned in the TI internal ticket, or perhaps there's some other issue related to either your board (e.g. PHY-specific) or to your kernel.

    At this point I think I really need you to try some of these tests on the TI hardware so we can narrow down whether the issue pertains to your board or your software.
  • Hey,
    Just as a follow-up. I tested this issue using the TISDKv7 sitara_linux_sdk_image_am335x.img on the BBB hardware. After about 40 minutes of running the test script, the network was unreachable and could not reconnect. This problem, however, was a known issue on the linux 3.12 kernel, so the TISDKv7 test on the BBB was just confirmed that the issue was reproducible using the base kernel with no modifications.

    Along with the 2nd patch listed, git.ti.com/.../e7cf277a6eca5237349de2356de12ff95d9eeb15, was there any other modifications in the done in the v3.14 kernel ethernet drivers that would have indirectly resolved this problem? If possible, but probably not recommended, could the v3.14 drivers/net/ethernet/ti/cpsw* files be backported and used with kernel v3.12. It seems like the problem is isolated to the 3.12 kernel even with the patches to fix.

    Logs:

    --- 172.217.2.78 ping statistics ---
    3 packets transmitted, 2 packets received, 33% packet loss
    round-trip min/avg/max = 25.079/25.250/25.421 ms
    [ 968.348744] net eth0: initializing cpsw version 1.12 (0)
    [ 968.356450] net eth0: phy found : id is : 0x7c0f1
    [ 968.369107] 8021q: adding VLAN 0 to HW filter on device eth0
    PING 172.217.2.78 (172.217.2.78): 56 data bytes
    [ 970.348079] libphy: 4a101000.mdio:00 - Link is Up - 100/Full

    --- 172.217.2.78 ping statistics ---
    3 packets transmitted, 0 packets received, 100% packet loss
    get ip failed, exit
    root@am335x-evm:~# ifconfig
    eth0 Link encap:Ethernet HWaddr D0:39:72:40:CB:E7
    inet addr:10.207.15.119 Bcast:10.207.15.255 Mask:255.255.255.0
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:1470 errors:0 dropped:0 overruns:0 frame:0
    TX packets:501 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:129059 (126.0 KiB) TX bytes:42676 (41.6 KiB)
    Interrupt:56

    lo Link encap:Local Loopback
    inet addr:127.0.0.1 Mask:255.0.0.0
    UP LOOPBACK RUNNING MTU:65536 Metric:1
    RX packets:117 errors:0 dropped:0 overruns:0 frame:0
    TX packets:117 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:248360 (242.5 KiB) TX bytes:248360 (242.5 KiB)

    root@am335x-evm:~# ping www.google.com
    ping: bad address 'www.google.com'
    root@am335x-evm:~#
    root@am335x-evm:~# udhcpc
    udhcpc (v1.20.2) started
    Sending discover...
    Sending discover...
  • Eric,

    Thanks for the follow-up. Apparently I must have tested just a little bit shy of what was needed to reproduce the issue! I kicked off a test to run overnight with SDK 7.00, and when I checked back 90 minutes later I could see it was stopped with the "network unreachable" messages. I needed to reboot to regain connectivity, so I have reproduced the original issue.

    I'm not using dual_emac=1 so I've only added this patch into my kernel:

    git.omapzoom.org/

    I'm rebuilding right now and about to re-test. Based on your results, it sounds like I should expect to still hit this error.

    I'm also running a 4.1 kernel (latest Proc SDK) in parallel to be sure the issue isn't still present.

    Brad
  • Eric,

    I can confirm that even with my patched 3.12 kernel, I'm still hitting the issue. I can also confirm my board running 4.1 is still going (4+ hours).

    So my conclusion is the same as yours, i.e. something else in the 3.14 transition is necessary to having the *complete* fix. Unfortunately, I don't know what that is. I'll check with our networking experts to see if one of them has any suggestions as to a specific patch that might impact this.

    Brad
  • Thanks Brad. In the meantime, I will attempt to move some source files from 3.14 to 3.12 in attempts to resolve. I'll let you know if I find anything useful.

    Keep me posted.
  • Hey Brad, any response from the networking group on any more patches we might need to apply to resolve this issue?
  • They were not aware of which specific dependency might have impacted this issue. (Thanks for the reminder.)
  • By the way, I just noticed that the test I started on May 27 with the 4.1 kernel was still going! I never stopped the test and I have been traveling the last couple weeks. So I think we can definitively conclude the issue is not present with the latest kernel!