Part Number: AM3352
We have issus with an USB ethernet controller (lan9500, smsc95xx driver) on a custom board with am3352 when DMA is enabled packets suddenly can no longer be transmitted and the TX error count increases. It occurs under moderate traffic within a few hours, or weeks. It cannot be reproduced by simply transmitting a terrabyte of data, the packet count seems to be more influencal.
The problem is present with all kernel versions we have used so far (4.0 4.4 4.14, kernel.org and git.ti.com). Currently we are using 4.14.172 based on the ti-linux-4.14.y branch with the missing stable kernel patches applied. The lan9500 is connected by a USB Hub and it has an ethernet switch attached to itself. The switch is controlled by a Distributed Switch Architecture (DSA) driver, but it is the same on the cpsw ethernet and its working fine there. I don't think the USB hub is causing problems either, because there have been similar errors with other devices (CDC Ether, or another lan9500 without the switch) where the TX error count increased. However they resulted in netdev watchdog errors. The likelyhood of the error occuring increases if more (up to 3) lan9500 are attached to the USB hub. Once it occurred it is also a lot more likely to reoccurr.
The workaround so far has been to check the TX error counter on the network interface and restart it once it increases. This causes packet loss and a downtime of at least a few seconds. A big customer will now longer accept that and furthermore restarting the interface is not fully reliable. It might simply stop working and the only workaround is to reboot the whole device. Receiving packets is still working, another lan9500 attached to the USB hub can continue to function properly.
Turning off DMA is not an option, because of performance issues.
Here is an example of the error occuring. The network interfaces are called swm4 and swm5.
[142502.538889] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142502.538966] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142502.539055] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142502.539187] smsc95xx 2-1.2:1.0 swm4: kevent 0 may have been dropped
[142502.539253] smsc95xx 2-1.2:1.0 swm4: kevent 0 may have been dropped
[142502.568045] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142502.568147] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.341366] net_ratelimit: 298 callbacks suppressed
[142507.341389] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.341458] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.341520] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.601300] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.601380] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.601447] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.601567] smsc95xx 2-1.2:1.0 swm4: kevent 0 may have been dropped
[142507.601630] smsc95xx 2-1.2:1.0 swm4: kevent 0 may have been dropped
[142507.754775] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
[142507.754853] smsc95xx 2-1.1:1.0 swm5: kevent 0 may have been dropped
~ # ifconfig swm4
swm4 Link encap:Ethernet HWaddr 00:05:B6:99:88:78
inet6 addr: fe80::205:b6ff:fe99:8878/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:183313 errors:0 dropped:0 overruns:0 frame:0
TX packets:3478853 errors:70475 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:35783060 (34.1 MiB) TX bytes:409513090 (390.5 MiB)
~ # ifconfig swm5
swm5 Link encap:Ethernet HWaddr 00:05:B6:03:10:52
inet6 addr: fe80::205:b6ff:fe03:1052/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:344310 errors:0 dropped:0 overruns:0 frame:0
TX packets:3967330 errors:58438 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:29199213 (27.8 MiB) TX bytes:337955614 (322.2 MiB)
I will give kernel 5.4 a try, but it will take me at least a week to merge my other changes and run the tests.
Any suggestions?