This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5726: NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out

Part Number: AM5726

Hi everyone,
I hope you can help me to understand a problem and to solve this problem.

First, this SDK and Kernel version I'm running:

SDK 03.01.00.06
Linux with RT support Linux 4.4.19-rt25 SMP PREEMPT RT.

I already saw other threads about that topic but any of them showed the real cause of the problem or what triggers this problem. Some times the Ethernet port was stopping with (NETDEV WATCHDOG) message as below:

[ 1568.403944] NOHZ: local_softirq_pending 80
[ 1574.143709] NOHZ: local_softirq_pending 80
[ 1581.853441] NOHZ: local_softirq_pending 80
[ 1597.993219] NOHZ: local_softirq_pending 80
[ 1602.332767] NOHZ: local_softirq_pending 80
[ 1644.541335] NOHZ: local_softirq_pending 80
[ 1662.670705] NOHZ: local_softirq_pending 80
[ 1690.839761] NOHZ: local_softirq_pending 80
[ 1711.029097] NOHZ: local_softirq_pending 80
[ 1720.258780] NOHZ: local_softirq_pending 80
[ 1921.951512] ------------[ cut here ]------------
[ 1921.998844] WARNING: CPU: 0 PID: 4 at net/sched/sch_generic.c:306 dev_watchdog+0x26c/0x278()
[ 1922.007324] NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out
[ 1922.013709] Modules linked in: cbc jitterentropy_rng drbg rpmsg_proto virtio_rpmsg_bus pru_rproc pruss_intc sha512_generic sha512_arm sha1_generic omap_aes_driver sha1_arm_neon sha1_arm pruss omap_sham omap_wdt omap_des omap_remoteproc debugss_kmodule(O) remoteproc virtio virtio_ring sch_fq_codel cryptodev(O) cmemk(O)
[ 1922.043675] CPU: 0 PID: 4 Comm: ktimersoftd/0 Tainted: G           O    4.4.19-rt25-gf572d285f0 #1
[ 1922.043679] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 1922.043684] Backtrace: 
[ 1922.043704] [<c0012ec4>] (dump_backtrace) from [<c00130c0>] (show_stack+0x18/0x1c)
[ 1922.043716]  r7:c0501bbc r6:20030113 r5:00000000 r4:c0889284
[ 1922.043728] [<c00130a8>] (show_stack) from [<c033e404>] (dump_stack+0x8c/0xa0)
[ 1922.043739] [<c033e378>] (dump_stack) from [<c0033264>] (warn_slowpath_common+0x88/0xb8)
[ 1922.043750]  r7:c0501bbc r6:00000132 r5:00000009 r4:ee8a1e00
[ 1922.043759] [<c00331dc>] (warn_slowpath_common) from [<c00332cc>] (warn_slowpath_fmt+0x38/0x40)
[ 1922.043771]  r8:ee31f000 r7:00000001 r6:c0852580 r5:ee062000 r4:c07c17c8
[ 1922.043782] [<c0033298>] (warn_slowpath_fmt) from [<c0501bbc>] (dev_watchdog+0x26c/0x278)
[ 1922.043788]  r3:ee062000 r2:c07c17c8
[ 1922.043792]  r4:00000000
[ 1922.043803] [<c0501950>] (dev_watchdog) from [<c008fce8>] (call_timer_fn+0x30/0xa0)
[ 1922.043816]  r10:c0501950 r9:ee062000 r8:00000200 r7:c0501950 r6:00000000 r5:eed54580
[ 1922.043820]  r4:ffffe000
[ 1922.043829] [<c008fcb8>] (call_timer_fn) from [<c008ff08>] (run_timer_softirq+0x1b0/0x23c)
[ 1922.043840]  r7:00000000 r6:00000000 r5:eed54580 r4:ee0622bc
[ 1922.043851] [<c008fd58>] (run_timer_softirq) from [<c0036174>] (do_current_softirqs+0x1b8/0x254)
[ 1922.043863]  r10:00000001 r9:ee8a1ed0 r8:00000000 r7:04208140 r6:ee8a0000 r5:00000004
[ 1922.043867]  r4:c084b2b0
[ 1922.043875] [<c0035fbc>] (do_current_softirqs) from [<c003668c>] (run_ksoftirqd+0x34/0x64)
[ 1922.043887]  r10:00000000 r9:00000000 r8:ffffe000 r7:c0864630 r6:00000001 r5:ee843bc0
[ 1922.043891]  r4:ffffe000
[ 1922.043901] [<c0036658>] (run_ksoftirqd) from [<c00511f0>] (smpboot_thread_fn+0x164/0x2b8)
[ 1922.043907]  r5:ee843bc0 r4:ee8a0000
[ 1922.043915] [<c005108c>] (smpboot_thread_fn) from [<c004de24>] (kthread+0xe4/0xfc)
[ 1922.043927]  r10:00000000 r9:00000000 r8:00000000 r7:c005108c r6:ee843bc0 r5:ee843c40
[ 1922.043932]  r4:00000000 r3:ee8915c0
[ 1922.043941] [<c004dd40>] (kthread) from [<c000fa90>] (ret_from_fork+0x14/0x24)
[ 1922.043950]  r7:00000000 r6:00000000 r5:c004dd40 r4:ee843c40
[ 1922.043953] ---[ end trace 0000000000000002 ]---

After that, there was not Ethernet network anymore. It was necessary to disconnect the RJ45 cable and connect again than the communication would work again. To avoid the cable re-connection I cherry-picked a commit from the successor version:

net: ethernet: ti: cpsw: wake tx queues on ndo_tx_timeout

After that the problem was still happening but the Ethernet connection continues working after the NETDEV WATCHDOG trigger.
Following the other topics on TI Forum, I have cherry-picked two other commits that allow to customize the CPDMA descriptor pool size, these were the commits:
 
net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()
net: ethernet: ti: cpsw: add support for descs_pool_size dt property
After that the frequency of the problem happens reduced but it still occurring time to time.

I didn't find a specific scenario where the problem is happening, but when I add a lot of Network traffic with UDP packets sometimes the problem happens. It include using iperf with target and host as a server and both being client simultaneously. Also the target sending some upd packets using the udpsend tool got from TI forum.

Does some one know the real trigger of this problem? Is there someone that can give me some insights about that issue?

Thank you very much. 
  • Hi,

    Are you referring to this E2E ?

    Do you see the issue on PRU-ICSS as well ?

    Regards

    Vineet

  • Hi Vineet,

    thank you for your reply.

    I didn't see any PRU-ICSS issue on my my tests. It always show messages related with CPSW. Also I ran the iperf command from the thread that you mentioned and the problem didn't occur.

    In my case, when I inject UDP traffic on eth0 the problem happens on eth0 and the board has the same behavior on eth1.

    The others two threads I have mentioned are these:

     
    I'm planning to double the descriptor pool size, that currently it is 1024, but doing this I'm only covering up the problem and not doing the right fix or finding what trigger this issue.

    Bests,

    Diogo

  • Hi Diogo,

    These are very old threads and your SDK version is a bit old, can you check with latest ? Maybe it's already solved.

    Current SDK version is 6.03 and Kernel is 4.19

    Regards

    Vineet