This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

am3359 ethernet hang (hw_stats: desc_alloc_fail steadily increasing)

Other Parts Discussed in Thread: AM3359, AM3354

We are using am3359 device based on evm -05.04.01.00 package.  We have built Etherlab EtherCAT master and have an application cycling at about 2ms. After about 8 hours runtime the ethernet interface associated with EtherCAT hangs.  The link is still up, and every second or so we see a flicker of activity as indicated by the link lights on the slaves.  The only way to get the interface operating again is by reboot, restarting the app and the EtherCAT master do not help.

When the system is in the hung state, cat /sys/class/net/eth0/hw_stats shows:

CPSW Statistics:
rxgoodframes ............................   48914245
rxbroadcastframes .......................   48914245
rxoctets ................................ -929601998
txgoodframes ............................   48914245
txbroadcastframes .......................   48914245
txoctets ................................ -929601998
octetframes64 ...........................    1309034
octetframes65t127 .......................   96518092
octetframes128t255 ......................       1364
netoctets ............................... -1859203996
rxsofoverruns ...........................     817856
rxdmaoverruns ...........................     817856

RX DMA Statistics:
head_enqueue ............................          1
tail_enqueue ............................   48096388
busy_dequeue ............................   48121676
good_dequeue ............................   48096325

TX DMA Statistics:
head_enqueue ............................   48100585
tail_enqueue ............................     813660
misqueued ...............................     813660
desc_alloc_fail .........................      25555
empty_dequeue ...........................   48125935
good_dequeue ............................   48914053

diffing with a seconds earlier collection of these stats gives:

@@ -1,16 +1,16 @@
 CPSW Statistics:
-rxgoodframes ............................   48914053
-rxbroadcastframes .......................   48914053
-rxoctets ................................ -929614286
-txgoodframes ............................   48914053
-txbroadcastframes .......................   48914053
-txoctets ................................ -929614286
-octetframes64 ...........................    1308650
+rxgoodframes ............................   48914245
+rxbroadcastframes .......................   48914245
+rxoctets ................................ -929601998
+txgoodframes ............................   48914245
+txbroadcastframes .......................   48914245
+txoctets ................................ -929601998
+octetframes64 ...........................    1309034
 octetframes65t127 .......................   96518092
 octetframes128t255 ......................       1364
-netoctets ............................... -1859228572
-rxsofoverruns ...........................     817664
-rxdmaoverruns ...........................     817664
+netoctets ............................... -1859203996
+rxsofoverruns ...........................     817856
+rxdmaoverruns ...........................     817856
 
 RX DMA Statistics:
 head_enqueue ............................          1
@@ -19,9 +19,9 @@
 good_dequeue ............................   48096325
 
 TX DMA Statistics:
-head_enqueue ............................   48100584
-tail_enqueue ............................     813469
-misqueued ...............................     813469
-desc_alloc_fail .........................      25551
-empty_dequeue ...........................   48125934
-good_dequeue ............................   48913861
+head_enqueue ............................   48100585
+tail_enqueue ............................     813660
+misqueued ...............................     813660
+desc_alloc_fail .........................      25555
+empty_dequeue ...........................   48125935
+good_dequeue ............................   48914053

Now I'm trying to shrink the descriptors/driver memory area so that I can recreate this problem more easily.  Any help with setting the proper parameters appreciated.

I'm diffing the driver source from evm -05.05.00.00 but I'm not seeing any definite fix for this.  I also plan to get the git tree to see what the latest is there.

I also found this possible fix: https://github.com/pgibson/u-boot-innotech/commit/93c05158992ffb0e6ce68f5f8ee3ff691694db7b but that path and file don't correspond to anything I have from the arago git trees or the ti packaged distributions.  Where is the best place to get a git tree to peruse the source changes?

In summary:  Anyone know of a fix for this problem?  How do I set the descriptors/memory smaller to drive the problem more easily?  Where is the source repo for Arago?

  • I backported linux-3.2-psp04.06.00.08.sdk/drivers/net/ethernet/ti to linux-3.2-psp04.06.00.07.  Same result after 6 hours of run time.

  • With the backported driver, I caught the hand shortly after it happened, here is a stack trace reported:

    [  943.015147] ------------[ cut here ]------------
    [  943.015213] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x28c/0x29c()
    [  943.015228] NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out
    [  943.015238] Modules linked in: g_ether ec_generic(O) ec_master(O)
    [  943.015260] Backtrace:
    [  943.015296] [<c0017b64>] (dump_backtrace+0x0/0x110) from [<c03c2e64>] (dump_stack+0x18/0x1c)
    [  943.015311]  r6:c04d0b64 r5:000000ff r4:cf82fe70 r3:00000000
    [  943.015344] [<c03c2e4c>] (dump_stack+0x0/0x1c) from [<c003f62c>] (warn_slowpath_common+0x5c/0x6c)
    [  943.015366] [<c003f5d0>] (warn_slowpath_common+0x0/0x6c) from [<c003f6e0>] (warn_slowpath_fmt+0x38/0x40)
    [  943.015379]  r8:c05bad78 r7:c0560a58 r6:00000000 r5:cfae49d8 r4:cfae4800
    [  943.015398] r3:00000009
    [  943.015416] [<c003f6a8>] (warn_slowpath_fmt+0x0/0x40) from [<c032b120>] (dev_watchdog+0x28c/0x29c)
    [  943.015429]  r3:cfae4800 r2:c04d0b7c
    [  943.015457] [<c032ae94>] (dev_watchdog+0x0/0x29c) from [<c004aaa8>] (run_timer_softirq+0x110/0x23c)
    [  943.015484] [<c004a998>] (run_timer_softirq+0x0/0x23c) from [<c0044f80>] (__do_softirq_common+0xd8/0x174)
    [  943.015507] [<c0044ea8>] (__do_softirq_common+0x0/0x174) from [<c00450dc>] (__thread_do_softirq+0xc0/0x10c)
    [  943.015530] [<c004501c>] (__thread_do_softirq+0x0/0x10c) from [<c00451ac>] (run_ksoftirqd+0x84/0x168)
    [  943.015543]  r6:cf82ff84 r5:cf82e000 r4:00000000 r3:00000002
    [  943.015580] [<c0045128>] (run_ksoftirqd+0x0/0x168) from [<c0059cf0>] (kthread+0x90/0x94)
    [  943.015592]  r8:00000000 r7:00000013 r6:c0045128 r5:00000000 r4:cf82bef4
    [  943.015623] [<c0059c60>] (kthread+0x0/0x94) from [<c0042b7c>] (do_exit+0x0/0x6b8)
    [  943.015634]  r6:c0042b7c r5:c0059c60 r4:cf82bef4
    [  943.015650] ---[ end trace 0000000000000002 ]---

  • I hand applied the diff as given in https://github.com/pgibson/u-boot-innotech/commit/93c05158992ffb0e6ce68f5f8ee3ff691694db7b

    This will be tested tonight.

    Below is the diff -cp against the original psp04.06.00.08.sdk:

    *** ~/ti-sdk-am335x-evm-05.05.00.00-azcsm-rtpatch/board-support/linux-3.2.0-psp04.06.00.08.sdk/drivers/net/ethernet/ti/davinci_cpdma.c    2012-07-19 15:21:35.000000000 -0500
    --- ./davinci_cpdma.c    2012-09-27 13:43:28.743883603 -0500
    *************** static int __cpdma_chan_process(struct c
    *** 757,762 ****
    --- 757,766 ----
          status    = __raw_readl(&desc->hw_mode);
          outlen    = status & 0x7ff;
          if (status & CPDMA_DESC_OWNER) {
    +       if(chan_read(chan,hdp) == NULL ) {
    +         if(desc_read(desc, hw_mode) & CPDMA_DESC_OWNER)
    +           chan_write(chan, hdp, desc);
    +       }
              chan->stats.busy_dequeue++;
              status = -EBUSY;
              goto unlock_ret;

  • I should mention that we are using the rt10 version of the 3.2.0 RT_PREEMPT patches applied.

  • Bruno,

    Did this patch address the issue you are observing?

  • No the patch didn't work.  Sorry, I thought I had supplied that info already.

  • Here is the dmesg output that includes the hang with all msg debug enabled in cpsw.c (note the "net eth0: desc submit failed" message):

    [21122.115318] net_ratelimit: 12515 callbacks suppressed
    [21127.125317] net_ratelimit: 12515 callbacks suppressed
    [21132.135317] net_ratelimit: 12515 callbacks suppressed
    [21137.145319] net_ratelimit: 12515 callbacks suppressed
    [21138.516040] EtherCAT 0: Domain 0: Working counter changed to 0/27.
    [21138.635127] EtherCAT ERROR 0-1: Failed to receive AL state datagram: Datagram timed out.
    [21138.995123] EtherCAT WARNING 0: 39 datagrams TIMED OUT!
    [21139.995138] EtherCAT WARNING 0: 83 datagrams TIMED OUT!
    [21140.995136] EtherCAT WARNING 0: 84 datagrams TIMED OUT!
    [21141.995139] EtherCAT WARNING 0: 84 datagrams TIMED OUT!
    [21142.295786] net_ratelimit: 3417 callbacks suppressed
    [21142.295806] net eth0: desc submit failed
    [21142.995138] EtherCAT WARNING 0: 84 datagrams TIMED OUT!
    [21143.995133] EtherCAT WARNING 0: 68 datagrams TIMED OUT!
    [21144.295770] net eth0: desc submit failed
    [21146.295837] net eth0: desc submit failed
    [21147.015127] ------------[ cut here ]------------
    [21147.015192] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x28c/0x29c()
    [21147.015206] NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out
    [21147.015215] Modules linked in: g_ether ec_generic(O) ec_master(O)
    [21147.015236] Backtrace:
    [21147.015270] [<c0017b64>] (dump_backtrace+0x0/0x110) from [<c03c2e84>] (dump_stack+0x18/0x1c)
    [21147.015283]  r6:c04d0b64 r5:000000ff r4:cf82fe70 r3:00000000
    [21147.015316] [<c03c2e6c>] (dump_stack+0x0/0x1c) from [<c003f62c>] (warn_slowpath_common+0x5c/0x6c)
    [21147.015336] [<c003f5d0>] (warn_slowpath_common+0x0/0x6c) from [<c003f6e0>] (warn_slowpath_fmt+0x38/0x40)
    [21147.015348]  r8:c05bad78 r7:c0560a58 r6:00000000 r5:cfae49d8 r4:cfae4800
    [21147.015366] r3:00000009
    [21147.015384] [<c003f6a8>] (warn_slowpath_fmt+0x0/0x40) from [<c032b140>] (dev_watchdog+0x28c/0x29c)
    [21147.015395]  r3:cfae4800 r2:c04d0b7c
    [21147.015424] [<c032aeb4>] (dev_watchdog+0x0/0x29c) from [<c004aaa8>] (run_timer_softirq+0x110/0x23c)
    [21147.015449] [<c004a998>] (run_timer_softirq+0x0/0x23c) from [<c0044f80>] (__do_softirq_common+0xd8/0x174)
    [21147.015471] [<c0044ea8>] (__do_softirq_common+0x0/0x174) from [<c00450dc>] (__thread_do_softirq+0xc0/0x10c)
    [21147.015492] [<c004501c>] (__thread_do_softirq+0x0/0x10c) from [<c00451ac>] (run_ksoftirqd+0x84/0x168)
    [21147.015504]  r6:cf82ff84 r5:cf82e000 r4:00000000 r3:00000002
    [21147.015540] [<c0045128>] (run_ksoftirqd+0x0/0x168) from [<c0059cf0>] (kthread+0x90/0x94)
    [21147.015550]  r8:00000000 r7:00000013 r6:c0045128 r5:00000000 r4:cf82bef4
    [21147.015580] [<c0059c60>] (kthread+0x0/0x94) from [<c0042b7c>] (do_exit+0x0/0x6b8)
    [21147.015590]  r6:c0042b7c r5:c0059c60 r4:cf82bef4
    [21147.015606] ---[ end trace 0000000000000002 ]---
    [21147.015625] net eth0: transmit timeout, restarting dma
    [21147.018375] net eth0: desc submit failed
    [21147.995137] EtherCAT WARNING 0: 80 datagrams TIMED OUT!
    [21148.295807] net eth0: desc submit failed
    [21148.995133] EtherCAT WARNING 0: 79 datagrams TIMED OUT!
    [21150.295807] net eth0: desc submit failed
    [21152.295790] net eth0: desc submit failed
    [21154.295778] net eth0: desc submit failed
    [21156.295827] net eth0: desc submit failed
    [21157.015143] net eth0: transmit timeout, restarting dma
    [21157.017926] net eth0: desc submit failed
    [21157.995139] EtherCAT WARNING 0: 81 datagrams TIMED OUT!
    [21158.295853] net eth0: desc submit failed
    [21158.995134] EtherCAT WARNING 0: 79 datagrams TIMED OUT!
    [21160.295797] net eth0: desc submit failed
    [21162.295780] net eth0: desc submit failed
    [21164.295834] net eth0: desc submit failed
    [21166.295819] net eth0: desc submit failed
    [21167.015141] net eth0: transmit timeout, restarting dma
    [21167.017909] net eth0: desc submit failed
    [21167.995140] EtherCAT WARNING 0: 81 datagrams TIMED OUT!
    [21168.295842] net eth0: desc submit failed
    [21168.995135] EtherCAT WARNING 0: 76 datagrams TIMED OUT!
    [21170.295786] net eth0: desc submit failed

  • I tried ifdown/ifup on our failed system.  While that did not solve the problem (nor clear the counters reported by ifconfig), I noticed a burst of activity for about a second or two before locking up again.  Sometimes the burst of activity would occur on the ifup, other times the burst occurred when the EtherCAT stack was restarted. In both cases the system went into the hung state after the small burst.

    I plan on getting wireshark into the system to see what the packets are.

  • The message of the form "net_ratelimit: 12515 callbacks suppressed" have to do with limiting the amount of messaging going to syslog.  This was put in to prevent Denial-of-Service attacks.  This has nothing to do with hardware ratelimiting.  We've verified the ports are in normal priority mode.

  • Bruno,

    would you mind to share the outcome of your tests regarding the ethernet issue ?

    We are facing the same problem here.

    Thanks.

    Oliver

  • These patches worked on the psp that came with evm 05.04.01:

    You may need this mdio fix:

    7776.davinci_mdio_fix.diff

    This is the patch that fixes the problem:

    2548.cpsw-irq-fixes-patch.diff

    A mainlining attempt was also made:

    http://www.spinics.net/lists/netdev/msg222891.html

  • Thansk Bruno,

    The patches works for my similar issue with Keren 3.2 + RT patch on TI Am3354.

    It saved me lots of time!

    Cheers,
    John
  • Hello Bruno and john,


    I am using kernel 3.2.0 and RT patch 3.2.0-rt10 and observed the same hang issue.

    After a long struggle and debugging driver found your thread.

    I applied the patch posted by you but there were conflicts which failed the HUNKS.

    So I manually managed to do the changes.

    But still I facing the same hang problem

    can you help me if I am missing any thing.