This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM572x Ethernet receive issues

Other Parts Discussed in Thread: AM5728, SHA-256, PCM5102A

We are seeing an unusually high rate of TCP retries and UDP packet loss when receiving over the AM5728 CPSW interface with Linux and the processor SDK on both the EVM and our own board.

Using iperf3, we can transmit over a dedicated link at well over 800Mb/s without issues (no UDP packet loss or TCP retries), but when receiving we get unexpectedly large UDP losses (>50% sometimes) and hundreds of TCP retries over the same link. Furthermore, network problems appear to get worse when the HDMI output is active; sometimes, large file transfers to the AM5728 can fail an SHA-256 check, suggesting a possible memory corruption issue.

Below is a sample of UDP receive at 100Mb/s (10% link bandwidth) showing 36% packet loss; increasing the buffer size to 1MB (-w 1M) reduces the loss to about 15% but it really should be 0 for this case. 

On the TCP side we see hundreds of retries/sec using similar parameters. The TCP retries occur regardless of how much we increase the window size.

Adding a receive coalesce interval of 20uS does not appear to have any effect.

The above tests are with kernel 4.1.13 (SDK 2.0.1.7). Older SDK with kernel 3.14.43 does not exhibit this issue. (3.x SDK does not boot on our EVM.)

Ethtool does not show errors for either scenario, but clearly the packets are dropped.

Is there a fix for this? 

----

[  4] local 172.30.0.30 port 55300 connected to 172.30.0.2 port 5201

[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams

[  4]   0.00-1.00   sec  8.49 MBytes  71.2 Mbits/sec  0.406 ms  465/1552 (30%)  

[  4]   1.00-2.00   sec  7.82 MBytes  65.6 Mbits/sec  0.120 ms  523/1524 (34%)  

[  4]   2.00-3.00   sec  7.54 MBytes  63.2 Mbits/sec  0.115 ms  560/1525 (37%)  

[  4]   3.00-4.00   sec  7.59 MBytes  63.6 Mbits/sec  0.419 ms  554/1525 (36%)  

[  4]   4.00-5.00   sec  8.26 MBytes  69.3 Mbits/sec  0.150 ms  476/1533 (31%)  

[  4]   5.00-6.00   sec  7.44 MBytes  62.4 Mbits/sec  0.159 ms  575/1527 (38%)  

[  4]   6.00-7.00   sec  7.20 MBytes  60.4 Mbits/sec  0.424 ms  596/1517 (39%)  

[  4]   7.00-8.00   sec  7.88 MBytes  66.1 Mbits/sec  0.187 ms  526/1535 (34%)  

[  4]   8.00-9.00   sec  7.55 MBytes  63.4 Mbits/sec  0.151 ms  558/1525 (37%)  

[  4]   9.00-10.00  sec  7.78 MBytes  65.3 Mbits/sec  0.401 ms  520/1516 (34%)  

[  4]  10.00-11.00  sec  8.02 MBytes  67.3 Mbits/sec  0.128 ms  509/1536 (33%)  

[  4]  11.00-12.00  sec  7.48 MBytes  62.7 Mbits/sec  0.147 ms  570/1527 (37%)  

[  4]  12.00-13.00  sec  6.94 MBytes  58.2 Mbits/sec  0.520 ms  637/1525 (42%)  

[  4]  13.00-14.00  sec  7.74 MBytes  64.9 Mbits/sec  0.146 ms  536/1527 (35%)  

[  4]  14.00-15.00  sec  7.77 MBytes  65.2 Mbits/sec  0.179 ms  530/1525 (35%)  

[  4]  15.00-16.00  sec  7.55 MBytes  63.3 Mbits/sec  0.431 ms  547/1513 (36%)  

^C[  4]  16.00-16.93  sec  6.95 MBytes  62.4 Mbits/sec  0.120 ms  491/1380 (36%)  

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams

[  4]   0.00-16.93  sec  0.00 Bytes  0.00 bits/sec  0.120 ms  9173/25812 (36%)  

  • Hi,

    The Ethernet experts have been notified. They will respond here. Please note that delays are possible due to upcoming holidays in the USA.
  • Apps will be posting later today their comments.
  • This sounds like the RX packets are being dropped due to descriptor exhaustion. Internal RAM only holds a fixed amount, about 128 total for both RX and TX. When doing an ethtool -S eth0 command, is RXDMA Overruns count non 0?

    If so the RX descriptors are being exhausted, the bindings document located here:

    https://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/master/Documentation/devicetree/bindings/net/cpsw.txt    

    There is field called in the bd_ram_size, increase this size from 0x2000 to 0x20000, this will move the descriptors from internal RAM to external DDR and reduce the possibility of overruns.

    The bd_ram_size can be modfied in the processor DTSI or the your board DTS file. I recommend the later, example here:

    &mac {
    bd_ram_size = <0x2000>;
    };
  • Unfortunately, changing the bd_ram_size to 0x20000 causes a kernel crash when the network is first used (see attached traceback.) Is there perhaps some other setting or fix required for this to work?

    [   21.118124] ------------[ cut here ]------------

    [   21.122781] WARNING: CPU: 0 PID: 1048 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()

    [   21.132388] 44000000.ocp:L3 Standard Error: MASTER GMAC_SW TARGET GPMC (Read Link): At Address: 0x48496000 : Data Access in User mode during Functional access

    [   21.146616] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

    [   21.176847] CPU: 0 PID: 1048 Comm: rep2 Not tainted 4.1.13-test-gb5be33b #8

    [   21.184184] Hardware name: Generic DRA74X (Flattened Device Tree)

    [   21.190300] Backtrace: 

    [   21.192772] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

    [   21.200371]  r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000

    [   21.206088] [<c0013184>] (show_stack) from [<c06bbaf8>] (dump_stack+0x9c/0xdc)

    [   21.213343] [<c06bba5c>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

    [   21.221465]  r5:00000009 r4:eca05e00

    [   21.225072] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

    [   21.233804]  r8:c08af9c0 r7:00000004 r6:ee1af190 r5:c08af4cc r4:c08af57c

    [   21.240567] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)

    [   21.249473]  r3:ee1af000 r2:c08af57c

    [   21.253071]  r4:80080001

    [   21.255624] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)

    [   21.264966]  r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360

    [   21.272860]  r4:ee1af500

    [   21.275413] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)

    [   21.284318]  r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360

    [   21.292212]  r4:ee1a9300

    [   21.294763] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)

    [   21.303319]  r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300

    [   21.309030] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)

    [   21.317674]  r7:00000000 r6:00000000 r5:00000017 r4:00000017

    [   21.323386] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)

    [   21.332117]  r5:00000017 r4:c098cd38

    [   21.335721] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

    [   21.344103]  r9:9e4fd340 r8:30c5387d r7:fa212000 r6:eca05fb0 r5:c099294c r4:fa21200c

    [   21.351915] [<c0009480>] (gic_handle_irq) from [<c06c1aa8>] (__irq_usr+0x48/0x60)

    [   21.359426] Exception stack(0xeca05fb0 to 0xeca05ff8)

    [   21.364495] 5fa0:                                     9e4fd26c 9eea8bc8 9fbcc93c 00000000

    [   21.372706] 5fc0: 9e4fd26c 9eea8bc8 9fbcc93c 9fa941bc 00000000 9e4fd340 9eea8bc8 0000003e

    [   21.380916] 5fe0: 0009dee0 9e4fd210 b2c1aa39 b2c17ea0 60000030 ffffffff

    [   21.387553]  r7:30c5387d r6:ffffffff r5:60000030 r4:b2c17ea0

    [   21.393261] ---[ end trace 975205f49dcfbb88 ]---

    [   21.397952] ------------[ cut here ]------------

    [   21.398138] Unhandled fault: asynchronous external abort (0x1211) at 0x00000000

    [   21.398140] pgd = ecca6340

    [   21.398148] [00000000] *pgd=ac930003, *pmd=94c22003, *pte=00000000

    [   21.398154] Internal error: : 1211 [#1] PREEMPT SMP ARM

    [   21.398202] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

    [   21.398207] CPU: 1 PID: 982 Comm: klogd Tainted: G        W       4.1.13-test-gb5be33b #8

    [   21.398210] Hardware name: Generic DRA74X (Flattened Device Tree)

    [   21.398213] task: d4dbde00 ti: dfc1a000 task.ti: dfc1a000

    [   21.398220] PC is at vfp_reload_hw+0x1c/0x44

    [   21.398224] LR is at __und_usr_fault_32+0x0/0x8

    [   21.398228] pc : [<c000ad80>]    lr : [<c06c1c40>]    psr: 600f0013

    [   21.398228] sp : dfc1bfb0  ip : 00000000  fp : 00000001

    [   21.398232] r10: dfc1a178  r9 : c06c1ca0  r8 : 00000b00

    [   21.398235] r7 : 00000001  r6 : dfc1a04c  r5 : 00000002  r4 : ecad80f8

    [   21.398238] r3 : c09e0068  r2 : b6efb436  r1 : 40000000  r0 : ed2d8b02

    [   21.398242] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user

    [   21.398245] Control: 30c5387d  Table: acca6340  DAC: 55555555

    [   21.398248] Process klogd (pid: 982, stack limit = 0xdfc1a218)

    [   21.398251] Stack: (0xdfc1bfb0 to 0xdfc1c000)

    [   21.398255] bfa0:                                     000bf008 b6f9dde4 be9ce9c8 000bbc82

    [   21.398261] bfc0: 000bf008 be9ce9e4 b6fd54c0 b6fb7000 ffffffff 00000000 000a8e88 b6fd54c0

    [   21.398265] bfe0: 00000000 be9ce994 b6f021d9 b6efb436 200f0030 ffffffff 00000000 00000000

    [   21.398270] Backtrace: invalid frame pointer 0x00000001

    [   21.398275] Code: ecba0b20 eef75a10 e205500f e3550002 (0cfa0b20) 

    [   21.398280] ---[ end trace 975205f49dcfbb89 ]---

    [   21.398286] note: klogd[982] exited with preempt_count 1

    [   21.581885] WARNING: CPU: 0 PID: 1048 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()

    [   21.591491] 44000000.ocp:L3 Custom Error: MASTER MPU TARGET L4_PER2_P3 (Idle): Data Access in Supervisor mode during Functional access

    [   21.603622] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

    [   21.633836] CPU: 0 PID: 1048 Comm: rep2 Tainted: G      D W       4.1.13-test-gb5be33b #8

    [   21.642393] Hardware name: Generic DRA74X (Flattened Device Tree)

    [   21.648506] Backtrace: 

    [   21.650972] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

    [   21.658571]  r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000

    [   21.664284] [<c0013184>] (show_stack) from [<c06bbaf8>] (dump_stack+0x9c/0xdc)

    [   21.671538] [<c06bba5c>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

    [   21.679659]  r5:00000009 r4:eca05cf8

    [   21.683263] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

    [   21.691994]  r8:c08af418 r7:00000000 r6:ee1af190 r5:c08af4d8 r4:c08af57c

    [   21.698758] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)

    [   21.707664]  r3:ee1af000 r2:c08af57c

    [   21.711262]  r4:80080003

    [   21.713815] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)

    [   21.723157]  r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360

    [   21.731052]  r4:ee1af500

    [   21.733604] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)

    [   21.742511]  r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360

    [   21.750405]  r4:ee1a9300

    [   21.752955] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)

    [   21.761511]  r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300

    [   21.767223] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)

    [   21.775866]  r7:00000000 r6:eca05fb0 r5:00000017 r4:00000017

    [   21.781581] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)

    [   21.790312]  r5:00000017 r4:c098cd38

    [   21.793916] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

    [   21.802298]  r9:9e4fd340 r8:ee008000 r7:fa212000 r6:eca05ea8 r5:c099294c r4:fa21200c

    [   21.810111] [<c0009480>] (gic_handle_irq) from [<c06c17c0>] (__irq_svc+0x40/0x74)

    [   21.817620] Exception stack(0xeca05ea8 to 0xeca05ef0)

    [   21.822691] 5ea0:                   c06c8154 00000000 c09e12c0 00000000 00000082 00000013

    [   21.830901] 5ec0: 00000000 00000000 ee008000 9e4fd340 9eea8bc8 eca05f4c c09e12c0 eca05ef0

    [   21.839110] 5ee0: c003cf18 c003cfb0 20000113 ffffffff

    [   21.844177]  r7:eca05edc r6:ffffffff r5:20000113 r4:c003cfb0

    [   21.849890] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

    [   21.857225]  r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

    [   21.865120]  r4:c098cd38

    [   21.867669] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

    [   21.875528]  r5:00000013 r4:c098cd38

    [   21.879131] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

    [   21.887513]  r9:9e4fd340 r8:30c5387d r7:fa212000 r6:eca05fb0 r5:c099294c r4:fa21200c

    [   21.895324] [<c0009480>] (gic_handle_irq) from [<c06c1aa8>] (__irq_usr+0x48/0x60)

    [   21.902834] Exception stack(0xeca05fb0 to 0xeca05ff8)

    [   21.907904] 5fa0:                                     9e4fd26c 9eea8bc8 9fbcc93c 00000000

    [   21.916114] 5fc0: 9e4fd26c 9eea8bc8 9fbcc93c 9fa941bc 00000000 9e4fd340 9eea8bc8 0000003e

    [   21.924324] 5fe0: 0009dee0 9e4fd210 b2c1aa39 b2c17ea0 60000030 ffffffff

    [   21.930961]  r7:30c5387d r6:ffffffff r5:60000030 r4:b2c17ea0

    [   21.936666] ---[ end trace 975205f49dcfbb8a ]---

  • Hey Chris,

    Sorry you are having trouble. I'm trying to jump in and provide some help.

    I wanted to let you know that I have been unable to replicate your results. I've got an HDMI monitor plugged into an AM572x EVM running PLSDK 2.1.0.7 (kernel version 4.1.13). I ran iperf server on the EVM and client on my Linux box and here are the results:

    sitara@sitara67-OptiPlex-745:~/ti-processor-sdk-linux-am335x-evm-02.00.01.07/board-support/u-boo
    015.07+gitAUTOINC+5922e09363$ iperf -c 192.168.2.119 -u -b 100M -i 1
    ------------------------------------------------------------
    Client connecting to 192.168.2.119, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size:  224 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.168.2.1 port 44009 connected with 192.168.2.119 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0- 1.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  1.0- 2.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  2.0- 3.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  3.0- 4.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  4.0- 5.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  5.0- 6.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  6.0- 7.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  7.0- 8.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  8.0- 9.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  9.0-10.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  0.0-10.0 sec   120 MBytes   101 Mbits/sec
    [  3] Sent 85471 datagrams
    [  3] Server Report:
    [  3]  0.0-10.0 sec   120 MBytes   101 Mbits/sec   0.004 ms    0/85470 (0%)
    [  3]  0.0-10.0 sec  1 datagrams received out-of-order
    sitara@sitara67-OptiPlex-745:~/ti-processor-sdk-linux-am335x-evm-02.00.01.07/board-support/u-boo
    015.07+gitAUTOINC+5922e09363$ iperf -c 192.168.2.119 -u -b 100M -i 1
    ------------------------------------------------------------
    Client connecting to 192.168.2.119, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size:  224 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.168.2.1 port 59366 connected with 192.168.2.119 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0- 1.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  1.0- 2.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  2.0- 3.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  3.0- 4.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  4.0- 5.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  5.0- 6.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  6.0- 7.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  7.0- 8.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  8.0- 9.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  9.0-10.0 sec  12.0 MBytes   101 Mbits/sec
    [  3]  0.0-10.0 sec   120 MBytes   101 Mbits/sec
    [  3] Sent 85471 datagrams
    [  3] Server Report:
    [  3]  0.0-10.0 sec   120 MBytes   101 Mbits/sec   0.004 ms    0/85470 (0%)


    As you can see, I'm achieving higher bandwidth and no packet loss at 100M. If I move to 200M, I do see a little packet loss. But, my results are not near as bad as yours.

    You can see the client commands I've used above. If there is something else you would like me to try, I will be happy to.

  • That's odd; I don't understand what the variable here would be other than perhaps iperf2 vs. iperf3, but we are seeing receive-side network problems on all of our AM5728-based systems running the 4.1.13 kernel - iperf3 just gives us a useful metric that seems to showcase the problem. Reviewing test conditions:

    1. EVM running from clean SD card burned from default SDK 2.0.1.7 pre-built demo distribution filesystem using supplied create-sdcard.sh script.
    2. Dedicated ethernet connection to PC client (running Centos 7) through upper (away from board) network port on EVM.
    3. EVM has camera module and LCD attached.
    4. Test using iperf3 (3.1.3, specifically), built from source on Centos 7 and with TI SDK toolkit (iperf3 has better reporting than the iperf 2 provided in the SDK; also, Centos client-side iperf 2 usually fails when exiting with "did not receive ack of last datagram after n tries"/"connection refused" and won't give packet loss info on my setup; this appears to be due to a long-standing iperf 2 bug.)
    5. Since we can run the test error-free with the same EVM hardware using an earlier SDK, the issue appears to be with SDK 2.0.1.7 specifically.

    Perhaps you could try testing with iperf3? Or is there some other element of the test that is different on your side, or could have an influence here which I haven't mentioned above?

    Thanks, I appreciate whatever help you can provide.

  • Is there a fix for the omap_l3_noc crash that results from increasing bd_ram_size? We are certainly seeing receive dma overruns reported by ethtool, so if the DMA descriptor RAM size could be increased it may fix our primary issue. 

  • Could you please post the ethtool -S output? I am looking for RX CRC errors too.

    I am not sure this is a descriptor exhaustion issue. What is the version of the TI EVM that you are using?
  • Attached below is more test output, including ethtool -S . I don't see CRC errors, although there are lots of RX DMA overruns.

    Regarding the EVM version, I don't know where to find that info. I do know that we had to do some cuts/jumpers for the serial connection because of an SDK change.

    Accepted connection from 192.168.254.200, port 45938

    [  5] local 192.168.254.4 port 5201 connected to 192.168.254.200 port 38431

    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams

    [  5]   0.00-1.00   sec  1.14 MBytes  9.57 Mbits/sec  151734.699 ms  1220/1366 (89%)

    [  5]   1.00-2.00   sec  1.47 MBytes  12.3 Mbits/sec  0.947 ms  1222/1410 (87%)

    [  5]   2.00-3.00   sec  1.23 MBytes  10.3 Mbits/sec  0.632 ms  1502/1659 (91%)

    [  5]   3.00-4.00   sec  1.65 MBytes  13.8 Mbits/sec  0.124 ms  1182/1393 (85%)

    [  5]   4.00-5.00   sec  1.38 MBytes  11.5 Mbits/sec  0.625 ms  1352/1528 (88%)

    [  5]   5.00-6.00   sec  1.29 MBytes  10.8 Mbits/sec  0.067 ms  1359/1524 (89%)

    [  5]   6.00-7.00   sec  1.22 MBytes  10.2 Mbits/sec  0.111 ms  1371/1527 (90%)

    [  5]   7.00-8.00   sec  1.29 MBytes  10.8 Mbits/sec  0.181 ms  1360/1525 (89%)

    [  5]   8.00-9.00   sec  1.23 MBytes  10.4 Mbits/sec  0.067 ms  1368/1526 (90%)

    [  5]   9.00-10.00  sec  1.45 MBytes  12.2 Mbits/sec  0.136 ms  1339/1525 (88%)

    [  5]  10.00-10.04  sec  0.00 Bytes  0.00 bits/sec  0.136 ms  0/0 (0%)

    - - - - - - - - - - - - - - - - - - - - - - - - -

    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams

    [  5]   0.00-10.04  sec  0.00 Bytes  0.00 bits/sec  0.136 ms  13275/14983 (89%)

    -----------------------------------------------------------

    Server listening on 5201

    -----------------------------------------------------------

    iperf3: interrupt - the server has terminated

    ^C

    root@am57xx-evm:~# ethtool -S eth0

    NIC statistics:

         Good Rx Frames: 822802

         Broadcast Rx Frames: 2

         Multicast Rx Frames: 0

         Pause Rx Frames: 0

         Rx CRC Errors: 0

         Rx Align/Code Errors: 0

         Oversize Rx Frames: 0

         Rx Jabbers: 0

         Undersize (Short) Rx Frames: 0

         Rx Fragments: 0

         Rx Octets: 1238415402

         Good Tx Frames: 14599

         Broadcast Tx Frames: 51

         Multicast Tx Frames: 20

         Pause Tx Frames: 0

         Deferred Tx Frames: 0

         Collisions: 0

         Single Collision Tx Frames: 0

         Multiple Collision Tx Frames: 0

         Excessive Collisions: 0

         Late Collisions: 0

         Tx Underrun: 0

         Carrier Sense Errors: 0

         Tx Octets: 1071655

         Rx + Tx 64 Octet Frames: 31

         Rx + Tx 65-127 Octet Frames: 14562

         Rx + Tx 128-255 Octet Frames: 9

         Rx + Tx 256-511 Octet Frames: 120

         Rx + Tx 512-1023 Octet Frames: 15299

         Rx + Tx 1024-Up Octet Frames: 807380

         Net Octets: 1239487057

         Rx Start of Frame Overruns: 15298

         Rx Middle of Frame Overruns: 0

         Rx DMA Overruns: 15298

         Rx DMA chan: head_enqueue: 1

         Rx DMA chan: tail_enqueue: 807567

         Rx DMA chan: pad_enqueue: 0

         Rx DMA chan: misqueued: 24

         Rx DMA chan: desc_alloc_fail: 0

         Rx DMA chan: pad_alloc_fail: 0

         Rx DMA chan: runt_receive_buf: 0

         Rx DMA chan: runt_transmit_buf: 0

         Rx DMA chan: empty_dequeue: 0

         Rx DMA chan: busy_dequeue: 381499

         Rx DMA chan: good_dequeue: 807504

         Rx DMA chan: requeue: 204

         Rx DMA chan: teardown_dequeue: 0

         Tx DMA chan: head_enqueue: 7813

         Tx DMA chan: tail_enqueue: 6786

         Tx DMA chan: pad_enqueue: 0

         Tx DMA chan: misqueued: 6786

         Tx DMA chan: desc_alloc_fail: 0

         Tx DMA chan: pad_alloc_fail: 0

         Tx DMA chan: runt_receive_buf: 0

         Tx DMA chan: runt_transmit_buf: 25

         Tx DMA chan: empty_dequeue: 7751

         Tx DMA chan: busy_dequeue: 29

         Tx DMA chan: good_dequeue: 14599

         Tx DMA chan: requeue: 7813

         Tx DMA chan: teardown_dequeue: 0

  • Could you post a picture of both sides of the board?
    That data loss is really high for the bit rate being used.
    Could you also post the command lines used for both the client and server sides of iperf?
    Could you also post just ethtool eth0 results too please?
  • Board pictures are attached.

    Server command (on board):  iperf3 -s

    Client command (on PC, attached via dedicated gig-E link): iperf3 -c 192.168.254.4 -u -b 100M

    Ethtool output:

    root@am57xx-evm:~# ethtool eth0

    Settings for eth0:

            Supported ports: [ TP MII ]

            Supported link modes:   10baseT/Half 10baseT/Full

                                    100baseT/Half 100baseT/Full

                                    1000baseT/Half 1000baseT/Full

            Supported pause frame use: Symmetric

            Supports auto-negotiation: Yes

            Advertised link modes:  10baseT/Half 10baseT/Full

                                    100baseT/Half 100baseT/Full

                                    1000baseT/Half 1000baseT/Full

            Advertised pause frame use: Symmetric

            Advertised auto-negotiation: Yes

            Link partner advertised link modes:  10baseT/Half 10baseT/Full

                                                 100baseT/Half 100baseT/Full

                                                 1000baseT/Full

            Link partner advertised pause frame use: Symmetric Receive-only

            Link partner advertised auto-negotiation: Yes

            Speed: 1000Mb/s

            Duplex: Full

            Port: MII

            PHYAD: 1

            Transceiver: external

            Auto-negotiation: on

            Supports Wake-on: d

            Wake-on: d

            Current message level: 0x00000000 (0)

            Link detected: yes

  • Chris,

    We put iperf3 on our board and can confirm that there is definitely a difference between iperf (2) and iperf3. Our guess is iperf3 is more "bursty" in how it sends out the packets. Google searches seem to reveal this is occurring in a lot of places.

    When we use the -l 65507 option, the dropped packets improves quite a bit. This seems to confirm the burstiness aspect. So, if it is more bursty, larger buffers are needed to handle it.

    The real question is what does your system expect? Do you need to design for this level of burstiness? There are two places to make changes, the kernel itself and the descriptors that we've already mentioned. The kernel can be tuned with sysctrl changes, and you may have alread done that.

    For the descriptor, we are going to try to backport a patchset we put into 4.4 to make the number of descriptors tunable. We will let you know how we are progressing tomorrow.

  • Our system needs not to drop packets, regardless of burstiness; it's a high-end streaming media device and has to be running multiple video streams with best reliability both in and out over the network. Some of the libraries we are using don't give us the ability to easily tune buffer sizes, so we need to be able to tune the kernel to use memory as necessary to buffer incoming/outgoing data as necessary.

    Thanks for your help. We've done some sysfs tuning without much effect, so please do send me any info/pointers/patches you've got so we can resolve this ASAP, as it's a "can't ship like this" problem for the product.
  • Any progress towards availability of a patch for this issue?

  • Attached is a patch that I tested on the kernel you are using that will enable functionality to move rx descs off chip. The attached file is a .txt as the .patch extensive prevents posting.

    From a97fce86a9e48bc776cd8f9c89489f472b185c1e Mon Sep 17 00:00:00 2001
    From: Mugunthan V N <mugunthanvnm@ti.com>
    Date: Tue, 22 Sep 2015 19:16:38 +0530
    Subject: [PATCH 1/2] driver: net: cpsw: add no_bd_ram dt parsing
    
    cpdma is capable of placing the dma descriptors in ddr using
    dma_alloc_coherent() when the internal bd ram size is not enough.
    To utilize this feature parse the DT parameter "no_bd_ram" and
    pass it to cpdma.
    
    Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
    ---
     drivers/net/ethernet/ti/cpsw.c | 4 ++++
     drivers/net/ethernet/ti/cpsw.h | 1 +
     2 files changed, 5 insertions(+)
    
    diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
    index b536b4c..3ec3e1f 100644
    --- a/drivers/net/ethernet/ti/cpsw.c
    +++ b/drivers/net/ethernet/ti/cpsw.c
    @@ -1987,6 +1987,8 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
     	}
     	data->ale_entries = prop;
     
    +	data->no_bd_ram = of_property_read_bool(node, "no_bd_ram");
    +
     	if (of_property_read_u32(node, "bd_ram_size", &prop)) {
     		dev_err(&pdev->dev, "Missing bd_ram_size property in the DT.\n");
     		return -EINVAL;
    @@ -2321,6 +2323,8 @@ static int cpsw_probe(struct platform_device *pdev)
     	dma_params.desc_mem_size	= data->bd_ram_size;
     	dma_params.desc_align		= 16;
     	dma_params.has_ext_regs		= true;
    +	if (data->no_bd_ram)
    +		dma_params.desc_mem_phys = 0;
     	dma_params.desc_hw_addr         = dma_params.desc_mem_phys;
     
     	priv->dma = cpdma_ctlr_create(&dma_params);
    diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
    index ca90efa..b654ac2 100644
    --- a/drivers/net/ethernet/ti/cpsw.h
    +++ b/drivers/net/ethernet/ti/cpsw.h
    @@ -33,6 +33,7 @@ struct cpsw_platform_data {
     	u32	cpts_clock_mult;  /* convert input clock ticks to nanoseconds */
     	u32	cpts_clock_shift; /* convert input clock ticks to nanoseconds */
     	u32	ale_entries;	/* ale table size */
    +	bool	no_bd_ram;	/* set if cpsw bd ram should not be used */
     	u32	bd_ram_size;  /*buffer descriptor ram size */
     	u32	rx_descs;	/* Number of Rx Descriptios */
     	u32	mac_control;	/* Mac control register */
    -- 
    1.9.1
    
    

    Here is the change that is necessary in the board dts file to enable this capability, I applied this to the am572x-idk.dts as a test:

    &mac{

    no_bd_ram = <1>;

    bd_ram_size = <0x40000>;

    rx_descs = <2048>;

    };

    There will be some changes necessary to network parameters using sysctl, otherwise UDP packets will be dropped at the network layer. Here are the commands I used, they may work for your environment as well, you may need to tune these values further:

    sysctl -w net.core.netdev_max_backlog=20000

    sysctl -w net.ipv4.udp_mem='17565 87380 50331648'

    sysctl -w net.core.rmem_max=50331648

    sysctl -w net.core.rmem_default=50331648

    sysctl -w net.ipv4.route.flush=1

    The 4.4 kernel in the 3.01.00.06 PLSDK has support for moving the descriptors to RAM. Backporting the support in the current SDK is not really feasible at this point but the attached will enable that capability.

  • Patched kernel with the DTS and sysctl changes longer exhibits packet loss issues - but then the network stops working entirely after about 15 - 30 minutes of continuous traffic. Apps then see network syscall errors and the interface stops responding externally (e.g. no response to ARP). There are no kernel messages; all network activity stops although the interface status appears OK. Issuing 'ifconfig eth0 down' then gives a kernel warning/traceback originating in davinci_cpdma.c:896 (it's a WARN(!timeout) in cpdma_chan_stop()).

    Rebooting is the only way to recover to a working interface again.

    Can you provide a patch that works reliably with this kernel? Moving to the 3.x SDK with the newer kernel is not a short-term option for us.

    Note: The provided patch did not match up well with the line numbers in the cpsw driver code in our 4.1.13 kernel from the 02.00.01.07 TI SDK (off by hundreds of lines) although I was able to apply it via context.

  • Is your test running on the TI EVM?
    With modifications from the earlier post I have been running iperf2 at 600 Mbps receive without overruns, errors or the link stopping.
    When the link stops responding could you post what devmem2 for 0x484848e0 is? This is the free buffer count on receive.
    The test I am doing is probably not as demanding as test. Can you describe the network traffic that is being run against the board?
  • This test is running on our board, which has an EVM-based design plus some video capture hardware etc. for our application. The board itself has been running fine on the 2.0.1.7 base SDK/kernel with the exception of this network issue for many months now. I'll work to reproduce it on the EVM if that will help you diagnose, but I can't run the same code there due to I/O limitations.

    When the network dies, all of the receive buffers appear to be free:
    root@salami:~# devmem2 0x484848e0
    /dev/mem opened.
    Memory mapped at address 0xb6fc0000.
    Read at address 0x484848E0 (0xb6fc08e0): 0x00000800

    However, no traffic can be received or sent (wireshark sees no traffic at all at the other end; any attempt to send/receive packets e.g. via ping just hangs).

    Regarding the network traffic mix, for this test it is a relatively light combination of:

    - An outgoing H.264 RTMP (TCP) stream at ~11Mb/s
    - Frequent small HTTP traffic back and forth with web server
    - ssh session doing performance measurements e.g. htop
    - other (mostly UDP) protocols in background including DHCP, NTP, DNS, mDNS, SSDP.

    I didn't get to the heavyweight tests because this one fails fairly quickly.

    The board itself is also running VIP, VPE, codec, DSS driving 1080p60 output, active USB & eMMC, CPU running at OPP1 (1.16GHz) with ~40% utilization of both cores.
  • We have isolated the network failure on a patched system to occur only if the video codec is also in operation.

    Specifically, if the previously described iperf3 test is run while also running 'videnc2test' the failure is produced. Note that it may take some time (many minutes) for the network to die, but the failure is quite repeatable.

    Since we have a streaming application, we need to run the network and the video codec at the same time. Is there a fix for this?

  • Chris,
    Sorry for late reply...
    Most of the apps team are on vacation his week, I am checking if I can get an answer for you asap.
  • I setup a system here that is perhaps similar but so far I do not see the problem that you are having. I ran a test here combining iperf running at about 300Mbps UDP receive and the h264 enc test from the out of box demo that was modified to run continuously.

    To summarize, the problem is the network is down but the system is functioning fine otherwise?

    To look at the network down problem for the moment:

    - Network traffic was running fine until iperf+video enc test is run?

    - When running "ethtool eth0" (assuming this is the port being used) is the link status still shown as detected?

    - After the problem is detected, does "ethtool -S eth0" show the RX byte count increasing during the ping message?

  • > To summarize, the problem is the network is down but the system is functioning fine otherwise?

    Yes. The network stops moving any packets from an app perspective but otherwise appears to still be up. Control operations such as "ifconfig eth0 down" result in kernel errors related to CPDMA. ethtool shows that all CP DMA descriptors are free.

    > To look at the network down problem for the moment:

    > - Network traffic was running fine until iperf+video enc test is run?

    Yes. iperf/network in both directions will run indefinitely without errors in the absence of codec test. Running codec test causes network failure within 1-30 minutes or so.

    > - When running "ethtool eth0" (assuming this is the port being used) is the link status still shown as detected?

    Yes.

    - After the problem is detected, does "ethtool -S eth0" show the RX byte count increasing during the ping message?

    The "good Rx frame" count does increase, though I don't see a byte count in the ethtool output.

  • Thanks for the answers, that says that the link between the PHY and the processor are passing packets since the ethtool dump of the hardware statistics block shows an increasing rx frame count. The RX Octets should also be increasing along with the RX frames.

    Does "cat /proc/net/snmp/ | grep IcmpMsg" show an increasing count after the link is not responding and the ping messages are being sent from another machine? The icmp count should also be the same between in and out packets.

    This current test, is it on your hardware or the TI EVM?
  • Could you also please post the message you are seeing about CPDMA errors?
    Does an ifconfig up/down sequence recover the network connectivity?
  • Stopping the codec and doing ifconfig eth0 down / up does recover the network connection (with some delay).

    The failure occurs again soon if the video codec is still active.

    I've attached the two errors we see and an explanation of when they occur, below.

    The initial failure is a transmit queue timeout:

    [   69.058050] ------------[ cut here ]------------

    [   69.062731] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x264/0x270()

    [   69.071236] NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out

    [   69.077632] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbco

    re rpmsg_rpc dwc3 rtc_palmas virtio_rpmsg_bus udc_core extcon_palmas rtc_ds1307 phy_omap_usb2 snd_soc_tlv320aic3x o

    map_wdt rtc_omap dwc3_omap snd_soc_pcm5102a omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon

    [   69.108009] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.13-aja-helo-g72e25e4-dirty #3

    [   69.116070] Hardware name: Generic DRA74X (Flattened Device Tree)

    [   69.122211] Backtrace:

    [   69.124692] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

    [   69.132320]  r7:c05f3628 r6:0000012f r5:c09b1024 r4:00000000

    [   69.138062] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)

    [   69.145325] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

    [   69.153469]  r5:00000009 r4:c0991d60

    [   69.157090] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

    [   69.165850]  r8:c09dd167 r7:c0992100 r6:ed8f5840 r5:ed930000 r4:c08f6230

    [   69.172638] [<c0039a6c>] (warn_slowpath_fmt) from [<c05f3628>] (dev_watchdog+0x264/0x270)

    [   69.180875]  r3:ed930000 r2:c08f6230

    [   69.184486]  r4:00000000

    [   69.187046] [<c05f33c4>] (dev_watchdog) from [<c0089314>] (call_timer_fn+0x2c/0xa0)

    [   69.194742]  r10:ed930000 r9:c05f33c4 r8:00200200 r7:00000000 r6:c05f33c4 r5:00000101

    [   69.202680]  r4:ed930264

    [   69.205242] [<c00892e8>] (call_timer_fn) from [<c0089930>] (run_timer_softirq+0x1d4/0x250)

    [   69.213558]  r6:c0991e00 r5:c09f3540 r4:ed930264

    [   69.218237] [<c008975c>] (run_timer_softirq) from [<c003d038>] (__do_softirq+0x140/0x264)

    [   69.226465]  r10:c0992080 r9:40000001 r8:00000001 r7:00000101 r6:c0990000 r5:c0992084

    [   69.234397]  r4:000000a0

    [   69.236958] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

    [   69.244309]  r10:c09dcfb3 r9:c06c8168 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

    [   69.252245]  r4:c098cd38

    [   69.254810] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

    [   69.262683]  r5:00000013 r4:c098cd38

    [   69.266313] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

    [   69.274714]  r9:c06c8168 r8:00000000 r7:fa212000 r6:c0991ef8 r5:c099294c r4:fa21200c

    [   69.282558] [<c0009480>] (gic_handle_irq) from [<c06c1880>] (__irq_svc+0x40/0x74)

    [   69.290086] Exception stack(0xc0991ef8 to 0xc0991f40)

    [   69.295161] 1ee0:                                                       00000001 00000000

    [   69.303401] 1f00: c09e06b0 00000000 c0990000 c09925b4 c0992568 00000000 00000000 c06c8168

    [   69.311625] 1f20: c09dcfb3 c0991f4c c0991f2c c0991f40 c002a838 c00104c0 60030013 ffffffff

    [   69.319856]  r7:c0991f2c r6:ffffffff r5:60030013 r4:c00104c0

    [   69.325594] [<c0010498>] (arch_cpu_idle) from [<c0070618>] (cpu_startup_entry+0x2a0/0x31c)

    [   69.333917] [<c0070378>] (cpu_startup_entry) from [<c06b7fd8>] (rest_init+0x90/0x94)

    [   69.341702]  r7:00000000

    [   69.344288] [<c06b7f48>] (rest_init) from [<c093ed4c>] (start_kernel+0x404/0x410)

    [   69.351843]  r5:00000000 r4:c09e0050

    [   69.355510] [<c093e948>] (start_kernel) from [<80008090>] (0x80008090)

    [   69.362082] ---[ end trace b4cdc5c99a7b5964 ]---

    ******* 

    subsequent to that, we get CPDMA channel stop errors:

    [   80.160529] ------------[ cut here ]------------

    [   80.165175] WARNING: CPU: 0 PID: 0 at drivers/net/ethernet/ti/davinci_cpdma.c:896 cpdma_chan_stop+0x16c/0x184()

    [   80.175302] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore rpmsg_rpc dwc3 rtc_palmas virtio_rpmsg_bus udc_core extcon_palmas rtc_ds1307 phy_omap_usb2 snd_soc_tlv320aic3x omap_wdt rtc_omap dwc3_omap snd_soc_pcm5102a omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon

    [   80.205529] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.1.13-aja-helo-g72e25e4-dirty #3

    [   80.214786] Hardware name: Generic DRA74X (Flattened Device Tree)

    [   80.220901] Backtrace:

    [   80.223374] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

    [   80.230972]  r7:c04d2fbc r6:00000380 r5:c09b1024 r4:00000000

    [   80.236689] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)

    [   80.243946] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

    [   80.252067]  r5:00000009 r4:00000000

    [   80.255671] [<c00399b0>] (warn_slowpath_common) from [<c0039b0c>] (warn_slowpath_null+0x24/0x2c)

    [   80.264490]  r8:00000000 r7:ed8e9050 r6:ee3864b0 r5:200d0113 r4:ee386490

    [   80.271255] [<c0039ae8>] (warn_slowpath_null) from [<c04d2fbc>] (cpdma_chan_stop+0x16c/0x184)

    [   80.279818] [<c04d2e50>] (cpdma_chan_stop) from [<c04d58dc>] (cpsw_ndo_tx_timeout+0x5c/0xb4)

    [   80.288287]  r9:00000140 r8:c09dd167 r7:c0992100 r6:ed8f5840 r5:00000000 r4:ed930000

    [   80.296100] [<c04d5880>] (cpsw_ndo_tx_timeout) from [<c05f35f4>] (dev_watchdog+0x230/0x270)

    [   80.304481]  r5:ed930000 r4:00000000

    [   80.308085] [<c05f33c4>] (dev_watchdog) from [<c0089314>] (call_timer_fn+0x2c/0xa0)

    [   80.315770]  r10:ed930000 r9:c05f33c4 r8:00200200 r7:00000000 r6:c05f33c4 r5:00000101

    [   80.323664]  r4:ed930264

    [   80.326214] [<c00892e8>] (call_timer_fn) from [<c0089930>] (run_timer_softirq+0x1d4/0x250)

    [   80.334510]  r6:c0991e00 r5:c09f3540 r4:ed930264

    [   80.339168] [<c008975c>] (run_timer_softirq) from [<c003d038>] (__do_softirq+0x140/0x264)

    [   80.347376]  r10:c0992080 r9:40000001 r8:00000001 r7:00000101 r6:c0990000 r5:c0992084

    [   80.355271]  r4:000000a0

    [   80.357821] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

    [   80.365156]  r10:c09dcfb3 r9:c06c8168 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

    [   80.373051]  r4:c098cd38

    [   80.375604] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

    [   80.383463]  r5:00000013 r4:c098cd38

    [   80.387067] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

    [   80.395449]  r9:c06c8168 r8:00000000 r7:fa212000 r6:c0991ef8 r5:c099294c r4:fa21200c

    [   80.403261] [<c0009480>] (gic_handle_irq) from [<c06c1880>] (__irq_svc+0x40/0x74)

    [   80.410772] Exception stack(0xc0991ef8 to 0xc0991f40)

    [   80.415842] 1ee0:                                                       00000001 00000000

    [   80.424054] 1f00: c09e06b0 00000000 c0990000 c09925b4 c0992568 00000000 00000000 c06c8168

    [   80.432264] 1f20: c09dcfb3 c0991f4c c0991f2c c0991f40 c002a838 c00104c0 600d0013 ffffffff

    [   80.440471]  r7:c0991f2c r6:ffffffff r5:600d0013 r4:c00104c0

    [   80.446184] [<c0010498>] (arch_cpu_idle) from [<c0070618>] (cpu_startup_entry+0x2a0/0x31c)

    [   80.454483] [<c0070378>] (cpu_startup_entry) from [<c06b7fd8>] (rest_init+0x90/0x94)

    [   80.462254]  r7:00000000

    [   80.464806] [<c06b7f48>] (rest_init) from [<c093ed4c>] (start_kernel+0x404/0x410)

    [   80.472316]  r5:00000000 r4:c09e0050

    [   80.475921] [<c093e948>] (start_kernel) from [<80008090>] (0x80008090)

    [   80.482474] ---[ end trace b4cdc5c99a7b5966 ]---

    1483985612.041512 enetd W 1103 ../../enetd/src/enetd.cpp:175: RX 0 (LEN 0, OVR 0, CRC 0, FRM 0, FFO 0, MIS 0)

    1483985612.041532 enetd W 1103 ../../enetd/src/enetd.cpp:182: TX 2 (ABT 0, CAR 0, FFO 0, HB 0, WIN 0)

    1483985612.041549 enetd I 1103 ../../enetd/src/enetd.cpp:193: Link Errors: 2

    [   91.098073] omapdrm omapdrm.0: atomic complete timeout (pipe 0)!

    [   91.160546] ------------[ cut here ]------------

  • The icmp counts are shown below, captured during pinging. The ping eventually fails because we get 'destination host unreachable' errors.

    IcmpMsg: InType3 InType8 OutType0 OutType3
    IcmpMsg: 4971 173 173 4969

    IcmpMsg: InType3 InType8 OutType0 OutType3
    IcmpMsg: 4975 176 176 4973

    IcmpMsg: InType3 InType8 OutType0 OutType3
    IcmpMsg: 4978 179 179 4976

    IcmpMsg: InType3 InType8 OutType0 OutType3
    IcmpMsg: 4982 181 181 4980

    The ICMP counts are not symmetric, as you can see.

    This latest test is on our own board; all of the tests I've tried fail indentically on the TI EVM board, but if you need me to try something specific there, I can do so.
  • The watchdog timeout might be a clue. Could you please attach the full results of ethtool -S eth0 for the TI EVM?

    Also is the link partner and cable the same when testing with both your board and the TI EVM?

    I am still unable to reproduce the network down problem, could you post the exact command you are using to run the video encoder on the TI EVM?
  • Could you also please dump the following address using devmem2  0x4a100D84 after the network down problem has occured?

    This is the MAC SL1 mac control address, I am looking to see if the interface is in half duplex mode.

    Could you also attach the full ethtool eth0 output before causing the network down condition and after it occurs?

    Is the link partner a switch or a HUB or direct connect to test equipment of some kind?

  • Output of ethtool eth0 prior to net failure:

    Settings for eth0:

    Supported ports: [ TP MII ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Half 1000baseT/Full
    Supported pause frame use: Symmetric
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Half 1000baseT/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Link partner advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Link partner advertised pause frame use: No
    Link partner advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: MII
    PHYAD: 1
    Transceiver: external
    Auto-negotiation: on
    Supports Wake-on: d
    Wake-on: d
    Current message level: 0x00000000 (0)

    Link detected: yes

    Output after failure:

    Settings for eth0:
    Supported ports: [ TP MII ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Half 1000baseT/Full
    Supported pause frame use: Symmetric
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Half 1000baseT/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Link partner advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Link partner advertised pause frame use: Symmetric Receive-only
    Link partner advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: MII
    PHYAD: 1
    Transceiver: external
    Auto-negotiation: on
    Supports Wake-on: d
    Wake-on: d
    Current message level: 0x00000000 (0)

    Link detected: yes

    ethtool -S eth0 after failure:

    NIC statistics:
    Good Rx Frames: 2022
    Broadcast Rx Frames: 220
    Multicast Rx Frames: 0
    Pause Rx Frames: 0
    Rx CRC Errors: 0
    Rx Align/Code Errors: 0
    Oversize Rx Frames: 0
    Rx Jabbers: 0
    Undersize (Short) Rx Frames: 0
    Rx Fragments: 0
    Rx Octets: 236946
    Good Tx Frames: 41428
    Broadcast Tx Frames: 16
    Multicast Tx Frames: 76
    Pause Tx Frames: 0
    Deferred Tx Frames: 0
    Collisions: 0
    Single Collision Tx Frames: 0
    Multiple Collision Tx Frames: 0
    Excessive Collisions: 0
    Late Collisions: 0
    Tx Underrun: 0
    Carrier Sense Errors: 0
    Tx Octets: 54751650
    Rx + Tx 64 Octet Frames: 252
    Rx + Tx 65-127 Octet Frames: 2044
    Rx + Tx 128-255 Octet Frames: 215
    Rx + Tx 256-511 Octet Frames: 496
    Rx + Tx 512-1023 Octet Frames: 648
    Rx + Tx 1024-Up Octet Frames: 39795
    Net Octets: 54988596
    Rx Start of Frame Overruns: 0
    Rx Middle of Frame Overruns: 0
    Rx DMA Overruns: 0
    Rx DMA chan: head_enqueue: 1
    Rx DMA chan: tail_enqueue: 4069
    Rx DMA chan: pad_enqueue: 0
    Rx DMA chan: misqueued: 0
    Rx DMA chan: desc_alloc_fail: 0
    Rx DMA chan: pad_alloc_fail: 0
    Rx DMA chan: runt_receive_buf: 0
    Rx DMA chan: runt_transmit_buf: 0
    Rx DMA chan: empty_dequeue: 0
    Rx DMA chan: busy_dequeue: 1563
    Rx DMA chan: good_dequeue: 2022
    Rx DMA chan: requeue: 0
    Rx DMA chan: teardown_dequeue: 0
    Tx DMA chan: head_enqueue: 37182
    Tx DMA chan: tail_enqueue: 30040
    Tx DMA chan: pad_enqueue: 0
    Tx DMA chan: misqueued: 1750
    Tx DMA chan: desc_alloc_fail: 19
    Tx DMA chan: pad_alloc_fail: 0
    Tx DMA chan: runt_receive_buf: 0
    Tx DMA chan: runt_transmit_buf: 232
    Tx DMA chan: empty_dequeue: 25185
    Tx DMA chan: busy_dequeue: 15579
    Tx DMA chan: good_dequeue: 41428
    Tx DMA chan: requeue: 37559
    Tx DMA chan: teardown_dequeue: 24576

    devmem2 0x4a100d84 after failure produces the following output:

    # devmem2 0x4a100d84
    /dev/mem opened.[ 314.900296] ------------[ cut here ]------------
    [ 314.906261] WARNING: CPU: 0 PID: 1317 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()
    [ 314.915869] 44000000.ocp:L3 Custom Error: MASTER MPU TARGET L4_CFG (Read): Data Access in User mode during Functional access
    [ 314.927130] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc rtc_palmas virtio_rpmsg_bus extcon_palmas phy_omap_usb2 udc_core snd_soc_pcm5102a snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon
    [ 314.957350] CPU: 0 PID: 1317 Comm: devmem2 Tainted: G W 4.1.13-aja-helo-g72e25e4-dirty #3
    [ 314.966693] Hardware name: Generic DRA74X (Flattened Device Tree)
    [ 314.972808] Backtrace:
    [ 314.975279] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)
    [ 314.982878] r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000
    [ 314.988595] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)
    [ 314.995851] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)
    [ 315.003971] r5:00000009 r4:ecce1e00
    [ 315.007577] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)
    [ 315.016309] r8:c08af41c r7:00000002 r6:ee1af190 r5:c08af4dc r4:c08af580
    [ 315.023077] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)
    [ 315.031982] r3:ee1af000 r2:c08af580
    [ 315.035580] r4:80080003
    [ 315.038135] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)
    [ 315.047476] r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360
    [ 315.055370] r4:ee1af500
    [ 315.057923] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)
    [ 315.066828] r10:beed9a74 r9:00000001 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360
    [ 315.074723] r4:ee1a9300
    [ 315.077272] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)
    [ 315.085829] r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300
    [ 315.091542] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)
    [ 315.100186] r7:00000000 r6:00000000 r5:00000017 r4:00000017
    [ 315.105897] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)
    [ 315.114629] r5:00000017 r4:c098cd38
    [ 315.118232] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)
    [ 315.126614] r9:00000001 r8:30c5387d r7:fa212000 r6:ecce1fb0 r5:c099294c r4:fa21200c
    [ 315.134426] [<c0009480>] (gic_handle_irq) from [<c06c1b68>] (__irq_usr+0x48/0x60)
    [ 315.141938] Exception stack(0xecce1fb0 to 0xecce1ff8)
    [ 315.147009] 1fa0: 00010417 b6ff8960 b6ff66e8 b6ff8b18
    [ 315.155220] 1fc0: 00000000 00020f5c b6fea000 b6fea4c0 00000000 00000001 beed9a74 b6ff4d84
    [ 315.163430] 1fe0: 00000017 beed9a08 00010400 b6fdaaf4 200b0030 ffffffff
    [ 315.170067] r7:30c5387d r6:ffffffff r5:200b0030 r4:b6fdaaf4
    [ 315.175775] ---[ end trace c41436b29ff88a5f ]---

    Memory mapped at address 0xb6ff4000.
    [ 315.180627] Unhandled fault: asynchronous external abort (0x1211) at 0x00000000
    [ 315.191288] pgd = d3fd8ec0
    [ 315.194004] [00000000] *pgd=9316e003, *pmd=92f5f003, *pte=00000000
    Read at address 0x4A100D84 (0xb6ff4d84): 0x00000000

    The link partner, cable, and machine are common across tests of our own board and the EVM. Link partner is an Intel gig-E ethernet card connected directly and installed in a linux PC - not a hub.

    The observed issue when the network is down is that transmit from the EVM no longer occurs and sending packets time out.

    The failure is produced by running the IVAHD (videnc2test) continuously while also running 'iperf3 -s' on the EVM the iperf client elsewhere doing a UDP transfer test usually at 100Mb/s, however, the failure occurs with most any kind of network traffic while also using the codec. Typically we are encoding 1080i30 material in our application.

    We will get details of the videnc2 command/setup/input to you tomorrow.

  • To reproduce the issue on the EVM, you will need a local 1080p NV12 YUV-formatted file for codec input. This must be <2GB in size due to videnc2test limitations. It can be prepared with ffmpeg from suitable source material e.g. as follows:

    ffmpeg -ss 00:00:30.000 -t 00:00:15 -i big_buck_bunny_1080p_h264.mov -c:v rawvideo -pix_fmt nv12 nv12.yuv

    Then, on the EVM, run 'iperf3 -s' to start the server.

    Run the IVAHD continuously with a command of the form:

    watch -n 1 videnc2test 1920 1080 108000 nv12.yuv out.h264 30 17000 h264 high 42 OMAPDRM

    Start the iperf3 client on a connected system with a command like:

    iperf3 -c 192.168.0.2 -u -b 500M -t 80000

    Network failure may occur within minutes or may take hours. Rebooting and restarting the test seems to hasten the failure if it does not occur within an hour or two.

    Having run this test many times now, we have observed total transmit failure, a 'network socket closed unexpectedly' error from iperf3, and also ipu2 crashes as follows:

    76087.328992] remoteproc1: crash detected in 55020000.ipu: type watchdog
    [76087.335669] remoteproc1: handling crash #43 in 55020000.ipu
    [76087.341818] remoteproc1: recovering 55020000.ipu
    [76087.376487] omap_hwmod: mmu_ipu2: _wait_target_disable failed
    [76087.384217] remoteproc1: stopped remote processor 55020000.ipu
    [76087.406380] remoteproc1: powering up 55020000.ipu
    [76087.414336] remoteproc1: Booting fw image dra7-ipu2-fw.xem4, size 3485072
    [76087.422570] omap-iommu 55082000.mmu: 55082000.mmu: version 2.1
    [76087.502657] remoteproc1: remote processor 55020000.ipu is now up
    [76087.510014] virtio_rpmsg_bus virtio1: rpmsg host is online
    [76087.515561] remoteproc1: registered virtio1 (type 7)
    [76097.507475] remoteproc1: crash detected in 55020000.ipu: type watchdog
  • Thanks for the console outputs and steps for the EVM, I will try them out. 

    I need to apologize on the address I gave you, it was another processor, that is the reason for the kernel warn. The one for the AM572x is 0x48484D84.

  • Has there been any further progress towards a resolution on this? The issue is critical for us and it is blocking product release.
  • We tried several different methods to try to reproduce the problem. We have replicated the condition that you are seeing though not with the same steps. The most consistent way we found to recreate the problem was to use iperf3 in client mode. The Video encode part of the test doesn't need to run to re-create the condition. Only iperf3 seems to cause the issue. We are continuing to look at the cuase of the network down condition.

  • Chris,
    Last night I sent an email to your team, and since I was using my phone, I may have left you out...sorry...anyway, I was informing that our apps would be posting comments today, as it was done...and we are continuing to investigate the issue.