AM572x Ethernet receive issues

Chris Anderson20

Other Parts Discussed in Thread: AM5728, SHA-256, PCM5102A

We are seeing an unusually high rate of TCP retries and UDP packet loss when receiving over the AM5728 CPSW interface with Linux and the processor SDK on both the EVM and our own board.

Using iperf3, we can transmit over a dedicated link at well over 800Mb/s without issues (no UDP packet loss or TCP retries), but when receiving we get unexpectedly large UDP losses (>50% sometimes) and hundreds of TCP retries over the same link. Furthermore, network problems appear to get worse when the HDMI output is active; sometimes, large file transfers to the AM5728 can fail an SHA-256 check, suggesting a possible memory corruption issue.

Below is a sample of UDP receive at 100Mb/s (10% link bandwidth) showing 36% packet loss; increasing the buffer size to 1MB (-w 1M) reduces the loss to about 15% but it really should be 0 for this case.

On the TCP side we see hundreds of retries/sec using similar parameters. The TCP retries occur regardless of how much we increase the window size.

Adding a receive coalesce interval of 20uS does not appear to have any effect.

The above tests are with kernel 4.1.13 (SDK 2.0.1.7). Older SDK with kernel 3.14.43 does not exhibit this issue. (3.x SDK does not boot on our EVM.)

Ethtool does not show errors for either scenario, but clearly the packets are dropped.

Is there a fix for this?

----

[ 4] local 172.30.0.30 port 55300 connected to 172.30.0.2 port 5201

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 4] 0.00-1.00 sec 8.49 MBytes 71.2 Mbits/sec 0.406 ms 465/1552 (30%)

[ 4] 1.00-2.00 sec 7.82 MBytes 65.6 Mbits/sec 0.120 ms 523/1524 (34%)

[ 4] 2.00-3.00 sec 7.54 MBytes 63.2 Mbits/sec 0.115 ms 560/1525 (37%)

[ 4] 3.00-4.00 sec 7.59 MBytes 63.6 Mbits/sec 0.419 ms 554/1525 (36%)

[ 4] 4.00-5.00 sec 8.26 MBytes 69.3 Mbits/sec 0.150 ms 476/1533 (31%)

[ 4] 5.00-6.00 sec 7.44 MBytes 62.4 Mbits/sec 0.159 ms 575/1527 (38%)

[ 4] 6.00-7.00 sec 7.20 MBytes 60.4 Mbits/sec 0.424 ms 596/1517 (39%)

[ 4] 7.00-8.00 sec 7.88 MBytes 66.1 Mbits/sec 0.187 ms 526/1535 (34%)

[ 4] 8.00-9.00 sec 7.55 MBytes 63.4 Mbits/sec 0.151 ms 558/1525 (37%)

[ 4] 9.00-10.00 sec 7.78 MBytes 65.3 Mbits/sec 0.401 ms 520/1516 (34%)

[ 4] 10.00-11.00 sec 8.02 MBytes 67.3 Mbits/sec 0.128 ms 509/1536 (33%)

[ 4] 11.00-12.00 sec 7.48 MBytes 62.7 Mbits/sec 0.147 ms 570/1527 (37%)

[ 4] 12.00-13.00 sec 6.94 MBytes 58.2 Mbits/sec 0.520 ms 637/1525 (42%)

[ 4] 13.00-14.00 sec 7.74 MBytes 64.9 Mbits/sec 0.146 ms 536/1527 (35%)

[ 4] 14.00-15.00 sec 7.77 MBytes 65.2 Mbits/sec 0.179 ms 530/1525 (35%)

[ 4] 15.00-16.00 sec 7.55 MBytes 63.3 Mbits/sec 0.431 ms 547/1513 (36%)

^C[ 4] 16.00-16.93 sec 6.95 MBytes 62.4 Mbits/sec 0.120 ms 491/1380 (36%)

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 4] 0.00-16.93 sec 0.00 Bytes 0.00 bits/sec 0.120 ms 9173/25812 (36%)

over 9 years ago

0 Biser Gatchev-XID over 9 years ago

TI__Guru**** 393215 points

Hi,

The Ethernet experts have been notified. They will respond here. Please note that delays are possible due to upcoming holidays in the USA.

0 Rogerio Almeida over 9 years ago in reply to Biser Gatchev-XID

TI__Mastermind 26215 points

Apps will be posting later today their comments.

0 Schuyler Patton over 9 years ago in reply to Rogerio Almeida

TI__Mastermind 40570 points

This sounds like the RX packets are being dropped due to descriptor exhaustion. Internal RAM only holds a fixed amount, about 128 total for both RX and TX. When doing an ethtool -S eth0 command, is RXDMA Overruns count non 0?

If so the RX descriptors are being exhausted, the bindings document located here:

https://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/master/Documentation/devicetree/bindings/net/cpsw.txt

There is field called in the bd_ram_size, increase this size from 0x2000 to 0x20000, this will move the descriptors from internal RAM to external DDR and reduce the possibility of overruns.

The bd_ram_size can be modfied in the processor DTSI or the your board DTS file. I recommend the later, example here:

&mac {
       bd_ram_size = <0x2000>;
};

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Unfortunately, changing the bd_ram_size to 0x20000 causes a kernel crash when the network is first used (see attached traceback.) Is there perhaps some other setting or fix required for this to work?

[ 21.118124] ------------[ cut here ]------------

[ 21.122781] WARNING: CPU: 0 PID: 1048 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()

[ 21.132388] 44000000.ocp:L3 Standard Error: MASTER GMAC_SW TARGET GPMC (Read Link): At Address: 0x48496000 : Data Access in User mode during Functional access

[ 21.146616] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

[ 21.176847] CPU: 0 PID: 1048 Comm: rep2 Not tainted 4.1.13-test-gb5be33b #8

[ 21.184184] Hardware name: Generic DRA74X (Flattened Device Tree)

[ 21.190300] Backtrace:

[ 21.192772] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

[ 21.200371] r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000

[ 21.206088] [<c0013184>] (show_stack) from [<c06bbaf8>] (dump_stack+0x9c/0xdc)

[ 21.213343] [<c06bba5c>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

[ 21.221465] r5:00000009 r4:eca05e00

[ 21.225072] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

[ 21.233804] r8:c08af9c0 r7:00000004 r6:ee1af190 r5:c08af4cc r4:c08af57c

[ 21.240567] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)

[ 21.249473] r3:ee1af000 r2:c08af57c

[ 21.253071] r4:80080001

[ 21.255624] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)

[ 21.264966] r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360

[ 21.272860] r4:ee1af500

[ 21.275413] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)

[ 21.284318] r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360

[ 21.292212] r4:ee1a9300

[ 21.294763] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)

[ 21.303319] r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300

[ 21.309030] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)

[ 21.317674] r7:00000000 r6:00000000 r5:00000017 r4:00000017

[ 21.323386] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)

[ 21.332117] r5:00000017 r4:c098cd38

[ 21.335721] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

[ 21.344103] r9:9e4fd340 r8:30c5387d r7:fa212000 r6:eca05fb0 r5:c099294c r4:fa21200c

[ 21.351915] [<c0009480>] (gic_handle_irq) from [<c06c1aa8>] (__irq_usr+0x48/0x60)

[ 21.359426] Exception stack(0xeca05fb0 to 0xeca05ff8)

[ 21.364495] 5fa0: 9e4fd26c 9eea8bc8 9fbcc93c 00000000

[ 21.372706] 5fc0: 9e4fd26c 9eea8bc8 9fbcc93c 9fa941bc 00000000 9e4fd340 9eea8bc8 0000003e

[ 21.380916] 5fe0: 0009dee0 9e4fd210 b2c1aa39 b2c17ea0 60000030 ffffffff

[ 21.387553] r7:30c5387d r6:ffffffff r5:60000030 r4:b2c17ea0

[ 21.393261] ---[ end trace 975205f49dcfbb88 ]---

[ 21.397952] ------------[ cut here ]------------

[ 21.398138] Unhandled fault: asynchronous external abort (0x1211) at 0x00000000

[ 21.398140] pgd = ecca6340

[ 21.398148] [00000000] *pgd=ac930003, *pmd=94c22003, *pte=00000000

[ 21.398154] Internal error: : 1211 [#1] PREEMPT SMP ARM

[ 21.398202] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

[ 21.398207] CPU: 1 PID: 982 Comm: klogd Tainted: G W 4.1.13-test-gb5be33b #8

[ 21.398210] Hardware name: Generic DRA74X (Flattened Device Tree)

[ 21.398213] task: d4dbde00 ti: dfc1a000 task.ti: dfc1a000

[ 21.398220] PC is at vfp_reload_hw+0x1c/0x44

[ 21.398224] LR is at __und_usr_fault_32+0x0/0x8

[ 21.398228] pc : [<c000ad80>] lr : [<c06c1c40>] psr: 600f0013

[ 21.398228] sp : dfc1bfb0 ip : 00000000 fp : 00000001

[ 21.398232] r10: dfc1a178 r9 : c06c1ca0 r8 : 00000b00

[ 21.398235] r7 : 00000001 r6 : dfc1a04c r5 : 00000002 r4 : ecad80f8

[ 21.398238] r3 : c09e0068 r2 : b6efb436 r1 : 40000000 r0 : ed2d8b02

[ 21.398242] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user

[ 21.398245] Control: 30c5387d Table: acca6340 DAC: 55555555

[ 21.398248] Process klogd (pid: 982, stack limit = 0xdfc1a218)

[ 21.398251] Stack: (0xdfc1bfb0 to 0xdfc1c000)

[ 21.398255] bfa0: 000bf008 b6f9dde4 be9ce9c8 000bbc82

[ 21.398261] bfc0: 000bf008 be9ce9e4 b6fd54c0 b6fb7000 ffffffff 00000000 000a8e88 b6fd54c0

[ 21.398265] bfe0: 00000000 be9ce994 b6f021d9 b6efb436 200f0030 ffffffff 00000000 00000000

[ 21.398270] Backtrace: invalid frame pointer 0x00000001

[ 21.398275] Code: ecba0b20 eef75a10 e205500f e3550002 (0cfa0b20)

[ 21.398280] ---[ end trace 975205f49dcfbb89 ]---

[ 21.398286] note: klogd[982] exited with preempt_count 1

[ 21.581885] WARNING: CPU: 0 PID: 1048 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()

[ 21.591491] 44000000.ocp:L3 Custom Error: MASTER MPU TARGET L4_PER2_P3 (Idle): Data Access in Supervisor mode during Functional access

[ 21.603622] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc udc_core virtio_rpmsg_bus snd_soc_pcm5102a rtc_palmas phy_omap_usb2 extcon_palmas snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio extcon_usb_gpio virtio_ring extcon

[ 21.633836] CPU: 0 PID: 1048 Comm: rep2 Tainted: G D W 4.1.13-test-gb5be33b #8

[ 21.642393] Hardware name: Generic DRA74X (Flattened Device Tree)

[ 21.648506] Backtrace:

[ 21.650972] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

[ 21.658571] r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000

[ 21.664284] [<c0013184>] (show_stack) from [<c06bbaf8>] (dump_stack+0x9c/0xdc)

[ 21.671538] [<c06bba5c>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

[ 21.679659] r5:00000009 r4:eca05cf8

[ 21.683263] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

[ 21.691994] r8:c08af418 r7:00000000 r6:ee1af190 r5:c08af4d8 r4:c08af57c

[ 21.698758] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)

[ 21.707664] r3:ee1af000 r2:c08af57c

[ 21.711262] r4:80080003

[ 21.713815] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)

[ 21.723157] r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360

[ 21.731052] r4:ee1af500

[ 21.733604] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)

[ 21.742511] r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360

[ 21.750405] r4:ee1a9300

[ 21.752955] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)

[ 21.761511] r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300

[ 21.767223] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)

[ 21.775866] r7:00000000 r6:eca05fb0 r5:00000017 r4:00000017

[ 21.781581] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)

[ 21.790312] r5:00000017 r4:c098cd38

[ 21.793916] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

[ 21.802298] r9:9e4fd340 r8:ee008000 r7:fa212000 r6:eca05ea8 r5:c099294c r4:fa21200c

[ 21.810111] [<c0009480>] (gic_handle_irq) from [<c06c17c0>] (__irq_svc+0x40/0x74)

[ 21.817620] Exception stack(0xeca05ea8 to 0xeca05ef0)

[ 21.822691] 5ea0: c06c8154 00000000 c09e12c0 00000000 00000082 00000013

[ 21.830901] 5ec0: 00000000 00000000 ee008000 9e4fd340 9eea8bc8 eca05f4c c09e12c0 eca05ef0

[ 21.839110] 5ee0: c003cf18 c003cfb0 20000113 ffffffff

[ 21.844177] r7:eca05edc r6:ffffffff r5:20000113 r4:c003cfb0

[ 21.849890] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

[ 21.857225] r10:9eea8bc8 r9:9e4fd340 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

[ 21.865120] r4:c098cd38

[ 21.867669] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

[ 21.875528] r5:00000013 r4:c098cd38

[ 21.879131] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

[ 21.887513] r9:9e4fd340 r8:30c5387d r7:fa212000 r6:eca05fb0 r5:c099294c r4:fa21200c

[ 21.895324] [<c0009480>] (gic_handle_irq) from [<c06c1aa8>] (__irq_usr+0x48/0x60)

[ 21.902834] Exception stack(0xeca05fb0 to 0xeca05ff8)

[ 21.907904] 5fa0: 9e4fd26c 9eea8bc8 9fbcc93c 00000000

[ 21.916114] 5fc0: 9e4fd26c 9eea8bc8 9fbcc93c 9fa941bc 00000000 9e4fd340 9eea8bc8 0000003e

[ 21.924324] 5fe0: 0009dee0 9e4fd210 b2c1aa39 b2c17ea0 60000030 ffffffff

[ 21.930961] r7:30c5387d r6:ffffffff r5:60000030 r4:b2c17ea0

[ 21.936666] ---[ end trace 975205f49dcfbb8a ]---

0 RonB over 8 years ago in reply to Chris Anderson20

TI__Mastermind 30706 points

Hey Chris,

Sorry you are having trouble. I'm trying to jump in and provide some help.

I wanted to let you know that I have been unable to replicate your results. I've got an HDMI monitor plugged into an AM572x EVM running PLSDK 2.1.0.7 (kernel version 4.1.13). I ran iperf server on the EVM and client on my Linux box and here are the results:

sitara@sitara67-OptiPlex-745:~/ti-processor-sdk-linux-am335x-evm-02.00.01.07/board-support/u-boo
015.07+gitAUTOINC+5922e09363$ iperf -c 192.168.2.119 -u -b 100M -i 1
------------------------------------------------------------
Client connecting to 192.168.2.119, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 224 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 44009 connected with 192.168.2.119 port 5001
[ ID] Interval       Transfer     Bandwidth
[ 3] 0.0- 1.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 1.0- 2.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 2.0- 3.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 3.0- 4.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 4.0- 5.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 5.0- 6.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 6.0- 7.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 7.0- 8.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 8.0- 9.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 9.0-10.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 0.0-10.0 sec   120 MBytes   101 Mbits/sec
[ 3] Sent 85471 datagrams
[ 3] Server Report:
[ 3] 0.0-10.0 sec   120 MBytes   101 Mbits/sec   0.004 ms    0/85470 (0%)
[ 3] 0.0-10.0 sec 1 datagrams received out-of-order
sitara@sitara67-OptiPlex-745:~/ti-processor-sdk-linux-am335x-evm-02.00.01.07/board-support/u-boo
015.07+gitAUTOINC+5922e09363$ iperf -c 192.168.2.119 -u -b 100M -i 1
------------------------------------------------------------
Client connecting to 192.168.2.119, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 224 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 59366 connected with 192.168.2.119 port 5001
[ ID] Interval       Transfer     Bandwidth
[ 3] 0.0- 1.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 1.0- 2.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 2.0- 3.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 3.0- 4.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 4.0- 5.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 5.0- 6.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 6.0- 7.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 7.0- 8.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 8.0- 9.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 9.0-10.0 sec 12.0 MBytes   101 Mbits/sec
[ 3] 0.0-10.0 sec   120 MBytes   101 Mbits/sec
[ 3] Sent 85471 datagrams
[ 3] Server Report:
[ 3] 0.0-10.0 sec   120 MBytes   101 Mbits/sec   0.004 ms    0/85470 (0%)

As you can see, I'm achieving higher bandwidth and no packet loss at 100M. If I move to 200M, I do see a little packet loss. But, my results are not near as bad as yours.

You can see the client commands I've used above. If there is something else you would like me to try, I will be happy to.

0 Chris Anderson20 over 8 years ago in reply to RonB

Intellectual 690 points

That's odd; I don't understand what the variable here would be other than perhaps iperf2 vs. iperf3, but we are seeing receive-side network problems on all of our AM5728-based systems running the 4.1.13 kernel - iperf3 just gives us a useful metric that seems to showcase the problem. Reviewing test conditions:

EVM running from clean SD card burned from default SDK 2.0.1.7 pre-built demo distribution filesystem using supplied create-sdcard.sh script.
Dedicated ethernet connection to PC client (running Centos 7) through upper (away from board) network port on EVM.
EVM has camera module and LCD attached.
Test using iperf3 (3.1.3, specifically), built from source on Centos 7 and with TI SDK toolkit (iperf3 has better reporting than the iperf 2 provided in the SDK; also, Centos client-side iperf 2 usually fails when exiting with "did not receive ack of last datagram after n tries"/"connection refused" and won't give packet loss info on my setup; this appears to be due to a long-standing iperf 2 bug.)
Since we can run the test error-free with the same EVM hardware using an earlier SDK, the issue appears to be with SDK 2.0.1.7 specifically.

Perhaps you could try testing with iperf3? Or is there some other element of the test that is different on your side, or could have an influence here which I haven't mentioned above?

Thanks, I appreciate whatever help you can provide.

0 Chris Anderson20 over 8 years ago in reply to Chris Anderson20

Intellectual 690 points

Is there a fix for the omap_l3_noc crash that results from increasing bd_ram_size? We are certainly seeing receive dma overruns reported by ethtool, so if the DMA descriptor RAM size could be increased it may fix our primary issue.

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Could you please post the ethtool -S output? I am looking for RX CRC errors too.

I am not sure this is a descriptor exhaustion issue. What is the version of the TI EVM that you are using?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Attached below is more test output, including ethtool -S . I don't see CRC errors, although there are lots of RX DMA overruns.

Regarding the EVM version, I don't know where to find that info. I do know that we had to do some cuts/jumpers for the serial connection because of an SDK change.

Accepted connection from 192.168.254.200, port 45938

[ 5] local 192.168.254.4 port 5201 connected to 192.168.254.200 port 38431

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 5] 0.00-1.00 sec 1.14 MBytes 9.57 Mbits/sec 151734.699 ms 1220/1366 (89%)

[ 5] 1.00-2.00 sec 1.47 MBytes 12.3 Mbits/sec 0.947 ms 1222/1410 (87%)

[ 5] 2.00-3.00 sec 1.23 MBytes 10.3 Mbits/sec 0.632 ms 1502/1659 (91%)

[ 5] 3.00-4.00 sec 1.65 MBytes 13.8 Mbits/sec 0.124 ms 1182/1393 (85%)

[ 5] 4.00-5.00 sec 1.38 MBytes 11.5 Mbits/sec 0.625 ms 1352/1528 (88%)

[ 5] 5.00-6.00 sec 1.29 MBytes 10.8 Mbits/sec 0.067 ms 1359/1524 (89%)

[ 5] 6.00-7.00 sec 1.22 MBytes 10.2 Mbits/sec 0.111 ms 1371/1527 (90%)

[ 5] 7.00-8.00 sec 1.29 MBytes 10.8 Mbits/sec 0.181 ms 1360/1525 (89%)

[ 5] 8.00-9.00 sec 1.23 MBytes 10.4 Mbits/sec 0.067 ms 1368/1526 (90%)

[ 5] 9.00-10.00 sec 1.45 MBytes 12.2 Mbits/sec 0.136 ms 1339/1525 (88%)

[ 5] 10.00-10.04 sec 0.00 Bytes 0.00 bits/sec 0.136 ms 0/0 (0%)

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 5] 0.00-10.04 sec 0.00 Bytes 0.00 bits/sec 0.136 ms 13275/14983 (89%)

-----------------------------------------------------------

Server listening on 5201

-----------------------------------------------------------

iperf3: interrupt - the server has terminated

root@am57xx-evm:~# ethtool -S eth0

NIC statistics:

Good Rx Frames: 822802

Broadcast Rx Frames: 2

Multicast Rx Frames: 0

Pause Rx Frames: 0

Rx CRC Errors: 0

Rx Align/Code Errors: 0

Oversize Rx Frames: 0

Rx Jabbers: 0

Undersize (Short) Rx Frames: 0

Rx Fragments: 0

Rx Octets: 1238415402

Good Tx Frames: 14599

Broadcast Tx Frames: 51

Multicast Tx Frames: 20

Pause Tx Frames: 0

Deferred Tx Frames: 0

Collisions: 0

Single Collision Tx Frames: 0

Multiple Collision Tx Frames: 0

Excessive Collisions: 0

Late Collisions: 0

Tx Underrun: 0

Carrier Sense Errors: 0

Tx Octets: 1071655

Rx + Tx 64 Octet Frames: 31

Rx + Tx 65-127 Octet Frames: 14562

Rx + Tx 128-255 Octet Frames: 9

Rx + Tx 256-511 Octet Frames: 120

Rx + Tx 512-1023 Octet Frames: 15299

Rx + Tx 1024-Up Octet Frames: 807380

Net Octets: 1239487057

Rx Start of Frame Overruns: 15298

Rx Middle of Frame Overruns: 0

Rx DMA Overruns: 15298

Rx DMA chan: head_enqueue: 1

Rx DMA chan: tail_enqueue: 807567

Rx DMA chan: pad_enqueue: 0

Rx DMA chan: misqueued: 24

Rx DMA chan: desc_alloc_fail: 0

Rx DMA chan: pad_alloc_fail: 0

Rx DMA chan: runt_receive_buf: 0

Rx DMA chan: runt_transmit_buf: 0

Rx DMA chan: empty_dequeue: 0

Rx DMA chan: busy_dequeue: 381499

Rx DMA chan: good_dequeue: 807504

Rx DMA chan: requeue: 204

Rx DMA chan: teardown_dequeue: 0

Tx DMA chan: head_enqueue: 7813

Tx DMA chan: tail_enqueue: 6786

Tx DMA chan: pad_enqueue: 0

Tx DMA chan: misqueued: 6786

Tx DMA chan: desc_alloc_fail: 0

Tx DMA chan: pad_alloc_fail: 0

Tx DMA chan: runt_receive_buf: 0

Tx DMA chan: runt_transmit_buf: 25

Tx DMA chan: empty_dequeue: 7751

Tx DMA chan: busy_dequeue: 29

Tx DMA chan: good_dequeue: 14599

Tx DMA chan: requeue: 7813

Tx DMA chan: teardown_dequeue: 0

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Could you post a picture of both sides of the board?
That data loss is really high for the bit rate being used.
Could you also post the command lines used for both the client and server sides of iperf?
Could you also post just ethtool eth0 results too please?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Board pictures are attached.

Server command (on board): iperf3 -s

Client command (on PC, attached via dedicated gig-E link): iperf3 -c 192.168.254.4 -u -b 100M

Ethtool output:

root@am57xx-evm:~# ethtool eth0

Settings for eth0:

Supported ports: [ TP MII ]

Supported link modes: 10baseT/Half 10baseT/Full

100baseT/Half 100baseT/Full

1000baseT/Half 1000baseT/Full

Supported pause frame use: Symmetric

Supports auto-negotiation: Yes

Advertised link modes: 10baseT/Half 10baseT/Full

100baseT/Half 100baseT/Full

1000baseT/Half 1000baseT/Full

Advertised pause frame use: Symmetric

Advertised auto-negotiation: Yes

Link partner advertised link modes: 10baseT/Half 10baseT/Full

100baseT/Half 100baseT/Full

1000baseT/Full

Link partner advertised pause frame use: Symmetric Receive-only

Link partner advertised auto-negotiation: Yes

Speed: 1000Mb/s

Duplex: Full

Port: MII

PHYAD: 1

Transceiver: external

Auto-negotiation: on

Supports Wake-on: d

Wake-on: d

Current message level: 0x00000000 (0)

Link detected: yes

0 RonB over 8 years ago in reply to Chris Anderson20

TI__Mastermind 30706 points

Chris,

We put iperf3 on our board and can confirm that there is definitely a difference between iperf (2) and iperf3. Our guess is iperf3 is more "bursty" in how it sends out the packets. Google searches seem to reveal this is occurring in a lot of places.

When we use the -l 65507 option, the dropped packets improves quite a bit. This seems to confirm the burstiness aspect. So, if it is more bursty, larger buffers are needed to handle it.

The real question is what does your system expect? Do you need to design for this level of burstiness? There are two places to make changes, the kernel itself and the descriptors that we've already mentioned. The kernel can be tuned with sysctrl changes, and you may have alread done that.

For the descriptor, we are going to try to backport a patchset we put into 4.4 to make the number of descriptors tunable. We will let you know how we are progressing tomorrow.

0 Chris Anderson20 over 8 years ago in reply to RonB

Intellectual 690 points

Our system needs not to drop packets, regardless of burstiness; it's a high-end streaming media device and has to be running multiple video streams with best reliability both in and out over the network. Some of the libraries we are using don't give us the ability to easily tune buffer sizes, so we need to be able to tune the kernel to use memory as necessary to buffer incoming/outgoing data as necessary.

Thanks for your help. We've done some sysfs tuning without much effect, so please do send me any info/pointers/patches you've got so we can resolve this ASAP, as it's a "can't ship like this" problem for the product.

0 Chris Anderson20 over 8 years ago in reply to Chris Anderson20

Intellectual 690 points

Any progress towards availability of a patch for this issue?

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Attached is a patch that I tested on the kernel you are using that will enable functionality to move rx descs off chip. The attached file is a .txt as the .patch extensive prevents posting.

Fullscreen 0001-driver-net-cpsw-add-no_bd_ram-dt-parsing.txt Download

From a97fce86a9e48bc776cd8f9c89489f472b185c1e Mon Sep 17 00:00:00 2001
From: Mugunthan V N <mugunthanvnm@ti.com>
Date: Tue, 22 Sep 2015 19:16:38 +0530
Subject: [PATCH 1/2] driver: net: cpsw: add no_bd_ram dt parsing

cpdma is capable of placing the dma descriptors in ddr using
dma_alloc_coherent() when the internal bd ram size is not enough.
To utilize this feature parse the DT parameter "no_bd_ram" and
pass it to cpdma.

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 4 ++++
 drivers/net/ethernet/ti/cpsw.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b536b4c..3ec3e1f 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1987,6 +1987,8 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 	}
 	data->ale_entries = prop;
 
+	data->no_bd_ram = of_property_read_bool(node, "no_bd_ram");
+
 	if (of_property_read_u32(node, "bd_ram_size", &prop)) {
 		dev_err(&pdev->dev, "Missing bd_ram_size property in the DT.\n");
 		return -EINVAL;
@@ -2321,6 +2323,8 @@ static int cpsw_probe(struct platform_device *pdev)
 	dma_params.desc_mem_size	= data->bd_ram_size;
 	dma_params.desc_align		= 16;
 	dma_params.has_ext_regs		= true;
+	if (data->no_bd_ram)
+		dma_params.desc_mem_phys = 0;
 	dma_params.desc_hw_addr         = dma_params.desc_mem_phys;
 
 	priv->dma = cpdma_ctlr_create(&dma_params);
diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
index ca90efa..b654ac2 100644
--- a/drivers/net/ethernet/ti/cpsw.h
+++ b/drivers/net/ethernet/ti/cpsw.h
@@ -33,6 +33,7 @@ struct cpsw_platform_data {
 	u32	cpts_clock_mult;  /* convert input clock ticks to nanoseconds */
 	u32	cpts_clock_shift; /* convert input clock ticks to nanoseconds */
 	u32	ale_entries;	/* ale table size */
+	bool	no_bd_ram;	/* set if cpsw bd ram should not be used */
 	u32	bd_ram_size;  /*buffer descriptor ram size */
 	u32	rx_descs;	/* Number of Rx Descriptios */
 	u32	mac_control;	/* Mac control register */
-- 
1.9.1

Here is the change that is necessary in the board dts file to enable this capability, I applied this to the am572x-idk.dts as a test:

&mac{

no_bd_ram = <1>;

bd_ram_size = <0x40000>;

rx_descs = <2048>;

};

There will be some changes necessary to network parameters using sysctl, otherwise UDP packets will be dropped at the network layer. Here are the commands I used, they may work for your environment as well, you may need to tune these values further:

sysctl -w net.core.netdev_max_backlog=20000

sysctl -w net.ipv4.udp_mem='17565 87380 50331648'

sysctl -w net.core.rmem_max=50331648

sysctl -w net.core.rmem_default=50331648

sysctl -w net.ipv4.route.flush=1

The 4.4 kernel in the 3.01.00.06 PLSDK has support for moving the descriptors to RAM. Backporting the support in the current SDK is not really feasible at this point but the attached will enable that capability.

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Patched kernel with the DTS and sysctl changes longer exhibits packet loss issues - but then the network stops working entirely after about 15 - 30 minutes of continuous traffic. Apps then see network syscall errors and the interface stops responding externally (e.g. no response to ARP). There are no kernel messages; all network activity stops although the interface status appears OK. Issuing 'ifconfig eth0 down' then gives a kernel warning/traceback originating in davinci_cpdma.c:896 (it's a WARN(!timeout) in cpdma_chan_stop()).

Rebooting is the only way to recover to a working interface again.

Can you provide a patch that works reliably with this kernel? Moving to the 3.x SDK with the newer kernel is not a short-term option for us.

Note: The provided patch did not match up well with the line numbers in the cpsw driver code in our 4.1.13 kernel from the 02.00.01.07 TI SDK (off by hundreds of lines) although I was able to apply it via context.

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Is your test running on the TI EVM?
With modifications from the earlier post I have been running iperf2 at 600 Mbps receive without overruns, errors or the link stopping.
When the link stops responding could you post what devmem2 for 0x484848e0 is? This is the free buffer count on receive.
The test I am doing is probably not as demanding as test. Can you describe the network traffic that is being run against the board?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

This test is running on our board, which has an EVM-based design plus some video capture hardware etc. for our application. The board itself has been running fine on the 2.0.1.7 base SDK/kernel with the exception of this network issue for many months now. I'll work to reproduce it on the EVM if that will help you diagnose, but I can't run the same code there due to I/O limitations.

When the network dies, all of the receive buffers appear to be free:
root@salami:~# devmem2 0x484848e0
/dev/mem opened.
Memory mapped at address 0xb6fc0000.
Read at address 0x484848E0 (0xb6fc08e0): 0x00000800

However, no traffic can be received or sent (wireshark sees no traffic at all at the other end; any attempt to send/receive packets e.g. via ping just hangs).

Regarding the network traffic mix, for this test it is a relatively light combination of:

- An outgoing H.264 RTMP (TCP) stream at ~11Mb/s
- Frequent small HTTP traffic back and forth with web server
- ssh session doing performance measurements e.g. htop
- other (mostly UDP) protocols in background including DHCP, NTP, DNS, mDNS, SSDP.

I didn't get to the heavyweight tests because this one fails fairly quickly.

The board itself is also running VIP, VPE, codec, DSS driving 1080p60 output, active USB & eMMC, CPU running at OPP1 (1.16GHz) with ~40% utilization of both cores.

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

We have isolated the network failure on a patched system to occur only if the video codec is also in operation.

Specifically, if the previously described iperf3 test is run while also running 'videnc2test' the failure is produced. Note that it may take some time (many minutes) for the network to die, but the failure is quite repeatable.

Since we have a streaming application, we need to run the network and the video codec at the same time. Is there a fix for this?

0 Rogerio Almeida over 8 years ago in reply to Chris Anderson20

TI__Mastermind 26215 points

Chris,
Sorry for late reply...
Most of the apps team are on vacation his week, I am checking if I can get an answer for you asap.

0 Schuyler Patton over 8 years ago in reply to Rogerio Almeida

TI__Mastermind 40570 points

I setup a system here that is perhaps similar but so far I do not see the problem that you are having. I ran a test here combining iperf running at about 300Mbps UDP receive and the h264 enc test from the out of box demo that was modified to run continuously.

To summarize, the problem is the network is down but the system is functioning fine otherwise?

To look at the network down problem for the moment:

- Network traffic was running fine until iperf+video enc test is run?

- When running "ethtool eth0" (assuming this is the port being used) is the link status still shown as detected?

- After the problem is detected, does "ethtool -S eth0" show the RX byte count increasing during the ping message?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

> To summarize, the problem is the network is down but the system is functioning fine otherwise?

Yes. The network stops moving any packets from an app perspective but otherwise appears to still be up. Control operations such as "ifconfig eth0 down" result in kernel errors related to CPDMA. ethtool shows that all CP DMA descriptors are free.

> To look at the network down problem for the moment:

> - Network traffic was running fine until iperf+video enc test is run?

Yes. iperf/network in both directions will run indefinitely without errors in the absence of codec test. Running codec test causes network failure within 1-30 minutes or so.

> - When running "ethtool eth0" (assuming this is the port being used) is the link status still shown as detected?

Yes.

- After the problem is detected, does "ethtool -S eth0" show the RX byte count increasing during the ping message?

The "good Rx frame" count does increase, though I don't see a byte count in the ethtool output.

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Thanks for the answers, that says that the link between the PHY and the processor are passing packets since the ethtool dump of the hardware statistics block shows an increasing rx frame count. The RX Octets should also be increasing along with the RX frames.

Does "cat /proc/net/snmp/ | grep IcmpMsg" show an increasing count after the link is not responding and the ping messages are being sent from another machine? The icmp count should also be the same between in and out packets.

This current test, is it on your hardware or the TI EVM?

0 Schuyler Patton over 8 years ago in reply to Schuyler Patton

TI__Mastermind 40570 points

Could you also please post the message you are seeing about CPDMA errors?
Does an ifconfig up/down sequence recover the network connectivity?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Stopping the codec and doing ifconfig eth0 down / up does recover the network connection (with some delay).

The failure occurs again soon if the video codec is still active.

I've attached the two errors we see and an explanation of when they occur, below.

The initial failure is a transmit queue timeout:

[ 69.058050] ------------[ cut here ]------------

[ 69.062731] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x264/0x270()

[ 69.071236] NETDEV WATCHDOG: eth0 (cpsw): transmit queue 0 timed out

[ 69.077632] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbco

re rpmsg_rpc dwc3 rtc_palmas virtio_rpmsg_bus udc_core extcon_palmas rtc_ds1307 phy_omap_usb2 snd_soc_tlv320aic3x o

map_wdt rtc_omap dwc3_omap snd_soc_pcm5102a omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon

[ 69.108009] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.13-aja-helo-g72e25e4-dirty #3

[ 69.116070] Hardware name: Generic DRA74X (Flattened Device Tree)

[ 69.122211] Backtrace:

[ 69.124692] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

[ 69.132320] r7:c05f3628 r6:0000012f r5:c09b1024 r4:00000000

[ 69.138062] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)

[ 69.145325] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

[ 69.153469] r5:00000009 r4:c0991d60

[ 69.157090] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)

[ 69.165850] r8:c09dd167 r7:c0992100 r6:ed8f5840 r5:ed930000 r4:c08f6230

[ 69.172638] [<c0039a6c>] (warn_slowpath_fmt) from [<c05f3628>] (dev_watchdog+0x264/0x270)

[ 69.180875] r3:ed930000 r2:c08f6230

[ 69.184486] r4:00000000

[ 69.187046] [<c05f33c4>] (dev_watchdog) from [<c0089314>] (call_timer_fn+0x2c/0xa0)

[ 69.194742] r10:ed930000 r9:c05f33c4 r8:00200200 r7:00000000 r6:c05f33c4 r5:00000101

[ 69.202680] r4:ed930264

[ 69.205242] [<c00892e8>] (call_timer_fn) from [<c0089930>] (run_timer_softirq+0x1d4/0x250)

[ 69.213558] r6:c0991e00 r5:c09f3540 r4:ed930264

[ 69.218237] [<c008975c>] (run_timer_softirq) from [<c003d038>] (__do_softirq+0x140/0x264)

[ 69.226465] r10:c0992080 r9:40000001 r8:00000001 r7:00000101 r6:c0990000 r5:c0992084

[ 69.234397] r4:000000a0

[ 69.236958] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

[ 69.244309] r10:c09dcfb3 r9:c06c8168 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

[ 69.252245] r4:c098cd38

[ 69.254810] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

[ 69.262683] r5:00000013 r4:c098cd38

[ 69.266313] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

[ 69.274714] r9:c06c8168 r8:00000000 r7:fa212000 r6:c0991ef8 r5:c099294c r4:fa21200c

[ 69.282558] [<c0009480>] (gic_handle_irq) from [<c06c1880>] (__irq_svc+0x40/0x74)

[ 69.290086] Exception stack(0xc0991ef8 to 0xc0991f40)

[ 69.295161] 1ee0: 00000001 00000000

[ 69.303401] 1f00: c09e06b0 00000000 c0990000 c09925b4 c0992568 00000000 00000000 c06c8168

[ 69.311625] 1f20: c09dcfb3 c0991f4c c0991f2c c0991f40 c002a838 c00104c0 60030013 ffffffff

[ 69.319856] r7:c0991f2c r6:ffffffff r5:60030013 r4:c00104c0

[ 69.325594] [<c0010498>] (arch_cpu_idle) from [<c0070618>] (cpu_startup_entry+0x2a0/0x31c)

[ 69.333917] [<c0070378>] (cpu_startup_entry) from [<c06b7fd8>] (rest_init+0x90/0x94)

[ 69.341702] r7:00000000

[ 69.344288] [<c06b7f48>] (rest_init) from [<c093ed4c>] (start_kernel+0x404/0x410)

[ 69.351843] r5:00000000 r4:c09e0050

[ 69.355510] [<c093e948>] (start_kernel) from [<80008090>] (0x80008090)

[ 69.362082] ---[ end trace b4cdc5c99a7b5964 ]---

*******

subsequent to that, we get CPDMA channel stop errors:

[ 80.160529] ------------[ cut here ]------------

[ 80.165175] WARNING: CPU: 0 PID: 0 at drivers/net/ethernet/ti/davinci_cpdma.c:896 cpdma_chan_stop+0x16c/0x184()

[ 80.175302] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore rpmsg_rpc dwc3 rtc_palmas virtio_rpmsg_bus udc_core extcon_palmas rtc_ds1307 phy_omap_usb2 snd_soc_tlv320aic3x omap_wdt rtc_omap dwc3_omap snd_soc_pcm5102a omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon

[ 80.205529] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.1.13-aja-helo-g72e25e4-dirty #3

[ 80.214786] Hardware name: Generic DRA74X (Flattened Device Tree)

[ 80.220901] Backtrace:

[ 80.223374] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)

[ 80.230972] r7:c04d2fbc r6:00000380 r5:c09b1024 r4:00000000

[ 80.236689] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)

[ 80.243946] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)

[ 80.252067] r5:00000009 r4:00000000

[ 80.255671] [<c00399b0>] (warn_slowpath_common) from [<c0039b0c>] (warn_slowpath_null+0x24/0x2c)

[ 80.264490] r8:00000000 r7:ed8e9050 r6:ee3864b0 r5:200d0113 r4:ee386490

[ 80.271255] [<c0039ae8>] (warn_slowpath_null) from [<c04d2fbc>] (cpdma_chan_stop+0x16c/0x184)

[ 80.279818] [<c04d2e50>] (cpdma_chan_stop) from [<c04d58dc>] (cpsw_ndo_tx_timeout+0x5c/0xb4)

[ 80.288287] r9:00000140 r8:c09dd167 r7:c0992100 r6:ed8f5840 r5:00000000 r4:ed930000

[ 80.296100] [<c04d5880>] (cpsw_ndo_tx_timeout) from [<c05f35f4>] (dev_watchdog+0x230/0x270)

[ 80.304481] r5:ed930000 r4:00000000

[ 80.308085] [<c05f33c4>] (dev_watchdog) from [<c0089314>] (call_timer_fn+0x2c/0xa0)

[ 80.315770] r10:ed930000 r9:c05f33c4 r8:00200200 r7:00000000 r6:c05f33c4 r5:00000101

[ 80.323664] r4:ed930264

[ 80.326214] [<c00892e8>] (call_timer_fn) from [<c0089930>] (run_timer_softirq+0x1d4/0x250)

[ 80.334510] r6:c0991e00 r5:c09f3540 r4:ed930264

[ 80.339168] [<c008975c>] (run_timer_softirq) from [<c003d038>] (__do_softirq+0x140/0x264)

[ 80.347376] r10:c0992080 r9:40000001 r8:00000001 r7:00000101 r6:c0990000 r5:c0992084

[ 80.355271] r4:000000a0

[ 80.357821] [<c003cef8>] (__do_softirq) from [<c003d438>] (irq_exit+0xb8/0x120)

[ 80.365156] r10:c09dcfb3 r9:c06c8168 r8:ee008000 r7:00000000 r6:00000000 r5:00000013

[ 80.373051] r4:c098cd38

[ 80.375604] [<c003d380>] (irq_exit) from [<c0079894>] (__handle_domain_irq+0x68/0xbc)

[ 80.383463] r5:00000013 r4:c098cd38

[ 80.387067] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)

[ 80.395449] r9:c06c8168 r8:00000000 r7:fa212000 r6:c0991ef8 r5:c099294c r4:fa21200c

[ 80.403261] [<c0009480>] (gic_handle_irq) from [<c06c1880>] (__irq_svc+0x40/0x74)

[ 80.410772] Exception stack(0xc0991ef8 to 0xc0991f40)

[ 80.415842] 1ee0: 00000001 00000000

[ 80.424054] 1f00: c09e06b0 00000000 c0990000 c09925b4 c0992568 00000000 00000000 c06c8168

[ 80.432264] 1f20: c09dcfb3 c0991f4c c0991f2c c0991f40 c002a838 c00104c0 600d0013 ffffffff

[ 80.440471] r7:c0991f2c r6:ffffffff r5:600d0013 r4:c00104c0

[ 80.446184] [<c0010498>] (arch_cpu_idle) from [<c0070618>] (cpu_startup_entry+0x2a0/0x31c)

[ 80.454483] [<c0070378>] (cpu_startup_entry) from [<c06b7fd8>] (rest_init+0x90/0x94)

[ 80.462254] r7:00000000

[ 80.464806] [<c06b7f48>] (rest_init) from [<c093ed4c>] (start_kernel+0x404/0x410)

[ 80.472316] r5:00000000 r4:c09e0050

[ 80.475921] [<c093e948>] (start_kernel) from [<80008090>] (0x80008090)

[ 80.482474] ---[ end trace b4cdc5c99a7b5966 ]---

1483985612.041512 enetd W 1103 ../../enetd/src/enetd.cpp:175: RX 0 (LEN 0, OVR 0, CRC 0, FRM 0, FFO 0, MIS 0)

1483985612.041532 enetd W 1103 ../../enetd/src/enetd.cpp:182: TX 2 (ABT 0, CAR 0, FFO 0, HB 0, WIN 0)

1483985612.041549 enetd I 1103 ../../enetd/src/enetd.cpp:193: Link Errors: 2

[ 91.098073] omapdrm omapdrm.0: atomic complete timeout (pipe 0)!

[ 91.160546] ------------[ cut here ]------------

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

The icmp counts are shown below, captured during pinging. The ping eventually fails because we get 'destination host unreachable' errors.

IcmpMsg: InType3 InType8 OutType0 OutType3
IcmpMsg: 4971 173 173 4969

IcmpMsg: InType3 InType8 OutType0 OutType3
IcmpMsg: 4975 176 176 4973

IcmpMsg: InType3 InType8 OutType0 OutType3
IcmpMsg: 4978 179 179 4976

IcmpMsg: InType3 InType8 OutType0 OutType3
IcmpMsg: 4982 181 181 4980

The ICMP counts are not symmetric, as you can see.

This latest test is on our own board; all of the tests I've tried fail indentically on the TI EVM board, but if you need me to try something specific there, I can do so.

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

The watchdog timeout might be a clue. Could you please attach the full results of ethtool -S eth0 for the TI EVM?

Also is the link partner and cable the same when testing with both your board and the TI EVM?

I am still unable to reproduce the network down problem, could you post the exact command you are using to run the video encoder on the TI EVM?

0 Schuyler Patton over 8 years ago in reply to Schuyler Patton

TI__Mastermind 40570 points

Could you also please dump the following address using devmem2 0x4a100D84 after the network down problem has occured?

This is the MAC SL1 mac control address, I am looking to see if the interface is in half duplex mode.

Could you also attach the full ethtool eth0 output before causing the network down condition and after it occurs?

Is the link partner a switch or a HUB or direct connect to test equipment of some kind?

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Output of ethtool eth0 prior to net failure:

Settings for eth0:

Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: external
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000000 (0)

Link detected: yes

Output after failure:

Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: Symmetric Receive-only
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: external
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000000 (0)

Link detected: yes

ethtool -S eth0 after failure:

NIC statistics:
Good Rx Frames: 2022
Broadcast Rx Frames: 220
Multicast Rx Frames: 0
Pause Rx Frames: 0
Rx CRC Errors: 0
Rx Align/Code Errors: 0
Oversize Rx Frames: 0
Rx Jabbers: 0
Undersize (Short) Rx Frames: 0
Rx Fragments: 0
Rx Octets: 236946
Good Tx Frames: 41428
Broadcast Tx Frames: 16
Multicast Tx Frames: 76
Pause Tx Frames: 0
Deferred Tx Frames: 0
Collisions: 0
Single Collision Tx Frames: 0
Multiple Collision Tx Frames: 0
Excessive Collisions: 0
Late Collisions: 0
Tx Underrun: 0
Carrier Sense Errors: 0
Tx Octets: 54751650
Rx + Tx 64 Octet Frames: 252
Rx + Tx 65-127 Octet Frames: 2044
Rx + Tx 128-255 Octet Frames: 215
Rx + Tx 256-511 Octet Frames: 496
Rx + Tx 512-1023 Octet Frames: 648
Rx + Tx 1024-Up Octet Frames: 39795
Net Octets: 54988596
Rx Start of Frame Overruns: 0
Rx Middle of Frame Overruns: 0
Rx DMA Overruns: 0
Rx DMA chan: head_enqueue: 1
Rx DMA chan: tail_enqueue: 4069
Rx DMA chan: pad_enqueue: 0
Rx DMA chan: misqueued: 0
Rx DMA chan: desc_alloc_fail: 0
Rx DMA chan: pad_alloc_fail: 0
Rx DMA chan: runt_receive_buf: 0
Rx DMA chan: runt_transmit_buf: 0
Rx DMA chan: empty_dequeue: 0
Rx DMA chan: busy_dequeue: 1563
Rx DMA chan: good_dequeue: 2022
Rx DMA chan: requeue: 0
Rx DMA chan: teardown_dequeue: 0
Tx DMA chan: head_enqueue: 37182
Tx DMA chan: tail_enqueue: 30040
Tx DMA chan: pad_enqueue: 0
Tx DMA chan: misqueued: 1750
Tx DMA chan: desc_alloc_fail: 19
Tx DMA chan: pad_alloc_fail: 0
Tx DMA chan: runt_receive_buf: 0
Tx DMA chan: runt_transmit_buf: 232
Tx DMA chan: empty_dequeue: 25185
Tx DMA chan: busy_dequeue: 15579
Tx DMA chan: good_dequeue: 41428
Tx DMA chan: requeue: 37559
Tx DMA chan: teardown_dequeue: 24576

devmem2 0x4a100d84 after failure produces the following output:

# devmem2 0x4a100d84
/dev/mem opened.[ 314.900296] ------------[ cut here ]------------
[ 314.906261] WARNING: CPU: 0 PID: 1317 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x248/0x34c()
[ 314.915869] 44000000.ocp:L3 Custom Error: MASTER MPU TARGET L4_CFG (Read): Data Access in User mode during Functional access
[ 314.927130] Modules linked in: usb_f_acm u_serial g_serial libcomposite usb_storage xhci_plat_hcd xhci_hcd usbcore dwc3 rpmsg_rpc rtc_palmas virtio_rpmsg_bus extcon_palmas phy_omap_usb2 udc_core snd_soc_pcm5102a snd_soc_tlv320aic3x rtc_ds1307 omap_wdt dwc3_omap rtc_omap omap_remoteproc remoteproc virtio virtio_ring extcon_usb_gpio extcon
[ 314.957350] CPU: 0 PID: 1317 Comm: devmem2 Tainted: G W 4.1.13-aja-helo-g72e25e4-dirty #3
[ 314.966693] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 314.972808] Backtrace:
[ 314.975279] [<c0012f78>] (dump_backtrace) from [<c001319c>] (show_stack+0x18/0x1c)
[ 314.982878] r7:c03655e8 r6:00000093 r5:c09b1024 r4:00000000
[ 314.988595] [<c0013184>] (show_stack) from [<c06bbb98>] (dump_stack+0x9c/0xdc)
[ 314.995851] [<c06bbafc>] (dump_stack) from [<c0039a38>] (warn_slowpath_common+0x88/0xb8)
[ 315.003971] r5:00000009 r4:ecce1e00
[ 315.007577] [<c00399b0>] (warn_slowpath_common) from [<c0039aa0>] (warn_slowpath_fmt+0x38/0x40)
[ 315.016309] r8:c08af41c r7:00000002 r6:ee1af190 r5:c08af4dc r4:c08af580
[ 315.023077] [<c0039a6c>] (warn_slowpath_fmt) from [<c03655e8>] (l3_interrupt_handler+0x248/0x34c)
[ 315.031982] r3:ee1af000 r2:c08af580
[ 315.035580] r4:80080003
[ 315.038135] [<c03653a0>] (l3_interrupt_handler) from [<c0079f54>] (handle_irq_event_percpu+0x80/0x13c)
[ 315.047476] r10:c09dcfb5 r9:ee1a9300 r8:00000017 r7:00000000 r6:00000000 r5:ee1a9360
[ 315.055370] r4:ee1af500
[ 315.057923] [<c0079ed4>] (handle_irq_event_percpu) from [<c007a054>] (handle_irq_event+0x44/0x64)
[ 315.066828] r10:beed9a74 r9:00000001 r8:ee008000 r7:00000000 r6:ee1af500 r5:ee1a9360
[ 315.074723] r4:ee1a9300
[ 315.077272] [<c007a010>] (handle_irq_event) from [<c007cd80>] (handle_fasteoi_irq+0xb8/0x17c)
[ 315.085829] r7:00000000 r6:c099713c r5:ee1a9360 r4:ee1a9300
[ 315.091542] [<c007ccc8>] (handle_fasteoi_irq) from [<c00795b8>] (generic_handle_irq+0x34/0x44)
[ 315.100186] r7:00000000 r6:00000000 r5:00000017 r4:00000017
[ 315.105897] [<c0079584>] (generic_handle_irq) from [<c0079890>] (__handle_domain_irq+0x64/0xbc)
[ 315.114629] r5:00000017 r4:c098cd38
[ 315.118232] [<c007982c>] (__handle_domain_irq) from [<c00094ac>] (gic_handle_irq+0x2c/0x64)
[ 315.126614] r9:00000001 r8:30c5387d r7:fa212000 r6:ecce1fb0 r5:c099294c r4:fa21200c
[ 315.134426] [<c0009480>] (gic_handle_irq) from [<c06c1b68>] (__irq_usr+0x48/0x60)
[ 315.141938] Exception stack(0xecce1fb0 to 0xecce1ff8)
[ 315.147009] 1fa0: 00010417 b6ff8960 b6ff66e8 b6ff8b18
[ 315.155220] 1fc0: 00000000 00020f5c b6fea000 b6fea4c0 00000000 00000001 beed9a74 b6ff4d84
[ 315.163430] 1fe0: 00000017 beed9a08 00010400 b6fdaaf4 200b0030 ffffffff
[ 315.170067] r7:30c5387d r6:ffffffff r5:200b0030 r4:b6fdaaf4
[ 315.175775] ---[ end trace c41436b29ff88a5f ]---

Memory mapped at address 0xb6ff4000.
[ 315.180627] Unhandled fault: asynchronous external abort (0x1211) at 0x00000000
[ 315.191288] pgd = d3fd8ec0
[ 315.194004] [00000000] *pgd=9316e003, *pmd=92f5f003, *pte=00000000
Read at address 0x4A100D84 (0xb6ff4d84): 0x00000000

The link partner, cable, and machine are common across tests of our own board and the EVM. Link partner is an Intel gig-E ethernet card connected directly and installed in a linux PC - not a hub.

The observed issue when the network is down is that transmit from the EVM no longer occurs and sending packets time out.

The failure is produced by running the IVAHD (videnc2test) continuously while also running 'iperf3 -s' on the EVM the iperf client elsewhere doing a UDP transfer test usually at 100Mb/s, however, the failure occurs with most any kind of network traffic while also using the codec. Typically we are encoding 1080i30 material in our application.

We will get details of the videnc2 command/setup/input to you tomorrow.

0 Chris Anderson20 over 8 years ago in reply to Chris Anderson20

Intellectual 690 points

To reproduce the issue on the EVM, you will need a local 1080p NV12 YUV-formatted file for codec input. This must be <2GB in size due to videnc2test limitations. It can be prepared with ffmpeg from suitable source material e.g. as follows:

ffmpeg -ss 00:00:30.000 -t 00:00:15 -i big_buck_bunny_1080p_h264.mov -c:v rawvideo -pix_fmt nv12 nv12.yuv

Then, on the EVM, run 'iperf3 -s' to start the server.

Run the IVAHD continuously with a command of the form:

watch -n 1 videnc2test 1920 1080 108000 nv12.yuv out.h264 30 17000 h264 high 42 OMAPDRM

Start the iperf3 client on a connected system with a command like:

iperf3 -c 192.168.0.2 -u -b 500M -t 80000

Network failure may occur within minutes or may take hours. Rebooting and restarting the test seems to hasten the failure if it does not occur within an hour or two.

Having run this test many times now, we have observed total transmit failure, a 'network socket closed unexpectedly' error from iperf3, and also ipu2 crashes as follows:

76087.328992] remoteproc1: crash detected in 55020000.ipu: type watchdog
[76087.335669] remoteproc1: handling crash #43 in 55020000.ipu
[76087.341818] remoteproc1: recovering 55020000.ipu
[76087.376487] omap_hwmod: mmu_ipu2: _wait_target_disable failed
[76087.384217] remoteproc1: stopped remote processor 55020000.ipu
[76087.406380] remoteproc1: powering up 55020000.ipu
[76087.414336] remoteproc1: Booting fw image dra7-ipu2-fw.xem4, size 3485072
[76087.422570] omap-iommu 55082000.mmu: 55082000.mmu: version 2.1
[76087.502657] remoteproc1: remote processor 55020000.ipu is now up
[76087.510014] virtio_rpmsg_bus virtio1: rpmsg host is online
[76087.515561] remoteproc1: registered virtio1 (type 7)
[76097.507475] remoteproc1: crash detected in 55020000.ipu: type watchdog

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

Thanks for the console outputs and steps for the EVM, I will try them out.

I need to apologize on the address I gave you, it was another processor, that is the reason for the kernel warn. The one for the AM572x is 0x48484D84.

0 Chris Anderson20 over 8 years ago in reply to Schuyler Patton

Intellectual 690 points

Has there been any further progress towards a resolution on this? The issue is critical for us and it is blocking product release.

0 Schuyler Patton over 8 years ago in reply to Chris Anderson20

TI__Mastermind 40570 points

We tried several different methods to try to reproduce the problem. We have replicated the condition that you are seeing though not with the same steps. The most consistent way we found to recreate the problem was to use iperf3 in client mode. The Video encode part of the test doesn't need to run to re-create the condition. Only iperf3 seems to cause the issue. We are continuing to look at the cuase of the network down condition.

0 Rogerio Almeida over 8 years ago in reply to Chris Anderson20

TI__Mastermind 26215 points

Chris,
Last night I sent an email to your team, and since I was using my phone, I may have left you out...sorry...anyway, I was informing that our apps would be posting comments today, as it was done...and we are continuing to investigate the issue.

Processors

Processors forum

AM572x Ethernet receive issues