This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: am65-cpsw-nuss ethernet driver lockup

Part Number: AM6442
Other Parts Discussed in Thread: SK-AM64B

Hi,

I have observed the following issue with Ethernet on AM6442 when running TI kernel 6.1.46-rt13 (09.01.00.002-rt).

After this happens Ethernet communications are dead and require a reboot to function again.

[ 2826.211565] ------------[ cut here ]------------
[ 2826.211581] NETDEV WATCHDOG: eth0 (am65-cpsw-nuss): transmit queue 0 timed out
[ 2826.211650] WARNING: CPU: 0 PID: 11 at net/sched/sch_generic.c:525 dev_watchdog+0x254/0x260
[ 2826.211688] CPU: 0 PID: 11 Comm: ktimers/0 Not tainted 6.1.46-rt13 #1
[ 2826.211698] Hardware name:
[ 2826.211704] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2826.211714] pc : dev_watchdog+0x254/0x260
[ 2826.211725] lr : dev_watchdog+0x254/0x260
[ 2826.211736] sp : ffff800008bb3c20
[ 2826.211739] x29: ffff800008bb3c20 x28: ffff8000085d2430 x27: ffff800008bb3d30
[ 2826.211755] x26: ffff00006fc23d50 x25: 0000000000000000 x24: 0000000000000000
[ 2826.211766] x23: ffff8000089b8000 x22: 0000000000000000 x21: ffff000001c683a0
[ 2826.211778] x20: ffff000001c68000 x19: ffff000001c68468 x18: ffffffffffffffff
[ 2826.211789] x17: 6f2064656d697420 x16: 3020657565757120 x15: 74696d736e617274
[ 2826.211801] x14: 203a297373756e2d x13: 74756f2064656d69 x12: 7420302065756575
[ 2826.211812] x11: 712074696d736e61 x10: 7274203a29737375 x9 : 777370632d35366d
[ 2826.211824] x8 : ffff8000089cb1f0 x7 : ffff800008bb3a30 x6 : 000000000000000c
[ 2826.211836] x5 : ffff00006fc23b40 x4 : 0000000000000000 x3 : 0000000000000027
[ 2826.211846] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000cd8d80
[ 2826.211858] Call trace:
[ 2826.211862] dev_watchdog+0x254/0x260
[ 2826.211874] call_timer_fn.constprop.0+0x20/0x80
[ 2826.211891] __run_timers+0x270/0x2f0
[ 2826.211902] run_timer_softirq+0x1c/0x40
[ 2826.211913] _stext+0xf4/0x228
[ 2826.211922] run_timersd+0x60/0xc0
[ 2826.211935] smpboot_thread_fn+0x278/0x2e0
[ 2826.211945] kthread+0x110/0x120
[ 2826.211953] ret_from_fork+0x10/0x20
[ 2826.211963] ---[ end trace 0000000000000000 ]---
[ 2826.211983] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:5660 dql_avail:-2 free_desc:509
[ 2831.331490] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:10780 dql_avail:-2 free_desc:509
[ 2837.219386] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:16668 dql_avail:-2 free_desc:509
[ 2842.083298] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:21532 dql_avail:-2 free_desc:509
[ 2847.203202] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:26652 dql_avail:-2 free_desc:509
[ 2852.323117] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:31772 dql_avail:-2 free_desc:509
[ 2858.211011] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:37660 dql_avail:-2 free_desc:509
[ 2863.330918] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:42780 dql_avail:-2 free_desc:509
[ 2869.218815] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:48668 dql_avail:-2 free_desc:509

  • I've done some more investigation and this appears to happen when receive timestamping is enabled in the driver (for 1588 PTP).

    i.e. running the command

    hwstamp_ctl -i eth0 -r 1

    appears to reliably cause the ethernet driver to lock up after a few hundred megabytes of data transfer.

  • Did you have this issue with the 9.0 SDK release version? The 9.1 releases are preliminary, there could be many issues.

  • Testing with 09.00.00.001-rt managed to panic on the same test:

    [ 54.633619] Unable to handle kernel paging request at virtual address 1794ecb8bfe302a7
    [ 54.641564] Mem abort info:
    [ 54.644350] ESR = 0x0000000096000044
    [ 54.648088] EC = 0x25: DABT (current EL), IL = 32 bits
    [ 54.653389] SET = 0, FnV = 0
    [ 54.656433] EA = 0, S1PTW = 0
    [ 54.659566] FSC = 0x04: level 0 translation fault
    [ 54.664431] Data abort info:
    [ 54.667301] ISV = 0, ISS = 0x00000044
    [ 54.671125] CM = 0, WnR = 1
    [ 54.674083] [1794ecb8bfe302a7] address between user and kernel address ranges
    [ 54.681204] Internal error: Oops: 0000000096000044 [#1] PREEMPT_RT SMP
    [ 54.681216] CPU: 0 PID: 93 Comm: irq/152-8000000 Not tainted 6.1.26-rt8 #1
    [ 54.681225] Hardware name:
    [ 54.681230] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 54.681239] pc : am65_cpsw_nuss_rx_poll+0x150/0x390
    [ 54.681263] lr : am65_cpsw_nuss_rx_poll+0x130/0x390
    [ 54.681271] sp : ffff800009c13b80
    [ 54.681274] x29: ffff800009c13b80 x28: 1794ecb8bfe30297 x27: ffff00004dcdc4b0
    [ 54.681288] x26: ffff0000022e3670 x25: 0000000000000600 x24: 0000000000000040
    [ 54.681298] x23: ffff0000022e2080 x22: ffff00004dcdc480 x21: ffff00000243c000
    [ 54.681309] x20: ffff0000022e3648 x19: 0000000000000006 x18: 0000000000000000
    [ 54.681319] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [ 54.681329] x14: 0000000000000001 x13: 0000000000000089 x12: 0000000000000000
    [ 54.681339] x11: 000000000000c400 x10: 0000000000000038 x9 : ffff00000234b308
    [ 54.681349] x8 : ffff000002c5b0b0 x7 : 000f0000cdcdc480 x6 : 000f0000cdcdc480
    [ 54.681359] x5 : ffff00004dcdc4c0 x4 : 000f000082b05ac0 x3 : ffff00004dcdc4c0
    [ 54.681369] x2 : 00000000000003d8 x1 : ffff000002439080 x0 : ffff000002439080
    [ 54.681380] Call trace:
    [ 54.681384] am65_cpsw_nuss_rx_poll+0x150/0x390
    [ 54.681394] __napi_poll.constprop.0+0x34/0x180
    [ 54.681406] net_rx_action+0x128/0x2c0
    [ 54.681413] _stext+0xf4/0x228
    [ 54.681423] __local_bh_enable_ip+0xc0/0x130
    [ 54.681437] irq_forced_thread_fn+0x94/0xb0
    [ 54.681448] irq_thread+0x140/0x200
    [ 54.681455] kthread+0x110/0x120
    [ 54.681462] ret_from_fork+0x10/0x20
    [ 54.681477] Code: b90077e2 52807b02 9ba20400 f9400415 (f9000b95)
    [ 54.837123] ---[ end trace 0000000000000000 ]---
    [ 54.837129] Kernel panic - not syncing: Oops: Fatal exception in interrupt
    [ 54.848591] SMP: stopping secondary CPUs
    [ 54.848615] Kernel Offset: disabled
    [ 54.848618] CPU features: 0x00000,00000004,0000400b
    [ 54.848624] Memory Limit: none

  • 09.00.00.011-rt also fails.

  • 09.00.00.011 (not realtime) also fails.

  • I ran iperf3 tests for multiple GB's with ptp4l and hw timestamps (ptp4l -P -2 -H -i eth0 -f  gPTP.cfg --step_threashold=1 -m -q -p /dev/ptp0) without seeing this issue? I did the same with calling hwstamp_ctl -i eth0 -r 1 and did not see this issue.

    Can you elaborate a little on what is the overall setup of what you have running? For example the sequence of commands run.

    One thing I'm suspecting is maybe the rt priorities are setup in such a way some of the servicing of DMA structures has an issue.

  • Hi Pekka,

    The hardware setup is a PHYTEC phyCORE-AM64x SOM. We see the issue on our own hardware and with the PHYTEC development board.

    In all cases we're using CPSW3G port 1 (bus@f4000/ethernet@8000000).

    Our Linux system is built using buildroot based on Andreas's patches [1]. We run entirely from RAM without a persistent rootfs. I understand that this is experimental, but I don't think this issue has anything to do with our build.

    The most minimal reproducer for the issue I have is as follows:
    * AM6442 board connected to our corporate network (this seems to be important, I think the extra traffic helps to trigger the issue).
    * Receive timestamping enabled on the AM6442: "hwstamp_ctl -i eth0 -r 1"
    * Run iperf3 on the AM6442: "iperf3 -s"
    * Run iperf3 on another machine: "iperf3 -c am6442.hostname" (see log [2] below for example of failure).

    See [3] below for a trace of the issue with DMA and ethernet driver logging turned on.

    Patrick

    [1] e2e.ti.com/.../faq-buildroot-support-for-sitara-am62x-am62ax-am64x-devices

    [2] failing iperf run on AM6442
    # iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 49712
    [ 5] local 10.117.68.234 port 5201 connected to 10.117.68.128 port 49714
    [ ID] Interval Transfer Bitrate
    [ 5] 0.00-1.00 sec 70.0 MBytes 585 Mbits/sec
    [ 5] 1.00-2.00 sec 71.7 MBytes 602 Mbits/sec
    [ 5] 2.00-3.00 sec 72.8 MBytes 609 Mbits/sec
    [ 5] 3.00-4.00 sec 72.2 MBytes 607 Mbits/sec
    [ 5] 4.00-5.00 sec 70.2 MBytes 589 Mbits/sec
    [ 5] 5.00-6.00 sec 72.5 MBytes 607 Mbits/sec
    [ 5] 6.00-7.00 sec 73.2 MBytes 614 Mbits/sec
    [ 5] 7.00-8.00 sec 72.0 MBytes 604 Mbits/sec
    [ 5] 8.00-9.00 sec 46.5 MBytes 391 Mbits/sec
    [ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 10.00-11.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 11.00-12.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 12.00-13.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 13.00-14.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 14.00-15.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 15.00-16.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 16.00-17.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 17.00-18.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 18.00-19.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 19.00-20.00 sec 0.00 Bytes 0.00 bits/sec
    [ 52.961495] ------------[ cut here ]------------
    [ 52.961511] NETDEV WATCHDOG: eth0 (am65-cpsw-nuss): transmit queue 0 timed out
    [ 52.961580] WARNING: CPU: 0 PID: 11 at net/sched/sch_generic.c:525 dev_watchdog+0x254/0x260
    [ 52.961618] CPU: 0 PID: 11 Comm: ktimers/0 Not tainted 6.1.46-rt13 #1
    [ 52.961629] Hardware name: Relectrify Stack Controller: Relecblox-Main-Controller-SOM (DT)
    [ 52.961635] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 52.961644] pc : dev_watchdog+0x254/0x260
    [ 52.961655] lr : dev_watchdog+0x254/0x260
    [ 52.961666] sp : ffff800008bb3c20
    [ 52.961669] x29: ffff800008bb3c20 x28: ffff8000085d2460 x27: ffff800008bb3d30
    [ 52.961685] x26: ffff00006fc23d50 x25: 0000000000000000 x24: 0000000000000000
    [ 52.961696] x23: ffff8000089b8000 x22: 0000000000000000 x21: ffff000001c703a0
    [ 52.961708] x20: ffff000001c70000 x19: ffff000001c70468 x18: ffffffffffffffff
    [ 52.961719] x17: 6f2064656d697420 x16: 3020657565757120 x15: 74696d736e617274
    [ 52.961731] x14: 203a297373756e2d x13: 74756f2064656d69 x12: 7420302065756575
    [ 52.961742] x11: 712074696d736e61 x10: 7274203a29737375 x9 : 777370632d35366d
    [ 52.961754] x8 : ffff8000089cb1f0 x7 : ffff800008bb3a30 x6 : 000000000000000c
    [ 52.961766] x5 : ffff00006fc23b40 x4 : 0000000000000000 x3 : 0000000000000027
    [ 52.961777] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000cd8d80
    [ 52.961788] Call trace:
    [ 52.961792] dev_watchdog+0x254/0x260
    [ 52.961804] call_timer_fn.constprop.0+0x20/0x80
    [ 52.961822] __run_timers+0x270/0x2f0
    [ 52.961832] run_timer_softirq+0x1c/0x40
    [ 52.961843] _stext+0xf4/0x228
    [ 52.961852] run_timersd+0x60/0xc0
    [ 52.961865] smpboot_thread_fn+0x278/0x2e0
    [ 52.961875] kthread+0x110/0x120
    [ 52.961883] ret_from_fork+0x10/0x20
    [ 52.961893] ---[ end trace 0000000000000000 ]---
    [ 52.961916] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:10104 dql_avail:-24 free_desc:514

    [3] ethernet driver & DMA driver logging of a failure
    [ 57.075671] ti-udma 485c0000.dma-controller: ring_pop: occ6 index229
    [ 57.075679] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ5 index230 pos_ptrffff0000028ce78
    [ 57.075688] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd720
    [ 57.075699] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.075707] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.075736] ti-udma 485c0000.dma-controller: ring_push: free0 index229
    [ 57.075744] ti-udma 485c0000.dma-controller: ring_push_mem: free10 index230
    [ 57.075751] ti-udma 485c0000.dma-controller: ring_pop: occ5 index230
    [ 57.075758] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ4 index231 pos_ptrffff0000028ce70
    [ 57.075766] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd730
    [ 57.075775] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.075782] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.075799] ti-udma 485c0000.dma-controller: ring_push: free10 index230
    [ 57.075806] ti-udma 485c0000.dma-controller: ring_push_mem: free9 index231
    [ 57.075813] ti-udma 485c0000.dma-controller: ring_pop: occ4 index231
    [ 57.075820] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ3 index232 pos_ptrffff0000028ce78
    [ 57.075828] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd730
    [ 57.075836] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.075843] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.075855] ti-udma 485c0000.dma-controller: ring_push: free9 index231
    [ 57.075863] ti-udma 485c0000.dma-controller: ring_push_mem: free8 index232
    [ 57.075869] ti-udma 485c0000.dma-controller: ring_pop: occ3 index232
    [ 57.075876] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ2 index233 pos_ptrffff0000028ce70
    [ 57.075885] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd740
    [ 57.075893] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.075900] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.075914] ti-udma 485c0000.dma-controller: ring_push: free8 index232
    [ 57.075921] ti-udma 485c0000.dma-controller: ring_push_mem: free7 index233
    [ 57.075928] ti-udma 485c0000.dma-controller: ring_pop: occ2 index233
    [ 57.075934] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ1 index234 pos_ptrffff0000028ce78
    [ 57.075943] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd740
    [ 57.075952] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.075959] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.075971] ti-udma 485c0000.dma-controller: ring_push: free7 index233
    [ 57.075978] ti-udma 485c0000.dma-controller: ring_push_mem: free6 index234
    [ 57.075985] ti-udma 485c0000.dma-controller: ring_pop: occ1 index234
    [ 57.075992] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index235 pos_ptrffff0000028ce70
    [ 57.076000] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd750
    [ 57.076008] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076015] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076030] ti-udma 485c0000.dma-controller: ring_push: free6 index234
    [ 57.076037] ti-udma 485c0000.dma-controller: ring_push_mem: free5 index235
    [ 57.076044] ti-udma 485c0000.dma-controller: ring_pop: occ5 index235
    [ 57.076050] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ4 index236 pos_ptrffff0000028ce78
    [ 57.076059] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd750
    [ 57.076067] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076074] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076086] ti-udma 485c0000.dma-controller: ring_push: free5 index235
    [ 57.076094] ti-udma 485c0000.dma-controller: ring_push_mem: free4 index236
    [ 57.076100] ti-udma 485c0000.dma-controller: ring_pop: occ4 index236
    [ 57.076107] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ3 index237 pos_ptrffff0000028ce70
    [ 57.076116] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd760
    [ 57.076124] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076131] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076143] ti-udma 485c0000.dma-controller: ring_push: free4 index236
    [ 57.076150] ti-udma 485c0000.dma-controller: ring_push_mem: free3 index237
    [ 57.076157] ti-udma 485c0000.dma-controller: ring_pop: occ3 index237
    [ 57.076163] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ2 index238 pos_ptrffff0000028ce78
    [ 57.076172] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd760
    [ 57.076180] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076187] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076200] ti-udma 485c0000.dma-controller: ring_push: free3 index237
    [ 57.076207] ti-udma 485c0000.dma-controller: ring_push_mem: free2 index238
    [ 57.076214] ti-udma 485c0000.dma-controller: ring_pop: occ2 index238
    [ 57.076221] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ1 index239 pos_ptrffff0000028ce70
    [ 57.076229] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd770
    [ 57.076238] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076245] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076256] ti-udma 485c0000.dma-controller: ring_push: free2 index238
    [ 57.076263] ti-udma 485c0000.dma-controller: ring_push_mem: free1 index239
    [ 57.076270] ti-udma 485c0000.dma-controller: ring_pop: occ1 index239
    [ 57.076277] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index240 pos_ptrffff0000028ce78
    [ 57.076285] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd770
    [ 57.076293] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.076300] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.076312] ti-udma 485c0000.dma-controller: ring_push: free1 index239
    [ 57.076319] ti-udma 485c0000.dma-controller: ring_push_mem: free0 index240
    [ 57.076326] ti-udma 485c0000.dma-controller: ring_pop: occ0 index240
    [ 57.076333] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_poll num_rx:11 64
    [ 57.076761] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_ndo_slave_xmit skb_queue:0
    [ 57.076784] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_ndo_slave_xmit tx psdata:0x33230020
    [ 57.076794] ti-udma 485c0000.dma-controller: ring_push: free102 index414
    [ 57.076927] ti-udma 485c0000.dma-controller: ring_push_mem: free101 index415
    [ 57.076982] ti-udma 485c0000.dma-controller: ring_pop: occ1 index414
    [ 57.076990] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index415 pos_ptrffff00004dcccc0
    [ 57.077010] ti-udma 485c0000.dma-controller: ring_pop: occ0 index415
    [ 57.077018] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_tx_compl_packets:0 pkt:1
    [ 57.077053] ti-udma 485c0000.dma-controller: ring_pop: occ1 index240
    [ 57.077061] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index241 pos_ptrffff0000028ce70
    [ 57.077070] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets flow_idx: 0 desc 0x00000000cdcd780
    [ 57.077081] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx port_id:1
    [ 57.077088] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_packets rx csum_info:0x14ffff
    [ 57.077114] ti-udma 485c0000.dma-controller: ring_push: free0 index240
    [ 57.077121] ti-udma 485c0000.dma-controller: ring_push_mem: free0 index241
    [ 57.077128] ti-udma 485c0000.dma-controller: ring_pop: occ0 index241
    [ 57.077135] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_rx_poll num_rx:1 64
    [ 57.120924] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_ndo_slave_xmit skb_queue:0
    [ 57.120956] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_ndo_slave_xmit tx psdata:0x33230020
    [ 57.120968] ti-udma 485c0000.dma-controller: ring_push: free101 index415
    [ 57.121018] ti-udma 485c0000.dma-controller: ring_push_mem: free100 index416
    [ 57.121072] ti-udma 485c0000.dma-controller: ring_pop: occ1 index415
    [ 57.121081] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index416 pos_ptrffff00004dcccc8
    [ 57.121101] ti-udma 485c0000.dma-controller: ring_pop: occ0 index416
    [ 57.121109] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_tx_compl_packets:0 pkt:1
    *** We appear to be dead now.
    [ 61.963018] am65-cpsw-nuss 8000000.ethernet: promisc disabled
    [ 61.966540] am65-cpsw-nuss 8000000.ethernet: promisc disabled
    [ 61.970154] am65-cpsw-nuss 8000000.ethernet: promisc disabled
    [ 61.974281] am65-cpsw-nuss 8000000.ethernet: promisc disabled
    [ 61.988938] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_ndo_slave_xmit skb_queue:0
    [ 61.988977] ti-udma 485c0000.dma-controller: ring_push: free100 index416
    [ 61.989035] ti-udma 485c0000.dma-controller: ring_push_mem: free99 index417
    [ 61.989093] ti-udma 485c0000.dma-controller: ring_pop: occ1 index416
    [ 61.989101] ti-udma 485c0000.dma-controller: k3_dmaring_reverse_pop: occ0 index417 pos_ptrffff00004dcccd0
    [ 61.989122] ti-udma 485c0000.dma-controller: ring_pop: occ0 index417
    [ 61.989131] am65-cpsw-nuss 8000000.ethernet: am65_cpsw_nuss_tx_compl_packets:0 pkt:1

  • Thanks this helps in narrowing down. "hwstamp_ctl -i eth0 -r 1" the "-r 1" looks to be asking to timestamp every incoming packet. So this is likely where things get overloaded with high packets per second traffic. I tried a few different combinations of 9.0 public release and 9.1 release candidates with this and I was not able to reproduce. The default iperf3 is TCP with max MTU's so packet rate is relatively low, I tried "iperf3 -c am6442.hostname -u -l100 -b100M" to create a much larger packets per second rate as well. I also played around with changing priorities (chrt for ksoftirqs and also iperf3 to FIFO priorities) to see if that would bring this out. Seems to be that in your system at some point this timestamping every packet overloads the system and it behaves ungracefully. Something must be different from the default SDK as I'm not able to reproduce. The buildroot support in general and RT in particular has not gone through the full testing a normal release has, as long as I can't reproduce on a supported SDK narrowing this down to a bug is problematic.

    To proceed I suggest couple possible paths:

    - do you really want to timestamp all received packets? Or just the ptp ones.
    - is the use case to run ptp4l ? Do you see an issue if you run ptp4l with HW timestamps (option -H for ptp4l) ? Or are you running the helper program hwstamp_ctl for some other reason?
    - On your printouts the iperf3 throughput looks quite low for TCP, it should reach 900-950Mbit/s. Are you running something else in parallel? What is the TCP throughput without asking for timestamps on all received frame?

      Pekka

  • Hi Pekka,

    a few comments and updates for you.

    * Timestamping all incoming packets is the only option with the CPSW/CPTS hardware. See am64_cpsw_nuss_hwtstamp_set in am64-cpsw-nuss.c.
    * Yes the usecase is to run ptp4l, however, running hwstamp_ctl directly reduces the test setup.
    * I agree that the iperf3 throughput looks low, I haven't analysed this yet, but it may be due to threaded interrupt handlers in the -rt kernel.
    * Interestingly, running iperf3 --bidir always gives very poor throughput (~150MBits/sec) in one direction no matter which kernel is running.

    I previously ran the same test with a non-realtime kernel and it also fails (see previous comments).

    I found an SK-AM64B board in the office and loaded the latest yocto SDK [1]. Running the following command sequence fails almost immediately for me:

    root@am64xx-evm:~# cat /etc/issue
    _____ _____ _ _
    | _ |___ ___ ___ ___ | _ |___ ___ |_|___ ___| |_
    | | _| .'| . | . | | __| _| . | | | -_| _| _|
    |__|__|_| |__,|_ |___| |__| |_| |___|_| |___|___|_|
    |___| |___|

    Arago Project \n \l

    Arago 2023.04 \n \l

    root@am64xx-evm:~# uname -a
    Linux am64xx-evm 6.1.33-g40c32565ca #1 SMP PREEMPT Thu Jul 6 14:17:24 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
    root@am64xx-evm:~# hwstampctl -i eth0 -r 1
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 44656
    [ 5] local 10.117.68.142 port 5201 connected to 10.117.68.128 port 44672
    [ ID] Interval Transfer Bitrate
    [ 5] 0.00-1.00 sec 53.9 MBytes 452 Mbits/sec
    [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec
    [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec
    Some lines removed
    [ 607.074314] ------------[ cut here ]------------
    [ 607.079005] NETDEV WATCHDOG: eth0 (am65-cpsw-nuss): transmit queue 0 timed out
    [ 607.086300] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x214/0x220
    [ 607.094590] Modules linked in: iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables x_tables ov6
    [ 607.137936] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 6.1.33-g40c32565ca #1
    [ 607.146360] Hardware name: Texas Instruments AM642 SK (DT)
    [ 607.151833] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 607.158783] pc : dev_watchdog+0x214/0x220
    [ 607.162788] lr : dev_watchdog+0x214/0x220
    [ 607.166791] sp : ffff80000800be20
    [ 607.170093] x29: ffff80000800be20 x28: 0000000000000005 x27: ffff800008abef10
    [ 607.177222] x26: ffff8000092679c0 x25: ffff00007fbd01a8 x24: ffff80000800bef0
    [ 607.184349] x23: ffff800009267000 x22: 0000000000000000 x21: ffff000000c4239c
    [ 607.191476] x20: ffff000000c42000 x19: ffff000000c42448 x18: ffffffffffffffff
    [ 607.198603] x17: 6f2064656d697420 x16: 3020657565757120 x15: 74696d736e617274
    [ 607.205731] x14: 203a297373756e2d x13: ffff800009281550 x12: 00000000000005b5
    [ 607.212858] x11: 00000000000001e7 x10: ffff8000092d9550 x9 : ffff800009281550
    [ 607.219985] x8 : 00000000ffffefff x7 : ffff8000092d9550 x6 : 0000000000000000
    [ 607.227112] x5 : ffff00007fbcfb60 x4 : 0000000000000000 x3 : 0000000000000000
    [ 607.234239] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00000015e3c0
    [ 607.241368] Call trace:
    [ 607.243806] dev_watchdog+0x214/0x220
    [ 607.247465] call_timer_fn.constprop.0+0x24/0x80
    [ 607.252081] __run_timers.part.0+0x1f0/0x234
    [ 607.256344] run_timer_softirq+0x3c/0x7c
    [ 607.260258] _stext+0x124/0x2a4
    [ 607.263394] ____do_softirq+0x10/0x20
    [ 607.267049] call_on_irq_stack+0x24/0x4c
    [ 607.270964] do_softirq_own_stack+0x1c/0x30
    [ 607.275139] __irq_exit_rcu+0xcc/0xf4
    [ 607.278797] irq_exit_rcu+0x10/0x20
    [ 607.282279] el1_interrupt+0x38/0x70
    [ 607.285854] el1h_64_irq_handler+0x18/0x2c
    [ 607.289941] el1h_64_irq+0x64/0x68
    [ 607.293335] arch_cpu_idle+0x18/0x2c
    [ 607.296902] default_idle_call+0x30/0x6c
    [ 607.300822] do_idle+0x244/0x2c0
    [ 607.304048] cpu_startup_entry+0x24/0x30
    [ 607.307966] secondary_start_kernel+0x124/0x150
    [ 607.312491] __secondary_switched+0xb0/0xb4
    [ 607.316669] ---[ end trace 0000000000000000 ]---
    [ 607.321342] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:5972 dql_avail:-21 free_desc:461
    [ 612.962317] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:11616 dql_avail:-21 free_desc:461
    [ 618.082303] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:16736 dql_avail:-21 free_desc:461
    [ 622.946289] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:21600 dql_avail:-21 free_desc:461
    [ 628.066272] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:26720 dql_avail:-21 free_desc:461
    [ 633.186255] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:31840 dql_avail:-21 free_desc:461
    [ 639.074233] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:37728 dql_avail:-21 free_desc:461
    [ 643.938217] am65-cpsw-nuss 8000000.ethernet eth0: txq:0 DRV_XOFF:0 tmo:42592 dql_avail:-21 free_desc:461

    [1] dr-download.ti.com/.../tisdk-default-image-am64xx-evm.wic.xz

  • See am64_cpsw_nuss_hwtstamp_set in am64-cpsw-nuss.c

    Sorry, typo. That's am65_cpsw_nuss_hwtstamp_set in am65-cpsw-nuss.c

  • I do suspect there is a bug underneath here, probably minor, but need to try to isolate to reproduce to file it. In general I don't recommend running any embedded network testing related to real-time or throughput in the presence of corporate IT traffic.

    * Yes the usecase is to run ptp4l, however, running hwstamp_ctl directly reduces the test setup.

    So you are observing the same issue with ptp4l ? Is that blocking your development?

    I found an SK-AM64B board in the office and loaded the latest yocto SDK [1]. Running the following command sequence fails almost immediately for me:

    This is not the case for me. I do also notice your copy-paste is slightly different (different tag and also looks like a typo? your command hwstampctl should be hwstamp_ctl, and it is not printing out anything?) I've highlighted the differences below. Also see the bandwidth difference. I'm running SK-AM64B (192.168.1.168) and an Ubuntu desktop (192.168.1.99), no corporate IT network traffic.

    root@am64xx-evm:~# uname -a
    Linux am64xx-evm 6.1.33-rt11-g685e771524 #1 SMP PREEMPT_RT Thu Jul  6 16:09:58 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
    root@am64xx-evm:~# 
    root@am64xx-evm:~# hwstampctl -i eth0 -r 1
    -sh: hwstampctl: command not found
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1                                                                                       
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 192.168.1.99, port 51880
    [  5] local 192.168.1.168 port 5201 connected to 192.168.1.99 port 51888
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   107 MBytes   895 Mbits/sec                  
    [  5]   1.00-2.00   sec   108 MBytes   908 Mbits/sec                  
    [  5]   2.00-3.00   sec   107 MBytes   898 Mbits/sec                  
    [  5]   3.00-4.00   sec   107 MBytes   897 Mbits/sec                  
    [  5]   4.00-5.00   sec   108 MBytes   904 Mbits/sec                  
    [  5]   5.00-6.00   sec   108 MBytes   904 Mbits/sec                  
    [  5]   6.00-7.00   sec   112 MBytes   938 Mbits/sec                  
    [  5]   7.00-8.00   sec   112 MBytes   942 Mbits/sec                  
    [  5]   8.00-9.00   sec   112 MBytes   936 Mbits/sec                  
    [  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  10.00-11.00  sec   111 MBytes   929 Mbits/sec                  
    [  5]  11.00-12.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  12.00-13.00  sec   110 MBytes   920 Mbits/sec                  
    [  5]  13.00-14.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  14.00-15.00  sec   107 MBytes   900 Mbits/sec                  
    [  5]  15.00-16.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  16.00-17.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  17.00-18.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  18.00-19.00  sec   111 MBytes   932 Mbits/sec                  
    [  5]  19.00-20.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  20.00-21.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  21.00-22.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  22.00-23.00  sec   111 MBytes   926 Mbits/sec                  
    [  5]  23.00-24.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  24.00-25.00  sec   110 MBytes   926 Mbits/sec                  
    [  5]  25.00-26.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  26.00-27.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  27.00-28.00  sec   113 MBytes   944 Mbits/sec                  
    [  5]  28.00-29.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  29.00-30.00  sec   112 MBytes   939 Mbits/sec                  
    [  5]  30.00-31.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  31.00-32.00  sec   105 MBytes   884 Mbits/sec                  
    [  5]  32.00-33.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  33.00-34.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  34.00-35.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  35.00-36.00  sec   110 MBytes   926 Mbits/sec                  
    [  5]  36.00-37.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  37.00-38.00  sec   110 MBytes   927 Mbits/sec                  
    [  5]  38.00-39.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  39.00-40.00  sec   110 MBytes   920 Mbits/sec                  
    [  5]  40.00-41.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  41.00-42.00  sec   111 MBytes   929 Mbits/sec                  
    [  5]  42.00-43.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  43.00-44.00  sec   109 MBytes   910 Mbits/sec                  
    [  5]  44.00-45.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  45.00-46.00  sec   110 MBytes   927 Mbits/sec                  
    [  5]  46.00-47.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  47.00-48.00  sec   111 MBytes   929 Mbits/sec                  
    [  5]  48.00-49.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  49.00-50.00  sec   110 MBytes   923 Mbits/sec                  
    [  5]  50.00-51.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  51.00-52.00  sec   110 MBytes   923 Mbits/sec                  
    [  5]  52.00-53.00  sec   112 MBytes   940 Mbits/sec                  
    [  5]  53.00-54.00  sec   110 MBytes   920 Mbits/sec                  
    [  5]  54.00-55.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  55.00-56.00  sec   110 MBytes   922 Mbits/sec                  
    [  5]  56.00-57.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  57.00-58.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  58.00-59.00  sec   112 MBytes   940 Mbits/sec                  
    [  5]  59.00-60.00  sec   110 MBytes   926 Mbits/sec                  
    [  5]  60.00-61.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  61.00-62.00  sec   106 MBytes   885 Mbits/sec                  
    [  5]  62.00-63.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  63.00-64.00  sec   107 MBytes   897 Mbits/sec                  
    [  5]  64.00-65.00  sec   112 MBytes   940 Mbits/sec                  
    [  5]  65.00-66.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  66.00-67.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  67.00-68.00  sec   110 MBytes   922 Mbits/sec                  
    [  5]  68.00-69.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  69.00-70.00  sec   111 MBytes   931 Mbits/sec                  
    [  5]  70.00-71.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  71.00-72.00  sec   110 MBytes   921 Mbits/sec                  
    [  5]  72.00-73.00  sec   113 MBytes   946 Mbits/sec                  
    [  5]  73.00-74.00  sec   111 MBytes   929 Mbits/sec                  
    [  5]  74.00-75.00  sec   112 MBytes   939 Mbits/sec                  
    [  5]  75.00-76.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  76.00-77.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  77.00-78.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  78.00-79.00  sec   108 MBytes   903 Mbits/sec                  
    [  5]  79.00-80.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  80.00-81.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  81.00-82.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  82.00-83.00  sec   111 MBytes   926 Mbits/sec                  
    [  5]  83.00-84.00  sec   112 MBytes   940 Mbits/sec                  
    [  5]  84.00-85.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  85.00-86.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  86.00-87.00  sec   111 MBytes   932 Mbits/sec                  
    [  5]  87.00-88.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  88.00-89.00  sec   110 MBytes   927 Mbits/sec                  
    [  5]  89.00-90.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  90.00-91.00  sec   110 MBytes   925 Mbits/sec                  
    [  5]  91.00-92.00  sec   112 MBytes   939 Mbits/sec                  
    [  5]  92.00-93.00  sec   110 MBytes   921 Mbits/sec                  
    [  5]  93.00-94.00  sec   112 MBytes   942 Mbits/sec                  
    [  5]  94.00-95.00  sec   110 MBytes   924 Mbits/sec                  
    [  5]  95.00-96.00  sec   105 MBytes   882 Mbits/sec                  
    [  5]  96.00-97.00  sec   107 MBytes   895 Mbits/sec                  
    [  5]  97.00-98.00  sec   112 MBytes   941 Mbits/sec                  
    [  5]  98.00-99.00  sec   110 MBytes   922 Mbits/sec                  
    [  5]  99.00-100.00 sec   112 MBytes   941 Mbits/sec                  
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-100.00 sec  10.8 GBytes   929 Mbits/sec                  receiver
    -----------------------------------------------------------
    Server listening on 5201 (test #3)
    -----------------------------------------------------------
    ^Ciperf3: interrupt - the server has terminated
    root@am64xx-evm:~# 
    
    

    There must be something else involved here. I see couple differences:

    1. your SDK image is slightly different (I don't think this is a problem) based in uname -a

    2. also your network and traffic is different, unknown corporate IT

    3. basic iperf3 throughput is down by maybe 30% compared to what I see, guess is something extra going on in the corporate LAN or your machine

    4. hwstampctl vs hwstamp_ctl ? maybe just a copy paste mistake?

      Pekka

  • Hi Pekka,

    I'm confused as to how your SDK could be different, I downloaded the file directly from the link I sent you. I'll try to come up with a more easily reproducible test case for you today.

    Yes, this issue is holding up our development. It is absolutely critical that this works reliably.

    In summary I currently see two different behaviours running the latest TI kernel, either:
    1. Ethernet stops transmitting/receiving packets, or
    2. Kernel panics in am65_cpsw_nuss_rx_poll

    These could be separate issues, but I'm hoping that they're both symptoms of the same underlying bug.

    Here's a quick analysis of the panic. I haven't gone into enough detail to understand exactly what's happening yet:

    Accepted connection from 10.117.68.128, port 48592
    [ 5] local 10.117.68.234 port 5201 connected to 10.117.68.128 port 48604
    [ ID] Interval Transfer Bitrate
    [ 5] 0.00-1.00 sec 68.6 MBytes 574 Mbits/sec
    [ 30.602777] Unable to handle kernel paging request at virtual address 17961a951231185b
    [ 30.610723] Mem abort info:
    [ 30.613507] ESR = 0x0000000096000044
    [ 30.617245] EC = 0x25: DABT (current EL), IL = 32 bits
    [ 30.622546] SET = 0, FnV = 0
    [ 30.625591] EA = 0, S1PTW = 0
    [ 30.628722] FSC = 0x04: level 0 translation fault
    [ 30.633588] Data abort info:
    [ 30.636458] ISV = 0, ISS = 0x00000044
    [ 30.640283] CM = 0, WnR = 1
    [ 30.643242] [17961a951231185b] address between user and kernel address ranges
    [ 30.650364] Internal error: Oops: 0000000096000044 [#1] PREEMPT_RT SMP
    [ 30.650376] CPU: 0 PID: 93 Comm: irq/152-8000000 Not tainted 6.1.46-rt13 #2
    [ 30.650386] Hardware name: Relectrify Stack Controller: Relecblox-Main-Controller-SOM (DT)
    [ 30.650391] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 30.650400] pc : am65_cpsw_nuss_rx_poll+0x150/0x390
    [ 30.650428] lr : am65_cpsw_nuss_rx_poll+0x130/0x390
    [ 30.650438] sp : ffff800009acbb80
    [ 30.650441] x29: ffff800009acbb80 x28: 17961a951231184b x27: ffff00004dcd53b0
    [ 30.650456] x26: ffff0000022eb670 x25: 0000000000000600 x24: 0000000000000040
    [ 30.650467] x23: ffff0000022ea080 x22: ffff00004dcd5380 x21: ffff000001c70000
    [ 30.650479] x20: ffff0000022eb648 x19: 0000000000000000 x18: 0000000000000000
    [ 30.650490] x17: ffff80006727c000 x16: ffff800008000000 x15: ffff000002b12682
    [ 30.650502] x14: 0000000000000001 x13: 00000000000001a8 x12: 0000000000000001
    [ 30.650513] x11: 0000000000000040 x10: ffff800008a3b4d0 x9 : ffff800008a3b4a8
    [ 30.650525] x8 : ffff80006727c000 x7 : 000f0000cdcd5380 x6 : 000f0000cdcd5380
    [ 30.650536] x5 : ffff00004dcd53c0 x4 : 000f000082a9f140 x3 : ffff00004dcd53c0
    [ 30.650548] x2 : 00000000000003e0 x1 : ffff000001c6d080 x0 : ffff000001c6d080
    [ 30.650560] Call trace:
    [ 30.650564] am65_cpsw_nuss_rx_poll+0x150/0x390
    [ 30.650574] __napi_poll.constprop.0+0x34/0x180
    [ 30.650588] net_rx_action+0x128/0x2c0
    [ 30.650598] _stext+0xf4/0x228
    [ 30.650609] __local_bh_enable_ip+0xc0/0x130
    [ 30.650623] irq_forced_thread_fn+0x94/0xb0
    [ 30.650635] irq_thread+0x140/0x200
    [ 30.650643] kthread+0x110/0x120
    [ 30.650651] ret_from_fork+0x10/0x20
    [ 30.650667] Code: b90077e2 52807c02 9ba20400 f9400415 (f9000b95) 
    [ 30.806403] ---[ end trace 0000000000000000 ]---
    [ 30.806409] Kernel panic - not syncing: Oops: Fatal exception in interrupt
    [ 30.817871] SMP: stopping secondary CPUs
    [ 30.817898] Kernel Offset: disabled
    [ 30.817901] CPU features: 0x00000,00000004,0000400b
    [ 30.817908] Memory Limit: none
    
    0000000000458ee0 <am65_cpsw_nuss_rx_poll>:
     458ee0: a9b67bfd stp x29, x30, [sp, #-160]!
     458ee4: d5384102 mrs x2, sp_el0
     458ee8: 910003fd mov x29, sp
     458eec: a90153f3 stp x19, x20, [sp, #16]
     458ef0: 52800013 mov w19, #0x0 // #0
     458ef4: a90363f7 stp x23, x24, [sp, #48]
     458ef8: 2a0103f8 mov w24, w1
     458efc: a9046bf9 stp x25, x26, [sp, #64]
     458f00: aa0003fa mov x26, x0
     458f04: f9428040 ldr x0, [x2, #1280]
     458f08: f9004fe0 str x0, [sp, #152]
     458f0c: d2800000 mov x0, #0x0 // #0
     458f10: 9282bde2 mov x2, #0xffffffffffffea10 // #-5616
     458f14: 8b020340 add x0, x26, x2
     458f18: f9003fe0 str x0, [sp, #120]
     458f1c: 34000301 cbz w1, 458f7c <am65_cpsw_nuss_rx_poll+0x9c>
     458f20: d100a354 sub x20, x26, #0x28
     458f24: aa0003f7 mov x23, x0
     458f28: a9025bf5 stp x21, x22, [sp, #32]
     458f2c: a90573fb stp x27, x28, [sp, #80]
     458f30: 14000008 b 458f50 <am65_cpsw_nuss_rx_poll+0x70>
     458f34: f94047e1 ldr x1, [sp, #136]
     458f38: 36000421 tbz w1, #0, 458fbc <am65_cpsw_nuss_rx_poll+0xdc>
     458f3c: b94012e0 ldr w0, [x23, #16]
     458f40: 37081260 tbnz w0, #1, 45918c <am65_cpsw_nuss_rx_poll+0x2ac>
     458f44: 11000673 add w19, w19, #0x1
     458f48: 6b13031f cmp w24, w19
     458f4c: 54000fe0 b.eq 459148 <am65_cpsw_nuss_rx_poll+0x268> // b.none
     458f50: f9400e80 ldr x0, [x20, #24]
     458f54: 910223e2 add x2, sp, #0x88
     458f58: f94002f5 ldr x21, [x23]
     458f5c: 52800001 mov w1, #0x0 // #0
     458f60: a908ffff stp xzr, xzr, [sp, #136]
     458f64: 94000000 bl 351130 <k3_udma_glue_pop_rx_chn>
     458f68: 34fffe60 cbz w0, 458f34 <am65_cpsw_nuss_rx_poll+0x54>
     458f6c: 3100f41f cmn w0, #0x3d
     458f70: 540016a1 b.ne 459244 <am65_cpsw_nuss_rx_poll+0x364> // b.any
     458f74: a9425bf5 ldp x21, x22, [sp, #32]
     458f78: a94573fb ldp x27, x28, [sp, #80]
     458f7c: 6b13031f cmp w24, w19
     458f80: 5400006d b.le 458f8c <am65_cpsw_nuss_rx_poll+0xac>
     458f84: 6b18027f cmp w19, w24
     458f88: 540010cb b.lt 4591a0 <am65_cpsw_nuss_rx_poll+0x2c0> // b.tstop
     458f8c: d5384100 mrs x0, sp_el0
     458f90: f9404fe2 ldr x2, [sp, #152]
     458f94: f9428001 ldr x1, [x0, #1280]
     458f98: eb010042 subs x2, x2, x1
     458f9c: d2800001 mov x1, #0x0 // #0
     458fa0: 540014c1 b.ne 459238 <am65_cpsw_nuss_rx_poll+0x358> // b.any
     458fa4: a94363f7 ldp x23, x24, [sp, #48]
     458fa8: 2a1303e0 mov w0, w19
     458fac: a94153f3 ldp x19, x20, [sp, #16]
     458fb0: a9446bf9 ldp x25, x26, [sp, #64]
     458fb4: a8ca7bfd ldp x29, x30, [sp], #160
     458fb8: d65f03c0 ret
     458fbc: f9400a80 ldr x0, [x20, #16]
     458fc0: 94000000 bl 45e480 <k3_cppi_desc_pool_dma2virt>
     458fc4: b9400002 ldr w2, [x0]                                       *** w2 = *x0 = pkt_info0
     458fc8: aa0003f6 mov x22, x0                                        *** x22 = x0 = desc_rx
     458fcc: d2800001 mov x1, #0x0                                       *** x1 = 0
     458fd0: 37e00062 tbnz w2, #28, 458fdc <am65_cpsw_nuss_rx_poll+0xfc> *** w2 & CPPI5_INFO_HDESC_PSINFO_LOATION
     458fd4: 53167c41 lsr w1, w2, #22
     458fd8: d37e1421 ubfiz x1, x1, #2, #6                               *** x1 = (w2 & CPPI5_INFO_HDESC_PSINFO_SIZE_MASK) >> CPPI5_INFO_HDESC_PSINFO_SIZE_SHIFT
     458fdc: f9400e80 ldr x0, [x20, #24]
     458fe0: f263005f tst x2, #0x20000000
     458fe4: f94016c4 ldr x4, [x22, #40]
     458fe8: 910102c3 add x3, x22, #0x40                                 *** x3 = x22 + 64
     458fec: b94026c2 ldr w2, [x22, #36]
     458ff0: 9100c2db add x27, x22, #0x30                                *** x27 = x22 + 48
     458ff4: 9a9b1065 csel x5, x3, x27,                                  *** x5 = x3 or x27 based on x2 & CPPI5_INFO0_HDESC_EPIB_PRESENT
     458ff8: f90037e3 str x3, [sp, #104]
     458ffc: 12006c59 and w25, w2, #0xfffffff
     459000: f86168bc ldr x28, [x5, x1]                                  *** x28 = x5[x1] - load of invalid address
     459004: 910243e1 add x1, sp, #0x90
     459008: f9004be4 str x4, [sp, #144]
     45900c: 94000000 bl 351250 <k3_udma_glue_rx_cppi5_to_dma_addr>
     459010: 79401ec0 ldrh w0, [x22, #14]
     459014: f94036e1 ldr x1, [x23, #104]
     459018: 51000400 sub w0, w0, #0x1
     45901c: b94002c2 ldr w2, [x22]
     459020: b90077e2 str w2, [sp, #116]
     459024: 52807c02 mov w2, #0x3e0
     459028: 9ba20400 umaddl x0, w0, w2, x1
     45902c: f9400415 ldr x21, [x0, #8]
     459030: f9000b95 str x21, [x28, #16]                                *** x28[22] = x21 - PANIC!
    
    x28=17961a951231184b is clearly bogus.
    x5=ffff00004dcd53c0 looks ok.
    Unfortunately x1 and x2 have been trampled.

    I'll try to collect as much more information as I can.

    Patrick

  • This latest error just above in rx_poll after "[ 30.602777] Unable to handle kernel paging request at virtual address 17961a951231185b" looks different to me? The previous error you had was "NETDEV WATCHDOG: eth0 (am65-cpsw-nuss): transmit queue 0 timed out".

    I'm still focused on why your setup looks so different. I guess you have multiple (3?) setups of HW, SK-AM64B, Phytec SoM, your own HW? First thing with a mismatch is the iperf3 throughput, before any of the errors show up. If you ignore running "hwstamp_ctl -i eth0 -r 1 ", what is the iperf3 throughput you observe? Two AM64x devices running iperf3, I would expect >900Mbit/s with default iperf3 TCP?.

  • Hi Pekka,

    I reported the panic previously in this thread 6 days ago.

    I'm not particularly interested in throughput, however you should be able to reproduce my results by:
    1. Testing with the realtime kernel
    2. Running iperf3 with --bidir with any kernel will show poor throughput in one direction (also mentioned previously in this thread).

    You're misinterpreting the "[ 5] 0.00-1.00 sec 68.6 MBytes 574 Mbits/sec" result in my last post. It's low because the bug happened before the test even managed to run for one second. The other lower results are a result of running the realtime kernel.

    I have tested with three different hardware setups, including the SK-AM64B and they all show the same results.

    Timestamping appears to have no impact (measurements within margin of error) on performance for me.

    Can you please escalate this issue as it is seriously impacting our development timeline.

    Patrick

  • I have tested with three different hardware setups, including the SK-AM64B and they all show the same results.

    This is where I see a major mismatch. I don’t see this issue with SK-AM64B, running iperf3 and hwstamp_ctl. And prior to any crash the throughput you see with iperf3 off significantly. So there must be something that is different or some additional configuration you are doing.

    I need a test to reproduce. SD card WIC image, hwstamp_ctl and then iperf3 works between SKs and between SK and Ubuntu desktop. For you it does not, apparently creates this crash.

  • One outside chance is lots of corrupted frames. That would fit both decreased iperf3 and a lockup. There is a HW issue that a very specifically corrupted frame can create lockup. I’m checking on the patch to work around the lockup, but can you check if you have CRC errors just running iperf3 without hwstamp_ctl ? Point to point with two boards and static IP addresses would be another way to check.

  • This is where I see a major mismatch. I don’t see this issue with SK-AM64B, running iperf3 and hwstamp_ctl. And prior to any crash the throughput you see with iperf3 off significantly. So there must be something that is different or some additional configuration you are doing.

    As I mentioned previously, if I run the normal (non realtime) kernel, the throughput is fine (> 900MBits/sec). It's just that the one test I copied/pasted in failed in less than 1 second, so the statistics show a lower datarate.

    Running the realtime kernel yields around 600MBits/sec or so, probably due to the threaded interrupt handlers.

    Running iperf3 in --bidir mode yields poor performance on all kernels tested.

    I need a test to reproduce. SD card WIC image, hwstamp_ctl and then iperf3 works between SKs and between SK and Ubuntu desktop. For you it does not, apparently creates this crash.

    I'm trying my best to come up with an easily reproducible test case.

    One outside chance is lots of corrupted frames. That would fit both decreased iperf3 and a lockup. There is a HW issue that a very specifically corrupted frame can create lockup. I’m checking on the patch to work around the lockup, but can you check if you have CRC errors just running iperf3 without hwstamp_ctl ? Point to point with two boards and static IP addresses would be another way to check.

    That sounds interesting. I would be happy to test any patches you have to see if they help.

    Here's the ethernet stats on a board which is currently locked up, looks like no errors. 

    # cat /proc/net/dev
    Inter-|   Receive                                                |  Transmit
     face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
        lo:   22034     228    0    0    0     0          0         0    22034     228    0    0    0     0       0          0
      sit0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
      eth0: 965237411  637841    0   82    0     0          0         0   963269   14480    0    0    0     0       0          0

  • Hi Pekka,

    I setup a test with an old laptop connected directly to the SK-AM64B with hardware timestamping enabled. The same bug occurs, it just takes a bit longer to manifest (37 minutes or so this time).

    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from fe80::56e1:adff:fe14:d67c, port 42952
    [  5] local fe80::3608:e1ff:fe80:b813 port 5201 connected to fe80::56e1:adff:fe14:d67c port 42956
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   110 MBytes   923 Mbits/sec                  
    [  5]   1.00-2.00   sec   111 MBytes   928 Mbits/sec                  
    [  5]   2.00-3.00   sec   110 MBytes   919 Mbits/sec                  
    [  5]   3.00-4.00   sec   111 MBytes   928 Mbits/sec                  
    [  5]   4.00-5.00   sec   110 MBytes   924 Mbits/sec                  
    [  5]   5.00-6.00   sec   111 MBytes   928 Mbits/sec                  
    [  5]   6.00-7.00   sec   109 MBytes   914 Mbits/sec                  
    [  5]   7.00-8.00   sec   111 MBytes   928 Mbits/sec                  
    [  5]   8.00-9.00   sec   110 MBytes   921 Mbits/sec                  
    ...
    [  5] 2246.00-2247.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2247.00-2248.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2248.00-2249.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2249.00-2250.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2250.00-2251.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2251.00-2252.00 sec   111 MBytes   928 Mbits/sec                  
    [  5] 2252.00-2253.00 sec  9.69 MBytes  81.3 Mbits/sec                  
    [  5] 2253.00-2254.00 sec  0.00 Bytes  0.00 bits/sec                  
    [  5] 2254.00-2255.00 sec  0.00 Bytes  0.00 bits/sec                  
    [  5] 2255.00-2256.00 sec  0.00 Bytes  0.00 bits/sec                  
    [  5] 2256.00-2257.00 sec  0.00 Bytes  0.00 bits/sec                  
    [  5] 2257.00-2258.00 sec  0.00 Bytes  0.00 bits/sec                  
    [  5] 2258.00-2259.00 sec  0.00 Bytes  0.00 bits/sec                  

    Patrick

  • I still see couple moving parts (printouts and text seem to mismatch) let's try to close them one by one.

    #1 throughput

    I setup a test with an old laptop connected directly to the SK-AM64B with hardware timestamping enabled. The same bug occurs, it just takes a bit longer to manifest (37 minutes or so this time).
    if I run the normal (non realtime) kernel, the throughput is fine (> 900MBits/sec). It's just that the one test I copied/pasted in failed in less than 1 second, so the statistics show a lower datarate.

    Wasa this a point to point Ethernet connection with static IP addresses? Was this run with RT image? so https://dr-download.ti.com/software-development/software-development-kit-sdk/MD-InmvA50mCw/09.00.00.03/tisdk-default-image-am64xx-evm.wic.xz ? I see >900Mbit/s out of the box with iperf3 -s on AM64x and iperf3 -c <ipaddr> from another eval board or a desktop like your cut-and-paste has? But your earlier posts talk about with RT you see ~600Mbit/s?

    On some releases and depending on how you have changed priorities I also do see ~600Mbit/s. For TCP test just running something like chrt 9 iperf3 -s then gets to >900Mbit/s with RT. For TCP it seems like the test application running at higher priority than the ksoftirq's is key.

    #2 usage of --bidir

    Running iperf3 in --bidir mode yields poor performance on all kernels tested.

    The printout above does not look what I'd expect if at the client you run --bidir ? If I run iperf3 -c <ipaddr> --bidir I see an additional column with header [role] and the RX-S and TX-S on the device running iperf3 -s ? 

    #3 corrupted frames

    Here's the ethernet stats on a board which is currently locked up, looks like no errors. 

    So probably not erroneous packets. 

    #4 possible other things to try

    This is not really a misunderstanding but just something to see if it helps you proceed. I'm assuming hwstamp_ctl usage is just a synthetic test to help reproduce the problem, your real application is something else? The ksoftirqd threads, have you tried running them at an elevated priority. In a loaded system managing these can help maintain networking performance


    ps aux | grep ksoftirq
    # seed the thread ids of the per core interrupt handlers, in my case 13 and 27
    chrt -f -p 10 13
    chrt -f -p 10 27

  • Hi Pekka,

    Another way I've found to reproduce the issue is by unplugging and plugging in the ethernet cable while iperf is running. After 10-20 times of doing this the ethernet driver will lock up.

    #1 throughput

    I setup a test with an old laptop connected directly to the SK-AM64B with hardware timestamping enabled. The same bug occurs, it just takes a bit longer to manifest (37 minutes or so this time).
    if I run the normal (non realtime) kernel, the throughput is fine (> 900MBits/sec). It's just that the one test I copied/pasted in failed in less than 1 second, so the statistics show a lower datarate.

    Wasa this a point to point Ethernet connection with static IP addresses? Was this run with RT image? so https://dr-download.ti.com/software-development/software-development-kit-sdk/MD-InmvA50mCw/09.00.00.03/tisdk-default-image-am64xx-evm.wic.xz ? I see >900Mbit/s out of the box with iperf3 -s on AM64x and iperf3 -c <ipaddr> from another eval board or a desktop like your cut-and-paste has? But your earlier posts talk about with RT you see ~600Mbit/s?

    On some releases and depending on how you have changed priorities I also do see ~600Mbit/s. For TCP test just running something like chrt 9 iperf3 -s then gets to >900Mbit/s with RT. For TCP it seems like the test application running at higher priority than the ksoftirq's is key.

    I need to clarify that we have no concerns about throughput. This is a distraction from the issue that the kernel either panics or ethernet traffic stops.

    In answer to your questions:
    * Yes, this is a point-to-point connection.
    * This test was running the exact "wic" file you linked.
    * Yes, with a realtime kernel you will see reduced ~600MBit/s vs ~900MBit/s with the non-realtime kernel.

    I have not changed any thread priorities.

    #2 usage of --bidir

    Running iperf3 in --bidir mode yields poor performance on all kernels tested.

    The printout above does not look what I'd expect if at the client you run --bidir ? If I run iperf3 -c <ipaddr> --bidir I see an additional column with header [role] and the RX-S and TX-S on the device running iperf3 -s ? 

    I was merely pointing out that if you're interested in doing throughput tests you may find some more interesting results when running in --bidir mode.

    #4 possible other things to try

    This is not really a misunderstanding but just something to see if it helps you proceed. I'm assuming hwstamp_ctl usage is just a synthetic test to help reproduce the problem, your real application is something else? The ksoftirqd threads, have you tried running them at an elevated priority. In a loaded system managing these can help maintain networking performance

    Yes, I am using hwstamp_ctl to configure the ethernet driver to quickly reproduce the problem. Our real application needs to operate in the field reliably for many months or even years at a time. Stability is of great concern to us.

    I have not adjusted any thread priorities as I am not concerned with performance at this stage.

    Patrick

  • I need to clarify that we have no concerns about throughput. This is a distraction from the issue that the kernel either panics or ethernet traffic stops.

    In answer to your questions:
    * Yes, this is a point-to-point connection.
    * This test was running the exact "wic" file you linked.
    * Yes, with a realtime kernel you will see reduced ~600MBit/s vs ~900MBit/s with the non-realtime kernel.

    I have not changed any thread priorities.

    I understand you don't care about throughput. My point with chasing this is that something is different in your setup, with 9.0 default RT image and supposedly the exact same commands behavior is significantly different. So there must be some other variable, or the throughput should be identical. Lots of dropped frames becasue of errors would be an example of this.

    I was merely pointing out that if you're interested in doing throughput tests you may find some more interesting results when running in --bidir mode.

    Ok, so nothing to do with --bidir. No need to pursue further.

    Yes, I am using hwstamp_ctl to configure the ethernet driver to quickly reproduce the problem. Our real application needs to operate in the field reliably for many months or even years at a time. Stability is of great concern to us.

    There is a workaround coming to the timestamping that removes the option of timestamping all frames. We'll only support the method to timestamp IEEE1588 and use the older fifo based approach instead of the timestamp in the descriptor. This is for stability with extreme amounts of corrupted frames. So hwstamp_ctl all frames is something to avoid anyway, it will be removed.

    Another way I've found to reproduce the issue is by unplugging and plugging in the ethernet cable while iperf is running. After 10-20 times of doing this the ethernet driver will lock up.

    This would point towards the corrupted frames. Do you see high amounts of errors (ethtool -S eth0 | grep err) in this unplug/plug sequence? Everything other than the "cat /proc/net/dev" with no errors would match with corrupted frames and then the issue with timestamping all frames. I understood you have 3 types of HW (own, SK, PHYTEC), do you have issues with just embedded boards connected directly to each other?

    The all frame timestamping issue workaround to use just IEEE1588 frames timestamped uses a completely different logic and delivery mechanism to SW., it is not just a stochastic workaround but avoids the underlying corrupted memory access.

      Pekka

  • My point with chasing this is that something is different in your setup, with 9.0 default RT image and supposedly the exact same commands behavior is significantly different. So there must be some other variable, or the throughput should be identical. Lots of dropped frames becasue of errors would be an example of this.

    I will repeat my testing next time I'm in the office (tomorrow) and report my results.

    I've just realised that the filename for the "normal" and the "RT" images is exactly the same (tisdk-default-image-am64xx-evm.wic.xz), so now I'm doubting myself as to exactly which of your images I was running.

    Ok, so nothing to do with --bidir. No need to pursue further.

    I think this is worth investigating separately purely from a performance perspective.

    There is a workaround coming to the timestamping that removes the option of timestamping all frames. We'll only support the method to timestamp IEEE1588 and use the older fifo based approach instead of the timestamp in the descriptor. This is for stability with extreme amounts of corrupted frames. So hwstamp_ctl all frames is something to avoid anyway, it will be removed.

    Is there an ETA for this fix? I think you might be right that this is causing the issue (see below).

    This would point towards the corrupted frames. Do you see high amounts of errors (ethtool -S eth0 | grep err) in this unplug/plug sequence? Everything other than the "cat /proc/net/dev" with no errors would match with corrupted frames and then the issue with timestamping all frames. I understood you have 3 types of HW (own, SK, PHYTEC), do you have issues with just embedded boards connected directly to each other?

    I'm working remotely today so can't change the hardware setup, I'll try the plug/unplug sequence again tomorrow, however there's a board currently connected to our corporate network I can test with. Here's the ethtool and /proc/net/dev stats. ethtool shows some errors but /proc/net/dev shows 0, so I think the /proc/net/dev stats misled us earlier.

    # ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 0
         rx_align_code_errors: 1
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 29
         tx_mem_protect_err: 0
    # cat /proc/net/dev
    Inter-|   Receive                                                |  Transmit
     face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
        lo:     576       8    0    0    0     0          0         0      576       8    0    0    0     0       0          0
      sit0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
      eth0: 74756498   53812    0 1047    0     0          0         0    99199    1484    0    0    0     0       0          0

    The all frame timestamping issue workaround to use just IEEE1588 frames timestamped uses a completely different logic and delivery mechanism to SW., it is not just a stochastic workaround but avoids the underlying corrupted memory access.

    I am very keen to test this! Can you please give me access to patches as soon as they are available.

    Thanks,

    Patrick

  • "RT" images is exactly the same (tisdk-default-image-am64xx-evm.wic.xz)

    I share the pain on this, unfortunate naming. I often make the mistake of which one with these. Best I can do is always run uname -a before the test to then copy paste in logs.

    I am very keen to test this! Can you please give me access to patches as soon as they are available.

    I'll try to get a patch for you, it is about falling back to an old mechanism for PTP. For this bug to show up there needs to be severe EMI or corruption of frames, so the Ethernet MAC level sees a certain pattern (related to the TSN feature preemption/IET/802.3br) . So someone sending corrupted frames or intentional IET fragments could cause this. This statistic in your last post  " iet_rx_smd_err: 29" points towards this problem, 29 frames that had something that looks like IET, but then the rest does not match. We should not of course lock-up but this amount of errors points to severe quality issues in the LAN where you are running. I have never seen this corrupted frame issue in a live network, only with testers able to generate packets with faulty L1 level fields in the preamble.

    If you want to manually turn the timestamps off it is the field TSTAMP_EN in register 12.2.1.6.3.2 CPSW_CPTS_CONTROL_REG Register Register in the TRM. Write a 0 to the bit 3 at address 0803 D004h with for example devmem2.

    In a normal Ethernet network with quality cables getting a frame with a preamble that is faulty in exactly the way to create the lockup should be really rare. Ethtool counter iet_rx_smd_err being non-zero is something that should not be observed in a few minutes or even hours.

    The errata is not yet in AM64x version but the issue is the same as in https://www.ti.com/lit/pdf/sprz488 : i2401 — CPSW: Host Timestamps Cause CPSW Port to Lock up. Workaround in PTP is to use the selective timestamping as I stated earlier.

  • I'll try to get a patch for you, it is about falling back to an old mechanism for PTP.

    Thanks!

    If you want to manually turn the timestamps off it is the field TSTAMP_EN in register 12.2.1.6.3.2 CPSW_CPTS_CONTROL_REG Register Register in the TRM. Write a 0 to the bit 3 at address 0803 D004h with for example devmem2.

    Okay, my understanding of the TRM shows this as disabling all hardware timestamps, is that correct?

    Clearing the bit resolves the lockup, but clearly I can't continue development of our PTP application like this.

    Workaround in PTP is to use the selective timestamping as I stated earlier.

    Is this still a hardware timestamp?

    Patrick

  • To me the first thing to confirm is that you are operating in an environment with a lot of EMI noise, a lot of errors, leading to Ethernet preamble start of frame delimiters corruptions, which then exposes the HW issue ( https://www.ti.com/lit/pdf/sprz488 : i2401 — CPSW: Host Timestamps Cause CPSW Port to Lock up). This is why I keep on asking about the large drop in throughput, that would be another side effect of bad quality LAN or a lot of EMI noise. In an isolated network of a few of AM64x devices, with good quality cables would be my reference. You will see many other side-effects if there are lots of corrupted frames.

    Clearing the bit resolves the lockup, but clearly I can't continue development of our PTP application like this.

    Are you running linuxptp/ptp4l ? Or something else. While developing your PTP use case, what is the network topology and HW, I would not do that in an environment with erroneous frames.

    Workaround in PTP is to use the selective timestamping as I stated earlier.

    Is this still a hardware timestamp?

    Yes. Location of taking the timestamp is unchanged. Just the mechanism to deliver to SW is different (not in every packet descriptor, instead in a fifo and only for IEEE1588 packets).

    I'll try to get a patch for you, it is about falling back to an old mechanism for PTP.

    Thanks!

    The older CPSW driver for Legacy devices such as AM335x which is: drivers/net/ethernet/ti/cpsw.c implemented RX timestamping using CPTS FIFO events and therefore can be used as a reference, will not work as is. The function:
    cpts_rx_timestamp()
    in drivers/net/ethernet/ti/cpts.c implements this approach.
    Link to the implementation: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/net/ethernet/ti/cpts.c?h=ti-linux-6.1.y#n505 

    Okay, my understanding of the TRM shows this as disabling all hardware timestamps, is that correct?

    Yes. The point of this test would be to verify that you are indeed running into this issue.

      Pekka

  • Hi Pekka

    To me the first thing to confirm is that you are operating in an environment with a lot of EMI noise, a lot of errors, leading to Ethernet preamble start of frame delimiters corruptions, which then exposes the HW issue ( https://www.ti.com/lit/pdf/sprz488 : i2401 — CPSW: Host Timestamps Cause CPSW Port to Lock up). This is why I keep on asking about the large drop in throughput, that would be another side effect of bad quality LAN or a lot of EMI noise. In an isolated network of a few of AM64x devices, with good quality cables would be my reference. You will see many other side-effects if there are lots of corrupted frames.

    This is an office environment, so normal levels of office EMI, nothing unusual.

    I have now tested in a number of different setups (complete description of test setups & logs below) and I have narrowed down the issue to the combination of the SK-AM64B and TL-SX3428X switch. I have tried two of these switches with the SK-AM64B and in both cases the combination causes rx errors to be reported by the SK-AM64B. In summary:

    Connected via QSW-2104-2S: no errors
    Connected via TL-SX3428X v1.0 (1): rx errors
    Connected via TL-SX3428X v1.0 (2): rx errors

    There appears to be some incompatibility between the AM64x and the TL-SX3428X as all other devices connected to the switches report no errors.

    Are you running linuxptp/ptp4l ? Or something else. While developing your PTP use case, what is the network topology and HW, I would not do that in an environment with erroneous frames.

    Errors happen in the real world. As you stated previously, the AM64x should not lock up regardless of errors.

    Thanks for the reference.

    Do you have an estimate of when a patch for this issue will be available?

    Yes. The point of this test would be to verify that you are indeed running into this issue.

    Is there anything else you need me to test before you're convinced I'm running into the issue?

     

    Test descriptions & logs:

    SK-AM64B Setup
    ==============
    
    root@am64xx-evm:~# uname -a
    Linux am64xx-evm 6.1.33-g40c32565ca #1 SMP PREEMPT Thu Jul  6 14:17:24 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
    
    
    TEST 1
    ======
    
    The SK-AM64B is reporting rx errors when connected to TL-SG3428X port 15 with a
    brand new CAT8 cable.
    
    Switching on hardware timestamping locks up the ethernet driver.
    
    Hardware Setup
    --------------
    
                    CAT8                       DAC                      CAT6
    SK-AM64B[eth0] <-1G-> [15]TL-SG3428X[26] <-10G-> [1]QSW-2104-2S[4] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 57560
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 57574
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   111 MBytes   929 Mbits/sec
    [  5]   1.00-2.00   sec   112 MBytes   937 Mbits/sec
    [  5]   2.00-3.00   sec   108 MBytes   910 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   939 Mbits/sec
    [  5]   4.00-5.00   sec   109 MBytes   916 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   940 Mbits/sec
    [  5]   6.00-7.00   sec   109 MBytes   912 Mbits/sec
    [  5]   7.00-8.00   sec   112 MBytes   937 Mbits/sec
    [  5]   8.00-9.00   sec   109 MBytes   918 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   936 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 4
         rx_align_code_errors: 10
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 152
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 55866
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 55878
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   108 MBytes   903 Mbits/sec
    [  5]   1.00-2.00   sec   112 MBytes   939 Mbits/sec
    [  5]   2.00-3.00   sec   106 MBytes   893 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   936 Mbits/sec
    [  5]   4.00-5.00   sec   109 MBytes   916 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   937 Mbits/sec
    [  5]   6.00-7.00   sec  80.3 MBytes   674 Mbits/sec
    [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 5
         rx_align_code_errors: 13
         ale_overrun_drop: 0
         tx_deferred_frames: 1
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 261
         tx_mem_protect_err: 0
    
    
    TEST 2
    ======
    
    Change the hardware setup to eliminate the TL-SG3428X switch. All cables are
    the same as TEST 1.
    
    The SK-AM64B no longer reports any errors.
    
    Hardware timestamping no longer locks up the ethernet driver.
    
    Hardware Setup
    --------------
    
                    CAT8                     CAT6
    SK-AM64B[eth0] <-1G-> [1]QSW-2104-2S[6] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 54584
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 54588
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   112 MBytes   940 Mbits/sec
    [  5]   1.00-2.00   sec   112 MBytes   942 Mbits/sec
    [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   941 Mbits/sec
    [  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec
    [  5]   6.00-7.00   sec   112 MBytes   942 Mbits/sec
    [  5]   7.00-8.00   sec   112 MBytes   942 Mbits/sec
    [  5]   8.00-9.00   sec   112 MBytes   942 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 0
         rx_align_code_errors: 0
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 0
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 46868
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 46878
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   112 MBytes   941 Mbits/sec
    [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec
    [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec
    [  5]   3.00-4.00   sec   111 MBytes   934 Mbits/sec
    [  5]   4.00-5.00   sec   112 MBytes   941 Mbits/sec
    [  5]   5.00-6.00   sec   111 MBytes   930 Mbits/sec
    [  5]   6.00-7.00   sec   112 MBytes   941 Mbits/sec
    [  5]   7.00-8.00   sec   111 MBytes   934 Mbits/sec
    [  5]   8.00-9.00   sec   112 MBytes   941 Mbits/sec
    [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 0
         rx_align_code_errors: 0
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 0
         tx_mem_protect_err: 0
    
    TEST 3
    ======
    
    Change the test setup to isolate the TL-SG3428X switch.
    
    Test results are identical to TEST 1.
    
    Hardware Setup
    --------------
    
                    CAT8                      CAT6
    SK-AM64B[eth0] <-1G-> [15]TL-SG3428X[17] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 60090
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 60106
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   112 MBytes   936 Mbits/sec
    [  5]   1.00-2.00   sec   111 MBytes   932 Mbits/sec
    [  5]   2.00-3.00   sec   112 MBytes   937 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   940 Mbits/sec
    [  5]   4.00-5.00   sec   111 MBytes   932 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   937 Mbits/sec
    [  5]   6.00-7.00   sec   111 MBytes   934 Mbits/sec
    [  5]   7.00-8.00   sec   112 MBytes   938 Mbits/sec
    [  5]   8.00-9.00   sec   108 MBytes   906 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   938 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 1
         rx_align_code_errors: 6
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 148
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 48780
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 48794
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   110 MBytes   920 Mbits/sec
    [  5]   1.00-2.00   sec   111 MBytes   935 Mbits/sec
    [  5]   2.00-3.00   sec   109 MBytes   912 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   940 Mbits/sec
    [  5]   4.00-5.00   sec   110 MBytes   919 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   939 Mbits/sec
    [  5]   6.00-7.00   sec  11.5 MBytes  96.8 Mbits/sec
    [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec
    
    
    TEST 4
    ======
    
    Isolate TL-SX3428X port 15 by swapping laptop & SK-AM64B ports.
    
    The SK-AM64B is reporting rx errors when connected to TL-SG3428X port 15 with a
    brand new CAT8 cable.
    
    Again the ethernet driver locks up when timestamping is enabled.
    
    
    Hardware Setup
    --------------
    
                    CAT8                      CAT6
    SK-AM64B[eth0] <-1G-> [17]TL-SG3428X[15] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 36060
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 36070
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec
    [  5]   1.00-2.00   sec   109 MBytes   916 Mbits/sec
    [  5]   2.00-3.00   sec   109 MBytes   914 Mbits/sec
    [  5]   3.00-4.00   sec   109 MBytes   913 Mbits/sec
    [  5]   4.00-5.00   sec   107 MBytes   895 Mbits/sec
    [  5]   5.00-6.00   sec   110 MBytes   927 Mbits/sec
    [  5]   6.00-7.00   sec   106 MBytes   889 Mbits/sec
    [  5]   7.00-8.00   sec   107 MBytes   896 Mbits/sec
    [  5]   8.00-9.00   sec   108 MBytes   908 Mbits/sec
    [  5]   9.00-10.00  sec   108 MBytes   910 Mbits/sec
    [  5]  10.00-10.00  sec   128 KBytes   647 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 7
         rx_align_code_errors: 21
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 386
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 58806
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 58814
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  95.4 MBytes   800 Mbits/sec
    [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 8
         rx_align_code_errors: 24
         ale_overrun_drop: 0
         tx_deferred_frames: 1
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 421
         tx_mem_protect_err: 0
    
    
    TEST 5
    ======
    
    Reverse direction of test so that SK-AM64B sending data to Laptop, check for rx
    errors on laptop.
    
    Laptop does not report any rx errors.
    
    Hardware Setup
    --------------
    
                    CAT8                      CAT6
    SK-AM64B[eth0] <-1G-> [17]TL-SG3428X[15] <-1G-> Laptop
    
    Test Logs
    ---------
    
    patrick@laptop ~ % ethtool -S enp2s0f0
    NIC statistics:
         tx_packets: 719627360
         rx_packets: 128530885
         tx_errors: 0
         rx_errors: 0
         rx_missed: 0
         align_errors: 0
         tx_single_collisions: 0
         tx_multi_collisions: 0
         unicast: 51510885
         broadcast: 2245764
         multicast: 74774236
         tx_aborted: 0
         tx_underrun: 0
    
    patrick@laptop ~ % iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.143, port 43064
    [  5] local 10.117.68.128 port 5201 connected to 10.117.68.143 port 43076
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   107 MBytes   894 Mbits/sec
    [  5]   1.00-2.00   sec   108 MBytes   908 Mbits/sec
    [  5]   2.00-3.00   sec   108 MBytes   908 Mbits/sec
    [  5]   3.00-4.00   sec   108 MBytes   905 Mbits/sec
    [  5]   4.00-5.00   sec   108 MBytes   907 Mbits/sec
    [  5]   5.00-6.00   sec   110 MBytes   924 Mbits/sec
    [  5]   6.00-7.00   sec   104 MBytes   873 Mbits/sec
    [  5]   7.00-8.00   sec   108 MBytes   906 Mbits/sec
    [  5]   8.00-9.00   sec   108 MBytes   909 Mbits/sec
    [  5]   9.00-10.00  sec  95.4 MBytes   800 Mbits/sec
    [  5]  10.00-10.01  sec   640 KBytes   944 Mbits/sec
    
    patrick@laptop ~ % ethtool -S enp2s0f0
    NIC statistics:
         tx_packets: 719670489
         rx_packets: 130102493
         tx_errors: 0
         rx_errors: 0
         rx_missed: 0
         align_errors: 0
         tx_single_collisions: 0
         tx_multi_collisions: 0
         unicast: 53082331
         broadcast: 2245900
         multicast: 74774262
         tx_aborted: 0
         tx_underrun: 0
    
    
    TEST 6
    ======
    
    Swap cables between SK-AM64B and laptop.
    
    The SK-AM64B is reporting rx errors when connected to TL-SG3428X port 15 with a
    CAT6 cable.
    
    Again the ethernet driver locks up when timestamping is enabled.
    
    Hardware Setup
    --------------
    
                    CAT6                      CAT8
    SK-AM64B[eth0] <-1G-> [17]TL-SG3428X[15] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 60142
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 60158
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   112 MBytes   938 Mbits/sec
    [  5]   1.00-2.00   sec   108 MBytes   905 Mbits/sec
    [  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec
    [  5]   3.00-4.00   sec   111 MBytes   934 Mbits/sec
    [  5]   4.00-5.00   sec   112 MBytes   939 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   939 Mbits/sec
    [  5]   6.00-7.00   sec   111 MBytes   929 Mbits/sec
    [  5]   7.00-8.00   sec   112 MBytes   940 Mbits/sec
    [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   940 Mbits/sec
    [  5]  10.00-10.00  sec  63.6 KBytes   744 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 5
         rx_align_code_errors: 12
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 157
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 51740
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 51756
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   112 MBytes   937 Mbits/sec
    [  5]   1.00-2.00   sec   112 MBytes   937 Mbits/sec
    [  5]   2.00-3.00   sec   112 MBytes   940 Mbits/sec
    [  5]   3.00-4.00   sec   111 MBytes   936 Mbits/sec
    [  5]   4.00-5.00   sec   111 MBytes   936 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   936 Mbits/sec
    [  5]   6.00-7.00   sec   111 MBytes   932 Mbits/sec
    [  5]   7.00-8.00   sec   111 MBytes   933 Mbits/sec
    [  5]   8.00-9.00   sec  26.9 MBytes   225 Mbits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec
    
    
    TEST 7
    ======
    
    Reverse direction of test so that SK-AM64B sending data to Laptop, check for rx
    errors on laptop.
    
    Laptop does not report any rx errors.
    
    Hardware Setup
    --------------
    
                    CAT6                      CAT8
    SK-AM64B[eth0] <-1G-> [17]TL-SG3428X[15] <-1G-> Laptop
    
    patrick@laptop ~ % ethtool -S enp2s0f0
    NIC statistics:
         tx_packets: 721161960
         rx_packets: 130440533
         tx_errors: 0
         rx_errors: 0
         rx_missed: 0
         align_errors: 0
         tx_single_collisions: 0
         tx_multi_collisions: 0
         unicast: 53417943
         broadcast: 2247289
         multicast: 74775301
         tx_aborted: 0
         tx_underrun: 0
    
    patrick@laptop ~ % iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.143, port 55000
    [  5] local 10.117.68.128 port 5201 connected to 10.117.68.143 port 55004
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   110 MBytes   917 Mbits/sec
    [  5]   1.00-2.00   sec   110 MBytes   918 Mbits/sec
    [  5]   2.00-3.00   sec   111 MBytes   930 Mbits/sec
    [  5]   3.00-4.00   sec   112 MBytes   935 Mbits/sec
    [  5]   4.00-5.00   sec   110 MBytes   927 Mbits/sec
    [  5]   5.00-6.00   sec   112 MBytes   935 Mbits/sec
    [  5]   6.00-7.00   sec   112 MBytes   936 Mbits/sec
    [  5]   7.00-8.00   sec   111 MBytes   932 Mbits/sec
    [  5]   8.00-9.00   sec   111 MBytes   933 Mbits/sec
    [  5]   9.00-10.00  sec   112 MBytes   935 Mbits/sec
    [  5]  10.00-10.01  sec   512 KBytes   947 Mbits/sec
    
    patrick@laptop ~ % ethtool -S enp2s0f0
    NIC statistics:
         tx_packets: 721188893
         rx_packets: 131251995
         tx_errors: 0
         rx_errors: 0
         rx_missed: 0
         align_errors: 0
         tx_single_collisions: 0
         tx_multi_collisions: 0
         unicast: 54228208
         broadcast: 2247812
         multicast: 74775975
         tx_aborted: 0
         tx_underlaptop: 0
    
    
    TEST 8
    ======
    
    Swap SK-AM64B ethernet port.
    
    SK-AM64B eth1 behaves identically to SK-AM64B eth0.
    
    Concerningly eth1 reports mem_protect_err now?
    
    Hardware Setup
    --------------
    
                    CAT8                      CAT6
    SK-AM64B[eth1] <-1G-> [17]TL-SG3428X[15] <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth1
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 37326
    [  5] local 10.117.68.102 port 5201 connected to 10.117.68.128 port 37338
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   108 MBytes   905 Mbits/sec
    [  5]   1.00-2.00   sec   106 MBytes   889 Mbits/sec
    [  5]   2.00-3.00   sec   110 MBytes   919 Mbits/sec
    [  5]   3.00-4.00   sec   110 MBytes   921 Mbits/sec
    [  5]   4.00-5.00   sec   109 MBytes   913 Mbits/sec
    [  5]   5.00-6.00   sec   105 MBytes   884 Mbits/sec
    [  5]   6.00-7.00   sec   104 MBytes   876 Mbits/sec
    [  5]   7.00-8.00   sec   108 MBytes   909 Mbits/sec
    [  5]   8.00-9.00   sec   107 MBytes   898 Mbits/sec
    [  5]   9.00-10.00  sec   111 MBytes   933 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth1 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 12
         rx_align_code_errors: 15
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 364
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth1 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    ^Croot@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 42126
    [  5] local 10.117.68.102 port 5201 connected to 10.117.68.128 port 42132
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   110 MBytes   919 Mbits/sec
    [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec
    [  5]   2.00-3.00   sec  15.2 MBytes   127 Mbits/sec
    [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    
    root@am64xx-evm:~# ethtool -S eth1 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 20
         rx_crc_errors: 13
         rx_align_code_errors: 18
         ale_overrun_drop: 0
         tx_deferred_frames: 2
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 423
         tx_mem_protect_err: 1
    
    
    TEST 9
    ======
    
    Swap to different TL-SG3428X. Eliminate TL-SG3428X and possible local EMI.
    
    Behaviour is identical to the original TL-SG3428X.
    
    Hardware Setup
    --------------
    
                    CAT8               DAC               Fibre               DAC                CAT6
    SK-AM64B[eth0] <-1G-> TL-SG3428X <-10G-> TL-SX3016F <-10G-> TL-SG3428X <-10G-> QSW-2104-2S <-1G-> Laptop
    
    Test Logs
    ---------
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0
    current settings:
    tx_type 0
    rx_filter 0
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 43142
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 43156
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec   109 MBytes   911 Mbits/sec
    [  5]   1.00-2.00   sec   107 MBytes   899 Mbits/sec
    [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec
    [  5]   3.00-4.00   sec   109 MBytes   913 Mbits/sec
    [  5]   4.00-5.00   sec   108 MBytes   902 Mbits/sec
    [  5]   5.00-6.00   sec   108 MBytes   905 Mbits/sec
    [  5]   6.00-7.00   sec   108 MBytes   909 Mbits/sec
    [  5]   7.00-8.00   sec   105 MBytes   879 Mbits/sec
    [  5]   8.00-9.00   sec   110 MBytes   924 Mbits/sec
    [  5]   9.00-10.00  sec   109 MBytes   918 Mbits/sec
    [  5]  10.00-10.00  sec   177 KBytes   877 Mbits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 5
         rx_align_code_errors: 15
         ale_overrun_drop: 0
         tx_deferred_frames: 0
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 270
         tx_mem_protect_err: 0
    
    root@am64xx-evm:~# hwstamp_ctl -i eth0 -r 1
    current settings:
    tx_type 0
    rx_filter 0
    new settings:
    tx_type 0
    rx_filter 1
    
    root@am64xx-evm:~# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 10.117.68.128, port 44854
    [  5] local 10.117.68.143 port 5201 connected to 10.117.68.128 port 44858
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  85.7 MBytes   719 Mbits/sec
    [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
    [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
    [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec
    
    root@am64xx-evm:~# ethtool -S eth0 |grep err
         p0_rx_crc_errors: 0
         p0_ale_overrun_drop: 0
         p0_ale_len_err_drop: 0
         p0_tx_mem_protect_err: 0
         rx_crc_errors: 6
         rx_align_code_errors: 16
         ale_overrun_drop: 0
         tx_deferred_frames: 2
         rx_ipg_error: 0
         tx_carrier_sense_errors: 0
         ale_len_err_drop: 0
         iet_rx_assembly_err: 0
         iet_rx_smd_err: 295
         tx_mem_protect_err: 0
    

    Patrick

  • Thanks. This looks clear that lots of corrupted frames are occurring ( iet_rx_smd_err: 386). Which also seems to be why I'm not reproducing this. I confirmed with HW design if statistics rx_bottom_fifo_drop (Ethtool -S prints this as well) or iet_rx_smd_err is non-zero this lockup is likely.

    Short term path forward, I'm assuming you are doing SW development or evaluation as you are not in an embedded environment but enterprise network? I'd suggest to continue developing in the LAN where you don't see errors for those tests you care about time accuracy. We test PTP with a small network (most of the time 3 -4) of various AM6x devices, connected with some of them configured as switches. I don't think your switches should be causing errors (maybe the cables?), I checked ehtool -S for my Linux desktop and our enterprise network, I did not see any errors. So still surprised to see lots of errors in your case.

    The switches you are using seem like fairly typical enterprise class, they don't have any IEEE1588 capability so using HW timestamps is kind of wasted for accuracy anyway as the hop through the switch will create multiple microseconds of jitter? For the deployed system and PTP with HW timestamps I'm assuming you plan to use switches that support IEEE1588? 

    I can't see a schedule for the fix or patch to go back to the selective timestamping. I'm assuming we can get it in the 9.2 release so maybe in a few months in CI/CD.

      Pekka

  • Hi Pekka,

    I don't think your switches should be causing errors (maybe the cables?)

    I tried with multiple brand-new cables and two different TL-SG3428X switches. When the SK-AM64B is connected to a TL-SG3428X (two different switches) we see these errors.

    I suspect there is some compatibility issue between the AM64 and the switch as no other devices connected to the switches report errors. We also see these errors with the Phytec board and our own board.

    I can't see a schedule for the fix or patch to go back to the selective timestamping. I'm assuming we can get it in the 9.2 release so maybe in a few months in CI/CD.

    Thanks for the update,

    Patrick

  • I don't think your switches should be causing errors (maybe the cables?)

    I tried with multiple brand-new cables and two different TL-SG3428X switches. When the SK-AM64B is connected to a TL-SG3428X (two different switches) we see these errors.

    I suspect there is some compatibility issue between the AM64 and the switch as no other devices connected to the switches report errors. We also see these errors with the Phytec board and our own board.

    Can you use a profishark https://www.profitap.com/profishark-1g/ or another tester to look at the frames coming out of TL-SG3428X? Or if you do not have a wire tap, try other devices and check if there are any Ethernet level errors in frames they receive from TL-SG3428X? Outside this lockup which is only present with this specific corrupted frame, narrowing down what corrupts the frames would be something I'd think you need to understand or it will likely have further side effects. We have run SK-AM64B with for example run https://www.keysight.com/us/en/products/network-test/network-test-hardware/novus-one-plus-l2-7-fixed-chassis.html without observing errors.

    The large throughput drop with TCP is likely a side effect of corrupted frames.

  • Hi Pekka,

    Can you use a profishark https://www.profitap.com/profishark-1g/ or another tester to look at the frames coming out of TL-SG3428X?

    Unfortunately we don't have an ethernet analyser available. I've looked at the data using port mirroring to do a capture, but that didn't show any issues.

    I'm not sure that that type of tester would show much anyway -- using the same port connected to another device shows no errors, and as far as I know, those types of testers usually have two PHYs back-to-back, which would likely mask the issue we're seeing.

    I suspect that there's a configuration or electrical issue between the switch and AM64 causing the corruption.

    Patrick

  • I've looked at the data using port mirroring to do a capture, but that didn't show any issues.

    port mirroring will not mirror corrupted frames, but I think tcpdump -n -e -i eth0 and raw capture might get you something. Record traffic the AM64x sends and what the switch TL-SG3428X sends and open in wireshark.

    I'm not sure that that type of tester would show much anyway -- using the same port connected to another device shows no errors, and as far as I know, those types of testers usually have two PHYs back-to-back, which would likely mask the issue we're seeing.

    True network analyzer like Profishark will capture all traffic including corrupted frames at L1 (IEEE802.3) level, not PHY to MAC to another PHY, so you'd see exactly how the frames are corrupted.

    I suspect that there's a configuration or electrical issue between the switch and AM64 causing the corruption.

    I agree, I suspect the something similar, but at the xMII or PHY level specific to TL-SG3428X and/or AM64x. I don't think this is MAC level as the preambles and similar L1 is getting corrupted.

  • True network analyzer like Profishark will capture all traffic including corrupted frames at L1 (IEEE802.3) level, not PHY to MAC to another PHY, so you'd see exactly how the frames are corrupted.

    If you can loan me an analyser for testing I'm happy to perform captures to try to debug the hardware.

    Even if we do that it doesn't solve the problem that the AM64x hardware locks up. Do you have any news on a fix for the AM64x hardware lockup?

  • If you can loan me an analyser for testing I'm happy to perform captures to try to debug the hardware.

    How are you testing LAN level issues in your end system? Profishark is maybe $1-2k (I don't know what it is in your location) if you are building an embedded networking product at minimum I'd recommend getting that. But there are many other network traffic analyzers and testers as well.

    Also how are you testing PTP accuracy at the nanoseconds level? You won't need hardware timestamps until you are below millisecond level.

    Do you have any news on a fix for the AM64x hardware lockup?

    The updated timestamping to use the fifo method instead of all frames won't be in the 9.1 SDK release which is out next week. So it will likely be in 9.2 release which is 1H24, earlier in CI/CD.

       Pekka

  • How are you testing LAN level issues in your end system? Profishark is maybe $1-2k (I don't know what it is in your location) if you are building an embedded networking product at minimum I'd recommend getting that. But there are many other network traffic analyzers and testers as well.

    Sure, but it's hard to make a case to the bean counters to spend even $1 on this as:
    1. We've already established that there's a bug in TI's silicon.
    2. TI is going to provide a fix.
    3. Once TI provides a fix to stop the hardware locking up this will not be an issue for our product.

    In the final product configuration there is a separate internal control network and a customer-facing ethernet port for cloud connectivity. We're unlikely to see packet corruption internally (it's always a possibility due to electrical noise, of course), but we can't control what the customer connects to their port.

    I would personally like to get to the bottom of what's causing the packet corruption in the test configuration, but it's so far from what we're actually trying to achieve that it's not a particularly useful way to spend our time.

    Also how are you testing PTP accuracy at the nanoseconds level? You won't need hardware timestamps until you are below millisecond level.

    The system runs a 20us control loop which needs to be synchronised between nodes.

    The updated timestamping to use the fifo method instead of all frames won't be in the 9.1 SDK release which is out next week. So it will likely be in 9.2 release which is 1H24, earlier in CI/CD.

    Thanks for the update,

    Patrick

  • In the final product configuration there is a separate internal control network and a customer-facing ethernet port for cloud connectivity. We're unlikely to see packet corruption internally (it's always a possibility due to electrical noise, of course), but we can't control what the customer connects to their port.

    Ok I'll close this, for the lockup this is clear bug, need a fix which is likely in 9.2 SDK from TI, earlier in https://software-dl.ti.com/cicd-report/linux/index.html?section=platform&platform=am64xx . I'm assuming you would not run ptp on the cloud port.

    I'm a little surprised that maybe 10% (?) corrupted Ethernet frames is not a justification to try to figure out where the corruption is coming from, and trying to fix that or react to it. The bit-error rate in Ethernet requires is usually in the 10^-9 range or couple orders of magnitude better, I think the exact value is physical interface specific. At the amount of errors I see based on the logs it is no longer meeting Ethernet specs, what else is going to break is going to be unknown.

      Pekka

  • I'm a little surprised that maybe 10% (?)

    It's nowhere near 10%. One of the logs above shows 30 errors out of 53812 packets, so for that run around 0.06%. That's still unacceptably bad though.

    is not a justification to try to figure out where the corruption is coming from, and trying to fix that or react to it.

    It's clear from our testing that the TL-SG3428X switch is unlikely to be at fault here. There are many other devices connected to the switch. None of them report any errors on any of the switch ports. Only the SK-AM64B and other AM64x based hardware has a problem.

    The one thing everything has in common is that there is a DP83867 + AM64x on the problem end.

    Patrick

     

  • Thanks, I'll give it a try!