This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

WL1835MOD: wl1835 firmware gets stuck and does not recover when bringing up large mesh ( > 10 peers within range)

Part Number: WL1835MOD
Other Parts Discussed in Thread: WL1835

Provide case details or comments: Our devices are setting up a wireless mesh with a wl1835 connected through sdio with an i.mx6 processor.

Our platform is a linux 4.9 based platform. The driver/net/wireless/ti directory seems to be in sync with the version in the linux stable 4.9 tag.

Firmware versions used for all logs attached are:
[ 171.787880] wlcore: PHY firmware version: Rev 8.2.0.0.240
[ 171.851427] wlcore: firmware booted (Rev 8.9.0.0.76)

However, issue seems to also be reproduceable with the latest firmware version taken from:
git.ti.com/.../

A text dump of the firmware configuration file (wlconf -i /lib/firmware/ti-connectivity/wl18xx-conf.bin -g > ./wlconf.txt) in attachment as wlconf.txt. It was configured using the configure_devices.sh script provided by TI. The device has 2 antenna mounted.

We setup the mesh using wpa supplicant, the TI R8.8 version as built from the upstream_29_rebase branch from git.ti.com/.../

Find the wpa-supplicant conf file in attachment as wpa_supplicant_mesh.conf. The network is a mesh without SAE enabled on channel 6.

We configure one node as a mesh gateway with a dhcp server running and NAT rules to an ethernet interface, the rest of the nodes run a dhcp client and have no ethernet interface. DHCP is handled by a systemd-networkd config

The mesh gateway is set to enable root mode and gate anouncements.
> iw dev st_wlan0 set mesh_param mesh_hwmp_rootmode 4
> iw dev st_wlan0 set mesh_param mesh_gate_announcements 1

We also disable power save on all devices and set rts on for all packets:
> iw phy `ls /sys/class/ieee80211/` set rts 0
> iw dev st_wlan0 set power_save off

This works fine until we bring up alot of nodes that are relatively close... i.e. close enough for the 10 peerlink limit to get reached by one or more nodes.

In that situation the wl1835 firmware seems to get stuck sometimes, where stuck is defined as cat /sys/kernel/debug/ieee80211/phy/wlcore/tx_queue_len continously going up reaching multiple 100s of queued messages within a minute or so.

The device does not seem to have entered ELP mode, i.e. /sys/kernel/debug/ieee80211/phy/wlcore/sleep_auth always indicates 0x0.

The situation can be recovered from manually by triggering the wl1835 recovery using:
> cat 0x1 > /sys/kernel/debug/ieee80211/phy/wlcore/sleep_auth

The recovery does not kick in automatically though.

The situation becomes very reproducible by just rebooting the gateway or restarting it's supplicant.

In one of those reproductions I increased the dynamic debug level of the wlcore driver, the kernel log during such a reproduction is attached as wlcore_part.xt. The debug_level change was applied just before bringing up the supplicant, the following settings were used:
> echo -n 'module wlcore +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'module wl18xx +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'module mac80211 +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'module cfg80211 +p' > /sys/kernel/debug/dynamic_debug/control
> echo 0x1840 > /sys/module/wlcore/parameters/debug_level
> echo 8 > /proc/sys/kernel/printk

In one of those reproductions I enabled a monitor interface above the radio and did a packet capture using tcpdump (started just before bringing up the supplicant); the capture is attached as wireless.cap.
> iw phy phy add interface mon0 type monitor
> ifconfig mon0 up
> tcpdump -i mon0 -n -w /data/wireless.capdata.tar.gz

  • Typo fix:

    The situation can be recovered from manually by triggering the wl1835 recovery using:
    > cat 0x1 > /sys/kernel/debug/ieee80211/phy/wlcore/start_recovery

    I repeat that it does not recover by itself. Peer links established are no longer reacheable at TCP/IP level, the tx_queue_len goes up and never down, peer links and mesh paths already established eventually time out, disappear and are never re-established.

    If the recovery is kicked manually before that happens, the mesh recovers and once all devices are in the mesh and mesh paths relatively stable the issue is not immediately seen again.

    Can you provide us with a proper solution?

  • I add the following observation:

    The firmware stats seem to still change once we are in this 'stuck' state, tx retries on a specific MCS index seem to be going up quickly (bin 4 in dump below) when the transmit path seems 'stuck'. Maybe the firmware is infinitely retrying a specific packet???

    cat /sys/kernel/debug/ieee80211/phy2/wlcore/wl18xx/fw_stats/tx_tx_retry_per_rate
    [0] = 0
    [1] = 0
    [2] = 0
    [3] = 1
    [4] = 9114
    [5] = 0
    [6] = 0
    [7] = 1
    [8] = 8
    [9] = 14
    [10] = 20
    [11] = 28
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 1
    [16] = 14
    [17] = 37
    [18] = 23
    [19] = 48
    [20] = 68
    [21] = 0
    [22] = 5
    [23] = 22
    [24] = 65
    [25] = 150
    [26] = 152

    TX seems to be stuck to us because queues between the mac80211 layer and the wl1835 physical layer seem to see continuously increasing tx_queue_len, i.e. data gets queued but never actually transmitted.

  • HI ,

    Can you pls confirm if these patches are applied to the kernel : https://git.ti.com/cgit/wilink8-wlan/build-utilites/tree/patches/kernel_patches/4.19.38?h=r8.8

    Thanks

    Saurabh

  • Upgrading to a 4.19 kernel and including those patches does indeed solve the problem we're seeing.

    Without the patches the issue is also reproducible on a 4.19 kernel.

    I assume there are no backports to a 4.9 kernel available?

    Some additional questions:

    - The wpa supplicant, after joining a mesh, will continue to scan periodically if ap_scan is set to 1. Normally, at least for managed mode connections, ap_scan would only trigger scans when disconnected. The background scan while connected would be controlled by the bg_scan configuration. However for this supplicant or for mesh mode disabling bg_scan does not seem to stop the periodic scans while connected. These scans show up in f.e. througput tets with iperf where periodically the throughput drops to 0bits/s for a second or more when the radio is being used for a scan. What is the purpose of these scans and what is the right configuration to control their period and/or disable them?

    - Is there a linux version of the wireless connectivity tools (i.e. RTTT tool) other than the callibrator tool? documentation suggests there is, but the download page only has a windows version