This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

WL1835MOD: kernel segmentation fault due to 4.19 kernel patches provided with wilink R8.8

Part Number: WL1835MOD

(A year ago) we were told to upgrade to a 4.19 kernel and aply the R8.8 wilink patches on top of the 4.19 kernel stack.

This includes MAC80211 stack patches, one of them introducing a bug that causes a segmentation fault (see example below) when a peerlink/station remove is requested by MAC for a MAC that is not in the peer/station list.

Easily reproducible, f.e.:
> iw phy phy1 interface add wlan0 type mp
ifconfig wlan0 up
iw dev wlan0 mesh join testmesh
iw dev wlan0 station del <some-random-mac-that-is-not-in-the-mesh-goes-here>
... but in practice the problem happens for us infrequently in a mesh managed by the TI wilink R8.8 wpa-supplicant version.

Looking at the code of patch 014 the bug is obvious, at line 31 you are dereferencing sta without NULL pointer check;
but it would in fact be NULL if the station that was asked to remove was no longer connected.

Makes me wonder if you guys even remotely validate this solution?

Example kernel log dump on segmentation fault caused by this patch:

[  112.739850] Unable to handle kernel NULL pointer dereference at virtual address 00000078
[  112.748057] pgd = d28b11e7
[  112.750777] [00000078] *pgd=8e193831
[  112.754426] Internal error: Oops: 17 [#1] SMP ARM
[  112.759145] Modules linked in: wl18xx wlcore wlcore_sdio tlv320adc3100_i2c(O) tlv320adc3100(O) frontend_audio(O)
[  112.769354] CPU: 1 PID: 728 Comm: iw Tainted: G           O      4.19.112 #1
[  112.776406] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[  112.782959] PC is at sta_info_destroy_addr_bss+0x58/0xb4
[  112.788279] LR is at sta_info_get_bss+0x18/0x74
[  112.792814] pc : [<8089f048>]    lr : [<8089cd3c>]    psr: 60070013
[  112.799085] sp : d9091c08  ip : ebf83a22  fp : d9091c24
[  112.804316] r10: 00000000  r9 : 80e48f00  r8 : d8a7f900
[  112.809547] r7 : 80e05d48  r6 : d8db0220  r5 : 00000464  r4 : d88dc540
[  112.816078] r3 : 00000007  r2 : a57b0255  r1 : 00000001  r0 : 00000000
[  112.822611] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  112.829749] Control: 10c5387d  Table: 690bc04a  DAC: 00000051
[  112.835501] Process iw (pid: 728, stack limit = 0xee5e00e8)
[  112.841079] Stack: (0xd9091c08 to 0xd9092000)
[  112.845447] 1c00:                   d8db0220 80bea7cc d8db0214 80e05d48 d9091c3c d9091c28
[  112.853634] 1c20: 808b52ec 8089effc 80e05d48 80bea7cc d9091c64 d9091c40 80872544 808b52d8
[  112.861819] 1c40: 80aa1934 d8db0220 0002000c a3423682 80bea7cc 80aa1934 d9091cf4 d9091c68
[  112.870003] 1c60: 807ae5f8 80872478 80aa0a84 d9091cfc d9091ca8 80e05d48 d808d000 d9091cfc
[  112.878188] 1c80: 00000000 d9091c84 d9091c84 a3423682 630f5dd4 d6b42908 0000032c 630f5dd5
[  112.886373] 1ca0: f24002d8 d8db0200 d8db0210 d8db0214 d808d000 80e48f00 d8da8000 d88dc000
[  112.894558] 1cc0: d9091cfc a3423682 d9091cfc d8a7f900 80e05d48 807ae2c8 d8db0200 00000028
[  112.902742] 1ce0: 00000000 80e05d48 d9091d44 d9091cf8 807ad590 807ae2d4 802625a4 00000000
[  112.910926] 1d00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 a3423682
[  112.919111] 1d20: 00000028 80e4a77c d8a7f900 d8a7f900 80e07a4c d9091d68 d9091d5c d9091d48
[  112.927296] 1d40: 807ae2b8 807ad4dc d8233c00 d8acb400 d9091d9c d9091d60 807acd0c 807ae298
[  112.935481] 1d60: 80440a50 00000028 7fffffff a3423682 00000001 d9091f44 80e05d48 d8acb400
[  112.943666] 1d80: d8a7f900 006000c0 00000028 00000000 d9091dfc d9091da0 807ad1d4 807acb34
[  112.951850] 1da0: 00000000 d9091e18 00000051 00000000 00000008 00000000 d8d41300 00000000
[  112.960035] 1dc0: 000002d8 00000000 00000000 a3423682 7ee07b98 d9091f44 80e05d48 00000000
[  112.968220] 1de0: 00000000 d6b06000 00000000 00000000 d9091e0c d9091e00 807462f8 807ace7c
[  112.976405] 1e00: d9091f2c d9091e10 807468b0 807462e8 80a98e40 d9091edc 00000000 01eee408
[  112.984590] 1e20: 00000028 80234dd0 d6b061e0 fffff000 d9091e54 d9091e40 80265518 80745e84
[  112.992774] 1e40: d6b061e0 fffff000 d9091e74 d9091e58 80265668 802654e4 d800fc00 00000010
[  113.000959] 1e60: 00000000 00000000 d9091e94 d9091e78 80265ac0 80265540 d6b42a18 d6b061e0
[  113.009143] 1e80: d6b42a18 00000000 00000000 00000000 d9091eb4 d9091ea0 8090ad2c 8019002c
[  113.017330] 1ea0: d6b42a18 00000000 d9091ed4 d9091eb8 80261588 8090ad0c d6b42a18 00000000
[  113.025514] 1ec0: d6b061e0 00080060 d9091ef4 d9091ed8 d9091f0c d9091ee0 8018ecf8 80191b98
[  113.033699] 1ee0: 00080040 00080060 d8d43180 00000000 d6b061e0 d8006910 d9091f14 a3423682
[  113.041885] 1f00: 8026827c 80e05d48 7ee07b6c 00000000 d6b06000 80101204 d9090000 00000128
[  113.050071] 1f20: d9091f94 d9091f30 80747948 80746684 00000000 00000000 00000000 00000000
[  113.058255] 1f40: fffffff7 d9091e5c 0000000c 00000001 00000000 00000000 d9091e24 00000000
[  113.066440] 1f60: 8014c434 00000000 00000000 00000000 00000000 a3423682 01eee330 01ef35b0
[  113.074624] 1f80: 01ef3668 00000128 d9091fa4 d9091f98 80747998 807478fc 00000000 d9091fa8
[  113.082808] 1fa0: 80101000 80747990 01eee330 01ef35b0 00000003 7ee07b6c 00000000 00000000
[  113.090993] 1fc0: 01eee330 01ef35b0 01ef3668 00000128 76f1d000 01ef3668 01ef35b0 00000001
[  113.099181] 1fe0: 76f1d0c8 7ee07b10 76f031d4 76e865fc 60070010 00000003 00000000 00000000
[  113.107358] Backtrace:
[  113.109830] [<8089eff0>] (sta_info_destroy_addr_bss) from [<808b52ec>] (ieee80211_del_station+0x20/0x30)
[  113.119318]  r7:80e05d48 r6:d8db0214 r5:80bea7cc r4:d8db0220
[  113.124997] [<808b52cc>] (ieee80211_del_station) from [<80872544>] (nl80211_del_station+0xd8/0x124)
[  113.134047]  r5:80bea7cc r4:80e05d48
[  113.137641] [<8087246c>] (nl80211_del_station) from [<807ae5f8>] (genl_rcv_msg+0x330/0x428)
[  113.145995]  r4:80aa1934
[  113.148540] [<807ae2c8>] (genl_rcv_msg) from [<807ad590>] (netlink_rcv_skb+0xc0/0x118)
[  113.156466]  r10:80e05d48 r9:00000000 r8:00000028 r7:d8db0200 r6:807ae2c8 r5:80e05d48
[  113.164298]  r4:d8a7f900
[  113.166841] [<807ad4d0>] (netlink_rcv_skb) from [<807ae2b8>] (genl_rcv+0x2c/0x3c)
[  113.174332]  r8:d9091d68 r7:80e07a4c r6:d8a7f900 r5:d8a7f900 r4:80e4a77c
[  113.181044] [<807ae28c>] (genl_rcv) from [<807acd0c>] (netlink_unicast+0x1e4/0x280)
[  113.188704]  r5:d8acb400 r4:d8233c00
[  113.192293] [<807acb28>] (netlink_unicast) from [<807ad1d4>] (netlink_sendmsg+0x364/0x3a8)
[  113.200565]  r10:00000000 r9:00000028 r8:006000c0 r7:d8a7f900 r6:d8acb400 r5:80e05d48
[  113.208398]  r4:d9091f44
[  113.210945] [<807ace70>] (netlink_sendmsg) from [<807462f8>] (sock_sendmsg+0x1c/0x2c)
[  113.218782]  r10:00000000 r9:00000000 r8:d6b06000 r7:00000000 r6:00000000 r5:80e05d48
[  113.226614]  r4:d9091f44
[  113.229157] [<807462dc>] (sock_sendmsg) from [<807468b0>] (___sys_sendmsg+0x238/0x24c)
[  113.237086] [<80746678>] (___sys_sendmsg) from [<80747948>] (__sys_sendmsg+0x58/0x94)
[  113.244926]  r10:00000128 r9:d9090000 r8:80101204 r7:d6b06000 r6:00000000 r5:7ee07b6c
[  113.252757]  r4:80e05d48
[  113.255300] [<807478f0>] (__sys_sendmsg) from [<80747998>] (sys_sendmsg+0x14/0x18)
[  113.262875]  r7:00000128 r6:01ef3668 r5:01ef35b0 r4:01eee330
[  113.268547] [<80747984>] (sys_sendmsg) from [<80101000>] (ret_fast_syscall+0x0/0x54)
[  113.276293] Exception stack(0xd9091fa8 to 0xd9091ff0)
[  113.281351] 1fa0:                   01eee330 01ef35b0 00000003 7ee07b6c 00000000 00000000
[  113.289537] 1fc0: 01eee330 01ef35b0 01ef3668 00000128 76f1d000 01ef3668 01ef35b0 00000001
[  113.297720] 1fe0: 76f1d0c8 7ee07b10 76f031d4 76e865fc
[  113.302781] Code: e0800005 eb01b2a3 e1a00006 e89da8f0 (e5903078)
[  113.308983] ---[ end trace fedeffbbd4a639b8 ]---
Segmentation fault

  • Thanks for sharing this. We will review the patch. Seems like you are able to fix this issue by integrating NULL pointer check in the referenced patch.

  • Where we indeed know how to ‘patch’ this specific problem ourselves, I’m afraid that that is not the answer we are looking for.

    Where the problem described above is easily recognisable as a major quality issue; we have stacked up evidence of harder to communicate issues with the quality of the wilink solution, or at least the wl1835mod-based version: from plain crashes of the stock wilink wpa supplicant in certain mesh use-cases; to corner cases where the ‘recovery’ procedure (which by itself is a bit questionable) on the physical layer is in fact not actually guaranteed to be transparent to upper layers; to issues with radio dsp features such as MRC in specific corner cases, …

    Besides those actual incidents; looking closely at the codebase of the wlcore driver and the TI wilink versions of wpa-supplicant, I’m afraid that even without specific issues listed our qualified engineers can observe that neither the design nor implementation seems to have happened with sufficient quality control or traceability procedures in place. As a simple example based on the above: It seems pretty obvious that development of an nl80211-compatible physical layer driver for mac80211, with patches of the mac80211 interfaces, would include unit testing of those nl80211 interfaces; but the issue above and several other corner cases we can reproduce show that that never actually happened. Similar concerns are raised over the supplicant code-base. We have no insight in the firmware solution, but the need for a recovery procedure on the driver-side raises similar questions.

    Where we managed to deal with/work around those issues so far for a solution on a relatively small scale; for business reasons I can’t discuss, we are faced now with the challenge of scaling up this solution; which will clearly be impossible without additional and more effective quality control on the wilink stack.

    Now I realize that that’s a lot of criticism; but the truth is that I’m looking for a constructive communication channel here; to provide people on our side with the necessary information and support in order to bring the quality of this solution under control.

    Where I think this forum is not the right communication platform, and I’m therefore looking for a commitment on your side to provide us with the right communication channels and transparency.

    I do can already list some input that I think we will require or want for the team on our side to be able to do the work that is needed:



    • We would need an SDIO interface specification for the wl1835mod chipset, so we can quantify and unit-test the chipset solution independent of the rest of the wilink wireless stack.


    • We would need a nl80211 interface specification for the additional interfaces added as part of the wilink stack + detailed design information on the interfaces that were changed.


    • We would like a detailed design and motivation for the TI-specific changes and workarounds put in place in your wilink version of the supplicant.


    • Can you provide us insight in the test and validation plan that was used on your side, so that we can complete it and reproduce part or all of that without having to re-validate what was already validated?


    • We would need insight in a root cause analysis of the know problems; that should at a minimum include details on the root cause of the chipset-side issue that motivates the recovery workaround in the wlcore kernel driver. We can see that the chipset can become unresponsive after certain commands related to peerlink/station-description offloading from mac80211 to chipset; but would like to understand the root cause of that: is this a firmware issue and if so why was it decided to workaround it rather than solve it? Is it a hardware issue, if so please detail so we can evaluate if it can be avoided on our side somehow.


    • Is there any chance that under certain conditions the source of your wl1835mod chipset firmware could be shared or even open-sourced; which would help us a lot in understanding the solution as a whole and might make it a lot easier to introduce or propose the necessary quality control toggles.

    regards,
    Wim Decelle