This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM62L: CAN init kernel panics

Part Number: AM62L

Hello,

we're seeing kernel panics on our custom SoM/Devboard with AM62L32 when CAN gets initialized at boot time.


Kernel:

https://github.com/phytec/linux-phytec-ti/blob/v6.12.49-11.02.02-phy/arch/arm64/boot/dts/ti/k3-am62l3-phyflex-libra-rdk.dts

U-Boot:

https://github.com/phytec/u-boot-phytec-ti/tree/v2025.01-11.02.02-phy/board/phytec/am62lx-phyflex-fpsc-g

TF-A is just the TI fork with our DDR4 configs added.

(MT40A1G16TB-062E 2GB DDR4 orAS4C512M16D4A-62BIN 1GB DDR4)

We're using MAIN_MCAN0 and MAIN_MCAN1 with TCAN1042 transceivers.

Since the following TI TF-A commit we saw  occasional kernel panics (~1 in 100 to 500 boots) while booting when systemd is initializing CAN. Then the system just got stuck and had to be power cycled.

https://github.com/TexasInstruments/arm-trusted-firmware/pull/34/commits/e3500f2bb713ea8044c5943f9bbf1486ec7a16e8

[    9.711664] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP
[    9.719870] Modules linked in: ti_am335x_adc(+) kfifo_buf crct10dif_ce phy_can_transceiver tps65219_pwrbutton rtc_rv3028 tmp102 rtc_ti_k3(+) k3_j72xx_bandgap leds_pca9532 dthev2 md5 crypto_engine m_can_platform m_can ti_am335x_tscadc can_dev lm75 at24 cfg80211 rfkill cryptodev(O) fuse ipv6
[    9.745576] CPU: 1 UID: 0 PID: 173 Comm: (udev-worker) Tainted: G   M       O       6.12.49-g5c3b6790-g5c3b67907416 #1
[    9.756275] Tainted: [M]=MACHINE_CHECK, [O]=OOT_MODULE
[    9.761401] Hardware name: phyFLEX-AM62L Libra Rapid Development Kit (DT)
[    9.768172] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.775119] pc : iomap_read_reg+0x8/0x20 [m_can_platform]
[    9.780526] lr : m_can_get_berr_counter+0x3c/0xe4 [m_can]
[    9.785927] sp : ffff80008272b2b0
[    9.789229] x29: ffff80008272b2b0 x28: 0000000000000224 x27: ffff8000797f3058
[    9.796360] x26: 0000000000000000 x25: ffff00000549f800 x24: 0000000000000000
[    9.803487] x23: ffff00000549fa24 x22: ffff00000386dc10 x21: ffff0000089c0000
[    9.810614] x20: ffff0000089c0980 x19: ffff80008272b2ec x18: ffffffffffffffff
[    9.817741] x17: 0000000000000000 x16: 0000000000000000 x15: 2f74656e2f6e6163
[    9.824868] x14: 0000000000000000 x13: 0000000100000200 x12: 0000000100000080
[    9.831994] x11: 0000000000000040 x10: ffff000003b32138 x9 : ffff000003b32130
[    9.839121] x8 : ffff00000404bdd0 x7 : 0000000000000000 x6 : 0000000000000080
[    9.846247] x5 : ffff000003b4aef8 x4 : 0000000000000000 x3 : 0000000000000001
[    9.853374] x2 : ffff80007981210c x1 : 0000000000000040 x0 : ffff80008211d040
[    9.860501] Call trace:
[    9.862939]  iomap_read_reg+0x8/0x20 [m_can_platform]
[    9.867989]  can_fill_info+0x108/0x52c [can_dev]
[    9.872620]  rtnl_fill_ifinfo.isra.0+0xac8/0x121c
[    9.877330]  rtmsg_ifinfo_build_skb+0xc4/0x140
[    9.881767]  rtnetlink_event+0xb0/0xd8
[    9.885509]  raw_notifier_call_chain+0x54/0x74
[    9.889946]  call_netdevice_notifiers_info+0x58/0xa4
[    9.894907]  dev_change_name+0x17c/0x348
[    9.898824]  do_setlink+0xc18/0xec8
[    9.902307]  rtnl_setlink+0x120/0x1d8
[    9.905962]  rtnetlink_rcv_msg+0x128/0x390
[    9.910051]  netlink_rcv_skb+0x60/0x130
[    9.913882]  rtnetlink_rcv+0x18/0x24
[    9.917453]  netlink_unicast+0x324/0x3a8
[    9.921367]  netlink_sendmsg+0x17c/0x3cc
[    9.925282]  __sys_sendto+0x110/0x178
[    9.928940]  __arm64_sys_sendto+0x28/0x38
[    9.932946]  invoke_syscall+0x48/0x10c
[    9.936690]  el0_svc_common.constprop.0+0xc0/0xe0
[    9.941385]  do_el0_svc+0x1c/0x28
[    9.944693]  el0_svc+0x28/0x98
[    9.947746]  el0t_64_sync_handler+0x120/0x12c
[    9.952093]  el0t_64_sync+0x190/0x194
[    9.955754] Code: 52800000 d65f03c0 f942fc00 8b21c000 (b9400000)
[    9.961835] ---[ end trace 0000000000000000 ]---

Another version looked like this:

[    9.948162] SError Interrupt on CPU1, code 0x00000000bf000000 -- SError
[    9.948193] CPU: 1 UID: 996 PID: 207 Comm: systemd-network Tainted: G           O       6.12.35-gb9b94b26-01037-gb9b94b267b88 #1
[    9.948204] Tainted: [O]=OOT_MODULE
[    9.948207] Hardware name: PHYTEC Libra AM62L RDK FPSC (DT)
[    9.948212] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.948219] pc : iomap_write_fifo+0x0/0x38 [m_can_platform]
[    9.948243] lr : m_can_init_ram+0x78/0xb8 [m_can]
[    9.948261] sp : ffff8000827bb480
[    9.948264] x29: ffff8000827bb490 x28: ffff0000049f0000 x27: 0000000000000003
[    9.948281] x26: ffff8000797e6ec8 x25: ffff0000056402d0 x24: 0000000000000001
[    9.948291] x23: ffff0000081f0344 x22: 0000000000040080 x21: 0000000000004400
[    9.948301] x20: ffff0000081f0980 x19: 0000000000000008 x18: ffffffffffffffff
[    9.948311] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000827bb270
[    9.948320] x14: ffff8001027bb3fd x13: 007473696c5f7974 x12: 696e696666615f65
[    9.948333] x11: 0000000000000040 x10: ffff800080ed6210 x9 : 0000000000000000
[    9.948342] x8 : ffff000003c88d80 x7 : 0000000000000000 x6 : ffff800081566290
[    9.948352] x5 : ffff000003c88c88 x4 : ffff80007981a0bc x3 : 0000000000000001
[    9.948361] x2 : ffff8000827bb484 x1 : 0000000000000008 x0 : ffff0000081f0980
[    9.948373] Kernel panic - not syncing: Asynchronous SError Interrupt
[    9.948378] CPU: 1 UID: 996 PID: 207 Comm: systemd-network Tainted: G           O       6.12.35-gb9b94b26-01037-gb9b94b267b88 #1
[    9.948387] Tainted: [O]=OOT_MODULE
[    9.948390] Hardware name: PHYTEC Libra AM62L RDK FPSC (DT)
[    9.948394] Call trace:
[    9.948398]  dump_backtrace+0x90/0xe8
[    9.948417]  show_stack+0x18/0x24
[    9.948426]  dump_stack_lvl+0x34/0x8c
[    9.948438]  dump_stack+0x18/0x24
[    9.948446]  panic+0x390/0x3a4
[    9.948457]  nmi_panic+0x40/0x8c
[    9.948465]  arm64_serror_panic+0x64/0x70
[    9.948475]  do_serror+0x3c/0x70
[    9.948484]  el1h_64_error_handler+0x30/0x48
[    9.948495]  el1h_64_error+0x64/0x68
[    9.948502]  iomap_write_fifo+0x0/0x38 [m_can_platform]
[    9.948512]  m_can_start+0x24/0x580 [m_can]
[    9.948521]  m_can_open+0x6c/0x264 [m_can]
[    9.948531]  __dev_open+0x120/0x1dc
[    9.948542]  __dev_change_flags+0x194/0x20c
[    9.948551]  dev_change_flags+0x24/0x6c
[    9.948559]  do_setlink+0x27c/0xec8
[    9.948569]  rtnl_setlink+0x120/0x1d8
[    9.948577]  rtnetlink_rcv_msg+0x128/0x390
[    9.948585]  netlink_rcv_skb+0x60/0x130
[    9.948597]  rtnetlink_rcv+0x18/0x24
[    9.948605]  netlink_unicast+0x318/0x380
[    9.948612]  netlink_sendmsg+0x17c/0x3c8
[    9.948620]  __sys_sendto+0x110/0x178
[    9.948630]  __arm64_sys_sendto+0x28/0x38
[    9.948642]  invoke_syscall+0x48/0x10c
[    9.948652]  el0_svc_common.constprop.0+0xc0/0xe0
[    9.948660]  do_el0_svc+0x1c/0x28
[    9.948668]  el0_svc+0x28/0x98
[    9.948676]  el0t_64_sync_handler+0x120/0x12c
[    9.948687]  el0t_64_sync+0x190/0x194
[    9.948695] SMP: stopping secondary CPUs
[    9.948711] Kernel Offset: disabled
[    9.948714] CPU features: 0x00,00000080,00200000,4200420b
[    9.948720] Memory Limit: none
[   10.229502] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

Other than the occasional error at boot we didn't have any issues and we don't have any functional problems with CAN(fd) either. Also the system passed all our tests from -40°C to +85°C ambient.


However with the latest commit in TI-TFA we see those panics on every boot:

https://github.com/TexasInstruments/arm-trusted-firmware/commit/58bfb476c908d4c220d8f0bae88536f457452f06

Now the system still boots to the Linux promt but everything related to network is broken.


The problem in those two e2e-threads looks similar to our issue although the affected SoC is AM62x which has DM on a separate core.

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1309832/am625-crash-internal-error-synchronous-external-abort-in-mcan-driver-at-low-temperatures-20-degrees-celsius

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1354528/am625-boot-stall-and-crashes?pifragment-323307=2#pifragment-323307=1

We don't see a correlation with the temperature however. I'll check if adding delays to the clock handling also affects our issue.


Best regards

Dominik

  • Hi Dominik,

    The TF-A patch you linked are for low power mode support, so I wouldn't expect it to impact MCAN.

    Is the MCAN driver built-in to the kernel or is it a module?

    If you remove MCAN from the device tree, do you still see the errors? If so, it might be related to interrupt routing, maybe not MCAN specific.

    Thanks,

    Anshu

  • Hi Anshu,

    I don't think so too.

    I think it's some odd race condition with power or clock management affecting CAN.

    When building BL31 out of Yocto with a different compiler strangely the bug doesn't happen.

    MCAN is built as loadable module. However configuring everything involved as built-in only moves the kernel panic a few seconds earlier when booting.

    With CAN disabled in dts the error is gone.

    When adding a delay to m_can runtime pm exactly as they did in the other thread [1] where AM62x was involved the error is gone on our AM62L setup too.

    +++ b/drivers/net/can/m_can/m_can_platform.c
    @@ -8,6 +8,7 @@
     #include <linux/hrtimer.h>
     #include <linux/phy/phy.h>
     #include <linux/platform_device.h>
    +#include <linux/delay.h>
     
     #include "m_can.h"
     
    @@ -209,6 +210,9 @@ static int __maybe_unused m_can_runtime_resume(struct device *dev)
            if (err)
                    clk_disable_unprepare(mcan_class->hclk);
     
    +       printk("delay 50ms....");
    +       mdelay(50);
    +       printk("end delay....");
            return err;
     }

    [   10.863746] delay 50ms....
    [   10.969583] end delay....
    [   10.977588] m_can_platform 20701000.can: m_can device registered (irq=223, version=32)
    [  OK  ] Finished OpenSSH Key Generation.
    [   11.003345] delay 50ms....
    [   11.096743] end delay....
    [   11.105388] m_can_platform 20711000.can: m_can device registered (irq=224, version=32)
    [   11.412264] am65-cpsw-nuss 8000000.ethernet end0: PHY [8000f00.mdio:00] driver [TI DP83867] (irq=41)
    [   11.430688] am65-cpsw-nuss 8000000.ethernet end0: configuring for phy/rgmii-rxid link mode
    [  OK  ] Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
    [   11.527976] am65-cpsw-nuss 8000000.ethernet end1: PHY [8000f00.mdio:01] driver [TI DP83867] (irq=POLL)
    [   11.545146] am65-cpsw-nuss 8000000.ethernet end1: configuring for phy/rgmii-rxid link mode
    [  OK  ] Started User Manager for UID 1000.
    [   11.635960] delay 50ms....
    [   11.737645] end delay....
    [  OK  ] Started Session c1 of User weston.
    [   11.742295] delay 50ms....
    [   11.802674] end delay....
    [   11.805920] delay 50ms....
    [   11.863017] end delay....
    [   11.867096] delay 50ms....
    [   11.920428] end delay....
    [   11.920777] delay 50ms....
    [   11.997853] end delay....
    [   12.001194] m_can_platform 20701000.can main_mcan0: renamed from can0
    [   12.015206] delay 50ms....
    [   12.065265] end delay....
    [   12.071850] delay 50ms....
    [   12.161617] end delay....
    [   12.164722] delay 50ms....
    [   12.222836] end delay....
    [   12.226648] m_can_platform 20711000.can main_mcan1: renamed from can1
    [   12.237783] delay 50ms....
    [   12.296198] end delay....
    [  OK  ] Started Weston, a Wayland compositor, as a system service.
    [   12.303554] delay 50ms....
    [   12.402396] rtc-ti-k3 2b1f0000.rtc: registered as rtc1
    [  OK  ] Started PHYTEC's Qt6 reference demo implementation.
    [  OK  ] Reached target Graphical Interface.
    [   12.444527] end delay....
    [   12.498888] delay 50ms....
    [   12.683504] end delay....
    [   12.692096] delay 50ms....
    [   12.761608] end delay....
    [   12.765175] delay 50ms....
    [   12.823269] end delay....
    [   12.828588] delay 50ms....
    [   12.925726] end delay....
    [   12.931201] delay 50ms....
    [   13.055909] end delay....
    [   13.074119] delay 50ms....
    [   13.172501] end delay....
    [   13.197690] delay 50ms....
    [   13.259366] end delay....
    [   13.270486] delay 50ms....
    

    Best regards

    Dominik

    [1] e2e.ti.com/.../am625-boot-stall-and-crashes

  • Hi Dominik,

    Glad that you found a work around.

    When building BL31 out of Yocto with a different compiler strangely the bug doesn't happen.

    Can you explain this further?

    Thanks,

    Anshu

  • Hi Anshu,

    when trying to find out which commit brought in the kernel panics I built tispl.bin and all the components including bl31 manually out of Yocto.

    Eventually the bl31 binary built with the toolchain from my local Ubuntu 24.04 does not trigger the kernel panics.

    Using the bl31 binary built with the toolchain used in our Yocto scarthgap setup does trigger the kernel panics.

    So except a few compiler flags they're the same.

    best regards

    Dominik

  • Thanks for the update Dominik.