Other Parts Discussed in Thread: AM623
We are experiencing the following random failure during boot.
[ 10.512202] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP [ 10.522024] Modules linked in: crct10dif_ce snd_soc_simple_card snd_soc_simple_card_utils display_connector ti_k3_r5_remoteproc virtio_rpmsg_bus rpmsg_ns rtc_ti_k3 ti_k3_m4_remoteproc ti_k3_common mcrc sa2ul tidss snd_soc_davinci_mcasp snd_soc_ti_udma drm_dma_helper pruss snd_soc_ti_edma snd_soc_ti_sdma m_can_platform m_can can_dev snd_soc_nau8822 pwm_tiehrpwm spi_omap2_mcspi ina2xx lontium_lt8912b tc358768 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops lm75 optee_rng rng_core libcomposite fuse drm drm_panel_orientation_quirks ipv6 [ 10.573193] CPU: 1 PID: 180 Comm: systemd-udevd Not tainted 6.1.46-6.5.0-devel+git.3e7fd3d544db #1 [ 10.582156] Hardware name: Toradex Verdin AM62 on Verdin Development Board (DT) [ 10.589458] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 10.596413] pc : iomap_read_reg+0xc/0x30 [m_can_platform] [ 10.601822] lr : m_can_get_berr_counter+0x3c/0x10c [m_can] [ 10.607313] sp : ffff8000098d35f0 [ 10.610621] x29: ffff8000098d35f0 x28: 0000000000000240 x27: ffff800000cc21b8 [ 10.617754] x26: ffff0000024b3000 x25: ffff0000024b3240 x24: 0000000000000000 [ 10.624885] x23: 0000000000000000 x22: ffff000000c06010 x21: ffff000002430000 [ 10.632016] x20: ffff000002430980 x19: ffff8000098d362c x18: 0000000000000000 [ 10.639146] x17: ffff800036e64000 x16: ffff800008008000 x15: 0000ccbd4c56cc8a [ 10.646277] x14: 0000000000000037 x13: 0000000000000037 x12: 0000000000000000 [ 10.653408] x11: 0000000000000001 x10: 00000000000009b0 x9 : ffff8000098d31e0 [ 10.660538] x8 : ffff00003fda2180 x7 : 0000000100000300 x6 : ffff000000c06190 [ 10.667668] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 10.674799] x2 : ffff800000ce7000 x1 : 0000000000000040 x0 : ffff8000097e3040 [ 10.681931] Call trace: [ 10.684372] iomap_read_reg+0xc/0x30 [m_can_platform] [ 10.689423] can_fill_info+0x108/0x524 [can_dev] [ 10.694058] rtnl_fill_ifinfo+0x844/0x11b0 [ 10.698161] rtnl_getlink+0x23c/0x424 [ 10.701821] rtnetlink_rcv_msg+0x130/0x3a0 [ 10.705914] netlink_rcv_skb+0x60/0x130 [ 10.709747] rtnetlink_rcv+0x18/0x2c [ 10.713318] netlink_unicast+0x2e4/0x340 [ 10.717235] netlink_sendmsg+0x1b0/0x420 [ 10.721153] __sys_sendto+0x134/0x170 [ 10.724812] __arm64_sys_sendto+0x28/0x40 [ 10.728815] invoke_syscall+0x48/0x114 [ 10.732566] el0_svc_common.constprop.0+0xd4/0xfc [ 10.737264] do_el0_svc+0x20/0x30 [ 10.740575] el0_svc+0x28/0xa0 [ 10.743630] el0t_64_sync_handler+0xbc/0x140 [ 10.747894] el0t_64_sync+0x18c/0x190 [ 10.751564] Code: bad PC value
HW: custom board, happening with AM623 and AM625 SKU, both GP and HSFS.
SW: custom BSP, based on TI 09.01.00.008. No changes related to MCAN, clocks, or anything that seems related to this issue. Linux kernel GIT: https://git.toradex.com/cgit/linux-toradex.git/log/?h=toradex_ti-linux-6.1.y
Such issue would happen in `m_can_get_berr_counter()` when calling `__m_can_get_berr_counter()` without calling `m_can_clk_start()`, IOW without enabling the clocks. I did look at the code and I was not able to spot any bug in the m_can driver that would justify such behavior.
It's important to note that so far we have not been able to reproduce the issue at temperatures above -20 degrees Celsius.
We were not able to reproduce the issue running continuosly `ip -det link show can0` after the system was properly booted at room temperature.
The only thing I was able to think is that the issue could be because `m_can_runtime_resume()` returns before the clocks are enabled, and maybe this could be related to some kind of race condition with the DM firmware that is running on the cortex R5?