This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Tool/software:
Hello,
we are using SDK 10.00.08 kernel ti-linux-6.6.y and, with this version, during boot, we get a kernel panic if a NVME PCIe device is connected.
With the old SDK 09.02.00.010 kernel ti-linux-6.1.y it was working correctly.
The same PCIe slot works correctly if we plug a different board (a PCIe to USB 3.0 adapter).
I checked kernel configuration and device tree and they seem to be correct.
The kernel panic is triggered by nvme_pci_enable function at this instruction:
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
Here the excerpt of the kernel panic:
[ 5.998134] j721e-pcie 2900000.pcie: PCI host bridge to bus 0000:00 [ 6.004436] pci_bus 0000:00: root bus resource [bus 00-ff] [ 6.009942] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] (bus address [0x10001000-0x10010fff]) [ 6.019437] pci_bus 0000:00: root bus resource [mem 0x10011000-0x17ffffff] [ 6.026359] pci 0000:00:00.0: [104c:b012] type 01 class 0x060400 [ 6.032370] pci_bus 0000:00: 2-byte config write to 0000:00:00.0 offset 0x4 may corrupt adjacent RW1C bits [ 6.042151] pci 0000:00:00.0: supports D1 [ 6.046156] pci 0000:00:00.0: PME# supported from D0 D1 D3hot [ 6.051930] pci 0000:00:00.0: reg 0x224: [mem 0x00000000-0x003fffff 64bit] [ 6.058802] pci 0000:00:00.0: VF(n) BAR0 space: [mem 0x00000000-0x00ffffff 64bit] (contains BAR0 for 4 VFs) [ 6.070865] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring [ 6.079034] pci 0000:01:00.0: [144d:a808] type 00 class 0x010802 [ 6.085091] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit] [ 6.092385] pci 0000:01:00.0: 15.752 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x2 link at 0000:00:00.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link) [ 6.123674] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 [ 6.130306] pci 0000:00:00.0: BAR 7: assigned [mem 0x10400000-0x113fffff 64bit] [ 6.137635] pci 0000:00:00.0: BAR 14: assigned [mem 0x10100000-0x101fffff] [ 6.144514] pci 0000:01:00.0: BAR 0: assigned [mem 0x10100000-0x10103fff 64bit] [ 6.151851] pci 0000:00:00.0: PCI bridge to [bus 01] [ 6.162334] pci 0000:00:00.0: bridge window [mem 0x10100000-0x101fffff] [ 6.169411] pcieport 0000:00:00.0: of_irq_parse_pci: failed with rc=-22 [ 6.176042] pcieport 0000:00:00.0: enabling device (0000 -> 0002) [ 6.182469] pcieport 0000:00:00.0: PME: Signaling with IRQ 617 [ 6.188592] pcieport 0000:00:00.0: AER: enabled with IRQ 617 [ 6.194688] pcieport 0000:00:00.0: of_irq_parse_pci: failed with rc=-22 [ 6.201879] nvme nvme0: pci function 0000:01:00.0 [ 6.206701] nvme 0000:01:00.0: enabling device (0000 -> 0002) [ OK ] Created slice Slice /system/systemd[ 6.215812] SError Interrupt on CPU7, code 0x00000000bf000000 -- SError [ 6.215818] CPU: 7 PID: 64 Comm: kworker/u16:3 Not tainted 6.6.32-01373-gda8dd76693a4-dirty #35 [ 6.215823] Hardware name: Toradex Aquila AM69 on Aquila Development Board (DT) [ 6.215826] Workqueue: events_unbound deferred_probe_work_func [ 6.215841] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 6.215845] pc : nvme_pci_enable+0x5c/0x524 [ 6.215856] lr : nvme_pci_enable+0x50/0x524 [ 6.215859] sp : ffffffc0824737e0 [ 6.215861] x29: ffffffc0824737e0 x28: 0000000000000000 x27: ffffffc081106000 [ 6.215866] x26: ffffff8800283100 x25: ffffff8800c5b800 x24: 000000000000ffff [ 6.215870] x23: ffffff88020f71f0 x22: ffffff880102a000 x21: ffffff880102a000 [ 6.215875] x20: ffffff880102a0c0 x19: ffffff88020f7000 x18: ffffffffffffffff [ 6.215879] x17: 0000000000000000 x16: 0000000000000000 x15: 0720072007200720 [ 6.215883] x14: 0720072007200720 x13: ffffffc08111ad70 x12: 0000000000000621 [ 6.215887] x11: 000000000000020b x10: ffffffc081172d70 x9 : ffffffc08111ad70 [ 6.215891] x8 : 00000000ffffefff x7 : ffffffc081172d70 x6 : 0000000000000000 [ 6.215895] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 6.215899] x2 : 0000000000000000 x1 : ffffff8800ff8000 x0 : 0000000000000000 [ 6.215905] Kernel panic - not syncing: Asynchronous SError Interrupt [ 6.215908] CPU: 7 PID: 64 Comm: kworker/u16:3 Not tainted 6.6.32-01373-gda8dd76693a4-dirty #35 [ 6.215911] Hardware name: Toradex Aquila AM69 on Aquila Development Board (DT) [ 6.215912] Workqueue: events_unbound deferred_probe_work_func [ 6.215916] Call trace: [ 6.215919] dump_backtrace+0x94/0x114 [ 6.215928] show_stack+0x18/0x24 [ 6.215932] dump_stack_lvl+0x48/0x60 [ 6.215937] dump_stack+0x18/0x24 [ 6.215939] panic+0x314/0x364 [ 6.215945] nmi_panic+0x8c/0x90 [ 6.215949] arm64_serror_panic+0x6c/0x78 [ 6.215951] do_serror+0x3c/0x78 [ 6.215953] el1h_64_error_handler+0x30/0x48 [ 6.215957] el1h_64_error+0x64/0x68 [ 6.215960] nvme_pci_enable+0x5c/0x524 [ 6.215963] nvme_probe+0x280/0x6f8 [ 6.215966] pci_device_probe+0xa8/0x16c [ 6.215971] really_probe+0x184/0x3c8 [ 6.215975] __driver_probe_device+0x7c/0x16c [ 6.215978] driver_probe_device+0x3c/0x10c [ 6.215981] __device_attach_driver+0xbc/0x158 [ 6.215984] bus_for_each_drv+0x80/0xdc [ 6.215990] __device_attach+0xa8/0x1d4 [ 6.215993] device_attach+0x14/0x20 [ 6.215996] pci_bus_add_device+0x64/0xd4 [ 6.216002] pci_bus_add_devices+0x3c/0x88 [ 6.216006] pci_bus_add_devices+0x68/0x88 [ 6.216010] pci_host_probe+0x44/0xbc [ 6.216015] cdns_pcie_host_setup+0x10c/0x1c8 [ 6.216020] j721e_pcie_probe+0x3cc/0x444 [ 6.216024] platform_probe+0x68/0xdc [ 6.216027] really_probe+0x184/0x3c8 [ 6.216030] __driver_probe_device+0x7c/0x16c [ 6.216032] driver_probe_device+0x3c/0x10c [ 6.216035] __device_attach_driver+0xbc/0x158 [ 6.216037] bus_for_each_drv+0x80/0xdc [ 6.216042] __device_attach+0xa8/0x1d4 [ 6.216044] device_initial_probe+0x14/0x20 [ 6.216047] bus_probe_device+0xac/0xb0 [ 6.216052] deferred_probe_work_func+0x9c/0xec [ 6.216054] process_one_work+0x138/0x260 [ 6.216061] worker_thread+0x32c/0x438 [ 6.216065] kthread+0x118/0x11c [ 6.216070] ret_from_fork+0x10/0x20 [ 6.216074] SMP: stopping secondary CPUs [ 6.216084] Kernel Offset: disabled [ 6.216086] CPU features: 0x0,80000000,28020000,1000420b [ 6.216089] Memory Limit: none [ 6.539032] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
I gave a try using this patch without any change: lore.kernel.org/.../
What do you think about this? Are you experiencing the same behavior?
Emanuele
Hi Emanuele,
What is the model of NVMe PCIe card?
It should not be applicable for AM69, but a sister processor named TDA4VM has an issue with certain PCIe devices that are multi-function devices. So I would like to make sure there is nothing special with this NVMe card like SR-IOV features.
In the meantime, let me see if I can obtain an AM69 SK board from a colleague to test SDK 10.0.
Regards,
Takuma
Hi Takuma,
the model is a Samsung 970EVOPlus (MZ-V7S250).
Thank you and regards,
Emanuele
Hi Emanuele,
I tried out SDK 10.0 with SK-AM69 board just now and cannot reproduce the issue. I used the prebuilt edgeai image for 10.0 here: https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-AM69A.
As for differences, I am using a different SSD card which are "Kingston TC2200" and "Sandisk Corp WD Black SN770". I will need to search for a Samsung SSD card to replicate exact setup.
Attaching logs below for what I am seeing:
root@am69-sk:/opt/edgeai-gst-apps# uname -a Linux am69-sk 6.6.32-ti-gdb8871293143-dirty #1 SMP PREEMPT Thu Aug 1 19:10:56 UTC 2024 aarch64 GNU/Linux root@am69-sk:/opt/edgeai-gst-apps# lspci 0000:00:00.0 PCI bridge: Texas Instruments Device b012 0000:01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. NV2 NVMe SSD TC2200 (DRAM-less) 0001:00:00.0 PCI bridge: Texas Instruments Device b012 0001:01:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN770 / PC SN740 256GB / PC SN560 (DRAM-less) NVMe SSD (rev 01) 0002:00:00.0 PCI bridge: Texas Instruments Device b012 root@am69-sk:/opt/edgeai-gst-apps#
On your end, could you try out a different SSD card to see if the same issue is seen? Best would be if you could try out a SSD card from a different vendor to verify that all SSD cards are failing on your system, and if you could additionally try out a different instance of the Samsung 970 SSD card to verify if it is an issue with that single SSD that you tested out.
Regards,
Takuma
Hello Takuma,
thank you very much for the verification.
I will try to find another NVMe M.2 device and double-check the kernel configuration and the device tree.
It is important to note that the same device is working with ti-linux-6.1.y.
Can I kindly ask you to provide the full dmesg output?
Kind regards,
Emanuele
Hi Emanuele,
I am currently out of office, but I can get back to you with the dmesg output on Wednesday.
Regards,
Takuma
Hi Emanuele,
Attached below are my boot logs. The commit ID for Linux might have shifted, since I was doing a different debug for MCAN, but there should be no changes in regards to PCIe interface.
Please feel free to use it for comparison with your dmesg logs.
Regards,
Takuma
Hi Emanuele,
I was able to obtain a Samsung NVMe PCIe card. Closest I can get is a Samsung 970 PRO card, and so far I am not able to reproduce the issue:
10_0_0_8_samsung970PRO_dmesg.txt
It could be some issue with the particular instance of Samsung 970 EVO PLUS PCIe card you have. As a workaround, could you continue development on a different NVMe PCIe card?
Regards,
Takuma
Dear Takuma Fujiwara,
Thank you for the support.
Let me allow to give a background around the issue here. We have three interfaces PCIE0 (2 lanes), PCIE2(1 lane) and QSGMII_LANE2 from AM69 SERDES1. As the default SW configuration does not allow three interfaces from a single SERDES1, we have below patch in our downstream branch to make PCIE multi-link work without SSC as suggested by TI.
This patch along with our DT seems to be working fine with Samsung 970 EVO PLUS PCIe (Connected at AM69 PCIE0, 2 lanes) on ti-linux-6.1.y and causing kernel panic when we updated to ti-linux-6.6.y as reported here.
Could you please check from your end if there is need to modify the given patch and or any other configuration in-order to support PCIE multi-link on ti-linux-6.6.y ?
Appreciate your support.
Regards,
Parth P
Hi Parth,
Thank you for the context. If that is the case, I think we should review the device tree as well as review the patched up driver.
Could you share four files:
Regards,
Takuma
Hi Takuma Fujiwara,
Please find below files for your reference as per your request.
Kindly let me know if you would need more info. Appreciate your support.
Regards,
Parth P
Hi Parth,
For clarification, SERDES0 with 2x PCIe and 1x USB has no issue when plugging in the SSD card? And only SERDES1 connected to 2x PCIe and 1x SGMII is having issues, and having issues only with the Samsung NVMe SSD card and no issues with any other PCIe cards?
So far, I do not see anything out of place in the device tree. The patched torrent driver I will need some time to review.
However, I do see one thing that is different from my working setup from your boot logs. It looks like the NVMe device is probed very soon after the PCIe controller is up.
Could you try out this patch here which increases delay for deasserting PCIe reset: https://lore.kernel.org/lkml/20230707095119.447952-1-a-verma1@ti.com/
Regards,
Takuma
Dear Takuma Fujiwara,
Thank you for helping on this issue. At the moment, mentioned PCIe NVMe SSD crash issue is resolved for us with the below changes in the kernel configuration.
-CONFIG_PHY_CADENCE_TORRENT=m +CONFIG_PHY_CADENCE_TORRENT=y -CONFIG_PHY_J721E_WIZ=m +CONFIG_PHY_J721E_WIZ=y
Could you please let me know if you have any idea why setting these cadence torrent PHY and J721e WIZ driver to module would not work as expected for the given use-case ? Is this some limitation or known issue ?
Regards,
Parth P
Hi Parth,
The assigned engineer is out of office for the day. Please expect a delay in response.
Thank you,
Fabiana
Hi Parth,
Most likely the issue is timing for when the SERDES driver is initialized (Torrent is the name of the SERDES IP, while WIZ is a wrapper around this SERDES IP). By setting "m" to "y", it builds the SERDES driver into the kernel instead of a loadable kernel module, which makes these drivers get probed quickly. I would say this is a valid fix for this timing issue.
As mentioned in my previous response, I saw that the NVMe driver is probed very early in your shared boot logs compared to the boot logs from the default TI SDK on default TI EVM board. In terms of dependency, the dependency goes like NVMe depends on -> PCIe depends on -> SERDES. So most likely the default TI SDK has some extra drivers and modules which delays the actual probing of NVMe driver, which makes the issue not seen.
Regards,
Takuma
Dear Takuma Fujiwara,
Appreciate your support on this issue. I would like to report that the mentioned PCIe NVMe crash is not fully resolved for us. The solution which I mentioned in my previous response related to the the Kernel configs to have static drivers instead of module drivers helped to improve the situation a bit and that did not fully resolved the issue in a permanent way.
We still see observe the PCIe NVMe crash with latest TI 10.x SDK components on AM69 in a systematic way when we have display connected to the device. As I understand, display port has no role here for PCIe except for the fact that it may affect the probing sequence and causes some race condition which results in a kernel panic for us.
Full Logs: /cfs-file/__key/communityserver-discussions-components-files/791/dmesg_2D00_nvme_2B00_dp_2D00_panic.txt
I have also tried the below suggestion which did not help to improve the situation.
Could you try out this patch here which increases delay for deasserting PCIe reset: https://lore.kernel.org/lkml/20230707095119.447952-1-a-verma1@ti.com/
Could you please have a look into this and let me know what could be done in-order to permanently fix mentioned PCIe NVMe kernel panic ?
Thank you.
Regards,
Parth P
Hi Parth,
I have not been able to reproduce the issue on our end. So far, with TI EVM and TI default 10.0 SDK image, 100% of the time the board successfully boots and PCIe enumerates without any crashes, with or without display, and this is true for all NVMe SSD cards that I can obtain.
I can only say that issue could be hardware related, with something to do with signal integrity or power if connecting display causes issues, or there is something wrong with your particular instance of PCIe NVMe card, or as you said, something to do with timing of how the drivers are probed. However, I do see that in the latest logs the NVMe gets probed much later after the PCIe ports are initialized.
If you have a TI EVM, please try it on the TI EVM to re-confirm what I am seeing on our boards. And is my understanding correct that without the display, the kernel panic does not occur and PCIe enumerates correctly?
Regards,
Takuma
Takuma Fujiwara
To me this seems just a but triggered by some timing situation (e.g. a race condition). Just trying to look at the fact that you cannot reproduce it with the EVM it's not enough IMO. Different boards will potentially probe different drivers in a different order and have a different execution timing.
The fact that moving the driver from `m` to `y` (module to built-in) made a big difference, to the point that we thought the issue was fixed, to me is just one more hint of this being a timing issue.
I'd like also to re-iterate that with the previous SDK, based on kernel 6.1 from TI, the issue is not reproducible, and this is not consistent with your hypothesis of this being a HW issue.
If you look at the logs, the crash is happening when trying to access the PCIe configuration space, to me this could happen if this space is not yet mmaped to the CPU.
Can you please look into my considerations?
Hi fd,
If you have a TI EVM board, then please try to reproduce the issue on the board and share with us the method to reproduce the issue. As of now, we cannot reproduce this issue on our end, which I am sure you understand makes it difficult for us to debug.
If neither of us can reproduce it on the TI EVM board, then you will have to drive this debug, since the issue is specific to your system (aka, we cannot collect logs nor can we test fixes on a system that does not show the issue). The best we can do is give suggestions for experiments and speculations for how to analyze results of experiments that you run on your system.
So I ask one of two things:
Regards,
Takuma
Dear Takuma Fujiwara,
Please find the logs for the below experiment as suggested. I see the same crash upon pci rescan when trying to hot-plug pcie nvme ssd card.
If you cannot reproduce the issue on a TI EVM board, then can you do this experiment:
- Boot the system without PCIe connected to SSD. Keep the display connected, since it sounds like display is what was causing issues in the latest response.
- Once the system is booted, connect the SSD card via PCIe
- Run "echo 1 > /sys/bus/pci/rescan" to do a manual rescanning of the PCIe bus
- Share resulting logs. This should enumerate PCIe correctly if timing is the issue.
Thanks.
Regards,
Parth P
Hi Parth,
Interesting results. It seems like it is not a timing issue (at least, not a timing issue between PCIe controller and when NVMe is probed).
I assume this is on a custom board, but is this using the internal reference clock, or is an external clock generator providing the 100MHz reference clock to the SoC and device?
I ask because there is an errata that affects a very rare set of devices if using the internal clock.
Regards,
Takuma
Dear Takuma Fujiwara
Thank you for response. As I mentioned before, our custom HW is using external clock generator.
. As the default SW configuration does not allow three interfaces from a single SERDES1, we have below patch in our downstream branch to make PCIE multi-link work without SSC as suggested by TI.
One more interesting finding is that PCIe NVMe crash for the suggested experiment of hot plug and PCIe re-scan is also present on TI AM69 SK board with SDK 10.x pre-build binaries.
Logs : /cfs-file/__key/communityserver-discussions-components-files/791/AM69A_5F00_SK_5F00_SDK_5F00_10.x_5F00_PCIE_5F00_Rescan_5F00_Crash.txt
Considering the above findings and assuming this use-case suppose to be working without any issues, Do you agree that there could be something to check from TI/PCIe drivers side as well ?
Regards,
Parth P
Hi Parth,
Thank you for the suggestion. I was also able to reproduce the issue on the SK-AM69 as well. So far, it looks like this only gets triggered when hotplugging and manually rescanning the PCIe slots with a Samsung SSD (issue does not show if booted up normally, and only when manually rescanning, and issue only shows for a specific SSD). I have tried the same experiment on a Kingston SSD and could not see the issue. Therefore, it looks to be a corner case.
I am looking into the issue, but as of now, it looks to not be a timing issue. I've narrowed down issue to be with this line 2490 in the nvme driver: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/nvme/host/pci.c?h=ti-linux-6.6.y#n2490. An experiment I have tried was adding a 3 second sleep before each line within this nvme_pci_enable function, but error still persists, which implies it is not a timing issue.
Comparing the nvme driver between 6.1 kernel used in 9.2 SDK and 6.6 kernel used in 10.0 SDK, there has been a couple of major changes such as a change with max size for NVMe allocation. These changes are all from upstream Linux, so not from TI, so I would like to warn you that debug will take a considerable time due to being a niche corner case and changes in upstream Linux. However, this is something that we are checking.
Regards,
Takuma
Hi Parth,
Could you find a PCIe card capable of PME states D0 and D3cold/hot and test the boot on 6.6 kernel, as well as the rescan? Running lspci -vv should show the capabilities of the PCIe card.
So far, I am finding that cards that have power management enabled(+) do not error out, while cards that have power management disabled(-) will cause a kernel panic on both the previous 6.1 kernel and 6.6 kernel doing a rescan. Since I am seeing the kernel panic on both kernels, I am unsure whether I am chasing the issue manifesting on your board.
Regards,
Takuma
Hi Parth,
Closing and continuing on this thread, as I assume these two threads are the same: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1424539/am69-pcie-kernel-panic-cdns_ti_pcie_config_read-0x18-0x34-pci_generic_config_read-0xd4-0x124/5474573?tisearch=e2e-sitesearch&keymatch=%252520user%25253A584660#5474573
Regards,
Takuma