This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM5728: PCIe USB hub failures

Part Number: AM5728
Other Parts Discussed in Thread: TUSB7340, , TUSB8041

Tool/software: Linux

We are developing a product based on the TI AM5728 EVM.  The product utilizes a TUSB7340 PCIe USB for additional USB ports.  The TUSB7340 is detected and setup properly and works OK with low data rate devices.  However, hot plugging a Realtek USB network adapter and doing Ethernet transfer bandwidth testing using iperf3 causes the host to be  locked out.  The TUSB7340 host appears to no longer communicate and the logging indicates xhci_hcd 0000:01:00.0: HC died; cleaning up.

We tried to setup a test on the TI AM5728 EVM using the TI TUSB7340 EVM and a PCIe adapter cable, but were not able to get a stable PCIe bus even with GEN1 speeds and were unable to replicate the test.

So we looked at using another host and found a mini PCIe card that utilizes the µPD720201 and can be directly installed to the LCD on the EVM.  The card is detected properly and we reran the transfer test.  The uPD720201 gets locks out with the same problem.

The AM5728 testing was performed using the stock am57xx-evm-linux-04.00.00.04.img on the SD card, kernel am57xx-evm 4.9.28-geed43d1050, and it reports that it is using the TI AM572x EVM Rev A3 Device tree.

It shows the following logging when it fails.

[  630.400899] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.

[  630.408769] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.

[  630.420849] r8152 2-4:1.0 enp1s0u4: Tx status -108

[  630.425667] r8152 2-4:1.0 enp1s0u4: Tx status -108

[  630.430483] r8152 2-4:1.0 enp1s0u4: Tx status -108

[  630.435297] r8152 2-4:1.0 enp1s0u4: Tx status -108

[  630.440122] xhci_hcd 0000:01:00.0: HC died; cleaning up

[  630.453961] usb 2-4: USB disconnect, device number 2

The problem appears to be a general driver issue given we get the same problem with both the TUSB7340 and the µPD720201.

Any suggestions on how we can address this problem?

Thanks!

  • Chris,

    The overnight run failed too, but it provided more info. It seems to me that the issue is on r8152 which causes watchdog to step up. I did a search on internet and it seems to be a known issue of Realtek, not specific to AM57x PCIe.

    https://bbs.archlinux.org/viewtopic.php?id=213517
    ubuntuforums.org/showthread.php
    ubuntuforums.org/showthread.php


    [ 4] 2055.00-2056.00 sec 27.4 MBytes 230 Mbits/sec 1 1.41 KBytes
    [ 4] 2056.00-2057.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2057.00-2058.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2058.00-2059.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2059.00-2060.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2060.00-2061.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2061.00-2062.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2062.00-2063.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 2211.039936] ------------[ cut here ]------------
    [ 2211.044608] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x258/0x25c
    [ 2211.052924] NETDEV WATCHDOG: enp1s0u2 (r8152): transmit queue 0 timed out
    [ 2211.059745] Modules linked in: cdc_ether usbnet r8152 sha512_generic sha512_arm sha256_generic sha1_generic sha1_arm_neon sha1_arm md5 cbc bc_example(O) xfrm_user xfrm4_tunnel ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo pru_rproc pruss_intc xhci_plat_hcd rpmsg_proto pruss rpmsg_rpc dwc3 udc_core bluetooth snd_soc_simple_card snd_soc_simple_card_utils ahci_platform libahci_platform libahci ti_vip pvrsrvkm(O) snd_soc_omap_hdmi_audio libata omap_sham pruss_soc_bus omap_aes_driver omap_wdt scsi_mod xhci_pci xhci_hcd ti_vpe snd_soc_tlv320aic3x ti_sc ti_csc ti_vpdma usbcore omap_des usb_common dwc3_omap pixcir_i2c_ts rtc_omap extcon_palmas mt9t11x extcon_core des_generic crypto_engine rtc_palmas rtc_ds1307 omap_remoteproc virtio_rpmsg_bus rpmsg_core remoteproc sch_fq_codel uio_module_drv(O) uio gdbserverproxy(O) cryptodev(O) cmemk(O)
    [ 2211.134077] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.28-geed43d1050 #2
    [ 2211.142290] Hardware name: Generic DRA74X (Flattened Device Tree)
    [ 2211.148406] Backtrace:
    [ 2211.150883] [<c020b29c>] (dump_backtrace) from [<c020b558>] (show_stack+0x18/0x1c)
    [ 2211.158487] r7:00000009 r6:60000113 r5:00000000 r4:c1022410
    [ 2211.164174] [<c020b540>] (show_stack) from [<c04c9f40>] (dump_stack+0x8c/0xa0)
    [

  • That makes no sense. Several things:

    1) The entire USB host is going down. How does a peripheral manage to bring down the entire host?

    2) It works fine on the native non PCIe USB host. If the Realtek is a problem then all USB hosts would show the same issue.

    3) I've been able to make it fail with everything that I've tried that works as a network peripheral. I've made it fail with our USB products which are CDC Ether based, an Asus adapter as well as the Realtek adapter.

    The watchdog is timing out because the USB host has failed preventing the watchdog from communicating with the Realtek adapter and so it fires.
  • I'm going to be away and will be getting back to this on Oct 10th. Thanks for all your help so far.
  • I looked at the articles you posted. Neither of them reflect this problem.

    1) There is no xhci host failure in either case.
    2) In both cases, the adapter fails and the port is reset. I've never seen this happen. It is always a host failure, not a peripheral failure.

    Have a look through your own logs. I'm positive your not going to fine a port or peripheral failure, it is always going to be a host failure.

    As I mentioned, the watchdog fires because once the host fails it can no longer communicate with the peripheral so you are going to get:

    [ 2211.052924] NETDEV WATCHDOG: enp1s0u2 (r8152): transmit queue 0 timed out

    that is completely expected as a result of the USB host failure.

    These articles have nothing to do with the problem we are seeing.
  • My advice to you if you still think it is the Realtek is get another adapter. I've had it fail with three different types so far.

    Choose what ever you want it doesn't matter. I have yet to find one that doesn't manifest the problem.
  • Chris,

    here is the full logs. xhci host not responding message shows up at the end of the logs. Wouldn't the xhci be affected by watchdog killing the system? The links are the failures on some PC's which had the ethernet going through PCIe bus.


    [ 4] 2061.00-2062.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 4] 2062.00-2063.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
    [ 2211.039936] ------------[ cut here ]------------
    [ 2211.044608] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x258/0x25c
    [ 2211.052924] NETDEV WATCHDOG: enp1s0u2 (r8152): transmit queue 0 timed out
    [ 2211.059745] Modules linked in: cdc_ether usbnet r8152 sha512_generic sha512_arm sha256_generic sha1_generic sha1_arm_neon sha1_arm md5 cbc bc_example(O) xfrm_user xfrm4_tunnel ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo pru_rproc pruss_intc xhci_plat_hcd rpmsg_proto pruss rpmsg_rpc dwc3 udc_core bluetooth snd_soc_simple_card snd_soc_simple_card_utils ahci_platform libahci_platform libahci ti_vip pvrsrvkm(O) snd_soc_omap_hdmi_audio libata omap_sham pruss_soc_bus omap_aes_driver omap_wdt scsi_mod xhci_pci xhci_hcd ti_vpe snd_soc_tlv320aic3x ti_sc ti_csc ti_vpdma usbcore omap_des usb_common dwc3_omap pixcir_i2c_ts rtc_omap extcon_palmas mt9t11x extcon_core des_generic crypto_engine rtc_palmas rtc_ds1307 omap_remoteproc virtio_rpmsg_bus rpmsg_core remoteproc sch_fq_codel uio_module_drv(O) uio gdbserverproxy(O) cryptodev(O) cmemk(O)
    [ 2211.134077] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.28-geed43d1050 #2
    [ 2211.142290] Hardware name: Generic DRA74X (Flattened Device Tree)
    [ 2211.148406] Backtrace:
    [ 2211.150883] [<c020b29c>] (dump_backtrace) from [<c020b558>] (show_stack+0x18/0x1c)
    [ 2211.158487] r7:00000009 r6:60000113 r5:00000000 r4:c1022410
    [ 2211.164174] [<c020b540>] (show_stack) from [<c04c9f40>] (dump_stack+0x8c/0xa0)
    [ 2211.171430] [<c04c9eb4>] (dump_stack) from [<c022dcf0>] (__warn+0xec/0x104)
    [ 2211.178421] r7:00000009 r6:c0c0b2a8 r5:00000000 r4:c1001d40
    [ 2211.184106] [<c022dc04>] (__warn) from [<c022dd48>] (warn_slowpath_fmt+0x40/0x48)
    [ 2211.191622] r9:ffffffff r8:c1002d00 r7:ed97c294 r6:ed349a00 r5:ed97c000 r4:c0c0b26c
    [ 2211.199400] [<c022dd0c>] (warn_slowpath_fmt) from [<c07ce83c>] (dev_watchdog+0x258/0x25c)
    [ 2211.207610] r3:ed97c000 r2:c0c0b26c
    [ 2211.211198] r4:00000000
    [ 2211.213749] [<c07ce5e4>] (dev_watchdog) from [<c0290e24>] (call_timer_fn.constprop.3+0x30/0xa0)
    [ 2211.222486] r10:40000001 r9:ed97c000 r8:c07ce5e4 r7:00000000 r6:c07ce5e4 r5:00000101
    [ 2211.230347] r4:ffffe000
    [ 2211.232895] [<c0290df4>] (call_timer_fn.constprop.3) from [<c0290f34>] (expire_timers+0xa0/0xac)
    [ 2211.241718] r6:00000200 r5:c1001df0 r4:eed36440
    [ 2211.246359] [<c0290e94>] (expire_timers) from [<c0290fd8>] (run_timer_softirq+0x98/0x184)
    [ 2211.254573] r9:00000001 r8:c1002080 r7:eed36440 r6:c1002d00 r5:c1001dec r4:00000001
    [ 2211.262354] [<c0290f40>] (run_timer_softirq) from [<c023284c>] (__do_softirq+0xf8/0x234)
    [ 2211.270479] r7:00000101 r6:c1000000 r5:c1002084 r4:00000020
    [ 2211.276166] [<c0232754>] (__do_softirq) from [<c0232cc8>] (irq_exit+0xe0/0x148)
    [ 2211.283508] r10:c10030ac r9:c1000000 r8:ee808000 r7:00000000 r6:00000000 r5:00000013
    [ 2211.291368] r4:c0e5bd88
    [ 2211.293914] [<c0232be8>] (irq_exit) from [<c027ea24>] (__handle_domain_irq+0x68/0xbc)
    [ 2211.301779] [<c027e9bc>] (__handle_domain_irq) from [<c02014a0>] (gic_handle_irq+0x40/0x7c)
    [ 2211.310167] r9:c1000000 r8:fa213000 r7:fa212000 r6:c1001ef0 r5:fa21200c r4:c1003424
    [ 2211.317945] [<c0201460>] (gic_handle_irq) from [<c020c078>] (__irq_svc+0x58/0x8c)
    [ 2211.325458] Exception stack(0xc1001ef0 to 0xc1001f38)
    [ 2211.330531] 1ee0: 00000001 00000000 fe600000 00000000
    [ 2211.338745] 1f00: c1000000 c100303c 00000001 c10030a4 00000000 00000000 c10030ac c1001f4c
    [ 2211.346957] 1f20: c1001f2c c1001f40 c0220814 c02086f4 60000013 ffffffff
    [ 2211.353601] r9:c1000000 r8:00000000 r7:c1001f24 r6:ffffffff r5:60000013 r4:c02086f4
    [ 2211.361385] [<c02086cc>] (arch_cpu_idle) from [<c08c625c>] (default_idle_call+0x28/0x34)
    [ 2211.369514] [<c08c6234>] (default_idle_call) from [<c026e170>] (cpu_startup_entry+0x1b4/0x230)
    [ 2211.378165] [<c026dfbc>] (cpu_startup_entry) from [<c08c1584>] (rest_init+0x8c/0x90)
    [ 2211.385938] r7:ffffffff
    [ 2211.388487] [<c08c14f8>] (rest_init) from [<c0e00d80>] (start_kernel+0x3e0/0x3ec)
    [ 2211.396001] r5:00000000 r4:c105004c
    [ 2211.399593] [<c0e009a0>] (start_kernel) from [<80008090>] (0x80008090)
    [ 2211.406185] ---[ end trace 7b35e15081296556 ]---
    [ 2211.410838] r8152 2-2:1.0 enp1s0u2: Tx timeout
    [ 4] 2063.00-2064.06 sec 0.00 Bytes 0.00 bits/sec 0 1.41[ 2211.415342] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
    [ 2211.428748] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
  • No it is the other way around. The wathdog path is this:

    watchdog
    ^
    |
    v
    PCIe
    ^
    |
    V
    USB Host
    ^
    |
    v
    Network adapter

    The watch dog will fire if any of the PCIe, USB host or network adapter fail. All three are required for the watchdog to successfully monitor the network adapter.

    Our logs show the USB host is failing. The USB host fails, the watch dog attempts to communicate with the network adapter but can't due to the the USB host failure so the watch dog fires.

    The xhci_hcd subsystem then times out on the same failure unable to communicate with the USB host.

    Root cause here is the USB host failure. There is absolutely nothing indicating a failure elsewhere.
  • I've seen this watchdog failure as well on tests which are able run for a while and it is clear it is due to the USB host failure.

    I put the question to you if you really think it is the watchdog how do you explain the same failure when the watchdog doesn't fire?

    Try a bunch of other adapters if you don't want to believe this, they will all fail.
  • Another thought, turn off the watchdog. You'll get the same failure.
  • Chris,

    We are investigating the issue and will update if we root cause it.

    Rex
  • Ok thanks, I'm back from holiday so let me know if there is any additional information or testing you need.

  • Chris,

    In the past week, I did some more runs with different setups, and had internal discussions on this issue. We'll debug further on the PCIe. We have an internal Jira record to track this issue. I'll post back to update you the status or any findings. I think we have enough info so far. In case I need something from you. I'll let you know.

    Rex
  • Chris,

    You mentioned that it also crashed in your USB product without the ethernet adapter. I am interested in the setup and the test you run on this USB setup. Does the issue happen as quick as using the ethernet adapter? I am trying to collect more data for this issue.

    Rex
  • Our products are "cassettes" that plug into a back plane on the main product chassis.

    The cassettes are USB v2.0 cdc_ether type devices connecting to TUSB7340 PCIe USB host ports.

    The data transfer rates are much slower than the USB Ethernet adapters, I think the test I was running was seeing < 500 kbs transfers.   Failure typically occurs within 20-30 minutes.

    Data transfers for the testing was data blocks based of roughly 100K bytes in size.

    However, failures also occur during hot plug operation during the pull.  A cassette pull quite frequently caused the host lock up and in fact that was how we first encountered the problem that led to this investigation.

  • Hi, Chris,

    Got you. We'll stick with the Realtek adapter.

    You mentioned hot plug. These cassettes are plugged to the back plane on the PCIe interface, aren't they? We don't support PCIe hot plug if that is the application.

    Rex
  • No, the backplane connections are TUSB7340 USB ports, not PCIe.  Cassettes plug into TUSB7340 ports just like the testing you are doing with the Realtek.  This was how we first encountered the failure during hot plug testing of the cassettes to the TUSB7340 ports on the backplane.

    The TUSB7340 host gets the same lock out failure typically on cassette removal.

    PCIe

       |

       V

    TUSB7340

    |        |      |      |

    V      V    V     V

    Backplane connections using the TUSB7340 ports

  • Thanks, Chris, for the info. I just want to be sure I understand and be able to cover any possible questions from development team. I'll update you once I have info on our progress.

    Rex
  • Any update on this Rex?

  • Hi, Chris,

    We found missing IRQ by looking into XHCI IRQ registers mapped into PCIe memory space. Currently, we are trying to clarify how hardware should work with hardware designer.

    Rex
  • Great thanks for the update
  • Chris,

    We are running week long test for the fix and so far no issue for 2 days. As soon as the patches are available, I may need you to test on your platform.

    Rex
  • Yes I received an update from Vignesh with info on code/patch changes.  I can't try it right away, but should be able to look at it in the next few days.

    Glad to hear it is working well so far.

  • The patch has worked very well, we haven't had a failure in a couple of days.

    The following additional changes should also be incorporated as I they resolved problems required to get our product to work.

    The first change fixes a miss by one error with the interrupt lines.

    The second change extends a patch developed for errata i870 but we found is applicable to RC operation as well as EPs. Thanks very much for your help!

    diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c old mode 100644 new mode 100755 index defa272..6245d89
    --- a/drivers/pci/dwc/pci-dra7xx.c
    +++ b/drivers/pci/dwc/pci-dra7xx.c
    @@ -238,8 +238,8 @@ static int dra7xx_pcie_init_irq_domain(struct pcie_port *pp)
    dev_err(dev, "No PCIe Intc node found\n");
    return -ENODEV;
    }
    -
    - dra7xx->irq_domain = irq_domain_add_linear(pcie_intc_node, 4,
    + // PCI interrupt lines start at 1 not zero so need to add 1
    + dra7xx->irq_domain = irq_domain_add_linear(pcie_intc_node, 4 +
    + 1,
    &intx_domain_ops, pp);
    if (!dra7xx->irq_domain) {
    dev_err(dev, "Failed to get a INTx IRQ domain\n"); @@ -706,10 +706,16 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev)
    dra7xx_pcie_writel(dra7xx, PCIECTRL_TI_CONF_DEVICE_TYPE,
    DEVICE_TYPE_RC);

    + // Errata i870 applies to RC as well as EP
    + ret = dra7xx_pcie_ep_legacy_mode(dev);
    + if (ret)
    + goto err_gpio;
    +
    ret = dra7xx_add_pcie_port(dra7xx, pdev);
    if (ret < 0)
    goto err_gpio;
    break;
  • Chris,

    Thanks for your feedback. I'll close the thread for now. If anything comes up, please submit a new thread.

    Rex
  • For anyone else going through this and requiring the fix:

    >>> So, could you try reverting commit 8c934095fa2f3 and also apply
    >>> below patch and let me know if that fixes the issue?
    >>>
    >>> -----------
    >>>
    >>> diff --git a/drivers/pci/dwc/pci-dra7xx.c
    >>> b/drivers/pci/dwc/pci-dra7xx.c index e77a4ceed74c..8280abc56f30
    >>> 100644
    >>> --- a/drivers/pci/dwc/pci-dra7xx.c
    >>> +++ b/drivers/pci/dwc/pci-dra7xx.c
    >>> @@ -259,10 +259,17 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
    >>> u32 reg;
    >>>
    >>> reg = dra7xx_pcie_readl(dra7xx,
    >>> PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
    >>> + dra7xx_pcie_writel(dra7xx,
    >>> + PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
    >>>
    >>> switch (reg) {
    >>> case MSI:
    >>> - dw_handle_msi_irq(pp);
    >>> + /*
    >>> + * Need to make sure no MSI IRQs are pending before
    >>> + * exiting handler, else the wrapper will not catch new
    >>> + * IRQs. So loop around till dw_handle_msi_irq() returns
    >>> + * IRQ_NONE
    >>> + */
    >>> + while (dw_handle_msi_irq(pp) != IRQ_NONE);
    >>
  • Hi Chris,

    What kernel version were these changes tested with? Commit 8c934095fa2f3 isn't in TI's 04.00.00.04 or 04.01.00 release, and PCIe fails to initialize when I apply all of the changes described here to the 04.01 kernel.

  • Most of it was done on 4.9.45 taken from the TI git repository.  I have patched SDK 04.01.00.06 without any issues as well.  I think TI has released the fixes to the repository as they have asked me to test them.

    This is my current change in 4.9.45, you may need to adjust the line #s a bit, but I think 4.9.41 in 04.01.00.06 was the same:

    wel52996@ubuntuvm:~/map300/ti-linux-kernel-dev/KERNEL$ git diff drivers/pci/dwc/pci-dra7xx.c drivers/pci/dwc/pcie-designware-host.c
    diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
    index 6245d89..0a0a43f
    --- a/drivers/pci/dwc/pci-dra7xx.c
    +++ b/drivers/pci/dwc/pci-dra7xx.c
    @@ -257,10 +257,19 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
            u32 reg;
     
            reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
    +       // cgw possible fix for PCIe USB host failure
    +    dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
     
            switch (reg) {
            case MSI:
    -               dw_handle_msi_irq(pp);
    +//             dw_handle_msi_irq(pp);
    +       /*
    +        * Need to make sure no MSI IRQs are pending before
    +        * exiting handler, else the wrapper will not catch new
    +        * IRQs. So loop around till dw_handle_msi_irq() returns
    +        * IRQ_NONE
    +        */
    +        while (dw_handle_msi_irq(pp) != IRQ_NONE);
                    break;
            case INTA:
            case INTB:
    @@ -271,8 +280,8 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
                    break;
            }
     
    -       dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
    -
    +//     dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
    +//
            return IRQ_HANDLED;
     }
     
    diff --git a/drivers/pci/dwc/pcie-designware-host.c b/drivers/pci/dwc/pcie-designware-host.c
    index abc137c..41dab6a 100644
    --- a/drivers/pci/dwc/pcie-designware-host.c
    +++ b/drivers/pci/dwc/pcie-designware-host.c
    @@ -69,8 +69,11 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
                            while ((pos = find_next_bit(&val, 32, pos)) != 32) {
                                    irq = irq_find_mapping(pp->irq_domain,
                                                           i * 32 + pos);
    +// cgw to try to fix PCIe USB host problem
    +//                             generic_handle_irq(irq);
                                    dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
                                                        i * 12, 4, 1 << pos);
    +// cgw to try to fix PCIe USB host problem, revert git SHA 8c934095fa2f3
                                    generic_handle_irq(irq);
                                    pos++;
                            }
  • The TI linux git host can be found at git.ti.com/ti-linux-kernel
  • I'm finding the patch doesn't work in the 4.9.41 kernel. I'm looking into the problem.
  • Thanks Chris. I added the following commits from ti-linux-kernel branch "ti-linux-4.9.y" to our kernel based on TISDK 4.9.41 and am still having issues with PCIe. We are using an NVMe device so it is possible that it is related to the NVMe driver as well.

    404d6b8 nvme-pci: Use PCI bus address for data/queues in CMB
    94e2881 PCI: designware-ep: Fix ->get_msi() to check MSI_EN bit
    e43479f pci: dwc: pci-dra7xx: Improve MSI IRQ handling
    2d40690 PCI: dra7xx: Clear IRQSTATUS_MSI as soon as its read
    d089e27 PCI: dwc: pci-dra7xx: Enable x2 mode support
    892037c PCI: dwc: dra7xx: Add support for SoC specific compatible strings

    TI has added the patches you listed to ti-linux-4.9.y (v4.9.67 347081a98a8a8f935c2a5f3de574fcf939abaaa5) along with several other changes related to PCIe, I'm wondering if you've tested with this branch and your device already? Did you end up needing the additional i870 errata and interrupt line fixes?
  • I haven't been able to get TI's fix for our USB host lock up working in 4.9.41. Works fine in 4.9.45 though. I'm trouble shooting it in 4.9.41 now. I get a failure with the PCI probe when the patch is applied.

    I haven't tested 4.9.67 yet, we have to have a test load ready for the holiday break and 4.9.41 is currently the target.

    We need both the i870 and interrupt line fixes. The i870 fix was required to fix a problem we had with proper enumeration of PEX 8606 PCI switches we have under the PCI hosts. We needed the interrupt line fixes because we are using both PCI hosts on the AM5728 along with the PCI switches underneath them and hit an out of index error with the existing logic.

    You'll see the following error if you need the interrupt line fix:

    [ 1.855950] error: hwirq 0x4 is too large for dummy

    You'll see an incorrect class identifier for the PCI device it you hit the i870 error:

    [ 0.919376] pci 0001:01:00.0: [10b5:8606] type 00 class 0x060400
    [ 0.919393] pci 0001:01:00.0: ignoring class 0x060400 (doesn't match header type 00)
  • Hi, Anna,

    I'll check internally on this 4.9.41 issue you and Chris are having.

    Hi, Chris,

    Just curious. What's the difference between your earlier test on 4.9.41 vs the failure you encounter now on the same version?

    Rex

  • It is very strange as I can find no difference in the PCI code between 4.9.45 and 4.9.41, but when I use the patch to correct the USB hub lock up in 4.9.41, PCI fails to initialize:

    root@arm:~# dmesg | grep -i pci

    [    0.711726] PCI: CLS 0 bytes, default 64

    [    0.733321] dra7-pcie 51000000.pcie: Linked as a consumer to phy-4a094000.pciephy.3

    [    0.733470] dra7-pcie 51000000.pcie: GPIO lookup for consumer (null)

    [    0.733477] dra7-pcie 51000000.pcie: using device tree for GPIO lookup

    [    0.733505] of_get_named_gpiod_flags: parsed 'gpios' property of node '/ocp/axi@0/pcie@51000000[0]' - status (0)

    [    0.733656] dra7-pcie 51000000.pcie: Dropping the link to phy-4a094000.pciephy.3

    [    0.733725] dra7-pcie: probe of 51000000.pcie failed with error -22

    Neither PCI host is seen.  I'm currently trying to isolate the specific code in the patch that is causing the initialization problem.

  • I've isolated the problem to the unaligned access patch. It is using incorrect values for the bit settings in the .dtsi file causing the PCI layer to fail to initialize.

    I'll post a corrected patch for the 4.9.41 kernel once I've completed testing.
  • 4.9.41-pci_fixes.diff
    diff --git a/arch/arm/boot/dts/dra7.dtsi b/arch/arm/boot/dts/dra7.dtsi
    index e3d6165..6d524eb 100644
    --- a/arch/arm/boot/dts/dra7.dtsi
    +++ b/arch/arm/boot/dts/dra7.dtsi
    @@ -318,6 +318,7 @@
     				num-lanes = <1>;
     				linux,pci-domain = <0>;
     				ti,hwmods = "pcie1";
    +				
     				phys = <&pcie1_phy>;
     				phy-names = "pcie-phy0";
     				interrupt-map-mask = <0 0 0 7>;
    @@ -325,6 +326,7 @@
     						<0 0 0 2 &pcie1_intc 2>,
     						<0 0 0 3 &pcie1_intc 3>,
     						<0 0 0 4 &pcie1_intc 4>;
    +        			ti,syscon-unaligned-access = <&scm_conf1 0x14 3>;
     				status = "disabled";
     				pcie1_intc: interrupt-controller {
     					interrupt-controller;
    @@ -342,9 +344,10 @@
     				num-ib-windows = <4>;
     				num-ob-windows = <16>;
     				ti,hwmods = "pcie1";
    +                                
     				phys = <&pcie1_phy>;
     				phy-names = "pcie-phy0";
    -				syscon-legacy-mode = <&scm_conf1 0x14 2>;
    +				ti,syscon-unaligned-access = <&scm_conf1 0x14 3>;
     				status = "disabled";
     			};
     		};
    @@ -355,8 +358,9 @@
     			#address-cells = <1>;
     			ranges = <0x51800000 0x51800000 0x3000
     				  0x0	     0x30000000 0x10000000>;
    -			status = "disabled";
    -			pcie@51800000 {
    +/*			status = "disabled"; */
    +/*			pcie@51800000 { */
    +			pcie2_rc: pcie@51800000 { 
     				compatible = "ti,dra7-pcie";
     				reg = <0x51800000 0x2000>, <0x51802000 0x14c>, <0x1000 0x2000>;
     				reg-names = "rc_dbics", "ti_conf", "config";
    @@ -377,6 +381,7 @@
     						<0 0 0 2 &pcie2_intc 2>,
     						<0 0 0 3 &pcie2_intc 3>,
     						<0 0 0 4 &pcie2_intc 4>;
    +        			ti,syscon-unaligned-access = <&scm_conf1 0x14 3>;
     				pcie2_intc: interrupt-controller {
     					interrupt-controller;
     					#address-cells = <0>;
    diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
    index defa272..cb37129 100644
    --- a/drivers/pci/dwc/pci-dra7xx.c
    +++ b/drivers/pci/dwc/pci-dra7xx.c
    @@ -238,8 +238,8 @@ static int dra7xx_pcie_init_irq_domain(struct pcie_port *pp)
     		dev_err(dev, "No PCIe Intc node found\n");
     		return -ENODEV;
     	}
    -
    -	dra7xx->irq_domain = irq_domain_add_linear(pcie_intc_node, 4,
    +        // PCI interrupt lines start at 1 not zero so need to add 1
    +	dra7xx->irq_domain = irq_domain_add_linear(pcie_intc_node, 4 + 1,
     						   &intx_domain_ops, pp);
     	if (!dra7xx->irq_domain) {
     		dev_err(dev, "Failed to get a INTx IRQ domain\n");
    @@ -255,12 +255,29 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
     	struct dw_pcie *pci = dra7xx->pci;
     	struct pcie_port *pp = &pci->pp;
     	u32 reg;
    +	int count = 0;
     
     	reg = dra7xx_pcie_readl(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI);
    +	dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
     
     	switch (reg) {
     	case MSI:
    -		dw_handle_msi_irq(pp);
    +		/*
    +		 * Need to make sure no MSI IRQs are pending before
    +		 * exiting handler, else the wrapper will not catch new
    +		 * IRQs. So loop around till dw_handle_msi_irq() returns
    +		 * IRQ_NONE
    +		 */
    +		while (dw_handle_msi_irq(pp) != IRQ_NONE && count < 1000)
    +			count++;
    +
    +		if (count == 1000) {
    +			dev_err(pci->dev, "too much work in msi irq\n");
    +			dra7xx_pcie_writel(dra7xx,
    +					   PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI,
    +					   reg);
    +			return IRQ_HANDLED;
    +		}
     		break;
     	case INTA:
     	case INTB:
    @@ -271,8 +288,6 @@ static irqreturn_t dra7xx_pcie_msi_irq_handler(int irq, void *arg)
     		break;
     	}
     
    -	dra7xx_pcie_writel(dra7xx, PCIECTRL_DRA7XX_CONF_IRQSTATUS_MSI, reg);
    -
     	return IRQ_HANDLED;
     }
     
    @@ -548,7 +563,7 @@ static const struct of_device_id of_dra7xx_pcie_match[] = {
     };
     
     /*
    - * dra7xx_pcie_ep_legacy_mode: workaround for AM572x/AM571x Errata i870
    + * dra7xx_pcie_unaligned_memaccess: workaround for AM572x/AM571x Errata i870
      * @dra7xx: the dra7xx device where the workaround should be applied
      *
      * Access to the PCIe slave port that are not 32-bit aligned will result
    @@ -558,7 +573,7 @@ static const struct of_device_id of_dra7xx_pcie_match[] = {
      *
      * To avoid this issue set PCIE_SS1_AXI2OCP_LEGACY_MODE_ENABLE to 1.
      */
    -static int dra7xx_pcie_ep_legacy_mode(struct device *dev)
    +static int dra7xx_pcie_unaligned_memaccess(struct device *dev)
     {
     	int ret;
     	struct device_node *np = dev->of_node;
    @@ -566,25 +581,25 @@ static int dra7xx_pcie_ep_legacy_mode(struct device *dev)
     	unsigned int reg;
     	unsigned int field;
     
    -	regmap = syscon_regmap_lookup_by_phandle(np, "syscon-legacy-mode");
    +	regmap = syscon_regmap_lookup_by_phandle(np, "ti,syscon-unaligned-access");
     	if (IS_ERR(regmap)) {
    -		dev_dbg(dev, "can't get syscon-legacy-mode\n");
    +		dev_dbg(dev, "can't get syscon-unaligned-access\n");
     		return -EINVAL;
     	}
     
    -	if (of_property_read_u32_index(np, "syscon-legacy-mode", 1, &reg)) {
    -		dev_err(dev, "couldn't get legacy mode register offset\n");
    +	if (of_property_read_u32_index(np, "ti,syscon-unaligned-access", 1, &reg)) {
    +		dev_err(dev, "couldn't get unaligned access register offset\n");
     		return -EINVAL;
     	}
     
    -	if (of_property_read_u32_index(np, "syscon-legacy-mode", 2, &field)) {
    -		dev_err(dev, "can't get bit field for setting legacy mode\n");
    +	if (of_property_read_u32_index(np, "ti,syscon-unaligned-access", 2, &field)) {
    +		dev_err(dev, "can't get bit field for setting unaligned access mode\n");
     		return -EINVAL;
     	}
     
     	ret = regmap_update_bits(regmap, reg, field, field);
     	if (ret)
    -		dev_err(dev, "failed to set legacy mode\n");
    +		dev_err(dev, "failed to set unaligned access mode\n");
     
     	return ret;
     }
    @@ -701,6 +716,11 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev)
     	if (dra7xx->link_gen < 0 || dra7xx->link_gen > 2)
     		dra7xx->link_gen = 2;
     
    +	// Errata i870 applies to RC as well as EP
    +	ret = dra7xx_pcie_unaligned_memaccess(dev);
    +	if (ret)
    +		goto err_gpio;
    +
     	switch (mode) {
     	case DW_PCIE_RC_TYPE:
     		dra7xx_pcie_writel(dra7xx, PCIECTRL_TI_CONF_DEVICE_TYPE,
    @@ -714,10 +734,6 @@ static int __init dra7xx_pcie_probe(struct platform_device *pdev)
     		dra7xx_pcie_writel(dra7xx, PCIECTRL_TI_CONF_DEVICE_TYPE,
     				   DEVICE_TYPE_EP);
     
    -		ret = dra7xx_pcie_ep_legacy_mode(dev);
    -		if (ret)
    -			goto err_gpio;
    -
     		ret = dra7xx_add_pcie_ep(dra7xx, pdev);
     		if (ret < 0)
     			goto err_gpio;
    

    diff attached of the changes required for 4.9.41 (kernel used in SDK 04.01.00.06) to fix the following PCI issues:

    1) Off by one with legacy interrupts

    2) unaligned access failure

    3) Hung PCI/e USB host and potentially other devices

    #2 and 3 follow the changes that TI is posting to correct the problems.  #1 is a quick fix as the proposed TI patch requires a much higher level of the 4.9 kernel.

    I haven't had a chance to soak test the changes, initial testing has been good so far.