This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

When attempting to read data from a pcie device (either mmap or file i/o) an Unhandled fault occurs

Hello,

I am trying to port/cross compile a pcie driver for a XILINX FPGA endpoint device to our DM8148 EVM Board (TMDXEVM8148 from Mistral), but for some reason, when a PCI read is performed, the error below occurs.  This driver runs well on Ubuntu 10.04, but on Arago, it's failing whenever a read is performed. 

Unhandled fault: Precise External Abort on non-linefetch (0x1018)

Writes can be executed without issue, but when a read is attempted, regardless of whether the interface is file I/O  or mmap, the error listed above results. In this particular kernel configuration, I deactivated MSI, but the error will occur when MSI is activated as well.

I've activated PCI debugging in the kernel and when the error occurs,   the following debug data is provided:

ti81xx_pcie: Data abort: address = 0x4056d000 fsr = 0x1018 PC =0x00008efc LR = 0x402c99e4

There's a lot of data printed to the screen when PCI debugging is activated, so in order to provide the output from "lspci -v"  minus the debug print statements,  I rebuilt the kernel with PCI DEBUG deactivated, and rebooted the DM8148EVM Board.

root@dm814x-evm:~/Linux_Driver# ./lspci -v
00:00.0 Class 0604: Device 104c:8888 (rev 01)
        Flags: bus master, fast devsel, latency 0
        Memory at <ignored> (32-bit, non-prefetchable)
        Memory at <ignored> (32-bit, prefetchable)
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Memory behind bridge: 20000000-200fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Root Port (Slot-), MSI 00
        Capabilities: [100] Advanced Error Reporting

01:00.0 Class 0580: Device 10ee:0505
        Flags: fast devsel, IRQ 48
        Memory at 20000000 (32-bit, non-prefetchable) [size=64K]
        Memory at 20010000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00

root@dm814x-evm:~/Linux_Driver#

Any help with this issue would be greatly appreciated,

Thanks,

Dave

  • Dave,

    Can you provide the output of "lspci -vv" (extra 'v') *after* you get above error?

    Also, is there any way that on FPGA side you can check if the read access is reaching the FPGA?

    Thanks.

       Hemant

  • Hi Hemant,

    Thanks for the quick reply!  

    On Monday,  we can check to make sure that the FPGA is receiving the read access and once that test is performed, I'll update the ticket with our results. 

    Here's the lspci -vv output that is produced after our read test  fails:

    root@dm814x-evm:~/Linux_Driver# ./lspci -vv
    00:00.0 Class 0604: Device 104c:8888 (rev 01)
            Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Latency: 0, Cache Line Size: 64 bytes
            Region 0: Memory at <ignored> (32-bit, non-prefetchable)
            Region 1: Memory at <ignored> (32-bit, prefetchable)
            Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
            Memory behind bridge: 20000000-200fffff
            Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
            BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                    PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
            Capabilities: [40] Power Management version 3
                    Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
            Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                    Address: 0000000000000000  Data: 0000
            Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
                    DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                            ExtTag- RBE+ FLReset-
                    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                            MaxPayload 128 bytes, MaxReadReq 512 bytes
                    DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                    LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s, Latency L0 <2us, L1 <64us
                            ClockPM- Surprise- LLActRep+ BwNot-
                    LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk-
                            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                    LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
                    RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
                    RootCap: CRSVisible-
                    RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                    DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ ARIFwd-
                    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
                    LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                             Compliance De-emphasis: -6dB
                    LnkSta2: Current De-emphasis Level: -6dB
            Capabilities: [100 v1] Advanced Error Reporting
                    UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                    UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                    UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                    CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                    CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                    AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-

    01:00.0 Class 0580: Device 10ee:0505
            Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Interrupt: pin A routed to IRQ 48
            Region 0: Memory at 20000000 (32-bit, non-prefetchable) [size=64K]
            Region 1: Memory at 20010000 (32-bit, non-prefetchable) [size=64K]
            Capabilities: [40] Power Management version 3
                    Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
            Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
                    Address: 0000000000000000  Data: 0000
            Capabilities: [60] Express (v2) Endpoint, MSI 00
                    DevCap: MaxPayload 256 bytes, PhantFunc 1, Latency L0s <64ns, L1 <1us
                            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                            MaxPayload 128 bytes, MaxReadReq 512 bytes
                    DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                    LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s, Latency L0 unlimited, L1 unlimited
                            ClockPM- Surprise- LLActRep- BwNot-
                    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                    LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                    DevCap2: Completion Timeout: Not Supported, TimeoutDis-
                    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                    LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                             Compliance De-emphasis: -6dB
                    LnkSta2: Current De-emphasis Level: -6dB
            Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00

    Thanks again for the help!

    -Dave

  • Hello Hemant,

    Confirming if the read access request reaches the FPGA is a bit more difficult than I thought - we can do it but it's going to require sometime on our end.

    We've been going through the RC driver code and we came across something that looks suspicious in the  set_inbound_trans() routine.  The comments don't line up with the code. It appears to us that BAR1 should be configured here, but in the first few raw_write calls, BAR0 is being referenced and not BAR1.   Could there be an issue with this code?

    /* Configure BAR1 only if inbound window is specified */
        if (ram_base != ram_end) {
            /*
             * Set Inbound translation. Skip BAR0 as it will have h/w
             * default set to open application register space.
             *
             * The value programmed in IB_STARTXX registers must be same as
             * the one set in corresponding BAR from PCI config space.
             *
             * We use translation 'offset' value to yield 1:1 mapping so as
             * to have physical address on RC side = Inbound PCIe link
             * address. This also ensures no overlapping with base/limit
             * regions (outbound).
             */
            __raw_writel(ram_base, reg_virt + IB_START0_LO);
            __raw_writel(0, reg_virt + IB_START0_HI);
            __raw_writel(1, reg_virt + IB_BAR0);
            __raw_writel(ram_base, reg_virt + IB_OFFSET0);

            /*
             * Set BAR1 mask to accomodate inbound window
             */

            set_dbi_mode();

            __raw_writel(1, reg_virt + SPACE0_LOCAL_CFG_OFFSET +
                    PCI_BASE_ADDRESS_1);

            __raw_writel(ram_end - ram_base, reg_virt +
                SPACE0_LOCAL_CFG_OFFSET + PCI_BASE_ADDRESS_1);

            clear_dbi_mode();

            /* Set BAR1 attributes and value in config space */
            __raw_writel(ram_base | PCI_BASE_ADDRESS_MEM_PREFETCH,
                    reg_virt + SPACE0_LOCAL_CFG_OFFSET
                    + PCI_BASE_ADDRESS_1);

            /*
             * Enable IB translation only if BAR1 is set. BAR0 doesn't
             * require enabling IB translation as it is set up in h/w
             */
            __raw_writel(IB_XLAT_EN_VAL | __raw_readl(reg_virt +
                        CMD_STATUS), reg_virt + CMD_STATUS);
        }

    Once again, thank you for any help that you can give us with our problem,

    Dave

  • Dave,

    BAR0 doesn't require inbound translation as it is hardwired.

    For rest of the BARs, we need to set up inbound. Thus, for BAR1, we use the the inbound translation registers.

    As you can see in the code above, we set 1 in IB_BAR0 register to indicate that it corresponds to BAR1.

       Hemant

  • Hi Hemant,

    Thanks for the clarification with IB_BAR0 corresponding to BAR1.   We're still working on the read issue that we are having here and once we have some more data points, I'll update the ticket again. If you think of anything in the meantime, please let us know.  Thanks again for your help, we appreciate it!

    -Dave

  • Hi Hemant,

    We've worked through our read issues, which turned out to be a couple of issues.  It seems that lseek wasn't working properly for us at first so we implemented our own lseek. When we lseek'ed,  our file descriptor's offset was always set to 0, instead of the location we wanted to lseek to.  Our write tests were writing to location 0, so it wasn't an issue.

    Yet, we're experiencing another road block and we're wondering if you could recommend a solution.  When we boot up we have to reload the FPGA with a new image and then renumerate/rescan the PCI bus and for some reason we can't seem to do that.   We're looking into ways of doing it, but if you have any suggestions we'd appreciate it.

    Thanks,

    Dave

  • Dave,

    I am not getting the exact use case / sequence you use, so tried putting it as per my understanding below. Please correct / update missing points:

    1) FPGA is connected as EP and on power on, the PCIe link is established, FPGA code (already burned) here takes care of setting up PCIe on FPGA side

    2) RC s/w downloads the complete firmware over PCIe and indicates restart/reload on FPGA

    3) PCIe is reset on FPGA side

    4) Now force re-enumeration from RC - here FPGA is not detected.

    Do the above steps roughly capture what you may be doing? I actually have concerns about step 3 or anything you may be doing as that step and hecen need clarification before commenting. 

       Hemant 

  • The sequence is accurate.  

    We will start with a default FPGA image that has a valid PCIe end-point at power on and then will want to re-configure the FPGA with a new bitstream via SW from the ARM and reconfigure.  During this time the PCIe bus will go down (tri-state). Only I have not been able to figure out how to force the RC (DM8148) to re-enumerate the bus.  I have enabled CONFIG_HOTPLUG and attempted to use sysfs to remove the PCIe EP (during FPGA reload) and rescan the PCI RC but it does not seem to do a rescan of the bus or re-enumeration.

  • Dave,

    DM8148 PCIe doesn't support hotplug in h/w. Also, simple 'rescan' wouldn't work as even the PCIe interface on RC side needs to be re-initialized if the peer had put the link into tristate (FPGA).

    One question: can you not manage without resetting PCIe interface on FPGA?

       Hemant 

  • Hi Hemant,

    Unfortunately, our only other option is to leverage the Xilinx FPGA ability to do partial reconfiguration of the FPGA which would leave the PCIe interface up and running.  However, we don’t have the schedule for this option and the HW group has not signed up to do it.  The worst case is to reconfigure the FPGA and then do a full kernel reboot, but our boot time requirement makes that the least desirable option.  Is it possible to just do a full restart of the PCIe subsystem OR to hold off initializing the PCIe Subsystem until after we download the new FPGA image from Ethernet and reload it?  We only have the one end-point so I don’t care if we bring the entire PCIe bus down and back up.  I know PCI is somewhat integral to the kernel and perhaps is not possible without a full reboot.

    -Dave

  • David,

    Since partial loading is not an option for you, the only way is to go for sequence modification in the Root Complex driver. Also, since your use case has only one endpoint (and is perhaps also an integrated system), you can use simpler solution such as invoking most of the PCIe initialization code from RC driver when you do echo 1 > /sys/bus/pci/rescan.

    I can give you exact steps, but let me know if this is acceptable solution to you.

       Hemant 

  • Hi Hemant,

    I think your solution will work for us. Could you please send us the steps to make it happen.

    Thanks again for all your help,

    -Dave

  • Hi Hemant,

    If you could provide us with the steps to implement your solution to our problem, we'd greatly  appreciate it!

    Thanks,

    Dave

  • Dave,

    I am summarizing steps below. Please note that this is untested and I have yet to verify this so haven't put in RC driver guide:

    A) Reorganize PCIe Setup Code

        1) We need to apply local reset to PCIe module, this means various PCIe configurations performed before peer reset will be lost and need to be repeated again.
    Thus, split the ti816x_pcie_setup() routine in two groups
    - One time setup
    - Hardware configuration
    2) Create a new function ti816x_pcie_hw_setup()
    Move the complete code block from ti816x_pcie_setup() after following:

    if (clk_enable(pcie_ck))
    goto err_clken;

    Only leave the platform hookup code and abort handler etc code present in ti816x_pcie_setup()
    Copy the h/w setup code in ti816x_pcie_hw_setup()

    3) Ensure local and global variables as needed inside this function are provided
    Toggle PRCM LRST (0x48180b10 bit 7) for PCIe from 1 to 0 (set = 1 and then set to 0) before proceeding to program PCIe registers.

    Add PMR Interrupt Handler (optional)

    Add an interrupt handler for handling reset interrupt
    Register it from ti816x_pcie_setup()
    The interrupt handler should schedule a tasklet/work to call ti816x_pcie_hw_setup()

    B) Re-enumeration

    Call ti816x_pcie_hw_setup() from ti816x_pcie_scan() just before pci_scan_bus() call.

    This call must be made from inside "if (nr == 0)" conditional block. 

    Force Re-enumeration

    From root prompt:

    echo 1 > /sys/bus/pci/rescan

    Note: The rescan may also be triggered on reset interrupt by adding interrupt handler for PCIe PMR (Power management & Reset) interrupt (IRQ=50). This will avoid the need for rescan from shell.

       Hemant

  • Hi Hemant

    Thank you again for your help - we've got things working here and we're continuing on with our development efforts.

    Once again,  Thanks!

    -Dave

  • Dear Hemant,

    I tried to follow your recommendations correspondingly for EZSDK 5.03.01.15 / DM8148. We have the following problem: After booting the DM8148, we have to configure an Altera FPGA  (using the DM8148 with the PS configuration scheme) as pcie endpoint device because we would like to omit a dedicated  FPGA configuration device. Doing so the pcie endpoint is never recognized by the root complex, also not if we cycle the power for the DM8148. The only way to recognize it is to have the FPGA already configured before the DM8148 ist powered the first time. I hoped to solve the issue by re-scanning the pcie bus according to your proposal, but it seems the ti81xx_pcie_scan() function is not triggered by invoking "echo 1 > /sys/bus/pci/rescan". At least there is no message like "Starting PCI scan .." in the dmesg log. In contast  during the boot procedure such a message is found in the log. Additionally I didn't yet understand when exactly I have to toggle the PRCM_LRST bit, Before or after re-scanning the bus ? Inside or outside pcie_ti81xx.c ? If you like, I can provide the modifications to pcie_ti81xx.c to you. Will there be any support for pcie rescanning in any future EZSDK ? This knowledge is important because some hardware design decisions we have to do right now depend on it urgently.

    best regards

    Christian

  • Christian,

    Most likely the FPGA is sending reset command to DM8148 which is why it is not detected when initialized after DM8148.

    As I mentioned in earlier post on this forum, you can hook up reset interrupt handler and carry out reset of PCIe on DM8148 and trigger re-enumeration.

    The catch here is, the FPGA should be detected (link should be established) *after* carrying PCIe reset on DM8148 and *before* doing re-enumeration.

    You can check the DEBUG0 register @0x51001728 to see if the LS 5 bits are 0x11 after resetting DM8148 PCIe (PRCM reset + start link training).

  • Dear Herman,Christian and David,

    we are having the same problem on the same issue (the Fpga EP goes up after power up so we need to rescan...)
    did you succed to solve the issue?
    Any help cpould be appreciated.

    Thanks
    Besta regards
    Omar
  • I have a suggestion:
    the rescan from shell cal:

    in pci_sysfs.c:

    static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
    size_t count)
    that call

    pci_rescan_bus(b);

    and not the scan in the driver.
    so i think we need to change the bus_rescan_store.

    sys_data = b->sysdata;
    sys_data->hw->scan(0, sys_data);

    so the scan of the driver will be called...
    but i think the the scan should know if this is from a rescan or from a first scan so it can call pci_rescan_bus instead of pci_scan_bus.

    if(ti81xx_pcie_rescan ==0)
    {
    pr_info(DRIVER_NAME ": scan...\n");
    bus = pci_scan_bus(0, &ti81xx_pci_ops, sys);
    ti81xx_pcie_rescan = 1;
    }
    else
    {
    pr_info(DRIVER_NAME ": rescan scan...\n");

    pci_rescan_bus(sys->bus);

    }

    I will try to do that...
    Let me know what do you think.

    Omar
  • close to working...

    what happen is that, during rescan with the reset set and reset in the ti816x_pcie_hw_setup function
    the system hang...
    removing the reset it doesn't hang, but of course it is not what we need... infact without the reset in the ti816x_pcie_hw_setup

    seems that sometimes the rescan works and some other i still have the Data abort error...
    maybe we need to mask ints during reset operations?
    (the ints are already enabled during first scan...)

    Omar