This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5716: very sporadical PCI bus read errors

Part Number: AM5716

On products with the AM57xx processor we see sporadical read errors if we adress the peripheral device registers which are mapped into the PCIe adress space.

The values which are read are always 0xffffffff.

It could be related to a certain load with QSPI NOR flash accesses and NAND accesses at the same time with PCIe register poll reads.

The operating system we use is vxWorks 6.9, we have a custom design with either one or two PCIe lanes used.

  • Are any PCIe errors generated?

    Thanks for the register logs.  We will review and get back to you.

  • PCIECTRL_TI_CONF_IRQSTATUS_RAW_MAIN says ERR_AXI

    and

    PCIECTRL_RC_DBICS2_DEV_CAS reports NFT_DET

  • Hello Martin,

    -0-

    We made some review of inputs in this thread and some before/after issue register dumps sent by Karim.  A few things do stand out.  At a high level, it looks like some event did happen which caused some hiccups.  It is not clear who was the aggressor and who were the victims of this event. You should review each party to the issues' registers and external bus states and experiment focusing on these.

    The below comments are based on register reviews.  There wasn't any hardware/board context discussed so possible issues on that side has not been considered.

    -1-

    On the A15 side, the VxWorks 6.9 configuration of the ACTLR register has a poor value (0x00C00001).  Please change it to 0x1E000040 and then retest the system.  The SMP bit should be set as it is an MPCORE chassis even if there is only one CPU.  For this CPU this was fleshed out during initial Linux wakeup.  The 0x1Exxxxxx deals with some rare (but seen in some productions) A15 specific errata. While this change seems to have a low chance of fixing your issue, it is very easy to try, and if it made a difference could short circuit a lot of investigation.  I've not seen an internal review of VxWorks on this CPU so its hard to gauge what other kinds of issues might need examination.

    -2-

    The registers in the dump show a number of L3 errors both before and after the event. You should clear all of these error statuses (in L3_FLAGMUX_REGERR0) before running your test so as to exclude them from contributing to the issue. Some of these logged errors may have happened when a bootloader first turned on some clocks without taking into account all considerations and doesn't matter... or they may be contributed to the issues.  For example, your system has a logged errors associated with the 2D engine and an L4 timeout. for the 2D engine.  If the 2D engine was free running and corrupting memory, sometimes it might mess up memory which PCIE or QSPI is relying on.

    -3-

    Across the event, new L3 errors show up in the QSPI target (caused by some bad ARM access) additonally a PCIE1 L3-timeout is flagged. The PCIe errors you report in this thread map to a non-fatal AXI error.  That AXI flag is generated by an overflow of the slave response table.  This is not an expected error and has some generic description about the need to flow input/out queues.  I'll guess its output was frozen while some new inputs were pushed and the error happened.  This would be the source of the PCIE1 timeout.

    Given your information, and the 3 errors apparently close in time... the story which comes to mind is one of the masters does something wrongly and this results in a disruption of PCIe processing for >65K L3 cycles.  Often this kind of issue results when a central resource like the DDR is having a problem.  Users at that time can then see secondary effects (timeouts, over/underflows, ...).

    If say your external PCIE device's clock stops while it is in the middle of servicing a transaction to DDR, this would be recorded as PCIE-L3timeout. If some DMA was at the same time pulling data from QSPI and its processing was disrupted, the ARM might try and do some inappropriate accesses while the DMA was still in flight (causing the QSPI error).

    Or say if the ARM did something bad to the QSPI device while some DMA was in parallel pulling data from it into DDR and that resulted in a path stall.  The PCIe might not be served for too long resulting in a timeout and nonfatal error. The L3 error at the QSPI is of the 'standard' type, this means on the request phase of the transaction. It might be some VxWorks mis-setup of an MMU resource that could cause some kind of illegal burst or the like to the control registers.

    In the above two stories, it's not clear who is the aggressor and who is the victim.  "Most" of the errors of this nature I've seen were the fault of the external PCI device disrupting the memory path.

    You should examine the registers of the QSPI space also.  If some DMA channel is linked to it for service, I'd disable it as a test.  I'd also look at using the bad read as a trigger source to stop bus analyzers on PCIe and maybe QSPI.  You might see some anomaly (which could be externally induced).

    Hopefully, this context will help in running down the issue.

    Regards,

    Richard W.

  • Hello Richard,

    first of all I want to thank you for the prompt response, also from the team.

    For the ACTLR settings:
    we found some ARM_ERRATA_799270 code is clearing the SMP bit in our BSP,

    We wonder if this errata is needed an how to deal with it.

    But neverless we changed the value to 0x1e000040 as requested, but no change.

    The register clearing in -2- also were not successful,

    and with all of the -3- we are still busy to inspect

    But we now found a trigger for the JTAG debugger when the PCI endpoint is signalling parity errors

    on the bus. After the correct reception of the interrupt register the connection seems to break

    and even in the interrupt service routine itself we read the register contents of 0xFFFF, 

    so that we have a state before and after which can both be seen with the debugger in the call stack.

    So the plan would be to trigger the oszilloscope in the interrupt routine by a GPIO toggle

    and try to capture the data on the bus.

    So it seems that first breaks the TX and then the RX (from CPU perspective)

    Are there any known conditions under which parity errors could possibly occur?

     We will come back with new findings next week, tomorrow will be a public holiday.

    Thank you very  much so far.

    regards,

    Martin

  • Hello Richard,

    an update on -2-

    We cleared all error bits before the error condition,

    then the difference is just (old values in () ):

    L3_FLAGMUX_CLK1MERGE_REGERR0 = 2 (0)

    L3_FLAGMUX_CLK1MERGE_REGERR1 = 2  (0)

    L3_FLAGMUX_REGERR0 = 0x20  (0)

    L3_FLAGMUX_REGERR1 = 0x20  (0)

    and on the PCIe we see the ERR_AXI bit set

    Another question would be if we could use the PCIe debug registers somehow

    PCIECTRL_TI_CONF_DEBUG_CFG and PCIECTRL_TI_CONF_DEBUG_DATA

    but we have no clue how to use them, we see some values if we change the CFG register 

    regards

  • Hello Martin,

    From your notes, I understand you cleared the existing L3 errors than ran across the error and only saw a 0x20 in REGERR0 and REGERR1.   The name REGERROR0 is not unique.  Based on the previous register diff, I assume its  CLK1_FLAGMUX_CLK1_2.L3_FLAGMUX_REGERR0=0x20.  Is this correct?  If so that maps to a 'replicator timeout'.  To decode this further it's necessary to look at the FLAGMUX TIMEOUT values.  Previously, this showed a PCIe timeout.  Probably that is the case again.

    In your first note, you mention trying the AUX value and not seeing a difference. This was mostly expected. The ARM errata which is talking about manipulation of the SMP bit is not applicable for the AM5716 configuration.  It is a single-core with no ACP port, so the run time case around L2 shutdown with a DVM conflict should not happen. I suspect that code was copy and pasted from some other platform.  Anyway, it's not central to your main issue.

    Your finding of a parity error and the idea to trigger on the bus seemed good. Have you all been able to get that capture in place?  After seeing the parity error, my guesses tend to some kind of signaling issue.  I am not an expert in this area and will ask someone else to add further comments.  I suspect they might inquire about board layout, signal levels as seen on the scope, and PDN topics.  I did some searches on the TI PCI documents for this IP and find the reporting of the error documented but not much about avoidance or recovery from it. A google search did pull up some useful reads ( https://www.eetimes.com/pesky-pci-x-parity-errors/https://buttersideup.com/edac-ukuug-2006-talk/slides/foil06.html), the contents of these matched with what types of things might create this error on the links.  I assume you are digging into these signaling issues in parallel with trying to understand the SOC side of it.

    Regarding, PCIECTRL_TI_CONF_DEBUG_CFG and PCIECTRL_TI_CONF_DEBUG_DATA, These allow selecting and routing out some internal IP state machine signals for observation on a logic analyzer.  In the documents, I scanned I can only see which blocks can route signals to this and not the specific signals.  It doesn't seem likely these internal signals will be of a lot of 'direct' use to you in debugging.

    Does this error only happen in a single-use case or directed test?  Or is it happening across generic usage?  It may be worth mapping out what all is active (and not active) during the issue to try and narrow down what is triggering the apparent signaling issue.  If you are triggering this with something like a 'spark' esd test then its a different type of process.

    Regards,

    Richard W.

    L3_FLAGMUX_REGERR0
  • Regarding PCIe issue, some common causes are:

    - Refclk issues: Is a Common Clock used between RC and EP.  Can you describe the topology and confirm SW config bits for Refclk?

    - PHY config issues: Can you confirm that the "PCIe PHY Subsystem Low-Level Programming Sequence" in the TRM is being followed, along with the "Preferred PCIe_PHY_RX SCP Register Settings"

    - PCB signal integrity issues: Can you confirm that the routing rules in app note SPRAAR7H "High-Speed Interface Layout Guidelines" are being followed?

  • Hi Richard,

    we will share new logs and handle this issue offline.

    regards

    Karim

    TI embedded processing

    Senior Member Technical Staff

  • Please obtain dumps of the below iATU registers for all 16 outbound iATU regions.  16 separate dumps are needed, updating the PCIECTRL_PL_IATU_INDEX register between each dump to select the viewport into the desired outbound region.

     

    PCIECTRL_PL_IATU_INDEX

    PCIECTRL_PL_IATU_REG_CTRL_1

    PCIECTRL_PL_IATU_REG_CTRL_2

    PCIECTRL_PL_IATU_REG_LOWER_BASE

    PCIECTRL_PL_IATU_REG_UPPER_BASE

    PCIECTRL_PL_IATU_REG_LIMIT

    PCIECTRL_PL_IATU_REG_LOWER_TARGET

    PCIECTRL_PL_IATU_REG_UPPER_TARGET

    PCIECTRL_PL_IATU_REG_CTRL_3