This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM5728: Unidentified PCIe transactions

Part Number: AM5728
Other Parts Discussed in Thread: SYSCONFIG

Tool/software: Linux

We are troubleshooting an issue that involves unidentified PCIe transactions that originate from the am5728. 

What makes these transactions stand apart from the normal traffic we recognize, is the transaction length and the target address.

1) The target address is typically an invalid address with respect to our custom hardware.  This causes timeouts and other issues to occur.

2) The transaction length is either 16 DW or 2 DW reads and appear as 1 x 16 DW or 8 x 2 DW

Example transactions are pasted below.

We suspect these transactions may be associated with caching, but have verified that caching is disabled on this space.  Are there other blocks in the am5728 that generate transactions with these characteristics?

Is the "firewall" in the L3 interconnect something we could use to protect this space and get a clue about the originator?

We have attempted to use the MMU to protect this space, but the unidentified transactions still occur with no seg faults.  Intentionally accessing the invalid addresses in this region confirms the MMU protection does cause seg faults.  This result seems to point away from our software application having a bad pointer or something.

Our kernel driver does not need to access this space, so we disabled the code that iomaps the BAR4 region associated with the addresses below.  We think this rules out a bad pointer in the driver.

We have also injected purposeful PCIe transactions with known payload to check for correlation between application code, driver ISR, driver ioctl, etc. and the unidentified transactions do not appear to correlate with the software we have instrumented.

Are there any tools, techniques or clues we can use to identify the originator of these transactions?

1 x 16 DW:

Timestamp      

Link              

Payload            

Address[31:0]      

PacketType        

STP_SeqNum        

Fmt                

Length            

'517.255 858 us'

SA 1_Dn

0x23A0B3E4

MRd(32)

0x07AE

0x0

1

'518.934 962 us'

SA 1_Up

0008 0053

CplD

0x0B71

0x2

1

'519.535 579 us'

SA 1_Dn

0000 0002

0x23A0B3E4

MWr(32)

0x07AF

0x2

1

'523.707 103 us'

SA 1_Dn

0x23A0C3E4

MRd(32)

0x07B0

0x0

1

'525.388 081 us'

SA 1_Up

0008 0000

CplD

0x0B72

0x2

1

'525.980 265 us'

SA 1_Dn

0000 0002

0x23A0C3E4

MWr(32)

0x07B1

0x2

1

'548.397 053 us'

SA 1_Dn

0x23A010D0

MRd(32)

0x07B2

0x0

1

'550.066 787 us'

SA 1_Up

0000 0000

CplD

0x0B73

0x2

1

'550.786 403 us'

SA 1_Dn

0000 0000

0x23A010D0

MWr(32)

0x07B3

0x2

1

'6.041 890 314 ms'

SA 1_Dn

0x22ED1640

MRd(32)

0x07B4

0x0

16

'6.085 532 026 ms'

SA 1_Up

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0006 0000 0000 0000 0000 0000 0000 FFFF FFFF 0000 0000 FFFF FFFF 0000 0000 0000 0000 0000 0000

 

CplD

0x0B74

0x2

16

'72.100 189 796 ms'

SA 1_Dn

0x22A26014

MRd(32)

0x07B5

0x0

1

'72.101 878 270 ms'

SA 1_Up

0000 0000

CplD

0x0B75

0x2

1

'72.103 840 348 ms'

SA 1_Dn

00DF FFFF

0x22A26014

MWr(32)

0x07B6

0x2

1

'72.107 631 450 ms'

SA 1_Dn

0x23A26014

MRd(32)

0x07B7

0x0

1

'72.109 325 546 ms'

SA 1_Up

0000 0000

CplD

0x0B76

0x2

1

'72.109 981 446 ms'

SA 1_Dn

00DF FFFF

0x23A26014

MWr(32)

0x07B8

0x2

1

'72.118 503 461 ms'

SA 1_Dn

0x22A26044

MRd(32)

0x07B9

0x0

1

'72.120 211 612 ms'

SA 1_Up

0000 0000

CplD

0x0B77

0x2

1

'72.120 878 756 ms'

SA 1_Dn

00DF FFFF

0x22A26044

MWr(32)

0x07BA

0x2

1

'72.123 342 129 ms'

SA 1_Dn

0x23A26044

MRd(32)

0x07BB

0x0

1

8 x 2 DW:

Timestamp      

Link              

Payload            

Address[31:0]      

PacketType        

STP_SeqNum        

Fmt                

Length            

'47.429 512 949 ms'

SA 1_Dn

0000 0100

0x23A053E4

MWr(32)

0x0924

0x2

1

'47.430 232 565 ms'

SA 1_Dn

0x23A053E0

MRd(32)

0x0925

0x0

1

'47.431 940 716 ms'

SA 1_Up

0000 0001

CplD

0x0F21

0x2

1

'47.432 901 141 ms'

SA 1_Dn

0x23A053E4

MRd(32)

0x0926

0x0

1

'47.434 607 418 ms'

SA 1_Up

0000 0003

CplD

0x0F22

0x2

1

'47.436 982 713 ms'

SA 1_Dn

0x23A053E4

MRd(32)

0x0927

0x0

1

'47.438 653 384 ms'

SA 1_Up

0000 0003

CplD

0x0F23

0x2

1

'47.439 301 788 ms'

SA 1_Dn

0000 0200

0x23A053E4

MWr(32)

0x0928

0x2

1

'47.440 001 727 ms'

SA 1_Dn

0x23A053E0

MRd(32)

0x0929

0x0

1

'47.441 702 382 ms'

SA 1_Up

0000 0001

CplD

0x0F24

0x2

1

'67.402 187 047 ms'

SA 1_Dn

0x22ED54D0

MRd(32)

0x092A

0x0

2

'67.402 203 913 ms'

SA 1_Dn

0x22ED54D8

MRd(32)

0x092B

0x0

2

'67.402 228 275 ms'

SA 1_Dn

0x22ED54E0

MRd(32)

0x092C

0x0

2

'67.402 245 141 ms'

SA 1_Dn

0x22ED54E8

MRd(32)

0x092D

0x0

2

'67.402 269 503 ms'

SA 1_Dn

0x22ED54F0

MRd(32)

0x092E

0x0

2

'67.402 286 369 ms'

SA 1_Dn

0x22ED54F8

MRd(32)

0x092F

0x0

2

'67.402 306 983 ms'

SA 1_Dn

0x22ED54C0

MRd(32)

0x0930

0x0

2

'67.402 327 597 ms'

SA 1_Dn

0x22ED54C8

MRd(32)

0x0931

0x0

2

'67.445 874 672 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F25

0x2

2

'67.488 474 440 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F26

0x2

2

'67.531 137 924 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F27

0x2

2

'67.573 802 345 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F28

0x2

2

'67.616 435 845 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F29

0x2

2

'67.659 099 329 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F2A

0x2

2

'67.701 729 081 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F2B

0x2

2

'67.744 434 730 ms'

SA 1_Up

0000 0000 0000 0000

CplD

0x0F2C

0x2

2

'77.187 984 545 ms'

SA 1_Dn

0x22A26014

MRd(32)

0x0932

0x0

1

'77.189 687 074 ms'

SA 1_Up

0000 0000

CplD

0x0F2D

0x2

1

'77.191 280 911 ms'

SA 1_Dn

00DF FFFF

0x22A26014

MWr(32)

0x0933

0x2

1

'77.195 169 461 ms'

SA 1_Dn

0x23A26014

MRd(32)

0x0934

0x0

1

'77.196 879 486 ms'

SA 1_Up

0000 0000

CplD

0x0F2E

0x2

1

'77.197 496 032 ms'

SA 1_Dn

00DF FFFF

0x23A26014

MWr(32)

0x0935

0x2

1

'77.204 953 615 ms'

SA 1_Dn

0x22A26044

MRd(32)

0x0936

0x0

1

'77.206 607 420 ms'

SA 1_Up

0000 0000

CplD

0x0F2F

0x2

1

'77.207 280 186 ms'

SA 1_Dn

00DF FFFF

0x22A26044

MWr(32)

0x0937

0x2

1

'77.209 571 151 ms'

SA 1_Dn

0x23A26044

MRd(32)

0x0938

0x0

1

'77.211 264 310 ms'

SA 1_Up

0000 0000

CplD

0x0F30

0x2

1

'77.211 864 927 ms'

SA 1_Dn

00DF FFFF

0x23A26044

MWr(32)

0x0939

0x2

1


$ lspci -vv


00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Region 0: Memory at 20100000 (32-bit, non-prefetchable) [size=1M]
Region 1: Memory at 20020000 (32-bit, non-prefetchable) [size=64K]
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Prefetchable memory behind bridge: 21000000-23ffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 0000000092e0f000 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- RBE+ FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Latency L0 <512ns, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot+
LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: pcieport

01:00.0 Unassigned class [ff00]: PLD APPLICATIONS Device 1100 (rev 01)
Subsystem: Tektronix Device 108c
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 468
Region 0: Memory at 21400000 (64-bit, prefetchable) [size=16K]
Region 2: Memory at 21000000 (64-bit, prefetchable) [size=4M]
Region 4: Memory at 22000000 (64-bit, prefetchable) [size=32M]
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x4, ASPM unknown, Latency L0 <4us, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [200 v1] Vendor Specific Information: ID=a000 Rev=0 Len=044 <?>
Kernel driver in use: plda

  • Hi, Shane,

    You may have received from other channel that we suspect it is the DMA doing the read. The behavior matches the DMA burst size.
    Could you try to reconfigure the link for MaxReadRequestSize and MaxPayload = 128B and see of the issue persists?

    Rex
  • We are working on recompiling the FPGA with the parameters reconfigured as you requested. Should have some results soon.

    Meanwhile, we are noticing a pattern in the sequence of 8 - 2DW transactions, that you can see in the data attached above. Notice how the sequence is out of order:

    0x22ED54D0
    0x22ED54D8
    0x22ED54E0
    0x22ED54E8
    0x22ED54F0
    0x22ED54F8
    0x22ED54C0
    0x22ED54C8

    Does the out of order sequence implicate or rule out any possible originator of these transactions?
  • Hi, Shane,

    PCIe supports relaxed ordering. It should be ok with the out of sequence transaction.
    Does your system have DMA enabled? Is it possible to disable it to see if that stops the unidentified transactions?

    Rex
  • The PLDA core in our FPGA has a DMA engine we typically use to move large data blocks.  In the interest of diagnosing this problem, we have disabled that DMA feature and the problem still happens.

    We are starting to focus on the fact the PLDA bridge declares the regions we are having trouble with as "prefetchable."  We found some posts on TI E2E and stack overflow that describe a problem that seems similar to what we are experiencing:

    PCI prefetching even without the prefetchable flag

    Linux PCI Driver, mmap prefetching

    As we investigated the PLDA modifications required to adjust the max payload size and max read request size, we also looked at configuring these regions as non-prefetchable.  However, it does not appear to be possible to make this configuration change to the core itself.

    Currently, we are experimenting with the work around described in the links above.  If the work around is effective, we will be left with some open questions about how these unidentified transaction are happening.

    In particular, these transactions do not appear in conjunction with transactions addressed to locations that are valid.  It would be more intuitive to associate prefetch with this problem, if the unidentified transactions were for locations adjacent to valid addresses.  However, these transactions are usually very distant from valid addresses with respect to our hardware.

    Could you comment on how prefetch might explain the symptoms described in the linked articles, which seem to match the symptoms we are experiencing?  Can you confirm this could explain our symptoms?

  • Hi Rex,

    I have an update on the suggestion to MaxReadRequestSize and MaxPayload = 128B. We seem to be unable to adjust those parameters for our PLDA core. As you will notice in lspci output above, MaxPayload is already 128B. However, the PLDA core configuration was set to 256B, so this parameter seems to be hardwired. We did modify MaxReadRequest in the core configuration, but it remains at 512B in the lspci results. It appears both of these parameters are not SW or HW adjustable.

    However, I have good news. The SW workarounds -- see E2E and stack overflow links above -- for disabling prefetch appear to be working. We have tested the workaround for 24 hours with multiple test cases and devices and the unidentified 64-byte transactions are not happening.

    Our confidence in this work around would benefit from any insight you might have about where these transactions were coming from and how the work around is preventing them. The original post described incompatibility with 64-byte transactions, but that was not our main issue. Our main issue was the transactions were targeting random and often invalid addresses.

    Any thoughts on these latest results?

    Thank you!
  • Hi, Shane,

    Glad to hear that the workaround seems to stop the unidentified transactions. I had started an internal thread yesterday to pull in resources for this issue and trying to explain the prefetch behavior. I'll update you the progress on this effort. Currently, we don't have answers yet.

    Rex
  • Thank you, Rex.

    I have a couple of charts to share that might help with brainstorming about where the transactions where coming from. These are the “before” charts, but I am not including a chart post workaround, because it would blank!

    The horizontal axis is in seconds, so the timespan represented below is about 16 hours. The vertical axis is the target address of the transactions. Blue transactions are 8x2DW and orange transactions are 1x16 DW. The green bands are regions of our custom hardware that are known to be disrupted by these transactions.

    Some patterns to note are 1) the concentration just below 0x23000000 move to other locations on reboot, 2) the transactions away from the concentrated location cluster in vertical stripes that are uniformly separated, 3) there are no transactions outside of the regions associated with BARs of our device (see lspci output above).

    Below is a different test run over about 2 hours, following a reboot. Notice that the concentration moved from near 0x23000000 to nearby 0x22400000. This shift appears to be random between reboots. The most puzzling thing is the consistently spaced vertical stripes, which exist in the previous chart, but were not obvious on first glance.

     

    The best idea I have to narrow down the originator of these transactions is to revert the workaround and instrument the TLB miss handler in the kernel. I think we could log the program counter value from the TLB miss exception when the physical page number matches the green locations above. Maybe the log data could be written to mailbox locations that we capture with the PCIe bus analyzer.  Since the workaround manipulates the Linux page tables, it seems plausible this TLB miss experiment would turn up a clue.

    With a workaround in hand, it may be some time before this next experiment is conducted. Thank you for continuing to think on an explanation.

  • Hi, Shane,

    Interesting diagrams. Does each blue dot stand for a 8x2 DW read or just a single 2DW read and should group 8 of them as one transaction?
    If it is 8x2DW read, it will mean in that sec, there are multiple transactions spread out accessing to different addresses.

    I have involved our silicon IP engineer and PCIe developer in the discussion. After seeing this diagram, I am thinking to involve our ARM expert to see if he has some insight to this or anything may be cache line related.

    I'll send the diagrams for internal discussion. Thanks for the info.

    Rex
  • Hi, Shane,

    I read your description again. The blue dot is 8x2DW transaction(period).

    Rex
  • Yes, if we zoom in on the blue dots we see groups of eight:

  • Hi, Shane,

    Setting the PCIe memory to be non-prefetchable/non-cacheable is correct configuration. TRM mentions the following note in Section "PCIe Controller Slave Port" that PCIe addresses cannot be cacheable. "Because only incremental burst mode is supported, the PCIe addresses can not be in a cacheable memory space. Subsequently, cache coherence protocol is not involved, and the PCIe controller is not expected to receive coherent PCIe TLPs (NS=0). For safety purpose, the PCIe controller coherent feature must be permanently disabled (and this applies for the device by default) by keeping PCIECTRL_TI_CONF_SYSCONFIG[16] MCOHERENT_EN at value 0b0. In that case, inbound coherent PCIe TLPs ( that means with NS=0) proceed normally, but coherence will not be guaranteed by the device."

    We don't have exact reason for the unidentified transactions.

    Rex
  • Hi Rex,

    Thank you for following up with additional information on this. I think we are in good shape now, so I will mark this issue resolved.

    I did look into instrumenting the TLB miss handler, but I learned that TLB reload is implemented fully in hardware, so there is no software TLB miss handler to instrument. That was the only experiment I was still contemplating, so I am moving on to other things now.

    We appreciate your help!
  • Update: Last night we tested a modified FPGA build that has prefetch disabled at the HW level. We were able to remove the software workarounds and the unidentified transactions were still resolved.
  • Hi, Shane,

    Thanks for the update. It's good news.

    Rex