This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Apparent lock up issue with PCI2060 bridge in asynchronous clocking mode with write combining enabled.

I want to begin by saying that I called technical support and received a reply that "TI is migrating support to our E2E (Engineer to Engineer) Community Forums" so it's not obvious whether someone at TI is actually working on this already, but in case they are the ID is: [SR THREAD ID: 1-RGDCD1]

The issue seems to happen after some time, so I expect it has to do with synchronizer failure, but in a nutshell here's what I have found:

The primary bus can be running at 33.33 MHz or 66.66 MHz and the issue occurs either way.  In all cases I run the secondary bus at 33.00 MHz using the SEC_ASYNC_CLK as described in this earlier thread.  On the secondary side I have an FPGA doing DMA to the host memory.  This consists of a long sequence of 16-word (one cache line) memory write bursts to successive addresses.  In the FPGA I was able to see that there was nothing unusual about the transactions occuring just before the lock up.  I also had a PCI bus analyzer on the primary bus and saw nothing out of the ordinary just before the lock up.  Transactions just before lockup were coming through in a handful of clock cycles, indicating that the bridge internal FIFO was empty or nearly empty.  At lock up the bridge accepted an additional 64 words (256 bytes) of data from the FPGA that did not get passed through to the primary bus.  All subsequent attempts by the FPGA to write were terminated with retry.  Only resetting the bridge would allow further transactions in either direction after the lock up occurs.

As noted in the thread title, this only happens when the clocks are asynchronous and write combining is enabled in the bridge.  I can work around the issue by either turning off write combining or using the primary clock as the clock source.  As turning off write combining is the simplest solution for my system, and write combining itself doesn't appear to improve throughput in these conditions, I have used this as the preferred workaround.

My question is whether there is any known errata for the PCI2060 that might explain this behavior.  I did not see any errata for this device on the website.

  • Hello,

    There are no Erratas for this Device.
    Can you attach a PCI Register dump?
    This could also be a clock issue, are you able to scope the clock signals when using synchronous Vs. async. ?

    Regards
  • This is a dump using PCI Tree before the lock-up.  The PC doesn't completely lock up, but attempting to "Rescan PCI Bus" from PCI Tree after the lock-up freezes the machine.  It's possible that this re-scan will complete at some point, and if so I'll post the registers after the lock-up.


    As for the clocks, it may take some time to set up the synchronous mode again.  However I'm not sure where you're going with clock signals, though.  When I turn off write combining - even in asynchronous clock mode - I don't get the lock-up.  So it's hard to see how this would be a "clock issue" at the board level.

    :  :  :  3.01.0    3->5 (5)  PCI/PCI; Bridge Device  104c ac2c [TI] no device name found: no device description found // SubIDs 0000 0000   --------------:

    AC2C 104C <00 : DID VID
    02B0 0006 <04 : Stat Cmd
    0604 0000 <08 : BaseClass SubClass PgmIF RevID
    0001 2010 <0C : BIST Header LatTimer CacheLSize
    0000 0000 <10 : BAR 0
    0000 0000 <14 : BAR 1
    2005 0503 <18 : sLatTimer subBNr secBNr priBNr
    22A0 01F1 <1C : secStat  IOLimit IOBase
    F620 F620 <20 : MemLimit MemBase
    0001 FFF1 <24 : prefMemLimit prefMemBase
    0000 0000 <28 : prefBaseUpper32
    0000 0000 <2C : prefBaseUpper32
    0000 0000 <30 : IOLimitUpper16 IOBaseUpper16
    0000 00DC <34 : reserved
    0000 0000 <38 : Exp_ROM_BAR
    0000 0000 <3C : BrigeControl IntPin IntLine
    0200 0000 <40 : < dev.specific
    0000 0000 <44 : < dev.specific
    0000 0000 <48 : < dev.specific
    0000 0000 <4C : < dev.specific
    0000 0000 <50 : < dev.specific
    0000 0000 <54 : < dev.specific
    0000 0000 <58 : < dev.specific
    0000 0000 <5C : < dev.specific
    0000 0000 <60 : < dev.specific
    0000 0000 <64 : < dev.specific
    0000 0000 <68 : < dev.specific
    0000 0000 <6C : < dev.specific
    0000 0000 <70 : < dev.specific
    0000 0000 <74 : < dev.specific
    0000 0000 <78 : < dev.specific
    0000 0000 <7C : < dev.specific
    0000 0000 <80 : < dev.specific
    0000 0000 <84 : < dev.specific
    0000 0000 <88 : < dev.specific
    0000 0000 <8C : < dev.specific
    0000 0000 <90 : < dev.specific
    0000 0000 <94 : < dev.specific
    0000 0000 <98 : < dev.specific
    0000 0000 <9C : < dev.specific
    0000 0000 <A0 : < dev.specific
    0000 0000 <A4 : < dev.specific
    0000 0000 <A8 : < dev.specific
    0000 0000 <AC : < dev.specific
    0000 0000 <B0 : < dev.specific
    0000 0000 <B4 : < dev.specific
    0000 0000 <B8 : < dev.specific
    0000 0000 <BC : < dev.specific
    0000 0000 <C0 : < dev.specific
    0000 0000 <C4 : < dev.specific
    0000 0000 <C8 : < dev.specific
    0000 0000 <CC : < dev.specific
    0000 0000 <D0 : < dev.specific
    0000 0000 <D4 : < dev.specific
    0000 0000 <D8 : < dev.specific
    0001 0001 <DC : < dev.specific
    0000 0000 <E0 : < dev.specific
    0000 0000 <E4 : < dev.specific
    0000 0000 <E8 : < dev.specific
    0000 0000 <EC : < dev.specific
    0000 0000 <F0 : < dev.specific
    0000 0000 <F4 : < dev.specific
    0000 0000 <F8 : < dev.specific
    0000 0000 <FC : < dev.specific

  • After leaving the system overnight, it is still frozen so I am not able to post a register dump after lock-up.  I have put a scope on the clock signal that loops from SCLKO9 to SCLKI (secondary side loopback clock).  I don't see any significant difference between synchronous and asynchronous clock modes.  In these images, the overshoot/undershoot is most likely the result of using a fairly long ground lead on the scope probe rather than a real issue on the board (async clock is shown first):

    ASYNC_CK.TIF

    SYNC_CLK.TIF

  • This is odd,

    Is the issue dependent of the burst size?

    Is it dependent on the Host? Can you try a different host?

    I would think this could be a synchronization issue indeed but the fact that disabling write combining fixes the issue is odd.

    I would be interesting to look if the Retry termination is coming form the bridge or directly from the host, are you able to tell that?

    Regards

  • Elias Villegas said:

    This is odd,

    Is the issue dependent of the burst size?

    That would be hard to check.  The cycles are generated by an FPGA that is programmed to break up transfers on cache line boundaries.  This means that a typical transfer consists of a series of 16 word bursts (64 bytes) and almost all of them would be potential candidates for write combining.

    Is it dependent on the Host? Can you try a different host?

    I have tried different hosts.  The issue appears more often on a newer machine, but all hosts I tried would eventually show the problem given a long enough transfer.

    I would think this could be a synchronization issue indeed but the fact that disabling write combining fixes the issue is odd.

    My only thought was that there could be an issue with the bridge's internal FIFO flags with respect to combining writes.  My observation was that the FIFO in the bridge is generally nearly empty because the secondary clock runs slightly slower than the primary side clock, so perhaps some combination of detecting a sequential write in the secondary clock domain as the FIFO was just becoming empty in the host clock domain could cause a failure.

    I would be interesting to look if the Retry termination is coming form the bridge or directly from the host, are you able to tell that?

    It comes from the bridge.  What I saw was that the last successful transfer went through the bridge almost immediately (very low latency) and then the host saw no more cycles from the bridge.  Up to that point there were very few retry terminations by the host, usually only at page boundaries.  The last successful transaction was not retried, and did not end at a page boundary.  Then on the secondary side, the bridge accepted another 256 bytes of data (without retry) and finally started infinite retries on the following transaction.  None of these transactions made it through to the host.  So it looks like the FIFO just filled up inside the bridge but no further host transactions were started.  In fact the bridge was not even requesting the host bus.

    Regards