AM6548: OSPI PHY configuration for SDR QSPI flash device

Dominic Rath

Part Number: AM6548

Dear TI Team,

following the advice we were given in https://e2e.ti.com/support/processors/f/791/t/878332 we are trying to use the PHY of the OSPI controller to achieve data rates above 50MHz.

We are using a Winbond W25Q128FW SDR QSPI flash device connected to OSPI0 of the AM65x. The device is operated in SPI Mode 0 and 1-1-4 Mode. We use a Fast Read Quad Output (6bh) command with 8 dummy cycles to read the training pattern that should work at up to 80 MHz, which is the frequency we're trying to achieve.

We are using master DLL bypass mode (see below for why we're not using master operational mode).

We are using the internal pad loopback clock, because on our current hardware design we don't have the external board loopback clock yet. Our driver operates in indirect mode so we disabled the PHY PIPELINE mode.

We've managed to achieve reasonable tuning results at 80 MHz, but we're seeing behaviour that we can't explain. We're using workarounds like a reduced number of dummy cycles and reading more data than what we actually need, and we have no idea if these workarounds are really necessary, or if we're missing some important setting.

There is a document called "OSPI Controller PHY Tuning Algorithm" (SPRACT2), but unfortuantely that document states that frequencies other than 166 MHz and single-data-rate will be addressed in a later version.

Is there already an updated version of SPRACT2 available? If not, when does TI plan to release the updated version of the document?

We tried to implement our code using the TI Linux and TI-RTOS drivers as an example. We noticed that the TI code reduces the number of dummy cycles configured in the controller by 1 when using the PHY, e.g. the OSPI controller is configured for 7 dummy cycles even though the device requieres 8 dummy cycles.

The TRM states: " Number of Dummy cycles should be set as specified in the documentation of the device or more when because of additional read paths delays of actual systems data is predicted to be flopped by PHY module with delay excesses actual cycle of SPI clock generated by the controller". For how many cycles is the external SPI clock supplied if the number of dummy clocks is reduced? In SPI mode 0 it shouldn't be a problem for the flash device if there no clock cycle for the final nibble in a transfer, since the flash device presented the data on the falling edge of the last-but-one clock, but we're seeing issues with that last nibble not beeing seen by the OSPI controller. We noticed that if we don't reduce the number of dummy cycles we don't read the correct data when using frequencies and TX/RX delays that should easily work (e.g. 40 MHz, rx and tx delay = 0 cause the first nibble to be missing), but the code and the documentation don't contain any explanation WHY the number of dummy cycles needs to be reduced.

With the reduced number of dummy cycles we're seeing issues towards higher RX delays that cause the final nibble in a transfer to be missing. In SPI mode 0 the reason for this should probably be the host OSPI controller and not the device (see above). If we simply read more data than what we acutally need, the last nibble is transfered just fine also with higher RX delay settings, just the extra data is possibly corrupted.

Why is a correction of the dummy cycles required and how large should it be?

We're using the master DLL bypass mode since for our frequencies we can't achieve a full cycle lock, and the TI code uses the master DLL bypass mode if it can't achieve a full cycle lock.

What advantages does the master operational mode have over the bypass mode? Is there a reason why the TI code uses master operational mode only with a full cycle lock, and not with a half cycle lock?
- Instead of using a calculated initial master delay the available TI code uses a magic number of either 4 (previous versions) or 16 (current version). What value should be used for the initial master delay and how can it be calculated?
- We saw that the master DLL achieves a lock with the correct number of DLL elements no matter what initial delay is configured (as long as the initial delay is > ~4)
- We noticed that the master initial delay values had an effect on the DLL observable register decoder values laster on, so we're not 100% what significance the master initial delay has
  - We "assume" that the "RX/TX DLL decoder" values should be independent of the initial master delay, but we found that the master initial delay had an influence on these values for identical "RX/TX DLL delay" settings.
We noticed that the DLL observable registers only briefly report a lock state
- Should the DLL obeservable registers continously report a lock? Should software trigger a locking sequence again if lock is lost?
What is the meaning of the "RX/TX DLL delay" settings depending on the master DLL modes:
- We assume that the delay is specified as an absolute number of delay elements in master DLL bypass mode. The code mentions 80ps worst case delay, but another post mentions 50ps (https://e2e.ti.com/support/processors/f/791/p/878287/3254470#3254470). Our own measurements show 72ps.

- We assume that the delay is a fraction of the (full or half) reference clock in master operational mode, but we're not certain about the behaviour if the DLL isn't locked anymore (see above). Is it possible that there's a fallback to the master initial delay as the reference for the fractional delay if the DLL lock is lost?

Is there any further information available regarding the PHY sampling and the interaction between PHY and OSPI controller beside the TRM and SPRACT2?
- How does the OSPI_RD_DATA_CAPTURE_REG Sample edge selection affect the sampling in PHY mode? We haven't seen any effect yet, but the documentation doesn't tell us whether this is intended, or if we're just missing something?
- When is the data sampled by PHY shifted into the RX FIFO?
- Does the PHY sample at the rising edge of the loopback clock when RX DLL is configured to 0 delay elements in master DLL bypass mode?
  - We've seen issues with a 40 MHz RCLK and 0 cycles RX and TX delay that we couldn't explain
- The TRM mentions requirements regarding the relationship of HCLK and RCLK for high frequencies and direct mode (HCLK needs to be high enough relative to RCLK), but for our use case we're seeing issues with "low" RCLKs as well.

Best Regards,

Dominic

over 5 years ago

0 z over 5 years ago

TI__Expert 4015 points

Is there already an updated version of SPRACT2 available? If not, when does TI plan to release the updated version of the document?

No, there is not a more recent version of SPRACT2. I can't provide a schedule for updated information at this time, but I will do my best to answer your questions.

Why is a correction of the dummy cycles required and how large should it be?

The dummy cycle correction is actually there to prevent a data collision at the end of the dummy cycle phase and the beginning of the read data phase. This is only necessary in DDR mode. Since you are operating in SDR mode, I recommend you use the same number of dummy cycles as the device requires.

What advantages does the master operational mode have over the bypass mode? Is there a reason why the TI code uses master operational mode only with a full cycle lock, and not with a half cycle lock?

Your assumption is correct. In bypass mode, the delay time of the PDLs are equal to the absolute number of delay elements in the TX Delay and RX Delay register fields. In Master mode, it is equal to a fraction of a clock cycle formed by the number in the register field divided by 128.

In 80MHz SDR mode, you can only lock on half of a clock cycle. Locking on half of a clock cycle means that your longest RX delay is half of a clock cycle. This is fine in DDR mode, but in SDR mode there can be valid RX delay values beyond a half cycle. For your use case, I recommend using bypass mode. In bypass mode, there is no need to lock the master DLL.

Is there any further information available regarding the PHY sampling and the interaction between PHY and OSPI controller beside the TRM and SPRACT2?

The TRM and SPRACT2 are all that is available at the moment

How does the OSPI_RD_DATA_CAPTURE_REG Sample edge selection affect the sampling in PHY mode? We haven't seen any effect yet, but the documentation doesn't tell us whether this is intended, or if we're just missing something?

The sample edge selection does not apply to PHY mode.

When is the data sampled by PHY shifted into the RX FIFO?

On the rising edge of ref_clock

The TRM mentions requirements regarding the relationship of HCLK and RCLK for high frequencies and direct mode (HCLK needs to be high enough relative to RCLK), but for our use case we're seeing issues with "low" RCLKs as well.

If you are referring to the relationship between the ref_clock and the bus clock, this only applies in non PHY mode. In PHY mode, the bus clock is the same frequency as the ref_clock.

Does the PHY sample at the rising edge of the loopback clock when RX DLL is configured to 0 delay elements in master DLL bypass mode? We've seen issues with a 40 MHz RCLK and 0 cycles RX and TX delay that we couldn't explain.

I'm not sure what your tuning procedure is, but I recommend the following steps:

1) Set the dummy cycles to equal the number required by the flash device. Set the READ DELAY field to 0.

2) Select a TX delay which is greater than the input setup time of the target flash device, and fix the TX delay to that value.

3) Search the RX delay from 0 to 128, reading the test pattern at each value and recording which values result in a good read.

4) Increment the READ DELAY field, and search RX delay again. Continue incrementing READ DELAY and searching RX until you get no passing RX values.

You can try this in multiple clocking modes (pad loop back and ref_clock sampling).

Regards,

Zack

0 Dominic Rath over 5 years ago in reply to z

Mastermind 6975 points

Hello Zack,

thanks for your replies. Following your recommendation, we implemented the following algorithm:

We set the correct number of dummy cycles.
We used master bypass mode and set the TX-Delay to 50 delay elements (>2ns setup time / 50ps per delay element = 40 delay elements).
Valid reads start from a RX Delay of 83 with READ DELAY set to 0 for 40MHz clock. Reads with a smaller RX delay appear to miss the first nibble of data. Normally we would have expected to see valid data without any RX Delay.

We believe we're getting useful tuning results for a large number of RX/TX delay combinations, but there are some "surprising" results, e.g. we're seeing much better results (more valid RX delay setttings) at 40 MHz with a higher number of TX delay elements than what we would need for your "TX delay which is greater than the input setup time" recommendation. Also, for SPI mode 0, we wouldn't expect to having to add a TX delay that meets the setup time at all (See Q3).

We think that we're seeing two effects for which we have some follow-up questions:

target reference cycle 0 is "late" (later than expected) at low clock frequencies (40 MHz, to some extent also at 80 MHz)
TX clock / data appear to be shifted compared to SPI mode 0

Q1) At which point of the loopback clock does the PHY sample for a RX Delay of 0?

Sampling with an RX Delay of 0 worked with a reduce number of dummy cycles configured in the controller. We suspect this works since with a reduced number of dummy cycles the first target reference cycle (READ DELAY 0) is one clock cycle earlier, i.e. we're not seeing an issue due to setup/hold but rather with the target reference cycle.

Q2) Is the first target reference cycle (READ DELAY 0) for a regular number of dummy cycles so late that a theoretical READ DELAY of -1 would be required for a part of the valid RX Delays (with slow frequencies like 40MHz)?

The following is a snapshot showing the first 10 bit of a read JEDEC ID command (9F) being transmitted by the OSPI controller using the PHY in SPI mode 0. The TX delay configured to 0 and a 40 MHz reference clock is used. The red signal is the CLK and the blue signal is D0 at the pins of the flash device:

On an untuned SPI Mode 0 snapshot we would expect the data signal to change on the falling edge of the clock. In the snapshot we see the data signal changing on (close to) the rising edge of the clock. This means the data signal appears to be driven about half a period later than expected.

Q3) Could you confirm that the TX clock is intentionally starting that early / the data is delayed that much?

The behaviour of the TX data with regard to the TX clock would explain why we're able to configure TX delays between 0 and 127 delay elements (up to ~9.2ns according to our measurements) at 80/100 MHz and still get a valid response from the flash device. At SPI mode 0, where data is expected to change on the falling clock edge, we wouldn't have expected this behaviour.

Regards,

Dominic

0 z over 5 years ago in reply to Dominic Rath

TI__Expert 4015 points

I'll start by addressing the surprising results you have observed:

"we're seeing much better results (more valid RX delay setttings) at 40 MHz with a higher number of TX delay elements than what we would need for your "TX delay which is greater than the input setup time" recommendation."

"Also, for SPI mode 0, we wouldn't expect to having to add a TX delay that meets the setup time at all (See Q3)."

The first point about a higher TX delay resulting in a bigger RX passing range is most likely caused by the higher TX delay pushing more of the valid data window into the target ref_clock cycle. As for your second point, see my answer to question 3 below.

Q1) At which point of the loopback clock does the PHY sample for a RX Delay of 0?

Sampling occurs on the falling edge of the (delayed) sampling clock.

The first target reference cycle should being immediately after the last dummy cycle of the ref_clock, which is within the controller. Meanwhile, data has make the "round trip" from clock out, to the flash device, to data in. So I'm not sure how the read delay window could be late.

Q3) Could you confirm that the TX clock is intentionally starting that early / the data is delayed that much?

In PHY mode (intended for high speeds), data transitions are aligned to the ref_clock, with the expectation that TX delay would be 50%. However, at very low speeds, this is not possible. so in short yes, what you are seeing is expected.

Also, were you able to find passing TX/RX combinations at Read Delay values other than 0? Have you tried using the gated ref_clock rather than the pad loopback for sampling? I have found that in SDR mode, *most* of the TX/RX space will have a Read Delay value that results in a pass. However, I was using the gated ref_clock and not loopback. I have also found it helpful to test every combination of TX, RX, and read delay, and graph the results. This should only take a few minutes to do.

-Zack

Processors

Processors forum

AM6548: OSPI PHY configuration for SDR QSPI flash device