This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6548: U-Boot SPL could not load u-boot proper: QSPI: QSPI is still busy after poll for 10000 times.

Part Number: AM6548

Dear,

Sometimes our AM6548 based device could not boot from u-boot SPL to u-boot proper:

NOTICE: BL31: v2.6(release):
NOTICE: BL31: Built : 08:03:39, Oct 20 2022
I/TC:
I/TC: OP-TEE version: 3.16.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Thu Oct 20 08:03:39 UTC 2022 aarch64
I/TC: Primary CPU initializing
I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2022.01-V01.03.01.01-0-gffc3caf (Oct 20 2022 - 08:10:44 +0000)
SYSFW ABI: 3.1 (firmware rev 0x0015 '21.9.1--v2021.09a (Terrific Lla')
Trying to boot from SPI
QSPI: QSPI is still busy after poll for 10000 times.
SPI probe failed.
SPL: failed to boot from all boot devices

ERROR ### Please RESET the board

Any suggestions? Please see github.com/.../440 for details.

  • Hi Baocheng,

    from the linked Github thread I understand this is related to the w25q128 Flash device that's used as part of "SIMATIC IOT2050 Advanced", correct? 

    With the new firmware V1.3.1 this does not work.

    Does this "new firmware" use a different U-Boot build than other, previous firmwares?

    I also see in the referenced github.com link that the project uses "U-Boot SPL 2022.01", which is not a TI-official U-Boot release.

    We may need to debug this to look for differences of previous SW releases vs. this new v1.3.1 releases, specifically as relates to U-Boot.

    Regards, Andreas

  • Yep, you are right regarding the product model and flash IC model.

    This issue is reproduced across several u-boot version, github report is using 2022.01, another user also reports the exact same issue on 2022.10. 

    Is there any SPI related difference between upstream version and TI branch?

    BRs/Baocheng

  • We have reports for the same issue with U-Boot SPL based on 2021.04 (V01.02.01.03-0-g7e29ca7) and 2022.01 (V01.03.01.01-0-gffc3caf). We have a not yet tried a recent U-Boot but we aim at 2023.10 upstream for the next release.


    The goal of this request is to identify potential known issues from the past or from the vendor U-Boot so that we can try to update / patch and start new tests. Given how long reproduction takes (30-90 days), we should invest a bit into thinking about potential reasons before trying out instrumentations or even fixes.

  • Jan, Baocheng,

    thanks for the additional background.

    Given how long reproduction takes (30-90 days), we should invest a bit into thinking about potential reasons before trying out instrumentations or even fixes

    For sure. Let me ask the internal R&D team if anything rings any bells. 

    We have a not yet tried a recent U-Boot but we aim at 2023.10 upstream for the next release.

    FYI, the next official TI SDK update for AM65x will be SDK v9.1, and is planned to be released on Oct 15th. It'll be based on U-Boot 2023.04. I'm not aware of any significant differences between TI SDK U-Boot and upstream U-Boot as it comes to OSPI driver stack that , but I'll have a look if anything spikes out.

    Note that the issue may not necessarily U-Boot related but could also be System Firmware (SYSFW, a.k.a. "FW running on the DMSC") related, as this is the central piece that controls all the peripherals including but not limited OSPI peripheral from a low-level POV (clocks, reset, power domains, etc.).

    How is your Flash chip connected to the AM65x? I see there are variants of that device available with a RESET pin, so I wonder if this is connected/can be used as one potential way to improve behavior.

    The good thing is though in your case ROM always seems to be able to boot at least into the second stage of boot (U-Boot SPL on A53) from that QSPI chip. So it seems like we should be able to take control of this situation.

    The long testing cycle is going to be a challenge here.

    Regards, Andreas

  • This issue is reproduced across several u-boot version, github report is using 2022.01, another user also reports the exact same issue on 2022.10. 

    Is there any SPI related difference between upstream version and TI branch?

    I had a quick look, there have in fact been made some recent improvements to the Cadence QSPI  driver in U-Boot, that are not part of 2022.01 or 2022.10. In particular, one commit spikes out that _could_ be related...

    spi: cadence-quadspi: Reset CMD_CTRL Reg on cmd r/w completion
    https://gitlab.com/u-boot/u-boot/-/commit/08b3098eadc7f826c3e6fb9d184cf6d82f5028fe

    ...however there are other improvements that were since made, all can be found in the current TI SDK https://git.ti.com/cgit/ti-u-boot/ti-u-boot/log/?h=ti-u-boot-2023.04 tree. Best would be if we could do some A/B testing, but the testing cycle is too long for that.

    Perhaps there's a way to accelerate the failure? Maybe it fails earlier than 30-90 days, but it wasn't noticed yet?

    Regards, Andreas

  • I've seen that commit, but it seems to be more related to https://gitlab.com/u-boot/u-boot/-/commit/8077d296adff235e13c1478f92ef42c08e17ec33.

    Is there anything else in the BSP that was not yet submitted upstream? We never used the BSP for the firmware, and we don't plan to do this anymore (firmware was very early in good shape in upstream, thus no need for that step back).

  • I've seen that commit, but it seems to be more related to https://gitlab.com/u-boot/u-boot/-/commit/8077d296adff235e13c1478f92ef42c08e17ec33.

    The commit message of "spi: cadence-quadspi: Reset CMD_CTRL Reg on cmd r/w completion" suggests it was found during "STIG mode", but I'm not sure it is necessarily limited to that, at least from having a quick look at the diffs it appears that way.

    Is there anything else in the BSP that was not yet submitted upstream?

    Doing a quick check...

    $ git log --oneline --no-merges ti-u-boot-2023.04 ^origin/master drivers/spi/cadence_qspi_apb.c
    107f490c0c spi: cadence_qspi_apb: Disable rising edge sampling
    edf468ce5d spi: cadence-qpsi: Disable Auto-HW polling
    d8ddedea23 spi: cadence-qspi: Do not use DMA for small reads
    fa50866e65 spi: cadence-qspi: Tune PHY to allow running at higher frequencies
    bb065dbbf2 spi: cadence-qspi: Use PHY for DAC reads if possible
    385bc2b935 spi: cadence-quadspi: Reset CMD_CTRL Reg on cmd r/w completion       <== This is upstream, ignore
    c7877a7eab spi: cadence-quadspi: Use STIG mode for all ops with small payload   <== This is upstream, ignore
    858102e65d spi: cadence-quadspi: Fix check condition for DTR ops                <== This is upstream, ignore

    ...the only one that may sound "suspicious" is "spi: cadence-qpsi: Disable Auto-HW polling", however when looking at that commit it seems to apply only to write mode.

    We never used the BSP for the firmware, and we don't plan to do this anymore (firmware was very early in good shape in upstream, thus no need for that step back).

    Understood about your project; it would still be good to use for testing purposes though, if we had a quicker test cycle, that is. Also, the commit diff listed above will converge to 0 over time, as some of the commits probably were already posted upstream but may not have gotten merged, or will get posted soon, as per our upstream strategy.

    Regards, Andreas

  • Did your hear any bells ringing internally by now regarding our error pattern?

    Meanwhile, I'm waiting for our hardware colleagues to come back from vacation to answer your question regarding the reset pin.

  • Hi Jan,

    yes, while nobody was able to pinpoint a root cause from the limited information we got, I did get a couple of questions/comments from different folks internally, as follows:

    1. Are you writing to the SPI Flash at runtime?
      My comment: I think you are only using it for boot purposes, and not mount or use any filesystem etc. on it. And not access it at runtime really. Plus, the fact that the broken boot can be fixed by doing a hard reset speaks against some type of issue with the SPI flash contents (corruption) itself
    2. We really need a way to speed up the test cycle to run experiments/gather data. What happens doing an endless boot loop (1,000 reboots, in a loop)? Has this been tried on a failing unit?
    3. We'd need a dump of all OSPI registers when the issue happens plus pinpoint exactly which of the read/write function call in the the driver timeouts.
    4. (Another question I had myself), is there any correlation between the failure behavior, and a particular board? There may not be enough data, but either way any potential accelerated testing might need to be done on a board that has shown the failure before, to be sure.

    Regards, Andreas

  • 1. Yes, it's only in boot time, and since the device is recoverable from a hard reset, the flash content is not in corruption.

    2. We have been always run some reboot testing in our test lab, which are more than 1000 reboots for each run, for example, when we were doing the power testing, we were running a reboot loop in weeks, take 60s as a reboot round, then 24*60*7 = 10k. We only noticed this issue once in our testing lab, but it has been reported by several customers in their setup.

    3. That requires some debugging code added around the issue point, although that is not a big deal in this case.

    4. As far as I can tell, it doesn't seem like a single board issue, however, we will keep an eye on it and also ask more information from our customer.

  • Regarding the RESET pin, just confirmed with our hardware colleague, it is not connected to any GPIOs, IOW, it's not software controllable.

  • Hi Baocheng,

    Andreas is out of office today. Please expect delayed response.