AM6548: U-Boot SPL could not load u-boot proper: QSPI: QSPI is still busy after poll for 10000 times.

Part Number: AM6548

Dear,

Sometimes our AM6548 based device could not boot from u-boot SPL to u-boot proper:

NOTICE: BL31: v2.6(release):
NOTICE: BL31: Built : 08:03:39, Oct 20 2022
I/TC:
I/TC: OP-TEE version: 3.16.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Thu Oct 20 08:03:39 UTC 2022 aarch64
I/TC: Primary CPU initializing
I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2022.01-V01.03.01.01-0-gffc3caf (Oct 20 2022 - 08:10:44 +0000)
SYSFW ABI: 3.1 (firmware rev 0x0015 '21.9.1--v2021.09a (Terrific Lla')
Trying to boot from SPI
QSPI: QSPI is still busy after poll for 10000 times.
SPI probe failed.
SPL: failed to boot from all boot devices

ERROR ### Please RESET the board

Any suggestions? Please see github.com/.../440 for details.

  • In my latest reproduced case, it went with the following patch. 

    I filtered the registers' value of abnormal case and the registers'value of normal case as follows:

    ------------------------------------------------------
    CQSPI_REG_CONFIG:
     normal registers: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CONFIG:              80083881
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_CONFIG:              80783881
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CONFIG:              80083881
    
    ------------------------------------------------------
    CQSPI_REG_RD_INSTR:
     normal registers: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_RD_INSTR:            00000000
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_RD_INSTR:            0800005a
       CQSPI_REG_RD_INSTR:            00000003
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_RD_INSTR:            00000000
    
    ------------------------------------------------------
    CQSPI_REG_DELAY:
     normal registers: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_DELAY:               05090309
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_DELAY:               09090909
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_DELAY:               05090309
    
    ------------------------------------------------------
    CQSPI_REG_INDIRECTRDSTARTADDR:
     normal registers: 
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_INDIRECTRDSTARTADDR: 00000000
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_INDIRECTRDSTARTADDR: 00000080
       CQSPI_REG_INDIRECTRDSTARTADDR: 00380000
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_INDIRECTRDSTARTADDR: 00000000
    
    ------------------------------------------------------
    CQSPI_REG_CMDCTRL:
     normal registers: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CMDCTRL:             9fd00000
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_CMDCTRL:             00000000
      Source: cadence_qspi_apb_command_write
       CQSPI_REG_CMDCTRL:             06000000
       CQSPI_REG_CMDCTRL:             01008000
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CMDCTRL:             05800000
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CMDCTRL:             9fd00000
    
    ------------------------------------------------------
    CQSPI_REG_CMDREADDATALOWER:
     normal registers: 
      Source: cadence_qspi_apb_read_execute
       CQSPI_REG_CMDREADDATALOWER:    001840ef
      Source: cadence_qspi_apb_command_write
       CQSPI_REG_CMDREADDATALOWER:    00000000
     abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CMDREADDATALOWER:    001840ef

    abnormal register: 
      Source: cadence_qspi_apb_command_read
       CQSPI_REG_CONFIG:              80083881

    A bit strange is noted: bit 31 should be 0 in the error case because it is CQSPI_REG_CONFIG_IDLE_LSB, and that is the condition that CQSPI_REG_IS_IDLE is testing for.

    Here is the log of abnormal case:

    Trying to boot from SPI
    QSPI: QSPI is still busy after poll for 10000 times.
    Source: cadence_qspi_apb_command_read
    CQSPI_REG_CONFIG:              80083881
    CQSPI_REG_RD_INSTR:            00000000
    CQSPI_REG_WR_INSTR:            00000002
    CQSPI_REG_DELAY:               05090309
    CQSPI_REG_RD_DATA_CAPTURE:     00000005
    CQSPI_REG_SIZE:                00101002
    CQSPI_REG_SRAMPARTITION:       00000080
    CQSPI_REG_INDIRECTTRIGGER:     00000000
    CQSPI_REG_REMAP:               00000000
    CQSPI_REG_MODE_BIT:            00000200
    CQSPI_REG_SDRAMLEVEL:          00000000
    CQSPI_REG_WR_COMPLETION_CTRL:  00010005
    CQSPI_REG_IRQSTATUS:           00000000
    CQSPI_REG_IRQMASK:             00000000
    CQSPI_REG_INDIRECTRD:          00000000
    CQSPI_REG_INDIRECTRDWATERMARK: 00000000
    CQSPI_REG_INDIRECTRDSTARTADDR: 00000000
    CQSPI_REG_INDIRECTRDBYTES:     00000000
    CQSPI_REG_CMDCTRL:             9fd00000
    CQSPI_REG_INDIRECTWR:          00000000
    CQSPI_REG_INDIRECTWRWATERMARK: ffffffff
    CQSPI_REG_INDIRECTWRSTARTADDR: 00000000
    CQSPI_REG_INDIRECTWRBYTES:     00000000
    CQSPI_REG_CMDADDRESS:          00000000
    CQSPI_REG_CMDREADDATALOWER:    001840ef
    CQSPI_REG_CMDREADDATAUPPER:    00000000
    CQSPI_REG_CMDWRITEDATALOWER:   00000000
    CQSPI_REG_CMDWRITEDATAUPPER:   00000000
    CQSPI_REG_OP_EXT_LOWER:        13edfa00
    CQSPI_REG_PHY_CONFIG:          00000000

  • Hi Andy,

    This looks like a duplicate of issue e2e.ti.com/.../am6548-u-boot-spl-could-not-load-u-boot-proper-qspi-qspi-is-still-busy-after-poll-for-10000-times , please confirm. There was already a lengthy discussion on this.

    Also I know you are using your own/upstream sources for U-Boot etc., but can you please create a minimal system based on our just-released AM65x Linux SDK v9.1 (https://www.ti.com/tool/PROCESSOR-SDK-AM65X). It'll be a good reference point for comparison whether the same issue manifests itself or not. Do you have a way to re-create this issue by accelerated cycling/loop testing, or is it still something that only happens after 90(?) days on occasion?

    As for your register values, do you know there is valid QSPI communication happening on the U-Boot SPL A53 stage before the busy timeout? Meaning are there QSPI operations that are successful and return meaningful register values, before the issue happens? Just trying to gauge if in failure case the QSPI is is behaving abnormally right from the beginning of U-Boot SPL A53, or starts doing so only sometime during execution.

    Regards, Andreas

  • Hi Andreas,

    Yes, it's the same issue. The original issue has been locked, so I created a related question.

    Do you have a way to re-create this issue by accelerated cycling/loop testing, or is it still something that only happens after 90(?) days on occasion?

    Unfortunately, no for now. From my experiences, it requires about 50 days to reproduce it.

    As for your register values, do you know there is valid QSPI communication happening on the U-Boot SPL A53 stage before the busy timeout? Meaning are there QSPI operations that are successful and return meaningful register values, before the issue happens? Just trying to gauge if in failure case the QSPI is is behaving abnormally right from the beginning of U-Boot SPL A53, or starts doing so only sometime during execution.

    From the log and my current understanding, no valid QSPI communication happened before the busy timeout in the abnormal case.

    However, could you suggest me some debug ways to locate the root cause? Such as a patch to show more information when it is reproduced again.

    And any workaround can be considered? I tried `do_reset()` but it didn't work, the same issues happen on next boot; hard reset works for this problem, but no way to trigger it in software side.

    Thanks.

  • From the log and my current understanding, no valid QSPI communication happened before the busy timeout in the abnormal case.

    Sorry, correct me here, there are some valid QSPI communications happed before the busy timeout, which is in the earlier phase. In the early seboot phase, it reads some contents from the Flash through QSPI communication.

  • In the early seboot phase

    What is the "seboot" phase?

    I'm specifically interested if U-Boot SPL A53 can do some valid transactions before the failure. Because each stage of U-Boot re-initializes (probes) the peripheral, so even if U-Boot SPL R5 was able to load U-Boot SPL A53 from QSPI, this by itself doesn't mean QSPI was ever working properly in U-Boot SPL A53 itself, hence my question.

    I'm on business travel right now but I'll try to spend some time next week to re-familiarize myself with the init flow to perhpaps make a more targeted suggestion to debug. In the meantime please consider my other comments.

    Regards, Andreas

  • I'm on business travel right now but I'll try to spend some time next week to re-familiarize myself with the init flow to perhpaps make a more targeted suggestion to debug. In the meantime please consider my other comments.

    Thanks for your supporting. I did considered your other comments, anyway AM65x Linux SDK v9.1 is not easy for us to archive. Let's see what I can do about this.

    What is the "seboot" phase?

    I'm specifically interested if U-Boot SPL A53 can do some valid transactions before the failure.

    By seboot I meant the R5 part, U-boot SPL A53 didn't do any valid transactions before the failure.

    Thanks.

  • Since there is no activity for more than 6 months, please let us know whether this is still a problem.

    Br, Tommy

  • I had a similar issue on AM62x and were able to found the reason. One (your)  issue in the cadence_qspi driver and a similar one in mmc (not tested). I send two patches to u-boot@lists.denx.de but they are on hold waiting for manager approval since I'm not on the list.

    The actual issue is that the timer on 64bit core is based on the ARM system timer which is reset only with a power-on-reset and not with a warm start. So after a bit more than 49 days values will overflow the 32bit start variable (which should actually be an unsigned long) causing an immediate report of a timeout.

    diff --git a/u-boot/drivers/spi/cadence_qspi_apb.c b/u-boot/drivers/spi/cadence_qspi_apb.c
    index e0ceb44957c..d7cee2c3932 100644
    --- a/u-boot/drivers/spi/cadence_qspi_apb.c
    +++ b/u-boot/drivers/spi/cadence_qspi_apb.c
    @@ -143,9 +143,9 @@ static int cadence_qspi_set_protocol(struct cadence_spi_priv *priv,
     /* Return 1 if idle, otherwise return 0 (busy). */
     static unsigned int cadence_qspi_wait_idle(void *reg_base)
     {
    -       unsigned int start, count = 0;
    +       unsigned long start, count = 0;
            /* timeout in unit of ms */
    -       unsigned int timeout = 5000;
    +       unsigned long timeout = 5000;
     
            start = get_timer(0);
            for ( ; get_timer(start) < timeout ; ) {
    @@ -162,7 +162,7 @@ static unsigned int cadence_qspi_wait_idle(void *reg_base)
            }
     
            /* Timeout, still in busy mode. */
    -       printf("QSPI: QSPI is still busy after poll for %d ms.\n", timeout);
    +       printf("QSPI: QSPI is still busy after poll for %lu ms.\n", timeout);
            return 0;
     }
    
    

    diff --git a/u-boot/drivers/mmc/mmc.c b/u-boot/drivers/mmc/mmc.c
    index 210703ea46b..0c570e2be3b 100644
    --- a/u-boot/drivers/mmc/mmc.c
    +++ b/u-boot/drivers/mmc/mmc.c
    @@ -696,7 +696,7 @@ static int mmc_send_op_cond(struct mmc *mmc)
     {
            int err, i;
            int timeout = 1000;
    -       uint start;
    +       ulong start;
     
            /* Some cards seem to need this */
            mmc_go_idle(mmc);
    @@ -791,7 +791,8 @@ int mmc_send_ext_csd(struct mmc *mmc, u8 *ext_csd)
     static int __mmc_switch(struct mmc *mmc, u8 set, u8 index, u8 value,
                            bool send_status)
     {
    -       unsigned int status, start;
    +       ulong start;
    +       unsigned int status;
            struct mmc_cmd cmd;
            int timeout_ms = DEFAULT_CMD6_TIMEOUT_MS;
            bool is_part_switch = (set == EXT_CSD_CMD_SET_NORMAL) &&
    

  • Patches are pushed to the mailing list now:

    Ronald thank you for chiming in here; I'll also notify our dev team to make sure they are aware. We'll need to look more closely to make sure there are no other instances of this,

    Andreas

  • Yes, there might be some other instances too. I have only checked the ones the code we care about. Maybe a bit off topic here but there is a similar issue in the ipc_rpmsg_echo_linux example from the MCUplus SDK in ipc_rpmsg_echo_main():

    /* This API MUST be called by applications when its ready to talk to Linux */
    status = RPMessage_waitForLinuxReady(SystemP_WAIT_FOREVER);
    DebugP_assert(status==SystemP_SUCCESS);

    If there is no peer on Linux side then this will return after 49.x days with an error and run into the assert bringing the code to halt. Here it is a simple 32bit overflow of the tick counter. TI advertises this example as default DM firmware (but it's likely not true - I guess its the ipc_rpmsg_echo example that is used as default DM firmware). Anyway there are probably also multiple occurrences of similar issues here. We build our own DM firmware from the echo linux example based on this advice and learned the hard way that this was no good idea. My suggestion would be to build the DM firmware without any example code at all. Initially it was even confusing why a firmware has a name like ipc_echo_testb.

  • Hi Ronald,

    /* This API MUST be called by applications when its ready to talk to Linux */
    status = RPMessage_waitForLinuxReady(SystemP_WAIT_FOREVER);
    DebugP_assert(status==SystemP_SUCCESS);

    Having a quick look this WAIT_FOREVER definition is translated to a similar FreeRTOS definition, for example here..

    source/kernel/freertos/dpl/common/MailboxP_freertos.c:        if (timeout == SystemP_WAIT_FOREVER)
    source/kernel/freertos/dpl/common/MailboxP_freertos.c-        {
    source/kernel/freertos/dpl/common/MailboxP_freertos.c-            timeout = portMAX_DELAY;
    source/kernel/freertos/dpl/common/MailboxP_freertos.c-        }

    ...and I'd expect FreeRTOS handle portMAX_DELAY appropriately (I did not verify this statement).

    We build our own DM firmware from the echo linux example based on this advice and learned the hard way that this was no good idea. My suggestion would be to build the DM firmware without any example code at all. Initially it was even confusing why a firmware has a name like ipc_echo_testb

    I 100% agree that this is confusing esp. to folks new to those devices and should be simplified, it's actually a battle I've been fighting for a while. Most Linux folks want the DM FW to run exclusively on the R5 core for simplicity and maximum robustness, and in this case it should really have a more distinct and intuitive name, and not containing any extra example-type functionality

    Regards, Andreas

  • The problem is in TI code (source/drivers/ipc_rpmsg/ipc_rpmsg.c):

    int32_t  RPMessage_waitForLinuxReady(uint32_t timeout)
    {
        ...
                    elaspedTicks = ClockP_getTicks() - startTicks;
                    if( elaspedTicks >= timeout)
                    {
                        /* timeout, linux did not init the resource table in user specific timeout time */
                        status = SystemP_TIMEOUT;
                    }

    If the tick counter returned by ClockP_getTicks() overflows after ~49 days and becomes startTicks-1 the conditional becomes true because startTicks-1-startTicks is 0xffffffff aka SystemP_WAIT_FOREVER.

  • Hi Ronald,

    If the tick counter returned by ClockP_getTicks() overflows after ~49 days and becomes startTicks-1 the conditional becomes true because startTicks-1-startTicks is 0xffffffff aka SystemP_WAIT_FOREVER.

    thanks, I can see the concern now. I've opened a ticket internally for the team to investigate further (ref: SITSW-6185).

    If there is no peer on Linux side then this will return after 49.x days with an error and run into the assert bringing the code to halt.

    Due to the debug assert function disabling HWIs it would appear the entire DM FW operation would come to a halt in this case.

    TI advertises this example as default DM firmware (but it's likely not true - I guess its the ipc_rpmsg_echo example that is used as default DM firmware).

    Well, when I look at the production DM FW there's definitely a strong indication that this is the code, as the strings correlate with what's in ./examples/drivers/ipc/ipc_rpmsg_echo_linux/ipc_rpmsg_echo.c....

    a0797059@ula0797059:~/ti/ti-processor-sdk-linux-am62xx-evm-10.00.07.04/board-support/prebuilt-images/am62xx-evm
    $ strings ti-dm/am62xx/ipc_echo_testb_mcu1_0_release_strip.xer5f | grep 'IPC RPMSG ECHO'
    [IPC RPMSG ECHO] All echoed messages received by main core from %d remote cores !!!
    [IPC RPMSG ECHO] Remote Core waiting for messages at end point %d ... !!!
    [IPC RPMSG ECHO] Message exchange started with RTOS cores !!!
    [IPC RPMSG ECHO] Closing all drivers and going to WFI ... !!!
    [IPC RPMSG ECHO] Total execution time = %lld usecs
    [IPC RPMSG ECHO] One way message latency = %d nsec
    [IPC RPMSG ECHO] Messages sent to each core = %d 
    [IPC RPMSG ECHO] Number of remote cores = %d 
    [IPC RPMSG ECHO] Version: %s (%s %s): 

    Regards, Andreas

  • Hi Andreas,

    thanks for creating the internal ticket.

    ... and you are right: The "ipc echo linux" example seems to be indeed the default DM firmware. Looking into it with "readelf -a" I can see the ".resource_table" section that is only present in the linux example. So I think this issue is more critical then. In the end it depends how many DM requests happen in practise. With our own system we see issues with DMA requests for the UARTs (but fallback to pio) and mcspi. Both DMAs are not enabled in default device trees. A soft reboot will need the DM firmware as well but - if enabled - has the rti watchdog as fallback. So it is likely that many people will not actually detect this malfunctioning.

    - ron

  • IPC guy chiming in here. Thanks for highlighting this Ronald.

    Yes, in our default filesystems on AM62x/AM62Px we have the Linux IPC Echo example running on the DM R5F core alongside the DM task, as a part of the out-of-the-box demo flow: https://dev.ti.com/tirex/explore/node?node=A__AXINfJJ0T8V7CR5pTK41ww__AM62-ACADEMY__uiYMDcq__LATEST

    I was unaware of this potential bug, but it could definitely be problematic, especially for Linux-only customers who are not even programming the M4F / DM R5F cores. I'll also add a note to update the multicore academy for all devices after the team finishes looking at the MCU+ behavior and working out a fix.

    e.g. in this page: https://dev.ti.com/tirex/explore/node?node=A__ASVmm1hNWx7CjUJCy91Aig__AM62-ACADEMY__uiYMDcq__LATEST, add something like

    "code running on a remote core should not call RPMessage_waitForLinuxReady if the core does not actually use RPMsg with Linux. For example, the DM R5F core should NOT use the default ipc_echo_linux firmware for production. If no task is programmed to run on the DM R5F alongside the device management task, it is suggested to rebuild the DM R5F firmware with the empty example, in the MCU+ SDK under examples/empty. For more information about building DM R5F firmware, please refer to https://dev.ti.com/tirex/explore/node?node=A__AZNhqJdyJ3LM.YBw-Z2UAw__AM62-ACADEMY__uiYMDcq__LATEST "

    Regards,

    Nick

  • Ronald, just to confirm:

    Did you see a full R5F crash after the timeout happened? Or was it just the IPC task that crashed, while the device management task (DM task) continued running fine?

    Thanks,

    Nick

  • Nick,

    the R5F came to halt. Otherwise I wouldn't have noticed it since I was not using any IPC. I actually got errors from the Linux kernel indicating DM firmware is not responding.

    - Ronald

  • the R5F came to halt.

    Thanks for this confirmation. I've positioned the associated ticket (SITSW-6185) accordingly internally. It would seem that most of the public DM FW would be affected (https://git.ti.com/cgit/processor-firmware/ti-linux-firmware/tree/ti-dm?h=ti-linux-firmware), if exposed to a scenario where the RPMSG functionality isn't exercised.

    Regards, Andreas

  • [PATCH] spi: cadence-quadspi: fix potential malfunction after ~49 days uptime fixed my problem. If anyone else encounters the same issue, please refer to it.