This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6412: Boot stops at U-boot

Part Number: AM6412

Hi,

My customer reported boot issues on their custom boards.
5 boards (in 30 boards total) show same behavior.
TI Linux SDK 8.6 is used.

Issue descriptions:
- boot stops at U-boot. There are two failing pattern. Please see two logs (NG1.txt, NG2.txt).
- The boot mode is SD card boot, but the same issue is observed with SPI Flash boot.
- Failing boards do not fail always. Sometimes these boards passed and boot up correctly. After Linux boot, it works fine afterward.
- It seems there is temperature dependency.
Room Temp: 5 boards in 30 boards failed.
60C : 7 boards in 30 boards failed.
-10C : 2 boards in 30 boards failed.

 

U-Boot SPL 2021.01 (Jul 31 2023 - 11:31:58 +0900)
Resetting on cold boot to workaround ErrataID:i2331
resetting ...

U-Boot SPL 2021.01 (Jul 31 2023 - 11:31:58 +0900)
SYSFW ABI: 3.1 (firmware rev 0x0008 '8.4.7--v08.04.07 (Jolly Jellyfi')
SPL initial stack usage: 13424 bytes
U-Boot SPL 2021.01 (Jul 31 2023 - 11:31:58 +0900)
Resetting on cold boot to workaround ErrataID:i2331
resetting ...

U-Boot SPL 2021.01 (Jul 31 2023 - 11:31:58 +0900)
SYSFW ABI: 3.1 (firmware rev 0x0008 '8.4.7--v08.04.07 (Jolly Jellyfi')
SPL initial stack usage: 13424 bytes  
Trying to boot from MMC2
mmc fail to send stop cmd
spl_load_image_fat: error reading image tispl.bin, err - -2
SPL: failed to boot from all boot devices
### ERROR ### Please RESET the board ###


What is potential cause of such behavior?
Do you see any idea from these failing logs?

Thanks and regards,
Koichiro Tashiro

  • Hi,

    Did you copy the schematic for MMC on AM64x SK or AM64x EVM? If so, that device is missing some resistors and SD card boot may fail with different SD cards. Please provide device schematic for further review.

    Have you looked at the following collateral: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1185502/faq-am6442-am6441-am6422-am6421-am6412-am6411-custom-board-hardware-design-collaterals-to-get-started

    In particular, please look at the:
    hardware design guide: https://www.ti.com/lit/an/sprad67a/sprad67a.pdf?ts=1693248917539&ref_url=https%253A%252F%252Fe2e.ti.com%252F

    and schematic review checklist: https://www.ti.com/lit/an/spracu5b/spracu5b.pdf?ts=1672314742629&ref_url=https%253A%252F%252Fwww.ti.com%252Fproduct%252FAM6421

    " An approximately 50 KΩ Pull-down resistor is recommended for SD-CARD implementations on the
    MMC1_CLK pin. Pull-ups of approximately 50 KΩ should be used for each of the CMD and DAT signals
    as per the SD Card specification. "

    I will forward this thread to our hardware team.

    ~ Judith

  • Did your customer follow the PCB Connectivity Requirements defined in the MMC1 Timing Conditions table found in the AM64x datasheet?

    Do they have any signal integrity issues on the MMC1 signals? 

    Regards,
    Paul

  • Hi Paul, Judith,

    Did you copy the schematic for MMC on AM64x SK or AM64x EVM?

    In fact the customer copied the schematics and some resistors were missing. But they fixed it on current boards.
    Do you mean the log shows something wrong with SD card access?
    As I mentioned, the issue is also observed with SPI boot, so it does not seem MMC1 interface dependent...

    Thanks and regards,
    Koichiro Tashiro

  • I'm not convinced this is a MMC or SPI hardware issue based the customer's observation. I can only help answer hardware related questions associated with electrical characteristics and timing of the signals. So, the customer will need to isolate the issue to a specific hardware function before I can help.

    I'm going to reassign to Judith in hopes she can help decode the logs and isolate the issue.

    Regards,
    Paul

  • Hi Paul,

    I see.

    Hi Judith,

    I have a couple of questions for the log.
    Q1) At the moment where the log stops, DDR initialization is already finished?
    Q2) In NG2.txt log, there are few error messages. What do they exactly mean?

    mmc fail to send stop cmd
    spl_load_image_fat: error reading image tispl.bin, err - -2
    SPL: failed to boot from all boot devices

    Thanks and regards,
    Koichiro Tashiro

  • Sorry for the delay, am OOO and there is holiday on Monday. Will respond as soon as I can.

    ~ Judith

  • From NG1.txt log, it should be DDR issue from my supporting experience. 

  • Hi Judith,

    Will respond as soon as I can.

    Any updates for my questions?

    Hi Tony,

    From NG1.txt log, it should be DDR issue from my supporting experience. 

    Could you explain a bit more why you think this is DDR issue?
    As I mentioned earlier, if the boot finishes successfully, no issue happens afterwards.
    So the root cause resides in boot process.

    Thanks and regards,
    Koichiro Tashiro 

  • Hi Koichiro,

    Is this the exact same log printed if you try SPI boot? If there is any different, please post that boot log.

    mmc fail to send stop cmd
    spl_load_image_fat: error reading image tispl.bin, err - -2
    SPL: failed to boot from all boot devices
    ### ERROR ### Please RESET the board ###

    Based on:

    mmc fail to send stop cmd

    It seems like mmc_read/write_blocks function call failed:


    Here are a few hints to debugging this issue further.

    • The most efficient and powerful tool for board bringup is to have access to the SoC via JTAG debugger, and use a tool such as TI’s Code Composer Studio (CCS) or Lauterbach (T32) to inspect the device and code.

    • Performing basic printf()-style debugging, try to understand the code.

    • Rebuild u-boot with debug option:
       CONFIG_LOG_MAX_LEVEL=7

    ~ Judith

  • Hi Judith,

    Is this the exact same log printed if you try SPI boot? If there is any different, please post that boot log.

    Below is SPI Flash boot log. As you can see it is the same as NG1.txt.

    U-Boot SPL 2021.01 (Jul 18 2023 - 14:43:39 +0900)
    Resetting on cold boot to workaround ErrataID:i2331
    resetting ...
    
    U-Boot SPL 2021.01 (Jul 18 2023 - 14:43:39 +0900)
    SYSFW ABI: 3.1 (firmware rev 0x0008 '8.4.7--v08.04.07 (Jolly Jellyfi')
    SPL initial stack usage: 13448 bytes
    
    


    Thanks for hints, let the customer try them.

    Thanks and regards,
    Koichiro Tashiro

  • Hi Judith,

    The customer tried below:

    Rebuild u-boot with debug option:
     CONFIG_LOG_MAX_LEVEL=7

    It did not add additional logs.

    Performing basic printf()-style debugging, try to understand the code.

    The customer added some debug logs in code(dlmalloc.c) and got below log.

    U-Boot SPL 2021.01 (Sep 06 2023 - 11:40:39 +0900)
    Resetting on cold boot to workaround ErrataID:i2331
    resetting ...
    
    U-Boot SPL 2021.01 (Sep 06 2023 - 11:40:39 +0900)
    SYSFW ABI: 3.1 (firmware rev 0x0008 '8.4.7--v08.04.07 (Jolly Jellyfi')
    DDR: LPDDR4 1600 MT/s
    DDRSS_PI_83_DATA 0x27c0a000, DDRSS_CTL_342_DATA 0x00000000
    SPL initial stack usage: 13424 bytes
    mem_malloc_init 2.
    mem_malloc_init 3. mem_malloc_start = 0x84000000 size = 0x1000000
    

    The changes made in dlmalloc.c is below:
    --- a/common/dlmalloc.c	2022-12-06 11:26:57.000000000 +0900
    +++ b/common/dlmalloc.c	2023-09-05 17:58:21.916249216 +0900
    @@ -611,15 +611,20 @@
     	mem_malloc_brk = start;
     
     #ifdef CONFIG_SYS_MALLOC_DEFAULT_TO_INIT
    +	printf("mem_malloc_init 1.\n");
     	malloc_init();
     #endif
     
    +	printf("mem_malloc_init 2.\n");
     	debug("using memory %#lx-%#lx for malloc()\n", mem_malloc_start,
     	      mem_malloc_end);
     #ifdef CONFIG_SYS_MALLOC_CLEAR_ON_INIT
    +	printf("mem_malloc_init 3. mem_malloc_start = 0x%lx size = 0x%lx\n",mem_malloc_start,size);
     	memset((void *)mem_malloc_start, 0x0, size);
     #endif
    +	printf("mem_malloc_init 4.\n");
     	malloc_bin_reloc();
    +	printf("mem_malloc_init 5.\n");
     }
     
     /* field-extraction macros */
    

    It seems the code stops somewhere in 
    memset((void *)mem_malloc_start, 0x0, size);

    Do you have any debug idea to narrow down the root cause?

    Thanks and regards,
    Koichiro Tashiro

  • Could you explain a bit more why you think this is DDR issue?
    As I mentioned earlier, if the boot finishes successfully, no issue happens afterwards.
    So the root cause resides in boot process.

    I have two customer using AM62x, boot stop at: SPL initial stack usage: 13424 bytes.

    One is DDR configuration issue, one is layout issue. 

    They are not AM64. just for your reference.

  • Hi Koichiro,

    Looking at the logs, the error could be anything. But we can focus on DDR to make sure that is not failing.

    I recommend on a working board with same DDR configuration, boot to Linux kernel, and stress test DDR using memtester: https://linux.die.net/man/8/memtester.

    Here are a few posts that can help in understanding how to use the tool if you require some examples:

    - e2e.ti.com/.../sk-am62-crash-when-performing-linux-memtester
    - e2e.ti.com/.../linux-am5748-test-of-asynchronous-memory
    - e2e.ti.com/.../am4378-ddr-r-w-fails-after-many-iterations-of-testing-using-memtester-command

    Please let me know what is the result of using memtester. If there is indeed an issue with DDR, then we can include DDR expert to look over the DDR configuration in U-boot.

    ~ Judith

  • Hi Judith,

    The customer did DDR stress test using memtester and confirmed there is no error, so currently DDR related issue is pushed aside.

    Instead, the customer found applying MCU_PORz always solved the issue.
    Here is the scenario.
    1) The system power-up and MCU_PORz is released.
    2) Software warmreset (SW_MCU_WARMRSTz) is asserted for Errata i2331 workaround as you can see in the boot log.
    3) In case the issue happens, u-boot stops here, otherwise u-boot is executed and Linux boot-up properly.
    4) If the issue happens and u-boot stops, MCU_PORz is asserted here by driving the pin low externally then MCU_PORz is released.
    5) Boot process is repeated again.
    6) Software warmreset ((SW_MCU_WARMRSTz) is done for Errata i2331 workaround.
    7) U-boot never stops in this case. Linux always boot-up properly.

    It seems there are some differences in conditions at Step#1 and Step#4 above.
    The customer checked power supply sequence and clock status, but they are all OK.

    Thanks and regards,
    Koichiro Tashiro

  • Koichiro, in step 4, can the customer connect with JTAG and observe a memory window in DDR address space 0x80000000?  I just want to see if DDR memory is stable at this point.  They can either enable continuous refresh or try to peek/poke values.  

    Regards,

    James

  • Hi James,

    in step 4, can the customer connect with JTAG and observe a memory window in DDR address space 0x80000000? 

    Yes. The customer can read/write DDR memory at 0x8000:0000.

    They did some tests with various reset sources and found below points.
    a) After the issue happens, asserting MCU_PORz recovers the system completely. No u-boot hang happens after MCU_PORz. (I reported it yesterday.)

    b) They tried to add SW resets *before* the issue happens. And it worked. The issue never happens if below changes are made in /common/spl/spl.c.

    --- a/common/spl/spl.c	2022-12-06 11:26:57.000000000 +0900
    +++ b/common/spl/spl.c	2023-09-12 11:03:56.491393473 +0900
    @@ -42,6 +42,16 @@
     #define CONFIG_SYS_MONITOR_LEN	(200 * 1024)
     #endif
     
    +#define COLD_BOOT                              0
    +#define MCU_CTRL_MMR0_BASE                      0x04500000
    +#define CTRL_MMR0_BASE                          0x43000000
    +#define CTRLMMR_MCU_RST_CTRL                    0x04518170
    +#define CTRLMMR_RST_CTRL_PHY0                   0x43018170
    +#define CTRLMMR_RST_CTRL_PHY1                   0x4301A170
    +#define CTRLMMR_MCU_RST_SRC                    (MCU_CTRL_MMR0_BASE + 0x18178)
    +#define CTRLMMR_RST_SRC_PHY0                   (CTRL_MMR0_BASE + 0x18178)
    +#define CTRLMMR_RST_SRC_PHY1                   (CTRL_MMR0_BASE + 0x1A178)
    +
     u32 *boot_params_ptr = NULL;
     
     /* See spl.h for information about this */
    @@ -803,10 +813,26 @@
     #ifdef CONFIG_SPL_STACK_R
     	gd_t *new_gd;
     	ulong ptr = CONFIG_SPL_STACK_R_ADDR;
    +	u32 stat_test, statphy0_test, statphy1_test;
    +	int rst_src;
     
     	if (CONFIG_IS_ENABLED(SYS_REPORT_STACK_F_USAGE))
     		spl_relocate_stack_check();
     
    +	rst_src = readl(CTRLMMR_MCU_RST_SRC);
    +	printf("For second reset read rst_src = %x\n", rst_src);
    +	if (rst_src == COLD_BOOT || !(rst_src & 0x1000000)) {
    +		stat_test = readl(CTRLMMR_MCU_RST_CTRL);
    +		stat_test &= 0xFFFFF666;
    +/*		statphy0_test = readl(CTRLMMR_RST_CTRL_PHY0); /* */
    +/*		statphy0_test &= 0xFFFFFF66; /* */
    +/*		statphy1_test = readl(CTRLMMR_RST_CTRL_PHY1); /* */
    +/*		statphy1_test &= 0xFFFFFF66; /* */
    +/*		writel(statphy0_test, CTRLMMR_RST_CTRL_PHY0); /* */
    +/*		writel(statphy1_test, CTRLMMR_RST_CTRL_PHY1); /* */
    +		writel(stat_test, CTRLMMR_MCU_RST_CTRL);
    +	}
    +
     #if defined(CONFIG_SPL_SYS_MALLOC_SIMPLE) && CONFIG_VAL(SYS_MALLOC_F_LEN)
     	if (CONFIG_SPL_STACK_R_MALLOC_SIMPLE_LEN) {
     		debug("SPL malloc() before relocation used 0x%lx bytes (%ld KB)\n",
    

    As you can see, the patch applies three SW resets (SW_MCU_WARMRST, SW_MAIN_POR and SW_MAIN_WARMRST).
    But they confirmed asserting only SW_MAIN_WARMRST is enough.

    As you know, our SDK already includes MAIN domain SW WARM reset (SW_MCU_WARMRST) for i2331 workaround.
    The i2331 workaround is done in root/arch/arm/mach-k3/am642_init.c. It may sounds strange, but the i2331 workaround does not work for this u-boot hang issue and u-boot hangs up after i2331 reset as you can see in the log I put the top of this thread. 
    	rst_src = readl(CTRLMMR_MCU_RST_SRC);
    	if (rst_src == COLD_BOOT || rst_src & (SW_POR_MCU | SW_POR_MAIN)) {
    		printf("Resetting on cold boot to workaround ErrataID:i2331\n");
    		do_reset(NULL, 0, 0, NULL);
    	}


    Questions:
    Q1) Do you have any potential root causes in mind?
    It seems something wrong in MAIN domain and it can be solved by warm reset, but the warm reset needs to be asserted somewhere *after* i2331 workaround is done.

    Q2) Adding SW_MAIN_WARMRST in spl.c could harm anything?
    I guess no, but please confirm.

    Q3) Please let me know if you have any further debug approach to this issue.

    Thanks and regards,
    Koichiro Tashiro


  • Hi Koichiro, 

    Ensure they double check all power rails on power up in relation to the timing of the MCU_PORz, and even beyond rising edge of MCU_PORz to ensure stable power supplies during boot.  Since they can get things working with a subsequent reset, it may be possible that power supplies a taking some time to stabilize.  

    Also, ensure all reset inputs are stable after MCU_PORz rising edge.  How are RESETz and MCU_RESETz connected on the schematic?

    Regards,

    James

  • Hi James,

    Since they can get things working with a subsequent reset, it may be possible that power supplies a taking some time to stabilize.  

    The customer thinks about that point and tried below test.
    c) Added delay loop(~30sec) just before i2331 reset is done. So, SW_MCU_WARMRST is asserted late enough any power supplies or clock sources to get stabilized. But it did not help.
    d) Added delay loop(~30sec) just before i2331 reset is done, then assert MCU_PORz from an external pin. But it did not help.

    Also, ensure all reset inputs are stable after MCU_PORz rising edge.

    They will check them and let us know.

    How are RESETz and MCU_RESETz connected on the schematic?

    Both MCU_RESETz and RESET_REQz are pull-up to 3.3V on the board.
    I will send you the schematics offline.

    Thanks and regards,
    Koichiro Tashiro

  • Greetings Koichiro,

    Please keep in mind James is on timebank, it will take some time for his response.

    Sincerely,

    Lucas

  • Here are some updates.

    1) The customer checked reset signal behaviors. Please see below waveforms.
    In summary, there is no suspicious behavior. Both failing board and working board show the same behavior as expected.
    reset.xlsx

    2) The customer tried to find where is the latest location to put SW_MAIN_WARMRST to workaround the issue.
    They found if the reset is asserted *before* below code, it does not help.
    But if the reset is asserted *after* below code, it works and no u-boot hang-up is observed.
    arch/arm/mach-k3/am642_init.c, inside board_init_f():

    #if defined(CONFIG_K3_AM64_DDRSS)
    
    /* adding SW_MAIN_WARMRST here or before, it does not help */
    
    	ret = uclass_get_device(UCLASS_RAM, 0, &dev);
    
    /* adding SW_MAIN_WARMRST here or after, it works as workaround */	
    	
    	if (ret)
    		panic("DRAM init failed: %d\n", ret);
    #endif


    What exactly below code does? It seems DDR initialization.
    ret = uclass_get_device(UCLASS_RAM, 0, &dev);

    Thanks and regards,
    Koichiro Tashiro

  • Greetings Koichiro,

    It is good to see they've gotten around their U-boot failure. I am not sure what that function does exactly, though it does seem likely it's related to DDR init. I will see if I can get the right U-boot expert to look at this.

    Sincerely,

    Lucas

  • Yes, DDR configuration is done by the function "uclass_get_device(UCLASS_RAM, 0, &dev)" in R5-SPL.

  • Hi Hong,

    So, something is wrong in DDR initialization.
    It seems "ret" value is correct because no panic message is shown in the log.
    How the function checks the DDR initialization is done properly?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    So, something is wrong in DDR initialization.
    It seems "ret" value is correct because no panic message is shown in the log.
    How the function checks the DDR initialization is done properly?

    there's another possibility. Perhaps the DDR init actually worked (as indicated by lack of error messages), however U-Boot R5 SPL hangs/crashes for other reasons, which I have seen happening before.

    One valuable experiment in this context they can try is trying to free up some SRAM memory by disabling some CONFIG options they don't need for the first stage bootloader (R5 SPL, also known as "tiboot3.bin"). This stage runs from internal SRAM and space is very VERY tight. And you do not necessarily get a build error if you use to much memory, so it can cause hidden side effects, such as crashes or hidden DDR init failures (I've seen the DDR device tree data getting overwritten by the stack in some cases, leading to incorrect DDR init values getting used).

    For example, one good experiment is to disable SPI flash support, by turning off all of the below CONFIG options in your configs/am64x_evm_r5_defconfig equivalent file. This should free quite a bit of memory. Even if they do want SPI flash support later, I do suggest this as an experiment to help narrow things down.

    • CONFIG_SYS_SPI_U_BOOT_OFFS
    • CONFIG_SPL_DM_SPI
    • CONFIG_SPL_SPI_FLASH_SUPPORT
    • CONFIG_SPL_SPI_SUPPORT
    • CONFIG_SPL_DM_SPI_FLASH
    • CONFIG_SPL_SPI_FLASH_SFDP_SUPPORT
    • CONFIG_SPL_SPI_LOAD

    The one thing that goes against this theory I think is the temperature dependency that was reported earlier. Nevertheless I'd like to see the result of this experiment.

    Regards, Andreas

  • Hi Andreas,

    I will ask the customer to check this point, but it that internal SRAM problem shows board dependency?
    According to your explanation, that issue seems purely software problem and all boards show the same behavior. 
    The issue the customer observes has board dependency.

    And the customer's issue is recovered by a warm reset, the other case(internal SRAM issue) is also recovered by a warm reset?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    And the customer's issue is recovered by a warm reset, the other case(internal SRAM issue) is also recovered by a warm reset?

    If you mean by "internal SRAM issue" the potential memory/code corruption I referred to, I'm not sure. I would think the memory layout would always be the same during U-Boot SPL on R5 execution, as there is no operating system running, and no asynchronous events happening, and so on, so the boot process should be very predictable during that early stage I think. But perhaps a memory corruption could affect the DDR tuning algorithm in a way that it brings up the DDR configuration in a less-robust manner, making it more prone to failure. It is a theory, probably not very likely, but something we should check.

    Also, in the cases where the customer was able to boot into Linux, have they run the `memtester` utility part of our standard SDK? If not, they should run this for hours to help uncover any general issues there might be with DDR.

    Regards, Andreas

  • Hi Andreas,

    If you mean by "internal SRAM issue" the potential memory/code corruption I referred to

    Yes. This is what I meant. Anyway, the customer will check it.

    Also, in the cases where the customer was able to boot into Linux, have they run the `memtester` utility part of our standard SDK? If not, they should run this for hours to help uncover any general issues there might be with DDR.

    I have one customer failing board and tested it. The memtester works fine.

    Koichiro, in step 4, can the customer connect with JTAG and observe a memory window in DDR address space 0x80000000? 

    I have one correction to above question from James. I once answered DDR was accessible based on the customer's answer, but it is not correct.
    I confirmed DDR access is corrupted after the issue happens. 
    I used "AM64 DDRSS Debug" GELs and found DDR initialization does not seem to be completed.
    Both PI Initialization and CTL Initialization were triggered, but not completed.
    I also got CTL_PI_PHY_RegDump and SS_RegDump. Logs are attached. I think James can provide some feedback here.
    DDR_RegDump_bad.txt
    If DDRSS is reset by GEL (Power Sleep Controller -> Reset by Power Domain Controllers -> PD_DDRSS_Reset), 
    then DDR is initialized again by GEL (AM64 DDR Initialization -> AM64_DDR_Initialization_ECC_Disabled),
    DDR can be accessed. DegDump and Log is attached for reference.
    DDR_RegDump_good.txt

    So DDR initialization was failing in fact. We do not know the reason, yet.

    Thanks and regards,
    Koichiro Tashiro

  • I have one customer failing board and tested it. The memtester works fine.

    Which command line did you use (parameters?) How long did you run the test for?

    But this sounds great!!

    So DDR initialization was failing in fact. We do not know the reason, yet.

    Ok. The experiment I suggested (reduce U-Boot R5 SPL memory footprint) is one possibility, so I'm curious as to the result. I can't think of other reasons at the moment, will need to consult with DDR expert James once he's back on Thursday. Until then I'll also look at the DDR driver some more to see if anything spikes out and let you know if that's the case.

    Regards, Andreas

  • Hi Andreas,

    Which command line did you use (parameters?) How long did you run the test for?

    I got memtester *.out file running on A53 from James and used it for the test. The code does following tests (log output) in loop.
    I run it for a hour.

    Loop: 1
    (00)  Stuck Address       : ok
    (01)  Random Value        : ok
    (02)  Compare XOR         : ok
    (03)  Compare SUB         : ok
    (04)  Compare MUL         : ok
    (05)  Compare DIV         : ok
    (06)  Compare OR          : ok
    (07)  Compare AND         : ok
    (08)  Seq Increment       : ok
    (09)  Solid Bits          : ok
    (10)  Blk Sequential      : ok
    (11)  Checkerboard        : ok
    (12)  Bit Spread          : ok
    (13)  Bit Flip            : ok
    (14)  Walking Ones        : ok
    (15)  Walking Zeroes      : ok
    (20)  Algorithmic ISI/SSO Patterns: ok
    Loop FINISHED
        FAILCNT Stuck Address                  : 0 (total:0 fails for 1 completed loops)
        FAILCNT Random Value                   : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare XOR                    : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare SUB                    : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare MUL                    : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare DIV                    : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare OR                     : 0 (total:0 fails for 1 completed loops)
        FAILCNT Compare AND                    : 0 (total:0 fails for 1 completed loops)
        FAILCNT Seq Increment                  : 0 (total:0 fails for 1 completed loops)
        FAILCNT Solid Bits                     : 0 (total:0 fails for 1 completed loops)
        FAILCNT Blk Sequential                 : 0 (total:0 fails for 1 completed loops)
        FAILCNT Checkerboard                   : 0 (total:0 fails for 1 completed loops)
        FAILCNT Bit Spread                     : 0 (total:0 fails for 1 completed loops)
        FAILCNT Bit Flip                       : 0 (total:0 fails for 1 completed loops)
        FAILCNT Walking Ones                   : 0 (total:0 fails for 1 completed loops)
        FAILCNT Walking Zeroes                 : 0 (total:0 fails for 1 completed loops)
        FAILCNT Algorithmic ISI/SSO Patterns   : 0 (total:0 fails for 1 completed loops)



    Until then I'll also look at the DDR driver some more to see if anything spikes out and let you know if that's the case.

    The customer narrowed down where is the latest location to put SW_MAIN_WARMRST to workaround the issue.
    Inside drivers/ram/k3-ddrss/lpddr4.c, there is function lpddr4_startsequencecontroller() and lpddr4_readreg().
    If SW_MAIN_WARMRST is done *before* the line: 
    CPS_REG_WRITE(&(ctlregbase->LPDDR4__START__REG), regval);
    this is inside lpddr4_startsequencecontroller(), the issue still occur after system re-boot.

    If SW_MAIN_WARMRST is done *after* the line:
    *regvalue = CPS_REG_READ(lpddr4_addoffset(&(ctlregbase->DENALI_CTL_0), regoffset));
    this is inside lpddr4_readreg(), the issue never occurs after system re-boot. So it can work as workaround. 

    And it may sound strange, but in case SW_MAIN_WARMRST is done anywhere between above two lines, the system locks up.
    ("between" means code sequence, not just line# in text editor)
    This lock up occurs all boards, so I guess we cannot assert SW_MAIN_WARMRST here.

    Thanks and regards,
    Koichiro Tashiro 

  • One correction.

    The customer narrowed down where is the latest location to put SW_MAIN_WARMRST to workaround the issue.

    The reset source used is SW_MCU_WARMRST.

    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    that's some good additional inputs. Will need to let James comment, probably tomorrow or Friday, I'm not familiar with the DDR controller internals and how/why this would make a difference.

    Regards, Andreas

  • One minor question regarding that u-boot function, is it blocking? As in if you call do_reset(), will it return back to the calling function or will it while(1) and not return?

  • Hi Lucas,

    I am not sure I understand your question correctly.
    The do_reset() does SW_MCU_WARMRST, so the device re-boot.
    I summarized code sequence in each scenario in attachment.
    sequence details.pptx
    Thanks and regards,
    Koichiro Tashiro

  • Koichiro, can you apply the following patch to the DDR driver  https://lore.kernel.org/u-boot/20230717221525.3693897-2-bb@ti.com/

    This fixes an issue in which the controller hangs when accessing controller registers during training.  This may be why the reset is needed in some situations to get through u-boot.

    Regards,

    James

  • Hi Tashiro-san,

    I summarized code sequence in each scenario in attachment.
    sequence details.pptx

    Thank you again for this summary, I was just trying to better understand the background here and that made this process much easier.

    Koichiro, can you apply the following patch to the DDR driver  https://lore.kernel.org/u-boot/20230717221525.3693897-2-bb@ti.com/

    Yes please test the patch that James provided. I think the background of his workaround seems slightly different from what you are seeing (as the potential hang that is being resolved is during training), but we definitely should try it.

    Ok. The experiment I suggested (reduce U-Boot R5 SPL memory footprint) is one possibility, so I'm curious as to the result

    Did they get a chance to run this experiment yet?

    Regards, Andreas

  • Hi James, Andreas,

    The customer will try the patch and reduce U-boot SPL boot print tomorrow.
    I will let you know results.

    Thanks and regards,
    Koichiro Tashiro 

  • Hi James,

    can you apply the following patch to the DDR driver  https://lore.kernel.org/u-boot/20230717221525.3693897-2-bb@ti.com/

    The customer apply the patch, but the issue still happens.

    Hi Andreas,

    Ok. The experiment I suggested (reduce U-Boot R5 SPL memory footprint) is one possibility, so I'm curious as to the result

    The customer also checked it, but the issue still happens.

    Can you find any suspicious points from the RegDump? 

    Thanks and regards,
    Koichiro Tashiro

  • Hi James, Andreas,

    I got RegDump when u-boot stops. JTAG is connected to R5.
    DDR_RegDump_no_resetWA.txt
    I also got RegDump with the case when i2331 WA is moved. This is slide#3 case in "sequence details.pptx".
    I stopped u-boot and JTAG is connected to A53 to get the dump.
    DDR_RegDump_resetWA.txt

    And below is the summary of differences. Do you find anything from the list?
    RegValue comparison.xlsx

    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    looks like an important clue is also in one of the slides you prepared...

    ...I wonder why the additional access to ctlregbase->DENALI_CTL_0 which is done in the function lpddr4_startsequencecontroller() would have an impact here in terms of working/not working.

    Will need James' input on that; it's a very specific topic, and the TRM doesn't have the needed level of detail here to understand the logs as well.

    Regards, Andreas

  • Koichiro, i compared the two regdumps, and i only see slight differences in training results (which is expected each time the training sequence is run).  So both regdumps show acceptable training results, and the DDR memory should be accessible.

    Did you say that in a failed case, the LPDDR4 shows all zeros?  If so can you run that experiment again:

    -Boot to failure

    -connect to A53 with CCS and perform a regdump

    -open Memory Browser and view memory at 0x80000000

    -if still showing zeros, perform a CPU reset on the A53 that you are connected to and check Memory Browser again

    -if still showing zeros, connect to another core (like R5) and check Memory Browser again

    Regards,

    James

  • Hi James,

    i compared the two regdumps, and i only see slight differences in training results (which is expected each time the training sequence is run).  So both regdumps show acceptable training results, and the DDR memory should be accessible.

    I see. Thanks.

    Did you say that in a failed case, the LPDDR4 shows all zeros?  If so can you run that experiment again:

    No. The LPDDR4 shows random values. I will confirm it again tomorrow.

    -connect to A53 with CCS and perform a regdump

    I thought A53 cannot be accessible in a failed case. Anyway, I will check it.

    Thanks and regards,
    Koichiro Tashiro

  • Hi James,

    I checked DDR memory access in a failed case.
    A53 was not connectable. CCS reported an error.

    Error connecting to the target:
    (Error -2081 - (0:0:0))
    Device functional clock appears to be off. Power-cycle the board. If error persists, confirm configuration and/or try more reliable JTAG settings (e.g. lower TCLK).
    (Emulation package 9.12.0.00150)

    So I connected R5_0 and checked DDR memory.
    DDR memory showed random values.


    These values were updated each time the memory window was refreshed.


    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    can you please also have a look at the state of DDR right after the DDR initialization, to see if you can also re-create a failure case?

    Basically, looking at this function here...

    arch/arm/mach-k3/am642_init.c=void board_init_f(ulong dummy)
    arch/arm/mach-k3/am642_init.c-{
    arch/arm/mach-k3/am642_init.c-#if defined(CONFIG_K3_LOAD_SYSFW) || defined(CONFIG_K3_AM64_DDRSS) || defined(CONFIG_ESM_K3)
    arch/arm/mach-k3/am642_init.c-  struct udevice *dev;
    arch/arm/mach-k3/am642_init.c-  int ret;
    arch/arm/mach-k3/am642_init.c-  int rst_src;
    arch/arm/mach-k3/am642_init.c-#endif
    
    <...snip...>
    
    arch/arm/mach-k3/am642_init.c-
    arch/arm/mach-k3/am642_init.c-#if defined(CONFIG_K3_AM64_DDRSS)
    arch/arm/mach-k3/am642_init.c-  ret = uclass_get_device(UCLASS_RAM, 0, &dev);
    arch/arm/mach-k3/am642_init.c-  if (ret)
    arch/arm/mach-k3/am642_init.c-          panic("DRAM init failed: %d\n", ret);
    arch/arm/mach-k3/am642_init.c-#endif
    arch/arm/mach-k3/am642_init.c-  if (IS_ENABLED(CONFIG_SPL_ETH_SUPPORT) && IS_ENABLED(CONFIG_TI_AM65_CPSW_NUSS) &&
    arch/arm/mach-k3/am642_init.c-      spl_boot_device() == BOOT_DEVICE_ETHERNET) {
    arch/arm/mach-k3/am642_init.c-          struct udevice *cpswdev;
    arch/arm/mach-k3/am642_init.c-
    arch/arm/mach-k3/am642_init.c-          if (uclass_get_device_by_driver(UCLASS_MISC, DM_GET_DRIVER(am65_cpsw_nuss), &cpswdev))
    arch/arm/mach-k3/am642_init.c-                  printf("Failed to probe am65_cpsw_nuss driver\n");
    arch/arm/mach-k3/am642_init.c-  }
    arch/arm/mach-k3/am642_init.c-  spl_enable_dcache();
    arch/arm/mach-k3/am642_init.c-}

    ...can you insert a while (1) loop (or something similar) right after the call to uclass_get_device(UCLASS_RAM, 0, &dev); to trap the program execution, and then use CCS/JTAG to inspect and verify functionality of DDR? Will it also result in failing cases?

    This will check/validate to see if anything goes "wrong" during the brief time after DDR init, but before U-Boot crashes.

    Thanks, Andreas

  • Hi Andreas,

    can you please also have a look at the state of DDR right after the DDR initialization, to see if you can also re-create a failure case?

    I connected CCS just after "ret = uclass_get_device(UCLASS_RAM, 0, &dev);" and checked DDR access.
    DDR accesses are failing here.
    I also tried to reset DDRSS then initialize LPDDR4 with GEL files. Now DDR accesses are working fine.

    This will check/validate to see if anything goes "wrong" during the brief time after DDR init, but before U-Boot crashes.

    I think the problem happens during DDR initialization process.

    Do you have any updates from your side?

    Thanks and regards,
    Koichiro Tashiro

  • I connected CCS just after "ret = uclass_get_device(UCLASS_RAM, 0, &dev);" and checked DDR access.
    DDR accesses are failing here.
    I also tried to reset DDRSS then initialize LPDDR4 with GEL files. Now DDR accesses are working fine.

    Good to know.

    I think the problem happens during DDR initialization process.

    Do you have any updates from your side?

    Yes this seems to be. I'm not sure how the DDR init function runs through seemingly successful, but the DDR is in a non-functional state, until the initialization is re-done. I don't have any other suggestions from a U-Boot point of view at this time, we'll need to discuss more with our DDR expert.

    Regards, Andreas

  • Is there further offline discussion about this?

  • Yes we had an on-site debug session 2 days ago. There were new findings but the issue is still active. Once we close this issue I'll update the solution here for everybody to benefit from that.

  • Just adding a comment to keep this thread active.

  • Hi Koichiro, responded via email.