This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3352: NAND boot fails

Part Number: AM3352

Dear Champs,

There are boot fail occurred in my customer's custom board at some time, and we found there was no progress in NAND communication when boot fail occurred.

When boot fail was occurred, booting was successful after reset or power recycling.

We have checked power rails, but could not find any issues in power rails, but found strange things in GMPC_D0 connected with NAND.

customer's boot mode is NAND boot mode(SYSBOOT[4:0] - 10011b)

When boot fail was occurred and checked GPMC_D0 using Oscilloscope, there was no more data communication with NAND after 1st communication as below.

Could you please let me know your idea how we can debug this issue further?

The NAND flash is Micron's MT29F1G08ABAEAWP.

In below picture :

Z1(yellow) - GPMC_D0,

Z2(red) - GPMC_ALE,

Z3(blue) - GPMC_WAIT0,

Z4(green) - GPMC_CLE

* normal case(booting success):

* boot fail:

The time duration are same between first high and low, and it is 164ms.

e.g.

 

Thanks and Best Regards,

SI.

  • Please check the NAND R/B signal. It must be connected to GPMC_WAIT0 with external pullup.

  • Hi Biser,

    Yes. it is connected to GPMC_WAIT0 with external pullup.

    I also found similar issue in below e2e forum with Micron.

    https://e2e.ti.com/support/processors/f/791/t/831456?tisearch=e2e-sitesearch&keymatch=NAND%252520AND%252520boot%252520AND%252520fail

    We are trying to connect JTAG, but have an issue to connect JTAG(error -1170, can not access DAP).

    Please check customer's schematic in below for GPMC_WAIT0 connection.

    Thanks and Best Regards,

    SI.

  • Dear Team,

    One more thing I want to check with you.

    current boot mode is SYSBOOT[4:0] = 10011b(NAND->NANDI2C->MMC0->UART0).

    We changed boot mode to SYSBOOT[4:0] = 00010b(UART0->SPI0->NAND->NANDI2C) to check timing issue, and found 'CCC' in the UART0 and try 3rd boot of NAND and boot fail.

    I think there should be time-out and go to next bootmode when there was a communication issue with NAND or no response from NAND, but it seemed it was halt when NAND boot failed.

    If time-out occurred in NAND BOOT, I think there should be 'CCC' displayed through UART0 because UART0 is 4th BOOTMODE in current setting(10011b)

    Can you guess why time-out was not occurred in NAND boot?

    I found there was EEPROM connected to I2C0 in my customer's board and this EEPROM is for other purpose and there is no information for NAND. Then, it is reasonable it was hang in the NANDI2C BOOTMODE as wrong information was read in the EEPROM?

    Thanks and Best Regards,

    SI.

  • SI,

    -how was the NAND flash programmed in the first place?  

    -did these boards work before and now they are failing?  

    -in the first post, you say "When boot fail was occurred, booting was successful after reset or power recycling."  Can you elaborate?  What is different about the failing case vs the passing case?  Is the failure random (sometimes it boots, and sometimes it doesn't)?

    -does the failure occur on many boards?  

    -can you identify the problem with the JTAG connection?  Can you connect to any of the cores?  You should be able to connect even after a boot failure.

    Regards,

    James

  • Hi James,

    customer's board is working well normally, but booting fails was occurred at SOMETIMES only - e.g. the board are boot well normally, but boot failed after 8~9 power recycling trials.  

    When boot failure was occurred in the board, the boot failure was resolved after power-on/off. e.g. for example, when the board was boot failed at 9th trials(power recycling), the board was boot well at 10th trial(power recycling). 

    This failure was occurred on many boards, but not all boards. 

    As I mentioned, I failed to connect JTAG now.

     

    Thanks and Best Regards,

    SI.

  • Sung-IL said:
    We changed boot mode to SYSBOOT[4:0] = 00010b(UART0->SPI0->NAND->NANDI2C) to check timing issue, and found 'CCC' in the UART0 and try 3rd boot of NAND and boot fail.

    It looks to me like MLO is being successfully loaded by the ROM, but a crash occurs in MLO which prevents u-boot.img from being loaded.

    What SDK version are you using?  Is this u-boot/Linux?

    If you halt at u-boot on a good power up, are you able to connect JTAG?  Or this board never can connect JTAG?  I think we need to fix the JTAG in order to get the necessary information to fix the main issue.

    How is sysboot9 configured?  One thought might relate to ECC errors causing issues if you have sysboot9 incorrectly configured.

    Once we have the JTAG operational, there's a DSS script that can decode most of what's happening in the chip to give us some hints as to what's happening.

    The general process for running these scripts is described here:

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/master/README

    In your particular instance, you should run this script:

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-boot.dss

    Best regards,
    Brad

  • Hi Brad,

    I'm not sure if MLO was successfully loaded by the ROM. SW SDK is Linux PSDK v4.0.

    For JTAG connection issue,

    it is HW issue. there is only few JTAG pins in our customer's board, and we are trying to connect it.

    Thanks and Best Regards,

    SI.

  • Dear,

    I found I missed below information in above capture image by mistake.

    In the captured above picture :

    Z1(yellow) - GPMC_D0,

    Z2(red) - GPMC_ALE,

    Z3(blue) - GPMC_WAIT0,

    Z4(green) - GPMC_CLE

    Thanks and Best Regards,

    SI.

  • I found NAND flash memory was changed to MT29F2G08ABAEA, and this issue was occurred with MT29F2G08ABAEA.

    I'm sorry for making confusion.

    Thanks and Best Regards,

    SI.

  • Dear Champs,

    We found this intermittent booting failure also occurred when reset cycling test. e.g. at 10 trials of reset, 1~2 booting failures were occurred.

    As you see in above, there is no connection with reset pin of AM3352 in NAND memory. 

    Thanks and Best Regards,

    SI.

  • SI, please try to get JTAG operational so you can run the scripts Brad pointed to. It will help a lot to debug this issue.  Are all the SYSBOOT signals tied off appropriately?  

    Regards,

    James

  • James,

    There was still an issue in JTAG connection. 

    As you see in captured screen shot I shared before, there is no difference to start GPMC signal between normal case and boot fail case, but there was no more procedure when boot fail case. So, I assumed SYSBOOT signal was still working well even at boot fail case.

    Do you still think it is meaningful to check SYSBOOT signals tied off appropriately?

    Thanks and Best Regards,

    SI.

  • Dear Champs,

    Finally I succeed to connect JTAG and got below output when I run the am335x-boot.dss file Brad pointed to.

    1. Could you please provide details what it said?

    am335x-boot-analysis_2019-09-23_153516_test1.txt
    CONTROL: device_id = 0x2b94402e
      * AM335x family
      * Silicon Revision 2.1
    
    PRM_DEVICE: PRM_RSTST = 0x00000001
      * Bit 0 : GLOBAL_COLD_RST
    
    CONTROL: control_status = 0x00400333
      * SYSBOOT[15:14] = 01b (24 MHz)
      * SYSBOOT[11:10] = 00b No GPMC CS0 addr/data muxing
      * SYSBOOT[9] = 0 GPMC CS0 Ignore WAIT input
      * SYSBOOT[8] = 0 GPMC CS0 8-bit data bus
      * Device Type = General Purpose (GP)
      * SYSBOOT[7:6] = 00b MII (EMAC boot modes only)
      * SYSBOOT[5] = 1 CLKOUT1 enabled
      * Boot Sequence : NAND -> NANDI2C -> MMC0 -> UART0
    
    ROM: Current tracing vector, word 1 = 0x0010009e
      * Bit 1  : [General] Entered main function
      * Bit 2  : [General] Running after the cold reset
      * Bit 3  : [Boot] Main booting routine entered
      * Bit 4  : [Memory Boot] Memory booting started
      * Bit 7  : [Boot] GP header found
      * Bit 20 : [Configuration Header] CHSETTINGS found
    
    ROM: Current tracing vector, word 2 = 0x00018000
      * Bit 15 : [Memory Boot] Memory booting trial 3
      * Bit 16 : [Memory Boot] Execute GP image
    
    ROM: Current tracing vector, word 3 = 0x00000020
      * Bit 5  : [Memory Boot] Memory booting device MMCSD0
    
    ROM: Current copy of PRM_RSTST = 0x00000000
    
    ROM: Cold reset tracing vector, word 1 = 0x00000000
    
    ROM: Cold reset tracing vector, word 2 = 0x00000000
    
    ROM: Cold reset tracing vector, word 3 = 0x00000001
      * Bit 0  : [Memory Boot] Memory booting device NULL
    
    Cortex A8 Program Counter = 0x402f35a8
    
    ROM Exception Vectors
      * 0x4030CE04 Undefined
      * 0x4030CE08 SWI
      * 0x4030CE0C Pre-fetch abort
      * 0x4030CE10 Data abort
      * 0x4030CE14 Unused
      * 0x4030CE18 IRQ
      * 0x4030CE1C FIQ
    
    ROM Dead Loops
      * 0x00020080 Undefined exception default handler
      * 0x00020084 SWI exception default handler
      * 0x00020088 Pre-fetch abort exception default handler
      * 0x0002008C Data exception default handler
      * 0x00020090 Unused exception default handler
      * 0x00020094 IRQ exception default handler
      * 0x00020098 FIQ exception default handler
      * 0x0002009C Validation test PASS
      * 0x000200A0 Validation test FAIL
      * 0x000200A4 Reserved
      * 0x000200A8 Image not executed or returned
      * 0x000200AC Reserved
      * 0x000200B0 Reserved
      * 0x000200B4 Reserved
      * 0x000200B8 Reserved
      * 0x000200BC Reserved
    

    2. Before running am335x-boot.dss script, I checked disassemble windows and it seemed PC was stopped in the internal SRAM by address.

    it seemed the address PC was stopped is 0x402F35a8.

    Does this meant there was an issue in SPL, not RBL?

    Thanks and Best Regards,

    SI.

    Thanks and Best Regards,

    SI.

  • Sung-IL said:

    it seemed the address PC was stopped is 0x402F35a8.

    Does this meant there was an issue in SPL, not RBL?

    The boot ROM appears to have loaded code and jumped to it.  I'm not sure yet if the issue is in SPL, as there are some other peculiarities in the output file you shared.  Here's an excerpt:


    ROM: Current tracing vector, word 2 = 0x00018000
    * Bit 15 : [Memory Boot] Memory booting trial 3
    * Bit 16 : [Memory Boot] Execute GP image

    ROM: Current tracing vector, word 3 = 0x00000020
    * Bit 5 : [Memory Boot] Memory booting device MMCSD0

    A couple things that don't look right to me:

    1. It shows memory booting trial 3, but it's missing trials 0, 1, 2.
    2. It shows memory booting device MMCSD0, but it doesn't show memory booting device NAND.

    Based on the above observations, I suspect that these trace vectors have been corrupted/overwritten.  They don't seem consistent.

    Here are a few suggestions to better understand what's happening:

    1. Can you capture a similar file for a good boot for comparison?
    2. Can you remove u-boot.img such that only SPL loads/runs.  I'd like you to look at address 0x402f35a8 and see if it is the same code.  This will help us to understand whether somehow something else entirely is being loaded (e.g. from another boot device) or if it is something wrong with the SPL code itself.
    3. If you stop a bad boot several times, is it always at the exact same location?  If you single step a few times, is it stuck in some kind of loop?  It would probably be useful to load symbols (make sure you choose "load symbols" and not "load program") and see what this corresponds to in the source code.  That might give us a clue as to the issue.
    4. If you look at the EMIF registers during a failed boot, does it look like they have been programmed?  If so, can you open up a memory window to address 0x80000000 and try poking in a few values (0x00c0ffee, 0xdeadbeef, etc.) and refresh several times to see if they stick?  Perhaps the most common early boot failure is DDR configuration issues.

    Best regards,
    Brad

  • Hi Brad,

    Thanks for your answer, and my answers are in below.

    1. Can you capture a similar file for a good boot for comparison?

      Ans) As linux OS will be run after booting, it is hard to capture similar file for a good boot. could you please how I can connect JTAG after or before SPL when a good boot?

    2. Can you remove u-boot.img such that only SPL loads/runs.  I'd like you to look at address 0x402f35a8 and see if it is the same code.  This will help us to understand whether somehow something else entirely is being loaded (e.g. from another boot device) or if it is something wrong with the SPL code itself.

      Ans) I'll try, but I'm afraid how I can confident if it is good boot or boot fail when there is only SPL.

    3. If you stop a bad boot several times, is it always at the exact same location?  If you single step a few times, is it stuck in some kind of loop?  It would probably be useful to load symbols (make sure you choose "load symbols" and not "load program") and see what this corresponds to in the source code.  That might give us a clue as to the issue.

      Ans) YES, there was always stopped at the exact same location while I have observed 5~6 times of boot fails.

    Thanks and Best Regards,

    SI.

  • And, there is no MMC device connected in their custom board, and MMC0 pins are unconnected.

    So, I'm also wondering why the boot was stuck at MMC0 boot.

    And,

    What is the meaning of below  'Cold reset tracing vector'?

    ROM: Cold reset tracing vector, word 1 = 0x00000000

    ROM: Cold reset tracing vector, word 2 = 0x00000000

    ROM: Cold reset tracing vector, word 3 = 0x00000001
    * Bit 0 : [Memory Boot] Memory booting device NULL

    And,

    What is the meaning of below - 'Execute GP image'?

    Did this mean booting image run?

    * Bit 15 : [Memory Boot] Memory booting trial 3
    * Bit 16 : [Memory Boot] Execute GP image

    And,

    when you have observed booting fail caused by DDR configuration issue, was the booting always failed? or occasionally?

    Thanks and Best Regards,

    SI.

  • Sung-IL said:
    Can you capture a similar file for a good boot for comparison?

    Ans) As linux OS will be run after booting, it is hard to capture similar file for a good boot. could you please how I can connect JTAG after or before SPL when a good boot?

    Just press a key to halt at u-boot and then re-run the DSS script.  Normally this works fine.

    Sung-IL said:
    • Can you remove u-boot.img such that only SPL loads/runs.  I'd like you to look at address 0x402f35a8 and see if it is the same code.  This will help us to understand whether somehow something else entirely is being loaded (e.g. from another boot device) or if it is something wrong with the SPL code itself.

      Ans) I'll try, but I'm afraid how I can confident if it is good boot or boot fail when there is only SPL.

    Good point.  Once we see the DSS output from a good boot (i.e. as taken from u-boot), you might be able to tell them apart that way.

    Sung-IL said:
    If you stop a bad boot several times, is it always at the exact same location?  If you single step a few times, is it stuck in some kind of loop?  It would probably be useful to load symbols (make sure you choose "load symbols" and not "load program") and see what this corresponds to in the source code.  That might give us a clue as to the issue.

    Ans) YES, there was always stopped at the exact same location while I have observed 5~6 times of boot fails.

    Please try to get symbolic debug functioning in CCS so that you can provide more context as to what's happening, e.g. it is likely stuck in a loop somewhere.  What is that loop checking?  That might give us a clue as to what is going wrong.

  • Sung-IL said:

    And, there is no MMC device connected in their custom board, and MMC0 pins are unconnected.

    So, I'm also wondering why the boot was stuck at MMC0 boot.

    And,

    What is the meaning of below  'Cold reset tracing vector'?

    ROM: Cold reset tracing vector, word 1 = 0x00000000

    ROM: Cold reset tracing vector, word 2 = 0x00000000

    ROM: Cold reset tracing vector, word 3 = 0x00000001
    * Bit 0 : [Memory Boot] Memory booting device NULL

    And,

    What is the meaning of below - 'Execute GP image'?

    Did this mean booting image run?

    * Bit 15 : [Memory Boot] Memory booting trial 3
    * Bit 16 : [Memory Boot] Execute GP image

    I think this is likely meaningless.  It looks to me like it was overwritten.

  • Thanks for your immediate response.

    As you see in my previous disassembly window, it seemed the PC was stuck on below instruction.

    When I checked it, it seemed 'svc' command is to call system call. e.g. exception handler.

    I'm afraid it was stuck on exception call already and I could not find what is exact code when I tried to run SPL at a good boot.

    Thanks and Best Regards,

    SI.

  • Please use symbolic debug to understand where that is in the code.  The disassembly you showed doesn't look like "real" code.  It could just be bad data that happens to coincide with valid opcodes.

    If you want to figure this out more quickly, I recommend putting a spin loop at the start of SPL.  You can then power-up the board, connect, force the PC past the spin loop, and then step through the code to see what it's doing.  It may take several iterations to see "good" vs "bad", but I think this will be more productive.

    Have you looked at the power supplies, e.g. VDD_CORE and VDD_MPU during the bad boot?  What are the voltages?

    Here are a few other scripts to run after a failed boot:

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-ctt.dss

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-ddr-analysis.dss

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/padconf/am335x-padconf.dss

    I still think JTAG debug is the best path forward, but the outputs from these scripts might contain some clues to better understand the failure.

  • SI, i think you may be misinterpreting the disassembly.  I don't think the ARM core is in ARM mode at this point.

    Please try to get symbolic debugging working.  The symbol file for SPL is u-boot-spl in your build directory.  Load symbols after a failure and try to work backwards to determine where the code is failing.  

    A lot of times, the reason for the failure is a DDR configuration issue, that is why Brad suggested to open up a memory window in DDR space to check if the DDR memory is stable and functional.

    Regards,

    James  

  • Thanks for your recommendation.

    There is no difference in power supply - VDD_CORE and VDD_MPU - and booting sequence between good boot and bad boot.

    For your suggestion using spin loop, I'm afraid how I can configure bad boot as bad boot occurred once per 100 ~ 200 trials. I'm afraid boot fail condition can not be occurred while debugging step-into.

    Thanks and Best Regards,

    SI.

  • SI,

    continue to try to get the symbolic debugging working when the board fails.  When board fails, you should be able to connect JTAG and load up the symbol file.  You should also be able to check the DDR memory at 0x80000000 as we suggested before.

    Regards,

    James

  • Hi Brad and James,

    Could you please how SPL can be built for symbol loading? When I tried to load spl.bin file for symbol loading, there was error occurred - binary type error.

    At booting fail, I checked DDR memory 0x80000000, but all values are '0', and the values were not changed when I write new values in the memory window of CCS as below.

    I also ran am335x-ddr-analysis.dss and the result is as below after boot fail.

    am335x-ddr-analysis_2019-09-24_152932_test2.txt
    Switched to DAP_DebugSS
    Read value of 2b94402e from Device_ID register.
    CONTROL: device_id = 0x2b94402e
      * AM335x family
      * Silicon Revision 2.1
    
    CONTROL: control_status = 0x00400333
      * SYSBOOT[15:14] = 01b (24 MHz)
    CM_CLKSEL_DPLL_DDR = 0x00000000
      * DPLL_MULT = 0 (x0)
      * DPLL_DIV = 0 (/1)
    CM_DIV_M2_DPLL_DDR = 0x00000201
      * CLKST = 1: M2 output clock enabled
      * DIVHS = 1 (/1)
    
    DPLL_DDR Summary
     -> F_input = 24 MHz
     -> CLKOUT_M2 = DDR_PLL_CLKOUT = 0 MHz
    
    EMIF: SDRAM_CONFIG = 0x4104bab2
      * Bits 31:29 (reg_sdram_type) set for DDR2
      * Bits 28:27 (reg_ibank_pos) set to 0
      * Bits 26:24 (reg_ddr_term) set for 75 Ohm (001b)
      * Bit  23    (reg_ddr2_ddqs) set to single ended DQS.
      * Bits 19:18 (reg_sdram_drive) set for weak drive (01b)
      * Bits 15:14 (reg_narrow_mode) set to ILLEGAL VALUE
      * Bits 13:10 (reg_cl) set to 14 -> CL = ILLEGAL VALUE
      * Bits 09:07 (reg_rowsize) set to 5 -> 14 row bits
      * Bits 06:04 (reg_ibank) set to 3 -> 8 banks
      * Bits 02:00 (reg_pagesize) set to 2 -> 10 column bits
    
    EMIF: PWR_MGMT_CTRL = 0x00000000
     * Bits 10:8 reg_lp_mode set to 0, auto power management disabled
     * Warning: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
       -> Please see the silicon errata (DDR3: JEDEC Compliance for Maximum Self-Refresh Command Limit) for more details.
       -> This is only an issue if used in conjunction with reg_lp_mode=2.
    
    DDR PHY: DDR_PHY_CTRL_1 = 0x00000000
      * WARNING: reg_phy_enable_dynamic_pwrdn disabled.
      * Bits 9:8 (reg_phy_rd_local_odt) to 0 -> no termination
      * Bits 4:0 (reg_read_latency) set to 0 -> ERROR: TOO SMALL
    
    *********************
    *** Register Dump ***
    *********************
    
    *(0x4c000000) = 0x40443403
    *(0x4c000004) = 0x40000000
    *(0x4c000008) = 0x4104bab2
    *(0x4c00000c) = 0x00000000
    *(0x4c000010) = 0x80001388
    *(0x4c000014) = 0x00001388
    *(0x4c000018) = 0x08891599
    *(0x4c00001c) = 0x08891599
    *(0x4c000020) = 0x148b31ca
    *(0x4c000024) = 0x148b31ca
    *(0x4c000028) = 0x00ffe82f
    *(0x4c00002c) = 0x00ffe82f
    *(0x4c000038) = 0x00000000
    *(0x4c00003c) = 0x00000000
    *(0x4c000054) = 0x00ffffff
    *(0x4c000058) = 0x8000140a
    *(0x4c00005c) = 0x00021616
    *(0x4c000080) = 0x00000000
    *(0x4c000084) = 0x00000000
    *(0x4c000088) = 0x00010000
    *(0x4c00008c) = 0x00000000
    *(0x4c000090) = 0x09da0441
    *(0x4c000098) = 0x00050000
    *(0x4c00009c) = 0x00050000
    *(0x4c0000a4) = 0x00000000
    *(0x4c0000ac) = 0x00000000
    *(0x4c0000b4) = 0x00000000
    *(0x4c0000bc) = 0x00000000
    *(0x4c0000c8) = 0x00000000
    *(0x4c0000d4) = 0x00000000
    *(0x4c0000d8) = 0x00000000
    *(0x4c0000dc) = 0x00000000
    *(0x4c0000e4) = 0x00000000
    *(0x4c0000e8) = 0x00000000
    *(0x4c000100) = 0x00000000
    *(0x4c000104) = 0x00000000
    *(0x4c000108) = 0x00000000
    *(0x4c000120) = 0x00000305
    
    ************************
    *** IOCTRL Registers ***
    ************************
    
    CONTROL: DDR_CMD0_IOCTRL = 0x00000004
      * ddr_ba2 Pullup/Pulldown disabled
      * ddr_wen Pullup/Pulldown disabled
      * ddr_ba0 Pullup/Pulldown disabled
      * ddr_a5 Pullup/Pulldown disabled
      * ddr_ck Pullup/Pulldown disabled
      * ddr_ckn Pullup/Pulldown disabled
      * ddr_a3 Pullup/Pulldown disabled
      * ddr_a4 Pullup/Pulldown disabled
      * ddr_a8 Pullup/Pulldown disabled
      * ddr_a9 Pullup/Pulldown disabled
      * ddr_a6 Pullup/Pulldown disabled
      * Bits 9:5 control ddr_ck and ddr_ckn
        - Slew fastest
        - Drive Strength 5 mA
      * Bits 4:0 control ddr_ba0, ddr_ba2, ddr_wen, ddr_a[9:8], ddr_a[6:3]
        - Slew fastest
        - Drive Strength 9 mA
    CONTROL: DDR_CMD1_IOCTRL = 0x00000004
      * ddr_a15 Pullup/Pulldown disabled
      * ddr_a2 Pullup/Pulldown disabled
      * ddr_a12 Pullup/Pulldown disabled
      * ddr_a7 Pullup/Pulldown disabled
      * ddr_ba1 Pullup/Pulldown disabled
      * ddr_a10 Pullup/Pulldown disabled
      * ddr_a0 Pullup/Pulldown disabled
      * ddr_a11 Pullup/Pulldown disabled
      * ddr_casn Pullup/Pulldown disabled
      * ddr_rasn Pullup/Pulldown disabled
      * Bits 4:0 control ddr_15, ddr_a[12:10], ddr_a7, ddr_a2, ddr_a0, ddr_ba1, ddr_casn, ddr_rasn
        - Slew fastest
        - Drive Strength 9 mA
    CONTROL: DDR_CMD2_IOCTRL = 0x00000004
      * ddr_cke Pullup/Pulldown disabled
      * ddr_resetn Pullup/Pulldown disabled
      * ddr_odt Pullup/Pulldown disabled
      * ddr_a14 Pullup/Pulldown disabled
      * ddr_a13 Pullup/Pulldown disabled
      * ddr_csn0 Pullup/Pulldown disabled
      * ddr_a1 Pullup/Pulldown disabled
      * Bits 4:0 control ddr_cke, ddr_resetn, ddr_odt, ddr_csn0, ddr_[a14:13], ddr_a1
        - Slew fastest
        - Drive Strength 9 mA
    CONTROL: DDR_DATA0_IOCTRL = 0x00000004
      * ddr_d8 Pullup/Pulldown disabled
      * ddr_d9 Pullup/Pulldown disabled
      * ddr_d10 Pullup/Pulldown disabled
      * ddr_d11 Pullup/Pulldown disabled
      * ddr_d12 Pullup/Pulldown disabled
      * ddr_d13 Pullup/Pulldown disabled
      * ddr_d14 Pullup/Pulldown disabled
      * ddr_d15 Pullup/Pulldown disabled
      * ddr_dqm1 Pullup/Pulldown disabled
      * ddr_dqs1 and ddr_dqsn1 Pullup/Pulldown disabled
      * Bits 9:5 control ddr_dqs1, ddr_dqsn1
        - Slew fastest
        - Drive Strength 5 mA
      * Bits 4:0 control ddr_d[15:8], ddr_dqm1
        - Slew fastest
        - Drive Strength 9 mA
    CONTROL: DDR_DATA1_IOCTRL = 0x00000004
      * ddr_d0 Pullup/Pulldown disabled
      * ddr_d1 Pullup/Pulldown disabled
      * ddr_d2 Pullup/Pulldown disabled
      * ddr_d3 Pullup/Pulldown disabled
      * ddr_d4 Pullup/Pulldown disabled
      * ddr_d5 Pullup/Pulldown disabled
      * ddr_d6 Pullup/Pulldown disabled
      * ddr_d7 Pullup/Pulldown disabled
      * ddr_dqm0 Pullup/Pulldown disabled
      * ddr_dqs0 and ddr_dqsn0 Pullup/Pulldown disabled
      * Bits 9:5 control ddr_dqs0, ddr_dqsn0
        - Slew fastest
        - Drive Strength 5 mA
      * Bits 4:0 control ddr_d[7:0], dqm0
        - Slew fastest
        - Drive Strength 9 mA
    CONTROL: DDR_IO_CTRL = 0x00000000
      * Bit 31: DDR_RESETn controlled by EMIF.
      * Bit 28 (mddr_sel) configured for SSTL, i.e. DDR2/DDR3/DDR3L operation.
    CONTROL: VTP_CTRL = 0x00010107
      * VTP not disabled (expected in normal operation, but not DS0).
    CONTROL: VREF_CTRL = 0x00000000
      * VREF supplied externally (typical).
    CONTROL: DDR_CKE_CTRL = 0x00000000
      * CKE gated (forces pin low).
    

    And I captured register windows for EMIF as below.

    I added spinlock loop in the SPL and I connected JTAG at good boot(boot success), and found the disassembled instructions are not same at 0x402f35a8.

    And, I found the GPMC values were different between good boot and bad boot as below, but I need to check where my customer added spinlock loop. 

    bad boot:

    good boot:

    I ran other dss files at bad boot(boot fail) and shared it in below.

    am335x-cttpadconf.zip

    Thanks and Best Regards,

    SI.

  • SI, you need to use u-boot-spl as your symbol file (found in the spl directory of your build directory).  Load this symbol file when the board fails.  Can you also send your u-boot-spl.map?

    It looks like SPL is failing very early.  The DDR PLL isn't even configured and locked. 

    One more thing i'd like to see is if the downloaded SPL is the same between a good boot and bad boot.  Can you perform the following after a good boot and bad boot:

    -connect with JTAG

    -Tools->Save Memory.  Choose a file name and File Type: Binary.  Next.

    -Start Address: 0x402F0400, Length In Words: 0x7F00

    This will copy the entire contents of internal memory to a file.  Then we can compare to see if there is a problem in the SPL that is copied from NAND.

    Regards,

    James

  • Hi James,

    I succeed to load symbol and found the PC was stopped at hang() function when booting failed as below.

    I attached spl.map file in below.

    7711.u-boot-spl.zip

    And the below is the memory dump file of internal sram(0x402f 0400) at bad booting(booting fail) as you requested.

    bad_sram2.zip

    And the below is the memory dump of sram at good boot.

    good_sram2.zip

    As you see, there were several difference between good boot and bad boot, but there is additional code to insert 'spinloop' in good boot.

    when I checked the memory dump of internal sram at bad boot with original binary file, there is no difference as you in below captured image(few differences, but these are caused by different build time displayed). unfortunately I could not receive original spl file due to security issue in samsung, but there was no issue in spl code of internal sram at bad boot.

    And also, I dumped internal sram memory region 5 times, but all spl codes were same during this 5 trials. 

    Could you please guide me what is the next debug step?

    Is it possible to check GPMC register values at bad boot?

    When I checked there was no difference in GPMC register between bad boot and good boot. NAND is Micron's 29F2G08ABAEA.

    I captured GPMC values in below at bad boot.

    Thanks and Best Regards,

    SI.

  • For hang() function in the SPL,

    1. is it possible to 'reset' system in the hang() function?

    2. what is general recommendation how this hang() function can be implemented in the real product?

    3. Is there any way to debug where the hang function called?

    Thanks and Best Regards,

    SI.

  • I can see from your screenshot that it hung in do_setup_dpll.  It looks like the PLL never locked.  I'm not sure yet why that is.  Can you figure out which DPLL is not locking, e.g. by stepping through code or perhaps outputting the CTT rd1?  There's likely a very fundamental issue here, e.g. boot pins configured for wrong crystal frequency, DPLL configuration was updated and has values that are out-of-spec, noise issue on a clock pin, power issue, etc.

  • Hi Brad,

    Thanks for your response.

    Could you please help to check the CTT rd1 file at bad boot in below? 

    am335x-ctt_2019-09-24_154602_test1.zip

    Thanks and Best Regards,

    SI.

  • How about overshoot of 3.3V in the power-up sequence?

    does this can cause this DPLL unlock issue?

    Their power-up sequence as below and there was overshoot in the 3.3V before ramp-up. the 3.3V is red line in below. 

    The green line is 1.5V, and there is no trace line for 1.8V in the screen shot, and I'm requesting to measure 1.8V also.

    I heard 1.8V ramp-up at the similar time with 1.5V line(green)

    Thanks and Best Regards,

    SI.

  • What SDK version are you using?

  • It is  Linux PSDK 4.0.

  • The order in which the DPLL's get configured is CORE -> MPU -> PER.

    Based on your rd1 file, it looks like DPLL_MPU is having an issue.  Is it correct that the customer is trying to configure the Cortex A8 for 300 MHz operation?  I can see that CM_CLKMODE_DPLL_MPU[2:0] DPLL_EN = 0x7 (DPLL enabled in lock mode), but CM_IDLEST_DPLL_MPU = 0 (bypass).  I recommend looking very carefully at the rail VDDS_PLL_MPU.  Is there noise on that rail?  It might be intermittent, e.g. when noise is present the DPLL fails to lock.

  • You might also get better lock performance by adjusting the configuration of your PLL.  

    Right now, it looks like you have this configuration

    M = 300

    N=23

    M2 = 1

    Which give you a MPU clock of 300MHz.  Can you try with the following configuration (which will also result in MPU clock 300MHz)

    M=25

    N=0

    M2 = 2

    This is a more optimal configuration.  Ultimately, though, Brad's recommendation is correct.  There is most likely noise on the PLL supply voltage which is causing lock issues.  Monitor this voltage with a scope while the PLL is attempting to lock in the bootloader

    Regards,

    James

  • Hi James and Brad,

    Thanks for your fruitful response.

    It seems new PLL values are worked well and we have seen any boot fail yet with new PLL values.

    Could you please provide details what is the difference between new and old values of PLL?

    Can you guess why new PLL values works well without booting fail?

    Can you expect their board was to be stable with new PLL values? or still there is an issue in their board?

    We need to provide details what is improved with new PLL values and they can sure that the booting fail issue was gone.

     

    As you requested,

    I tried to measured VDDS_PLL_MPU pin while the PLL is attempting to lock in the bootloader and I think the pk-to-pk is too large. Are you agree on this?

    It should be under 50mV, right?

    The capture image was in below, and they set it to AC mode to check ripple.

    the pk-to-pk ripple is 86mV or above.

    Thanks and Best Regards,

    SI.

  • Sung-IL said:

    It seems new PLL values are worked well and we have seen any boot fail yet with new PLL values.

    Could you please provide details what is the difference between new and old values of PLL?

    Reducing the pre-divider N results in a faster loop frequency, i.e. the DPLL can react more quickly and correspondingly lock more quickly.

    Sung-IL said:

    I tried to measured VDDS_PLL_MPU pin while the PLL is attempting to lock in the bootloader and I think the pk-to-pk is too large. Are you agree on this?

    It should be under 50mV, right?

    The capture image was in below, and they set it to AC mode to check ripple.

    the pk-to-pk ripple is 86mV or above.

    Yes, the rail is too noisy.  The proper solution is to make improvements to bring the noise within the required 50 mV peak-to-peak.  However, if there are units already in the field, that might not be practical. In that case your best option might be a software update with improved DPLL configuration.

  • Hi Brad,

    Thanks for your response.

    Do you mean 'new PLL value' make 'the DPLL can react more quickly and correspondingly lock more quickly' and it would helpful in this noisy power source?

    e.g.  'the DPLL can react more quickly and correspondingly lock more quickly'  can help to improve system lock with noisy power source?

    and it is best option for my customer to apply new PLL values?

    Or should they need to find more improved PLL values?

    I would like to improve ripples in their power rails and I'm afraid most of their power rails have ripples as they are using discrete power solution.

    Is it OK to minimize ripples in below power rails only? or are there other power rails to be minimized?

    Thanks and Best Regards,

    SI.

  • Sung-IL said:
    Do you mean 'new PLL value' make 'the DPLL can react more quickly and correspondingly lock more quickly' and it would helpful in this noisy power source?

    Yes.  The reason the new value is better is because it has a smaller pre-divider. 

    Sung-IL said:
    Or should they need to find more improved PLL values?

    Since the DPLL_MPU pre-divider is now /1, that's the best you can do.  Or are you referring to updating the DPLL values for the other DPLL's?

    Sung-IL said:
    Is it OK to minimize ripples in below power rails only? or are there other power rails to be minimized?

    Those are the only rails with the 50 mV peak-to-peak requirement.

  • The new PLL configuration also brings the internal DCO frequency to the middle of the valid frequency range (20MHz-2GHz) which also helps with lock performance. 

    Ultimately, as you show in the scope plot, the supplied voltage  violates the max peak to peak noise margin defined in table 6-1 of the data manual.  Typically, designers use a ferrite bead to filter noise from the rest of the 1.8V rail on the board, especially when being supplied by a DC-DC converter or supplying other noisy digital circuits.  An example of this can be found in the BeagleBone Black schematic

    Regards,

    James

  • Sung-IL said:
    1. is it possible to 'reset' system in the hang() function?

    You could reset the system in the hang() function, or another option might be to reset the system inside do_setup_dpll(), i.e. instead of calling hang(), perform a reset.  You can force a cold reset by writing PRM_RSTCTRL = 2.

    Here's what I recommend:

    1. Revert your DPLL configuration (temporarily) back to the values where you saw issues.
    2. Update either do_setup_dpll() or hang() function to reset the device in case of issues.
    3. Check to see if this allows for robust recovery.  I suspect a cold reset will be sufficient, but it's possible that a full power cycle is required.  You would need to test it to know.

    Ideally I hope that adding this reset will show consistent recovery.  If that is the case, then I would recommend applying BOTH fixes (i.e. extra reset plus optimized DPLL configuration).  That would give you an extra layer of recovery if there is a corner case condition where the updated DPLL value is not working.

    You may want to optimize the other configurations for other DPLL's as well to give you additional confidence.  Again I want to emphasize that the official and proper fix is to bring the supply ripple on these rails to within the 50mV-pp requirement.  However, if you're trying to get the best possible software workaround, a combination of all the above I expect would be robust.

  • Thanks for your response.

    Can you guess if there are other artifacts with this noisy power sources other than PLL locking?

    Is there possibility the MPU PLL can be released due to noisy power sources after PLL locking?

    Thanks and Best Regards,

    SI.

  • SI, yes, it is possible that the PLL can lose lock with the noisy supply, especially if other factors in the system vary (eg, temperature).   

    One idea to catch this is to monitor the PRM_IRQSTATUS_MPU register.  This register has status bits that identify when a recalibration is necessary for certain PLLs.  This can be setup to interrupt the processor, which then the software can proceed to perform the recalibration and relock using the CM_CLKMODE_DPLL_xxx.DPLL_DRIFTGUARD_EN bit.

    Regards,

    James  

  • Hi James and Brad,

    Thanks again for your valuable responses, and there have not been booting fail issue in customer's board yet with new PLL values.

    I'm pushing to change customer's HW to reduce noise in the power rails as it is root cause and also pushing to implement SW reset in the hang() function and monitor the PRM_IRQSTATUS_MPU register.

    Thanks and Best Regards,

    SI.

  • Hi James and Brad,

    Could you please recommend optimum PLL values for 600Mhz AM3352 also?

    Is it OK to use below values for their 600Mhz AM3352 board?

    M=25

    N=0

    M2 = 1

    They also found boot fail issue in their 600Mhz AM3352 board, and requested our recommendation for PLL values for workaround.

    Thanks and Best Regards,

    SI.

  • SI, the settings you quoted in your post are the optimal settings for 600MHz.  Are they still seeing failures with those settings?

    REgards,

    James

  • Hi James,

    Thanks for your confirm that this is optimum value for 600Mhz.

    And, Sorry for your confusion. They have not tested their board with this new optimum PLL setting and I just want to check with you before their testing.

    Thanks and Best Regards,

    SI.