This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM3352: Slow boot continues through WDT

Part Number: AM3352
Other Parts Discussed in Thread: SEGGER,

Tool/software: Linux

We've had a design in the field for 3 years that exhibits a 'slow' boot condition during which normal execution occurs at maybe 1/20 of the expected speed.  It occurs on less than 1/10 of a percent of fielded units.  It's slow enough where the WDT fires and creates a WDT reset event.  However, the slow boot condition continues and the cycle repeats.. forever, until POR.  Our work around has been to check the reset reason in u-boot and issue a reset from the PMIC if the reset reason is WDT.  This resolves the issue and the device is able to boot normally after initially failing.  However,  the over all boot time is creating problems and we need to understand why this is occurring and what more can be done to avoid the situation, what is root cause?  In one unit that that we've been able to capture the condition it seems to be related to CKE on the SDRAM.  During a normal boot CKE show continuous clocking, however in the 'slow boot' condition CKE operates in a busting manner.  See below.  This design was reviewed 3 years ago by TI and a Hyperlynx simulation was performed on the SDRAM to ensure proper timing.  This issue doesn't seem to be related to any environmental condition or any unusual electrical condition such as EMI or ESD.

   

  • James Miller39 said:
    We've had a design in the field for 3 years that exhibits a 'slow' boot condition during which normal execution occurs at maybe 1/20 of the expected speed

    The first thought that comes to mind based on your description is that perhaps the PLL has not properly locked.  If you were in bypass mode then you might experience this sort of issue.  Something like that might happen if there's excessive noise on the VDDS_PLL_* rails.

    James Miller39 said:
    Our work around has been to check the reset reason in u-boot and issue a reset from the PMIC if the reset reason is WDT.  This resolves the issue and the device is able to boot normally after initially failing.  However,  the over all boot time is creating problems

    How much time is this workaround adding?  I would have expected it to be very small, e.g. <1sec.  What is your boot time requirement?

  • Can you verify that it is CKE that is toggling? CKE should not be toggling like that. It should only go low when you intend to go into self-refresh for low power, otherwise it should stay high.

    Do you know when you are executing slow? For example, is the ROM executing slowly, or only after your bootloader has executed? Or only when executing out of DDR?

    If you have JTAG access (not sure if you do since it is in a product), we can help you narrow down where the problem is. There are a couple of scripts to run via JTAG that can help find configuration issues.

    Brad's idea is a good one. You can check lock of the ARM PLL using register CM_IDLEST_DPLL_MPU (0x 44E0_0420)

    Regards,
    James
  • Yes, it is DDR_CKE Ball G3 on the ZCZ package that is toggling.  The only indication that we know we're running in 'slow mode' is the rate at which messages are produced on the console and the activation of the WDT reset, which, oddly fails to restore normal operation.  We typically use reset reason == WDT to indicate that we should issue a PMIC reset to recover,  This is known be be effective.  We're currently using u-boot as our boot loader which checks to ensure the PLL is locked prior to continuing the boot process and this appears to be working.  If the PLL fails to lock uboot will hang until WDT resets, but, what we see is messages continue to be output on the serial port.  Our JTAG port is available, what scripts can you suggest we try to verify configuration?

    Thanks,

    Jim

  • I have some scripts for Code Composer Studio that can help us. The general process for running these scripts is described here:

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/master/README

    In your particular instance, there are several scripts which you should run:

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-boot.dss

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-ctt.dss

    http://git.ti.com/sitara-dss-files/am335x-dss-files/blobs/raw/master/am335x-ddr-analysis.dss

    Please zip up and attach the output files.

  • Thank you for the scripts.  We'll get this up and running with the output back to you as soon as possible.

    Another interesting thing we've found is that a watchdog timer reset on a normally running device causes this slow running condition approximately 30% of the time after the reset.  We see this by running our application and turning the pet of the watchdog off.  Is there any particular reason why a WDT reset wouldn't return the CPU back to a stable state like POR?

    Also, we're  running an older version of the TI kernel 3.12.10.  Are you aware of any issues with locking of the PLL by the kernel?  We've ruled out failure of u-boot to lock the PLL as u-boot waits specifically for the PLL to lock and halts if the PLL doesn't lock.  We don't see a halt, the kernel loads as normal but everything runs s l o w l y.

    Thanks.

  • <soapbox>I cannot implore strongly enough for you to update to newer software. I run into issues a lot where people say "but we've already invested so much into hardening this software. We can't migrate now." Most of those designs never made it to production... Please just make the leap. In most cases it has taken customers even without much Linux experience about a month to migrate to the latest, even when they're jumping many years of kernel revisions. You will be amazed at how many issues disappear by migrating... </soapbox>

    Since you mentioned that 3.12 kernel, make sure you apply this patch:

    git.ti.com/.../diffs

    It has some very similar characteristics to what you describe. If we are lucky it might be THE issue. If not though, you would likely bump into this issue eventually, so be sure you apply it if you're going to stay on 3.12.
  • I like the idea of updating to resolve these sorts of issues, but sometimes you trade old issues for new, unknown issues.  But, moving from 3,12, what kernel might you recommend?

  • James Miller39 said:
    I like the idea of updating to resolve these sorts of issues, but sometimes you trade old issues for new, unknown issues.

    I've not had a single instance where a customer regretted moving forward.  Sure, a few issues will pop up during the migration.  That's why it takes a month to migrate.  Usually the issues that pop up are things that other people are also encountering where solutions are readily available.

    If you're close to production, you might use SDK 5.03 which is our final release on the 4.14 kernel.  However, if you still have a year or more of development work, you might consider moving all the way to SDK 6.00 when it is released in July.  It will be on the 4.19 kernel.  If you move to SDK 6.00, you should budget for one more small migration to a later release such as 6.03 which will be the final 4.19 release and out by this time a year from now.  It will be a rather small effort migrating from SDK 6.00 to 6.0x, likely just a few days of effort as it will mostly contain bug fixes and not a lot of huge feature additions.

    Also, keep in mind that the kernels being chosen for the TI SDK releases are all LTS kernels.  So there will be much better community support on those kernels if you run into an issue a couple years from now.

  • Hi Brad, I work with James and am trying to make this work. I have a Segger JLink connected to our device. It can connect just fine on its own using JLink Commander:

    SEGGER J-Link Commander V6.34f (Compiled Sep  5 2018 13:25:56)
    DLL version V6.34f, compiled Sep  5 2018 13:25:28

    Connecting to J-Link via USB...O.K.
    Firmware: J-Link V9 compiled Oct 25 2018 11:46:07
    Hardware version: V9.20
    S/N: 59200642
    License(s): GDB
    VTref=3.316V


    Type "connect" to establish a target connection, '?' for help
    J-Link>connect
    Please specify device / core. <Default>: AM3352
    Type '?' for selection dialog
    Device>?
    Please specify target interface:
      J) JTAG (Default)
      S) SWD
      T) cJTAG
    TIF>J
    Device position in JTAG chain (IRPre,DRPre) <Default>: -1,-1 => Auto-detect
    JTAGConf>-1
    ERROR while parsing value for DRPre. Using default: -1.
    Specify target interface speed [kHz]. <Default>: 4000 kHz
    Speed>4000
    Device "AM3352" selected.


    Connecting to target via JTAG
    TotalIRLen = 6, IRPrint = 0x01
    TotalIRLen = 10, IRPrint = 0x0011
    JTAG chain detection found 2 devices:
     #0 Id: 0x3BA00477, IRLen: 04, CoreSight JTAG-DP
     #1 Id: 0x2B94402F, IRLen: 06, TI ICEPick
    AP map detection skipped. Manually configured AP map found.
    AP[0]: AHB-AP (IDR: Not set)
    AP[1]: APB-AP (IDR: Not set)
    AP[2]: JTAG-AP (IDR: Not set)
    Iterating through AP map to find AHB-AP to use
    AP[0]: Skipped. Not an APB-AP
    AP[1]: APB-AP found
    Found Cortex-A8 r3p2
    6 code breakpoints, 2 data breakpoints
    Debug architecture ARMv7.0
    Data endian: little
    Main ID register: 0x413FC082
    I-Cache L1: 32 KB, 128 Sets, 64 Bytes/Line, 4-Way
    D-Cache L1: 32 KB, 128 Sets, 64 Bytes/Line, 4-Way
    Unified-Cache L2: 256 KB, 512 Sets, 64 Bytes/Line, 8-Way
    System control register:
      Instruction endian: little
      Level-1 instruction cache enabled
      Level-1 data cache enabled
      MMU enabled
      Branch prediction enabled
    Memory zones:
      [0]: Default (Default access mode)
      [1]: AHB-AP (AP0) (DMA like acc. in AP0 addr. space)
      [2]: APB-AP (AP1) (DMA like acc. in AP1 addr. space)
    Cortex-A8 identified.
    J-Link>

    This took a couple of tries (I deleted the first connect failure lines as irrelevant).

    Then, following the instructions in the README, CCS says this in its Debug tab:


    As per the instructions, I did NOT try to connect to the unit in any way (after I saw that the JTAG was working all the way through to the target, I disconnected and re-connected it physically and then re-invoked CCS to get the above screen shot.) But, when trying to execute the script as per the README instructions, I get this:
    js:> loadJSFile c:\workspace\am335x-dss-files\am335x-boot.dss
    Could not open session. No matching devices found. (c:\workspace\am335x-dss-files\am335x-boot.dss#60)
    js:>

    Does #60 mean "line 60" in that file?
    Line 60 is this:

    debugSessionDAP = ds.openSession("*","CS_DAP_DebugSS");


    Thanks in advance.

  • Looks like your Segger can connect to the Cortex A8 but not the DAP. The scripts all use the DAP since it avoids the MMU. Do you have any TI XDS debuggers? If not you might consider getting an XDS110 for this purpose. It is only $99.

    You might be able to make the script work by changing CS_DAP_DebugSS to CortxA8.  If the MMU is not yet enabled that will work ok. Once Linux enables that won't work.

  • finally got it to run and produce results by changing all places referencing _DAP_ to CortxA8. It seems to be impossible to simply attach files to this message as it inserts them into the message itself, so I must cut and paste the text from these here, in order.
    ------------------------
    am335x-boot-analysis_2019-05-10_081917.txt:
    CONTROL: device_id = 0x2b94402e
    * AM335x family
    * Silicon Revision 2.1

    PRM_DEVICE: PRM_RSTST = 0x00000200

    CONTROL: control_status = 0x00400312
    * SYSBOOT[15:14] = 01b (24 MHz)
    * SYSBOOT[11:10] = 00b No GPMC CS0 addr/data muxing
    * SYSBOOT[9] = 0 GPMC CS0 Ignore WAIT input
    * SYSBOOT[8] = 0 GPMC CS0 8-bit data bus
    * Device Type = General Purpose (GP)
    * SYSBOOT[7:6] = 00b MII (EMAC boot modes only)
    * SYSBOOT[5] = 0 CLKOUT1 disabled
    * Boot Sequence : NAND -> NANDI2C -> USB0 -> UART0

    ROM: Current tracing vector, word 1 = 0x0010009a
    * Bit 1 : [General] Entered main function
    * Bit 3 : [Boot] Main booting routine entered
    * Bit 4 : [Memory Boot] Memory booting started
    * Bit 7 : [Boot] GP header found
    * Bit 20 : [Configuration Header] CHSETTINGS found

    ROM: Current tracing vector, word 2 = 0x00018000
    * Bit 15 : [Memory Boot] Memory booting trial 3
    * Bit 16 : [Memory Boot] Execute GP image

    ROM: Current tracing vector, word 3 = 0x00000020
    * Bit 5 : [Memory Boot] Memory booting device MMCSD0

    ROM: Current copy of PRM_RSTST = 0x00000000

    ROM: Cold reset tracing vector, word 1 = 0x00000000

    ROM: Cold reset tracing vector, word 2 = 0x00000000

    ROM: Cold reset tracing vector, word 3 = 0x00000200
    * Bit 9 : Reserved

    Cortex A8 Program Counter = 0x402f3500

    ROM Exception Vectors
    * 0x4030CE04 Undefined
    * 0x4030CE08 SWI
    * 0x4030CE0C Pre-fetch abort
    * 0x4030CE10 Data abort
    * 0x4030CE14 Unused
    * 0x4030CE18 IRQ
    * 0x4030CE1C FIQ

    ROM Dead Loops
    * 0x00020080 Undefined exception default handler
    * 0x00020084 SWI exception default handler
    * 0x00020088 Pre-fetch abort exception default handler
    * 0x0002008C Data exception default handler
    * 0x00020090 Unused exception default handler
    * 0x00020094 IRQ exception default handler
    * 0x00020098 FIQ exception default handler
    * 0x0002009C Validation test PASS
    * 0x000200A0 Validation test FAIL
    * 0x000200A4 Reserved
    * 0x000200A8 Image not executed or returned
    * 0x000200AC Reserved
    * 0x000200B0 Reserved
    * 0x000200B4 Reserved
    * 0x000200B8 Reserved
    * 0x000200BC Reserved

    -----------------------------------
    am335x-ctt_2019-05-10_082154.rd1:
    DeviceName AM335x_SR2.x_SR1.0
    0x44e00000 0x01004502
    0x44e00004 0x0000000a
    0x44e00008 0x00000102
    0x44e0000c 0x00000016
    0x44e00010 0x00070000
    0x44e00014 0x00070000
    0x44e00018 0x00070000
    0x44e0001c 0x00000002
    0x44e00020 0x00070000
    0x44e00024 0x00030000
    0x44e00028 0x00000002
    0x44e0002c 0x00000002
    0x44e00030 0x00000002
    0x44e00034 0x00030000
    0x44e00038 0x00000002
    0x44e0003c 0x00030000
    0x44e00040 0x00000002
    0x44e00044 0x00030000
    0x44e00048 0x00000002
    0x44e0004c 0x00030000
    0x44e00050 0x00030000
    0x44e00054 0x00030000
    0x44e00058 0x00030000
    0x44e00060 0x00000002
    0x44e00064 0x00000002
    0x44e00068 0x00030000
    0x44e0006c 0x00000002
    0x44e00070 0x00000002
    0x44e00074 0x00000002
    0x44e00078 0x00000002
    0x44e0007c 0x00030000
    0x44e00080 0x00000002
    0x44e00084 0x00030000
    0x44e00088 0x00030000
    0x44e0008c 0x00030000
    0x44e00090 0x00030000
    0x44e00094 0x00030000
    0x44e00098 0x00030000
    0x44e0009c 0x00030000
    0x44e000a0 0x00030000
    0x44e000a4 0x00030000
    0x44e000a8 0x00030000
    0x44e000ac 0x00000002
    0x44e000b0 0x00000002
    0x44e000b4 0x00000002
    0x44e000b8 0x00030000
    0x44e000bc 0x00030000
    0x44e000c0 0x00030000
    0x44e000c4 0x00030000
    0x44e000c8 0x00030000
    0x44e000cc 0x00030000
    0x44e000d0 0x00000002
    0x44e000d4 0x00030000
    0x44e000d8 0x00030000
    0x44e000dc 0x00000002
    0x44e000e0 0x00000002
    0x44e000e4 0x00040002
    0x44e000e8 0x00070000
    0x44e000ec 0x00030000
    0x44e000f0 0x00030000
    0x44e000f4 0x00030000
    0x44e000f8 0x00030000
    0x44e000fc 0x00030000
    0x44e00100 0x00030000
    0x44e00104 0x00030000
    0x44e0010c 0x00030000
    0x44e00110 0x00030000
    0x44e0011c 0x0000000a
    0x44e00120 0x00000002
    0x44e00124 0x00070000
    0x44e00128 0x00030000
    0x44e0012c 0x00000012
    0x44e00130 0x00040002
    0x44e00134 0x00030000
    0x44e00138 0x00030000
    0x44e0013c 0x00030000
    0x44e00140 0x00000002
    0x44e00144 0x00000002
    0x44e00148 0x00000002
    0x44e0014c 0x00000002
    0x44e00150 0x00000012
    0x44e00400 0x00001e16
    0x44e00404 0x00000002
    0x44e00408 0x00000002
    0x44e0040c 0x00000002
    0x44e00410 0x00000002
    0x44e00414 0x52580002
    0x44e00418 0x0000001e
    0x44e0041c 0x00000000
    0x44e00420 0x00000001
    0x44e00424 0x00000000
    0x44e00428 0x00000000
    0x44e0042c 0x00025817
    0x44e00430 0x00000000
    0x44e00434 0x00000001
    0x44e00438 0x00000000
    0x44e0043c 0x00000000
    0x44e00440 0x00010a17
    0x44e00444 0x00000000
    0x44e00448 0x00000100
    0x44e0044c 0x00000000
    0x44e00450 0x00000000
    0x44e00454 0x00000000
    0x44e00458 0x00000000
    0x44e0045c 0x00000001
    0x44e00460 0x00000000
    0x44e00464 0x00000000
    0x44e00468 0x0003e817
    0x44e0046c 0x00000000
    0x44e00470 0x00000001
    0x44e00474 0x00000000
    0x44e00478 0x00000000
    0x44e0047c 0x00000300
    0x44e00480 0x0000022a
    0x44e00484 0x00000028
    0x44e00488 0x00000007
    0x44e0048c 0x00000007
    0x44e00490 0x00000007
    0x44e00494 0x00000007
    0x44e00498 0x00000004
    0x44e0049c 0x0403c017
    0x44e004a0 0x00000201
    0x44e004a4 0x00000001
    0x44e004a8 0x00000201
    0x44e004ac 0x00000285
    0x44e004b0 0x00040002
    0x44e004b4 0x00000002
    0x44e004b8 0x00000002
    0x44e004bc 0x00030000
    0x44e004c0 0x00030000
    0x44e004c4 0x00030000
    0x44e004c8 0x00030000
    0x44e004cc 0x00000006
    0x44e004d0 0x00000002
    0x44e004d4 0x00000002
    0x44e004d8 0x00000004
    0x44e00504 0x00000001
    0x44e00508 0x00000001
    0x44e0050c 0x00000000
    0x44e00510 0x00000001
    0x44e00514 0x00000004
    0x44e00518 0x00000001
    0x44e0051c 0x00000000
    0x44e00520 0x00000000
    0x44e00528 0x00000000
    0x44e0052c 0x00000000
    0x44e00530 0x00000000
    0x44e00534 0x00000000
    0x44e00538 0x00000000
    0x44e0053c 0x00000000
    0x44e00600 0x00000006
    0x44e00604 0x00000002
    0x44e00700 0x00000000
    0x44e00800 0x00000002
    0x44e00804 0x00000302
    0x44e00900 0x00000002
    0x44e00904 0x00070000
    0x44e00908 0x00070000
    0x44e0090c 0x00000002
    0x44e00910 0x00030000
    0x44e00914 0x00030000
    0x44e00a00 0x00000002
    0x44e00a20 0x00030000
    0x44e00b00 0x00000000
    0x44e00b04 0x00000500
    0x44e00b08 0x00000000
    0x44e00b0c 0x00000100
    0x44e00b10 0x00000000
    0x44e00c00 0x00000003
    0x44e00c04 0x00000000
    0x44e00c08 0x01e60007
    0x44e00c0c 0xee0000eb
    0x44e00d00 0x00000008
    0x44e00d04 0x00000008
    0x44e00d08 0x00000000
    0x44e00d0c 0x00000020
    0x44e00e00 0x01ff0007
    0x44e00e04 0x000003f7
    0x44e00e08 0x00000000
    0x44e00f00 0x00000000
    0x44e00f04 0x00001006
    0x44e00f08 0x00000200
    0x44e00f0c 0x78000017
    0x44e00f10 0x00000003
    0x44e00f14 0x00000000
    0x44e00f18 0x00000003
    0x44e00f1c 0x00000000
    0x44e01000 0x00000004
    0x44e01004 0x00000000
    0x44e01100 0x00060044
    0x44e01104 0x00000001
    0x44e01110 0x00000037
    0x44e01114 0x00000000
    0x44e01200 0x00000000
    0x44e01204 0x00000007
    0x44e10040 0x00400312

    ----------------------------------
    am335x-ddr-analysis_2019-05-10_082502.txt:
    CONTROL: device_id = 0x2b94402e
    * AM335x family
    * Silicon Revision 2.1

    CONTROL: control_status = 0x00400312
    * SYSBOOT[15:14] = 01b (24 MHz)
    CM_CLKSEL_DPLL_DDR = 0x00010a17
    * DPLL_MULT = 266 (x266)
    * DPLL_DIV = 23 (/24)
    CM_DIV_M2_DPLL_DDR = 0x00000201
    * CLKST = 1: M2 output clock enabled
    * DIVHS = 1 (/1)

    DPLL_DDR Summary
    -> F_input = 24 MHz
    -> CLKOUT_M2 = DDR_PLL_CLKOUT = 266 MHz

    EMIF: SDRAM_CONFIG = 0x41805232
    * Bits 31:29 (reg_sdram_type) set for DDR2
    * Bits 28:27 (reg_ibank_pos) set to 0
    * Bits 26:24 (reg_ddr_term) set for 75 Ohm (001b)
    * Bit 23 (reg_ddr2_ddqs) set to differential DQS.
    * Bits 19:18 (reg_sdram_drive) set for normal drive (00b)
    * Bits 15:14 (reg_narrow_mode) set to 1 -> 16-bit EMIF interface
    * Bits 13:10 (reg_cl) set to 4 -> CL = 4
    * Bits 09:07 (reg_rowsize) set to 4 -> 13 row bits
    * Bits 06:04 (reg_ibank) set to 3 -> 8 banks
    * Bits 02:00 (reg_pagesize) set to 2 -> 10 column bits

    EMIF: PWR_MGMT_CTRL = 0x00000000
    * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
    * Please see the silicon errata for more details.

    DDR PHY: DDR_PHY_CTRL_1 = 0x00000005
    * WARNING: reg_phy_enable_dynamic_pwrdn disabled.
    * Bits 9:8 (reg_phy_rd_local_odt) to 0 -> no termination
    * Bits 4:0 (reg_read_latency) set to 5
    -> If PHY_INVERT_CLKOUT=0, this is an appropriate value.
    -> If PHY_INVERT_CLKOUT=1, this is too small.
    -> PHY_INVERT_CLKOUT is a write-only register, so this needs to be
    -> inspected closely in the code and RatioSeed spreadsheet.

    *********************
    *** Register Dump ***
    *********************

    *(0x4c000000) = 0x40443403
    *(0x4c000004) = 0x40000000
    *(0x4c000008) = 0x41805232
    *(0x4c00000c) = 0x00000000
    *(0x4c000010) = 0x0000041d
    *(0x4c000014) = 0x0000041d
    *(0x4c000018) = 0x0666a391
    *(0x4c00001c) = 0x0666a391
    *(0x4c000020) = 0x242431ca
    *(0x4c000024) = 0x242431ca
    *(0x4c000028) = 0x0000021f
    *(0x4c00002c) = 0x0000021f
    *(0x4c000038) = 0x00000000
    *(0x4c00003c) = 0x00000000
    *(0x4c000054) = 0x00ffffff
    *(0x4c000058) = 0x8000140a
    *(0x4c00005c) = 0x00021616
    *(0x4c000080) = 0x129a206a
    *(0x4c000084) = 0x05ff9908
    *(0x4c000088) = 0x00010000
    *(0x4c00008c) = 0x00000000
    *(0x4c000090) = 0x64a8105e
    *(0x4c000098) = 0x00050000
    *(0x4c00009c) = 0x00050000
    *(0x4c0000a4) = 0x00000000
    *(0x4c0000ac) = 0x00000000
    *(0x4c0000b4) = 0x00000000
    *(0x4c0000bc) = 0x00000000
    *(0x4c0000c8) = 0x00000000
    *(0x4c0000d4) = 0x00000000
    *(0x4c0000d8) = 0x00000000
    *(0x4c0000dc) = 0x00000000
    *(0x4c0000e4) = 0x00000005
    *(0x4c0000e8) = 0x00000005
    *(0x4c000100) = 0x00000000
    *(0x4c000104) = 0x00000000
    *(0x4c000108) = 0x00000000
    *(0x4c000120) = 0x00000305

    ************************
    *** IOCTRL Registers ***
    ************************

    CONTROL: DDR_CMD0_IOCTRL = 0x0000018b
    * ddr_ba2 Pullup/Pulldown disabled
    * ddr_wen Pullup/Pulldown disabled
    * ddr_ba0 Pullup/Pulldown disabled
    * ddr_a5 Pullup/Pulldown disabled
    * ddr_ck Pullup/Pulldown disabled
    * ddr_ckn Pullup/Pulldown disabled
    * ddr_a3 Pullup/Pulldown disabled
    * ddr_a4 Pullup/Pulldown disabled
    * ddr_a8 Pullup/Pulldown disabled
    * ddr_a9 Pullup/Pulldown disabled
    * ddr_a6 Pullup/Pulldown disabled
    * Bits 9:5 control ddr_ck and ddr_ckn
    - Slew slow
    - Drive Strength 9 mA
    * Bits 4:0 control ddr_ba0, ddr_ba2, ddr_wen, ddr_a[9:8], ddr_a[6:3]
    - Slew slow
    - Drive Strength 8 mA
    CONTROL: DDR_CMD1_IOCTRL = 0x0000018b
    * ddr_a15 Pullup/Pulldown disabled
    * ddr_a2 Pullup/Pulldown disabled
    * ddr_a12 Pullup/Pulldown disabled
    * ddr_a7 Pullup/Pulldown disabled
    * ddr_ba1 Pullup/Pulldown disabled
    * ddr_a10 Pullup/Pulldown disabled
    * ddr_a0 Pullup/Pulldown disabled
    * ddr_a11 Pullup/Pulldown disabled
    * ddr_casn Pullup/Pulldown disabled
    * ddr_rasn Pullup/Pulldown disabled
    * Bits 4:0 control ddr_15, ddr_a[12:10], ddr_a7, ddr_a2, ddr_a0, ddr_ba1, ddr_casn, ddr_rasn
    - Slew slow
    - Drive Strength 8 mA
    CONTROL: DDR_CMD2_IOCTRL = 0x0000018b
    * ddr_cke Pullup/Pulldown disabled
    * ddr_resetn Pullup/Pulldown disabled
    * ddr_odt Pullup/Pulldown disabled
    * ddr_a14 Pullup/Pulldown disabled
    * ddr_a13 Pullup/Pulldown disabled
    * ddr_csn0 Pullup/Pulldown disabled
    * ddr_a1 Pullup/Pulldown disabled
    * Bits 4:0 control ddr_cke, ddr_resetn, ddr_odt, ddr_csn0, ddr_[a14:13], ddr_a1
    - Slew slow
    - Drive Strength 8 mA
    CONTROL: DDR_DATA0_IOCTRL = 0x0000018b
    * ddr_d8 Pullup/Pulldown disabled
    * ddr_d9 Pullup/Pulldown disabled
    * ddr_d10 Pullup/Pulldown disabled
    * ddr_d11 Pullup/Pulldown disabled
    * ddr_d12 Pullup/Pulldown disabled
    * ddr_d13 Pullup/Pulldown disabled
    * ddr_d14 Pullup/Pulldown disabled
    * ddr_d15 Pullup/Pulldown disabled
    * ddr_dqm1 Pullup/Pulldown disabled
    * ddr_dqs1 and ddr_dqsn1 Pullup/Pulldown disabled
    * Bits 9:5 control ddr_dqs1, ddr_dqsn1
    - Slew slow
    - Drive Strength 9 mA
    * Bits 4:0 control ddr_d[15:8], ddr_dqm1
    - Slew slow
    - Drive Strength 8 mA
    CONTROL: DDR_DATA1_IOCTRL = 0x0000018b
    * ddr_d0 Pullup/Pulldown disabled
    * ddr_d1 Pullup/Pulldown disabled
    * ddr_d2 Pullup/Pulldown disabled
    * ddr_d3 Pullup/Pulldown disabled
    * ddr_d4 Pullup/Pulldown disabled
    * ddr_d5 Pullup/Pulldown disabled
    * ddr_d6 Pullup/Pulldown disabled
    * ddr_d7 Pullup/Pulldown disabled
    * ddr_dqm0 Pullup/Pulldown disabled
    * ddr_dqs0 and ddr_dqsn0 Pullup/Pulldown disabled
    * Bits 9:5 control ddr_dqs0, ddr_dqsn0
    - Slew slow
    - Drive Strength 9 mA
    * Bits 4:0 control ddr_d[7:0], dqm0
    - Slew slow
    - Drive Strength 8 mA
    CONTROL: DDR_IO_CTRL = 0x00000000
    * Bit 31: DDR_RESETn controlled by EMIF.
    * Bit 28 (mddr_sel) configured for SSTL, i.e. DDR2/DDR3/DDR3L operation.
    CONTROL: VTP_CTRL = 0x00010167
    * VTP not disabled (expected in normal operation, but not DS0).
    CONTROL: VREF_CTRL = 0x00000000
    * VREF supplied externally (typical).
    CONTROL: DDR_CKE_CTRL = 0x00000001
    * CKE controlled by EMIF (normal/ungated operation).
  • Please zip them into one file and attach.
  • So were those files taken during a "slow" period? I just imported the ctt.rd1 into Clock Tree Tool, and things look pretty normal at first glance. It shows MPU @ 600 MHz, DDR @ 266 MHz. The DPLL_CORE frequencies are at their typical values.
  • Looking at your boot log, I see PRM_RSTST = 0x200 which indicates an ICEPICK reset. The ICEPICK is part of the JTAG logic, so the act of connecting that Segger debugger may have performed a reset. It's hard to know if that might possibly be changing the behavior. Hopefully you have a XDS110 on order as I think we would get more reliable results.

    As a sanity check, your device boots from 8-bit NAND right?

    Are you sure CKE is toggling? Normally you would see that toggling to indicate the DDR entering/exiting self-refresh mode. However, I wouldn't expect that to happen at all while the board is booting.

    Do you have an EMIF spreadsheet that you filled out that you can attach?
  • No, this was a properly running card. I just induced "slow" operation and ran the scripts again (multiple resets but still in slow mode!). Please see attached zip: am335x-scripts-results-slow.zip

  • Here are the distilled differences between the two results files (done using cygwin diff):

    $ diff -ud am335x-normal-jtag-scripts-results/am335x-boot-analysis_2019-05-10_081917.txt am335x-slow-jtag-scripts-results/am335x-boot-analysis_2019-05-10_093602.txt
    --- am335x-normal-jtag-scripts-results/am335x-boot-analysis_2019-05-10_081917.txt 2019-05-10 08:19:18.946506300 -0700
    +++ am335x-slow-jtag-scripts-results/am335x-boot-analysis_2019-05-10_093602.txt 2019-05-10 09:36:04.986923800 -0700
    @@ -2,7 +2,8 @@
    * AM335x family
    * Silicon Revision 2.1

    -PRM_DEVICE: PRM_RSTST = 0x00000200
    +PRM_DEVICE: PRM_RSTST = 0x00000210
    + * Bit 4 : WDT1_RST

    CONTROL: control_status = 0x00400312
    * SYSBOOT[15:14] = 01b (24 MHz)
    @@ -34,7 +35,8 @@

    ROM: Cold reset tracing vector, word 2 = 0x00000000

    -ROM: Cold reset tracing vector, word 3 = 0x00000200
    +ROM: Cold reset tracing vector, word 3 = 0x00000210
    + * Bit 4 : [Memory Boot] Reserved
    * Bit 9 : Reserved

    Cortex A8 Program Counter = 0x402f3500

    $ diff -ud am335x-normal-jtag-scripts-results/am335x-ctt_2019-05-10_082154.rd1 am335x-slow-jtag-scripts-results/am335x-ctt_2019-05-10_093617.rd1
    --- am335x-normal-jtag-scripts-results/am335x-ctt_2019-05-10_082154.rd1 2019-05-10 08:21:55.035548800 -0700
    +++ am335x-slow-jtag-scripts-results/am335x-ctt_2019-05-10_093617.rd1 2019-05-10 09:36:17.730568600 -0700
    @@ -180,7 +180,7 @@
    0x44e00e08 0x00000000
    0x44e00f00 0x00000000
    0x44e00f04 0x00001006
    -0x44e00f08 0x00000200
    +0x44e00f08 0x00000210
    0x44e00f0c 0x78000017
    0x44e00f10 0x00000003
    0x44e00f14 0x00000000

    $ diff -ud am335x-normal-jtag-scripts-results/am335x-ddr-analysis_2019-05-10_082502.txt am335x-slow-jtag-scripts-results/am335x-ddr-analysis_2019-05-10_093629.txt
    --- am335x-normal-jtag-scripts-results/am335x-ddr-analysis_2019-05-10_082502.txt 2019-05-10 08:25:04.949467300 -0700
    +++ am335x-slow-jtag-scripts-results/am335x-ddr-analysis_2019-05-10_093629.txt 2019-05-10 09:36:32.993552800 -0700
    @@ -27,9 +27,7 @@
    * Bits 06:04 (reg_ibank) set to 3 -> 8 banks
    * Bits 02:00 (reg_pagesize) set to 2 -> 10 column bits

    -EMIF: PWR_MGMT_CTRL = 0x00000000
    - * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
    - * Please see the silicon errata for more details.
    +EMIF: PWR_MGMT_CTRL = 0x000002a0

    DDR PHY: DDR_PHY_CTRL_1 = 0x00000005
    * WARNING: reg_phy_enable_dynamic_pwrdn disabled.
    @@ -56,16 +54,16 @@
    *(0x4c000024) = 0x242431ca
    *(0x4c000028) = 0x0000021f
    *(0x4c00002c) = 0x0000021f
    -*(0x4c000038) = 0x00000000
    -*(0x4c00003c) = 0x00000000
    +*(0x4c000038) = 0x000002a0
    +*(0x4c00003c) = 0x000000a0
    *(0x4c000054) = 0x00ffffff
    *(0x4c000058) = 0x8000140a
    *(0x4c00005c) = 0x00021616
    -*(0x4c000080) = 0x129a206a
    -*(0x4c000084) = 0x05ff9908
    +*(0x4c000080) = 0xcf15a100
    +*(0x4c000084) = 0x4a0a6016
    *(0x4c000088) = 0x00010000
    *(0x4c00008c) = 0x00000000
    -*(0x4c000090) = 0x64a8105e
    +*(0x4c000090) = 0x173eb76e
    *(0x4c000098) = 0x00050000
    *(0x4c00009c) = 0x00050000
    *(0x4c0000a4) = 0x00000000

    -----------------------------
  • Ah ha! I see a critical clue... The PWR_MGMT_CTRL register is important. And I think I clearly need to add a little more decoding of this register. In particular take a look at bits 10:8 (reg_lp_mode). In the "slow" case these bits are set to a value of 2, which indicates you're trying to put the DDR into self-refresh.

    Any idea how/where that register is being configured? It might also be interesting if you zero out that register via u-boot or JTAG to see if you can "fix" a board that is in a problematic state.
  • Yes, I saw this too. As far as I know, this is the same boot code being executed after a watchdog reset. Is there something in the processor that sets this register to self-refresh on watchdog reset? If so, then u-boot must reset this. This is a great reason why it is doing this.

    Also, there is this (in the normal running):
    EMIF: PWR_MGMT_CTRL = 0x00000000
    * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
    * Please see the silicon errata for more details.

    What should this value be? Would you be able to point us to the relevant errata?
  • FYI, I just pushed an update that decodes that reg_lp_mode field and warns if it is enabled:

    git.ti.com/.../am335x-ddr-analysis.dss

    For your particular issue, it might not matter at this point since we're already lasered onto that register, but at least now it will be more obvious in future issues.
  • PWR_MGMT_CTRL == 0x2a0 from the Sitara AM335x reference manual says this:
    reg_sr_tim is 0xa (8192 clocks)
    reg_lp_mode is 2 (self refresh)

    reg_lp_mode can also be 1 (clock stop) where nothing runs at all .. it could be that we may even be in this mode?

    reg_lp_mode can also be 4 (power down)
  • Stephen Biggs said:
    Also, there is this (in the normal running):
    EMIF: PWR_MGMT_CTRL = 0x00000000
    * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
    * Please see the silicon errata for more details.

    It relates to Section 3.1.2 DDR3: JEDEC Compliance for Maximum Self-Refresh Command Limit.  Having the lower byte of the register set as 0xA0 is appropriate.  That's why there's no warning in the case of your slow boot, i.e. because the register is configured as 0x2A0.  That's actually a properly configured value for configuring self-refresh.  The only issue is you shouldn't be doing self-refresh!!!

    Stephen Biggs said:
    Yes, I saw this too. As far as I know, this is the same boot code being executed after a watchdog reset.

    I wonder if you explicitly zero out that register in u-boot if that would resolve your issue.

  • The question is: why is this being put into self-refresh in the first place?
  • I'm looking at the SDK 7.00 u-boot code. It seems to me that PWR_MGMT_CTRL gets written in the following location:

    arch/arm/cpu/armv7/am33xx/emif4d5.c
    do_sdram_init()

    Unforunately, back in SDK 7.00, I don't think that function is getting called as part of the initialization flow. In newer SDK's there is a slightly different flow and this register definitely gets initialized. In other words, if you were on a newer SDK, I don't expect we'd be having this conversation... This is highly typical for these old SDK's. You spend hours or days (or worse months) hunting for issues that have been solved for years...

    I suggest adding it somewhere in arch/arm/cpu/armv7/am33xx/ddr.c. Perhaps in config_sdram() or set_sdram_timings().
  • Stephen Biggs said:
    The question is: why is this being put into self-refresh in the first place?

    I think for starters you should check to see if adding in code to u-boot that forces that to a defined value (either 0 or 0xA0) fixes the issue.  In other words, before we get overly obsessed with this register, it would be good to have some further confirmation that this truly has a big impact on the issue.  I think it will, but let's verify.

    The next question in my mind would be WHEN is this register taking on this value of 0x2A0.  You might need to add some additional prints into u-boot and/or Linux to understand that one.  Given that this issue happens "after some time", I expect it's happening in Linux, and then never getting cleared out by u-boot.  So I think having u-boot initialize this register is good.

  • Your script flags an error:

    EMIF: PWR_MGMT_CTRL = 0x00000000
     * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
     * Please see the silicon errata for more details.

    The errata says:

    When using DDR3 EMIF Self-Refresh, it is possible to violate the maximum refresh command requirement specified in the JEDEC standard DDR3 SDRAM Specification (JESD79-3E, July 2010). This requirement states that the DDR3 EMIF controller should issue no more than 16 refresh commands within any 15.6-μs interval.

    To avoid this requirement violation, when using the DDR3 EMIF and Self-Refresh (setting LP_MODE = 0x2 field in the PMCR), the SR_TIM value in the PMCR must to be programmed to a value greater than or equal to 0x9.

    But with a 0 in reg_lp_mode, self refresh is not enabled, so your script is wrong to flag this, IMO.

  • PS. If the issue is in Linux, my guess is that it's power management related. Are you using DeepSleep0 at any time? The value 0x2A0 is what I would expect to see as the device is entering DeepSleep0. It should be restored as you exit DeepSleep0. I'll note that power management (especially around DeepSleep0) is a place where LOTS of work has been done between SDK 7.00 and present day. You would really need to update to the latest kernel to get a proper and complete fix.
  • responding to your latest that I see in email re: DeepSleep0.

    All of a sudden I have been put back into "need moderation" mode before my posts appear. Can you please disable this so I can post correctly?

    We are not using DeepSleep0 at this time. We wanted to try this and may still investigate the feasibility of this, but we found it to be simpler to just cut power to the processor itself. This causes us a lag time when we restart as it must do a full reboot, but then we manage why we restarted.
  • Missing a post that got sent to moderation where I questioned the validity of your script flagging an errata issue with the EMIF register when we weren't doing self refresh.
  • Stephen Biggs said:
    Missing a post that got sent to moderation where I questioned the validity of your script flagging an errata issue with the EMIF register when we weren't doing self refresh.

    It's true that it's only an actual issue when you actually enable self-refresh.  There are a lot of different software environments, and in some cases this PWR_MGMT_CTRL register may be changed using a read-modify-write that expects the proper timing to already be there and is only switching between self-refresh and normal mode.  It's best to just have it configured appropriately all of the time.
  • yes, this is understood. What I was discussing (and was in my post that got waylaid somewhere, so must repeat) was only the way your script works. Your script flags this error when reg_lp_mode is 0, no self refresh:
    EMIF: PWR_MGMT_CTRL = 0x00000000
    * ERROR: Bits 7:4 (reg_sr_tim) are in violation of Maximum Self-Refresh Command Limit
    * Please see the silicon errata for more details.

    The errata says:
    When using DDR3 EMIF Self-Refresh, it is possible to violate the maximum refresh command requirement specified in the JEDEC standard DDR3 SDRAM Specification (JESD79-3E, July 2010). This requirement states that the DDR3 EMIF controller should issue no more than 16 refresh commands within any 15.6-μs interval.
    To avoid this requirement violation, when using the DDR3 EMIF and Self-Refresh (setting LP_MODE = 0x2 field in the PMCR), the SR_TIM value in the PMCR must to be programmed to a value greater than or equal to 0x9.

    But since LP_MODE is 0, we are not doing self-refresh so this error is not correct and should not be flagged unless we are doing self-refresh (LP_MODE == 2) and the reg_sr_tim is actually out of range.
  • Stephen Biggs said:
    But since LP_MODE is 0, we are not doing self-refresh so this error is not correct and should not be flagged unless we are doing self-refresh (LP_MODE == 2) and the reg_sr_tim is actually out of range.

    I've made it more intelligent now.  It only flags an error if you're using a combination of reg_lp_mode=2 with an invalid reg_sr_tim.  It gives a warning otherwise saying that the value being used could be a problem if it was ever used with reg_lp_mode=2.

  • Thank you for your help on this.  We can detect the issue in u-boot and clear the self refresh register and it prevents the case of continuously slow operation through a software watchdog reset event.  My question now is why this is occurring.  I would think this register would be set to a known value during reset, but perhaps this isn't happening because it's persistent through software watchdog reset.  Is this a possibility?  You also mentioned read-modify-write.  Would this refresh setting persist through a software reset and r-m-w?  We know POR typically always resolves the issue.  

  • James Miller39 said:
    I would think this register would be set to a known value during reset, but perhaps this isn't happening because it's persistent through software watchdog reset.  Is this a possibility?

    It certainly seems that this EMIF_PWR_MGMT register is persistent through a watchdog reset.  Take a look in the Technical Reference Manual at Table 8-26. Reset Sources.  The watchdog reset does not reset the PLL's and also puts the DDR into self-refresh.  A cold reset on the other hand will reset everything.

    James Miller39 said:
    You also mentioned read-modify-write.

    That was specific to Linux suspend operation.

    James Miller39 said:
    We know POR typically always resolves the issue. 

    Yes, a POR will clear this register.  However, ensuring the DDR initialization code writes to this register (which is what is done in the current u-boot) is something you should be doing.

  • I'm looking in the Technical Reference manual rev P table 8-26 and I see where the PLLs are reset differently but you mention that watchdog reset puts the DDR into self refresh.  I don't see that in the table, where is this indicated?  I'm also looking where EMIF_PWR_MGMT is persistent though watchdog reset and I'm not seeing that, can you help oint that register out for me?  Thanks

  • Hi James, it is mentioned in figures 8-22 and 8-23 (Rev O of the TRM). This is admittedly subtle, but the point is that any warm reset (which includes watchdog reset), will put the DDR into self-refresh. As a consequence, the DDR controller and PHY will maintain its configuration so that it can successfully exit from self-refresh.

    As stated earlier, if you don't ever intend to use power management, then PWR_MGMT register should always remain 0x0. Whenever you have it set to 0x2A0, the EMIF will put the DDR into self refresh after a certain amount of idle time (sr_tim clock cycles). This can be good in some use cases, but for your case, going in and out of self-refresh multiple times (which is why the CKE signal was toggling) seems to be causing too much overhead and slowing down execution.

    With no intention of power management, CKE signal should remain high all the time.


    Regards,
    James
  • Thank you for the explanation and pointing me to the right information.  One final question, are there any warm reset conditions under which self-refresh mode would not be entered, i.e. PWR_MGMT = 0x00 on reset?  I'm not sure that we're seeing evidence of self-refresh after every warm reset.  Certainly, we do see lp_mode set on most warm resets but not on every warm reset.

    Jim  

  • Hi James, self-refresh should be entered on any warm reset. You should see CKE signal go low indicating self refresh. And the DDR configuration registers maintain their state.
    I experimented with this a little on our EVM, and the PWR_MGMT register will be loaded with the value of PWR_MGMT_SHDW after a warm reset.
    It seems like everytime you boot, the PWR_MGMT and PWR_MGMT_SHDW register aren't consistent. Maybe that's why you are seeing different results?

    Regards,
    James
  • James, is everything closed out here? Haven't heard if there was a final resolution.

    Regards,
    James