AM3358: cold start problem

Bart

Part Number: AM3358

Team,

this is continuation of the existing thread (linked here).

We are facing with a problem at booting. This issue was found at only one card so far, but we would like to know the background of it.

So I try to describe the problem:

If the card is booting from the TFTP server through KSZ8463RL switch I have seen that the Program Counter is stuck at 0x402f0440 and hanging on this point (attached a picture about it below). I also attached the register values at this state.

I thought that this problem related to the DDR, but if I try to load the same (SPL and UBoot image) into serial port, the card will booting.

It seems that the problem appears only at first start, while the temperature of the card is about room temperature. The second boot is OK, if I am waiting for an hour to decrease the card temperature the problem will appear again.

I have just checked the voltages and the clocks of the Processor and the DDR and those seem to be OK, and after the reset the SYSBOOT[15:0] value is 1000 0000 0101 0000.

Can you please help solving this issue?

Thank you
TI Customer

cold-start-registers.txt

over 6 years ago

0 Biser Gatchev-XID over 6 years ago

TI__Guru**** 393215 points

Bartosz,

Please describe the hardware environment in detail. If possible, post the schematic.

0 Biser Gatchev-XID over 6 years ago in reply to Biser Gatchev-XID

TI__Guru**** 393215 points

Another question: Is this failure seen on a single board, or across multiple boards?

0 Bart over 6 years ago in reply to Biser Gatchev-XID

TI__Expert 6460 points

Hi Biser,

I've sent you the description of the setup and schematic offline (confidential).

So far the issue has been observed with one board.

Kind regards,
Bartosz

0 Biser Gatchev-XID over 6 years ago in reply to Bart

TI__Guru**** 393215 points

Is this only 1 board out of many? If so, I would suggest you check the board for manufacturing defects, esp. bad soldering.

0 Biser Gatchev-XID over 6 years ago in reply to Biser Gatchev-XID

TI__Guru**** 393215 points

By the way, the files you have sent are of no use. I was asking about the schematic in PDF format.

0 Bart over 6 years ago in reply to Biser Gatchev-XID

TI__Expert 6460 points

Biser,

We don't see manufacturing defects on the board.

The schematic in pdf has been sent to you now via e-mail.

Kind regards
Bartosz

0 Biser Gatchev-XID over 6 years ago in reply to Bart

TI__Guru**** 393215 points

Is this only 1 board out of many, or you see the issue on multiple boards?

0 JJD over 6 years ago in reply to Biser Gatchev-XID

TI__Guru* 87040 points

Bartosz, as Biser alludes to, know how many boards has this issue is important to determine what kind of problem it is.

Based on your screenshot, it looks like you are getting a data abort in MLO. A lot of times, this is a result of a mis-configured DDR. Since it is intermittent or maybe temp related, the configuration may be marginal.

A couple of things you can try:
-On a failing board, try opening up a memory window in the DDR memory to see if the values are still stable (hit the continuous refresh button)
-Since you fail in MLO, you can put breadcrumbs in your code to determine how far you execute. Find a memory location in internal SRAM and just increment the value in different places in code. My suspicion is that the problem is happening during the loading of u-boot, when a lot of data is being transferred to DDR.
-In your screenshot, i notice a couple of ARM core registers with addresses such as 0x81FFFEE8 and 0x81FFFF20. These look to be stack located in DDR. Immediately after a fail, open up a memory window to see if these values are stable.
-In your screenshot, your SP is 0x00000000, which doesn't seem correct. May want to debug to try to understand why the stack pointer is being set to 0x0

Regards,
James

0 Bart over 6 years ago in reply to JJD

TI__Expert 6460 points

Hi James,

thanks for your feedback. Please find the comments below, looking forward for your help.

Kind regards,
Bartosz

Customer: I attach the trace vector, seen with debugger. Booting the card from the Ethernet source (SYSBOOT[15:0]=0x8050), the trace vector lowest value is 0xD03E, if the card boots from eMMC SYSBOOT[15:0]=0x8057 the trace vector value is 0x10009E.

Note: If the warm- or cold reset or the PWR_EN of the PMIC signals are pulled to low, the problem is still remaining. I also checked the Core and the MPU voltages for the OPP100 boot mode.

JJD said:

Based on your screenshot, it looks like you are getting a data abort in MLO. A lot of times, this is a result of a mis-configured DDR. Since it is intermittent or maybe temp related, the configuration may be marginal.

Could we configure the DDR in the MLO?

JJD said:

A couple of things you can try:
-On a failing board, try opening up a memory window in the DDR memory to see if the values are still stable (hit the continuous refresh button)

It looks stable (picture attached in the zip). I cooled the card with cold spray.

JJD said:

-Since you fail in MLO, you can put breadcrumbs in your code to determine how far you execute. Find a memory location in internal SRAM and just increment the value in different places in code. My suspicion is that the problem is happening during the loading of u-boot, when a lot of data is being transferred to DDR.

It is hard to reproduce the problem. The "data abort exception" is what we have only found so far.

JJD said:

-In your screenshot, i notice a couple of ARM core registers with addresses such as 0x81FFFEE8 and 0x81FFFF20. These look to be stack located in DDR. Immediately after a fail, open up a memory window to see if these values are stable.

It also looks stabe (picture attached).

JJD said:

-In your screenshot, your SP is 0x00000000, which doesn't seem correct. May want to debug to try to understand why the stack pointer is being set to 0x0

If I click on the run it will reload the right value of the SP (picture attached).

traces.zip

0 Bart over 6 years ago in reply to Bart

TI__Expert 6460 points

Hi James,

would you share your thoughts?

Thank you,
Bartosz

0 JJD over 6 years ago in reply to Bart

TI__Guru* 87040 points

Hi Bartosz, one important point to make is that when you try to boot using MLO and then inspect the results using CCS, ensure that you have all of the GEL files disabled or removed from your target configuration. In several of your screenshots, it looks like a GEL script has run, which will render any results you see in CCS useless.

Can you retry some of the experiments above without the GELs? See if the memory window in DDR is stable after a crash. Can you also dump the EMIF registers after a boot from Ethernet and a boot from serial port, and compare the two.

Regards,
James

0 JJD over 6 years ago in reply to JJD

TI__Guru* 87040 points

Bartosz, for further DDR analysis, can you use the DSS scripts described here: processors.wiki.ti.com/.../How_to_use_the_AM335x_IBIS_Models

This will give us more information beyond just a register dump.

Regards,
James

0 Bart over 6 years ago in reply to JJD

TI__Expert 6460 points

Hi James,

please see feedback from the engineer:

1. I created a new target configuration as described in the link. Then I waited for an hour to fail in the MLO, the card tried boot from MMC. I attached the results of the DDR analysis and the Boot script (link), the files begin with 1.

2. Then I connected to the card with the debugger for checking the DDR stable (attached the screenshot, begin with 2.) and exported the value of the EMIF (txt begin with 2.).

3. I set the SYSBOOT pins to Eth boot configuration and waited an hour again to fail in the MLO, and I attached the results of the DDR analysis and the Boot script (the files begin with 3.).

4. Then I connected to the card with the debugger again for checking the DDR stable (attached the screenshot, begin with 4.) and exported the value of the EMIF (txt begin with 4.).

5. I attached the results of the DDR analysis and the Boot script (the files begin with 5.) after a normal boot from MMC

6. I attached the results of the DDR analysis and the Boot script (the files begin with 6.) after a normal boot from Ethernet.

Looking forward for your feedback.

Thank you,

Kind regards,
TI Customer

3704.files.zip

0 JJD over 6 years ago in reply to Bart

TI__Guru* 87040 points

Hi Bartosz, thanks for the information. The DDR analysis in step 1 and 3 is showing that the EMIF clock is disabled and the some of the DDR initialization didn't execute, yet in steps 2 and 4 when you connected the debugger, you could read the EMIF registers and everything looked fine. It still seems like the GEL scripts are running automatically when he connects the debugger. Can you confirm that no GEL scripts are loaded in the debugger?

Also, when he can get the board to crash after an hour or so, can he continually get it to crash with successive reboots? If he can get it to fail consistently, then he may be able to step through his code with the debugger to determine where the fail occurs. It may take successive tries to narrow in on where the failure occurs.
Another question: what is the source of his MLO? What codebase did they start with, and what modifications did they make?

Regards,
James

0 Bart over 6 years ago in reply to JJD

TI__Expert 6460 points

Hi James,

It seems we have found the problem with breadcrumbs. In our own function in the MLO, we stored a variable in the DDR without initialization of the variable. The variable addressed the upper section of the DDR memory. It seems that the DDR stored the last value of the variable within an hour. After an hour the variable sometimes got changed to another "bigger" value and the memory addressing was wrong.

Thank you for your great support!

0 JJD over 6 years ago in reply to Bart

TI__Guru* 87040 points

Thanks for following up with this.

James

Processors

Processors forum

AM3358: cold start problem