AM3352: MPU hangs

Kanae

Guru 10085 points

Part Number: AM3352

Hi Sitara support team,

My customer is facing the restart of unknown origin in their mass-produced board using AM3352.

(Failure state details)

AM3352 CPU hang-up.

When HW watch Dog is effective, the HW has reset by watch Dog time-out.

Here is a trace log of ETB (Embedded Trace Buffer) acquired via JTAG in AM3352 just before the above failure state.

Trace_log_20171204.txt

I would like to know whether the cause of this failure state is estimated from the attached trace log, or not.

■JTAG debugging environment
JTAG debugger is connected to the custom system board.
ARM ETM (Embedded Trace Macrocell) function is enable, and it runs the real time trace of ARM instructions.
[custom system board]--- < JTAG connection> --- [TRACE32 Lauterbach(ARM-ETM Trace)] ----- [PC]

■ Trace log result summary
It stops by just before the failure state with the following processing sequence.
(1) Undefined instruction exception (VFP)
↓
(2) Processing of userland Process
↓
(3) Data abortion exception
The trace log is acquired by total of 4 times of failure state. It stops by the same processing in all trace log.

And also, please advise me the effective way to investigate this failure state.

Best reards,
Kanae

over 7 years ago

0 Biser Gatchev-XID over 7 years ago

TI__Guru**** 393215 points

Hi,

What software is this? Which version? Please provide more details.

0 Kanae over 7 years ago in reply to Biser Gatchev-XID

Guru 10085 points

Hi Biser,

Thank you for quick reply.

Customer's system software is "Linux OS; Base Distribution, Debian Linux Kernel 3.13.4".

If you need other information, please let me know.

Best regards,

Kanae

0 Kanae over 7 years ago in reply to Kanae

Guru 10085 points

Hi Biser,

Here are the additional information.
* Temperature at the failure state: Approximately 20-30 degrees
* CPU frequency : 1GHz (fixed)
Customer has confirmed that it has same failure state even 300MHz or 600MHz fixed.
* Total units of failure state: 3 units
I am currently checking that the production total units and the lot information.

My customer needs any clues to investigate this failure state.
I appreciate that Sitara support Team helps this issue from your experiences in advance.

Best regards,
Kanae

0 Biser Gatchev-XID over 7 years ago in reply to Kanae

TI__Guru**** 393215 points

I am sorry, TI does not support Debian Linux. Please advise your customer to check if the issue exists with AM335x Processor SDK: software-dl.ti.com/.../index_FDS.html

0 Kanae over 7 years ago

Guru 10085 points

Part Number: AM3357

Tool/software: TI-RTOS

Dear Sitara support Team,

Regarding my customer is facing MPU hang on the mass production board, he would to ask about ARM assembler code level of AM335x program.

Q1.
From checking the last processing of all trace log; log_file, acquired at the failuer state,
it is set the value in a system control register of CP15, and it semms to make the MPU hung state.
Is there a possibility which will be in the MPU hung state by this processing?

| ldr r0,0xC05E33E0
| ldr r0 and [r0]
| mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)
--------------------

Note: MPU hung state is defined that it cannot be controlled to MPU at all by JTAG debugger
and it cannot be recognized as the state that a register and a memory.

Q2.
Is there a possibility on Q1, could you provide any advise to the considerable conditions and so on?

Q3.
During like this processing, what kind of processings should not be processed?
For examples, the processing which isn't assumed as CPU or the function whitch sould not be operated at the same time.

Q4.
Could you suggest to my customer what kind of factors that MPU software processing is stopped suddenly
at the same processing each time in spite of the above state?

Q5.
My customer also considers about a hang-up inside VFP coprocessor.
However he does not know how to corroborate it.
Do you have any ideas to check to hang-up inside VFP coprocessor?

Best regards,
Kanae

0 Biser Gatchev-XID over 7 years ago in reply to Kanae

TI__Guru**** 393215 points

I have asked the factory team to look at this. They will respond here.

0 JJD over 7 years ago in reply to Biser Gatchev-XID

TI__Guru* 85990 points

The attached log seems to have captured several loops of an abort handler, so I don't think it shows anything meaningful. Can you give us more context around the failure:

-you say 3 boards fail, out of how many boards?
-when the failure occurs, what is the application doing? Is it booting, running a certain application, etc?
-do the 3 boards fail all the time? Is the failure repeatable? Do the 3 boards work for a certain period of time, and then fail?
-Can you disable the Watchdog reset break before the reset occurs (maybe in the abort handler)?

-James

0 Kanae over 7 years ago in reply to JJD

Guru 10085 points

Hi JJD,
Thank you for your reply.
Here are answers from my customer to your questions.

-you say 3 boards fail, out of how many boards?
>The MPU hang-ups have occurred by 13 units out of 220 units as of Dec-13-2017.
Customer has corrected the failured numbers from 3 units to 13 units in their reports.

>The frequency of failure occurs total of 23 times. It differs according to the each unit
to have the failure from once to 5 times.

>AM3352 manufacturing lots are 3 types.
All three type of lots were included in total 220 units, and also included the failure 13 units.

-when the failure occurs, what is the application doing? Is it booting, running a certain application, etc?
>All 220 unit have operated the same application.
The application program of a floating-point operation is executed periodically.

-do the 3 boards fail all the time? Is the failure repeatable? Do the 3 boards work for a certain period of time, and then fail?
>The timing of the failure depends on the unit.
Some units had the failure to take about 2000 hours after a system start,
and others had about 24 hours at the earliest from a system start.

-Can you disable the Watchdog reset break before the reset occurs (maybe in the abort handler)?
>HW watch Dog invalidates for debugging.

If you need other infomation to solve this issue, please let me know.
I appreciate your support in advance.

Best regards,
Kanae

0 JJD over 7 years ago in reply to Kanae

TI__Guru* 85990 points

Hi Kanae, I'm thinking this may be some sort of power delivery issue, especially for the VDD_MPU rail. Can the customer monitor the VDD_MPU voltage and check for droops or other noise issues on the rail, especially around the failure point.
-What power solution are they using (ie, PMIC or discrete solution)?
-You said that they are operating at a fixed 1GHz. They need to ensure they are operating at 1.35V for the VDD_MPU rail and the voltage remains within 4% tolerance
-as an experiment, they should try to increase the VDD_MPU rail slightly (maybe to 1.4V) to see if they observe a change in behavior.
-they need to review the board layout, especially with respect to the VDD_MPU and VDD_CORE power planes, and ensure adequate routing , decoupling and bulk caps, and return paths. More info can be found here: processors.wiki.ti.com/.../Sitara_Layout_Checklist

Regards,
James

0 Kanae over 7 years ago in reply to JJD

Guru 10085 points

Hi James,

Thank you for your prompt reply.

Here are customer's reply to your comments.

[-What power solution are they using (ie, PMIC or discrete solution)?]

⇒Customer's power soluttion is PMIC: TPS65217CRSL.

[Can the customer monitor the VDD_MPU voltage and check for droops or other noise issues on the rail,

especially around the failure point.]

⇒ The voltage of VDD_MPU is set as 1.325V following the data seat;

5.5 Recommended Operating Conditions.

VDD_MPU (Supply voltage range for MPU domain, Nitro)

MIN 1.272V, NOM 1.25V, MAX 1.378V

Is setting of 1.35V right for the VDD_MPU voltage as you said?

This CPU hang failures have been occurred at fixed 600MHz and 300MHz besides at fixed 1GHz.

Customer has get the trace from the 4 failure units by JTAG debugger

and the tracing has stopped at the same state of "mcr p15".

As far as the trace results, customer thinks that the failure does not depend on the voltage drop and the clock.

[as an experiment, they should try to increase the VDD_MPU rail slightly (maybe to 1.4V) to see

if they observe a change in behavior.]

⇒　Customer has tried to increase the VDD_MPU rail; to 1.375V or 1.275V in case of VDD_MPU voltage; 1.325V,

and to operate them for 72 hours. However there are any change in behavior.

[they need to review the board layout, especially with respect to the VDD_MPU and VDD_CORE power planes, and ensure adequate routing ,

decoupling and bulk caps, and return paths. More info can be found here:

processors.wiki.ti.com/.../Sitara_Layout_Checklist]

⇒　Customer checked them again, but there are no points that customer is concerned about.

If there are any other particular points to check this failure, please point them out.

Best regards,

Kanae

0 Kanae over 7 years ago in reply to Kanae

Guru 10085 points

Hi James,

I have posted my customer's comments to your check items.
Are there are any points to check this failure from their comment?

If you need the other data of my customer design, please let me know.

Best regards,
Kanae

0 Kanae over 7 years ago in reply to Kanae

Guru 10085 points

Hi James,

Could you please advise to my customer's comments?
If you need the other infomation, please let me know.

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Kanae,

In my opinion, based on your results, VDD_MPU is not the issue. If there was a power issue with this rail, then the reliability would have improved when you raised the voltage and/or when you lowered the CPU clock.

Now that said, one other critical voltage rail that I recommend checking is VDD_CORE. This can have a big impact on reliably reading data from DDR. A few recommendations:

Monitor VDD_CORE with a scope. For example, if you put a trigger at 1.06V, are you ever triggering?
Can you raise VDD_CORE to 1.2V to see if it has a significant impact on failure rates?
Have you run memtester at high and low temperature to make sure you can reliably read from your DDR?

On a related note, this could be due to the DDR configuration or DDR layout just as easily. The memtester experiment (#3) should help expose any such issues.

Brad

0 Kanae over 7 years ago in reply to Brad Griffis

Guru 10085 points

Hi Brad,
Thank you for your quick reply.
Here are my customer's comments to your recommendations.

****************************************************************
1. My customer has already done it. The result of VDD_CORE triggering
at 1.05V was not shown the voltage drop.

2. My customer has already tried to raise VDD_CORE to 1.15V, not 1.2V though.
It kept to operate in this state for 72 hours, there was no impact on failure rates.

3. My customer has already un memtester at +60℃ and -20℃ for 96 hours.
However the results are no problems.

As I posted, the failure units have stopped at "mcr p15" on the trace
so my customer does not think that the votage drop is not related this issue.

=Trace result=
| ldr r0,0xC05E33E0
| ldr r0 and [r0]
| mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)

****************************************************************
What else should he check out?

Best regards,
Kanae

0 JJD over 7 years ago in reply to Kanae

TI__Guru* 85990 points

Kanae, at this point i would recommend a device swap:

-Take a processor from a working board and and solder it to a non-working board, and test again
-Take a processor from a non-working board and solder it to a working board, and test again

This would expose possible device defects (if a bad processor fails on a good board), board issues (if a good processor fails on a bad board), or possibly assembly issues (if both work)

Regards,
James

0 Kanae over 7 years ago in reply to JJD

Guru 10085 points

Hi James,

Thank you for your reply.
My customer' s comments to your recommendation as below.

**********
It isn't realistic to swap a device and test once again,
because it has a high possibility that a device is damaged by rework and influence a test result.
When a MPU hang (CORE locked) has occurred this time,
it stops at the same instruction "mcr p15" in all 4 boards.
If it causes a MPU hang (CORE locked) by a board or by an assembly,
the stop point would be at random.
**********

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Kanae,

Generally issues that only impact certain boards are hardware-related. The "swap test" that James suggested is not easy. You need to have the device reballed before you can put it onto another board. You likely will need to pay an outside company to do that work. However, there is much to be learned from this type of test and I highly recommend you do it. I agree with James that it's the next logical step and would be needed to further narrow down and understand your issue.

With respect to your software, Debian Linux Kernel 3.13.4, this is not a kernel version that was ever officially supported by TI. Furthermore, I see many changes have been introduced in the specific file where you are hitting this issue (arch/arm/kernel/entry-armv.S). From a software perspective, I would encourage you to use TI Processor SDK Linux 4.02 for best support. That's the current release.

Brad

0 JJD over 7 years ago in reply to Brad Griffis

TI__Guru* 85990 points

Kanae, how do they trigger and fill the trace buffer? Are they sure that the trace buffer is not just full, and the last mcr instruction is just the last instruction in the buffer? The processor should not just stop, it should at least go to an exception handler and loop there. As i stated, it looks like the log is just looping on an exception, so the actual error is buried somewhere previously in the log, or was not captured because the trace buffer is too small

Regards,
James

0 Kanae over 7 years ago in reply to JJD

Guru 10085 points

Hi Brad and James,
Thank you for your replies.
Here are comments from my customer to each reply.

Regarding a swap test,
MPU hang (CORE locked) board which wasn't facing MPU hang are increasing a little as time proceeds.
In other words, it can be said that the other boards which have not faced MPU hang yet will have MPU hang
from now on.

Regarding a software,
My customer can mount the following kernels easily at this time
　1) "git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git" repository's "ti-linux-4.14.y" Project
　2) "git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository's "processor-sdk-linux-01.00.01" Project
Which kernel is better to use?

Regarding a trace buffer,
When the buffer (ETB) will be fully by on chip trace, it'll be overwritten from old data.
So that means the circular buffer structure. This is the specification of ETMv3 of CoreSight.
It was irresponsive from the core to a JTAG debugger therefore my customer judged as MPU hang.

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Kanae said:
1) "git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git" repository's "ti-linux-4.14.y" Project

This is a development branch as of now. In June we'll be releasing Proc SDK Linux 5.00 which will be based on this branch. We are not yet officially supporting this branch.

Our current release is based on kernel 4.9. You may want to look at the corresponding ti-linux-4.9.y branch.

0 Kanae over 7 years ago in reply to Brad Griffis

Guru 10085 points

Hi Brad,
Thank you for quick reply!

So, my customer cannot select "1) "ti-linux-4.14.y" Project" now.
Can they use this project?
　2) "git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository's "processor-sdk-linux-01.00.01" Project
If you have concern to use this, please let me know.

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Proc SDK 1.00 was a 3.14 kernel. As I mentioned the currently supported TI kernel is 4.9. We will migrate to 4.14 in June.

0 Kanae over 7 years ago in reply to Brad Griffis

Guru 10085 points

Hi Brad,
Thank you for your reply.

You do not recommend both of the kernels which my customer can select now.
Your recommend software is only TI kernel is 4.9 currently supported.
Is my understanding correct?

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Yes.

0 Kanae over 7 years ago in reply to Brad Griffis

Guru 10085 points

Hi Brad,
Thank you for your reply!

Here are comments from my customer.
The reason why my customer tried to use the two type of kernels is as follows.

1) The custom boards in other project mounted AM3352 using ti-linux-4.14.y" brunch already have worked now.
2) The "processor-sdk-linux-01.00.01" brunch porting is easy because it is a little different from Kernel-3.13.4.

My customer will try to work with " ti-linux-4.9.y" brunch that takes time, though.
By the way, there are "ti-linux-4.9.y" brunch and "processor-sdk-linux-04.02.00" brunch of
"git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository.
Which brunch should be used to work?

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

Kanae,

The ti-linux-4.9.y branch would be the best option since it's more regularly updated.

Brad

0 Kanae over 7 years ago

Guru 10085 points

Part Number: AM3352

Hi Sitara support team,

My customer has an additional reports and question.

**********************************************************************************

About the problem of "AM 3352 CPU hang up", we find that CPU hang will occur

if HIGHMEM of Linux kernel option is enabled from the verification result.

【HIGHMEM verification result: Linux kernel version: 3.13.4】

(1) DRAM 1 GB HIGHMEM valid ---> CPU hang occurs

(2) DRAM 1 GB HIGH MEM invalid ---> No occurrence

(3) DRAM 512 MB ---> No occurrence

When "DRAM 1GB" is implemented, the area exceeding 740MB(LOWMEM) becomes the HIGHMEM area,

and the Linux memory management method differs from the LOWMEM area's method.

In order to use this HIGHMEM area, if it enables the Linux kernel option HIGHMEM,

the CPU hang occurs. However, if it disables HIGHMEM, the CPU hang does not occur.

Also, in the case of DRAM 512 MB, it does not occur because the HIGHMEM area is not used.

From this result, it seems that Linux memory management function including HIGHMEM

is affecting the CPU hang issue.

-Cortex-A8 processor revision: r3p2 (0x413fc082)

The result of the JTAG trace log at CPU hang is always found read or write instruction

of the coprocessor register.

It seems that there is a problem with the MMU and L1 / L2 cache of AM 3352 Coretx-A8,

and it affects the coprocessor.

We also confirmed that a CPU hang occurs in processor-sdk-linux-01.00.01 (Kernel 3.14.43).

[Question]

If it assumes that the Linux kernel memory management function including HIGHMEM function

is causing AM 3352 hang issue, could you tell us the possible causes for that?

**************************************************************************************************

Best regards,

Kanae

0 Kanae over 7 years ago in reply to Kanae

Guru 10085 points

Hi Brad,

Could you please give me your comments to the following question?
If I have to make the new post for this, please let me know.

[Question]
If it assumes that the Linux kernel memory management function including HIGHMEM function
is causing AM 3352 hang issue, could you tell us the possible causes for that?

Best regards,
Kanae

0 Brad Griffis over 7 years ago in reply to Kanae

TI__Guru*** 125430 points

I don’t have expertise in the ARM HIGHMEM implementation. I recommend testing to see if the issue can be reproduced on TI hardware using Processor SDK 1.00 (since that is similar to your current software) and then check if moving to the latest SDK 5.00 (kernel 4.14) fixes the issue.

If it fixes the issue you could do some testing to narrow down what exactly fixed it and then backport to your older kernel. Or perhaps better yet you could update to the latest on your board too.

If the issue is still present on the TI board in the latest kernel then you have a great test case for TI to better support you.

0 JJD over 7 years ago in reply to Brad Griffis

TI__Guru* 85990 points

Kanae, please start a new thread and refer back to this one with a link back to this one. I think Brad has some good suggestions to try to help get more priority on this issue.
Thanks,
James

0 Kanae over 7 years ago in reply to JJD

Guru 10085 points

Hi Brad and James,

Thank for your support!
I have just started a new thread.
e2e.ti.com/.../725328

My customer has new questions related this issue.
Could you please continue to give us your support?

Best regards,
Kanae

0 RonB over 7 years ago in reply to Kanae

TI__Mastermind 30506 points

Kanae,

Could you try this testing with the Filesystem provided with the SDK? This would help isolate the issue to something in their custom filesystem.

Thank you,
Ron

0 Kanae over 7 years ago in reply to RonB

Guru 10085 points

Hi Ron,

Thank you for your reply!

I will move a new thread.

Best regards,
Kanae

Processors

Processors forum

AM3352: MPU hangs