This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3352: MPU hangs

Guru 10085 points
Part Number: AM3352

 

Hi Sitara support team,

My customer is facing the restart of unknown origin in their mass-produced board using AM3352.

(Failure state details)

 AM3352 CPU hang-up.

 When HW watch Dog is effective, the HW has reset by watch Dog time-out.

Here is a trace log of ETB (Embedded Trace Buffer) acquired via JTAG in AM3352 just before the above failure state.

Trace_log_20171204.txt

I would like to know whether the cause of this failure state is estimated from the attached trace log, or not.

■JTAG debugging environment
JTAG debugger is connected to the custom system board.
ARM ETM (Embedded Trace Macrocell) function is enable, and it runs the real time trace of ARM instructions.
[custom system board]--- < JTAG connection> --- [TRACE32 Lauterbach(ARM-ETM Trace)] ----- [PC]                                                                  

■ Trace log result summary
It stops by just before the failure state with the following processing sequence.
(1) Undefined instruction exception (VFP)
  ↓
(2)  Processing of userland Process
  ↓
(3) Data abortion exception
The trace log is acquired by total of 4 times of failure state. It stops by the same processing in all trace log.

And also, please advise me the effective way to investigate this failure state.

Best reards,
Kanae

 

  • Hi,

    What software is this? Which version? Please provide more details.
  • Hi Biser,

     

     

    Thank you for quick reply.

    Customer's system software is "Linux OS; Base Distribution, Debian Linux Kernel 3.13.4".

    If you need other information, please let me know.

     

    Best regards,

    Kanae

  • Hi Biser,

    Here are the additional information.
    * Temperature at the failure state: Approximately 20-30 degrees
    * CPU frequency : 1GHz (fixed)
    Customer has confirmed that it has same failure state even 300MHz or 600MHz fixed.
    * Total units of failure state: 3 units
    I am currently checking that the production total units and the lot information.

    My customer needs any clues to investigate this failure state.
    I appreciate that Sitara support Team helps this issue from your experiences in advance.

    Best regards,
    Kanae
  • I am sorry, TI does not support Debian Linux. Please advise your customer to check if the issue exists with AM335x Processor SDK: software-dl.ti.com/.../index_FDS.html
  • Part Number: AM3357

    Tool/software: TI-RTOS

    Dear Sitara support Team,

    Regarding my customer is facing MPU hang on the mass production board, he would to ask about ARM assembler code level of AM335x program.

    Q1.
    From checking the last processing of all trace log; log_file, acquired at the failuer state,
    it is set the value in a system control register of CP15, and it semms to make the MPU hung state.
    Is there a possibility which will be in the MPU hung state by this processing?

    | ldr r0,0xC05E33E0
    | ldr r0 and [r0]
    | mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)
    --------------------

    Note: MPU hung state is defined that it cannot be controlled to MPU at all by JTAG debugger
    and it cannot be recognized as the state that a register and a memory.

    Q2.
    Is there a possibility on Q1, could you provide any advise to the considerable conditions and so on?

    Q3.
    During like this processing, what kind of processings should not be processed?
    For examples, the processing which isn't assumed as CPU or the function whitch sould not be operated at the same time.

    Q4.
    Could you suggest to my customer what kind of factors that MPU software processing is stopped suddenly
    at the same processing each time in spite of the above state?

    Q5.
    My customer also considers about a hang-up inside VFP coprocessor.
    However he does not know how to corroborate it.
    Do you have any ideas to check to hang-up inside VFP coprocessor?

    Best regards,
    Kanae

  • I have asked the factory team to look at this. They will respond here.
  • The attached log seems to have captured several loops of an abort handler, so I don't think it shows anything meaningful. Can you give us more context around the failure:

    -you say 3 boards fail, out of how many boards?
    -when the failure occurs, what is the application doing? Is it booting, running a certain application, etc?
    -do the 3 boards fail all the time? Is the failure repeatable? Do the 3 boards work for a certain period of time, and then fail?
    -Can you disable the Watchdog reset break before the reset occurs (maybe in the abort handler)?

    -James

  • Hi JJD,
    Thank you for your reply.
    Here are answers from my customer to your questions.

    -you say 3 boards fail, out of how many boards?
    >The MPU hang-ups have occurred by 13 units out of 220 units as of Dec-13-2017.
    Customer has corrected the failured numbers from 3 units to 13 units in their reports.

    >The frequency of failure occurs total of 23 times. It differs according to the each unit
    to have the failure from once to 5 times.

    >AM3352 manufacturing lots are 3 types.
    All three type of lots were included in total 220 units, and also included the failure 13 units.

    -when the failure occurs, what is the application doing? Is it booting, running a certain application, etc?
    >All 220 unit have operated the same application.
    The application program of a floating-point operation is executed periodically.

    -do the 3 boards fail all the time? Is the failure repeatable? Do the 3 boards work for a certain period of time, and then fail?
    >The timing of the failure depends on the unit.
    Some units had the failure to take about 2000 hours after a system start,
    and others had about 24 hours at the earliest from a system start.

    -Can you disable the Watchdog reset break before the reset occurs (maybe in the abort handler)?
    >HW watch Dog invalidates for debugging.

    If you need other infomation to solve this issue, please let me know.
    I appreciate your support in advance.

    Best regards,
    Kanae

  • Hi Kanae, I'm thinking this may be some sort of power delivery issue, especially for the VDD_MPU rail. Can the customer monitor the VDD_MPU voltage and check for droops or other noise issues on the rail, especially around the failure point.
    -What power solution are they using (ie, PMIC or discrete solution)?
    -You said that they are operating at a fixed 1GHz. They need to ensure they are operating at 1.35V for the VDD_MPU rail and the voltage remains within 4% tolerance
    -as an experiment, they should try to increase the VDD_MPU rail slightly (maybe to 1.4V) to see if they observe a change in behavior.
    -they need to review the board layout, especially with respect to the VDD_MPU and VDD_CORE power planes, and ensure adequate routing , decoupling and bulk caps, and return paths. More info can be found here: processors.wiki.ti.com/.../Sitara_Layout_Checklist

    Regards,
    James
  •  

    Hi James,

     

    Thank you for your prompt reply.

    Here are customer's reply to your comments.

     

    [-What power solution are they using (ie, PMIC or discrete solution)?]

    ⇒Customer's power soluttion is PMIC: TPS65217CRSL.

     

    [Can the customer monitor the VDD_MPU voltage and check for droops or other noise issues on the rail,

    especially around the failure point.]

    ⇒ The voltage of VDD_MPU is set as 1.325V following the data seat;

     5.5 Recommended Operating Conditions.

     VDD_MPU (Supply voltage range for MPU domain, Nitro)

     MIN 1.272V, NOM 1.25V,  MAX 1.378V

     Is setting of 1.35V right for the VDD_MPU voltage as you said?

     

     This CPU hang failures have been occurred at fixed 600MHz and 300MHz besides at fixed 1GHz.

     Customer has get the trace from the 4 failure units by JTAG debugger

     and the tracing has stopped at the same state of "mcr p15".

     As far as the trace results, customer thinks that the failure does not depend on the voltage drop and the clock.

     

    [as an experiment, they should try to increase the VDD_MPU rail slightly (maybe to 1.4V) to see

     if they observe a change in behavior.]

    ⇒ Customer has tried to increase the VDD_MPU rail; to 1.375V or 1.275V in case of VDD_MPU voltage; 1.325V,

     and to operate them for 72 hours. However there are any change in behavior.

     

    [they need to review the board layout, especially with respect to the VDD_MPU and VDD_CORE power planes, and ensure adequate routing ,  

     decoupling and bulk caps, and return paths. More info can be found here:

     processors.wiki.ti.com/.../Sitara_Layout_Checklist]

    ⇒ Customer checked them again, but there are no points that customer is concerned about.

     

    If there are any other particular points to check this failure, please point them out.

     

    Best regards,

    Kanae

  • Hi James,

    I have posted my customer's comments to your check items.
    Are there are any points to check this failure from their comment?

    If you need the other data of my customer design, please let me know.

    Best regards,
    Kanae

  • Hi James,

    Could you please advise to my customer's comments?
    If you need the other infomation, please let me know.

    Best regards,
    Kanae

  • Kanae,

    In my opinion, based on your results, VDD_MPU is not the issue.  If there was a power issue with this rail, then the reliability would have improved when you raised the voltage and/or when you lowered the CPU clock.

    Now that said, one other critical voltage rail that I recommend checking is VDD_CORE.  This can have a big impact on reliably reading data from DDR.  A few recommendations:

    1. Monitor VDD_CORE with a scope.  For example, if you put a trigger at 1.06V, are you ever triggering?
    2. Can you raise VDD_CORE to 1.2V to see if it has a significant impact on failure rates?
    3. Have you run memtester at high and low temperature to make sure you can reliably read from your DDR?

    On a related note, this could be due to the DDR configuration or DDR layout just as easily.  The memtester experiment (#3) should help expose any such issues.

    Brad

  • Hi Brad,
    Thank you for your quick reply.
    Here are my customer's comments to your recommendations.

    ****************************************************************
    1. My customer has already done it. The result of VDD_CORE triggering
    at 1.05V was not shown the voltage drop.

    2. My customer has already tried to raise VDD_CORE to 1.15V, not 1.2V though.
    It kept to operate in this state for 72 hours, there was no impact on failure rates.

    3. My customer has already un memtester at +60℃ and -20℃ for 96 hours.
    However the results are no problems.

    As I posted, the failure units have stopped at "mcr p15" on the trace
    so my customer does not think that the votage drop is not related this issue.

    =Trace result=
    | ldr r0,0xC05E33E0
    | ldr r0 and [r0]
    | mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)

    ****************************************************************
    What else should he check out?

    Best regards,
    Kanae
  • Kanae, at this point i would recommend a device swap:

    -Take a processor from a working board and and solder it to a non-working board, and test again
    -Take a processor from a non-working board and solder it to a working board, and test again

    This would expose possible device defects (if a bad processor fails on a good board), board issues (if a good processor fails on a bad board), or possibly assembly issues (if both work)

    Regards,
    James
  • Hi James,

    Thank you for your reply.
    My customer' s comments to your recommendation as below.

    **********
    It isn't realistic to swap a device and test once again,
    because it has a high possibility that a device is damaged by rework and influence a test result.
    When a MPU hang (CORE locked) has occurred this time,
    it stops at the same instruction "mcr p15" in all 4 boards.
    If it causes a MPU hang (CORE locked) by a board or by an assembly,
    the stop point would be at random.
    **********

    Best regards,
    Kanae

  • Kanae,

    Generally issues that only impact certain boards are hardware-related. The "swap test" that James suggested is not easy.  You need to have the device reballed before you can put it onto another board.  You likely will need to pay an outside company to do that work.  However, there is much to be learned from this type of test and I highly recommend you do it.  I agree with James that it's the next logical step and would be needed to further narrow down and understand your issue.

    With respect to your software, Debian Linux Kernel 3.13.4, this is not a kernel version that was ever officially supported by TI.  Furthermore, I see many changes have been introduced in the specific file where you are hitting this issue (arch/arm/kernel/entry-armv.S).  From a software perspective, I would encourage you to use TI Processor SDK Linux 4.02 for best support.  That's the current release.

    Brad

  • Kanae, how do they trigger and fill the trace buffer? Are they sure that the trace buffer is not just full, and the last mcr instruction is just the last instruction in the buffer? The processor should not just stop, it should at least go to an exception handler and loop there. As i stated, it looks like the log is just looping on an exception, so the actual error is buried somewhere previously in the log, or was not captured because the trace buffer is too small

    Regards,
    James

  • Hi Brad and James,
    Thank you for your replies.
    Here are comments from my customer to each reply.

    Regarding a swap test,
    MPU hang (CORE locked) board which wasn't facing MPU hang are increasing a little as time proceeds.
    In other words, it can be said that the other boards which have not faced MPU hang yet will have MPU hang
    from now on.

    Regarding a software,
    My customer can mount the following kernels easily at this time
     1) "git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git" repository's "ti-linux-4.14.y" Project
     2) "git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository's "processor-sdk-linux-01.00.01" Project
    Which kernel is better to use?

    Regarding a trace buffer,
    When the buffer (ETB) will be fully by on chip trace, it'll be overwritten from old data.
    So that means the circular buffer structure. This is the specification of ETMv3 of CoreSight.
    It was irresponsive from the core to a JTAG debugger therefore my customer judged as MPU hang.

    Best regards,
    Kanae

  • Kanae said:
    1) "git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git" repository's "ti-linux-4.14.y" Project

    This is a development branch as of now.  In June we'll be releasing Proc SDK Linux 5.00 which will be based on this branch.  We are not yet officially supporting this branch.

    Our current release is based on kernel 4.9.  You may want to look at the corresponding ti-linux-4.9.y branch

  • Hi Brad,
    Thank you for quick reply!

    So, my customer cannot select "1) "ti-linux-4.14.y" Project" now.
    Can they use this project?
     2) "git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository's "processor-sdk-linux-01.00.01" Project
    If you have concern to use this, please let me know.

    Best regards,
    Kanae
  • Proc SDK 1.00 was a 3.14 kernel. As I mentioned the currently supported TI kernel is 4.9. We will migrate to 4.14 in June.

  • Hi Brad,
    Thank you for your reply.

    You do not recommend both of the kernels which my customer can select now.
    Your recommend software is only TI kernel is 4.9 currently supported.
    Is my understanding correct?

    Best regards,
    Kanae
  • Hi Brad,
    Thank you for your reply!

    Here are comments from my customer.
    The reason why my customer tried to use the two type of kernels is as follows.

    1) The custom boards in other project mounted AM3352 using ti-linux-4.14.y" brunch already have worked now.
    2) The "processor-sdk-linux-01.00.01" brunch porting is easy because it is a little different from Kernel-3.13.4.

    My customer will try to work with " ti-linux-4.9.y" brunch that takes time, though.
    By the way, there are "ti-linux-4.9.y" brunch and "processor-sdk-linux-04.02.00" brunch of
    "git://git.ti.com/processor-sdk/processor-sdk-linux.git" repository.
    Which brunch should be used to work?

    Best regards,
    Kanae

  • Kanae,

    The ti-linux-4.9.y branch would be the best option since it's more regularly updated.

    Brad
  • Part Number: AM3352

    Hi Sitara support team,

    My customer has an additional reports and question.

    **********************************************************************************

    About the problem of "AM 3352 CPU hang up", we find that CPU hang will occur

    if HIGHMEM of Linux kernel option is enabled from the verification result.

     

    【HIGHMEM verification result: Linux kernel version: 3.13.4】

    (1) DRAM 1 GB HIGHMEM valid       ---> CPU hang occurs

    (2) DRAM 1 GB HIGH MEM invalid    ---> No occurrence

    (3) DRAM 512 MB          ---> No occurrence

     

    When "DRAM 1GB" is implemented, the area exceeding 740MB(LOWMEM) becomes the HIGHMEM area,

    and the Linux memory management method differs from the LOWMEM area's method.

     

    In order to use this HIGHMEM area, if it enables the Linux kernel option HIGHMEM,

    the CPU hang occurs. However, if it disables HIGHMEM, the CPU hang does not occur.

    Also, in the case of DRAM 512 MB, it does not occur because the HIGHMEM area is not used.

     

    From this result, it seems that Linux memory management function including HIGHMEM

    is affecting the CPU hang issue.

     

    -Cortex-A8 processor revision: r3p2 (0x413fc082)

     

    The result of the JTAG trace log at CPU hang is always found read or write instruction

    of the coprocessor register.

     

    It seems that there is a problem with the MMU and L1 / L2 cache of AM 3352 Coretx-A8,

    and it affects the coprocessor.

     

    We also confirmed that a CPU hang occurs in processor-sdk-linux-01.00.01 (Kernel 3.14.43).

     

    [Question]

    If it assumes that the Linux kernel memory management function including HIGHMEM function

    is causing AM 3352 hang issue, could you tell us the possible causes for that?

    **************************************************************************************************

     

    Best regards,

    Kanae

  • Hi Brad,

    Could you please give me your comments to the following question?
    If I have to make the new post for this, please let me know.

    [Question]
    If it assumes that the Linux kernel memory management function including HIGHMEM function
    is causing AM 3352 hang issue, could you tell us the possible causes for that?

    Best regards,
    Kanae
  • I don’t have expertise in the ARM HIGHMEM implementation. I recommend testing to see if the issue can be reproduced on TI hardware using Processor SDK 1.00 (since that is similar to your current software) and then check if moving to the latest SDK 5.00 (kernel 4.14) fixes the issue.

    If it fixes the issue you could do some testing to narrow down what exactly fixed it and then backport to your older kernel. Or perhaps better yet you could update to the latest on your board too.

    If the issue is still present on the TI board in the latest kernel then you have a great test case for TI to better support you.

  • Kanae, please start a new thread and refer back to this one with a link back to this one. I think Brad has some good suggestions to try to help get more priority on this issue.
    Thanks,
    James
  • Hi Brad and James,

    Thank for your support!
    I have just started a new thread.
    e2e.ti.com/.../725328

    My customer has new questions related this issue.
    Could you please continue to give us your support?

    Best regards,
    Kanae
  • Kanae,

    Could you try this testing with the Filesystem provided with the SDK? This would help isolate the issue to something in their custom filesystem.

    Thank you,
    Ron
  • Hi Ron,

    Thank you for your reply!

    I will move a new thread.

    Best regards,
    Kanae