AM623: thermal issues

Part Number: AM623

Hi,
in qualification customer  recognised thermal issues on 6 devices out of 20. 2 of the 6 fail at 60°C ambient temperature, 4 at 80°C. The other 14 good ones reach upto 100°C and then shutdown controlled due to over temperature.
The failure is seen as an abrupt system crash, without any message from Linux. The module then attempts a restart, but fails due to the missing reset signal for the e.MMC (design error on our part). The module starts normally again with a power cycle.

Failure analysis:
- During failure 3 signals are set at the MPU:
  * RESETSTATZ (WARM_RST) set to LOW
  * MCU_ERROR set to LOW
  * DDR0_RESET0_N (DDR_RST) set to LOW
1.jpg

- The RESET STATZ is set by the processor. The peripheral circuitry does not perform a reset (PORZ_OUT, RESET_REQZ; MCU_PORZ remain high
- The supply voltages are all present and do not show a dip or similar.
- The time of failure can be shifted by the voltage applied to VDDA_PLL2
  * Reducing the voltage (only on this pin) from 1.80 V to 1.70 V will result in failure at a significantly higher temperature (about 30°C more). This means that the failure is at an ambient temperature of about 90-95°C instead of at 60°C. On the other hand, the time of failure can also be pushed further forward. At 1.85 V, the module fails well before 60 °C ambient temperature

The current circuit diagram is still stored in the appendix. 

Regards, Holger

  • Hello Holger,

    Thank you for the schematic review request.

    Can you clarify if the SOM sits on a carrier board.

    Can you help us understand the mounting of the SOM on the carrier board with some pictures.

    Regards,

    Sreenivasa 

  • Hello Holger,

    I have a few queries from the expert:

    Did the customer do thermal analysis of their system?  If so what die temp was expected for these ambient temps?  Ambient temp alone does not provide any information.  Customer may need to measure package temp and power consumption to get a good estimate of die temperature.

    Regards,

    Sreenivasa

  • Hi,
    I will ask for this information.

    Is there any SW (Linux) or HW (AM623) thermal shutdown process?

    Regards, Holger

  • Hi Sreenivasa,
    I got feedback. The package temperature wasn't measured. They read the temperature in Linux and they were always 82-83°C when the failure happens.

  • Hello Holger,

    Thank you.

    They read the temperature in Linux and they were always 82-83°C when the failure happens.

    I assume these are for the failed boards.

    Can you please ask customer to read the temperature on the boards that did not fail.

    The temp sensor accuracy is +/-5C.

    Regards,

    Sreenivasa

  • Hello Holger,

    Is there any SW (Linux) or HW (AM623) thermal shutdown process?

    After power-up customer will be able to set the thermal alert temperature.

    Not sure if this is the question.

    Regards,

    Sreenivasa

  • Hi Sreenivasa,
    question was if there is a default SW thermal shutdown in Linux.

    Regards, Holger

  • Hello Holger,

    If customer is using linux, there is a thermal shutdown enabled after power-up.

    The value is set depending on the processor temp grade used.

    Can you please check is customer is modifying the temperature alert threshold.

    Regards,

    Sreenivasa

  • Hello Holger,

    I consolidated inputs received from different experts:

    Did the customer do thermal analysis of their system? If so what die temp was expected for these ambient temps? Ambient temp alone is not a useful datapoint. They may need to measure package temp and power consumption to get a good estimate of die temperature.

    Some additional questions regarding the scope of thermal analysis done on their system:
    1. What was the thermal model used for AM623?
    2. What was the model used for other components in their system besides AM623?


    Ambient temp doesn’t tell us much. Do you know how customer is configuring the temp thresholds? Configuring the thresholds in DT might not be sufficient because the linux driver hardcode the reset temp at a different level. This means a software patch is sometimes needed to adjust the thermal shutdown thresholds. For this kind of thermal analysis, in addition to the temp thresholds config, I would also ask customer to provide the following:
    • Schematic and layout (if available in .brd format)
    • Copy of the power estimation tool filled out with their application use case/conditions.
    • Power measurements. We need the actual measured power consumption of the AM62x supplies and the total PCB power consumption.
    • Temperature logs until the SoC shut down (cat /sys/class/thermal/thermal_zone*/temp)
    • Thermal/IR image of the PCB.
    • Thermal/Simulation results (if any).

    Regards,

    Sreenivasa

  • Hi Sreenivasa,
    in which file do I find the temperature alert?

    Regards, Holger

  • Hello Holger,

    Thank you.

    I assume this is a software query.

    Can you please check with customer regarding the software development environment they are using.

    Regards,

    Sreenivasa

  • Hi,
    the SDK is equivalent to 11.01. TI kernel is as of 11.02.08 (cicd.scarthgap.202512041635).
    They also tested on the test board with our Linux kernel from NXP. This kernel is based on NXP's LF-6.12.49-2.2.0, which we they patched on 6.12.77. However, there is no difference with regard to the error pattern.
    Only if the module is stopped in the boot loader (and no Linux is running) the error does not occur. Conversely, this means that the trigger occurs at least only when the Linux is executed. Whether it’s TI Kernel or our patched NXP Kernel.

    Regards, Holger

  • Hello Holger,

    Thank you.

    Let me assign the query to our software expert.

    Regards,

    Sreenivasa

  • Hello Holger,

    Checking if you have had a chance to review the below:

    Some additional questions regarding the scope of thermal analysis done on their system:
    1. What was the thermal model used for AM623?
    2. What was the model used for other components in their system besides AM623?

    Regards,

    Sreenivasa

  • Hello Sreenivasa,
    I will ask for the thermal model. Please ask the SW experts in which file customer should look for the shutdown.

    Regards, Holger

  • Hi Holger,

    I'll provide a response by the end of day today.

    Thanks,

    Anshu

  • Hi Hogler,

    There are two mechanisms that can trigger a thermal shutdown.

    Linux Based Shutdown

    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/arch/arm64/boot/dts/ti/k3-am62-thermal.dtsi?h=ti-linux-6.12.y-cicd

    We can set trip points at different temperatures. When the Tj approaches a high temperature, the software can adjust accordingly. So in this case, it will suspend the OS when it reaches 105C.

    What happens when we reach the shutdown temperature in this case? The existing drivers and processes will be suspend. Then the Linux OS will be off, but it will not shutdown the whole SoC. There will be Linux logs that will show this process in dmesg.

    HW Based Shutdown

    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/thermal/k3_j72xx_bandgap.c?h=ti-linux-6.12.y-cicd#n357

    This linux driver enables the thermal shutdown mechanism for the SoC by programing the register. So when Tj reaches the thermal shutdown point, it will shutdown the whole SoC until it cools down.

    What happens when we reach the shutdown temperature in this case? The SoC will stop abruptly. There will not be Linux logs that show its shutdown. Instead, you can check the Reset Reason register to check what caused the reset which should show the reason as thermal shutdown.

    These two mechanisms usually work in tandem. So the Linux shutdown should trigger first to suspend the OS, then the HW shutdown should trigger for the rest of the SoC.


    Best Regards,

    Anshu

  • Hi Anshu,

    they set following thermal trip points in Linux:

    Thermal Zone 0:

    • Trip_point_0_hyst    —> 2000
    • Trip_point_0_temp   —> 95000
    • Trip_point_0_type    —> passive

    • Trip_point_1_hyst    —> 2000
    • Trip_point_1_temp   —> 125000
    • Trip_point_1_type    —> critical

    Thermal Zone 1:

    • Trip_point_0_hyst    —> 2000
    • Trip_point_0_temp   —> 95000
    • Trip_point_0_type    —> passive

     

    • Trip_point_1_hyst    —> 2000
    • Trip_point_1_temp   —> 125000
    • Trip_point_1_type    —> critical

    Futher, what is the register in which the reason for the reset can be read (...Instead, you can check the reset reason register to check what caused the reset that should show the reason as thermal shutdown.)
    Which register is the Reset Reason Register?
    And the next question based on this is: Can the contents of the register be read after the system crashes? Would the contents of the register still be valid after a warm start? Currently, we cannot perform a warm start due to the missing reset on the e.MMC. But if there is a possibility that after a warm start the register values could bring us closer to valid statements about the cause of the crash, we would make an intermediate design and fix the problem with the warm start. 

    Regards, Holger

  • Hi Hogler,

    Let me check the reset architecture and get back to you.

    Thanks,

    Anshu

  • Hi Holger, 

    For register info, refer to the AM62x TRM: https://www.ti.com/lit/pdf/spruiv7

    You can refer to WKUP_MMR0_RST_SRC and MCU_MMR0_RST_SRC Registers which will update when a reset occurs and will change the bit based on the source of the reset.

    For the overall reset architecture, refer to Figure 6-16. SoC Reset Hardware Logic Diagram.

    And the next question based on this is: Can the contents of the register be read after the system crashes?

    In theory, if the SoC's power supply is still on and DDR didn't crash, then its possible the contents of the register is still there, but because this an indeterminate state, its unknown. I'm assuming you're checking this through a debugger.

    Would the contents of the register still be valid after a warm start?

    After warm restart, it should be updated since thats one of the valid reset sources.

    Best Regards,

    Anshu

  • Hi Anshu,

    they see some follow-up problem of the crash. They cannot evaluate the registers after the crash, because the RAM is also reset when the crash occurs and therefore the data is no longer available after a warm start.
    Is it possible to change some register values in advance, which could possibly have an impact on the crash?
    They have already tried to disable the watchdogs in the SoC to see if this will prevent an event that initiates the crash but withour success.

    What else can they do to find the problem?

    Regards, Holger

  • Holger, 

    Do they have any JTAG debug interface or could they possibly check the value of WKUP_MMCR0_RST_SRC on a subsequent boot? 

    It is expected with warm reset the RAM will be reset, but the value in this MMR should only be rest on PORz or if intentionally cleared via SW by writing to MCU_CTRL_MMR_CFG0_RST_SRC. 

    This is critical to confirm that thermal shutdown initiated by the VTM is actually the cause of the reset event. 

    Can you explain the failure scenario a bit more? I assume this is some sort of thermal chamber testing:

    • Is this a test that is started at room temp and then ambient is ramped up?
    • Do the boards failing at 60deg and 80deg consistently fail once these thresholds are reached or is the behavior random?
    • Is the VTM temperature measurement being monitored during this temperature ramp to get a Tj measurement to get the actual junction temp (this is what would trigger a thermal reset)?
    • Are all other components on the system rated for operation up to 100C (i.e. could this be a failure elsewhere that is causing a system warm reset). 

    Based on the information shared and assuming the VTM configuration is accurate you would expect a linux log indicating thermal issues prior to going straight to a VTM initiated thermal shutdown. The fact this is not happening makes understanding the reset source critical. If it is confirmed to be thermal, it would be good to monitor Tj up to the point the reset occurs -  then we can understand if it is genuine or if we need to suspect some sort of misconfiguration.

    Thanks,

    Chris 

  • Hi Chris,

    > Do they have any JTAG debug interface or could they possibly check the value of WKUP_MMCR0_RST_SRC on a subsequent boot? 
    > It is expected with warm reset the RAM will be reset, but the value in this MMR should only be rest on PORz or if intentionally cleared via SW by writing to MCU_CTRL_MMR_CFG0_RST_SRC. 
    > This is critical to confirm that thermal shutdown initiated by the VTM is actually the cause of the reset event. 

    Unfortunately,  a warm reset is not possible due to the missing reset signal for the e.MMC. To make this happen we need to order and produce new boards. That would take several weeks. But if I understand you correctly, we have a very good chance of obtaining the data from the WKUP_MMCR0_RST_SRC register after a warm reset, right? In this case we would consider new boards including the reset support for the e.MMC.
    The JTAG debug option is still being clarified. And my colleague from the Software development has another idea to catch the data from the register. We try to prepare this for the next step.

    > Can you explain the failure scenario a bit more? I assume this is some sort of thermal chamber testing:

    Yes, the very first test setup took place in a climate chamber. But we can reproduce the problem also outside the climate chamber with a simple hot air gun. But more on that below.

    > Is this a test that is started at room temp and then ambient is ramped up?

    Yes, we started at room temperature and ramped up to 85°C (our target ambient temperature)

    > Do the boards failing at 60deg and 80deg consistently fail once these thresholds are reached or is the behavior random?

    It seems to be very consistently.  It moves within a window of around +/- 3K

    > Is the VTM temperature measurement being monitored during this temperature ramp to get a Tj measurement to get the actual junction temp (this is what would trigger a thermal reset)?

    Yes, we logged the junction temperature of both thermal zones of the processor during the heating phase. The threshold is around 83°C (Tj). I tried to attach a small video. Don't know if IT policy is blocking the content. Maybe we have to share it via other tools.

    • The system freezes at 2:36. After a couple of seconds, the board tries to make a restart. But failed due to the missing reset signal for the e.mmc

     > Are all other components on the system rated for operation up to 100C (i.e. could this be a failure elsewhere that is causing a system warm reset). 
    > Based on the information shared and assuming the VTM configuration is accurate you would expect a linux log indicating thermal issues prior to going straight to a VTM initiated thermal shutdown. The fact this is not happening makes understanding the reset source critical.
    > If it is confirmed to be thermal, it would be good to monitor Tj up to the point the reset occurs -  then we can understand if it is genuine or if we need to suspect some sort of misconfiguration.

    To narrow down the problem, we heated the individual components on our board with the hot air gun. DRAM / PMIC / uP /... 
    The results showed that if we heat the processor from the back, the system reliably crashes at nearly the same point. All other components appear to be stable even above 100 °C, or do not cause a crash. Please see pictures from the test setup in attachment

    Regards, Holger

  • Hello Holger, 

    Just letting you know that Anshu will be out of office until May 11th so there will be delayed responses from him. Looks like Chris is helping out here

    -Daolin

  • Holger, 

    Thank you for the info. I can see the video. To make sure I am understanding correctly, the above reset at ~84C occurs when the thermal zone 1/0 trip point 1/0 are set for 95C and 125C respectively? 

    Is the VTM MAXTEMP_OUTRANGE_ALERT also being configured as shown below? This would cause a HW initiated reset which may look like above. I'm not sure why there would be device to device variance assuming the configuration is the same. 

    At this point I think it would be helpful if we can dump the VTM register space for a device resetting ~60C ambient as well as a device that can survive to 100C ambient so we can compare. 

    We can discuss further on the call tomorrow. 

    Thanks,

    Chris 

  • Hi,

    now they execute warm start and could read the registers:

    Is seem that the ESM send an error. But there is nothing found in the ESMs itself.

    Regards, Holger

  • Focus areas

    1) Share schematics (done offline)

    2) Gauge the behavior of a good board, to see if it also has any dependencies on pass/fail point changing the 1.8V rail , as reported for the failing boards

    3) Remove the VTM driver completely so that no error are generated to see if the crash goes away

    4) Provide current /power reading for good vs bad board if possible.

    5) Try to identify the source of reset, from the ESM Module - to see if this provides any further clues.

  • Hi,

    3) Remove the VTM driver completely so that no error are generated to see if the crash goes away

    Let's hold on this action, and please apply the following kernel patch to see if the ESM reset issue still happens. 

    diff --git a/drivers/thermal/k3_j72xx_bandgap.c b/drivers/thermal/k3_j72xx_bandgap.c
    index 9bc279ac131a..faaec371acef 100644
    --- a/drivers/thermal/k3_j72xx_bandgap.c
    +++ b/drivers/thermal/k3_j72xx_bandgap.c
    @@ -352,6 +352,7 @@ static void k3_j72xx_bandgap_init_hw(struct k3_j72xx_bandgap *bgp)
                            K3_VTM_TMPSENS_CTRL_SOC |
                            K3_VTM_TMPSENS_CTRL_CLRZ | BIT(4));
                    writel(val, bgp->cfg2_base + data->ctrl_offset);
    +               udelay(1);
            }
     
            /*
    @@ -365,7 +366,9 @@ static void k3_j72xx_bandgap_init_hw(struct k3_j72xx_bandgap *bgp)
            low_temp = k3_j72xx_bandgap_temp_to_adc_code(COOL_DOWN_TEMP);
     
            writel((low_temp << 16) | high_max, bgp->cfg2_base + K3_VTM_MISC_CTRL2_OFFSET);
    +       udelay(1);
            writel(K3_VTM_ANYMAXT_OUTRG_ALERT_EN, bgp->cfg2_base + K3_VTM_MISC_CTRL_OFFSET);
     }

  • Current measurement is completed. Excel sheet is updated with the last measurement. However, compared to the other measurements, no significant deviations are apparent.

    Hi,

    I have some results now.

    3) --> VTM driver removed --> shows no effect --> The trigger point remains unchanged. And it is same for the VTM option with delay

    4) --> Excel sheet is attached. I've measured three "bad" PCBs and three "good" PCBs under two different temperature conditions. 25°C and 55°C. One measurement is missing. I have to make that up next week.

    Next steps: Completing 3) ; Starting investigations for 2)

    Best regards,

    Jens

    2605.CurrentMeasurement.xlsx

  • Hi Jens,

    Thanks for the update.

    Can you please confirm if you tested with the kernel patch above which adds 1us delay in the bandgap driver? Does it change the behavior?

  • Hi Jens,

    3) --> VTM driver removed --> shows no effect --> The trigger point remains unchanged.

    Since you have tested with the VTM driver removed (assuming it is the driver from drivers/thermal/k3_j72xx_bandgap.c), the test with the patch above does no longer make sense.

    Can you please trigger the ESM reset again, then dump the following ESM registers after the board rebooted into U-Boot?

    => md.l <address> 1

    WKUP ESM MMRs:

    04100028
    0410002c
    04100400
    04100420
    04100440

    MAIN ESM MMRs:

    00420028
    0042002c
    00420400
    00420420
    00420440
    00420460
    00420480
    004200a0

  • Hi,

    We also tried out the delay patch but without any discernible change

  • Thanks for confirming.

    Did you get the ESM MMR readings that I asked in my previous response?

  • Hi,

    here's the register dump after reboot (Warm Reset)

    WKUPESM MMRs


    MAIN ESM MMRs

    The last register is incorrect, as I see. Should be 004200a0.

    I made a second register dump for 004200a0, but the result is also "0".

    Since you have tested with the VTM driver removed (assuming it is the driver from drivers/thermal/k3_j72xx_bandgap.c)

    I will double check that with our Software development

  • For 2)

    Good boards do not seem to be affected by changing the 1V8A (Range: 1.70V to 1.90V)

  • All the ESM MMR are '0', but register 0x43018178 is still 0xc0000000 after warm reset?

  • Hi Jens,

    It is probably expecting that all the ESM MMR are 0 in U-Boot, the ESM driver in U-Boot has done a module soft-reset in initialization.

    So we might have to read those MMR in U-Boot ESM driver before doing the soft-reset. Let me try this on the EVM and provide your a patch to dump the ESM MMR.