This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5728: CPU reset during high temperature testing

Part Number: AM5728

Hi,

My customer is testing their AM5728 board in high temperature (up to 55C environment).
During the test, CPU reset happens repeatedly around 50C.
Customer guesses the reset is generated by Linux Thermal management.

The SDK is;
PROCESSOR-SDK-LINUX-RT-AM57X 06_01_00_08

Q1) How to confirm the thermal management triggers the reset?

Q2) How to optimize/configure the threshold to trigger the reset?
The system needs to work up to 55C environment.

Thanks and regards,
Koichiro Tashiro

  • Hi Koichiro Tashiro,

    You can read the temperature of the SoC from Linux using:

    cat /sys/class/thermal/thermal_zone*/temp

    You can see the names of various zones using:

    cat /sys/class/thermal/thermal_zone*/type

    55C is too less for reset.

    If you look at the thresholds here: arch/arm/boot/dts/am57xx-industrial-grade.dtsi

    &cpu_crit {
    temperature = <105000>; /* milliCelsius */
    };

    So reset should happen in the vicinity of 105C.

    Can you please look at the temperatures when you are seeing the resets using the commands
    i suggested?

    Best Regards,
    Keerthy

  • Hi Keerthy,

    Customer tried these commands, but could not get expected responses.

    cat /sys/class/thermal/thermal_zone*/temp
    cat: /sys/class/thermal/thermal_zone0/temp: Invalid argument
    cat: /sys/class/thermal/thermal_zone1/temp: Invalid argument
    cat: /sys/class/thermal/thermal_zone2/temp: Invalid argument
    cat: /sys/class/thermal/thermal_zone3/temp: Invalid argument
    cat: /sys/class/thermal/thermal_zone4/temp: Invalid argument
    

    cat /sys/class/thermal/thermal_zone*/type
    cpu_thermal
    gpu_thermal
    core_thermal
    dspeve_thermal
    iva_thermal
    



    It seems something is different.
    What should be checked?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Koichiro,

    It seems like there is some failure with the thermal driver probe. I believe i know the reason as RT
    Linux disables the CPUFREQ or the dynamic voltage frequency scaling for A15. The CPUFREQ enabling
    is a must of thermal monitoring. Can you try the same on Non-RT version?

    Latest SDK: software-dl.ti.com/.../index_FDS.html

    Best Regards,
    Keerthy

  • Hi Keerthy,

    Customer manually checked configuration.
    Alert is set as 90C.
    Crit is set as 105C.

    --------------------------------------------------
    root@HX-SDC:/proc/device-tree# for file in `find -name temperature`; do echo $file; hexdump -Cv $file; done
    ./thermal-zones/dspeve_thermal/trips/dspeve_crit/temperature
    00000000  00 01 9a 28                                       |...(|
    00000004
    ./thermal-zones/gpu_thermal/trips/gpu_crit/temperature
    00000000  00 01 9a 28                                       |...(|
    00000004
    ./thermal-zones/cpu_thermal/trips/cpu_alert/temperature
    00000000  00 01 5f 90                                       |.._.|
    00000004
    ./thermal-zones/cpu_thermal/trips/cpu_crit/temperature
    00000000  00 01 9a 28                                       |...(|
    00000004
    ./thermal-zones/iva_thermal/trips/iva_crit/temperature
    00000000  00 01 9a 28                                       |...(|
    00000004
    ./thermal-zones/core_thermal/trips/core_crit/temperature
    00000000  00 01 9a 28                                       |...(|
    00000004
    root@HX-SDC:/proc/device-tree# echo $((0x19a28))
    105000
    root@HX-SDC:/proc/device-tree# echo $((0x15f90))
    90000
    --------------------------------------------------
    



    Customer also read temperature register values manually.
    Just before the Reset, temperature values are below:
    MPU:120.8
    GPU:114.0
    CORE:114.4
    Register dump is also attached.
    register dump.xlsx
    I thought the reset is triggered at 105C, but it seems internal temperature is much higher.
    Is this expected?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Koichiro Tashiro,

    I assumed we are still talking about RT-Linux here right? With RT_Linux there is no thermal
    support as the cooling Agent CPUFREQ is disabled. Could you please check that CPUFREQ support
    is in place?

    Best Regards,
    Keerthy

  • Hi Keerthy,

    Yes, we are talking about RT-Linux as changing environment to Non-RT requires a lot of work for customer.
    Could you explain a bit more about CPUFREQ support?
    - What is CPUFREQ?
    - How to check CPUFREQ is enabled or disabled?
    - Any documentation available?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Koichiro Tashiro,

    CPUFREQ is the DVFS (Dynamic Voltage Frequency Scaling) for the CPU. CPU has multiple OPPs(Operating Power Points). OPPs are
    nothing but Voltage/Freqeuncy Pair. Based on the system load one can use either high OPP or Lower OPPs that gives the required
    power to sustain the load.

    CONFIG_ARM_TI_CPUFREQ=y

    The Above tells you if the CPUFREQ for TI SoC am57xx is enabled.

    Documentation: Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt

    The CONFIG_ARM_TI_CPUFREQ should be disabled on RT Kernel as the frequency changes will make it
    difficult to preserve the Real-Time nature of the kernel.

    Let me know if all of your questions are answered.

    Best Regards,
    Keerthy

  • Hi Keerthy,

    Thanks for your explanation. I understood followings:
    - With CPUFREQ enabled, the kernel can manage OPPs based on the required load.
    - With CPUFREQ enabled, the kernel can handle OPPs to reduce power if internal temperature goes higher than Alert or Crit temp.
    - Without CPUFREQ, above features are disabled, so temperature goes up continuously and results in thermal reset.
    - For RT-Linux, CPUFREQ cannot be used to preserved real time nature.

    Now think about my customer’s original problem, it seems the internal temperature is higher than Crit temp. just before the reset.
    I think this is due to no OPP management done in RT-kernel without CPUFREQ.
    Is it possible to enable CPUFREQ in RT-kernel just to check it workaround the thermal reset customer observes?

    And, what is the practical solution to avoid thermal reset in RT-kernel environment?

    Thanks and regards,
    Koichiro Tashiro

  • Koichiro Tashiro said:
    Is it possible to enable CPUFREQ in RT-kernel just to check it workaround the thermal reset customer observes?

    Can you try the defconfig in the non-RT kernel which has the CPUFREQ enabled?

    Koichiro Tashiro said:
    And, what is the practical solution to avoid thermal reset in RT-kernel environment?

    This will need some external cooling. I can loop in hardware experts to comment on this. Let me know.

    Best Regards,
    Keerthy

  • Hi Keerthy,

    I have a question in Thermal Management registers.
    According to TRM(SPRUHZ6L) section 18.4.6.2.3 “Thermal Shutdown Comparators”,
    it says;
    "The values of the five low and high TSHUT thresholds are fix and can be neither overridden nor read by
    software. The value for the high TSHUT threshold is 123°C (assuming ±2°C temperature sensor accuracy)
    and for the low TSHUT threshold it is 105°C."

    On the other hand, CTRL_CORE_BANDGAP_TSHUT_xxx regsiter description(ex. Table 18-167) says TSHUT_HOT and TSHUT_COLD can be overridden by SW.
    Which description is correct?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Koichiro Tashiro,

    "The values of the five low and high TSHUT thresholds are fix and can be neither overridden nor read by software. The value for the high TSHUT threshold is 123°C (assuming ±2°C temperature sensor accuracy) and for the low TSHUT threshold it is 105°C."

    This is from  https://www.ti.com/lit/ug/spruhz6l/spruhz6l.pdf 

    However if you do need to set them you can always try CTRL_CORE_BANDGAP_TSHUT_xxx regsiter description(ex. Table 18-167) says TSHUT_HOT and TSHUT_COLD can be overridden by SW. So my recommendation is to try to program at room temperature values say 45 as high and 40 as low
    & verify that it works at that temperatures and carefully program them for desired values like 105C as in your case.

    Please resolve this issue if no further questions.
    Best Regards,
    Keerthy
  • Hi Keerthy,

    Thanks for your reply.
    Customer tested and confirmed CTRL_CORE_BANDGAP_TSHUT_xxx override works.

    Thanks and regards,
    Koichiro Tashiro

  • Hello Koichiro-san,

    I am glad it worked. Closing this issue.

    Best Regards,
    Keerthy