We're running Android (TI release 4AI.1.7) on our board, and we have some questions about the CPU zones defined in the Linux kernel in the file:
drivers/staging/thermal_framework/governor/omap_die_governor.c
So the temperature zones are defined near the top of this file. A little further down, the omap_thermal_zones array is defined with the five zones going from 'safe' to 'fatal'. This is also where the cooling levels are set for the various zones.
We were running some CPU benchmark programs, and managed to cause the board to heat up to such an extent that the system entered the 'fatal' temperature zone quickly, and then shut down (like it is supposed to do to prevent permanent damage). While running the benchmark programs, the system went from 'safe' to 'monitor', and shortly afterwards when into 'alert'. But the temperature rose quickly enough that the 'panic' zone was skipped, and it went from 'alert' directly to 'fatal'.
Looking at this more closely, only the panic zone has a cooling level set above 0. So it would repeatedly occur that the system would start to run the benchmarks, and then just reboot, without giving the cooling a chance to work.
I've since adjusted down the temperatures for 'alert' and 'panic', and also set 'alert' to use level-1 cooling (OPP_TURBO), and 'panic' to level-2 cooling (OPP100).
Obviously, this will result in reduced performance at higher temperatures, but now the system doesn't just reset itself when it gets too hot, because there is enough time for it to down-clock the CPU cores, which prevents the system from reaching the 'fatal' zone.
So my questions is how were the existing zone temperatures and cooling levels picked?
It seems to me that at the 'panic' level, it might be advisable to use the highest cooling level (down-clocking the CPU to OPP50). I'm also considering implementing more cooling zones in between 'alert' and 'panic' which use the intermediate cooling levels.