AM623: RTI Watchdog support for non-windowed (100% open) operation

Part Number: AM623

Tool/software:

According to the AM62x technical reference manual, it is possible to configure the RTIx_WWDSIZECTRL register to a value of 0x00000005 (default) and the watchdog will function as a standard timeout digital watchdog.  However, the Linux watchdog driver for the K3 RTI module is only capable of configuring a 50% open/close window as the largest window.  Is this a limitation of the driver implementation or an actual hardware limitation?

I have tried making changes to the driver to support a 100% window and have noticed that the watchdog fails to reset the system when it expires.  This is despite the RTIx_WDSTATUS register containing the value 0x32 indicating that there is a timing violation on the end-time.

  • Hello Aaron,

    This is a limitation of the driver, not of the hardware as far as I am aware.

    During our previous discussion when we could not pet the watchdog, we did check the 50%, 25%, 12.5%, 6.5%, and 3.125% windows:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1338070/am625-how-to-control-the-watchdog/5185069#5185069

    I am not sure if 100% window was tested or not - I'll check with the developer.

    Regards,

    Nick

  • Hello Aaron,

    I double-checked with the developer, we did NOT try 100% window, since 100% open window is not currently supported by the driver.

    Can you tell us a little more about your usecase where the 50% window does not meet the design needs?

    Regards,

    Nick

  • Hello Nick,

    I'm currently experiencing an issue with the watchdog resetting the board when it is serviced by systemd.  A reset usually occurs right around the 30 second mark when referencing /sys/class/watchdog/watchdog0/timeleft.  I have systemd configured with the setting `RuntimeWatchdogSec=45` and I've also tested other values.  This is with the V3 patch applied as we're still on SDK 9.01.

    The interesting thing is that the watchdog behaves as expected when I disable the systemd watchdog feature and use a user-space application to pet it.  I tested the watchdog with a script petting it and in an application using the watchdog IOCTL API.  Both worked as expected.

    I can get the system to stay up with systemd servicing the watchdog if I increase the MAX_HW_ERROR definition from 250 to 2000 (chosen randomly) in the driver.  This is with a 60-second default timeout and the watchdog module clocked with an external 32Khz clock source.

    Some interesting findings is that the systemd configuration parameter `RuntimeWatchdogSec` doesn't use this value.  Inspecting the systemd watchdog source code, it looks like it falls back to using the driver default value (60s) when it detects that setting the timeout is an unsupported feature.  Systemd will attempt to pet the watchdog at a rate of half the timeout interval.  This means that systemd will attempt to pet the watchdog at the 30-second mark, although I have seen it pet sooner.  This could place it on the cusp of servicing the watchdog right at the edge of the close window.

    It looks to me that I'm facing a timing issue with systemd servicing the watchdog much sooner than a user-space application and on the edge of the close/open window.

    Thanks,
    Aaron

  • Oops, I forgot to follow up on my use case.  For the reason above, I was attempting to update the watchdog driver so that it could function as a traditional watchdog with a 100% open window for the whole timeout.  Alleviating the uncertainty of when to pet the watchdog and only that its serviced before the timeout expires.

    As mentioned above, it fails to reset the board upon a timeout.  The WDSTATUS register indicates a timing violation on the end-time.  The /sys/class/watchdog/watchdog0/timeleft stops counting and reports 0.  I tried changing the WWDRXNCTRL (watchdog reaction control) from NMI (driver default) to reset but it didn't change the behaviour.

  • Hello Aaron,

    Interesting. Could I get you to share the patches that you are applying to set a 100% window?

    What is MAX_HW_ERROR doing?

    We realized that the 13 least significant bits of the watchdog timeout value are rounded up in the watchdog's hardware. That means there can be up to 0.25sec difference between the timeout programmed in software, and the timeout programmed in hardware. If we only allow whole seconds to be programmed, there is ALWAYS a 0.25 second difference between the hardware countdown value and the software countdown value.

    Depending on the size of the window, that means the SW driver would allow software to pet the watchdog 0.125 to 0.25 seconds before the hardware's window actually opened. So that's why we added MAX_HW_ERROR as a 0.25 sec buffer, to make sure that the SW driver would never allow the watchdog to be pet until AFTER the hardware window opened.

    Here's a bit of a messy graphic I drew showing what is going on:

    To confirm: the system is resetting with SystemD petting the watchdog, unless you increase MAX_HW_ERROR to a larger value? 

    If SystemD only pets the system once, at exactly 50% of the timeout value, then I would expect the watchdog driver to block the pet.

    The other source of error between the watchdog's hardware counter and the software driver's counter would be a difference in clock frequency. The watchdog driver adds 2% of buffer, but if the 32kHz hardware clock is incrementing at more than 2% of difference with the software driver's PLL clock, then the driver could still allow the watchdog to be pet before the hardware window opens.

    Regards,

    Nick

  • Hi Nick,

    I've sent the associated patches to Michael and you should hopefully receive them soon.

    Thanks for the explanation for the MAX_HW_ERROR.  It does clarify the requirement for this extra padding.  As you may recall from other discussions, the custom board I'm developing on uses an external RTC (MCP7940) to provide a 32Khz clock and this requires initialization via I2C.  I moved the initialization to U-Boot to ensure a stable clock source was provided before Linux starts.  I'm uncertain if this has any affect on the RTI module as I believe the AM62 EVM always provides a 32Khz clock via a dedicated crystal.

    I have a question related to the watchdog for the AM62 and how it monitors the bootloader process. Would you like me to open a separate post for it?

     

  • Hello Aaron,

    100% open window

    I have received the patches. Running out of time this weekend, so I'll have to take a look on Monday. I cannot make any promises about getting the 100% open code working, but I will at least double-check your changes to see if everything makes sense.

    50% open window

    I am still trying to think through why we would see the behavior you describe (when increasing the MAX_HW_ERROR prevents the system from rebooting itself after 30 seconds):
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1411019/am623-rti-watchdog-support-for-non-windowed-100-open-operation/5407264#5407264

    The mostly likely things I can think of is that either the patches are not getting applied to the code that is running, or the hardware clock source is running more than 2% slower than the software's clock expects.

    I'm not sure if SystemD's timeout can be set independently of the hardware watchdog's, but if so perhaps setting the SystemD timeout to longer would also cause better behavior? (e.g., systemD timeout at 60 secs, driver timeout at 50 secs, then 50% hardware window opens at 25 seconds and there's a 5 second buffer before SystemD tries to pet)

    Regards,

    Nick

  • Hello Aaron,

    In general, your code changes look reasonable to me.

    Behavior when watchdog expires

    One thing that popped out to me while reviewing the code is that by default, the driver configures the watchdog to generate a non-maskable interrupt (NMI) when the watchdog expires, instead of just directly resetting the processor:

            /* Generate NMI when wdt expires */
            writel_relaxed(RTIWWDRX_NMI, wdt->base + RTIWWDRXCTRL);
    

    I am not sure if there is a particular reason that the code is written like that? But that is another potential knob you could turn, to set that to 0x5 (which is the default bitfield value), instead of 0xA and see if the processor starts resetting as expected.

    double-checking zero value in min_hw_heartbeat_ms 

    At least when looking at the SDK 10.0 code, it looks like this value is only used in watchdog_dev.c > __watchdog_ping().

    This is part of an add function, so I would not expect the zero value to cause math-related problems (e.g., no divide-by-zero or anything like that):

    static int __watchdog_ping(struct watchdog_device *wdd)
    {
            struct watchdog_core_data *wd_data = wdd->wd_data;
            ktime_t earliest_keepalive, now;
            int err;
    
            earliest_keepalive = ktime_add(wd_data->last_hw_keepalive,
                                           ms_to_ktime(wdd->min_hw_heartbeat_ms));
            now = ktime_get();
    
            if (ktime_after(earliest_keepalive, now)) {
                    hrtimer_start(&wd_data->timer,
                                  ktime_sub(earliest_keepalive, now),
                                  HRTIMER_MODE_REL_HARD);
                    return 0;
            }
    

    Regards,

    Nick

  • Hi Nick,

    Sorry, I've been tied up with a few items.  I've already attempted the suggested change to directly reset the processor on watchdog timing violation instead of generating an NMI but the results are the same.  The WDSTATUS register indicates a timing violation on the end-time but a reset never occurs.

    Would you be able to see if you can reproduce this issue on an AM62x EVM using similar patches?

    Thanks,
    Aaron

  • Hello Aaron,

    What is your timeframe of need? I can definitely see if I can replicate your SystemD observations, but I'm handling a lot of escalations right now so you'd need to get your field representative to escalate up my management chain if you wanted it tested in the month of September.

    Regards,

    Nick

  • Hello Aaron,

    Apologies for the delays here. I am now working on replicating your observations. Will provide another update tomorrow.

    This is currently the only watchdog thread from your AM62x project that I am tracking. Please let me know if there are any other threads I should have on my radar.

    Regards,

    Nick

  • Hello Aaron,

    SDK 10.0 tests - I am unable to replicate results

    I am starting by seeing if I can replicate your observations on AM62x SDK 10.0 / kernel 6.6. I am starting there since that is the first official SDK version where watchdog should work out-of-the-box. I will then work backwards towards kernel 6.1.

    First off, the watchdog is able to reset the board if I start it and do not service it:

    root@am62xx-evm:~# uname -a
    Linux am62xx-evm 6.6.32-g6de6e418c80e-dirty #1 SMP PREEMPT Thu Oct 24 17:56:25 CDT 2024 aarch64 GNU/Linux
    
    // processor resets after about a minute
    root@am62xx-evm:~# echo 1 > /dev/watchdog
    [   65.047224] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [   65.054138] watchdog: watchdog0: watchdog did not stop!
    ...
    U-Boot SPL 2024.04-ti-gfda88f8bcea3 (Jul 26 2024 - 11:00:12 +0000)
    SYSFW ABI: 4.0 (firmware rev 0x000a '10.0.8--v10.00.08 (Fiery Fox)')
    SPL initial stack usage: 13392 bytes
    Trying to boot from MMC2
    Authentication passed
    Authentication passed
    

    SystemD is able to keep the system running (tested for 5 minutes, not sure if you had to wait longer to observe the behavior)

    root@am62xx-evm:~# uname -a
    Linux am62xx-evm 6.6.32-g6de6e418c80e-dirty #1 SMP PREEMPT Thu Oct 24 17:56:25 CDT 2024 aarch64 GNU/Linux
    
    // set up SystemD and reboot
    root@am62xx-evm:~# vi /etc/systemd/system.conf
    
    //file looks like this:
    RuntimeWatchdogSec=45
    #RuntimeWatchdogPreSec=off
    #RuntimeWatchdogPreGovernor=
    RebootWatchdogSec=60
    #KExecWatchdogSec=off
    #WatchdogDevice=
    
    // reboot
    
    // watchdog is in use
    root@am62xx-evm:~# echo 1 > /dev/watchdog
    -sh: /dev/watchdog: Device or resource busy
    

    I forgot to enable additional visibility into the watchdog with this kernel parameter, but I will enable it for tests on SDK 9.x:
    /sys/class/watchdog/watchdogN/ status and timeout might be visible after changing kernel parameter 
    CONFIG_WATCHDOG_SYSFS=y

    Miscellaneous notes 

    It looks like a specific service is in charge of petting the watchdog? By default that service would only try to pet the watchdog halfway after the countdown, but it seems like the service might be able to be modified to send notification messages more frequently. If that is true, perhaps the service that is handling watchdog petting could be rewritten, e,g, to send a notification message every 1/4 of the timeout value?

    https://manpages.debian.org/testing/libelogind-dev-doc/sd_watchdog_enabled.3.en.html

    Regards,

    Nick

  • Hello Aaron,

    Apologies for the delayed responses here. Are you still running into issues? Are there any updates I should be aware of before trying to test on SDK 9.x?

    Regards,

    Nick