This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: How to control the watchdog

Part Number: AM625

Dear Sir,

We try to run the userspace application that can be found in Linux Kernel at samples/watchdog/watchdog-simple.c. 

but it just enable the watchdog & always reboot after 30 seconds,

unfortunately, after our test, the watchdog API does not work

ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
ioctl(fd, WDIOC_GETTIMEOUT, &timeout);

## to get/set timeout seconds,
ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);  ## to get left seconds
or 

echo 'V' > /dev/watchdog ## to send heartbeat

Could you please provide a sample code that would enable/disable watchdog, get/set the timeout seconds?

Regards, Jason


  • Hello Jason,

    It will take me a couple of days to circle back to your thread. Please ping the thread if I do not have a response for you by midweek.

    Regards,

    Nick

  • Hi Nick,

    Do you have any updates regarding WDT control? 

    Regars, Jason

  • Hi Nick,

    Do you have any updates?

    Regards, Jason

  • Hello Jason,

    There is a known watchdog issue that we have been working on for a month or so now, and that issue could prevent you from petting the watchdog (i.e., if you start the watchdog, the system would just reboot after the watchdog timeout even if you try to pet it). We are making progress - you can see updates here: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1311045/am623-wdt_rti-control-via-systemd

    I did not have time to look into your other questions, and I am now on vacation for the rest of March. Please ping the thread in the first week of April to make sure your thread gets attention.

    Regards,

    Nick

  • Nick, 

    Customer just tried the latest SDK9.2 and add the patch you provided in another e2e thread but found WDT still cannot be controlled and system will reboot after 40 second no matter which value they set. 

    Is below patch what you mentioned which will be migrated in SDK10? 

    Could you check whether current customer problem is same as the known watchdog issue or it is a different one? 

     https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1311045/am623-wdt_rti-control-via-systemd

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/5277.v2_2D00_0001_2D00_watchdog_2D00_rti_5F00_wdt_2D00_Set_2D00_min_5F00_hw_5F00_heartbeat_5F00_ms_2D00_to_2D00_55_2D00_of.patch

    BR, Rich

  • Hello Rich,

    Thanks for the ping. No, that patch that you attached is NOT the "final" patch - we are still working with different designs to figure out which one makes sense.

    I've started moving my setup so that I can try out a different patch to see if I can get the customer's usecase working. I'll run tests early next week, please ping the thread if you don't have a response by Wednesday.

    Regards,

    Nick

  • Hello,

    I have started getting things set up on my side, but I have not had the time to run a bunch of tests before the weekend. Please treat the information below as my "work in progress" notes - I'll provide additional details as I do additional tests.

    Step 1: apply the kernel patch to fix the driver 

    The latest patch is from here: https://lore.kernel.org/lkml/20240417205700.3947408-1-jm@ti.com/

    Step 2: enable the watchdog in the kernel configs 

    I followed the steps here to apply the default kernel configuration. By default, the K3 RTI Watchdog timer is set as a module.
    https://software-dl.ti.com/processor-sdk-linux/esd/AM62X/09_01_00_08/exports/docs/linux/Foundational_Components_Kernel_Users_Guide.html 

    TODO: Better to set as a module, or build into the kernel?

    TODO: add CONFIG_WATCHDOG_SYSFS=y as a documented option to allow for more visibility into watchdog and to allow testing by directly writing to the sysfs interface, as per 
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1311045/am623-wdt_rti-control-via-systemd/4998285#4998285

    Step 3: rebuild the kernel & modules 

    Follow the steps here to rebuild the kernel & modules:
    https://software-dl.ti.com/processor-sdk-linux/esd/AM62X/09_01_00_08/exports/docs/linux/Foundational_Components_Kernel_Users_Guide.html

    Note that you can multithread the command to compile faster. e.g., if I want to use 8 threads, I would use -j8

    Different ways to test 

    Try using the sysfs interface directly as discussed at
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1311045/am623-wdt_rti-control-via-systemd/5000403#5000403
    Does it grab timeout values from timeout-sec passed through the devicetree? Or is that totally overridden by the driver, like I suspect?

    Try insmod-ing the driver module to see if we can pass different timeout values, as per
    rti_wdt.c
    module_param(heartbeat, int, 0);
    MODULE_PARM_DESC(heartbeat,
    "Watchdog heartbeat period in seconds from 1 to "
    __MODULE_STRING(MAX_HEARTBEAT) ", default "
    __MODULE_STRING(DEFAULT_HEARTBEAT));

    TODO: if insmod-ing a driver, how to clarify which watchdog instance you want to use the driver with?

    Try using the samples/watchdog/watchdog-simple.c

    Regards,

    Nick

  • Test: can you use timeout-sec in the Linux devicetree to change the watchdog timeout value? 

    Expected results: no. It looks like the rti_wdt driver overrides the variable that other drivers would use to grab the timeout-sec information.

    Actual results: no

    test setup:

    enable the kernel fix, enable the watchdog as a module in the kernel configs, rebuild the kernel & modules 

    Apply the following patch to the devicetree file, rebuild the devicetree, copy the new kernel, devicetree, modules into the filesystem

    diff --git a/arch/arm64/boot/dts/ti/k3-am625-sk.dts b/arch/arm64/boot/dts/ti/k3-am625-sk.dts
    index f9b7fa2e8156..296860163cb2 100644
    --- a/arch/arm64/boot/dts/ti/k3-am625-sk.dts
    +++ b/arch/arm64/boot/dts/ti/k3-am625-sk.dts
    @@ -366,3 +366,16 @@ K3_TS_OFFSET(12, 17)
                            >;
            };
     };
    +
    +&main_rti1 {
    +       timeout-sec = <1>;
    +};
    +
    +&main_rti2 {
    +       timeout-sec = <10>;
    +};
    +
    +&main_rti3 {
    +       timeout-sec = <30>;
    +};
    

    These are the results

    root@am62xx-evm:~# ls /dev/watchdog
    watchdog   watchdog0  watchdog1  watchdog2  watchdog3  watchdog4
    root@am62xx-evm:~# echo 1 > /dev/watchdog4
    [  207.853772] watchdog: watchdog4: nowayout prevents watchdog being stopped!
    [  207.860811] watchdog: watchdog4: watchdog did not stop!
    // board reboots in 60 seconds for all watchdog instances

    For all watchdog instances, it rebooted after 60 seconds, ignoring the devicetree file changes.

  • edited April 25 2024

    Test: Can you set the watchdog timeout value by passing in kernel module parameters? 

    Expected results: yes

    Actual results: yes

    test setup:

    enable the kernel fix, enable the watchdog as a module in the kernel configs, rebuild the kernel & modules 

    Pass in the module parameters to set the new watchdog heartbeat period

    // module has already been loaded, so unload it
    root@am62xx-evm:~# rmmod rti_wdt
    
    // set the new heartbeat value, let's say 10 seconds
    root@am62xx-evm:~# insmod /lib/modules/6.1.82-00002-g19ed1c2b777d/kernel/drivers/watchdog/rti_wdt.ko heartbeat=10
    
    // start whichever watchdog you want to use
    root@am62xx-evm:~# echo 1 > /dev/watchdog2
    [  336.558216] watchdog: watchdog2: nowayout prevents watchdog being stopped!
    [  336.565290] watchdog: watchdog2: watchdog did not stop!
    // board resets after 10 seconds

    you can also use modprobe, which is probably the preferred method:

    root@am62xx-evm:~# rmmod rti_wdt
    
    root@am62xx-evm:~# modprobe rti_wdt heartbeat=10
    root@am62xx-evm:~# echo 1 > /dev/watchdog

  • Test: Can I use SystemD to control the watchdog? 

    Expected results: yes

    Actual results: yes 

    Documentation is here:
    https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html

    test setup: 

    First, check to see that /dev/watchdog (i.e., /dev/watchdog0) is not running by default after Linux boot:

    //If I wait for a couple of minutes here, the system will NOT reboot
    // now let's check if the watchdog is running
    
    // if systemD is already controlling the watchdog,
    // then the echo will return "device or resource busy"
    
    root@am62xx-evm:~# echo 1 > /dev/watchdog
    [  143.886043] watchdog: watchdog: nowayout prevents watchdog being stopped!
    [  143.893101] watchdog: watchdog: watchdog did not stop!
    
    // the echo has now started the watchdog
    // however, there is no service to pet the watchdog
    // the system will reboot after the default 60 second timeout

    Next, update /etc/systemd/system.conf to enable the watchdog and set a timeout.

    root@am62xx-evm:~# vi /etc/systemd/system.conf
    
    // uncomment RuntimeWatchdogSec and provide the desired timeout period
    // for safety, I am not setting this less than 45 seconds
    // If the watchdog is not properly pet, then
    // that gives me time to open this file, edit it, and save it before
    // a reboot is forced 
    
    RuntimeWatchdogSec=45
    #RebootWatchdogSec=10min
    #KExecWatchdogSec=off
    #WatchdogDevice=
    

    Now we should see "Device or resource busy" when we try to interact with the watchdog0. This tells us that SystemD is running and controlling it:

    // Let's wait for a couple of minutes before doing anything
    // since the processor does not reboot, the watchdog is either running
    // and getting pet properly, or it is not running
    
    // this confirms that the watchdog is running
    root@am62xx-evm:~# echo 1 > /dev/watchdog
    -sh: /dev/watchdog: Device or resource busy
    
    // let's double check by starting another watchdog
    // this should force the processor to reboot after the default 60 seconds
    root@am62xx-evm:~# echo 1 > /dev/watchdog2
    [  143.886043] watchdog: watchdog2: nowayout prevents watchdog being stopped!
    [  143.893101] watchdog: watchdog2: watchdog did not stop!
    

    Let's see if we can select a different watchdog counter:

    root@am62xx-evm:~# vi /etc/systemd/system.conf
    
    RuntimeWatchdogSec=45
    #RebootWatchdogSec=10min
    #KExecWatchdogSec=off
    WatchdogDevice=/dev/watchdog2

    Now /dev/watchdog2 should be the busy one, and /dev/watchdog (or /dev/watchdog0) should not be running

    root@am62xx-evm:~# echo 1 > /dev/watchdog2
    -sh: /dev/watchdog2: Device or resource busy
    root@am62xx-evm:~# echo 1 > /dev/watchdog
    [   44.130020] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [   44.137091] watchdog: watchdog0: watchdog did not stop!
    

    Regards,

    Nick

  • Testing a variation on an older patch: patch v2, but remove the conditional statement 

    so this patch: https://lore.kernel.org/linux-watchdog/e1d1aad3-0635-45e1-9470-6398a04820d0@ti.com/

    plus removing this code section: 

    -       /*
    -        * If watchdog is running at 32k clock, it is not accurate.
    -        * Adjust frequency down in this case so that we don't pet
    -        * the watchdog too often.
    -        */
    -       if (wdt->freq < 32768)
    -               wdt->freq = wdt->freq * 9 / 10;
    

    Expected results: patch should work fine with larger timeout values, but stop working on smaller timeouts

    Actual results: patch works for timeout values of 7 seconds or larger, fails for 6 seconds or shorter

    Test methodology:

    modify samples/watchdog/watchdog-simple.c to pet every 5 milliseconds instead of every 10 seconds

    // SPDX-License-Identifier: GPL-2.0
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <fcntl.h>
    //#include <time.h>
    
    int main(void)
    {
            int fd = open("/dev/watchdog", O_WRONLY);
            int ret = 0;
            if (fd == -1) {
                    perror("watchdog");
                    exit(EXIT_FAILURE);
            }
            while (1) {
                    ret = write(fd, "\0", 1);
                    if (ret != 1) {
                            ret = -1;
                            break;
                    }
                    printf("sleep 5ms\n");
                    sleep(.005);
            }
            close(fd);
            return ret;
    }
    

    Rebuild the userspace code. I did it with gcc-arm-9.2-2019.12, but other cross compilers should also work:

    $ rm a.out
    $ ~/ti/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc samples/watchdog/watchdog-simple.c
    ls
    a.out  block  COPYING  crypto         drivers  include  io_uring  Kbuild   kernel  LICENSES     Makefile  net     rust     scripts   sound  usr
    arch   certs  CREDITS  Documentation  fs       init     ipc       Kconfig  lib     MAINTAINERS  mm        README  samples  security  tools  virt
    

    Copy the new watchdog-simple code into the filesystem.

    Set the watchdog timer to the desired timeout value. e.g., 10 seconds, then run the watchdog-simple code. If the driver is working properly, the system should not reset.

    am62xx-evm login: root
    root@am62xx-evm:~# rmmod rti_wdt
    root@am62xx-evm:~# modprobe rti_wdt heartbeat=10
    root@am62xx-evm:~# ./a.out
    

  • Edited May 8 2024

    Testing patch v3

    expected results: If we assume Linux will take <200ms to pet the watchdog, patch should work for timeouts above 0.93 sec.

    actual results: tentatively, patch v3 is working perfectly with a timeout of 1sec and a pet interval of 5ms. I will leave the code running overnight to see if any edge cases pop up. This 1 second timeout test with 50% open window successfully ran for 24 hours. the 1 second timeout test ALSO ran successfully with 3.125% open window, which was surprising. This result is explained with the math in the next response from May 8.

    What is the math for patch v3? 

    This is... kinda complex. For now, I will just provide a very basic equation, and if needed later I can provide additional background.

    Note: This is a conservative equation that assumes that the timeout can be set to a fraction of a second. If we can guarantee that ONLY whole numbers are programmed as the timeout value, then
    sw_timeout < hw_timeout < sw_timeout + 0.25 seconds
    becomes
    hw_timeout = sw_timeout + [ (2^13 clocks) / (32,768 Hz)] = sw_timeout + 0.25 sec

    I'll cover that variation of the equation in the next response.

    open_window_time = timeout x open_window_percentage > safety_margin + max_service_time

    safety_margin

    1) The hw_open_window_time can be up to 0.25 sec > sw_open_window_time, depending on the timeout and the open_window_percentage.

    2) Also, since there can be different clock sources for the watchdog hardware and the Linux watchdog driver, there can be clock drift between the watchdog hardware and the Linux watchdog driver. Driver patch v3 builds in 2% error margin for clock drift

    Thus, safety_margin = 0.25sec + 2% x timeout

    max_service_time 

    How long is it going to take Linux to pet the watchdog after the software window opens up? This will depend on actual usecase, and you may not be able to guarantee that Linux will ALWAYS respond within this timeframe (for example, I would NOT rely on Linux always petting the watchdog within 1ms).

    open_window_percentage 

    Currently, the Linux watchdog driver only supports setting the window to 50%. However, if the watchdog is initialized by another source (like uboot), then the window can be set to other values. The smallest open_window_percentage the hardware supports is 3.125%

    So the equation simplifies to 

    timeout x open_window_percentage > safety_margin + max_service_time

    timeout x open_window_percentage > [0.25sec + 2% x timeout] + max_service_time

    timeout x (open_window_percentage - 0.02) > 0.25sec + max_service_time

    timeout > (0.25sec + max_service_time) / (open_window_percentage - 0.02)

    What is the minimum allowable timeout value? (assuming timeout can be programmed to fractions of a second)

    Case 1: 50% window, assume 200ms max_service_time 

    timeout > (0.25sec + 0.2sec) / (0.50 - 0.02) = 0.45 sec / 0.48 = 0.93sec

    Case 2: 3.125% window, assume 200ms max_service_time 

    timeout > (0.25sec + 0.2sec) / (0.03125 - 0.02) = 0.45 sec / 0.01125 = 40sec

    Case 3: 50% window, assume 10,200ms max_service_time (userspace app tries to pet watchdog every 10 sec + Linux overhead) 

    timeout > (0.25sec + 10.2sec) / (0.50 - 0.02) = 10.45 sec / 0.48 = 21.8 sec

  • The math presented for v3 above assumes that there is an unknown difference in the hw_open_window_time and the sw_open_window_time (which is true if the timeout value can be programmed to a fraction of a second. For example, 6.5 seconds).

    However, if you can guarantee that the timeout value can ONLY be programmed to whole numbers, then

    1) we guarantee hw_timeout = sw_timeout + [ (2^13 clocks) / (32,768 Hz)] = sw_timeout + 0.25 sec

    2) we guarantee a known difference between hw_open_window_time and sw_open_window_time that only varies based on the open window percentage, 0.25sec x (1 - open_window_percentage)

    So the equation if we can guarantee whole numbers for the timeout simplifies to 

    sw_timeout x open_window_percentage + (hw_timeout - sw_timeout) > safety_margin + max_service_time

    sw_timeout x open_window_percentage + 0.25 sec > [0.25sec x (1 - open_window_percentage) + 2% x timeout] + max_service_time

    sw_timeout x (open_window_percentage - 0.02) + 0.25 sec > 0.25sec - 0.25 x open_window_percentage + max_service_time

    sw_timeout > (- 0.25 x open_window_percentage + max_service_time) / (open_window_percentage - 0.02)

    What is the minimum allowable timeout value? (assuming timeout CANNOT be programmed to fractions of a second)

    Case 1: 50% window, assume 200ms max_service_time 

    timeout > (- 0.25 x 0.5 + 0.2) / (0.5 - 0.02) = (- 0.125 + 0.2) / (0.48) = 0.15625 sec

    Case 2: 3.125% window, assume 200ms max_service_time 

    timeout > (- 0.25 x 0.03125 + 0.2) / (0.03125 - 0.02) = ( - 0.0078125 + 0.2/) / (0.03125 - 0.02) = 17.08 sec

    Case 3: 50% window, assume 10,200ms max_service_time (userspace app tries to pet watchdog every 10 sec + Linux overhead) 

    timeout > (- 0.25 x 0.5 + 10.2) / (0.5 - 0.02) = (- 0.125 + 10.2) / (0.48) = 20.99 sec

    Case 4: 3.125% window, assume 10ms max_service_time 

    timeout > (- 0.25 x 0.03125 + 0.2) / (0.03125 - 0.02) = ( - 0.0078125 + 0.01) / (0.03125 - 0.02) = 0.19 sec

    Does this math hold up? 

    Yes. As expected, a 1 second timeout value with a 3.125% open window allowed the processor to run overnight when the pet interval was set to 5ms, which guaranteed the "service_time" < 10ms. However, adding even a couple of print statements to the watchdog-simple script was enough to push the service time over 10ms, cause a watchdog timeout, and force a processor reset.

    Regards,

    Nick

  • Just checking in, is there any additional information you need from our side?

    Regards,

    Nick

  • For future readers, this thread is about the watchdog issue where the watchdog could not be pet, resulting in a processor reset. We are now working through the opposite issue: some customers are observing that the processor will never reset, regardless of when the watchdog is pet. You can see updates on that discussion here:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1370422/am62p-am62p 

    and here:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1377565/re-am623-watchdog-will-not-reset-processor

    Regards,

    Nick