This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM623: WDT_RTI control via systemd

Part Number: AM623
Other Parts Discussed in Thread: AM62P

Champs,

customer would like to control WD via systemd daemon. In order to do so they set RuntimeWatchdogSec to 60 in /etc/systemd/system.conf . the expectation is that the systemd daemon will pat the dog as long as the system is operating. Instead, the board reboots ~every 60s. I have also observed the same behavior  on SK EVM running SDK 9.01 default image. 

Question: has wdt_rti been tested with systemd? is it supposed to work?

Also, I noticed that the wdt_rti driver is built as a module and is somewhere inserted during Linux boot. When tried using it as built-in (CONFIG_K3_RTI_WATCHDOG=y) Linux produces the following error message ~every 100s:

vmap allocation for size 479232 failed: use vmalloc=<size> to increase size
modprobe: vmalloc error: size 475136, vm_struct allocation failed, mode

it looks like it still somehow tries to install the module despite the fact that it is compiled in. 

Questions:

1. Where wdt_rti module is installed (no trace in dmesg)?

2. is it possible to use it as a built-in driver?

thank you

Michael

thank you

Michael

  • Hello Michael,
    Here's one recent e2e post on RTI_WDT in AM64x Linux SDK for your reference.
    e2e.ti.com/.../am6442-watchdog-timer-issue-on-sr2-0-silicon
    Best,
    -Hong

  • Hi Hong,

    Yes, I've seen this and other discussions concerning rti_wdt. it does not apply to my question though since the driver in general does work. I was able to run

    ./runltp -P am62xx-sk -f ddt/wdt_test -s WDT_S_FUNC_GETSTATUS

    and it completed successfully restarting the board in 60 sec. 

    What customer and I are struggling with is operating the WD using systemd framework. Quoting from the description (https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html):

    If RuntimeWatchdogSec= is set to a non-zero value, the watchdog hardware (/dev/watchdog0 or the path specified with WatchdogDevice= or the kernel option systemd.watchdog-device=) will be programmed to automatically reboot the system if it is not contacted within the specified timeout interval. The system manager will ensure to contact it at least once in half the specified timeout interval. This feature requires a hardware watchdog device to be present, as it is commonly the case in embedded and server systems. Not all hardware watchdogs allow configuration of all possible reboot timeout values, in which case the closest available timeout is picked.

    the above does not happen. Instead, the board keeps rebooting every RuntimeWatchdogSec interval. We do have systemd running on the default SDK image, but its control of the rti_wdt seems to be broken. Could you please check with the driver owner to see what could be wrong?

    thank you

    Michael

  • Hello Michael,
    thanks for clarifications on the user case. I'll pass it to my colleague to follow up...
    Best,
    -Hong

  • Hello Michael,

    Apologies for the delayed response here. I want to replicate your observations on my end, but I ran out of time this week. Monday is a holiday. Please keep me honest and ping the thread if I have not responded by Wednesday.

    Regards,

    Nick

  • Hello Michael,

    Thank you for your patience here. First, I do not see any reported bugs in the system.

    I am still figuring out how to tell if the watchdog is actually being used, and by whom. Just changing the settings in /etc/systemd/system.conf and rebooting the board does not seem to impact anything.

    I tested by changing the settings like this:

     RuntimeWatchdogSec=17
     RebootWatchdogSec=10min
     KExecWatchdogSec=off
     WatchdogDevice=
    

    And nothing seemed to change. No timeouts, the board just keeps running.

    If I tried to see the status of the watchdog with wdctl, then a countdown DOES start. If I just enter wdctl once, then do nothing, the board will reset after 60 seconds. typing out wdctl again when the "timeleft" value is less than half of the Timeout value leads to the watchdog getting pet and the counter restarting. Either way, the 17 second setting in the system.conf file is not applied.

    root@am62xx-evm:~# wdctl
    
    [   42.989666] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    
    [   42.996623] watchdog: watchdog0: watchdog did not stop!
    
    Device:        /dev/watchdog0
    
    Identity:      K3 RTI Watchdog [version 0]
    
    Timeout:       60 seconds
    
    Pre-timeout:    0 seconds
    
    Timeleft:      60 seconds
    
    FLAG           DESCRIPTION           STATUS BOOT-STATUS
    
    KEEPALIVEPING  Keep alive ping reply      1           0
    

    I'll check with the developers to see if anyone is familiar with using systemd with the watchdog timer.

    Regards,

    Nick

  • Nick,

    it looks like /etc/systemd/system.conf is only checked upon boot. I change only 

    RuntimeWatchdogSec=40

    then restart the board and it starts going through resets continuously. Dont make shorted than 40 sec or you won't have enough time to change it back between resets Slight smile

    let me know if you see that

    thank you

    Michael

  • Hello Michael,

    The first couple people I talked to are unfamiliar with the watchdog timer. I'll keep asking around, but we might not have any local expertise around watchdog at this point.

    Ahh, I see what I was doing wrong - I wasn't uncommenting the watchdog settings in the /etc/systemd/system.conf file. Now that I've deleted the # I can replicate your observations.

    With that said, at first I was getting a timeout that was actually 60 seconds, and now I'm getting a timeout after 30 seconds (even though the setting is still RuntimeWatchdogSec=60 - not sure what changed).

    I am running out of time to do more testing today, so I'll add some potentially useful (but untested) notes from looking around online:

    /sys/class/watchdog/watchdogN/ status and timeout might be visible after changing kernel parameter 
    CONFIG_WATCHDOG_SYSFS=y

    systemctl status watchdog 
    I get the same output, regardless of whether the watchdog has been enabled in the systemd file or not:

    root@am62xx-evm:~# systemctl status watchdog
    
    
    * watchdog.service - watchdog daemon
    
         Loaded: loaded (8;;file://am62xx-evm/lib/systemd/system/watchdog.service/lib/systemd/system/watchdog.service8;;; disabled; vendor preset: disabled)8;;
    
         Active: inactive (dead)
    

    I saw a comment here https://unix.stackexchange.com/questions/684671/how-to-check-systemd-hardware-watchdog-state about how at some point systemd got native hardware watchdog support. This might be something we would want to dig into?

    Regards,

    Nick

  • Nick,

    CONFIG_WATCHDOG_SYSFS=y does indeed expose number of different parameters:

    root@am62xx-evm:~# ls /sys/class/watchdog/watchdog0
    bootstatus dev device identity max_timeout min_timeout nowayout power state status subsystem timeleft timeout uevent

    I can activate the WD by writing 'active' to "status" or via "systemctl start watchdog" or via "wdctl /dev/watchdog0" but in all cases the WD just starts off and reboots the system upon expiration.

    The timeouts are different though: systemctl or status cause reboot after 30sec. wdctl after 60sec. 

    when activating WD via wdctl:

    root@am62xx-evm:~#
    root@am62xx-evm:~# wdctl
    [ 198.408194] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [ 198.415116] watchdog: watchdog0: watchdog did not stop!
    Device: /dev/watchdog0
    Identity: K3 RTI Watchdog [version 0]
    Timeout: 60 seconds
    Pre-timeout: 0 seconds
    Timeleft: 60 seconds
    FLAG DESCRIPTION STATUS BOOT-STATUS
    KEEPALIVEPING Keep alive ping reply 1 0

    when activating via systemd:

    root@am62xx-evm:~# systemctl start watchdog
    root@am62xx-evm:~# systemctl status watchdog
    * watchdog.service - watchdog daemon
    Loaded: loaded (/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)
    Active: active (running) since Thu 1970-01-01 00:01:59 UTC; 8s ago
    Process: 1613 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
    Process: 1615 ExecStart=/bin/sh -c [ x$run_watchdog != x1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
    Main PID: 1617 (watchdog)
    Tasks: 1 (limit: 2154)
    Memory: 684.0K
    CGroup: /system.slice/watchdog.service
    `- 1617 /usr/sbin/watchdog

    Jan 01 00:01:59 am62xx-evm watchdog[1617]: interface: no interface to check
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: temperature: no sensors to check
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: no test binary files
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: no repair binary files
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: error retry time-out = 60 seconds
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: repair attempts = 1
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: cannot set timeout 60 (errno = 95 = 'Operation not supported')
    Jan 01 00:01:59 am62xx-evm watchdog[1617]: hardware watchdog identity: K3 RTI Watchdog
    Jan 01 00:01:59 am62xx-evm systemd[1]: Started watchdog daemon.

    So it looks like systemd watchdog service is activated and running but fails to activate the timeout and pet the dog.  I think at this point we need support from the driver's maintainers. 

    thank you

    Michael

  • Hello Michael,

    some guidance for us on interpreting the output of systemctl status watchdog:
    https://trstringer.com/systemctl-status-output-explained/

    We are slowly making progress...

    when I enable the watchdog in systemd, I see the following in the boot log which matches your log output:

    root@am62xx-evm:~# dmesg | grep watchdog
    [    2.325550] systemd[1]: Using hardware watchdog 'K3 RTI Watchdog', version 0, device /dev/watchdog0
    [    2.334741] systemd[1]: Modifying watchdog timeout is not supported, reusing the programmed timeout.
    

    Which explains why my board only gives me about 35 seconds from boot time before the board resets, regardless of what value I put into the systemd file. 

    Also interesting that the status of watchdog.daemon is "inactive" if enabling through that system.conf file instead of systemctl start watchdog. I am not sure if there is a separate place where we would need to explicitly call the watchdog daemon during the boot process?

    root@am62xx-evm:~# systemctl status watchdog
    
    * watchdog.service - watchdog daemon
         Loaded: loaded (8;;file://am62xx-evm/lib/systemd/system/watchdog.service/lib/systemd/system/watchdog.servi
    ce8;;; disabled; vendor preset: disabled)8;;
         Active: inactive (dead)
    

    I'll go ahead and pass your observations on to the wider team. The only feedback I got on Friday was "you need a daemon to pet the watchdog" but clearly your output "Started watchdog daemon" shows that things are not working as expected.

    Regards,

    Nick

  • Hello Michael,

    I apologize for the slow responses here - I haven't had time to do any more tests on my side yet.

    So far I have not been able to find anyone with expertise on this area. The only additional feedback I've gotten is this:

    Jan 01 00:01:59 am62xx-evm watchdog[1617]: cannot set timeout 60 (errno = 95 = 'Operation not supported')

    Perhaps our timeout is fixed (non-configurable), and with this we need to not try to set it to 60, _or_ set it to exactly what is supported. I’d look into this.

    Once the watchdog timeout has been set once, my understanding is that the timeout value cannot be changed until the processor reboots. But I'm not sure if that would prevent the "petting" function of the systemD code from working. The terminal output during bootup from my previous response seems to indicate that systemd is flexible enough to reuse the currently programmed timeout. (though it still timed out on me)

    I've got one more person locally I can try to talk to, I'll try to hunt them down this week.

    Regards,

    Nick

  • Good evening, I have been fighting the same issue, and have at least one other data point to add to this discussion. Upon a suggestion by on the Beagleplay Discord Channel, I cross compiled and ran the kernel's own watchdog-test.c utility. On a stock eMMC Debian distribution (downloaded from a mirror, but nevertheless official).

    BTW, I can confirm all the same behaviors (w.r.t `wdctl` and `systemd`) that has reported here.

    Here are the outputs from my invocation of this very simple test utility:

    root@BeaglePlay:/home/debian# ./watchdog-test -p 5
    Watchdog ping rate set to 5 seconds.
    Watchdog Ticking Away!
    ......
    U-Boot SPL 2023.04-g43791d94 (Dec 28 2023 - 17:37:03 +0000)
    SYSFW ABI: 3.1 (firmware rev 0x0009 '9.1.8--v09.01.08 (Kool Koala)')
    ...

    The system faithfully reboots after about 30 seconds.

    On another window, I called `strace` on the pid of the watchdog-test process, and here is what that showed:

    root@BeaglePlay:/home/debian# strace -p 1305
    strace: Process 1305 attached
    restart_syscall(<... resuming interrupted io_setup ...>) = 0
    ioctl(3, WDIOC_KEEPALIVE)               = 0
    write(1, ".", 1)                        = 1
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=5, tv_nsec=0}, 0xffffec52a3b0) = 0
    ioctl(3, WDIOC_KEEPALIVE)               = 0
    write(1, ".", 1)                        = 1
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=5, tv_nsec=0}, 0xffffec52a3b0) = 0
    ioctl(3, WDIOC_KEEPALIVE)               = 0
    write(1, ".", 1)                        = 1
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=5, tv_nsec=0}, 0xffffec52a3b0) = 0
    ioctl(3, WDIOC_KEEPALIVE)               = 0
    write(1, ".", 1)                        = 1
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=5, tv_nsec=0},
    

    So clearly, the 'petting' IOCTL calls every 5 seconds are succeeding (witness the return value of 0), which would indicate a silent failure on the part of the hardware watchdog driver, IMO.

    Regards,

    Sidd

  • I tested the watchdog-simple application on SDK 9.0 (kernel 6.1) as well today under samples/watchdog/watchdog-simple.c, and it also lead to a reboot instead of perpetually running. So at this point it looks like a driver issue instead of an issue specifically with systemD.

    I haven't tested on SDKs 8.x yet to see if the behavior also exists on kernel 5.10. That will be the next thing I test.

    Regards,

    Nick

  • Hello Michael & Sidd,

    I did one more test today where it does seem like the watchdog timer is definitely broken:

    root@am62xx-evm:/opt/ltp# ./runltp -P am62xx-sk -f ddt/wdt_test -s WDT_M_FUNC_KEEPALIVE
    
    INFO: Filtering testscenarios based on am62xx-sk capabilities
    awk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
                                                                                                                                                              [1709/6939]
    Checking for required user/group ids
    
    
    'root' user id and group found.
    'nobody' user id and group found.
    'bin' user id and group found.
    'daemon' user id and group found.
    Users group found.
    Sys group found.
    Required users/groups exist.
    If some fields are empty or look unusual you may have an old version.
    Compare to the current minimal requirements in Documentation/Changes.
    
    
    /etc/os-release
    ID=arago
    NAME="Arago"
    VERSION="2023.04"
    VERSION_ID=2023.04
    PRETTY_NAME="Arago 2023.04"
    
    uname:
    Linux am62xx-evm 6.1.33-00001-ge4769b223d98 #1 SMP PREEMPT Sun Nov 12 10:19:18 CST 2023 aarch64 aarch64 aarch64 GNU/Linux
    
    /proc/cmdline                                                                                                                                                     [1664/6939]
    console=ttyS2,115200n8 earlycon=ns16550a,mmio32,0x02800000 mtdparts=spi-nand0:512k(ospi.tiboot3),2m(ospi.tispl),4m(ospi.u-boot),256k(ospi.env),256k(ospi.env.backup),98048k@3
    2m(ospi.rootfs),256k@130816k(ospi.phypattern);omap2-nand.0:2m(NAND.tiboot3),2m(NAND.tispl),2m(NAND.tiboot3.backup),4m(NAND.u-boot),256k(NAND.u-boot-env),256k(NAND.u-boot-env
    .backup),-(NAND.file-system) root=PARTUUID=b23245c5-02 rw rootfstype=ext4 rootwait
    
    Gnu C                  gcc (GCC) 11.3.0
    Clang
    Gnu make               4.3
    util-linux             2.37.4
    mount                  linux 2.37.4 (libmount 2.37.4: btrfs, namespaces, assert, debug)
    modutils               29
    e2fsprogs              1.46.5
    Linux C Library        x 1 root root 1630088 Mar  9  2018 /lib/libc.so.6
    Dynamic linker (ldd)   2.35
    Linux C++ Library      6.0.29
    Procps                 3.3.17-dirty
    Net-tools              2.10
    iproute2               iproute2-5.17.0
    iputils                'V'
    ethtool                5.16
    Sh-utils               9.0
    
    Modules Loaded         xt_conntrack xt_addrtype iptable_filter br_netfilter bridge stp llc iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc
    32c ip_tables x_tables overlay xhci_plat_hcd rpmsg_ctrl rpmsg_char cdns_csi2rx dwc3 snd_soc_hdmi_codec v4l2_fwnode pru_rproc irq_pruss_intc wlcore_sdio wl18xx wlcore mac802$
    1 libarc4 cfg80211 rfkill crct10dif_ce snd_soc_simple_card snd_soc_simple_card_utils ti_k3_r5_remoteproc virtio_rpmsg_bus display_connector rtc_ti_k3 rpmsg_ns dwc3_am62 j721
    e_csi2rx videobuf2_dma_contig videobuf2_memops ti_k3_m4_remoteproc ti_k3_common videobuf2_v4l2 tidss drm_dma_helper sii902x videobuf2_common drm_kms_helper v4l2_async cfbfil
    lrect videodev syscopyarea snd_soc_davinci_mcasp mc cfbimgblt sysfillrect sysimgblt snd_soc_ti_udma snd_soc_ti_edma fb_sys_fops cdns_dphy_rx cfbcopyarea snd_soc_ti_sdma prus
    s mcrc sa2ul tps6598x snd_soc_tlv320aic3x_i2c snd_soc_tlv320aic3x typec fuse drm drm_panel_orientation_quirks ipv6                                                [1619/6939]
    
    
    free reports:
                   total        used        free      shared  buff/cache   available
    Mem:         1972872      553340      709128       73028      710404     1271316
    Swap:              0           0           0
    
    cpuinfo:
    Architecture:            aarch64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
    CPU(s):                  4
      On-line CPU(s) list:   0-3
    Vendor ID:               ARM
      Model name:            Cortex-A53
        Model:               4
        Thread(s) per core:  1
        Core(s) per cluster: 4
        Socket(s):           -
        Cluster(s):          1
        Stepping:            r0p4
        CPU max MHz:         1400.0000
        CPU min MHz:         200.0000
        BogoMIPS:            400.00
        Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
    Caches (sum of all):
      L1d:                   128 KiB (4 instances)
      L1i:                   128 KiB (4 instances)
      L2:                    512 KiB (1 instance)
    Vulnerabilities:
      Itlb multihit:         Not affected
      L1tf:                  Not affected
      Mds:                   Not affected
      Meltdown:              Not affected
      Mmio stale data:       Not affected
      Retbleed:              Not affected
      Spec store bypass:     Not affected
      Spectre v1:            Mitigation; __user pointer sanitization
      Spectre v2:            Not affected
      Srbds:                 Not affected
      Tsx async abort:       Not affected
    
    available filesystems:
    9p autofs bdev bpf cgroup cgroup2 configfs cpuset debugfs devpts devtmpfs ext2 ext3 ext4 fuse fuseblk fusectl hugetlbfs mqueue nfs nfs4 overlay pipefs proc pstore ramfs rpc$
    pipefs securityfs sockfs squashfs sysfs tmpfs vfat
    
    mounted filesystems (/proc/mounts):
    /dev/root / ext4 rw,relatime 0 0
    devtmpfs /dev devtmpfs rw,relatime,size=919844k,nr_inodes=229961,mode=755 0 0
    proc /proc proc rw,relatime 0 0
    sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
    securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
    tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
    devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
    tmpfs /run tmpfs rw,nosuid,nodev,size=394576k,nr_inodes=819200,mode=755 0 0
    tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755 0 0
    cgroup2 /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
    cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
    pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
    bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
    cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
    cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
    cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
    cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
    cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
    cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
    cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
    cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
    cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0                                                                                        [1484/6939]
    cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
    hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
    mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
    debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
    tmpfs /tmp tmpfs rw,nosuid,nodev,nr_inodes=1048576 0 0
    fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
    configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
    tmpfs /media/ram tmpfs rw,relatime,size=16384k 0 0
    tmpfs /var/volatile tmpfs rw,relatime,size=51200k 0 0
    /dev/mmcblk1p1 /media/mmcblk1p1 vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
    /dev/mmcblk1p1 /run/media/boot-mmcblk1p1 vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
    /dev/mmcblk0p1 /run/media/mmcblk0p1 vfat rw,relatime,gid=6,fmask=0007,dmask=0007,allow_utime=0020,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
    tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=197284k,nr_inodes=49321,mode=700,uid=1000,gid=1000 0 0
    tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=197284k,nr_inodes=49321,mode=700 0 0
    
    mounted filesystems (df):
    Filesystem     Type     [82064.736102] LTP: starting WDT_M_FUNC_KEEPALIVE (source 'common.sh'; do_cmd install_modules.sh 'wdt' ; DEV_NODE=`get_devnode.sh "wdt"` || die "erro
    r getting devnode for wdt"; do_cmd wdt_tests -device $DEV_NODE -ioctl keepalive -loop 120)
    
     Size  Used Avail Use% Mounted on
    /dev/root      ext4       15G  6.0G  7.5G  45% /
    devtmpfs       devtmpfs  899M  4.0K  899M   1% /dev
    tmpfs          tmpfs     964M     0  964M   0% /dev/shm
    tmpfs          tmpfs     386M   69M  317M  18% /run                                                                                                               [1439/6939]
    tmpfs          tmpfs     4.0M     0  4.0M   0% /sys/fs/cgroup
    tmpfs          tmpfs     964M   16K  964M   1% /tmp
    tmpfs          tmpfs      16M     0   16M   0% /media/ram
    tmpfs          tmpfs      50M  8.0K   50M   1% /var/volatile
    /dev/mmcblk1p1 vfat      128M   24M  105M  19% /media/mmcblk1p1
    /dev/mmcblk0p1 vfat       16M  2.0K   16M   1% /run/media/mmcblk0p1
    tmpfs          tmpfs     193M     0  193M   0% /run/user/1000
    tmpfs          tmpfs     193M     0  193M   0% /run/user/0
    
    AppArmor disabled
    
    SELinux mode: unknown
    no big block device was specified on commandline.
    Tests which require a big block device are disabled.
    You can specify it with option -z
    COMMAND:    /opt/ltp/bin/ltp-pan   -e -S   -a 166128     -n 166128 -p -f /tmp/ltp-AcXmiGnBbx/alltests -l /tmp/tmp.QwsZNRFH2o  -C /opt/ltp/output/LTP_RUN_ON-tmp.QwsZNRFH2o.fa
    iled -T /opt/ltp/output/LTP_RUN_ON-tmp.QwsZNRFH2o.tconf
    
    INFO: Restricted to WDT_M_FUNC_KEEPALIVE
    
    LOG File: /tmp/tmp.QwsZNRFH2o
    
    FAILED COMMAND File: /opt/ltp/output/LTP_RUN_ON-tmp.QwsZNRFH2o.failed
    
    TCONF COMMAND File: /opt/ltp/output/LTP_RUN_ON-tmp.QwsZNRFH2o.tconf
    
    Running tests.......
    
    <<<test_start>>>                                                                                                                                                  [1394/6939]
    
    tag=WDT_M_FUNC_KEEPALIVE stime=82071
    
    cmdline="source 'common.sh'; do_cmd install_modules.sh 'wdt' ; DEV_NODE=`get_devnode.sh "wdt"` || die "error getting devnode for wdt"; do_cmd wdt_tests -device $DEV_NODE -io
    ctl keepalive -loop 120"
    
    contacts=""
    
    analysis=exit
    
    <<<test_output>>>
    incrementing stop
    |TRACE LOG|Inside do_cmd:CMD=install_modules.sh wdt|
    |TRACE LOG|Inside do_cmd:CMD=wdt_tests -device /dev/watchdog -ioctl keepalive -loop 120|
    
    |TEST START|wdt_tests|
    |TRACE LOG|******** WDT Testcase  parameters  ******** |
    |TRACE LOG|Device         : /dev/watchdog|
    |TRACE LOG|Loop Count     : 120|
    |TRACE LOG|Operation      : Ioctl|
    |TRACE LOG|Ioctl Name     : WDIOC_KEEPALIVE|
    |TRACE LOG|Ioctl Arg      : 0 |
    |TRACE LOG| ************* End of Test params ************* |
    
    U-Boot SPL 2023.04-g24098ea90d (Jul 06 2023 - 12:59:40 +0000)
    ...
    

    I think the next steps here are to start putting prints into the watchdog timer driver to get more visibility into what is going on. Didn't have time to try any of that today. If either of you are able to dig into it I'd be interested to see what you find.

    Regards,

    Nick

  • Partial update: The dev team has said they'll take a look sometime in the near future. At this point they have not committed to a specific fix timeframe. Feel free to ping me for another update in a week.

    Regards,

    Nick

  • Hello Michael,

    I have looked at this issue a bit and compared wdctl (working case) with watchdog_simple (failing case). It seem like the reason wdctl is working and watchdog_simple is not is because wdctl is using open() and close() repeatedly while watchdog_simple is using only write().

    This is still under debug but here is a quick hack of the watchdog_simple application so you can verify on your end that open() is actually petting the watchdog.

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/test_2D00_wdgsimple

    Let me know your findings when you can.

    ~ Judith

  • Hi Judith,

    Not sure what's supposed to happen, below is the console output. Could you clarify the purpose of the example? We know that the WD can be patted manually (if that's what you are doing in this example) . The issue us that systemd isn't doing it.

    thank you

    Michael

    root@am62xx-evm:~# ./test-wdgsimple
    Debug, open pass
    [1226353.699953] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226353.707103] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    [1226365.713129] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226365.720241] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    [1226377.726470] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226377.733689] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    [1

  • Hello Michael,

    Heads-up, we'll be off on Monday, but back again later in the week. I'm not sure if Judith is taking additional time off or not.

    We're still debugging exactly what is going on. I don't have access to Judith's source code, but if I recall our verbal discussion, what she gave you is a modified version of samples/watchdog/watchdog-simple.c. She added print statements within the watchdog driver, and then moved the open() and close() commands inside the while loop. So (without actually testing the code today), I think it looks something like this:

    int main(void)
    {
    //        int fd = open("/dev/watchdog", O_WRONLY);
    //        int ret = 0;
    //        if (fd == -1) {
    //                perror("watchdog");
    //                exit(EXIT_FAILURE);
    //        }
            while (1) {
            
                    int fd = open("/dev/watchdog", O_WRONLY);
                    int ret = 0;
                    if (fd == -1) {
                            perror("watchdog");
                            exit(EXIT_FAILURE);
                    }
                    ret = write(fd, "\0", 1);
                    if (ret != 1) {
                            ret = -1;
                            break;
                    }
                    close(fd);
                    
                    sleep(10);
            }
    //        close(fd);
            return ret;
    }
    

    So if you are trying to follow the logic of the print statements she posted, I think it should look like this:

    Debug, open pass
    //countdown at 60 sec (fails to reset watchdog counter)
    // not sure if the close() command is triggering these print statements?
    [1226353.699953] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226353.707103] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    //countdown at 48 sec (fails to reset watchdog counter)
    [1226365.713129] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226365.720241] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    //countdown at 36 sec (fails to reset watchdog counter)
    [1226377.726470] watchdog: watchdog0: nowayout prevents watchdog being stopped!
    [1226377.733689] watchdog: watchdog0: watchdog did not stop!
    Debug, open pass
    // after another sleep(10), countdown should be <30 sec
    // At this point I would expect to see the watchdog get reset
    // the processor should not force a reboot

    I don't want to say too much here, just because we don't fully understand the behavior yet. But it looks like
    1) the watchdog can be pet, but
    2) the API we would expect to use to pet the watchdog does not seem to be working as expected

    To set expectations: If we find that the watchdog driver needs to be modified, that patch will NOT make it into the SDK 9.2 release. We are continuing to work on this, and we can talk more about the best way to distribute code after we make additional progress.

    Thanks as always for your patience, and keeping us accountable Michael. Feel free to ping the thread if we don't have another update for you within a few business days.

    Regards,

    Nick

  • Hi Michael,

    Adding to Nick's comment.

    I am not using systemd to test yet,

    Using wdctl, and watchdog-simple.c that is found in the Linux kernel source, and a few other debugging tools, I have discovered a few things.

    Watchdog_simple is using a syscall write() call that the watchdog subsystem uses to pet the watchdog. This sample program works on AM64x but not on AM62x. The path of software execution is the same.  In watchdog_dev.c we go through various functions and end up calling rti_wdt_ping in rti_wdg.c which is the driver for watchdog on K3 devices. For AM64x, we successfully write the two step key sequence that discharges the watchdog capacitor and essentially reloads the DWD down counter which is the action that pets the watchdog. This second write seems to be failing for AM62x platform.

    However, we do manage to 'pet the watchdog' on one specific case for AM62x, when you call syscalls open() and close() functions repeatedly. That is why I sent a sample program that implements this. Using open() and close(), the watchdog subsystem uses a different path but we also end up calling rti_wdt_ping in rti_wdg.c and both writes pass, reloading the DWD down counter.

    So I am currently looking into why we cannot execute the two writes to reload the DWD down counter for AM62x, using the write() call.

    ~ Judith

  • Hi all,

    Could someone try the following change in Linux kernel and let me know if this changes the behavior of watchdog:

    diff --git a/arch/arm64/boot/dts/ti/k3-am62-main.dtsi b/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    index a9ccf7bb2ecc..06460243cdcd 100644
    --- a/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    +++ b/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    @@ -891,7 +891,7 @@ main_rti0: watchdog@e000000 {
                    clocks = <&k3_clks 125 0>;
                    power-domains = <&k3_pds 125 TI_SCI_PD_EXCLUSIVE>;
                    assigned-clocks = <&k3_clks 125 0>;
    -               assigned-clock-parents = <&k3_clks 125 2>;
    +               assigned-clock-parents = <&k3_clks 125 4>;
            };
     
            main_rti1: watchdog@e010000 {
    

    ~ Judith

  • Hi all,

    Another update here.

    It seems like the clock parent for RTI WWDT (DEV_RTI0_RTI_CLK_PARENT_CLK_32K_RC_SEL_OUT0) is not working as expected. If we change to:

    DEV_RTI0_RTI_CLK_PARENT_MAIN_WWDTCLKN_SEL_OUT0_DIV_CLKOUT then the watchdog works as expected, but this is not a true 32K signal.

    The patch sent above should allow to pet the watchdog, it would be great to see if it now works using systemd.

    Pending to check why the LFXOSC0 is not working.

    ~ Judith

  • where the patch Judith provided above swaps between 
    DEV_RTI0_RTI_CLK_PARENT_CLK_32K_RC_SEL_OUT0 (Clock ID 2)
    and
    DEV_RTI0_RTI_CLK_PARENT_MAIN_WWDTCLKN_SEL_OUT0_DIV_CLKOUT (Clock ID 4)

    as per
    https://software-dl.ti.com/tisci/esd/latest/5_soc_doc/am62x/clocks.html#clocks-for-rti0-device 

    Regards,

    Nick

  • Hello, I was able to reproduce the issue on this thread with systemd, setting up my RuntimeWatchdogSec to 45.

    Then I applied the patch mentioned above (root clock ID 2 to root clock ID 4), and now the system is not rebooting anymore.

    Would this change be considered a solution to this problem or another hint on what could be going wrong?

    Regards,

    Rafael

  • Hello everyone, I (re)built the system after updating the k3-am62-main.dtsi file, and re-ran the `watchdog-test` experiment from a few weeks ago.

    Unfortunately, that *did not* prevent a reboot of the system.

    root@2b6e5b7:/tmp# ./watchdog-test -p 5  
    Watchdog ping rate set to 5 seconds.
    Watchdog Ticking Away!
    ......SSH session disconnected
    SSH reconnecting...

    I'll try with the systemd watchdog runtimewatchdogsec values set to something other than 0 tomorrow... But this is another data point for the folks here.

  • Hello Rafael & Sidd,

    Thank you both for testing things out on your side!

    We are still digging into exactly what is going on. Our current hypothesis is that the behavior is caused by these lines of code in the driver file
    drivers/watchdog/rti_wdt.c

            /*
             * If watchdog is running at 32k clock, it is not accurate.
             * Adjust frequency down in this case so that we don't pet
             * the watchdog too often.
             */
            if (wdt->freq < 32768)
                    wdt->freq = wdt->freq * 9 / 10;
    

    It seems like with the default clock, wdt->freq = 32768, so the driver does not enter the IF statement, wdt->freq stays at 32768, and the watchdog does not work properly. However, with the other clock, wdt->freq < 32768, the driver enters the IF statement, wdt->freq is reduced by 10%, and the watchdog works.

    Assuming that is the case, we are still investigating exactly why reducing wdt->freq changes the behavior, and the logic for why that line of code was added to begin with. Once we understand the original intent, we can do other evaluation (e.g., if the "right fix" is to remove the IF statement and always set wdt->freq = wdt->freq * 0.9, if this is covering up some logic that needs to be changed somewhere else in the code, etc).

    Regards,

    Nick

  • HI Sidd,

    What does the following command show for you?

    root@am62pxx-evm:~# k3conf dump parent_clock 125 0
    |------------------------------------------------------------------------------|
    | VERSION INFO                                                                 |
    |------------------------------------------------------------------------------|
    | K3CONF | (version 0.3-nogit built Tue Feb 13 15:54:05 UTC 2024)              |
    | SoC    | AM62Px SR1.0                                                        |
    | SYSFW  | ABI: 3.1 (firmware version 0x0009 '9.2.5--v09.02.05 (Kool Koala))') |
    |------------------------------------------------------------------------------|
    
    |-----------------------------------------------------------------------------|
    | Clock information                                                           |
    |-----------------------------------------------------------------------------|
    | Device ID | Clock ID | Clock Name       | Status          | Clock Frequency |
    |-----------------------------------------------------------------------------|
    |   125     |     0    | DEV_RTI0_RTI_CLK | CLK_STATE_READY | 32768           |
    |-----------------------------------------------------------------------------|
    
    |--------------------------------------------------------------------------------------------------------------------|
    | Clock Parent information                                                                                           |
    |--------------------------------------------------------------------------------------------------------------------|
    | Selected | Clock ID | Clock Name                                               | Status          | Clock Frequency |
    |--------------------------------------------------------------------------------------------------------------------|
    |          |     1    | DEV_RTI0_RTI_CLK_PARENT_GLUELOGIC_HFOSC0_CLKOUT          | CLK_STATE_READY | 25000000        |
    |   ==>    |     2    | DEV_RTI0_RTI_CLK_PARENT_CLK_32K_RC_SEL_OUT0              | CLK_STATE_READY | 32768           |
    |          |     3    | DEV_RTI0_RTI_CLK_PARENT_GLUELOGIC_RCOSC_CLKOUT           | CLK_STATE_READY | 12500000        |
    |          |     4    | DEV_RTI0_RTI_CLK_PARENT_GLUELOGIC_RCOSC_CLK_1P0V_97P65K3 | CLK_STATE_READY | 32552           |
    |--------------------------------------------------------------------------------------------------------------------|
    
    root@am62pxx-evm:~# 
    

    Ignore the result on my end since I gave an example using AM62p, but I am curious what k3conf displays for you.

    ~ Judith

  • Hi Rafael,

    This patch changes the clock parent for rti_clk, since the frequency in the driver is changed, this affects the timer margin and this affects when we pet the watchdog. It seems on AM62x, we are violating the valid time window to pet the watchdog, it seems like we are petting too early. We could fix our issue by extending the hack that is currently in the driver, but long term fix is still pending. We need to understand why the 32K clock is causing us to pet the watchdog too early if this is a true 32K clock signal.

    ~ Judith

  • Dear Judith,

    I built and ran the k3conf command like you asked, and here is the output it produced:

    root@2b6e5b7:/opt/install# ./k3conf dump parent_clock 125 0
    |------------------------------------------------------------------------------|
    | VERSION INFO                                                                 |
    |------------------------------------------------------------------------------|
    | K3CONF | (version v0.2-53-g81581af built Thu Mar 7 11:09:21 PM UTC 2024)     |
    | SoC    | AM62X SR1.0                                                         |
    | SYSFW  | ABI: 3.1 (firmware version 0x0009 '9.1.8--v09.01.08 (Kool Koala))') |
    |------------------------------------------------------------------------------|
    
    |-----------------------------------------------------------------------------|
    | Clock information                                                           |
    |-----------------------------------------------------------------------------|
    | Device ID | Clock ID | Clock Name       | Status          | Clock Frequency |
    |-----------------------------------------------------------------------------|
    |   125     |     0    | DEV_RTI0_RTI_CLK | CLK_STATE_READY | 32552           |
    |-----------------------------------------------------------------------------|
    
    |---------------------------------------------------------------------------------------------------------------------|
    | Clock Parent information                                                                                            |
    |---------------------------------------------------------------------------------------------------------------------|
    | Selected | Clock ID | Clock Name                                                | Status          | Clock Frequency |
    |---------------------------------------------------------------------------------------------------------------------|
    |          |     1    | DEV_RTI0_RTI_CLK_PARENT_GLUELOGIC_HFOSC0_CLKOUT           | CLK_STATE_READY | 25000000        |
    |          |     2    | DEV_RTI0_RTI_CLK_PARENT_CLK_32K_RC_SEL_OUT0               | CLK_STATE_READY | 32768           |
    |          |     3    | DEV_RTI0_RTI_CLK_PARENT_GLUELOGIC_RCOSC_CLKOUT            | CLK_STATE_READY | 12500000        |
    |   ==>    |     4    | DEV_RTI0_RTI_CLK_PARENT_MAIN_WWDTCLKN_SEL_OUT0_DIV_CLKOUT | CLK_STATE_READY | 32552           |
    |---------------------------------------------------------------------------------------------------------------------|
    

    (Hope this helps)

    Regards,

    Sidd

  • Hi Sid,

    Thanks for the update. The clock parent is correct. What board are you using? Is it TI EVM? Also, what are the results of your test using systemd?

    ~ Judith

  • Hello, and TI experts,

    What's your plan and expectation for getting this issue solved?

  • Hi fd,

    So far we have done the initial investigation. The hope is to have a fix for this issue in time for 10.0 release.

    ~ Judith

  • Hi all,

    We now have a patch ready with the fix for this issue. On AM62x, it seems like the watchdog is pet before the valid timing window opens, please try the following patch and let us know if this fixes the issue on your end.

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/0002_2D00_watchdog_2D00_rti_5F00_wdt_2D00_Fix_2D00_min_5F00_hw_5F00_heartbeat_5F00_ms.patch

    ~ Judith

  • Thanks for this .. Just to be clear, can I revert the following change that you had proposed earlier before applying this one ? Am assuming this is no longer needed.

    Regards,

    Sidd

    diff --git a/arch/arm64/boot/dts/ti/k3-am62-main.dtsi b/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    index f1494e0e9816..33d7ba16da9a 100644
    --- a/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    +++ b/arch/arm64/boot/dts/ti/k3-am62-main.dtsi
    @@ -834,7 +834,7 @@ main_rti0: watchdog@e000000 {
                    clocks = <&k3_clks 125 0>;
                    power-domains = <&k3_pds 125 TI_SCI_PD_EXCLUSIVE>;
                    assigned-clocks = <&k3_clks 125 0>;
    -               assigned-clock-parents = <&k3_clks 125 2>;
    +               assigned-clock-parents = <&k3_clks 125 4>;
            };
     
            main_rti1: watchdog@e010000 {

  • Hi Sidd,

    The patch that you referenced is no longer needed. The only patch necessary is: 0002-watchdog-rti_wdt-Fix-min_hw_heartbeat_ms.patch.

    ~ Judith

  • Hi Sidd,

    As per review comments, it is a safer pet to pet the watchdog +5% of the timeout value after the valid window opens, so here is the corresponding patch:

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/v2_2D00_0001_2D00_watchdog_2D00_rti_5F00_wdt_2D00_Set_2D00_min_5F00_hw_5F00_heartbeat_5F00_ms_2D00_to_2D00_55_2D00_of.patch

    Let us know if this fix is helpful.

    regards,

    Judith

  • Hello 

    the patch you sent to the LKML, https://lore.kernel.org/all/20240403212426.582727-1-jm@ti.com/, seems different. Can you explain the reasons? 

  • Hi fd,

    Yes I can explain.

    1. I omitted removing the hack that was included in the introduction of the driver since in our case, we do not touch that specific piece of code by default on AM62x, therefore, someone could argue that the change is irrelevant.

    2. The rti_wdt_setup_hw_hb function comes into play when watchdog has been started prior to Linux kernel starting the watchdog, We must update min_hw_heartbeat_ms here as well accommodate a 5% safety margin.

    ~ judith