For a couple of years I’m running Proxmox on a couple of NUC’s to run all kinds of selfhosted services. One of them is this personal site, but also Home Assistant, Jellyfin and Immich, among others.
Home Assistant handles a lot of automatizations in our house, so it’s quite inconvenient Proxmox crashes one or two times a day. This started last week and I have yet to find out what’s the reason for this. When this Proxmox server crashes, it also takes Home Assistant with it. This last week on a couple of mornings the alarm clock didn’t fire or lights did not turn on when entering a room.
Can’t find a solution yet, so I resorted to a workaround. I read somewhere almost all chipsets of the last twenty years by Intel includes a hardware watchdog timer. The way this works is that a program or the kernel writes every few seconds something to the /dev/watchdog device. When this hasn’t happened for ten seconds, the watchdogs reboots the server. This should work better than a software based watchdog. This was already running inside Proxmox, but failed to work this week. The server crashed, but no automatic reboot.
When you run Proxmox, activating the hardware watchdog timer works somewhat different, when compared to default Debian or another Linux distro. But it’s not hard.
First you have to find out which watchdog hardware comes with your server. In my case this is iTCO_wdt. This one is quite common, if I’m correct informed.
The next step is editing the /etc/default/pve-ha-manager and add the watchdog module to the WATCHDOG_MODULE stanza (remove the hash when necessary), like this:
WATCHDOG_MODULE=iTCO_wdt
Leave your editor after saving and reboot the Proxmox server. When you login to your shell again, run wdctl. You should see output like this:
Hopefully this workaround helps. I hope to know this tomorrow; I’ll let you know…