You can do things to try and prevent this from happening, as mazs mentions in his answer. But to have the box recover from any situation, even the ones you don't expect, you want a watchdog.
In Linux, a watchdog is a process which sends a "ping" periodically, and if that "ping" isn't received, the system resets. When I say "ping", I don't mean this as a network ICMP echo request, but just a "hey I'm still alive" message. The ping can be sent to a physical watchdog, which might do something like power cycle the host to perform the reset, but it can also be something within the linux kernel. The latter is what you'll most likely want as hardware watchdog devices are typically only found in enterprise grade equipment.
Anyway, to get started, you first need the software watchdog enabled in your kernel. This is likely to be the biggest obstacle. I have no idea if ubuntu has the watchdog enabled or not. See if you have /dev/watchdog
available. If not, try modprobe softdog
. If neither works, and you still want to pursue this, you're going to need to recompile your kernel with the SOFT_WATCHDOG
option.
Now, assuming you have the watchdog enabled, you need to install the watchdog package.
Once it's installed, put a script in /etc/watchdog.d
that you want to use to perform your health check (can also use test-binary
in watchdog.conf
), and make it executable. If you wanted to make sure ssh is functional, you could do something like:
#!/bin/bash
case "$1" in
test)
ssh testuser@localhost /bin/true
;;
repair)
service ssh restart
;;
*)
false
;;
esac
(To do exactly as above, you'll need to create testuser
and set up public key auth, but thats outside the scope of this answer)
The watchdog will invoke the script with test
to do the health check, and will try repair
if it fails. If the repair also fails, the system is reset.
See the watchdog documentation for further details.