Keep system accessible in case of freeze

Question

We run a small server (Ubuntu 14.04) solely for computing purposes. From time to time, a user will manage to consume enough memory to freeze the system. Last time, the culprit was a process that spawned 30 memory eating sub-processes. The result is that I cannot log into the machine to fix it - ssh and local login both just time out. The OOM Killer did not seem to do anything. egrep -i 'killed process' /var/log/* returned nothing.

Is there a way to keep/gain command line access in such circumstances?

magor · Answer 1 · 2016-06-07T13:31:13.003

There is a way of limiting the use of system resources.
Check the ulimit command and it's usage. It has a conf file limits.conf where you can specify which groups can have how many resources. For example, if you specify in the conf file :

@developers        soft    nproc          20
@developers        hard    nproc          30

then the developers group can have only 30 processes and they will get a warning when reaching 20. You can also limit the number of processes globally with ulimit -u 10 - in this case users can run 10 processes.

Use ulimit -a to display current limits. To limit memory, try to use ulimit -v.

user@localhost:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256646
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 32768
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

score 0 · Answer 2 · answered Jun 08 '16 at 06:19

You can do things to try and prevent this from happening, as mazs mentions in his answer. But to have the box recover from any situation, even the ones you don't expect, you want a watchdog.

In Linux, a watchdog is a process which sends a "ping" periodically, and if that "ping" isn't received, the system resets. When I say "ping", I don't mean this as a network ICMP echo request, but just a "hey I'm still alive" message. The ping can be sent to a physical watchdog, which might do something like power cycle the host to perform the reset, but it can also be something within the linux kernel. The latter is what you'll most likely want as hardware watchdog devices are typically only found in enterprise grade equipment.

Anyway, to get started, you first need the software watchdog enabled in your kernel. This is likely to be the biggest obstacle. I have no idea if ubuntu has the watchdog enabled or not. See if you have /dev/watchdog available. If not, try modprobe softdog. If neither works, and you still want to pursue this, you're going to need to recompile your kernel with the SOFT_WATCHDOG option.

Now, assuming you have the watchdog enabled, you need to install the watchdog package.

Once it's installed, put a script in /etc/watchdog.d that you want to use to perform your health check (can also use test-binary in watchdog.conf), and make it executable. If you wanted to make sure ssh is functional, you could do something like:

#!/bin/bash
case "$1" in
test)
  ssh testuser@localhost /bin/true
  ;;
repair)
  service ssh restart
  ;;
*)
  false
  ;;
esac

(To do exactly as above, you'll need to create testuser and set up public key auth, but thats outside the scope of this answer)

The watchdog will invoke the script with test to do the health check, and will try repair if it fails. If the repair also fails, the system is reset.

See the watchdog documentation for further details.

Keep system accessible in case of freeze

2 Answers2