3

I have a java process that runs continually that sometimes, for reasons I have yet to fully debug, craps out. So, I also have a cron job that looks for the process every 5 minutes and if it finds the process isn't running, it calls a script to restart it.

The problem is, sometimes, every once in a while, the check-up script gets a false negative -- it thinks the process isn't running when in fact it is. I haven't seen any consistency to when it does this. But I do need a completely foolproof way to check whether the process is running.

What I'm doing currently is this:

if ! pgrep -f '/path/to/XML2DB.jar -n' > /dev/null
then
    nice -n 19 java -Xmx2024M -jar /path/to/XML2DB.jar -n > /dev/null 2>/dev/null &
    echo "" | mail -s "$HOST: xml2db was found not running, is being started" support@mycompany.com
fi

Before pgrep, we were using ! ps ax | grep -v grep | grep "XML2DB.jar -n" > /dev/null but this was also giving false positives.

Linux version is Scientific Linux SL release 3.0.9 (SL) and LSB Version is 1.3.

Thanks in advance!

Sabrina S
  • 133

2 Answers2

4

The ps ax | grep -v grep | grep "XML2DB.jar -n" technique has a race condition: the grep instances may or may not get started in time for ps to see them, so you get inaccurate counts: see here and here. You aren't the first one to get in trouble using it.

I did an strace pgrep somepattern on a RHEL box to find out what pgrep is doing. It stats every process directory in /proc, and opens /proc/<PID>/cmdline for some PIDs, and reads the contents, presumably to match against the pattern given on the pgrep command line. I'm waving my hands here, but I bet there's a race condition there as well.

The only foolproof solution to this is to have the Java process try to create a "lock directory". Directory creation is atomic for user processes. If the lock directory already exists, exit with an error message, other wise. start up. After creating the lock directory, it should write its PID to a file in the lock directory.

You can use the PID in the file to check whether the Java program is running with kill -0 $(cat /whatever/lockdir/PIDfile) - if the process doesn't exist, kill will exit with a non-zero status.

The trick is to pass the PID to the Java program on its command line:

exec java blah blah -mypid $$

You still have to be very careful about errors or exceptions surrounding lock directory creation, and in interpreting the kill -0, and when removing the PID file and lock directory, but you will have problems with any other method.

1

There is no way to reliably and usefully check that an unrelated process is running: a race condition is always possible. Even if you find a way to detect whether the process you're interested in is running, it might be killed immediately after you've seen it, or conversely it might get started immediately after you missed it.

If you control the program or the way it runs, you can make it reserve a unique resource such as a file lock. However, if you control the way the program is invoked, there's a simpler way to keep track of it: monitor it from its parent. A process is informed when its child dies.

The simplest way to ensure that a process is always running is to restart it in a loop.

# sleep 1 avoids a tight loop if the process systematically fails to start
while sleep 1; do
  nice …
  ret=$?
  if [ $ret -le 127 ]; then
    msg="… exited with status $ret"
  else
    msg="… exited on signal $((ret-128))"
  esac
  mail -s "$msg" "$USER"
done

There is more robust and more powerful monitoring software. See How to set proper monitoring of my services in a automated way? So that if one crash it auto on the fly restarts?