8

Usually when a cron job crash, it will leave some error messages in the log.

We run shell script and some java program with cron job. Recently we found some weird thing out from the log. Obviously the program was either crashed or killed because there is a program lock we set when the program was initialed, which was not released. We guess the program was killed because the log of the program didn't show the finish message.

Who can possibly killed the job and how can I get notified through email when a cron job is dead?

EDIT: I don't want the crontab way to receive email because it just push every standard output to the email. In my case there are a lot of other system output from different program because some of them ain't using log4j or they are echo by shell script. Because there are many users in the system, we can't require all the users to manage their program's standard output.

2 Answers2

7

Check your system logs. Which logs to check for depend on your installation; on Debian with the default setup, you get:

  • in /var/log/auth.log, notices from when the cron job started and finished, because the job involved a PAM session;
  • in /var/log/syslog, a notice that grandchild #32283 failed with exit status 1.
  • an additional notice in /var/log/kern.log if your process was terminated by the OOM killer.

You will get email from cron if your cron job produces any output on its standard output or standard error (unless your local mail delivery system isn't set up properly). You won't get a mail if it quietly returns a nonzero status (which includes the case of being killed by a signal). If you want a notice, arrange a shell wrapper that's noisy in case of error, e.g.

42 1 * * * /path/to/real/job || echo $?

If you want to log more information about processes and how they die (and how they're born, but here you already know), see Is there a log of past threads that are now closed?

  • @sourcejedi I'm not sure I understand your comment. If you're wondering when you might not get a message from a shell if a program from a cron job is killed, then here are two common cases: 1. The shell itself gets killed before it has time to print anything. 2. The cron job just runs an executable, and the shell execs it, so there's no shell to print anything if the executable's process gets killed. – Gilles 'SO- stop being evil' Oct 16 '18 at 18:23
  • I'm saying my understanding is that for sh-compatible shells you would never need || echo $? to detect crashes / kills as per the question, only to detect non-signal exits which are silent but do not return EXIT_SUCCESS (0). The latter are unusual, because they also wouldn't show up if you had run the command from an interactive shell. (Might be useful to check for weird java programs or something though). – sourcejedi Oct 16 '18 at 20:36
  • 1
    @sourcejedi If you just have /path/to/myprogram as the job, and the shell implements the common exec-the-last-command optimization, and the program dies from a signal, then nothing will print any message. The only trace of the death will be the notice from cron containing the program's exit status. – Gilles 'SO- stop being evil' Oct 16 '18 at 20:41
  • pain. I was effectively testing echo "sleep 10" | bash. It turns out this behaves differently from both bash -c 'sleep "10"' - which is what I should have tested - and e.g. echo "sleep 10" | dash, which I also should have tested. – sourcejedi Oct 16 '18 at 21:04
6

To debug this you can put

set -e -u

at the top of your shell script - it then ends with an error exit status when a command fails or an undefined variable is used.

Then you can call from the cron-job a wrapper script that calls the main script like this

sh -x main_script.sh || echo Failed with exit status: $?

With -x every line is printed out before it is executed. The output is mailed by the cron daemon to you.

You can also use a temp file when the output is too big:

sh -x main_script.sh > $TEMPFILE 2>&1
if [ $? -ne 0 ]; then echo Failed with exit status $? - see $TEMPFILE; fi

In case the exit status is > 128 the command was interrupted by a signal - e.g. someone 'killed' it, a segmentation fault occured or there was a out-of-memory situtation (how to get the signal from the exit status).

maxschlepzig
  • 57,532