1

In the past I used pid-files to guarantee race-condition safe execution of scripts. But this had the downside, that the pid-file was not deleted if the kernel killed the script somehow. So sometimes manual interaction was needed (remove the pid file, killing some sub processes).

Finally I came up to a solution which checks if a script is running by using pgrep and kills really old processes that shouldn't exist anymore:

while read -r pid cmd; do
  # check if pid or parent pid belong to current script execution
  if [[ $pid == "$$" ]] || [[ $(ps -o ppid= -p "$pid" | xargs) == "$$" ]]; then
    continue
  fi
  # avoid re-execution of script within 12 hours (43200 seconds)
  if [[ $(date +%s --d="now - $( stat -c%X "/proc/$pid" ) seconds") -lt 43200 ]]; then
    echo "Error: Script is already running!"
    exit
  fi
  # kill outdated script executions
  if kill -9 "$pid"; then
    echo "Warning: Outdated script execution ($pid) has been killed!"
  fi
done < <(pgrep -af "/bin/bash $(basename "$(readlink -f "$0")")")

This works on hundreds of servers and thousands of script executions without any issues, but in very rare cases it returned an "already running"-error although no other script was executed.

As I was not able to find a bug in my code, I created more debug information:

echo "pid: $pid"
echo "cmd: $cmd"
echo "lsof:"
lsof -p "$pid"
echo "/proc/$pid/cmdline:"
xargs -0 <"/proc/$pid/cmdline"
echo "pstree:"
pstree -asp "$pid"

And after several weeks it returned this:

2023-04-02 06:10:01 Error: Script is already running!
2023-04-02 06:10:01 pid: 3902
2023-04-02 06:10:01 cmd: /bin/bash /usr/local/bin/script.sh
2023-04-02 06:10:01 lsof:
2023-04-02 06:10:01 COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
2023-04-02 06:10:01 sed     3902 root  cwd    DIR    8,3     4096  6291457 /root
2023-04-02 06:10:01 sed     3902 root  rtd    DIR    8,3     4096        2 /
2023-04-02 06:10:01 sed     3902 root  txt    REG    8,3    77664  2886809 /usr/bin/sed
2023-04-02 06:10:01 sed     3902 root  mem    REG    8,3  1936776  5243076 /lib64/libc-2.22.so
2023-04-02 06:10:01 sed     3902 root  mem    REG    8,3    10344  2892257 /usr/lib64/coreutils/libstdbuf.so
2023-04-02 06:10:01 sed     3902 root  mem    REG    8,3   164144  5242936 /lib64/ld-2.22.so
2023-04-02 06:10:01 sed     3902 root    0r  FIFO   0,12      0t0 96041317 pipe
2023-04-02 06:10:01 sed     3902 root    1w  FIFO   0,12      0t0 96037120 pipe
2023-04-02 06:10:01 sed     3902 root    2w  FIFO   0,12      0t0 96021482 pipe
2023-04-02 06:10:01 /proc/3902/cmdline:
2023-04-02 06:10:01 sed s/%/%%/g
2023-04-02 06:10:01 pstree:
2023-04-02 06:10:01 systemd,1 --switched-root --system --deserialize 23
2023-04-02 06:10:01   `-cron,3599 -n
2023-04-02 06:10:01       `-cron,3885 -n
2023-04-02 06:10:01           `-fcc-monitor-and,3891 /usr/local/bin/script.sh
2023-04-02 06:10:01               `-fcc-monitor-and,3901 /usr/local/bin/script.sh
2023-04-02 06:10:01                   `-sed,3902 s/%/%%/g

As you can see pgrep returned 3902, which is used by a subshell process of the script. Instead I would expect 3891 or 3901 (as in all other executions). Is this a bug of pgrep or why isn't it 100% reliable?

Update: Only for those, who are interested in the final solution:

# obtain script vars
script_pid="$$"
script_path=$(readlink -f "$0")

make script safe against race conditions (can't be executed twice)

while read -r pid; do

obtain all parent and child pids

pid_list=$(pstree --show-parents --show-pids "$pid")

ignore list as it was created through current script execution

if echo "$pid_list" | grep -F "($script_pid)" >/dev/null; then continue fi

verify that pid is part of the list ("pstree --show-parents" returns list, even pid does not exist https://github.com/acg/psmisc/issues/5)

if ! echo "$pid_list" | grep -F "($pid)" >/dev/null; then continue # process does not exist anymore fi

obtain age of pid

pid_time=$( stat -c%X "/proc/$pid" 2>/dev/null ) if [[ ! $pid_time ]]; then continue # process does not exist anymore fi

kill outdated script executions (older than 12 hours / 43200 seconds)

if [[ $(date +%s --d="now - $pid_time seconds") -gt 43200 ]]; then if kill -9 "$pid"; then echo "Warning: Outdated script execution ($pid) has been killed!" fi

we are facing a race condition

else echo "Error: Script is already running!" exit 1 fi

done < <(pgrep -f "^/bin/bash $script_path") # obtain pids that belong to this script

mgutt
  • 467
  • 1
    Seems like a race condition. For an extremely brief period, between when the script process forked out to start sed and actually exec'd sed, that child process has the same command line as the parent process, so it got picked up by pgrep. But I don't see how this is a problem for your use case, since it would be picking up the child process only when the main script is already running, right? – muru Apr 04 '23 at 00:33
  • 1
    The -a option to pgrep is explained in the man page as, "include process ancestors in the match list". So the parent process will be included, even when that parent process doesn't match the regular expression. This sounds like it can be a source of "the wrong pid". – Sotto Voce Apr 04 '23 at 00:47
  • systemd and tmpfs have ways of dealing with this to prevent the races. – user10489 Apr 04 '23 at 01:08
  • @muru My problem was in the "several checks" part. I only checked [[ $pid == "$$" ]] || [[ $(ps -o ppid= -p "$pid" | xargs) == "$$". Instead I should have checked all parent and child pids with pstree -sp "$pid" | grep -oP "(?<=\()[0-9]+(?=\))" | grep -q "$$". By that the sed process will be skipped, too. Do you have any references for your statement "extremely brief period ... has the same command"? Then feel free to post it as the answer? – mgutt Apr 04 '23 at 07:42
  • @SottoVoce I'm not using FreeBSD. In other versions of pgrep, -a means List the full command line as well as the process ID. – mgutt Apr 04 '23 at 08:37
  • I found the same question (https://unix.stackexchange.com/questions/281758/command-substitution-spawned-process-name-is-identical-to-the-parent-one/281761) and voted for closing mine. – mgutt Apr 04 '23 at 09:13
  • Your reasons for avoiding pidfiles might be based on a false premise - perhaps you have been relying on the mere presence of the pidfile rather than actually inspecting the process it references? – Toby Speight Apr 04 '23 at 10:08
  • @TobySpeight Yes, I could use the pid, which is stored in the pidfile and do the checks / cleanup with that, but the gprep-way does not need to create pidfiles and read their content at all. And you even don't need to think about trap's, that delete pidfiles on script interruption. And you never face the problem that someone deletes the pidfile, while forgetting to kill a staling script. Conclusion: Less code and more fail-proof. – mgutt Apr 04 '23 at 10:25
  • @muru Only a small update: | grep -q "$$" was a bad idea as $$ creates another race condition as it sometimes seems to have the pid of the invoking pipe command. That's why I now set script_pid="$$" at the beginning of the script and use grep -qF "($script_pid)" instead (full code in my question if someone is interested). – mgutt Apr 05 '23 at 12:25

0 Answers0