In the past I used pid-files to guarantee race-condition safe execution of scripts. But this had the downside, that the pid-file was not deleted if the kernel killed the script somehow. So sometimes manual interaction was needed (remove the pid file, killing some sub processes).
Finally I came up to a solution which checks if a script is running by using pgrep
and kills really old processes that shouldn't exist anymore:
while read -r pid cmd; do
# check if pid or parent pid belong to current script execution
if [[ $pid == "$$" ]] || [[ $(ps -o ppid= -p "$pid" | xargs) == "$$" ]]; then
continue
fi
# avoid re-execution of script within 12 hours (43200 seconds)
if [[ $(date +%s --d="now - $( stat -c%X "/proc/$pid" ) seconds") -lt 43200 ]]; then
echo "Error: Script is already running!"
exit
fi
# kill outdated script executions
if kill -9 "$pid"; then
echo "Warning: Outdated script execution ($pid) has been killed!"
fi
done < <(pgrep -af "/bin/bash $(basename "$(readlink -f "$0")")")
This works on hundreds of servers and thousands of script executions without any issues, but in very rare cases it returned an "already running"-error although no other script was executed.
As I was not able to find a bug in my code, I created more debug information:
echo "pid: $pid"
echo "cmd: $cmd"
echo "lsof:"
lsof -p "$pid"
echo "/proc/$pid/cmdline:"
xargs -0 <"/proc/$pid/cmdline"
echo "pstree:"
pstree -asp "$pid"
And after several weeks it returned this:
2023-04-02 06:10:01 Error: Script is already running!
2023-04-02 06:10:01 pid: 3902
2023-04-02 06:10:01 cmd: /bin/bash /usr/local/bin/script.sh
2023-04-02 06:10:01 lsof:
2023-04-02 06:10:01 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
2023-04-02 06:10:01 sed 3902 root cwd DIR 8,3 4096 6291457 /root
2023-04-02 06:10:01 sed 3902 root rtd DIR 8,3 4096 2 /
2023-04-02 06:10:01 sed 3902 root txt REG 8,3 77664 2886809 /usr/bin/sed
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 1936776 5243076 /lib64/libc-2.22.so
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 10344 2892257 /usr/lib64/coreutils/libstdbuf.so
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 164144 5242936 /lib64/ld-2.22.so
2023-04-02 06:10:01 sed 3902 root 0r FIFO 0,12 0t0 96041317 pipe
2023-04-02 06:10:01 sed 3902 root 1w FIFO 0,12 0t0 96037120 pipe
2023-04-02 06:10:01 sed 3902 root 2w FIFO 0,12 0t0 96021482 pipe
2023-04-02 06:10:01 /proc/3902/cmdline:
2023-04-02 06:10:01 sed s/%/%%/g
2023-04-02 06:10:01 pstree:
2023-04-02 06:10:01 systemd,1 --switched-root --system --deserialize 23
2023-04-02 06:10:01 `-cron,3599 -n
2023-04-02 06:10:01 `-cron,3885 -n
2023-04-02 06:10:01 `-fcc-monitor-and,3891 /usr/local/bin/script.sh
2023-04-02 06:10:01 `-fcc-monitor-and,3901 /usr/local/bin/script.sh
2023-04-02 06:10:01 `-sed,3902 s/%/%%/g
As you can see pgrep
returned 3902
, which is used by a subshell process of the script. Instead I would expect 3891
or 3901
(as in all other executions). Is this a bug of pgrep
or why isn't it 100% reliable?
Update: Only for those, who are interested in the final solution:
# obtain script vars
script_pid="$$"
script_path=$(readlink -f "$0")
make script safe against race conditions (can't be executed twice)
while read -r pid; do
obtain all parent and child pids
pid_list=$(pstree --show-parents --show-pids "$pid")
ignore list as it was created through current script execution
if echo "$pid_list" | grep -F "($script_pid)" >/dev/null; then
continue
fi
verify that pid is part of the list ("pstree --show-parents" returns list, even pid does not exist https://github.com/acg/psmisc/issues/5)
if ! echo "$pid_list" | grep -F "($pid)" >/dev/null; then
continue # process does not exist anymore
fi
obtain age of pid
pid_time=$( stat -c%X "/proc/$pid" 2>/dev/null )
if [[ ! $pid_time ]]; then
continue # process does not exist anymore
fi
kill outdated script executions (older than 12 hours / 43200 seconds)
if [[ $(date +%s --d="now - $pid_time seconds") -gt 43200 ]]; then
if kill -9 "$pid"; then
echo "Warning: Outdated script execution ($pid) has been killed!"
fi
we are facing a race condition
else
echo "Error: Script is already running!"
exit 1
fi
done < <(pgrep -f "^/bin/bash $script_path") # obtain pids that belong to this script
sed
and actuallyexec
'dsed
, that child process has the same command line as the parent process, so it got picked up bypgrep
. But I don't see how this is a problem for your use case, since it would be picking up the child process only when the main script is already running, right? – muru Apr 04 '23 at 00:33-a
option topgrep
is explained in the man page as, "include process ancestors in the match list". So the parent process will be included, even when that parent process doesn't match the regular expression. This sounds like it can be a source of "the wrong pid". – Sotto Voce Apr 04 '23 at 00:47[[ $pid == "$$" ]] || [[ $(ps -o ppid= -p "$pid" | xargs) == "$$"
. Instead I should have checked all parent and child pids withpstree -sp "$pid" | grep -oP "(?<=\()[0-9]+(?=\))" | grep -q "$$"
. By that thesed
process will be skipped, too. Do you have any references for your statement "extremely brief period ... has the same command"? Then feel free to post it as the answer? – mgutt Apr 04 '23 at 07:42-a
meansList the full command line as well as the process ID
. – mgutt Apr 04 '23 at 08:37gprep
-way does not need to create pidfiles and read their content at all. And you even don't need to think abouttrap
's, that delete pidfiles on script interruption. And you never face the problem that someone deletes the pidfile, while forgetting to kill a staling script. Conclusion: Less code and more fail-proof. – mgutt Apr 04 '23 at 10:25| grep -q "$$"
was a bad idea as$$
creates another race condition as it sometimes seems to have the pid of the invoking pipe command. That's why I now setscript_pid="$$"
at the beginning of the script and usegrep -qF "($script_pid)"
instead (full code in my question if someone is interested). – mgutt Apr 05 '23 at 12:25