3

Sometimes, I realize I need to run another command when a process completes. If it's in the same shell (and I can control-Z it), then Can I somehow add a "&& prog2" to an already running prog1? provides many good solutions. However, those only work if it's a child process.

There is the fairly standard idiom, at least for processes sharing the same user-id, which looks like this in shell:

while kill -s 0 $pid 2> /dev/null; do sleep 1; done

On systems with /proc (like Linux), you can use [ -d /proc/$pid ] instead of kill to dispense of the same user-id requirement.

But both of those have a race condition: the process could exit during the sleep and then a new process could obtain the same PID. The loop would thus continue, unaware the process it was interested in actually exited.

Is there a way to eliminate the race condition in a shell script?

derobert
  • 109,670

4 Answers4

3

Possibly Linux-specific, other /procs might behave differently, but on Linux you can do this:

(   # subshell to preserve CWD
    cd /proc/$pid || exit
    [ "$expected_path" = "$(readlink exe)" ] || exit # optional, and can do more checks
    while [ -d . ]; do sleep 1; done
)

The working directory will cease to be valid (and this . will no longer exist) when the process exits. However, if a new process takes that pid, it does not become valid again. If the process exits (and the pid remains unused) before the cd, then cd will fail and the subshell exits. If the pid has been re-used before the cd, then that will succeed but the "optional" checks (thanks LJKims) will hopefully catch it. So, as long as the checks catch that very short possible race, there is no race condition. To test, you can use something like:

# shell 1
$ sleep 10 &
[1] 26453
$ cd /proc/26453

# shell 2 (must be bash for BASHPID)
for ((i=0; i<200000; ++i)) do ( if [ 26453 -eq $BASHPID ]; then echo "I am have the pid — sleeping 60"; sleep 60; echo "done sleeping"; fi ); done
I have the pid — sleeping 60

# back to shell 1
[ -d . ]; echo $?
1

So it's still noticing the sleep it was interested in is exited, despite the new pid 26453.

derobert
  • 109,670
  • you probably want cd /proc/$pid && ... to avoid yet another race condition should $pid exit before the cd gets run – thrig Jun 02 '17 at 20:27
  • 1
    @thrig true, though that still leaves a race before the cd. That's pretty much unavoidable, I think. – derobert Jun 02 '17 at 21:32
  • No, if the pid exits before the cd runs the cd fails, the error is improperly unchecked for, and the while loop loops forever in whatever directory that subshell started out in. – thrig Jun 03 '17 at 14:27
  • @thrig yep. Sorry about that, I didn't have the chance to edit the answer until now. I've fixed it, and added in a check to catch the race before cd. – derobert Jun 03 '17 at 18:17
2

Supposing you can stop the process (e.g., via ctrl+z as you mention) this is a non-racy method to simulate && prog2 (actually, more accurately ; prog2, since the exit status won't be checked but that could be added).

Find the PID. Open another terminal and execute

strace -f -e trace=exit_group -p $PID && prog2

Then, resume the process.

Mark Wagner
  • 1,911
  • I think you can even use trace=none there. Though that's still adding a bit of overhead, as ptrace returns for each syscall, the filtering of what to display is done in userspace (you can see this by doing, e.g., strace strace -e trace=none perl -e 'sleep(1) while 1'). BTW: processes can exit without calling exit_group, e.g., due to SIGKILL. – derobert Jun 03 '17 at 18:28
1

Presumably you don't want to monitor a process based on its PID alone. That is, likely you have an idea of what the process is. Why don't you monitor the contents of /proc/<pid>/cmdline and once it either ceases to exist or no longer matches what you expect, you can then execute your command/job?

LJKims
  • 429
  • 1
    That could catch another instance of the same program, among other, less likely race conditions. – Gilles 'SO- stop being evil' Jun 02 '17 at 22:17
  • The fact that he can't use wait tells me that he has either exited the shell that launched the PID to be watched or that he is watching a process that some other user launched. Yet somehow he knows the PID of that process. He seems to be worried about the program ending during a short sleep cycle or even while simply typing 'cd'. If this process is so short running there is little chance to ever catch it. Regardless, in order to know the PID of a process you either had to be present when it was launched or know what is running to search for its PID. – LJKims Jun 02 '17 at 23:46
  • There is no way to avoid a race condition for a process that you did not launch and are trying to monitor later. What I have described is probably about the best option. Could another instance of the EXACT same process get launched with the exact same PID within an instant? Highly unlikely. However, if you are ONLY looking at the PID it is much more likely that ANY other process might get that same PID. I'd say no solution is perfect but looking at the PID and the process cmdline is much safer than the PID alone. – LJKims Jun 02 '17 at 23:48
  • It could also have been launched via a shell I've already closed (e.g., started via ssh, then logged out). Or from cron or at. Anyway, checking cmdline (which is a tiny bit non-trivial since it contains nulls) or exe is a good idea; I've swiped it and put it as a check in my answer... – derobert Jun 03 '17 at 18:36
0

Maybe a lock file is a good approach. You could even embed a string passed to the sub-process and put it into the lock file / part of the lock file name.

The lock file could be deleted as part of the process actually ending. Then you could just hunt for the lock file by name or by it's contents and put the PID inside the lock file. If the lock file is there, it is still running and you could get the PID from within the file? On the other hand, if the lock file is gone, then your process has finished?

dij8al
  • 1
  • 1
    Welcome to the site, and thank you for your contribution. However, please note that in its current form your answer sounds somewhat vague. You may want to edit it your post to contain some code examples how to implement this. – AdminBee May 18 '22 at 06:34
  • I believe that you have misunderstood the question.  I believe that it is implicit in the question that you realize what you want to do after you have already started the first process, so it’s too late to modify it to incorporate any sort of lock file or other termination signaling into its flow. – G-Man Says 'Reinstate Monica' May 21 '22 at 19:54