47

Suppose, for example, you have a shell script similar to:

longrunningthing &
p=$!
echo Killing longrunningthing on PID $p in 24 hours
sleep 86400
echo Time up!
kill $p

Should do the trick, shouldn't it? Except that the process may have terminated early and its PID may have been recycled, meaning some innocent job get a bomb in its signal queue instead. In practice this possibly does matter, but its worrying me nonetheless. Hacking longrunningthing to drop dead by itself, or keep/remove its PID on the FS would do but I'm thinking of the generic situation here.

FJL
  • 531
  • 1
  • 4
  • 5
  • 3
    You make sure that if your targeted process dies, it kills its killer in the process. – mikeserv Jan 25 '15 at 18:24
  • 3
    Use killall which matches on name, so at least you are only killing a process with the same name as longrunningthing. Assuming you will only have one of these running at a time. – LawrenceC Jan 25 '15 at 23:29
  • 9
    You can save the start time of the original process and, before killing, check that the start time of the process with that pid matches what you saved. The pair pid, start-time is a unique identifier for the processes in Linux. – Bakuriu Jan 26 '15 at 10:43
  • 1
    May I ask why you need this? what is the fundamental thing you are trying to achieve? (something that run continuously but is reseted every 24h?) – Olivier Dulac Jan 26 '15 at 17:09
  • 2
    @mikeserv A process can't guarantee that something will happen in the event of its own death. – kasperd Jan 26 '15 at 22:34
  • @kasperd - that depends on what launches what. There are various ways to ensure that it can - stephane demos the process groups below as are associated with controlling terminals, and on linux systems there are pid namespaces - which can be assured even after a process dies as easy as mount --bind /proc/"$pid"/ns/pid /else/where. – mikeserv Jan 26 '15 at 22:41
  • From what I can gather, Linux doesn't reuse PIDs until it cycles back around through those 32,000 integers at its disposal. New processes are given new PIDs in sequential order. Out of curiosity I ran the following script on one of my servers for a day and it does seem that new processes get higher-number PIDs; old unused PIDs are not reused: while true; do ps -eo "%p" | sort -n > /tmp/sort.out; sleep 300; diff <(ps -eo "%p" | sort -n) /tmp/sort.out; done – Michael Martinez Jan 28 '15 at 19:44

10 Answers10

33

Best would be to use the timeout command if you have it which is meant for that:

timeout 86400 cmd

The current (8.23) GNU implementation at least works by using alarm() or equivalent while waiting for the child process. It does not seem to be guarding against the SIGALRM being delivered in between waitpid() returning and timeout exiting (effectively cancelling that alarm). During that small window, timeout may even write messages on stderr (for instance if the child dumped a core) which would further enlarge that race window (indefinitely if stderr is a full pipe for instance).

I personally can live with that limitation (which probably will be fixed in a future version). timeout will also take extra care to report the correct exit status, handle other corner cases (like SIGALRM blocked/ignored on startup, handle other signals...) better than you'd probably manage to do by hand.

As an approximation, you could write it in perl like:

perl -MPOSIX -e '
  $p = fork();
  die "fork: $!\n" unless defined($p);
  if ($p) {
    $SIG{ALRM} = sub {
      kill "TERM", $p;
      exit 124;
    };
    alarm(86400);
    wait;
    exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
  } else {exec @ARGV}' cmd

There's a timelimit command at http://devel.ringlet.net/sysutils/timelimit/ (predates GNU timeout by a few months).

 timelimit -t 86400 cmd

That one uses an alarm()-like mechanism but installs a handler on SIGCHLD (ignoring stopped children) to detect the child dying. It also cancels the alarm before running waitpid() (that doesn't cancel the delivery of SIGALRM if it was pending, but the way it's written, I can't see it being a problem) and kills before calling waitpid() (so can't kill a reused pid).

netpipes also has a timelimit command. That one predates all the other ones by decades, takes yet another approach, but doesn't work properly for stopped commands and returns a 1 exit status upon timeout.

As a more direct answer to your question, you could do something like:

if [ "$(ps -o ppid= -p "$p")" -eq "$$" ]; then
  kill "$p"
fi

That is, check that the process is still a child of ours. Again, there's a small race window (in between ps retrieving the status of that process and kill killing it) during which the process could die and its pid be reused by another process.

With some shells (zsh, bash, mksh), you can pass job specs instead of pids.

cmd &
sleep 86400
kill %
wait "$!" # to retrieve the exit status

That only works if you spawn only one background job (otherwise getting the right jobspec is not always possible reliably).

If that's an issue, just start a new shell instance:

bash -c '"$@" & sleep 86400; kill %; wait "$!"' sh cmd

That works because the shell removes the job from the job table upon the child dying. Here, there should not be any race window since by the time the shell calls kill(), either the SIGCHLD signal has not been handled and the pid can't be reused (since it has not been waited for), or it has been handled and the job has been removed from the process table (and kill would report an error). bash's kill at least blocks SIGCHLD before it accesses its job table to expand the % and unblocks it after the kill().

Another option to avoid having that sleep process hanging around even after cmd has died, with bash or ksh93 is to use a pipe with read -t instead of sleep:

{
  {
    cmd 4>&1 >&3 3>&- &
    printf '%d\n.' "$!"
  } | {
    read p
    read -t 86400 || kill "$p"
  }
} 3>&1

That one still has race conditions, and you lose the command's exit status. It also assumes cmd doesn't close its fd 4.

You could try implementing a race-free solution in perl like:

perl -MPOSIX -e '
   $p = fork();
   die "fork: $!\n" unless defined($p);
   if ($p) {
     $SIG{CHLD} = sub {
       $ss = POSIX::SigSet->new(SIGALRM); $oss = POSIX::SigSet->new;
       sigprocmask(SIG_BLOCK, $ss, $oss);
       waitpid($p,WNOHANG);
       exit (WIFSIGNALED($?) ? WTERMSIG($?)+128 : WEXITSTATUS($?))
           unless $? == -1;
       sigprocmask(SIG_UNBLOCK, $oss);
     };
     $SIG{ALRM} = sub {
       kill "TERM", $p;
       exit 124;
     };
     alarm(86400);
     pause while 1;
   } else {exec @ARGV}' cmd args...

(though it would need to be improved to handle other types of corner cases).

Another race-free method could be using process groups:

set -m
((sleep 86400; kill 0) & exec cmd)

However note that using process groups can have side-effects if there's I/O to a terminal device involved. It has the additional benefit though to kill all the other extra processes spawned by cmd.

  • 5
    Why not mention the best method first? – deltab Jan 26 '15 at 03:43
  • 2
    @deltab: timeout is not portable, the answer mentioned a portable solution first. – cuonglm Jan 26 '15 at 04:51
  • 1
    @deltab: it gives insight on how things work and especially how "common sense" approach can fail (Stephane prefers to teach one to fish first, which I like). One is expected to read the whole answer – Olivier Dulac Jan 26 '15 at 13:10
  • @Stephane: for the "getting the right jobspec is not always possible reliably" : can't you first count the output of jobs and then know that (as it's your own shell, in which you have control on what happens next) the next backgrounded job will be N+1 ? [then you can save N, and later kill %N+1]) – Olivier Dulac Jan 26 '15 at 13:13
  • 1
    @OlivierDulac, that would assume no past job has terminated by the time you start a new one (shells reuse job numbers). – Stéphane Chazelas Jan 26 '15 at 13:44
  • luit can be a pretty cheap means of getting a pty - cheaper than ssh or screen anyway - and is usually available anywhere there is an xterm. It also offers the sometimes handy -[oi]log and/or -argv0 switches. If you can do setsid within its child then it can go a long way toward isolating the terminal i/o you mention in the last bit. – mikeserv Jan 26 '15 at 15:30
  • @StéphaneChazelas thx, I thought there was something, hence my question ^^ (Maybe you should add this next to the %job bit? but don't waste an edit just for this) – Olivier Dulac Jan 26 '15 at 16:55
  • I think bash uses job numbers higher than the highest currently running job, so that cmd & sleep 864000 & sleep 86400; random; commands; involving; jobs; kill %1 should kill the cmd, (assuming that the first sleep is still alive). – muru Jan 27 '15 at 01:06
  • @muru, unless cmd has already dies by the time those sleep jobs are started, or some of the later jobs are started after all of those cmd and sleeps have dies. And that assumes cmd is the first job you run. If not, getting the job number is racy. (But it's true it's slightly easier in bash than in other shells) – Stéphane Chazelas Jan 27 '15 at 07:21
  • I like your thorough answer, but in your very last idea ("...using process groups:"), if the main process ("cmd") finishes before the timeout, won't the background process get re-parented to the init process (PID 1), and therefore kill any processes whose parent is the init process? If you like, see my answer for an alternative based upon this strategy. – Sean May 13 '21 at 03:11
31

In general, you can't. All of the answers given so far are buggy heuristics. There is only one case in which you can safely use the pid to send signals: when the target process is a direct child of the process that will be sending the signal, and the parent has not yet waited on it. In this case, even if it has exited, the pid is reserved (this is what a "zombie process" is) until the parent waits on it. I'm not aware of any way to do that cleanly with the shell.

An alternative safe way to kill processes is to start them with a controlling tty set to a pseudo-terminal for which you own the master side. You can then send signals via the terminal, e.g. writing the character for SIGTERM or SIGQUIT over the pty.

Yet another way that's more convenient with scripting is to use a named screen session and send commands to the screen session to end it. This process takes place over a pipe or unix socket named according to the screen session, which won't automatically be reused if you choose a safe unique name.

  • 1
    Thank you. When I started reading the answers I already went *facepalm* until I encountered yours. All other methods are best effort, but not a sure shot. Technically the best answer, IMO. – 0xC0000022L Jan 26 '15 at 11:18
  • 4
    I can't see why that couldn't be done in shells. I've given several solutions. – Stéphane Chazelas Jan 26 '15 at 12:40
  • 4
    Could you please give an explanation and some sort of quantitative discussion of race windows and other drawbacks? Without that, "All of the answers given so far are buggy heuristics" is a just unnecessarily confrontational without any benefits. – peterph Jan 26 '15 at 13:18
  • 6
    @peterph: In general, any use of a pid is a TOCTOU race -- no matter how you check whether it still refers to the same process you expect it to refer to, it could cease to refer to that process and refer to some new process in the interval before you use it (sending the signal). The only way to prevent this is to be able to block the freeing/reuse of the pid, and the only process that can do this is the direct parent. – R.. GitHub STOP HELPING ICE Jan 26 '15 at 18:17
  • 3
    @StéphaneChazelas: How do you prevent the shell from waiting on a pid of a background process that has exited? If you can do that, the problem is easily solvable in the case OP needs. – R.. GitHub STOP HELPING ICE Jan 26 '15 at 18:18
  • @R.. direct parent - check my answer. Generally the race window is small compared how quickly the PID would get reused (under non-extreme conditions and Linux, I can't say much about other kernels). – peterph Jan 26 '15 at 18:25
  • 8
    @peterph: "The race window is small" is not a solution. And the rarity of the race is dependent on sequential pid assignment. Bugs that cause something very bad to happen once in a year are much worse than bugs that happen all the time because they're virtually impossible to diagnose and fix. – R.. GitHub STOP HELPING ICE Jan 26 '15 at 19:10
  • No doubt about the general statement, but I would rather leave it for the OP to decide whether the race is acceptable or not in his particular case - it's certainly good though to point that out though. – peterph Jan 26 '15 at 19:48
  • Again, see my answer for various race-free approaches. And shells are tools to run other commands, so if it can be done in any way (by any command, written in any language), then it can be done in shells. – Stéphane Chazelas Jan 26 '15 at 20:16
  • I think this answer explains the best how the desired result can be achieved by using the right code in the parent process, and what are the technical drawbacks of methods not involving the parent process. However, I don't think it is entirely impossible to do something reliable without being the parent of the target process. Once you have checked that the target process still exists, you can attach to it with ptrace and then repeat the check. I would expect that attaching to a process with ptrace does prevent the pid from being reused. – kasperd Jan 26 '15 at 22:49
  • 2
    @kasperd: Indeed ptrace works, but it's not a good idea. It's not portable, it precludes debugging/strace (since there can usually only be one tracing process), and hardened systems sometimes have the ptrace syscall disabled entirely because of a history of vulns. – R.. GitHub STOP HELPING ICE Jan 27 '15 at 00:17
  • 1
    @R.. I agree with all of those points, but it seems to be the only way to send a signal to a process which isn't your child without a race condition. – kasperd Jan 27 '15 at 07:07
  • Extract the pid, time process started after system boot, and address of the start of the stack from /proc/[pid]/stats. This guarantees uniqueness of the process. – Michael Martinez Jan 28 '15 at 18:36
  • 2
    @MichaelMartinez: That does not help. There's still a TOCTOU race: first you check /proc/[pid]/stats then you use the pid to kill. – R.. GitHub STOP HELPING ICE Jan 28 '15 at 19:26
  • @R.. yes assuming the kernel reuses that same PID within the second or less it takes to check the stats and issue the kill. Which I suppose is theoretically possible but unlikely since Linux assigns new PIDs sequentially, and I don't think there's a system in the world that goes through 32,000 PIDs in one second. – Michael Martinez Jan 28 '15 at 19:33
  • 2
    The fact that you think it's unlikely to happen does not make it any less of a race. The potential amount of time between the stats read and the kill is unbounded; the code may lie in different pages, where one of the pages is resident and the other has been discarded and needs to be reloaded from a disk that's under high IO load. Als, hardened systems should assign pids randomly, not sequentially, and it's pretty easy for a malicious load source to run through the default pid space of 32k in a matter of seconds. – R.. GitHub STOP HELPING ICE Jan 28 '15 at 19:40
  • @R.. you're absolutely right. there is a race. I'm curious however as to which Linux systems assign pids randomly, do you have an example? – Michael Martinez Jan 28 '15 at 19:47
  • 2
    @MichaelMartinez: grsec has an option for it. See http://grsecurity.net/confighelp.php/ – R.. GitHub STOP HELPING ICE Jan 28 '15 at 19:55
  • 1
    "In general, you can't" seems like a rather bold and risky assertion. Are you sure? What about mikeserv's suggestion of using a pid namespace? And even if that one doesn't pan out... are you still sure there isn't a method that you might not have thought of or know about? This is a good answer but I think it would be even better without that initial assertion. – Don Hatch May 04 '15 at 20:48
12
  1. When launching the process save its start time:

    longrunningthing &
    p=$!
    stime=$(TZ=UTC0 ps -p "$p" -o lstart=)
    
    echo "Killing longrunningthing on PID $p in 24 hours"
    sleep 86400
    echo Time up!
    
  2. Before trying to kill the process stop it (this isn't truly essential, but it's a way to avoid race conditions: if you stop the process, it's pid cannot be reused)

    kill -s STOP "$p"
    
  3. Check that the process with that PID has the same start time and if yes, kill it, otherwise let the process continue:

    cur=$(TZ=UTC0 ps -p "$p" -o lstart=)
    
    if [ "$cur" = "$stime" ]
    then
        # Okay, we can kill that process
        kill "$p"
    else
        # PID was reused. Better unblock the process!
        echo "long running task already completed!"
        kill -s CONT "$p"
    fi
    

This works because there can be only one process with the same PID and start time on a given OS.

Stopping the process during the check makes race-conditions a non-issue. Obviously this has the problem that, some random process may be stopped for some milliseconds. Depending on the type of process this may or may not be an issue.


Personally I'd simply use python and psutil which handles PID reuse automatically:

import time

import psutil

# note: it would be better if you were able to avoid using
#       shell=True here.
proc = psutil.Process('longrunningtask', shell=True)
time.sleep(86400)

# PID reuse handled by the library, no need to worry.
proc.terminate()   # or: proc.kill()
Bakuriu
  • 817
  • Python rules in UNIX... I'm not sure why more answers don't start there as I'm sure most systems do not prohibit the use of it. – Mr. Mascaro Jan 26 '15 at 18:03
  • I've used a similar scheme (using the start time) before, but your sh scripting skills are neater than mine! Thanks. – FJL Jan 26 '15 at 20:14
  • That means you're potentially stopping the wrong process. Note that ps -o start= format changes from 18:12 to Jan26 after a while. Beware of DST changes as well. If on Linux, you'll probably prefer TZ=UTC0 ps -o lstart=. – Stéphane Chazelas Jan 26 '15 at 20:30
  • @StéphaneChazelas Yes, but you let it continue afterwards. I explicitly said: depending on the type of task that that process is performing you may have some trouble for stopping it some milliseconds. Thanks for the tip on lstart, I'll edit it in. – Bakuriu Jan 26 '15 at 20:33
  • 2
    Note that (unless your system limits the number of processes per user), it's easy for anyone to fill the process table with zombies. Once there are only 3 available pids left, it's easy for anyone to start hundreds of different processes with the same pid within a single second. So, strictly speaking, your "there can be only one process with the same PID and start time on a given OS" is not necessarily true. – Stéphane Chazelas Jan 26 '15 at 21:19
9

On a linux system you can assure that a pid will not be reused by keeping its pid namespace alive. This can be done via the /proc/$pid/ns/pid file.

  • man namespaces -

    Bind mounting (see mount(2)) one of the files in this directory to somewhere else in the filesystem keeps the corresponding namespace of the process specified by pid alive even if all processes currently in the namespace terminate.

    Opening one of the files in this directory (or a file that is bind mounted to one of these files) returns a file handle for the corresponding namespace of the process specified by pid. As long as this file descriptor remains open, the namespace will remain alive, even if all processes in the namespace terminate. The file descriptor can be passed to setns(2).

You can isolate a group of processes - basically any number of processes - by namespacing their init.

  • man pid_namespaces -

    The first process created in a new namespace (i.e., the process created using clone(2) with the CLONE_NEWPID flag, or the first child created by a process after a call to unshare(2) using the CLONE_NEWPID flag) has the PID 1, and is the init process for the namespace (see init(1)). A child process that is orphaned within the namespace will be reparented to this process rather than init(1) (unless one of the ancestors of the child in the same PID namespace employed the prctl(2) PR_SET_CHILD_SUBREAPER command to mark itself as the reaper of orphaned descendant processes).

    If the init process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This behavior reflects the fact that the init process is essential for the correct operation of a PID namespace.

The util-linux package provides many useful tools for manipulating namespaces. For example, there is unshare, though, if you have not already arranged for its rights in a user namespace, it will require superuser rights:
unshare -fp sh -c 'n=
    echo "PID = $$"
    until   [ "$((n+=1))" -gt 5 ]
    do      while   sleep 1
            do      date
            done    >>log 2>/dev/null   &
    done;   sleep 5' >log
cat log; sleep 2
echo 2 secs later...
tail -n1 log

If you have not arranged for a user namespace, then you can still safely execute arbitrary commands by immediately dropping privileges. The runuser command is another (non setuid) binary provided by the util-linux package and incorporating it might look like:

sudo unshare -fp runuser -u "$USER" -- sh -c '...'

...and so on.

In the above example two switches are passed to unshare(1) the --fork flag which makes the invoked sh -c process the first child created and ensures its init status, and the --pid flag which instructs unshare(1) to create a pid namespace.

The sh -c process spawns five backgrounded child shells - each an inifinite while loop that will continue to append the output of date to the end of log for as long as sleep 1 returns true. After spawning these processes sh calls sleep for an additional 5 seconds then terminates.

It maybe worth noting that if the -f flag were not used none of the backgrounded while loops would terminate, but with it...

OUTPUT:

PID = 1
Mon Jan 26 19:17:45 PST 2015
Mon Jan 26 19:17:45 PST 2015
Mon Jan 26 19:17:45 PST 2015
Mon Jan 26 19:17:45 PST 2015
Mon Jan 26 19:17:45 PST 2015
Mon Jan 26 19:17:46 PST 2015
Mon Jan 26 19:17:46 PST 2015
Mon Jan 26 19:17:46 PST 2015
Mon Jan 26 19:17:46 PST 2015
Mon Jan 26 19:17:46 PST 2015
Mon Jan 26 19:17:47 PST 2015
Mon Jan 26 19:17:47 PST 2015
Mon Jan 26 19:17:47 PST 2015
Mon Jan 26 19:17:47 PST 2015
Mon Jan 26 19:17:47 PST 2015
Mon Jan 26 19:17:48 PST 2015
Mon Jan 26 19:17:48 PST 2015
Mon Jan 26 19:17:48 PST 2015
Mon Jan 26 19:17:48 PST 2015
Mon Jan 26 19:17:48 PST 2015
2 secs later...
Mon Jan 26 19:17:48 PST 2015
mikeserv
  • 58,310
  • Interesting answer that seem to be robust. Probably a bit overkill for basic usage, but worth having a thought. – Uriel Sep 12 '15 at 08:42
  • 1
    I do not see how or why keeping a PID namespace alive prevents reuse of a PID. The very manpage you quote - As long as this file descriptor remains open, the namespace will remain alive, even if all processes in the namespace terminate - suggests that the processes may still terminate (and thus presumably have their process ID recycled). What does keeping the PID namespace alive have to do with preventing the PID itself from being re-used by another process? – davmac Jun 24 '16 at 14:05
5

Consider making your longrunningthing behave a bit better, a little bit more daemon-like. For example you may make it create a pidfile that will allow at least some limited control of the process. There are several ways of doing this without modifying the original binary, all involving a wrapper. For example:

  1. a simple wrapper script that will start the required job in background (with optional output redirection), write the PID of this process into a file, then wait for the process to finish (using wait) and remove the file. If during the wait the process gets killed e.g. by something like

    kill $(cat pidfile)
    

    the wrapper will just make sure the pidfile is removed.

  2. a monitor wrapper, that will put its own PID somewhere and catch (and respond to) signals sent to it. Simple example:

    #!/bin/bash
    p=0
    trap killit USR1

    killit () {
        printf "USR1 caught, killing %s\n" "$p"
        kill -9 $p
    }

    printf "monitor $$ is waiting\n"
    therealstuff &
    p=%1
    wait $p
    printf "monitor exiting\n"

Now, as @R.. and @StéphaneChazelas pointed out, these approaches often have a race condition somewhere or impose a restriction on the number of processes you can spawn. In addition, it doesn't handle the cases, where the longrunningthing may fork and the children get detached (which is probably not what was the problem in the original question).

With recent (read a couple of years old) Linux kernels, this can be nicely treated by using cgroups, namely the freezer - which, I suppose, is what some modern Linux init systems use.

peterph
  • 30,838
  • Thanks, and to all. I'm reading everyting now.The point about longrunningthing is that you have no control over what it is. I also gave a shell script example because it explained the problem. I like yours, and all the other creative solutions here, but if you're using Linux/bash there's a "timeout" builtin for that. I suppose I ought to get the source to that and see how it does it! – FJL Jan 26 '15 at 20:11
  • @FJL, timeout is not a shell builtin. There have been various implementations of a timeout command for Linux, one was recently (2008) added to GNU coreutils (so not Linux specific), and that's what most Linux distributions use nowadays. – Stéphane Chazelas Jan 26 '15 at 20:21
  • @ Stéphane - Thanks - I subsequently found reference to GNU coreutils. They may be portable, but unless its in the base system it can't be relied on. I'm more interested in knowing how it work, although I note your comment elsewhere suggesting it's not 100% reliable. Given the way this thread has gone, I'm not surprised! – FJL Jan 27 '15 at 22:39
1

If you are running on Linux (and a few other *nixes), you can check if the process you intend to kill is still used and that the command line matches your long process. Something like :

echo Time up!
grep -q longrunningthing /proc/$p/cmdline 2>/dev/null
if [ $? -eq 0 ]
then
  kill $p
fi

An alternative can be to check for how long the process you intend to kill is running, with something like ps -p $p -o etime=. You could do it yourself by extracting this information from /proc/$p/stat, but this would be tricky (time is measured in jiffies, and you will have to use the system uptime in /proc/stat too).

Anyway, you usually cannot ensure that the process is not replaced after your check and before you kill it.

Uriel
  • 1,383
  • 11
  • 15
  • That's still not correct because it doesn't get rid of the race condition. – strcat Sep 10 '15 at 22:12
  • @strcat Indeed, No warranty of success, but most scripts do not even bother to do such check and only bluntly kill a cat pidfile result. I cannot remember of a clean way to do it in shell only. The proposed namespace answer seems an intersting one however... – Uriel Sep 12 '15 at 08:39
0

Another way would be to check the age of the process before killing it. That way, you can make sure you are not killing a process that is not spawned in less than 24 hours. You can add an if condition based on that before killing the process.

if [[ $(ps -p $p -o etime=) =~ 1-. ]] ; then
    kill $p
fi

This if condition will check whether the process ID $p is less than 24 hours (86400 seconds).

P.S:- The command ps -p $p -o etime= will have the format <no.of days>-HH:MM:SS

Sreeraj
  • 5,062
0

This is actually a very good question.

The way to determine process uniqueness is to look at (a) where it is in memory; and (b) what that memory contains. To be specific, we want to know where in memory is the program text for the initial invocation, because we know that the text area of each thread will occupy a different location in memory. If the process dies and another one is launched with the same pid, the program text for the new process will not occupy the same place in memory and will not contain the same information.

So, immediately after launching your process, do md5sum /proc/[pid]/maps and save the result. Later when you want to kill the process, do another md5sum and compare it. If it matches, then kill the pid. If not, don't.

to see this for yourself, launch two identical bash shells. Examine the /proc/[pid]/maps for them and you will find that they are different. Why? Because even though it is the same program, they occupy different locations in memory and the addresses of their stack are different. So, if your process dies and its PID is reused, even by the same command being relaunched with the same arguments, the "maps" file will be different and you will know that you are not dealing with the original process.

See: proc man page for details.

Note that the file /proc/[pid]/stat already contains all of the information that other poster's have mentioned in their answers: age of the process, parent pid, etc. This file contains both static information and dynamic information, so if you prefer to use this file as a base of comparison, then upon launching your longrunningthing, you need to extract the following static fields from the stat file and save them for comparison later:

pid, filename, pid of parent, process group id, controlling terminal, time process started after system boot, resident set size, the address of the start of the stack,

taken together, the above uniquely identify the process, and so this represents another way to go. Actually you could get away with nothing more than "pid" and "time process started after system boot" with high degree of confidence. Simply extract these fields from the stat file and save it somewhere upon launching your process. Later before killing it, extract it again and compare. If they match, then you are assured you are looking at the original process.

Michael Martinez
  • 982
  • 7
  • 12
  • 1
    That generally won't work as /proc/[pid]/maps changes over time as extra memory is allocate or the stack grows or new files are mmapped... And what does immediately after launching mean? After all the libraries have been mmapped? How do you determine that? – Stéphane Chazelas Jan 27 '15 at 11:26
  • I'm doing a test now on my system with two processes, one a java app and the other a cfengine server. Every 15 minutes I do md5sum on their maps files. I'll let it run for a day or two and report back here with the results. – Michael Martinez Jan 28 '15 at 00:13
  • @StéphaneChazelas: I've been checking my two processes for 16 hours now, and there's been no change in the md5sum – Michael Martinez Jan 28 '15 at 18:12
0

UPDATED: Enabled use with pipes

You've gotten one good answer regarding using timeout, if it is available on your system. On mine (Mac OS X Catalina), it isn't, and as a proof-by-example, that means that this isn't a universal solution. I don't believe that the namespaces answer is universal, either. I challenge the "in general, you can't" answer, but agree that a lot of the other answers are buggy. For example, recording the start time, and then checking that, then killing... does not eliminate the race-condition where the process terminates and its PID gets recycled. Same with the proc/..../cmdline answer. Neither atomically checks that the process is the correct process, then kills it, all in one step. Same with the other answers.

Anyways, since this was the #1 post regarding timeouts and race conditions, I'll share my answer here. I'll share one preferred answer and two alternatives. Usage / test cases for the below methods are:

timeOut 1 sleep 20; ps -j
timeOut 20 sleep 1; ps -j
timeOut 1 sleep 1; ps -j
timeOut 1 exec sleep 20; ps -j
timeOut 1 sh -c 'sleep 20& sleep 30'; ps -j
timeOut 1 sleep 20 | cat; ps -j

All of the above commands will return after 1 second.

The preferred method is:

timeOut() {
    checkArgs() { [ $(( ${1} )) -gt 0 -a "${*:2}" ]; }
    jobControlEnabled() { expr "${-}" : '.*m' >/dev/null; }
    terminalFDs() { [ -t 0 -a -t 1 ]; }
    groupLeader() { sh -c 'expr `ps -o pgid= ${PPID}` : "${PPID}" >/dev/null;'; }
    timeOutImpl() {
        groupLeader || { echo "Job control error - not group leader!"; return -1; }
        KILL_SUB="kill -- -`sh -c 'echo ${PPID}'`"
        { sleep ${1}; ${KILL_SUB}; } &
        "${@:2}"; ${KILL_SUB}
    }
    checkArgs "${@}" || { echo "Usage: timeOut <delay> <command>"; return -1; }
    if jobControlEnabled && terminalFDs; then
        ( timeOutImpl "${@}"; )
    else
        ( set -m; ( timeOutImpl "${@}"; ); )
    fi
}

The above script works even if you pass a command beginning with "exec" as an argument, therefore providing support for commands which cannot be run in a subshell (as long as you aren't trying to pipe such commands! -- that would be pointless, since such programs wouldn't run in a pipe even without this utility). The only minor consequence of using exec is that the "watchdog" process will live-on (only until its pre-programmed timeout expires) if it does not "fire" before the command exits. It will only harmlessly kill itself after its timeout. It is my understanding that there is no concern of PID recycling occurring (and the watchdog killing some other process) even when exec is used, because the watchdog still itself retains a PGID of the group it will kill (even though its PPID will have been changed to 1 -- the init process). So: while the watchdog lives, no PID recycling can result in another process with that same PGID being created. Somebody correct me if I am mistaken, but I believe this to be correct.

I avoid race conditions by making a subshell where all commands issued inside of it (including the watchdog) get assigned the same PGID. Then, I kill all processes with that PGID in one fell swoop, regardless of whether the command finished successfully or if the watchdog kicked in.

I made this version work with both exec (for commands which can't be run in a subshell) and pipes (for commands which can be run in a subshell, obviously) by detecting whether STDIN and STDOUT are connected to a terminal. If either one is not connected to a terminal, I enable job control in the first subshell, then run the command in a second subshell.

It also works if you want to execute multiple commands using sh -c 'cmd1; cmd2'.

The only non-POSIX portion of the above is in groupLeader(), which is just an extra error-check that makes sure that job control worked as-expected. If your version of ps does not support the "-o" option, either delete that line, or replace it with however you get a process' PGID on your distribution. I would recommend keeping some form of error-checking enabled, or else the utility will kill all jobs started by your main terminal, if somehow job control doesn't work. I don't think this should ever happen, but if you're doing something odd with job control, don't say that I didn't warn you.


Only read further if you have a solid reason not to use the first (preferred) option listed above.

If, for some reason, you cannot have job control enabled in your main terminal, and you're one of those people who simply finds it to be sacrilege to enable job control in a subshell, use this next version. Since we can't rely on PGIDs in this method, it uses a FIFO as a way to terminate the watchdog, rather than killing it directly. Killing the watchdog directly would involve a race condition if the watchdog was about to terminate on its own (meaning that its PID could have been recycled). It requires pkill in order to kill by PPID rather than PGID.

timeOut() {
    [ $(( ${1} )) -gt 0 ] && [ "${*:2}" ] ||
        { echo "Usage: timeOut <delay> <command>"; return -1; }
    fifo="`mktemp -dt $$`/fifo"; mkfifo -m 0600 "${fifo}"; exec 9<>"${fifo}"
    ( exec sh -c "read -t ${1} <&9; pkill -P \${PPID}" & "${@:2}"; echo >&9; wait; )
}

The above version should NOT be used with exec. Doing so will result in the watchdog killing all processes belonging to the init process (PID 1) if it does not fire before the main process ends. It also will not work with sh -c 'cmd1; cmd2'. This version does appear to work with pipes without any special logic like the first version requires.

As a final version, if your processes always work fine in a subshell (you don't need the exec capabilities of the "preferred solution"), the below is very concise and does work with pipes, too:

timeOut() {
    (
        set -m; trap "sh -c 'pkill -P \${PPID}'" SIGCHLD
        sleep ${1} & "${@:2}"; wait
    )
}
Sean
  • 123
-3

What I do is, after having killed the process, do it again. Every time I do that the answer comes back, "no such process"

allenb   12084  5473  0 08:12 pts/4    00:00:00 man man
allenb@allenb-P7812 ~ $ kill -9 12084
allenb@allenb-P7812 ~ $ kill -9 12084
bash: kill: (12084) - No such process
allenb@allenb-P7812 ~ $ 

Couldn't be simpler and I've been doing this for years without any problems.

Allen
  • 15