17

I'm writing a Perl script that parses logfiles to collect PIDs and then checks whether that PID is running. I am trying to think of the best way to make that check. Obviously, I could do something like:

system("ps $pid > /dev/null") && print "Not running\n";

However, I'd prefer to avoid the system call if possible. I therefore thought I could use the /proc filesystem (portability isn't a concern, this will always be running on a Linux system). For example:

if(! -d "/proc/$pid"){
    print "Not running\n";
}

Is that safe? Can I always asume that if there's no /proc/$pid/ directory the associated PID is not running? I expect so since AFAIK ps itself gets its information from /proc anyway but since this is for production code, I want to be sure.

So, can there be cases where a running process has no /proc/PID directory or where a /proc/PID directory exists and the process is not running? Is there any reason to prefer parsing ps over checking for the existence of the directory?

terdon
  • 242,166
  • 2
    there's also the perl kill function using signal 0 which does no killing but says if you could do so (ie you need permission to signal that process). – meuh Jul 10 '16 at 17:48
  • 1
    '/proc/$PID' should be just fine if you're doing this in Linux. – likewhoa Jul 10 '16 at 17:58
  • Even kernel parts are properly kept in /proc (see /proc/2 for example), i'm confident (80% sure) that /proc reflects the process table inside the kernel. And, if a process is not on the process table it will not be caught by the scheduler, and therefore will never run (so it cannot be a running process). – grochmal Jul 10 '16 at 17:58
  • 7
    @terdon Note that whatever method you use (and kill -0 is the best one), this only tells you whether there is a running process with the given PID. It doesn't tell you whether the process will still be running one millisecond later, and it doesn't tell you whether the process is the one you're interested in or an unrelated process that got assigned the same PID after the interesting process died. It is almost always a mistake to test whether a given PID is running: there are very few circumstances where this isn't prone to race conditions. – Gilles 'SO- stop being evil' Jul 10 '16 at 19:50
  • 1
    @Gilles indeed, I should also check the process name. However, in this case, I don't care if there's a status change a millisecond later. I have processes listed as active in a dB and a corresponding file with the pids. I just need to check if anything the DB thinks is running isn't actually running and have enough control over the setup to know that it can't start up again randomly. – terdon Jul 10 '16 at 20:27
  • This design is atrociously misguided and beyond being an uptime timebomb, is most likely enabling a class of security risks too. I would be embarassed to have pursued a design so horribly flawed so far as to ask this question. Please listen to @Gilles. – MickLH Jul 11 '16 at 00:25
  • 3
    @MickLH wow, that's me told. That would actually have been a useful comment instead of a celebration of your own brilliance had you bothered to explain what is so bad and dangerous. I don't doubt you're right, I've never claimed to be a programmer, which is why I asked the question, but you managed to be both insulting and unhelpful. – terdon Jul 11 '16 at 07:16
  • “Is this PID running” is probably the wrong question. This looks like an XY problem. The most common situation where people check whether a PID is running is when they should be doing it from the parent of that process: the parent can reliably know when a particular child dies. In your situation, maybe you should be checking whether that PID has a connection to the database, but I can't tell without knowing what problem you're really trying to solve. You should ask a question about the problem you're actually trying to solve, probably on SO, instead of the misguided attempt here. – Gilles 'SO- stop being evil' Jul 11 '16 at 11:49
  • @Gilles the problem is precisely as stated. I have a list of PIDs that should be running in a text file and want to check if they are actually running. – terdon Jul 11 '16 at 11:50
  • @terdon Yes, but the answer to that question is not useful. We are continuing this discussion in chat. – Gilles 'SO- stop being evil' Jul 11 '16 at 12:01
  • @terdon maybe I'm harsh sometimes... even extreme to an embarrassing degree, but when I read "production code" I take things a bit seriously and try to think of it as a co-worker... when I see a co-worker rejecting code review from a brought-in expert, I just already know they are fired if they don't clean up, so I try to wake them up before it's too late. – MickLH Jul 11 '16 at 13:57
  • 1
    @MickLH In future, I suggest you do so by focusing on the mistake you see and not on how you'd characterize it. I have no idea what made you think I was rejecting Gilles's advice but when you see a user doing something wrong, either explain the mistake or walk away shaking your head at the poor fool's idiocy. Pointing out the idiocy but not explaining the mistake is less than useless. – terdon Jul 11 '16 at 14:25
  • At the deepest level, by polling the "PID Alive" signal on some interval, you're imposing a Nyquist limit on your reconstruction of the actual signal you want. This is vulnerable not just to race conditions, but to denial of service attacks. – MickLH Jul 11 '16 at 14:35
  • Simple example: Say I know your server needs to be online for some authentication service, but if it's down for 30 seconds there's a fallback to a weaker authentication. Now lets say your script is trusted to always catch it in 5 seconds and a human can always respond in 20, safe right? No, I can attack your server by forcing the OOM killer to drop the process while simultaneously running a large amount of long bodied requests to nearly guarantee the PID is used again before your script catches it. – MickLH Jul 11 '16 at 14:35
  • @MickLH I see, thank you. I'm not polling though. This is just one of various tests run manually to check the consistency of our information; it won't be run automatically and its results will never trigger an automatic reaction. All my script will do is print a warning if a PID should be there and isn't. A human will have to choose to act on it and they can do so three days later. So, in any case, the human will have to evaluate the current situation. The script will simply point the human in the right direction. – terdon Jul 11 '16 at 16:37

2 Answers2

22

The perl function kill(0,$pid) can be used.

If the return code is 1 then the PID exists and you're allowed to send a signal to it.

If the return code is 0 then you need to check $!. It may be EPERM (permission denied) which means the process exists or ESRCH in which case the process doesn't exist.

If your checking code is running as root then you can simplify this to just checking the return code of kill; 0=>error, 1=>ok

For example:

% perl -d -e 0

Loading DB routines from perl5db.pl version 1.37
Editor support available.

Enter h or 'h h' for help, or 'man perldebug' for more help.

main::(-e:1):   0
  DB<1> print kill(0,500)
0
  DB<2> print $!
No such process
  DB<3> print kill(0,1)
0
  DB<4> print $!
Operation not permitted
  DB<5> print kill(0,$$)
1

This can be made into a simple function

use Errno;

sub test_pid($)
{
  my ($pid)=@_;

  my $not_present=(!kill(0,$pid) && $! == Errno::ESRCH);

  return($not_present);
}

print "PID 500 not present\n" if test_pid(500);
print "PID 1 not present\n" if test_pid(1);
print "PID $$ not present\n" if test_pid($$);
  • FWIW, I added a simple function showing how you can do this. – Stephen Harris Jul 10 '16 at 18:19
  • Yeah, I think I'll go with this. I deleted my comment since I realized that all I needed was if (!kill(0,$pid) && $! =~ /No such process/){ exit; } or similar. I like your Errno solution better though, thanks. While I'll probably go with this, I'll wait a while in case anyone can answer the underlying Linux question. – terdon Jul 10 '16 at 18:21
  • 2
    If /proc is mounted then every PID visible in the namespace will be present, so your -d /proc/$pid test would work... but it involves going out to the filesystem rather than using native system calls. – Stephen Harris Jul 10 '16 at 18:23
  • Which is precisely what I wanted to avoid, yes. – terdon Jul 10 '16 at 18:28
  • @terdon: I'm confused, are you trying to avoid the system call or are you trying to avoid reading the file system? You say one thing you your question and the opposite in your comment... – user541686 Jul 11 '16 at 08:45
  • @Mehrdad I'm trying to minimize the overhead. I hadn't thought of this so I suggested reading the filesystem as a better approach than calling an external program. However, using kill is even more efficient. – terdon Jul 11 '16 at 08:47
  • 2
    @terdon: I just realized my confusion came from the fact that by "system call" you actually meant "system call" -- i.e., a call to the system function itself, not a "system call". The latter you can't avoid, but the former you certainly can. Makes sense now! – user541686 Jul 11 '16 at 09:43
6
  • I’m 99.9% sure that checking whether /proc/PID exists (and is a directory) is 98% as reliable as the kill 0 technique.  The reason why the 98% isn’t 100% is a point that Stephen Harris touched on (and bounced off) in a comment — namely, the /proc filesystem might not be mounted.  It may be valid to claim that a Linux system without /proc is a damaged, degraded system — after all, things like ps, top, and lsof probably won’t work — and so this might not be an issue for a production system.  But it is (theoretically) possible for it never to have been mounted (although this might prevent the system from coming up to a normal state), it is definitely possible for it to be unmounted (I’ve tested it1), and I believe that there isn’t any guarantee that it will exist (i.e., it’s not required by POSIX).  And, unless the system is completely hosed, kill will work.
  • Stephen’s comment talks about “going out to the filesystem” and “using native system calls”.  I believe that this is largely a red herring.
    • Yes, any attempt to access /proc requires reading the root directory to find the /proc filesystem.  This is true for any attempt to access any file by an absolute pathname, including things in /bin, /etc, and /dev.  This happens so often that the root directory is surely cached in memory for the entire lifetime (uptime) of the system, so this step can be done without any disk I/O.  And, once you have the inode of /proc, everything else that happens is in memory.
    • How do you access /proc?  With stat, open, readdir, etc., which are native system calls every bit as much as kill.
  • The question talks about a process running.  This is a slippery phrase.  If you actually want to test whether the process is running (i.e., in the run queue; maybe the current process on some CPU; not sleeping, waiting, or stopped), you may need to do a ps PID and read the output, or look at /proc/PID/stat.  But I see no hint in your question or comments that you are concerned with this.

    The elephant in the room, however, is that a zombie2 process can be difficult to distinguish from a process that is alive and well.  kill 0 works on a zombie, and /proc/PID exists.  You can identify zombies with the techniques listed in the previous paragraph (doing ps PID and reading the output, or looking at /proc/PID/stat).  My very quick & casual (i.e., not very thorough) testing suggests that you can also do this by doing a readlink or lstat on /proc/PID/cwd, /proc/PID/root, or /proc/PID/exe — these will fail on zombies.  (However, they will also fail on processes you don’t own.)

____________
1 if the -f (force) option doesn’t work, try -l (lazy).
2 i.e., a process that has exited/died/terminated, but whose parent has not yet done a wait.

  • Thanks for your comment on my answer, which I have deleted because it was wrong. I don't believe the kill(2) manpage indicates directly the behavior you pointed out, but the perlfunc manpage does. I will email Michael Kerrisk to see what he has to say about the system manpage. – jrw32982 Jul 15 '16 at 02:26
  • I filed a bug report for clarify the man-page for kill(2) regarding permissions to "send" signal 0. – jrw32982 Jul 15 '16 at 02:52
  • @jrw32982: Well, the man page says things like, “The sig argument …may be 0, in which case error checking is performed …” and “ERRORS — The kill() system call will fail and no signal will be sent if: … The sending process is not the super-user and its effective user id does not match the effective user-id of the receiving process.  …”  Now that you mention it, I suppose that can be interpreted in more than one way, but, regrettably, many Unix man pages are written in this style, requiring you to read between the lines.  Michael may believe that it’s clear enough as it is. – G-Man Says 'Reinstate Monica' Jul 16 '16 at 00:58
  • Michael made a change to the kill(2) manpage (I'm not seeing it yet online): "If sig is 0, then no signal is sent, but existence and permission checks are still performed; this can be used to check for the existence of a process ID or process group ID that the caller is permitted to signal." – jrw32982 Jul 16 '16 at 13:17