Why doesn't SIGKILL terminate a stopped program (yes)?

Question

I'm using Ubuntu 14.04 and I'm experiencing this behavior I can't seem to understand:

Run the yes command (in the default shell: Bash)
Type CtrlZ to stop yes
Run jobs. Output:
[1]+ Stopped yes
Run kill -9 %1 to stop yes. Output:
[1]+ Stopped yes
Run jobs. Output:
[1]+ Stopped yes

This is on Ubuntu 3.16.0-30-generic running in a parallels virtual machine.

Why didn't my kill -9 command terminate the yes command? I thought SIGKILL can't be caught or ignored? And how can I terminate the yes command?

That's interesting. SIGKILL should work and it does on my Linux Mint 17. For any other signal, you'd normally need to send it SIGCONT afterwards to make sure the signal gets received by the stopped target. — Petr Skocik, Jun 12 '15 at 18:34
Does bash really print "Stopped" for a process that is suspended? — edmz, Jun 12 '15 at 18:45
Linux ubuntu 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux. I'm running Ubuntu in Parallels Desktop. — s1m0n, Jun 12 '15 at 18:59
@black most shells say "Stopped". tcsh says "Suspended" and zsh says "suspended". A cosmetic difference. Of somewhat more importance is the fact that bash prints an identical message for STOP and TSTP, where all other shells mark the annotate the STOP message with (signal) so you can tell the difference. — , Jun 12 '15 at 20:21

lcd047 · Answer 1 · 2015-06-12T20:45:56.060

10

Signals are blocked for suspended processes. In a terminal:

$ yes
...
y
y
^Zy

[1]+  Stopped                 yes

In a second terminal:

$ killall yes

In the first terminal:

$ jobs
[1]+  Stopped                 yes

$ fg
yes
Terminated

However SIGKILL can't be blocked. Doing the same thing with killall -9 yes from the second terminal immediately gives this in the yes terminal:

[1]+  Killed                  yes

Consequently if kill -9 %1 doesn't terminate the process right away then either bash isn't actually sending the signal until you fg the process, or you have uncovered a bug in kernel.

edited Jun 12 '15 at 20:45

answered Jun 12 '15 at 19:07

lcd047

7,238

5

Some background details: When issuing Ctrl+Z in your terminal bash sends a SIGTSTP (which is the blockable version of SIGSTOP) to the active process. This puts the process in a frozen state where the kernel won't schedule it. That also inhibits signal processing (except for the SIGCONT signal which unfreezes the process) and therefore prevents the process from being killed right away. – mreithub Jun 12 '15 at 19:24
@mreithub Yes, that's exactly what happens. Well, SIGSTOP is actually sent in very different conditions. – lcd047 Jun 12 '15 at 19:35
1

SIGKILL, unlike other signals, is not blocked for suspended processes. Sending the KILL signal to a suspended process kills it — asynchronously, but in practice basically immediately. – Gilles 'SO- stop being evil' Jun 12 '15 at 22:30
1

@Gilles That's what I was trying to illustrate above: SIGTERM is blocked, but SIGKILL isn't. Anyway, according to a comment from OP, the problem seems to be that jobs doesn't detect that the process has died, not the process not being killed by kill -9 %1. – lcd047 Jun 12 '15 at 22:32
1

But I can reproduce s1m0n's behavior on my system (Debian, amd64, bash 4.3.30). – Gilles 'SO- stop being evil' Jun 12 '15 at 22:34
1

While SIGKILL cannot be blocked, there is no guarantee that it will be delivered within any meaningful time. If a process is suspended pending blocking I/O, for example, SIGKILL will not arrive until the process wakes. This could potentially be never, if no I/O occurs. – sapi Jun 13 '15 at 01:22
@Gilles'SO-stopbeingevil', "suspended process kills it — asynchronously". In that case, what is exit status of a kill command? 0 or 1? – Alex Martian Oct 21 '23 at 02:59
@AlexMartian kill returns 0 if it could deliver the signal, which basically means that the PID is valid and the calling process has the appropriate permission. What happens next (signal ignored, signal handler runs, signal kills the process, signal reaction delayed because the target process is in a non-interruptible system call, …) doesn't delay kill returning and doesn't affect its return status even if it hasn't returned yet. – Gilles 'SO- stop being evil' Oct 23 '23 at 09:35

score 8 · Answer 2 · answered Jun 12 '15 at 20:57

Don't panic.

There's nothing funky going on. There's no kernel bug here. This is perfectly normal behaviour from the Bourne Again shell and a multitasking operating system.

The thing to remember is that a process kills itself, even in response to SIGKILL. What's happening here is that the Bourne Again shell is getting around to things before the process that it just told to kill itself gets around to killing itself.

Consider what happens from the point where yes has been stopped with SIGTSTP and you've just executed the kill command with the Bourne Again shell:

The shell sends SIGKILL to the yes process.
In parallel:
1. The yes process is scheduled to run and immediately kills itself.
2. The Bourne Again shell continues, issuing another prompt.

The reason that you are seeing one thing and other people are seeing another is a simple race between two ready to run processes, the winner of which is entirely down to things that vary both from machine to machine and over time. System load makes a difference, as does the fact that your CPU is virtual.

In the interesting case, the detail of step #2 is this:

The Bourne Again shell continues.
As part of the internals of the built-in kill command it marks the entry in its job table as needing a notification message printed at the next available point.
It finishes the kill command, and just before printing the prompt again checks to see whether it should print notification messages about any jobs.
The yes process hasn't had the chance to kill itself yet, so as far as the shell is concerned the job is still in the stopped state. So the shell prints a "Stopped" job status line for that job, and resets its notification pending flag.
The yes process gets scheduled and kills itself.
The kernel informs the shell, which is busy running its command line editor, that the process has killed itself. The shell notes the change in status and flags the job as notification pending again.
Simply pressing enter to cycle through the prompt printing again gives the shell the chance to print the new job status.

The important points are:

Processes kill themselves. SIGKILL isn't magical. Processes check for pending signals when returning to application mode from kernel mode, which happens at the ends of page faults, (non-nested) interrupts, and system calls. The only special thing is that the kernel doesn't allow the action in response to SIGKILL to be anything other than immediate and unconditional suicide, with no return to application mode. Importantly, processes need to both be making kernel-to-application-mode transitions and be scheduled to run in order to respond to signals.
A virtual CPU is just a thread on a host operating system. There's no guarantee that the host has scheduled the virtual CPU to run. Host operating systems are not magical, either.
Notification messages aren't printed when the job state changes happen (unless you use set -o notify). They are printed when next the shell reaches a point in its execution cycle that it checks to see whether any notifications are pending.
The notification pending flag is being set twice, once by kill and once by the SIGCHLD signal handler. This means that one can see two messages if the shell is running ahead of the yes process being rescheduled to kill itself; one a "Stopped" message and one a "Killed" message.
Obviously, the /bin/kill program doesn't have any access to the shell's internal jobs table; so you won't see such behaviour with /bin/kill. The notification pending flag is only set the once, by the SIGCHLD handler.
For the same reason, you won't see this behaviour if you kill the yes process from another shell.

That's an interesting theory, but the OP gets to type jobs and the shell still sees the process as alive. That would be one unusually long scheduling race condition. :) — lcd047, Jun 12 '15 at 21:15
First of all, thanks for your elaborative answer! I certainly makes sense and clears up quite a few things.. But as stated above, I can run multiply jobs commands after the kill which all still indicate the process is just stopped. You however inspired me to keep experimenting and I discovered this: the message [1]+ Terminated yes is printed as soon as I run another external command (not a shell builtin like echo or jobs). So I can run jobs as much as I like and it keeps printing [1]+ Stopped yes. But as soon as I run ls for example, Bash prints [1]+ Terminated yes — s1m0n, Jun 12 '15 at 21:29
lcd047 didn't read your comment to the question; which was important and should have been edited into the start of the question, properly. It's easy to overload a host operating system such that guests appear to schedule very strangely, from within. Just like this, and more besides. (I once managed to cause quite odd scheduling with a runaway Bing Desktop consuming most of the host CPU time.) — JdeBP, Jun 12 '15 at 22:18
No, that's not quite it. kill -9 %1 sends the KILL signal immediately. The job may or may not be killed before the next prompt (there's a race condition here), but it'll definitely be killed before a human can type or even paste jobs, so it should show as killed by that time. — Gilles 'SO- stop being evil', Jun 12 '15 at 22:33
@Gilles The problem seems to be that jobs doesn't notice that the process has actually died... Not sure what to make about the status being updated by running another command though. — lcd047, Jun 12 '15 at 22:41
Even Gilles didn't see the comment. This is why you should put this sort of important stuff in the question, not bury it in a comment. Gilles, the answer clearly talks about delays in delivering a signal, not delays in sending it. You've mixed them up. Also, read the questioner's comment (and indeed the bullet point that it is given here) and see the very important wrong fundamental assumption that you are making. Virtual processors do not necessarily run in lockstep, and are not magically capable of always running at full speed. — JdeBP, Jun 13 '15 at 00:28
Maybe a note or two on set -b would be worthwhile here? Oh, my mistake, set -o notify is mentioned. — mikeserv, Jun 13 '15 at 00:53

score 3 · Answer 3 · answered Jun 13 '15 at 00:45

What you're observing is a bug in this version of bash.

kill -9 %1 does kill the job immediately. You can observe that with ps. You can trace the bash process to see when the kill system call is called, and trace the subprocess to see when it receives and processes the signals. More interstingly, you can go and see what's happening to the process.

bash-4.3$ sleep 9999
^Z
[1]+  Stopped                 sleep 9999
bash-4.3$ kill -9 %1

[1]+  Stopped                 sleep 9999
bash-4.3$ jobs
[1]+  Stopped                 sleep 9999
bash-4.3$ jobs -l
[1]+  3083 Stopped                 sleep 9999
bash-4.3$

In another terminal:

% ps 3083
  PID TTY      STAT   TIME COMMAND
 3083 pts/4    Z      0:00 [sleep] <defunct>

The subprocess is a zombie. It's dead: all that's left of it is an entry in the process table (but no memory, code, open files, etc.). The entry is left around until its parent takes notice and retrieves its exit status by calling the wait system call or one of its siblings.

An interactive shell is supposed to check for dead children and reap them before printing a prompt (unless configured otherwise). This version of bash fails to do it in some circumstances:

bash-4.3$ jobs -l
[1]+  3083 Stopped                 sleep 9999
bash-4.3$ true
bash-4.3$ /bin/true
[1]+  Killed                  sleep 9999

You might expect bash to report “Killed” as soon as it's printing the prompt after the kill command, but that isn't guaranteed, because there's a race condition. Signals are delivered asynchronously: the kill system call returns as soon as the kernel has figured out which process(es) to deliver the signal to, without waiting for it to be actually delivered. It's possible, and it does happen in practice, that bash has time to check on the status of its subprocess, find that it's still not dead (wait4 doesn't report any child death), and print that the process is still stopped. What is wrong is that before the next prompt, the signal has been delivered (ps reports that the process is dead), yet bash still hasn't called wait4 (we can see that not only because it still reports the job as “Stopped”, but because the zombie is still present in the process table). In fact, bash only reaps the zombie the next time it needs to call wait4, when it's run some other external command.

The bug is intermittent and I couldn't reproduce it while bash is traced (presumably because it's a race condition where bash needs to react fast). If the signal is delivered before bash checks, everything happens as expected.

score 2 · Answer 4 · answered Jun 12 '15 at 17:33

2

Something funky may be happening on your system, on mine your recipe works nicely both with and without the -9:

> yes
...
^Z
[1]+  Stopped                 yes
> jobs
[1]+  Stopped                 yes
> kill %1
[1]+  Killed                  yes
> jobs
>

Get the pid with jobs -p and try to kill it as root.

answered Jun 12 '15 at 17:33

Dan Cornilescu

1,286

May I ask what distribution/kernel/bash version you're using? Maybe your bash's internal kill command goes the extra mile and checks if the job is frozen (you might want to try finding out the PID of the job and kill it using env kill <pid>. That way you'll be using the actual kill command and not the bash builtin. – mreithub Jun 12 '15 at 19:51
bash-4.2-75.3.1.x86_64 on opensuse 13.2. The kill cmd is not the internal one: which kill /usr/bin/kill – Dan Cornilescu Jun 13 '15 at 01:55
1

which is not a bash-builtin, so which <anything> will always give you the path to the actual command. But try comparing kill --help vs. /usr/bin/kill --help. – mreithub Jun 13 '15 at 19:43
Ah, right. Indeed, it's the builtin kill. – Dan Cornilescu Jun 14 '15 at 02:42

Why doesn't SIGKILL terminate a stopped program (yes)?

4 Answers4

Don't panic.

Linked