When I used killall -9 name
to kill a program, the state become zombie. Some minutes later, it stopped really.
So, what's happening during those minutes?
1 Answers
The program actually never receives the SIGKILL signal, as SIGKILL is completely handled by the operating system/kernel.
When SIGKILL for a specific process is sent, the kernel's scheduler immediately stops giving that process any more CPU time for running user-space code. If the process has any threads executing user-space code on other CPUs/cores at the time the scheduler makes this decision, those threads will be stopped too. (In single-core systems this used to be much simpler: if the only CPU core in the system was running the scheduler, it by definition wasn't running the process at the same time!)
If the process/thread is executing kernel code (e.g. a system call, or an I/O operation associated with a memory-mapped file) at the time of SIGKILL, it gets a bit trickier: only some system calls are interruptible, so the kernel internally marks the process as being in a special "dying" state until the system calls or I/O operations are resolved. CPU time to resolve those will be scheduled as usual. Interruptible system calls or I/O operations will check if the process that called them is dying at any suitable stopping points, and will exit early in that case. Uninterruptible operations will run into completion, and will check for a "dying" state just before returning to user-space code.
Once any in-process kernel routines are resolved, the process state is changed from "dying" to "dead" and the kernel begins cleaning it up, similar to when a program exits normally. Once the clean-up is complete, a greater-than-128 result code will be assigned (to indicate that the process was killed by a signal; see this answer for the messy details), and the process will transition into "zombie" state. The parent of the killed process will be notified with a SIGCHLD signal.
As a result, the process itself will never get the chance to actually process the information that it has received a SIGKILL.
When a process is in a "zombie" state it means the process is already dead, but its parent process has not yet acknowledged this by reading the exit code of the dead process using the wait(2)
system call. Basically the only resource a zombie process is consuming any more is a slot in the process table that holds its PID, the exit code and some other "vital statistics" of the process at the time of its death.
If the parent process dies before its children, the orphaned child processes are automatically adopted by PID #1, which has a special duty to keep calling wait(2)
so that any orphaned processes won't stick around as zombies.
If it takes several minutes for a zombie process to clear, it suggests that the parent process of the zombie is struggling or not doing its job properly.
There is a tongue-in-cheek description on what to do in case of zombie problems in Unix-like operating systems: "You cannot do anything for the zombies themselves, as they are already dead. Instead, kill the evil zombie master!" (i.e. the parent process of the troublesome zombies)

- 96,466
ps
: 'S' is for I/O waits that the kernel can cancel in order to deliver a signal, and 'D' for those it can't. – zwol Dec 03 '18 at 15:47