Possible Duplicate:
What if ‘kill -9’ does not work?
I guess its a bit late to ask this, but for future reference;
I was called to look at a server today after a customer was reporting that connecting with ssh was slow and executing commands was also slow (with some not working at all).
After logging in I could type promptly so I didn't think it was a network issue like delay or bandwidth saturation (as I find this tends to be directly relatable through your ssh experiences). I first tried to run top
, after a minute of nothing happening I cancelled this operation with CTRL+C. The prompt was hanging waiting for top
to start up.
free -m
also was just hanging at the prompt for a minute or longer before I cancelled that.
df -h
did execute, and showed me that there was 60% of disk space free (I was wondering if some application had gone bananas and filled up the disks with logs).
dmesg
wouldn't execute either.
I executed tail -n 50 /var/log/message
and sadly I no longer have the output but it looked like there had been a serious problem. Lots of memory locations printed in HEX and presumably their contents (incomprehensibly ramblings) on the right. It was very similar to the output in this log I found on Google, trying to find a similar example, except that in the right hand column most of the lines contain "ext4" in them, perhaps there was a file system error?
Running tail -n 50 /var/log/syslog
I saw in the middle of all the memory madness that was repeated here a couple of lines that said works to the effect of Info procname:pid blocked for more than 120 seconds
.
I executed ps aux
and looked through the output until I found one process with 299% cpu usage;
ps aux | grep procname
procuser 8279 299 0.0 479064 41916 pts/6 Sl+ 08:05 548:31 /path/to/procname procbox 6390 6394 6395 0
So this process has gone bonkers it seems but I can't execute any command (with or without sudo) that are related to memory. For example free -m
, or top
. I could cat /proc/meminfo
and see that there was about 5GB out of 40GBs of RAM free.
I tried kill PID
but after a couple of minutes of hanging I gave up. I tried kill -9 PID
but again, same thing. I can only assume this process was so busy that it couldn't answer kill messages from the kernel? I tried renice 19 PID
and kill -9 PID
but this didn't work either, renice
would run, just hang.
In the end a hard reboot was required which was not ideal. Files are now corrupt etc due to the specialist applications on the server. What other options did I have?
Is there no way to simply cease a process? Rather than sending a SIGTERM, just flat out cease the processing of code, or similar?
R
),kill -9
kills it immediately — there's no such thing as “too busy”. If the process is inside a system call (stateD
),kill -9
kills it (there's nothing stronger), but this can sometimes take a tiny amount of time for the system call to reach a cancellation point. Ifkill -9
takes a noticeable amount of time or doesn't work at all, it's a kernel bug, possibly triggered by a hardware failure. – Gilles 'SO- stop being evil' Nov 11 '12 at 00:58kswapd
was at 100% CPU or not. Because in 2012 there was a kernel bug in 2.6 type kernels which, when the race condition was met, made kswapd go 100% CPU plus allow processes to hang, unkillable, with 100% CPU in kernel, too. An example script which was able to detect this state is at https://gist.github.com/hilbix/5264057 – Tino Nov 07 '14 at 16:44