- Assumptions
- Instructions
- A dumb hack (which assumes your tasks are hung on filesystem/disk access)
1. Assumptions
1.1) By default, the Linux kernel has code that reports various types of crash or hang.
All of them can show the immediate problem and print a call chain on the "local console". It might not show the root cause, and this code can never be 100% reliable. But usually you get something, and it is much better than nothing.
So you should double-check that you are able to see these kernel log messages on the console! Details are in the next section.
1.2) Since the kernel itself is still responding to your key-presses and network packets, I would really expect the hung task detector to work here.
It sounds like kernel threads and interrupts are still functioning, but userspace processes are hanging. The symptoms sound consistent with a hang when processes try to access the physical filesystem. When a process hangs for several minutes, the kernel prints a "hung task" message and call chain.
1.3) Alternatively, maybe user processes are not entirely hung, but they are very slow, and you are not waiting "long enough" to see them make progress.
You might be familiar with this story, if you have experience using a Linux PC with a mechanical HDD :-). But since this is not a PC on your desk, you will not notice the hard disk being very noisy or the disk activity light being permanently lit :-).
I am not experienced at managing servers. But I think you should be using monitoring software to try and detect issues like this. Ideally even before they create user-visible problems :-).
To give one example, if you monitor your system memory usage, it can show if you have a gradual "memory leak" and the system starts swapping itself to death. I expect you do not have that problem. E.g. if login
had been swapped out, then the console login would have been slow to even prompt for your password.
Perhaps you would detect a rise in disk IO in the seconds before the observed failure, if you had sufficiently fine-grained monitoring.
2. Instructions
2.1) Is the "local console" logged or at least persistent, so that you would have noticed if a kernel crash was printed on it? It really should be, but I am not sure familiar with how vSphere etc. would work if you used an emulated serial console. If you are just using an emulated video display, then it will be persistent already.
This VMWare article appears to rely on the same assumption.
2.2) Make sure you have not disabled console logging. Run this command:
sudo sh -c "echo '<3>test' >/dev/kmsg"
It should show "test" on the console. See also below, where I talk about stack traces.
If this is an emulated video display, part of a crash message might scroll off the top of the screen. And if the kernel has crashed, then you cannot use shift+PageUp to scroll up. In principle, it can be more useful to have an emulated serial console which implements scrollback.
For kernel crashes, there are a couple of other crash dump suggestions in the VMWare link above.
2.3) The hang after you enter a password sounds like the disk has become non-responsive. I think Linux SCSI will timeout operations after a while, and the timeouts will be logged as kernel errors, so Linux would print them on the console. Are your filesystems mounted using SCSI protocol, or something else?
2.4) Also, by default the kernel detects hung tasks and prints a message: task bash:999 blocked for more than 120 seconds
. It is followed by a call chain ("stack trace"). Although, I think the call chain part is logged using the kernel's "default log level", which most often means level 4 (warning).
If you want to see the call chain part of the hung task message, you probably need to increase the console log level above level 4, e.g. dmesg -n 5
.
To check you have not disabled hung task messages: cat /proc/sys/kernel/hung_task_timeout_secs
should show a positive number e.g. 120
.
Hung task messages are not printed for network filesystem hangs. They are only printed when the hung task is both "uninterruptible" and "unkillable". Processes hung on NFS can be killed. If you were using a network filesystem that could have caused this hang, you would likely already have thought about this. (As well as testing connecting to the NFS server somehow, not just testing the hung VM with ping
, and then you would have mentioned all this in the question :-). If an NFS server appears responsive from other VMs, and yet you do not see a hung task message on this VM, I guess you could try to investigate using sysrq
+T (see below).
Hung task messages are enabled by default on Debian Linux builds. (For some reason, my Fedora Linux kernel is built without including the hung task detector at all. Even though it appears included in RHEL and SLES kernels. FIXME).
When I searched for the hung task messages, I noticed both hung servers and IO error messages seemed to be a common theme :-).
There is also Linux sysrq
. If you had a serial console, but you can only capture output printed after you have connected to it, you could e.g. try looking for hung tasks using sysrq+T . This dumps info about every task on the system, so it generates a lot of output to the console. So this might not be so useful when your console is a video display. And you should not test this on a working production system! Some distributions disable sysrq
by default for physical security reasons. Debian leaves sysrq
enabled. Of course you might have used a security checklist that told you to disable sysrq
.
2.4) The original question did not quote any quantitative monitoring of "responsiveness", either immediately before the observed failure, or to show that the system is not frequently overloaded (this might just be the ultimate extension).
Consider the value of quantitative monitoring of service "responsiveness", for various services - this could include logging in to the ssh server. Also system utilization levels, latencies, and network packets per second.
P.S. both disk busy% and "CPU iowait" may be cursed. I would want to monitor the current disk latencies and IOPS as well. (Current Debian 9.x kernels should be relatively sensible about disk busy% though).
The above answer and the VMware link describes some of the standard tools which you should learn about, or at least be aware of their existence.
3. A dumb hack (which assumes your tasks are hung on filesystem/disk access)
The detail below is a dumb hack. Sometimes a dumb hack is what you need. I'm just saying, if you have to resort to this, it might suggest some deficits in your way of working that you need to sort out :-P.
If you have some shell tests, that you would like to run while the system is "quasi-unresponsive", you could try running an mlock()'ed busybox
shell. E.g. run a non statically linked busybox using this LD_PRELOAD mlock hack . Then run busybox commands using e.g. (exec -a ls /proc/self/exe /)
. Probably safest to do :
# prevent you running any normal program by mistake!
OLDPATH="$PATH"
PATH=
# run a busybox builtin
b() (
local command="$1"
shift
exec -a "$command" /proc/self/exe "$@"
)
# run normal program in the background, in case it hangs
p() {
local PATH="$OLDPATH"
exec "$@" &
}
This should let you run b dmesg
, without needing to read any uncached file :-).
(This breaks down if someone has both 1) contrived to mount a hung filesystem 2) mounted it over /
or /proc
, so that you cannot even access /proc
without hanging. I think that is not very likely, and it would be even more of a pain to defend against.)
b ps -o stat,pid,args
will show process states; D
means "uninterruptible sleep" - normally waiting for the disk or network filesystem. Then b cat /proc/999/stack
will show where PID 999 is waiting in the kernel.
cd /sys/class/block/ && b grep -H . */inflight
will show a count of inflight reads and writes for each disk.
I'll update my question with this and a few more pcs of info.
– deletesystem32 May 13 '19 at 22:06