3

The impacted computer is a VM running in vSphere, and it's a production server, so troubleshooting time is generally nonexistent once the issue rears its ugly head. The problem goes away after a reboot, and the system appears to run stable for about a week afterward. I'm looking for some ideas on what to do or look for when it happens again (probably around this Thursday or Friday, if the current pattern holds).

The VM responds normally to pings and is still listening for — but not responding to — http/s requests (apache2). It also listens for SSH, but never prompts for authentication before closing the session.

The "local" console hangs immediately after submitting a command. Up until that point, you can type whatever you want; only after you try to ask the system to do something does it stop working. This includes trying to use tab-completion for filenames, &c. I can switch to one of the other virtual terminals and enter a username and password...but then the system hangs again.

No information about the crash is present in any logs in /var/log (any pointers on someplace else to look?). The last message in any of the log files is written long before the actual problem occurs.

Additional info:

There is never anything printed on the VM's "local console" when this problem occurs. This VM has a single 1TB volume (thick provisioned, lazy zeroed) attached via LSI Logic Parallel vSCSI. The datastore itself is a large NAS which also services a few other ESXi hosts, and none of the other guests are impacted when this happens.

vCenter/vSphere does not show abnormally high CPU or memory utilization when this problem occurs.

On at least one occasion, the hang has gone on for >8 hours before being noticed by someone trying to access the server via SFTP.

Per sourcejedi's recommendation, I have now lowered the threshold for console logging to 5, and confirmed that I can see messages sent to /dev/kmsg on the VM's local console. These messages were not displayed before I made the change, so it's possible the kernel was trying to say something, but I never saw it.

Since I have resources to spare on the ESXi host, I've also cloned the VM and put it on a separate, isolated network. If this issue pops up there, I will have more time to troubleshoot it without having to worry about production services being down during the process.

I'll update as I have more info, but thanks to everyone who has helped so far!

sourcejedi
  • 50,249
  • There are whole books about the theme. Do you have any more specific data about both VMs and logs? Specific questions? I would advise reading our FAQ. – Rui F Ribeiro May 12 '19 at 16:09
  • @RuiFRibeiro I thought this was quite specific :-P. (original question, paraphrase): "It looks as if all my userspace programs are freezing, perhaps when they try to use the filesystem. Except that I don't see any hung task messages on the console! How can I gather more information about this when it happens? Would really like to have those sweet sweet kernel backtraces." If you think your comment still applies, perhaps you could clarify it by giving an example of a whole book about this theme. – sourcejedi May 13 '19 at 10:37
  • @sourcejedi The question is anything but specific, it does not even mention the stack/software used, or any kind of logs. https://www.amazon.com/Systems-Performance-Enterprise-Brendan-Gregg/dp/0133390098 How many more you want? Please do not mess up with me. :) Cheers and keep up the good work. – Rui F Ribeiro May 13 '19 at 14:17
  • @RuiFRibeiro It says there are no console logs :-). Honestly, that's a nice source and I think it helps disambiguate your comment. I.e. this is not necessarily an indefinite deadlock waiting for something that never happens. The observer in the question might not be waiting "long enough" / the latency for individual requests might be below the hung task timeout but tasks are needing multiple requests to make visible progress. It also points out that the question doesn't quote any quantitative monitoring of "responsiveness", either leading up to the event, or to look for smaller excursions. – sourcejedi May 13 '19 at 15:30
  • @sourcejedi btw, I have that book. Much better than the previous one from the same guy about Dtrace. – Rui F Ribeiro May 13 '19 at 15:41
  • @RuiFRibeiro sourcejedi's paraphrasing of my question is more or less what I was trying to get at. We are quantitatively monitoring performance on this guest to the extent that ESXi logs resource usage at regular intervals. I should have mentioned this in my original question, but we do not see abnormal CPU or memory utilization when this happens; on sourcejedi's recommendation, I will also be checking IOPS when this happens again to see if something is slamming the disk.

    I'll update my question with this and a few more pcs of info.

    – deletesystem32 May 13 '19 at 22:06

1 Answers1

3
  1. Assumptions
  2. Instructions
  3. A dumb hack (which assumes your tasks are hung on filesystem/disk access)

1. Assumptions

1.1) By default, the Linux kernel has code that reports various types of crash or hang.

All of them can show the immediate problem and print a call chain on the "local console". It might not show the root cause, and this code can never be 100% reliable. But usually you get something, and it is much better than nothing.

So you should double-check that you are able to see these kernel log messages on the console! Details are in the next section.

1.2) Since the kernel itself is still responding to your key-presses and network packets, I would really expect the hung task detector to work here.

It sounds like kernel threads and interrupts are still functioning, but userspace processes are hanging. The symptoms sound consistent with a hang when processes try to access the physical filesystem. When a process hangs for several minutes, the kernel prints a "hung task" message and call chain.

1.3) Alternatively, maybe user processes are not entirely hung, but they are very slow, and you are not waiting "long enough" to see them make progress.

You might be familiar with this story, if you have experience using a Linux PC with a mechanical HDD :-). But since this is not a PC on your desk, you will not notice the hard disk being very noisy or the disk activity light being permanently lit :-).

I am not experienced at managing servers. But I think you should be using monitoring software to try and detect issues like this. Ideally even before they create user-visible problems :-).

To give one example, if you monitor your system memory usage, it can show if you have a gradual "memory leak" and the system starts swapping itself to death. I expect you do not have that problem. E.g. if login had been swapped out, then the console login would have been slow to even prompt for your password.

Perhaps you would detect a rise in disk IO in the seconds before the observed failure, if you had sufficiently fine-grained monitoring.

2. Instructions

2.1) Is the "local console" logged or at least persistent, so that you would have noticed if a kernel crash was printed on it? It really should be, but I am not sure familiar with how vSphere etc. would work if you used an emulated serial console. If you are just using an emulated video display, then it will be persistent already.

This VMWare article appears to rely on the same assumption.

2.2) Make sure you have not disabled console logging. Run this command:

sudo sh -c "echo '<3>test' >/dev/kmsg"

It should show "test" on the console. See also below, where I talk about stack traces.

If this is an emulated video display, part of a crash message might scroll off the top of the screen. And if the kernel has crashed, then you cannot use shift+PageUp to scroll up. In principle, it can be more useful to have an emulated serial console which implements scrollback.

For kernel crashes, there are a couple of other crash dump suggestions in the VMWare link above.

2.3) The hang after you enter a password sounds like the disk has become non-responsive. I think Linux SCSI will timeout operations after a while, and the timeouts will be logged as kernel errors, so Linux would print them on the console. Are your filesystems mounted using SCSI protocol, or something else?

2.4) Also, by default the kernel detects hung tasks and prints a message: task bash:999 blocked for more than 120 seconds. It is followed by a call chain ("stack trace"). Although, I think the call chain part is logged using the kernel's "default log level", which most often means level 4 (warning).

If you want to see the call chain part of the hung task message, you probably need to increase the console log level above level 4, e.g. dmesg -n 5.

To check you have not disabled hung task messages: cat /proc/sys/kernel/hung_task_timeout_secs should show a positive number e.g. 120.

Hung task messages are not printed for network filesystem hangs. They are only printed when the hung task is both "uninterruptible" and "unkillable". Processes hung on NFS can be killed. If you were using a network filesystem that could have caused this hang, you would likely already have thought about this. (As well as testing connecting to the NFS server somehow, not just testing the hung VM with ping, and then you would have mentioned all this in the question :-). If an NFS server appears responsive from other VMs, and yet you do not see a hung task message on this VM, I guess you could try to investigate using sysrq+T (see below).

Hung task messages are enabled by default on Debian Linux builds. (For some reason, my Fedora Linux kernel is built without including the hung task detector at all. Even though it appears included in RHEL and SLES kernels. FIXME).

When I searched for the hung task messages, I noticed both hung servers and IO error messages seemed to be a common theme :-).

There is also Linux sysrq. If you had a serial console, but you can only capture output printed after you have connected to it, you could e.g. try looking for hung tasks using sysrq+T . This dumps info about every task on the system, so it generates a lot of output to the console. So this might not be so useful when your console is a video display. And you should not test this on a working production system! Some distributions disable sysrq by default for physical security reasons. Debian leaves sysrq enabled. Of course you might have used a security checklist that told you to disable sysrq.

2.4) The original question did not quote any quantitative monitoring of "responsiveness", either immediately before the observed failure, or to show that the system is not frequently overloaded (this might just be the ultimate extension).

Consider the value of quantitative monitoring of service "responsiveness", for various services - this could include logging in to the ssh server. Also system utilization levels, latencies, and network packets per second.

P.S. both disk busy% and "CPU iowait" may be cursed. I would want to monitor the current disk latencies and IOPS as well. (Current Debian 9.x kernels should be relatively sensible about disk busy% though).

The above answer and the VMware link describes some of the standard tools which you should learn about, or at least be aware of their existence.

3. A dumb hack (which assumes your tasks are hung on filesystem/disk access)

The detail below is a dumb hack. Sometimes a dumb hack is what you need. I'm just saying, if you have to resort to this, it might suggest some deficits in your way of working that you need to sort out :-P.

If you have some shell tests, that you would like to run while the system is "quasi-unresponsive", you could try running an mlock()'ed busybox shell. E.g. run a non statically linked busybox using this LD_PRELOAD mlock hack . Then run busybox commands using e.g. (exec -a ls /proc/self/exe /) . Probably safest to do :

# prevent you running any normal program by mistake!
OLDPATH="$PATH"
PATH=

# run a busybox builtin
b() (
  local command="$1"
  shift
  exec -a "$command" /proc/self/exe "$@"
)

# run normal program in the background, in case it hangs
p() {
  local PATH="$OLDPATH"
  exec "$@" &
}

This should let you run b dmesg, without needing to read any uncached file :-).

(This breaks down if someone has both 1) contrived to mount a hung filesystem 2) mounted it over / or /proc, so that you cannot even access /proc without hanging. I think that is not very likely, and it would be even more of a pain to defend against.)

b ps -o stat,pid,args will show process states; D means "uninterruptible sleep" - normally waiting for the disk or network filesystem. Then b cat /proc/999/stack will show where PID 999 is waiting in the kernel.

cd /sys/class/block/ && b grep -H . */inflight will show a count of inflight reads and writes for each disk.

sourcejedi
  • 50,249
  • I would've sworn that I have seen system/kernel errors printed in the local console before, but your first suggested test did not print anything for me. Fixing that is currently my top priority. There is never anything printed on the VM's "local console" when this problem occurs, and that has been a big source of frustration.

    This VM has a single 1TB volume (thick provisioned, lazy zeroed) attached via LSI Logic Parallel vSCSI. The datastore itself is a large NAS which also services a few other ESXi hosts, and none of the other guests are impacted when this happens.

    – deletesystem32 May 12 '19 at 17:18
  • @deletesystem32 it might only be logging errors. I think echo'ing to kmsg defaults to loglevel "warn". I guess try lowering the loglevel with dmesg -n warn or dmesg -n info . – sourcejedi May 12 '19 at 17:21
  • I've just seen all your edits to this answer. TBQH, a lot of it is going straight over my head, as I am a novice *nix user, but this is some very good info. Lots of rabbit holes to go down. I can monitor IOPS per VM via vSphere, so I'll check there the next time this happens to see if something has gone crazy. In the mean time, I've cloned the VM and put it on an isolated network segment; if the hang happens there, I'll have unlimited time to troubleshoot. If not, it at least builds more evidence that someone connected to the VM is doing something it doesn't like. Thanks! – deletesystem32 May 13 '19 at 21:57
  • @deletesystem32 I've fixed the test message command. The problem was I tested it on my Linux PC, and while a lot of systems are booted using the quiet option (equivalent to booting with the option loglevel=4 ?), mine does not. I hope my mistake did not alarm you too much :-). I think console loglevel=4 is supposed to show the hung task message, but not the helpful call chain. I think the call chain is written at loglevel 4, and this means you are supposed to see the call chain as well when you set console loglevel=5. – sourcejedi May 14 '19 at 09:54
  • @deletesystem32 TL;DR - it's good to hear you have set the console loglevel to 5. Just remember that if you used dmesg -n 5, that setting will not survive a reboot. – sourcejedi May 14 '19 at 09:56