On a system with N CPUs, you can run N userspace threads using 100% of a CPU each. However when you try to use ssh
, the kernel would give ssh
a "fair" share of CPU time and allow you to log in.
When you see 100% CPU usage in VMware, and ssh
is not responding, you might have a busy-loop inside the kernel.
Make sure your server is not running a graphical interface on its local console. You want to be in text mode, so you can see any messages printed by the kernel. Now read:
The docs say the hard lockup detector, aka NMI watchdog, is enabled by default. Except it is disabled when the kernel runs in a VM. So in this case, the default is to only detect soft lockups.
/*
* Hard lockup detection is enabled by default. Disable it, as guests
* can get false positives too easily, for example if the host is
* overcommitted.
*/
hardlockup_detector_disable();
-- arch/x86/kernel/kvm.c: kvm_guest_init()
I am confused by the rationale and history of this. I don't know why it considers the soft-lockup detector to be "safer" than the hard-lockup detector. Also the original change was justified with a different reason. "The guest PMU is still flushing out bugs. The idea is once the KVM PMU is stable enough, the default switches to hardlockup enabled by default". Lastly, it is mentioned that on some versions of hypervisors, it might not be possible to enable the NMI watchdog at all.
Assuming you are not massively over-committing CPUs in your hypervisor, you could see if it's possible to enable the NMI watchdog as well. You can use the sysctl
linked above, or the doc for the sysctl says you could also use the kernel boot option nmi_watchdog=1
.
Then test that you can see messages printed by the kernel:
Is the "local console" logged or at least persistent, so that you
would have noticed if a kernel crash was printed on it? It really
should be, but I am not sure familiar with how vSphere etc. would work
if you used an emulated serial console. If you are just using an
emulated video display, then it will be persistent already.
This VMWare article appears
to rely on the same assumption.
Make sure you have not disabled console logging. Run this
command:
sudo sh -c "echo '<3>test' >/dev/kmsg"
It should show "test" on the console.
If this is an emulated video display, part of a crash message might scroll off the top of the screen. And if the kernel has crashed, then you cannot use shift+PageUp to scroll up. In principle, it can be more useful to have an emulated serial console which implements scrollback.
For kernel crashes, there are a couple of other crash dump suggestions in the VMWare link above.
-- Debian Stretch VM becomes quasi-unresponsive every few days
Most of the other instructions in the linked answer are about hung task messages. You will not necessarily see those if you have a lockup.
That said, it does also mention sysrq
. sysrq
+L might be an alternative way to get useful information, if you have a soft lockup. That will generate a kernel backtrace for each CPU. But the root cause might only be visible for one of the CPUs, so you really need to be able to capture quite a lot of messages there. It would be best if you have a serial console. If you have a video console, shift+PageUp might just work, assuming you don't have too many CPUs.
ssh
. You should edit this question to confirm that the server responds toping
(or not). Or if you insist on configuring a horrible firewall which blocks allping
probes, you must test ARP / ND is still responding. (E.g.arping
, or runping
and then inspect the ARP/ND tables on the connected router). – sourcejedi May 20 '19 at 09:04ssh
was completely not working, in fact I didssh -vvv
and the verbose output got stuck at "Connecting to host port 22" and the shell never returns. It seems to be waiting for the connection to be established. Regarding theping
our IT engineers block ICMP traffic to the servers hence I cannot verify ping operation. – 3bdalla May 20 '19 at 09:09tty
number shown above the login prompt should change, or the prompt might disappear and be replaced with a black screen, depending on the system. https://unix.stackexchange.com/questions/518554/debian-stretch-vm-becomes-quasi-unresponsive-every-few-days/ – sourcejedi May 20 '19 at 09:19for (int i=0; i<20; i++) fork();
generates over a million processes not just 20. – Philip Couling May 20 '19 at 10:18ssh
protocol response. The initial TCP handshake would usually be processed in a softirq. I might be missing something though. (If there is too much time spent in the TCP softirq as a whole - due to heavy network load - this work can get punted to the ksoftirqd thread. At that point, I guess it could be starved of CPU time by other normal threads). – sourcejedi May 20 '19 at 11:46