Virtual server stops responding and uses 100% CPU, how to diagnose?

Question

We are facing weird issue with a CentOS 7.4 virtual server (VMware ESXi), the server suddenly gets 100% usage in few seconds resulting in ssh won't work and a specific program running is not responding. The only solution is to force restart the server through vSphere. We can't figure out the source of such high usage so my question is, how to diagnose such sudden high usage? Is there a way to log some processes information to be investigated after restart?

EDIT: ssh was completely not working, in fact I did ssh -vvv and the verbose output got stuck at "Connecting to host port 22" and the shell never returns. It seems to be waiting for the connection to be established. Regarding ping, our IT engineers block ICMP traffic to the servers hence I cannot verify ping operation.

hum, 100% CPU doesn't necessarily break ssh. You should edit this question to confirm that the server responds to ping (or not). Or if you insist on configuring a horrible firewall which blocks all ping probes, you must test ARP / ND is still responding. (E.g. arping, or run ping and then inspect the ARP/ND tables on the connected router). — sourcejedi, May 20 '19 at 09:04
@sourcejedi ssh was completely not working, in fact I did ssh -vvv and the verbose output got stuck at "Connecting to host port 22" and the shell never returns. It seems to be waiting for the connection to be established. Regarding the ping our IT engineers block ICMP traffic to the servers hence I cannot verify ping operation. — 3bdalla, May 20 '19 at 09:09
I would offer my comment to your IT engineers, and if necessary create a written record for future reference. I.e. not being able to check for a network response from the kernel is making it slower to diagnose and fix problems when they arise. They should be able to allow for an automatic network monitoring system somewhere, which is allowed to ping the kernel. — sourcejedi, May 20 '19 at 09:12
If that cannot be resolved in time, the next best test result is to see whether the kernel responds to ctrl+alt+f2 / ctrl+alt+f1 on the emulated local console. This assumes you have text login prompts. The tty number shown above the login prompt should change, or the prompt might disappear and be replaced with a black screen, depending on the system. https://unix.stackexchange.com/questions/518554/debian-stretch-vm-becomes-quasi-unresponsive-every-few-days/ — sourcejedi, May 20 '19 at 09:19
This smells like a fork bomb. They can be the result of someone making a simple mistake with the fork() function in their code. Eg in C for (int i=0; i<20; i++) fork(); generates over a million processes not just 20. — Philip Couling, May 20 '19 at 10:18
@PhilipCouling It's impossible to tell anything without knowing what is the issue, I had the exact same issue and it was a infinite heavy loop of PHP caused by a referencing bot. It's impossible to speculate realistically regarding this issue without knowing what is running, how it's configured and so on... — Kiwy, May 20 '19 at 11:33
To be accurate, I think it is some system process causing this behavior. vSphere performance graph as well as CPU system load archived data shows some moderate CPU spikes every 2-3 minutes reaching values of 20-30% usage while nothing is running except our application which takes CPU usage of less than 1%. I am starting to suspect something to do with the CF Engine might be the cause because other servers reported similar behavior, but still nothing is verified. — 3bdalla, May 20 '19 at 11:40
@PhilipCouling thinking about it, stuck in "Connecting to host port 22" suggests even the kernel TCP stack doesn't get enough cycles to send TCP ACK, never mind the initial ssh protocol response. The initial TCP handshake would usually be processed in a softirq. I might be missing something though. (If there is too much time spent in the TCP softirq as a whole - due to heavy network load - this work can get punted to the ksoftirqd thread. At that point, I guess it could be starved of CPU time by other normal threads). — sourcejedi, May 20 '19 at 11:46
@sourcejedi yes it's strange. The reason I mention a fork bomb is that Linux is usually good a throttling out of control single threads. Simple CPU demand doesn't generally explain this type of symptom. — Philip Couling, May 20 '19 at 11:52
btw blocking ping internally is usually not really useful and more a hassle than a real security feature and blocking ping on public IP is also quite useless nowadays as there are site referencing open ports of any online server. — Kiwy, May 20 '19 at 15:02

Kiwy · Answer 1 · 2019-05-20T11:28:31.140

2

When I faced similar issue I've created a small script like this (it writes every second the date and list of process running with its CPU and RAM usage):

#!/bin/sh
while true
  do 
    date
    ps faux
    sleep 1
  done >> /a/log/file

and I let it run as a background program. This would help you to understand what and when a process become crazy.

After that you will have to look at /var/log/messages and other logs (the one from the crazy program maybe) to identify precisely your issue.
You could also install atsar which provide amazing binary log with a log of stats such as IO, network activity, cpu and so on...

/!\ Warning:
If your run this script for long enough the log could become big. Place the log file in a spot where there's enough disk space or this could become a major problem.

/!\ Warning 2:
I have no idea what is your esxi setup. However, if for any reason disk become laggy esxi wide; this could cause major latency and a dramatic CPU usage on the virtual machine if this one rely on IO.

edit 2:

As mention by @sourcejedi you could add sync to the script to ensure the log will be written in case of hard reboot (I'm not sure it's necessary though but two security nests are better than one:

#!/bin/sh
    LOGFILE="a/log/file"
    echo "" > $LOGFILE
    while true
      do 
        date
        ps faux
        sync $LOGFILE
        sleep 1
      done >> $LOGFILE

edited May 20 '19 at 11:28

answered May 20 '19 at 08:16

Kiwy

9,534

Since the server becomes totally unresponsive, will this script be executed at the time of that spike? Maybe not totally unresponsive because vSphere is still able to read live CPU usage. Would /var/log/messages contains OS messages about which process is causing this behavior because up to this point we have no idea which program (either the specific program we are running or some CF engine process because another server reported similar but not severe behavior where the causing process something related to CF Engine) – 3bdalla May 20 '19 at 08:38
1

it will list every second all your process with cpu and ram usage. I bet the problem occurs not in less than a second. A huge bump in CPU usually not occurs that fast, it usually take some second to really freeze your server. it would give you a timestamp of when the problem occur and the list of proccess running with its CPU/RAM consumption. If you have a problem with one process this could be enough to identify your issue. – Kiwy May 20 '19 at 08:59
When you restart you will have to look at the end of the generated log file and you should be able to identify the precise moment something went wrong and cross check in syslog/messages/application log to find a pattern or a problem. – Kiwy May 20 '19 at 09:03
The precise time stamp can be determined, I ran your script now, I guess we'll have to wait for another spike and start investigation. Will come back for feedback. Thanks! – 3bdalla May 20 '19 at 09:05
@3bdalla Good luck, precise timestamp is not when the application become irresponsive but probably few seconds/minute before that. – Kiwy May 20 '19 at 09:07
might be worth considering sync /a/log/file in the loop, e.g. in case you force reboot the VM very soon when you see the problem. – sourcejedi May 20 '19 at 11:01
@sourcejedi Indeed however I never had the issue with that, but it's certainly worst mentioning. I've add it to the answer. – Kiwy May 20 '19 at 11:29

sourcejedi · Answer 2 · 2019-05-20T11:57:45.603

On a system with N CPUs, you can run N userspace threads using 100% of a CPU each. However when you try to use ssh, the kernel would give ssh a "fair" share of CPU time and allow you to log in.

When you see 100% CPU usage in VMware, and ssh is not responding, you might have a busy-loop inside the kernel.

Make sure your server is not running a graphical interface on its local console. You want to be in text mode, so you can see any messages printed by the kernel. Now read:

The docs say the hard lockup detector, aka NMI watchdog, is enabled by default. Except it is disabled when the kernel runs in a VM. So in this case, the default is to only detect soft lockups.

    /*
     * Hard lockup detection is enabled by default. Disable it, as guests
     * can get false positives too easily, for example if the host is
     * overcommitted.
     */
    hardlockup_detector_disable();

-- arch/x86/kernel/kvm.c: kvm_guest_init()

I am confused by the rationale and history of this. I don't know why it considers the soft-lockup detector to be "safer" than the hard-lockup detector. Also the original change was justified with a different reason. "The guest PMU is still flushing out bugs. The idea is once the KVM PMU is stable enough, the default switches to hardlockup enabled by default". Lastly, it is mentioned that on some versions of hypervisors, it might not be possible to enable the NMI watchdog at all.

Assuming you are not massively over-committing CPUs in your hypervisor, you could see if it's possible to enable the NMI watchdog as well. You can use the sysctl linked above, or the doc for the sysctl says you could also use the kernel boot option nmi_watchdog=1.

Then test that you can see messages printed by the kernel:

Is the "local console" logged or at least persistent, so that you would have noticed if a kernel crash was printed on it? It really should be, but I am not sure familiar with how vSphere etc. would work if you used an emulated serial console. If you are just using an emulated video display, then it will be persistent already.

This VMWare article appears to rely on the same assumption.

Make sure you have not disabled console logging. Run this command:
sudo sh -c "echo '<3>test' >/dev/kmsg"
It should show "test" on the console.

If this is an emulated video display, part of a crash message might scroll off the top of the screen. And if the kernel has crashed, then you cannot use shift+PageUp to scroll up. In principle, it can be more useful to have an emulated serial console which implements scrollback.

For kernel crashes, there are a couple of other crash dump suggestions in the VMWare link above.

-- Debian Stretch VM becomes quasi-unresponsive every few days

Most of the other instructions in the linked answer are about hung task messages. You will not necessarily see those if you have a lockup.

That said, it does also mention sysrq. sysrq+L might be an alternative way to get useful information, if you have a soft lockup. That will generate a kernel backtrace for each CPU. But the root cause might only be visible for one of the CPUs, so you really need to be able to capture quite a lot of messages there. It would be best if you have a serial console. If you have a video console, shift+PageUp might just work, assuming you don't have too many CPUs.

Virtual server stops responding and uses 100% CPU, how to diagnose?

2 Answers2

edit 2: