Find PID for processes that are sleeping for more than 2 hours and have CPU usage greater than 90%

Question

I have a machine with 8 core CPU, but each day all the CPU core gets blocked by sleeping processes and requires manual killing to restore the CPU.

I want to automate the killing of the processes that are

sleeping for more than 2 hours and
have CPU usage greater than 90%

How can I find the PID of such processes and kill them?

Output from htop that shows process with PID: 134425 in the sleeping state and blocking a CPU core

Note:

This gives PIDs for all sleeping processes but doesn't take CPU usage into consideration

awk '/sleeping/{ $0=FILENAME; gsub(/[^0-9]/, ""); print $0 }' /proc/[0-9]*/status

Sleeping processes by definition don’t use CPU. Are you sure that’s what you’re looking for? — Stephen Kitt, Aug 18 '22 at 05:30
Yes, I think so. A process goes to sleep and keeps on occupying the CPU for phusion passenger if the number of active requests increases the threshold. — red-devil, Aug 18 '22 at 06:13
If you are having problems with certain processes, then the correct solution is to fix the underlying problem, not to restart them. — Bib, Aug 18 '22 at 09:02
I have added the output from htop that shows process with PID: 134425 in the sleeping state and blocking a CPU core. I don't want to restart the process, but want to kill it. The issue occurs when there is extra load on the server. — red-devil, Aug 18 '22 at 11:06
Is the "S" state stable over time in htop for the bad process? What about PID 134435, which is also eating up CPU? Is that bad as well? Are there any logfile indications that might show when these processes have gone bad? — thrig, Aug 18 '22 at 13:17
Yes, it stays as "S".
On killing process with PID 134425; PID 134435 also disappeared. Though I haven't before seen an "R" process linked to "S" this way. As also it only shows one CPU core being almost 100% consumed which I think is consumed by the "S" process.

I have seen multiple "S" processes piling up over a period of 24 hours while their state stays stable, and this happens till it reaches a number of "8" that is all CPUs are consumed. At this point, the server doesn't take any further requests and throws an error like: "too many requests, limit reached". — red-devil, Aug 18 '22 at 13:36
Also, can't find any log-file indications that show a process has been stuck this way. — red-devil, Aug 18 '22 at 13:42
I blame the parents. Seriously. Dump a ps -ef snapshot, open it in an editor, and backtrack through the parent pids until you find the common parent. Then figure whether it has a bug (like not processing SIGCHLD), or a bad config (like allowing 20,000 worker threads). — Paul_Pedant, Aug 18 '22 at 14:51
@StephenKitt a process with two threads, running sleep(9999); in the main thread and a while (1) x++; in the other thread shows as state "S" but >90% CPU in top on a Centos 7 virt. https://github.com/thrig/scripts/blob/master/signal/busythread.c — thrig, Aug 18 '22 at 15:25
@thrig aargh, I should have said “threads”. If you press H you’ll see the sleeping thread separately from the looping thread. — Stephen Kitt, Aug 18 '22 at 15:31

score 2 · Answer 1 · answered Aug 18 '22 at 16:25

It might be better to find the root cause of the processes going out of control, but to find any process in state S but using some amount of CPU one could use a script along the lines of:

#!/usr/bin/env bash
ps ahxo pid,state,%cpu |\
while read pid state cpu; do
   if [[ "$state" = S ]]; then
      if [[ "${cpu%%.*}" -gt 90 ]]; then
         echo "woe betide pid $pid"
      fi
   fi
done

with a kill instead of the echo once you are sure the script is not going to destroy your system. This could be made less dangerous by restricting the match based on the command field, or by first filtering for suitable PID with the pgrep command. "More than 2 hours" is difficult to capture from the ps output given that time is humanized; parsing that information out of /proc/$pid/stat can be dangerous if whitespace ever shows up in the process name. Or the monitoring script could be made more complicated and maintain a state counter for how many times it has seen a PID with the required conditions. Maybe see if the above script works, first, before going for a more complicated "more than 2 hours" check?

Another method might be to have a monitoring system that makes periodic requests to the service, and to have that issue a service restart if the requests start timing out or taking too long. But such band-aids aren't really a good long term solution for something that needs to be whacked with a broom now and then.

If the service has some sort of log file indication that things are broken, that can be another method to automatically detect faults and "turn it off and back on again."

Of course such restart automation can go terribly wrong if it kills the wrong process, or kills the process when it should not; this might be used as a rationale for finding and fixing the problem, or retiring the software, in the event management is not willing to spend the necessary resources to fix the problem. I've had */5 * * * * /reboot-service crontab jobs because something management wouldn't allow to be fixed was leaking that much memory... oh yeah you probably want to document the auto-restart script somewhere, to disable it when doing updates, and so on. More work.

Find PID for processes that are sleeping for more than 2 hours and have CPU usage greater than 90%

1 Answers1