It might be better to find the root cause of the processes going out of control, but to find any process in state S
but using some amount of CPU one could use a script along the lines of:
#!/usr/bin/env bash
ps ahxo pid,state,%cpu |\
while read pid state cpu; do
if [[ "$state" = S ]]; then
if [[ "${cpu%%.*}" -gt 90 ]]; then
echo "woe betide pid $pid"
fi
fi
done
with a kill
instead of the echo
once you are sure the script is not going to destroy your system. This could be made less dangerous by restricting the match based on the command
field, or by first filtering for suitable PID with the pgrep
command. "More than 2 hours" is difficult to capture from the ps
output given that time
is humanized; parsing that information out of /proc/$pid/stat
can be dangerous if whitespace ever shows up in the process name. Or the monitoring script could be made more complicated and maintain a state counter for how many times it has seen a PID with the required conditions. Maybe see if the above script works, first, before going for a more complicated "more than 2 hours" check?
Another method might be to have a monitoring system that makes periodic requests to the service, and to have that issue a service restart if the requests start timing out or taking too long. But such band-aids aren't really a good long term solution for something that needs to be whacked with a broom now and then.
If the service has some sort of log file indication that things are broken, that can be another method to automatically detect faults and "turn it off and back on again."
Of course such restart automation can go terribly wrong if it kills the wrong process, or kills the process when it should not; this might be used as a rationale for finding and fixing the problem, or retiring the software, in the event management is not willing to spend the necessary resources to fix the problem. I've had */5 * * * * /reboot-service
crontab jobs because something management wouldn't allow to be fixed was leaking that much memory... oh yeah you probably want to document the auto-restart script somewhere, to disable it when doing updates, and so on. More work.
phusion passenger
if the number of active requests increases the threshold. – red-devil Aug 18 '22 at 06:13htop
that shows process with PID: 134425 in the sleeping state and blocking a CPU core. I don't want to restart the process, but want to kill it. The issue occurs when there is extra load on the server. – red-devil Aug 18 '22 at 11:06On killing process with PID 134425; PID 134435 also disappeared. Though I haven't before seen an "R" process linked to "S" this way. As also it only shows one CPU core being almost 100% consumed which I think is consumed by the "S" process.
I have seen multiple "S" processes piling up over a period of 24 hours while their state stays stable, and this happens till it reaches a number of "8" that is all CPUs are consumed. At this point, the server doesn't take any further requests and throws an error like: "too many requests, limit reached".
– red-devil Aug 18 '22 at 13:36ps -ef
snapshot, open it in an editor, and backtrack through the parent pids until you find the common parent. Then figure whether it has a bug (like not processing SIGCHLD), or a bad config (like allowing 20,000 worker threads). – Paul_Pedant Aug 18 '22 at 14:51sleep(9999);
in the main thread and awhile (1) x++;
in the other thread shows as state "S" but >90% CPU intop
on a Centos 7 virt. https://github.com/thrig/scripts/blob/master/signal/busythread.c – thrig Aug 18 '22 at 15:25