3

From one of my scripts, I called find command, as a normal user (not root).
It was not returning/continuing, so I killed the script and find is still running.
At htop I see it is always using 100% of one core (4 cores here).
The core at 100% changes from time to time btw.
At htop, its state is 'R' (running), won't change after kill signals below.

I have tried: SIGKILL, SIGSTOP, SIGTERM, SIGABRT, hup, 15, none works.
Neither using sudo.

I tried also all possible kill signals:

astr=(`kill -l |grep "..[)]" -o |tr -d ')'`)
for str in "${astr[@]}"; do echo "======== $str";kill -$str 2315444;ps -o pid,stat,status,state,pcpu,cmd -p 2315444;sleep 1;done

but after each, the result is always the same:

PID STAT STATUS S %CPU CMD
2315444 RN        - R 99.5 find

apparmor is running but find is not listed on it (after checking), but stopping it didn't work either. SELinux is not running and I found yet no way to check for LSM here yet.

thinking about this I tried to forcefully umount the partition it was running at (what would cause no problem), and after doing so, find was still running.

What else can I try, other than reboot?
There is nothing special at dmesg either. Could it be a hardware failure? or a kernel bug?

I think it could have happened with any other process, not sure though. Maybe it is related to process that does hard drive IO?

OS.: Ubuntu 16.04

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • Have you tried kill -9? – jesse_b Aug 17 '17 at 19:31
  • 1
    @Jesse_b thats -SIGKILL at kill -l :) – Aquarius Power Aug 17 '17 at 19:33
  • Is it running in it's own PGID or is it still under the PGID of your script? If the latter, have you tried to kill -9 the whole process group? – jesse_b Aug 17 '17 at 19:37
  • Can you strace it to maybe see what it's doing? – thrig Aug 17 '17 at 19:41
  • interesting, I ended the top script but a sub script is still active as it's parent, so its parent is not pid 1; I will try kill the parent pid to see what happens; I dont know what is PGID yet, for test should I try it first? – Aquarius Power Aug 17 '17 at 19:41
  • This is a bit over my head but in my limited experience the PGID should be the same as the PID of the parent process. You should be able to modify your ps command to: ps -o pgid,pid,stat,status,state,pcpu,cmd – jesse_b Aug 17 '17 at 19:42
  • @thrig strace outputted nothing after saying it was attached, I had to sudo SIGKILL the strace process as ctrl+c could not stop it, after a long wait for output – Aquarius Power Aug 17 '17 at 19:45
  • @Jesse_b https://stackoverflow.com/a/41498519/1422630, killing the PPID did not work, it is just reparented now; I am not sure if I can kill a PGID yet tho, or how to do it and if it will affect other applications? researching.. – Aquarius Power Aug 17 '17 at 19:51
  • You can kill a pgid with the kill command the same you would a pid. Not entirely sure about the rest though, but this may help? https://unix.stackexchange.com/questions/385105/can-kill-in-bash-send-a-signal-only-to-a-single-process-whose-process-group-ha/385108#385108 – jesse_b Aug 17 '17 at 19:52
  • @Jesse_b it seems PGID is not accessible as a PID thru kill, so I guess we need some other command to deal with PGID(s), I got a bit lost on that answer btw, still trying to understand it :), I will try to keep my machine as long as possible running b4 rebooting, as I wont be able to try anything else after it; My guess is it can be a kernel bug? – Aquarius Power Aug 17 '17 at 19:55
  • @Jesse_b no I cant use kill with it: bash: kill: (2309051) - No such process – Aquarius Power Aug 17 '17 at 19:58
  • That is very strange. I assume that the signal does not reach the process. Is AppArmor, SELinux or another LSM active? You may run kill through strace and have a look at the output of dmesg. – Hauke Laging Aug 17 '17 at 19:58
  • @HaukeLaging if I have to active these (AppArmor, SELinux, LSM) on ubuntu 16.04, then I guess they are not cause I dont remember activating them in the past. But I guess I should make it sure in some way, will research about them. Kill thru strace (like I could use it as a kind of shell?) I will research about it. And indeed, the signal not reaching the process seems a great guess, will look more on it too thx! – Aquarius Power Aug 17 '17 at 20:03
  • @HaukeLaging apparmor is running but find is not listed on it. SELinux is not. found no way to determine if LSM is running yet. Tried to kill with SIGSTOP at the strace pid, but it didnt affect the find process, not sure if is that what you meant either, to kill the strace process just kills it only, nothing changes on dmesg either other than ptrace of pid 2315444 was attempted by: strace (pid 153868) – Aquarius Power Aug 17 '17 at 20:21
  • Then it should be very simple: Turn off AppArmor and see if the problem disappears. There need not be a configuration for find anyway: AppArmor permissions and restrictions are inherited. – Hauke Laging Aug 17 '17 at 20:47
  • @HaukeLaging I stopped apparmor, tried STOP, KILL, ABRT signals on the find pid, but none worked. I tried to cat on many files at /proc/2315444 (at syscall it says running) even using sudo for that, but the most thing I get at dmesg is just like ptrace of pid 2315444 was attempted by: cat (pid 446803) – Aquarius Power Aug 17 '17 at 21:06
  • 2
    Is this an epidemic? If many people are having the same problem all of a sudden, maybe a recent kernel upgrade is buggy? (Unlikely though to have the same bug on RHEL and Ubuntu as they use different kernel versions.) – Gilles 'SO- stop being evil' Aug 17 '17 at 23:25

1 Answers1

3

I can avoid rebooting using the commands below:

sudo cgcreate -g cpu:/cpulimited
sudo cgclassify -g cpu:cpulimited 2315444 #the `find` pid
cd /sys/fs/cgroup/cpu/cpulimited
echo 1000000 |sudo tee cpu.cfs_period_us
echo 1000 |sudo tee cpu.cfs_quota_us #cant be less than 1000 as I tested

read the full explanation for cpu.cfs_quota_us at here, from this tip

The cgroup magic works on such unkillable process!

Despite ps shows pcpu as 98%, all other system monitors shows that such process is using about nothing of the cpu, like htop, top and the "system monitor" application.
So now, the machine usage is smooth again, as that single process always at 100% was making it, on intermitent intervals, slow to a halt for a second.

An answer, concerning other ways than kill to end such process, would still be better tho.

thx u all the tips!