5

I have a task running that blocks pm-hibernate (on Linux 4.0.7-2). When I try pm-hibernate there is an error message "Freezing of tasks failed after 20.002 seconds (1 tasks refusing to freeze, wq_busy=0):" and the task is shown.

The process is a dead one that has been killed hours before. Why can root not just remove it from the kernel? I am feeling like under Windows!

I have seen related questions like How to kill a process which can't be killed without rebooting? but there do not seem to be satisfactory answers.

Some info (31207 is the pid):

# cat /proc/31207/syscall
11 0x7fe482a47000 0x25fce 0x7fe481d4eb78 0x1 0x7fe482a6e700 0x25f2d30     0x7ffca8d8c278 0x7fe481a95ae7
# ps -l -p 31207
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 D  1001 31207     1  0  80   0 -  5035 lock_e pts/9    00:00:00 a.out
# ps -lnp 31207
F S   UID   PID  PPID  C PRI  NI ADDR SZ  WCHAN TTY        TIME CMD
0 D  1001 31207     1  0  80   0 -  5035 ffffff pts/9      0:00 /tmp/a.out
# ps opid,wchan:42,cmd -p 31207
 PID WCHAN                                      CMD
31207 lock_extent_bits                           /tmp/a.out

So, why can I not just stop it? To suspend it would suffice!

I am using no network FS and the task was a simple one accessing the network. If you can read this, the network is still up.

Ned64
  • 8,726
  • 1
    How about adding the output of ps -l -p 31207 to see what the "WCHAN" column holds. Beyond that, the question you reference has the answer: if you can't kill a process, it's waiting on network filesystem, there;s a kernel bug, or there's a hardware problem. –  Jul 09 '15 at 21:25
  • So, it's lock_e. What does that tell us? There is no network filesystem, no mounts from remote servers, nothing. So - a kernel bug? – Ned64 Jul 09 '15 at 21:54
  • PS: It's actually the function lock_extent_bits which points to a possible (!) cause: https://bugzilla.kernel.org/show_bug.cgi?id=76421 – Ned64 Jul 09 '15 at 22:25
  • Try if you can catch it with kill -SEGV pid`. – ott-- Sep 12 '15 at 21:27

1 Answers1

8

Processes in state D (uninterruptable sleep) cannot be killed while they are in this state. NFS was notorious for this but there are other ways to get a process stuck. Broken device drivers that don't return control to the calling process can also cause this behavior. One would need to reset the driver, but in general there is no way to do this. I hate to say this: there is nothing but reboot to get out of this.

countermode
  • 7,533
  • 5
  • 31
  • 58
  • 4
    There is nothing but reboot or figuring out what's bothering the kernel and unblocking it. Of course the latter isn't always possible. – Gilles 'SO- stop being evil' Jul 09 '15 at 23:18
  • 1
    ok, in most cases there is nothing but reboot. – countermode Jul 09 '15 at 23:37
  • Experimenting with commands similar to find . -path '*some_directory/*some_string*' -exec realpath {} +, I got twice a system that was not frozen, yet unusable, gradually consuming all memory. No kill -9 was possible. Eventually, I killed the terminal from where this command was run, but this process still kept running. I was forced to reboot. What is there special/wrong with the above command combination (find with -path + -exec realpath)? – Nikos Alexandris Mar 28 '19 at 05:55
  • So, the 1st time, killing the terminal from where this command was run, did not kill the find process too! Forced to reboot. The 2nd time, lead to the same extreme slowness. As I got to log off at some point, I recall reading messages of failing to establish a WWAN connection. Is it potentially a case of this answer? – Nikos Alexandris Mar 28 '19 at 06:05
  • @NikosAlexandris unfortunately it is impossible to remotely diagnose this unless you are willing to give remote access to the system. There could be so many causes, it has something to do with some peculiarity of your system. – countermode Mar 28 '19 at 10:20
  • @countermode So, the combination of find and -path and -exec realpath is a valid one, then? – Nikos Alexandris Mar 28 '19 at 15:07