7

I have a process which appears to be hung.

When I try and reboot the process, I get a timeout.

service logstash_server stop
timeout: run: logstash_server: (pid 11797) 839061s, want down, got TERM

I have tried running a tail -f on the logs, which unfortunately do not show anything. I've also tried a kill -15 on the process, but it is still hung. Top does not show this as a zombie process.

I'd like to figure out 'why' this process is in this state since it is the 3rd time this has happened in the past month.

I checked the file descriptors and the syslog, but don't see anything noticeable.

file descriptors => http://pastebin.com/90rDHhT4
syslog output => http://pastebin.com/xBaMaL9Z

output of lsof | grep logstash => http://pastebin.com/gsSdPyg5

I tried running an strace on the process, and it just shows FUTEX_WAIT

strace -p 11797
Process 11797 attached
futex(0x7f6d95d8e9d0, FUTEX_WAIT, 11811, NULL

Is there anything else I can do before I issue a kill -9 ?

Update

Opened ticket with developers. Issue continues about once per week.

https://github.com/elastic/logstash/issues/2992

spuder
  • 18,053
  • 1
    One you can use strace to see what the process is doing. However, as I remember this is java-embedded-ruby application which makes it huge enough to better leave strace alone.. You can check ls -l /proc/11797/fd/ to see what file descriptors it has opened. Maybe there you'll find something interesting. – pawel7318 Apr 13 '15 at 17:22
  • 2
    You could possibly check the syslog to see if there are any clues there. If this is a java process kill -3 <PID> will write out the thread dumps, which can possibly help. – rahul Apr 13 '15 at 17:23
  • A related question is https://unix.stackexchange.com/q/5642/5132 . – JdeBP Jan 02 '20 at 10:16

3 Answers3

5

There are a couple of tools to diagnose things like this:

  1. lsof. Lists the open files, you can see if, e.g., one is on a hung network share. Or waiting on a TCP connection. Etc.
  2. strace. See what syscall it's hanging in, or if it's actually doing something.
  3. Any debug logging options the daemon has. Typically, you'd have to turn these on before it crashes (often when starting it), though.
  4. Software debugging tools (like the thread dumps rahul mentions, gdb, jdb, or whatever else might be relevant). You're into software debugging now, but ultimately that may be required to find out why.

lsof and strace is basically to double-check it isn't something broken on your system/config. Beyond that, you really need the help of a software developer.

edit: From your updates, most likely you need to report a bug or request assistance from the author(s). Unless you have a developer around you can have look at it.

derobert
  • 109,670
2

A process that can't be killed (even with SIGKILL, even by root) is most likely in process state D ("uninterruptable sleep"). If this condition persists for no obvious reason, then your program most certainly has triggered a bug in some I/O driver. There's virtually nothing you can do but to reboot.

countermode
  • 7,533
  • 5
  • 31
  • 58
0

This might be of some assistance - http://man7.org/linux/man-pages/man2/futex.2.html

Agreed this is likely a bug and should probably go to the developers