18

I have a process which several times now has stopped responding and appears to be completely locking up. It doesn't respond to any attempt at strace or peeking with gdb (gdb just hangs on a wait4() syscall). The process is runnable, and is not waiting on a syscall (/proc/X/syscall: running) or in uninterruptable sleep (/proc/X/status: State: R (running)).

What state is this process in exactly? Is this possibly a kernel bug of some type?

The process is redis, and this has happened a few times now. Only thing that can kill the process is a reboot, it seems. OS is Cent 7.

Edit: Kernel version is 3.10.0-123.13.2.el7.x86_64. Trying an update to 3.10.0-229.11.1.el7 to see if that makes any difference.

Braiam
  • 35,991
alienth
  • 2,197
  • What version of GDB is it using? According to http://stackoverflow.com/questions/8978777/why-would-gdb-hang#comment11407877_8984115 a newer version might work better. – Greg Bray Aug 04 '15 at 01:18
  • Currently it looks like the investigation is more kernel side because of the special way it hangs, but if you don't mind, could you add a few Redis specific informations? What the process is doing while it blocks and things like that. I got a few info from Nick Craver via Twitter, apparently Redis is loading a large data set when this happens, is the data set loaded just restarting the process or in some other way (for example via DEBUG RELOAD, or pipelining large amounts of data)? Thanks. –  Aug 05 '15 at 06:27
  • @antirez The data set is being loaded by an rdb copy from another redis instance. The lockups occur after redis starts up and reads in the giant rdb. Notably it doesn't always lockups during this, just sometimes. – alienth Aug 06 '15 at 00:00
  • That's pretty odd since if the RDB would be corrupted or alike, the lookup should be observable in a consistent way (hopefully...). Have you tried to spin a different Redis instance on EC2 and load the same RDB file to see if this is reproducible in a different system? Thanks. –  Aug 06 '15 at 06:17
  • 1
    I only had this sort of issues when having IO errors. Could you please tell us about the dmesg output? – Ho1 Aug 07 '15 at 21:14
  • dmesg might also show if "hung task detector" triggers (if that's enabled it's supposed to show when tasks are stuck inside the kernel for too long). – sourcejedi Aug 07 '15 at 21:15
  • hmm I wonder what /proc/X/syscall is supposed to show while a page fault is being serviced (e.g. reading pages of a file through mmap() memory). – sourcejedi Aug 07 '15 at 21:17
  • @Ho1 @sourcejedi dmesg contains nothing related to this task, and nothing out of the ordinary. no 'hung task' detection. – alienth Aug 08 '15 at 00:24
  • 3
    What does /proc/<pid>/stack (and /proc/<pid>/task/*/stack) contain? Has that process got several threads? – Stéphane Chazelas Aug 09 '15 at 20:49

1 Answers1

2

wait4 is a syscall indicating the process is waiting for one of his child termination. This may points some issue with the signal handling.

A bit brutal, but you may try to kill the hierarchy of the app : kill -15 -$YourRedisPID. The - before the PID means "the PID and its children". As it seems to be waiting for a child termination, it may unlock it.

If it's not working, let's check deeper : find your signal process status with grep ^Sig /proc/$YourRedisPID/status

You'll see some stuff like :

SigQ:   8/62777
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000080
SigCgt: 0000000180004023

As defined in "fs/proc/array.c" of the kernel source, the "SigQ" is the number of signals pending / the limit of pending signals.

If the number of signal is too high, it may indicate your "SIGKILL" is not handled at all. I'm still checking the "kernel/signal.c" file to understand the signal management of these special signals.

For a direct understanding of the output, try this one-liner : awk 'BEGIN{print "ibase=16;obase=2;"} /^Sig...:/{ print toupper($2)}' /proc/$YourRedisPID/status | BC_LINE_LENGTH=0 bc

This outputs me :

0
0
10000000
110000000000000000100000000100011

Let's start by sending us this output. I'll update the post as required.

Adrien M.
  • 3,566
  • The process isn't in wait4(), gdb is hung on a wait4() when trying to access the process. The process itself isn't in any syscall. Also, the hung process has no children. Unfortunately I had to reboot the box. I'll gather the data you requested once the issue reoccurs. – alienth Aug 13 '15 at 17:15
  • Output here: https://gist.githubusercontent.com/alienth/23685ad2ea46a7eade56/raw/d461d608df5eb1efb81d136abab76fd09767e986/gistfile1.txt

    Once again, proc is ignoring SIGKILL. It isn't in a syscall. Proc also ignores SIGTERM.

    – alienth Aug 24 '15 at 20:35