1

It happened a few days in a row that one of our process couldn't be killed by "kill -9". The kill -9 used to be able to kill this process on daily base reliably for past 2+ years. This process has been try to connect to a remote server ( which was down in late night) via getaddrinfo and connect once every 1 second.

After I noticed the kill -9 failed to kill it, ( a few mins later) , the "ps aux"command shows it is in "S". So strictly speaking, its state could be in "D" state when "kill -9" ran.

So here came my question, is it possible that either getaddrinfo or connect call put this process into "D" state ? getaddrinfo involves DNS query, which could take some time.

Here is the strace info ( I replaced the IP address with bogus one), the following sequence repeat once every 1 second :

$ strace -p 3634009
Process 3634009 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
write(2, "Tue Dec  5 23:01:46 2017 INFO  f"..., 74) = 74
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
socket(PF_NETLINK, SOCK_RAW, 0)         = 6
bind(6, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(6, {sa_family=AF_NETLINK, pid=3634009, groups=00000000}, [12]) = 0
sendto(6, "\24\0\0\0\26\0\1\3\252k'Z\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(6, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\252k'ZYs7\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(6, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\252k'ZYs7\0\0\0\0\0\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(6)                                = 0
connect(4, {sa_family=AF_INET, sin_port=htons(20458), sin_addr=inet_addr("161.161.161.161")}, 16) = -1 ECONNREFUSED (Connection refused)
write(2, "Tue Dec  5 23:01:46 2017 INFO  ."..., 126) = 126
close(4)                                = 0
write(2, "Tue Dec  5 23:01:46 2017 INFO  f"..., 146) = 146
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fffe56288c0)       = 0
**** start of next cycle ****
write(2, "Tue Dec  5 23:01:47 2017 INFO  f"..., 74) = 74
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
socket(PF_NETLINK, SOCK_RAW, 0)         = 6
....
Stephen Kitt
  • 434,908
  • 1
    Killing a ‘D’ process will kill it when it becomes interruptible again, so you won’t see a killed process return to ‘S’. Something else is going on... (See this question.) – Stephen Kitt Dec 05 '17 at 20:25
  • Have a look at /proc/[pid]/syscall and /proc/[pid]/wchan for information where the process was stuck, see man procfs for details. You can also try to attack gdb and have a look at the curent PC to figure out where it is. I don't know if getaddrinfo/connect can lead to a D state, but I would be somewhat surprised if they could. – dirkt Dec 06 '17 at 09:15
  • @StephenKitt, thanks for your reply. If the link (in your response ) is correct on "kernel code, aka, system call will "queue" SIGKILL, then I should have seen it in SigQ under /proc/{pid}/status afterwards. What I saw last night ( yes, it happened again last night ) is SigQ: 0/1033162. – GoodToLearn Dec 06 '17 at 19:43
  • @dirkt, thanks for your reply. There is no gdb on our prod servers :). But I did run strace on the process ( after stop script failed to kill the process). basically, the following sequence of actions going on and on once every one second. – GoodToLearn Dec 06 '17 at 19:47

0 Answers0