It happened a few days in a row that one of our process couldn't be killed by "kill -9". The kill -9 used to be able to kill this process on daily base reliably for past 2+ years. This process has been try to connect to a remote server ( which was down in late night) via getaddrinfo and connect once every 1 second.
After I noticed the kill -9 failed to kill it, ( a few mins later) , the "ps aux"command shows it is in "S". So strictly speaking, its state could be in "D" state when "kill -9" ran.
So here came my question, is it possible that either getaddrinfo or connect call put this process into "D" state ? getaddrinfo involves DNS query, which could take some time.
Here is the strace info ( I replaced the IP address with bogus one), the following sequence repeat once every 1 second :
$ strace -p 3634009
Process 3634009 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
write(2, "Tue Dec 5 23:01:46 2017 INFO f"..., 74) = 74
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
socket(PF_NETLINK, SOCK_RAW, 0) = 6
bind(6, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(6, {sa_family=AF_NETLINK, pid=3634009, groups=00000000}, [12]) = 0
sendto(6, "\24\0\0\0\26\0\1\3\252k'Z\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(6, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\252k'ZYs7\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(6, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\252k'ZYs7\0\0\0\0\0\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(6) = 0
connect(4, {sa_family=AF_INET, sin_port=htons(20458), sin_addr=inet_addr("161.161.161.161")}, 16) = -1 ECONNREFUSED (Connection refused)
write(2, "Tue Dec 5 23:01:46 2017 INFO ."..., 126) = 126
close(4) = 0
write(2, "Tue Dec 5 23:01:46 2017 INFO f"..., 146) = 146
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7fffe56288c0) = 0
**** start of next cycle ****
write(2, "Tue Dec 5 23:01:47 2017 INFO f"..., 74) = 74
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
socket(PF_NETLINK, SOCK_RAW, 0) = 6
....
/proc/[pid]/syscall
and/proc/[pid]/wchan
for information where the process was stuck, seeman procfs
for details. You can also try to attackgdb
and have a look at the curent PC to figure out where it is. I don't know if getaddrinfo/connect can lead to a D state, but I would be somewhat surprised if they could. – dirkt Dec 06 '17 at 09:15