2

My process is mysteriously getting a SIGKILL on AIX.

Running trace on the system indicated the process got a signal sent by itself via the kill(pid, SIGNAL) system call. I checked my code and I don't have any explicit call that does this. I'm beginning to think if it's possible that a system call I made or a call to Oracle could possibly result on a getting a SIGKILL back. Since the process made the call, if the SIGKILL occurs as the call is made, it would appear that the process issued the SIGKILL to itself.

What can cause the SIGKILL in this case?

Timo
  • 6,332
James
  • 21
  • Dunno about AIX, but on Linux, out of memory conditions can trigger a SIGKILL. – phemmer Jan 28 '14 at 03:21
  • Thanks for this comment. Couldn't find much text on the internet regarding what else could trigger a SIGKILL. Let me track memory usage internally for my process. I was suspecting if this was the case but the system logs didn't indicate any red flags memory wise. – James Jan 28 '14 at 17:10
  • linux OOM also causes kernel messages. (I don't know what AIX does these days) – Ricky Jan 28 '14 at 18:58
  • Check errpt - depending on the trigger, you may see some related entry in there. – EightBitTony Sep 01 '14 at 14:43

2 Answers2

2

If paging space is getting used more and more, when reaching a specified threshold AIX will start to kill the most recently spawned processes, which happens quite often. That threshold can be specified via vmo -o npskill see e.g. http://www-01.ibm.com/support/docview.wss?uid=isg3T1012693 :

The formula to determine the default value of npskill is as follows:
npskill = number_of_paging_space_pages/128
The npskill value must be greater than zero and less than the total number of > paging space pageson the system. This parameter can be changed by using the vmo -o command.

You will see an ERRPT for that, with the label being PGSP_KILL and it will look like this:

C5C09FFA DATE P S SYSVMM            SOFTWARE PROGRAM ABNORMALLY TERMINATED

If you look in the details via errpt -a you will also see the PID, name of the process and date.

doktor5000
  • 2,699
0

Run it through a debugger (gdb?) and set a breakpoint for the kill system call. Then hope a backtrace can tell you what in your code triggered it. (don't hold much hope of seeing into any external libraries)

Ricky
  • 1,377
  • not quite an ideal thing to do since it's only kill SIGKILL which I want to intercept. kill with other signals are OK. The problem also happens only once every so many days. If the code doesn't have a an kill with SIGKILL in it, I'm wondering what could cause a kill with SIGKILL to occur in the trace. – James Jan 28 '14 at 17:02
  • debugging is never ideal. – Ricky Jan 28 '14 at 19:00