2

Below is my code snippet.

idql -n $REPOSITORY_NAME.$cs -Udmadmin -P"" -R$DM_SCRIPTS/test.api > /dev/null 2>&1
    if [ $? != 0 ]; then
      echo "   \c"
      echo "ERROR: Cannot connect to: $REPOSITORY_NAME.$cs on $HOST"
    else
      echo "   Successfully connected to: $REPOSITORY_NAME.$cs"
    fi

This is from the main logic that we use for monitoring our service. But we often see our service getting hung and so the first line of the above snippet gets hung and it doesn't proceed after that. Due to this we are not able to catch this 'service hung' condition.

Most importantly we have to retain the checks for the existing conditions (specified in the if-else conditional statements) and additionally we have to be checking for the 'hung' state. If the idql command takes more than 5 seconds, we can assume that it is hung.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
Vishnu
  • 31

3 Answers3

2

I think you want the timeout command, which is part of coreutils and should be available on your system

To kill the command after 5 seconds, change to:

timeout 5 iqdl -n $REPOSITORY_NAME.$cs ...

If you don't have coreutils, you can download, build and install it from here: http://www.gnu.org/software/coreutils/

See also: https://stackoverflow.com/questions/687948/timeout-a-command-in-bash-without-unnecessary-delay

ckhan
  • 4,132
1

I was able to modify the solution in http://h30499.www3.hp.com/t5/System-Administration/Capturing-hung-command-in-a-script/td-p/5662103 to match my requirement.

I tested and this is perfect for me. I appreciate all your help.

#!/bin/ksh

WAITTIME=5

# run the idql command in the background, discarding any output
idql -n $REPOSITORY_NAME -Udmadmin -P"" -R"$DM_SCRIPTS/test.api" >/dev/null 2>&1 &
IDQL_PID=$!

# set up a timeout that will kill the idql command when 
# $WAITTIME seconds has passed, unless it has completed before that.
(sleep $WAITTIME; kill $IDQL_PID 2>/dev/null) &
TIMEOUT_PID=$!

# wait for the idql command to either complete or get killed; read its return status
wait $IDQL_PID
RESULT=$?

# if the timeout is still running, stop it (ignore any errors)
kill $TIMEOUT_PID 2>/dev/null

# read the return status of the timeout process (we don't need it 
# but running the wait function prevents it from remaining as a 
# zombie process)
wait $TIMEOUT_PID

if [ $RESULT -eq 1 ];then
    echo "something is wrong with $REPOSITORY_NAME, It seems to be down. Result - $RESULT"
elif [ $RESULT -eq 143 ];then
    echo "Attention!!! ***$REPOSITORY_NAME seems to be HUNG*** Result - $RESULT"
else
    echo "$REPOSITORY_NAME seems to be OK. Result - $RESULT"
fi
Vishnu
  • 31
  • This mostly works, but note that there's a small race condition, because wait in the shell reaps all terminated children, not just the one passed as argument. So if both jobs (idql and sleep) terminate around the same time, it's possible that wait $IDQL_PID reaps both. If that happens, when kill $TIMEOUT_PID is executed, the process ID is likely not to exist (which only causes a harmless error message from kill), but it is possible that the process ID has been reassigned (in which case you'll end up killing some random unrelated process). – Gilles 'SO- stop being evil' Sep 10 '12 at 01:24
0

If idql uses CPU time in a loop when it gets hung, you can put a limit on its total CPU time:

( ulimit -t 5;
  idql -n $REPOSITORY_NAME.$cs -Udmadmin -P"" -R$DM_SCRIPTS/test.api > /dev/null 2>&1 )

If idql blocks for some other reason (e.g. a deadlock), you'll have to make that timeout on wall clock time. Here's a solution due to Stéphane Gimenez, lightly adapted to obtain the exit status of the idql command.

ret=$(sh -ic '{ { idql -n "$REPOSITORY_NAME.$cs" -Udmadmin -P"" -R"$DM_SCRIPTS/test.api" > /dev/null 2>&1;
                  echo $? >&3;
                  kill 0; } |
                { sleep 5; kill 0; } }' </dev/null 3>&1 2>/dev/null)
if [ -z "$ret" ]; then
  echo "timed out"
elif [ "$ret" -ne 0 ]; then
  echo "error $ret"
else
  echo "ok"
fi

Explanation:

  • Start an interactive shell (sh -i). Since this shell is interactive, it is in its own process group.
  • The subshell runs two commands piped together. This allows both commands to be executed in parallel inside the same process group.
  • Both commands end with kill 0, which kills both all inside the process group. Whichever command ends first (idql or sleep) will thus kill the other one.
  • Print the return status of idql to file descriptor 3, so that it doesn't go through the pipe. File descriptor 3 is redirected to file descriptor 1 in the outer shell, so that the output on that fd is captured by the command substitution.
  • Redirect standard error of the interactive shell to /dev/null so as not to display any “Terminated” message from the inner shell. If you wanted to see the error output from idql, you would need to redirect it (idql 2>&4 instead of idql 2>/dev/null, and add 4>&2 before the 2>/dev/null of sh -i).
  • Redirect standard input of the interactive shell from /dev/null so that it doesn't end up reading commands from the terminal if you press Ctrl+C.
  • I am just getting the below output. Terminated Terminated – Vishnu Sep 08 '12 at 05:36
  • Server gets hung because of hung threads in oracle db, so I am following tha latter option. The output I am getting is - "Terminated Terminated". I read Stephanie's explanation too, but didnt quite understand the logic (mainly the usage of |). Can you please explain for a beginner. :) – Vishnu Sep 08 '12 at 05:49
  • @Vishnu I made a mistake in my original answer (same problem Stéphane had initially, in fact). The point of the pipe is to run idql and sleep in the same process group: a pipeline is the only way to do that, so we have to use a pipeline even though we aren't trying to pipe idql's output into sleep. – Gilles 'SO- stop being evil' Sep 08 '12 at 11:35
  • I tried running your modified script. This time I am getting 'Terminated' only once. Still the script as a whole is sleeping and getting terminated. Meanwhile, please check out another solution that i have posted on this thread. – Vishnu Sep 09 '12 at 11:44
  • @Vishnu The solution in your answer should work, but I've gone and fixed the multiple typos in mine and added explanations. (It's complicated, but it works and it's reliable if you get it right.) – Gilles 'SO- stop being evil' Sep 10 '12 at 01:17