2

I wrote a cron job, which uses ssh to run a script on a server. I just tried running the script, and now I am unhappy.

client# ssh server.local /usr/local/bin/script
client#

server# /usr/local/bin/script
Segmentation fault (core dumped)
server#

client# ssh server.local /usr/local/bin/script
client# echo $?
255

I can confirm the crash is in the script interpreter, /bin/sh (a symlink to /bin/dash). For example, when I ran script & on the server, the shell tells me the background job has PID 30860, and that is the next PID that shows up in the list of crashes in coredumpctl. I will need to solve the crash, but this question is only about how to detect such crashes.

cron supports error reporting by "sending mail" when a job prints any message. But it does not mail on non-zero exit status. So my current cron job would not mail me about this error. (And if it did, I would really like to have a more helpful pointer for troubleshooting than "Exited with code 255").

cron is relying on the Unix convention, that "no news is good news". But that convention is being broken here.

I interpret this as a limitation of SSH. If I wanted to always notice segmentation faults in remote commands, what rule could I follow, to work around this SSH limitation?

(I'm also interested if there's a "good reason" for this limitation. I think I know more-or-less how it could happen, at the implementation level).

sourcejedi
  • 50,249
  • 1
    Hmmm... What is it that prints the Segmentation fault string to the terminal? Is it the shell that runs the command, or is it printed by the thing that has the fault? – Kusalananda Aug 08 '18 at 11:23
  • @Kusalananda edited. I have confirmed a segfault occurs in the binary interpreter of script. Then "Segmentation fault" is almost certainly printed by the main shell. – sourcejedi Aug 08 '18 at 12:05

1 Answers1

0
% cat segfault.c
#include <stdio.h>
int main()
{
    char *s = "hello world";
    *s = 'H';
    printf("%s\n", s);
}
% CFLAGS=-g make segfault
gcc -g    segfault.c   -o segfault

The error comes from something performing some waitpid call, typically the shell:

% ./segfault
zsh: bus error  ./segfault

as here zsh has dropped out of a waitpid and then wandered through some code path involving WIFSIGNALED. (macOS issues a bus error instead of segfault (a rose by any other name) and the exact string error will vary depending on shell.)

OpenSSH portable (as of commit ed7bd5d93fe14c7bd90febd29b858ea985d14d45) does make various WIFSIGNALED(status) calls notably in misc.c, session.c, and sshd.c; some of these return -1 which could easily turn into the observed 255 exit status, though I'd need to add debugging or trace sshd to see exactly how this case happens (ssh -v -v -v does not help, nor do the default sshd logs).

As a kluge, you can force a waitpid to happen. This requires tricking a shell into not performing a simple exec to replace itself which it will do as an optimization if it can:

% ssh localhost 'sh -c ./segfault'
% ssh localhost ':; ./segfault'
% ssh localhost 'sh -c ":; ./segfault"'
sh: line 1:  9068 Bus error: 10           ./segfault
% 

:; ... is complex enough that sh does not fork/exec and instead ends up waitpid on segfault and then does report on that. Note that error reporting will vary depending on the exact flavor of sh.

% ssh localhost '/usr/local/bin/sh -c ":; ./segfault"'
Bus Error

Now, if sh itself is causing the segfault (or whatever unexpected signal) you will need to inspect the exit code of SSH. Another option would be to call a small wrapper that does the waitpid without shell tricks:

#include <sys/wait.h>    
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
    int status;
    pid_t pid;
    if (argc < 2) {
        fprintf(stderr, "Usage: waiter command [args ..]\n");
        exit(1);
    }
    pid = fork();
    if (pid < 0) {
        err(1, "could not fork");
    } else if (pid == 0) {      /* child */
        argv++;
        execvp(*argv, argv);
        err(1, "could not exec");
    } else {                    /* parent */
        if (waitpid(pid, &status, 0) < 0)
            err(1, "could not waitpid");
        if (WIFEXITED(status)) {
            exit(WEXITSTATUS(status));
        } else if (WIFSIGNALED(status)) {
            warnx("child exited with signal %d", WTERMSIG(status));
            exit(128 + WTERMSIG(status));
        } else {
            err(1, "unknown waitpid condition?? status=%d", status);
        }
    }
    exit(1);
}

... though this wrapper also could segfault (or whatever signal) especially if the server has hardware problems, bad memory, etc.

% ssh localhost ./waiter ./segfault
waiter: child exited with signal 10

However the wrapper has much less code than a typical sh (~50 lines versus ~10,000 for the heirloom bourne shell) so should be less likely to itself cause an exit-by-signal condition. (You did check that exit code, right?)

thrig
  • 34,938
  • It turned out my problem was a dash segfault triggered by some sort of buggy script with infinite recursion. But, a bus error could also occur due to I/O error when demand-paging a binary executable. I was hoping to be able to catch that. I don't think I can use a wrapper to detect a bus error inside the wrapper itself. Seems I need to start with a rule like "openssh cannot be trusted to print an error, you must check its exit status and print a error yourself". To be fair, when I can see the error, it is a fairly clear second step to start testing from an interactive shell on the server. – sourcejedi Aug 08 '18 at 17:40
  • a small wrapper would be a lot simpler than even a sh shell though yes it could itself fall over and die if something is horribly awry on a server – thrig Aug 08 '18 at 18:20