Yesterday I made an interesting observation. I sit on my local office computer (Ubuntu 16.04) and I have several remote jobs (long CFD simulations) on our cluster (CentOS7) running. All jobs were started as background jobs (program OPTIONS > LOGFILE &
) and bash
is configured (huponexit off
) in a way, that SIGHUP
is not sent to all jobs.
Consequently, if I start a simulation and log-out, the job continues to run. This can be easily checked by later by checking the log file.
Out of laziness, when I start a job I simply keep the terminal open for checking the progress of the simulation.
Yesterday, something on my local machine went wrong (I guess some hang-up with the GUI, as I was able to start a terminal session and call reboot
) and it froze. After rebooting the local machine I noticed that all my remote jobs on the cluster stopped.
This also happens in cases when the GUI on my local workstation freezes and I restart the display manager from the Terminal.
I know I can most probably prevent this by using screen
, nevertheless I am curious why this happened. What is different when I forcibly reboot my local machine compared to a controlled log-out and reboot, which does nothing to the remote jobs?
strace
orsysdig
on the cluster and log in particular the signals that hit the programs for a normal logout versus a crash? – thrig Aug 26 '16 at 13:45