2

Yesterday I made an interesting observation. I sit on my local office computer (Ubuntu 16.04) and I have several remote jobs (long CFD simulations) on our cluster (CentOS7) running. All jobs were started as background jobs (program OPTIONS > LOGFILE &) and bash is configured (huponexit off) in a way, that SIGHUP is not sent to all jobs. Consequently, if I start a simulation and log-out, the job continues to run. This can be easily checked by later by checking the log file.

Out of laziness, when I start a job I simply keep the terminal open for checking the progress of the simulation.

Yesterday, something on my local machine went wrong (I guess some hang-up with the GUI, as I was able to start a terminal session and call reboot) and it froze. After rebooting the local machine I noticed that all my remote jobs on the cluster stopped.

This also happens in cases when the GUI on my local workstation freezes and I restart the display manager from the Terminal.

I know I can most probably prevent this by using screen, nevertheless I am curious why this happened. What is different when I forcibly reboot my local machine compared to a controlled log-out and reboot, which does nothing to the remote jobs?

Dohn Joe
  • 215
  • Can you run strace or sysdig on the cluster and log in particular the signals that hit the programs for a normal logout versus a crash? – thrig Aug 26 '16 at 13:45
  • That would require me to provoke a crash of my local machine. How would I do that. What I could do, is compare normal logout and a forced reboot from a parallel terminal session. However, I will try this at a later time when my current jobs have finished. Just in case. – Dohn Joe Aug 26 '16 at 13:54

0 Answers0