0

I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.

I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?

And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?

Additional info:

Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
J Collins
  • 1,155
  • 2
  • 12
  • 25

1 Answers1

0

It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.

J Collins
  • 1,155
  • 2
  • 12
  • 25