1

I run measurements on a cluster consisting of 32 nodes/machines. I do not require all nodes, only 4, for example. The problem is most of the time, the nodes are busy with other people running their heavy job on them. So, to find idle nodes to get good results, I run the top command on each machine starting from the first until I find 4 free ones.

Is there a way to test the cpu load/utilisation on multiple machines at once and if possible, listing those machines that are less busy?

vis
  • 1,017

5 Answers5

3

While @wnoise's answer is a the nicer solution, it might not be possible for you to implement it (i.e. do you administer the cluster?)... so, why not have a look at

  1. one of the 'cluster SSH' solutions @Chaleb mentioned here (pssh, pdsh, clusterssh, clusterit) or
  2. Fabric (also mentioned in this thread, by @Crankyadmin)

to gather usage statistics.

Add a little scriptinga to evaluate the statistics you gathered on each host and you should be good to go.

(a) depending on your preferences, one or another of the mentioned tools could be more handy, i.e. Fabric is a Python framework, so if you'd like to do the evaluation in Python, it might be well-suited (while any Perl/Bash/whatever scripting language is just as good).

sr_
  • 15,384
3

The rup command from the rstatclient package will poll all the machines on your subnet for information, including their load averages. The machines must be running rstatd to serve up that information, and I would tcpwrapper it to only respond to your admin desktops. You can also specify individual machines to collect data from. With rstatd running on the remote machines you can also bring up xmeter to visually monitor their historical load average.

Jodie C
  • 1,879
2

There are many "batch systems" that are designed to handle this sort of problem. One specifically tailored to handle "cycle stealing" from otherwise unoccupied systems is condor, a long running research project at the University of Wisconsin.

wnoise
  • 1,961
  • I was looking for something like a simple command or a script to do this, but maybe there isn't one. – vis Mar 08 '12 at 16:52
2

If SNMPD is running you can query the load-values of these machines with a simple snmpget. If you script that you can sort these with the load value and output the four lowest...

Nils
  • 18,492
-1

you should use the command mdiag -n to check whether the nodes are idle or busy.

Anthon
  • 79,293
Pankaj
  • 1