0

10 years ago, there was a question: Automatically run commands over SSH on many servers.

I have basically the same one, but I need to run commands/scripts with (potentially different) parameters there and need to stop long running tasks. Also, I would prefer a modern monitor (eg. web UI or Elastic output at least) so I can monitor which scripts are running and/or finished and with what results. Eventually would be nice to queue them, or give some time limits. Also, I can't add my public key to all of the computers, but I may (let someone) to install some software there.

This is mainly intended for AI training processes, but also many others as upgrading the frameworks and eventually sending (downloading) new scripts and data.

In the link above, guys suggested Ansible, I think the automation way is the modern way to go, but are there any others?

A friend suggested also CI/CD (gitlab actions), but this seems a bit too much and for other purposes as code testing. I also got a tip for AutoML, but that is a complete framework for AI, which I don't need as I need also to run multiple various commands/scripts with various parametes there.

P3k
  • 1

3 Answers3

1

Put the parameters into a dictionary.

  • For example, let's start with the default
shell> cat group_vars/all/scripts.yml
scripts:
  default:
    script: /root/bin/default.sh
    params: p1 p2 p3
    timeout: 30
    retries: 10
    delay: 3
    log: /tmp/ansible_script.log

Given the scripts at the controller

shell> tree files
files
├── default.sh
├── script_A.sh
├── script_B.sh
└── script_C.sh
shell> cat files/default.sh 
#!/bin/sh
echo $1 $2 $3
echo finished > /tmp/ansible_script.log
exit 0

The playbook below

shell> cat pb.yml
- hosts: all

vars: my_script: "{{ scripts[inventory_hostname]|d(scripts['default']) }}" _script: "{{ my_script.script|d(scripts.default.script) }}" _params: "{{ my_script.params|d(scripts.default.params) }}" _timeout: "{{ my_script.timeout|d(scripts.default.timeout) }}" _retries: "{{ my_script.retries|d(scripts.default.retries) }}" _delay: "{{ my_script.delay|d(scripts.default.delay) }}" _log: "{{ my_script.log|d(scripts.default.log) }}"

tasks: - debug: msg: |- _script: {{ _script }} _params: {{ _params }} _timeout: {{ _timeout }} _retries: {{ _retries }} _delay: {{ _delay }} _log: {{ _log }} when: debug|d(false)|bool

- name: Copy script
  block:
    - file:
        state: directory
        path: "{{ _script|dirname }}"
        mode: 0750
    - copy:
        src: "{{ _script|basename }}"
        dest: "{{ _script }}"
        mode: 0550
  when: copy_script|d(false)|bool

- name: Run script
  block:
    - command:
        cmd: "{{ _script }} {{ _params }}"
      async: "{{ _timeout }}"
      poll: 0
      register: cmd_async
    - debug:
        var: cmd_async.ansible_job_id
      when: debug|d(false)|bool

- name: Read log until finished
  block:
    - command:
        cmd: "cat {{ _log }}"
      register: cmd_log
      until: cmd_log.stdout == 'finished'
      retries: "{{ _retries }}"
      delay:  "{{ _delay }}"
    - debug:
        var: cmd_log.stdout
      when: debug|d(false)|bool
  when: read_log_fin|d(false)|bool

- name: Check async script
  block:
    - async_status:
        jid: "{{ cmd_async.ansible_job_id }}"
      register: job_result
      until: job_result.finished
      retries: "{{ _retries }}"
      delay: "{{ _delay }}"
    - debug:
        msg: >-
          {{ job_result.start }}
          {{ job_result.end }}
          rc: {{ job_result.rc}}
      when: debug|d(false)|bool

gives

shell> ansible-playbook pb.yml -e debug=true -e copy_script=true -e read_log_fin=true

PLAY [all] ***********************************************************************************

TASK [debug] ********************************************************************************* ok: [test_11] => msg: |- _script: /root/bin/default.sh _params: p1 p2 p3 _timeout: 30 _retries: 10 _delay: 3 _log: /tmp/ansible_script.log ok: [test_12] => msg: |- _script: /root/bin/default.sh _params: p1 p2 p3 _timeout: 30 _retries: 10 _delay: 3 _log: /tmp/ansible_script.log ok: [test_13] => msg: |- _script: /root/bin/default.sh _params: p1 p2 p3 _timeout: 30 _retries: 10 _delay: 3 _log: /tmp/ansible_script.log

TASK [file] ********************************************************************************** ok: [test_13] ok: [test_12] ok: [test_11]

TASK [copy] ********************************************************************************** ok: [test_12] ok: [test_11] ok: [test_13]

TASK [command] ******************************************************************************* changed: [test_12] changed: [test_11] changed: [test_13]

TASK [debug] ********************************************************************************* ok: [test_11] => cmd_async.ansible_job_id: '754707567219.90860' ok: [test_12] => cmd_async.ansible_job_id: '148176661548.90862' ok: [test_13] => cmd_async.ansible_job_id: '688240445475.90861'

TASK [command] ******************************************************************************* changed: [test_13] changed: [test_11] changed: [test_12]

TASK [debug] ********************************************************************************* ok: [test_11] => cmd_log.stdout: finished ok: [test_12] => cmd_log.stdout: finished ok: [test_13] => cmd_log.stdout: finished

TASK [async_status] ************************************************************************** changed: [test_12] changed: [test_13] changed: [test_11]

TASK [debug] ********************************************************************************* ok: [test_11] => msg: '2022-08-01 16:02:50.287027 2022-08-01 16:02:50.320177 rc: 0' ok: [test_12] => msg: '2022-08-01 16:02:49.770331 2022-08-01 16:02:49.801347 rc: 0' ok: [test_13] => msg: '2022-08-01 16:02:50.189800 2022-08-01 16:02:50.343773 rc: 0'

PLAY RECAP *********************************************************************************** test_11: ok=9 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
test_12: ok=9 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
test_13: ok=9 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0


  • It's not possible to display the intermediate log during the iteration. The callback plugin displays all results together after the iteration. You have to go outside Ansible if you want to observe the intermediate log. For example, fetch the log files
    - name: Fetch log until finished
      fetch:
        dest: /tmp/ansible/
        src: "{{ _log }}"
      until: lookup('file', my_logfile) == 'finished'
      retries: "{{ _retries }}"
      delay:  "{{ _delay }}"
      vars:
        my_logfile: "/tmp/ansible/{{ inventory_hostname}}/tmp/ansible_script.log"
      when: fetch_log_fin|d(false)|bool

This creates the periodically updated files at the controller

shell> tree /tmp/ansible/
/tmp/ansible/
├── test_11
│   └── tmp
│       └── ansible_script.log
├── test_12
│   └── tmp
│       └── ansible_script.log
└── test_13
    └── tmp
        └── ansible_script.log

Display the files at the controller. For example, use watch

shell> watch cat /tmp/ansible/test_11/tmp/ansible_script.log

To test it, the script below writes $1 times to the log in $2 interval

shell> cat files/script_A.sh
#!/bin/sh
for i in $(seq 1 $1); do
    echo step $i  > /tmp/ansible_script.log
    sleep $2
done
echo finished > /tmp/ansible_script.log
exit 0

Update the dictionary and let the host test_11 run the script

shell> cat group_vars/all/scripts.yml
scripts:
  default:
    script: /root/bin/default.sh
    params: p1 p2 p3
    timeout: 30
    retries: 10
    delay: 3
    log: /tmp/ansible_script.log
  test_11:
    script: /root/bin/script_A.sh
    params: 7 3

The playbook gives abridged. (Delete the fetched files before you run the playbook again. Otherwise, the task will be skipped on the last files.)

shell> ansible-playbook pb.yml -e debug=true -e fetch_log_fin=true
...
TASK [Fetch log until finished] **************************************************************
ok: [test_12]
FAILED - RETRYING: [test_11]: Fetch log until finished (10 retries left).
ok: [test_13]
FAILED - RETRYING: [test_11]: Fetch log until finished (9 retries left).
FAILED - RETRYING: [test_11]: Fetch log until finished (8 retries left).
FAILED - RETRYING: [test_11]: Fetch log until finished (7 retries left).
FAILED - RETRYING: [test_11]: Fetch log until finished (6 retries left).
changed: [test_11]

TASK [async_status] ************************************************************************** changed: [test_13] changed: [test_12] changed: [test_11]

TASK [debug] ********************************************************************************* ok: [test_11] => msg: '2022-08-01 18:00:13.304133 2022-08-01 18:00:34.768385 rc: 0' ok: [test_12] => msg: '2022-08-01 18:00:13.413492 2022-08-01 18:00:13.480142 rc: 0' ok: [test_13] => msg: '2022-08-01 18:00:13.537767 2022-08-01 18:00:13.731926 rc: 0'

PLAY RECAP *********************************************************************************** test_11: ok=6 changed=3 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
test_12: ok=6 changed=2 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
test_13: ok=6 changed=2 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0

The output will be approximately the same if you read the log

shell> ansible-playbook pb.yml -e debug=true -e read_log_fin=true
...
TASK [Read log until finished] ***************************************************************
FAILED - RETRYING: [test_11]: Read log until finished (10 retries left).
changed: [test_12]
changed: [test_13]
FAILED - RETRYING: [test_11]: Read log until finished (9 retries left).
FAILED - RETRYING: [test_11]: Read log until finished (8 retries left).
FAILED - RETRYING: [test_11]: Read log until finished (7 retries left).
FAILED - RETRYING: [test_11]: Read log until finished (6 retries left).
changed: [test_11]

TASK [debug] ********************************************************************************* ok: [test_11] => cmd_log.stdout: finished ok: [test_12] => cmd_log.stdout: finished ok: [test_13] => cmd_log.stdout: finished

Link to complete playbook pb.yml

0

Your question seems to have two parts:

  1. How do I stop/start/manage scripts/processes running on a (large) list of servers?
  2. How do I track or monitor the status of the long-running scripts/processes on that same list of servers?

Ansible and other configuration-management software (Puppet, Chef, Saltstack, etc.) can be good candidates for question #1, but don't offer tracking/monitoring of the processes they start/stop, especially if you are envisioning progress graphs or alerts that a script errored out. This class of status monitoring and reporting will (IMO) be better served by monitoring systems.

Back to question #1, Ansible is well-known for its ability to run commands/scripts on multiple servers, both pre-configured ones or "ad hoc" ones you type in the command-line on the spot. Saltstack has a similar feature, though it may not be able to handle ad hoc commands as complex as Ansible can. The other config management packages have less direct ways of executing commands on the target servers, and may not be as easily configured to solve the problem you describe here.

On question #2, there's a number of factors that go into selecting a monitoring system, such as whether one already exists for the basics of the servers (cpu, memory, disk) and how easily it can be extended to tracking your scripts/processes. Also if these servers are dedicated to running your scripts/processes, or have other things running on them as well. This is a big enough topic that it probably deserves its own question with the details of what you want to monitor, how you'd like to view the status, and what kinds of alerts you might like.

Sotto Voce
  • 4,131
0

Puppet Bolt accomplishes what you’ve described with relative ease. We’ve used it for what I’d describe as complex workflows and tied into other tools like terraform (provisioning) and serviceNow (CR validation) in a short amount of time. Your use case may not require integrations - but it’s good to know in case those needs evolve.

OP mentioned not placing public key on target nodes… I’m not sure I understand fully but accepting that at face value, it’s possible to use username and password for authentication with Bolt.

A CI tool (like your colleague suggested) would be an elegant way to have a web-enabled view into everything.