5

I would like to solve the following issue about submitting a job that has been parallelised to a specific node.


Let me start with explaining the structure of my problem

I have two very simple Matlab scripts

1) main.m

clear
rng default
P=2;
grid=randn(4,3);
jobs=1;

2) f.m

sgetasknum_grid=grid(jobs*(str2double(getenv('SGE_TASK_ID'))-1)+1: str2double(getenv('SGE_TASK_ID'))*jobs,:); %jobsx3

result=sgetasknum_grid+1; 

filename = sprintf('result.%d.mat', ID);
save(filename, 'result')

exit

What I want to do is:

  • Run main.m;

  • then, run f.m 4 times, allowing for parallel execution of 2 tasks at each time

  • Everything should be executed on node A


Here's how I implement the steps above

1) I save main.m and f.m into a folder named My_folder

2) I create the script td.sh as below and save it into the folder My_folder

#!/bin/bash -l
#$ -S /bin/bash
#$ -l h_vmem=5G
#$ -l tmem=5G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y


#$ -N try

date
hostname

J=4 #number tasks

N=2 #number tasks executed in parallel

export SGE_TASK_ID


SGE_TASK_ID=1
n=0
while [ "$SGE_TASK_ID" -le "$J" ]; do
    if [ "$n" -eq "$N" ]; then
        wait -n  # as soon as one task is done, refill it with another
        n=$(( n - 1 ))
    fi

    printf 'Task ID is %d\n' "$SGE_TASK_ID"

    /share/.../matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID=$SGE_TASK_ID; f; exit" &

    SGE_TASK_ID=$(( SGE_TASK_ID + 1 ))
    n=$(( n + 1 ))
done

wait

3) I go into the terminal and type ssh username@A, then cd /.../My_folder, then bash td.sh


Problem: I get the following error

td.sh: line 26: wait: -n: invalid option
wait: usage: wait [id]

As noticed in the comments below, the issue is that the version of bash on @A is old (the -n option was added to the wait builtin in 4.3) and the sysadmin can't update it. The latest version possible is bash 4.1.

Thus, could you suggest a way to replace wait -n?

Star
  • 125
  • 1
    Perhaps the version of bash on the remote node is older? IIRC the -n option was added to the wait builtin in 4.3 – steeldriver Dec 07 '18 at 13:11
  • OK. What can I do? (1) contact sysadmin, (2) replace wait -n with? – Star Dec 07 '18 at 13:15
  • 1
    Before making the code more complicated, see if bash 4.3 or newer are installed elsewhere on the system, or if an admin is happy to upgrade the existing bash to a later version. – Kusalananda Dec 07 '18 at 13:17
  • Thanks. The sysadmin said that the latest version possible is bash 4.1 and that in order to get a newer version he would need to upgrade the entire OS. What can I do to solve the issue? – Star Dec 10 '18 at 09:52
  • 1
    If you remove all the stuff about matlab, and ask the question at the top, then you may get an answer. At present people are starting to read, saying to themselfs “I don't know about matlab”, and moving on. Many of these people may know the answer, but never get to the question. – ctrl-alt-delor Dec 11 '18 at 13:18
  • Note that first line of you script #!… is magic. If you do chmod +b td.sh then you can run it as a normal executable e.g. ./my.sh. And remove the file extension, as when you have to re-implement in python, you don't want to have to rename it. – ctrl-alt-delor Dec 11 '18 at 13:21
  • 2
    Is switching to another shell (like zsh, dash, mksh, ksh93) where it's easier an option? – Stéphane Chazelas Dec 11 '18 at 13:24
  • @StéphaneChazelas I have no idea, if you tell me how to do it I can try. – Star Dec 11 '18 at 18:07

7 Answers7

2

That script you have written can better be done with gnu parallel, or make with the -j option. Alternatively you can re-write it in python (or another language).

Look at

  • parallel: A tool for use in bash (the easiest of the 3 to learn, only does one thing).
  • make: A bit more advanced, and it has its own language. It is used to create files. e.g. to make A.b you will need A.a, and g.f, when you have these, do z;y;z. You can also add rules on how to make A.a and g.f. It will work out what depends on what, and build things in the correct order. If it can it will do things in parallel (if asked to).
  • python: A programming language, it can do what you script is trying to do, it can do what matlab does.

You will also have to consider which of these are/can-be installed. Do this to find out:

type parallel
type make
type python

Note: type is not an instruction to you, to type. It is the command, that you type. It tells you the type of each command (where it is).

1

What about not using wait at all, in the while loop?

while [ "$SGE_TASK_ID" -le "$J" ]; do

    # grep count of matlab processes out of list of user processes
    n = $(ps ux | grep -c "matlab")

    ##  if [ "$n" -le "$N" ]; then
    if [ "$n" -eq "$N" ]; then
        # sleep 1 sec if already max processes started
        sleep 1
        ##  wait -n  # as soon as one task is done, refill it with another
        ##  n=$(( n - 1 ))
    else
        # start another process
        printf 'Task ID is %d\n' "$SGE_TASK_ID"

        /share/.../matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID=$SGE_TASK_ID; f; exit" &

        SGE_TASK_ID=$(( SGE_TASK_ID + 1 ))

    fi
    ##  n=$(( n + 1 ))
done

The string to grep for may of course have to differ, depending of what you have running (e.g. give f.m some more special name, and grep for that.)

Jaleks
  • 2,579
  • Thanks. (1) Sorry, but I don't understand the terminology: what do you mean by "The string to grep for may of course have to differ, depending of what you have running (e.g. give f.m some more special name, and grep for that.) (2) Should SGE_TASK_ID=1 n=0 be inserted before while? (3) Should I out the wait after the final done? – Star Dec 11 '18 at 12:08
  • n = $(ps ux | grep -c "matlab") counts the Matlab processes used by other users in the node. Why are you counting just the Matlab processes and not more generally any process? Is that line aiming to check how many processes are available in the node out of the N maximum number of processes? In that case, wouldn't be better something like n=$(cut -d. -f1 /proc/loadavg) (which counts how many processes are used by me or other users)?
  • – Star Dec 11 '18 at 18:02
  • What sleep 1 is exactly doing? Is it "waiting one second and then recomputing n"?
  • – Star Dec 11 '18 at 18:06
  • 4): The parameter 'u' should tell ps to only look for your own processes, so it counts your "matlab" processes. 1): You might have more/other matlab instances already running, so the string "matlab" might need adjustement to find the right ones, e.g. matlab process also containing you script name. 2)/3): it's a replacement for your original while loop, keep all the rest for it to work together. 5): yes, the sleep tells to wait one second before again counting the "matlab" processes 6) it might not be the nicest approach like that, but should work and also needs minimal adjustments – Jaleks Dec 12 '18 at 22:53