0

I am trying to run a demo example given in the tutorial for running codes on a cluster computer. Below is the example but I am unable to understand most of the statements -

#BSUB -L /bin/bash
#BSUB -J "MNIST_DDL"
#BSUB -o "MNIST_DDL.%J"
#BSUB -n 12
#BSUB -R "span[ptile=4]"
#BSUB -gpu "num=2"
#BSUB -q "normal"
#BSUB -W 00:10

ml wml_anaconda3 conda activate <your environment>

Workaround for GPU selection issue

cat > launch.sh << EoF_l #! /bin/sh export CUDA_VISIBLE_DEVICES=0,1 exec $* EoF_l chmod +x launch.sh

Run the program

export PAMI_IBV_ADAPTER_AFFINITY=0 ddlrun ./launch.sh python /path/to/your_program.py

Clean up

/bin/rm -f launch.sh

I can understand the initial #BSUB tagged lines, they tells about the allocation of resources and the meta-data about the code. But I am really not able to get the following lines-

  # Workaround for GPU selection issue
cat > launch.sh << EoF_l
#! /bin/sh
export CUDA_VISIBLE_DEVICES=0,1
exec \$*
EoF_l
chmod +x launch.sh

Thank You.

Beginner
  • 123
  • 2
    The code cat > launch.sh << EoF_l... creates a file launch.sh with the 3 lines of code before EoF_l and makes it executable. (I don't know how it is related to a"GPU selection issue.) – Bodo Feb 22 '21 at 16:48
  • 3
  • 1
    That's an awfully roundabout way of setting CUDA_VISIBLE_DEVICES=0,1 in the environment for the Python code. You could probably just do ddlrun env CUDA_VISIBLE_DEVICES=0,1 python /path/to/your_program.py and delete that here-document, or export it just before, as is done for PAMI_IBV_ADAPTER_AFFINITY (unless ddlrun cleans the environment). – Kusalananda Feb 22 '21 at 17:06
  • 2
    Your script is using a heredoc to create a text file called launch.sh containing 3 lines of commands and then making it an executable shell script with chmod +x launch.sh – fpmurphy Feb 22 '21 at 17:07
  • Thanks everyone for their inputs. I have another question - Where will this launch.sh be created, in the same directory as the above file? Because if I remove the rm - f launch.sh then it is bein created in the parent directory, where the above file is stored – Beginner Feb 22 '21 at 17:42
  • @Kusalananda Could you please press more on why it is not a good way, Also I didn't understand how my `CUDA devices were being set before. – Beginner Feb 22 '21 at 17:43
  • 1
    The launch.sh file will be created (or attempted to) in the current directory where you run this script from. Kusalananda is saying that creating launch.sh, running it, then deleting it, is unneeded complexity. Why not just perform the task of the launch.sh script directly in the parent script? – spuck Feb 22 '21 at 17:48
  • 1
    Why are all these answer-comments comments and not answers? – DopeGhoti Feb 22 '21 at 18:28
  • 1
    Does this answer your question? How does cat > file << "END" work? – AdminBee Feb 23 '21 at 13:14
  • @AdminBee Thank You for this. I have a doubt & I am quoting from the above answer - "stop interpreting the stream data as commands and pass them on to the stdin of the command you are going to execute" but in my case, we are changing the heredoc to the executable script by doing chmod +x launch.sh. How is it different from the case that if we haven't used the heredoc and have simply written export CUDA_VISIBLE_DEVICES=0,1 in the above script because according to my understanding the above script will be treated as a command and commands are executable(Let me know if my thinking is correct? – Beginner Feb 25 '21 at 18:07

0 Answers0