I am working with a cluster machine running under linux.
I have a shell script that uses mpirun
to submit my jobs to the cluster machine. In that same script, I can choose the number of nodes that will be assigned to the job. So far, so good.
My issue arises after: when I submit a few jobs, all works well, however, when I fill the capacity of the nodes, some of the submitted jobs won't be completed. I am consequently suspecting that the available memory on the cluster is not sufficient to deal with all of my jobs at the same time.
This is why I want to check the memory usage of each job over time, I then use the qstat -f
command, but it displays a lot of things, and most of them I cannot understand.
So here is my question: In the sample output of the qstat -f
command below, we can see two types of memory: mem
and vmem
. I would like to know what is the difference between these two and what is the real amount of memory used?
resources_used.cput = 00:21:04
resources_used.mem = 2099860kb
resources_used.vmem = 40505676kb
resources_used.walltime = 00:21:08
Additionally, I would appreciate any reference where the output of this command is detailed. I tried man qstat but it doesn't go into the details of each returned line.