0

Here is a simple experiment I tried:

Given a task called "sim.exe" which doing a model simulation, I then use MPI to launch x "sim.exe" simultaneously on one node (shared memory system). I have tried four different runs with x to be different value (e.g., 1, 4, 8, 16). Then I check the memory usage through the PBS report "memory used" and "Vmem used". I observed that for these different runs the "memory used" and "Vmem used" keep the same not changed with "mem" = 8,432 KB and "vmem" = 489,716 KB.

My understanding of the "mem" and "vmem" is the memory resources used by the job, according to the post About mem and vmem and Actual memory usage of a process. Then why the "mem" and "vmem" keep the same even through the tasks of the jobs increased with x-fold?

All these Jobs are submit through PBS job scheduler. For each jobs all the cores and RAM are requested when submitting the job with #PBS -l select=1:ncpus=24:mem=96GB

Update for this question:

I have tested threading in python as a replacement of mpi to launch x "sim.exe" simultaneously. So I started x thread and in each thread using subprocess to call the "sim.exe" model simulation. I again tried four experiments with x=1, 4, 8, 16. I observed that the "mem" and "vmem" used by the job is increasing linearly with the increase of x, which is close to what I would expect.

So, is it possible that PBS did not count the "mem" and "vmen" correctly? it seems that PBS only count the memory usage of only one instance.

1 Answers1

0

I'm not sure but as far as I remember MPI launches only one instance per node and then, after some initialisation, forks the process into the requested x copies. This means that if sim.exe does not allocate any additional memory after the fork all memory will be shared between the x copies and the number x has no influence on memory usage, apart from a tiny overhead in the operating system to keep track of the processes.

If you load some data or allocate some memory after the fork you should see a correlation between memory usage and x.

In the threaded sub-process scenario, all x processes make their own initialisation and therefore do not share as much memory as in the MPI scenario. (They will still share memory for libraries and similar memory-mapped I/O.)

To fully understand the behaviour, I suggest you write a small MPI program replacing sim.exe that has both a few MB of static data, e.g. a static array of some type, and dynamically allocated memory and you experiment with the sizes and number of instances. My guess is that the static data is shared between parallel MPI instances on the same node and the dynamic data (allocated after MPI forked the instances) is not.

Further reading: How does copy-on-write in fork() handle multiple fork?