Parallelization Recommendation

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
Abdulrahman_Allangawi
Newbie
Newbie
Posts: 4
Joined: Wed May 07, 2025 10:19 am

Parallelization Recommendation

#1 Post by Abdulrahman_Allangawi » Tue May 20, 2025 7:39 am

Hello all,

I am working on a HPC with large CPU nodes, on each node I have two sockets, each with 96 physical processors for a total of 192 physical processors on each node.

I am struggling with finding the best parallelization setting for my projects, especially for the gamma k-point calculations with 100+ atoms. As there are many variables to change, I am unsure of what is the usual recommended recipes. Of course, I know that it depends on the system, but I want to ask, what are some overall recommended tips to follow when trying to find the optimum settings?

Below I have added my current slurm script, from my initial tastings I found that these settings run well with "NCORE = 24; KPAR = 1". However, I am unsure if I am utilizing all the processors in my nodes efficiently.

Also, whichever setting I find best for VASP should be fine for VASPsol right?

You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 518
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: Parallelization Recommendation

#2 Post by henrique_miranda » Tue May 20, 2025 6:51 pm

Finding the best paralelizaiton is not always an easy task.

The performance does not depend always on the hardware but also on the type of calculation you are running.
Assuming that you are running a ground-state calculation, gamma only, and you stick to MPI paralelism then there is really only one INCAR tag that is relevant and is NCORE.
How to set it? If you are running the same ground-state calculation multiple times (for example a molecular dynamics run) then its maybe worth trying a few different values and timing the code. For that you can look for "LOOP: cpu time" in the OUTCAR.
If you don't want to test it, there are a few heuristics that tend to be reasonable:

  1. choose NCORE approx. number of atoms in the system
  2. don't choose NCORE larger than CPUs per socket

In your case you have 96 physical processors, maybe two sockets (?), that would mean NCORE=48 for example. But I think NCORE=24 is pretty reasonable too.
Only by looking at the timing you can know for sure.

About the slurm configuration, I would actually try:

Code: Select all

#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=2
export OMP_NUM_THREADS=1

in a sense this is just to avoid using multithreading. If you want to try using mutithreading then you should probably use

Code: Select all

#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=2
export OMP_NUM_THREADS=2

You can find more information on this page on our wiki:
https://www.vasp.at/wiki/index.php/Cate ... lelization

in particular:
https://www.vasp.at/wiki/index.php/Opti ... lelization


Abdulrahman_Allangawi
Newbie
Newbie
Posts: 4
Joined: Wed May 07, 2025 10:19 am

Re: Parallelization Recommendation

#3 Post by Abdulrahman_Allangawi » Tue May 20, 2025 7:49 pm

Dear Henrique,

Thanks for the reply.

Most of my systems have 100+ atoms, and most of my calculations are ground-state optimization, MD, and Freq (IBRION = 5), which I assume have the same basic concept for parallelization as most of the time is spent on similar scf calculation, is this correct?

As for the NCORE, is it fine if NCORE * KPAR (where KPAR is 1 for gamma-point) does not equal the total number of cores in a socket?

The value of NCORE = 24 in which I used is based on some experimentation of different NCORE values.

My main issue is in the interpretation of the "--ntasks-per-node=X and --cpus-per-task=Y" commands and how they relate to VASP. Is there a rule such as the tasks per nodes * cpus per task should equal the total number of physical processors in a node? Additionally, how can one correlate adequate values of these lines with the NCORE value? do they even affect one another?

Lastly, to what extent does hyperthreading matter? should one consider utilizing such parallelization? and are there recommended ranges or can even large values of hyperthreading applicable sometimes?

I apologize for the long line of questions but I have gone through the linked pages and is still confused about these question. Any help is much appreciated.


henrique_miranda
Global Moderator
Global Moderator
Posts: 518
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: Parallelization Recommendation

#4 Post by henrique_miranda » Tue May 20, 2025 9:07 pm

Most of my systems have 100+ atoms, and most of my calculations are ground-state optimization, MD, and Freq (IBRION = 5), which I assume have the same basic concept for parallelization as most of the time is spent on similar scf calculation, is this correct?

Yes, exactly

As for the NCORE, is it fine if NCORE * KPAR (where KPAR is 1 for gamma-point) does not equal the total number of cores in a socket?

Yes, I would just try to avoid using a value larger than the number of cores in a socket because the number of MPI ranks you specify in NCORE need to communicate often and if they are running on CPUs that have high latency to access each other memory then the performance will suffer. A particularly extreme case would be the the MPI ranks are located on different machines. But different sockets might already deteriorate performance significantly.

My main issue is in the interpretation of the "--ntasks-per-node=X and --cpus-per-task=Y" commands and how they relate to VASP.

This question is mostly related to slurm so I am not really well equiped to answer but I will try: "--ntasks-per-node=X" is the number of MPI ranks that will be spawned per node, "--cpus-per-task=Y" is the number of cpu cores associated to each task. They relate to vasp in the sense that the number of slurm tasks will be equal to the number of MPI ranks that vasp will use which are reported in the OUTCAR and very first line of stdout.

Is there a rule such as the tasks per nodes * cpus per task should equal the total number of physical processors in a node?

No, you don't need to do that. I was only assuming that you wanted to use the full node but you don't have to.

Additionally, how can one correlate adequate values of these lines with the NCORE value? do they even affect one another?

They are vaguely correlated but just by the second rule that I mentioned in the previous post and mentioned again above: the max number of NCORE should not lead to excessive comunication between MPI ranks placed physical CPU cores that don't share the same memory.
A practical example: your machine has 96 physical cores and 2 sockets (?) then it has 48 physical cores per socket, if you set NCORE to 96 then all the MPI ranks have to communicate at each FFT which might be slow specially between MPI ranks that are placed on the different sockets.


Post Reply