Parallelization Recommendation

Message

Abdulrahman_Allangawi · #1 Post by **Abdulrahman_Allangawi** » Tue May 20, 2025 7:39 am

Hello all,

I am working on a HPC with large CPU nodes, on each node I have two sockets, each with 96 physical processors for a total of 192 physical processors on each node.

I am struggling with finding the best parallelization setting for my projects, especially for the gamma k-point calculations with 100+ atoms. As there are many variables to change, I am unsure of what is the usual recommended recipes. Of course, I know that it depends on the system, but I want to ask, what are some overall recommended tips to follow when trying to find the optimum settings?

Below I have added my current slurm script, from my initial tastings I found that these settings run well with "NCORE = 24; KPAR = 1". However, I am unsure if I am utilizing all the processors in my nodes efficiently.

Also, whichever setting I find best for VASP should be fine for VASPsol right?

#2 Post by **henrique_miranda** » Tue May 20, 2025 6:51 pm

Finding the best paralelizaiton is not always an easy task.

The performance does not depend always on the hardware but also on the type of calculation you are running.
Assuming that you are running a ground-state calculation, gamma only, and you stick to MPI paralelism then there is really only one INCAR tag that is relevant and is NCORE.
How to set it? If you are running the same ground-state calculation multiple times (for example a molecular dynamics run) then its maybe worth trying a few different values and timing the code. For that you can look for "LOOP: cpu time" in the OUTCAR.
If you don't want to test it, there are a few heuristics that tend to be reasonable:

choose NCORE approx. number of atoms in the system
don't choose NCORE larger than CPUs per socket

In your case you have 96 physical processors, maybe two sockets (?), that would mean NCORE=48 for example. But I think NCORE=24 is pretty reasonable too.
Only by looking at the timing you can know for sure.

About the slurm configuration, I would actually try:

Code: Select all

#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=2
export OMP_NUM_THREADS=1

in a sense this is just to avoid using multithreading. If you want to try using mutithreading then you should probably use

Code: Select all

#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=2
export OMP_NUM_THREADS=2

You can find more information on this page on our wiki:
https://www.vasp.at/wiki/index.php/Cate ... lelization

in particular:
https://www.vasp.at/wiki/index.php/Opti ... lelization

Abdulrahman_Allangawi · #3 Post by **Abdulrahman_Allangawi** » Tue May 20, 2025 7:49 pm

Dear Henrique,

Thanks for the reply.

Most of my systems have 100+ atoms, and most of my calculations are ground-state optimization, MD, and Freq (IBRION = 5), which I assume have the same basic concept for parallelization as most of the time is spent on similar scf calculation, is this correct?

As for the NCORE, is it fine if NCORE * KPAR (where KPAR is 1 for gamma-point) does not equal the total number of cores in a socket?

The value of NCORE = 24 in which I used is based on some experimentation of different NCORE values.

My main issue is in the interpretation of the "--ntasks-per-node=X and --cpus-per-task=Y" commands and how they relate to VASP. Is there a rule such as the tasks per nodes * cpus per task should equal the total number of physical processors in a node? Additionally, how can one correlate adequate values of these lines with the NCORE value? do they even affect one another?

Lastly, to what extent does hyperthreading matter? should one consider utilizing such parallelization? and are there recommended ranges or can even large values of hyperthreading applicable sometimes?

I apologize for the long line of questions but I have gone through the linked pages and is still confused about these question. Any help is much appreciated.

#4 Post by **henrique_miranda** » Tue May 20, 2025 9:07 pm

Most of my systems have 100+ atoms, and most of my calculations are ground-state optimization, MD, and Freq (IBRION = 5), which I assume have the same basic concept for parallelization as most of the time is spent on similar scf calculation, is this correct?

Yes, exactly

As for the NCORE, is it fine if NCORE * KPAR (where KPAR is 1 for gamma-point) does not equal the total number of cores in a socket?

Yes, I would just try to avoid using a value larger than the number of cores in a socket because the number of MPI ranks you specify in NCORE need to communicate often and if they are running on CPUs that have high latency to access each other memory then the performance will suffer. A particularly extreme case would be the the MPI ranks are located on different machines. But different sockets might already deteriorate performance significantly.

My main issue is in the interpretation of the "--ntasks-per-node=X and --cpus-per-task=Y" commands and how they relate to VASP.

This question is mostly related to slurm so I am not really well equiped to answer but I will try: "--ntasks-per-node=X" is the number of MPI ranks that will be spawned per node, "--cpus-per-task=Y" is the number of cpu cores associated to each task. They relate to vasp in the sense that the number of slurm tasks will be equal to the number of MPI ranks that vasp will use which are reported in the OUTCAR and very first line of stdout.

Is there a rule such as the tasks per nodes * cpus per task should equal the total number of physical processors in a node?

No, you don't need to do that. I was only assuming that you wanted to use the full node but you don't have to.

Additionally, how can one correlate adequate values of these lines with the NCORE value? do they even affect one another?

They are vaguely correlated but just by the second rule that I mentioned in the previous post and mentioned again above: the max number of NCORE should not lead to excessive comunication between MPI ranks placed physical CPU cores that don't share the same memory.
A practical example: your machine has 96 physical cores and 2 sockets (?) then it has 48 physical cores per socket, if you set NCORE to 96 then all the MPI ranks have to communicate at each FFT which might be slow specially between MPI ranks that are placed on the different sockets.

Abdulrahman_Allangawi · #5 Post by **Abdulrahman_Allangawi** » Wed May 21, 2025 6:36 am

Thanks a lot for your reply.

Abdulrahman_Allangawi · #6 Post by **Abdulrahman_Allangawi** » Wed May 21, 2025 6:43 am

One last question,

This question is mostly related to slurm so I am not really well equiped to answer but I will try: "--ntasks-per-node=X" is the number of MPI ranks that will be spawned per node, "--cpus-per-task=Y" is the number of cpu cores associated to each task. They relate to vasp in the sense that the number of slurm tasks will be equal to the number of MPI ranks that vasp will use which are reported in the OUTCAR and very first line of stdout.

Here, If my nodes contain 192 physical cores, with two sockets (each with 96 physical cores). If I use --ntasks-per-node=48; --cpus-per-task=4, this means that I want to use all 192 cores in a node. When invoking VASP, should I use "srun -n 192 vasp_gam" or "srun -n 48 vasp_gam"? keeping in mind that I want to use all CPUs in my node (or nodes, in which case the question becomes 192*#nodes vs 48*#nodes)

#7 Post by **henrique_miranda** » Wed May 21, 2025 9:23 am

If you make a reservation in slurm using:

Code: Select all

#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=4

then you don't need to specify the number of tasks again in srun.
slurm will take care of that for you when you simply use

Code: Select all

srun vasp_gam

Depending on the number of nodes in your reservation you will get a total of ntasks-per-node * nodes tasks which is reported in vasp as MPI ranks.
But in this case you will only use the CPUs if you are additionally using openmp paralelism with OMP_NUM_THREADS=cpus_per_task

Abdulrahman_Allangawi · #8 Post by **Abdulrahman_Allangawi** » Thu May 22, 2025 8:25 am

Dear,

I am running my single gamma k-point calculation using KPAR=1 and NCORE = 24.

For some reason, when I run it using

Code: Select all

#SBATCH --ntasks-per-node=192

or

Code: Select all

#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=4

or

Code: Select all

#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=1

I consistently see that my nodes are only using about 20% of the available CPU, something seems wrong, or is this normal?
I am checking the CPU usage using mpstat and getting

Code: Select all

11:18:41 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:18:41 AM  all   22.90    0.09    0.48    0.03    0.00    0.00    0.00    0.00    0.00   76.51

#9 Post by **henrique_miranda** » Thu May 22, 2025 8:55 am

I would say that is not normal.
Can you share the full scripts that you are submitting to slurm?

Abdulrahman_Allangawi · #10 Post by **Abdulrahman_Allangawi** » Thu May 22, 2025 3:43 pm

Please find it attached.

Keep in mind, a single k-point is utilized for the calculations.

Additionally, I tried different variations of

Code: Select all

#SBATCH --ntasks-per-node=
#SBATCH --cpus-per-task=
export OMP_NUM_THREADS=

and also adjusted the number of submitted srun tasks to be 192 per node. Still I am getting the same 20% utilization.

#11 Post by **henrique_miranda** » Fri May 23, 2025 6:48 am

Could you try again but changing your srun line from

Code: Select all

srun -n ${ntasks}  --hint=nomultithread \
     ${VASP_HOME}/vasp_gam | tee vasp.out

to

Code: Select all

srun --hint=nomultithread \
     ${VASP_HOME}/vasp_gam | tee vasp.out

Abdulrahman_Allangawi · #12 Post by **Abdulrahman_Allangawi** » Fri May 23, 2025 9:57 am

I have tried your suggestion, still I am only observing utilization between 17-29 % for different variants of

Code: Select all

srun -n ${ntasks}  --hint=nomultithread \
     ${VASP_HOME}/vasp_gam | tee vasp.out

That is weird as I should observe 50%... given no hyperthreading of the physical cores.

#13 Post by **henrique_miranda** » Fri May 23, 2025 12:35 pm

It is very hard for me to help you further with just this information.
Could you show me for each variant that you tested the OUTCAR and sbatch files as well as the CPU usage you get for each of these cases?

Abdulrahman_Allangawi · #14 Post by **Abdulrahman_Allangawi** » Fri May 23, 2025 4:50 pm

I have attached a file containing all needed information, thanks for your help.

Note that inside each file, I created a file "mpstat" containing the information of CPU utilization. The utilization was mostly consistent throughout the runs, the provided data is taken during the scf calculations.

If further information is needed, I can also share.

#15 Post by **henrique_miranda** » Fri May 23, 2025 8:54 pm

Ok thanks for the additional information.
I had a look at your files and I see that you don't have an openmp version of vasp.
That means that setting OMP_NUM_THREADS/=1 will not make any difference.

Now, because of the way you set "ntasks=$(( SLURM_JOB_NUM_NODES * SLURM_NTASKS_PER_NODE ))" using "srun -n ${ntasks}" will have no practical effect.
This means that test_0, test_1 and test_2 in your folders are effectively the same.
In test_3 you are using "--ntasks-per-node=192" which in principle should completely saturate your machines in terms of CPU usage but that does not seem to be the case. I don't understand why that is and I am not sure how to help.

Could you share the output of the job if you add "--cpu-bind=verbose" to srun?

Note that test_3 runs slower than the other cases:

Code: Select all

% grep -m 1 LOOP test_*/OUTCAR
test_0/OUTCAR:      LOOP:  cpu time     10.6035: real time     10.6268
test_1/OUTCAR:      LOOP:  cpu time     10.6588: real time     10.6783
test_2/OUTCAR:      LOOP:  cpu time     10.6898: real time     10.7261
test_3/OUTCAR:      LOOP:  cpu time     20.1527: real time     20.1731

I suspect this has to do with your machines having 96 physical CPU cores and you setting 192 MPI ranks per machine on 8 machines which leads to a total of 1536 mpi-ranks.
If you have only 96 physical cores and 192 threads it is often deterimental for performance to use 192 mpi-ranks per machine.
Perhaps 96 mpi-ranks per machine leads to a better performance than 48?

My Community

Parallelization Recommendation

Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation

Re: Parallelization Recommendation