Differences in TOTEN and band gap depending on parallelisation flag

Message

sophie_weber · #1 Post by **sophie_weber** » Sat Jul 17, 2021 11:32 am

I am writing on behalf of myself and some other people in my old group who have run across a disconcerting feature of VASP that seems to be material-specific. For a few compounds that we have studied, the converged total energy and band gap can change by a few tenths of an eV depending on the parallelisation flags (i.e, NPAR and KPAR, as well as the total number of cores used for the calculation), even if all other inputs are identical. I spoke with an IT person for the cluster we use regarding the second material which I have attached as an example, and he said that VASP is not robust to "over-parallelisation", i.e. if you use way more times cores than your number of atoms, or you distribute your k-points across too many cores, you can get unreliable results for total energies and other outputs (the file FeNb3S6_parallel2.tar is the "overparallelised" case, whereas FeNb3S6_parallel1.tar uses few cores and is, presumably, more reliable it its results). However, the particularly confusing thing is that 1. there is no obvious parallelization cutoff at which this change in energy seems to occur and 2. For certain systems or calculations, the band gap and total energy remain identical no matter how many cores you use.

For example, in the third file I've attached (NPAR_inputs.zip), another person in the group found a similar discrepancy for a different system depending on whether he used NPAR=16 or NPAR=32 in an HSE calculation. However, when he uses the exact same material, but uses PBE+U instead of HSE while keeping everything else identical, the band gaps and total energies are identical regardless of what NPAR he uses.

I wanted to ask whether this is a known feature of VASP (strong sensitivity of results to parallelisation), and whether anyone has any insights into the material/calculation detail specificity of it. It would be extremely unfortunate and a major deficiency if, for every material studied with VASP, in addition to all the other convergence tests one must do it is necessary to also perform a check of convergence with respect to number of cores and parallelisation flags in the INCAR.

#2 Post by **henrique_miranda** » Mon Jul 19, 2021 9:44 am

You posted input files for two different cases so I will answer separately for each of them:

1. the FeNb3S6_parallel1 and FeNb3S6_parallel2 files seem to have been executed with two different compilations of VASP (look for 'executed on' on the top of the OUTCAR file):

grep "executed on" FeNb3S6_parallel1/OUTCAR FeNb3S6_parallel2/OUTCAR
FeNb3S6_parallel1/OUTCAR: executed on LinuxIFC date 2021.03.25 18:19:36
FeNb3S6_parallel2/OUTCAR: executed on IFC18.0.1.163_CrayMP date 2021.03.26 05:57:08

I don't know the details of how these two versions were compiled (i.e. what libraries they are linked against or if they correspond to the exact same version of the code). It also looks like these two runs are done on different hardware which also can play a role in the final results (due to different compiler optimizations).
Could you maybe re-run these two input files on the exact same compilation of VASP?

I will wait for these results and then we can discuss them further.
I should already mention that in this example you are trying to obtain the ground state of a non-collinear spin calculation.
These are known to have multiple local energy minima (related to different magnetic alignments of the spins) close in energy.
Even small numeric differences during the minimization procedure can direct your solution one way or another. These differences can come from the libraries or the order in which operations are performed in the code (in floating-point precision operations, addition is not commutative).
You can check in your OUTCAR that you get different magnetic configurations in the end.
Now, this does not always have to be like this, i.e. we can try to find what if the source of numeric error, keep it under control, and obtain reliable results independently of the parallelization.

2. related to the "npar_inputs" inputs: because of how the (over-bands or over-planewaves) parallelization in VASP works it is required that the number of bands is divisible by the number of cores treating the number of bands (NPAR). When this is not the case VASP automatically changes NBANDS such that this condition is fulfilled.
In the two calculations, you've sent the number of bands is different (npar16->NBANDS=432 and npar32->448). Now, this should not play a role in the total energy provided that there are enough empty states but maybe this is not the case in this calculation. Could you try re-running the calculation setting explicitly for example NBANDS=512?

The diagonalization is done with an iterative procedure and as such the highest bands are not converged to the same precision as the lower-lying ones. If these highest bands are even residually occupied (can happen on a metallic system) it might be enough to lead to slightly different total energies in the end.

#3 Post by **henrique_miranda** » Mon Jul 19, 2021 1:51 pm

I was searching for an unrelated topic in the forum and I stumbled upon this thread that is somehow related to your questions:
forum/viewtopic.php?f=4&t=2184

sophie_weber · #4 Post by **sophie_weber** » Tue Jul 20, 2021 8:49 am

Thank you both so much for your help, I really appreciate it.

As for the first input files I sent (FeNb3S6), I should have clarified, I'm sorry. You are right that the two calculations I sent are from different clusters. However, they are the same VASP version (5.4.4) and same build, and moreover I reran the calculations with fewer cores on the same cluster as the calculation with more cores, using the same parallelisation settings, and got the exact same results. So it is not an issue with different VASP compilations.

I think you might be correct regarding the fact that the energy differences come from very small differences in final magnetization. Indeed, in the FeNb3S6_parallel1 case, there is a very slight non collinear canting that develops which is not the case for FeNb3S6_parallel2. However, this canting is significantly less than 1 degree, so it was shocking to me that this could truly be the source of a >100 meV energy difference. Also, interestingly enough, when I turn off symmetries (ISYM=-1), this canting does not develop and the energy is very closer to the FeNb3S6_parallel2. I would have though that if the energy difference truly came from local minima in the magnetization state, then surely turning all symmetries off I should naturally converge to the global minimum (the one with the <1 degree canting). But this did not seem to be the case, and as such I thought the energy difference was perhaps not physical. But it sounds like maybe I was wrong.

Thank you very much for the hint about NBANDS...I'll ask my colleague to try rerunning his calculations fixing the number of bands. However, if I recall his system is insulating, not metallic. Have you ever run into a case when the number of bands can give a difference in energy in an insulating system?

Thank you again, and I'll update you once I have more information about the second calculation.

#5 Post by **henrique_miranda** » Wed Jul 21, 2021 11:36 am

FeNb3S6_parallel1 and FeNb3S6_parallel2:
Just to be sure I understand correctly: after running the same system with the same version of VASP changing only the parallelization you still reach different final magnetic states. Then if you set ISYM=-1 (to not enforce symmetries) you get the same final result regardless of the parallelization. Is that so?
If this is the case then it warrants some investigation from my side.
I will try to reproduce this behavior.
Could you share the output files for these two runs (on the same hardware and compilation)? This is just to try to ensure that I replicate the same conditions.

"npar_inputs":
This is a good point indeed.
If the system is insulating I do not expect to see any difference depending on the number of bands.
Looking again at the OUTCAR I see that indeed there is a clear transition between fully occupied and empty states so the argument of not enough bands might not be the reason why you get different results in this situation.
Before starting to speculate I would like to see the results with NBANDS=512.
If those lead to the same final result I might have an idea of what is going on.
An additional test you can do is to restart the calculations (for npar16 and npar32) from the same pre-existing WAVECAR.

My Community

Differences in TOTEN and band gap depending on parallelisation flag

Differences in TOTEN and band gap depending on parallelisation flag

Re: Differences in TOTEN and band gap depending on parallelisation flag

Re: Differences in TOTEN and band gap depending on parallelisation flag

Re: Differences in TOTEN and band gap depending on parallelisation flag

Re: Differences in TOTEN and band gap depending on parallelisation flag