BUG when running supercell SCF calculation

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
jbackman
Newbie
Newbie
Posts: 24
Joined: Thu Nov 26, 2020 10:27 am

BUG when running supercell SCF calculation

#1 Post by jbackman » Tue Nov 08, 2022 3:55 pm

I'm using Phonopy to calculate the phonon properties of WS2. When doing this I run a set of SCF calculations with displaced supercells. For some calculations it converges and finishes. However, for some calculations it freezes. VASP continues to run and does not crash, but nothing happens. This is the case when I use ALGO=Normal.

If I instead use ALGO=Fast, it finishes, but I get warnings like: EDDRMM: call to ZHEGV failed, returncode = 6 3 9. I know this can happen which is why I use ALGO=Normal.

To try to avoid both problems I tried ALGO=All. However, when doing this I instead get a BUG report from VASP.

!BUG!
internal error in: rot.F at line: 801
EDWAV: internal error, the gradient is not orthogonal 2 1 -5.223e-4
If you are not a developer, you should not encounter this problem.
Please submit a bug report.

I have attached my input and output files. The error can be seen in VASP.err.
data.tar.gz
Best,
Jonathan Backman
You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 506
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: BUG when running supercell SCF calculation

#2 Post by henrique_miranda » Wed Nov 09, 2022 8:50 am

Ok, this calculation is rather expensive because of the way it is setup.
It is hard for me to reproduce this issue and track it down.
In fact, the hangup might just hang because it runs out of memory in your machine at some point or starts swapping.

I would recommend that you review your setup based on the recommendations here https://www.vasp.at/wiki/index.php/Phon ... ical_hints and check if the issue persists.
A few points that stand out to me:
- You are setting ADDGRID=.TRUE. which is detrimental for performance and brings little benefit
- note that KPAR does not distribute memory which means that you double your memory requirements
- in output_all you are starting from a previously existing WAVECAR file which might explain the BUG in rot.F if this file is somehow corrupt
- you might be able to reduce ENCUT: converge the phonon frequencies at Gamma w.r.t ENCUT.

Hope this helps!

jbackman
Newbie
Newbie
Posts: 24
Joined: Thu Nov 26, 2020 10:27 am

Re: BUG when running supercell SCF calculation

#3 Post by jbackman » Wed Nov 09, 2022 10:09 am

I use ADDGRID = .TRUE. since this is recommend for phonon calculations as far as I know. Are you saying this makes no difference and should not be used?

I use KPAR following the recommendations for the OpenACC GPU port: https://www.vasp.at/wiki/index.php/Open ... rt_of_VASP

Yes I restart from the run using ALGO = Fast. The fast run finished but with the warning I mentioned.

However the original problem is that the calculation freezes with ALGO = Normal. As I said this only happens for certain displacement. Other calculations with the same number of irreducible kpoints and same memory requirements have no problems. This only happens for WS2, it does not happen for MoS2 with same exact same settings.

henrique_miranda
Global Moderator
Global Moderator
Posts: 506
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: BUG when running supercell SCF calculation

#4 Post by henrique_miranda » Wed Nov 09, 2022 2:33 pm

The suggestion related to ADDGRID is that the user should do his own testing instead of using it by default (this recommendation was changed relatively recently):
https://www.vasp.at/wiki/index.php/ADDGRID
Setting ADDGRID=.TRUE. increases the computational effort but the quality of the results might not improve significantly.

No problem in using KPAR of course, my recommendation was for the case you were running out of memory but now I see that is unlikely to be the problem :D

You are also using a very tight value for EDIFF
https://www.vasp.at/wiki/index.php/EDIFF
EDIFF=1e-8 should be more than sufficient.

Last but not least: ot seems you are using 80 mpi-ranks and only 1 GPU.
Was the code compiled using NCCL?
In that case you can using use 1 mpi-rank per GPU.

jbackman
Newbie
Newbie
Posts: 24
Joined: Thu Nov 26, 2020 10:27 am

Re: BUG when running supercell SCF calculation

#5 Post by jbackman » Wed Nov 09, 2022 2:58 pm

Thanks for your comment.
I use EDIFF = 1e-10 since that is what is needed to obtain low forces when relaxing the unit cell. However since I now have a 9x9 supercell I could increase this to 9*9*1e-10 = 8.1e-09. So close to 1e-8 like you said.

It should be compiled with NCCL. I have also noticed that it always says "1 GPUs detected". I assumed that was just how the code worked. I try to follow the recommendation of the OpenACC GPU port, where it says to set KPAR to the number of GPUs. However since I have a large system with few k-points (5 irreducible in this case) I set KPAR = 5 and then then use a set of nodes (GPUs) for each k-point. So in this case 16 nodes per k-point for a total of 80 nodes. If I try with 1 or only a few nodes (GPUs) per kpoint I run out of memory.

Should the code show "80 GPUs detected"? This is the behavior of the NCCL complied version of VASP we have access to.

Last. I Tried to rerun the calculation without restarting with a old WAVECAR and I still get the same problem with the BUG message. I tried with another displacement which has 3 irreducible k-points (so I use 48 nodes) and the problem is the same with ALL (BUG) Normal (frozen)

henrique_miranda
Global Moderator
Global Moderator
Posts: 506
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: BUG when running supercell SCF calculation

#6 Post by henrique_miranda » Mon Nov 14, 2022 2:07 pm

A further update regarding your question.

You are using a hybrid openmp and openacc version. In that case, it can be that the number of GPUs detected is only reported for the first mpi rank.
This will be changed in the next VASP release such that all the GPUs are counted.
So the fact that only 1 GPU is shown to be detected is not related to the problem you are observing.

The hangup you observe with ALGO=Normal and the BUG message with ALGO=All are all probably related to the too-low EDIFF parameter which might cause problems in the iterative diagonalization.
You might solve the problem by increasing EDIFF to 1e-8 or 1e-7 without big consequences to the quality of the final results.

Could you also try running the INCAR with EDIFF=1e-10 and ALGO=Normal on CPUs only?

Post Reply