VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 14
- Joined: Mon Feb 15, 2021 9:42 am
VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Dear VASP developers,
Here is a minimal calculation that causes VASP 6.2.1 to sporadically hang in the middle of an SCF iteration. The more nodes we use the more likely the calculation hangs (4 nodes, with a total of 96 mpi processes hangs 80% of the time). In the cases where it doesn't hang, it converges nicely. The calculation is a 2x2x2 supercell of GaAs with the k-grid consisting of just the gamma point using HSE and reading in a PBE WAVECAR as a starting point.
We think this is related to using hybrid functionals because we do not see this problem when using PBE. We have tried various intel compilers (including intel 2019) which change the percentage of calculations that hang, but never fully removes the problem and we have included our makefile.include in the zip file. Additionally, we have run the the test suite and found that all calculations passed successfully.
We appreciate any help identifying the source of this problem.
Sincerely,
Guy
Here is a minimal calculation that causes VASP 6.2.1 to sporadically hang in the middle of an SCF iteration. The more nodes we use the more likely the calculation hangs (4 nodes, with a total of 96 mpi processes hangs 80% of the time). In the cases where it doesn't hang, it converges nicely. The calculation is a 2x2x2 supercell of GaAs with the k-grid consisting of just the gamma point using HSE and reading in a PBE WAVECAR as a starting point.
We think this is related to using hybrid functionals because we do not see this problem when using PBE. We have tried various intel compilers (including intel 2019) which change the percentage of calculations that hang, but never fully removes the problem and we have included our makefile.include in the zip file. Additionally, we have run the the test suite and found that all calculations passed successfully.
We appreciate any help identifying the source of this problem.
Sincerely,
Guy
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Thank you for the bug report.
We are trying to reproduce this issue on our side. But it is unlikely that we will see it.
In the meantime, there are a couple of things that you could try that might help us try narrow down where the problem might be:
1. Try compiling the code with "-g -traceback -debug extended", run the code, kill it when it hangs up, and then post here the traceback?
2. Try compiling with openmpi and see if the problem persists?
We are trying to reproduce this issue on our side. But it is unlikely that we will see it.
In the meantime, there are a couple of things that you could try that might help us try narrow down where the problem might be:
1. Try compiling the code with "-g -traceback -debug extended", run the code, kill it when it hangs up, and then post here the traceback?
2. Try compiling with openmpi and see if the problem persists?
-
- Newbie
- Posts: 14
- Joined: Mon Feb 15, 2021 9:42 am
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Hi Henrique,
I compiled with openmpi and I still get the same problem. Attached are the tracebacks for both the openmpi version and the mpi only version. As you can see, they hang in the same location. Are you able to reproduce the error?
Best,
Guy
I compiled with openmpi and I still get the same problem. Attached are the tracebacks for both the openmpi version and the mpi only version. As you can see, they hang in the same location. Are you able to reproduce the error?
Best,
Guy
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Thank you for the traceback.
This makes it clearer where the problem possibly is.
We have encountered issues when using some MPI versions with non-blocking communications.
There is a toggle in mpi.F you can try to uncomment and check if it solves your problem:
!#define MPI_avoid_bcast
This makes it clearer where the problem possibly is.
We have encountered issues when using some MPI versions with non-blocking communications.
There is a toggle in mpi.F you can try to uncomment and check if it solves your problem:
!#define MPI_avoid_bcast
-
- Newbie
- Posts: 14
- Joined: Mon Feb 15, 2021 9:42 am
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Hi Henrique,
I uncommented #define MPI_avoid_bcast, however VASP still hangs 80% of the time. The traceback indicates that calculation gets stuck at the same location in the code. What's the next thing we can try?
Thanks,
Guy
I uncommented #define MPI_avoid_bcast, however VASP still hangs 80% of the time. The traceback indicates that calculation gets stuck at the same location in the code. What's the next thing we can try?
Thanks,
Guy
-
- Global Moderator
- Posts: 236
- Joined: Mon Apr 26, 2021 7:40 am
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Hello Guy,
I could reproduce the hang-ups for VASP 6.2.1 on a machine with 44 cores and with the latest Intel compiler (2021.4). It seems the problem is coming from non-blocking MPI communication for which we had similar issues before. In the past the "culprit" was MPI_Ibcast and we could even write a little reproducer code snippet which strongly indicates that there is a problem with the Intel compiler/MPI. In your case there seems to be a similar issue with MPI_Ireduce... anyway, the upcoming VASP version will avoid these calls by default (usually without loss of performance) and we will re-evaluate at a later time whether non-blocking global communication calls work reliably.
So, at this point I have two potential solutions for you:
(1) Either you wait a few more days until the upcoming release and try directly with the newest version of VASP,
(2) or you copy the attached mpi.F into your VASP 6.2.1 src directory and recompile the whole code with the Intel compiler.
In my case both options worked, I hope it will solve the issue for you too! Could you please test it and report back, thanks!
All the best,
Andreas Singraber
I could reproduce the hang-ups for VASP 6.2.1 on a machine with 44 cores and with the latest Intel compiler (2021.4). It seems the problem is coming from non-blocking MPI communication for which we had similar issues before. In the past the "culprit" was MPI_Ibcast and we could even write a little reproducer code snippet which strongly indicates that there is a problem with the Intel compiler/MPI. In your case there seems to be a similar issue with MPI_Ireduce... anyway, the upcoming VASP version will avoid these calls by default (usually without loss of performance) and we will re-evaluate at a later time whether non-blocking global communication calls work reliably.
So, at this point I have two potential solutions for you:
(1) Either you wait a few more days until the upcoming release and try directly with the newest version of VASP,
(2) or you copy the attached mpi.F into your VASP 6.2.1 src directory and recompile the whole code with the Intel compiler.
In my case both options worked, I hope it will solve the issue for you too! Could you please test it and report back, thanks!
All the best,
Andreas Singraber
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 14
- Joined: Mon Feb 15, 2021 9:42 am
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Thank you very much Andreas! The new mpi.F file fixed the problem and we even notice a ~10% speed up.
Best,
Guy
Best,
Guy
-
- Global Moderator
- Posts: 236
- Joined: Mon Apr 26, 2021 7:40 am
Re: VASP 6.2.1 sporadically hangs for hybrid calculations using multiple nodes
Hi!
Great :-), thank you for reporting back!
Best,
Andreas
Great :-), thank you for reporting back!
Best,
Andreas