strange behavior of PAW double counting energy
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 4
- Joined: Sat Nov 16, 2019 8:10 pm
strange behavior of PAW double counting energy
I recently notice a very strange behavior. I submitted the same job (all the input files are exactly the same) to a single node, two nodes, and three nodes. The cluster that I'm running on has 64 cores per node. While the results for one and two nodes are almost exactly the same with some numerical differences, the total energy from three nodes came out to be NaN. I looked at the OUTPUT and realized that it is because of the PAW doubling counting energy (PAWAE), which is NaN. All the other energy terms are the same as in the case of one or two nodes. This is the only difference, everything else, including force, temperature, etc., are also the same.
It is really weird and I do not know why. Could you please advise? Thanks a lot. all the files are attached.
It is really weird and I do not know why. Could you please advise? Thanks a lot. all the files are attached.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: strange behavior of PAW double counting energy
This is indeed very strange.
Unfortunately, I was not able to reproduce this issue locally using the latest version of VASP compiled with a gcc toolchain.
Could you give me some more information about your compilation?
I would like to see the 'makefile.include' as well as which compiler and libraries were used if possible.
It is very difficult to track down the issue if I cannot reproduce it locally.
Unfortunately, I was not able to reproduce this issue locally using the latest version of VASP compiled with a gcc toolchain.
Could you give me some more information about your compilation?
I would like to see the 'makefile.include' as well as which compiler and libraries were used if possible.
It is very difficult to track down the issue if I cannot reproduce it locally.
-
- Newbie
- Posts: 4
- Joined: Sat Nov 16, 2019 8:10 pm
Re: strange behavior of PAW double counting energy
Thank you. Please see makefile.include below. Please note that I'm using the Cray LibSci math library and Cray ftn compiler, which links LAPACK, BLAS, and fftw libraries automatically. Furthermore, I'm using openmpi 4.1.1, compiled by cray compilers.
Code: Select all
# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxGNU\" \
-DMPI -DMPI_BLOCK=65536 -Duse_collective \
-DCACHE_SIZE=8000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dfock_dblbuf
CPP = gcc -E -P -C -w $*$(FUFFIX) >$*$(SUFFIX) $(CPP_OPTIONS)
#CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
#CPP = cpp --traditional -P $(CPP_OPTIONS) $*$(FUFFIX) $*$(SUFFIX)
#FC = /data/users/yingma/dir.src/openmpi/bin/mpif90
#FCL = /data/users/yingma/dir.src/openmpi/bin/mpif90
FC = mpif90
FCL = mpif90
FREE = -ffree
FFLAGS = -dC -rmo -emEb -hnoomp -N1023
OFLAG = -O3
OFLAG_IN = $(OFLAG)
DEBUG = -O0
BLAS =
LAPACK =
BLACS =
SCALAPACK =
LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS)
#FFTW = /opt/cray/pe/fftw/3.3.8.8/x86_rome
#LLIBS += -L$(FFTW)/lib -lfftw3
#INCS = -I$(FFTW)/include
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# For the parser library
CXX_PARS = CC
LLIBS += -lstdc++
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
#================================================
# GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DCUFFT_MIN=28 -UscaLAPACK -Ufock_dblbuf # -DUSE_PINNED_MEMORY
OBJECTS_GPU= fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o
CC = cc
CXX = CC
CFLAGS = -fPIC -DADD_ -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS
# Minimal requirement is CUDA >= 10.X. For "sm_80" you need CUDA >= 11.X.
CUDA_ROOT ?= /usr/local/cuda
NVCC := $(CUDA_ROOT)/bin/nvcc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_60,code=\"sm_60,compute_60\" \
-gencode=arch=compute_70,code=\"sm_70,compute_70\" \
-gencode=arch=compute_80,code=\"sm_80,compute_80\"
MPI_INC = /opt/gnu/ompi-3.1.4-GNU-5.4.0/include
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: strange behavior of PAW double counting energy
Thanks for the makefile.include.
If I remember correctly we encountered some issues when compiling the latest distributed version of VASP with the latest cray compilers.
We are working to integrate the necessary changes in the next release.
Which version of the cray compiler are you using?
Did you modify any part of the VASP source code in order to be able to compile VASP with the cray compilers?
If so what modifications have you done?
If I remember correctly we encountered some issues when compiling the latest distributed version of VASP with the latest cray compilers.
We are working to integrate the necessary changes in the next release.
Which version of the cray compiler are you using?
Code: Select all
mpif90 -V
If so what modifications have you done?
-
- Newbie
- Posts: 4
- Joined: Sat Nov 16, 2019 8:10 pm
Re: strange behavior of PAW double counting energy
We are using cray compiler version 11.0.3. Yes I did make some (minor) changes to the source code, including:
in scpc.F, access=append needs to be changed to position=append (there are a few occasions)
in minimax.F, remove comma in WRITE(*,1), in line 2435 and 2607.
in In pade_fit.F, SUBROUTINE PADE_SVD_EVAL, variable name Q(:) needs to be changed
in addition, fast_aug.F does not compiler with -O2 optimization level, but is fine with -O1.
If building hybrid mpi/openmpi, openmp parallel do on line 1154 and 115 needs to be commented out in hamil_lrf.F. Not sure why.
in scpc.F, access=append needs to be changed to position=append (there are a few occasions)
in minimax.F, remove comma in WRITE(*,1), in line 2435 and 2607.
in In pade_fit.F, SUBROUTINE PADE_SVD_EVAL, variable name Q(:) needs to be changed
in addition, fast_aug.F does not compiler with -O2 optimization level, but is fine with -O1.
If building hybrid mpi/openmpi, openmp parallel do on line 1154 and 115 needs to be commented out in hamil_lrf.F. Not sure why.
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: strange behavior of PAW double counting energy
Ok, all these changes sound familiar.
Unfortunately, I was not able to reproduce this issue on our local machines.
We have no access to cray hardware and compilers so it's difficult to reproduce this issue (in particular for a job running on 192 cores).
There are a few steps that I would take to try and track down this issue (in this order):
1. Try recompiling the code with decreasing level of optimization (-O3 is not recommended and normally is not much better than -O2, if -O2 produces errors then try -O0)
2. Run the VASP test suite and report if there are any failed tests.
3. Try trapping floating-point exceptions (search "trap=" in https://www.nersc.gov/assets/Documentat ... ayftn.html).
4. Try compiling the same source (after the modifications you did for cray) with a different compiler.
5. Try to reproduce this issue on a smaller system with fewer MPI ranks.
If all of these steps do not work then the best option to track down this issue would be if you could somehow grant us access to the machine where you observed this problem.
Unfortunately, I was not able to reproduce this issue on our local machines.
We have no access to cray hardware and compilers so it's difficult to reproduce this issue (in particular for a job running on 192 cores).
There are a few steps that I would take to try and track down this issue (in this order):
1. Try recompiling the code with decreasing level of optimization (-O3 is not recommended and normally is not much better than -O2, if -O2 produces errors then try -O0)
2. Run the VASP test suite and report if there are any failed tests.
3. Try trapping floating-point exceptions (search "trap=" in https://www.nersc.gov/assets/Documentat ... ayftn.html).
4. Try compiling the same source (after the modifications you did for cray) with a different compiler.
5. Try to reproduce this issue on a smaller system with fewer MPI ranks.
If all of these steps do not work then the best option to track down this issue would be if you could somehow grant us access to the machine where you observed this problem.
-
- Newbie
- Posts: 4
- Joined: Sat Nov 16, 2019 8:10 pm
Re: strange behavior of PAW double counting energy
It looks like it is related to variables with no initial values. I recompiled VASP with -e0 option, which initializes all undefined variables to be zero, and the strange behavior is gone. Some other compilers, by default, set all undefined to be zero, and Cray is not the case.
That said, I don't know which variables lead to this behavior.
That said, I don't know which variables lead to this behavior.
-
- Global Moderator
- Posts: 506
- Joined: Mon Nov 04, 2019 12:41 pm
- Contact:
Re: strange behavior of PAW double counting energy
Thanks a lot for this information!
This gives us a good hint to try to identify what causes this problem and fix it in the code.
In the meantime, it might be a good idea to initialize the values to zero in the compiler.
This gives us a good hint to try to identify what causes this problem and fix it in the code.
In the meantime, it might be a good idea to initialize the values to zero in the compiler.