MLFF Rocks!

Message

rwindiks · #1 Post by **rwindiks** » Fri Aug 11, 2023 7:13 am

Dear VASP developers,

I want to congratulate the VASP team on MLFF - it is spectacular - it is a pleasure and honour to partner with the world's leading density functional and related methodology innovators! MLFF is a major step forward in performing simple MD simulations and also applying the advanced MD techniques such as meta-dynamics simulations, the blue-moon ensemble approach, and constrained MD simulations. With MLFF is it possible to perform MD simulations that capture mechanisms and processes for time scales of up to 100ps and even ns for compounds with up to 8 (eight!) elements with the accuracy of DFT methods!

Thanks a lot and please keep up your excellent work!

Rene
www.materialsdesign.com

alex · #2 Post by **alex** » Fri Aug 11, 2023 11:30 am

Hi Rene,

I just wonder how much memory you have in your machine for that kind of training. Would you share this information, please?

Thanks & best regards,

alex

rwindiks · #3 Post by **rwindiks** » Fri Aug 11, 2023 4:43 pm

Hi alex,

for an on-the-fly training with a structure with 6 different chemical elements the total memory demand is approx. 100-120 GB, whereby for each element max. 3000 different local configurations are considered.

Thanks.

Rene

thanh-nam_huynh · #4 Post by **thanh-nam_huynh** » Sat Aug 12, 2023 2:10 am

Hi Rene,

I would appreciate it if you could share how you can deal with such a large number of local configurations (~ 18000 in total, if I calculated it right) using only 100-120 GB of memory.
In my system, I have 3 elements and approximately 6000 local configurations, and it already requested ~240 GB of memory.

Thanks,
Nam

rwindiks · #5 Post by **rwindiks** » Mon Aug 14, 2023 9:06 am

Hi Nam,

why do you need 6000 configurations per element, especially if your system has 3 elements only? Surely, you can create an MLFF with RMSE for the energies and forces of <0.1 meV/atom and <0.1 eV/Ang, respectively, with around 3000 different local configurations per element! However, that provides that all the SCF (DFT) steps in the MLFF on-the-fly training procedure converged with an energy criterion of 0.000001 eV (1e-6) eV)!

For your orientation, find attached a snippet of a representative ML_LOGFILE that summarizes the MLFF options that I am usually using for the MLFF on-the-fly training procedure.

Thanks.

Rene

#6 Post by **ferenc_karsai** » Mon Aug 14, 2023 9:17 am

A remark:
The design matrix, which is by far the most memory-consuming part not depends on the number of local reference configurations (mor precisely max number ML_MBxelement_types) but also on the number of atoms per training structure (times 3 even because of the three directions of the forces). The second dimension fortunately doesn't depend on the number of element types.

I guess thanh-nam_huynh has larger number of atoms in the training structures.

PS: Don't forget if you have the resources you can always go on multiple nodes, which would decrease the required memory per core, since the design matrix is block-cyclicly distributed.

rwindiks · #7 Post by **rwindiks** » Mon Aug 14, 2023 9:40 am

Then I should add that the structures used for the MLFF training procedure have between 40 and 60 atoms, which is more than sufficient to generate a reasonably accurate MLFF for kind of different local environments. In case of dense solids, structures with even 20-30 atoms are sufficient. The generated MLFF can be applied on structures with 1000 atoms or more provided the fast descriptors are used.

thanh-nam_huynh · #8 Post by **thanh-nam_huynh** » Mon Aug 14, 2023 1:37 pm

Hi Rene and Ferenc,

Thanks for all the helpful information.

To make it clear, I have a system with 40 atoms and about 2000 local reference configurations per element. With this, I am able to obtain an MLFF with RMSEs that Rene mentioned (and, of course, EDIFF of 1e-6 eV was used). I am happy with the RMSE in energy, but still a bit greedy to improve accuracy in forces.

Regarding using multiple nodes to reduce memory per code, I made some tests with different numbers of nodes on two VASP binaries (complied with and without -Duse_shmem). What I observed is that with the shared memory binary, the required memory is almost linearly reduced with increasing nodes, while in the other case, the decrease is slight. Only the memories needed for "FMAT for basis" and "SVD matrices" decrease proportionally. I'm aware that the use of shared memory is recommended in some other topics in this forum.
However, when using the shared memory binary, I faced an issue where the MLFF running on 2 nodes gave wrong predictions for energies and forces while MLFF on a single node performed well. (The settings were completely identical, except number of nodes and codes used).
So I wonder if it could potentially be a bug or if there is something wrong with my VASP binary. Also, I would like to know if Rene is also using a shared memory binary and have had similar issues.

I also attached here the snippets of all the tests and issues I mentioned.

Thanks,
Nam

rwindiks · #9 Post by **rwindiks** » Tue Aug 15, 2023 11:54 am

The Linux VASP executables of Materials Design are compiled with the mentioned option and are working well, without any observed issues in terms of MLFF. So far I performed VASP MLFF calculations on single nodes, i.e. not, across two or more nodes. Hence, I did not observe any wrong predictions.

@Nam: I wonder: What are your criteria to determine whether a MLFF predictions are wrong or correct? How do you distinguish between predicted atomic configurations that are not captured by the MLFF training set and those that are in fact wrongly predicted? I assume that is very difficult to assess as once the MLFF leaves the "charted" configurational space it predicts anything, i.e., nonsense!

Thanks.

Rene

thanh-nam_huynh · #10 Post by **thanh-nam_huynh** » Tue Aug 15, 2023 9:49 pm

Hi Rene,

Good questions. Personally, I usually look at the BEEF values to assess whether the MLFF is extrapolating. I would assume that in the production run, the BEEF values should be compatible with such values from the last training steps. If you observed some values that are 1 order of magnitude higher, then it'd be likely that the MLFF poorly describes this space, and more training is needed.

I don't really have an answer to the second question in cases where MLFF gives wrong predictions. From my limited experience, two scenarios could occur. The most apparent signal that something's wrong would be the system "exploding" (the geometry's getting unphysical or "non-sense") while MLFF is still predicting "reasonable" energies (but usually not correct forces). This could easily be pointed out by keeping track of geometries, energies, and forces during the run. There is also a more subtle scenario that could happen as the run is long enough. It might be a new phase or a new state of the system that looks reasonable. In such cases, I guess the ideal solution (but almost impractical) is to test it with a DFT run. Also, it might invoke some physical/chemical reasoning to decide whether such behavior could happen given the conditions of the MD run. Other than that, I don't have solutions.

Let's hope that some expert will jump in and share some opinions.

Best,
Nam

#11 Post by **ferenc_karsai** » Wed Aug 16, 2023 9:08 am

thanh-nam_huynh wrote: ↑Mon Aug 14, 2023 1:37 pm Hi Rene and Ferenc,

Thanks for all the helpful information.

To make it clear, I have a system with 40 atoms and about 2000 local reference configurations per element. With this, I am able to obtain an MLFF with RMSEs that Rene mentioned (and, of course, EDIFF of 1e-6 eV was used). I am happy with the RMSE in energy, but still a bit greedy to improve accuracy in forces.

Regarding using multiple nodes to reduce memory per code, I made some tests with different numbers of nodes on two VASP binaries (complied with and without -Duse_shmem). What I observed is that with the shared memory binary, the required memory is almost linearly reduced with increasing nodes, while in the other case, the decrease is slight. Only the memories needed for "FMAT for basis" and "SVD matrices" decrease proportionally. I'm aware that the use of shared memory is recommended in some other topics in this forum.
However, when using the shared memory binary, I faced an issue where the MLFF running on 2 nodes gave wrong predictions for energies and forces while MLFF on a single node performed well. (The settings were completely identical, except number of nodes and codes used).
So I wonder if it could potentially be a bug or if there is something wrong with my VASP binary. Also, I would like to know if Rene is also using a shared memory binary and have had similar issues.

I also attached here the snippets of all the tests and issues I mentioned.

Thanks,
Nam

Ok, in the calculation on two nodes and shared memory something goes terribly wrong in the first step (full of NaNs). Could you please test this on another system, something very simple like silicon. If it does the same for the other system then something is clearly wrong with your shared memory system and you need to fix that. If it works for the other system we may have found a pathological system on which a possible bug comes out. In that case I would ask you to send me the POSCAR, POTCAR, KPOINTS, ML_AB, INCAR, OUTCAR, stdout and ML_LOGFILE of the calculation. We have several hundreds of times run the code on multiple nodes for various systems succesfully, but strange bugs that only show up in special constellations can always occur.

rwindiks · #12 Post by **rwindiks** » Wed Aug 16, 2023 1:47 pm

Hi Nam,

to avoid explosions, single atoms that are flying around, and physically unreasonable structure in applications of MLFFs, make sure that these scenarios are somehow part of the MLFF training set or occur in the on-the-fly training procedure. How else should an MLFF or any other machine learned potential (MLP) should know that physically unreasonable structures are in fact physically unreasonable structures?

I can recommend considering two scenarios for the training and the refitting, especially in case of structures that consists of two or more rather different elements:

+ use the VASP directives ML_ISCALE_TOTEN = 1 and ML_EATOM_REF
+ perform short on-the-fly MLFF training MD simulations with physically unreasonable configurations, i.e., high energy configurations

To calculate the energies of separate atoms for of ML_EATOM_REF use the same DFT method that is used in the on-the-fly training procedure.

And again: Make sure that all (!) SCF cycles in the on-the-fly MLFF training MD simulations are converged!

Thanks.

Rene

lukas_volkmer · #13 Post by **lukas_volkmer** » Fri Jun 21, 2024 12:45 pm

Hi Rene,

sorry for winding up this topic, but I have questions referring to your last answer.

1) What do you consider a short on-the-fly run ? How many ps ? Is the reason behind this the reinitialization of the PAW-basis. Does this hint still apply if I train in NVT ensemble ?
2) How can I choose physical unreasonable configurations ? I guess you mean for instance unrelaxed structures? Is there another way to get those configurations ? In my current MLFF studies, I always chose some structures which were more or less equilibrated in NpT ensemble.

Best,
Lukas

My Community

MLFF Rocks!

MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!

Re: MLFF Rocks!