I compiled VASP 6.5.0 with the python plugins option, with two different compiler (gcc and fpp) see the attached makefile.include.
I execute VASP on a node composed of two processors AMD EPYC™ Milan 7763 - 64 Core - 2.45GHz - 256MB Cache
The plugin is only used to change the atoms positions every steps through a python code which runs Langevin dynamics using an integrator from ASE adapted for the plugin.
Depending on the ML_MODE and ML_LMLFF tag in the INCAR, I obtain two types of errors for both compilations with gcc and fpp.
With ML_LMLFF=.FALSE., the dynamics (through the plugin) runs for 4500 steps (out of 10 000) and then obtain the following error:
Code: Select all
slurmstepd-topaze1701: error: Detected 1 oom_kill event in StepId=7485027.0. Some of the step tasks have been OOM Killed.
srun: error: topaze1701: task 64: Out Of Memory
slurmstepd-topaze1701: error: mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:64]
slurmstepd-topaze1701: error: *** STEP 7485027.0 ON topaze1701 CANCELLED AT 2025-02-19T23:19:21 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
slurmstepd-topaze1701: error: mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:0]
+ exit 0
With ML_LMLFF=.TRUE. and ML_MODE = train. I do not obtain any error and can run dynamics (through the plugins) for 10 000 steps (with high CTIFOR to not do DFT).
In that particular case, the ML_CTIFOR was set to a high value so that there is no DFT calls and only FF evaluations.
With ML_LMLFF=.TRUE. and ML_MODE = run, the vasp execution stops before calling the python interface but after writing the first energy and forces to the OUTCAR.
I obtain the following error (many times):
Code: Select all
[topaze1150:3973629:0:3973629] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3973629) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x00000000005aa353 rc_add_() ???:0
2 0x00000000004cd81b plugins_mp_plugins_structure_() ???:0
3 0x0000000001eff5f1 MAIN__() ???:0
4 0x000000000041fba2 main() ???:0
5 0x000000000003ad85 __libc_start_main() ???:0
6 0x000000000041faae _start() ???:0
=================================
[topaze1150:3973585:0:3973585] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:3973585) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000584695 map_forward_() ???:0
2 0x000000000058921d fftbrc_plan_mpi_() ???:0
3 0x000000000058d33b fft3d_mpi_() ???:0
4 0x0000000000590e98 fft3d_() ???:0
5 0x00000000004cd838 plugins_mp_plugins_structure_() ???:0
6 0x0000000001eff5f1 MAIN__() ???:0
7 0x000000000041fba2 main() ???:0
8 0x000000000003ad85 __libc_start_main() ???:0
9 0x000000000041faae _start() ???:0
=================================
[topaze1150:3973645:0:3973645] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1268000007f)
==== backtrace (tid:3973645) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000584695 map_forward_() ???:0
2 0x000000000058921d fftbrc_plan_mpi_() ???:0
3 0x000000000058d33b fft3d_mpi_() ???:0
4 0x0000000000590e98 fft3d_() ???:0
5 0x00000000004cd838 plugins_mp_plugins_structure_() ???:0
6 0x0000000001eff5f1 MAIN__() ???:0
7 0x000000000041fba2 main() ???:0
8 0x000000000003ad85 __libc_start_main() ???:0
9 0x000000000041faae _start() ???:0
=================================