Dear all,
I wanted to test the performance of GPU run over CPU one. I tried a single SCF loop, but I have not seen any improvement with GPU use (3 times slower compared to CPU, which is calculated as NBANDS*KPOINTS/8/NCORE) despite NSIM=128 and 1 MPI rank per GPU. I probably make some mistakes and would be happy if you could help. The simulation files are attached.
Regards,
Burak
VASP_GPU Performance Issue
Moderators: Global Moderator, Moderator
-
- Jr. Member
- Posts: 51
- Joined: Thu Apr 06, 2023 12:25 pm
VASP_GPU Performance Issue
You do not have the required permissions to view the files attached to this post.
-
- Full Member
- Posts: 212
- Joined: Tue Jan 19, 2021 12:01 am
Re: VASP_GPU Performance Issue
Hi,
Thank you for running performance tests and sharing them here!
It is true that for the comparison of time to solution for GPU vs. CPU, the CPU version has the upper hand by a factor of 3 for this particular calculation.
There are a couple of aspects to note:
Cheers,
Marie-Therese
Thank you for running performance tests and sharing them here!
It is true that for the comparison of time to solution for GPU vs. CPU, the CPU version has the upper hand by a factor of 3 for this particular calculation.
Code: Select all
OUTCAR_CPU: LOOP+: cpu time 670.8752: real time 672.0153
OUTCAR_GPU: LOOP+: cpu time 1941.4516: real time 1905.4101
- Most of the time is lost when computing the VdW forces.
That is because this part of the code has not been ported to GPU. Thus, at this point, we cannot recommend running calculations that include VdW forces on GPU.
Code: Select all
OUTCAR_CPU: FORVDW: cpu time 13.9918: real time 14.0008 OUTCAR_GPU: FORVDW: cpu time 959.5451: real time 951.2572
- The GPU run takes 2 more iteration steps to reach convergence.
Code: Select all
1 OUTCAR_CPU: LOOP: cpu time 31.0964: real time 31.3084 2 OUTCAR_CPU: LOOP: cpu time 37.0171: real time 37.1010 3 OUTCAR_CPU: LOOP: cpu time 36.1667: real time 36.2138 4 OUTCAR_CPU: LOOP: cpu time 35.8931: real time 35.9381 5 OUTCAR_CPU: LOOP: cpu time 36.2547: real time 36.3063 6 OUTCAR_CPU: LOOP: cpu time 29.3839: real time 29.4306 7 OUTCAR_CPU: LOOP: cpu time 32.8986: real time 32.9410 8 OUTCAR_CPU: LOOP: cpu time 28.9229: real time 28.9653 9 OUTCAR_CPU: LOOP: cpu time 32.5930: real time 32.6356 10 OUTCAR_CPU: LOOP: cpu time 33.8198: real time 33.8665 11 OUTCAR_CPU: LOOP: cpu time 32.8474: real time 32.8873 12 OUTCAR_CPU: LOOP: cpu time 34.5196: real time 34.5708 13 OUTCAR_CPU: LOOP: cpu time 40.0197: real time 40.0728 14 OUTCAR_CPU: LOOP: cpu time 34.0037: real time 34.0559 15 OUTCAR_CPU: LOOP: cpu time 33.3978: real time 33.4419 16 OUTCAR_CPU: LOOP: cpu time 35.9634: real time 36.0134 17 OUTCAR_CPU: LOOP: cpu time 38.1054: real time 38.1565 18 OUTCAR_CPU: LOOP: cpu time 25.3977: real time 25.4350 19 OUTCAR_CPU: LOOP+: cpu time 670.8752: real time 672.0153
This could just as well be the other way around. So, there is no fundamental conclusion we can draw from this observation.Code: Select all
1 OUTCAR_GPU: LOOP: cpu time 40.1964: real time 39.7315 2 OUTCAR_GPU: LOOP: cpu time 46.5786: real time 46.2402 3 OUTCAR_GPU: LOOP: cpu time 48.1026: real time 47.7735 4 OUTCAR_GPU: LOOP: cpu time 54.6248: real time 54.3239 5 OUTCAR_GPU: LOOP: cpu time 57.1046: real time 55.8257 6 OUTCAR_GPU: LOOP: cpu time 38.0119: real time 39.7700 7 OUTCAR_GPU: LOOP: cpu time 42.5045: real time 41.0562 8 OUTCAR_GPU: LOOP: cpu time 37.5500: real time 36.0945 9 OUTCAR_GPU: LOOP: cpu time 42.1764: real time 40.6554 10 OUTCAR_GPU: LOOP: cpu time 44.2746: real time 42.7324 11 OUTCAR_GPU: LOOP: cpu time 42.5493: real time 41.0302 12 OUTCAR_GPU: LOOP: cpu time 46.4575: real time 44.9322 13 OUTCAR_GPU: LOOP: cpu time 50.7778: real time 49.2683 14 OUTCAR_GPU: LOOP: cpu time 43.5063: real time 41.9760 15 OUTCAR_GPU: LOOP: cpu time 42.6561: real time 41.1782 16 OUTCAR_GPU: LOOP: cpu time 45.6093: real time 44.1545 17 OUTCAR_GPU: LOOP: cpu time 48.8931: real time 47.4580 18 OUTCAR_GPU: LOOP: cpu time 33.3592: real time 31.8699 19 OUTCAR_GPU: LOOP: cpu time 35.0750: real time 33.5765 20 OUTCAR_GPU: LOOP: cpu time 29.2735: real time 28.8527 21 OUTCAR_GPU: LOOP+: cpu time 1941.4516: real time 1905.4101
- Time to solution vs. power per iteration step: I understand the interest in comparing time to solution, but alternatively, one can look at time per iteration step to judge the performance. If we do that and subtract the contribution from the VdW forces, we still observe that the GPU run takes about 25% more time to solution. Additionally, you could consider the power consumption and availability of resources. Depending on your hardware, the GPU may have the upper hand (for calculations without VdW forces) after all when considering power per iteration step.
Cheers,
Marie-Therese