• R/O
  • HTTP
  • SSH
  • HTTPS

sctk: Zusammenfassung des Repository

Superconducting Toolkit用メインリポジトリ


Neueste Commits RSS

Rev. Zeit Autor Nachricht
13ada28 2022-06-28 12:44:21 Mitsuaki Kawamura master change version number
5a7ca7c 2022-06-28 11:33:35 Mitsuaki Kawamura Update reference for SCTK-example
60476f1 2022-05-20 20:07:23 Mitsuaki Kawamura Merge branch 'develop' of github.com:mitsuaki1987/DFPT-te...
edcb8d7 2022-05-20 20:01:01 Mitsuaki Kawamura Bugfix
d43fa39 2022-05-09 01:51:24 Mitsuaki Kawamura Bugfix: sign
bd0a173 2022-05-08 23:52:53 Mitsuaki Kawamura SCTK/src/sctk_spinfluc.f90: Waste calculation was done. t...
2d2fba6 2022-04-24 23:38:12 Mitsuaki Kawamura [BugFix] sctk_tetra.f90 : indices of sort is not used for...
49ed595 2022-04-24 17:17:09 Mitsuaki Kawamura sctk_invert : Fix typo in comment sctk_coulomb : avoid to...
691dd7c 2022-04-24 11:16:52 Mitsuaki Kawamura OpenMP for matrix hermite-conjugate and xc kernel
913ad74 2022-04-23 18:37:12 Mitsuaki Kawamura Add OpenMP parallel into kel

Kürzlich bearbeitete Tags

Name Rev. Zeit Autor
esm-rism_ver.1.0 c95c5e4 2018-04-18 10:03:06 S.Nishihara of AdvanceSoft
qe-5.1.0 a0ee5d7 2018-03-19 19:32:50 giannozz
qe-5.1.1 9c8547e 2018-03-15 17:09:22 giannozz
qe-5.2.0 5beef24 2018-03-14 16:53:39 spigafi
qe-5.1.2 778b905 2018-03-14 16:45:41 giannozz
qe-5.2.1 27884f7 2018-03-14 04:24:07 spigafi
qe-5.3 1bb16ac 2018-03-14 04:20:03 giannozz
qe-5.4 999aca3 2018-03-14 04:14:33 spigafi
qe-6.0.0 84eb939 2018-03-14 04:08:54 spigafi
qe-6.1.0 03ed522 2018-03-14 04:02:40 spigafi
qe-6.2.0 827380e 2018-03-14 03:56:11 giannozz
PW-1.3.1 4e747a2 2018-03-14 03:50:41 sbraccia
PW-1.3.0 d070622 2018-03-14 03:48:02 giannozz
PW-1.2.0 04f073f 2018-03-14 03:39:09 giannozz
qe-6.2.1 ab3ad02 2017-12-11 23:59:33 giannozz
v6.2 a5ff9a4 2017-10-24 02:26:35 giannozz
v6.2+d a5ff9a4 2017-10-24 02:26:35 giannozz
v6.2-beta 62e31bb 2017-09-01 00:20:44 giannozz
v6.2b 585716d 2017-08-31 17:24:49 paulatto
pretag 585716d 2017-08-31 17:08:32 paulatto
gc-scf_v6.1 61684b9 2017-07-26 10:49:42 nisihara1
stable ab20adb 2017-07-19 17:07:21 nisihara1
fcp-rism_v6.1 87157a6 2017-06-30 15:47:23 nisihara1
esm-rism_v6.1 0125a36 2017-06-29 16:38:35 nisihara1
esm-stress_v6.1 b4a8f94 2017-05-10 17:01:49 nisihara1
v6.1-nisihara 351a651 2017-03-26 08:33:02 nisihara1
rmm-diis_v6.1 8cc8bdd 2017-03-26 07:47:09 nisihara1
dfpttetra5.2.1 97c9440 2017-03-22 15:22:12 maitsuaki
dfpttetra6.0 f2f3986 2017-03-22 15:20:05 Mitsuaki Kawamura
dfpttetra6.1 51ed5af 2017-03-18 18:42:27 mitsuaki1987
rev-bfgs_v6.1 8f10042 2017-03-14 23:07:18 nisihara1
v6.1 b490894 2017-03-03 21:30:34 spigafi
v6.0 aed7172 2016-10-05 02:33:59 spigafi
v5.4.0 3fe69fe 2016-04-25 06:05:19 spigafi

Zweige

Name Rev. Zeit Autor Nachricht
master 13ada28 2022-06-28 12:44:21 Mitsuaki Kawamura change version number

README_GPU.md

Quantum ESPRESSO GPU

GPL v2

This repository also contains the GPU-accelerated version of Quantum ESPRESSO.

Installation

This version is tested against PGI (now nvfortran) compilers v. >= 17.4. The configure script checks for the presence of a PGI compiler and of a few cuda libraries.For this reason path pointing to cudatoolkit must be present in LD_LIBRARY_PATH.

A template for the configure command is:

./configure --with-cuda=XX --with-cuda-runtime=YY --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ]

where XX is the location of the CUDA Toolkit (in HPC environments is generally $CUDA_HOME), YY is the version of the cuda toolkit and ZZ is the compute capability of the card. If you have no idea what these numbers are you may give a try to the automatic tool get_device_props.py. An example using Slurm is:

$ module load cuda
$ cd dev-tools
$ salloc -n1 -t1
[...]
salloc: Granted job allocation xxxx
$ srun python get_device_props.py
[...]
Compute capabilities for dev 0: 6.0
Compute capabilities for dev 1: 6.0
Compute capabilities for dev 2: 6.0
Compute capabilities for dev 3: 6.0

 If all compute capabilities match, configure QE with:
./configure --with-cuda=$CUDA_HOME --with-cuda-cc=60 --with-cuda-runtime=9.2

It is generally a good idea to disable Scalapack when running small test cases since the serial GPU eigensolver can outperform the parallel CPU eigensolver in many circumstances.

From time to time PGI links to the wrong CUDA libraries anf fails reporting a problem in cusolver missing GOmp (GNU Openmp). The solution to this problem is removing cudatoolkit from the LD_LIBRARY_PATH before compiling.

Serial compilation is also supported.

Execution

By default, GPU support is active. The following message will appear at the beginning of the output

     GPU acceleration is ACTIVE.

GPU acceleration can be switched off by setting the following environment variable:

$ export USEGPU=no

Testing

The current GPU version passes all 186 tests with both parallel and serial compilation. The testing suite should only be used to check the correctness of pw.x. Therefore only make run-tests-pw-parallel and make run-tests-pw-serial should be used.

Naming conventions

Variables allocated on the device must end with _d. Subroutines and functions replicating an algorithm on the GPU must end with _gpu. Modules must end with _gpum. Files with duplicated source code must end with _gpu.f90.

Porting functionalities

PW functionalities are ported to GPU by duplicating the subroutines and the functions that operate on CPU variables. The number of arguments should not change but input and output data may be referring to device variables when applicable.

Bifurcations in code flow happen at runtime with commands similar to

use control_flags, only : use_gpu
[...]
if (use_gpu) then
   call subroutine_gpu(arg_d)
else
   call subroutine(arg)
end if

At each bifurcation point it should be possible to remove the call to the accelerated routine without breaking the code. Note however that calling both the CPU and the GPU version of a subroutine in the same place may break the code execution.

Memory management

[ DISCLAIMER STARTS ] What described below is not the method that will be integrated in the final release. Nonetheless it happens to be a good approach for:

1) simplify the alignment of this fork with the main repository, 2) debugging, 3) tracing evolution of memory paths as the CPU version evolves, 4) (in the future) report on a the set of global variables that should be kept to guarantee a certain speedup.

For example, this simplified the integration of the changes that took place to modernize the I/O. [ DISCLAIMER ENDS ]

Global GPU data are tightly linked to global CPU data. One cannot allocate global variables on the GPU manually. The global GPU variables follow the allocation and deallocation of the CPU ones. This is an automatic mechanism enforced by the managed memory system. In what follows, I will refer to duplicated GPU variables as "duplicated variable" and to the equivalent CPU variable as "parent variable".

Global variables in modules are synchronized through calls to subroutines named using_xxx and using_xxx_d with xxx being the name of the variable in the module globally accessed by multiple subroutines. This function accepts one argument that replicates the role of the intent attribute.

Acceptable values are:

0: variable will only be read (equal to intent in)
1: variable will be read and written (equal to intent inout)
2: variable will be only (entirely) updated (equal to intent out).

Function and subroutine calls having global variables in their argument should be guarded by calls to using_xxx with the appropriate argument. Obviously calls with argument 0 and 1 must always be prepended.

The actual allocation of a duplicated variable happens when using_xxx_d is called and the parent variable is allocated. Deallocation happens when using_xxx_d(2) is called and the CPU variable is not allocated. Data synchronization (done with synchronous copies, i.e. overloaded cudamemcpy) happens when either the CPU or the GPU memory is found to be flagged "out of date" by a previous call to using_xxx(1) or using_xxx(2) or using_xxx_d(1) or using_xxx_d(2).

Calls to using_xxx_d should only happen in GPU function/subroutines. This rule can be avoided if the call is protected by ifdefs. This is useful if you are lazy and a global variable is updated only a few times. An example of this being g vectors that are set in a few places (at initialization, after a scaling of the Hamiltonian etc) and are used everywhere in the code.

Finally, there are global variables that are only updated with subroutines residing inside the same module. The allocation and the update of the duplicated counterpart becomes trivial and is simply done at the same time as the CPU variable. At the time of writing this constitute an exception to the general rule but it is actually the result of the efforts done in the last year to modularize the code and is probably the correct method to deal with duplicated data in the code.

Show on old repository browser