Generate Training Data
Overview before starting
We are going to
- Generate a dataset using FHI-aims.
- Record atomic basis information.
- Benchmark training dataset.
Related starting files
file or dir name | description |
---|---|
README.rst |
README file for your reference |
inp.yaml |
SALTED input file, consists of file paths and hyperparameters |
control.in |
FHI-aims control file, for generating dataset with FHI-aims |
run-aims.sbatch |
Example script for generating dataset with FHI-aims (only an example, everything can be run locally) |
water_monomers_100.xyz |
xyz file, training dataset. Units: Angstrom Å |
No MPI interface?
Don't worry!
Just change the parameter parallel
in inp.yaml
to False
.
For all MPI \(\otimes\) Python commands in this tutorial, e.g.
mpirun -np $ntasks python -m salted.aims.move_data
just run without mpirun
:
python -m salted.aims.move_data
Input geometry
For the whole project, we will stay at working root dir example/water_monomer_AIMS
,
where inp.yaml
is located.
cd $path_to_salted_examples/water_monomer_AIMS
Before we start, check the filename
entry in inp.yaml
and make sure it is water_monomers_100.xyz
.
This is the training dataset.
About inp.yaml
The inp.yaml
consists of file paths, hyperparameters for machine-learning models, and controlling arguments.
The parameters will be indicated by inp.[parameter]
in the following text.
For a detailed description of each parameter, please refer to the Appendix.
FHI-aims needs a control.in
file and a geometry.in
file to start a calculation. The control.in
file is always the same, and the geometry.in
file is generated by salted.aims.make_geoms
from water_monomers_100.xyz
:
python -m salted.aims.make_geoms
This will generate all 100 water monomers' input files in "[inp.qm.path2qm]/data/geoms/[n].in"="qmdata/data/geoms/[n].in"
with [n]
running from 1 to 100. Later, these files will be moved into each FHI-aims calculation dir and renamed to geometry.in
.
Control file
In the provided control.in
file, there are 3 special tags for SALTED:
ri_density_restart write
ri_full_output .True.
ri_output_density_only .True.
Control tags ri_xxx
and options details (click to expand)
You can find more details about these tags and options in the FHI-aims PDF documentation.
ri_density_restart [task] [value]
- If
[task]
iswrite
[value]
is an optional positive float which defines a cutoff radius for calculating the overlap of the product basis functions. Default[value]
is1.5
.- FHI-aims will write restart coefficients to a file
ri_restart_coeffs.out
, which should be renamed tori_restart_coeffs_df.out
manually and will be used by SALTED later.
- If
[task]
isread
value
is an optional non-negative integer specifying the maximum number of SCF steps to be performed after reading the density. Default[value]
is the the value of tagsc_iter_limit
, which is by default1000
.
- If
[task]
isread_and_write
- FHI-aims will do both
read
andwrite
.
- FHI-aims will do both
- Default: do nothing.
- If
ri_full_output [boolean]
- If
[boolean]
is.True.
andri_density_restart
iswrite
- FHI-aims will write (by
subroutine output_idx_info()
)- overlap matrix to file
ri_ovlp.out
, and density projection to filesri_projections.out
- information about the product basis needed to interface with the SALTED framework to file
prodbas_condon_shotley_list.out
,idx_prodbas.out
,basis_info.out
,idx_prodbas_details.out
- density and partition table on the internal real-space grid used by FHI-aims to file
partition_tab.out
.
- overlap matrix to file
- FHI-aims will write (by
- If
[boolean]
is.False.
(default)- FHI-aims will not write the above files.
- If
ri_output_density_only [boolean]
- If
[boolean]
is.True.
- FHI-aims will write the density to file
rho_scf.out
(atscf_solver.f90/write_rho()
).
- FHI-aims will write the density to file
- If
[boolean]
is.False.
(default)- FHI-aims will not write the above file.
- If
If you need help with FHI-aims (click to expand)
For basics of using FHI-aims, please refer to this tutorial: Basics of Running FHI-aims.
For FHI-aims full user guide, please download the latest PDF documentation here, or search for a compatible version for your FHI-aims. You can literally find everything about FHI-aims in this PDF.
Run FHI-aims
To run FHI-aims calculation, please write your own script in reference to run-aims.sbatch
(this is merely an example) based on your cluster/PC,
and then run the script. If you are using an HPC (not necessary for this exercise) the command would be:
sbatch run-aims.sbatch
otherwise, something like
bash run-aims.sh
will suffice. If you struggle to adapt this script to your needs, contact one of the people responsible for this tutorial.
For each FHI-aims calculation, the script should
- Copy
control.in
(from working root dir) and eachgeometry.in
(from[inp.qm.path2qm]/data/geoms="qmdata/data/geoms"
) to the working dir. - Run FHI-aims
- Rename
rho_rebuilt_ri.out
torho_df.out
, renameri_restart_coeffs.out
tori_restart_coeffs_df.out
Why rename outputs?
This is for the sake of distinguishing training and prediction.
There will be two kinds of rebuilt densities / DF coefficients: either from FHI-aims for training, or from SALTED for prediction (see Part 3).
- The current
rho_restart_coeffs.out
are the DF coefficients, and they are used for training SALTED. So it is suffixed by_df
- The current
rho_rebuilt_ri.out
is the rebuilt real-space density from the DF mentioned above, so it is also re-suffixed by_df
. This file will be later used for benchmarking DF accuracy.
Output files
After the FHI-aims calculation, each FHI-aims working dirs ([inp.qm.path2qm]/data/geoms/[idx]="qmdata/data/[idx]"
) contains
file name | description | FHI-aims tag |
---|---|---|
FHI-aims basic files | ||
control.in |
moved in by script, from control.in in working root dir |
--- |
geometry.in |
moved in by script, from qmdata/data/geoms/[idx] |
--- |
aims.out |
FHI-aims output | --- |
real-space density related outputs | ||
partition_tab.out |
space grid in FHI-aims integral | ri_full_output .True. |
rho_df.out |
renamed by script from rho_rebuilt_ri.out , df for density fitting |
--- |
(rho_rebuilt_ri.out ) |
reconstructed real-space electron density by RI/DF, columns=(x,y,z,density) , renamed by script to rho_df.out |
ri_full_output .True. |
rho_scf.out |
real-space electron density, columns=(x,y,z,density) |
ri_output_density_only .True. -> ri_output_density = .True. |
RI / DF related outputs | (Notice: \(\text{RI} \Leftrightarrow \text{DF}\) refer to the same procedure) | |
ri_ovlp.out |
overlap matrix of DF basis \(\mathbf{S}_{NN}\) | ri_full_output .True. |
ri_restart_coeffs_df.out |
renamed by script from ri_restart_coeffs.out , df for DF |
--- |
(ri_restart_coeffs.out ) |
DF coefficients (\(\mathbf{c}_{N}^{DF}\)), renamed by script to ri_restart_coeffs_df.out |
ri_density_restart write |
ri_projections.out |
\(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF} = \left\langle \phi_{i,\sigma} (\mathbf{0}) \middle\| \rho^{QM} \right\rangle\), projected electron density | ri_full_output .True. |
product basis related outputs | ||
basis_info.out |
general information about the product basis | ri_full_output .True. |
Collecting AIMS outputs
The RI outputs from each AIMS calculation need to be collected into a single folder, and are converted into numpy format to speed up reading these quantities in further steps. For that, we will run:
mpirun -np $ntasks python -m salted.aims.move_data
where $ntasks
needs to be substituted by the amount of tasks you wish to use in your machine.
Three directories, namely overlaps
, projections
, coefficients
are generated at the working root dir, consisting of the collected data from FHI-aims output files.
data name | physics quantity | from file (in dir qmdata/data/[n] ) |
to file (at working root) |
---|---|---|---|
overlap matrices | \(\mathbf{S}_{NN}\) | ri_ovlp.out |
overlaps/overlap_conf[n].npy |
DF projections | \(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF}\) | ri_projections.out |
projections/projections_conf[n].npy |
DF coefficients | \(\mathbf{c}_{N}^{DF}\) | ri_restart_coeffs_df.out |
coefficients/coefficients_conf[n].npy |
Notice that n
ranges from 1 to 100 across the training dataset, and overlap naming convention is overlap_conf[n].npy
.
Reordering coefficients (AIMS version < 240403)
Due to different spherical harmonics conventions, the overlap matrix / DF projection / DF coefficients should be reordered and the Condon-Shottley convention should be applied (includes the Condon-Shottley phase factor \((-1)^m\) in the definition of spherical harmonics) before using them in SALTED. In newer versions of AIMS, this is done before outputing; in older versions this is also done by salted.aims.move_data
, based on additional output files idx_prodbas.out
and prodbas_condon_shotley_list.out
. The sequence after reordering is described in idx_prodbas_details.out
(see table above for column names), and the reordering follows (m_num, l_num, atom_idx)
(increasing sorting importance) in ascending order.
file name | description | FHI-aims tag |
---|---|---|
idx_prodbas.out |
reordering index for spherical harmonics (see below) | ri_full_output .True. |
idx_prodbas_details.out |
basis info, columns = (reordering_idx, atom_idx, l_num, bas_idx, m_num) , not used later |
ri_full_output .True. |
prodbas_condon_shotley_list.out |
Condon Shotley \((-1)^{m}\) phase factor index | ri_full_output .True. |
Note that the script will automatically determine the version of AIMS used; the same SALTED command should be used regardless of the version of AIMS.
Also, the product basis information is really important for SALTED (for generating \(\lambda\)-SOAP kernels), and we should transfer such information from basis_info.out
to the SALTED basis database basis_data.yaml
, which is stored with your local installation of SALTED.
This is achieved by running
python -m salted.get_basis_info
Benchmark dataset
Because the RI/density fitting procedure is not exact, we wish to check the accuracy of the fitted density. This is achieved by running
python -m salted.aims.get_df_err
Real-space densities read from rho_df.out
(DF) and rho_scf.out
(SCF density matrix) are compared as an average error over a dense real-space grid,
and the mean absolute error (in percent) is written to df_maes
at working root dir.