Generate Training Data
Overview before starting
We are going to
- Generate a dataset using FHI-aims.
 - Record atomic basis information.
 - Benchmark training dataset.
 
Related starting files
| file or dir name | description | 
|---|---|
README.rst | 
README file for your reference | 
inp.yaml | 
SALTED input file, consists of file paths and hyperparameters | 
control.in | 
FHI-aims control file, for generating dataset with FHI-aims | 
run-aims.sbatch | 
Example script for generating dataset with FHI-aims (only an example, everything can be run locally) | 
water_monomers_100.xyz | 
xyz file, training dataset. Units: Angstrom Å | 
No MPI interface?
Don't worry!
Just change the parameter parallel in inp.yaml to False.
For all MPI \(\otimes\) Python commands in this tutorial, e.g.
mpirun -np $ntasks python -m salted.aims.move_data
just run without mpirun:
python -m salted.aims.move_data
Input geometry
For the whole project, we will stay at working root dir example/water_monomer_AIMS,
where inp.yaml is located.
cd $path_to_salted_examples/water_monomer_AIMS
Before we start, check the filename entry in inp.yaml and make sure it is water_monomers_100.xyz.
This is the training dataset.
About inp.yaml
The inp.yaml consists of file paths, hyperparameters for machine-learning models, and controlling arguments.
The parameters will be indicated by inp.[parameter] in the following text.
For a detailed description of each parameter, please refer to the Appendix.
FHI-aims needs a control.in file and a geometry.in file to start a calculation. The control.in file is always the same, and the geometry.in file is generated by salted.aims.make_geoms from water_monomers_100.xyz:
python -m salted.aims.make_geoms
This will generate all 100 water monomers' input files in "[inp.qm.path2qm]/data/geoms/[n].in"="qmdata/data/geoms/[n].in" with [n] running from 1 to 100. Later, these files will be moved into each FHI-aims calculation dir and renamed to geometry.in.
Control file
In the provided control.in file, there are 3 special tags for SALTED:
ri_density_restart write
ri_full_output .True.
ri_output_density_only .True.
Control tags ri_xxx and options details (click to expand)
You can find more details about these tags and options in the FHI-aims PDF documentation.
ri_density_restart [task] [value]- If 
[task]iswrite[value]is an optional positive float which defines a cutoff radius for calculating the overlap of the product basis functions. Default[value]is1.5.- FHI-aims will write restart coefficients to a file 
ri_restart_coeffs.out, which should be renamed tori_restart_coeffs_df.outmanually and will be used by SALTED later. 
 - If 
[task]isreadvalueis an optional non-negative integer specifying the maximum number of SCF steps to be performed after reading the density. Default[value]is the the value of tagsc_iter_limit, which is by default1000.
 - If 
[task]isread_and_write- FHI-aims will do both 
readandwrite. 
 - FHI-aims will do both 
 - Default: do nothing.
 
- If 
 ri_full_output [boolean]- If 
[boolean]is.True.andri_density_restartiswrite- FHI-aims will write (by 
subroutine output_idx_info())- overlap matrix to file 
ri_ovlp.out, and density projection to filesri_projections.out - information about the product basis needed to interface with the SALTED framework to file 
prodbas_condon_shotley_list.out,idx_prodbas.out,basis_info.out,idx_prodbas_details.out - density and partition table on the internal real-space grid used by FHI-aims to file 
partition_tab.out. 
 - overlap matrix to file 
 
 - FHI-aims will write (by 
 - If 
[boolean]is.False.(default)- FHI-aims will not write the above files.
 
 
- If 
 ri_output_density_only [boolean]- If 
[boolean]is.True.- FHI-aims will write the density to file 
rho_scf.out(atscf_solver.f90/write_rho()). 
 - FHI-aims will write the density to file 
 - If 
[boolean]is.False.(default)- FHI-aims will not write the above file.
 
 
- If 
 
If you need help with FHI-aims (click to expand)
For basics of using FHI-aims, please refer to this tutorial: Basics of Running FHI-aims.
For FHI-aims full user guide, please download the latest PDF documentation here, or search for a compatible version for your FHI-aims. You can literally find everything about FHI-aims in this PDF.
Run FHI-aims
To run FHI-aims calculation, please write your own script in reference to run-aims.sbatch
(this is merely an example) based on your cluster/PC,
and then run the script. If you are using an HPC (not necessary for this exercise) the command would be:
sbatch run-aims.sbatch
otherwise, something like
bash run-aims.sh
will suffice. If you struggle to adapt this script to your needs, contact one of the people responsible for this tutorial.
For each FHI-aims calculation, the script should
- Copy 
control.in(from working root dir) and eachgeometry.in(from[inp.qm.path2qm]/data/geoms="qmdata/data/geoms") to the working dir. - Run FHI-aims
 - Rename 
rho_rebuilt_ri.outtorho_df.out, renameri_restart_coeffs.outtori_restart_coeffs_df.out 
Why rename outputs?
This is for the sake of distinguishing training and prediction.
There will be two kinds of rebuilt densities / DF coefficients: either from FHI-aims for training, or from SALTED for prediction (see Part 3).
- The current 
rho_restart_coeffs.outare the DF coefficients, and they are used for training SALTED. So it is suffixed by_df - The current 
rho_rebuilt_ri.outis the rebuilt real-space density from the DF mentioned above, so it is also re-suffixed by_df. This file will be later used for benchmarking DF accuracy. 
Output files
After the FHI-aims calculation, each FHI-aims working dirs ([inp.qm.path2qm]/data/geoms/[idx]="qmdata/data/[idx]") contains
| file name | description | FHI-aims tag | 
|---|---|---|
| FHI-aims basic files | ||
control.in | 
moved in by script, from control.in in working root dir | 
--- | 
geometry.in | 
moved in by script, from qmdata/data/geoms/[idx] | 
--- | 
aims.out | 
FHI-aims output | --- | 
| real-space density related outputs | ||
partition_tab.out | 
space grid in FHI-aims integral | ri_full_output .True. | 
rho_df.out | 
renamed by script from rho_rebuilt_ri.out, df for density fitting | 
--- | 
(rho_rebuilt_ri.out) | 
reconstructed real-space electron density by RI/DF, columns=(x,y,z,density), renamed by script to rho_df.out | 
ri_full_output .True. | 
rho_scf.out | 
real-space electron density, columns=(x,y,z,density) | 
ri_output_density_only .True. -> ri_output_density = .True. | 
| RI / DF related outputs | (Notice: \(\text{RI} \Leftrightarrow \text{DF}\) refer to the same procedure) | |
ri_ovlp.out | 
overlap matrix of DF basis \(\mathbf{S}_{NN}\) | ri_full_output .True. | 
ri_restart_coeffs_df.out | 
renamed by script from ri_restart_coeffs.out, df for DF | 
--- | 
(ri_restart_coeffs.out) | 
DF coefficients (\(\mathbf{c}_{N}^{DF}\)), renamed by script to ri_restart_coeffs_df.out | 
ri_density_restart write | 
ri_projections.out | 
\(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF} = \left\langle \phi_{i,\sigma} (\mathbf{0}) \middle\| \rho^{QM} \right\rangle\), projected electron density | ri_full_output .True. | 
| product basis related outputs | ||
basis_info.out | 
general information about the product basis | ri_full_output .True. | 
Collecting AIMS outputs
The RI outputs from each AIMS calculation need to be collected into a single folder, and are converted into numpy format to speed up reading these quantities in further steps. For that, we will run:
mpirun -np $ntasks python -m salted.aims.move_data
where $ntasks needs to be substituted by the amount of tasks you wish to use in your machine.
Three directories, namely overlaps, projections, coefficients are generated at the working root dir, consisting of the collected data from FHI-aims output files.
| data name | physics quantity | from file (in dir qmdata/data/[n]) | 
to file (at working root) | 
|---|---|---|---|
| overlap matrices | \(\mathbf{S}_{NN}\) | ri_ovlp.out | 
overlaps/overlap_conf[n].npy | 
| DF projections | \(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF}\) | ri_projections.out | 
projections/projections_conf[n].npy | 
| DF coefficients | \(\mathbf{c}_{N}^{DF}\) | ri_restart_coeffs_df.out | 
coefficients/coefficients_conf[n].npy | 
Notice that n ranges from 1 to 100 across the training dataset, and overlap naming convention is overlap_conf[n].npy.
Reordering coefficients (AIMS version < 240403)
Due to different spherical harmonics conventions, the overlap matrix / DF projection / DF coefficients should be reordered and the Condon-Shottley convention should be applied (includes the Condon-Shottley phase factor \((-1)^m\) in the definition of spherical harmonics) before using them in SALTED. In newer versions of AIMS, this is done before outputing; in older versions this is also done by salted.aims.move_data, based on additional output files idx_prodbas.out and prodbas_condon_shotley_list.out. The sequence after reordering is described in idx_prodbas_details.out (see table above for column names), and the reordering follows (m_num, l_num, atom_idx) (increasing sorting importance) in ascending order.
| file name | description | FHI-aims tag | 
|---|---|---|
idx_prodbas.out | 
reordering index for spherical harmonics (see below) | ri_full_output .True. | 
idx_prodbas_details.out | 
basis info, columns = (reordering_idx, atom_idx, l_num, bas_idx, m_num), not used later | 
ri_full_output .True. | 
prodbas_condon_shotley_list.out | 
Condon Shotley \((-1)^{m}\) phase factor index | ri_full_output .True. | 
Note that the script will automatically determine the version of AIMS used; the same SALTED command should be used regardless of the version of AIMS.
Also, the product basis information is really important for SALTED (for generating \(\lambda\)-SOAP kernels), and we should transfer such information from basis_info.out to the SALTED basis database basis_data.yaml, which is stored with your local installation of SALTED.
This is achieved by running
python -m salted.get_basis_info
Benchmark dataset
Because the RI/density fitting procedure is not exact, we wish to check the accuracy of the fitted density. This is achieved by running
python -m salted.aims.get_df_err
Real-space densities read from rho_df.out (DF) and rho_scf.out (SCF density matrix) are compared as an average error over a dense real-space grid,
and the mean absolute error (in percent) is written to df_maes at working root dir.