Generate Training Data

Overview before starting

We are going to

Generate a dataset using FHI-aims.
Record atomic basis information.
Benchmark training dataset.

Related starting files

file or dir name	description
`README.rst`	README file for your reference
`inp.yaml`	SALTED input file, consists of file paths and hyperparameters
`control.in`	FHI-aims control file, for generating dataset with FHI-aims
`run-aims.sbatch`	Example script for generating dataset with FHI-aims (only an example, everything can be run locally)
`water_monomers_100.xyz`	`xyz` file, training dataset. Units: Angstrom Å

No MPI interface?

Don't worry! Just change the parameter parallel in inp.yaml to False. For all MPI $\otimes$ Python commands in this tutorial, e.g.

mpirun -np $ntasks python -m salted.aims.move_data

just run without mpirun:

python -m salted.aims.move_data

Input geometry

For the whole project, we will stay at working root dir example/water_monomer_AIMS, where inp.yaml is located.

cd $path_to_salted_examples/water_monomer_AIMS

Before we start, check the filename entry in inp.yaml and make sure it is water_monomers_100.xyz. This is the training dataset.

About inp.yaml

The inp.yaml consists of file paths, hyperparameters for machine-learning models, and controlling arguments. The parameters will be indicated by inp.[parameter] in the following text. For a detailed description of each parameter, please refer to the Appendix.

FHI-aims needs a control.in file and a geometry.in file to start a calculation. The control.in file is always the same, and the geometry.in file is generated by salted.aims.make_geoms from water_monomers_100.xyz:

python -m salted.aims.make_geoms

This will generate all 100 water monomers' input files in "[inp.qm.path2qm]/data/geoms/[n].in"="qmdata/data/geoms/[n].in" with [n] running from 1 to 100. Later, these files will be moved into each FHI-aims calculation dir and renamed to geometry.in.

Control file

In the provided control.in file, there are 3 special tags for SALTED:

ri_density_restart write
ri_full_output .True.
ri_output_density_only .True.

Control tags ri_xxx and options details (click to expand)

You can find more details about these tags and options in the FHI-aims PDF documentation.

ri_density_restart [task] [value]
- If [task] is write
  - [value] is an optional positive float which defines a cutoff radius for calculating the overlap of the product basis functions. Default [value] is 1.5.
  - FHI-aims will write restart coefficients to a file ri_restart_coeffs.out, which should be renamed to ri_restart_coeffs_df.out manually and will be used by SALTED later.
- If [task] is read
  - value is an optional non-negative integer specifying the maximum number of SCF steps to be performed after reading the density. Default [value] is the the value of tag sc_iter_limit, which is by default 1000.
- If [task] is read_and_write
  - FHI-aims will do both read and write.
- Default: do nothing.
ri_full_output [boolean]
- If [boolean] is .True. and ri_density_restart is write
  - FHI-aims will write (by subroutine output_idx_info())
    - overlap matrix to file ri_ovlp.out, and density projection to files ri_projections.out
    - information about the product basis needed to interface with the SALTED framework to file prodbas_condon_shotley_list.out, idx_prodbas.out, basis_info.out, idx_prodbas_details.out
    - density and partition table on the internal real-space grid used by FHI-aims to file partition_tab.out.
- If [boolean] is .False. (default)
  - FHI-aims will not write the above files.
ri_output_density_only [boolean]
- If [boolean] is .True.
  - FHI-aims will write the density to file rho_scf.out (at scf_solver.f90/write_rho()).
- If [boolean] is .False. (default)
  - FHI-aims will not write the above file.

If you need help with FHI-aims (click to expand)

For basics of using FHI-aims, please refer to this tutorial: Basics of Running FHI-aims.

For FHI-aims full user guide, please download the latest PDF documentation here, or search for a compatible version for your FHI-aims. You can literally find everything about FHI-aims in this PDF.

Run FHI-aims

To run FHI-aims calculation, please write your own script in reference to run-aims.sbatch (this is merely an example) based on your cluster/PC, and then run the script. If you are using an HPC (not necessary for this exercise) the command would be:

sbatch run-aims.sbatch

otherwise, something like

bash run-aims.sh

will suffice. If you struggle to adapt this script to your needs, contact one of the people responsible for this tutorial.

For each FHI-aims calculation, the script should

Copy control.in (from working root dir) and each geometry.in (from [inp.qm.path2qm]/data/geoms="qmdata/data/geoms") to the working dir.
Run FHI-aims
Rename rho_rebuilt_ri.out to rho_df.out, rename ri_restart_coeffs.out to ri_restart_coeffs_df.out

Why rename outputs?

This is for the sake of distinguishing training and prediction.

There will be two kinds of rebuilt densities / DF coefficients: either from FHI-aims for training, or from SALTED for prediction (see Part 3).

The current rho_restart_coeffs.out are the DF coefficients, and they are used for training SALTED. So it is suffixed by _df
The current rho_rebuilt_ri.out is the rebuilt real-space density from the DF mentioned above, so it is also re-suffixed by _df. This file will be later used for benchmarking DF accuracy.

Output files

After the FHI-aims calculation, each FHI-aims working dirs ([inp.qm.path2qm]/data/geoms/[idx]="qmdata/data/[idx]") contains

file name	description	FHI-aims tag
*FHI-aims basic files*
`control.in`	moved in by script, from `control.in` in working root dir	---
`geometry.in`	moved in by script, from `qmdata/data/geoms/[idx]`	---
`aims.out`	FHI-aims output	---
*real-space density related outputs*
`partition_tab.out`	space grid in FHI-aims integral	`ri_full_output .True.`
`rho_df.out`	renamed by script from `rho_rebuilt_ri.out`, `df` for density fitting	---
(`rho_rebuilt_ri.out`)	reconstructed real-space electron density by RI/DF, columns=`(x,y,z,density)`, renamed by script to `rho_df.out`	`ri_full_output .True.`
`rho_scf.out`	real-space electron density, columns=`(x,y,z,density)`	`ri_output_density_only .True.` -> `ri_output_density = .True.`
*RI / DF related outputs*	(Notice: $\text{RI} \Leftrightarrow \text{DF}$ refer to the same procedure)
`ri_ovlp.out`	overlap matrix of DF basis $\mathbf{S}_{NN}$	`ri_full_output .True.`
`ri_restart_coeffs_df.out`	renamed by script from `ri_restart_coeffs.out`, `df` for DF	---
(`ri_restart_coeffs.out`)	DF coefficients ($\mathbf{c}_{N}^{DF}$), renamed by script to `ri_restart_coeffs_df.out`	`ri_density_restart write`
`ri_projections.out`	$\mathbf{S}_{NN} \mathbf{c}_{N}^{DF} = \left\langle \phi_{i,\sigma} (\mathbf{0}) \middle\\| \rho^{QM} \right\rangle$, projected electron density	`ri_full_output .True.`
*product basis related outputs*
`basis_info.out`	general information about the product basis	`ri_full_output .True.`

Collecting AIMS outputs

The RI outputs from each AIMS calculation need to be collected into a single folder, and are converted into numpy format to speed up reading these quantities in further steps. For that, we will run:

mpirun -np $ntasks python -m salted.aims.move_data

where $ntasks needs to be substituted by the amount of tasks you wish to use in your machine. Three directories, namely overlaps, projections, coefficients are generated at the working root dir, consisting of the collected data from FHI-aims output files.

data name	physics quantity	from file (in dir `qmdata/data/[n]`)	to file (at working root)
overlap matrices	$\mathbf{S}_{NN}$	`ri_ovlp.out`	`overlaps/overlap_conf[n].npy`
DF projections	$\mathbf{S}_{NN} \mathbf{c}_{N}^{DF}$	`ri_projections.out`	`projections/projections_conf[n].npy`
DF coefficients	$\mathbf{c}_{N}^{DF}$	`ri_restart_coeffs_df.out`	`coefficients/coefficients_conf[n].npy`

Notice that n ranges from 1 to 100 across the training dataset, and overlap naming convention is overlap_conf[n].npy.

Reordering coefficients (AIMS version < 240403)

Due to different spherical harmonics conventions, the overlap matrix / DF projection / DF coefficients should be reordered and the Condon-Shottley convention should be applied (includes the Condon-Shottley phase factor $(-1)^m$ in the definition of spherical harmonics) before using them in SALTED. In newer versions of AIMS, this is done before outputing; in older versions this is also done by salted.aims.move_data, based on additional output files idx_prodbas.out and prodbas_condon_shotley_list.out. The sequence after reordering is described in idx_prodbas_details.out (see table above for column names), and the reordering follows (m_num, l_num, atom_idx) (increasing sorting importance) in ascending order.

file name	description	FHI-aims tag
`idx_prodbas.out`	reordering index for spherical harmonics (see below)	`ri_full_output .True.`
`idx_prodbas_details.out`	basis info, columns = `(reordering_idx, atom_idx, l_num, bas_idx, m_num)`, not used later	`ri_full_output .True.`
`prodbas_condon_shotley_list.out`	Condon Shotley $(-1)^{m}$ phase factor index	`ri_full_output .True.`

Note that the script will automatically determine the version of AIMS used; the same SALTED command should be used regardless of the version of AIMS.

Also, the product basis information is really important for SALTED (for generating $\lambda$-SOAP kernels), and we should transfer such information from basis_info.out to the SALTED basis database basis_data.yaml, which is stored with your local installation of SALTED. This is achieved by running

python -m salted.get_basis_info

Benchmark dataset

Because the RI/density fitting procedure is not exact, we wish to check the accuracy of the fitted density. This is achieved by running

python -m salted.aims.get_df_err

Real-space densities read from rho_df.out (DF) and rho_scf.out (SCF density matrix) are compared as an average error over a dense real-space grid, and the mean absolute error (in percent) is written to df_maes at working root dir.

data name	physics quantity	from file (in dir `qmdata/data/[n]`)	to file (at working root)
overlap matrices	\(\mathbf{S}_{NN}\)	`ri_ovlp.out`	`overlaps/overlap_conf[n].npy`
DF projections	\(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF}\)	`ri_projections.out`	`projections/projections_conf[n].npy`
DF coefficients	\(\mathbf{c}_{N}^{DF}\)	`ri_restart_coeffs_df.out`	`coefficients/coefficients_conf[n].npy`