# Learn the electron density with SALTED

## Overview before starting

Related starting files

file or dir name | description |
---|---|

`README.rst` |
README file for your reference |

`inp.yaml` |
SALTED input file, consists of file paths and hyperparameters |

`run-ml.sbatch` |
Example script for training and validating SALTED |

`coefficients/*` |
Density fitting coefficients |

`projections/*` |
Density projection (the \(\mathbf{S}_{NN} \mathbf{c}_{N}^{DF}\)) |

`overlaps/*` |
Overlap matrix |

We are going to

- Sparsify the symmetry-adapted descriptors using a subset of training set (optional but often necessary for larger scale applications).
- Sparsify the atomic environments.
- Calculate the sparse descriptors for each reference atomic environment.
- Optimize GPR weights by direct inversion or conjugate gradient method.
- Validate the SALTED model.

## Generate descriptors (with optional feature sparsification)

Currently, we have obtained the DF coefficients in dir `coefficients/`

and the DF projection vectors in dir `projections/`

.
To conduct the GPR, we need to generate \(\lambda\)-SOAP power spectra (which can be used to construct kernels) for the training dataset.
This is achieved in three steps. We first run

```
mpirun -np $ntasks python -m salted.initialize
```

`wigners/wigner_lam-[lam]_lmax1-[nang1]_lmax2-[nang2].dat`

- The wigner \(3j\) symbols.
- In our example, the angular momentum \(\lambda\) ranges from \(0\) to \(5\), i.e.
`lam=[0..5]`

.`inp.descriptor.rep1.nang`

and`inp.descriptor.rep2.nang`

are from`inp.yaml`

.

`equirepr_[inp.saltedname]/FEAT-0.h5`

- The power spectrum for \(\lambda=0\) is stored in an HDF5-formatted files that allow parallel read/write.

Optionally, if a `sparsify`

subsection is added to the `inp.descriptor`

section in `inp.yaml`

, then the power spectrum for every value of \(\lambda\) is FPS-sparsified along the feature axis (keeping all atoms, but selecting features). The number of retained features is specified by `inp.descriptor.sparsify.ncut`

, and the number of structures used to sparsify the feature space is given by `inp.sparsify.nsamples`

. It is recommended to use a fairly small fraction of your total dataset to carry out this sparsification. In this case, the files `equirepr_[inp.saltedname]/fps[inp.descriptor.sparsify.ncut]-[lam].npy`

are generated which store this sparsification information.

In this example, we will not sparsify the features.

## Sparsify atomic environments

The atomic environments need to be sparsified to keep the dimensionality of the problem fixed. In this case we select `[inp.gpr.Menv]`

atomic environments by running

```
python -m salted.sparse_selection
```

This sparsification is based on the (dense) features of angular momentum \(\lambda = 0\).

The new file created is:

`equirepr_[inp.salted.saltedname]/sparse_set_[inp.gpr.Menv].txt`

- This file has two columns,
`(feature_index, atomic_species)`

.

- This file has two columns,

## Calculate sparse descriptors

Having sparsified the atomic environments, and optionally the feature space, we then calculate the descriptors for each of the sparse environemnts, at each required value of \(\lambda\), by calling

```
mpirun -np $ntasks python -m salted.sparse_descriptor
```

This produces the following files:

`equirepr_[inp.salted.saltedname]/FEAT-[lam]_M-[inp.gpr.Menv].h5`

- Features of each \(\lambda\) are sliced by sparsification results above.

## Reproducing Kernel Hilbert Space (RKHS) Vectors

Having calculated the descriptors for the sparse environments, we can now calculate the kernels between each of these environments for each \(\lambda\), \(k_{MM}^{\lambda}\), and from those kernels find the projector matrices over the RKHS:

```
python -m salted.rkhs_projector
```

This produces the file `equirepr_[inp.salted.saltedname]/projector_M[inp.gpr.Menv]_zeta[inp.gpr.z].h5`

which contains each of these projectors.

We then calculate the kernels across the full dataset, \(k_{NM}^{\lambda}\), and project them onto the RKHS to determine the input RKHS vectors:

```
mpirun -np $ntasks python -m salted.rkhs_vector
```

These input vectors are stored in the files: `rkhs-vectors_[inp.salted.saltedname]/M[inp.gpr.Menv]_zeta[inp.gpr.z]/psi-nm_conf[n].npz`

, where `i`

runs over each structure in the training dataset.

## Regression

There are two options for us to derive the weights in GPR model: direct inversion and loss function minimization.

### Direct inversion

Direct inversion is not recommended for large systems and datasets, where the dimension is greater than \(10^5\)! ❌

In order to perform the solution of the problem by inversion, the following commands need to be run:

```
mpirun -np $ntasks python -m salted.hessian_matrix # builds the matrices to be inverted
python -m salted.solve_regression # carry out matrix inversion
```

The GPR weights are stored at `regrdir_[inp.salted.saltedname]/M[inp.gpr.Menv]_zeta[inp.gpr.z]/weights_N[ntrain]_reg[log10(inp.gpr.regul)].npy`

. Here `[ntrain]`

is determined by `inp.gpr.Ntrain * inp.gpr.trainfrac`

, and `inp.gpr.regul`

is the regularization parameter.

Note that in order to parallelise the matrix inversion, we split matrices into mini-batches of size `inp.gpr.blocksize`

, which must be an exact divisor of the training set size `ntrain`

. `inp.gpr.blocksize`

should not be included if running serially.

### Conjugate gradient method

In order to solve the problem by CG minimization, run

```
mpirun -np $ntasks python -m salted.minimize_loss
```

The minimization might take relatively long, and checkpoint files are stored at `regrdir_[inp.salted.saltedname]/M[inp.gpr.Menv]_zeta[inp.gpr.z]`

.
The final GPR weights are stored in `regrdir_[inp.salted.saltedname]/M[inp.gpr.Menv]_zeta[inp.gpr.z]/weights_N[ntrain]_reg[log10(inp.gpr.regul)].npy`

.
Part of the training dataset is sampled for training, and their indexes are written to file `regrdir_[inp.salted.saltedname]/training_set_N[inp.gpr.Ntrain].txt`

.

## Validation

After obtaining the regression weights, we should validate the model by a fixed fraction of the dataset, which is not included the training structures. This is achieved by running

```
mpirun -np $ntasks python -m salted.validation
```

and results are written to `validations_[inp.salted.saltedname]/M[inp.gpr.Menv]_zeta[inp.gpr.z]/N[ntrain]_ref[log10(inp.gpr.regul)]/errors.dat`

,
in which the density mean absolute error is in percentage, and the average error for our example should be around 0.5%.

We can check the model performance when we vary the number of training samples (learning curves) by training several models with different values of the hyperparameter `inp.gpr.trainfrac`

\(\in [0.0, 1.0]\). Note that the validation set should be kept unchanged, i.e. do not change the hyperparameter `inp.gpr.Ntrain`

.