`starter_code`

Created on Thu Jun 2 10:48:30 2022

@author: fhu14

Module Contents

Functions

`build_XX_matrix`(→ Array)	Builds a holder matrix for the residuals between energy targets
`fit_linear_ref_ener`(→ Array)	Fits a linear reference energy model between the DFTB+ method and some
`get_ani1data_cached`(→ List[Dict])	Loads the ani1 data file and returns the molecules in the file
`calc_resid`(→ Union[Dict, pandas.DataFrame])	calculates residuals of the ani1 data set
`create_heatmap`(target, title[, data_matrix, ...])	Creates a heatmap of the MAE between methods.
`filter_outliers`(→ Any)	Filters outliers from each element in the dataset
`is_outlier`(→ Union[pandas.DataFrame, pandas.Series])
`num_heavy_atoms`(→ int)	Determines the number of heavy atoms in a molecule based on its empirical formula
`get_residuals_by_num_heavy_atoms`(→ Dict)	Calculates residuals by the number of heavy atoms
`bonds_from_coordinates`(→ List)	Calculate min and max bond lengths from a set of coordinates
`get_residuals_by_num_bonds`(→ Dict)	Calculate residuals by the number of bonds
`get_residuals_by_num_atoms`(→ Dict)	Determine residual by the number of atoms
`rmse`(→ float)	Calculate the root mean squared error between y and y_pred.
`mae`(→ float)	Calculate the mean absolute error between y and y_pred.
`compute_rmse_by_num_heavy_atoms`(→ pandas.DataFrame)	Calculates the heavy-atom conditional RMSE for each method-method combination.
`plot_rmse_by_num_heavy_atoms`(→ None)	Plots the RMSE conditional on heavy atoms for each method-method combination.
`compute_rmse_by_num_bonds`(→ pandas.DataFrame)	Calculates the bond-count conditional RMSE for each method-method combination.
`plot_rmse_by_num_bonds`(→ None)	Plots the RMSE conditional on bond count for each method-method combination.
`compute_rmse_by_num_atoms`(→ pandas.DataFrame)	Calculates the atom-count conditional RMSE for each method-method combination.
`plot_rmse_by_num_atoms`(→ None)	Plots the RMSE conditional on atom count for each method-method combination.
`isin_tuple_series`(→ pandas.Series)
`create_boxplot`(boxplot_data, title[, method, plot_args])	Create a boxplot
`create_histogram`(data, xlabel[, plot_args])	Filters outliers from each element in the dataset
`unnest_dictionary`(→ Optional[dict])	Insert the keys of a sub-dictionary into data dictionary.
`convert_ani1_data_to_dataframe`(→ pandas.DataFrame)	Converts ANI1 data to a dataframe.
`load_ani1_data`(→ List[Dict])	Loads molecules from the ANI-1 Dataset

Attributes

`Array`
`ani1_config`
`ATOM_PAIR_TO_BOND_ANGSTROM`

starter_code.Array

starter_code.ani1_config

starter_code.ATOM_PAIR_TO_BOND_ANGSTROM

starter_code.build_XX_matrix(dataset: List[Dict], allowed_Zs: List[int]) → Array

Builds a holder matrix for the residuals between energy targets

Parameters

dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.
allowed_Zs (List[int]) – The allowed atoms in the molecules

Returns

Per-molecule atomic frequency matrix

Return type

XX (Array)

starter_code.fit_linear_ref_ener(dataset: List[Dict], target1: str, target2: str, allowed_Zs: List[int], XX: Optional[Array] = None) → Array

Fits a linear reference energy model between the DFTB+ method and some: energy target

Parameters

dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.
target1 (str) – The starting point energy target
target2 (str) – The second energy target that you need to correct for
allowed_Zs (List[int]) – The allowed atoms in the molecules

Returns

The coefficients of the reference energy XX (Array): 2D matrix in the number of atoms method1_mat (Array): The reference energy of the DFTB+ method method2_mat (Array): The reference energy of the target XX (Array): Per-molecule atomic frequency matrix

Return type

coefs (Array)

Notes: The reference energy corrects the magnitude between two methods

in the following way:

E_2 = E_1 + sum_z N_z * C_z + C_0

where N_z is the number of times atom z occurs within the molecule and C_z is the coefficient for the given molecule. This is accomplished by solving a least squares problem.

The reference energy vector is generated through a matrix multiply. Suppose that E_1 is the vector of energies for the molecules in the given dataset. The corrected energies, E_corrected, is generated as follows:

E_corrected = E_1 + (XX @ coefs)

where XX and coefs are the output of this function.

starter_code.get_ani1data_cached(ani1_path: str, molecules_path: str, allowed_Z: List[int], heavy_atoms: List[int], max_config: int, target: Dict[str, str], **kwargs) → List[Dict]

Loads the ani1 data file and returns the molecules in the file

Parameters

ani1_path (str) – The path to the ani1 data file
molecules_path (str) – The path to the pickled molecules file
allowed_Z (List[int]) – Include only molecules whose elements are in this list
heavy_atoms (List[int]) – Include only molecules for which the number of heavy atoms is in this list
max_config (int) – Maximum number of configurations included for each molecule.
target (Dict[str,str]) – entries specify the targets to extract key: target_name name assigned to the target value: key that the ANI-1 file assigns to this target

Returns

The list of molecule dictionaries

Return type

molecules (List[Dict])

starter_code.calc_resid(molecules: List[Dict], target: str = ani1_config['target'], allowed_Z: List[int] = ani1_config['allowed_Z'], show_progress: bool = True, XX: Optional[Array] = None, as_dataframe: bool = False) → Union[Dict, pandas.DataFrame]

calculates residuals of the ani1 data set

Parameters

molecules (List[Dict]) – From ANI-1 dataset
allowed_Z (List[int]) – The allowed atoms in the molecules
target (str) – energy targets
show_progress (bool) – Show TQDM progress bar
XX (Optional[Array]) – precomputed array to replace molecules

Returns

matrix of the residuals between two methods

Return type

resid_matrix Dict

Notes

Result is converted to hartrees

starter_code.create_heatmap(target: str, title: str, data_matrix: Optional[List[Dict]] = None, dataframe: Optional[Union[pandas.DataFrame, pandas.Series]] = None, molecules: Optional[List[Dict]] = None, allowed_Z: Optional[List[int]] = None, plot_args: Optional[Dict] = None, show_progress: bool = False, XX: Optional[Array] = None)

Creates a heatmap of the MAE between methods.

Parameters

target (str) – List of method IDs to compare
title (str) – Title of heatmap
data_matrix (Optional[List[Dict]]) – residual matrix
Series]] (dataframe Optional[Union[DataFrame,) – residual dataframe
molecules (Optional(List[Dict])) – From ANI-1 dataset
allowed_Z (Optional(List[int])) – The allowed atoms in the molecules
plot_args (Optional[Dict]) – Arguments to pass to seaborn heatmap
show_progress (bool) – Show TQDM progress bar
XX (Optional[Array]) – precomputed array to replace molecules

Returns

Matplotlib axes object

Return type

plt.Axes

Notes

Refactored to take in the residual matrix by default

starter_code.filter_outliers(filter_type: str = 'SD', data_matrix: Dict[Tuple[str, str], Array] = None, dataframe: Union[pandas.DataFrame, pandas.Series] = None, q_lower: float = 0.25, q_upper: float = 0.75, n_sd: int = 20) → Any

Filters outliers from each element in the dataset

Parameters

n_sd (int) – the number of standard deviations
filter_type (str) – “SD” for standard deviation IQR for IQR method
data_matrix (Optional(Dict)) – dictionary with the mean absolute error
dataframe (Optional[Union[DataFrame, Series]]) – dataframe from molecules
q_lower (float) – lower quantile
q_upper (float) – upper quantile

Returns

matrix with no outliers dataframe (Union[DataFrame, Series]): dataframe with ref energies replaced with bool of whether it was an outlier or not

Return type

filtered_dict (Dict)

starter_code.is_outlier(x: Union[pandas.DataFrame, pandas.Series], q_lower: float = 0.25, q_upper: float = 0.75) → Union[pandas.DataFrame, pandas.Series]

starter_code.num_heavy_atoms(name: str) → int

Determines the number of heavy atoms in a molecule based on its empirical formula

Parameters: name (str) – molecule name
Returns: number of heavy atoms
Return type: num_heavy (int)

starter_code.get_residuals_by_num_heavy_atoms(molecules: List[Dict], residuals: Array, heavy_atoms: list[int]) → Dict

Calculates residuals by the number of heavy atoms

Parameters

heavy_atoms (List[int]) – list of heavy atoms to include in molecules
residuals (Array) – honestly idk
molecules (List[Dict]) – from ANI-1 Dataset

Returns

Dictionary of the residuals keyed by num heavy atoms

Return type

molecules_by_heavy_atoms (Dict)

starter_code.bonds_from_coordinates(coordinates: Array, atomic_numbers: Array) → List

Calculate min and max bond lengths from a set of coordinates

Parameters

coordinates (Array) – Expected distance to differentiate bonds
atomic_numbers (Array) – atoms to analyze

Returns

whether a distance is a bond or not

Return type

bonds (List)

starter_code.get_residuals_by_num_bonds(molecules: List[Dict], residuals: Array) → Dict

Calculate residuals by the number of bonds

Parameters

molecules (List[Dict]) – from ANI-1 Dataset
residuals (Array) – calculated residuals from calc_resid

Returns

residuals by the number of bonds

Return type

molecules_by_num_bonds (Dict)

starter_code.get_residuals_by_num_atoms(molecules: List[Dict], residuals: Array) → Dict

Determine residual by the number of atoms

Parameters

molecules (List[Dict]) – from ANI-1 Dataset
residuals (Array) – Calculated residuals from calc resid

Returns

resids by molecules by number atoms

Return type

molecules_by_num_atoms (Dict)

starter_code.rmse(y: Array, y_pred: Optional[Array] = None) → float

Calculate the root mean squared error between y and y_pred.

Parameters

y (Array) – exp values
y_pred (Optional(Array)) – target values

Returns

root mean square error

Return type

rmse (float)

Notes: If y_pred is not provided, y is treated as the residual vector.

starter_code.mae(y: Array, y_pred: Optional[Array] = None) → float

Calculate the mean absolute error between y and y_pred.

Notes: If y_pred is not provided, y is treated as the residual vector.

starter_code.compute_rmse_by_num_heavy_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], heavy_atoms: list[int], show_progress: bool = True) → pandas.DataFrame

Calculates the heavy-atom conditional RMSE for each method-method combination.

Parameters

molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
heavy_atoms (list[int]) – List of allowed heavy atom numbers.
show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on heavy atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_heavy_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) → None

Plots the RMSE conditional on heavy atoms for each method-method combination.

Parameters

rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on heavy atoms for
combination. (each method-method) –

starter_code.compute_rmse_by_num_bonds(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) → pandas.DataFrame

Calculates the bond-count conditional RMSE for each method-method combination.

Parameters

molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on the number of bonds for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_bonds(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) → None

Plots the RMSE conditional on bond count for each method-method combination.

Parameters

rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for
combination. (each method-method) –

starter_code.compute_rmse_by_num_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) → pandas.DataFrame

Calculates the atom-count conditional RMSE for each method-method combination.

Parameters

molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on the number of atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) → None

Plots the RMSE conditional on atom count for each method-method combination.

Parameters

rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for
combination. (each method-method) –

starter_code.isin_tuple_series(values: Any, tuple_col: pandas.Series) → pandas.Series

starter_code.create_boxplot(boxplot_data: Dict, title: str, method: Optional[str] = None, plot_args: Optional[Dict] = None)

Create a boxplot

Parameters

boxplot_data (Dict) – input from calc resid
title (str) – boxplot title
method (Optional(str)) – specify which target energy to plot
plot_args (Optional(Dict) – other plot args

Returns

Nothing

starter_code.create_histogram(data: pandas.DataFrame, xlabel: str, plot_args: Optional[Dict] = None)

Filters outliers from each element in the dataset

Parameters

data (DataFrame) – FILTERED data dataframe–must already count the number of outliers
plot_args (Optional[Dict]) – additional args for the histogram

Returns

Nothing

starter_code.unnest_dictionary(data: dict, key: str, prefix: str = '', inplace: bool = False) → Optional[dict]

Insert the keys of a sub-dictionary into data dictionary.

Parameters

data (dict) – Main dictionary to unnest.
key (str) – The key of the sub-dictionary to unnest.
prefix (str, optional) – String value to prefix the new keys with. Defaults to “”.
inplace (bool, optional) – Modify the dictionary in place if True, else return a copy. Defaults to False.

Returns

The modified dictionary if inplace is False, else None.

Return type

Optional[dict]

starter_code.convert_ani1_data_to_dataframe(data: List[Dict]) → pandas.DataFrame

Converts ANI1 data to a dataframe.

Parameters

data (List[Dict]) –

List of dictionaries containing ANI1 data.

’name’: str with name ANI1 assigns to this molecule type ‘iconfig’: int with number ANI1 assignes to this structure ‘atomic_numbers’: List of Zs ‘coordinates’: numpy array (:,3) with cartesian coordinates ‘targets’: Dict whose keys are the target_names in the

target argument and whose values are numpy arrays with the ANI-1 data

Returns

A dataframe with the columns ‘name’, ‘iconfig’,: ’atomic_numbers’, and ‘coordinates’ from the input data. For each target in the input data, a column with the target name is added to the dataframe with the prefix ‘target_’.

Return type

pd.DataFrame

starter_code.load_ani1_data(config: Dict = ani1_config, ani1_path: str = './ANI-1ccx_clean_fullentry.h5', as_dataframe=False) → List[Dict]

Loads molecules from the ANI-1 Dataset

Parameters

config (Dict) – data to grab from ANI-1
ani1_path (str) – ANI-1 dataset
as_dataframe (bool) – return as dataframe or not

Returns

molecules from ANI-1 dataset

Return type

molecules (List[Dict])

starter_code

Module Contents

Functions

Attributes

`starter_code`