starter_code

Created on Thu Jun 2 10:48:30 2022

@author: fhu14

Module Contents

Functions

build_XX_matrix(→ Array)

Builds a holder matrix for the residuals between energy targets

fit_linear_ref_ener(→ Array)

Fits a linear reference energy model between the DFTB+ method and some

get_ani1data_cached(→ List[Dict])

Loads the ani1 data file and returns the molecules in the file

calc_resid(→ Union[Dict, pandas.DataFrame])

calculates residuals of the ani1 data set

create_heatmap(target, title[, data_matrix, ...])

Creates a heatmap of the MAE between methods.

filter_outliers(→ Any)

Filters outliers from each element in the dataset

is_outlier(→ Union[pandas.DataFrame, pandas.Series])

num_heavy_atoms(→ int)

Determines the number of heavy atoms in a molecule based on its empirical formula

get_residuals_by_num_heavy_atoms(→ Dict)

Calculates residuals by the number of heavy atoms

bonds_from_coordinates(→ List)

Calculate min and max bond lengths from a set of coordinates

get_residuals_by_num_bonds(→ Dict)

Calculate residuals by the number of bonds

get_residuals_by_num_atoms(→ Dict)

Determine residual by the number of atoms

rmse(→ float)

Calculate the root mean squared error between y and y_pred.

mae(→ float)

Calculate the mean absolute error between y and y_pred.

compute_rmse_by_num_heavy_atoms(→ pandas.DataFrame)

Calculates the heavy-atom conditional RMSE for each method-method combination.

plot_rmse_by_num_heavy_atoms(→ None)

Plots the RMSE conditional on heavy atoms for each method-method combination.

compute_rmse_by_num_bonds(→ pandas.DataFrame)

Calculates the bond-count conditional RMSE for each method-method combination.

plot_rmse_by_num_bonds(→ None)

Plots the RMSE conditional on bond count for each method-method combination.

compute_rmse_by_num_atoms(→ pandas.DataFrame)

Calculates the atom-count conditional RMSE for each method-method combination.

plot_rmse_by_num_atoms(→ None)

Plots the RMSE conditional on atom count for each method-method combination.

isin_tuple_series(→ pandas.Series)

create_boxplot(boxplot_data, title[, method, plot_args])

Create a boxplot

create_histogram(data, xlabel[, plot_args])

Filters outliers from each element in the dataset

unnest_dictionary(→ Optional[dict])

Insert the keys of a sub-dictionary into data dictionary.

convert_ani1_data_to_dataframe(→ pandas.DataFrame)

Converts ANI1 data to a dataframe.

load_ani1_data(→ List[Dict])

Loads molecules from the ANI-1 Dataset

Attributes

Array

ani1_config

ATOM_PAIR_TO_BOND_ANGSTROM

starter_code.Array
starter_code.ani1_config
starter_code.ATOM_PAIR_TO_BOND_ANGSTROM
starter_code.build_XX_matrix(dataset: List[Dict], allowed_Zs: List[int]) Array

Builds a holder matrix for the residuals between energy targets

Parameters
  • dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.

  • allowed_Zs (List[int]) – The allowed atoms in the molecules

Returns

Per-molecule atomic frequency matrix

Return type

XX (Array)

starter_code.fit_linear_ref_ener(dataset: List[Dict], target1: str, target2: str, allowed_Zs: List[int], XX: Optional[Array] = None) Array
Fits a linear reference energy model between the DFTB+ method and some

energy target

Parameters
  • dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.

  • target1 (str) – The starting point energy target

  • target2 (str) – The second energy target that you need to correct for

  • allowed_Zs (List[int]) – The allowed atoms in the molecules

Returns

The coefficients of the reference energy XX (Array): 2D matrix in the number of atoms method1_mat (Array): The reference energy of the DFTB+ method method2_mat (Array): The reference energy of the target XX (Array): Per-molecule atomic frequency matrix

Return type

coefs (Array)

Notes: The reference energy corrects the magnitude between two methods

in the following way:

E_2 = E_1 + sum_z N_z * C_z + C_0

where N_z is the number of times atom z occurs within the molecule and C_z is the coefficient for the given molecule. This is accomplished by solving a least squares problem.

The reference energy vector is generated through a matrix multiply. Suppose that E_1 is the vector of energies for the molecules in the given dataset. The corrected energies, E_corrected, is generated as follows:

E_corrected = E_1 + (XX @ coefs)

where XX and coefs are the output of this function.

starter_code.get_ani1data_cached(ani1_path: str, molecules_path: str, allowed_Z: List[int], heavy_atoms: List[int], max_config: int, target: Dict[str, str], **kwargs) List[Dict]

Loads the ani1 data file and returns the molecules in the file

Parameters
  • ani1_path (str) – The path to the ani1 data file

  • molecules_path (str) – The path to the pickled molecules file

  • allowed_Z (List[int]) – Include only molecules whose elements are in this list

  • heavy_atoms (List[int]) – Include only molecules for which the number of heavy atoms is in this list

  • max_config (int) – Maximum number of configurations included for each molecule.

  • target (Dict[str,str]) – entries specify the targets to extract key: target_name name assigned to the target value: key that the ANI-1 file assigns to this target

Returns

The list of molecule dictionaries

Return type

molecules (List[Dict])

starter_code.calc_resid(molecules: List[Dict], target: str = ani1_config['target'], allowed_Z: List[int] = ani1_config['allowed_Z'], show_progress: bool = True, XX: Optional[Array] = None, as_dataframe: bool = False) Union[Dict, pandas.DataFrame]

calculates residuals of the ani1 data set

Parameters
  • molecules (List[Dict]) – From ANI-1 dataset

  • allowed_Z (List[int]) – The allowed atoms in the molecules

  • target (str) – energy targets

  • show_progress (bool) – Show TQDM progress bar

  • XX (Optional[Array]) – precomputed array to replace molecules

Returns

matrix of the residuals between two methods

Return type

resid_matrix Dict

Notes

Result is converted to hartrees

starter_code.create_heatmap(target: str, title: str, data_matrix: Optional[List[Dict]] = None, dataframe: Optional[Union[pandas.DataFrame, pandas.Series]] = None, molecules: Optional[List[Dict]] = None, allowed_Z: Optional[List[int]] = None, plot_args: Optional[Dict] = None, show_progress: bool = False, XX: Optional[Array] = None)

Creates a heatmap of the MAE between methods.

Parameters
  • target (str) – List of method IDs to compare

  • title (str) – Title of heatmap

  • data_matrix (Optional[List[Dict]]) – residual matrix

  • Series]] (dataframe Optional[Union[DataFrame,) – residual dataframe

  • molecules (Optional(List[Dict])) – From ANI-1 dataset

  • allowed_Z (Optional(List[int])) – The allowed atoms in the molecules

  • plot_args (Optional[Dict]) – Arguments to pass to seaborn heatmap

  • show_progress (bool) – Show TQDM progress bar

  • XX (Optional[Array]) – precomputed array to replace molecules

Returns

Matplotlib axes object

Return type

plt.Axes

Notes

Refactored to take in the residual matrix by default

starter_code.filter_outliers(filter_type: str = 'SD', data_matrix: Dict[Tuple[str, str], Array] = None, dataframe: Union[pandas.DataFrame, pandas.Series] = None, q_lower: float = 0.25, q_upper: float = 0.75, n_sd: int = 20) Any

Filters outliers from each element in the dataset

Parameters
  • n_sd (int) – the number of standard deviations

  • filter_type (str) – “SD” for standard deviation IQR for IQR method

  • data_matrix (Optional(Dict)) – dictionary with the mean absolute error

  • dataframe (Optional[Union[DataFrame, Series]]) – dataframe from molecules

  • q_lower (float) – lower quantile

  • q_upper (float) – upper quantile

Returns

matrix with no outliers dataframe (Union[DataFrame, Series]): dataframe with ref energies replaced with bool of whether it was an outlier or not

Return type

filtered_dict (Dict)

starter_code.is_outlier(x: Union[pandas.DataFrame, pandas.Series], q_lower: float = 0.25, q_upper: float = 0.75) Union[pandas.DataFrame, pandas.Series]
starter_code.num_heavy_atoms(name: str) int

Determines the number of heavy atoms in a molecule based on its empirical formula

Parameters

name (str) – molecule name

Returns

number of heavy atoms

Return type

num_heavy (int)

starter_code.get_residuals_by_num_heavy_atoms(molecules: List[Dict], residuals: Array, heavy_atoms: list[int]) Dict

Calculates residuals by the number of heavy atoms

Parameters
  • heavy_atoms (List[int]) – list of heavy atoms to include in molecules

  • residuals (Array) – honestly idk

  • molecules (List[Dict]) – from ANI-1 Dataset

Returns

Dictionary of the residuals keyed by num heavy atoms

Return type

molecules_by_heavy_atoms (Dict)

starter_code.bonds_from_coordinates(coordinates: Array, atomic_numbers: Array) List

Calculate min and max bond lengths from a set of coordinates

Parameters
  • coordinates (Array) – Expected distance to differentiate bonds

  • atomic_numbers (Array) – atoms to analyze

Returns

whether a distance is a bond or not

Return type

bonds (List)

starter_code.get_residuals_by_num_bonds(molecules: List[Dict], residuals: Array) Dict

Calculate residuals by the number of bonds

Parameters
  • molecules (List[Dict]) – from ANI-1 Dataset

  • residuals (Array) – calculated residuals from calc_resid

Returns

residuals by the number of bonds

Return type

molecules_by_num_bonds (Dict)

starter_code.get_residuals_by_num_atoms(molecules: List[Dict], residuals: Array) Dict

Determine residual by the number of atoms

Parameters
  • molecules (List[Dict]) – from ANI-1 Dataset

  • residuals (Array) – Calculated residuals from calc resid

Returns

resids by molecules by number atoms

Return type

molecules_by_num_atoms (Dict)

starter_code.rmse(y: Array, y_pred: Optional[Array] = None) float

Calculate the root mean squared error between y and y_pred.

Parameters
  • y (Array) – exp values

  • y_pred (Optional(Array)) – target values

Returns

root mean square error

Return type

rmse (float)

Notes: If y_pred is not provided, y is treated as the residual vector.

starter_code.mae(y: Array, y_pred: Optional[Array] = None) float

Calculate the mean absolute error between y and y_pred.

Notes: If y_pred is not provided, y is treated as the residual vector.

starter_code.compute_rmse_by_num_heavy_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], heavy_atoms: list[int], show_progress: bool = True) pandas.DataFrame

Calculates the heavy-atom conditional RMSE for each method-method combination.

Parameters
  • molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.

  • resid (Dict) – Dictionary of residual vectors for each method-method combination.

  • heavy_atoms (list[int]) – List of allowed heavy atom numbers.

  • show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on heavy atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_heavy_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None

Plots the RMSE conditional on heavy atoms for each method-method combination.

Parameters
  • rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on heavy atoms for

  • combination. (each method-method) –

starter_code.compute_rmse_by_num_bonds(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) pandas.DataFrame

Calculates the bond-count conditional RMSE for each method-method combination.

Parameters
  • molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.

  • resid (Dict) – Dictionary of residual vectors for each method-method combination.

  • show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on the number of bonds for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_bonds(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None

Plots the RMSE conditional on bond count for each method-method combination.

Parameters
  • rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for

  • combination. (each method-method) –

starter_code.compute_rmse_by_num_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) pandas.DataFrame

Calculates the atom-count conditional RMSE for each method-method combination.

Parameters
  • molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.

  • resid (Dict) – Dictionary of residual vectors for each method-method combination.

  • show_progress (bool) – Whether to display the TQDM progress bar.

Returns

Dataframe with the RMSE conditional on the number of atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.

Return type

pd.DataFrame

starter_code.plot_rmse_by_num_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None

Plots the RMSE conditional on atom count for each method-method combination.

Parameters
  • rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for

  • combination. (each method-method) –

starter_code.isin_tuple_series(values: Any, tuple_col: pandas.Series) pandas.Series
starter_code.create_boxplot(boxplot_data: Dict, title: str, method: Optional[str] = None, plot_args: Optional[Dict] = None)

Create a boxplot

Parameters
  • boxplot_data (Dict) – input from calc resid

  • title (str) – boxplot title

  • method (Optional(str)) – specify which target energy to plot

  • plot_args (Optional(Dict) – other plot args

Returns

Nothing

starter_code.create_histogram(data: pandas.DataFrame, xlabel: str, plot_args: Optional[Dict] = None)

Filters outliers from each element in the dataset

Parameters
  • data (DataFrame) – FILTERED data dataframe–must already count the number of outliers

  • plot_args (Optional[Dict]) – additional args for the histogram

Returns

Nothing

starter_code.unnest_dictionary(data: dict, key: str, prefix: str = '', inplace: bool = False) Optional[dict]

Insert the keys of a sub-dictionary into data dictionary.

Parameters
  • data (dict) – Main dictionary to unnest.

  • key (str) – The key of the sub-dictionary to unnest.

  • prefix (str, optional) – String value to prefix the new keys with. Defaults to “”.

  • inplace (bool, optional) – Modify the dictionary in place if True, else return a copy. Defaults to False.

Returns

The modified dictionary if inplace is False, else None.

Return type

Optional[dict]

starter_code.convert_ani1_data_to_dataframe(data: List[Dict]) pandas.DataFrame

Converts ANI1 data to a dataframe.

Parameters

data (List[Dict]) –

List of dictionaries containing ANI1 data.

’name’: str with name ANI1 assigns to this molecule type ‘iconfig’: int with number ANI1 assignes to this structure ‘atomic_numbers’: List of Zs ‘coordinates’: numpy array (:,3) with cartesian coordinates ‘targets’: Dict whose keys are the target_names in the

target argument and whose values are numpy arrays with the ANI-1 data

Returns

A dataframe with the columns ‘name’, ‘iconfig’,

’atomic_numbers’, and ‘coordinates’ from the input data. For each target in the input data, a column with the target name is added to the dataframe with the prefix ‘target_’.

Return type

pd.DataFrame

starter_code.load_ani1_data(config: Dict = ani1_config, ani1_path: str = './ANI-1ccx_clean_fullentry.h5', as_dataframe=False) List[Dict]

Loads molecules from the ANI-1 Dataset

Parameters
  • config (Dict) – data to grab from ANI-1

  • ani1_path (str) – ANI-1 dataset

  • as_dataframe (bool) – return as dataframe or not

Returns

molecules from ANI-1 dataset

Return type

molecules (List[Dict])