starter_code
Created on Thu Jun 2 10:48:30 2022
@author: fhu14
Module Contents
Functions
|
Builds a holder matrix for the residuals between energy targets |
|
Fits a linear reference energy model between the DFTB+ method and some |
|
Loads the ani1 data file and returns the molecules in the file |
|
calculates residuals of the ani1 data set |
|
Creates a heatmap of the MAE between methods. |
|
Filters outliers from each element in the dataset |
|
|
|
Determines the number of heavy atoms in a molecule based on its empirical formula |
|
Calculates residuals by the number of heavy atoms |
|
Calculate min and max bond lengths from a set of coordinates |
|
Calculate residuals by the number of bonds |
|
Determine residual by the number of atoms |
|
Calculate the root mean squared error between y and y_pred. |
|
Calculate the mean absolute error between y and y_pred. |
|
Calculates the heavy-atom conditional RMSE for each method-method combination. |
|
Plots the RMSE conditional on heavy atoms for each method-method combination. |
|
Calculates the bond-count conditional RMSE for each method-method combination. |
|
Plots the RMSE conditional on bond count for each method-method combination. |
|
Calculates the atom-count conditional RMSE for each method-method combination. |
|
Plots the RMSE conditional on atom count for each method-method combination. |
|
|
|
Create a boxplot |
|
Filters outliers from each element in the dataset |
|
Insert the keys of a sub-dictionary into data dictionary. |
|
Converts ANI1 data to a dataframe. |
|
Loads molecules from the ANI-1 Dataset |
Attributes
- starter_code.Array
- starter_code.ani1_config
- starter_code.ATOM_PAIR_TO_BOND_ANGSTROM
- starter_code.build_XX_matrix(dataset: List[Dict], allowed_Zs: List[int]) Array
Builds a holder matrix for the residuals between energy targets
- Parameters
dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.
allowed_Zs (List[int]) – The allowed atoms in the molecules
- Returns
Per-molecule atomic frequency matrix
- Return type
XX (Array)
- starter_code.fit_linear_ref_ener(dataset: List[Dict], target1: str, target2: str, allowed_Zs: List[int], XX: Optional[Array] = None) Array
- Fits a linear reference energy model between the DFTB+ method and some
energy target
- Parameters
dataset (List[Dict]) – The list of molecule dictionaries that have had the DFTB+ results added to them.
target1 (str) – The starting point energy target
target2 (str) – The second energy target that you need to correct for
allowed_Zs (List[int]) – The allowed atoms in the molecules
- Returns
The coefficients of the reference energy XX (Array): 2D matrix in the number of atoms method1_mat (Array): The reference energy of the DFTB+ method method2_mat (Array): The reference energy of the target XX (Array): Per-molecule atomic frequency matrix
- Return type
coefs (Array)
- Notes: The reference energy corrects the magnitude between two methods
in the following way:
E_2 = E_1 + sum_z N_z * C_z + C_0
where N_z is the number of times atom z occurs within the molecule and C_z is the coefficient for the given molecule. This is accomplished by solving a least squares problem.
The reference energy vector is generated through a matrix multiply. Suppose that E_1 is the vector of energies for the molecules in the given dataset. The corrected energies, E_corrected, is generated as follows:
E_corrected = E_1 + (XX @ coefs)
where XX and coefs are the output of this function.
- starter_code.get_ani1data_cached(ani1_path: str, molecules_path: str, allowed_Z: List[int], heavy_atoms: List[int], max_config: int, target: Dict[str, str], **kwargs) List[Dict]
Loads the ani1 data file and returns the molecules in the file
- Parameters
ani1_path (str) – The path to the ani1 data file
molecules_path (str) – The path to the pickled molecules file
allowed_Z (List[int]) – Include only molecules whose elements are in this list
heavy_atoms (List[int]) – Include only molecules for which the number of heavy atoms is in this list
max_config (int) – Maximum number of configurations included for each molecule.
target (Dict[str,str]) – entries specify the targets to extract key: target_name name assigned to the target value: key that the ANI-1 file assigns to this target
- Returns
The list of molecule dictionaries
- Return type
molecules (List[Dict])
- starter_code.calc_resid(molecules: List[Dict], target: str = ani1_config['target'], allowed_Z: List[int] = ani1_config['allowed_Z'], show_progress: bool = True, XX: Optional[Array] = None, as_dataframe: bool = False) Union[Dict, pandas.DataFrame]
calculates residuals of the ani1 data set
- Parameters
molecules (List[Dict]) – From ANI-1 dataset
allowed_Z (List[int]) – The allowed atoms in the molecules
target (str) – energy targets
show_progress (bool) – Show TQDM progress bar
XX (Optional[Array]) – precomputed array to replace molecules
- Returns
matrix of the residuals between two methods
- Return type
resid_matrix Dict
Notes
Result is converted to hartrees
- starter_code.create_heatmap(target: str, title: str, data_matrix: Optional[List[Dict]] = None, dataframe: Optional[Union[pandas.DataFrame, pandas.Series]] = None, molecules: Optional[List[Dict]] = None, allowed_Z: Optional[List[int]] = None, plot_args: Optional[Dict] = None, show_progress: bool = False, XX: Optional[Array] = None)
Creates a heatmap of the MAE between methods.
- Parameters
target (str) – List of method IDs to compare
title (str) – Title of heatmap
data_matrix (Optional[List[Dict]]) – residual matrix
Series]] (dataframe Optional[Union[DataFrame,) – residual dataframe
molecules (Optional(List[Dict])) – From ANI-1 dataset
allowed_Z (Optional(List[int])) – The allowed atoms in the molecules
plot_args (Optional[Dict]) – Arguments to pass to seaborn heatmap
show_progress (bool) – Show TQDM progress bar
XX (Optional[Array]) – precomputed array to replace molecules
- Returns
Matplotlib axes object
- Return type
plt.Axes
Notes
Refactored to take in the residual matrix by default
- starter_code.filter_outliers(filter_type: str = 'SD', data_matrix: Dict[Tuple[str, str], Array] = None, dataframe: Union[pandas.DataFrame, pandas.Series] = None, q_lower: float = 0.25, q_upper: float = 0.75, n_sd: int = 20) Any
Filters outliers from each element in the dataset
- Parameters
n_sd (int) – the number of standard deviations
filter_type (str) – “SD” for standard deviation IQR for IQR method
data_matrix (Optional(Dict)) – dictionary with the mean absolute error
dataframe (Optional[Union[DataFrame, Series]]) – dataframe from molecules
q_lower (float) – lower quantile
q_upper (float) – upper quantile
- Returns
matrix with no outliers dataframe (Union[DataFrame, Series]): dataframe with ref energies replaced with bool of whether it was an outlier or not
- Return type
filtered_dict (Dict)
- starter_code.is_outlier(x: Union[pandas.DataFrame, pandas.Series], q_lower: float = 0.25, q_upper: float = 0.75) Union[pandas.DataFrame, pandas.Series]
- starter_code.num_heavy_atoms(name: str) int
Determines the number of heavy atoms in a molecule based on its empirical formula
- Parameters
name (str) – molecule name
- Returns
number of heavy atoms
- Return type
num_heavy (int)
- starter_code.get_residuals_by_num_heavy_atoms(molecules: List[Dict], residuals: Array, heavy_atoms: list[int]) Dict
Calculates residuals by the number of heavy atoms
- Parameters
heavy_atoms (List[int]) – list of heavy atoms to include in molecules
residuals (Array) – honestly idk
molecules (List[Dict]) – from ANI-1 Dataset
- Returns
Dictionary of the residuals keyed by num heavy atoms
- Return type
molecules_by_heavy_atoms (Dict)
- starter_code.bonds_from_coordinates(coordinates: Array, atomic_numbers: Array) List
Calculate min and max bond lengths from a set of coordinates
- Parameters
coordinates (Array) – Expected distance to differentiate bonds
atomic_numbers (Array) – atoms to analyze
- Returns
whether a distance is a bond or not
- Return type
bonds (List)
- starter_code.get_residuals_by_num_bonds(molecules: List[Dict], residuals: Array) Dict
Calculate residuals by the number of bonds
- Parameters
molecules (List[Dict]) – from ANI-1 Dataset
residuals (Array) – calculated residuals from calc_resid
- Returns
residuals by the number of bonds
- Return type
molecules_by_num_bonds (Dict)
- starter_code.get_residuals_by_num_atoms(molecules: List[Dict], residuals: Array) Dict
Determine residual by the number of atoms
- Parameters
molecules (List[Dict]) – from ANI-1 Dataset
residuals (Array) – Calculated residuals from calc resid
- Returns
resids by molecules by number atoms
- Return type
molecules_by_num_atoms (Dict)
- starter_code.rmse(y: Array, y_pred: Optional[Array] = None) float
Calculate the root mean squared error between y and y_pred.
- Parameters
y (Array) – exp values
y_pred (Optional(Array)) – target values
- Returns
root mean square error
- Return type
rmse (float)
Notes: If y_pred is not provided, y is treated as the residual vector.
- starter_code.mae(y: Array, y_pred: Optional[Array] = None) float
Calculate the mean absolute error between y and y_pred.
Notes: If y_pred is not provided, y is treated as the residual vector.
- starter_code.compute_rmse_by_num_heavy_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], heavy_atoms: list[int], show_progress: bool = True) pandas.DataFrame
Calculates the heavy-atom conditional RMSE for each method-method combination.
- Parameters
molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
heavy_atoms (list[int]) – List of allowed heavy atom numbers.
show_progress (bool) – Whether to display the TQDM progress bar.
- Returns
Dataframe with the RMSE conditional on heavy atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.
- Return type
pd.DataFrame
- starter_code.plot_rmse_by_num_heavy_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None
Plots the RMSE conditional on heavy atoms for each method-method combination.
- Parameters
rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on heavy atoms for
combination. (each method-method) –
- starter_code.compute_rmse_by_num_bonds(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) pandas.DataFrame
Calculates the bond-count conditional RMSE for each method-method combination.
- Parameters
molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
show_progress (bool) – Whether to display the TQDM progress bar.
- Returns
Dataframe with the RMSE conditional on the number of bonds for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.
- Return type
pd.DataFrame
- starter_code.plot_rmse_by_num_bonds(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None
Plots the RMSE conditional on bond count for each method-method combination.
- Parameters
rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for
combination. (each method-method) –
- starter_code.compute_rmse_by_num_atoms(molecules: List[Dict], resid: Dict[Tuple[str, str], Array], show_progress: bool = True) pandas.DataFrame
Calculates the atom-count conditional RMSE for each method-method combination.
- Parameters
molecules (List[Dict]) – List of molecules dictionaries from ANI-1 data.
resid (Dict) – Dictionary of residual vectors for each method-method combination.
show_progress (bool) – Whether to display the TQDM progress bar.
- Returns
Dataframe with the RMSE conditional on the number of atoms for each method-method combination. Also includes STD, which is the standard deviation of the residual vector, and n, which is the number of residuals used in the calculation.
- Return type
pd.DataFrame
- starter_code.plot_rmse_by_num_atoms(rmse_df: pandas.DataFrame, method_id_to_name: Optional[Dict[str, str]] = None) None
Plots the RMSE conditional on atom count for each method-method combination.
- Parameters
rmse_df (pd.DataFrame) – Dataframe with the RMSE conditional on bond c ount for
combination. (each method-method) –
- starter_code.isin_tuple_series(values: Any, tuple_col: pandas.Series) pandas.Series
- starter_code.create_boxplot(boxplot_data: Dict, title: str, method: Optional[str] = None, plot_args: Optional[Dict] = None)
Create a boxplot
- Parameters
boxplot_data (Dict) – input from calc resid
title (str) – boxplot title
method (Optional(str)) – specify which target energy to plot
plot_args (Optional(Dict) – other plot args
- Returns
Nothing
- starter_code.create_histogram(data: pandas.DataFrame, xlabel: str, plot_args: Optional[Dict] = None)
Filters outliers from each element in the dataset
- Parameters
data (DataFrame) – FILTERED data dataframe–must already count the number of outliers
plot_args (Optional[Dict]) – additional args for the histogram
- Returns
Nothing
- starter_code.unnest_dictionary(data: dict, key: str, prefix: str = '', inplace: bool = False) Optional[dict]
Insert the keys of a sub-dictionary into data dictionary.
- Parameters
data (dict) – Main dictionary to unnest.
key (str) – The key of the sub-dictionary to unnest.
prefix (str, optional) – String value to prefix the new keys with. Defaults to “”.
inplace (bool, optional) – Modify the dictionary in place if True, else return a copy. Defaults to False.
- Returns
The modified dictionary if inplace is False, else None.
- Return type
Optional[dict]
- starter_code.convert_ani1_data_to_dataframe(data: List[Dict]) pandas.DataFrame
Converts ANI1 data to a dataframe.
- Parameters
data (List[Dict]) –
List of dictionaries containing ANI1 data.
’name’: str with name ANI1 assigns to this molecule type ‘iconfig’: int with number ANI1 assignes to this structure ‘atomic_numbers’: List of Zs ‘coordinates’: numpy array (:,3) with cartesian coordinates ‘targets’: Dict whose keys are the target_names in the
target argument and whose values are numpy arrays with the ANI-1 data
- Returns
- A dataframe with the columns ‘name’, ‘iconfig’,
’atomic_numbers’, and ‘coordinates’ from the input data. For each target in the input data, a column with the target name is added to the dataframe with the prefix ‘target_’.
- Return type
pd.DataFrame
- starter_code.load_ani1_data(config: Dict = ani1_config, ani1_path: str = './ANI-1ccx_clean_fullentry.h5', as_dataframe=False) List[Dict]
Loads molecules from the ANI-1 Dataset
- Parameters
config (Dict) – data to grab from ANI-1
ani1_path (str) – ANI-1 dataset
as_dataframe (bool) – return as dataframe or not
- Returns
molecules from ANI-1 dataset
- Return type
molecules (List[Dict])