2. Machine Learning Basics#
2.1. 💎 Crystal hardness#
Are you excited to tackle a regression problem?
Today’s dataset consists of the bulk modulus for more than 10,000 inorganic crystals. The exercise aims to develop an understanding of how to approach supervised learning problems.
The energy of a crystal varies with the volume of the unit cell. The equilibrium volume is found at the minimum in the potential energy surface. The shape of this curve can be described by an equation of state, where energy is a function of volume or pressure, i.e. \(E(V)\) or \(E(P)\). The curvature is related to the bulk modulus \(B\), which can be defined as:
\( B = -V \frac{\partial P}{\partial V} = V \frac{\partial^2 E}{\partial V^2} \)
The typical unit of \(B\) is GPa. For example, diamond has has a measured bulk modulus of \(B\) = 443 GPa at T = 4 K. The bulk modulus is a useful quantity in models of materials bonding, thermodynamics, and mechanics. For instance, the inverse of the bulk modulus is the compressability of a crystal (\(\kappa = \frac{1}{B}\)).
We will use the Python package matminer
(https://matminer.readthedocs.io) to access the materials dataset and featurise the data in a form that is suitable for statistical analysis and building machine learning models. We will use the computational materials science package pymatgen
(https://pymatgen.org) that powers the Materials Project. There are many new concepts that will be explored in future lectures, so don’t worry about grasping everything now.
# Installation of libraries
!pip install matminer --quiet
# Basic utilities
import pprint # Pretty print data structures
import warnings # Warning control
import numpy as np # Numerical operations
from numpy import ComplexWarning # Warning for complex numbers
# Data handling
import pandas as pd # Data manipulation with DataFrames
from monty.serialization import loadfn # Load serialised data
# Materials science
from pymatgen.core import Structure # Materials analysis for crystal structures
import matminer # Materials informatics
from matminer.datasets.dataset_retrieval import load_dataset # Load materials datasets
# Visualisation
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical visualisation
plt.style.use('ggplot') # Set Matplotlib style to 'ggplot'
# Warning management
warnings.filterwarnings("ignore", category=ComplexWarning) # Ignore ComplexWarning
# Performance adjustments
teaching_mode = True # To make models run faster
Colab error solution
If running the import module cell fails with an "AttributeError", click `Runtime` -> `Restart Session` and then simply rerun the cell.2.2. Bulk moduli dataset#
From matminer
, we can check what datasets are available using the datasets.get_available_datasets()
method.
# Print the available datasets
matminer.datasets.get_available_datasets(print_format='low')
We can use the get_all_dataset_info
function from the matminer.datasets.dataset_retrieval
module to output a detailed description of a matminer dataset. Let’s check the information for the matbench_log_kvrh
dataset.
Here “K” relates to the bulk modulus (which we called \(B\)), and and “VRH” relates to the Voigt-Reuss-Hill equation of state, which is one approach to define a value for each material.
print(matminer.datasets.dataset_retrieval.get_all_dataset_info('matbench_log_kvrh'))
We can then load a dataset using the load_dataset
method.
# Use matminer to download the dataset
df = load_dataset('matbench_log_kvrh')
print(f'The full dataset contains {df.shape[0]} entries. \n')
if teaching_mode:
# Store the original DataFrame as a copy
full_dataset_df = df.copy()
# Create a subset of the original DataFrame for demonstration purposes
df = df.sample(n=1500, random_state=41)
print(f'For teaching purposes we will only work with {df.shape[0]} entries from the DataFrame to make the model training and testing faster. \n')
print('The DataFrame is shown below:')
df.head(10)
2.2.1. Visualise the target variable#
We can use df.describe()
to produce summary statistics of the numerical columns. The importance of this is to check whether the data for our target variable, log10(K_VRH)
, is reasonable. Negative values for the bulk modulus are considered unphysical and forbidden by crystal thermodynamics. You can think about why from the definition.
As we are working with log10
of the bulk modulus, it should not be possible for there to be negative values in our target variable column as the logarithm of a negative number is undefined. This also gives us a quick check for the input data.
df.describe()
From the summary statistics, the minimum value for log10(K_VRH)
is zero, so it appears that there are no glaring issues with the target variable.
For a better understanding, let’s make a histogram to visualise the distribution. This is best practice when you encounter any new dataset.
# Plot a histogram
fig, ax = plt.subplots(figsize=(5,3))
ax.hist(df22['log10(K_VRH)'])
ax.set_xlabel(r'$log_{10}K_{VRH}$ [$log_{10}GPa$]')
ax.set_ylabel('Counts')
plt.show()
Code hint
Your dataframe is not called df22!2.3. Features of materials#
As you may notice from the dataset, we only have one input feature, the crystal structure. This is not a numerical feature that we can use for a regression model. For supervised machine learning, we must represent each material by a vector that can be used as an input to the model, e.g.
would be a four-dimensional representation.
For now we will use some pre-selected features from matminer
for this regression task. Materials representations will be covered in Lecture 4.
2.3.1. Composition-based features#
To use the ElementProperty
featuriser, we first need to add a pymatgen.core.composition.Composition
object to our DataFrame. There are several ways to do this but we will proceed using the composition
property of the pymatgen Structure
class.
from matminer.featurizers.composition.composite import ElementProperty
from matminer.featurizers.structure.order import DensityFeatures
# Add a composition column to df using the composition property of the Structure class and a lambda function
df['composition'] = df.structure.apply(lambda x: x.composition )
df.head()
The new composition column contains both the elements and the amount of each element in the composition. Let’s use the ElementProperty
featuriser to add some composition-based features to our dataset.
# Create the ElementProperty featuriser
el_prop_featuriser = ElementProperty.from_preset(preset_name='magpie')
# By default multiprocessing is enabled, however this has been known to slow performance on some systems, so we disable it
el_prop_featuriser.set_n_jobs(1)
# Apply the ElementProperty featuriser
df = el_prop_featuriser.featurize_dataframe(df, col_id='composition')
# Print the shape of the DataFrame
print(df.shape)
df.head()
There are now a lot more columns in the DataFrame. We can check the reference for a property featuriser using the .citations()
method as shown below.
el_prop_featuriser.citations()
2.3.2. Structure-based features#
Within matminer
, there are many featurisers which operate on crystal structures. We will add some simple features based on the density of the structures using DensityFeatures
. We will return to these later in the module.
# Crystal structure to vector
density_featuriser = DensityFeatures()
density_featuriser.set_n_jobs(1)
df=density_featuriser.fit_featurize_dataframe(df, col_id='structure')
df.head()
2.4. Bulk modulus regression#
With regression tasks, we want to fit a model that maps our input feature \(x\) to our target variable \(y\), i.e. \(y=f(x)\). Here, \(x\) and \(y\) are vectors of dimensions \(M\) and \(N\), respectively, such that \(f: \mathbb{R}^M\rightarrow\mathbb{R}^N\).
Supervised machine learning problems generally take the following form:
Select a form for the model \(f\)
Determine an error/loss function that is used to evaluate model performance
Optimise the parameters of the model to minimise the error
The error, \(L(\hat{y},y)\), is a function of the predicted target variable \(\hat{\textbf{y}}=f(\textbf{x})\) and the true target variable, \(\textbf{y}\). We want our model to minimise \(L\).
For our problem. the target variable is log(K_VRH)
, which we want to predict from knowledge of the composition and structure (represented by the set of chosen features).
We can make extensive use of scikit-learn for these tasks.
2.4.1. Data preparation#
To start, we need to split our dataset into the target variable log10(K_VRH)
and the input features. For the input features, we must remove any non-numerical data.
# Define the features we want to keep
features_to_drop = ['structure','composition','log10(K_VRH)']
feature_cols = [col for col in list(df.columns) if col not in features_to_drop]
# Get an array of the features
X = df[feature_cols].values
# Get an array of the target variable
y = df['log10(K_VRH)'].values
print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')
We can also check the names of the features used for our model.
print(f'We have {len(feature_cols)} features in our dataset.')
print(features)
Code hint
Check your print statement!2.4.2. Baseline linear regression model#
A simple model is the linear regressor. For a univariate linear regressor represented by \(\hat{y}=mx+c\), the task is to find the best value of \(m\) and \(c\) that minimise the model error.
If we were to consider multivariate linear regression, then our equation transforms to \(\hat{y}=\beta_0 + ∑_1^n\beta_ix_i\), where \(\beta_i\) are the weights of the model and \(x_i\) are the input features.
# Import linear regression model
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Fit the model to the data
lr = LinearRegression()
lr.fit(X, y)
def make_prediction_plot(X, y, model, label):
"""
Plot and return predictions for the given model and data.
Parameters:
X : Input features.
y : Actual target values.
model : Fitted model.
label : Descriptor for the axes labels.
Returns:
Predicted values.
"""
y_pred = model.predict(X)
fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(y, y_pred, c=y, cmap='viridis')
ax.plot(y, y, 'r-')
ax.set_xlabel(f'{label} true')
ax.set_ylabel(f'{label} predicted')
plt.show()
return y_pred
# Make predictions using the fitted model
y_pred = make_prediction_plot(X, y, lr, 'log10(K_VRH)')
from sklearn import metrics
# Mean absolute error
print (f'The training MAE = {metrics.mean_absolute_error(y,y_pred):.3f} log10GPa')
# Mean squared error
print(f'The training RMSE = {metrics.root_mean_squared_error(y,y_pred):.3f} log10GPa')
# $r^2$ - coefficient of determination
print(f'The training r^2 = {lr.score(X,y):.3f}')
Based on your analysis, is this a useful model?
2.4.3. Random forest regressor#
We can do better with a non-linear model. Let’s try a machine learning regressor. Random forest is an ensemble machine learning algorithm that combines multiple decision trees to improve predictive accuracy.
Random forest can be applied to both classification and regression tasks. The prediction is made by taking a majority vote (for classification) or averaging (for regression) of the predictions from individual trees. Mathematically, it can be represented as:
\( \hat{y}_{RF} = \frac{1}{n_{trees}} \sum_{i=1}^{n_{trees}} f_i(x) \)
where:
\(\hat{y}_{RF}\) is the random forest prediction.
\(n_{trees}\) is the number of decision trees in the forest.
\(f_i(x)\) represents the prediction of the \(i\)-th tree.
2.4.3.1. 1. Create the regressor#
In sklearn
, the random forest regressor is created by:
RandomForestRegressor(n_estimators=<int>, criterion=<str>, max_depth=<int>, min_samples_split=<int>, min_samples_leaf=<int>)
The hyperparameters that need to be set are:
n_estimators
: number of decision trees in the random forest model.criterion
: loss function to be minimised. Default value is ‘squared_error` which is the MSE.max_depth
: maximum depth of the tree.min_sample_split
: minimum number of samples required to split an internal node.min_samples_leaf
: minimum number of samples required to be at a leaf node.
from sklearn.ensemble import RandomForestRegressor
# Define the model
rf = RandomForestRegressor(n_estimators=100,criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=42)
# Fit the model
rf.fit(X,y)
Why is the random state set to 42?
Most random number generators start with an initial seed value and then produces a sequence of numbers that appears random. Since the algorithms are deterministic, providing the same seed will result in the same sequence of "random" numbers. 42 is simply a science fiction reference.You just trained a machine learning model 🎉.
We can now make predictions and plot the results. We will use the plotting function make_prediction_plot()
that we defined earlier to make the plots.
print("Linear regression")
y_pred_lr = make_prediction_plot(X,y,lr,'log10(K_VRH)')
print("Random Forest model")
y_pred = make_prediction_plot(X,y,rf,'log10(K_VRH)')
Now let’s quantify the performance of the random forest model:
# Print the metrics
print(f'The training MAE = {metrics.mean_absolute_error(y,y_pred):.3f} log10GPa')
print(f'The training RMSE = {metrics.root_mean_squared_error(y,y_pred):.3f} log10GPa')
print(f'The training r^2 = {rf.score(X,y):.3f}')
The coefficient of determination, \(r^2\), as well as the low RMSE suggest that this model is performs well. However, it is also likely that the model is simply overly-fitted to reproduce the training data. This means that it will not generalise to other materials (unseen data), which is necessary for a meaningful machine learning model.
2.4.3.2. 2. Cross validation#
To better determine the quality of our model, we can peform cross-validation (CV). CV enables us to evaluate the out-of-sample goodness-of-fit of a regressor. We will use \(k\)-fold CV, which splits the training set into \(k\) subsets. Each subset is used as a validation set to evaluate the performance, with the model being trained on the remaining \(k-1\) subsets. Don’t worry, we’ll cover this in later lectures.
Let’s perform 5-fold CV:
from sklearn.model_selection import KFold, cross_val_score, cross_validate
# Define the number of splits for cross-validation
n_splits = 5 if teaching_mode else 10
# Compute the cross-validation score
cv = KFold(
n_splits=n_splits,
shuffle=True,
random_state=42
)
scores= cross_val_score(rf, X, y,cv=cv, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(rf, X, y, cv=cv, scoring='r2')
print('From our cross-validation, we have obtained the following results:')
print(f'mean MAE = {np.mean(np.abs(scores)):.3f} log10GPa')
print(f'mean r^2 = {np.mean(np.abs(r2_scores)):.3f}')
# Show the training scores for each k-fold
fig, ax = plt.subplots(2, 1, figsize=(5, 4))
ax[0].scatter([i for i in range(len(scores))], np.abs(scores), c=scores, cmap='viridis')
ax[1].scatter([i for i in range(len(r2_scores))], np.abs(r2_scores), c=r2_scores, cmap='viridis')
ax[0].set_xlabel('Training fold')
ax[0].set_ylabel('MAE')
ax[0].set_ylim(0, 0.14)
ax[0].set_xticks(range(len(scores)))
ax[1].set_xticks(range(len(r2_scores)))
ax[1].set_xlabel('Training fold')
ax[1].set_ylabel('r$^2$')
ax[1].set_ylim(0, 1.0)
# Display the plot
plt.show()
There is an increase in the error (decrease in performance) for the CV model. However, the MAE is still reasonable. Let’s visualise the result of the final model:
from sklearn.model_selection import cross_val_predict
# Plot the original and predicted data against each other
fig, ax = plt.subplots(figsize=(5, 3))
# Scatter plot with color
ax.scatter(y, cross_val_predict(rf, X, y, cv=cv), c=y, cmap='viridis', label='Predicted', alpha=0.6)
# Red line representing a perfect prediction (y = x)
ax.plot(y, y, 'r-', label='Perfect prediction')
# Set labels and legend
ax.set_xlabel('K_VRH true')
ax.set_ylabel('K_VRH predicted')
ax.legend()
plt.show()
2.5. Feature importance#
We fed in many materials features, but which were most useful? Understanding this will increase our understanding (the interpretability) of the model.
We can see how particular features contribute to a Random Forest model by looking at the RandomForestRegressor().feature_importances_
attribute. Some features are significant, whereas others offer very little contribution.
# Get the feature importances
importances = rf.feature_importances_
# Get the indices that would sort the importances array from largest to smallest
indices = np.argsort(importances)[::-1]
# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(5, 3))
# Create a bar plot of the feature importance
ax.bar(range(X.shape[1]), importances[indices], color="r", align="center")
# Set the labels
ax.set_xlabel("Feature index")
ax.set_ylabel("Importance")
plt.show()
There is a rapid drop off in the feature importance, with few features offering a significant contribution to the model.
Below we will only plot the importance for the top-\(N\) features. Try a value of 5. I guess the top feature is vpa
(volume per atom).
# Visualise the top N features
N =
# Get the names of the top N important features
top_feature_names = df[feature_cols].columns.values[np.argsort(importances)[::-1][:N]]
# Set up the figure and axis
fig, ax = plt.subplots(figsize=(5, 3))
# Create a bar plot of the top N feature importances
ax.bar(x=top_feature_names, height=importances[np.argsort(importances)[::-1][:N]])
# Set the labels and title
ax.set_xlabel("Feature")
ax.set_ylabel("Importance")
# Rotate x-axis labels for better readability
ax.set_xticklabels(top_feature_names, rotation=45, ha='right', rotation_mode='anchor')
plt.show()
# Print them too
print(f"Top {N} Features:")
for feat in range(N):
print(f" {feat+1}. {feature_cols[indices[feat]]} ({importances[indices[feat]]:.3f})")
Code hint
Remember to set N!2.6. 🚨 Exercise 2#
2.6.1. Your details#
import numpy as np
# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)
# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)
# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")
2.6.2. Problem#
In machine learning, reducing the number of features can lead to simpler models, reduce the risk of overfitting, and improve generalisation. Understanding which features are necessary and which can be excluded is crucial for developing efficient and interpretable models.
A task will be given in class focusing on feature selection and model performance analysis.
#Empty block for your answers
#Empty block for your answers
.ipynb
. The completed file should be uploaded to Blackboard under assignments for MATE70026.
2.7. 🌊 Dive deeper#
Level 1: Tackle Chapter 14 on Tree-Based Learners in Machine Learning Refined.
Level 2: A collection of videos from the Materials Project Workshop on advanced Python.
Level 3: Read more about the scikit-learn package and what it can do.