7. Building a Model from Scratch#
7.1. 🦾 Crystal hardness revisited#
We first tackled the bulk modulus of inorganic crystals in Lecture 2. However our model development was not thorough back then.
Let’s revisit this problem using the new knowledge and tricks we have picked up. We will follow the same initial steps, making use of matminer to access the materials dataset and featurise the data.
# Installation of libraries
!pip install matminer --quiet
!pip install xgboost --quiet
# Downgrade scikit to avoid a conflict with xgboost
# Note: Ignore the error message
!pip uninstall -y scikit-learn --quiet
!pip install scikit-learn==1.3.1 --quiet
# Import of modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pprint
import seaborn as sns
plt.style.use('ggplot')
# Advanced
from pymatgen.core import Structure
import matminer
from matminer.datasets.dataset_retrieval import load_dataset
from monty.serialization import loadfn
# To make the model run faster
teaching_mode = True
Colab error solution
If running the import module cell fails with an "AttributeError", click `Runtime` -> `Restart Session` and then simply rerun the cell.7.2. Data preparation#
The steps to load and featurise the bulk modulus data were introduced in Notebook 2, so we can jump straight in.
# Use matminer to load the dataset
df = load_dataset('matbench_log_kvrh')
print(f'The full dataset contains {df.shape[0]} entries. \n')
if teaching_mode:
# Store the original DataFrame as a copy
full_dataset_df = df.copy()
# Create a subset of the original DataFrame for demonstration purposes
df = df.sample(n=1000, random_state=33)
print(f'For teaching purposes we will only work with {df.shape[0]} entries from the dataframe to make the model training and testing faster. \n')
print('The DataFrame is shown below:')
df.head(10)
# Plot a histogram of values
fig, ax = plt.subplots(figsize=(5, 4))
ax.hist(df['log10(K_VRH)'])
ax.set_xlabel(r'$log_{10}K_{VRH}$ [$log_{10}GPa$]' )
ax.set_ylabel('Counts')
plt.show()
# Use matminer to featurise the dataset
from matminer.featurizers.composition.composite import ElementProperty
from matminer.featurizers.structure.order import DensityFeatures
# Add a composition column to df using the composition property of the Structure class
df['composition'] = df.structure.apply(lambda x: x.composition )
# Create the ElementProperty featuriser
el_prop_featuriser = ElementProperty.from_preset(preset_name='magpie')
# By default multiprocessing is enabled, however, this can slow performance, so we disable it
el_prop_featuriser.set_n_jobs(1)
# Featurise using the ElementProperty featuriser
df = el_prop_featuriser.featurize_dataframe(df, col_id='composition')
# Add structure features
density_featuriser = DensityFeatures()
density_featuriser.set_n_jobs(1)
df=density_featuriser.fit_featurize_dataframe(df, col_id='structure')
# Print the shape of the DataFrame
print(df.shape)
df.head()
Let’s understand the feature space a little better.
# Extract the feature columns (excluding the first three)
feature_columns = df.columns[3:]
# Create a unique colour for each feature
colors = [plt.cm.jet(i / float(len(feature_columns))) for i in range(len(feature_columns))]
# Plot the distribution of feature values with different colours
plt.figure(figsize=(5, 4))
for i, column in enumerate(feature_columns):
df[column].plot(kind='hist', bins=0, alpha=0.5, color=colors[i], label=column)
plt.title('Feature Distributions')
plt.show()
Code hint
Add some bins to your histogram. 10-20 should be sufficient.Some dimensions have very different ranges, as you can see from the spread on the x-axis. We can standardise these.
MinMaxScaler
is a data scaling technique to transform numerical features within the range [0, 1]. It linearly scales data, preserving relationships between values, making it suitable for algorithms sensitive to feature magnitudes.
from sklearn.preprocessing import MinMaxScaler
scaled_df = df.copy()
# Step 1: Standardise the feature columns
scaler = MinMaxScaler()
scaled_df[feature_columns] = scaler.fit_transform(scaled_df[feature_columns])
# Step 2: Plot the standardised feature distributions
plt.figure(figsize=(5, 4))
for column in feature_columns:
scaled_df[column].plot(kind='hist', bins=20, alpha=0.5, label=column)
plt.title('Standardised Feature Distributions')
plt.show()
Finally, let’s prepare the data for model training. We need to split the dataset into the target variable log10(K_VRH)
and the input features. For the input features, we must remove any non-numerical data to avoid getting errors later in our workflow.
# Define the features we want
features_to_drop = ['structure','composition','log10(K_VRH)']
feature_cols = [col for col in list(df.columns) if col not in features_to_drop]
# Get an array of the features
X = df[feature_cols].values
scaled_X = scaled_df[feature_cols].values
# Get an array of the target variable
y = df['log10(K_VRH)'].values
print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')
7.3. Model choice#
We are dealing with a supervised regression problem, so should choose a suitable machine learning model. We can start by rebuilding a random forest. Are you curious if the feature scaling has an effect? I am.
# Random forest - original features
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# Define the model
rf = RandomForestRegressor(n_estimators=100, criterion='squared_error', max_depth=3, min_samples_split=2, min_samples_leaf=1, random_state=42)
# Fit the model
rf.fit(X,y)
# Wrap the lines of code for later sections
def make_prediction_plot(X, y, model, label):
y_pred = model.predict(X) # Calculate predictions here
fig, ax = plt.subplots(figsize=(5, 4))
ax.scatter(y, y_pred, c=y, cmap='viridis')
ax.plot(y, y, 'r-')
ax.set_xlabel(f'{label} True')
ax.set_ylabel(f'{label} Predicted')
plt.show()
return y_pred # Return y_pred
# Performance
y_pred = make_prediction_plot(X, y, rf, 'log10(K_VRH)')
print(f'The training MAE = {metrics.mean_absolute_error(y,y_pred):.3f} log10GPa')
print(f'The training RMSE = {np.sqrt(metrics.mean_squared_error(y,y_pred)):.3f} log10GPa')
print(f'The training r^2 = {rf.score(X,y):.3f}')
# Random forest - scaled features
# Define the model
rf2 = RandomForestRegressor(n_estimators=100, criterion='squared_error', max_depth=3, min_samples_split=2, min_samples_leaf=1, random_state=42)
# Fit the model
rf2.fit(scaled_X, y)
# Performance
y_pred = make_prediction_plot(scaled_X, y, rf2, 'log10(K_VRH)')
print(f'The training MAE = {metrics.mean_absolute_error(y, y_pred):.3f} log10GPa')
print(f'The training RMSE = {np.sqrt(metrics.mean_squared_error(y, y_pred)):.3f} log10GPa')
print(f'The training r^2 = {rf2.score(scaled_X, y):.3f}')
We can see that Random Forest is not sensitive to feature scaling. Recall that this model works by averaging over multiple decision trees, and the decision boundaries are determined by feature thresholds, not their absolute values.
We have time to try one more model. Let’s go with the popular XGBoost. Like Random Forest, it is an ensemble learning method. XGBoost uses a gradient-boosting framework and often achieves higher predictive accuracy by optimising for both bias and variance in the model.
# XGBoost model
import xgboost as xgb
# Define the model
xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=3, random_state=42, objective='reg:squarederror')
# Fit the model
xgb_model.fit(scaled_X, y)
# Performance
y_pred = make_prediction_plot(scaled_X, y, xgb_model, 'log10(K_VRH)')
print(f'The training MAE = {metrics.mean_absolute_error(y, y_pred):.3f} log10GPa')
print(f'The training RMSE = {np.sqrt(metrics.mean_squared_error(y, y_pred)):.3f} log10GPa')
print(f'The training r^2 = {xgb_model.score(scaled_X, y):.3f}')
XGBoost does a better job, but wait…
We haven’t performed proper training and testing yet 😱. These models are likely to be overfit and unable to make useful predictions for new inputs. On to the next stage!
7.4. Training and testing#
7.4.1. Train-test split#
We are ready to build a real model now. Let’s separate the training data from the unseen test set used to assess model performance.
from slearn.model_selection import train_test_split
# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.2, random_state=42)
# Print the sizes of the arrays
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
Code hint
The library is "sklearn"!7.4.2. Cross-validation#
Using the 80% training set, we can train a model by making use of cross-validation in an attempt to avoid overfitting. Note that this step may take a minute to run as 10 models are being trained (i.e. 5-fold cross-validation x 2 models).
Recap of cross-validation
Cross-validation partitions data into multiple subsets, training the model on some and validating it on others, ensuring robust evaluation.Key types include:
k-Fold Cross-Validation: Data is split into k folds; each fold is used as a validation set once while training on the remaining k-1 folds.
Leave-One-Out Cross-Validation (LOOCV): Each data point is used as a validation set once, with the rest for training.
Stratified k-Fold: Preserves class proportions in each fold, useful for imbalanced datasets.
Time Series Cross-Validation: Ensures training always precedes validation, preserving temporal structure.
Typical workflow:
Split Data: Divide the dataset into k folds.
Train and Validate: Train the model on k-1 folds, validate on the remaining fold.
Repeat: Cycle through all folds, ensuring each serves as a validation set.
Aggregate Results: Compute performance metrics across all iterations.
Train Final Model: Fit the model using the full training dataset based on cross-validation insights.
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
# Define models
xgb_model = XGBRegressor(n_estimators=100, max_depth=3, random_state=42, objective='reg:squarederror')
rf_model = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=42)
# Perform cross-validation for XGBoost
xgb_cv_scores = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
xgb_rmse = np.sqrt(-xgb_cv_scores) # Convert to RMSE
# Perform cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rf_rmse = np.sqrt(-rf_cv_scores) # Convert to RMSE
# Print results
# Compare the results
print("XGBoost Cross-Validation Results")
print(f" Mean RMSE: {xgb_rmse.mean():.3f}")
print(f" Standard Deviation of RMSE: {xgb_rmse.std():.3f}")
print("\nRandom Forest Cross-Validation Results")
print(f" Mean RMSE: {rf_rmse.mean():.3f}")
print(f" Standard Deviation of RMSE: {rf_rmse.std():.3f}")
• Mean RMSE: Mean error across the cross-validation folds (smaller = better).
• Standard Deviation of RMSE: Variability in error across the folds (smaller = more consistent).
7.4.3. Hyperparamater optimisation#
XGBoost is in the lead! So far, we have not adjusted the models themselves. It is possible to improve performance by tuning the hyperparameters. Manually tuning would be laborious. We can use GridSearchCV
to automate the search.
Note that this step will be even more computationally expensive as we are performing cross-validation as a function of model hyperparameters for two separate models. You can see how computational cost quickly escalates and this is where powerful GPUs can become essential for machine learning!
from sklearn.model_selection import GridSearchCV
# Hyperparameter grid for XGBoost
xgb_param_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 6],
'learning_rate': [0.1, 0.2]
}
xgb_grid_search = GridSearchCV(XGBRegressor(random_state=42), xgb_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
xgb_grid_search.fit(X_train, y_train)
best_xgb_params = xgb_grid_search.best_params_
best_xgb_model = xgb_grid_search.best_estimator_
# Hyperparameter grid for Random Forest
rf_param_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 6],
'min_samples_split': [2, 4]
}
rf_grid_search = GridSearchCV(RandomForestRegressor(random_state=42), rf_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)
best_rf_params = rf_grid_search.best_params_
best_rf_model = rf_grid_search.best_estimator_
# Evaluate the best models
xgb_cv_scores = -cross_val_score(best_xgb_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
xgb_rmse = np.sqrt(xgb_cv_scores)
rf_cv_scores = -cross_val_score(best_rf_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rf_rmse = np.sqrt(rf_cv_scores)
# Compare the results of the best models
print("Best XGBoost Hyperparameters:", best_xgb_params)
print("Best XGBoost Cross-Validation Results")
print(f" Mean RMSE: {xgb_rmse.mean():.3f}")
print(f" Standard Deviation of RMSE: {xgb_rmse.std():.3f}")
print("\nBest Random Forest Hyperparameters:", best_rf_params)
print("Best Random Forest Cross-Validation Results")
print(f" Mean RMSE: {rf_rmse.mean():.3f}")
print(f" Standard Deviation of RMSE: {rf_rmse.std():.3f}")
Was it worth the effort? There should be improvements in the RMSE for both models. Note the optimal hyperparameters found.
7.4.4. Model assessment#
Now that we have our best trained models, let’s see how they perform on unseen test data. Comparing test performance to training performance will help us determine if the model generalises well or shows signs of overfitting or underfitting.
from sklearn.metrics import mean_squared_error, r2_score
# Test the best XGBoost model
xgb_test_preds = best_xgb_model.predict(X_test)
xgb_test_rmse = np.sqrt(mean_squared_error(y_test, xgb_test_preds))
xgb_test_r2 = r2_score(y_test, xgb_test_preds)
# Test the best Random Forest model
rf_test_preds = best_rf_model.predict(X_test)
rf_test_rmse = np.sqrt(mean_squared_error(y_test, rf_test_preds))
rf_test_r2 = r2_score(y_test, rf_test_preds)
# Print test results
print("XGBoost test results:")
print(f"RMSE: {xgb_test_rmse:.3f}")
print(f"R²: {xgb_test_r2:.3f}")
print("\nRandom Forest test results:")
print(f"RMSE: {rf_test_rmse:.3f}")
print(f"R²: {rf_test_r2:.3f}")
# Create a scatter plot with both models in different colors
plt.figure(figsize=(5, 4))
plt.scatter(y_test, xgb_test_preds, c='blue', label=f'XGBoost (R²={xgb_test_r2:.2f})', alpha=0.5)
plt.scatter(y_test, rf_test_preds, c='green', label=f'Random Forest (R²={rf_test_r2:.2f})', alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2) # Reference line (y=x)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Test set performance")
plt.legend()
plt.show()
XGBoost outperforms Random Forest in both cross-validation and test performance for this task, with the slight increase in RMSE from train to test suggesting both models generalise reasonably well.
7.4.5. Model speed#
The speed of a model may also be important, e.g. a use case involving millions of predictions. Several factors can influence the computational performance, including the dataset size, model complexity, and hardware. We can perform a simple comparison of our two models using time
.
import time
# Measure the training time for XGBoost
start_time = time.time()
xgb_model.fit(X_train, y_train)
xgb_training_time = time.time() - start_time
# Measure the training time for Random Forest
start_time = time.time()
rf_model.fit(X_train, y_train)
rf_training_time = time.time() - start_time
# Measure the prediction time for XGBoost
start_time = time.time()
xgb_test_preds = xgb_model.predict(X_test)
xgb_prediction_time = time.time() - start_time
# Measure the prediction time for Random Forest
start_time = time.time()
rf_test_preds = rf_model.predict(X_test)
rf_prediction_time = time.time() - start_time
print(f"XGBoost training time: {xgb_training_time:.4f} seconds")
print(f"Random Forest training time: {rf_training_time:.4f} seconds")
print(f"\nXGBoost prediction time: {xgb_prediction_time:.4f} seconds")
print(f"Random Forest prediction time: {rf_prediction_time:.4f} seconds")
It is clear that the XGBoost library has been well optimised to run quickly.
7.5. 🚨 Exercise 7#
7.5.1. Your details#
import numpy as np
# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)
# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)
# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")
7.5.2. Problem#
Selecting the most appropriate ML model for a given purpose is important for achieving predictive performance. Your job will be to assess additional models (e.g. Nearest Neighbours and Support Vector Machines) for the hardness regression task. The tasks will be given in class.
#Empty block for your answers
#Empty block for your answers
Task hint
You can perform cross-validation following the same procedure as the random forest model in the main notebook..ipynb
. The completed file should be uploaded to Blackboard under assignments for MATE70026.
7.6. 🌊 Dive deeper#
Level 1: Tackle Chapter 14 on Tree-Based Learners in Machine Learning Refined.
Level 2: Explore the XGBoost tutorials, e.g. predicting multiple properties with multi-output regression.
Level 3: Find the best model (subject to time constraints) with Automatminer based on TPOT.