9. Research Challenge#
9.1. π Build your own model#
We have just completed a session on generative AI (see Lecture slides), but it is time to go back to supervised machine learning problems.
You have been assigned one dataset from MatBench as introduced in the Challenge slides. You are free to choose and tune any machine-learning model, with any Python library, but it should be appropriate for the problem. For instance, XGBoost could be a good starting starting point to build a regression model. You can refer back to earlier notebooks and repurpose code as needed.
You may reach the limits of computing processing power on Google Colab. Building a useful model with limited resources is a real-world skill. Using other free resources is allowed if you find an alternative service, as is running on your own computer. A model tracker such as wandb could be helpful for advanced users. If you want to try a brute force approach, a library such as Automatminer may be of interest.
This notebook should be used for keeping a record of your model development, submission, and even your presentation. You are free to edit (add/remove/delete) or rearrange the cells as you see fit.
9.1.1. Your details#
import numpy as np
# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)
# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)
# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")
9.2. Problem statement#
You have been assigned one dataset from the list on MatBench. You should state what problem you are trying to solve and comment on the best-performing model in the benchmark.
# Spare cell
9.3. Data preparation#
Check the data distribution and apply appropriate pre-processing steps as required.
# Installation of libraries
!pip install matminer # Datasets and featurisation
# Get dataset info from matminer
from matminer.datasets import get_all_dataset_info
from matminer.datasets import load_dataset
# Uncomment the info line for your assigned challenge
# A (GTAs - Xia, Kinga)
#info = get_all_dataset_info("matbench_dielectric")
# B (GTAs - Irea, Pan)
#info = get_all_dataset_info("matbench_expt_gap")
# C (GTAs - Yifan, Fintan)
#info = get_all_dataset_info("matbench_glass")
# Check the dataset information
print(info)
# Load your dataset into a pandas DataFrame
df = load_dataset(" ")
print(df)
Choose relevant features, which may be based on composition or structure, depending on your problem. matminer is a good place to start.
9.4. Model selection, testing and training#
Define your model and justify your choice based on the problem and available data. You can look back at earlier notebooks and investigate other examples online including in scikit-learn.
# Spare cell
Train, validate and test your model. Make sure to do proper data splits and to consider the hyperparamaters of your model.
Note on the ROC-AUC classification metric
There is one metric we didn't cover but is used in Matbench. In binary classification models, the ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) score can be used to evaluate performance. It quantifies the ability of the model to distinguish between positive and negative instances across different decision thresholds. A higher ROC-AUC score (ranging from 0.5 to 1) indicates better performance, with 1 representing a perfect classifier and 0.5 indicating performance no better than random chance. There is a more detailed discussion here: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc.The metric can be calculated using the roc_auc_score
function from the sklearn.metrics
module, e.g.
from sklearn.metrics import roc_auc_score
# Assuming you have true labels (y_true) and predicted probabilities (y_pred_prob)
y_true = [...]
y_pred_prob = [...]
# Calculate ROC-AUC
roc_auc = roc_auc_score(y_true, y_pred_prob)
# Display the result
print(f'ROC-AUC Score: {roc_auc:.4f}')
# Spare cell
9.5. Model analysis and discussion#
How well does your final model perform? Think of metrics and plots that are useful to dig a little deeper.
Compare against the best-performing model on the MatBench leaderboard. With limited resources, donβt expect to match this performance, but you should do better than a baseline model.
# Spare cell
9.6. Large Language Model (LLM) usage declaration#
Acknowledge use of a generative model during your assignment. Points to consider:
State which LLM (e.g. GPT-4, Gemini, Co-Pilot)
Specify tasks (e.g. summarising research or code snippets)
Were any limitations/biases noted?
How did you ensure ethical use?
# Spare cell
9.7. βοΈ Final word#
Good luck building your own model! We hope that you enjoyed the course and exercises. Dive deeper into the aspects that caught your interest. A useful starting point may be the Resources page.
Remember that submission is on Blackboard and you should upload both the completed Juypter Notebook (.ipynb
file), as well as your recorded narrated presentation (maximum 5 minutes; see guides on using Zoom or Powerpoint for this purpose).