Materials Data

3. Materials Data#

💡 Michael Ashby and David Jones: How is the engineer to choose from this vast menu the material that best suits the purpose?

Lecture slides

3.1. 🚀 Data-driven thermoelectrics#

The goal today is to access, filter, and visualise materials data. There is a growing number of open computational materials science databases that include Materials Project, NOMAD, OQMD, and AFLOW.

We will use an application programming interface (API) via Python. Open Databases Integration for Materials Design (OPTIMADE) provides access to >20 databases and >20 million structures using a single interface.

🚨 Data warning: Many computational databases are based on the properties of static crystals. In reality, temperature influences the structures and properties of materials. We have to start somewhere, but keep this in mind when judging the utility of derived models and the trust in "ground truth" reference data.

# Installation of libraries
!pip install optimade --quiet
!pip install matminer --quiet
!pip install elementembeddings --quiet
!pip install pymatviz --quiet 
!pip install plotly --quiet 

# Import of modules
import pandas as pd  # Data manipulation with DataFrames
import numpy as np  # Numerical operations
import matplotlib.pyplot as plt  # Plotting
import pprint  # Pretty print data structures
import os  # Operating system functions

Colab error solution

If running the import module cell fails with an "AttributeError", click `Runtime` -> `Restart Session` and then simply rerun the cell.

3.2. Database queries#

In the following Python code, we use OPTIMADE to query a database and retrieve structures that contain a specified number of elements.

The steps include:

Initialise the OPTIMADE client: specify the database provider.
Define a filter: set the criteria for filtering the materials data.
Retrieve and process data: Query the database using the filter and process the retrieved data.

from optimade.client import OptimadeClient

# Initialise with specified provider
client = OptimadeClient(include_providers={"mp"})  
   # Other providers include "mp", "alexandria", "oqmd", "jarvis", "aflow"

# Define a filter to find structures with exactly n elements
filters = ['nelements=8']

# Process each filter to fetch and display relevant data
for f in filters:
    # Retrieve data using the client
    result = client.get(f)
    
    if 'structures' in result and f in result['structures']:
        provider_data = next(iter(result['structures'][f].values()))
        structures_data = provider_data['data']

        # Print the count of structures found
        print(f"Count for filter '{f}': {len(structures_data)}")

        # Print a summary of each structure's ID and formula
        print("Summary of Compounds:")
        for structure in structures_data:
            structure_id = structure.get('id', 'No ID provided')
            formula = structure.get('attributes', {}).get('chemical_formula_descriptive', 'No formula provided')
            print(f"ID: {structure_id}, Formula: {formula}")
    else:
        print("No data found for this filter.")

They are complex compounds!

We won’t spend too much time delving into the specifics of each API since they each come with their own syntax and capabilities. They can also be slow for complex searches. However, it’s important to grasp the overall logic behind constructing queries.

For instance, consider the following example using the Materials Project API. This query is designed to search for materials that contain Li, have a band gap between 0.5 and 1.5 eV, and consist of either two or three elements:

with MPRester(ACCESS_KEY, use_document_model=False) as mpr:
    docs = mpr.materials.summary.search(
        elements=["Li"],
        band_gap=(0.5,1.5),
        num_elements=(2,3),
        fields=['material_id', 'formula_pretty', 'band_gap', 'is_stable', 'theoretical']
       )

print("Number of binary and ternary Li containing compounds with a band gap between 0.5 and 1.5 eV: ", len(docs))

Running this requires getting your own access key. We had problems last year as too many students connected at once and Imperial College got temporarily blocked! For this activity we will move forward and “cheat” by using a pre-built dataset.

3.3. Chemical space of thermoelectric materials#

In the Lecture 2 activity, we used a matminer dataset on crystal hardness. Today we will turn to thermoelectric materials.

Thermoelectric devices convert temperature differences directly into electrical voltage, enabling applications in power generation and refrigeration. Their efficiency is characterised by the dimensionless figure of merit (zT), which depends on electrical conductivity, Seebeck coefficient, and thermal conductivity (read more here). Let’s explore the diverse compositions that give rise to these properties.

import matminer  # Materials informatics
from matminer.datasets.dataset_retrieval import load_dataset  # Load materals datasets

print(matminer.datasets.dataset_retrieval.get_all_dataset_info('ucsb_thermoelectrics'))

The dataset is a reasonable size, so we can load it all without downsampling.

# Use matminer to download the dataset
df = load_dataset('ucsb_thermoelectrics')

print(f'The full dataset contains {df.shape[0]} entries. \n')
print('The DataFrame is shown below:')
df.head(10)

Let’s perform a deeper check:

# Display the columns of the dataset
print("Columns in the dataset:")
print(df2.columns)

# Check for any missing values in the dataset
print("\nChecking for missing values:")
print(df2.isnull().sum())

Code hint

Check the DataFrame name

First, let’s visualise the distribution of the zT performance metric in the dataset. This will help us understand the range and common values.

import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of zT values
plt.figure(figsize=(5, 4))
sns.histplot(df['zT'], bins=, kde=True)
plt.title('Thermoelectric figure of merit (zT)')
plt.xlabel('zT (unitless)')
plt.ylabel('Frequency')
plt.show()

Code hint

Set the number of bins for the histogram. 50?

Next, we will analyse the distribution of elements within our thermoelectrics dataset by creating a heatmap over the periodic table. This would take a lot of time from scratch, but we can use the pymatviz package.

import re
from collections import Counter
from pymatviz import ptable_heatmap

# Function to extract elements using a regex pattern
def extract_elements(composition):
    # Regex to match chemical elements: one uppercase letter optionally followed by a lowercase letter
    pattern = r"[A-Z][a-z]?"
    return re.findall(pattern, composition)

# Extract all unique elements from the composition column
elements = []
for composition in df['composition']:
    elements += extract_elements(composition)

# Count the frequency of each element
element_counts = Counter(elements)

# Convert the Counter object to a dictionary for the heatmap
element_counts_dict = dict(element_counts)

# Create the periodic table heatmap 
fig = ptable_heatmap(
    element_counts_dict, 
    colormap="viridis", 
    cbar_label_fmt="Frequency of Elements", 
    log=True,  # Use logarithmic scale for better visualisation
    value_kwargs={"fontsize": 10},  # Adjust font size for element labels
    return_type="figure"  # Return the figure for further customisation
)

# Add a title
fig.suptitle("Element Frequency in Thermoelectric Dataset", fontsize=16, fontweight="bold")

# Display the plot
plt.show()

3.4. Unsupervised machine learning#

We have a set of materials with different compositions. We can use machine learning to visualise these in two or three dimensions. You can think of this like a materials map.

We refer to the methods that enable us to reduce high-dimensional data into lower dimensions as dimensionality reduction techniques. These allow us to visualise complex data and in this example we will make use of Principal Component Analysis (PCA).

Overview of PCA

PCA is a popular technique for dimensionality reduction and data preprocessing, enabling the simplification of complex datasets. High-dimensional data is transformed into a new coordinate system where the axes align with the directions of maximum variance in the original data. These new axes, termed "principal components," are orthogonal. The first principal component captures the highest variance, the second captures the second highest, etc.

Key use cases include:

Dimensionality Reduction: Identifying and eliminating less informative dimensions, reducing noise and computational complexity.
Data Visualisation: Facilitating easier interpretation while preserving essential patterns.
Noise Reduction: Filtering out noise or unimportant variations by focusing on significant variance.
Feature Engineering: A preprocessing step to transform data before applying machine learning algorithms, potentially improving performance.

PCA workflow:

Center the Data: Subtract the mean from each feature to center the data around the origin.

\( X_{\text{centered}} = X - \bar{X} \)
Calculate Covariance Matrix: Compute the covariance matrix to understand feature relationships and their covariance.

\( \text{Cov}(X) = \frac{1}{n}X_{\text{centered}}^T X_{\text{centered}} \)
Compute Eigenvalues and Eigenvectors: Calculate eigenvalues (\(\lambda_i\)) and eigenvectors (\(\mathbf{v}_i\)) of the covariance matrix. Eigenvectors represent maximum variance directions, and eigenvalues quantify the variance magnitude.

\( \text{Cov}(X) \mathbf{v}_i = \lambda_i \mathbf{v}_i \)
Sort Eigenvalues: Sort eigenvalues in descending order, rearranging corresponding eigenvectors accordingly.
Select Principal Components: Choose a subset of eigenvectors (principal components) based on eigenvalues, explaining the most variance in the data.
Project Data: Project original data onto selected principal components, yielding a lower-dimensional representation.

Note that PCA analysis is limited by its reliance on linear transformations of the data. In cases where non-linear structures are prominent, alternative techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE) can be used.

# Perform PCA analysis
from sklearn.decomposition import PCA
plt.rcdefaults() # Reset the matplotlib style

# Random 10D vectors for demonstration purposes
np.random.seed(42)
num_samples = 1000
dimensionality = 0
random_state = 42
data = np.random.rand(num_samples, dimensionality)

# Perform PCA to reduce the dimensionality to 2D
pca = PCA(n_components=dimensionality)
reduced_data = pca.fit_transform(data)

# Create a color map based on the original data points
   # Use the first dimension of the original data as the color value
color_map = data[:, 0]

# Plot the 2D projection
plt.figure(figsize=(5, 3))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=color_map, cmap='viridis')
plt.colorbar(label='Colour based on first dimension')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2D projection using PCA')
plt.show()

Code hint

Set the number of dimensions (components) to 2

3.4.1. Featurise compositions#

Using our dataset, we can try different featurisation schemes and analyse the resulting visualisations. Here we will perform a Magpie encoding of chemical compositions using the ElementEmbeddings package.

from elementembeddings.composition import composition_featuriser

magpie_df = composition_featuriser(df, formula_column="composition", embedding='magpie')
magpie_df.head()

Note all of the extra columns that have been added containing the features. We need to do a little cleanup of the DataFrame and can then proceed with our PCA analysis.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Select only the new features by dropping the original columns and handling NaNs
new_features_df = magpie_df.drop(columns=df.columns).dropna()

# Extract feature values 
X = new_features_df.values
X_standardised = StandardScaler().fit_transform(X)

# Perform PCA to reduce dimensionality to 2D
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X_standardised)

# Create a single plot for visualisation
fig, ax = plt.subplots(figsize=(6, 4))

# Select a label for coloring (assuming zT column exists in the original df)
color_map = df["zT"].values  # Use original DataFrame for the label (e.g., zT)

# Scatter plot in 2D space
scatter = ax.scatter(reduced_data[:, 0], reduced_data[:, 1], c=color_map, cmap='viridis', alpha=0.25, s=100)

# Add labels and title
ax.set_xlabel("Component 1")
ax.set_ylabel("Component 2")
ax.set_title("2D PCA of Magpie Features")

# Add colour bar
fig.colorbar(scatter, ax=ax, label="zT")

# Show the plot
plt.show()

Let’s check how much variance is captured by the first two principal components. Capturing above 50% of the variance with two dimensions is generally acceptable for initial exploration. However, for practical applications, we might need to increase the number of components to explain a higher proportion of the variance (e.g. 80%-90%) to ensure sufficient representation of the data.

# Check explained variance 
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
print(f"Explained variance by Component 1: {explained_variance[0]:.2f}")
print(f"Explained variance by Component 2: {explained_variance[1]:.2f}")
print(f"Cumulative variance explained by 2 components: {cumulative_variance[1]:.2f}")

Now, wouldn’t it be nice if we could see what material each data point refers to. Plotly can help with that.

import plotly.express as px

# Add the zT column and composition column from the original DataFrame for coloring and hover info
plot_df = pd.DataFrame(reduced_data, columns=['Component 1', 'Component 2'])
plot_df['zT'] = df['zT'].values
plot_df['composition'] = df['composition'].values  # Assuming this is the column with chemical formulas

# Create an interactive scatter plot using Plotly
fig = px.scatter(
    plot_df, 
    x='Component 1', 
    y='Component 2', 
    color='zT', 
    color_continuous_scale='Viridis', 
    opacity=0.6,
    hover_name='composition',  # Show chemical formula on hover
    hover_data={'Component 1': ':.2f', 'Component 2': ':.2f', 'zT': ':.2f'},  # Additional hover data
    labels={'zT': 'zT Value'}
)

fig.update_traces(marker=dict(size=10))  # Adjust the size value as needed

fig.update_layout(
    width=6*96,  # 6 inches converted to pixels
    height=4*96  # 4 inches converted to pixels
)

# Show the interactive plot
fig.show()

Finally, let’s see how a 3D projection looks.

from mpl_toolkits.mplot3d import Axes3D

# Perform PCA to reduce dimensionality to 3D
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(X_standardised)

# Create a 3D plot for visualisation
fig = plt.figure(figsize=(6, 5))

# Create a 3D plot
ax = fig.add_subplot(111, projection='3d')
color_map = df["zT"].values 
scatter = ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c=color_map, cmap='viridis', alpha=0.5)

# Add labels and title
ax.set_xlabel("Component 1")
ax.set_ylabel("Component 2")
ax.set_zlabel("Component 3")
ax.set_title("3D PCA of Magpie Features")

# Show the plot
plt.show()

Code hint

You can change the featurisation scheme to see the impact on the resulting visualisation and distribution of compositions

3.5. 🚨 Exercise 3#

💡 Coding exercises: The exercises are designed to apply what you have learned with room for creativity. It is fine to discuss solutions with your classmates, but the actual code should not be directly copied.

3.5.1. Your details#

import numpy as np

# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)

# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)

# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")

3.5.2. Problem#

Unsupervised machine learning techniques, such as dimensionality reduction, allow us to uncover patterns and relationships in complex datasets.

A set of tasks will be assigned to apply different techniques to visualise clustering to explore the properties of thermoelectric materials.

#Empty block for your answers

#Empty block for your answers

📓 Submission: When your notebook is complete in Google Colab, go to File > Download and choose .ipynb. The completed file should be uploaded to Blackboard under assignments for MATE70026.

3.6. 🌊 Dive deeper#

Level 1: Tackle Chapter 8 on Linear Unsupervised Learning in Machine Learning Refined.
Level 2: Read about our attempt to screen all inorganic materials (with caveats) in the journal Chem.
Level 3: Watch a seminar by quantum chemist Anatole von Lilienfeld on chemical space.

Materials Data

Contents

3. Materials Data#

3.1. 🚀 Data-driven thermoelectrics#

3.2. Database queries#

3.3. Chemical space of thermoelectric materials#

3.4. Unsupervised machine learning#

3.4.1. Featurise compositions#

3.5. 🚨 Exercise 3#

3.5.1. Your details#

3.5.2. Problem#

3.6. 🌊 Dive deeper#