1. Introduction#

💡 Ada Lovelace: The more I study, the more insatiable do I feel my genius for it to be.

Lecture slides

1.1. 👋 Getting started#

Welcome to our first practical session!

This is a Jupyter Notebook loaded inside a Jupyter Book. They are part of Project Jupyter, a suite of open-source tools. A Jupyter Notebook also allows you to run and easily share computer code. This combination makes Jupyter notebooks a useful tool for analysing data.

Unlike spreadsheets or combinations of separate data analysis codes, you can collect descriptions and notes for individual experiments, links to the raw data collected, the computer code that performs any necessary data analysis, and the final figures generated with these data, ready for use in a report or published paper.

There are a few components to be aware of:

1.1.1. Python#

A working knowledge of the Python programming language is assumed for this course. If you are rusty, Chapters 1-4 of Datacamp cover the base concepts, as do many other online resources including Imperial’s Introduction to Python course.

1.1.2. Markdown#

Markdown is a markup language that allows easy formatting of text. It is widely used for creating and formatting online content. It is easier to read and write than html. A guide to the syntax can be found here.

# Heading
## Smaller heading
### Even smaller heading

1.1.3. Github#

GitHub is a platform for writing and sharing code. There are many materials science projects hosted there, which enable researchers from around the world to contribute to their development. These notebooks are hosted on GitHub too. If you find an error, you can raise an issue or even better fix it yourself with a pull request.

1.1.4. Live coding#

The weekly notebooks are designed to be run online directly in your browser. You can activate the server by clicking the rocket icon on the top right and selecting Live Code. There is an option to open in Binder or Google Colab, which you may prefer if you are an advanced user, but the formatting won’t be as nice. Those services will be used for the research challenge later in the course. You can opt to install Python on your own computer with Anaconda and run the notebooks locally, but we do not offer support if things go wrong.

1.2. Analyse data with code#

By programming a series of instructions, researchers can consistently obtain the same results from a given dataset. This approach enables us to share datasets and code, allowing other scientists to review, repeat and reuse the analysis. The transparency and reproducibility of code-based analysis enhances research integrity and credibility, while minimising errors. It also enables efficient handling of large datasets and complex calculations, accelerating the exploration of different techniques.

1.2.1. Running code#

Different programming languages can be used in Jupyter notebooks. We will be using Python 3. The large scientific community for Python means that well-developed resources exist for data processing and specific prewritten tools for manipulating and plotting data.

Any code typed into a code cell can be run (executed) by pressing the run button. You can also run the selected code block using Shift-Enter combination on your keyboard.

2+3 # run this cell
print("Beware of 妖精") # anything after '#' is a comment and ignored
12*2.40*3737*12 # you get the idea
2**1000 - 2 # a big number
import math as m # import a math module
m.pi
20*m.atan(1/7)+8*m.atan(3/79) # Euler's approximation

1.2.2. Plotting with Matplotlib#

Let’s import the package Matplotlib, which we will be using a lot for data visualisation.

# Imports
import matplotlib.pyplot as plt  # Plotting
import numpy as np  # Numerical operations
%matplotlib inline

x = np.arange(0, 10, 0.001) # x = 0 to 10 in steps of 0.001
y = np.sin(x*x) # define your function
plt.figure(figsize=(5, 3)) # create a new figure (5x3 inches)
plt.plot(,y) # plot x against y
Code hint You need to plot x vs y. Fix the plot command to (x,y).

1.2.3. Using a DataFrame#

A DataFrame organises data into a 2-dimensional table of rows and columns, much like a spreadsheet. They are useful tools to store, access, and modify large sets of data.

In this module, we’ll make use of Pandas to process input and output data for our machine learning models.

import pandas as pd  # Data manipulation using DataFrames

df = pd.DataFrame() # This instantiates an empty pandas DataFrame

data = {
    "Element" : ['C', 'O', 'Fe', 'Mg', 'Xe'],
    "Atomic Number" : [6, 8, 26, 12, 54],
    "Atomic Mass" : [12, 16, 56, 24, 131]
}

# Let's try loading data into DataFrame df
df = pd.DataFrame(data)

# We can make the 'Element' column the index using the set_index function
df = df.set_index("Element")

# Printing the values in the 'Atomic Number' column
print(df["Atom Number"])
Code hint Check you are printing the correct column name. Try out some of the other options.
# Add a new column
df["Energy (eV)"] = [5.47, 5.14, 0.12, 4.34, 7.01]

print(df["Energy (eV)"])
# Print a row from the DataFrame

# The df.loc[index] function to print the entry "C"
print(df.loc[''])

print('-----')

# The df.iloc[index] function to print the first entry (counting starts at 0...)
print(df.iloc[0])
Code hint You need to tell `df.loc` what to look for. Put an element name in between the quotes.

1.2.4. Write an equation#

This equation is written in LaTeX format. It’s easy to learn and useful for complex expressions, e.g. \frac{x}{y} writes x/y as a fraction \(\dfrac{x}{y}\).

$-\frac{\hslash^2}{2m} \, \frac{\partial^2 \psi}{\partial x^2}$

renders as

\(-\dfrac{\hslash^2}{2m} \, \dfrac{\partial^2 \psi}{\partial x^2}\)

1.3. Computational science#

1.3.1. Thermally-actived diffusion#

Ion transport in crystals is a fundamental process that underpins various technological applications, from batteries to semiconductor devices. Understanding the kinetics of ion movement within and between materials is crucial for optimising device performance.

Like many chemical processes, solid-state diffusion transport is thermally activated. We can describe ion motion in a crystal using a familiar Arrhenius relationship.

The diffusion coefficient of a species is given by \(D_{ion} = D_0 \cdot e^{-(\frac{\Delta E_a}{k_BT})}\), where:

  • \(D_{ion}\) is the diffusion coefficient for a particular ion,

  • \(D_0\) is the temperature-independent prefactor (containing an attempt frequency),

  • \(\Delta E_a\) is the activation energy for diffusion,

  • \(k_B\) is the Boltzmann constant, and

  • \(T\) is the temperature.

Let’s write a function for it, which will take advantage of the wonderful NumPy package. It also uses the physical constants in scipy, and explains the function with a docstring.

import numpy as np
from scipy.constants import physical_constants

# Define constants
k_B = physical_constants['Boltzmann constant in eV/K'][0]

# Arrhenius function
def arrhenius(activation_energy, temperature, D0=1):
    """
    Calculates the rate using the Arrhenius equation.
    
    Parameters:
    activation_energy (float): the activation energy in eV.
    temperature (float): the temperature in K (must be > 0).
    D0 (float): the pre-exponential factor (default is 1).
    
    Returns:
    float: the rate of the reaction.
    """
    if np.any(temperature <= 0):
        raise ValueError("Temperature must be greater than 0 K.")
    return D0 * np.exp(-activation_energy / (k_B * temperature))

This function takes activation_enery (eV) and temperature (K) as inputs and returns the corresponding diffusion coefficient. Recall that the units of the exponential term have to cancel out, so \(D_{ion}\) takes the same units as \(D_0\). Now let’s use it:

arrhenius(0.12, 1000) # Calls the function for Ea = 0.12 eV; T = 1000 K

This value tells us the likelihood that each attempt has of overcoming the thermodynamic barrier for ionic diffusion. Decrease the temperature to 100 K and see the difference.

Now let’s take advantage of the function to make a plot. We will use the numpy function linspace, which is documented over here. It simply generates 100 numbers evenly spaced between 100 and 1000 to represent the temperature range.

import matplotlib.pyplot as plt

# Pre-exponential term in cm^2/s
D0 = 0.5

# Temperature range in K
T = np.linspace(0, 1000, 100)

# Example activation energy in eV
activation_energy = 0.83

# Calculate rates
rates = arrhenius(activation_energy, T, D0)

# Plotting
plt.figure(figsize=(5, 3))
plt.plot(T, rates, label=f'Activation Energy = {activation_energy} eV')
plt.xlabel('Temperature (K)')
plt.ylabel('$D_{ion}$ (cm$^2$/s)')  # Adding units to y-axis
plt.title('Thermally Activated Transport')
plt.legend()
plt.grid(True)
plt.show()
Code hint Start the temperature range from 100 instead of 0 K.
# Pre-exponential term in cm2/s
D0 = 0.5

# Range of activation energies in eV
activation_energies = np.linspace(0.1, 1.0, 4) # Range from 0.1 to 1.0 eV in n steps

# Temperature range in K
T = np.linspace(100, 1000, 0)

# Calculate rates and plot curves
plt.figure(figsize=(5, 3)) 

for activation_energy in activation_energies:
    rates = arrhenius(activation_energy, T, D0)
    plt.plot(T, rates, label=f'{activation_energy:.1f} eV')

plt.xlabel('Temperature (K)')
plt.ylabel('$D_{ion}$ (cm$^2$/s)') 
plt.title('Diffusion with varying activation energies')
plt.legend()
plt.grid(True)
plt.show()
Code hint 'np.linspace' requires three arguments (start, stop, number of points). 0 points won't work. Try changing it to 10.

To better visualise the trends, we can make an Arrhenius plot by plotting the logarithm of D versus the inverse temperature, 1/T. We use 1000/T to give a nicer range on the \(x\)-axis.

# Plotting ln(R) vs 1000/T
plt.figure(figsize=(5, 3)) 

for activation_energy in activation_energies:
    rates = arrhenius(activation_energy, T, D0)
    plt.plot(1000/T, np.log(rates), label=f'{activation_energy:.1f} eV')

plt.xlabel('1000 / Temperature (1/K)')
plt.ylabel('ln($D_{ion}$)')
plt.title('Arrhenius plot')
plt.legend()
plt.grid(True)
plt.show()

The last technique to picked up in this class is data fitting. Later in the module, this will be expanded to more complex functions in high dimensions, but we’ll start with linear regression in just 2D. There is no need to code this by hand as we can use a function in the machine learning package scikit-learn. The real power of Python is the quality and quantity of available libraries such as this one.

import numpy as np
import pandas as pd

num_points =  # Number of data points to generate

# Generate random x-y data points
x_data = np.random.uniform(0, 10, num_points)  # Adjust the range as needed
y_data = np.random.uniform(0, 10, num_points)

# Create a DataFrame
data = {'X': x_data, 'Y': y_data}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)
Code hint Again you need to choose the number of points. 50 should be fine, but you have the power to decide.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Perform linear regression
X = df['X'].values.reshape(-1, 1)  # Reshape X for compatibility with sklearn
y = df['Y'].values
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

# Calculate error bars
residuals = y - y_pred
error_bars = np.abs(residuals)

# Plot the linear regression line
plt.figure(figsize=(5, 3)) 
plt.errorbar(df['X'], df['Y'], yerr=error_bars, fmt='o', color='skyblue', label='Prediction errors')
plt.scatter(df['X'], df['Y'])
plt.plot(df['X'], y_pred, color='red', label='Regression line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Linear regression')
plt.legend()
plt.show()

There are a number of useful analysis tools built into sklearn, which we can use to probe the model properties.

# Print the model parameters and performance with enhanced formatting
try:
    print(f'Slope: {model2.coef_[0]:.2f}')  # Assuming model.coef_ might be an array for multidimensional X
    print(f'Intercept: {model2.intercept_:.2f}')
    print(f'R^2 Score: {r2_score(y, y_pred):.3f}')  # R^2 - coefficient of determination
    print(f'RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.3f}')  # Root Mean Squared Error
except Exception as e:
    print("Error in calculating model parameters or performance metrics:", e)
Code hint Your model is not called `model2`. Try changing the name.

1.4. 🚨 Exercise 1#

Coding exercises

The exercises are designed to apply what you have learned with room for creativity. It is fine to discuss solutions with your classmates, but the actual code should not be directly copied.

The completed notebooks are to be submitted at the end of class, but you can revist later, experiment with the code, and follow the further reading suggestions.

1.4.1. Your details#

import numpy as np

# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)

# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)

# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")

1.4.2. Tasks#

#Code block
#Comment block
#Code block
#Comment block

Submission

When your notebook is complete, click on the download icon on the top right, select .ipynb, save the file and upload it to Blackboard. If you are using Google Colab, you have to File -> Download and choose .ipynb.

1.5. 🌊 Dive deeper#