1. Introduction#

đź’ˇ Ada Lovelace: The more I study, the more insatiable do I feel my genius for it to be.

Lecture slides

1.1. đź‘‹ Getting started#

Welcome to our first practical session!

This is a Jupyter Notebook loaded inside a Jupyter Book. They are part of Project Jupyter, a suite of open-source tools. A Jupyter Notebook also allows you to run and easily share computer code. This combination makes Jupyter notebooks a useful tool for analysing data.

Unlike spreadsheets or combinations of separate data analysis codes, you can collect descriptions and notes for individual experiments, links to the raw data collected, the computer code that performs any necessary data analysis, and the final figures generated with these data, ready for use in a report or published paper.

There are a few components to be aware of:

1.1.1. Python#

A working knowledge of the Python programming language is assumed for this course. If you are rusty, Chapters 1-4 of Datacamp cover the base concepts, as do many other online resources including Imperial’s Introduction to Python course.

Choose your degree programme:

If MSc, have you completed the introductory Python course:

Rate your current Python level:

1.1.2. Markdown#

Markdown is a markup language that allows easy formatting of text. It is widely used for creating and formatting online content. It is easier to read and write than html. A guide to the syntax can be found here.

# Heading
## Smaller heading
### Even smaller heading

1.1.3. Github#

GitHub is a platform for writing and sharing code. There are many materials science projects hosted there, which enable researchers from around the world to contribute to their development. These notebooks are hosted on GitHub too. If you find an error, you can raise an issue or even better fix it yourself with a pull request.

1.1.4. Live coding#

The weekly notebooks are designed to be run online directly in your browser. You can activate the server by clicking the rocket icon on the top right and selecting Live Code. There is an option to open in Binder or Google Colab. Colab is more powerful, but the formatting won’t be as nice. You can opt to install Python on your own computer with Anaconda and run the notebooks locally, but we do not offer support if things go wrong.

1.2. Analyse data with code#

By programming a series of instructions, researchers can consistently obtain the same results from a given dataset. This approach enables us to share datasets and code, allowing other scientists to review, repeat and reuse the analysis. The transparency and reproducibility of code-based analysis enhances research integrity and credibility, while minimising errors. It also enables efficient handling of large datasets and complex calculations, accelerating the exploration of different techniques.

1.2.1. Running code#

Different programming languages can be used in Jupyter notebooks. We will be using Python 3. The large scientific community for Python means that well-developed resources exist for data processing and specific prewritten tools for manipulating and plotting data.

Any code typed into a code cell can be run (executed) by pressing the run button. You can also run the selected code block using Shift-Enter combination on your keyboard.

2+3 # run this cell
print("Beware of 小妖精") # anything after '#' is a comment and ignored
12*2.40*3737*12 # you get the idea
2**1000 - 2 # a big number
import math as m # import a math module
m.pi
20*m.atan(1/7)+8*m.atan(3/79) # Euler's approximation

1.2.2. Plotting with Matplotlib#

Let’s import the package Matplotlib, which we will be using a lot for data visualisation.

# Imports
import matplotlib.pyplot as plt  # Plotting
import numpy as np  # Numerical operations
%matplotlib inline

x = np.arange(0, 10, 0.001) # x = 0 to 10 in steps of 0.001
y = np.sin(x*x) # define your function
plt.figure(figsize=(5, 3)) # create a new figure (5x3 inches)
plt.plot(,y) # plot x against y
Code hint You need to plot x vs y. Fix the plot command to (x,y).

1.2.3. Using a DataFrame#

A DataFrame organises data into a 2-dimensional table of rows and columns, much like a spreadsheet. They are useful tools to store, access, and modify large sets of data.

In this module, we’ll make use of Pandas to process input and output data for our machine learning models.

import pandas as pd  # Data manipulation using DataFrames

df = pd.DataFrame() # This instantiates an empty pandas DataFrame

data = {
    "Element" : ['C', 'O', 'Fe', 'Mg', 'Xe'],
    "Atomic Number" : [6, 8, 26, 12, 54],
    "Atomic Mass" : [12, 16, 56, 24, 131]
}

# Let's try loading data into DataFrame df
df = pd.DataFrame(data)

# We can make the 'Element' column the index using the set_index function
df = df.set_index("Element")

# Printing the values in the 'Atomic Number' column
print(df["Atom Number"])
Code hint Check you are printing the correct column name. Try out some of the other options.
# Add a new column
df["Energy (eV)"] = [5.47, 5.14, 0.12, 4.34, 7.01]

print(df["Energy (eV)"])
# Print a row from the DataFrame

# The df.loc[index] function to print the entry "C"
print(df.loc[''])

print('-----')

# The df.iloc[index] function to print the first entry (counting starts at 0...)
print(df.iloc[0])
Code hint You need to tell `df.loc` what to look for. Put an element name in between the quotes.

1.2.4. Write an equation#

This equation is written in LaTeX format. It’s easy to learn and useful for complex expressions, e.g. \frac{x}{y} writes x/y as a fraction \(\dfrac{x}{y}\).

$-\frac{\hslash^2}{2m} \, \frac{\partial^2 \psi}{\partial x^2}$

renders as

\(-\dfrac{\hslash^2}{2m} \, \dfrac{\partial^2 \psi}{\partial x^2}\)

1.3. Computational science#

1.3.1. Thermally-actived diffusion#

Ion transport in crystals is a fundamental process that underpins various technological applications, from batteries to semiconductor devices. Understanding the kinetics of ion movement within and between materials is crucial for optimising device performance.

Like many chemical processes, solid-state diffusion transport is thermally activated. We can describe ion motion in a crystal using a familiar Arrhenius relationship.

The diffusion coefficient of a species is given by \(D_{ion} = D_0 \cdot e^{-(\frac{\Delta E_a}{k_BT})}\), where:

  • \(D_{ion}\) is the diffusion coefficient for a particular ion,

  • \(D_0\) is the temperature-independent prefactor (containing an attempt frequency),

  • \(\Delta E_a\) is the activation energy for diffusion,

  • \(k_B\) is the Boltzmann constant, and

  • \(T\) is the temperature.

Let’s write a function for it, which will take advantage of the wonderful NumPy package. It also uses the physical constants in SciPy, and explains the function with a docstring.

import numpy as np
from scipy.constants import physical_constants

# Define constants
k_B = physical_constants['Boltzmann constant in eV/K'][0]

# Arrhenius function
def arrhenius(activation_energy, temperature, D0=1):
    """
    Calculates the rate using the Arrhenius equation.
    
    Parameters:
    activation_energy (float): the activation energy in eV.
    temperature (float): the temperature in K (must be > 0).
    D0 (float): the pre-exponential factor (default is 1).
    
    Returns:
    float: the rate of the reaction.
    """
    if np.any(temperature <= 0):
        raise ValueError("Temperature must be greater than 0 K")
    return D0 * np.exp(-activation_energy / (k_B * temperature))

This function takes activation_energy (eV) and temperature (K) as inputs and returns the corresponding diffusion coefficient. Recall that the units of the exponential term cancel out, so \(D_{ion}\) takes the same units as \(D_0\). Now let’s use the function:

 # Call the function for Ea = 0.12 eV; T = 1000 K
arrhenius(0.12, 1000) 

This value tells us the likelihood that each attempt has of overcoming the thermodynamic barrier for ionic diffusion. Decrease the temperature to 100 K and see the difference.

Now let’s take advantage of the function to make a plot. We will use the numpy function linspace, which is documented over here. It is used here to generate 100 numbers evenly spaced between 100 and 5000 that represent the temperature range of our “experiments”.

import matplotlib.pyplot as plt

# Pre-exponential term in cm^2/s
D0 = 0.5

# Range of activation energies in eV
activation_energies = np.linspace(0.1, 1, 0) # Range from 0.1 to 1 eV in n steps

# Temperature range in K
T = np.linspace(100, 5000, 100)

# Calculate rates and plot curves
plt.figure(figsize=(5, 3)) 

for activation_energy in activation_energies:
    rates = arrhenius(activation_energy, T, D0)
    plt.plot(T, rates, label=f'{activation_energy:.1f} eV')

plt.xlabel('Temperature (K)')
plt.ylabel('$D_{ion}$ (cm$^2$/s)') 
plt.title('Varying activation energy')
plt.legend()
plt.grid(True)
plt.show()
Code hint 'np.linspace' requires three arguments (start, stop, number of points). 0 points won't work. Try changing it to 5.

To better visualise the trends, we can make an Arrhenius plot by plotting the natural logarithm of \(D\) versus the inverse temperature, 1/T. We use 1000/T to give a nicer range on the \(x\)-axis.

# Plotting ln(R) vs 1000/T
plt.figure(figsize=(5, 3)) 

for activation_energy in activation_energies:
    rates = arrhenius(activation_energy, T, D0)
    plt.plot(1000/T, np.log(rates), label=f'{activation_energy:.1f} eV')

plt.xlabel('1000 / Temperature (1/K)')
plt.ylabel('ln($D_{ion}$)')
plt.title('Arrhenius plot')
plt.legend()
plt.grid(True)
plt.show()

The last technique to pick up in this class is data fitting. Later in the module, we will use more complex functions in high dimensions, but let’s start with linear regression. There is no need to code this by hand as we can use a function in the machine learning package scikit-learn. The real power of Python is the quality and quantity of available libraries such as this one.

import numpy as np
import pandas as pd

num_points =  # Number of data points to generate

# Generate random x-y data points
x_data = np.random.uniform(0, 10, num_points)  # Adjust the range as needed
y_data = np.random.uniform(0, 10, num_points)

# Create a DataFrame
data = {'X': x_data, 'Y': y_data}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)
Code hint Again you need to choose the number of points. 50 should be fine, but you have the power to decide.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Perform linear regression
X = df['X'].values.reshape(-1, 1)  # Reshape X for compatibility with sklearn
y = df['Y'].values
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

# Calculate error bars
residuals = y - y_pred
error_bars = np.abs(residuals)

# Plot the linear regression line
plt.figure(figsize=(5, 3)) 
plt.errorbar(df['X'], df['Y'], yerr=error_bars, fmt='o', color='skyblue', label='Prediction errors')
plt.scatter(df['X'], df['Y'])
plt.plot(df['X'], y_pred, color='red', label='Regression line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Linear regression')
plt.legend()
plt.show()

There are a number of useful analysis tools built into sklearn, which we can use to probe the model properties.

# Print the model parameters and performance
try:
    print(f'Slope: {model2.coef_[0]:.2f}')  # Assuming model.coef_ might be an array for multidimensional X
    print(f'Intercept: {model2.intercept_:.2f}')
    print(f'R^2 Score: {r2_score(y, y_pred):.3f}')  # R^2 - coefficient of determination
    print(f'RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.3f}')  # Root Mean Squared Error
except Exception as e:
    print("Error in calculating model parameters or performance metrics:", e)
Code hint Your model is not called `model2`. Try changing the name.

1.4. 🚨 Exercise 1#

đź’ˇ Coding exercises: The exercises are designed to apply what you have learned with room for creativity. It is fine to discuss solutions with your classmates, but the actual code should not be directly copied.

1.4.1. Your details#

import numpy as np

# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)

# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)

# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")

1.4.2. Problem#

Due to their importance in the electronics industry, the diffusion of atoms in semiconductors has been well studied for decades. Below is a set of data for impurity diffusion in crystalline Si [Source: Casey and Pearson (1975)]. It has been arranged into a DataFrame for your convenience.

import pandas as pd

data = {
    'Impurity': ['B', 'Al', 'Ga', 'In', 'P', 'As', 'Sb', 'Bi'],
    'Mass': [10.81, 26.98, 69.72, 114.82, 30.97, 74.92, 121.76, 208.98], # atomic mass in g/mol
    'D0': [5.1, 8.0, 3.6, 16.5, 10.5, 60.0, 12.9, 1.03E3],  # cm2/sec
    'Eact': [3.70, 3.47, 3.51, 3.91, 3.69, 4.20, 3.98, 4.63]  # eV
}

df = pd.DataFrame(data)
print(df)

Two tasks will be given in class.

#Empty block for your answers
#Empty block for your answers
đź““ Submission: When your notebook is complete in Google Colab, go to File > Download and choose .ipynb. The completed file should be uploaded to Blackboard under assignments for MATE70026.

1.5. 🌊 Dive deeper#