Practical 1: describing and representing data

Starting with the ‘Basics’

This week is focussed on ensuring that you’re able to access the teaching materials and to run Jupyter notebooks locally, as well as describing a dataset in Python.

Learning Outcomes

  • You have familiarised yourself with how to access the lecture notes and Python notebook of this module.
  • You have familiarised yourself with running the Python notebooks locally.
  • You have familiarised yourself with describing a dataset in Python.

Starting the Practical

The process for every week will be the same: download the notebook to your QM folder, switch over to JupyterLab (which will be running in Podman/Docker) and get to work. If you want to save the completed notebook to your Github repo, you can add, commit, and push the notebook in Git after you download it. When you’re done for the day, save your changes to the file (This is very important!), then add, commit, and push your work to save the completed notebook.

Note

Suggestions for a Better Learning Experience:

  • Set your operating system and software language to English: this will make it easier to follow tutorials, search for solutions online, and understand error messages.

  • Save all files to a cloud storage service: use platforms like Google Drive, OneDrive, Dropbox, or Git to ensure your work is backed up and can be restored easily when the laptop gets stolen or broken.

  • Avoid whitespace in file names and column names in datasets

Set up the tools

Please follow the Setup page of CASA0013 to install and configure the computing platform, and this page to get started on using the container & JupyterLab.

Download the Notebook

So for this week, visit the Week 1 of QM page, you’ll see that there is a ‘preview’ link and a a ‘download’ link. If you click the preview link you will be taken to the GitHub page for the notebook where it has been ‘rendered’ as a web page, which is not editable. To make the notebook useable on your computer, you need to download the IPYNB file.

So now:

  1. Click on the Download link.
  2. The file should download automatically, but if you see a page of raw code, select File then Save Page As....
  3. Make sure you know where to find the file (e.g. Downloads or Desktop).
  4. Move the file to your Git repository folder (e.g. ~/Documents/CASA/QM/)
  5. Check to see if your browser has added .txt to the file name:
    • If no, then you can move to adding the file.
    • If yes, then you can either fix the name in the Finder/Windows Explore, or you can do this in the Terminal using mv <name_of_practical>.ipynb.txt <name_of_practical>.ipynb (you can even do this in JupyterLab’s terminal if it’s already running).

Running notebooks on JupyterLab

I am assuming that most of you are already running JupyterLab via Podman using the command.

If you are a bit confused with container, JupyterLab, terminal, or Git, please feel free to ask any questions.

Loading data

We are going to describe the population of local authorities in the UK.

The data is sourced from Office for National Statistics and is donwloadable here.

We have saved a copy of this dataset to the Github repo, in case that the dataset is removed from the website.

import pandas as pd

# Read CSV file, skipping first 5 rows, using row 6 as header, and handling comma as thousands separator
df_pop = pd.read_csv(
    'https://raw.githubusercontent.com/huanfachen/QM/refs/heads/main/sessions/L1_data/UK_census_population.csv',
    skiprows=5,        # Skip first 5 rows. Wnhy?
    thousands=',',     # Interpret commas as thousands separators
    header=0           # After skipping, the first row becomes the header
)

print(df_pop.head())
   Area code          Area name Area type  Population 2011  Population 2021  \
0  K04000001  England and Wales  National       56075912.0       59597542.0   
1  E92000001            England   Country       53012456.0       56490048.0   
2  W92000004              Wales   Country        3063456.0        3107494.0   
3  E12000001         North East    Region        2596886.0        2647013.0   
4  E12000002         North West    Region        7052177.0        7417397.0   

   Percentage change  
0                6.3  
1                6.6  
2                1.4  
3                1.9  
4                5.2  

You might wonder why skipping the first 5 rows and setting thousands=‘,’. I knew about this after opening this csv file in a text editor and lots of trial-and-errors.

Then, we check the first few rows of this dataset using df_pop.head().

Describing the dataframe

Which columns are included?

list(df_pop.columns)
['Area code',
 'Area name',
 'Area type',
 'Population 2011',
 'Population 2021',
 'Percentage change']

It is a pain to deal with whitespaces in a column, so good practice is to replace the whitespaces (eg tabs, multiple spaces) within column names with underscore.

df_pop.columns = df_pop.columns.str.replace(r'\s+', '_', regex=True)
print(list(df_pop.columns)) # check again
['Area_code', 'Area_name', 'Area_type', 'Population_2011', 'Population_2021', 'Percentage_change']

How many rows & cols are included?

rows, cols = df_pop.shape
print(f"Rows: {rows}, Columns: {cols}")
Rows: 369, Columns: 6

Geography matters

This dataset contains multiple geographies of UK and different geographies are incomparable. We can check the Area_type column:

print(df_pop.Area_type.value_counts())
Area_type
Local Authority    355
Region               9
Country              2
National             1
Name: count, dtype: int64

So there are 355 records of Local Authority, 9 records of Region, 2 of Country, and 1 of ‘National’. For an introduction to these terms, see this article on ONS.

We will focus on the local authorities, so we apply a filter:

df_pop_la = df_pop[df_pop['Area_type'] == 'Local Authority']

Overview of the columns

There are two pandas functions that give overview of a dataframe. - info(): shows column data types, non‑null counts, and memory usage. - describe(): shows summary statistics for numeric data (count, mean, std, min, quartiles, max) - describe(include='all'): for both numeric data and non‑numeric data (count, unique, top value, frequency).

print(df_pop_la.info())
print(df_pop_la.describe())
<class 'pandas.core.frame.DataFrame'>
Index: 355 entries, 12 to 366
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Area_code          355 non-null    object 
 1   Area_name          355 non-null    object 
 2   Area_type          355 non-null    object 
 3   Population_2011    355 non-null    float64
 4   Population_2021    355 non-null    float64
 5   Percentage_change  355 non-null    float64
dtypes: float64(3), object(3)
memory usage: 19.4+ KB
None
       Population_2011  Population_2021  Percentage_change
count     3.550000e+02     3.550000e+02         355.000000
mean      2.132867e+05     2.268876e+05           6.070423
std       2.099628e+05     2.245442e+05           4.608338
min       2.203000e+03     2.054000e+03          -9.600000
25%       1.000530e+05     1.055705e+05           2.950000
50%       1.382650e+05     1.477760e+05           5.800000
75%       2.487865e+05     2.628895e+05           9.000000
max       1.463740e+06     1.576069e+06          22.100000

Describing census 2021 population

Now, we focus on describing the local authority population from census 2021. The first question is, what data type is this variable - nominal, ordinal, interval, or ratio

Note

The data type of a variable is different from how it’s stored in a computer. For example, the Area_type variable can be encoded for convenience as 0 (“national”), 1 (“country”), and 2 (“local authority”). Although these are stored as numbers, Area_type is not truly numeric data — it’s an nominal variable.

Does it make sense to say ‘the population of LA AAA is twice of LA BBB’? Yes. So, this variable is of ratio type.

max and min

What is the maximum population size in census 2021?

print("Max population: ", df_pop_la['Population_2021'].max(skipna=True))

Which LAs have the maximum population size? The code below is a bit complicated.

print("{} have the maximum population of {}".format(
    ", ".join(df_pop_la.loc[df_pop_la['Population_2021'] == df_pop_la['Population_2021'].max(skipna=True), 'Area_name']), 
    df_pop_la['Population_2021'].max(skipna=True))
    )
Kent have the maximum population of 1576069.0

What it does:

  • Finds the max population while ignoring NaNs. It is always safe to use skipna=True, even though there is no NA values.
  • Selects all rows with that population.
  • Joins their Area_name values into a comma-separated string.

Two new Python functions here:

  • format(): Inserts variables into a string by replacing {} placeholders in order with provided arguments.

  • join(): Combines the elements of an iterable into one string using the given separator before .join().

Which LAs have the minimum population?

print("{} have the minimum population of {}".format(
    ", ".join(df_pop_la.loc[df_pop_la['Population_2021'] == df_pop_la['Population_2021'].min(skipna=True), 'Area_name']), 
    df_pop_la['Population_2021'].min(skipna=True))
    )
Isles of Scilly have the minimum population of 2054.0

Standard deviation

The result from df_pop_la.describe() indicates that the standard deviation of Population_2021 is 2.245442e+05.

Another way to calculate this standard deviation and to reformat it is:

std_dev = df_pop_la['Population_2021'].std()
# plain notation
print("The standard deviation of Population_2021 is: {}".format(std_dev)) 
# scientific notation
print("Using scientific notation: {:.3e}".format(std_dev)) 
# thousands separator notation + 2 decimal places
print("Using thousands separator notation: {:,.2f}".format(std_dev))
The standard deviation of Population_2021 is: 224544.20636612535
Using scientific notation: 2.245e+05
Using thousands separator notation: 224,544.21

There are several ways to represent numbers, and which one you choose depends on the situation.

Equally important is to ensure the numbers are meaningful, or to use proper significant figures. For example, reporting a population’s standard deviation with 10 decimal places does not make sense.

Null value and outliers?

Are there Null values or outliers in this variable? From results of info(), there are no NA values.

To detect outliers, we will implement the Tukey Fences method using pandas function, as pandas does not provide a built-in function for this method.

# Calculate Q1, Q3, and IQR
Q1 = df_pop_la['Population_2021'].quantile(0.25)
Q3 = df_pop_la['Population_2021'].quantile(0.75)
IQR = Q3 - Q1

# Tukey's fences
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers
outliers = df_pop_la[
    (df_pop_la['Population_2021'] < lower_bound) |
    (df_pop_la['Population_2021'] > upper_bound)
]

print("Lower bound:", lower_bound)
print("Upper bound:", upper_bound)
print("How many outliers?", outliers.shape[0])
print("Outliers:\n", outliers)
Lower bound: -130408.0
Upper bound: 498868.0
How many outliers? 33
Outliers:
      Area_code        Area_name        Area_type  Population_2011  \
56   E06000047    County Durham  Local Authority         513242.0   
60   E06000052         Cornwall  Local Authority         532273.0   
62   E06000054        Wiltshire  Local Authority         470981.0   
68   E06000060  Buckinghamshire  Local Authority         505283.0   
254  E08000003       Manchester  Local Authority         503127.0   
270  E08000019        Sheffield  Local Authority         552698.0   
275  E08000025       Birmingham  Local Authority        1073045.0   
282  E08000032         Bradford  Local Authority         522452.0   
285  E08000035            Leeds  Local Authority         751485.0   
321  E10000003   Cambridgeshire  Local Authority         621210.0   
322  E10000006          Cumbria  Local Authority         499858.0   
323  E10000007       Derbyshire  Local Authority         769686.0   
324  E10000008            Devon  Local Authority         746399.0   
325  E10000011      East Sussex  Local Authority         526671.0   
326  E10000012            Essex  Local Authority        1393587.0   
327  E10000013  Gloucestershire  Local Authority         596984.0   
328  E10000014        Hampshire  Local Authority        1317788.0   
329  E10000015    Hertfordshire  Local Authority        1116062.0   
330  E10000016             Kent  Local Authority        1463740.0   
331  E10000017       Lancashire  Local Authority        1171339.0   
332  E10000018   Leicestershire  Local Authority         650489.0   
333  E10000019     Lincolnshire  Local Authority         713653.0   
334  E10000020          Norfolk  Local Authority         857888.0   
335  E10000023  North Yorkshire  Local Authority         598376.0   
336  E10000024  Nottinghamshire  Local Authority         785802.0   
337  E10000025      Oxfordshire  Local Authority         653798.0   
338  E10000027         Somerset  Local Authority         529972.0   
339  E10000028    Staffordshire  Local Authority         848489.0   
340  E10000029          Suffolk  Local Authority         728163.0   
341  E10000030           Surrey  Local Authority        1132390.0   
342  E10000031     Warwickshire  Local Authority         545474.0   
343  E10000032      West Sussex  Local Authority         806892.0   
344  E10000034   Worcestershire  Local Authority         566169.0   

     Population_2021  Percentage_change  
56          522068.0                1.7  
60          570305.0                7.1  
62          510330.0                8.4  
68          553078.0                9.5  
254         551938.0                9.7  
270         556521.0                0.7  
275        1144919.0                6.7  
282         546412.0                4.6  
285         811953.0                8.0  
321         678849.0                9.3  
322         499846.0                0.0  
323         794636.0                3.2  
324         811640.0                8.7  
325         545847.0                3.6  
326        1503521.0                7.9  
327         645076.0                8.1  
328        1400899.0                6.3  
329        1198798.0                7.4  
330        1576069.0                7.7  
331        1235354.0                5.5  
332         712366.0                9.5  
333         768364.0                7.7  
334         916120.0                6.8  
335         615491.0                2.9  
336         824822.0                5.0  
337         725291.0               10.9  
338         571547.0                7.8  
339         876104.0                3.3  
340         760688.0                4.5  
341        1203108.0                6.2  
342         596773.0                9.4  
343         882676.0                9.4  
344         603676.0                6.6  

There are 33 outliers in this dataset. Think about the three types of outliers that we discussed. Which type do these 33 outliers beloong to?

  • Error Outlier
  • Irregular Pattern Outlier
  • Influential Outlier

Boxplot

To create a boxplot of this variable:

import matplotlib.pyplot as plt

# Create boxplot
df_pop_la['Population_2021'].plot(kind='box', title='LA Population 2021 Boxplot')

plt.ylabel('Population')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

What do you observe from this boxplot? There are lots of values above the

Exploring Percentage_change

Now, we turn to explore the variable Percentage_change, which represents the relative change from the 2011 census to 2021 census.

Try completing the code below on your own. Practice makes perfect!

Which LAs experienced the largest population percentage change? To what extent?

print("{} have the largest population percentage change of {}%".format(
    ", ".join(df_pop_la.loc[??['??'] == ??['??'].max(skipna=True), 'Area_name']), 
    df_pop_la['Population_2021'].??(skipna=True))
    )
print("{} have the largest population percentage change of {}%".format(
    ", ".join(df_pop_la.loc[df_pop_la['Percentage_change'] == df_pop_la['Percentage_change'].max(skipna=True), 'Area_name']), 
    df_pop_la['Percentage_change'].max(skipna=True))
    )
Tower Hamlets have the largest population percentage change of 22.1%

Which LAs experienced the smallest population percentage change? To what extent?

print("{} have the smallest population percentage change of {}%".format(
    ", ".??(df_pop_la.loc[df_pop_la[??] == ??['Percentage_change'].??(skipna=True), 'Area_name']), 
    df_pop_la['Percentage_change'].??(skipna=True))
    )
print("{} have the smallest population percentage change of {}%".format(
    ", ".join(df_pop_la.loc[df_pop_la['Percentage_change'] == df_pop_la['Percentage_change'].min(skipna=True), 'Area_name']), 
    df_pop_la['Percentage_change'].min(skipna=True))
    )
Kensington and Chelsea have the smallest population percentage change of -9.6%

Making a boxplot of Percentage_change

import matplotlib.pyplot as plt

# Create boxplot
df_pop_la[??].plot(kind=??, title='LA Population Percentage Change Boxplot')

plt.??('Percentage change')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()
import matplotlib.pyplot as plt

# Create boxplot
df_pop_la['Percentage_change'].plot(kind='box', title='LA Population Percentage Change Boxplot')

plt.ylabel('Percentage change')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

You’re Done!

Congratulations on completing the first QM practical session! If you are still working on it, take you time.

Don’t worry about understanding every detail of the Python code — what matters most is knowing which functions to use for a specific task, like checking minimum and maximum values or generating boxplots, and knowing how to debug when it goes wrong. Remember, practice makes perfect.