Practical 3: Supervised learning workflow

This week will introduce the supervised learning framework and key metrics for evaluating supervised learning models using the London Fire Brigade dataset.

Learning Outcomes

  • You have familiarised yourself with the key concepts of supervised machine learning workflow, including train-test split, cross validation, and hyperparameter tuning.
  • You are able to explain the differences between different workflows, including their pros and cons.

Starting the Practical

The process for every week will be the same: download the notebook to your DSSS folder (or wherever you keep your course materials), switch over to JupyterLab (which will be running in Podman/Docker) and get to work.

If you want to save the completed notebook to your Github repo, you can add, commit, and push the notebook in Git after you download it. When you’re done for the day, save your changes to the file (this is very important!), then add, commit, and push your work to save the completed notebook.

Note

Suggestions for a Better Learning Experience:

  • Set your operating system and software language to English: this will make it easier to follow tutorials, search for solutions online, and understand error messages.

  • Save all files to a cloud storage service: use platforms like Google Drive, OneDrive, Dropbox, or Git to ensure your work is backed up and can be restored easily when the laptop gets stolen or broken.

  • Avoid whitespace in file names and column names in datasets

Revisiting London Fire Brigade Dataset

This week, we will continue using the London Fire Brigade (LFB) dataset for supervised learning tasks. For the context of LFB data and the two learning tasks, please refer to Week 2 practical notebook. Briefly, we formulated two supervised learning tasks using the LFB dataset and have got some initial results:

  1. Regression: predicting daily LFB callouts in Greater London, using weather and temporal features.
  2. Classification: predicting whether a fire incident is a false alarm given the location available at the time of the callout, which includes time of day, day of week, building type (dwelling or commercial).

Predicting daily LFB callouts

Remember in predicting daily LFB callouts, we used a random forest model and a train-test split:

# import data from https://raw.githubusercontent.com/huanfachen/DSSS_2025/refs/heads/main/data/LFB_2023_daily_data.csv
import pandas as pd
df_lfb_daily = pd.read_csv("https://raw.githubusercontent.com/huanfachen/DSSS_2025/refs/heads/main/data/LFB_2023_daily_data.csv")

# using Random Forest to predict IncidentCount using weather, weekday, weekend, and bank holiday info
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# prepare data for modeling
feature_cols = ['TX', 'TN', 'TG', 'SS', 'SD','RR','QQ', 'PP','HU','CC', 'IsWeekend', 'IsBankHoliday', 'weekday']
X = df_lfb_daily[feature_cols]
y = df_lfb_daily['IncidentCount']

# one-hot encode the 'weekday' column
X = pd.get_dummies(X, columns=['weekday'], drop_first=True)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train Random Forest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# evaluate model performance on training and testing sets
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# compute R-squared on training and testing data
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

print(f'Train R-squared: {r2_train:.3f}')
print(f'Test R-squared: {r2_test:.3f}')
Train R-squared: 0.918
Test R-squared: 0.180

It is obvious that the model trained is overfitting the training data, as the \(R^2\) on the training and testing data is around 0.92 and only 0.18, respectively. Therefore, this model is not useful in practice, as it doesn’t generalise well to unseen data.

To mitigate this issue, we can use cross-validation to tune the hyperparameters of the random forest model to reduce overfitting. The hyperparameters to tune include the following (see link for details):

  • max_depth: maximum depth of the tree (default at None, meaning nodes are expanded until all leaves are pure or until all leaves contain less tha nmin_samples_split samples)
  • min_samples_leaf: minimum number of samples required to be at a leaf node (default at 1)
  • max_features: number of features to consider when looking for the best split (default to 1.0, meaning all features are considered)

We haven’t introduced random forest algorithm yet, which is the content of Week 4. For now, please assume that these hyperparameters can control the complexity of the random forest model, and tuning them can help reduce overfitting.

A real prediction task

Now that we have a trained model that can predict the daily LFB callouts. However, this model is not really trained on the past data, as we use a random train-test split and a random CV. This might overestimate the model performance, as the model might have seen future data during training.

In a real prediction task, we should use a temporal train-test split, where the training data is from the past and the testing data is from the future. In the next part, we will use the first 80% data (sorted by date) for training and the remaining data for testing.

Please note that the analysis below doesn’t generate high predictive accuracy. Rather, this analysis serves as an example of how to implement temporal train-test split and temporal cross-validation in practice.

Temporal train-test split

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# load and sort by date to respect time order
df_lfb_daily = pd.read_csv("https://raw.githubusercontent.com/huanfachen/DSSS_2025/refs/heads/main/data/LFB_2023_daily_data.csv")
df_lfb_daily['DateOfCall'] = pd.to_datetime(df_lfb_daily['DateOfCall'])

# sort by date
df_lfb_daily = df_lfb_daily.sort_values('DateOfCall')

feature_cols = ['TX', 'TN', 'TG', 'SS', 'SD','RR','QQ', 'PP','HU','CC', 'IsWeekend', 'IsBankHoliday', 'weekday']
X = df_lfb_daily[feature_cols]
y = df_lfb_daily['IncidentCount']
X = pd.get_dummies(X, columns=['weekday'], drop_first=True)

# temporal split: first 80% dates for training, remaining 20% for testing
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

train_r2 = r2_score(y_train, rf.predict(X_train))
test_r2 = r2_score(y_test, rf.predict(X_test))

print(f"Temporal split — Train R-squared: {train_r2:.3f}")
print(f"Temporal split — Test R-squared: {test_r2:.3f}")
Temporal split — Train R-squared: 0.925
Temporal split — Test R-squared: -2.673

Not surprisingly, the model is overfitting the training data and does poorly on the testing data, with \(R^2\) of 0.925 and -2.673 on the training and testing data, respectively. This is because we included only one-year data, and the training data and testing data are from different seasons. If we included multiple years of data, the model would perform much better on the testing data.

Future directions

That’s mostly for this practical. We have demonstrated how to predict LFB daily callouts withcross validation and hyperparameter tuning using grid search in both random and temporal data splits. The similar workflow can be applied to the classification task of predicting false alarms in fire incidents, and we will demonstrate this in later practicals.

References and recommendations:

  1. There is not much (geospatial) machine learning research on London Fire Brigade datasets in academia. The blog by GTH Consulting provides some interesting articles on fire service data analysis in the UK, which receive lots of comments on LinkedIn (e.g. this).