Supervised learning

Framework & evaluation metrics

Huanfa Chen - huanfa.chen@ucl.ac.uk

13 December 2025

Supervised learning

Framework

\[ \begin{aligned} (x_i, y_i) &\sim p(x, y) \text{ i.i.d.} \\ x_i &\in \mathbb{R}^n \\ y_i &\in \begin{cases} \mathbb{R}, & \text{(regression)} \\ \mathcal{Y} \text{ (finite set)}, & \text{(classification)} \end{cases} \\ \text{learn } f(x_i) &\approx y_i \\ \text{such that } f(x) &\approx y \end{aligned} \]

Regression vs. classification

	Regression	Classification
Target variable	Continuous value (y )	Discrete label (y ) (finite set)
Task	Predict “how much” / “how many”	Predict “which class” / “which category”
Intuition	Find a ‘line’ close to all points	Find a ‘boundary’ between classes
Make predictions	Directly predict target variable	Predict class probabilities and assign class with highest probability
Example	Predicting house prices from features	Predicting spam vs. not spam for an email

Example - linear regression

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

MedInc: 0.45
HouseAge: 0.01
AveRooms: -0.12
AveBedrms: 0.78
Population: -0.00
AveOccup: -0.00
Latitude: -0.42
Longitude: -0.43
Intercept: -37.02

Terms

Term	Definition	Example
Algorithm	procedure that runs on data to create a model	linear regression
Model	an output by algorithm and data	\(\hat{y}_i = \sum_k \beta_k x_{ik} + \beta_0\)
Metric	to evalute the model performace	Squared error
Hyperparameter	algorithm settings, predefined by user instead of learned from data	None
Parameter	model components learned from data	coefficients & intercept
Model training	process of estimating parameters from data	using maximum likelihood estimation

Common Challenges of regression/classification

To select evaluation metrics (so that the model solves the right problem)
To design workflow (so that the model generalises well and avoids overfitting)

Metrics for regression

	formula	unit	notes
RMSE	\(\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\)	Same unit as target \(y\)	Penalises large errors more; sensitive to outliers
R²	\(1 - \dfrac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\)	Dimensionless (between 0 and 1 for most cases)	Measures proportion of variance explained by the model; can be negative if model is worse than \(y_i=\bar{y}\)
MAE	\(\frac{1}{n}\sum_{i=1}^{n}\lvert y_i - \hat{y}_i\rvert\)	Same unit as target \(y\)	More robust to outliers than RMSE; interpretable as average absolute error
MAPE	\(\frac{100}{n}\sum_{i=1}^{n}\left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert\)	Percent (%)	sensitive to very small \(y_i\); relative error

Metrics for regression

Using RMSE or squared error in most cases
By default, regressors in sklearn and XGBoost use Squared Error as loss function
Generally, R2 for regression can be negative. A negative R2 means the model is worse than simply predicting the mean of y.
A special case is that when using OLS regression with an intercept term and evaluating on the training data, R2 will always be between 0 and 1.

Misuse of R2 metric

Question: which R2 is correct?

# fit a neural network model using sklearn. Using similar workflow as previous cell
from sklearn.neural_network import MLPRegressor
nn = MLPRegressor(random_state=10, max_iter=1000)
nn.fit(X_train, y_train)
y_pred_nn = nn.predict(X_test)
print(f"R2 (sklearn): {r2_score(y_test, y_pred_nn):.3f}")
lr_nn = LinearRegression()
lr_nn.fit(y_test.reshape(-1, 1), y_pred_nn)
print(f"R2 (linear regression between y_test and y_pred): {lr_nn.score(y_test.reshape(-1, 1), y_pred_nn):.3f}")

R2 (sklearn): -2.616
R2 (linear regression between y_test and y_pred): 0.156

Misuse of R2 metric

R2 (sklearn) is correct, as it compares predictions from NN model against true values.
R2 (linear regression between y_test and y_pred) is incorrect.
- Takes the NN outputs and fits a linear regression between NN outputs and true values
- Computes the R2 of this linear regression, which is not the same as evaluating the NN model.

Metrics for classification (binary)

Image Credit: COMS4995-s20

\(Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\)

Selecting Positive label: often the minority class

Example using Breast Cancer dataset

class malignant: 212, 37.258%
class benign: 357, 62.742%
Predictive accuracy: 0.930

Limitation of accuracy (Accuracy paradox)

Scenario: Data with 90% negatives (imbalanced data)
A majority strategy that predicts all as negative gets 90% accuracy, but this is useless.
Different models can have the same accuracy (0.9) but make very different types of errors.

Precision, Recall, F1-score, AUC

Precision (Positive Predicted Value): \(\frac{TP}{TP+FP}\). Among predicted positives, how many are actually positive.
Recall (Sensitivity, True Positive Rate): \(\frac{TP}{TP+FN}\). Among actual positives, how many are correctly predicted.
F1-score (Harmonic mean of precision & recall): \(F=2\frac{\text{precision}\cdot \text{recall}}{\text{precision}+\text{recall}}\)

Another illustration

Image Credit: wikipedia.org

Example

Addressing accuracy paradox

To address this issue in imbalanced data, pick up precision or recall or F1 as the metric.

Trade-off between precision and recall

Generally, when precision increases, recall decreases. Vice versa.
To generate this plot, using varying threshold T.
1. Predict p(label=1) for each record.
2. With a given T, predict label=1 if p(label=1) > T; else predict label=0, then compute precision & recall.
3. Vary T from 0 to 1, compute precision and recall at each threshold. Link the dots to get the curve.

Example: trade-off between precision and recall

Edge case: when T >= 1, all are predicted as negative.
recall=0, precision is undefined and set to 1.

Precision–Recall curve for LR and Random Forest

Usage of Precision-Recall curve

To visualise the precision-recall trade-off
To select the optimal threshold
- Need high precision? Choose a point in high-precision, low-recall region.
- Need high recall? Choose a point in high-recall, low-precision region.
- Need a balance? Look for the “elbow” of the curve.
To compare different models

ROC Curve

Receiver Operating Characteristic (ROC) curve plots True Positive Rate=TP/(TP+FN) vs. False Positive Rate=FP/(FP+TN).
The identity line y=x represents random classifier (e.g. tossing a coin).

AUC (Area under ROC Curve)

The integral of the ROC curve (or the area size under the curve)
AUC is always 0.5 for random predictions
Maximum AUC is 1.0 for perfect predictions

Aggregating metrics across classes

Macro average: \(\frac{1}{|L|}\sum_{l\in L}R(y_{l},\hat{y}_{l})\)
Unweighted mean of per-class scores. Each class contributes equally, regardless of sample size.
Micro average: \(\frac{1}{n}\sum_{i=1}^{n}R(y_{i},\hat{y}_{i})\)
Sum individual TP, FP, FN, TN across all classes, then compute metric. Equivalent to accuracy for multiclass.
Weighted average: \(\frac{1}{n}\sum_{l\in L}n_{l}R(y_{l},\hat{y}_{l})\)
Weighted by support (number of samples in each class). Balances class sizes in the final score.

Classification report

print(classification_report(y_test, rf.predict(X_test), target_names=data.target_names))

              precision    recall  f1-score   support

   malignant       0.92      0.92      0.92        53
      benign       0.96      0.96      0.96        90

    accuracy                           0.94       143
   macro avg       0.94      0.94      0.94       143
weighted avg       0.94      0.94      0.94       143

Picking a metric

Real-world problems are rarely balanced.
Accuracy is rarely what you want.
Find the right criterion for the specific task.
Decide between emphasis on recall or precision.
Identify which classes are important.

Metric for breast cancer detection

“1” indicates malignant/cancer, “0” indicates benign/no cancer.
Missing a cancer (FN) is much worse than a false alarm (FP)
So, we care more about recall than precision or accuracy.
A model with high recall is preferred, even if it has lower precision.

Generalisation to multi-class

Most metrics can be generalised to multi-class using macro, micro, or weighted averaging.
ROC curve and AUC can be computed using one-vs-rest approach for each class and then averaged.

Aggregating metrics across classes

Macro average: \(\frac{1}{|L|}\sum_{l\in L}R(y_{l},\hat{y}_{l})\)
Unweighted mean of per-class scores. Each class contributes equally, regardless of sample size.
Micro average: \(\frac{1}{n}\sum_{i=1}^{n}R(y_{i},\hat{y}_{i})\)
Sum individual TP, FP, FN, TN across all classes, then compute metric. Equivalent to accuracy for multiclass.
Weighted average: \(\frac{1}{n}\sum_{l\in L}n_{l}R(y_{l},\hat{y}_{l})\)
Weighted by support (number of samples in each class). Balances class sizes in the final score.

Next question: what if the algorithm doesn’t directly optimise for the metric we want?

RandomForestClassifier (and other classifiers) in sklearn by default optimises for accuracy and have no direct way to optimise for recall.
We have several workarounds (topics for later weeks)
1. Use recall metric during hyperparameter tuning and cross-validation
2. Adjust classification threshold after training
3. Use class weights to penalise misclassifications of the positive class more heavily during training

Overview

We’ve covered:

Framework of supervised learning
Evaluation metrics for regression (default: RMSE or squared error)
Evaluation metrics for classification (accuracy, precision, recall, F1-score, AUC)
Mind the accuracy paradox: accuracy is often not what you want for imbalanced data
How to pick the right metric for your problem