Supervised learning

Framework & evaluation metrics

Huanfa Chen - huanfa.chen@ucl.ac.uk

13 December 2025

Supervised learning

Framework

\[ \begin{aligned} (x_i, y_i) &\sim p(x, y) \text{ i.i.d.} \\ x_i &\in \mathbb{R}^n \\ y_i &\in \begin{cases} \mathbb{R}, & \text{(regression)} \\ \mathcal{Y} \text{ (finite set)}, & \text{(classification)} \end{cases} \\ \text{learn } f(x_i) &\approx y_i \\ \text{such that } f(x) &\approx y \end{aligned} \]

Regression vs. classification

Regression Classification
Target variable Continuous value (y ) Discrete label (y ) (finite set)
Task Predict “how much” / “how many” Predict “which class” / “which category”
Intuition Find a ‘line’ close to all points Find a ‘boundary’ between classes
Make predictions Directly predict target variable Predict class probabilities and assign class with highest probability
Example Predicting house prices from features Predicting spam vs. not spam for an email

Example - linear regression

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
MedInc: 0.45
HouseAge: 0.01
AveRooms: -0.12
AveBedrms: 0.78
Population: -0.00
AveOccup: -0.00
Latitude: -0.42
Longitude: -0.43
Intercept: -37.02

Terms

Term Definition Example
Algorithm procedure that runs on data to create a model linear regression
Model an output by algorithm and data \(\hat{y}_i = \sum_k \beta_k x_{ik} + \beta_0\)
Metric to evalute the model performace Squared error
Hyperparameter algorithm settings, predefined by user instead of learned from data None
Parameter model components learned from data coefficients & intercept
Model training process of estimating parameters from data using maximum likelihood estimation

Common Challenges of regression/classification

  1. To select evaluation metrics (so that the model solves the right problem)
  2. To design workflow (so that the model generalises well and avoids overfitting)

Metrics for regression

formula unit notes
RMSE \(\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\) Same unit as target \(y\) Penalises large errors more; sensitive to outliers
\(1 - \dfrac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\) Dimensionless (between 0 and 1 for most cases) Measures proportion of variance explained by the model; can be negative if model is worse than \(y_i=\bar{y}\)
MAE \(\frac{1}{n}\sum_{i=1}^{n}\lvert y_i - \hat{y}_i\rvert\) Same unit as target \(y\) More robust to outliers than RMSE; interpretable as average absolute error
MAPE \(\frac{100}{n}\sum_{i=1}^{n}\left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert\) Percent (%) sensitive to very small \(y_i\); relative error

Metrics for regression

  • Using RMSE or squared error in most cases
  • By default, regressors in sklearn and XGBoost use Squared Error as loss function
  • Generally, R2 for regression can be negative. A negative R2 means the model is worse than simply predicting the mean of y.
  • A special case is that when using OLS regression with an intercept term and evaluating on the training data, R2 will always be between 0 and 1.

Misuse of R2 metric

  • Question: which R2 is correct?
# fit a neural network model using sklearn. Using similar workflow as previous cell
from sklearn.neural_network import MLPRegressor
nn = MLPRegressor(random_state=10, max_iter=1000)
nn.fit(X_train, y_train)
y_pred_nn = nn.predict(X_test)
print(f"R2 (sklearn): {r2_score(y_test, y_pred_nn):.3f}")
lr_nn = LinearRegression()
lr_nn.fit(y_test.reshape(-1, 1), y_pred_nn)
print(f"R2 (linear regression between y_test and y_pred): {lr_nn.score(y_test.reshape(-1, 1), y_pred_nn):.3f}")
R2 (sklearn): -2.616
R2 (linear regression between y_test and y_pred): 0.156

Misuse of R2 metric

  • R2 (sklearn) is correct, as it compares predictions from NN model against true values.
  • R2 (linear regression between y_test and y_pred) is incorrect.
    • Takes the NN outputs and fits a linear regression between NN outputs and true values
    • Computes the R2 of this linear regression, which is not the same as evaluating the NN model.

Metrics for classification (binary)

ML is a subset of AI
Image Credit: COMS4995-s20

\(Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\)

  • Selecting Positive label: often the minority class

Example using Breast Cancer dataset

class malignant: 212, 37.258%
class benign: 357, 62.742%
Predictive accuracy: 0.930

Limitation of accuracy (Accuracy paradox)

  • Scenario: Data with 90% negatives (imbalanced data)
  • A majority strategy that predicts all as negative gets 90% accuracy, but this is useless.
  • Different models can have the same accuracy (0.9) but make very different types of errors.

Precision, Recall, F1-score, AUC

  • Precision (Positive Predicted Value): \(\frac{TP}{TP+FP}\). Among predicted positives, how many are actually positive.
  • Recall (Sensitivity, True Positive Rate): \(\frac{TP}{TP+FN}\). Among actual positives, how many are correctly predicted.
  • F1-score (Harmonic mean of precision & recall): \(F=2\frac{\text{precision}\cdot \text{recall}}{\text{precision}+\text{recall}}\)

Another illustration

precision vs recall
Image Credit: wikipedia.org

Example

Addressing accuracy paradox

  • To address this issue in imbalanced data, pick up precision or recall or F1 as the metric.

Trade-off between precision and recall

  • Generally, when precision increases, recall decreases. Vice versa.
  • To generate this plot, using varying threshold T.
    1. Predict p(label=1) for each record.
    2. With a given T, predict label=1 if p(label=1) > T; else predict label=0, then compute precision & recall.
    3. Vary T from 0 to 1, compute precision and recall at each threshold. Link the dots to get the curve.

Example: trade-off between precision and recall

  • Edge case: when T >= 1, all are predicted as negative.
  • recall=0, precision is undefined and set to 1.

Precision–Recall curve for LR and Random Forest

Precision–Recall curve for LR and Random Forest

Usage of Precision-Recall curve

  • To visualise the precision-recall trade-off
  • To select the optimal threshold
    • Need high precision? Choose a point in high-precision, low-recall region.
    • Need high recall? Choose a point in high-recall, low-precision region.
    • Need a balance? Look for the “elbow” of the curve.
  • To compare different models

ROC Curve

  • Receiver Operating Characteristic (ROC) curve plots True Positive Rate=TP/(TP+FN) vs. False Positive Rate=FP/(FP+TN).
  • The identity line y=x represents random classifier (e.g. tossing a coin).

AUC (Area under ROC Curve)

  • The integral of the ROC curve (or the area size under the curve)
  • AUC is always 0.5 for random predictions
  • Maximum AUC is 1.0 for perfect predictions

Aggregating metrics across classes

  • Macro average: \(\frac{1}{|L|}\sum_{l\in L}R(y_{l},\hat{y}_{l})\)
    Unweighted mean of per-class scores. Each class contributes equally, regardless of sample size.

  • Micro average: \(\frac{1}{n}\sum_{i=1}^{n}R(y_{i},\hat{y}_{i})\)
    Sum individual TP, FP, FN, TN across all classes, then compute metric. Equivalent to accuracy for multiclass.

  • Weighted average: \(\frac{1}{n}\sum_{l\in L}n_{l}R(y_{l},\hat{y}_{l})\)
    Weighted by support (number of samples in each class). Balances class sizes in the final score.

Classification report

print(classification_report(y_test, rf.predict(X_test), target_names=data.target_names))
              precision    recall  f1-score   support

   malignant       0.92      0.92      0.92        53
      benign       0.96      0.96      0.96        90

    accuracy                           0.94       143
   macro avg       0.94      0.94      0.94       143
weighted avg       0.94      0.94      0.94       143

Picking a metric

  • Real-world problems are rarely balanced.
  • Accuracy is rarely what you want.
  • Find the right criterion for the specific task.
  • Decide between emphasis on recall or precision.
  • Identify which classes are important.

Metric for breast cancer detection

  • “1” indicates malignant/cancer, “0” indicates benign/no cancer.
  • Missing a cancer (FN) is much worse than a false alarm (FP)
  • So, we care more about recall than precision or accuracy.
  • A model with high recall is preferred, even if it has lower precision.

Generalisation to multi-class

  • Most metrics can be generalised to multi-class using macro, micro, or weighted averaging.
  • ROC curve and AUC can be computed using one-vs-rest approach for each class and then averaged.

Aggregating metrics across classes

  • Macro average: \(\frac{1}{|L|}\sum_{l\in L}R(y_{l},\hat{y}_{l})\)
    Unweighted mean of per-class scores. Each class contributes equally, regardless of sample size.

  • Micro average: \(\frac{1}{n}\sum_{i=1}^{n}R(y_{i},\hat{y}_{i})\)
    Sum individual TP, FP, FN, TN across all classes, then compute metric. Equivalent to accuracy for multiclass.

  • Weighted average: \(\frac{1}{n}\sum_{l\in L}n_{l}R(y_{l},\hat{y}_{l})\)
    Weighted by support (number of samples in each class). Balances class sizes in the final score.

Next question: what if the algorithm doesn’t directly optimise for the metric we want?

  • RandomForestClassifier (and other classifiers) in sklearn by default optimises for accuracy and have no direct way to optimise for recall.
  • We have several workarounds (topics for later weeks)
    1. Use recall metric during hyperparameter tuning and cross-validation
    2. Adjust classification threshold after training
    3. Use class weights to penalise misclassifications of the positive class more heavily during training

Overview

We’ve covered:

  • Framework of supervised learning
  • Evaluation metrics for regression (default: RMSE or squared error)
  • Evaluation metrics for classification (accuracy, precision, recall, F1-score, AUC)
  • Mind the accuracy paradox: accuracy is often not what you want for imbalanced data
  • How to pick the right metric for your problem