Imbalanced Data

Huanfa Chen - huanfa.chen@ucl.ac.uk

13/03/2026

Recap on W2 Supervised Learning Metrics

Limitation of accuracy (Accuracy paradox)

Scenario: Data with 90% negatives (imbalanced data)
A majority strategy that predicts all as negative gets 90% accuracy, but this is useless.
Different models can have the same accuracy (0.9) but make very different types of errors.

Precision, Recall, and their trade-off

Precision: \(\frac{TP}{TP+FP}\). Among predicted positives, how many are actually positive.
Recall (Sensitivity): \(\frac{TP}{TP+FN}\). Among actual positives, how many are correctly predicted.

Picking a metric

Real-world problems are rarely balanced.
Accuracy is rarely what we want.
Find the right criterion; decide between emphasis on recall or precision.
Identify which classes are important.

Metric for breast cancer detection

“1”: malignant/cancer (37.3%)
“0”: benign/no cancer (62.7%)
Missing a cancer (FN) is much worse than a false alarm (FP)
So, we care more about recall than precision or accuracy.
A model with high recall is preferred, even if it has lower precision.

Imbalanced data is common

Classification often has asymmetric costs or data imbalance
Need metrics and models that respect imbalance

This week

Objectives

Understand the sources of imbalanced data
Learn methods for handling imbalanced data

Two Sources of Imbalance

Asymmetric data prevalence
Asymmetric cost between errors
- Example: In medical diagnosis, missing a disease (FN) is much worse than a false alarm (FP)
- … even if class prevalence is balanced, the cost of errors is not symmetric

Methods for selecting imbalanced data

Select evaluation metrics (What do you want to optimise?)
Adjust decision thresholds
Change class-weights
Resample data

Select evaluation metrics

Accuracy paradox: accuracy is misleadingfor imbalanced data
Use precision or recall

Adjust thresholds

Most models output probabilities (0-1). Default threshold T = 0.5
If p(class=1) > T, predict Class 1; else predict 0
Tuning T shifts the balance between precision and recall

Precision-Recall Curve with varying thresholds

Tuning Threshold with Cross-Validation

RF classifiers don’t support threshold tuning, so we use TunedThresholdClassifierCV from sklearn to tune threshold with CV

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.model_selection import TunedThresholdClassifierCV
import numpy as np

# Load data: 1 = malignant, 0 = benign
data = load_breast_cancer(as_frame=True)
X = data.data
y = (data.target == 0).astype(int)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42, stratify=y
)

# Cross-validation with threshold tuning
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
thresholds = np.arange(0.1, 1.0, 0.1)

# Tune threshold to maximise recall using CV
rf = RandomForestClassifier(n_estimators=100, random_state=42)
tuned_rf = TunedThresholdClassifierCV(
  estimator=rf,
  scoring="recall",
  thresholds=thresholds,
  cv=cv
)
tuned_rf.fit(X_train, y_train)

y_pred = tuned_rf.predict(X_test)
test_recall = recall_score(y_test, y_pred, pos_label=1, zero_division=0)
optimal_thresh = tuned_rf.best_threshold_

print(f"Thresholds tested: {thresholds.min():.1f} to {thresholds.max():.1f}, step length = 0.1")
print(f"Optimal threshold: {optimal_thresh:.1f}")
print(f"Test recall at optimal threshold: {test_recall:.4f}")

Thresholds tested: 0.1 to 0.9, step length = 0.1
Optimal threshold: 0.1
Test recall at optimal threshold: 1.0000

Change class-weights

Many classifcation algorithms use a weighted loss function in model training (or similar mechanisms) to handle imbalanced data
\(L = \frac{1}{n} \sum_{i=1}^{n} w_{y_i} \cdot \ell(y_i, \hat{y}_i)\)
Weight indicates importance. The weight can be set for a class or individually for each sample.
If weight for a class is set higher, model training will penalise misclassifying that class more heavily.
Can use cross-validation to tune class weights

Class-Weights in tree-based methods

Resampling

Resample data

Resampling modifies the training data to balance classes.
Random undersampling: drop majority samples until balanced
Random oversampling: repeat minority samples until balanced
SMOTE: create synthetic minority samples by interpolating between existing ones
Ensemble resampling: train multiple models on different balanced subsets and aggregate predictions

Basic Approaches

Change the training procedure
Modify data via sampling

Random Undersampling

Drop majority samples until balanced
Very fast; dataset shrinks to ~2x minority class size
Problem: can lose majority samples; unstable for small datasets

Random Oversampling

Randomly pick up a minority sample, duplicate it, until balanced
Dataset grows; slower training

ros = RandomOverSampler()
X_over, y_over = ros.fit_sample(X_train, y_train)

SMOTE (Synthetic Minority Oversampling Technique)

Steps (repeated until balanced)
1. Randomly pick up a minority sample A, find its k nearest minority neighbors based on feature distance
2. Randomly select one neighbor B
3. Create a synthetic sample by interpolating between A and B

Advantages of SMOTE (compared to random oversampling)

Linear interpolation: synthetic samples are created on the line between two existing minority samples
No duplication: it generates new data rather than just duplicating existing ones
Better generalisation*: synthetic samples are created in the feature space

Variants of SMOTE

Borderline-SMOTE: only create synthetic samples near the decision boundary
ADASYN: adaptively create more synthetic samples in harder-to-learn regions
SMOTE-NC: handles mixed numerical and categorical features
SMOTE-Tomek: combines SMOTE with under-sampling techniques to clean up noise and overlapping samples

Resampling in practice: when should we apply resampling?

Resampling should be applied after train-test split, and only on the training data
Resampling is only applied to training data, not test data
Test data should reflect real-world distribution; resampling test data would give an unrealistic evaluation of model performance

Scikit-learn not support resampling

Sklearn doesn’t support resampling in its API; sklearn’s pipelines transform X only and cannot resample y
Need to do it manually or use imbalaned-learn extension

Imbalance-Learn

Library: http://imbalanced-learn.org
To install: pip install -U imbalanced-learn
Extends sklearn API with samplers and pipelines

sklearn pipelines (no resampling)

# train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# sklearn Pipeline: imputation → normalisation → classifier (no resampling)
sklearn_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

sklearn_pipeline.fit(X_train, y_train)
y_pred = sklearn_pipeline.predict(X_test)
print(f"Recall (malignant=1): {recall_score(y_test, y_pred, pos_label=1):.4f}")

sklearn pipelines with resampling

Sampler only runs during fit, NOT at prediction
This is achieved by from imblearn.pipeline import Pipeline

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler

# imblearn Pipeline: imputation → normalisation → undersampling → classifier
imb_pipeline = ImbPipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('sampler', RandomUnderSampler(random_state=42)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Sampler only runs during fit, NOT at prediction time
imb_pipeline.fit(X_train, y_train)
y_pred = imb_pipeline.predict(X_test)
print(f"Recall (malignant=1): {recall_score(y_test, y_pred, pos_label=1):.4f}")

Ensemble resampling

Random resampling separately per estimator in ensemble
Example: Balanced bagging or balanced random forest
Easy with imblearn BalancedBaggingClassifier (sklearn API compatible)

Easy Ensemble with imblearn

Bag of Boosted Learners; by default using AdaBoostClassifier as base estimator
Trains each tree on a different under-sampled dataset
As cheap as undersampling, but much more powerful than undersampling alone, as it prevents overfitting

from imblearn.ensemble import EasyEnsembleClassifier

ee = EasyEnsembleClassifier(n_estimators=10, random_state=42)
ee.fit(X_train, y_train)

y_pred = ee.predict(X_test)

Summary

Imbalanced classification is common; accuracy is often misleading
Methods for handling imbalance: adjust metrics, thresholds, class weights, resample data
Resampling should only be applied to training data, not test data
SMOTE and Ensemble resampling are more powerful than random undersampling or oversampling.