Tree-Based Methods

Tree, Forest, and Gradient Boosting

Huanfa Chen - huanfa.chen@ucl.ac.uk

13/12/2025

Last week

  • Analysis workflow of supervised learning
  • Model evaluation methods: train-test split, cross validation (and extensions)

Objectives of this week

  1. Understand decision trees and tree-based ensemble methods (random forests, gradient boosting)
  2. Can apply tree-based methods to real-world problems

Decision Trees

Decision Trees (DT)

  • Many types of decision trees: classification and regression trees (CART), C4.5, ID3, CHAID, etc.
  • Most trees (except CART) are rarely used these days, and CART is still popular.
  • Focus on CART: binary trees for classification and regression
  • A CART is a regression tree when the target variable Y is continous, and a classification tree when Y is categorical.
  • It is not dependent on type of features (X).

Why DT?

  • Non-linear, strong models for classification and regression
  • Minimal preprocessing: scale and distributions matter little
  • Good interpretation: small trees can be explained

Decision Trees for Classification

  • It starts with a root node
  • A split with a threshold on a single X feature divides data into two parts
  • Leave Nodes are the endpoints that give predictions.

Tree illustration

Another (synthetic) classification tree

  • A dataset with two numertical features (X[0], [1]), to predict Y with two classes (RED and BLUE)
  • Two visuals: tree structure, decision boundary
  • The leaf node and region are one-to-one mapped.
  • a node with counts=[40, 30] means this node contains 40 RED samples, 30 BLUE samples.

tree_demonstration_depth_2

Making predictions with classification tree

  • Given a new sample, start from the root node, traverse the tree according to split, until reaching a leaf node.
  • Leaf node D has counts=[1, 10], so it predicts BLUE class AND probability distribution of [1/11, 10/11] over two classes.

tree_demonstration_depth_2_making_prediction

How to build a classification tree

  • A tree is built by a split from the root node, then recursively splitting child nodes.
  • A key question: How to choose a split?
  • Note:
    • A split is a X feature and a threshold.
    • Can’t split on Y, as Y is unknown for new datasets.
    • Can’t split on multiple X variables.
  • A split is chosen to best separate Y classes, by measuring impurity reduction before & after split

To measure impurity of a node

  • Gini: \(H_{\text{gini}}(X_m) = \sum_{k \in \mathcal{Y}} p_{mk}(1-p_{mk})\)
    • \(p_{mk}\) is class \(k\) proportion in node \(m\)
  • The larger Gini, the more impure the node
  • When Gini=0, the node is pure (all samples in one class)
  • Example: a node with 10 samples in three classes [2,5,3].
  • Gini = 0.2 × 0.8 + 0.5 × 0.5 + 0.3 × 0.7 = 0.66

To measure impurity reduction by a split

  • A good split will put similar samples together, or generate two child nodes with lower impurity
  • Impurity reduction by a split = impurity(parent) - weighted impurity(children)

tree_depth_2_impurity_reduction

To choose a split

  • All splits will be evaluated
  • The split with the largest impurity reduction will be chosen

tree_depth_2_impurity_reduction_comparison

A regression tree

  • Similar to classification tree, but a leaf node stores a range of Y values and will predict the mean of Y
  • To measure impurity of a node in regresiion tree: Mean Square Error
  • \(H(X_m) = \frac{1}{N_m}\sum_{i \in N_m}(y_i-\bar{y}_m)^2\)
  • MSE measures the deviation of Y values from its mean, which is similar to the idea of impurity.
  • If all Y values are the same, MSE=0, meaning the node is pure.

Summary of impurity measures of decision trees

  • For classification:
    • Gini (by default)
    • Cross-Entropy
  • For regression
    • Mean Square Error (by default)
    • Mean Absolute Error

Visualising Trees (sklearn)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=0)

tree = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)

Visualising Trees (plot_tree)

from sklearn.tree import plot_tree
tree_dot = plot_tree(tree, feature_names=cancer.feature_names)

Plot tree

Parameter Tuning

  • Pre-pruning (limit growth)
    • max_depth
    • max_leaf_nodes
    • min_samples_split
    • min_impurity_decrease
  • Post-pruning (cost-complexity) after full growth

No Pruning

No pruning

max_depth = 4

Max depth 4

max_leaf_nodes = 8

Max leaf nodes 8

min_samples_split = 50

Min samples split 50

Grid Search: max_depth

from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': range(1, 7)}
grid = GridSearchCV(DecisionTreeClassifier(random_state=0),
                    param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)

Grid max depth

Grid Search: max_leaf_nodes

param_grid = {'max_leaf_nodes': range(2, 20)}

Grid max leaf nodes

Cost Complexity Pruning

  • Objective: \(R_\alpha(T) = R(T) + \alpha |T|\)
  • \(R(T)\) = total leaf impurity; \(|T|\) = number of leaves; tune \(\alpha\)

Grid ccp alpha

Efficient Pruning Path

clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

Pruning alphas

Post- vs Pre-Pruning

  • Cost-complexity pruning result

    Pruned tree

  • max_leaf_nodes search result

    Max leaf nodes

Feature Importance

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, stratify=iris.target, random_state=0)
tree = DecisionTreeClassifier(max_leaf_nodes=6).fit(X_train, y_train)
tree.feature_importances_

Tree importances

  • Sum of impurity decreases per feature; magnitude only (no sign)
  • Unstable with correlated features or different splits

Categorical Data

  • Trees can split categories into subsets; many possible splits
  • Exact search costly; efficient for binary classification + Gini
  • In sklearn, one-hot encoding is needed today as sklearn can’t handle categorical features natively

Predicting Probabilities

  • Leaf probability = class fraction in leaf
  • e.g. for three-class leaf with samples of (AA BBB CCCCC): \(P(A)=0.2\), \(P(B)=0.3\), \(P(C)=0.5\)
  • Deep, unpruned trees give overconfident (100%) probabilities
  • Pruning helps but calibration may still be poor

Limitations of Trees

  • Instability & overfitting: small data changes can alter splits.
  • (Emsembles help)

Instability

Tree instability 1

Tree instability 2

  • Small data changes can alter splits

Emsemble Methods

Poor man’s ensemble

  • Combine multiple models to reduce variance / improve accuracy
  • Train several models with different seeds; average predictions
  • Owen Zhang (long time Kaggle 1st): build XGBoosting models with different random seeds.
  • Works across model families (e.g., tree + linear + RF + NN)
  • Key to success: diversity among models
  • sklearn: VotingClassifier
    • soft: average the probabilities of all models and take the class with largest average probability
    • hard: let each model make a prediction and take majority vote

VotingClassifier Example

from sklearn.ensemble import VotingClassifier
voting = VotingClassifier(
    [('logreg', LogisticRegression(C=100)),
     ('tree', DecisionTreeClassifier(max_depth=3, random_state=0))],
    voting='soft')
voting.fit(X_train, y_train)

Voting classifier

Tree ensemble: two types

  • Bagging (Bootstrap Aggregation): random forests
  • Boosting: gradient boosting machines (GBM), XGBoost, LightGBM, CatBoost, etc.

Bagging (Bootstrap Aggregation)

  • Sample with replacement (same size as dataset)
  • Train a model on each bootstrap sample
  • Average predictions to cut variance

Bootstrap sample

Bias and Variance

Bias vs variance

  • Aim for low bias + low variance
  • Averaging high-variance models can lower variance

Ensemble: Bias vs Variance

  • Generalisation improves with strong base learners and low correlation
  • Diversifying models (or data/features) helps more than sheer count

Random Forests

Random forest

  • Bagging + feature subsampling at each split

Randomise in Two Ways

  • For each tree: bootstrap sample of rows
  • For each split: sample features without replacement
  • More trees → lower variance (diminishing returns)

Bootstrap rows

Feature sample

Tuning Random Forests

  • max_features: ~\(\sqrt{p}\) for classification, ~\(p\) for regression
  • n_estimators: use 100+; more reduces variance
  • Pre-pruning (max_depth, max_leaf_nodes, min_samples_split) can cut size/time

Extremely Randomized Trees

  • Add randomness: draw split thresholds at random per feature
  • Often no bootstrap; faster (no threshold search)
  • Can yield smoother boundaries; less common than standard RF

Warm-Starts

rf = RandomForestClassifier(warm_start=True)
for n in range(1, 100, 5):
    rf.n_estimators = n
    rf.fit(X_train, y_train)

Warm start forest

  • Increase trees incrementally; stop when scores stabilise

Out-of-Bag Estimates (optional)

  • Each tree trains on ~66% of data; predict remaining ~34%
  • Average OOB predictions as a free validation score
rf = RandomForestClassifier(max_features=m, oob_score=True,
                            n_estimators=200, random_state=0)
rf.fit(X_train, y_train)
oob_scores.append(rf.oob_score_)

OOB estimates

Variable Importance (RF)

rf = RandomForestClassifier().fit(X_train, y_train)
rf.feature_importances_

Forest importances

  • More stable than single-tree importances; still magnitude-only

Trees: Takeaways

  • Non-linear without heavy preprocessing
  • Single small trees interpretable; forests robust baselines
  • Beware extrapolation limits and instability

Gradient Boosting

Gradient Descent

  • Optimise \(\arg\min_w F(w)\) by stepping along \(-\nabla F(w)\)
  • Update: \(w_{i+1} = w_i - \eta_i \nabla F(w_i)\)
  • Converges to a local minimum (global for convex losses)

Gradient 1D

Stochastic Gradient Descent

  • Logistic regression objective: log-loss + regularizer
  • SGD uses one (or a mini-batch of) example(s) per step to approximate the gradient
  • Faster on large data; noisier updates
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier().fit(X_train, y_train)

Gradient Boosting Idea

  • Train a small tree to predict \(y\)
  • Train next tree on residuals of previous model (or on points poorly predicted)
  • Repeat; sum scaled predictions: \(f(x)=\sum \gamma g_k(x)\) with learning rate \(\gamma\)

Gradient Boosting Algorithm

\[ \begin{aligned} f_{1}(x) &\approx y \\ f_{2}(x) &\approx y - \gamma f_{1}(x) \\ f_{3}(x) &\approx y - \gamma f_{1}(x) - \gamma f_{2}(x) \end{aligned} \]

  • Each new tree fixes remaining error
  • Small \(\gamma\) (e.g., 0.1) for smoother learning

Gradient Boosting as Gradient Descent

  • Linear regression minimizes \(\sum (y - w^T x - b)^2\)
  • Gradient descent updates weights
  • Gradient boosting minimises same loss over predictions \(\hat{y}\)
  • Update \(\hat{y}\) by adding trees along negative gradient

Regression Example

GBR steps

  • Shallow trees fit residuals sequentially until residuals shrink

Classification Example

GBC depth 2

  • Probability surfaces become sharper as trees accumulate
  • Multiclass: one regression tree per class per step

Early Stopping

  • More trees can overfit
  • Stop when validation metric stops improving
  • Either fix \(n_{\text{estimators}}\) and tune \(\gamma\), or fix \(\gamma\) and stop early

Tuning Gradient Boosting

  • Use shallow trees (strong pruning via max_depth)
  • Tune learning rate, n_estimators, subsampling of rows/columns
  • Optional regularisation (min_samples_split, min_impurity_decrease)

Ensuring uncorrelated trees: Aggressive Subsampling

  • Row and column subsampling reduce correlation and overfitting
  • Common in XGBoost/LightGBM

XGBoost

from xgboost import XGBClassifier
xgb = XGBClassifier().fit(X_train, y_train)
  • Fast, supports missing values, GPU/cluster training
  • L1/L2 on leaves; fast approximate splits; sparse data friendly

LightGBM

from lightgbm.sklearn import LGBMClassifier
lgbm = LGBMClassifier().fit(X_train, y_train)
  • Native categorical handling; missing values; GPU/cluster support

CatBoost

from catboost.sklearn import CatBoostClassifier
catb = CatBoostClassifier().fit(X_train, y_train)
  • Strong on categorical data; symmetric trees; target-encoding tricks; GPU

Advantages of Gradient Boosting

  • Often more accurate than random forests with tuning
  • Small models; fast prediction
  • Hist/XGB/LightGBM implementations are very fast

Summary: when to use a tree or tree ensemble

  • For tabular data
  • Can model non-linear relationships and require minimal preprocessing
  • Single small tree for interpretability, but a single tree often underfits and doesn’t generalise well
  • Random forests are very robust and good benchmark
  • Gradient boosting generate best accuracy when tuned

Questions