Huanfa Chen - huanfa.chen@ucl.ac.uk
13/12/2025





from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=0)
tree = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
max_depthmax_leaf_nodesmin_samples_splitmin_impurity_decrease









from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, stratify=iris.target, random_state=0)
tree = DecisionTreeClassifier(max_leaf_nodes=6).fit(X_train, y_train)
tree.feature_importances_


diversity among modelsVotingClassifier
soft: average the probabilities of all models and take the class with largest average probabilityhard: let each model make a prediction and take majority vote
random forestsXGBoost, LightGBM, CatBoost, etc.




max_features: ~\(\sqrt{p}\) for classification, ~\(p\) for regressionn_estimators: use 100+; more reduces variancemax_depth, max_leaf_nodes, min_samples_split) can cut size/time



\[ \begin{aligned} f_{1}(x) &\approx y \\ f_{2}(x) &\approx y - \gamma f_{1}(x) \\ f_{3}(x) &\approx y - \gamma f_{1}(x) - \gamma f_{2}(x) \end{aligned} \]


max_depth)n_estimators, subsampling of rows/columnsmin_samples_split, min_impurity_decrease)© CASA | ucl.ac.uk/bartlett/casa