Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
1. Deepak George
Senior Data Scientist – Machine Learning
Decision Tree Ensembles
Bagging, Random Forest & Gradient Boosting Machines
December 2015
2. Education
Computer Science Engineering – College Of Engineering Trivandrum
Business Analytics & Intelligence – Indian Institute Of Management Bangalore
Career
Mu Sigma
Accenture Analytics
Data Science
1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
Top 10% (out of 1100) finish Kaggle Coupon Purchase Prediction (Recommender
System)
SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
Statistical Learning – Stanford University
Passion
Photography, Football, Data Science, Machine Learning
Contact
Deepak.george14@iimb.ernet.in
linkedin.com/in/deepakgeorge7
Copyright @ Deepak George, IIM Bangalore
2
About Me
3. Copyright @ Deepak George, IIM Bangalore
3
Bias-Variance Tradeoff
Expected test MSE
Bias
Error that is introduced by approximating a
complicated relationship, by a much simpler
model.
Difference between the truth and what you
expect to learn
Underfitting
Variance
Amount by which model would change if we
estimated it using a different training data.
If a model has high variance then small
changes in the training data can result in
large changes in the model.
Overfitting
6. Copyright @ Deepak George, IIM Bangalore
6
Bootstrap sampling
Bootstrap sample
should have same
sample size as the
original sample.
With replacement results
in repetition of values
Bootstrap sample on an
average uses only 2/3 of
the data in the original
sample
7. Copyright @ Deepak George, IIM Bangalore
7
Random Forest
Problem: Bagging still have relatively high variance
Goal: Reduce variance of Bagging
Solution: Along with sampling of data in Bagging, take samples of features also!
In other words, in building a random forest, at each split in the tree,
the use only a random subset of features instead of all the features.
This de-correlates the trees.
Its mathematically proved that 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 is a good approximate value for
predictor subset size (mtry/max_features).
Evaluation: A bootstrap sample uses only approximately 2/3 of the observations of original
sample.
Remaining training data (OOB) are used to estimate error and variable importance
8. Hyperparameters are knobs to control bias & variance tradeoff of any
machine learning algorithm.
Key Hyper parameters
Max Features – De-correlates the trees
Number of Trees in the forest – Higher number reduce more variance
Random Forest - Key Hyperparameters
8
Copyright @ Deepak George, IIM Bangalore
9. Copyright @ Deepak George, IIM Bangalore
9
Random Forest – R Implementation
library(randomForest)
library(MASS) #Contains Boston dataframe
library(caret)
View(Boston)
#Cross Validation
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,number = 5, allowParallel=T)
#GridSeach
rf.grid <- expand.grid(mtry = 2:13)
set.seed(1861) ## make reproducible here, but not if generating many random samples
#Hyper Parametertuning
rf_tune <-train(medv~.,
data=Boston,
method="rf",
trControl=cv.ctrl,
tuneGrid=rf.grid,
ntree = 1000,
importance = TRUE)
#Cross Validation results
rf_tune
plot(rf_tune)
#Variable Importance
varImp(rf_tune)
plot(varImp(rf_tune), top = 10)
10. Copyright @ Deepak George, IIM Bangalore
10
Boosting
Intuition: Ensemble many “weak” classifiers (typically decision trees) to
produce a final “strong” classifier
Weak classifier Error rate is only slightly better than random
guessing.
Boosting is a Forward Stagewise Additive model
Boosting sequentially apply the weak classifiers one by one to repeatedly
reweighted versions of the data.
Each new weak learner in the sequence tries to correct the
misclassification/error made by the previous weak learners.
Initially all of the weights are set to Wi = 1/N
For each successive step the observation weights are individually
modified and a new weak learner is fitted on the reweighted
observations.
At step m, those observations that were misclassified by the
classifier Gm−1(x) induced at the previous step have their weights
increased, whereas the weights are decreased for those that were
classified correctly.
Final “strong” classifier is based on weighted vote of weak classifiers
11. X1
X2
AdaBoost – Illustration
11Copyright @ Deepak George, IIM Bangalore
Step 1
Input Data
Initially all observations are
assigned equal weight (1/N)
Observations that are
misclassified in the ith
iteration is given higher
weights in the (i+1)th iteration
Observations that are correctly
classified in the ith iteration is
given lower weights in the
(i+1)th iteration
Copyright @ Deepak George, IIM Bangalore
15. Generalization of AdaBoost to work with arbitrary loss functions resulted in GBM.
Gradient Boosting = Gradient Descent + Boosting
GBM uses gradient descent algorithm which can optimize any differentiable loss
function.
In Adaboost, ‘shortcomings’ are identified by high-weight data points.
In Gradient Boosting,“shortcomings” are identified by negative gradients (also
called pseudo residuals).
In GBM instead of reweighting used in adaboost, each new tree is fit to the
negative gradients of the previous tree.
Each tree in GBM is a successive gradient descent step.
Gradient Boosting Machines
15
Copyright @ Deepak George, IIM Bangalore
AdaBoost is equivalent to forward stagewise additive modeling using the
exponential loss function.
17. GBM has 3 types of hyper parameters
Tree Structure
Max depth of the trees - Controls the degree of features
interactions
Min samples leaf – Minimum number of samples in leaf node.
Number of Trees
Shrinkage
Learning rate - Slows learning by shrinking tree predictions.
Unlike fitting a single large decision tree to the data, which amounts
to fitting the data hard and potentially overfitting, the boosting
approach instead learns slowly
Stochastic Gradient Boosting
SubSample: Select random subset of the training set for fitting each
tree than using the complete training data.
Max features: Select random subset of features for each tree.
GBM – Key Hyperparameters
17
Copyright @ Deepak George, IIM Bangalore