1. To bag, or to boost?
A question of balance
Alex Henderson
University of Manchester & SurfaceSpectra Ltd.
alexhenderson.info @AlexHenderson00
2. Acknowledgements
University of Gothenburg, Sweden
• Kelly Dimovska Nilsson
• John Fletcher
University of Manchester, UK
• Nick Lockyer
• UK Engineering and Physical Sciences Research Council
http://tiny.cc/join-sims-data
Slides and code will be made available on SlideShare and Bitbucket
3. Why machine learning?
Traditional multivariate statistics
• Largely linear separation
(single source of variance)
• Difficult to interpret loadings
from multi-class data sets
• New classes requires model to
be re-built
• Lots of experience in community
Machine learning
• Non-linear methods
• Largely binary models, but
multiclass varieties readily
available
• Can be extended when new
classes are added
• No ‘perfect’ classifier
• Relatively new – considered to
be ‘complicated’
4. Ensemble methods in machine learning
Machine learning: Collection (committee) of weak learners
5. Learners: The weak versus the strong
One strong learner
• Difficult to build
• Need lots of information
• Specialised to problem
• Can overfit
Many weak learners
• Easy to build
• Each learner is barely better
than guessing
• Generality
6. Learners: The weak versus the strong
One strong learner
• Difficult to build
• Need lots of information
• Specialised to problem
• Can overfit
Many weak learners
• Easy to build
• Each learner is barely better
than guessing
• Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta
7. However: Blind scientists and an elephant…
Ancient Buddhist parable (adapted)
Image: https://imgbin.com/png/F7pvuyHE/blind-men-and-an-elephant-parable-point-of-view-fable-png
8. Ensemble strategies
Three pillars of ensemble systems:
Diversity
• How do we select the data each learner will use?
• Need different responses from each learner
Training model
• Which weak learning model should we apply?
Combination
• How should we combine the responses of the weak learners to form an overall
judgement?
9. Aside: Sampling methods
Sampling without replacement
1. Start with a collection C (size N)
2. Select one element from C and
record its identity
3. Repeat step 2, M times (M < N)
Sampling with replacement (bootstrap)
1. Start with a collection C (size N)
2. Select one element from C and
record its identity
3. Return the selected element to C
4. Repeat step 2, N times
Outcome is a subset of C.
Contains M elements of C, with
no duplicates
Outcome is same size as C,
but contains duplicates
10. Bagging (from bootstrap aggregation)
Diversity
• Sample with replacement
• Each weak learner gets a different version of the data set, with duplicates
Training model
• Decision tree
Combination
• Majority vote
• Outcome is the one most decision trees voted for
11. Boosting
Diversity
• Sample M from C without replacement
• Each weak learner gets a different subset of the data, without duplicates
Training model
• Decision tree (decision stumps: single split)
• Three steps, gradually improving classification, with data weights modified at each iteration
Combination
• Weighted majority vote
• Decision tree weights are calculated based on their ability to handle difficult cases
• Outcome is the one most decision trees voted for
12. Random forest(Breiman 1996)
• An example of bagging approach
• Many decision trees (~200-500)
• Sampling with replacement (bootstrap), but…
Random subspace approach
• Variables (mass values) also sampled with replacement (bootstrap)
Each decision tree gets a version of the collection, but not necessarily the
same peaks
• Helps prevent dominant features ‘hijacking’ the model
13. AdaBoost (Freund and Schapire, 1997)
• Name is a contraction of Adaptive Boosting
• Modification of original boosting approach
• Iterative boosting
• Subsequent iterations have misclassified spectra weighted more
highly
• Learners need to be rebuilt on each iteration to accommodate new
weights
14. Pros and cons
Random forest
• Low bias (higher accuracy)
• Low variance (higher precision)
• Relatively stable
• Good with small training sets
• Amenable to parallel processing
• Interrogation possible
AdaBoost
• Model training is iterative
• Weights make it difficult to
interrogate
• Outliers can be difficult to
classify
Supervised classification so both require labelled data
16. • Bacterial colonies, spotted on silicon wafer
• Data acquisition using Ionoptika J105 instrument
• Data exported in HDF5 format, an open standard, so easy to read
• Example of Data Sharing
Anal. Chem. 2019, 91, 11355−11361
https://doi.org/10.1021/acs.analchem.9b02533
17. Data analysis toolchain
• MATLAB (R2018a)
• Image Processing Toolbox
• Statistics and Machine Learning Toolbox
• ChiToolbox
• Open source (GPL 3.0)
• https://bitbucket.org/AlexHenderson/chitoolbox/
(Machine learning algorithms also available in R, Python, Java etc.)
18. E.coli mutant strains spotted on silicon
Each column represents a technical replicate
Image from: Anal. Chem. 2019, 91, 11355−11361
19. Statistics
• 320 × 480 = 153600 pixels
• 100 – 2000 amu
• 16278 mass channels
• 2.5 B channels overall
• 19.5 GB in memory
• Spectra downsampled to 8 ns
Raw data
Total ion image
20. Extract spectra from spot locations
Total ion image Edge detection to identify spots
Coloured by spot id, not biological strain or SIMS
21. • Spectra from top row of spots
becomes the training set
• Some spectra from substrate
pixels also added to training set
• All spectra used as independent
test set
Extract spectra from spot locations
Edge detection to identify spots
Coloured by spot id, not biological strain or SIMS
22. Training and test sets
• Have 7 classes: 6 biological strains + substrate
• Holdout sample taken, 80:20 ratio of stratified classes
• Therefore 80% of spectra in each spot in first row (and substrate)
used to train model
• Roughly 1750 spectra from each class in training set
• Remaining 20% used to test the model (inside same spot test)
• Next predict the entire slide (whole slide test)
24. Classification of all pixels in image
Total ion image Correctly classified in white
Orange border to indicate limits of SIMS image
Random forest
25. Classification of all pixels in image
Correctly classified in white
• ‘Circles’ around spots:
• Artefact of edge detection?
• Mislabelled pixels?
• Coffee-ring effect of spotting?
• Column 5 is badly misclassified
Orange border to indicate limits of SIMS image
Random forest
26. Mean spectra from column 5
Spot at column 5, row 1 (exemplar) Spot at column 5, row 3
Spectra appear to be quite different | Misclassification may be correct
28. Classification of all pixels in image
Total ion image Correctly classified in white
Orange border to indicate limits of SIMS image
AdaBoost
29. Classification of all pixels in image
AdaBoost (88.9% cc)Random forest (88.1% cc)
Subtle differences, but largely the same outcome
30. Comparison (full mass resolution)
Random forest AdaBoost
Model building time 1 hr 20 min 2 hr 20 min
Prediction time (inside same spot test) 7 min 10 sec 7 min 25 sec
Classification (inside same spot test) 98.4% 99.0%
Prediction time (whole slide test) 40 min 20 sec 38 min 2 sec
Classification (whole slide test) 88.1% 88.9%
32. Comparison (500 peaks)
Random forest AdaBoost
Model building time 5 min 31 sec 13 min 29 sec
Prediction time (inside same spot test) 23.1 sec 22.7 sec
Classification (inside same spot test) 99.39% 99.4%
Prediction time (whole slide test) 2 min 3 sec 1 min 35 sec
Classification (whole slide test) 88.1% 88.9%
35. Drawbacks
• Need to decide on number of trees (RF), or iterations (AB)
• Possible to calculate an appropriate number retrospectively
• Ideally should have balanced classes (numbers of spectra)
• Some classes may be under-represented
• Works best with many spectra
• Outliers can be mis-classified or difficult to manage in model building
• Not perfectly repeatable due to random sampling
• If working on a computer cluster, take care with random number
seeds
36. Prediction issues
• Data must have same (number of) variables
• Number of mass peaks (variables)
• Each variable must correspond to same mass
• Mass calibration can cause problems
• Peak detection limits must be the same for training model and prediction
data
• Hard classifiers, so outliers always put into a class, even if ‘none of the
above’ should apply. (Can be mitigated using probability of classification)
Applies to all types of prediction including traditional statistics approaches
(PLS-DA, CVA-QDA)
37. Extensions
Many versions of bagging/boosting algorithms.
Some tuned to specific scenarios:
• Regression (fitting data)
• Missing data
• Adding new classes without rebuilding model
• Incremental update to model without rebuilding (streaming data)
• Combining different types of data: categorical and continuous
• Model of models (mixture of experts MoE)
38. Summary
• Ensemble machine learning brings additional data analysis tools to
assist the analyst
• Both AdaBoost and Random forest can perform regression in addition
to classification
• Both AdaBoost and Random forest produce high classification rates
• AdaBoost is slightly more accurate, but is somewhat slower
• Random forest can be interrogated to identify which spectral features
drive the classification/regression
• Random forest can take advantage of modern multi-core computers
Why not try it out? What do you have to lose?