SlideShare a Scribd company logo
1 of 38
Download to read offline
To bag, or to boost?
A question of balance
Alex Henderson
University of Manchester & SurfaceSpectra Ltd.
alexhenderson.info @AlexHenderson00
Acknowledgements
University of Gothenburg, Sweden
• Kelly Dimovska Nilsson
• John Fletcher
University of Manchester, UK
• Nick Lockyer
• UK Engineering and Physical Sciences Research Council
http://tiny.cc/join-sims-data
Slides and code will be made available on SlideShare and Bitbucket
Why machine learning?
Traditional multivariate statistics
• Largely linear separation
(single source of variance)
• Difficult to interpret loadings
from multi-class data sets
• New classes requires model to
be re-built
• Lots of experience in community
Machine learning
• Non-linear methods
• Largely binary models, but
multiclass varieties readily
available
• Can be extended when new
classes are added
• No ‘perfect’ classifier
• Relatively new – considered to
be ‘complicated’
Ensemble methods in machine learning
Machine learning: Collection (committee) of weak learners
Learners: The weak versus the strong
One strong learner
• Difficult to build
• Need lots of information
• Specialised to problem
• Can overfit
Many weak learners
• Easy to build
• Each learner is barely better
than guessing
• Generality
Learners: The weak versus the strong
One strong learner
• Difficult to build
• Need lots of information
• Specialised to problem
• Can overfit
Many weak learners
• Easy to build
• Each learner is barely better
than guessing
• Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta
However: Blind scientists and an elephant…
Ancient Buddhist parable (adapted)
Image: https://imgbin.com/png/F7pvuyHE/blind-men-and-an-elephant-parable-point-of-view-fable-png
Ensemble strategies
Three pillars of ensemble systems:
Diversity
• How do we select the data each learner will use?
• Need different responses from each learner
Training model
• Which weak learning model should we apply?
Combination
• How should we combine the responses of the weak learners to form an overall
judgement?
Aside: Sampling methods
Sampling without replacement
1. Start with a collection C (size N)
2. Select one element from C and
record its identity
3. Repeat step 2, M times (M < N)
Sampling with replacement (bootstrap)
1. Start with a collection C (size N)
2. Select one element from C and
record its identity
3. Return the selected element to C
4. Repeat step 2, N times
Outcome is a subset of C.
Contains M elements of C, with
no duplicates
Outcome is same size as C,
but contains duplicates
Bagging (from bootstrap aggregation)
Diversity
• Sample with replacement
• Each weak learner gets a different version of the data set, with duplicates
Training model
• Decision tree
Combination
• Majority vote
• Outcome is the one most decision trees voted for
Boosting
Diversity
• Sample M from C without replacement
• Each weak learner gets a different subset of the data, without duplicates
Training model
• Decision tree (decision stumps: single split)
• Three steps, gradually improving classification, with data weights modified at each iteration
Combination
• Weighted majority vote
• Decision tree weights are calculated based on their ability to handle difficult cases
• Outcome is the one most decision trees voted for
Random forest(Breiman 1996)
• An example of bagging approach
• Many decision trees (~200-500)
• Sampling with replacement (bootstrap), but…
Random subspace approach
• Variables (mass values) also sampled with replacement (bootstrap)
Each decision tree gets a version of the collection, but not necessarily the
same peaks
• Helps prevent dominant features ‘hijacking’ the model
AdaBoost (Freund and Schapire, 1997)
• Name is a contraction of Adaptive Boosting
• Modification of original boosting approach
• Iterative boosting
• Subsequent iterations have misclassified spectra weighted more
highly
• Learners need to be rebuilt on each iteration to accommodate new
weights
Pros and cons
Random forest
• Low bias (higher accuracy)
• Low variance (higher precision)
• Relatively stable
• Good with small training sets
• Amenable to parallel processing
• Interrogation possible
AdaBoost
• Model training is iterative
• Weights make it difficult to
interrogate
• Outliers can be difficult to
classify
Supervised classification so both require labelled data
Example
• Bacterial colonies, spotted on silicon wafer
• Data acquisition using Ionoptika J105 instrument
• Data exported in HDF5 format, an open standard, so easy to read
• Example of Data Sharing
Anal. Chem. 2019, 91, 11355−11361
https://doi.org/10.1021/acs.analchem.9b02533
Data analysis toolchain
• MATLAB (R2018a)
• Image Processing Toolbox
• Statistics and Machine Learning Toolbox
• ChiToolbox
• Open source (GPL 3.0)
• https://bitbucket.org/AlexHenderson/chitoolbox/
(Machine learning algorithms also available in R, Python, Java etc.)
E.coli mutant strains spotted on silicon
Each column represents a technical replicate
Image from: Anal. Chem. 2019, 91, 11355−11361
Statistics
• 320 × 480 = 153600 pixels
• 100 – 2000 amu
• 16278 mass channels
• 2.5 B channels overall
• 19.5 GB in memory
• Spectra downsampled to 8 ns
Raw data
Total ion image
Extract spectra from spot locations
Total ion image Edge detection to identify spots
Coloured by spot id, not biological strain or SIMS
• Spectra from top row of spots
becomes the training set
• Some spectra from substrate
pixels also added to training set
• All spectra used as independent
test set
Extract spectra from spot locations
Edge detection to identify spots
Coloured by spot id, not biological strain or SIMS
Training and test sets
• Have 7 classes: 6 biological strains + substrate
• Holdout sample taken, 80:20 ratio of stratified classes
• Therefore 80% of spectra in each spot in first row (and substrate)
used to train model
• Roughly 1750 spectra from each class in training set
• Remaining 20% used to test the model (inside same spot test)
• Next predict the entire slide (whole slide test)
Random forest
500 trees
Classification of all pixels in image
Total ion image Correctly classified in white
Orange border to indicate limits of SIMS image
Random forest
Classification of all pixels in image
Correctly classified in white
• ‘Circles’ around spots:
• Artefact of edge detection?
• Mislabelled pixels?
• Coffee-ring effect of spotting?
• Column 5 is badly misclassified
Orange border to indicate limits of SIMS image
Random forest
Mean spectra from column 5
Spot at column 5, row 1 (exemplar) Spot at column 5, row 3
Spectra appear to be quite different | Misclassification may be correct
AdaBoost
500 iterations
Classification of all pixels in image
Total ion image Correctly classified in white
Orange border to indicate limits of SIMS image
AdaBoost
Classification of all pixels in image
AdaBoost (88.9% cc)Random forest (88.1% cc)
Subtle differences, but largely the same outcome
Comparison (full mass resolution)
Random forest AdaBoost
Model building time 1 hr 20 min 2 hr 20 min
Prediction time (inside same spot test) 7 min 10 sec 7 min 25 sec
Classification (inside same spot test) 98.4% 99.0%
Prediction time (whole slide test) 40 min 20 sec 38 min 2 sec
Classification (whole slide test) 88.1% 88.9%
Peak detection results
500 most intense spectral features
Comparison (500 peaks)
Random forest AdaBoost
Model building time 5 min 31 sec 13 min 29 sec
Prediction time (inside same spot test) 23.1 sec 22.7 sec
Classification (inside same spot test) 99.39% 99.4%
Prediction time (whole slide test) 2 min 3 sec 1 min 35 sec
Classification (whole slide test) 88.1% 88.9%
Which peaks contributed most?
Mean spectrum Variable importance
Which peaks contributed most?
Mean spectrum Variable importance
Drawbacks
• Need to decide on number of trees (RF), or iterations (AB)
• Possible to calculate an appropriate number retrospectively
• Ideally should have balanced classes (numbers of spectra)
• Some classes may be under-represented
• Works best with many spectra
• Outliers can be mis-classified or difficult to manage in model building
• Not perfectly repeatable due to random sampling
• If working on a computer cluster, take care with random number
seeds
Prediction issues
• Data must have same (number of) variables
• Number of mass peaks (variables)
• Each variable must correspond to same mass
• Mass calibration can cause problems
• Peak detection limits must be the same for training model and prediction
data
• Hard classifiers, so outliers always put into a class, even if ‘none of the
above’ should apply. (Can be mitigated using probability of classification)
Applies to all types of prediction including traditional statistics approaches
(PLS-DA, CVA-QDA)
Extensions
Many versions of bagging/boosting algorithms.
Some tuned to specific scenarios:
• Regression (fitting data)
• Missing data
• Adding new classes without rebuilding model
• Incremental update to model without rebuilding (streaming data)
• Combining different types of data: categorical and continuous
• Model of models (mixture of experts MoE)
Summary
• Ensemble machine learning brings additional data analysis tools to
assist the analyst
• Both AdaBoost and Random forest can perform regression in addition
to classification
• Both AdaBoost and Random forest produce high classification rates
• AdaBoost is slightly more accurate, but is somewhat slower
• Random forest can be interrogated to identify which spectral features
drive the classification/regression
• Random forest can take advantage of modern multi-core computers
Why not try it out? What do you have to lose?

More Related Content

What's hot

H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14Sri Ambati
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitBAINIDA
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General AudiencesSangwoo Mo
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Lviv Data Science Summer School
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at ScaleSri Ambati
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...WiMLDSMontreal
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsankit_ppt
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
A Random Forest Approach To Skin Detection With R
A Random Forest Approach To Skin Detection With RA Random Forest Approach To Skin Detection With R
A Random Forest Approach To Skin Detection With RAuro Tripathy
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionDong Guo
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with RWei Zhong Toh
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 

What's hot (16)

H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
A Random Forest Approach To Skin Detection With R
A Random Forest Approach To Skin Detection With RA Random Forest Approach To Skin Detection With R
A Random Forest Approach To Skin Detection With R
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 

Similar to To bag, or to boost? A question of balance

How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptxMonicaTimber
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with WekaAlbanLevy
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your dataAlex Henderson
 
Machine Learning Innovations
Machine Learning InnovationsMachine Learning Innovations
Machine Learning InnovationsHPCC Systems
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfAnkita Tiwari
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroSi Krishan
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudAnima Anandkumar
 

Similar to To bag, or to boost? A question of balance (20)

How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
Machine Learning Innovations
Machine Learning InnovationsMachine Learning Innovations
Machine Learning Innovations
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdf
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
 

More from Alex Henderson

Hyperspectral Data Issues
Hyperspectral Data IssuesHyperspectral Data Issues
Hyperspectral Data IssuesAlex Henderson
 
The Class Imbalance Problem: AdaBoost to the Rescue?
The Class Imbalance Problem: AdaBoost to the Rescue?The Class Imbalance Problem: AdaBoost to the Rescue?
The Class Imbalance Problem: AdaBoost to the Rescue?Alex Henderson
 
Getting started with chemometric classification
Getting started with chemometric classificationGetting started with chemometric classification
Getting started with chemometric classificationAlex Henderson
 
2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)Alex Henderson
 
Digging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DDigging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DAlex Henderson
 
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisRise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisAlex Henderson
 
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyWhat's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyAlex Henderson
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your modelAlex Henderson
 
Interpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraInterpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraAlex Henderson
 
Secondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometrySecondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometryAlex Henderson
 

More from Alex Henderson (10)

Hyperspectral Data Issues
Hyperspectral Data IssuesHyperspectral Data Issues
Hyperspectral Data Issues
 
The Class Imbalance Problem: AdaBoost to the Rescue?
The Class Imbalance Problem: AdaBoost to the Rescue?The Class Imbalance Problem: AdaBoost to the Rescue?
The Class Imbalance Problem: AdaBoost to the Rescue?
 
Getting started with chemometric classification
Getting started with chemometric classificationGetting started with chemometric classification
Getting started with chemometric classification
 
2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)
 
Digging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DDigging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3D
 
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisRise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
 
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyWhat's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your model
 
Interpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraInterpretation of Static SIMS Spectra
Interpretation of Static SIMS Spectra
 
Secondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometrySecondary Ion Mass Spectrometry
Secondary Ion Mass Spectrometry
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

To bag, or to boost? A question of balance

  • 1. To bag, or to boost? A question of balance Alex Henderson University of Manchester & SurfaceSpectra Ltd. alexhenderson.info @AlexHenderson00
  • 2. Acknowledgements University of Gothenburg, Sweden • Kelly Dimovska Nilsson • John Fletcher University of Manchester, UK • Nick Lockyer • UK Engineering and Physical Sciences Research Council http://tiny.cc/join-sims-data Slides and code will be made available on SlideShare and Bitbucket
  • 3. Why machine learning? Traditional multivariate statistics • Largely linear separation (single source of variance) • Difficult to interpret loadings from multi-class data sets • New classes requires model to be re-built • Lots of experience in community Machine learning • Non-linear methods • Largely binary models, but multiclass varieties readily available • Can be extended when new classes are added • No ‘perfect’ classifier • Relatively new – considered to be ‘complicated’
  • 4. Ensemble methods in machine learning Machine learning: Collection (committee) of weak learners
  • 5. Learners: The weak versus the strong One strong learner • Difficult to build • Need lots of information • Specialised to problem • Can overfit Many weak learners • Easy to build • Each learner is barely better than guessing • Generality
  • 6. Learners: The weak versus the strong One strong learner • Difficult to build • Need lots of information • Specialised to problem • Can overfit Many weak learners • Easy to build • Each learner is barely better than guessing • Generality The Incredible Hulk. Avengers: Endgame V For Vendetta
  • 7. However: Blind scientists and an elephant… Ancient Buddhist parable (adapted) Image: https://imgbin.com/png/F7pvuyHE/blind-men-and-an-elephant-parable-point-of-view-fable-png
  • 8. Ensemble strategies Three pillars of ensemble systems: Diversity • How do we select the data each learner will use? • Need different responses from each learner Training model • Which weak learning model should we apply? Combination • How should we combine the responses of the weak learners to form an overall judgement?
  • 9. Aside: Sampling methods Sampling without replacement 1. Start with a collection C (size N) 2. Select one element from C and record its identity 3. Repeat step 2, M times (M < N) Sampling with replacement (bootstrap) 1. Start with a collection C (size N) 2. Select one element from C and record its identity 3. Return the selected element to C 4. Repeat step 2, N times Outcome is a subset of C. Contains M elements of C, with no duplicates Outcome is same size as C, but contains duplicates
  • 10. Bagging (from bootstrap aggregation) Diversity • Sample with replacement • Each weak learner gets a different version of the data set, with duplicates Training model • Decision tree Combination • Majority vote • Outcome is the one most decision trees voted for
  • 11. Boosting Diversity • Sample M from C without replacement • Each weak learner gets a different subset of the data, without duplicates Training model • Decision tree (decision stumps: single split) • Three steps, gradually improving classification, with data weights modified at each iteration Combination • Weighted majority vote • Decision tree weights are calculated based on their ability to handle difficult cases • Outcome is the one most decision trees voted for
  • 12. Random forest(Breiman 1996) • An example of bagging approach • Many decision trees (~200-500) • Sampling with replacement (bootstrap), but… Random subspace approach • Variables (mass values) also sampled with replacement (bootstrap) Each decision tree gets a version of the collection, but not necessarily the same peaks • Helps prevent dominant features ‘hijacking’ the model
  • 13. AdaBoost (Freund and Schapire, 1997) • Name is a contraction of Adaptive Boosting • Modification of original boosting approach • Iterative boosting • Subsequent iterations have misclassified spectra weighted more highly • Learners need to be rebuilt on each iteration to accommodate new weights
  • 14. Pros and cons Random forest • Low bias (higher accuracy) • Low variance (higher precision) • Relatively stable • Good with small training sets • Amenable to parallel processing • Interrogation possible AdaBoost • Model training is iterative • Weights make it difficult to interrogate • Outliers can be difficult to classify Supervised classification so both require labelled data
  • 16. • Bacterial colonies, spotted on silicon wafer • Data acquisition using Ionoptika J105 instrument • Data exported in HDF5 format, an open standard, so easy to read • Example of Data Sharing Anal. Chem. 2019, 91, 11355−11361 https://doi.org/10.1021/acs.analchem.9b02533
  • 17. Data analysis toolchain • MATLAB (R2018a) • Image Processing Toolbox • Statistics and Machine Learning Toolbox • ChiToolbox • Open source (GPL 3.0) • https://bitbucket.org/AlexHenderson/chitoolbox/ (Machine learning algorithms also available in R, Python, Java etc.)
  • 18. E.coli mutant strains spotted on silicon Each column represents a technical replicate Image from: Anal. Chem. 2019, 91, 11355−11361
  • 19. Statistics • 320 × 480 = 153600 pixels • 100 – 2000 amu • 16278 mass channels • 2.5 B channels overall • 19.5 GB in memory • Spectra downsampled to 8 ns Raw data Total ion image
  • 20. Extract spectra from spot locations Total ion image Edge detection to identify spots Coloured by spot id, not biological strain or SIMS
  • 21. • Spectra from top row of spots becomes the training set • Some spectra from substrate pixels also added to training set • All spectra used as independent test set Extract spectra from spot locations Edge detection to identify spots Coloured by spot id, not biological strain or SIMS
  • 22. Training and test sets • Have 7 classes: 6 biological strains + substrate • Holdout sample taken, 80:20 ratio of stratified classes • Therefore 80% of spectra in each spot in first row (and substrate) used to train model • Roughly 1750 spectra from each class in training set • Remaining 20% used to test the model (inside same spot test) • Next predict the entire slide (whole slide test)
  • 24. Classification of all pixels in image Total ion image Correctly classified in white Orange border to indicate limits of SIMS image Random forest
  • 25. Classification of all pixels in image Correctly classified in white • ‘Circles’ around spots: • Artefact of edge detection? • Mislabelled pixels? • Coffee-ring effect of spotting? • Column 5 is badly misclassified Orange border to indicate limits of SIMS image Random forest
  • 26. Mean spectra from column 5 Spot at column 5, row 1 (exemplar) Spot at column 5, row 3 Spectra appear to be quite different | Misclassification may be correct
  • 28. Classification of all pixels in image Total ion image Correctly classified in white Orange border to indicate limits of SIMS image AdaBoost
  • 29. Classification of all pixels in image AdaBoost (88.9% cc)Random forest (88.1% cc) Subtle differences, but largely the same outcome
  • 30. Comparison (full mass resolution) Random forest AdaBoost Model building time 1 hr 20 min 2 hr 20 min Prediction time (inside same spot test) 7 min 10 sec 7 min 25 sec Classification (inside same spot test) 98.4% 99.0% Prediction time (whole slide test) 40 min 20 sec 38 min 2 sec Classification (whole slide test) 88.1% 88.9%
  • 31. Peak detection results 500 most intense spectral features
  • 32. Comparison (500 peaks) Random forest AdaBoost Model building time 5 min 31 sec 13 min 29 sec Prediction time (inside same spot test) 23.1 sec 22.7 sec Classification (inside same spot test) 99.39% 99.4% Prediction time (whole slide test) 2 min 3 sec 1 min 35 sec Classification (whole slide test) 88.1% 88.9%
  • 33. Which peaks contributed most? Mean spectrum Variable importance
  • 34. Which peaks contributed most? Mean spectrum Variable importance
  • 35. Drawbacks • Need to decide on number of trees (RF), or iterations (AB) • Possible to calculate an appropriate number retrospectively • Ideally should have balanced classes (numbers of spectra) • Some classes may be under-represented • Works best with many spectra • Outliers can be mis-classified or difficult to manage in model building • Not perfectly repeatable due to random sampling • If working on a computer cluster, take care with random number seeds
  • 36. Prediction issues • Data must have same (number of) variables • Number of mass peaks (variables) • Each variable must correspond to same mass • Mass calibration can cause problems • Peak detection limits must be the same for training model and prediction data • Hard classifiers, so outliers always put into a class, even if ‘none of the above’ should apply. (Can be mitigated using probability of classification) Applies to all types of prediction including traditional statistics approaches (PLS-DA, CVA-QDA)
  • 37. Extensions Many versions of bagging/boosting algorithms. Some tuned to specific scenarios: • Regression (fitting data) • Missing data • Adding new classes without rebuilding model • Incremental update to model without rebuilding (streaming data) • Combining different types of data: categorical and continuous • Model of models (mixture of experts MoE)
  • 38. Summary • Ensemble machine learning brings additional data analysis tools to assist the analyst • Both AdaBoost and Random forest can perform regression in addition to classification • Both AdaBoost and Random forest produce high classification rates • AdaBoost is slightly more accurate, but is somewhat slower • Random forest can be interrogated to identify which spectral features drive the classification/regression • Random forest can take advantage of modern multi-core computers Why not try it out? What do you have to lose?