SlideShare a Scribd company logo
1 of 66
Download to read offline
@benhamnerPhoto by mikebaird, www.flickr.com/photos/mikebaird
Lessons from ML
Competitions
Ben Hamner
ben.hamner@kaggle.com
November 13, 2015
@benhamner
Kaggle runs machine learning competitions
@benhamner
We release challenging machine learning problems to our community of 410,000 data scientists
@benhamner
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15
Our community makes 100k submissions per month on these competitions
@benhamner@benhamner
Examples of Machine Learning Competitions
@benhamner
Automatically grading student-written essays
197 entrants
155 teams
2,499 submissions
over 80 days
$100,000 in prizes
Human-level performance
www.kaggle.com/c/asap-aes
21,000+ essays
@benhamner
Predicting compounds toxicity given its molecular structure
796 entrants
703 teams
8,841 submissions
over 91 days
$20,000 in prizes
25.6% improvement over
previous accuracy benchmark
www.kaggle.com/c/BioResponse
@benhamner
Personalizing web search results
261 entrants
194 teams
3570 submissions
over 91 days
$9,000 in prizes
www.kaggle.com/c/yandex-personalized-web-search-challenge
167,000,000+ logs
@benhamner
Detecting diabetic retinopathy
www.kaggle.com/c/diabetic-retinopathy-detection
88,000+ retina images
854 entrants
661 teams
6999 submissions
Over 160 days
$100,000 in prizes
85% agreement with a human
rater (quadratic weighted kappa)
@benhamner@benhamner
How do machine learning competitions work?
@benhamner
We take a dataset with a target variable – something we’re trying to predict
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
$50k 582 HOME 0.61 1 1
$145k 1640 APT 2 3
$394k 3546 HOME 0.4 4 4
$82k 903 APT 2 2
$105k 1096 HOME 0.04 3 4
$129k 1280 HOME 0.15 2 2
$106k 1139 APT 1 1
Predicting the sale
price of a home
@benhamner
Training
Test
Split the data into two sets, a training set and a test set
Solution
“Ground Truth”
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
$50k 582 HOME 0.61 1 1
$145k 1640 APT 2 3
$394k 3546 HOME 0.4 4 4
$82k 903 APT 2 2
$105k 1096 HOME 0.04 3 4
$129k 1280 HOME 0.15 2 2
$106k 1139 APT 1 1
@benhamner
Training
Test
Our community gets everything but the solution on the test set
Solution
“Ground Truth”
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
??? 582 HOME 0.61 1 1
??? 1640 APT 2 3
??? 3546 HOME 0.4 4 4
??? 903 APT 2 2
??? 1096 HOME 0.04 3 4
??? 1280 HOME 0.15 2 2
??? 1139 APT 1 1
@benhamner
Competition participants use the training set to learn the relation between the data and the target
@benhamner
Training
Test
Competition participants apply their models to make predictions on the test set
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
??? 582 HOME 0.61 1 1
??? 1640 APT 2 3
??? 3546 HOME 0.4 4 4
??? 903 APT 2 2
??? 1096 HOME 0.04 3 4
??? 1280 HOME 0.15 2 2
??? 1139 APT 1 1
Submission
Predicted
$41k
$165k
$280k
$76k
$128k
$115k
$94k
@benhamner
Training
Test
Kaggle compares the submission to the ground truth
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
$50k 582 HOME 0.61 1 1
$145k 1640 APT 2 3
$394k 3546 HOME 0.4 4 4
$82k 903 APT 2 2
$105k 1096 HOME 0.04 3 4
$129k 1280 HOME 0.15 2 2
$106k 1139 APT 1 1
Submission
Predicted
$41k
$165k
$380k
$76k
$128k
$115k
$94k
Delta
-$9k
$20k
-$14k
-$6k
$13k
-$14k
-$12k
@benhamner
Training
Test
Kaggle calculates two scores, one for the public leaderboard and one for the private leaderboard
SalePrice SquareFeet Type LotAcres Beds Baths
$88k 719 HOME 1.64 1 1
$164k 2017 APT 3 2
$72k 697 APT 1 1
$85k 948 HOME 1.02 2 3
$271k 3375 APT 3 4
$482k 3968 APT 4 4
$88k 790 APT 1 2
$128k 1341 HOME 0.66 3 3
$235k 2379 APT 3 3
$309k 2495 HOME 0.21 3 4
$163k 1356 APT 1 1
$375k 3361 HOME 1.64 3 4
$98k 1060 HOME 0.05 1 1
$50k 582 HOME 0.61 1 1
$145k 1640 APT 2 3
$394k 3546 HOME 0.4 4 4
$82k 903 APT 2 2
$105k 1096 HOME 0.04 3 4
$129k 1280 HOME 0.15 2 2
$106k 1139 APT 1 1
Submission
Predicted
$41k
$165k
$380k
$76k
$128k
$115k
$94k
MeanError
Public Leaderboard $14k
Private Leaderboard $15k
Delta
-$9k
$20k
-$14k
-$6k
$13k
-$14k
-$12k
@benhamner
The participant immediately sees their public score on the public leaderboard
@benhamner
Participants explore the problem and iterate on their models to improve them
@benhamner
At the end, the participant with the best score on the private leaderboard wins
@benhamner@benhamner
Competition leaderboards
@benhamner
The leaderboard is a powerful mechanism to drive competition
@benhamner
The leaderboard is objective and meritocratic
@benhamner
The leaderboard encourages leapfrogging
@benhamner
The leaderboard encourages iterative improvements over many submissions
@benhamner
This causes the competition to approach the frontier of what’s possible given the data
@benhamner
Many competitions quickly approach a frontier; the most challenging ones take longer
@benhamner
Some applied ML research looks like competitions running over years instead of months
www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/
@benhamner
One long-running research competition is ImageNet (not hosted on Kaggle)
www.image-net.org
@benhamner
We see a similar progression in ImageNet performance over time as we do in Kaggle competitions
www.image-net.org
@benhamner
Can we do better than competition results?
@benhamner@benhamner
Looking holistically across all the competitions
@benhamner
At Kaggle, we’ve run hundreds of public machine learning competitions
@benhamner
And over 600 in-class competitions for university students
@benhamner
These competitions have generated over 2,000,000 submissions from around the world
@benhamner
Most of the competitions we’ve run have involved supervised classification or regression
@benhamner@benhamner
Doing well in competitions
@benhamner
Setup your environment to enable rapid iteration and experimentation
Extract and
Select Features
Train Models
Evaluate and
Visualize
Results
Identify &
Handle Data
Oddities
Data
Preprocessing
@benhamner
As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models
http://jeffreydf.github.io/diabetic-retinopathy-detection/
@benhamner
Successful users invest time, thought, and creativity in problem structure and feature extraction
@benhamner
Random Forests / GBM’s work very well for many common classification and regression tasks
(Verikas et al. 2011)
@benhamner
Deep learning has been very effective in computer vision competitions we’ve hosted
caffe, theano, torch7, and keras are four popular open source libraries that facilitate this
@benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
@benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
@benhamner
The Boruta feature selection algorithm is robust and reliable
• Wrapper method around Random Forest and its calculated variable
importance
• Iteratively trains RF’s and runs statistical tests to identify features as
important or not important
• Widely used in competition-winning models to select a small subset
of features for use in training more complex models
• library(boruta) in R
@benhamner
Model ensembling usually results in marginal but significant performance gains
@benhamner
Data leakage is our (and our user’s) #1 challenge
http://www.navy.mil/view_image.asp?id=12495
@benhamner@benhamner
We’ve also seen some things that competitions aren’t effective at
@benhamner
Competitions don’t typically yield simple and theoretically elegant solutions
*exception – Factorization Machines in KDD Cup 2012
@benhamner
Competitions don’t typically yield production code
http://ora-00001.blogspot.ru/2011/07/mythbusters-stored-procedures-edition.html
@benhamner
Competitions don’t always yield computationally efficient solutions
• Rewards performance without computational and complexity
constraints
http://iinustechtips.com/main/topic/193045-need-help-underclocking-d/
@benhamner@benhamner
Competitions tend to be highly effective at
@benhamner
Optimizing a quantifiable evaluation metric by exploring an enormously broad range of approaches
@benhamner
Fairly and consistently evaluating a variety of approaches on the same problem
• Implementation details matter, which can make it tough to
reproduce results in other settings where data and/or code is
not open source
• “A quick, simple way to apply machine learning successfully?
In your domain, find the stupid baseline that new methods
consistently claim to beat. Implement that stupid baseline”
@benhamner
Identifying data quality and leakage issues
Check that ID
column isn’t
informative
“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the
introduction of information about the data mining target, which should not be
legitimately available to mine from.”
- “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
Time
series
are
tricky
Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”
Human Scores: 5/5, 4/5
@benhamner
Exposing a specific domain problem to many new communities around the world
@benhamner@benhamner
Where Kaggle’s going
@benhamner
Kaggle’s mission is to help the world learn from data
http://data-arts.appspot.com/globe/
@benhamner
We’re building a public platform for collaborating on data and analytics results
People
CodeData
@benhamner
An early alpha version of this is released as Kaggle Scripts
@benhamner
It enables users to immediately access R/Python/Julia environments with data preloaded
@benhamner
Everything created on Kaggle Scripts is published as soon as it’s run
www.kaggle.com/scripts
@benhamner
Reproducing and building on another’s work is simply a click away
@benhamner
We’re starting to enable users to do this on non-competition datasets
@benhamner
Soon, any user will be able to publish data through Kaggle for analysis
@benhamner@benhamner
Thank you!
head to www.kaggle.com/scripts to check out code,
visualizations, and results from our community

More Related Content

Viewers also liked

Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFSteffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFMLconf
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16MLconf
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016MLconf
 
The internet of things is for people
The internet of things is for peopleThe internet of things is for people
The internet of things is for peopleyiibu
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016MLconf
 
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017 Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017 MLconf
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

Viewers also liked (17)

Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SFSteffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SF
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16
Ewa Dominowska, Engineering Manager, Facebook at MLconf SEA - 5/20/16
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Xgboost
XgboostXgboost
Xgboost
 
The internet of things is for people
The internet of things is for peopleThe internet of things is for people
The internet of things is for people
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
 
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017 Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 

Similar to Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15

Leaderboard12 17-13 patexia
Leaderboard12 17-13   patexiaLeaderboard12 17-13   patexia
Leaderboard12 17-13 patexiaMeltin Bell
 
Mozcon 2022 Trash in, Garbage out
Mozcon 2022 Trash in, Garbage outMozcon 2022 Trash in, Garbage out
Mozcon 2022 Trash in, Garbage outTom Capper
 
Knn intro advanced_middleschool
Knn intro advanced_middleschoolKnn intro advanced_middleschool
Knn intro advanced_middleschoolaiclub_slides
 
NYSERDA PROGRAMS: $aving Home Energy Dollars
NYSERDA PROGRAMS: $aving Home Energy DollarsNYSERDA PROGRAMS: $aving Home Energy Dollars
NYSERDA PROGRAMS: $aving Home Energy DollarsAnn Heidenreich
 
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdf
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdfMULTIPLICAÇÃO COM REAGRUPAMENTO.pdf
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdfKelly Cardoso
 
Ark Cloud City Brochure (3)
Ark Cloud City Brochure (3)Ark Cloud City Brochure (3)
Ark Cloud City Brochure (3)Mohan M A
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchGeorge Awad
 
Final Presentation from Chester Group Rev 0
Final Presentation from Chester Group Rev 0Final Presentation from Chester Group Rev 0
Final Presentation from Chester Group Rev 0Steven Quenzel
 
Getting bad ideas out of the way with keynote
Getting bad ideas out of the way with keynoteGetting bad ideas out of the way with keynote
Getting bad ideas out of the way with keynoteTravis Isaacs
 
Energy Conservation & Sustainability
Energy Conservation & Sustainability  Energy Conservation & Sustainability
Energy Conservation & Sustainability Meg Thompson
 
PV2 Tools & Technology for Permaculture Homesteads
PV2 Tools & Technology for Permaculture HomesteadsPV2 Tools & Technology for Permaculture Homesteads
PV2 Tools & Technology for Permaculture HomesteadsGrant Schultz
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rankJettro Coenradie
 
Jan Casteels - Duracell
Jan Casteels - DuracellJan Casteels - Duracell
Jan Casteels - DuracellFDMagazine
 
Day1 track session_1_a_tom_allason
Day1 track session_1_a_tom_allasonDay1 track session_1_a_tom_allason
Day1 track session_1_a_tom_allasonTheFocusGroup
 
Productionizing a Machine Learning System at a Large Australian Telco with Ca...
Productionizing a Machine Learning System at a Large Australian Telco with Ca...Productionizing a Machine Learning System at a Large Australian Telco with Ca...
Productionizing a Machine Learning System at a Large Australian Telco with Ca...Databricks
 

Similar to Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15 (20)

2021/2022 Burner Book
2021/2022 Burner Book2021/2022 Burner Book
2021/2022 Burner Book
 
Leaderboard12 17-13 patexia
Leaderboard12 17-13   patexiaLeaderboard12 17-13   patexia
Leaderboard12 17-13 patexia
 
Mozcon 2022 Trash in, Garbage out
Mozcon 2022 Trash in, Garbage outMozcon 2022 Trash in, Garbage out
Mozcon 2022 Trash in, Garbage out
 
Knn intro advanced_middleschool
Knn intro advanced_middleschoolKnn intro advanced_middleschool
Knn intro advanced_middleschool
 
NYSERDA PROGRAMS: $aving Home Energy Dollars
NYSERDA PROGRAMS: $aving Home Energy DollarsNYSERDA PROGRAMS: $aving Home Energy Dollars
NYSERDA PROGRAMS: $aving Home Energy Dollars
 
Brand Retail
Brand RetailBrand Retail
Brand Retail
 
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdf
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdfMULTIPLICAÇÃO COM REAGRUPAMENTO.pdf
MULTIPLICAÇÃO COM REAGRUPAMENTO.pdf
 
Ark Cloud City Brochure (3)
Ark Cloud City Brochure (3)Ark Cloud City Brochure (3)
Ark Cloud City Brochure (3)
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Final Presentation from Chester Group Rev 0
Final Presentation from Chester Group Rev 0Final Presentation from Chester Group Rev 0
Final Presentation from Chester Group Rev 0
 
Getting bad ideas out of the way with keynote
Getting bad ideas out of the way with keynoteGetting bad ideas out of the way with keynote
Getting bad ideas out of the way with keynote
 
Energy Conservation & Sustainability
Energy Conservation & Sustainability  Energy Conservation & Sustainability
Energy Conservation & Sustainability
 
PV2 Tools & Technology for Permaculture Homesteads
PV2 Tools & Technology for Permaculture HomesteadsPV2 Tools & Technology for Permaculture Homesteads
PV2 Tools & Technology for Permaculture Homesteads
 
Millionaire tools. Chapter 1
Millionaire tools. Chapter 1Millionaire tools. Chapter 1
Millionaire tools. Chapter 1
 
Millionaire CHAPTER 1. Review your knowledge
Millionaire CHAPTER 1. Review your knowledgeMillionaire CHAPTER 1. Review your knowledge
Millionaire CHAPTER 1. Review your knowledge
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rank
 
Jan Casteels - Duracell
Jan Casteels - DuracellJan Casteels - Duracell
Jan Casteels - Duracell
 
Day1 track session_1_a_tom_allason
Day1 track session_1_a_tom_allasonDay1 track session_1_a_tom_allason
Day1 track session_1_a_tom_allason
 
Team 2 Presentation
Team 2 PresentationTeam 2 Presentation
Team 2 Presentation
 
Productionizing a Machine Learning System at a Large Australian Telco with Ca...
Productionizing a Machine Learning System at a Large Australian Telco with Ca...Productionizing a Machine Learning System at a Large Australian Telco with Ca...
Productionizing a Machine Learning System at a Large Australian Telco with Ca...
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 

Recently uploaded (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 

Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15

  • 1. @benhamnerPhoto by mikebaird, www.flickr.com/photos/mikebaird Lessons from ML Competitions Ben Hamner ben.hamner@kaggle.com November 13, 2015
  • 2. @benhamner Kaggle runs machine learning competitions
  • 3. @benhamner We release challenging machine learning problems to our community of 410,000 data scientists
  • 4. @benhamner 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Our community makes 100k submissions per month on these competitions
  • 6. @benhamner Automatically grading student-written essays 197 entrants 155 teams 2,499 submissions over 80 days $100,000 in prizes Human-level performance www.kaggle.com/c/asap-aes 21,000+ essays
  • 7. @benhamner Predicting compounds toxicity given its molecular structure 796 entrants 703 teams 8,841 submissions over 91 days $20,000 in prizes 25.6% improvement over previous accuracy benchmark www.kaggle.com/c/BioResponse
  • 8. @benhamner Personalizing web search results 261 entrants 194 teams 3570 submissions over 91 days $9,000 in prizes www.kaggle.com/c/yandex-personalized-web-search-challenge 167,000,000+ logs
  • 9. @benhamner Detecting diabetic retinopathy www.kaggle.com/c/diabetic-retinopathy-detection 88,000+ retina images 854 entrants 661 teams 6999 submissions Over 160 days $100,000 in prizes 85% agreement with a human rater (quadratic weighted kappa)
  • 10. @benhamner@benhamner How do machine learning competitions work?
  • 11. @benhamner We take a dataset with a target variable – something we’re trying to predict SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Predicting the sale price of a home
  • 12. @benhamner Training Test Split the data into two sets, a training set and a test set Solution “Ground Truth” SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1
  • 13. @benhamner Training Test Our community gets everything but the solution on the test set Solution “Ground Truth” SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 ??? 582 HOME 0.61 1 1 ??? 1640 APT 2 3 ??? 3546 HOME 0.4 4 4 ??? 903 APT 2 2 ??? 1096 HOME 0.04 3 4 ??? 1280 HOME 0.15 2 2 ??? 1139 APT 1 1
  • 14. @benhamner Competition participants use the training set to learn the relation between the data and the target
  • 15. @benhamner Training Test Competition participants apply their models to make predictions on the test set SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 ??? 582 HOME 0.61 1 1 ??? 1640 APT 2 3 ??? 3546 HOME 0.4 4 4 ??? 903 APT 2 2 ??? 1096 HOME 0.04 3 4 ??? 1280 HOME 0.15 2 2 ??? 1139 APT 1 1 Submission Predicted $41k $165k $280k $76k $128k $115k $94k
  • 16. @benhamner Training Test Kaggle compares the submission to the ground truth SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Submission Predicted $41k $165k $380k $76k $128k $115k $94k Delta -$9k $20k -$14k -$6k $13k -$14k -$12k
  • 17. @benhamner Training Test Kaggle calculates two scores, one for the public leaderboard and one for the private leaderboard SalePrice SquareFeet Type LotAcres Beds Baths $88k 719 HOME 1.64 1 1 $164k 2017 APT 3 2 $72k 697 APT 1 1 $85k 948 HOME 1.02 2 3 $271k 3375 APT 3 4 $482k 3968 APT 4 4 $88k 790 APT 1 2 $128k 1341 HOME 0.66 3 3 $235k 2379 APT 3 3 $309k 2495 HOME 0.21 3 4 $163k 1356 APT 1 1 $375k 3361 HOME 1.64 3 4 $98k 1060 HOME 0.05 1 1 $50k 582 HOME 0.61 1 1 $145k 1640 APT 2 3 $394k 3546 HOME 0.4 4 4 $82k 903 APT 2 2 $105k 1096 HOME 0.04 3 4 $129k 1280 HOME 0.15 2 2 $106k 1139 APT 1 1 Submission Predicted $41k $165k $380k $76k $128k $115k $94k MeanError Public Leaderboard $14k Private Leaderboard $15k Delta -$9k $20k -$14k -$6k $13k -$14k -$12k
  • 18. @benhamner The participant immediately sees their public score on the public leaderboard
  • 19. @benhamner Participants explore the problem and iterate on their models to improve them
  • 20. @benhamner At the end, the participant with the best score on the private leaderboard wins
  • 22. @benhamner The leaderboard is a powerful mechanism to drive competition
  • 23. @benhamner The leaderboard is objective and meritocratic
  • 25. @benhamner The leaderboard encourages iterative improvements over many submissions
  • 26. @benhamner This causes the competition to approach the frontier of what’s possible given the data
  • 27. @benhamner Many competitions quickly approach a frontier; the most challenging ones take longer
  • 28. @benhamner Some applied ML research looks like competitions running over years instead of months www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/
  • 29. @benhamner One long-running research competition is ImageNet (not hosted on Kaggle) www.image-net.org
  • 30. @benhamner We see a similar progression in ImageNet performance over time as we do in Kaggle competitions www.image-net.org
  • 31. @benhamner Can we do better than competition results?
  • 33. @benhamner At Kaggle, we’ve run hundreds of public machine learning competitions
  • 34. @benhamner And over 600 in-class competitions for university students
  • 35. @benhamner These competitions have generated over 2,000,000 submissions from around the world
  • 36. @benhamner Most of the competitions we’ve run have involved supervised classification or regression
  • 38. @benhamner Setup your environment to enable rapid iteration and experimentation Extract and Select Features Train Models Evaluate and Visualize Results Identify & Handle Data Oddities Data Preprocessing
  • 39. @benhamner As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models http://jeffreydf.github.io/diabetic-retinopathy-detection/
  • 40. @benhamner Successful users invest time, thought, and creativity in problem structure and feature extraction
  • 41. @benhamner Random Forests / GBM’s work very well for many common classification and regression tasks (Verikas et al. 2011)
  • 42. @benhamner Deep learning has been very effective in computer vision competitions we’ve hosted caffe, theano, torch7, and keras are four popular open source libraries that facilitate this
  • 43. @benhamner XGBoost and Keras — two ML libraries with great power:effort ratios Competition Type Winning ML Algorithm Liberty Mutual Regression XGBoost Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest Diabetic Retinopathy Image SparseConvNet + RF Avito CTR XGBoost Taxi Trajectory 2 Geostats Classic neural net Grasp and Lift EEG Keras + XGBoost + other CNN Otto Group Classification Stacked ensemble of 35 models Facebook IV Classification sklearn GBM
  • 44. @benhamner XGBoost and Keras — two ML libraries with great power:effort ratios Competition Type Winning ML Algorithm Liberty Mutual Regression XGBoost Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest Diabetic Retinopathy Image SparseConvNet + RF Avito CTR XGBoost Taxi Trajectory 2 Geostats Classic neural net Grasp and Lift EEG Keras + XGBoost + other CNN Otto Group Classification Stacked ensemble of 35 models Facebook IV Classification sklearn GBM
  • 45. @benhamner The Boruta feature selection algorithm is robust and reliable • Wrapper method around Random Forest and its calculated variable importance • Iteratively trains RF’s and runs statistical tests to identify features as important or not important • Widely used in competition-winning models to select a small subset of features for use in training more complex models • library(boruta) in R
  • 46. @benhamner Model ensembling usually results in marginal but significant performance gains
  • 47. @benhamner Data leakage is our (and our user’s) #1 challenge http://www.navy.mil/view_image.asp?id=12495
  • 48. @benhamner@benhamner We’ve also seen some things that competitions aren’t effective at
  • 49. @benhamner Competitions don’t typically yield simple and theoretically elegant solutions *exception – Factorization Machines in KDD Cup 2012
  • 50. @benhamner Competitions don’t typically yield production code http://ora-00001.blogspot.ru/2011/07/mythbusters-stored-procedures-edition.html
  • 51. @benhamner Competitions don’t always yield computationally efficient solutions • Rewards performance without computational and complexity constraints http://iinustechtips.com/main/topic/193045-need-help-underclocking-d/
  • 53. @benhamner Optimizing a quantifiable evaluation metric by exploring an enormously broad range of approaches
  • 54. @benhamner Fairly and consistently evaluating a variety of approaches on the same problem • Implementation details matter, which can make it tough to reproduce results in other settings where data and/or code is not open source • “A quick, simple way to apply machine learning successfully? In your domain, find the stupid baseline that new methods consistently claim to beat. Implement that stupid baseline”
  • 55. @benhamner Identifying data quality and leakage issues Check that ID column isn’t informative “Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from.” - “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al Time series are tricky Essay: “This essay got good marks, but as far as I can tell, it's gibberish.” Human Scores: 5/5, 4/5
  • 56. @benhamner Exposing a specific domain problem to many new communities around the world
  • 58. @benhamner Kaggle’s mission is to help the world learn from data http://data-arts.appspot.com/globe/
  • 59. @benhamner We’re building a public platform for collaborating on data and analytics results People CodeData
  • 60. @benhamner An early alpha version of this is released as Kaggle Scripts
  • 61. @benhamner It enables users to immediately access R/Python/Julia environments with data preloaded
  • 62. @benhamner Everything created on Kaggle Scripts is published as soon as it’s run www.kaggle.com/scripts
  • 63. @benhamner Reproducing and building on another’s work is simply a click away
  • 64. @benhamner We’re starting to enable users to do this on non-competition datasets
  • 65. @benhamner Soon, any user will be able to publish data through Kaggle for analysis
  • 66. @benhamner@benhamner Thank you! head to www.kaggle.com/scripts to check out code, visualizations, and results from our community