Lessons learned from Running Hundreds of Kaggle Competitions: At Kaggle, we've run hundreds of machine learning competitions and seen over 80,000 data scientists make submissions. One thing is clear: winning competitions isn't random. We've learned that certain tools and methodologies work consistently well on different types of problems. Many participants make common mistakes (such as overfitting) that should be actively avoided. Similarly, competition hosts have their own set of pitfalls (such as data leakage).
In this talk, I'll share what goes into a winning competition toolkit along with some war stories on what to avoid. Additionally, I’ll share what we’re seeing on the collaborative side of competitions. Our community is showing an increasing amount of collaboration in developing machine learning models and analytic solutions. I'll showcase examples of this and discuss how these types of collaboration will improve how data science is learned and applied.
6. @benhamner
Automatically grading student-written essays
197 entrants
155 teams
2,499 submissions
over 80 days
$100,000 in prizes
Human-level performance
www.kaggle.com/c/asap-aes
21,000+ essays
7. @benhamner
Predicting compounds toxicity given its molecular structure
796 entrants
703 teams
8,841 submissions
over 91 days
$20,000 in prizes
25.6% improvement over
previous accuracy benchmark
www.kaggle.com/c/BioResponse
8. @benhamner
Personalizing web search results
261 entrants
194 teams
3570 submissions
over 91 days
$9,000 in prizes
www.kaggle.com/c/yandex-personalized-web-search-challenge
167,000,000+ logs
28. @benhamner
Some applied ML research looks like competitions running over years instead of months
www.kaggle.com/c/BioResponse/leaderboardyann.lecun.com/exdb/mnist/
38. @benhamner
Setup your environment to enable rapid iteration and experimentation
Extract and
Select Features
Train Models
Evaluate and
Visualize
Results
Identify &
Handle Data
Oddities
Data
Preprocessing
39. @benhamner
As an example, here’s a dashboard one user created to evaluate Diabetic Retinopathy models
http://jeffreydf.github.io/diabetic-retinopathy-detection/
41. @benhamner
Random Forests / GBM’s work very well for many common classification and regression tasks
(Verikas et al. 2011)
42. @benhamner
Deep learning has been very effective in computer vision competitions we’ve hosted
caffe, theano, torch7, and keras are four popular open source libraries that facilitate this
43. @benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
44. @benhamner
XGBoost and Keras — two ML libraries with great power:effort ratios
Competition Type Winning ML Algorithm
Liberty Mutual Regression XGBoost
Caterpillar Tubes Regression Keras + XGBoost + Reg. Forest
Diabetic Retinopathy Image SparseConvNet + RF
Avito CTR XGBoost
Taxi Trajectory 2 Geostats Classic neural net
Grasp and Lift EEG Keras + XGBoost + other CNN
Otto Group Classification Stacked ensemble of 35 models
Facebook IV Classification sklearn GBM
45. @benhamner
The Boruta feature selection algorithm is robust and reliable
• Wrapper method around Random Forest and its calculated variable
importance
• Iteratively trains RF’s and runs statistical tests to identify features as
important or not important
• Widely used in competition-winning models to select a small subset
of features for use in training more complex models
• library(boruta) in R
54. @benhamner
Fairly and consistently evaluating a variety of approaches on the same problem
• Implementation details matter, which can make it tough to
reproduce results in other settings where data and/or code is
not open source
• “A quick, simple way to apply machine learning successfully?
In your domain, find the stupid baseline that new methods
consistently claim to beat. Implement that stupid baseline”
55. @benhamner
Identifying data quality and leakage issues
Check that ID
column isn’t
informative
“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the
introduction of information about the data mining target, which should not be
legitimately available to mine from.”
- “Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
Time
series
are
tricky
Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”
Human Scores: 5/5, 4/5