"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
The Hitchhiker’s Guide to Kaggle
1. The Hitchhiker’s Guide to Kaggle
July 27, 2011
ksankar42@gmail.com [doubleclix.wordpress.com]
anthony.goldbloom@kaggle.com
2.
3. The Amateur Data Scientist CART
Analytics
Competitions! Algorithms
randomForest
Tools
Old DataSets
Competition
Competition Titanic
in-flight
Churn
HHP Ford
4. Encounters
— 1st
◦ This Workshop
— 2nd
◦ Do Hands-on Walkthrough
◦ I will post the walkthrough
scripts in ~ 10 days
— 3rd
◦ Participate in HHP &
Other competitions
5. Goals Of This workshop
1. Introduction to Analytics Competitions
from Data, Algorithms & Tools
perspective
2. End-To-End Flow of a Kaggle
Competition – Ford
3. Introduction to the Heritage Health Prize
Competition
4. Materials for you to explore further
◦ Lot more slides
◦ Walkthrough – will post in 10 days
6. Agenda
— Algorithms for the Amateur Data Scientist [25Min]
◦ Algorithms, Tools & frameworks in perspective
— The Art of Analytics Competitions[10Min]
◦ The Kaggle challenges
— How the RTA FORD was won - Anatomy of a competition
[15Min]
◦ Predicting FORD using Trees
◦ Submit an Entry
— Competition in flight - The Heritage Health Prize [30Min]
◦ Walkthrough
– Introduction
– Dataset Organization
– Analytics Walkthrough
◦ Submit our entry
— Conclusion [5Min]
7. ALGORITHMS FOR THE
AMATEUR DATA SCIENTIST
Algorithms ! The Most Massively useful thing an Amateur Data
Scientist can have …
“A towel is about the most massively useful thing an
interstellar hitchhiker can have … any man who can hitch
the length and breadth of the Galaxy, rough it … win
through, and still know where his towel is, is clearly a
man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Published by Harmony Books in 1979
8. The Amateur Data Scientist
— Am not a quant or a ML expert
— School of Amz, Springer & UTube
— For the Rest of us
— References I used (Refs also in the respective slide):
◦ The Elements Of Statistical Learning (a.k.a ESLII)
– By Hastie,Tibshirani & Friedman
◦ Statistical Learning From a Regression Perspective
– By Richard Berk
— As Jeremy says it, you can dig into it as needed
◦ Not necessarily be an expert in R toolbox
9. Jeremy’s Axioms
— Iteratively explore data
— Tools
◦ Excel Format, Perl, Perl Book
— Get your head around data
◦ Pivot Table
— Don’t over-complicate
— If people give you data, don’t assume
that you need to use all of it
— Look at pictures !
— History of your submissions – keep a
tab
— Don’t be afraid to submit simple
solutions
◦ We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-
sciencetalk-by-jeremy-howard/ !
10. Don’t throw away
1 any data !
Big data to smart data
Be ready for different
2 ways of organizing the
data
— summary
11. Users apply different techniques
• Support Vector Machine • Genetic Algorithms
• adaBoost • Monte Carlo Methods
• Bayesian Networks • Principal Component
• Decision Trees Analysis
• Ensemble Methods • Kalman Filter
• Random Forest • Evolutionary Fuzzy
• Logistic Regression Modelling
• Neural Networks
Quora
• http://www.quora.com/What-are-the-top-10-
data-mining-or-machine-learning-algorithms
Ref: Anthony’s Kaggle Presentation!
12. — Let
us take a 15 min overview of the
algorithms
◦ Relevant in the context of this workshop
◦ From the perspective of the datasets we plan
to use
— More of a qualitative than mathematical
— To get a feel for the how & the why
13. Bias Continuous
Linear
Variance Variables
Regression
Model Complexity
Over-fitting Categorical
Variables
Classifiers
k-NN
Bagging Boosting Decision (Nearest
Trees Neighbors)
CART
14. Titanic Passenger Metadata Customer Churn
• Small • 17 Predictors
• 3 Predictors
• Class
• Sex
• Age
• Survived?
Kaggle Competition - Stay Alert
Ford Challenge
• Simple Dataset
• Competition Class Heritage Health Prize Data
• Complex
• Competition in Flight
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!
http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
15. Titanic Dataset
— Taken from passenger
manifest
— Good candidate for a
Decision Tree
— CART [Classification
& Regression Tree]
◦ Greedy, top-down
binary, recursive
partitioning that
divides feature space
into sets of disjoint
rectangular regions
— CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
16. Titanic Dataset Y
Male ?
R walk through
— Load libraries
3rd?
— Load data
— Model CART
N Y
— Model rattle() Y
Adult?
— Tree
Y
— Discussion N
3rd?
N Y
17. CART
Y
Male ?
Female
3rd?
N Y
Y
Adult?
Child
Y
3rd?
N
N Y
18. CART
Y
Male ?
Female
1 Do Not Over-fit 3rd?
N
2 All predictors are not needed
N Y
3 All data rows are not needed
4 Tuning the algorithms will give
different results
21. Challenges
— Model Complexity
◦ Complex Model increases the training data fit
◦ But then over-fits and doesn't perform as well
with real data
— Bias vs.Variance
◦ Classical diagram
◦ From ELSII
Prediction Error
◦ By Hastie,Tibshirani &
Friedman Training
Error
22. — Goal
◦ Model Complexity (-)
Solution #1 ◦ Variance (-)
◦ Prediction Accuracy (+)
Partition Data !
◦ Training (60%)
◦ Validation(20%) &
◦ “Vault” Test (20%) Data sets
k-fold Cross-Validation
◦ Split data into k equal parts
◦ Fit model to k-1 parts & calculate prediction
error on kth part
◦ Non-overlapping dataset
But the fundamental problem still exists !
23. — Goal
◦ Model Complexity (-)
Solution #2 ◦ Variance (-)
◦ Prediction Accuracy (+)
Bootstrap
◦ Draw datasets (with replacement) and fit model
for each dataset
– Remember : Data Partitioning (#1) & Cross Validation
(#2) are without replacement
Bagging (Bootstrap aggregation)
◦ Average prediction over a
collection of bootstrap-ed
samples, thus reducing variance
24. — Goal
◦ Model Complexity (-)
Solution #3 ◦ Variance (-)
◦ Prediction Accuracy (+)
Boosting
◦ “Output of weak classifiers into a powerful
committee”
◦ Final Prediction = weighted majority vote
◦ Later classifiers get misclassified points
– With higher weight,
– So they are forced
– To concentrate on them
◦ AdaBoost (AdaptiveBoosting)
◦ Boosting vs Bagging
– Bagging – independent trees
– Boosting – successively weighted
25. — Goal
◦ Model Complexity (-)
Solution #4 ◦ Variance (-)
◦ Prediction Accuracy (+)
Random Forests+
◦ Builds large collection of de-correlated trees &
averages them
◦ Improves Bagging by selecting i.i.d* random
variables for splitting
◦ Simpler to train & tune
◦ “Do remarkably well, with very little tuning
required” – ESLII
◦ Less suseptible to overfitting (than boosting)
◦ Many RF implementations
– Original version - Fortran-77 ! By Breiman/Cutler
– R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed!
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
26. — Goal
◦ Model Complexity (-)
Solution - General ◦ Variance (-)
◦ Prediction Accuracy (+)
Ensemble methods
◦ Two Step
– Develop a set of learners
– Combine the results to develop a composite
predictor
◦ Ensemble methods can take the form of:
– Using different algorithms,
– Using the same algorithm with different settings
– Assigning different parts of the dataset to different
classifiers
◦ Bagging & Random Forests are examples of
ensemble method
Ref: Machine Learning In Action !
27. Random Forests
— While Boosting splits based on best among all variables, RF splits
based on best among randomly chosen variables
— Simpler because it requires two variables – no. of Predictors
(typically √k) & no. of trees (500 for large dataset, 150 for smaller)
— Error prediction
◦ For each iteration, predict for dataset that is not in the sample (OOB
data)
◦ Aggregate OOB predictions
◦ Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
– Can use this to search for optimal # of predictors
◦ We will see how close this is to the actual error in the Heritage Health
Prize
— Assumes equal cost for mis-prediction. Can add a cost function
— Proximity matrix & applications like adding missing data, dropping
outliers
Ref: R News Vol 2/3, Dec 2002 !
Statistical Learning from a Regression Perspective : Berk!
A Brief Overview of RF by Dan Steinberg!
28. Lot more to explore (Homework!)
— Loss matrix
◦ E.g. Telcom churn - Better to give incentives to
false + (who is not leaving) than optimize in
incentives for false –ves(who is leaving)
— Missing values
— Additive Models
— Bayesian Models
— Gradient Boosting
Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-
New_Tree_Data_Set_and_Loss_Matrices.pdf!
37. R
R
Matlab
Matlab
SAS
SAS
WEKA
WEKA
SPSS
SPSS
Python
Python
Excel
Excel
Mathematica
Mathematica
Stata
Stata
R on Kaggle Among academics
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Among Americans
Mathematica
Stata Ref: Anthony’s Kaggle Presentation!
38. Mapping Dark Matter is a image
analysis competition whose aim is to
encourage the development of new
algorithms that can be applied to
challenge of measuring the tiny
distortions in galaxy images caused by
dark matter.
~25%
Successful
grant applications
NASA tried, now it s our turn
39. “The world’s brightest
physicists have been
working for decades on
solving one of the great
unifying problems of our
universe”
“In less than a week,
Martin O’Leary, a PhD
student in glaciology,
outperformed the state-of-
the-art algorithms”
52. How the Ford Competition was won
— How I Did It Blogs
— http://blog.kaggle.com/
2011/03/25/inference-
on-winning-the-ford-
stay-alert-
competition/
— http://blog.kaggle.com/
2011/04/20/mick-
wagner-on-finishing-
second-in-the-ford-
challenge/
— http://blog.kaggle.com/
2011/03/16/junpei-
komiyama-on-
finishing-4th-in-the-
ford-competition/
53. How the Ford Competition was won
— Junpei Komiyama (#4)
◦ To solve this problem, I constructed a Support Vector
Machine (SVM), which is one of the best tools for
classification and regression analysis, using the libSVM
package.
◦ This approach took more than 3 hours to complete
◦ I found some data (P3-P6) were characterized by
strong noise... Also, many environmental and vehicular
data showed discrete values continuously increased and
decreased.These suggested the necessity of pre-
processing the observation data before SVM analysis
for better performance
54. How the Ford Competition was won
— Junpei Komiyama (#4)
◦ Averaging – improved score and processing time
◦ Average 7 data points
– Reduced processing by 86% &
– Increased score by 0.01
◦ Tools
– Python processing of csv
– libSVM
55. How the Ford Competition was won
— Mick Wagner (#2)
◦ Tools
– Excel, SQL Server
◦ I spent the majority of my time analyzing the data. I
inputted the data into Excel and started examining the
data taking note of discrete and continuous values, category
based parameters, and simple statistics (mean, median,
variance, coefficient of variance). I also looked for extreme
outliers.
◦ I made the first 150 trials (~30%) be my test data and the
remainder be my training dataset (~70%). This single factor
had the largest impact on the accuracy of my final model.
◦ I was concerned that using the entire data set would create
too much noise and lead to inaccuracies in the model … so
focussed on data with state change
56. How the Ford Competition was won
— Mick Wagner (#2)
◦ After testing the Decision Tree and Neural Network
algorithms against each other and submitting
models to Kaggle, I found the Neural Network
model to be more accurate
◦ Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6,
V10, and V11
57. How the Ford Competition was won
— Inference (#1)
◦ Very interesting
◦ “Our first observation is that trials are not
homogeneous – so calculated mean, sd et al”
◦ “Training set & test set are not from the same
population” – a good fit for training will result in a
low score
◦ Lucky Model (Regression)
– -‐410.6073(sd(E5))
+
0.1494(V11)
+
4.4185(E9)
◦ (Remember – Data had P1-P8,E1-E11,V1-V11)
58. HOW THE RTA WAS
WON
“This competition requires participants to predict travel time on
Sydney's M4 freeway from past travel time observations.”
59. — Thanks to
◦ François
GUILLEM &
◦ Andrzej Janusz
— They both
used R
— Share their
code &
algorithms
60. How the RTA was won
— I effectively used R for the RTA competition.
For my best submission, I just used simple
technics (OLS and means) but in a clever way
- François GUILLEM (#14)
— I used a simple k-NN approach but the idea
was to process data first & to compute some
summaries of time series in consecutive
timestamps using some standard indicators
from technical analysis
- Andrzej Janusz(#17)
61. How the RTA was won
— #1 used Random Forests
◦ Time, Date & Week as predictors
- José P. González-Brenes and Matías Cortés
— Regression models for data segments (total
~600!)
— Tools:
◦ Java/Weka
◦ 4 processors, 12 GB RAM
◦ 48 hours of computations
- Marcin Pionnier (#5)
Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/!
Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
63. Lessons from Kaggle Winners
1 Don’t over-fit
2 All predictors are not needed
3 All data rows are not needed, either
4 Tuning the algorithms will give different results
5 Reduce the dataset (Average, select transition data,…)
6 Test set & training set can differ
7 Iteratively explore & get your head around data
8 Don’t be afraid to submit simple solutions
9 Keep a tab & history your submissions
64. The Competition
“The goal of the prize is to develop a predictive
algorithm that can identify patients who will be
admitted to the hospital within the next year,
using historical claims data”
66. Data Organization
ID 113,000 Entries
Members Age at 1st Claim Missing values
Sex Days In Hospital Y2
MemberID 2,668,990 Entries
Prov ID Missing values
Claims Different Coding
Vendor, PCP,
Delay 162+
Days In Hospital Y3
Year
Speciality SupLOS – Length of stay is
PlaceOfSvc suppressed during de-
PayDelay identification process for Days In Hospital Y4
LengthOfStay some entries
(Target)
DaysSinceFirstClaimThatYear
PrimaryConditionGroup MemberID
CharlsonIndex Claims Truncated
ProcedureGroup DaysInHospital
SupLOS
76039 Entries(Y2)
MemberID, Year, 361,485 Entries
71436 Entries (Y3)
LabCount DSFS,LabCount Fairly Consistent
Coding (10+)
70943 Entries
Lots Of Zeros
MemberID, Year, 818,242 Entries
DrugCount DSFS,DrugCount Fairly Consistent
Coding (10+)
67.
68. Calculation & Prizes
Prediction Error Rate
Deadline
Apr 04,2013
Deadline : Aug 31,2011 06:59:59 UTC
Deadline : Feb 13,2012
Deadline : Sep 04,2012
70. POA
— Load data into SQLite
— Use SQL to de-normalize & pick out
datasets
— Load them into R for analytics
— Total/Distinct count
◦ Claims = 2,668,991/113,001
◦ Members = 113,001
◦ Drug = 818,242/75,999 <- unique = 141,532/75,999(test)
◦ Lab = 361,485/86,640 <- unique = 154,935/86,640 (test)
◦ dih_y2 = 76,039 / distinct/11,770 dih > 0
◦ dih_y3 = 71,436/distinct/10,730 dih > 0
◦ dih_y4 = 70,943/distinct
71. Idea #1
— dih_Y2
=
β0
+
β1dih_Y1
+
β2DC
+
β3LC
— dih_Y3
=
β0
+
β1dih_Y2
+
β2DC
+
β3LC
— dih_Y4
=
β0
+
β1dih_Y3
+
β2DC
+
β3LC
— select count(*) from dih_y2 join dih_y3 on
dih_y2.member_id = dih_y3.member_id;
— Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683
(7,699 dih_y3 > 0)
— Data is not straightforward to get this
◦ Summarize drug and lab by member, year
◦ Split into year to get DC
&
LC
by
year
◦ Add to dih_Yx table
◦ Linear Regression
72. Some SQL for idea #1
— create table drug_tot as select
member_id,Year, total(drug_count) from
drug_count group by member_id,year
order by member_id,year; <- total drug,
lab per year for each member !
— Same for lab_tot
— create table drug_tot_y1 as select * from
drug_tot where year = “Y1”
— … for y2,y3 and y1, y2,y3 for lab_tot
— … join with dih_yx tables
73. Idea #2
— Add claims at yx to the Idea #1 equations
— dih_Yn
=
β0
+
β1dih_Yn-‐1
+
β2DC/n-‐1
+
β3LC/n-‐1
+
β4Caimn-‐1
— Then we will have to define the criteria for
Caimn-‐1
from the claim predictors viz.
PrimaryConditionGroup, CharlsonIndex
and ProcedureGroup
74. The Beginning As the End
— We started with a set of
goals
— Homework
◦ For me :
– To finish the hands-on walkthrough
& post it in ~10 days
◦ For you
– Go through the slides
– Do the walkthrough
– Submit entries to Kaggle
75. I
enjoyed a lot
preparing
the materials
…
Hope
you enjoyed
more
attending …
Questions ?!
IDE <- RStudio
R_Packages <- c(plyr, rattle, rpart, randomForest)
R_Search <- http://www.rseek.org/, powered=google