SlideShare a Scribd company logo
1 of 75
Download to read offline
The Hitchhiker’s Guide to Kaggle




                                          July 27, 2011
         ksankar42@gmail.com [doubleclix.wordpress.com]
                         anthony.goldbloom@kaggle.com
The Amateur Data Scientist                                      CART

       Analytics
      Competitions!                           Algorithms
                                                                        randomForest




                                                     Tools




                       Old               DataSets
                    Competition

Competition                                                   Titanic
  in-flight
                                            Churn


              HHP                 Ford
Encounters
—  1st
   ◦  This Workshop

                 —    2nd
                       ◦  Do Hands-on Walkthrough
                       ◦  I will post the walkthrough
                          scripts in ~ 10 days



                                   —  3rd
                                      ◦  Participate in HHP &
                                         Other competitions
Goals Of This workshop
1.     Introduction to Analytics Competitions
       from Data, Algorithms & Tools
       perspective
2.     End-To-End Flow of a Kaggle
       Competition – Ford
3.     Introduction to the Heritage Health Prize
       Competition
4.     Materials for you to explore further
      ◦  Lot more slides
      ◦  Walkthrough – will post in 10 days
Agenda
—    Algorithms for the Amateur Data Scientist [25Min]
      ◦  Algorithms, Tools & frameworks in perspective
—    The Art of Analytics Competitions[10Min]
      ◦  The Kaggle challenges
—    How the RTA FORD was won - Anatomy of a competition
      [15Min]
      ◦  Predicting FORD using Trees
      ◦  Submit an Entry
—    Competition in flight - The Heritage Health Prize [30Min]
      ◦  Walkthrough
        –  Introduction
        –  Dataset Organization
        –  Analytics Walkthrough
      ◦  Submit our entry
—    Conclusion [5Min]
ALGORITHMS FOR THE
AMATEUR DATA SCIENTIST


      Algorithms ! The Most Massively useful thing an Amateur Data
      Scientist can have …




      “A towel is about the most massively useful thing an
      interstellar hitchhiker can have … any man who can hitch
      the length and breadth of the Galaxy, rough it … win
      through, and still know where his towel is, is clearly a
      man to be reckoned with.”
             - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
                                        Published by Harmony Books in 1979
The Amateur Data Scientist
—    Am not a quant or a ML expert
—    School of Amz, Springer & UTube
—    For the Rest of us
—    References I used (Refs also in the respective slide):
      ◦  The Elements Of Statistical Learning (a.k.a ESLII)
        –  By Hastie,Tibshirani & Friedman
      ◦  Statistical Learning From a Regression Perspective
        –  By Richard Berk
—    As Jeremy says it, you can dig into it as needed
      ◦  Not necessarily be an expert in R toolbox
Jeremy’s Axioms
—  Iteratively explore data
—  Tools
      ◦  Excel Format, Perl, Perl Book
—    Get your head around data
      ◦  Pivot Table
—    Don’t over-complicate
—    If people give you data, don’t assume
      that you need to use all of it
—    Look at pictures !
—    History of your submissions – keep a
      tab
—    Don’t be afraid to submit simple
      solutions
      ◦  We will do this during this workshop


                       Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-
                       sciencetalk-by-jeremy-howard/ !
Don’t throw away
1   any data !
          Big data to smart data
    Be ready for different
2   ways of organizing the
    data
           —  summary
Users apply different techniques



•    Support Vector Machine                     •    Genetic Algorithms
•    adaBoost                                   •    Monte Carlo Methods
•    Bayesian Networks                          •    Principal Component
•    Decision Trees                                  Analysis
•    Ensemble Methods                           •    Kalman Filter
•    Random Forest                              •    Evolutionary Fuzzy
•    Logistic Regression                             Modelling
                                                •    Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-
   data-mining-or-machine-learning-algorithms
                                                      Ref: Anthony’s Kaggle Presentation!
—  Let
      us take a 15 min overview of the
  algorithms
  ◦  Relevant in the context of this workshop
  ◦  From the perspective of the datasets we plan
     to use
—  More  of a qualitative than mathematical
—  To get a feel for the how & the why
Bias               Continuous
       Linear
      Variance              Variables
      Regression
   Model Complexity
     Over-fitting            Categorical 
                            Variables
    Classifiers




                                                           k-NN
Bagging          Boosting            Decision             (Nearest
                                      Trees              Neighbors)




                                                CART
Titanic Passenger Metadata                 Customer Churn
  •  Small                                   •  17 Predictors
  •  3 Predictors
       •  Class
       •  Sex
       •  Age
       •  Survived?



Kaggle Competition - Stay Alert
Ford Challenge
•  Simple Dataset
•  Competition Class                           Heritage Health Prize Data	
                                               •  Complex	
                                               •  Competition in Flight	
                                                           http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!
                                  http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
Titanic Dataset
—  Taken from passenger
    manifest
—  Good candidate for a
    Decision Tree
—  CART [Classification
    & Regression Tree]
      ◦  Greedy, top-down
         binary, recursive
         partitioning that
         divides feature space
         into sets of disjoint
         rectangular regions
—    CART in R                  http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
Titanic Dataset               Y
                                   Male ?
R walk through
 —  Load libraries
                                            3rd?
 —  Load data
 —  Model CART
                                      N            Y
 —  Model rattle()   Y
                          Adult?
 —  Tree
                               Y
 —  Discussion       N
                                   3rd?



                              N             Y
CART
        Y
             Male ?
                                 Female	




                             3rd?



                N                       Y
Y
    Adult?
                  Child	


         Y
             3rd?
N



        N                    Y
CART


                                          Y
                                              Male ?


                                                          Female	

1   Do Not Over-fit                                    3rd?
                                      N
2   All predictors are not needed
                                                 N               Y
3   All data rows are not needed


4   Tuning the algorithms will give
    different results
Churn Data


—  Predictchurn
—  Based on
  ◦  Service calls, v-mail and so forth
CART Tree
Challenges
—  Model    Complexity
  ◦  Complex Model increases the training data fit
  ◦  But then over-fits and doesn't perform as well
     with real data
—  Bias   vs.Variance
  ◦  Classical diagram
  ◦  From ELSII
                                         Prediction Error
  ◦  By Hastie,Tibshirani &
     Friedman                 Training
                              Error
—    Goal
                                    ◦  Model Complexity (-)
Solution #1                         ◦  Variance (-)
                                    ◦  Prediction Accuracy (+)


Partition Data !
 ◦  Training (60%)
 ◦  Validation(20%) &
 ◦  “Vault” Test (20%) Data sets
k-fold Cross-Validation
  ◦  Split data into k equal parts
  ◦  Fit model to k-1 parts & calculate prediction
     error on kth part
  ◦  Non-overlapping dataset
 But the fundamental problem still exists !
—    Goal
                                           ◦  Model Complexity (-)
 Solution #2                               ◦  Variance (-)
                                           ◦  Prediction Accuracy (+)

Bootstrap
  ◦  Draw datasets (with replacement) and fit model
     for each dataset
      –  Remember : Data Partitioning (#1) & Cross Validation
          (#2) are without replacement

Bagging (Bootstrap aggregation)
 ◦     Average prediction over a
       collection of bootstrap-ed
       samples, thus reducing variance
—    Goal
                                                      ◦  Model Complexity (-)
      Solution #3                                     ◦  Variance (-)
                                                      ◦  Prediction Accuracy (+)


Boosting
 ◦  “Output of weak classifiers into a powerful
    committee”
 ◦  Final Prediction = weighted majority vote
 ◦  Later classifiers get misclassified points
      –     With higher weight,
      –     So they are forced
      –     To concentrate on them
 ◦          AdaBoost (AdaptiveBoosting)
 ◦          Boosting vs Bagging
      –     Bagging – independent trees
      –     Boosting – successively weighted
—    Goal
                                                          ◦  Model Complexity (-)
Solution #4                                               ◦  Variance (-)
                                                          ◦  Prediction Accuracy (+)


Random Forests+
 ◦          Builds large collection of de-correlated trees &
            averages them
 ◦          Improves Bagging by selecting i.i.d* random
            variables for splitting
 ◦          Simpler to train & tune
 ◦          “Do remarkably well, with very little tuning
            required” – ESLII
 ◦          Less suseptible to overfitting (than boosting)
 ◦          Many RF implementations
      –     Original version - Fortran-77 ! By Breiman/Cutler
      –     R, Mahout, Weka, Milk (ML toolkit for py), matlab
                                                * i.i.d – independent identically distributed!
                    + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
—    Goal
                                                  ◦  Model Complexity (-)
Solution - General                                ◦  Variance (-)
                                                  ◦  Prediction Accuracy (+)


Ensemble methods
 ◦          Two Step
      –     Develop a set of learners
      –     Combine the results to develop a composite
             predictor
 ◦          Ensemble methods can take the form of:
      –     Using different algorithms,
      –     Using the same algorithm with different settings
      –     Assigning different parts of the dataset to different
             classifiers
 ◦          Bagging & Random Forests are examples of
            ensemble method
                                                    Ref: Machine Learning In Action !
Random Forests
—  While Boosting splits based on best among all variables, RF splits
    based on best among randomly chosen variables
—  Simpler because it requires two variables – no. of Predictors
    (typically √k) & no. of trees (500 for large dataset, 150 for smaller)
—  Error prediction
      ◦  For each iteration, predict for dataset that is not in the sample (OOB
         data)
      ◦  Aggregate OOB predictions
      ◦  Calculate Prediction Error for the aggregate, which is basically the OOB
         estimate of error rate
        –  Can use this to search for optimal # of predictors
      ◦  We will see how close this is to the actual error in the Heritage Health
         Prize
—  Assumes equal cost for mis-prediction. Can add a cost function
—  Proximity matrix & applications like adding missing data, dropping
    outliers
                                                                  Ref: R News Vol 2/3, Dec 2002 !
                                        Statistical Learning from a Regression Perspective : Berk!
                                                        A Brief Overview of RF by Dan Steinberg!
Lot more to explore (Homework!)

—  Loss   matrix
  ◦  E.g. Telcom churn - Better to give incentives to
     false + (who is not leaving) than optimize in
     incentives for false –ves(who is leaving)
—  Missing values
—  Additive Models
—  Bayesian Models
—  Gradient Boosting

            Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-
                                              New_Tree_Data_Set_and_Loss_Matrices.pdf!
Churn Data w/ randomForest
KAGGLE
COMPETITIONS
“I keep saying the sexy job in
the next ten years will be
statisticians.”
Hal Varian
Google Chief Economist
2009
Crowdsourcing




Mismatch between those with data and
   those with the skills to analyse it
Tourism Forecasting Competition




Forecast
   Error
 (MASE)
                   Existing model




           Aug 9                    2 weeks   1 month   Competition
                                      later     later          End
Existing model (ELO)




         Error Rate
            (RMSE)




                      Aug 4                   1 month   2 months   Today
                                                later     later


Chess Ratings Competition
12,500 “Amateur” Data Scientists with different
               backgrounds
R
                                                     R
              Matlab
                                                     Matlab
              SAS
                                                     SAS
              WEKA
                                                     WEKA
              SPSS
                                                     SPSS
              Python
                                                     Python
              Excel
                                                     Excel
              Mathematica
                                                     Mathematica
              Stata
                                                     Stata



R on Kaggle                  Among academics

          R
          Matlab
          SAS
          WEKA
          SPSS
          Python
          Excel
                            Among Americans
          Mathematica
          Stata                 Ref: Anthony’s Kaggle Presentation!
Mapping Dark Matter is a image
                           analysis competition whose aim is to
                           encourage the development of new
                           algorithms that can be applied to
                           challenge of measuring the tiny
                           distortions in galaxy images caused by
                           dark matter.	


      ~25%
      Successful
      grant applications




NASA tried, now it s our turn
“The world’s brightest
physicists have been
working for decades on
solving one of the great
unifying problems of our
universe”




  “In less than a week,
  Martin O’Leary, a PhD
  student in glaciology,
  outperformed the state-of-
  the-art algorithms”
Who to hire?
Why Participants Compete


            1                                               2




             Clean, Real world data             Professional Reputation & Experience



            3                                           4




  Interactions with experts in related fields                   Prizes
Use the wizard to post a competition
Participants make their entries
Competitions are judged based on predictive accuracy
Competition Mechanics




 Competitions are judged on objective criteria
The Anatomy of a KAGGLE COMPETITION

THE FORD
COMPETITION
Ford Challenge - DataSet
—  Goal:
  ◦  Predict Driver Alertness
—  Predictors:
  ◦  Psychology – P1 .. P8
  ◦  Environment – E1 .. E11
  ◦  Vehicle – V1 ..V11
  ◦  IsAlert ?
—  Datastatistics meaningless outside the
  IsAlert context
Ford Challenge – DataSet Files
—    Three files
      ◦  ford_train
        –  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows



      ◦  ford_test
        –  100 Trials,~1,200 observations/trial, 120,841 rows




      ◦  example_submission.csv
A Plan
glm
Submission & Results
Raw, all variables, rpart




Raw, selected variables, rpart




All variables, glm
How the Ford Competition was won
—    How I Did It Blogs
—    http://blog.kaggle.com/
      2011/03/25/inference-
      on-winning-the-ford-
      stay-alert-
      competition/
—    http://blog.kaggle.com/
      2011/04/20/mick-
      wagner-on-finishing-
      second-in-the-ford-
      challenge/
—    http://blog.kaggle.com/
      2011/03/16/junpei-
      komiyama-on-
      finishing-4th-in-the-
      ford-competition/
How the Ford Competition was won
—  Junpei   Komiyama (#4)
  ◦  To solve this problem, I constructed a Support Vector
     Machine (SVM), which is one of the best tools for
     classification and regression analysis, using the libSVM
     package.
  ◦  This approach took more than 3 hours to complete
  ◦  I found some data (P3-P6) were characterized by
     strong noise... Also, many environmental and vehicular
     data showed discrete values continuously increased and
     decreased.These suggested the necessity of pre-
     processing the observation data before SVM analysis
     for better performance
How the Ford Competition was won

—  Junpei   Komiyama (#4)
  ◦  Averaging – improved score and processing time
  ◦  Average 7 data points
    –  Reduced processing by 86% &
    –  Increased score by 0.01
  ◦  Tools
    –  Python processing of csv
    –  libSVM
How the Ford Competition was won
—    Mick Wagner (#2)
      ◦  Tools
        –  Excel, SQL Server
      ◦  I spent the majority of my time analyzing the data. I
         inputted the data into Excel and started examining the
         data taking note of discrete and continuous values, category
         based parameters, and simple statistics (mean, median,
         variance, coefficient of variance). I also looked for extreme
         outliers.
      ◦  I made the first 150 trials (~30%) be my test data and the
         remainder be my training dataset (~70%). This single factor
         had the largest impact on the accuracy of my final model.
      ◦  I was concerned that using the entire data set would create
         too much noise and lead to inaccuracies in the model … so
         focussed on data with state change
How the Ford Competition was won

—  Mick Wagner   (#2)
 ◦  After testing the Decision Tree and Neural Network
    algorithms against each other and submitting
    models to Kaggle, I found the Neural Network
    model to be more accurate
 ◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6,
    V10, and V11
How the Ford Competition was won

—  Inference    (#1)
  ◦  Very interesting
  ◦  “Our first observation is that trials are not
     homogeneous – so calculated mean, sd et al”
  ◦  “Training set & test set are not from the same
     population” – a good fit for training will result in a
     low score
  ◦  Lucky Model (Regression)
    –  -­‐410.6073(sd(E5))	
  +	
  0.1494(V11)	
  +	
  4.4185(E9)	
  
  ◦  (Remember – Data had P1-P8,E1-E11,V1-V11)
HOW THE RTA WAS
WON


“This competition requires participants to predict travel time on
Sydney's M4 freeway from past travel time observations.”
—  Thanks   to
  ◦  François
     GUILLEM &
  ◦  Andrzej Janusz
—  They  both
    used R
—  Share their
    code &
    algorithms
How the RTA was won
—  I effectively used R for the RTA competition.
    For my best submission, I just used simple
    technics (OLS and means) but in a clever way
                          - François GUILLEM (#14)
—  I used a simple k-NN approach but the idea
    was to process data first & to compute some
    summaries of time series in consecutive
    timestamps using some standard indicators
    from technical analysis
                               - Andrzej Janusz(#17)
How the RTA was won
  —  #1      used Random Forests
       ◦  Time, Date & Week as predictors
                  - José P. González-Brenes and Matías Cortés

  —  Regression              models for data segments (total
      ~600!)
  —  Tools:
       ◦  Java/Weka
       ◦  4 processors, 12 GB RAM
       ◦  48 hours of computations
                                                               - Marcin Pionnier (#5)
               Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/!
Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
THE HHP




TimeCheck : Should be ~2:40!!
Lessons from Kaggle Winners
1    Don’t over-fit

2    All predictors are not needed

3    All data rows are not needed, either

4    Tuning the algorithms will give different results

5    Reduce the dataset (Average, select transition data,…)

6    Test set & training set can differ

7    Iteratively explore & get your head around data

8    Don’t be afraid to submit simple solutions

9    Keep a tab & history your submissions
The Competition
“The goal of the prize is to develop a predictive
algorithm that can identify patients who will be
admitted to the hospital within the next year,
using historical claims data”
TimeLine
Data Organization
            ID                 113,000 Entries	

Members     Age at 1st Claim   Missing values	

            Sex                                          Days In Hospital Y2
            MemberID        2,668,990 Entries	

            Prov ID         Missing values	

 Claims                     Different Coding 	

            Vendor, PCP,
                                	

Delay 162+	

         Days In Hospital Y3
            Year
            Speciality      SupLOS – Length of stay is
            PlaceOfSvc      suppressed during de-
            PayDelay        identification process for    Days In Hospital Y4
            LengthOfStay    some entries	

                                                              (Target)
                            	

            DaysSinceFirstClaimThatYear
            PrimaryConditionGroup                         MemberID
            CharlsonIndex                                 Claims Truncated
            ProcedureGroup                                DaysInHospital
            SupLOS
                                                          76039 Entries(Y2)	

            MemberID, Year, 361,485 Entries	

            71436 Entries (Y3)	

LabCount    DSFS,LabCount Fairly Consistent	

                            Coding (10+)	

               70943 Entries	

                                                          Lots Of Zeros	

            MemberID, Year, 818,242 Entries	

DrugCount   DSFS,DrugCount Fairly Consistent	

                            Coding (10+)
Calculation & Prizes

 Prediction Error Rate




                           Deadline
                            Apr 04,2013


 Deadline : Aug 31,2011   06:59:59 UTC

 Deadline : Feb 13,2012

 Deadline : Sep 04,2012
Now it is our turn …

HHP ANALYTICS
POA
—  Load  data into SQLite
—  Use SQL to de-normalize & pick out
    datasets
—  Load them into R for analytics
—  Total/Distinct count
  ◦    Claims = 2,668,991/113,001
  ◦    Members = 113,001
  ◦    Drug = 818,242/75,999 <- unique = 141,532/75,999(test)
  ◦    Lab = 361,485/86,640 <- unique = 154,935/86,640 (test)
  ◦    dih_y2 = 76,039 / distinct/11,770 dih > 0
  ◦    dih_y3 = 71,436/distinct/10,730 dih > 0
  ◦    dih_y4 = 70,943/distinct
Idea #1
—  dih_Y2	
  =	
  β0	
  +	
  β1dih_Y1	
  +	
  β2DC	
  +	
  β3LC	
  
—  dih_Y3	
  =	
  β0	
  +	
  β1dih_Y2	
  +	
  β2DC	
  +	
  β3LC	
  
—  dih_Y4	
  =	
  β0	
  +	
  β1dih_Y3	
  +	
  β2DC	
  +	
  β3LC	
  
—  select count(*) from dih_y2 join dih_y3 on
    dih_y2.member_id = dih_y3.member_id;
—  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683
    (7,699 dih_y3 > 0)

—  Data   is not straightforward to get this
   ◦  Summarize drug and lab by member, year
   ◦  Split into year to get DC	
  &	
  LC	
  by	
  year
   ◦  Add to dih_Yx table
   ◦  Linear Regression
Some SQL for idea #1
—  create table drug_tot as select
    member_id,Year, total(drug_count) from
    drug_count group by member_id,year
    order by member_id,year; <- total drug,
    lab per year for each member !
—  Same for lab_tot
—  create table drug_tot_y1 as select * from
    drug_tot where year = “Y1”
—  … for y2,y3 and y1, y2,y3 for lab_tot
—  … join with dih_yx tables
Idea #2
—  Add    claims at yx to the Idea #1 equations
—  dih_Yn	
  =	
  β0	
  +	
  β1dih_Yn-­‐1	
  +	
  β2DC/n-­‐1	
  
    +	
  β3LC/n-­‐1	
  +	
  β4Caimn-­‐1	
  
—  Then we will have to define the criteria for
    Caimn-­‐1	
  from the claim predictors viz.
    PrimaryConditionGroup, CharlsonIndex
    and ProcedureGroup
The Beginning As the End
          —  We  started with a set of
              goals
          —  Homework
            ◦  For me :
             –  To finish the hands-on walkthrough
                 & post it in ~10 days
            ◦  For you
             –  Go through the slides
             –  Do the walkthrough
             –  Submit entries to Kaggle
I
enjoyed a lot
  preparing
the materials
     …
    Hope
 you enjoyed
    more
attending …
                            Questions ?!




                IDE <- RStudio	
                R_Packages <- c(plyr, rattle, rpart, randomForest)	
                R_Search <- http://www.rseek.org/, powered=google

More Related Content

What's hot

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
Methods for meta learning in AutoML
Methods for meta learning in AutoMLMethods for meta learning in AutoML
Methods for meta learning in AutoMLMohamed Maher
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneDhiana Deva
 

What's hot (20)

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Methods for meta learning in AutoML
Methods for meta learning in AutoMLMethods for meta learning in AutoML
Methods for meta learning in AutoML
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
 

Viewers also liked

Machine learning from disaster
Machine learning from disasterMachine learning from disaster
Machine learning from disasterPhillip Trelford
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Manage services presentation
Manage services presentationManage services presentation
Manage services presentationLen Moncrieffe
 
Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!Stuart Selbst Consulting
 
NYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel HsuNYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel HsuRizwan Habib
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services PresentationScott Gombar
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 monthsTetiana Ivanova
 
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいTakaaki Umada
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 

Viewers also liked (11)

Machine learning from disaster
Machine learning from disasterMachine learning from disaster
Machine learning from disaster
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Manage services presentation
Manage services presentationManage services presentation
Manage services presentation
 
Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
 
NYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel HsuNYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel Hsu
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services Presentation
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 months
 
Final pink panthers_03_31
Final pink panthers_03_31Final pink panthers_03_31
Final pink panthers_03_31
 
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 

Similar to The Hitchhiker’s Guide to Kaggle

R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.pptbutest
 
Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Carlos Hernandez
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)GLA University
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluationkrisztianbalog
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationDavid Gleich
 
A General Overview of Machine Learning
A General Overview of Machine LearningA General Overview of Machine Learning
A General Overview of Machine LearningAshish Sharma
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 

Similar to The Hitchhiker’s Guide to Kaggle (20)

R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
Why am I doing this???
Why am I doing this???Why am I doing this???
Why am I doing this???
 
Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015
 
Big Data Workshop
Big Data WorkshopBig Data Workshop
Big Data Workshop
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregation
 
A General Overview of Machine Learning
A General Overview of Machine LearningA General Overview of Machine Learning
A General Overview of Machine Learning
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 

More from Krishna Sankar

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

More from Krishna Sankar (18)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

The Hitchhiker’s Guide to Kaggle

  • 1. The Hitchhiker’s Guide to Kaggle July 27, 2011 ksankar42@gmail.com [doubleclix.wordpress.com] anthony.goldbloom@kaggle.com
  • 2.
  • 3. The Amateur Data Scientist CART Analytics Competitions! Algorithms randomForest Tools Old DataSets Competition Competition Titanic in-flight Churn HHP Ford
  • 4. Encounters —  1st ◦  This Workshop —  2nd ◦  Do Hands-on Walkthrough ◦  I will post the walkthrough scripts in ~ 10 days —  3rd ◦  Participate in HHP & Other competitions
  • 5. Goals Of This workshop 1.  Introduction to Analytics Competitions from Data, Algorithms & Tools perspective 2.  End-To-End Flow of a Kaggle Competition – Ford 3.  Introduction to the Heritage Health Prize Competition 4.  Materials for you to explore further ◦  Lot more slides ◦  Walkthrough – will post in 10 days
  • 6. Agenda —  Algorithms for the Amateur Data Scientist [25Min] ◦  Algorithms, Tools & frameworks in perspective —  The Art of Analytics Competitions[10Min] ◦  The Kaggle challenges —  How the RTA FORD was won - Anatomy of a competition [15Min] ◦  Predicting FORD using Trees ◦  Submit an Entry —  Competition in flight - The Heritage Health Prize [30Min] ◦  Walkthrough –  Introduction –  Dataset Organization –  Analytics Walkthrough ◦  Submit our entry —  Conclusion [5Min]
  • 7. ALGORITHMS FOR THE AMATEUR DATA SCIENTIST Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979
  • 8. The Amateur Data Scientist —  Am not a quant or a ML expert —  School of Amz, Springer & UTube —  For the Rest of us —  References I used (Refs also in the respective slide): ◦  The Elements Of Statistical Learning (a.k.a ESLII) –  By Hastie,Tibshirani & Friedman ◦  Statistical Learning From a Regression Perspective –  By Richard Berk —  As Jeremy says it, you can dig into it as needed ◦  Not necessarily be an expert in R toolbox
  • 9. Jeremy’s Axioms —  Iteratively explore data —  Tools ◦  Excel Format, Perl, Perl Book —  Get your head around data ◦  Pivot Table —  Don’t over-complicate —  If people give you data, don’t assume that you need to use all of it —  Look at pictures ! —  History of your submissions – keep a tab —  Don’t be afraid to submit simple solutions ◦  We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data- sciencetalk-by-jeremy-howard/ !
  • 10. Don’t throw away 1 any data ! Big data to smart data Be ready for different 2 ways of organizing the data —  summary
  • 11. Users apply different techniques •  Support Vector Machine •  Genetic Algorithms •  adaBoost •  Monte Carlo Methods •  Bayesian Networks •  Principal Component •  Decision Trees Analysis •  Ensemble Methods •  Kalman Filter •  Random Forest •  Evolutionary Fuzzy •  Logistic Regression Modelling •  Neural Networks Quora •  http://www.quora.com/What-are-the-top-10- data-mining-or-machine-learning-algorithms Ref: Anthony’s Kaggle Presentation!
  • 12. —  Let us take a 15 min overview of the algorithms ◦  Relevant in the context of this workshop ◦  From the perspective of the datasets we plan to use —  More of a qualitative than mathematical —  To get a feel for the how & the why
  • 13. Bias Continuous Linear Variance Variables Regression Model Complexity Over-fitting Categorical Variables Classifiers k-NN Bagging Boosting Decision (Nearest Trees Neighbors) CART
  • 14. Titanic Passenger Metadata Customer Churn •  Small •  17 Predictors •  3 Predictors •  Class •  Sex •  Age •  Survived? Kaggle Competition - Stay Alert Ford Challenge •  Simple Dataset •  Competition Class Heritage Health Prize Data •  Complex •  Competition in Flight http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic! http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
  • 15. Titanic Dataset —  Taken from passenger manifest —  Good candidate for a Decision Tree —  CART [Classification & Regression Tree] ◦  Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions —  CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
  • 16. Titanic Dataset Y Male ? R walk through —  Load libraries 3rd? —  Load data —  Model CART N Y —  Model rattle() Y Adult? —  Tree Y —  Discussion N 3rd? N Y
  • 17. CART Y Male ? Female 3rd? N Y Y Adult? Child Y 3rd? N N Y
  • 18. CART Y Male ? Female 1 Do Not Over-fit 3rd? N 2 All predictors are not needed N Y 3 All data rows are not needed 4 Tuning the algorithms will give different results
  • 19. Churn Data —  Predictchurn —  Based on ◦  Service calls, v-mail and so forth
  • 21. Challenges —  Model Complexity ◦  Complex Model increases the training data fit ◦  But then over-fits and doesn't perform as well with real data —  Bias vs.Variance ◦  Classical diagram ◦  From ELSII Prediction Error ◦  By Hastie,Tibshirani & Friedman Training Error
  • 22. —  Goal ◦  Model Complexity (-) Solution #1 ◦  Variance (-) ◦  Prediction Accuracy (+) Partition Data ! ◦  Training (60%) ◦  Validation(20%) & ◦  “Vault” Test (20%) Data sets k-fold Cross-Validation ◦  Split data into k equal parts ◦  Fit model to k-1 parts & calculate prediction error on kth part ◦  Non-overlapping dataset But the fundamental problem still exists !
  • 23. —  Goal ◦  Model Complexity (-) Solution #2 ◦  Variance (-) ◦  Prediction Accuracy (+) Bootstrap ◦  Draw datasets (with replacement) and fit model for each dataset –  Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacement Bagging (Bootstrap aggregation) ◦  Average prediction over a collection of bootstrap-ed samples, thus reducing variance
  • 24. —  Goal ◦  Model Complexity (-) Solution #3 ◦  Variance (-) ◦  Prediction Accuracy (+) Boosting ◦  “Output of weak classifiers into a powerful committee” ◦  Final Prediction = weighted majority vote ◦  Later classifiers get misclassified points –  With higher weight, –  So they are forced –  To concentrate on them ◦  AdaBoost (AdaptiveBoosting) ◦  Boosting vs Bagging –  Bagging – independent trees –  Boosting – successively weighted
  • 25. —  Goal ◦  Model Complexity (-) Solution #4 ◦  Variance (-) ◦  Prediction Accuracy (+) Random Forests+ ◦  Builds large collection of de-correlated trees & averages them ◦  Improves Bagging by selecting i.i.d* random variables for splitting ◦  Simpler to train & tune ◦  “Do remarkably well, with very little tuning required” – ESLII ◦  Less suseptible to overfitting (than boosting) ◦  Many RF implementations –  Original version - Fortran-77 ! By Breiman/Cutler –  R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed! + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
  • 26. —  Goal ◦  Model Complexity (-) Solution - General ◦  Variance (-) ◦  Prediction Accuracy (+) Ensemble methods ◦  Two Step –  Develop a set of learners –  Combine the results to develop a composite predictor ◦  Ensemble methods can take the form of: –  Using different algorithms, –  Using the same algorithm with different settings –  Assigning different parts of the dataset to different classifiers ◦  Bagging & Random Forests are examples of ensemble method Ref: Machine Learning In Action !
  • 27. Random Forests —  While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables —  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) —  Error prediction ◦  For each iteration, predict for dataset that is not in the sample (OOB data) ◦  Aggregate OOB predictions ◦  Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate –  Can use this to search for optimal # of predictors ◦  We will see how close this is to the actual error in the Heritage Health Prize —  Assumes equal cost for mis-prediction. Can add a cost function —  Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 ! Statistical Learning from a Regression Perspective : Berk! A Brief Overview of RF by Dan Steinberg!
  • 28. Lot more to explore (Homework!) —  Loss matrix ◦  E.g. Telcom churn - Better to give incentives to false + (who is not leaving) than optimize in incentives for false –ves(who is leaving) —  Missing values —  Additive Models —  Bayesian Models —  Gradient Boosting Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4- New_Tree_Data_Set_and_Loss_Matrices.pdf!
  • 29. Churn Data w/ randomForest
  • 31. “I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009
  • 32.
  • 33. Crowdsourcing Mismatch between those with data and those with the skills to analyse it
  • 34. Tourism Forecasting Competition Forecast Error (MASE) Existing model Aug 9 2 weeks 1 month Competition later later End
  • 35. Existing model (ELO) Error Rate (RMSE) Aug 4 1 month 2 months Today later later Chess Ratings Competition
  • 36. 12,500 “Amateur” Data Scientists with different backgrounds
  • 37. R R Matlab Matlab SAS SAS WEKA WEKA SPSS SPSS Python Python Excel Excel Mathematica Mathematica Stata Stata R on Kaggle Among academics R Matlab SAS WEKA SPSS Python Excel Among Americans Mathematica Stata Ref: Anthony’s Kaggle Presentation!
  • 38. Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter. ~25% Successful grant applications NASA tried, now it s our turn
  • 39. “The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe” “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of- the-art algorithms”
  • 41. Why Participants Compete 1 2 Clean, Real world data Professional Reputation & Experience 3 4 Interactions with experts in related fields Prizes
  • 42. Use the wizard to post a competition
  • 44. Competitions are judged based on predictive accuracy
  • 45. Competition Mechanics Competitions are judged on objective criteria
  • 46. The Anatomy of a KAGGLE COMPETITION THE FORD COMPETITION
  • 47. Ford Challenge - DataSet —  Goal: ◦  Predict Driver Alertness —  Predictors: ◦  Psychology – P1 .. P8 ◦  Environment – E1 .. E11 ◦  Vehicle – V1 ..V11 ◦  IsAlert ? —  Datastatistics meaningless outside the IsAlert context
  • 48. Ford Challenge – DataSet Files —  Three files ◦  ford_train –  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows ◦  ford_test –  100 Trials,~1,200 observations/trial, 120,841 rows ◦  example_submission.csv
  • 50. glm
  • 51. Submission & Results Raw, all variables, rpart Raw, selected variables, rpart All variables, glm
  • 52. How the Ford Competition was won —  How I Did It Blogs —  http://blog.kaggle.com/ 2011/03/25/inference- on-winning-the-ford- stay-alert- competition/ —  http://blog.kaggle.com/ 2011/04/20/mick- wagner-on-finishing- second-in-the-ford- challenge/ —  http://blog.kaggle.com/ 2011/03/16/junpei- komiyama-on- finishing-4th-in-the- ford-competition/
  • 53. How the Ford Competition was won —  Junpei Komiyama (#4) ◦  To solve this problem, I constructed a Support Vector Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦  This approach took more than 3 hours to complete ◦  I found some data (P3-P6) were characterized by strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased.These suggested the necessity of pre- processing the observation data before SVM analysis for better performance
  • 54. How the Ford Competition was won —  Junpei Komiyama (#4) ◦  Averaging – improved score and processing time ◦  Average 7 data points –  Reduced processing by 86% & –  Increased score by 0.01 ◦  Tools –  Python processing of csv –  libSVM
  • 55. How the Ford Competition was won —  Mick Wagner (#2) ◦  Tools –  Excel, SQL Server ◦  I spent the majority of my time analyzing the data. I inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦  I made the first 150 trials (~30%) be my test data and the remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦  I was concerned that using the entire data set would create too much noise and lead to inaccuracies in the model … so focussed on data with state change
  • 56. How the Ford Competition was won —  Mick Wagner (#2) ◦  After testing the Decision Tree and Neural Network algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6, V10, and V11
  • 57. How the Ford Competition was won —  Inference (#1) ◦  Very interesting ◦  “Our first observation is that trials are not homogeneous – so calculated mean, sd et al” ◦  “Training set & test set are not from the same population” – a good fit for training will result in a low score ◦  Lucky Model (Regression) –  -­‐410.6073(sd(E5))  +  0.1494(V11)  +  4.4185(E9)   ◦  (Remember – Data had P1-P8,E1-E11,V1-V11)
  • 58. HOW THE RTA WAS WON “This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.”
  • 59. —  Thanks to ◦  François GUILLEM & ◦  Andrzej Janusz —  They both used R —  Share their code & algorithms
  • 60. How the RTA was won —  I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way - François GUILLEM (#14) —  I used a simple k-NN approach but the idea was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis - Andrzej Janusz(#17)
  • 61. How the RTA was won —  #1 used Random Forests ◦  Time, Date & Week as predictors - José P. González-Brenes and Matías Cortés —  Regression models for data segments (total ~600!) —  Tools: ◦  Java/Weka ◦  4 processors, 12 GB RAM ◦  48 hours of computations - Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/! Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
  • 62. THE HHP TimeCheck : Should be ~2:40!!
  • 63. Lessons from Kaggle Winners 1 Don’t over-fit 2 All predictors are not needed 3 All data rows are not needed, either 4 Tuning the algorithms will give different results 5 Reduce the dataset (Average, select transition data,…) 6 Test set & training set can differ 7 Iteratively explore & get your head around data 8 Don’t be afraid to submit simple solutions 9 Keep a tab & history your submissions
  • 64. The Competition “The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”
  • 66. Data Organization ID 113,000 Entries Members Age at 1st Claim Missing values Sex Days In Hospital Y2 MemberID 2,668,990 Entries Prov ID Missing values Claims Different Coding Vendor, PCP, Delay 162+ Days In Hospital Y3 Year Speciality SupLOS – Length of stay is PlaceOfSvc suppressed during de- PayDelay identification process for Days In Hospital Y4 LengthOfStay some entries (Target) DaysSinceFirstClaimThatYear PrimaryConditionGroup MemberID CharlsonIndex Claims Truncated ProcedureGroup DaysInHospital SupLOS 76039 Entries(Y2) MemberID, Year, 361,485 Entries 71436 Entries (Y3) LabCount DSFS,LabCount Fairly Consistent Coding (10+) 70943 Entries Lots Of Zeros MemberID, Year, 818,242 Entries DrugCount DSFS,DrugCount Fairly Consistent Coding (10+)
  • 67.
  • 68. Calculation & Prizes Prediction Error Rate Deadline Apr 04,2013 Deadline : Aug 31,2011 06:59:59 UTC Deadline : Feb 13,2012 Deadline : Sep 04,2012
  • 69. Now it is our turn … HHP ANALYTICS
  • 70. POA —  Load data into SQLite —  Use SQL to de-normalize & pick out datasets —  Load them into R for analytics —  Total/Distinct count ◦  Claims = 2,668,991/113,001 ◦  Members = 113,001 ◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦  dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦  dih_y3 = 71,436/distinct/10,730 dih > 0 ◦  dih_y4 = 70,943/distinct
  • 71. Idea #1 —  dih_Y2  =  β0  +  β1dih_Y1  +  β2DC  +  β3LC   —  dih_Y3  =  β0  +  β1dih_Y2  +  β2DC  +  β3LC   —  dih_Y4  =  β0  +  β1dih_Y3  +  β2DC  +  β3LC   —  select count(*) from dih_y2 join dih_y3 on dih_y2.member_id = dih_y3.member_id; —  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683 (7,699 dih_y3 > 0) —  Data is not straightforward to get this ◦  Summarize drug and lab by member, year ◦  Split into year to get DC  &  LC  by  year ◦  Add to dih_Yx table ◦  Linear Regression
  • 72. Some SQL for idea #1 —  create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member ! —  Same for lab_tot —  create table drug_tot_y1 as select * from drug_tot where year = “Y1” —  … for y2,y3 and y1, y2,y3 for lab_tot —  … join with dih_yx tables
  • 73. Idea #2 —  Add claims at yx to the Idea #1 equations —  dih_Yn  =  β0  +  β1dih_Yn-­‐1  +  β2DC/n-­‐1   +  β3LC/n-­‐1  +  β4Caimn-­‐1   —  Then we will have to define the criteria for Caimn-­‐1  from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup
  • 74. The Beginning As the End —  We started with a set of goals —  Homework ◦  For me : –  To finish the hands-on walkthrough & post it in ~10 days ◦  For you –  Go through the slides –  Do the walkthrough –  Submit entries to Kaggle
  • 75. I enjoyed a lot preparing the materials … Hope you enjoyed more attending … Questions ?! IDE <- RStudio R_Packages <- c(plyr, rattle, rpart, randomForest) R_Search <- http://www.rseek.org/, powered=google