Recent world #1 Kaggle Grandmaster and Research Data Scientist at H2O.ai, Marios Michailidis, will delve into the competitive edge that Driverless AI brings out of the box.
Driverless AI can easily score in the top 5% in popular data science challenges against thousands of participants in a matter of minutes with limited processing power.
Apart from the actual predictions, one can use Driverless AI data munging and derived knowledge of the data to build even more powerful models.
This webinar discusses how Driverless AI can get competitive scores in popular Kaggle challenges. Also, Marios will explain the concepts of hyper-parameter tuning and stacking and how they help to make stronger predictions.
Bio:
Former world no.1 Kaggle Grandmaster, Marios Michailidis, is now a Research Data Scientist at H2O.ai. He is finishing his PhD in machine learning at the University College London (UCL) with a focus on ensemble modeling and his previous education entails a B.Sc in Accounting Finance from the University of Macedonia in Greece and an M.Sc. in Risk Management from the University of Southampton. He has gained exposure in marketing and credit sectors in the UK market and has successfully led multiple analytics’ projects based on a wide array of themes.
Before H2O.ai, Marios held the position of Senior Personalization Data Scientist at dunnhumby where his main role was to improve existing algorithms, research benefits of advanced machine learning methods, and provide data insights. He created a matrix factorization library in Java along with a demo version of personalized search capability. Prior to dunnhumby, Marios has held positions of importance at iQor, Capita, British Pearl, and Ey-Zein.
At a personal level, he is the creator and administrator of KazAnova, a freeware GUI for quick credit scoring and data mining which is made absolutely in Java. In addition, he is also the creator of StackNet Meta-Modelling Framework.
3. How to perceive Driverless AI
• It is an AI that creates AI
• Creates machine learning models given:
Some input data
A target variable
An objective
Some allocated computing power (CPU or GPU)
H2O.ai
Machine Intelligence
Will there be a default?
Minimize prediction error
6 CPU cores
Predictions
Model interpretability
Insight
Feature engineering
4. How does DAI become competitive
• Mostly with exhaustive feature engineering
• Using and (tuning) Xgboost models
• Ensemble
H2O.ai
Machine Intelligence
5. Tuning Xgboost
• Initialize xgboost with modest parameters and
small learning rate, but 10,000 potential trees.
• Cross-validation is used to find optimal
maximum depth of the trees.
• Then early stopping is used to get no. of trees
• Commence feature engineering
• Revisit parameters in the end
H2O.ai
Machine Intelligence
Find best maximum depth
Best number of trees
Feature engineering Revisit parameters
6. Ensembling
• After Feature engineering, based on the
resources allocated and accuracy, it takes place.
• Up to 40 different xgboost models are build
• Different combinations of :
• Maximum depths
• Tree-growing policies (loss or depth)
• Maximum leaves
• Simple average of all models
H2O.ai
Machine Intelligence
7. Why Ensembling (1) - Data
• 3,000ish teams
• 133 anonymized columns , numerical or
categorical
• 115 K rows, binary target (accelerate approval)
• DAI scores top 2%
• Had taken my team almost 3 weeks to get there
(we finished 3rd eventually)
H2O.ai
Machine Intelligence
8. Why Ensembling (2) - Impact
H2O.ai
Machine Intelligence
After-model options
Best features found
Performance through time
Ensemble impact
9. Why Ensembling (3) - Results
H2O.ai
Machine Intelligence
Top 2%
with
ensemble
Around
Top 4%
without
10. Empowering DAI (1) - Data
H2O.ai
Machine Intelligence
• Popular competition (1700ish teams) in 2013
• Only 9 columns (8 unique).
• high cardinality – thousands of unique values.
• 90K rows combined for train and test.
• Scope: determine an employee's access needs.
• Metric to maximize was AUC (or Area Under
Curve).
22. Empowering DAI (6) – Get features
H2O.ai
Machine Intelligence
• Download the feature engineering of DAI
• 55 features derived (out of the initial 9)
• Target column in training data
23. Empowering DAI (7) – Value of FE
H2O.ai
Machine Intelligence
• Initial set of features is not very predictive
without transformations
• Features derived in DAI are very predictive
Initial Features auc gini
RESOURCE 0.501 0.26%
MGR_ID 0.460 -8.09%
ROLE_ROLLUP_1 0.445 -10.97%
ROLE_ROLLUP_2 0.515 3.04%
ROLE_DEPTNAME 0.534 6.84%
ROLE_TITLE 0.521 4.18%
ROLE_FAMILY_DESC 0.528 5.66%
ROLE_FAMILY 0.495 -0.98%
DAI features auc gini
37_CV_TE_MGR_ID… 0.840 67.9%
18_CV_TE_MGR_ID… 0.819 63.9%
13_CV_TE_MGR_ID… 0.805 61.1%
9_CV_TE_MGR_ID_… 0.796 59.2%
50_WoE_ROLE_DEP… 0.779 55.8%
49_WoE_MGR_ID_R… 0.779 55.7%
45_WoE_MGR_ID_R… 0.774 54.7%
0_CV_TE_MGR_ID_… 0.766 53.2%
8_WoE_MGR_ID_RO… 0.765 53.1%
43_WoE_MGR_ID_R… 0.765 53.0%
24. Empowering DAI (8) - Stacking
H2O.ai
Machine Intelligence
Models built on DAI FE Test LB
Lightgbm with gbdt 0.909
Lightgbm with dart 0.909
Extra Trees 0.910
Random Forest 0.907
Logistic Regression 0.898
Lightgbm Rmse 0.906
Lightgbm Huber 0.900
Xgboost 0.908
DAI 0.909
DAIderiveddata
Stacking
From 0.90933
To 0.91045
25. Empowering DAI (9) – Plus counts
H2O.ai
Machine Intelligence
DAIderiveddata
Stacking
From 0.91045
To 0.914
DAI is production-ready
It ignores information about test
data in its learning…Kagglers don’t!
Knowing distribution of test data
helps make better predictions.
For example how frequent a
category is
Models built on DAI FE Test LB
Lightgbm with gbdt 0.909
Lightgbm with dart 0.909
Extra Trees 0.910
Random Forest 0.907
Logistic Regression 0.898
Lightgbm Rmse 0.906
Lightgbm Huber 0.900
Xgboost 0.908
DAI 0.909
Lightgbm plus counts 0.913
26. Models built on DAI FE Test LB
Lightgbm with gbdt 0.909
Lightgbm with dart 0.909
Extra Trees 0.910
Random Forest 0.907
Logistic Regression 0.898
Lightgbm rmse 0.906
Lightgbm Huber 0.900
Xgboost 0.908
DAI 0.909
Lightgbm plus counts 0.913
Logistic plus dummies 0.907
Empowering DAI (10) – Plus OHE
H2O.ai
Machine Intelligence
DAIderiveddata
Stacking
From 0.914
To 0.9158
Logistic model does not perform
as good. Because best features
were found using tree methods
Dummy Variables or One-Hot
Encoding can improve results for
linear models.
27. Empowering DAI (11) - Timeline
H2O.ai
Machine Intelligence
Predictions from DAI| Rank 79 | 0.9093
Plus Stacking 9 models| Rank 73 | 0.91045
Test counts | Rank 38 | 0.9139
Dummies | Rank
20 | 0.9158
0.906 0.908 0.91 0.912 0.914 0.916 0.918
2 hours
4 hours
5 hours
6 hours
AUC IN TEST DATA
HOURSIN
AUC IN TEST DATA VERSUS TIME
28. Further Improvement
• Let it run more time.
• More DAI datasets. The genetic algorithm may come
up with (slightly) different features every time
• Check predictions, search for areas were DAI might
not have done as well as you
• Add deep learning models or other algorithmic
families
• Add your own features
• Add your own models and do stacking using the
Kfold paradigm
H2O.ai
Machine Intelligence
29. Final words
• Can DAI beat me in predictive modelling competitions?
• In time, (probably) yes
• In depth and creativity, (probably) no
• Can I improve my score with DAI?
• Yes, I can use the features in my models
• Yes, I can use the predictions of stacking
• Yes, I can use the interpretability module or other tools
to get insight about potential additions/pitfalls
• Yes, While DAI is running I can focus on other things ,
like checking visualizations and/or exploring the data.
H2O.ai
Machine Intelligence