Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/42Oo8TOl85I.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin's Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
8. Data Prep
• Imputation of missing data
• Standardization of numeric features
• One-hot encoding of categorical features
• Count/Label/Target encoding of categorical features
• Feature selection and/or feature extraction (e.g. PCA)
• Feature engineering
9. Model Generation
• Cartesian grid search
• Random grid search
• Tune individual models via Early Stopping
• Bayesian Hyperparameter Optimization
12. Stacked Ensembles
• Specify L base learners (with model params).
• Specify a metalearner (just another algo).
• Perform k-fold cross-validation on the base learners.
13. Stacked Ensembles
• Collect cross-validated predicted values from base
learners.
• Train a second-level metalearning algorithm to find
the optimal combination of base learners.
• Metalearner requires only a small amount of compute
on top of the cross-validation process (it’s cheap).
14. Random Grid Search + Stacking
• Random Grid Search combined with Stacked
Ensembles is a powerful combination.
• Stacked Ensembles perform particularly well if the
models they are based on (1) are individually
strong, and (2) make uncorrelated errors.
• Random Grid Search is an excellent way to create
a diverse of models for the ensemble.
16. H2O Machine Learning Platform
• Distributed (multi-core + multi-node) implementations of
cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala & web GUI.
• Works on Hadoop, Spark, EC2, your laptop, etc.
• Easily deploy models to production as pure Java code.
17. H2O AutoML (first cut)
• Imputation, one-hot encoding, standardization.
• Random Grid Search over a custom hyperparameter
space, defined by expert data scientists.
• Early stopping of individual models and random grids.
• GBMs, Random Forests, Deep Neural Nets, GLMs
• Multiple Stacked Ensembles of models.
• Leaderboard for ranking.
18. H2O AutoML in R
library(h2o)
h2o.init()
train <- h2o.importFile("train.csv")
aml <- h2o.automl(y = "response_colname",
training_frame = train,
max_runtime_secs = 600)
lb <- aml@leaderboard
19. H2O AutoML in Python
import h2o
from h2o.automl import H2OAutoML
h2o.init()
train = h2o.import_file("train.csv")
aml = H2OAutoML(max_runtime_secs = 600)
aml.train(y = "response_colname",
training_frame = train)
lb = aml.leaderboard
26. First-time Qwiklab Account Setup
• Go to http://h2oai.qwiklab.com
• Click on “JOIN”
• Create a new account with a valid email address
• You will receive a confirmation email
• Click on the link in the confirmation email
• Go back to http://h2oai.qwiklab.com and log in
• Go to the Catalog on the left bar
• Choose “Introduction to AutoML in H2O”
• Wait for instructions