[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.
1. H2O4GPU and GoAI: harnessing the power of GPUs.
Mateusz Dymczyk
Senior Software Engineer
H2O.ai
@mdymczyk
2. Agenda
• About me
• About H2O.ai
• A bit of history: H2O-3
• Moving forward: feature engineering & Driverless AI
• The need for GPUs
• GPU overview
• Machine Learning + GPUs = why? how?
• About GoAI
• About H2O4GPU
• Q&A
3. About me
• M.Sc. in Computer Science @ AGH UST in Poland
• Ph.D. dropout (machine learning)
• Previously NLP/ML @ Fujitsu Laboratories, Kanagawa
• Currently Lead/Senior Machine Learning Engineer @
H2O.ai (remotely from Tokyo)
• Conference speaker (Strata Beijing/NY/Singapore,
Hadoop World Tokyo etc.)
4. About H2O.ai
FOUNDED 2012, SERIES C IN NOV, 2017
PRODUCTS • DRIVERLESS AI – AUTOMATED MACHINE LEARNING
• H2O OPEN SOURCE MACHINE LEARNING
• SPARKLING WATER
• H2O4GPU OS ML GPU LIBRARY
MISSION DEMOCRATIZE AI
TEAM • ~100 EMPLOYEES
• SEVERAL KAGGLE GRANDMASTERS
• DISTRIBUTED SYSTEMS ENGINEERS DOING MACHINE LEARNING
• WORLD-CLASS VISUALIZATION DESIGNERS
OFFICES MOUNTAIN VIEW, LONDON, PRAGUE
8. H2O-3 Overview
• Distributed implementations of cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala, REST/JSON.
• Interactive Web GUI called H2O Flow.
• Easily deploy models to production with H2O Steam.
9. H2O-3 Distributed Computing
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas DataFrame
H2O Frame
H2O Cluster
10. H2O-3 Algorithms
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
12. The Need for Automation
“The United States alone faces a shortage of 140,000 to
190,000 people with analytical expertise and 1.5 million
managers and analysts”
–McKinsey Prediction for 2018
13. Recipe for Success
Auto Feature Generation
Kaggle Grand Master Out of the Box • Automatic Text Handling
• Frequency Encoding
• Cross Validation Target
Encoding
• Truncated SVD
• Clustering and more
Feature Transformations
Generated Features
Original Features
18. Moore’s Law
1980 1990 2000 2010 2020
102
103
104
105
106
107
40 Years of Microprocessor Trend Data
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham,
K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
Single-threaded perf
1.5X per year
1.1X per year
Transistors
(thousands)
19. GPU
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by 2025
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham,
K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
25. GPU architecture
Low latency vs High throughput
GPU
• Optimized for data-parallel,
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated to
computation
CPU
• Optimized for low-latency
access to cached data sets
• Control logic for out-of-order
and speculative execution
37. H2O4GPU
• Open-Source: https://github.com/h2oai/h2o4gpu
• Collection of important ML algorithms ported to the GPU (with CPU fallback option):
• Gradient Boosted Machines
• GLM
• Truncated SVD
• PCA
• KMeans
• (soon) Field Aware Factorization Machines
• Performance optimized, multi-GPU support (certain algorithms)
• Used within our own Driverless AI Product to boost performance 30X
• Scikit-Learn compatible Python API (and now R API)
39. Gradient Boosting Machines
• Based upon XGBoost
• Raw floating point data -> Binned into Quantiles
• Quantiles are stored as compressed instead of floats
• Compressed Quantiles are efficiently transferred to GPU
• Sparsity is handled directly with highly GPU efficiency
• Multi-GPU by sharding rows using NVIDIA NCCL AllReduce
41. KMeans
• Significantly faster than Scikit-learn implementation (up to 50x)
• Significantly faster than other GPU implementations (5x-10x)
• Supports kmeans|| initialization
• Supports multiple GPUs by sharding the dataset
• Supports batching data if exceeds GPU memory
43. Truncated SVD & PCA
• Matrix decomposition
• Popular for text processing
and dimensionality reduction
• GPU optimizes linear algebra
operations
44. Truncated SVD & PCA
• The intrinsic dimensionality of certain datasets is much lower than the
original (e.g. here 4096 vs. actual ~200)
• PCA can reduce the dimensionality and preserve most of the explained
variance at the same time
• Better input for further modeling - takes less time
46. Field Aware Factorization Machines
* under development
• Click Through Rate (CTR):
• One of the most important tasks in computational advertising
• Percentage of users, who actually click on ads
• Until recently solved with logistic regression - bad at finding feature conjunctions
(learns the effect of all variables or features individually)
Clicked Publisher (P) Advertiser (A) Gender (G)
Yes ESPN Nike Male
No NBC Adidas Male
47. Field Aware Factorization Machines
* under development
• Separates the data into fields (Publisher, Advertiser, Gender) and features (EPSN, NBC,
Adidas, Nike, Male, Female)
• Uses a latent space for each pair to generate the model
• Used to win the first prize of three CTR competitions hosted by Criteo, Avazu, Outbrain,
and also the third prize of RecSys Challenge 2015.