Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14

© 2015 ligaDATA, Inc. All Rights Reserved.
Using Deep Learning to
do Real-Time Scoring in
Practical Applications
Deep Learning Applications Meetup, Monday, 12/14/2015, Mountain View, CA
http://www.meetup.com/Deep-Learning-Applications/events/227217853/
By Greg Makowski
www.Linkedin.com/in/GregMakowski
greg@LigaDATA.com
Community @ http://Kamanja.org
Try out

Deep Learning - Outline
•  Big Picture of 2016 Technology
•  Neural Net Basics
•  Deep Network ConﬁguraBons for PracBcal ApplicaBons
–  Auto-Encoder (i.e. data compression or Principal Components Analysis)
–  ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
–  Real Time Scoring and Lambda Architecture
–  Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
–  Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
–  ConBnuous Space Word Models (i.e. word2vec)

Gartner’s Top 2016 Strategic Technology Trends
David
Clearley

Advantages of a Net
over Regression
5
field 1
field 2
$
c
$
$
$
$
$
$
$
$ $
$ $
$
$
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c c
c
c
c c
c
c
c
A Regression
SoluBon

“Linear”

Fit one Line
$ c
Target values for a
data point with source
field values graphed by
“field 1” and “field 2”
Showing ONE target field, with values of $ or c
https://en.wikipedia.org/wiki/Regression_analysis

Advantages of a Net
over Regression
6
ﬁeld 1
ﬁeld 2
$
c
$
$
$
$
$
$
$
$ $
$ $
$
$
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c c
c
c
c c
c
c
c
A Neural Net
SoluBon

“Non-Linear”

Several
regions
which are
not adjacent

Hidden
nodes can be
line or circle
https://en.wikipedia.org/wiki/Artificial_neural_network

A Comparison of a Neural Net
and Regression
A Logis(c regression formula:
Y = f( a0 + a1*X1 + a2*X2 + a3*X3)
a* are coeﬃcients

Backpropaga(on, cast in a similar form:
H1 = f(w0 + w1*I1 + w2*I2 + w3*I3)
H2 = f(w4 + w5*I1 + w6*I2 + w7*I3)
:
Hn = f(w8 + w9*I1 + w10*I2 + w11*I3)

O1 = f(w12 + w13*H1 + .... + w15*Hn)
On = ....

w* are weights, AKA coeﬃcients
I1..In are input nodes or input variables.
H1..Hn are hidden nodes, which extract features of the data.
O1..On are the outputs, which group disjoint categories.

Look at raBo of training records v.s. free parameters (complexity, regularizaBon)
a0
a1 a2 a3
X1 X2 X3
Y
Input 1 I2 I3
Bias
H1 Hidden 2
Output
w1
w2
w3

Think of SeparaBng Land vs. Water
8
1 line,
Regression
(more errors)
5 Hidden Nodes in
a Neural Network
Different algorithms use
different Basis Functions:
•  One line
•  Many horizontal & vertical lines
•  Many diagonal lines
•  Circles
Decision Tree
12 splits
(more elements,
Less computation)
Q) What is too detailed? “Memorizing high tide boundary” and applying it at all times

–  Real Time Scoring and Lambda Architecture
http://deeplearning.net/
http://www.kdnuggets.com/
http://www.analyticbridge.com/

Leading up to an Auto Encoder

•  Supervised Learning
–  Regression, Tree or Net: 50 inputs à 1 output
–  Possible nets:
•  256 à 120 à 1
•  256 à 120 à 5 (trees, regressions and most are limited to 1 output)
•  256 à 120 à 60 à 1
•  256 à 180 à 120 à 60 à 1 (start gemng into training stability problems, with old
processes)
•  Unsupervised Learning
–  Clustering (tradiBonal unsupervised):
•  60 inputs (no target); produce 1-2 new (cluster ID & distance)

Auto Encoder (like data compression)
Relate input to output, through compressed middle
•  Supervised Learning
–  Regression, Tree or Net: 50 inputs à 1 output
–  Possible nets:
•  256 à 120 à 1
•  256 à 120 à 5 (trees, regressions, SVD and most are limited to 1 output)
•  256 à 120 à 60 à 1
•  256 à 180 à 120 à 60 à 1
•  Unsupervised Learning
–  Clustering (tradiBonal unsupervised):
•  60 inputs (no target); produce 1-2 new (cluster ID & distance)
–  Unsupervised training of a net, assign (target record == input record) AUTO-ENCODING
–  Train net in stages,
•  256 à 180 à 256
à 120 à
à 120 à
à 120 à
•  Add supervised layer to forecast 10 target categories
à 10
Because of symmetry,
Only need to update
mirrored weights once
(start getting long training times to stabilize, or may not finish,
The BREAKTHROUGH provided by DEEP LEARNING)
4 hidden layers w/ unsupervised training
1 layer at end w/ supervised traininghttps://en.wikipedia.org/wiki/Deep_learning

Auto Encoder
How it can be generally used to solve problems
•  Add supervised layer to forecast 10 target categories
–  4 hidden layers trained with unuspervised training,
–  1 new layer, trained with supervised learning
à 10
•  Outlier detecBon
•  The “acBvaBon” at each of the 120 output nodes indicates the “match” to that
cluster or compressed feature
•  When scoring new records, can detect outliers with a process like
If ( max_output_match < 0.333) then suspected outlier
•  How is it like PCA?
–  Individual hidden nodes in the same layer are “diﬀerent” or “orthogonal”

How Transferable are Features in
Deep Neural Networks?
http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf

Deep Learning Caused a 50% ReducBon
in Speech recogniBon error rates in 4 yrs
“The use of deep neural nets in
producBon speech systems really
started more like in 2011...

I would esBmate that from the
Bme before deep neural nets
were used unBl now, the error
rate on producBon speech
systems fell from about 20%
down to below 10%, so more
than a 50% reducBon in error
rate.” - Jeﬀ Dean email to Greg
12/13/2015
http://research.google.com/people/jeff/
Senior Fellow in the Knowledge Group
Google
Drop in Speech Rec. Error Rates
Deep Learning
Deployments
Started
2011

Internet of Things (IoT) is heavily signal data
http://www.datasciencecentral.com/profiles/blogs/the-internet-of-things-data-science-and-big-data

ConvoluBonal Neural Net (CNN)
Enables detecBng shiK invariant paxerns
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Neural Nets can be explicitly trained to provide a FFT (Fast Fourier Transform)
to convert data from time domain to the frequency domain – but typically an explicit FFT is used
Internet of
Things Signal Data

ConvoluBonal Neural Net (CNN)
Enables detecBng shiK invariant paxerns
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Solution: use a siding convolution to detect the pattern
CNN can use very long observational windows, up to 400 ms, long context

ConvoluBon
https://en.wikipedia.org/wiki/Convolution

ConvoluBon Neural Net:
from LeNet-5
Gradient-Based Learning Applied to Document RecogniBon
Proceedings of the IEEE, Nov 1998
Yann LeCun, Leon Boxou, Yoshua Bengio and Patrick Haﬀner
Director
Facebook, AI Research
http://yann.lecun.com/

Auto Encoder (like data compression)
Relate input to output, through compressed middle

ConvoluBon Neural Net (CNN)
•  How is a CNN trained diﬀerently than a typical back
propagaBon (BP) network?
–  Parts of the training which is the same:
•  Present input record
•  Forward pass through the network
•  Back propagate error (i.e. per epoch)

–  Diﬀerent parts of training:
•  Some connecBons are CONSTRAINED to the same value
–  The connecBons for the same paxern, sliding over all input space
•  Error updates are averaged and applied equally to the one set of weight
values
•  End up with the same paxern detector feeding many nodes at the next level
http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf
Convolutional Deep Belief Networks for Scalable
Unsupervised Learning of Hierarchical Representations, 2009

Same Low Level Features
http://stats.stackexchange.com/questions/146413/why-convolutional-neural-networks-belong-to-deep-learning

The Mammalian Visual Cortex is Hierarchical
(The Brain is a Deep Neural Net - Yann LeCun)
http://www.pamitc.org/cvpr15/files/lecun-20150610-cvpr-keynote.pdf
0
1
2
3
4
5
6
7
8
9
1011

Facebook example
https://gigaom.com/2014/03/18/facebook-shows-off-its-deep-learning-skills-with-deepface/

Yahoo + Stanford example – ﬁnd a face in a pic, even upside down
http://www.dailymail.co.uk/sciencetech/article-2958597/Facial-recognition-breakthrough-Deep-Dense-software-spots-faces-images-partially-hidden-UPSIDE-DOWN.html

ConvoluBonal Neural Nets (CNN)
RoboBc Grasp DetecBon (IoT)
http://pjreddie.com/media/files/papers/grasp_detection_1.pdf

Real Time Scoring
OpBmizaBons
•  Auto-Encoding nets
–  Can grow to millions of connecBons, and start to get computaBonal
–  Can reduce connecBons by 5% to 25+% with pruning & retraining
•  Train with increased regularizaBon semngs
•  Drop connec(ons with near zero weights, then retrain
•  Drop nodes with fan in connecBons which don’t get used much later, such as in
your predicBve problem
•  Perform sensiBvity analysis – delete possible input ﬁelds
•  ConvoluBonal Neural Nets
–  With large enough data, can even skip the FFT preprocessing step
–  Can use wider than 10ms audio sampling rates for speed up
•  Implement other preprocessing as lookup tables (i.e. Bayesian Priors)
•  Use cloud compuBng, do not limit to device compuBng
•  Large models don’t ﬁt à use model or data parallelism to train

30
ligaDATA
Real Time Scoring
Lambda Architecture – for both Batch and Real Time
•  First architecture to really deﬁne how batch and stream processing can work together
•  Founded on the concepts of immutability and re-computaBon, with human fault tolerance
•  Pre-computes the results of batch & real-Bme processes as a set of views, & query layer
merges the views
https://en.wikipedia.org/wiki/Lambda_architecture

31
ligaDATA
Real Time Scoring
Lambda Architecture With Kamanja
Kamanja
Decisions
Transformations
Enrichment
Aggregations
Master
Dataset
Real time
Views &
Indexing
Serving
Layer
Query
Query
Real-time
Data
•  Kamanja embraces and extends Lambda architecture
•  Transform and process messages in real-Bme, combine messages with historical
data and compute real-Bme views to make real-Bme decisions based on the views
Queue

32
ligaDATA
Real Time Compu(ng
Kamanja Technology Stack

Kamanja
(PMML, Java or Scala Consumer)
High level languages / abstractions
Compute
Fabric
Cloud, EC2
Internal Cloud
Security
Kerberos
Real Time
Streaming
Kafka,
MQ
Spark*
ligaDATA
Data Store
HBase,
Cassandra,
InfluxDB
HDFS
(Create
adaptors to
integrate
others)
Resource
Management
Zookeeper,
Yarn*,
Mesos*
High Level Languages /
Abstractions
PMML Producers, MLlib

Deep Reinforcement Learning, Q-Learning
http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf David Silver, Google DeepMind
https://en.wikipedia.org/wiki/Reinforcement_learning
https://en.wikipedia.org/wiki/Q-learning
Think in terms of IoT….
Device agent measures, infers user’s acBon
Maximizes future reward, recommends to user or system

(Think about IoT possibiliBes)
http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf
David Silver, Google DeepMind
Use 4
screen
shots

Use 4
screen
shots
Use 4 screen shots
IoT challenge: How to replace game
score with IoT score?
Shift right fast
shift right
stay
shift left
shift left fast

Games w/ best Q-learning
Video Pinball
Breakout
Star Gunner
Crazy Climber
Gopher

–  Real Time Scoring

ConBnuous Space Word Models (word2vec)
•  Before (a predicBve “Bag of Words” model):
–  One row per document, paragraph or web page
–  Binary word space: 10k to 200k columns, one per word or phrase
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 …. “This word space model is ….”
–  The “Bag of words model” relates input record to a target category

•  Before (a predicBve “Bag of Words” model):
–  One row per document, paragraph or web page
–  Binary word space: 10k to 200k columns, one per word or phrase
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 …. “This word space model is ….”
–  The “Bag of words model” relates input record to a target category

•  New:
–  One row per word (word2vec), possibly per sentence (sent2vec)
–  Con(nuous word space: 100 to 300 columns, conBnuous values
.01 .05 .02 .00 .00 .68 .01 .01 .35 ... .00 à “King”
.00 .00 .05 .01 .49 .52 .00 .11 .84 ... .01 à “Queen”
–  The deep net training resulted in an Emergent Property:
•  Numeric geometry locaBon relates to concept space
•  “King” – “man” + “woman” = “Queen” (math to change gender relaBon)
•  “USA” – “Washington DC” + “England” = “London” (math for capital relaBon)

How to SCALE to larger vocabularies?
http://www.slideshare.net/hustwj/cikm-keynotenov2014?qid=f92c9e86-feea-41ac-a099-d086efa6fac1&v=default&b=&from_search=2

Training ConBnuous Space Word Models
•  How to Train These Models?
–  Raw data: “This example sentence shows the word2vec model training.”

–  Training data (with target values underscored, and other words as input)
“This example sentence shows word2vec” (prune “the”)
“example sentence shows word2vec model”
“sentence shows word2vec model training”
–  The context of the 2 to 5 prior and following words predict the middle
word
–  Deep Net model architecture, data compression to 300 conBnuous nodes
•  50k binary word input vector à ... à 300 à ... à 50k word target vector

–  Training data (with target values underscored, and other words as input)
“This example sentence shows word2vec” (prune “the”)
“example sentence shows word2vec model”
“sentence shows word2vec model training”
–  The context of the 2 to 5 prior and following words predict the middle
word
–  Deep Net model architecture, data compression to 300 conBnuous nodes
•  50k binary word input vector à ... à 300 à ... à 50k word target vector
•  Use Pre-Trained Models hxps://code.google.com/p/word2vec/
–  Trained on 100 billion words from Google News
–  300 dim vectors for 3 million words and phrases
–  hxps://code.google.com/p/word2vec/

http://www.slideshare.net/hustwj/cikm-keynotenov2014?qid=f92c9e86-feea-41ac-a099-d086efa6fac1&v=default&b=&from_search=2

Applying ConBnuous Space Word Models
http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf
State of the art in machine translation
Sequence to Sequence Learning with neural Networks, NIPS 2014
Language translaBon
Document summary
Generate text capBons for pictures
.01
.05
.89
.00
.05
.62
.00
.34

“Greg’s Guts” on Deep Learning
•  Some claim the need for preprocessing and knowledge
representaBon has ended
–  For most of the signal processing applicaBons à yes, simplify
–  I am VERY READY TO COMPETE in other applicaBons, conBnuing
•  expressing explicit domain knowledge
•  opBmizing business value calculaBons
•  Deep Learning gets big advantages from big data
–  Why? Bexer populaBng high dimensional space combinaBon subsets
–  Unsupervised feature extracBon reduces need for large labeled data
•  However, “regular sized data” gets a big boost as well
–  The “raBo of free parameters” (i.e. neurons) to training set records
–  For regressions or regular nets, want 5-10 Bmes as many records
–  RegularizaBon and weight drop out reduces this pressure
–  Especially when only training “the next auto encoding layer”

Deep Learning Summary – ITS EXCITING!
•  Discussed Deep Learning architectures
–  Auto Encoder, convoluBonal, reinforcement learning, conBnuous word
•  Real Time speed up
–  Train model, reduce complexity, retrain
–  Simplify preprocessing with lookup tables
–  Use cloud compuBng, do not be limited to device compuBng
–  Lambda architecture like Kamanja, to combine real Bme and batch
•  ApplicaBons
–  Signal Data: IoT, Speech, Images
–  Control System models (like Atari game playing, IoT)
–  Language Models
https://www.quora.com/Why-is-deep-learning-in-such-demand-now

Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14

Similar to Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14 (20)

More from Greg Makowski

More from Greg Makowski (6)

Recently uploaded

Recently uploaded (20)

Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-12-14