This talk covers 4 configurations of deep learning to solve different types of application needs. Also, strategies for speed up and real-time scoring are discussed.
2. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring and Lambda Architecture
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
8. Think of SeparaBng Land vs. Water
8
1 line,
Regression
(more errors)
5 Hidden Nodes in
a Neural Network
Different algorithms use
different Basis Functions:
• One line
• Many horizontal & vertical lines
• Many diagonal lines
• Circles
Decision Tree
12 splits
(more elements,
Less computation)
Q) What is too detailed? “Memorizing high tide boundary” and applying it at all times
9. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring and Lambda Architecture
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
http://deeplearning.net/
http://www.kdnuggets.com/
http://www.analyticbridge.com/
10. Leading up to an Auto Encoder
• Supervised Learning
– Regression, Tree or Net: 50 inputs à 1 output
– Possible nets:
• 256 à 120 à 1
• 256 à 120 à 5 (trees, regressions and most are limited to 1 output)
• 256 à 120 à 60 à 1
• 256 à 180 à 120 à 60 à 1 (start gemng into training stability problems, with old
processes)
• Unsupervised Learning
– Clustering (tradiBonal unsupervised):
• 60 inputs (no target); produce 1-2 new (cluster ID & distance)
11. Auto Encoder (like data compression)
Relate input to output, through compressed middle
• Supervised Learning
– Regression, Tree or Net: 50 inputs à 1 output
– Possible nets:
• 256 à 120 à 1
• 256 à 120 à 5 (trees, regressions, SVD and most are limited to 1 output)
• 256 à 120 à 60 à 1
• 256 à 180 à 120 à 60 à 1
• Unsupervised Learning
– Clustering (tradiBonal unsupervised):
• 60 inputs (no target); produce 1-2 new (cluster ID & distance)
– Unsupervised training of a net, assign (target record == input record) AUTO-ENCODING
– Train net in stages,
• 256 à 180 à 256
à 120 à
à 120 à
à 120 à
• Add supervised layer to forecast 10 target categories
à 10
Because of symmetry,
Only need to update
mirrored weights once
(start getting long training times to stabilize, or may not finish,
The BREAKTHROUGH provided by DEEP LEARNING)
4 hidden layers w/ unsupervised training
1 layer at end w/ supervised traininghttps://en.wikipedia.org/wiki/Deep_learning
14. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring and Lambda Architecture
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
17. ConvoluBonal Neural Net (CNN)
Enables detecBng shiK invariant paxerns
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Neural Nets can be explicitly trained to provide a FFT (Fast Fourier Transform)
to convert data from time domain to the frequency domain – but typically an explicit FFT is used
Internet of
Things Signal Data
18. ConvoluBonal Neural Net (CNN)
Enables detecBng shiK invariant paxerns
In Speech and Image applications, patterns vary by size, can be shifted right or left
Challenge: finding a bounding box for a pattern is almost as hard as detecting the pat.
Solution: use a siding convolution to detect the pattern
CNN can use very long observational windows, up to 400 ms, long context
22. ConvoluBon Neural Net (CNN)
• How is a CNN trained differently than a typical back
propagaBon (BP) network?
– Parts of the training which is the same:
• Present input record
• Forward pass through the network
• Back propagate error (i.e. per epoch)
– Different parts of training:
• Some connecBons are CONSTRAINED to the same value
– The connecBons for the same paxern, sliding over all input space
• Error updates are averaged and applied equally to the one set of weight
values
• End up with the same paxern detector feeding many nodes at the next level
http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf
Convolutional Deep Belief Networks for Scalable
Unsupervised Learning of Hierarchical Representations, 2009
28. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring and Lambda Architecture
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
29. Real Time Scoring
OpBmizaBons
• Auto-Encoding nets
– Can grow to millions of connecBons, and start to get computaBonal
– Can reduce connecBons by 5% to 25+% with pruning & retraining
• Train with increased regularizaBon semngs
• Drop connec(ons with near zero weights, then retrain
• Drop nodes with fan in connecBons which don’t get used much later, such as in
your predicBve problem
• Perform sensiBvity analysis – delete possible input fields
• ConvoluBonal Neural Nets
– With large enough data, can even skip the FFT preprocessing step
– Can use wider than 10ms audio sampling rates for speed up
• Implement other preprocessing as lookup tables (i.e. Bayesian Priors)
• Use cloud compuBng, do not limit to device compuBng
• Large models don’t fit à use model or data parallelism to train
34. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring and Lambda Architecture
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
39. Deep Learning - Outline
• Big Picture of 2016 Technology
• Neural Net Basics
• Deep Network ConfiguraBons for PracBcal ApplicaBons
– Auto-Encoder (i.e. data compression or Principal Components Analysis)
– ConvoluBonal (shiK invariance in Bme or space for voice, image or IoT)
– Real Time Scoring
– Deep Net libraries and tools (R, H2O, DL4J, TensorFlow, Gorila, Kamanja)
– Reinforcement Learning, Q-Learning (i.e. beat people at Atari games, IoT)
– ConBnuous Space Word Models (i.e. word2vec)
41. ConBnuous Space Word Models (word2vec)
• Before (a predicBve “Bag of Words” model):
– One row per document, paragraph or web page
– Binary word space: 10k to 200k columns, one per word or phrase
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 …. “This word space model is ….”
– The “Bag of words model” relates input record to a target category
• New:
– One row per word (word2vec), possibly per sentence (sent2vec)
– Con(nuous word space: 100 to 300 columns, conBnuous values
.01 .05 .02 .00 .00 .68 .01 .01 .35 ... .00 à “King”
.00 .00 .05 .01 .49 .52 .00 .11 .84 ... .01 à “Queen”
– The deep net training resulted in an Emergent Property:
• Numeric geometry locaBon relates to concept space
• “King” – “man” + “woman” = “Queen” (math to change gender relaBon)
• “USA” – “Washington DC” + “England” = “London” (math for capital relaBon)
44. Training ConBnuous Space Word Models
• How to Train These Models?
– Raw data: “This example sentence shows the word2vec model training.”
– Training data (with target values underscored, and other words as input)
“This example sentence shows word2vec” (prune “the”)
“example sentence shows word2vec model”
“sentence shows word2vec model training”
– The context of the 2 to 5 prior and following words predict the middle
word
– Deep Net model architecture, data compression to 300 conBnuous nodes
• 50k binary word input vector à ... à 300 à ... à 50k word target vector
45. Training ConBnuous Space Word Models
• How to Train These Models?
– Raw data: “This example sentence shows the word2vec model training.”
– Training data (with target values underscored, and other words as input)
“This example sentence shows word2vec” (prune “the”)
“example sentence shows word2vec model”
“sentence shows word2vec model training”
– The context of the 2 to 5 prior and following words predict the middle
word
– Deep Net model architecture, data compression to 300 conBnuous nodes
• 50k binary word input vector à ... à 300 à ... à 50k word target vector
• Use Pre-Trained Models hxps://code.google.com/p/word2vec/
– Trained on 100 billion words from Google News
– 300 dim vectors for 3 million words and phrases
– hxps://code.google.com/p/word2vec/
48. “Greg’s Guts” on Deep Learning
• Some claim the need for preprocessing and knowledge
representaBon has ended
– For most of the signal processing applicaBons à yes, simplify
– I am VERY READY TO COMPETE in other applicaBons, conBnuing
• expressing explicit domain knowledge
• opBmizing business value calculaBons
• Deep Learning gets big advantages from big data
– Why? Bexer populaBng high dimensional space combinaBon subsets
– Unsupervised feature extracBon reduces need for large labeled data
• However, “regular sized data” gets a big boost as well
– The “raBo of free parameters” (i.e. neurons) to training set records
– For regressions or regular nets, want 5-10 Bmes as many records
– RegularizaBon and weight drop out reduces this pressure
– Especially when only training “the next auto encoding layer”