SlideShare a Scribd company logo
1 of 113
Download to read offline
Deep Learning for Personalized
Search and Recommender
Systems
Ganesh Venkataraman
Airbnb
Nadia Fawaz, Saurabh Kataria, Benjamin Le, Liang Zhang
LinkedIn
1
Tutorial Outline
• Part I (45min) Deep Learning Key concepts
• Part II (45min) Deep learning for Search and Recommendations at Scale
• Coffee break (30 min)
• Deep Learning Case Studies
• Part III (45min) Jobs You May Be Interested In (JYMBII) at LinkedIn
• Part IV (45min) Job Search at LinkedIn
Q&A at the end of each part
2
Motivation – Why Recommender Systems?
• Recommendation systems are everywhere. Some examples of impact:
• “Netflix values recommendations at half a billion dollars to the company”
[netflix recsys]
• “LinkedIn job matching algorithms to improves performance by 50%” [San Jose
Mercury News]
• “Instagram switches to using algorithmic feed” [Instagram blog]
3
Motivation – Why Search?
4
PERSONALIZED SEARCH
4
Query = “things to do in halifax”
Search view – this is a classic IR problem
Recommendations view – For this query,
what are the recommended results?
Why Deep Learning? Why now?
• Many of the fundamental algorithmic techniques have existed since
the 80s or before
2.5 Exobytes of data produced per
day
Or 530,000,000 songs
150,000,000 iPhones 5
Why Deep Learning?
Image classification
eCommerce fraud
Search
Recommendations
NLP
Deep learning is eating the world
6
Why Deep Learning and Recommender
Systems?
• Features
• Semantic understanding of words/sentences possible with embeddings
• Better classification of images (identifying cats in YouTube videos)
• Modeling
• Can we cast matching problems into a deep (and possibly) wide net and learn
family of functions?
7
Part I – Representation Learning and Deep
Learning: Key Concepts
8
Deep Learning and AI
http://www.deeplearningbook.org/contents/intro.html 9
Part I Outline
• Shallow Models for Embedding Learning
• Word2Vec
• Deep Architectures
• FF, CNN, RNN
• Training Deep Neural Networks
• SGD, Backpropagation, Learning Rate Schedule, Regularization, Pre-Training
10
Learning Embeddings
11
Representation learning for automated feature generation
• Natural Language Processing
• Word embedding: word2vec, GloVe
• Sequence modeling using RNN’s and LSTM’s
• Graph Inputs
• Deep Walk
• Multiple Hierarchy of features for varying granularities for semantic meaning
with deep networks
12
Example Application of Representation
Learning - Understanding Text
• One of the keys to any content based recommender system
is understanding text
• What does “understanding” mean?
• How similar/dissimilar are any two words?
• What does the word represent? (Named Entity
Recognition)
• “Abraham Lincoln, the 16th President ...”
• “My cousin drives a Lincoln”
13
How to represent a word?
• Vocabulary – run, jog, math
• Simple representation:
• [1, 0, 0], [0, 1, 0], [0, 0, 1]
• No representation of meaning
• Cooccurrence in a word/document matrix
14
How to represent a word?
• Trouble with cooccurrence matrix
• Large dimension, lots of memory
• Dimensionality reduction using SVD
• High computational time nxm matrix => O(mn^2)
• Adding new word => redo everything
15
Word embeddings taking context
• Key Conjecture
• Context matters.
• Words that convey a certain context occur together
• “Abraham Lincoln was the 16th President of the United States”
• Bigram model
• P (“Lincoln”|”Abraham”)
• Skip Gram Model
• Consider all words within context and ignore position
• P(Context|Word)
16
Word2vec
17
Word2Vec: Skip Gram Model
• Basic notations:
• w represents a word, C(w) represents all the context around a word
• 𝜃 represents the parameter space
• D represent all the (w, c) pairs
• 𝑝 𝑐 𝑤; 𝜃 represents the probability of context c given word w
parametrized by 𝜃
• The probability of all the context appearing given a word is given by:
• 𝑐∈𝐶(𝑤) 𝑝(𝑐|𝑤; 𝜃)
• The loss function then becomes:
• 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝑤,𝑐 ∈𝐷 𝑝(𝑐|𝑤; 𝜃)
18
Word2vec details
• Let 𝑣 𝑤 and 𝑣𝑐 represent the current word and context. Note that
𝑣𝑐 and 𝑣 𝑤 are parameters we want to learn
• p c w; 𝜃 =
𝑒 𝑣 𝑐∗𝑣 𝑤
𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤
• C represents set of all available contexts
19
Negative Sampling – basic intuition
p c w; 𝜃 =
𝑒 𝑣 𝑐∗𝑣 𝑤
𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤
• Sample from unigram distribution instead of taking all contexts into
account
• Word2vec itself is a shallow model and can be used to initialize a
deep model
20
Deep Architectures
FF, CNN, RNN
21
Neuron: Computational Unit
• Input vector: x = [x1, x2 ,… ,xn]
• Neuron
• Weight vector: W
• Bias: b
• Activation function: f
• Output
a = f(WT x + b)
x1
x2
x3
x4
W
b
f
a = f(WTx + b)
Input x Neuron Output a 22
Activation Functions
• Tanh: ℝ → (-1,1)
tanh(𝑥) =
𝑒 𝑥
− 𝑒−𝑥
𝑒 𝑥 + 𝑒−𝑥
• Sigmoid: ℝ → (0,1)
𝜎 𝑥 =
1
1 + 𝑒−𝑥
• ReLU: ℝ → [0, +∞)
𝑓 𝑥 = max 0, 𝑥 = 𝑥+
http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
23
Layer
• Layer l: nl neurons
• weight matrix: W = [W1,…, Wnl]
• bias vector: b = [b1,…, bnl]
• activation function: f
• output vector
• a = f(WT x + b)
x1
x2
x3
x4
W1
b1
f
a1 = f(W1
T x + b1)
W2
b2
f
a2= f(W2
T x + b2)
Input x Layer Output a
W3
b3
f
a3= f(W3
T x + b3)
24
Layer: Matrix Notation
• Layer l: nl neurons
• weight matrix: W
• bias vector: b
• activation function: f
• output vector
• a = f(WT x + b)
• more compact notation
• fast-linear algebra routines for
quick computations in network
x1
x2
x3
x4
Input x Layer Output a
a = f(WT a + b)
W , b , f
25
Feed Forward Network
• Depth L layers
• Activation at layer l+1
a(l+1) = f(W(l)T a(l) + b(l) )
• Output: prediction in
supervised learning
• goal: approximate y = F(x)
x1
x2
x3
x4
Input Layer 1 Hidden Layer 3
a(3)
Hidden Layer 2
W(1) , b(1) , f(1) W(2) , b(2) , f(2)
a(2)
Depth L = 4
a(L)
W(3) , b(3) , f(3)
26Output Layer 4: Prediction layer
Why CNN: Convolutional Neural Networks?
• Large size grid structured data
• 1D: time series
• 2D: image
• Convolution to extract features from image (e.g. edges, texture)
• Local connectivity
• Parameter sharing
• Equivariance to translation: small translations in input do not affect output
Convolution example
https://docs.gimp.org/en/plug-in-convmatrix.html
Edge detect kernel Sharpen kernel
2D convolution
http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/
2D kernel (3x3)
W1 W2 W3 W4
input matrix
Kernel matrix (2x2)
29
• Fully connected
• hidden unit connected to all input units
• computationally expensive
• Large image NxN pixels and Hidden layer K features
• Number of parameters: ~KN2
• Locally connected
• hidden unit connected to some contiguous input
units
• no parameter sharing
• Convolution
• locally connected
• kernel: parameter sharing
• 1D Kernel vector [W1, W2]
• 1D Toeplitz weight matrix W
• Scaling to large input, images
• Equivariance to translation
30
W11 W12 W22 W23 W33 W34
W1 W2 W1 W2 W1 W2
W11 W12 W13 W14
W21 W22 W23 W24
W31 W32 W33 W34
W11 W12 0 0
0 W22 W23 0
0 0 W33 W34
Kernel vector
Weight matrix W
Convolution
W1 W2 0 0
0 W1 W2 0
0 0 W1 W2
Pooling
• Summary statistics
• Aggregate over region
• Reduce size
• Less overfitting
• Translation invariance
• Max, mean
http://ufldl.stanford.edu/tutorial/supervised/Pooling/
31
CNN: Convolutional Neural Network
Combination
• Convolutional layers
• Pooling layers
• Fully connected layers
http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
32
[LeCun et al., 1998]
CNN example for image recognition: ImageNet [Krizhevsky et al., 2012]
Pictures courtesy of [Krizhevsky et al., 2012], http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
33
1st GPU
2nd GPU
filters learned by first CNN layer
Why RNN: Recurrent Neural Network?
• Sequential data processing
• ex: predict next word in sentence: “I was born in France. I can speak…”
• RNN
• Persist information through feedback loop
• loop passes information from one step to the next
• Parameter sharing across time indexes
• output unit depends on previous output units through same
update rule.
xt
ht
ht-1
Unfolded RNN
• Copies of NN passing feedback to one another
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
35
LSTM: Long Short Term Memory [Hochreiter et al., 1997]
• Avoid vanishing or exploding gradient
• Cell state updates regulated by gates
• Forget: how much info from cell state to let
through
• Input: which cell state components to update
• Tanh: values to add to cell state
• Output: select component values to output
picture courtesy of http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell state
• Long term dependencies
• large gap between relevant information and
where it is needed
• Cell state: long-term memory
• Can remember relevant information over long
period of time
36
Examples of RNN application
• Speech recognition [Graves et al., 2013]
• Language modeling [Mikolov, 2012]
• Machine translation [Kalchbrenner et al., 2013][Sustkever et al., 2014]
• Image captioning [Vinyals et al., 2014]
37
Training a Deep Neural Network
38
Cost Function
• m training samples (feature vector, label)
(𝑥 1 , 𝑦 1 ), … , (𝑥 𝑚 , 𝑦 𝑚 )
• Per sample cost: error between label and output from prediction layer
𝐽 𝑊, 𝑏; 𝑥 𝑖 , 𝑦 𝑖 = 𝑎(𝐿) 𝑥 𝑖 − 𝑦(𝑖) 2
• Minimize cost function over parameters: weights W and biases b
𝐽 𝑊, 𝑏 =
1
𝑚
𝑖=1
𝑚
𝐽(𝑊, 𝑏; 𝑥 𝑖
, 𝑦(𝑖)
) +
𝜆
2
𝑙=1
𝐿
𝑊(𝑙)
𝐹
2
Average error Regularization 39
Gradient Descent
• Random parameter initialization: symmetry breaking
• Gradient descent step: update for every parameter Wij
(l) and bi
(l)
𝜃 = 𝜃 − 𝛼𝛻θ 𝔼[𝐽(𝜃)]
• Gradient computed by Backpropagation
• High cost of backpropagation over full training set
40
Stochastic Gradient Descent (SGD)
• SGD: follow negative gradient after
• single sample
𝜃 = 𝜃 − 𝛼𝛻𝜃J(θ; 𝑥 𝑖
, 𝑦(𝑖)
)
• a few samples: mini-batch (256)
• Epoch: full pass through training set
• Randomly shuffle data prior to each training epoch
41
Backpropagation [Rumelhart et al., 1986]
Goal: Compute gradient numerically
Recursively apply chain rule for derivative of composition of functions
Let 𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥))
then
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
𝜕𝑦
𝜕𝑥
= 𝑓′
𝑔 𝑥 𝑔′(𝑥)
Backpropagation steps
1. Feedforward pass: compute all activations
2. Output error: measures node contribution to output error
3. Backpropagate error through all layers
4. Compute partial derivatives
42
Training optimization
• Learning Rate Schedule
• Changing learning rate as learning progresses
• Pre-training
• Goal: training simple model on simple task before training desired model to perform desired task
• Greedy supervised pre-training: pre-train for task on subset of layers as initialization for final network
• Regularization to curb overfitting
• Goal: reduce generalization error
• Penalize parameter norm: L2, L1
• Augment dataset: train on more data
• Early stopping: return parameter set at point in time with lowest validation error
• Drop out [Srivatstava, 2013] : train ensemble of all subnetworks formed by removing non-output units
• Gradient clipping to avoid exploding gradient
• norm clipping
• element wise clipping
43
Part II – Deep Learning for Personalized
Recommender Systems at Scale
44
Examples of Personalized Recommender Systems
45
Examples of Personalized Recommender Systems
Job Search
46
Examples of Personalized Recommender Systems
47
item j from a set of candidates
User i
with
<user features, query
(optional)>
(e.g., industry,
behavioral features,
Demographic features,……)
(i, j) : response yijvisits
Algorithm selects
(action or not, e.g. click, like, share, apply…)
Which item(s) should we recommend to the user?
• The item(s) with the best expected utility
• Utility examples:
• CTR, Revenue, Job Apply rates, Ads conversion rates, …
• Can be a combination of the above for trade-offs
Personalized Recommender Systems
48
An Example Architecture of
Personalized Recommender
Systems
49
User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
User Feature
Store
Item Store +
Features
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
4
5
Offline System Online System
3
An example of Recommender System
Architecture
Item
derived features
50
User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
Search-based
Candidate
Selection &
Retrieval
Query
Construction
User Feature
Store
Search Index
of Items
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
3
4 5
6
7
Offline System Online System
Item
derived features
An example of Personalized Search
System Architecture
51
Key Components – Offline Modeling
• Train the model offline (e.g. Hadoop)
• Push model to online ranking model store
• Pre-generate user / item derived features for online systems
to consume
• E.g. user / item embeddings from word2vec / DNNs based
on the raw features
52
Key Components – Candidate Selection
• Personalized Search (With user query):
• Form a query to the index based on user query annotation [Arya et al., 2016]
• Example: Panda Express Sunnyvale +restaurant:panda express
+location:sunnyvale
• Recommender system (Optional):
• Can help dramatically reduce the number of items to score in ranking steps
[Cheng, et al., 2016, Borisyuk et al. 2016]
• Form a query based on the user features
• Goal: Fetch only the items with at least some match with user feature
• Example: a user with title software engineer -> +title:software engineer for
jobs recommendation
53
Key Components - Ranking
• Recommendation Ranking
• The main ML model that ranks items retrieved by candidate selection based
on the expected utility
• Additional Re-ranking Steps
• Often for user experience optimization related to business rules, e.g.
• Diversification of the ranking results
• Recency boost
• Impression discounting
• …
54
Integration of Deep Learning Models
into Personalized Recommender
Systems at Scale
55
Literature: Deep Learning for Recommendation Systems
• RBM for Collaborative Filtering [Salakhutdinov et al., 2007]
• Deep Belief Networks [Hinton et al., 2006]
• Neural Autoregressive Distribution Estimator (NADE) [Zheng, 2016]
• Neural Collaborative Filtering [He, et al., 2017]
• Siamese networks for user item matching [Huang et al., 2013]
• Deep Belief Networks with Pre-training [Hinton et al., 2006]
• Collaborative Deep Learning [Wang et al., 2015]
56
User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
Search-based
Candidate
Selection &
Retrieval
Query
Construction
User Feature
Store
Search Index
of Items
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
3
4 5
6
7
Offline System Online System
Item
derived features
57
Offline Modeling + User / Item Embeddings
User Features Item Features
User Embedding
Vector
Item Embedding
Vector
Sim(U,I)
User Feature
Store
Item Store / Index
with Features
58
Query Formulation & Candidate Selection
• Issues of using raw text: Noisy or incorrect query tagging due to
• Failure to capture semantic meaning
• Ex. Query: Apple watch -> +food:apple +product:watch or +product:apple watch?
• Multilingual text
• Query: 熊猫快餐 -> +restaurant:panda express
• Cross-domain understanding
• People search vs job search
59
Query Formulation & Candidate Selection
• Represent Query as an
embedding
• Expand query to similar
queries in a semantic
space
• KNN search in dense
feature space with
Inverted Index [Cheng,
et al., 2016]
Q = “Apple Watch”
D = “iphone”
D = “Orange Swatch”
D = “ipad”
60
Recommendation Ranking Models
• Wide and Deep Models to capture all possible signals [Cheng, et
al., 2016]
https://arxiv.org/pdf/1606.07792.pdf
61
Challenges & Open Problems for Deep
Learning at Recommender Systems
• Distributed training on very large data
• Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark)
• CNTK (https://github.com/Microsoft/CNTK)
• MXNet (http://mxnet.io/)
• Caffe (http://caffe.berkeleyvision.org/)
• …
• Latency Issues from Online Scoring
• Pre-generation of user / item embeddings
• Multi-layer scoring (simple models => complex)
• Batch vs online training
62
Part III – Case Study: Jobs You May Be
Interested In (JYMBII)
63
Outline
• Introduction
• Generating Embeddings via Word2vec
• Generating Embeddings via Deep Networks
• Tree Feature Transforms in Deep + Wide Framework
64
Introduction: JYMBII
65
Introduction: Problem Formulation
• Rank jobs by 𝑃 User 𝑢 applies to Job 𝑗 𝑢, 𝑗)
• Model response given:
66
Careers History, Skills, Education, Connections Job Title, Description, Location, Company
66
Introduction: JYMBII Modeling- Generalization
Recommend
• Model should learn general rules to predict which
jobs to recommend to a member.
• Learn generalizations based on similarity in title, skill,
location, etc between profile and job posting
67
Introduction: JYMBII Modeling - Memorization
Applies to
68
• Model should memorize exceptions to the rules
• Learn exceptions based on frequent co-
occurrence of features
Introduction: Baseline Features
• Dense BoW Similarity Features for Generalization
• i.e: Similarity in title text good predictor of response
• Sparse Two-Depth Cross Features for Memorization
• i.e: Memorize that computer science students will transition to entry engineering roles
Vector BoW Similarity Feature
Sim(User Title BoW,
Job Title BoW)
Sparse Cross Feature
AND(user = Comp Sci. Student,
job = Software Engineer)
Sparse Cross Feature
AND(user = In Silicon Valley,
job = In Austin, TX)
Sparse Cross Feature
AND(user = ML Engineer,
job = UX Designer)
69
Introduction: Issues
• BoW Features don’t capture semantic similarity between user/job
• Cosine Similarity between Application Developer and Software Engineer is 0
• Generating three-depth, four-depth cross features won’t scale
• i.e. Memorizing that Factory Workers from Detroit are applying to Fracking
jobs in Pennsylvania
• Hand-engineered features time consuming and will have low coverage
• Permutations of three-depth, four-depth cross features grows exponentially
70
Introduction: Deep + Wide for JYMBII
• BoW Features don’t capture semantic similarity between user/job
• Generate embeddings to capture Generalization through semantic similarity
• Deep + Wide model for JYMBII [Cheng et al., 2016]
Semantic Similarity Feature
Sim(User Embedding,
Job Embedding)
Global Model Cross Feature
AND(user = Comp Sci. Student,
job = Software Engineer)
User Model Cross Feature
AND(user = User 2,
job = Job Latent Feature 1 )
Job Model Cross Feature
AND(user = User Latent Feature,
job = Job 1)
71
Sparse Cross Feature
AND(user = Comp Sci. Student,
job = Software Engineer)
Sparse Cross Feature
AND(user = In Silicon Valley,
job = In Austin, TX)
Sparse Cross Feature
AND(user = ML Engineer,
job = UX Designer)
Vector BoW Similarity Feature
Sim(User Title BoW,
Job Title BoW)
Generating Embeddings via Word2vec:
Training Word Vectors
• Key Ideas
• Same users (context) apply to similar jobs (target)
• Similar users (target) will apply to the same jobs (context)
Application Developer => Software Engineer
• Train word vectors via word2vec skip-gram architecture
• Concatenate user’s current title and the applied job’s title as input
User Title Applied Job Title
72
Generating Embeddings via Word2vec:
Model Structure
Application, Developer Software, EngineerTokenized Titles
Word Embedding Lookup
Pre-trained Word
Vectors
Entity Embeddings
Via Average Pooling
Word Vectors
Response Prediction (Logistic Regression)
Cosine Similarity
User Job 73
Generating Embeddings via Word2vec:
Results and Next Steps
• Receiver Operating Characteristic – Area Under Curve for evaluation
• Response prediction is binary classification: Apply or don’t Apply
• Highly skewed data: Low CTR for Apply Action
• Good metric for ranking quality: Focus on discriminatory ability of model
• Marginal 0.87% ROC AUC Gain
• How to improve quality of embeddings?
• Optimize embeddings for prediction task with supervised training
• Leverage richer context about user and job
74
Generating Embeddings via Deep Networks:
Model Structure
User Job
Response Prediction (Logistic Regression)
Sparse Features (Title, Skill,
Company)
Embedding Layer
Hidden Layer
Entity Embedding
Hadamard Product (Elementwise Product)
75
Generating Embeddings via Deep Networks:
Hyper Parameters, Lots of Knobs!
• Optimizer Used
• SGD w/ Momentum and exponential decay vs. Adam [Kingma et al., 2015] (Adam)
• Learning Rate
• 10−5
to 10−3
(𝟏𝟎−𝟒
)
• Embedding Layer Size
• 50 to 200 (100)
• Dropout
• 0% to 50% dropout (0% dropout)
• Sharing Parameter Space for both user/job embeddings
• Assumes communitive property of recommendations (a + b = b + a) (No shared parameter space)
• Hidden Layer Sizes
• 0 to 2 Hidden Layers (200 -> 200 Hidden Layer Size)
• Activation Function
• ReLU vs. Tanh (ReLU)
76
Generating Embeddings via Deep Networks:
Training Challenges
• Millions of rows of training data impossible to store all in memory
• Stream data incrementally directly from files into a fixed size example pool
• Add shuffling by randomly sampling from example pool for training batches
• Extreme dimensionality of company sparse feature
• Reduce dimensionality of company feature from millions -> tens of thousands
• Perform feature selection by frequency in training set
• Hyper parameter tuning
• Distribute grid search through parallel modeling in single driver Spark jobs
77
Generating Embeddings via Deep Networks:
Results
Model ROC AUC
Baseline Model 0.753
Deep + Wide Model 0.790 (+4.91%***)
*** For reference, a previous major JYMBII
modeling improvement with a 20% lift in ROC
AUC resulted in a 30% lift in Job Applications
78
Response Prediction (Logistic Regression)
The Current Deep + Wide Model
Deep Embedding Features (Feed Forward NN)
• Generating three-depth, four-depth cross features won’t scale
• Smart feature selection required
Wide Sparse Cross Features (Two-Depth)
79
Tree Feature Transforms: Feature Selection via
Gradient Boosted Decision Trees
Each tree outputs a path from root to leaf encoding
a combination of feature crosses [He et al., 2014]
GDBT’s select the most useful combinations of
feature crosses for memorization
Member Seniority: Vice
President
Yes
No
Member Industry:
Banking
Yes
No
Member Location:
Silicon Valley
Member Skill:
Statistics
Yes No
80
Yes No
Job Seniority:
CXO
NoYes
Job Title: ML
Engineer
Yes No
Response Prediction (Logistic Regression)
Tree Feature Transforms: The Full Picture
How to train both the NN model and GBDT model
jointly with each other?
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GBDT)
81
Tree Feature Transforms: Joint Training via
Block-wise Cyclic Coordinate Descent
• Treat NN model and GBDT model as separate block-wise coordinates
• Implemented by
1. Training the NN until convergence
2. Training GBDT w/ fixed NN embeddings
3. Training the regression layer weights w/ generated cross features from GBDT
4. Training the NN until convergence w/ fixed cross features
5. Cycle step 2-4 until global convergence criteria
82
Response Prediction (Logistic Regression)
Tree Feature Transforms: Train NN Until
Convergence
Initially no trees are in our forest
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
83
Response Prediction (Logistic Regression)
Tree Feature Transforms: Train GDBT w/ NN
Section as Initial Margin
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
84
Response Prediction (Logistic Regression)
Tree Feature Transforms: Train GDBT w/ NN
Section as Initial Margin
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
85
Response Prediction (Logistic Regression)
Tree Feature Transforms: Train Regression
Layer Weights
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
86
Response Prediction (Logistic Regression)
Tree Feature Transforms: Train NN w/ GDBT
Section as Initial Margin
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
87
Tree Feature Transforms: Block-wise
Coordinate Descent Results
Model ROC AUC
Baseline Model 0.753
Deep + Wide Model 0.790 (+4.91%)
Deep + Wide Model w/ GBDT Iteration 1 0.792 (+5.18%)
Deep + Wide Model w/ GBDT Iteration 2 0.794 (+5.44%)
Deep + Wide Model w/ GBDT Iteration 3 0.795 (+5.57%)
Deep + Wide Model w/ GBDT Iteration 4 0.796 (+5.71%)
88
JYMBII Deep + Wide: Future Direction
• Generating Embeddings w/ LSTM Networks
• Leverage sequential career history data
• Promising results in NEMO: Next Career Move Prediction with Contextual
Embedding [Li et al., 2017]
• Semi-Supervised Training
• Leverage pre-trained title, skill, and company embeddings on profile data
• Replace Hadamard Product for entity embedding similarity function
• Deep Crossing [Shan et al., 2016]
• Add even richer context
• i.e. Location, Education, and Network features
89
Part IV – Case Study: Deep Learning Networks
for Job Search
90
Outline
• Introduction
• Representations via Word2vec
• Robust Representations via DSSM
91
Introduction: Job Search
92
Introduction: Search Architecture
Index
Indexer
Top-K retrieval
ResultsOffline Training /
Model
Result Ranking
User QueryQuery
Understanding
93
Introduction: Query Understanding -
Segmentation and Tagging
• First divide the search query into
segments
• Tag query segments based on
recognized entity tags
Oracle
Java
Application Developer
Oracle
Java Application Developer
Query Segmentations
COMPANY = Oracle
SKILL = Java
TITLE = Application Developer
COMPANY = Oracle
TITLE = Java Application
Developer
Query Tagging
94
Introduction: Query Understanding –
Expansion
• Task of adding additional
synonyms/related entities to the
query to improve recall
• Current Approach: Curated dictionary
for common synonyms and related
entities
COMPANY = Oracle OR NetSuite OR
Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE
OR JVM OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Green – Synonyms
Blue – Related Entities
95
Introduction: Query Understanding - Retrieval
and Ranking
COMPANY = Oracle OR NetSuite OR Taleo OR
Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE OR JVM
OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Title
Title
Skills
Company
96
Introduction: Issues – Retrieval and Ranking
• Term retrieval has limitations
• Cross language retrieval
• Softwareentwickler  Software developer
• Word Inflections
• Engineering Management  Engineering Manager
• Query expansion via curated dictionary of synonyms is not scalable
• Expensive to refresh and store synonyms for all possible entities
• Heavy reliance on query tagging is not robust enough
• Novel title, skill, and company entities will not be tagged correctly
• Errors upstream propagates to poor retrieval and ranking
97
Introduction: Solution – Deep Learning for
Query and Document Representations
• Query and document representations
• Map queries and document text to vectors in semantic space
• Robust to Handle Out of Vocabulary words
• Term retrieval has limitations
• Query expansion via curated dictionary of synonyms is not scalable
• Map synonyms, translations and inflections to similar vectors in semantic space
• Term retrieval on cluster id or KNN based retrieval
• Heavy reliance on query tagging is not robust enough
• Compliment structured query representations with semantic representations
98
Representations via Word2vec:
Leverage JYMBII Work
• Key Ideas
• Similar users (context) apply to the same job (target)
• The same user (target) will apply to similar jobs (context)
Application Developer => Software Engineer
• Train word vectors via word2vec skip-gram architecture
• Concatenate user’s current title and the applied job’s title as input
User Title Applied Job Title
99
Representations via Word2vec:
Word2vec in Ranking
Application, Developer Software, EngineerTokenized Text
Word Embedding Lookup
Pre-trained Word
Vectors
Entity Embeddings
Via Average Pooling
Word Vectors
Learning to Rank Model (NDCG Loss)
Cosine Similarity
JobQuery 100
Representations via Word2vec:
Ranking Model Results
Model Normalized Cumulative
Discounted Gain@5 (NDCG@5)
CTR@5(%)
Baseline Model 0.582 +0.0%
Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6%
101
Representations via Word2vec:
Optimize Embeddings for Job Search Use Case
• Leverage apply and click feedback to guide learning of embeddings
• Fine tune embeddings for task using supervised feedback
• Handle out of vocabulary words and scale to query vocabulary size
• Compared to JYMBII, query vocabulary is much larger and less well-formed
• Misspellings
• Word Inflections
• Free text search
• Need to make representations more robust for these free text queries
102
Robust Representations via DSSM:
Deep Structured Semantic Model [Huang et al., 2013]
Query Applied Job (Positive)
Application Developer Software EngineerRaw Text
#Ap, App, ppl… #So, Sof, oft…Tri-letter Hashing #Ha, Hai, air…
Hairdresser
Randomly Sampled
Applied Job (Negative)
Hidden Layer 3
Hidden Layer 2
Hidden Layer 1
Cosine Similarity
Softmax w/ Cross Entropy Loss
103
Robust Representations via DSSM:
Tri-letter Hashing
• Tri-letter Hashing Example
• Engineer -> #en, eng, ngi, gin, ine, nee, eer, er#
• Benefits of Tri-letter Hashing
• More compact Bag of Tri-letters vs. Bag of Words representation
• 700K Word Vocabulary -> 75K Tri-letters
• Can generalize for out of vocabulary words
• Tri-letter hashing robust to minor misspellings and inflections of words
• Engneer -> #en, eng, ngn, gne, nee, eer, er#
104
Robust Representations via DSSM:
Training Details
105
• Parameter Sharing Helps
• Better and faster convergence
• Model size is reduced
• Regularization
• L2 performs better than dropout
• Toolkit Comparisons (CNTK vs TensorFlow)
• CNTK: Faster convergence and better model quality
• TensorFlow: Easy to implement and better community support.
Comparative model quality
Training performance with/o parameter sharing
Robust Representations via DSSM:
Lessons in Production Environment
106
+ 100%
+ 70%
+ 40%
• Bottlenecks in Production
Environment
• Latency due to extra computation
• Latency due to GC activity
• Fat Jars in JVM environment
• Practical Lessons
• Avoid JVM Heap while serving the
model
• Caching most accessed entities’
embedding
Robust Representations via DSSM:
DSSM Qualitative Results
Software Engineer Data Mining LinkedIn Softwareentwickler
Engineer Software Data Miner Google Software
Software Engineers Machine Learning
Engineer
Software Engineers Software Engineer
Software Engineering Microsoft Research Software Engineer Engineer Software
For qualitative results, only top head queries are taken to analyze similarity to each other
107
Robust Representations via DSSM:
DSSM Metric Results
Model Normalized Cumulative
Discounted Gain@5 (NDCG@5)
CTR@5 Lift (%)
Baseline Model 0.582 +0.0%
Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6%
Baseline Model + DSSM Feature 0.602 (+3.4%) +3.2%
108
Robust Representations via DSSM:
DSSM Future Direction
• Leverage Current Query Understanding Into DSSM Model
• Query tag entity information for richer context embeddings
• Query segmentation structure can be considered into the network design
• Deep Crossing for Similarity Layer [Shan et al., 2016]
• Convolutional DSSM [Shen et al., 2014]
109
Conclusion
• Recommender Systems and personalized search are very similar
problems
• Deep Learning is here to stay and can have significant impact on both
• Understanding and constructing queries
• Ranking
• Deep learning and more traditional techniques are *not* mutually
exclusive (hint: Deep + Wide)
110
Appendix – Backup slides
111
Back up – Part I
112
Difference between parameter sharing in 1-D
convolution and RNN?
• CNN Kernel: output unit depends on small number of neighboring input units
through same kernel
• RNN update rule: output unit depends on previous output units through same
update rule. Deeper computational graph.

More Related Content

What's hot

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewYONG ZHENG
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationLuis Serrano
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers leopauly
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsYONG ZHENG
 
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...Benjamin Le
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filteringD Yogendra Rao
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 

What's hot (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...
Personalized Job Recommendation System at LinkedIn: Practical Challenges and ...
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filtering
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 

Similar to Deep Learning for Personalized Search and Recommender Systems

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdfFEG
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningVishwas Lele
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowArtificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowJen Stirrup
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Chapter10.pptx
Chapter10.pptxChapter10.pptx
Chapter10.pptxadnansbp
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesTuri, Inc.
 

Similar to Deep Learning for Personalized Search and Recommender Systems (20)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowArtificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Chapter10.pptx
Chapter10.pptxChapter10.pptx
Chapter10.pptx
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep Features
 

Recently uploaded

Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvementVijayMuni2
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...soginsider
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...sahb78428
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technologyabdulkadirmukarram03
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfNaveenVerma126
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
Multicomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfMulticomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfGiovanaGhasary1
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationMohsinKhanA
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxYogeshKumarKJMIT
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Amil baba
 
Modelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsModelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsYusuf Yıldız
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfodunowoeminence2019
 
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxIT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxSAJITHABANUS
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxrealme6igamerr
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptxSaiGouthamSunkara
 

Recently uploaded (20)

Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvement
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technology
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
Multicomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdfMulticomponent Spiral Wound Membrane Separation Model.pdf
Multicomponent Spiral Wound Membrane Separation Model.pdf
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software Simulation
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptx
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
 
Modelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsModelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovations
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxIT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptx
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 

Deep Learning for Personalized Search and Recommender Systems

  • 1. Deep Learning for Personalized Search and Recommender Systems Ganesh Venkataraman Airbnb Nadia Fawaz, Saurabh Kataria, Benjamin Le, Liang Zhang LinkedIn 1
  • 2. Tutorial Outline • Part I (45min) Deep Learning Key concepts • Part II (45min) Deep learning for Search and Recommendations at Scale • Coffee break (30 min) • Deep Learning Case Studies • Part III (45min) Jobs You May Be Interested In (JYMBII) at LinkedIn • Part IV (45min) Job Search at LinkedIn Q&A at the end of each part 2
  • 3. Motivation – Why Recommender Systems? • Recommendation systems are everywhere. Some examples of impact: • “Netflix values recommendations at half a billion dollars to the company” [netflix recsys] • “LinkedIn job matching algorithms to improves performance by 50%” [San Jose Mercury News] • “Instagram switches to using algorithmic feed” [Instagram blog] 3
  • 4. Motivation – Why Search? 4 PERSONALIZED SEARCH 4 Query = “things to do in halifax” Search view – this is a classic IR problem Recommendations view – For this query, what are the recommended results?
  • 5. Why Deep Learning? Why now? • Many of the fundamental algorithmic techniques have existed since the 80s or before 2.5 Exobytes of data produced per day Or 530,000,000 songs 150,000,000 iPhones 5
  • 6. Why Deep Learning? Image classification eCommerce fraud Search Recommendations NLP Deep learning is eating the world 6
  • 7. Why Deep Learning and Recommender Systems? • Features • Semantic understanding of words/sentences possible with embeddings • Better classification of images (identifying cats in YouTube videos) • Modeling • Can we cast matching problems into a deep (and possibly) wide net and learn family of functions? 7
  • 8. Part I – Representation Learning and Deep Learning: Key Concepts 8
  • 9. Deep Learning and AI http://www.deeplearningbook.org/contents/intro.html 9
  • 10. Part I Outline • Shallow Models for Embedding Learning • Word2Vec • Deep Architectures • FF, CNN, RNN • Training Deep Neural Networks • SGD, Backpropagation, Learning Rate Schedule, Regularization, Pre-Training 10
  • 12. Representation learning for automated feature generation • Natural Language Processing • Word embedding: word2vec, GloVe • Sequence modeling using RNN’s and LSTM’s • Graph Inputs • Deep Walk • Multiple Hierarchy of features for varying granularities for semantic meaning with deep networks 12
  • 13. Example Application of Representation Learning - Understanding Text • One of the keys to any content based recommender system is understanding text • What does “understanding” mean? • How similar/dissimilar are any two words? • What does the word represent? (Named Entity Recognition) • “Abraham Lincoln, the 16th President ...” • “My cousin drives a Lincoln” 13
  • 14. How to represent a word? • Vocabulary – run, jog, math • Simple representation: • [1, 0, 0], [0, 1, 0], [0, 0, 1] • No representation of meaning • Cooccurrence in a word/document matrix 14
  • 15. How to represent a word? • Trouble with cooccurrence matrix • Large dimension, lots of memory • Dimensionality reduction using SVD • High computational time nxm matrix => O(mn^2) • Adding new word => redo everything 15
  • 16. Word embeddings taking context • Key Conjecture • Context matters. • Words that convey a certain context occur together • “Abraham Lincoln was the 16th President of the United States” • Bigram model • P (“Lincoln”|”Abraham”) • Skip Gram Model • Consider all words within context and ignore position • P(Context|Word) 16
  • 18. Word2Vec: Skip Gram Model • Basic notations: • w represents a word, C(w) represents all the context around a word • 𝜃 represents the parameter space • D represent all the (w, c) pairs • 𝑝 𝑐 𝑤; 𝜃 represents the probability of context c given word w parametrized by 𝜃 • The probability of all the context appearing given a word is given by: • 𝑐∈𝐶(𝑤) 𝑝(𝑐|𝑤; 𝜃) • The loss function then becomes: • 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝑤,𝑐 ∈𝐷 𝑝(𝑐|𝑤; 𝜃) 18
  • 19. Word2vec details • Let 𝑣 𝑤 and 𝑣𝑐 represent the current word and context. Note that 𝑣𝑐 and 𝑣 𝑤 are parameters we want to learn • p c w; 𝜃 = 𝑒 𝑣 𝑐∗𝑣 𝑤 𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤 • C represents set of all available contexts 19
  • 20. Negative Sampling – basic intuition p c w; 𝜃 = 𝑒 𝑣 𝑐∗𝑣 𝑤 𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤 • Sample from unigram distribution instead of taking all contexts into account • Word2vec itself is a shallow model and can be used to initialize a deep model 20
  • 22. Neuron: Computational Unit • Input vector: x = [x1, x2 ,… ,xn] • Neuron • Weight vector: W • Bias: b • Activation function: f • Output a = f(WT x + b) x1 x2 x3 x4 W b f a = f(WTx + b) Input x Neuron Output a 22
  • 23. Activation Functions • Tanh: ℝ → (-1,1) tanh(𝑥) = 𝑒 𝑥 − 𝑒−𝑥 𝑒 𝑥 + 𝑒−𝑥 • Sigmoid: ℝ → (0,1) 𝜎 𝑥 = 1 1 + 𝑒−𝑥 • ReLU: ℝ → [0, +∞) 𝑓 𝑥 = max 0, 𝑥 = 𝑥+ http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/ 23
  • 24. Layer • Layer l: nl neurons • weight matrix: W = [W1,…, Wnl] • bias vector: b = [b1,…, bnl] • activation function: f • output vector • a = f(WT x + b) x1 x2 x3 x4 W1 b1 f a1 = f(W1 T x + b1) W2 b2 f a2= f(W2 T x + b2) Input x Layer Output a W3 b3 f a3= f(W3 T x + b3) 24
  • 25. Layer: Matrix Notation • Layer l: nl neurons • weight matrix: W • bias vector: b • activation function: f • output vector • a = f(WT x + b) • more compact notation • fast-linear algebra routines for quick computations in network x1 x2 x3 x4 Input x Layer Output a a = f(WT a + b) W , b , f 25
  • 26. Feed Forward Network • Depth L layers • Activation at layer l+1 a(l+1) = f(W(l)T a(l) + b(l) ) • Output: prediction in supervised learning • goal: approximate y = F(x) x1 x2 x3 x4 Input Layer 1 Hidden Layer 3 a(3) Hidden Layer 2 W(1) , b(1) , f(1) W(2) , b(2) , f(2) a(2) Depth L = 4 a(L) W(3) , b(3) , f(3) 26Output Layer 4: Prediction layer
  • 27. Why CNN: Convolutional Neural Networks? • Large size grid structured data • 1D: time series • 2D: image • Convolution to extract features from image (e.g. edges, texture) • Local connectivity • Parameter sharing • Equivariance to translation: small translations in input do not affect output
  • 30. • Fully connected • hidden unit connected to all input units • computationally expensive • Large image NxN pixels and Hidden layer K features • Number of parameters: ~KN2 • Locally connected • hidden unit connected to some contiguous input units • no parameter sharing • Convolution • locally connected • kernel: parameter sharing • 1D Kernel vector [W1, W2] • 1D Toeplitz weight matrix W • Scaling to large input, images • Equivariance to translation 30 W11 W12 W22 W23 W33 W34 W1 W2 W1 W2 W1 W2 W11 W12 W13 W14 W21 W22 W23 W24 W31 W32 W33 W34 W11 W12 0 0 0 W22 W23 0 0 0 W33 W34 Kernel vector Weight matrix W Convolution W1 W2 0 0 0 W1 W2 0 0 0 W1 W2
  • 31. Pooling • Summary statistics • Aggregate over region • Reduce size • Less overfitting • Translation invariance • Max, mean http://ufldl.stanford.edu/tutorial/supervised/Pooling/ 31
  • 32. CNN: Convolutional Neural Network Combination • Convolutional layers • Pooling layers • Fully connected layers http://colah.github.io/posts/2014-07-Conv-Nets-Modular/ 32 [LeCun et al., 1998]
  • 33. CNN example for image recognition: ImageNet [Krizhevsky et al., 2012] Pictures courtesy of [Krizhevsky et al., 2012], http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf 33 1st GPU 2nd GPU filters learned by first CNN layer
  • 34. Why RNN: Recurrent Neural Network? • Sequential data processing • ex: predict next word in sentence: “I was born in France. I can speak…” • RNN • Persist information through feedback loop • loop passes information from one step to the next • Parameter sharing across time indexes • output unit depends on previous output units through same update rule. xt ht ht-1
  • 35. Unfolded RNN • Copies of NN passing feedback to one another http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 35
  • 36. LSTM: Long Short Term Memory [Hochreiter et al., 1997] • Avoid vanishing or exploding gradient • Cell state updates regulated by gates • Forget: how much info from cell state to let through • Input: which cell state components to update • Tanh: values to add to cell state • Output: select component values to output picture courtesy of http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Cell state • Long term dependencies • large gap between relevant information and where it is needed • Cell state: long-term memory • Can remember relevant information over long period of time 36
  • 37. Examples of RNN application • Speech recognition [Graves et al., 2013] • Language modeling [Mikolov, 2012] • Machine translation [Kalchbrenner et al., 2013][Sustkever et al., 2014] • Image captioning [Vinyals et al., 2014] 37
  • 38. Training a Deep Neural Network 38
  • 39. Cost Function • m training samples (feature vector, label) (𝑥 1 , 𝑦 1 ), … , (𝑥 𝑚 , 𝑦 𝑚 ) • Per sample cost: error between label and output from prediction layer 𝐽 𝑊, 𝑏; 𝑥 𝑖 , 𝑦 𝑖 = 𝑎(𝐿) 𝑥 𝑖 − 𝑦(𝑖) 2 • Minimize cost function over parameters: weights W and biases b 𝐽 𝑊, 𝑏 = 1 𝑚 𝑖=1 𝑚 𝐽(𝑊, 𝑏; 𝑥 𝑖 , 𝑦(𝑖) ) + 𝜆 2 𝑙=1 𝐿 𝑊(𝑙) 𝐹 2 Average error Regularization 39
  • 40. Gradient Descent • Random parameter initialization: symmetry breaking • Gradient descent step: update for every parameter Wij (l) and bi (l) 𝜃 = 𝜃 − 𝛼𝛻θ 𝔼[𝐽(𝜃)] • Gradient computed by Backpropagation • High cost of backpropagation over full training set 40
  • 41. Stochastic Gradient Descent (SGD) • SGD: follow negative gradient after • single sample 𝜃 = 𝜃 − 𝛼𝛻𝜃J(θ; 𝑥 𝑖 , 𝑦(𝑖) ) • a few samples: mini-batch (256) • Epoch: full pass through training set • Randomly shuffle data prior to each training epoch 41
  • 42. Backpropagation [Rumelhart et al., 1986] Goal: Compute gradient numerically Recursively apply chain rule for derivative of composition of functions Let 𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥)) then 𝜕𝑧 𝜕𝑥 = 𝜕𝑧 𝜕𝑦 𝜕𝑦 𝜕𝑥 = 𝑓′ 𝑔 𝑥 𝑔′(𝑥) Backpropagation steps 1. Feedforward pass: compute all activations 2. Output error: measures node contribution to output error 3. Backpropagate error through all layers 4. Compute partial derivatives 42
  • 43. Training optimization • Learning Rate Schedule • Changing learning rate as learning progresses • Pre-training • Goal: training simple model on simple task before training desired model to perform desired task • Greedy supervised pre-training: pre-train for task on subset of layers as initialization for final network • Regularization to curb overfitting • Goal: reduce generalization error • Penalize parameter norm: L2, L1 • Augment dataset: train on more data • Early stopping: return parameter set at point in time with lowest validation error • Drop out [Srivatstava, 2013] : train ensemble of all subnetworks formed by removing non-output units • Gradient clipping to avoid exploding gradient • norm clipping • element wise clipping 43
  • 44. Part II – Deep Learning for Personalized Recommender Systems at Scale 44
  • 45. Examples of Personalized Recommender Systems 45
  • 46. Examples of Personalized Recommender Systems Job Search 46
  • 47. Examples of Personalized Recommender Systems 47
  • 48. item j from a set of candidates User i with <user features, query (optional)> (e.g., industry, behavioral features, Demographic features,……) (i, j) : response yijvisits Algorithm selects (action or not, e.g. click, like, share, apply…) Which item(s) should we recommend to the user? • The item(s) with the best expected utility • Utility examples: • CTR, Revenue, Job Apply rates, Ads conversion rates, … • Can be a combination of the above for trade-offs Personalized Recommender Systems 48
  • 49. An Example Architecture of Personalized Recommender Systems 49
  • 50. User Interaction Logs Offline Modeling Workflow + User / Item derived features User User Feature Store Item Store + Features Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 4 5 Offline System Online System 3 An example of Recommender System Architecture Item derived features 50
  • 51. User Interaction Logs Offline Modeling Workflow + User / Item derived features User Search-based Candidate Selection & Retrieval Query Construction User Feature Store Search Index of Items Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 3 4 5 6 7 Offline System Online System Item derived features An example of Personalized Search System Architecture 51
  • 52. Key Components – Offline Modeling • Train the model offline (e.g. Hadoop) • Push model to online ranking model store • Pre-generate user / item derived features for online systems to consume • E.g. user / item embeddings from word2vec / DNNs based on the raw features 52
  • 53. Key Components – Candidate Selection • Personalized Search (With user query): • Form a query to the index based on user query annotation [Arya et al., 2016] • Example: Panda Express Sunnyvale +restaurant:panda express +location:sunnyvale • Recommender system (Optional): • Can help dramatically reduce the number of items to score in ranking steps [Cheng, et al., 2016, Borisyuk et al. 2016] • Form a query based on the user features • Goal: Fetch only the items with at least some match with user feature • Example: a user with title software engineer -> +title:software engineer for jobs recommendation 53
  • 54. Key Components - Ranking • Recommendation Ranking • The main ML model that ranks items retrieved by candidate selection based on the expected utility • Additional Re-ranking Steps • Often for user experience optimization related to business rules, e.g. • Diversification of the ranking results • Recency boost • Impression discounting • … 54
  • 55. Integration of Deep Learning Models into Personalized Recommender Systems at Scale 55
  • 56. Literature: Deep Learning for Recommendation Systems • RBM for Collaborative Filtering [Salakhutdinov et al., 2007] • Deep Belief Networks [Hinton et al., 2006] • Neural Autoregressive Distribution Estimator (NADE) [Zheng, 2016] • Neural Collaborative Filtering [He, et al., 2017] • Siamese networks for user item matching [Huang et al., 2013] • Deep Belief Networks with Pre-training [Hinton et al., 2006] • Collaborative Deep Learning [Wang et al., 2015] 56
  • 57. User Interaction Logs Offline Modeling Workflow + User / Item derived features User Search-based Candidate Selection & Retrieval Query Construction User Feature Store Search Index of Items Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 3 4 5 6 7 Offline System Online System Item derived features 57
  • 58. Offline Modeling + User / Item Embeddings User Features Item Features User Embedding Vector Item Embedding Vector Sim(U,I) User Feature Store Item Store / Index with Features 58
  • 59. Query Formulation & Candidate Selection • Issues of using raw text: Noisy or incorrect query tagging due to • Failure to capture semantic meaning • Ex. Query: Apple watch -> +food:apple +product:watch or +product:apple watch? • Multilingual text • Query: 熊猫快餐 -> +restaurant:panda express • Cross-domain understanding • People search vs job search 59
  • 60. Query Formulation & Candidate Selection • Represent Query as an embedding • Expand query to similar queries in a semantic space • KNN search in dense feature space with Inverted Index [Cheng, et al., 2016] Q = “Apple Watch” D = “iphone” D = “Orange Swatch” D = “ipad” 60
  • 61. Recommendation Ranking Models • Wide and Deep Models to capture all possible signals [Cheng, et al., 2016] https://arxiv.org/pdf/1606.07792.pdf 61
  • 62. Challenges & Open Problems for Deep Learning at Recommender Systems • Distributed training on very large data • Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark) • CNTK (https://github.com/Microsoft/CNTK) • MXNet (http://mxnet.io/) • Caffe (http://caffe.berkeleyvision.org/) • … • Latency Issues from Online Scoring • Pre-generation of user / item embeddings • Multi-layer scoring (simple models => complex) • Batch vs online training 62
  • 63. Part III – Case Study: Jobs You May Be Interested In (JYMBII) 63
  • 64. Outline • Introduction • Generating Embeddings via Word2vec • Generating Embeddings via Deep Networks • Tree Feature Transforms in Deep + Wide Framework 64
  • 66. Introduction: Problem Formulation • Rank jobs by 𝑃 User 𝑢 applies to Job 𝑗 𝑢, 𝑗) • Model response given: 66 Careers History, Skills, Education, Connections Job Title, Description, Location, Company 66
  • 67. Introduction: JYMBII Modeling- Generalization Recommend • Model should learn general rules to predict which jobs to recommend to a member. • Learn generalizations based on similarity in title, skill, location, etc between profile and job posting 67
  • 68. Introduction: JYMBII Modeling - Memorization Applies to 68 • Model should memorize exceptions to the rules • Learn exceptions based on frequent co- occurrence of features
  • 69. Introduction: Baseline Features • Dense BoW Similarity Features for Generalization • i.e: Similarity in title text good predictor of response • Sparse Two-Depth Cross Features for Memorization • i.e: Memorize that computer science students will transition to entry engineering roles Vector BoW Similarity Feature Sim(User Title BoW, Job Title BoW) Sparse Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) Sparse Cross Feature AND(user = In Silicon Valley, job = In Austin, TX) Sparse Cross Feature AND(user = ML Engineer, job = UX Designer) 69
  • 70. Introduction: Issues • BoW Features don’t capture semantic similarity between user/job • Cosine Similarity between Application Developer and Software Engineer is 0 • Generating three-depth, four-depth cross features won’t scale • i.e. Memorizing that Factory Workers from Detroit are applying to Fracking jobs in Pennsylvania • Hand-engineered features time consuming and will have low coverage • Permutations of three-depth, four-depth cross features grows exponentially 70
  • 71. Introduction: Deep + Wide for JYMBII • BoW Features don’t capture semantic similarity between user/job • Generate embeddings to capture Generalization through semantic similarity • Deep + Wide model for JYMBII [Cheng et al., 2016] Semantic Similarity Feature Sim(User Embedding, Job Embedding) Global Model Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) User Model Cross Feature AND(user = User 2, job = Job Latent Feature 1 ) Job Model Cross Feature AND(user = User Latent Feature, job = Job 1) 71 Sparse Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) Sparse Cross Feature AND(user = In Silicon Valley, job = In Austin, TX) Sparse Cross Feature AND(user = ML Engineer, job = UX Designer) Vector BoW Similarity Feature Sim(User Title BoW, Job Title BoW)
  • 72. Generating Embeddings via Word2vec: Training Word Vectors • Key Ideas • Same users (context) apply to similar jobs (target) • Similar users (target) will apply to the same jobs (context) Application Developer => Software Engineer • Train word vectors via word2vec skip-gram architecture • Concatenate user’s current title and the applied job’s title as input User Title Applied Job Title 72
  • 73. Generating Embeddings via Word2vec: Model Structure Application, Developer Software, EngineerTokenized Titles Word Embedding Lookup Pre-trained Word Vectors Entity Embeddings Via Average Pooling Word Vectors Response Prediction (Logistic Regression) Cosine Similarity User Job 73
  • 74. Generating Embeddings via Word2vec: Results and Next Steps • Receiver Operating Characteristic – Area Under Curve for evaluation • Response prediction is binary classification: Apply or don’t Apply • Highly skewed data: Low CTR for Apply Action • Good metric for ranking quality: Focus on discriminatory ability of model • Marginal 0.87% ROC AUC Gain • How to improve quality of embeddings? • Optimize embeddings for prediction task with supervised training • Leverage richer context about user and job 74
  • 75. Generating Embeddings via Deep Networks: Model Structure User Job Response Prediction (Logistic Regression) Sparse Features (Title, Skill, Company) Embedding Layer Hidden Layer Entity Embedding Hadamard Product (Elementwise Product) 75
  • 76. Generating Embeddings via Deep Networks: Hyper Parameters, Lots of Knobs! • Optimizer Used • SGD w/ Momentum and exponential decay vs. Adam [Kingma et al., 2015] (Adam) • Learning Rate • 10−5 to 10−3 (𝟏𝟎−𝟒 ) • Embedding Layer Size • 50 to 200 (100) • Dropout • 0% to 50% dropout (0% dropout) • Sharing Parameter Space for both user/job embeddings • Assumes communitive property of recommendations (a + b = b + a) (No shared parameter space) • Hidden Layer Sizes • 0 to 2 Hidden Layers (200 -> 200 Hidden Layer Size) • Activation Function • ReLU vs. Tanh (ReLU) 76
  • 77. Generating Embeddings via Deep Networks: Training Challenges • Millions of rows of training data impossible to store all in memory • Stream data incrementally directly from files into a fixed size example pool • Add shuffling by randomly sampling from example pool for training batches • Extreme dimensionality of company sparse feature • Reduce dimensionality of company feature from millions -> tens of thousands • Perform feature selection by frequency in training set • Hyper parameter tuning • Distribute grid search through parallel modeling in single driver Spark jobs 77
  • 78. Generating Embeddings via Deep Networks: Results Model ROC AUC Baseline Model 0.753 Deep + Wide Model 0.790 (+4.91%***) *** For reference, a previous major JYMBII modeling improvement with a 20% lift in ROC AUC resulted in a 30% lift in Job Applications 78
  • 79. Response Prediction (Logistic Regression) The Current Deep + Wide Model Deep Embedding Features (Feed Forward NN) • Generating three-depth, four-depth cross features won’t scale • Smart feature selection required Wide Sparse Cross Features (Two-Depth) 79
  • 80. Tree Feature Transforms: Feature Selection via Gradient Boosted Decision Trees Each tree outputs a path from root to leaf encoding a combination of feature crosses [He et al., 2014] GDBT’s select the most useful combinations of feature crosses for memorization Member Seniority: Vice President Yes No Member Industry: Banking Yes No Member Location: Silicon Valley Member Skill: Statistics Yes No 80 Yes No Job Seniority: CXO NoYes Job Title: ML Engineer Yes No
  • 81. Response Prediction (Logistic Regression) Tree Feature Transforms: The Full Picture How to train both the NN model and GBDT model jointly with each other? Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GBDT) 81
  • 82. Tree Feature Transforms: Joint Training via Block-wise Cyclic Coordinate Descent • Treat NN model and GBDT model as separate block-wise coordinates • Implemented by 1. Training the NN until convergence 2. Training GBDT w/ fixed NN embeddings 3. Training the regression layer weights w/ generated cross features from GBDT 4. Training the NN until convergence w/ fixed cross features 5. Cycle step 2-4 until global convergence criteria 82
  • 83. Response Prediction (Logistic Regression) Tree Feature Transforms: Train NN Until Convergence Initially no trees are in our forest Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 83
  • 84. Response Prediction (Logistic Regression) Tree Feature Transforms: Train GDBT w/ NN Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 84
  • 85. Response Prediction (Logistic Regression) Tree Feature Transforms: Train GDBT w/ NN Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 85
  • 86. Response Prediction (Logistic Regression) Tree Feature Transforms: Train Regression Layer Weights Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 86
  • 87. Response Prediction (Logistic Regression) Tree Feature Transforms: Train NN w/ GDBT Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 87
  • 88. Tree Feature Transforms: Block-wise Coordinate Descent Results Model ROC AUC Baseline Model 0.753 Deep + Wide Model 0.790 (+4.91%) Deep + Wide Model w/ GBDT Iteration 1 0.792 (+5.18%) Deep + Wide Model w/ GBDT Iteration 2 0.794 (+5.44%) Deep + Wide Model w/ GBDT Iteration 3 0.795 (+5.57%) Deep + Wide Model w/ GBDT Iteration 4 0.796 (+5.71%) 88
  • 89. JYMBII Deep + Wide: Future Direction • Generating Embeddings w/ LSTM Networks • Leverage sequential career history data • Promising results in NEMO: Next Career Move Prediction with Contextual Embedding [Li et al., 2017] • Semi-Supervised Training • Leverage pre-trained title, skill, and company embeddings on profile data • Replace Hadamard Product for entity embedding similarity function • Deep Crossing [Shan et al., 2016] • Add even richer context • i.e. Location, Education, and Network features 89
  • 90. Part IV – Case Study: Deep Learning Networks for Job Search 90
  • 91. Outline • Introduction • Representations via Word2vec • Robust Representations via DSSM 91
  • 93. Introduction: Search Architecture Index Indexer Top-K retrieval ResultsOffline Training / Model Result Ranking User QueryQuery Understanding 93
  • 94. Introduction: Query Understanding - Segmentation and Tagging • First divide the search query into segments • Tag query segments based on recognized entity tags Oracle Java Application Developer Oracle Java Application Developer Query Segmentations COMPANY = Oracle SKILL = Java TITLE = Application Developer COMPANY = Oracle TITLE = Java Application Developer Query Tagging 94
  • 95. Introduction: Query Understanding – Expansion • Task of adding additional synonyms/related entities to the query to improve recall • Current Approach: Curated dictionary for common synonyms and related entities COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR … SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK … TITLE = Application Developer OR Software Engineer OR Software Developer OR Programmer … Green – Synonyms Blue – Related Entities 95
  • 96. Introduction: Query Understanding - Retrieval and Ranking COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR … SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK … TITLE = Application Developer OR Software Engineer OR Software Developer OR Programmer … Title Title Skills Company 96
  • 97. Introduction: Issues – Retrieval and Ranking • Term retrieval has limitations • Cross language retrieval • Softwareentwickler  Software developer • Word Inflections • Engineering Management  Engineering Manager • Query expansion via curated dictionary of synonyms is not scalable • Expensive to refresh and store synonyms for all possible entities • Heavy reliance on query tagging is not robust enough • Novel title, skill, and company entities will not be tagged correctly • Errors upstream propagates to poor retrieval and ranking 97
  • 98. Introduction: Solution – Deep Learning for Query and Document Representations • Query and document representations • Map queries and document text to vectors in semantic space • Robust to Handle Out of Vocabulary words • Term retrieval has limitations • Query expansion via curated dictionary of synonyms is not scalable • Map synonyms, translations and inflections to similar vectors in semantic space • Term retrieval on cluster id or KNN based retrieval • Heavy reliance on query tagging is not robust enough • Compliment structured query representations with semantic representations 98
  • 99. Representations via Word2vec: Leverage JYMBII Work • Key Ideas • Similar users (context) apply to the same job (target) • The same user (target) will apply to similar jobs (context) Application Developer => Software Engineer • Train word vectors via word2vec skip-gram architecture • Concatenate user’s current title and the applied job’s title as input User Title Applied Job Title 99
  • 100. Representations via Word2vec: Word2vec in Ranking Application, Developer Software, EngineerTokenized Text Word Embedding Lookup Pre-trained Word Vectors Entity Embeddings Via Average Pooling Word Vectors Learning to Rank Model (NDCG Loss) Cosine Similarity JobQuery 100
  • 101. Representations via Word2vec: Ranking Model Results Model Normalized Cumulative Discounted Gain@5 (NDCG@5) CTR@5(%) Baseline Model 0.582 +0.0% Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6% 101
  • 102. Representations via Word2vec: Optimize Embeddings for Job Search Use Case • Leverage apply and click feedback to guide learning of embeddings • Fine tune embeddings for task using supervised feedback • Handle out of vocabulary words and scale to query vocabulary size • Compared to JYMBII, query vocabulary is much larger and less well-formed • Misspellings • Word Inflections • Free text search • Need to make representations more robust for these free text queries 102
  • 103. Robust Representations via DSSM: Deep Structured Semantic Model [Huang et al., 2013] Query Applied Job (Positive) Application Developer Software EngineerRaw Text #Ap, App, ppl… #So, Sof, oft…Tri-letter Hashing #Ha, Hai, air… Hairdresser Randomly Sampled Applied Job (Negative) Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Cosine Similarity Softmax w/ Cross Entropy Loss 103
  • 104. Robust Representations via DSSM: Tri-letter Hashing • Tri-letter Hashing Example • Engineer -> #en, eng, ngi, gin, ine, nee, eer, er# • Benefits of Tri-letter Hashing • More compact Bag of Tri-letters vs. Bag of Words representation • 700K Word Vocabulary -> 75K Tri-letters • Can generalize for out of vocabulary words • Tri-letter hashing robust to minor misspellings and inflections of words • Engneer -> #en, eng, ngn, gne, nee, eer, er# 104
  • 105. Robust Representations via DSSM: Training Details 105 • Parameter Sharing Helps • Better and faster convergence • Model size is reduced • Regularization • L2 performs better than dropout • Toolkit Comparisons (CNTK vs TensorFlow) • CNTK: Faster convergence and better model quality • TensorFlow: Easy to implement and better community support. Comparative model quality Training performance with/o parameter sharing
  • 106. Robust Representations via DSSM: Lessons in Production Environment 106 + 100% + 70% + 40% • Bottlenecks in Production Environment • Latency due to extra computation • Latency due to GC activity • Fat Jars in JVM environment • Practical Lessons • Avoid JVM Heap while serving the model • Caching most accessed entities’ embedding
  • 107. Robust Representations via DSSM: DSSM Qualitative Results Software Engineer Data Mining LinkedIn Softwareentwickler Engineer Software Data Miner Google Software Software Engineers Machine Learning Engineer Software Engineers Software Engineer Software Engineering Microsoft Research Software Engineer Engineer Software For qualitative results, only top head queries are taken to analyze similarity to each other 107
  • 108. Robust Representations via DSSM: DSSM Metric Results Model Normalized Cumulative Discounted Gain@5 (NDCG@5) CTR@5 Lift (%) Baseline Model 0.582 +0.0% Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6% Baseline Model + DSSM Feature 0.602 (+3.4%) +3.2% 108
  • 109. Robust Representations via DSSM: DSSM Future Direction • Leverage Current Query Understanding Into DSSM Model • Query tag entity information for richer context embeddings • Query segmentation structure can be considered into the network design • Deep Crossing for Similarity Layer [Shan et al., 2016] • Convolutional DSSM [Shen et al., 2014] 109
  • 110. Conclusion • Recommender Systems and personalized search are very similar problems • Deep Learning is here to stay and can have significant impact on both • Understanding and constructing queries • Ranking • Deep learning and more traditional techniques are *not* mutually exclusive (hint: Deep + Wide) 110
  • 111. Appendix – Backup slides 111
  • 112. Back up – Part I 112
  • 113. Difference between parameter sharing in 1-D convolution and RNN? • CNN Kernel: output unit depends on small number of neighboring input units through same kernel • RNN update rule: output unit depends on previous output units through same update rule. Deeper computational graph.
  • 114. Back up – Part III 114
  • 115. Introduction: JYMBII Modeling - Personalization Applies to Recommend jobs that are similar to jobs user has previously applied to 115
  • 116. Introduction: JYMBII Modeling - Collaboration Applies to Recommend jobs that similar users have previous applied to 116
  • 117. Introduction: Generalized Linear Mixed Models • Mixture of linear models into an additive model [Zhang et al., 2016] • Fixed Effect – Population Average Model • Random Effects – Subject Specific Models Response Prediction (Logistic Regression) User 1 Random Effect Model User 2 Random Effect Model Personalization Job 2 Random Effect Model Job 1 Random Effect Model Collaboration Global Fixed Effect Model Content-Based Similarity 117
  • 118. Introduction: Features • Dense Vector BoW Similarity Features in global model for Generalization • i.e: Similarity in title text good predictor of response • Sparse Cross Features in global,user, and job model for Memorization • i.e: Memorize that computer science students will transition to entry engineering roles Vector BoW Similarity Feature Sim(User Title BoW, Job Title BoW) Global Model Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) User Model Cross Feature AND(user = User 2, job = Software Engineer) Job Model Cross Feature AND(user = Comp Sci. Student, job = Job 1) 118
  • 119. Introduction: GLMM Formulation • Generalized Linear Mixed Models • Per-User Random Effect • Per-Job Random Effect 𝑃 𝐴𝑝𝑝𝑙𝑦 𝑚, 𝑗) = 𝜎(𝐵𝑓𝑖𝑥𝑒𝑑 𝑋cos 𝑚,𝑗 , 𝑋 𝑚, 𝑋𝑗, 𝑋 𝑚,𝑗 + 𝐵 𝑚 𝑋cos 𝑚,𝑗 , 𝑋𝑗 + 𝐵𝑗 𝑋cos 𝑚,𝑗 , 𝑋 𝑚 ) Notation Meaning Xcos(m, j) Dense Cosine Similarity Features for sample pair m,j. (i.e. - Cosine similarity between title BoW) Xm Sparse Features of user m. (i.e. – User is a software engineer) Xj Sparse Features of job j. (i.e. – Job is from company LinkedIn) Xm,j Sparse Cross product feature transformation of Xm, Xj. (i.e. – User is software engineer and job is from company LinkedIn) Bfixed Fixed effect model coefficients. Bm User model coefficients for user m. Bj Job model coefficients for job j. Fixed Effect – User-Job Affinity Per-User Model Per-Job Model 119
  • 120. Introduction: Issues • BoW Features don’t capture semantic similarity between user/job • Cosine Similarity between Application Developer and Software Engineer is 0 • Sparse features in user and job specific models make it difficult to fit • Current linear model is unable to share learning across similar titles • Large number of sparse features does not scale to infrastructure • Billions of User and Job Model Cross features 120
  • 121. Introduction: Proposed Solution • Learn dense semantic embeddings of user and job entities • Integrate embeddings into GLMM model as a set of latent features Semantic Similarity Feature Sim(User Embedding, Job Embedding) Global Model Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) User Model Cross Feature AND(user = User 2, job = Job Latent Feature 1 ) Job Model Cross Feature AND(user = User Latent Feature, job = Job 1) 121
  • 122. Generating Embeddings Via Deep Networks: Data and Features • 4 weeks (Training) and 1 week (Test) of Click Log Data • 15M Applies • 2.5M Dismisses • 30M Skips • 15M Random Negatives • 5M Users • 1M Jobs • Input Features • Sparse Standardized Title - |Title Taxonomy| = 25K • Sparse Standardized Skills - |Skill Taxonomy| = 35K • Sparse Standardized Company - |Companies| = 1M+ 122
  • 123. Generating Embeddings via Deep Networks: GLMM Integration Results Improved model performance w/ 10x reduction in number of model parameters Model ROC AUC Baseline GLMM Model w/ Sparse Features 0.800 GLMM Model w/ Dense Embedding Features 0.811 (+1.38%) 123
  • 124. Generating Embeddings Via Deep Networks: Proposed Infrastructure Design 124124 124
  • 125. Robust Representations via DSSM: DSSM Training Details • Training Data: 6 months of apply data from job search • Training Tuples: (query, applied job) • Network Size: (3 Layers:- 300, 300, 300) • Parameter Sharing: Shared Parameters among query and job • Activation Function: Tanh 125
  • 126. References • [Rumelhart et al., 1986] Learning representations by back-propagating errors, Nature 1986 • [Hochreiter et al., 1997] Long short-term memory, Neural computation 1997 • [LeCun et al., 1998] Gradient-based learning applied to document recognition, Proceedings of the IEEE 1998 • [Krizhevsky et al., 2012] Imagenet classification with deep convolutional neural networks, NIPS 2012 • [Graves et al., 2013] Speech recognition with deep recurrent neural networks, ICASSP 2013 • [Mikolov, 2012] Statistical language models based on neural networks, PhD Thesis, Brno University of Technology, 2012 • [Kalchbrenner et al., 2013] Recurrent continuous translation models, EMNLP 2013 • [Srivatstava, 2013] Improving neural networks with dropout, PhD Thesis, University of Toronto, 2013 • [Sustkever et al., 2014] Sequence to sequence learningg with neural networks, NIPS 2014 • [Vinyals et al., 2014] Show and tell: a neural image caption generator, Arxiv 2014 • [Zaremba et al., 2015] Recurrent Neural Network Regularization, ICLR 2015 126
  • 127. References (continued) • [Arya et al., 2016] Personalized Federated Search at LinkedIn, CIKM 2016 • [Cheng et al., 2016] Wide & Deep Learning for Recommender Systems, DLRS 2016 • [He et al., 2014] Practical Lessons from Predicting Clicks on Ads at Facebook, ADKDD 2014 • [Kingma et al., 2015] Adam: A Method for Stochastic Optimization, ICLR 2015 • [Huang et al., 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013 • [Li et al., 2017] NEMO: Next Career Move Prediction with Contextual Embedding, WWW 2017 • [Shan et al., 2016] Deep Crossing: Web-scale modeling without manually crafted combinatorial features, KDD 2016 • [Zhang et al., 2016] GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction, KDD 2016 • [Salakhutdinov et al., 2007] Restricted Boltzmann Machines for Collaborative Filtering, ICML 2007 • [Zheng, 2016] http://tech.hulu.com/blog/2016/08/01/cfnade.html • [Hinton et al., 2006] A fast learning algorithm for deep belief nets, Neural Computations 2006 • [Wang et al., 2015] Collaborative Deep Learning for Recommender Systems , KDD 2015 • [He et al., 2017] Neural Collaborative Filtering, WWW 2017 • [Borisyuk et al. 2016]. CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents, KDD 2016 127
  • 128. References (continued) • [netflix recsys] http://nordic.businessinsider.com/netflix-recommendation-engine-worth-1-billion-per- year-2016-6/ • [San Jose Mercury News] http://www.mercurynews.com/2017/01/06/at-linkedin-artificial-intelligence-is- like-oxygen/ • [Instagram blog] http://blog.instagram.com/post/145322772067/160602-news 128

Editor's Notes

  1. Welcome to our tutorial on Deep Learning for Personalized Search and Recommender Systems I am Ganesh and I work for Airbnb. I will be presenting this with my former colleagues from LinkedIn – Nadia, Ben and Liang. Saurabh could not make it Between the 5 of us we have worked on various aspects of search/recommendations/machine learning and deep learning. We will be releasing these slides online. We want this to be an interactive tutorial, so please feel free to interrupt any time. But, just to get a sense of the audience, show of hands to how many of you are from academia? How many from industry? How many have worked on some sort of production deep learning system before?
  2. It is a fairly long tutorial and we want to make sure we have your attention As we have noticed we have a fairly diverse audience and we want to ensure that we cater to all your interests which may be a challenge. We start off with some foundational topic in deep learning. One may consider these like understanding the lego pieces We then move on to explaining deep learning for search and recommendations at scale, followed by case study at LinkedIn TODO – add logistics here
  3. In Recys 2015, Netflix mentioned that recommender systems add about half a billion dollars to the company It is almost a given these ,days that you open up a site or an app and it presents an experience that is tailor made for you, knowing your behavior and your preferences whether or not you made them explicit
  4. There’s always been this friendly turf between search and recommendations. Take a look at this query. A search person would treat this as classic IR problem, a recommendation person would treat this as a set of recommendations returned by the search engine for your query. Well, these two world’s aren’t that different and we believe they actually got married via personalized search. Personalized search is where each result is tailored to your own preferences and not necessarily same for everyone. There are of course various levels of personalization
  5. That brings us to the logical next question? Why deep learning If you talk with a mathematician, they would point out that a lot of foundational techniques have been around since the 80’s or even before. And they would be right. So, why now? Two major things need to happen for deep learning to really work. Computing cost Availability of data – lot’s of it TODO – add GPU
  6. In several domains, deep learning has already made a dent. These include self driving cars and machine translation and a few others just to name some
  7. deep learning is a particular case of representation learning. It learns additional layers of more abstract features.
  8. This slide should motivate why understanding text is critical Understanding text is critical to any recommender system that works with underlying text data (example – news recommendations, jobs recommendations etc) What are the components of “understanding”? -- Similar or dissimilar words -- concept of similarity can range from true synonyms to more ‘fuzzy’ types -- Entity the word represents (Named Entity Recognition) -- “Abraham Lincoln, the 16th President”, “My cousin drives a Lincoln”
  9. Explain why the function isn’t scalable Note on negative sampling shallow models to learn embeddings, such as word2vec, are often used as initialization in deeper network architectures.
  10. sigmoid used to convert real to (0,1), for example into a probability. Used in output layer with log loss (cross entropy) in optimization, rather than squared loss, to avoid vanishing gradient issue when it saturates discouraged use in hidden layer because of saturation tanh resembles identity close to 0, can be trained easily as long as activation remains small. otherwise, saturation makes gradient-based learning difficult. ReLU, also know as positive part piecewise linear, easy to optimize does not saturate on the positive side. Initialize ReLU with small positive bias to make it likely for the ReLU to be active for most inputs in the training set
  11. FF also known as Multi Layer Perceptron MLP goal of the network: approximate a function y = F(x) - non-linearities output prediction layer: linear unit for linear regression sigmoid for logistic regression softmax for multinoulli multi-class classification
  12. CNN automates filters and feature extraction from images that used to be highly tailored and manual
  13. Equivariance to translation: small translations in input do not affect output Large image NxN pixels: N^2 input units Hidden layer: K features Number of parameters: ~KN^2
  14. 2 GPs to train large deep neural net (layers split between the 2 GPUs) 5 CNN layers with pooling + 3 fully connected layers ReLU DropOut against overfitting Large dataset with 1000 image categories image, label, classification. top 4: correctly classified, bottom 4: misclassified 63% accuracy correct label was in top 5 predicted 85% of time
  15. RNN update rule: output unit depends on previous output units through same update rule. Deeper computational graph. 1-D CNN Kernel: output unit depends on small number of neighboring input units through same kernel
  16. chain structure with repeating module (parameter sharing across time)
  17. cell state: conveyor belt that runs through the chain and stores long term memory. cell state has a linear self-loop that allows product of derivatives close to one, weights of the self-loop are control by gates. Problem with gradient descent for standard RNNs: error gradients vanish exponentially quickly with the size of the time lag between important events. gradients propagated over many stages tend to either vanish (often) or explode sigmoid gate lives in [0,1]: controls amount of information that it lets through from each component of cell state tanh: create a vector of candidate values to update the cell state or from the cell state to update the output
  18. speech recognition: acoustic modeling for mapping from acoustic signal to sequences of word (phonetic state) machine translation: LSTM trained on concatenation of source sentences and their translation [Sustkever et al., 2014]
  19. maximum likelihood: negative log likelihood = cross entropy between data distribution p(x,y) and model distribution p(y|x) if model distribution is gaussian p(y|x) = N(y; a^L(x), I), then cross entropy boils down to squared loss.
  20. alpha: learning rate
  21. Dataset augmentation for image: translate, rotate, scale images to generate new samples. Dropout: - Dropout operator corrupts the information carried by the units, forcing them to perform their immediate computations more robustly for LSTM, apply dropout only the non-recurrent connections, to maintain memorization ability see [Zaremba et al., 2015] Gradient norm clipping maintains the direction of the gradient.
  22. During time interval t User visits to the webpage Some item inventory at our disposal Decision problem: for each visit, select an item to display each display generates a response Goal: choose items to maximize expected clicks
  23. - Introduce My Self
  24. - Passive Job Seekers: Allow them to discover what is avaliable in the marketplace. - Not Alot of Data for Passives, show them jobs that make the most sense to them given their current career history and experience - Active Job Seekers: Reinforce their job seeking experience. Show them similar jobs that they applied to that they may have missed. Make sure they don't miss opportunities - Powers alot of modules including jobs home, feed, email, ads
  25. - Goal: Get people hired - Confirmed Hires - Time Lag on signal - Proxy in Total Job Applies - Metric: Total Job Applies - Optimize probability of Apply. Not View. Showing users popular/attractive jobs not as important as showing them actual good matches - User, Job, Activity
  26. - Leverage our rich context on a user and job - Generalize rules that similarity in title is a good thing
  27. - Memorize exceptions - I.E. for people in ML who have replaced our jobs w/ ML itself, we need to move to design
  28. - Example of features in each of the model. Generalization, on most of the time to share learning across examples in a linear model - Sparse Features for memorization - Need to choose good features to memorize - Random effect sparse features model personal affinity
  29. - No semantic similarity do well for the most part - Difficult fitting because no shared learning - Don't scale to infrastructure. Too many parameter updates via SGD costly to run w/o too much ROI. Alot of sparse features need to be computed online which makes for expensive serving
  30. - Dimensionality reduction - Learn dense embeddings for job and user entities. Semantic embeddings
  31. - Key things are to defined context and target. You shall know a word by the company it keeps  - Leverage word2vec existing libraries which take sentences as input
  32. - Random Effects are not shown here - Cross Features are not shown here - Word vectors trained seperately
  33. - ROC is good - Accuracy won't be as good metric - Marginal Gain from ROC AUC over BoW - Leverage our large amount of labeled data
  34. - Two Towers - 3 Layers - One tower for embedding member - One for Job - One similarity layer - Non Linear Interactions. Important Skills for a Job
  35. - Adam is adaptive learning rate. Momentum - Momentum good for smoothing out noisy gradient - Decay is good so that we don't overshoot the optium - Dropout wasn't good for us. Model wasn't complex enough given our amount of data - Sharing parameter space doesn't make sense for us
  36. - Similar to Tensorflow batch queue implementation - Millions of parameters * 100 in embedding size - No distributed training so just train multiple models in parallel. Tensorflow supports distributed training but not on YARN
  37. - Do well just knowing the member's profile - Deep model does better due to all the non linear interactions introduced - Gets close to the Baseline GLMM model, despite there being no random effects in the deep model to handle personalization and collaboration - Can see how this translates
  38. - Visual representation - Model Driven way like for deep model? - Random effects not shown
  39. - Output of each tree is a path used as a categorical feature - GDBT learns to select the most useful combinations of feature crosses for memorization - Example of features - Vice Presidents in Banking are not a normal VP. Seniority level is lower than other VP's in comparision - Data Engineers who don't know programming but know statistics are still very good matches for Data engineering jobs
  40. - Can't run SGD through graph - SGD to train network - Greedy split on learning decision rules
  41. - Coordinant Descent is single coordinate optimzation (Iterative 1 dimensional optimization fixing the other coordinants) - Train deep model first because we want to learn generalizations before exceptions. Can't have exceptions if we don't have rules first
  42. - Deep Model Initialization. Train Deep Model first since need to learn the generalizations first before learning the exceptions
  43. - Logit gets uses as an intial prediction for GDBT to start boosting from - Tree learns to generate features to help classify cases where the NN is wrong
  44. -
  45. - Train Regression Layer weights on the features generated by tree layer transformation
  46. - GDBT generates better features than sparse. Iterative training improves slightly - AUC might not be best way to visualize how tree features help given that they won't move the needle as much since they handle exceptions
  47. - Profile Data. Word2vec type of networks w/ career history - Hadamard product is easy to deploy in production. Only need to do a vector operations rather than a full matrix multiply - Location good to learn interactions between industry and location. - Education good for students
  48. - Search Jobs active job searchers - Personalized Component to Job Search - We will focus on the query to job matching though
  49. - Extracting structured information from a user's raw query - Query Understanding most important part - Talk about indexer first, then talk about user query
  50. - Reference LinkedIn Economic graph and taxonomy - Add structure to query
  51. -- Exact terms matching for retrieval will miss very relevant documents containing similar terms - Expansion allows us to retrieve those very relevant documents - Curate dictionary through domain knowledge or knowledge base - Can also mine synonyms through query logs - Give enough example for it Google -> LinkedIn -> Apple
  52. - Leverage query understanding for matching phrases rather than terms. Phrase match with distance - Leverage query tags to match w/ appropriate fields for higher precision (Job Application -> Application Developer) - BoW features (Title, query title sim)
  53. - Might not be handled in synonyms - Curating is not scalable given all the possible inflections and languages around. Large dictionary will be expensive to store all the mappings - If query tagging fails, then our retrieval and ranking will not be as good - Pinterest as a Skill, Hedge Fund as a company -
  54. - Good representations that map to dense semantic space - Robust to OOV words - Robust to issues w/ current segmentation/query tagging - Map on job and query side for embeddings - Compliment them. Query tagging gives us good information about the phrases and handles ambiguity for us but potentially, taxonomy not might be refreshed often enough to keep up w/ the professional world - Embeddings learned from query log data should reflect new entities automatically as we see more new data
  55. Same slide as before
  56. - Show differences
  57. NDCG Normalized Discounted Cumulative Gain - Describe NDCG - Relative Ranking - Penalized lower rankings for relevant results - Normalized across query - CTR@5 goes up
  58. Emphasize OOV words. Job search vocabulary alot more different from JYMBII vocab - Skills, companies, titles, can search anything
  59. - Talk about tri letter hashing later. BoTrigrams vs. BoW - Extra Negatives. 10 Negatives - 3 layers - Softmax layer, learn positives and negatives at the same time
  60. - Triletter hashing achieves goal of robustness at OOV
  61. - Parameter Sharing for Full Siamese - Reuse more - Some Regularization used. Didn't really do too much though. Parameter sharing already adds some form of regularization - Tensorflow for the flexibility
  62. - Latency, model size, model delivery was issue - Store model off heap to reduce GC activity - Cache most access entities for speed
  63. - Emphasize head queries
  64. - Even better lift
  65. - Also try to use for candidate selection
  66. Recommendations and Search are more in common that what people may realize.  Online phase of any recommender or search consists of two very board stages and deep learning has impact on both