JW Player is the world’s largest network-independent video platform representing 5 percent of global internet video. One of the core services it offers video publishers are turn-key recommendations that can drive higher engagement among their viewers. This talk will focus on the challenges of building and improving recommendations algorithms at JW Player's scale.
4. About JW Player
● Open source video player + video
platform
● 5% of all video plays on the web
● Per month:
○ 40Bn plays
○ 100 TB events
● 15K Customers
5. PLAYER Data
Analytics
The fastest online
video player
(2008)
Data-driven products (e.g.
Recommendations)
(2016)
Dashboards, Audience
Measurement
(2014)
Data Has Become Core to JW Player’s Strategy
Video Management
and Delivery
(2011)
PLATFORM
8. MVP Focused on Product Reqs and Scalability
● 20K requests per second
● Support legacy endpoints
○ Non-recommendations playlists
● Business rule features (e.g. sunrise, sunset, geo block)
● Include video metadata in response (conversions, manifest, etc.)
● Pass product “sniff test”
● Rudimentary A/B testing using click-through rates
○ Beat random
9. Data Types At Our Disposal for Recommending
Association-based
recommendations
Content-based
recommendations
(& Trending videos)
Title: Top ten Snowboarding
Destinations in Colorado
Description, keywords
10. ● Association → Association Rule Mining
○ Viewers who watched X also watched Y
We Layered Classic Algorithms That Were Easy to
Implement
● Content → BM25 (think tf-idf)
○ Elasticsearch
● Trending
○ Exp. weighted moving avg of plays
11. Rec 1: “Best hotels in Boulder”
Rec 2: “Amazing 1080”
Rec 3: “Best ski slopes in Colorado”
Rec 4: “Snowboarding is fun!”
Rec 5: “Top Snowboarding schools”
Rec 6: “Kardashian Katastrophe!”
Rec 7: “Cats on Skis”
Top ten Snowboarding
Destinations in Colorado, 2018
Example Recommendations
Similar titles
Highly
co-watched
Trending
13. Results: We Met Goals :-)
✓ 20K requests per second
✓ Support legacy endpoints
✓ Business rule features (e.g. sunset, sunrise, geo block)
✓ Include video metadata in response (conversions, manifest, etc.)
○ Use log-based architecture to sync from various sources
✓ Pass product “sniff test”
✓ Rudimentary A/B testing
○ Beat random when looking at Overlay Click-Through Rate
○ Bested competitors in customer-led A/B tests
14. Beyond the MVP
How can we drive more value to customers?
How can we continue to grow competitive advantage?
16. Wait, What Exactly Are We Improving?
Click Through Rate Completion Rate
Ad Impressions Viewer Time
17. Americans spend 2+ hrs on social media
Viewer Time, the Unit of Online Currency
Our publishers are fighting for time
Recommendations can drive viewer
time by either:
● More Time per Session
● More Sessions (higher retention)
18. First, We Need Ability to Run Experiments
● Keep viewers in consistent
variant to measure:
○ Time/session
○ Viewer retention
A/B results (JW model vs random)
● 50% more time per session on recommended content
● 10% higher viewer retention (D1, D7)
19. The Natural Itch to Test Stuff
We can now run experiments and understanding
impact on viewer time
Hypothesis
“If we boost recently
produced content,
recs will be more
relevant”
Experiment
What happens to
time spent?
20. Some of the Initial Tests That Were Tried
Experiment Result
Recommendation Algorithm (hypothesis)
Swap in Word2Vec title similarity instead of tf-idf
Boost recent content
Try trending only
Try different ordering of layers
2 Weeks
3 Weeks
1 Week
2 Weeks
21. Offline Testing = Faster Iteration
Fast Iteration Cycles
Build
Signals
Training
Data
Model
Model
Output
Predict
Evaluation,
validation
Improve Features,
Model, Data
Run Experiment
Build
Recommendation Algorithm (hypothesis)
22. Choosing An Offline Performance Metric
● Time spent in a session aggregates behavior over a sequence of
recommendations
○ Predicting that directly is hard
● Pick closely related metric to measure effectiveness of a single
recommendation
○ Time watched, percent watched?
○ Probability of an “engaged watch”
23. Video 1 Video 2
Pairwise Empirical Engagement Rate
(PEER Score)
PEER Score = Wilson Score ( )
% video 2 watches >= 30 seconds
Metric for List of Recommended Videos V :
nDCG (V), where PEER is relevance metric
26. A/B Testing Learnings: Publishers Matter
● Algorithm performance
○ Association vs Content
○ Optimal Training Window
● Publishers with viral events that affect results
○ Test results change with such events
● Publisher quirks
○ Player, Recommendations implementation
28. ● Algorithmic Perspective
○ More Context
○ Personalization
○ Progress in deep learning for recs
● Implementation / Maintainability
○ Single Unified Model (for widely varying publishers)
○ Flexible inputs (Anything2Vec)
Deep Learning Makes Sense For Us
29. We’ve Taken Some Good Initial Steps
● Built and A/B tested Tensorflow
model that performs on par with
our current algorithms
● Same context, unpersonalized
● AWS SageMaker used for training
on GPUs, serving model via
Tensorflow Serving
● Trained using triplet loss to learn
video embeddings
Anchor
Positive
Example
Negative
Example
FaceNet: A Unified Embedding for Face Recognition and
Clustering (2015)
30. Next Challenges
● Modeling
○ Score individual videos vs. learn to rank
○ How to choose positive & negative training samples?
○ Relevance metric for hyperparameter tuning
● Architecture
○ API traffic
○ Viewer profile service
○ Tensorflow is free, but scaling it is not
31. Takeaways
● “Just build” can work great for MVP recommender
● Offline testing critical for algorithmic improvement
● Finding the right offline metric is key