SlideShare a Scribd company logo
1 of 19
The 100k Question:
Leveraging Large Scale
Datasets to Make The Most of
In-Game Decisions
Dylan Rogerson
Senior Data Scientist
Facing the Starkness of Reality…
OMG Vlad with our new environment we can build models off
of BILLIONS of records!!!
…ok, but why? Almost every model I build barely needs more
than 100k records.
But…but…I…BILLIONS!!! <long uncomfortable sullen silence>
Alright I’ll do some research.
Table of Contents
 Are we actually leveraging our big data?
 We need data to address rare behavior.
 Big data can build confidence / Beyond Accuracy
 Complex interactions & weird configurations.
 Case Study: Predicting Spawn Behavior/ When you fundamentally
need big data.
Are we actually leveraging our big data?
 Here’s a churn model for one of our titles.
Are we actually leveraging our big data?
 Here’s a completely different model (predicting survey responses).
We Can Address Rare Behavior
 To capture and predict rare
events we need ‘enough’
training data.
 Examples: Fraud (Boosting)
Detection, Outlier Detection,
Spawns Outliers…
Example of boosting.
Focusing on Fraud Detection
 Many different types (regular boosting, reverse boosting,
challenge boosting, leaderboard boosting).
 Not just rare: Difficult to detect because their features are not
innately obvious.
 Ok I cheated! You need big data to make the dataset but not to
train the model.
Big Data Can Build Confidence
 Let’s go back to the churn model. Start off with loving crafted
intuition driven variables.
 Most predictive power from player lifetime variables. Boss: Can
we brute force lifetime data?
 Approach: Dataset with number of hours played per day for each
day in the observation window. Create ‘Complex Model’
0
1
2
3
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
HoursPlayed
Day of Observation Period
Player Hours Per Day (Example Purposes Only)
Hours
Big Data Can Build Confidence
 Result: We can’t implement it. ‘Complex’ model is bulky and
requires a lot of data, but very accurate ‘high bar’.
 Inspecting Var importance in complex model led to new feature:
# Days played in preceding week.
 New feature in simple model -> Nearly the same AUC as complex
model!
 We can be confident that our ‘simple’ models (built on small data)
are good enough by comparing to feature rich ‘complex’ models
(which require more data).
Beyond Accuracy
 Execution time, complexity, interpretability may force simpler
models. In-game models have to be lean.
 Rare behavior: False positives
 We also need to understand how our models develop over
time.
 Digging into any of these requires more data.
Complex Interactions & Configurations
 What’s the best match composition? Low quit rate.
 Teams vs. Solo Players: The Eternal Struggle
 Many permutations. Solution: dummy encode
composition.
 Approach will take more data.
 You’ll need to pay attention to big parties since
they’re more engaged players. Big parties are rare
events (more data).
For Example Purposes Only
Case Study: Predicting Spawn Behavior
 A fun side project.
 Design wants to be able to control the spawn experience with
even greater accuracy.
 Can we predict short spawns? < 3 seconds or long spawns? > 30
seconds
 Data shown in this section is for an older title. May not be
representative of current games.
Case Study: Predicting Spawn Behavior
Player
Ally
Enemy
Spawn
Example 1 Example 2
Case Study: Predicting Spawn Behavior
 First stab at the data: 30MM observations and less than 1MM
were targeted spawns.
 Initial data was all positional: Team and Enemy coordinates.
 Built 2 Models and compared predictive power.
 First chance to leverage our big data architecture (WOOO)!
Case Study: Predicting Spawn Behavior
 We created models in Spark using a Zeppelin notebook written
in either Scala or PySpark.
 Different models were tried: Logistic Regression, Random
Forest and GBM
 Language Hierarchy: Scala > PySpark > SparkR. Learn Scala
Case Study: Predicting Spawn Behavior
 Gridded Model: Made a grid with 200 x 200 units and counted
how many enemies or allies were in a grid square.
 Allows for complex positional interactions (cover, team spacing,
high ground).
 Requires a significant amount of data.
Case Study: Predicting Spawn Behavior
 Distance Model: Bucket enemy and ally distances by 200 units
and count.
 Very simple approach: Just checking to see how far away
enemies and allies are.
 Requires much less data.
Case Study: Predicting Spawn Behavior
 End result: Distance Model > Gridded Model
 Why? Distance Model hit peak accuracy early. Gridded had
more to go.
 Verdict: For Gridded (big data) approach we don’t have enough
data.
 This was only for 1 map. ~500MM data points for all maps.
Final Thoughts
 We need big data to build confidence in simple models:
Complex models to compare to our simple assumptions.
 Rare events require a lot of data to understand (and model).
 Complex Interactions: Combinatorial, Acoustic and Spatial
problems require a lot of data.
 Even so, thoughtful exploration of the data / feature creation
can make up for small datasets (you just won’t know it).
 Investigate how much information your model actually needs.

More Related Content

Similar to Leveraging Large Scale Datasets to Make the Most of In-game Decisions by Dylan Rogerson

PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into valueNAVER D2
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do itGregory Renard
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItHolberton School
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learningDenis Dus
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Improving your Agile Process
Improving your Agile ProcessImproving your Agile Process
Improving your Agile ProcessDavid Copeland
 
The Dangers of Machine Learning
The Dangers of Machine LearningThe Dangers of Machine Learning
The Dangers of Machine LearningtothepointIT
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 

Similar to Leveraging Large Scale Datasets to Make the Most of In-game Decisions by Dylan Rogerson (20)

PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
MLProjectReport
MLProjectReportMLProjectReport
MLProjectReport
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learning
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Improving your Agile Process
Improving your Agile ProcessImproving your Agile Process
Improving your Agile Process
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Big Data Analytics V2
Big Data Analytics V2Big Data Analytics V2
Big Data Analytics V2
 
The Dangers of Machine Learning
The Dangers of Machine LearningThe Dangers of Machine Learning
The Dangers of Machine Learning
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
Data Con LA 2022 - Building Field-level Lineage from Scratch for Modern Data ...
 

Recently uploaded

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 

Leveraging Large Scale Datasets to Make the Most of In-game Decisions by Dylan Rogerson

  • 1. The 100k Question: Leveraging Large Scale Datasets to Make The Most of In-Game Decisions Dylan Rogerson Senior Data Scientist
  • 2. Facing the Starkness of Reality… OMG Vlad with our new environment we can build models off of BILLIONS of records!!! …ok, but why? Almost every model I build barely needs more than 100k records. But…but…I…BILLIONS!!! <long uncomfortable sullen silence> Alright I’ll do some research.
  • 3. Table of Contents  Are we actually leveraging our big data?  We need data to address rare behavior.  Big data can build confidence / Beyond Accuracy  Complex interactions & weird configurations.  Case Study: Predicting Spawn Behavior/ When you fundamentally need big data.
  • 4. Are we actually leveraging our big data?  Here’s a churn model for one of our titles.
  • 5. Are we actually leveraging our big data?  Here’s a completely different model (predicting survey responses).
  • 6. We Can Address Rare Behavior  To capture and predict rare events we need ‘enough’ training data.  Examples: Fraud (Boosting) Detection, Outlier Detection, Spawns Outliers… Example of boosting.
  • 7. Focusing on Fraud Detection  Many different types (regular boosting, reverse boosting, challenge boosting, leaderboard boosting).  Not just rare: Difficult to detect because their features are not innately obvious.  Ok I cheated! You need big data to make the dataset but not to train the model.
  • 8. Big Data Can Build Confidence  Let’s go back to the churn model. Start off with loving crafted intuition driven variables.  Most predictive power from player lifetime variables. Boss: Can we brute force lifetime data?  Approach: Dataset with number of hours played per day for each day in the observation window. Create ‘Complex Model’ 0 1 2 3 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930 HoursPlayed Day of Observation Period Player Hours Per Day (Example Purposes Only) Hours
  • 9. Big Data Can Build Confidence  Result: We can’t implement it. ‘Complex’ model is bulky and requires a lot of data, but very accurate ‘high bar’.  Inspecting Var importance in complex model led to new feature: # Days played in preceding week.  New feature in simple model -> Nearly the same AUC as complex model!  We can be confident that our ‘simple’ models (built on small data) are good enough by comparing to feature rich ‘complex’ models (which require more data).
  • 10. Beyond Accuracy  Execution time, complexity, interpretability may force simpler models. In-game models have to be lean.  Rare behavior: False positives  We also need to understand how our models develop over time.  Digging into any of these requires more data.
  • 11. Complex Interactions & Configurations  What’s the best match composition? Low quit rate.  Teams vs. Solo Players: The Eternal Struggle  Many permutations. Solution: dummy encode composition.  Approach will take more data.  You’ll need to pay attention to big parties since they’re more engaged players. Big parties are rare events (more data). For Example Purposes Only
  • 12. Case Study: Predicting Spawn Behavior  A fun side project.  Design wants to be able to control the spawn experience with even greater accuracy.  Can we predict short spawns? < 3 seconds or long spawns? > 30 seconds  Data shown in this section is for an older title. May not be representative of current games.
  • 13. Case Study: Predicting Spawn Behavior Player Ally Enemy Spawn Example 1 Example 2
  • 14. Case Study: Predicting Spawn Behavior  First stab at the data: 30MM observations and less than 1MM were targeted spawns.  Initial data was all positional: Team and Enemy coordinates.  Built 2 Models and compared predictive power.  First chance to leverage our big data architecture (WOOO)!
  • 15. Case Study: Predicting Spawn Behavior  We created models in Spark using a Zeppelin notebook written in either Scala or PySpark.  Different models were tried: Logistic Regression, Random Forest and GBM  Language Hierarchy: Scala > PySpark > SparkR. Learn Scala
  • 16. Case Study: Predicting Spawn Behavior  Gridded Model: Made a grid with 200 x 200 units and counted how many enemies or allies were in a grid square.  Allows for complex positional interactions (cover, team spacing, high ground).  Requires a significant amount of data.
  • 17. Case Study: Predicting Spawn Behavior  Distance Model: Bucket enemy and ally distances by 200 units and count.  Very simple approach: Just checking to see how far away enemies and allies are.  Requires much less data.
  • 18. Case Study: Predicting Spawn Behavior  End result: Distance Model > Gridded Model  Why? Distance Model hit peak accuracy early. Gridded had more to go.  Verdict: For Gridded (big data) approach we don’t have enough data.  This was only for 1 map. ~500MM data points for all maps.
  • 19. Final Thoughts  We need big data to build confidence in simple models: Complex models to compare to our simple assumptions.  Rare events require a lot of data to understand (and model).  Complex Interactions: Combinatorial, Acoustic and Spatial problems require a lot of data.  Even so, thoughtful exploration of the data / feature creation can make up for small datasets (you just won’t know it).  Investigate how much information your model actually needs.

Editor's Notes

  1. We begin our story a few years ago when I was tasked to work on our new model development pipeline (in Spark). After getting super hyped about all the data we could process I turned to our most expert data scientist and this conversation transpired…
  2. We’re collecting a ton of data, but are we actually leveraging it to build better models and fundamentally better decisions? This is a churn model from one of our older titles, but it’s a story I’ve seen time and time again. Here we see the learning curve for the model, showing a dataset only needs to be so big for the model to reach a plateau in accuracy. For this logistic regression that happens around 10k observations. To be safe we can say 100k.
  3. Same story, different model. This GBM needs around 10k (maybe 50k - 100k) observations to build a competent initial model. So when do you need a lot of data?
  4. I cheated. You need big data to make the dataset but you still may only need 100k observations to train the model.
  5. Here’s where we at Activision find a good deal of value in big data. On the bottom is an example of what the previously mentioned dataset might look like for an individual player.
  6. For us in boosting detection false positive rates are very important. You don’t want to falsely accuse someone of cheating in your game! In regards to how models change over time, we’ve had to change churn models throughout the year to account for different behavior.
  7. Now for some weird considerations, stuff that might only be answerable with large coverage from a vast amount of data.
  8. Here we see two examples of spawns. Red triangles are enemy players and green triangles are ally players. The player in question is a white triangle. Circles are some possible spawn points. Can you easily tell which one is good or bad? Example 2 might be better simply because teammates are closer. Additionally the alley viewing the enemy team is narrower and it’s easier to hop to cover. Positioning data is subtle.
  9. Oh btw…all of those calculations were only for a single map, so feel free to multiply your problem by 12 unless you include DLC maps, or get mode specific or…. (~500MM data points for full models).
  10. And now I guess I’ll turn it over to you. Please do look into how much data your model actually needs and ask yourself if you’re making the most of the data you have available. The results might surprise you.