Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/UkwXZ2.
Randy Shoup describes KIXEYE's analytics infrastructure from Kafka queues through Hadoop 2 to Hive and Redshift, built for flexibility, experimentation, iteration, componentization, testability, reliability and replay-ability. Filmed at qconnewyork.com.
As CTO, Randy brings 25 years of engineering leadership to KIXEYE, ranging from tiny startups to Internet-scale companies. Prior to KIXEYE, he was Director of Engineering at Google, leading several teams building Google App Engine, the world's largest Platform as a Service.
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
The Game of Big Data: Scalable, Reliable Analytics Infrastructure at KIXEYE
1. The Game of Big Data!
Analytics Infrastructure at KIXEYE
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
QCon New York, June 13 2014
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/kixeye-big-data
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Free-to-Play Real-time Strategy
Games
• Web and mobile
• Strategy and tactics
• Really real-time J
• Deep relationships with
players
• Constantly evolving
gameplay, feature set,
economy, balance
• >500 employees worldwide
5. Intro: Analytics at KIXEYE
User Acquisition
Game Analytics
Retention and Monetization
Analytic Requirements
6. User Acquisition
Goal: ELTV > acquisition cost
• User’s estimated lifetime value is more than it
costs to acquire that user
Mechanisms
• Publisher Campaigns
• On-Platform Recommendations
7. Game Analytics
Goal: Measure and Optimize “Fun”
• Difficult to define
• Includes gameplay, feature set, performance, bugs
• All metrics are just proxies for fun (!)
Mechanisms
• Game balance
• Match balance
• Economy management
• Player typology
8. Retention and Monetization
Goal: Sustainable Business
• Monetization drivers
• Revenue recognition
Mechanisms
• Pricing and Bundling
• Tournament (“Event”) Design
• Recommendations
9. Analytic Requirements
• Data Integrity and Availability
• Cohorting
• Controlled experiments
• Deep ad-hoc analysis
12. V1 Analytic System
Grew Organically
• Built originally for user acquisition
• Progressively grown to much more
Idiosyncratic mix of languages, systems, tools
• Log files -> Chukwa -> Hadoop -> Hive -> MySQL
• PHP for reports and ETL
• Single massive table with everything
13. V1 Analytic System
Many Issues
• Very slow to query
• No data standardization or validation
• Very difficult to add a new game, report, ETL
• Extremely difficult to backfill on error or outage
• Difficult for analysts to use; impossible for PMs,
designers, etc.
… but we survived (!)
15. Goals of Deep Thought
Independent Scalability
• Logically separate, independently scalable tiers
Stability and Outage Recovery
• Tiers can completely fail with no data loss
• Every step idempotent and replayable
Standardization
• Standardized event types, fields, queries, reports
16. Goals of Deep Thought
In-Stream Event Processing
• Sessionalization, Dimensionalization, Cohorting
Queryability
• Structures are simple to reason about
• Simple things are simple
• Analysts, Data Scientists, PMs, Game Designers,
etc.
Extensibility
• Easy to add new games, events, fields, reports
19. Sessionalization
All events are part of a “session”
• Explicit start event, optional stop event
• Game-defined semantics
Event Batching
• Events arrive in batch, associated with session
• Pipeline computes batch-level metrics,
disaggregates events
• Can optionally attach batch-level metrics to each
event
20. Sessionalization
Time-Series Aggregations
• Configurable metrics
• 1-day X, 7-day X, lifetime X
• Total attacks, total time played
• Accumulated in-stream
• V1 aggregate + batch delta
• Faster to calculate in-stream vs. Map-Reduce
21. Dimensionalization
Pipeline assigns unique numeric id to string enums
• E.g., “twigs” resource id 1234
Automatic mapping and assignment
• Games log strings
• Pipeline generates and maps ids
• No configuration necessary
Fast dimensional queries
• Join on integers, not strings
22. Dimensionalization
Metadata enumeration and manipulation
• Easily enumerate all values for a field
• Merge multiple values
• “TWIGS” == “Twigs” == “twigs”
Metadata tagging
• Can assign arbitrary tags to metadata
• E.g., “Panzer 05” is {tank, mechanized infantry, event prize}
• Enables custom views
23. Cohorting
Group players along any dimension / metric
• Well beyond classic age-based cohorts
Core analytical building block
• Experiment groups
• User acquisition campaign tracking
• Prospective modeling
• Retrospective analysis
24. Cohorting
Set-based
• Overlapping groups: >100, >200, etc.
• Exclusive groups: (100-200), (200-500), etc.
Time-based
• E.g., people who played in last 3 days
• E.g., “whale” == ($$ > X) in last N days
• Autoexpire from a group without explicit
intervention
26. Implementation of Pipeline
• Ingestion
• Event Log
• Transformation
• Data Storage
• Analysis and Visualization
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
27. Ingestion: Logging Service
HTTP / JSON Endpoint
• Play framework
• Non-blocking, event-driven
Responsibilities
• Message integrity via checksums
• Durability via local disk persistence
• Async batch writes to Kafka topics
• {valid, invalid, unauth}
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
28. Event Log: Kafka
Persistent, replayable pipe of events
• Events stored for 7 days
Responsibilities
• Durability via replication and local
disk streaming
• Replayability via commit log
• Scalability via partitioned brokers
• Segment data for different types of
processing
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
30. Transformation: Importer
Responsibilities (cont.)
• Sessionalization
• Assign event to session
• Calculate time-series aggregates
• Dimensionalization
• String enum -> numeric id
• Merge / coalesce different string
representations into single id
• Player metadata
• Join player metadata from session
store
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
31. Transformation: Importer
Responsibilities (cont.)
• Cohorting
• Process enter-cohort, exit-cohort
events
• Process A / B testing events
• Evaluate cohort rules (e.g., spend
thresholds)
• Decorate events with cohort tags
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
32. Transformation: Session Store
Key-value store (Couchbase)
• Fast, constant-time access to sessions,
players
Responsibilities
• Store Sessions, Players, Dimensions,
Config
• Lookup
• Idempotent update
• Store accumulated session-level metrics
• Store player history
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
33. Storage: Hadoop 2
Camus MR
• Kafka -> HDFS every 3 minutes
append_events table
• Append-only log of events
• Each event has session-version for
deduplication
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
34. Storage: Hadoop 2
append_events -> base_events MR
• Logical update of base_events
• Update events with new metadata
• Swap old partition for new partition
• Replayable from beginning without
duplication
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
35. Storage: Hadoop 2
base_events table
• Denormalized table of all events
• Stores original JSON + decoration
• Custom Serdes to query / extract
JSON fields without materializing
entire rows
• Standardized event types lots of
functionality for free
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
36. Analysis and Visualization
Hive Warehouse
• Normalized event-specific, game-
specific stores
• Aggregate metric data for reporting,
analysis
• Maintained through custom ETL
• MR
• Hive queries
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
37. Analysis and Visualization
Amazon Redshift
• Fast ad-hoc querying
Tableau
• Simple, powerful reporting
Logging
Service
Ka.a
Importer
/
Session
Store
Hadoop
2
Hive
/
Redshi:
38. Come Join Us!
KIXEYE is hiring in
SF, Seattle, Victoria,
Brisbane, Amsterdam
rshoup@kixeye.com
@randyshoup
Deep Thought Team:
• Mark Weaver
• Josh McDonald
• Ben Speakmon
• Snehal Nagmote
• Mark Roberts
• Kevin Lee
• Woo Chan Kim
• Tay Carpenter
• Tim Ellis
• Kazue Watanabe
• Erica Chan
• Jessica Cox
• Casey DeWitt
• Steve Morin
• Lih Chen
• Neha Kumari
39. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/kixeye-
big-data