SlideShare a Scribd company logo
1 of 15
Lambda Architecture:
How We Merged Batch and Real-Time
Sewook Wee, Senior Engineering Lead
Sotos Matzanas, Tech Lead
June 27 2016
Our Goal
Our goal at Trulia is to
give consumers an easy
and enjoyable way to find
their next home by
providing data and
insights to help them
make the best decision.
Personalization Team
• Who we are
• What we do
• How we do it
User ID: 30942342
Browser Cookies: [1411087c06c6b530c155b933fdee44e2ef8905]
SearchHistory: [
{time: 2016-01-21T23:38:31Z,
query: “/for_sale/San_Francisco,CA/2p_beds/1p_baths”}
]
lastVisited: 2016-01-21T23:52:55Z
locationPreference: {
“San Francisco, CA” : 0.52,
“Oakland, CA”: 0.39,
“Los Angeles, CA” : 0.09
}
User Type: { buyer: 0.87, renter : 0.13 }
Examples of User Trait
user-device linkage
activity summary
inferred insights
Recommended properties: [
{ propertyId: 3223394214, score: 0.22},
{ propertyId: 3223518578, score: 0.09}
]
Why Lambda Architecture
We needed a way to…
• Recalculate the full User Trait from full event body at scale
• Read it back fast
• Ability to add new metrics to old aggregates
• Refresh near real-time to catch up the delta
Enter: Lambda Architecture
Event
Master
Dataset
Batch
Processing
Real-Time
Processing
Serving Layer
User Trait API
Desktop,
Mobile Web
Mobile
Apps
Email & Push
Event
API
User Trait
(Real-Time)
User
Trait
(Batch)
Kafka
Our User Model
• We support both registered and unregistered users
• Registered users: user id + secondary id(s) (mobile, Web, email)
• User login: link and merge all known activity on all devices
Our Real-Time Complications
• User linkage can change while new batch is calculated
• New user linkage can appear during the day, and not reflected in
batch calculation
• We needed to plan for these and make sure eventual User Trait
reflects the state of a user as of right now
Event Event Event Parse
Linkage
Lookup
HBase
Store
Transfers Writes Reads
Redis
user id
Simple Real-Time Case
Parse
Linkage
Lookup
Yesterday’s
Lookup
(Hbase)
Rebalance user id
secondary id
user id
Redis
Store
Send as Control Events
Control
Bolt
Lookup
Change
Lookup
Change
Lookup
Change
Control Event Spout
Rebalance Time @Batch Completion Time
Get all user ids + secondary ids for today
Transfers Writes Reads
Current Real-Time Design
Event Event Event
Kafka Spout
Today’s
Lookup
(Hbase)
Transition to a New Epoch
• When rebalance of all ids is complete
• Completion of rebalance: no new user id has been rebalanced for 30
seconds
• Redis keys with TTL mark a heartbeat that disappears if no new
control events
Rebalance
Done for N+1
Midnight Rebalance
Done for N
Batch Layer
Events for
Epoch N
Batch Layer
N Done
Midnight
Batch Layer Events for N + 1
Speed Layer Events Epoch N
Batch Layer
N +1 Done
Timeline
Epoch Transitions
Serve
N + 1
Serving Batch N +
Speed N
Batch Layer
Epoch N
Real-Time
Layer Epoch N
Speed Layer Events for N + 1
Batch Layer
Epoch N + 1
Real-Time Layer
Epoch N + 1
Rebalance
for N
Event Processing Epoch Serving FromBatch Process
Rebalance
for N
Our Input and its Size
• Hundreds of millions of events per day
• Billions of events per month
• 12TBs of events, and growing
• Hundreds of millions of User Traits calculated daily
• Millions calculated in real-time
As a Result
• Continuously add new features to build data driven products
• Retroactively apply new features on old data
• A virtuous cycle of learning more, personalizing more, and
learning again
• Delivery of data and insights to help consumers make the
best decision
swee@trulia.com
smatzanas@trulia.com

More Related Content

More from DataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 

Recently uploaded

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Lambda Architecture: How we merged batch and real time

  • 1. Lambda Architecture: How We Merged Batch and Real-Time Sewook Wee, Senior Engineering Lead Sotos Matzanas, Tech Lead June 27 2016
  • 2. Our Goal Our goal at Trulia is to give consumers an easy and enjoyable way to find their next home by providing data and insights to help them make the best decision.
  • 3. Personalization Team • Who we are • What we do • How we do it
  • 4. User ID: 30942342 Browser Cookies: [1411087c06c6b530c155b933fdee44e2ef8905] SearchHistory: [ {time: 2016-01-21T23:38:31Z, query: “/for_sale/San_Francisco,CA/2p_beds/1p_baths”} ] lastVisited: 2016-01-21T23:52:55Z locationPreference: { “San Francisco, CA” : 0.52, “Oakland, CA”: 0.39, “Los Angeles, CA” : 0.09 } User Type: { buyer: 0.87, renter : 0.13 } Examples of User Trait user-device linkage activity summary inferred insights Recommended properties: [ { propertyId: 3223394214, score: 0.22}, { propertyId: 3223518578, score: 0.09} ]
  • 5. Why Lambda Architecture We needed a way to… • Recalculate the full User Trait from full event body at scale • Read it back fast • Ability to add new metrics to old aggregates • Refresh near real-time to catch up the delta
  • 6. Enter: Lambda Architecture Event Master Dataset Batch Processing Real-Time Processing Serving Layer User Trait API Desktop, Mobile Web Mobile Apps Email & Push Event API User Trait (Real-Time) User Trait (Batch) Kafka
  • 7. Our User Model • We support both registered and unregistered users • Registered users: user id + secondary id(s) (mobile, Web, email) • User login: link and merge all known activity on all devices
  • 8. Our Real-Time Complications • User linkage can change while new batch is calculated • New user linkage can appear during the day, and not reflected in batch calculation • We needed to plan for these and make sure eventual User Trait reflects the state of a user as of right now
  • 9. Event Event Event Parse Linkage Lookup HBase Store Transfers Writes Reads Redis user id Simple Real-Time Case
  • 10. Parse Linkage Lookup Yesterday’s Lookup (Hbase) Rebalance user id secondary id user id Redis Store Send as Control Events Control Bolt Lookup Change Lookup Change Lookup Change Control Event Spout Rebalance Time @Batch Completion Time Get all user ids + secondary ids for today Transfers Writes Reads Current Real-Time Design Event Event Event Kafka Spout Today’s Lookup (Hbase)
  • 11. Transition to a New Epoch • When rebalance of all ids is complete • Completion of rebalance: no new user id has been rebalanced for 30 seconds • Redis keys with TTL mark a heartbeat that disappears if no new control events
  • 12. Rebalance Done for N+1 Midnight Rebalance Done for N Batch Layer Events for Epoch N Batch Layer N Done Midnight Batch Layer Events for N + 1 Speed Layer Events Epoch N Batch Layer N +1 Done Timeline Epoch Transitions Serve N + 1 Serving Batch N + Speed N Batch Layer Epoch N Real-Time Layer Epoch N Speed Layer Events for N + 1 Batch Layer Epoch N + 1 Real-Time Layer Epoch N + 1 Rebalance for N Event Processing Epoch Serving FromBatch Process Rebalance for N
  • 13. Our Input and its Size • Hundreds of millions of events per day • Billions of events per month • 12TBs of events, and growing • Hundreds of millions of User Traits calculated daily • Millions calculated in real-time
  • 14. As a Result • Continuously add new features to build data driven products • Retroactively apply new features on old data • A virtuous cycle of learning more, personalizing more, and learning again • Delivery of data and insights to help consumers make the best decision

Editor's Notes

  1. Introductions
  2. Sewook Engage audience: Show of hands – how many of you have heard of Trulia before? Trulia’s goal is to simplify the crazy experience of finding a home, by providing data and insights to help you make a better decision It’s not just about finding the best home in the town, but finding the best home for you. To do that, we need to know our users, so we’ve formed a personalization team
  3. Sewook The personalization team works to understand what our users are looking for We have built a personal users platform based on the Lambda Architecture, where we track users’ activity in real-time, process them and build a digital signature or profile We have built a digital profile of each user which we call a user trait, which I’ll explain a bit more on the next slide
  4. Sewook This slide explains how we’ve built our user traits. Essentially, we take the repository of consumer activity events, process and generate the user trait. The simplest approach to processing the data is batch processing, but that takes time and during the batch cycles the user trait becomes stale. Another extreme approach is event by event full real-time processing, which is cool but historically we can ran into other issues, like full data re-processsing. Which is why we landed on Lambda Architecture.
  5. Sewook We like Lambda because it has batch and real-time benefits Through Lambda, we can recalculate the full trait from event body in each batch cycle at scale Whenever we need to change business rules or cleanse old data, we can do it very easily We can read back each individual trait quickly In addition, we have a real-time layer where we can catch up the delta Hand presentation to Sotos: Sotos here will explain exactly how our personalization platform looks
  6. Sotos I’m going to share how we implement Lambda Architecture, our specific needs and complications and what our current architecture looks like. Will mostly focus on the real-time process but first will walk you through our batch and how we built our Lambda Architecture batch part.
  7. Sotos Before I dig into the complications of our RT let me explain our user model a bit No matter what avenue a user comes from, we always build a user trait, even though we have a secondary ID, so incases where the user is registered, we also build a unique user trait We discover linkages through our batch workflow and store all discovered linkages in a unique table per batch run
  8. Sotos The issue is that linkages may change day to day, or we might discover a new linkage. We need to account for all these cases. Our real time platform needs to properly marry all activity for a user.
  9. Sotos
  10. Sotos
  11. Sotos Explain what Epoch is
  12. Batch layer events for Epoch N Midnight line Batch layer epoch N blue Speed layer evenets epoch n orange 5. real-time layer epoch n not rebalance N 6.Batch layer N done line 7. Rebalance for N 8. Rebalance done for N line 9. Serving Batch N