SlideShare a Scribd company logo
1 of 26
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele Modena
Data Scientist Improve Digital

E: g.modena@improvedigital.com
Copyright © 2014 Improve Digital - All Rights Reserved
Real Time Advertisement Technology
Media Owners Advertisers
Copyright © 2014 Improve Digital - All Rights Reserved
3
Adtech 101
<150 msec
• Geographically distributed adserver fleet
• 200+ billion events / month
• Hundreds of TB in a Hadoop cluster
Copyright © 2014 Improve Digital - All Rights Reserved
4
– How much revenue did publisher X generate last month? Which
are the top advertisers?
• Reporting & BI
– Is the day-to-day traffic on site Y increasing or decreasing?
• Trend analysis
– Is the traffic legit or coming from a botnet ?
• Fraud detection
– How likely is this impression to generate a click or a conversion?
• Predictive modelling
– How are advertisers bidding and buying on inventory? Who is
our audience?
• Pattern Recognition
Improve digital data platform
Copyright © 2014 Improve Digital - All Rights Reserved 5
Historically
• Batch pipelines
• Incremental processing
• Realtime pipelines
• Monitoring and trend analysis
!
Batch dataset != Realtime dataset
Batch models != Realtime models
Copyright © 2014 Improve Digital - All Rights Reserved
6
• Write jobs once
• Unifiy models and
• Analytics codebase
• Datasets semantic
• Experimentation
Goals
Copyright © 2014 Improve Digital - All Rights Reserved
7
Analytics Architecture
Real-time
log
collection
Brokerage
(Kakfa
+Samza)
Processing
(YARN+Spark
+MapReduce)
Push Expose
Publish
Publish
Publish
Datab
ase
HDFS
Redis
Copyright © 2014 Improve Digital - All Rights Reserved
8
Kafka and Samza
• Kafka (http://kafka.apache.org) as a
distributed message queue
• Topic-based
• Producers write, consumers read
• Messages are persistently stored – topics
can be re-read
• We use Samza for coordinating ingestion, ETL
and distributed stream processing
Copyright © 2014 Improve Digital - All Rights Reserved
9
Apache Spark
• Spark (Zaharia et al. 2010)
• “Iterative” computing
• Generalization of MapReduce (Isard 2007)
• Runs atop Hadoop (YARN)

!
• Spark Streaming
• Break data into batches and pass it to
Spark engine (same API & data structures)
Copyright © 2014 Improve Digital - All Rights Reserved
10
Challenges
• Conceptually everything is a stream
• Satisfy a tradeoff between
• Latency
• Memory
• Accuracy

• On infinitely expanding datasets
Copyright © 2014 Improve Digital - All Rights Reserved
Make big data small
Samples, sketches and summaries
Copyright © 2014 Improve Digital - All Rights Reserved
12
Reservoir Sampling (Vitter, 1985)
• Hard to parallelize
• How to use samples to answer certain queries?
Count distinct? TopK?
• From an infinitely expanding dataset
• With constant memory and in a single pass
Copyright © 2014 Improve Digital - All Rights Reserved
Cardinality estimation (count distinct)
How many users are visiting a site?
Copyright © 2014 Improve Digital - All Rights Reserved
14
Claim
The cardinality of a multiset of
uniformly-distributed random
numbers can be estimated by
calculating the maximum number
of leading zeros in the binary
representation of each number in
the set.
Copyright © 2014 Improve Digital - All Rights Reserved
15
Intuitively

1. Apply an hash function on each element and
take the binary representation of the output
2. If the maximum number of leading zeros
observed is n, an estimate for the number of
distinct elements in the set is 2^n
3. Account for variance by averaging on subsets
HyperLogLog (Flajolet, Philippe, et al. 2008)
Copyright © 2014 Improve Digital - All Rights Reserved
16
val hll = new HyperLogLogMonoid(12)
!
val approxUsers = users.mapPartitions(user => user.map(uuid =>
hll(uuid.getBytes))).reduce(_ + _)
!
var h = globalHll.zero
approxUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
h += partial
}
})
HyperLogLog (with Spark + Algebird)
Copyright © 2014 Improve Digital - All Rights Reserved
17
HyperLogLog (< 2% error rate in 15kB)
Count
Exact
Approximate
Memory
Copyright © 2014 Improve Digital - All Rights Reserved
Frequency estimation
Top 10 most visited sites (out of a few millions) ?
Copyright © 2014 Improve Digital - All Rights Reserved
19
Count Min Sketch
(Cormode, Graham, and S. Muthukrishnan, 2005)
It’s the hashing trick!
Copyright © 2014 Improve Digital - All Rights Reserved
20
val eps = 0.01
val delta = 1E-3
val seed = 1
val perc = 0.003
!
val approxImpressions = publishers.mapPartitions(publisher => {
val cms = new CountMinSketchMonoid(delta, eps, seed, perc)
publisher.map(publisher_id => cms.create(publisher_id.toLong))
}).reduce(_ ++ _)
!
var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero
approxTopUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id => (id,
globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5)
}
})
CMS (with Spark + Algebird)
Copyright © 2014 Improve Digital - All Rights Reserved
21
CMS results
Exact Approximate
Copyright © 2014 Improve Digital - All Rights Reserved
Learning from data
Copyright © 2014 Improve Digital - All Rights Reserved 23
Iterative methods are hard to
scale in MapReduce
Copyright © 2014 Improve Digital - All Rights Reserved
24
• Liner Regression
– OLS + SGD on batches of data
– Recursive Least Squares with Forgetting
(Vahidi et al. 2005)

• Streaming kmeans (Ailon et al. 2009, Shindler
et al 2011, Ostrovsky et al. 2012)
– Single iteration-to-convergence
– Use sketches to reduce dimensionality (k log
N centroids)
– Mini batch updates + forgetfulness
Using sketches
Copyright © 2014 Improve Digital - All Rights Reserved
25
• Streaming is part of the broader system
• Approximation can help us scale both
streaming and batch loads
– Make “big data” small
– Unification
• Data collection and distribution is key
▪ Publishing results follows
• Large scale analytics = Architecture + Algos +
Data Structures
Conclusion
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele Modena
Data Scientist Improve Digital

E: g.modena@improvedigital.com

More Related Content

What's hot

Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAwaretwilmes
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The CloudsJacky Chu
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning PrimerMathieu Dumoulin
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Aditya Yadav
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Mathieu Dumoulin
 

What's hot (20)

Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 

Similar to Approximation algorithms for stream and batch processing

Criteo TektosData Meetup
Criteo TektosData MeetupCriteo TektosData Meetup
Criteo TektosData MeetupOlivier Koch
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" BusinessMapR Technologies
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Hello Streams Overview
Hello Streams OverviewHello Streams Overview
Hello Streams Overviewpsanet
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetParag Ahire
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLAPaul Barsch
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberSudhir Tonse
 
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...Cubic Corporation
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldSean Roberts
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXKirk Haslbeck
 

Similar to Approximation algorithms for stream and batch processing (20)

Criteo TektosData Meetup
Criteo TektosData MeetupCriteo TektosData Meetup
Criteo TektosData Meetup
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" Business
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Hello Streams Overview
Hello Streams OverviewHello Streams Overview
Hello Streams Overview
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLA
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
 
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Approximation algorithms for stream and batch processing

  • 1. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: g.modena@improvedigital.com
  • 2. Copyright © 2014 Improve Digital - All Rights Reserved Real Time Advertisement Technology Media Owners Advertisers
  • 3. Copyright © 2014 Improve Digital - All Rights Reserved 3 Adtech 101 <150 msec • Geographically distributed adserver fleet • 200+ billion events / month • Hundreds of TB in a Hadoop cluster
  • 4. Copyright © 2014 Improve Digital - All Rights Reserved 4 – How much revenue did publisher X generate last month? Which are the top advertisers? • Reporting & BI – Is the day-to-day traffic on site Y increasing or decreasing? • Trend analysis – Is the traffic legit or coming from a botnet ? • Fraud detection – How likely is this impression to generate a click or a conversion? • Predictive modelling – How are advertisers bidding and buying on inventory? Who is our audience? • Pattern Recognition Improve digital data platform
  • 5. Copyright © 2014 Improve Digital - All Rights Reserved 5 Historically • Batch pipelines • Incremental processing • Realtime pipelines • Monitoring and trend analysis ! Batch dataset != Realtime dataset Batch models != Realtime models
  • 6. Copyright © 2014 Improve Digital - All Rights Reserved 6 • Write jobs once • Unifiy models and • Analytics codebase • Datasets semantic • Experimentation Goals
  • 7. Copyright © 2014 Improve Digital - All Rights Reserved 7 Analytics Architecture Real-time log collection Brokerage (Kakfa +Samza) Processing (YARN+Spark +MapReduce) Push Expose Publish Publish Publish Datab ase HDFS Redis
  • 8. Copyright © 2014 Improve Digital - All Rights Reserved 8 Kafka and Samza • Kafka (http://kafka.apache.org) as a distributed message queue • Topic-based • Producers write, consumers read • Messages are persistently stored – topics can be re-read • We use Samza for coordinating ingestion, ETL and distributed stream processing
  • 9. Copyright © 2014 Improve Digital - All Rights Reserved 9 Apache Spark • Spark (Zaharia et al. 2010) • “Iterative” computing • Generalization of MapReduce (Isard 2007) • Runs atop Hadoop (YARN)
 ! • Spark Streaming • Break data into batches and pass it to Spark engine (same API & data structures)
  • 10. Copyright © 2014 Improve Digital - All Rights Reserved 10 Challenges • Conceptually everything is a stream • Satisfy a tradeoff between • Latency • Memory • Accuracy
 • On infinitely expanding datasets
  • 11. Copyright © 2014 Improve Digital - All Rights Reserved Make big data small Samples, sketches and summaries
  • 12. Copyright © 2014 Improve Digital - All Rights Reserved 12 Reservoir Sampling (Vitter, 1985) • Hard to parallelize • How to use samples to answer certain queries? Count distinct? TopK? • From an infinitely expanding dataset • With constant memory and in a single pass
  • 13. Copyright © 2014 Improve Digital - All Rights Reserved Cardinality estimation (count distinct) How many users are visiting a site?
  • 14. Copyright © 2014 Improve Digital - All Rights Reserved 14 Claim The cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set.
  • 15. Copyright © 2014 Improve Digital - All Rights Reserved 15 Intuitively
 1. Apply an hash function on each element and take the binary representation of the output 2. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n 3. Account for variance by averaging on subsets HyperLogLog (Flajolet, Philippe, et al. 2008)
  • 16. Copyright © 2014 Improve Digital - All Rights Reserved 16 val hll = new HyperLogLogMonoid(12) ! val approxUsers = users.mapPartitions(user => user.map(uuid => hll(uuid.getBytes))).reduce(_ + _) ! var h = globalHll.zero approxUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() h += partial } }) HyperLogLog (with Spark + Algebird)
  • 17. Copyright © 2014 Improve Digital - All Rights Reserved 17 HyperLogLog (< 2% error rate in 15kB) Count Exact Approximate Memory
  • 18. Copyright © 2014 Improve Digital - All Rights Reserved Frequency estimation Top 10 most visited sites (out of a few millions) ?
  • 19. Copyright © 2014 Improve Digital - All Rights Reserved 19 Count Min Sketch (Cormode, Graham, and S. Muthukrishnan, 2005) It’s the hashing trick!
  • 20. Copyright © 2014 Improve Digital - All Rights Reserved 20 val eps = 0.01 val delta = 1E-3 val seed = 1 val perc = 0.003 ! val approxImpressions = publishers.mapPartitions(publisher => { val cms = new CountMinSketchMonoid(delta, eps, seed, perc) publisher.map(publisher_id => cms.create(publisher_id.toLong)) }).reduce(_ ++ _) ! var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero approxTopUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5) } }) CMS (with Spark + Algebird)
  • 21. Copyright © 2014 Improve Digital - All Rights Reserved 21 CMS results Exact Approximate
  • 22. Copyright © 2014 Improve Digital - All Rights Reserved Learning from data
  • 23. Copyright © 2014 Improve Digital - All Rights Reserved 23 Iterative methods are hard to scale in MapReduce
  • 24. Copyright © 2014 Improve Digital - All Rights Reserved 24 • Liner Regression – OLS + SGD on batches of data – Recursive Least Squares with Forgetting (Vahidi et al. 2005)
 • Streaming kmeans (Ailon et al. 2009, Shindler et al 2011, Ostrovsky et al. 2012) – Single iteration-to-convergence – Use sketches to reduce dimensionality (k log N centroids) – Mini batch updates + forgetfulness Using sketches
  • 25. Copyright © 2014 Improve Digital - All Rights Reserved 25 • Streaming is part of the broader system • Approximation can help us scale both streaming and batch loads – Make “big data” small – Unification • Data collection and distribution is key ▪ Publishing results follows • Large scale analytics = Architecture + Algos + Data Structures Conclusion
  • 26. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: g.modena@improvedigital.com