SlideShare a Scribd company logo
1 of 22
Download to read offline
How Conviva used Spark to Speed
Up Video Analytics by 30x
Dilip Antony Joseph (@DilipAntony)
Conviva monitors and optimizes online
video for premium content providers
What happens if you don’t use Conviva?
65 million video streams a day
Conviva data processing architecture

Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Player	
  
Hadoop	
  for	
  
historical	
  data	
  

Spark	
  

Ad-­‐hoc	
  
analysis	
  
Group By queries dominate 
our Reporting workload
SELECT	
  videoName,	
  COUNT(1)	
  
FROM	
  summaries	
  
WHERE	
  date='2012_08_22'	
  AND	
  customer='XYZ’	
  
GROUP	
  BY	
  videoName;	
  

10s	
  of	
  metrics,	
  10s	
  of	
  group	
  bys	
  
Group By query in Spark
val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,	
  NullWritable](	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pathToSessionSummaryOnHdfs,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  classOf[SessionSummary],	
  classOf[NullWritable])	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .flatMap	
  {	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  case	
  (key,	
  val)	
  =>	
  val.fieldsOfInterest	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  
val	
  cachedSessions	
  =	
  sessions.filter(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  whereCondifonToFilterSessionsForTheDesiredDay)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .cache	
  
val	
  mapFn	
  :	
  SessionSummary	
  =>	
  (String,	
  Long)	
  =	
  {	
  s	
  =>	
  (s.videoName,	
  1)	
  }	
  
	
  
val	
  reduceFn	
  :	
  (Long,	
  Long)	
  =>	
  Long	
  =	
  {	
  (a,b)	
  =>	
  a+b	
  }	
  
	
  
val	
  results	
  =	
  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap	
  
Spark is 30x faster than Hive
45	
  minutes	
  versus	
  24	
  hours	
  	
  
for	
  weekly	
  Conviva	
  Geo	
  Report	
  
How much data are we talking about?
•  150	
  GB	
  /	
  week	
  of	
  compressed	
  summary	
  data	
  
•  Compressed	
  ~	
  3:1	
  
•  Stored	
  as	
  Hadoop	
  Sequence	
  Files	
  
	
  
•  Custom	
  binary	
  serializafon	
  format	
  
Spark is faster because it avoids
reading data from disk multiple times
Group	
  By	
  Country	
  
Read	
  from	
  HDFS	
  
Decompress	
  
Deserialize	
  
Group	
  By	
  State	
  
Read	
  from	
  HDFS	
  
Decompress	
  
Deserialize	
  
Group	
  By	
  Video	
  
Read	
  from	
  HDFS	
  
Decompress	
  
Deserialize	
  
10s	
  of	
  Group	
  Bys	
  …	
  

Cache	
  only	
  
columns	
  of	
  
interest	
  

Hive/
MapReduce	
  
startup	
  
overhead	
  
Overhead	
  of	
  
flushing	
  
intermediate	
  
data	
  to	
  disk	
  

Spark	
  
Group	
  By	
  Country	
  
Read	
  from	
  HDFS	
  
Decompress	
  
Deserialize	
  
Cache	
  data	
  in	
  memory	
  
Group	
  By	
  State	
  
Read	
  data	
  from	
  memory	
  
Group	
  By	
  Video	
  
Read	
  data	
  from	
  memory	
  
10s	
  of	
  Group	
  Bys	
  …	
  
Why not <some other big data system>?
•  Hive	
  
•  Mysql	
  
•  Oracle/SAP	
  HANA	
  
•  Column	
  oriented	
  dbs	
  
Spark just worked for us

• 
• 
• 
• 
• 

30x	
  speed-­‐up	
  
Great	
  fit	
  for	
  our	
  report	
  computafon	
  model	
  
Open-­‐source	
  
Flexible	
  
Scalable	
  
30% of our reports use Spark
We are working on more projects
that use Spark
•  Streaming	
  Spark	
  for	
  unifying	
  batch	
  and	
  live	
  
computafon	
  
•  SHadoop	
  –	
  Run	
  exisfng	
  Hadoop	
  jobs	
  on	
  Spark	
  
	
  
Problems with Spark

Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Player	
  
Hadoop	
  for	
  
historical	
  data	
  

No	
  Logo	
  

Spark	
  

Ad-­‐hoc	
  
analysis	
  
Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Player	
  
Hadoop	
  for	
  
historical	
  data	
  

Spark	
  

Ad-­‐hoc	
  
analysis	
  
Spark queries are not very succinct
SELECT	
  videoName,	
  COUNT(1)	
  
FROM	
  summaries	
  
WHERE	
  date='2012_08_22'	
  AND	
  customer='XYZ’	
  
GROUP	
  BY	
  videoName;	
  

Spark	
  

val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,	
  NullWritable](	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pathToSessionSummaryOnHdfs,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  classOf[SessionSummary],	
  classOf[NullWritable])	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .flatMap	
  {	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  case	
  (key,	
  val)	
  =>	
  val.fieldsOfInterest	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  
val	
  cachedSessions	
  =	
  sessions.filter(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  whereCondifonToFilterSessionsForTheDesiredDay)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .cache	
  
val	
  mapFn	
  :	
  SessionSummary	
  =>	
  (String,	
  Long)	
  =	
  {	
  s	
  =>	
  (s.videoName,	
  1)	
  }	
  
	
  
val	
  reduceFn	
  :	
  (Long,	
  Long)	
  =>	
  Long	
  =	
  {	
  (a,b)	
  =>	
  a+b	
  }	
  
	
  
val	
  results	
  =	
  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap	
  
There is a learning curve associated
with Scala, but ...
•  Type-­‐safety	
  offered	
  by	
  Scala	
  is	
  a	
  great	
  boon	
  
–  Code	
  complefon	
  via	
  Eclipse	
  Scala	
  plugin	
  

•  Complex	
  queries	
  are	
  easier	
  in	
  Scala	
  than	
  in	
  Hive	
  
–  Cascading	
  IF()s	
  in	
  Hive	
  

•  No	
  need	
  to	
  write	
  Scala	
  in	
  the	
  future	
  
–  Shark	
  
–  Java,	
  Python	
  APIs	
  
Additional Hiccups
•  Always	
  on	
  the	
  bleeding	
  edge	
  –	
  geung	
  
dependencies	
  right	
  
•  More	
  maintenance/debugging	
  tools	
  required	
  
Spark has been working great for
Conviva for over a year
We are Hiring
jobs@conviva.com	
  
hwp://www.conviva.com/blog/engineering/
using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐
conviva	
  	
  

More Related Content

What's hot

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 

What's hot (20)

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 

Similar to How Conviva Used Spark to Speed Up Video Analytics 30x

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...HostedbyConfluent
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Dustin Vannoy
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow enginedmoebius
 
Docker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionDocker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionBrennan Saeta
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure wayBahadir Cambel
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGSimon Maple
 
Spring Boot Revisited with KoFu and JaFu
Spring Boot Revisited with KoFu and JaFuSpring Boot Revisited with KoFu and JaFu
Spring Boot Revisited with KoFu and JaFuVMware Tanzu
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 

Similar to How Conviva Used Spark to Speed Up Video Analytics 30x (20)

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
 
Docker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionDocker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline Execution
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUG
 
Spring Boot Revisited with KoFu and JaFu
Spring Boot Revisited with KoFu and JaFuSpring Boot Revisited with KoFu and JaFu
Spring Boot Revisited with KoFu and JaFu
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

How Conviva Used Spark to Speed Up Video Analytics 30x

  • 1. How Conviva used Spark to Speed Up Video Analytics by 30x Dilip Antony Joseph (@DilipAntony)
  • 2. Conviva monitors and optimizes online video for premium content providers
  • 3. What happens if you don’t use Conviva?
  • 4. 65 million video streams a day
  • 5. Conviva data processing architecture Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   Spark   Ad-­‐hoc   analysis  
  • 6. Group By queries dominate our Reporting workload SELECT  videoName,  COUNT(1)   FROM  summaries   WHERE  date='2012_08_22'  AND  customer='XYZ’   GROUP  BY  videoName;   10s  of  metrics,  10s  of  group  bys  
  • 7. Group By query in Spark val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }     val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache   val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }     val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  
  • 8. Spark is 30x faster than Hive 45  minutes  versus  24  hours     for  weekly  Conviva  Geo  Report  
  • 9. How much data are we talking about? •  150  GB  /  week  of  compressed  summary  data   •  Compressed  ~  3:1   •  Stored  as  Hadoop  Sequence  Files     •  Custom  binary  serializafon  format  
  • 10. Spark is faster because it avoids reading data from disk multiple times
  • 11. Group  By  Country   Read  from  HDFS   Decompress   Deserialize   Group  By  State   Read  from  HDFS   Decompress   Deserialize   Group  By  Video   Read  from  HDFS   Decompress   Deserialize   10s  of  Group  Bys  …   Cache  only   columns  of   interest   Hive/ MapReduce   startup   overhead   Overhead  of   flushing   intermediate   data  to  disk   Spark   Group  By  Country   Read  from  HDFS   Decompress   Deserialize   Cache  data  in  memory   Group  By  State   Read  data  from  memory   Group  By  Video   Read  data  from  memory   10s  of  Group  Bys  …  
  • 12. Why not <some other big data system>? •  Hive   •  Mysql   •  Oracle/SAP  HANA   •  Column  oriented  dbs  
  • 13. Spark just worked for us •  •  •  •  •  30x  speed-­‐up   Great  fit  for  our  report  computafon  model   Open-­‐source   Flexible   Scalable  
  • 14. 30% of our reports use Spark
  • 15. We are working on more projects that use Spark •  Streaming  Spark  for  unifying  batch  and  live   computafon   •  SHadoop  –  Run  exisfng  Hadoop  jobs  on  Spark    
  • 16. Problems with Spark Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   No  Logo   Spark   Ad-­‐hoc   analysis  
  • 17. Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   Spark   Ad-­‐hoc   analysis  
  • 18. Spark queries are not very succinct SELECT  videoName,  COUNT(1)   FROM  summaries   WHERE  date='2012_08_22'  AND  customer='XYZ’   GROUP  BY  videoName;   Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }     val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache   val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }     val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  
  • 19. There is a learning curve associated with Scala, but ... •  Type-­‐safety  offered  by  Scala  is  a  great  boon   –  Code  complefon  via  Eclipse  Scala  plugin   •  Complex  queries  are  easier  in  Scala  than  in  Hive   –  Cascading  IF()s  in  Hive   •  No  need  to  write  Scala  in  the  future   –  Shark   –  Java,  Python  APIs  
  • 20. Additional Hiccups •  Always  on  the  bleeding  edge  –  geung   dependencies  right   •  More  maintenance/debugging  tools  required  
  • 21. Spark has been working great for Conviva for over a year
  • 22. We are Hiring jobs@conviva.com   hwp://www.conviva.com/blog/engineering/ using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐ conviva