SlideShare a Scribd company logo
1 of 37
Download to read offline
Lars Albertsson, Data Engineer @Spotify
Focus on challenges & needs
Data infrastructure
for a world of music
1. Clients generate data
2. ???
3. Make profit
Users create data
Why data?
Reporting to partners, from day 1
Record labels, ad buyers, marketing
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose
Different needs: speed vs quality
Reporting to partners, from day 1
Record labels, ad buyers, marketing (daily + monthly)
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose
Most user actions
Played songs
Playlist modifications
Web navigation
UI navigation
Service state changes
User
Notifications
Incoming
Content
Social integration
Data purpose
What data?
26M monthly active users
6M subscribers
55 markets
20M songs, 20K new / day
1.5B playlists
4 data centres
10 TB from users / day
400 GB from services / day
61 TB generated in Hadoop / day
600 Hadoop nodes
6500 MapReduce jobs / day
18PB in HDFS
Data purpose
Much data?
Data purpose
Data is true
Data purpose
Data is true
Get raw data
Refine
Make it useful
Data infrastructure
2008:
> for h in all_hosts
rsync ${h}:/var/log/syslog /incoming/$h/$date
> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontab
Dump to Postgres, make graph
Still living with some of this…
Data infrastructure
It all started very basic
Data infrastructure
Collect, crunch, use/display
Gateway
Playlist
service
Kafka
message bus
MapReduce
SQL
Reports
Cassandra
Recomm-
endations
HDFS
service
DB
Kafka@lon
logs
Data infrastructure
Fault scenarios
Gateway
Playlist
service
Kafka
message bus
MapReduce
SQL
Reports
Cassandra
Recomm-
endations
HDFS
service
DB
Kafka@lon
logs
Most datasets are produced daily
Consumers want data after morning coffee
For each line, bottom level represents a good
day
Destabilisation is the norm
Delay factors all over the infrastructure - client
to display
Producers are not stakeholders
Data infrastructure
Shit happens
Get raw data from
clients through GWs
GWs
Service logs
Service databases
To HDFS
Data collection
Data collection
Data collection
Gateway
Playlist
service
Kafka
message bus
HDFS
service
DB
Kafka@lon
logs
Sources of truth
MapReduce?
Need to wait for “all” data
for a time slot (hour)
What is all?
Can we get all?
Most consumers want 9x%
quickly.
Reruns are complex.
1. Rsync from hosts. Get list from hosts DB.
- Rsync fragile, frequent network issues.
- DB info often stale
- Often waiting for dead host or omitting host
2. Push logs over Kafka. Wait for hosts according to hosts DB.
+ Kafka better. Application level cross-site routing.
- Kafka unreliable by design. Implement end-to-end acking.
3. Use Kafka as in #2. Determine active hosts by snooping metrics.
+ Reliable? host metric.
- End-to-end stability and host enumeration not scalable.
Data collection
Log collection evolution
Single solution cannot fit all needs.
Choose reliability or low latency.
Reliable path with store and forward
Service hosts must not store state.
Synchronous handoff to HA Kafka with large replay buffer
Best effort path similar
No acks, asynchronous handoff
Message producers know appropriate semantics
For critical data: handoff failure -> stop serving users
Measuring loss is essential
Data collection
Log collection future
~1% loss is ok, assuming that it is measured
Few % time slippage is ok, if unbiased
Biased slippage is not ok
Timestamp to use for bucketing: client, GW, HDFS?
Some components are HA (Cassandra, ZooKeeper). Most are
unreliable. Client devices are very unreliable.
Buffers in “stateless” components cause loss.
Crunching delay is inconvenient. Crunching wrong data is expensive.
Data crunching
Data is false?
Core databases dumped daily (user x 2, playlist, metadata)
Determinism required - delays inevitable
Slave replication issues common
No good solution:
Sqoop live - non-deterministic
Postgres commit log replay - not scalable
Cassandra full dumps - resource heavy
Solution - convert to event processing?
Experimenting with Netflix Aegisthus for Cassandra -> HDFS
Facebook has MySQL commit log -> event conversion
Data collection
Database dumping
We have raw data, sorted by host and
hour
We want e.g. active users by country
and product over the last month
Data crunching
Data crunching
End goal example - business insights
1. Split by message type, per hour
2. Combine multiple sources for similar data, per day - a core dataset.
3. Join activity datasets, e.g. tracks played or user activity, with
ornament dataset, e.g. track metadata, user demographics.
4a. Make reports for partners, e.g. labels, advertisers.
4b. Aggregate into SQL or add metadata for Hive exploration.
4c. Build indexes (search, top lists), denormalise, and put in Cassandra.
4d. Run machine learning (recommendations) and put in Cassandra.
4e. Make notification decisions and send out.
...
Data crunching
Typical data crunching
MR
C*
Data crunching
Core dataset example: users
Generate - organic
Transfer - Kafka
Process - Python MapReduce. Bad idea.
Big data ecosystem is 99% JVM -> moving to Crunch
Test - in production.
Not acceptable. Working on it. No available tools.
Deploy - CI + Debian packages.
Low isolation. Looking at containers (Docker).
Monitor - organic
Cycle time for code-test-debug: 21 days
Data crunching
Data processing platform
Online storage: Cassandra, Postgres
Offline storage: HDFS
Transfer: Kafka, Sqoop
Processing engine: Hadoop MapReduce in Yarn
Processing languages: Luigi Python MapReduce, Crunch, Pig
Mining: Hive, Postgres, Qlikview
Real-time processing: Storm (mostly experimental)
Trying out:
Spark - better for iterative algorithms (ML), future of MapReduce?
Giraph and other graph tools
More stable infrastructure: Docker, Azkaban
Data crunching
Technology stack
def mapper(self, items):
for item in items:
if item.type == ‘EndSong’
yield (item.track_id, 1, item)
else: # Track metadata
yield (item.track_id, 0, item)
def reducer(self, key, values):
for item in values:
if item.type != ‘EndSong’:
meta = item
else:
yield add_meta(meta, item)
Data crunching
Crunching tools - four joins
select * from tracks inner join
metadata on tracks.track_id =
metadata.track_id;
join tracks by track_id, metadata by
track_id;
PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin
(endSongTable, metaTable);
Vanilla MapReduce - fragile SQL / Hive - exploration & display
Pig - deprecated
Crunch - future for processing pipelines
Lots of opportunities in PBs of data.
Opportunities to get lost.
Organising data
Mostly organic - frequent discrepancies
Agile feature dev -> easy schema change
Currently requires client lib release
Avro meta format in backend
Good Hadoop integration
Not best option in client
Some clients are hard to upgrade, e.g. old phones, hifi, cars.
Utopic (aka Google): client schema change -> automatic
Hive/SQL/dashboard/report change
Data crunching
Schemas
Today:
if date < datetime(2012, 10, 17):
# Use old format
else:
…
Not scalable
Few tools available. HCatalog?
Solution(?): Encapsulate each dataset in library. Owners decide
compatibility vs reformat strategy. Version the interface. (Twitter)
Data crunching
Data evolution
Many redundant calculations
Data discovery
Home-grown tool
Retention policy
Save the raw data (S3)
Be brutal and delete
Data crunching
What is out there?
Technology is easy to change, humans
hard
Our most difficult challenges are cultural
Organising yourself
Failing jobs, dead jobs
Dead data
Data growth
Reruns
Isolation
Configuration, memory, disk, Hadoop resources
Technical debt
Testing, deployment, monitoring, remediations
Cost
Be stringent with software engineering practices or suffer. Most data
organisations suffer.
Data crunching
Staying in control
History:
Data service department
Core data + platform department
Data platform department
Self-service spurs data usage
Data producers and consumers have domain knowledge
Data infrastructure engineers do not
Data producers prioritise online services over offline
Producing and consuming is closely tied, yet often organisationally
separated
Data crunching
Who owns what?
Dos:
Solve domain-specific or unsolved
things
Use stuff from leaders (Kafka)
Monitor aggressively
Have 50+% backend engineers
Focus on the data feature
developer needs
Separate raw and generated data
Hadoop was good bet, Spark even
better?
Data crunching
Things learnt in the fire
Don’ts:
Choose your own path (Python)
Use ad-hoc formats
Build stuff with < 3 years horizon
Accumulate debt
Use SQL in data pipelines
Have SPOFs - no excuse anymore
Rely on host configurations
Collect data with pull
Vanilla MapReduce
“Data is special” - no SW practices
Innovation originates at Google (~10^7 data dedicated machines)
MapReduce, GFS, Dapper, Pregel, Flume
Open source variants by the big dozen (10^5 - 10^6)
Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US only
Hadoop, HDFS, ZooKeeper, Giraph, Crunch.
Cassandra
Improved by serious players (10^3 - 10^4)
Spotify, AirBnB, FourSquare, Prezi, King. Mostly US
Used by beginners (10^1 - 10^2)
Big Data innovation
Innovation in Big Data - four tiers
Not much in infrastructure:
Supercomputing legacy
MPI still in use
Berkeley: Spark, Mesos
Cooperation with Yahoo and Twitter
Containers
Xen, VMware
Data processing theory:
Bloom filters, stream processing (e.g. Count-Min Sketch)
Machine learning
Big Data innovation
Innovation from academia
Fluid architectures / private clouds
Large pools of machines
Services and jobs are independent of hosts
Mesos, Curator are scratching at the problem
Google Borg = Utopia
LAMP stack for Big Data
End to end developer testing
Client modification to insights SQL change
Running on developer machine, in IDE
Scale is not an issue - efficiency & productivity is
Big Data innovation
Innovation is needed, examples

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...jaxLondonConference
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

What's hot (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Viewers also liked

13 Highlights in Data Analytics Impacting 2014, TagMan
13 Highlights in Data Analytics Impacting 2014, TagMan13 Highlights in Data Analytics Impacting 2014, TagMan
13 Highlights in Data Analytics Impacting 2014, TagManTagMan
 
Deezer and Spotify for brands and labels
Deezer and Spotify for brands and labelsDeezer and Spotify for brands and labels
Deezer and Spotify for brands and labelsPlayApp
 
Google Analytics for Particularly Curious SaaS People
Google Analytics for Particularly Curious SaaS PeopleGoogle Analytics for Particularly Curious SaaS People
Google Analytics for Particularly Curious SaaS PeopleChargebee
 
Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing playNiklas Gustavsson
 
25 Examples of Native Analytics in Modern Products
25 Examples of Native Analytics in Modern Products25 Examples of Native Analytics in Modern Products
25 Examples of Native Analytics in Modern ProductsKeen
 
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure Analysis
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure AnalysisTo Deploy or Not-To-Deploy - decide using TTA's Trend & Failure Analysis
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure AnalysisAnand Bagmar
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 

Viewers also liked (11)

13 Highlights in Data Analytics Impacting 2014, TagMan
13 Highlights in Data Analytics Impacting 2014, TagMan13 Highlights in Data Analytics Impacting 2014, TagMan
13 Highlights in Data Analytics Impacting 2014, TagMan
 
Deezer and Spotify for brands and labels
Deezer and Spotify for brands and labelsDeezer and Spotify for brands and labels
Deezer and Spotify for brands and labels
 
Google Analytics for Particularly Curious SaaS People
Google Analytics for Particularly Curious SaaS PeopleGoogle Analytics for Particularly Curious SaaS People
Google Analytics for Particularly Curious SaaS People
 
Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing play
 
25 Examples of Native Analytics in Modern Products
25 Examples of Native Analytics in Modern Products25 Examples of Native Analytics in Modern Products
25 Examples of Native Analytics in Modern Products
 
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure Analysis
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure AnalysisTo Deploy or Not-To-Deploy - decide using TTA's Trend & Failure Analysis
To Deploy or Not-To-Deploy - decide using TTA's Trend & Failure Analysis
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 

Similar to Data Infrastructure for a World of Music

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 

Similar to Data Infrastructure for a World of Music (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Data Infrastructure for a World of Music

  • 1. Lars Albertsson, Data Engineer @Spotify Focus on challenges & needs Data infrastructure for a world of music
  • 2. 1. Clients generate data 2. ??? 3. Make profit Users create data
  • 3. Why data? Reporting to partners, from day 1 Record labels, ad buyers, marketing Analytics KPIs, Ads, Business insights: growth, retention, funnels Features Recommendations, search, top lists, notifications Product development A/B testing Operations Root cause analysis, latency, planning Customer support Legal Data purpose
  • 4. Different needs: speed vs quality Reporting to partners, from day 1 Record labels, ad buyers, marketing (daily + monthly) Analytics KPIs, Ads, Business insights: growth, retention, funnels Features Recommendations, search, top lists, notifications Product development A/B testing Operations Root cause analysis, latency, planning Customer support Legal Data purpose
  • 5. Most user actions Played songs Playlist modifications Web navigation UI navigation Service state changes User Notifications Incoming Content Social integration Data purpose What data?
  • 6. 26M monthly active users 6M subscribers 55 markets 20M songs, 20K new / day 1.5B playlists 4 data centres 10 TB from users / day 400 GB from services / day 61 TB generated in Hadoop / day 600 Hadoop nodes 6500 MapReduce jobs / day 18PB in HDFS Data purpose Much data?
  • 9. Get raw data Refine Make it useful Data infrastructure
  • 10. 2008: > for h in all_hosts rsync ${h}:/var/log/syslog /incoming/$h/$date > echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontab Dump to Postgres, make graph Still living with some of this… Data infrastructure It all started very basic
  • 11. Data infrastructure Collect, crunch, use/display Gateway Playlist service Kafka message bus MapReduce SQL Reports Cassandra Recomm- endations HDFS service DB Kafka@lon logs
  • 12. Data infrastructure Fault scenarios Gateway Playlist service Kafka message bus MapReduce SQL Reports Cassandra Recomm- endations HDFS service DB Kafka@lon logs
  • 13. Most datasets are produced daily Consumers want data after morning coffee For each line, bottom level represents a good day Destabilisation is the norm Delay factors all over the infrastructure - client to display Producers are not stakeholders Data infrastructure Shit happens
  • 14. Get raw data from clients through GWs GWs Service logs Service databases To HDFS Data collection
  • 15. Data collection Data collection Gateway Playlist service Kafka message bus HDFS service DB Kafka@lon logs Sources of truth MapReduce? Need to wait for “all” data for a time slot (hour) What is all? Can we get all? Most consumers want 9x% quickly. Reruns are complex.
  • 16. 1. Rsync from hosts. Get list from hosts DB. - Rsync fragile, frequent network issues. - DB info often stale - Often waiting for dead host or omitting host 2. Push logs over Kafka. Wait for hosts according to hosts DB. + Kafka better. Application level cross-site routing. - Kafka unreliable by design. Implement end-to-end acking. 3. Use Kafka as in #2. Determine active hosts by snooping metrics. + Reliable? host metric. - End-to-end stability and host enumeration not scalable. Data collection Log collection evolution
  • 17. Single solution cannot fit all needs. Choose reliability or low latency. Reliable path with store and forward Service hosts must not store state. Synchronous handoff to HA Kafka with large replay buffer Best effort path similar No acks, asynchronous handoff Message producers know appropriate semantics For critical data: handoff failure -> stop serving users Measuring loss is essential Data collection Log collection future
  • 18. ~1% loss is ok, assuming that it is measured Few % time slippage is ok, if unbiased Biased slippage is not ok Timestamp to use for bucketing: client, GW, HDFS? Some components are HA (Cassandra, ZooKeeper). Most are unreliable. Client devices are very unreliable. Buffers in “stateless” components cause loss. Crunching delay is inconvenient. Crunching wrong data is expensive. Data crunching Data is false?
  • 19. Core databases dumped daily (user x 2, playlist, metadata) Determinism required - delays inevitable Slave replication issues common No good solution: Sqoop live - non-deterministic Postgres commit log replay - not scalable Cassandra full dumps - resource heavy Solution - convert to event processing? Experimenting with Netflix Aegisthus for Cassandra -> HDFS Facebook has MySQL commit log -> event conversion Data collection Database dumping
  • 20. We have raw data, sorted by host and hour We want e.g. active users by country and product over the last month Data crunching
  • 21. Data crunching End goal example - business insights
  • 22. 1. Split by message type, per hour 2. Combine multiple sources for similar data, per day - a core dataset. 3. Join activity datasets, e.g. tracks played or user activity, with ornament dataset, e.g. track metadata, user demographics. 4a. Make reports for partners, e.g. labels, advertisers. 4b. Aggregate into SQL or add metadata for Hive exploration. 4c. Build indexes (search, top lists), denormalise, and put in Cassandra. 4d. Run machine learning (recommendations) and put in Cassandra. 4e. Make notification decisions and send out. ... Data crunching Typical data crunching MR C*
  • 23. Data crunching Core dataset example: users
  • 24. Generate - organic Transfer - Kafka Process - Python MapReduce. Bad idea. Big data ecosystem is 99% JVM -> moving to Crunch Test - in production. Not acceptable. Working on it. No available tools. Deploy - CI + Debian packages. Low isolation. Looking at containers (Docker). Monitor - organic Cycle time for code-test-debug: 21 days Data crunching Data processing platform
  • 25. Online storage: Cassandra, Postgres Offline storage: HDFS Transfer: Kafka, Sqoop Processing engine: Hadoop MapReduce in Yarn Processing languages: Luigi Python MapReduce, Crunch, Pig Mining: Hive, Postgres, Qlikview Real-time processing: Storm (mostly experimental) Trying out: Spark - better for iterative algorithms (ML), future of MapReduce? Giraph and other graph tools More stable infrastructure: Docker, Azkaban Data crunching Technology stack
  • 26. def mapper(self, items): for item in items: if item.type == ‘EndSong’ yield (item.track_id, 1, item) else: # Track metadata yield (item.track_id, 0, item) def reducer(self, key, values): for item in values: if item.type != ‘EndSong’: meta = item else: yield add_meta(meta, item) Data crunching Crunching tools - four joins select * from tracks inner join metadata on tracks.track_id = metadata.track_id; join tracks by track_id, metadata by track_id; PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin (endSongTable, metaTable); Vanilla MapReduce - fragile SQL / Hive - exploration & display Pig - deprecated Crunch - future for processing pipelines
  • 27. Lots of opportunities in PBs of data. Opportunities to get lost. Organising data
  • 28. Mostly organic - frequent discrepancies Agile feature dev -> easy schema change Currently requires client lib release Avro meta format in backend Good Hadoop integration Not best option in client Some clients are hard to upgrade, e.g. old phones, hifi, cars. Utopic (aka Google): client schema change -> automatic Hive/SQL/dashboard/report change Data crunching Schemas
  • 29. Today: if date < datetime(2012, 10, 17): # Use old format else: … Not scalable Few tools available. HCatalog? Solution(?): Encapsulate each dataset in library. Owners decide compatibility vs reformat strategy. Version the interface. (Twitter) Data crunching Data evolution
  • 30. Many redundant calculations Data discovery Home-grown tool Retention policy Save the raw data (S3) Be brutal and delete Data crunching What is out there?
  • 31. Technology is easy to change, humans hard Our most difficult challenges are cultural Organising yourself
  • 32. Failing jobs, dead jobs Dead data Data growth Reruns Isolation Configuration, memory, disk, Hadoop resources Technical debt Testing, deployment, monitoring, remediations Cost Be stringent with software engineering practices or suffer. Most data organisations suffer. Data crunching Staying in control
  • 33. History: Data service department Core data + platform department Data platform department Self-service spurs data usage Data producers and consumers have domain knowledge Data infrastructure engineers do not Data producers prioritise online services over offline Producing and consuming is closely tied, yet often organisationally separated Data crunching Who owns what?
  • 34. Dos: Solve domain-specific or unsolved things Use stuff from leaders (Kafka) Monitor aggressively Have 50+% backend engineers Focus on the data feature developer needs Separate raw and generated data Hadoop was good bet, Spark even better? Data crunching Things learnt in the fire Don’ts: Choose your own path (Python) Use ad-hoc formats Build stuff with < 3 years horizon Accumulate debt Use SQL in data pipelines Have SPOFs - no excuse anymore Rely on host configurations Collect data with pull Vanilla MapReduce “Data is special” - no SW practices
  • 35. Innovation originates at Google (~10^7 data dedicated machines) MapReduce, GFS, Dapper, Pregel, Flume Open source variants by the big dozen (10^5 - 10^6) Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US only Hadoop, HDFS, ZooKeeper, Giraph, Crunch. Cassandra Improved by serious players (10^3 - 10^4) Spotify, AirBnB, FourSquare, Prezi, King. Mostly US Used by beginners (10^1 - 10^2) Big Data innovation Innovation in Big Data - four tiers
  • 36. Not much in infrastructure: Supercomputing legacy MPI still in use Berkeley: Spark, Mesos Cooperation with Yahoo and Twitter Containers Xen, VMware Data processing theory: Bloom filters, stream processing (e.g. Count-Min Sketch) Machine learning Big Data innovation Innovation from academia
  • 37. Fluid architectures / private clouds Large pools of machines Services and jobs are independent of hosts Mesos, Curator are scratching at the problem Google Borg = Utopia LAMP stack for Big Data End to end developer testing Client modification to insights SQL change Running on developer machine, in IDE Scale is not an issue - efficiency & productivity is Big Data innovation Innovation is needed, examples