Data Infrastructure for a World of Music

Lars Albertsson, Data Engineer @Spotify
Focus on challenges & needs
Data infrastructure
for a world of music

1. Clients generate data
2. ???
3. Make profit
Users create data

Why data?
Reporting to partners, from day 1
Record labels, ad buyers, marketing
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose

Different needs: speed vs quality
Reporting to partners, from day 1
Record labels, ad buyers, marketing (daily + monthly)
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose

Most user actions
Played songs
Playlist modifications
Web navigation
UI navigation
Service state changes
User
Notifications
Incoming
Content
Social integration
Data purpose
What data?

26M monthly active users
6M subscribers
55 markets
20M songs, 20K new / day
1.5B playlists
4 data centres
10 TB from users / day
400 GB from services / day
61 TB generated in Hadoop / day
600 Hadoop nodes
6500 MapReduce jobs / day
18PB in HDFS
Data purpose
Much data?

Get raw data
Refine
Make it useful
Data infrastructure

2008:
> for h in all_hosts
rsync ${h}:/var/log/syslog /incoming/$h/$date
> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontab
Dump to Postgres, make graph
Still living with some of this…
Data infrastructure
It all started very basic

Data infrastructure
Collect, crunch, use/display
Gateway
Playlist
service
Kafka
message bus
MapReduce
SQL
Reports
Cassandra
Recomm-
endations
HDFS
service
DB
Kafka@lon
logs

Data infrastructure
Fault scenarios
Gateway
Playlist
service
Kafka
message bus
MapReduce
SQL
Reports
Cassandra
Recomm-
endations
HDFS
service
DB
Kafka@lon
logs

Most datasets are produced daily
Consumers want data after morning coffee
For each line, bottom level represents a good
day
Destabilisation is the norm
Delay factors all over the infrastructure - client
to display
Producers are not stakeholders
Data infrastructure
Shit happens

Get raw data from
clients through GWs
GWs
Service logs
Service databases
To HDFS
Data collection

Data collection
Data collection
Gateway
Playlist
service
Kafka
message bus
HDFS
service
DB
Kafka@lon
logs
Sources of truth
MapReduce?
Need to wait for “all” data
for a time slot (hour)
What is all?
Can we get all?
Most consumers want 9x%
quickly.
Reruns are complex.

1. Rsync from hosts. Get list from hosts DB.
- Rsync fragile, frequent network issues.
- DB info often stale
- Often waiting for dead host or omitting host
2. Push logs over Kafka. Wait for hosts according to hosts DB.
+ Kafka better. Application level cross-site routing.
- Kafka unreliable by design. Implement end-to-end acking.
3. Use Kafka as in #2. Determine active hosts by snooping metrics.
+ Reliable? host metric.
- End-to-end stability and host enumeration not scalable.
Data collection
Log collection evolution

Single solution cannot fit all needs.
Choose reliability or low latency.
Reliable path with store and forward
Service hosts must not store state.
Synchronous handoff to HA Kafka with large replay buffer
Best effort path similar
No acks, asynchronous handoff
Message producers know appropriate semantics
For critical data: handoff failure -> stop serving users
Measuring loss is essential
Data collection
Log collection future

~1% loss is ok, assuming that it is measured
Few % time slippage is ok, if unbiased
Biased slippage is not ok
Timestamp to use for bucketing: client, GW, HDFS?
Some components are HA (Cassandra, ZooKeeper). Most are
unreliable. Client devices are very unreliable.
Buffers in “stateless” components cause loss.
Crunching delay is inconvenient. Crunching wrong data is expensive.
Data crunching
Data is false?

Core databases dumped daily (user x 2, playlist, metadata)
Determinism required - delays inevitable
Slave replication issues common
No good solution:
Sqoop live - non-deterministic
Postgres commit log replay - not scalable
Cassandra full dumps - resource heavy
Solution - convert to event processing?
Experimenting with Netflix Aegisthus for Cassandra -> HDFS
Facebook has MySQL commit log -> event conversion
Data collection
Database dumping

We have raw data, sorted by host and
hour
We want e.g. active users by country
and product over the last month
Data crunching

Data crunching
End goal example - business insights

1. Split by message type, per hour
2. Combine multiple sources for similar data, per day - a core dataset.
3. Join activity datasets, e.g. tracks played or user activity, with
ornament dataset, e.g. track metadata, user demographics.
4a. Make reports for partners, e.g. labels, advertisers.
4b. Aggregate into SQL or add metadata for Hive exploration.
4c. Build indexes (search, top lists), denormalise, and put in Cassandra.
4d. Run machine learning (recommendations) and put in Cassandra.
4e. Make notification decisions and send out.
...
Data crunching
Typical data crunching
MR
C*

Data crunching
Core dataset example: users

Generate - organic
Transfer - Kafka
Process - Python MapReduce. Bad idea.
Big data ecosystem is 99% JVM -> moving to Crunch
Test - in production.
Not acceptable. Working on it. No available tools.
Deploy - CI + Debian packages.
Low isolation. Looking at containers (Docker).
Monitor - organic
Cycle time for code-test-debug: 21 days
Data crunching
Data processing platform

Online storage: Cassandra, Postgres
Offline storage: HDFS
Transfer: Kafka, Sqoop
Processing engine: Hadoop MapReduce in Yarn
Processing languages: Luigi Python MapReduce, Crunch, Pig
Mining: Hive, Postgres, Qlikview
Real-time processing: Storm (mostly experimental)
Trying out:
Spark - better for iterative algorithms (ML), future of MapReduce?
Giraph and other graph tools
More stable infrastructure: Docker, Azkaban
Data crunching
Technology stack

def mapper(self, items):
for item in items:
if item.type == ‘EndSong’
yield (item.track_id, 1, item)
else: # Track metadata
yield (item.track_id, 0, item)
def reducer(self, key, values):
for item in values:
if item.type != ‘EndSong’:
meta = item
else:
yield add_meta(meta, item)
Data crunching
Crunching tools - four joins
select * from tracks inner join
metadata on tracks.track_id =
metadata.track_id;
join tracks by track_id, metadata by
track_id;
PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin
(endSongTable, metaTable);
Vanilla MapReduce - fragile SQL / Hive - exploration & display
Pig - deprecated
Crunch - future for processing pipelines

Lots of opportunities in PBs of data.
Opportunities to get lost.
Organising data

Mostly organic - frequent discrepancies
Agile feature dev -> easy schema change
Currently requires client lib release
Avro meta format in backend
Good Hadoop integration
Not best option in client
Some clients are hard to upgrade, e.g. old phones, hifi, cars.
Utopic (aka Google): client schema change -> automatic
Hive/SQL/dashboard/report change
Data crunching
Schemas

Today:
if date < datetime(2012, 10, 17):
# Use old format
else:
…
Not scalable
Few tools available. HCatalog?
Solution(?): Encapsulate each dataset in library. Owners decide
compatibility vs reformat strategy. Version the interface. (Twitter)
Data crunching
Data evolution

Many redundant calculations
Data discovery
Home-grown tool
Retention policy
Save the raw data (S3)
Be brutal and delete
Data crunching
What is out there?

Technology is easy to change, humans
hard
Our most difficult challenges are cultural
Organising yourself

Failing jobs, dead jobs
Dead data
Data growth
Reruns
Isolation
Configuration, memory, disk, Hadoop resources
Technical debt
Testing, deployment, monitoring, remediations
Cost
Be stringent with software engineering practices or suffer. Most data
organisations suffer.
Data crunching
Staying in control

History:
Data service department
Core data + platform department
Data platform department
Self-service spurs data usage
Data producers and consumers have domain knowledge
Data infrastructure engineers do not
Data producers prioritise online services over offline
Producing and consuming is closely tied, yet often organisationally
separated
Data crunching
Who owns what?

Dos:
Solve domain-specific or unsolved
things
Use stuff from leaders (Kafka)
Monitor aggressively
Have 50+% backend engineers
Focus on the data feature
developer needs
Separate raw and generated data
Hadoop was good bet, Spark even
better?
Data crunching
Things learnt in the fire
Don’ts:
Choose your own path (Python)
Use ad-hoc formats
Build stuff with < 3 years horizon
Accumulate debt
Use SQL in data pipelines
Have SPOFs - no excuse anymore
Rely on host configurations
Collect data with pull
Vanilla MapReduce
“Data is special” - no SW practices

Innovation originates at Google (~10^7 data dedicated machines)
MapReduce, GFS, Dapper, Pregel, Flume
Open source variants by the big dozen (10^5 - 10^6)
Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US only
Hadoop, HDFS, ZooKeeper, Giraph, Crunch.
Cassandra
Improved by serious players (10^3 - 10^4)
Spotify, AirBnB, FourSquare, Prezi, King. Mostly US
Used by beginners (10^1 - 10^2)
Big Data innovation
Innovation in Big Data - four tiers

Not much in infrastructure:
Supercomputing legacy
MPI still in use
Berkeley: Spark, Mesos
Cooperation with Yahoo and Twitter
Containers
Xen, VMware
Data processing theory:
Bloom filters, stream processing (e.g. Count-Min Sketch)
Machine learning
Big Data innovation
Innovation from academia

Fluid architectures / private clouds
Large pools of machines
Services and jobs are independent of hosts
Mesos, Curator are scratching at the problem
Google Borg = Utopia
LAMP stack for Big Data
End to end developer testing
Client modification to insights SQL change
Running on developer machine, in IDE
Scale is not an issue - efficiency & productivity is
Big Data innovation
Innovation is needed, examples

Data Infrastructure for a World of Music

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Data Infrastructure for a World of Music

Similar to Data Infrastructure for a World of Music (20)

More from Lars Albertsson

More from Lars Albertsson (20)

Recently uploaded

Recently uploaded (20)

Data Infrastructure for a World of Music