The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
3. Why data?
Reporting to partners, from day 1
Record labels, ad buyers, marketing
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose
4. Different needs: speed vs quality
Reporting to partners, from day 1
Record labels, ad buyers, marketing (daily + monthly)
Analytics
KPIs, Ads, Business insights: growth, retention, funnels
Features
Recommendations, search, top lists, notifications
Product development
A/B testing
Operations
Root cause analysis, latency, planning
Customer support
Legal
Data purpose
5. Most user actions
Played songs
Playlist modifications
Web navigation
UI navigation
Service state changes
User
Notifications
Incoming
Content
Social integration
Data purpose
What data?
6. 26M monthly active users
6M subscribers
55 markets
20M songs, 20K new / day
1.5B playlists
4 data centres
10 TB from users / day
400 GB from services / day
61 TB generated in Hadoop / day
600 Hadoop nodes
6500 MapReduce jobs / day
18PB in HDFS
Data purpose
Much data?
10. 2008:
> for h in all_hosts
rsync ${h}:/var/log/syslog /incoming/$h/$date
> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontab
Dump to Postgres, make graph
Still living with some of this…
Data infrastructure
It all started very basic
11. Data infrastructure
Collect, crunch, use/display
Gateway
Playlist
service
Kafka
message bus
MapReduce
SQL
Reports
Cassandra
Recomm-
endations
HDFS
service
DB
Kafka@lon
logs
13. Most datasets are produced daily
Consumers want data after morning coffee
For each line, bottom level represents a good
day
Destabilisation is the norm
Delay factors all over the infrastructure - client
to display
Producers are not stakeholders
Data infrastructure
Shit happens
14. Get raw data from
clients through GWs
GWs
Service logs
Service databases
To HDFS
Data collection
16. 1. Rsync from hosts. Get list from hosts DB.
- Rsync fragile, frequent network issues.
- DB info often stale
- Often waiting for dead host or omitting host
2. Push logs over Kafka. Wait for hosts according to hosts DB.
+ Kafka better. Application level cross-site routing.
- Kafka unreliable by design. Implement end-to-end acking.
3. Use Kafka as in #2. Determine active hosts by snooping metrics.
+ Reliable? host metric.
- End-to-end stability and host enumeration not scalable.
Data collection
Log collection evolution
17. Single solution cannot fit all needs.
Choose reliability or low latency.
Reliable path with store and forward
Service hosts must not store state.
Synchronous handoff to HA Kafka with large replay buffer
Best effort path similar
No acks, asynchronous handoff
Message producers know appropriate semantics
For critical data: handoff failure -> stop serving users
Measuring loss is essential
Data collection
Log collection future
18. ~1% loss is ok, assuming that it is measured
Few % time slippage is ok, if unbiased
Biased slippage is not ok
Timestamp to use for bucketing: client, GW, HDFS?
Some components are HA (Cassandra, ZooKeeper). Most are
unreliable. Client devices are very unreliable.
Buffers in “stateless” components cause loss.
Crunching delay is inconvenient. Crunching wrong data is expensive.
Data crunching
Data is false?
19. Core databases dumped daily (user x 2, playlist, metadata)
Determinism required - delays inevitable
Slave replication issues common
No good solution:
Sqoop live - non-deterministic
Postgres commit log replay - not scalable
Cassandra full dumps - resource heavy
Solution - convert to event processing?
Experimenting with Netflix Aegisthus for Cassandra -> HDFS
Facebook has MySQL commit log -> event conversion
Data collection
Database dumping
20. We have raw data, sorted by host and
hour
We want e.g. active users by country
and product over the last month
Data crunching
22. 1. Split by message type, per hour
2. Combine multiple sources for similar data, per day - a core dataset.
3. Join activity datasets, e.g. tracks played or user activity, with
ornament dataset, e.g. track metadata, user demographics.
4a. Make reports for partners, e.g. labels, advertisers.
4b. Aggregate into SQL or add metadata for Hive exploration.
4c. Build indexes (search, top lists), denormalise, and put in Cassandra.
4d. Run machine learning (recommendations) and put in Cassandra.
4e. Make notification decisions and send out.
...
Data crunching
Typical data crunching
MR
C*
24. Generate - organic
Transfer - Kafka
Process - Python MapReduce. Bad idea.
Big data ecosystem is 99% JVM -> moving to Crunch
Test - in production.
Not acceptable. Working on it. No available tools.
Deploy - CI + Debian packages.
Low isolation. Looking at containers (Docker).
Monitor - organic
Cycle time for code-test-debug: 21 days
Data crunching
Data processing platform
25. Online storage: Cassandra, Postgres
Offline storage: HDFS
Transfer: Kafka, Sqoop
Processing engine: Hadoop MapReduce in Yarn
Processing languages: Luigi Python MapReduce, Crunch, Pig
Mining: Hive, Postgres, Qlikview
Real-time processing: Storm (mostly experimental)
Trying out:
Spark - better for iterative algorithms (ML), future of MapReduce?
Giraph and other graph tools
More stable infrastructure: Docker, Azkaban
Data crunching
Technology stack
26. def mapper(self, items):
for item in items:
if item.type == ‘EndSong’
yield (item.track_id, 1, item)
else: # Track metadata
yield (item.track_id, 0, item)
def reducer(self, key, values):
for item in values:
if item.type != ‘EndSong’:
meta = item
else:
yield add_meta(meta, item)
Data crunching
Crunching tools - four joins
select * from tracks inner join
metadata on tracks.track_id =
metadata.track_id;
join tracks by track_id, metadata by
track_id;
PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin
(endSongTable, metaTable);
Vanilla MapReduce - fragile SQL / Hive - exploration & display
Pig - deprecated
Crunch - future for processing pipelines
28. Mostly organic - frequent discrepancies
Agile feature dev -> easy schema change
Currently requires client lib release
Avro meta format in backend
Good Hadoop integration
Not best option in client
Some clients are hard to upgrade, e.g. old phones, hifi, cars.
Utopic (aka Google): client schema change -> automatic
Hive/SQL/dashboard/report change
Data crunching
Schemas
29. Today:
if date < datetime(2012, 10, 17):
# Use old format
else:
…
Not scalable
Few tools available. HCatalog?
Solution(?): Encapsulate each dataset in library. Owners decide
compatibility vs reformat strategy. Version the interface. (Twitter)
Data crunching
Data evolution
30. Many redundant calculations
Data discovery
Home-grown tool
Retention policy
Save the raw data (S3)
Be brutal and delete
Data crunching
What is out there?
31. Technology is easy to change, humans
hard
Our most difficult challenges are cultural
Organising yourself
32. Failing jobs, dead jobs
Dead data
Data growth
Reruns
Isolation
Configuration, memory, disk, Hadoop resources
Technical debt
Testing, deployment, monitoring, remediations
Cost
Be stringent with software engineering practices or suffer. Most data
organisations suffer.
Data crunching
Staying in control
33. History:
Data service department
Core data + platform department
Data platform department
Self-service spurs data usage
Data producers and consumers have domain knowledge
Data infrastructure engineers do not
Data producers prioritise online services over offline
Producing and consuming is closely tied, yet often organisationally
separated
Data crunching
Who owns what?
34. Dos:
Solve domain-specific or unsolved
things
Use stuff from leaders (Kafka)
Monitor aggressively
Have 50+% backend engineers
Focus on the data feature
developer needs
Separate raw and generated data
Hadoop was good bet, Spark even
better?
Data crunching
Things learnt in the fire
Don’ts:
Choose your own path (Python)
Use ad-hoc formats
Build stuff with < 3 years horizon
Accumulate debt
Use SQL in data pipelines
Have SPOFs - no excuse anymore
Rely on host configurations
Collect data with pull
Vanilla MapReduce
“Data is special” - no SW practices
35. Innovation originates at Google (~10^7 data dedicated machines)
MapReduce, GFS, Dapper, Pregel, Flume
Open source variants by the big dozen (10^5 - 10^6)
Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US only
Hadoop, HDFS, ZooKeeper, Giraph, Crunch.
Cassandra
Improved by serious players (10^3 - 10^4)
Spotify, AirBnB, FourSquare, Prezi, King. Mostly US
Used by beginners (10^1 - 10^2)
Big Data innovation
Innovation in Big Data - four tiers
36. Not much in infrastructure:
Supercomputing legacy
MPI still in use
Berkeley: Spark, Mesos
Cooperation with Yahoo and Twitter
Containers
Xen, VMware
Data processing theory:
Bloom filters, stream processing (e.g. Count-Min Sketch)
Machine learning
Big Data innovation
Innovation from academia
37. Fluid architectures / private clouds
Large pools of machines
Services and jobs are independent of hosts
Mesos, Curator are scratching at the problem
Google Borg = Utopia
LAMP stack for Big Data
End to end developer testing
Client modification to insights SQL change
Running on developer machine, in IDE
Scale is not an issue - efficiency & productivity is
Big Data innovation
Innovation is needed, examples