Scale your database traffic with Read & Write split using MySQL Router
Big data and APIs for PHP developers - SXSW 2011
1. Big Data and APIs
for PHP Developers
SXSW Interactive 2011
Austin, Texas
#BigDataAPIs
2.
Text REDCROSS to 90999 to make a $10 donation and
support the American Red Cross' disaster relief efforts to help
those affected by the earthquake in Japan and tsunami
throughout the Pacific.
3. Topics & Goal
Topics:
o Introductions
o Definition of Big Data
o Working with Big Data
o APIs
o Visualization
o MapReduce
Goal:
To provide an interesting discussion revolving around all
aspects of big data & to spark your imagination on the subject.
12. Tell Us About You
• Who is currently working on a Big Data problem?
• Have you integrated with an existing API?
• Do you publish your own API?
• How many of you are PHP developers?
• Tell us what you hope to learn here today.
14. Different Types of Big Data
Large chunks of data
o Massive XML files, Images, Video, log files
Massive amounts of small data points
o Typical Web Data, Survey votes, Tweets
Requests/Traffic
o Serving lots of data, regardless of set size
Processing vs storing
o Different concerns if only processing versus storing
16. CAP Theorem
Consistency
All clients will see a consistent view of the data.
Availability
Clients will have access to read and write data.
Partition Tolerance
The system won't fail if individual nodes can't communicate.
You can't have all three—pick two!
http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
17. Common Tools
Hadoop and HBase
Cassandra
Riak
MongoDB
CouchDB
Amazon Web Services
18.
19. Scaling Data vs. Scaling Requests
Highly related? (or are they?)
At Digg, both were related, as more requests meant more data,
and more data made it harder to handle more requests.
At TripAdvisor, it was handling pass through big data, we never
stored it, but had to process it quickly.
Mozilla's Socorro has a small number of webapp users (100?),
but catches 3 million crashes a day via POST, median size
150k, and must store, process, and analyze these.
20. OLTP vs. OLAP
OLTP: low latency, high volume
OLAP: long running, CPU intensive, low volume (?)
Trying to do both with the same system happens more often
than you might think (and will make you cry)
One bad MapReduce can bring a system to its knees
21. Online vs offline solutions
Keep only recent data in online system
Replicate / snapshot data to secondary system
Run expensive jobs on secondary
Cache output of expensive jobs in primary
Make snapshot/window data available to end users for ad hoc
processing
27. Scaling up
A different type of scaling:
Typical webapp: scale to millions of users without degradation
of response time
Socorro: less than a hundred users, terabytes of data.
Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
28. Some numbers
At peak we receive 3000 crashes per minute
3 million per day
Median crash size 100k -> 150k
30TB stored in HBase and growing every day
29. What can we do?
Does betaN have more (null signature) crashes than other
betas?
Analyze differences between Flash versions x and y crashes
Detect duplicate crashes
Detect explosive crashes
Email victims of a malware-related crash
Find frankeninstalls
30. War stories
HBase
Low latency writes
Fast to retrieve one record or a range, anything else MR
Stability problems and solutions
Big clusters: network is key
Need spare capacity, secondary systems
Need great instrumentation
Redundant layers (disk, RDBMS, cache)
Next: Universal API
Big data is hard to move
32. Sources for public data
• Data.gov
• DataSF: http://datasf.org/
• Public Data Sets on AWS:
http://aws.amazon.com/publicdatasets/
• UNData: http://data.un.org/
• OECD: http://stats.oecd.org/
33. Infochimps
Over 10,000 data sets listed at Infochimps
Over 1,800 data sets available through our API
APIs allow easy access to terabyte scale data
34. Data is the lifeblood of your product
Any great application consists of:
• Awesome code
• The right data
And, as we've seen with sites like CNET, NPR, Netflix & Twitter,
it has become a best practice to build applications against APIs
to access that data.
36. How we make the Screenname
Autocomplete API
• Flying Monkey Scraper
o continually crisscrosses the user graph, to discover new
users
o 1B objects to insert a day, 8 nodes
• Hadoop
o To do the processing, 15+ nodes
o Pig, Wukong
o Precalculate 100M usernames -> prefixes
-> a few hundred million rows
o Sorted by Trstrank
• Apeyeye
o load balanced cluster of 10 nodes, across 2 data centers
37. Infochimps Yahoo Stock API
GET http://api.infochimps.com/economics/finance/stocks/
y_historical/
price_range{
"results":[{
"open":19.1,
"adj_close":9.55,
"close":19.09,
"high":19.33,
"symbol":"AAPL",
"date":20010813,
"exchange":"NASDAQ",
"volume":5285600,
"low":18.76
},
38. How we make the Stock API
Changes every day...
You can get Yahoo stock data, every day too, in CSV form.
• Hackbox:
o little piece of armored sausage that takes the data and
munges it up to be useful
• Troop:
o Publishes it into the datastore
o Writes the API docs
o Puts the API endpoint into the code
o Stages for deploy
39. Build these APIs yourself.. if you want.
You can build these APIs yourself.
Check out http://infochimps.com/labs to see our open
sourced codebase.
Or, you can let us do it, and you can focus on writing awesome
code for your application, and let us do the monkeying with the
data.
40. And actually....
These two APIs illustrate great reasons why you *wouldn't*
want to build them yourself:
• The data involved is too large to practically handle
• The data involved updates frequently, so it's a hassle
41. Examples of data available through the
Infochimps API
• Trstrank
o How trustworthy is a Twitter user?
• Conversations
o Given two users, get a summary of their interactions
• Twitter Name Autocomplete
o Given a string, find twitter names that complete it
• Geo Name Autocomplete
o Given a string, find place names that complete it
• Qwerly
o maps from Twitter handle to other social networks
• Daily Historical Stock Quotes
• IP to Census
o Given an IP address, map to Geo, then map to US Census info
42. And many more APIs...
Many more APIs available:
• Word Frequencies
• Word Lists
• Freebase
• DBPedia
• AggData
• Material Safety Data Sheets (MSDS)
• Internet "Weather" from Cedexis
And... many more to come.
Relocate your subroutine to the cloud.
http://infochimps.com/apis
44. Rules of thumb
Generate APIs that only give incremental access, don't shove
more data than needed at a user.
For performance reasons, don't allow the user to request too
much data in one request, and throttle requests.
Consider building API against secondary system.
Asynchronous APIs: request/queue data, pick it up later.
Caches for big data don't necessarily help.
46. Exploring vs. Explaining
L. fineartamerica.com/featured/exploring-archimedes-david-robinson.html
R. sgp.undp.org/web/projects/10771/environmental_awareness_and_familiarity_with_animals.html
47. Structure has a purpose.
fiveless.deviantart.com/art/Periodic-Table-of-the-Elements-147350318
52. Chart junk makes baby Jesus cry.
L. www.flickr.com/photos/santo_cuervo/3693877386/
R. www.gereports.com/a-good-look-at-the-cost-of-chronic-diseases/
55. It deosn't mttaer waht oredr the ltteers in a wrod
are, the olny iprmoetnt tihng is taht the frist and
lsat ltteres are at the rghit pclae. The rset can be a
tatol mses and you can sitll raed it wouthit a
porbelm. Tihs is bcuseae we do not raed ervey
lteter by it slef but the wrod as a wlohe.
IT DEOSN'T MTTAER WAHT OREDR THE LTTEERS IN
A WROD ARE, THE OLNY IPRMOETNT TIHNG IS TAHT
THE FRIST AND LSAT LTTERES ARE AT THE RGHIT
PCLAE. THE RSET CAN BE A TATOL MSES AND YOU
CAN SITLL RAED IT WOUTHIT A PORBELM. TIHS IS
BCUSEAE WE DO NOT RAED ERVEY LTETER BY IT
SLEF BUT THE WROD AS A WLOHE.
70. Some Numbers
Facebook, new data per day:
• 03/2008: 200 GB
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
71. Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
72. Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB 400 PB per month (in 2007!)
• 04/2009: 2 TB
• 10/2009: 4 TB
• 03/2010: 12 TB
73. Some Numbers
Facebook, new data per day: Google's processing jobs:
• 03/2008: 200 GB 400 PB per month (in 2007!)
• 04/2009: 2 TB Average job size is 180 GB
• 10/2009: 4 TB
• 03/2010: 12 TB
101. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
102. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
103. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
104. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
105. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
• Mapper emits the URI as the key and the bytes as the value.
106. Basic Principle: The Mapper
Mapper reads records and emits key and value pairs.
Take an Apache web server log file as an example:
• Each line is a record.
• Mapper extracts request URI and number of bytes sent.
• Mapper emits the URI as the key and the bytes as the value.
Parallelize by having log files per hour, splitting up the files into
even smaller chunks (by line) and so forth.
108. Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
109. Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
110. Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
111. Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.
112. Basic Principle: The Reducer
All values (from all nodes) for the same key are sent to the
same reducer.
Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).
Apache web server log example to the rescue again:
• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.
• Sum up the bytes, and we have the total traffic per URI!
122. Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
123. Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
124. Big Hadoop Installations
Facebook:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
125. Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
126. Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total
storage capacity
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
127. Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12
TB disk per node
o 1 GbE per node
• 4 GbE between racks
128. Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes
o 1 GbE per node
• 4 GbE between racks
129. Big Hadoop Installations
Facebook: Yahoo:
• Mostly used with Hive • 40% of jobs use Pig
• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers
o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes
o 1 GbE per node o 2 x 4 CPU cores and 16
GB RAM per node
• 4 GbE between racks
134. Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
135. Hadoop at Facebook
Daily usage:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
136. Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
137. Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data
scanned
• 7500+ Hive jobs
• ~80k compute hours
138. Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• 7500+ Hive jobs
• ~80k compute hours
139. Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs
• ~80k compute hours
140. Hadoop at Facebook
Daily usage: Data per day growth:
• 25 TB logged by Scribe • I/08: 200 GB
• 135 TB compressed data • II/09: 2 TB compressed
scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs
• I/10: 12 TB compressed
• ~80k compute hours
143. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
144. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
145. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
146. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
147. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
• Namenode has metadata and knows where blocks reside
148. HDFS Overview
• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)
• Designed for streaming rather than random reads
• Write-once, read-many (although there is a way to append)
• Stores data redundantly (three replicas by default), is aware
of your network topology
• Namenode has metadata and knows where blocks reside
• Datanodes hold the data
152. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
153. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
154. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
155. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
156. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
o Called Hadoop Streaming, uses STDIN & STDOUT
157. Job Processing
• Input Formats split up your data into individual records
• Mappers do their work, then a partitioner partitions & sorts
• Combiner can perform local pre-reduce on each mapper
• Reducers perform reduction for each key
• Mapper, Combiner and Reducer can be an external process
o Called Hadoop Streaming, uses STDIN & STDOUT
Shameless plug: http://github.com/dzuelke/hadoophp