Big data and APIs for PHP developers - SXSW 2011

Big Data and APIs
for PHP Developers
SXSW Interactive 2011
Austin, Texas

#BigDataAPIs

Text REDCROSS to 90999 to make a $10 donation and
support the American Red Cross' disaster relief efforts to help
those affected by the earthquake in Japan and tsunami
throughout the Pacific.

Topics & Goal

Topics:
o Introductions
o Definition of Big Data
o Working with Big Data
o APIs
o Visualization
o MapReduce

Goal:
To provide an interesting discussion revolving around all
aspects of big data & to spark your imagination on the subject.

Bradley Holt (@bradleyholt)

Curator of this Workshop

Co-Founder and Technical Director,
Found Line

Author

Julie Steele (@jsteeleeditor)

Pythonista
(but we like her anyway)

Acquisitions Editor, O'Reilly Media

Graphic Designer, freelance

Visualization curator

Laura Thomson (@lxt)

Webtools Engineering Manager,
Mozilla

crash-stats.mozilla.com

PHP, Python, Scaling, Systems

Eli White (@eliw)

Worked for:
Digg, Tripadvisor, Hubble

PHP guy at heart

Currently unemployed (Hiring?)

Author:

Dennis Yang (@sinned)

Director of Product & Marketing,
Infochimps

Previously:
mySimon, CNET,
& cofounder of Techdirt

David Zuelke (@dzuelke)

Lead Developer: Agavi

Managing Partner,
Bitextender GmbH

Addicted to rockets, sharks w/
friggin laser beams, helicopters,
HTTP, REST, CouchDB and
MapReduce. And PHP.

Who are you?

Let's Learn about the Audience!

Tell Us About You

• Who is currently working on a Big Data problem?
• Have you integrated with an existing API?
• Do you publish your own API?
• How many of you are PHP developers?
• Tell us what you hope to learn here today.

Different Types of Big Data

Large chunks of data
o Massive XML files, Images, Video, log files

Massive amounts of small data points
o Typical Web Data, Survey votes, Tweets

Requests/Traffic
o Serving lots of data, regardless of set size

Processing vs storing
o Different concerns if only processing versus storing

CAP Theorem
Consistency
All clients will see a consistent view of the data.

Availability
Clients will have access to read and write data.

Partition Tolerance
The system won't fail if individual nodes can't communicate.

You can't have all three—pick two!

http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

Common Tools

Hadoop and HBase

Cassandra

Riak

MongoDB

CouchDB

Amazon Web Services

Scaling Data vs. Scaling Requests

Highly related? (or are they?)

At Digg, both were related, as more requests meant more data,
and more data made it harder to handle more requests.

At TripAdvisor, it was handling pass through big data, we never
stored it, but had to process it quickly.

Mozilla's Socorro has a small number of webapp users (100?),
but catches 3 million crashes a day via POST, median size
150k, and must store, process, and analyze these.

OLTP vs. OLAP

OLTP: low latency, high volume
OLAP: long running, CPU intensive, low volume (?)
Trying to do both with the same system happens more often
than you might think (and will make you cry)

One bad MapReduce can bring a system to its knees

Online vs offline solutions

Keep only recent data in online system

Replicate / snapshot data to secondary system

Run expensive jobs on secondary
Cache output of expensive jobs in primary

Make snapshot/window data available to end users for ad hoc
processing

Scaling up

A different type of scaling:
Typical webapp: scale to millions of users without degradation
of response time
Socorro: less than a hundred users, terabytes of data.

Basic law of scale still applies:
The bigger you get, the more spectacularly you fail

Some numbers

At peak we receive 3000 crashes per minute
3 million per day
Median crash size 100k -> 150k
30TB stored in HBase and growing every day

What can we do?

Does betaN have more (null signature) crashes than other
betas?
Analyze differences between Flash versions x and y crashes
Detect duplicate crashes
Detect explosive crashes
Email victims of a malware-related crash
Find frankeninstalls

War stories

HBase
Low latency writes
Fast to retrieve one record or a range, anything else MR

Stability problems and solutions
Big clusters: network is key
Need spare capacity, secondary systems
Need great instrumentation
Redundant layers (disk, RDBMS, cache)
Next: Universal API

Big data is hard to move

Sources for public data

• Data.gov

• DataSF: http://datasf.org/

• Public Data Sets on AWS:
http://aws.amazon.com/publicdatasets/

• UNData: http://data.un.org/

• OECD: http://stats.oecd.org/

Infochimps

Over 10,000 data sets listed at Infochimps
Over 1,800 data sets available through our API
APIs allow easy access to terabyte scale data

Data is the lifeblood of your product

Any great application consists of:
• Awesome code
• The right data

And, as we've seen with sites like CNET, NPR, Netflix & Twitter,
it has become a best practice to build applications against APIs
to access that data.

Infochimps Screennames Autocomplete
API
GET http://www.infochimps.com/datasets/twitter‐screen‐name‐autocomplete?
prefix=infochi&apikey=api_test-W1cipwpcdu9Cbd9pmm8D4Cjc469

{
"completions":["infochimps",
"InfoChile",
"infochiapas",
"infochilecompra",
"Infochick",
"infochimp",
"infoChief1",
"infochip2",
"infochild",
"infochiocciola",

How we make the Screenname
Autocomplete API
• Flying Monkey Scraper
o continually crisscrosses the user graph, to discover new
users
o 1B objects to insert a day, 8 nodes
• Hadoop
o To do the processing, 15+ nodes
o Pig, Wukong
o Precalculate 100M usernames -> prefixes
-> a few hundred million rows
o Sorted by Trstrank
• Apeyeye
o load balanced cluster of 10 nodes, across 2 data centers

Infochimps Yahoo Stock API

GET http://api.infochimps.com/economics/finance/stocks/
y_historical/

price_range{
"results":[{
"open":19.1,
"adj_close":9.55,
"close":19.09,
"high":19.33,
"symbol":"AAPL",
"date":20010813,
"exchange":"NASDAQ",
"volume":5285600,
"low":18.76
},

How we make the Stock API

Changes every day...

You can get Yahoo stock data, every day too, in CSV form.

• Hackbox:
o little piece of armored sausage that takes the data and
munges it up to be useful
• Troop:
o Publishes it into the datastore
o Writes the API docs
o Puts the API endpoint into the code
o Stages for deploy

Build these APIs yourself.. if you want.

You can build these APIs yourself.
Check out http://infochimps.com/labs to see our open
sourced codebase.

Or, you can let us do it, and you can focus on writing awesome
code for your application, and let us do the monkeying with the
data.

And actually....

These two APIs illustrate great reasons why you *wouldn't*
want to build them yourself:

• The data involved is too large to practically handle
• The data involved updates frequently, so it's a hassle

Examples of data available through the
Infochimps API
• Trstrank
o How trustworthy is a Twitter user?
• Conversations
o Given two users, get a summary of their interactions
• Twitter Name Autocomplete
o Given a string, find twitter names that complete it
• Geo Name Autocomplete
o Given a string, find place names that complete it
• Qwerly
o maps from Twitter handle to other social networks
• Daily Historical Stock Quotes
• IP to Census
o Given an IP address, map to Geo, then map to US Census info

And many more APIs...

Many more APIs available:
• Word Frequencies
• Word Lists
• Freebase
• DBPedia
• AggData
• Material Safety Data Sheets (MSDS)
• Internet "Weather" from Cedexis

And... many more to come.

Relocate your subroutine to the cloud.
http://infochimps.com/apis

Rules of thumb

Generate APIs that only give incremental access, don't shove
more data than needed at a user.

For performance reasons, don't allow the user to request too
much data in one request, and throttle requests.

Consider building API against secondary system.

Asynchronous APIs: request/queue data, pick it up later.

Caches for big data don't necessarily help.

Exploring vs. Explaining

L. fineartamerica.com/featured/exploring-archimedes-david-robinson.html
R. sgp.undp.org/web/projects/10771/environmental_awareness_and_familiarity_with_animals.html

Structure has a purpose.

fiveless.deviantart.com/art/Periodic-Table-of-the-Elements-147350318

squidspot.com/Periodic_Table_of_Typefaces.html

michaelvandaniker.com/labs/browserVisualization/

Chart junk makes baby Jesus cry.

L. www.flickr.com/photos/santo_cuervo/3693877386/
R. www.gereports.com/a-good-look-at-the-cost-of-chronic-diseases/

http://jec.senate.gov/republicans/public/
index.cfm?p=CommitteeNews&ContentRecord_id=bb302d88-3d0d-4424-8e33-3c5d2578c2b0

www.flickr.com/photos/robertpalmer/3743826461/

It deosn't mttaer waht oredr the ltteers in a wrod
are, the olny iprmoetnt tihng is taht the frist and
lsat ltteres are at the rghit pclae. The rset can be a
tatol mses and you can sitll raed it wouthit a
porbelm. Tihs is bcuseae we do not raed ervey
lteter by it slef but the wrod as a wlohe.

IT DEOSN'T MTTAER WAHT OREDR THE LTTEERS IN
A WROD ARE, THE OLNY IPRMOETNT TIHNG IS TAHT
THE FRIST AND LSAT LTTERES ARE AT THE RGHIT
PCLAE. THE RSET CAN BE A TATOL MSES AND YOU
CAN SITLL RAED IT WOUTHIT A PORBELM. TIHS IS
BCUSEAE WE DO NOT RAED ERVEY LTETER BY IT
SLEF BUT THE WROD AS A WLOHE.

The Stroop Effect

RED YELLOW BLUE GREEN

RED YELLOW BLUE GREEN

Color: function vs. decoration

Martin Wattenberg and Fernanda Viégas, Chapter 11, Beautiful Visualization (O'Reilly Media)

Andrew Odewahn, Chapter 8, Beautiful Visualization (O'Reilly Media)

Be kind to the color blind

courses.washington.edu/info424/Labs/ChoroplethMap.html

Cool tools to try

• ManyEyes
www-958.ibm.com/software/data/cognos/manyeyes/
• Wordle
www.wordle.net/
• GraphViz
www.graphviz.org/
• Protovis
vis.stanford.edu/protovis/
• Tableau
www.tableausoftware.com/

Your turn!

datavis.tumblr.com/post/
2746708037/the-sequel-map

MapReduce

Or: How to even store and process
that much data in the first place?

Some Numbers

Facebook, new data per day:

Some Numbers


• 03/2008: 200 GB

Some Numbers


• 03/2008: 200 GB

• 04/2009: 2 TB

Some Numbers


• 03/2008: 200 GB

• 04/2009: 2 TB

• 10/2009: 4 TB

Some Numbers


• 03/2008: 200 GB

• 04/2009: 2 TB

• 10/2009: 4 TB

• 03/2010: 12 TB

Some Numbers

Facebook, new data per day: Google's processing jobs:

• 03/2008: 200 GB

• 04/2009: 2 TB

• 10/2009: 4 TB

• 03/2010: 12 TB

Some Numbers


• 03/2008: 200 GB 400 PB per month (in 2007!)

• 04/2009: 2 TB

• 10/2009: 4 TB

• 03/2010: 12 TB

Some Numbers


• 03/2008: 200 GB 400 PB per month (in 2007!)

• 04/2009: 2 TB Average job size is 180 GB

• 10/2009: 4 TB

• 03/2010: 12 TB

what if you have this much data?

what if it's just 1% of what Facebook has to deal with?

reading 180 GB off a hard disk will take ~45 minutes

and then you haven't even processed it yet!

today's computers process data way faster than it can be read

solution: parallelize your I/O

but now you need to coordinate

what if a node dies?
does the whole job have to re-start?

can another node take over?

can another node take over?
how do you coordinate this?

Enter: Our Hero

MapReduce to the Rescue!

distribute workload across a grid

MapReduce: Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html

distribute the data up front across nodes

then ship computing nodes to where the data is

scaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalable

Basic Principle: The Mapper

Mapper reads records and emits key and value pairs.



Take an Apache web server log file as an example:




• Each line is a record.





• Mapper extracts request URI and number of bytes sent.






• Mapper emits the URI as the key and the bytes as the value.






• Mapper emits the URI as the key and the bytes as the value.

Parallelize by having log files per hour, splitting up the files into
even smaller chunks (by line) and so forth.

Basic Principle: The Reducer

All values (from all nodes) for the same key are sent to the
same reducer.


same reducer.

Keys get sorted, and in case of a simple count or sum, you can
do a first reduce run on each mapper node once it's finished to
cut down on I/O (that's the combiner).


same reducer.


Apache web server log example to the rescue again:


same reducer.



• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.


same reducer.



• Reducer is invoked for a URI like "/foobar" and a list of all
number of bytes.

• Sum up the bytes, and we have the total traffic per URI!

Hello, Hadoop

MapReduce for the Masses

Hadoop is a MapReduce framework

comes with a distributed FS, task tracker and so forth

so we can focus on writing MapReduce jobs

Big Hadoop Installations

Facebook:


Facebook:

• Mostly used with Hive


Facebook:


• 8400 cores, 13 PB total
storage capacity


Facebook:


storage capacity

o 8 cores, 32 GB RAM, 12
TB disk per node


Facebook:


storage capacity

TB disk per node

o 1 GbE per node


Facebook:


storage capacity

TB disk per node

o 1 GbE per node

• 4 GbE between racks


Facebook: Yahoo:


storage capacity

TB disk per node

o 1 GbE per node



Facebook: Yahoo:

• Mostly used with Hive • 40% of jobs use Pig

storage capacity

TB disk per node

o 1 GbE per node



Facebook: Yahoo:


• 8400 cores, 13 PB total • > 100,000 CPU cores in
storage capacity > 25,000 servers

TB disk per node

o 1 GbE per node



Facebook: Yahoo:



o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes

o 1 GbE per node



Facebook: Yahoo:



o 8 cores, 32 GB RAM, 12 • Largest cluster: 4000
TB disk per node nodes

o 1 GbE per node o 2 x 4 CPU cores and 16
GB RAM per node

Hadoop at Facebook

Daily usage:

Hadoop at Facebook

Daily usage:

• 25 TB logged by Scribe

Hadoop at Facebook

Daily usage:


• 135 TB compressed data
scanned

Hadoop at Facebook

Daily usage:


scanned

• 7500+ Hive jobs

Hadoop at Facebook

Daily usage:


scanned

• 7500+ Hive jobs

• ~80k compute hours

Hadoop at Facebook

Daily usage: Data per day growth:


scanned

• 7500+ Hive jobs


Hadoop at Facebook


• 25 TB logged by Scribe • I/08: 200 GB

scanned

• 7500+ Hive jobs


Hadoop at Facebook



• 135 TB compressed data • II/09: 2 TB compressed
scanned

• 7500+ Hive jobs


Hadoop at Facebook



scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs


Hadoop at Facebook



scanned
• III/09: 4 TB compressed
• 7500+ Hive jobs
• I/10: 12 TB compressed

HDFS

Hadoop Distributed File System

HDFS Overview

• Designed for very large data sets, transparent compression,
block-based storage (64 MB block size by default)

HDFS Overview


• Designed for streaming rather than random reads

HDFS Overview



• Write-once, read-many (although there is a way to append)

HDFS Overview




• Stores data redundantly (three replicas by default), is aware
of your network topology

HDFS Overview





• Namenode has metadata and knows where blocks reside

HDFS Overview





• Namenode has metadata and knows where blocks reside

• Datanodes hold the data

Task Processing

How Hadoop Gets the Job Done

Job Processing

• Input Formats split up your data into individual records

Job Processing


• Mappers do their work, then a partitioner partitions & sorts

Job Processing



• Combiner can perform local pre-reduce on each mapper

Job Processing




• Reducers perform reduction for each key

Job Processing





• Mapper, Combiner and Reducer can be an external process

Job Processing






o Called Hadoop Streaming, uses STDIN & STDOUT

Job Processing






o Called Hadoop Streaming, uses STDIN & STDOUT

 Shameless plug: http://github.com/dzuelke/hadoophp

</BigDataAPIs>

• Bradley Holt, @bradleyholt, bradley.holt@foundline.com

</BigDataAPIs>


• Julie Steele, @jsteeleeditor, jsteele@oreilly.com

</BigDataAPIs>



• Laura Thomson, @lxt, laura@mozilla.com

</BigDataAPIs>




• Eli White, @eliw, eli@eliw.com

</BigDataAPIs>





• Dennis Yang, @sinned, dennis@infochimps.com

</BigDataAPIs>





• Dennis Yang, @sinned, dennis@infochimps.com

• David Zuelke, @dzuelke, david.zuelke@bitextender.com

Big data and APIs for PHP developers - SXSW 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (12)

Similar to Big data and APIs for PHP developers - SXSW 2011

Similar to Big data and APIs for PHP developers - SXSW 2011 (20)

Recently uploaded

Recently uploaded (20)

Big data and APIs for PHP developers - SXSW 2011