SlideShare a Scribd company logo
1 of 29
Lucky Oyster
                        dive deep - discover pearls




                $100 Worth of Priceless
Leveraging Common Crawl and Spot Instances to Data Mine The Web


                Lisa Green, Common Crawl
                Matthew Berk, Lucky Oyster
Common Crawl Data
•   ~8 Billion web pages
•   ~120 TB
•   2008-2012
•   ARC files, JSON metadata, text files
•   Available to anyone on Amazon’s Public Data Sets
What Does $100 Buy You?
•   2 nosebleed seats at an NFL game
•   1/10 cost of an entry level Dell PowerEdge
•   80 minutes of time from a mid level engineer
•   Omakase for 1 at Shiro’s Sushi in Seattle

or…
$100 + 14 hours + 300 lines of Ruby =

3.4 billion Web pages processed, data mined, and indexed for
                    search and research.

                   Even a few years ago,
             this would have been unthinkable.
The Experiment
• Process most recent (2012) Web crawl from Common Crawl
• Determine extent and nature of hardcoded references to
  Facebook
• Extract structured metadata (Open Graph and Schema.org)
• Store, analyze and index entity metadata and link structure
Components
• AWS Spot Instances
  – Peak of ~200 nodes
  – ~5,000 hours of compute time
  – Average cost of $0.02 per hour
• Custom ruby code for extraction and analysis
• Beanstalkd, Apache httpd, Sinatra
• Some sysadmin elbow grease
Architecture
• Master instance (m2.4xlarge)
   – Queue for Common Crawl S3 paths
   – Data collection and node control service
   – Indexers and Solr instances

• Worker nodes (c1.medium)
   – Spot instances with worker AMI
   – Consume S3 paths; decompress and stream ARC files
   – Extract and analyze

• Goals were simplicity, interruption tolerance, and high throughput
Findings / Output
• Lucky Oyster Study (see appendix or
  http://blog.luckyoyster.com )
• Utility computing = major cost savings
• Reusable framework for low complexity Web scale crawl
  processing
• Indexes of 400+ million structured entities for R&D
Thank you. Questions?

            matthew@luckyoyster.com



            lisa@commoncrawl.org
Appendix
The Lucky Oyster Study
• Based on 3.4 billion URLs from Common Crawl
• 22% of pages reference Facebook directly
• 8% of pages implement Open Graph tags
• Top open graph types: hotels, movies, activities, songs, games,
  books
• Study of shift in locus (away from the open Web) and nature
  (towards entities) of content
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

What's hot

2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortarOpen Analytics
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jArangoDB Database
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Globus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinarscorlosquet
 
Semantics, rdf and drupal
Semantics, rdf and drupalSemantics, rdf and drupal
Semantics, rdf and drupalGokul Nk
 
Drupal 7 and RDF
Drupal 7 and RDFDrupal 7 and RDF
Drupal 7 and RDFscorlosquet
 
Integrating Drupal with a Triple Store
Integrating Drupal with a Triple StoreIntegrating Drupal with a Triple Store
Integrating Drupal with a Triple StoreBarry Norton
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 

What's hot (20)

2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Graph database
Graph database Graph database
Graph database
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Globus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus: Enabling the Open Storage Network
Globus: Enabling the Open Storage Network
 
Shawn-Averkamp-feb25
Shawn-Averkamp-feb25Shawn-Averkamp-feb25
Shawn-Averkamp-feb25
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Shaun-Ellis-feb25
Shaun-Ellis-feb25Shaun-Ellis-feb25
Shaun-Ellis-feb25
 
Drupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP WebinarDrupal and the Semantic Web - ESIP Webinar
Drupal and the Semantic Web - ESIP Webinar
 
Semantics, rdf and drupal
Semantics, rdf and drupalSemantics, rdf and drupal
Semantics, rdf and drupal
 
Drupal 7 and RDF
Drupal 7 and RDFDrupal 7 and RDF
Drupal 7 and RDF
 
Integrating Drupal with a Triple Store
Integrating Drupal with a Triple StoreIntegrating Drupal with a Triple Store
Integrating Drupal with a Triple Store
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 

Viewers also liked

Measuring the impact of Google Analytics
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google AnalyticsDomino Data Lab
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterCommonCrawl
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Cloud Connect 2013- Lock Stock and x Smoking EC2's
Cloud Connect 2013- Lock Stock and x Smoking EC2'sCloud Connect 2013- Lock Stock and x Smoking EC2's
Cloud Connect 2013- Lock Stock and x Smoking EC2'sHarish Ganesan
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Webinar: Delivering Static and Dynamic Content Using CloudFront
Webinar: Delivering Static and Dynamic Content Using CloudFrontWebinar: Delivering Static and Dynamic Content Using CloudFront
Webinar: Delivering Static and Dynamic Content Using CloudFrontAmazon Web Services
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the CloudAmazon Web Services
 
Leveraging Hybid IT for More Robust Business Services
Leveraging Hybid IT for More Robust Business ServicesLeveraging Hybid IT for More Robust Business Services
Leveraging Hybid IT for More Robust Business ServicesAmazon Web Services
 
Deploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic BeanstalkDeploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic BeanstalkAmazon Web Services
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileyAmazon Web Services
 
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 AWS Government, Education, and Nonprofits Symposium London, United Kingdom L... AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...Amazon Web Services
 
AWS Summit - Brisbane 2014 - Keynote
AWS Summit - Brisbane 2014 - KeynoteAWS Summit - Brisbane 2014 - Keynote
AWS Summit - Brisbane 2014 - KeynoteAmazon Web Services
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your StartupAmazon Web Services
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAmazon Web Services
 

Viewers also liked (20)

Measuring the impact of Google Analytics
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google Analytics
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Cloud Connect 2013- Lock Stock and x Smoking EC2's
Cloud Connect 2013- Lock Stock and x Smoking EC2'sCloud Connect 2013- Lock Stock and x Smoking EC2's
Cloud Connect 2013- Lock Stock and x Smoking EC2's
 
IBM Open Data
IBM Open DataIBM Open Data
IBM Open Data
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Webinar: Delivering Static and Dynamic Content Using CloudFront
Webinar: Delivering Static and Dynamic Content Using CloudFrontWebinar: Delivering Static and Dynamic Content Using CloudFront
Webinar: Delivering Static and Dynamic Content Using CloudFront
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 
Leveraging Hybid IT for More Robust Business Services
Leveraging Hybid IT for More Robust Business ServicesLeveraging Hybid IT for More Robust Business Services
Leveraging Hybid IT for More Robust Business Services
 
Deploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic BeanstalkDeploy, Manage & Scale Your Apps with Elastic Beanstalk
Deploy, Manage & Scale Your Apps with Elastic Beanstalk
 
AWS Blackbelt NINJA Dojo
AWS Blackbelt NINJA DojoAWS Blackbelt NINJA Dojo
AWS Blackbelt NINJA Dojo
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve Riley
 
From Development to Production
From Development to ProductionFrom Development to Production
From Development to Production
 
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 AWS Government, Education, and Nonprofits Symposium London, United Kingdom L... AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 
AWS Summit - Brisbane 2014 - Keynote
AWS Summit - Brisbane 2014 - KeynoteAWS Summit - Brisbane 2014 - Keynote
AWS Summit - Brisbane 2014 - Keynote
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your Startup
 
0. series overview
0. series overview0. series overview
0. series overview
 
AWS SeMINAR SERIES 2015 Sydney
AWS SeMINAR SERIES 2015 SydneyAWS SeMINAR SERIES 2015 Sydney
AWS SeMINAR SERIES 2015 Sydney
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
 

Similar to BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.pptEqinNiftalyev
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsAndreas Kamilaris
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Kill those bugs with the ultimate tool - Chrome DevTools
Kill those bugs with the ultimate tool - Chrome DevToolsKill those bugs with the ultimate tool - Chrome DevTools
Kill those bugs with the ultimate tool - Chrome DevToolsBlue Raster
 
Sinatra Introduction
Sinatra IntroductionSinatra Introduction
Sinatra IntroductionYi-Ting Cheng
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution designAlexander Tokarev
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015Himanshu Desai
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 

Similar to BDT204 Awesome Applications of Open Data - AWS re: Invent 2012 (20)

A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
WOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of ThingsWOTS2E: A Search Engine for a Semantic Web of Things
WOTS2E: A Search Engine for a Semantic Web of Things
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Kill those bugs with the ultimate tool - Chrome DevTools
Kill those bugs with the ultimate tool - Chrome DevToolsKill those bugs with the ultimate tool - Chrome DevTools
Kill those bugs with the ultimate tool - Chrome DevTools
 
Sinatra Introduction
Sinatra IntroductionSinatra Introduction
Sinatra Introduction
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution design
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Lucky Oyster dive deep - discover pearls $100 Worth of Priceless Leveraging Common Crawl and Spot Instances to Data Mine The Web Lisa Green, Common Crawl Matthew Berk, Lucky Oyster
  • 19. Common Crawl Data • ~8 Billion web pages • ~120 TB • 2008-2012 • ARC files, JSON metadata, text files • Available to anyone on Amazon’s Public Data Sets
  • 20. What Does $100 Buy You? • 2 nosebleed seats at an NFL game • 1/10 cost of an entry level Dell PowerEdge • 80 minutes of time from a mid level engineer • Omakase for 1 at Shiro’s Sushi in Seattle or…
  • 21. $100 + 14 hours + 300 lines of Ruby = 3.4 billion Web pages processed, data mined, and indexed for search and research. Even a few years ago, this would have been unthinkable.
  • 22. The Experiment • Process most recent (2012) Web crawl from Common Crawl • Determine extent and nature of hardcoded references to Facebook • Extract structured metadata (Open Graph and Schema.org) • Store, analyze and index entity metadata and link structure
  • 23. Components • AWS Spot Instances – Peak of ~200 nodes – ~5,000 hours of compute time – Average cost of $0.02 per hour • Custom ruby code for extraction and analysis • Beanstalkd, Apache httpd, Sinatra • Some sysadmin elbow grease
  • 24. Architecture • Master instance (m2.4xlarge) – Queue for Common Crawl S3 paths – Data collection and node control service – Indexers and Solr instances • Worker nodes (c1.medium) – Spot instances with worker AMI – Consume S3 paths; decompress and stream ARC files – Extract and analyze • Goals were simplicity, interruption tolerance, and high throughput
  • 25. Findings / Output • Lucky Oyster Study (see appendix or http://blog.luckyoyster.com ) • Utility computing = major cost savings • Reusable framework for low complexity Web scale crawl processing • Indexes of 400+ million structured entities for R&D
  • 26. Thank you. Questions? matthew@luckyoyster.com lisa@commoncrawl.org
  • 28. The Lucky Oyster Study • Based on 3.4 billion URLs from Common Crawl • 22% of pages reference Facebook directly • 8% of pages implement Open Graph tags • Top open graph types: hotels, movies, activities, songs, games, books • Study of shift in locus (away from the open Web) and nature (towards entities) of content
  • 29. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.