Submit Search
Upload
Scalable Machine Learning with Hadoop
•
Download as PPTX, PDF
•
8 likes
•
2,884 views
Grant Ingersoll
Follow
My intro to machine learning talk at
Read less
Read more
Report
Share
Report
Share
1 of 21
Download now
Recommended
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
Will Iverson
2010 05-21, object-relational mapping using hibernate v2
2010 05-21, object-relational mapping using hibernate v2
alvaro alcocer sotil
Spark forspringdevs springone_final
Spark forspringdevs springone_final
sdeeg
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
Shannon McFarland OpenStack/Cisco Intro
Shannon McFarland OpenStack/Cisco Intro
Shannon McFarland
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Korea Sdec
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
DuraSpace
Machine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
Recommended
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
Will Iverson
2010 05-21, object-relational mapping using hibernate v2
2010 05-21, object-relational mapping using hibernate v2
alvaro alcocer sotil
Spark forspringdevs springone_final
Spark forspringdevs springone_final
sdeeg
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
Shannon McFarland OpenStack/Cisco Intro
Shannon McFarland OpenStack/Cisco Intro
Shannon McFarland
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Korea Sdec
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
DuraSpace
Machine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?
Zouheir Cadi
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Data Science London
Toronto jaspersoft meetup
Toronto jaspersoft meetup
Patrick McFadin
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
russell_jurney
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
George Cao
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Klokie Grossfeld
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
Leveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
Architecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Model-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
jccastrejon
50 Shades of SQL
50 Shades of SQL
DataWorks Summit
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
Large scale computing
Large scale computing
Bhupesh Bansal
Introduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Adaryl "Bob" Wakefield, MBA
Hadoop and Machine Learning
Hadoop and Machine Learning
joshwills
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
Using mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
avocadodb
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
zenyk
Java scalability considerations yogesh deshpande
Java scalability considerations yogesh deshpande
IndicThreads
Cloud patterns
Cloud patterns
Nicolas De Loof
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
More Related Content
What's hot
Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?
Zouheir Cadi
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Data Science London
Toronto jaspersoft meetup
Toronto jaspersoft meetup
Patrick McFadin
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
russell_jurney
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
George Cao
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Klokie Grossfeld
What's hot
(6)
Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Toronto jaspersoft meetup
Toronto jaspersoft meetup
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Similar to Scalable Machine Learning with Hadoop
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
Leveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
Architecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Model-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
jccastrejon
50 Shades of SQL
50 Shades of SQL
DataWorks Summit
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
Large scale computing
Large scale computing
Bhupesh Bansal
Introduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Adaryl "Bob" Wakefield, MBA
Hadoop and Machine Learning
Hadoop and Machine Learning
joshwills
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
Using mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
avocadodb
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
zenyk
Java scalability considerations yogesh deshpande
Java scalability considerations yogesh deshpande
IndicThreads
Cloud patterns
Cloud patterns
Nicolas De Loof
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
Apache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM Alternative
Andrus Adamchik
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
Similar to Scalable Machine Learning with Hadoop
(20)
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Leveraging Solr and Mahout
Leveraging Solr and Mahout
Architecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Model-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
50 Shades of SQL
50 Shades of SQL
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Large scale computing
Large scale computing
Introduction to Hadoop
Introduction to Hadoop
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Hadoop and Machine Learning
Hadoop and Machine Learning
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Using mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
Java scalability considerations yogesh deshpande
Java scalability considerations yogesh deshpande
Cloud patterns
Cloud patterns
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
Apache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM Alternative
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
More from Grant Ingersoll
Solr for Data Science
Solr for Data Science
Grant Ingersoll
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
Intro to Search
Intro to Search
Grant Ingersoll
Open Source Search FTW
Open Source Search FTW
Grant Ingersoll
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
Taming Text
Taming Text
Grant Ingersoll
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
Intro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
More from Grant Ingersoll
(20)
Solr for Data Science
Solr for Data Science
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Intro to Search
Intro to Search
Open Source Search FTW
Open Source Search FTW
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Taming Text
Taming Text
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Apache Lucene 4
Apache Lucene 4
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Scalable Machine Learning with Hadoop
1.
Scalable Machine Learning
with Hadoop (most of the time) Grant Ingersoll Chief Scientist October 2, 2012 © Copyright 2012
2.
Anyone Here Use
Machine Learning? • Any users of: • Google? • Search • Translation • Priority Inbox • Facebook? Google Translate • Twitter? • LinkedIn? Proprietary © 2012 LucidWorks
3.
Topics
• What is scalable machine learning? • Use Cases • Approaches • Hadoop-based • Alternatives • What is Apache Mahout? Proprietary 3 © 2012 LucidWorks
4.
Machine Learning
• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin • Lots of related fields: • Information Retrieval • Stats • Biology • Linear algebra • Many more Proprietary © 2012 LucidWorks
5.
What does scalable
mean for us? • As data grows linearly, either scale linearly in time or in machines • 2X data requires 2X time or 2X machines (or less!) • Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need different distributed programming models • Be pragmatic Proprietary © 2012 LucidWorks
6.
Common Use Cases
http://www.readwriteweb.com/archives/linkedin_plots_your_profession al_network_with_inma.php Proprietary © 2012 LucidWorks
7.
My Use Cases
Search Relevance Recommendations Related Items Content/User Classification Phrases Topics Analytics Discovery Proprietary 7 © 2012 LucidWorks
8.
Scalable Approaches
• Mind the Gap • Algorithms are the fun stuff, but you’ll spend more time on ETL, feature selection and post-processing • Simpler is usually better at scale 1. Scale Data Pipeline -> Sample -> Sequential 2. Hadoop 3. Ensemble (distribute many sequential models) 4. Spark, MPI & BSP, Others Proprietary 8 © 2012 LucidWorks
9.
Open Source Machine
Learning Libraries • Apache Mahout • Vowpal Wabbit • R Stats Project • Weka • LibSVM, SVMLight • Many, many more Proprietary 9 © 2012 LucidWorks
10.
Apache Mahout • An
Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • Why Mahout? • Many Open Source ML libraries are either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License • Or are research-oriented http://dictionary.reference.com/browse/mahout Proprietary © 2012 LucidWorks
11.
Who uses Mahout?
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout Proprietary © 2012 LucidWorks
12.
What Can I
do with Mahout Right Now? 3 “C”s + Extras Proprietary © 2012 LucidWorks
13.
Collaborative Filtering • Recommender
Approaches • User based • Item based • Online and Offline support • Offline can utilize Hadoop • Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, others Proprietary © 2012 LucidWorks
14.
Hadoop Recommenders
• Alternating Least Squares • Iterative, but scales well • Deals well with sparseness • “Large-scale Parallel Collaborative Filtering for the Netflix Prize” by Zhou et. al • https://cwiki.apache.org/MAHOUT/collaborative-filtering-with- als-wr.html • Slope One • Simple yet effective • Pseudo • Distribute sequential approach across Hadoop nodes Proprietary 14 © 2012 LucidWorks
15.
Clustering • Document level
• Group documents based on a notion of similarity • K-Means, Fuzzy K- Means, Dirichlet, Canopy, Mean-Shift, Spectral, Top- Down • Topic Modeling • Pluggable Distance • Cluster words across Measures documents to identify topics • Latent Dirichlet Allocation • Using Collapsed Variational Bayes Proprietary http://carrotsearch.com/foamtree-overview.html © 2012 LucidWorks
16.
Clustering In Hadoop •
Many people start with K-Means, but others can be more effective • Challenges • Iterative nature of many clustering algorithms can be slow • Distance measures and other factors can have dramatic impact on performance and quality • When in doubt, experiment Proprietary © 2012 LucidWorks
17.
Classification
“This gives a raw classification rate requirement of tens of millions of • Place new items into classifications per predefined categories second, which is, as they say in the old • Online and Offline country, a lot.” supported “Mahout in Action” http://awe.sm/5FyNe • Hadoop • Sequential • Naïve Bayes • Logistic Regression • Complementary Naïve Bayes • Stochastic Grad. • Decision Forests Descent • Clustering-based • Hidden Markov Model • Winnow/Perceptron Proprietary © 2012 LucidWorks
18.
Scaling Mahout Classification
Client (Train/Test/Classify) LucidWorks Big Data Classifier Model Model ... Node 1 A N Zookeeper ... Classifier Model Model ... Node N C Q Model Model Model HDFS Model Offline (Map/Reduce?) Training Proprietary © 2012 LucidWorks
19.
Other Mahout Features •
Apache Licensed: • Primitive Collections! • Extensive Math library • Vectors, Matrices, Statistics, etc. • Vector Encoding options • Singular Value Decomposition • Frequent Pattern Mining • Collocations (statistically interesting phrases) • I/O: Lucene, Cassandra, MongoDB and others Proprietary © 2012 LucidWorks
20.
What’s Next for
Mahout? • Streaming K-Means • Map/Reduce Training for HMM? • Clean Up towards 1.0 release • 1.0? Proprietary 20 © 2012 LucidWorks
21.
Resources
• http://www.lucidworks.com • grant@lucidworks.com • @gsingers Proprietary 21 © 2012 LucidWorks
Editor's Notes
A few things come to mind
Download now