SlideShare a Scribd company logo
1 of 22
Download to read offline
Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL
A Little Bit about Us Core Services Team at HPMG | AOL  Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval
Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”
Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …
Our Classification Tasks abusive non-abusive non-abusive scary sexy non-abusive non-abusive abusive Comment Moderation Article Classification
In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
However, N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10  500K
So, we parallelize on Hadoop Good news:  Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news:  Mahout doesn’t support necessary algorithms yet.  Other algorithms do not run natively on Hadoop.
Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.
Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing  AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)
What Parallelization? Training Task Training Task Training Task Training Task Training Task
General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)
Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model
Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business	Investments are taxed as capital gains..... business	It was the overleveraged and underregulatedbanks … none   	I am afraid we may be headed for … none   	In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279	68ngram_stem_stopword	1snowballtrue 279	68	ngram_stem_stopword2	snowball	true 279	68	ngram_stem_stopword3	snowball	true 279	68	ngram_stem_stopword	1	porter	true 279	68	ngram_stem_stopword2porter	true 279	68	ngram_stem_stopword3none	false … Vector 5 Preprocessing Request (a parameter set per line) Vector k
Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE 		RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector
Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73	923	balanced_winnow	5	1	10… 73	923	balanced_winnow	5	210… 73	923	balanced_winnow	5	310… 73	923	balanced_winnow	5	1	20	… 73	923	balanced_winnow	5	2	20	… 73	923	balanced_winnow	5	320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm
Training on HadoopBig Picture Model 1 Through UDF Call Model 2 UDF RunTrainer() par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATERunTrainer(par1, par2, …); STORE run ..; ……. Mallet ,[object Object]
Bagging
Balanced Winnow
C45
Decision Tree
…Mahout ,[object Object]

More Related Content

What's hot

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 

What's hot (20)

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 

Viewers also liked

Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-wekalucboudreau
 
EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016Ayman van Bregt
 
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...Kai Wähner
 
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)Carlos Rangel
 
GBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British BrandsGBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British BrandsMPP Consulting
 
Reactive architecture e microservices microservices, ap is e event driven (1)
Reactive architecture e microservices  microservices, ap is e event driven (1)Reactive architecture e microservices  microservices, ap is e event driven (1)
Reactive architecture e microservices microservices, ap is e event driven (1)Petterson Henrique Andrade
 
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبونممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبونeythar
 
Venus - #UseYourAnd
Venus - #UseYourAndVenus - #UseYourAnd
Venus - #UseYourAndMarie Talak
 
Final project report`````
Final project report`````Final project report`````
Final project report`````Arslan Ahmad
 
Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth Steve Bray
 
美雅找醬油篇
美雅找醬油篇美雅找醬油篇
美雅找醬油篇suyuanc1
 
Pengenalan kepada Pentaho
Pengenalan kepada PentahoPengenalan kepada Pentaho
Pengenalan kepada PentahoHisyammudin
 
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και ΑνιέζαΕυρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζαdaskdask131
 
あっぱれじゃ
あっぱれじゃあっぱれじゃ
あっぱれじゃKeita Hasebe
 
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...CEW Georgetown
 

Viewers also liked (19)

Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
 
EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016
 
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
 
World com
World comWorld com
World com
 
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
 
GBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British BrandsGBBrand 2012 - TOP 100 British Brands
GBBrand 2012 - TOP 100 British Brands
 
Reactive architecture e microservices microservices, ap is e event driven (1)
Reactive architecture e microservices  microservices, ap is e event driven (1)Reactive architecture e microservices  microservices, ap is e event driven (1)
Reactive architecture e microservices microservices, ap is e event driven (1)
 
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبونممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
 
Zaragoza turismo-59
Zaragoza turismo-59Zaragoza turismo-59
Zaragoza turismo-59
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Venus - #UseYourAnd
Venus - #UseYourAndVenus - #UseYourAnd
Venus - #UseYourAnd
 
Final project report`````
Final project report`````Final project report`````
Final project report`````
 
Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth Smart SMBs: fine-tuning the engines of growth
Smart SMBs: fine-tuning the engines of growth
 
美雅找醬油篇
美雅找醬油篇美雅找醬油篇
美雅找醬油篇
 
Dubai Travel Guide
Dubai Travel GuideDubai Travel Guide
Dubai Travel Guide
 
Pengenalan kepada Pentaho
Pengenalan kepada PentahoPengenalan kepada Pentaho
Pengenalan kepada Pentaho
 
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και ΑνιέζαΕυρωπαϊκή Ένωση, Αντωνία και Ανιέζα
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
 
あっぱれじゃ
あっぱれじゃあっぱれじゃ
あっぱれじゃ
 
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
 

Similar to Machine Learning with Hadoop

From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerAmazon Web Services
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Codiax
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Julien SIMON
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...Amazon Web Services Korea
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleAmazon Web Services
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)Julien SIMON
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAMakoto Yui
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Advanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerAdvanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerJulien SIMON
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
 
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...Amazon Web Services
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleAmazon Web Services
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleAmazon Web Services
 

Similar to Machine Learning with Hadoop (20)

From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at Scale
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)
 
Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Advanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerAdvanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMaker
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
Train ML Models Using Amazon SageMaker with TensorFlow - SRV336 - Chicago AWS...
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scale
 

Recently uploaded

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Machine Learning with Hadoop

  • 1. Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL
  • 2. A Little Bit about Us Core Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval
  • 3. Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”
  • 4. Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …
  • 5. Our Classification Tasks abusive non-abusive non-abusive scary sexy non-abusive non-abusive abusive Comment Moderation Article Classification
  • 6. In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
  • 7. However, N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10  500K
  • 8. So, we parallelize on Hadoop Good news: Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.
  • 9. Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.
  • 10. Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)
  • 11. What Parallelization? Training Task Training Task Training Task Training Task Training Task
  • 12. General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)
  • 13. Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model
  • 14. Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business Investments are taxed as capital gains..... business It was the overleveraged and underregulatedbanks … none I am afraid we may be headed for … none In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279 68ngram_stem_stopword 1snowballtrue 279 68 ngram_stem_stopword2 snowball true 279 68 ngram_stem_stopword3 snowball true 279 68 ngram_stem_stopword 1 porter true 279 68 ngram_stem_stopword2porter true 279 68 ngram_stem_stopword3none false … Vector 5 Preprocessing Request (a parameter set per line) Vector k
  • 15. Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector
  • 16. Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73 923 balanced_winnow 5 1 10… 73 923 balanced_winnow 5 210… 73 923 balanced_winnow 5 310… 73 923 balanced_winnow 5 1 20 … 73 923 balanced_winnow 5 2 20 … 73 923 balanced_winnow 5 320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm
  • 17.
  • 20. C45
  • 22.
  • 24.
  • 27.
  • 28. Training on Hadoop: Trick #2 We call ML functions from UDF. Some functions can take too long to return, and Hadoop will kill the job if they do. RunTrainer() “Pig Heartbeat” Thread Main Thread
  • 29. As a result, we now see… We are now able to build tens of thousands of models within an hour and choose the best. Previously, the same task took us days. As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.
  • 30. Useful Resources Mahout: http://mahout.apache.org/ Mallet: http://mallet.cs.umass.edu/ Weka: http://www.cs.waikato.ac.nz/ml/weka/ libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ OpenNLP: http://incubator.apache.org/opennlp/ Pig: http://pig.apache.org/