Machine Learning with Hadoop

Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL

A Little Bit about Us Core Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval

Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”

Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …

Our Classification Tasks abusive non-abusive non-abusive scary sexy non-abusive non-abusive abusive Comment Moderation Article Classification

In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best

However, N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10  500K

So, we parallelize on Hadoop Good news: Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.

Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.

Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)

What Parallelization? Training Task Training Task Training Task Training Task Training Task

General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)

Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model

Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business Investments are taxed as capital gains..... business It was the overleveraged and underregulatedbanks … none I am afraid we may be headed for … none In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279 68ngram_stem_stopword 1snowballtrue 279 68 ngram_stem_stopword2 snowball true 279 68 ngram_stem_stopword3 snowball true 279 68 ngram_stem_stopword 1 porter true 279 68 ngram_stem_stopword2porter true 279 68 ngram_stem_stopword3none false … Vector 5 Preprocessing Request (a parameter set per line) Vector k

Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector

Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73 923 balanced_winnow 5 1 10… 73 923 balanced_winnow 5 210… 73 923 balanced_winnow 5 310… 73 923 balanced_winnow 5 1 20 … 73 923 balanced_winnow 5 2 20 … 73 923 balanced_winnow 5 320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm

Training on HadoopBig Picture Model 1 Through UDF Call Model 2 UDF RunTrainer() par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATERunTrainer(par1, par2, …); STORE run ..; ……. Mallet ,[object Object]

Machine Learning with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Machine Learning with Hadoop

Similar to Machine Learning with Hadoop (20)

Recently uploaded

Recently uploaded (20)

Machine Learning with Hadoop