Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.
1. Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL
2. A Little Bit about Us Core Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com) Principal Software Engineer Worked on machine learning, data mining, and natural language processing Sang Chul Song, Ph.D. (sangchul.song@teamaol.com) Senior Software Engineer Worked on data intensive computing – data archiving / information retrieval
3. Machine Learning:Supervised Classification 1. Learning Phase Model Train “Business” 2. Classifying Phase “Entertainment” Model Result Classify capital gains to be taxed … “Politics”
4. Two Machine Learning Use Cases at HuffPost | AOL Comment Moderation Evaluate All New HuffPost User Comments Every Day Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every Day Article Classification Tag Articles for Advertising E.g.: scary, salacious, …
6. In Order to Meet Our Needs,We Require… Support for important algorithms, including SVM Perceptron / Winnow Bayesian Decision Tree AdaBoost … Ability to build tons of models on regular basis, and pick the best Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
7. However, N algorithms, K parameters each, L values in each parameter There are N x LK combinations!, which is often too many to deal with sequentially. For example, N=5, K=5, L=10 500K
8. So, we parallelize on Hadoop Good news: Mahout, a parallel machine learning tool, is already available. There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.
9. Therefore, we do… We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations. On top of our platform, we generate / test hundred thousands models, and choose the best. We use Pig for Hadoop implementation.
10. Our Approach OUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree, Bayesian and a Lot Others Train Request Return CONVENTIONAL 1000s Models(one for each param set) Best Model Training Data Select Train (sequential)
12. General Processing Flow TrainingDocs Preprocess VectorizedDocs Train Model Preprocess Parameters Stopword use, n-gram size, stemming, etc. Train Parameters Algorithm and algorithm specific parameters (e.g. SVM, C, Ɛ, and other kernel parameters)
13. Our Parallel Processing Flow Model Vectorized Docs Model Model TrainingDocs Vectorized Docs Model Model Model Model Vectorized Docs Model Model
14. Preprocessing on Hadoop (see next slide) Preprocessing on Hadoop business Investments are taxed as capital gains..... business It was the overleveraged and underregulatedbanks … none I am afraid we may be headed for … none In the famous words of Homer Simpson, “it takes 2 to lie …” … Vector 1 Training Data Vector 2 Vector 3 Vector 4 279 68ngram_stem_stopword 1snowballtrue 279 68 ngram_stem_stopword2 snowball true 279 68 ngram_stem_stopword3 snowball true 279 68 ngram_stem_stopword 1 porter true 279 68 ngram_stem_stopword2porter true 279 68 ngram_stem_stopword3none false … Vector 5 Preprocessing Request (a parameter set per line) Vector k
15. Preprocessing on HadoopBig Picture Vector 1 Through UDF Call Vector 2 UDF par = LOAD param_file AS par1, par2, …; run = FOREACH par GENERATE RunPreprocess(par1, par2, …); STORE run ..; RunPreprocess() …….. Preprocessors (Pluggable Pipes) Stemmer Tokenizer StopwordFilter Vector k Vectorizer FeatureSelector
16. Training on Hadoop 010101101020101100010101110100010101011100… 010111010100010100100010101011100110110101… 011101011010101011101011011010001010010101… 010010111010100010101010001010111010101010… 111010110001110101011010100101011010001011… Model 1 Training on Hadoop (see next slide) Vectors Model 2 Model 3 Model 4 73 923 balanced_winnow 5 1 10… 73 923 balanced_winnow 5 210… 73 923 balanced_winnow 5 310… 73 923 balanced_winnow 5 1 20 … 73 923 balanced_winnow 5 2 20 … 73 923 balanced_winnow 5 320… … Model 5 Train Request (a parameter set per line) Model k Mahout, Weka, Mallet or libsvm
28. Training on Hadoop: Trick #2 We call ML functions from UDF. Some functions can take too long to return, and Hadoop will kill the job if they do. RunTrainer() “Pig Heartbeat” Thread Main Thread
29. As a result, we now see… We are now able to build tens of thousands of models within an hour and choose the best. Previously, the same task took us days. As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.