SlideShare a Scribd company logo
1 of 54
Download to read offline
Globally Scalable Web Document Classification
Using Word2Vec
Kohei Nakaji (SmartNews)
keyword: machine learning for discovery
SmartNews Demo
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users
50% DAU/MAU
100+ Publishers
2013 App of The Year
US
Launched Oct 2014
1M+ Monthly Active Users
Same engagement
80+ Publishers
Top News Category App
International
Launched Feb 2015
10M Downloads WW
Same engagement
English beta
Featured App
Funding: $50M
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversification
Signals on the Internet
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversification
Signals on the Internet
Web Document
Classification
⊂
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:
When an arbitrary web document arrives, choose one
category exclusively from a pre-determined category set.
WORLD
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole page
・Extract from HTML
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, but fast
Two approaches:
Our Approach
Main Content Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy who'd
never led an arrest for the Tulsa County Sheriff's
Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance company
CEO end up joining a sting operation this month
that ended when he pulled out his handgun and
killed suspect Eric Harris instead of stunning
him with a Taser?</p>
</div>
</body>
</html>
Example:
main content
not main content
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
But not scalable.
Japanese:
…
…
…
…
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Separate HTML into ‘text block’s
Step1:
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3:
Define feature of each text block as
combination of local features
word count(current block) : 36,
num of <a>(current block) : 0,
word count (previous block) : 4,
num of <a> (previous block) : 1
ex:
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach:
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Making Main Content Using Decision Tree
(features)block1:
not main
(features)block2:
not main
(features)block3:
main
(features)block5:
main
(features)block4:
not main
Main Content Extraction From HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
stop words
sports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。
よろしくお願いします。
stop words
person dictionary
私
は
中路
よろしく
お願い
し
ます
です
PERSON
tf-idf
Another Option: Paragraph Vector
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ∼ several 100)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(https://code.google.com/p/word2vec/)
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Word Vector in word2vec Model
Every word is mapped to unique word vector
with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1]=
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3]=
Germany Berlin
Paris
France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221
He comes from Japan.
…
…
TX
t=1
logP(wt|wt c, · · · wt+c)
P(wt|wt c, · · · wt+c) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
for anduw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good
feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure
① Maximize
②
L
L
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ∼ 100s)
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.
w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP(wt|wt c, · · · wt+c, doc i)
P(wt|wt c, · · · wt+c, doc i) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
+ di
, and di
wt is included
vw② Preserve uw , as ˜uw , ˜vw
document where
Add a vector to the model for each document.
Objective Function (dbow-case)
=
Model (sum-case)
Procedure
① Maximize
L
L
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as ˜uw , ˜vw
After training, we can get a good paragraph vector as
a feature for a new document.
Objective Function (dbow-case)
Model (sum-case)
Procedure
① Maximize
TX
t=1
logP(wt|wt c, · · · wt+c, doc)
P(wt|wt c, · · · wt+c, doc) =
exp(˜uwt · ˜v)
P
W exp(˜uW · ˜v)
˜v =
X
t0
6=t, ct0
c
˜vwt
0 + d
We love SmartNews.
…
doc :
I love SmartNews
very much.
d
Ldoc =
③ Maximize for
L
Ldoc d
④ Use as a paragraph vectord
training
live data
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]
d
˜uw ˜vw
Paragraph Vector :
Lmaximize
Ldocmaximize
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)
([0.2, -0.3, …], sports)
([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics)
…
sports
training
algorithm
classifier
feature
extraction
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text Classification
Several percent better than using Bag-of-Words
with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in
each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each
component of paragraph vector.
labeled: ∼several 10000
unlabeled: ∼100000
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a
different nature than Bag-of-Words
Reason: We can get a better classifier by combining
two different types of classifiers.
Our Use Case
Validation
Use one to validate the other.
Combination
Use the more reliable result of two classifiers:
Bag-of-Words-based classifier vs.
Paragraph Vector-based classifier
In multilingual localization
Use only Paragraph Vector-based classifier without
any feature engineering.
Our Use Case (future)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
The Challenge
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking
discovery
What Big Data Firms
typically do:
preference estimation
and risk quantification
What if parents don't feed vegetables to children who only like meat?
What if you keep hearing only opinions that match yours?
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum.
Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Exploration
user
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articles
x
We are building our engineering team in SF -
please join us!
採用してます
・ML/NLP Engineer
・Data Science Engineer
…
kohei.nakaji@smartnews.com
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas Mikolov
Distributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation
To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan,
Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round,
at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews

More Related Content

What's hot

Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
RNNs for Timeseries Analysis
RNNs for Timeseries AnalysisRNNs for Timeseries Analysis
RNNs for Timeseries AnalysisBruno Gonçalves
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Elasticsearch를 활용한 GIS 검색
Elasticsearch를 활용한 GIS 검색Elasticsearch를 활용한 GIS 검색
Elasticsearch를 활용한 GIS 검색ksdc2019
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?NAVER D2
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLJonathan Katz
 
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용NAVER Engineering
 
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현태현 임
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리Junyi Song
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기찬희 이
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 

What's hot (20)

Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
RNNs for Timeseries Analysis
RNNs for Timeseries AnalysisRNNs for Timeseries Analysis
RNNs for Timeseries Analysis
 
Bleu vs rouge
Bleu vs rougeBleu vs rouge
Bleu vs rouge
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Elasticsearch를 활용한 GIS 검색
Elasticsearch를 활용한 GIS 검색Elasticsearch를 활용한 GIS 검색
Elasticsearch를 활용한 GIS 검색
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용
미등록단어 문제 해결을 위한 비지도학습 기반 한국어자연어처리 방법론 및 응용
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec (20)

Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2
 
HTML CSS JS in Nut shell
HTML  CSS JS in Nut shellHTML  CSS JS in Nut shell
HTML CSS JS in Nut shell
 
Ember
EmberEmber
Ember
 
Getting Started with jQuery
Getting Started with jQueryGetting Started with jQuery
Getting Started with jQuery
 
Caste a vote online
Caste a vote onlineCaste a vote online
Caste a vote online
 
Jquery library
Jquery libraryJquery library
Jquery library
 
Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Overview of PHP and MYSQL
Overview of PHP and MYSQLOverview of PHP and MYSQL
Overview of PHP and MYSQL
 
Javascript libraries
Javascript librariesJavascript libraries
Javascript libraries
 
JS Libraries and jQuery Overview
JS Libraries and jQuery OverviewJS Libraries and jQuery Overview
JS Libraries and jQuery Overview
 
Medium TechTalk — iOS
Medium TechTalk — iOSMedium TechTalk — iOS
Medium TechTalk — iOS
 
DotNet Introduction
DotNet IntroductionDotNet Introduction
DotNet Introduction
 
Build a game with javascript (april 2017)
Build a game with javascript (april 2017)Build a game with javascript (april 2017)
Build a game with javascript (april 2017)
 
MLBox
MLBoxMLBox
MLBox
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
Continuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma ScanContinuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma Scan
 
R data interfaces
R data interfacesR data interfaces
R data interfaces
 
Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1
 
JQuery
JQueryJQuery
JQuery
 

Recently uploaded

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 

Recently uploaded (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

  • 1. Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)
  • 2.
  • 5. About SmartNews Japan Launched 2013 4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers 2013 App of The Year US Launched Oct 2014 1M+ Monthly Active Users Same engagement 80+ Publishers Top News Category App International Launched Feb 2015 10M Downloads WW Same engagement English beta Featured App Funding: $50M
  • 6. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+/day Diversification Signals on the Internet
  • 7. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+ /day Diversification Signals on the Internet Web Document Classification ⊂
  • 8. Web Document Classification ENTERTAINMENT SPORTS TECHNOLOGY LIFESTYLE SCIENCE … Task definition: When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set. WORLD
  • 9. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 10. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 11. Main Content Extraction Two approaches: html html easier, but takes time difficult, but fast ・Extract after rendering whole page ・Extract from HTML
  • 12. Main Content Extraction ・Extract after rendering whole page ・Extract from HTML html html easier, but takes time difficult, but fast Two approaches: Our Approach
  • 13. Main Content Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html> Example: main content not main content
  • 14. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: …
  • 15. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: … But not scalable. Japanese: … … … …
  • 16. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 17. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 18. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Separate HTML into ‘text block’s Step1:
  • 19. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0
  • 20. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0 Step3: Define feature of each text block as combination of local features word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1 ex:
  • 21. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach: See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 22. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 23. Making Main Content Using Decision Tree (features)block1: not main (features)block2: not main (features)block3: main (features)block5: main (features)block4: not main
  • 24. Main Content Extraction From HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 25. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 26. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 27. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 28. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector. Will deliver an NBA championship to Cleveland James LeBron
  • 29. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector Will deliver an NBA championship to Cleveland James LeBron stop words sports players dictionary with some feature engineering. NBA_PLAYER tf-idf
  • 30. Feature Extraction in Text Classification Similarly used in Japanese. 私は中路です。 よろしくお願いします。 stop words person dictionary 私 は 中路 よろしく お願い し ます です PERSON tf-idf
  • 32. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vector (dimension ∼ several 100)
  • 33. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053) (https://code.google.com/p/word2vec/)
  • 34. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (https://code.google.com/p/word2vec/) (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 35. Word Vector in word2vec Model Every word is mapped to unique word vector with good properties. [0.1, 0.2, ……0.2]= [0.1, 0.1, ……-0.1]= [0.3, 0.4, ……0]= [0.3, 0.3, ……0.3]= Germany Berlin Paris France … “Germany - Berlin = France - Paris” vFrance vParis vGermany vBerlin
  • 36. Procedure to Create Word Vectors Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on A cat sat on the street. … I love cat very much. w220 w221 He comes from Japan. … … TX t=1 logP(wt|wt c, · · · wt+c) P(wt|wt c, · · · wt+c) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t for anduw vw vw is word vector for w. Word vectors are trained so that it becomes a good feature for predicting surrounding words. Objective Function (cbow-case) Model (sum-case) = Procedure ① Maximize ② L L
  • 37. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 38. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vectors (dimension ∼ 100s)
  • 39. Procedure to Create Paragraph Vectors for uw vw A cat sat on the street. … doc_1 : doc_2 : … I love cat very much. w220 He comes from Japan. … w221 Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on doc_1 TX t=1 logP(wt|wt c, · · · wt+c, doc i) P(wt|wt c, · · · wt+c, doc i) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t + di , and di wt is included vw② Preserve uw , as ˜uw , ˜vw document where Add a vector to the model for each document. Objective Function (dbow-case) = Model (sum-case) Procedure ① Maximize L L
  • 40. Procedure to Create Paragraph Vector for uw vw, and di vw② Preserve uw , as ˜uw , ˜vw After training, we can get a good paragraph vector as a feature for a new document. Objective Function (dbow-case) Model (sum-case) Procedure ① Maximize TX t=1 logP(wt|wt c, · · · wt+c, doc) P(wt|wt c, · · · wt+c, doc) = exp(˜uwt · ˜v) P W exp(˜uW · ˜v) ˜v = X t0 6=t, ct0 c ˜vwt 0 + d We love SmartNews. … doc : I love SmartNews very much. d Ldoc = ③ Maximize for L Ldoc d ④ Use as a paragraph vectord training live data
  • 41. Procedure to Create Paragraph Vector Feature Extractor [0.2, 0.3, ……0.2] d ˜uw ˜vw Paragraph Vector : Lmaximize Ldocmaximize
  • 42. Text Classification Ordinary text classification architecture: ② live data ([0.1, -0.1, …]) ① training ([0.1, 0.3, …], entertainment) ([0.2, -0.3, …], sports) ([0.1, 0.1, …], entertainment) features ? ? … entertainment sports ([0.1, -0.2, …], politics) … sports training algorithm classifier feature extraction
  • 43. Good Benefits of Using Paragraph Vector ・High Scalability ・High Precision in Text Classification Several percent better than using Bag-of-Words with feature engineering in our Japanese/English data set. We don’t need to work hard for feature engineering in each language. Bad ・Difficulty in analyzing error It is hard to understand the meaning of each component of paragraph vector. labeled: ∼several 10000 unlabeled: ∼100000
  • 44. Benefits of Using Paragraph Vector It is important that Paragraph Vector has a different nature than Bag-of-Words Reason: We can get a better classifier by combining two different types of classifiers.
  • 45. Our Use Case Validation Use one to validate the other. Combination Use the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
  • 46. In multilingual localization Use only Paragraph Vector-based classifier without any feature engineering. Our Use Case (future)
  • 47. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 49. The Challenge News is uncertainty seeking for long-term values. Exploitation Exploration What SmartNews does: uncertainty seeking discovery What Big Data Firms typically do: preference estimation and risk quantification What if parents don't feed vegetables to children who only like meat? What if you keep hearing only opinions that match yours?
  • 50. The Challenge Searching not optimal, but acceptable form of exploration. Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews. ・topic extraction We are developing: ・image extraction ・multi-arm bandit based scoring model ① For better Feature Vector of users and articles ② For Human-Acceptable Exploration user interests ① ② … feature vector for 10 million users real-time feature vector for articles x
  • 51. We are building our engineering team in SF - please join us! 採用してます ・ML/NLP Engineer ・Data Science Engineer …
  • 53. References Main Content Extraction ・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl Text Classification Boilerplate Detection using Shallow Text Features ・BoilerPipe (GoogleCode) ・Quoc V. Le, Tomas Mikolov Distributed Representations of Sentences and Documents ・Word2Vec (GoogleCode)
  • 54. References About SmartNews ・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S. ・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S. ・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M ・About our Company SmartNews Articles about SmartNews

Editor's Notes

  1. Hello I am Kohei Nakaji, engineer of SmartNews Inc. I'm developing news delivery algorithm in SmartNews, using especially machine learning and natural language processing in SmartNews. My research background is not kind of ML things but particle physics theory, begining of universe, dark matter and so on. so if you guys have interest in physics thing I can also talk about it in another day. Anyway, Today I'm gonna talk about this topic: 'Grobally scalable web document classification using word2vec'. Because This talk is based on the technology in SmartNews, I will do brief introduction of our company SmartNews. We SmartNews, are developing ios/android application: SmartNews.
  2. How many guys use SmartNews here? very few people. How many guys love machine learning? Great. So you will love SmartNews. Because our apps are made by machine learning. SmartNews is news app for more than 100 countries, but we have No writier, no editor, algorithm do everything. How many guys use news app every day? yeah most of news app fail. Some apps have great downloads but they are annoying with few engagement ratio. We SmartNews have 10M downloads grobally and more than 50% is active. We have possibility to get the position of successful news app. Then what makes SmartNews different?
  3. Keyword is ‘machine learning for discovery’. Some apps rely on human editor. they are not scalable and also they can be biased. Some apps use machine learning for delivery algorithms, but they use it for personalization. We use machine learning for everyone on earth to discover and learn new things they might not otherwise have seen. This is our mission. We are trying to develop algorithm for users to discover new things. that makes our engagement ratio high. Now let me show you demo of our apps.
  4. Let me show you guys how it works. First when you open it up, you can see the top news right here. Top news are latest important news chosen by our algorithm. Over here you got tabs of different categories which is the most straightforward result of web document classification. You see the latest important news in each category chosen by our algorithm. you may understand how precise our web document classification should be. One of the cool things is that when you find that you wanna read, for example see I wanna read this article right here, you’ve got this option right here which is the smart view option. you like this option, because it looks very very clean, no banners, no ads. Over here you can see the web view which is ordinary web browser, you see a lot of things you don’t wanna read in web view, but in smart view it is more simple and clean. You may understand how difficult to create smart view from arbitrary web site. I will introduce some of the algorithm in this talk. Another cool thing of smart view is you can see smart view even in offline. You can read in metro, in the airplane, anywhere.
  5. As I told you we have 10M downloads and more than 50% is active. there are 3 types of editions, japanese edition, us edition, and international edition. In international edition, users can read English articles which is localized for more than 100 countries. But there is no editor for each country.
  6. UI is good, Smart View is cool. But as I told you what makes us different is the algorithm to find articles from which users can discover new things. This is the outline of our algorithm for ‘users discovery’. urls are found from the signals on the Internet by our crawler, html structures are automatically analyzed, for example title, mainText, image is extracted then semantics of articles are analyzed, what category it has, what subject it has, what image is in…etc… - using signals and semantics, the importance score of each article for each category in each country are calculated diversify topic of the delivery list then we deliver the articles to users. the list of the article are refreshed in real time. We crawled 10 million urls/day and deliver only top 1000 articles to users and 100/category/day. There are many things to talk about this algorithm. Especially how we do importance estimation, we do personalize or do another approach is key feature because it is related with our mission. I will talk about it later and now let’s get into the today’s main topic
  7. Web Document classification, which is part of our structure analysis, and semantics analysis. The reason why I choose Web Document Classification for today’s topic is for one thing it is important for our application as you have already seen and for another thing, classification of unstructured data is common task in many applications, from simple spam filter to category tagging in ec site.
  8. The task definision is very simple. when arbitrary
  9. There are roughly two steps. 1. main content extraction : we have to detect main content from news website. it is difficult because there are so many websites, and different websites have different structure. 2. text classification : we classify main content into one category first I briefly show one of our algorithm to detect main content from Web Document, next I will talk about text classification using word2vec extended model
  10. Let’s start from main content extraction. I want to add that in our app main content extraction is also important for making smart view we have seen.
  11. when we do main content extraction, there are two approaches actually we use the bottom one. First approach is rendering all of the page loading all css, javascript and after that extract the main content. it is relatively easier because we can use the information of position, width, and height of each component but it takes time because we have to render all items. Second approach is extract main content directry from html. it is more difficult but needs much less computing resource comparing with first approach.
  12. we use second approach in our algorithm, because we have to proceed 10 million articles per day, 100 article per second.
  13. This is the example of main content extraction from html. It is the task to detect which is main content and which is not main content.
  14. Rule based extraction algorithm is of course possible like div which has text length more than 200 is main content. Because there are so many websites, the number of rule tend to be large,
  15. If we do it in multi-language, it becomes much harder.
  16. So, as one of our algorithm to extract main content, we are using machine learning approach which is based on the paper in 2011. So today, let me introduce about this. In the training phase, first we prepare the sets of html document that main content is already labeled. In our case, we aggregate the articles by our crawler and annotator annotate main content. Next by using block separator, html is separated into each text block, and by using feature extractor, feature vector in each block is extracted.
  17. let’s get into the block separation and feature extraction part.
  18. For step one we separate html into text blocks. The definition of ‘text block’ in our case is roughly, the block which is sandwiched by block level tag.
  19. For step 2, local features for each block is extracted. We use for example number of word, number of a tag, as local feature,
  20. For Step3, we create feature vector of each block as the combination of local features of different blocks. In this example, feature vector of this text block has element of ‘word count and num of a tag in previous and current block’.
  21. in training phase, after the block separation and feature extraction, we get sets of labeled feature vector. The label is binary value: main/not main. By using the labeled feature vector, decision tree is trained. When live data comes, html is separated into text blocks with features, and by using already trained decision tree, final result is obtained.
  22. Let’s get into this part.
  23. Feature vector in each block is classified into main/not main by using already trained decision tree. Then now, we know which text block is main content and which text block is not main content. By combining the result, we get the main text.
  24. This is the end of main content extraction. easy, simple, but not bad. If you want to know more about it. please see the link, and also there is the library which includes already trained model in English, please try. I will share the reference later.
  25. so let’s get into the text classification.
  26. Probably you know everything already, but let me review the ordinary classification architecture. In the training phase, first we prepare sets of labeled texts as training data. by using feature extractor, sets of labeled feature vector is created, then using training algorithm, like SVM or logistic regression, classifier is trained. In bag-of-words feature extractor case, sets of word in the document is extracted as feature vector, and after training, roughly speaking, which word tends to show up in which category, is trained. when live data comes, feature vector is extracted and by using already trained classifier, category is determined.
  27. Training algorithm itself is ordinary logistic regression in our application and there are many materials about it. So today, let’s focus on feature extraction part.
  28. As a feature vector ‘Bag-of-words’ is commonly used. Bag-of-words is set of words in the document, it does not care about the order of words. very simple but not bad if we use it for text classification.
  29. If we want to improve the quality of feature vector, we create, for example stop words dictionary for removing unnecessary words, create specific dictionary for adding a specific feature, or use tf-idf. But still Bag-of-Words are starting point.
  30. In Japanese case, we have to use technique to separate words, but still Bag-of-Words with some feature engineering is commonly used. But Bag-of-Words definitely seems not perfect feature vector of text, for example it cannot include the information of word order. For another example we cannot use information that two words are close to each other or not. We wonder whether we can easily get better feature vector or not.
  31. As a better feature vector, we use Paragraph Vector which is word2vec extended model. It is ‘better’ in precision of text classification.
  32. by using the technique I will talk about today, every document is mapped to one dense vector with a few hundred dimensions named paragraph vector.
  33. Because paragraph vector is kind of word2vec extended model, I should start from word2vec. In word2vec case, every word is mapped to unique word vector. In paragraph vector case, every document is mapped to unique vector.
  34. So let’s get into word2vec.
  35. Every word is mapped to unique vector. In this example, France, Paris, Germany, Berlin is mapped to each unique vector. What is surprising is the nature like Germany - Berlin = France - Paris. From this nature, we assure that some semantics is embedded in the vector.
  36. This is Brief Overview of training word2vec model. First prepare sets of document. and label each word like w1, w2, then, maximize the objective function. The value of c is arbitary. 2 or 3 is commonly used. By looking at the shape of this objective function, you can see that maximizing this objective function means maximizing the probability to predict a word from surrounding words. In the example of the right figure, The model is refreshed so that the probability to predict ‘on’ from surrounding words ‘cat’, ‘sat’, ‘the’, ‘street’ becomes higher. The model of probability function is like this. For each word, 2 types of vectors: output vector u and input vector v are defined. Roughly speaking, when training converge, the more a pair of 2words shows up in a same sentence, the bigger the inner product of u and v for the 2words become. After training we use v for each word as word vector. Technically, training this model directly is really heavy because of this sum, and 2 types of approximation Negative sampling and Hierarchal softmax are used. Detail about the approximations is beyond the scope of this talk. This is how we create word vector by using word2vec model.
  37. Then let’s get into paragraph vector.
  38. As I told you, each document is mapped into one dense vector named paragraph vector.
  39. The procedure to create paragraph vector is similar to word2vec case. Prepare sets of document. and label each word like w1, w2, we also label each document like doc_1, doc_2. Then, maximize this objective function. The difference from word2vec model is that, the objective function includes document_id where the word is included. So maximizing this objective function means maximizing the probability to predict a word not only from surrounding words but also from the document where the word is included. The model of the probability function is also a little bit different. Same as word2vec case, for each word outer vector u and inner vector v are defined. In addition, for each document, vector d_i is also defined. When training converge, we get optimized u, v for each word and d_i for each document. The final result of vector d_i is paragraph vector for each document. But what we really want to do is extracting paragraph vector from new document. For doing it we need one more step.
  40. When new document comes, we label the words in the document, and maximize this objective function. In this time, T is the number of word in the document. We don’t need to maximize the objective function for u and v, we can use u and v which is already trained. All we have to do is just maximize objective function for d. After the objective function is maximized we get d as a paragraph vector for the document.
  41. It was a little bit confusing, so I show a simple figure. First, we train the feature extractor by putting the large set of documents, and when new document comes, by using the already trained feature extractor, paragraph vector is extracted. very simple right?
  42. By just using the paragraph vector as a feature vector, we can do ordinary text classification.
  43. Good thing for using paragraph vector comparing with Bag-of-Words is these two. ①high precision In our Japanese/English data set, the result of 10-fold validation test becomes several percent better than bag-of-words with feature engineering case. ②high scalability. By just preparing the sets of Document for each language, without feature engineering, we can get good result. Bad thing is the difficulty in analyzing error. It is hard to understand the meaning of each component of paragraph vector. Because there is trade off, I don’t know which you should choose in your use-case even if the precision of text classification is several percent higher by using paragraph vector.
  44. But still, I think it’s good for you to try paragraph vector. Paragraph vector has different nature from bag-of-words. So the combination of bag-of-words classifier and paragraph vector based classifier can be much better classifier.
  45. In our app, there are many types of classifiers like sports classifier, entertainment classifier other than main category classifier. Depending on the purpose of each classification, in some case, we use the more reliable result of Bag-of-words based classifier and paragraph-vector based classifier. In another case we validate the result of bag-of-words based classifier by using paragraph-vector based classifier. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  46. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  47. This is the end of todays’ topic web document classification.
  48. News is uncertainty seeking for long-term values. What other big data firms typically do is recommend what people have interest about, by using like matrix factorization. What we are doing is not simply suggest users what they like, but expand users’ interest by our algorithm.
  49. How to explorer users’ interest space and suggest something new to users, are very challenging problem. We are now brushing up, these two. For better understanding of the users’ interest space we are brushing up the topic or the subject extraction from article, brushing up users’ feature vector For doing the good exploration multi-arm bandit based scoring model, Technically, we have to create and operate the good and reasonable model which includes feature vector of 10 million users and real time feature vector of articles, it is really exciting. Actually the number of people tuckling on these problems is 5, including ML PhD., Theoretical Physics PhD, but we need much much much more people to tackle on this difficult problem.
  50. Then let’s get into paragraph vector.
  51. Then let’s get into paragraph vector.