Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

Passive-Aggressive Sequence Labeling with
Discriminative Post-Editing for
Recognising Person Entities in Tweets.
Leon Derczynski
Kalina Bontcheva

Problem
● Finding person NEs in tweets, a diverse genre
– Need to know participates in events / claims
● Twitter as the
D. Melanogaster of social media1
● Newswire: regulated
– “our most frequently-used corpora [..] written and edited predominantly by
working-age white men” 2
● Twitter: wild; many styles
– Headlines
– Conversations
– Colloquial
– Just “noise” (hashtags, URLs, mentions)
1. Tufekci, 2014. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls”
Proc. ICWSM; 2. Eisenstein, 2013. “What to do about bad language on the internet” Proc. NAACL; Image “Mr.checker”
Wikimedia Commons

Why person entities?
● There are many entity types and classification
schemes
– ACE (PER, GPE, ORG); maybe add PROD
– Freebase top-level (à la Ritter)
● Have a long tail, making them “resistant” to
gazetteer approaches
● Required to mine conversations and claims
● Unfortunately, they're difficult to find in tweets:
Stanford NER on CoNLL news: 92.29 F1
Stanford NER on Ritter tweets: 63.20 F1

Machine learning for twitter NER
● We know twitter's diverse & noisy, so let's add word
shape (Xxx) and lemma features
● Conventional approaches – sequence labelling
● Lots of dysfluency, differs from newswire
● What if we throw out whole-sequence idea and only
use local context?
Stanford 72.19 F1 (up from ~63)
SVM 75.89 F1
MaxEnt 76.76 F1
CRF 78.89 F1
● Looks like sequence labelling is useful

Two ML adaptations
● SVM/UM
– Hyperplane may lie between two unbalanced classes
– Move closer to minority class, to reflect prior distribution
● CRF-PA
– Passive: when example's hinge loss is zero, skip
updates
– Aggressive: when hinge loss >0, scale down example's
weight

Single-pass results
● Corpus: person entities from MSM2013, Ritter,
UMBC tweet datasets (86k toks, 1.7k ents)
P R F
Stanford 90.60 60.00 72.19
Ritter 77.23 80.18 78.68
SVM/UM 81.16 74.97 77.94
CRF-PA 86.85 74.71 80.32
● Honourable mention: MaxEnt, precision 91.10
● Ritter: good recall, possibly from huge bootstrapped
integrated resource
● How can we improve recall without this?

Recall problems
● Typical missed entities:
– “Under Obama 's tax plan , ...”
– “delighted for you & Dave !”
– “Strategies for selling in a slow market : by Denise
Calaman”
● Looks like things we'd find in a gazetteer
● How can we include these without reducing precision?
● Post-editing can be effective in fixing up MT output

Post-editing
● Formulate as binary discriminative problem
– Is a given non-entity text actually a person?
● Narrow search space:
– Does a token in an out-of-entity sequence begin a
with known person name?
● Confine window to two tokens
● Given a set of triggers, are tokens in a bigram
beginning with a trigger, a person?
Best Ann Coulter quotes
Under Obama 's tax plan

Evaluation
● Baselines: no editing, gazetteer term, gazetter term+1
● Goal is to improve recall: use cost-sensitive SVM
Missed entity F1 Overall
No editing 0.00 80.32
Term only 5.82 82.58
Term+1 6.05 81.67
SVM Cost 0.1 (P) 78.26 83.07
SVM Cost 1.5 (R) 92.73 83.83
Ritter - 78.68

Error analysis
● False positives:
– Other-class entities (Huff Post, Exodus Porter)
– Descriptive titles (Millionaire Rob Ford)
– Names in non-name senses (Marie Claire)
– Polysemous names (Mark)
● False negatives:
– Capitalisation (charlie gibson, KANYE WEST)
– Spelling errors (Russel Crowe)
– Common nouns (Jack Straw)
– Uncommon names (Spicy Pickle Jr.)

Conclusion
● PA adaptation of CRF helps NER in diverse domain
● Automatic post-editing improves recall
● SVM using context much better than gazetteer
● Only external resource is first name lists

Thank you for your time!
Do you have any questions?
Research partially supported by the European Union/EU under the Information and Communication Technologies
(ICT) theme of the 7th Framework Programme for R&D (FP7), grant PHEME (611233).

Entities in tweets
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities,
celebrities, names of
friends
LOC Countries, cities,
rivers, and other
places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies,
government
organisations
Bands, internet
companies, sports
clubs

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

Recommended

Recommended

More Related Content

More from Leon Derczynski

More from Leon Derczynski (20)

Recently uploaded

Recently uploaded (20)

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets