Presentation with audio: https://www.youtube.com/watch?v=heYj8sCmWCo
Finding the names in tweets is difficult. However, with a few simple modifications to handle the noise and variety in tweets, and a automatic post-editor to fix errors made by the automatic systems, it becomes easier.
Full paper: http://derczynski.com/sheffield/papers/person_tweets.pdf
whole genome sequencing new and its types including shortgun and clone by clone
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets
1. Passive-Aggressive Sequence Labeling with
Discriminative Post-Editing for
Recognising Person Entities in Tweets.
Leon Derczynski
Kalina Bontcheva
2. Problem
● Finding person NEs in tweets, a diverse genre
– Need to know participates in events / claims
● Twitter as the
D. Melanogaster of social media1
● Newswire: regulated
– “our most frequently-used corpora [..] written and edited predominantly by
working-age white men” 2
● Twitter: wild; many styles
– Headlines
– Conversations
– Colloquial
– Just “noise” (hashtags, URLs, mentions)
1. Tufekci, 2014. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls”
Proc. ICWSM; 2. Eisenstein, 2013. “What to do about bad language on the internet” Proc. NAACL; Image “Mr.checker”
Wikimedia Commons
3. Why person entities?
● There are many entity types and classification
schemes
– ACE (PER, GPE, ORG); maybe add PROD
– Freebase top-level (à la Ritter)
● Have a long tail, making them “resistant” to
gazetteer approaches
● Required to mine conversations and claims
● Unfortunately, they're difficult to find in tweets:
Stanford NER on CoNLL news: 92.29 F1
Stanford NER on Ritter tweets: 63.20 F1
4. Machine learning for twitter NER
● We know twitter's diverse & noisy, so let's add word
shape (Xxx) and lemma features
● Conventional approaches – sequence labelling
● Lots of dysfluency, differs from newswire
● What if we throw out whole-sequence idea and only
use local context?
Stanford 72.19 F1 (up from ~63)
SVM 75.89 F1
MaxEnt 76.76 F1
CRF 78.89 F1
● Looks like sequence labelling is useful
5. Two ML adaptations
● SVM/UM
– Hyperplane may lie between two unbalanced classes
– Move closer to minority class, to reflect prior distribution
● CRF-PA
– Passive: when example's hinge loss is zero, skip
updates
– Aggressive: when hinge loss >0, scale down example's
weight
6. Single-pass results
● Corpus: person entities from MSM2013, Ritter,
UMBC tweet datasets (86k toks, 1.7k ents)
P R F
Stanford 90.60 60.00 72.19
Ritter 77.23 80.18 78.68
SVM/UM 81.16 74.97 77.94
CRF-PA 86.85 74.71 80.32
● Honourable mention: MaxEnt, precision 91.10
● Ritter: good recall, possibly from huge bootstrapped
integrated resource
● How can we improve recall without this?
7. Recall problems
● Typical missed entities:
– “Under Obama 's tax plan , ...”
– “delighted for you & Dave !”
– “Strategies for selling in a slow market : by Denise
Calaman”
● Looks like things we'd find in a gazetteer
● How can we include these without reducing precision?
● Post-editing can be effective in fixing up MT output
8. Post-editing
● Formulate as binary discriminative problem
– Is a given non-entity text actually a person?
● Narrow search space:
– Does a token in an out-of-entity sequence begin a
with known person name?
● Confine window to two tokens
● Given a set of triggers, are tokens in a bigram
beginning with a trigger, a person?
Best Ann Coulter quotes
Under Obama 's tax plan
9. Evaluation
● Baselines: no editing, gazetteer term, gazetter term+1
● Goal is to improve recall: use cost-sensitive SVM
Missed entity F1 Overall
No editing 0.00 80.32
Term only 5.82 82.58
Term+1 6.05 81.67
SVM Cost 0.1 (P) 78.26 83.07
SVM Cost 1.5 (R) 92.73 83.83
Ritter - 78.68
11. Conclusion
● PA adaptation of CRF helps NER in diverse domain
● Automatic post-editing improves recall
● SVM using context much better than gazetteer
● Only external resource is first name lists
12. Thank you for your time!
Do you have any questions?
Research partially supported by the European Union/EU under the Information and Communication Technologies
(ICT) theme of the 7th Framework Programme for R&D (FP7), grant PHEME (611233).
13. Entities in tweets
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities,
celebrities, names of
friends
LOC Countries, cities,
rivers, and other
places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies,
government
organisations
Bands, internet
companies, sports
clubs