SlideShare a Scribd company logo
1 of 29
NEM Summit – 28 Sept. 2011
Identifying Topics in Social Media
Posts using DBpedia
Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media)
Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
Contents
Identifying Topics in Social Media Posts using DBpedia ⎢2
 Introduction
 Related Work
 Description of the Method
 Evaluation
 Conclusions
Identifying Topics in Social Media Posts using DBpedia
Introduction
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢4
 Topic Identification
 “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]
 Applications of Topic Identification for Social Media
 Automatically summarising the content published in a channel.
 Mining the interest of a given user.
 etc…
 Benefits for Advertising Companies
 To focus the advertisement actions to the appropriate channels.
 To serve ads to the users based in their interest.
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢5
 Difficulties of Topic Identification in Social Media
 Different channels with heterogeneous texts
l Different lengths
 From short sentences on Twitter to medium-size articles in blogs
l Misspellings
 Posts completely written in uppercase (or lowercase) letters
 Makes difficult the detection of proper nouns.
 In Spanish, absence /presence of an accent in a word different meanings
 “té” = “tea” (common noun)
 “te” = “you” (personal pronoun)
l Use of set phrases
 E.g., “too many cooks spoil the broth” (if too many people try to take charge at
a task, the end product might be ruined)
 E.g., “rain cats and dogs” (rains heavily)
 It is important to take into account the context of the post
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢6
 Why DBpedia?
 DBpedia is a structured Semantic Web representation of Wikipedia
l Wikipedia is maintained by thousands of editors
l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]
 Each topic identified is mapped with a DBpedia resource
l E.g., The URI http://dbpedia.org/resource/Turin
 Represents the city of Torino
 Has about 45 attributes defined (population, area, latitude, longitude, etc.)
 Has labels and definitions in 14 different languages.
 It is linked with many semantic entities
 E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro
 It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino
 It is a nucleus for the Web of Data [Bizer et al, 2009]
l Data published on the Web according to Tim Berners-Lee’s Linked Data principles.
l Several billion RDF triples (i.e. facts)
l Multi-domain datasets (geographic information, people, companies, online
communities, etc…)
Introduction
Identifying Topics in Social Media Posts using DBpedia
Related Work
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢9
 Wikipedia has been exploited for the following tasks:
 Topic identification and text categorization
l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et
al, 2008], [Schonhofen, 2009]
 Semantic Relatedness between fragments of text
l [Gabrilovich et al, 2007]
 Keyword Extraction
l [Mihalcea et al, 2007]
 Word sense disambiguation
l [Mihalcea, 2007]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢10
 Uses of Wikipedia data-structure:
 Relating words in text with articles using article title information
l [Schonhofen, 2009]
 Exploiting anchor text in links
l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]
 Exploiting the whole articles
l [Syed et al, 2008] [Gabrilovich, 2007]
 Exploiting categories to measure relatedness between articles
l [Coursey et al, 2009] [Syed et al, 2008]
 Exploiting disambiguation pages and redirection links to select
candidate senses and alternative labels
l [Mendelyan et al, 2008]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢11
 Supervised learning methods
l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]
 Unsupervised techniques
 Based on a Vector Space Model
l [Schonhofen, 2009]
 Based in a Graph
l [Coursey et al, 2009] [Syed et al, 2008]
 Combined methods (supervised and unsupervised)
 Based on a Vector Space Model
l [Mihalcea et al, 2007]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢12
 Our approach
 Exploits titles, disambiguation pages, redirection links and article
text to select candidate senses and alternative labels
 Uses an unsupervised method
 Uses a vector space model
 Main benefit in comparison with previous approaches:
 The interlinking of social media posts with the Web of data through
DBpedia resources
Identifying Topics in Social Media Posts using DBpedia
Description of the Method
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢14
Input
Part-of-
speech
tagging
• “torino”, “art”, “media”, “user”, “cloud”
Topic
Recognition
• http://dbpedia.org/resource/Turin
• http://dbpedia.org/resource/Art
• http://dbpedia.org/resource/User_(computing)
Language
Filtering
• “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢15
 Part-of-speech tagging
 Wp = w1,w2, ..., wn list of lexical units contained in the post
 lexcat(w) lexical category of the lexical unit w
 lemma(w) lemma of w
 L = {common noun, proper noun, acronym…} meaningful lexical categories
that we consider
 = {“RT”, “/cc”, “;)”, …} stop words (lemmas excluded)
 Kp = k1,k2, …, kn list of keywords with meaning
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢16
 Part-of-speech tagging example
Input
• But a hardware problem is more likely, especially if
you use the phone a lot while eating. The
Blackberry's tiny trackball could be suffering the
same accumulation of gunk and grime that can
plague a computer mouse that still uses a rubber
ball on the underside to roll around the desk.
Part-of-speech
tagging
• Blackberry, phone, trackball, computer,
problem, grime, hardware, mouse, desk,
rubber ball, gunk
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢17
 Topic Recognition (Sem4Tags [García-Silva et al, 2010])
POS
tagging
• Blackberry, phone, trackball, computer, problem, grime, hardware,
mouse, desk, rubber ball, gunk
Context
Selection
• Blackberry, {phone, hardware, trackball, mouse}
• Computer, {hardware, mouse, problem, desk}
• …
Disambiguation
• http://dbpedia.org/resource/BlackBerry
• http://dbpedia.org/resource/Computer
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢18
 Context Selection
 For each keyword, a set of up to 4 related keywords that will help to
disambiguate the its meaning
 4 is the number of words above which the context does not add more resolving
power to disambiguation [Kaplan, 1955]
 We compute semantic relatedness (active context) taking into account the
co-ocurrence of words in web pages [Gracia et al, 2009]
Keyword Relatedness Keyword Relatedness
phone 0.347 hardware 0.347
trackball 0.311 mouse 0.311
computer 0.288 desk 0.287
problem 0.246 rubber ball 0.246
grime 0.190 gunk 0.168
Active context selection for blackberry keyword
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢19
 Disambiguation Criteria
 OPTION 1: Most frequent sense for the ambiguous word
l Determined by Wikipedia editors (the first link in a disambiguation page)
 OPTION 2: Vector space model
1. A vector containing the keyword and its context
2. A vector containing top N terms is created from each candidate sense is created using
TF-IDF (Term Frequency and Inverse Document Frequency)
3. The cosine similarity is used to determine which vectorised sense is more similar to
the vector associated to the keyword
DBpedia resource Definition Similarity
BlackBerry
Is a line of mobile e-mail and
smartphone
0.224
Blackberry is an edible fruit 0.15
BlackBerry_(song) is a song by the Black Crowes 0.0
BlackBerry_Township,
_Itasca_County,
_Minnesota
Is a towship in … Itasca County 0.0
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢20
 Language Filtering
 Tp = t1,t2, ..., tn set of topics identified
 l language to filter
 Labels(t) set of labels associated to a given topic (value of rdfs:label
property)
 lang(b) language of a given label
 Tl
p set of topics with labels in language l
Identifying Topics in Social Media Posts using DBpedia
Evaluation
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢22
 Evaluated with a corpora of 10,000 posts in Spanish extracted from
 Blogs
 Forums
 Microblogs (e.g., Twitter)
 Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)
 Review sites (e.g. , Ciao and Dooyoo)
 Audiovisual sites (e.g., YouTube and Flickr)
 News publising sites (e.g., elpais.com, elmundo.es)
 Others (web pages not classified in the categories above)
 Variants evaluated
1. Without considering any context
l Default Wikipedia sense assigned for a given keyword
2. Considering as context all the other keywords found in the same post
3. Active context selection technique
l Selecting the 4 most relevant topics from the keywords in the same post
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢23
 Coverage
 Part-of-speech tagging: nearly 100%
 Topic recognition: over 90% for almost all the cases
 After language filtering coverage is reduced in about 10% because not all
DBpedia resources have a label defined for Spanish language
Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall
POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32%
Topic identification
Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35%
With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02%
Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72%
Topic identification after language filtering
Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74%
With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85%
Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢24
 Precision
 Evaluated a random sample of 1,816 posts (18,16%)
 47 human evaluator
 Each post and topics identified shown to 3 different evaluators
 Evaluation options:
1. The topic is not related with the post
2. The topic is somehow related with the post
3. The topic is closely related with the post
4. The evaluator has not enough information for taking a decision
 Fleiss’ kappa test
l Strength of agreement for 2 evaluators = 0.826 (very good)
l Strength of agreement for 3 evaluators = 0.493 (moderate)
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢25
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢26
 Precision Results
 Precision depends on the channel
l From 59.19% for social networks
 More misspellings
 More common nouns
l To 88.89% for review sites
 Concrete products and brands
 Proper nouns tend to have a Wikipedia entry
 Context selection criteria also depends on the channel
l Active context selection better for microblogs and review sites
l Considering all the post keywords as context better for blogs
l Without context selection is better for the rest of the cases (almost all the channels)
 Naïve default sense selection is effective
Identifying Topics in Social Media Posts using DBpedia
Conclusions
Conclusions
Identifying Topics in Social Media Posts using DBpedia ⎢28
 We have achieved good results of coverage
 The precision depends on the channel (better for review sites,
worst for social networks)
 With respect to considering context or not, there is not a variant that
provide the best results for all the channels.
 Future lines of work:
 Improve Natural Language Processing
l Dealing with slang
l Detect set phrases
l Improve n-gram detection
l Dealing with microblogs’ specifics (e.g., hashtag expansion)
 Combine broad-domain topic identification with knowledge about
specific domains
l Use of domain ontologies in combination with DBpedia ontology
Thank you!
oscar.munoz@havasmedia.com

More Related Content

What's hot

Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataRichard Urban
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applicationsMark Greaves
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender systemKapil Kumar
 

What's hot (9)

Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked Data
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applications
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
 

Viewers also liked

Horario 19 de diciembre de 2014
Horario 19 de  diciembre de 2014Horario 19 de  diciembre de 2014
Horario 19 de diciembre de 2014Cole Navalazarza
 
Study Twitter
Study TwitterStudy Twitter
Study TwitterNurun
 
Foucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vidaFoucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vidaMaría Apellido
 
SafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet
 
Deria pendengaran
Deria pendengaranDeria pendengaran
Deria pendengaranAzuan Dy
 
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgoPresentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgoEricka Vanessa pejendino perea
 
Curso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la RedCurso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la RedConversia
 
Rocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National ParkRocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National ParkAnkit Khandelwal
 
Cateter venoso-central
Cateter venoso-centralCateter venoso-central
Cateter venoso-centralsalomegg
 

Viewers also liked (17)

Horario 19 de diciembre de 2014
Horario 19 de  diciembre de 2014Horario 19 de  diciembre de 2014
Horario 19 de diciembre de 2014
 
Bum Gennaio 2013
Bum Gennaio 2013Bum Gennaio 2013
Bum Gennaio 2013
 
Study Twitter
Study TwitterStudy Twitter
Study Twitter
 
Conceptos Basicos de Red
Conceptos Basicos de RedConceptos Basicos de Red
Conceptos Basicos de Red
 
Foucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vidaFoucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vida
 
Osteoporosis en ap
Osteoporosis en apOsteoporosis en ap
Osteoporosis en ap
 
SafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server Encryption
 
Deria pendengaran
Deria pendengaranDeria pendengaran
Deria pendengaran
 
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgoPresentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
 
El autismo
El autismoEl autismo
El autismo
 
Limpieza del registro de windows
Limpieza del registro de windowsLimpieza del registro de windows
Limpieza del registro de windows
 
Curso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la RedCurso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la Red
 
Rocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National ParkRocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National Park
 
CONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESSCONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESS
 
Componentes de los ecosistemas
Componentes de los ecosistemasComponentes de los ecosistemas
Componentes de los ecosistemas
 
Cateter venoso-central
Cateter venoso-centralCateter venoso-central
Cateter venoso-central
 
Atividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letrasAtividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letras
 

Similar to Identifying Topics in Social Media Posts using DBpedia

Identifying Topics in Social Media Posts
Identifying Topics in Social Media PostsIdentifying Topics in Social Media Posts
Identifying Topics in Social Media Postshavasmedialabs
 
Web 2.0
Web 2.0Web 2.0
Web 2.0bjornh
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.PhiloWeb
 
Web 2.0
Web 2.0Web 2.0
Web 2.0bjornh
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsJakob .
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintokeee
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...John Breslin
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting WikipediaAndrew Gray
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Feedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAbleFeedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAbleMichael Priestley
 
Intro semanticweb
Intro semanticwebIntro semanticweb
Intro semanticwebultimate007
 
Web 20-1217591424848412-9
Web 20-1217591424848412-9Web 20-1217591424848412-9
Web 20-1217591424848412-9Radost Sviridon
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebJohn Breslin
 
Social media as a tool for terminological research
Social media as a tool for terminological researchSocial media as a tool for terminological research
Social media as a tool for terminological researchTERMCAT
 

Similar to Identifying Topics in Social Media Posts using DBpedia (20)

Identifying Topics in Social Media Posts
Identifying Topics in Social Media PostsIdentifying Topics in Social Media Posts
Identifying Topics in Social Media Posts
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documents
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization Systems
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting Wikipedia
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Feedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAbleFeedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAble
 
Intro semanticweb
Intro semanticwebIntro semanticweb
Intro semanticweb
 
Web 20
Web 20Web 20
Web 20
 
Web 20-1217591424848412-9
Web 20-1217591424848412-9Web 20-1217591424848412-9
Web 20-1217591424848412-9
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Web 2.0 and the LMS
Web 2.0 and the LMSWeb 2.0 and the LMS
Web 2.0 and the LMS
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
 
Social media as a tool for terminological research
Social media as a tool for terminological researchSocial media as a tool for terminological research
Social media as a tool for terminological research
 

More from Óscar Muñoz García

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaÓscar Muñoz García
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media AgenciesÓscar Muñoz García
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?Óscar Muñoz García
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Óscar Muñoz García
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesÓscar Muñoz García
 
Comparing user generated content published in different social media sources
Comparing user generated content published in different social media sourcesComparing user generated content published in different social media sources
Comparing user generated content published in different social media sourcesÓscar Muñoz García
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesÓscar Muñoz García
 

More from Óscar Muñoz García (8)

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social Media
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media Agencies
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
 
Big Data and Marketing Technology
Big Data and Marketing TechnologyBig Data and Marketing Technology
Big Data and Marketing Technology
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes Sociales
 
Comparing user generated content published in different social media sources
Comparing user generated content published in different social media sourcesComparing user generated content published in different social media sources
Comparing user generated content published in different social media sources
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relaciones
 

Recently uploaded

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 

Recently uploaded (20)

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 

Identifying Topics in Social Media Posts using DBpedia

  • 1. NEM Summit – 28 Sept. 2011 Identifying Topics in Social Media Posts using DBpedia Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media) Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
  • 2. Contents Identifying Topics in Social Media Posts using DBpedia ⎢2  Introduction  Related Work  Description of the Method  Evaluation  Conclusions
  • 3. Identifying Topics in Social Media Posts using DBpedia Introduction
  • 4. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢4  Topic Identification  “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]  Applications of Topic Identification for Social Media  Automatically summarising the content published in a channel.  Mining the interest of a given user.  etc…  Benefits for Advertising Companies  To focus the advertisement actions to the appropriate channels.  To serve ads to the users based in their interest.
  • 5. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢5  Difficulties of Topic Identification in Social Media  Different channels with heterogeneous texts l Different lengths  From short sentences on Twitter to medium-size articles in blogs l Misspellings  Posts completely written in uppercase (or lowercase) letters  Makes difficult the detection of proper nouns.  In Spanish, absence /presence of an accent in a word different meanings  “té” = “tea” (common noun)  “te” = “you” (personal pronoun) l Use of set phrases  E.g., “too many cooks spoil the broth” (if too many people try to take charge at a task, the end product might be ruined)  E.g., “rain cats and dogs” (rains heavily)  It is important to take into account the context of the post
  • 6. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢6  Why DBpedia?  DBpedia is a structured Semantic Web representation of Wikipedia l Wikipedia is maintained by thousands of editors l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]  Each topic identified is mapped with a DBpedia resource l E.g., The URI http://dbpedia.org/resource/Turin  Represents the city of Torino  Has about 45 attributes defined (population, area, latitude, longitude, etc.)  Has labels and definitions in 14 different languages.  It is linked with many semantic entities  E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro  It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino  It is a nucleus for the Web of Data [Bizer et al, 2009] l Data published on the Web according to Tim Berners-Lee’s Linked Data principles. l Several billion RDF triples (i.e. facts) l Multi-domain datasets (geographic information, people, companies, online communities, etc…)
  • 8. Identifying Topics in Social Media Posts using DBpedia Related Work
  • 9. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢9  Wikipedia has been exploited for the following tasks:  Topic identification and text categorization l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et al, 2008], [Schonhofen, 2009]  Semantic Relatedness between fragments of text l [Gabrilovich et al, 2007]  Keyword Extraction l [Mihalcea et al, 2007]  Word sense disambiguation l [Mihalcea, 2007]
  • 10. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢10  Uses of Wikipedia data-structure:  Relating words in text with articles using article title information l [Schonhofen, 2009]  Exploiting anchor text in links l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]  Exploiting the whole articles l [Syed et al, 2008] [Gabrilovich, 2007]  Exploiting categories to measure relatedness between articles l [Coursey et al, 2009] [Syed et al, 2008]  Exploiting disambiguation pages and redirection links to select candidate senses and alternative labels l [Mendelyan et al, 2008]
  • 11. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢11  Supervised learning methods l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]  Unsupervised techniques  Based on a Vector Space Model l [Schonhofen, 2009]  Based in a Graph l [Coursey et al, 2009] [Syed et al, 2008]  Combined methods (supervised and unsupervised)  Based on a Vector Space Model l [Mihalcea et al, 2007]
  • 12. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢12  Our approach  Exploits titles, disambiguation pages, redirection links and article text to select candidate senses and alternative labels  Uses an unsupervised method  Uses a vector space model  Main benefit in comparison with previous approaches:  The interlinking of social media posts with the Web of data through DBpedia resources
  • 13. Identifying Topics in Social Media Posts using DBpedia Description of the Method
  • 14. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢14 Input Part-of- speech tagging • “torino”, “art”, “media”, “user”, “cloud” Topic Recognition • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art • http://dbpedia.org/resource/User_(computing) Language Filtering • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
  • 15. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢15  Part-of-speech tagging  Wp = w1,w2, ..., wn list of lexical units contained in the post  lexcat(w) lexical category of the lexical unit w  lemma(w) lemma of w  L = {common noun, proper noun, acronym…} meaningful lexical categories that we consider  = {“RT”, “/cc”, “;)”, …} stop words (lemmas excluded)  Kp = k1,k2, …, kn list of keywords with meaning
  • 16. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢16  Part-of-speech tagging example Input • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberry's tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber ball on the underside to roll around the desk. Part-of-speech tagging • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, rubber ball, gunk
  • 17. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢17  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) POS tagging • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, rubber ball, gunk Context Selection • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} • … Disambiguation • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/Computer
  • 18. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢18  Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword
  • 19. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢19  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity BlackBerry Is a line of mobile e-mail and smartphone 0.224 Blackberry is an edible fruit 0.15 BlackBerry_(song) is a song by the Black Crowes 0.0 BlackBerry_Township, _Itasca_County, _Minnesota Is a towship in … Itasca County 0.0
  • 20. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢20  Language Filtering  Tp = t1,t2, ..., tn set of topics identified  l language to filter  Labels(t) set of labels associated to a given topic (value of rdfs:label property)  lang(b) language of a given label  Tl p set of topics with labels in language l
  • 21. Identifying Topics in Social Media Posts using DBpedia Evaluation
  • 22. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢22  Evaluated with a corpora of 10,000 posts in Spanish extracted from  Blogs  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)  Review sites (e.g. , Ciao and Dooyoo)  Audiovisual sites (e.g., YouTube and Flickr)  News publising sites (e.g., elpais.com, elmundo.es)  Others (web pages not classified in the categories above)  Variants evaluated 1. Without considering any context l Default Wikipedia sense assigned for a given keyword 2. Considering as context all the other keywords found in the same post 3. Active context selection technique l Selecting the 4 most relevant topics from the keywords in the same post
  • 23. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢23  Coverage  Part-of-speech tagging: nearly 100%  Topic recognition: over 90% for almost all the cases  After language filtering coverage is reduced in about 10% because not all DBpedia resources have a label defined for Spanish language Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32% Topic identification Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35% With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02% Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72% Topic identification after language filtering Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74% With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85% Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%
  • 24. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢24  Precision  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluator  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate)
  • 25. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢25
  • 26. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢26  Precision Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effective
  • 27. Identifying Topics in Social Media Posts using DBpedia Conclusions
  • 28. Conclusions Identifying Topics in Social Media Posts using DBpedia ⎢28  We have achieved good results of coverage  The precision depends on the channel (better for review sites, worst for social networks)  With respect to considering context or not, there is not a variant that provide the best results for all the channels.  Future lines of work:  Improve Natural Language Processing l Dealing with slang l Detect set phrases l Improve n-gram detection l Dealing with microblogs’ specifics (e.g., hashtag expansion)  Combine broad-domain topic identification with knowledge about specific domains l Use of domain ontologies in combination with DBpedia ontology