SlideShare a Scribd company logo
1 of 29
Identifying Topics in Social Media
Posts using DBpedia
Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media)
Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
NEM Summit – 28 Sept. 2011
Contents




 Introduction
 Related Work
 Description of the Method
 Evaluation
 Conclusions




Identifying Topics in Social Media Posts using DBpedia 2
Identifying Topics in Social Media Posts using DBpedia


Introduction
Introduction




 Topic Identification
    “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]
 Applications of Topic Identification for Social Media
    Automatically summarising the content published in a channel.
    Mining the interest of a given user.
    etc…
 Benefits for Advertising Companies
    To focus the advertisement actions to the appropriate channels.
    To serve ads to the users based in their interest.




Identifying Topics in Social Media Posts using DBpedia 4
Introduction




 Difficulties of Topic Identification in Social Media
    Different channels with heterogeneous texts
                 l    Different lengths
                             From short sentences on Twitter to medium-size articles in blogs
                 l    Misspellings
                             Posts completely written in uppercase (or lowercase) letters
                                      Makes difficult the detection of proper nouns.
                             In Spanish, absence /presence of an accent in a word  different meanings
                                      “té” = “tea” (common noun)
                                      “te” = “you” (personal pronoun)
                 l    Use of set phrases
                             E.g., “too many cooks spoil the broth” (if too many people try to take charge at
                              a task, the end product might be ruined)
                             E.g., “rain cats and dogs” (rains heavily)


 It is important to take into account the context of the post

Identifying Topics in Social Media Posts using DBpedia 5
Introduction




 Why DBpedia?
   DBpedia is a structured Semantic Web representation of Wikipedia
                 l    Wikipedia is maintained by thousands of editors
                 l    Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]
             Each topic identified is mapped with a DBpedia resource
                 l    E.g., The URI http://dbpedia.org/resource/Turin
                             Represents the city of Torino
                             Has about 45 attributes defined (population, area, latitude, longitude, etc.)
                             Has labels and definitions in 14 different languages.
                             It is linked with many semantic entities
                                      E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro
                                      It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino

             It is a nucleus for the Web of Data [Bizer et al, 2009]
                 l    Data published on the Web according to Tim Berners-Lee’s Linked Data principles.
                 l    Several billion RDF triples (i.e. facts)
                 l    Multi-domain datasets (geographic information, people, companies, online
                      communities, etc…)
Identifying Topics in Social Media Posts using DBpedia 6
Introduction
Identifying Topics in Social Media Posts using DBpedia


Related Work
Related Work




 Wikipedia has been exploited for the following tasks:
             Topic identification and text categorization
                 l    [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et
                      al, 2008], [Schonhofen, 2009]
             Semantic Relatedness between fragments of text
                 l    [Gabrilovich et al, 2007]
             Keyword Extraction
                 l    [Mihalcea et al, 2007]
             Word sense disambiguation
                 l    [Mihalcea, 2007]




Identifying Topics in Social Media Posts using DBpedia 9
Related Work




 Uses of Wikipedia data-structure:
             Relating words in text with articles using article title information
                 l    [Schonhofen, 2009]
             Exploiting anchor text in links
                 l    [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]
             Exploiting the whole articles
                 l    [Syed et al, 2008] [Gabrilovich, 2007]
             Exploiting categories to measure relatedness between articles
                 l    [Coursey et al, 2009] [Syed et al, 2008]
             Exploiting disambiguation pages and redirection links to select
              candidate senses and alternative labels
                 l    [Mendelyan et al, 2008]




Identifying Topics in Social Media Posts using DBpedia 10
Related Work




 Supervised learning methods
                 l    [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]
 Unsupervised techniques
    Based on a Vector Space Model
                 l    [Schonhofen, 2009]
             Based in a Graph
                 l    [Coursey et al, 2009] [Syed et al, 2008]
 Combined methods (supervised and unsupervised)
    Based on a Vector Space Model
                 l    [Mihalcea et al, 2007]




Identifying Topics in Social Media Posts using DBpedia 11
Related Work




 Our approach
    Exploits titles, disambiguation pages, redirection links and article
     text to select candidate senses and alternative labels
    Uses an unsupervised method
    Uses a vector space model
 Main benefit in comparison with previous approaches:
    The interlinking of social media posts with the Web of data through
     DBpedia resources




Identifying Topics in Social Media Posts using DBpedia 12
Identifying Topics in Social Media Posts using DBpedia


Description of the Method
Description of the Method




  Input




  Part-of-      • “torino”, “art”, “media”, “user”, “cloud”
  speech
  tagging


            • http://dbpedia.org/resource/Turin
            • http://dbpedia.org/resource/Art
  Topic     • http://dbpedia.org/resource/User_(computing)
Recognition



                • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
Language
 Filtering




   Identifying Topics in Social Media Posts using DBpedia 14
Description of the Method




 Part-of-speech tagging
    Wp = w1,w2, ..., wn  list of lexical units contained in the post
    lexcat(w)  lexical category of the lexical unit w
    lemma(w)  lemma of w
    L = {common noun, proper noun, acronym…}  meaningful lexical categories
      that we consider
     = {“RT”, “/cc”, “;)”, …}  stop words (lemmas excluded)
    Kp = k1,k2, …, kn  list of keywords with meaning




Identifying Topics in Social Media Posts using DBpedia 15
Description of the Method




 Part-of-speech tagging example

                                                  • But a hardware problem is more likely, especially if
                                                    you use the phone a lot while eating. The
                                                    Blackberry's tiny trackball could be suffering the
                                                    same accumulation of gunk and grime that can
                                                    plague a computer mouse that still uses a rubber
                             Input                  ball on the underside to roll around the desk.




                                    • Blackberry, phone, trackball, computer,
                                      problem, grime, hardware, mouse, desk,
                     Part-of-speech   rubber ball, gunk
                           tagging




Identifying Topics in Social Media Posts using DBpedia 16
Description of the Method




     Topic Recognition (Sem4Tags [García-Silva et al, 2010])


                   • Blackberry, phone, trackball, computer, problem, grime, hardware,
    POS              mouse, desk, rubber ball, gunk
  tagging


          • Blackberry, {phone, hardware, trackball, mouse}
          • Computer, {hardware, mouse, problem, desk}
 Context
Selection • …


                   • http://dbpedia.org/resource/BlackBerry
                   • http://dbpedia.org/resource/Computer
Disambiguation




    Identifying Topics in Social Media Posts using DBpedia 17
Description of the Method




 Context Selection
             For each keyword, a set of up to 4 related keywords that will help to
              disambiguate the its meaning
             4 is the number of words above which the context does not add more resolving
              power to disambiguation [Kaplan, 1955]
             We compute semantic relatedness (active context) taking into account the
              co-ocurrence of words in web pages [Gracia et al, 2009]


                                   Keyword                  Relatedness   Keyword       Relatedness
                                    phone                      0.347      hardware         0.347
                                    trackball                  0.311       mouse           0.311
                                    computer                   0.288         desk          0.287
                                    problem                    0.246      rubber ball      0.246
                                      grime                    0.190        gunk           0.168


                   Active context selection for blackberry keyword

Identifying Topics in Social Media Posts using DBpedia 18
Description of the Method




  Disambiguation Criteria
              OPTION 1: Most frequent sense for the ambiguous word
                 l        Determined by Wikipedia editors (the first link in a disambiguation page)
              OPTION 2: Vector space model
                     1.   A vector containing the keyword and its context
                     2.   A vector containing top N terms is created from each candidate sense is created using
                          TF-IDF (Term Frequency and Inverse Document Frequency)
                     3.   The cosine similarity is used to determine which vectorised sense is more similar to
                          the vector associated to the keyword

  DBpedia resource                           Definition           Similarity
                                Is a line of mobile e-mail and
BlackBerry                                                          0.224
                                smartphone
Blackberry                      is an edible fruit                  0.15
BlackBerry_(song)               is a song by the Black Crowes        0.0
BlackBerry_Township,
_Itasca_County,                 Is a towship in … Itasca County      0.0
_Minnesota



 Identifying Topics in Social Media Posts using DBpedia 19
Description of the Method




 Language Filtering
    Tp = t1,t2, ..., tn  set of topics identified
    l  language to filter
    Labels(t)  set of labels associated to a given topic (value of rdfs:label
     property)
    lang(b)  language of a given label
    Tlp  set of topics with labels in language l




Identifying Topics in Social Media Posts using DBpedia 20
Identifying Topics in Social Media Posts using DBpedia


Evaluation
Evaluation




 Evaluated with a corpora of 10,000 posts in Spanish extracted from
    Blogs
    Forums
    Microblogs (e.g., Twitter)
    Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)
    Review sites (e.g. , Ciao and Dooyoo)
    Audiovisual sites (e.g., YouTube and Flickr)
    News publising sites (e.g., elpais.com, elmundo.es)
    Others (web pages not classified in the categories above)
 Variants evaluated
        1.     Without considering any context
                l      Default Wikipedia sense assigned for a given keyword
        2. Considering as context all the other keywords found in the same post
        3. Active context selection technique
                l      Selecting the 4 most relevant topics from the keywords in the same post
Identifying Topics in Social Media Posts using DBpedia 22
Evaluation




   Coverage
                     Blogs       Forums Microblogs Social Networks Others Reviews Audiovisual    News   Overall
POS Tagging          99.63%       96,64%            99.01%    98.14% 98.77%   98.20%   97.20% 99.62% 98.32%
Topic identification
Without context        96.7%      87.68%            94.22%    93.54% 92.71%   88.81%   90.29% 96.67% 92.35%
With context         96.64%       93.07%            95.54%    94.99% 95.13%   92.67%   97.41% 98.54% 95.02%
Active context       99.24%       89.71%            94.43%    96.40% 94.75%   93.81%   92.23%    97.4% 94.72%
Topic identification after language filtering
Without context      91.21%       79.04%            87.54%    82.64% 86.93%   70.15%   82.52% 90.71% 82.74%

With context         88.43%       80.84%            86.31%    85.24% 88.72%   76.19%   89.66% 92.46% 84.85%

Active context       89.69%       80.51%            86.51%    86.78% 89.78%   75.59%   80.58% 90.54% 84.73%

               Part-of-speech tagging: nearly 100%
               Topic recognition: over 90% for almost all the cases
               After language filtering coverage is reduced in about 10% because not all
                DBpedia resources have a label defined for Spanish language


  Identifying Topics in Social Media Posts using DBpedia 23
Evaluation




 Precision
    Evaluated a random sample of 1,816 posts (18,16%)
    47 human evaluator
    Each post and topics identified shown to 3 different evaluators
    Evaluation options:
                 1.     The topic is not related with the post
                 2.     The topic is somehow related with the post
                 3.     The topic is closely related with the post
                 4.     The evaluator has not enough information for taking a decision
               Fleiss’ kappa test
                 l      Strength of agreement for 2 evaluators = 0.826 (very good)
                 l      Strength of agreement for 3 evaluators = 0.493 (moderate)




Identifying Topics in Social Media Posts using DBpedia 24
Evaluation




Identifying Topics in Social Media Posts using DBpedia 25
Evaluation




 Precision Results




             Precision depends on the channel
                 l    From 59.19% for social networks
                             More misspellings
                             More common nouns
                 l    To 88.89% for review sites
                             Concrete products and brands
                             Proper nouns tend to have a Wikipedia entry
             Context selection criteria also depends on the channel
                 l    Active context selection better for microblogs and review sites
                 l    Considering all the post keywords as context better for blogs
                 l    Without context selection is better for the rest of the cases (almost all the channels)
                             Naïve default sense selection is effective

Identifying Topics in Social Media Posts using DBpedia 26
Identifying Topics in Social Media Posts using DBpedia


Conclusions
Conclusions




 We have achieved good results of coverage
 The precision depends on the channel (better for review sites,
  worst for social networks)
 With respect to considering context or not, there is not a variant that
  provide the best results for all the channels.
 Future lines of work:
             Improve Natural Language Processing
                 l    Dealing with slang
                 l    Detect set phrases
                 l    Improve n-gram detection
                 l    Dealing with microblogs’ specifics (e.g., hashtag expansion)
             Combine broad-domain topic identification with knowledge about
              specific domains
                 l    Use of domain ontologies in combination with DBpedia ontology

Identifying Topics in Social Media Posts using DBpedia 28
Thank you!
 oscar.munoz@havasmedia.com

More Related Content

Viewers also liked

Sinhvienthamdinh.com --nh125
Sinhvienthamdinh.com --nh125Sinhvienthamdinh.com --nh125
Sinhvienthamdinh.com --nh125vinhthanhdbk
 
Online Community = People
Online Community = PeopleOnline Community = People
Online Community = PeopleColleen Young
 
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacji
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacjiRynek ksiazek szkolnych a niektóre reformy polskiej edukacji
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacjicentrumcyfrowe
 
Newsweek - June 24, 2015
Newsweek - June 24, 2015Newsweek - June 24, 2015
Newsweek - June 24, 2015Jason Wu
 
Social Media, Knowledge Translation and Palliative Care
Social Media, Knowledge Translation and Palliative CareSocial Media, Knowledge Translation and Palliative Care
Social Media, Knowledge Translation and Palliative CareColleen Young
 
Chls 100 toni_fall_12
Chls 100 toni_fall_12Chls 100 toni_fall_12
Chls 100 toni_fall_12susanluevano
 

Viewers also liked (16)

Sinhvienthamdinh.com --nh125
Sinhvienthamdinh.com --nh125Sinhvienthamdinh.com --nh125
Sinhvienthamdinh.com --nh125
 
Chls 335 spring12
Chls 335 spring12Chls 335 spring12
Chls 335 spring12
 
profe
profeprofe
profe
 
Salita 4 j
Salita 4 jSalita 4 j
Salita 4 j
 
Ingles
InglesIngles
Ingles
 
iizuu concept
iizuu conceptiizuu concept
iizuu concept
 
Online Community = People
Online Community = PeopleOnline Community = People
Online Community = People
 
Chls 335 i_14
Chls 335 i_14Chls 335 i_14
Chls 335 i_14
 
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacji
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacjiRynek ksiazek szkolnych a niektóre reformy polskiej edukacji
Rynek ksiazek szkolnych a niektóre reformy polskiej edukacji
 
Newsweek - June 24, 2015
Newsweek - June 24, 2015Newsweek - June 24, 2015
Newsweek - June 24, 2015
 
Social Media, Knowledge Translation and Palliative Care
Social Media, Knowledge Translation and Palliative CareSocial Media, Knowledge Translation and Palliative Care
Social Media, Knowledge Translation and Palliative Care
 
Softskill ppt(refi)
Softskill ppt(refi)Softskill ppt(refi)
Softskill ppt(refi)
 
profe
profeprofe
profe
 
Prezentacja o reuse
Prezentacja o reusePrezentacja o reuse
Prezentacja o reuse
 
Chls 100 toni_fall_12
Chls 100 toni_fall_12Chls 100 toni_fall_12
Chls 100 toni_fall_12
 
Hw anis
Hw anisHw anis
Hw anis
 

Similar to Identifying Topics in Social Media Posts

Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
 
Speaking drupal: a cultural linguistic adventure
Speaking drupal: a cultural linguistic adventureSpeaking drupal: a cultural linguistic adventure
Speaking drupal: a cultural linguistic adventureJackie Wolf
 
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)Claudia Warth
 
The Evolution of E-learning
The Evolution of E-learningThe Evolution of E-learning
The Evolution of E-learningJennifer Lim
 
Introduction to Web 2.0 Tools-Multimedia Unit 2
Introduction to Web 2.0 Tools-Multimedia Unit 2Introduction to Web 2.0 Tools-Multimedia Unit 2
Introduction to Web 2.0 Tools-Multimedia Unit 2mrsbrown526
 
The civil rights movement ppt for itc 1 kj 7
The civil rights movement ppt for itc 1 kj 7The civil rights movement ppt for itc 1 kj 7
The civil rights movement ppt for itc 1 kj 7hollowaymm
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Liz Rodrigues
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.PhiloWeb
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Dan Brickley
 
Web 2.0
Web 2.0Web 2.0
Web 2.0bjornh
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignCommunitySense
 
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin Fuchs
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin FuchsEUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin Fuchs
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin FuchsThe Open University
 
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010nickyjohnson
 
The civil rights movement ppt for itc 1 kj 5
The civil rights movement ppt for itc 1 kj 5The civil rights movement ppt for itc 1 kj 5
The civil rights movement ppt for itc 1 kj 5hollowaymm
 
The civil rights movement ppt for itc 1 kj 6
The civil rights movement ppt for itc 1 kj 6The civil rights movement ppt for itc 1 kj 6
The civil rights movement ppt for itc 1 kj 6hollowaymm
 
Web 2.0
Web 2.0Web 2.0
Web 2.0bjornh
 

Similar to Identifying Topics in Social Media Posts (20)

Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
Speaking drupal: a cultural linguistic adventure
Speaking drupal: a cultural linguistic adventureSpeaking drupal: a cultural linguistic adventure
Speaking drupal: a cultural linguistic adventure
 
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)
Social media, Web 2.0 & language teaching (Foresite, Sèvres, July 2011)
 
The Evolution of E-learning
The Evolution of E-learningThe Evolution of E-learning
The Evolution of E-learning
 
Affordances in Social Media for Education
Affordances in Social Media for EducationAffordances in Social Media for Education
Affordances in Social Media for Education
 
Introduction to Web 2.0 Tools-Multimedia Unit 2
Introduction to Web 2.0 Tools-Multimedia Unit 2Introduction to Web 2.0 Tools-Multimedia Unit 2
Introduction to Web 2.0 Tools-Multimedia Unit 2
 
The civil rights movement ppt for itc 1 kj 7
The civil rights movement ppt for itc 1 kj 7The civil rights movement ppt for itc 1 kj 7
The civil rights movement ppt for itc 1 kj 7
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin Fuchs
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin FuchsEUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin Fuchs
EUROCALL Teacher Education SIG Workshop 2010 Presentation Carolin Fuchs
 
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010
G fuchs euro_call_teacher_ed_sig_presentation_cfuchs_may2010
 
The civil rights movement ppt for itc 1 kj 5
The civil rights movement ppt for itc 1 kj 5The civil rights movement ppt for itc 1 kj 5
The civil rights movement ppt for itc 1 kj 5
 
The civil rights movement ppt for itc 1 kj 6
The civil rights movement ppt for itc 1 kj 6The civil rights movement ppt for itc 1 kj 6
The civil rights movement ppt for itc 1 kj 6
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...
 

Recently uploaded

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Identifying Topics in Social Media Posts

  • 1. Identifying Topics in Social Media Posts using DBpedia Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media) Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM) NEM Summit – 28 Sept. 2011
  • 2. Contents  Introduction  Related Work  Description of the Method  Evaluation  Conclusions Identifying Topics in Social Media Posts using DBpedia 2
  • 3. Identifying Topics in Social Media Posts using DBpedia Introduction
  • 4. Introduction  Topic Identification  “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]  Applications of Topic Identification for Social Media  Automatically summarising the content published in a channel.  Mining the interest of a given user.  etc…  Benefits for Advertising Companies  To focus the advertisement actions to the appropriate channels.  To serve ads to the users based in their interest. Identifying Topics in Social Media Posts using DBpedia 4
  • 5. Introduction  Difficulties of Topic Identification in Social Media  Different channels with heterogeneous texts l Different lengths  From short sentences on Twitter to medium-size articles in blogs l Misspellings  Posts completely written in uppercase (or lowercase) letters  Makes difficult the detection of proper nouns.  In Spanish, absence /presence of an accent in a word  different meanings  “té” = “tea” (common noun)  “te” = “you” (personal pronoun) l Use of set phrases  E.g., “too many cooks spoil the broth” (if too many people try to take charge at a task, the end product might be ruined)  E.g., “rain cats and dogs” (rains heavily)  It is important to take into account the context of the post Identifying Topics in Social Media Posts using DBpedia 5
  • 6. Introduction  Why DBpedia?  DBpedia is a structured Semantic Web representation of Wikipedia l Wikipedia is maintained by thousands of editors l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]  Each topic identified is mapped with a DBpedia resource l E.g., The URI http://dbpedia.org/resource/Turin  Represents the city of Torino  Has about 45 attributes defined (population, area, latitude, longitude, etc.)  Has labels and definitions in 14 different languages.  It is linked with many semantic entities  E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro  It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino  It is a nucleus for the Web of Data [Bizer et al, 2009] l Data published on the Web according to Tim Berners-Lee’s Linked Data principles. l Several billion RDF triples (i.e. facts) l Multi-domain datasets (geographic information, people, companies, online communities, etc…) Identifying Topics in Social Media Posts using DBpedia 6
  • 8. Identifying Topics in Social Media Posts using DBpedia Related Work
  • 9. Related Work  Wikipedia has been exploited for the following tasks:  Topic identification and text categorization l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et al, 2008], [Schonhofen, 2009]  Semantic Relatedness between fragments of text l [Gabrilovich et al, 2007]  Keyword Extraction l [Mihalcea et al, 2007]  Word sense disambiguation l [Mihalcea, 2007] Identifying Topics in Social Media Posts using DBpedia 9
  • 10. Related Work  Uses of Wikipedia data-structure:  Relating words in text with articles using article title information l [Schonhofen, 2009]  Exploiting anchor text in links l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]  Exploiting the whole articles l [Syed et al, 2008] [Gabrilovich, 2007]  Exploiting categories to measure relatedness between articles l [Coursey et al, 2009] [Syed et al, 2008]  Exploiting disambiguation pages and redirection links to select candidate senses and alternative labels l [Mendelyan et al, 2008] Identifying Topics in Social Media Posts using DBpedia 10
  • 11. Related Work  Supervised learning methods l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]  Unsupervised techniques  Based on a Vector Space Model l [Schonhofen, 2009]  Based in a Graph l [Coursey et al, 2009] [Syed et al, 2008]  Combined methods (supervised and unsupervised)  Based on a Vector Space Model l [Mihalcea et al, 2007] Identifying Topics in Social Media Posts using DBpedia 11
  • 12. Related Work  Our approach  Exploits titles, disambiguation pages, redirection links and article text to select candidate senses and alternative labels  Uses an unsupervised method  Uses a vector space model  Main benefit in comparison with previous approaches:  The interlinking of social media posts with the Web of data through DBpedia resources Identifying Topics in Social Media Posts using DBpedia 12
  • 13. Identifying Topics in Social Media Posts using DBpedia Description of the Method
  • 14. Description of the Method Input Part-of- • “torino”, “art”, “media”, “user”, “cloud” speech tagging • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art Topic • http://dbpedia.org/resource/User_(computing) Recognition • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ... Language Filtering Identifying Topics in Social Media Posts using DBpedia 14
  • 15. Description of the Method  Part-of-speech tagging  Wp = w1,w2, ..., wn  list of lexical units contained in the post  lexcat(w)  lexical category of the lexical unit w  lemma(w)  lemma of w  L = {common noun, proper noun, acronym…}  meaningful lexical categories that we consider   = {“RT”, “/cc”, “;)”, …}  stop words (lemmas excluded)  Kp = k1,k2, …, kn  list of keywords with meaning Identifying Topics in Social Media Posts using DBpedia 15
  • 16. Description of the Method  Part-of-speech tagging example • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberry's tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber Input ball on the underside to roll around the desk. • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, Part-of-speech rubber ball, gunk tagging Identifying Topics in Social Media Posts using DBpedia 16
  • 17. Description of the Method  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) • Blackberry, phone, trackball, computer, problem, grime, hardware, POS mouse, desk, rubber ball, gunk tagging • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} Context Selection • … • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/Computer Disambiguation Identifying Topics in Social Media Posts using DBpedia 17
  • 18. Description of the Method  Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword Identifying Topics in Social Media Posts using DBpedia 18
  • 19. Description of the Method  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity Is a line of mobile e-mail and BlackBerry 0.224 smartphone Blackberry is an edible fruit 0.15 BlackBerry_(song) is a song by the Black Crowes 0.0 BlackBerry_Township, _Itasca_County, Is a towship in … Itasca County 0.0 _Minnesota Identifying Topics in Social Media Posts using DBpedia 19
  • 20. Description of the Method  Language Filtering  Tp = t1,t2, ..., tn  set of topics identified  l  language to filter  Labels(t)  set of labels associated to a given topic (value of rdfs:label property)  lang(b)  language of a given label  Tlp  set of topics with labels in language l Identifying Topics in Social Media Posts using DBpedia 20
  • 21. Identifying Topics in Social Media Posts using DBpedia Evaluation
  • 22. Evaluation  Evaluated with a corpora of 10,000 posts in Spanish extracted from  Blogs  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)  Review sites (e.g. , Ciao and Dooyoo)  Audiovisual sites (e.g., YouTube and Flickr)  News publising sites (e.g., elpais.com, elmundo.es)  Others (web pages not classified in the categories above)  Variants evaluated 1. Without considering any context l Default Wikipedia sense assigned for a given keyword 2. Considering as context all the other keywords found in the same post 3. Active context selection technique l Selecting the 4 most relevant topics from the keywords in the same post Identifying Topics in Social Media Posts using DBpedia 22
  • 23. Evaluation  Coverage Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32% Topic identification Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35% With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02% Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72% Topic identification after language filtering Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74% With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85% Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%  Part-of-speech tagging: nearly 100%  Topic recognition: over 90% for almost all the cases  After language filtering coverage is reduced in about 10% because not all DBpedia resources have a label defined for Spanish language Identifying Topics in Social Media Posts using DBpedia 23
  • 24. Evaluation  Precision  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluator  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate) Identifying Topics in Social Media Posts using DBpedia 24
  • 25. Evaluation Identifying Topics in Social Media Posts using DBpedia 25
  • 26. Evaluation  Precision Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effective Identifying Topics in Social Media Posts using DBpedia 26
  • 27. Identifying Topics in Social Media Posts using DBpedia Conclusions
  • 28. Conclusions  We have achieved good results of coverage  The precision depends on the channel (better for review sites, worst for social networks)  With respect to considering context or not, there is not a variant that provide the best results for all the channels.  Future lines of work:  Improve Natural Language Processing l Dealing with slang l Detect set phrases l Improve n-gram detection l Dealing with microblogs’ specifics (e.g., hashtag expansion)  Combine broad-domain topic identification with knowledge about specific domains l Use of domain ontologies in combination with DBpedia ontology Identifying Topics in Social Media Posts using DBpedia 28