SlideShare a Scribd company logo
1 of 28
Download to read offline
Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada
OVERVIEW
❏ INTRODUCTION
❏ MOTIVATION
❏ RESEARCH
QUESTIONS
❏ ANALYSES
❏ CONCLUSIONS
❏ FUTURE WORK
INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?
INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
Other semantics
(structured facts) on
the Web!
INTRODUCTION (3/3)
Characteristics of Markup Data
MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers
RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?
DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates
DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML
Documents
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to
Embedded Entities
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of
Publications
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.
● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)
FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data
Contact Details :
gadiraju@l3s.de
http://www.L3S.de
LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.

More Related Content

What's hot

RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Morgan Briles
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationReynold Xin
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Data Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesData Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesVarsha Khodiyar
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentivesdatacite
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge queryStanley Wang
 
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)Gregor Hagedorn
 
Bluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesBluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesRichard Davis
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014datacite
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013ORCID, Inc
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessOntotext
 

What's hot (20)

Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gonzalez-8-jun15
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesData Publishing and Institutional Repositories
Data Publishing and Institutional Repositories
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
 
Bluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesBluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional Repositories
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014
 
Freire model api
Freire model apiFreire model api
Freire model api
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 

Viewers also liked

Photos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigoPhotos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigorossgagne
 
Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur 0665
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015khyps13
 
체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect SchoolChoohan Cho
 
Clipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoClipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoPacto Ambiental
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueCERTyou Formation
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth Ramaswamy
 
Manual software para acionamneto v75
Manual software para acionamneto v75Manual software para acionamneto v75
Manual software para acionamneto v75FTorres Torres
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)Bhupendra Shakya
 
너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come TrueChoohan Cho
 
Ғалымдар өмірінен
Ғалымдар өміріненҒалымдар өмірінен
Ғалымдар өміріненBeisek Serikbay
 
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...Jin-Yi Hsu
 
Diapos de sindrome treacher collins
Diapos de sindrome treacher collinsDiapos de sindrome treacher collins
Diapos de sindrome treacher collinsMaría Puentes
 

Viewers also liked (20)

Photos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigoPhotos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigo
 
Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015
 
체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School
 
Clipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoClipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexo
 
by geethuraj
by geethurajby geethuraj
by geethuraj
 
Obejtos yeissa ortiz
Obejtos yeissa ortizObejtos yeissa ortiz
Obejtos yeissa ortiz
 
Xerradamotivacional
XerradamotivacionalXerradamotivacional
Xerradamotivacional
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continue
 
Work Sample - Arch Design 2
Work Sample - Arch Design 2Work Sample - Arch Design 2
Work Sample - Arch Design 2
 
And Then I Met Her
And Then I Met HerAnd Then I Met Her
And Then I Met Her
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015
 
Manual software para acionamneto v75
Manual software para acionamneto v75Manual software para acionamneto v75
Manual software para acionamneto v75
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)
 
King Of Buns
King Of BunsKing Of Buns
King Of Buns
 
너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True
 
Ғалымдар өмірінен
Ғалымдар өміріненҒалымдар өмірінен
Ғалымдар өмірінен
 
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
 
Cricket quiz 2014 mains
Cricket quiz 2014 mainsCricket quiz 2014 mains
Cricket quiz 2014 mains
 
Diapos de sindrome treacher collins
Diapos de sindrome treacher collinsDiapos de sindrome treacher collins
Diapos de sindrome treacher collins
 

Similar to Analysing Structured Scholarly Data Embedded in Web Pages

PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...Dimitris Kontokostas
 
Researcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitResearcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitapanigab2
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers Getaneh Alemu
 
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingRec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingAravind Sesagiri Raamkumar
 
Removing Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceRemoving Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceResearch Data Alliance
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016 Rebecca Raworth, MLIS
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016Rebecca Raworth, MLIS
 
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Rensselaer Polytechnic Institute
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in CatalogingWilliam Worford
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale Bernadette Hyland-Wood
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsChimezie Ogbuji
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentationjendibbern
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Kerstin Forsberg
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realizationandrea huang
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge GraphsPeter Haase
 

Similar to Analysing Structured Scholarly Data Embedded in Web Pages (20)

A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Researcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitResearcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submit
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Linked Data
Linked DataLinked Data
Linked Data
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
 
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingRec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
 
Removing Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceRemoving Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data Alliance
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016
 
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in Cataloging
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 

Recently uploaded

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Analysing Structured Scholarly Data Embedded in Web Pages

  • 1. Analysing Structured Scholarly Data Embedded in Web Pages Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze WWW 2016 April 11th , 2016 Montreal, Canada
  • 2. OVERVIEW ❏ INTRODUCTION ❏ MOTIVATION ❏ RESEARCH QUESTIONS ❏ ANALYSES ❏ CONCLUSIONS ❏ FUTURE WORK
  • 3. INTRODUCTION (1/3) The Web: nearly 46 trillion Web pages indexed by Google VS Linked Data: approx. 1000 datasets & 100 billion statements ● different order of magnitude w.r.t. scale & dynamics Are there other semantics (structured facts) on the Web?
  • 4. INTRODUCTION (2/3) ● Web pages embed structured data (microdata, microformats and RDFa) ○ Interpretation of web documents (search & retrieval) ● Increase in prevalence of embedded markup (2014 Google study of 12 bn pages estimates an adoption of 26%) ● “Web Data Commons” (Meusel et al. [ISWC’14]) ○ Markup from Common Crawl (2.2 bn pages) ○ 17 billion RDF quads ○ Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
  • 7. MOTIVATION ● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements ● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale ● Lack of understanding of the adoption of markup for scholarly resource metadata
  • 8. WHAT WE BRING TO THE TABLE ... ● Study of scholarly data extracted from embedded annotations (Web Data Commons) ● Shape & characteristics of entity descriptions ● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
  • 9. RESEARCH QUESTIONS RQ1 What are frequently used terms & types for scholarly data? RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup? RQ3 What are the frequent errors that can be observed?
  • 10. DATASET ● Web Data Commons (WDC) 2014 dataset ● Subset ⇒ all statements describing entities of type s:ScholarlyArticle or co- occuring on same document with any s: ScholarlyArticle instance ○ 6,793,764 quads ○ 1,184,623 entities ○ 83 distinct classes ○ 429 distinct predicates
  • 11. DATASET - Considerations ● s:ScholarlyArticle is the only type which explicitly refers to scholarly articles ● We focus on schema.org, the most widely used schema ● Types considered ⇒ s:ScholarlyArticle, s:Person and s:Organization ○ 280,616 instances (s: ScholarlyArticle) ○ 847,417 insrances (s:Person) ○ 3,798 instances (s:Organization)
  • 12. SCHOLARLY TYPES & PREDICATES (½) Cumulative dist. of predicates over instances across extracted types 1 to 14 1 to 9 1 to 4
  • 13. SCHOLARLY TYPES & PREDICATES (2/2) Top-10 Predicates for s:ScholarlyArticle
  • 14. DOMAINS & DOCUMENTS (1/5) Distribution of Entities & Statements across PLDs
  • 15. DOMAINS & DOCUMENTS (2/5) Top-10 PLDs (ranked by no. of entities)
  • 16. DOMAINS & DOCUMENTS (3/5) Distribution of Entities & Statements across TLDs
  • 17. DOMAINS & DOCUMENTS (4/5) Distribution of Entities & Statements across HTML Documents
  • 18. DOMAINS & DOCUMENTS (5/5) Top-10 Documents Ranked According to Embedded Entities
  • 19. TOPICS & PUBLICATION TYPES (1/4) Distribution of Scholarly Articles across Publishers
  • 20. TOPICS & PUBLICATION TYPES (2/4) Top-10 Publishers and corresponding no. of Publications
  • 21. TOPICS & PUBLICATION TYPES (3/4) Top-10 Publication Types (genres) across WDC
  • 22. TOPICS & PUBLICATION TYPES (4/4) Top-10 Article Titles (ranked by frequency of occurrence)
  • 23. FREQUENT ERRORS - Schema Violations Top-10 Misused Predicates
  • 24. CONCLUSIONS (½) ● First study on coverage & char. of bibliographic metadata embedded in web pages. ● Early adopters ⇒ publishers, libraries, other providers of bibliographic data. ● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
  • 25. ● Top-k genres & publishers indicate a bias towards French, English data providers. ● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences. ● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data. CONCLUSIONS (2/2)
  • 26. FUTURE WORK ● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.) ● Consider implicitly typed bibliographic or creative work as scholarly data
  • 28. LIMITATIONS ● Our study is limited to schema.org & the types of s:ScholarlyArticle, s: Person, s:Organization. ● We consider only explicitly linked scholarly works.