SlideShare a Scribd company logo
1 of 15
Download to read offline
Analysing and Improving embedded Markup of
Learning Resources on the Web
Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin
- WWW2017, Digital Learning Track -
05/04/17 1Stefan Dietze
Open Data & Linked Data
Structured data about learning resources on the Web?
05/04/17 2Stefan Dietze
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
http://data.linkededucation.org/linkedup/catalog/
Structured data about learning resources on the Web?
05/04/17 3Stefan Dietze
Web: approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
Open Data & Linked Data
Resource metadata
 Standards: LOM, ADL SCORM, IMS LD etc.
 Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
 Vocabularies: BIBO, LOM/RDF, mEducator etc
 Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
 Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
 schema.org vocabulary used at scale
(700 classes, 1000 predicates) and supported by Yahoo,
Yandex, Bing, Google
 Adoption on the Web (2016):
o 38 % out of 3.2 bn pages
o 44 bn statements/quads
(see “Web Data Commons”, see Meusel & Paulheim
[ISWC2014])
 Same order of magnitude as “the Web” (scale, dynamics)
Embedded markup data & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 4
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
 schema.org extension providing
vocabulary for annotation of learning
resources
 Association of resources
(s:CreativeWork, e.g. books, videos etc)
with learning-related attributes (typical
age, learning resource type,
educational frameworks etc)
 Dublin Core Metadata Initiative task
force on LRMI
Learning Resources Metadata Initiative (LRMI)
05/04/17 5Stefan Dietze
http://lrmi.dublincore.net/
Learning Resources Metadata Initiative: research questions
05/04/17 6Stefan Dietze
How is LRMI actually being used on the Web?
 RQ1) Adoption of LRMI terms / patterns and its evolution?
 RQ2) Distribution across the Web?
 RQ3) Quality (and how to improve/cleanse/interpret)?
Why is it important?
 Enable data reuse (KB construction, recommenders, search)
 Inform vocabulary design (LRMI, schema.org)
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (according to LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 7Stefan Dietze
 CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
 WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
 LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (LRMI spec)
 LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 8Stefan Dietze
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
 Power law distribution across
approx. 300 PLDs and 4000
subdomains (2015)
 Top 10% of contributors
provide 98.4% of all quads
(2015)
LRMI distribution across pay-level-domains (PLDs)
05/04/17 9Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
05/04/17 10Stefan Dietze
Markup quality (1/2): addressing schema misuse
sunriseseniorliving.com
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
Clustering/classification of unintended uses of
LRMI terms?
• Domain blacklist: recall 96%, roughly 10% of
PLDs (0,5 % of documents) affected
• Clustering of PLDs/resource types (XMeans)
• Variety of features, in particular related to
term adoption
Term co-occurrence within markup from top-ranked PLDs
(„learning resources in the LRMI sense“)
Unintended schema use: term distribution as clustering feature?
05/04/17 11Stefan Dietze
Term co-occurrence within markup from
filtered adult content PLDs
Rank Year Type # Quads # PLDs
1
2013 EducationalEvent 6004 1
2014 EducationalEvent 3047 1
2015 offer 100516 1
2
2013 UserComment 20 1
2014 Therapist 25 1
2015 headline 6724 1
3
2013 CompetencyObject 4 1
2014 UserComment 23 1
2015 URL 693 1
4
2013 Webpage 2 1
2014 learningResourceType 21 1
2015 webpage 360 1
5
2013 about 1 1
2014 EducationalEvent 19 1
2015 musicrecording 296 1
 Heuristics for fixing frequent errors
(see Meusel et al., ESWC2015)
o Wrong namespaces
(eg.: “htp:/schema.org”): 501,530 quads in
2015
o Undefined types and properties: 1,172,893
quads in 2015
o Object properties misused as data type
property: 10,288,717 quads in 2015
 Errors fixed in most PLDs and documents
 But: lower error rate in LRMI corpus than
markup in general (WDC)
Markup quality (2/2): heuristics for fixing frequent errors
05/04/17 12Stefan Dietze
Top-5 undefined types
“Strings, not things”
 Numbers from 2015:
o 46 million “transversal” quads (i.e. non-hierarchical
statements)
o 64% datatype properties, yet 97% refer to literals
(up from 70% in 2013)
 Issues
o Lack of links and controlled vocabularies
o Data reuse requires identity resolution
2013 2014 2015
# quads
520,815
(5.63%)
1,601,796
(6.10%)
6,179,097
(8.84%)
# docs
46,382
(55.15%)
369,772
(85.81%)
754,863
(81.21%)
# PLDs
75
(75.76%)
154
(67.54%)
291
(77.39%)
Fixed quads/documents/PLDs
Key findings & implications
05/04/17 13Stefan Dietze
I. Significant growth, but biased term adoption.
 Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)
 Bias towards simple data type & generic properties
 Implications for data consumption & identity resolution
II. Power-law distribution of LRMI markup.
 Top 10% contributors provide 98.4% of quads 2015
 Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender)
=> focused crawling of most probable data providers
III. Frequent errors.
 Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general
 Steady increase (total and relative) of errors
 Need for data cleansing & fixing: heuristics and frequency-based approaches
(e.g. erroneous terms usually in few PLDs only)
IV. Unintended use of vocabulary terms.
 Terms applied in variety of contexts (e.g. adult content)
 Not necessarily schema violation
 But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
Consumption, reuse & fusion of markup data
 Clustering for data cleansing and categorisation
(features: eg term distribution, page-rank, etc)
 Supervised data fusion for entity matching and fact verification –
related work [ICDE2017, SWJ2017]
 Augmenting knowledge bases
Vocabulary design
 Feed findings into DCMI task force on LRMI
 Bootstrap pattern and terms (from actual usage) ?
 Wider schema.org question: reflecting lack of acceptance of
object-object relationships in vocabularies?
Future work
05/04/17 14Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-
Centric Data Fusion on Structured Web Markup,
ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D.,
Dietze, S., KnowMore - Knowledge Base Augmentation
with Structured Web Markup, Semantic Web Journal
2017, under review.
Contact, data & stats
05/04/17 15Stefan Dietze
Data
http://lrmi.itd.cnr.it/
Contact
@stefandietze | http://stefandietze.net

More Related Content

What's hot

Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Mathieu d'Aquin
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d'Aquin
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebMathieu d'Aquin
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataMathieu d'Aquin
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefCrossref
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesMathieu d'Aquin
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceStefan Dietze
 
Data Management for Mountain Observatories Workshop
Data Management for Mountain Observatories WorkshopData Management for Mountain Observatories Workshop
Data Management for Mountain Observatories WorkshopCarly Strasser
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF AgeM. Tamer Özsu
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Stefan Dietze
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebFranck Michel
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremonyFabien Gandon
 
It19 20140721 linked data personal perspective
It19 20140721 linked data personal perspectiveIt19 20140721 linked data personal perspective
It19 20140721 linked data personal perspectiveJanifer Gatenby
 

What's hot (20)

Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisExtracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Doing Clever Things with the Semantic Web
Doing Clever Things with the Semantic WebDoing Clever Things with the Semantic Web
Doing Clever Things with the Semantic Web
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked Data
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRef
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data Technologies
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Data Management for Mountain Observatories Workshop
Data Management for Mountain Observatories WorkshopData Management for Mountain Observatories Workshop
Data Management for Mountain Observatories Workshop
 
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF Age
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the Web
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremony
 
It19 20140721 linked data personal perspective
It19 20140721 linked data personal perspectiveIt19 20140721 linked data personal perspective
It19 20140721 linked data personal perspective
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 

Similar to Analysing & Improving Learning Resources Markup on the Web

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating TaxonomiesJoseph Busch
 
Opening up MOOCs for OER management on the Web of linked data
Opening up MOOCs for OER management on the Web of linked dataOpening up MOOCs for OER management on the Web of linked data
Opening up MOOCs for OER management on the Web of linked dataGilbert Paquette
 
Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012IUPUI
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Bradley Allen
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectStuart Chalk
 
Missing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscapMissing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscapStuart Weibel
 
Metadata issues and challenges: Link Data
Metadata issues and challenges: Link DataMetadata issues and challenges: Link Data
Metadata issues and challenges: Link DataAmna Farzand Ali
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW
 
FAIR, standards and FAIRsharing - MAQC Society 2019
FAIR, standards and FAIRsharing - MAQC Society 2019FAIR, standards and FAIRsharing - MAQC Society 2019
FAIR, standards and FAIRsharing - MAQC Society 2019Susanna-Assunta Sansone
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
Role of metadata in transportation agency data programs
Role of metadata in transportation agency data programsRole of metadata in transportation agency data programs
Role of metadata in transportation agency data programsJoseph Busch
 
RDA for Original Catalogers
RDA for Original CatalogersRDA for Original Catalogers
RDA for Original CatalogersShana McDanold
 
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...Richard Wallis
 
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)Ig Bittencourt
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataStuart Chalk
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale Bernadette Hyland-Wood
 
A Data Citation Roadmap for Scholarly Data Repositories
A Data Citation Roadmap for Scholarly Data RepositoriesA Data Citation Roadmap for Scholarly Data Repositories
A Data Citation Roadmap for Scholarly Data RepositoriesLIBER Europe
 

Similar to Analysing & Improving Learning Resources Markup on the Web (20)

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating Taxonomies
 
Opening up MOOCs for OER management on the Web of linked data
Opening up MOOCs for OER management on the Web of linked dataOpening up MOOCs for OER management on the Web of linked data
Opening up MOOCs for OER management on the Web of linked data
 
Metadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the schemeMetadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the scheme
 
Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012
 
Web Information Systems Introduction and Origin of World Wide Web
Web Information Systems Introduction and Origin of World Wide WebWeb Information Systems Introduction and Origin of World Wide Web
Web Information Systems Introduction and Origin of World Wide Web
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
Missing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscapMissing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscap
 
Metadata issues and challenges: Link Data
Metadata issues and challenges: Link DataMetadata issues and challenges: Link Data
Metadata issues and challenges: Link Data
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise It
 
FAIR, standards and FAIRsharing - MAQC Society 2019
FAIR, standards and FAIRsharing - MAQC Society 2019FAIR, standards and FAIRsharing - MAQC Society 2019
FAIR, standards and FAIRsharing - MAQC Society 2019
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Role of metadata in transportation agency data programs
Role of metadata in transportation agency data programsRole of metadata in transportation agency data programs
Role of metadata in transportation agency data programs
 
RDA for Original Catalogers
RDA for Original CatalogersRDA for Original Catalogers
RDA for Original Catalogers
 
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...
Web 3.0 / Semantic Web: What it means for academic users, libraries and publi...
 
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
A Data Citation Roadmap for Scholarly Data Repositories
A Data Citation Roadmap for Scholarly Data RepositoriesA Data Citation Roadmap for Scholarly Data Repositories
A Data Citation Roadmap for Scholarly Data Repositories
 

More from Stefan Dietze

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Stefan Dietze
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebStefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-esStefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 

More from Stefan Dietze (20)

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Analysing & Improving Learning Resources Markup on the Web

  • 1. Analysing and Improving embedded Markup of Learning Resources on the Web Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin - WWW2017, Digital Learning Track - 05/04/17 1Stefan Dietze
  • 2. Open Data & Linked Data Structured data about learning resources on the Web? 05/04/17 2Stefan Dietze Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources) http://data.linkededucation.org/linkedup/catalog/
  • 3. Structured data about learning resources on the Web? 05/04/17 3Stefan Dietze Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google Open Data & Linked Data Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources)
  • 4.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  schema.org vocabulary used at scale (700 classes, 1000 predicates) and supported by Yahoo, Yandex, Bing, Google  Adoption on the Web (2016): o 38 % out of 3.2 bn pages o 44 bn statements/quads (see “Web Data Commons”, see Meusel & Paulheim [ISWC2014])  Same order of magnitude as “the Web” (scale, dynamics) Embedded markup data & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 4 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  • 5.  schema.org extension providing vocabulary for annotation of learning resources  Association of resources (s:CreativeWork, e.g. books, videos etc) with learning-related attributes (typical age, learning resource type, educational frameworks etc)  Dublin Core Metadata Initiative task force on LRMI Learning Resources Metadata Initiative (LRMI) 05/04/17 5Stefan Dietze http://lrmi.dublincore.net/
  • 6. Learning Resources Metadata Initiative: research questions 05/04/17 6Stefan Dietze How is LRMI actually being used on the Web?  RQ1) Adoption of LRMI terms / patterns and its evolution?  RQ2) Distribution across the Web?  RQ3) Quality (and how to improve/cleanse/interpret)? Why is it important?  Enable data reuse (KB construction, recommenders, search)  Inform vocabulary design (LRMI, schema.org)
  • 7. 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (according to LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 7Stefan Dietze
  • 8.  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 8Stefan Dietze 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849
  • 9.  Power law distribution across approx. 300 PLDs and 4000 subdomains (2015)  Top 10% of contributors provide 98.4% of all quads (2015) LRMI distribution across pay-level-domains (PLDs) 05/04/17 9Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  • 10. 05/04/17 10Stefan Dietze Markup quality (1/2): addressing schema misuse sunriseseniorliving.com 7xxxtube.com 1amateurporntube.com virtualpornstars.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de Clustering/classification of unintended uses of LRMI terms? • Domain blacklist: recall 96%, roughly 10% of PLDs (0,5 % of documents) affected • Clustering of PLDs/resource types (XMeans) • Variety of features, in particular related to term adoption
  • 11. Term co-occurrence within markup from top-ranked PLDs („learning resources in the LRMI sense“) Unintended schema use: term distribution as clustering feature? 05/04/17 11Stefan Dietze Term co-occurrence within markup from filtered adult content PLDs
  • 12. Rank Year Type # Quads # PLDs 1 2013 EducationalEvent 6004 1 2014 EducationalEvent 3047 1 2015 offer 100516 1 2 2013 UserComment 20 1 2014 Therapist 25 1 2015 headline 6724 1 3 2013 CompetencyObject 4 1 2014 UserComment 23 1 2015 URL 693 1 4 2013 Webpage 2 1 2014 learningResourceType 21 1 2015 webpage 360 1 5 2013 about 1 1 2014 EducationalEvent 19 1 2015 musicrecording 296 1  Heuristics for fixing frequent errors (see Meusel et al., ESWC2015) o Wrong namespaces (eg.: “htp:/schema.org”): 501,530 quads in 2015 o Undefined types and properties: 1,172,893 quads in 2015 o Object properties misused as data type property: 10,288,717 quads in 2015  Errors fixed in most PLDs and documents  But: lower error rate in LRMI corpus than markup in general (WDC) Markup quality (2/2): heuristics for fixing frequent errors 05/04/17 12Stefan Dietze Top-5 undefined types “Strings, not things”  Numbers from 2015: o 46 million “transversal” quads (i.e. non-hierarchical statements) o 64% datatype properties, yet 97% refer to literals (up from 70% in 2013)  Issues o Lack of links and controlled vocabularies o Data reuse requires identity resolution 2013 2014 2015 # quads 520,815 (5.63%) 1,601,796 (6.10%) 6,179,097 (8.84%) # docs 46,382 (55.15%) 369,772 (85.81%) 754,863 (81.21%) # PLDs 75 (75.76%) 154 (67.54%) 291 (77.39%) Fixed quads/documents/PLDs
  • 13. Key findings & implications 05/04/17 13Stefan Dietze I. Significant growth, but biased term adoption.  Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)  Bias towards simple data type & generic properties  Implications for data consumption & identity resolution II. Power-law distribution of LRMI markup.  Top 10% contributors provide 98.4% of quads 2015  Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender) => focused crawling of most probable data providers III. Frequent errors.  Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general  Steady increase (total and relative) of errors  Need for data cleansing & fixing: heuristics and frequency-based approaches (e.g. erroneous terms usually in few PLDs only) IV. Unintended use of vocabulary terms.  Terms applied in variety of contexts (e.g. adult content)  Not necessarily schema violation  But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
  • 14. Consumption, reuse & fusion of markup data  Clustering for data cleansing and categorisation (features: eg term distribution, page-rank, etc)  Supervised data fusion for entity matching and fact verification – related work [ICDE2017, SWJ2017]  Augmenting knowledge bases Vocabulary design  Feed findings into DCMI task force on LRMI  Bootstrap pattern and terms (from actual usage) ?  Wider schema.org question: reflecting lack of acceptance of object-object relationships in vocabularies? Future work 05/04/17 14Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query- Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  • 15. Contact, data & stats 05/04/17 15Stefan Dietze Data http://lrmi.itd.cnr.it/ Contact @stefandietze | http://stefandietze.net