Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
Analysing & Improving Learning Resources Markup on the Web
1. Analysing and Improving embedded Markup of
Learning Resources on the Web
Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin
- WWW2017, Digital Learning Track -
05/04/17 1Stefan Dietze
2. Open Data & Linked Data
Structured data about learning resources on the Web?
05/04/17 2Stefan Dietze
Resource metadata
Standards: LOM, ADL SCORM, IMS LD etc.
Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
Vocabularies: BIBO, LOM/RDF, mEducator etc
Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
http://data.linkededucation.org/linkedup/catalog/
3. Structured data about learning resources on the Web?
05/04/17 3Stefan Dietze
Web: approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
Open Data & Linked Data
Resource metadata
Standards: LOM, ADL SCORM, IMS LD etc.
Repositories: Open Courseware, Merlot, ARIADNE etc
Educational(ly relevant) linked data
Vocabularies: BIBO, LOM/RDF, mEducator etc
Datasets: e.g. LinkedUp Catalog
(approx. 50 M resources)
4. Embedded markup (RDFa, Microdata, Microformats) for
interpretation of Web documents (search, retrieval)
schema.org vocabulary used at scale
(700 classes, 1000 predicates) and supported by Yahoo,
Yandex, Bing, Google
Adoption on the Web (2016):
o 38 % out of 3.2 bn pages
o 44 bn statements/quads
(see “Web Data Commons”, see Meusel & Paulheim
[ISWC2014])
Same order of magnitude as “the Web” (scale, dynamics)
Embedded markup data & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
05/04/17 4
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
5. schema.org extension providing
vocabulary for annotation of learning
resources
Association of resources
(s:CreativeWork, e.g. books, videos etc)
with learning-related attributes (typical
age, learning resource type,
educational frameworks etc)
Dublin Core Metadata Initiative task
force on LRMI
Learning Resources Metadata Initiative (LRMI)
05/04/17 5Stefan Dietze
http://lrmi.dublincore.net/
6. Learning Resources Metadata Initiative: research questions
05/04/17 6Stefan Dietze
How is LRMI actually being used on the Web?
RQ1) Adoption of LRMI terms / patterns and its evolution?
RQ2) Distribution across the Web?
RQ3) Quality (and how to improve/cleanse/interpret)?
Why is it important?
Enable data reuse (KB construction, recommenders, search)
Inform vocabulary design (LRMI, schema.org)
7. 2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (according to LRMI spec)
LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 7Stefan Dietze
8. CC: Common Crawl, 2013-2015
(http://commoncrawl.org)
WDC: Web Data Commons, 2013-2015:
statements/quads extracted from CC
(http://webdatacommons.org)
LRMI: all quads extracted from WDC/CC
which include or co-occur with an LRMI
term (LRMI spec)
LRMI‘: extracted from WDC/CC as above,
but considering „common errors“
[Meusel et al 2015]
Data extraction
05/04/17 8Stefan Dietze
2013 2014 2015
Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212
URLs (WDC)
585,792,337
(26.3%)
620,151,400
(30.7%)
541,514,775
(30.5%)
Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352
URLs (LRMI) 83,791 430,861 779,260
URLs (LRMI’) 84,098 430,895 929,573
Quads (LRMI) 9,245,793 26,256,833 44,108,511
Quads(LRMI’) 9,251,553 26,258,524 69,932,849
9. Power law distribution across
approx. 300 PLDs and 4000
subdomains (2015)
Top 10% of contributors
provide 98.4% of all quads
(2015)
LRMI distribution across pay-level-domains (PLDs)
05/04/17 9Stefan Dietze
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
sunriseseniorliving.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
10. 05/04/17 10Stefan Dietze
Markup quality (1/2): addressing schema misuse
sunriseseniorliving.com
7xxxtube.com
1amateurporntube.com
virtualpornstars.com
simplyfinance.co.uk
menslifestyles.com
audiobooks.com
simplypsychology.org
helles-koepfchen.de
Clustering/classification of unintended uses of
LRMI terms?
• Domain blacklist: recall 96%, roughly 10% of
PLDs (0,5 % of documents) affected
• Clustering of PLDs/resource types (XMeans)
• Variety of features, in particular related to
term adoption
11. Term co-occurrence within markup from top-ranked PLDs
(„learning resources in the LRMI sense“)
Unintended schema use: term distribution as clustering feature?
05/04/17 11Stefan Dietze
Term co-occurrence within markup from
filtered adult content PLDs
12. Rank Year Type # Quads # PLDs
1
2013 EducationalEvent 6004 1
2014 EducationalEvent 3047 1
2015 offer 100516 1
2
2013 UserComment 20 1
2014 Therapist 25 1
2015 headline 6724 1
3
2013 CompetencyObject 4 1
2014 UserComment 23 1
2015 URL 693 1
4
2013 Webpage 2 1
2014 learningResourceType 21 1
2015 webpage 360 1
5
2013 about 1 1
2014 EducationalEvent 19 1
2015 musicrecording 296 1
Heuristics for fixing frequent errors
(see Meusel et al., ESWC2015)
o Wrong namespaces
(eg.: “htp:/schema.org”): 501,530 quads in
2015
o Undefined types and properties: 1,172,893
quads in 2015
o Object properties misused as data type
property: 10,288,717 quads in 2015
Errors fixed in most PLDs and documents
But: lower error rate in LRMI corpus than
markup in general (WDC)
Markup quality (2/2): heuristics for fixing frequent errors
05/04/17 12Stefan Dietze
Top-5 undefined types
“Strings, not things”
Numbers from 2015:
o 46 million “transversal” quads (i.e. non-hierarchical
statements)
o 64% datatype properties, yet 97% refer to literals
(up from 70% in 2013)
Issues
o Lack of links and controlled vocabularies
o Data reuse requires identity resolution
2013 2014 2015
# quads
520,815
(5.63%)
1,601,796
(6.10%)
6,179,097
(8.84%)
# docs
46,382
(55.15%)
369,772
(85.81%)
754,863
(81.21%)
# PLDs
75
(75.76%)
154
(67.54%)
291
(77.39%)
Fixed quads/documents/PLDs
13. Key findings & implications
05/04/17 13Stefan Dietze
I. Significant growth, but biased term adoption.
Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)
Bias towards simple data type & generic properties
Implications for data consumption & identity resolution
II. Power-law distribution of LRMI markup.
Top 10% contributors provide 98.4% of quads 2015
Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender)
=> focused crawling of most probable data providers
III. Frequent errors.
Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general
Steady increase (total and relative) of errors
Need for data cleansing & fixing: heuristics and frequency-based approaches
(e.g. erroneous terms usually in few PLDs only)
IV. Unintended use of vocabulary terms.
Terms applied in variety of contexts (e.g. adult content)
Not necessarily schema violation
But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
14. Consumption, reuse & fusion of markup data
Clustering for data cleansing and categorisation
(features: eg term distribution, page-rank, etc)
Supervised data fusion for entity matching and fact verification –
related work [ICDE2017, SWJ2017]
Augmenting knowledge bases
Vocabulary design
Feed findings into DCMI task force on LRMI
Bootstrap pattern and terms (from actual usage) ?
Wider schema.org question: reflecting lack of acceptance of
object-object relationships in vocabularies?
Future work
05/04/17 14Stefan Dietze
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query-
Centric Data Fusion on Structured Web Markup,
ICDE2017.
Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D.,
Dietze, S., KnowMore - Knowledge Base Augmentation
with Structured Web Markup, Semantic Web Journal
2017, under review.
15. Contact, data & stats
05/04/17 15Stefan Dietze
Data
http://lrmi.itd.cnr.it/
Contact
@stefandietze | http://stefandietze.net