"Wikidata, a target for Europeana's semantic strategy"/ Presentation at the GLAM-Wiki conference with Valentine Charles, Hugo Manguinhas, Antoine Isaac, Vladimir Alexiev http://nl.wikimedia.org/wiki/GLAM-WIKI_2015/
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Wikidata, a target for Europeana's semantic strategy - GLAM-WIKI 2015
1. Wikidata, a target for Europeana’s semantic strategy
Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana
Vladimir Alexiev: Ontotext Corp
GLAM Wiki 2015, Den Haag
3. Europeana has many data challenges: diversity
Aggregates metadata from the cultural heritage sector in Europe
• Large amount of references to places, agents, concepts, time
4. Europeana has many data challenges: diversity
Metadata in more than 30 languages
From all EU countries
5. Europeana’s priority 1: Improve data quality
Europeana Data Model (EDM), a framework for richer data
• Re-uses several existing Semantic Web-based models
Dublin Core, OAI-ORE, SKOS, CIDOC-CRM…
• EDM gives support for contextual resources (semantic layer)
Rely on vocabularies to solve a problem of data interlinking
• Encourage data providers to contribute their own vocabularies
and benefit from data links made at data providers’ level
8. Europeana performs automatic enrichment
based on vocabularies
Goal: Contextualization which reaches
outside the scope of a particular platform
ObjectObject
9. Automatic enrichment process in Europeana
• Selection of
metadata fields
in resource
descriptions
• Selection of
potential rules
to match
• Selection of
metadata fields
in resource
descriptions
• Selection of
potential rules
to match
AnalysisAnalysis
• Matching the
values of the
metadata fields
to values of the
contextual
resources
• Adding
contextual links
• Matching the
values of the
metadata fields
to values of the
contextual
resources
• Adding
contextual links
LinkingLinking
• Selecting the
values from the
contextual
resource
• Augmentation of
the search index
with the labels
from the
vocabulary
• Selecting the
values from the
contextual
resource
• Augmentation of
the search index
with the labels
from the
vocabulary
Augmentation
10. Enrichment Types and Current Vocabularies
Enrichment Type Target vocabulary
Source
metadata fields
Places GeoNames dcterms:spatial,
dc:coverage
Concepts GEMET, DBpedia dc:subject, dc:type
Agents DBpedia dc:creator,
dc:contributor
Time Semium Time dc:date, dc:coverage,
dcterms:temporal,
edm:year
13. Wikipedia's Relevance for Cultural Heritage
Authority Lists and Thesauri have central importance in CH
Wikipedia being "the sum of all knowledge" has broader reach than
any institutional authority list
Only large-scale aggregations like VIAF (35 institutions) and LCSH
(about 10 libraries around LoC) are comparable
While some facts are inaccurate and disputable, Wikipedia has a
great role as a source of stable URLs on all kinds of topics
14. How Big is Wikidata?
Name data sources for semantic enrichment (Europeana Creative
D2.4) gives DBpedia and Wikidata stats
Wikidata: 3y old, 14M items, 209M edits
2.7M humans, 5k families, 22k literary characters
215k organizations
66k creative orgs (bands, radio/TV stations, newspapers…)
30k educational institutions
20k non-profit orgs
13k GLAM orgs: 0.5k galleries,1k libraries, 0.2k archives, 9k
museums
500k creative works
110k heritages sites and monuments
40k family names, 20k first names
15. Is this big enough?
Wikidata: 2.7M humans, 215k organizations, 800k places, 500k works
VIAF: 35M personal names, 5.4M orgs/conferences, 410k places,
1.7M works
GeoNames: 9M places
Only 1.1M persons are coreferenced, see
Authority Addicts: The New Frontier of Authority Control on Wikidata
VIAF much bigger but still Wikidata is very important for GLAM:
Wikidata is active in Authority Control and Coreferencing
(VIAF) Moving to Wikidata: will get 1M persons/orgs, and many
multilingual names (see next)
Authority Files have barely more than names & dates; Wikipedia
often has a lot more info
16. Wikidata Multilingual Coverage
Wikidata/DBpedia has huge multilingual coverage
Each entity is represented in 2.11 Wikipedias on average (see
Europeana food and drink classification scheme, EFD D2.2)
But popular entities are present in many more (up to 180); and
even in one Wikipedia there are many languages
E.g. Lucas Cranach in Wikidata: 57 lang tags, representing 44
languages and 13 language variants
Languages are consistently marked
Important for semantic enrichment (Named Entity Recognition)
Even though language labels in Europeana are not consistent
17. Name Variants for Lucas Cranach
Wikidata and VIAF each have 70 variants and
dominate the "Wikipedia tradition" and
"Library tradition" datasets respectively (see
Name data sources for semantic enrichment)
Only 5 variants are in common (see
Interactive Venn diagram)
Excellent complementarity. VIAF has more
variants, Wikidata more multilingual names
VIAF's move to sync to Wikidata will narrow
the gap
18. Wikidata is connected to other vocabularies
Europeana prefers using pivot vocabularies
• that are connected to many other vocabularies
• It is key to avoid duplication and redundancy
Wikidata has lot of coreferences to other vocabularies that
can be used to create extra links, and extract missing data
• https://www.wikidata.org/wiki/Wikidata:WikiProject_Authority_control
• https://twitter.com/hashtag/coreferencing: shots and news
• Please tweet!
19. VIAF-Wikidata Coreferences for Lucas Cranach
Can be leveraged to fill the gaps, e.g. bring RKDartists into VIAF
VIAF id in VIAF Wikidata id in Wikidata
viafID 49268177 VIAF 49268177
BAV ADV10197613
BNC .a10853637
BNE XX907273
BNF cb12176451h BNF 12176451h
DNB 118522582 GND 118522582
ISNI 0000000121319721 ISNI 0000 0001 2131 9721
JPG 500115364 ULAN 500115364
LC n50020861 LCCN n50020861
LNB LNC10-000002573
NDL 00436834
NKC jn20000700335
NLA 000035031951
NLI 000035532,001445575,001448179
NLP a16828161
NTA 068435312 NTA PPN 068435312
NUKAT vtls000190728
SELIBR 182422
SUDOC 028710010
WKP Lucas_Cranach_the_Elder Many Wikipedias
IMAGINE T7238,T267474 Cantic a10853637
Commons Creator Lucas Cranach (I)
Commons category Lucas Cranach d. Ä.
Freebase /m/0kqp0
RKDartists 18978
SIMBAD CRANACH, Lucas the Elder
Your Paintings lucas-the-elder-cranach
20. Wikidata Coreferencing (1)
Excellent Mix-n-Match tool by Magnus Manske. 54 catalogs loaded!!
Decent auto-matching and excellent crowd-sourcing features
21. Wikidata Coreferencing (2)
Excellent Authority Control navbox in Wikipedia
E.g. matching British Museum person-institution thesaurus
(currently not coreferenced to anything: high value to BM)
22. Europeana Food and Drink
How do you define such wide area as Food and Drink,
which is so pervasive in every day life and culture?
Europeana food and drink classification scheme (EFD D2.2,
or presentation) studies ~20 datasets for relevance to FD
Concludes that Wikipedia is our playing ground, and we
should try to use Wikipedia Categories to delineate the topic
• AGROVOC has 32k concepts but on production/science
• Wikipedia/DBpedia has 6.6k proper Foods (with infoboxes and
ingredients)
• But I estimate 0.6-1.2M things relevant to FD in all Wikipedias
Background image: 2 levels of Food_and_drink cat hierarchy
23. Wikidata is Easily Accessible
It is important for Europeana to have the data
• Technically available:
• Data dump preferably as Linked Data (RDF)
• SPARQL end-point or other query mechanism (e.g. WDQ)
• Properly documented and structured
• Wikidata has an excellent Property Proposal process
• Wikidata integrity constraints are excellent
• In contrast, no Class creation process, so the classes are quite a
mess (16k of which 2/3 have less than 5 instances)
• Data templates should be made more visible and be used as
references
• Open access
24. Wikidata Property Integrity Constraints
E.g. ULAN id constraints help to find records to merge / split
E.g. Communist Party of the Russian Federation has 5 LCNAF id's,
what's up? Is it so popular with the Library of Congress?
25. How Wikidata will be used by Europeana
Semantic Enrichment of Europeana data with additional
information
• With a specific focus on entities such as persons and concepts
Linking Europeana objects with Wikidata
• Approach similar to
https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_pa
intings
• But would be extended to the whole Europeana dataset
• Links would be added in the Europeana data
Structure (data template) for CH objects (e.g. paintings) still
not very rich on Wikidata, e.g. Measurements not there
• Improvements are made all the time, but see next
26. Wikidata Items as Linking Hubs
Still, they're
great as stable
URLs
Providing the
basic info
(who, when,
where, what)
And acting as
coreferencing
hubs
I don't expect Wikidata CH objects to ever be described in the full
richness & complexity of professional art research. E.g. see
British Museum Mapping to CIDOC CRM
27. Wikidata and DBpedia
Wikidata and DBpedia are the two structured representations of
Wikipedia
Wikidata: initially populated from Wikipedia, manually curated, will
master structured data for Wikipedia. Synchronized through an
assortment of bots
Data is fairly accurate but data depth is still small
DBpedia: automatically extracted from Wikipedia, live update, one-
way extraction only.
Data reach is deep, but there are many problems in ontology and
individual mappings, especially for non-English. E.g. United
Nations is extracted as "Country". See DBpedia Ontology and
Mapping Problems.
Should they be together?
28. GLAMs should add to Wikipedia or Wikidata!
EFD project. Swiecenie Koszyczek, "blessing of the baskets", a
colorful Polish tradition
There's no article in pl.wikipedia.org, so we can't relate such
artifacts to anything
Content partner's museum staff have no time to make a proper
Wikipedia article
But adding a Wikidata
item is quick & easy
Appropriate
categories (Easter
Traditions, Easter-
related Foods) will
put it in context
29. Thank you
Valentine Charles, valentine.charles@europeana.eu
Vladimir Alexiev, vladimir.alexiev@ontotext.com
Hugo Manguinhas, hugo.manguinhas@europeana.eu
Editor's Notes
Take advantages of these rich data to improve other types of services such as auto-completion
Two categories:
Global
Produced by projects
See list on the wiki
In the linked environment, enrichment often refers to adding new information at the semantic level to the data about certain resources.
It is the creation of new links between the enriched resources and another data resource, such as controlled vocabularies and authority files.
The goal is contextualization of metadata and embedding the resources in context outside the scope of the platform