1. ENTITIES FOR AUGMENTED
INTELLIGENCE
Krisztian Balog
University of Stavanger
@krisztianbalog
Keynote given at the 23rd Interna+onal Conference on Theory and Prac+ce of Digital Libraries (TPDL '19) | Oslo, Norway, September 2019
4. WHAT IS AN ENTITY?
An entity is a uniquely identifiable object or thing,
characterized by its name(s), type(s), attributes, and
relationships to other entities.
5. AN ENTITY
<dbr:Roger_Needham>
<dbo:Scientist>
<dbo:Person>
<dbo:Agent>
<owl:Thing>
<rdf:type>
<dbo:abstract>
"1935-08-26"
"Karen Spärck Jones"
<foaf:name>
<dbo:spouse>
<University_of_Cambridge>
<dbp:almaMater>
<dbr:Natural_language_processing>
<dbo:knownFor>
<dbc:Information_retrieval_researchers>
<dct:subject>
<dbc:British_women_computer_scientists>
<dbc:British_computer_scientists> <dbc:British_women_scientists>
"Karen Spärck Jones FBA (26 August
1935 – 4 April 2007) was a British
computer scientist."
<dbr:Karen_Spark_Jones>
<dbo:birthDate>
7. REPRESENTING ENTITIES
AND THEIR PROPERTIES
entity catalog entity ID*
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
8. REPRESENTING ENTITIES
AND THEIR PROPERTIES
entity catalog entity ID*
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
knowledge base (KB) /
knowledge graph (KG)
attributes
relationships (typed links)
9. REPRESENTING ENTITIES
AND THEIR PROPERTIES
entity catalog entity ID*
name(s)*
knowledge repository type(s)*
descriptions
relationships (non-typed links)
knowledge base (KB) /
knowledge graph (KG)
attributes
relationships (typed links)
Meant for
human
consump+on
Meant for
machine
consump+on
10. WHY CARE ABOUT ENTITIES?
• From a user perspective,
entities ...
• are natural units for organizing
information
• enable a richer and more effective
user experience
• From a machine perspective,
entities ...
• allow for a better understanding of
queries, document content, and of
users
• help to bridge the gap between
unstructured and structured data
• enable search engines to be more
intelligent
12. ENTITY RETRIEVAL
• Task: Answer an information need (expressed, e.g., as a free text
query) with a ranked list of entities from some catalog of entities
e1
e2
…
en
Information need
14. APPROACHES
• Term-based entity representations can be effectively ranked
using document-based retrieval models
• Semantically informed retrieval models utilize entity-specific
properties (attributes, types, and relationships)
15. ENTITY LINKING
• Task: Recognize entity mentions in text and link them to the
corresponding entries in a knowledge repository
Michael Schumacher (born 3 January 1969) is a German retired racing driver. He
is a seven-time Formula One World Champion and is widely regarded as one of
the greatest Formula One drivers of all time. He won two titles with Benetton in
1994 and 1995 before moving to Ferrari where he drove for eleven years. His
time with Ferrari yielded five consecutive titles between 2000 and 2004.
Michael Schumacher
Schuderia Ferrari
Benetton Formula
Racing driver
Formula One constructor
Formula One constructor
Formula One
Auto racing series
17. SUMMARY OF PART I
• Established entity retrieval and entity linking techniques
provide a solid starting point
• Open issues
• Most work on entity retrieval has focused on keyword queries; there are
numerous other ways of expressing information needs
• Different types of input calls for different entity linking techniques
• Noisy short texts (e.g., tweets, queries), structured data (e.g., tables), OCR'ed text, ...
• Long tail entities (with sparse representation)
19. EXAMPLE INFORMATION NEEDS
• Planning a road trip in California
• Creating a curriculum for a course (including recommended
literature and invited speakers)
• Finding out which anti-aircraft guns were used in ships during war
periods, what countries produced them, and if any working models
may be found (and where)
Answering complex information needs involves retrieving,
extracting, filtering, and aggregating information from
multiple sources
22. Formula 1 constructors’ statistics 2016
Constructor
Ferrari
Engine Country Base
Force India
Haas
Ferrari
Mercedes
Ferrari
Italy
India
US
Italy
UK
US & UK
Manor Mercedes UK UK
…
…
Table cap+on
THE ANATOMY OF A RELATIONAL
(ENTITY-FOCUSED) TABLE
Table en++es
(core/subject column)
Heading
column labels
(table schema)
Table data
23. WHAT KIND OF ASSISTANCE CAN WE
PROVIDE FOR PEOPLE WORKING
WITH (RELATIONAL) TABLES?
27. ASSISTANCE #1
Formula 1 constructors’ statistics 2016
1.McLaren
2.Mercedes
3.Red Bull
Add entity
Constructor
Ferrari
Engine Country Base
Force India
Haas
Ferrari
Mercedes
Ferrari
Italy
India
US
Italy
UK
US & UK
Manor Mercedes UK UK
Row popula+on
Suggesting entities to be
added to the subject
column of the table
28. ASSISTANCE #2
Formula 1 constructors’ statistics 2016
Add column
1.Seasons
2.Races Entered
Constructor
Ferrari
Engine Country Base
Force India
Haas
Ferrari
Mercedes
Ferrari
Italy
India
US
Italy
UK
US & UK
Manor Mercedes UK UK
Column popula+on
Suggesting column
labels to be added
as heading columns
29. ASSISTANCE #3
Oscar Best Actor
Year
2013
Actor Film Role(s)
2014
2015
Matthew McConaughey
Eddie Redmayne
Leonard DiCaprio
Dallas Buyers Club
The theory of Everything
The Revenant
Ron Woodroof
Stephen Hawking
Hugh Class
2016 Casey Affleck Manchester by the Sea Lee Chandler
2017 Gary Oldman
1.Darkest Hour
https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor
(2 additional sources)
2.Tinker Tailor Soldier Spy
https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor
(1 additional source)
3.Nil by Mouth
http://dbpedia.org/page/Gary_Oldman
1.Lee Chandler
https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor
https://en.wikipedia.org/wiki/Casey_Affleck
2.Ray Sybert
https://en.wikipedia.org/wiki/Casey_Affleck
Value finding
Suggesting values for
specific table cells with
supporting evidence
Value checking
Checking existing cell
values whether there is
supporting evidence
30. ASSISTANCE #4
Singapore Search
Year
GDP
Nominal
(Billion)
GDP
Nominal
Per Capita
GDP Real
(Billion)
Singapore - Wikipedia, Economy Statistics (Recent Years)
GNI
Nominal
(Billion)
GNI
Nominal
Per Capita
2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292
https://en.wikipedia.org/wiki/Singapore
Show more (5 rows total)
Singapore - Wikipedia, Language used most frequently at home
https://en.wikipedia.org/wiki/Singapore
2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216
2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902
Query: economy of Singapore
Table genera+on
Automatically generating
an entire table in response
to a keyword query
31. EXPERIMENTAL SETTING
• Data sources
• Table corpus: 1.6M tables extracted from Wikipedia
• Knowledge base: DBpedia 2015-10 (4.6M entities)
• Evaluation measures
• Standard IR measures (MAP, MRR, NDCG)
32. #1 ROW POPULATION
• Task: Generate a ranked list of entities to be added to the core
column of a given seed table
S. Zhang and K. Balog. EntiTables: Smart Assistance for Entity-Focused Tables.
In: 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17)
l1 l2 ... lm
e1
e2
…
en
en+1
Table caption c
Seed entities
E=(e1,…,en)
Seed column labels
L=(l1,…,lm)
?
Seed table
34. APPROACH: CANDIDATE SELECTION
• From knowledge base
• Entities that are of the same type(s) or belong to the same categories
• Ranking is based on the number of shared types/categories
• From table corpus
• Based on caption: indexing the table as a document and using a standard
document retrieval method (BM25)
• Based on entities: indexing only entities, using seed entities as the query
Seed table
Candidate
selec+on
1
En+ty
ranking
2
Ranked list
of sugges+on
(top-K enSSes)
35. APPROACH: ENTITY RANKING
• Based on the similarity between the candidate entity and
various table elements
Candidate
selec+on
1
Ranked list
of sugges+on
(top-K enSSes)
En+ty
ranking
2
P(e|E, L, c) = · · · / P(e|E)P(L|e)P(c|e)
En+ty similarity
Column label similarity
Cap+on similarity
Candidate en+ty
Seed table
36. EXPERIMENTAL DESIGN
• Idea: Take existing tables and simulate the user
in an intermediate step during table completion
• Select a set of (1000) tables randomly
• Contain at least 6 rows and at least 3 columns (in
addition to the subject column)
• For any intermediate step (i rows completed)
• First i (1<=i<=5) rows are taken as the seed table
• Entities in the remaining rows are the ground truth
l1 l2 lm
e1
…
ei
ei+1
…
en
Seed table
Ground truth
37. EXPERIMENTAL RESULTS
Method
#Seed entities
1 2 3 4 5
Baseline* 0.307 0.327 0.340 0.342 0.340
Entity similarity 0.490 0.542 0.561 0.566 0.560
+ column label similarity 0.572 0.610 0.618 0.618 0.610
+ caption similarity 0.592 0.626 0.633 0.634 0.631
Entity ranking performance in terms of Mean Average Precision (MAP)
* M.Bron, K. Balog, and M. de Rijke. Example Based Entity Search in the Web of Data.
In: 34th European Conference on Information Retrieval (ECIR ’13)
38. #2 COLUMN POPULATION
• Task: generate a ranked list of entities to be added to the core
column of a given seed table
S. Zhang and K. Balog. EntiTables: Smart Assistance for Entity-Focused Tables.
In: 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17)
l1 l2 ... lm lm+1
e1
e2
…
en
Table caption c ?Seed table
39. EXPERIMENTAL DESIGN
• Idea: Take existing tables and simulate the user
in an intermediate step during table completion
• Select a set of (1000) tables randomly
• Contain at least 6 rows and at least 4 columns
• For any intermediate step (j columns completed)
• First j (1<=j<=3) columns are taken as the seed table
• Labels in the remaining columns are the ground truth
l1 ... lj lj+1 ... lm
e1
e2
…
en
Seed table Ground truth
40. #3 CELL VALUE FINDING
• Task: Given an input relational table, find the value of a specific cell
(identified by the entity in the core column and the column heading
label) or (optionally) determine if the cell should be left empty
S. Zhang and K. Balog. Auto-completion for Data Cells in Relational Tables.
In: 28th ACM International Conference on Information and Knowledge Management (CIKM ’19)
l
e
Table caption c
?
42. APPROACH:
CANDIDATE VALUE FINDING
• From knowledge base
• Heading-to-predicate matching
• E.g., "location" vs. <dbp:location>, <dbp:city>, <dbp:country>
• From table corpus
• Heading-to-heading matching
• Identify other table columns that have the same meaning
1
Input table
?
Candidate
value finding
Value
ranking
2
Ranked list
of sugges+on
(top-K values)
43. APPROACH: VALUE RANKING
• Combine evidence in a feature-based approach
• Features I: Degree of support for the given value across the
different evidence sources
• Features II: Empty value prediction
• Features III: Semantic relatedness between the input table and
candidate tables (where the value originates from)
2
Input table
Candidate
value finding
1
Value
ranking
Ranked list
of sugges+on
(top-K values)
?
44. EXPERIMENTAL DESIGN
• Idea: Conceal cell values from existing
tables
• Randomly select an existing table
• Pick a table column
• Remove n cells randomly from this column
• Evaluate using crowdsourcing
• Given the input table, the value, and a
source document, does this appear as the
correct value for the missing cell?
... ... ... ...
... ... ...
... ... ... ...
... ... ...
... ... ...
... ... ... ...
... ... ... ...
45. EXPERIMENTAL RESULTS
Method
Empy values
excluded
Empy values
included
Baseline 0.585 0.518
Features I 0.664 0.576
Features I+II 0.684 0.590
Features I+II+III 0.757 0.671
Value finding performance in terms of NDCG@5
46. #4 ON-THE-FLY TABLE GENERATION
• Task: Answer a free text query with a relational table, where
• the core column lists all relevant entities
• columns correspond to attributes of those entities
• cells contain the values of the corresponding entity attributes
L
E V
Keyword query q
S. Zhang and K. Balog. On-the-fly Table Generation.
In: 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '18)
47. APPROACH Core column entity ranking
and schema determination
could potentially mutually
reinforce each other.
Query
(q)
E
Core column
en+ty ranking
Schema
determina+on
S
Value lookup
V
E
S
49. MAIN RANKING SIGNALS
Query
(q)
E
Core column
en+ty ranking
Schema
determina+on
Value lookup
E
S
S
V
• Query-only
• Term-based matching
• Semantic matching
• Query + schema
• Entity-schema matching
• Entity-schema compatibility
• Query-only
• Column population (q)
• Semantic matching
• Query + entities
• Column population (q, E)
• Attribute retrieval
• Entity-schema compatibility
50. EXPERIMENTAL DESIGN
• QS-1: List queries from the DBpedia-Entity v2 collection1 (119)
• Relevance judgments obtained via crowdsourcing
• "all cars that are produced in Germany"
• "permanent members of the UN Security Council"
• "Airlines that currently use Boeing 747 planes"
• QS-2: Entity-relationship queries from the RELink Query Collection2 (600)
• Queries and relevance judgments obtained automatically from Wikipedia lists that contain
relational tables
• "find peaks above 6000m in the mountains of Peru"
• "Which countries and cities have accredited Armenian ambassadors?"
• "Which anti-aircraft guns were used in ships during war periods and what country produced them?"
1 Hasibi et al. DBpedia-Entity v2: A Test Collection for Entity Search. In: SIGIR ’17.
2 Saleiro et al. RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval. In: SIGIR ’17.
51. EXPERIMENTAL RESULTS
(QS-1)
Core column entity ranking Schema determination
without schema
information
(query only)
with ground
truth schema
with automatic
schema determination
without entity
information
(query only)
with ground
truth entities
with automatic core
column entity ranking
52. SUMMARY OF PART II
• Tables are a universal tool for collecting and manipulating data
• A selection of smart assistance functionalities for relational tables
• Open issues
• Moving from homogeneous Wikipedia tables to heterogeneous Web tables
and to other (non-relational) table types
• Tapping into unstructured data sources
• Additional operations, e.g., filtering ("above 6000m") and sorting ("by
population")
• User-centric evaluation in the context of a larger work task
54. TRENDS IN THE IR LITERATURE
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities Wikipedia
knowledge base knowledge graph
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW
55. TRENDS IN THE IR LITERATURE
Numbers are based on boolean queries on paper titles from SIGIR, ECIR, CIKM, WSDM, and WWW
0
10
20
30
40
2000 2002 2004 2006 2008 2010 2012 2014 2016
entity OR entities
Wikipedia OR "knowledge base" OR "knowledge graph"
57. KNOWLEDGE BASES LAG BEHIND
• Many intelligent information access tasks are enabled by
knowledge bases
• Increasingly difficult to keep up with changes and ensure that
knowledge bases are up-to-date and reliable
• Work that needs to be performed by human editors
Can we help human editors to maintain and expand
knowledge bases?
58. KNOWLEDGE BASE ACCELERATION
Human editor
Entity-centric
document filtering
Entity attribute
extraction
Entity KB entry
time
Content stream
ranked list of
documents
entity
facts
KBA system
edits
Knowledge base
59. KNOWLEDGE BASE ACCELERATION
Human editor
Entity-centric
document filtering
Entity attribute
extraction
Entity KB entry
time
Content stream
ranked list of
documents
entity
facts
KBA system
edits
Knowledge base
Task: Analyze a
stream of documents
and assign a score to
each document based
on how relevant it is to
a given target entity
60. ENVISAGED TOOL
K. Balog, H. Ramampiaro, and K. Nørvåg. KBAAA: A Web-based Toolkit for the Assessment and Analysis of Knowledge Base Acceleration
Systems. In: 10th Conference on Open Research Areas in Information Retrieval (OAIR ’13)
62. APPROACH: MENTION DETECTION
Document
Men+on
detec+on
1
Document
scoring
2
Relevance score
K. Balog, N. Takhirov, H. Ramampiaro, and K. Nørvåg. Multi-step Classification Approaches to Cumulative Citation Recommendation.
In: 10th Conference on Open Research Areas in Information Retrieval (OAIR ’13)
0.86
• Objectives
• High recall, at the same time keep the false positive rate low
• Efficiency (need to be performed on all documents)
• Based on known surface forms of the entity
• No entity disambiguation performed
63. APPROACH: DOCUMENT SCORING
Document
Men+on
detec+on
1
Document
scoring
2
Relevance score
K. Balog, N. Takhirov, H. Ramampiaro, and K. Nørvåg. Multi-step Classification Approaches to Cumulative Citation Recommendation.
In: 10th Conference on Open Research Areas in Information Retrieval (OAIR ’13)
0.86
• Document features
• Entity features
• Document-entity features
• E.g., occurrences and spread of entity and related entities in the document
• Temporal features
• E.g., bursts in document stream or in entity profile views in KB
64. EXPERIMENTAL SETUP
• TREC Knowledge Base Acceleration
track (2012 edition)
• KBA stream corpus
• Oct 2011—Apr 2012
• Three sources: news, social, linking
• Raw data 8.7TB
• Target entities are from Wikipedia
• Precision and recall measured as a
function of cutoff
1328055120'f6462409e60d2748a0adef82fe68b86d
1328057880'79cdee3c9218ec77f6580183cb16e045
1328057280'80fb850c089caa381a796c34e23d9af8
1328056560'450983d117c5a7903a3a27c959cc682a
1328056560'450983d117c5a7903a3a27c959cc682a
1328056260'684e2f8fc90de6ef949946f5061a91e0
1328056560'be417475cca57b6557a7d5db0bbc6959
1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009
1328058660'807e4aaeca58000f6889c31c24712247
1328060040'7a8c209ad36bbb9c946348996f8c616b
1328063280'1ac4b6f3a58004d1596d6e42c4746e21
1328064660'1a0167925256b32d715c1a3a2ee0730c
1328062980'7324a71469556bcd1f3904ba090ab685
PositiveNegative
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
score
Target entity: Aharon Barak
urlname stream_id
Cutoff
1000
500
500
480
450
430
428
428
380
380
375
315
263
1328055120'f6462409e60d2748a0adef82fe68b86d
1328057880'79cdee3c9218ec77f6580183cb16e045
1328057280'80fb850c089caa381a796c34e23d9af8
1328056560'450983d117c5a7903a3a27c959cc682a
1328056560'450983d117c5a7903a3a27c959cc682a
1328056260'684e2f8fc90de6ef949946f5061a91e0
1328056560'be417475cca57b6557a7d5db0bbc6959
1328057520'4e92eb721bfbfdfa0b1d9476b1ecb009
1328058660'807e4aaeca58000f6889c31c24712247
1328060040'7a8c209ad36bbb9c946348996f8c616b
1328063280'1ac4b6f3a58004d1596d6e42c4746e21
1328064660'1a0167925256b32d715c1a3a2ee0730c
1328062980'7324a71469556bcd1f3904ba090ab685
PositiveNegative
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
Aharon_Barak
65. RESULTS
• Features that worked well
• #related entities, stream volume, Wikipedia pageviews
• Similarity between the doc and the entity’s Wikipedia page
• #entity mentions and spread in the document body
• Features that didn't work that well
• Temporal features
• Separating 'relevant' and 'vitally relevant' is difficult!
66. KNOWLEDGE BASE ACCELERATION
Human editor
Entity-centric
document filtering
Entity attribute
extraction
Entity KB entry
time
Content stream
ranked list of
documents
entity
facts
KBA system
edits
Knowledge base
Task: Extract the corresponding values
for a pre-defined set of predicates, for a
given target entity, from a previously
identified set of documents
67. ENVISAGED TOOL
J. Benetka, K. Balog, and K. Nørvåg. Towards Building a Knowledge Base of Monetary Transactions from a News Collection.
In: 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’17)
acquisitionFinancial event:OracleSubject: Find events
InsertConfidence
2004
NYT
USD 10 300 000 000
Value
NYT
Year
56%
2007
USD 1 500 000
… from the PeopleSoft purchase …
2005 NYT
2004
NYT
Snippet
NYT
82.8% …Oracle finally acquired PeopleSoft for…
pleSoft finally capitulated to Oracle's …
Link
2004
… which acquired PeopleSoft last year …
USD 11
75.3% USD 20 000 000 000
78.9%
66.7% PeopleSoft for $5.1 billion in cash.
USD 7 700 000 000
Counterpart Event attributes
Hyperion Solutions
Siebel Systems
Retek
PeopleSoft
Subject en+ty Predicate filter
Object en+ty
Extracted informa+on
A Boom in Merger Activity
In December 2004, after a
battle for control that grew
nasty, Oracle finally acquired
PeopleSoft for about $10.3
billion, becoming the second-
largest maker of business-
management software.
68. APPROACH
• Generate all possible event
interpretations (quintuples)
Event representa+on
• Monetary value recognition
• Economic event recognition
• Entity recognition
• Date extraction
• Semantic role labeling
Seman+c annota+on of sentences
• Grouping sentences that discuss
the same economic event
Clustering events
• Assigning confidence score to
each interpretation
Supervised learning
s#1
s#2
s#3
s#4
s#5
s#1
s#1
s#2
s#5
s#3
s#4
0.85
0.65
0.91
0.43
0.45
0.77
1
2 3
4
s#1
s#2
s#5
A B
A B
A B
s#3
s#4
C D
C D
e#1
[C] <rel> [D]
e#2
[A] <rel> [B]
{
{
69. EXPERIMENTAL SETUP
• New York Times Annotated Corpus
• 20 years, 1.8M articles
• Entity repository constructed from three sources
• DBpedia, Freebase, and CrunchBase
• Test set comprises 30 companies
• 132 ground truth events in total
71. SUMMARY OF PART III
• Techniques for identifying documents that could potentially
trigger updates to the entry of an entity in a knowledge base
• Domain-specific adaptation of an NLP+ML pipeline for attribute
extraction
• Open issues
• Novel entity discovery
• Attributes of interest
• Facts vs. claims
• Generic vs. domain specific techniques
72. SUMMARY
• Complex information needs will continue to require human intelligence,
but there is a growing array of tools to assist them
• Entity-oriented perspective on information access
• Equipping spreadsheet programs with smart assistance capabilities
• Tool support for knowledge editors for maintaining and expanding knowledge bases
• Open issues
• Pipeline approaches vs. end-to-end learning
• Techniques for long-tail and emerging entities
• Domain-specific adaptations
• User-centric evaluation in an actual task context
73. JOINT WORK WITH
• Jan Benetka, Faegheh Hasibi, Kjetil Nørvåg, Heri Ramampiaro,
Naimdjon Takhirov, Shuo Zhang