Slides for my keynote at PAN-DL Workshop (Pattern-based Approaches to NLP in the Age of Deep Learning) at EMNLP'2023 (December. 6, 2023).
In this talk, I share our initial learnings from constructing, growing and serving large knowledge graphs
Handwritten Text Recognition for manuscripts and early printed texts
The Role of Patterns in the Era of Large Language Models
1. Yunyao Li
PAN-DL@EMNLP’23 | Adobe | December, 2023
The Role of Patterns in the Era of
Large Language Models
Initial Learnings from Constructing, Growing and Serving Large
Knowledge Graphs*
* Work done at IBM Research and Apple
yunyaol@adobe.com
@yunyao_li
3. Example: Financial Content Knowledge Base
Financial
Reports
Ontology
[VLDB’2017] Creation and Interaction with Large-scale Domain-Speci
fi
c Knowledge Bases.
XML
Knowledge
Extraction
Overall Architecture: A Simpli
fi
ed View
Linking
Fusion
KG Construction
Transforming
>31,000 companies
439 industries
~170,000 insiders
~100 millions
fi
nancial
metrics ~22,000 industry
KPIs
Financial Content KB
KG Services
QA
APIs
5. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
6. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
8. “Connor McDavid”
name
“Richmond Hill”
name
“97/01/13”
dob
place of birth
CITY
type
PERSON
type
“Connor McDavid”
name
“Jan 13”
bday
goals
HOCKEY_PLAYER
type
“43”
Source A
Source B
“Connor McDavid”
name
ID1
“Richmond Hill”
name
“January 13, 1997”
dob
place of birth
CITY
type
PERSON
type
goals
HOCKEY_PLAYER
type
“43”
Linking
Fusion
9. Entity Normalization & Variant Generation
Learning: Structured Representations
Capture Entity Semantic Structure
[COLING’2018] Exploiting Structure in Representation of Named Entities using Active Learning.
[ICDE’2018] LUSTRE: An Interactive System for Entity Structured Representation and Variant
Generation.
Generated normalizers for Watson Discovery
[AAAI’2020] PARTNER: Human-in-the-Loop Entity Name Understanding with Deep
Learning.
[EMNLP’2020] Learning Structured Representations of Entity Names using Active
Learning and Weak Supervision.
“Bank of America N.A.” “Bank of America National Association”
Pattern-Based: Synthesizing
Normalization and Variant
Generation Functions
“97/01/13” “January 13, 1997”
10. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
11. Graph Completion via Ontology Inference
KG
Ontology Inference Rules Updated KG
A has_mother B B has_child A
A has_father B B has_child A
A has_spouse B B has_spouse A
A contains B B is_part_of A
A has_child B A has_child C B has_sibling C C has_sibling B
… …
→
→
→
→
∧ → ∧
12. Example Inference
Who’s Kylian Mbappé’s mother?
Source: https://en.wikipedia.org/wiki/Kylian_Mbappé
No information
about his mother
13. Example Inference
Who’s Kylian Mbappé’s mother?
Source: https://www.wikidata.org/wiki/Q45094361
A has_child B A is_a female B has_mother A
∧ →
Fayza Lamri has_child Kylian Mbappé
Fayza Lamri is_a female
Kylian Mbappé has_mother Fayza Lamri
Infer high-quality facts
at scale
14. Fact Editing for LLM
Ontology-Guided Evaluation
Source: Evaluating the Ripple Effects of Knowledge Editing in Language Models https://arxiv.org/pdf/2307.12976.pdf
15. Fact Editing for LLM
Ontology-Guided Evaluation
Source: Evaluating the Ripple Effects of Knowledge Editing in Language Models https://arxiv.org/pdf/2307.12976.pdf
16. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
17. Scale Fact Collection
Missing / stale facts
Missing
Facts
Query
Synthesizer
QA System
candidate facts
Baseline
New
Facts
18. Scale Fact Collection
Missing / stale facts
Missing
Facts
Query
Synthesizer
QA System
candidate facts
Baseline
New
Facts
Query-by-Committee
Missing
Facts
Query
Synthesizer
QA System
candidate facts
New
Facts
QA System
Q1
QA System
… …
… …
…
Qn
QbC
Selector
AnswerSet1
AnswerSetn
[EMNLP-DaSH’2022] Improving Human Annotation Effectiveness for Fact Collection by Identifying the Most Relevant Answers
Success Rate
fact collection
25%
19. Scale Fact Collection
Missing / stale facts
Missing
Facts
Query
Synthesizer
QA System
candidate facts
Baseline
New
Facts
Open Domain Knowledge Extraction
[SIGMOD’23] Growing and Serving Large Open-domain Knowledge Graphs.
Throughput vs.
manual fact collection
>100x
Missing
Facts
Query
Synthesizer
Web Search
candidate facts w/
lower-con
fi
dence
New
Facts
Knowledge
Extractor
Fact
Corroboration
20. Extraction: Pattern vs. LLM
* All details simpli
fi
ed for presentation
If entity.type = “Person” And If
tuple.key = “Height” Return height
= extract(tuple.value, “d?.d+
s*m”)
You are an accurate information extraction system responsible to
fi
nd answers to a set of questions solely from a given passage.
For example
Now please work on the following task:
Questions: height
Passage:
Title: José Varela
Infobox properties:
{“Full name": "José Carlos Moreira Varela”
“Date of birth”: “15 September 1997 (age 26)”
“Place of birth”: “Praia, Cape Verde”
“Height”: “1.68 m (5 ft 6 in)”
… …}
Key Value
Full name José Carlos Moreira Varela
Date of birth 15 September 1997 (age 26)
Place of
birth
Praia, Cape Verde
Height 1.68 m (5 ft 6 in)
… …
Key-Value Pair Extractor Height Extractor
Height = 1.68 m
Prompt
Pattern-based Extractors
Height = 1.68 m
LLM
LLM-based Extractor
Demonstrate Example
InfoBox
Content
21. Extraction: Pattern vs. LLM
* All details simpli
fi
ed for presentation purpose
If entity.type = “Person” And If
tuple.key = “Height” Return height
= extract(tuple.value, “d?.d+
s*m”)
Key Value
Born 5 September 1808, Calcutta
…
Died 30 May 1869 (aged 60) ..
Political Party Liberal Party.
Spouse Annie Henrietta Templer …
… …
Key-Value Pair Extractor Height Extractor
Height = null
Pattern-based Extractors
Height = 1.80 m
LLM
LLM-based Extractors
hallucination
You are an accurate information extraction system responsible to
fi
nd
answers to a set of questions solely from a given passage.
For example
Now please work on the following task:
Questions: height
Passage:
Title: Sir Arthur William Buller
Infobox properties:
{“Born": “5 September 1808”
“Calcutta, British India”
… …}
Demonstrate Example
InfoBox
Content
Prompt
22. Extraction: Pattern vs. LLM
* All details simpli
fi
ed for presentation purpose
If entity.type = “Person” And If
tuple.key = “Spouses”
Return spouse = extract(tuple.value,
PersonNameRegex), start time =
extract(tuple.value,
StartTimeRegex), end time =
extract(tuple.value, EndTimeRegex)
Key Value
Born Jacques Haussmann, …
Died October 31, 1958 (aged 86) …
Citizenship American
Education Clifton College
… …
Key-Value Pair Extractor Spouse Extractor
Pattern-based Extractors
Spouse = Zita Johann
Start time = 1929
End time = 1933
Spouse = Joan Courtney
Start time = 1952
End time = 1988
LLM-based Extractors
You are an accurate information extraction system responsible to
fi
nd
answers to a set of questions solely from a given passage.
For example
Now please work on the following task:
Questions: spouse
Passage:
Title: John Houseman
Infobox properties:
{“Born”: “Jacques Haussmann”
“September 22, 1902”
… …}
Prompt
LLM
Demonstrate Example
InfoBox
Content
Spouse = Zita Johann
Start time = 1929
End time = 1933
Incomplete
23. Extraction: Pattern vs. LLM
A Side-by-Side Comparison
Pattern-based LLM-based
Throughput
Quality of Results
Simple Cases
Complex Cases
Development
Effort
Simple Cases
Complex Cases
High Low
High High
High
Medium
Medium
Medium
High
Low
24. Extraction: Pattern vs. LLM
Opportunity to Get the Best of Both Worlds
Source: Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes https://arxiv.org/pdf/2304.09433.pdf
A recent example
Additional reading: Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!. https://arxiv.org/abs/2303.08559
25. Multilingual Coverage of KG
EN
ES
ES
IT
EN
ES
EN
DE
EN
ES
ES
ES
ES
IT
EN
EN
ES
ES
0%
100%
AR DE ES FR IT JA KO RU ZH
36
40
63
36
34
21
24
27
55
64
60
37
64
66
79
76
73
45
Coverage of entity names (Wikidata)
Major gap exists
[EMNLP’23] Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs
26. [EMNLP’23] Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs
Multilingual Knowledge Graph Enrichment
EN
ES
ES
IT
EN
ES
EN
DE
EN
ES
ES
ES
ES
IT
EN
EN
ES
ES
EN
ES
ES
IT
EN
ES
IT
DE
EN
DE
ES
IT
EN
DE
ES
DE
IT
EN
ES
ES
IT
DE
EN
ES
ES
IT
DE
EN
ES
ES
IT
DE
EN
ES
ES
IT
DE
Before
Existing KG
After
Multilingually-enriched KG
M-NTA
Increasing multilingual coverage of locale-speci
fi
c facts.
27. M-NTA | Multi-source Naturalization, Translation, and Alignment
Leverages complementary knowledge across locales and tools
Naturalization
triple-to-text
KG
Machine Translation
Web Search
LLMs
Alignment
text-to-triple
Ensemblement
Triple Selection
Apple, is_a, fruit of the apple tree
Apple, is_a, American
multinational technology company
…
⟨
⟩
⟨
⟩
Apple is a fruit of the apple tree
Apple is an American multinational
technology company
…
リンゴはリンゴの
木
の実です
りんごはりんごの
木
の実です
…
Apple, is_a, fruit of the apple tree
リンゴはリンゴの
木
の実です
りんごはりんごの
木
の実です
…
⟨
⟩
リンゴ りんご 果実
6 4 1
リンゴ りんご
6 4
[EMNLP’23] Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs
28. Improve Question Answering
Reduce the number of unanswerable queries
DE ES FR ZH JA
+12.1%
+14.4%
+13.4%
+26.9%
+18.1%
[EMNLP’23] Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs
MKQA Dataset 2
Dec. 9. Poster Session 4
Daniel Lee
Simone Conia
29. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
32. Introspection
Constraint Violation Detection
Source: https://en.wikipedia.org/wiki/Plato
Date of birth:
• 428/427
• 424/423 BC
Date of Death:
348 BC
Extracted facts
- Two dates of birth
Potential error
Actual error
- Extracted date of birth is later than date of death
428/427 vs 348 BC
33. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
34. [EMNLP’23] FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge
FLEEK
Factual Error Detection and Correction with Evidence Retrieved from External Knowledge
36. FLEEK
Factual Error Detection and Correction with Evidence Retrieved from External Knowledge
Input Text
Fact Extraction
text-to-triple
Question Generation
triple-to-question
Veri
fi
cation
Revision
Final Correction
Evidence Retrieval
[EMNLP’23] FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge
38. Con
fi
gurable Attributes
User Experience Level
Voice Interaction
Search Interaction
Metadata Level
Popularity Scores - Long Tail Entities
Timestamps
Conversation Level
Topic Exploration
Extend to Related Entities & Neighbors
39. Voice Assistant Questions
More well-formed questions,
with a small mix of queries
Dis
fl
uencies - yes
Deixis - yes
Web Search Queries
Often short queries, mimic
search engine interactions but
with follow-ups
Dis
fl
uencies - no
Deixis - yes
Typos - yes
40. Voice Assistant Questions
Dis
fl
uencies & Deixis
Question: Hmm, which languages does Karl Wolff use
Answer: German
Question: Could you please, um, inform me about his military branch
Answer: Waffen-SS
Question: Do you know which wars he was a part of
Answer: ['Italian campaign', 'World War II', 'World War I']
Question: Do you know his military ranks
Answer: ['Obergruppenführer', ‘general']
Question: Do you know his date of birth
Answer: +1900-05-13
Question: Where was he born
Answer: Darmstadt
Question: Can you, uh, tell me when this military person died
Answer: +1984-07-15
41. Voice Assistant Questions with Related Entities
Question: Do you know any languages that Karl Wolff speaks
Answer: German
Question: Which military branch is he a part of
Answer: Waffen-SS
Question: Could you please, um, inform me about the wars he was involved in
Answer: ['Italian campaign', 'World War II’, ‘World War I']
Question: What about Sepp Dietrich
Answer: World War I
Question: Can you tell me, um, Karl's military rank
Answer: ['Obergruppenführer', 'general']
Question: How about Sep
Answer: SS-Oberst-Gruppenführer
Question: Can you, uh, tell me the birthplace of Karl
Answer: Darmstadt
Question: Ermm, what about Sepp
Answer: Hawangen
Primary Entity Related Entity
42. Web Search Queries — Short & Keyword-esque
Question: Karl Wolff country of citizenship
Answer: Germany
Question: wars involving him
Answer: ['Italian campaign', 'World War II', 'World War I']
Question: Also for Sepp Dietrich
Answer: World War I
Question: Karl place of birth
Answer: Darmstadt
Question: Answer for Sepp
Answer: Hawangen
Question: Karl died in
Answer: Rosenheim
Question: For Sepp
Answer: Ludwigsburg
Question: Karl military rank
Answer: ['Obergruppenführer', 'general']
43. Web Search Queries + Typos
Question: Kerl Wilff contry of citizenship
Answer: Germany
Question: wars involvng him
Answer: ['Italian campaign', 'World War II', 'World War I']
Question: Also fr Sepp Dietrich
Answer: World War I
Question: KJarl place of birth
Answer: Darmstadt
Question: Answer for Sep
Answer: Hawangen
Question: Karl died in
Answer: Rosenheim
Question: For Sepp
Answer: Ludwigsburg
Question: Karl mlitary rsnk
Answer: ['Obergruppenführer', 'general']
44. Dataset Statistics - Exhaustive & Evergrowing
Dataset
# Entities
(# Conversations)
# Facts # Questions Per Fact # Unique Types
# Unique
Predicates
General Set 29M 196M 12 (Web) + 12 (Voice) 274 1252
Related Entities
Set
210K 6.1M
24
[+ 30 (RE Follow-Up)]
95 265
45. Internal use only–do not distribute.
Evaluation - Effectiveness of LLMs on these conversations
Model
Question Type
Experience
Accuracy
GPT-3.5 Voice Assistant 25.9
GPT-4 Voice Assistant 32.4
GPT-3.5 Web Search 28.6
GPT-4 Web Search 35.7
General Subset
Model
Question Type
Experience
Accuracy
GPT-3.5 Voice Assistant 37.7
GPT-4 Voice Assistant 44.4
GPT-3.5 Web Search 38.7
GPT-4 Web Search 46.7
Related Entity Subset
46. Direct Triple Retrieval
Triple Retrieval
Direct Retrieval without Entity Linking
Triple Index
Query
• [S1, R1, O1]
• [S2, R2, O2]
• ……….
• [S100, R100, O100]
LLM
(Prompting for
Answer
Generation)
Answer
You are a question answering agent.
You will always provide short concise answers.
Based on the following evidence:
Fact 1: …..
Fact 2: …..
…
Fact N: …..
Answer the question using only the
evidence above:
Query
Query Triple
BERT BERT
hq ht
sim(q, t) = hqT ht
Can only work well for simple questions!
47. Subgraph + Triple Retrieval (Ours)
• We consider two types of subgraphs:
• Cliques: Subgraph containing predicates of an entity.
• 2-hop subgraphs: Subgraph containing predicates of one
and two-hop entities together
Original Graph
Cliques 2-hop Subgraphs
Triple Selection
Direct Retrieval without Entity Linking
Subgraph
Index
Query
• [S1, R1, O1]
• [S2, R2, O2]
• …
• [S100, R100,
LLM
(Prompting for
Answer
Generation)
Answer
Subgraph Retrieval
48. Evaluation
System Accuracy
System A 15.4
Direct Triple Retrieval 53.9
Subgraph + Triple Retrieval (Ours) 56.3
System Dataset 1 Dataset 2
System A 17.2 21.1
Ours 24.1 26.3
Public Benchmark Internal Benchmark
49. Wins
Overcoming Intent Detection Errors
Query: how old was Ronald Reagan when he was
inaugurated president
System A Answer
Ronald Reagan died June 5,
2004 at age 93 in X.
Our Answer
Ronald Reagan was 69 years old
when he was inaugurated president.
Query: what movies were Bill Cosby and Sidney Poitier in?
System A Answer
<empty>
Our Answer
Bill Cosby and Sidney Poitier have been
in several movies together, including:
Uptown Saturday Night , Let's Do It
Again
Query: who is the female lead in the movie music man?
System A Answer
<empty>
Our Answer
The female lead in the movie Music
Man is Shirley Jones.
Handling Multi-Hop Queries
Query: The drummer for Nirvana was born in what city?
System A Answer Our Answer
The drummer for Nirvana, Dave Grohl,
was born in Warren, Ohio.
<empty>
50. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
53. Of
fl
ine Entity Linking
Enable Better Ranking and Search
“They bought tickets for Beyoncé; she’s performing at T-Mobile Park”
KG:123
KG Info
KG:345
KG Info
0.914
0.312
Entity Importance
Entity Importance
Entity Embedding Index
related entities (approximate
nearest neighbor search)
"entity_name": "Beyonce Knowles",
“entity_types": [
"artist",
"human",
“writer"],
…
"entity_name": "T-Mobile Park",
“entity_types": [
“stadium",
"POI",
"location"],
…
54. Custom Con
fi
guration
Specify what types should be included and what should not be present
Improve linking quality
Include:
City, Natural Place, Landmark, National Park, …
Exclude:
Company, Hospitals, Person, …
Example:
For “weather” use cases
55. Custom Tag Con
fi
guration
Example Use Case: Weather
Weather in Obama
Source: duckduckgo.com
Won’t be
considered with the
con
fi
guration
Obama [Person]
Obama [City]
56. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Services
Construction & Growth
57. Fact Ranking & Related Entities
Embed entities / relations / queries in embedding space
Query processing = nearest neighbor search
Lady Gaga, occupation, ?
58. Apple Con
fi
dential–Internal Use Only
Related Entities
LLM
Entity Descriptions
Embeddings
Search
Query Logs
Entity Co-occurrence
Reranking Related Entities KV Store
KG
59. Example Use Case
Fact Ranking and Related Entities
Lady Gaga
Song
Album
Related Entities
Movie
Shadow …
Dance Telephone
The Frame Artpop Chromatica …
Adrian
Grande
Beyoncé Bradley
Cooper
…
House of
Gucci
A Star
Is Born
Sin City: A
Dame to Kill
For
…
Fact Ranking: Lady Gaga is
fi
rst a musician then an actress
60. Example Use Case
Fact Ranking and Related Entities
Lady Gaga
Song
Album
Related Entities
Movie
Shadow …
Dance Telephone
The Frame Artpop Chromatica …
Adrian
Grande
Beyoncé Bradley
Cooper
…
House of
Gucci
A Star
Is Born
Sin City: A
Dame to Kill
For
…
Relatedness:
Based on KG +
query log
61. Key Components
of KG Construction, Growth, and Services
KG
QA Linking
Embedding … ….
Extraction
Integration
Inference Introspection
Construction & Growth Services
62. LLMs vs. KGs
Source: Shirui Pan, et al. Unifying Large Language
Models and Knowledge Graphs: A Roadmap
https://arxiv.org/abs/2306.08302 Source: Link
63. Thanks!
IBM (including interns):
Shivakumar Vaithyanathan
Sriram Raghavan
Rajasekar Krishnamurthy
Lucian Popa
Ron Fagin
Fred Reiss
Laura Chiticariu
Mauricio Hernadez
Eser Kandogan
Huaiyu Zhu
Kun Qian
Dakuo Wang
Maeda Hana
fi
Many amazing collaborators and interns …
Apple (including interns):
Ihab Ilyas
Theodoros Rekatsinas
Umar Farooq Minhas
Ali Mousavi
Jefferey Pound
Anil Pacaci
Hongyu Ren
Kun Qian
Fei Wu
Simone Conia
Sha (Zoey) Li
Azadeh Nikfarjam
Yisi Sang
Saloni Potdar
Farima Fatahi Bayat … …
Universities:
Azza Abouzeid (NYU-Abu Dhabi)
H. V. Jagadish (U. Of Michigan)
Fei Xia (U. Of Washington)
Kevin Chen-Chuan Chang (UIUC)
ChengXiang Zhai (UIUC)
Domenico Lembo(Sapienza University of
Rome)
Dragomir R. Radev (Yale)
Jonathan K. Kummerfeld (U. Of
Michigan)
Toby Li (U. of Notre Dame)
Rishabh Iyer (UT Dallas)
Eduard C. Dragut (Temple Univ.) … ….
Douglas Burdick
Alan Akbik
Nancy Wang
Prithiviraj Sen
Marina Danilevsky
Poornima Chozhiyath Raman
Sudarshan Rangarajan
Ramiya Venkatachalam
Kiran Kate
Chenguang Wang
Ishan Jindal
Yiwei Yang
Nikita Bhutani … ….