Natural languages exhibit in general phenomena such as hyponymy/hypernymy, synonymy, heteronymy, etc. These linguistic phenomena lead to some problems in the IR field, e.g. term mismatch. In this talk, I show how these phenomena or their related problems make the direct word-based document-query comparison insufficient, which is the case in most classical IR models. More precisely, I talk about using concepts instead of words to represent the content of documents and queries. Consequently, we do matching in a more informative space rather than the word space, where the effects of some linguistic phenomena are limited. I mainly talk about two problems that appear when moving to the conceptual space, namely the document length deformation and inter-concept relations quantification.
This presentation is mainly based on my talk at University of Lugano (Università della Svizzera italiana).
Beyond Classical Information Retrieval (IR): Conceptual IR
1. Karam Abdulahhad
GESIS - Cologne
karam.abdulahhad@gesis.org
karam.abdulahhad@gmail.com
Beyond Classical Information Retrieval (IR)
Conceptual IR
2. Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad2
How have “fiddles” changed over time
Violins
Like most technological breakthroughs, today's
violin is an evolutionary product. So far as we
know, there were no violins in 1500. A century
later, there were several types and probably
thousands of specimens north and south of the
Alps, and from England to Poland. A marvel of
craftsmanship and acoustical engineering, the
violin produced more sound than any stringed
instrument to date. Almost immediately,
composers, players and collectors liked what
they heard and saw. Italian and non-Italian
makers proliferated.
……….
3. Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad3
Historical information about “sugar
river bank”
History and Mission Statement
…………
The Bank continues to grow at a healthy pace.
We have continued to do well and be a leader
in our industry. Our main branch was expanded
in 1982 and we now have branches in Sunapee,
New London, Warner, Grantham and Concord.
We at Sugar River Bank are proud of our
history and growth. It is the responsibility of
each and every member of our Bank's family to
insure continued growth in the future.
…………
www.sugarriverbank.com
4. Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad4
Historical information about “sugar
river bank”
The Life-Sustaining Sugar River
…………
The west branch of the Sugar River historically
supported a native trout population, but had
suffered from sedimentation, overgrazing of its
banks and warming water. “Restoration efforts
in the Dane County portion of the watershed
reduced nonpoint source pollution, installed
riverbank vegetative filter strips, improved in-
stream habitat, restricted cattle access to
streams, and improved management of animal
waste from barnyards,” says Hansis.
…………
northwestquarterly.com
5. Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad5
Part-Whole
Hand Body
Heteronyms
Bank(com) Bank(geo)
Hyponym / Hypernym
B-cell Lymphocyte
Synonyms
Violin Fiddle
Co-hyponym
Cat Dog
6. Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
20-12-2018GESIS - K.Abdulahhad6
7. Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
2. Retrieval process has an inferential nature, where the
classical word-based document-query comparison
paradigm is insufficient
20-12-2018GESIS - K.Abdulahhad7
10. Conceptual approach
20-12-2018GESIS - K.Abdulahhad10
Concepts are categories encompassing all synonymous
terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
11. Conceptual approach
20-12-2018GESIS - K.Abdulahhad11
Concepts are categories encompassing all synonymous
terms
Using concepts IDs
instead of terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
12. 20-12-2018GESIS - K.Abdulahhad12
Part I: Relative Concept Frequency
[1] K. . Abdulahhad et al., Revisiting the Term Frequency in concept-Based IR Models. DEXA 2013
[2] K. . Abdulahhad et al., MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach.
CLEF 2012
17. Relative Concept Frequency (idea)
Use all concepts but maintaining word-based document
length
Structure based redistribution of word-based document
length on concepts
GESIS - K.Abdulahhad17 20-12-2018
19. Relative Concept Frequency (how)
Computing relative frequency
Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
GESIS - K.Abdulahhad19 20-12-2018
20. Relative Concept Frequency (how)
Computing relative frequency
Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
GESIS - K.Abdulahhad20 20-12-2018
21. Relative Concept Frequency (how)
Computing relative frequency
Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
Hypothesis 3: maintaining
word-based 𝑑
GESIS - K.Abdulahhad21 20-12-2018
24. Computing Relative Concept Frequency
(Step 3)
Step 3: compute relative frequency 𝑟𝑓𝑖
Breadth first search
The relative frequency 𝑟𝑓𝑖 of 𝑐 ∈ 𝐶𝑖 must be
propositional to 𝑇𝑖 (Hypothesis 1), and inversely
propositional to 𝐶𝑖 (Hypothesis 2)
Maintaining 𝑑 by distributing it on the concepts of 𝑑
(Hypothesis 3).
GESIS - K.Abdulahhad24
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300
𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
11,CT
R
22 ,CT
33,CT 44 ,CT 55 ,CT
20-12-2018
25. Computing Relative Concept Frequency
(Step 3)
We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad25
11,CT
R
22,CT
33,CT 44,CT 55,CT
3
20-12-2018
26. Computing Relative Concept Frequency
(Step 3)
Step 3: computing relative weight
For each node 𝑇𝑖, 𝐶𝑖 we compute three values
𝛼𝑖 the amount that should be distributed on the concepts of the current node 𝑇𝑖, 𝐶𝑖 and
its children
𝛼𝑖 = 𝛿 𝑝𝑎𝑟𝑒𝑛𝑡 × 𝑇𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠
𝛿𝑖 the portion of one single word of the input amount 𝛼𝑖
𝛿𝑖 =
𝛼 𝑖
𝑇 𝑖 + 𝑇 𝑐ℎ𝑖𝑙𝑑𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛
𝛽𝑖, or equivalently 𝑟𝑓𝑖, the relative frequency of each concept 𝑐 ∈ 𝐶𝑖
𝛽𝑖 =
𝛿 𝑖× 𝑇𝑖
𝐶 𝑖
GESIS - K.Abdulahhad26 20-12-2018
41. Relative Concept Frequency (conclusion)
Dealing with the document length deformation
Encouraging results
Increase recall
Maintain or even increase the precision
Can be used with classical IR models
Change the (TF) component
GESIS - K.Abdulahhad41 20-12-2018
48. Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad48
Concepts as vectors
Still using concepts to reduce mismatch effect
Avoiding the complexities of relation-based inter-
concept similarity
49. Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad49
Concepts as vectors
Still using concepts to reduce mismatch effect
Avoiding the complexities of relation-based inter-
concept similarity
Check adaptability of concept-embedding-based
similarity to IR
Goal
53. Concept embedding (experiments)
20-12-2018GESIS - K.Abdulahhad53
Experiments consist of two parts
Generating concept embedding vectors
Testing a vector-based concept similarity for ad-hoc IR
54. Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad54
Word embedding
PubMed Central collection (1177879 vocabularies)
Word2Vec
Vector size 500
Continuous bag of words
Window size 8
Negative sampling 25
55. Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad55
Concept embedding
UMLS2017 concepts (only English content)
For each concept, we build the corresponding set of words
Flat embedding
Replace F by avg
Hierarchical embedding
Replace F by avg
Weighted embedding
Replace F by weighted-avg
The weight 𝛼 𝑤 of a word w is: 𝛼 𝑤 = ln
𝑁+1
𝑛
N the number of documents in PubMed Central
n is the document frequency of w in PubMed Central
56. Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad56
Concept embedding (missing words)
Fixed random vectors
Several experiments for weighting missing words
The word is too popular n = N (poor idf)
The word is too rare n = 1 (high idf)
Or in between n = N/2
57. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad57
Corpora
clef11 & clef12
Text to concepts mapping
MetaMap
UMLS concepts
58. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad58
IR model and concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
59. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad59
IR model and concept similarity
Weight(c): BM25 and Pivoted Normalization
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
60. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad60
IR model and concept similarity
Weight(c): BM25 and Pivoted Normalization
Concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
61. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad61
IR model and concept similarity
Weight(c): BM25 and Pivoted Normalization
Concept similarity
For comparison (Leacock)
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
62. Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad62
Results
(*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-NoSim”
(†) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-Leacock”
63. Concept embedding (conclusion)
20-12-2018GESIS - K.Abdulahhad63
Three approaches to build concept vectors
based on word embedding
Promising results to use vector-based concept
representation and similarity
Concepts and words are represented in the
same vector space
they are comparable
Improve approaches like MetaMap
65. Conclusion
Dealing with the two observations
Inadequacy of the term independence assumption
Retrieval process has an inferential nature
Conceptual IR
Document length deformation
Inter-concept relations quantification
20-12-2018GESIS - K.Abdulahhad65