Micromeritics - Fundamental and Derived Properties of Powders
Improving Correlation with Human Judgments by Integrating Second-Order Vectors with Semantic Similarity
1. Improving Correlation with Human
Judgments by Integrating Second-
Order Vectors with Semantic Similarity
Bridget T. McInnes, PhD
Virginia Commonwealth University
Ted Pedersen, PhD
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse
2. Measuring Similarity & Relatedness
● Similarity != Relatedness (!!!)
● Assign scores to pairs of concepts
● Compare to scores decided on by humans
● Measure correlation
– Often by rank, because scales differ
– Spearman's rank correlation coefficient
● Think about ways to do better
3. 3
Contribution of this Work?
● We show that integrating a similarity measure
into a 2nd
order measure of relatedness
improves correlation to human judgments
– Compare impact of various similarity measures
– Compare to other methods including word2vec
● Focus is on UMLS and medical concepts
although ideas apply more generally
4. 4
Similar or Related?
● Similarity based on is-a relations
– How much is X like Y?
– Share ancestor in is-a hierarchy
● LCS : least common subsumer
● Closer / deeper the ancestor the more similar
● Tetanus and strep_throat are similar
– both are kinds-of bacterial infections
6. 6
Measures of Similarity
● Path based
– Is-a hierarchy
● Path + Depth
– Is-a hierarchy
● Feature
– Is-a hierarchy
● Information Content
– Is-a hierarchy + corpus
7. 7
Similar or Related?
● Relatedness more general
– How much is X related to Y?
– Many ways to be related
● is-a, part-of, treats, affects, symptom-of, …
● Tetanus and puncture_wound are related but
they really aren't similar
– (puncture wounds can cause tetanus)
● All similar concepts are related, but not all
related concepts are similar
9. 9
Definition Based Relatedness
● Related concepts defined using many of the
same terms
● Concepts don't need to be connected via
relations or paths to measure them
– Lesk, 1986
– Adapted Lesk, Banerjee & Pedersen, 2003
10. 10
BUT! ...
● Definitions are brief, potentially inconsistent
– Alopecia : … a result of cancer_treatment
– Thrush : … a side_effect of chemotherapy
● Lesk matching won't recognize the similarity
between result and side_effect, or between
cancer_treatment and chemotherapy
– Will find alopecia and thrush totally unrelated
11. 11
Gloss Vector Measure
● Rely on co-occurrences of terms
● Allows for a fuzzier notion of matching
● Exploits second order co-occurrences
– Friend of a friend relation
– Suppose cancer_treatment and chemotherapy
don't occur in text with each other. But,
suppose that “survival” occurs with each.
– cancer_treatment and chemotherapy are
second order co-occurrences via “survival”
12. 12
Gloss Vector Measure
● Replace words or terms in definitions with
vector of co-occurrence counts from corpus
● Represent defined concept by the average of
all the vectors of the words in its definition
● Measure relatedness of concepts via cosine
between their respective vectors
● Patwardhan and Pedersen, 2006 (vector)
– Schutze, 1998
– Latent Semantic Analysis
13. Can We Improve Gloss Vector?
● Instead of constructing second order vectors
using frequency counts or measures of
association...
● Use semantic similarity measures!
– Not all pairs of concepts will have similarity
values, but some do!
– Weight co-occurrences based on how similar
they are...
14. Integrated 2nd
Order Vector
● Construct a co-occurrence matrix from an
external corpus
– NLM Medline Bigram data
– https://mbr.nlm.nih.gov
– Bigram counts from 2014 Medline baseline
● 44 million bigrams
● Replace co-occurrence counts in matrix with
similarity measure scores
– UMLS::Similarity
– http://umls-similarity.sourceforge.net
15. Integrated 2nd
Order Vector
● Build second order vector for each concept
● Obtain definitions of concept (from UMLS)
– Augment with definitions of parents (PAR),
children (CHD), broader than (RB), and
narrower than (RN) relations
– Look up vector for each word in definition
– Average these vectors together
– Resulting averaged vector represents the
concept
18. Reference Standards
● UMNSRS : 587 pairs ranked for relatedness,
566 for similarity, both by medical residents
– We used subsets of 430 and 401 pairs
– ICC > .7
● MayoSRS : 101 pairs ranked by physicians
and (separately) by medical coders
– MiniMayoSRS – 30 pair subset
● http://www.people.vcu.edu/~btmcinnes/
20. Thresholds
● Remove all similarity scores less than a given
threshold
● Experiments with thresholds with res and faith
showed that results improved significantly with
some threshold settings
22. Discussion
● Information content measures fared well!
● Clear advantage to filtering low similarity
scores, but ...
– How low, and how do we set?
– Values of thresholds vary with measures
● With reference standard?
● With corpora used for co-occurrences?
23. Related work
● Various studies using word2vec
– UMNSRS, MayoSRS, and MiniMayoSRS
– CBow and / or skip-grams with various
different kinds of corpora
● Vector retrofitting (Yu et al., 2016) very related!
– Map terms to MeSH terms, build vectors
based on documents assigned those terms
– Include semantically related words from UMLS
– MiniMayoSRS
25. Future Work
● Why more improvement on similarity results?
● Regularize comparisons (!!!)
● Which corpora for co-occurrences?
● Which definitions to represent concepts?
● How can threshold be automatically set?
● What about WordNet and general English?