An Empirical Study of Vocabulary Relatedness and Its Application to Recommender Systems
1. .nju.edu.cn
An Empirical Study of Vocabulary Relatedness
and Its Application to Recommender Systems
Gong Cheng, Saisai Gong, Yuzhong Qu
State Key Laboratory for Novel Software Technology, Nanjing University, China
gcheng@nju.edu.cn
Presented at ISWC2011
2. ws .nju.edu.cn
Measuring term similarity
0.9
FacultyMember Faculty
FullProfessor 0.8 Professor
AssistantProfessor
AssistantProfessor
Vocabulary matching 1.0
Gong Cheng (程龚) gcheng@nju.edu.cn 2 of 36
3. ws .nju.edu.cn
Measuring vocabulary similarity
Semantic Web for Research
Communities (SWRC)
Foundational Model of
Anatomy (FMA)
0.8 0.5
Vocabulary distance
GALEN 0.6
0.02
eBiquity Person 0.5
NCBI organismal classification
Vocabulary matching (NCBITaxon)
Gong Cheng (程龚) gcheng@nju.edu.cn 3 of 36
4. ws .nju.edu.cn
Measuring vocabulary relatedness
Vocabulary relatedness
FacultyMember Postgraduate-Research-
Degree
Vocabulary distance
FullProfessor
PhD EngD
AssistantProfessor
Vocabulary matching
not that similar, but somewhat related
Gong Cheng (程龚) gcheng@nju.edu.cn 4 of 36
5. Contributions
ws .nju.edu.cn
How to measure vocabulary relatedness?
6 measures, from 4 aspects
How about vocabulary relatedness in real-life cases?
Empirical analysis of 2,996 vocabularies and other 4 billion RDF triples
Where to apply vocabulary relatedness?
Post-selection vocabulary recommendation in vocabulary search
Gong Cheng (程龚) gcheng@nju.edu.cn 5 of 36
6. Outline
ws .nju.edu.cn
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 6 of 36
7. Data set statistics
ws .nju.edu.cn
Crawled from February 2010 to May 2011 by
Gong Cheng (程龚) gcheng@nju.edu.cn 7 of 36
8. Data set distributions
ws .nju.edu.cn
RDF documents over pay-level domains
Gong Cheng (程龚) gcheng@nju.edu.cn 8 of 36
9. Data set distributions
ws .nju.edu.cn
Vocabularies over top-level domains
Gong Cheng (程龚) gcheng@nju.edu.cn 9 of 36
10. Outline
ws .nju.edu.cn
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 10 of 36
11. Vocabulary relatedness
ws .nju.edu.cn
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) gcheng@nju.edu.cn 11 of 36
12. Measure 1: explicit semantic relatedness
ws .nju.edu.cn
E 1
RS v i , v j
weight of a shortestpathbetween vi and v j in GE
1 2
GE v1 v2 v3
owl:imports owl:priorVersion
v1 v3
v2
rdfs:seeAlso
Gong Cheng (程龚) gcheng@nju.edu.cn 12 of 36
13. Measure 2: implicit semantic relatedness
ws .nju.edu.cn
I 1
RS v i , v j
weight of a shortestpathbetween vi and v j in GI
1 2
GI v2 v3 v4
owl:inverseOf rdfs:subClassOf
t2 t4
t3
owl:inverseOf
v2 v3 v4
Gong Cheng (程龚) gcheng@nju.edu.cn 13 of 36
14. Measure 3: hybrid semantic relatedness
ws .nju.edu.cn
E I 1
RS vi , v j
weight of a shortestpathbetween vi and v j in GE I
1 v2
GE+I 1 v4
v1
2
v3
Gong Cheng (程龚) gcheng@nju.edu.cn 14 of 36
15. Empirical analysis (1)
ws .nju.edu.cn
Statistical properties of GE, GI and GE+I
Gong Cheng (程龚) gcheng@nju.edu.cn 15 of 36
16. Empirical analysis (2)
ws .nju.edu.cn
Explicit relations between vocabularies
Gong Cheng (程龚) gcheng@nju.edu.cn 16 of 36
17. Measure 4: content similarity
ws .nju.edu.cn
Harmonic mean
Maximum similarity between their labels
Gong Cheng (程龚) gcheng@nju.edu.cn 17 of 36
18. Empirical analysis (3)
ws .nju.edu.cn
86 label-like properties
rdfs:label, dc:title, and their subproperties (e.g. skos:prefLabel)
and local name
Terms and their labels Vocabulary distribution
36.33% 36.21%
63.67% w/ w/
63.79%
w/o w/o
Gong Cheng (程龚) gcheng@nju.edu.cn 18 of 36
20. Empirical analysis (4)
ws .nju.edu.cn
4,978 meta-level terms, 469 (9.42%) in >1 vocabulary
Most popular meta-level terms
1. rdf:type
2. rdfs:domain
3. rdfs:range
4. …
and after excluding language constructs
10.13 meta-level terms per vocabulary
≤20 meta-level terms in 92.96% vocabularies
but hundreds in Cyc
Gong Cheng (程龚) gcheng@nju.edu.cn 20 of 36
21. Measure 6: distributional relatedness
ws .nju.edu.cn
Distributional profile
p v1 | v
p v2 | v
DP v RD vi , v j cos DP vi , DP v j
...
p vn | v
Gong Cheng (程龚) gcheng@nju.edu.cn 21 of 36
22. Empirical analysis (5)
ws .nju.edu.cn
Instantiation found for 1,874 (62.55%) vocabularies
Most popular vocabularies (excluding languages)
Gong Cheng (程龚) gcheng@nju.edu.cn 22 of 36
23. Empirical analysis (6)
ws .nju.edu.cn
Co-instantiation found for 9,763 pairs of vocabularies
Most popular vocabulary co-instantiation (excluding languages)
Gong Cheng (程龚) gcheng@nju.edu.cn 23 of 36
24. Vocabulary relatedness
ws .nju.edu.cn
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) gcheng@nju.edu.cn 24 of 36
25. Agreement between measures
ws .nju.edu.cn
Spearman’s rank correlation coefficient (ρ∈[-1,1])
Single-link hierarchical clustering
Gong Cheng (程龚) gcheng@nju.edu.cn 25 of 36
26. Outline
ws .nju.edu.cn
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 26 of 36
27. Relatedness-based ranking
ws .nju.edu.cn
Ranking by single measure:
Ranking by multiple measures:
Gong Cheng (程龚) gcheng@nju.edu.cn 27 of 36
28. Popularity-based re-ranking
ws .nju.edu.cn
Degree of influence of popularity
Number of pay-level domains instantiating vi
Gong Cheng (程龚) gcheng@nju.edu.cn 28 of 36
29. Evaluation settings
ws .nju.edu.cn
20 “selections” randomly selected from 1,302 moderate-sized vocabularies
Depth-10 pooling with
2 experts
Ratings
Closely related: 2
Somewhat related: 1
Unrelated: 0
Metric: NDCG
Gong Cheng (程龚) gcheng@nju.edu.cn 29 of 36
30. Gold standard
ws .nju.edu.cn
739 assessments
Assessments
7.85% Closely related
10.55%
81.60% Somewhat related
Unrelated
Agreement between experts
80%
or 91% when “closely related = somewhat related = related”
Gong Cheng (程龚) gcheng@nju.edu.cn 30 of 36
31. Evaluation results --- individual measures
ws .nju.edu.cn
56.88% isolated vocabularies in GE 37.45% uninstantiated vocabularies
Gong Cheng (程龚) gcheng@nju.edu.cn 31 of 36
32. Evaluation results --- combinations of measures
ws .nju.edu.cn
Gong Cheng (程龚) gcheng@nju.edu.cn 32 of 36
33. Relatedness vs. popularity
ws .nju.edu.cn
NDCG@1 vs. number of pay-level domains instantiating it
Gong Cheng (程龚) gcheng@nju.edu.cn 33 of 36
34. Outline
ws .nju.edu.cn
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 34 of 36
36. Take away
ws .nju.edu.cn
Vocabulary meta-descriptions are incomplete.
Terms lack labels.
Co-instantiated ∝ explicitly related
http://ws.nju.edu.cn/falcons/ontologysearch/
Gong Cheng (程龚) gcheng@nju.edu.cn 36 of 36