Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
1. Link Analysis of Life Science Linked Data
1
Wei Hu1, Honglei Qiu1, and Michel Dumontier2
1State Key Laboratory for Novel Software Technology, Nanjing University, China
2Center for Biomedical Informatics Research, Stanford University
@micheldumontier::ISWC 2015
2. Linked Data offers links between
datasets, but they are often
incomplete and may contain
errors.
@micheldumontier::ISWC 20152
3. Network Analysis
• Network analysis has long been
used to study link structures
– The structure of the Web
– Network medicine: cellular
networks and implications
@micheldumontier::ISWC 20153
Power law is scale free
A graph demonstrates the small world
phenomenon, if its clustering coefficient is
significantly higher than that of a random
graph on the same node set, and if the graph
has a shorter average distance.
BTC2010
The clustering coefficient quantifies how close
its neighbors are to be a clique. The average
distance is the average shortest path length
between all nodes in the graph.
4. Dataset link analysis
(using RDF data model)
Entity link analysis
(using cross-references)
Term link analysis
(using ontology matching)
@micheldumontier::ISWC 20154
5. @micheldumontier::ISWC 2015
Linked Data for the Life Sciences
5
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• Release 3 (June 2014)
• 35 datasets
• 11B RDF triples
• 1B entities
• 2K classes
• 4K properties
6. Dataset Links
@micheldumontier::ISWC 20156
Network Properties
1. Well linked
2. Hubs and authorities
3. small-world phenomenon
Average distance = 2.77 vs 6
Clustering coefficient = 0.22 vs
0.13
4. robust on systematic removal
of nodes
7. Entity Link Analysis
How well do entities link to each other?
• 76% entity links involve a special kind of RDF triples
– e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>
– x-relations have under-specified semantics
• May be truly identical, may refer to another related entity …
• Degree distribution
– Some do not follow power law
• Exponent is too large (close to 5)
7
BTC2010
@micheldumontier::ISWC 2015
8. symmetry of entity links varies
between different pairs of datasets
• Over 99% of links are reciprocated in DrugBank-PharmGKB and
OMIM-HGNC
– Suggests link sharing and synchronization
• Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanet
links are reciprocal
– Suggests incomplete mapping
• 28% of OMIM-Orphanet links are malposed
– Suggests variation in model (omim:Phenotype to orphanet:Disorder)
8 @micheldumontier::ISWC 2015
10. Evaluation of Entity Matching
How accurate are current entity matching approaches?
• Built a benchmark from the reciprocal links between similarly-typed
entities
• Evaluated several entity matching approaches
– Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard
– Machine learning: Linear regression, logistic regression with 5 properties
• Many-to-one links are difficult to be discovered
10 @micheldumontier::ISWC 2015
11. Term Link Analysis
How similar are the topics in the data network?
• Use ontology matching to generate term link graph
– Falcon-AO (linguistic matchers + structural matcher + synonyms)
• Created 83K class mappings, 1.5K object property mappings, and 858 data
property mappings
– Similarity threshold = 0.9
– Top-5 popular labels for classes and properties
• Significant overlap in topics, does not follow power law as in broader SW
11 @micheldumontier::ISWC 2015
12. Correlation of Link Graphs
To what degree are each of the three link graphs are correlated?
• Spearman’s rank correlation coefficient:
– Entity link graph dataset pairs: entity links / entities
– Term link graph dataset pairs: term mappings / terms
– Dataset link graph dataset pairs: shortest path length
• All positively correlated
– Closer datasets in distance have more linked entities and terms
– Number of linked entities contributes little to overlap of topics
12 @micheldumontier::ISWC 2015
13. Summary of Findings
• Dataset, entity and term link graphs do not necessarily share the same
characteristics with the Hypertext / Semantic Web
– Degree distribution of entity links does not follow power law
– Data hubs
• A significant number of entities have been linked using x-relations, but
their intended semantics differs
– Classes are identical or equivalent entity links represent logical equivalence
• Symmetric and transitive entity links do exist, but their utility is weakened
due to their small number
– Meanings of entity links may shift during transitive closure
• Only matching the labels of entities may fail, while combining different
properties and using simple learning algorithms achieve good accuracy
13 @micheldumontier::ISWC 2015