This talk gives an introduction to entity linking for biomedical data. It describes the problem to be solved as a three stage task and links to state of the art approaches for these steps.
Talk held at the Hamburg Data Science Meetup, Hamburgs largest data event.
2. About me
● PhD in machine learning & natural language processing
from University of Bonn & Fraunhofer IAIS
● Now in industry: AI and data driven products, since 2016
mostly in the medical and healthcare domain
● Main interests: NLP, especially German; information
retrieval; recommender systems
@anja_pilz
aplz
3. Outline
● Motivation: why do we need entity linking
○ Ambiguity, use cases
● Entity linking in the biomedical domain
○ Data and ontologies, main challenges
● Technical problem: 3 stage task
○ Approaches for each of these stages (sketches & references)
● Short preview of challenges with German data
4. Language is ambiguous
“with steroid induced diabetes, I lost a stone
in three days, it was grim”
Type II diabetes
Type I diabetes
Gestational
diabetes
Steroid diabetes
gallstones
kidney stones
the stone
(unit)
a random
stone
grim protein, Drosophila
variants of
steroids
5. Why do we care?
Need to resolve ambiguity to
● avoid mistakes in patient-doctor communication
○ specialist vs layman vocabulary
● automatically retrieve important information
○ side effects of drugs discussed in online patient fora
● enrich electronic health records (EHR)
○ links to newest research, treatment guidelines or other LOD resources
And many more reasons…
Entity linking resolves ambiguity by assigning each mention its underlying “sense”.
6. Headache
Cephalgia
Entities: entries in (curated), medical ontologiesMentions: textual references of medical terms like
diagnoses, treatments, body parts, drugs, ...
Biomedical Entity Linking
Migraine
Head Pain
Cranial Pain
Headache
(D006261)
layman
terms
EHR
specialist
vocabulary
7. Example: excerpt from a PubMed abstract linked to UMLS (Unified Medical Language
System)
Biomedical Entity Linking
Mohan & Li, MedMentions: A Large Biomedical Corpus
Annotated with UMLS Concepts, AKBC 2019
The technique does not
require contrast material, so
it can safely be used in
patients with renal failure.
8. Why is that hard?
● Notion of uniqueness: a disease is
rendered unique by the person it affects
(and the stage)
● Uniqueness heavily affects linkability:
which stage of renal failure is meant?
○ candidates “look” super similar
○ might even need additional resources (lab)
Acute renal failure: Her baseline Cr is
1.8. On presentation the Cr had
increased to 7.7 secondary to the
bilateral hydronephrosis.
https://icd.who.int/browse11/l-m/en
Johnson et al., MIMIC-III, a freely accessible
critical care database. Scientific Data 2016
9. Given some text document, find all spans of words m that mention some entity e and
assign each span to a unique identifier (entry in a KB).
Technical Problem
Entity Recognition: detect spans to be linked
(Sequence Tagging)
Candidate Retrieval: find all relevant candidates in a KB
(Information Retrieval)
Candidate Ranking: decide on the best candidate
(Ranking Task)
Errorpropagation
10. Step 1: Entity Recognition
Goal: detect diagnoses, measurements, procedures in the text of the EHR
● supervised: train a sequence tagging model
○ pick a model: lots of literature but mostly sth Bi-LSTM CRF
○ (manually) annotate data
● pro: domain adaptation & custom features
● con: requires training data & medical expertise
Roller et al., Detecting Named Entities and Relations
in German Clinical Reports, GSCL 2017
Murty et al., Hierarchical Losses and New
Resources for Fine-grained Entity Typing and
Linking, ACL 2018
Lampe et al., Neural Architectures for Named Entity
Recognition. NAACL-HLT 2016
Indication: Acute hypoxia. Relapsed AML,
GVHD, and renal failure with new hypoxia with
clear chest x-ray.
11. Step 1: Entity Recognition
Goal: detect diagnoses, measurements, procedures in the text of the EHR
● weakly labeled: keyword matching
○ walk over text and lookup every span in a dictionary
○ keep all spans that have at least one entity candidate
● pro: no need to annotate data
● con: noise, type and recall issues
Murty et al., Hierarchical Losses and New
Resources for Fine-grained Entity Typing and
Linking, ACL 2018
Kolitsas et al., End-to-End Neural Entity Linking,
CoNLL 2018
Wiatrak, Iso-Sipilä. Simple Hierarchical Multi-Task
Neural End-To-End Entity Linking for Biomedical
Text. LOUHI@EMNLP 2020
Indication: Acute hypoxia. Relapsed AML,
GVHD, and renal failure with new hypoxia with
clear chest x-ray.
12. Step 2: Candidate Retrieval
Goal: fetch all relevant candidate entities from the ontology
● upper bound on performance: you can’t link what you
don’t find
GoTo solution: inverted index (lucene) over entity descriptions
● make use of the analyzers coming with lucene for
tokenization, stemming, etc
● craft search query from the mention context
● keep top 5, 10, 100 hits as candidates
Pilz & Paaß, Collective Search for Concept
Disambiguation, COLING 2012
13. Step 3: Candidate Ranking
Goal: decide on the best candidate as target entity
Rank by context similarity
● compare text representations of mention context and
entity description (word2vec, topic distributions, etc)
● but: medical ontologies do often not provide extensive
descriptions
Pilz & Paaß, From names to entities using thematic
context distance, CIKM 2011
14. Step 3: Candidate Ranking
Goal: decide on the best candidate as target entity
Add type similarity from hierarchies
● Wikipedia: categories assigned to entities
● UMLS: use semantic types
○ distinguish disease form the gene its caused by
○ LATTE: find boost in linking performance when adding type
encoding learned from UMLS types
Zhu et al., LATTE: Latent Type Modeling for
Biomedical Entity Linking, AAAI 2020
UMLS® Reference Manual
15. Step 3: Candidate Ranking
Goal: decide on the best candidate as target entity
In a nutshell
● find expressive vector representations of mention-candidate pairs
● plug vectors into some function to rank them
○ Ranking SVM, specific loss functions in NN, …
● the information in the vector is more important than the algorithm!
16. Challenges with German data
● Data is scarce, nothing comparable to MIMIC-III or MedMentions exists
● Ontologies like UMLS are only available in English
● NLP for German is a tad harder
○ Common nouns look like named entities (upper case)
● … the notorious compound words
○ sensory sensation disorder: Schallempfindungsstörung
○ occlusion of the central retinal artery: Netzhautarterienverschluss
Ideas?
Let’s discuss!