Natural language is highly ambiguous and the sense of a word heavily depends on the context it appears in. While slight uncertainties are acceptable for the texts you read on a daily basis, they can lead to fatalities in medical contexts. This talks gives an introduction to the underlying problem, word sense ambiguity, and the technical approach aiming to resolve it – entity linking. We highlight the crucial challenges that we need to overcome when dealing with German data in practical examples and show how we integrate those solutions in our product: damedic code.
Talk held at ML Conference 2021, online.
2. Dr. Anja Pilz, ML Conference 2021
About me
@anja_pilz
aplz
● PhD in machine learning & natural
language processing from University of
Bonn & Fraunhofer IAIS
● Now in industry: AI and data driven
products, since 2016 mostly in the medical
and healthcare domain
● Main interests: NLP, especially German;
information retrieval; recommender
systems
3. Dr. Anja Pilz, ML Conference 2021
Doctors spend more time documenting what they do than with effective treatment
● 70% of work hours dedicated to tasks not performed on the patient (orga & docs)
Important as documentation covers symptoms, risk factors, intolerances, treatments, …
● each piece of information is vital for the patient - but can be buried somewhere
Not only complex cases quickly become “unscannable”
● Use NLP for Information Extraction: automatically search, analyze, and add
structure to these unstructured texts
Swiss Medical Journal, 2016;97(1):6–8
Motivation
4. Dr. Anja Pilz, ML Conference 2021
Support doctor’s daily work
● create warnings from automatically detected
risks and contraindications
● summarize suspected and excluded
diagnoses (differential diagnosis)
● add hints to treatment guidelines
And much more!
Motivation
5. Dr. Anja Pilz, ML Conference 2021
Support billing process
● billing process is super complex and
needs to be soundproof
● help medical controllers to find
relevant information
● automatically find mentions of
diseases and treatments
● align with entries from catalogs
used for billing (e.g. ICD-10)
Motivation
Image damedic code
6. Dr. Anja Pilz, ML Conference 2021
NLP Tasks for Medical Data
filter relevant entities
(clinical, billing)
Entity
Recognition
(NER)
Entity Linking
(NEL/NED)
Entity Filtering
detect all relevant
mentions:
● diagnoses
● procedures
● body parts
● drugs
● measurements
● negations...
link to unique concepts:
● entries in (curated)
medical ontologies
or catalogs
● normalization used
for documentation,
summarization, &
billing
7. Dr. Anja Pilz, ML Conference 2021
Challenges: Medical Domain is not News
Typical medical texts are very different common NLP data
● super condensed and short, sometimes like an enumeration
● full of abbreviations, acronyms and technical terms
● ambiguity is often resolved through sheer knowledge, not necessarily by the local
context
Indication: Acute hypoxia. Relapsed AML,
GVHD, and renal failure with new hypoxia with
clear chest x-ray.
8. Dr. Anja Pilz, ML Conference 2021
Abbreviations are used for convenience
● ambiguous ones may cause miscommunication
● potentially jeopardise patient care
Entity Linking needs to expand acronyms but must not rely on priors
Challenges: Ambiguity
TMZ temazepam
temozolomide
Holper et al., Ambiguous medical abbreviation study:
challenges and opportunities, Intern Med J. 2020
LFT liver function test
LFT lung function test
HWI Harnwegsinfekt
Hinterwandinfarkt
BCa bladder cancer
breast cancer
VF Vorhofflimmern
Vorhofflattern
MS Magensonde
Mitralstenose
9. Dr. Anja Pilz, ML Conference 2021
Challenges: German
Latin origin vs German spelling results in a bunch of variations
● Carcinom, Karcinom, Carzinom, Karzinom, Ca, CA
The notorious compound words
● sensory sensation disorder: Schallempfindungsstörung
● occlusion of the central retinal artery: Netzhautarterienverschluss
● detection of Tuberculosis: Tuberkulosenachweis
Decompounding is non-trivial and requires profound linguistic knowledge
10. Dr. Anja Pilz, ML Conference 2021
● data is available, e.g. BC5CDR (1500 PubMed articles with annotated chemicals,
diseases & their interactions)
● trained models are available
● not “solved” but at a pretty good state of the art
Entity Recognition (EN)
https://scispacy.apps.allenai.org/
11. Dr. Anja Pilz, ML Conference 2021
● typical off-the-shelf models are not useful for the medical domain
● need to train domain models here
Entity Recognition (DE)
12. Dr. Anja Pilz, ML Conference 2021
Data?
Real patient data
● resides in hospitals and medical practices
● not publicly available
Public data
● netdoktor != Dr. B. Oss
● data in layman language does not compare well to real medical texts
● may still help
Patient: “Ich habe im
Moment keine
Blutdruckprobleme”
Doctor: “RR gut eingestellt”
13. Dr. Anja Pilz, ML Conference 2021
Entity Recognition
Get data. Start annotating.
● entities are all concepts of interest:
drugs, medical conditions, procedures,
body parts, …
● annotation usually requires medical
expert knowledge
● super specific vocabulary with lots of
abbreviations & acronyms
● good to go after ~1k documents
14. Dr. Anja Pilz, ML Conference 2021
Train your own model
Entity Recognition
+ data
15. Dr. Anja Pilz, ML Conference 2021
Most work in research: link entity mentions to concepts
in medical thesaurus UMLS
● higher level metadata enrichment
● index new publications by topic & keywords
● hot topic and a bunch of publications exists
Why not?
● no German version (yet)
● concepts are sometimes not specific enough
Entity Linking
Murty et al., Hierarchical Losses and New Resources for
Fine-grained Entity Typing and Linking, ACL 2018
Kolitsas et al., End-to-End Neural Entity Linking, CoNLL 2018
Mohan & Li, MedMentions: A Large Biomedical Corpus
Annotated with UMLS Concepts, AKBC 2019
16. Dr. Anja Pilz, ML Conference 2021
ICD-10 Linking
ICD: International Statistical Classification of
Diseases and Related Health Problems
● catalogs mental and physical disorders in
most specific and precise form
● global standard for clinical
documentation and billing
● published yearly by the WHO
https://icd.who.int/browse10/2019/en
17. Dr. Anja Pilz, ML Conference 2021
ICD-10 Linking
ICD: International Statistical Classification of
Diseases and Related Health Problems
● catalogs mental and physical disorders
in most specific and precise form
● global standard for clinical
documentation and billing
● published yearly by the WHO
● … comes with German modification
ICD-10-GM (BfArM)
https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm
18. Dr. Anja Pilz, ML Conference 2021
Higher clinical relevance
● support doctors: can’t get much more specific than with a
diagnosis code
● support medical controllers: ICD codes are the items used in
billing, not UMLS concepts
Requires entity filtering to avoid false positives
● excluded or suspected diagnoses
● “state after diseases”: clinically but not be billing relevant
ICD-10 Linking
EHR
Keine Hinweis auf
intrazerebrale
Blutung.
Z.n. Hysterektomie,
2006
19. Dr. Anja Pilz, ML Conference 2021
Most mentions may be clinically relevant, but not coding relevant.
Need relation extraction approaches here..
Entity Filtering for primary coding
Prostatacarcinom in der Vorgeschichte
Vorbekannte Osteochondrose
Z.n. mehrfachem Apoplexen, zuletzt 2006
Mamma-Ca wurde ausgeschlossen.
Keine Hinweis auf intrazerebrale
Blutung.
Die BWK 9-Fraktur zeigte sich mit
fehlender knöcherner Durchbauung im
Sinne einer Pseudarthrose.
Intrazerebrale Blutung konnte
nicht bestätigt werden.
Verdacht auf arterielle Hypertonie.
20. Dr. Anja Pilz, ML Conference 2021
Toy example. Typical
cases are much
more complex.
21. Dr. Anja Pilz, ML Conference 2021
To be really useful, the link must be super specific
● “some renal failure” (N17*) is not good enough
Specificity relates to the stage of the disease
● hugely affects treatment complexity and care
intensity
● treatment complexity directly corresponds to
the hospital’s bill send to the insurance
company
ICD-10 Linking
https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm
22. Dr. Anja Pilz, ML Conference 2021
Specificity
To describe a disease in a certain stage or
manifestation, the catalog is super specific
● 40 entries for different instances of
Diabetes Mellitus, Type 1 and Type 2 each
● there are even more forms of Diabetes...
Difference is sometimes only one word
● “nicht” or “mit/ohne”: usual stopwords are
dangerous here!
https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm
23. Dr. Anja Pilz, ML Conference 2021
Precision vs Context
ICD is completely different from Wikipedia
● catalog entries are precise descriptions without further context
● descriptions are not the most commonly used names
● descriptions tend to be very long: median number of words is 5, maximum is 28
● typically not used in this form by the doctors: low character overlap, low similarity
... RR 150/90...
... rezidiv. Bluthochdruck mit
Schwächegefühl...
24. Dr. Anja Pilz, ML Conference 2021
About Context..
Disambiguating information need not be
located the discharge letter
● can even be in a completely different
data format, e.g. lab measurements
● N18*: multiple measurements of a
specific lab value (Creatinine)
● not an NLP task anymore, time series
analysis?
https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm
25. Dr. Anja Pilz, ML Conference 2021
Entity Linking in Practice
GoTo solution for candidate retrieval: inverted index over catalog descriptions
● basically a vector space model with cosine similarity over (query, entry)
● make use of the analyzers coming with lucene for tokenization, stemming, etc
Secret sauce
● add medical knowledge and extend the descriptions (e.g. synonyms)
● hand craft search query from the mention context
Gist: aim for high recall, you can’t link what you don’t find...
Pilz & Paaß, Collective Search for Concept
Disambiguation, COLING 2012
26. Dr. Anja Pilz, ML Conference 2021
Can handle typos and
spelling variations.
Query: “diabetes meltus”
fetches all codes for
Diabetes mellitus.
Demo
27. Dr. Anja Pilz, ML Conference 2021
Can handle alternative
names like synonyms or
acronyms.
Query “ANV 3” fetches all
“Akutes Nierenversagen ...
Stadium 3” codes
But which one is it? Can not
decide on the best
candidate...
Demo
28. Dr. Anja Pilz, ML Conference 2021
Best Candidate?
Recipe: rank by context similarity to decide on best candidate
● find expressive vector representations of mention-candidate pairs
○ word2vec
○ topic distributions (LDA)
○ graphical similarity …
● plug vectors into some ranking model
○ Ranking SVM
○ specific loss functions in Neural Networks (Hamming)
But we have seen: catalog does not provide extensive descriptions, so... Next time!
Pilz & Paaß, From names to entities using thematic
context distance, CIKM 2011
29. Dr. Anja Pilz, ML Conference 2021
Thanks!
Questions?
Say Hi!