Natural Language Processing for Medical Data

Natural
Language
Processing for
medical data

Dr. Anja Pilz, ML Conference 2021
About me
@anja_pilz
aplz
● PhD in machine learning & natural
language processing from University of
Bonn & Fraunhofer IAIS
● Now in industry: AI and data driven
products, since 2016 mostly in the medical
and healthcare domain
● Main interests: NLP, especially German;
information retrieval; recommender
systems

Doctors spend more time documenting what they do than with effective treatment
● 70% of work hours dedicated to tasks not performed on the patient (orga & docs)
Important as documentation covers symptoms, risk factors, intolerances, treatments, …
● each piece of information is vital for the patient - but can be buried somewhere
Not only complex cases quickly become “unscannable”
● Use NLP for Information Extraction: automatically search, analyze, and add
structure to these unstructured texts
Swiss Medical Journal, 2016;97(1):6–8
Motivation

Support doctor’s daily work
● create warnings from automatically detected
risks and contraindications
● summarize suspected and excluded
diagnoses (differential diagnosis)
● add hints to treatment guidelines
And much more!
Motivation

Support billing process
● billing process is super complex and
needs to be soundproof
● help medical controllers to ﬁnd
relevant information
● automatically ﬁnd mentions of
diseases and treatments
● align with entries from catalogs
used for billing (e.g. ICD-10)
Motivation
Image damedic code

NLP Tasks for Medical Data
ﬁlter relevant entities
(clinical, billing)
Entity
Recognition
(NER)
Entity Linking
(NEL/NED)
Entity Filtering
detect all relevant
mentions:
● diagnoses
● procedures
● body parts
● drugs
● measurements
● negations...
link to unique concepts:
● entries in (curated)
medical ontologies
or catalogs
● normalization used
for documentation,
summarization, &
billing

Challenges: Medical Domain is not News
Typical medical texts are very different common NLP data
● super condensed and short, sometimes like an enumeration
● full of abbreviations, acronyms and technical terms
● ambiguity is often resolved through sheer knowledge, not necessarily by the local
context
Indication: Acute hypoxia. Relapsed AML,
GVHD, and renal failure with new hypoxia with
clear chest x-ray.

Abbreviations are used for convenience
● ambiguous ones may cause miscommunication
● potentially jeopardise patient care
Entity Linking needs to expand acronyms but must not rely on priors
Challenges: Ambiguity
TMZ temazepam
temozolomide
Holper et al., Ambiguous medical abbreviation study:
challenges and opportunities, Intern Med J. 2020
LFT liver function test
LFT lung function test
HWI Harnwegsinfekt
Hinterwandinfarkt
BCa bladder cancer
breast cancer
VF Vorhofflimmern
Vorhofflattern
MS Magensonde
Mitralstenose

Challenges: German
Latin origin vs German spelling results in a bunch of variations
● Carcinom, Karcinom, Carzinom, Karzinom, Ca, CA
The notorious compound words
● sensory sensation disorder: Schallempﬁndungsstörung
● occlusion of the central retinal artery: Netzhautarterienverschluss
● detection of Tuberculosis: Tuberkulosenachweis
Decompounding is non-trivial and requires profound linguistic knowledge

● data is available, e.g. BC5CDR (1500 PubMed articles with annotated chemicals,
diseases & their interactions)
● trained models are available
● not “solved” but at a pretty good state of the art
Entity Recognition (EN)
https://scispacy.apps.allenai.org/

● typical off-the-shelf models are not useful for the medical domain
● need to train domain models here
Entity Recognition (DE)

Data?
Real patient data
● resides in hospitals and medical practices
● not publicly available
Public data
● netdoktor != Dr. B. Oss
● data in layman language does not compare well to real medical texts
● may still help
Patient: “Ich habe im
Moment keine
Blutdruckprobleme”
Doctor: “RR gut eingestellt”

Entity Recognition
Get data. Start annotating.
● entities are all concepts of interest:
drugs, medical conditions, procedures,
body parts, …
● annotation usually requires medical
expert knowledge
● super speciﬁc vocabulary with lots of
abbreviations & acronyms
● good to go after ~1k documents

Train your own model
Entity Recognition
+ data

Most work in research: link entity mentions to concepts
in medical thesaurus UMLS
● higher level metadata enrichment
● index new publications by topic & keywords
● hot topic and a bunch of publications exists
Why not?
● no German version (yet)
● concepts are sometimes not speciﬁc enough
Entity Linking
Murty et al., Hierarchical Losses and New Resources for
Fine-grained Entity Typing and Linking, ACL 2018
Kolitsas et al., End-to-End Neural Entity Linking, CoNLL 2018
Mohan & Li, MedMentions: A Large Biomedical Corpus
Annotated with UMLS Concepts, AKBC 2019

ICD-10 Linking
ICD: International Statistical Classiﬁcation of
Diseases and Related Health Problems
● catalogs mental and physical disorders in
most speciﬁc and precise form
● global standard for clinical
documentation and billing
● published yearly by the WHO
https://icd.who.int/browse10/2019/en

ICD-10 Linking
ICD: International Statistical Classification of
Diseases and Related Health Problems
● catalogs mental and physical disorders
in most specific and precise form
● global standard for clinical
documentation and billing
● published yearly by the WHO
● … comes with German modification
ICD-10-GM (BfArM)
https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm

Higher clinical relevance
● support doctors: can’t get much more speciﬁc than with a
diagnosis code
● support medical controllers: ICD codes are the items used in
billing, not UMLS concepts
Requires entity ﬁltering to avoid false positives
● excluded or suspected diagnoses
● “state after diseases”: clinically but not be billing relevant
ICD-10 Linking
EHR
Keine Hinweis auf
intrazerebrale
Blutung.
Z.n. Hysterektomie,
2006

Most mentions may be clinically relevant, but not coding relevant.
Need relation extraction approaches here..
Entity Filtering for primary coding
Prostatacarcinom in der Vorgeschichte
Vorbekannte Osteochondrose
Z.n. mehrfachem Apoplexen, zuletzt 2006
Mamma-Ca wurde ausgeschlossen.
Keine Hinweis auf intrazerebrale
Blutung.
Die BWK 9-Fraktur zeigte sich mit
fehlender knöcherner Durchbauung im
Sinne einer Pseudarthrose.
Intrazerebrale Blutung konnte
nicht bestätigt werden.
Verdacht auf arterielle Hypertonie.

Toy example. Typical
cases are much
more complex.

To be really useful, the link must be super speciﬁc
● “some renal failure” (N17*) is not good enough
Speciﬁcity relates to the stage of the disease
● hugely affects treatment complexity and care
intensity
● treatment complexity directly corresponds to
the hospital’s bill send to the insurance
company
ICD-10 Linking

Speciﬁcity
To describe a disease in a certain stage or
manifestation, the catalog is super speciﬁc
● 40 entries for different instances of
Diabetes Mellitus, Type 1 and Type 2 each
● there are even more forms of Diabetes...
Difference is sometimes only one word
● “nicht” or “mit/ohne”: usual stopwords are
dangerous here!

Precision vs Context
ICD is completely different from Wikipedia
● catalog entries are precise descriptions without further context
● descriptions are not the most commonly used names
● descriptions tend to be very long: median number of words is 5, maximum is 28
● typically not used in this form by the doctors: low character overlap, low similarity
... RR 150/90...
... rezidiv. Bluthochdruck mit
Schwächegefühl...

About Context..
Disambiguating information need not be
located the discharge letter
● can even be in a completely different
data format, e.g. lab measurements
● N18*: multiple measurements of a
speciﬁc lab value (Creatinine)
● not an NLP task anymore, time series
analysis?

Entity Linking in Practice
GoTo solution for candidate retrieval: inverted index over catalog descriptions
● basically a vector space model with cosine similarity over (query, entry)
● make use of the analyzers coming with lucene for tokenization, stemming, etc
Secret sauce
● add medical knowledge and extend the descriptions (e.g. synonyms)
● hand craft search query from the mention context
Gist: aim for high recall, you can’t link what you don’t ﬁnd...
Pilz & Paaß, Collective Search for Concept
Disambiguation, COLING 2012

Can handle typos and
spelling variations.
Query: “diabetes meltus”
fetches all codes for
Diabetes mellitus.
Demo

Can handle alternative
names like synonyms or
acronyms.
Query “ANV 3” fetches all
“Akutes Nierenversagen ...
Stadium 3” codes
But which one is it? Can not
decide on the best
candidate...
Demo

Best Candidate?
Recipe: rank by context similarity to decide on best candidate
● ﬁnd expressive vector representations of mention-candidate pairs
○ word2vec
○ topic distributions (LDA)
○ graphical similarity …
● plug vectors into some ranking model
○ Ranking SVM
○ speciﬁc loss functions in Neural Networks (Hamming)
But we have seen: catalog does not provide extensive descriptions, so... Next time!
Pilz & Paaß, From names to entities using thematic
context distance, CIKM 2011

Thanks!
Questions?
Say Hi!

Natural Language Processing for Medical Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Natural Language Processing for Medical Data

Similar to Natural Language Processing for Medical Data (20)

Recently uploaded

Recently uploaded (20)

Natural Language Processing for Medical Data