Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)

Barcelona Supercomputing Center (BSC):
• Antonio Miranda-Escalada
• Luis Gascó
• Salvador Lima-López
• Eulàlia Farré-Maduell
• Darryl Estrada
• Martin Krallinger
Mention detection, normalization &
classification of species, pathogens,
humans and food in clinical
documents: Overview of the
LivingNER shared task and resources
Martin Krallinger
Head of Text Mining Unit, BSC
<mkrallin@bsc.es>
IberLEF @ SEPLN 2022 LivingNER corpus: doi.org/10.5281/zenodo.6376662
1

IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
Importance of species information extraction
De − Allice Hunter - File:Hispanophone global world map language.png,CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=69323596
National Center for
Biotechnology Information
(NCBI) Taxonomy

How many species inhabit the earth
How many species do we know
Quantification of global species
richness
Taxonomic classification of species
Number of species in a taxonomic
group
Validation against well-known taxa
250 years of taxonomic classification
1.2 million species catalogued in a
central database
86% of species on Earth and 91% of
species in the ocean still await
description
Knowledge gap

-Large collection of species, change over time, hierarchical relation types relation
types
-Homonymy with commonly used words, e.g.: “Spot” (Leiostomus xanthurus) and
“Permit” (Trachinotus falcatus)
-Homonymy with other medical entities (the word “goat” can refer to proteins
found in human, zebrafish, rat and mouse.
-Abbreviations are ambiguous, e.g.: HBV can be used for both “Hepatitis B virus”
as well as “Hepatitis B vaccine”
-Vernacular form (common names)
- Incorrect case or misspelt (like, Bacterium coli, Bacillus coli and Escheria coli for
Escherichia coli)
- Coordinations, nested expressions: “human immunodeficiency viruses types 1
and 2”, refer to two distinct species names, “HIV type 1” and “HIV type 2”
- Role names (e.g. athletes, responders)
- Human mencions in the form of family members, etc….
Challenges

Previous SPECIES extraction and normalization efforts
● LivingNER
< 2000 2000-2010 2010-2021 2022
● The Catalogue of Life [Index of
the world's species] [Bánki et al.,
2022] [2001]
●Infectious Diseases (ID) task of BioNLP [Corpus and
shared task] [Pyysalo et al., 2011] [2011]
● SPECIES [Species mention and normalisation to NCBI
taxonomy corpus and tool] [Pafilis et al., 2013] [2014]
● ITIS (Integrated Taxonomic Information
System) [Federal effort to provide consistent
biological taxonomies] [1996]
● NCBI taxonomy [Terminological resource]
[Federhen, 2012] [1997]
● Global Names Architecture database [organizes
and cross-links electronic information about
organisms] [Pyle et al., 2016] [2016]
● LINNAEUS [Species mention and
normalisation to NCBI taxonomy corpus
and tool] [Gerner et al., 2010] [2010]

LivingNER overview

LivingNER resources
LivingNER corpus: doi.org/10.5281/zenodo.6376662
LivingNER annotation guidelines: doi.org/10.5281/zenodo.6385162
LivingNER Multilingual Silver Standard: doi.org/10.5281/zenodo.6376662
LivingNER terminology: doi.org/10.5281/zenodo.6390506
LivingNER Silver Standard:
LivingNER evaluation library:
github.com/tonifuc3m/livingner-evaluation-library
LivingNER participant systems:
temu.bsc.es/livingner/participant-systems/
LivingNER YouTube playlist:
https://www.youtube.com/channel/UCDsmS1pCCO8TW312wJq8aCQ/playlists

LivingNER Corpus: documents, format and annotation

LivingNER Corpus - Overview
● Diversity: Atención primaria, dermatología, medicina interna, medicina tropical,
endocrinología, neurología, oftalmología, psiquiatría, radiología, urgencias, cardiología,
pediatrita, oncología, odontología,..
● Manual entity annotations, NCBI taxonomy mapping and application classification
● Inter-Annotator Agreement (IAA): 94.2
● Random training, validation and test split Most common SPECIES mentions

DisTEMIST Multilingual Silver Standard

DisTEMIST Multilingual Silver Standard
Spanish Gold Standard English Silver Standard
Online visualiser:
https://temu.bsc.es/mLivingNER/diff.xhtml#/translations/en/annotation_transfer/train/caso_clinico_radiologia942?dif
f=/gold-standard/train/
NCBI Tax
ID: 11103
NCBI Tax
ID: 11103
NCBI Tax
ID: 1311
NCBI Tax
ID: 1311

LivingNER participating teams
● Registrations: 56
● SPECIES NER track: 20
participating teams, 41
submissions
● SPECIES Norm track: 8
teams, 14 submissions
● Clinical Impact track:
5 teams, 6 submissions

LivingNER participant results
● MiF: micro-averaged F-score (main metric)
● MiP: micro-avg. Precision
● MiR: micro-avg. Recall
SPECIES NER SPECIES Norm

LivingNER participant results - Clinical Impact track

• Increasing interest in Spanish clinical NLP tasks
• LivingNER Resources
○ LivingNER Corpus: Species entity Gold Standard corpus mapped to NCBI Taxonomy.
○ LivingNER Multilingual Silver Standard Corpus: Disease entity corpora normalised to
NCBI Taxonomy in several languages.
○ LivingNER Spanish Silver Standard (from participants’ predictions)
Conclusions

• Correct the LivingNER Multilingual Silver Standard to generate a Gold Standard subset
of each language to create high-quality benchmarks in the seven languages.
• Clinical Impact track lacked enough training and test data, and we plan to correct this
issue in the future.
Future directions
● Generate more granular annotations
for the HUMAN mentions that are
needed for real-world applications.
Actual examples of annotated species mentions and automatically
recognized profession mentions.

Acknowledgements
LivingNER Participants &
LivingNER Scientific Committee
IberLEF organisers
● Manuel
● Julio
● and all others
SEPLN organisers
Funding:
• Plan de Tecnologías del Lenguaje
• AI4PROFHEALTH (PID2020-119266RA-I00)
• BioMATDB Horizon Europe Grant
Agreement No 101058779
BSC Text Mining Unit

LivingNER resources
LivingNER corpus: doi.org/10.5281/zenodo.6376662
LivingNER annotation guidelines: doi.org/10.5281/zenodo.6385162
LivingNER Multilingual Silver Standard: doi.org/10.5281/zenodo.6376662
LivingNER terminology: doi.org/10.5281/zenodo.6390506
LivingNER Silver Standard:
LivingNER evaluation library:
LivingNER participant systems:
temu.bsc.es/livingner/participant-systems/
LivingNER YouTube playlist:
https://youtube.com/playlist?list=PL5uSCzf1azhA_gMLC3DBZe6NvmMJiggTg

Questions?

Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)

Recommended

Recommended

More Related Content

Similar to Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)

Similar to Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022) (20)

Recently uploaded

Recently uploaded (20)

Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)