Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)
Overview talk of the LivingNER shared task at IberLEF/SEPLN
There is a pressing need to generate tools for finding mentions of species, pathogens, or food from medical texts. To promote the development of such tools we organized the LivingNER task. LivingNER relied on a large Gold Standard corpus of 2000 carefully selected clinical cases in Spanish covering diverse specialties. It was manually annotated with species mentions that were also carefully mapped to their corresponding NCBI Taxonomy identifiers. Besides, we have generated Silver Standard versions of LivingNER for 7 languages: English, Portuguese, Galician, Catalan, Italian, French, and Romanian. LivingNER had three subtasks: LivingNERSpecies NER (species mention detection sub-task), LivingNER-Species Norm (species mention detection and normalization to NCBI taxonomy Ids), and LivingNERClinical IMPACT (a document classification task related to the detection of pets, animalscausing injuries, food, and nosocomial entities). We received and evaluated 62 systems from 20 teams from 11 countries worldwide, obtaining highly competitive results. Successful approaches typically modified pre-trained transformer-like language models (BERT, BETO, RoBERTa, etc.) and employed embedding distance metrics for entity linking. LivingNER corpus: doi.org/10.5281/zenodo.6376662
Similar to Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)
Similar to Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022) (20)
Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of the LivingNER shared task and resources (talk at IberLEF @ SEPLN 2022)
1. Barcelona Supercomputing Center (BSC):
• Antonio Miranda-Escalada
• Luis Gascó
• Salvador Lima-López
• Eulàlia Farré-Maduell
• Darryl Estrada
• Martin Krallinger
Mention detection, normalization &
classification of species, pathogens,
humans and food in clinical
documents: Overview of the
LivingNER shared task and resources
Martin Krallinger
Head of Text Mining Unit, BSC
<mkrallin@bsc.es>
IberLEF @ SEPLN 2022 LivingNER corpus: doi.org/10.5281/zenodo.6376662
1
2. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
Importance of species information extraction
De − Allice Hunter - File:Hispanophone global world map language.png,CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=69323596
National Center for
Biotechnology Information
(NCBI) Taxonomy
3. How many species inhabit the earth
How many species do we know
Quantification of global species
richness
Taxonomic classification of species
Number of species in a taxonomic
group
Validation against well-known taxa
250 years of taxonomic classification
1.2 million species catalogued in a
central database
86% of species on Earth and 91% of
species in the ocean still await
description
Knowledge gap
4. -Large collection of species, change over time, hierarchical relation types relation
types
-Homonymy with commonly used words, e.g.: “Spot” (Leiostomus xanthurus) and
“Permit” (Trachinotus falcatus)
-Homonymy with other medical entities (the word “goat” can refer to proteins
found in human, zebrafish, rat and mouse.
-Abbreviations are ambiguous, e.g.: HBV can be used for both “Hepatitis B virus”
as well as “Hepatitis B vaccine”
-Vernacular form (common names)
- Incorrect case or misspelt (like, Bacterium coli, Bacillus coli and Escheria coli for
Escherichia coli)
- Coordinations, nested expressions: “human immunodeficiency viruses types 1
and 2”, refer to two distinct species names, “HIV type 1” and “HIV type 2”
- Role names (e.g. athletes, responders)
- Human mencions in the form of family members, etc….
Challenges
5. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
Previous SPECIES extraction and normalization efforts
● LivingNER
< 2000 2000-2010 2010-2021 2022
● The Catalogue of Life [Index of
the world's species] [Bánki et al.,
2022] [2001]
●Infectious Diseases (ID) task of BioNLP [Corpus and
shared task] [Pyysalo et al., 2011] [2011]
● SPECIES [Species mention and normalisation to NCBI
taxonomy corpus and tool] [Pafilis et al., 2013] [2014]
● ITIS (Integrated Taxonomic Information
System) [Federal effort to provide consistent
biological taxonomies] [1996]
● NCBI taxonomy [Terminological resource]
[Federhen, 2012] [1997]
● Global Names Architecture database [organizes
and cross-links electronic information about
organisms] [Pyle et al., 2016] [2016]
● LINNAEUS [Species mention and
normalisation to NCBI taxonomy corpus
and tool] [Gerner et al., 2010] [2010]
6. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
LivingNER overview
16. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
• Increasing interest in Spanish clinical NLP tasks
• LivingNER Resources
○ LivingNER Corpus: Species entity Gold Standard corpus mapped to NCBI Taxonomy.
○ LivingNER Multilingual Silver Standard Corpus: Disease entity corpora normalised to
NCBI Taxonomy in several languages.
○ LivingNER Spanish Silver Standard (from participants’ predictions)
Conclusions
17. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
• Correct the LivingNER Multilingual Silver Standard to generate a Gold Standard subset
of each language to create high-quality benchmarks in the seven languages.
• Clinical Impact track lacked enough training and test data, and we plan to correct this
issue in the future.
Future directions
● Generate more granular annotations
for the HUMAN mentions that are
needed for real-world applications.
Actual examples of annotated species mentions and automatically
recognized profession mentions.
18. IberLEF - LivingNER: recognition, normalization & classification of species, pathogens and food - krallinger.martin@gmail.com; antoniomiresc@gmail.com
Acknowledgements
LivingNER Participants &
LivingNER Scientific Committee
IberLEF organisers
● Manuel
● Julio
● and all others
SEPLN organisers
Funding:
• Plan de Tecnologías del Lenguaje
• AI4PROFHEALTH (PID2020-119266RA-I00)
• BioMATDB Horizon Europe Grant
Agreement No 101058779
BSC Text Mining Unit