SlideShare a Scribd company logo
1 of 36
Reference Domain Ontologies and Large
Medical Language Models
Chimezie Ogbuji
Chief Medical Informatics Officer / Amara Home Care
Owner /Metacognition
Semantic Web
● Semantic Web
○ Goal: build a framework for intelligent, machine understanding on the standards, ubiquity,
and connectedness of the World Wide Web (WWW)
○ 2006 to 2010: Most exciting time. Peak of inflated expectations
○ Driven by standardization efforts at the WWW Consortium (W3C)
○ Equal parts hype and robust infrastructure for modern applications
A cautionary tale for Large Language Models (LLM)
The Semantic Web: Where is it now? Rashif Ray Rahman / Oct 3, 2018
https://medium.com/@schivmeister/the-semantic-web-where-is-it-now-f4773f3097e3
Semantic Web Layers
How are things identified and
retrieved?
Rules, Logic, & Ontologies: How
can machines reason about the data?
How can machines ask
questions of the data?
How can data be exchanged in a
knowledge graph format?
Gartner's Hype Cycle
Gartner's 2016 Hype Cycle for Emerging Technologies
Natural Language (NL) question
answering was in the trough of
disillusionment a year before a
key technology underlying
today’s language models
(transformers) was in its infancy
A Cautionary Tale?
● SemanticDB work at the Cleveland Clinic Foundation
○ A reconceived implementation of an existing, 30-year old registry of heart
surgery and cardiovascular intervention cases
■ 500 variables, ~ 200K patient records, 100 heart and vascular
research publications per year
○ Addressing shortcomings of conventional data warehouse functionality
■ Domain-specific criteria conceived by researchers who work with DB
administrators
○ Internally-funded project in conjunction with CCF Innovations Department to partner
with Cycorp, Inc
■ Cyc: a powerful reasoning system and knowledge base with built-in capability
for natural language.
The Semantic Research
Assistant (SRA). Query for
patients who had
a coronary artery bypass graft
(CABG) between 2008 and
2010 (inclusive) and after a
percutaneous coronary
intervention (PCI)
D Pierce, C., Booth, D., Ogbuji, C., Deaton, C., Blackstone, E., & Lenat, D. (2012). Semanticdb: A semantic web
infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3), 267-277.
● Primary challenges at the time
○ Developing representational models (ontologies) that can cover the domain
in 200K+ patient dataset to facilitate machine reasoning
○ Resolving natural language query fragments to concepts in these models
(the purpose of the SRA and challenge of Natural Language Processing at
the time)
○ Dispatching Semantic Web queries (SPARQL) to the RDF patient
registry
○ Evaluating the queries efficiently
Parrot-like software that use sophisticated analysis of patterns and relationships
underlying language to simulate intelligent, natural language
“How GPT3 Works - Visualizations and Animations” - Jay Alammar
What are Large Language Models?
https://jalammar.github.io/how-gpt3-works-visualizations-animations/
Peak Hype?
Semantic Web
v.s. LLM
● Semantic Web
○ The basis for the value proposition was well-
understood
○ Driven mainly by the development of industry
standards
○ Adoption significantly lagged behind the research
○ Its applicability was not well defined
● Large Language Models
○ The basis for the value proposition is not fully
understood (the mechanism is a mystery to us)
○ No standardization (driven by community use)
○ Lightspeed community use keeping pace with
lightspeed research
○ Its applicability is well defined
● Artificial Neural Networks (ANN): a branch of deep learning inspired by biological neural
networks in animal brains
● Natural language processing (NLP): interdisciplinary subfield of computer science and
linguistics concerned with the ability of computers to support and manipulate human language
● LLMs: probabilistic models of natural language using ANN and trained on large textual data
● Fine tuning: A subsequent, task-specific training performed on a model to refine it for a
specific use case
● Instruction tuning: fine-tuning that improves a model's ability to follow instructions
● Transfer learning: a technique where knowledge learned from a task is re-used to boost
performance on a related task
● Unsupervised Learning: learning patterns without being told what’s right/wrong
Terminology
Sindhu, et. al.. "An empirical science research on bioinformatics in machine learning." 2020
What is the state of the
art of the use of LLMS in
the domain of medicine?
● MedAlpaca (4/2023)
○ Trained on Q/A pairs from online forums (52K), medical curriculum flashcards (34K),
Q/A pairs from WikiDoc (68K), and data from open NLP datasets and benchmarks.
Evaluated on United States Medical Licensing Examination (USMLE) self-assessment
datasets (119)
● Med-PaLM 2 (5/2023)
○ Trained on multiple-choice question dataset for solving medical problems, collected
from professional examinations (183K), several standard benchmark training datasets
of multiple-choice question dataset used for evaluation (10K), and common consumer
questions (60). Evaluated on standard multiple-choice datasets
Recent Medical LLMs
● MEDITRON (11/2023)
○ Trained on a dataset of clinical practice guidelines (46K), PubMed Papers (5M) &
abstracts (16M), and standard benchmark training datasets (10,178). Evaluated on
standard multiple-choice datasets and Q/A based on PubMed abstracts
● MedPrompt (11/2023)
○ A study of the power of how to prompt GPT-4 to unleash capabilities on medical
challenge problems without training
● BioMistral (2/2024)
○ Trained on PMC Open Access Subset of medical research papers (~1.47M documents).
Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts
Training &
Evaluation ● Mix of training on open Q/As, multiple
choice questions, and raw text (domain
expertise or research publication)
● Most were evaluated on medical
reasoning benchmarks that provided
training data for the models before
evaluating them
● Suffer from the same current and more
general issue of how to objectively
evaluate LLMs
What are Ontologies and
Description Logic (DL)?
● Rigorously-specified
conceptualizations of a domain
as mathematical logic
● Usually expressed as
hierarchies of classes,
restrictions on relationships,
etc.
● Meant for automated processing
by logical reasoning tools
Mondal, Sutapa, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for
the EL++ description logic. Diss. IIIT-Delhi, 2020.
Angioedema ⊑ Edema
Angioedema ⊑ ∃ morphology . angioedema
A_ACE ⊑ ∃ morphology . (Angioedema ⊓ ∃ caused_by kallidin i )
Essential hypertension ⊑ Hypertensive disorder, systemic arterial
Essential hypertension ⊑ (∃ located-in . (systemic circulatory system structure))
Internationally standardized medical terminology system with over
360K+ medical concepts, 1.25M relationships between them, and 9.6K
textual definitions created by domain-experts. Uses a DL that facilitates
automation and machine reasoning.
● Released in US English, UK English, UK Australian, Spanish, Danish, Dutch,
Lithuanian, Swedish, and Canadian French
What is SNOMED-CT?
Nested matryoshka dolls are a good analogy for
visualizing DL concept inclusion (⊑)
All instances of
Essential hypertension
are within the set of all
things that stand in a
located-in relation
with a systemic
circulatory system
structure
∃ - existential role restrictions
Angioedema caused by angiotensin-converting-enzyme inhibitor
(A_ACE)
⊓ - intersection of concepts
∃ - existential role restriction
named concept
DLs are designed for computer processing and not easily read by non-
mathematicians
What are Controlled Natural
Languages (CNL)?
Since CNLs are based on natural languages, their grammars use the same
syntactic structures: sentences, noun phrases, verb phrases, and relative clauses
CNLs were originally designed for use by domain experts to encode knowledge
without working directly in DL
Kuhn, Tobias. "The understandability of OWL statements in controlled
English." (2013): 101-115
“Every A_ACE morphology an Angioedema caused_by Kallidin i”
● Adopt a CNL for use
with SNOMED-CT
○ Using phraseology
appropriate for the domain
(pathophysiology)
● The CNL phrases generated can
be used as training data for
LLMs
“Every A_ACE is characterized in form by an Angioedema caused by Kallidin i”
● SNOMED-CT includes text definitions
○ “[..] applied to some SNOMED CT concepts that provides additional information about the
intended meaning or usage of the concept.”
● These can be used in addition to SNOMED-CT CNL phrases to train LLMs
Text Definitions
Angioedema ⊑ Non-allergic hypersensitivity reaction
Non-allergic hypersensitivity reaction ⊑ Non-allergic hypersensitivity process
“Every Angioedema is a Non-allergic hypersensitivity reaction”
“Every Non-allergic hypersensitivity reaction is a Non-allergic hypersensitivity process”
Non-allergic hypersensitivity process (SNOMED-CT’s Text definition)
“A pathological nonimmune process generally directed towards a foreign substance, which
results in tissue injury, which is usually transient. It is the realization of the pseudoallergic
disposition. A variety of mechanisms such as direct histamine release, complement
activation, cyclooxygenase activation and bradykinin generation may be involved.”
Combining CNL and Text
Definitions
● Create a dataset of instructions to define SNOMED-CT terms
○ Create full definition instructions and instructions on the component parts of each
definition (granular instructions)
● Train an existing model on the text definitions
● Perform instruction fine tuning on the resulting model
○ Use transfer learning to boost learning of SNOMED-CT definitions from the training
on textual definitions
○ Definition Rendering Instruction Fine Tuning (DRIFT)
Training a LLM on SNOMED-CT CNL phrases and Text Definitions
Putting into Practice
What is Angiotensin converting enzyme inhibitor-aggravated angioedema?
Angiotensin converting enzyme inhibitor-aggravated angioedema is defined in SNOMED-CT
as a disorder (a clinical diagnosis). Angiotensin converting enzyme inhibitor-aggravated
angioedema is an angioedema caused by angiotensin-converting-enzyme inhibitor, a drug-
aggravated angioedema-urticaria, [..] It is a hypersensitivity process characterized in form
by an angioedema caused by a substance with angiotensin-converting enzyme inhibitor
mechanism of action and is characterized in form by an angioedema caused by kallidin i.
What causes an Angiotensin converting enzyme inhibitor-aggravated angioedema?
It is caused by kallidin i
What is the morphology of an Angiotensin converting enzyme inhibitor-aggravated
angioedema?
It is characterized in form by an angioedema
Full / Granular Definition
Instructions
● SNOMED CT concepts are organised into 19
distinct hierarchies, covering different aspects
of healthcare
● Generate definitions from subset of hierarchies
dealing with medical problems
○ Clinical finding (includes findings and disorders)
○ Subset of Body structure: Morphological abnormality which
physically characterize disorders
○ Situation with explicit context (situation)
Subset of SNOMED-CT
Experiment 626
● Began with OpenHermes-2.5-Mistral-7B model
● Performed unsupervised training on (7,694) SNOMED-CT text definitions
○ Using September 23rd 2023 release of SNOMED CT United States
Edition
● Performed DRIFT on medical problem hierarchies
○ Used full instruction definitions (130K)
○ Added 80% of granular instructions from each category (204K)
○ Validated training using remaining 20% (102K)
○ Used QLoRA fine tuning on Apple Mac Studio M1 Ultra with 128GB
RAM (OoriData servers)
● Runtime of 1 day 17 hours
● Software used: mlx, mlx-tuning-fork, Ogbuji-PT, and django-snomed-ct
https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5
Training
Conclusion (future considerations)
● Investigate how logical reasoning enabled by DL and term synonym can be
leveraged to further generate gold standard text for medical language model
training
○ Transitivity of relations, ACE vs. “Angiotensin converting enzyme”
● Use other LLMs (Medical LLMs, larger LLMs, etc.)
● Train on other (or all) SNOMED-CT categories
● Try other foundational biomedical ontologies (widely adopted and include
textual definitions):
○ Foundational Model of Anatomy (FMA): 120K classes and > 2.1M relationships
○ Gene Ontology (GO): 42K classes
● Evaluate against standard medical reasoning benchmarks (with and without
training against their data)
● Investigate prompting strategies in depth (Chain of biological thought, etc.)
Questions?
https://linkr.bio/chimezie
https://www.researchgate.net/profile/Chimezie-Ogbuji
https://www.linkedin.com/in/chimezie/
https://chimezie.medium.com/
https://huggingface.co/cogbuji
https://github.com/chimezie
https://github.com/OoriData

More Related Content

Similar to Reference Domain Ontologies and Large Medical Language Models.pptx

AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...Timothy Cook
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachIRJET Journal
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopKoray Atalag
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Amit Sheth
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarAhmad C. Bukhari
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
 
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...AmrAlaaEldin12
 
Hl7 common terminology services
Hl7 common terminology servicesHl7 common terminology services
Hl7 common terminology servicesSyed Ali Raza
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdfAdhySugara2
 
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...home
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfAlan Morrison
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmTELKOMNIKA JOURNAL
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedRangarajan Chari
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016Richard Vidgen
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizLuis Marco Ruiz
 

Similar to Reference Domain Ontologies and Large Medical Language Models.pptx (20)

AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
 
Cri big data
Cri big dataCri big data
Cri big data
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering Approach
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
 
Hl7 common terminology services
Hl7 common terminology servicesHl7 common terminology services
Hl7 common terminology services
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridged
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco Ruiz
 

More from Chimezie Ogbuji

Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryChimezie Ogbuji
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchChimezie Ogbuji
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Chimezie Ogbuji
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextractionChimezie Ogbuji
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereChimezie Ogbuji
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachChimezie Ogbuji
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLChimezie Ogbuji
 
UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic WebChimezie Ogbuji
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsChimezie Ogbuji
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsChimezie Ogbuji
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR OntologyChimezie Ogbuji
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantChimezie Ogbuji
 

More from Chimezie Ogbuji (12)

Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextraction
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and Where
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial Approach
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDL
 
UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic Web
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical Informatics
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical Informatics
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR Ontology
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are Important
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Reference Domain Ontologies and Large Medical Language Models.pptx

  • 1. Reference Domain Ontologies and Large Medical Language Models Chimezie Ogbuji Chief Medical Informatics Officer / Amara Home Care Owner /Metacognition
  • 2. Semantic Web ● Semantic Web ○ Goal: build a framework for intelligent, machine understanding on the standards, ubiquity, and connectedness of the World Wide Web (WWW) ○ 2006 to 2010: Most exciting time. Peak of inflated expectations ○ Driven by standardization efforts at the WWW Consortium (W3C) ○ Equal parts hype and robust infrastructure for modern applications A cautionary tale for Large Language Models (LLM) The Semantic Web: Where is it now? Rashif Ray Rahman / Oct 3, 2018 https://medium.com/@schivmeister/the-semantic-web-where-is-it-now-f4773f3097e3
  • 3. Semantic Web Layers How are things identified and retrieved? Rules, Logic, & Ontologies: How can machines reason about the data? How can machines ask questions of the data? How can data be exchanged in a knowledge graph format?
  • 5. Gartner's 2016 Hype Cycle for Emerging Technologies Natural Language (NL) question answering was in the trough of disillusionment a year before a key technology underlying today’s language models (transformers) was in its infancy A Cautionary Tale?
  • 6. ● SemanticDB work at the Cleveland Clinic Foundation ○ A reconceived implementation of an existing, 30-year old registry of heart surgery and cardiovascular intervention cases ■ 500 variables, ~ 200K patient records, 100 heart and vascular research publications per year ○ Addressing shortcomings of conventional data warehouse functionality ■ Domain-specific criteria conceived by researchers who work with DB administrators ○ Internally-funded project in conjunction with CCF Innovations Department to partner with Cycorp, Inc ■ Cyc: a powerful reasoning system and knowledge base with built-in capability for natural language.
  • 7. The Semantic Research Assistant (SRA). Query for patients who had a coronary artery bypass graft (CABG) between 2008 and 2010 (inclusive) and after a percutaneous coronary intervention (PCI) D Pierce, C., Booth, D., Ogbuji, C., Deaton, C., Blackstone, E., & Lenat, D. (2012). Semanticdb: A semantic web infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3), 267-277.
  • 8. ● Primary challenges at the time ○ Developing representational models (ontologies) that can cover the domain in 200K+ patient dataset to facilitate machine reasoning ○ Resolving natural language query fragments to concepts in these models (the purpose of the SRA and challenge of Natural Language Processing at the time) ○ Dispatching Semantic Web queries (SPARQL) to the RDF patient registry ○ Evaluating the queries efficiently
  • 9. Parrot-like software that use sophisticated analysis of patterns and relationships underlying language to simulate intelligent, natural language “How GPT3 Works - Visualizations and Animations” - Jay Alammar What are Large Language Models? https://jalammar.github.io/how-gpt3-works-visualizations-animations/
  • 11. Semantic Web v.s. LLM ● Semantic Web ○ The basis for the value proposition was well- understood ○ Driven mainly by the development of industry standards ○ Adoption significantly lagged behind the research ○ Its applicability was not well defined ● Large Language Models ○ The basis for the value proposition is not fully understood (the mechanism is a mystery to us) ○ No standardization (driven by community use) ○ Lightspeed community use keeping pace with lightspeed research ○ Its applicability is well defined
  • 12. ● Artificial Neural Networks (ANN): a branch of deep learning inspired by biological neural networks in animal brains ● Natural language processing (NLP): interdisciplinary subfield of computer science and linguistics concerned with the ability of computers to support and manipulate human language ● LLMs: probabilistic models of natural language using ANN and trained on large textual data ● Fine tuning: A subsequent, task-specific training performed on a model to refine it for a specific use case ● Instruction tuning: fine-tuning that improves a model's ability to follow instructions ● Transfer learning: a technique where knowledge learned from a task is re-used to boost performance on a related task ● Unsupervised Learning: learning patterns without being told what’s right/wrong Terminology Sindhu, et. al.. "An empirical science research on bioinformatics in machine learning." 2020
  • 13. What is the state of the art of the use of LLMS in the domain of medicine?
  • 14. ● MedAlpaca (4/2023) ○ Trained on Q/A pairs from online forums (52K), medical curriculum flashcards (34K), Q/A pairs from WikiDoc (68K), and data from open NLP datasets and benchmarks. Evaluated on United States Medical Licensing Examination (USMLE) self-assessment datasets (119) ● Med-PaLM 2 (5/2023) ○ Trained on multiple-choice question dataset for solving medical problems, collected from professional examinations (183K), several standard benchmark training datasets of multiple-choice question dataset used for evaluation (10K), and common consumer questions (60). Evaluated on standard multiple-choice datasets Recent Medical LLMs
  • 15. ● MEDITRON (11/2023) ○ Trained on a dataset of clinical practice guidelines (46K), PubMed Papers (5M) & abstracts (16M), and standard benchmark training datasets (10,178). Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts ● MedPrompt (11/2023) ○ A study of the power of how to prompt GPT-4 to unleash capabilities on medical challenge problems without training ● BioMistral (2/2024) ○ Trained on PMC Open Access Subset of medical research papers (~1.47M documents). Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts
  • 16. Training & Evaluation ● Mix of training on open Q/As, multiple choice questions, and raw text (domain expertise or research publication) ● Most were evaluated on medical reasoning benchmarks that provided training data for the models before evaluating them ● Suffer from the same current and more general issue of how to objectively evaluate LLMs
  • 17. What are Ontologies and Description Logic (DL)?
  • 18. ● Rigorously-specified conceptualizations of a domain as mathematical logic ● Usually expressed as hierarchies of classes, restrictions on relationships, etc. ● Meant for automated processing by logical reasoning tools
  • 19. Mondal, Sutapa, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for the EL++ description logic. Diss. IIIT-Delhi, 2020.
  • 20. Angioedema ⊑ Edema Angioedema ⊑ ∃ morphology . angioedema A_ACE ⊑ ∃ morphology . (Angioedema ⊓ ∃ caused_by kallidin i ) Essential hypertension ⊑ Hypertensive disorder, systemic arterial Essential hypertension ⊑ (∃ located-in . (systemic circulatory system structure)) Internationally standardized medical terminology system with over 360K+ medical concepts, 1.25M relationships between them, and 9.6K textual definitions created by domain-experts. Uses a DL that facilitates automation and machine reasoning. ● Released in US English, UK English, UK Australian, Spanish, Danish, Dutch, Lithuanian, Swedish, and Canadian French What is SNOMED-CT?
  • 21. Nested matryoshka dolls are a good analogy for visualizing DL concept inclusion (⊑) All instances of Essential hypertension are within the set of all things that stand in a located-in relation with a systemic circulatory system structure ∃ - existential role restrictions
  • 22. Angioedema caused by angiotensin-converting-enzyme inhibitor (A_ACE) ⊓ - intersection of concepts ∃ - existential role restriction named concept
  • 23. DLs are designed for computer processing and not easily read by non- mathematicians What are Controlled Natural Languages (CNL)? Since CNLs are based on natural languages, their grammars use the same syntactic structures: sentences, noun phrases, verb phrases, and relative clauses CNLs were originally designed for use by domain experts to encode knowledge without working directly in DL
  • 24. Kuhn, Tobias. "The understandability of OWL statements in controlled English." (2013): 101-115
  • 25. “Every A_ACE morphology an Angioedema caused_by Kallidin i”
  • 26. ● Adopt a CNL for use with SNOMED-CT ○ Using phraseology appropriate for the domain (pathophysiology) ● The CNL phrases generated can be used as training data for LLMs “Every A_ACE is characterized in form by an Angioedema caused by Kallidin i”
  • 27. ● SNOMED-CT includes text definitions ○ “[..] applied to some SNOMED CT concepts that provides additional information about the intended meaning or usage of the concept.” ● These can be used in addition to SNOMED-CT CNL phrases to train LLMs Text Definitions
  • 28. Angioedema ⊑ Non-allergic hypersensitivity reaction Non-allergic hypersensitivity reaction ⊑ Non-allergic hypersensitivity process “Every Angioedema is a Non-allergic hypersensitivity reaction” “Every Non-allergic hypersensitivity reaction is a Non-allergic hypersensitivity process” Non-allergic hypersensitivity process (SNOMED-CT’s Text definition) “A pathological nonimmune process generally directed towards a foreign substance, which results in tissue injury, which is usually transient. It is the realization of the pseudoallergic disposition. A variety of mechanisms such as direct histamine release, complement activation, cyclooxygenase activation and bradykinin generation may be involved.” Combining CNL and Text Definitions
  • 29. ● Create a dataset of instructions to define SNOMED-CT terms ○ Create full definition instructions and instructions on the component parts of each definition (granular instructions) ● Train an existing model on the text definitions ● Perform instruction fine tuning on the resulting model ○ Use transfer learning to boost learning of SNOMED-CT definitions from the training on textual definitions ○ Definition Rendering Instruction Fine Tuning (DRIFT) Training a LLM on SNOMED-CT CNL phrases and Text Definitions Putting into Practice
  • 30. What is Angiotensin converting enzyme inhibitor-aggravated angioedema? Angiotensin converting enzyme inhibitor-aggravated angioedema is defined in SNOMED-CT as a disorder (a clinical diagnosis). Angiotensin converting enzyme inhibitor-aggravated angioedema is an angioedema caused by angiotensin-converting-enzyme inhibitor, a drug- aggravated angioedema-urticaria, [..] It is a hypersensitivity process characterized in form by an angioedema caused by a substance with angiotensin-converting enzyme inhibitor mechanism of action and is characterized in form by an angioedema caused by kallidin i. What causes an Angiotensin converting enzyme inhibitor-aggravated angioedema? It is caused by kallidin i What is the morphology of an Angiotensin converting enzyme inhibitor-aggravated angioedema? It is characterized in form by an angioedema Full / Granular Definition Instructions
  • 31. ● SNOMED CT concepts are organised into 19 distinct hierarchies, covering different aspects of healthcare ● Generate definitions from subset of hierarchies dealing with medical problems ○ Clinical finding (includes findings and disorders) ○ Subset of Body structure: Morphological abnormality which physically characterize disorders ○ Situation with explicit context (situation) Subset of SNOMED-CT
  • 33. ● Began with OpenHermes-2.5-Mistral-7B model ● Performed unsupervised training on (7,694) SNOMED-CT text definitions ○ Using September 23rd 2023 release of SNOMED CT United States Edition ● Performed DRIFT on medical problem hierarchies ○ Used full instruction definitions (130K) ○ Added 80% of granular instructions from each category (204K) ○ Validated training using remaining 20% (102K) ○ Used QLoRA fine tuning on Apple Mac Studio M1 Ultra with 128GB RAM (OoriData servers) ● Runtime of 1 day 17 hours ● Software used: mlx, mlx-tuning-fork, Ogbuji-PT, and django-snomed-ct https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5
  • 35. Conclusion (future considerations) ● Investigate how logical reasoning enabled by DL and term synonym can be leveraged to further generate gold standard text for medical language model training ○ Transitivity of relations, ACE vs. “Angiotensin converting enzyme” ● Use other LLMs (Medical LLMs, larger LLMs, etc.) ● Train on other (or all) SNOMED-CT categories ● Try other foundational biomedical ontologies (widely adopted and include textual definitions): ○ Foundational Model of Anatomy (FMA): 120K classes and > 2.1M relationships ○ Gene Ontology (GO): 42K classes ● Evaluate against standard medical reasoning benchmarks (with and without training against their data) ● Investigate prompting strategies in depth (Chain of biological thought, etc.)