Large Language Models (LLMs) have exploded into the modern research and development consciousness and triggered an artificial intelligence revolution. They are well-positioned to have a major impact on Medical Informatics. However, much of the data used to train these revolutionary models are general-purpose and, in some cases, synthetically generated from LLMs. Ontologies are a shared and agreed-upon conceptualization of a domain and facilitate computational reasoning. They have become important tools in biomedicine, supporting critical aspects of healthcare and biomedical research, and are integral to science. In this talk, we will delve into ontologies, their representational and reasoning power, and how terminology systems such as SNOMED-CT, an international master terminology providing comprehensive coverage of the entire domain of medicine, can be used with Controlled Natural Languages (CNL) to advance how LLMs are used and trained.
08448380779 Call Girls In Friends Colony Women Seeking Men
Reference Domain Ontologies and Large Medical Language Models.pptx
1. Reference Domain Ontologies and Large
Medical Language Models
Chimezie Ogbuji
Chief Medical Informatics Officer / Amara Home Care
Owner /Metacognition
2. Semantic Web
● Semantic Web
○ Goal: build a framework for intelligent, machine understanding on the standards, ubiquity,
and connectedness of the World Wide Web (WWW)
○ 2006 to 2010: Most exciting time. Peak of inflated expectations
○ Driven by standardization efforts at the WWW Consortium (W3C)
○ Equal parts hype and robust infrastructure for modern applications
A cautionary tale for Large Language Models (LLM)
The Semantic Web: Where is it now? Rashif Ray Rahman / Oct 3, 2018
https://medium.com/@schivmeister/the-semantic-web-where-is-it-now-f4773f3097e3
3. Semantic Web Layers
How are things identified and
retrieved?
Rules, Logic, & Ontologies: How
can machines reason about the data?
How can machines ask
questions of the data?
How can data be exchanged in a
knowledge graph format?
5. Gartner's 2016 Hype Cycle for Emerging Technologies
Natural Language (NL) question
answering was in the trough of
disillusionment a year before a
key technology underlying
today’s language models
(transformers) was in its infancy
A Cautionary Tale?
6. ● SemanticDB work at the Cleveland Clinic Foundation
○ A reconceived implementation of an existing, 30-year old registry of heart
surgery and cardiovascular intervention cases
■ 500 variables, ~ 200K patient records, 100 heart and vascular
research publications per year
○ Addressing shortcomings of conventional data warehouse functionality
■ Domain-specific criteria conceived by researchers who work with DB
administrators
○ Internally-funded project in conjunction with CCF Innovations Department to partner
with Cycorp, Inc
■ Cyc: a powerful reasoning system and knowledge base with built-in capability
for natural language.
7. The Semantic Research
Assistant (SRA). Query for
patients who had
a coronary artery bypass graft
(CABG) between 2008 and
2010 (inclusive) and after a
percutaneous coronary
intervention (PCI)
D Pierce, C., Booth, D., Ogbuji, C., Deaton, C., Blackstone, E., & Lenat, D. (2012). Semanticdb: A semantic web
infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3), 267-277.
8. ● Primary challenges at the time
○ Developing representational models (ontologies) that can cover the domain
in 200K+ patient dataset to facilitate machine reasoning
○ Resolving natural language query fragments to concepts in these models
(the purpose of the SRA and challenge of Natural Language Processing at
the time)
○ Dispatching Semantic Web queries (SPARQL) to the RDF patient
registry
○ Evaluating the queries efficiently
9. Parrot-like software that use sophisticated analysis of patterns and relationships
underlying language to simulate intelligent, natural language
“How GPT3 Works - Visualizations and Animations” - Jay Alammar
What are Large Language Models?
https://jalammar.github.io/how-gpt3-works-visualizations-animations/
11. Semantic Web
v.s. LLM
● Semantic Web
○ The basis for the value proposition was well-
understood
○ Driven mainly by the development of industry
standards
○ Adoption significantly lagged behind the research
○ Its applicability was not well defined
● Large Language Models
○ The basis for the value proposition is not fully
understood (the mechanism is a mystery to us)
○ No standardization (driven by community use)
○ Lightspeed community use keeping pace with
lightspeed research
○ Its applicability is well defined
12. ● Artificial Neural Networks (ANN): a branch of deep learning inspired by biological neural
networks in animal brains
● Natural language processing (NLP): interdisciplinary subfield of computer science and
linguistics concerned with the ability of computers to support and manipulate human language
● LLMs: probabilistic models of natural language using ANN and trained on large textual data
● Fine tuning: A subsequent, task-specific training performed on a model to refine it for a
specific use case
● Instruction tuning: fine-tuning that improves a model's ability to follow instructions
● Transfer learning: a technique where knowledge learned from a task is re-used to boost
performance on a related task
● Unsupervised Learning: learning patterns without being told what’s right/wrong
Terminology
Sindhu, et. al.. "An empirical science research on bioinformatics in machine learning." 2020
13. What is the state of the
art of the use of LLMS in
the domain of medicine?
14. ● MedAlpaca (4/2023)
○ Trained on Q/A pairs from online forums (52K), medical curriculum flashcards (34K),
Q/A pairs from WikiDoc (68K), and data from open NLP datasets and benchmarks.
Evaluated on United States Medical Licensing Examination (USMLE) self-assessment
datasets (119)
● Med-PaLM 2 (5/2023)
○ Trained on multiple-choice question dataset for solving medical problems, collected
from professional examinations (183K), several standard benchmark training datasets
of multiple-choice question dataset used for evaluation (10K), and common consumer
questions (60). Evaluated on standard multiple-choice datasets
Recent Medical LLMs
15. ● MEDITRON (11/2023)
○ Trained on a dataset of clinical practice guidelines (46K), PubMed Papers (5M) &
abstracts (16M), and standard benchmark training datasets (10,178). Evaluated on
standard multiple-choice datasets and Q/A based on PubMed abstracts
● MedPrompt (11/2023)
○ A study of the power of how to prompt GPT-4 to unleash capabilities on medical
challenge problems without training
● BioMistral (2/2024)
○ Trained on PMC Open Access Subset of medical research papers (~1.47M documents).
Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts
16. Training &
Evaluation ● Mix of training on open Q/As, multiple
choice questions, and raw text (domain
expertise or research publication)
● Most were evaluated on medical
reasoning benchmarks that provided
training data for the models before
evaluating them
● Suffer from the same current and more
general issue of how to objectively
evaluate LLMs
18. ● Rigorously-specified
conceptualizations of a domain
as mathematical logic
● Usually expressed as
hierarchies of classes,
restrictions on relationships,
etc.
● Meant for automated processing
by logical reasoning tools
19. Mondal, Sutapa, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for
the EL++ description logic. Diss. IIIT-Delhi, 2020.
20. Angioedema ⊑ Edema
Angioedema ⊑ ∃ morphology . angioedema
A_ACE ⊑ ∃ morphology . (Angioedema ⊓ ∃ caused_by kallidin i )
Essential hypertension ⊑ Hypertensive disorder, systemic arterial
Essential hypertension ⊑ (∃ located-in . (systemic circulatory system structure))
Internationally standardized medical terminology system with over
360K+ medical concepts, 1.25M relationships between them, and 9.6K
textual definitions created by domain-experts. Uses a DL that facilitates
automation and machine reasoning.
● Released in US English, UK English, UK Australian, Spanish, Danish, Dutch,
Lithuanian, Swedish, and Canadian French
What is SNOMED-CT?
21. Nested matryoshka dolls are a good analogy for
visualizing DL concept inclusion (⊑)
All instances of
Essential hypertension
are within the set of all
things that stand in a
located-in relation
with a systemic
circulatory system
structure
∃ - existential role restrictions
22. Angioedema caused by angiotensin-converting-enzyme inhibitor
(A_ACE)
⊓ - intersection of concepts
∃ - existential role restriction
named concept
23. DLs are designed for computer processing and not easily read by non-
mathematicians
What are Controlled Natural
Languages (CNL)?
Since CNLs are based on natural languages, their grammars use the same
syntactic structures: sentences, noun phrases, verb phrases, and relative clauses
CNLs were originally designed for use by domain experts to encode knowledge
without working directly in DL
24. Kuhn, Tobias. "The understandability of OWL statements in controlled
English." (2013): 101-115
26. ● Adopt a CNL for use
with SNOMED-CT
○ Using phraseology
appropriate for the domain
(pathophysiology)
● The CNL phrases generated can
be used as training data for
LLMs
“Every A_ACE is characterized in form by an Angioedema caused by Kallidin i”
27. ● SNOMED-CT includes text definitions
○ “[..] applied to some SNOMED CT concepts that provides additional information about the
intended meaning or usage of the concept.”
● These can be used in addition to SNOMED-CT CNL phrases to train LLMs
Text Definitions
28. Angioedema ⊑ Non-allergic hypersensitivity reaction
Non-allergic hypersensitivity reaction ⊑ Non-allergic hypersensitivity process
“Every Angioedema is a Non-allergic hypersensitivity reaction”
“Every Non-allergic hypersensitivity reaction is a Non-allergic hypersensitivity process”
Non-allergic hypersensitivity process (SNOMED-CT’s Text definition)
“A pathological nonimmune process generally directed towards a foreign substance, which
results in tissue injury, which is usually transient. It is the realization of the pseudoallergic
disposition. A variety of mechanisms such as direct histamine release, complement
activation, cyclooxygenase activation and bradykinin generation may be involved.”
Combining CNL and Text
Definitions
29. ● Create a dataset of instructions to define SNOMED-CT terms
○ Create full definition instructions and instructions on the component parts of each
definition (granular instructions)
● Train an existing model on the text definitions
● Perform instruction fine tuning on the resulting model
○ Use transfer learning to boost learning of SNOMED-CT definitions from the training
on textual definitions
○ Definition Rendering Instruction Fine Tuning (DRIFT)
Training a LLM on SNOMED-CT CNL phrases and Text Definitions
Putting into Practice
30. What is Angiotensin converting enzyme inhibitor-aggravated angioedema?
Angiotensin converting enzyme inhibitor-aggravated angioedema is defined in SNOMED-CT
as a disorder (a clinical diagnosis). Angiotensin converting enzyme inhibitor-aggravated
angioedema is an angioedema caused by angiotensin-converting-enzyme inhibitor, a drug-
aggravated angioedema-urticaria, [..] It is a hypersensitivity process characterized in form
by an angioedema caused by a substance with angiotensin-converting enzyme inhibitor
mechanism of action and is characterized in form by an angioedema caused by kallidin i.
What causes an Angiotensin converting enzyme inhibitor-aggravated angioedema?
It is caused by kallidin i
What is the morphology of an Angiotensin converting enzyme inhibitor-aggravated
angioedema?
It is characterized in form by an angioedema
Full / Granular Definition
Instructions
31. ● SNOMED CT concepts are organised into 19
distinct hierarchies, covering different aspects
of healthcare
● Generate definitions from subset of hierarchies
dealing with medical problems
○ Clinical finding (includes findings and disorders)
○ Subset of Body structure: Morphological abnormality which
physically characterize disorders
○ Situation with explicit context (situation)
Subset of SNOMED-CT
33. ● Began with OpenHermes-2.5-Mistral-7B model
● Performed unsupervised training on (7,694) SNOMED-CT text definitions
○ Using September 23rd 2023 release of SNOMED CT United States
Edition
● Performed DRIFT on medical problem hierarchies
○ Used full instruction definitions (130K)
○ Added 80% of granular instructions from each category (204K)
○ Validated training using remaining 20% (102K)
○ Used QLoRA fine tuning on Apple Mac Studio M1 Ultra with 128GB
RAM (OoriData servers)
● Runtime of 1 day 17 hours
● Software used: mlx, mlx-tuning-fork, Ogbuji-PT, and django-snomed-ct
https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5
35. Conclusion (future considerations)
● Investigate how logical reasoning enabled by DL and term synonym can be
leveraged to further generate gold standard text for medical language model
training
○ Transitivity of relations, ACE vs. “Angiotensin converting enzyme”
● Use other LLMs (Medical LLMs, larger LLMs, etc.)
● Train on other (or all) SNOMED-CT categories
● Try other foundational biomedical ontologies (widely adopted and include
textual definitions):
○ Foundational Model of Anatomy (FMA): 120K classes and > 2.1M relationships
○ Gene Ontology (GO): 42K classes
● Evaluate against standard medical reasoning benchmarks (with and without
training against their data)
● Investigate prompting strategies in depth (Chain of biological thought, etc.)