Automatic generation of summaries that capture the salient aspects of a search resultset
(i.e., automatic summarization) has become an important task in biomedical research. Automatic
summarization offers an avenue for overcoming the information overload problem
prevalent in large online digital libraries. However, across many of the knowledge-driven
approaches for automatic summarization it is not always clear which features highly impact
or influence the quality of a summary. Instead, there has been considerable focus on
utilizing schema knowledge to facilitate browsing and exploration of generated summaries
a posteriori. Informative features should not be ignored, since they could be utilized to
help optimize the models that generate these semantic summaries in the first place.
In this research, we adopt a leave-one-out approach to assess the impact of various
features on the quality of automatically generated summaries that contain structured background
knowledge. We first create the gold standard summaries, using information-theoretic
methods, by extraction and validation, then the semantic summaries are transformed into
an equivalent textual format. Finally, various similarity metrics, such as cosine similarity,
euclidean distance, and Jensen-Shannon divergence are computed under different feature
combinations, to assess summary quality against the textual gold standard. We report on
the relative importance of the various features used to automatically generate the semantic
summaries in a biomedical application. Our evaluation suggests that the proposed approach
is an effective automatic
ResQu: A Framework for Automatic Evaluation of Knowledge-Driven Automatic Summarization
1. RESQU: A FRAMEWORK FOR AUTOMATIC EVALUATION OF
KNOWLEDGE-DRIVEN AUTOMATIC SUMMARIZATION
MASTERS THESIS DEFENSE
NISHITA JAYKUMAR
MAY 26, 2016
MASTERS COMMITTEE
AMIT P. SHETH (ADVISOR)
THOMAS C. RINDFLESCH (NIH)
DELROY CAMERON (APPLE INC.)
KRISHNAPRASAD THIRUNARAYAN
1
6. • What is an effective summary?
- Saliency
- Compressed format
• Approaches to Automatic Summarization
Automatic Summarization
Extractive Abstractive
6
Extractive summary
A randomized, placebo-controlled trial of
acetaminophen for treatment of migraine
headache.
Long-term evaluation of sumatriptan and
naproxen sodium for the acute treatment of
migraine in adolescents.
…………….
Mapping from disease-specific measures to
health-state utility values in individuals with
migraine.
Abstractive summary
Acetaminophen TREATS Migraine Disorders
Sumatriptan TREATS Migraine Disorders
…………….
Migraine Disorders PROCESS_OF Individuals
8. Intrinsic Evaluation:
- Compared to a human-curated gold standard.
- Using document similarity measures.
• Evaluating Summary Quality
Evaluating Summaries
Extrinsic evaluation:
- Based on a secondary task.
- Through a discrete scoring system.
8
9. Intrinsic Evaluation of Extractive Summariztion
• Pyramid Approach [Nenkova et al., 2004]
- Summary Content Units (SCU)
• Louis et al [2009]
• Distribution of terms
• Kullback-Liebler
• Jensen-Shannon
Nenkova, Ani, and Rebecca Passonneau. "Evaluating content selection in summarization: The pyramid method."
(2004). Louis, Annie, and Ani Nenkova. "Automatic summary evaluation without human models." Notebook
Papers and Results, Text Analysis Conference (TAC-2008), Gaithersburg, Maryland (USA). 2008.
9
10. • Information Misalignment
• Semantic summary – structured background knowledge.
• Gold standard – textual.
• Proposed Solution
• Summary transformation: predications to text.
• Semantic similarity computation.
Intrinsic Evaluation of Abstractive Summarization
10
11. Approach: ResQu
We can use the words that co-occur with the semantic predications in a summary to represent
the meaning of the semantic predications based on distributional semantics.
By generating multiple summaries with features held-out, we can effectively evaluate the impact
of each feature.
Word Co-occurrence
Leave-one-Out
11
12. A semantic summary can be understood and potentially improved by leveraging distributional
statistics between the structured knowledge that comprises the semantic summary and the
words with which these structured constructs co-occur, across the corpus.
Thesis Statement
12
3
14. • Similarity between SS and GS
- Cosine similarity, Euclidean distance, Jensen-Shannon divergence
• Root Mean-Squared Error
• For each summary generated with a feature held-out
Measuring Similarity
The summary that is least similar to the gold standard has the most important feature.
14
6
15. Assertional
Knowledge
Definitional
Knowledge
ComplementaryDisjoint
65 Attributes:
62 Provenance Metadata 3
Semantic Attributes
MEDLINE
(1865 – 2015)
Largest Biomedical
Knowledgebase,
>25 million abstracts,
PubMed, PMC
Semantic Predications
Medical Subject Headings (MeSH)
15 Unique Trees, Max Depth – 15
~27,000 Terms
SPECIALIST Lexicon
Semantic Network
Metathesaurus
>300k concepts
>100 Vocabularies
9 million triples
134 Types
15 Groups
54 predicates
Unified Medical Language System (UMLS)
MeSH Indexing
d1
d2
d3
dn
Resource-Rich
Biomedical Knowledge
15
1
16. ResQu System Architecture
User Query
Processor
Document
Selector
Predication
Mapper
Concept
Mapper
Summarizer
(Schema
Summarizer)
Vectorizer
Predication
Extractor
(SemRep)
Graph
Generator
ResQu
Summary
Vectors
MEDLINE
15
Jericho Crawler
Gold standard
Vectors
Similarity
Computation
Module
Gold standard
creation module
17. User Query
• l: label of an entity (or concept) in the UMLS,
- Migraine Disorders: C0149931
• c1: Humans[MH] and c2: Clinical Trial [PTYP]
• dt: the date range of documents
• ub: is the upper bound (default = 5000)
q = (l, c1, c2, dt, ub)
17
8
18. q = (Migraine Disorders[MH] AND Humans[MH] AND Clinical Trial
[PTYP] AND 1860/01:2014/08[DCOM])
User Query Instance
18
9
19. • Query from the User Query Processor.
• Retrieves the set of MEDLINE documents.
• D = {d1; d2;. . . ; dn}
• Uses the MEDLINE Entrez Search API.
Document Selection
20
21. Semantic Predications Extractor
22
A randomized, placebo-controlled trial of acetaminophen for
treatment of migraine disorders
Acetaminophen Migraine disorders
treats
22. Automatic Summarizer
Inflammation mediated by the immune system is known to be important in carcinogenesis and, specifically, T helper 17 cells have been reported to play a role in tumor
progression by promoting neo-angiogenesis. The aim of this study was to investigate whether inflammatory cytokines and vascular endothelial growth factor (VEGF) levels
in exhaled breath condensate (EBC) and in serum were related to tumor size in patients with non-small cell lung cancer (NSCLC). Il-6, IL-17, TNF-α and VEGF levels were
measured in EBC and serum of 15 patients with stage I-IIA NSCLC and in 30 healthy controls by immunoassay. The tumor size was measured by a CT scan. The
concentrations of IL-6, IL-17 and VEGF were significantly higher in EBC of patients with lung cancer, compared with controls, while only serum IL-6 concentration was
higher in patients compared to controls. A significant correlation (r = 0.78, p = 0.001) was observed between EBC levels of IL-6 and IL-17; IL-17 was also correlated to EBC
levels of the VEGF (r = 0.83, p < 0.001) and TNF-α (r = 0.62, p = 0.014). The tumor diameter was significantly correlated with EBC concentrations of VEGF (r = 0.58, p =
0.039), IL-6 (r = 0.67, p = 0.013) and IL-17 (r = 0.66, p = 0.017). Our results show a significant relationship between inflammatory and angiogenic markers, measured in
EBC by a non-invasive method, and tumor mass. To assess whether polymorphisms of the interleukin-23 receptor (IL23R) gene are associated with bladder transitional cell
carcinoma because chronic inflammation contributes to bladder cancer and the IL23R is known to be critically involved in the carcinogenesis of various malignant tumors.
226 patients with bladder cancer and 270 age-matched controls were involved in the study. Polymerase chain reaction-restriction fragment length polymorphism was used
for genotyping. Genotype distribution and allelic frequencies between patients and controls were compared. In all three single nucleotide polymorphisms of IL23R studied,
the distribution of genotype and allele frequencies of rs10889677 differed significantly between patients and controls. The frequency of allele C of rs10889677 was
significantly increased in cases compared with controls (0.2898 vs. 0.1833, odds ratio 1.818, 95 % confidence interval 1.349-2.449). The result indicates that IL23R may
play an important role in the susceptibility of bladder cancer in Chinese population. For over a century, inactivated or attenuated bacteria have been employed in the clinic
as immunotherapies to treat cancer, starting with the Coley's vaccines in the 19th century and leading to the currently approved bacillus Calmette-Guérin vaccine for
bladder cancer. While effective, the inflammation induced by these therapies is transient and not designed to induce long-lasting tumor-specific cytolytic T lymphocyte
(CTL) responses that have proven so adept at eradicating tumors. Therefore, in order to maintain the benefits of bacteria-induced acute inflammation but gain long-lasting
anti-tumor immunity, many groups have constructed recombinant bacteria expressing tumor-associated antigens (TAAs) for the purpose of activating tumor-specific CTLs.
One bacterium has proven particularly adept at inducing powerful anti-tumor immunity, Listeria monocytogenes (Lm). Lm is a gram-positive bacterium that selectively
infects antigen-presenting cells wherein it is able to efficiently deliver tumor antigens to both the MHC Class I and II antigen presentation pathways for activation of tumor-
targeting CTL-mediated immunity. Lm is a versatile bacterial vector as evidenced by its ability to induce therapeutic immunity against a wide-array of TAAs and specifically
infect and kill tumor cells directly. It is for these reasons, among others, that Lm-based immunotherapies have delivered impressive therapeutic efficacy in preclinical
models of cancer for two decades and are now showing promise clinically. The result indicates that IL23R may play an important role in the susceptibility of bladder cancer
in Chinese population. For over a century, inactivated or attenuated bacteria have been employed in the clinic as immunotherapies to treat cancer, starting with the Coley's
vaccines in the 19th century and leading to the currently approved bacillus Calmette-Guérin vaccine for bladder cancer. While effective, the inflammation induced by these
therapies is transient and not designed to induce long-lasting tumor-specific cytolytic T lymphocyte (CTL) responses that have proven so adept at eradicating tumors.
Therefore, in order to maintain the benefits of bacteria-induced acute inflammation but gain long-lasting anti-tumor immunity, many groups have constructed recombinant
bacteria expressing tumor-associated antigens (TAAs) for the purpose of activating tumor-specific CTLs. One bacterium has proven particularly adept at inducing powerful
anti-tumor immunity, Listeria monocytogenes (Lm). Lm is a gram-positive bacterium that selectively infects antigen-presenting cells wherein it is able to efficiently deliver
tumor antigens to both the MHC Class I and II antigen presentation pathways for activation of tumor-targeting CTL-mediated immunity. Lm is a versatile bacterial vector as
evidenced by its ability to induce therapeutic immunity against a wide-array of TAAs and specifically infect and kill tumor cells directly. It is for these reasons, among others,
that Lm-based immunotherapies have delivered impressive therapeutic efficacy in preclinical models of cancer for two decades and are now showing promise clinically.
inflammation contributes to bladder cancer and the IL23R is known to be critically involved in the carcinogenesis of various malignant tumors. 226 patients with bladder
cancer and 270 age-matched controls were involved in the study. Polymerase chain reaction-restriction fragment length polymorphism was used for genotyping. Genotype
distribution and allelic frequencies between patients and controls were compared. In all three single nucleotide polymorphisms of IL23R studied, the distribution of genotype
and allele frequencies of rs10889677 differed significantly between patients and controls. The frequency of allele C of rs10889677 was significantly increased in cases
compared with controls (0.2898 vs. 0.1833, odds ratio 1.818, 95 % confidence
Ibuprofen
Topiramate
Headache
Acetaminophen
TREATS
PREVENTS
ISA
LOCATION_OF
Migraine
Disorders
Migraine
Disorders
Migraine
Disorders
Migraine
Disorders
TREATS
Migraine
Disorders
Migraine
Disorders
Vestibule
Pain
ISA
24
24. Step 1: get all documents for each concept in semantic summary.
Step 2: create bag-of-words for each concept (term-frequency).
Step 3: then aggregate the bag-of-words for each concept in the entire
semantic summary.
Step 4: we use the idfs for each words in the corpus to create the tf-idf vector for the
given semantic summary.
Summary Transformation
𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑, 𝐷 = 𝑡𝑓 𝑡, 𝑑 ∗ log
𝑁
𝑛 𝑡
26
25. Bag-of-words Model
We used hemofiltration to treat a patient with digoxin overdose that was
complicated by refractory hyperkalemia.
bow = [(we,1), (used,1), . . ., (hyperkalemia,1)]
bow_sparse_vector =[(678,1), (2,1), . . ., (999,1)]
27
26. Dictionary Creation
28
Term Index Document id
ibuprofen 0 1,3,…,3000
.
.
.
migraine 5 5,6,…,475
Documents
ibuprofen is …. migraine
Ibuprofen is effective in treating Migraine
28. Gold Standard Vectorization
Step 1: iterate over the each document in the gold standard.
Step 2: tokenize each sentence.
Step 3: create the bag-of-words model.
Step 4: we use the idfs for each word from the dictionary to create the tf-idf
vector for the gold standard.
Problem: data sparsity.
30
29. Gold Standard Vectorization Enhancement
Step 1: MetaMap the gold standard document.
Step 2: create bag-of-words for each concept (term frequency).
Step 3: then aggregate the bag-of-words for each concept bag-of-words for
summary.
Step 4: we use the idfs for each word from the dictionary to create the tf-idf
vector for the gold standard.
Solution: enhance with context clues from corpus.
31
30. Step 1: select 20 disease as topics for an information need.
Step 2: use each query to generate a semantic summary.
Step 3: transform each semantic summary into semantic summary vectors.
Step 4: transform each gold standard into a gold standard tf-idf vectors.
Step 5: compute the similarity between a semantic summary vector and its associated
gold standard vector under different features.
Step 6: determine the features that generate the most informative summary in each
scenario.
Evaluation: Overall Approach
32
37. Method Cosine-RMSE Euclidean-RMSE JS-RMSE
Leave-out-relevancy 0.263 0.315 0.187
Leave-out-connectivity 0.263 0.335 0.143
Leave-out-novelty 0.254 0.329 0.252
Leave-out-saliency 0.237 0.333 0.281
Evaluation
Saliency is the most important feature.
37
38. • We propose a method for intrinsic evaluation of abstractive summarization.
• We transform semantic summaries in an equivalent textual representation.
• We evaluate the impact of these features using numerous similarity metrics.
• We adopt a leave-one-out strategy to identify and evaluate the features that impact
automatically generated semantic summaries.
Contributions
38
39. Limitations and Future Work
1. Query diversity
- 20 disease treatments
2. Concept-based bag-of-words
3. Gold standard impurities
- Diluted quality based on co-occurrence
39
Use machine learning and a larger query set
Involve more domain experts and consider
other gold standard creation techniques
Use facts instead of concepts
40. 40
THANK YOU!
Prof. Amit P. Sheth
(Advisor)
Prof. Krishnaprasad
Thirunarayan
Thomas C. Rindflesch Delroy Cameron
Acknowledgements
Editor's Notes
Hello everyone, good morning, thank you for gathering here,
Today I am going to talk about my work titled: “”
This is the work that I started as a part of my internship at NLM with Dr. Rindflesch and his team
I am sure all of us are aware..
For those of us who are not…PubMed is the search service that queries the MEDLINE database to retrieve relevant documents for a user’s information need.
MEDLINE itself…
So what is the problem with PubMed?, Well the problem with PubMed is that is presents the information as a list
So if a user wanted to find information on migraine disorders, and he constructed this specific query and presented it to PubMed,
it would retrieve 2171 results,
then he would have to search and sift thru this entire collect to find relevant answers.
though the information is contained in the resultset, it is not directly accessible
To alleviate this problem, with the research by Tom and him team, they developed this tool called Sem. Med.
which is a tool used for automatically summarizing biomedical literature.
So as we see here for the same user query, in addition to presenting the information as a list,
Sem Med., extracts the salient information from the search resultset as facts and represents them as a graph
And this set of facts are called semantic predications or triples.
This provides more direct access to the information and, from this graph the user can understand the following facts on migraine disorders.. amongst many other
The motivation for this work is very specific to Semantic MEDLINE and in particular, we want to automatically evaluate the summaries
So this is the outline for the remainder of the talk
We will discuss about automatic summarization and its types
then automatic summarization evaluation and its types
later about summarization in Sem. Med. and ResQu
Then discuss about the different datasets used for this work very briefly
then we move on to the core of the approach.
Finally we talk about experimental evaluation
in this general scope of automatic summarization, an interesting question is what is an effective summary in the first place. So how might one be able to quantify that to evaluate it.
So, what is an effective summary? an effective summary is something that convey the most important information or the salient information from the search result set in a compressed and concise format.
Extractive: is where the summary contains most important information from the source is added to the summary in an unaltered format.
whereas,
Abstractive: is where the summary is a condensed abstract representation of the source, the content is usually rephrased or paraphrased. Semantic Med. performs abstractive summarization.
Saliency - Conveys most important information from SOURCE
So, how does summarization take place in Semanitc MEDLINE in the broad sense:
Well, first we have ………..
SemRep is a program that extracts semantic predications (subject-relation-object triples) from biomedical free text.
then a series of 4 features are applied in the reduction step to produce a summary
Relevancy: is a knowledge-based feature derived by selecting semantic predications that address the user-selected seed topic for the summary
Connectivity: is a feature that ensures the summary will also include “useful” additional predications, such as based on the connectedness of relevant concepts
Novelty – is a knowledge-based feature that uses the hierarchical structure of the Metathesaurus to eliminate predications with generic (and hence uninformative) arguments
Saliency – is a feature that assigns bias to semantic predications that occur frequently
One of the critical limitations of Sem. Med. is that, it is difficult to evaluate the quality of the automatically generated summary.
Now, how might one actually evaluate the quality of an automatic summary or summaries in general, Well…
Some of the popular work in this area is the work by Nenkova et al titled pyramid approach, where they focus in creating summaries using SCU, which are extracts. They evaluate using an intrinsic
Similarily in another work by Louis et al, they perform intrinsic evaluation of the extractive summaries, by comparing the distribution of terms summary to input using different …
so we propose a possible solution to this problem which involves summary transformation
Specifically, we approach this problem with two broad ideas
Here we have a semantic summary which is a list of predications
We take each summary produced by Semantic MEDLINE and for each of the facts in it, we express them as a distribution of terms in which the predications co-occur.
Then we aggregate these words to create a bag of words model, this way we will be able to represent the semantic summary as a vector on which we can then perform similarity scoring
Once we have the semantic summary vector and the gold standard vector we evaluate them using different similarity scoring techniques as suggested by the literature
Further, to understand which feature is influential in the quality of the summary we perform the RMSE computation, using a leave-one-out approach, and we state that
The Medical Subject Headings (MeSH) is a controlled vocabulary and thesaurus of biomedical terms, organized in a hierarchical structure. Subject headings in MeSH are often used as search terms in PubMed to retrieve relevant documents.
In terms of organization, the semantic network is comparable to an ontology schema, while the Metathesaurus is comparable to the instances in the ontology.
The Metathesaurus is the biggest component of the UMLS. It is a large biomedical thesaurus
The SPECIALIST Lexicon is a large syntactic lexicon of biomedical and general English terms, designed to provide the information needed for information extraction by various tools and natural language processing system
So here is the overall system architecture for ResQu, that we have developed:
First we have the UQP: this module is used for constructing the query based of the users input, then passes this on to the DS
DS is responsible for retrieving all documents that match the query, for this we use the MEDLINE ENTREZ API
Then this set of PubMed articles are sent to the Predication Extractor, in this module we use the SemRep API to extract the semantic predications or facts from the articles
Then we use the summarizer, which is responsible for applying the features we previous mentioned, relevance, novelty, connectivity and saliency to create a focused summary
The Concept Mapper and the Predication mapper are responsible for transforming the semantic summaries into their textual summarization
these components creates the initial model which is fed into the Vectorizer, which vecorizes the summary to create the ResQu summary vectors
So what is a user query in ResQu?
Well we represent a user query to be a tuple with 5 elements.
Migraine is mapped to Migraine Disorders : C0149931Migraine Disorders[MH]
c1 and c2 are MeSH filters
these are 2 MesH (medical subject heading) indexing terms, citations in Pubmed are indexed using these MeSH terms assigned by human
From the search results, the PubMed identifier(or PMID) of each article in D is then passed to the Semantic Predication Extractor
In ResQu we evaluate the summaries using 20 scenarios, this is a carefully chosen list of diseases which contains both well known and rarely occurring diseases.
with an upper bound of 5000 documents per disease. For this study we were mainly interested in understanding the drug treatments for these diseases
For ex, in this sentence, the semantic predication extractor, will first try to do this… then use the indicator rules and..
The predications graph is then delivered as input to the Summarizer, which applies various features to filter our non-informative semantic predications and create a more concise semantic summary reflective of the salient aspects of the search result set.
by the application of the reduction transformation rules
here we see some of the predications that it accepts, while these are some of the predications it rejects
finally we end up with this semantic summary
However….
as we have noted earlier it is challenging to evaluate such a summary, hence here is the steps for summary transformation
We implement the summary transformation in the following 4 steps:
First
1) then aggregate the bag-of-words for each concept, in the entire summary
2) then we use the idf as the inverse document frequency for each word in the corpus to create the tf-idf vector for a summary.
3) to create both the bow_sparse_vector and the idfs, we create a dictionary for the corpus
4) at this stage we have a semantic summary represented as a transformed
Here is a simple bag-of-words model for this snippet of text
A bag-of-words model simply for every document it will create a list of tuples, with the word and its frequency
The bag-of-words model can be used as a sparse vector for a document, by simply replacing the word with the id of the word in a feature space
Step 1: Iterate over each document in the corpus
Step 2: Tokenize each sentence
Step 3: Add each token to the dictionary with a unique id (index position)
Step 4: Keeps track of document frequency for each token (id of the documents that the terms occurs in)
at this point we have a semantic summary transformed as a summary vector
At this point we have a semantic summary, represented as a vector
1) These were the 3 resources that were considered for the creation of the gold standard.
3) these were resources that were selected by domain experts from NLM as authoritative sources of drug treatments for diseases
4) We use the Jericho crawler to extract text present in structured and unstructured formats in theses resources
We found that the gold standard vectors were sparse.
to overcome this data sparsity problem we enhance the gold standard vectors using contextual clues from the corpus
and repeated step 2, 3 and 4 as previous
using the RMSE, which we will discuss in the next section
If we let the semantic summary be S-prime and the gold standard summary be T, then the cosine-similarity between the GS-vector and a SS-vector is computed as shown in equation 1.
Which is nothing but ht dot product between the 2 vectors divided by the square-root of the square of the sum of the squares
for EU-distances, the distance is computed as the sum of the squared differences , between each of the corresponding points in the vector
JS-divergence is a bit more complicated, computed as the function of the symmetric Kullback-liebler divergence,
KL: assuming the SS-vector, S-prime & GS-vector as T, then KL-divergence is the sum for each of the weights of the word w-I & the corresponding word in the GS, Product of the probability of the word in the semantic summary into the log of the probability of the word in the SS divided by the probability of the word in the GS.
given the KL-divergence then, we can compute JS as follows
First we have the baseline summary, which a summary with no feature left out, which is compared to the gold standard,
then we compute the semantic similarity with the relevancy feature removed, which is in green and similarly for all other features
It is difficult to discern which of the features is important but just looking at these graphs
So in order to assess more quantitatively which of these features is important,
we instead compute the RMSE for each of the distributions
so just to put the things into perspective, lets just take the baseline dataset.
So we have 20 queries, so for the baseline we generate 20 semantic summaries and we will have as well 20 gold standard summary vectors
20 cosine similarity values, 20 Euclidean distance values and 20 Jensen-Shannon values
it is the summation of the square of the similarity scores
Now when we compute the RMSE for just the baseline, that number in isolation is not very informative, however if we compute the RMSE for each of the held of the features, then we are able to estimate the importance of each feature
So what we would like to see for Cosine similarity is, when the most important feature has been removed and we compute the cosine similarity across all 20 queries, then the RMSE value should become very low. which is what we see. For the Euclidean we expect to see the opposite.
So having done this across the 20 queries and these different metrics, what we see is that the saliency for cosine sim and JS is the most important feature and for the Euclidean distance connectivity is the highest and saliency is the second highest, which leads us to conclude that saliency is the most important feature for generating sem sum.