Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
It includes the information related to a bioinformatics tool BLAST (Basic Local Alignment Search Tool), BLAST is in-silico hybridisation to find regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. This presentation too contains the input - output format, Blast process and its types .
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherNils Gehlenborg
Keynote Talk presented at the 1st Annual BiVi Community Annual Meeting (17 December 2014)
http://bivi.co/page/bivi-annual-meeting-16-17th-december-2014
Visualization Approaches for Biomedical Omics Data: Putting It All Together
The rapid proliferation of high quality, low cost genome-wide measurement technologies such as whole-genome and transcriptome sequencing, as well as advances in epigenomics and proteomics, are enabling researchers to perform studies that generate heterogeneous datasets for cohorts of thousands of individuals. A common feature of these studies is that a collection of genome-wide, molecular data types and phenotypic or clinical characterizations are available for each individual. These data can be used to identify the molecular basis of diseases and to characterize and describe the variations that are relevant for improved diagnosis, prognosis and targeted treatment of patients. An example for a study in which this approach has been successfully applied is The Cancer Genome Atlas project (http://cancergenome.nih.gov).
In my talk I will discuss how visualization approaches can be applied to enable exploration and support analysis of data generated by such studies. Specifically, I will review techniques and tools for visual exploration of individual omics data types, their ability to scale to large numbers of individuals or samples, and emerging techniques that integrate multiple omics data types for interactive visual analysis. I will also examine technical and legal challenges that developers of such visualization tools are facing. To conclude my talk, I will outline research opportunities for the biological data visualization community that address major challenges in this domain.
It includes the information related to a bioinformatics tool BLAST (Basic Local Alignment Search Tool), BLAST is in-silico hybridisation to find regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. This presentation too contains the input - output format, Blast process and its types .
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherNils Gehlenborg
Keynote Talk presented at the 1st Annual BiVi Community Annual Meeting (17 December 2014)
http://bivi.co/page/bivi-annual-meeting-16-17th-december-2014
Visualization Approaches for Biomedical Omics Data: Putting It All Together
The rapid proliferation of high quality, low cost genome-wide measurement technologies such as whole-genome and transcriptome sequencing, as well as advances in epigenomics and proteomics, are enabling researchers to perform studies that generate heterogeneous datasets for cohorts of thousands of individuals. A common feature of these studies is that a collection of genome-wide, molecular data types and phenotypic or clinical characterizations are available for each individual. These data can be used to identify the molecular basis of diseases and to characterize and describe the variations that are relevant for improved diagnosis, prognosis and targeted treatment of patients. An example for a study in which this approach has been successfully applied is The Cancer Genome Atlas project (http://cancergenome.nih.gov).
In my talk I will discuss how visualization approaches can be applied to enable exploration and support analysis of data generated by such studies. Specifically, I will review techniques and tools for visual exploration of individual omics data types, their ability to scale to large numbers of individuals or samples, and emerging techniques that integrate multiple omics data types for interactive visual analysis. I will also examine technical and legal challenges that developers of such visualization tools are facing. To conclude my talk, I will outline research opportunities for the biological data visualization community that address major challenges in this domain.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Contents:
What does sequence mean?
Examples of sequences
Sequence Homology
Sequence Alignment
What is the use of sequence alignment?
Alignment methods
Tools for Sequence Alignment
FASTA Format
BLAST
Principle of BLAST
Variants of BLAST Program
BLAST input
BLAST output
Multiple sequence alignment
What is the use of multiple alignments?
Multiple Alignment Method
Tool for multiple alignments
ClustalW input
ClustalW output
E (Expectation) value
Demerits of progressive alignment
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
The objectives are an understanding of:
▶ How “function” in a genomic context can be defined. This is a
surprisingly philosophical problem.
▶ Some strategies for determining if a genomic region is likely to
be functional or not.
Sequence alignment & comparative genomics
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Contents:
What does sequence mean?
Examples of sequences
Sequence Homology
Sequence Alignment
What is the use of sequence alignment?
Alignment methods
Tools for Sequence Alignment
FASTA Format
BLAST
Principle of BLAST
Variants of BLAST Program
BLAST input
BLAST output
Multiple sequence alignment
What is the use of multiple alignments?
Multiple Alignment Method
Tool for multiple alignments
ClustalW input
ClustalW output
E (Expectation) value
Demerits of progressive alignment
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
The objectives are an understanding of:
▶ How “function” in a genomic context can be defined. This is a
surprisingly philosophical problem.
▶ Some strategies for determining if a genomic region is likely to
be functional or not.
Sequence alignment & comparative genomics
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
Genome annotation & comparative genomics
An appreciation for:
▶ An overview of some techniques and methods are used for
comparative genomics
▶ An understanding of genome annotation methods, particularly
the advantages and disadvantages of the different methods:
▶ Sequence analysis (ORF finding)
▶ Comparative sequence analysis
▶ Experimental methods (RNAseq & mass-spectroscopy)
Random RNA interactions control protein expression in prokaryotesPaul Gardner
Presented at the NZSBMB/NZMS Conference in Christchurch 2016
CustomScience Award
A core assumption of gene expression analysis is that mRNA abundances broadly correlate with protein abundance, but these two can be imperfectly correlated. Some of the discrepancy can be accounted for by two important mRNA features: codon usage and mRNA secondary structure. We present a new global factor, called mRNA:ncRNA avoidance, and provide evidence that avoidance increases translational efficiency. We demonstrate a strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes, and that these have a greater impact on protein abundance than mRNA structure or codon usage. By generating synonymously variant green fluorescent protein (GFP) mRNAs with different potential for mRNA:ncRNA interactions, we demonstrate that GFP levels correlate well with interaction avoidance. Therefore, taking stochastic mRNA:ncRNA interactions into account enables precise modulation of protein abundance.
Avoidance of stochastic RNA interactions can be harnessed to control protein ...Paul Gardner
Presented at the Computational RNA Biology conference in Hinxton, 17-19th October, 2016.
https://coursesandconferences.wellcomegenomecampus.org/events/item.aspx?e=584
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
2. Objectives for lecture 06
▶ An understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
5. BLAST comes in several different flavours
▶ BLASTN: nucleotide query, find close nucleotide matches
▶ BLASTP: protein query, find close protein matches
▶ BLASTX: nucleotide query, find protein matches (useful for
annotation)
▶ TBLASTN: protein query, find matches in 6-frame translated
nucleotide
https://blast.ncbi.nlm.nih.gov/Blast.cgi
6. BLAST outline: speed comes from “word” matching
▶ Identify all (high-scoring) shared “words” (a.k.a. k-mers,
seeds) of at least W long
▶ Find any words on the same diagonal of an alignment matrix
▶ Trigger a full alignment, and score that region
Basic idea: identify near-identical sub-sequences first → align any
hits in full
7. BLAST: k-mer scores
List all 3-mers for each sequence, and the top scoring similar
k-mers (≥ 11 bits). High scoring 3-mers, the same distance apart
on 2 sequences imply a good match.
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
LYG 17 EEL 14
IYG 15 EEI 12
MYG 15 EEM 12
VYG 14 DEL 11
AYG 12 ------
KYG 11 EEF 10
------ Neighborhood score
EYG 10 threshold (T=11)
HYG 10
|<---------Distance--------->| Botulinum neurotoxin
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHTYYPISAEELFTFGGQDANLISIDIKNDLYEKT 60
|<---------Distance--------->| Tetanus toxin
High-scoring segment pair (HSSP)
8. Homology search: a dot-plot & heuristics
“15 hits with score at least 13 are indicated by plus signs. An
additional 22 non-overlapping hits with score at least 11 are
indicated by dots.”
Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research.
9. Homology search: path through a matrix
Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research.
10. What does that E-value (Expect) mean?
Score Expect Identities Positives Gaps
40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%)
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59
Query 61 YNKFKDIASTLNKAKS 76
N +K IA+ L++ S
Sbjct 60 LNDYKAIANKLSQVTS 75
Is an E-value of 4 × 10−11 good?
11. How can we evaluate the significance of a score?
▶ Note that a bit-score of 40.0 by itself is not that useful.
▶ It depends on the sequence & database size & composition.
▶ To counter this we can compute an Expect-value (E-value).
▶ This is the number of hits one can “expect” to see by
chance with the observed score for the given query and
database sizes.
▶ Related to P-values/probabilies, but are not the same!
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
True vs false sequence matches
Score (bits)
Num.
matches
Random sequences/Neg. controls
True homologs/Pos. controls
Threshold?
False neg.
True pos.
False pos.
True neg.
12. How can we evaluate the significance of a score?
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
True vs false sequence matches
Score (bits)
Num.
matches
Random sequences/Neg. controls
True homologs/Pos. controls
Threshold?
False neg.
True pos.
False pos.
True neg.
Fit the following curve to
a distribution of scores for
random sequences:
E = κMN2−λx
E: E-value
M&N: query & database size
κ&λ: fitting parameters
NB. the blue line represents a null hypothesis or negative
control.
13. What does that E-value (Expect) mean?
Score Expect Identities Positives Gaps
40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%)
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59
Query 61 YNKFKDIASTLNKAKS 76
N +K IA+ L++ S
Sbjct 60 LNDYKAIANKLSQVTS 75
Is an E-value of 4 × 10−11 good? YES! We would have to run approximately 25
billion (1/(4 × 10−11)) searches to see 1 match with a score this high by chance.
14. Are these 2 sequences homologous?
What do the bit score and and expect value tell you?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
15. BLAST is not the only, or best tool for the job!
17. Profile HMMs: a powerful (protein) homology search tool
▶ Some residues and sequence blocks are very highly conserved,
others are free to vary.
▶ Gaps (insertions & deletions) do not occur randomnly in
sequences.
▶ BLAST uses uniform match, mis-match and gap penalties,
whereas profile HMMs will penalise mis-matches and gaps in
conserved blocks more harshly than those in regions where
INDELs are frequent.
18. Sequence logos
Shine-Dalgarno sequence from Escherichia coli str. K-12 substr. MG1655
Distance from start codon (nucs)
WebLogo 3.1
0.0
1.0
2.0
bits
-30
G
C
A
U
U
C
A
C
U
-25
A
C
U
A
C
U
C
A
U
C
A
U
G
C
A
U
-20
C
A
U
C
U
A
G
C
U
A
-15
U
A
C
G
U
A
C
U
A
U
C
G
A
C
G
A
-10
A
GA
GU
G
A
U
G
A
C
G
U
A
-5
C
G
U
A
G
U
C
A
C
U
G
AA
C
U
U
A
C
0
AUGU
C
G
AU
G
C
A
5
G
C
U
A
G
C
AC
U
AU
A
19. Profile HMMs: a powerful (protein) homology search tool
Image provided by Sean Eddy.
20. Profile-based homology search
Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol
Biol.
Image provided by Eric Nawrocki.
22. Profile HMM are slightly more complicated than that...
▶ A tree-weighting scheme takes care of unbalanced
alignments
▶ Dirichlet-mixture priors are used to incorporate information
about amino-acid biochemistry
▶ Effective sequence number is used to down-weight priors
when many sequences are available
▶ Transition probabilities to Insert & Delete states are
estimated from the alignment
23. Why not just use BLAST?
▶ ACCURACY!
▶ Every benchmark of homology search tools has shown that
profile methods are more accurate than single-sequence
methods.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
24. Why not just use BLAST?
▶ SPEED! To search a single query vs a database of all proteins:
▶ BLAST: searches 462 million UniProt sequences
▶ HMMER: searches 20 thousand Pfam profiles
▶ The search space is ∼ 20, 000x smaller for profiles
▶ Save Planet Earth, use HMMER3
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
25. Profile HMMs: different flavours
▶ phmmer: protein query, find homologous proteins
▶ hmmscan: protein query, find homologous protein families
▶ hmmsearch: alignment/HMM query, find homologous proteins
▶ jackhmmer: protein query, iterative protein search
▶ nhmmer: nucleotide query, find homologous nucleotides.
Command-line only – or in RNACentral for ncRNAs.
https://www.ebi.ac.uk/Tools/hmmer/search/phmmer
26. Pfam, Rfam, Dfam, EggNOG, etc. all use profile-based
tools
What is a Pfam-A Entry?
hmmsearch
hmmbuild
hmmalign
SEED
HMM
OUTOUT
ALIGN
DESC
Image provided by Rob Finn.
27. WARNING!
Interpret matches between sequences very carefully – try searching
short random strings, “your name”, “genome”, “bioinformatics”,
... Are any of these matches significant?
Many attempted analyses essentially amount to over-interpreting
insignificant BLAST results (e.g. early coronavirus analysis papers).
28. Bioinformatic tools are not evil, but they do stupid things
▶ False positives & negatives, database contamination,
inefficient, ...
▶ Use controls to increase your confidence in predictions!
29. The main points
▶ Homology estimated by sequence similarity and bit-scores.
▶ BLAST is very handy for quick searches, limited accuracy for
divergent sequences if these are needed.
▶ HMMER is more accurate, but can be slightly slower
▶ BLAST is useful if you are interested in very similar
sequences, HMMER is more useful if you are interested in
divergent sequences.
▶ The significance of the alignments from both BLAST and
HMMER can be determined with E-values.
30. Self-evaluation exercises
▶ Briefly summarise how the BLAST algorithm works.
▶ Present the aim of Blast software: (a) Describe how the
algorithm works. (b) Describe the output of a Blast search.
(c) What is a bit-score? (d) What is an E-value?
▶ Briefly outline how a profile HMM can be used for homology
search. What are the main differences between profile HMMs
and BLAST?
31. Self-evaluation exercises
Find the most remote (based upon taxonomy) and significant
(E < 0.01) homolog of the below sequence that you can using
blastp, tblastn, psi-blast, phmmer or jackhmmer.
>Q9UN81|LORF1_HUMAN LINE-1 retrotransposable element ORF1 protein
MGKKQNRKTGNSKTQSASPPPKERSSSPATEQSWMENDFDELREEGFRRSNYSELREDIQ
TKGKEVENFEKNLEECITRITNTEKCLKELMELKTKARELREECRSLRSRCDQLEERVSA
MEDEMNEMKREGKFREKRIKRNEQSLQEIWDYVKRPNLRLIGVPESDVENGTKLENTLQD
IIQENFPNLARQANVQIQEIQRTPQRYSSRRATPRHIIVRFTKVEMKEKMLRAAREKGRV
TLKGKPIRLTADLSAETLQARREWGPIFNILKEKNFQPRISYPAKLSFISEGEIKYFIDK
QMLRDFVTTRPALKELLKEALNMERNNRYQPLQNHAKM
32. Self-evaluation exercises
b) Based upon the bitscores and E-values for the above zebrafish and cone snail venom
and insulin sequences in Figure 1A, are these matches likely to be due to chance?
Justify your answer.
c) Using the multiple sequence alignment in Figure 1B, compute a probability of the
Figure 2: (A) Pairwise sequence alignment between similar regions of the cone snail and
zebrafish insulin proteins. (B) BLOSUM62 amino acid similarity score matrix.
d) Using the alignment and BLOSUM62 score matrix in Figure 2, compute a score for
the alignment between regions of the zebrafish and cone snail insulin sequences
using -9/
-1 gap open and gap extension penalty. Show your working.
e) You are computing your own score matrix and observe that isoleucine (I) and valine
(V) replace one another 336 times in an alignment of 10,000 residues with 62%
sequence identity. The frequencies of isoleucine and valine were 0.06 and 0.07
respectively. What is the log-odds score for isoleucine and valine replacements? Is
this expected based upon the BLOSUM62 matrix? Show your working.
sequence "RGM" from frequencies derived from columns 1, 2 and 3 of the alignment.
a) Using the results in Figure 1, discuss the relationship between the concepts of
"sequence similarity", "homology", "paralogy" and "analogy" of protein sequences.
The fish-hunting cone snail (Conus geographus) uses a specialized insulin in the venom
to induce hypoglycemic shock in its prey. C. geographus has a very toxic sting, a LD70
value (in humans) of 0.001 to 0.003 mg/
kg, making the cone snail one of the most
venomous animals in the world.
The cone snail genome encodes two insulin-like proteins, one forms a component of its
venom, the other is involved in regulating glucose levels. Comparisons using BLASTP
have been made between the cone snail venom, and insulin sequences from a fish,
human, the venomous cone snail, and a non-venomous sea snail.
Figure 1: comparisons of similar insulin and venom proteins from the molluscs: cone snail and sea snail; and
vertebrates: zebrafish and human. (A) a table of BLASTP results for matches between the zebrafish insulin
sequence, and similar cone snail, sea snail and human sequences. (B) a multiple sequence alignment of the
cone snail venom, and similar insulin sequences. (C) a sequence logo corresponding to a profile HMM
generated from the alignment in (B). (D) a tree illustrating the sequence relationships between the insulin and
venom sequences.
33. Further reading
▶ Blast Bitscores & E-values:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
▶ Eddy SR (2004) What is a hidden Markov model? Nature
Biotechnology.
Post questions on the FAQ GoogleDoc:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing