SlideShare a Scribd company logo
GENE315: Homology search & comparative
genomics
Paul Gardner
March 14, 2023
Objectives for lecture 06
▶ An understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
Homology search
▶ In a huge collection of biological sequences how can you
locate similar sequences?
▶ With heuristic, super fast, local sequence alignment methods!
1995
Species
Streptococcus
Escherichia
Staphylococcus
Mycobacterium
Klebsiella
Pseudomonas
Other
Non-pathogenic
No.
Genomes
Year
2000 2005 2010 2015
0
50,000
100,000
150,000
Prokaryotic genome submissions to NCBI
Image provided by Bethany Jose ©in perpetuity.
BLAST: Basic Local Alignment Search Tool
BLAST comes in several different flavours
▶ BLASTN: nucleotide query, find close nucleotide matches
▶ BLASTP: protein query, find close protein matches
▶ BLASTX: nucleotide query, find protein matches (useful for
annotation)
▶ TBLASTN: protein query, find matches in 6-frame translated
nucleotide
https://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST outline: speed comes from “word” matching
▶ Identify all (high-scoring) shared “words” (a.k.a. k-mers,
seeds) of at least W long
▶ Find any words on the same diagonal of an alignment matrix
▶ Trigger a full alignment, and score that region
Basic idea: identify near-identical sub-sequences first → align any
hits in full
BLAST: k-mer scores
List all 3-mers for each sequence, and the top scoring similar
k-mers (≥ 11 bits). High scoring 3-mers, the same distance apart
on 2 sequences imply a good match.
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
LYG 17 EEL 14
IYG 15 EEI 12
MYG 15 EEM 12
VYG 14 DEL 11
AYG 12 ------
KYG 11 EEF 10
------ Neighborhood score
EYG 10 threshold (T=11)
HYG 10
|<---------Distance--------->| Botulinum neurotoxin
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHTYYPISAEELFTFGGQDANLISIDIKNDLYEKT 60
|<---------Distance--------->| Tetanus toxin
High-scoring segment pair (HSSP)
Homology search: a dot-plot & heuristics
“15 hits with score at least 13 are indicated by plus signs. An
additional 22 non-overlapping hits with score at least 11 are
indicated by dots.”
Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research.
Homology search: path through a matrix
Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research.
What does that E-value (Expect) mean?
Score Expect Identities Positives Gaps
40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%)
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59
Query 61 YNKFKDIASTLNKAKS 76
N +K IA+ L++ S
Sbjct 60 LNDYKAIANKLSQVTS 75
Is an E-value of 4 × 10−11 good?
How can we evaluate the significance of a score?
▶ Note that a bit-score of 40.0 by itself is not that useful.
▶ It depends on the sequence & database size & composition.
▶ To counter this we can compute an Expect-value (E-value).
▶ This is the number of hits one can “expect” to see by
chance with the observed score for the given query and
database sizes.
▶ Related to P-values/probabilies, but are not the same!
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
True vs false sequence matches
Score (bits)
Num.
matches
Random sequences/Neg. controls
True homologs/Pos. controls
Threshold?
False neg.
True pos.
False pos.
True neg.
How can we evaluate the significance of a score?
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
True vs false sequence matches
Score (bits)
Num.
matches
Random sequences/Neg. controls
True homologs/Pos. controls
Threshold?
False neg.
True pos.
False pos.
True neg.
Fit the following curve to
a distribution of scores for
random sequences:
E = κMN2−λx
E: E-value
M&N: query & database size
κ&λ: fitting parameters
NB. the blue line represents a null hypothesis or negative
control.
What does that E-value (Expect) mean?
Score Expect Identities Positives Gaps
40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%)
Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60
H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+
Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59
Query 61 YNKFKDIASTLNKAKS 76
N +K IA+ L++ S
Sbjct 60 LNDYKAIANKLSQVTS 75
Is an E-value of 4 × 10−11 good? YES! We would have to run approximately 25
billion (1/(4 × 10−11)) searches to see 1 match with a score this high by chance.
Are these 2 sequences homologous?
What do the bit score and and expect value tell you?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
BLAST is not the only, or best tool for the job!
Profile HMMs: a powerful (protein) homology search tool
Profile HMMs: a powerful (protein) homology search tool
▶ Some residues and sequence blocks are very highly conserved,
others are free to vary.
▶ Gaps (insertions & deletions) do not occur randomnly in
sequences.
▶ BLAST uses uniform match, mis-match and gap penalties,
whereas profile HMMs will penalise mis-matches and gaps in
conserved blocks more harshly than those in regions where
INDELs are frequent.
Sequence logos
Shine-Dalgarno sequence from Escherichia coli str. K-12 substr. MG1655
Distance from start codon (nucs)
WebLogo 3.1
0.0
1.0
2.0
bits
-30
G
C
A
U
U
C
A
C
U
-25
A
C
U
A
C
U
C
A
U
C
A
U
G
C
A
U
-20
C
A
U
C
U
A
G
C
U
A
-15
U
A
C
G
U
A
C
U
A
U
C
G
A
C
G
A
-10
A
GA
GU
G
A
U
G
A
C
G
U
A
-5
C
G
U
A
G
U
C
A
C
U
G
AA
C
U
U
A
C
0
AUGU
C
G
AU
G
C
A
5
G
C
U
A
G
C
AC
U
AU
A
Profile HMMs: a powerful (protein) homology search tool
Image provided by Sean Eddy.
Profile-based homology search
Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol
Biol.
Image provided by Eric Nawrocki.
Profile-based homology search – scoring sequences
Image provided by Eric Nawrocki.
Profile HMM are slightly more complicated than that...
▶ A tree-weighting scheme takes care of unbalanced
alignments
▶ Dirichlet-mixture priors are used to incorporate information
about amino-acid biochemistry
▶ Effective sequence number is used to down-weight priors
when many sequences are available
▶ Transition probabilities to Insert & Delete states are
estimated from the alignment
Why not just use BLAST?
▶ ACCURACY!
▶ Every benchmark of homology search tools has shown that
profile methods are more accurate than single-sequence
methods.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Why not just use BLAST?
▶ SPEED! To search a single query vs a database of all proteins:
▶ BLAST: searches 462 million UniProt sequences
▶ HMMER: searches 20 thousand Pfam profiles
▶ The search space is ∼ 20, 000x smaller for profiles
▶ Save Planet Earth, use HMMER3
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Profile HMMs: different flavours
▶ phmmer: protein query, find homologous proteins
▶ hmmscan: protein query, find homologous protein families
▶ hmmsearch: alignment/HMM query, find homologous proteins
▶ jackhmmer: protein query, iterative protein search
▶ nhmmer: nucleotide query, find homologous nucleotides.
Command-line only – or in RNACentral for ncRNAs.
https://www.ebi.ac.uk/Tools/hmmer/search/phmmer
Pfam, Rfam, Dfam, EggNOG, etc. all use profile-based
tools
What is a Pfam-A Entry?
hmmsearch
hmmbuild
hmmalign
SEED
HMM
OUTOUT
ALIGN
DESC
Image provided by Rob Finn.
WARNING!
Interpret matches between sequences very carefully – try searching
short random strings, “your name”, “genome”, “bioinformatics”,
... Are any of these matches significant?
Many attempted analyses essentially amount to over-interpreting
insignificant BLAST results (e.g. early coronavirus analysis papers).
Bioinformatic tools are not evil, but they do stupid things
▶ False positives & negatives, database contamination,
inefficient, ...
▶ Use controls to increase your confidence in predictions!
The main points
▶ Homology estimated by sequence similarity and bit-scores.
▶ BLAST is very handy for quick searches, limited accuracy for
divergent sequences if these are needed.
▶ HMMER is more accurate, but can be slightly slower
▶ BLAST is useful if you are interested in very similar
sequences, HMMER is more useful if you are interested in
divergent sequences.
▶ The significance of the alignments from both BLAST and
HMMER can be determined with E-values.
Self-evaluation exercises
▶ Briefly summarise how the BLAST algorithm works.
▶ Present the aim of Blast software: (a) Describe how the
algorithm works. (b) Describe the output of a Blast search.
(c) What is a bit-score? (d) What is an E-value?
▶ Briefly outline how a profile HMM can be used for homology
search. What are the main differences between profile HMMs
and BLAST?
Self-evaluation exercises
Find the most remote (based upon taxonomy) and significant
(E < 0.01) homolog of the below sequence that you can using
blastp, tblastn, psi-blast, phmmer or jackhmmer.
>Q9UN81|LORF1_HUMAN LINE-1 retrotransposable element ORF1 protein
MGKKQNRKTGNSKTQSASPPPKERSSSPATEQSWMENDFDELREEGFRRSNYSELREDIQ
TKGKEVENFEKNLEECITRITNTEKCLKELMELKTKARELREECRSLRSRCDQLEERVSA
MEDEMNEMKREGKFREKRIKRNEQSLQEIWDYVKRPNLRLIGVPESDVENGTKLENTLQD
IIQENFPNLARQANVQIQEIQRTPQRYSSRRATPRHIIVRFTKVEMKEKMLRAAREKGRV
TLKGKPIRLTADLSAETLQARREWGPIFNILKEKNFQPRISYPAKLSFISEGEIKYFIDK
QMLRDFVTTRPALKELLKEALNMERNNRYQPLQNHAKM
Self-evaluation exercises
b) Based upon the bitscores and E-values for the above zebrafish and cone snail venom
and insulin sequences in Figure 1A, are these matches likely to be due to chance?
Justify your answer.
c) Using the multiple sequence alignment in Figure 1B, compute a probability of the
Figure 2: (A) Pairwise sequence alignment between similar regions of the cone snail and
zebrafish insulin proteins. (B) BLOSUM62 amino acid similarity score matrix.
d) Using the alignment and BLOSUM62 score matrix in Figure 2, compute a score for
the alignment between regions of the zebrafish and cone snail insulin sequences
using -9/
-1 gap open and gap extension penalty. Show your working.
e) You are computing your own score matrix and observe that isoleucine (I) and valine
(V) replace one another 336 times in an alignment of 10,000 residues with 62%
sequence identity. The frequencies of isoleucine and valine were 0.06 and 0.07
respectively. What is the log-odds score for isoleucine and valine replacements? Is
this expected based upon the BLOSUM62 matrix? Show your working.
sequence "RGM" from frequencies derived from columns 1, 2 and 3 of the alignment.
a) Using the results in Figure 1, discuss the relationship between the concepts of
"sequence similarity", "homology", "paralogy" and "analogy" of protein sequences.
The fish-hunting cone snail (Conus geographus) uses a specialized insulin in the venom
to induce hypoglycemic shock in its prey. C. geographus has a very toxic sting, a LD70
value (in humans) of 0.001 to 0.003 mg/
kg, making the cone snail one of the most
venomous animals in the world.
The cone snail genome encodes two insulin-like proteins, one forms a component of its
venom, the other is involved in regulating glucose levels. Comparisons using BLASTP
have been made between the cone snail venom, and insulin sequences from a fish,
human, the venomous cone snail, and a non-venomous sea snail.
Figure 1: comparisons of similar insulin and venom proteins from the molluscs: cone snail and sea snail; and
vertebrates: zebrafish and human. (A) a table of BLASTP results for matches between the zebrafish insulin
sequence, and similar cone snail, sea snail and human sequences. (B) a multiple sequence alignment of the
cone snail venom, and similar insulin sequences. (C) a sequence logo corresponding to a profile HMM
generated from the alignment in (B). (D) a tree illustrating the sequence relationships between the insulin and
venom sequences.
Further reading
▶ Blast Bitscores & E-values:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
▶ Eddy SR (2004) What is a hidden Markov model? Nature
Biotechnology.
Post questions on the FAQ GoogleDoc:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
Matanaka, Waikouaiti, 4 February 2023.

More Related Content

Similar to ppgardner-lecture06-homologysearch.pdf

Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Elia Brodsky
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
ericndunek
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
Scott Dawson
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
Prof. Wim Van Criekinge
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
Patricia Francis-Lyon
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
Lars Juhl Jensen
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNAUlises Urzua
 
Practical 7 dna, rna and the flow of genetic information5
Practical 7 dna, rna and the flow of genetic information5Practical 7 dna, rna and the flow of genetic information5
Practical 7 dna, rna and the flow of genetic information5Osama Barayan
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
Kishor Tappita
 
Session ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrSession ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrUSD Bioinformatics
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informaticsDaniela Rotariu
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
DEBPRASAD DUTTA
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final PresentationShruthi Choudary
 
Blast gp assignment
Blast  gp assignmentBlast  gp assignment
Blast gp assignment
barathvaj
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Natalio Krasnogor
 
bioinformatic.pptx
bioinformatic.pptxbioinformatic.pptx
bioinformatic.pptx
RitikaChoudhary57
 

Similar to ppgardner-lecture06-homologysearch.pdf (20)

Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNA
 
Practical 7 dna, rna and the flow of genetic information5
Practical 7 dna, rna and the flow of genetic information5Practical 7 dna, rna and the flow of genetic information5
Practical 7 dna, rna and the flow of genetic information5
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Session ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corrSession ii g1 lab genomics and gene expression mmc-corr
Session ii g1 lab genomics and gene expression mmc-corr
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Use of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay DesignUse of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay Design
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
 
Blast gp assignment
Blast  gp assignmentBlast  gp assignment
Blast gp assignment
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
bioinformatic.pptx
bioinformatic.pptxbioinformatic.pptx
bioinformatic.pptx
 

More from Paul Gardner

ppgardner-lecture07-genome-function.pdf
ppgardner-lecture07-genome-function.pdfppgardner-lecture07-genome-function.pdf
ppgardner-lecture07-genome-function.pdf
Paul Gardner
 
ppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdfppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdf
Paul Gardner
 
ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdfppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
Paul Gardner
 
ppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdf
Paul Gardner
 
Does RNA avoidance dictate protein expression level?
Does RNA avoidance dictate protein expression level?Does RNA avoidance dictate protein expression level?
Does RNA avoidance dictate protein expression level?
Paul Gardner
 
Machine learning methods
Machine learning methodsMachine learning methods
Machine learning methods
Paul Gardner
 
Clustering
ClusteringClustering
Clustering
Paul Gardner
 
Monte Carlo methods
Monte Carlo methodsMonte Carlo methods
Monte Carlo methods
Paul Gardner
 
The jackknife and bootstrap
The jackknife and bootstrapThe jackknife and bootstrap
The jackknife and bootstrap
Paul Gardner
 
Contingency tables
Contingency tablesContingency tables
Contingency tables
Paul Gardner
 
Regression (II)
Regression (II)Regression (II)
Regression (II)
Paul Gardner
 
Regression (I)
Regression (I)Regression (I)
Regression (I)
Paul Gardner
 
Analysis of covariation and correlation
Analysis of covariation and correlationAnalysis of covariation and correlation
Analysis of covariation and correlation
Paul Gardner
 
Analysis of two samples
Analysis of two samplesAnalysis of two samples
Analysis of two samples
Paul Gardner
 
Analysis of single samples
Analysis of single samplesAnalysis of single samples
Analysis of single samples
Paul Gardner
 
Centrality and spread
Centrality and spreadCentrality and spread
Centrality and spread
Paul Gardner
 
Fundamentals of statistical analysis
Fundamentals of statistical analysisFundamentals of statistical analysis
Fundamentals of statistical analysis
Paul Gardner
 
Random RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotesRandom RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotes
Paul Gardner
 
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
Avoidance of stochastic RNA interactions can be harnessed to control protein ...Avoidance of stochastic RNA interactions can be harnessed to control protein ...
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
Paul Gardner
 
A meta-analysis of computational biology benchmarks reveals predictors of pro...
A meta-analysis of computational biology benchmarks reveals predictors of pro...A meta-analysis of computational biology benchmarks reveals predictors of pro...
A meta-analysis of computational biology benchmarks reveals predictors of pro...
Paul Gardner
 

More from Paul Gardner (20)

ppgardner-lecture07-genome-function.pdf
ppgardner-lecture07-genome-function.pdfppgardner-lecture07-genome-function.pdf
ppgardner-lecture07-genome-function.pdf
 
ppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdfppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdf
 
ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdfppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
 
ppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdf
 
Does RNA avoidance dictate protein expression level?
Does RNA avoidance dictate protein expression level?Does RNA avoidance dictate protein expression level?
Does RNA avoidance dictate protein expression level?
 
Machine learning methods
Machine learning methodsMachine learning methods
Machine learning methods
 
Clustering
ClusteringClustering
Clustering
 
Monte Carlo methods
Monte Carlo methodsMonte Carlo methods
Monte Carlo methods
 
The jackknife and bootstrap
The jackknife and bootstrapThe jackknife and bootstrap
The jackknife and bootstrap
 
Contingency tables
Contingency tablesContingency tables
Contingency tables
 
Regression (II)
Regression (II)Regression (II)
Regression (II)
 
Regression (I)
Regression (I)Regression (I)
Regression (I)
 
Analysis of covariation and correlation
Analysis of covariation and correlationAnalysis of covariation and correlation
Analysis of covariation and correlation
 
Analysis of two samples
Analysis of two samplesAnalysis of two samples
Analysis of two samples
 
Analysis of single samples
Analysis of single samplesAnalysis of single samples
Analysis of single samples
 
Centrality and spread
Centrality and spreadCentrality and spread
Centrality and spread
 
Fundamentals of statistical analysis
Fundamentals of statistical analysisFundamentals of statistical analysis
Fundamentals of statistical analysis
 
Random RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotesRandom RNA interactions control protein expression in prokaryotes
Random RNA interactions control protein expression in prokaryotes
 
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
Avoidance of stochastic RNA interactions can be harnessed to control protein ...Avoidance of stochastic RNA interactions can be harnessed to control protein ...
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
 
A meta-analysis of computational biology benchmarks reveals predictors of pro...
A meta-analysis of computational biology benchmarks reveals predictors of pro...A meta-analysis of computational biology benchmarks reveals predictors of pro...
A meta-analysis of computational biology benchmarks reveals predictors of pro...
 

Recently uploaded

Plant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptxPlant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptx
yusufzako14
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
Alex Henderson
 

Recently uploaded (20)

Plant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptxPlant Biotechnology undergraduates note.pptx
Plant Biotechnology undergraduates note.pptx
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 

ppgardner-lecture06-homologysearch.pdf

  • 1. GENE315: Homology search & comparative genomics Paul Gardner March 14, 2023
  • 2. Objectives for lecture 06 ▶ An understanding of: ▶ Homology search tools ▶ E-values ▶ how BLAST works ▶ how profile HMMs (hmmer) work ▶ which is the right tool for different questions
  • 3. Homology search ▶ In a huge collection of biological sequences how can you locate similar sequences? ▶ With heuristic, super fast, local sequence alignment methods! 1995 Species Streptococcus Escherichia Staphylococcus Mycobacterium Klebsiella Pseudomonas Other Non-pathogenic No. Genomes Year 2000 2005 2010 2015 0 50,000 100,000 150,000 Prokaryotic genome submissions to NCBI Image provided by Bethany Jose ©in perpetuity.
  • 4. BLAST: Basic Local Alignment Search Tool
  • 5. BLAST comes in several different flavours ▶ BLASTN: nucleotide query, find close nucleotide matches ▶ BLASTP: protein query, find close protein matches ▶ BLASTX: nucleotide query, find protein matches (useful for annotation) ▶ TBLASTN: protein query, find matches in 6-frame translated nucleotide https://blast.ncbi.nlm.nih.gov/Blast.cgi
  • 6. BLAST outline: speed comes from “word” matching ▶ Identify all (high-scoring) shared “words” (a.k.a. k-mers, seeds) of at least W long ▶ Find any words on the same diagonal of an alignment matrix ▶ Trigger a full alignment, and score that region Basic idea: identify near-identical sub-sequences first → align any hits in full
  • 7. BLAST: k-mer scores List all 3-mers for each sequence, and the top scoring similar k-mers (≥ 11 bits). High scoring 3-mers, the same distance apart on 2 sequences imply a good match. Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60 LYG 17 EEL 14 IYG 15 EEI 12 MYG 15 EEM 12 VYG 14 DEL 11 AYG 12 ------ KYG 11 EEF 10 ------ Neighborhood score EYG 10 threshold (T=11) HYG 10 |<---------Distance--------->| Botulinum neurotoxin Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60 H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+ Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHTYYPISAEELFTFGGQDANLISIDIKNDLYEKT 60 |<---------Distance--------->| Tetanus toxin High-scoring segment pair (HSSP)
  • 8. Homology search: a dot-plot & heuristics “15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots.” Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research.
  • 9. Homology search: path through a matrix Altschul et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research.
  • 10. What does that E-value (Expect) mean? Score Expect Identities Positives Gaps 40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%) Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60 H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+ Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59 Query 61 YNKFKDIASTLNKAKS 76 N +K IA+ L++ S Sbjct 60 LNDYKAIANKLSQVTS 75 Is an E-value of 4 × 10−11 good?
  • 11. How can we evaluate the significance of a score? ▶ Note that a bit-score of 40.0 by itself is not that useful. ▶ It depends on the sequence & database size & composition. ▶ To counter this we can compute an Expect-value (E-value). ▶ This is the number of hits one can “expect” to see by chance with the observed score for the given query and database sizes. ▶ Related to P-values/probabilies, but are not the same! 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 True vs false sequence matches Score (bits) Num. matches Random sequences/Neg. controls True homologs/Pos. controls Threshold? False neg. True pos. False pos. True neg.
  • 12. How can we evaluate the significance of a score? 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 True vs false sequence matches Score (bits) Num. matches Random sequences/Neg. controls True homologs/Pos. controls Threshold? False neg. True pos. False pos. True neg. Fit the following curve to a distribution of scores for random sequences: E = κMN2−λx E: E-value M&N: query & database size κ&λ: fitting parameters NB. the blue line represents a null hypothesis or negative control.
  • 13. What does that E-value (Expect) mean? Score Expect Identities Positives Gaps 40.0 bits 4e-11 24/76(32%) 38/76(50%) 1/76(1%) Query 1 HAGHRLYGIAINPNRVFKVNTNAYYEMSGLEVSFEELRTFGGHDAKFIDSLQENEFRLYY 60 H H LYG+ ++ + + Y + + +S EEL TFGG DA I +N+ Sbjct 1 HVLHGLYGMQVSSHEIIPSKQEIYMQHT-YPISAEELFTFGGQDANLISIDIKNDLYEKT 59 Query 61 YNKFKDIASTLNKAKS 76 N +K IA+ L++ S Sbjct 60 LNDYKAIANKLSQVTS 75 Is an E-value of 4 × 10−11 good? YES! We would have to run approximately 25 billion (1/(4 × 10−11)) searches to see 1 match with a score this high by chance.
  • 14. Are these 2 sequences homologous? What do the bit score and and expect value tell you? >gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome Length=4537948 Features in this part of subject sequence: cold-shock DNA-binding domain protein Score = 57.2 bits (62), Expect = 2e-05 Identities = 78/106 (74%), Gaps = 6/106 (6%) Strand=Plus/Plus Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC || |||||||| ||||||||| |||||| | | | || |||| |||| |||| Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG | | || |||||| ||| ||||||||||| |||||| ||| ||| Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
  • 15. BLAST is not the only, or best tool for the job!
  • 16. Profile HMMs: a powerful (protein) homology search tool
  • 17. Profile HMMs: a powerful (protein) homology search tool ▶ Some residues and sequence blocks are very highly conserved, others are free to vary. ▶ Gaps (insertions & deletions) do not occur randomnly in sequences. ▶ BLAST uses uniform match, mis-match and gap penalties, whereas profile HMMs will penalise mis-matches and gaps in conserved blocks more harshly than those in regions where INDELs are frequent.
  • 18. Sequence logos Shine-Dalgarno sequence from Escherichia coli str. K-12 substr. MG1655 Distance from start codon (nucs) WebLogo 3.1 0.0 1.0 2.0 bits -30 G C A U U C A C U -25 A C U A C U C A U C A U G C A U -20 C A U C U A G C U A -15 U A C G U A C U A U C G A C G A -10 A GA GU G A U G A C G U A -5 C G U A G U C A C U G AA C U U A C 0 AUGU C G AU G C A 5 G C U A G C AC U AU A
  • 19. Profile HMMs: a powerful (protein) homology search tool Image provided by Sean Eddy.
  • 20. Profile-based homology search Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. Image provided by Eric Nawrocki.
  • 21. Profile-based homology search – scoring sequences Image provided by Eric Nawrocki.
  • 22. Profile HMM are slightly more complicated than that... ▶ A tree-weighting scheme takes care of unbalanced alignments ▶ Dirichlet-mixture priors are used to incorporate information about amino-acid biochemistry ▶ Effective sequence number is used to down-weight priors when many sequences are available ▶ Transition probabilities to Insert & Delete states are estimated from the alignment
  • 23. Why not just use BLAST? ▶ ACCURACY! ▶ Every benchmark of homology search tools has shown that profile methods are more accurate than single-sequence methods. Eddy (2011) Accelerated Profile HMM Searches. PLoS Computational Biology.
  • 24. Why not just use BLAST? ▶ SPEED! To search a single query vs a database of all proteins: ▶ BLAST: searches 462 million UniProt sequences ▶ HMMER: searches 20 thousand Pfam profiles ▶ The search space is ∼ 20, 000x smaller for profiles ▶ Save Planet Earth, use HMMER3 Eddy (2011) Accelerated Profile HMM Searches. PLoS Computational Biology.
  • 25. Profile HMMs: different flavours ▶ phmmer: protein query, find homologous proteins ▶ hmmscan: protein query, find homologous protein families ▶ hmmsearch: alignment/HMM query, find homologous proteins ▶ jackhmmer: protein query, iterative protein search ▶ nhmmer: nucleotide query, find homologous nucleotides. Command-line only – or in RNACentral for ncRNAs. https://www.ebi.ac.uk/Tools/hmmer/search/phmmer
  • 26. Pfam, Rfam, Dfam, EggNOG, etc. all use profile-based tools What is a Pfam-A Entry? hmmsearch hmmbuild hmmalign SEED HMM OUTOUT ALIGN DESC Image provided by Rob Finn.
  • 27. WARNING! Interpret matches between sequences very carefully – try searching short random strings, “your name”, “genome”, “bioinformatics”, ... Are any of these matches significant? Many attempted analyses essentially amount to over-interpreting insignificant BLAST results (e.g. early coronavirus analysis papers).
  • 28. Bioinformatic tools are not evil, but they do stupid things ▶ False positives & negatives, database contamination, inefficient, ... ▶ Use controls to increase your confidence in predictions!
  • 29. The main points ▶ Homology estimated by sequence similarity and bit-scores. ▶ BLAST is very handy for quick searches, limited accuracy for divergent sequences if these are needed. ▶ HMMER is more accurate, but can be slightly slower ▶ BLAST is useful if you are interested in very similar sequences, HMMER is more useful if you are interested in divergent sequences. ▶ The significance of the alignments from both BLAST and HMMER can be determined with E-values.
  • 30. Self-evaluation exercises ▶ Briefly summarise how the BLAST algorithm works. ▶ Present the aim of Blast software: (a) Describe how the algorithm works. (b) Describe the output of a Blast search. (c) What is a bit-score? (d) What is an E-value? ▶ Briefly outline how a profile HMM can be used for homology search. What are the main differences between profile HMMs and BLAST?
  • 31. Self-evaluation exercises Find the most remote (based upon taxonomy) and significant (E < 0.01) homolog of the below sequence that you can using blastp, tblastn, psi-blast, phmmer or jackhmmer. >Q9UN81|LORF1_HUMAN LINE-1 retrotransposable element ORF1 protein MGKKQNRKTGNSKTQSASPPPKERSSSPATEQSWMENDFDELREEGFRRSNYSELREDIQ TKGKEVENFEKNLEECITRITNTEKCLKELMELKTKARELREECRSLRSRCDQLEERVSA MEDEMNEMKREGKFREKRIKRNEQSLQEIWDYVKRPNLRLIGVPESDVENGTKLENTLQD IIQENFPNLARQANVQIQEIQRTPQRYSSRRATPRHIIVRFTKVEMKEKMLRAAREKGRV TLKGKPIRLTADLSAETLQARREWGPIFNILKEKNFQPRISYPAKLSFISEGEIKYFIDK QMLRDFVTTRPALKELLKEALNMERNNRYQPLQNHAKM
  • 32. Self-evaluation exercises b) Based upon the bitscores and E-values for the above zebrafish and cone snail venom and insulin sequences in Figure 1A, are these matches likely to be due to chance? Justify your answer. c) Using the multiple sequence alignment in Figure 1B, compute a probability of the Figure 2: (A) Pairwise sequence alignment between similar regions of the cone snail and zebrafish insulin proteins. (B) BLOSUM62 amino acid similarity score matrix. d) Using the alignment and BLOSUM62 score matrix in Figure 2, compute a score for the alignment between regions of the zebrafish and cone snail insulin sequences using -9/ -1 gap open and gap extension penalty. Show your working. e) You are computing your own score matrix and observe that isoleucine (I) and valine (V) replace one another 336 times in an alignment of 10,000 residues with 62% sequence identity. The frequencies of isoleucine and valine were 0.06 and 0.07 respectively. What is the log-odds score for isoleucine and valine replacements? Is this expected based upon the BLOSUM62 matrix? Show your working. sequence "RGM" from frequencies derived from columns 1, 2 and 3 of the alignment. a) Using the results in Figure 1, discuss the relationship between the concepts of "sequence similarity", "homology", "paralogy" and "analogy" of protein sequences. The fish-hunting cone snail (Conus geographus) uses a specialized insulin in the venom to induce hypoglycemic shock in its prey. C. geographus has a very toxic sting, a LD70 value (in humans) of 0.001 to 0.003 mg/ kg, making the cone snail one of the most venomous animals in the world. The cone snail genome encodes two insulin-like proteins, one forms a component of its venom, the other is involved in regulating glucose levels. Comparisons using BLASTP have been made between the cone snail venom, and insulin sequences from a fish, human, the venomous cone snail, and a non-venomous sea snail. Figure 1: comparisons of similar insulin and venom proteins from the molluscs: cone snail and sea snail; and vertebrates: zebrafish and human. (A) a table of BLASTP results for matches between the zebrafish insulin sequence, and similar cone snail, sea snail and human sequences. (B) a multiple sequence alignment of the cone snail venom, and similar insulin sequences. (C) a sequence logo corresponding to a profile HMM generated from the alignment in (B). (D) a tree illustrating the sequence relationships between the insulin and venom sequences.
  • 33. Further reading ▶ Blast Bitscores & E-values: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html ▶ Eddy SR (2004) What is a hidden Markov model? Nature Biotechnology. Post questions on the FAQ GoogleDoc: https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv- qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
  • 34. Matanaka, Waikouaiti, 4 February 2023.