1. The document discusses various bioinformatics concepts and tools including sequence alignment, BLAST, substitution matrices, and open reading frames. Sequence alignment involves comparing sequences to find similar regions and can be local or global. BLAST is a tool used to find similar sequences in a database by searching for exact and similar matches. Substitution matrices like BLOSUM and PAM assign scores to amino acid substitutions observed in protein evolution. Open reading frames refer to the three possible frames for translating a nucleic acid sequence into a protein.
Somatic embryogenesis, in plant tissue culture 2KAUSHAL SAHU
Introduction
Types of somatic embryogenesis
Developmental stages
Factors affecting somatic embryogenesis
Importance
Conclusions
References
The process of regeneration of embryos from somatic cells, tissue or organs is regarded as somatic or asexual embryogenesis.
opposite of zygotic or sexual embryogenesis.
Embryo-like structures which can develop into whole plants in a way that is similar to zygotic embryos are formed from somatic cells.
Gene mapping means the mapping of genes to specific locations on chromosomes.
Such maps indicates the positions of genes in the genome and also distance between them.
Somatic embryogenesis, in plant tissue culture 2KAUSHAL SAHU
Introduction
Types of somatic embryogenesis
Developmental stages
Factors affecting somatic embryogenesis
Importance
Conclusions
References
The process of regeneration of embryos from somatic cells, tissue or organs is regarded as somatic or asexual embryogenesis.
opposite of zygotic or sexual embryogenesis.
Embryo-like structures which can develop into whole plants in a way that is similar to zygotic embryos are formed from somatic cells.
Gene mapping means the mapping of genes to specific locations on chromosomes.
Such maps indicates the positions of genes in the genome and also distance between them.
Introduction
Protein modifications
Folding
Chaperon mediated
Enzymatic
Cleavage
Addition of functional groups
Chemical groups
Hydrophobic groups
Proteolysis
Conclusion
Reference
Introduction.
Properties of Stem Cells.
Key Research events.
Embryonic Stem Cell.
Stem cell Cultivation.
Stem cells are central to three processes in an organism.
Research & Clinical Application of stem cell.
Research patents.
Conclusion.
Reference.
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...JIA-MING CHANG
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/.
Introduction
Protein modifications
Folding
Chaperon mediated
Enzymatic
Cleavage
Addition of functional groups
Chemical groups
Hydrophobic groups
Proteolysis
Conclusion
Reference
Introduction.
Properties of Stem Cells.
Key Research events.
Embryonic Stem Cell.
Stem cell Cultivation.
Stem cells are central to three processes in an organism.
Research & Clinical Application of stem cell.
Research patents.
Conclusion.
Reference.
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...JIA-MING CHANG
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/.
2016.09.28
TOPIC REVIEW
• Exam
• PS2 Sequence Alignment
• Command Line Blast
• PS1 Molecular Biology
• Personal Microbiome Project
CURRENTLY
LET’S NEGOTIATE
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam (1) - 20%
• Research project - 45%
• Participation - 5%
OR
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam 1 - 15%
• Exam 2 - 15%
• Research project - 35%
• Participation - 5%
PS2 SEQUENCE ALIGNMENT
PS2 SEQUENCE ALIGNMENT
RefSeqs, protein (experimentally supported)
On chromosome 17
Reverse strand
PRCD Progressive rod-cone degeneration
PS2: GLOBAL ALIGNMENT
BLOSUM62
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
BLOSUM80
• Substitutions more penalized and
gaps are favored.
PAM60
• Substitutions more penalized and gaps
are favored.
PAM250
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
PS2: LOCAL ALIGNMENT
SEQ1 A L S C V W M I P
SEQ2 A I S C M I P T
9 residues
8 residues
Create Matrix: length of seq1 + 1
x
length of seq2 + 1
Matrix 10 x 9
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
Ala A 4
Arg R -1 5
Asn N -2 0 6
Asp D -2 -2 1 6
Cys C 0 -3 -3 -3 9
Gln Q -1 1 0 0 -3 5
Glu E -1 0 0 2 -4 2 5
Gly G 0 -2 0 -1 -3 -2 -2 6
His H -2 0 1 -1 -3 0 0 -2 8
Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A
la
A
rg
A
sn
A
sp
C
y
s
G
ln
G
lu
G
ly
H
is
Il
e
L
e
u
L
y
s
M
e
t
P
h
e
P
ro
S
e
r
T
h
r
T
rp
T
y
r
V
a
l
A R N D C Q E G H I L K M F P S T W Y V
Dynamical programming - global alignment
83
BLOSUM62
GAP COST: -2
At each cell, 3 scores are calculated:
• match score = diagonal cell score +
score from the substitution matrix.
• Vertical gap score = upper neighbor
+ gap cost
• Horizontal gap score = left neighbor
+ gap cost
• The highest score is retained and
the arrow is labelled
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
A ...
Module 2 Sequence similarity.
Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This presentation gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs. A step-by-step example is given in addition to its implementation in Python 3.5.
---------------------------------
Read more about GA:
Yu, Xinjie, and Mitsuo Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
Global and local alignment (bioinformatics)Pritom Chaki
A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.
wealth age region
37 50 M
24 88 U
14 64 A
13 63 U
13 66 U
11.7 72 E
10 71 M
8.2 77 U
8.1 68 U
7.2 66 E
7 69 M
6.2 36 O
5.9 49 U
5.3 73 U
5.2 52 E
5 77 M
5 73 M
4.9 62 A
4.8 54 U
4.7 63 U
4.7 23 U
4.6 70 O
4.6 59 E
4.5 96 E
4.5 84 O
4.5 40 E
4.3 60 U
4 77 E
4 68 E
4 83 E
4 68 A
4 40 E
4 62 M
4 69 E
4 49 A
3.9 64 A
3.9 83 A
3.8 41 A
3.8 78 A
3.6 80 A
3.5 68 O
3.4 67 U
3.4 71 O
3.4 54 A
3.3 62 E
3.3 69 A
3.3 58 U
3.2 71 U
3.2 55 O
3 66 E
3 65 E
3 50 U
3 64 E
3 57 A
3 86 M
3 71 E
3 68 E
3 68 E
3 54 U
2.8 68 A
2.8 76 E
2.8 52 E
2.8 73 O
2.8 46 O
2.7 69 U
2.7 63 E
2.6 42 E
2.6 67 E
2.6 62 O
2.6 66 U
2.6 75 U
2.5 74 E
2.5 73 E
2.5 84 M
2.5 49 A
2.4 60 U
2.4 71 O
2.4 76 A
2.4 67 E
2.3 54 A
2.3 57 U
2.3 54 O
2.3 64 O
2.2 85 E
2.2 45 A
2.2 39 O
2.2 54 E
2.1 68 U
2.1 85 U
2 70 M
2 102 M
2 38 U
2 73 A
2 91 E
2 82 U
2 74 M
2 81 M
2 * U
2 62 E
2 62 U
2 67 U
2 80 O
2 68 M
2 80 U
2 * U
2 60 E
2 74 O
1.9 48 U
1.9 60 E
1.9 43 E
1.9 64 O
1.9 67 U
1.8 62 A
1.8 90 E
1.8 66 U
1.8 68 A
1.8 60 A
1.8 53 A
1.8 47 E
1.8 86 U
1.8 67 A
1.7 54 U
1.7 77 E
1.7 61 U
1.7 83 E
1.7 61 U
1.7 58 U
1.7 64 U
1.7 53 A
1.7 67 A
1.6 57 E
1.6 62 A
1.6 * E
1.6 64 O
1.6 69 A
1.6 71 E
1.6 54 U
1.6 78 A
1.5 45 U
1.5 69 U
1.5 59 U
1.5 * A
1.5 82 O
1.5 68 E
1.5 41 E
1.5 60 E
1.5 64 E
1.5 44 E
1.5 7 E
1.5 72 E
1.5 56 E
1.5 60 E
1.4 61 E
1.4 79 O
1.4 42 O
1.4 63 E
1.4 49 E
1.4 56 E
1.4 67 U
1.4 75 E
1.4 43 M
1.4 61 U
1.4 54 O
1.4 47 E
1.4 64 U
1.4 52 A
1.4 73 A
1.3 83 U
1.3 64 E
1.3 71 O
1.3 71 E
1.3 61 M
1.3 83 E
1.3 43 E
1.3 47 U
1.3 79 E
1.3 53 E
1.3 73 U
1.3 72 U
1.3 72 U
1.3 59 A
1.3 77 E
1.3 68 E
1.3 42 E
1.3 61 U
1.2 69 A
1.2 82 O
1.2 * E
1.2 56 U
1.2 42 M
1.2 63 U
1.2 75 U
1.2 * E
1.2 59 A
1.2 70 E
1.2 46 M
1.2 68 U
1.2 68 A
1.2 69 A
1.2 68 O
1.2 64 A
1.1 53 E
1.1 79 E
1.1 49 E
1.1 47 U
1.1 75 U
1.1 76 M
1.1 66 U
1.1 85 U
1.1 66 O
1.1 70 U
1.1 58 E
1.1 72 E
1.1 52 M
1 52 O
1 79 E
1 69 A
1 52 M
1 75 E
1 62 E
1 65 M
1 63 U
1 87 E
1 61 U
1 58 O
1 60 E
1 67 O
1 80 E
1 63 U
1 9 M
1 59 E
1 * E
1 * O
Sheet1DateExportRefinery OutputJan-04283.92246.01Feb-04241.7237.15Mar-04142.66249.35Apr-04331.02237.72May-04197.33269.92Jun-04210.95285.3Jul-04256.03227.27Aug-04268.59226.86Sep-04114.05129.92Oct-04203.37226.18Nov-04165.71220.87Dec-04308.34235.21Jan-05270230Feb-05137232Mar-05309250Apr-05184248May-05322270Jun-05199240Jul-05246250Aug-05237255Sep-05226236Oct-05287254Nov-05320261Dec-05313277Jan-06313229Feb-06216258Mar-06217260Apr-06316199May-06215226Jun-06200231Jul-06269248Aug-06216234Sep-06291219Oct-06234270Nov-06192277Dec-06275197Jan-07181219Feb-07176146Mar-07149238Apr-07270253May-07266230Jun-07196222Jul-07253141Aug-07237230Sep-07216176Oct-07112194Nov-07217191Dec-07187187Jan-08246191Feb-08157174Mar-08187187Apr-08160208May-08263208Jun-08195195Jul-08113177Aug-08240197Se.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Nutrition is the science that deals with the study of nutrients and their role in maintaining human health and well-being. It encompasses the various processes involved in the intake, absorption, and utilization of essential nutrients, such as carbohydrates, proteins, fats, vitamins, minerals, and water, by the human body.
4. Sequence alignment
Alignment: Comparing two (pairwise) or more
(multiple) sequences. Searching for a series of
identical or similar characters in the sequences.
-Similarity : Same Physicochemical properties.
- Identity :- Identical
MVNLTSDEKTAVLALWNKVDVEDCGGE
|| || ||||| ||| || || ||
MVHLTPEEKTAVNALWGKVNVDAVGGE
5. Sequence alignment-why???
• The basis for comparison of proteins and genes
using the similarity of their sequences is that the
the proteins or genes are related by evolution;
they have a common ancestor.
• Random mutations in the sequences accumulate
over time, so that proteins or genes that have a
common ancestor far back in time are not as similar
as proteins or genes that diverged from each other
more recently.
6. Alignment
• A way of arranging the objects or alphabets to
find out the similarity and difference existing
between them.
• In case of bioinformatics, it is the arrangement
of sequence (DNA,RNA or protein) to find out
the regions of similarity and difference by
virtue of which homology can be predicted.
9. Why perform to pair wise sequence
alignment?
Finding homology between two sequences
Example : Protein prediction(Sequence or
Structure).
similar sequence (or structure)
similar function
10. Local Vs. Global
• Global alignment compares through out the sequence
and gives best overall alignment but may fail to find out
the local region of similarity among sequence which
exactly contain the domain and motif information.
• Local alignment find regions of ungapped sequence
with high level of similarity. Best for finding the motif
although two sequences are different.
11. Local alignment – finds regions of high similarity in
parts of the sequences
Global alignment – finds the best alignment across
the entire two sequences
Local vs. Global
12. Three types of nucleotide changes:
1. Substitution – a replacement of one (or more)
sequence characters by another:
2. Insertion - an insertion of one (or more) sequence
characters:
3. Deletion – a deletion of one (or more) sequence
characters:
T
A
Evolutionary changes in sequences
Insertion + Deletion Indel
AAGA AACA
AAG
GA
A
A
13. Choosing an alignment:
• Many different alignments between two
sequences are possible:
AAGCTGAATTCGAA
AGGCTCATTTCTGA
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
How one can determine which is the best alignment?
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
. . .
14. Exercise
• Match: +1
• Mismatch: -2
• Indel: -1
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
Compute the scores of each of the following alignments
Scoring scheme:
-2
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
-2
A
C
G
T
A C G T
Substitution matrix
Gap penalty (opening = extending)
15. Open Reading Frames(ORFs)
•6 possible ORFs
–frames 1,2,and 3 in 5’ to 3’direction
–frames 1,2, and 3 in 5’ to 3’ direction
of complimentary strand.
The different reading frames give
entirely different proteins.
Each gene uses a single reading frame, so
once the ribosome gets started, it just has
to count off groups of 3 bases to produce
the proper protein.
16. PAM matrices
• Family of matrices PAM 80, PAM 120, PAM 250, …
• The number with a PAM matrix (the n in PAMn) represents
the evolutionary distance between the sequences on which
the matrix is based
• The (ith,jth) cell in a PAMn matrix denotes the probability that
amino-acid i will be replaced by amino-acid j in time n:
Pi→j,n .
• Greater n numbers denote greater distances
17. BLOSUM matrices
• Different BLOSUMn matrices are calculated independently
from BLOCKS (ungapped, manually created local alignments)
• BLOSUMn is based on a cluster of BLOCKS of sequences
that share at least n percent identity
• The (ith,jth) cell in a BLOSUM matrix denotes the log of odds
of the observed frequency and expected frequency of amino
acids i and j in the same position in the data: log(Pij/qi*qj)
• Higher n numbers denote higher identity between the
sequences on which the matrix is based
18. BLAST
(Basic Local Alignment Search Tool)
• The BLAST program was designed by Eugene
Myers, Stephen Altschul, Warren Gish, David J.
Lipman and Webb Miller at the NIH and was
published in J. Mol. Biol. in 1990.
• OBJECTIVE: Find high scoring ungapped segment
among related sequences
• Most widely used bioinformatics programs as the
algorithm emphasizes speed over sensitivity.
19. • An algorithm for comparing primary biological
sequence information to find out the similarity
existing between these two.
• Emphasizes on regions of local alignment to
detect relationship among sequences which
shares only isolated regions of similarity.
• Not only a tool for visualizing alignment but
also give a view to compare structure and
function.
20. Steps for BLAST
Searches for exact matches of a small fixed length
between query sequence in the database called Seed.
BLAST tries to extend the match in both direction
starting at the seed ungapped alignment occur---- High
Scoring Segment Pair (HSP).
The highest scored HSP’s are presented as final report.
They are called Maximum Scoring Pairing
21. BLAST performs a gapped alignment
between query sequence and database
sequence using a variation of Smith-
Watermann Algorithm statistically
significant alignments are then displayed
to user
22. BLAST PROGRAMS
• BLASTP: protein query sequence against a protein
database, allowing for gaps.
• BLASTN: DNA query sequence against a DNA database,
allowing for gaps.
• BLASTX: DNA query sequence, translated into all six
reading frames, against a protein database, allowing for
gaps.
• TBLASTN: protein query sequence against a DNA
database, translated into all six reading frames, allowing
for gaps.
• TBLASTX: DNA query sequence, translated into all six
reading frames, against a DNA database, translated into
all six reading frames (No gaps allowed)
23. PSI-BLAST
(position-specific scoring matrix)
• Used to find distant relatives of a protein.
• First, a list of all closely related proteins is
created. These proteins are combined into a
general "profile" sequence.
• Now this profile used as a query and again the
search performed to get the more distantly
related sequence.
• PSI-BLAST is much more sensitive in picking
up distant evolutionary relationships than a
standard protein-protein BLAST.
25. Matrix
• A key element in evaluating the quality of a
pairwise sequence alignment is the
"substitution matrix", which assigns a score for
aligning any possible pair of residues.
• BLAST includes BLOSUM & PAM matrix.
28. The Score Matrix
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Gaps
Similarity
Identity
,
i j
X A B
ACDEFGH
HICDYGH
A
B
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
29. Paths in the Score Matrix
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Deletion
Insertion
Matches
O
T
Alignments are in a one-
to-one correspondence
with score matrix paths.
30. Low Complexity Regions
• Amino acid or DNA sequence regions that offer very
low information due to their highly biased content
– histidine-rich domains in amino acids
– poly-A tails in DNA sequences
– poly-G tails in nucleotides
– runs of purines
– runs of pyrimidines
– runs of a single amino acid, etc.
31. E-value
• Depends on database size
• Indicates probability of a database
match expected as result of random
chance
• Lower E-value, more significant
sequence, less likely Db result of
random chance
32. E=m x n x p
E=E-value
m=total no. of residues in Database
n=no. of residues in query sequence
p= probability that high scoring pair is result of
random chance
33. • E-value 0.01 and 10-50 Homology
• E-value 0.01 and 10 not significant to
remote homology
• E-value>10 distantly related
34. Bit Score
• Measure sequence similarity which is independent of
query sequence length and database size but based on Raw
Pairwise Alignment
• High bit score , high significantly match
• S’ (λ S-lnk)/ln2
S’=bit score
λ =grumble distributation constt.
K=constt.associated with scoring matrix
(λ and k are two statistical parameters)
35. Low Complexity Regions (LCR)
Masking:
(I) Hard masking
(II) Soft Masking
Program for Masking
(i) SEG :high frequency region declared LCR
(ii) RepeatMasker: score for a sequence region above
certain threshold region declared LCR. Residue
masked with N’s and X’s
37. BLAST result page
• BLAST result page divided into 3 parts.
• Part1 contains the information regarding version, database
used, reference and length of the query sequence.
• Part-2 is the conserved regions and graphical representation
of the alignment where each line represents the alignment of
query sequence with one database sequence.
• It shows the result in 5 different color depending upon the bit
score.
• Part-3 contains the list of database sequence having
similarity obtained while database search and detail view of
alignment along with bitscore, e-value, identities, positives
and gaps.
42. BLAST Preferred
• BLAST uses substitution matrix to find
matching while FASTA identifies identical
matching words using hashing procedure. By
default FASTA scans smaller window sizes
.Thus it gives more sensitive results than
BLAST with better coverage rates of
homologs but usually slower than BLAST
43. • BLAST use low complexity masking means it
may have higher specificity than FASTA
therefore false positives are reduced
• BLAST sometimes give multiple best scoring
alignments from the same sequence, FASTA
returns only one final alignment
44. REFRENCES
Jin Xiong(2006). Essential Bioinformatics.
Cambridge University Press.
Mount D. W. (2004). Bioinformatics &
Genome Analysis. Cold Spring Harbor
Laboratory Press.
URL:-
WWW.ncbi.nlm.nih.gov