SlideShare a Scribd company logo
1 of 140
Genome Wide Association Studies
(GWAS)
BY USING TASSEL 5.0
Dr. Amit Joshi
HOD-Department of Biochemistry
Kalinga University
GWAS INTRODUCTION
• In genomics, a genome-wide association study (GWA study, or
GWAS), is an observational study of a genome-wide set of
genetic variants in different individuals to see if any variant is
associated with a trait.
• GWA studies typically focus on associations between single-
nucleotide polymorphisms (SNPs) and traits like major human
diseases, but can equally be applied to any other genetic variants
and any other organisms.
• Aware about some software's for conducting GWAS are:
Structure, Plink, Tassel, R-Studio based r-commands….etc etc.
Download & Install Tassel 5.0
• https://tassel.bitbucket.io/
Webpage view
Webpage link
Preparing the Input files
A. Phenotype file
• Prepare the phenotype file as shown below in the figure
• Note: Please remember if your data has covariates such as sex, age or treatment,
then, please categories them with header name factor.
B. Genotype file
• TASSEL allows various genotype file formats such as VCF (variant call format),
.hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the genotype
file. The below is the screenshot of the hmp.txt genotype file.
Importing phenotype and genotype files
• Import the files by following the steps shown below. Tip! Both files
can be opened at same time holding CTRL and clicking the file names.
Phenotype Data
Phenotype distribution plot
Genotype summary analysis
• Next crucial step is to look at the genotype data by simply following
the steps shown. Couple of keys things to look at are:
• Minor allele frequency distribution
• Missing genotypic data to see if it requires to be imputed
• Proportion of heterozygous in the samples to check for self-ed samples
Filter genotypes based on call rate
• Steps to filter the genotypes based on call rate and heterozygosity
level are shown below:
• In the video, genotypes were filtered based on listed parameters:
• 90% minimum sites persent
• 5% minimum heterozygosity
• 100% maxmimum heterozygosity
Filter Markers based on read depth, Minor and
Major allele frequency (MAF)
• Steps to filter markers based on read depth, Minor and Major allele
frequency (MAF) are shown below:
• In the video, markers were filtered based on listed parameters:
• 100 minimum count of 545 sequences (Its the number of times a
particular allele was seen for that locus)
• 0.05 Minor allele Frequency (set filter thresholds for rare alleles)
• 0.95 Major allele frequency (set filter thresholds to remove
monomorphic markers)
Conduct GWAS analysis
: Multidimensional scaling (MDS)
MDS output can be used as the covariate in the GLM or MLM
to correct for population structure. Please follow the steps
shown below:
Intersecting the files
• Intersect the genotype, phenotype and MDS files by following the
steps below:
running General Linear Model (GLM)
• Run the GLM analysis by selecting the intersected files following the
steps below:
The output of the GLM analyis is produced ubder the
Result node. The GLM association test can be evaluated
by plotting Q-Q plot and the Manhattan plot as shown
below.
Mixed Linear Model (MLM)
• Calculating Kinship matrix
• Follow the below steps to calcuate the kinship matrix:
MLM
running Mixed Linear Model (MLM)
• MLM model includes the PCA and the kinship matrix i.e.
MLM(PCA+K).
• Therefore, once the Kinship matrix has been calculated, MLM can be
now be conducted by following below steps:
Exporting results
• One may export the results in .txt format Results- Table
Determine GWAS Significance Threshold
• Bonferroni threshold can be determined to identify significantly
markers associated with the trait by using the below equation:
• where, N is the total number of markers tested in association
analysis) was used to identify the most significantly markers
associated with the trait. Similarly, another way is to perform FDR
(False Discoveyy Rate) correction method, which is a less stringent
than the family-wise error rate.
Adjust P-Values For Multiple Comparisons:
Bonferroni and False Discovery Rate
• Give the output from GLM and or MLM analysis, one calcuate the
adjusted p-values using one of the frequenlty comparison methods:
Bonferroni and False Discovery Rate (FDR)in R using the below code.
# Import GLM or MLM stats file.
glm_stats <- read.table("GLMstats.txt", header = T, sep = "t")
# Check data
head(glm_stats)
# Import R library
library(dplyr)
# Calculate Bonferroni Correction and False Discovery Rate
adj_glm <- glm_stats %>%
transmute(Marker, Chr, Pos, p,
p_Bonferroni = p.adjust(glm_stats$p,"bonferroni"),
p_FDR = p.adjust(glm_stats$p,"fdr")
)
View(adj_glm)
# Save the result to a file
write.csv(adj_glm, file="adj_p_GLM.csv", quote = T, eol = "n", na= "NA")
# QQ plot
library(qqman)
# import data
adj_glm_KRN_4 <- read.csv("adj_p_GLM.csv", header = T)
#plot qq plot GLM(PCA)
par(mfrow=c(1,3))
qq(adj_glm_KRN_4$p, main = "non-adjusted P-value")
qq(adj_glm_KRN_4$p_Bonferroni, main = "Bonferroni")
qq(adj_glm_KRN_4$p_FDR, main = "FDR")
par(mfrow=c(1,1))
The Hardy–Weinberg (HD) principle
• Allele and genotype frequencies in a population will remain
constant from generation to generation in the absence of other
evolutionary influences.
• These influences include non-random mating, mutation,
selection, genetic drift, gene flow and meiotic drive.
• Allele frequency: f(A)=p, f(a)=q
• Genotype frequency: f(AA)=p2, f(aa)=q2, f(Aa)=2pq
• Both allele and genotype frequency remain unchanged: Hardy-
Weinberg equilibrium
HD principle for two loci
• First locus: A and a alleles; Second locus: B and b alleles
• Allele frequency: PA+Pa = 1, PB+Pb=1
• Haplotype frequency: PAB=PAPB, Pab=PaPb, so on so forth
• Haplotype frequency reaches the equilibrium stage with one generation of
random matting if the two loci are on different chromosomes
• It takes multiple generation to reach the the equilibrium stage if the two
loci are on the same chromosome
• It takes more generation to move out the linkage disequilibrium stage
with lower recombination rate between the two loci
Linkage equilibrium and Disequilibrium
Linkage equilibrium: haplotype frequencies in a population
have the same value that they would have if the genes at each
locus were combined at random.
Linkage disequilibrium: Non-random association of alleles at
different loci in a given population
Linkage Disequilibrium Quantification
• Linkage equilibrium: PAB=PAPB
D(ifference)=PAB-PAPB=
0 𝑖𝑓 𝑒𝑞𝑢𝑖𝑙𝑖𝑏𝑟𝑖𝑢𝑚
𝑛𝑜𝑛 𝑧𝑒𝑟𝑜 𝑖𝑓 𝑑𝑖𝑠𝑒𝑞𝑢𝑖𝑙𝑖𝑏𝑟𝑖𝑢𝑚
Linkage Disequilibrium (LD)
Loci and
allele
A a B b
Frequency .6 .4 .7 .3
Gametic type AB Ab aB ab
Observed 0.5 0.1 0.2 0.2
D = PAB-PAPB =Pab-PaPb
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
=-(PAb-PAPb) =-(PaB-PaPB )
Lemma
Proof
(1): PABPab=(PAPB+D) (PaPb+D)= PAPB PaPb + PAPB D + PaPb D + D2
(2): PAbPaB=(PAPb-D) (PaPB-D)= PAPb PaPB - PAPb D - PaPB D + D2
Subtracting (2) from (1): PABPab-PAbPaB=D(PAPB + PaPb + PAPb + PaPB )=D
D=PABPab-PAbPaB
D depends on allele frequency
• Vary even with complete LD
• PAb=PaB=0
• PAB=1-Pab=PA=PB
• D=PA-PAPA
Property of D
• Deviation between observed and expected
• Extreme values: -0.25 and 0.25
• Non LD (equilibrium): D=0
• Dependency on allele frequency
Modification of D: D’
• Lewontin (1964) proposed standardizing D to the
maximum possible value it can take:
• D’=D/DMax
• Dmax: =
max(−PAPB, −PaPb) if D<0
min(PAPb, PaPB) 𝑖𝑓 𝐷 > 0
• Range of D’: 0 to 1
Example
Loci and
allele
A a B b
Frequency .6 .4 .7 .3
Gametic type AB Ab aB ab
Observed 0.5 0.1 0.2 0.2
• D =PAB-PAPB = 0.08
• Dmax=min (PAPb, PaPB)
• =min(.6x.3, .4x.7)
• =0.18
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
• D’=D/Dmax=0.08/0.18
=0.44
Example (switch A and a)
Loci and
allele
a A B b
Frequency .6 .4 .7 .3
Gametic type aB ab AB Ab
Observed 0.5 0.1 0.2 0.2
• D =PAB-PAPB = -0.08
• Dmax=max (-PAPB, -PaPb)
• =max(-.4x.7, -.6x.3)
• =-0.18
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
• D’=D/Dmax=-0.08/-0.18=0.44
R2
• Hill and Robertson (1968) proposed the following
measure of linkage disequilibrium:
• r2 (Δ2)=D2/(PAPBPaPb)
• Square makes positive
• The product of allele frequency creates penalty for
50% allele frequency.
• Range: 0 to 1
Summary of LD statistics
P values D D’ R2
Definition Statistical test
(e.g. X2)
PAB-PAPB D/DMax D2/(PAPBPaPb)
Value at
equilibrium
1 0 0 0
Value at
complete LD
0 -0.25 or 0.25 1 1
Disadvantage Dependency
on allele
frequency
Penalty on
neutral loci
Causes of LD
• Linkage
• Mutation
• Selection
• Inbreeding
• Genetic drift
• Gene flow/admixture
Spurious association
True association
Mutation and selection
A____q A____Q
A____q A____q
A____q A____q
A____q A____Q
A____Q A____q
A____q A____q
A____Q A____Q
A____Q A____q
A____Q A____q
Generation 1
Generation 2
Generation 3
mutation
A____q
Selection
Selection
• c: recombination rate
• Dt=D0(1-c)t
• t=log(Dt/D0)/log(1-c)
• if c=10%, it takes 6.5 generation for D to be cut in half
• 1Mb=1cM,
• if two SNPs 100kb apart,
• c=1% / 10 = 0.001
• It takes 693 generations for D to be cut in half
Change in D over time
Change in D over time
0 10 20 30 40 50
0.00
0.05
0.10
0.15
0.20
0.25
t
Dt
c=.1
c=.01
c=.05
c=.25
Human out of Africa
https://arstechnica.com/science/2015/12/the-human-migration-out-of-africa-left-its-mark-in-mutations/
LD decay over distance
HW equilibrium, Linkage equilibrium and
Linkage disequilibrium
Single locus
Multiple locus
HWE LE LD
PAB=PAPB
LD Decay
PAA=P2
PAB!=PAPB
PAB=PAPB
Same chromosome
different chromosome
Association
Estimation and plotting of LD decay over
distance from the LD results from TASSEL
Code to plot in R using GGplot:
84
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
• Reduce number of dimensions in data
• Find patterns in high-dimensional data
• Visualize data of high dimensionality
• Example applications:
• Face recognition
• Image compression
• Gene expression analysis
85
Principal Components Analysis Ideas ( PCA)
• Does the data set ‘span’ the whole of d dimensional space?
• For a matrix of m samples x n genes, create a new covariance matrix
of size n x n.
• Transform some large number of variables into a smaller number of
uncorrelated variables called principal components (PCs).
• developed to capture as much of the variation in data as possible
86
X1
X2
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf
Note: Y1 is the
first eigen vector,
Y2 is the second.
Y2 ignorable.
Y1
Y2
x
x
x x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
Key observation:
variance = largest!
87
Principal Component Analysis: one
attribute first
• Question: how much
spread is in the data
along the axis?
(distance to the mean)
• Variance=Standard
deviation^2
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
)
1
(
)
(
1
2
2





n
X
X
s
n
i
i
88
Now consider two dimensions
X=Temperature Y=Humidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
)
1
(
)
)(
(
)
,
cov( 1






n
Y
Y
X
X
Y
X
n
i
i
i
Covariance: measures the
correlation between X and Y
• cov(X,Y)=0: independent
•Cov(X,Y)>0: move same dir
•Cov(X,Y)<0: move oppo dir
89
More than two attributes: covariance
matrix
• Contains covariance values between all possible
dimensions (=attributes):
• Example for three attributes (x,y,z):
))
,
cov(
|
( j
i
ij
ij
nxn
Dim
Dim
c
c
C 












)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
z
z
y
z
x
z
z
y
y
y
x
y
z
x
y
x
x
x
C
90
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue of A.


































2
3
4
8
12
2
3
1
2
3
2
x
x
91
Eigenvalues & eigenvectors
• Ax=x  (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues 
• Solve (A- I) x=0 for each  to obtain eigenvectors x
92
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.
93
Steps of PCA
• Let be the mean
vector (taking the mean
of all rows)
• Adjust the original data
by the mean
X’ = X –
• Compute the covariance
matrix C of adjusted X
• Find the eigenvectors
and eigenvalues of C.
X
• For matrix C, vectors e
(=column vector) having
same direction as Ce :
• eigenvectors of C is e such
that Ce=e,
•  is called an eigenvalue of
C.
• Ce=e  (C-I)e=0
• Most data mining packages
do this for you.
X
94
Eigenvalues
• Calculate eigenvalues  and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:

 




n
x
x
n
x
x
j
j n
V
1
1
100 


95
Principal components - Variance
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Variance
(%)
96
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances














































n
in
i
i
p
ip
i
i
x
x
x
x
x
x
e
e
e
y
y
y
...
...
...
2
2
1
1
2
1
2
1
97
An Example
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50
Series1
Mean1=24.1
Mean2=53.8
-40
-30
-20
-10
0
10
20
30
40
-15 -10 -5 0 5 10 15 20
Series1
98
Covariance Matrix
• C=
• Using MATLAB, we find out:
• Eigenvectors:
• e1=(-0.98,-0.21), 1=51.8
• e2=(0.21,-0.98), 2=560.2
• Thus the second eigenvector is more important!
75 106
106 482
99
If we only keep one dimension: e2
• We keep the dimension
of e2=(0.21,-0.98)
• We can obtain the final
data as
  2
1
2
1
*
98
.
0
*
21
.
0
98
.
0
21
.
0 i
i
i
i
i x
x
x
x
y 











-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-40 -20 0 20 40
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
100
101
102
103
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
• RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
• Yields original data using the chosen components
104
Principal components
• General about principal components
• summary variables
• linear combinations of the original variables
• uncorrelated with each other
• capture as much of the original variance as possible
105
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with
multiple clusters
• Alternative: determination of gene’s closest neighbours
106
Two Way (Angle) Data Analysis
Genes 103–104
Samples
10
1
-10
2
Gene expression
matrix
Sample space analysis Gene space analysis
Conditions 101–102
Genes
10
3
-10
4
Gene expression
matrix
107
PCA - example
108
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
109
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2
110
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2
PCAAnalysis
For genotype & Phenotype practice data
• https://avikarn.com/image/gwas/sample_Genotype-Chr-
1_data_forTASSEL.hmp.txt
• https://avikarn.com/image/gwas/sample_Phenotype_data_forTASSEL
.txt
• https://www.panzea.org/
• http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip
Another Way for running MLM:
Structure tool to Tassel tool
Generally burnin period set to 10000, and MOMC Reps= 100000
For which K-Value?
Inferred ancestory for Q matrix
Q-Matrix from structure
From Sequence data we make Kinship
centered_IBS_Filtered file
We do intersection join for genotype data+
phenotype data+ q-matrix
Result MLM-Statistics file can be analysed
for Manhattan Plot and QQ-Plot
QQ Plot Interpretation
• This plot provide information on two main aspects
of GWAS data: whether the statistical testing is well
controlled for challenges such as population
stratification & whether there is any association.
• QQ-Plots measures and compares the p-values
expected to be seen when testing for association &
those actually observed.
• Each dot represent SNP
• X-Axis shows: Expected –log10(p)
• Y-Axis shows: Observed –log10(p)
• The red line shows pattern of –log10(p) value if no
SNP have significant genetic association with the
trait.
• When there are significant associations between
SNP markers and the traits then SNP Dots (Blue
color) rise off the line
Manhattan plot Interpretation
• A scatter plot used to show on
which chromosomes have any
significantly associated SNPs based
on their p-value.
• In this genomic coordinates are
displayed on x-axis and negative
logarithm of association p-values for
each SNP on y-axis.
• Each dot signifies a SNP
• Strongest association have smaller
p-values (10-15)
•Thank You

More Related Content

Similar to Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.

Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLekki Frazier-Wood
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDivyanshGupta922023
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHRafael C. Jimenez
 
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesDetecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesThermo Fisher Scientific
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Getting More from GWAS
Getting More from GWASGetting More from GWAS
Getting More from GWASGolden Helix
 
Explainable Data Poison Attack on Human Emotion Based 1.pptx
Explainable Data Poison Attack on Human Emotion Based 1.pptxExplainable Data Poison Attack on Human Emotion Based 1.pptx
Explainable Data Poison Attack on Human Emotion Based 1.pptxAfiyaSheikh2
 
Neutral theory 2019
Neutral theory 2019Neutral theory 2019
Neutral theory 2019RanajitDas12
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Dennis Sweitzer
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysisyuvraj404
 
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031Thermo Fisher Scientific
 
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant StudiesAvoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant StudiesJames Lyons-Weiler
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 

Similar to Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. (20)

Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_full
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptx
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGH
 
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesDetecting and Quantifying Low Level Variants in Sanger Sequencing Traces
Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Getting More from GWAS
Getting More from GWASGetting More from GWAS
Getting More from GWAS
 
Explainable Data Poison Attack on Human Emotion Based 1.pptx
Explainable Data Poison Attack on Human Emotion Based 1.pptxExplainable Data Poison Attack on Human Emotion Based 1.pptx
Explainable Data Poison Attack on Human Emotion Based 1.pptx
 
Neutral theory 2019
Neutral theory 2019Neutral theory 2019
Neutral theory 2019
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
 
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant StudiesAvoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 

Recently uploaded

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 

Recently uploaded (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 

Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.

  • 1. Genome Wide Association Studies (GWAS) BY USING TASSEL 5.0 Dr. Amit Joshi HOD-Department of Biochemistry Kalinga University
  • 2. GWAS INTRODUCTION • In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. • GWA studies typically focus on associations between single- nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms. • Aware about some software's for conducting GWAS are: Structure, Plink, Tassel, R-Studio based r-commands….etc etc.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Download & Install Tassel 5.0 • https://tassel.bitbucket.io/ Webpage view Webpage link
  • 8. Preparing the Input files A. Phenotype file • Prepare the phenotype file as shown below in the figure • Note: Please remember if your data has covariates such as sex, age or treatment, then, please categories them with header name factor.
  • 9. B. Genotype file • TASSEL allows various genotype file formats such as VCF (variant call format), .hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the genotype file. The below is the screenshot of the hmp.txt genotype file.
  • 10. Importing phenotype and genotype files • Import the files by following the steps shown below. Tip! Both files can be opened at same time holding CTRL and clicking the file names.
  • 12.
  • 13.
  • 14.
  • 16.
  • 17. Genotype summary analysis • Next crucial step is to look at the genotype data by simply following the steps shown. Couple of keys things to look at are: • Minor allele frequency distribution • Missing genotypic data to see if it requires to be imputed • Proportion of heterozygous in the samples to check for self-ed samples
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Filter genotypes based on call rate • Steps to filter the genotypes based on call rate and heterozygosity level are shown below: • In the video, genotypes were filtered based on listed parameters: • 90% minimum sites persent • 5% minimum heterozygosity • 100% maxmimum heterozygosity
  • 23.
  • 24.
  • 25.
  • 26. Filter Markers based on read depth, Minor and Major allele frequency (MAF) • Steps to filter markers based on read depth, Minor and Major allele frequency (MAF) are shown below: • In the video, markers were filtered based on listed parameters: • 100 minimum count of 545 sequences (Its the number of times a particular allele was seen for that locus) • 0.05 Minor allele Frequency (set filter thresholds for rare alleles) • 0.95 Major allele frequency (set filter thresholds to remove monomorphic markers)
  • 27.
  • 28.
  • 29. Conduct GWAS analysis : Multidimensional scaling (MDS) MDS output can be used as the covariate in the GLM or MLM to correct for population structure. Please follow the steps shown below:
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Intersecting the files • Intersect the genotype, phenotype and MDS files by following the steps below:
  • 35.
  • 36.
  • 37. running General Linear Model (GLM) • Run the GLM analysis by selecting the intersected files following the steps below:
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43. The output of the GLM analyis is produced ubder the Result node. The GLM association test can be evaluated by plotting Q-Q plot and the Manhattan plot as shown below.
  • 44. Mixed Linear Model (MLM) • Calculating Kinship matrix • Follow the below steps to calcuate the kinship matrix:
  • 45.
  • 46.
  • 47. MLM
  • 48.
  • 49.
  • 50.
  • 51. running Mixed Linear Model (MLM) • MLM model includes the PCA and the kinship matrix i.e. MLM(PCA+K). • Therefore, once the Kinship matrix has been calculated, MLM can be now be conducted by following below steps:
  • 52.
  • 53.
  • 54.
  • 55. Exporting results • One may export the results in .txt format Results- Table
  • 56. Determine GWAS Significance Threshold • Bonferroni threshold can be determined to identify significantly markers associated with the trait by using the below equation: • where, N is the total number of markers tested in association analysis) was used to identify the most significantly markers associated with the trait. Similarly, another way is to perform FDR (False Discoveyy Rate) correction method, which is a less stringent than the family-wise error rate.
  • 57. Adjust P-Values For Multiple Comparisons: Bonferroni and False Discovery Rate • Give the output from GLM and or MLM analysis, one calcuate the adjusted p-values using one of the frequenlty comparison methods: Bonferroni and False Discovery Rate (FDR)in R using the below code.
  • 58. # Import GLM or MLM stats file. glm_stats <- read.table("GLMstats.txt", header = T, sep = "t") # Check data head(glm_stats) # Import R library library(dplyr) # Calculate Bonferroni Correction and False Discovery Rate adj_glm <- glm_stats %>% transmute(Marker, Chr, Pos, p, p_Bonferroni = p.adjust(glm_stats$p,"bonferroni"), p_FDR = p.adjust(glm_stats$p,"fdr") ) View(adj_glm) # Save the result to a file write.csv(adj_glm, file="adj_p_GLM.csv", quote = T, eol = "n", na= "NA") # QQ plot library(qqman) # import data adj_glm_KRN_4 <- read.csv("adj_p_GLM.csv", header = T) #plot qq plot GLM(PCA) par(mfrow=c(1,3)) qq(adj_glm_KRN_4$p, main = "non-adjusted P-value") qq(adj_glm_KRN_4$p_Bonferroni, main = "Bonferroni") qq(adj_glm_KRN_4$p_FDR, main = "FDR") par(mfrow=c(1,1))
  • 59.
  • 60. The Hardy–Weinberg (HD) principle • Allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. • These influences include non-random mating, mutation, selection, genetic drift, gene flow and meiotic drive. • Allele frequency: f(A)=p, f(a)=q • Genotype frequency: f(AA)=p2, f(aa)=q2, f(Aa)=2pq • Both allele and genotype frequency remain unchanged: Hardy- Weinberg equilibrium
  • 61. HD principle for two loci • First locus: A and a alleles; Second locus: B and b alleles • Allele frequency: PA+Pa = 1, PB+Pb=1 • Haplotype frequency: PAB=PAPB, Pab=PaPb, so on so forth • Haplotype frequency reaches the equilibrium stage with one generation of random matting if the two loci are on different chromosomes • It takes multiple generation to reach the the equilibrium stage if the two loci are on the same chromosome • It takes more generation to move out the linkage disequilibrium stage with lower recombination rate between the two loci
  • 62. Linkage equilibrium and Disequilibrium Linkage equilibrium: haplotype frequencies in a population have the same value that they would have if the genes at each locus were combined at random. Linkage disequilibrium: Non-random association of alleles at different loci in a given population
  • 63. Linkage Disequilibrium Quantification • Linkage equilibrium: PAB=PAPB D(ifference)=PAB-PAPB= 0 𝑖𝑓 𝑒𝑞𝑢𝑖𝑙𝑖𝑏𝑟𝑖𝑢𝑚 𝑛𝑜𝑛 𝑧𝑒𝑟𝑜 𝑖𝑓 𝑑𝑖𝑠𝑒𝑞𝑢𝑖𝑙𝑖𝑏𝑟𝑖𝑢𝑚
  • 64. Linkage Disequilibrium (LD) Loci and allele A a B b Frequency .6 .4 .7 .3 Gametic type AB Ab aB ab Observed 0.5 0.1 0.2 0.2 D = PAB-PAPB =Pab-PaPb Frequency equilibrium 0.42 0.18 0.28 0.12 Difference 0.08 -0.08 -0.08 0.08 =-(PAb-PAPb) =-(PaB-PaPB )
  • 65. Lemma Proof (1): PABPab=(PAPB+D) (PaPb+D)= PAPB PaPb + PAPB D + PaPb D + D2 (2): PAbPaB=(PAPb-D) (PaPB-D)= PAPb PaPB - PAPb D - PaPB D + D2 Subtracting (2) from (1): PABPab-PAbPaB=D(PAPB + PaPb + PAPb + PaPB )=D D=PABPab-PAbPaB
  • 66. D depends on allele frequency • Vary even with complete LD • PAb=PaB=0 • PAB=1-Pab=PA=PB • D=PA-PAPA
  • 67. Property of D • Deviation between observed and expected • Extreme values: -0.25 and 0.25 • Non LD (equilibrium): D=0 • Dependency on allele frequency
  • 68. Modification of D: D’ • Lewontin (1964) proposed standardizing D to the maximum possible value it can take: • D’=D/DMax • Dmax: = max(−PAPB, −PaPb) if D<0 min(PAPb, PaPB) 𝑖𝑓 𝐷 > 0 • Range of D’: 0 to 1
  • 69. Example Loci and allele A a B b Frequency .6 .4 .7 .3 Gametic type AB Ab aB ab Observed 0.5 0.1 0.2 0.2 • D =PAB-PAPB = 0.08 • Dmax=min (PAPb, PaPB) • =min(.6x.3, .4x.7) • =0.18 Frequency equilibrium 0.42 0.18 0.28 0.12 Difference 0.08 -0.08 -0.08 0.08 • D’=D/Dmax=0.08/0.18 =0.44
  • 70. Example (switch A and a) Loci and allele a A B b Frequency .6 .4 .7 .3 Gametic type aB ab AB Ab Observed 0.5 0.1 0.2 0.2 • D =PAB-PAPB = -0.08 • Dmax=max (-PAPB, -PaPb) • =max(-.4x.7, -.6x.3) • =-0.18 Frequency equilibrium 0.42 0.18 0.28 0.12 Difference 0.08 -0.08 -0.08 0.08 • D’=D/Dmax=-0.08/-0.18=0.44
  • 71. R2 • Hill and Robertson (1968) proposed the following measure of linkage disequilibrium: • r2 (Δ2)=D2/(PAPBPaPb) • Square makes positive • The product of allele frequency creates penalty for 50% allele frequency. • Range: 0 to 1
  • 72. Summary of LD statistics P values D D’ R2 Definition Statistical test (e.g. X2) PAB-PAPB D/DMax D2/(PAPBPaPb) Value at equilibrium 1 0 0 0 Value at complete LD 0 -0.25 or 0.25 1 1 Disadvantage Dependency on allele frequency Penalty on neutral loci
  • 73. Causes of LD • Linkage • Mutation • Selection • Inbreeding • Genetic drift • Gene flow/admixture Spurious association True association
  • 74. Mutation and selection A____q A____Q A____q A____q A____q A____q A____q A____Q A____Q A____q A____q A____q A____Q A____Q A____Q A____q A____Q A____q Generation 1 Generation 2 Generation 3 mutation A____q Selection Selection
  • 75. • c: recombination rate • Dt=D0(1-c)t • t=log(Dt/D0)/log(1-c) • if c=10%, it takes 6.5 generation for D to be cut in half • 1Mb=1cM, • if two SNPs 100kb apart, • c=1% / 10 = 0.001 • It takes 693 generations for D to be cut in half Change in D over time
  • 76. Change in D over time 0 10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 t Dt c=.1 c=.01 c=.05 c=.25
  • 77. Human out of Africa https://arstechnica.com/science/2015/12/the-human-migration-out-of-africa-left-its-mark-in-mutations/
  • 78. LD decay over distance
  • 79. HW equilibrium, Linkage equilibrium and Linkage disequilibrium Single locus Multiple locus HWE LE LD PAB=PAPB LD Decay PAA=P2 PAB!=PAPB PAB=PAPB Same chromosome different chromosome Association
  • 80. Estimation and plotting of LD decay over distance from the LD results from TASSEL
  • 81.
  • 82.
  • 83. Code to plot in R using GGplot:
  • 84. 84 Principal Components Analysis ( PCA) • An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D • Can be used to: • Reduce number of dimensions in data • Find patterns in high-dimensional data • Visualize data of high dimensionality • Example applications: • Face recognition • Image compression • Gene expression analysis
  • 85. 85 Principal Components Analysis Ideas ( PCA) • Does the data set ‘span’ the whole of d dimensional space? • For a matrix of m samples x n genes, create a new covariance matrix of size n x n. • Transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). • developed to capture as much of the variation in data as possible
  • 86. 86 X1 X2 Principal Component Analysis See online tutorials such as http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi pal_components.pdf Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable. Y1 Y2 x x x x x x x x x x x x x x x x x x x x x x x x x Key observation: variance = largest!
  • 87. 87 Principal Component Analysis: one attribute first • Question: how much spread is in the data along the axis? (distance to the mean) • Variance=Standard deviation^2 Temperature 42 40 24 30 15 18 15 30 15 30 35 30 40 30 ) 1 ( ) ( 1 2 2      n X X s n i i
  • 88. 88 Now consider two dimensions X=Temperature Y=Humidity 40 90 40 90 40 90 30 90 15 70 15 70 15 70 30 90 15 70 30 70 30 70 30 90 40 70 ) 1 ( ) )( ( ) , cov( 1       n Y Y X X Y X n i i i Covariance: measures the correlation between X and Y • cov(X,Y)=0: independent •Cov(X,Y)>0: move same dir •Cov(X,Y)<0: move oppo dir
  • 89. 89 More than two attributes: covariance matrix • Contains covariance values between all possible dimensions (=attributes): • Example for three attributes (x,y,z): )) , cov( | ( j i ij ij nxn Dim Dim c c C              ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( ) , cov( z z y z x z z y y y x y z x y x x x C
  • 90. 90 Eigenvalues & eigenvectors • Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix). • In the equation Ax=x,  is called an eigenvalue of A.                                   2 3 4 8 12 2 3 1 2 3 2 x x
  • 91. 91 Eigenvalues & eigenvectors • Ax=x  (A-I)x=0 • How to calculate x and : • Calculate det(A-I), yields a polynomial (degree n) • Determine roots to det(A-I)=0, roots are eigenvalues  • Solve (A- I) x=0 for each  to obtain eigenvectors x
  • 92. 92 Principal components • 1. principal component (PC1) • The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation • 2. principal component (PC2) • the direction with maximum variation left in data, orthogonal to the 1. PC • In general, only few directions manage to capture most of the variability in the data.
  • 93. 93 Steps of PCA • Let be the mean vector (taking the mean of all rows) • Adjust the original data by the mean X’ = X – • Compute the covariance matrix C of adjusted X • Find the eigenvectors and eigenvalues of C. X • For matrix C, vectors e (=column vector) having same direction as Ce : • eigenvectors of C is e such that Ce=e, •  is called an eigenvalue of C. • Ce=e  (C-I)e=0 • Most data mining packages do this for you. X
  • 94. 94 Eigenvalues • Calculate eigenvalues  and eigenvectors x for covariance matrix: • Eigenvalues j are used for calculation of [% of total variance] (Vj) for each component j:        n x x n x x j j n V 1 1 100   
  • 95. 95 Principal components - Variance 0 5 10 15 20 25 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 Variance (%)
  • 96. 96 Transformed Data • Eigenvalues j corresponds to variance on each component j • Thus, sort by j • Take the first p eigenvectors ei; where p is the number of top eigenvalues • These are the directions with the largest variances                                               n in i i p ip i i x x x x x x e e e y y y ... ... ... 2 2 1 1 2 1 2 1
  • 97. 97 An Example X1 X2 X1' X2' 19 63 -5.1 9.25 39 74 14.9 20.25 30 87 5.9 33.25 30 23 5.9 -30.75 15 35 -9.1 -18.75 15 43 -9.1 -10.75 15 32 -9.1 -21.75 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 Series1 Mean1=24.1 Mean2=53.8 -40 -30 -20 -10 0 10 20 30 40 -15 -10 -5 0 5 10 15 20 Series1
  • 98. 98 Covariance Matrix • C= • Using MATLAB, we find out: • Eigenvectors: • e1=(-0.98,-0.21), 1=51.8 • e2=(0.21,-0.98), 2=560.2 • Thus the second eigenvector is more important! 75 106 106 482
  • 99. 99 If we only keep one dimension: e2 • We keep the dimension of e2=(0.21,-0.98) • We can obtain the final data as   2 1 2 1 * 98 . 0 * 21 . 0 98 . 0 21 . 0 i i i i i x x x x y             -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 -40 -20 0 20 40 yi -10.14 -16.72 -31.35 31.374 16.464 8.624 19.404 -17.63
  • 100. 100
  • 101. 101
  • 102. 102
  • 103. 103 PCA –> Original Data • Retrieving old data (e.g. in data compression) • RetrievedRowData=(RowFeatureVectorT x FinalData)+OriginalMean • Yields original data using the chosen components
  • 104. 104 Principal components • General about principal components • summary variables • linear combinations of the original variables • uncorrelated with each other • capture as much of the original variance as possible
  • 105. 105 Applications – Gene expression analysis • Reference: Raychaudhuri et al. (2000) • Purpose: Determine core set of conditions for useful gene comparison • Dimensions: conditions, observations: genes • Yeast sporulation dataset (7 conditions, 6118 genes) • Result: Two components capture most of variability (90%) • Issues: uneven data intervals, data dependencies • PCA is common prior to clustering • Crisp clustering questioned : genes may correlate with multiple clusters • Alternative: determination of gene’s closest neighbours
  • 106. 106 Two Way (Angle) Data Analysis Genes 103–104 Samples 10 1 -10 2 Gene expression matrix Sample space analysis Gene space analysis Conditions 101–102 Genes 10 3 -10 4 Gene expression matrix
  • 108. 108 PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2
  • 109. 109 PCA on 100 top significant genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 100 genes reduced to 2
  • 110. 110 PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2
  • 112.
  • 113.
  • 114. For genotype & Phenotype practice data • https://avikarn.com/image/gwas/sample_Genotype-Chr- 1_data_forTASSEL.hmp.txt • https://avikarn.com/image/gwas/sample_Phenotype_data_forTASSEL .txt • https://www.panzea.org/ • http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip
  • 115. Another Way for running MLM: Structure tool to Tassel tool
  • 116.
  • 117.
  • 118.
  • 119.
  • 120. Generally burnin period set to 10000, and MOMC Reps= 100000
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 127.
  • 128.
  • 129.
  • 130.
  • 133. From Sequence data we make Kinship centered_IBS_Filtered file
  • 134. We do intersection join for genotype data+ phenotype data+ q-matrix
  • 135.
  • 136.
  • 137. Result MLM-Statistics file can be analysed for Manhattan Plot and QQ-Plot
  • 138. QQ Plot Interpretation • This plot provide information on two main aspects of GWAS data: whether the statistical testing is well controlled for challenges such as population stratification & whether there is any association. • QQ-Plots measures and compares the p-values expected to be seen when testing for association & those actually observed. • Each dot represent SNP • X-Axis shows: Expected –log10(p) • Y-Axis shows: Observed –log10(p) • The red line shows pattern of –log10(p) value if no SNP have significant genetic association with the trait. • When there are significant associations between SNP markers and the traits then SNP Dots (Blue color) rise off the line
  • 139. Manhattan plot Interpretation • A scatter plot used to show on which chromosomes have any significantly associated SNPs based on their p-value. • In this genomic coordinates are displayed on x-axis and negative logarithm of association p-values for each SNP on y-axis. • Each dot signifies a SNP • Strongest association have smaller p-values (10-15)