Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.
In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.
GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
Similar to Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
Similar to Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. (20)
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Genome wide association studies---In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.
1. Genome Wide Association Studies
(GWAS)
BY USING TASSEL 5.0
Dr. Amit Joshi
HOD-Department of Biochemistry
Kalinga University
2. GWAS INTRODUCTION
• In genomics, a genome-wide association study (GWA study, or
GWAS), is an observational study of a genome-wide set of
genetic variants in different individuals to see if any variant is
associated with a trait.
• GWA studies typically focus on associations between single-
nucleotide polymorphisms (SNPs) and traits like major human
diseases, but can equally be applied to any other genetic variants
and any other organisms.
• Aware about some software's for conducting GWAS are:
Structure, Plink, Tassel, R-Studio based r-commands….etc etc.
8. Preparing the Input files
A. Phenotype file
• Prepare the phenotype file as shown below in the figure
• Note: Please remember if your data has covariates such as sex, age or treatment,
then, please categories them with header name factor.
9. B. Genotype file
• TASSEL allows various genotype file formats such as VCF (variant call format),
.hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the genotype
file. The below is the screenshot of the hmp.txt genotype file.
10. Importing phenotype and genotype files
• Import the files by following the steps shown below. Tip! Both files
can be opened at same time holding CTRL and clicking the file names.
17. Genotype summary analysis
• Next crucial step is to look at the genotype data by simply following
the steps shown. Couple of keys things to look at are:
• Minor allele frequency distribution
• Missing genotypic data to see if it requires to be imputed
• Proportion of heterozygous in the samples to check for self-ed samples
18.
19.
20.
21.
22. Filter genotypes based on call rate
• Steps to filter the genotypes based on call rate and heterozygosity
level are shown below:
• In the video, genotypes were filtered based on listed parameters:
• 90% minimum sites persent
• 5% minimum heterozygosity
• 100% maxmimum heterozygosity
23.
24.
25.
26. Filter Markers based on read depth, Minor and
Major allele frequency (MAF)
• Steps to filter markers based on read depth, Minor and Major allele
frequency (MAF) are shown below:
• In the video, markers were filtered based on listed parameters:
• 100 minimum count of 545 sequences (Its the number of times a
particular allele was seen for that locus)
• 0.05 Minor allele Frequency (set filter thresholds for rare alleles)
• 0.95 Major allele frequency (set filter thresholds to remove
monomorphic markers)
27.
28.
29. Conduct GWAS analysis
: Multidimensional scaling (MDS)
MDS output can be used as the covariate in the GLM or MLM
to correct for population structure. Please follow the steps
shown below:
30.
31.
32.
33.
34. Intersecting the files
• Intersect the genotype, phenotype and MDS files by following the
steps below:
35.
36.
37. running General Linear Model (GLM)
• Run the GLM analysis by selecting the intersected files following the
steps below:
38.
39.
40.
41.
42.
43. The output of the GLM analyis is produced ubder the
Result node. The GLM association test can be evaluated
by plotting Q-Q plot and the Manhattan plot as shown
below.
44. Mixed Linear Model (MLM)
• Calculating Kinship matrix
• Follow the below steps to calcuate the kinship matrix:
51. running Mixed Linear Model (MLM)
• MLM model includes the PCA and the kinship matrix i.e.
MLM(PCA+K).
• Therefore, once the Kinship matrix has been calculated, MLM can be
now be conducted by following below steps:
56. Determine GWAS Significance Threshold
• Bonferroni threshold can be determined to identify significantly
markers associated with the trait by using the below equation:
• where, N is the total number of markers tested in association
analysis) was used to identify the most significantly markers
associated with the trait. Similarly, another way is to perform FDR
(False Discoveyy Rate) correction method, which is a less stringent
than the family-wise error rate.
57. Adjust P-Values For Multiple Comparisons:
Bonferroni and False Discovery Rate
• Give the output from GLM and or MLM analysis, one calcuate the
adjusted p-values using one of the frequenlty comparison methods:
Bonferroni and False Discovery Rate (FDR)in R using the below code.
58. # Import GLM or MLM stats file.
glm_stats <- read.table("GLMstats.txt", header = T, sep = "t")
# Check data
head(glm_stats)
# Import R library
library(dplyr)
# Calculate Bonferroni Correction and False Discovery Rate
adj_glm <- glm_stats %>%
transmute(Marker, Chr, Pos, p,
p_Bonferroni = p.adjust(glm_stats$p,"bonferroni"),
p_FDR = p.adjust(glm_stats$p,"fdr")
)
View(adj_glm)
# Save the result to a file
write.csv(adj_glm, file="adj_p_GLM.csv", quote = T, eol = "n", na= "NA")
# QQ plot
library(qqman)
# import data
adj_glm_KRN_4 <- read.csv("adj_p_GLM.csv", header = T)
#plot qq plot GLM(PCA)
par(mfrow=c(1,3))
qq(adj_glm_KRN_4$p, main = "non-adjusted P-value")
qq(adj_glm_KRN_4$p_Bonferroni, main = "Bonferroni")
qq(adj_glm_KRN_4$p_FDR, main = "FDR")
par(mfrow=c(1,1))
59.
60. The Hardy–Weinberg (HD) principle
• Allele and genotype frequencies in a population will remain
constant from generation to generation in the absence of other
evolutionary influences.
• These influences include non-random mating, mutation,
selection, genetic drift, gene flow and meiotic drive.
• Allele frequency: f(A)=p, f(a)=q
• Genotype frequency: f(AA)=p2, f(aa)=q2, f(Aa)=2pq
• Both allele and genotype frequency remain unchanged: Hardy-
Weinberg equilibrium
61. HD principle for two loci
• First locus: A and a alleles; Second locus: B and b alleles
• Allele frequency: PA+Pa = 1, PB+Pb=1
• Haplotype frequency: PAB=PAPB, Pab=PaPb, so on so forth
• Haplotype frequency reaches the equilibrium stage with one generation of
random matting if the two loci are on different chromosomes
• It takes multiple generation to reach the the equilibrium stage if the two
loci are on the same chromosome
• It takes more generation to move out the linkage disequilibrium stage
with lower recombination rate between the two loci
62. Linkage equilibrium and Disequilibrium
Linkage equilibrium: haplotype frequencies in a population
have the same value that they would have if the genes at each
locus were combined at random.
Linkage disequilibrium: Non-random association of alleles at
different loci in a given population
64. Linkage Disequilibrium (LD)
Loci and
allele
A a B b
Frequency .6 .4 .7 .3
Gametic type AB Ab aB ab
Observed 0.5 0.1 0.2 0.2
D = PAB-PAPB =Pab-PaPb
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
=-(PAb-PAPb) =-(PaB-PaPB )
65. Lemma
Proof
(1): PABPab=(PAPB+D) (PaPb+D)= PAPB PaPb + PAPB D + PaPb D + D2
(2): PAbPaB=(PAPb-D) (PaPB-D)= PAPb PaPB - PAPb D - PaPB D + D2
Subtracting (2) from (1): PABPab-PAbPaB=D(PAPB + PaPb + PAPb + PaPB )=D
D=PABPab-PAbPaB
66. D depends on allele frequency
• Vary even with complete LD
• PAb=PaB=0
• PAB=1-Pab=PA=PB
• D=PA-PAPA
67. Property of D
• Deviation between observed and expected
• Extreme values: -0.25 and 0.25
• Non LD (equilibrium): D=0
• Dependency on allele frequency
68. Modification of D: D’
• Lewontin (1964) proposed standardizing D to the
maximum possible value it can take:
• D’=D/DMax
• Dmax: =
max(−PAPB, −PaPb) if D<0
min(PAPb, PaPB) 𝑖𝑓 𝐷 > 0
• Range of D’: 0 to 1
69. Example
Loci and
allele
A a B b
Frequency .6 .4 .7 .3
Gametic type AB Ab aB ab
Observed 0.5 0.1 0.2 0.2
• D =PAB-PAPB = 0.08
• Dmax=min (PAPb, PaPB)
• =min(.6x.3, .4x.7)
• =0.18
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
• D’=D/Dmax=0.08/0.18
=0.44
70. Example (switch A and a)
Loci and
allele
a A B b
Frequency .6 .4 .7 .3
Gametic type aB ab AB Ab
Observed 0.5 0.1 0.2 0.2
• D =PAB-PAPB = -0.08
• Dmax=max (-PAPB, -PaPb)
• =max(-.4x.7, -.6x.3)
• =-0.18
Frequency
equilibrium
0.42 0.18 0.28 0.12
Difference 0.08 -0.08 -0.08 0.08
• D’=D/Dmax=-0.08/-0.18=0.44
71. R2
• Hill and Robertson (1968) proposed the following
measure of linkage disequilibrium:
• r2 (Δ2)=D2/(PAPBPaPb)
• Square makes positive
• The product of allele frequency creates penalty for
50% allele frequency.
• Range: 0 to 1
72. Summary of LD statistics
P values D D’ R2
Definition Statistical test
(e.g. X2)
PAB-PAPB D/DMax D2/(PAPBPaPb)
Value at
equilibrium
1 0 0 0
Value at
complete LD
0 -0.25 or 0.25 1 1
Disadvantage Dependency
on allele
frequency
Penalty on
neutral loci
73. Causes of LD
• Linkage
• Mutation
• Selection
• Inbreeding
• Genetic drift
• Gene flow/admixture
Spurious association
True association
75. • c: recombination rate
• Dt=D0(1-c)t
• t=log(Dt/D0)/log(1-c)
• if c=10%, it takes 6.5 generation for D to be cut in half
• 1Mb=1cM,
• if two SNPs 100kb apart,
• c=1% / 10 = 0.001
• It takes 693 generations for D to be cut in half
Change in D over time
76. Change in D over time
0 10 20 30 40 50
0.00
0.05
0.10
0.15
0.20
0.25
t
Dt
c=.1
c=.01
c=.05
c=.25
77. Human out of Africa
https://arstechnica.com/science/2015/12/the-human-migration-out-of-africa-left-its-mark-in-mutations/
79. HW equilibrium, Linkage equilibrium and
Linkage disequilibrium
Single locus
Multiple locus
HWE LE LD
PAB=PAPB
LD Decay
PAA=P2
PAB!=PAPB
PAB=PAPB
Same chromosome
different chromosome
Association
84. 84
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
• Reduce number of dimensions in data
• Find patterns in high-dimensional data
• Visualize data of high dimensionality
• Example applications:
• Face recognition
• Image compression
• Gene expression analysis
85. 85
Principal Components Analysis Ideas ( PCA)
• Does the data set ‘span’ the whole of d dimensional space?
• For a matrix of m samples x n genes, create a new covariance matrix
of size n x n.
• Transform some large number of variables into a smaller number of
uncorrelated variables called principal components (PCs).
• developed to capture as much of the variation in data as possible
86. 86
X1
X2
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf
Note: Y1 is the
first eigen vector,
Y2 is the second.
Y2 ignorable.
Y1
Y2
x
x
x x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
Key observation:
variance = largest!
87. 87
Principal Component Analysis: one
attribute first
• Question: how much
spread is in the data
along the axis?
(distance to the mean)
• Variance=Standard
deviation^2
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
)
1
(
)
(
1
2
2
n
X
X
s
n
i
i
88. 88
Now consider two dimensions
X=Temperature Y=Humidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
)
1
(
)
)(
(
)
,
cov( 1
n
Y
Y
X
X
Y
X
n
i
i
i
Covariance: measures the
correlation between X and Y
• cov(X,Y)=0: independent
•Cov(X,Y)>0: move same dir
•Cov(X,Y)<0: move oppo dir
89. 89
More than two attributes: covariance
matrix
• Contains covariance values between all possible
dimensions (=attributes):
• Example for three attributes (x,y,z):
))
,
cov(
|
( j
i
ij
ij
nxn
Dim
Dim
c
c
C
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
)
,
cov(
z
z
y
z
x
z
z
y
y
y
x
y
z
x
y
x
x
x
C
90. 90
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x, is called an eigenvalue of A.
2
3
4
8
12
2
3
1
2
3
2
x
x
91. 91
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues
• Solve (A- I) x=0 for each to obtain eigenvectors x
92. 92
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.
93. 93
Steps of PCA
• Let be the mean
vector (taking the mean
of all rows)
• Adjust the original data
by the mean
X’ = X –
• Compute the covariance
matrix C of adjusted X
• Find the eigenvectors
and eigenvalues of C.
X
• For matrix C, vectors e
(=column vector) having
same direction as Ce :
• eigenvectors of C is e such
that Ce=e,
• is called an eigenvalue of
C.
• Ce=e (C-I)e=0
• Most data mining packages
do this for you.
X
94. 94
Eigenvalues
• Calculate eigenvalues and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:
n
x
x
n
x
x
j
j n
V
1
1
100
96. 96
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
n
in
i
i
p
ip
i
i
x
x
x
x
x
x
e
e
e
y
y
y
...
...
...
2
2
1
1
2
1
2
1
98. 98
Covariance Matrix
• C=
• Using MATLAB, we find out:
• Eigenvectors:
• e1=(-0.98,-0.21), 1=51.8
• e2=(0.21,-0.98), 2=560.2
• Thus the second eigenvector is more important!
75 106
106 482
99. 99
If we only keep one dimension: e2
• We keep the dimension
of e2=(0.21,-0.98)
• We can obtain the final
data as
2
1
2
1
*
98
.
0
*
21
.
0
98
.
0
21
.
0 i
i
i
i
i x
x
x
x
y
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-40 -20 0 20 40
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
103. 103
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
• RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
• Yields original data using the chosen components
104. 104
Principal components
• General about principal components
• summary variables
• linear combinations of the original variables
• uncorrelated with each other
• capture as much of the original variance as possible
105. 105
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with
multiple clusters
• Alternative: determination of gene’s closest neighbours
106. 106
Two Way (Angle) Data Analysis
Genes 103–104
Samples
10
1
-10
2
Gene expression
matrix
Sample space analysis Gene space analysis
Conditions 101–102
Genes
10
3
-10
4
Gene expression
matrix
138. QQ Plot Interpretation
• This plot provide information on two main aspects
of GWAS data: whether the statistical testing is well
controlled for challenges such as population
stratification & whether there is any association.
• QQ-Plots measures and compares the p-values
expected to be seen when testing for association &
those actually observed.
• Each dot represent SNP
• X-Axis shows: Expected –log10(p)
• Y-Axis shows: Observed –log10(p)
• The red line shows pattern of –log10(p) value if no
SNP have significant genetic association with the
trait.
• When there are significant associations between
SNP markers and the traits then SNP Dots (Blue
color) rise off the line
139. Manhattan plot Interpretation
• A scatter plot used to show on
which chromosomes have any
significantly associated SNPs based
on their p-value.
• In this genomic coordinates are
displayed on x-axis and negative
logarithm of association p-values for
each SNP on y-axis.
• Each dot signifies a SNP
• Strongest association have smaller
p-values (10-15)