SlideShare a Scribd company logo
1 of 23
Download to read offline
1
A nearest neighbour approach by genic distance to the assignment of
1
individuals to geographic origin
2
3
Bernd Degen1
, CΓ©line Blanc-Jolivet1
, Katrin Stierand1
& Elizabeth Gillet2
4
(15/11/2016)
5
6
1
ThΓΌnen Institute of Forest Genetics, Sieker Landstrasse 2, 22927 Grosshansdorf, Germany,
7
E-mail: bernd.degen@thuenen.de
8
9
2
Forest Genetics and Forest Tree Breeding, Faculty of Forest Sciences and Forest Ecology,
10
Georg-August-University of GΓΆttingen, BΓΌsgenweg 2, 37077 GΓΆttingen, Germany
11
12
Abstract
13
During the past decade, the use of DNA for forensic applications has been extensively
14
implemented for plant and animal species, as well as in humans. Tracing back the
15
geographical origin of an individual usually requires genetic assignment analysis. These
16
approaches are based on reference samples that are grouped into populations or other
17
aggregates and intend to identify the most likely group of origin. Often this grouping does not
18
have a biological but rather a historical or political justification, such as β€œcountry of origin”.
19
In this paper, we present a new nearest neighbour approach to individual assignment
20
or classification within a given but potentially imperfect grouping of reference samples. This
21
method, which is based on the genic distance between individuals, functions better in many
22
cases than commonly used methods. We demonstrate the operation of our assignment method
23
using two data sets. One set is simulated for a large number of trees distributed in a 120 km
24
by 120 km landscape with individual genotypes at 150 SNPs, and the other set comprises
25
experimental data of 1221 individuals of the African tropical tree species Entandrophragma
26
cylindricum (Sapelli) genotyped at 61 SNPs. Judging by the level of correct self-assignment,
27
our approach outperformed the commonly used frequency and Bayesian approaches by 15%
28
for the simulated data set and by 5 to 7% for the Sapelli data set.
29
Our new approach is less sensitive to overlapping sources of genetic differentiation,
30
such as genic differences among closely-related species, phylogeographic lineages and
31
isolation by distance, and thus operates better even for suboptimal grouping of individuals.
32
33
Key words: Forensics, genetic assignment, genic distance, GeoAssign, geographic origin,
34
timber, trees
35
36
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
2
Introduction
37
In recent years, the application of forensic methods based on genetic markers to assign
38
individual plants and animals to their geographic origin (Johnson et al, 2014; Ogden and
39
Linacre, 2015) has gained importance for the control of trade regulations and consumer
40
protection. Whereas many animal and plant species are protected by the Convention on
41
International Trade in Endangered Species of Wild Fauna and Flora (CITES), the protection
42
level can depend on the country of origin (species listed in CITES annex 3). For the timber
43
trade, regulations such as the US Lacey Act and the EU timber regulation require the trader to
44
declare the geographic origin of the material. Therefore it is important to be able to determine
45
the true origin of the individuals that are traded (Johnson et al, 2014).
46
In different areas of the world, genetic reference data bases have been developed to aid
47
in the identification of the true geographic origin of timber (Degen et al, 2013; Ng et al,
48
2016). To this end, reference samples of individual trees are collected throughout the natural
49
distribution area of the species or in specific target regions. For each individual, the data base
50
records the geographic coordinates of the individual and its genotype at gene loci representing
51
different types of gene marker, especially molecular markers of the types nSSR, cp-DNA, mt-
52
DNA and more recently SNP (Deguilloux et al, 2003; Ishida et al, 2013; Jolivet and Degen,
53
2012; Puckett and Eggert, 2016). Usually, more than one individual is sampled at each
54
reference location, and reference locations and their sampled individuals are aggregated a
55
priori into groups (or classes) by non-genetic criteria such as country of origin. The objective
56
is to assign an individual of unknown origin, such as imported timber, to that group of
57
reference individuals to which its multilocus genotype conforms best by some criterion.
58
Examples for the assignment of individuals to place of origin have been published for timber
59
tree species, elephants, and salmon (Degen et al, 2006; Glover et al, 2008; Wasser et al, 2015;
60
Wasser et al, 2007).
61
Most of the previously applied approaches to the assignment of an individual of
62
unknown origin to its putative group are frequency-based (Paetkau et al, 1995) and/or
63
Bayesian (Rannala and Mountain, 1997). These approaches estimate the allele frequencies at
64
all loci in every group, compute the probability that the multilocus genotype of the individual
65
in question originates from each group, and assign the individual to that group for which the
66
probability of origin is highest (Ogden and Linacre, 2015). Application of these approaches is
67
based on assumptions that (a) the structure of the groups fits to Wright’s Island Model of drift
68
(Wright, 1931), (b) the alleles at each locus are in Hardy-Weinberg-Proportions (HWP) (i.e.,
69
no stochastic association between alleles at the same locus), and (c) that the loci are in linkage
70
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
3
equilibrium or are completely unlinked (i.e., no stochastic association between the genotypes
71
between loci). Unfortunately, these assumptions are often violated in real populations. For
72
example, molecular markers such as nSSRs often show deviation from HWP in at all spatial
73
scales (Mariette et al, 2001; Nurtjahjaningsih et al, 2005). Also the use of large SNP marker
74
sets derived from Next Generation Sequencing often leads to linkage disequilibrium (LD) in
75
the data, which would require pruning process of loci in the data (Duforet-Frebourg et al,
76
2015). To overcome this problem and to avoid loss in the Gain of Informativeness for
77
Assignment (GIA), other methods combine some loci into haplotypes (Duforet-Frebourg et al,
78
2015). Indeed, the use of haplotypes provides information on the ancestry and recombination
79
events, which are useful when differentiation among groups is low. However, these methods
80
are mostly interesting when dense marker sets are used (Gattepaille and Jakobsson, 2012), and
81
it is not clear how these methods handle missing data, which is unfortunately the main issue
82
with genotyping of timber or other material with degraded DNA.
83
The success of these assignment methods depends on the extent of genetic
84
differentiation among the groups (Ogden and Linacre, 2015). The sampling scheme and the
85
variability of the spatial genetic structure have an additional impact on the success of the
86
assignment methods, especially if the groups represent political units such as countries.
87
When pre-defined aggregation to groups does not reflect the genetic structure among
88
the reference individuals, genetic assignment approaches based on allele frequencies may fail
89
(Meirmans, 2012). Grouping according to political (e.g. country) borders instead of genetic
90
boundaries between populations as reproductive units is particularly critical and may lead to
91
confusing genetic mixtures within groups (Manel et al, 2007). In the case that reference
92
groups contain individuals of more than one reproductively isolated deme, such as different
93
regions or even cryptic sympatric species, the mean genetic difference among individuals of
94
different demes should be larger than the mean genetic difference among individuals of the
95
same deme (Konnert et al, 2011). The ability of markers to identify such demes differs,
96
however. For instance, the incomplete lineage sorting or chloroplast capture that can be
97
observed among closely-related species at chloroplast markers (Duminil et al, 2013) could
98
hinder attempts to recognize these species in the reference data. Also, the genetic
99
differentiation among the groups of reference samples is biased by the proportion of species
100
mixture or the mixing of individuals from different phylogeographic lineages (e.g. refugia)
101
within the groups or by the assignment of individuals from the same spatial genetic unit (e.g.
102
cross-border populations) to different groups.
103
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
4
The opposite approach to the a priori specification of groups is to attempt to partition
104
individuals into reference groups that show HWP and linkage equilibrium. Bayesian
105
clustering approaches, as implemented for example in the program STRUCTURE (Pritchard
106
et al, 2000), have become common in recent years. These aggregations are, however, also
107
based on estimates of allele frequencies that could be biased by unequal sample sizes among
108
genetic groups or violation of the assumption that the populations fit Wright’s Island Model
109
with relatively small, clearly differentiated populations (Meirmans, 2012). Another major
110
drawback of the partitioning of reference individuals into genetic groups by Bayesian
111
clustering methods is that the genetic groups they detect may be spread over more than one
112
country. This collides with existing legislation that requires declaration of the country of
113
origin, the political borders of which may cut through the middle of a genetic group, making it
114
difficult to issue a statement from genetic testing on how likely this declaration is.
115
This paper describes a nearest neighbour classification approach that assigns
116
unclassified individuals to predefined classes of reference individuals, such as by the country
117
of origin that is of relevance for the timber trade or any CITES listed species with different
118
country restrictions. The distance between individuals is measured by the genic distance
119
between their multilocus genotypes, defining the nearest neighbours of a specific individual as
120
those individuals with the smallest genic distance to it. An unclassified individual is assigned
121
to a particular class if this class has the highest representation among a limited set of nearest
122
neighbours by a new index πΌπ‘Ÿ and if this representation is statistically significant. Since
123
assignment is not based on estimation of allele frequencies within entire reference classes, the
124
approach avoids the problems of possible discrepancy between political and genetic
125
boundaries described above. We demonstrate its application using two data sets, a large set of
126
simulated data for a hypothetical tropical tree species and experimental data of an African tree
127
species. When applied to test whether individuals of known origin are correctly assigned to
128
their class, it turns out that the probability of correct self-assignment is better than for
129
conventional methods based on allele frequencies.
130
131
Material and methods
132
Nearest neighbour classification of individuals to origin
133
Consider the situation in which 𝑅 reference individuals (index 𝑗) have been subdivided
134
a priori into G different groups that correspond to their region (e.g. country) of origin. The
135
objective is to assign each of a set of T test individuals of unknown origin to that group to
136
which it is genetically most similar by the following procedure.
137
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
5
For a specified set of 𝑀 gene loci, genetic similarity is first calculated on a pairwise
138
basis between the test individual 𝑖 and each reference individual 𝑗 as the proportion of genes
139
they share over the loci, that is, as 1 βˆ’ 𝐷𝑖𝑗 , where
140
0 ≀ 𝐷𝑖𝑗 =
1
𝑀
βˆ‘
1
2
βˆ‘|π‘Žπ‘–π‘™π‘˜βˆ’π‘Žπ‘—π‘™π‘˜|
𝑁𝑙
π‘˜=1
𝑀
𝑙=1
≀ 1
is the proportion of non-shared alleles (Gregorius, 1978). The index π‘˜ denotes the alleles 1 to
141
𝑁𝑙 at locus 𝑙. π‘Žπ‘–π‘™π‘˜ is the frequency (or dosage) of the π‘˜-th allele at locus 𝑙 in the test individual
142
𝑖, and π‘Žπ‘—π‘™π‘˜ is the frequency of the same allele in reference individual 𝑗. For two diploid
143
individuals, the frequencies of alleles at any locus can equal 1 (homozygote for that allele),
144
0.5 (heterozygote with one copy of the allele), or 0 (the allele is absent). 𝐷𝑖𝑗 is a direct
145
measure of the genetic difference between two individuals. Compared to most other genetic
146
distances, 𝐷𝑖𝑗 has the additional advantages that (a) it is a metric distance, (b) it ranges
147
between 0 for the sharing of all alleles and 1 for the sharing of no alleles, and (c) it has a
148
linear geometric conception (Gregorius, 1984).
149
All reference individuals 𝑗 are then listed in ascending order of their genic distance 𝐷𝑖𝑗
150
to test individual 𝑖. If the number of loci is small, many reference individuals may have the
151
same distance from the test individual, but as the number of loci is increased, the number of
152
reference individuals with the same distance generally decreases until all distances are unique.
153
Each number in the interval [0,1] specifies a neighbourhood of the test individual 𝑖 as the set
154
of all reference individuals 𝑗 that have genic distance 𝐷𝑖𝑗 less than or equal to this number.
155
Members of one neighbourhood are also members of any larger neighbourhood. The π‘˜ nearest
156
neighbours to test individual 𝑖 by this measure are the π‘˜ reference individuals that have the π‘˜
157
smallest genic differences to test individual 𝑖.
158
In its most basic form, the π‘˜-nearest neighbour algorithm for classification would
159
assign 𝑖 to the group that has the highest number of reference individuals among its π‘˜ nearest
160
neighbors (β€œmajority vote”). Its application here requires solution of the following problems:
161
how to determine an β€œoptimal” π‘˜, how to deal with groups containing different numbers of
162
reference individuals, and how to test for significance of an assignment.
163
For consistency, the different numbers of reference individuals that have been sampled
164
for different species, say, could be accommodated by specifying π‘˜ as a recommended
165
percentage 𝑃 of all reference individuals. Alternatively, π‘˜ could be specified as the number of
166
reference individuals found within a neighbourhood of recommended size (critical genic
167
distance), and test individuals whose critical neighbourhood is empty (π‘˜ = 0) would not be
168
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
6
assigned to any group (outliers). Such a critical distance could also be chosen as the first
169
β€œjump” in the cumulative frequency distribution of 1 βˆ’ 𝐷𝑖𝑗 over the reference individuals 𝑗,
170
separating genically similar reference individuals from genically dissimilar ones. For now, we
171
will speak of the chosen individuals of highest genic similarity to the test individual as the β€œπ‘˜
172
nearest neighbours”.
173
Simple assignment of the test individual to the largest group among the nearest
174
neighbours (β€œmajority vote”) ignores the possibility that the original groups of reference
175
individuals can be of different sizes, giving groups with more reference individuals a bigger
176
chance to β€œwin”. In order to compensate for group size, we define an index πΌπ‘Ÿ that weights the
177
relative frequency π‘œπ‘Ÿ (i.e., proportion) of each group π‘Ÿ among the π‘˜ nearest neighbours by the
178
relative frequency π‘’π‘Ÿ of group π‘Ÿ among all reference individuals
179
βˆ’1 ≀ πΌπ‘Ÿ =
π‘œπ‘Ÿ βˆ’ π‘’π‘Ÿ
π‘’π‘Ÿ
< 0 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ < π‘’π‘Ÿ
πΌπ‘Ÿ = 0 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ = π‘’π‘Ÿ
0 < πΌπ‘Ÿ =
π‘œπ‘Ÿ βˆ’ π‘’π‘Ÿ
1 βˆ’ π‘’π‘Ÿ
< 1 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ > π‘’π‘Ÿ
It holds that πΌπ‘Ÿ = βˆ’1 whenever group π‘Ÿ is not represented among the nearest neighbours,
180
πΌπ‘Ÿ = 0 if the frequencies π‘œπ‘Ÿ and π‘’π‘Ÿ are equal, and πΌπ‘Ÿ = 1 whenever all nearest neighbours
181
belong to group π‘Ÿ. Unless the relative frequencies of all groups among the nearest neighbours
182
are exactly the same as among all reference individuals, in which case πΌπ‘Ÿ = 0 for all groups
183
(and 𝑁 is divisible by π‘˜), at least one group π‘Ÿ must have an index πΌπ‘Ÿ greater than 0 and
184
another group an index less than 0.
185
Selecting the group π‘Ÿ with the highest value of πΌπ‘Ÿ to be the major candidate for the
186
origin of the test individual corresponds to the majority rule weighted by group size. The
187
statistical significance of the higher relative frequency of group π‘Ÿ among nearest neighbours
188
compared to the set of all reference individuals can be tested by classifying reference
189
individuals into two classes (i.e., members of group π‘Ÿ and members of any other group) and
190
calculating the P-value as the probability of finding at least the observed number π‘˜π‘Ÿ = π‘˜ βˆ™ π‘œπ‘Ÿ
191
of individuals from group π‘Ÿ within a random sample of size π‘˜ drawn without replacement
192
from the set of all 𝑁 reference individuals containing π‘π‘Ÿ individuals of group π‘Ÿ, that is,
193
𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) = βˆ‘
(
π‘π‘Ÿ
π‘š
) (
𝑁 βˆ’ π‘π‘Ÿ
π‘˜ βˆ’ π‘š
)
(
𝑁
π‘˜
)
π‘˜
π‘š=π‘˜π‘Ÿ
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
7
by the hypergeometric distribution, where (
π‘Ž
𝑏
) = (π‘Ž! 𝑏! (π‘Ž βˆ’ 𝑏)!
⁄ ) is the binomial coefficient.
194
A P-value 𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) less than a chosen level of significance (commonly 0.05) is an
195
indication that the observed representation of group π‘Ÿ among the nearest neighbors is
196
significantly high. It cannot be ruled out that another group with a smaller positive value of
197
the index also shows significantly high representation among the nearest neighbours, making
198
it an additional candidate for the origin of the test individual. This could happen, for example,
199
if the test individual originates from a cross-border population. It must, however, be
200
considered that hypergeometric tests of different candidate groups are not independent of each
201
other.
202
By the same reasoning, a P-value 𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) greater than 0.95 indicates that this
203
group has a significantly low representation among the nearest neighbours of the test
204
individual and should be excluded as a candidate for the origin.
205
206
Test of performance
207
We then compared the performance of this new individual genic distance assignment
208
method with the so far commonly used genetic assignment methods based on frequency
209
(Paetkau et al, 2004) and the Bayesian (Rannala and Mountain, 1997) approach criteria,
210
referred below as frequency and Bayesian approaches.
211
As an indicator of the performance, we computed the proportion of correctly assigned
212
individuals in self-assignments tests (Cornuet et al, 1999). Here the individuals of the
213
reference data were self-classified to the sampled groups using the leave-one-out approach
214
(Efron, 1983). For the frequency and the Bayesian approaches, we applied the self-assignment
215
routine implemented in the program GeneClass 2.0 (Piry et al, 2004).
216
217
Genetic diversity, Genetic differentiation and Hardy-Weinberg-Proportion
218
The success of genetic assignment depends on the level of genetic diversity, genetic
219
differentiation among the groups and the similarity of the genotype frequencies to Hardy-
220
Weinberg-Proportions.
221
As a measure for genetic diversity (effective number of alleles) we computed the mean
222
allelic diversity v of group π‘Ÿ (Gregorius, 1987):
223
1 ≀ π‘£π‘Ÿ =
1
𝑀
βŒˆβˆ‘ π‘π‘Ÿπ‘˜π‘–
2
π‘›π‘˜
𝑖=1
βŒ‰
βˆ’1
≀ π‘›π‘˜
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
8
With relative frequency π‘π‘Ÿπ‘˜π‘– of the 𝑖-th allele at the π‘˜-th locus in group π‘Ÿ and π‘›π‘˜ different
224
alleles at locus π‘˜.
225
226
To measure the difference between groups, we calculate the gene pool distance 𝑑0
227
between two groups πΊπ‘Ÿ and πΊπ‘Ÿβ€² as
228
𝑑0(πΊπ‘Ÿ, πΊπ‘Ÿβ€²) =
1
𝐿
βˆ‘
1
2
𝑙
βˆ‘|π‘π‘–π‘™π‘Ÿ βˆ’ π‘π‘–π‘™π‘Ÿβ€²|
𝑖
where the index 𝑙 runs through the 𝐿 loci, the index 𝑖 runs through the alleles at locus 𝑙, and
229
π‘π‘–π‘™π‘Ÿ and π‘π‘–π‘™π‘Ÿβ€² are the relative frequency of the 𝑖-th allele at locus 𝑙 in group πΊπ‘Ÿ and πΊπ‘Ÿβ€²,
230
respectively (Gregorius, 1984). 𝑑0 ranges between 0 and 1. 𝑑0 =0 holds when the relative
231
frequency of every allele at every locus is the same in both groups. 𝑑0 = 1 holds when the
232
groups have no allele in common at any locus. To visualize difference between groups, cluster
233
analysis based on the pairwise gene pool distances 𝑑0 between groups was performed using
234
the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) as implemented in the
235
software PAST (Hammer et al, 2001).
236
Based on the gene pool distance, the complementary compositional differentiation 𝛿𝑆𝐷
237
among the gene pools of the groups is then defined as the mean gene pool distance between
238
each group πΊπ‘Ÿ and its complement πΊπ‘Ÿ
Μ…Μ…Μ…, where πΊπ‘Ÿ
Μ…Μ…Μ… consists of all reference individuals that do
239
not belong to group π‘Ÿ (Gregorius, 1987), that is,
240
𝛿𝑆𝐷 =
1
𝑅
βˆ‘ 𝑑0(πΊπ‘Ÿ, πΊπ‘Ÿ
Μ…Μ…Μ…)
π‘Ÿ
where the index π‘Ÿ runs through the 𝑅 groups, and π‘π‘–π‘™π‘Ÿ
Μ…Μ…Μ…Μ…Μ… is the relative frequency of the 𝑖-th
241
allele in the complement πΊπ‘Ÿ
Μ…Μ…Μ…. 𝛿𝑆𝐷 ranges between 0 and 1. 𝛿𝑆𝐷 = 0 holds when the relative
242
frequency of every allele at every locus is the same in all groups. 𝛿𝑆𝐷 = 1 holds when the
243
groups have no allele in common at any locus.
244
For comparison, we also calculated the commonly used Wright’s FST (Wright, 1978)
245
which is a measure of fixation (monomorphism) and not of the difference among groups
246
(Gregorius et al, 2007).
247
248
As an indicator for the departure from Hardy-Weinberg proportions we computed the
249
inbreeding coefficient 𝐹𝑖𝑠 within each group π‘Ÿ for which π»π‘’π‘Ÿπ‘˜ > 0 holds (Wright, 1950) as
250
𝐹𝑖𝑠 =
1
𝑀
βˆ‘ (π»π‘’π‘Ÿπ‘˜ βˆ’ π»π‘œπ‘Ÿπ‘˜) π»π‘’π‘Ÿπ‘˜
⁄
𝑀
π‘˜=1
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
9
Here π»π‘’π‘Ÿπ‘˜ = 1 βˆ’ βˆ‘ π‘π‘Ÿπ‘˜π‘–
2
π‘›π‘˜
𝑖=1 is the expected heterozygosity in group π‘Ÿ at locus π‘˜ under the
251
hypothesis of random mating (i.e., the Simpson diversity) and π»π‘œπ‘Ÿπ‘˜ is the observed
252
heterozygosity (i.e., the relative frequency of heterozygotes).
253
254
Simulated data
255
The simulation model β€œEco-Gene Landscape” was used to create a baseline population
256
for a tropical tree species in an area of 120 km by 120 km and its spatial genetic structure at
257
150 nuclear Single Nucleotide Polymorphism = nSNPs after 700 years of re-colonisation. The
258
simulation model includes pollen and seed dispersal, mating, growth and mortality (Degen et
259
al, 2006). The 120 km x 120 km landscape was subdivided into grid cells of side lengths 1 km
260
x 1 km (100 ha). Each grid cell represented a different population that is connected via pollen
261
and seed dispersal with the surrounding populations. It was assumed that 60 % of the grid
262
cells could be potentially occupied by the target tree species. This equals to 8640 populations
263
in the simulated area. Simulation was initiated with 200 trees in each of only 10 grid cells to
264
represent the refugia before re-colonisation. A mean individual growth rate of 0.5 cm
265
diameter increment per year with a standard deviation of 0.2 was assumed. All trees with a
266
diameter of 35 cm were fertile. Pollen dispersal was simulated following two normal
267
distributions with means of 0 m and standard deviations of 500 m and 1500 m with a
268
probability of 0.7 and 0.3, respectively. The resulting pollen dispersal pattern is typical for an
269
insect pollinated species (Degen and Roubik, 2004). Random fertilization was assumed, with
270
the exception that seeds from self-fertilization were inviable. Seed dispersal was also
271
simulated by two normal distributions with means of 0 m and standard deviations of 100 m
272
and 1500 m with probabilities of 0.7 and 0.3, respectively. Using these parameters, the
273
majority of seeds were distributed within 300 m around the seed trees. Further, in accordance
274
with field inventories typical for tropical trees, a maximum density of trees per ha in different
275
diameter classes was considered. These maximum densities represented carrying capacities
276
and caused mortality if by seed production, migration or ingrowth the densities were
277
exceeded. The model thus created overlapping generations.
278
A total of 888,841 individuals were generated over all populations during 700
279
simulated years. At that time, 6212 grid cells contained populations. The number of
280
individuals in each population varied from 2 to 206 with an average of 156. We assumed that
281
every SNP locus has 2 alleles (A1, A2). The initial allele frequencies were randomly selected
282
with 0<A1<1 for all initial populations (i.e., uniform distribution). To create reference data,
283
we selected 10 areas, each with a size of 10 km by 10 km (Figure 1). In each of these areas we
284
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
10
randomly selected a reference data set of 100 individuals. In order to imitate the effect of
285
species mixture, in area 5 we replaced allele β€œ2” by an allele β€œ3” in all individuals, and in
286
areas 6 and 7 we did the same for 50% of the individuals. Thus the areas 1, 2, 3, 4, 8, 9, and
287
10 are genetically purely species A, area 5 is purely species B, and the areas 6 and 7 are
288
mixtures of 50% species A and 50% species B.
289
290
291
292
Figure 1: Simulated data set; red and brown points are grid cells covered by trees after 700
293
simulated years of re-colonisation. Blue points = location of initial refugial areas. Green
294
squares mark selected areas of species A, yellow squares mark selected areas of species B,
295
and yellow-green squares are selected areas with 50% mixtures of species A and B
296
297
Experimental data on Sapelli
298
Sapelli is the trade name of the botanical species Entandrophragma cylindicum of the
299
same family as mahogany (Meliaceae). Since decades, this high value timber tree is one of the
300
most harvested tree species in the natural forests of tropical Africa. In frame of a project of
301
the International Tropical Timber Organization (ITTO), we developed a genetic reference
302
data set in order to check claims on the geographic origin of the traded timber (Degen and
303
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
11
Bouda, 2015). A total of 1292 Sapelli trees were collected in seven African countries (Figure
304
2). Then, SNPs were developed and all individuals were screened at 61 SNPs. Two Sapelli
305
samples from Cameroon and DRC were used as reference for RAD (Restriction Site
306
Associated DNA) sequencing and SNP detection (Pujolar et al, 2014). A total of 140 putative
307
SNP loci were screened with a MassARRAY technology (Agena Biosciences, San Diego,
308
California, USA) on a set of 190 reference individuals chosen from different countries among
309
the reference samples (McKernan et al, 2002). Estimates of genetic differentiation and
310
correlations among genetic and geographic distances per locus were used together with
311
amplification success rates to select 61 loci for reference sample screening. Details are
312
presented elsewhere (Blanc-Jolivet and Degen, in preparation).
313
314
315
Figure 2: Location of the sampled trees of Entandrophragma sp. in Africa
316
317
Results
318
Simulated data
319
After 700 years of simulated re-colonisation, the allelic diversity (vr) in the selected
320
areas varied between 1.52 and 1.58 for the individuals belonging to species A and in the same
321
range for individuals of species B. The fixation index equalled FST = 0.053 and the
322
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
12
compositional differentiation 𝛿𝑆𝐷 = 0.082 for species A and FST = 0.045 and 𝛿𝑆𝐷 = 0.106 for
323
species B. For all individuals of both species together, the measures among the 10 selected
324
areas equalled FST = 0.171 and 𝛿𝑆𝐷 = 0.196. Both measures clearly indicated a stronger
325
fixation and differentiation between the species compared to the within species
326
differentiation.
327
Spatial-genetic analysis reveals a clear spatial structure within species A showing that
328
individuals up to a distance of 20 km are genetically more similar than expected for a random
329
distribution (Figure 3).
330
331
332
333
Figure 3: Spatial genetic structure of the simulated data set for species A. Dis_O is the
334
observed mean genic distance among individuals in the given spatial distance class. Dis_E is
335
the expected genic distance for a random distribution of individuals in space.
336
337
The dominating species signal is visible in the genetic differentiation among the areas
338
(Figure 4). The areas covered only by species A (green) form one group in the dendrogram
339
and the areas with pure species B (yellow) and species mixture (light green) form another
340
group.
341
342
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
13
343
Figure 4: Dendrogram based on a cluster analysis (UPGMA: Unweighted Pair Group Method
344
with Arithmetic Mean) using the genic distance matrix among the allele frequencies of the ten
345
selected areas in the simulated data set. The numbers are the proportions of the 1000 bootstrap
346
simulations for shifting the included SNP loci that find the same branch.
347
348
In addition, results for the self-assignment is quite stable over a range of percentiles of
349
nearest neighbours (P= 0.5% to 10%, Figure 5).
350
351
352
Figure 5: Proportion of correct self-assignments for different percentiles (P) of the genically
353
most similar reference individuals (nearest neighbours) in the simulated data set
354
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
14
The frequency and the Bayesian approach assigned 70% of all 1000 individuals to the
355
correct area, whereas our nearest neighbour approach assigned 85% of the individuals
356
correctly (Table 1). The largest differences were observed for the areas 6 and 7 with the
357
simulated species mixture of individuals. There, the frequency and the Bayesian approach did
358
not assign a single individual correctly but the nearest neighbour approach accurately
359
assigned 95% for area 6 and 82% for area 7. The frequency and Bayesian approach performed
360
slightly better than the nearest neighbour approach in areas (areas 2, 3, 9) with lower genetic
361
differentiation (lower values of 𝛿𝑆𝐷) and heterozygosity values close to Hardy-Weinberg-
362
Proportion (Fis values close to 0), whereas our nearest neighbour approach came up with
363
better results for areas with stronger genetic differentiation and stronger departure from
364
Hardy-Weinberg-Heterozygosity (areas 1, 6, 7, 8).
365
366
Area N 𝛿𝑆𝐷 Fis Nearest
neighbour
approach
(P1) %
Frequency
approach
%
Bayesian
approach %
Area 1 100 0.170 0.026 99 93 94
Area 2 100 0.142 0.040 67 77 77
Area 3 100 0.147 0.027 81 91 91
Area 4 100 0.174 0.035 93 95 96
Area 5 100 0.463 0.029 100 100 100
Area 6 100 0.228 0.340 95 0 0
Area 7 100 0.197 0.304 82 0 0
Area 8 100 0.156 0.057 87 84 85
Area 9 100 0.140 0.013 67 78 78
Area 10 100 0.143 0.014 74 79 79
Total / Mean 1000 85 70 70
367
Table 1: Results of the self-assignments for the simulated data set using the different
368
assignment approaches, N = sample size, 𝛿𝑆𝐷 = complementary compositional
369
differentiation, Fis = mean inbreeding coefficient over all loci
370
371
372
Experimental data on Sapelli
373
The samples from Ghana and Ivory Coast were merged in one group because of the
374
small sample sizes and the geographic neighbourhood. The mean allelic diversity (vr) varied
375
between 1.39 and 1.59 in the different countries. The mean fixation was FST = 0.092 and the
376
mean compositional differentiation 𝛿𝑆𝐷 = 0.126 (0.078 βˆ’ 0.194). There was a clear spatial
377
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
15
structure showing that individuals up to a distance of 500 km are genically more similar than
378
expected for a random distribution (data not shown). The analysis with STRUCTURE came
379
up with a clear indication that a significant proportion of individuals represent other species
380
within the genus Entandrophragma (Blanc-Jolivet and Degen, in preparation).
381
The frequency and the Bayesian approach assigned 68% and 66% of all individuals to
382
the correct area, whereas the nearest neighbour approach assigned 74% of the individuals
383
correctly (Table 2).
384
Large differences among the methods were observed for individuals from Cameroon.
385
Here, the nearest neighbour approach assigned 77% of the individuals correctly, while the
386
frequency and the Bayesian approach assigned only 65% and 61% correctly. This was the
387
group with the largest sample size. Also for samples from DRC with the second largest
388
sample size the nearest neighbour approach was better.
389
As for the simulated data, the nearest neighbour approach was better in groups with
390
high genetic differentiation and stronger departure from Hardy-Weinberg heterozygosity
391
(DRC, Ghana/ Ivory Coast) and the other approaches performed better in situations with
392
lower differentiation and F-values close to 0 (Gabon).
393
394
Country N 𝛿𝑆𝐷 Fis Nearest
neighbour
approach
(P=1) %
Frequency
approach %
Bayesian
approach %
Cameroon 430 0.094 0.234 77 65 61
Congo Brazzaville 140 0.096 0.269 44 51 50
DRC 419 0.170 0.439 86 81 78
Gabon 36 0.078 0.233 3 19 22
Ghana / Ivory Coast 89 0.194 0.312 74 71 67
Total/ Mean 1114 74 68 66
395
Table 2: Results of the self-assignments for the Sapelli data set using the different assignment
396
approaches, all data that had at least 50% of the loci were included. N = sample size, 𝛿𝑆𝐷 =
397
complementary compositional differentiation, Fis = mean inbreeding coefficient over all loci
398
399
Discussion
400
Realistic simulated data and typical large scale SNP data set of a tree species?
401
So far, not many data sets have been published on large scale genetic differentiation of
402
tree species at SNP gene markers, and most of them are looking for SNP loci linked to
403
selection (Csillery et al, 2014). The large scale genetic structure at the 150 nuclear SNPs in
404
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
16
the simulated data was the result of non-selective processes (gene flow, mating system) and
405
the imitated effect of a mixture of two closely related species. The level of genetic
406
differentiation and fixation we observed in the Sapelli data set (𝛿𝑆𝐷 = 0.123, FST= 0.092) lay
407
between the pure within-species differentiation and fixation (𝛿𝑆𝐷 = 0.083, FST= 0.053) and
408
the overall genetic differentiation and fixation of the simulated data set (𝛿𝑆𝐷 = 0.196, FST=
409
0.171). A similar level of fixation (FST=0.15) has been reported for a large SNP data sets of
410
Populus trichocarpa in North America (Slavov et al, 2012). The overlapping of different
411
genetic signals such as species hybridisation and mixture, phylogeographic pattern and
412
isolation-by-distance effects responsible for the genetic differentiation of tree populations has
413
been found in several recent studies (De La Torre et al, 2015; Jelic et al, 2015; Thomson et al,
414
2015). We imitated the species signal by replacing one of two alleles by a third allele, keeping
415
one allele at each locus in common among species. This might look rather β€œartificial”, but in
416
nature separation of related species of the same genus is visible by the same observation
417
(mixture of species-specific alleles and common alleles (see e. g. for different citrus species
418
(Curk et al, 2015) and for different frog species (Lemmon and Lemmon, 2012)).
419
420
Performance of assignment methods
421
We demonstrate by the simulated and the real data set of the African tree species
422
Sapelli that our nearest neighbour approach had a higher percentage of correctly assigned
423
individuals compared to the frequency and Bayesian methods. The reason for this better
424
performance was that our calculated index takes into account only the genically most similar
425
individuals to the reference data. Thus, it reduces the bias of overlapping genetic signals
426
because there is a hierarchy of genic similarity (Mori et al, 2015; Petit et al, 2005).
427
Individuals of the same species (or the same level of hybridisation), from the same
428
phylogeographic lineage and the same geographic region are the most similar. On the
429
opposite side, individuals from different species show the highest genic differences. Of
430
course, one could use Bayesian clustering analysis to obtain a genetic grouping of reference
431
data that minimizes departure from HWP within each group (Pritchard et al, 2000), thereby
432
limiting problems due to population sub-structure. But for many applications – especially as
433
part of law enforcement and forensic investigations – the investigator needs to come to a
434
conclusion for a given (non-biological) grouping of reference individuals. For example, the
435
question for application of the EU-timber regulation is to confirm or not that the timber is
436
from a Sapelli tree grown in Cameroon. In this case, the data must be aggregated by countries.
437
438
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
17
The better performance of the nearest neighbour approach was particularly visible for groups
439
with high sample sizes, higher genetic differentiation and an excess of homozygotes
440
compared to Hardy-Weinberg expectations. The excess of homozygotes could be caused by a
441
subpopulation structure within the groups (Wahlund effects). The power of our nearest
442
neighbour approach depends on a sufficient sample size from all of the different
443
subpopulations represented in the reference data. On the other hand, there were also groups
444
within these data sets for which the Bayesian and the frequency approach had higher success
445
rates for the self-assignments. These were groups with lower genetic differentiation and
446
heterozygosity close to Hardy-Weinberg-expectations.
447
448
In our individual distance approach, a decision is still needed on the percentile of
449
genically most similar individuals to be included in the calculation of the index. Thus we need
450
to decide on the right number of nearest neighbours. We have shown with the simulated data
451
set that the percentage of correctly assigned individuals does not change much for percentiles
452
from P=0.5% to P=5% (Figure 5). For all approaches, the success of the assignment depends
453
on the genetic differentiation among the reference individuals (Ogden and Linacre, 2015). If
454
there is only little genetic differentiation, then it is better to include more individuals in the
455
calculation of the index because the power of the statistical test increases with increasing
456
numbers of reference individuals included. On the other side, large percentiles will increase
457
the mixture of genic distances caused by different processes.
458
Cornuet et al (1999) compared the performance of different assignment methods using
459
simulated data sets. They found that the Bayesian and the frequency approaches outperformed
460
by far different genetic distance approaches. Our study found the opposite. The difference is
461
that Cornuet et al (1999) included all individuals of each population in the analysis, and their
462
simulated data set fulfilled the assumptions of Hardy-Weinberg and linkage equilibrium. Our
463
simulated data set and the Sapelli experimental SNP-data were characterised by an excess of
464
homozygotes in nearly all groups. This excess could be explained by Wahlund effects, thus by
465
a mixture of individuals from different populations in the same groups (Wahlund, 1928).
466
467
Forensic application in timber tracking
468
We have generated and are continuously developing genetic reference data for timber
469
tracking of several high-value tree species (Degen and Bouda, 2015; Degen et al, 2013). A
470
few years ago, we shifted from SSR markers to SNPs because their shortness is an advantage
471
for amplification in the typically degraded DNA of timber (Jardine et al, 2016; Pakull et al,
472
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
18
2016). These tools are developed to reduce illegal logging. As has been pointed out by Ogden
473
and Linacre (2015), wildlife forensics is faced with several drawbacks compared to the ideal
474
situations found in human forensics. It is often a big challenge to obtain a collection of
475
reference material that is well distributed over the geographic range of the target tree species.
476
It is difficult to obtain access to many remote areas and sometimes even dangerous because of
477
unstable political situations in countries where illegal logging is an issue. We often need to
478
collaborate with local groups that help with the collection of material. They receive training
479
but are usually not scientists. Yet even for the botanical specialist it is often impossible to
480
clearly distinguish between closely related tree species based on morphological traits. Thus it
481
is not uncommon for us to have reference samples that contain admixtures of closely related
482
species. This is either due to errors in the field work or the lack of an efficient biological
483
barrier between closely related species. This underlines the importance of assignment routines
484
that are robust against this kind of error and still yield reliable results.
485
486
Access to the program
487
We implemented the approach in a software application called GeoAssign and this is
488
available through a web service at https://geoassign.thuenen.de. The webserver architecture is
489
designed around a Ruby on Rails application that handles multiple jobs via a Sidekiq
490
(http://sidekiq.org) / Redis (http://redis.io) background processing framework. The GeoAssign
491
method itself is implemented in the Ruby programming language and included in the web
492
application as a gem.
493
494
495
Outlook
496
We are currently extending our nearest neighbour approach using a city-block distance
497
for metric traits such as stable isotopes. The idea is to have a single statistical assignment
498
approach for different data sources that should be combined to increase the performance of
499
forensic assignment approaches (Dormontt et al, 2015; Lowe et al, 2016). Applying the same
500
assignment routine would also enable us to better compare the performance of different
501
reference data sets (such as genetic, stable isotope or near infrared data).
502
As is the case for the other assignment approaches, the nearest neighbour approach
503
will always assign the test individual to a group. But it is possible that the true group of origin
504
is not represented by the reference individuals. To deal with this problem, other approaches
505
simulate genotypes based on allele frequencies in the different groups of reference data,
506
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
19
compute the distribution of likelihoods for assignment of these simulated genotypes to β€œtheir”
507
group, and compare the likelihood of test individuals to these distributions and interpret the
508
position in the distribution as an β€œexclusion probability” (Cornuet et al, 1999). Again this
509
approach suffers from the assumptions to be made on the use of allele frequencies to simulate
510
genotypes. To deal with these β€œalien individuals”, we are integrating a routine into our nearest
511
neighbour approach that compares the genetic distance of the test individuals to the
512
distribution of genetic distances among all reference individuals within each group. An outlier
513
would be identified by its relatively large genic distance compared to the distribution of
514
distances among members of the same group.
515
516
517
Acknowledgement
518
The work of CB-J was financially supported by the International Tropical Timber
519
Organization (ITTO) through the project PD 620/11 Rev.1 (M): β€œDevelopment and
520
implementation of species identification and timber tracking in Africa with DNA fingerprints
521
and stable isotopes”. The work of EMG was financially supported by grant ZI 662/7-1 of the
522
Deutsche Forschungsgemeinschaft. Further we thank Malte Mader for the setup and
523
configuration of the LINUX web server. Finally, we are thankful to the two reviewers for
524
their very useful comments and suggestions on a former version of the manuscript.
525
526
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
20
References
527
Cornuet J-M, Piry S, Luikart G, Estoup A, Solignac M (1999). New methods employing multilocus
528
genotypes to select or exclude populations as origins of individuals. Genetics 153(4): 1989-2000.
529
530
Csillery K, Lalague H, Vendramin GG, Gonzalez-Martinez SC, Fady B, Oddou-Muratorio S (2014).
531
Detecting short spatial scale local adaptation and epistatic selection in climate-related candidate
532
genes in European beech (Fagus sylvatica) populations. Molecular Ecology 23(19): 4696-4708.
533
534
Curk F, Ancillo G, Ollitrault F, Perrier X, Jacquemoud-Collet JP, Garcia-Lor A et al (2015). Nuclear
535
Species-Diagnostic SNP Markers Mined from 454 Amplicon Sequencing Reveal Admixture Genomic
536
Structure of Modern Citrus Varieties. Plos One 10(5): 25.
537
538
De La Torre A, Ingvarsson PK, Aitken SN (2015). Genetic architecture and genomic patterns of gene
539
flow between hybridizing species of Picea. Heredity 115(2): 153-164.
540
541
Degen B, Blanc L, Caron H, Maggia L, Kremer A, Gourlet-Fleury S (2006). Impact of selective logging
542
on genetic composition and demographic structure of four tropical tree species. Biological
543
Conservation 131(3): 386-401.
544
545
Degen B, Bouda H (2015). Verifying timber in Africa. ITTO Trop For Update 24(1): 8-10.
546
547
Degen B, Roubik DW (2004). Effects of animal pollination on pollen dispersal, selfing, and effective
548
population size of tropical trees: a simulation study. Biotropica 36(2): 165-179.
549
550
Degen B, Ward S, Lemes M, Navarro C, Cavers S, Sebbenn A (2013). Verifying the geographic origin of
551
mahogany (Swietenia macrophylla King) with DNA-fingerprints. Forensic Science International:
552
Genetics 7(1): 55-62.
553
554
Deguilloux MF, Pemonge MH, Bertel L, Kremer A, Petit RJ (2003). Checking the geographical origin of
555
oak wood: molecular and statistical tools. Molecular Ecology 12(6): 1629-1636.
556
557
Dormontt EE, Boner M, Braun B, Breulmann G, Degen B, Espinoza E et al (2015). Forensic timber
558
identification: It's time to integrate disciplines to combat illegal logging. Biological Conservation 191:
559
790-798.
560
561
Duforet-Frebourg N, Gattepaille LM, Blum MG, Jakobsson M (2015). HaploPOP: a software that
562
improves population assignment by combining markers into haplotypes. BMC bioinformatics 16(1): 1.
563
564
Duminil J, Brown RP, Ewedjs E, Mardulyn P, Doucet JL, Hardy OJ (2013). Large-scale pattern of genetic
565
differentiation within African rainforest trees: insights on the roles of ecological gradients and past
566
climate changes on the evolution of Erythrophleum spp (Fabaceae). Bmc Evolutionary Biology 13: 13.
567
568
Efron B (1983). Estimating the error rate of a prediction rule: improvement on cross-validation.
569
Journal of the American Statistical Association 78(382): 316-331.
570
571
Gattepaille LM, Jakobsson M (2012). Combining markers into haplotypes can improve population
572
structure inference. Genetics 190(1): 159-174.
573
574
Glover K, Skilbrei O, Skaala Ø (2008). Genetic assignment identifies farm of origin for Atlantic salmon
575
Salmo salar escapees in a Norwegian fjord. ICES Journal of Marine Science: Journal du Conseil 65(6):
576
912-920.
577
578
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
21
Gregorius H-R (1978). The concept of genetic diversity and its formal relationship to heterozygosity
579
and genetic distance. Mathematical Biosciences 41(3): 253-271.
580
581
Gregorius H-R (1987). The relationship between the concepts of genetic diversity and differentiation.
582
Theoretical and Applied Genetics 74(3): 397-401.
583
584
Gregorius H, Degen B, Konig A (2007). Problems in the Analysis of Genetic Differentiation Among
585
Populations--A Case Study in Quercus robur. Silvae Genetica 56(3-4): 190-199.
586
587
Gregorius HR (1984). A unique genetic distance. Biometrical Journal 26(1): 13-18.
588
589
Hammer Ø, Harper D, Ryan P (2001). PAST: Paleontological Statistics Software Package for education
590
and data analysis. Palaeontolia Electronica 4.
591
592
Ishida Y, Georgiadis NJ, Hondo T, Roca AL (2013). Triangulating the provenance of African elephants
593
using mitochondrial DNA. Evolutionary Applications 6(2): 253-265.
594
595
Jardine DI, Blanc-Jolivet C, Dixon RRM, Dormontt EE, Dunker B, Gerlach J et al (2016). Development
596
of SNP markers for Ayous (Triplochiton scleroxylon K. Schum) an economically important tree species
597
from tropical West and Central Africa. Conserv Genet Resour 8(2): 129-139.
598
599
Jelic M, Patenkovic A, Skoric M, Misic D, Novicic ZK, Bordacs S et al (2015). Indigenous forests of
600
European black poplar along the Danube River: genetic structure and reliable detection of
601
introgression. Tree Genetics & Genomes 11(5): 14.
602
603
Johnson RN, Wilson-Wilde L, Linacre A (2014). Current and future directions of DNA in wildlife
604
forensic science. Forensic Science International: Genetics 10: 1-11.
605
606
Jolivet C, Degen B (2012). Use of DNA fingerprints to control the origin of sapelli timber
607
(Entandrophragma cylindricum) at the forest concession level in Cameroon. Forensic Science
608
International-Genetics 6(4): 487-493.
609
610
Konnert M, Maurer W, Degen B, Katzel R (2011). Genetic monitoring in forests - early warning and
611
controlling system for ecosystemic changes. iForest 4: 77-81.
612
613
Lemmon AR, Lemmon EM (2012). High-Throughput Identification of Informative Nuclear Loci for
614
Shallow-Scale Phylogenetics and Phylogeography. Syst Biol 61(5): 745-761.
615
616
Lowe AJ, Dormontt EE, Bowie MJ, Degen B, Gardner S, Thomas D et al (2016). Opportunities for
617
Improved Transparency in the Timber Trade through Scientific Verification. BioScience 66(11): 990-
618
998.
619
620
Manel S, Berthoud F, Bellemain E, Gaudeul M, Luikart G, Swenson JE et al (2007). A new individual-
621
based spatial approach for identifying genetic discontinuities in natural populations. Molecular
622
Ecology 16(10): 2031-2043.
623
624
Mariette S, Chagne D, Decroocq S, Vendramin GG, Lalanne C, Madur D et al (2001). Microsatellite
625
markers for Pinus pinaster Ait. Ann For Sci 58(2): 203-206.
626
627
McKernan K, Fujii C, Ziauddin J, Malek J, McEwan P (2002). A high throughput and accurate method
628
for SNP genotyping using Sequenom MassARRAY (TM) system. AMERICAN JOURNAL OF HUMAN
629
GENETICS 71(4): 454-454.
630
631
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
22
Meirmans PG (2012). The trouble with isolation by distance. Molecular Ecology 21(12): 2839-2846.
632
633
Mori GM, Zucchi MI, Souza AP (2015). Multiple-geographic-scale genetic structure of two mangrove
634
tree species: the roles of mating system, hybridization, limited dispersal and extrinsic factors. PloS
635
one 10(2).
636
637
Ng KKS, Lee SL, Tnah LH, Nurul-Farhanah Z, Ng CH, Lee CT et al (2016). Forensic timber identification:
638
a case study of a CITES listed species, Gonystylus bancanus (Thymelaeaceae). Forensic Science
639
International-Genetics 23: 197-209.
640
641
Nurtjahjaningsih ILG, Saito Y, Lian CL, Tsuda Y, Ide Y (2005). Development and characteristics of
642
microsatellite markers in Pinus merkusii. Molecular Ecology Notes 5(3): 552-553.
643
644
Ogden R, Linacre A (2015). Wildlife forensic science: A review of genetic geographic origin
645
assignment. Forensic Science International: Genetics.
646
647
Paetkau D, Calvert W, Stirling I, Strobeck C (1995). Microsatellite analysis of population structure in
648
Canadian polar bears. Molecular Ecology 4(3): 347-354.
649
650
Paetkau D, Slade R, Burden M, Estoup A (2004). Genetic assignment methods for the direct, real‐time
651
estimation of migration rate: a simulation‐based exploration of accuracy and power. Molecular
652
ecology 13(1): 55-65.
653
654
Pakull B, Mader M, Kersten B, Ekue MRM, Dipelet UGB, Paulini M et al (2016). Development of
655
nuclear, chloroplast and mitochondrial SNP markers for Khaya sp. Conserv Genet Resour 8(3): 283-
656
297.
657
658
Petit RJ, Duminil J, Fineschi S, Hampe A, Salvini D, Vendramin GG (2005). Invited review: comparative
659
organization of chloroplast, mitochondrial and nuclear diversity in plant populations. Molecular
660
Ecology 14(3): 689-701.
661
662
Piry S, Alapetite A, Cornuet J-M, Paetkau D, Baudouin L, Estoup A (2004). GENECLASS2: a software for
663
genetic assignment and first-generation migrant detection. Journal of heredity 95(6): 536-539.
664
665
Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus
666
genotype data. Genetics 155(2): 945-959.
667
668
Puckett EE, Eggert LS (2016). Comparison of SNP and microsatellite genotyping panels for spatial
669
assignment of individuals to natal range: A case study using the American black bear (Ursus
670
americanus). Biological Conservation 193: 86-93.
671
672
Pujolar JM, Jacobsen M, Als TD, Frydenberg J, Magnussen E, JΓ³nsson B et al (2014). Assessing
673
patterns of hybridization between North Atlantic eels using diagnostic single-nucleotide
674
polymorphisms. Heredity 112(6): 627-637.
675
676
Rannala B, Mountain JL (1997). Detecting immigration by using multilocus genotypes. Proceedings of
677
the National Academy of Sciences 94(17): 9197-9201.
678
679
Slavov GT, DiFazio SP, Martin J, Schackwitz W, Muchero W, Rodgers-Melnick E et al (2012). Genome
680
resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the
681
forest tree Populus trichocarpa. New Phytologist 196(3): 713-725.
682
683
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint
23
Thomson AM, Dick CW, Pascoini AL, Dayanandan S (2015). Despite introgressive hybridization, North
684
American birches (Betula spp.) maintain strong differentiation at nuclear microsatellite loci. Tree
685
Genetics & Genomes 11(5): 12.
686
687
Wahlund S (1928). Composition of populations and correlation appearances viewed in relation to the
688
studies of inheritance. Hereditas 11: 65-106.
689
690
Wasser SK, Brown L, Mailand C, Mondol S, Clark W, Laurie C et al (2015). Genetic assignment of large
691
seizures of elephant ivory reveals Africa's major poaching hotspots. Science 349(6243): 84-87.
692
693
Wasser SK, Mailand C, Booth R, Mutayoba B, Kisamo E, Clark B et al (2007). Using DNA to track the
694
origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of
695
Sciences of the United States of America 104(10): 4228-4233.
696
697
Wright S (1931). Evolution in Mendelian populations. Genetics 16(2): 97.
698
699
Wright S (1950). Genetical structure of populations. Nature 166: 247-249.
700
701
Wright S (1978). Evolution and the genetics of populations: a treatise in four volumes: Vol. 4:
702
variability within and among natural populations. University of Chicago Press.
703
704
705
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which
this version posted November 15, 2016.
.
https://doi.org/10.1101/087833
doi:
bioRxiv preprint

More Related Content

Similar to A Nearest Neighbour Approach By Genic Distance To The Assignment Of Individuals To Geographic Origin

McAllister et al
McAllister et alMcAllister et al
McAllister et alHeidi Williams
Β 
Utility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogeneticUtility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogeneticEdizonJambormias2
Β 
Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Tania Acuna
Β 
Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009renesmulders
Β 
Genotyping an invasive vine
Genotyping an invasive vineGenotyping an invasive vine
Genotyping an invasive vineRoy Moger-Reischer
Β 
Plantbiology 2-1009
Plantbiology 2-1009Plantbiology 2-1009
Plantbiology 2-1009henry prakash
Β 
Fuller_etal2015
Fuller_etal2015Fuller_etal2015
Fuller_etal2015Ryan Fuller
Β 
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docxNature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docxgemaherd
Β 
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docxNature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docxvannagoforth
Β 
D2 analysis & it's Interpretation
D2 analysis & it's InterpretationD2 analysis & it's Interpretation
D2 analysis & it's InterpretationPABOLU TEJASREE
Β 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Methodpraveena06
Β 
Report- Genome wide association studies.
Report- Genome wide association studies.Report- Genome wide association studies.
Report- Genome wide association studies.Varsha Gayatonde
Β 
Burns_et_al-2016-Molecular_Ecology_Resources
Burns_et_al-2016-Molecular_Ecology_ResourcesBurns_et_al-2016-Molecular_Ecology_Resources
Burns_et_al-2016-Molecular_Ecology_ResourcesMercedes Burns
Β 
6 55 E
6 55 E6 55 E
6 55 ESean Paul
Β 
Association mapping for improvement of agronomic traits in rice
Association mapping  for improvement of agronomic traits in riceAssociation mapping  for improvement of agronomic traits in rice
Association mapping for improvement of agronomic traits in riceSopan Zuge
Β 
Spatial transcriptomics in Plant (for Agriculture)
Spatial transcriptomics in Plant (for Agriculture)Spatial transcriptomics in Plant (for Agriculture)
Spatial transcriptomics in Plant (for Agriculture)EdizonJambormias2
Β 

Similar to A Nearest Neighbour Approach By Genic Distance To The Assignment Of Individuals To Geographic Origin (20)

McAllister et al
McAllister et alMcAllister et al
McAllister et al
Β 
Utility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogeneticUtility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogenetic
Β 
Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)
Β 
Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009Database Of Rose Varieties Eucarpia Leiden 2009
Database Of Rose Varieties Eucarpia Leiden 2009
Β 
eg.poster
eg.postereg.poster
eg.poster
Β 
Genotyping an invasive vine
Genotyping an invasive vineGenotyping an invasive vine
Genotyping an invasive vine
Β 
Plantbiology 2-1009
Plantbiology 2-1009Plantbiology 2-1009
Plantbiology 2-1009
Β 
Fuller_etal2015
Fuller_etal2015Fuller_etal2015
Fuller_etal2015
Β 
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docxNature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Β 
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docxNature GeNetics  VOLUME 46  NUMBER 10  OCTOBER 2014 1 0 8 9.docx
Nature GeNetics  VOLUME 46 NUMBER 10 OCTOBER 2014 1 0 8 9.docx
Β 
D2 analysis & it's Interpretation
D2 analysis & it's InterpretationD2 analysis & it's Interpretation
D2 analysis & it's Interpretation
Β 
PGX Data Mining
PGX Data MiningPGX Data Mining
PGX Data Mining
Β 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
Β 
Landrace gap analysis - Methodology and bean results
Landrace gap analysis - Methodology and bean resultsLandrace gap analysis - Methodology and bean results
Landrace gap analysis - Methodology and bean results
Β 
Report- Genome wide association studies.
Report- Genome wide association studies.Report- Genome wide association studies.
Report- Genome wide association studies.
Β 
Burns_et_al-2016-Molecular_Ecology_Resources
Burns_et_al-2016-Molecular_Ecology_ResourcesBurns_et_al-2016-Molecular_Ecology_Resources
Burns_et_al-2016-Molecular_Ecology_Resources
Β 
phy prAC.pptx
phy prAC.pptxphy prAC.pptx
phy prAC.pptx
Β 
6 55 E
6 55 E6 55 E
6 55 E
Β 
Association mapping for improvement of agronomic traits in rice
Association mapping  for improvement of agronomic traits in riceAssociation mapping  for improvement of agronomic traits in rice
Association mapping for improvement of agronomic traits in rice
Β 
Spatial transcriptomics in Plant (for Agriculture)
Spatial transcriptomics in Plant (for Agriculture)Spatial transcriptomics in Plant (for Agriculture)
Spatial transcriptomics in Plant (for Agriculture)
Β 

More from Andrea Porter

Taking Notes With Note Cards
Taking Notes With Note CardsTaking Notes With Note Cards
Taking Notes With Note CardsAndrea Porter
Β 
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHow
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHowHow To Write A TOK Essay 15 Steps (With Pictures) - WikiHow
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHowAndrea Porter
Β 
How To Write A Good Topic Sentence In Academic Writing
How To Write A Good Topic Sentence In Academic WritingHow To Write A Good Topic Sentence In Academic Writing
How To Write A Good Topic Sentence In Academic WritingAndrea Porter
Β 
Narrative Essay Argumentative Essay Thesis
Narrative Essay Argumentative Essay ThesisNarrative Essay Argumentative Essay Thesis
Narrative Essay Argumentative Essay ThesisAndrea Porter
Β 
A Good Introduction For A Research Paper. Writing A G
A Good Introduction For A Research Paper. Writing A GA Good Introduction For A Research Paper. Writing A G
A Good Introduction For A Research Paper. Writing A GAndrea Porter
Β 
How To Find The Best Paper Writing Company
How To Find The Best Paper Writing CompanyHow To Find The Best Paper Writing Company
How To Find The Best Paper Writing CompanyAndrea Porter
Β 
Primary Handwriting Paper - Paging Supermom - Free Printable Writing
Primary Handwriting Paper - Paging Supermom - Free Printable WritingPrimary Handwriting Paper - Paging Supermom - Free Printable Writing
Primary Handwriting Paper - Paging Supermom - Free Printable WritingAndrea Porter
Β 
Write An Essay On College Experience In EnglishDesc
Write An Essay On College Experience In EnglishDescWrite An Essay On College Experience In EnglishDesc
Write An Essay On College Experience In EnglishDescAndrea Porter
Β 
Ghost Writing KennyLeeHolmes.Com
Ghost Writing KennyLeeHolmes.ComGhost Writing KennyLeeHolmes.Com
Ghost Writing KennyLeeHolmes.ComAndrea Porter
Β 
How To Write A Good Concluding Observation Bo
How To Write A Good Concluding Observation BoHow To Write A Good Concluding Observation Bo
How To Write A Good Concluding Observation BoAndrea Porter
Β 
How To Write An Argumentative Essay
How To Write An Argumentative EssayHow To Write An Argumentative Essay
How To Write An Argumentative EssayAndrea Porter
Β 
7 Introduction Essay About Yourself - Introduction Letter
7 Introduction Essay About Yourself - Introduction Letter7 Introduction Essay About Yourself - Introduction Letter
7 Introduction Essay About Yourself - Introduction LetterAndrea Porter
Β 
Persuasive Essay Essaypro Writers
Persuasive Essay Essaypro WritersPersuasive Essay Essaypro Writers
Persuasive Essay Essaypro WritersAndrea Porter
Β 
Scholarship Personal Essays Templates At Allbusinesst
Scholarship Personal Essays Templates At AllbusinesstScholarship Personal Essays Templates At Allbusinesst
Scholarship Personal Essays Templates At AllbusinesstAndrea Porter
Β 
How To Create An Outline For An Essay
How To Create An Outline For An EssayHow To Create An Outline For An Essay
How To Create An Outline For An EssayAndrea Porter
Β 
ESSAY - Qualities Of A Good Teach
ESSAY - Qualities Of A Good TeachESSAY - Qualities Of A Good Teach
ESSAY - Qualities Of A Good TeachAndrea Porter
Β 
Fountain Pen Writing On Paper - Castle Rock Financial Planning
Fountain Pen Writing On Paper - Castle Rock Financial PlanningFountain Pen Writing On Paper - Castle Rock Financial Planning
Fountain Pen Writing On Paper - Castle Rock Financial PlanningAndrea Porter
Β 
Formatting A Research Paper E
Formatting A Research Paper EFormatting A Research Paper E
Formatting A Research Paper EAndrea Porter
Β 
Business Paper Examples Of Graduate School Admissio
Business Paper Examples Of Graduate School AdmissioBusiness Paper Examples Of Graduate School Admissio
Business Paper Examples Of Graduate School AdmissioAndrea Porter
Β 
Impact Of Poverty On The Society - Free Essay E
Impact Of Poverty On The Society - Free Essay EImpact Of Poverty On The Society - Free Essay E
Impact Of Poverty On The Society - Free Essay EAndrea Porter
Β 

More from Andrea Porter (20)

Taking Notes With Note Cards
Taking Notes With Note CardsTaking Notes With Note Cards
Taking Notes With Note Cards
Β 
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHow
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHowHow To Write A TOK Essay 15 Steps (With Pictures) - WikiHow
How To Write A TOK Essay 15 Steps (With Pictures) - WikiHow
Β 
How To Write A Good Topic Sentence In Academic Writing
How To Write A Good Topic Sentence In Academic WritingHow To Write A Good Topic Sentence In Academic Writing
How To Write A Good Topic Sentence In Academic Writing
Β 
Narrative Essay Argumentative Essay Thesis
Narrative Essay Argumentative Essay ThesisNarrative Essay Argumentative Essay Thesis
Narrative Essay Argumentative Essay Thesis
Β 
A Good Introduction For A Research Paper. Writing A G
A Good Introduction For A Research Paper. Writing A GA Good Introduction For A Research Paper. Writing A G
A Good Introduction For A Research Paper. Writing A G
Β 
How To Find The Best Paper Writing Company
How To Find The Best Paper Writing CompanyHow To Find The Best Paper Writing Company
How To Find The Best Paper Writing Company
Β 
Primary Handwriting Paper - Paging Supermom - Free Printable Writing
Primary Handwriting Paper - Paging Supermom - Free Printable WritingPrimary Handwriting Paper - Paging Supermom - Free Printable Writing
Primary Handwriting Paper - Paging Supermom - Free Printable Writing
Β 
Write An Essay On College Experience In EnglishDesc
Write An Essay On College Experience In EnglishDescWrite An Essay On College Experience In EnglishDesc
Write An Essay On College Experience In EnglishDesc
Β 
Ghost Writing KennyLeeHolmes.Com
Ghost Writing KennyLeeHolmes.ComGhost Writing KennyLeeHolmes.Com
Ghost Writing KennyLeeHolmes.Com
Β 
How To Write A Good Concluding Observation Bo
How To Write A Good Concluding Observation BoHow To Write A Good Concluding Observation Bo
How To Write A Good Concluding Observation Bo
Β 
How To Write An Argumentative Essay
How To Write An Argumentative EssayHow To Write An Argumentative Essay
How To Write An Argumentative Essay
Β 
7 Introduction Essay About Yourself - Introduction Letter
7 Introduction Essay About Yourself - Introduction Letter7 Introduction Essay About Yourself - Introduction Letter
7 Introduction Essay About Yourself - Introduction Letter
Β 
Persuasive Essay Essaypro Writers
Persuasive Essay Essaypro WritersPersuasive Essay Essaypro Writers
Persuasive Essay Essaypro Writers
Β 
Scholarship Personal Essays Templates At Allbusinesst
Scholarship Personal Essays Templates At AllbusinesstScholarship Personal Essays Templates At Allbusinesst
Scholarship Personal Essays Templates At Allbusinesst
Β 
How To Create An Outline For An Essay
How To Create An Outline For An EssayHow To Create An Outline For An Essay
How To Create An Outline For An Essay
Β 
ESSAY - Qualities Of A Good Teach
ESSAY - Qualities Of A Good TeachESSAY - Qualities Of A Good Teach
ESSAY - Qualities Of A Good Teach
Β 
Fountain Pen Writing On Paper - Castle Rock Financial Planning
Fountain Pen Writing On Paper - Castle Rock Financial PlanningFountain Pen Writing On Paper - Castle Rock Financial Planning
Fountain Pen Writing On Paper - Castle Rock Financial Planning
Β 
Formatting A Research Paper E
Formatting A Research Paper EFormatting A Research Paper E
Formatting A Research Paper E
Β 
Business Paper Examples Of Graduate School Admissio
Business Paper Examples Of Graduate School AdmissioBusiness Paper Examples Of Graduate School Admissio
Business Paper Examples Of Graduate School Admissio
Β 
Impact Of Poverty On The Society - Free Essay E
Impact Of Poverty On The Society - Free Essay EImpact Of Poverty On The Society - Free Essay E
Impact Of Poverty On The Society - Free Essay E
Β 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
Β 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
Β 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
Β 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
Β 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
Β 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
Β 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
Β 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
Β 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
Β 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
Β 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
Β 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
Β 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri
Β 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
Β 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
Β 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
Β 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
Β 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
Β 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
Β 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Β 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
Β 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
Β 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
Β 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
Β 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
Β 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
Β 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
Β 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
Β 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
Β 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
Β 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
Β 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Β 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
Β 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
Β 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
Β 

A Nearest Neighbour Approach By Genic Distance To The Assignment Of Individuals To Geographic Origin

  • 1. 1 A nearest neighbour approach by genic distance to the assignment of 1 individuals to geographic origin 2 3 Bernd Degen1 , CΓ©line Blanc-Jolivet1 , Katrin Stierand1 & Elizabeth Gillet2 4 (15/11/2016) 5 6 1 ThΓΌnen Institute of Forest Genetics, Sieker Landstrasse 2, 22927 Grosshansdorf, Germany, 7 E-mail: bernd.degen@thuenen.de 8 9 2 Forest Genetics and Forest Tree Breeding, Faculty of Forest Sciences and Forest Ecology, 10 Georg-August-University of GΓΆttingen, BΓΌsgenweg 2, 37077 GΓΆttingen, Germany 11 12 Abstract 13 During the past decade, the use of DNA for forensic applications has been extensively 14 implemented for plant and animal species, as well as in humans. Tracing back the 15 geographical origin of an individual usually requires genetic assignment analysis. These 16 approaches are based on reference samples that are grouped into populations or other 17 aggregates and intend to identify the most likely group of origin. Often this grouping does not 18 have a biological but rather a historical or political justification, such as β€œcountry of origin”. 19 In this paper, we present a new nearest neighbour approach to individual assignment 20 or classification within a given but potentially imperfect grouping of reference samples. This 21 method, which is based on the genic distance between individuals, functions better in many 22 cases than commonly used methods. We demonstrate the operation of our assignment method 23 using two data sets. One set is simulated for a large number of trees distributed in a 120 km 24 by 120 km landscape with individual genotypes at 150 SNPs, and the other set comprises 25 experimental data of 1221 individuals of the African tropical tree species Entandrophragma 26 cylindricum (Sapelli) genotyped at 61 SNPs. Judging by the level of correct self-assignment, 27 our approach outperformed the commonly used frequency and Bayesian approaches by 15% 28 for the simulated data set and by 5 to 7% for the Sapelli data set. 29 Our new approach is less sensitive to overlapping sources of genetic differentiation, 30 such as genic differences among closely-related species, phylogeographic lineages and 31 isolation by distance, and thus operates better even for suboptimal grouping of individuals. 32 33 Key words: Forensics, genetic assignment, genic distance, GeoAssign, geographic origin, 34 timber, trees 35 36 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 2. 2 Introduction 37 In recent years, the application of forensic methods based on genetic markers to assign 38 individual plants and animals to their geographic origin (Johnson et al, 2014; Ogden and 39 Linacre, 2015) has gained importance for the control of trade regulations and consumer 40 protection. Whereas many animal and plant species are protected by the Convention on 41 International Trade in Endangered Species of Wild Fauna and Flora (CITES), the protection 42 level can depend on the country of origin (species listed in CITES annex 3). For the timber 43 trade, regulations such as the US Lacey Act and the EU timber regulation require the trader to 44 declare the geographic origin of the material. Therefore it is important to be able to determine 45 the true origin of the individuals that are traded (Johnson et al, 2014). 46 In different areas of the world, genetic reference data bases have been developed to aid 47 in the identification of the true geographic origin of timber (Degen et al, 2013; Ng et al, 48 2016). To this end, reference samples of individual trees are collected throughout the natural 49 distribution area of the species or in specific target regions. For each individual, the data base 50 records the geographic coordinates of the individual and its genotype at gene loci representing 51 different types of gene marker, especially molecular markers of the types nSSR, cp-DNA, mt- 52 DNA and more recently SNP (Deguilloux et al, 2003; Ishida et al, 2013; Jolivet and Degen, 53 2012; Puckett and Eggert, 2016). Usually, more than one individual is sampled at each 54 reference location, and reference locations and their sampled individuals are aggregated a 55 priori into groups (or classes) by non-genetic criteria such as country of origin. The objective 56 is to assign an individual of unknown origin, such as imported timber, to that group of 57 reference individuals to which its multilocus genotype conforms best by some criterion. 58 Examples for the assignment of individuals to place of origin have been published for timber 59 tree species, elephants, and salmon (Degen et al, 2006; Glover et al, 2008; Wasser et al, 2015; 60 Wasser et al, 2007). 61 Most of the previously applied approaches to the assignment of an individual of 62 unknown origin to its putative group are frequency-based (Paetkau et al, 1995) and/or 63 Bayesian (Rannala and Mountain, 1997). These approaches estimate the allele frequencies at 64 all loci in every group, compute the probability that the multilocus genotype of the individual 65 in question originates from each group, and assign the individual to that group for which the 66 probability of origin is highest (Ogden and Linacre, 2015). Application of these approaches is 67 based on assumptions that (a) the structure of the groups fits to Wright’s Island Model of drift 68 (Wright, 1931), (b) the alleles at each locus are in Hardy-Weinberg-Proportions (HWP) (i.e., 69 no stochastic association between alleles at the same locus), and (c) that the loci are in linkage 70 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 3. 3 equilibrium or are completely unlinked (i.e., no stochastic association between the genotypes 71 between loci). Unfortunately, these assumptions are often violated in real populations. For 72 example, molecular markers such as nSSRs often show deviation from HWP in at all spatial 73 scales (Mariette et al, 2001; Nurtjahjaningsih et al, 2005). Also the use of large SNP marker 74 sets derived from Next Generation Sequencing often leads to linkage disequilibrium (LD) in 75 the data, which would require pruning process of loci in the data (Duforet-Frebourg et al, 76 2015). To overcome this problem and to avoid loss in the Gain of Informativeness for 77 Assignment (GIA), other methods combine some loci into haplotypes (Duforet-Frebourg et al, 78 2015). Indeed, the use of haplotypes provides information on the ancestry and recombination 79 events, which are useful when differentiation among groups is low. However, these methods 80 are mostly interesting when dense marker sets are used (Gattepaille and Jakobsson, 2012), and 81 it is not clear how these methods handle missing data, which is unfortunately the main issue 82 with genotyping of timber or other material with degraded DNA. 83 The success of these assignment methods depends on the extent of genetic 84 differentiation among the groups (Ogden and Linacre, 2015). The sampling scheme and the 85 variability of the spatial genetic structure have an additional impact on the success of the 86 assignment methods, especially if the groups represent political units such as countries. 87 When pre-defined aggregation to groups does not reflect the genetic structure among 88 the reference individuals, genetic assignment approaches based on allele frequencies may fail 89 (Meirmans, 2012). Grouping according to political (e.g. country) borders instead of genetic 90 boundaries between populations as reproductive units is particularly critical and may lead to 91 confusing genetic mixtures within groups (Manel et al, 2007). In the case that reference 92 groups contain individuals of more than one reproductively isolated deme, such as different 93 regions or even cryptic sympatric species, the mean genetic difference among individuals of 94 different demes should be larger than the mean genetic difference among individuals of the 95 same deme (Konnert et al, 2011). The ability of markers to identify such demes differs, 96 however. For instance, the incomplete lineage sorting or chloroplast capture that can be 97 observed among closely-related species at chloroplast markers (Duminil et al, 2013) could 98 hinder attempts to recognize these species in the reference data. Also, the genetic 99 differentiation among the groups of reference samples is biased by the proportion of species 100 mixture or the mixing of individuals from different phylogeographic lineages (e.g. refugia) 101 within the groups or by the assignment of individuals from the same spatial genetic unit (e.g. 102 cross-border populations) to different groups. 103 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 4. 4 The opposite approach to the a priori specification of groups is to attempt to partition 104 individuals into reference groups that show HWP and linkage equilibrium. Bayesian 105 clustering approaches, as implemented for example in the program STRUCTURE (Pritchard 106 et al, 2000), have become common in recent years. These aggregations are, however, also 107 based on estimates of allele frequencies that could be biased by unequal sample sizes among 108 genetic groups or violation of the assumption that the populations fit Wright’s Island Model 109 with relatively small, clearly differentiated populations (Meirmans, 2012). Another major 110 drawback of the partitioning of reference individuals into genetic groups by Bayesian 111 clustering methods is that the genetic groups they detect may be spread over more than one 112 country. This collides with existing legislation that requires declaration of the country of 113 origin, the political borders of which may cut through the middle of a genetic group, making it 114 difficult to issue a statement from genetic testing on how likely this declaration is. 115 This paper describes a nearest neighbour classification approach that assigns 116 unclassified individuals to predefined classes of reference individuals, such as by the country 117 of origin that is of relevance for the timber trade or any CITES listed species with different 118 country restrictions. The distance between individuals is measured by the genic distance 119 between their multilocus genotypes, defining the nearest neighbours of a specific individual as 120 those individuals with the smallest genic distance to it. An unclassified individual is assigned 121 to a particular class if this class has the highest representation among a limited set of nearest 122 neighbours by a new index πΌπ‘Ÿ and if this representation is statistically significant. Since 123 assignment is not based on estimation of allele frequencies within entire reference classes, the 124 approach avoids the problems of possible discrepancy between political and genetic 125 boundaries described above. We demonstrate its application using two data sets, a large set of 126 simulated data for a hypothetical tropical tree species and experimental data of an African tree 127 species. When applied to test whether individuals of known origin are correctly assigned to 128 their class, it turns out that the probability of correct self-assignment is better than for 129 conventional methods based on allele frequencies. 130 131 Material and methods 132 Nearest neighbour classification of individuals to origin 133 Consider the situation in which 𝑅 reference individuals (index 𝑗) have been subdivided 134 a priori into G different groups that correspond to their region (e.g. country) of origin. The 135 objective is to assign each of a set of T test individuals of unknown origin to that group to 136 which it is genetically most similar by the following procedure. 137 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 5. 5 For a specified set of 𝑀 gene loci, genetic similarity is first calculated on a pairwise 138 basis between the test individual 𝑖 and each reference individual 𝑗 as the proportion of genes 139 they share over the loci, that is, as 1 βˆ’ 𝐷𝑖𝑗 , where 140 0 ≀ 𝐷𝑖𝑗 = 1 𝑀 βˆ‘ 1 2 βˆ‘|π‘Žπ‘–π‘™π‘˜βˆ’π‘Žπ‘—π‘™π‘˜| 𝑁𝑙 π‘˜=1 𝑀 𝑙=1 ≀ 1 is the proportion of non-shared alleles (Gregorius, 1978). The index π‘˜ denotes the alleles 1 to 141 𝑁𝑙 at locus 𝑙. π‘Žπ‘–π‘™π‘˜ is the frequency (or dosage) of the π‘˜-th allele at locus 𝑙 in the test individual 142 𝑖, and π‘Žπ‘—π‘™π‘˜ is the frequency of the same allele in reference individual 𝑗. For two diploid 143 individuals, the frequencies of alleles at any locus can equal 1 (homozygote for that allele), 144 0.5 (heterozygote with one copy of the allele), or 0 (the allele is absent). 𝐷𝑖𝑗 is a direct 145 measure of the genetic difference between two individuals. Compared to most other genetic 146 distances, 𝐷𝑖𝑗 has the additional advantages that (a) it is a metric distance, (b) it ranges 147 between 0 for the sharing of all alleles and 1 for the sharing of no alleles, and (c) it has a 148 linear geometric conception (Gregorius, 1984). 149 All reference individuals 𝑗 are then listed in ascending order of their genic distance 𝐷𝑖𝑗 150 to test individual 𝑖. If the number of loci is small, many reference individuals may have the 151 same distance from the test individual, but as the number of loci is increased, the number of 152 reference individuals with the same distance generally decreases until all distances are unique. 153 Each number in the interval [0,1] specifies a neighbourhood of the test individual 𝑖 as the set 154 of all reference individuals 𝑗 that have genic distance 𝐷𝑖𝑗 less than or equal to this number. 155 Members of one neighbourhood are also members of any larger neighbourhood. The π‘˜ nearest 156 neighbours to test individual 𝑖 by this measure are the π‘˜ reference individuals that have the π‘˜ 157 smallest genic differences to test individual 𝑖. 158 In its most basic form, the π‘˜-nearest neighbour algorithm for classification would 159 assign 𝑖 to the group that has the highest number of reference individuals among its π‘˜ nearest 160 neighbors (β€œmajority vote”). Its application here requires solution of the following problems: 161 how to determine an β€œoptimal” π‘˜, how to deal with groups containing different numbers of 162 reference individuals, and how to test for significance of an assignment. 163 For consistency, the different numbers of reference individuals that have been sampled 164 for different species, say, could be accommodated by specifying π‘˜ as a recommended 165 percentage 𝑃 of all reference individuals. Alternatively, π‘˜ could be specified as the number of 166 reference individuals found within a neighbourhood of recommended size (critical genic 167 distance), and test individuals whose critical neighbourhood is empty (π‘˜ = 0) would not be 168 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 6. 6 assigned to any group (outliers). Such a critical distance could also be chosen as the first 169 β€œjump” in the cumulative frequency distribution of 1 βˆ’ 𝐷𝑖𝑗 over the reference individuals 𝑗, 170 separating genically similar reference individuals from genically dissimilar ones. For now, we 171 will speak of the chosen individuals of highest genic similarity to the test individual as the β€œπ‘˜ 172 nearest neighbours”. 173 Simple assignment of the test individual to the largest group among the nearest 174 neighbours (β€œmajority vote”) ignores the possibility that the original groups of reference 175 individuals can be of different sizes, giving groups with more reference individuals a bigger 176 chance to β€œwin”. In order to compensate for group size, we define an index πΌπ‘Ÿ that weights the 177 relative frequency π‘œπ‘Ÿ (i.e., proportion) of each group π‘Ÿ among the π‘˜ nearest neighbours by the 178 relative frequency π‘’π‘Ÿ of group π‘Ÿ among all reference individuals 179 βˆ’1 ≀ πΌπ‘Ÿ = π‘œπ‘Ÿ βˆ’ π‘’π‘Ÿ π‘’π‘Ÿ < 0 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ < π‘’π‘Ÿ πΌπ‘Ÿ = 0 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ = π‘’π‘Ÿ 0 < πΌπ‘Ÿ = π‘œπ‘Ÿ βˆ’ π‘’π‘Ÿ 1 βˆ’ π‘’π‘Ÿ < 1 π‘“π‘œπ‘Ÿ π‘œπ‘Ÿ > π‘’π‘Ÿ It holds that πΌπ‘Ÿ = βˆ’1 whenever group π‘Ÿ is not represented among the nearest neighbours, 180 πΌπ‘Ÿ = 0 if the frequencies π‘œπ‘Ÿ and π‘’π‘Ÿ are equal, and πΌπ‘Ÿ = 1 whenever all nearest neighbours 181 belong to group π‘Ÿ. Unless the relative frequencies of all groups among the nearest neighbours 182 are exactly the same as among all reference individuals, in which case πΌπ‘Ÿ = 0 for all groups 183 (and 𝑁 is divisible by π‘˜), at least one group π‘Ÿ must have an index πΌπ‘Ÿ greater than 0 and 184 another group an index less than 0. 185 Selecting the group π‘Ÿ with the highest value of πΌπ‘Ÿ to be the major candidate for the 186 origin of the test individual corresponds to the majority rule weighted by group size. The 187 statistical significance of the higher relative frequency of group π‘Ÿ among nearest neighbours 188 compared to the set of all reference individuals can be tested by classifying reference 189 individuals into two classes (i.e., members of group π‘Ÿ and members of any other group) and 190 calculating the P-value as the probability of finding at least the observed number π‘˜π‘Ÿ = π‘˜ βˆ™ π‘œπ‘Ÿ 191 of individuals from group π‘Ÿ within a random sample of size π‘˜ drawn without replacement 192 from the set of all 𝑁 reference individuals containing π‘π‘Ÿ individuals of group π‘Ÿ, that is, 193 𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) = βˆ‘ ( π‘π‘Ÿ π‘š ) ( 𝑁 βˆ’ π‘π‘Ÿ π‘˜ βˆ’ π‘š ) ( 𝑁 π‘˜ ) π‘˜ π‘š=π‘˜π‘Ÿ was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 7. 7 by the hypergeometric distribution, where ( π‘Ž 𝑏 ) = (π‘Ž! 𝑏! (π‘Ž βˆ’ 𝑏)! ⁄ ) is the binomial coefficient. 194 A P-value 𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) less than a chosen level of significance (commonly 0.05) is an 195 indication that the observed representation of group π‘Ÿ among the nearest neighbors is 196 significantly high. It cannot be ruled out that another group with a smaller positive value of 197 the index also shows significantly high representation among the nearest neighbours, making 198 it an additional candidate for the origin of the test individual. This could happen, for example, 199 if the test individual originates from a cross-border population. It must, however, be 200 considered that hypergeometric tests of different candidate groups are not independent of each 201 other. 202 By the same reasoning, a P-value 𝑃(𝑋 β‰₯ π‘˜π‘Ÿ) greater than 0.95 indicates that this 203 group has a significantly low representation among the nearest neighbours of the test 204 individual and should be excluded as a candidate for the origin. 205 206 Test of performance 207 We then compared the performance of this new individual genic distance assignment 208 method with the so far commonly used genetic assignment methods based on frequency 209 (Paetkau et al, 2004) and the Bayesian (Rannala and Mountain, 1997) approach criteria, 210 referred below as frequency and Bayesian approaches. 211 As an indicator of the performance, we computed the proportion of correctly assigned 212 individuals in self-assignments tests (Cornuet et al, 1999). Here the individuals of the 213 reference data were self-classified to the sampled groups using the leave-one-out approach 214 (Efron, 1983). For the frequency and the Bayesian approaches, we applied the self-assignment 215 routine implemented in the program GeneClass 2.0 (Piry et al, 2004). 216 217 Genetic diversity, Genetic differentiation and Hardy-Weinberg-Proportion 218 The success of genetic assignment depends on the level of genetic diversity, genetic 219 differentiation among the groups and the similarity of the genotype frequencies to Hardy- 220 Weinberg-Proportions. 221 As a measure for genetic diversity (effective number of alleles) we computed the mean 222 allelic diversity v of group π‘Ÿ (Gregorius, 1987): 223 1 ≀ π‘£π‘Ÿ = 1 𝑀 βŒˆβˆ‘ π‘π‘Ÿπ‘˜π‘– 2 π‘›π‘˜ 𝑖=1 βŒ‰ βˆ’1 ≀ π‘›π‘˜ was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 8. 8 With relative frequency π‘π‘Ÿπ‘˜π‘– of the 𝑖-th allele at the π‘˜-th locus in group π‘Ÿ and π‘›π‘˜ different 224 alleles at locus π‘˜. 225 226 To measure the difference between groups, we calculate the gene pool distance 𝑑0 227 between two groups πΊπ‘Ÿ and πΊπ‘Ÿβ€² as 228 𝑑0(πΊπ‘Ÿ, πΊπ‘Ÿβ€²) = 1 𝐿 βˆ‘ 1 2 𝑙 βˆ‘|π‘π‘–π‘™π‘Ÿ βˆ’ π‘π‘–π‘™π‘Ÿβ€²| 𝑖 where the index 𝑙 runs through the 𝐿 loci, the index 𝑖 runs through the alleles at locus 𝑙, and 229 π‘π‘–π‘™π‘Ÿ and π‘π‘–π‘™π‘Ÿβ€² are the relative frequency of the 𝑖-th allele at locus 𝑙 in group πΊπ‘Ÿ and πΊπ‘Ÿβ€², 230 respectively (Gregorius, 1984). 𝑑0 ranges between 0 and 1. 𝑑0 =0 holds when the relative 231 frequency of every allele at every locus is the same in both groups. 𝑑0 = 1 holds when the 232 groups have no allele in common at any locus. To visualize difference between groups, cluster 233 analysis based on the pairwise gene pool distances 𝑑0 between groups was performed using 234 the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) as implemented in the 235 software PAST (Hammer et al, 2001). 236 Based on the gene pool distance, the complementary compositional differentiation 𝛿𝑆𝐷 237 among the gene pools of the groups is then defined as the mean gene pool distance between 238 each group πΊπ‘Ÿ and its complement πΊπ‘Ÿ Μ…Μ…Μ…, where πΊπ‘Ÿ Μ…Μ…Μ… consists of all reference individuals that do 239 not belong to group π‘Ÿ (Gregorius, 1987), that is, 240 𝛿𝑆𝐷 = 1 𝑅 βˆ‘ 𝑑0(πΊπ‘Ÿ, πΊπ‘Ÿ Μ…Μ…Μ…) π‘Ÿ where the index π‘Ÿ runs through the 𝑅 groups, and π‘π‘–π‘™π‘Ÿ Μ…Μ…Μ…Μ…Μ… is the relative frequency of the 𝑖-th 241 allele in the complement πΊπ‘Ÿ Μ…Μ…Μ…. 𝛿𝑆𝐷 ranges between 0 and 1. 𝛿𝑆𝐷 = 0 holds when the relative 242 frequency of every allele at every locus is the same in all groups. 𝛿𝑆𝐷 = 1 holds when the 243 groups have no allele in common at any locus. 244 For comparison, we also calculated the commonly used Wright’s FST (Wright, 1978) 245 which is a measure of fixation (monomorphism) and not of the difference among groups 246 (Gregorius et al, 2007). 247 248 As an indicator for the departure from Hardy-Weinberg proportions we computed the 249 inbreeding coefficient 𝐹𝑖𝑠 within each group π‘Ÿ for which π»π‘’π‘Ÿπ‘˜ > 0 holds (Wright, 1950) as 250 𝐹𝑖𝑠 = 1 𝑀 βˆ‘ (π»π‘’π‘Ÿπ‘˜ βˆ’ π»π‘œπ‘Ÿπ‘˜) π»π‘’π‘Ÿπ‘˜ ⁄ 𝑀 π‘˜=1 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 9. 9 Here π»π‘’π‘Ÿπ‘˜ = 1 βˆ’ βˆ‘ π‘π‘Ÿπ‘˜π‘– 2 π‘›π‘˜ 𝑖=1 is the expected heterozygosity in group π‘Ÿ at locus π‘˜ under the 251 hypothesis of random mating (i.e., the Simpson diversity) and π»π‘œπ‘Ÿπ‘˜ is the observed 252 heterozygosity (i.e., the relative frequency of heterozygotes). 253 254 Simulated data 255 The simulation model β€œEco-Gene Landscape” was used to create a baseline population 256 for a tropical tree species in an area of 120 km by 120 km and its spatial genetic structure at 257 150 nuclear Single Nucleotide Polymorphism = nSNPs after 700 years of re-colonisation. The 258 simulation model includes pollen and seed dispersal, mating, growth and mortality (Degen et 259 al, 2006). The 120 km x 120 km landscape was subdivided into grid cells of side lengths 1 km 260 x 1 km (100 ha). Each grid cell represented a different population that is connected via pollen 261 and seed dispersal with the surrounding populations. It was assumed that 60 % of the grid 262 cells could be potentially occupied by the target tree species. This equals to 8640 populations 263 in the simulated area. Simulation was initiated with 200 trees in each of only 10 grid cells to 264 represent the refugia before re-colonisation. A mean individual growth rate of 0.5 cm 265 diameter increment per year with a standard deviation of 0.2 was assumed. All trees with a 266 diameter of 35 cm were fertile. Pollen dispersal was simulated following two normal 267 distributions with means of 0 m and standard deviations of 500 m and 1500 m with a 268 probability of 0.7 and 0.3, respectively. The resulting pollen dispersal pattern is typical for an 269 insect pollinated species (Degen and Roubik, 2004). Random fertilization was assumed, with 270 the exception that seeds from self-fertilization were inviable. Seed dispersal was also 271 simulated by two normal distributions with means of 0 m and standard deviations of 100 m 272 and 1500 m with probabilities of 0.7 and 0.3, respectively. Using these parameters, the 273 majority of seeds were distributed within 300 m around the seed trees. Further, in accordance 274 with field inventories typical for tropical trees, a maximum density of trees per ha in different 275 diameter classes was considered. These maximum densities represented carrying capacities 276 and caused mortality if by seed production, migration or ingrowth the densities were 277 exceeded. The model thus created overlapping generations. 278 A total of 888,841 individuals were generated over all populations during 700 279 simulated years. At that time, 6212 grid cells contained populations. The number of 280 individuals in each population varied from 2 to 206 with an average of 156. We assumed that 281 every SNP locus has 2 alleles (A1, A2). The initial allele frequencies were randomly selected 282 with 0<A1<1 for all initial populations (i.e., uniform distribution). To create reference data, 283 we selected 10 areas, each with a size of 10 km by 10 km (Figure 1). In each of these areas we 284 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 10. 10 randomly selected a reference data set of 100 individuals. In order to imitate the effect of 285 species mixture, in area 5 we replaced allele β€œ2” by an allele β€œ3” in all individuals, and in 286 areas 6 and 7 we did the same for 50% of the individuals. Thus the areas 1, 2, 3, 4, 8, 9, and 287 10 are genetically purely species A, area 5 is purely species B, and the areas 6 and 7 are 288 mixtures of 50% species A and 50% species B. 289 290 291 292 Figure 1: Simulated data set; red and brown points are grid cells covered by trees after 700 293 simulated years of re-colonisation. Blue points = location of initial refugial areas. Green 294 squares mark selected areas of species A, yellow squares mark selected areas of species B, 295 and yellow-green squares are selected areas with 50% mixtures of species A and B 296 297 Experimental data on Sapelli 298 Sapelli is the trade name of the botanical species Entandrophragma cylindicum of the 299 same family as mahogany (Meliaceae). Since decades, this high value timber tree is one of the 300 most harvested tree species in the natural forests of tropical Africa. In frame of a project of 301 the International Tropical Timber Organization (ITTO), we developed a genetic reference 302 data set in order to check claims on the geographic origin of the traded timber (Degen and 303 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 11. 11 Bouda, 2015). A total of 1292 Sapelli trees were collected in seven African countries (Figure 304 2). Then, SNPs were developed and all individuals were screened at 61 SNPs. Two Sapelli 305 samples from Cameroon and DRC were used as reference for RAD (Restriction Site 306 Associated DNA) sequencing and SNP detection (Pujolar et al, 2014). A total of 140 putative 307 SNP loci were screened with a MassARRAY technology (Agena Biosciences, San Diego, 308 California, USA) on a set of 190 reference individuals chosen from different countries among 309 the reference samples (McKernan et al, 2002). Estimates of genetic differentiation and 310 correlations among genetic and geographic distances per locus were used together with 311 amplification success rates to select 61 loci for reference sample screening. Details are 312 presented elsewhere (Blanc-Jolivet and Degen, in preparation). 313 314 315 Figure 2: Location of the sampled trees of Entandrophragma sp. in Africa 316 317 Results 318 Simulated data 319 After 700 years of simulated re-colonisation, the allelic diversity (vr) in the selected 320 areas varied between 1.52 and 1.58 for the individuals belonging to species A and in the same 321 range for individuals of species B. The fixation index equalled FST = 0.053 and the 322 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 12. 12 compositional differentiation 𝛿𝑆𝐷 = 0.082 for species A and FST = 0.045 and 𝛿𝑆𝐷 = 0.106 for 323 species B. For all individuals of both species together, the measures among the 10 selected 324 areas equalled FST = 0.171 and 𝛿𝑆𝐷 = 0.196. Both measures clearly indicated a stronger 325 fixation and differentiation between the species compared to the within species 326 differentiation. 327 Spatial-genetic analysis reveals a clear spatial structure within species A showing that 328 individuals up to a distance of 20 km are genetically more similar than expected for a random 329 distribution (Figure 3). 330 331 332 333 Figure 3: Spatial genetic structure of the simulated data set for species A. Dis_O is the 334 observed mean genic distance among individuals in the given spatial distance class. Dis_E is 335 the expected genic distance for a random distribution of individuals in space. 336 337 The dominating species signal is visible in the genetic differentiation among the areas 338 (Figure 4). The areas covered only by species A (green) form one group in the dendrogram 339 and the areas with pure species B (yellow) and species mixture (light green) form another 340 group. 341 342 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 13. 13 343 Figure 4: Dendrogram based on a cluster analysis (UPGMA: Unweighted Pair Group Method 344 with Arithmetic Mean) using the genic distance matrix among the allele frequencies of the ten 345 selected areas in the simulated data set. The numbers are the proportions of the 1000 bootstrap 346 simulations for shifting the included SNP loci that find the same branch. 347 348 In addition, results for the self-assignment is quite stable over a range of percentiles of 349 nearest neighbours (P= 0.5% to 10%, Figure 5). 350 351 352 Figure 5: Proportion of correct self-assignments for different percentiles (P) of the genically 353 most similar reference individuals (nearest neighbours) in the simulated data set 354 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 14. 14 The frequency and the Bayesian approach assigned 70% of all 1000 individuals to the 355 correct area, whereas our nearest neighbour approach assigned 85% of the individuals 356 correctly (Table 1). The largest differences were observed for the areas 6 and 7 with the 357 simulated species mixture of individuals. There, the frequency and the Bayesian approach did 358 not assign a single individual correctly but the nearest neighbour approach accurately 359 assigned 95% for area 6 and 82% for area 7. The frequency and Bayesian approach performed 360 slightly better than the nearest neighbour approach in areas (areas 2, 3, 9) with lower genetic 361 differentiation (lower values of 𝛿𝑆𝐷) and heterozygosity values close to Hardy-Weinberg- 362 Proportion (Fis values close to 0), whereas our nearest neighbour approach came up with 363 better results for areas with stronger genetic differentiation and stronger departure from 364 Hardy-Weinberg-Heterozygosity (areas 1, 6, 7, 8). 365 366 Area N 𝛿𝑆𝐷 Fis Nearest neighbour approach (P1) % Frequency approach % Bayesian approach % Area 1 100 0.170 0.026 99 93 94 Area 2 100 0.142 0.040 67 77 77 Area 3 100 0.147 0.027 81 91 91 Area 4 100 0.174 0.035 93 95 96 Area 5 100 0.463 0.029 100 100 100 Area 6 100 0.228 0.340 95 0 0 Area 7 100 0.197 0.304 82 0 0 Area 8 100 0.156 0.057 87 84 85 Area 9 100 0.140 0.013 67 78 78 Area 10 100 0.143 0.014 74 79 79 Total / Mean 1000 85 70 70 367 Table 1: Results of the self-assignments for the simulated data set using the different 368 assignment approaches, N = sample size, 𝛿𝑆𝐷 = complementary compositional 369 differentiation, Fis = mean inbreeding coefficient over all loci 370 371 372 Experimental data on Sapelli 373 The samples from Ghana and Ivory Coast were merged in one group because of the 374 small sample sizes and the geographic neighbourhood. The mean allelic diversity (vr) varied 375 between 1.39 and 1.59 in the different countries. The mean fixation was FST = 0.092 and the 376 mean compositional differentiation 𝛿𝑆𝐷 = 0.126 (0.078 βˆ’ 0.194). There was a clear spatial 377 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 15. 15 structure showing that individuals up to a distance of 500 km are genically more similar than 378 expected for a random distribution (data not shown). The analysis with STRUCTURE came 379 up with a clear indication that a significant proportion of individuals represent other species 380 within the genus Entandrophragma (Blanc-Jolivet and Degen, in preparation). 381 The frequency and the Bayesian approach assigned 68% and 66% of all individuals to 382 the correct area, whereas the nearest neighbour approach assigned 74% of the individuals 383 correctly (Table 2). 384 Large differences among the methods were observed for individuals from Cameroon. 385 Here, the nearest neighbour approach assigned 77% of the individuals correctly, while the 386 frequency and the Bayesian approach assigned only 65% and 61% correctly. This was the 387 group with the largest sample size. Also for samples from DRC with the second largest 388 sample size the nearest neighbour approach was better. 389 As for the simulated data, the nearest neighbour approach was better in groups with 390 high genetic differentiation and stronger departure from Hardy-Weinberg heterozygosity 391 (DRC, Ghana/ Ivory Coast) and the other approaches performed better in situations with 392 lower differentiation and F-values close to 0 (Gabon). 393 394 Country N 𝛿𝑆𝐷 Fis Nearest neighbour approach (P=1) % Frequency approach % Bayesian approach % Cameroon 430 0.094 0.234 77 65 61 Congo Brazzaville 140 0.096 0.269 44 51 50 DRC 419 0.170 0.439 86 81 78 Gabon 36 0.078 0.233 3 19 22 Ghana / Ivory Coast 89 0.194 0.312 74 71 67 Total/ Mean 1114 74 68 66 395 Table 2: Results of the self-assignments for the Sapelli data set using the different assignment 396 approaches, all data that had at least 50% of the loci were included. N = sample size, 𝛿𝑆𝐷 = 397 complementary compositional differentiation, Fis = mean inbreeding coefficient over all loci 398 399 Discussion 400 Realistic simulated data and typical large scale SNP data set of a tree species? 401 So far, not many data sets have been published on large scale genetic differentiation of 402 tree species at SNP gene markers, and most of them are looking for SNP loci linked to 403 selection (Csillery et al, 2014). The large scale genetic structure at the 150 nuclear SNPs in 404 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 16. 16 the simulated data was the result of non-selective processes (gene flow, mating system) and 405 the imitated effect of a mixture of two closely related species. The level of genetic 406 differentiation and fixation we observed in the Sapelli data set (𝛿𝑆𝐷 = 0.123, FST= 0.092) lay 407 between the pure within-species differentiation and fixation (𝛿𝑆𝐷 = 0.083, FST= 0.053) and 408 the overall genetic differentiation and fixation of the simulated data set (𝛿𝑆𝐷 = 0.196, FST= 409 0.171). A similar level of fixation (FST=0.15) has been reported for a large SNP data sets of 410 Populus trichocarpa in North America (Slavov et al, 2012). The overlapping of different 411 genetic signals such as species hybridisation and mixture, phylogeographic pattern and 412 isolation-by-distance effects responsible for the genetic differentiation of tree populations has 413 been found in several recent studies (De La Torre et al, 2015; Jelic et al, 2015; Thomson et al, 414 2015). We imitated the species signal by replacing one of two alleles by a third allele, keeping 415 one allele at each locus in common among species. This might look rather β€œartificial”, but in 416 nature separation of related species of the same genus is visible by the same observation 417 (mixture of species-specific alleles and common alleles (see e. g. for different citrus species 418 (Curk et al, 2015) and for different frog species (Lemmon and Lemmon, 2012)). 419 420 Performance of assignment methods 421 We demonstrate by the simulated and the real data set of the African tree species 422 Sapelli that our nearest neighbour approach had a higher percentage of correctly assigned 423 individuals compared to the frequency and Bayesian methods. The reason for this better 424 performance was that our calculated index takes into account only the genically most similar 425 individuals to the reference data. Thus, it reduces the bias of overlapping genetic signals 426 because there is a hierarchy of genic similarity (Mori et al, 2015; Petit et al, 2005). 427 Individuals of the same species (or the same level of hybridisation), from the same 428 phylogeographic lineage and the same geographic region are the most similar. On the 429 opposite side, individuals from different species show the highest genic differences. Of 430 course, one could use Bayesian clustering analysis to obtain a genetic grouping of reference 431 data that minimizes departure from HWP within each group (Pritchard et al, 2000), thereby 432 limiting problems due to population sub-structure. But for many applications – especially as 433 part of law enforcement and forensic investigations – the investigator needs to come to a 434 conclusion for a given (non-biological) grouping of reference individuals. For example, the 435 question for application of the EU-timber regulation is to confirm or not that the timber is 436 from a Sapelli tree grown in Cameroon. In this case, the data must be aggregated by countries. 437 438 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 17. 17 The better performance of the nearest neighbour approach was particularly visible for groups 439 with high sample sizes, higher genetic differentiation and an excess of homozygotes 440 compared to Hardy-Weinberg expectations. The excess of homozygotes could be caused by a 441 subpopulation structure within the groups (Wahlund effects). The power of our nearest 442 neighbour approach depends on a sufficient sample size from all of the different 443 subpopulations represented in the reference data. On the other hand, there were also groups 444 within these data sets for which the Bayesian and the frequency approach had higher success 445 rates for the self-assignments. These were groups with lower genetic differentiation and 446 heterozygosity close to Hardy-Weinberg-expectations. 447 448 In our individual distance approach, a decision is still needed on the percentile of 449 genically most similar individuals to be included in the calculation of the index. Thus we need 450 to decide on the right number of nearest neighbours. We have shown with the simulated data 451 set that the percentage of correctly assigned individuals does not change much for percentiles 452 from P=0.5% to P=5% (Figure 5). For all approaches, the success of the assignment depends 453 on the genetic differentiation among the reference individuals (Ogden and Linacre, 2015). If 454 there is only little genetic differentiation, then it is better to include more individuals in the 455 calculation of the index because the power of the statistical test increases with increasing 456 numbers of reference individuals included. On the other side, large percentiles will increase 457 the mixture of genic distances caused by different processes. 458 Cornuet et al (1999) compared the performance of different assignment methods using 459 simulated data sets. They found that the Bayesian and the frequency approaches outperformed 460 by far different genetic distance approaches. Our study found the opposite. The difference is 461 that Cornuet et al (1999) included all individuals of each population in the analysis, and their 462 simulated data set fulfilled the assumptions of Hardy-Weinberg and linkage equilibrium. Our 463 simulated data set and the Sapelli experimental SNP-data were characterised by an excess of 464 homozygotes in nearly all groups. This excess could be explained by Wahlund effects, thus by 465 a mixture of individuals from different populations in the same groups (Wahlund, 1928). 466 467 Forensic application in timber tracking 468 We have generated and are continuously developing genetic reference data for timber 469 tracking of several high-value tree species (Degen and Bouda, 2015; Degen et al, 2013). A 470 few years ago, we shifted from SSR markers to SNPs because their shortness is an advantage 471 for amplification in the typically degraded DNA of timber (Jardine et al, 2016; Pakull et al, 472 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 18. 18 2016). These tools are developed to reduce illegal logging. As has been pointed out by Ogden 473 and Linacre (2015), wildlife forensics is faced with several drawbacks compared to the ideal 474 situations found in human forensics. It is often a big challenge to obtain a collection of 475 reference material that is well distributed over the geographic range of the target tree species. 476 It is difficult to obtain access to many remote areas and sometimes even dangerous because of 477 unstable political situations in countries where illegal logging is an issue. We often need to 478 collaborate with local groups that help with the collection of material. They receive training 479 but are usually not scientists. Yet even for the botanical specialist it is often impossible to 480 clearly distinguish between closely related tree species based on morphological traits. Thus it 481 is not uncommon for us to have reference samples that contain admixtures of closely related 482 species. This is either due to errors in the field work or the lack of an efficient biological 483 barrier between closely related species. This underlines the importance of assignment routines 484 that are robust against this kind of error and still yield reliable results. 485 486 Access to the program 487 We implemented the approach in a software application called GeoAssign and this is 488 available through a web service at https://geoassign.thuenen.de. The webserver architecture is 489 designed around a Ruby on Rails application that handles multiple jobs via a Sidekiq 490 (http://sidekiq.org) / Redis (http://redis.io) background processing framework. The GeoAssign 491 method itself is implemented in the Ruby programming language and included in the web 492 application as a gem. 493 494 495 Outlook 496 We are currently extending our nearest neighbour approach using a city-block distance 497 for metric traits such as stable isotopes. The idea is to have a single statistical assignment 498 approach for different data sources that should be combined to increase the performance of 499 forensic assignment approaches (Dormontt et al, 2015; Lowe et al, 2016). Applying the same 500 assignment routine would also enable us to better compare the performance of different 501 reference data sets (such as genetic, stable isotope or near infrared data). 502 As is the case for the other assignment approaches, the nearest neighbour approach 503 will always assign the test individual to a group. But it is possible that the true group of origin 504 is not represented by the reference individuals. To deal with this problem, other approaches 505 simulate genotypes based on allele frequencies in the different groups of reference data, 506 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 19. 19 compute the distribution of likelihoods for assignment of these simulated genotypes to β€œtheir” 507 group, and compare the likelihood of test individuals to these distributions and interpret the 508 position in the distribution as an β€œexclusion probability” (Cornuet et al, 1999). Again this 509 approach suffers from the assumptions to be made on the use of allele frequencies to simulate 510 genotypes. To deal with these β€œalien individuals”, we are integrating a routine into our nearest 511 neighbour approach that compares the genetic distance of the test individuals to the 512 distribution of genetic distances among all reference individuals within each group. An outlier 513 would be identified by its relatively large genic distance compared to the distribution of 514 distances among members of the same group. 515 516 517 Acknowledgement 518 The work of CB-J was financially supported by the International Tropical Timber 519 Organization (ITTO) through the project PD 620/11 Rev.1 (M): β€œDevelopment and 520 implementation of species identification and timber tracking in Africa with DNA fingerprints 521 and stable isotopes”. The work of EMG was financially supported by grant ZI 662/7-1 of the 522 Deutsche Forschungsgemeinschaft. Further we thank Malte Mader for the setup and 523 configuration of the LINUX web server. Finally, we are thankful to the two reviewers for 524 their very useful comments and suggestions on a former version of the manuscript. 525 526 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 20. 20 References 527 Cornuet J-M, Piry S, Luikart G, Estoup A, Solignac M (1999). New methods employing multilocus 528 genotypes to select or exclude populations as origins of individuals. Genetics 153(4): 1989-2000. 529 530 Csillery K, Lalague H, Vendramin GG, Gonzalez-Martinez SC, Fady B, Oddou-Muratorio S (2014). 531 Detecting short spatial scale local adaptation and epistatic selection in climate-related candidate 532 genes in European beech (Fagus sylvatica) populations. Molecular Ecology 23(19): 4696-4708. 533 534 Curk F, Ancillo G, Ollitrault F, Perrier X, Jacquemoud-Collet JP, Garcia-Lor A et al (2015). Nuclear 535 Species-Diagnostic SNP Markers Mined from 454 Amplicon Sequencing Reveal Admixture Genomic 536 Structure of Modern Citrus Varieties. Plos One 10(5): 25. 537 538 De La Torre A, Ingvarsson PK, Aitken SN (2015). Genetic architecture and genomic patterns of gene 539 flow between hybridizing species of Picea. Heredity 115(2): 153-164. 540 541 Degen B, Blanc L, Caron H, Maggia L, Kremer A, Gourlet-Fleury S (2006). Impact of selective logging 542 on genetic composition and demographic structure of four tropical tree species. Biological 543 Conservation 131(3): 386-401. 544 545 Degen B, Bouda H (2015). Verifying timber in Africa. ITTO Trop For Update 24(1): 8-10. 546 547 Degen B, Roubik DW (2004). Effects of animal pollination on pollen dispersal, selfing, and effective 548 population size of tropical trees: a simulation study. Biotropica 36(2): 165-179. 549 550 Degen B, Ward S, Lemes M, Navarro C, Cavers S, Sebbenn A (2013). Verifying the geographic origin of 551 mahogany (Swietenia macrophylla King) with DNA-fingerprints. Forensic Science International: 552 Genetics 7(1): 55-62. 553 554 Deguilloux MF, Pemonge MH, Bertel L, Kremer A, Petit RJ (2003). Checking the geographical origin of 555 oak wood: molecular and statistical tools. Molecular Ecology 12(6): 1629-1636. 556 557 Dormontt EE, Boner M, Braun B, Breulmann G, Degen B, Espinoza E et al (2015). Forensic timber 558 identification: It's time to integrate disciplines to combat illegal logging. Biological Conservation 191: 559 790-798. 560 561 Duforet-Frebourg N, Gattepaille LM, Blum MG, Jakobsson M (2015). HaploPOP: a software that 562 improves population assignment by combining markers into haplotypes. BMC bioinformatics 16(1): 1. 563 564 Duminil J, Brown RP, Ewedjs E, Mardulyn P, Doucet JL, Hardy OJ (2013). Large-scale pattern of genetic 565 differentiation within African rainforest trees: insights on the roles of ecological gradients and past 566 climate changes on the evolution of Erythrophleum spp (Fabaceae). Bmc Evolutionary Biology 13: 13. 567 568 Efron B (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. 569 Journal of the American Statistical Association 78(382): 316-331. 570 571 Gattepaille LM, Jakobsson M (2012). Combining markers into haplotypes can improve population 572 structure inference. Genetics 190(1): 159-174. 573 574 Glover K, Skilbrei O, Skaala Ø (2008). Genetic assignment identifies farm of origin for Atlantic salmon 575 Salmo salar escapees in a Norwegian fjord. ICES Journal of Marine Science: Journal du Conseil 65(6): 576 912-920. 577 578 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 21. 21 Gregorius H-R (1978). The concept of genetic diversity and its formal relationship to heterozygosity 579 and genetic distance. Mathematical Biosciences 41(3): 253-271. 580 581 Gregorius H-R (1987). The relationship between the concepts of genetic diversity and differentiation. 582 Theoretical and Applied Genetics 74(3): 397-401. 583 584 Gregorius H, Degen B, Konig A (2007). Problems in the Analysis of Genetic Differentiation Among 585 Populations--A Case Study in Quercus robur. Silvae Genetica 56(3-4): 190-199. 586 587 Gregorius HR (1984). A unique genetic distance. Biometrical Journal 26(1): 13-18. 588 589 Hammer Ø, Harper D, Ryan P (2001). PAST: Paleontological Statistics Software Package for education 590 and data analysis. Palaeontolia Electronica 4. 591 592 Ishida Y, Georgiadis NJ, Hondo T, Roca AL (2013). Triangulating the provenance of African elephants 593 using mitochondrial DNA. Evolutionary Applications 6(2): 253-265. 594 595 Jardine DI, Blanc-Jolivet C, Dixon RRM, Dormontt EE, Dunker B, Gerlach J et al (2016). Development 596 of SNP markers for Ayous (Triplochiton scleroxylon K. Schum) an economically important tree species 597 from tropical West and Central Africa. Conserv Genet Resour 8(2): 129-139. 598 599 Jelic M, Patenkovic A, Skoric M, Misic D, Novicic ZK, Bordacs S et al (2015). Indigenous forests of 600 European black poplar along the Danube River: genetic structure and reliable detection of 601 introgression. Tree Genetics & Genomes 11(5): 14. 602 603 Johnson RN, Wilson-Wilde L, Linacre A (2014). Current and future directions of DNA in wildlife 604 forensic science. Forensic Science International: Genetics 10: 1-11. 605 606 Jolivet C, Degen B (2012). Use of DNA fingerprints to control the origin of sapelli timber 607 (Entandrophragma cylindricum) at the forest concession level in Cameroon. Forensic Science 608 International-Genetics 6(4): 487-493. 609 610 Konnert M, Maurer W, Degen B, Katzel R (2011). Genetic monitoring in forests - early warning and 611 controlling system for ecosystemic changes. iForest 4: 77-81. 612 613 Lemmon AR, Lemmon EM (2012). High-Throughput Identification of Informative Nuclear Loci for 614 Shallow-Scale Phylogenetics and Phylogeography. Syst Biol 61(5): 745-761. 615 616 Lowe AJ, Dormontt EE, Bowie MJ, Degen B, Gardner S, Thomas D et al (2016). Opportunities for 617 Improved Transparency in the Timber Trade through Scientific Verification. BioScience 66(11): 990- 618 998. 619 620 Manel S, Berthoud F, Bellemain E, Gaudeul M, Luikart G, Swenson JE et al (2007). A new individual- 621 based spatial approach for identifying genetic discontinuities in natural populations. Molecular 622 Ecology 16(10): 2031-2043. 623 624 Mariette S, Chagne D, Decroocq S, Vendramin GG, Lalanne C, Madur D et al (2001). Microsatellite 625 markers for Pinus pinaster Ait. Ann For Sci 58(2): 203-206. 626 627 McKernan K, Fujii C, Ziauddin J, Malek J, McEwan P (2002). A high throughput and accurate method 628 for SNP genotyping using Sequenom MassARRAY (TM) system. AMERICAN JOURNAL OF HUMAN 629 GENETICS 71(4): 454-454. 630 631 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 22. 22 Meirmans PG (2012). The trouble with isolation by distance. Molecular Ecology 21(12): 2839-2846. 632 633 Mori GM, Zucchi MI, Souza AP (2015). Multiple-geographic-scale genetic structure of two mangrove 634 tree species: the roles of mating system, hybridization, limited dispersal and extrinsic factors. PloS 635 one 10(2). 636 637 Ng KKS, Lee SL, Tnah LH, Nurul-Farhanah Z, Ng CH, Lee CT et al (2016). Forensic timber identification: 638 a case study of a CITES listed species, Gonystylus bancanus (Thymelaeaceae). Forensic Science 639 International-Genetics 23: 197-209. 640 641 Nurtjahjaningsih ILG, Saito Y, Lian CL, Tsuda Y, Ide Y (2005). Development and characteristics of 642 microsatellite markers in Pinus merkusii. Molecular Ecology Notes 5(3): 552-553. 643 644 Ogden R, Linacre A (2015). Wildlife forensic science: A review of genetic geographic origin 645 assignment. Forensic Science International: Genetics. 646 647 Paetkau D, Calvert W, Stirling I, Strobeck C (1995). Microsatellite analysis of population structure in 648 Canadian polar bears. Molecular Ecology 4(3): 347-354. 649 650 Paetkau D, Slade R, Burden M, Estoup A (2004). Genetic assignment methods for the direct, real‐time 651 estimation of migration rate: a simulation‐based exploration of accuracy and power. Molecular 652 ecology 13(1): 55-65. 653 654 Pakull B, Mader M, Kersten B, Ekue MRM, Dipelet UGB, Paulini M et al (2016). Development of 655 nuclear, chloroplast and mitochondrial SNP markers for Khaya sp. Conserv Genet Resour 8(3): 283- 656 297. 657 658 Petit RJ, Duminil J, Fineschi S, Hampe A, Salvini D, Vendramin GG (2005). Invited review: comparative 659 organization of chloroplast, mitochondrial and nuclear diversity in plant populations. Molecular 660 Ecology 14(3): 689-701. 661 662 Piry S, Alapetite A, Cornuet J-M, Paetkau D, Baudouin L, Estoup A (2004). GENECLASS2: a software for 663 genetic assignment and first-generation migrant detection. Journal of heredity 95(6): 536-539. 664 665 Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus 666 genotype data. Genetics 155(2): 945-959. 667 668 Puckett EE, Eggert LS (2016). Comparison of SNP and microsatellite genotyping panels for spatial 669 assignment of individuals to natal range: A case study using the American black bear (Ursus 670 americanus). Biological Conservation 193: 86-93. 671 672 Pujolar JM, Jacobsen M, Als TD, Frydenberg J, Magnussen E, JΓ³nsson B et al (2014). Assessing 673 patterns of hybridization between North Atlantic eels using diagnostic single-nucleotide 674 polymorphisms. Heredity 112(6): 627-637. 675 676 Rannala B, Mountain JL (1997). Detecting immigration by using multilocus genotypes. Proceedings of 677 the National Academy of Sciences 94(17): 9197-9201. 678 679 Slavov GT, DiFazio SP, Martin J, Schackwitz W, Muchero W, Rodgers-Melnick E et al (2012). Genome 680 resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the 681 forest tree Populus trichocarpa. New Phytologist 196(3): 713-725. 682 683 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint
  • 23. 23 Thomson AM, Dick CW, Pascoini AL, Dayanandan S (2015). Despite introgressive hybridization, North 684 American birches (Betula spp.) maintain strong differentiation at nuclear microsatellite loci. Tree 685 Genetics & Genomes 11(5): 12. 686 687 Wahlund S (1928). Composition of populations and correlation appearances viewed in relation to the 688 studies of inheritance. Hereditas 11: 65-106. 689 690 Wasser SK, Brown L, Mailand C, Mondol S, Clark W, Laurie C et al (2015). Genetic assignment of large 691 seizures of elephant ivory reveals Africa's major poaching hotspots. Science 349(6243): 84-87. 692 693 Wasser SK, Mailand C, Booth R, Mutayoba B, Kisamo E, Clark B et al (2007). Using DNA to track the 694 origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of 695 Sciences of the United States of America 104(10): 4228-4233. 696 697 Wright S (1931). Evolution in Mendelian populations. Genetics 16(2): 97. 698 699 Wright S (1950). Genetical structure of populations. Nature 166: 247-249. 700 701 Wright S (1978). Evolution and the genetics of populations: a treatise in four volumes: Vol. 4: 702 variability within and among natural populations. University of Chicago Press. 703 704 705 was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which this version posted November 15, 2016. . https://doi.org/10.1101/087833 doi: bioRxiv preprint