avoiding potential biases in ses.pd estimations with the ...46 arguments of pd, ses.pd includes...
TRANSCRIPT
-
1
Avoiding potential biases in ses.PD estimations with the Picante software package 1
Rafael Molina-Venegas1 2
1. Institute of Plant Sciences, University of Bern, Altenbergrain 21, Bern 3013, 3
Switzerland. 4
Contact: [email protected] 5
Running title: avoiding biases in ses.PD estimations 6
7
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
2
Abstract 8
1. Faith’s phylogenetic diversity (PD) is one of the most widespread used indices of 9
phylogenetic structure in the eco-phylogenetic literature. The metric became 10
notably popular with the publication of the function pd as part of the Picante R 11
package, which is nowadays a reference software for phylogenetic analyses. 12
2. Because PD is not statistically independent of species richness, the routine 13
procedure is to standardize the observed PD values for unequal richness across 14
samples. The function ses.pd, which is also implemented in the Picante R 15
package, is the reference function to conduct such standardization. 16
3. Unfortunately, I have detected an anomaly in the function that may result in biased 17
estimations of standardized PD values, particularly in communities with low 18
species richness (i.e. less than four species) and unbalanced phylogenies. 19
4. I conduct a simple simulation exercise to illustrate the issue and propose two 20
alternative and easy to implement solutions to go around the problem. 21
22
Keywords: Phylogenetic diversity; Picante; ses.pd; standardization. 23
24
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
3
Introduction 25
Faith’s phylogenetic diversity (PD; Faith, 1992), defined as the sum of all 26
branch lengths connecting taxa in a sample, is one of the most widespread used indices 27
of phylogenetic structure in the eco-phylogenetic literature. The metric became notably 28
popular with the publication of the function pd as part of the Picante software (Kembel 29
et al., 2010), which is nowadays a reference R package for phylogenetic analyses. The 30
pd function includes three arguments: (i) “samp”, a community data matrix (samples in 31
rows and species in columns), (ii) “tree”, a phylo tree object including all the species in 32
the community data matrix, and (iii) “include.root”, which is a logical argument. If the 33
latter is set to TRUE (default = TRUE), then the PD of all samples in the community 34
data matrix will include the distance from the most recent common ancestor (MRCA) of 35
the species in each sample and the root of the supplied phylogeny (hereafter “MRCA – 36
root” distance). Otherwise, the MRCA – root distance is excluded from the 37
computations. 38
Importantly, PD is not statistically independent of species richness, and the 39
routine procedure is to standardize the observed PD values for unequal richness across 40
samples (see documentation for the pd function in Picante R package, Kembel et al., 41
2010). Typically, the observed PD is compared to a null distribution of PD values 42
generated by shuffling species names across the phylogenetic tips a high number of 43
times (e.g. 999 times). The function ses.pd, which is also implemented in the Picante 44
software, is the reference function to conduct such standardization. In addition to the 45
arguments of pd, ses.pd includes multiple null models that can be used to generate null 46
distributions (the default model is “taxa.labels”, which shuffles taxa labels across the 47
phylogenetic tips). Since ses.pd calls internally to pd, the user can specify if the MRCA 48
– root distance should be included in the calculation of the observed PD and the 49
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
4
corresponding null PD values. However, I have noted that ses.pd computes null PD 50
values without including the MRCA – root distance regardless of the logical value that 51
is specified in the include.root argument. This is, if include.root = TRUE (default = 52
TRUE), the observed PD will include the MRCA – root distance, but the null PD values 53
will not (see Supplementary Material). Unfortunately, this anomaly in the ses.pd 54
function may result in biased estimations of standardized PD values (hereafter 55
“ses.PD”), particularly in samples with low species richness. This is because the lower 56
the species richness in the samples, the lower the probability for the phylogenetic 57
branches connecting species in the null samples to traverse the root node of the supplied 58
phylogeny, and therefore the higher the impact of excluding the MRCA – root distance 59
from the computations when it should be included (i.e. when include.root = TRUE). 60
Here, I conducted a simple simulation exercise to illustrate this issue. 61
62
Materials and Methods 63
I simulated four different community data matrices with n = 50 samples (rows) 64
and m = 25, 50, 100 and 200 species (columns) respectively. Then, I used the function 65
pbtree implemented in Phytools R package (Revell, 2012) to simulate 500 pure-birth 66
phylogenies (root to tip distance scaled to unit) of m = 25, 50, 100 and 200 tips, 67
respectively (2000 phylogenies in total), representing the species in the community data 68
matrices. Finally, I simulated four different community datasets per community data 69
matrix (each community data matrix represents a different species pool) with fixed 70
species richness within datasets (i.e. equal row sums). To do so, I assigned n = 2, 4, 8 71
and 16 species, respectively, to each of samples in the community data matrices by 72
randomly picking species from the corresponding pools. For each dataset (16 in total), I 73
computed ses.PD values for the samples using 500 simulated phylogenies and the ses.pd 74
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
5
function as implement in Picante (Kembel et al., 2010, hereafter “ses.pd-Picante”). The 75
argument include.root was set to TRUE, and null distributions were generated using the 76
taxa.labels model and 999 randomizations. Then, I reanalyzed the data using a corrected 77
version of the ses.pd-Picante function that actually includes the MRCA – root distance 78
in all the computations if the argument include.root is set to TRUE. Both functions are 79
identical in all other respects (see Supplementary Material). I used the same seed 80
(random number generator) to analyze the data with both functions. Finally, I compared 81
the ses.PD values derived from each function using cross-validation R-squared (𝑅"#$ ) 82
(Molina-Venegas, Moreno-Saiz, Castro, Davies, Peres-Neto & Rodríguez, 2018). 𝑅"#$ = 83
1 indicates perfect match between ses.PD values obtained from both functions, and 𝑅"#$ 84
< 1 indicates imperfect match. 𝑅"#$ varies from 1 to minus infinity. Since I used the 85
same seed to analyze the data with both functions, the randomization pattern was 86
preserved, and therefore 𝑅"#$ will be equal to 1 in case both functions yield identical 87
results. The R code to reproduce all the analyses along with the corrected version of the 88
ses.pd-Picante function is provided as Supplementary Material. All analyses were 89
conducted using Picante version 1.7 (latest version) and R version 3.4.3 (R Core Team., 90
2017), yet results were the same regardless of the version of the package (the ses.pd 91
function was first implemented in Picante version 0.7-2 and delivered in CRAN R 92
repository in July 2009). 93
94
Results and Discussion 95
I found substantial mismatch between the ses.PD values derived from the ses.pd-96
Picante function and its corrected version, particularly at low species richness and 97
regardless of the size of the species pool (Figs. 1 and S1 in Appendix 1). More 98
specifically, the ses.pd-Picante function yielded higher ses.PD values than expected (i.e. 99
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
6
above the 1:1 line) in the negative side of the distribution (Figs. 2 and S2 in Appendix 100
1). However, results derived from both functions rapidly converged following an 101
exponential trend as species richness increased (Figs. 1 and S1 in Appendix 1), 102
suggesting that the ses.pd-Picante function will introduce biases only when species 103
richness is very low (i.e. less than four species). Fortunately, such low-richness levels 104
are rare in natural communities, yet they are eventually reported along with their ses.PD 105
values (e.g. Mennes, Moerland, Rath, Smets & Merckx, 2015; Geedicke, Schultz, 106
Rudolph & Oldeland, 2016; Nowakowski, Frishkoff, Thompson, Smith & Todd, 2018). 107
On the other hand, some simulation analyses have also reported ses.PD values for 108
samples including only two species (e.g. Mazel et al., 2016), and diversity experiments 109
often include plots with very few species (e.g. Symstad et al., 2003). 110
Figs. 3 and S3 show that mismatches were more likely to occur with highly 111
unbalanced trees (i.e. those with internal nodes defining divergent lineages of unequal 112
size). This is because the higher the imbalance of the phylogeny, the lower the 113
probability for the phylogenetic branches connecting species in the null samples to 114
traverse the root node of the supplied phylogeny, and therefore the higher the impact of 115
excluding the MRCA – root distance when it should be included. Given the unbalanced 116
nature of most real phylogenies, I conclude that future studies will avoid potential 117
biases in ses.PD estimations (particularly in communities with very low species 118
richness) by either removing the MRCA – root distance from all the computations 119
conducted by the ses.pd-Picante function (i.e. setting the include.root argument to 120
FALSE) or using its corrected version if the MRCA – root distance is to be included 121
(see Supplementary Material). 122
123
Acknowledgments 124
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
7
I thank the Scientific Computation Centre of Andalusia (CICA) for the computing 125
services they provided. 126
127
Supporting Information 128
Appendix 1. R code used for the analyses. 129
130
REFERENCES 131
Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological 132 Conservation, 61, 1–10. doi:10.1016/0006-3207(92)91201-3 133
Geedicke, I., Schultz, M., Rudolph, B., & Oldeland, J. (2016). Phylogenetic clustering 134 found in lichen but not in plant communities in European heathlands. 135 Community Ecology, 17, 216–224. doi:10.1556/168.2016.17.2.10 136
Kembel,S.W., Cowan, P. D., Helmus, M. R., Cornwell, W. K., Morlon, H, Ackerly, D. 137 D., … , Webb, C. O. (2010). Picante: R tools for integrating phylogenies and 138 ecology. Bioinformatics, 26, 1463–1464. 139
Mazel, F., Davies, T. J., Gallien, L., Renaud, J., Groussin, M., Münkemüller, T., & 140 Thuiller, W. (2016). Influence of tree shape and evolutionary time-scale on 141 phylogenetic diversity metrics. Ecography, 39, 913–920. 142 doi:10.1111/ecog.01694 143
Mennes, C. B., Moerland, M. S., Rath, M., Smets, E. F., & Merckx, V. S. F. T. (2015). 144 Evolution of mycoheterotrophy in Polygalaceae: The case of Epirixanthes. 145 American Journal of Botany, 102, 598–608. doi:10.3732/ajb.1400549 146
Molina-Venegas, R., Moreno-Saiz, J. C., Parga, I. C., Davies, T. J., Peres-Neto, P. R., & 147 Rodríguez, M. Á. (2018). Assessing among-lineage variability in phylogenetic 148 imputation of functional trait datasets. Ecography, 41, 1740–1749 149 doi:10.1111/ecog.03480 150
Nowakowski, A. J., Frishkoff, L. O., Thompson, M. E., Smith, T. M., & Todd, B. D. 151 (2018). Phylogenetic homogenization of amphibian assemblages in human-152 altered habitats across the globe. Proceedings of the National Academy of 153 Sciences, 115, E3454–E3462. doi:10.1073/pnas.1714891115 154
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
8
R Core Team (2017) R: a language and environment for statistical computing. R 155 Foundation for Statistical Computing, Vienna, Austria. 156
Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and 157 other things). Methods in Ecology and Evolution, 3, 217–223. 158
Symstad, A. J., Chapin, F. S., Wall, D. H., Gross, K. L., Huenneke, L. F., Mittelbach, G. 159 G., … Tilman, D. (2003). Long-term and large-scale perspectives on the 160 relationship between biodiversity and ecosystem functioning. BioScience, 53, 161 89–98. doi:10.1641/0006-3568(2003)053[0089:LTALSP]2.0.CO;2 162
163
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
9
Figure 1. Violin plot showing the cross-validation R-squared scores for the 164
comparisons between ses.PD values derived from the function ses.pd-Picante (Kembel 165
et al., 2010) and its corrected version. Analyses were conducted using datasets with 166
species richness = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species 167
pool (community data matrix) of n = 25 species (see Fig. S1 in Appendix 1 for results 168
derived from community data matrices with n = 50, 100 and 200 species). 169
170
171
172
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
10
Figure 2. Scatter plots showing the relationship between ses.PD values (25,000 per 173
plot) derived from the function ses.pd-Picante (Kembel et al., 2010, y-axis) and its 174
corrected version (x-axis). Analyses were conducted using datasets with species 175
richness n = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool 176
of 25 species (see Fig. S2 for results derived from species pools of 50, 100 and 200 177
species). The grey lines represent the expected 1:1 relationship. 178
179
180
181
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/
-
11
Figure 3. Relationship between cross-validation R-squared scores (y-axis) and tree 182
imbalance (i.e. Colless’ index, x-axis). Analyses were conducted using datasets with 183
species = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool of 184
25 species (see Fig. S3 for results derived from species pools of 50, 100 and 200 185
species). The higher the value of the Colless’ index, the higher the imbalance of the 186
phylogeny (values scaled between 0 and 1). 187
188
189
190
.CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint
https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/