New Developments in Plant Genomic
Prediction Models
J. Crossa1, G. de los Campos2, P. Perez-Rodriguez3, S. Perez-Elizalde3,
J. Cuevas4 O. Montesinos5, M. Lopez-Cruz1, and D. Gianola6
1 Biometrics and Statistics Unit (BSU), CIMMYT, Mรฉxico.
2 Department of Biostatistics, School of Public Health, University of Alabama
at Birmingham, AL, 35294, USA.
3 Colegio de Postgraduado, Montecillos, Mexico
4 Universidad de Quintana Roo, Mexico
5 Universidad de Colima, Mรฉxico.
6 Department of Animal Sciences, University of Wisconsin, Madison, WI 53706,
USA
Outline
Bayesian Inverse Regression
Genomic and Pedigree-based Reaction Norm
models
Marker x Environment interaction model
Threshold model for genome prediction of
ordinal categorical traits in plant breeding
The problem in genomic prediction
Basic Linear Regression function on markers
๐ = ๐ฟ๐ท + ๐บ
These challenges can be overcome using what is called the inverse problem approach
Problems ๐ฟ is over-parameterized and therefore ill-conditioned (p>>n), with strong collinearity among predictors due to linkage disequilibrium; hence ๐ฟ is not a full column rank. Not unique solutions Unstable solution process
Solutions Regression methods with ๐ >> ๐ generally requires either shrinkage estimation or reducing the dimensionality of the data & shrinkage
๐ = ๐ฟ๐ท + ๐บ
Solution of the Bayesian Inverse Regression
problem
transformed
data vector Linear operator Vector of regression parameters
vector of Gaussian errors
Singular value decomposition of ๐ such that the decay of singular
values is mimicked in the prior distribution of ๐ท. The inferential
solution to the inverse problem is based on the posterior
distribution of ๐ท, with the variability of ๐บ and prior beliefs about ๐ท
OBJECTIVE estimating the unknown transformed marker effects of ๐ท.
b a c
Decay of the
singular values
Increasing noise
โPriorโ mimicking de decay
of the singular values
the mean of ๐ can be seen as the
product of a filter (๐๐) x OLS (๐๐โ)
The penalty (๐๐) is related to the
decay of the singular values
Trait-
environment
combination+
Bayes
Ridge
Regression
Correlation
Bayes
Inverse
Regression 1
Correlation
Bayes
Inverse
Regression 2
Correlation
Bayes
Ridge
Regression
PMSE
Bayes
Inverse
Regression 1
PMSE
Bayes
Inverse
Regression 1
PMSE
FFL-WW 0.818 0.842 0.847 0.260 0.201 0.196
FFL-SS 0.754 0.762 0.764 0.325 0.324 0.320
MFL-WW 0.822 0.841 0.850 0.263 0.203 0.198
MFL-SS 0.776 0.782 0.788 0.318 0.298 0.293
GY-SS 0.320 0.354 0.360 0.883 0.866 0.862
GY-WW 0.557 0.558 0.558 0.675 0.674 0.675
NCBL 1 0.649 0.693 0.697 0.592 0.527 0.523
NCBL 2 0.469 0.526 0.524 0.723 0.679 0.680
Results (Cuevas et al., 2014 G3)
(300 Maize lines 50,000 SNPs)
Big Data With modern genotyping and sequencing
technologies, molecular marker information has
become highly dimensional, with the number
of markers (p) exceeding by large the number of
data-points (n) available for model fitting.
Similarly, environmental information is also
becoming highly dimensional owing to the
development of information systems that allow
continuous monitoring of environmental
conditions.
A reaction norm model for genomic prediction in multi-
environment trials (Jarquin et al., 2014, TAG)
Proposition
A class of random effects models with the
main effects of markers and ECs, and their
interactions are introduced using co-
variance structures that are functions of
marker genotypes and ECs.
An extension of the GBLUP that uses a
random effects model on all the markers,
all the ECs, and all the interactions
between markers and ECs.
ijkijijijk ELLEy
MAIN EFFECT AND INTERACTION MODELS
BASELINE
ijkijijk gEy
GENOMIC COVARIABLES
Replace Li with
ENVIRONMENTAL COVARIABLES
ijkiijijk gwy
Replace Ej with ig ijw
INTERACTION MODEL
ijkijiijijk wggwy
How much of GE is explained by the interaction of
genomic x environmental covariables? Due to imperfect LD between alleles at markers
and alleles at QTLs, markers may not fully
account for genetic differences between lines.
ECs may not fully account for differences due to
environmental conditions.
Therefore, some proportion of the GxE may not
be fully captured by the interaction genomic x
EC
Results 700 CIMMYT wheat lines genotyped with 15,000 markers
and evaluated 6 environments
MODEL
CV1
70%TRN
30% TST
CV2
70%TRN
30% TST
E+G 0.242 0.299
E+G+W 0.240 0.299
E+G+W+(GxW) 0.413 0.473
E+G+W+(GxW)+(GxE) 0.430 0.517
Prediction accuracy of GY under well-watered (WW) and severe stress (SS) using 955,690 SNPs GBS
markers. 4-5 environments in SS and WW. CV1=20% of entries unobserved in all environments; CV2= some entries unobserved in some environments but observed in others (Zhang et al. Heredity, 2014)
CV1 CV2 CV1 CV2
GY_WW GY_WW GY_SS GY_SS
Maize
biparental
(200 F2)
E+G E+G+GE E+G E+G+GE E+G E+G+GE E+G E+G+GE
1 0.38 0.47 0.42 0.58 0.25 0.23 0.37 0.37
2 0.41 0.48 0.38 0.45 0.31 0.38 0.15 0.18
3 0.22 0.28 0.34 0.35 -- -- -- --
4 0.48 0.48 0.50 0.53 -- -- -- --
5 0.43 0.46 0.42 0.47 0.37 0.61 0.18 0.45
6 0.29 0.33 0.29 0.39 -- -- -- --
7 0.22 0.25 0.35 0.41 0.20 0.29 0.17 0.27
8 0.42 0.45 0.43 0.46 0.20 0.18 0.35 0.39
9 0.27 0.42 0.27 0.43 0.35 0.46 0.42 0.60
10 0.32 0.43 0.32 0.41 0.49 0.58 0.44 0.64
G A G+A+(GxE)+(AxE)
CV1
DWR_2013 0.3346 0.3390 0.3851
330 CIMMYT wheat lines with 90k illumine
SNP chip
MCO_2012 0.5252 0.4690 0.5660
MCO_2013 0.4552 0.4421 0.4581
PAU_2013 0.2891 0.3433 0.4009
CV2
DWR_2013 0.5119 0.5117 0.5847
MCO_2012 0.6859 0.6408 0.6941
MCO_2013 0.6386 0.6069 0.5947
PAU_2013 0.3908 0.4019 0.5068
HarvestPlus (Govidan Velu) Preliminary results
Correlations between observed and predicted for Grain Zn
MCO, Mexico Ciudad Obregon,
DWR, Directorate of Wheat Research,
PAU, Punjab Agricultural University
G A G+A+(GxE)+(AxE)
CV1
DWR_2013 0.4851 0.4412 0.4889
MCO_2012 0.6104 0.5642 0.6857
MCO_2013 0.4135 0.4770 0.5149
PAU_2013 0.4053 0.3908 0.4679
CV2
DWR_2013 0.5510 0.5264 0.5468
MCO_2012 0.6769 0.6519 0.7343
MCO_2013 0.4825 0.5242 0.5305
PAU_2013 0.4458 0.4340 0.5068
Correlations between observed and predicted for Grain Fe
MCO, Mexico Ciudad Obregon,
DWR, Directorate of Wheat Research,
PAU, Punjab Agricultural University
Marker x environment interaction model (Lopez-Cruz et al, 2015, G3)
Single-environment model
๐ฆ๐๐ = ๐๐ + ๐ฅ๐๐๐๐๐=1 ๐ฝ๐๐ + ๐๐๐,
(i=1,2,โฆ,n individuals; j=1,2,โฆs environments; k=1,2,โฆ,p markers)
๐๐ = ๐๐๐ + ๐ฟ๐๐ท๐ + ๐บ๐
Setting ๐๐ = ๐ฟ๐๐ท๐ , we have that the model described above can
also be represented as follows:
๐๐ = ๐๐๐ + ๐๐ + ๐บ๐
with ๐๐~๐ ๐,๐ฎ๐๐๐ข๐2
Is possible a model that will predict unobserved lines & be good for GWAS?
Marker x Environment Interaction model
Multiple-environment model
This model includes main effects and MรE
environment interactions. Specifically, we set ๐ฝ๐๐ = ๐0๐ + ๐๐๐
where ๐0๐ is a main effect of the kth marker, a
component assumed to be stable across
environments, and ๐๐๐ is an interaction term
representing deviations from the main effect due to
marker ร environment interaction.
Therefore, the equation for data from the jth
environment becomes
๐ฆ๐๐ = ๐๐ + ๐ฅ๐๐๐๐๐=1 ๐0๐ + ๐๐๐ + ๐๐๐,
Correlation between phenotypes and predictions (average over 50 TRN-TST
partitions) by model (stratified, interaction and across-environment analysis),
validation design (CV1, CV2) data set and environment (horizontal axis).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Correlation
Data set / Environment
CV2
Stratified Interaction Across Env.
Correlation between phenotypes and predictions (average over 50 TRN-TST
partitions) by model (stratified, interaction and across-environment analysis),
validation design (CV1, CV2) data set and environment (horizontal axis).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Correlation
Data set / Environment
CV1
Stratified Interaction Across Env.
Conclusions on the MxE prediction model
The MรE model decomposes marker effects and
genomic values into components that are stable
across environments (main effects) and others
that are environment-specific (interactions).
This information is not provided by standard
multi-environment linear mixed models.
Therefore, in principle, the interaction model
could shed light over which variants have effects
that are stable across environments and which
ones are responsible for GรE.
Threshold models for genome prediction of
ordinal traits in plant breeding (Montesinos et al., 2014 G3)
Threshold (cumulative probit) model (similar to Gianola, 1983)
The response variable (disease resistance), ๐ฆ๐๐๐, represents an assignment
into one of ๐ถ mutually exclusive categories
Dat: ๐ถ=5, follow an order, since 1 indicates no infection, 2 means low
infection, 3 moderate infection, 4 high infection and 5 complete infection.
Data 3 sites 300 maize lines.
Therefore, this model in a GLMM framework can be described by defining the
distribution, the linear predictor and the link function. Distribution: Multinomial
We use it with the Reaction Norm Model previously described
Model Site 1 Site 2 Site 3
E+L 0.3924 0.3617 0.3507
E+G 0.3869 0.3611 0.3434
E+L+G 0.3856 0.3621 0.3431
E+G+GxE 0.3261 0.3337 0.3145
E+L+G+GxE 0.3249 0.3345 0.3189
Genomic prediction of the threshold models with the Reaction
Norm Model
(50,000 SNP, 300 maize lines in 3 sites).
Brier scores for assessing prediction accuracy (smaller the better prediction)
Threshold models for genome prediction of
ordinal traits in plant breeding incorporating GxE
Models with the GรE captured a sizeable
proportion of the total variability, which indicates
the importance of introducing interaction to
improve prediction accuracy.
Models that included GรE gave gains in
prediction accuracy between 9-14% over the
models that includes only main effects.
BGLR R-package
=> We have recently extended the BLR package (Bayesian Lasso) by:
โข Developing a programming approach that allows combining many types of
models in a single analysis.
โข Implementing methods not implemented in BLR:
- P and G-BLUP
- The Bayesian Alphabet
- Reproducing Kernel Hilbert Spaces Regressions
โข Implementing multi-core computation
โข Implementing methods for linear/binary and censored traits.
The software is available at the R-Forge http://bglr.r-forge.r-project.org
de los Campos, G., and P. Pรฉrez-Rodriguez, 2014 Bayesian Generalized Linear Regression. R
package version 1.0.1. http://CRAN.R-project.org/package=BGLR
Acknowledgements All researchers of the CIMMYT Global Maize Program
and Global Wheat Program in Africa, India, Turkey,
Ethiopia, China, and Mexico that generate the data used
in several genomics studies.
Researchers at Cornell University and Kansas University
involved in the Bill and Melinda Gates Foundation
Projects of maize and wheat that generated the GBS
genotypic data used in several genomic studies.
Consistent support from all our colleagues in the Genetic
Resource Program and Seed of Discovery Project of
CIMMYT
.
http://genomics.cimmyt.org
Thanks!!