Download - 2015. Jose Crossa. New developments in plant genomic prediction models

New Developments in Plant Genomic

Prediction Models

J. Crossa1, G. de los Campos2, P. Perez-Rodriguez3, S. Perez-Elizalde3,

J. Cuevas4 O. Montesinos5, M. Lopez-Cruz1, and D. Gianola6

1 Biometrics and Statistics Unit (BSU), CIMMYT, México.

2 Department of Biostatistics, School of Public Health, University of Alabama

at Birmingham, AL, 35294, USA.

3 Colegio de Postgraduado, Montecillos, Mexico

4 Universidad de Quintana Roo, Mexico

5 Universidad de Colima, México.

6 Department of Animal Sciences, University of Wisconsin, Madison, WI 53706,

USA

Outline

Bayesian Inverse Regression

Genomic and Pedigree-based Reaction Norm

models

Marker x Environment interaction model

Threshold model for genome prediction of

ordinal categorical traits in plant breeding

The problem in genomic prediction

Basic Linear Regression function on markers

𝒚 = 𝑿𝜷 + 𝜺

These challenges can be overcome using what is called the inverse problem approach

Problems 𝑿 is over-parameterized and therefore ill-conditioned (p>>n), with strong collinearity among predictors due to linkage disequilibrium; hence 𝑿 is not a full column rank. Not unique solutions Unstable solution process

Solutions Regression methods with 𝑝 >> 𝑛 generally requires either shrinkage estimation or reducing the dimensionality of the data & shrinkage

𝒚 = 𝑿𝜷 + 𝜺

Solution of the Bayesian Inverse Regression

problem

transformed

data vector Linear operator Vector of regression parameters

vector of Gaussian errors

Singular value decomposition of 𝒚 such that the decay of singular

values is mimicked in the prior distribution of 𝜷. The inferential

solution to the inverse problem is based on the posterior

distribution of 𝜷, with the variability of 𝜺 and prior beliefs about 𝜷

OBJECTIVE estimating the unknown transformed marker effects of 𝜷.

b a c

Decay of the

singular values

Increasing noise

‘Prior’ mimicking de decay

of the singular values

the mean of 𝒃 can be seen as the

product of a filter (𝒇𝒊) x OLS (𝒃𝒊∗)

The penalty (𝒇𝒊) is related to the

decay of the singular values

Trait-

environment

combination+

Bayes

Ridge

Regression

Correlation

Bayes

Inverse

Regression 1

Correlation

Bayes

Inverse

Regression 2

Correlation

Bayes

Ridge

Regression

PMSE

Bayes

Inverse

Regression 1

PMSE

Bayes

Inverse

Regression 1

PMSE

FFL-WW 0.818 0.842 0.847 0.260 0.201 0.196

FFL-SS 0.754 0.762 0.764 0.325 0.324 0.320

MFL-WW 0.822 0.841 0.850 0.263 0.203 0.198

MFL-SS 0.776 0.782 0.788 0.318 0.298 0.293

GY-SS 0.320 0.354 0.360 0.883 0.866 0.862

GY-WW 0.557 0.558 0.558 0.675 0.674 0.675

NCBL 1 0.649 0.693 0.697 0.592 0.527 0.523

NCBL 2 0.469 0.526 0.524 0.723 0.679 0.680

Results (Cuevas et al., 2014 G3)

(300 Maize lines 50,000 SNPs)

Big Data With modern genotyping and sequencing

technologies, molecular marker information has

become highly dimensional, with the number

of markers (p) exceeding by large the number of

data-points (n) available for model fitting.

Similarly, environmental information is also

becoming highly dimensional owing to the

development of information systems that allow

continuous monitoring of environmental

conditions.

A reaction norm model for genomic prediction in multi-

environment trials (Jarquin et al., 2014, TAG)

Proposition

A class of random effects models with the

main effects of markers and ECs, and their

interactions are introduced using co-

variance structures that are functions of

marker genotypes and ECs.

An extension of the GBLUP that uses a

random effects model on all the markers,

all the ECs, and all the interactions

between markers and ECs.

ijkijijijk ELLEy

MAIN EFFECT AND INTERACTION MODELS

BASELINE

ijkijijk gEy

GENOMIC COVARIABLES

Replace Li with

ENVIRONMENTAL COVARIABLES

ijkiijijk gwy

Replace Ej with ig ijw

INTERACTION MODEL

ijkijiijijk wggwy

How much of GE is explained by the interaction of

genomic x environmental covariables? Due to imperfect LD between alleles at markers

and alleles at QTLs, markers may not fully

account for genetic differences between lines.

ECs may not fully account for differences due to

environmental conditions.

Therefore, some proportion of the GxE may not

be fully captured by the interaction genomic x

EC

Results 700 CIMMYT wheat lines genotyped with 15,000 markers

and evaluated 6 environments

MODEL

CV1

70%TRN

30% TST

CV2

70%TRN

30% TST

E+G 0.242 0.299

E+G+W 0.240 0.299

E+G+W+(GxW) 0.413 0.473

E+G+W+(GxW)+(GxE) 0.430 0.517

Prediction accuracy of GY under well-watered (WW) and severe stress (SS) using 955,690 SNPs GBS

markers. 4-5 environments in SS and WW. CV1=20% of entries unobserved in all environments; CV2= some entries unobserved in some environments but observed in others (Zhang et al. Heredity, 2014)

CV1 CV2 CV1 CV2

GY_WW GY_WW GY_SS GY_SS

Maize

biparental

(200 F2)

E+G E+G+GE E+G E+G+GE E+G E+G+GE E+G E+G+GE

1 0.38 0.47 0.42 0.58 0.25 0.23 0.37 0.37

2 0.41 0.48 0.38 0.45 0.31 0.38 0.15 0.18

3 0.22 0.28 0.34 0.35 -- -- -- --

4 0.48 0.48 0.50 0.53 -- -- -- --

5 0.43 0.46 0.42 0.47 0.37 0.61 0.18 0.45

6 0.29 0.33 0.29 0.39 -- -- -- --

7 0.22 0.25 0.35 0.41 0.20 0.29 0.17 0.27

8 0.42 0.45 0.43 0.46 0.20 0.18 0.35 0.39

9 0.27 0.42 0.27 0.43 0.35 0.46 0.42 0.60

10 0.32 0.43 0.32 0.41 0.49 0.58 0.44 0.64

G A G+A+(GxE)+(AxE)

CV1

DWR_2013 0.3346 0.3390 0.3851

330 CIMMYT wheat lines with 90k illumine

SNP chip

MCO_2012 0.5252 0.4690 0.5660

MCO_2013 0.4552 0.4421 0.4581

PAU_2013 0.2891 0.3433 0.4009

CV2

DWR_2013 0.5119 0.5117 0.5847

MCO_2012 0.6859 0.6408 0.6941

MCO_2013 0.6386 0.6069 0.5947

PAU_2013 0.3908 0.4019 0.5068

HarvestPlus (Govidan Velu) Preliminary results

Correlations between observed and predicted for Grain Zn

MCO, Mexico Ciudad Obregon,

DWR, Directorate of Wheat Research,

PAU, Punjab Agricultural University

G A G+A+(GxE)+(AxE)

CV1

DWR_2013 0.4851 0.4412 0.4889

MCO_2012 0.6104 0.5642 0.6857

MCO_2013 0.4135 0.4770 0.5149

PAU_2013 0.4053 0.3908 0.4679

CV2

DWR_2013 0.5510 0.5264 0.5468

MCO_2012 0.6769 0.6519 0.7343

MCO_2013 0.4825 0.5242 0.5305

PAU_2013 0.4458 0.4340 0.5068

Correlations between observed and predicted for Grain Fe

MCO, Mexico Ciudad Obregon,

DWR, Directorate of Wheat Research,

PAU, Punjab Agricultural University

Marker x environment interaction model (Lopez-Cruz et al, 2015, G3)

Single-environment model

𝑦𝑖𝑗 = 𝜇𝑗 + 𝑥𝑖𝑗𝑘𝑝𝑘=1 𝛽𝑗𝑘 + 𝜀𝑖𝑗,

(i=1,2,…,n individuals; j=1,2,…s environments; k=1,2,…,p markers)

𝒚𝑗 = 𝟏𝜇𝑗 + 𝑿𝑗𝜷𝑗 + 𝜺𝑗

Setting 𝒖𝑗 = 𝑿𝑗𝜷𝑗 , we have that the model described above can

also be represented as follows:

𝒚𝑗 = 𝟏𝜇𝑗 + 𝒖𝑗 + 𝜺𝑗

with 𝒖𝑗~𝑁 𝟎,𝑮𝑗𝜎𝑢𝑗2

Is possible a model that will predict unobserved lines & be good for GWAS?

Marker x Environment Interaction model

Multiple-environment model

This model includes main effects and M×E

environment interactions. Specifically, we set 𝛽𝑗𝑘 = 𝑏0𝑘 + 𝑏𝑗𝑘

where 𝑏0𝑘 is a main effect of the kth marker, a

component assumed to be stable across

environments, and 𝑏𝑗𝑘 is an interaction term

representing deviations from the main effect due to

marker × environment interaction.

Therefore, the equation for data from the jth

environment becomes

𝑦𝑖𝑗 = 𝜇𝑗 + 𝑥𝑖𝑗𝑘𝑝𝑘=1 𝑏0𝑘 + 𝑏𝑗𝑘 + 𝜀𝑖𝑗,

Correlation between phenotypes and predictions (average over 50 TRN-TST

partitions) by model (stratified, interaction and across-environment analysis),

validation design (CV1, CV2) data set and environment (horizontal axis).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Correlation

Data set / Environment

CV2

Stratified Interaction Across Env.

Correlation between phenotypes and predictions (average over 50 TRN-TST

partitions) by model (stratified, interaction and across-environment analysis),

validation design (CV1, CV2) data set and environment (horizontal axis).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Correlation

Data set / Environment

CV1

Stratified Interaction Across Env.

Conclusions on the MxE prediction model

The M×E model decomposes marker effects and

genomic values into components that are stable

across environments (main effects) and others

that are environment-specific (interactions).

This information is not provided by standard

multi-environment linear mixed models.

Therefore, in principle, the interaction model

could shed light over which variants have effects

that are stable across environments and which

ones are responsible for G×E.

Threshold models for genome prediction of

ordinal traits in plant breeding (Montesinos et al., 2014 G3)

Threshold (cumulative probit) model (similar to Gianola, 1983)

The response variable (disease resistance), 𝑦𝑖𝑗𝑘, represents an assignment

into one of 𝐶 mutually exclusive categories

Dat: 𝐶=5, follow an order, since 1 indicates no infection, 2 means low

infection, 3 moderate infection, 4 high infection and 5 complete infection.

Data 3 sites 300 maize lines.

Therefore, this model in a GLMM framework can be described by defining the

distribution, the linear predictor and the link function. Distribution: Multinomial

We use it with the Reaction Norm Model previously described

Model Site 1 Site 2 Site 3

E+L 0.3924 0.3617 0.3507

E+G 0.3869 0.3611 0.3434

E+L+G 0.3856 0.3621 0.3431

E+G+GxE 0.3261 0.3337 0.3145

E+L+G+GxE 0.3249 0.3345 0.3189

Genomic prediction of the threshold models with the Reaction

Norm Model

(50,000 SNP, 300 maize lines in 3 sites).

Brier scores for assessing prediction accuracy (smaller the better prediction)

Threshold models for genome prediction of

ordinal traits in plant breeding incorporating GxE

Models with the G×E captured a sizeable

proportion of the total variability, which indicates

the importance of introducing interaction to

improve prediction accuracy.

Models that included G×E gave gains in

prediction accuracy between 9-14% over the

models that includes only main effects.

BGLR R-package

=> We have recently extended the BLR package (Bayesian Lasso) by:

• Developing a programming approach that allows combining many types of

models in a single analysis.

• Implementing methods not implemented in BLR:

- P and G-BLUP

- The Bayesian Alphabet

- Reproducing Kernel Hilbert Spaces Regressions

• Implementing multi-core computation

• Implementing methods for linear/binary and censored traits.

The software is available at the R-Forge http://bglr.r-forge.r-project.org

de los Campos, G., and P. Pérez-Rodriguez, 2014 Bayesian Generalized Linear Regression. R

package version 1.0.1. http://CRAN.R-project.org/package=BGLR

http://bglr.r-forge.r-project.org






http://cran.r-project.org/package=BGLR



Acknowledgements All researchers of the CIMMYT Global Maize Program

and Global Wheat Program in Africa, India, Turkey,

Ethiopia, China, and Mexico that generate the data used

in several genomics studies.

Researchers at Cornell University and Kansas University

involved in the Bill and Melinda Gates Foundation

Projects of maize and wheat that generated the GBS

genotypic data used in several genomic studies.

Consistent support from all our colleagues in the Genetic

Resource Program and Seed of Discovery Project of

CIMMYT

.

http://genomics.cimmyt.org

Thanks!!

[email protected]

http://genomics.cimmyt.org/

http://genomics.cimmyt.org/

mailto:[email protected]

Download - 2015. Jose Crossa. New developments in plant genomic prediction models

Top Related