ridge regression for risk prediction - with - bayes-pharma

75
Ridge regression for risk prediction with applications to genetic data Erika Cule and Maria De Iorio Imperial College London Department of Epidemiology and Biostatistics School of Public Health May 2012

Upload: others

Post on 18-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ridge regression for risk prediction - with - Bayes-pharma

Ridge regression for risk predictionwith applications to genetic data

Erika Cule and Maria De Iorio

Imperial College LondonDepartment of Epidemiology and Biostatistics

School of Public Health

May 2012

Page 2: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 3: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 4: Ridge regression for risk prediction - with - Bayes-pharma

Risk Prediction using Genetic Data

...genome-wide associationstudies have identified thou-sands of genetic variantsassociated with hundreds ofdiseases and traits.

In the decade following thepublication of the first draftof the Human Genome Se-quence...

Page 5: Ridge regression for risk prediction - with - Bayes-pharma

Risk Prediction using Genetic Data

...genome-wide associationstudies have identified thou-sands of genetic variantsassociated with hundreds ofdiseases and traits.

In the decade following thepublication of the first draftof the Human Genome Se-quence...

Page 6: Ridge regression for risk prediction - with - Bayes-pharma

Risk Prediction using Genetic Data

However, clinicians are getting impatient about the utility ofthese identified variants for risk prediction in complex diseases:

Page 7: Ridge regression for risk prediction - with - Bayes-pharma

Risk Prediction using Genetic Data

However, clinicians are getting impatient about the utility ofthese identified variants for risk prediction in complex diseases:

Page 8: Ridge regression for risk prediction - with - Bayes-pharma

Risk prediction using genetic data

• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).

• The aim here is to make ridge regression possible for genetic data in asemi-automatic way

• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.

• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.

Page 9: Ridge regression for risk prediction - with - Bayes-pharma

Risk prediction using genetic data

• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).

• The aim here is to make ridge regression possible for genetic data in asemi-automatic way

• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.

• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.

Page 10: Ridge regression for risk prediction - with - Bayes-pharma

Risk prediction using genetic data

• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).

• The aim here is to make ridge regression possible for genetic data in asemi-automatic way

• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.

• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.

Page 11: Ridge regression for risk prediction - with - Bayes-pharma

Risk prediction using genetic data

• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).

• The aim here is to make ridge regression possible for genetic data in asemi-automatic way

• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.

• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.

Page 12: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 13: Ridge regression for risk prediction - with - Bayes-pharma

Univariate tests of association

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%

27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE |Vol 447 |7 June 2007

666Nature ©2007 Publishing Group

Page 14: Ridge regression for risk prediction - with - Bayes-pharma

Univariate tests of association

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%

27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE |Vol 447 |7 June 2007

666Nature ©2007 Publishing Group

Page 15: Ridge regression for risk prediction - with - Bayes-pharma

Univariate tests of association

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%

27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE |Vol 447 |7 June 2007

666Nature ©2007 Publishing Group

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%

27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE |Vol 447 |7 June 2007

666Nature ©2007 Publishing Group

WTCCC (2007)

Page 16: Ridge regression for risk prediction - with - Bayes-pharma

Multivariate methods

• Consider all SNPs jointly

• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.

• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.

• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.

• Known as shrinkage methods - “shrink” regression coefficients towardszero.

• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...

Ridge Regression

Page 17: Ridge regression for risk prediction - with - Bayes-pharma

Multivariate methods

• Consider all SNPs jointly

• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.

• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.

• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.

• Known as shrinkage methods - “shrink” regression coefficients towardszero.

• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...

Ridge Regression

Page 18: Ridge regression for risk prediction - with - Bayes-pharma

Multivariate methods

• Consider all SNPs jointly

• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.

• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.

• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.

• Known as shrinkage methods - “shrink” regression coefficients towardszero.

• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...

Ridge Regression

Page 19: Ridge regression for risk prediction - with - Bayes-pharma

Multivariate methods

• Consider all SNPs jointly

• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.

• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.

• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.

• Known as shrinkage methods - “shrink” regression coefficients towardszero.

• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...

Ridge Regression

Page 20: Ridge regression for risk prediction - with - Bayes-pharma

Multivariate methods

• Consider all SNPs jointly

• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.

• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.

• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.

• Known as shrinkage methods - “shrink” regression coefficients towardszero.

• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...

Ridge Regression

Page 21: Ridge regression for risk prediction - with - Bayes-pharma

Prior distributions in Lasso and Ridge Regression

Page 22: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 23: Ridge regression for risk prediction - with - Bayes-pharma

Ridge regression

• Ridge regression (Hoerl & Kennard, 1970) is a penalized regressionapproach proposed to overcome the problems associated withmulticollinearity among predictors in multiple regression.

• Among penalized regression approaches, ridge regression has beenshown to offer very good predictive performance (Frank & Friedman,1993).

• We applied ridge regression to the problem of risk prediction usinggenetic data obtained from genome-wide association studies.

• Ridge regression shrinks the squared length of the regressioncoefficient vector - corresponds to a quadratic penalty on thecoefficients.

Page 24: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 25: Ridge regression for risk prediction - with - Bayes-pharma

Shrinkage parameter

• Controls the degree of shrinkageof the regression coefficients.

• A larger shrinkage parametershrinks the coefficients furthertowards zero.

• Data-driven methods proposedin the literature cannot beapplied p � n, because theydepend on the ordinary leastsquares estimates.

• Ridge trace (graphical method)

Page 26: Ridge regression for risk prediction - with - Bayes-pharma

Shrinkage parameter

• Controls the degree of shrinkageof the regression coefficients.

• A larger shrinkage parametershrinks the coefficients furthertowards zero.

• Data-driven methods proposedin the literature cannot beapplied p � n, because theydepend on the ordinary leastsquares estimates.

• Ridge trace (graphical method)

Page 27: Ridge regression for risk prediction - with - Bayes-pharma

Our starting point

Linear model:

Y = Xβ + ε εiid∼ N(0, σ2)

Ridge regression:

βk = arg minβ

n∑

i=1

yi −

p∑

j=1

βixij

2

+ kp∑

j=1

|β2j |

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

σ2, β estimated from ordinary least squares (OLS).

Page 28: Ridge regression for risk prediction - with - Bayes-pharma

Our starting point

Linear model:

Y = Xβ + ε εiid∼ N(0, σ2)

Ridge regression:

βk = arg minβ

n∑

i=1

yi −

p∑

j=1

βixij

2

+ kp∑

j=1

|β2j |

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

σ2, β estimated from ordinary least squares (OLS).

Page 29: Ridge regression for risk prediction - with - Bayes-pharma

Our starting point

Linear model:

Y = Xβ + ε εiid∼ N(0, σ2)

Ridge regression:

βk = arg minβ

n∑

i=1

yi −

p∑

j=1

βixij

2

+ kp∑

j=1

|β2j |

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

σ2, β estimated from ordinary least squares (OLS).

Page 30: Ridge regression for risk prediction - with - Bayes-pharma

Our starting point

Linear model:

Y = Xβ + ε εiid∼ N(0, σ2)

Ridge regression:

βk = arg minβ

n∑

i=1

yi −

p∑

j=1

βixij

2

+ kp∑

j=1

|β2j |

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

σ2, β estimated from ordinary least squares (OLS).

Page 31: Ridge regression for risk prediction - with - Bayes-pharma

We observed

Linear model:

Y = Xβ + ε

= Zα+ ε

εiid∼ N(0, σ2)

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

=pσ2

α′α

α are principal components regression coefficients.

PCR coefficients are available when p >> n

Page 32: Ridge regression for risk prediction - with - Bayes-pharma

We observed

Linear model:

Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β

=pσ2

α′α

α are principal components regression coefficients.

PCR coefficients are available when p >> n

Page 33: Ridge regression for risk prediction - with - Bayes-pharma

We observed

Linear model:

Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β=

pσ2

α′α

α are principal components regression coefficients.

PCR coefficients are available when p >> n

Page 34: Ridge regression for risk prediction - with - Bayes-pharma

We observed

Linear model:

Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β=

pσ2

α′α

α are principal components regression coefficients.

PCR coefficients are available when p >> n

Page 35: Ridge regression for risk prediction - with - Bayes-pharma

We observed

Linear model:

Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)

Proposed by Hoerl, Kennard & Baldwin (1975):

kHKB =pσ2

β′β=

pσ2

α′α

α are principal components regression coefficients.

PCR coefficients are available when p >> n

Page 36: Ridge regression for risk prediction - with - Bayes-pharma

We propose

kHKB =pσ2

α′α

kr =r σ2

rα′r αr

Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.

How many components?

Page 37: Ridge regression for risk prediction - with - Bayes-pharma

We propose

kHKB =pσ2

α′αkr =

r σ2r

α′r αr

Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.

How many components?

Page 38: Ridge regression for risk prediction - with - Bayes-pharma

We propose

kHKB =pσ2

α′αkr =

r σ2r

α′r αr

Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.

How many components?

Page 39: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

% of replicates with larger MSE using kHKB than using kr

12

34

56

78

9

19

2549

81200

400900

250010000

0

20

40

60

80

100

number of PCs (r)

signal to noise

ratio

percent

0

20

40

60

80

100

Page 40: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

Most of the variance in ge-netic data can be explainedby the first few principalcomponents.

Page 41: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

PSE ={

1 + tr(HH′)n

}σ2 +

b′bn

= variance +bias2

n

• H is the “hat matrix”: Y = HY

• Degrees of freedom for variance = tr(HH ′) (Hastie & Tibshirani (1990) ).

Page 42: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

• For given r , RR estimates haveless bias than PCR estimates.

• PCR using r components has rdegrees of freedom for variance.

• We fixed r such that degrees offreedom of the ridge model usingr components equals r .

tr(HH ′

)= r

Page 43: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

• For given r , RR estimates haveless bias than PCR estimates.

• PCR using r components has rdegrees of freedom for variance.

• We fixed r such that degrees offreedom of the ridge model usingr components equals r .

tr(HH ′

)= r

Page 44: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

• For given r , RR estimates haveless bias than PCR estimates.

• PCR using r components has rdegrees of freedom for variance.

• We fixed r such that degrees offreedom of the ridge model usingr components equals r .

tr(HH ′

)= r

Page 45: Ridge regression for risk prediction - with - Bayes-pharma

How many components?

• For given r , RR estimates haveless bias than PCR estimates.

• PCR using r components has rdegrees of freedom for variance.

• We fixed r such that degrees offreedom of the ridge model usingr components equals r .

tr(HH ′

)= r

Page 46: Ridge regression for risk prediction - with - Bayes-pharma

Simulation Study

Mean prediction squared error:

p-value trace:

Page 47: Ridge regression for risk prediction - with - Bayes-pharma

Simulation Study

Mean prediction squared error: p-value trace:

Page 48: Ridge regression for risk prediction - with - Bayes-pharma

Simulation study

• Performance comparison

SNP ranking followed by multivariate regressionHyperLasso

• Continuous and binary outcomes

Univariate HLasso RR% of SNPs ranked by univariate p-value 0.1% 0.5% 1 % 3% 4%

Continuous outcomes (mean PSE) 1.51 1.55 1.54 2.21 3.93 2.41 1.23Binary outcomes (mean CE) 0.49 0.48 0.48 0.49 0.50 0.50 0.46

Page 49: Ridge regression for risk prediction - with - Bayes-pharma

Bipolar Disorder Data

• Two GWAS of Bipolar Disorder: WTCCC and GAIN.

• Case-control studies - model extended to logistic ridge regression.

• SNPs typed on different platforms. Impute2 to obtain common SNPs.

• When determining shrinkage parameter, training data were thinned(1 SNP every 100kb).

• Univariate model - which significance threshold?

• HyperLasso - cross-validation to choose the parameters iscomputationally intensive.

Univariate HyperLasso Ridge Regression

p-value threshold 10−5 10−7 10−10

Mean0.489 0.491 0.490 0.492 0.465

Classification Error

Page 50: Ridge regression for risk prediction - with - Bayes-pharma

Bipolar Disorder Data

• Two GWAS of Bipolar Disorder: WTCCC and GAIN.

• Case-control studies - model extended to logistic ridge regression.

• SNPs typed on different platforms. Impute2 to obtain common SNPs.

• When determining shrinkage parameter, training data were thinned(1 SNP every 100kb).

• Univariate model - which significance threshold?

• HyperLasso - cross-validation to choose the parameters iscomputationally intensive.

Univariate HyperLasso Ridge Regression

p-value threshold 10−5 10−7 10−10

Mean0.489 0.491 0.490 0.492 0.465

Classification Error

Page 51: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 52: Ridge regression for risk prediction - with - Bayes-pharma

Significance testing in ridge regression

• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.

• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.

• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.

Page 53: Ridge regression for risk prediction - with - Bayes-pharma

Significance testing in ridge regression

• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.

• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.

• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.

Page 54: Ridge regression for risk prediction - with - Bayes-pharma

Significance testing in ridge regression

• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.

• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.

• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.

Page 55: Ridge regression for risk prediction - with - Bayes-pharma

Significance test

Based on a Wald test:

Tk =βk

se(βk

) H0 : Tk ∼ N (0,1)

se(βk

)from covariance matrix

Var(βk

)= σ2(X ′X + kI)−1X ′X (X ′X + kI)−1

taking into account both correlation in predictors and amount ofshrinkage.

Page 56: Ridge regression for risk prediction - with - Bayes-pharma

Simulation studyCausal SNP

Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2

1Imperial College London and 2University College London

Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.

Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model

Y = Xβ + � (1)

Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =

�βj, . . . ,βp

�is a vector

of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the

measurement error. Ridge regression estimates are obtained as

βk =�X �X + kIp

�−1 X �Y

Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:

Tk =βjk

se(βjk)

The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:

Var(βk) = σ2 �X �X + kIp�−1 X �X

�X �X + kIp

�−1

σ2 is replaced by its estimate, σ2:

σ2 =(Y − Xβ) �(Y − Xβ)

n − tr (2H − HH �)H is the hat matrix:

H = X�X �X + kIp

�−1 X �

In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)

Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .

Simulation study

p = 0

coefficient estimate

Fre

qu

en

cy

!1.0 !0.5 0.0 0.5 1.0 1.5

05

01

50

25

0

!2 0 2 4 60

.00

.10

.20

.30

.4

p = 1.07e!08

T!

Pro

ba

bili

ty

p = 0.496

coefficient estimate

Fre

qu

en

cy

!1.0 !0.5 0.0 0.5 1.0 1.5

05

01

00

15

02

00

!2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p = 0.322

T!

Pro

ba

bili

ty

Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.

Lung Cancer Data

3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).

Approximate test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

rs8034191

rs16969968

rs402710

other SNPs

a

Permutation test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

Inf

rs8034191

rs16969968

rs402710

other SNPs

b

Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.

Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:

Y = Zα + �

with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is

α = Λ−1Z �Y

and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose

kr =rσ2

rα �

rαr

where σ2r is estimated using the first r columns of Z and r is chosen

to such that the degrees of freedom for variance is the same as thatof a PCR using r components.

Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.

References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC

Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).

[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.

[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.

[email protected]

Page 57: Ridge regression for risk prediction - with - Bayes-pharma

Simulation studyNon-causal SNP

Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2

1Imperial College London and 2University College London

Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.

Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model

Y = Xβ + � (1)

Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =

�βj, . . . ,βp

�is a vector

of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the

measurement error. Ridge regression estimates are obtained as

βk =�X �X + kIp

�−1 X �Y

Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:

Tk =βjk

se(βjk)

The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:

Var(βk) = σ2 �X �X + kIp�−1 X �X

�X �X + kIp

�−1

σ2 is replaced by its estimate, σ2:

σ2 =(Y − Xβ) �(Y − Xβ)

n − tr (2H − HH �)H is the hat matrix:

H = X�X �X + kIp

�−1 X �

In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)

Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .

Simulation study

p = 0

coefficient estimate

Fre

quency

!1.0 !0.5 0.0 0.5 1.0 1.5

050

150

250

!2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p = 1.07e!08

T!

Pro

babili

ty

p = 0.496

coefficient estimate

Fre

quency

!1.0 !0.5 0.0 0.5 1.0 1.5

050

100

150

200

!2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p = 0.322

T!

Pro

babili

ty

Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.

Lung Cancer Data

3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).

Approximate test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

rs8034191

rs16969968

rs402710

other SNPs

a

Permutation test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

Inf

rs8034191

rs16969968

rs402710

other SNPs

b

Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.

Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:

Y = Zα + �

with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is

α = Λ−1Z �Y

and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose

kr =rσ2

rα �

rαr

where σ2r is estimated using the first r columns of Z and r is chosen

to such that the degrees of freedom for variance is the same as thatof a PCR using r components.

Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.

References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC

Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).

[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.

[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.

[email protected]

Page 58: Ridge regression for risk prediction - with - Bayes-pharma

Simulation studyTrue-positive and False-positive rates

Shrinkage ParameterApproximate test Permutation test

0.1 1 10 100 0.1 1 10 100Individuals SNPs500 20

TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.045 0.045 0.061 0.133 0.015 0.015 0.017 0.095

100TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.056 0.054 0.071 0.141 0.015 0.018 0.024 0.074

1000TPR 0.100 0.500 0.900 1.000 0.000 0.200 0.800 1.000FPR 0.038 0.045 0.049 0.080 0.007 0.006 0.010 0.029

ALLTPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.318 0.071 0.068 0.069 0.019 0.019 0.020 0.020

5000 20TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.048 0.048 0.048 0.113 0.006 0.006 0.006 0.053

100TPR 0.900 0.900 1.000 1.000 0.800 0.900 1.000 1.000FPR 0.055 0.052 0.062 0.100 0.003 0.001 0.007 0.055

1000TPR 0.700 0.700 1.000 1.000 0.700 0.700 0.900 1.000FPR 0.046 0.046 0.045 0.060 0.006 0.007 0.008 0.014

ALLTPR 0.400 0.500 0.900 1.000 0.300 0.900 0.900 1.000FPR 0.026 0.027 0.029 0.042 0.007 0.007 0.007 0.009

Page 59: Ridge regression for risk prediction - with - Bayes-pharma

Lung Cancer Data

Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2

1Imperial College London and 2University College London

Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.

Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model

Y = Xβ + � (1)

Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =

�βj, . . . ,βp

�is a vector

of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the

measurement error. Ridge regression estimates are obtained as

βk =�X �X + kIp

�−1 X �Y

Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:

Tk =βjk

se(βjk)

The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:

Var(βk) = σ2 �X �X + kIp�−1 X �X

�X �X + kIp

�−1

σ2 is replaced by its estimate, σ2:

σ2 =(Y − Xβ) �(Y − Xβ)

n − tr (2H − HH �)H is the hat matrix:

H = X�X �X + kIp

�−1 X �

In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)

Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .

Simulation study

p = 0

coefficient estimate

Fre

qu

en

cy

!1.0 !0.5 0.0 0.5 1.0 1.5

05

01

50

25

0

!2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p = 1.07e!08

T!

Pro

ba

bili

ty

p = 0.496

coefficient estimate

Fre

qu

en

cy

!1.0 !0.5 0.0 0.5 1.0 1.5

05

01

00

15

02

00

!2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p = 0.322

T!

Pro

ba

bili

ty

Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.

Lung Cancer Data

3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).

Approximate test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

rs8034191

rs16969968

rs402710

other SNPs

a

Permutation test

Shrinkage parameter

!lo

g p!

valu

e

0 50 100 150 200 250 300

02

46

8

Inf

rs8034191

rs16969968

rs402710

other SNPs

b

Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.

Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:

Y = Zα + �

with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is

α = Λ−1Z �Y

and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose

kr =rσ2

rα �

rαr

where σ2r is estimated using the first r columns of Z and r is chosen

to such that the degrees of freedom for variance is the same as thatof a PCR using r components.

Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.

References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC

Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).

[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.

[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.

[email protected]

Page 60: Ridge regression for risk prediction - with - Bayes-pharma

Outline

1 Risk Prediction using Genetic Data

2 Methods and challenges

3 Ridge RegressionShrinkage parameterSignificance testing

4 Conclusions

Page 61: Ridge regression for risk prediction - with - Bayes-pharma

Summary

• Prediction is a challenging problem!

• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.

• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.

• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.

• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.

Page 62: Ridge regression for risk prediction - with - Bayes-pharma

Summary

• Prediction is a challenging problem!

• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.

• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.

• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.

• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.

Page 63: Ridge regression for risk prediction - with - Bayes-pharma

Summary

• Prediction is a challenging problem!

• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.

• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.

• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.

• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.

Page 64: Ridge regression for risk prediction - with - Bayes-pharma

Summary

• Prediction is a challenging problem!

• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.

• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.

• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.

• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.

Page 65: Ridge regression for risk prediction - with - Bayes-pharma

Summary

• Prediction is a challenging problem!

• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.

• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.

• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.

• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.

Page 66: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 67: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 68: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 69: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 70: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 71: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 72: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 73: Ridge regression for risk prediction - with - Bayes-pharma

R package ridge

• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.

• We have written an R package, ridge, for fitting such models.

• For large data sets, C code is used (with a user-friendly R interface).

• Where available, multi-core or GPU computation speeds up matrixoperations.

• Flexibility to include non-genetic covariates - penalized or not.

• Significance test is implemented.

• Graphical outputs - ridge and p-value traces.

• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.

Page 74: Ridge regression for risk prediction - with - Bayes-pharma

Acknowledgements

• Maria De Iorio

• Colleagues in the Department of Epidemiology and Biostatistics,Imperial College London

• Colleagues in the Department of Statistical Science,University College London

• ILCO study nested within EPIC

• WTCCC and GAIN studies

Page 75: Ridge regression for risk prediction - with - Bayes-pharma

References

[1] D. G Clayton.Prediction and interaction in complex disease genetics: experience in type 1 diabetes.PLoS Genetics, 2009.

[2] Erika Cule and Maria De Iorio.A semi-automatic method to guide the choice of ridge parameter in ridge regression.arXiv, stat.AP, May 2012.

[3] Erika Cule, Paolo Vineis, and Maria De Iorio.Significance testing in ridge regression for genetic data.BMC Bioinformatics, 12(1):372, 2011.

[4] Ildiko Frank and Jerome Friedman.A statistical view of some chemometrics regression tools.Technometrics, 35(2):109–135, May 1993.

[5] A M Halawa and M Y El Bassiouni.Tests of regression coefficients under ridge regression models.Journal of Statistical Computation and Simulation, 65(1):341–356, 2000.

[6] T Hastie and R Tibshirani.Generalized Additive Models.Chapman & Hall, 1990.

[7] Arthur E Hoerl and RW Kennard.Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.

[8] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding.Simultaneous analysis of all snps in genome-wide and re-sequencing association studies.PLoS Genet, 4(7):e1000130, 2008.

[9] Nathalie Malo, Ondrej Libiger, and Nicholas J Schork.Accommodating linkage disequilibrium in genetic-association analyses via ridge regression.Am J Hum Genet, 82(2):375–385, Feb 2008.

[10] Robert Tibshirani.Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society B, 58:267–288, 1996.

[11] Hui Zou and Trevor Hastie.Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society B, 67(2):301–320, Jan 2005.