gene_identification_report

Identifying Genes with Prognostic DNA MethylationRates for Breast Cancer Survival

Teun de Planque, Christopher ElamriDepartment of Computer Science

Stanford University

Abstract—Breast cancer treatments using methylation in-hibitors are an effective new therapeutic option for breast cancerpatients [1]. We used different regression models, i.e. proportionalhazards regression, elastic net regression, ridge regression, andlasso regression, to identify genes of which methylation ratesare strongly correlated to breast cancer survival. With eachof the regression models we identified genes of which highmethylation rates are strongly favorably correlated with breastcancer survival, and genes of which high methylation rates arestrongly adversely correlated with breast cancer survival. A betterunderstanding of the relationship between DNA methylation ratesand breast cancer survival can assist in the development ofpatient-tailored therapy strategies, and the discovery of thera-peutic targets.

I. INTRODUCTION

DNA methylation is an epigenetic process by which methylgroups are added to the cytosine (C) or adenine (A) nucleotidesin the DNA molecule [1]. This addition of a methyl groupto DNA is used to regulate gene expression and assure stablegene silencing. Abnormal DNA methylation patterns have beenassociated with breast cancer development [1, 2, 3]. However,epigenetic processes are reversible and inhibitors of DNAmethylation can reactivate silenced tumor suppressor genes,and restore normal gene function. Therapeutic applicationsof methylation inhibitors provide an effective new treatmentoption for breast cancer patients. The identification of genesof which the methylation rates correlate with breast cancersurvival rates is, however, challenging, because of the enor-mous number of human genes. In this paper, we use differentregression models including proportional hazards regression,elastic net regression, ridge regression, and lasso regression,to identify genes of which methylation rates strongly correlatewith breast cancer survival rates.

II. TASK DEFINITION

Using the survival and methylation data of breast cancerpatients as input, our goal is to output a set of genes of whichthe methylation rates are strong predictors of breast cancersurvival.

A. DatasetsWe used two datasets from TCGA (The Cancer Genomic

Atlas); one contains survival data of cancer patients, the othercontains genomic data, copy number variation (CNV) data,

and methylation data of cancer patients. The survival datasetcontains the type of cancer (11 different cancer types in total),the ”time to last contact or event,” and whether the eventoccurred (1: death) or not (0: no death at time of last contact)for 8089 different patients [4]. Many of the survival times arecensored, i.e. the time of observation was cut off before deathoccurred; this indicates that the patient either was still aliveat the end of the study or that the patient withdrew from thestudy before the end of the study. The dataset with genomicdata, copy number variation (CNV) data, and methylation datacontains methylation data for more than 16,500 different genesof over 1,000 different cancer patients [4, 5].

B. Input and Output• Input: survival data of breast cancer patients including

the ”time to last contact or event,” and whether the eventoccurred (1: death) or not (0: no death at time of lastcontact), and the methylation data of these breast cancerpatients.

• Output: a set of genes of which the methylation ratesare strong predictors of breast cancer survival.

C. Evaluation MetricWe use 10-fold cross validation to measure the success of

our system by evaluating how well the survival of patients inour test set can be predicted based on the methylation ratesof the genes chosen by our system. In order to do this wecompute the hazard of death of the patients in our test set giventheir methylation rates of the genes chosen by our system.We compute the hazard of death using the Cox proportionalhazards model. The hazard of death at time t can interpretedas the risk of dying at time t. Ideally, the computed hazardis much higher for patients in the test set who die than forpatients in the test set who survive.

III. APPROACH

A. BaselineDoctors do not yet use methylation data of breast cancer pa-

tients when selecting breast cancer therapy strategies or whenpredicting breast cancer survival. In other words, methylationrates of patients do currently not affect hazard estimates forboth patients who will survive breast cancer and for patientswho will die from breast cancer. Thus, the average ’hazardratio relative to the sample average’ based on methylation rates

2

is the same (1) for both patients who will survive breast cancerand for patients who will die from breast cancer. We use thisapproach as our baseline, meaning that the average ’hazardratio relative to the sample average’ is 1 for both both patientswho will survive breast cancer and for patients who will diefrom breast cancer.

B. OracleOur oracle knows which genes are most correlated with

breast cancer survival. Thus, it identifies the genes for whichthe average predicted ’hazard ratio relative to the sampleaverage’ of patients who will survive breast cancer is minimal,or for which the average predicted ’hazard ratio relative to thesample average’ of patients who will die from breast cancer ismaximal. We do not know what these genes are, so there is noway for us to implement the oracle; the purpose of this workis to identify those genes. Ideally, we can correctly predictsurvival for all patients in our test set based on the methylationrates of the genes selected by the oracle. This correspondsto an average predicted ’hazard ratio relative to the sampleaverage’ of patients who will survive breast cancer of 0, and anaverage predicted ’hazard ratio relative to the sample average’of patients who will die from breast cancer of ∞.

C. Data PreprocessingWe merged the dataset containing genomic data, copy

number variation (CNV) data, and methylation data, with thesurvival dataset by processing all 9,074 patient IDs, puttingall of them in the same format, and then finding the patientIDs contained in both datasets. We then created a matrix withmethylation data of the 16,020 different genes of all the breastcancer patients contained in both datasets (989 patients intotal). For all of the 989 patients, we added their survival dataincluding the ”time to last contact or event,” and whether theevent occurred (1:death) or not (0:no death at time of lastcontact) to this new matrix. Because of the enormous numberof genes included in this matrix (16,020), we reduced thenumber of genes contained in the matrix by removing genesof which methylation rates do not or barely correlate to breastcancer survival. We identified genes of which methylation ratesdo not or barely correlate to breast cancer survival using ourregression models. For each model we fitted the model to thesurvival and methylation data. We then removed the genes withthe lowest absolute weights.

D. Regularized Least-squares Regression Using RidgeRidge regression minimizes squared error while regularizing

the L2-norm of the weights [6]:

J(w) = λ(w)22 +

∑i

(wTxi − yi)2 (1)

Then the stationary condition is∂J

∂w= λw +

∑i

(wTxi − yi)x = 0 (2)

(XXT + λI)w = Xy (3)w = (XXT + λI)−1Xy (4)

Ridge regression is ideal if there are many predictors (i.e.the 16,020 genes from our dataset), all with non-zero coef-ficients and drawn from a normal distribution. In particular,ridge regression performs well with predictors that have smalleffects, and prevents coefficients of regression models withmany correlated variables from being poorly determined andexhibiting high variance [7].

E. Regularized Least-squares Regression Using LassoLasso regression methods are widely used in domains with

massive datasets, such as genomics, for which efficient andfast algorithms are essential [7]. However, lasso regularizationis not robust to high correlations among predictors. It willarbitrarily choose one predictor, ignore other predictors, andbreak down when all predictors are identical [8]. Moreover,the lasso penalty expects many coefficients to be close to zeroand only a small subset of coefficients to be significantly largerthan zero. The lasso estimator uses the L1-norm penalized leastsquares criterion to obtain a sparse solution to the followingoptimization problem:

J(w) = argminw

(y −Xw)21 + λ(w)1 (5)

(w)1 =∑pj (wj) is the L1-norm penalty on w, which induces

sparsityin the solution, and λ ≥ 0 is a tuning parameter.The L1-norm penalty enables the lasso method to simul-taneously regularize the least squares fit and shrink somecomponents of J(w) to zero for some suitably chosen λ.However, the lasso method is unstable for high-dimensionaldata and cannot select more variables than the sample sizebefore it saturates when p > n [8].

F. Regularized Least-squares Regression Using Elastic NetThe elastic net (ENET) is an extension of the lasso that is

robust to high correlations among the predictors. In fact, inorder to circumvent the instability of the lasso solution pathswhen predictors are highly correlated in the context of ourDNA methylation analysis, the ENET can efficiently analyzehigh dimensional data [9]. In particular, the ENET uses amixture of the L1-norm (lasso) and L2-norm (ridge regression)penalties and can be formulated as:

J(w) = (1+λ2n)(argmin

w(y −Xw)22 +λ(w)

22 +λ(w)1) (6)

On setting α = λ2

λ1+λ2, the ENET estimator is seen to be

equivalent to the minimizer of:

J(w) = argminw

(y −Xw)22 (7)

subject toPα(w) = (1− α)(w)1 + α(w)

22 (8)

where Pα(w) is the ENET penalty [9].

Thus, the ENET simplifies to simple ridge regression whenα = 1 and to the lasso when α = 0. The L1-norm part of the

3

ENET does automatic variable selection, while the L2-normpart encourages grouped selection and stabilizes the solutionpaths with respect to random sampling, thereby improvingprediction. By inducing a grouping effect during variableselection, such that a group of highly correlated variables tendto have coefficients of similar magnitude, the ENET can selectgroups of correlated features when the groups are not knownin advance. Unlike the lasso, when p >> n the elastic netselects more than n variables [9].

G. Cox Proportional Hazards ModelThe Cox model is a well-recognised statistical technique

for analyzing the relationship between patient survival andexplanatory variables [10]. The Cox regression model (alsoknown as know as the proportional hazards regression model)allows us to isolate the effects of several explanatory variablesand deal with the censored survival times. It models thesurvival times of the patients on the gene methylation rates.Proportional hazards regression produces an equation for thehazard function of breast cancer patients given their DNAmethylation rates. The hazard function is the probability that abreast cancer patient will die within a short time interval, giventhat the breast cancer patient has survived up to the beginningof the interval. The hazard at time t can be interpreted as therisk that a breast cancer patient will die during time period t.

The hazard function obtained using the Cox regressionmodel is:

h(t) = h0(t)exp(βTx) (9)

where,t: time after the start of the studyh0(t): the baseline hazard functionβ: vector of the regression coefficientsx: vector of the values of the explanatory variables

The baseline hazard function represents the probability ofdying when all the methylation rates are zero. Based on the re-gression coefficients we can identify the genes most correlatedto lower or higher survival rates. The regression coefficientswith low values correspond to genes of which the methylationrates are favorably correlated with breast cancer survival, andthe regression coefficients with high values correspond togenes of which the methylation rates are adversely correlatedwith breast cancer survival. A disadvantage of the Cox modelis that the proportional hazards (PH) assumption assumes thatthe impact of each covariate on hazard remains constant duringthe entire follow-up time. However, in our case, the genomicexpression of a patient might slightly change during the studytime, thereby violating the PH assumption [10].

IV. RESULTS

A. Error AnalysisWe evaluated the different regression models using 10-fold

cross-validation. We then used the Cox proportional hazardmodel to compute the ’hazard ratios relative to the sampleaverage’ for all patients in our test data based on the genes

0 500 1000 1500 2000 2500 3000

0.5

0.6

0.7

0.8

0.9

1.0

Kaplan−Meier Survival Curve for GRHPR Gene

Time (days)

Cum

ulat

ive

Sur

viva

l Per

cent

age

(%)

high GRHPR methylation rate

low GRHPR methylation rate

4

0 500 1000 1500 2000 2500 3000

0.5

0.6

0.7

0.8

0.9

1.0

Kaplan−Meier Survival Curve for GADD45A Gene

Time (days)

Cum

ulat

ive

Sur

viva

l Per

cent

age

(%)

high GADD45A methylation rate

low GADD45A methylation rate

that affect breast cancer prognosis as selected by the differentregression models. As visible in the bar graph with the averagehazard ratios of patients in the test data, the computed averagepredicted ’hazard ratio relative to the sample average’ basedon the chosen genes is significantly larger for patients whodie of breast cancer than for patients who are still alive at thetime of last contact. In particular, the ’hazard ratios relative tothe sample average’ based on the genes selected with elasticnet regression turns out to be over 18.6 times higher than forpatients who were still alive at the time of last contact. The’hazard ratios relative to the sample average’ based on thegenes selected using ridge regression, lasso regression, andthe Cox proportional hazards model is respectively 10.0, 3.8,and 3.4 times higher for patients who die than for patients whoare still alive at the time of last contact. In other words, thegenes selected using elastic net regression, and ridge regressionare particularly useful for the prediction of the risk of deathof breast cancer patients within a certain time interval. Thegenes selected using lasso regression and the Cox proportionalhazards model are also good predictors of the probability thata breast cancer patient will experience death within a certaintime period, but the hazard predictions based on the genesselected using elastic net regression and ridge regression aremore accurate.

The Kaplan-Meier curves show a comparison of how longpatients with high and low methylation rates of four of thegenes selected using our methods will survive [11]. As visible

0 500 1000 1500 2000 2500 3000

0.5

0.6

0.7

0.8

0.9

1.0

Kaplan−Meier Survival Curve for ENOX2 Gene

Time (days)

Cum

ulat

ive

Sur

viva

l Per

cent

age

(%)

high ENOX2 methylation rate

low ENOX2 methylation rate

in the GRHPR Kaplan-Meier curve, breast cancer patients witha relatively high GRHPR methylation rate survive longer thanpatients with a relatively low GRHPR methylation rate. TheCox proportional hazards model, the lasso regression model,and the ridge regression model all suggest that GRHPR is agene of which a high methylation rate in patients favorablyaffects breast cancer survival. In fact, 5 years after the start ofthe survival study 85% of the breast cancer patients (who didnot withdraw from the study) with high GRHPR methylationrates were still alive, while 73% of the breast cancer patients(who did not withdraw from the study) with low GRHPRmethylation rate were still alive.Lasso regression indicates that GADD45A is a gene of whichhigh methylation rates are associated with high cancer survivalrates. GADD45A Kaplan-Meier curve shows that 83% ofbreast cancer patients (who did not withdraw from the study)with relatively high GADD45A methylation are still alive 5years after the start of the survival study, while 76% of breastcancer patients (who did not withdraw from the study) withrelatively low GADD45A methylation are still alive 5 yearsafter the start of the survival study. Similarly, both elastic netregression and ridge regression suggest that a high ENOX2methylation rate negatively affect breast cancer survival, andthe Cox proportional hazards model indicates that a highANKRD52 methylation rate adversely affects breast cancersurvival. The curves for GADD45A and ENOX2 show thathigh GADD45A and ENOX2 methylation rates do indeed

5

0 500 1000 1500 2000 2500 3000

0.5

0.6

0.7

0.8

0.9

1.0

Kaplan−Meier Survival Curve for ANKRD52 Gene

Time (days)

Cum

ulat

ive

Sur

viva

l Per

cent

age

(%)

high ANKRD52 methylation rate

low ANKRD52 methylation rate

Cox Proportional Hazards Model top 3 favorably

prognostic genes top 3 adversely

prognostic genes EEF1A1P9 COL6A2

GRHPR ANKRD52 CASP3 C12orf41

Elastic Net Regression Model top 3 favorably


prognostic genes CLEC2D DHDDS C9orf89 EXOC1 CASP3 ENOX2

Lasso Regression Model top 3 favorably


prognostic genes GRHPR GGCX

FUZ GTPBP8 GADD45A GRHL2

Ridge Regression Model top 3 favorably


prognostic genes ADH5 DNAJC8

GRHPR ENOX2 CASP3 EXOC1

negatively affect breast cancer survival rates.

B. Literature ReviewSeveral other projects have focused on applying machine

learning techniques in order to extract valuable informationfrom DNA methylation data. Previous projects mainly focusedon evaluating different statistical methods for analyzing DNAmethylation data [18], while others analyzed DNA methylationdata for specific types of cancer, such as leukemia [19]. In thatcontext, our project fits in the second framework, since we usedifferent regression techniques for gene identification for breastcancer specifically using DNA methylation data.

In terms of existing projects, our contribution is two-fold.First, we have compared different regression models (lasso,ridge, Cox proportional hazards, elastic net) to find poten-tial genes highly correlated to breast cancer survival, whichfurther highlights the importance of using different methodsin gene identification (i.e. different genes can be found withdifferent methods). Second, we have found genes of whichthe methylation rates are highly correlated to breast cancerdevelopment (i.e., genes of which methylation have beenshown to be linked to breast cancer survival), which maygive additional directions for breast cancer research, and breastcancer treatment developments.

In fact, the favorably and adversely prognostic genes iden-tified by our methods might be worth looking at in order tofurther understand breast cancer biological mechanisms. Manyof the genes we identified have been widely acknowledged inthe medical literature as genes strongly correlated to breastcancer survival, such as: CASP3 [12, 13], GADD45A [14],ENOX2 [15], GRHL2 [16], and COL6A2 [17]. Some of thosegenes were identified by only one method, such as ENOX2(only identified by ENET). This underscores the benefits ofusing distinctive methods in the context of gene identification.Moreover, given our success in identifying genes known to behighly-correlated to breast cancer survival, the additional geneswe found might be worth investigating to further understandbreast cancer.

V. CONCLUSION

We have presented different regressions techniques to iden-tify genes that are highly correlated to breast cancer survivalrates by analyzing the survival and DNA methylation dataof 989 breast cancer patients [4]. Our results identify geneswidely known in the medical literature to be involved inbreast cancer development. The identified genes may proveto be helpful for the discovery of therapeutic targets, and thedevelopment of patient-tailored therapy strategies.

6

ACKNOWLEDGMENT

This project would have not been possible without the helpof the Gevaert Biomedical Informatics Lab, which providedboth the datasets and ongoing support.

REFERENCES

[1] M. Szyf, ’DNA methylation signatures for breast cancerclassification and prognosis’, Genome Medicine, vol. 4, no.3, p. 26, 2012.

[2] S. Baylin, ’Aberrant patterns of DNA methylation,chromatin formation and gene expression in cancer’,Human Molecular Genetics, vol. 10, no. 7, pp. 687-692,2001.

[3] K. Hansen, W. Timp, H. Bravo, S. Sabunciyan, B.Langmead, O. McDonald, B. Wen, H. Wu, Y. Liu, D. Diep,E. Briem, K. Zhang, R. Irizarry and A. Feinberg, ’Increasedmethylation variation in epigenetic domains across cancertypes’, Nature Genetics, vol. 43, no. 8, pp. 768-775, 2011.

[4] The Cancer Genome Atlas - National Cancer Institute,’The Cancer Genome Atlas Home Page’, 2015. [Online].Available: http://cancergenome.nih.gov/. [Accessed: 20-Nov-2015].

[5] C. Creighton, ’SR2-3: Integrative Genomic Analyses ofBreast Cancer from The Cancer Genome Atlas (TCGA).’,Cancer Research, vol. 71, no. 24, pp. SR2-3-SR2-3, 2011.

[6] A. Hoerl and R. Kennard, ’Ridge Regression: BiasedEstimation for Nonorthogonal Problems’, Technometrics,vol. 42, no. 1.

[7] J. Friedman, T. Hastie and R. Tibshirani, ’RegularizationPaths for Generalized Linear Models via Coordinate Descent’, Journal of Statistical Software, vol. 33, no. 1, 2010.

[8] H. Zou, ’The Adaptive Lasso and Its Oracle Properties’,Journal of the American Statistical Association, vol. 101,no.476, pp. 1418-1429, 2006.

[9] J. Ogutu, T. Schulz-Streeck and H. Piepho, ’Genomicselection using regularized linear regression models: ridgeregression, lasso, elastic net and their extensions’, BMCProc, vol. 6, no. 2, p. S10, 2012.

[10] M. Abrahamowicz, T. Schopflocher, K. Leffondre, R. duBerger and D. Krewski, ’Flexible Modeling of Exposure--Response Relationship between Long-Term Average Levelsof Particulate Air Pollution and Mortality in the AmericanCancer Society Study’,Journal of Toxicology and Environmental Health, Part A, vol. 66, no. 16-19, pp. 1625-1654,2003.

[11] E. Kaplan and P. Meier, ’Nonparametric Estimationfrom Incomplete Observations’, Journal of the AmericanStatistical Association, vol. 53, no. 282, p. 457, 1958.

[12] O’Donovan N, Crown J, Stunell H, Hill AD, McDermottE, O’Higgins N, Duffy MJ. ’Caspase 3 in breast cancer’,Clin Cancer Res, pp. 738-742, 2003.

[13] E. Devarajan, A. Sahin, J. Chen, R. Krishnamurthy, N.Aggarwal, A. Brun, A. Sapino, F. Zhang,D. Sharma, X. Yang,A. Tora and K. Mehta, ’Down-regulation of caspase 3 inbreast cancer: a possible mechanism for chemoresistance’,Oncogene, vol. 21, no. 57, pp. 8843-8851, 2002.[14] J. Tront, Y. Huang, A. Fornace, B. Hoffman and

D. Liebermann, ’Gadd45a Functions as a Promoter orSuppressor of Breast Cancer Dependent on the OncogenicStress’, Cancer Research, vol.70, no. 23, pp. 9671-9681,2010.

[15] D. Morre and D. Morre, ECTO-NOX proteins. NewYork, NY Springer, 2013.

[16] X. Xiang, Z. Deng, X. Zhuang, S. Ju, J. Mu, H. Jiang, L.Zhang, J. Yan, D. Miller and H. Zhang, ’Grhl2 Determinesthe Epithelial Phenotype of Breast Cancers and PromotesTumor Progression’, PLoS ONE, vol. 7, no. 12, p. e50781,2012.

[17] E. Karousou, M. D’Angelo, K. Kouvidi, D. Vigetti, M.Viola, D. Nikitovic, G. De Luca and A. Passi, ’CollagenVI and Hyaluronan: The Common Role in Breast Cancer’,BioMed Research International, vol. 2014, pp. 1-10, 2014.

[18] T. Wilhelm, ’Phenotype prediction based on genome-wideDNA methylation data’, BMC Bioinformatics, vol. 15, no.1, p. 193, 2014.

[19] J. Nordlund, C. Backlin, V. Zachariadis, et al. ’DNAmethylation-based subtype prediction for pediatric acutelymphoblastic leukemia’, Clin Epigenetics, vol. 7, no. 1, p.11, 2015.

gene_identification_report

Documents