gene_identification_report
TRANSCRIPT
![Page 1: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/1.jpg)
Identifying Genes with Prognostic DNA MethylationRates for Breast Cancer Survival
Teun de Planque, Christopher ElamriDepartment of Computer Science
Stanford University
Abstract—Breast cancer treatments using methylation in-hibitors are an effective new therapeutic option for breast cancerpatients [1]. We used different regression models, i.e. proportionalhazards regression, elastic net regression, ridge regression, andlasso regression, to identify genes of which methylation ratesare strongly correlated to breast cancer survival. With eachof the regression models we identified genes of which highmethylation rates are strongly favorably correlated with breastcancer survival, and genes of which high methylation rates arestrongly adversely correlated with breast cancer survival. A betterunderstanding of the relationship between DNA methylation ratesand breast cancer survival can assist in the development ofpatient-tailored therapy strategies, and the discovery of thera-peutic targets.
I. INTRODUCTION
DNA methylation is an epigenetic process by which methylgroups are added to the cytosine (C) or adenine (A) nucleotidesin the DNA molecule [1]. This addition of a methyl groupto DNA is used to regulate gene expression and assure stablegene silencing. Abnormal DNA methylation patterns have beenassociated with breast cancer development [1, 2, 3]. However,epigenetic processes are reversible and inhibitors of DNAmethylation can reactivate silenced tumor suppressor genes,and restore normal gene function. Therapeutic applicationsof methylation inhibitors provide an effective new treatmentoption for breast cancer patients. The identification of genesof which the methylation rates correlate with breast cancersurvival rates is, however, challenging, because of the enor-mous number of human genes. In this paper, we use differentregression models including proportional hazards regression,elastic net regression, ridge regression, and lasso regression,to identify genes of which methylation rates strongly correlatewith breast cancer survival rates.
II. TASK DEFINITION
Using the survival and methylation data of breast cancerpatients as input, our goal is to output a set of genes of whichthe methylation rates are strong predictors of breast cancersurvival.
A. DatasetsWe used two datasets from TCGA (The Cancer Genomic
Atlas); one contains survival data of cancer patients, the othercontains genomic data, copy number variation (CNV) data,
and methylation data of cancer patients. The survival datasetcontains the type of cancer (11 different cancer types in total),the ”time to last contact or event,” and whether the eventoccurred (1: death) or not (0: no death at time of last contact)for 8089 different patients [4]. Many of the survival times arecensored, i.e. the time of observation was cut off before deathoccurred; this indicates that the patient either was still aliveat the end of the study or that the patient withdrew from thestudy before the end of the study. The dataset with genomicdata, copy number variation (CNV) data, and methylation datacontains methylation data for more than 16,500 different genesof over 1,000 different cancer patients [4, 5].
B. Input and Output• Input: survival data of breast cancer patients including
the ”time to last contact or event,” and whether the eventoccurred (1: death) or not (0: no death at time of lastcontact), and the methylation data of these breast cancerpatients.
• Output: a set of genes of which the methylation ratesare strong predictors of breast cancer survival.
C. Evaluation MetricWe use 10-fold cross validation to measure the success of
our system by evaluating how well the survival of patients inour test set can be predicted based on the methylation ratesof the genes chosen by our system. In order to do this wecompute the hazard of death of the patients in our test set giventheir methylation rates of the genes chosen by our system.We compute the hazard of death using the Cox proportionalhazards model. The hazard of death at time t can interpretedas the risk of dying at time t. Ideally, the computed hazardis much higher for patients in the test set who die than forpatients in the test set who survive.
III. APPROACH
A. BaselineDoctors do not yet use methylation data of breast cancer pa-
tients when selecting breast cancer therapy strategies or whenpredicting breast cancer survival. In other words, methylationrates of patients do currently not affect hazard estimates forboth patients who will survive breast cancer and for patientswho will die from breast cancer. Thus, the average ’hazardratio relative to the sample average’ based on methylation rates
![Page 2: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/2.jpg)
2
is the same (1) for both patients who will survive breast cancerand for patients who will die from breast cancer. We use thisapproach as our baseline, meaning that the average ’hazardratio relative to the sample average’ is 1 for both both patientswho will survive breast cancer and for patients who will diefrom breast cancer.
B. OracleOur oracle knows which genes are most correlated with
breast cancer survival. Thus, it identifies the genes for whichthe average predicted ’hazard ratio relative to the sampleaverage’ of patients who will survive breast cancer is minimal,or for which the average predicted ’hazard ratio relative to thesample average’ of patients who will die from breast cancer ismaximal. We do not know what these genes are, so there is noway for us to implement the oracle; the purpose of this workis to identify those genes. Ideally, we can correctly predictsurvival for all patients in our test set based on the methylationrates of the genes selected by the oracle. This correspondsto an average predicted ’hazard ratio relative to the sampleaverage’ of patients who will survive breast cancer of 0, and anaverage predicted ’hazard ratio relative to the sample average’of patients who will die from breast cancer of ∞.
C. Data PreprocessingWe merged the dataset containing genomic data, copy
number variation (CNV) data, and methylation data, with thesurvival dataset by processing all 9,074 patient IDs, puttingall of them in the same format, and then finding the patientIDs contained in both datasets. We then created a matrix withmethylation data of the 16,020 different genes of all the breastcancer patients contained in both datasets (989 patients intotal). For all of the 989 patients, we added their survival dataincluding the ”time to last contact or event,” and whether theevent occurred (1:death) or not (0:no death at time of lastcontact) to this new matrix. Because of the enormous numberof genes included in this matrix (16,020), we reduced thenumber of genes contained in the matrix by removing genesof which methylation rates do not or barely correlate to breastcancer survival. We identified genes of which methylation ratesdo not or barely correlate to breast cancer survival using ourregression models. For each model we fitted the model to thesurvival and methylation data. We then removed the genes withthe lowest absolute weights.
D. Regularized Least-squares Regression Using RidgeRidge regression minimizes squared error while regularizing
the L2-norm of the weights [6]:
J(w) = λ(w)22 +
∑i
(wTxi − yi)2 (1)
Then the stationary condition is∂J
∂w= λw +
∑i
(wTxi − yi)x = 0 (2)
(XXT + λI)w = Xy (3)w = (XXT + λI)−1Xy (4)
Ridge regression is ideal if there are many predictors (i.e.the 16,020 genes from our dataset), all with non-zero coef-ficients and drawn from a normal distribution. In particular,ridge regression performs well with predictors that have smalleffects, and prevents coefficients of regression models withmany correlated variables from being poorly determined andexhibiting high variance [7].
E. Regularized Least-squares Regression Using LassoLasso regression methods are widely used in domains with
massive datasets, such as genomics, for which efficient andfast algorithms are essential [7]. However, lasso regularizationis not robust to high correlations among predictors. It willarbitrarily choose one predictor, ignore other predictors, andbreak down when all predictors are identical [8]. Moreover,the lasso penalty expects many coefficients to be close to zeroand only a small subset of coefficients to be significantly largerthan zero. The lasso estimator uses the L1-norm penalized leastsquares criterion to obtain a sparse solution to the followingoptimization problem:
J(w) = argminw
(y −Xw)21 + λ(w)1 (5)
(w)1 =∑pj (wj) is the L1-norm penalty on w, which induces
sparsityin the solution, and λ ≥ 0 is a tuning parameter.The L1-norm penalty enables the lasso method to simul-taneously regularize the least squares fit and shrink somecomponents of J(w) to zero for some suitably chosen λ.However, the lasso method is unstable for high-dimensionaldata and cannot select more variables than the sample sizebefore it saturates when p > n [8].
F. Regularized Least-squares Regression Using Elastic NetThe elastic net (ENET) is an extension of the lasso that is
robust to high correlations among the predictors. In fact, inorder to circumvent the instability of the lasso solution pathswhen predictors are highly correlated in the context of ourDNA methylation analysis, the ENET can efficiently analyzehigh dimensional data [9]. In particular, the ENET uses amixture of the L1-norm (lasso) and L2-norm (ridge regression)penalties and can be formulated as:
J(w) = (1+λ2n)(argmin
w(y −Xw)22 +λ(w)
22 +λ(w)1) (6)
On setting α = λ2
λ1+λ2, the ENET estimator is seen to be
equivalent to the minimizer of:
J(w) = argminw
(y −Xw)22 (7)
subject toPα(w) = (1− α)(w)1 + α(w)
22 (8)
where Pα(w) is the ENET penalty [9].
Thus, the ENET simplifies to simple ridge regression whenα = 1 and to the lasso when α = 0. The L1-norm part of the
![Page 3: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/3.jpg)
3
ENET does automatic variable selection, while the L2-normpart encourages grouped selection and stabilizes the solutionpaths with respect to random sampling, thereby improvingprediction. By inducing a grouping effect during variableselection, such that a group of highly correlated variables tendto have coefficients of similar magnitude, the ENET can selectgroups of correlated features when the groups are not knownin advance. Unlike the lasso, when p >> n the elastic netselects more than n variables [9].
G. Cox Proportional Hazards ModelThe Cox model is a well-recognised statistical technique
for analyzing the relationship between patient survival andexplanatory variables [10]. The Cox regression model (alsoknown as know as the proportional hazards regression model)allows us to isolate the effects of several explanatory variablesand deal with the censored survival times. It models thesurvival times of the patients on the gene methylation rates.Proportional hazards regression produces an equation for thehazard function of breast cancer patients given their DNAmethylation rates. The hazard function is the probability that abreast cancer patient will die within a short time interval, giventhat the breast cancer patient has survived up to the beginningof the interval. The hazard at time t can be interpreted as therisk that a breast cancer patient will die during time period t.
The hazard function obtained using the Cox regressionmodel is:
h(t) = h0(t)exp(βTx) (9)
where,t: time after the start of the studyh0(t): the baseline hazard functionβ: vector of the regression coefficientsx: vector of the values of the explanatory variables
The baseline hazard function represents the probability ofdying when all the methylation rates are zero. Based on the re-gression coefficients we can identify the genes most correlatedto lower or higher survival rates. The regression coefficientswith low values correspond to genes of which the methylationrates are favorably correlated with breast cancer survival, andthe regression coefficients with high values correspond togenes of which the methylation rates are adversely correlatedwith breast cancer survival. A disadvantage of the Cox modelis that the proportional hazards (PH) assumption assumes thatthe impact of each covariate on hazard remains constant duringthe entire follow-up time. However, in our case, the genomicexpression of a patient might slightly change during the studytime, thereby violating the PH assumption [10].
IV. RESULTS
A. Error AnalysisWe evaluated the different regression models using 10-fold
cross-validation. We then used the Cox proportional hazardmodel to compute the ’hazard ratios relative to the sampleaverage’ for all patients in our test data based on the genes
0 500 1000 1500 2000 2500 3000
0.5
0.6
0.7
0.8
0.9
1.0
Kaplan−Meier Survival Curve for GRHPR Gene
Time (days)
Cum
ulat
ive
Sur
viva
l Per
cent
age
(%)
high GRHPR methylation rate
low GRHPR methylation rate
![Page 4: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/4.jpg)
4
0 500 1000 1500 2000 2500 3000
0.5
0.6
0.7
0.8
0.9
1.0
Kaplan−Meier Survival Curve for GADD45A Gene
Time (days)
Cum
ulat
ive
Sur
viva
l Per
cent
age
(%)
high GADD45A methylation rate
low GADD45A methylation rate
that affect breast cancer prognosis as selected by the differentregression models. As visible in the bar graph with the averagehazard ratios of patients in the test data, the computed averagepredicted ’hazard ratio relative to the sample average’ basedon the chosen genes is significantly larger for patients whodie of breast cancer than for patients who are still alive at thetime of last contact. In particular, the ’hazard ratios relative tothe sample average’ based on the genes selected with elasticnet regression turns out to be over 18.6 times higher than forpatients who were still alive at the time of last contact. The’hazard ratios relative to the sample average’ based on thegenes selected using ridge regression, lasso regression, andthe Cox proportional hazards model is respectively 10.0, 3.8,and 3.4 times higher for patients who die than for patients whoare still alive at the time of last contact. In other words, thegenes selected using elastic net regression, and ridge regressionare particularly useful for the prediction of the risk of deathof breast cancer patients within a certain time interval. Thegenes selected using lasso regression and the Cox proportionalhazards model are also good predictors of the probability thata breast cancer patient will experience death within a certaintime period, but the hazard predictions based on the genesselected using elastic net regression and ridge regression aremore accurate.
The Kaplan-Meier curves show a comparison of how longpatients with high and low methylation rates of four of thegenes selected using our methods will survive [11]. As visible
0 500 1000 1500 2000 2500 3000
0.5
0.6
0.7
0.8
0.9
1.0
Kaplan−Meier Survival Curve for ENOX2 Gene
Time (days)
Cum
ulat
ive
Sur
viva
l Per
cent
age
(%)
high ENOX2 methylation rate
low ENOX2 methylation rate
in the GRHPR Kaplan-Meier curve, breast cancer patients witha relatively high GRHPR methylation rate survive longer thanpatients with a relatively low GRHPR methylation rate. TheCox proportional hazards model, the lasso regression model,and the ridge regression model all suggest that GRHPR is agene of which a high methylation rate in patients favorablyaffects breast cancer survival. In fact, 5 years after the start ofthe survival study 85% of the breast cancer patients (who didnot withdraw from the study) with high GRHPR methylationrates were still alive, while 73% of the breast cancer patients(who did not withdraw from the study) with low GRHPRmethylation rate were still alive.Lasso regression indicates that GADD45A is a gene of whichhigh methylation rates are associated with high cancer survivalrates. GADD45A Kaplan-Meier curve shows that 83% ofbreast cancer patients (who did not withdraw from the study)with relatively high GADD45A methylation are still alive 5years after the start of the survival study, while 76% of breastcancer patients (who did not withdraw from the study) withrelatively low GADD45A methylation are still alive 5 yearsafter the start of the survival study. Similarly, both elastic netregression and ridge regression suggest that a high ENOX2methylation rate negatively affect breast cancer survival, andthe Cox proportional hazards model indicates that a highANKRD52 methylation rate adversely affects breast cancersurvival. The curves for GADD45A and ENOX2 show thathigh GADD45A and ENOX2 methylation rates do indeed
![Page 5: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/5.jpg)
5
0 500 1000 1500 2000 2500 3000
0.5
0.6
0.7
0.8
0.9
1.0
Kaplan−Meier Survival Curve for ANKRD52 Gene
Time (days)
Cum
ulat
ive
Sur
viva
l Per
cent
age
(%)
high ANKRD52 methylation rate
low ANKRD52 methylation rate
Cox Proportional Hazards Model top 3 favorably
prognostic genes top 3 adversely
prognostic genes EEF1A1P9 COL6A2
GRHPR ANKRD52 CASP3 C12orf41
Elastic Net Regression Model top 3 favorably
prognostic genes top 3 adversely
prognostic genes CLEC2D DHDDS C9orf89 EXOC1 CASP3 ENOX2
Lasso Regression Model top 3 favorably
prognostic genes top 3 adversely
prognostic genes GRHPR GGCX
FUZ GTPBP8 GADD45A GRHL2
Ridge Regression Model top 3 favorably
prognostic genes top 3 adversely
prognostic genes ADH5 DNAJC8
GRHPR ENOX2 CASP3 EXOC1
negatively affect breast cancer survival rates.
B. Literature ReviewSeveral other projects have focused on applying machine
learning techniques in order to extract valuable informationfrom DNA methylation data. Previous projects mainly focusedon evaluating different statistical methods for analyzing DNAmethylation data [18], while others analyzed DNA methylationdata for specific types of cancer, such as leukemia [19]. In thatcontext, our project fits in the second framework, since we usedifferent regression techniques for gene identification for breastcancer specifically using DNA methylation data.
In terms of existing projects, our contribution is two-fold.First, we have compared different regression models (lasso,ridge, Cox proportional hazards, elastic net) to find poten-tial genes highly correlated to breast cancer survival, whichfurther highlights the importance of using different methodsin gene identification (i.e. different genes can be found withdifferent methods). Second, we have found genes of whichthe methylation rates are highly correlated to breast cancerdevelopment (i.e., genes of which methylation have beenshown to be linked to breast cancer survival), which maygive additional directions for breast cancer research, and breastcancer treatment developments.
In fact, the favorably and adversely prognostic genes iden-tified by our methods might be worth looking at in order tofurther understand breast cancer biological mechanisms. Manyof the genes we identified have been widely acknowledged inthe medical literature as genes strongly correlated to breastcancer survival, such as: CASP3 [12, 13], GADD45A [14],ENOX2 [15], GRHL2 [16], and COL6A2 [17]. Some of thosegenes were identified by only one method, such as ENOX2(only identified by ENET). This underscores the benefits ofusing distinctive methods in the context of gene identification.Moreover, given our success in identifying genes known to behighly-correlated to breast cancer survival, the additional geneswe found might be worth investigating to further understandbreast cancer.
V. CONCLUSION
We have presented different regressions techniques to iden-tify genes that are highly correlated to breast cancer survivalrates by analyzing the survival and DNA methylation dataof 989 breast cancer patients [4]. Our results identify geneswidely known in the medical literature to be involved inbreast cancer development. The identified genes may proveto be helpful for the discovery of therapeutic targets, and thedevelopment of patient-tailored therapy strategies.
![Page 6: Gene_Identification_Report](https://reader030.vdocuments.mx/reader030/viewer/2022020301/58eda5481a28ab77568b45ef/html5/thumbnails/6.jpg)
6
ACKNOWLEDGMENT
This project would have not been possible without the helpof the Gevaert Biomedical Informatics Lab, which providedboth the datasets and ongoing support.
REFERENCES
[1] M. Szyf, ’DNA methylation signatures for breast cancerclassification and prognosis’, Genome Medicine, vol. 4, no.3, p. 26, 2012.
[2] S. Baylin, ’Aberrant patterns of DNA methylation,chromatin formation and gene expression in cancer’,Human Molecular Genetics, vol. 10, no. 7, pp. 687-692,2001.
[3] K. Hansen, W. Timp, H. Bravo, S. Sabunciyan, B.Langmead, O. McDonald, B. Wen, H. Wu, Y. Liu, D. Diep,E. Briem, K. Zhang, R. Irizarry and A. Feinberg, ’Increasedmethylation variation in epigenetic domains across cancertypes’, Nature Genetics, vol. 43, no. 8, pp. 768-775, 2011.
[4] The Cancer Genome Atlas - National Cancer Institute,’The Cancer Genome Atlas Home Page’, 2015. [Online].Available: http://cancergenome.nih.gov/. [Accessed: 20-Nov-2015].
[5] C. Creighton, ’SR2-3: Integrative Genomic Analyses ofBreast Cancer from The Cancer Genome Atlas (TCGA).’,Cancer Research, vol. 71, no. 24, pp. SR2-3-SR2-3, 2011.
[6] A. Hoerl and R. Kennard, ’Ridge Regression: BiasedEstimation for Nonorthogonal Problems’, Technometrics,vol. 42, no. 1.
[7] J. Friedman, T. Hastie and R. Tibshirani, ’RegularizationPaths for Generalized Linear Models via Coordinate Descent’, Journal of Statistical Software, vol. 33, no. 1, 2010.
[8] H. Zou, ’The Adaptive Lasso and Its Oracle Properties’,Journal of the American Statistical Association, vol. 101,no.476, pp. 1418-1429, 2006.
[9] J. Ogutu, T. Schulz-Streeck and H. Piepho, ’Genomicselection using regularized linear regression models: ridgeregression, lasso, elastic net and their extensions’, BMCProc, vol. 6, no. 2, p. S10, 2012.
[10] M. Abrahamowicz, T. Schopflocher, K. Leffondre, R. duBerger and D. Krewski, ’Flexible Modeling of Exposure--Response Relationship between Long-Term Average Levelsof Particulate Air Pollution and Mortality in the AmericanCancer Society Study’,Journal of Toxicology and Environmental Health, Part A, vol. 66, no. 16-19, pp. 1625-1654,2003.
[11] E. Kaplan and P. Meier, ’Nonparametric Estimationfrom Incomplete Observations’, Journal of the AmericanStatistical Association, vol. 53, no. 282, p. 457, 1958.
[12] O’Donovan N, Crown J, Stunell H, Hill AD, McDermottE, O’Higgins N, Duffy MJ. ’Caspase 3 in breast cancer’,Clin Cancer Res, pp. 738-742, 2003.
[13] E. Devarajan, A. Sahin, J. Chen, R. Krishnamurthy, N.Aggarwal, A. Brun, A. Sapino, F. Zhang,D. Sharma, X. Yang,A. Tora and K. Mehta, ’Down-regulation of caspase 3 inbreast cancer: a possible mechanism for chemoresistance’,Oncogene, vol. 21, no. 57, pp. 8843-8851, 2002.[14] J. Tront, Y. Huang, A. Fornace, B. Hoffman and
D. Liebermann, ’Gadd45a Functions as a Promoter orSuppressor of Breast Cancer Dependent on the OncogenicStress’, Cancer Research, vol.70, no. 23, pp. 9671-9681,2010.
[15] D. Morre and D. Morre, ECTO-NOX proteins. NewYork, NY Springer, 2013.
[16] X. Xiang, Z. Deng, X. Zhuang, S. Ju, J. Mu, H. Jiang, L.Zhang, J. Yan, D. Miller and H. Zhang, ’Grhl2 Determinesthe Epithelial Phenotype of Breast Cancers and PromotesTumor Progression’, PLoS ONE, vol. 7, no. 12, p. e50781,2012.
[17] E. Karousou, M. D’Angelo, K. Kouvidi, D. Vigetti, M.Viola, D. Nikitovic, G. De Luca and A. Passi, ’CollagenVI and Hyaluronan: The Common Role in Breast Cancer’,BioMed Research International, vol. 2014, pp. 1-10, 2014.
[18] T. Wilhelm, ’Phenotype prediction based on genome-wideDNA methylation data’, BMC Bioinformatics, vol. 15, no.1, p. 193, 2014.
[19] J. Nordlund, C. Backlin, V. Zachariadis, et al. ’DNAmethylation-based subtype prediction for pediatric acutelymphoblastic leukemia’, Clin Epigenetics, vol. 7, no. 1, p.11, 2015.