Using Gene Expression Data to Predict Clinical Information in Seven Human Cancers
Nathan Abell
Dataset Overview
References and Acknowledgements
Future Directions
[email protected] of Genetics
Stanford University School of Medicine
In this project, it quickly became obvious that very low-dimensional sets of genes, forming coherent signatures, could be used to represent the disease sub-type in all studied tissues. Additionally, several other phenotypes, such as progesterone receptor status in breast cancers, were also easily predictable. Much more complex, however, were quantitative outcomes like survival time, or age of disease onset, which rarely were accurate within years of their target. I found, clearly, that variable reduction was the crucial step, with many classification and regression algorithms later performing similarly well (or poorly).
To proceed further, I would start by incorporating information about the selected genes, to see if they were shared across tissues or private. I would also incorporate more tissues, attempt to incorporate matched normal tissue, and attempt to include additional data types like copy number variation.
[1] RG Verhaak, KA Hoadley, E Purdometal. Integratedgenomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98-110. [2] KA Hoadley, C Yau, DM Wolf, et al. Multiplatformanalysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929-44. [3] https://cancergenome.nih.gov[4] J Friedman, T Hastie, R Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 2010;33(1):1-22.[5] https://CRAN.R-project.org/package=e1071[6] WN Venables, BD Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York. 2002. ISBN 0-387-95457-0[7] A Liaw and M Wiener. Classification and Regression by randomForest. R News 2(3), 18--22. 2002.
Background Statistical Approach Predicting Clinical Attributes
Clinical Outcomes
Fig. 1: Ten human tissues with the indicated number of samples in the Genomic Data
Commons
Fig. 3: Distributions of clinical outcomes in breast and kidney tumors
Fig. 2: Representative Pearson correlation heatmap between lung cancers revealing the
extent of gene expression correlation
Fig. 4: Visual overview of the procedure applied to each tissue separatelyFig. 6A: ROC plots for two example predictions: left, breast cancer
progesterone receptor; right, bladder cancer stage (early vs late)
Feature Selection Across Tissues
Fig. 5B: LASSO regularization path, dashed lines showing
estimates for optimal values of lambda by the misclassification rate
Fig. 5A: Principal component analysis before (above) and after
(right) LASSO variable reduction for breast tumors colored by histology
Fig. 5C: Fitted LASSO parameters for disease sub-type in all tissues
NormalizationTissue Type Sample Size
Bladder 414Brain 667Breast 1102Kidney 891Lung 1035
Prostate 495Skin 103
Split 70/30
LASSO
• LASSO• PCR• SVR
●
●●
●●●
●
●
●
● ●
●●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●●●●
●
● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●●●
●
●● ●●●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●
●
●●●
● ●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●● ●●●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●
● ●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
●
● ●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●
●
● ●●●
●
●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
−30
−20
−10
0
10
20
−30 0 30 60PC1 (16.9% explained var.)
PC2
(6.0
% e
xpla
ined
var
.)
groups●
●
●
Infiltrating Ductal Carcinoma
Infiltrating Lobular Carcinoma
NA
-5 -4 -3 -2 -1
0.1
0.2
0.3
0.4
log(Lambda)
Mis
clas
sific
atio
n Er
ror
155 140 138 131 125 109 93 84 72 61 52 43 34 29 25 21 19 12 6 1
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●● ●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−10
−5
0
5
10
−5 0 5 10PC1 (11.5% explained var.)
PC2
(4.7
% e
xpla
ined
var
.)
groups●
●
●
Infiltrating Ductal Carcinoma
Infiltrating Lobular Carcinoma
NA
Tissue 𝛌 CV Accuracy
Bladder 0.0672 0.9035Brain 0.0091 0.9310Breast 0.0235 0.9811Kidney 0.0234 0.9503Lung 0.0444 0.9628
Prostate 0.0796 0.9651Skin 0.1090 0.8961
0
200
400
histological_type
coun
t histological_typeKidney Clear Cell Renal CarcinomaKidney Papillary Renal Cell CarcinomaKidney Chromophobe
histological type: kidneyA
0
200
400
600
800
histological_type
coun
t
histological_type
Infiltrating Carcinoma NOSInfiltrating Ductal CarcinomaInfiltrating Lobular CarcinomaMedullary CarcinomaMetaplastic CarcinomaMixed Histology (please specify)Mucinous CarcinomaOther, specifyNA
histological type: breast
0
100
200
300
400
stage_event_pathologic_stage
coun
t
stage_event_pathologic_stage
Stage IStage IIStage IIIStage IV
stage: kidneyB
0
100
200
300
stage_event_pathologic_stage
coun
t
stage_event_pathologic_stage
Stage IStage IAStage IBStage IIStage IIAStage IIBStage IIIStage IIIAStage IIIBStage IIICStage IVStage XNA
stage: breast
0
100
200
300
400
hemoglobin_result
coun
t
hemoglobin_result
ElevatedLowNormal
hemoglobin: kidneyC
0
200
400
600
breast_carcinoma_progesterone_receptor_status
coun
t
breast_carcinoma_progesterone_receptor_status
IndeterminateNegativePositiveNA
progesterone receptor: breastD
0
10
20
30
25 50 75age_at_initial_pathologic_diagnosis
coun
t
stage: kidneyE
0
20
40
40 60 80age_at_initial_pathologic_diagnosis
coun
t
stage: breast
• Logistic• LDA• SVM• RF
Validation
All tissues responded similarly to the LASSO, with very robust
performance for classifiers (particularly subtype. Fig 5C). In
a multinomial context, the LASSO generally helps separate
the desired groups. However, this did not extend to
quantitative responses, which failed to show the normal
regularization path (Fig 5B).
The heterogeneity of known cancers share one key property - genetic and transcriptomic abnormalities. To this end, the Genomic Data Commons (GDC) has aggregated and standardized tens of thousands of experimental datasets from dozens of human cancers [1-3]. Here, we describe a pipeline for the prediction of specific clinical features (ranging from blood tests to pathological features and survival outcomes) on all available gene expression data for seven human cancers.
Each tissue consists of a sample set, each with ~60000 expression measurements. Thus, many measurements are highly correlated, for biological and experimental reasons (Fig. 1). This presents an immediate problem, as many predictors are almost perfectly co-linear. Reducing the large set of genes to a representative set of variables is a crucial first task.
Some samples in each tissue are annotated with clinical
information, such as disease subtype or survival time. These
vary between categorical, binomial, and multinomial
response variables (Fig 3) with some variability between
datasets. Thus we focus on subsets of these attributes.
I began by normalizing each dataset for various factors like depth and variance. Then, I separated each tissue into training and
validation sets, using only the training sets. Using cross-validation, I obtained very small subsets of variables with non-zero LASSO coefficients for each tissue, and used them to train models (also by cross-validation within the test set) on a variety of models depending on whether the response was continuous or categorical. This was largely done using packages in R, including
glmnet, MASS, e1071, and randomForest [4-7].
For all disease subtype attributes (Fig. 5C), each validation was over 0.9 accurate in validation. So, I attempted other complex categorical responses; two are shown above. The progesterone receptor in breast cancer was very predictable from gene expression, while bladder tumor stage was much more difficult to predict. Across all built models, significant
variation was observed with respect to classifier performance, though always better than regression-based predictions.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
P(FP)
P(TP
)
Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
P(FP)
P(TP
)
Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests