geveart lab simr paper

Predicting Gene Expression By Copy Number Alterations and Methylation Data in Cancer

Abstract

Cancer cells harm the body through proliferation, caused by (epi)-genetic

mutations or factors, such as copy number alterations and methylation changes

influencing genetic activity through differential gene expression. Cancer is driven by

irregular expression of certain oncogenes that alter the cell cycle, cause proliferation, or

inhibit normal function.

To investigate these oncogenes, I created a bioinformatics tool to model

oncogene expression as a linear model predicted by copy number and methylation. The

statistical model identifies oncogenes and tumor-suppressor genes relevant to specific

cancers and which (epi)-genomic factors drive their expression. It extracts the latest live

data and selects only relevant copy number data and methylation clusters in analysis,

increasing accuracy of the expression model.

We used the R programming language to package the model, using data from

The Cancer Genome Atlas (TCGA), a cancer database, to retrieve expression data.

Starting with glioblastoma and rapidly expanding to include all cancers in TCGA, I

developed a program that summarizes the disease by genes most instrumental to its

expression and offers a comprehensive individual analysis per gene. A ranking

algorithm sorts genes by relevance and displays influencing factors. Conclusions about

disease-specific gene expression can be quickly derived with this package.

1

Predicting Gene Expression By Copy Number Alterations and

Methylation Data in Cancer

1 INTRODUCTION

Cancer is a life-threatening disease that kills over 20,000 people every day. Multiple

types of cancers exist and they can strike at any part of the body. Cancer is caused by irregular

gene expression, and genes control all functions of the body.1 Certain cancer-causing genes,

labeled as oncogenes when mutated to inhibit normal function, can lead to unwanted cell

proliferation, cell-cycle irregularities, and tumor progression when activated. It is important to

identify these oncogenes; however, cancer is a very general term for multiple specific diseases

and specific cancers have different oncogenes that control their expression. Our goal is to find a

way to identify a wide array of these oncogenes and their level of impact on a specific cancer. It

is also desired to find which genetic or epigenetic factors control the expression of these

oncogenes.

Throughout the field of cancer biology, researchers search for genes that control

cancerous mechanisms, such as cell cycle acceleration, proliferation, and malignant behavior.

Many advances have been made in the field with regard to finding these oncogenes, and targeted

drug delivery to inhibit the activity of these oncogenes has greatly increased life expectancy for

cancer.2 The death rate due to cancer goes down each year due to medicinal advancement and the

discovery of these important oncogenes.3 Bioinformaticians have studied cancer using the wide

variety of data available in databases spanning the world wide web. Many databases, websites,

and critical papers have investigated the role of specific genes in specific cancers, allowing for

medicinal treatment and drug delivery to accommodate for these genes and greatly improve

preventive treatment.4 This data analysis is huge to the field of cancer, as without targeted gene

analysis new cancer-causing pathways and mechanisms would not be discovered. In lieu of this

2

information, we decided to create a comprehensive tool that studies gene expression using

genetic or epigenetic predictors to ultimately provide a comprehensive gene analysis that

pinpoints the most important oncogenes in any given cancer and understand their expression.

Further research can be applied on these genes of high importance in cancer to facilitate cancer

research and improve gene-targeted drug delivery to cure various forms of cancer. We aim to

create a product that can help to identify more of the many genes listed as cancer biomarkers.5

A bioinformatics package was created to model oncogene expression in cancer. Gene

expression was modeled with a linear fit, using copy number alterations and methylation data as

predictors for the model. Copy number alterations specifically refer to the amount of copies of

the gene relative to its expression. We retrieve this data through a GISTIC analysis of raw copy

number data.6 Methylation is an epigenetic factor that controls gene expression by attaching a

methyl-group to a CpG site, commonly located on the promoter region of the gene, which

compresses the DNA and generally inhibits transcription.7 Raw human methylation data with

general consensus clustering is used to study methylation directly related to these oncogenes.

Methylation can also be modeled using MethylMix, an algorithm which uses differential

and functional methylation states to group methylation into clusters.8 Gene expression can be

mapped using mRNA sequencing data (mRNAseq) or gene expression microarray data, both

standard ways to acquire gene expression data from patients.

The package acquires this wealth of data from The Cancer Genome Atlas (TCGA). The

Cancer Genome Atlas is a source that is home to data from over 30 cancers and compiles genetic

data retrieved from multiple patients. TCGA is where the methylation, copy number, and gene

expression data are retrieved. The use of the R programming language for statistics was essential

in the creation of this package. The finished package will be released for users to enter the cancer

3

on which they desire to conduct research and to request the types of expression and methylation

data they want. The package allows users to quickly come to conclusions on which oncogenes

drive which cancers, and whether copy number, methylation, or both predictors drive the

expression of these oncogenes. The R package, once completed, will be released on

Bioconductor,9 a home for bioinformaticians to release software and will be available to any user

who quickly wants to come to conclusions on oncogenes in cancers.

For users, the package will run using the following general pipeline: The datasets, with

the specific types of data they desire for methylation and gene expression, will be downloaded

securely and locally to their hard drive, where a series of preprocessing steps will occur,

including batch correction, missing value estimation, and patient matching for multiple datasets.

A pool of genes of interest is extracted from the datasets, applying a variance filter to select a

percent of genes specified by the user. A linear model is applied to these genes, combining only

relevant and statistically significant parts of the model for each gene to ensure the most accurate

predictive model for the gene and its drivers. Copy number alterations are added to the model by

relevance and relevant methylation clusters are combined in a linear combination to produce the

final methylation coefficient. Only relevant components of the expression model of the gene,

whether it is copy number alterations, methylation clusters, or both, are included in the

individual gene model. In this manner, extremely accurate and unique models are created for

each gene of interest for a given cancer.

It was decided to run our software on glioblastoma (GBM), a rare yet well-documented

brain cancer, to compare the results retrieved from our package to established results on the

disease. The package then was quickly expanded to be able to implement the model on all

cancers available through TCGA. The code produces individual gene graphs, presents raw data,

4

and shows a full disease summary view, indicating the most important genes for specific cancers

and how they are driven. The disease summary allows users to see incredibly accurate models for

expression that are fitted for multiple genes of interest and show how a specific cancer is being

driven as a whole, making it a great tool to find genes in cancers for further research.

Overall, the model was a success in finding genes relevant to the expression of GBM. It

is able to differentiate between genes driven only by copy number alterations, by methylation

clusters, or both. The package is still under development and it will be released soon. In future

updates to the software, more complex, higher dimensional models may be used. 3D graphs may

be used to display these complex gene models, and new predictors may accompany copy number

and methylation as drivers for the expression model. A grounds-up, simple approach to creating

a gene overview tool for cancers is necessary and is now implemented through using data from

TCGA and analysis via the R programming language.

2 MATERIALS AND METHODS

The expression model using copy number and methylation as predictors is implemented

as a user package rather than a 1-time analysis of genes. As such, the methods outlined here can

be thought of as a pipeline that the package goes through upon its execution rather than a

traditional procedure.

2.1 Downloading Data

A designated R script is used to dynamically download data from the Broad Institute’s

TCGA server. I initially wrote the script using the Broad Institute’s Firehose tool to acquire the

data, but a graduate student updated the script to use curl and wget commands to retrieve data

from Broad Institute URLs; the script can retrieve all the files we need to process and analyze.

There are two types of data the user can select from for gene expression, which are mRNAseq

5

data and microarray data. Selections for methylation data include raw clustered methylation and

MethylMix data. Copy number data used in the model are raw data run through a GISTIC

analysis. The user decides for which cancer to retrieve data, and by the end of the downloading

phase, expression, copy number, and methylation datasets are locally stored on the user’s

computer.

2.2 Preprocessing

While the datasets are retrieved, many steps of preprocessing must be achieved to get the

datasets ready for analysis. All these separate processes are stored in separate datasets so the user

has access to the raw data at every step. A series of scripts written by my mentor processes the

data to move to the next step. The code begins by estimating missing values using the k-nearest-

neighbor method. A 10% threshold of missing values is used to determine whether to eliminate a

gene or patient from the dataset. Standard TCGA batch correction is done on the resulting dataset

to normalize patient data across multiple facilities of data collection. I also had to accommodate

for mismatched patients and genes to ensure modeling is accurate across those multiple datasets.

The algorithm only tests patients with a pure form of the cancer, so I had to remove pan-cancer,

metastatic, and healthy control patients. After these steps, the data are ready for linear modeling

analysis.

2.3 Linear Modeling of Genomic and Epigenomic Data

After the data are ready, the program starts analysis by selecting a pool of genes, which

we label as the genes of interest, given a user prompt for the percentage of total genes available.

To select genes of interest, a filter that maximizes data information regarding deviation, such as

variance, interquartile range, and mean absolute deviation, sorts all genes. The code selects a

percent of all genes to be the genes of interest based on user input. Individual gene analysis is

6

performed on these genes. The variance filter is a very basic and fast way to generally take a

percent of genes for analysis.

The linear model for expression (EXP) using copy number (CN) and methylation (MET)

as predictors can be easily interpreted as a linear combination of beta values for copy number

alterations and methylation data for each gene, and can be stated as follows.

EXP GENE = β1 * CN GENE + β2 * MET GENE .

The linear model executes by splitting itself up into its two-dimensional components and

combining the statistically significant beta values into the above equation. The first step in this

process is to model expression based on copy number and methylation separately:

EXP GENE = β1 * CN GENE , EXP GENE = β2 * MET GENE .

The copy number portion of the model is rendered separately, and if the beta values of

this separate regression are significant, this beta is included in the overall model for gene

expression. On the other hand, methylation has another nuance in clusters. The separate model

for expression based on methylation can be modeled as a linear combination of the different n

methylation clusters for each gene as shown here:

EXP GENE = C=1

n

å β2-ClusterC * MET GENE-ClusterC .

A linear regression is applied on the clustered methylation model and significant

coefficients are included in a second regression model for methylation, including only significant

beta values. A linear regression is applied on the subsequent equation to retrieve fitted values.

The linear coefficient for these fitted values can be determined as the significant beta value for

methylation as a whole. So, the linear model can now be summed up in the following simple

equation based on significance:

EXP GENE = β1-signif * CN GENE + β2-signif * MET GENE .

7

If an entire driver, whether it be copy number or methylation, is insignificant, the final

model will simply be rendered without that driver. Sometimes both copy number and

methylation data are completely insignificant, implying that the gene has little to no role in

cancer. In this way, we have created a powerful tool that lists the most significant genes and how

they are expressed, creating a unique model for each and every gene.

The beta values are filled in once the regression is applied. For a high correlation with a

gene to a specific cancer, we look for low p-values and r-squared values, statistics associated

with linear regression. Throughout the process of evaluation for statistical significance, we use a

p-value filter determined by the user to remove statistically insignificant parts of the model. The

p-value is associated with the accuracy of the model, or how close the actual values are to the

fitted values of the regression model. The r-squared value refers to the coverage of the model, or

how well the fit corresponds to all data points. We look for low p-values and high r-squared

values to determine statistical significance of a gene in a given cancer.

In user interpretation of these linear models, one would look for positive copy number

coefficients and negative methylation coefficients. Recalling that copy number refers to the

number of copies of the gene and methylation refers to the protection of the gene, one can

interpret that copy number and methylation must have positive and negative correlations on gene

expression. However, due to the epigenetic and undiscovered true nature of methylation, we

leave this part of the model up to interpretation.

2.4 Data Visualization

Users can see results instantaneously upon visualizing the data generated. Three ways to

visualize the data are available through this package, designed to work together to understand

gene expression in any given cancer.

8

All raw data are shown, and users can see whether genes are driven by copy number,

methylation, or both. The user can also see graphs of individual gene-by-gene analysis, which

shows the linear fit line across copy number and methylation and displays correlation statistics of

the gene. A disease summary allows the user to see how genes are being driven across the entire

disease. It also shows the genes with the highest statistical correlation in the dataset.

I created these data visualization tools using the R package ggplot2, a graphics engine

that allows for several types of data visualization.

3 RESULTS

All results displayed here are run on the brain cancer glioblastoma using microarray data

for gene expression, GISTIC data for copy number alterations, and MethylMix for methylation

clusters. However, the pipeline can be run on any disease, with options for types of data in

expression and methylation.

9

3.1 Individual Gene Analysis

Graphs are printed of the linear regression of individual genes. Plotting the regression is

the fastest way to quickly obtain a visualization of the gene. This example shows a model for

expression based on only methylation for the gene FABP5, which controls fatty acids and

regulates control of growth factors.10

Without proper gene expression, the cell’s growth factor

communication may lose control, resulting in cancerous tumor growth.

Figure 1: Modeling expression by methylation in FABP5. Noticing the 59% coverage as indicated by the r-squared

value, we can infer that this gene is being downregulated by methylation in glioblastoma patients.

10

EGFR is a growth factor involved in cell division and proliferation, and it tends to have a high

correlation with copy number in many cancers, including glioblastoma.11

Figure 2: Individual gene analysis of EGFR in glioblastoma. This gene holds a statistical correlation for copy

number and MethylMix methylation. As the two separate models (blue) are combined (red), we can see that this

gene is driven by copy number alteration.

11

PDLIM4 is a gene that is heavily methylated in several cancers, and its repression due to this

hypermethylation is listed as a biomarker for several cancers.

Figure 3: Individual gene analysis of PDLIM4, an actin protein coding gene, in glioblastoma. This gene holds a

statistical correlation for copy number and MethylMix methylation. As the two separate models (blue) are combined

(red), we can see that this gene is driven by copy number alteration.

12

3.2 Raw Linear Model Data

The raw data shows accuracy and scope of the expression model with relation to copy

number alterations, methylation data, or both predictors. In this chart, analysis of raw data has

been done to see if genes driven by copy number, methylation, or both factors. All of these genes

have both copy number and methylation rendered as significant, or higher than the p-value filter

(default p-value filter is < 0.05).

Figure 4: Linear Model Data – Users are given all the raw data for genes of interest in glioblastoma. Combining

copy number and methylation into one model facilitates analysis of drivers for gene expression. (0* indicates value

<1 e-10)

3.3 Disease Summary

The disease summary describes in a holistic view how the disease is being driven. A pie

chart displays the genes driven by copy number, methylation, or both factors, and a percent of

genes will be labeled insignificant. Sorting the genes of interest by relevance can show the most

important genes for any cancer. Because of the selection for only relevant clusters, we can

13

eliminate a number of genes as insignificant, and we can accurately and precisely model the

expression of important genes, as shown in the disease summary.

Figure 5: Pie chart showing drivers for significant genes of a pool of genes of interest in glioblastoma.

The model maximizes the number of genes that are not significant by selecting for the most significant

coefficients, allowing for a wider spread of genes. 73% of genes all have unique models, as shown here.

14

Figure 5: Top genes listed in a disease summary for glioblastoma. At a glance, users can see genes sorted by

relevance for any given cancer. Relevance is denoted using the r-squared value.

4 DISCUSSION

Through the three visualization tools of the linear model software, we can find the most

important genes in a cancer, visualize them separately, determine the dominant factor in these

oncogenes, and get a glimpse of how the disease is driven overall.

In executing only methylation or copy number values, we can see easily how the gene is

being driven. For example, methylation in FABP5 is downregulating gene expression, as it

should according to the traditional view of methylation blocking transcription. We can then infer

that, because FABP5 has a large 59% coverage and a very low p-value, that methylation in

FABP5 could be instrumental to the regulation of GBM and should require further research.

EGFR and PDLIM4 in glioblastoma are examples of linear models that incorporate both

copy number alteration and methylation clusters. We can see that EGFR has a strong correlation

in copy number and methylation, but its relevance is much higher with copy number than with

methylation. If EGFR was studied with just copy number or methylation, we would not be able

15

to see the copy number dominate the expression of EGFR as in the combined model. Similarly,

in PDLIM4, we see methylation as the ultimate driver for expression after the copy number and

methylation models are combined.

The separate copy number and methylation models can also work together and promote

synergy, as shown in the raw data for PDGFRA. We can see the relevance of the model increase

from 41% in methylation and 22% in copy number to 55% overall in the combined model. This

synergy shows an inherent strength in the model to combine components and increase accuracy

and relevance of the model.

Summarizing the disease based on its drivers allows us to see the overall expression of

the disease. It also shows the package’s efficacy in combining models, producing 25% of models

as combined models with copy number and methylation. By selecting only for relevant parts of

the model, the model also inherently selects for the relevant genes, allowing the code to

maximize the number of irrelevant genes and permitting the user to focus on the genes that have

statistical relevance to the cancer.

5 CONCLUSIONS AND FUTURE WORK

Overall, the linear model package is a simple yet effective way to model gene expression

in cancer. It allows us to come to striking conclusions by analyzing data available on TCGA in

real time. The package provides a comprehensive view of genes with a strong correlation to other

specific factors, which allows users to find and continue research on unexplored genes that are

found using this tool. Mechanisms related to these undiscovered genes with high correlation to

copy number, methylation, or both factors can be explored to study the reason for specific

cancers and to ultimately invert the mechanism and do a better job treating cancer.

16

Simple and comprehensive as the linear model is, it has its faults. For example,

exponential, logarithmic, polynomial, and otherwise nonlinear relationships between gene

expression and copy number alterations or methylation data can go unnoticed. In the future, these

complex relationships can be added to find more genes that have strong correlations to cancer.

We intend to add 3D modeling to gene-by-gene graphs to visualize the linear fit as a

plane equation spanning in the areas of copy number and methylation. More predictors to

analyze gene expression such as miRNA may be included in the model in the future to give more

factors to gene expression and make the model more accurate.13

These more complex higher-

dimensional models allow more places for strong statistical correlations between gene expression

and its predictors, and allow multiple statistically significant predictors to work together in

synergy.

Creating the software package from scratch was a great way to explore the field of

bioinformatics. By providing data analysis tools, I learned more about the biological and

statistical concepts of this field. My lab provided the introduction and the framework necessary

for my understanding on how to create this product, and through this background, I built a

powerful product that can service the bioinformatics field.

In summation, we modeled gene expression as a linear model using copy number and

methylation, and the result is a fast, elegant, and comprehensive way to find important

oncogenes and analyze their behavior and impact on any given cancer.

geveart lab simr paper

Documents