using network clustering to predict copy number variations
TRANSCRIPT
USING NETWORK CLUSTERING TO PREDICT COPY NUMBER VARIATIONS
ASSOCIATED WITH HEALTH DISPARITIES
BY
Yi Jiang
Li Yang Farah Kandah Professor of Computer Science Assistant Professor of Computer Science (Chair) (Committee Member)
Katherine Winters Senior Lecturer of Computer Science (Committee Member)
ii
USING NETWORK CLUSTERING TO PREDICT COPY NUMBER VARIATIONS
ASSOCIATED WITH HEALTH DISPARITIES
BY
Yi Jiang
A Thesis Submitted to the Faculty of the University of Tennessee at Chattanooga in Partial Fulfillment of the Requirements of the Degree of Master:
Computer Science
The University of Tennessee at Chattanooga Chattanooga, Tennessee
December 2014
iii
ABSTRACT
Substantial health disparities exist between African Americans and Caucasians in
the United States. Copy number variations (CNVs) are one form of human genetic
variations that have been linked with complex diseases and often occur at different
frequencies among African Americans and Caucasian populations. In this study, we
aimed to investigate whether CNVs with differential population frequencies can
contribute to health disparities from the perspective of gene networks. We inferred
network clusters from two different human gene/protein networks. We then evaluated
each network cluster for the occurrences of known pathogenic genes and genes located in
CNVs with different population frequencies, and used false discovery rates (FDRs) to
rank network clusters. This approach let us identify five clusters enriched with known
pathogenic genes and with genes located in CNVs with different frequencies between
African Americans and Caucasians. These clustering patterns predict four candidate
causal population-specific CNVs that play potential roles in health disparities.
iv
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my advisor, Prof. Li Yang, for her
thoughtful guidance, warm encouragement, great patience, and financial support during
the whole period of my research. I appreciate her vast knowledge and skills, and her
assistance in writing this thesis.
I would like to thank my thesis committee members, Prof. Farah Kandah and Ms.
Katherine Winters for their excellent advises and detailed review during the preparation
of this thesis.
I would also like to thank Prof. Hong Qin at Spelman College, Atlanta, GA, for
thoughtful guidance, insightful discussion, correction of my writing, and the help to
develop my background in computational biology and genetics.
v
TABLE OF CONTENTS
ABSTRACT iii ACKNOWLEDGEMENTS iv LIST OF TABLES v LIST OF FIGURES vi LIST OF ABBREVIATIONS vii CHAPTERS
I. INTRODUCTION 1
1.1 Objectives of the Study 1 1.2 Health Disparities 1 1.3 Genome-Wide Association Studies 4 1.4 Copy Number Variations 5 1.5 Protein-Protein Interaction Networks 6 1.6 Network-Based Analysis 8 1.7 Contributions of the Study 10
II. MATERIAL AND METHODS 12
2.1 Network Clustering 13 2.2 Gene Mapping of CNVs and SNPs 14 2.3 Cluster Analyses 16 2.4 Biological Significance Analyses 20
III. RESULTS AND DISCUSSION 22
3.1 Network Clustering Results 22 3.2 Gene Mapping Results 23 3.3 Cluster Analyses Results 24
3.3.1 Clusters enriched with genes located in African American CNVs 30
3.3.2 Clusters enriched with genes located in Caucasian CNVs 31
vi
3.4 Plausible inks of population-specific CNVs in the identified Network Clusters to Health Disparities 32 3.4.1 Duplication of HSPB1 and Health Disparities in
African Americans 33 3.4.2 Duplication of ATP2A1 and Health Disparities in Caucasians 35
IV. CONCLUSION 36
REFERENCES 37 APPENDIX
A. CODES FOR GENE MAPPING OF SNPS AND CNV COORDINATES 44
B. CODES FOR RIGHT-TAILED FISHER’S EXACT TEST 46
C. CODES FOR FALSE DISCOVERY RATE CALCULATION 49 VITA 51
vii
LIST OF TABLES
2.1 Contingency table for Fisher’s exact test on pathogenic genes 17 2.2 Contingency table for Fisher’s exact test on CNV genes 17 3.1 Summary of biological networks 23 3.2 Results of gene mapping of SNPs and CNV coordinates 24 3.3 Top-ranked clusters from HPRDNet 26 3.4 Top-ranked clusters from MultiNet 27 3.5 Cluster analysis results for HPRDNet and MultiNet 28 3.6 Selected genes with potential roles in health disparities and their located CNVs 29 3.7 Enriched GO terms with CNV-genes in the identified network clusters 32 3.8 Associated diseases of genes with enriched GO terms 33
viii
LIST OF FIGURES
1.1 Death rates of selected ethnicities for five causes of death in the United States 2 2.1 Overview of our approach to identify CNVs associated with health disparities 13 3.1 Graph representations of selected clusters for biological significance analysis 28 3.2 Graph representations of cluster AA1, AA2 and AA3 31
ix
LIST OF ABBREVIATIONS
HGV, Human genetic variation
CNV, Copy number variation
SNP, Single nucleotide polymorphism
GWAS, Genome wide association studies
GSA, Gene set analysis
PPIN, Protein-protein interaction network
HPRD, Human protein reference database
PPI, Protein-protein interaction
AA, African American
CA, Caucasian
MCL, Markov Cluster Algorithm
CHD, Coronary Heart Disease
PANOGA, pathway and network oriented GWAS analysis
RA, Rheumatoid Arthritis
FDR, false discovery rate
GO, Gene ontology
OMIM, Online Mendelian Inheritance in Man
dbSNP, Single Nucleotide Polymorphism Database
SERCA1, Sarco/endoplasmic reticulum Ca2+-ATPase 1
1
CHAPTER I
INTRODUCTION
1.1 Objectives of the Study
In this study, we aim to investigate the association of health disparities and
genetic variations with different population frequencies, to better understand health
disparities between African Americans and Caucasians,
Here, we propose a novel network clustering based approach to associate
population-specific copy number variations (CNVs) and health disparities. First, we
obtain human gene/protein interaction networks and partition them into gene clusters.
Second, we search pathogenic single nucleotide polymorphisms (SNPs) and population-
specific CNV loci in genome database to generate gene lists. Third, clusters are ranked
based on results of gene enrichment tests for pathogenic genes and CNV-genes. At last,
we investigate the biological significance of clusters that were ranked at first place. We
will use this approach to identify CNVs that may contribute to health disparities between
African Americans and Caucasians in diseases.
1.2 Health Disparities
Health disparities refer to differences in health status between people grouped by
social or demographic factors, such as race, gender, income or geographic region. The
differences could be in the presence of disease distribution, health outcomes, quality of
2
health care and access to health care services. In United States, health disparities between
African Americans and other racial and ethnic populations are found in life expectancy,
death rates, and health measures. Figure 1.1 shows the death rates of selected ethnicities
for five causes of death in the United States. The death rates are per 100,000 population
and age-adjusted to the 2000 census. AI, AN and PI refer to American Indian, Alaska
Native, and Pacific Islander, respectively. As we can see, the death rates of African
Americans are found higher than those of other populations in heart diseases, prostate
cancer (in male), breast cancer (in female), and diabetes (National Center for Health
Statistics 2007; National Center for Health Statistics 2013) (Figure 1.1). According to a
recent study, eliminating health disparities would have reduced direct medical care
expenditures by about $230 billion and indirect costs associated with illness and
premature death by more than $1 trillion for the years 2003 to 2006 (LaVeist et al. 2011).
Figure 1.1 Death rates of selected ethnicities for five causes of death in the United States.
0
50
100
150
200
250
Heart Disease Prostate
Cancer*
Breast Cancer* Liver Disease Diabetes
De
ath
Ra
te p
er
10
0,0
00
po
pu
lati
on
Cause of Death
Death rates of selected ethnicities in US
Whites
African American
AI or AN
Asian or PI
Hispanic
3
Many factors contribute to health disparities (American Public Health
Association). People with different socioeconomic status, such as income, education and
occupation, will have different opinions on health practice and different access to healthy
diet. People living in rural area and/or with low income will have trouble to obtain
essential health care. People in different culture will have different living style, such as
different living habit and diet, which will affect disease prevalence and health treatment
outcome.
In addition, human genetic variations (HGVs) play a significant role in health
disparities (Fine et al. 2005; Ramos & Rotimi 2009). Human genes contain information
that is required to build, regulate and maintain our bodies. HGVs are permanent changes
in human genes and may cause alterations in an individual's phenotype, from physical
properties to disease risk. According to “out of Africa” model, modern humans speciated
in Africa and then migrated to other continents of the world (Stringer & Andrews 1988).
During the migration, genetic variations occurred and kept due to random chance, natural
selection, and other genetic mechanisms, at different frequencies from region to region
(Tishkoff & Verrelli 2003). It is believed that genetic variations are the main reasons of
many diseases, and thus different occurrence frequencies can lead to differences in
disease susceptibility or resistance among various populations. Studies on associations of
genetic variations and diseases are essential to understand disease etiology and health
disparities, and are greatly advanced by the completion of the International HapMap
Project (Ramos & Rotimi 2009). With the aid of modern genome sequencing techniques,
the HapMap contains information of common human genetic variants, such as where
4
these variants occur in our DNA, and how they are distributed within and among
different population, which can be used by researchers to link genetic variants to diseases.
1.3 Genome-Wide Association Studies
Genome-wide association studies (GWAS) are currently an effective approach to
identify diseases-associated genetic variations (Hirschhorn & Daly 2005; Wang et al.
2005). In GWAS, a group of diseased individuals is compared to a group of healthy
individuals for a large number of Single Nucleotide Polymorphisms (SNPs) (Clarke et al.
2011). The frequency of each allele is compared between two groups and a statistical test
is performed with a null hypothesis that no association exists between disease and the
SNP. Usually tests of millions of SNPs are carried out, which requires multiple
hypothesis testing procedures to control false positives (Dudbridge & Gusnanto 2008).
Although GWAS have revealed many disease-associated SNPs, only a few of them are
associated with moderate or large increase in disease risk, and some well-known genetic
risk factors have been missed (Williams et al. 2007). One possible reason is that GWAS
focus only on individual genetic variations and do not address complex gene interactions
(Moore & Williams 2009). Another possible reason is that the current statistical analysis
is “unbiased”, since it ignores available knowledge of disease pathobiology (Moore et al.
2010).
Several approaches tried to incorporate GWAS results with known biological
knowledge. One of them is called Gene Set Analysis (GSA) (Cantor et al. 2010; Lehne et
al. 2011; Wang et al. 2007), which associates variations in an entire set of genes with a
phenotype. A gene set is defined as a set of genes that are involved in common biological
5
processes or pathways, or as a set of interacting proteins identified from protein-protein
interaction networks (Lehne et al. 2011). In GSA, enrichment tests are performed by
comparing the frequency of significantly associated SNPs in a particular set of genes with
that among all other genes not in the set. Gene sets containing significantly more
associated SNPs will have closer association with the corresponding phenotypes. The
advantage of GSA is that it detects associations of the phenotype with a gene set, not
individual SNPs. Therefore it does not ignore SNPs that have low p-values but still
contribute to phenotypes, and it reduces the number of statistical tests and requires less
stringent multiple testing correction (Lehne et al. 2011).
1.4 Copy Number Variations
Unlike SNPs, which affect only one single nucleotide base, copy number
variations (CNVs) are duplications or deletions of relatively large genomic segments that
can contain one or more genes (Feuk et al. 2006; Freeman et al. 2006). The widespread
presence of CNVs in normal individuals was first reported in 2004 (Iafrate et al. 2004;
Sebat et al. 2004). And to date, over 100,000 non-overlapping human CNVs have been
identified, with the size varying from 50 base pair to more than one million bases pairs,
and they cover about 70% of the whole genome (MacDonald et al. 2014).
In early genetic association studies, CNVs have been associated with various
complex diseases (Feuk et al. 2006; Ionita-Laza et al. 2009). Updates on CNVs’ roles in
some diseases, such as psychoses (Lee et al. 2012), autism (Wang et al. 2013),
autoimmunity (Olsson & Holmdahl 2012) and schizophrenia (Hosak et al. 2012), have
been reviewed recently. Computational tools and methods have been developed to help
6
address the potential roles of CNVs in human diseases. The CNVannotator was
developed to provide considerable capabilities for researchers to annotate specific CNVs
in a reliable and efficient manner (Zhao & Zhao 2013). The NETBAG+ algorithm was
developed to search for strongly cohesive gene clusters affected by CNVs, using a
likelihood network constructed based on a combination of various functional descriptors.
(Gilman et al. 2012). Recently, it is reported that CNVs can occur at different frequencies
between African Americans and Caucasians (McElroy et al. 2009), and naturally the
question about the potential roles of CNVs in health disparity is raised.
1.5 Protein-Protein Interaction Networks
Protein-protein interactions (PPIs) play diverse roles in biology. It is observed that
proteins seldom carry out their function in isolation, and usually proteins involved in the
same cellular processes interact with each other (von Mering et al. 2002). Advanced
high-throughput technologies, such as yeast-two-hybrid screening, mass spectrometry,
and protein microarray chip technologies, have generated huge data sets of protein-
protein interactions (von Mering et al. 2002).
Several databases have been constructed as repositories for experimentally
discovered protein interactions (Mathivanan et al. 2006). PPIs are incorporated into PPI
databases through curation from the literatures by biologists, or through direct deposit by
the investigators before their publication. For example, Human Protein Reference
Database (HPRD) is a joint project between the Institute of Bioinformatics in Bangalore,
India and the Pandey lab at Johns Hopkins University in Baltimore, USA. HPRD
contains annotations related to human proteins based on experimental evidence from the
7
literature (Mishra et al. 2006; Peri et al. 2004; Prasad et al. 2009). HPRD includes not
only PPIs, but also interactions between proteins and other small molecules, as well as
information about post-translational modifications, subcellular localization, protein
domain architecture, tissue expression and association with human diseases. PPIs in
HPRD are usually direct physical interactions. Pairwise interactions are often represented
by undirected links in a graph model of network. Some databases contain indirect genetic
or regulatory interactions, and some contain directional interactions such as those in
phosphorylation, metabolic, signaling and regulatory networks. In one study, various
interaction data have been put together to construct a unified global network named
MultiNet (Khurana et al. 2013).
Protein-protein interaction (PPI) data can be represented in the form of networks,
in which nodes are proteins and edges are interactions. The protein-protein interaction
network (PPIN) can help understand the basic scheme of cell functions by correlating the
components of the network with their cellular functions, which can be done by clustering
processes (Lin et al. 2007; Pizzuti et al. 2012; Wang et al. 2010). In PPIN, a cluster is a
set of genes that share a large number of interactions, and the clustering process is to
group genes into clusters which contain more interactions among genes in the same
cluster than in different clusters. Clustering process can identify both protein complexes
and functional modules (Lin et al. 2007; Pizzuti et al. 2012; Wang et al. 2010). Protein
complexes are groups of proteins that bind to each other at the same time and place,
while functional modules consist of proteins that participate in the same cellular process
through interactions between themselves at a different time and place. Usually the PPINs
do not contain information about when and where proteins interacts, therefore protein
8
complexes and functional modules are not treated differently in clustering processes. The
results of clustering process of PPINs can help to infer the principal function of each
cluster from the functions of its members, and suggest possible functions of cluster
members based on the functions of other members.
Many distance-based or graph-based clustering algorithms were developed to
cluster PPINs (Lin et al. 2007; Pizzuti et al. 2012; Wang et al. 2010). The Markov
Clustering (MCL) algorithm is a fast, scalable, and unsupervised clustering algorithm,
which simulates stochastic flows in graphs (van Dongen 2000). The algorithm simulates
random flows within a graph by alternation of two operations called expansion and
inflation. In expansion, the flow moves within the same dense regions or out to other
dense regions. The inflation operation strengthens the flow within the dense regions and
weakens the flow out of the dense regions. The expansion and inflation steps are repeated
until a steady state is reached. A recent study compared MCL with other three clustering
algorithms, restricted neighborhood search clustering (RNSC), super paramagnetic
clustering (SPC) and molecular complex detection (MCODE), on six PPINs to detect
previously annotated gene clusters (Brohee & van Helden 2006). The conclusion was that
MCL algorithm outperformed the other algorithms in the extraction of complexes from
interaction networks.
1.6 Network-Based Analysis
Gene/Protein interaction networks combined with GWAS data can help
understand complex biological activities and cellular mechanisms of complex diseases
(Barabasi et al. 2011; Halldorsson & Sharan 2013; Sharan et al. 2007; Vidal et al. 2011;
9
Wang et al. 2011). This is based on the assumption of “guilt by association”, which
means that genes associated with the same or related functions or diseases tend to interact
with each other and cluster together with high connectivity in networks (Altshuler et al.
2000; Oliver 2000).
One example of using PPIN to identify disease-associated genes was reported in a
study of incident Coronary Heart Disease (CHD) (Jensen et al. 2011). In this study, an
experiment-derived PPI database InWeb was used to produce unbiased protein complexes
and corresponding gene sets, which were then ranked based on results of enrichment tests
of CHD-associated genes. In the identified gene set, five out of 19 genes were involved in
abnormal cardiovascular system physiological features, and pathways related to blood
pressure regulation were significantly enriched.
Another methodology that utilizes the PPINs to discover disease associated
clusters is called the pathway and network oriented GWAS analysis (PANOGA), which
combines GWAS data with current knowledge of biochemical pathways, PPINs, and
functional and genetic information of selected SNPs (Bakir-Gungor & Sezerman 2011).
In their study, genes related to significant SNPs from GWAS data were identified and
were assigned with functional attributes, which were used in the process of identifying
clusters associated with the disease. Then, genes in one identified cluster were tested
whether they are part of important pathways. The application of this methodology on
Rheumatoid Arthritis (RA) dataset identified new RA-associated pathways, in addition to
pathways previously identified by GWAS analysis. The newly identified pathways were
found to include many genes that are known to be used as drug targets for the treatment
of RA. Moreover, new genes have been identified to be associated with RA.
10
Protein interactions could provide important clues to help illustrate SNP’s
functional association (Huang et al. 2010). Protein interaction network was combined
with other traditional hybrid features, such as sequence, structure and pathway properties,
and it was used to establish predictors using hundreds of those features. These predictors
can correctly identify around 80% of known disease-associated SNPs and is valuable to
predict undiscovered disease-associated SNPs.
In another approach, a genome-scale functional gene network, named HumanNet,
was constructed by incorporating gene expression, protein interaction, sequence and other
genomic data to prioritize candidate disease genes, which can facilitate both seed gene-
based and GWAS-based disease association studies (Lee et al. 2011). In seed gene-based
approach, gene connections in the network were assigned with weights, calculated by
using label propagation algorithms based on their distance to the seed genes. Genes
connected to seed genes with larger weights were be considered as more likely to be
associated with target diseases. Although for GWAS data there are no definite seed genes,
this approach can still boost the power of association analysis by using a different ranking
score (Lee et al. 2011). In the analysis of Crohn’s disease and type 2 diabetes, the
HumanNet not only boosted the identification of correct associations, but organized the
associated genes into processes, which arouse attentions to genes that were not
significantly identified in GWAS.
1.7 Contributions of the Study
Although genetic factors play a crucial role in health disparities, only a few
association studies have been reported in common complex diseases, such as breast
11
cancer (Long et al. 2013), prostate cancer (Bensen et al. 2014; Bensen et al. 2013; Xu et
al. 2011), type 2 diabetes (Ng et al. 2014) and vascular diseases (Wei et al. 2011). In
order to better understand health disparities between African Americans and Caucasians,
we aim to investigate the association of health disparities and genetic variations with
different population frequencies.
Here, we propose a novel network clustering based approach on CNVs for health
disparities. First, we choose to focus on CNVs. Although CNVs are one important type of
genetic variations, and can occur at different frequencies among African Americans and
Caucasians, no association studies have been reported so far to our best knowledge.
Therefore, this work is the first study on association of CNVs and health disparities.
Second, our approach is on gene level, and pathogenic SNPs and population specific
CNV loci are mapped to corresponding gene names. Current GWAS on health disparities
still focused on individual SNPs, but we choose to focus on genes, because only a small
fraction of the genetic heredity of most diseases can be explained by the SNPs, and a
gene-based approach can allow us using the information encoded by protein interaction
networks (Lee et al. 2011). Third, association analysis in our approach uses gene clusters
inferred from gene networks, which is based on the rationale that interacting genes often
have the same functions or participate in the same biological processes. In addition,
unlike common Gene Set Analysis (GSA) studies that compares significantly associated
SNPs (Cantor et al. 2010; Lehne et al. 2011; Wang et al. 2007), our novel approach
compares the frequency of pathogenic genes or population specific CNV related genes
between clusters to evaluated the relationship between clusters and diseases or CNVs.
12
CHAPTER II
MATERIALS AND METHODS
This chapter introduces materials (gene/protein networks and gene sets) we
collected and prepared for this study, and methods we used in clustering process, cluster
analyses and biological significance analysis.
Our overall work flow is shown in Figure 2.1. To identify potential CNVs
associated with health disparities, our basic idea is to identify gene clusters that are
enriched with both pathogenic genes and genes located in population-specific CNVs.
Health disparities in diseases associated with identified clusters could be considered as
results of the occurrence of corresponding CNVs. Specifically, we first obtained two
human gene/protein networks and partitioned them into gene clusters. We then identify
disease-associated genes and genes located in population-specific CNVs in those clusters.
Statistical tests were performed on each cluster to estimate its significances of containing
pathogenic genes and genes in population-specific CNVs. Finally, we ranked gene
clusters based on false discovery rates (FDRs). Top-ranked clusters were enriched both
for pathogenic genes and for genes in CNVs with differential frequencies between
African-Americans and Caucasians. These clusters were then searched for enriched Gene
Ontology (GO) terms and related disease phenotypes to identify corresponding biological
significance.
Figure 2.1 Overview of our approach to identify CNVs associated with health disparities
2.1 Network Clustering
We obtained two human
Reference Database (HPRD)
another from MultiNet (Khurana et al. 2013
HPRDNet) is one of the largest human gene/protein interaction networks, and
only physical protein-protein interactions (PPIs)
including PPI, phosphorylation,
These two networks share 8468 genes (89.6% of HPRDNet and 58.6% of MultiNet)
only 8769 interactions (23.8% of HPRDNet and 8% of MultiNet).
were both partitioned into gene clusters using the M
13
1 Overview of our approach to identify CNVs associated with health disparities
human gene/protein networks, one from Human Protein
Reference Database (HPRD) (Mishra et al. 2006; Peri et al. 2003; Prasad et al. 2009
Khurana et al. 2013). The HPRD network (referred to
is one of the largest human gene/protein interaction networks, and
protein interactions (PPIs). The MultiNet is a unified network
phosphorylation, metabolic, signaling, genetic and regulatory networks.
8468 genes (89.6% of HPRDNet and 58.6% of MultiNet)
only 8769 interactions (23.8% of HPRDNet and 8% of MultiNet). These two networks
into gene clusters using the Markov Cluster (MCL) Algorithm
1 Overview of our approach to identify CNVs associated with health disparities
otein
Prasad et al. 2009) and
to as
is one of the largest human gene/protein interaction networks, and contains
network
metabolic, signaling, genetic and regulatory networks.
8468 genes (89.6% of HPRDNet and 58.6% of MultiNet) but
These two networks
Algorithm (van
14
Dongen 2000). MCL (version10-201) was installed and run in an Ubuntu 11.10 system.
The following command was used to run clustering:
mcl <network> --abc -I <I value> --force-connected=y -o <result>
Option <network> refers to the input file that contains the gene/protein interaction
network. In the input file, each line contains a pair of gene names, separated by a tab
space, which represent a gene-gene interaction. Option <I value> refers to the inflation
parameter I, which was from 1.1 to 2.0 with a step of 0.1. Option <result> refers to the
output file to which the clustering results will be saved. In the output file, each line
contains a set of gene names separated by a tab space, which are member of a cluster.
2.2 Gene Mapping of CNVs and SNPs
Two sets of genes were used in this study: genes located in population-specific
CNVs and pathogenic genes. Currently, there are no ready-made sources containing
information of these genes. However, coordinates of disease-associated SNPs and
population-specific CNVs were available and were searched in the UCSC Genome
Database (Karolchik et al. 2014) through its MySQL API to obtain the corresponding
gene name sets. All works were done using codes in Java. The following Java statements
were used to establish the connection to UCSC Genome Database MySQL interface:
// setup connection to UCSC Genome Database String username = "genome";
String password = "";
String url = "jdbc:mysql://genome-mysql.cse.ucsc.edu/hg19"; Class.forName("com.mysql.jdbc.Driver"); Connection con = DriverManager.getConnection(url, username,
password);
15
CNV coordinates were obtained from a CNV map in African Americans and
Caucasians (McElroy et al. 2009). There are three types of CNVs in this map: (1) CNVs
only occur in African Americans, (2) CNVs only occur in Caucasians, and (3) CNVs
occurred in both African Americans and Caucasians. To simplify the analysis, we further
partitioned the last type: CNVs that occurred more than 50% in African Americans or in
Caucasians were combined with the first and second types of CNVs, respectively. This
repartition resulted in two modified CNV sets with differential population frequencies.
The coordinates of these CNVs were then searched in the UCSC Genome Database
through its MySQL API to obtain the corresponding gene sets. The SQL statements were
formed in Java codes (APPENDIX A) and submitted, in which begin, end, and chrom
refer to the starting positions, ending positions and chromosomes of the CNVs,
respectively:
// search related genes of CNVs
select distinct X.geneSymbol, G.chrom, G.txStart, G.txEnd from knownGene as G, kgXref as X where ((G.txStart>begin AND G.txEND<end or
(G.txStart>begin AND G.txStart<end or (G.txEND>begin AND G.txEND<end or
(G.txStart<begin AND G.txEnd>end AND
X.kgId = G.name AND G.chrom = chrom);
For simplicity, CNVs that occur more frequently in African Americans were called
African-American CNVs or CNV_AA; CNVs that occur more frequently in Caucasians
were called Caucasian CNVs or CNV_CA.
Disease-associated SNPs were retrieved from a file, OmimVarLocusIdSNP.bcp,
obtained from the FTP site of Single Nucleotide Polymorphism Database (dbSNP)
(Sherry et al. 2001). Coordinates of these SNPs were then queried against the MySQL
16
API of the UCSC Genome Database to identify genes in which those SNPs are located.
The following SQL statements were composed in Java codes (see APPENDIX A) and
submitted, in which SNP refers to the RSid of the SNPs:
//search related genes of SNPs. select distinct S.name, X.geneSymbol from snp138 as S, knownGene as G, kgXref as X
where S.name = SNP AND X.kgId = G.name AND S.chrom = G.chrom AND G.txStart <= S.chromStart AND G.txEnd >= S.chromEnd;
This identified gene set was termed as pathogenic genes.
2.3 Cluster Analyses
Clusters were obtained from both HPRDNet and MultiNet using MCL with a
range of ten inflation parameters. For each cluster, contingency tables were constructed
using the numbers of pathogenic genes and population-specific CNV-related genes.
Table 2.1 is for pathogenic significance test, and Table 2.2 is for tests of enrichment
significance of CNVrelated genes (CNV_AA or CNV_CA genes). Q and q are the
number of pathogenic genes in the whole networks and that in current cluster,
respectively. N and m are the number of genes in whole networks and that in current
cluster, respectively. S and s are the number of CNV_AA or CNV_CA genes in the
whole networks and that in current cluster, respectively.
In statistics, a contingency table is a type of table that displays the frequency
distribution of multiple variables, usually in a two by two format. The significance of the
difference between the two proportions (derived based on observed frequencies) can be
evaluated by various statistical tests, such as Pearson's chi-squared test and Fisher's exact
17
test. If the proportions in the different columns are significantly different between rows
(or vice versa), we say that there is a contingency between the two variables. In our case,
we want to know whether a cluster contains more pathogenic genes or population-
specific CNV-related genes than all of the other clusters. If it does, we say this cluster is
enriched with such genes and thus has stronger connections to diseases or is more likely
to be affected by population-specific CNVs. If a cluster is found to be significant in both
enrichment tests, it will be of most interest to us.
Table 2.1
Contingency table for Fisher’s exact test on pathogenic genes
Pathogenic Genes Non-pathogenic Genes Total
Genes in this cluster q m-q m
Genes in other clusters Q-q N-Q-m+q N-m
Total Q N-Q N
Table 2.2
Contingency table for Fisher’s exact test on CNV genes
CNV Genes Non-CNV Genes Total
Genes in this cluster s m-s m
Genes in other clusters S-s N-S-m+s N-m
Total S N-S N
In this study, right-tailed Fisher’s exact tests were applied to these contingency
tables to calculate enrichment significance of pathogenic genes, CNV_AA or CNV_CA
genes, respectively. Fisher's exact test is a statistical test used in the analysis of
18
contingency tables. It is valid for all sample sizes, but in practice it is employed mostly
when sample sizes are small, which is the main reason it is chosen in for our analyses.
Using Table 2.1 as an example, the null hypothesis H0 is that the proportion of
pathogenic genes in one cluster less than or equals to that in all other clusters, and the
alternative hypothesis H1 indicates that this pathogenic occurrence in this cluster is more
than random expectation:
�� : �
��
� �
�
�� : �
��
� �
�
in which Q and q are the number of pathogenic genes in the whole networks and that in
current cluster, respectively, and N and m are the number of genes in whole networks and
that in current cluster, respectively. In other words, if the null hypothesis is true, we
would have
� ��
× �
The question we ask is: what is the probability that the observed q value is larger than the
expected value in the null hypothesis? The right-tailed Fisher’s exact test will give the
answer: if the p-value is small (common cutoff values are 0.001, 0.01, 0.05, or 0.10), we
can say that in this cluster pathogenic genes were significantly enriched. The smaller the
p-values, the more significant the enrichment is. The right-tailed Fisher’s exact tests were
performed in the R statistical environment (R Development Core Team 2013), with the
following R statements composed in Java codes (see APPENDIX B) and submitted to the
R engine:
19
// pathogenic gene enrichment tests. cluster = c(q, m-q);
other = c(Q-q, N-Q-m+q); tb = rbind(cluster, other); fisher.test(tb, alternative="greater")$p.value;
// population-specific CNV-related gene enrichment tests. cluster = c(s, m-s);
other = c(S-s, N-S-m+s); tb = rbind(cluster, other); fisher.test(tb, alternative="greater")$p.value;
The right-tailed Fisher’s exact tests were performed to each cluster in each
clustering results, which led to the multiple comparisons problem. The multiple
comparisons problem, or multiple testing problem, occurs when a set of statistical tests
are performed simultaneously (Miller 1981), in which the null hypothesis are more likely
to be incorrectly rejected based on individual tests when all tests were considered as a
whole. The False discovery rate (FDR) control is a statistical method to correct for
multiple comparisons. In a list of tests, FDR procedures are designed to control the
proportion of incorrectly rejected null hypotheses, i.e. “false discoveries” (Benjamini &
Hochberg 1995). There are many statistical procedures that use p-values to estimate or
control the FDR, such as Benjamini-Hochberg procedure (Benjamini & Hochberg 1995)
and Q-value procedure (Storey et al. 2004). However, most of them assume that p-values
are continuously distributed and based on two-sided tests. Therefore, it is difficult to
reliably estimate the FDR when the p-values are discrete or based on one-sided tests.
Since p-values in our study are discrete and based on one-sided tests, a proper FDR
calculation procedure needs to be carefully chosen. The Robust FDR Routine does not
rely on the assumptions that tests are two-sided or yield continuously p-values, and was
shown to have excellent performance in a series of data (Pounds & Cheng 2006). Based
20
on obtained p-values from the right-tailed Fisher’s exact tests, we calculated FDRs using
the Robust FDR Routine in the R statistical environment. The following R statements
were composed in Java codes (see APPENDIX C) and submitted to the R engine:
// Run robust FDR routine. // p is the vector of all sorted p-values.
robust.fdr(p, discrete=T, use8=F)$fdr";
Ranking were applied to clusters with p-value<0.10 and FDR<0.20 in both
enrichment tests for pathogenic genes and population-preferred CNVs genes. Assuming
both tests are independent, the FDR values were multiplied and the products were used to
rank the network clusters.
The same cluster analysis procedure was applied to each clustering results with
different MCL inflation parameters. For clarity, we focused our biological significance
analyses on clusters that were consistently ranked at the first place with different MCL
inflation parameter values.
2.4 Biological Significance Analyses
To understand the biological significance of the clusters, their corresponding
Gene Ontology (GO) terms were analyzed. Gene ontology is a project to unify the
representation of gene and gene product attributes (Gene Ontology Consortium 2008).
More specifically, the project aims to develop and maintain the vocabulary of gene and
gene product attributes, annotate genes and gene products, and provide tools for data
access and functional interpretation. GO terms describe properties of genes and their
products and are grouped into three domains: cellular component (where they locate),
molecular function (what they do at the molecular level), and biological process (which
21
molecular events they participate). One of the main uses of the GO terms is to perform
enrichment analysis on gene sets by comparing one target gene set to the background
gene set. If a GO term appears more often in the target gene set than in the background
gene set, it is said to be enriched in this gene set, or this gene set is highly related to this
GO term, which could help to understand the underlying biological relevance.
In this study, biological relevance of selected network clusters were analyzed by
the Gene Ontology enrichment analysis and visualization tool, or GOrilla (Eden et al.
2009) to search for enriched GO terms. In GOrilla search, genes in the selected clusters
were target genes, and all other genes in the network were treated as background genes.
To investigate the possible links of population-specific CNVs to heath disparities, we
first identified significantly enriched GO terms that are associated with CNV_AA or
CNV_CA genes. We then focused on the pathogenic genes associated with the enriched
GO terms, and examined their related disease phenotypes in OMIM database (Online
Mendelian Inheritance in Man 2014). These disease phenotypes were searched for
reported health disparities between African American and Caucasian populations.
22
CHAPTER III
RESULTS AND DISCUSSION
This chapter shows our results in network clustering, gene mapping and cluster
analyses, and discusses the plausible links of population-specific CNVs in the identified
network clusters to health disparities.
3.1 Network Clustering Results
Two gene/protein interaction networks were chosen in this study: HPRDNet and
MultiNet. For HPRDNet, three types of interactions were removed before clustering
process: 1) interaction between same gene; 2) interaction involving unknown genes; 3)
and interactions with ambiguous genes. For MultiNet, all interactions were submitted to
clustering process. Clusters were inferred using MCL with ten different inflation values.
Descriptive statistics of the two networks and their clustering results are summarized in
Table 3.1. When inflation parameter I was increased, the cluster sizes decreased and the
cluster numbers increased. This is because higher I value increases the granularity or
tightness of clusters. The “--force-connected=” option is also proven to have an effect on
clustering results for both networks. When this option was set to “N”, we identified
several uninformative clusters with some genes do not connect to any other genes in the
same cluster.
23
Table 3.1
Summary of biological networks
Network Inflation
Value
Total
Clusters
Cluster Sizes
Maximum Minimum Median Mean
HPRDNet:
9451 genes
36880 interactions
1.1 111 9203 2 2 85 1.2 128 9025 1 2 74 1.3 363 3454 1 3 26 1.4 660 432 1 5 14 1.5 1036 260 1 4 9 1.6 1401 192 1 3 7 1.7 1704 152 1 3 6 1.8 1990 146 1 3 5 1.9 2222 119 1 3 4 2.0 2447 90 1 3 4
MultiNet:
14445 genes
109598 interactions
1.1 21 14399 2 2 450 1.2 22 14394 2 2 430 1.3 43 13614 1 2 220 1.4 226 5560 1 3 42 1.5 600 3524 1 2 16 1.6 923 2402 1 2 10 1.7 1387 1589 1 2 7 1.8 1660 1215 1 2 6 1.9 1887 996 1 2 5 2.0 2116 804 1 2 4
3.2 Gene Mapping Results
Chromosome coordinates of pathogenic SNPs and population-specific CNVs
were searched in the UCSC Genome Database. Totally 2810 pathogenic genes, 194
CNV_AA genes and 258 CNV_CA genes were obtained. Since this study focuses on
network-derived gene clusters, only genes that are listed in HPRDNet or MultiNet were
kept. Details of gene mapping results are shown in Table 3.2.
24
Table 3.2
Results of gene mapping of SNPs and CNV coordinates
Pathogenic Genes CNV_AA CNV_CA
Total 2810 194 258 in HPRDNet 1791 48 62 in MultiNet 2143 64 97
3.3 Cluster Analysis Results
We performed cluster analyses on each of the ten clustering results for both
HPRDNet and MultiNet. Ranking were applied only to clusters with p-value<0.10 and
FDR<0.20 in both enrichment tests for pathogenic genes and those for population-
preferred CNVs genes. The products of FDRs of both enrichment tests were used to rank
the network clusters.
The ranking results are listed in Table 3.3 and Table 3.4. The value of inflation
parameter I used in clustering process is from 1.1 to 2.0, increased by 0.1. Cluster NO is
the serial number of a cluster. CNV_AA and CNV_CA are CNV-related genes. Cluster
Names are names of clusters selected for biological significance analyses. p_CNV and
FDR_CNV, and p_OMIM and FDR_OMIM are p-values and FDR values from
enrichment tests for CNV-related genes and pathogenic genes, respectively. FDR product
is the multiplication result of FDR_CNV and FDR_OMIM and is used for ranking.
We focused on clusters that were consistently ranked at the first place with
different MCL inflation parameter values for further biological significance analyses. It is
worth to mention that clusters containing gene CYP2E1 were filtered out, although in
25
HPRDNet these clusters have been ranked three times at first place in cluster ranking for
CNV_AA gene enrichment tests, and three times at first place for CNV_CA gene
enrichment tests (Table 3.3). This is because gene CYP2E1 is affected by multiple CNVs
and these CNVs have different population preferences. CYP2E1’s location was compared
with those CNVs’ loci. It is found that CYP2E1 is located in the overlapped region
among those CNVs, which means it may not have CNV-caused population-preferred
variations, and may not contribute to health disparities. Therefore, clusters containing
gene CYP2E1 were not included in the following analysis, and the corresponding clusters
ranked at second place were selected instead.
Details of selected clusters are listed in Table 3.5, and their graph representations
are shown in Figure 3.1. In Figure 3.1, each rounded rectangle represents a gene and each
gray line represents a gene-gene interaction. Black rounded rectangles represent non-
pathogenic genes and orange rounded rectangles represent pathogenic genes. Genes
labeled with red or blue ovals are located in African American CNVs or in Caucasian
CNVs. Genes with Green lines share the same GO terms. In each cluster, different line
types represent the enrichment of different GO terms. Line types shown in different
clusters refer to the enrichment of different GO terms. The CNV_AA or CNV_CA genes
in the selected clusters and the details of their corresponding CNVs are listed in Table 3.6.
In Table 3.6, Chr represents chromosomes. CNV Regions are regions of CNVs identified
in more than a single individual. All CNVs listed have a type of Duplication, referring to
one copy increase. CNV Regions and Types are from the CNV map (McElroy et al.
2009). CNV Occurrence describes in CNVs occurrence in African American and
Caucasian populations.
26
Table 3.3
Top-ranked clusters from HPRDNet
I Cluster
No
Cluster
Name CNV_AA CNV_CA OMIM Size p_CNV FDR_CNV p_OMIM FDR_OMIM
FDR
product
1.1 - - - - - - - - - - 1.2 - - - - - - - - - - 1.3 - - - - - - - - - - 1.4 175 AA1 HSPB1 - 8 11 0.0534 0.1548 1.56E-4 3.78E-6 5.85E-07 1.5 190 AA1 HSPB1 - 8 11 0.0534 0.1032 1.56E-4 1.45E-5 1.50E-06 187 CYP2E1 - 5 11 0.0534 0.1032 0.0408 0.1772 1.83E-02 429 CA1 - ATP2A1 4 5 0.0324 0.1094 0.0055 1.80E-5 1.97E-06 271 - SLC4A4 5 8 0.0513 0.1241 0.0082 4.01E-4 4.98E-05 257 - CRX 5 9 0.0575 0.1260 0.0155 0.0155 1.95E-03 1.6 162 AA2 HSPB1 - 8 12 0.0581 0.0964 3.91E-4 4.08E-5 3.93E-06 181 CYP2E1 - 5 11 0.0534 0.0936 0.0408 0.1796 1.68E-02 470 CA1 - ATP2A1 4 5 0.0324 0.0896 0.0055 0.0369 3.31E-03 392 - SLC4A4 4 6 0.0387 0.1007 0.0139 0.1238 1.25E-02 1.7 142 AA2 HSPB1 - 8 12 0.0581 0.0842 3.91E-4 0.0454 3.82E-03 509 CA1 - ATP2A1 4 5 0.0324 0.0690 0.0055 0.1989 1.37E-02 1.8 444 CYP2E1 - 5 6 0.0295 0.0628 0.0012 0.1231 7.73E-03 109 AA3 HSPB1 - 8 13 0.0628 0.0889 8.48E-4 0.1010 8.98E-03 444 - CYP2E1 5 6 0.0387 0.0781 0.0012 0.1231 9.61E-03 516 CA1 - ATP2A1 4 5 0.0324 0.0756 0.0055 0.1910 1.44E-02 1.9 430 CYP2E1 - 5 6 0.0295 0.0246 0.0012 0.1171 2.88E-03 430 - CYP2E1 5 6 0.0387 0.0684 0.0012 0.1171 8.01E-03 2.0 412 CYP2E1 - 5 6 0.0295 0.0241 0.0012 0.1594 3.84E-03 412 CYP2E1 5 6 0.0387 0.0470 0.0012 0.1594 7.49E-03
27
Table 3.4
Top-ranked clusters from MultiNet
I Cluster
No.
Cluster
Name CNV_AA CNV_CA OMIM Size p_CNV FDR_CNV p_OMIM FDR_OMIM
FDR
product
1.1 - - - - - - - - - - 1.2 - - - - - - - - - - 1.3 - - - - - - - - - - 1.4 30 ANXA8, IGHG1 - 12 46 0.0176 0.1588 0.0326 0.1745 2.77E-02
85 CA1 - ATP2A1 4 5 0.0331 0.1501 0.0021 0.0289 4.34E-03 1.5 174 AA4 HSPB1 - 5 5 0.0220 0.0979 7.16E-5 0.0023 2.25E-04
188 CA1 - ATP2A1 4 5 0.0331 0.1237 0.0021 0.0345 4.27E-03 1.6 258 AA4 HSPB1 - 5 5 0.0220 0.0743 7.16E-5 0.0035 2.60E-04
277 CA1 - ATP2A1 4 5 0.0331 0.1708 0.0021 0.0384 6.56E-03 1.7 320 AA4 HSPB1 - 5 5 0.0220 0.0838 7.16E-5 0.0057 4.78E-04
351 CA1 - ATP2A1 4 5 0.0331 0.1814 0.0021 0.0467 8.47E-03 1.8 377 AA4 HSPB1 - 5 5 0.0220 0.0809 7.16E-5 0.0067 5.42E-04
412 CA1 - ATP2A1 4 5 0.0331 0.1347 0.0021 0.0509 6.86E-03 248 - CLN3 4 7 0.0461 0.1459 0.0116 0.1625 2.37E-02
1.9 410 AA4 HSPB1 - 5 5 0.0220 0.0747 7.16E-5 0.0078 5.83E-04 443 CA1 - ATP2A1 4 5 0.0331 0.1159 0.0021 0.0663 7.68E-03 316 - CLN3 4 6 0.0396 0.1190 0.0056 0.1118 1.33E-02
2.0 142 ECHS1 - 8 12 0.0519 0.1196 6.55E-5 0.0096 1.15E-03 479 CA1 - ATP2A1 4 5 0.0331 0.0988 0.0021 0.0728 7.19E-03
28
Table 3.5
Cluster analysis results for HPRDNet and MultiNet
Network Cluster
Name CNV_AA CNV_CA
Pathogenic
gene number
Cluster
Size
HPRDNet AA1 HSPB1 - 8 11 AA2 HSPB1 - 8 12 AA3 HSPB1 - 8 13 CA1 - ATP2A1 4 5 MultiNet AA4 HSPB1 - 5 5 CA1 - ATP2A1 4 5
Figure 3.1 Graph representations of selected clusters for biological significance analysis
29
Table 3.6
Selected genes with potential roles in health disparities and their located CNVs
Gene Chr Gene Coordinates CNV Region CNV Type CNV Occurrence
HSPB1 7 75,931,861-75,933,614 75,867,431-76,481,102 Duplication Only in African American
75,929,740-76,481,102 Duplication Only in African American
75,929,740-76,568,388 Duplication higher in African American than in Caucasian
ATP2A1 16 28,889,726-28,915,830 28,306,730-28,936,772 Duplication Only in Caucasian
30
3.3.1 Clusters enriched with genes located in African American CNVs
We found four similar clusters, (AA1, AA2, and AA3 in HPRDNet and AA4 in
MultiNet), that are enriched both for pathogenic genes and for genes located in African-
American CNVs (Table 3.5). In HPRDNet, cluster AA1, AA2 and AA3 together were
ranked at first place five times; and cluster AA4 were ranked at first place five times in
Multinet (Table 3.4). Cluster AA1 contains 11 genes, within which eight are pathogenic
genes (Figure 3.1A). Cluster AA2 and AA3 contain one and two more genes than cluster
AA1, respectively (Figure 3.2). In MultiNet, cluster AA4 contains five genes and can be
considered as a sub-cluster of cluster AA1, AA2 and AA3 (Figure 3.1B). All four clusters
contain gene HSPB1 (Table 3.5), which is mainly duplicated in African Americans
(Table 3.6). Since cluster AA1, AA2 and AA3 were selected from the same network and
are highly similar to each other, only cluster AA1 and AA4 were studied in biological
significance analyses.
31
Figure 3.2 Graph representations of cluster AA1, AA2 and AA3.
3.3.2 Clusters enriched with genes located in Caucasian CNVs
In both HPRDNet and MultiNet, the same cluster, named as CA1, was identified
to be enriched with both pathogenic genes and genes located in Caucasian CNVs (Table
3.5). Cluster CA1 was ranked at first place four times in HPRDNet (Table 3.3) and seven
times in MultiNet (Table 3.4). This cluster contains five genes, and four of them are
associated with diseases (Figure 3.1C). Cluster CA1 contains gene ATP2A1 which is
duplicated only in Caucasians (Table 3.6).
32
3.4 Plausible Links of population-specific CNVs in the identified Network Clusters
to Health Disparities
To investigate the possible links of population-specific CNVs to heath disparities,
we first identified significantly enriched GO terms that are associated with CNV_AA or
CNV_CA genes in the identified network clusters. We then focus on the pathogenic
genes with the enriched GO terms, and examined their associated disease phenotypes in
OMIM database. The results of GO term enrichment analysis are listed in Table 3.7, and
the associated diseases are listed in Table 3.8.
Table 3.7
Enriched GO terms with CNV-genes in the identified network clusters
Clusters Involved Genes GO Domain GO ID GO term
AA1 HSPB1, CRYAA,
CRYAB,
CRYBB2,
CRYBA1,
CRYBA2
Molecular Function
GO:0042802 Identical protein binding
AA4 HSPB1, CRYAA,
CRYAB
Biological Process
GO:0043086 negative regulation of catalytic activity
GO:0043066 negative regulation of apoptotic process
GO:0043069 negative regulation of programmed cell death
HSPB1, CRYAA,
CRYAB, CRYBB2
Molecular Function
GO:0042802 Identical protein binding
HSPB1, CRYAB Cellular Component
GO:0030018 Z disc
CA1 ATP2A1,
ATP2A2, PLN
Biological Process
GO:0048878 chemical homeostasis
ATP2A1, PLN Biological Process
GO:0006937 regulation of muscle contraction
GO:0008016 regulation of heart contraction
33
Table 3.8
Associated diseases of genes with enriched GO terms.
Cluster Gene Associated Disease
AA1 HSPB1 Axonal Charcot-Marie-Tooth disease type 2F and Distal hereditary motor neuronopathy type 2B AA4 CRYAA Multiple types of cataract 9 CRYAB Multiple types of cataract 16 Dilated cardiomyopathy-1II Myofibrillar myopathy-2 CRYAB-related fatal infantile hypertonic myofibrillar myopathy CRYBB2 Multiple types of Cataract 3
CA1 ATP2A1 Brody myopathy ATP2A2 Acrokeratosis verruciformis Darier disease PLN Dilated cardiomyopathy-1P Familial hypertrophic cardiomyopathy-18
3.4.1 Duplication of HSPB1 and Health Disparities in African Americans
Gene HSPB1 is located in genomic duplication regions occurring more frequently
in African Americans (Table 3.6), and is found in the cluster family of AA1, AA2, AA3,
and AA4 (Table 3.5). For cluster AA1, only one GO molecular function term related to
gene HSPB1 is significantly enriched (Cluster AA1 in Table 3.7). For cluster AA4, in
addition to the same enriched GO molecular functions term, three more GO biological
process terms and one GO cellular component term are found significantly enriched
(Cluster AA4 in Table 3.7). In the genes with the enriched GO terms, four of them are
known to be associated with diseases (Cluster AA1/AA4 in Table 3.8). Among these four
genes, three of them are implicated in health disparities of African Americans.
Specifically, gene CRYAB is related to dilated cardiomyopathy and myofibrillar
34
myopathy. African Americans were found at higher risk for idiopathic dilated
cardiomyopathy compared with Caucasian, and this could not be explained by income,
education, alcohol use, smoking, or history of some other diseases (Coughlin et al. 1993).
Moreover, gene CRYAA, CRYAB and CRYBB2 are all related to various types of cataract.
It was reported that age-specific blindness prevalence was higher for African Americans
compared with Caucasian, and cataract accounts for 36.8% of all blindness in African
American, but for only 8.7% in Caucasian (Congdon et al. 2004).
We would like to emphasize that gene HSPB1 is located in three CNVs that occur
frequently in African Americans. Our results predict that HSPB1 is a potential causal
gene in these CNVs for health disparities. How could HSPB1 duplication contribute to
health disparity? Based on the direct interaction between HSPB1 and CRYAB and the fact
that both genes are expressed in Z-disc (Table 3.7), it is plausible that HSPB1 may play
an unknown role in cardiomyopathy. Alternatively, HSPB1 might be involved in cataract,
because HSPB1, CRYAA and CRYAB interact with each other and all can negatively
regulate apoptotic process (Table 3.7). Studies suggested that lens epithelial cell
apoptosis may be a common cellular basis for initiation of non-congenital cataract
formation (Li et al. 1995), and inhibition of epithelial cell apoptosis may be one possible
mechanism that inhibits cataract development (Nahomi et al. 2013). Our results here
argue for further experimental studies to test the possible role of HSPB1 CNVs in
cardiomyopathy or cataract/blindness in African Americans.
35
3.4.2 Duplication of ATP2A1 and Health Disparities in Caucasians
Gene ATP2A1 in cluster CA1 is located in a genomic duplication region that
occurs only in Caucasians (Table 3.6). We found that three genes in cluster CA1 are
enriched with various GO biological process terms that involve ATP2A1 (Cluster CA1 in
Table 3.7). All of the three genes are related to diseases when they mutated (Cluster CA1
in Table 3.8). Our results here predict that APT2A1 is a candidate causal gene for health
disparity in this region.
How would ATP2A1 influence health disparity? Among the diseases related to the
pathogenic genes in cluster CA1, idiopathic dilated cardiomyopathy occurs less often in
Caucasians than in African Americans (Coughlin et al. 1993). One possibility is that
higher copies of ATP2A1 may offer some benefits to Caucasians. Studies have shown that
increased activity of sarco/endoplasmic reticulum Ca2+-ATPase 1 (SERCA1), which is
encoded by ATP2A1, can partially rescue the heart from •OH-induced injury
(Hiranandani et al. 2006), and protect the heart from ischemia-reperfusion (I/R) injury
(Talukder et al. 2007). Another possibility is that higher copies of ATP2A1 only lead to
moderate risk of cardiomyopathy in Caucasians, and this moderate effect is
overshadowed by other genetics factors not covered by our CNV dataset.
36
CHAPTER IV
CONCLUSION
In this study, we proposed a network clustering based approach to associate
CNVs to health disparities in African and American. Gene clusters were inferred from
two human gene/protein networks, HPRDNet and MultiNet, by MCL clustering
algorithm with different parameters. Each cluster was ranked based on products of FDR
values obtained from the right-tailed Fisher’s exact tests for enrichment of pathogenic
genes or CNV-genes. Five clusters were consistently found to be ranked at first place and
be enriched with both pathogenic genes and genes located in African-American or
Caucasian CNVs. In cluster AA1, AA2, AA3 and AA4, gene HSPB1 is located in
African-American-specific CNVs that are duplicated more frequently in African-
Americans. In clusters CA1, gene ATP2A1 is located in a Caucasian-specific CNV that is
duplicated only in Caucasians. All gene clusters are associated with certain diseases that
occur more often in one population than in the other, which could be considered as results
of the occurrence of corresponding CNVs. Although we only studied population-
differential CNVs and did not consider the roles of other genetic factors, our
computational studies have generated some interesting hypotheses for further
experimental studies to understand health disparities in these diseases.
37
REFERENCES
Altshuler D, Daly M, and Kruglyak L. 2000. Guilt by association. Nat Genet 26:135-137.
American Public Health Association. Health Disparities: The Basics. Available at
http://www.apha.org/NR/rdonlyres/54C4CC4D-E86D-479A-BABB-
5D42B3FDC8BD/0/HlthDisparty_Primer_FINAL.pdf2014).
Bakir-Gungor B, and Sezerman OU. 2011. A new methodology to associate SNPs with human
diseases according to their pathway related context. PLoS One 6:e26277.
Barabasi AL, Gulbahce N, and Loscalzo J. 2011. Network medicine: a network-based approach to
human disease. Nat Rev Genet 12:56-68.
Benjamini Y, and Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B
(Methodological) 57:289-300.
Bensen JT, Xu Z, McKeigue PM, Smith GJ, Fontham ET, Mohler JL, and Taylor JA. 2014. Admixture
mapping of prostate cancer in African Americans participating in the North Carolina-
Louisiana Prostate Cancer Project (PCaP). Prostate 74:1-9.
Bensen JT, Xu Z, Smith GJ, Mohler JL, Fontham ET, and Taylor JA. 2013. Genetic polymorphism
and prostate cancer aggressiveness: a case-only study of 1,536 GWAS and candidate
SNPs in African-Americans and European-Americans. Prostate 73:11-22.
Brohee S, and van Helden J. 2006. Evaluation of clustering algorithms for protein-protein
interaction networks. BMC Bioinformatics 7:488.
Cantor RM, Lange K, and Sinsheimer JS. 2010. Prioritizing GWAS results: A review of statistical
methods and recommendations for their application. Am J Hum Genet 86:6-22.
Clarke GM, Anderson CA, Pettersson FH, Cardon LR, Morris AP, and Zondervan KT. 2011. Basic
statistical analysis in genetic case-control studies. Nat Protoc 6:121-133.
Congdon N, O'Colmain B, Klaver CC, Klein R, Munoz B, Friedman DS, Kempen J, Taylor HR, and
Mitchell P. 2004. Causes and prevalence of visual impairment among adults in the
United States. Arch Ophthalmol 122:477-485.
38
Coughlin SS, Labenberg JR, and Tefft MC. 1993. Black-white differences in idiopathic dilated
cardiomyopathy: the Washington DC dilated Cardiomyopathy Study. Epidemiology
4:165-172.
Dudbridge F, and Gusnanto A. 2008. Estimation of significance thresholds for genomewide
association scans. Genet Epidemiol 32:227-234.
Eden E, Navon R, Steinfeld I, Lipson D, and Yakhini Z. 2009. GOrilla: a tool for discovery and
visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10:48.
Feuk L, Carson AR, and Scherer SW. 2006. Structural variation in the human genome. Nat Rev
Genet 7:85-97.
Fine MJ, Ibrahim SA, and Thomas SB. 2005. The role of race and genetics in health disparities
research. Am J Public Health 95:2125-2128.
Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW,
Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, and Lee C. 2006. Copy number
variation: new insights in genome diversity. Genome Res 16:949-961.
Gene Ontology Consortium. 2008. The Gene Ontology project in 2008. Nucleic Acids Res
36:D440-444.
Gilman SR, Chang J, Xu B, Bawa TS, Gogos JA, Karayiorgou M, and Vitkup D. 2012. Diverse types
of genetic variation converge on functional gene networks involved in schizophrenia.
Nat Neurosci 15:1723-1728.
Halldorsson BV, and Sharan R. 2013. Network-based interpretation of genomic variation data. J
Mol Biol 425:3964-3969.
Hiranandani N, Bupha-Intr T, and Janssen PM. 2006. SERCA overexpression reduces hydroxyl
radical injury in murine myocardium. Am J Physiol Heart Circ Physiol 291:H3130-3135.
Hirschhorn JN, and Daly MJ. 2005. Genome-wide association studies for common diseases and
complex traits. Nat Rev Genet 6:95-108.
Hosak L, Silhan P, and Hosakova J. 2012. Genomic copy number variations: A breakthrough in
our knowledge on schizophrenia etiology? Neuro Endocrinol Lett 33:183-190.
Huang T, Wang P, Ye ZQ, Xu H, He Z, Feng KY, Hu L, Cui W, Wang K, Dong X, Xie L, Kong X, Cai YD,
and Li Y. 2010. Prediction of deleterious non-synonymous SNPs based on protein
interaction network and hybrid properties. PLoS One 5:e11900.
Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, and Lee C. 2004.
Detection of large-scale variation in the human genome. Nat Genet 36:949-951.
39
Ionita-Laza I, Rogers AJ, Lange C, Raby BA, and Lee C. 2009. Genetic association analysis of copy-
number variation (CNV) in human disease pathogenesis. Genomics 93:22-26.
Jensen MK, Pers TH, Dworzynski P, Girman CJ, Brunak S, and Rimm EB. 2011. Protein interaction-
based genome-wide analysis of incident coronary heart disease. Circ Cardiovasc Genet
4:549-556.
Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, Dreszer TR, Fujita PA,
Guruvadoo L, Haeussler M, Harte RA, Heitner S, Hinrichs AS, Learned K, Lee BT, Li CH,
Raney BJ, Rhead B, Rosenbloom KR, Sloan CA, Speir ML, Zweig AS, Haussler D, Kuhn RM,
and Kent WJ. 2014. The UCSC Genome Browser database: 2014 update. Nucleic Acids
Res 42:D764-770.
Khurana E, Fu Y, Chen J, and Gerstein M. 2013. Interpretation of genomic variants using a unified
biological network approach. PLoS Comput Biol 9:e1002886.
LaVeist TA, Gaskin D, and Richard P. 2011. Estimating the economic burden of racial health
inequalities in the United States. Int J Health Serv 41:231-238.
Lee I, Blom UM, Wang PI, Shim JE, and Marcotte EM. 2011. Prioritizing candidate disease genes
by network-based boosting of genome-wide association data. Genome Res 21:1109-
1121.
Lee KW, Woon PS, Teo YY, and Sim K. 2012. Genome wide association studies (GWAS) and copy
number variation (CNV) studies of the major psychoses: what have we learnt? Neurosci
Biobehav Rev 36:556-571.
Lehne B, Lewis CM, and Schlitt T. 2011. From SNPs to genes: disease association at the gene
level. PLoS One 6:e20133.
Li WC, Kuszak JR, Dunn K, Wang RR, Ma W, Wang GM, Spector A, Leib M, Cotliar AM, Weiss M,
and et al. 1995. Lens epithelial cell apoptosis appears to be a common cellular basis for
non-congenital cataract development in humans and animals. J Cell Biol 130:169-181.
Lin C, Cho Y-R, Hwang W-C, Pei P, and Zhang A. 2007. Clustering Methods in a Protein–Protein
Interaction Network. Knowledge Discovery in Bioinformatics: John Wiley & Sons, Inc.,
319-355.
Long J, Zhang B, Signorello LB, Cai Q, Deming-Halverson S, Shrubsole MJ, Sanderson M, Dennis J,
Michailiou K, Easton DF, Shu XO, Blot WJ, and Zheng W. 2013. Evaluating genome-wide
association study-identified breast cancer risk variants in African-American women. PLoS
One 8:e58350.
MacDonald JR, Ziman R, Yuen RK, Feuk L, and Scherer SW. 2014. The Database of Genomic
Variants: a curated collection of structural variation in the human genome. Nucleic Acids
Res 42:D986-992.
40
Mathivanan S, Periaswamy B, Gandhi TK, Kandasamy K, Suresh S, Mohmood R, Ramachandra YL,
and Pandey A. 2006. An evaluation of human protein-protein interaction data in the
public domain. BMC Bioinformatics 7 Suppl 5:S19.
McElroy JP, Nelson MR, Caillier SJ, and Oksenberg JR. 2009. Copy number variation in African
Americans. BMC Genet 10:15.
Miller RG. 1981. Simultaneous statistical inference. New York: Springer-Verlag.
Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N,
Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S,
Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN,
Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R,
Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK,
Harsha HC, Deshpande KS, Sarker M, Prasad TS, and Pandey A. 2006. Human protein
reference database--2006 update. Nucleic Acids Res 34:D411-414.
Moore JH, Asselbergs FW, and Williams SM. 2010. Bioinformatics challenges for genome-wide
association studies. Bioinformatics 26:445-455.
Moore JH, and Williams SM. 2009. Epistasis and its implications for personal genetics. Am J Hum
Genet 85:309-320.
Nahomi RB, Wang B, Raghavan CT, Voss O, Doseff AI, Santhoshkumar P, and Nagaraj RH. 2013.
Chaperone peptides of alpha-crystallin inhibit epithelial cell apoptosis, protein
insolubilization, and opacification in experimental cataracts. J Biol Chem 288:13022-
13035.
National Center for Health Statistics. 2007. Health, United States, 2007. Hyattsville, MD.
National Center for Health Statistics. 2013. Health, United States, 2012: With Special Feature on
Emergency Care. Hyattsville, MD.
Ng MC, Shriner D, Chen BH, Li J, Chen WM, Guo X, Liu J, Bielinski SJ, Yanek LR, Nalls MA, Comeau
ME, Rasmussen-Torvik LJ, Jensen RA, Evans DS, Sun YV, An P, Patel SR, Lu Y, Long J,
Armstrong LL, Wagenknecht L, Yang L, Snively BM, Palmer ND, Mudgal P, Langefeld CD,
Keene KL, Freedman BI, Mychaleckyj JC, Nayak U, Raffel LJ, Goodarzi MO, Chen YD,
Taylor HA, Jr., Correa A, Sims M, Couper D, Pankow JS, Boerwinkle E, Adeyemo A,
Doumatey A, Chen G, Mathias RA, Vaidya D, Singleton AB, Zonderman AB, Igo RP, Jr.,
Sedor JR, Kabagambe EK, Siscovick DS, McKnight B, Rice K, Liu Y, Hsueh WC, Zhao W,
Bielak LF, Kraja A, Province MA, Bottinger EP, Gottesman O, Cai Q, Zheng W, Blot WJ,
Lowe WL, Pacheco JA, Crawford DC, Grundberg E, Rich SS, Hayes MG, Shu XO, Loos RJ,
Borecki IB, Peyser PA, Cummings SR, Psaty BM, Fornage M, Iyengar SK, Evans MK,
Becker DM, Kao WH, Wilson JG, Rotter JI, Sale MM, Liu S, Rotimi CN, and Bowden DW.
2014. Meta-analysis of genome-wide association studies in African Americans provides
insights into the genetic architecture of type 2 diabetes. PLoS Genet 10:e1004517.
41
Oliver S. 2000. Guilt-by-association goes global. Nature 403:601-603.
Olsson LM, and Holmdahl R. 2012. Copy number variation in autoimmunity--importance hidden
in complexity? Eur J Immunol 42:1969-1976.
Online Mendelian Inheritance in Man O. 2014. Baltimore, MD: McKusick-Nathans Institute of
Genetic Medicine, Johns Hopkins University.
Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V,
Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K,
Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC,
Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R,
Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann
WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe
GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan
AM, Hamosh A, Chakravarti A, and Pandey A. 2003. Development of human protein
reference database as an initial platform for approaching systems biology in humans.
Genome Res 13:2363-2371.
Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK,
Chandrika KN, Deshpande N, Suresh S, Rashmi BP, Shanker K, Padma N, Niranjan V,
Harsha HC, Talreja N, Vrushabendra BM, Ramya MA, Yatish AJ, Joy M, Shivashankar HN,
Kavitha MP, Menezes M, Choudhury DR, Ghosh N, Saravana R, Chandran S, Mohan S,
Jonnalagadda CK, Prasad CK, Kumar-Sinha C, Deshpande KS, and Pandey A. 2004. Human
protein reference database as a discovery resource for proteomics. Nucleic Acids Res
32:D497-501.
Pizzuti C, Rombo S, and Marchiori E. 2012. Complex Detection in Protein-Protein Interaction
Networks: A Compact Overview for Researchers and Practitioners. In: Giacobini M,
Vanneschi L, and Bush W, eds. Evolutionary Computation, Machine Learning and Data
Mining in Bioinformatics: Springer Berlin Heidelberg, 211-223.
Pounds S, and Cheng C. 2006. Robust estimation of the false discovery rate. Bioinformatics
22:1979-1987.
Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R,
Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS,
Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood
R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S,
Chaerkady R, and Pandey A. 2009. Human Protein Reference Database--2009 update.
Nucleic Acids Res 37:D767-772.
R Development Core Team. 2013. R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
42
Ramos E, and Rotimi C. 2009. The A's, G's, C's, and T's of health disparities. BMC Med Genomics
2:29.
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M,
Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N,
Zetterberg A, and Wigler M. 2004. Large-scale copy number polymorphism in the human
genome. Science 305:525-528.
Sharan R, Ulitsky I, and Shamir R. 2007. Network-based prediction of protein function. Mol Syst
Biol 3:88.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, and Sirotkin K. 2001. dbSNP:
the NCBI database of genetic variation. Nucleic Acids Res 29:4.
Storey JD, Taylor JE, and Siegmund D. 2004. Strong control, conservative point estimation, and
simultaneous conservative consistency of false discovery rates. Journal of the Royal
Statistical Society, Series B 66:187-205.
Stringer CB, and Andrews P. 1988. Genetic and fossil evidence for the origin of modern humans.
Science 239:1263-1268.
Talukder MA, Kalyanasundaram A, Zhao X, Zuo L, Bhupathy P, Babu GJ, Cardounel AJ, Periasamy
M, and Zweier JL. 2007. Expression of SERCA isoform with faster Ca2+ transport
properties improves postischemic cardiac function and Ca2+ handling and decreases
myocardial infarction. Am J Physiol Heart Circ Physiol 293:H2418-2428.
Tishkoff SA, and Verrelli BC. 2003. Patterns of human genetic diversity: implications for human
evolutionary history and disease. Annu Rev Genomics Hum Genet 4:293-340.
van Dongen S. 2000. Graph Clustering by Flow Simulation PhD. University of Utrecht.
Vidal M, Cusick ME, and Barabasi AL. 2011. Interactome networks and human disease. Cell
144:986-998.
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, and Bork P. 2002. Comparative
assessment of large-scale data sets of protein-protein interactions. Nature 417:399-403.
Wang J, Li M, Deng Y, and Pan Y. 2010. Recent advances in clustering methods for protein
interaction networks. BMC Genomics 11 Suppl 3:S10.
Wang K, Li M, and Bucan M. 2007. Pathway-based approaches for analysis of genomewide
association studies. Am J Hum Genet 81:1278-1283.
Wang P, Carrion P, Qiao Y, Tyson C, Hrynchak M, Calli K, Lopez-Rangel E, Andrieux J, Delobel B,
Duban-Bedu B, Thuresson AC, Anneren G, Liu X, Rajcan-Separovic E, and Suzanne Lewis
43
ME. 2013. Genotype-phenotype analysis of 18q12.1-q12.2 copy number variation in
autism. Eur J Med Genet 56:420-425.
Wang WY, Barratt BJ, Clayton DG, and Todd JA. 2005. Genome-wide association studies:
theoretical and practical concerns. Nat Rev Genet 6:109-118.
Wang X, Gulbahce N, and Yu H. 2011. Network-based methods for human disease gene
prediction. Brief Funct Genomics 10:280-293.
Wei P, Milbauer LC, Enenstein J, Nguyen J, Pan W, and Hebbel RP. 2011. Differential endothelial
cell gene expression by African Americans versus Caucasian Americans: a possible
contribution to health disparity in vascular disease and cancer. BMC Med 9:2.
Williams SM, Canter JA, Crawford DC, Moore JH, Ritchie MD, and Haines JL. 2007. Problems with
genome-wide association studies. Science 316:1840-1842.
Xu Z, Bensen JT, Smith GJ, Mohler JL, and Taylor JA. 2011. GWAS SNP Replication among African
American and European American men in the North Carolina-Louisiana prostate cancer
project (PCaP). Prostate 71:881-891.
Zhao M, and Zhao Z. 2013. CNVannotator: a comprehensive annotation server for copy number
variation in the human genome. PLoS One 8:e80170.
45
// Setup connection to UCSC Genome Database
String username = "genome";
String password = "";
String url = "jdbc:mysql://genome-mysql.cse.ucsc.edu/hg19";
Class.forName("com.mysql.jdbc.Driver");
Connection con = DriverManager.getConnection(url, username, password);
Statement stmt = con.createStatement();
ResultSet rs = null;
// Search for gene names by SNP id
// Compose and submit SQL statements to UCSC Genome Database
rs = stmt.executeQuery(
"select distinct S.name, X.geneSymbol " +
"from snp138 as S, knownGene as G, kgXref as X " +
"where S.name = '" + "rs" + new_snp + "' AND X.kgId = G.name AND
S.chrom = G.chrom AND " + "G.txStart <= S.chromStart AND
G.txEnd >= S.chromEnd;");
// extract results and write to output file
if (rs.next()){
do {
for (int i=1; i<3; i++) { output.print(rs.getString(i)+"\t"); }
output.println();
}while (rs.next());
}
else { output.println("rs" + new_snp + "\t" + "gene_not_found"); }
// Search for gene names by CNV loci
// Compose and submit SQL statements to UCSC Genome Database
rs = stmt.executeQuery(
"select distinct X.geneSymbol, G.chrom, G.txStart, G.txEnd " +
"from knownGene as G, kgXref as X " +
"where ((G.txStart>" + begin + " AND G.txEND<" + end + ") or " +
"(G.txStart>" + begin + " AND G.txStart<" + end + ") or " +
"(G.txEND>" + begin + " AND G.txEND<" + end + ") or " +
"(G.txStart<" + begin + " AND G.txEnd>" + end + ")) AND " +
"X.kgId = G.name AND G.chrom = '" + chrom + "';");
// extract results and write to output file
if (rs.next()) {
do {
for (int l=1; l<5; l++) { output.print(rs.getString(l)+"\t"); }
output.println();
}while (rs.next());
}
else { output.println("not found"); }
47
//setup R engine
Rengine re = new Rengine (new String[] {}, false, null);
// Check if the session is working.
if (!re.waitForR()) { return; }
//Fisher's exact test for OMIM genes
for (int k=1; k<=nCluster; k++)
{
if(AA_GeneClusters[k].get_q() != 0)
{
// obtain values and create the contingency table
double cluster_q = AA_GeneClusters[k].get_q();
double cluster_m = AA_GeneClusters[k].get_m();
double cluster_nonOMIM = cluster_m - cluster_q;
double other_OMIM = Q - cluster_q;
double other_nonOMIM = N - Q - cluster_nonOMIM;
re.eval("cluster = c("+cluster_q+","+cluster_nonOMIM+")");
re.eval("other = c("+other_OMIM+","+other_nonOMIM+")");
re.eval("tb = rbind(cluster, other)");
// perform the Fisher’s exact test and obtain the p-value
double p_OMIM = re.eval("fisher.test(tb,
alternative=\"greater\")$p.value").asDouble();
}
}
//Fisher’s exact test for AA and CA genes
for (int k=1; k<=nCluster; k++)
{
double cluster_AA_s = AA_GeneClusters[k].get_s();
double cluster_CA_s = CA_GeneClusters[k].get_s();
double cluster_m = AA_GeneClusters[k].get_m();
double other_AA = S_AA - cluster_AA_s;
double other_CA = S_CA - cluster_CA_s;
// Fisher's exact test for AA only
if(AA_GeneClusters[k].get_s() != 0)
{
// obtain values and create the contingency table
double cluster_nonAA = cluster_m - cluster_AA_s;
double other_nonAA = N-cluster_m-S_AA+cluster_AA_s;
re.eval("cluster = c("+cluster_AA_s+","+cluster_nonAA+")");
re.eval("other = c("+other_AA+","+other_nonAA+")");
re.eval("tb = rbind(cluster, other)");
// perform the Fisher’s exact test and obtain the p-value
double p_AA = re.eval("fisher.test(tb,
alternative=\"greater\")$p.value").asDouble();
}
// Fisher's exact test for CA only
if(CA_GeneClusters[k].get_s() != 0)
{
48
// obtain values and create the contingency table
double cluster_nonCA = cluster_m - cluster_CA_s;
double other_nonCA = N-cluster_m-S_CA+cluster_CA_s;
re.eval("cluster = c("+cluster_CA_s+","+cluster_nonCA+")");
re.eval("other = c("+other_CA+","+other_nonCA+")");
re.eval("tb = rbind(cluster, other)");
// perform the Fisher’s exact test and obtain the p-value
double p_CA = re.eval("fisher.test(tb,
alternative=\"greater\")$p.value").asDouble();
}
}
50
// Calculate FDR values
//fp is the File object of input file
Scanner pScan = new Scanner(fp);
//RstatementA contains a R statement to create a tuple of all p-values.
String RstatementA = "p=c(";
while (pScan.hasNextLine())
{
String line = pScan.nextLine();
Scanner lineReader = new Scanner(line);
lineReader.next();
int tmp = (int) lineReader.nextDouble();
if (tmp != 0)
{
lineReader.next();
double p = lineReader.nextDouble();
RstatementA = RstatementA + p + ", ";
log.println("RstatementA"+RstatementA);
}
}
RstatementA = RstatementA.substring(0, RstatementA.length()-2) + ")";
// create a tuple named “p”
re.eval(RstatementA);
// load robust-fdr routine
re.eval("source(\"D:/eclipse/workspace/test/robust-fdr.R\")");
// RstatementD contains a R statement to run robust-fdr routine
String RstatementD = "robust.fdr(p, discrete=T, use8=F)$fdr";
// Obtain fdr values and save to an array
double[] fdr_disc = re.eval(RstatementD).asDoubleArray();
51
VITA
Yi Jiang was born in Suzhou, Jiangsu, China. He attended Nanjing University and
obtained the Bachelors of Science degree in Applied Chemistry in May 2001. Yi
completed the Master of Science degree in Pharmaceutical Sciences at University of
Florida before he entered the Computer Sciences Program at University of Tennessee at
Chattanooga. Yi worked as teaching and research assistants in Department of Computer
Science, and graduated with a Master of Science degree with thesis option in Computer
Science in December 2014. Yi’s research interests include data analysis, database
management and computer programming.