abstract - nc state repository

123
ABSTRACT PADMANABHAN, KANCHANA. Hierarchical Modularity-driven Discovery and Characterization of Network Biomarkers: Application to Bioenergy Production and Prognostics of Neurological Disorders.(Under the direction of Dr. Nagiza F. Samatova.) Biomarkers, short for biological markers are ubiquitous and help distinguish organisms that exhibit certain characteristics. Characteristics could refer to bioenergy related processes such as bioethanol production or biohydrogen production by microorganisms, or could refer to neurolog- ical disorders such as Alzheimer’s or schizophrenia. These characteristics are termed phenotypes in biological literature. Biomarker discovery and characterization have gained a lot of attention in the recent past and have many uses including improving the efficiency of bioengineering processes or increasing the skill of disease prognostics. Traditionally, these biomarkers are defined as individual genes or sets of genes correlated with the presence of a phenotype. However, genes do not function individually but cooperate with other genes to carry out various functions in an organism. These inter-gene relationships are captured using a network structure, where the nodes are genes and the edges represent some functional associations between the genes (e.g., genetic interaction networks). Recently, the focus has been shifting towards utilizing these biological networks for unearthing biomarkers. Such biomarkers are coined as network biomarkers. In this thesis, we argue that the performance of biomarker identification could be improved by explicit consideration of hierarchical modularity—the inherent design and organization prin- ciple of cellular systems. According to this principle, individual genes combine in a hierarchical manner into larger, functionally less cohesive subsystems. The hierarchy can be broadly di- vided into three major levels: (1) the gene level, (2) the subsystem level, and (3) the subsystem crosstalk level. The subsystem crosstalk level captures the interactions between the subsystems. The area of identifying phenotype-related network biomarker crosstalks is in its infancy. To the best of our knowledge, this thesis is the first systematic approach to identify crosstalk biomarkers by utilizing the power of multiple existing clues. The method has been empirically validated using Alzheimer’s disease as a use-case. The method not only verified previously known Alzheimer’s related crosstalks but also suggested hypothesis for further investigation. In addition, the iden- tified crosstalks improve the accuracy of Alzheimer’s prognosis by 15.9% in comparison to the current state-of-the-art methodologies. At the subsystem level, our in silico identification of phenotype-related network biomarker subsystems offers an efficient solution to the contrast-based comparative analysis of multiple biological networks. The method is robust to noise in and incompleteness of biological data, and

Upload: khangminh22

Post on 22-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ABSTRACT

PADMANABHAN, KANCHANA. Hierarchical Modularity-driven Discovery and Characterizationof Network Biomarkers: Application to Bioenergy Production and Prognostics of NeurologicalDisorders.(Under the direction of Dr. Nagiza F. Samatova.)

Biomarkers, short for biological markers are ubiquitous and help distinguish organisms that

exhibit certain characteristics. Characteristics could refer to bioenergy related processes such as

bioethanol production or biohydrogen production by microorganisms, or could refer to neurolog-

ical disorders such as Alzheimer’s or schizophrenia. These characteristics are termed phenotypes

in biological literature.

Biomarker discovery and characterization have gained a lot of attention in the recent past

and have many uses including improving the efficiency of bioengineering processes or increasing

the skill of disease prognostics. Traditionally, these biomarkers are defined as individual genes

or sets of genes correlated with the presence of a phenotype. However, genes do not function

individually but cooperate with other genes to carry out various functions in an organism. These

inter-gene relationships are captured using a network structure, where the nodes are genes and

the edges represent some functional associations between the genes (e.g., genetic interaction

networks). Recently, the focus has been shifting towards utilizing these biological networks for

unearthing biomarkers. Such biomarkers are coined as network biomarkers.

In this thesis, we argue that the performance of biomarker identification could be improved

by explicit consideration of hierarchical modularity—the inherent design and organization prin-

ciple of cellular systems. According to this principle, individual genes combine in a hierarchical

manner into larger, functionally less cohesive subsystems. The hierarchy can be broadly di-

vided into three major levels: (1) the gene level, (2) the subsystem level, and (3) the subsystem

crosstalk level.

The subsystem crosstalk level captures the interactions between the subsystems. The area

of identifying phenotype-related network biomarker crosstalks is in its infancy. To the best of

our knowledge, this thesis is the first systematic approach to identify crosstalk biomarkers by

utilizing the power of multiple existing clues. The method has been empirically validated using

Alzheimer’s disease as a use-case. The method not only verified previously known Alzheimer’s

related crosstalks but also suggested hypothesis for further investigation. In addition, the iden-

tified crosstalks improve the accuracy of Alzheimer’s prognosis by 15.9% in comparison to the

current state-of-the-art methodologies.

At the subsystem level, our in silico identification of phenotype-related network biomarker

subsystems offers an efficient solution to the contrast-based comparative analysis of multiple

biological networks. The method is robust to noise in and incompleteness of biological data, and

is also parameter free. It has been validated against subsystem biomarkers reported in literature

for phenotypes such as biohydrogen production, motility, respiration, and gram stain.

At the gene level, the information about the functional roles of individual genes is utilized

to assess significance of the network biomarkers in a biological context. Ignoring hierarchical

modularity leads to high false negative rates, namely biologically significant results are termed

insignificant. By explicitly incorporating the hierarchical modularity principle into evaluation

of the biological significance of network biomarkers improved the accuracy by 15% on average

when compared to state-of-the-art methods.

Hierarchical Modularity-driven Discovery and Characterization of Network Biomarkers: Application to Bioenergy Production and Prognostics of Neurological Disorders

byKanchana Padmanabhan

A dissertation submitted to the Graduate Faculty ofNorth Carolina State University

in partial fulfillment of therequirements for the Degree of

Doctor of Philosophy

Computer Science

Raleigh, North Carolina

2013

APPROVED BY:

Dr. Steffen Heber Dr. Donald L. Bitzer

Dr. Kemafor Ogan Dr. Nagiza F. SamatovaChair of Advisory Committee

DEDICATION

To my parents for their unconditional love and to my husband for his unflinching support

ii

BIOGRAPHY

Kanchana Padmanabhan received her Bachelor of Technology in Information Technology from

Madras Institute of Technology, Anna University, India in 2007. She enrolled into North Carolina

State University in Fall 2009 in pursuit of a Master’s degree in Computer Science. Towards the

end of Spring 2010, she transferred from the Master’s to the Ph.D program. Kanchana passed

her Written Qualifier Examination in Spring 2011. She obtained her “en-route” Master’s degree

in Spring 2012. She then spent Summer of 2012 at the Lawrence Berkeley National Laboratory.

Kanchana defended her Oral Preliminary Examination in Fall 2012. During the course of her

graduate studies, she was inducted into The Honor Society of Phi Kappa Phi and Golden

Key International Honor Society. She was also given the opportunity to co-edit a book titled

“Practical Graph Mining with R,” that was published by the CRC Press and came to the

marketplace in July 2013.

Publications

Book

Editors. N. F. Samatova, W. Hendrix, J. Jenkins, K. Padmanabhan, and A. Chakraborty.

Practical Graph Mining with R: Students Series. Chapman & Hall/CRC Data Mining and

Knowledge Discovery Series, CRC Press, 2013

Book Chapters

1. K. Padmanabhan, W. Hendrix, and N. F. Samatova, Introduction, Practical Graph

Mining with R. Editors. N. F. Samatova, W. Hendrix, J. Jenkins, K. Padmanabhan, and

A. Chakraborty. Practical Graph Mining with R: Students Series. Chapman & Hall/CRC

Data Mining and Knowledge Discovery Series, CRC Press, 2013

2. K. Padmanabhan, B. Harrison, K. Wilson, M. L. Warren, K. Bright, J. Mosiman,

J. Kancherla, H. Phung, B. Miller, and S. Shamseldin, Cluster analysis, Practical Graph

Mining with R. Editors. N. F. Samatova, W. Hendrix, J. Jenkins, K. Padmanabhan, and

A. Chakraborty. Practical Graph Mining with R: Students Series. Chapman & Hall/CRC

Data Mining and Knowledge Discovery Series, CRC Press, 2013

3. K. Padmanabhan, Z. Chen, Sriram Lakshminarasimhan, S. Shankar Ramaswamy, and

B. T. Richardson, Graph-based anomaly detection, Practical Graph Mining with R. Edi-

tors. N. F. Samatova, W. Hendrix, J. Jenkins, K. Padmanabhan, and A. Chakraborty.

iii

Practical Graph Mining with R: Students Series. Chapman & Hall/CRC Data Mining

and Knowledge Discovery Series, CRC Press, 2013

4. K. Padmanabhan and J. Jenkins, Performance metrics for graph mining Tasks, Practi-

cal Graph Mining with R. Editors. N. F. Samatova, W. Hendrix, J. Jenkins, K. Padman-

abhan, and A. Chakraborty. Practical Graph Mining with R: Students Series. Chapman

& Hall/CRC Data Mining and Knowledge Discovery Series, CRC Press, 2013

5. K. Padmanabhan, S. Lakshminarasimhan, Z. Gong, J. Jenkins, N. Shah, E. Schendel,

I. Arkatkar, R. Ross, S. Klasky, and N. F. Samatova, In situ analysis in support of ex-

ploratory scientific discovery in data intensive science, Data Intensive Science, Editors.

Terence Critchlow and Kerstin Kleese Van Dam. Chapman & Hall/CRC Computational

Science, CRC Press, 2013

Journal

1. K. Padmanabhan, K. Wang, and N. F. Samatova, Functional annotation of hierarchical

modularity, 2012, PLoS ONE, 10.1371/journal.pone.0033744

2. K. Padmanabhan, K. Wilson, A. Rocha, J. R. Mihelcic, and N. F. Samatova, In situ

identification of phenotype related modules, 2012, Proteome Science, 10(1):S2, 10.1186/1477-

5956-10-S1-S2

3. G. Atluri, K. Padmanabhan, M. Steinbach, J. R. Petrella, K. Lim, A. McDonald III,

N. F. Samatova, M. Doraiswamy, and V. Kumar, Complex biomarker discovery in neu-

roimaging data: Finding a needle in a haystack, NeuroImage: Clinical, 2013, 3:123-131,

4. Z. Chen, K. Padmanabhan, A. Rocha, Y. Shpanskaya, J. R. Mihelcic, and N. F. Sam-

atova, SPICE: Discovery of phenotype-determining component interplays, 2012, BMC

Systems Biology, 6:40

5. M. C Schmidt, A. Rocha, K. Padmanabhan, Z. Chen, K. Scott, J. R Mihelcic, and

N. F Samatova, Efficient alpha,beta-motif finder for identification of phenotype-related

functional modules, 2011, BMC Bioinformatics, 12:440

6. W. Hendrix*, A. Rocha*, K. Padmanabhan, A. Choudhary, K. Scott, J. R. Mihelcic, and

N. F. Samatova, DENSE: Efficient and prior knowledge-driven discovery of phenotype-

associated protein functional modules, 2011, BMC Systems Biology, 5:172

7. M. C. Schmidt*, A. Rocha*, K. Padmanabhan, Y. Shpanskaya, J. Banfield, K. Scott,

J. R. Mihelcic, and N. F. Samatova, NIBBS-Search for fast and accurate prediction of

iv

phenotype-biased metabolic systems, 2012, PLoS Computational Biology, 10.1371/jour-

nal.pcbi.1002490

8. P. Varalakshmi, S. T. Selvi, K. Padmanabhan, and J. Ramesh, Fuzzy based resource

management in broker architecture using trust and credentials, 2007, International Jour-

nal of Soft Computing, 2: 494

Conference

1. D.Gonzalez II, S. Pendse, K. Padmanabhan, M. Angus, I. Tetteh, S. Srinivas, A. Vil-

lanes, F. Semazzi, V. Kumar, and N. Samatova, Coupled Heterogeneous Association Rule

Mining (CHARM): Application toward Inference of Modulatory Climate Relationships,

2013, IEEE International Conference on Data Mining, Dallas, TX, USA.

2. P. Varalakshmi, S. T. Selvi, K. Padmanabhan, and S. Wazirah, Three-tier grid archi-

tecture with policy-based resource selection in grid, 2008, International Conference on

Signal Processing, Communications, and Networking, Chennai, Tamil Nadu, India

3. P. Varalakshmi, S. T. Selvi, K. Padmanabhan, and S. Wazirah, A robust trust model

with rated genuine feedbacks in a grid environment, 2007, International Conference on

Computational Intelligence and Multimedia Applications, Sivakasi, Tamil Nadu, India

4. K. Wilson, A. Rocha, K. Padmanabhan, K. Wang, Z. Chen, Y. Jin, J. R. Mihelcic, and

N. F. Samatova, Detecting pathway cross-talks by analyzing conserved functional modules

across multiple phenotype-expressing organisms, 2011, IEEE International Conference

Bioinformatics and Biomedicine, Atlanta, GA, USA

Submission

� K. Padmanabhan, K. Holohan, G. B. Lander*, S. Harenberg*, L. Gjeltema*, G. Atluri,

A. Saykin, M. Doraiswamy, V. Kumar and N. F. Samatova, Pathway Crosstalks as

Biomarkers for Conversion from Mild Cognitive Impairment to Alzheimers Disease, 2013,

PLoS Computational Biology

� S. Harenberg, K. Holohan, K. Padmanabhan, L. Gjeltema, G. B. Lander, A. Saykin,

M. Doraiswamy, V. Kumar and N. F. Samatova, Inferring Complex Disease Relationships

with Manually Curated Knowledge Priors, 2013, PLoS Computational Biology

* - contributed equally

v

ACKNOWLEDGEMENTS

It is my belief that a person needs excellent mentors, a good working environment, and a strong

support system to successfully complete his/her doctoral dissertation. I was fortunate to have

all of those things. I would like to take this opportunity to thank all the people without whom

this journey would not have been possible.

First and foremost, I will always be grateful and deeply indebted to my advisor, mentor,

and guide, Dr. Nagiza F. Samatova. I can safely say that I could not have completed this

dissertation without her constant encouragement and belief in my abilities. There are no words

sufficient to express the magnitude of her contribution to my growth both as a person and as a

researcher. I feel truly blessed that she decided to take me on as her student. The opportunity

to continue working under her was one of the main reasons that pushed me to switch from the

Master’s to the Ph.D. program. She is a mentor who is always available to her students—I have

had 4 AM conversations with her regarding my research. Her enthusiasm towards her work, be

it research or teaching, never ceases to impress me. She backs it up with a tremendous amount

of hard work, determination, ability to multi-task, and a drive to succeed. She set a benchmark

that I constantly strive towards. She provided me opportunities to teach, work on proposals,

be involved in meetings with collaborators, co-develop a course, and even edit a book. These

experiences have helped transform me into a professional who can function independently and

at the same time know what it means to be a team player. I am glad I have someone like her

to count on for excellent professional guidance throughout my life.

I would like to thank my committee members Dr. Steffen Heber, Dr. Donald Bitzer, and Dr.

Kemafor Ogan for offering their invaluable time, support, and guidance during the course of

my doctoral studies. I am thankful to Dr. David Thuente for helping me with the transfer from

the Master’s to the Ph.D. program. I also thank Dr. Douglas Reeves for helping me with all the

admistrative processes associated with the degree. I would also like to thank Ms. Margery Page,

Ms. Carol Allen, Mr. Andrew Sleeth, and Mr. Ron Hartis, who have been of tremendous help in

various capacities. I am thankful to several professors in the department specifically Dr.Thomas

Honeycutt, Dr. Rada Chirkova, Dr. Munindar Singh, and Dr. Ting Yu. I am thankful to Dr.

Chirkova, Dr. Ogan, and Dr. Honeycutt for also graciously agreeing to act as my professional

references. I am thankful to Dr. Heber for also giving me a recommendation to transfer from

the Master’s to the Ph.D. program. I would also like to thank Dr. Anatoli Melechko for offering

kind words of support on many occasions.

I have had the privilege of working in a lab where team work and camaraderie are part of

the culture. Nobody hesitated to jump in and help out another. Many of the current and former

lab members and some external student collaborators have contributed to the completion of

vi

my thesis. I would like to begin by thanking former student Dr. Andrea Rocha for her support

in finishing several of my publications and for also agreeing serve as one of my professional

references. Specifically, I would like to thank her for the biological insights she provided for

my second thesis component. I am grateful to (in no specific order) Lucia Gjeltema, Steven

Harenberg, and Gonzalo Bello Lander, without whom the timely completion of my first thesis

component would not have been possible. I would also like to thank Kelly Holohan (Indiana

University) for the detailed biological analysis she provided for my first thesis component. I

want to thank Yekaterina Shpanskaya for the biological analysis she provided for two of the

publications I co-authored (NIBBS and SPICE). I am thankful to Kuangyu Wang, a former

student, for the biological insights he provided for both my first and third thesis components.

I am thankful to Dr. Mathew Schmidt and Dr. William Hendrix, who have acted both in

the capacity of mentors and co-authors. I am grateful to John Jenkins for his comments and

suggestions that tremendously improved the book chapters I was lead author for. I would also

like to thank Gowtham Atluri (University Of Minnesota), Dr. Zhengzhang Chen, Kevin Wilson,

and Doel Gonzalez for giving me the opportunity to co-author with them.

I have been lucky to have external collaborators who in spite of their hectic schedules have

never hesitated to offer help, advice, and suggestions. I would like to begin by thanking Dr.

Murali Doraiswamy (Duke University) for providing me the Alzheimer’s disease use-case and

access to the corresponding datasets. I am grateful to Dr. Andrew Saykin (Indiana University)

in addition to Dr. Doraiswamy for reviewing and providing suggestions and insights that added

incredible value to my first thesis component. I would also like to thank Dr. Saykin for connecting

me to his student Kelly Hoholan who ended up being a valued co-author. I would like to

thank Dr. Vipin Kumar (University of Minnesota) for giving me with opportunity to co-author

a journal publication with his student Gowtham Atluri. I spent the summer of 2012 at the

Lawrence Berkeley National Lab in Dr. Inna Dubchak’s group. I am thankful to her for taking

me in. I am also deeply grateful to Dr. Pavel Novichkov and Dr. Alexey Kazakov for taking me

under their wings and mentoring me for the duration of the internship.

Sriram Lakshmininarasimhan has been my best friend, confidante, and sounding board for

the past four years. He has been an integral part of the support system that has propelled

me to successfully complete my dissertation. His confidence in me has on many occasions been

overwhelming and provided inspiration to better myself. He has always offered a patient listening

ear and his sound advice has often helped me navigate situations. Graduate school among other

things has given me a friend for life.

I would also like to thank (in no particular order) my friends Veda, Vishnu, Sarah, Uthra,

TC, Sai, and PK, for always being excited and happy for me in all my endeavors. I also thank

Padmashree for getting me involved in the STARS USCRI Refugee Outreach project that gave

me an opportunity to contribute to the society in some tiny capacity.

vii

The acknowledgments would be incomplete without thanking my family. My parents Usha

and Padmanabhan have instilled in me the confidence to achieve all my goals. I would be never

be able to thank them enough for all the sacrifices they made to ensure a good life for me. I am

lucky to have them as my parents. My sister Vidhya and brother-in-law Mukundh have been two

of the biggest supporters of my intent to pursue this doctoral degree. I am indebted to them for

their consistent backing. I am thankful to my sister Nandhini and brother-in-law Vishwanath

who have always been there for me and my parents-in-law Bhanumathi and Vasudevan who

have always offered their support, when needed.

I will always consider myself fortunate to have Srinivasan for my husband. My decision to

go into graduate school resulted in us living many thousand miles apart for over four years but

that never stopped him from supporting or encouraging my decision. I am unable to find the

right words to express my gratitude except to say that he has been my rock through all these

years of graduate school.

Last but not the least, I would like to thank my funding sources U.S. Department of Energy

(Office of Advanced Scientific Computing Research and Office of Biological and Environmental

Research, Office of Science) and North Carolina State University, Department of Computer

Science for their financial support.

viii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Crosstalk Level: Systematic Identification of

Network Biomarker Crosstalks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Subsystem Level: Discovery of Network Biomarker Subsystems . . . . . . . . . . 41.3 Gene Level: Functional Annotation of Hierarchical Modularity of Network Biomark-

ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2 Crosstalk Level: Systematic Identification of Network BiomarkerCrosstalks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.1 Identification of Potential Pathway Crosstalks . . . . . . . . . . . . . . . . 122.1.2 Identification of Patient-specific Pathway Crosstalks . . . . . . . . . . . . 152.1.3 Identification of Significant Pathway Crosstalks . . . . . . . . . . . . . . . 17

2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Sample Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 SNP-enriched Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 SNP-enriched Pathway Crosstalks . . . . . . . . . . . . . . . . . . . . . . 252.2.5 Prognosis of MCI to AD Conversion . . . . . . . . . . . . . . . . . . . . . 27

2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Chapter 3 Subsystem Level: Discovery of Network Biomarker Subsystems . . 453.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.1 Network Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . 473.1.2 Enumeration Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.1.3 Evaluation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.2 Hydrogen Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2.3 Motility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2.4 Respiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2.5 Gram-positive and Gram-negative . . . . . . . . . . . . . . . . . . . . . . 65

3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Chapter 4 Gene Level: Functional Annotation of Hierarchical Modularity ofNetwork Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1.1 Hierarchical Taxonomy of Functional Terms (HTFA) . . . . . . . . . . . 694.1.2 Hierarchical Gene Module (HGM) . . . . . . . . . . . . . . . . . . . . . . 70

ix

4.1.3 Assessing Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . 734.1.4 Fuzzy Reconstruction of Hierarchical Modularity . . . . . . . . . . . . . . 74

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.2.1 Benchmark Data and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 764.2.2 Functional Coherency of Protein Functional Modules . . . . . . . . . . . . 774.2.3 Functional Coherency of Protein Pairs . . . . . . . . . . . . . . . . . . . . 814.2.4 Inferred Hierarchy of Functional Modules . . . . . . . . . . . . . . . . . . 834.2.5 Effect of Fuzziness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.6 Choosing ωλ Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Chapter 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 895.1 Causative Network Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.2 Dynamic Network Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3 Multi-phenotype Network Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . 915.4 Prediction of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

x

LIST OF TABLES

Table 2.1 Subset of genes from CTD [30] and their association to Alzheimer’s disease 19Table 2.2 Baseline characteristics of MCI study sample . . . . . . . . . . . . . . . . 20Table 2.3 List of literature verified the pathways found in the study [139] . . . . . . . 22Table 2.4 Performance of models with baseline clinical parameters . . . . . . . . . . 31Table 2.5 Performance of models with baseline clinical parameters with ApoE . . . . 33Table 2.6 Performance of models with baseline clinical parameters and all imaging

(MRI, PET) and bio-specimen (CSF) biomarkers . . . . . . . . . . . . . . 35Table 2.7 Performance of models with baseline clinical parameters and PET biomarkers 35Table 2.8 Performance of models with baseline clinical parameters, PET, and CSF

biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 2.9 Performance of models with baseline clinical parameters and ApoE with

all imaging (MRI, PET) and bio-specimen (CSF) biomarkers . . . . . . . . 38Table 2.10 Performance of models with clinical parameters and ApoE with PET

biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Table 2.11 Performance of models with clinical parameters and ApoE with PET and

CSF biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Table 2.12 Performance of Shaffer et al [158] model in comparison with 97 patients in

comparison to our model with 91 patients . . . . . . . . . . . . . . . . . . 41Table 2.13 Performance of Shaffer et al [158] model in comparison with 91 patients in

comparison to our model with 91 patients . . . . . . . . . . . . . . . . . . 42Table 2.14 Performance of models with randomized pathway crosstalk features . . . . 44

Table 3.1 Subsystem associated with light fermentation . . . . . . . . . . . . . . . . . 59Table 3.2 Subsystem associated with dark fermentation . . . . . . . . . . . . . . . . . 60Table 3.3 Subsystem associated with motility . . . . . . . . . . . . . . . . . . . . . . 62Table 3.4 Subsystem associated with aerobic respiration . . . . . . . . . . . . . . . . 63Table 3.5 Subsystem associated with anaerobic respiration . . . . . . . . . . . . . . . 63Table 3.6 Subsystem associated with gram-negativity . . . . . . . . . . . . . . . . . . 64Table 3.7 Subsystem associated with gram-positivity . . . . . . . . . . . . . . . . . . 65

Table 4.1 Statistical significance of protein pairs’ functional coherence in Saccha-romyces cerevisiae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Table 4.2 Functional modules evaluated using existing enrichment analysis tools incomparison with HMS. The first two rows show two homogeneous func-tional modules and the next two rows of the table show heterogeneousfunctional modules that have coherent submodules. Functional moduleshave been obtained from Chen and Yuan [22] of the Saccharomyces cere-visiae PPI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Table 4.3 Percentage of significant (p-value≤ 0.05) and highly significant (p-value≤0.001) functionally coherent modules from Chen and Yuan [22] and CYS2008[137] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xi

Table 4.4 HMS results for some KEGG metabolic pathways and MIPS protein com-plexes [21] classified as insignificant by Chagoyen et al. [21] . . . . . . . . . 81

Table 4.5 Comparison of pair-wise semantic similarity metrics using functionally-associated and non-functional protein pairs . . . . . . . . . . . . . . . . . . 82

Table 4.6 Skill metrics for Saccharomyces cerevisiae KEGG experiments . . . . . . . 84Table 4.7 Skill metrics for Saccharomyces cerevisiae MIPS experiments . . . . . . . . 85Table 4.8 Consistency of multi-pathway genes across clusters that enrich the cor-

ressponding pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

xii

LIST OF FIGURES

Figure 1.1 Hierarchical modularity of phenotype-related network biomarkers . . . . . 2

Figure 2.1 The generic pipeline to identify potential network biomarker crosstalks;(A) given biological networks that represent evidences, and a set of biolog-ical subsystems, the method first constructs an evidence-specific subsys-tem crosstalk network for each evidence, (B) these individual evidence-specific crosstalk networks are further combined into a single weightedcrosstalk reference map that represents the plethora of all possible sub-system crosstalks captured by the combined evidence information, and(C) the reference map is analyzed in the context of some organism spe-cific information to identify phenotype-related network biomarker crosstalks 11

Figure 2.2 The pipeline to identify potential pathway crosstalks. The analysis in-volves (1) using individual clues to score a each pathway pair to quantifycrosstalk likelihood, (2) obtaining a combined score using informationfusion, and (3) building the reference crosstalk map . . . . . . . . . . . . . 12

Figure 2.3 The pipeline to identify SNP enriched pathways and pathway crosstalks.The methodology has three steps: (1) the SNPs are mapped to genes andin turn to pathways using the SNP and gene location information, (2) agenetic model is chosen and a patient specific pathway SNP enrichmentscore is calculated using SNPs mapped to pathways and using patientallele information, and (3) the pathway enrichment scores are overlaid onthe reference crosstalk map to build patient-specific crosstalk maps . . . . 13

Figure 2.4 Pie chart showing the distribution of types of SNP-enriched pathwaysidentified in the study and comparison to KEGG pathway distribution . . 21

Figure 2.5 Alzheimer’s disease pathway crosstalk: pathways found to have signifi-cant crosstalk with the AD pathway fall into KEGG categories includingHuman Diseases, Environmental Information Processing, Genetic Infor-mation Processing, Cellular Processes, Metabolism, and Organismal Sys-tems. Specific KEGG pathway types are listed below each category, andthe number of occurrences is listed in parentheses. . . . . . . . . . . . . . 27

Figure 3.1 Methdology overview to identify and evaluate phenotype-related networkbiomarker subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 3.2 Three step building of the orthologous-group organism bipartite network . 49Figure 3.3 Enumeration of conserved edge-sets using maximal biclique subgraph model 50Figure 3.4 Parameter-free biclique enumerator. (A) An example graph, (B) Repre-

sentative selection, and (C) The search tree for the example graph . . . . 53Figure 3.5 Extracting the potential phenotype-related modules from the enumerated

maximal bicliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 4.1 Overview of the methodology to assess functional coherence and assignannotation to hierarchical functional modules . . . . . . . . . . . . . . . . 68

Figure 4.2 Overview of fuzzy reconstruction of hierarchical modularity . . . . . . . . 69

xiii

Figure 4.3 Hierarchical functional annotation of a gene module M for a gene setG = {g1, g2, g3, g4} given the taxonomy T in Figure 4.4. (A) Functionalannotation of genes in G by unrelated term sets in T , (B) A hierarchicalgene module M , and (C) The resulting annotation of the internal (non-leaf) nodes in V (M) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 4.4 An illustration of a hierarchical taxonomy T over the set of functionalannotation terms A = {t0, t1, t2, t3, t4, t5}. (A) A DAG view and (B) Alevel set view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Figure 4.5 llustration of penalization factor calculation. (A) Hierarchical annotationof the functional module defined in Figure 7 and (B) Dissimilarity scoreψ(p, c) for a parent p = v and a child c = g1 . . . . . . . . . . . . . . . . . 73

Figure 4.6 Comparison between the three penalization factor functions considered . . 74Figure 4.7 Functionally coherent modules from the Chen and Yuan [22] study. (A)

Module ID M1 and (B) Module ID M3 . . . . . . . . . . . . . . . . . . . . 77Figure 4.8 Functional coherence analysis of protein complexes from MIPS-curated

[52], Ho [62], Gavin [41], and Krogan [91] as well as metabolic pathwaysfrom KEGG. Comparison between our HMS scoring, cosine similaritywith different -value methods from [21], Jaccard similarity with differentvalue methods from [21] and GS2 [146] methods. (A) Significant Modules(p-value ≤ 0.05 and (B) Highly Significant Modules (p-value≤ 0.001) . . . 80

Figure 4.9 Effect of different values of ωλ on the SHMS() score . . . . . . . . . . . . . 87

xiv

Chapter 1

Introduction

Biomarker discovery and characterization have been an important area of research in the re-

cent past. Biomarkers help differentiate organisms that possess certain characteristics from the

organisms that do not exhibit the characteristics. Examples of characteristics include bioenergy

related characteristics, such as bioethanol production, biohydrogen production, and acid toler-

ance. Examples of characteristics also include human diseases, such as Alzheimer’s, cancer, or

diabetes. The term for these biological characteristics is phenotypes.

Biomarker discovery is extremely important in many areas of research. Biomarkers related to

phenotypes such as bioethanol production and biohydrogen production are useful in improving

the efficiency of many industrial processes. Similarly, biomarkers related to uranium reduction

phenotype can assist in uranium decontamination of soil and water, because uranium takes

hundreds of millions of years to decay radioactively. Biomarkers can also improve the accuracy

of disease prognostics and support development of more effective drugs.

Traditionally, biomarker discovery has revolved around searching for individual genes or

gene sets that are linked to phenotype presence [46, 72, 89, 163, 172] (see also [4], our review

of biomarker discovery for neurological disorder phenotypes). However, genes do not function

individually but work together with other genes to carry out various functions in an organism.

Biological networks (e.g., genetic interaction networks) capture inter-gene relationships,

where the nodes are genes (or gene sets) and the edges represent some functional associa-

tion between the genes (or gene sets). Hence, the biomarker discovery space is moving towards

utilizing these networks for unearthing biomarkers. These biomarkers are known as network

biomarkers.

Network biomarkers are discovered in an unsupervised fashion, where the problem is to

discover discriminatory subnetworks between biological networks for organisms that exhibit the

target phenotype and those biological networks for organisms that do not exhibit the phenotype

[26, 46, 47, 60, 61, 72, 89, 97, 152, 153, 163, 172, 198]. Such signature subnetworks are assumed

1

to be functionally coherent, or forming functional modules, with respect to the function that

is critical to the target phenotype. Moreover, these functionally coherent modules combine in

a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential

design principles of system-level cellular organization and function—hierarchical modularity

[58, 108, 140, 164].

Figure 1.1 illustrates this concept of hierarchical modularity in cellular systems, using three

major levels of abstraction: (1) the gene level, (2) the subsystem level, and (3) the subsystem

crosstalk level. Such hierarchical modularity of complex systems has several benefits especially

from the evolutionary adaptation point of view [108]. Examples of such hierarchical organization

with respect to biohydrogen production phenotype include functionally coherent modules de-

rived from transcriptome data related to electron transport (fixX, fixC, fixB, fixA, ferN, fer1 ),

co-factor synthesis (nifB, nifV, nifQ, nifN, nifE, nifX ), assembly or stability (nifW, nifS2,

nifU ), and regulation (nifA) in Rhodopseudomonas palustris [142]. Likewise, the CD4+ T-cell

modules involved in human immune protection and regulation are hierarchically organized and

are made up of polarizing cues, lineage-specifying transcription factors, homing receptors, and

effector molecules [148].

g3 g4 g5 g6 g1 g2

g3

g4

g1

g2 g5

g6

g3

g4

g1

g2 g5

g6

Individual Gene Level

Subsystem Level

Crosstalk Level

Network Biomarker Functional Annotation (using individual genes & their functional roles)

Network Biomarker Subsystems (functionally associated genes)

Network Biomarker Crosstalks (interacting subsystems)

Figure 1.1: Hierarchical modularity of phenotype-related network biomarkers

With this perspective of phenotype-centric hierarchical modularity, the central hypothesis of

this thesis is the following:

Phenotype-centric hierarchical-modularity driven network biomarker discovery and analysis

will enhance the performance of phenotype-related biomarker identification and improve

phenotype diagnostics and prognostics using such biomarkers

2

Motivated by this hypothesis we focus each component of this thesis towards one level of the

hierarchical modularity, depicted in Fig 1.1.

1.1 Crosstalk Level: Systematic Identification of

Network Biomarker Crosstalks

Phenotype-expression in an organism is extremely complex and it is likely that not a single

cellular subsystem such as the ones identified at the subsystem level, but a group of subsystems

interact to help express a particular phenotype. This inter-subsystem interaction has been

coined as subsystem crosstalk. Crosstalks useful in improving our understanding of phenotype

expression are termed crosstalk biomarkers. For example, the human neurological disorders are

complex and characterized by crosstalks among multiple molecular pathways contributing to

disease initiation and progression. One example is the crosstalk among AD pathway, oxidative

phosphorylation, p53 signaling pathway, and apoptosis [7, 93, 156]. Drug developers are now

focusing on understanding the interactions between the multiple key pathways that are affected

as a result of these diseases [101], because the previous one-target, one-drug approach is not

successful for diseases, such as Alzheimer’s and cancer.

To the best of our knowledge, there has been no systematic computational study of crosstalk

biomarkers. There are several different “clues” for crosstalk prediction including synthetic lethal-

ity, phosphorylation, genetic interaction, genes common between subsystems, etc. Incorporating

different clues will likely improve the confidence of crosstalk predictions. However, the computa-

tional methodologies developed so far [102, 122] make use of only one kind of “clue” to identify

crosstalks between phenotype related subsystems identified at the subsystem level. A method-

ology that includes multiple clues simultaneously will likely help identify crosstalks that are

biologically more meaningful.

Besides contributing to mechanistic understanding, crosstalks could likely be important

features for phenotype prediction as well. However, to the best of our knowledge, crosstalk

biomarkers as features for phenotype prediction, especially, for disease prognosis has not be

investigated in-depth.

Approach

We propose a methodology to identify subsystem crosstalk biomarkers and show their utility in

phenotype prediction. Given biological networks that represent evidences such as protein inter-

action and genetic interaction, and a set of biological subsystems, the method first constructs

an evidence-specific subsystem crosstalk network for each evidence. These individual evidence-

specific crosstalk networks are then combined into a single weighted crosstalk reference map

3

that represents the plethora of all possible subsystem crosstalks captured by the combined

evidence information.

In order to identify phenotype related crosstalks, these reference maps have to be analyzed in

the context of some organism specific information. This is done by choosing a target phenotype

and a set of organisms and instantiating the crosstalk reference map with information specific

to each organism (e.g., gene expression datasets, single nucleotide polymorphism datasets).

These instantiated organism-specific crosstalk maps are then comparatively analyzed to identify

crosstalk network biomarkers. These biomarkers are then further utlized as additional features

in phenotype prediction or disease prognosis.

Results and Impact

In this study, we utilize biological subsystems that model biological pathways, i.e., the nodes of

the crosstalk reference map are pathways. We have developed a methodology to build a pathway

crosstalk reference map using the combined power of several protein/gene level evidences. We

also identified pathway crosstalks that are potential features for phenotype prediction.

Our methodology was tested on the human disease phenotype, Alzheimer’s disease. Specif-

ically, the problem of interest is to predict which of the patients with condition known as mild

cognitive impairement (MCI) will progress to Alzheimer’s disease (AD).

The method captured the well documented crosstalks associated with oxidative damage and

insulin resistance cause metabolic dysfunction in AD. A temporal three-part sequence occurs

with changes in glycolysis, the tricarboxylic acid cycle (TCA) and oxidative phosphorylation

pathways [160]. The method also found several neural cell adhesion pathways supported by

previous research indicating that these pathways might be involved in this disease process [139].

Since AD is a neurodegenerative disease, we also observed nervous system related pathways; in

particular, enrichment of long-term potentiation and depression and neurotransmitter signaling

pathways all support our process given that these pathways have been previously associated

with cognitive impairment and AD [24, 37, 96, 171].

Pathway crosstalks incorporated as additional features for MCI to AD conversion prediction,

improved the accuracy by 15.9% in comparison to the best predictive model of the current state-

of-the-art [158].

1.2 Subsystem Level: Discovery of Network Biomarker Subsys-

tems

Information about the cellular system of an organism can be represented as a network. For

example, gene functional association networks model the cellular system using genes as nodes

4

and edges representing a functional association between the genes. These organismal networks

are comparatively analyzed to identify subsystems that could be potentially associated with a

phenotype, also known as subsystem network biomarkers or simply subsystem biomarkers.

The current state of the art α, β−motif finder [152] algorithm identifies subsystem biomark-

ers modeled as cliques (completely connected subgraphs) that are present in at least α-phenotype

expressing organisms and no more than β phenotype non-expressing organisms (where α is cho-

sen to be greater than β). However, due to the noise and missing information inherent to real

world networks, cellular subsystems are not always cliques (cliques have a density of 1.0—

where density ratio of the number of edges in the subnetwork to the total number of edges the

subnetwork would have if every pair of vertices in the subnetwork were connected)[54].

Additionally, selecting α and β parameters that will produce biologically meaningful results

could be a hard problem to solve. Hence, several runs of the algorithm with varying parameters

would be required to identify all the potential phenotype-related subsystem biomarkers.

Comparative analysis methods to identify phenotype-related subsystem biomarkers typically

work under the assumption that subsystems more likely to be present in phenotype-expressing

organisms are responsible for phenotype-expression. However, it is possible that the absence

of a subsystem causes the phenotype. For example, it was shown that the “couched potato”

syndrome could be potentially caused by the result of a person missing some key genes related

to the MP-activated protein kinase (AMPK) [179]. For α, β-motif finder to identify this kind of

information would require reruns of the algorithms by reversing the phenotype-expressing and

non-expressing organisms and once again the choice of parameters would come into play.

A straightforward approach of enumerating subsystem biomarkers of all densities (0.1−1.0)

present in all subsets of organisms and determining which of those are specifically present or ab-

sent in the networks of phenotype-expressing organisms would be computationally intractable

due to the potential number of subsystems in each network. The goal of this study is to enu-

merate phenotype-related subsystem biomarkers placing no restriction on the density or size of

the enumerated subgraphs. The method also does not place any restriction on the number of

networks the resulting biomarker should be present in.

Approach

Given functional association networks of the target organisms for the phenotype of interest,

our method first combines these networks into the proposed orthologous group-pair, organism

bi-partite network. This network model is utilized for enumerating the phenotype-related net-

work biomarkers at the subsystem level. Unlike our earlier work, the current-state-of-the-art

α, β-motif finder that needs the α and β parameters to identify the subsystem level network

biomarkers, our method is parameter free. In addition, the method can identify biomarkers

5

biased towards both phenotype-expressing organisms and phenotype non-expressing organisms

using a single run of the method in contrast to the existing method that requires two passes.

Moreover, an identified network biomarkers corresponds to a more relaxed representation of a

cellular subsystem in comparison to the highly rigid clique representation used by α, β-motif

finder. These subsystems, as subgraphs, have topological density ranging from 0.1 to 1.0, unlike

cliques. Hence, the method is more robust at handling the inherent noise and missing infor-

mation in the input networks. We use a statistical method to calculate for each biomarker a

p−value quantifying its bias towards both the phenotype-expressing organisms and phenotype

non-expressing organisms.

Results and Impact

We developed a method to enumerate phenotype-related subsystem biomarkers taking into

consideration lessons learned from our earlier α, β-motif finder method, which is the current-

state-of-the-art. The methodology was tested on data for a number of phenotypes including

hydrogen production, gram stain, motility, and respiration. The results show that in each case

the method identified subsystems associated with the phenotypes that can be verified by lit-

erature. For example, for dark fermentation, we identified literature verified subsystems either

directly or indirectly responsible for the uptake or production of hydrogen; subsystems that in-

clude proteins associated with to the synthesis or expression of [NiFe]-hydrogenase, an enzyme

that catalyzes the reversible oxidation of molecular hydrogen, and plays a vital role in anaer-

obic metabolism [159]; the others are involved in nitrogen and iron metabolic pathways that

include proteins like nitrogenase, iron uptake proteins, ammonia assimilation proteins, such as

glutamine synthetase, and proteins involved in electron transfer.

The methodology is not confined to the biology domain alone but has applications in other

areas. For example, we adopted this methodology for anomaly detection is time-evolving climate

networks [25].

This work is published in [129].

1.3 Gene Level: Functional Annotation of Hierarchical Modu-

larity of Network Biomarkers

Computationally enumerated phenotype-related network biomarkers are further evaluated for

biological significance. This analysis helps identify results that are biologically meaningful. Bi-

ological significance analysis includes assigning a functional coherence score to quantify the

significance and a functional annotation denoting the biological function of the gene set con-

stituent of the network biomarker. Functional annotation methods assign a set of functional

6

terms to the gene set. The terms come from a semantic functional taxonomy that documents all

possible functions of genes in a cell. Arguably, hierarchical modularity has not been explicitly

taken into consideration by most, if not all, current methodologies. As a result, the existing

methods would often fail to assign a statistically significant coherence score to biologically

relevant molecular machines. The goal of this study is to develop a methodology to analyze

gene sets for functional coherence and assign functional annotation by explicitly taking into

consideration the hierarchical modularity principle.

Approach

Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the associa-

tion of individual genes to these concepts (e.g., GO terms), our method will assign a proposed

Hierarchical Modularity Score (HMS) to each node in the hierarchical structure of the gene set;

the HMS score and its p−value measure functional coherence. While existing methods annotate

each network biomarker with a set of “enriched” functional terms in a bag of genes, our com-

plementary method provides the hierarchical functional annotation of the modules and their

hierarchically organized components. A hierarchical organization of the gene set often comes as

a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise,

our method will automatically build such a hierarchy by directly incorporating the functional

taxonomy information into the hierarchy search process and by allowing multi-functional genes

to be part of more than one component in the hierarchy. In addition, the underlying HMS scoring

metric ensures that functional specificity of the terms across different levels of the hierarchical

taxonomy is properly treated.

Results and Impact

We developed a method to assess biological significance (functional coherence and annotation)

of gene sets taking into consideration their hierarchical modularity. The method shows an

improved performance (on average 15%) in comparison to the current state-of-the-art functional

coherence and annotation analysis methods when evaluated on known functionally coherent

gene sets from KEGG [75, 76, 77] and MIPS [52] and several other computationally derived

and curated datasets [22, 41, 62, 91, 137]. The use of this methodology is not confined to

the biology domain alone but is useful in other areas where such hierarchical annotations are

required. For example, semantic text analysis uses hierarchical taxonomies (e.g., WordNet) to

annotate and analyze text for topical similarity of its components.

This work has been published in [128].

7

Chapter 2

Crosstalk Level: Systematic

Identification of Network Biomarker

Crosstalks

Given its complex nature, phenotype-expression is likely the result of interactions among several

cellular subsystems and not a function of a single subsystem. This inter-subsystem interaction

is known as subsystem crosstalk. Subsystem crosstalks have been observed across different or-

ganisms. A classic example of microbial crosstalk is the regulation of tricarboxylic acid (TCA)

cycle. Fuel enters the TCA cycle primarily as acetyl-CoA (Acetyl coenzyme A). The generation

of acetyl-CoA is therefore a major control point of the cycle. Several pathways involved in syn-

thesis of Acetyl-CoA may be rate limiting and thus regulate TCA. These subsystems could be

part of carbohydrates metabolism, amino acids metabolism, or fatty acids metabolism. Simi-

larly cellular mechanisms underlying human neurological disorders are complex with crosstalks

between multiple molecular pathways contributing to disease initiation and progression. One

example is the crosstalk among oxidative phosphorylation, p53 signaling pathway, and apopto-

sis [7, 93, 156]. Another example is the crosstalk among MAPK signaling, insulin signaling, and

calcium signaling pathways [102]. The same study also provided evidence of crosstalk among

the pathways involved in the regulation of glycolysis metabolism, the regulation of the actin

cytoskeleton, and apoptosis [102]. The latter crosstalk was also associated with other neurode-

generative disorders such as Huntington disease and amyotrophic lateral sclerosis [102].

The importance of subsystem crosstalks has been previously articulated in scientific liter-

ature. Crosstalks improve robustness of the cellular system through mutual function compen-

sation [11]. For example, there are existing studies that show that the plastidial mevalonate-

independent pathway can compensate the reduced flux through the cytosolic mevalonate path-

way in Arabidopsis thaliana [11]. Certain crosstalks help boost the performance of certain sub-

8

systems. The Mitogen-activated protein kinase (MAPK) pathway enhances the signaling in the

Integrin pathway by increasing the activation rate of key proteins in the Integrin pathway [155].

MAPK also alters transcription of genes that are important to other metabolic pathways by

altering the levels and activities of transcription factors.

There is no universal categorization of biological mechanisms that are responsible for crosstalks

but some attempts have been made to model certain pathway crosstalks. Mathematical mod-

els have been developed to study signaling pathway crosstalks [32]. The different models are

the signal flow, substrate availability, receptor function, gene expression, and intracellular com-

munication. Signal flow crosstalk between two pathways occurs when the molecules of one

pathway affect the rate of protein activation in another pathway. For example, MAPK pathway

enhances the signaling in the Integrin pathway by increasing the activation rate of key pro-

teins in the Integrin pathway [155]. Substrate availability crosstalk occurs when two pathways

compete for proteins that perform the same function. For example, the hyperosmolar pathway

and pheromone mitogen-activated protein (MAP) kinase pathway in Saccharomyces cerevisiae

share the STE11 protein [111]. Receptor function crosstalk occurs when the pathway receptor

can be activated by factors other than its target ligands. For example, in the absence of the es-

trogen ligand, other signaling pathways can activate the estrogen receptor [80]. Gene expression

crosstalk occurs when a transcription factor (TF) contained in one pathway represses another

TF activated through a second pathway. For example, the transcription factor NR3C1 relocates

to the nucleus upon activation and represses the transcription factor NF-kB that is present in

multiple pathways. Intracellular communication crosstalk occurs when a ligand released by a

pathway activates another pathway. For example, the Bone morphogenetic proteins (BMP) and

WNT pathways reciprocally regulate the production of their ligands [12]. There has also been

a study on the interactions between different biological processes defined by the Gene Ontology

taxonomy in Saccharomyces cerevisiae [33].

Crosstalks can also be potential biomarkers for phenotype prediction or disease prognosis,

also called network biomarker crosstalks. Diseases such as cancer, diabetes, obesity, and asthma

are caused by defects in multiple pathways [69, 23, 99]. Drug developers are focusing on the

interactions between multiple key pathways that are affected as a result of these diseases [102],

because the previous one-target, one-drug approach has not been particularly successful for

diseases such as cancer.

For phenotypes such as Alzheimers disease (AD), a variety of features have been tested for

disease prognosis: imaging data such as structural magnetic resonance imaging (MRI), fluorine

18 flurodeoxyglucose positron emission tomography (PET), and bio-specimen samples such

as cerebrospinal fluid (CSF). Clinical parameters including neurological cognitive tests such

as the Alzheimer’s disease Assessment Scale-Cognitive subscale with 70-point scale (ADAS-

cog) and age, gender, education, etc., have been tested as well. However, to the best of our

9

knowledge, utilizing biomarker crosstalks as potential features for disease prognostics has not

be investigated in-depth. (In the following sections the MRI, PET, and CSF features may also

be referred together as AD biomarkers.)

From the computational methodology standpoint we find that the area of network biomarker

crosstalks is still in its infancy. Molecular level interactions (functional associations between

genes or interactions between proteins) can act as “clues” to predict potential crosstalks [33].

Molecular level interactions are better studied in comparison to inter-subsystem crosstalks that

are ubiquitous, yet poorly understood. There are several documented mechanisms for molec-

ular interactions. These mechanisms (clues) include physical evidences such as direct binding,

biochemical evidences such as phosphorylation, and functional evidences such as transcrip-

tional regulation [101]. However, we find that existing computational methods that attempt to

predict crosstalks do not take advantage of the different “clues” available. Existing methods

predicts crosstalk between known metabolic pathways using physical protein interaction data

[102, 122, 192]. Besides improving mechanistic understanding, biomarker crosstalks are likely

to worth investigating as features for phenotype diagnostics and prognostics.

In this chapter, we provide a three-part methodology for network biomarker crosstalks dis-

covery. The methodology (see Figure 2.1) performs the following steps; (A) Given biological

networks that represent evidences and a set of biological subsystems, the method first con-

structs an evidence-specific subsystem crosstalk network for each evidence, (B) These individ-

ual evidence-specific crosstalk networks are further combined into a single weighted crosstalk

reference map that represents the plethora of all possible subsystem crosstalks captured by the

combined evidence information, and (C) The reference map is analyzed in the context of some

organism specific information to identify phenotype-related network biomarker crosstalks that

are further incorporated in prognostic models.

2.1 Approach

In this study, we utilize cellular subsystems that model biological pathways. Henceforth in

this chapter we will refer to a cellular subsystem as a pathway. We utilize the human disease

phenotype Alzheimer’s disease, as our use-case. The pipeline specific to our use-case are shown

in Figures 2.2 and 2.3.

Specifically, our methodology consists of the following steps: (A) identify the potential path-

way crosstalks by using existing gene/protein level data (see Figure 2.2), (B) identify per-patient

pathway crosstalks via SNP information (see Figure 2.3), and (C) identify the significant path-

way crosstalks that will be combined with other imaging (MRI, PET) and bio-specimen (CSF)

biomarkers and clinical parameters for AD prognosis.

10

g8

g9

g7

g5

g6

Subsystem A Subsystem B

g3

g4

g1

g2

g12

g14

g11

g13

Subsystem C Subsystem X

Subsystem A

Subsystem B

Subsystem C

Subsystem X

genes

Evidence M

Crosstalk Map Specific to Evidence M

(A)

Organism-Specific Maps

Phenotype-expressing Phenotype non-expressing

Organism-specific datasets e.g., gene expression

SNP

Org 1 Org 2 Org 3 Org 4 Org 5 Org 6

(C)

Crosstalk Map Specific to Evidences

Network Information Fusion F(Evidence M, Evidence N, Evidence O, Evidence P)

(B) Evidence M Evidence N Evidence O Evidence P

Figure 2.1: The generic pipeline to identify potential network biomarker crosstalks; (A) givenbiological networks that represent evidences, and a set of biological subsystems, the method firstconstructs an evidence-specific subsystem crosstalk network for each evidence, (B) these indi-vidual evidence-specific crosstalk networks are further combined into a single weighted crosstalkreference map that represents the plethora of all possible subsystem crosstalks captured by thecombined evidence information, and (C) the reference map is analyzed in the context of someorganism specific information to identify phenotype-related network biomarker crosstalks

11

Enzymes and Metabolites of Pathway

Transcription Regulation Information

Fusion

Synthetic Lethality

f(p-val1…pval5) = X

Pathway pairs hsa0030-has0040 hsa0030-has0050

Genetic Interactions

PPI

Overlap Similarity

Interaction Similarity

Monte Carlo

p-value Calculation

pathways

Pathway Crosstalk Reference Map (PC-RM)

0.8

Figure 2.2: The pipeline to identify potential pathway crosstalks. The analysis involves (1)using individual clues to score a each pathway pair to quantify crosstalk likelihood, (2) obtaininga combined score using information fusion, and (3) building the reference crosstalk map

2.1.1 Identification of Potential Pathway Crosstalks

We quantify via scores how likely it is that a pair of pathways will crosstalk based on each

biological dataset (including physical interaction, genetic interaction, and transcription factors)

that provides evidence for possible crosstalks. We then combine these scores to build one generic

pathway crosstalk reference map analogous to the KEGG pathway reference map [75, 76, 77].

The likelihood of a pathway pair crosstalking can be scored based on two different methods.

One method is based on the presence of “common elements” such as kinases, transcription

factors, and enzymes. The other scoring method is based on the presence of interacting elements

such as physically interacting proteins. In the following sections, we will discuss the different

evidences and the scoring methods used for each of those evidences.

Scoring pathway crosstalks based on common elements

� Shared enzymes and metabolites: The number of enzymes and metabolites shared

by a pair of pathways is utilized as one of the evidences to identify potential pathway

crosstalks. This is reasonable as evidence for possible crosstalks, because a variation in

the concentration of common enzymes or metabolites will affect both pathways.

� Transcriptional regulation: Genes with common transcription factors are likely co-

regulated. Co-regulated genes in different pathways provide an avenue for the pathways

12

hsa0030 – gene1, gene2 hsa0040 – gene2, gene3 hsa0050 – gene3, gene4, gene5 Gene Start and End

& SNP Locations

hsa0030 – snp1, snp2 hsa0040 – snp2, snp3, snp7 hsa0050 – snp3, snp4, snp5, snp6

patient1 – (snp1,1), (snp2,0),(snp3,1) patient2 – (snp1,1), (snp2,2),(snp3,0),(snp4,1) patient3 – (snp1,0), (snp2,2),(snp3,1),(snp5,2)

Pathway-gene mapping

Pathway-SNP mapping

Patient-SNP Allele Information

SNP ID Minor Allele Count

Calculate SNP-Enrichment of Pathways using Patient-SNP

information

patient1 – (hsa0030, 0.04), (hsa0040,0.05) patient2 – (hsa0030,0.5), (hsa0040,0.06)

Patient-pathway enrichment scores

gene1 – snp1, snp2 gene2 – snp2, snp3, snp7 gene3 – snp5, snp6

Gene-SNP mapping

Homozygous minor (recessive)

Genetic model

Instantiated Patient 1 Pathway Crosstalk Reference

Map

Patient 1

Crosstalk Pathway Pairs

Edge Presence

hsa0030-has0040 1

hsa0030-has0050 1

hsa0030-hsa0060 1

hsa0060-hsa0070 0

Figure 2.3: The pipeline to identify SNP enriched pathways and pathway crosstalks. Themethodology has three steps: (1) the SNPs are mapped to genes and in turn to pathwaysusing the SNP and gene location information, (2) a genetic model is chosen and a patient spe-cific pathway SNP enrichment score is calculated using SNPs mapped to pathways and usingpatient allele information, and (3) the pathway enrichment scores are overlaid on the referencecrosstalk map to build patient-specific crosstalk maps

to crosstalk. For each pathway pair, we find the group of transcription factors that regulate

genes in both pathways.

� Phosphorylation: Phosphorylation, performed by protein kinases, is the addition of

a phosphate group to a protein that results in a change of the protein function. Co-

phosphorylated proteins in different pathways suggest potential pathway crosstalks.

� “Common element” score: The pathway pairs were scored for how likely they are to

crosstalk using each of the above evidences. For each pair of pathways, Pi and Pj , we

define the scoring function as the overlap similarity score.

Overlapscore (Pi, Pj) =|Υ (Pi)∩Υ (Pj)|

min (|Υ (Pi)| , |Υ (Pj)|)(2.1)

where Υ(Pi) is the set of proteins (enzymes or metabolites, transcription factors, or ki-

nases) associated with pathway Pi.

13

Scoring pathway crosstalks based on interacting elements

� Physical interaction: Protein interaction has previously been used to identify crosstalks

[102, 101] in yeast. The physical interaction between the proteins belonging to different

pathways provides a way for pathways to crosstalk.

� Genetic interactions: The use of genetic interactions for identifying pathway crosstalk

stems from the concept of “between-pathway” interactions. This essentially states that if

there is a genetic interaction between pathways, one pathway covers for the defects in the

other pathway.

� Protein domain: Protein function is closely related to fundamental units of protein

structure called “domains.” In the domain interaction network, a pair of proteins has an

edge if they are associated with the same set of protein domains. These edges are taken

into consideration to assess for potential pathway crosstalks.

� Synthetically lethal gene pairs: Gene pairs whose simultaneous low- or non-expression

can cause the organism to die are called synthetically lethal pairs [57, 168]. The presence

of synthetically lethal pairs of genes across two pathways is a possible sign of pathway

crosstalk.

� “Interaction-based” score: For each pair of pathways, Pi and Pj , we define the scoring

function as follows

Interactionscore (Pi, Pj) =Ninter (Pi, Pj)

|Υ (Pi)| ∗ |Υ (Pj)|(2.2)

where Ninter(Pi, Pj) is the number of interactions (genetic, physical, domain, syntheti-

cally lethal) that exist among the proteins associated with pathway Pi and the proteins

associated with pathway Pj .

Significance estimation of pathway crosstalk scores

We use Monte Carlo p−value estimation to normalize and assess the significance of the scores

obtained for the pathway crosstalks using different evidences. The Monte Carlo method [125] is

a robust statistical significance assessment method. Each pathway is randomized by replacing

a protein from that pathway with a randomly chosen protein from the set of all proteins in

the organism. This pathway randomization step is repeated W times (W ≈ 1000), i.e., we

obtain W sets of randomized pathways. For each of the evidences and each set of randomized

pathways, the evidence-specific score is recalculated for all distinct pathway pairs. We estimate

an empirical evidence-specific p−value as R/W for each pathway pair, where R is the number

14

of randomized versions of the pathway pair that produce an evidence-specific score greater than

or equal to the original score obtained.

Combining the scores for each pathway crosstalk

For each pathway pair, we combine the p−values obtained from each of the individual evidences.

This gives us a combined estimation for crosstalk likelihood between the pathway pair. To

combine the p−values, we use the q−fast information fusion methodology derived by Bailey

and Gribskov[6], which is based on a theorem defined by Feller [36]. q−fast uses the product

of the individual p−values as a test statistic to calculate the combined p−value. Usage of the

product of p−values as a test statistic has been shown to be a desirable method for information

fusion [6]. One issue to consider is that some pathway pairs may not be scored by some of

the evidences due to missing data. For those cases, we assign a p−value of 1 to denote that

that particular evidence offers no information about those pathways crosstalking. The q−fast

formula to calculate the combined p−value for the n evidences is(n∏i=1

pi

)n−1∑i=0

− ln (∏ni=1 pi)

i!(2.3)

where pi is the p-value obtained for evidence i. This combined p-value is obtained for each

pathway pair. The generic pathway crosstalk reference map is built with nodes as pathways and

edges representing a statistically significant (threshold α = 0.01) crosstalk between a pathway

pair.

2.1.2 Identification of Patient-specific Pathway Crosstalks

The crosstalks identified so far are generic, however, in order to analyze their effect on a specific

phenotype, in this case our use-case of Alzheimers, we need to identify those crosstalks that are

active in individual patients. For that purpose, we make use of an additional data source called

the SNP (Single nucleotide polymorphism) data. SNPs are variations in the DNA sequence at

particular locations. SNPs can generate biological variation between people by causing differ-

ences in the genetic codes for proteins that are written in genes. These variations can in turn

influence phenotypes such as disease proneness or reaction to different drugs. DNA is inherited

by a child from both parents and this leads to inheriting SNP versions from your parents. Ini-

tiatives such as the Alzheimer’s disease Neuroimaging Initiative (ADNI) collect patient specific

information for a large number of SNPs (620901 SNPs) and this information is incorporated

to transform the generic pathway crosstalk signatures into patient specific pathway crosstalk

signatures.

The pathway-crosstalks for a patient are obtained using the following four steps.

15

1. Obtain a mapping of SNPs to pathways using genes.

2. Identify the list of SNPs that are active (or present) in a patient.

3. Use the mapping obtained in Step 1 and the patient-specific SNP list in Step 2 to obtain

the pathways that are “SNP-enriched” in the patient.

4. Utilize the “SNP-enriched” pathways from Step 3 to obtain patient-specific pathway

crosstalks.

Obtain a mapping of SNPs to pathways

Every SNP is assigned a chromosome number and a location on the genome, which can be used

to map SNPs to genes and in turn to pathways. Starting with a list of all genes that map to at

least one pathway, we assign a SNP to a gene if it is present within 10 kilo base pairs distance

of that gene. This method has been previously used by Silver et al [160, 161]. Note that since

SNPs are mapped to all genes within a range of 10 kbp, the same SNP may map to more than

one gene. The set of SNPs assigned to a pathway is the union of all SNPs assigned to the genes

in the pathway.

Identify patient-specific lists of SNPs that are active

A genetic model decides the minor allele count required for a SNP to be active (or present) in

a person. In the current study we fix our genetic model to be the homozygous minor (recessive)

model. This genetic model requires a minor allele count of 2 for a SNP to be considered active,

i.e., the minor allele is inherited from both parents. For each patient, we decide whether each

SNP is active or not. This gives us a per-patient list of SNPs that are active.

Identify patient-specific SNP-enriched pathways

We define the notion of SNP-enriched pathways based on a new scoring function. We first

identify a set of SNPs of interest, SNP interest. SNP interest can be the union of all SNPs found

on the human genome or a list of SNPs from other sources such as scientific literature. Given

the set of SNPs assigned to a pathway, SNP pathway, and the set of SNPs active (or present) in a

patient, SNP patient, the enrichment score for this pathway and patient taking into consideration

the SNPs of interest is calculated as

Enrichment(patient, pathway) =SNP patient ∩ SNP pathway ∩ SNP interest

SNPinterest(2.4)

16

A p-value for the enrichment is calculated using the Monte Carlo method discussed previ-

ously. The pathways with a p-value below a certain threshold for a patient are referred to as

“SNP-enriched” for that patient.

Identify patient-specific pathway crosstalks that are active

A crosstalking pathway pair is considered active in a patient if both pathways are SNP-enriched,

i.e. they both have a significant SNP-enrichment score with respect to the patient. The stringent

condition is to ensure we do not incorporate noise. Identifying the crosstalking pathway pairs

that are active is equal to identifying those edges of the pathway crosstalk reference map that

are present in this patient. This translates to creating a patient-specific pathway crosstalk map

from the generic pathway crosstalk reference map, synonymous with obtaining an organism-

specific pathway map from the KEGG pathway reference map.

2.1.3 Identification of Significant Pathway Crosstalks

We calculate the bias for each SNP-enriched pathway and for each active pathway crosstalk to-

wards AD converters and non-converters. A SNP-enriched pathway or active pathway crosstalk

is considered statistically significant if its bias is below 0.01. The significant pathways and

significant pathway crosstalks are incorporated as features for disease prognosis. The bias is

quantified by using the hypergeometric test. The bias of an active pathway crosstalk towards

converters is calculated as:

φ (n, x, v, w) =

∑vi=w

(v

w

)(n− vx− w

)(n

x

) (2.5)

where we define

� Population: n is the total number of patients

� Success in population: x is the total number of converters and y is the number non-

converters

� Sample: v is the total number of patients (both converters and non-converters) a pathway

crosstalk is enriched in

� Success in sample: w is the number of converters and z is the number of non-converters

the pathway crosstalk is enriched in

17

Similarly, the bias of an active pathway crosstalk towards non-converters can be calculated via

φ (n, y, v, z) .

2.2 Results

2.2.1 Datasets

Datasets used were primarily obtained from the Alzheimer’s Disease (AD) Neuroimaging Ini-

tiative (ADNI) database [187] (adni.loni.ucla.edu, www.adni-info.org). ADNI launched in 2004

as a multi-site study with the goal of collecting a wide range of longitudinal data in 200 healthy

elderly control subjects, 400 subjects with MCI, and 200 subjects with AD. The data included a

wide array of neuropsychological tests, genetic data (including SNPs), CSF biomarkers, MRIs,

and PET scans, and subjects were followed up every 6 months for up to 4 years. The measured

outcome of the ADNI study was whether or not a patient converted to AD during this time

period.

In this retrospective study, we look into the specific prognosis problem of patients progressing

from Mild Cognitive Impairment to Alzheimer’s disease. Patients with the specific condition

called Mild Cognitive Impairment (MCI) have an increased risk of progressing to Alzheimer’s

disease (AD). Thus, it is important to establish the predictive power of conversion to AD once

MCI is detected. Thus we look into the prognosis problem of MCI to AD conversion.

We utilized the dataset from an earlier study [158] based on ADNI. That particular study

identified 97 MCI patients and predicted conversion to AD based on their clinical parameters,

MRI results, PET scans, CSF markers (tau, p-tau181P, and -amyloid1-42), apolipoprotein E

(ApoE) genotype, and results from at least one follow-up clinical examination as of October

19, 2010. The data preprocessing has been detailed in the publication[158] pertaining to this

previous study. Out of the 97 patients from the earlier study, only 91 patients have corresponding

SNP data in the ADNI database. Hence, for the current study we only utilized these 91 patients.

However, this reduction in the number of patients did not much affect the ratio of converters

to non-converters. The original study had 43 converters and 54 non-converters and the reduced

dataset has 41 converters and 50 non-converters. Thus, there is still sufficient representation of

the two classes of patients. According to the Shaffer et al [158] paper, the Four MR imaging and

nine FDG PET components were extracted using ICA analysis from the raw imaging dataset.

Notably, the components were not extracted on the basis of the known outcome of the study;

rather, they were extracted to account for variance across the entire MCI study population

without regard to future conversion status. The MRI and PET components from the [158] were

not recalculated on the basis of each fold of cross-validation. Due to the lack of the original

data, we were unable to perform ICA for each fold. We suggest the results that include these

18

components are interpreted with caution. As future work, we plan to analyze the effect of

per-fold re-computation of ICA components on the performance of the corresponding models.

The pathway subsystems were downloaded from the Kyoto Encyclopedia of Genes and

Genomes database (KEGG) [75, 76, 77]. We obtained evidences for human physical interaction,

genetic interaction, and synthetic lethal gene pairs from BioGRID [165], domain interaction

from GeneMania [186], transcription factors from FANTOM database [81, 82], and protein phos-

phorylation [100]. We obtained SNPs associated with genes manually curated to Alzheimer’s

disease in the Comparative Toxicogenomics Database [30] and a compilation of genes from lit-

erature that have been identified as likely risk factors for Alzheimer’s from SNPedia [20]. This

information was utilized as our biologically meaningful knowledge priors (Table 2.1).

Table 2.1: Subset of genes from CTD [30] and their association to Alzheimer’s disease

Gene Evidence

APP Mutations in this gene have been implicatedin autosomal dominant Alzheimer’s disease andcerebroarterial amyloidosis (NCBI Entrez Gene)

amyloid beta (A4) precursor protein

IL1-B Four new genetic studies underscore the rele-vance of IL-1 to Alzheimer pathogenesis, show-ing that homozygosity of a specific polymor-phism in the IL-1A gene at least triplesAlzheimer risk, especially for an earlier age ofonset and in combination with homozygosity foranother polymorphism in the IL-1B gene [118]

interleukin 1, beta

SOD2 A polymorphism in SOD2 is associated with de-velopment of Alzheimer’s disease [189]

superoxide dismutase 2

NOS3 NOS3 may be a new genetic risk factor for lateonset AD [27]

nitric oxide synthase 3

2.2.2 Sample Characteristics

The mean age for all 91 MCI patients was 74.96±7.32 (standard deviation). Male-to-female

ratio was 2.37, and 96.7% of subjects were white. A total of 36.26% of subjects had a family

history of AD, and 54.94% had a positive finding for the ApoE4 genotype. The mean follow-up

duration for all subjects was 31.6 months±10.6. Of these, 41 progressed to AD during follow-up

19

(converters) and 50 did not (non-converters), with converters tending to have longer follow-up

times by about 4.5 months. Statistically, converters did not differ from non-converters in mean

age, sex ratio, education, race, ethnicity, family history of AD, or ApoE prevalence. See Table

2.2 for details. In the table, unless otherwise indicated, data was written as mean± standard

deviation. In the table, unless otherwise indicated p−values were calculated using t-test. They

symbol ∗ indicates that the data in parentheses are number of patients. The symbol + indicates

p−values obtained using χ2-tests.

Table 2.2: Baseline characteristics of MCI study sample

Subjects with MCI (n = 91) Converters Non-converters p-value#(n = 41) (n = 50)

Age (years) 75.17±7.30 74.78±7.44 0.8011Male-to-female-ratio*,+ 2.42 (29/12) 2.33 (35/15) 0.9394Education (years) 16.40±2.55 15.48±3.23 0.1169Percentage with known family his-tory of AD*,+

36.39 (15/41) 36.00 (18/50) 0.9539

Percentage with ApoE4*,+ 60.98 (25/41) 50.00 (25/50) 0.4036Average follow-up time (months) 34.10±9.70 29.64±10.94 0.0426

2.2.3 SNP-enriched Pathways

Our analysis identified SNP-enriched pathways representing six of the seven KEGG pathway

categories. The pie chart (Figure 2.4) shows the breakdown the number of enriched pathways in

each KEGG category identified in the study (left) compared to the total number of pathways in

each KEGG category (right). These pathways represent potential disease mechanisms in AD;

the variety of pathway categories support the complicated nature of the disease, which has

been attributed to many different biological mechanisms from amyloid toxicity to metabolic

dysfunction to immune deregulation.

It should also be noted that there is overlap between processes, as a gene can be involved in

more than one pathway. For example, the NF-kappa B signaling pathway is categorized under

Signal Transduction, in Environmental Information Processing; however, the NF-kappa B gene

is listed in many other KEGG pathways, including MAPK signaling, osteoclast differentiation,

apoptosis, infectious diseases such as Tuberculosis and Hepatitis C, several types of cancer,

and inflammatory bowel disease. This may explain some of the discrepancy in the number of

pathways involved in Cellular Processes, Metabolism, and Information Processing, compared to

Human Diseases and Organismal Systems, both of which are likely to involve genes represented

20

Figure 2.4: Pie chart showing the distribution of types of SNP-enriched pathways identified inthe study and comparison to KEGG pathway distribution

21

in other categories as well.

Compared to the percent of pathways categorized as metabolic in KEGG (45%), our study

contains a much smaller percentage of enriched metabolic pathways. While this does not yield

any information on the importance of these pathways, it does indicate that out of the many

human metabolic processes cataloged in KEGG only a small number of specific metabolic

pathways may be important in AD. Importantly, the Human Diseases category is a much larger

percentage of the total in our study (31%) compared to the distribution of KEGG pathways

(18%); this suggests that individuals with MCI do have an increased genetic load in disease-

predisposing pathways.

Table 2.3 lists some of the pathways identified and the following section provides some details

on SNP-enriched pathways from this study that were also associated with memory impairment

in another study by Ramanan et al [139].

Table 2.3: List of literature verified the pathways found in the study [139]

KEGG ID Name

hsa04510 Focal adhesionhsa04020 Calcium signaling pathwayhsa04940 Type I diabetes mellitushsa04730 Long-term depressionhsa04720 Long-term potentiationhsa05330 Allograft rejectionhsa05416 Viral myocarditishsa04360 Axon guidance

Broadly, these pathways fit into several different KEGG categories, including Cellular Pro-

cesses, Environmental Information Processing, Human Diseases, and Organismal Systems, and

represent some of the risk factors and disease mechanisms that have been widely postulated for

AD, including metabolic and immune dysfunction, impaired memory consolidation processes,

and neural cell development, outgrowth, and adhesion. While these topics are addressed in

more depth in the following section, it is important to note that this study supports our re-

sults, and suggests that a number of different processes may contribute to AD susceptibility

and disease-associated memory impairment.

The first and most obviously enriched pathway category to discuss is Human Diseases; in

addition to AD and other neurodegenerative disorders, this KEGG category encompasses can-

cers and infectious, immune, cardiovascular, and endocrine and metabolic diseases. Diabetes,

obesity and heart diseases are well-established risk factors for AD, so much so that AD has

22

been referred to as type 3 diabetes; so finding SNP-enriched pathways for cardiovascular and

endocrine and metabolic diseases in individuals with MCI is not unexpected [181]. Evidence

has supported an inverse association of the development of cancer and AD, and it is postu-

lated that this may be due to differential regulation of common signaling pathways [63, 144].

Autoimmune pathways have also previously been implicated in AD, particularly the idea that

a compromised blood brain barrier may allow entry of serum proteins, which lead to neuronal

death via autoimmune mechanisms [28, 149]. Chronic infection with pathogens including herpes

simplex virus type 1, Chlamydia pneumonia, Helicobacter pylori, and spirochete [64] has been

implicated in AD and cognitive change, and mouse model research suggests that Toxoplasma

gondii infection may be neuroprotective [74], further linking immune and inflammatory pro-

cesses to AD. Finally, certain molecular markers such as alpha-synuclein protein have linked

AD to other neurodegenerative diseases including Parkinson disease and dementia with Lewy

bodies [2]. These disparate pieces of evidence support the identification in this study of enriched

pathways in human diseases in patients with MCI.

Enriched metabolic pathways in this population include metabolism of carbohydrates, lipids,

amino acids, cofactors and vitamins, energy, and xenobiotics. Glycolysis and gluconeogenesis

are pathways involved in glucose metabolism. One of the hallmarks of early AD is deficits in

glucose metabolism in the brain, which can be clinically detected using FDG-PET. One theory

is that this simply reflects neuronal injury and loss, whereas another theory proposes that there

is a primary metabolic energy deficit in the AD brain. The AD brain has been also called “the

starving brain.” The pentose phosphate pathway generates NADH which is an anti-oxidant,

and oxidative stress is one of the postulated neurodegenerative mechanisms in AD; pentose

phosphate activity is increased in the brains of AD patients in response to oxidative stress.

The pentose phosphate pathway is also involved as an alternate to glycolysis. It has been well

documented that oxidative damage and insulin resistance cause metabolic dysfunction in AD;

a temporal three-part sequence occurs with changes in glycolysis, the tricarboxylic acid cycle

(TCA) and oxidative phosphorylation pathways[160]. Given that apolipoprotein E is the largest

genetic risk factor for AD identified to date, we would expect to find an enrichment of lipid

metabolic pathways. Finally, the xenobiotic cytochrome-P450 gene pathway identified, while

most commonly tied to drug metabolism, has also been implicated in neurodegeneration. For

example, variations in CYP17 P450 gene have been linked to increased AD risk. Studies have

also found alterations in constituent cytochrome enzymes in mouse models of AD. Lastly, NADH

CYP450 reductase is induced by the amyloid peptide, which has been identified as mutated in

familial AD, and such a response could be a possible mechanism whereby amyloid beta causes

oxidative stress. Thus, the identification of enriched metabolic pathways in patients with MCI

is also logical.

Organismal systems enriched in this population include the endocrine, excretory, digestive,

23

immune, nervous, and sensory systems, as well as development processes. The first and most

obvious link here is the identification of the insulin and adipocytokine signaling pathways, which

are logical given that diabetes is a risk factor for AD. Other pathways in the endocrine system

include the PPAR signaling pathway [110], which is known to interact with the apolipoprotein E

functional pathway, as well as several hormone signaling pathways including the gonadotropin-

releasing hormone signaling pathway, which has been shown to play a potential role in diabetes

[9]. As previously discussed, metabolism has been shown to play an important role in AD, so it

is unsurprising that we observe enrichment in digestive and excretory pathways, as malfunction

could presumably result in unbalanced energy metabolism and compromised metabolic function.

Both up and downregulation of inflammatory processes have been associated with AD, and some

research has suggested that a combination of inflammation and aging, or inflammaging, may

be a prodromal stage of AD, potentially accompanied by age-related immune dysregulation.

This supports the identification of SNP-enriched immune and inflammatory pathways [45, 106,

112]. Since AD is a neurodegenerative disease, we would expect to observe nervous system

pathway enrichment; in particular, enrichment of long-term potentiation and depression and

neurotransmitter signaling pathways all support our process given that these pathways have

been previously associated with cognitive impairment and AD [24, 37, 96, 171].

Cell processes include a number of pathways linked to neurodegeneration via their involve-

ment in cell growth and death, including p53 signaling, autophagy, apoptosis, and cell cycle

regulation. There are also several neural cell adhesion pathways that are enriched in individuals

with MCI, which is supported by previous research indicating that these pathways might be

involved in this disease process [139]. Interestingly, there were also a number of enriched ge-

netic information processing pathways, including DNA replication, transcription, translation,

and protein processing. Enriched pathways in cell cycle regulation and DNA replication are

particularly interesting given that previous research has found that neurons in the brains of AD

patients exhibit markers indicating reentry into the cell cycle, and that neuronal cell death may

be preceded by DNA replication [117, 194, 197]. It has been proposed that DNA replication

resulting in tetraploid cells that do not undergo mitosis may be resulting in neuronal cell death.

Furthermore, there is some evidence that cells from AD patients have a DNA repair deficiency

after treatment with alkylating agents [14]. Genetic variants increasing risk for cell cycle and/or

transcription/translation misregulation could thus play an important part in AD pathology.

The final category of enriched pathways identified in our study is Environmental Information

Processing; all of these pathways involve signal transduction or cell trafficking, and most could

be predicted based on pathways identified in other categories. For example, the TGF-beta,

NF-kappa B, VEGF, HIF-1, PI3K-Akt, Wnt, Notch, mTOR, MAPK, and Hedgehog signaling

pathways are all annotated in the KEGG Pathways in Cancer, as well as their normal functions

in growth and development. Notch is known to interact with the presenilin genes, in which

24

mutations have been identified in individuals with familial AD, linking this signaling pathway

to the disease. Calcium signaling is important in a number of mechanisms that have been

discussed as important in AD, including long-term potentiation and depression, as well as

apoptosis and metabolism. Several cell receptor interaction pathways are also enriched, which

might be expected given that AD has been posited to have an environmental risk component

[44]. These common signaling pathways are responsible for normal growth and development,

and many have previously been linked to abnormal processes in disease, so it is not surprising

that differential genetic load in these pathways could impact risk of development of AD as well.

All six categories of enriched pathways identified in our study can be linked to AD through

various mechanisms, and support the complex etiology behind this disease. This systematic

investigation of pathways in individuals with MCI provides evidence that many risk factors and

biological mechanisms are potentially important for disease development; the identification of

many pathways that could be impacted by environment, such as memory processes that could

be impacted by education and occupation, also help to explain the incomplete penetrance of

disease.

2.2.4 SNP-enriched Pathway Crosstalks

The categories with the greatest numbers of crosstalks are Human Diseases and Organismal

Systems, which we would expect given that these categories also contain the greatest numbers of

pathways; however, it is interesting to note that although the two categories contain an approxi-

mately equal percentage of the total number of pathways tested (31% for Human Diseases, 29%

for Organismal Systems), Human Diseases generated almost twice as many crosstalks as Organ-

ismal Systems. This is most likely attributable to the complex etiology of many diseases within

this category, which could potentially involve many different pathways, but also demonstrates

the potential utility of this type of analysis for prioritizing pathways.

The largest category of interactions appear to be Human Diseases with Environmental In-

formation Processing, other Human Diseases, and Organismal Systems, most likely indicating

process interactions for disease risk and development (Environmental Information Processing),

risk factors predisposing to multiple disease (Human Disease crosstalks such as AD and dia-

betes), and disease development, immunity, and pathology (Organismal Systems). Interestingly,

Organismal Systems crosstalks form primarily between Environmental Information Processing

and other Organismal Systems, perhaps representing less pathological genetic factors and mech-

anisms.

Metabolism and Genetic Information Processing show an unexpected dearth of crosstalks

compared to that expected based strictly on percentage of total enriched pathways in the cohort,

suggesting that genetic load in these processes may be less important or have less overall impact

25

than other types of biological mechanisms in this particular cohort. All but one of the enriched

pathways had at least one significant crosstalk; the one carbon pool by folate pathway did not

have any significant crosstalks. This maybe because the pathway is small and specific; it is only

linked to rare, serious diseases in KEGG, including several metabolic disorders and spina bifida.

Because our aim was to investigate the genetic load specifically in regards to AD, we also

examined enriched pathway crosstalk specifically relating to this disease process. Sorting out

interactions involving the KEGG AD pathway, we identified 97 AD-related crosstalks. Identified

pathways were grouped according to category; all six KEGG categories were represented (Figure

2.4).

Given the overall findings of crosstalk enrichment, it is not surprising that the largest cat-

egories are once again Human Diseases and Organismal Systems. These categories are equally

represented in AD crosstalk, echoing the percentage of enriched pathways in these categories

and supporting the importance of both processes in AD genetic load. Interestingly, although

the crosstalks in the broad search did not match the percentages of enriched pathways, for

AD-specific crosstalks the percentages do largely reflect the percentages of enriched pathways;

Human Diseases and Organismal Systems are the largest categories (33% each), followed by

Environmental Information Processing (19%), Cellular Processes (10%), Metabolism (3%), and

Genetic Information Processing (2%). As seen in the broad crosstalk analysis, Metabolism and

Genetic Information Processing have fewer crosstalks than might be expected given the per-

centage of pathways involved, suggesting that genetic load in these processes is not as important

to the disease process, at least in this particular cohort.

Investigation of pathway crosstalk revealed the presence of a large number of SNP-enriched

pathway interactions. These pathway crosstalks were observed to be particularly common for

human disease and organismal system KEGG pathways, which makes sense given that these

pathways are known to involve many biological mechanisms. Interestingly, the large number

of enriched disease and organismal system pathway crosstalks present in the dataset might

suggest a core set of biological mechanisms are central to many different disorders including

infectious diseases, nervous disorders, and cancer. One example of this is the calcium signaling

pathway; this SNP-enriched pathway shows crosstalk with AD, as expected given the method

of identification, but also crosstalks with other nervous system disorders including Huntington’s

disease and Amyotrophic lateral sclerosis (ALS), infectious disease pathways and a number of

different types of cancers among others indicating that this pathway is an important signaling

hub, and that increased SNP load in this pathway could have wide-ranging systemic effects.

Focal cell adhesion is another interesting example of a signaling pathway with wide-ranging

effects; in addition to AD, this pathway shows crosstalk with immune pathways and infectious,

immune, and cardiomyopathic diseases, many types of cancer, nervous system pathways and

diseases such as ALS, diabetes and insulin signaling, and cell growth, development, and signal

26

•Cancer (9)

•Cardiovascular

diseases (3)

•Endocrine and

metabolic diseases

(3)

•Immune diseases

(3)

•Infectious diseases

(11)

•Neurodegenerative

diseases (3)

•Signal

Transduction (14)

• Signaling

molecules and

interaction (3)

•Membrane

transport (1)

•Protein folding,

sorting, and

degradation (1)

•Transcription (1)

•Cell

communication (2)

•Cell growth and

death (4)

•Cell motility (1)

•Transport and

catabolism (3)

•Amino acid

metabolism (2)

•Energy metabolism

(1)

•Immune system

(12)

•Endocrine system

(6)

•Nervous system (6)

•Digestive system

(2)

•Excretory system

(2)

•Sensory system (2)

•Development (2)

Human Diseases (32)

Environmental Information

Processing (18)

Genetic Information

Processing (2)

Cellular Processes (10)

Metabolism (3) Organismal Systems

(32)

Alzheimer’s Disease Pathway Crosstalk

Figure 2.5: Alzheimer’s disease pathway crosstalk: pathways found to have significant crosstalkwith the AD pathway fall into KEGG categories including Human Diseases, EnvironmentalInformation Processing, Genetic Information Processing, Cellular Processes, Metabolism, andOrganismal Systems. Specific KEGG pathway types are listed below each category, and thenumber of occurrences is listed in parentheses.

transduction pathways.

Focusing in on the AD pathway, we observe significant crosstalk in all other pathway cat-

egories previous mentioned, including Human Diseases, Organismal Systems, Environmental

Information Processing, Cellular Processes, Metabolism, and Genetic Information Processing,

supporting the complex etiology of this disease (Figure 2.5). These interactions further sup-

port previous research linking many other disease processes and mechanisms to this disease, as

previously discussed. The enriched interactions of immune, inflammatory, cell growth, death,

and signaling, and metabolic pathways in this relatively small cohort suggest that rather than

focusing on one specific biological mechanism, AD is more likely to result from risk factors and

environmental cues impacting many different biological pathways. Given the interactive nature

of many of these pathways, as demonstrated by the current study, future therapeutic efforts

should concentrate on developing methods to target multiple biological mechanisms.

2.2.5 Prognosis of MCI to AD Conversion

We analyzed the additive power of significant pathways and significant pathway crosstalks by

comparing the performance of prediction models using the all imaging (MRI, PET) and bio-

specimen (CSF) biomarkers, and clinical parameters alone with models that also included the

27

SNP-enriched pathways or active pathway crosstalks. We used a support vector machine (SVM)

with a linear kernel to build the classification models that predict conversion from MCI to AD.

We used linear SVMs as our classification model, because they have been used previously to

classify AD and other types of dementia [29, 87]. SVMs also allow for straightforward data inte-

gration of several data types [98]. The validation was performed using 10-fold cross-validation.

The 10-fold cross-validation was repeated 100 times to remove the bias of choosing a good or

bad cross-validation by chance. We reported mean and standard deviation for test accuracy

and percentage of support vectors for each model. For models that incorporated significant

pathways or significant pathway crosstalks, we utilized the bias function to perform feature

selection at each fold.

The current state-of-art method by Shaffer et al [158] utilized a combination of PET, CSF,

and MRI biomarkers to produce an accuracy of 71.6%. The clinical parameters in Shaffer et

al [158] included age, education, ADAS-Cog, and ApoE genotype. MRI, PET, and CSF were

defined as AD biomarkers. In this section we will refer to age, education, and ADAS-Cog as

baseline clinical parameters and explicitly mention whenever ApoE is included in the model.

We specify this distinction, because the ApoE4 allele is traditionally an important genetic

biomarker for AD. Hence, we wanted to compare models that included and excluded ApoE to

understand its impact on the 91 patients.

As a reminder, we defined significant pathways as SNP-enriched pathways with a bias below

0.01. Significant pathway crosstalks were defined as active pathway crosstalks with a bias below

0.01.

As a brief summary, we point out that including significant pathways in the model in-

creased the model accuracy and reduced the number of support vectors used in nearly all cases.

Moreover, incorporating significant pathway crosstalks instead of significant pathways further

improved the accuracy and reduced the number of support vectors in most models.

The overall best performance was achieved by the model built with baseline clinical pa-

rameters (age, education, and ADAS-Cog), PET, CSF biomarkers, and significant pathway

crosstalks, which had an accuracy of 82.72%±2.82% (mean ± standard deviation) and

33.42%±0.57% (mean ± standard deviation) of training data points as support vectors. Our sec-

ond best model is based on the best model and further included ApoE. It achieved 82.00%±2.69%

accuracy and 33.43%±0.61% support vectors. The detailed results are discussed in the following

subsections.

SNP-enriched features with baseline clinical parameters

In this section we discuss the results of the models built using baseline clinical parameters

(age, education, and ADAS-Cog, but not ApoE), significant pathways, and significant pathway

28

crosstalks. The best accuracy was achieved for the model using baseline clinical parameters and

significant pathway crosstalks. See Table 2.4 for all results.

The model with the baseline clinical parameters age, education, and ADAS-Cog produced

an accuracy of 59.17%±2.28% (mean ± standard deviation) with 83.69%±0.25% of training

data points as support vectors. The model built with significant pathways only produced an

accuracy of 56.86%±3.20% with 68.15%±1.85% support vectors. Typically, we expect that a

random guessing model would get 50% of the predictions correct; both of these models only

perform moderately above a random model.

A high percentage of support vectors indicate a model that is overfitted and unlikely to

generalize well. Thus, if we have two models that produce the same accuracy, then we pick the

model that has the lower percentage of support vectors. 68% or more of the training data points

were used as support vectors and this indicates highly overfitted models, which is shown by the

poor cross-validation accuracy.

The model that included both baseline clinical parameters and significant pathways pro-

duced an accuracy of 64.48%±3.39% with 63.11%±1.14% support vectors. This combined model

results in approximately 5.31% increase (calculated as the difference between the mean accu-

racies of two models, in this case, 64.48% - 59.17% = 5.31%) in accuracy compared to the

model that uses baseline clinical parameters alone and approximately 7.62% increase in accu-

racy compared to the model that uses significant pathways alone. Additionally, the combined

model selected fewer support vectors. Specifically, compared to the model with baseline clinical

parameters, the combined model reduced the percentage of data points as support vectors by

approximately 20.58% (calculated as the difference between the mean percentages of training

data points as support vectors of two models). Likewise, compared to the model using only sig-

nificant pathways, the combined model reduced the percentage of data points as support vectors

by approximately 5.04%. Thus, the model that included both baseline parameters and signif-

icant pathways is likely to generalize better that models built using either clinical parameters

alone or significant pathways alone.

Similarly, the combined model that included both baseline clinical parameters and signifi-

cant pathway crosstalks produced an accuracy of 71.06%±3.23% with 54.37%±0.59% support

vectors whereas the model using only significant pathway crosstalks produced an accuracy of

60.88%±3.03% with 50.68%± 5.01% support vectors. The combined model resulted in approx-

imately 11.89% increase in accuracy compared to the model with baseline clinical parameters

alone and approximately 10.18% increase in accuracy compared to the model that uses signifi-

cant pathway crosstalks alone. Again, the combined model used 29.32% fewer support vectors

compared to the model that used baseline clinical parameters alone. The combined model had

3.69% more support vectors compared to the model that uses significant pathway crosstalks

alone. Also, this combined model resulted in a 6.58% increase in accuracy and a 8.74% decrease

29

in support vectors when compared to the model that uses both baseline parameters and signif-

icant pathways. Thus, including significant pathway crosstalks, instead of significant pathways,

in a model that also contains baseline clinical parameters produced an improved model.

In all tables of the Results section, unless otherwise noted, support vector machines with

linear kernel functions were used as classification method for MCI to AD conversion; mean and

standard deviation of 100 iterations of 10-fold cross-validation results are shown.

30

Table 2.4: Performance of models with baseline clinical parameters

Metrics Clinical: Clinical + Clinical + Significant SignificantAge, significant significant pathways pathway

Education, and pathways pathway (only) crosstalksADAS-Cog crosstalks (only)

Accuracy in % 59.17±2.28 64.48±3.39 71.06±3.23 56.86±3.20 60.88±3.03Support Vectors in % 83.69±0.25 63.11±1.14 54.37±0.59 68.15±1.85 50.68±5.01True Positives 30.66±1.63 33.85±2.57 38.04±1.87 31.02±3.73 39.84±3.54False Negatives 19.34±1.63 16.15±2.57 11.96±1.87 18.98±3.73 10.16±3.54False Positives 17.84±1.24 16.19±2.03 14.39±1.84 20.29±2.63 25.48±4.35True Negatives 23.16±1.24 24.81±2.03 26.61±1.84 20.71±2.63 15.52±4.35Sensitivity 0.61±0.03 0.68±0.05 0.76±0.04 0.62±0.07 0.80±0.07Specificity 0.56±0.03 0.61±0.05 0.65±0.04 0.51±0.06 0.38±0.11Precision 0.63±0.02 0.68±0.03 0.73±0.03 0.60±0.03 0.61±0.03

31

SNP-enriched features with baseline clinical parameters and ApoE

This section contains a discussion of the results of the models built using baseline clinical

parameters and ApoE, significant pathways and significant pathway crosstalks. The models

exhibit the same trend as the models in the previous section (see Table 2.5 for details). Again, we

found the addition of SNP-enriched data (significant pathways or significant pathway crosstalks)

improves the accuracy and also considerably brings down the number of support vectors in

models using baseline clinical parameters and ApoE (the better model of the two used significant

pathway crosstalks).

The model with baseline clinical parameters and ApoE produced an accuracy of 56.9%±2.64%

with 83.52%±0.31% support vectors. The model only performed marginally better than a ran-

dom guess. Similar to models that used baseline parameters alone, the model that included

ApoE also had a high percentage of training data points used as support vectors and hence

the model is likely to have overfitted the data. However, there was no noteworthy difference

between the model that used baseline parameters alone and the one that additionally included

ApoE (2.27% difference in accuracy and 0.17% difference in support vectors).

The combined model that included baseline clinical parameters, ApoE, and significant path-

ways gave an accuracy of 63.33%±3.63% with 63.04%±1.14% support vectors. Compared to the

model that uses baseline clinical parameters and ApoE alone, this combined model increased

accuracy by 6.43% and reduced support vectors by 20.48%. Compared to the model that used

significant pathways alone, the combined model had a 6.47% higher accuracy and 5.11% fewer

support vectors. Thus, the model that included baseline parameters, ApoE, and significant

pathways is likely to generalize better that models built using either just significant pathways

or baseline clinical parameters and ApoE.

We found little difference between the model that used baseline parameters and significant

pathways alone and the one that additionally includes ApoE4 (1.15% difference in accuracy

and 0.07% difference in support vectors).

The model that included baseline clinical parameters, ApoE, and significant pathway crosstalks

produced an accuracy of 69.81%±3.13% with 53.78%±0.65% support vectors. It is likely to gen-

eralize better than models built using either just clinical parameters and ApoE or just significant

pathway crosstalks alone, because there was a 29.74% decrease in support vectors (and a 12.91%

increase in accuracy) compared to the model that used baseline clinical parameters and ApoE

alone and a 3.10% decrease in support vectors (and an 8.93 % increase in accuracy) compared

to the model that used significant pathway crosstalks alone.

Including significant pathway crosstalks instead of significant pathways produced an im-

proved model, because we found a 6.48% increase in accuracy and 9.26% decrease of support

vectors when compared to the model that used baseline parameters, ApoE, and significant

32

Table 2.5: Performance of models with baseline clinical parameters with ApoE

Metrics Clinical + Clinical + Clinical +ApoE + ApoE + ApoE +

significant significantpathways pathway

crosstalks

Accuracy in % 56.90±2.64 63.33±3.63 69.81±3.13Support Vectors in % 83.52±0.31 63.04±1.14 53.78±0.65True Positives 32.12±1.89 33.86±2.51 37.71±2.02False Negatives 17.88±1.89 16.14±2.51 12.29±2.02False Positives 21.34±1.41 17.24±2.19 15.20±1.68True Negatives 19.66±1.41 23.76±2.19 25.80±1.68Sensitivity 0.64±0.04 0.68±0.05 0.75±0.04Specificity 0.48±0.03 0.58±0.05 0.63±0.04Precision 0.60±0.02 0.66±0.03 0.71±0.03

pathways.

SNP-enriched features with baseline clinical parameters and all imaging (MRI,

PET) and bio-specimen (CSF) biomarkers

Shaffer et al [158] built 8 models using logistic regression with clinical parameters and different

subsets of the imaging (MRI, PET) and bio-specimen (CSF) biomarkers to test the predictive

ability of the variables. In this section, we set up their top three models in SVMs instead of

regression models and show the additional value of SNP-enriched features to these models.

Our method enhanced the performance of all top three model features (clinical parameters,

biomarkers). Adding significant pathway crosstalks gives a higher accuracy, even higher than

adding significant pathways. For each of the three models, adding significant pathway crosstalks

increased their average accuracy by at least 7.96%.

Once again we divided our analysis into models that did not directly incorporate ApoE

explicitly and models that incorporated ApoE explicitly. This section discusses the former

models and the subsequent section discusses the latter.

The three models that Shafferet al [158] found to have the best performance were (1) model

with clinical parameters and all imaging (MRI, PET) and bio-specimen (CSF) biomarkers, (2)

model with clinical parameters and PET biomarkers, and (3) model with clinical parameters,

PET, and CSF biomarkers. The performance of other combinations of baseline parameters and

all imaging (MRI, PET) and bio-specimen (CSF) biomarkers can be found in the supplemental

files.

33

Model performance with baseline clinical parameters and all imaging (MRI,

PET) and bio-specimen (CSF) biomarkers: We analyzed the model that included all

imaging (MRI, PET) and bio-specimen (CSF) biomarkers along with baseline clinical parame-

ters. The model had an accuracy of 65.48%±2.40% with 54.85%±0.45% support vectors. This

model resulted in a 6.31% increase in accuracy and a 28.84% reduction in support vectors in

comparison to the model that used only the baseline clinical parameters alone.

The model that included significant pathways along with all imaging (MRI, PET), bio-

specimen (CSF) biomarkers and baseline clinical parameters (Table 2.6) resulted in an accuracy

of 72.66%±4.04% with 41.17%±0.72% support vectors. Dropping either all imaging (MRI,

PET) and bio-specimen (CSF) biomarkers or significant pathways from this model gave a worse

performance (8.18% decrease in accuracy and 21.94% increase in support vectors when removing

all imaging (MRI, PET) and bio-specimen (CSF) biomarkers; 7.18% decrease in accuracy and

13.68% more support vectors when removing significant pathways).

Similarly, the model built on all imaging (MRI, PET), bio-specimen (CSF) biomarkers and

baseline clinical parameters along with significant pathway crosstalks led to 81.38%±2.79%

accuracy with 35.31%±0.62% support vectors. Removing either all imaging (MRI, PET)and

bio-specimen (CSF) biomarkers or significant pathway crosstalks from the model decreased

performance (10.32% less accuracy and 19.06% more support vectors when dropping all imaging

(MRI, PET) and bio-specimen (CSF) biomarkers; 15.90% less accurate and 19.54% increase in

support vectors when dropping significant pathway crosstalks).

Including significant pathway crosstalks, instead of significant pathways, in a model that also

contained baseline clinical parameters and all imaging (MRI, PET) and bio-specimen (CSF)

biomarkers improved model accuracy by 8.72% and reduced the support vectors by 5.86%.

Model performance with baseline clinical parameters and PET biomarkers: The

model built using PET biomarkers and baseline clinical parameters (Table 2.7) had

68.58%±2.32% accuracy and 57.59%±0.42% support vectors, which was 9.41% more accurate

and had 25.93% fewer support vectors than the model that used only baseline clinical parame-

ters.

The model that included significant pathways along with PET biomarkers and baseline clin-

ical parameters gave an accuracy of 74.26%±2.71% with 43.91%±0.79% support vectors. Elim-

inating either the PET biomarkers or the significant pathways decreased performance (9.78%

less accurate and 19.13% more support vectors in the model without PET biomarkers; 5.68%

decrease in accuracy and 13.68% more support vectors when removing significant pathways

from our model instead).

The model based on significant pathway crosstalks, PET biomarkers, and baseline clinical

parameters reported an accuracy of 76.54%±2.76% with 38.99%±0.56% support vectors. When

we exclude either PET biomarkers or significant pathway crosstalks, the performance dropped

34

Table 2.6: Performance of models with baseline clinical parameters and all imaging (MRI,PET) and bio-specimen (CSF) biomarkers

Metrics Clinical + Clinical + Clinical +MRI,PET,CSF MRI,PET,CSF MRI,PET,CSF

+significant + significantpathways pathway

crosstalks

Accuracy in % 65.48±2.40 72.66±2.05 81.38±2.79Support Vectors in % 54.85±0.45 41.17±0.72 35.31±0.62True Positives 33.59±1.77 37.22±2.33 41.66±1.63False Negatives 16.41±1.77 12.78±2.33 8.34±1.63False Positives 14.98±1.41 12.01±2.15 8.60±1.76True Negatives 26.02±1.41 28.99±2.15 32.40±1.76Sensitivity 0.67±0.04 0.74±0.05 0.83±0.03Specificity 0.63±0.03 0.71±0.05 0.79±0.04Precision 0.69±0.04 0.76±0.04 0.83±0.03

(the accuracy was reduced by 5.48% and the support vectors increased by 14.79% when elimi-

nating PET biomarkers; the accuracy was reduced by 7.96% and the support vectors increased

by 18.6% when disregarding significant pathway crosstalks).

Comparing to the model that used baseline clinical parameters, PET biomarkers, and sig-

nificant pathways, substituting significant pathways for significant pathway crosstalks resulted

in a 2.28% increase in accuracy and a 4.92% reduction of support vectors.

Table 2.7: Performance of models with baseline clinical parameters and PET biomarkers

Metrics Clinical + PET Clinical + PET Clinical + PET+ significant + significant

pathways pathway crosstalks

Accuracy in % 68.58±2.32 74.26±2.71 76.54±2.76Support Vectors in % 57.59±0.42 43.91±0.79 38.99±0.56True Positives 34.14±1.43 36.99±1.73 38.70±1.84False Negatives 15.86±1.43 13.01±1.73 11.30±1.84False Positives 12.73±1.53 10.41±1.93 10.02±2.03True Negatives 28.27±1.53 30.59±1.93 30.98±2.03Sensitivity 0.68±0.03 0.74±0.03 0.77±0.04Specificity 0.69±0.04 0.75±0.05 0.76±0.05Precision 0.73±0.03 0.78±0.03 0.80±0.03

35

Model performance with baseline clinical parameters, PET, and CSF biomark-

ers: The model based on baseline clinical parameters, PET, and CSF biomarkers (Table 2.8)

yielded an accuracy of 67.19%±2.75% with 56.19%±0.47% support vectors. This model had a

slightly lower (1.30%) accuracy but a 27.33% reduction of support vectors in comparison to the

model that used only the baseline clinical parameters and PET. Moreover, this model yielded

10.74% more accuracy and 25.02% fewer support vectors in comparison to the model that used

only the baseline clinical parameters and CSF.

The model incorporating significant pathways along with PET and CSF biomarkers and

baseline clinical parameters resulted in 75.27%±2.82% accuracy and 41.06%±0.76% support

vectors. Comparing to models without either CSF biomarkers or PET biomarkers we were

better off keeping both in the model (1.01% decrease in accuracy and 21.98% increase in support

vectors for the model without CSF biomarkers; 11.96% decrease in accuracy and 19.33% more

support vectors for the model excluding PET biomarkers).

Similarly, the model that included significant pathway crosstalks, baseline clinical param-

eters, PET, and CSF biomarkers gave us an accuracy of 82.72%±2.82% and 33.42%±0.57%

support vectors. Omitting either CSF biomarkers or PET biomarkers gave worse results (6.18%

less accurate and 5.57% more support vectors when dropping CSF biomarkers; 10.61% reduction

in accuracy and 15.49% increase of support vectors when omitting PET biomarkers).

Including significant pathway crosstalks instead of significant pathways in a model that also

contained baseline clinical parameters, PET, and CSF biomarkers produced an improved model

(7.45% increase in accuracy and 7.64% decrease in support vectors).

SNP-enriched features with baseline clinical parameters, ApoE, and all imaging

(MRI, PET) and bio-specimen (CSF) biomarkers

This section contains the top three models from the previous section, extended by ApoE. Again,

we contrast and compare those models with models that also include significant pathways or

significant pathway crosstalks.

Model performance with baseline clinical parameters, ApoE, and all imaging

(MRI, PET) and bio-specimen (CSF) biomarkers: The model built using baseline clinical

parameters, ApoE, and all imaging (MRI, PET) and bio-specimen (CSF) biomarkers (Table

2.9), produced an accuracy of 67.08%±2.78% with 54.05%±0.48% support vectors. This model

had only a slightly higher (1.60%) accuracy and just 0.80% fewer support vectors compared to

dropping ApoE from the model. The addition of significant pathway crosstalks produced the

best accuracy (see Table 2.9 for details).

Similarly, the model containing significant pathways, all imaging (MRI, PET) and bio-

specimen (CSF) biomarkers, baseline clinical parameters, and ApoE yielded an accuracy of

36

Table 2.8: Performance of models with baseline clinical parameters, PET, and CSF biomarkers

Metrics Clinical +PET Clinical +PET + Clinical +PET ++ CSF + CSF + CSF

significant significantpathways pathway crosstalks

Accuracy in % 67.19±2.75 75.27±2.82 82.72±2.82Support 56.19±0.47 41.06±0.76 33.42±0.57Vectors in %True 36.34±2.08 38.68±2.11 42.52±1.54PositivesFalse 13.66±2.08 11.32±2.11 7.48±1.54NegativesFalse 16.19±1.35 11.17±1.84 8.25±1.86PositivesTrue 24.81±1.35 29.83±1.84 32.75±1.86NegativesSensitivity 0.73±0.04 0.77±0.04 0.85±0.03Specificity 0.61±0.03 0.73±0.04 0.80±0.05Precision 0.69±0.02 0.78±0.03 0.84±0.03

73.93%±3.06% with 41.52%±0.69% support vectors. This model gave us a small increase

(1.27%) in accuracy along with a small increase (0.35%) in support vectors in comparison

to the model without ApoE.

The model based on significant pathway crosstalks, all imaging (MRI, PET) and bio-

specimen (CSF) biomarkers, baseline clinical parameters, and ApoE parameters had 80.58%

±2.97%accuracy and 35.60%±0.74% support vectors. Removing ApoE caused the performance to increase a little (0.80% more accurate and 0.29% fewer support vectors in the model without

ApoE). This was the only result where the addition of ApoE to a model resulted in amarginally better model.

In a model based on all imaging (MRI, PET) and bio-specimen (CSF) biomarkers, baseline

clinical parameters, and ApoE, the additional inclusion of significant pathway crosstalks was

more beneficial than the inclusion of significant pathways (6.54% better accuracy and 5.92%

fewer support vectors).

Model performance with baseline clinical parameters, ApoE, and PET biomark-

ers: The model with baseline clinical parameters, ApoE, and PET biomarkers yielded an accu-

racy of 69.76%±1.94% with 55.79%±0.38% support vectors (see Table 2.10 for details). Again,

incorporating significant pathways improved the accuracy and reduced the number of sup-

port vectors. Including significant pathway crosstalks instead further increased the accuracy to

37

Table 2.9: Performance of models with baseline clinical parameters and ApoE with all imaging(MRI, PET) and bio-specimen (CSF) biomarkers

Metrics Clinical + ApoE Clinical + ApoE Clinical + ApoE +MRI,PET,CSF MRI,PET,CSF MRI,PET,CSF

+ significant +significantpathways pathway crosstalks

Accuracy in % 67.08±2.78 73.93±3.06 80.58±2.97Support Vectors in % 54.05±0.48 41.52±0.69 35.60±0.74True Positives 34.53±2.10 37.87±2.18 41.82±1.76False Negatives 15.47±2.10 12.13±2.18 8.18±1.76False Positives 14.47±1.54 11.60±1.81 9.50±1.90True Negatives 26.53±1.54 29.40±1.81 31.50±1.90Sensitivity 0.69±0.04 0.76±0.04 0.84±0.04Specificity 0.65±0.04 0.72±0.04 0.77±0.05Precision 0.70±0.03 0.77±0.03 0.82±0.03

79.53%±2.63% and reduced the number of support vectors to 37.13%±0.57%.

Model performance with baseline clinical parameters, ApoE, PET, and CSF

biomarkers: Using clinical parameters, ApoE, PET, and CSF biomarkers in a model resulted

in 68.10%±2.35% accuracy and 55.69%±0.45% support vectors (see Table 2.11 for details).

The additional inclusion of significant pathways was beneficial (6.10% better accuracy and

14.35% fewer support vectors), even more so was the inclusion of significant pathway crosstalks

(accuracy of 82.00%±2.69% and 33.43%±0.61% support vectors).

Comparison of model performances of Shaffer et al [158] with our SNP-enriched

features

We compared the logistic regression model of Shaffer et al [158] with our SVM models (see

Tables 2.12 and Table 2.13). Our models also contained the SNP-enriched features (significant

pathways or significant pathway crosstalks) we built. We noticed that the average accuracy of

the logistic regression model slightly decreased (from 71.56% to 68.55%±2.61%) when we re-

peatedly created random 10-folds instead of using the 10 original folds from Shaffer et al [158].

It decreased further (to 67.57%±2.63%) when we removed the 6 patients that did not have cor-

responding SNP data in the ADNI database. Our method, when incorporating either significant

pathways or significant pathway crosstalks, had a higher average accuracy on 100 randomly gen-

erated 10-folds than Shafferet al’s [158] method. Especially the combination of baseline clinical

parameters, ApoE, all all imaging (MRI, PET) and bio-specimen (CSF) biomarkers, and sig-

nificant pathway crosstalks in our model yielded a remarkably high accuracy of 80.58%±2.97%

38

Table 2.10: Performance of models with clinical parameters and ApoE with PET biomarker

Metrics Clinical Clinical Clinical+ ApoE + ApoE + ApoE+ PET + PET + PET

+ significant + significantpathways pathway

crosstalks

Accuracy in % 69.76±1.94 73.01±2.95 79.53±2.63Support Vectors in % 55.79±0.38 43.24±0.76 37.13±0.57True Positives 35.61±1.29 37.36±1.78 41.36±1.70False Negatives 14.39±1.29 12.64±1.78 8.64±1.70False Positives 13.12±1.17 11.91±2.04 9.96±1.50True Negatives 27.88±1.17 29.09±2.04 31.04±1.50Sensitivity 0.71±0.03 0.75±0.04 0.83±0.03Specificity 0.68±0.03 0.71±0.05 0.76±0.04Precision 0.73±0.02 0.76±0.03 0.81±0.03

Table 2.11: Performance of models with clinical parameters and ApoE with PET and CSFbiomarkers

Metrics Clinical Clinical Clinical+ ApoE + ApoE + ApoE+ PET + PET + PET+ CSF + CSF + CSF

+ significant + significantpathways pathway

crosstalks

Accuracy in % 68.10±2.35 74.20±2.73 82.00±2.69Support Vectors in % 55.69±0.45 41.34±0.72 33.43±0.61True Positives 36.22±1.65 38.32±2.09 42.73±1.46False Negatives 13.78±1.65 11.68±2.09 7.27±1.46False Positives 15.23±1.38 11.79±1.78 9.11±1.90True Negatives 25.77±1.38 29.21±1.78 31.89±1.90Sensitivity 0.72±0.03 0.77±0.04 0.86±0.03Specificity 0.63±0.03 0.71±0.04 0.78±0.05Precision 0.70±0.02 0.77±0.03 0.83±0.03

39

with 35.60%±0.74% support vectors. The original data set of 97 patients from the logistic re-

gression by Shaffer et al [158] was obtained. Table 2.12 provides the result details for. The

results of the original 10-fold cross-validation on 97 patients are shown in column 2. The mean

and standard deviation of 100 iterations of random 10-fold cross-validation on 97 patients are

shown in column 3. Columns 4-5 show mean and standard deviation of 100 iterations of 10-fold

cross-validation on 91 patients, performed with our method. Table 2.13 provides the result de-

tails for the reduced 91 patient dataset. The mean and standard deviation of 100 iterations of

random 10-fold cross-validation on 91 patients are shown in column 2. Columns 3-4 show mean

and standard deviation of 100 iterations of 10-fold cross-validation on 91 patients, performed

with our method.

40

Table 2.12: Performance of Shaffer et al [158] model in comparison with 97 patients in comparison to our model with 91 patients

Shaffer et al paper [158] Our method(logistic regression)

97 patients 91 patients

Metrics original 10-fold 100 iterations of 100 iterations of 100 iterations ofcross-validation random 10-fold random 10-fold random 10-fold

cross-validation cross-validation cross-validationClinical + ApoE Clinical + ApoE Clinical + ApoE Clinical + ApoE+ MRI,PET,CSF + MRI,PET,CSF + MRI,PET,CSF + MRI,PET,CSF

+ significant + significantpathways pathway

crosstalks

Accuracy in % 71.56 68.55±2.61 73.93±3.06 80.58±2.97Support Vectors in % N/A N/A 41.52±0.69 35.60±0.74True Positives 39 36.43±2.19 37.87±2.18 41.82±1.76False Negatives 14 16.57±2.19 12.13±2.18 8.18±1.76False Positives 13 12.95±1.29 11.60±1.81 9.50±1.90True Negatives 30 30.05±1.29 29.40±1.81 31.50±1.90Sensitivity 0.74 0.69±0.04 0.76±0.04 0.84±0.04Specificity 0.7 0.70±0.03 0.72±0.04 0.77±0.05Precision 0.75 0.74±0.02 0.77±0.03 0.82±0.03

41

Table 2.13: Performance of Shaffer et al [158] model in comparison with 91 patients in comparison to our model with 91 patients

Shaffer et al paper[158] Our method(logistic regression)

Metrics 100 iterations of 100 iterations of 100 iterations ofrandom 10-fold random 10-fold random 10-foldcross-validation cross-validation cross-validationClinical + ApoE Clinical + ApoE Clinical + ApoE+MRI,PET,CSF + MRI,PET,CSF + MRI,PET,CSF

+ significant + significantpathways pathway

crosstalks

Accuracy in % 67.57±2.63 73.93±3.06 80.58±2.97Support Vectors in % N/A 41.52±0.69 35.60±0.74True Positives 33.17±1.69 37.87±2.18 41.82±1.76False Negatives 16.83±1.69 12.13±2.18 8.18±1.76False Positives 12.66±1.57 11.60±1.81 9.50±1.90True Negatives 28.34±1.57 29.40±1.81 31.50±1.90Sensitivity 0.66±0.03 0.76±0.04 0.84±0.04Specificity 0.69±0.04 0.72±0.04 0.77±0.05Precision 0.72±0.02 0.77±0.03 0.82±0.03

42

Randomized SNP-enriched features

We wanted to ensure that the choice of knowledge priors utilized in this study has true additive

power and the results are not a random occurrence. To do so, we generated 25 random samples

knowledge priors with no prior association to Alzheimer’s. This translates to creating random-

ized significant pathway crosstalks. We performed 100 iterations of 10-fold cross-validation for

each of the 25 samples and reported mean and standard deviation. The results clearly show that

the knowledge priors are truly effective in building better models for MCI to AD conversion

(see Table 2.14).

We analyzed models based on this randomized SNP-enriched data and: (1) baseline clinical

parameters, and (2) baseline clinical parameters and all imaging (MRI, PET) and bio-specimen

(CSF) biomarkers.

The model with baseline clinical parameters and randomized significant pathway crosstalks

gave an accuracy of 59.27%±3.66% with 83.47%±1.84% support vectors. This model yields 12%

less accuracy and 29.1% increase in support vectors, in comparison to the original model that

uses baseline parameters and significant pathway crosstalks (instead of randomized significant

pathway crosstalks). As expected, our randomly generated knowledge priors (randomized sig-

nificant pathway crosstalks) give us worse performance than significant pathway crosstalks. The

model accuracy is still moderately above a random guessing model, only due to the presence of

the clinical parameters.

Similarly, the model with baseline clinical parameters, all imaging (MRI, PET) and bio-

specimen (CSF) biomarkers, and randomized significant pathway crosstalks gave an accuracy

of 65.62%±3.53% with 54.73%±1.46% support vectors. This model yields nearly 15% less accu-

racy and almost 19% increase in support vectors, in comparison to the original model that uses

baseline parameters, all imaging (MRI, PET) and bio-specimen (CSF) biomarkers, and signif-

icant pathway crosstalks (instead of randomized significant pathway crosstalks). As expected,

our randomly generated knowledge priors (randomized significant pathway crosstalks) give us

worse performance than significant pathway crosstalks. The model accuracy is still above a

random guessing model, only due to the presence of the clinical parameters and all imaging

(MRI, PET) and bio-specimen (CSF) biomarkers. Again, for models using pathway crosstalks

only, no features were being selected as significant to build the model and hence the runs could

not be completed.

A similar trend was seen when investigating models with baseline clinical parameters and all

imaging (MRI, PET) and bio-specimen (CSF) biomarkers to determine the effects of randomized

pathways.

43

Table 2.14: Performance of models with randomized pathway crosstalk features

Metrics Clinical + Clinical + Clinical + Clinical +MRI,PET, MRI,PET,and CSF and CSF

+ randomized + significant + randomized + significantpathway pathway pathway pathway

crosstalks crosstalks crosstalks crosstalks

Accuracy in % 59.27±3.66 71.06±3.23 65.62±3.53 81.38±2.79Support 83.47±1.84 54.37±0.59 54.73±1.46 35.31±0.62Vectors in %True 30.86±1.98 38.04±1.87 33.68±2.14 41.66±1.63PositivesFalse 19.14±1.97 11.96±1.87 16.32±2.09 8.34±1.63NegativesFalse 17.95±1.59 14.39±1.84 14.95±1.80 8.60±1.76PositivesTrue 23.05±1.56 26.61±1.84 26.05±1.79 32.40±1.76NegativesSensitivity 0.62±0.97 0.76±0.04 0.67±1.07 0.83±0.03Specificity 0.56±1.45 0.65±0.04 0.64±1.21 0.79±0.04Precision 0.63±0.02 0.73±0.03 0.69±0.02 0.81±2.79

2.3 Conclusion

We have developed a three step methodology to discover network biomarker crosstalks that uti-

lizes the combined power of several protein/gene level evidences. To the best of our knowledge,

this is the first systematic study of crosstalk biomarkers. Utilizing crosstalks as additional fea-

tures phenotype prediction significantly improved the prediction accuracy for our Alzheimer’s

disease use-case.

44

Chapter 3

Subsystem Level: Discovery of

Network Biomarker Subsystems

At the subsystem level, the phenotype-related subsytem biomarkers are discovered and char-

acterized. Biological relationships (e.g., protein functional associations) between proteins are

often modeled as networks (protein functional association networks [70]), where each node is

a protein and every pair of functionally associated proteins is connected with an edge. Func-

tional association between proteins is derived from a number of clues like experimental data [5],

gene-fusion [167], co-occurrence of the corresponding genes on the same operon [31], etc. The

subgraphs of these networks can model the cellular subsystems.

Evolutionary conservation of cellular subsystems across multiple organisms with similar

phenotype can be used as one clue to identify the phenotype-related cellular subsystems [152].

Namely, the cellular subsystems associated with a phenotype are more likely to be conserved

across phenotype-expressing organisms and are less likely to be conserved across phenotype

non-expressing organisms [152].

Our earlier work, the α, β-motif finder [152] algorithm, which is the current state-of-the-art,

identifies phenotype-related subsystems that are modeled as cliques in the organismal functional

association networks. A clique is a subgraph with an edge between every pair of vertices in the

subgraph, i.e., every pair of proteins in the subsystem has to be functionally associated. Density,

the ratio of the number of edges in the subgraph to the total number of possible edges in the

subgraph is 1.0 for subgraphs modeled as cliques.

The clique model is too stringent to model all potential phenotype-related cellular subsys-

tems. There are existing studies that show that cellular subsystems can have different topolo-

gies [54]. A detailed study of the densities of existing protein complexes from various sources

[40, 41, 62, 91, 165] has revealed that the density of complexes varies from 0.1 to 1.0 [54]. This is

likely due to the fact that real world biological networks are prone to noise and missing informa-

45

tion (like missing edges) [127, 193]. Other studies [61, 164] have shown that non-clique clusters

form biologically relevant cellular subsystems. An example of a non-clique subsystem is the

cell-cycle regulation consisting of cyclins (CLB1-4 and CLN2), cyclin-dependent kinases (CKS1

and CDC28), and a nuclear import protein NIP29 identified from Saccharomyces cerevisiae

network [164]. Additionally, Hwang et al [68] showed that maximal clique enumeration based

methods discard over 90% of network nodes when applied to the protein interaction network of

Saccharomyces cerevisiae. Hence, α, β-motif finder method may not cover the entire spectrum

of phenotype-related cellular subsystems.

As a post processing step, each subsystem identified by the α, β-motif finder can be fed into

methods like our DENSE algorithm [61], a knowledge prior-driven quasi clique enumeration

algorithm to identify the subsystems missed by α, β-motif finder. However, DENSE can extract

subsystems from only a single organismal network and these subsystems may or may not be

related towards the target phenotype. DENSE also requires an input parameter to fix the density

of the subgraphs to be identified. This post processing amounts to running a compute-intensive

quasi-clique enumeration algorithm for every subsystem identified by the α, β-motif finder.

Our previous α, β-motif finder method requires two inputs: the parameter α−the least num-

ber of phenotype-expressing organisms the identified clique has to be present in and the pa-

rameter β−the number of phenotype non-expressing organisms the identified clique can be

present in. These parameters may be hard to estimate beforehand and, hence, multiple runs

with different parameter values may be required.

The α, β-motif finder works under the assumption that a subsystem’s presence can alone

cause a phenotype. Hence, the method only aims to identify the subsystems that are more

prevalent in phenotype-expressing organismal networks and less prevalent in phenotype non-

expressing organismal networks. However, a subsystem’s absence could also cause a phenotype-

expression. It was shown that the “couched potato” syndrome could be potentially caused

by the result of a person missing some key genes related to the MP-activated protein kinase

(AMPK) [179]. It was also shown that one out of every 4 cases of breast cancer is associated

with the absence of the NF1, a gene that negatively regulates an oncogene called RAS. RAS,

when mutated or highly expressed, contributes to a normal cell turning cancerous [183].

We propose to develop a methodology to enumerate phenotype-related cellular subsystems

(functional modules) placing no restriction on the density or size of the enumerated subgraphs.

The method also does not place any restriction on the number of networks the resulting sub-

graph should be present in. Additionally, the method identifies subsystems biased towards the

phenotype expressing organisms and towards phenotype non-expressing organisms from the

results of a single run of the methodology.

46

3.1 Approach

Figure 3.1 gives the overview of the methodology that will be explained in the subsequent

sections of the thesis.

Gene-gene Functional Coherence

Networks

Orthologous-group, Organism Bipartite Network

Model

Enumerate Conserved Edge-sets

Extract ConservedSubsystems

Hydrogen Production

Motility

Respiration

Gram staining

Evaluation Stage

Network Model Construction

Enumeration Stage

Application to Phenotypes

Figure 3.1: Methdology overview to identify and evaluate phenotype-related network biomarkersubsystems

3.1.1 Network Model Construction

Background

Functional Association Networks

A functional association network of an organism is a graph G = (V,E), where V is the set of all

genes in the organism and E represents the set of existing functional associations between pairs

of genes in the organism. A pair of genes is said to be functionally associated if there is evidence

to show that the genes encode proteins that take part together in some biological cellular

function. The evidences include gene neighborhood, gene fusion events, published results, data

47

from experiments, etc.

Orthologous groups

Orthologs, or orthologous genes, are genes from different species that originated by vertical

descent from a single ancestral gene in the last common ancestor of the compared species [88].

In other words, when a species diverges into two separate species, the copies of a single gene in

the two resulting species are said to be orthologous. Orthologs retain the same function in the

course of evolution. Orthologous group is a cluster of orthologous genes. Mapping each gene in

each input organism to its corresponding orthologous group allows for a uniform representation

across all input organismal networks. This mapping can be obtained from several existing data

sources [113, 135, 176, 175].

Orthologous Group-pair, Organism Bipartite Network

In order to identify phenotype-related network subsystems from a given set of organismal func-

tional association networks, we need a representation that would help us enumerate the sub-

systems efficiently. In this thesis, we propose the orthologous group-pair, organism bipartite

network that combines the information present in all of the individual organismal protein func-

tional association networks into one single network. This network model is built in three steps.

In the first step, we transform each input organismal network into a representation that would

help us understand the commonalty and differences among the networks. One such transfor-

mation is replacing each gene in every organismal network with its corresponding orthologous

group (see Figure 3.2.1). The most commoly used dataset to obtain the gene-orthologous group

mapping is the manually curated Clusters of Orthologous Groups (COG) [175, 176]. In Clusters

of Orthologous Groups dataset, a COG ID (simply referred to as a COG) is assigned to each

cluster of orthologous genes.

In the second step, we construct two sets, O and C (see Figure 3.2.2). In C, each element is

a pair (u, v), where both u and v are COGs. In O, each element represents an organism. These

two sets become the two partites of the graph that we will construct in the next step.

In the final step, we construct the orthologous group-pair, organism bipartite network (see

Figure 3.2.3), N = (O,C,E). An edge (a, b) ∈ E, where a ∈ O and b = (u, v) ∈ C, exists if and

only if the COG pair (u, v) is functionally associated in the organism a, i.e., in the organismal

functional association network Ga = (Va, Ea) corresponding to organism a, ∃x, y ∈ Va : gene x

and gene y belong to orthologous cluster groups u and v, respectively, and (x, y) ∈ Ea. As we

make use of COGs, the network N will henceforth be referred to as the COG-pair, organism

bipartite network .

48

1

Org #4

Org #3

Org #2

Org #1

C O

Org #1

3

2

Figure 3.2: Three step building of the orthologous-group organism bipartite network

3.1.2 Enumeration Stage

The COG-pair, organism bipartite network model is utilized to enumerate the phenotype-related

subsystems. The methodology consists of two phases. The first phase involves obtaining the

COG edges common to every subset of organisms and the second phase involves enumerating

the potential phenotype-related cellular subsystems from the edge set.

49

Enumeration of Conserved Edge-sets

In the first phase, we identify for each subset of organisms, the set of COG edges that are

conserved across that subset of organisms. In the COG-pair, organism bipartite network model,

the nodes of one partite of the network model represent the COG edges and the nodes in the

other partite represent organisms and hence, bicliques (see Figure 3.3.A) of the network model

represent the set of edges conserved across a subset of organisms. We aim to enumerate only

the maximal bicliques (see Figure 3.3.B) to identify the largest subset of organisms an edge set

can be conserved in. This allows us to get the correct estimate of subsystem-bias towards both

phenotype-expressing and phenotype non-expressing organisms.

Biclique Maximal Biclique Each biclique is

disintegrated to obtain edge-sets conserved

across different subsets of organisms

Org #1,Org #2,Org #4

A B C

Figure 3.3: Enumeration of conserved edge-sets using maximal biclique subgraph model

Definition 3.1.1 Given a bipartite graph N = (O,C,E), a subgraph S = (Os, Cs, Es) of N is

a biclique if ∀a ∈ Os and b ∈ Cs, (a, b) ∈ Es.

Definition 3.1.2 A biclique S of N is also maximal if there is no S′

= (O′s, C

′s, E

′s) such that

Os ⊂ O′s and Cs ⊂ C′s and S

′forms a biclique in N .

The problem of identifying maximal bicliques using the binary matrix representation trans-

lates to identifying the maximal biclusters in the matrix. For our preliminary results, we chose

Prelic et al.’s Bimax biclustering algorithm [136]. There are two reasons for this choice: (1) Bi-

max performs on par with the best biclustering techniques [136], and (2) It has also been shown

that Bimax is able to output all the optimal (maximal) biclusters in the given binary matrix

50

[136]. The algorithm uses a divide-and-conquer approach to enumerate maximal biclusters in

the COG-pair, organism bipartite network.

However, the methodology should be scalable for large number of organisms because, as we

increase the size of the organism set (O), the corresponding number of COG-edges (vertices

of partite C) could increase, and a significant number of edges will be added to the edge set

E. Thus, the underlying algorithm to mine the bicliques should be scalable and complete its

execution in a practical amount of time.To meet this goal, we consider another representation

of the problem—enumerating maximal bicliques in bipartite graphs.

Most existing methods solve the biclique mining problem by introducing restrictions to im-

prove algorithm efficiency. These include bounding the maximum input degree [174], bounding

the maximum biclique degree [51], bounding the minimum biclique size [103, 114, 184], bound-

ing an inputs arboricity [35], solving the problem for graphs that satisfy certain conditions

[126], even assigning weights to the vertices of the graph and identifying the “high confidence”

bicliques [123]. However, algorithms that depend for its efficiency on such restrictions cannot

solve arbitrary bipartite instances. Additionally, if we do utilize existing methods, we would

have to then perform a seperate analysis to determine the right parameter values. An existing

algorithm to solving the general bipartite biclique problem is serial [200]. The existing parallel

algorithm for the biclique mining in general graphs [147] places restrictions on the bicliques

mined.

Another approach relies on graph inflation. It is easily observed (see, for example, [109]) that

maximal biclique enumeration on bipartite graphs can be turned into a maximal clique enu-

meration on general graphs, by adding edges between all pairs of vertices belonging to the same

partite. This approach is neither practical nor scalable due to the enormous number of edges

that may be needed and the accompanying increase in problem difficulty that is incurred. How-

ever, it has been suggested in literature [1], that it is worthwhile adapting existing algorithms

for the general clique problem to mine bicliques.

We developed a parallel parameter-free heuristic algorithm to generate all maximal bicliques

in bipartite graphs. The algorithm is a modified version from the widely-used, efficient, back-

tacking method of enumerating maximal cliques of Bron and Kerbosch (BK) [17]. The parallel

framework from our earlier work on maximal clique enumeration [154] has been adopted and

modified to provide a parallel implementation our algorithm.

The main difference between a clique mining in a general graph and biclique mining in a

bipartite graph is the relationship that can exist between a pair of vertices (a, b). In general

graphs, a and b, can either be connected or disconnected and that condition can be checked to

decide if a vertex can be added to a current clique to form a larger clique. A pair of discon-

nected vertices cannot be part of the same clique. In case of bipartite graphs, the concept of

disconnection is more profound. A pair of vertices can be disconnected and belong to the same

51

clique, if they are part of the same partite. This means, that checking a pair of vertices from

the same partite to tell if they can or cannot be part of the same clique without additional

information cannot be done. This extremely basic but important concept was kept in mind

while developing the algorithm.

In generic BK, for every clique three lists get maintained, the compsub—the vertices cur-

rently part of the clique, the cand—vertices that are connected to all vertices in compsub, and

not—the set of vertices connected to all vertices in compsub, but would form cliques that have

been previously enumerated. The basic idea is that, a vertex a is selected from cand and added

to the set compsub and subsequently the sets cand and not are updated to only contain vertices

adjacent to a. However, in the modified algorithm, we not only keep vertices adjacent to a but

also vertices in the same partite as a (see Algorithm 3.2 - Line 10 and Line 11), since these

vertices could also be potentially added to the current clique.

The modified definitions of cand and not require a more sophisticated approach when it

comes to selecting a vertex to be added to the clique. In the general BK algorithm, any vertex

from the set cand could be selected, but for a bipartite graph, we need to make sure that the

selected vertex is connected to every vertex that is in compsub and not in the same partite

as a. The brute force way would for every vertex in cand, loop through the vertices in the

compsub to check for a’s connectivity. This solution would be an significant addition to the

clique enumeration time as the clique sizes increase and would defy the purpose of utilizing an

algorithm like the BK.

Thus, we introduce the concept of partite-representatives. Partite-representatives (partc and

parto) are elements that belong to compsub that belong to different partites. A potential candi-

date a is added to compsub if it is connected to either one of the representatives (see Algorithm

3.2 - Line 6 ). Each time a vertex x is added to compsub, the cand is updated to only con-

tain vertices that are connected or in the same partite as x. The update at the current step

is performed over the updates performed the previous steps, hence, if the current candidate a

is connected to a vertex x from compsub, we know that a is also connected to every vertex in

the same partite as x, and all other vertices in compsub would belong to the same partite as

a. The first candidate added to an empty compsub becomes the first representative, and the

second representative is the second candidate chosen because it will have to satisfy the criteria

of being connected the one existing representative.

In the generic BK algorithm, once the selected candidate a was added to compsub, the

algorithm looks for other vertices in cand that is not connected to a and this would signify

a new branch being added to the search tree. We modify the search condition to check that

vertices that are not connected to a are also not in the same partite as a (see Algorithm 3.2

- Line 16). The BK algorithm with all the modifications is shown in Algorithm 3.2 and an

example is shown in Figure 3.4. The algorithm had been fitted into the parallel framework from

52

our earlier work [154].

Figure 3.4: Parameter-free biclique enumerator. (A) An example graph, (B) Representativeselection, and (C) The search tree for the example graph

Mining cliques in k-partite graphs

An interesting and useful generalization of the problem is mining all cliques from a k-partite

graph, i.e., every pair of vertices in the clique should either be part of the same partite or be

connected. The serial algorithm above can be extended to mine cliques in k-partite graphs and

53

Algorithm 3.1: Main Function

Input: Bipartite Graph N=(O,C,E) where O and C are the two disjoint vertex sets andE is the set of edges

1 compsub ← ∅;2 cand ← C ∪O;3 not ← ∅;4 part c ← NULL;5 part o ← NULL;6 BiCliqueEnumerate(compsub,cand ,not ,part c,part o)

by extension, a parallel clique mining graph for k-partite graphs. In the algorithm described, a

candidate a should be connected to one of the representatives in order to be added to the current

compsub set, because connection to one representative means that a is in the same partite as

the other representative. In case of a k-partite graph, this condition alone cannot work because,

a candidate a can belong to a partite different from both representatives. However, that being

the case, we know that a has to be connected to both representatives in order to be added to

compsub. The two partite representatives can be any pair of vertices from compsub that do not

belong to the same partite, i.e., from any distinct pair from the k partites.

Thus, if we modify the condition, a has to be connected to one representative and belong to

the same partite as the other representative or has to be connected to both representatives. No

matter how large the value of k is, this modified condition will select the appropriate candidate

to be added to compsub.

Such a generic parallel parameter free clique mining algorithm on k-partite graphs would

have its uses in several areas, such as analysis of multiple phenotypes simultaneously. One

partite can represent the COG-edges, and each of the other partites can represent organisms

expressing a different phenotype. The cliques from this graph will help identify subsystems

common to different subsets of phenotypes.

Extracting Conserved Cellular Subsystems

Each maximal biclique identified in the previous section represents the set of COG edges con-

served across a specific subset of organisms. However, we cannot consider the COG edge set as

a subsystem as is. A subsystem has to be a connected subgraph of an organismal network as

opposed to a collection of edges.

A connected component is a subgraph with path between every pair of nodes in the subgraph.

There is no density constraint on the subgraph, and we can model subsystems with varying

topologies (density from 0.1 to 1.0). The connected component subgraphs from the COG edge set

54

of each maximal bicluster are enumerated (see Figure 3.5). The enumeration can be performed

using depth-or-breadth-first-search, both of which are linear time algorithms in terms of the

number of edges.

Algorithm 3.2: Parameter-free biclique enumerator

1 BiCliqueEnumerate(compsub,cand ,not ,part c,part o)

2 if cand = ∅ then3 if not = ∅ then4 Output compsub;

5 else6 fixp ← The vertex in cand that is connected to the greatest number of other vertices

in cand and connected to either part c or part o;7 cur v ← fixp;8 if cur v 6= NULL then9 while cur v 6= NULL do

10 new not ← All vertices in not that are connected to cur v or in the samepartite as cur v ;

11 new cand ← All vertices in cand that are connected to or in the same partiteas cur v ;

12 new cs ← compsub + cur v ;13 BiCliqueEnumerate(new cs, new cand , new not ,part o,part c);14 not ← not + cur v ;15 cand ← cand − cur v ;16 if there is a vertex v in cand that is not connected to fixp and not in the

same partite as fixp then17 cur v ← v;

18 else19 cur v ← NULL;

20 return

3.1.3 Evaluation Stage

Statistical Significance Assessment

The results of the method only guarantee that the subgraphs output are connected components

but there is no clear indication whether the subgraphs could represent subsystems or if their

occurrence was purely random. To determine this statistical significance the topological density

of each subgraph is compared with a density of randomly obtained subgraphs with the same

number of nodes.

55

GenerateConnected

Components (DFS, BFS)

Org #1, Org #2, Org #4

Org #1 Org #2 Org #3

Maximal Biclique Extracted Edge, Organism Information

Potential Phenotype-related

Subsystems

Figure 3.5: Extracting the potential phenotype-related modules from the enumerated maximalbicliques

The Monte Carlo method [125, 199], a robust statistical significance assessment method, is

utilized. For every connected component subgraph S = (V,E) identified in a subset of organisms

O, we calculate the density γ(S). We randomly sample subsets of |V | COGs each from the set

of all possible COGs. For each random subset of COGs Sr, we calculate the density (γ) of the

subgraph that will be formed by the nodes. To calculate the density, we first need to determine

the number of edges that occurs between the nodes taking into consideration the networks of

the organisms in O. An edge is placed between a pair of nodes (u, v), where u, v ∈ Sr, if and

only if an edge between u and v occurs in θ percentage of networks corresponding to organisms

in O.

We estimate an empirical p-value (pr) as R/W , where W is the total number of random

subsets generated (W ∼ 1000) and R is the number of random subsets that produce a test

statistics γ() greater than or equal to that of γ(S). We then use a cutoff (say 0.05) to identify

the statistically significant components.

Phenotype-relatedness Assessment

Cellular subsystem S conserved across a set of organisms O is further analyzed to determine

the strength of its relatedness and relevance to the target phenotype. As discussed in the intro-

duction section of this chapter, the presence or the absence of a subsystem can cause a pheno-

56

type, hence phenotype-relatedness would refer to quantifying both the subsystem’s bias towards

phenotype expressing and phenotype non-expressing organisms. The phenotype-relatedness is

determined by a pair of p-values determined using the hypergeometric statistical test and a set

of rules defined over the p-values. The bias is quantified by using the hypergeometric statisti-

cal test. Let n be the total number of organisms (both phenotype-expressing and phenotype

non-expressing). Let x be the total number of phenotype expressing organisms and y be the

number of phenotype non-expressing organisms. Let v be the total number of organisms (both

phenotype-expressing and phenotype non-expressing) the subgraph S is conserved in. Let w be

the number of phenotype expressing and z be the number of phenotype non-expressing organ-

isms the subgraph S is conserved in. The bias towards phenotype expressing (pe) organisms

and phenotype non-expressing (pne) organisms is calculated as follows:

pe =1(nv

) × v∑i=w

(x

i

)(n− xv − i

)(3.1)

pne =1(nv

) × v∑i=z

(y

i

)(n− yv − i

)(3.2)

To determine if the subgraph S is phenotype-related given a significance threshold τ , we

apply the following rules.

� if pne ≤ τ and pe ≤ τ , then S is not phenotype-related;

� if pne ≤ τ and pe > τ , then S is phenotype-related and biased towards phenotype non-

expressing organisms;

� if pne > τ and pe ≤ τ , then S is phenotype-related and biased towards phenotype ex-

pressing organisms; and

� if pne > τ and pe > τ , then S is not phenotype-related.

We apply a p-value cutoff of τ = 0.05 to identify all the phenotype-biased biclusters.

3.2 Results

3.2.1 Experimental Setup

We set up experiments with four different phenotypes, hydrogen production (dark and light

fermentation), respiration, gram-stain, and motility. The organisms for each phenotype were

identified using literature search [79, 163]. The functional association network for each organism

was obtained from the STRING [70] database and the edge score cutoff used was 700 (termed

as high confidence[70]).

57

3.2.2 Hydrogen Production

Biological hydrogen production is being looked at as a source of alternate energy and there are

a number of microorganisms that can utilize different organic substrates to produce hydrogen.

This makes it a useful alternative energy option to explore [15, 78, 124]. Identifying cellular

subsystems related to hydrogen production will be extremely useful to genetic engineers aiming

to make the process of biological hydrogen production more efficient. The light and dark fer-

mentation are two important sub-phenotypes of hydrogen production, and experiments based

on these phenotypes will be discussed in this section.

Light fermentation

Initial review of the light fermentation clusters shows the presence of a set of 13 identical COGs

found across all 8 COG clusters. These “core” COGs include genes necessary for synthesis of

hydrogenase complex(es).

Nitrogen-fixation is the process, in which nitrogenase catalyzes the conversion of nitrogen

gas to ammonia and inadvertently results in the production of hydrogen gas as a byproduct

[142, 143]. Two COGs (COG2710 and COG1348), which are associated with the expression

of two key proteins, nitrogenase iron protein (NifH) and molybdenum iron protein [142], were

present across all the clusters. Although, the presence of these two proteins is essential for

nitrogen-fixation to be carried out by light fermenting microorganisms, expression of various

genes in other metabolic pathways plays important roles in either directly or indirectly regulat-

ing the expression of genes encoding NifH proteins. These proteins include ferric iron regulation

proteins (sigK, clpB, and fur-related), ammonia ligase (glnA), and nitrogenase [94]. In this

study, glutamate ammonia ligase (glnA), a key gene for nitrogenase (NifH), and genes encoding

proteins for iron uptake, are assembled in the same cluster. In Anabaena, iron uptake proteins

and some nitrogen proteins (e.g., Ntc) have been shown to regulate genes encoding glutamate

synthetase (glnA) [107]. Review of the role of glutamine synthetase in Anabaena indicates that

this enzyme is responsible for regulating nitrogenase activity, thus impacting hydrogen produc-

tion [107]. The indirect regulation of nitrogenase by iron uptake proteins provides an example

of crosstalk between iron and nitrogen-related metabolic pathways.

In addition to nitrogenase, proteins associated with the synthesis of uptake or expression

of hydrogenase, were identified in 11 of the 19 COGs present in Table 3.1. Hydrogen uptake

proteins help with removing excess hydrogen to maintain the reducing environment in cells [90].

We also identified a number of proteins (e.g., Hyd and Hyp) involved in formation of [NiFe]-

uptake hydrogenases. The presence of maturation hydrogenase factors (COG0068, COG0298,

COG0309) and accessory proteins for uptake of nickel (COG0378) are consistent with literature

reports describing the structure of hydrogenase complexes. Inclusion of hydrogenase proteins

58

in Table 1 is likely due to the relationship of hydrogenase proteins with iron uptake genes.

To function properly, iron is needed to form the NiFe center present in the large hydroge-

nase subunit (HupL) [180]. As such, hydrogenase maturation is dependent on crosstalks with

iron uptake. In previous studies by Lopez-Gollomon [107], the nitrogen regulator protein NtcA

Table 3.1: Subsystem associated with light fermentation

COG ID COG Description

COG0068 Hydrogenase maturation factorCOG0298 Hydrogenase maturation factorCOG0309 Hydrogenase maturation factorCOG0374 Ni,Fe-hydrogenase I large subunitCOG0375 Zn finger protein HypA/HybF

(possibly regulating hydrogenase expression)COG0378 Ni2+-binding GTPase involved in regulation

of expression and maturation of urease and hydrogenase1COG0409 Hydrogenase maturation factorCOG0680 Ni,Fe-hydrogenase maturation factorCOG1740 Ni,Fe-hydrogenase I small subunitCOG0174 Glutamine synthetaseCOG0535 Predicted Fe-S oxidoreductasesCOG0716 FlavodoxinsCOG1348 Nitrogenase subunit NifH (ATPase)COG2082 Precorrin isomeraseCOG2710 Nitrogenase molybdenum-iron protein,

alpha and beta chains

COG2370 Hydrogenase/urease accessory proteinCOG1941 Coenzyme F420-reducing hydrogenase, gamma subunitCOG3259 Coenzyme F420-reducing hydrogenase, alpha subunitCOG0735 Fe2+Zn2+ uptake regulation proteins

was found to work together with the iron-uptake protein, Fur, to co-regulate genes involved

in various metabolic functions. Metabolic functions co-regulated include the transcriptional

regulation protein and glutamine synthesis [116]. In this study, genes encoding iron uptake reg-

ulator proteins (COG0735) were clustered together with genes encoding glutamine synthetase

(COG0174). The co-appearance of these two COGs suggests the possible crosstalk between iron

uptake and ammonia assimilation networks. In addition, there is indication that hydrogenase

proteins, such as HupUV, are involved in regulating the glutamine synthetase gene, glnAII, in

some organisms [142, 159].

59

Dark fermentation

Unlike light fermentation, we did not observe a large set of COGs present across all clusters.

For this set of organisms, only two COGs were identified as present across all clusters. This may

be partially due to the following two reasons. First, the selection of species and species diversity

have some impact on the types of clusters generated. Second, dark fermentation organisms

tend to utilize a greater variety of fermentation pathways, such as acetate fermentation and

butyrate fermentation pathways [188]. Greater variation in fermentation routes will not produce

as large of a “core” set of COGs across all clusters. An example of COG clusters identified in

Table 3.2: Subsystem associated with dark fermentation

COG ID COG Description

COG0298 Hydrogenase maturation factorCOG0309 Hydrogenase maturation factorCOG0374 Ni,Fe-hydrogenase I large subunitCOG0409 Hydrogenase maturation factorCOG0680 Ni,Fe-hydrogenase maturation factorCOG1740 Ni,Fe-hydrogenase I small subunitCOG0535 Predicted Fe-S oxidoreductasesCOG1348 Nitrogenase subunit NifH (ATPase)COG2710 Nitrogenase molybdenum-iron protein,

alpha and beta chainsCOG0716 FlavodoxinsCOG0735 Fe2+/Zn2+ uptake regulation proteinsCOG2082 Precorrin isomeraseCOG3968 Uncharacterized protein related to

glutamine synthetase

dark fermentative bacteria is present in Table 3.2. In this cluster, 13 different COGs consisting

of proteins that are either directly or indirectly responsible for the uptake or production of

hydrogen, are present. Of these COGs, 7 are related to the synthesis or expression of [NiFe]-

hydrogenase, an enzyme that catalyses the reversible oxidation of molecular hydrogen, and

plays a vital role in anaerobic metabolism [159]; the others are involved in nitrogen and iron

metabolic pathways that include proteins like nitrogenase, iron uptake proteins, such as Fur

(COG0735), ammonia assimilation proteins, such as glutamine synthetase (COG3968), and

proteins involved in electron transfer. Previous findings by Butland et al. [19] show that the

presence of proteins (e.g., HypE, HypD, HupS, HupD) is typically associated with hydrogen

uptake [50, 180]. Based on the other genes (e.g., hybG, hupS) present in the cluster, we can

60

predict that [NiFe]-hydrogenase is associated with hydrogen uptake in this group of organisms.

In addition to hydrogenase maturation and expression proteins, Fe-S oxidoreductases were

identified. As part of the structure of [NiFe]-hydrogenase, Fe-S metal centers are located on

the small subunit of the hydrogenase complex [159, 180]. Thus, it is expected that iron uptake

pathway would crosstalk with hydrogenase-related pathways. Furthermore, the iron uptake

pathway also crosstalks with nitrogen metabolism, in a sense that iron uptake proteins can be

involved indirectly in nitrogen metabolism through regulation of nitrogenase and maintaining

the reducing environment in the cell through hydrogen uptake (hydrogenase) [116, 195].

It has been shown that crosstalk between iron uptake and nitrogen metabolism enables reg-

ulation of ammonia assimilation [143]; it may be possible that the uncharacterized glutamine

synthetase protein in Table 2 is subject to such regulation. In our results, the gene encoding

the uncharacterized glutamine synthetase proteins was only present in a few species, includ-

ing Clostridium acetobutylicum and Clostridium beijerinckii, which both contained nitrogenase

and hydrogenase enzymes. It has been demonstrated that, in light fermenting organisms, such

as Rhodopseudomonas palustris, glutamine synthetase is regulated by hydrogenase accessory

proteins (HupUV) [143]. However, to the best of our knowledge, this relationship has not been

described in dark fermentation organisms. This knowledge increases the probability that the

uncharacterized glutamine synthetase protein maybe present in the COG cluster oweing to its

association with nitrogenase proteins, which may further indicate a possible crosstalk between

ammonia assimilation and nitrogen metabolism.

3.2.3 Motility

The motility experiment was set up with a set of 85 motile and 56 non-motile organisms chosen

from Slonim et al [163]. The method identified clusters that contained COG1360, COG1558,

COG1157, COG1684, and COG1536 (Table 3.3). All these COGS are related to flagella pro-

teins. The flagella proteins are those that enable the organisms to move. The method also found

COG0643, COG0835, and COG0784 that are related to bacterial chemotaxis. It is well known

that chemotaxis controls an organism’s movement with respect to the chemical composition

of its environment. For example, it helps the organism moves to the areas where there is very

high concentration of food [177]. COGS related to the Type III secretion system (COG1766,

COG1684, COG1987, COG1338, and COG1886) were also identified. It has been shown that

Type III secretion proteins share similarities with flagella proteins in structure and func-

tion [105]. Additionally, we identified COG0835, COG0643, COG1344, COG1291,COG0784,

COG1508, and COG1191 associated with the two-component systems. This is a signaling path-

way that regulates motility [48].

61

Table 3.3: Subsystem associated with motility

COG ID COG Description

COG1843 Flagellar hook capping proteinCOG1291 Flagellar motor componentCOG1344 Flagellin and related hook-associated proteinsCOG1256 Flagellar hook-associated proteinCOG1338 Flagellar biosynthesis pathway, component FliPCOG4786 Flagellar basal body rod proteinCOG1360 Flagellar motor proteinCOG1558 Flagellar basal body rod proteinCOG1157 Flagellar biosynthesis,

type III secretory pathway ATPaseCOG1684 Flagellar biosynthesis pathway, component FliRCOG1536 Flagellar motor switch proteinCOG1766 Flagellar biosynthesis/

type III secretory pathway lipoproteinCOG1684 Flagellar biosynthesis pathway,

component FliRCOG1987 Flagellar biosynthesis pathway,

component FliQCOG1338 Flagellar biosynthesis pathway,

component FliPCOG1886 Flagellar motor switch/

type III secretory pathway proteinCOG0643 Chemotaxis protein histidine kinase

and related kinasesCOG0835 Chemotaxis signal transduction proteinCOG0784 FOG: CheY-like receiverCOG0643 Chemotaxis protein histidine kinase and related kinasesCOG1508 DNA-directed RNA polymerase specialized sigma subunit, sigma54 ho-

mologCOG1191 DNA-directed RNA polymerase specialized sigma subunit

3.2.4 Respiration

This experiment was set up with a set of 77 aerobic organisms and 57 anaerobic organisms.

For aerobic respiration, COGs related to the enzymes present in the TCA cycle were identified.

They are COGs related to citrate synthase (COG0372), acitonase (COG1048), and Malate

dehydrogenases (COG0039) (Table 3.4). Some COGs such as the malate synthase (COG2225),

isocitrate synthase (COG2224), glyoxylate bypass were also found. The entire list of TCA-

related COGs identified can be found in Table 3.4. There were also other literature verified COGs

62

(COG0843, COG0109,COG1048, COG1622, COG1845, and COG0372) found by the method

described in [172]. For anaerobic experiment, we found COG1924, COG1592, COG2221, and

Table 3.4: Subsystem associated with aerobic respiration

COG ID COG Description

COG0372 Citrate synthaseCOG1048 Aconitase ACOG0045 Succinyl-CoA synthetase, beta subunitCOG0074 Succinyl-CoA synthetase, alpha subunitCOG0479 Succinate dehydrogenase/

fumarate reductase,Fe-S protein subunit

COG1053 Succinate dehydrogenase/fumarate reductase,flavoprotein subunit

COG2142 Succinate dehydrogenase,hydrophobic anchor subunit

COG0039 Malate/lactate dehydrogenasesCOG2224 Isocitrate lyaseCOG2225 Malate synthaseCOG2084 3-hydroxyisobutyrate dehydrogenase

and related beta-hydroxyacid dehydrogenasesCOG2379 Putative glycerate kinase

COG2033. The COG1924 is related to oxygen sensitive proteins [173] (Table 3.5). The other

COGs were pulled out computationally by another genotype-phenotype methods [173, 172]

applied to the anaerobic phenotype. We also identified COGs from the Arginine and proline

metabolism, the reason for this could be attributed to the L-argnine which could serve as an

energy source for anaerobes.

Table 3.5: Subsystem associated with anaerobic respiration

COG ID COG Description

COG1924 Activator of 2-hydroxyglutaryl-CoA dehydratase(HSP70-class ATPase domain)

COG1592 RubrerythrinCOG2221 Dissimilatory sulfite reductase

(desulfoviridin), alpha and beta subunitsCOG2033 Desulfoferrodoxin

63

Table 3.6: Subsystem associated with gram-negativity

COG ID COG Description

COG2877 3-deoxy-D-manno-octulosonic acid(KDO) 8-phosphate synthase

COG2885 Outer membrane protein andrelated peptidoglycan-associated (lipo)proteins

COG1044 UDP-3-O-[3-hydroxymyristoyl]glucosamine N-acyltransferase

COG1519 3-deoxy-D-manno-octulosonic-acid transferaseCOG0763 Lipid A disaccharide synthetaseCOG0838 NADH:ubiquinone oxidoreductase subunit 3 (chain A)COG0337 3-dehydroquinate synthetaseCOG0852 NADH:ubiquinone oxidoreductase 27 kD subunitCOG1143 Formate hydrogenlyase subunit 6/NADH:

ubiquinone oxidoreductase 23 kD subunit (chain I)COG0713 NADH:ubiquinone oxidoreductase subunit 11 or 4L (chain K)COG0649 NADH:ubiquinone oxidoreductase 49 kD subunit 7COG0382 4-hydroxybenzoate polyprenyltransferase

and related prenyltransferasesCOG0043 3-polyprenyl-4-hydroxybenzoate decarboxylase

and related decarboxylasesCOG0163 3-polyprenyl-4-hydroxybenzoate decarboxylaseCOG2227 2-polyprenyl-3-methyl-5-hydroxy

-6-metoxy-1,4-benzoquinol methylaseCOG1008 NADH:ubiquinone oxidoreductase subunit 4 (chain M)COG1005 NADH:ubiquinone oxidoreductase subunit 1 (chain H)COG1663 Tetraacyldisaccharide-1-P 4’-kinaseCOG0774 UDP-3-O-acyl-N-acetylglucosamine deacetylaseCOG1212 CMP-2-keto-3-deoxyoctulosonic acid synthetaseCOG0859 ADP-heptose:LPS heptosyltransferaseCOG2908 Uncharacterized protein conserved in bacteriaCOG2870 ADP-heptose synthase, bifunctional

sugar kinase/adenylyltransferaseCOG3307 Lipid A core - O-antigen ligase and related enzymesCOG0445 NAD/FAD-utilizing enzyme

apparently involved in cell divisionCOG0848 Biopolymer transport protein

64

3.2.5 Gram-positive and Gram-negative

This experiment was set up with a set of 61 gram positive bacteria and 109 gram negative bac-

teria. For gram negativity, COG2877, COG2885, COG1044, COG1519, COG0763, and others

related to the Lipopolysaccharide biosynthesis were found (Table 3.6). This pathway has been

shown to be related to gram-negativity [105]. Another set consisting of COG0043, COG0163,

COG2227, COG1008, and COG1005 were found. These are associated with the ubiquinone

pathway that is also shown to be associated with gram-negativity [105]. The COG0848 found

by the method has been shown to be associated with the target phenotype [46].

Table 3.7: Subsystem associated with gram-positivity

COG ID COG Description

COG3764 SortaseCOG3773 Cell wall hydrolyses involved in spore germinationCOG0619 ABC-type cobalt transport system, permease compo-

nent CbiQ and related transportersCOG1122 ABC-type cobalt transport system, ATPase compo-

nent

From gram-positive bacteria (Table 3.7), the method identified COG3764 and COG3773.

COG3674 relates to plasma membrane proteins and was identified by previous research as

related to gram-positivity [46]. The COG3773 is associated with endospore formation that

usually occurs in gram-positive bacteria when there is a lack of nutrients . We also found

COG0619 and COG1122 as single connected component; these COGs are associated with uptake

of MET/or MET-precursors, which are associated with regulation of genes involved in amino

acid metabolism in gram-positive bacteria [182].

3.3 Conclusion

We have developed a method to identify phenotype-related network biomarkers by address-

ing some of the issues with the current state-of-the art method. Method relaxes constraints of

subsystem density and size and is also parameter free. The method was evaluated on pheno-

types such as hydrogen production, gram stain, motility and respiration. The method identified

subsystem biomarkers that are consistent with many known phenotype-related cellular systems.

65

Chapter 4

Gene Level: Functional Annotation

of Hierarchical Modularity of

Network Biomarkers

The identified network biomarkers are analyzed for biological significance. Biological significance

analysis identifies gene sets that are meaningful in a biological context. Biological significance

can be computationally determined by analyzing the gene set for functional coherence and as-

signing functional annotations. Functional coherence analysis assigns a score quantifying the

gene set’s biological significance. Functional annotation methods identify the possible biolog-

ical functions of the gene set. Biological significance analysis should take into consideration

the principle of hierarchical modularity, i.e., consider the gene set as hierarchically organized

entities.

Hierarchical modularity as a generic principle of system-level cellular organization and func-

tion is supported by existing literature [22, 140, 148, 201]. Hierarchical taxonomies of functional

terms manifested by GO ontology or by KEGG knowledgebase further suggest a possible hier-

archical functional organization of the gene set. This kind of functional organization could help

with understanding the functioning of the gene set at various levels of functional specificity. For

example, the overall function of the target gene set could be chromosome segregation, but at

lower level of the hierarchy, we could find a subset of genes responsible for proper alignment and

attachment of chromosomes and another subset responsible for translating the force generated

by microtubule depolarization into movement to facilitate chromosome segregation [22].

Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if

not all, functional annotation systems [3, 145]. As a result, the existing methods would often fail

to assign a statistically significant functional coherence score to biologically relevant molecular

machines (see Table 4.1).

66

Table 4.1: Statistical significance of protein pairs’ functional coherence in Saccharomycescerevisiae

Protein pair p-value (pair/ module/ module size) Ref.

ID Description ID Description HMS [130,

131]

[13] [8] [66,

67]

SNU13 RNA binding

protein

DIB1 17-kDa com-

ponent of the

U4/ U6aU5

tri-snRNP

0.0/0.0/2

0.23 0.01/0.01/2

0.1/0.1/2

0.42/0.42/2

[166]

HAP1Zinc finger

transcription

factor involved

in the complex

regulation of

gene expression

in response to

levels of heme

and oxygen

RPM2 Protein

subunit of

mitochondrial

RNase P

0.0/0.01/3

0.214 0.1/0.337/3

0.02/1.0/3

0.51/1.0∗/3

[55]

SRB2 Subunit of

the RNA

polymerase

II mediator

complex

RPB9 RNA poly-

merase II

subunit B12.6

0.0/0.01/58

0.48 0.12/1.0∗/58

0.333/1.0/58

0.98/1.0∗/58

[121]

NSR1 Nucleolar

protein that

binds nuclear

localization

sequences

DBP2 Essential ATP-

dependent

RNA helicase

of the DEAD-

box protein

family

0.0/0.0/2

0.44 0.1/0.1/2

0.13/0.13/2

0.74/0.74/2

[170]

∗assigned a p-value of 1 because the tool was unable to find a score for the entire module

To address this gap, we developed a methodology for hierarchical functional annotation of

gene sets. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the

association of individual genes or proteins with these concepts (e.g., GO terms), our method

will assign a Hierarchical Modularity Score (HMS); the HMS score and its p−value measure

functional coherence of the gene set. While existing methods annotate the gene set with a set

of “enriched” functional terms, our complementary method provides the hierarchical functional

annotation of the gene set.

A hierarchical organization of gene set often comes as a bi-product of cluster analysis of gene

expression data or protein interaction data. Otherwise, our method will automatically build such

a hierarchy by directly incorporating the functional taxonomy information into the hierarchy

search process and by allowing multi-functional genes to be part of more than one component in

67

SHMS(M) =𝟏

|𝑽(𝑴)|× [𝝀𝒗 ×

𝟏

|𝑨𝒗|× 𝑺𝑭𝑻𝑺(𝒕)𝒕∈𝑨

𝒗]𝒗∈𝑽(𝑴)

Step 4: Hierarchical Functional Coherence

Step 1: Functional Term Specificity

Step 2: Hierarchical Functional Annotation

Step 3: Penalization Factor

t0

t4

t1

t5

t2

t3

T

w

g3

v

g4

u

g1 g2

M

Figure 4.1: Overview of the methodology to assess functional coherence and assign annotationto hierarchical functional modules

the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity

of the terms across different levels of the hierarchical taxonomy is properly treated.

We have evaluated our method using Saccharomyces cerevisiae data from KEGG [75, 76, 77]

and MIPS [52] and several other computationally derived and curated datasets [22, 41, 62, 91,

137]. We compared our method with several biological significance analysis methods [8, 13, 21,

66, 67, 115, 130, 131, 146]. The hierarchical modularity built by our method from a set of genes

in various KEGG pathways produces biologically relevant modules, namely, at various levels of

the hierarchy, the corresponding modules match quite well with the manually-curated hierarchy

of pathways in KEGG. We have obtained similar results for the protein complexes in the MIPS

database.

4.1 Approach

Our method provides two main functionalities:

1. Given a hierarchical module and a hierarchical functional taxonomy, our method can assess

the functional coherence of the module and provide a hierarchical functional annotation.

The overview of this functionality is provided in Figure 4.1.

2. Given a module as a “bag of genes” and a hierarchical functional taxonomy, our method

can build the functional hierarchy of the module, i.e, provide a global functional view of

68

g1 g2

g3 g4

τs

μ g1 g2 g3 g4

stopping criterion

merging factor

Figure 4.2: Overview of fuzzy reconstruction of hierarchical modularity

the module. The overview of this functionality is provided in Figure 4.2.

In the following subsections we discuss the technical details of the two functionalities.

4.1.1 Hierarchical Taxonomy of Functional Terms (HTFA)

Let A={t0, t1, . . . , tn}, n ∈ N, be a set of functional annotation terms. A functional annotation

term (e.g., lyase activity) describes a function that a gene or a protein can carry out in

the cell. A gene g can be annotated with a subset Ag ⊆ A of functional annotation terms. If

|Ag| > 1, then g is multi-functional. If Ag = ∅, then g is called a hypothetical or unannotated

gene. A functional term ti is more specific than a functional term tj , if it is a subtype of tj . For

example, lyase activity is a subtype of catalytic activity. Moreover, the same term can

be a subtype of multiple terms. To capture functional specificity of terms, we will next define

a hierarchical taxonomy of functional terms (HTFA).

A hierarchical taxonomy Tt0 of functional terms A is a directed tree or a directed acyclic

graph (DAG) with the set V of labeled nodes (see Figure 4.4.A), such that

1. The labeling function l : V (Tt0)→ A is a bijection, i.e., every node v ∈ V (Tt0) is labeled

with only one term t ∈ A, and each term t is assigned to only one node v, and

2. Label t0 is assigned to only one node that is called the root node.

Whenever the context is clear, T and Tt0 will be used interchangeably. Likewise, we will simply

use t to refer to the node v with label t (i.e., l(v) = t).

Due to its hierarchical nature, T can be represented as a level set L(T ) = {L0, L1, . . . , LDT}

(see Figure 4.4.B), where L0 = {t0} and level Ld ⊆ V is a set of nodes visited at distance d

from the root t0 during the depth-first traversal of T , and DT ∈ N is the tree depth. Note that

if T is a DAG (e.g., the Gene Ontology [3]), then Li⋂Lj 6= ∅ for some i 6= j. In other words,

the node can occur at different levels in the taxonomy.

69

A pair of nodes ti and tj in Tt0 forms an ancestor relationship (ti ≺ tj), if there is a simple

directed path from ti to tj in Tt0 . An ancestor relationship between a pair of nodes ti and tj in

Tt0 represents a functional specificity relationship, namely, a functional term of the child node

is a subtype of the functional term of its parent, grandparent, grandgrandparent, and so on.

This relationship is transitive, i.e, ti ≺ tj and tj ≺ tk imply ti ≺ tk. Also, a child can have

multiple parents, as in a DAG.

Given this fact, we next introduce the functional term specificity score SFTS(t) for a node

t ∈ T as follows:

SFTS(t) =

∑t∈L SLS(L)∑L∈L δ(t, L)

, (4.1)

where SLS(L) (see Figure 6.B) is the level specificity score associated with level L ∈ L and

defined as

SLS(Ld) =d

DT, (4.2)

and δ() is a term characteristic function that specifies whether the term t occurs at level L:

δ(t, L) =

1, if t ∈ L

0, if t /∈ L(4.3)

A distinct pair (ti, tj) ∈ A × A of functional terms is called related, if the corresponding

nodes in T form an ancestor relationship, i.e., ti ≺ tj . More generally, a set of terms U ⊆ A is

called an unrelated set, or an unrelated term set in T , if no distinct pair (ti, tj), s.t. ti, tj ∈ U ,

is related in T .

Let U be an unrelated functional term set in T . Then, as defined by Equation 4.4, the

ancestor functional term set U of U is the set of all the functional terms t ∈ T on any simple

path from any node t ∈ U to the root node t0 ∈ T :

U = U ∪ P ∪ {t0},

P = {∀t, t ∈ T : ∃t ∈ U : t ≺ t ≺ t0} (4.4)

For example, consider an unrelated functional term set Au = {t4, t6} in Figure 6. According to

Equation 4.4, its ancestor functional term set is Au = Au ∪ {t0, t2, t3}.

4.1.2 Hierarchical Gene Module (HGM)

Given a set of genesG = {g1, g2, . . . , gm}, a hierarchical gene module (HGM)M is an undirected

tree over the set G of leaf nodes. Given the hierarchical taxonomy T of functional terms A, let

an unrelated term set Ag, Ag ⊂ A, denote the functional annotation of gene g.

70

Figure 4.3: Hierarchical functional annotation of a gene module M for a gene set G ={g1, g2, g3, g4} given the taxonomy T in Figure 4.4. (A) Functional annotation of genes in G byunrelated term sets in T , (B) A hierarchical gene module M , and (C) The resulting annotationof the internal (non-leaf) nodes in V (M)

Hierarchical functional annotation

Given an HTFA taxonomy T and an HGM module M with a functional annotation Ag ⊂ A for

each leaf node gene g, hierarchical functional annotation of M is the function h : V (M)→ ℘(A)

that maps each node v in V (M) to the set Av from the power set of A such that:

1. Av is the set of the most specific common functional terms among v’s children, and

2. Av is an unrelated functional term set in T .

Next, we will formally define the first condition, i.e., the set of the most specific common

functional terms among the child nodes of v. Let Cv be the set of child nodes of v. Note that

if Cv = ∅, then v is a leaf node g and Av = Ag. Otherwise, as defined by Equation 4.5, Av is

derived from the intersection of the ancestor functional term sets of v’s children (see Equation

4.4) by maximizing the size of the unrelated functional term set U in the power set of this

intersection:

Av = U , (4.5)

U = argmaxU ∈ ℘(ACv) && (U is unrelated)

|U | ,

ACv =⋂

w ∈ CvAw

For a hierarchical functional module M in Figure 4.3 and a taxonomy T in Figure 6, consider

v ∈ V (M) as an example with Cv = {g1, g2} and ACv = {t0, t1, t2, t3, t4, t5} ∩ {t0, t2, t4, t5} =

{t0, t2, t4, t5}. Then the maximum size unrelated term set in the power set of this intersection

defines the functional annotation set Av = {t4, t5} for the internal node v.

71

Figure 4.4: An illustration of a hierarchical taxonomy T over the set of functional annotationterms A = {t0, t1, t2, t3, t4, t5}. (A) A DAG view and (B) A level set view

Functional Coherence

Existing functional coherence analysis techniques analyze the input functional module in its

entirety without considering its hierarchical structure. Additionally, most methods depend on

a reference set by incorporating its annotation distribution and size into their scoring formula

[21, 130, 131]. A reference set is a group of proteins that forms a superset of the functional

module. Khatri et al. discuss the difficulties with selecting the right reference set [83].

Here, we introduce a method that accounts for the hierarchical structure of the module M

when determining its functional coherence. Additionally, the scoring function does not directly

rely on any reference set. More specifically, given the hierarchical gene module M and its func-

tional annotation, the functional coherence score, called hierarchical modularity score (HMS),

SHMS(M) of M is defined by Equation 4.6:

SHMS(M) =1

|V (M)|×

∑v∈V (M)

[λ(v)× 1

|Av|×∑t∈Av

SFTS(t)

], (4.6)

where the functional term specificity SFTS(t) is defined by Equation 4.1, and the penalization

factor λ(v) is discussed in the following section

Penalization Factor (λ)

Consider a hierarchical gene module M with its hierarchical functional annotation (see Figure

4.5.A), as described in Hierarchical functional annotation section. Let p ∈ V (M) be a parent

node with its children Cp. Given the functional annotation term sets, Ap and Ac, for the parent

p and its child c ∈ Cp, respectively, a dissimilarity score ψ(p, c) between p and c is defined by

Equation 4.7 (see Figure 4.5.B):

72

Figure 4.5: llustration of penalization factor calculation. (A) Hierarchical annotation of thefunctional module defined in Figure 7 and (B) Dissimilarity score ψ(p, c) for a parent p = vand a child c = g1

ψ(p, c) =1

|Ap|×∑t∈Ap

mint′∈Ac

d(t, t′), (4.7)

where the distance d(t, t′) is the length of the shortest simple path (t′ ≺ t) from node t′ to t in

T , or d(t, t′) =∞, if t′ and t are not related.

Given Equation 4.7, the penalization factor λ(p) for the parent node p ∈ V (M) is then

defined by Equation 4.8:

λ(p) =

[1 +

1

ωλmaxc∈Cp

ψ(p, c)

]−1

(4.8)

For the example in Figure 4.5, λ(v) = 0.95 for ωλ = 10, because the dissimilarity scores

between v and its children g1 and g2 are 0.5 and 0, respectively. Also, Figure 4.6 depicts the

behavior of λ(p) for different values of ωλ in Equation 4.8, as the maximum value of ψ(p, c)

varies from zero to its maximum possible value of the tree depth DT in the taxonomy T . If ωλ

increases from one to 100, then node score’s penalty decreases from 50% (even for immediate

neighbors in ψ()) to 13% (for the largest taxonomy depth DT = 15 in the Gene Ontology [3]).

More information on choosing ωλ values can be found i Choosing ωλ value section.

4.1.3 Assessing Statistical Significance

To provide a robust assessment of statistical significance for SHMS(M), we measure an empirical

p−value for SHMS(M) score assigned to each hierarchical module M using the Monte Carlo

procedure described in [125]. Specifically, for hierarchical module M over a set of |G| genes from

organism O, we randomly sample N subsets of size |G| from the entire genome of organism O,

build the hierarchy, and compute the SHMS(). Then, we estimate an empirical p−value for

SHMS(M) as p−value=R/N , where N is the total number of random samples (N ∼ 1000) and

73

Figure 4.6: Comparison between the three penalization factor functions considered

R is the number of the samples that produce a test statistics SHMS() greater than or equal to

the SHMS(M).

4.1.4 Fuzzy Reconstruction of Hierarchical Modularity

In Hierarchical gene module section, the hierarchical structure for a gene moduleM was provided

as an input. Based on this structure and the hierarchical taxonomy of functional annotation

terms (Hierarchical taxonomy of functional terms section), we provided means both for infer-

ring M ’s hierarchical functional annotation (Hierarchical functional annotation section) and for

estimating M ’s functional coherency via hierarchical modularity scoring (Functional coherence

section).

In contrast, here we consider a somewhat inverse problem, namely, the reconstruction of a

hierarchical structure of a functional module M defined by its gene set G. G is often referred as

a “bag of genes.” On the one hand, it seems that any hierarchical clustering method could be

used to reconstruct the hierarchical functional modularity from such a “bag of genes.” On the

other hand, the presence of multi-functional genes suggests that the same gene could belong to

multiple subtrees in the hierarchy—the property that is not often guaranteed by any hierarchical

clustering method. Therefore, we will first need to introduce “fuzziness” into the process of

building a functional hierarchy for G. For example, in Figure 5, a bag of genes containing SPC24,

74

TID3, NUF2, and FHL1 and a functional annotation taxonomy T are provided as input to the

method. It is known that SPC24, TID3, and NUF2 are functionally related because they are

part of the Ndc80 protein complex but SPC24 is also trnascriptionally regulated by FHL1 [95]

and so SPC24 is part of multiple subtress and, hence, fuzziness is introduced.

Existing fuzzy clustering schemes typically introduce some fuzziness into some known clus-

tering algorithm. C-means [10, 178] is a typical example of this kind. Others are typically

partitional by nature [43, 66, 67]. Agglomerative fuzzy clustering algorithms are not common,

because agglomerative techniques are considered “hard clustering,” i.e., it becomes difficult to

move an element from an existing cluster to a new cluster. Ideally, any fuzziness in a clustering

procedure should be introduced, while the hierarchy is being built and not as a post-processing

step.

To meet these requirements, we propose a taxonomy-based, agglomerative, fuzzy inference

(TAFI) of the hierarchical gene module M from a gene set G, provided each gene g ∈ G

is annotated with an unrelated functional term set Ag ⊂ A in a hierarchical taxonomy T of

functional annotation terms A (see Hierarchical taxonomy of functional terms section). The

overview of this method is provided in Figure 5.

Similar to an agglomerative hierarchical clustering (AHC) process, TAFI starts with assign-

ing each gene to its own cluster and proceeds building the hierarchy in an iterative, bottom-up

manner, but it introduces fuzziness by allowing multiple cluster pairs to merge simultaneously

at each iteration. The two user-defined parameters control this fuzziness process at each itera-

tion: (a) the merging factor µ and (b) the stopping criterion τs. The former defines what cluster

pairs get merged at a given iteration it. Namely, unlike traditional AHC that merge the pair

of clusters with the maximum similarity Smax(it), TAFI allows for clusters with suboptimal

similarity to be merged as well. Suboptimality is defined by the percentage µ of Smax(it). In ad-

dition, TAFI prevents the formation of unrelated clustering modules by stopping the bottom-up

cluster merging process at iteration it as soon as Smax(it) value falls below τs.

Note that Horng et al. [65] proposed to merge the cluster pairs whose similarity is greater

than Smax(i) −∆ unlike TAFI’s way of restricting to µ100 × Smax(i) similarity threshold. The

reason behind our choice of multiplicative factor rather than additive/subtractive factor is the

following. If ∆ = 0.1 and Smax(i) = 0.5, then any cluster pair with inter-cluster similarity

greater than 0.4 would be merged. The value of 0.4 is 80% of 0.5. However, if Smax(i) = 0.2,

then any cluster pair with inter-cluster similarity greater than 0.1 would be merged, but the

value of 0.1 is only 50% of 0.2. The criterion becomes more stringent with a larger value of

Smax() and, conversely, it becomes more lenient, as Smax() gets smaller. In contrast, our choice

of the merging factor allows for resolving this inconsistency issue.

Also, observe that multiple merges at each iteration can sometimes result in the same subtree

being formed repeatedly. This leads to redundancy. Thus, TAFI employs pruning, where a

75

merge is allowed only if the merge results in a subtree that has not been already formed.

In addition, we need to make two important decisions in order to apply TAFI: (1) the

inter-cluster similarity measure and (2) the linkage algorithm. For the inter-cluster similarity

measure, we use Equation 4.6 that calculates the hierarchical modularity score SHMS(M) for a

hypothetical module M that could be formed if the two clusters, or hierarchical tree modules M1

and M2, were merged by adding a new root node vnew and making the root nodes v1 ∈ V (M1)

and v2 ∈ V (M2) to be the children of vnew.

It is worth noticing that SHMS(M) is a semi-metric, and this property has direct implications

on our choice of the base clustering algorithm. Since semi-metrics do not adhere to the triangle

inequality principle, we can resort to an average, single, complete, or centroid linkage algorithm

as our base clustering technique. Therefore, the effective clustering techniques, such as Ward’s

method cannot be used in conjunction with semi-metrics [92].

4.2 Results

4.2.1 Benchmark Data and Tools

To evaluate the performance of our method, we first need to define (1) the model organism;

(2) the benchmark data of known functional annotations for this organism; (3) the hierarchical

taxonomy of functional terms, and (4) the state-of-the-art methods that are most suitable for

our comparative analysis.

Saccharomyces cerevisiae is our model organism. The reason is that its genome annotation

is mostly complete and manually curated by human experts [21]. Apart from annotation qual-

ity, the availability of functional module datasets, both manually curated and experimentally

generated, for S. cerevisiae is advantageous for our method validation purposes.

For benchmark data, we plan to use both metabolic pathways from KEGG database [75,

76, 77] and protein complexes in MIPS database [52] including experimental protein-protein

interaction data and protein complexes derived from this data [22, 41, 62, 91, 137].

For the hierarchical taxonomy of functional terms, we will rely on the commonly-used func-

tional annotation taxonomy provided by the Gene Ontology consortium [3]. As such, we will

limit ourselves to the existing methods that are also based on GO ontology. Namely, we will

compare our method with the ones by Pandey et al. [130, 131], Chen et al. [21], and GS2 [146]

methods. The former makes use of the lowest common ancestor principle to score functional

coherence for a protein pair; it is based on Jiang and Conrath’s scoring method [71], which is a

normalized version of the scoring method in Resnik et al. [141]. It has been shown that Jiang

and Conrath’s method is the best measure to capture semantic relatedness [18]. The method by

Chen et al. is based on a widely used cosine similarity measure to assess functional coherence

76

Figure 4.7: Functionally coherent modules from the Chen and Yuan [22] study. (A) ModuleID M1 and (B) Module ID M3

for a protein pair and the authors provide a Matlab implementation for the same. The GS2 [146]

uses the overlap similarity measure, and the authors provide a Python implementation for the

same. Additionally, we perform comparisons with methods described in [8, 13, 66, 67, 115, 146].

These methods [13, 66, 67, 115] have web-based implementations. The p-value for our method

is calculated using the Monte Carlo procedure [125] and is discussed in detail in the Methods

section.

We conducted three major types of performance evaluation: (1) at the level of functional co-

herency for protein pairs; (2) at the level of functional coherency for protein functional modules

(with two or more proteins in each); and (3) functional annotation of reconstructed hierarchi-

cal functional modules. Both large-scale comparative analysis and small-scale literature mining

based validation are performed.

4.2.2 Functional Coherency of Protein Functional Modules

Detailed biological analysis of modules from Chen and Yuan [22]

Two functional modules, M1 (Figure 4.7.A) and M3 (Figure 1.B), with the same ID’s as in [22],

have been reported as insignificant by several existing functional enrichment analysis methods

(see Table 4.2). We used the web-based implementations for the functional enrichment analysis

methods. However, the modules were identified as significant by our HMS method. In the

following paragraphs, we provide biological evidence for the subtrees in the functional hierarchy

of the two modules.

In module M1 (see Figure 1.A), GAL1 is the galactose structural gene and GAL3, GAL4,

and GAL80 are transcriptional regulators involved in activation of the GAL genes in response

to galactose; they form a sub-module in the hierarchy. The pair-wise functional associations

77

Table 4.2: Functional modules evaluated using existing enrichment analysis tools in comparisonwith HMS. The first two rows show two homogeneous functional modules and the next two rowsof the table show heterogeneous functional modules that have coherent submodules. Functionalmodules have been obtained from Chen and Yuan [22] of the Saccharomyces cerevisiae PPInetwork

p-valueModule ID [13] [66, 67] [8] HMS

M12 1.02E-12 3.4E-19 5.73E-01 0.00

M94 1.05E-07 4.0E-7 6.03E-04 0.00

M3 1.0 0.1 1.0 0.02

M1 1.0 0.18 1.0 0.04

between these genes are well-documented. Transcription of the galactose pathway genes in

Saccharomyces cerevisiae (S. cerevisiae) and Kluyveromyces lactis (K. lactis) is induced by

galactose through the activities of the regulatory proteins, GAL4, GAL80, and GAL3 (S. cere-

visiae) or GAL1 (K. lactis) [73, 150]. GAL4 binds to its binding sites in both the absence

and the presence of galactose [157]; it has the capacity to activate transcription, while GAL80

inhibits GAL4 in the absence of galactose [133]. At the presence of galactose, GAL3 (GAL1 in

K. lactis) binds to GAL80 that alleviates the inhibition effect of GAL80 upon GAL4 [134].

PMA1 and PMA2 form another sub-module that encodes plasma membrane H+-ATPase

(PM-H+-ATPase), an enzyme with critical physiological roles both in the absence or presence

of environmental stress. PMA2, showing 89% identity to PMA1 at the amino acid sequence

level, encodes an H+-ATPase that is functionally interchangeable with the one encoded by

PMA1 [151].

The third sub-module involves DAP1, the damage response protein, and YGP1 induced by

nutrient deprivation-associated growth arrest. DAP1 is required for growth in the presence of

the methylating agent methyl methanesulfonate (MMS). DAP1 is required for cell cycle pro-

gression following damage [56], while YGP1 is induced after exposing cells to nutrient limitation

[56]. It has already been demonstrated that exposure to one kind of stress can activate pro-

tective mechanisms against other different stresses, a phenomenon known as cross-protection

[138]. Since DAP1 and YGP1 act both in the process of stress response, cross-protection might

associate these two genes together.

The same relationship based on cross-protection can be observed in another sub-module that

consists of YBP2 that plays the role in resistance to oxidative stress and HRT1 that is involved

in stress response. The transcription factor YBP2 and its homologue play central roles in the

determination of resistance to oxidative stress [53], while HRT1 forms ubiquitin ligase complex

with other scaffold proteins [84]. The critical stress response factor Nrf2 has been shown to be

78

repressed by the ubiquitin-proteasome system under normal, unstressed conditions, with Nrf2

exploiting ubiquitin ligase complexes [86].

The next module is made up of PMA1 and TPO5 that are involved in excretion of putrescine

and spermidine. TPO5 functions as a suppressor of cell growth by excreting polyamines [169].

PMA1 is a polytopic membrane protein, whose essential physiological function is to pump

protons out of the cell. Both the excretion of putrescine by TPO5 and the delivery of PMA1

to cell surface rely on secretory pathway. Furthermore, small portions of TPO5 are co-localized

with PMA1 in plasma membrane, which indicates possible interactions between these two

proteins [184].

PMA1 also forms a sub-module together with RVS161 that regulates polarization of the

actin cytoskeleton. RVS161 regulates secretory vesicle trafficking [16] as well as cell polarity

[34], actin cytoskeleton polarization [162], and endocytosis [119]. It is already known that the

efficient delivery of PMA1 to cell surface relies on secretory pathway[184]. Thus, RVS161 has

a regulatory effect upon PMA1.

Genes in this module have coherent functions, namely more than half of the proteins in this

module are related to stress response, five out of 20 total have regulatory roles in cell cycle,

four out of 20 total are evolved in endocytosis. Stress conditions are likely to cause cell cycle

arrest, as well as endocytosis induction.

For module M3 (see Figure 1.B), the vast majority of the genes in the module enjoy oxidative

stress response as the common theme. SOD2 protects cells against oxygen toxicity and TSA2,

responsible for the removal of reactive oxygen, directly protects cells against oxidative stress,

while PXR1 plays the role in negative regulation of telomerase, and YKU80, a subunit of the

telomeric Ku complex, contributes to the maintenance of telomere stability, since oxidative

stress is likely to induce telomere attrition [104].

Meanwhile, proteolysis could also be the result of oxidative stress: YNL311C is part of

an ubiquitin protease complex, DEF1 enables ubiquitination, DMA1 is involved in ubiquitin

ligation, and DMA2 is involved in ubiquitination [49]. Since proteolysis involves many protein

transportation processes, the signal recognition particles are essential to enable transportation:

SRP14, SRP21, SRP54, SRP68, SRP72, and SEC65 are all part of the signal recognition

particle (SRP) subunit, and appear in module M3.

Furthermore, Wu et al. [190] showed that repression of sulfate assimilation is an adap-

tive response of yeast to the oxidative stress of zinc deficiency, while we notice that MET1,

MET10, MET14, MET16, and YPR003C are basic proteins or protein subunits that are re-

quired for sulfate assimilation. Finally, oxidative phosphorylation produces ATP by utilizing

electron transport trains. As a result, the inhibition of electron transport chain will lead to

oxidative stress [39]. That is probably why ATP3, ATP5, and ATP7 are all part of the enzyme

complex required for ATP synthesis. Also, STI1, ATPase inhibitor activity, and YBT1, ATPase

79

Figure 4.8: Functional coherence analysis of protein complexes from MIPS-curated [52], Ho[62], Gavin [41], and Krogan [91] as well as metabolic pathways from KEGG. Comparisonbetween our HMS scoring, cosine similarity with different -value methods from [21], Jaccardsimilarity with different value methods from [21] and GS2 [146] methods. (A) Significant Mod-ules (p-value ≤ 0.05 and (B) Highly Significant Modules (p-value≤ 0.001)

activity, coupled to transmembrane movement of substances, are part of the module.

Large-scale Analysis of Protein Functional Modules

Protein functional modules predicted by Chen et al. using their betweenness-based network

partitioning algorithm [22] and protein complexes from CYS2008 database [137] are analyzed

as modules for their functional coherency. Table 4.3 summarizes the results obtained by our

HMS scoring method and GS2 [146] method for both significant (p−value≤ 0.05) and highly

significant (p−value≤ 0.001) cut-offs. HMS predicted 96.7% of the CYS complexes and 63.5%

of the modules from Chen and Yuan study to be significant. GS2 predicted 79.5% of the CYS

complexes and 42.6% of the modules from Chen and Yuan [22] to be significant. The results

can be found in Supplement S1.

HMS comparison with protein-set semantic similarity scoring metric

Protein pairs from the same protein complexes in [41, 52, 62, 91] or the same metabolic pathways

in KEGG [75, 76, 77] are assessed for functional coherency using HMS scoring method, the cosine

similarity metric [21], the Jaccard similarity metric [21], and the GS2 [146] method. We filter

our results as being significant (p−value ≤ 0.05, Figure 4.8.A) and highly significant (p−value

≤ 0.001, Figure 4.8.B).

For MIPS-curated data [52], HMS, cosine, and Jaccard methods predicted nearly 100% of

the protein functional modules as being functionally coherent, while GS2 only predicted 84%

to be significant. For Ho et al. data [62], HMS, on average, provided 30% higher predictions

than other methods. For Krogan et al. data [91], HMS performed 20% better than the other

methods, on average. For KEGG data, our HMS method, on average, performed 8% better than

the other methods. For Gavin et al. data [41], HMS performed about 15% better, on average.

Additionally, Chagoyen et al. [21] mentioned some complexes and pathways that in spite of

80

Table 4.3: Percentage of significant (p-value≤ 0.05) and highly significant (p-value≤ 0.001)functionally coherent modules from Chen and Yuan [22] and CYS2008 [137]

Dataset Method Significant Highly significant

CYS2008 protein complex database [137] HMS 96.7% 82.9%

GS2[146] 79.5% 40.1%

Chen and Yuan [22] HMS 63.5% 46.4%

GS2[146] 42.6% 29.8%

being functionally related were predicted incoherent. We list some of those modules in Table

4.4 and show that our HMS method is able to predict them as functionally related.

Table 4.4: HMS results for some KEGG metabolic pathways and MIPS protein complexes [21]classified as insignificant by Chagoyen et al. [21]

Chagoyen et al. [21] HMSPathway orComplex Name

Size pv1 pv2 pv3 SHMS p-value

DNA helicases 2 4.12E-01 5.36E-01 5.36E-01 0.21 0.0

Mitochondrial pro-cessing complexes

4 1.05E-01 1.46E-01 2.73E-01 0.35 0.0

Tryptophanmetabolism

16 7.67E-02 4.12E-01 5.08E-01 0.34 0.0

Lipoic acidmetabolism

3 4.09E-01 4.69E-01 4.69E-01 0.50 0.0

Limonene andpinene degradation

6 2.59E-01 4.13E-02 1.66E-01 0.30 0.0

4.2.3 Functional Coherency of Protein Pairs

HMS comparison with pair-wise semantic similarity metrics

We also calculated HMS score for 150 functionally-associated protein pairs and compared these

scores with the HMS scores for an equal number of non-functional protein pairs in S. cerevisiae.

The former were obtained from STRING [70] with a strong functional association score of 999

out of 999. The latter were sampled from those pairs that were not scored in STRING (i.e.,

there is no evidence for their functional association). We also performed a similar analysis using

four other pair-wise protein similarity scores, Pandey et al. [130, 131] metric, GS2 [146] metric,

overlap score [115], and cosine similarity [21, 115]. The results of the analysis are summarized in

81

Table 4.5. For all methods, the mean score for the functionally-associated pairs is significantly

different from the mean score for the non-functional pairs, but HMS has the lowest p−value.

Additionally, we calculated the percentage of the total number of pairs whose score is lower

than the maximum score of the non-functional pairs but greater than the minimum score of

the functionally-associated pairs. We found that except for HMS and Pandey et al. [130, 131],

all the other methods have an overlap. This is one of the reasons why we selected Pandey et

al. [130, 131] method for comparison in the next section.

Table 4.5: Comparison of pair-wise semantic similarity metrics using functionally-associatedand non-functional protein pairs

Functionally-associatedPairs

Non-functionalPairs

Method Ref.

Mean StdMedian

Mean StdMedian Overlap

(%)

p-value

HMS 0.56 0.06 0.53 0.02 0.03 0.0 0 3.89E-61

GS2 [146] 0.78 0.18 0.79 0.38 0.13 0.36 63.2 8.42E-74

Pandey [130,

131]

0.71 0.23 0.73 0.07 0.02 0.06 0 6.53E-29

OverlapScore

[115] 0.91 0.13 1.00 0.41 0.33 0.25 54.8 8.59E-40

Cosinesimilar-ity

[21,

115]

0.78 0.17 0.80 0.20 0.08 0.20 6 3.05E-94

We analyzed some functionally-associated protein pairs from STRING that were classified as

82

functionally coherent and thus biologically relevant (p−value ≤ 0.05) by our method, yet were

assessed as incoherent by Pandey’s et al. We found literature support for biological relevance of

these protein pairs. The results are summarized in Table 1. RPB9 and SRB2 proteins are part

of RNA polymerase II holoenzyme in S. cerevisiae [121]. SNU13 and DIB1 proteins have been

shown to be associated with the U4/U6U5 pre-mRNA splicing small nuclear ribonucleoprotein

(snRNP) complex [166]. HAP1 and RPM2 are related by the fact that RPM2 is required for

repression of the heme activator protein HAP2 in the absense of heme [55]. When NSR1 was

used as a bait in the protein-fragment complementation assay (PCA), the experiment pulled

out DBP2 as one of its prey proteins [170].

4.2.4 Inferred Hierarchy of Functional Modules

To assess the quality of the hierarchy of functional modules derived from a given “bag of

genes” using our HMS scoring metric and the hierarchical modularity inference methodology

described in the Methods section, we assess the consistency between the predicted hierarchy

and the hierarchy of known functional concepts in KEGG and MIPS databases. Remind that

HMS, by default, uses GO ontology as its hierarchical taxonomy of functional terms.

Consistency Analysis for KEGG Metabolic Pathways

Note that each metabolic pathway is a functional module. We consider the genes from several

metabolic pathways as one “bag of genes” to build the hierarchy of functional modules. If the

constructed hierarchy of functional modularity is biologically relevant, then the genes in each

pathway should form a subtree in the hierarchy and not be “contaminated” by the genes from

the other pathways. We set the fuzziness to null before running the algorithm in order to be

able to use standard clustering validation metrics like the Heidke Score [59], Gerrity Score [42],

and Peirce Score [132].

Since KEGG is organized into a three level hierarchy, the pathways at the lower levels of

the hierarchy are functionally more coherent. Hence, they should be harder to separate into

different subtrees. This hierarchical specificity of the KEGG knowledgebase provides us with

an opportunity to check both the specificity and the sensitivity of our hierarchical modularity

inference method.

We build contingency tables to provide a mathematically and statistically sound way for

assessing the performance at large-scale. To construct a contingency table, the inferred hierarchy

is first cut at the level that produces s subtrees that are then compared with s pathways used

as input to the algorithm. In the ideal scenario, all the genes in a given pathway (or row in the

contingency table) will end up in the corresponding subtree (or the column in the contingency

table) and vice versa; or the contingency table will form a diagonal matrix with the number

83

of pathway genes along the diagonal and zero’s on the off-diagonal elements of the table. By

completing such a contingency table, we could then utilize various skill metrics, such as Heidke

Score [59], Gerrity Score [42], and Peirce Score [132], to measure the goodness of the predicted

hierarchical modularity.

We also performed all the experiments by replacing the SHMS scoring metric with the one

proposed by Pandey et al. [130, 131] and compiled the results in Table 4.6. We found that at

“Level 1” in the KEGG hierarchy, both methods had a perfect score of 1.0 for all three metrics,

but as we moved down the hierarchy, we found that our method performed consistantly better

than Pandey’s et al. [130, 131]. At “Level 2,” we found that our method performed 6%, 7%,

and 8% better in terms of the Heidke score, the Pierce score, the Gerrity score, respectively.

At “Level 3,” which is probably the hardest of the three in terms of pathways seperability, we

performed about 13%, 6%, and 2% better for the same skill metrics.

Table 4.6: Skill metrics for Saccharomyces cerevisiae KEGG experiments

KEGG Heidke Score Pierce Score Gerrity Score

HMS Level 1 1±0 1±0 1±0

[130, 131] 1±0 1±0 1±0

HMS Level 2 0.97±0.04 0.98±0.05 0.98±0.05

[130, 131] 0.91±0.1 0.91±0.12 0.90±0.12

HMS Level 3 0.90±0.03 0.90±0.05 0.90±0.06

[130, 131] 0.77±0.14 0.84±0.10 0.88±0.07

Consistency Analysis for MIPS Protein Complexes

Protein complexes are functionally coherent modules, and hence experiments similar to the

ones preformed using KEGG pathways can be designed. The results can be found in Table 4.7.

We compared the mean score reported for our method and the one proposed by Pandey et al.

We found that at “Level 1” in the MIPS hierarchy, our method performed 12% better than

Pandey’s et al. for both the Heidke and Pierce scores and 13% better for the Gerrity score. At

“Level 2,” our method performed approximately 26% better in terms of the Pierce Score and

35% and 26% better in terms of the Heidke and Gerrity scores, respectively. The results can be

found in Supplement S2.

84

Table 4.7: Skill metrics for Saccharomyces cerevisiae MIPS experiments

MPact-MIPS Heidke Score Pierce Score Gerrity Score

HMS Level 1 1±0 1±0 1±0

[130, 131] 0.88±0.25 0.88±0.25 0.87±0.26

HMS Level 2 0.89±0.14 0.90±0.13 0.90±0.13

[130, 131] 0.54±0.37 0.64±0.27 0.64±0.27

4.2.5 Effect of Fuzziness

To evaluate the effect of incorporating fuzziness into the reconstruction of hierarchical mod-

ularity, we selected several KEGG pathways with common genes and then reconstructed the

hierarchy with the fuzziness parameter µ = 0.90. For each pathway, we identified the corre-

sponding cluster with the maximum gene overlap (at least 75%). We analyzed multi-pathway

genes in terms of their membership in the corresponding clusters. Table 4.8 summarizes the

results of the analysis for multi-pathway genes. Except for UGA1 gene, which was missed in the

cluster corresponding to the Valine, leucine and isoleucine degradation pathway, all the other

genes were properly identified in their corresponding clusters.

85

Table 4.8: Consistency of multi-pathway genes across clusters that enrich the corressponding pathways

GenesPathways ALD4 ALD5 ALD6 ERG10 ERG13 SHM1 SHM2 UGA1 POX1

Propanoate metabolism 1/1 1/1 1/1 1/1 0/0 0/0 0/0 1/1 0/0

Valine, leucine and isoleucinedegradation

1/1 1/1 1/1 1/1 1/1 0/0 0/0 1/0 0/0

Cyanoamino acid metabolism 0/0 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0

Methane metabolism 0/0 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0

beta-Alanine degradation 1/1 1/1 1/1 0/0 0/0 0/0 0/0 1/1 0/0

Synthesis and degradation of ke-tone bodies

0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/0 0/0

Lysine degradation 1/1 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0

Biosynthesis of unsaturated fattyacids

0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1

Fatty acid metabolism 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 1/1

86

Figure 4.9: Effect of different values of ωλ on the SHMS() score

4.2.6 Choosing ωλ Value

Our ωλ selection strategy aims to optimize the method performance on a validation set of protein

complexes, that are essentially known functional modules (Figure 4.9). This prior knowledge

is derived from manually curated set of complexes from MPact-MIPS [52] database. Starting

with the most conservative value of 1 for the ωλ value, for each ωλ value, we calculate the

accuracy of identifying known protein complexes from the validation set as being statistically

significant. We pick a value that is lenient enough to classify most of the known functional

modules (manually curated protein complexes) as significant, while being stringent enough to

avoid predicting random modules from getting high SHMS scores. Thus, we select the largest ωλ

(in this case (ωλ = 10)) value that ensures that at least 95% of the validation protein complex

set is predicted as being statistically significant. The significance of a protein complex score is

calculated using the Monte Carlo method discussed in the Methods section (using a p−value

threshold of 0.05).

4.3 Conclusion

We developed a method to assess biological significance (functional coherence and annotation)

of gene sets taking into consideration their hierarchical modularity. The method shows an

87

improved performance in comparison to the current state-of-the-art functional coherence and

annotation analysis methods when evaluated on several known functionally coherent gene sets.

88

Chapter 5

Conclusion and Future Work

The central hypothesis of this thesis is Phenotype-centric hierarchical-modularity driven network

biomarker discovery and analysis will enhance the performance of phenotype-related biomarker

identification and improve phenotype diagnostics and prognostics using such biomarkers. The

thesis proposed three methodologies towards this hypothesis. We have presented empirical re-

sults showing the efficacy and applicability of our methods.

The subsystem crosstalk level methodology is to the best of our knowledge the first to sys-

tematically identify network biomarker crosstalks, taking into account the multiple biological

clues. The method when applied to the Alzheimer’s disease phenotype, identifies crosstalks

that are expected to be found and novel crosstalks that require further investigation. Further-

more, empirical results demonstrate the value of crosstalk biomarkers in Alzheimer’s disease

prognosis—improving prognosis accuracy by 15.9% compared to current state-of-the-art.

The subsystem level methodology analyzes hundreds of organismal networks to enumerate

the phenotype-related network biomarker subsystems avoiding several of the pitfalls of the

current state-of-the art methodology. The method has been applied to microbial phenotypes

such as acid tolerance, motility, gram stain, and biohydrogen production. The method captures

biomarkers verified by existing literature as well as novel subsystems whose relationship with

the phenotype can be biologically explained.

The individual gene level methodology evaluates network biomarkers for functional coher-

ence and assigns a hierarchy functional annotations by considering the constituent gene sets

as hierarchically organized components. The method demonstrates a 15%-30% improvement in

performance compared to the current state-of-the art methods.

However, the thesis does not purport to represent the fully comprehensive closure of all

valid or effective approaches to the problem of hierarchical modularity driven phenotype-related

network biomarker identification and analysis. The following sections describe major directions

for future work.

89

5.1 Causative Network Biomarkers

The network biomarkers identified so far have an “associative” relationship with the pheno-

type. Moving towards “causal” relationships is a natural progression. Causal relationships can

be looked at from two perspectives. The first perspective looks into causality between the

biomarker and the phenotype, to decide if the variations in the network biomarker (A) such

as missing edges human brain network biomarker, caused the phenotype (B) such as dementia

or vice versa. The second perspective looks into causal crosstalks to help understand if the

structural and functional variations in a subsystem A caused a variation in subsystem B, A→B, or vice-versa. The methodology can use the reference map built earlier and evaluate the

edges for directionality, i.e., decide if the undirected associative edges can be converted into

directed causal edges. Causal relationships are of significant value because the (A→B)-type

causal relationship would provide more specific targets for clinical interference.

The existing methods [38, 85] in this space primarily look at causality between genes or gene

sets not taking into consideration the network biomarker structure. There are several computa-

tional challenges in this space. The causality relationships between a pair of biomarkers and be-

tween biomarker-phenotype pairs are likely non-linear. Identifying causal relationships between

overlapping subsystems is non-trivial. Hierarchical-modularity adds a layer of complexity; it is

imperative to understand how causality at one level of the hierarchy (say, at the individual gene

level) can be translated into the next level in the hierarchy (say, at the subsystem level). The

final challenge lies in identifying causal relationships that are conserved across organisms—to

analyze if strategies used for “associative” relationships would suffice for “causal” relationships.

5.2 Dynamic Network Biomarkers

This thesis discussed methods for comparative network analysis of static biological networks,

i.e, networks without a temporal component. However, the trend in the recent past has been to

collect longitudinal patient-specific data (e.g., ADNI Initiative), i.e., to collect patient variables

at regular time intervals, over several years. Dynamic network biomarkers are biomarkers that

are identified at a given timestamp, and then monitored and evaluated at different stages and

time points during development of disease. Dynamic network biomarkers will likely demon-

strate changes at various stages of the disease and improve phenotype specificity of identified

biomarkers[185].

Dynamic biomarkers will help in further understanding not only disease progression but also

identifying certain other factors (e.g., environmental factors) that maybe contributing to faster

(or slower) disease progression. This kind of analysis when performed on normal patients can

help cancel out effects/changes that are not related to a specific-disease but related to normal

90

“wear and tear” of the human body. Dynamic network biomarkers can be utilized to compare

patients in different cohorts even if the actual timelines of data collection is different, since the

trajectory of the biomarker rather than the actual measurements are compared.

There are several computational challenges in this space. Matching network biomarkers

across two time stamps is non-trivial. The network could have undergone perturbation across

two timepoints. Tracking all network biomarkers across all timepoints is computationally in-

tensive. In addition, the method should take into consideration that at each time-stamp, new

biomarkers may appear and some existing ones may disappear, several biomarkers may combine

and a single biomarker may split and so on. The methodology should handle updates, i.e., data

collected at future timepoints should not require complete reruns of the method.

5.3 Multi-phenotype Network Biomarkers

The current methodologies look at phenotypes one-at-a-time; however, in the real-world, pheno-

types do not occur independently but as a combination. It is necessary to explore and develop

methodologies that can find network biomarkers that are biased towards two or more traits

simultaneously. For example, diseases such as Alzheimer’s, hypertension, heart problems, and

diabetes often co-occur [120]. These diseases, if analyzed together, may provide useful informa-

tion that can be utilized in effective drug discovery.

5.4 Prediction of Missing Values

Prediction of missing values is an important area of research that will complement all the meth-

ods developed in the network biomarker space. Solutions to this problem will likely improve the

stability, utility, and robustness of methods developed in the research areas discussed previously.

In disease prognostics, missing data often leads to omitting patients from the analysis since the

usual strategies of using the mean or mode values are not sufficient. We need more intelligent

and domain-specific techniques to fill in missing values. The most recent papers [191, 196] on

the topic build different prediction models for different subsets of patients and then use the

features with known values for a patient to predict values of missing variables. However, it

would be worthwhile investigating this problem more systematically by addressing the follow-

ing challenges. Different features follow different patterns and a “one glove fits all” strategy

might not necessarily work. For example, the missing value solution for the PET data would

likely be different from the MRI data. Another challenge is to decide when to use a methodology

to fill in missing values and when to omit the data point. There is also a need to estimate an

error margin for the predicted missing values. This would also help calculate a confidence to

prognosis performed incorporating the predicted missing values.

91

REFERENCES

[1] C.E. Acuna, V.and Ferreira, A.S. Freire, and E. Moreno. Solving the maximum edgebiclique packing problem on unbalanced bipartite graphs. Discrete Appl. Math. in press,2011.

[2] K. Al-Mansoori, M. Hasan, A. Al-Hayani, and O. El-Agnaf. The role of α-synuclein in neu-rodegenerative diseases: From molecular pathways in disease to therapeutic approaches.Curr Alzheimer Res., 10(6):559–568, 2013.

[3] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis,K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis,S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock. GeneOntology: A tool for the unification of biology. Nat. Genet., 25(1):25–29, 2000.

[4] G. Atluri, K. Padmanabhan, G. Fang, M. Steinbach, J. Petrella, K. Lim, A. MacDon-ald III, N. Samatova, M. Doraiswamy, and V. Kumar. Complex biomarker discovery inneuroimaging data: Finding a needle in a haystack. NeuroImage: Clinical, 3(0):123 – 131,2013.

[5] G.D. Bader, D. Betel, and C.W.V. Hogue. BIND: The biomolecular interaction networkdatabase. Nucleic Acids Res., 31(1):248–250, 2003.

[6] T. Bailey and M. Gribskov. Combining evidence using p-values: Application to sequencehomology searches. Bioinformatics, 14(1):48–54, January 1998.

[7] C. Ballatore, V. Lee, and J. Trojanowski. Tau-mediated neurodegeneration in Alzheimer’sdisease and related disorders. Nat Rev Neurosci., 8:663–672, 2007.

[8] S. Bauer, S. Grossmann, M. Vingron, and P.N. Robinson. Ontologizer 2.0-a multi-functional tool for GO term enrichment analysis and data exploration. Bioinformatics,24(14):1650 –1651, 2008.

[9] K. Berntorp, A. Frid, R. Alm, G. Fredrikson, K. Sjoberg, and B. Ohlsson. Antibodiesagainst gonadotropin-releasing hormone (GnRH) in patients with diabetes mellitus isassociated with lower body weight and autonomic neuropathy. BMC Res Notes., 6:329,2013.

[10] J.C. Bezdek, R. Ehrlich, and W. Full. FCM: The fuzzy c-means clustering algorithm.Comput. Geosci., 10(2-3):191–203, 1984.

[11] J. Bick and B. Lange. Metabolic cross talk between cytosolic and plastidial pathways ofisoprenoid biosynthesis: Unidirectional transport of intermediates across the chloroplastenvelope membrane. Mol. Syst. Biol., 415(2):146–154, 2010.

[12] K. Bosscher, W.V. Berghe, and G. Haegeman. Cross-talk between nuclear receptors andnuclear factor kappaB. Oncogene, 25(51):6868–86, 2006.

92

[13] E.I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J.M. Cherry, and G. Sherlock.GO::termfinder-open source software for accessing gene ontology information and findingsignificantly enriched gene ontology terms associated with a list of genes. Bioinformatics,20(18):3710 –3715, 2004.

[14] W. Bradley, R. Polinsky, W. Pendlebury, S. Jones, L. Nee, J. Bartlett, J. Hartshorn,R. Tandan, L. Sweet, and G. Magin. DNA repair deficiency for alkylation damage in cellsfrom Alzheimer’s disease patients. Prog Clin Biol Res., 317:715–732, 1989.

[15] L. Brentner, J. Peccia, and J. Zimmerman. Challenges in developing biohydrogen as asustainable energy source: Implications for a research agenda. Environ. Sci. Technol.,44:2243–2254, 2010.

[16] A. M. Breton, J. Schaeffer, and M. Aigle. The yeast rvs161 and rvs167 proteins areinvolved in secretory vesicles targeting the plasma membrane and in cell integrity. Yeast,18(11):1053–1068, 2001.

[17] C. Bron and J. Kerbosch. Algorithm 457: Finding all cliques of an undirected graph.Commun. ACM, 16(9):575–577, 1973.

[18] A. Budanitsky. Semantic distance in WordNet: An experimental, application-orientedevaluation of five measures. In Proceedings of the North American Conference on ChineseLinguistics Workshop on WordNet and other lexical resources: Applications, extensions,and customizations, 2001.

[19] G. Butland, J. Zhang, W. Yang, A. Sheung, P. Wong, J. Greenbalt, A. Emili, and D. Zam-ble. Interactions of the Escherichia coli hydrogenase biosynthetic proteins: HybG complexformation. FEBS Lett., 580:677–681, 2006.

[20] M. Cariaso and G. Lennon. SNPedia: A wiki supporting personal genome annotation,interpretation and analysis. Nucleic Acids Res, 40:1308–1312, 2012.

[21] M. Chagoyen, J. Carazo, and A. Pascual-Montano. Assessment of protein set coherenceusing functional annotations. BMC Bioinformatics, 9:444, 2008.

[22] J. Chen and B. Yuan. Detecting functional modules in the yeast protein-protein interac-tion network. Bioinformatics, 22(18):2283 –2290, 2006.

[23] J.Z. Chen. Targeted therapy of obesity-associated colon cancer. Transl. Gastrointest.Cancer, 1(1):44–57, 2012.

[24] Q. Chen, W. Wei, T. Shimahara, and C. Xie. Alzheimer amyloid beta-peptide inhibitsthe late phase of long-term potentiation through calcineurin-dependent mechanisms inthe hippocampal dentate gyrus. Neurobiol Learn Mem., 77(3):354–371, 2002.

[25] Z. Chen, W. Hendrix, and N.F. Samatova. Community-based anomaly detection in evo-lutionary networks. J. Intell. Inf. Syst., 39(1):59–85, 2012.

93

[26] Z. Chen, K. Padmanabhan, A. Rocha, Y. Shpanskaya, J. Mihelcic, K. Scott, and N. Sam-atova. SPICE: Discovery of phenotype-determining component interplays. BMC Sys Biol,6(1):40, 2012.

[27] M. Dahiyat, A. Cumming, C. Harrington, C. Wischik, J. Xuereb, F. Corrigan, G. Breen,D. Shaw, and D. St Clair. Association between Alzheimer’s disease and the NOS3 gene.Ann Neurol, 46(4):664–667, 1999.

[28] M. D’Andrea. Evidence linking neuronal cell death to autoimmunity in Alzheimer’s dis-ease. Brain Res., 982(1):19–30, 2003.

[29] C. Davatzikos, P. Bhatt, L. Shaw, K. Batmanghelich, and J. Trojanowski. Prediction ofMCI to AD conversion, via MRI, CSF biomarkers, and pattern classification. NeurobiolAging., 32:2322.e19–27, 2011.

[30] A. Davis, C. Murphy, R. Johnson, J. Lay, K. Lennon-Hopkins, C. Saraceni-Richards,D. Sciaky, B. King, M. Rosenstein, T. Wiegers, and C. Mattingly. The comparativetoxicogenomics database: Update 2013. Nucleic Acids Res., 41(1):1104–1114, 2013.

[31] P.S. Dehal, M.P. Joachimiak, M.N. Price, J.T. Bates, J.K. Baumohl, D. Chivian, G.D.Friedland, K.H. Huang, K. Keller, P.S. Novichkov, I. Dubchak, E.J. Alm, and A.P. Arkin.MicrobesOnline: An integrated portal for comparative and functional genomics. NucleicAcids Res, 38:396–400, 2009.

[32] R. Donaldson and M. Calder. Modelling and analysis of biochemical signalling pathwaycross-talk. EPTCS, 19:40–54, 2010.

[33] D. Dotan-Cohen and S. Letovsky. Biological process linkage networks. PLoS One,4(4):e5313, 2009.

[34] P. Durrens, E. Revardel, M. Bonneu, and M. Aigle. Evidence for a branched pathwayin the polarized cell division of Saccharomyces cerevisiae. Curr. Genet., 27(3):213–216,1995.

[35] D. Eppstein. Arboricity and bipartite subgraph listing algorithms. Inform. Process. Lett.,51(4):207–211, 1994.

[36] W. Feller. An introduction to probability theory and its applications. Wiley, New York,1957.

[37] U. Freo, G. Pizzolato, M. Dam, C. Ori, and L. Battistin. A short review of cognitiveand functional neuroimaging studies of cholinergic drugs: Implications for therapeuticpotentials. J Neural Transm., 109(5-6):857–870, 2002.

[38] A. Fujita, K. Kojima, A. Patriota, J. Sato, P. Severino, and S. Miyano. A fast androbust statistical test based on likelihood ratio with bartlett correction to identify grangercausality between gene sets. Bioinformatics, 26(18):2349–2351, 2010.

94

[39] C. Garcia-Ruiz, A. Colell, A. Morales, N. Kaplowitz, and J.C. Fernandez-Checa. Role ofoxidative stress generated from the mitochondrial electron transport chain and mitochon-drial glutathione status in loss of mitochondrial function and activation of transcriptionfactor nuclear factor-kappaB: Studies with isolated mitochondria and rat hepatocytes.Mol. Pharmacol., 48:825–834, 1995.

[40] A. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. Jensen,S. Bastuck, B. Dumpelfeld, A. Edelmann, M. Heurtier, V. Hoffman, C. Hoefert, K. Klein,M. Hudak, A. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer,T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. Rick, B. Kuster, P. Bork, R. Rus-sell, and G. Superti-Furga. Proteome survey reveals modularity of the yeast cell machin-ery. Nature, 440(7084):631–636, 2006.

[41] A. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J.M. Rick,A. Michon, C. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajenovic, H. Ruffner,A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck,B. Huhse, C. Leutwein, M. Heurtier, R.R. Copley, A. Edelmann, E. Querfurth, V. Rybin,G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, andG. Superti-Furga. Functional organization of the yeast proteome by systematic analysisof protein complexes. Nature, 415(6868):141–147, 2002.

[42] J.P. Gerrity. A note on Gandin and Murphy’s equitable skill score. Mon. Weather. Rev.,120:2709–2712, 1992.

[43] A.B. Geva. Hierarchical unsupervised fuzzy clustering. IEEE Trans. Fuzzy Syst., 7(6):723–733, 1999.

[44] N. Ghebranious, B. Mukesh, P. Giampietro, I. Glurich, S. Mickel, S. Waring, and C. Mc-Carty. A pilot study of gene/gene and gene/environment interactions in Alzheimer disease.Clin Med Res., 9(1):17–25, 2011.

[45] B. Giunta, F. Fernandez, W. Nikolic, D. Obregon, E. Rrapo, T. Town, and J. Tan. In-flammaging as a prodrome to Alzheimer’s disease. J Neuroinflammation., 5:51, 2008.

[46] C. Goh, T. Gianoulis, J. Liu, Y.and Li, A. Paccanaro, Y. Lussier, and M. Gerstein. Integra-tion of curated databases to identify genotype-phenotype associations. BMC Genomics,7:257, 2006.

[47] O. Gonzalez and R. Zimmer. Assigning functional linkages to proteins using phylogeneticprofiles and continuous phenotypes. Bioinformatics, 24(10):1257–1263, 2008.

[48] R. Gross and D. Beier, editors. Two-Component Systems in Bacteria. Caister AcademicPress, Norwich,U.K., 2011.

[49] T. Grune, T. Reinheckel, M. Joshi, and K.J. Davies. Proteolysis in cultured liver epithelialcells during oxidative stress. Role of the multicatalytic proteinase complex, proteasome.J. Biol. Chem., 270:2344–2351, 1995.

95

[50] O. Guerrini, P. Soucaille, L. Girbal, B. Guigliarelli, C. Leger, B. Burlat, and C. Lger.Characterization of two 2[4Fe4S] ferredoxins from Clostridium acetobutylicum. Curr.Microbiol., 56:261–267, 2008.

[51] L. Guimei, K. Sim, and J. Li. Efficient mining of large maximal bicliques. In Proceedingsof the International Conference on Data Warehousing and Knowledge Discovery, 2006.

[52] U. Guldener, M. Munsterkotter, M. Oesterheld, P. Pagel, A. Ruepp, H. Mewes, andV. Stumpflen. MPact: The MIPS protein interaction resource on yeast. Nucleic AcidsRes., 34(1):436–441, 2005.

[53] K. Gulshan, S.A. Rovinsky, and W.S Moye-Rowley. YBP1 and its homologueYBP2 /YBH1 influence oxidative-stress tolerance by nonidentical mechanisms in Sac-charomyces cerevisiae. Eukaryot. Cell, 3:318–330, 2004.

[54] M. Habibi, C. Eslahchi, and L. Wong. Protein complex prediction based on k-connectedsubgraphs in protein interaction network. BMC Syst. Biol., 4:129, 2010.

[55] A. Hach, T. Hon, and L. Zhang. A new class of repression modules is critical for hemeregulation of the yeast transcriptional activator HAP1. Mol. Cell Biol., 19(6):4324–4333,1999.

[56] R.A. Hand, N. Jia, M. Bard, and R.J. Craven. Saccharomyces cerevisiae Dap1p, a novelDNA damage response protein related to the mammalian membrane-associated proges-terone receptor. Eukaryot. Cell, 2:306–317, 2003.

[57] J. Hartman, B. Garvik, and L. Hartwell. Principles for the buffering of genetic variation.Science, 291:1001–1004, 2001.

[58] L.H. Hartwell, J.J. Hopfield, S. Leibler, and A.W. Murray. From molecular to modularcell biology. Nature, 402(6761):47–52, 1999.

[59] P. Heidke. Berechnung des erfolges und der gte der windstrkevorhersagen im sturmwar-nungsdienst. Geogr. Ann., 8:301–349, 1926.

[60] W. Hendrix, A. Rocha, M. Elmore, J. Trien, and N. Samatova. Discovery of enrichedbiological motifs using knowledge priors with application to biohydrogen production. InProceedings of the International Conference on Bioinformatics & Computational Biology,pages 17–23, 2010.

[61] W. Hendrix, A. Rocha, K. Padmanabhan, A. Choudhary, K. Scott, J. Mihelcic, andN.F. Samatova. DENSE: Efficient and prior knowledge-driven discovery of phenotype-associated protein functional modules. BMC Syst. Biol., 5:172, 2011.

[62] Y. Ho, A. Gruhler, A. Heilbut, G. Bader, L. Moore, S. Adams, A. Millar, P. Taylor, K. Ben-nett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, Shewnarane J., M. Vo,J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova,A. Willems, H. Sassi, P. Nielsen, K. Rasmussen, J. Andersen, L. Johansen, L. Hansen,H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Crawford, V. Poulsen, B. Sorensen,

96

J. Matthiesen, R. Hendrickson, F. Gleeson, T. Pawson, M. Moran, D. Durocher, M. Mann,C. Hogue, D. Figeys, and M. Tyers. Systematic identification of protein complexes in Sac-charomyces cerevisiae by mass spectrometry. Nature, 415(6868):180–183, 2002.

[63] K. Holohan, D. Lahiri, B. Schneider, T. Foroud, and A. Saykin. Functional microRNAsin Alzheimer’s disease and cancer: Differential regulation of common mechanisms andpathways. Front Genet., 3:323, 2012.

[64] K. Honjo, R. van Reekum, and N. Verhoeff. Alzheimer’s disease and infection: Do in-fectious agents contribute to progression of Alzheimer’s disease? Alzheimers Dement.,5(4):348–360, 2009.

[65] Y. Horng, S. Chen, Y. Chang, and C. Lee. A new method for fuzzy information retrievalbased on fuzzy hierarchical clustering and fuzzy inference techniques. IEEE Trans. FuzzySyst., 13(2):216–228, 2005.

[66] D.W. Huang, B.T. Sherman, and R.A. Lempicki. Systematic and integrative analysis oflarge gene lists using DAVID bioinformatics resources. Nat. Protoc., 4(1):44–57, 2008.

[67] D.W. Huang, B.T. Sherman, and R.A. Lempicki. Bioinformatics enrichment tools: Pathstoward the comprehensive functional analysis of large gene lists. Nucleic Acids Res.,37(1):1–13, 2009.

[68] W. Hwang, Y. Cho, A. Zhang, and M. Ramanathan. A novel functional module detectionalgorithm for protein-protein interaction networks. Algorithms Mol. Biol., 1:24, 2006.

[69] T.L. Jackson. A mathematical investigation of the multiple pathways to recurrent prostatecancer: Comparison with experimental data. Neoplasia, 6(6):697–704, 2004.

[70] L. J. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. Creevey, J. Muller, T. Doerks, P. Julien,A. Roth, M. Simonovic, P. Bork, and C. von Mering. String 8: A global view on proteinsand their functional interactions in 630 organisms. 37:412–416, 2009.

[71] J.J. Jiang and D.W. Conrath. Semantic similarity based on corpus statistics and lexicaltaxonomy. In Proceedings of the International Conference on Research in ComputationalLinguistics, 1997.

[72] K. Jim, K. Parmar, M. Singh, and S. Tavazoie. A cross-genomic approach for systematicmapping of phenotypic traits to genes. Genome Res., 14(1):109–115, 2004.

[73] M. Johnston. A model fungal gene regulatory mechanism: the GAL genes of Saccha-romyces cerevisiae. Microbiol Rev., 51(4):458–476, 1987.

[74] B. Jung, K. Pyo, K. Shin, Y. Hwang, H. Lim, S. Lee, J. Moon, S. Lee, Y. Suh, J. Chai,and E. Shin. Toxoplasma gondii infection in the brain inhibits neuronal degeneration andlearning and memory impairments in a murine model of Alzheimer’s disease. PLoS ONE,7(3):e33312, 2012.

[75] M. Kanehisa and S. Goto. KEGG: Kyoto encyclopedia of genes and genomes. NucleicAcids Res., 28(1):27–30, 2000.

97

[76] M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa. KEGG for repre-sentation and analysis of molecular networks involving diseases and drugs. Nucleic AcidsRes., 38(1):355–360, 2010.

[77] M. Kanehisa, S. Goto, M. Hattori, K.F. Aoki-Kinoshita, M. Itoh, S. Kawashima,T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: Newdevelopments in KEGG. Nucleic Acids Res., 34(1):354–357, 2006.

[78] I. Kapdan and F. Kargi. Bio-hydrogen production from waste materials. Enzyme Microb.Technol., 38(5):569–582, 2006.

[79] G. Kastenmuller, M. Schenk, J. Gasteiger, and H. Mewes. Uncovering metabolic pathwaysrelevant to phenotypic traits of microbial genomes. Gen. Biol., 10(3):28, 2009.

[80] B.S. Katzenellenbogen. Estrogen receptors: Bioactivities and interactions with cell sig-naling pathways. Biol. Reprod., 54(2):287–293, 1996.

[81] H. Kawaji, J. Severin, M. Lizio, A. Forrest, E. van Nimwegen, M. Rehli, K. Schroder,K. Irvine, H. Suzuki, P. Carninci, Y. Hayashizaki, and C. Daub. Update of the FANTOMweb resource: From mammalian transcriptional landscape to its dynamic regulation. Nu-cleic Acids Res., 39:856–860, 2010.

[82] H. Kawaji, J. Severin, M. Lizio, A. Waterhouse, S. Katayama, K. Irvine, D. Hume, A. For-rest, H. Suzuki, P. Carninci, Y. Hayashizaki, and C. Daub. The FANTOM web resource:From mammalian transcriptional landscape to its dynamic regulation. Genome Biol.,10(4):R40, 2009.

[83] P. Khatri and S. Draghici. Ontological analysis of gene expression data: Current tools,limitations, and open problems. Bioinformatics, 21(18):3587 –3595, 2005.

[84] Y. Kikuchi, Y. Oka, M. Kobayashi, Y. Uesono, A. Tohe, and A. Kikuchi. A new yeastgene, HTR1, required for growth at high-temperature, is needed for recovery from matingpheromone-induced G1 arrest. Mol. Gen. Genet., 245:107–116, 1994.

[85] Y. Kim, S. Wuchty, and T. Przytycka. Identifying causal genes and dysregulated pathwaysin complex diseases. PLoS Comput Biol, 7(3), 2011.

[86] A. Kobayashi, M. Kang, Y. Watai, K.I. Tong, T. Shibata, K. Uchida, and M. Yamamoto.Oxidative and electrophilic stresses activate Nrf2 through inhibition of ubiquitinationactivity of Keap1. Mol. Cell. Biol., 26(1):221–229, 2006.

[87] O. Kohannim, X. Hua, D. Hibar, S. Lee, Y. Chou, A.. Toga, C.. Jack, M. Weiner,and P. Thompson. Boosting power for clinical trials using classifiers based on multiplebiomarkers. Neurobiol Aging, 31(8):1429–1442, August 2010.

[88] E.V. Koonin. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet., 39:309–338, 2005.

98

[89] J. Korbel, T. Doerks, L. Jensen, C. Perez-Iratxeta, S. Kaczanowski, S. Hooper, M. An-drade, and P. Bork. Systematic association of genes to phenotypes by genome and liter-ature mining. PLoS Biol., 3:e134, 2005.

[90] K.L. Kovacs, Z. Bagi, B. Balint, B.D. Fodor, G. Scnadi, R. Csaki, T. Hanczar, A.T. Ko-vacs, G. Maroti, K. Perei, A. Toth, and G. Rakhely. Novel approaches to exploit microbialhydrogen metabolism. In J. Miyake, Y Igarashi, and M Rogner, editors, Biohydrogen III.Elsevier Inc, New York, USA, 2004.

[91] N. Krogan, W. Peng, G. Cagney, M. Robinson, R. Haw, G. Zhong, X. Guo, X. Zhang,V. Canadien, D. Richards, B. Beattie, A. Lalev, W. Zhang, A. Davierwala, S. Mnaimneh,A. Starostine, A. Tikuisis, J. Grigull, N. Datta, J. Bray, T. Hughes, A. Emili, and J. Green-blatt. High-definition macromolecular composition of yeast RNA-processing complexes.Mol. Cell., 13(2):225–239, 2004.

[92] G.N. Lance and W.T. Williams. A general theory of classificatory sorting strategies.Comput. J., 9(4):373–380, 1967.

[93] C. Lanni, D. Uberti, M. Racchi, S. Govoni, and M. Memo. Unfolded p53: A potentialbiomarker for Alzheimer’s disease. J Alzheimers Dis., 12:93–99, 2007.

[94] W. Lathe, J. Williams, M. Mangan, and D. Karolchik. Genomic data resources: Challengesand promises. Nature Edu., 1(3), 2008.

[95] H. Lavoie, H. Hogues, J. Mallick, A. Sellam, A. Nantel, and M. Whiteway. Evolutionarytinkering with conserved components of a transcriptional regulatory network. PLoS Biol.,8:e1000329, 2010.

[96] A. Lawrence and B. Sahakian. Alzheimer disease, attention, and the cholinergic system.Alzheimer Dis Assoc Disord, 9 Suppl 2:43–49, 1995.

[97] M. Levesque, D. Shasha, W. Kim, M.G. Surette, and P.N. Benfey. Trait-to-Gene: Acomputational method for predicting the function of uncharacterized genes. Curr. Biol.,13(2):129–133, 2003.

[98] L. Li, B. Rakitsch, and K. Borgwardt. ccSVM: Correcting support vector machines forconfounding factors in biological data classification. Bioinformatics, 27(13):i342–i348,2011.

[99] S. Li and X. Li. Multiple pathways contribute to the pathogenesis of huntington disease.Mol. Neurodegener., 16:1–19, 2006.

[100] T. Li, P. Du, and N. Xu. Identifying human kinase-specific protein phosphorylation sitesby integrating heterogeneous information from various sources. PLoS ONE, 5(11):e15411,2010.

[101] Y. Li. Pathway crosstalk network. In S. Choi, editor, Syst Biol Sig Net. Springer, NewYork, USA, 2010.

99

[102] Y. Li, P. Agarwal, and D. Rajagopalan. A global pathway crosstalk network. Bioinfor-matics, 24(12):1442–1447, 2008.

[103] J. Liu and W. Wang. OP-cluster: Clustering by tendency in high dimensional space. InProceedings of the International Conference on Data Mining, 2003.

[104] L. Liu, J.R. Trimarchi, P. Navarro, M.A. Blasco, and D.L. Keefe. Oxidative stress con-tributes to arsenic-induced telomere attrition, chromosome instability, and apoptosis. J.Biol. Chem., 278(34):31998–32004, 2003.

[105] Y. Liu, J. Li, L. Sam, C. Goh, M. Gerstein, and Y. Lussier. An integrative genomicapproach to uncover molecular mechanisms of prokaryotic traits. PLoS Comput. Biol.,2(11):e159, 2006.

[106] Y. Liu, F. Zeng, Y. Wang, H. Zhou, B. Giunta, J. Tan, and Y. Wang. Immunity andAlzheimer’s disease: Immunological perspectives on the development of novel therapies.Drug Discov Today., 2013.

[107] S. Lopez-Gomollon, J. Hernandez, S. Pellicer, V. Angarica, M. Peleato, and M. Fil-lat. Cross-talk between iron and nitrogen regulatory networks in Anabaena (Nostoc)sp. PCC 7120: Identification of overlapping genes in FurA and NtcA regulons. J. Mol.Biol, 374(1):267–281, 2007.

[108] D.M. Lorenz, A. Jeng, and M.W. Deem. The emergence of modularity in biologicalsystems. Phys. Life Rev., 8(2):129–160, 2011.

[109] K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In Proceed-ings of the Scandinavian Workshop Algorithm Theory, 2004.

[110] S. Mandrekar-Colucci, C. Karlo, and G. Landreth. Mechanisms underlying the rapidperoxisome proliferator-activated receptor-γ-mediated amyloid clearance and reversal ofcognitive deficits in a murine model of Alzheimer’s disease. J Neurosci., 32(30):10117–10128, 2012.

[111] M.N. McClean, A. Mody, J.R. Broach, and S. Ramanathan. Cross-talk and decisionmaking in MAP kinase pathways. Nat. Genet., 39(3):409–414, 2007.

[112] D. Meraz-Rıos, M.and Toral-Rios, Di. Franco-Bocanegra, J. Villeda-Hernandez, andV. Campos-Pena. Inflammatory process in Alzheimer’s disease. Front Integr Neurosci.,7:59, 2013.

[113] C. Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, and B. Snel. STRING: A databaseof predicted functional associations between proteins. Nucleic Acids Res., 31(1):258–261,2003.

[114] M. Mirghorbani and P. Krokhmal. On finding k-cliques in k-partite graphs. OptimizationLett., pages 1–11, 2012.

[115] M. Mistry and P. Pavlidis. Gene Ontology term overlap as a measure of gene functionalsimilarity. BMC Bioinformatics, 9(1):327, 2008.

100

[116] J. Miyake. Biohydrogen. Plenum Press, New York, USA, 1998.

[117] B. Mosch, M. Morawski, A. Mittag, D. Lenz, A. Tarnok, and T. Arendt. Aneuploidyand DNA replication in the normal human brain and Alzheimer’s disease. J Neurosci.,27(26):6859–6867, 2007.

[118] R. Mrak and S. Griffin. Interleukin-1, neuroinflammation, and Alzheimer’s disease. Neu-robiol Aging, 22(6):903–908, 2001.

[119] A. L. Munn, B. J. Stevenson, M. I. Geli, and H. Riezman. END5, END6, and END7 :Mutations that cause actin delocalization and block the internalization step of endocytosisin Saccharomyces cerevisiae. Mol. Biol. Cell, 6(12):1721–1742, 1995.

[120] J. Murray, I.and Proza, F. Sohrabji, and J. Lawler. Vascular and metabolic dysfunctionin alzheimer’s disease: a review. Exp Biol Med (Maywood), 236(7):772–782, 2011.

[121] V.E. Myer and R.A. Young. RNA polymerase II holoenzymes and subcomplexes. J. Biol.Chem., 273(43):27757–27760, 1998.

[122] C.L. Myers, D. Robson, A. Wible, M.A. Hibbs, C. Chiriac, C.L. Theesfeld, K. Dolinski,and O.G. Troyanskaya. Discovery of biological networks from diverse functional genomicdata. Genome Biol., 6(13):114, 2005.

[123] N. Nagarajan and C. Kingsford. GiRaF: Robust, computational identification of influenzareassortments via graph mining. Nucleic Acids Res, 39(6), 2011.

[124] K. Nath and D. Das. Improvement of fermentative hydrogen production: Various ap-proaches. Appl. Microbiol. Biotechnol., 65(5):520–529, 2004.

[125] B. North, D. Curtis, and P. Sham. A note on the calculation of empirical p-values fromMonte Carlo procedures. Am. J. Hum. Genet., 71(2):439–441, 2002.

[126] D. Nussbaum, S. Pu, J. Sack, T. Uno, and H. Zarrabi-Zadeh. Finding maximum edgebicliques in convex bipartite graphs. Algorithmica, 64(2), 2012.

[127] A. Paccanaro, V. Trifonov, H. Yu, and M. Gerstein. Inferring protein-protein interactionsusing interaction network topologies. In Proceedings of the International Joint Conferenceon Neural Networks, 2005.

[128] K. Padmanabhan, K. Wang, and N. Samatova. Functional annotation of hierarchicalmodularity. PLoS ONE, 7(4):e33744, 04 2012.

[129] K. Padmanabhan, K. Wilson, A. Rocha, K. Wang, J. Mihelcic, and N. Samatova. In-silicoidentification of phenotype-biased functional modules. Proteome Sci, 10(1):2, 2012.

[130] J. Pandey, M. Koyuturk, and A. Grama. Functional characterization and topologicalmodularity of molecular interaction networks. BMC Bioinformatics, 11(1):35, 2010.

[131] J. Pandey, M. Koyuturk, S. Subramaniam, and A. Grama. Functional coherence in domaininteraction networks. Bioinformatics, 24(16):28–34, 2008.

101

[132] C.S. Peirce. The numerical measure of the success of predictions. Science, 4(93):453–454,1884.

[133] V. Pilauri, M. Bewley, C. Diep, and J. Hopper. Gal80 dimerization and the yeast GALgene switch. Genetics, 169(4):1903–1914, 2005.

[134] A. Platt and R.J. Reece. The yeast galactose genetic switch is mediated by the formationof a Gal4p-Gal80p-Gal3p complex. Embo. J., 17(14):4086–4091, 1998.

[135] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei,I. Letunic, T. Doerks, L.J. Jensen, C. von Mering, and P. Bork. eggNOG v3.0: Orthologousgroups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res.,40(1):284–289, 2012.

[136] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig,L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methodsfor gene expression data. Bioinformatics, 22(9):1122–1129, 2006.

[137] S. Pu, J. Wong, B. Turner, E. Cho, and S.J. Wodak. Up-to-date catalogues of yeastprotein complexes. Nucleic Acids Res., 37(3):825–831, 2009.

[138] J. Quinn. Different stress response between model and pathogenic fungi. In S. Avery,M. Stratford, and P. West, editors, Stress in yeasts and filamentous fungi. Academic Press,Massachusetts, USA, 2008.

[139] V. Ramanan, S. Kim, K. Holohan, L. Shen, K. Nho, S. Risacher, T. Foroud, S. Mukherjee,P. Crane, P. Aisen, R. Petersen, M. Weiner, and A. Saykin. Genome-wide pathway analysisof memory impairment in the Alzheimer’s disease neuroimaging initiative (ADNI) cohortimplicates gene candidates, canonical pathways, and networks. Brain Imaging Behav.,6(4):634–648, 2012.

[140] E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai, and A.L. Barabasi. Hierarchicalorganization of modularity in metabolic networks. Science, 297(5586):1551–1555, 2002.

[141] P. Resnik. Semantic similarity in a taxonomy: An information based measure and itsapplication to problems of ambiguity in natural language. J. Artif. Intell. Res., 11(1):95–130, 1999.

[142] F. Rey, E.. Heiniger, and C. Harwood. Redirection of metabolism for biological hydrogenproduction. Appl. Environ. Microbiol., 73(5):1665–1671, 2007.

[143] F. Rey, Y. Oda, and C. Harwood. Regulation of uptake hydrogenase and effects ofhydrogen utilization on gene expression in Rhodopseudomonas palustris. J. Bacteriol.,188(17):6143–6152, 2006.

[144] C. Roe, M. Behrens, C. Xiong, J. Miller, and J. Morris. Alzheimer disease and cancer.Neurology, 64(5):895–898, 2005.

102

[145] A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko,U. Guldener, G. Mannhaupt, M. Munsterkotter, and H.W. Mewes. The FunCat, a func-tional annotation scheme for systematic classification of proteins from whole genomes.Nucleic Acids Res., 32(18):5539 –5545, 2004.

[146] T. Ruths, D. Ruths, and L. Nakhleh. GS2: An efficiently computable measure of GO-basedsimilarity of gene sets. Bioinformatics, 25(9):1178–1184, 2009.

[147] Nataraj R.V and S. Selvan. Parallel mining of large maximal bicliques using order pre-serving generators. Int. J. Comput., 8(3):105–113, 2009.

[148] F. Sallusto and A. Lanzavecchia. Heterogeneity of CD4+ memory T cells: Functionalmodules for tailored immunity. Eur. J. Immunol., 39(8):2076–2082, 2009.

[149] F. Sardi, L. Fassina, L. Venturini, M. Inguscio, F. Guerriero, E. Rolfo, and G. Ricevuti.Alzheimer’s disease, autoimmunity and inflammation. The good, the bad and the ugly.Autoimmun Rev., 11(2):149–153, 2011.

[150] R. Schaffrath and K.D. Breunig. Genetics and molecular physiology of the yeastKluyveromyces lactis. Fungal Genet. Biol., 30:173–190, 2000.

[151] A. Schlesser, S. Ulaszewski, M. Ghislain, and A. Goffeau. A second transport ATPasegene in Saccharomyces cerevisiae. J. Biol. Chem., 263:19480–19487, 1988.

[152] M. Schmidt, A. Rocha, K. Padmanabhan, Z. Chen, K. Scott, J. Mihelcic, and N.F. Sama-tova. Efficient α, β-motif finder for identification of phenotype-related functional modules.BMC Bioinformatics, 12:440, 2011.

[153] M. Schmidt, A. Rocha, K. Padmanabhan, Y. Shpanskaya, J. Banfield, K. Scott, J. Mihel-cic, and N. Samatova. NIBBS-Search for fast and accurate prediction of phenotype-biasedmetabolic systems. PLoS Comput Biol, 8(5):e1002490, 2012.

[154] M.C Schmidt, N.F. Samatova, K. Thomas, and B. Park. A scalable, parallel algorithmfor maximal clique enumeration. J. Parallel. Distr. Com., 69(4):417–428, 2009.

[155] M.A. Schwartz and V. Baron. Interactions between mitogenic stimuli, or, a thousand andone connections. Curr. Opin. Cell Biol., 11(2):197–202, 1999.

[156] D. Selkoe. Alzheimer’s disease: Genes, proteins, and therapy. Physiol Rev., 81:741–766,2001.

[157] S.B. Selleck and J.E. Majors. In vivo DNA-binding properties of a yeast transcriptionactivator protein. Mol. Cell Biol., 7(9):3260–3267, 1987.

[158] J. Shaffer, J. Petrella, F. Sheldon, K. Choudhury, V. Calhoun, E. Coleman, and M. Do-raiswamy. Predicting cognitive decline in subjects at risk for Alzheimer disease by usingcombined cerebrospinal fluid, MR imaging, and PET biomarkers. Radiology, 266(2):583–591, 2013.

103

[159] S. Shima and R. Thauer. A third type of hydrogenase catalyzing H2 activation. Chem.Rec., 7(1):37–46, 2007.

[160] M. Silver, E. Janousova, X. Hua, P. Thompson, and G. Montana. Identification of genepathways implicated in Alzheimer’s disease using longitudinal imaging phenotypes withsparse regression. NeuroImage., 63(3):1681–1694, 2012.

[161] M. Silver and G. Montana. Fast identification of biological pathways associated with aquantitative trait using group lasso with overlaps. Stat Appl Genet Mol Biol., 11(1):Article7, 2012.

[162] P. Sivadon, F. Bauer, M. Aigle, and M. Crouzet. Actin cytoskeleton and budding patternare altered in the yeast RVS161 mutant: The RVS161 protein shares common domainswith the brain protein amphiphysin. Mol. Gen. Genet., 246(4):485–495, 1995.

[163] N. Slonim, O. Elemento, and S. Tavazoie. Ab initio genotype-phenotype associationreveals intrinsic modularity in genetic networks. Mol. Syst. Biol., 2:2006.0005, 2006.

[164] V. Spirin and L. Mirny. Protein complexes and functional modules in molecular networks.Proc. Natl. Acad. Sci., 100(21):12123–12128, 2003.

[165] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers. BioGRID:A general repository for interaction datasets. Nucleic Acids Res., 34:535–539, 2005.

[166] S.W. Stevens and J. Abelson. Purification of the yeast U4/U6-U5 small nuclear ribonu-cleoprotein particle and identification of its proteins. Proc. Natl. Acad. Sci., 96(13):7226–7231, 1999.

[167] K. Suhre and J. Claverie. FusionDB: A database for in-depth analysis of prokaryotic genefusion events. Nucleic Acids Res., 32(1):273–276, 2004.

[168] P. Suthers, A. Zomorrodi, and C. Maranas. Genome-scale gene/reaction essentiality andsynthetic lethality analysis. Mol Syst Biol., 5, 2009.

[169] K. Tachihara, T. Uemura, K. Kashiwagi, and K. Igarashi. Excretion of putrescine andspermidine by the protein encoded by YKL174C (TPO5 ) in Saccharomyces cerevisiae. J.Biol. Chem., 280(13):12637–12642, 2005.

[170] S.L. Tai, P. Daran-Lapujade, M.C. Walsh, J.T. Pronk, and J. Daran. Acclimation ofSaccharomyces cerevisiae to low temperature: A chemostat-based transcriptome analysis.Mol. Biol. Cell., 18(12):5100–5112, 2007.

[171] A. Tamburri, A. Dudilot, S. Licea, C. Bourgeois, and J. Boehm. NMDA-Receptor acti-vation but not ion flux is required for amyloid-beta induced synaptic depression. PLoSONE, 8(6):e65350, 2013.

[172] M. Tamura and P. D’haeseleer. Microbial genotype-phenotype mapping by class associa-tion rule mining. Bioinformatics, 24(13):1523–1529, 2008.

104

[173] M. Tamura and P. D’haeseler. Comparative genome content analysis with respect to basicmicrobial phenotypes by class association rule mining? In Proceedings of the InternationalConference on Systems Biology, 2007.

[174] A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters ingene expression data. Bioinformatics, 18(1):136–144, 2002.

[175] R.L. Tatusov, Koonin E.V., and Lipman D.J. A genomic perspective on protein families.Science, 278(5338):631–637, 1997.

[176] R.L. Tatusov, M. Galperin, D. Natale, and E.V. Koonin. The COG database: A tool forgenome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28(1):33–36,2000.

[177] A. Thomaz, L. Pozzo, A. Fontes, D. Almeida, C. Stahl, J. Santos-Mallet, S. Gomes,D. Feder, D. Ayres, S. Giorgio, and C. Cesar. Optical tweezers force measurements tostudy parasites chemotaxis. In Proceedings of the European Conference on BiomedicalOptics, 2009.

[178] G. Tsekouras, H. Sarimveis, E. Kavakli, and G. Bafas. A hierarchical fuzzy-clusteringapproach to fuzzy modeling. Fuzzy Set. Syst., 150(2):245–266, 2005.

[179] McMaster University. Couch potatoes explained? missing key genesmay be cause for lack of resolve to exercise, researchers find.http://www.sciencedaily.com/releases/2011/09/110905160904.htm, 2011. Accessed:11/26/2012.

[180] P. Vignais, B. Billoud, and J. Meyer. Classification and phylogeny of hydrogenases. FEMSMicrobiol. Rev., 25(4):455–501, 2001.

[181] A. Vignini, A. Giulietti, L. Nanetti, F. Raffaelli, L. Giusti, L. Mazzanti, and L. Provinciali.Alzheimer’s disease and diabetes: New insights and unifying therapies. Curr DiabetesRev., 9(3):218–227, 2013.

[182] A.G. Vitreschak, A. Mironov, V.A. Lyubetsky, and M. Gelfand. Comparative genomicanalysis of T-box regulatory systems in bacteria. RNA, 14(4):717–735, 2008.

[183] M.D. Wallace, A.D. Pfefferle, L. Shen, A.J. McNairn, E.G. Cerami, B.L. Fallon, V.D.Rinaldi, T.L. Southard, C.M. Perou, and J.C. Schimenti. Comparative oncogenomics im-plicates the Neurofibromin 1 Gene (NF1) as a breast cancer driver. Genetics, 192(2):385–396, 2012.

[184] Q.Q. Wang and A. Chang. Sphingoid base synthesis is required for oligomerization andcell surface stability of the yeast plasma membrane ATPase, PMA1. Proc. Natl. Acad.Sci., 99(20):12853–12858, 2002.

[185] X. Wang and P. Ward. Opportunities and challenges of disease biomarkers: a new sectionin the journal of translational medicine. J Transl Med, 10(1):220, 2012.

105

[186] D. Warde-Farley, S. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz,C. Grouios, F. Kazi, C. Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G. Wright,G. Bader, and Q. Morris. The GeneMANIA prediction server: Biological network integra-tion for gene prioritization and predicting gene function. Nucleic Acids Res., 38(2):W214–W220, 2010.

[187] M. Weiner, D. Veitch, P. Aisen, L. Beckett, N. Cairns, R. Green, D. Harvey, C. Jack,W. Jagust, E. Liu, J. Morris, R. Petersen, A. Saykin, M. Schmidt, L. Shaw, J. Siuciak,H. Soares, A. Toga, and J. Trojanowski. The Alzheimer’s disease neuroimaging initiative:A review of papers published since its inception. Alzheimers Dement, 8(1):1 –68, 2012.

[188] D. White. The physiology and biochemistry of prokaryotes. Oxford University Press, Inc.,Oxford, UK, 2000.

[189] H. Wiener, R. Perry, Z. Chen, L. Harrell, and R. Go. A polymorphism in SOD2 isassociated with development of Alzheimer’s disease. Genes Brain Behav, 6(8):770–776,2007.

[190] C.Y. Wu, S. Roje, F.J. Sandoval, A.J. Bird, D.R. Winge, and D.J. Eide. Repressionof sulfate assimilation is an adaptive response of yeast to the oxidative stress of zincdeficiency. J. Biol. Chem., 284(40):27544–27556, 2009.

[191] S. Xiang, L. Yuan, W. Fan, Y. Wang, P. Thompson, and J. Ye. Bi-level multi-sourcelearning for heterogeneous block-wise missing data. NeuroImage, 13:S1053–8119, 2013.

[192] Y. Xu, W. Hu, Z. Chang, H. DuanMu, S. Zhang, Z. Li, Z. Li, L. Yu, and X. Li. Predictionof human proteinprotein interaction by a mixed Bayesian model and its application toexploring underlying cancer-related pathway crosstalk. J. Royal Soc. Interface, 8(57):555–567, 2011.

[193] B. Yan and S. Gregory. Finding missing edges and communities in incomplete networks.J. Phys. A., 44(49):5102, 2011.

[194] Y. Yang, D. Geldmacher, and K. Herrup. DNA replication precedes neuronal cell deathin Alzheimer’s disease. J Neurosci., 21(8):2661–2668, April 2001.

[195] J. Yu and P. Takahashi. Biophotolysis-based hydrogen production by cyanobacteria andgreen microalgae. In A. Mendez-Vilas, editor, Communicating Current Research and Edu-cational Topics and Trends in Applied Microbiology. Formatex Research Center, Badajoz,Spain, 2007.

[196] L. Yuan, Y. Wang, P. Thompson, V. Narayan, and J. Ye. Multi-source learning forjoint analysis of incomplete multi-modality neuroimaging data. In Proceedings of theInternational Conference on Knowledge Discovery and Data Mining, pages 1149–1157,2012.

[197] Y. Yurov, S. Vorsanova, and I. Iourov. The DNA replication stress hypothesis ofAlzheimer’s disease. ScientificWorldJournal., 11:11, 2011.

106

[198] A. Zalesky, A. Fornito, and E. Bullmore. Network-based statistic: Identifying differencesin brain networks. NeuroImage, 53(4):1197 – 1207, 2010.

[199] B. Zhang, B. Park, T. Karpinets, and N.F. Samatova. From pull-down data to proteininteraction networks and complexes with biological relevance. Bioinformatics, 24(7):979–986, 2008.

[200] Y Zhang, E.J Chesler, and M.A Langston. On finding bicliques in bipartite graphs: Anovel algorithm with application to the integration of diverse biological data types. InProceedings of the International Conference on System Sciences, 2008.

[201] H. Zhou and R. Lipowsky. The yeast protein-protein interaction map is a highly modularnetwork with a staircase community structure. in press, 2005.

107