methods in bioengineering - systems analysis of biological networks (the artech house methods in...

Methods in BioengineeringSystems Analysis of Biological Networks

The Artech House Methods in Bioengineering Series

Series Editors-in-ChiefMartin L. Yarmush, M.D., Ph.D.Robert S. Langer, Sc.D.

Methods in Bioengineering: Biomicrofabrication and Biomicrofluidics,Jeffrey D. Zahn and Luke P. Lee, editors

Methods in Bioengineering: Microdevices in Biology and Medicine,Yaakov Nahmias and Sangeeta N. Bhatia, editors

Methods in Bioengineering: Nanoscale Bioengineering and Nanomedicine,Kaushal Rege and Igor Medintz, editors

Methods in Bioengineering: Stem Cell Bioengineering,Biju Parekkadan and Martin L. Yarmush, editors

Methods in Bioengineering: Systems Analysis of Biological Networks,Arul Jayaraman and Juergen Hahn, editors

Methods in BioengineeringSystems Analysis of Biological Networks

Arul JayaramanDepartment of Chemical Engineering, Texas A&M University

Juergen HahnDepartment of Chemical Engineering, Texas A&M University

Editors

a r techhouse . com

Library of Congress Cataloging-in-Publication DataA catalog record for this book is available from the U.S. Library of Congress.

British Library Cataloguing in Publication DataA catalogue record for this book is available from the British Library.

ISBN-13: 978-1-59693-406-1

Cover design by Yekaterina Ratner

© 2009 Artech House. All rights reserved.

Printed and bound in the United States of America. No part of this book may be reproduced orutilized in any form or by any means, electronic or mechanical, including photocopying, record-ing, or by any information storage and retrieval system, without permission in writing from thepublisher.

All terms mentioned in this book that are known to be trademarks or service marks have beenappropriately capitalized. Artech House cannot attest to the accuracy of this information. Use ofa term in this book should not be regarded as affecting the validity of any trademark or servicemark.

10 9 8 7 6 5 4 3 2 1

Contents

CHAPTER 1Quantitative Immunofluorescence for Measuring Spatial Compartmentation ofCovalently Modified Signaling Proteins 1

1.1 Introduction 2

1.2 Experimental Design 3

1.3 Materials 3

1.3.1 Cell culture 3

1.3.2 Buffers/reagents 3

1.3.3 Immunofluorescence reagents 4

1.4 Methods 4

1.4.1 Cell culture and stimulation for phospho-ERK measurements 4

1.4.2 Antibody labeling of phosphorylated ERK (ppERK) 4

1.4.3 Fluorescence microscopy imaging of ppERK and automatedimage analysis 5

1.5 Data Acquisition, Anticipated Results, and Interpretation 6

1.6 Statistical Guidelines 7

1.7 Discussion and Commentary 8

1.8 Application Notes 8

1.9 Summary Points 8

Acknowledgments 9

References 10

CHAPTER 2Development of Green Fluorescent Protein-Based Reporter Cell Lines for DynamicProfiling of Transcription Factor and Kinase Activation 11

2.1 Introduction 12

2.2 Materials 13

2.2.1 Cell and bacterial culture 13

2.2.2 Buffers and reagents 13

2.2.3 Cloning 14

2.2.4 Microscopy 14

2.3 Methods 14

2.3.1 3T3-L1 cell culture 14

2.3.2 Transcription factor reporter development 14

v

2.3.3 Kinase reporter development 17


2.4.1 Electroporation of TF reporter plasmids into 3T3-L1preadipocytes 23

2.4.2 Monitoring activation of ERK in HepG2 cells 26




Acknowledgments 31

References 31

CHAPTER 3Comparison of Algorithms for Analyzing Fluorescent Microscopy Images andComputation of Transcription Factor Profiles 33

3.1 Introduction 34

3.2 Preliminaries 35

3.2.1 Principles of GFP reporter systems 35

3.2.2 Wavelets 36

3.2.3 K-means clustering 36

3.2.4 Principal component analysis 37

3.2.5 Mathematical description of digital images and image analysis 37

3.3 Methods 38

3.3.1 Image analysis based on wavelets and a bidirectional search 38

3.3.2 Image analysis based on K-means clustering and PCA 41

3.3.3 Determining fluorescence intensity of an image 43

3.3.4 Comparison of the two image analysis procedures 45


3.4.1 Developing a model describing the relationship between thetranscription factor concentration and the observed fluorescenceintensity 46

3.4.2 Solution of an inverse problem for determining transcriptionfactor concentrations 47


3.6 Summary and Conclusions 53

Acknowledgments 54

References 54

CHAPTER 4Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks 57

4.1 Introduction 58

4.2 Principles of Data-Driven Modeling 59

4.2.1 Types of experimental data 59

4.2.2 Data processing and normalization 60

4.2.3 Suitability of models used in conjunction with quantitative data 62

Contents

vi

4.2.4 Issues related to parameter specification and estimation 63

4.3 Examples of Data-Driven Modeling 64

4.3.1 Example 1: Systematic analysis of crosstalk in the PDGF receptorsignaling network 64

4.3.2 Example 2: Computational analysis of signal specificity in yeast 69

Acknowledgments 72

References 72

CHAPTER 5Construction of Phenotype-Specific Gene Network by Synergy Analysis 75

5.1 Introduction 76


5.3 Materials 79

5.3.1 Cell culture and reagents 79

5.3.2 Fatty acid salt treatment 79

5.4 Methods 79

5.4.1 Cytotoxicity measurement 79

5.4.2 Gene expression profiling 79

5.4.3 Metabolites measurements 80

5.4.4 Gene selection based on trends of metabolites 80

5.4.5 Calculation of the synergy scores of gene pairs 80

5.4.6 Permutation test to evaluate the significance of the synergy 82

5.4.7 Characterization of the network topology 82



5.7 Applications Notes 83

5.7.1 Topological characteristics of the synergy network 84

5.7.2 Hub genes in the network 85


Acknowledgments 90

References 90

CHAPTER 6Genome-Scale Analysis of Metabolic Networks 95

6.1 Introduction 96

6.2 Materials and Methods 98

6.2.1 Flux analysis theory 98

6.2.2 Model development 99

6.2.3 Objective function 100

6.2.4 Optimization 104


6.3.1 Feasible solution determined 105

6.3.2 No feasible solution determined 106

Contents

vii



Acknowledgments 107

References 108

CHAPTER 7Modeling the Dynamics of Cellular Networks 111

7.1 Introduction 112

7.2 Materials 113

7.2.1 Cell culture 113

7.2.2 Database 113

7.3 Methods 113

7.3.1 Network reconstruction 113

7.3.2 Network reduction 113

7.3.3 Kinetic modeling 117

7.3.4 Parameter estimation 120


7.4.1 Model network 121

7.4.2 Dynamic simulation parameters 122


7.5.1 Modularity 122

7.5.2 Generalized kinetic expressions 123

7.5.3 Population heterogeneity 124



Acknowledgments 126

References 126

CHAPTER 8Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A BriefReview and New Methods 129


8.2 Considered System Class and Parametric Sensitivity 131

8.2.1 Example system: reversible covalent modification 131

8.2.2 Parametric steady-state sensitivity 132

8.3 Linear Sensitivity Analysis 134

8.4 Sensitivity Analysis Via Empirical Gramians 136

8.4.1 Gramians and linear sensitivity analysis 136

8.4.2 Empirical Gramians for nonlinear systems 137

8.4.3 A new sensitivity measure based on Gramians 138

8.4.4 Example: covalent modification system 140

8.5 Sensitivity Analysis Via Infeasibility Certificates 141

8.5.1 Feasibility problem and semidefinite relaxation 142

Contents

viii

8.5.2 Infeasibility certificates from the dual problem 143

8.5.3 Algorithm to bound feasible steady states 144

8.5.4 Example: covalent modification system 145

8.6 Discussion and Outlook 146

References 147

CHAPTER 9Determining Metabolite Production Capabilities of Saccharomyces CerevisiaeUsing Dynamic Flux Balance Analysis 149


9.2 Methods 151

9.2.1 Stoichiometric models of cellular metabolism 151

9.2.2 Classical flux balance analysis 152

9.2.3 Dynamic flux balance analysis 154

9.3 Results and Interpretation 155

9.3.1 Stoichiometric models of S. cerevisiae metabolism 155

9.3.2 Dynamic simulation of fed-batch cultures 157

9.3.3 Dynamic optimization of fed-batch cultures 159

9.3.4 Identification of ethanol overproduction mutants 164

9.3.5 Exploration of novel metabolic capabilities 167



Acknowledgments 175

References 176

Related Resources and Supplementary Electronic Information 178

CHAPTER 10Experimental Design for Parameter Identifiability in Biological SignalTransduction Modeling 179


10.1.1 Model structure 180

10.1.2 Parameter estimation 181

10.1.3 Identifiability metrics and conditions 182

10.1.4 Overview of the experimental design procedure 184

10.2 Methods 185

10.2.1 Initial perturbation and measurement design 185

10.2.2 Identifiability analysis 186

10.2.3 Impact analysis 188

10.2.4 Design modification and reduction 190

10.2.5 Design implementation 191


10.3.1 Step 1: Initial perturbation and measurement design 193

10.3.2 Step 2: Identifiability analysis 193

10.3.3 Step 3: Impact analysis 194

Contents

ix

10.3.4 Step 4: Design reduction 196



10.4.1 Step 1: Initial perturbation and measurement design 198


10.4.3 Steps 3 to 5: Impact analysis, design reduction, andidentifiability analysis 200



Acknowledgments 208

References 208

CHAPTER 11Parameter Identification with Adaptive Sparse Grid-Based Optimization forModels of Cellular Processes 211


11.1.1 Adaptive sparse grid interpolation 213


11.3 Materials 217

11.4 Methods 218


11.5.1 Sorted grid points 222

11.5.2 Unique points 222

11.5.3 Unstable points 223

11.5.4 Interpretation and conclusions 223

11.6 Troubleshooting 224

11.6.1 Troubleshooting special cases: small and large problems 224



11.8.1 Comparison of adaptive sparse grid and GA-basedoptimization 228

11.8.2 Adaptive sparse grid-based optimization 228

11.8.3 Genetic algorithm 229


Acknowledgments 231

References 231

Related sources and supplementary information 232

CHAPTER 12Reverse Engineering of Biological Networks 233

12.1 Introduction: Biological Networks and Reverse Engineering 234

12.1.1 Biological networks 234

12.1.2 Network representation 236

12.1.3 Motivation and design principles 237

Contents

x

12.1.4 Reverse engineering 238

12.2 Material: Time Series and Omics Data 239

12.2.1 Metabolomics 240

12.2.2 Proteomics and protein interaction networks 240

12.2.3 Transcriptomics 241

12.3 Approaches for Inference of Biological Networks 242

12.3.1 Genome-scale metabolic modeling 243

12.3.2 Boolean networks 245

12.3.3 Network topology from correlation or hierarchical clustering 247

12.3.4 Bayesian networks 248

12.3.5 Ordinary differential equations 250

12.4 Network Biology—Exploring the Inferred Networks 256

12.4.1 Graph theory 257

12.4.2 Motifs and modules 258

12.4.3 Stoichiometric analysis 260

12.4.4 Simulation of dynamics, sensitivity analysis, control analysis 261

12.5 Discussion and Comparison of Approaches 264


Acknowledgments 266

References 267

CHAPTER 13Transcriptome Analysis of Regulatory Networks 271


13.2 Methods 273

13.2.1 Materials 273

13.2.2 Cell harvesting 274

13.2.3 RNA purification 274

13.2.4 Transcriptional profiling using DNA microarrays 276


13.3.1 Acquisition of DNA microarray data 281

13.3.2 Normalization 281

13.3.3 Network Component Analysis (NCA) 282




References 285

CHAPTER 14A Workflow from Time Series Gene Expression to Transcriptional RegulatoryNetworks 287


14.2 Materials 289

Contents

xi

14.3 Methods 291

14.3.1 Identification of differentially expressed genes 291

14.3.2 Robust clustering of differential gene expression time seriesdata using computational negative control approach 292

14.3.3 Transcriptional regulatory network analysis using PAINT 293


14.4.1 Selection of number of clusters 296

14.4.2 PAINT result interpretation for gene coexpression clusters 296


14.5.1 Estimation of nondifferentially expressed genes(pi.not value) 297

14.5.2 Threshold for local false discovery rate analysis 297

14.5.3 Format of gene identifiers 298

14.5.4 Cluster size issues 298

14.5.5 TRANSFAC version issues 298

14.5.6 Annotation redundancy in the gene list and multiplepromoters 299

14.5.7 Reference Feasnet selection/generation 299

14.5.8 Multiple testing correction in PAINT 299



Acknowledgments 301

References 301

About the Editors 303

List of Contributors 304

Index 307

Contents

xii

C H A P T E R

1Quantitative Immunofluorescence forMeasuring Spatial Compartmentation ofCovalently Modified Signaling Proteins

Jin-Hong Kim1 and Anand R. Asthagiri2

1Division of Engineering and Applied Science and2Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA,e-mail: [email protected]

1

Key terms ERKImage analysisSegmentationSite-specific modificationSpatial localizationWatershed algorithm

Abstract

Intracellular signaling pathways control cell behaviors and multicellularmorphodynamics. A quantitative understanding of these pathways will pro-vide design principles for tuning these signals in order to engineer cell behav-iors and tissue morphology. The transmission of information in signalingpathways involves both site-specific covalent modifications and spatial local-ization of signaling proteins. Here, we describe an algorithm for quantifyingthe spatial localization of covalently modified signaling proteins from imagesacquired by immunofluorescence (IF) staining. As a case study, we applythe method to quantify the amount of dually phosphorylated extracellular-regulated kinase (ERK) in the nucleus. The algorithm presented here provides ageneral schematic that can be modified and applied more broadly to quantifythe spatial compartmentation of other covalently modified signaling proteins.

1.1 Introduction

Signal transduction networks control all aspects of cell behavior, such as metabolism,proliferation, migration, and differentiation [1]. Thus, engineering cell behaviors willhinge on understanding and tuning information flow in these signaling pathways.Intracellular signals transmit information in at least two major ways. First, signalingproteins undergo covalent modifications that alter their intrinsic enzymatic activityand/or their interactions with binding partners. In addition to the connectivity of thesignal transduction network, signaling proteins are localized spatially. Where a signal islocated can influence its accessibility to upstream and downstream factors, and there-fore can play a significant role in controlling information flux [2].

Green fluorescent protein (GFP) has provided a powerful way to track the localiza-tion of signaling proteins [3]. Variants of GFP spanning a wide range of spectral proper-ties have opened the door to monitoring colocalization of signaling proteins. A keychallenge, however, is that quantifying signal propagation must involve not only track-ing protein localization, but also the covalent state of that signal.

Sensor platforms that track both spatial localization and covalent state/activity areemerging. Several involve fluorescence resonance energy transfer (FRET), a phenome-non wherein the close proximity of two complementary fluorphores allows one (thedonor) to excite the other (acceptor) [4]. The quenching of the donor and the excitationof the acceptor serves as a FRET signal. One general strategy has been to introduce a chi-meric version of the signaling protein. Both the acceptor and donor are placed in theprotein whose folding into an active confirmation changes the FRET signal. Examplesinclude the Raichu sensors for the cdc42/Rac/Rho family of GTPases [5]. In anotherdesign, the fluorophores have been placed in chimeric pseudosubstrates for tyrosine kin-ases [6] and caspases [7]. When these signaling enzymes act on the substrate, therefolding or the cleavage of the substrate changes the FRET signal. A final approach is toplace one fluorophore on the signaling enzyme and the other fluorophore on a bindingpartner. When these are recruited to each other, FRET signal ensues. Examples of thisthird approach include the Raichu-CRIB sensors for Rho family of GTPases [8].

A major drawback of these tools, however, is that they are highly tailor-made and donot report on the remarkable diversity of covalent modifications that a single signalingprotein undergoes. For example, the PDGF receptor is phosphorylated at multiple tyro-sine residues, and each phosphorylation site enables its interaction with distinct down-stream targets [9]. Such multisite covalent modifications are prevalent across signalingproteins. New mathematical modeling frameworks are being developed to handle thehuge number of states in which a single signaling protein may be found [10]. Proteomicapproaches are being developed to quantify site-specific covalent modifications in cellextracts on a large scale [11]. While this approach allows large-scale, quantitative analy-sis of covalent modifications to signaling proteins, it does not gauge subcellular spatialinformation.

Thus, complementary methods are needed to quantify spatial information on signal-ing proteins that have undergone site-specific covalent modifications. Classicalimmunofluorescence (IF) staining provides an excellent starting point. In IF staining,antibodies are used to detect an antigen (e.g., signaling protein) in fixed cells [12]. Theseantibodies may be tagged with fluorophores, including quantum dots that have uniqueadvantages over GFP. Furthermore, antibodies for site-specific covalent modificationsare widely available commercially. A limiting factor, however, is that images acquired by

Quantitative Immunofluorescence for Measuring Spatial Compartmentation

2

IF are primarily analyzed qualitatively. Here, we describe image analysis algorithms thatmay be used to quantify IF images in an automated manner. As a case study, we applythe algorithms to quantify the level of nuclear extracellular-regulated kinase (ERK)signaling.

1.2 Experimental Design

In this work, we developed and tested image analysis algorithms to quantify the spatiallocalization of phosphorylated signaling proteins. We focused on phosphorylated ERK,a signal that localizes to the nucleus and is required for cell proliferation [13]. We per-formed a dose-dependence assay to gauge how the localized signal responds to differentamounts of stimuli. Such dose-response studies provide a well-defined approach to testwhether our measurement methodology could discern quantitative changes insignaling.

It is useful to conduct such experiments in systems that have been confirmed to trig-ger the signal of interest using other experimental assays. Therefore, we chose a stimu-lus, epidermal growth factor (EGF), that is well known to trigger ERK signaling [14, 15].We used MCF-10A cells that respond to EGF by triggering ERK phosphorylation as con-firmed by Western blotting [16].

1.3 Materials

1.3.1 Cell culture

1. 6-well plate (Corning).

2. Micro cover glass, 18 mm circle (VWR).

3. Dulbecco’s modified Eagle’s medium/Ham’s F-12 containing HEPES andL-glutamine (Gibco).

4. Epidermal Growth Factor (Peprotech).

5. Hydrocortisone (Sigma-Aldrich).

6. Insulin (Sigma-Aldrich).

7. Choleratoxin (Sigma-Aldrich).

8. Bovine serum albumin (Sigma-Aldrich).

9. Trypsin EDTA 0.05% (Gibco).

10. Penicillin/streptomycin (Gibco).

1.3.2 Buffers/reagents

1. Phosphate buffered saline (Gibco).

2. Paraformaldehyde (Sigma-Aldrich).

3. Tween-20 (Sigma-Aldrich).

4. Methanol (EMD).

5. Glycine (Sigma-Aldrich).

6. Triton X-100 (Sigma-Aldrich).

7. Goat serum (Gibco).


3

8. NP-40 (Sigma-Aldrich).

9. NaCl (Sigma-Aldrich).

10. Na2HPO4 (Sigma-Aldrich).

11. NaH2PO4 (Sigma-Aldrich).

12. NaN3 (Sigma-Aldrich).

13. PD98059 (Calbiochem).

14. Na3VO4 (Sigma-Aldrich).

15. NaF (Sigma-Aldrich).

16. β-glycerophosphate (Sigma-Aldrich).

1.3.3 Immunofluorescence reagents

1. Primary antibodies:

i. Phospho-p44/42 MAPK (Thr202/Tyr204), Polyclonal: #9101 (1:200) andMonoclonal: #4377 (1:50) (Cell Signaling Technology, Inc.).

2. Secondary antibody:

i. Alexa Fluor 488 (1:200) (Molecular Probe).

3. 4’,6-diamidino-2-phenylindole (DAPI) (Sigma-Aldrich).

4. ProLong Gold antifade (Molecular Probe).

1.4 Methods

1.4.1 Cell culture and stimulation for phospho-ERK measurements

1. Culture MCF-10A cells in Dulbecco’s modified Eagle’s medium/Ham’s F-12containing HEPES and L-glutamine supplemented with 5% (v/v) horse serum, 20ng/mL EGF, 0.5 μg/ml hydrocortisone, 0.1 μg/ml cholera toxin, 10 μg/ml insulin,and 1% penicillin/streptomycin.

2. Plate cells on sterilized glass cover glass placed in the 6-well tissue culture plates at 1 ×105 cells per well and grow cells in growth medium for 24 hours to allow adhesion.

3. For G0 synchronization, wash cells twice with PBS and culture them in serum freemedium for 24 hours: DMEM/F-12 supplemented with 1% penicillin/streptomycinand 0.1% bovine serum albumin.

4. For EGF stimulation, reconstitute recombinant human EGF in sterile H2O at 100μg/ml and dilute it in serum free medium to designated concentrations.

5. Make sure EGF containing medium is warmed to 37°C. Then stimulate cells for 15minutes by adding 2 ml of EGF containing medium to each well. Either cellsincubated in the absence of EGF or treated with a pharmacological inhibitor of MEK,PD98059, can be used as a negative control while cells treated with 10 ng/ml canserve as a positive control.

1.4.2 Antibody labeling of phosphorylated ERK (ppERK)

1. After 15 minutes of EGF stimulation, place 6-well plates on the ice and wash cellstwice with ice-cold PBS.


4

2. Fix cells in freshly prepared 2% paraformaldehyde (pH 7.4) for 20 minutes at roomtemperature in the presence of phosphatase inhibitors at the followingconcentrations: 1 mM sodium orthovanadate, 10 mM sodium fluoride, and 10 mM

β-glycerophosphate. Rinse with 0.1 mM solution of Glycine in PBS three times.

3. Permeabilize cells in PBS containing 0.5% NP-40 and the phosphatase inhibitors for10 minutes at 4°C with gentle rocking. Rinse with PBS three times.

4. Dehydrate cells in ice-cold pure methanol for 20 minutes at –20°C. Rinse with PBSthree times.

5. Block with IF Buffer: 130 mM NaCl, 7 mM Na2HPO4, 3.5 mM NaH2PO4, 7.7 mM NaN3,0.1% bovine serum albumin, 0.2% Triton X-100, 0.05% Tween-20 and 10% goatserum for 1 hour at room temperature.

6. Incubate with anti-phospho-p44/42 MAPK antibody in IF buffer overnight at 4°C.Rinse three times with IF buffer at room temperature on the rocker for 20 minuteseach. Washing step is essential to minimize background staining.

7. Sequentially incubate with Alexa dye-labeled secondary antibodies in IF buffer for 45minutes at room temperature. Rinse three times with IF buffer at room temperatureon the rocker for 20 minutes each. Make sure to protect samples from light.

8. Counterstain nuclei with 0.5 ng/ml DAPI for 15 minutes at room temperature andrinse with PBS twice with gentle rocking for 5 minutes each.

9. Mount with ProLong Gold antifade. Dry overnight in a place that can protectsamples from light.

1.4.3 Fluorescence microscopy imaging of ppERK and automated imageanalysis

1. Acquire fluorescence images using filters for DAPI and FITC. Start with a sample thatis expected to give the highest FITC signal (e.g., the positive control, 10 ng/ml EGF).Using this positive control, empirically choose an exposure time so that the highestpixel intensity in a given field is close to the saturation level (generally 255). Be surethat the chosen exposure time does not saturate the FITC signal in other fields of thepositive control sample. These steps identify an exposure time that maximizes thedynamic range of ppERK signals that may be quantified. The exposure timedetermined in this way should then be fixed and used to capture images from allother samples.

2. Segment DAPI (nuclei) images using a combination of edge detection and watershedalgorithms. The algorithm to process a single image is written in MATLAB(MathWorks) as described below (steps 2, i–v). This algorithm can be iterated toprocess multiple images in a single execution.

i. Import a DAPI image using imread function.

DAPI = imread ( ‘DAPI image.tif’ )

ii. The edge function detects the edge of the objects using gradients in pixelintensity across the objects and returns a binary image where the edge of objectsis traced. Different masks are available in edge function. ‘sobel’ and ‘canny’methods were successfully used in this study.

[edgeDAPI, thresh] = edge ( DAPI, ‘sobel’)

1.4 Methods

5

Optionally, imdilate and imerode functions can be used together to enhancethe results of edge detection.

iii. Fill in the inside of the traced nuclei using imfill function.

edgefillDAPI = imfill (edgeDAPI, ‘holes’)

iv. The edge detection method often cannot distinguish cells that are spaced tooclosely. The watershed algorithm can be used along with the distance transformto separate merged multiple nuclei. Use of bwdist and watershed functionswill generate an image having lines that would separate touching cells.Optionally, imhmin function can be used to prevent over-segmentation, whichis a known problem of watershed algorithm in some cases. Finally, change theobtained image into the binary image to match the class type.

distDAPI = -bwdist (~edgefillDAPI)

distDAPI2= imhmin (dsitDAPI,1)

ridgeDAPI= watershed(distDAPI2)

ridgeDAPI2= im2bw(ridgeDAPI)

v. Merge two images generated by edge detection algorithm and watershedalgorithm to create a single nuclear compartment image.

segmentedDAPI = edgefillDAPI & ridgeDAPI2

3. Additionally, apply size thresholds to the images to exclude noncellular objects. Thedistribution of nucleus size can be approximated as a normal distribution. Thus, usethree standard deviations above and below the mean area of nuclei as the upper andlower cut-off values.

4. Using FITC images, calculate the average fluorescence level of the noncell areas on aper pixel base to account for the background level for each image.

5. Using the segmented images (nuclear mask) and FITC image together, calculate thearea of individual nucleus and sum up the FITC values in this area. Finally, thephospho-protein intensity for each cell can be calculated in the following way:multiply the average background level by the area of the nucleus and subtract thisvalue from the total FITC in the nucleus.

ppERK FITC Background ARnucleus

nucleus= − ×∑

1.5 Data Acquisition, Anticipated Results, andInterpretation

We quantified the level of ppERK in the nucleus of MCF-10A cells that were stimulatedwith 0.01 or 10 ng/ml EGF or left untreated for 15 minutes. At a qualitative level, thedose-dependent phosphorylation of ERK was evident (Figure 1.1). Furthermore, thelocalization of ppERK to the nucleus was most evident at the highest EGF concentra-tion. The dose-dependent activation of ERK was confirmed using our quantitativeimage processing algorithms (Figure 1.2). At the highest EGF concentration, the averageamount of nuclear ppERK was approximately five-fold above the response when EGFwas absent. Meanwhile, a relatively moderate amount of EGF (0.01 ng/ml) inducedonly a three fold increase in nuclear ppERK.


6

Since these measurements were conducted at the single-cell level, one can analyzethe variation in cell responses across the population. We generated a histogram repre-senting the distribution of nuclear ppERK levels across the population for the three dif-ferent EGF concentrations (Figure 1.3). In the absence of EGF, most cells fall into anarrow range of low nuclear ppERK intensity. As the EGF concentration was increased,this distribution shifted gradually to the right. These results indicate that the level ofnuclear ppERK is a graded response to EGF stimulation at the single-cell level.

1.6 Statistical Guidelines

Total of three independent trials (n = 3) were conducted to gather statistically meaning-ful data. In each trial, duplicates were prepared for each condition to minimize errorsassociated with sample preparation. For each sample, five images were collected at themultiple fields. All together, at least 150 cells were analyzed for each condition.

In each trial, the average amount of nuclear ppERK for each condition was expressedrelative to the level in 10 ng/ml EGF sample. Thus, a statistical test was not performedbetween the 10 ng/ml EGF sample and other samples. One tailed-Student’s t test was per-formed between 0 and 0.01 ng/ml EGF sample and indicated that these values were dif-ferent with a p-value less than 0.01. Error bars represent standard error with n = 3.

1.6 Statistical Guidelines

7

Figure 1.1 Serum-starved MCF-10A cells were stimulated with 0, 0.01, and 10 ng/ml EGF. Following 15minutes of stimulation, cells were immunostained against ppERK (FITC) and nuclei were counterstainedwith DAPI. The scale bar represents 50 μm.

Figure 1.2 Average nuclear ppERK intensities in samples treated with 0, 0.01, and 10 ng/ml EGF. Theerror bars indicated S.E. (n = 3) with duplicates performed in each experiment. The asterisk denotes p <0.01 (student’s t test).

1.7 Discussion and Commentary

Intracellular signaling pathways control cell behaviors and multicellularmorphodynamics. A quantitative understanding of these pathways will provide designprinciples for tuning these signals in order to engineer cell behaviors and tissue mor-phology. The transmission of information in these pathways involves both site-specificcovalent modifications to signaling proteins and spatial localization of these signals.Here, we describe algorithms for quantifying signal localization from immuno-fluorescence staining for phosphorylated ERK. Our data reveal that in epithelial cells,ERK exhibits a graded response to EGF not only at the population level, but also at thelevel of individual nuclei. These results are consistent with the other studies thatreported graded ERK responses to various stimuli in other mammalian cell systems [17,18]. The algorithms presented here should facilitate quantitative, high throughputanalysis of images acquired by IF staining.

1.8 Application Notes

The method described in this report would be particularly useful in quantifying thespatiotemporal signaling response at a single-cell level. The algorithm should allowautomated and high throughput quantification of subcellular compartmentation ofsignaling events in response to multiple combinations and doses of environmentalstimuli. It should also prove useful for quantitative studies of cell-to-cell variation insignaling. Such measurements would provide valuable quantitative data for sys-tems-level analysis of signal transduction networks, the regulatory architecture thatgoverns cellular decision-making.

1.9 Summary Points

• Before beginning image acquisition, choose an exposure time that maximizes thedynamic range of signals that may be quantified. Chosen exposure time should befixed and used to capture images from all the samples.


8

Figure 1.3 Histogram representation of the distribution of the nuclear ppERK levels in cell populationstreated with 0, 0.01, and 10 ng/ml EGF.

• Qualitatively verify that the edge detection and watershed algorithms properly seg-ment individual nuclei.

• Size thresholds are often necessary to exclude noncellular objects.

• Account for background fluorescence level to measure exclusively fluorescence sig-nals from signaling proteins.

• Choose a proper sample size (e.g., the number of cells analyzed in each trial),depending on the degree of cell-to-cell variance of the target proteins.

• Add phosphatase inhibitors at fixation and permeabilization steps if target signalmolecules are phosphoproteins.

• Rigorous washing after incubation with antibodies is essential to minimize back-ground staining.

Acknowledgments

The authors thank the members of the Asthagiri Lab for helpful discussions. Fundingfor this work was provided by The Jacobs Institute for Molecular Engineering forMedicine.

Acknowledgments

9

Troubleshooting Table

Problem Explanation Potential Solutions

Background is too high Nonspecific binding of primaryor secondary antibodyBasal ERK activity mediatedby autocrine factor

Make sure to follow the required blocking and washingsteps thoroughlyPerform a negative control using only the secondary anti-body and skipping the primary antibody incubation toassess the level of nonspecific bindingPrepare a sample treated with PD98059 to quench ERKactivity all together

The number of segmented nucleiis significantly less than the actualnumber of nuclei

Failure to detect the edge ofsome of the nuclei

Increase the exposure time until DAPI signals at the loca-tion of nuclei become saturated. It will ensure the contrastbetween nuclei and the backgroundAlternatively, imadjust or contrast functions canbe used in MATLAB to enhance the contrast of a DAPIimage before performing nuclear segmentation

The number of nuclei issignificantly over-counted

Many noncellular objects wereconsidered as nucleiOver-segmentation fromwatershed algorithm

Rinse and wipe the slides with alcohol to get rid of driedsalts and stainAvoid air bubbles when mounting the sample withantifadeAdjust the upper and lower limits of area threshold ofnuclei appropriately to exclude noncellular objects withqualitative verificationUse imhmin function to reduce over-segmentation

References

[1] Asthagiri, A.R., and D.A. Lauffenburger, “Bioengineering models of cell signaling,” Annu. Rev.Biomed. Eng., Vol. 2, 2000, pp. 31–53.

[2] Haugh, J.M., “Localization of receptor-mediated signal transduction pathways: the inside story,”Mol. Interv., Vol. 2, No. 5, 2002, pp. 292–307.

[3] Misteli, T., and D.L. Spector, “Applications of the green fluorescent protein in cell biology and bio-technology,” Nat. Biotechnol., Vol. 15, No. 10, 1997, pp. 961–964.

[4] Pollok, B.A., and R. Heim, “Using GFP in FRET-based applications,” Trends Cell. Biol., Vol. 9, No. 2,1999, pp. 57–60.

[5] Mochizuki, N., et al., “Spatio-temporal images of growth-factor-induced activation of Ras andRap1,” Nature, Vol. 411, No. 6841, 2001, pp. 1065–1068.

[6] Ting, A.Y., et al., “Genetically encoded fluorescent reporters of protein tyrosine kinase activities inliving cells,” Proc. Natl. Acad. Sci. USA, Vol. 98, No. 26, 2001, pp. 15003–15008.

[7] Tyas, L., et al., “Rapid caspase-3 activation during apoptosis revealed using fluorescence-resonanceenergy transfer,” EMBO Rep., Vol. 1, No. 3, 2000, pp. 266–270.

[8] Graham, D.L., P.N. Lowe, and P.A. Chalk, “A method to measure the interaction of Rac/Cdc42 withtheir binding partners using fluorescence resonance energy transfer between mutants of green flu-orescent protein,” Anal. Biochem., Vol. 296, No. 2, 2001, pp. 208–217.

[9] Claesson-Welsh, L., “Platelet-derived growth factor receptor signals,” J. Biol. Chem., Vol. 269, No.51, 1994, pp. 32023–32026.

[10] Hlavacek, W.S., et al., “Rules for modeling signal-transduction systems,” Sci. STKE, Vol. 2006,No. 344, 2006, p. RE6.

[11] Wolf-Yadlin, A., et al., “Multiple reaction monitoring for robust quantitative proteomic analysis ofcellular signaling networks,” Proc. Natl. Acad. Sci. USA, Vol. 104, No. 14, 2007, pp. 5860–5865.

[12] Giepmans, B.N., et al., “The fluorescent toolbox for assessing protein location and function,” Sci-ence, Vol. 312, No. 5771, 2006, pp. 217–224.

[13] Wetzker, R., and F.D. Bohmer, “Transactivation joins multiple tracks to the ERK/MAPK cascade,”Nat. Rev. Mol. Cell. Biol., Vol. 4, No. 8, 2003, pp. 651–657.

[14] Gutkind, J.S., “Regulation of mitogen-activated protein kinase signaling networks by G protein-coupled receptors,” Sci. STKE, Vol. 2000, No. 40, 2000, p. RE1.

[15] Yarden, Y., and M.X. Sliwkowski, “Untangling the ErbB signalling network,” Nat. Rev. Mol. Cell.Biol., Vol. 2, No. 2, 2001, pp. 127–137.

[16] Graham, N.A., and A.R. Asthagiri, “Epidermal growth factor-mediated T-cell factor/lymphoidenhancer factor transcriptional activity is essential but not sufficient for cell cycle progression innontransformed mammary epithelial cells,” J. Biol. Chem., Vol. 279, No. 22, 2004,pp. 23517–23524.

[17] Mackeigan, J.P., et al., “Graded mitogen-activated protein kinase activity precedes switch-like c-Fosinduction in mammalian cells,” Mol. Cell. Biol., Vol. 25, No. 11, 2005, pp. 4676–4682.

[18] Whitehurst, A., M.H. Cobb, and M.A. White, “Stimulus-coupled spatial restriction of extracellularsignal-regulated kinase 1/2 activity contributes to the specificity of signal-response pathways,”Mol. Cell. Biol., Vol. 24, No. 23, 2004, pp. 10145–10150.


10

C H A P T E R

2Development of Green FluorescentProtein-Based Reporter Cell Lines for DynamicProfiling of Transcription Factor and KinaseActivation

Colby Moya1 and Arul Jayaraman2

1 Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX2 222 Jack E. Brown Engineering, 3122 TAMU, College Station, TX 77843-3122; phone: (979) 845-3306;fax: (979) 845–6446; e-mail: [email protected]

11

Key terms Dynamic expression profilingGFPTranscription factorKinaseFRETAdipocytes

Abstract

One of the main goals of systems biology is the development of quantitativemodels for describing and predicting cellular responses on the basis of regula-tory molecules such as transcription factors and signaling kinases. The regula-tion of gene expression by transcription factors and kinases, through differentexpression and activation dynamics, is integral in governing the expression ofspecific genes and cellular phenotypes. In this chapter, we present methods toengineer systems suitable for monitoring the dynamic activity of transcriptionfactors and kinases. These methods were used to develop reporter cell lines forthe transcription factor PPARγ, as well as a reporter construct for the kinaseERK1/2.

2.1 Introduction

An important requirement for the development of signal transduction models is theability to quantitatively describe the activation dynamics of these regulatory molecules.However, the activation of transcription factors has been conventionally monitoredusing protein binding techniques such as electrophoretic mobility shift assay orchromatin immunoprecipitation [1], while kinase activity is typically investigatedusing enzymatic assays. While these techniques are capable of providing snapshots ofactivation at a small set of single time points, they can yield only qualitative data (e.g.,mobility shift assay) and require the use of multiple cell populations for each time pointat which activity is to be measured. As a result of the limited sampling points and fre-quencies, the true dynamics of regulatory molecules are not easily captured. Hence,there is a need for methods to investigate time-dependent activation of regulatorymolecules in a quantitative manner.

Green fluorescent protein (GFP) reporter systems have been recently developed forthe continuous and noninvasive monitoring of transcription factors and kinase activa-tion dynamics. Transcription factor reporter systems involve expressing GFP under thecontrol of a minimal promoter such that GFP expression and fluorescence is observedonly when a transcription factor is activated (i.e., when the transcription factor binds toits specific DNA binding sequence and induces expression from a minimal promoter).Since wild-type GFP has a half-life of ~72 hours, short half-life variants of GFP have beenused so that profiling the activation and decay of different transcription factors (i.e.,dynamics) can be carried out. Prior work from our lab has used GFP-based profiling tocontinuously monitor activation of a panel of transcription factors underlying theinflammatory response in hepatocytes for 24 hours [2–5], where the dynamics of GFPfluorescence is a quantitative indicator for dynamics of the transcription factor beingprofiled.

Fluorescence resonance energy transfer (FRET) has been used to monitor the dynam-ics of kinase signaling and activity [6, 7]. Recent advances have utilized FRET as areal-time indicator of the activity of numerous kinases and proteases. An example of thisis the use of FRET to monitor the activity of protein kinase C (PKC) in living cells by cor-relating FRET changes to PKC substrate binding and phosphorylation [6]. FRET occursfrom the transfer of energy from a donor fluorophore to an acceptor fluorophore uponexcitation. When the donor and acceptor proteins are in close proximity (< 10 nm)energy is transferred and the spectral properties change in such a way that excitation ofthe donor results in emission in the spectral range of the acceptor [8].

Most studies using GFP or other reporter genes involve transiently introducing thereporter plasmid into cells and monitoring changes in activation. However, the stableinsertion of reporter plasmids to generate reporter cell lines is more advantageous, asthey result in a relatively homogenous population (in terms of number of reporterplasmid copies in each cell and fraction of cell population that contain the reporterplasmid); thereby, increasing the efficiency of profiling.

In this chapter, we describe methods for developing GFP-based transcription factorand kinase reporter plasmids. PPARγ is used as the model transcription factor while thekinase activation methods are based on ERK1/2; however, these methods are applicablefor any transcription factor whose DNA binding sequence is known and any kinasewhose substrate and binding partner have been identified. In addition, we also describe

Development of Green Fluorescent Protein-Based Reporter Cell Lines

12

methods for generating reporter cell lines with the transcription factor reporter plasmidsusing 3T3-L1 adipocytes as the model cell line.

2.2 Materials

2.2.1 Cell and bacterial culture

1. Complete growth medium for 3T3-L1 preadipocytes: Dubelco’s Modified EagleMedium (DMEM, Hyclone, Logan, Utah) supplemented with 10% adult bovineserum (BS, Hyclone, Logan, Utah), 200 units/ml and 200 μg/ml ofpenicillin/streptomycin, respectively (Hyclone, Logan, Utah) and glucose (4.5 g/L).

2. Complete growth medium for HepG2 cells: Modified Eagle Medium (MEM, Hyclone,Logan, Utah) supplemented with 10% fetal bovine serum (FBS, Hyclone, Logan,Utah), 200 units/ml and 200 μg/ml of penicillin/streptomycin, respectively(Hyclone, Logan, Utah) and glucose (1 g/L).

3. Cryogenic freezing medium: DMEM supplemented with 20% fetal bovine serum(FBS, Hyclone, Logan, Utah), 10% dimethyl sulfoxide (DMSO, Fisher, Pittsburgh,Pennsylvania), 200 units/ml and 200 μg/ml of penicillin/streptomycin, respectively(Hyclone, Logan, Utah).

4. LB media (10g Bacto-tryptone, 5g yeast extract, 10g NaCl, pH 7.5, in 1L).

5. Kanamycin (Fisher, Pittsburgh, Pennsylvania).

6. LB agar plates supplemented with 30 μg/mL Kanamycin.

7. 10 cm cell culture dish (Corning, Lowell, Massachusetts).

8. 24-well and 6-well cell culture plates (Corning, Lowell, Massachusetts).

9. Cloning cylinder 6 × 8 mm (Corning, Lowell, Massachusetts).

10. E.coli XL1blue electrocompetent cells.

11. Lab-Tek chambered coverglass with two wells (Fisher Scientific, Rochester, NewYork).

12. Petrolatum (Fisher Scientific, Rochester, New York).

2.2.2 Buffers and reagents

1. 10X Annealing Buffer [100 mM Tris HCl (pH 7.5), 1M NaCl, 10 mM EDTA].

2. TE Buffer [10 mM Tris-HCl (pH 7.5), 1 mM EDTA].

3. Restriction enzymes (BglII, HindIII, EcoRI, BamHI, NotI, XhoI) (NEB, Ipswich,Massachusetts).

4. Wizard SV Gel and PCR Clean-Up System (Promega, Madison, Wisconsin).

5. T4 DNA Ligase (NEB, Ipswich, Massachusetts).

6. Antarctic Phosphatase (NEB, Ipswich, Massachusetts).

7. Trypsin (0.05%) supplemented with ethylenediaminetetraacetic acid (EDTA, 0.02g/L) in Hanks Balanced Salt Solution (HBSS) without calcium or magnesium(Hyclone, Logan, Utah).

8. 1X Phosphate Buffered Saline (PBS) (pH 7.3).

9. GenJet transfection reagent for HepG2 cells (Signagen, Gaithersburg, Maryland).

10. GoTaq PCR mix (Promega, Madison, Wisconsin).

2.2 Materials

13

2.2.3 Cloning

1. Plasmid pCEP4CyPet-MAMM (Addgene, Cambridge, Massachusetts).

2. Plasmid pCEP4YPet-MAMM (Addgene, Cambridge, Massachusetts).

3. Plasmid pEYFP-N1 (Clontech, Mountain View, California).

4. GenePulser XCell electroporation system (Bio-Rad, Hercules, California).

5. Electroporation cuvettes; 2 mm and 4 mm gap (Bio-Rad, Hercules, California).

6. Marligen plasmid maxiprep kit (Marligen biosciences, Ijamsville, Maryland).

7. Eppendorf miniprep kit (Eppendorf, Westbury, New York).

2.2.4 Microscopy

1. Zeiss Axiovert 200M inverted fluorescent microscope (Carl Zeiss Microimaging, Inc.,Thornwood, California) (or a similar fluorescent microscope).

2. U0126 (MEK1/2 inhibitor) (Cell Signaling Technologies, Danvers, Massachusetts).

3. PMA (phorbol 12-myristate 13-acetate) (Fisher, Pittsburgh, Pennsylvania).

4. Human recombinant Interleukin-6 (R&D Systems, Minneapolis, Minnesota).

2.3 Methods

2.3.1 3T3-L1 cell culture

1. 3T3-L1 preadipocytes are grown in DMEM supplemented with 10% bovine serum(BS) and 2% penicillin/streptomycin. Once confluency is reached, cells are passed ata 1:10 dilution for routine propagation. Cells are passed at least twice prior to theexperiment to ensure recovery from cryopreservation.

2. 3T3-L1 preadipocytes can be terminally differentiated into mature adipocytes byculturing in media containing a cocktail of hormones. Confluent preadipocytes arecultured in DMEM media supplemented with 10% fetal bovine serum (FBS), 1 μMdexamethasone, 0.5 mM 3-isobutyl-1-methylxanthine (IBMX), 2 nM3,3’,5-triiodo-L-thyronine (T3), 1 μg/mL insulin and 2% penicillin/streptomycin for48 hours. Cells are cultured for another 48 hours in DMEM media supplementedwith 10% fetal bovine serum (FBS), 2 nM 3,3’,5-triiodo-L-thyronine (T3), 1 μg/mLinsulin, and 2% penicillin/streptomycin for 48 hours. 3T3-L1 adipocytes aremaintained in DMEM containing media supplemented with 10% fetal bovine serum(FBS) and 2% penicillin/streptomycin.

3. For long-term storage of 3T3-L1 preadipocytes, add 5 mL of freezing media to a cell

pellet of ~5 × 106 cells and freeze down 1 mL aliquots in liquid nitrogen.

2.3.2 Transcription factor reporter development

2.3.2.1 Identification of response elements

The first step in the development of a transcription factor (TF) reporter is the identifica-tion of the TF response element or binding site (i.e., DNA sequence to which the TFbinds to regulate gene expression) (Figure 2.1). Publicly available curated databasessuch as TRANSFAC can be used to identify response elements for different TF. We used


14

the TRANSFAC database as well as literature that provided binding sequences for TFsspecific to our work. The following is a basic approach using TRANSFAC to identify a TFresponse element.

1. Access http://www.gene-regulation.com/pub/databases.html and click on the“Search TRANSFAC public” tab.

2. After successfully logging in, click on the “matrix” tab.

3. On the matrix page, enter the full name of the TF of interest or its acronym.

4. Change the “Table field to search in” box to “Factor name” and submit.

5. Click the appropriate returned result.

6. A nucleotide base matrix is generated listing the bases most likely to make up the TFresponse element.

2.3.2.2 Identification of TF binding sites

We have engineered reporter constructs for several TFs. The following section will usePPARγ as an example.

1. Analyze the generated TRANSFAC matrix for PPAR:

PO A C G T

01 14 16 13 17 N

02 12 21 14 16 N

03 32 6 9 17 W

04 19 3 45 3 G

05 31 0 41 0 R

06 0 0 72 0 G

07 0 0 71 1 G

08 0 0 0 72 T

09 0 70 2 0 C

10 72 0 0 0 A

11 72 0 0 0 A

12 71 0 1 0 A

13 0 0 72 0 G

2.3 Methods

15

RE Prom GFP

TF

TF

RE Prom GFP

***

(b)

(a)

Figure 2.1 Illustration of the GFP reporter system. (a) No fluorescence is observed in the absence of TFbinding to the DNA response element (RE). (b) Binding of TF results in activation of the promoter (Prom)and transcription of the gfp gene.

14 0 0 70 2 G

15 0 0 0 72 T

16 0 69 1 2 C

17 70 0 2 0 A

18 16 20 2 20 N

19 17 21 15 3 N

20 8 15 13 13 N

21 14 2 16 12 N

XX

From the table above, we can see that the response element is conserved fromposition 5 through position 17. The corresponding bases (from 5’ to 3’) would beRGGTCAAAGGTCA. The R represents purines (adenine and guanine) and theirrelatively equal numbers suggest that either one can be used in the design of thereporter construct. In this example, we choose guanine due to its slightly higherprevalence.

2. Synthesize the TF binding element as complimentary oligonucleotides. The TFbinding element consists of three tandem repeats of the response element separatedby a single base, with appropriate restriction enzyme sites for cloning. The sequencedesigned for PPAR is given here:

5’AGATCTAAGCTTGGGTCAAAGGTCATGGGTCAAAGGTCAAGGGTCAAAGGTCAGAATTC 3’

The red and blue bases represent restriction sites for BglII and EcoRI, respectively,while the purple bases represent the restriction site for HindIII. The green bases makeup the binding site repeats for PPARγ separated by a single base (see Section 2.6, #1).

2.3.2.3 Cloning TF binding elements into reporter plasmids

The TF binding element is cloned upstream of a minimal CMV promoter which con-trols expression of the EGFP reporter gene. In the absence of TF binding, the minimalpromoter is not active and minimal EGFP is detected. When a TF binds to its responseelement, it activates the promoter leading to transcription of the EGFP gene. In thisexample, the PPAR binding sequence is cloned into pCMVmin-d2egfp-N1 [4].

1. Reconstitute the complimentary oligonucleotides making up the TF bindingelement in TE buffer to a concentration of 100 μM. Mix the two oligonucleotides to afinal concentration of 40 ng/μL each in a 50 μL reaction. Anneal the DNA strands byincubation at 95ºC for 5 minutes, followed by gradual cooling to room temperatureat a rate of 0.5ºC/min. Annealing and cooling can be performed using a standardPCR thermocycler.

2. Digest 25 μL of the annealed DNA with 25 units of BglII (or any other appropriaterestriction enzyme) in a 50 μL reaction at 37ºC for 16 hours. The vector into whichthe TF binding element is to be cloned (pCMVmin-d2egfp-1) is also digested inparallel.

3. Precipitate the digested DNA using 5 μL sodium acetate (3M, pH 5.2) and 150 μLabsolute ethanol. Vortex the reactions and incubate at –20ºC for 1 hour. Pellet the

DNA by centrifugation at 16,000 × g for 30 minutes. Decant the supernatant (takingcare not to disturb the DNA pellet) and resuspend in 30 μL ddH2O.


16

4. To the single-digested DNA and vector, add 20 units of EcoRI (NEB, Ipswich,Massachusetts), or any other appropriate enzyme, along with appropriate buffer.Make up the volume to 50 μL and incubate at 37ºC for 16 hours.

5. Separate the double-digested DNA on a 1% low melting point agarose gel and purifythe DNA fragment using the Wizard SV Gel and PCR Clean-Up System (Promega,Madison, Wisconsin) as per manufacturer’s suggestions. Electrophoresis is carriedout at 100 volts for 60 minutes.

6. Treat 1 μg of the digested vector with antarctic phosphatase (NEB, Ipswich,Massachusetts) for 1 hour at 37ºC to remove the 5’ phosphate group from the vector.Heat inactivate the phosphatase reaction by incubating at 65ºC for 5 minutes.

7. Set up three ligation reactions using the double-digested TF binding elementoligonucleotides and the phosphatase-treated vector. As a starting point, use molarratios of 5:1 (insert:vector), 0:1 (control), and 1:5. Allow the insert and vector toligate at 16ºC for 1 hour using T4 DNA ligase followed by heat inactivation at 65ºCfor 10 minutes.

8. While the ligations are being heat inactivated, prepare three sterile, 2 mm gapelectroporation cuvettes by placing them on ice along with electrocompetent cells.Mix 2 μL (10 ng) of the ligation reaction with 40 μL of electrocompetent cells andelectroporate (2,500V and 25 μF) with the GenePulser XCell electroporation system(Bio-Rad, Hercules, California) or any other comparable electroporation unit.

9. Immediately add 1 mL of LB media to the cells and allow them to recover at 37ºC for1 hour with agitation.

10. Collect the electroporated cells by centrifuging at 12,000 × g for 30 seconds. Decantthe supernatant and resuspend the cells in the residual media (~50 μL). Plate the cellson LB agar plates containing kanamycin (30 μg/ml) and incubate at 37ºC overnight.

11. Collect kanamycin resistant colonies (~10) from the ligation plates and inoculateovernight in 5 mL of LB media supplemented with 30 μg/mL of kanamycin.

12. Extract plasmid DNA from overnight cultures using the Eppendorf miniprep kit(Eppendorf, Westbury, New York) as per the manufacturer’s protocol.

13. Perform multiple restriction digests to verify the fidelity of the obtained clone. In ourscheme, since a HindIII restriction site is present in the vector, we engineered asecond HindIII site in the TF binding element. Therefore, plasmids having the TFbinding element correctly inserted will have two HindIII sites whereas incorrectclones will have only a single site.

14. Propagate the putative correct clone(s) and extract the plasmid using a plasmidmaxi-prep kit (Marligen Biosciences, Ijamsville, Maryland).

15. Sequence the plasmid to verify the fidelity of the inserted binding element.

2.3.3 Kinase reporter development

The functionality of the FRET-based kinase reporter (Figure 2.2) is primarily based on alinker region that contains a substrate domain (pink), a phosphoamino acid bindingdomain (red) and a flexible aminoacid domain (green) which links the other twodomains. Kinases will bind to and phosphorylate specific amino acid residues withinthe substrate domain. Phosphorylated residues are then recognized and bound by thephosphoamino acid binding domain, which results in a conformational change within

2.3 Methods

17

the construct. Due to this conformational change, the two fluorescent proteins (CFPand YFP) come into close proximity, resulting in FRET.

Once the substrate and binding domains for a kinase are identified, they can be syn-thetically developed using oligonucleotides. The acceptor fluorescent protein and donorfluorescent protein (in our example, Ypet and CyPet, respectively) can be generatedfrom a plasmid template using PCR, while the flexible linker region can be synthesizedas complimentary oligonucleotides, annealed together and amplified. Figure 2.3 sum-marizes the cloning steps involved in the development of a FRET construct.

2.3.3.1 Selection of FRET elements

1. Identify and select a suitable substrate domain. For example, to monitor theactivation of extracellular signal regulated kinase (ERK), we selected a domain fromElk1 as the substrate domain because it is a downstream target of ERK [9, 10]. TheElk1 domain contains the phosphoamino acid motifs serine-proline (SP) andthreonine-proline (TP) which can be phosphorylated by bound ERK. The Elk1domain also contains a FQFP motif which is recognized by ERK as a docking domain[11].

2. Identify and select a suitable phosphoamino acid binding domain which is the nextkey consideration of the design. Some phosphoamino acid binding domainsreported in the literature include 14-3-3 [12], forkhead associated (FHA) domain [13],and several WW domains [14, 15]. We chose the WW-domain as our phosphoaminoacid binding domain due to the native binding affinity for phosphoserine orphosphothreonine.

3. Engineer a linker region which will join the substrate domain and thephosphoamino acid domain. The primary purpose of this domain is to allowflexibility. Therefore, when selecting the aminoacid residues for this region, it isrecommended that amino acids that may provide stiffness to the region (i.e.,proline), be left out. The linker region in our construct is glycine and serine rich(GSHSGSGKP). Another consideration for the linker region is length. There exists an


18

433nm 433

nm

507nm

Kinase(phosphorylation)

475nm

Phosphatase(dephosphorylation)

Figure 2.2 Schematic illustrating the expected spectral overlap during FRET. When two fluorescent pro-teins (cyan fluorescent protein and yellow fluorescent protein, CFP and YFP, respectively) are sufficientlydistant from one another, they retain their individual spectral properties (i.e., CFP: excitation 433 nm,emission 475 nm). If the two proteins come into close proximity, the spectral properties change such thatan excitation at a low wavelength (433 nm) will result in emission at a high wavelength (507 nm).

optimal linker length that will allow for maximal FRET. As seen above, our linkerregion is nine amino acids long which should make for a good starting point formost designs.

4. Select the appropriate fluorophores for the construct. We have developed our FRETconstruct using CyPet and YPet, which are variants of CFP and YFP, respectively [16].The cDNA corresponding to the two fluorescent proteins were linked by a 306 basesequence that consists of, in order, the DNA corresponding to a WW domain, aflexible linker and Elk phosphorylation sites.

2.3.3.2 Fluorescent protein PCR

1. Develop forward and reverse primers for PCR amplification of the genes that encodethe donor and acceptor fluorescent proteins, CyPet and Ypet, from the plasmids:pCEP4CyPet-MAMM and pCEP4YPet-MAMM [16]. Each primer should be developedas oligonucleotides containing 18 to 20 bases that are complementary to the

2.3 Methods

19

Figure 2.3 Cloning scheme involved in the development of the pCyPet-WWElk1-YPet-N1 FRET reporterplasmid.

template, six to eight bases of the recognition sequence for appropriate restrictionenzymes, and an additional 10 to 12 base sequence to facilitate restriction enzymebinding and digestion. Note that the reverse primer of the upstream fluorescentprotein (CyPet) must be designed in such a way that the stop codon is eliminated;otherwise, the entire FRET construct will not be translated (see Section 2.6, #2).

2. Amplify the CyPet and YPet genes using PCR in 25 μL reactions using primers at afinal concentration of 0.1 μM and 100 ng of template. Perform PCR for 40 cycles withan annealing temperature of 59°C (the annealing temperature should be roughly 10degrees lower than the calculated melting temperatures of the primer). Anycommercially available PCR kit can be used.

3. Remove unincorporated dNTPs and buffers by cleaning the PCR product using a PCRclean-up kit. Elute the PCR product with 35 μL of ddH2O.

2.3.3.3 Fluorescent protein cloning

1. Digest 10 μg of YPet PCR product with 30 units of BamHI and 25 units of NotI (or anyother appropriate restriction enzymes) in a 50 μL reaction at 37°C for 16 hours.

2. Digest the CyPet PCR product with 40 units of XhoI and 40 units of EcoRI at the sameconditions of the digest in step 1.

3. Digest 10 μg of the vector pEYFP-N1 with 30 units BamHI and 25 units of NotI asabove.

4. Separate the double-digested plasmid and PCR products on a 1% low melting pointagarose gel and purify the DNA fragments using the Wizard SV Gel and PCRClean-Up System (Promega, Madison, Wisconsin) as per manufacturer’s suggestions.

5. Follow steps 6 through 12 of Section 2.3.2.3 for cloning the digested YPet PCRproduct into the digested and phosphatase treated pEYFP-N1 vector to generateplasmid pYPet-N1.

6. Perform multiple restriction digests to verify the fidelity of the obtained clones. Forexample, digestion with enzymes that contain a restriction site in the ypet geneshould be used along with enzymes that contain a restriction site in the vectorbackbone. Therefore, plasmids which have the ypet gene correctly inserted willdisplay two appropriately sized bands on an agarose gel.

7. The putative correct clones are further propagated and the plasmid sequenced toverify the fidelity of the inserted gene.

8. Digest the newly engineered pYPet-N1 with 40 units each of EcoRI and XhoI in a 50μL reaction overnight at 37°C.

9. Gel purify 10 μg of the EcoRI/XhoI digested pYPet-N1 as per step 4 above.

10. Follow steps 6 through 12 of Section 2.3.2.3 to clone the EcoRI/XhoI digested CyPetPCR product (from step 2) into the gel purified and newly photophase treatedpYPet-N1. The end result of this step should be pCyPet-YPet-N1 which is a plasmidcontaining both flourophores separated by a small arbitrary nucleotide sequence(step 2 of Figure 2.3).

11. Digest the newly engineered pCyPet-YPet-N1 with 45 units each of EcoRI and BamHIin a 50 μL reaction overnight at 37°C.

12. Gel-purify 10 μg of the EcoRI/BamHI digested pCyPet-YPet-N1 as per step 4 above.

13. Phosphatase-treat the purified pCyPet-YPet-N1 as per step 6 of Section 2.3.2.3.


20

2.3.3.4 Linker oligonucleotide development and annealing

The linker region of the FRET construct contains the phosphoamino acid bindingdomain, the substrate region and the 9 amino acid flexible motif that links them. Afteridentifying the appropriate domains for the construct, their complementary nucleicacid sequences can be combined to yield a functional FRET protein upon translation.(See Section 2.6, #3.)

1. Identify the DNA corresponding to the amino acid sequences of the three parts of thelinker. In our ERK construct, the phosphoamino acid and substrate domains are 105and 159 bases long, respectively, while the flexible motif is 27 bases in length.

2. Using the sense strand of the DNA sequence as the basis, divide the entire sequenceinto multiple fragments that are approximately equal in length. The last section ofthe sense strand may not be exactly the same length as the others.

3. Develop the complementary antisense strand of the first fragment so that it coversapproximately 50% of the 5’ end of the sense strand. The second antisense fragmentshould span the remaining 50% of the previous fragment and 50% of the sensestrand of the second fragment. Continue to develop fragments until the entiresequence is covered (Figure 2.4).

4. Synthesize the different fragments as oligonucleotides using any commercial DNAsynthesis source.

5. Reconstitute each oligonucleotide to 100 μM with sterile TE buffer.

6. Anneal and amplify the oligonucleotides with GoTaq DNA polymerase or any othersuitable polymerase. Add 0.5 μL of each oligonucleotide to the reaction and amplifyfor 40 cycles with an annealing temperature of 56ºC (see Section 2.6, #4).

7. Perform PCR once more using 1/20th of the reaction from step 6 as the template andthe 5’ synthetic sense oligonucleotide as the forward primer (1 μM) and the 5’synthetic antisense oligonucleotide as the reverse primer (1 μM). Perform PCR for 22cycles with an annealing temperature of 62ºC. High yield of the linker by PCR can beobtained by performing numerous (4 to 5) reactions and combining them just priorto precipitation.

8. Precipitate the 50 μL linker PCR reaction with 5 μL sodium acetate (3M, pH 5.2) and150 μL absolute ethanol. Vortex the product and incubate at –20ºC for 1 hour.Vortex once more and spin down at 16,000g for 30 minutes. Decant supernatant andresuspend DNA pellet in 30 μL ddH2O.

2.3 Methods

21

Figure 2.4 DNA sequence of the oligonucleotide fragments used for constructing the FRET plasmidlinker region. Both sense and antisense strands are shown.

2.3.3.5 Linker region cloning

1. Digest 10 μg of the linker region with 40 units of EcoRI and 40 units BamHI (or otherappropriate restriction enzymes whose recognition sequences were incorporatedinto the synthetic oligonucleotides).

2. Separate the PCR product on a 2% low melting point agarose gel and purify thefragment using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison,Wisconsin) as per manufacturer’s suggestions.

3. Use EcoRI/BamHI digested and phosphatase treated pCyPet-YPet-N1 from step 13 ofSection 2.3.3.3 as the vector and the gel purified linker for cloning. Follow steps 6through 13 of Section 2.3.3.3 to clone the linker into pCyPet-YPet-N1. The end resultof this step should be the final FRET construct, pCyPet-WWElk1-YPet-N1, whichcontains both fluorescent proteins separated by all three functional domains in thelinker region (step 3 of Figure 2.4).

4. Sequence the plasmid to verify the fidelity of the FRET construct.

2.3.3.6 FRET control plasmid development

The functioning of the FRET construct must be validated using appropriate controlsand microscopy measurements. Since FRET occurs, in part, through spectral overlap ofone fluorophore (donor, CyPet) with another (acceptor, YPet), the extent of FRET signalis strongly influenced by the distance between the two fluorophores. Therefore, base-line signal values must be established to facilitate quantitative assessment of FRET sig-nal. This can be accomplished by using plasmids that express either the donor(pCyPet-S) or acceptor (pYPet-S) fluorescent protein alone as these will provide themaximal intensity of each fluorophore. Similarly, a chimera between the donor andacceptor (pCyPet-YPet-Chimera, a CyPet-Ypet fusion protein similar toCyPet-WWElk1-YPet but without the linker region) should be used as it provides themaximum FRET that can be achieved with the donor and acceptor fluorescent proteins(i.e., the closest distance between the two proteins).

1. Create a new forward primer for PCR of YPet from plasmid pCEP4YPet-MAMM [16].This primer should contain a Kozak initiation start sequence (GCCACC)downstream of a restriction enzyme (BamHI) recognition sequence to aid in proteintranslation. The reverse primer from step 1 of Section 2.3.3.2 should be used as thenew reverse primer for development of pYPet-S.

2. Create a new reverse primer for PCR of CyPet from plasmid pCEP4CyPet-MAMM[16]. This primer must complement the 3’ end of the cypet gene including the stopcodon (previous reverse primer contained a mutation to eliminate the stop codon).Additionally, it must contain a restriction enzyme (NotI) recognition sequence forcloning. Create a new forward primer analogous to the forward primer created instep 1 of Section 2.3.3.2, except that a different restriction enzyme (BamHI) is usedinstead of XhoI.

3. Set up two 25-μL PCR reactions for genes of both fluorescent proteins. Using anycommercially available polymerase kit, add primers at a final concentration of 0.1μM each, along with 100 ng of template. Perform the reaction for 40 cycles with anannealing temperature of 59°C.


22

4. Purify the reactions with a PCR clean-up kit for removal of the buffers and dNTPs.Elute PCR products with 35 μL of ddH2O.

5. Use BamHI/NotI digested and phosphatase treated pEYFP-N1 as the cloning vectorfor both CyPet-S and YPet-S.

6. Follow steps 6 through 13 of Section 2.3.3.3 for engineering of the pCyPet-S andpYPet-S constructs.

7. Sequence the plasmid to verify the fidelity of the control constructs.


2.4.1 Electroporation of TF reporter plasmids into 3T3-L1 preadipocytes

We demonstrate applicability of the method described above by generating a reportercell line for the transcription factor PPARγ in 3T3-L1 preadipocytes. PPARγ is well estab-lished as a master-regulator of adipocyte differentiation and function [8]. Aspreadipocytes differentiate into adipocytes in culture, the activity of PPARγ is expectedto continuously change.

2.4.1.1 Electroporation of TF reporter plasmids into 3T3-L1 preadipocytes

1. Linearize the TF reporter construct containing the PPAR binding sites (15 μg) with 60units of ApaLI (NEB, Ipswich, Massachusetts) (or any other enzyme that cuts only inan unnecessary portion of the plasmid such as the ampicillin resistance gene) in 60μL total volume. Place reaction in a 37ºC water bath for 16 hours.

2. Precipitate ApaLI digested plasmid with 6 μL sodium acetate (3M, pH 5.2) and 180 μLabsolute ethanol. Vortex the reactions and incubate at –20ºC for 1 hour. Vortex oncemore and spin down the reactions at 16,000g for 30 minutes. Decant supernatantand resuspend pellets in 30 μL sterile PBS. It is important that the plasmid be sterilebecause it will be mixed with preadipocytes for electroporation.

3. Grow 3T3-L1 cells to confluence in a T-25 flask.

4. Wash cells three times with 1X PBS. Add 1 mL of trypsin-EDTA to the flask andincubate in a 37ºC incubator for 5 minutes.

5. Remove the cells from the flask by adding 4 mL of complete growth medium. Pipettecells to a centrifuge tube and centrifuge at 800 rpm for 5 minutes at 4ºC. While cellsare being centrifuged, place a sterile 4-mm gap cuvette on ice.

6. Aspirate supernatant from the centrifuge tube (see Section 2.6, #5). Reconstitute cellpellet in 400 μL of complete growth medium (see Section 2.6, #6) and pipette intothe cold sterile cuvette. To the 400 μL cell suspension add the 30 μL sterile DNAsolution from step 2. Pipette gently to mix.

7. Using the GenePulser XCell electroporation system (Bio-Rad, Hercules, California),electroporate the cell/DNA suspension at 240V and 950 μF with a time constant of~48 ms (see Section 2.6, #7).

8. Immediately after electroporation, gently add 600 μL of complete growth medium tothe cuvette and mix once. Set aside cuvette at room temperature for 5 minutes toallow cells to recover.


23

9. Remove cells from the cuvette and place in a 100-mm cell culture dish with 14 mL ofcomplete growth medium. Incubate dish at 37°C.

2.4.1.2 Clonal selection

1. After 48 hours of incubation, change growth medium and supplement with 800μg/ml of G418 (see Section 2.6, #8).

2. Change and supplement the medium with G418 every 48 hours. At 7 to 10 days postG418 addition, individual colonies should become visible. Allow the colonies tocontinue to grow until there are enough cells to select in a single colony. Stop cultureif the colonies begin to touch each other.

3. Place the cells under the microscope and mark the colonies which look welldeveloped (i.e., proper cell morphology), are separated from other colonies, and aresufficiently large to be isolated. Roughly 10 to 15 colonies should be marked forpropagation.

4. Wash the dish 2X with sterile PBS, dip the bottom edge of a 6 × 8-mm cloningcylinder in sterile petrolatum. Using sterile forceps, place the cloning cylinderdirectly over the marked colony. Repeat this process for all marked colonies.

5. Add 30 μL of trypsin-EDTA to the center of the cloning cylinder. Make sure there areno air bubbles between the colony and the trypsin. Place the cells in a 37ºC incubatorfor 5 minutes.

6. Add 50 μL of media to all cylinders to stop the reaction. Gently pipette themedia/trypsin mixture in each cylinder multiple times to ensure the cells aredetached and to ensure the cells are not clumped together. Place each colony in asingle well of a 24-well plate. Make sure there is no cross-contamination betweencolonies because the cells are now individual populations. (See Section 2.6, #9.)

7. Culture the cells in the 24-well plate until they become confluent. Confluency maybe reached at different times for each clone; so culture appropriately. Typically,3T3-L1 cells will become confluent in 2 to 4 days after passing from the dish.

8. Wash the cells in the 24-well plate with sterile PBS and add 100 μL of trypsin-EDTA.Place the cells in a 37°C incubator for 5 minutes.

9. Add 400 μL of complete growth medium and pipette several times to ensure the cellshave been detached from the well. Transfer all 500 μL of cell suspension (includingtrypsin) to a single well of a 6-well plate. Repeat until all the cells have beentransferred to a 6-well plate, and culture cells for another 2 to 4 days (i.e., untilconfluence).

10. Wash confluent cells in the 6 well-plate with sterile PBS and trypsinize with 400 μL oftrypsin-EDTA. Incubate the cells in a 37ºC incubator for 5 minutes.

11. Add 1.6 mL of complete growth medium to each well, pipette to break up cell clumpsand completely remove cells from flask.

12. Pipette the cell suspension into a T-25 and add medium to a total volume of 5 mL.Grow cells to confluency.

13. Once confluency is reached, cells should be frozen down in 1 mL aliquots (~1 × 106

cells) as per the storage section above.


24

2.4.1.3 Clonal screening

In order to obtain a clone with the TF reporter plasmid stably integrated into thegenome without altering cell function, it is necessary to screen multiple colonies (seeSection 2.6, #10). Typically, it is recommended that 10 to 15 colonies be screened foractivation of the TF by monitoring the induction of GFP fluorescence upon exposure toa specific ligand relative to the initial time point (see Section 2.6, #11). Reporter clonesthat demonstrate significant GFP induction are identified for further screening andpurification [5].

1. Identify and grow the TF reporter clones to ~70% confluency in 24-well tissueculture plates. Switch to phenol red free media 16 hours prior to starting theexperiment. (See Section 2.6, #12.)

2. Determine the extent of GFP fluorescence by adding a known inducer of the TFbeing studied. For example, thiazolidinedione (TZD), a well-established PPARagonist [17], can be added to induce expression of PPARγ.

3. Monitor the temporal change in GFP fluorescence using fluorescence microscopy.Once the maximum GFP signal is observed, trypsinize the cells for flowcytometry-based cell sorting.

4. Using flow cytometry, sort the cell population based on intensity of GFP expressionand isolate the population that exhibits the maximum GFP fluorescence. Thispopulation contains cells that can be exposed to a specific ligand to activate the TF ofinterest (i.e., responsive population). This sorting step is also called “positivesorting.”

5. Culture the sorted cells in 6-well tissue culture plates. When cells are ~70%confluent, trypsinize them and once again sort using flow cytometry.

6. Using flow cytometry, collect cells that do not exhibit any fluorescence. Since thesecells were not stimulated with any ligand, isolation of cells that exhibit the leastfluorescence represents the population where background expression of GFP isminimal (“negative sorting”).

7. Culture the twice-sorted cells and again stimulate with a known ligand. Isolateresponsive (GFP expressing) cells by flow cytometry. These represent the populationthat has the highest signal-to-noise ratio and demonstrates maximum induction ofthe TF of interest (i.e., can be used for dynamic profiling of TF activation).

2.4.1.4 Monitoring PPARγ activation in 3T3-L1 adipocytes

The above-described procedure was used to isolate a 3T3-L1 preadipocyte reporter cellline for monitoring the activation of the PPARγ during adipocyte differentiation andenlargement (Figure 2.5). Preadipocytes were grown to confluence and differentiatedinto mature adipocytes using the differentiation protocol above, and the fluorescenceintensity was monitored every 48 hours. The data in Figure 2.5 show that no fluores-cence was detected at the beginning of differentiation, indicating that PPARγ is notactive at this time point. The fluorescence intensity increases after day 6 and is signifi-cantly higher at days 8 and 10 relative to the initial time point (i.e., at the later stages ofadipocyte differentiation).


25

2.4.2 Monitoring activation of ERK in HepG2 cells

We demonstrate functionality of the ERK reporter construct through induction of theMEK/ERK pathway. Stimulation of cells containing the reporter plasmid by inducerssuch as phorbol 12-myristate 13-acetate (PMA), epidermal growth factor (EGF), andinterleukin-6 (IL-6) will lead to activation of ERK and phosphorylation of downstreamtargets such as Elk1 [9]. In this example, we use IL-6 as it is a well-known activator of theMAPK pathway.

1. Prepare cell culture coverslips (Fisher Scientific, Rochester, New York) by coatingwith 2 mL of PBS supplemented with fibronectin (10 μg/mL) per well. Incubate slidesat 37°C for 1 hour. Aspirate the PBS/fibronectin; gently wash once with media andset aside.

2. Seed ~8 × 105 HepG2s per well. Seed two wells per construct to be transfected.

3. Grow HepG2 cells in cell culture coverslips overnight. Transfection should beperformed using cells that are 70% to 80% confluent.


26

Figure 2.5 Fluorescence images of cells from a single 3T3-L1 PPARγ reporter clone from induction ofadipocyte differentiation through development of the mature adipocyte phenotype. The initial image wastaken immediately after addition of the differentiation medium. All other images were taken at 48-hourintervals through 10 days of culture.

4. Replenish medium in coverslips with 1 mL of fully supplemented medium 1 hourprior to transfection.

5. Mix 1.5 μg of pCyPet-YPet-Chimera, pCyPet-WWElk1-Ypet, pCyPet-S, and pYPet-Sin 100 μL serum and antibiotic free medium in eight separate tubes.

6. Gently mix GenJet HepG2 transfection reagent (or any other commercialtransfection reagent) prior to pipetting. To eight additional tubes containing 100 μLserum and antibiotic free medium add 4.5 μL GenJet reagent and mix gently byflicking the tubes several times.

7. Immediately add the 100 μL of medium containing the GenJet reagent to the 100 μLof medium containing plasmids to form the transfection complex by incubating theDNA-GenJet mixture at room temperature for 15 minutes.

8. Add the transfection complex dropwise to cells and gently rock the plate touniformly disperse the transfection complex. Return plate to incubator and incubatefor 12 to 18 hours before replacing media with fresh medium not containing thetransfection complex.

9. Continue to grow cells for 18 to 24 hours before stimulation of ERK with IL-6.

10. Stimulate cells with IL-6. The stimulation time and concentration may vary based onthe cell line being used; in our example, we used 100 ng/mL of IL-6.

11. Place slide chambers on the stage of a Zeiss Axiovert 200M inverted microscopeequipped with two cool SNAP cameras and a 60X water immersion objective lens.

12. Collect FRET data for pCyPet-S (CFP alone), pYPet-S (YFP alone),pCyPet-WWElk1-YPet (+IL-6), and pCyPet-WWElk1-YPet (-IL-6) using threechannels: CFP (donor) channel, YFP (acceptor) channel, and FRET channel. CFP andYFP channels are configured to use the respective excitation and emission filters (400nm excitation and 470 nm emission for CFP, 480 nm excitation and 530 nmemission for YFP) while the FRET channel is configured to use a CFP excitation andYFP emission.

13. Determine the extent of bleed-over of the donor signal to the FRET channel bymeasuring the FRET channel signal with the pCyPet-S (CFP only) transfected cells.Similarly, calculate bleed-over of the acceptor signal by measuring the FRET channelsignal of the pYPet-S (YFP only) transfected cells.

14. Calculate corrected FRET (FRETc) by using the following equation:

( )[ ] ( )[ ]FRETc FRET Df Dd CFP Df Da YFP= − −

where FRET, [CFP], and [YFP] are the signals visualized through the FRET, CFP, andYFP filter sets, respectively. The constants Df/Dd and Df/Da are the bleed throughconstants describing donor emission visible in the FRET channel and directexcitation of acceptor, respectively [18].

Figure 2.6 shows the FRETc signal in control and HepG2 cells stimulated with IL-6 for2 hours. Based on measurement from approximately 150 cells, it is evident that IL-6stimulation induces a statistically-significant change in ERK activation in HepG2 cells.


27


For the TF reporter, hormonal cues added to the adipocyte differentiation media—insu-lin, IBMX, T3, and dexamethasone—are “inputs” that adipocytes respond to and initi-ate activation of the TF being investigated. The “output” of the system is the inductionof fluorescence through activation of PPARγ at different stages of adipocyte differentia-tion. We expect PPARγ to be activated differently during adipocyte differentiation, lead-ing to different levels of fluorescence at different stages of adipocyte differentiation(Figure 2.5). The fluorescence data will be interpreted based on difference in the magni-tude of activation or down-regulation between different time points as well as betweenany time point and the initial fluorescence value. For the FRET-based ERK reporter, spe-cific addition of a ligand such as IL-6 serves as the input for signaling through the MAPKsignal transduction pathway leading to phosphorylation of ERK1/2 and activation ofdownstream transcription factors. Similarly, activation of ERK will be evaluated basedon activation observed with the CyPet-Ypet chimera (FRET construct without the flexi-ble linker) and in the presence of MAPK activity inhibitors (e.g., U0126).

In general, for any system, we expect the temporal changes in fluorescence to be cor-related to the temporal activation profile of the regulatory molecule being investigated,and when profiling activation of multiple TF, the data need to be interpreted based onthe TF activated at each time point and/or the relative magnitudes of activation.


28

1.85

1.5

1.55

1.6

1.65FRET

Sig

nal

Control IL-6

1.7

1.75

1.8

Figure 2.6 Activation of ERK in HepG2 cells by IL-6. HepG2 cells transfected with the ERK reporter con-struct were stimulated with 100 ng/mL of IL-6 for 2 hours. Data shown are average of 150 single-cell mea-surements captured using the FRET channel (400 nm excitation and 530 nm emission) from twoindependent cultures. * indicates statistical significance at p < 0.001.


1. Transcription factor binding elements can be designed in several configurations. Inthe PPAR example, three tandem binding sites were used with a single nucleotidespacer between each. Since the dynamics of the tertiary conformation induced bybinding of a TF to its response element are not fully understood, it is advisable to usetwo to four tandem repeats to mitigate any inhibitory effects induced in thepromoter region by the binding of the TF. Similarly, the nucleotide spacer, whichworks to increase the efficiency of binding, can be increased or decreased. Morespacing between the tandem repeats may be needed when the design uses fewerbinding sites and vice versa.

2. PCR can be done with any commercially available kit that uses a high fidelity proofreading polymerase. We have had success with Promega’s GoTaq PCR kit and haveseen minimal errors in the amplified sequence.

3. The linker region should be developed with several considerations in mind. First, itneeds to be composed of amino acids which will provide flexibility to the construct.This is important because the acceptor and donor fluorescent proteins must comeinto close contact with each other upon kinase phosphorylation. Secondly, whendeveloping the synthetic oligonucleotides, one or two bases may need to be added tokeep the FRET protein “in-frame.” For example, the 3’ end of CyPet (the upstreamprotein of the construct) will have a mutated stop codon. As a result, the restrictionenzyme recognition sequence (e.g., GGA TCC for BamHI) will be transcribed as twoadditional codons followed by reading of the synthetic linker region. Therefore, it iscritical that the sequence of the synthetic linker region be designed such that codonsare always in frame.

4. Annealing of the synthetic oligonucleotides can be done in the early stages of PCRusing any standard PCR kit. Supplementation with dNTPs will allow for thepolymerase to add phosphate groups to the portions of the linker which need to becombined. Annealing and combining of the synthetic oligonucleotides may requirethe majority of the reaction time so additional rounds of PCR amplification mayneed to be performed.

5. It is important to remove all media from the pellet because salt and serumcontamination interferes with plasmid electroporation. Also, the volume to be usedfor electroporation is crucial.

6. Complete growth medium may adversely affect the efficiency of transfection. If thismethod is used and low transfection efficiency is observed, reconstituting the pelletin DMEM without serum and without antibiotics can be used to increase theefficiency. Additionally, there are commercially available electroporation buffers(e.g., hypoosmolar buffer and iso-osmolar buffer) that are ideal for reconstitution.

7. The optimal voltage and capacitance values for each cell type needs to bedetermined. The time constant for an exponential decay pulse should be around 48ms. If it is less than 30 ms or greater than 60 ms, the electroporation is likely to havefailed. Salt concentration, cell number, plasmid purity, media composition can allhave an effect on electroporation efficiency. Values reported here are specific to theGenepulser Xcell unit (Biorad, Hercules, California). However, similar trends will beseen on any other commercially available electroporation unit.


29

8. Geneticin sulfate salt (G418) is used as the selection antibiotic for the electroporatedcells. Unfortunately, each G418 lot contains significant differences in potency.Additionally, some cell types are more susceptible to G418 than others. Therefore, akill curve is recommended when switching from one lot of G418 to another andwhen the cell type is changed. If a kill curve is not performed, false positive clonesmay result or all cells may die, even those which have successfully incorporated theplasmid.

9. Clonal screening must be done prior to experimentation because not all clones willexhibit the same response when stimulated. The site of plasmid incorporation andthe number of integrated copies play a major role in the responsiveness of thereporter clone. A high level of background can be seen in clones which incorporatedthe plasmid in a region of a chromosome which is highly active.

10. For high electroporation efficiencies, cells plated in a 100-mm dish will result inmultiple colonies after selection. Therefore, cells should be grown such thatindividual colonies can be isolated. If the size of the isolated colony is small, cellsshould be passed to a single well of a 48-well plate and not a 24-well plate.

11. It is advised that no more than three to six clones be screened at any one time as thepossibility of cross-contamination increases significantly. Screening more than oneTF construct at a time is also not advised as cross-contamination between constructscannot be easily determined. Generally speaking, an endpoint analysis should besufficient for screening the clones. The analysis should include fluorescent images ofthe initial and final time points so that image analysis can be done to determine thechange in fluorescence from the initial time point to the final time point.

12. The medium used for this portion of the experiment should not contain phenol redas phenol red interferes with fluorescence imaging.

2.7 Summary Points

The methods detailed in this chapter describe the development of:

1. GFP-based reporter plasmids for dynamically monitoring transcription factoractivation;

2. FRET-based reporter plasmids for dynamically monitoring kinase activity;

3. Stable reporter cell lines for dynamic profiling of TF activation.


30



No significant FRET Inappropriate design Optimize linker length and compositionLow transfection efficiency Serum or antibiotic present Remove serum and antibiotics from transfection mixFalse positive clones G418 concentration too low Determine the lowest concentration of G418 that kills

all cells of a negative control

Acknowledgments

This work was supported by grants from the National Science Foundation CBET-0651864 and the American Heart Association (AY0755112Y). The authors wish tothank Professor Robert Burghardt and Dr. Roula Barhoumi Mouneimne for help withFRET imaging and analysis.

References

[1] Elnitski, L., V.X. Jin, P.J. Farnham, and S.J. Jones, “Locating mammalian transcription factor bind-ing sites: a survey of computational and experimental techniques,” Genome Res., Vol. 16, 2006,pp. 1455–1464.

[2] King, K.R., S. Wang, A. Jayaraman, M.L. Yarmush, and M. Toner, “Microfluidic flow-encodedswitching for parallel control of dynamic cellular microenvironments,” Lab Chip, Vol. 8, 2008,pp. 107–116.

[3] Lu, P.J., X.Z. Zhou, M. Shen, and K.P. Lu, “Function of WW domains as phosphoserine- orphosphothreonine-binding modules,” Science, Vol. 283, 1999, pp. 1325–1328.

[4] Thompson, D.M., K.R. King, K.J. Wieder, M. Toner, M.L. Yarmush, and A. Jayaraman, “Dynamicgene expression profiling using a microfabricated living cell array,” Anal. Chem., Vol. 76, 2004,pp. 4098–4103.

[5] Wieder, K.J., K.R. King, D.M. Thompson, C. Zia, M.L. Yarmush, and A. Jayaraman, “Optimizationof reporter cells for expression profiling in a microfluidic device,” Biomed. Microdevices, Vol. 7,2005, pp. 213–222.

[6] Violin, J.D., J. Zhang, R.Y. Tsien, and A.C. Newton, “A genetically encoded fluorescent reporterreveals oscillatory phosphorylation by protein kinase C,” J. Cell. Biol., Vol. 161, 2003, pp. 899–909.

[7] Zhang, J., Y. Ma, S.S. Taylor, and R.Y. Tsien, “Genetically encoded reporters of protein kinase Aactivity reveal impact of substrate tethering,” Proc. Natl. Acad. Sci. USA, Vol. 98, 2001,pp. 14997–15002.

[8] Ni, Q., D.V. Titov, and J. Zhang, “Analyzing protein kinase dynamics in living cells with FRETreporters,” Methods, Vol. 40, 2006, pp. 279–286.

[9] Davis, R.J., “Transcriptional regulation by MAP kinases,” Mol. Reprod. Dev., Vol. 42, 1995,pp. 459–467.

[10] King, K.R., S. Wang, D. Irimia, A. Jayaraman, M. Toner, and M.L. Yarmush, “A high-throughputmicrofluidic real-time gene expression living cell array,” Lab Chip, Vol. 7, 2007, pp. 77–85.

[11] Fantz, D.A., D. Jacobs, D. Glossip, and K. Kornfeld, “Docking sites on substrate proteins directextracellular signal-regulated kinase to phosphorylate specific residues,” J. Biol. Chem., Vol. 276,2001, pp. 27256–27265.

[12] Fu, H., R.R. Subramanian, and S.C. Masters, “14-3-3 proteins: structure, function, and regulation,”Annu. Rev. Pharmacol. Toxicol., Vol. 40, 2000, pp. 617–647.

[13] Durocher, D., J. Henckel, A.R. Fersht, and S.P. Jackson, “The FHA domain is a modularphosphopeptide recognition motif,” Mol. Cell., Vol. 4, 1999, pp. 387–394.

[14] Nguyen, A.W., and P.S. Daugherty, “Evolutionary optimization of fluorescent proteins forintracellular FRET,” Nat. Biotechnol., Vol. 23, 2005, pp. 355–360.

[15] Verdecia, M.A., M.E. Bowman, K.P. Lu, T. Hunter, and J.P. Noel, “Structural basis forphosphoserine-proline recognition by group IV WW domains,” Nat. Struct. Biol., Vol. 7, 2000,pp. 639–643.

[16] Shao, D., and M.A. Lazar, “Peroxisome proliferator activated receptor g, CCAAT/enhancer-bindingprotein a, and cell cycle status regulate the commitment to adipocyte differentiation,” J. Biol.Chem., Vol. 272, 1997, pp. 21473–21478.

[17] Dumasia, R., et al., “Role of PPAR-gamma agonist thiazolidinediones in treatment of pre-diabeticand diabetic individuals: a cardiovascular perspective,” Curr. Drug. Targets Cardiovasc. Haematol.Disord., Vol. 5, 2005, pp. 377–386.

[18] Sorkin, A., et al., “Interaction of EGF receptor and grb2 in living cells visualized by fluorescence res-onance energy transfer,” Curr. Biol., Vol. 10, 2000, pp. 1395–1398.

Acknowledgments

31

C H A P T E R

3Comparison of Algorithms for AnalyzingFluorescent Microscopy Images andComputation of Transcription Factor Profiles

Zuyi Huang and Juergen Hahn**Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, Texas77843-3122, e-mail: [email protected]

33

Key terms Green fluorescent protein (GFP) reporter systemsFluorescent microscopy imagesImage analysisInverse problemTranscription factor profilesPrincipal component analysis (PCA)K-means clusteringWaveletTNF-α signaling pathwayMathematical modeling

Abstract

Obtaining quantitative data about protein concentrations is an importantcomponent in systems biology; however, only a few options for generatingsuch data exist. One of these is to use green fluorescent protein (GFP) reportersystems as an indicator of protein concentration; however, the measurementsconsist of a series of fluorescent microscopy images that need to be analyzed toderive time-dependent quantitative data.

This chapter presents two techniques for determining data from fluores-cent microscopy images. The first technique uses wavelets to sharpen the imagecontrast between cells and a bidirectional search to identify the cell region. Thesecond technique is based on K-means clustering and uses principal compo-nent analysis (PCA). A comparison of these two methods is made where thedynamics of NF-κB in TNF-α signaling pathway is investigated.

3.1 Introduction

Signal transduction plays a key role in systems biology as signal transduction pathwaysare responsible for relaying cellular information and are involved in the regulation ofcellular responses. An understanding of signal transduction mechanisms offers thepotential for improved treatment options for diseases; for example, abnormalities ofthe Jak/STAT signaling pathway have been linked to colon cancer [1] and abnormalitiesof MAPK signaling have been associated with gastric cancer [2]. One possibility fordeveloping an understanding of the dynamics of signal transduction pathways is thederivation of models describing the pathways. However, deriving an accurate signaltransduction pathway is nontrivial as the mechanisms tend to involve many compo-nents and the system will have a large degree of uncertainty in both its structure andparameter values (for some examples of signal transduction pathways, see the modelsin [3–5]). Validation and refinement of any model is a crucial step for modeling signaltransduction pathways; however, these steps can only be undertaken if experimentaldata is available or can easily be derived.

One popular approach for collecting experimental data for signal transduction path-ways involves Western blotting (e.g., in [6, 7]). While performing a Western blot is a rela-tively simple experiment, it does have the drawbacks that: (1) Western blotting is adestructive measurement technique, and (2) the data is semi-quantitative in nature [8,9]. The first drawback poses a problem for the use of Western blots for experimentswhere a time series of a concentration profile of a particular protein is to be measured,while the latter results from the limitation of the technique itself (i.e., it is not alwayspossible to determine “how black a Western blot is” and to what protein concentrationthis level of color corresponds). One promising approach for taking dynamic measure-ments is to use a green fluorescent protein (GFP) reporter system [10, 11]. This method isbased upon the idea that expression of certain genes will also result in the formation ofGFP for a cell line that has been modified accordingly. It is then possible to take fluores-cent microscopy images which show the fluorescence of the cells, where the degree offluorescence can be correlated with the concentration of the transcription factor that ispresent in the nucleus of the cells. Compared with the data from Western blots, the fluo-rescence intensity profile provides more easily quantifiable data, which can be used tovalidate or update the mathematical model. GFP reporter systems have been extensivelyused in clone isolation [12], identification and detection of promoter activity [13–15],the assessment of gene transfer and expression [16, 17], and the study of hepatitis B virusreplication [18], to just name a few applications. However, using fluorescent microscopyimages of GFP reporter cells is a relatively new approach [19]. An automated image anal-ysis procedure to identify the GFP localization regions with standard MATLAB com-mands has been presented in [20]; however, the procedure only determines regions offluorescence and does not provide quantitative data about the fluorescence intensity.

Analyzing fluorescent microscopy images to obtain quantitative information is not atrivial task due to several reasons: (1) not all cells will express GFP; (2) fluorescence seenin images can vary over time due to fluctuations occurring during the measurement pro-cess as well as other cellular functions; and (3) some of the fluorescence seen in theimages may be an artifact of the image. Image analysis algorithms are required in orderto address these points. Accordingly, developing algorithms for analyzing fluorescentmicroscopy images of GFP reporter cells is an important step for obtaining quantitativedata of protein concentrations in signal transduction pathways. In this regard, two

Comparison of Algorithms for Analyzing Fluorescent Microscopy Images

34

image analysis methods are presented in this work. The goal of these algorithms is todetermine which areas of an image represent cells where fluorescence can be seen and toquantify the amount of fluorescence in a second step. The first method uses wavelets forsharpening image contrast and then searches through the individual pixels of an imagein two directions to determine if a pixel corresponds to a cell, the background, or to anartifact of the image and/or measurement. The second technique is based on K-meansclustering and PCA. A comparison of these two techniques is provided and a series ofimages of hepatocytes stimulated with three different concentrations of TNF-α havebeen analyzed. As the fluorescence intensity is only an indicator of the amount of GFPpresent in a cell, the data is further analyzed in order to determine the concentration ofthe transcription factor responsible for transcription of RNA containing code for GFP.Based on the dynamic data from these two image analysis algorithms, the NF-κB dynam-ics for different stimulation concentrations of TNF-α is obtained from a proposedNF-κB-GFP model.

3.2 Preliminaries

3.2.1 Principles of GFP reporter systems

A DNA fragment encoding GFP is inserted into DNA in GFP reporter systems. Due tostimulation from the stimulant, transcription factor (TF) is activated and translocates tothe nucleus. The transcription factor then binds to the promoter and transcribes DNA,which also includes code for a green fluorescent protein as this code has been previouslyinserted into the DNA. In a next step GFP-RNA is then translated into GFP which, afterpost-translational modification, induces the green fluorescence seen in fluorescentmicroscopy images. Figure 3.1 shows a simple illustration of the principles of GFPreporter system. No fluorescence can be seen until the transcription factor binds to thepromoter; however, fluorescence is easily visible after transcription factor activation.Stronger stimulation will lead to a larger concentration of the transcription factor mole-

3.2 Preliminaries

35

(b)

(a)

Figure 3.1 GFP reporter systems. The DNA response element (RE) to which the TF binds is upstream of aminimal promoter that controls GFP expression: (a) before the transcription factor binds to promoter; and(b) after the transcription factor binds to promoter.

cules in the nucleus, which, in turn, results in increased GFP expression and more fluo-rescence. The dynamics of transcription factor can be indirectly measured by quantifyingthe time-series of the fluorescence intensity seen in fluorescent microscopy images.

3.2.2 Wavelets

Equation (3.1) shows the wavelet transformation.

( ) ( )W a b f ta

t ba

dt, *= −⎛⎝⎜

⎞⎠⎟−∞

+∞

∫1

ψ (3.1)

where a is a real variable representing the scale or dilation, b is a real variable represent-

ing time shift or translation, ψ(t) is a wavelet function [e.g., ψ π(t) e tt= − 2 22

cos( )In

for a

Morlet wavelet], * denotes the complex conjugate operator, and f (t) is the processed sig-nal. Wavelet transforms can be considered as a collection of inner products of f(t) and

ψa,0(t − b) at a and b; that is,W a b f t t ba( , ) ( ), ( ),= −ψ 0 . The values of W(a,b) for different a

provide the frequency domain information, while the values W(a,b) for different b pro-vide time domain information. The availability of both frequency domain and timedomain information makes wavelet transformation particularly attractive for denoisingof data [20, 21]. The principle of wavelet denoising is that wavelet transforms candecompose an image into multiple scales by dilation and compression and then removethe noise at multiple scales by thresholding [22]. The general procedure for denoisingvia wavelets consists of three steps [23]: (1) decompose an image into N levels via awavelet transform, where N is an integer chosen by experience; (2) calculate a threshold[24] and compare the high frequency components of each level from 1 to N to thisthreshold; (3) reconstruct the image by using the low frequency components of level Nand the modified high frequency components of levels 1 to N.

3.2.3 K-means clustering

K-means clustering is a method for identifying patterns in data and for dividing datainto k disjoint clusters [25]. The principle of K-means clustering is to minimize theobjective function shown in (3.2) by determining centroids for each of the k clusters:

minμ

μf xj ix Si l

k

j i

= −∈=

∑∑ 2(3.2)

where Si, i = 1, 2, …, k, represents all points belonging to the ith cluster, μi is the centroid

of all the points xj ∈ Si, and μ is the collection of all the centroids. μi is calculated by (3.3).

μi

jx S

i

x

Nj i= ∈∑

(3.3)

where Ni is the total number of the data points in cluster Si.The procedure to perform K-means clustering consists of the following steps:


36

1. The initial centroids μi, i = 1, 2,… , k, for the k clusters are assigned or randomlysampled from the data points.

2. Each data point xj is assigned to a cluster m. This decision is made by determining the

smallest value for ||xj − μm||2 among all possible ones ||xj − μi||2, i = 1, 2, …, k.

3. The function f from (3.2) is evaluated by computing the sum of the distances for alldata points as well as for all clusters.

4. Equation (3.3) is used to update the centroid of each cluster by averaging the datapoints of the corresponding cluster.

5. Steps 2 through 4 are repeated iteratively until the relative change in the objectivefunction f between iterations is less than a certain threshold. The iterativerefinement procedure is known as Lloyd’s algorithm [26, 27].

The key idea for K-means clustering is the selection of the initial centroids for the kclusters. A proper choice for the initial centroids will make the clustering algorithm con-verge faster to the optimal solution.

3.2.4 Principal component analysis

Principal component analysis (PCA) [28] is a well-established technique for identifyingmultivariable patterns in data. A data matrix X can be composed as follows using PCA:

X TP ET= + (3.4)

where T is the score matrix, P is the loading matrix, and E is the residual between theactual data and the reconstruction by PCA. The columns of P represent principal com-ponents of the data matrix, while the columns of T are the projections of the datamatrix onto the principal components [29].

The motivation for using PCA for image analysis comes from the work presented in[30, 31], which shows that clusters in a score plot from PCA are associated with featuresof an image. Furthermore, combining K-means clustering and PCA has been widelystudied for clustering [32, 33].

3.2.5 Mathematical description of digital images and image analysis

The tri-stimulus theory states that any visual color can be represented by overlayingthree color information channels. For television and computer graphics, the standardcolors used are red, green, and blue [30]. An RGB image can be represent by athree-dimensional tensor

( ) ( )

( ) ( )M

r g b r g b

r g b r g b

j

i ij

=⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

, , , ,

, , , ,

11 1

1

�

� � �

�

(3.5)

where M is of size i × j × 3. i × j is the resolution of the image, which means that there are irows of pixels in the images and each row has j columns of pixels. Each pixel of theimage has three intensity values (i.e., one each for red, green, and blue). Mcan be rewrit-

ten as a two-dimensional matrix X with the size of (i × j) × 3 as shown in (3.6) by listing

3.2 Preliminaries

37

the three intensity values of each pixel in a row, such that each row of X represents thered, green, and blue values of a pixel:

X

r g b

r g b

r g bi j i j i j

=

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

× × ×

1 1 1

2 2 2

� � �

(3.6)

The intensity for each pixel is defined as the sum of the red, green, and blue values.

I r g b= + + (3.7)

Another option is to use the tensor M and calculate the intensity for each pixelaccording to (3.7). This results in the RGB image Mbeing transformed into a two-dimen-sional gray image by replacing each pixel by its intensity:

( ) ( )

( ) ( )N

r g b r g b

r g b r g b

j

i ij

=+ + + +

+ + + +

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

11 1

1

�

� � �

�

(3.8)

where N is of dimension i × j.Image analysis extracts information from a time-series of images, represented by the

M matrices recorded at different points in time. A common image analysis procedure isto: (1) record images at different time points, (2) separately analyze the images, and (3)combine the image analysis results for different time points to determine dynamics ofthe system. One popular application of such an image analysis procedure is the study ofMRI intensity patterns [34].

3.3 Methods

The goal of the analysis of fluorescent microscopy images for the purpose of this work isto determine which areas of an image represent cells where fluorescence can be seenand then quantify the average amount of fluorescence over these cells. It is an impor-tant aspect of this analysis to distinguish between cells where fluorescence can be seenand the background, which can consist of regions without cells or regions where cellsare not producing fluorescent proteins. This section presents two algorithms to deter-mine the fluorescent cell regions in images obtained from fluorescent microscopy. In anext step, calculation of the fluorescence intensity of an image on the basis of the calcu-lated fluorescent cell regions is discussed, and finally a comparison of the resultsreturned by the two image analysis algorithms is presented.

3.3.1 Image analysis based on wavelets and a bidirectional search

The first method is based upon sharpening the contrast in the image using wavelets andperforming a bidirectional search to determine bright regions of an image. Once the cellregion has been determined, the average fluorescence intensity is computed from the


38

original image. It is important to note here that the transformed image is only used fordetermining the area representing fluorescent cells and not for determining the averagefluorescence, as the wavelet transform will affect the value of the observed fluorescenceintensity.

The use of wavelets can increase the contrast of images by denoising. The effect ofthis is illustrated in Figure 3.2. While it is difficult to identify the fluorescent cell regionfrom the original image shown in Figure 3.2(a), the image processed by wavelets clearlyshows the cell region [Figure 3.2(b)]. Figure 3.2(c) compares the signals before waveletdenoising and after wavelet denoising along the horizontal red line shown in Figure3.2(b). It can be seen that the application of wavelets significantly reduces the noise andalso sharpens the contrast of the image. A type of wavelet called coiflets is used in thiswork due to the type of contrast seen in the images to be analyzed.

Once the image contrast has been increased to the point where regions of fluorescentcells are clearly visible, a search algorithm that identifies these regions in the images canbe applied. The key idea behind a bidirectional search is that if there are two or morepoints next to one another in the horizontal or the vertical direction whose intensitiesare higher than a threshold, then these points are classified as belonging to a fluorescentcell region. The threshold value is calculated on the basis of the largest value and themean value of the intensities of the entire image:

( )THRMa Me

kMe Ma N Me mean N= − + = =, max( ), (3.9)

where THR is the threshold, Ma is the maximum intensity found in the image, Me is themean intensity over the entire image, and k is a constant which can be adjusted to takeimage contrast into account.

Figure 3.3 illustrates the procedure of the bidirectional search. In this case, a horizon-tal search is performed first, followed by a search in the vertical direction. The algorithmmoves from one pixel of the image to the next and determines if the pixel intensity isabove the threshold. If two pixels in a row are found with intensities above the thresh-old, then these pixels are classified as belonging to a fluorescent cell region. This is illus-trated in Figure 3.3(a) where the algorithm moves from one pixel of the image to thenext in a horizontal direction. Once the pixel labeled “A” has been found to have abrightness above the threshold, the algorithm looks at the next horizontal pixel and

3.3 Methods

39

(a) (b) (c)Pixel

Inte

nsi

ty

108

110

112

114

116

Signal after wavelet-denoisingSignal before wavelet-denoising

118

120

122

0 200 400 600 800 1000 1200

Figure 3.2 Improving contrast of images via wavelets: (a) original image; (b) processed image; and (c) intensityin one line of the image before and after processing.

determines that the pixel labeled “B” also shows a fluorescence intensity above thethreshold. Since both of these pixels are located next to one another, this region consist-ing of pixels “A” and “B” is considered a region representing a fluorescent cell. The otherthree pixels in the picture which have fluorescence intensities above the threshold fluo-rescence are not found by the directional search in the horizontal direction. The reasonfor this is that these pixels do not have an adjacent pixel (in the horizontal direction)that is also above the threshold. This search algorithm tries to distinguish between cells,which have to consist of many pixels at the chosen level of magnification, and other,smaller bright spots which may represent artifacts of the measurement technique.

In a following step, the algorithm scans the pixels of the image in the vertical direc-tion. A search in this region identifies the region consisting of pixels “C” and “D” in Fig-ure 3.3(b). While searching in this direction alone would have missed the cell regionconsisting of “A” and “B,” the bidirectional search ensures that both the regions labeled“A-B” and “C-D” are detected. The pixel labeled “E” is not detected by this algorithm;however, this pixel does not represent a major region as it only consists of one pixel witha fluorescence intensity above the threshold. In practice, the images that have been ana-lyzed have cell regions consisting of dozens to hundreds of adjacent pixel. Therefore, theassumption that two or more adjacent pixels are required for identifying part of animage as a cell is very reasonable. However, it is always recommended to combine theautomated search routine of this algorithm with a visual inspection of the imagesderived from fluorescent microscopy [Figure 3.4(a)] but also from light microscopy. Oneillustrative result from the presented algorithm is shown in Figure 3.4(b) where the cellregions that have above average fluorescence intensity from Figure 3.4(a) have beendetermined.

The individual steps of the algorithm are shown in Figure 3.5. Summarizing, thealgorithm for the image analysis procedure based on wavelets and the bidirectionalsearch can be described as follows:

1. The three-dimensional data matrix M, (3.5), of the fluorescent microscopy image istransformed into the two-dimensional matrix N, (3.8), as wavelet denoisingalgorithms are only available for one or two dimensional matrices.


40

(a) (b)

Figure 3.3 Illustration of the bidirectional search: (a) horizontal search; and (b) vertical search.

2. The wavelet “coif3” is used in the 4-level 2D wavelet decomposition with theMATLAB command “wavedec2”. The command “wbmpen” is then used to calculatethe threshold for 2D denoising. Finally, a denoised image Ndenoise is obtained by usingthe command “wdencmp”.

3. The threshold for fluorescent cell regions is calculated from (3.9).

4. The bidirectional search algorithm in Figure 3.5 is implemented to obtain thefluorescent cell regions S from the de-noised image Ndenoise.

5. On the basis of the fluorescent cell regions S and the original intensity matrix N, thefluorescent intensity for the GFP image is calculated. This step will be discussed inSection 3.3.3.

6. Steps 1 through 5 are implemented for each image of a time-series of images. Theintensities for the images at different points in time construct the fluorescenceintensity profile for the time-series of images.

3.3.2 Image analysis based on K-means clustering and PCA

Another option for image analysis is to use a procedure based upon K-means clusteringand PCA to group pixels of an image with similar brightness. PCA can indicate the vari-ation of a cluster by calculating the distance from a pixel to the first principal compo-nent as illustrated in Figure 3.6.

The image analysis procedure based upon K-means clustering and PCA is describedin the following. In a first step, PCA is used to divide the pixels of the image into twoclusters. The centroids of these two clusters are used as the initial centroid values forK-means clustering, which then assigns each of the pixels of the images to one of the twoclusters. PCA is used to determine the cluster with higher variability which is divided ina next step. PCA is used again to compute the initial centroids of the three clusters forK-means clustering, which assigns the pixels of the image to one of these three clusters.The procedure is repeated until a sufficient number of clusters are obtained. For theimages investigated in this work, it was found that six clusters are sufficient to make adistinction between fluorescent cells and image background. The first few clusters withhigher fluorescence intensity are considered to represent fluorescent cells, while theremaining ones represent the background.

Summarizing, image analysis based upon K-means clustering and PCA is describedby the following.

3.3 Methods

41

(a) (b)

Figure 3.4 Fluorescence regions determine by bidirectional search: (a) processed fluorescent microscopyimage; and (b) regions with fluorescence intensities above threshold.


42

Figure 3.5 Bidirectional search algorithm.

1. The RGB image can be brought into the form of X, shown in (3.6).

2. The algorithm based on PCA and K-means clustering is implemented to determinethe fluorescent cell regions S. The details of this algorithm are shown in Figure 3.7.

3. The fluorescence intensity is calculated based upon the fluorescent cell regions S andthe original intensity matrix N. The exact procedure for this calculation is discussedin Section 3.3.3.

4. Steps 1 through 3 are implemented for each image. The fluorescence intensity profilefor the time-series of images is computed by combining the fluorescence intensitiesfor the individual images into a vector.

An example of the results from this procedure is shown in Figure 3.8. Six clusters rep-resenting different fluorescence intensity levels are calculated and the first four clusterswith higher fluorescence intensity are considered as the fluorescent cell regions.

3.3.3 Determining fluorescence intensity of an image

The bidirectional search or the search algorithm based on PCA and K-means clusteringonly determines the regions of an image corresponding to fluorescent cells and not thefluorescence intensity. The fluorescence intensity is computed from the original imagesby the following formula:

II

N

I

N

f kk

N

f

b kk

N

b

stimulati

f b

= −

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟

= =∑ ∑, ,

1 1

on

f kk

N

f

b kk

N

b

control

I

N

I

N

f b

− −

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟

= =∑ ∑, ,

1 1 (3.10)

where If,k refers to the fluorescence intensity of the kth pixel in the fluorescent cellregion, Ib,k refers to the fluorescence intensity of the kth pixel in the background region,Nf is the total number of the pixels in the fluorescent cell regions, Nb is the total numberof the pixels in the background regions, ( )stimulation refers to the intensity for the imagewith the stimulation, and ( )control refers to the intensity of images of negative controlexperiments (i.e., experiments where no stimulation is applied). The reason for sub-tracting the background intensity is to reduce measurement noise due to brightness

3.3 Methods

43

10

20

30

0

200

100

0 0

10

20

Red

Blu

e

Green

PC 1

Figure 3.6 Principal component analysis applied to images to determine pixels with similar features.


44

Figure 3.7 Image analysis based on K-means clustering and PCA.

variation. The reason for subtracting the intensity of (negative) control experiments isto reduce other effects that can cause fluorescence. If no significant changes are seen inthe control experiments then it can be concluded that the changes in the fluorescenceintensity are due to stimulation and it is not required to subtract the control term as itsmain effect on the measurements will be in the form of a small noise term.

3.3.4 Comparison of the two image analysis procedures

Since both image analysis techniques are based upon different concepts, it is warrantedto compare the properties of the two algorithms. The summary of this comparison isshown in Table 3.1. The method based on wavelets and the bidirectional search has pro-cessed the images faster in our investigations than the method based on K-mean clus-tering and PCA. The drawback of the method based on wavelets and the bidirectionalsearch is that it can not provide the information about different intensity levels, unlessit is modified to look at different threshold values. However, this step can be compli-cated as it is nontrivial to determine a good value for even one threshold when theimages have low contrast. In comparison, the method based upon K-means clusteringand PCA can provide the information about different intensity levels. This informationcan be helpful as it can be used to remove image artifacts like artificially bright spots.

From the comparison of these two methods, it can be concluded that the methodbased on K-means clustering and PCA is generally a better choice as it can processlower-quality images and provides information about different intensity levels. How-ever, this can come at the cost of an increased computational burden. The bidirectionalsearch technique can be a viable alternative if image quality is good and fast processingtimes are important. Table 3.2 highlights when which of the two methods may be thebetter choice to implement.

3.3 Methods

45

(e)

(a)

(f)

(b)

(g)

(c)

(h)

(d)

Figure 3.8 Fluorescent cell regions and clusters calculated by K-means clustering and PCA: (a) original image;(b) cluster 1; (c) cluster 2; (d) cluster 3; (e) cluster 4; (f) cluster 5; (g) cluster 6; and (h) combination of clusters 1,2, 3, and 4.


A fluorescence intensity profile can be computed by the techniques presented in Sec-tion 3.3. The fluorescence intensity can be assumed to be directly proportional to theconcentration of green fluorescence protein. However, the purpose of using GFPreporter systems is to measure the transcription factor concentration and not the GFPconcentration. Therefore, the transcription factor concentration needs to be computedfrom the fluorescence intensity profile. The first step of computing the transcriptionfactor concentration from the fluorescence intensity profile consists of developing adynamic model between these two quantities. An inverse problem can then be solvedin a second step to actually compute the transcription factor concentration over time.

3.4.1 Developing a model describing the relationship between thetranscription factor concentration and the observed fluorescence intensity

The dynamic model used in this work is based upon the model published bySubramanian and Srienc [10]; however, several modifications are made. Specifically:

• The amount of DNA remains constant in our work as the cells do not proliferate.This results in (3.11), where p represents the concentration of the DNA.

• No growth dilution terms need to be included in the model for either the GFPm-RNA, m, balance (3.12), the nonfluorescent protein, n, balance (3.13), or the flu-orescent protein, f, balance (3.14).

• The transcription rate needs to be modified so that it depends on the amount ofactivated transcription factor present in the nucleus. This change results in the


46

Table 3.1 Comparison of Two Image Analysis Techniques

Method Advantages Drawbacks

Image analysis based onwavelets and bidirectionalsearch

Computationally inexpensive: for example,a movie with 42 images was processed in~5 minutes on a desktop computer (Pentium4 CPU 2.8 GHz, 1 GB memory)

Cannot provide information about differentintensity levelsSearch is performed in only two directionsThreshold is sensitive to the quality of theimages

Image analysis based onK-means clustering andPCA

Can provide information about differentintensity levelsCan be used to remove artificial brightspots in the imageSuitable for any shape of cells, also suitablefor poor quality images

Can be computationally expensive: on aver-age required about an order of magnitudemore computation time than bidirectionalsearch

Table 3.2 When to Use Which Image Analysis Method

Method Method Should Be Used

Image analysis based on wavelets andbidirectional search

If images with good contrast and large bright regionsin the images are availableIf a threshold value can be easily obtainedFor quick evaluation

Image analysis based on K-meansclustering and PCA

For a variety of images, even those with low contrastInformation about intensity levels can improve imageanalysis results

Monod kinetics shown in (3.12), SC

C Cp Cm

NF

NFNF

−

−−+

κ

κκ

B

BB, is the concentration of

NF-κB, replacing the original term which was solely based upon the amount ofm-RNA present. While it was sufficient for the original model to neglect the tran-scription factor concentration, this is not the case for the model developed here asthe transcription factor concentration is a crucial element of signal transductionand is regulated inside the cell.

The resulting model is given by (3.11) through (3.14).

dpdt

= 0 (3.11)

dmdt

SC

C Cp D mm

NF

NFm=

+−−

−

κ

κ

B

B

(3.12)

dndt

S m D n S nn n f= − − (3.13)

dfdt

s n D ff n= − (3.14)

where Sm is a reaction constant describing the transcription rate with a value of 373 1/hr;Dm is a constant describing the mRNA degradation rate and is equal to 0.45 1/hr; Sn is areaction constant for the translation rate with a value of 780 1/hr; Dn is a constant asso-ciated with the protein degradation rate and is equal to 0.5 1/hr; Sf is associated with thefluorophore formation rate and has a value of 0.347 1/hr. These values are identical tothe ones reported by Subramanian and Srienc [10], with the exception of the values forDm and Dn, which were slightly modified to account for model adjustments as well as totake experimental observations into account. The rationale behind the procedure usedfor estimation of C, which has a value of 108 nM, will be discussed in the Section 3.5.The initial conditions for this system are p(0) = 5, m(0) = 0, n(0) = 0, and f(0) = 0.

Equations (3.11) through (3.14) describe the relationship between the concentrationof the transcription factor and activated GFP, f. The experimental measurements consist

of the fluorescence intensity, �I , from the images which is directly proportional to theconcentration of activated green fluorescent protein:

f I= Δ � (3.15)

where Δ is the ratio between activated GFP and computed fluorescence intensity.

As �I can be obtained from the fluorescent microscopy images that have been pro-cessed by one of the procedures described in Section 3.3, the dynamics of NF-κB can becomputed by solving an inverse problem involving (3.12) through (3.15).

3.4.2 Solution of an inverse problem for determining transcription factorconcentrations

Solving an inverse problem is nontrivial as high frequency noise and measurementerrors will be accentuated when the measurements are differentiated [35]. Several com-

3.4 Data Acquisition, Anticipated Results, and Interpretation

47

mon methods for solving inverse problems are: (1) approximating the time-derivativeof the measured output by numerical differentiation; (2) approximating the time-deriv-ative of the measured output by a filter, which can remove the noise (for an example see[36]); (3) integrating the differential equation of the measured output to avoid the dif-ferentiation operation and approximating the integral with RungeKutta methods.However, these techniques can introduce additional noise due to the numerical calcu-lations in addition to the experimental noise or may introduce a time delay in the datadue to aggressive low-pass filtering. One solution to solving inverse problems that is lesssensitive to measurement noise is to use a regularization procedure. The technique pre-sented in the following is one type of such a procedure as the inverse problem is solvedby determining an analytical solution and estimating parameters of this analyticalsolution [37]:

1. The system of equations can be viewed as a linear system with a static nonlinearity inthe input. Accordingly, (3.12) can be rewritten:

dmdt

S pu D mm m= − (3.16)

uC

C CNF

NF

=+

−

−

κ

κ

B

B

(3.17)

with an alternative input u such that the relationship between u and the fluorescentintensity is linear.

2. Even though the shape of CNF-κB (and accordingly of u), is not known, as it is thepurpose of this algorithm to determine CNF-κB from data, it is usually possible to makecertain assumptions about the shape that the concentration profile might have. Forexample, data provided in [7] suggests that the concentration profile of CNF-κB willoscillate for a continuous stimulation with TNF-α. Therefore, it is appropriate tochoose the Laplace transform of u as a second order transfer function multiplied by astep input:

( )u ss s

Ts

n

n n

=+ +

⋅ω

εω ωα

2

2 22(3.18)

where ε, ωn, and Tα are the parameters describing the input u.

3. A Laplace transformation is applied to (3.16), (3.13), and (3.14), resulting in

( ) ( )m s

S pu s

s Dm

m

=+

(3.19)

( ) ( )n s

S m s

s D Sn

n f

=+ +

(3.20)

( )( )

f sS n s

s Df

n

=+

(3.21)


48

n(s) and m(s) can be eliminated from (3.19) to (3.21) such that a transfer functionbetween u(s) and f(s) is derived:

( ) ( )f sS

s DS

s D SS p

s Du sf

n

n

n f

m

m

=+

⋅+ +

⋅+

(3.22)

Substituting (3.18) into (3.22) results in

( )f sS

s DS

s D SS p

s D s sTs

f

n

n

n f

m

m

n

n n

=+

⋅+ +

⋅+

⋅+ +

⋅ω

εω ωα

2

2 22(3.23)

4. f(t) can be obtained by performing an inverse Laplace transform of (3.23):

( ) ( ) ( )f t A A e A e A e tD t D S t tn

n n f n= + + + − +− − + −1 2 3 4

21εω ω ε ϕsin (3.24)

where A1, A2, A3, A4, A5, and ϕ are all constants with the following values

( )

( )

AS S S pT

D D D S

AS S p T

D D D D

f n m a

n m n f

n m n a

n m n n

1

2

2

2 2

=+

= −− −

Δ

Δω

ε( )

( )( ) ( )

ω ω

ω

εω

α

n n n

n m n

n f m n f n f n n

D

AS S p T

D S D D S D S D S

+

=+ − − + − +

2

3

2

22Δ ( )

( )( )

f n

f n m n

m n m m n f m n

AS S S p T

D D D D D S D

+⎛⎝⎜

⎞⎠⎟

=− − − −

ω

ω

εω

α

2

4

2

2 2Δ ( )

( )

D

A

AA A

AC ad bd

b

m n

n

n

+

=

+ −−

⎛

⎝⎜⎜

⎞

⎠⎟⎟

=+

ω

εω

ω ε

2

5

72 6 7

2

2

60 1 0

1

Δ

( )

d bd

AC d

bd bd

a

b

d a a b

n

n

12

02

70 1

12

02

2

1 33

1

4

+

= −+

= −

= −

= − +

εω

ω ε

( )( )

+ + + +

= + − + + + +

3 2 4

3 6

32

23

1

04

33

3 22 2

22

a a a a a a b

d b a a a a a a b a a a

( )4

1

12

22

3

2

2

+

= +

= + + +

= + +

a a

a D D S D

a D D S D D D S

a D D

n n f m

n n f n m m f

n m S

C S S S p T

AA A

f

f n m n

n

n

02

72

6 7

1

=

= −−

ω

ϕω ε

εω

α

arctan


49

The values of the parameters ε, ωn, and Tα are estimated by fitting f(t) to theexperimental data.

5. u(t) is given by an inverse Laplace transformation of (3.18) which can be used tocompute the profile of NF-κB from (3.17):

( )( )

CCT CT e t

T T e

n tn

NF B−

−

−=

− − − +

− − +κ

α αεω

α αε

ε ω ε φ

ε

1 1

1 1

2 2

2

sin

( )ω ω ε φ

φε

ε

n tn tsin

,

arctan

1

1

2

2

− +

= −where

(3.25)

The parameters ε, ωn, and Tα, have the values determined in step 4.

While this procedure was derived for a transcription factor profile exhibitingdamped oscillations, it is possible to derive other expressions for the fluorescence inten-sity and the transcription factor concentration profiles using the same procedure thatwas outlined in this section.


To illustrate the implementation of the above two GFP image analysis methods as wellas the procedure for computing the transcription factor profiles, time-series of imagesof hepatocytes constantly stimulated with three different concentrations of TNF-α (i.e.,6 ng/ml, 13 ng/ml, and 19 ng/ml) have been analyzed. Additionally, negative controlexperiments without TNF-α stimulation were carried out. The experiments were con-ducted for 15 hours and measurements were taken every 60 minutes. For each concen-tration of TNF-α, three images were recorded showing different areas of the experiment.Both image analysis techniques were applied to all the images. The mean and one-stan-dard deviation error bars of each of these determined time-series are shown in Figure3.9. It can be concluded that both analysis algorithms are able to correctly capture thetrends. The results returned by the method based upon PCA and K-means clusteringseem to have slightly smaller error bars; however, the difference is not significant. Fur-thermore, these results have to be put in the right context by comparing it to theamount of information that could have been captured by other, semi-quantitative mea-surement techniques. It is sufficient to say that both image analysis techniques returncomparable results for the investigated images.

The results from the method based on PCA and K-means clustering are used to deter-mine the dynamic profile of NF-κB. The analysis that is described here has also beenapplied to data generated by the technique based upon wavelets and bidirectionalsearch. However, since the results were found to be very similar, only one set of data isused here due to space constraints. The image analysis procedure returned the profile of

the intensity �I seen in the fluorescent microscopy images. Before �I is used to derive the

profile of NF-κB, the parameters C in the GFP model and Δ from (3.15), which links theconcentration of activated GFP and the fluorescence intensity seen in an image, are esti-mated by the following procedure:


50

1. The CNF-κB data for cells stimulated by TNF-α = 10 ng/ml in wild-type cells from thepaper by Hoffmann, et al. [7] is used to identify C, ε, ωn, and T in (3.25) with nonlinearleast square optimization command in MATLAB, lsqnonlin. C, ε, ωn, and T are foundto be 108 nM, 0.17, 4.49 and 0.27, respectively. Figure 3.10 shows that the output of(3.25) with the estimated parameters C, ε, ωn, and T fits the CNF-κB data from [7] well.

2. The model described by (3.11) through (3.14) is used with the estimated value of C,to compute the profile of the GFP. The input of this model is the concentration ofNF-κB. The CNF-κB data for cells stimulated by TNF-α = 10 ng/ml in wild-type cells fromthe paper by Hoffman et al. [7] is used as an input to calculate the profile of f. As thethe CNF-κB concentrations are given at discrete points, the values between two timepoints are estimated by linear interpolation.

3. The fluorescence intensity for TNF-α = 10 ng/ml is computed by the described

procedure from the experimental results shown in Figure 3.11 (red line). Δ isestimated by the ratio of the steady state value of the f value computed from themodel from step 2 and the steady state fluorescence intensity computed from theexperimental data by the image analysis procedure.

4. The estimated value for Δ is 2.5562 × 104. A comparison of the experimental dataanalyzed by the presented analysis algorithm and the fluorescence intensity profilecomputed from (3.11) through (3.14) for an input of TNF-α = 10 ng/ml is shown inFigure 3.11.


51

(a) (b) (c)

Figure 3.9 Image analysis results for the fluorescent microscopy images of NF-κB for cells stimulatedwith TNF-α: (a) TNF-α = 6 ng/ml; (b) TNF-α = 13 ng/ml; and (c) TNF-α= 19 ng/ml.

Figure 3.10 The output from (3.25) with the estimated parameters and the original CNF-êB

from [7].

After C and Δ have been estimated, their values are used to derive the profile of NF-κB

from the fluorescence intensity profile �I for TNF-α concentrations other than 10 ng/ml.The procedure for solving the inverse problem is given as follows:

1. The parameters , n, and T are estimated to fit f, given by (3.24), to Δ �I using anonlinear least squares optimization method.

2. The corresponding NF-κB profile is given by (3.25) using the values of the estimatedparameters ε, ωn, and T .

Table 3.3 shows the parameters ε, ωn, and T estimated for TNF-α concentrations of 6,

13, and 19 ng/ml. The corresponding curves of f/Δ, as predicted by (3.24), are shown inFigure 3.12 together with the experimental data obtained from the image analysis tech-niques. The three corresponding CNF-κB profiles are shown in Figure 3.13. After constantstimulation by TNF-α, CNF-κB increases and reaches its maximum value after approxi-mately 40 minutes. The NF-κB concentrations reach their steady state values afterapproximately 6 hours. Comparison of the results for these three TNF-α concentrationsshow that stimulation with increased levels of TNF-α lead to higher peak values and alarger steady state value of the concentration of NF-κB. These results are reasonable aslarger TNF-α concentrations are able to activate more IKKn (neutral form of IKK kinase),which releases more NF-κB from the complex (IκBα| NF-κB) and then induces moreNF-κB in the nucleus [38]. It can be concluded from Figure 3.13 that the image analysistechniques and the solution of the inverse problem presented in this work can obtainquantitative data for the transcription factors NF-κB.


52

Figure 3.11 The experimental data and the output f/Δ from the identified GFP model for Hoffmann’sNF-κB data.

Table 3.3 Estimated Parameters for Fitting Δto �I

TNF-α Concentration ε ωn Tα

6 ng/ml 0.20 4.52 0.2613 ng/ml 0.20 4.52 0.3119 ng/ml 0.28 4.61 0.35

3.6 Summary and Conclusions

This work presented techniques for determining dynamic concentration profiles oftranscription factors from a series of fluorescent microscopy images. The first imageanalysis method presented in this work uses wavelets to sharpen the contrast of theimages. This sharpening step is followed by a two-directional search which determinesif a pixel corresponds to a fluorescent cell. The second image analysis technique usesK-means clustering and principal component analysis (PCA) to cluster the pixelsaccording to their fluorescence intensity levels. It has been found that the first algo-rithm is simpler to implement and requires less computation time, while the latter tech-nique tends to give better results for low-contrast images. Additionally, the techniqueinvolving PCA and K-means clustering is able to determine several regions with thesame intensity level whereas the bidirectional search technique only detects regionswhere the fluorescence intensity is above a certain threshold. The results for thefluorescence intensity profiles obtained from these techniques were similar for all thetest cases investigated in this work. A second contribution of this chapter is the intro-duction of a technique that determines the transcription factor concentration from thefluorescence intensity profile. The procedures are illustrated by determining the NF-κBdynamics in hepatocytes for different stimulation concentrations of TNF-α.

3.6 Summary and Conclusions

53

(a) (b) (c)

Figure 3.12 The experimental data �I and the fitted curve f/Δ for TNF-α at (a) 6 ng/ml, (b) 13 ng/ml, and(c) 19 ng/ml.

Figure 3.13 NF-κB profiles computed via solution of the inverse problem based upon one of the pre-sented image analysis techniques for TNF-α at 6 ng/ml, 10 ng/ml, 13 ng/ml, and 19 ng/ml.

Acknowledgments

The authors gratefully acknowledge partial financial support from the National ScienceFoundation (Grant CBET# 0706792) and the ACS Petroleum Research Fund (GrantPRF# 48144-AC9). The authors are grateful for the fluorescence microscopy images pro-vided by Dr. Arul Jayaraman and Mr. Fatih Senocak.

References

[1] Corvinus, F.M., C. Orth, R. Morigg, S.A. Tsareva, S. Wagner, E.B. Pfitzner, D. Baus, R. Kaufmann,L.A. Huberb, K. Zatloukal, H. Beug, P. Ohlschlager, A. Schutz, K.-J. Halbhuber, and K. Friedrich,“Persistent STAT3 activation in colon cancer is associated with enhanced cell proliferation andtumor growth,” Neoplasia, Vol. 7, No. 6, 2005, pp. 545–555.

[2] Judd, L.M., B.M. Alderman, M. Howlett, A. Shulkes, C. Dow, J. Moverley, D. Grail, B.J. Jenkins, M.Ernst, and A.S. Giraud, “Gastric cancer development in mice lacking the SHP2 binding site on theIL-6 family co-receptor gp130,” Gastroenterology, Vol. 126, No. 1, 2004, pp. 196–207.

[3] Heinrich, P.C., I. Behrmann, S. Haan, and H.M. Hermanns, “Principles of interleukin (IL)-6-typecytokine signaling and its regulation,” Biochem., Vol. 374, 2003, pp. 1–20.

[4] Singh, A.K., A. Jayaraman, and J. Hahn, “Modeling regulatory mechanisms in IL-6 signaltransduction in hepatocytes,” Biotechnol. Bioeng, Vol. 95, No. 5, 2006, pp. 850–862.

[5] Huang, Z., Y. Chu, F. Senocak, A. Jayaraman, and J. Hahn, “Model update of signal transductionpathways in hepatocytes based upon sensitivity analysis,” Proceedings Foundations of Systems Biology2007, September 2007, Stuttgart, Germany.

[6] Birtwistle, M.R., M. Hatakeyama, N. Yumoto, B. A Ogunnaike, J. B Hoek, and B.N. Kholodenko,“Ligand-dependent responses of the ErbB signaling network: experimental and modeling analy-ses,” Molecular Systems Biology, Vol. 3, No. 144, 2007, pp. 1–16.

[7] Hoffmann, A., A. Levchenko, M.L. Scott, and D. Baltimore, “The IκB–NF-κB signaling module: tem-poral control and selective gene activation,” Science, Vol. 298, No. 8, 2002, pp. 1241–1245.

[8] Kurien, B.T., and R.H. Scofield, “Western blotting,” Methods, Vol. 38, No. 4, 2006, pp.283–293.[9] Pan, Q., A.L. Saltzman, Y. Ki Kim, C. Misquitta, O. Shai, L.E. Maquat, B.J. Frey, and B.J. Blencowe,

“Quantitative microarray profiling provides evidence against widespread coupling of alternativesplicing with nonsense-mediated mRNA decay to control gene expression,” Genes Dev., Vol. 20,No. 2, 2006, pp. 153–158.

[10] Subramanian, S., and F. Srienc, “Quantitative analysis of transient gene expression in mammaliancells using the green fluorescent protein,” J. Biotechnol., Vol. 49, 1996, pp. 137–151.

[11] King, K. R., S. Wang, D. Irimia, A. Jayaraman, M. Toner, and M. L. Yarmush, “A high-throughputmicrofluidic real-time gene expression living cell array,” Lab Chip, Vol. 7, 2007, pp. 77–85.

[12] Choe, J., H.H. Guo, and G. V. D. Engh, “A dual-fluorescence reporter system for high-throughputclone characterization and selection by cell sorting,” Nucleic Acids Research, Vol. 33, No. 5, 2005,e49.

[13] Miksch, G., F. Bettenworth, K. Friehs, E. Flaschel, A. Saalbach, and T.W. Nattkemper, “A rapidreporter system using GFP as a reporter protein for identification and screening of synthetic sta-tionary-phase promoters in Escherichia coli,” Appl. Microbiol. Biotechnol., Vol. 70, 2006,pp. 229–236.

[14] Ducrest, A.L, M. Amacker, J. Lingner, and M. Nabholz, “Detection of promoter activity by flowcytometric analysis of GFP reporter expression,” Nucleic Acids Research, Vol. 30, No. 14, 2002, e65.

[15] Tee, C. S., M. Marziah, C.S. Tan, and M.P. Abdullah, “Evaluation of different promoters driving theGFP reporter gene and selected target tissues for particle bombardment of DendrobiumSonia 17,”Plant Cell Rep., Vol. 21, 2003, pp. 452–458.

[16] Carroll, J. A., P.E. Stewart, P. Rosa, A.F. Elias, and C.F. Garon, “An enhanced GFP reporter system tomonitor gene expression in Borrelia burgdorferi,” Microbiology, Vol. 149, 2003, pp.1819–1828.

[17] Cheng, L., C. Du, D. Murray, X. Tong, Y. Zhang, B.P. Chen, and R.G. Hawley, “A GFP reporter sys-tem to assess gene transfer and expression in human hematopoietic progenitor cells,” Gene Ther-apy, Vol. 4, 1997, pp. 1013–1022.

[18] Gouskosa, T., F. Wightmanb, S.R. Lewinb, and J. Torresia, “Highly reproducible transienttransfections for the study of hepatitis B virus replication based on an internal GFP reporter sys-tem,” Journal of Virological Methods, Vol. 121, 2004, pp. 65–72.

[19] Hoffman, R.M., “In vivo imaging with fluorescent proteins: the new cell biology,” Actahistochemica, Vol. 106, 2004, pp.77–87.


54

[20] Venkataraman, S., J.L. Morrell-Falvey, M.J. Doktycz, and H. Qi, “Automated image analysis of fluo-rescence microscopic images to identify protein-protein interactions,” Proc. 27th Annu. Conf. IEEEEngineering in Medicine and Biology, Shanghai, 2005.

[21] Jung, C.K., J.B. Lee, X.H. Wang, and Y.H. Song, “Wavelet based noise cancellation technique forfault location on underground power cables,” Electric Power Systems Research, Vol. 77 , 2007,pp.1349–1362.

[22] Sharmaa, A., G. Sheoranb, Z.A. Jafferya, and Moinuddina, “Improvement of signal-to-noise ratio indigital holography using wavelet transform,” Optics and Lasers in Engineering, Vol. 48, 2008,pp. 42–47.

[23] Donoho, D.L., and I.M. Johnstone, “Ideal de-noising in an orthonormal basis chosen from a libraryof bases,” C.R.A.S. Paris, t. 319, Ser. I, 1994, pp. 1317–1322.

[24] Donoho, D.L., “De-noising by soft-thresholding,” IEEE Trans. on Inf. Theory, Vol. 41, No. 3, 1995,pp. 613–627.

[25] Kaufman, L., and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, NewYork: John Wiley & Sons, 1990.

[26] Lloyd, S.P., “Least squares quantization in PCM,” IEEE Trans. on Inf. Theory, Vol. 28, No. 2, 1982,pp. 129–137.

[27] Sabin, J., and R. Gray, “Global convergence and empirical consistency of the generalized Lloydalgorithm,” IEEE Trans. on Inf. Theory, Vol. 32, No. 2, 1986, pp. 148–155.

[28] Hotelling, H., “Analysis of a complex of statistical variables into principal components,” Journal ofEducational Psychology, Vol. 24, 1933, pp. 417–441.

[29] Jackson, J.E., A User’s Guide to Principal Components, New York: John Wiley & Sons, 2003.[30] Geladi, P., and H. Grahn, Multivariate Image Analysis, New York: John Wiley & Sons, 1996.[31] Bharati, M.H.M., and J.F. Macgregor, “Multivariate image analysis for real–time process monitor-

ing and control,” Industrial and Engineering Chemistry Research, Vol. 37, 1998, pp. 4715–4724.[32] Guillemin, F., M.F. Devaux, and F. Guillon, “Evaluation of plant histology by automatic clustering

based on individual cell morphological features,” Image Anal. Stereol., Vol. 23, 2004, pp. 13–22.[33] Ding, C., and X. He, “K-means clustering via principal component analysis,” Proc. of Intl. Conf.

Machine Learning (ICML 2004), 2004, pp. 225–232.[34] Meier, D.S., and C.R.G. Guttmann, “Time-series analysis of MRI intensity patterns in multiple scle-

rosis,” NeuroImage, Vol. 20, No. 2, 2003, pp. 1193–1209.[35] Benyon, P.R., “The inversion of dynamic systems,” Mathematics and Computers in Simulation, XXI,

1979, pp. 335–339.[36] Puebla, H., and J. Alvarez-Ramirez, “Stability of inverse-system approaches in coherent chaotic

communication,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Application,Vol. 48, No. 12, 1979, pp.1413–1423.

[37] Huang, Z., F. Senocak, A. Jayaraman, and J. Hahn, “Integrated modeling and experimentalapproach for determining transcription factor profiles from fluorescent reporter data,” BMC Sys-tems Biology, Vol. 2, No. 64, 2008.

[38] Lipniacki, T., P. Paszek, A.R. Brasier, B. Luxon, and M. Kimmel, “Mathematical model of NF-kB reg-ulatory module,” Journal of Theoretical Biology, Vol. 228, 2004, pp.195–215.

Acknowledgments

55

C H A P T E R

4Data-Driven, Mechanistic Modeling ofBiochemical Reaction Networks

Jason M. Haugh1*, Timothy C. Elston2, Murat Cirit1, Chun-Chao Wang1,Nan Hao2, and Necmettin Yildirim3

1Department of Chemical & Biomolecular Engineering, North Carolina State University,Raleigh, NC 276952Department of Pharmacology, University of North Carolina, Chapel Hill, NC 27599*e-mail: [email protected] of Natural Sciences, New College of Florida, Sarasota, FL 34243

57

Key terms Signal transductionCell biologyKinase cascadeCrosstalkParameter estimation

Abstract

Mathematical modeling has emerged as a valuable tool for characterizing andpredicting the spatiotemporal dynamics of biochemical reaction pathways andnetworks in living cells; however, the power of such models is currently limitedby the availability of quantitative, kinetic data for comparison and validation.In this chapter, we discuss data-driven modeling of intracellular reaction net-works, with a focus on signal transduction in eukaryotic cells. Experimentaldata types and their limitations, approaches for data processing and normaliza-tion, types of models and issues related to model simplification, and parameterestimation methods are covered. To illustrate these principles, we offer tworecent examples of data-driven modeling, each dealing with signaltransduction through mitogen-activated protein kinases (MAPKs): elucidationof crosstalk between phosphoinositide 3-kinase (PI3K)- and Ras-dependentpathways in mammalian cells, and analysis of feedback regulation as a mecha-nism for signaling specificity in the yeast pheromone response.

4.1 Introduction

At a certain level of abstraction, living cells are picoliter-sized reaction vessels in whichthousands of biochemical reactions and intermolecular binding processes take place ina dynamic, coordinated, and highly regulated fashion. This is an energy intensive pro-cess. Intracellular enzyme activities are modulated by covalent modifications that arerapidly added and removed in a seemingly futile cycle in order to respond to changes inthe cell’s external environment. These reactions are responsible for governing cell func-tion, and their dysregulation and modulation by infectious agents constitute themolecular basis for human disease. From the perspective of chemical kinetics, the innerworkings of the cell are fascinating, but we are still a long way from a mechanisticunderstanding of these reactions and quantitative characterization of their rates. In thischapter, we discuss how mathematical modeling is applied in tandem with biochemicalmeasurements to achieve this goal.

Whether before, during, or after the collection of experimental data, quantitativemodeling is a valuable approach for critically assessing and organizing hypotheses thatintegrate the many processes that might be at play [1]. And, to the extent that a model istrained on a sufficient amount of quantitative data and its mechanistic assumptions aresound, it may be used to predict the outcomes of novel experiments and thus generatenew, hypothesis-driven research. Some experiments will inevitably contradict themodel predictions, but as with conceptual, “arrow diagram” models, one iterativelyrefines the model based on new data.

The examples presented here are focused on mechanisms of signal transduction ineukaryotic cells, which are responsible for controlling cell cycle progression, cell motil-ity, responses to stress, programmed cell death, and differentiation of cell function [2, 3].These reaction pathways transmit information about the cell’s externalmicroenvironment, making them fundamentally distinct from metabolic pathways,which deal in currencies of energy and reducing power. We further narrow our focus onmodeling of cell signaling that is both data-driven and rooted in biochemical mecha-nisms. We distinguish data-driven models from purely theoretical models, where experi-mental data are either not available or not accessible with current technology, andmechanistic models are distinguished from purely phenomenological and purely statis-tical/correlative models. To supplement the topics presented here, the reader is referredto a number of reviews on the subject of modeling signal transduction processes [4–8].

Rather than presenting detailed recipes of experimental or modeling techniques,this chapter aims to shed light on the inherent relationship between the two indata-driven modeling. In Section 4.2, we discuss the advantages and shortcomings ofdifferent experimental methodologies from the standpoint of modeling and the formu-lation of models of the appropriate type and level of complexity. Emphasis is placed onthe pressing need for model simplification and more systematic approaches for modelparameter specification. Then, in Section 4.3, we present two examples of how we haveapplied those modeling principles to understand specific cell signaling systems.

Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks

58

4.2 Principles of Data-Driven Modeling

4.2.1 Types of experimental data

Depending on one’s point of view, cell biology is currently either in a data-rich ordata-deprived state. There is a wealth of genomic and proteomic data that have yieldedmostly qualitative information about the connectivity of pathways, yet there is rela-tively little in the way of measurements characterizing their dynamics. Here, we brieflydiscuss the various quantitative experimental methodologies that define the currentstate of the art and weigh their advantages, caveats, and limitations. We choose to clas-sify measurement techniques in three categories: population endpoint, single-cell end-point, and single-cell kinetic (Table 4.1). An endpoint measurement is one in which theexperiment is stopped at a certain time and the sample is prepared for analysis, whereasa kinetic measurement is one in which the biochemical readout is monitored in realtime. Important considerations include dynamic range (the range of measured valuesfrom the lowest limit of detection to the upper limit of assay linearity), throughput (thenumber of conditions that can be compared in each independent experiment), the abil-ity to multiplex (measure multiple readouts at once), and the ability to assesssubcellular localization.

In population endpoint measurements, a large number of cells (103 to 108) are sub-jected to identical experimental conditions, and a lysate of the cell collective is preparedfor analysis. Hence, information about individual cells is lost, and information aboutsubcellular localization is at best indirect; depending on the method of lysis, the prepa-ration can be subdivided based on density and/or detergent solubility into fractions rep-resenting different subcellular compartments (cytosol, plasma membrane, endosomesand Golgi, nuclei, and so forth). Despite these shortcomings, this approach has severaladvantages, including potentially high sensitivity and throughput and broad versatility


59

Table 4.1 Capabilities and Limitations of Common Experimental Methodologies

Dynamic Range Throughput Multiplexing Spatial Detail

Population Endpoint:Immunoblotting * +++++ +++ + –Dot blot/ELISA +++ +++++ + –Sandwich ELISA ++++ ++++ + –In vitro enzymatic assay ++++ +++ – –Antibody array +++ ++ +++ –Mass spectrometry +++ + +++++ –

Single-Cell Endpoint:Flow cytometry ++ ++ ++ –Immunofluorescence ++ + + +++

Single-Cell Kinetic:Confocal; localization ++ + + +++TIRF; localization +++ + + ++Spectral shift +++ ++ – +FRET + + – ++

Various measurement techniques are rated according to typical performance in four categories: dynamicrange/sensitivity, throughput (the number of conditions that can be compared in each independent experiment),the ability to multiplex (measure multiple readouts at once), and the ability to assess subcellular localization.*Immunoblotting using enhanced chemiluminescence and a high-sensitivity, cooled charge coupled device cam-era for imaging; the traditional method using photographic film for imaging gives a sigmoidal response over amuch narrower dynamic range (contributing to the false notion that immunoblotting is generally notquantitative).

for measuring a variety of molecular readouts; all of these depend critically on the qual-ity of the reagents used. The most common population endpoint measurements, such asimmunoblotting, enzyme-linked immunosorbent assays (ELISAs), and in vitro enzy-matic assays, involve protein immobilization and the use of antibodies for specific cap-ture or detection. All things being equal, assays that involve an initial separation step(e.g., gel electrophoresis or the use of a capture antibody) tend to be more specific andtherefore have a higher dynamic range. There is also a general trade-off between the abil-ity to multiplex and throughput, as exemplified by antibody arrays [9] and especiallycurrent mass spectrometry technology [10, 11]. These methods can also be used in con-junction with coimmunoprecipitation to assess protein-protein interactions; however,because of the work-up time involved, this approach is strongly biased to detect onlyvery stable interactions. From the standpoint of experimental data, the inability to mea-sure intracellular protein-protein interactions quantitatively is arguably the mostsignificant limitation for data-driven modeling.

Single-cell endpoint measurements, which provide information about individualcells, include flow cytometry and immunofluorescence microscopy. Both involve incu-bation with antibodies, detection of fluorescence, and in the case of intracellular pro-teins, cell fixation and permeabilization. Flow cytometry offers high throughput interms of assembling population statistics for each sample and moderate throughput interms of comparing multiple samples. Immunofluorescence offers information aboutsubcellular localization, but the analysis is tedious and therefore low in throughput.

Single-cell kinetic measurements generally involve microscopic imaging of live cells,in which case information about subcellular localization is obtained. Although thisapproach suffers from many of the same throughput issues as immunofluorescence, theability to observe signaling kinetics in real time and in conjunction with cell behaviormakes it unique [12, 13]. The basis for the measurement is the introduction of a biosen-sor, either genetically encoded or microinjected into the cell; genetically encodedbiosensors are fusion proteins comprised of a protein or protein domain of interest, towhich a fluorescent protein such as enhanced green fluorescent protein is attached. Alimited degree of multiplexing is offered through the use of multiple biosensors labeledwith different fluorophores. The dynamic range of the measurement is affected bywhich particular biosensor and microscopy modality [e.g., wide-field fluorescence, con-focal fluorescence, or total internal reflection fluorescence (TIRF)] are used, and the basisfor the measurement [e.g., a shift in spectral properties of the fluorophore, as in calciumimaging, translocation to a particular membrane or intracellular compartment, orchanges in Förster resonance energy transfer (FRET)]. The most significant limitation ofthis approach is that there are currently only a small number of biosensors that workwell for quantitative studies; another caveat of using biosensors is that they might signif-icantly interfere with or otherwise modulate the signaling processes they were meant todetect.

4.2.2 Data processing and normalization

All data require some form(s) of processing prior to any sort of quantitative analysis.Some of these are obvious and routine; for example, the subtraction of assay/imagebackground and the linear rescaling of images for presentation. Typically, quantitativedata are also normalized. The purpose of normalization is to adjust for sources of vari-


60

ability, so that the reproducibility of experimentally deduced trends may be comparedin a statistically meaningful way. The manner in which this is done varies and is con-text-dependent (and, in some cases, arbitrary), and hence this topic is worthy of somediscussion.

Variability arises because of both the biological system and the assay itself. Biologicalvariability is significant in any measurement involving cells; this is because, no matterhow carefully the parameters of the cell culture are controlled, the culture will vary fromexperiment to experiment. Assay variability arises from heterogeneity within a sample(e.g., from cell to cell in single-cell measurements) and in the preparation of samples,which affects the comparison of conditions within the same experiment, and also fromtemporal and lot-to-lot changes in the reagents used, which along with biological vari-ability affect the comparison of independent experiments. Sample heterogeneity at thesingle-cell or population level is generally normalized by dividing the signal by a secondmeasurement that should not be affected by the perturbations being tested. For exam-ple, population endpoint measurements are typically normalized by the total amount ofcellular protein in the sample or by the amount of an abundant species that should beinvariant from sample to sample (e.g., actin or tubulin). This is especially importantwhen comparing samples derived from the same cell line/strain but which have beendifferentially modified over some period of time; for example, comparing control cellsto cells in which over-expression of a wild-type gene or expression of a mutant gene hasbeen introduced. Day-to-day variability of the assay reagents and other assay conditionscan be normalized by the measurement of a common standard sample; however, thisapproach is of little use in the typical case where biological variability is also prominent.

To normalize for biological variability, it is often appropriate to use a negative or pos-itive control sample, acquired in each independent experiment. A pitfall of using a nega-tive control for normalization (e.g., fold-induction) is that it often has the lowest andleast reliable signal. For more complex data sets of the sort that is desirable for quantita-tive modeling, with measurements at multiple time points for a variety of experimentalconditions, choosing how to normalize the data by a positive control condition (e.g.,maximum stimulation of otherwise unperturbed cells) is subject to some ambiguity.Normalizing by the value at a particular time point is a common practice, but the choiceof the time point might be considered arbitrary; normalizing by the maximum (peak)value in each experiment is less arbitrary but nonetheless tends to obscure comparisonsbetween control and noncontrol conditions at time points other than in the vicinity ofthe peak. For such data sets, we contend that normalizing in a manner that incorporatesall of the time-dependent data for the control condition is more appropriate. Examplesinclude normalizing by the mean value of the control time course, its “area under thecurve” (e.g., [14]), or by normalization factors that minimize its variance across all exper-iments (e.g., as assessed by the mean coefficient of variation). The latter approach,which we currently favor, is briefly described here and demonstrated in the twoexamples presented in Section 4.3.

Suppose there are n experiments for which data are collected at m time points. Dur-ing each of the n experiments the same control is run. Let Xij denote the experimentalreadout for the control in the ith experiment at the jth time point. Often the quantity ofinterest Yij (e.g., the concentration of chemical species) is related to Xij by an unknownscale factor. That is, Yij = αi Xij. Under ideal conditions, the control would not vary from


61

experiment to experiment. Therefore, we seek the set of αi’s that minimize a suitablequantity F, for example

( ) ( )F Y Y FY

Y Yij j ij ji

n

j

m

i

n

j

m

= − = −⎡

⎣⎢

⎤

⎦⎥

====∑∑∑∑

2

1

2 1 2

111

1,

where Yj is the mean value that results for time point j. The minimization is subject to a

constraint that eliminates the trivial solution, αi = 0 for all i. Once the αi have beenfound, they are used to scale the experimental time series, allowing the mathematicalmodel to be fit to all the data simultaneously.

4.2.3 Suitability of models used in conjunction with quantitative data

In formulating a suitable mathematical description of a system, it is important to castthe model at an appropriate level of abstraction, which should be weighed carefullyalong with considerations of computational feasibility. While all models of biochemi-cal processes are expected to include fundamentals of chemical reaction kinetics, theyare expected to vary along two axes of increasing complexity: from deterministic to sto-chastic, and from well-stirred to spatially extended (Figure 4.1). In deterministic mod-els, continuum variables such as species concentrations evolve according to ordinary orpartial differential equations (ODEs and PDEs, respectively) and associated initial andboundary constraints, whereas in stochastic models, molecules and molecular com-plexes are modeled as discrete entities whose states are updated probabilistically [15,16]. So-called hybrid models incorporate both continuum and discrete variables [17].On the other axis, well-stirred models assume spatial homogeneity within the domainof interest, and any transport processes in the model (e.g., trafficking betweenintracellular compartments [18]) are incorporated as reaction terms, whereas spatially


62

Figure 4.1 Two axes of mechanistic model complexity. Models can be characterized according towhether they are deterministic or stochastic and whether or not they explicitly account for spatial gradi-ents. Roughly speaking, the degree of computational difficulty increases as one moves from the lower leftto the upper right quadrant. In each corner, techniques used to implement such models are listed alongwith, in parentheses, the type of experimental data that might be described. Abbreviations: ODE, ordinarydifferential equation; PDE, partial differential equation; SDE, stochastic differential equation; BD,Brownian dynamics.

extended models account for spatial gradients and therefore describe the underlyingtransport processes explicitly, according to physicochemical principles [8].

For data-driven modeling of biochemical systems, the chosen complexity of themodel should depend not only on what qualitative information is available in the litera-ture, however reliable, but also in large part on the amount and type of quantitative,experimental data available. For instance, population endpoint measurements tend tobe the most versatile and quantitative, yet they do not provide the kind of informationthat would justify a stochastic or spatially extended description of the model. Therefore,even though more complex models might be formulated, it is most appropriate to castthe model as a set of deterministic ODEs (see Section 4.3, Examples 1 and 2). Data-drivenstochastic models generally benefit from single-cell information, which is obtainedmost quantitatively (albeit without spatial information) from flow cytometry data[19–21], and spatially extended models must be driven almost exclusively by single-cellkinetic (live-cell microscopy) data [22–26].

4.2.4 Issues related to parameter specification and estimation

Another aspect of model complexity that must be carefully considered when makingcomparisons to data is the amount of molecular detail to include. A comprehensivemodel, explicitly including all of the “known” biochemistry, comes at the expense ofhaving to identify a large set of parameter values (rate constants and initial concentra-tions) [27]. Prominent examples of signaling pathway/network models with ~100 ormore adjustable parameters have been offered [28–31], and in such cases the parametervalues are typically culled from published in vitro measurements using purified compo-nents (or assumed to be similar in magnitude to parameters for related interactionswhere such data are available) or adjusted by hand to reconcile the sparse biochemicaldata assembled in various cell types and laboratories. Although models using thisapproach have proven valuable, it must be recognized that there is a great deal of uncer-tainty associated with such a parameter specification exercise. Formulation of verydetailed models also dictates a qualitative assessment, wherein the model is judged byits ability to correctly produce the gross kinetic features seen in a relatively smallcollection of measurements [1].

The other approach is to simplify the model so as to reduce the number of adjustableparameters, to the point where a more direct, quantitative comparison or fit to the databecomes feasible and adequately constrained. Thus, the degree of model simplificationis largely determined by the variety of experimental conditions and biochemical read-outs in the data set; this, we contend, is the art of data-driven modeling. Simplificationof kinetic models is achieved in a number of ways, including the use of scaled,dimensionless variables and through knowledge or assumptions about fast versus slowrate processes. Another mode of simplification is the lumping of multiple processes intoa single step, which is warranted when quantitative data related to that particular stepare absent or unattainable, or when its details are poorly characterized.

Supposing that a model with an appropriate level of granularity has been tailored fora particular set of measurements, how does one fit the model output to the data? Thiscan be somewhat tricky, because even with appropriate simplification, a pathway/net-work model is going to have more than a handful of adjustable parameters. Indeed, it isbecoming increasingly clear that the values of parameters in models with even modest


63

complexity are not uniquely identifiable, even with near perfect kinetic data [32]. Withthat said, there are efficient methods for identifying a (nonunique) set of parametersthat fit the data optimally well. One approach, which has been used to great effect in themodeling of the cell cycle, is the use of global optimization algorithms such asODRPACK, which implements the Levenberg-Marquardt method with variable step size[33, 34]. Another strategy, which is gaining in popularity, involves Monte Carlo–basedor “genetic” algorithms, wherein all of the parameter values are adjusted randomly,according to distributions centered on the current values, and the resulting parameterset is either accepted or rejected with certain probability or based on specified criteriarelated to the goodness of fit. The classic example of such an approach is the Metropolisalgorithm [35] (Figure 4.2). In this method, parameter sets that improve the goodness offit are always accepted, whereas sets that yield a poorer fit are accepted with a probabilitydetermined by a Boltzmann-like function; the overall error (χ2) is analogous to theenergy, which is compared with a user-specified parameter that is analogous to the ther-mal energy scale or temperature (the lower the “temperature,” the lower the probabilityof acceptance). A commonly used variation is simulated annealing, in which the “tem-perature” is steadily reduced with time, making it more efficient for finding a globaloptimum [36, 37]. Regardless of the method used, it is important to note that the unitsof the model and those of the measurement are rarely the same, and so a conver-sion/alignment factor for each data type must usually be assigned or used as a fitparameter.

Faced with the inherent problem of identifying unique parameter values, it mightnot be fruitful to seek one single, “best” solution to the parameter estimation problem;another approach, which we have demonstrated in Examples 1 and 2 below, is theensemble or collective fitting approach [32, 38]. In this method, one accumulates a largenumber (potentially > 1,000) of parameter sets (the ensemble) that fit the data almostequally well. Starting with a single, near-optimal parameter set, the Metropolis algo-rithm is suitable for collecting the ensemble. At least for ODE models, which are solvedwith very little computational effort, it is no large task to recompute the model outputfor each or these parameter sets; the output of the “model,” then, may be taken as theensemble mean, with its standard deviation yielding a measure of the variability in themodel fit or prediction. An advantage of this approach is that one can readily inferwhether or not a particular parameter is well constrained by the fit by inspection of thedistribution of its values across the ensemble. Arguably, this evaluation is more insight-ful than the typical sensitivity analysis, which only assesses how the model responds tosmall changes in the parameter values, made one parameter at a time.

4.3 Examples of Data-Driven Modeling

4.3.1 Example 1: Systematic analysis of crosstalk in the PDGF receptorsignaling network

Historically, intracellular signal transduction has been characterized in terms of path-ways of sequential activation processes, such as the canonical mitogen-activated pro-tein kinase (MAPK) cascades; a prominent example is the Ras � Raf � MEK �extracellular signal-regulated kinase (Erk) pathway in mammals, which is both a masterintegrator of upstream inputs and a master controllers of transcription factors and


64


65

(a)

Figure 4.2 Parameter estimation using the Metropolis algorithm. (a) Schematic of the algorithm. Thevalues of all model parameters are adjusted at random, according to distributions centered on the previ-ous values, and the resulting quality of fit determines the probability of accepting each successive parame-ter set. Alignment of the model output to the data is achieved through the assignment of conversionfactors, which may be estimated in a separate subroutine. The performance of the algorithm is tuned byadjusting the values of α, which characterizes how much the parameters change in each step, and β, thestringency of the acceptance criterion. (b) Illustration of the algorithm run in a highly stringent mode,wherein each accepted move almost always results in a better fit (lower SSD), starting from randomguesses of the parameter values. (c) After achieving a near-minimum SSD value, the algorithm may bereinitiated with a relaxed stringency, allowing a large number of parameter sets to be collected in anensemble. The average output of the parameter set ensemble constitutes the output of the model. Quanti-tative predictions are made through uniform changes (e.g., setting a particular parameter equal to zero)across the ensemble.

other effectors [39]. Although our current understanding of signal transduction net-works includes more complex interactions, including those between the classicallydefined pathways (crosstalk) and those responsible for feedback regulation/reinforce-ment, such interactions have not yet been adequately characterized.

In an effort to quantify the relative contributions of classical and crosstalk interac-tions in a signaling network, population endpoint measurements and computationalmodeling were systematically combined to study signaling mediated by platelet-derivedgrowth factor (PDGF) receptors in fibroblasts [40] (Figures 4.3 and 4.4). The PDGF recep-tor signaling network is important in dermal wound healing and embryonic develop-ment [41], stimulating directed cell migration, survival, and proliferation through theaforementioned Ras/Erk pathway and exceptionally robust activation ofphosphoinositide 3-kinases (PI3Ks), which produce specific lipid second messengers atthe plasma membrane [42–44].

Erk phosphorylation and PI3K-dependent Akt phosphorylation in PDGF-stimulatedNIH 3T3 fibroblasts were measured by quantitative immunoblotting for an array of 126


66

(b)

(c)

Figure 4.2 (continued)


67

(d)

(b)

(a)

(c)

Fig

ure

4.3

Dat

a-d

rive

nm

odel

toch

arac

teri

zecr

osst

alk

inth

ePD

GF

rece

pto

rsi

gnal

ing

net

wor

k.(a

)A

por

tion

ofth

eq

uan

tita

tive

dat

ase

tsh

ows

that

inh

ibit

ion

ofR

as(b

yex

pre

ssio

nof

dom

inan

t-n

egat

ive

S17N

Ras

)or

PI3K

(usi

ng

the

LYco

mp

oun

d)

affe

cts

the

dyn

amic

sof

PDG

F-st

imu

late

dEr

kp

hos

ph

oryl

atio

n.(

b)W

her

eas

inh

ibit

ion

ofR

asor

PI3K

only

par

tial

lybl

ocks

Erk

ph

osp

hor

ylat

ion

,th

ed

oubl

e-in

hib

itio

nex

per

imen

tsh

ows

that

Ras

and

PI3K

acco

un

tfo

ral

lof

the

maj

orp

ath

way

sfr

omPD

GF

rece

pto

rsto

Erk.

(c)

Con

cep

tual

mod

elof

the

PDG

Fre

cep

tor

sign

alin

gn

etw

ork

base

don

the

enti

red

ata

set.

(d)

Aco

arse

-gra

ined

kin

etic

mod

elof

the

net

wor

kis

alig

ned

dir

ectl

yto

the

dat

au

sin

ga

vari

atio

nof

the

Met

rop

olis

algo

rith

man

dth

ep

aram

etri

cen

sem

ble

app

roac

h.A

llp

anel

sar

ead

apte

dfr

om[4

0](w

ith

per

mis

sion

ofth

eau

thor

s).

experimental conditions, sampling different combinations of PDGF dose, stimulationtime, and molecular manipulations; considering biological replicates and parallel deter-mination of total Erk and Akt levels, this set of data comprises 2,772 total measurements.A selected portion of the Erk data shows that blocking the activity of either Ras or PI3Konly partially reduces PDGF-stimulated Erk phosphorylation [Figure 4.3(a)], whereassimultaneous inhibition of Ras and PI3K almost completely abolished PDGF-stimulatedErk phosphorylation [Figure 4.3(b)], indicating that Ras and PI3K are responsible for allof the major pathways from PDGF receptors to Erk, and at least one mode ofPI3K-dependent crosstalk to Erk is independent of Ras. By comparison, the Aktphosphorylation results showed that the PI3K pathway is not significantly affected byperturbations affecting Ras and Erk; crosstalk is apparently unidirectional, from PI3K toRas/Erk, in this network [40]. This conceptual model was further refined by additionalexperiments, which characterized two known negative feedback mechanisms and estab-lished that PI3K-dependent crosstalk affects the Erk pathway both downstream andupstream of Ras [Figure 4.3(c)].

Motivated by the dynamics revealed in this unique data set, a kinetic model of thenetwork was formulated and used to quantify the relative magnitudes of thePI3K-dependent and -independent inputs collaborating to activate Erk. A total of 34unspecified parameter values were estimated using the ensemble approach described inthe previous section; taken together, the data force the model to reconcile time- andPDGF dose-dependent features of the network observed under the various experimentalconditions tested [Figure 4.3(d)].


68

(b)

(a)

Figure 4.4 Quantification of Ras- and PI3K-dependent MEK phosphorylation pathways in the PDGFreceptor signaling network. (a) For each parameter set in the model ensemble, the quantity Cxij is definedas the maximum catalytic efficiency of pathway i (i = 1, Ras-dependent; i = 2, PI3K-dependent) towardssite j on MEK divided by the catalytic efficiency of the corresponding phosphatase reaction. On thedashed line, the two pathways are equally potent by this measure. (b) When MEK kinases andphosphatases are far from saturation, the steady-state fractions of MEK in the unphosphorylated, singlyphosphorylated, and doubly phosphorylated states are readily calculated. The MEK Activation Compara-tor (MAC) is a ratio devised to compare the MEK phosphorylation capacity of PI3K-dependent signalingcrosstalk to that of the classical Ras-dependent pathway. All panels are adapted from [40] (with permis-sion of the authors).

Analysis of the parameter sets chosen by the algorithm revealed a consistent ratio ofPI3K- and Ras-dependent contributions to the dual phosphorylation of MEK, the kinaseactivity directly upstream of Erk [Figure 4.4(a)]. We formulated a single number, theMEK activation comparator (MAC), which compares the capacities of the two pathwaysto generate dually phosphorylated MEK. Importantly, the MAC quantifies these inputsin a way that uncouples them from negative feedback effects. This analysis revealed that,whereas the PI3K-dependent MEK activation pathway is predicted to be intrinsicallymuch less potent than the Ras-dependent pathway under maximal PDGF stimulationconditions, feedback regulation of Ras renders the PI3K-dependent pathway somewhatmore important [Figure 4.4(b)]. A similar analysis was performed relating thePI3K-dependent and PI3K-independent signaling modes upstream of Ras [40].

The computational approach was also used to generate hypothetical predictionswith an eye towards future experiments. Whereas inhibition of PI3K affects crosstalkinteractions both upstream and downstream of Ras, the model ensemble predictsunique kinetic signatures that might be expected if either mechanism were silencedselectively [40], which could help validate the point of action of a particularPI3K-dependent pathway on Erk.

4.3.2 Example 2: Computational analysis of signal specificity in yeast

Yeast is well recognized as an excellent model organism for systems level analysis [45].Their ability to undergo efficient homologous recombination is particularly useful forstudying the functional role of proteins in vivo, through gene disruption or genereplacement. Because of this property, the yeast pheromone response system is argu-ably the best-characterized signaling pathway of any eukaryote. This pathway bearsstrong similarities to signaling networks in mammals. In particular, the MAPK compo-nents share extensive sequence similarity with their human counterparts [46]. Anotherfeature common to the yeast pheromone response pathway and response pathways ofhigher organisms is the sharing of signaling proteins among multiple systems. Thisproperty makes the pheromone response pathway an excellent system for studyingsignal specificity.

Depending on specific external cues, yeast cells initiate either a mating response oran invasive growth program. Mating is initiated when haploid cell types a and α secreteand respond to type-specific pheromones, which act through G protein-coupled recep-tors on cells of the opposite mating type [47]. Alternatively, invasive growth occurs innutrient-poor conditions [48]. Combined genetic and biochemical studies revealed thatboth mating and invasive growth require a protein kinase cascade comprised of Ste20(MAP4K), Ste11 (MAP3K), and Ste7 (MAP2K) [Figure 4.5(a)]. The pathways diverge at thelevel of the MAP kinase. Whereas deletion of one MAP kinase gene (KSS1) blocks inva-sive growth, deletion of a second MAP kinase gene (FUS3) impairs pheromone-inducedcell-cycle arrest. Deletion of FUS3 leads to enhanced activity of Kss1 [49]. However, themechanism by which this cross inhibition occurs was unknown.

We recently combined mathematical modeling with experimental analysis to inves-tigate how Fus3 limits the activity of Kss1 [50]. Six mathematical models were developedto describe different hypothetical mechanisms of cross inhibition. All six models were fitto the time courses for Fus3 and Kss1 activation obtained from wild-type cells as well asfrom strains containing various genetic alterations. The experiments yielded a data set of


69

over 300 measurements. To compare the performance of each of the six potential mod-els, the Monte Carlo approach described above was used. Figure 4.5(b) shows a plot ofthe sum of the squared differences (SSD) versus the number of accepted realizations inthe Monte Carlo optimization process for each of the six models. After 800 accepted real-izations the SSD converged for each model. Model I performed the best (minimum SSD)and the next best models were II, III, and V, which roughly performed equally well.

Each of the six models fall into one of two distinct cases [Figure 4.5(c)]: (1) activeFus3 inhibits Kss1 phosphorylation, and (2) active Fus3 increases Kss1dephosphorylation. Models I and II are mathematically the simplest and demonstratethe key difference between the two hypothetical mechanisms of cross inhibition. Tocompare the two models, 100 parameter sets were randomly selected from thoseaccepted by the Monte Carlo optimization routine. The model equations were run usingthese parameter sets to generate a distribution of solutions. Figure 4.6(a) shows compari-sons between the models’ output and experimental data for Kss1 activity in WT cells(black circles) and cells in which the MAPK Fus3 has been deleted (red circles). Note that

only Model I is able to capture the rapid increase in Kss1 activity seen in the fus3Δ strain.The confidence intervals presented in these plots indicate that this behavior is not a con-


70

(a) (b)

(c)

Figure 4.5 Data-driven modeling of signal specificity in yeast. (a) Components of the mating and inva-sive-growth pathways. Activation steps are indicated with arrows, and inhibition steps are indicated witha T-shaped line. (b) The sum of the squared differences (SSD) between the experimental data and outputof the six models versus the number of accepted realizations in the Monte Carlo optimization routine. (c)A simple model that incorporates two mechanisms of cross-inhibition: Fus3 inhibits the rate of Kss1phosphorylation (red dashed line), and Fus3 increases the rate of Kss1 dephosphorylation (blue dashedline). All panels are adapted from [50] (with permission of the authors).

sequence of the specific choice of parameter values, but a general property of themodels.

Motivated by these results, a simplified model of cross-inhibition was developed thatcaptures the two general mechanisms by which Fus3 might regulate Kss1. Analysis of thesimple model revealed that the two mechanisms of cross-inhibition have oppositeeffects on the rate at which the system relaxes to steady state. If Fus3 inhibits Kss1phosphorylation, the relaxation rate is reduced; if Fus3 increases deactivation, the relax-ation rate is increased. Consequently, a mechanism that increases thedephosphorylation rate of Kss1 is incompatible with the experimental data because itcannot simultaneously account for: (1) the large increase in maximum Kss1 activity seen

in the fus3Δ strain, and (2) the slow decline in Kss1 activity observed in wild-type cells.Because the MAP2K Ste7 is feedback phosphorylated by Fus3 [51–53] and directly

catalyzes Kss1 activation, this protein was considered to be the most likely target forFus3-mediated cross-inhibition of Kss1. The sites at which Fus3 phosphorylates Ste7have been mapped, and a mutant lacking each of these phosphorylated residues hasbeen described (Ste7A7) [54]. Consistent with the results of the computational investiga-


71

(a)

(b)

Figure 4.6 Representative results for the Fus3 cross-inhibition models. (a) Model I, in which Fus3 inhib-its the activation of Kss1, is able to capture the rapid increase in Kss1 activity seen in a fus3Δ mutant,whereas Model II, in which Fus3 increases the rate at which Kss1 is deactivated, cannot capture this effect.(b) Model I accurately predicts the results for the Ste7A7 mutant in which feedback phosphorylation hasbeen disrupted. The circles are experimental data points and the lines are model results. Results for thewild-type cells are indicated in black and red indicates results for the mutants. All panels are adapted from[50] (with permission of the authors).

tions, Ste7A7 exhibits a significant elevation in the extent of Kss1 phosphorylation com-pared with wild-type Ste7 (Ste7Wt); furthermore, the mathematical model describing thisscenario accurately predicts the extent and duration of the increase in Kss1 activationpromoted by Ste7A7 [Figure 4.6(b)] [50].

Acknowledgments

This work was supported by National Institutes of Health grants R01-GM067739 andR21-GM074711 to J.M.H. and R01-GM079271 and R01-GM073180 to T.C.E.

References

[1] Mogilner, A., R. Wollman, and W.F. Marshall, “Quantitative modeling in cell biology: what is itgood for?” Dev. Cell, Vol. 11, 2006, pp. 279–287.

[2] Hunter, T., “Signaling—2000 and beyond,” Cell, Vol. 100, 2000, pp. 113–127.[3] Pawson, T., “Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to

complex cellular systems,” Cell, Vol. 116, 2004, pp. 191–203.[4] Tyson, J.J., K.C. Chen, and B. Novak, “Sniffers, buzzers, toggles and blinkers: dynamics of regula-

tory and signaling pathways in the cell,” Curr. Opin. Cell Biol., Vol. 15, 2003, pp. 221–231.[5] Ma’ayan, A., R.D. Blitzer, and R. Iyengar, “Toward predictive models of mammalian cells,” Annu.

Rev. Biophys. Biomol. Struct., Vol. 34, 2005, pp. 319–349.[6] Kholodenko, B.N., “Cell-signalling dynamics in time and space,” Nat. Rev. Mol. Cell. Biol., Vol. 7,

2006, pp. 165–176.[7] Janes, K.A., and M.B. Yaffe, “Data-driven modelling of signal-transduction networks,” Nat. Rev.

Mol. Cell. Biol., Vol. 7, 2006, pp. 820–828.[8] Haugh, J.M., “Mathematical modeling of biological signaling networks,” in Wiley Encyclopedia of

Chemical Biology, New York: John Wiley & Sons, 2008.[9] Nielsen, U.B., and B.H. Geierstanger, “Multiplexed sandwich assays in microarray format,” J.

Immunol. Meth., Vol. 290, 2004, pp. 107–120.[10] Domon, B., and R. Aebersold, “Mass spectrometry and protein analysis,” Science, Vol. 312, 2006,

pp. 212–217.[11] Huang, P.H., and F.M. White, “Phosphoproteomics: Unraveling the signaling web,” Mol. Cell, Vol.

31, 2008, pp. 777–781.[12] Meyer, T., and M.N. Teruel, “Fluorescence imaging of signaling networks,” Trends Cell Biol., Vol.

13, 2003, pp. 101–106.[13] Giepmans, B.N.G., S.R. Adams, M.H. Ellisman, and R.Y. Tsien, “The fluorescent toolbox for assess-

ing protein location and function,” Science, Vol. 312, 2006, pp. 217–224.[14] Park, C.S., I.C. Schneider, and J.M. Haugh, “Kinetic analysis of platelet-derived growth factor recep-

tor/phosphoinositide 3-kinase/Akt signaling in fibroblasts,” J. Biol. Chem., Vol. 278, 2003, pp.37064–37072.

[15] Kepler, T.B., and T.C. Elston, “Stochasticity in transcriptional regulation: Origins, consequences,and mathematical representations,” Biophys. J., Vol. 81, 2001, pp. 3116–3136.

[16] Li, H., Y. Cao, L.R. Petzold, and D.T. Gillepie, “Algorithms and software for stochastic simulation ofbiochemical reacting systems,” Biotechnol. Prog., Vol. 24, 2008, pp. 56–61.

[17] Dallon, J.C., “Numerical aspects of discrete and continuum hybrid models in cell biology,” Appl.Numerical Math., Vol. 32, 2000, pp. 137–159.

[18] Lauffenburger, D.A., and J.L. Linderman, Receptors: Models for Binding, Trafficking, and Signaling,New York: Oxford University Press, 1993.

[19] Pirone, J.R., and T.C. Elston, “Fluctuations in transcription factor binding can explain the gradedand binary responses observed in inducible gene expression,” J. Theor. Biol., Vol. 226, 2004, pp.111–121.

[20] Altan-Bonnet, G., and R.N. Germain, “Modeling T cell antigen discrimination based on feedbackcontrol of digital ERK responses,” PLoS Biol., Vol. 3, 2005, article no. e356.

[21] Perez, O.D., and G.P. Nolan, “Phospho-proteomic immune analysis by flow cytometry: frommechanism to translational medicine at the single-cell level,” Immunol. Rev., Vol. 210, 2006, pp.208–228.


72

[22] Hirschberg, K., C.M. Miller, J. Ellenberg, J.F. Presley, E.D. Siggia, R.D. Phair, and J.Lippincott-Schwartz, “Kinetic analysis of secretory protein traffic and characterization of Golgi toplasma membrane transport intermediates in living cells,” J. Cell Biol., Vol. 143, 1998,pp. 1485–1503.

[23] Slepchenko, B.M., J.C. Schaff, J.H. Carson, and L.M. Loew, “Computational cell biology:spatiotemporal simulation of cellular events,” Annu. Rev. Biophys. Biomol. Struct., Vol. 31, 2002,pp. 423–441.

[24] Reynolds, A.R., C. Tischer, P.J. Verveer, O. Rocks, and P.I.H. Bastiaens, “EGFR activation coupled toinhibition of tyrosine phosphatases causes lateral signal propagation,” Nat. Cell Biol., Vol. 5, 2003,pp. 447–453.

[25] Janetopoulos, C., L. Ma, P.N. Devreotes, and P.A. Iglesias, “Chemoattractant-inducedphosphatidylinositol 3,4,5-trisphosphate accumulation is spatially amplified and adapts, inde-pendent of the actin cytoskeleton,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 8951–8956.

[26] Schneider, I.C., and J.M. Haugh, “Quantitative elucidation of a distinct spatial gradient-sensingmechanism in fibroblasts,” J. Cell Biol., Vol. 171, 2005, pp. 883–892.

[27] Weng, G., U.S. Bhalla, and R. Iyengar, “Complexity in biological signaling systems,” Science, Vol.284, 1999, pp. 92–96.

[28] Bhalla, U.S., P.T. Ram, and R. Iyengar, “MAP kinase phosphatase as a locus of flexibility in amitogen-activated protein kinase signaling network,” Science, Vol. 297, 2002, pp. 1018–1023.

[29] Schoeberl, B., C. Eichler-Jonsson, E.D. Gilles, and G. Muller, “Computational modeling of thedynamics of the MAP kinase cascade activated by surface and internalized EGF receptors,” Nat.Biotechnol., Vol. 20, 2002, pp. 370–375.

[30] Hatakeyama, M., S. Kimura, T. Naka, T. Kawasaki, N. Yumoto, M. Ichikawa, J. Kim, K. Saito, M.Saeki, M. Shirouzu, S. Yokoyama, and A. Konagaya, “A computational model on the modulation ofmitogen-activated protein kinase (MAPK) and Akt pathways in heregulin-induced ErbB signal-ling,” Biochem. J., Vol. 373, 2003, pp. 451–463.

[31] Kiyatkin, A., E. Aksamitiene, N.I. Markevich, N.M. Borisov, J.B. Hoek, and B.N. Kholodenko, “Scaf-folding protein Grb2-associated binder 1 sustains epidermal growth factor-induced mitogenic andsurvival signaling by multiple positive feedback loops,” J. Biol. Chem., Vol. 281, 2006, pp.19925–19938.

[32] Gutenkunst, R.N., J.J. Waterfall, F.P. Casey, K.S. Brown, C.R. Myers, and J.P. Sethna, “Universallysloppy parameter sensitivities in systems biology models,” PLoS Comp. Biol., Vol. 3, 2007, articleno. e189.

[33] Zwolak, J.W., J.J. Tyson, and L.T. Watson, “Parameter estimation for a mathematical model of thecell cycle in frog eggs,” J. Comput. Biol., Vol. 12, 2005, pp. 48–63.

[34] Sible, J.C., and J.J. Tyson, “Mathematical modeling as a tool for investigating cell cycle control net-works,” Methods, Vol. 41, 2007, pp. 238–247.

[35] Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller, “Equation of statecalculations by fast computing machines,” J. Chem. Phys., Vol. 21, 1953, pp. 1087–1092.

[36] Hansmann, U.H.E., and Y. Okamoto, “New Monte Carlo algorithms for protein folding,” Curr.Opin. Struct. Biol., Vol. 9, 1999, pp. 177–183.

[37] Gonzalez, O.R., C. Kuper, K. Jung, P.C. Naval, and E. Mendoza, “Parameter estimation using simu-lated annealing for S-system models of biochemical networks,” Bioinformatics, Vol. 23, 2007,pp. 480–486.

[38] Brown, K.S., and J.P. Sethna, “Statistical mechanical approaches to models with many poorlyknown parameters,” Phys. Rev. E, Vol. 68, 2003, article no. 021904.

[39] Kolch, W., “Meaningful relationships: the regulation of the Ras/Raf/MEK/Erk pathway by proteininteractions,” Biochem. J., Vol. 351, 2000, pp. 289–305.

[40] Wang, C.-C., M. Cirit, and J.M. Haugh, “PI3K-dependent crosstalk interactions converge with Rasas quantifiable inputs integrated by Erk,” Mol. Syst. Biol., Vol 5, 2009, article no. 246.

[41] Heldin, C.-H., and B. Westermark, “Mechanism of action and in vivo role of platelet-derivedgrowth factor,” Physiol. Rev., Vol. 79, 1999, pp. 1283–1316.

[42] Vanhaesebroeck, B., S.J. Leevers, K. Ahmadi, J. Timms, R. Katso, P.C. Driscoll, R. Woscholski, P.J.Parker, and M.D. Waterfield, “Synthesis and function of 3-phosphorylated inositol lipids,” Annu.Rev. Biochem., Vol. 70, 2001, pp. 535–602.

[43] Hawkins, P.T., K.E. Anderson, K. Davidson, and L.R. Stephens, “Signalling through Class I PI3Ks inmammalian cells,” Biochem. Soc. Trans., Vol. 34, 2006, pp. 647–662.

[44] Engelman, J.A., J. Luo, and L.C. Cantley, “The evolution of phosphatidylinositol 3-kinases as regu-lators of growth and metabolism,” Nat. Rev. Genet., Vol. 7, 2006, pp. 606–619.

[45] Hao, N., M. Behar, T.C. Elston, and H.G. Dohlman, “Systems biology analysis of G protein andMAP kinase signaling in yeast,” Oncogene, Vol. 26, 2007, pp. 3254–3266.

[46] Dohlman, H.G., and J.W. Thorner, “Regulation of G protein-initiated signal transduction in yeast:Paradigms and principles,” Annu. Rev. Biochem., Vol. 70, 2001, pp. 703–754.

Acknowledgments

73

[47] Wang, Y.Q., and H.G. Dohlman, “Pheromone signaling mechanisms in yeast: A prototypical sexmachine,” Science, Vol. 306, 2004, pp. 1508–1509.

[48] Truckses, D.M., L.S. Garrenton, and J. Thorner, “Jekyll and Hyde in the microbial world,” Science,Vol. 306, 2004, pp. 1509–1511.

[49] Sabbagh, W., L.J. Flatauer, A.J. Bardwell, and L. Bardwell, “Specificity of MAP kinase signaling inyeast differentiation involves transient versus sustained MAPK activation,” Mol. Cell, Vol. 8, 2001,pp. 683–691.

[50] Hao, N., N. Yildirim, S.C. Parnell, M.J. Nagiec, R.H. Shanks, B. Errede, H.G. Dohlman, and T.C.Elston, “A computational analysis of feedback regulation as a mechanism for signaling specificityin yeast,” in revision, 2009.

[51] Errede, B., A. Gartner, Z. Zhou, K. Nasmyth, and G. Ammerer, “MAP kinase-related FUS3 from S.cerevisiae is activated by STE7 in vitro,” Nature, Vol. 362, 1993, pp. 261–264.

[52] Errede, B., and Q.Y. Ge, “Feedback regulation of MAP kinase signal pathways,” Philos. Trans. R. Soc.Lond. B Biol. Sci., Vol. 351, 1996, pp. 143–149.

[53] Zhou, Z., A. Gartner, R. Cade, G. Ammerer, and B. Errede, “Pheromone-induced signaltransduction in Saccaromyces cerevisiae requires the sequential function of three protein kinases,”Mol. Cell. Biol., Vol. 13, 1993, pp. 2069–2080.

[54] Maleri, S., Q. Ge, E.A. Hackett, Y. Wang, H.G. Dohlman, and B. Errede, “Persistent activation byconstitutive Ste7 promotes Kss1-mediated invasive growth but fails to support Fus3-dependentmating in yeast,” Mol. Cell. Biol., Vol. 24, 2004, pp. 9221–9238.


74

C H A P T E R

5Construction of Phenotype-Specific GeneNetwork by Synergy Analysis

Xuerui Yang1, Xuewei Wang1, Ming Wu2, Ertugrul Dalkic3,and Christina Chan*1,2,3

1Department of Chemical Engineering and Material Science, Michigan State University, East Lansing,MI 488242Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 488243Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824*e-mail: Christina Chan at [email protected]

Key terms SynergyGene network reconstructionPhenotype-specificFree fatty acidMetaboliteCytotoxicity

Abstract

Complex cellular activities are believed to be coordinately regulated by genesthat function in a network. Reconstructing these gene networks can provideinsights into the molecular mechanisms of cell physiology and thus representsa fundamental challenge in systems biology. Elucidating the phenotype-geneinteraction through the reconstruction of context-specific gene networksremains elusive. In this chapter, we present a methodology that integratesmulti-level biological data to infer a cooperative gene network with respect to aspecific phenotype. Our method introduces the concept of synergy and builds anetwork that consists of gene pairs with significant synergistic relations, whichimplies cooperation. We apply our method to reconstruct a synergistic genenetwork for saturated free fatty acid (FFA)-induced cytotoxicity, and analyzethe properties of the network. Scale-free characteristics and multiple-hub genesare found in the network, revealing many important cooperative candidategenes in regulating the FFA-induced cytotoxicity. These candidates are sup-ported by the literature.

75

5.1 Introduction

The development of diseases can be traced to abnormal activities of the cells in specifictissues or parts of the body. Hepatic disorders, such as steatosis and nonalcoholicsteatohepatitis (NASH), are associated with saturated free fatty acid (FFA)-inducedcytotoxicity of liver cells [1, 2]. Cellular activities are tuned by regulatory machineriesinvolving genes, proteins, and metabolites. For instance, saturated FFA-inducedcytotoxicity is coordinately regulated by a set of genes that interact in a complex net-work [3]. Therefore, reconstructing the gene networks that give rise to the different phe-notypes may provide insights into the cellular mechanisms involved, ultimately, in thedevelopment of diseases and disorders [4, 5].

This chapter describes a methodology to reconstruct phenotypic and context-spe-cific gene networks based on the assumption that only a subset of the genes is relevant tothe target phenotype. The phenotype addressed in this chapter is saturated FFA-inducedcytotoxicity. Methods of gene selection relevant to a phenotype have been based pre-dominantly on fold changes in the genes across different conditions or correlationsbetween the genes and the phenotype, using statistical tests [6] or correlation measures[7], respectively. Statistical tests typically yield too many genes for analysis, while corre-lation measures only select genes that are statistically correlated with the phenotype,thereby missing potentially relevant genes that are not highly correlated to the pheno-type. Incorporating prior information, such as gene set enrichment analysis [8] or trendprofiles, into the gene selection methods can help to alleviate these problems; however,the source and quality of the prior knowledge will affect the results.

FFAs modulate intracellular metabolic pathways involved in glucose [9], triglyceride(TG) [10], and amino acid [11] metabolism. Tuned by the gene network [3, 7], some ofthese alterations are involved in the induction of cytotoxicity by saturated FFAs [9–11].Therefore, integrating multiple levels of information (i.e., gene expression and metabo-lite profiles) would better reflect the “multilevel” characteristic of cellular activities, suchas saturated FFA-induced cytotoxicity, and thereby aid in the selection of genes that areinvolved in the observed phenotype, and in the reconstruction of a phenotype-specificgene network.

Various methods, such as correlation [12], mutual information [13, 14], andBayesian network analysis [15], have been used to construct gene networks. These meth-ods do not directly incorporate the phenotype in identifying the gene interactions.Instead, they typically build a gene network for each of the conditions, and compare thenetworks to identify the gene interactions that are specific to a condition or phenotype.Consequently, these methods are computationally expensive and sensitive to the qual-ity of data. Since the size (i.e., the number of genes included in the network reconstruc-tion) and the noise level of the samples can affect the networks reconstructed for each ofthe conditions, it is difficult to determine whether the differences in the networks acrossthe conditions are real changes in the mechanisms or simply an artifact due to the size ornoise levels. Alternatively, methods have been developed to select sets of gene pairs rele-vant to a phenotype based on classification models, such as support vector machine [16,17], decision tree [18], and probabilistic model [19]. Intuitively, if a phenotype-predic-tion based on a pair of genes performs better than that based on either one of the genes,

Construction of Phenotype-Specific Gene Network by Synergy Analysis

76

then the pair of genes is suggested to have cooperative effects on the phenotype. How-ever, these classification methods fail to differentiate the cooperative effects of the genespairs from the independent contributions of the individual genes [20]. To address thisshortcoming, we present a method that distinguishes the difference in the cooperativeversus individual effects of the genes.

Biological activities are regulated by multiple factors, many of which function coop-eratively (i.e., synergistically). The basic idea is that the whole (i.e., the regulatory sys-tem) is greater than the sum of the individual parts (i.e., regulators) of a system. Synergyis defined as the “additional” contribution provided by the “whole” as compared to thesum of the contributions of the individual “parts.” An example of synergy can been seenwith transcription factors, such as GATA4 and dHAND, which together cooperativelyand dramatically up-regulate cardiac (target) gene expression levels more significantlythan the sum of the effects from either of the transcription factors alone [21].

The concept of synergy will be used in this chapter to assess the cooperative effect oftwo genes on a phenotype. In a multivariate system, the synergistic effect of two factorson a phenotype is the gain in the “mutual information” over the sum of the informationprovided by each factor on a phenotype. A positive synergy denotes that two factors reg-ulate a phenotype, either cooperatively (e.g., co-activating) or antagonistically (e.g.,competitive inhibiting). Thus, one can predict the phenotype with a certain confidencefrom either of the two factors; however, knowing both factors brings additional infor-mation, which enhances the confidence of the prediction. Negative synergy denotesredundancy; thus, knowing both factors brings redundant information to the predic-tion of the phenotype. Zero synergy denotes that at least one of the two factors has noeffect on the phenotype, and therefore brings neither additional nor redundantinformation to the prediction of the phenotype.

Systematic assessment of synergy was first applied in neuroscience, where the goalwas to understand the neuron code by evaluating the strength of correlations betweenthe neurons upon activation by a stimulus [22, 23]. More recently the concept of syn-ergy has been applied to the field of systems biology [24–26]. Investigators developed aninformation theoretic measure of synergy from discretized gene expression data andapplied this measure to identify cooperative gene interactions associated with neuralinterconnectivity [24] and prostate cancer development [25]. More recently, the conceptof synergy and the information theoretic measure of synergy have been applied directlyto continuous gene expression data [20].

In this chapter we introduce an integrative methodology to reconstruct pheno-type-specific gene networks based on synergy analysis. First, we select the pheno-type-specific genes by integrating the gene expression and metabolite profiles in thecontext of saturated FFA-induced cytotoxicity. Next, we assess the synergistic effectsbetween the gene pairs. Unlike other computational methods used to identify geneinteractions, the fundamental concept of synergy is to identify the cooperative geneinteractions responsible for the phenotype, and these cooperative gene interactions mayor may not be direct interactions. Finally, with the identified synergistic gene pairs, webuild a synergy network. Topological analyses reveal the structural characteristics of thenetwork while the hub genes provide insights into potential mechanism(s) involved inthe induction of the phenotype (i.e., saturated FFA-induced cytotoxicity).

5.1 Introduction

77


Human hepatoblastoma cells (HepG2/C3A) were used for the study. These cells offerthe advantages of ease of culture and experimentation, are of human-origin, and havebeen shown to retain many hepatospecific functions and therefore suggested as a goodmodel cell system for hepatocellular function, such as lipid metabolism [27] and fattyacid transport [28]. Cell lines offer an advantage over primary hepatocytes in that theyare more amenable to genetic manipulation.

Two different types of free fatty acids were employed (i.e., saturated andmonounsaturated fatty acids), corresponding to the major dietary fractions. Palmiticacid was chosen as the representative saturated fatty acid and oleic acid as themonounsaturated fatty acid. They are the major fatty acids of their classes found inserum/plasma. While the total concentration of FFAs in plasma may reach millimolarrange under pathological conditions [29], most studies of obese/type 2 diabetic patientshave reported fatty acid concentration at about 0.7 mM. We therefore used 0.7 mM as astandard concentration of palmitate and oleate.

The proposed experiments and methodology, described in a flowchart (Figure 5.1),for reconstructing the phenotype-specific gene network with synergy analysis are asfollows.

1. Identify the phenotype of interest: cytotoxicity levels induced by saturated FFA.

2. Obtain metabolite data.

3. Obtain gene-expression data with cDNA microarray.

4. Select the metabolites that are associated with the phenotype (i.e., cytotoxicity).

5. Select the genes by matching the trend of the metabolite and gene profiles.

6. Obtain synergistic gene pairs by calculating their synergy scores.

7. Construct a synergy network with the gene pairs that are significantly synergistic.

Topological analysis of the synergy network revealed the type, structure, and othercharacteristics of the network. Statistical analysis of the synergistic gene pairs yielded


78

Figure 5.1 Flowchart of the proposed methodology.

hub genes (i.e., the genes that appeared most frequently in the synergistic gene pairs).The high frequency of the hub genes suggests that the hub genes are potentially impor-tant in producing the phenotype.

5.3 Materials

5.3.1 Cell culture and reagents

HepG2 cells were cultured in Dulbecco’s Modified Eagle Medium (DMEM) (Invitrogen,Carlsbad, California) with 10% fetal bovine serum (FBS) (Biomeda Corp., Foster City,California) and penicillin-streptomycin (penicillin: 10,000 U/ml, streptomycin: 10,000μg/ml) (Invitrogen, Carlsbad, California). Freshly trypsinized HepG2 cells were sus-

pended at 5 × 105 cells/ml in standard HepG2 culture medium and seeded at a density of106 cells per well in standard 6-well tissue culture plates. After seeding, the cells wereincubated at 37°C in a 90% air/10% CO2 atmosphere, and 2 mm of fresh medium wassupplied every other day to the cultures after removal of the supernatant. The HepG2cells were cultured in standard medium for 5 to 6 days to achieve 90% confluent beforetreating with FFAs, or other additives. HepG2 cell number was assessed by trypan bluedye exclusion using a hematocytometer.

5.3.2 Fatty acid salt treatment

Sodium salts of palmitate (P9767) and oleate (O7501) were purchased fromSigma-Aldrich. Palmitate or oleate was complexed to 0.7 mM bovine serum albumin(BSA, fatty acid free) dissolved in the media, which mimics the physiological concentra-tion of albumin in human blood (3.5% to 5%, [30]). In all the experiments, the vehicle(0.7 mM BSA) was used as the control. Fatty acid free BSA was purchased from MPBiomedicals (Chillicothe, Ohio).

5.4 Methods

5.4.1 Cytotoxicity measurement

HepG2 cells were cultured in different media for 24 hours and the supernatants col-lected. Cells were washed with PBS and kept in 1% triton-X-100 in PBS for 24 hours at37°C. Cell lysate was then collected, vortexed for 15 seconds, and centrifuged at 7,000rpm for 5 minutes. Cytotoxicity detection kit (Roche Applied Science, Indianapolis,Indiana) was used to measure the LDH levels in the supernatants and in the cell lysates.The fraction of LDH released into the medium was normalized to the total LDH (LDHreleased into the medium + LDH remaining in the cell lysates) [31].

5.4.2 Gene expression profiling

Cells were cultured in 10-cm tissue culture plates until confluence and then exposed todifferent treatments. RNA was isolated with Trizol reagent. The gene expression profileswere obtained with cDNA microarray. Analyses were done at the Van Andel Institute,

5.3 Materials

79

Grand Rapids, Michigan. The procedure of the microarray analysis was describedpreviously [5].

5.4.3 Metabolites measurements

The fluxes of the various metabolites were measured according to [32, 33]. The concen-trations of glucose, lactate, FFA, glycerol, and glycerol were measured by enzymatic kitsfrom Sigma–Aldrich, while beta-hydroxybutyrate and triglycerides were measuredusing enzymatic kits from Stanbio Laboratories. These metabolites were assayed accord-ing to the manufacturers’ instructions. The concentration of acetoacetate in the mediawas measured by an enzymatic fluorimetric assay [34]. Concentrations of Asp, Glu, Gly,NH3, Arg, Thr, Ala, Pro, Tyr, Val, Met, Orn, Lys, Ile, Leu, and Phe were measured by theAccQTag amino acid analysis method (Waters) coupled with fluorescence detection.The concentrations of Ser, Asn, Gln, and His were measured by a modification of theAccQTag method. Cystine concentration in the media and supernatants was measuredusing HPLC according to a previously published protocol [35]. All the measured fluxeswere normalized to total protein in the cell extract, measured with the bicinchoninicacid (BCA) method (Pierce Chemicals, Rockford, Illinois).

5.4.4 Gene selection based on trends of metabolites

The statistical significance of the changes in the metabolite levels across the conditions(i.e. BSA (control), Palmitate and Oleate) were assessed using two-sample t-test for eachmetabolite. Eleven metabolites differed significantly across the three conditions, andfour representative trends were extracted from these metabolites (Figure 5.2). Thereremained 7,394 genes after removing the EST/hypothetical proteins and ORF ofunknown functions from the list of ~20,000 genes. Genes with expression patterns thatmatched the four representative metabolite trends were selected. Two-sample t-test wasapplied to each gene to assess the significance in their fold change across the differentconditions. Finally, 610 genes were selected from the full list of 7,394 genes. Thep-value cutoff was set at 0.05.

5.4.5 Calculation of the synergy scores of gene pairs

An information theory–based score was calculated to quantify the synergy between thegenes [26]. Given two genes, G1 and G2, and a phenotype P, the synergy score betweenG1 and G2 with respect to the phenotype P is defined as

( ) ( ) ( ) ( )[ ]Syn G1,G2;P I G1,G2;P I G1;P I G2;P= − +

where I(G1;P) is the mutual information between G1 and P, I(G2;P) is the mutual infor-mation between G2 and P, and I(G1,G2;P) is the mutual information between (G1,G2)and P. This equation reflects the definition of synergy, the additional contribution pro-vided by the “whole” as compared to the sum of the contributions of the individual“parts.” Mutual information (I) was calculated using a clustering-based method fromcontinuous data [20].


80

5.4 Methods

81

Fig

ure

5.2

Fou

rre

pre

sen

tati

vetr

end

sof

the

met

abol

ites

.Ele

ven

met

abol

ites

dif

fere

dsi

gnif

ican

tly

acro

ssth

eth

ree

con

dit

ion

s,an

dfo

ur

rep

rese

nta

tive

tren

ds

wer

eex

trac

ted

from

thes

em

etab

olit

es.T

ren

dI:

BSA

<Pa

lman

dPa

lm>

Ole

;Tre

nd

II:B

SA>

Palm

and

Palm

<O

le;T

ren

dII

I:B

SA<

Palm

<O

le;T

ren

dIV

:BSA

>Pa

lm>

Ole

.

The synergy scores range from [–1 1]. A positive synergy score indicated that twogenes jointly provided additional information on the phenotype, a negative synergyscore indicated that the two genes provided redundant information about the pheno-type, and a zero score indicated that the two genes provided no additional informationabout the phenotype. The 610 genes that were selected based on the metabolite trendsgenerated 185,745 gene pairs, of which 436 pairs had significant synergy scores.

5.4.6 Permutation test to evaluate the significance of the synergy

A permutation test was performed to assess the statistical significance of the synergy ofthe gene pairs. The phenotypes (i.e., toxic and nontoxic) were randomly permutated tobe uncorrelated with the gene expression profiles. The synergy scores of the genes werethen recalculated based on the permutated phenotype. This process was repeated 100times to calculate the p-values of the synergy score for each gene pair. Finally,Benjamin-Hochberg false discovery rate procedure [36] was performed to adjust thep-values for all the gene pairs and thereby control the expected false discoveries. Thep-value cutoff was set at 0.05.

5.4.7 Characterization of the network topology

A synergy network was built with gene pairs that have statistically significant synergyscores. The network was composed of nodes that represented the genes, and edges thatrepresented the synergy of the gene pairs. Graph theoretical (topological) analysis ofthe reconstructed gene networks was used to assess the generated network and how itcompared with other biological networks [37]. We characterized the topology of thesynergy network by its degree distribution and shortest path length. Degree distributionprovides a distribution of the number of edges associated with the nodes. Shortest pathlength is the lowest number of edges that connect two nodes and is measured using abread-first search algorithm [38].


In the proposed methodology, the phenotype (cytotoxicity), metabolite, and geneexpression profiles were collected as “inputs” and integrated as described above. Theanticipated results include the representative trends of the metabolites relevant to thephenotype, genes that match the representative trends of the metabolites, and genepairs with significant synergy scores.

Gene pairs that have statistically significant synergy scores indicate possible mem-bership of the gene pairs in a shared pathway or potential cross-talk between differentpathways. Graphical representation of these synergistic gene pairs yields a network ofgene-gene interactions that are associated with the phenotype. Topology analysis of thesynergy network reveals the characteristics of the network, such as degree distribution,modularity and centrality. The hub genes, which have the highest number of edges, sug-gest that they may be central regulators in the induction of the phenotype.


82


Phenotype-specific gene network reconstruction is a useful approach to extract geneinteraction information from microarray data and to help provide insight into diseasemechanisms. Multiple methods (e.g., meta-analysis [39] and pair-wise relevance [40])have been developed to reconstruct gene networks associated with diseases. The infor-mation theoretic measure of synergy provides a convenient method to identify cooper-ative gene pairs with respect to a phenotype. Therefore in this chapter, we present analternative strategy to reconstruct phenotype-specific networks based on the concept ofsynergy.

A major concern that often arises in network reconstruction using microarray data isthe high computational cost. To alleviate this limitation, we preselect a subset of genesby matching the trend of their expression profiles, across the different conditions, to theprofiles of the phenotype-relevant metabolites. This step reduces the number of genes tobe analyzed. Concomitantly, the trend-based analysis allows the incorporation of priorknowledge of the gene expression patterns. In other words, it permits the inclusion ofgenes of particular interest that are known to be related to the phenotype, whether ornot these gene profiles are statistically correlated with the phenotype.

Gene pairs with statistically significant synergy scores suggest potential combina-torial effects of those genes on the phenotype. However, the scores cannot distinguishbetween the types of combinatorial effects, such as additive or antagonistic. Neverthe-less, this limitation can be addressed by integrating physical interaction data (i.e., pro-tein-protein and protein-DNA interaction) into the analysis.

In addition, the proposed analysis pipeline consists of several steps (see Figure 5.1),including selection of genes relevant to the phenotype, calculation of the synergy scoresfor each gene pair, evaluation of the statistical significance of the synergy score, andanalysis of the network topology to identify biologically relevant genes. The methodsfor these steps are not limited to those proposed in this chapter; alternative methods foreach of the steps in the framework can be used. For example, pattern recognition meth-ods can be used to select the genes, discretization-based entropy estimation can be usedto calculate the synergy score, Bayesian FDR control procedure can be used to evaluatethe statistical significance of the synergy score, and so on. A comparison of the differentmethods for each step could be performed to determine the optimal procedure for eachstep.

5.7 Applications Notes

Based upon the concept of synergy, we reconstructed a synergy network specifically forthe phenotype of saturated FFA-induced cytotoxicity. In this application, we firstselected the phenotype-relevant genes by integrating the metabolites altered by satu-rated FFAs [11] with the global gene expression profile and extracting the genes that fol-lowed the trends of the metabolites. From the selected genes, the synergy analysisrevealed synergistic gene pairs, which were used to build a synergy network. The recon-structed network suggested potential gene targets that may play central roles in theinduction of the phenotype.


83

5.7.1 Topological characteristics of the synergy network

The synergy network, shown in Figure 5.3, is composed of 292 genes with 436 connec-tion edges. The synergy network is characterized by relatively short path lengths, rang-ing from 2 to 10 (Figure 5.4), while the characteristic path length, or average diameter,of the network is 4.872. The network demonstrates small world characteristics of realnetworks [41], suggesting that the propagation of communication between the genes isrelatively fast.

The degree distribution, P(k), provides the probability that a randomly selected ver-tex has k links to its neighbors. A power law distribution suggests that P(k) ~k−γ, where k isthe degree and γ is the degree exponent, and in most biological, scale-free networks γ

ranges around 2 and 3 [41]. The degree distribution of our synergy network is shown in

Figure 5.5, and γ ∼ 2, suggesting the synergy network, similar to other biological net-works, is scale-free. Therefore, most of the genes are sparsely connected, while few of the


84


Problem Possible Cause Solution

Too many genes are selected The criterion for gene selectionis too loose

Use stricter criteria in the statistical test (i.e., a lowerp-value cutoff)Incorporate prior knowledge to remove the genesthat are irrelevant to the research targetTry different gene selection methods

Too few genes are selected The criterion for gene selectionis too strict

Relax the statistical test (i.e., using a higher p-valuecutoff)Add more genes of interest based on prior knowl-edgeTry different gene selection methods

The synergy network is too bigfor interpretation

False positive: nonsynergistic genepairs are included

Run more permutation tests to reduce the varianceUse a lower p-value cutoff

The synergy network is too small The criterion of the permutation testis too strict

Use a higher p-value cutoff

Figure 5.3 The synergy network. The synergy network is composed of 292 genes with 436 connectionedges. The size of the nodes indicates its degree.

genes (hubs) are connected to many genes and play important roles in sustaining theintegrity of the network, which suggest their importance in the biological function orthe phenotype. In summary, the topology of the synergy network differs from thebell-like Poisson distribution characteristic of a random and statistically homogeneousnetwork and suggests the existence of hub genes.

5.7.2 Hub genes in the network

The genes in the synergy network are listed and ranked by their degree(http://www.chems.msu.edu/groups/chan/gene pairs and hub genes.xls) Table 5.1 liststhe genes with the highest degree, which are therefore “hub genes” in the synergy net-work. These genes include P4HA1, AHDC1, MACF1, INSIG2, and SH3RF2.

P4HA1, proline 4-hydroxylase, alpha polypeptide, is the alpha subunit of the proteinproline 4-hydroxylase (P4H). P4H is a key enzyme involved in the biosynthesis of col-lagens [42, 43]. Collagens, the most abundant protein in the extracellular matrix (ECM)and the main protein of connective tissues, play important roles in regulating cellularactivities and are linked to multiple diseases, including cardiovascular diseases and can-cers [44–46]. In liver, alteration in the synthesis of collagen is related to liver cellapoptosis, hepatic fibrosis, and cirrhosis [47–49]. Physiologically, the synthesis, process-ing, secretion, and degradation of collagens are tightly modulated by regulatory factors,including P4HA1. As a critical functional subunit of P4H, P4HA1 is involved in the


85

20000

01 2 3 4 5 6 7 8

Shortest path length

Freq

uen

cy

9 10 11 12 13

4000

8000

12000

16000

Figure 5.4 The distribution of shortest path lengths in the synergy network.

0 0.2

0

0.5

1

Log (degree)

1.5

2

2.5

0.4 0.6 0.8 1 1.2 1.4 1.6

Log

(Nd

egre

e)

Log (Ndegree) 1.9 Log(degree) 2.3005= − +

Figure 5.5 The degree distribution of the synergy network. For a degree value k, Ndegree is the number ofgenes with degree k in the network.

post-translational modification of procollagen [42, 50]. As shown in Figure 5.6, P4HA1 islocated within the endoplasmic reticulum and catalyzes the post-translational forma-tion of 4-hydroxyproline in the -Xaa-Pro-Gly- sequences in procollagens, which is essen-tial for proper folding of the procollagen polypeptide [50]. Inhibiting P4H generatesunstable intracellular collagens which cannot be secreted [51]. On the other hand,over-expressing P4HA1 causes excess synthesis of collagen [52]. The deregulation of P4Hor P4HA1 in collagen synthesis has been attributed to cytotoxicity and apoptosis in vari-ous types of cells [53–56]. P4HA1 regulates the ECM components by controlling the syn-thesis and secretion of procollagens, which modulates fibrosis, cell proliferation, andapoptosis [53–56]. The role of P4HA1 in palmitate-induced lipotoxicity is unclear. In ourdata, however, the level of P4HA1 is significantly down-regulated in palmitate as com-pared to the control (p=0.0014) and oleate (p=0.018) samples. Therefore, it is plausiblethat palmitate-induced cytotoxicity is mediated, in part, by P4HA1 through alteredsynthesis or improper folding of collagen peptides, although experimental validationsare needed to confirm this hypothesis.

In addition, a number of proteins have been identified or predicted to interact withP4HA1, from publicly available protein-protein interaction databases, such as KEGG [58]and STRING [59, 60]. An interaction network (Figure 5.7) obtained from STRING showssome of the proteins that potentially interact with P4HA1. P4HA1 is centrally positioned


86

Table 5.1 The Hub Genes in the Synergy Network.

Gene Symbol Degree Full Name

P4HA1 22 procollagen-proline, 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase), alpha polypeptide

AHDC1 20 AT hook, DNA binding motif, containing 1MACF1 19 microtubule-actin crosslinking factor 1INSIG2 18 insulin induced gene 2SH3RF2 13 SH3 domain containing ring finger 2…

P4HA1: prolyl 4-hydroxylase is a key enzyme in collagen synthesis

2-ketoglutarate procollagen-L-proline

succinate procollagen trans 4-hydroxy-L-proline

proline 4-hydroxylase

Figure 5.6 P4HA1 catalyzes the formation of 4-hydroxyproline. (Graph adapted from MetaCyc database(http://www.metacyc.org/) [57].)

and integrates different gene clusters in this network, supporting our result that P4HA1is a hub gene (with the highest degree, 22) in the synergy network.

AHDC1, AT hook, DNA binding motif, containing 1, contains 2 AT hook DNA bind-ing domains and can be phosphorylated upon DNA damage, probably by ATM or ATR[61]. Although the in vivo function or phenotype of this gene has not been identified, itis known that AHDC1 encodes seven different isoforms, some of which contain HMG-Iand HMG-Y, DNA-binding domains [62]. HMG proteins are involved in nucleosomephasing, 3’ end processing of mRNA transcripts, and transcription of genes close to ATrich regions, and are thereby related to the pathogenesis of inflammatory and autoim-mune diseases [63–65]. Although the function of AHDC1 is currently unknown, it maybe involved in DNA damage and inflammatory responses. The significant alteration ofthis gene by palmitate (p=0.015 for palmitate versus BSA and 0.014 for palmitate versusoleate) and the identification of this gene in the synergy network hints at the possibilitythat palmitate may affect DNA damage and inflammatory responses through AHDC1.

MACF1, microtubule-actin crosslinking factor 1, also called ACF7 (actin cross-link-ing factor 7), is a member of the spectraplakin family of cytoskeletal cross-linking pro-teins that possess actin- and microtubule-binding domains [66, 67]. It may be involvedin microtubule dynamics to facilitate actin-microtubule interactions at the cell periph-ery and in coupling the microtubule network to cellular junctions [62]. Cell-cell contact


87

Figure 5.7 The protein-protein interaction network associated with P4HA1. Network was generatedfrom the STRING database (http://string71.embl.de/) [59, 60]. P4HA1 is located in the middle of the net-work, which is shown in red.

and cell-surface interactions through the cytoskeleton and ECM are involved in the con-trol and regulation of cell motility, tissue remodeling, gene expression, differentiation,and proliferation [68]. In the literature, a large number of cytoskeletal and ECM geneswere found to be down-regulated by palmitate treatment [69, 70]. Therefore, it is possi-ble that our method has identified two central genes (i.e., P4HA1 and MACF1) in mediat-ing the effect of palmitate on the ECM and cytoskeletal structure.

INSIG2, insulin-induced gene 2, encodes a protein that blocks the processing ofsterol regulatory element binding proteins (SREBPs), which regulate human lipogenicand adipocyte metabolism [71]. As shown in Figure 5.8, in endoplasmic reticulum (ER),SREBP cleavage-activating protein (SCAP) can bind to the regulation domain of SREBPand transfer SREBP into golgi apparatus, where SCAP activates protease S1P to cleave theregulation domain and activate the transcription activation/DNA binding domain ofSREBP. Activated SREBP then can be transported to the nucleus to bind to the cis-ele-ment SRE to promote the expression of a series of enzymes that are involved in lipid syn-thesis [72]. INSIG2 can bind to SCAP and inhibit its function, thereby blocking lipidsynthesis [73]. Indeed, reduced INSIG2 levels in adipocytes resulted in SREBP activation,which increased the expression of genes involved in adipogenesis [74]. In our system ofHepG2 cells, microarray analysis found that the gene expression levels of INSIG2 weredown-regulated by both palmitate (p=0.011) and oleate (p=0.14). Therefore, the sup-pression of INSIG2 expression likely contributes to the increased TG synthesis observedin the FFA cultures [10].

SH3RF2, SH3 domain containing ring finger 2, is a putative protein whose functionis unknown, but from sequence analysis SH3RF2 contains 3 Src homology 3 (SH3)domains and a RING-type zinc finger domain (Figure 5.9, from the InterPro database).RING-type zinc finger domain is found in many E3 ubiquitin-protein ligases [75]. E3ubiquitin-protein ligases determine the substrate specificity for ubiquitination and aretherefore involved in targeting proteins for degradation by the Ubiquitin-ProteasomeSystem [76]. During this process, RING fingers, by interacting with E2 ubiquitin-conju-gating enzymes, promotes ubiquitination [77, 78]. Although the exact function ofSH3RF2 is not known, the RING-type zinc finger domain suggests SH3RF2 as a putative


88

Figure 5.8 INSIG2 plays an important role in lipid synthesis [72, 73]. (Figure adapted from [72].) Bybinding to SCAP, INSIG2 blocks the processing of SREBPs and therefore suppresses lipid synthesis.

E3 ubiquitin-protein ligase, which may be involved in the protein degradation pathwaythrough the Ubiquitin-Proteasome System. In addition, SH3RF2 contains another typeof domain, SH3 domains, which are present in many proteins involved in intracellularsignal transduction pathways [79, 80]. SH3 domains recognize and bind to theproline-rich motifs (-X-P-P-X-P-) on the associated proteins. Therefore, SH3 domains arerecruited by the signaling proteins to direct protein-protein interactions and therebyspecify distinct regulatory pathways mediated by different protein binding domains [81,82]. Taken together, the SH3 domains of SH3RF2 could potentially serve as a targetingdomain that determines the substrate specificity of SH3RF2 as a putative E3ubiquitin-protein ligase, therefore facilitating SH3RF2 to subject certain types of pro-teins to degradation via the ubiquitin-proteasome system. Interestingly, this gene is sig-nificantly up-regulated in the palmitate culture (p=0.019 for palmitate versus BSA and0.025 for palmitate versus oleate). This result is consistent with the literature suggestingthat palmitate induces protein degradation.

Chronic treatment of palmitate increases the levels of unfolded or misfolded pro-teins, which induces Endoplasmic reticulum (ER) stress [83, 84]. This lends support to ourfinding that P4HA1 was down-regulated in palmitate, suggesting the possibility thatprocollagen may be misfolded in the palmitate culture as compared to the oleate andcontrol cultures. The accumulation of unfolded or misfolded proteins activates theubiquitin-proteasome system. Indeed, recent studies found palmitate strongly enhancesubiquitination by activating E3 ubiquitin ligases, resulting in enhanced protein degrada-tion. Therefore, SH3RF2, putative ubiquitin-protein ligase, may be potentially involvedin palmitate-induced cytotoxicity by triggering ubiquitination. In addition, SH3RF2,composed of 3 SH3 and a zinc finger domain which are all protein-protein interactiondomains, could recruit a diverse set of proteins for degradation, supporting its centralposition in our synergy network. Therefore, the synergy network identified a novel pro-tein, SH3RF2, which potentially plays a central role in palmitate-induced cytotoxicity.The domain knowledge of this protein suggests SH3RF2 may be involved inpalmitate-induced cytotoxicity by recruiting unfolded proteins and triggeringubiquitination.

In summary, we built a synergy network specifically for palmitate-inducedcytotoxicity. This network is scale-free and has multiple hub genes. The hub genes arerelated to cellular activities, such as cellular contact, cytotoxicity, metabolic pathways,protein degradation, which may play important roles in palmitate-induced cytotoxicity.Therefore, these hub genes suggest potential mechanisms involved inpalmitate-induced cytotoxicity.

5.8 Summary Points

In this chapter, we have achieved the following goals:

5.8 Summary Points

89

Znf RING SH3 SH3 2x2SH3 2

Figure 5.9 The protein domains of SH3RF2. Figure modified from the InterPro database. SH3RF2 con-tains 3 Src homology 3 (SH3) domains and a RING-type zinc finger domain.

1. Integrated gene and the metabolite profiles to identify a select group of genes thatmay be involved in palmitate-induced cytotoxicity.

2. Reconstructed a phenotype-specific synergy network.

Topology analysis of the synergy network revealed scale-free characteristics and mul-tiple hub genes, which are typical characteristics shared by many biological networks.These hub genes suggest potential mechanisms and may be targets for modulatingpalmitate-induced cytotoxicity.

Acknowledgments

This research was supported in part by Michigan State University (MSU) QuantitativeBiology and Modeling Initiative Fellowship, the MSU Foundation, the National ScienceFoundation (BES 0425821 and DBI 0701709), and the National Institutes of Health(R01GM079688-01, R21CA126136-01, R21RR024439, and R21GM075838).

References

[1] Scheen, A.J., and F.H. Luyckx, “Obesity and liver disease,” Best Pract. Res. Clin. Endocrinol. Metab.,Vol. 16, No. 4, December 2002, pp. 703–716.

[2] Farrell, G.C., and C.Z. Larter, “Nonalcoholic fatty liver disease: from steatosis to cirrhosis,”Hepatology, Vol. 43, No. 2, Suppl. 1, February 2006, pp. S99–S112.

[3] Li, Z., S. Srivastava, S. Mittal, X. Yang, L. Sheng, and C. Chan, “A Three Stage Integrative PathwaySearch (TIPS) framework to identify toxicity relevant genes and pathways,” BMC Bioinformatics,Vol. 8, 2007, p. 202.

[4] Said, M.R., T.J. Begley, A.V. Oppenheim, D.A. Lauffenburger, and L.D. Samson, “Global networkanalysis of phenotypic effects: protein networks and toxicity modulation in Saccharomycescerevisiae,” Proc Natl Acad Sci U S A, Vol. 101, No. 52, December 28, 2004, pp. 18006–18011.

[5] Srivastava, S., Z. Li, X. Yang, M. Yedwabnick, S. Shaw, and C. Chan, “Identification of genes thatregulate multiple cellular processes/responses in the context of lipotoxicity to hepatoma cells,”BMC Genomics, Vol. 8, 2007, p. 364.

[6] Tusher, V.G., R. Tibshirani, and G. Chu, “Significance analysis of microarrays applied to the ioniz-ing radiation response,” Proc. Natl. Acad. Sci. USA, Vol. 98, No. 9, April 24, 2001, pp. 5116–5121.

[7] Li, Z., and C. Chan, “Integrating gene expression and metabolic profiles,” J. Biol. Chem., Vol. 279,No. 26, June 25, 2004, pp. 27124–27137.

[8] Li, Z., S. Srivastava, X. Yang, S. Mittal, P. Norton, J. Resau, B. Haab, and C. Chan, “A hierarchicalapproach employing metabolic and gene expression profiles to identify the pathways that confercytotoxicity in HepG2 cells,” BMC Syst. Biol., Vol. 1, 2007, p. 21.

[9] Lam, T.K., A. Carpentier, G.F. Lewis, G. van de Werve, I.G. Fantus, and A. Giacca, “Mechanisms ofthe free fatty acid-induced increase in hepatic glucose production,” Am J Physiol Endocrinol Metab,Vol. 284, No. 5, May 2003, pp. E863–873.

[10] Listenberger, L.L., X. Han, S.E. Lewis, S. Cases, R.V. Farese, Jr., D.S. Ory, and J.E. Schaffer,“Triglyceride accumulation protects against fatty acid-induced lipotoxicity,” Proc. Natl. Acad. Sci.USA, Vol. 100, No. 6, March 18, 2003, pp. 3077–3082.

[11] Li, Z., S. Srivastava, R. Findlan, and C. Chan, “Using dynamic gene module map analysis to iden-tify targets that modulate free fatty acid induced cytotoxicity,” Biotechnol. Prog., Vol. 24, No. 1,January–February 2008, pp. 29–37.

[12] Eisen, M.B., P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster analysis and display ofgenome-wide expression patterns,” Proc. Natl. Acad. Sci. USA, Vol. 95, No. 25, December 8, 1998,pp. 14863–14868.

[13] Liang, K.C., and X. Wang, “Gene regulatory network reconstruction using conditional mutualinformation,” EURASIP J. Bioinform. Syst. Biol., Vol. No. 2008, p. 253894.

[14] Basso, K., A.A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, and A. Califano, “Reverse engi-neering of regulatory networks in human B cells,” Nat. Genet., Vol. 37, No. 4, April 2005,pp. 382–390.


90

[15] Pe’er, D., A. Regev, G. Elidan, and N. Friedman, “Inferring subnetworks from perturbed expressionprofiles,” Bioinformatics, Vol. 17, Suppl. 1, 2001, pp. S215–224.

[16] Furey, T.S., N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, “Support vec-tor machine classification and validation of cancer tissue samples using microarray expressiondata,” Bioinformatics, Vol. 16, No. 10, October 2000, pp. 906–914.

[17] Tang, E.K., P.N. Suganthan, and X. Yao, “Gene selection algorithms for microarray data based onleast squares support vector machine,” BMC Bioinformatics, Vol. 7, 2006, p. 95.

[18] Diaz-Uriarte, R., and S. Alvarez de Andres, “Gene selection and classification of microarray datausing random forest,” BMC Bioinformatics, Vol. 7, 2006, p. 3.

[19] Paul, T.K., and H. Iba, “Gene selection for classification of cancers using probabilistic model build-ing genetic algorithm,” Biosystems, Vol. 82, No. 3, December 2005, pp. 208–225.

[20] Watkinson, J., X. Wang, T. Zheng, and D. Anastassiou, “Identification of gene interactions associ-ated with disease from gene expression data using synergy networks,” BMC Syst. Biol., Vol. 2, 2008,p. 10.

[21] Dai, Y.S., P. Cserjesi, B.E. Markham, and J.D. Molkentin, “The transcription factors GATA4 anddHAND physically interact to synergistically activate cardiac gene expression through ap300-dependent mechanism,” J. Biol. Chem., Vol. 277, No. 27, July 5, 2002, pp. 24390–24398.

[22] Schneidman, E., W. Bialek, and M.J. Berry, 2nd, “Synergy, redundancy, and independence in pop-ulation codes,” J Neurosci, Vol. 23, No. 37, December 17, 2003, pp. 11539–11553.

[23] Brenner, N., S.P. Strong, R. Koberle, W. Bialek, and R.R. de Ruyter van Steveninck, “Synergy in aneural code,” Neural Comput., Vol. 12, No. 7, July 2000, pp. 1531–1552.

[24] Varadan, V., D.M. Miller, 3rd, and D. Anastassiou, “Computational inference of the molecularlogic for synaptic connectivity in C. elegans,” Bioinformatics, Vol. 22, No. 14, July 15, 2006,pp. e497–e506.

[25] Varadan, V., and D. Anastassiou, “Inference of disease-related molecular logic from systems-basedmicroarray analysis,” PLoS Comput. Biol., Vol. 2, No. 6, June 16, 2006, p. e68.

[26] Anastassiou, D., “Computational analysis of the synergy among multiple interacting genes,” Mol.Syst. Biol., Vol. 3, 2007, p. 83.

[27] Cianflone, K., H. Vu, Z. Zhang, and A.D. Sniderman, “Effects of albumin on lipid synthesis, apoB-100 secretion, and LDL catabolism in HepG2 cells,” Atherosclerosis, Vol. 107, No. 2, June 1994,pp. 125–135.

[28] Guo, W., N. Huang, J. Cai, W. Xie, and J.A. Hamilton, “Fatty acid transport and metabolism inHepG2 cells,” Am. J. Physiol. Gastrointest. Liver Physiol., Vol. 290, No. 3, March 2006, pp. G528–534.

[29] Artwohl, M., M. Roden, W. Waldhausl, A. Freudenthaler, and S.M. Baumgartner-Parzer, “Free fattyacids trigger apoptosis and inhibit cell cycle progression in human vascular endothelial cells,”FASEB J., Vol. 18, No. 1, January 1, 2004, pp. 146–148.

[30] Peters, T., All About Albumin: Biochemistry, Genetics, and Medical Applications, San Diego, CA: Aca-demic Press, 1996.

[31] Srivastava, S., and C. Chan, “Hydrogen peroxide and hydroxyl radicals mediate palmitate-inducedcytotoxicity to hepatoma cells: Relation to mitochondrial permeability transition,” Free Radic. Res.,Vol. 41, No. 1, January 2006, pp. 38–49.

[32] Chan, C., F. Berthiaume, K. Lee, and M.L. Yarmush, “Metabolic flux analysis of culturedhepatocytes exposed to plasma,” Biotechnol. Bioeng., Vol. 81, No. 1, January 5, 2003, pp. 33–49.

[33] Chan, C., F. Berthiaume, K. Lee, and M.L. Yarmush, “Metabolic flux analysis of hepatocyte func-tion in hormone- and amino acid-supplemented plasma,” Metab. Eng., Vol. 5, No. 1, January 2003,pp. 1–15.

[36] Benjamini, Y., and Y. Hochberg, “Controlling the false discovery rate—A practical and powerfulapproach to multiple testing,” Journal of the Royal Statistical Society Series B-Methodological, Vol. 57,No. 1, 1995, pp. 289–300.

[37] Christensen, C., A. Gupta, C.D. Maranas, and R. Albert, “Large-scale inference and graph-theoreti-cal analysis of gene-regulatory networks in B-Subtilis,” Physica a-Statistical Mechanics and Its Appli-cations, Vol. 373, January 1, 2007, pp. 796–810.

[38] Cormen, T.H., C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, Cambridge, MA: MIT Press,2001.

[39] Rasche, A., H. Al-Hasani, and R. Herwig, “Meta-analysis approach identifies candidate genes andassociated molecular networks for type-2 diabetes mellitus,” BMC Genomics, Vol. 9, 2008, pp. 310.

[40] Jiang, W., X. Li, S. Rao, L. Wang, L. Du, C. Li, C. Wu, H. Wang, Y. Wang, and B. Yang, “Construct-ing disease-specific gene networks using pair-wise relevance metric: application to colon canceridentifies interleukin 8, desmin and enolase 1 as the central elements,” BMC Syst. Biol., Vol. 2,2008, p. 72.

[41] Barabasi, A.L., and Z.N. Oltvai, “Network biology: Understanding the cell’s functional organiza-tion,” Nature Reviews Genetics, Vol. 5, No. 2, February 2004, pp. 101–U115.

Acknowledgments

91

[42] Chen, L., Y.H. Shen, X. Wang, J. Wang, Y. Gan, N. Chen, S.A. LeMaire, J.S. Coselli, and X.L. Wang,“Human prolyl-4-hydroxylase alpha(I) transcription is mediated by upstream stimulatory factors,”J. Biol. Chem., Vol. 281, No. 16, April 21, 2006, pp. 10849–10855.

[43] Annunen, P., H. Autio-Harmainen, and K.I. Kivirikko, “The novel type II prolyl 4-hydroxylase isthe main enzyme form in chondrocytes and capillary endothelial cells, whereas the type I enzymepredominates in most cells,” J. Biol. Chem., Vol. 273, No. 11, March 13, 1998, pp. 5989–5992.

[44] Bedossa, P., and V. Paradis, “Liver extracellular matrix in health and disease,” J. Pathol., Vol. 200,No. 4, July 2003, pp. 504–515.

[45] Rodriguez-Feo, J.A., J.P. Sluijter, D.P. de Kleijn, and G. Pasterkamp, “Modulation of collagen turn-over in cardiovascular disease,” Curr. Pharm. Des., Vol. 11, No. 19, 2005, pp. 2501–2514.

[46] Stone, P.J., “Potential use of collagen and elastin degradation markers for monitoring liver fibrosisin schistosomiasis,” Acta Trop., Vol. 77, No. 1, October 23, 2000, pp. 97–99.

[47] Bickel, M., K.H. Baringhaus, M. Gerl, V. Gunzler, J. Kanta, L. Schmidts, M. Stapf, G. Tschank, K.Weidmann, and U. Werner, “Selective inhibition of hepatic collagen accumulation in experimen-tal liver fibrosis in rats by a new prolyl 4-hydroxylase inhibitor,” Hepatology, Vol. 28, No. 2, August1998, pp. 404–411.

[48] Clement, B., C. Chesne, A.P. Satie, and A. Guillouzo, “Effects of the prolyl 4-hydroxylaseproinhibitor HOE 077 on human and rat hepatocytes in primary culture,” J. Hepatol., Vol. 13,Suppl. 3, 1991, pp. S41–47.

[49] Faouzi, S., B. Le Bail, V. Neaud, L. Boussarie, J. Saric, P. Bioulac-Sage, C. Balabaud, and J.Rosenbaum, “Myofibroblasts are responsible for collagen synthesis in the stroma of humanhepatocellular carcinoma: an in vivo and in vitro study,” J. Hepatol., Vol. 30, No. 2, February 1999,pp. 275–284.

[50] Myllyharju, J., “Prolyl 4-hydroxylases, the key enzymes of collagen biosynthesis,” Matrix Biol.,Vol. 22, No. 1, March 2003, pp. 15–24.

[51] Rocnik, E.F., B.M. Chan, and J.G. Pickering, “Evidence for a role of collagen synthesis in arterialsmooth muscle cell migration,” J. Clin. Invest., Vol. 101, No. 9, May 1, 1998, pp. 1889–1898.

[52] John, D.C., R. Watson, A.J. Kind, A.R. Scott, K.E. Kadler, and N.J. Bulleid, “Expression of an engi-neered form of recombinant procollagen in mouse milk,” Nat. Biotechnol., Vol. 17, No. 4, April1999, pp. 385–389.

[53] Xia, S.H., J. Wang, and J.X. Kang, “Decreased n-6/n-3 fatty acid ratio reduces the invasive potentialof human lung cancer cells by downregulation of cell adhesion/invasion-related genes,”Carcinogenesis, Vol. 26, No. 4, April 2005, pp. 779–784.

[54] Huet, C., P. Monget, C. Pisselet, and D. Monniaux, “Changes in extracellular matrix componentsand steroidogenic enzymes during growth and atresia of antral ovarian follicles in the sheep,” Biol.Reprod., Vol. 56, No. 4, April 1997, pp. 1025–1034.

[55] Dong, M.S., S.H. Jung, H.J. Kim, J.R. Kim, L.X. Zhao, E.S. Lee, E.J. Lee, J.B. Yi, N. Lee, Y.B. Cho, W.J.Kwak, and Y.I. Park, “Structure-related cytotoxicity and anti-hepatofibric effect of asiatic acidderivatives in rat hepatic stellate cell-line, HSC-T6,” Arch. Pharm. Res., Vol. 27, No. 5, May 2004,pp. 512–517.

[56] Ju, H., J. Hao, S. Zhao, and I.M. Dixon, “Antiproliferative and antifibrotic effects of mimosine onadult cardiac fibroblasts,” Biochim. Biophys. Acta, Vol. 1448, No. 1, November 19, 1998, pp. 51–60.

[57] Caspi, R., H. Foerster, C.A. Fulcher, P. Kaipa, M. Krummenacker, M. Latendresse, S. Paley, S.Y. Rhee,A.G. Shearer, C. Tissier, T.C. Walk, P. Zhang, and P.D. Karp, “The MetaCyc database of metabolicpathways and enzymes and the BioCyc collection of pathway/genome databases,” Nucleic AcidsRes., Vol. 36, Database Issue, January 2008, pp. D623–D631.

[58] Aoki, K.F., and M. Kanehisa, “Using the KEGG database resource,” Curr Protoc Bioinformatics, Vol. 1,October 2005, pp. Unit 1 12.

[59] von Mering, C., L.J. Jensen, B. Snel, S.D. Hooper, M. Krupp, M. Foglierini, N. Jouffre, M.A. Huynen,and P. Bork, “STRING: known and predicted protein-protein associations, integrated and trans-ferred across organisms,” Nucleic Acids Res., Vol. 33, No. Database issue, January 1, 2005,pp. D433–437.

[60] von Mering, C., L.J. Jensen, M. Kuhn, S. Chaffron, T. Doerks, B. Kruger, B. Snel, and P. Bork,“STRING 7—recent developments in the integration and prediction of protein interactions,”Nucleic Acids Res., Vol. 35, No. Database issue, January 2007, pp. D358–362.

[61] Matsuoka, S., B.A. Ballif, A. Smogorzewska, E.R. McDonald, 3rd, K.E. Hurov, J. Luo, C.E. Bakalarski,Z. Zhao, N. Solimini, Y. Lerenthal, Y. Shiloh, S.P. Gygi, and S.J. Elledge, “ATM and ATR substrateanalysis reveals extensive protein networks responsive to DNA damage,” Science, Vol. 316,No. 5828, May 25, 2007, pp. 1160–1166.

[62] Thierry-Mieg, D., and J. Thierry-Mieg, “AceView: A comprehensive cDNA-supported gene andtranscripts annotation,” Genome Biol., Vol. 7, Suppl. 1, 2006, pp. S12 11–14.


92

[63] Voll, R.E., V. Urbonaviciute, M. Herrmann, and J.R. Kalden, “High mobility group box 1 in thepathogenesis of inflammatory and autoimmune diseases,” Isr. Med. Assoc. J., Vol. 10, No. 1, January2008, pp. 26–28.

[64] Tesniere, A., T. Panaretakis, O. Kepp, L. Apetoh, F. Ghiringhelli, L. Zitvogel, and G. Kroemer,“Molecular characteristics of immunogenic cancer cell death,” Cell Death Differ., Vol. 15, No. 1,January 2008, pp. 3–12.

[65] Jiang, W., and D.S. Pisetsky, “Mechanisms of Disease: the role of high-mobility group protein 1 inthe pathogenesis of inflammatory arthritis,” Nat. Clin. Pract. Rheumatol., Vol. 3, No. 1, January2007, pp. 52–58.

[66] Kodama, A., I. Karakesisoglou, E. Wong, A. Vaezi, and E. Fuchs, “ACF7: An essential integrator ofmicrotubule dynamics,” Cell, Vol. 115, No. 3, October 31, 2003, pp. 343–354.

[67] Gong, T.W., C.G. Besirli, and M.I. Lomax, “MACF1 gene structure: a hybrid of plectin anddystrophin,” Mamm. Genome, Vol. 12, No. 11, November 2001, pp. 852–861.

[68] Zamir, E., and B. Geiger, “Molecular complexity and dynamics of cell-matrix adhesions,” J. CellSci., Vol. 114, No. Pt. 20, October 2001, pp. 3583–3590.

[69] Draghici, S., P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero, “Asystems biology approach for pathway level analysis,” Genome Res., Vol. 17, No. 10, October 2007,pp. 1537–1545.

[70] Swagell, C.D., D.C. Henly, and C.P. Morris, “Expression analysis of a human hepatic cell line inresponse to palmitate,” Biochem. Biophys. Res. Commun., Vol. 328, No. 2, March 11, 2005,pp. 432–441.

[71] Krapivner, S., S. Popov, E. Chernogubova, M.L. Hellenius, R.M. Fisher, A. Hamsten, and F.M. van’tHooft, “Insulin-induced gene 2 involvement in human adipocyte metabolism and body weightregulation,” J. Clin. Endocrinol. Metab., Vol. 93, No. 5, May 2008, pp. 1995–2001.

[72] Horton, J.D., J.L. Goldstein, and M.S. Brown, “SREBPs: activators of the complete program of cho-lesterol and fatty acid synthesis in the liver,” J. Clin. Invest., Vol. 109, No. 9, May 2002,pp. 1125–1131.

[73] Yabe, D., M.S. Brown, and J.L. Goldstein, “Insig-2, a second endoplasmic reticulum protein thatbinds SCAP and blocks export of sterol regulatory element-binding proteins,” Proc. Natl. Acad. Sci.USA, Vol. 99, No. 20, October 1, 2002, pp. 12753–12758.

[74] Rosen, E.D., C.J. Walkey, P. Puigserver, and B.M. Spiegelman, “Transcriptional regulation ofadipogenesis,” Genes Dev., Vol. 14, No. 11, June 1, 2000, pp. 1293–1307.

[75] Freemont, P.S., “The RING finger. A novel protein sequence motif related to the zinc finger,” Ann.NY Acad. Sci., Vol. 684, June 11, 1993, pp. 174–192.

[76] Hershko, A., H. Heller, E. Eytan, and Y. Reiss, “The protein substrate binding site of theubiquitin-protein ligase system,” J. Biol. Chem., Vol. 261, No. 26, September 15, 1986,pp. 11992–11999.

[77] Freemont, P.S., “RING for destruction?” Curr. Biol., Vol. 10, No. 2, January 27, 2000, pp. R84–R87.[78] Barinaga, M., “A new finger on the protein destruction button,” Science, Vol. 286, No. 5438, Octo-

ber 8, 1999, pp. 223, 225.[79] Cohen, G.B., R. Ren, and D. Baltimore, “Modular binding domains in signal transduction pro-

teins,” Cell, Vol. 80, No. 2, January 27, 1995, pp. 237–248.[80] Pawson, T., “Protein modules and signalling networks,” Nature, Vol. 373, No. 6515, February 16,

1995, pp. 573–580.[81] Ren, R., B.J. Mayer, P. Cicchetti, and D. Baltimore, “Identification of a ten-amino acid proline-rich

SH3 binding site,” Science, Vol. 259, No. 5098, February 19, 1993, pp. 1157–1161.[82] Morton, C.J., and I.D. Campbell, “SH3 domains. Molecular ‘Velcro,’” Curr. Biol., Vol. 4, No. 7,

July 1, 1994, pp. 615–617.[83] Guo, W., S. Wong, W. Xie, T. Lei, and Z. Luo, “Palmitate modulates intracellular signaling, induces

endoplasmic reticulum stress, and causes apoptosis in mouse 3T3-L1 and rat primarypreadipocytes,” Am. J. Physiol. Endocrinol. Metab., Vol. 293, No. 2, August 2007, pp. E576–586.

[84] Karaskov, E., C. Scott, L. Zhang, T. Teodoro, M. Ravazzola, and A. Volchuk, “Chronic palmitate butnot oleate exposure induces endoplasmic reticulum stress, which may contribute to INS-1 pancre-atic beta-cell apoptosis,” Endocrinology, Vol. 147, No. 7, July 2006, pp. 3398–3407.

Acknowledgments

93

C H A P T E R

6Genome-Scale Analysis of Metabolic Networks

Ranjan SrivastavaDepartment of Chemical, Materials and Biomolecular Engineering, University of Connecticut,191 Auditorium Road, U-3222, Storrs, CT 06269

95

Key terms Mathematical modelingMetabolic flux analysisFlux balance analysisGenome-scale

Abstract

Metabolic modeling, particularly at the genome-scale, can be a useful tool inproviding insights regarding metabolic processes for various organisms ofinterest. As with any other tool, however, it is subject to a number of limita-tions. If these limitations are kept in mind, metabolic modeling can provide apowerful means to understand and manipulate micro-organisms for purposesof fundamental research, as well as for accomplishing practical objectives.These practical objectives may include the efficient production of commer-cially relevant products or drugs, or they may include better approaches totreating microbial pathogens. A strategy is presented here for developing,implementing, and analyzing metabolic models of prokaryotes. It is assumedthat experimental data for such studies may be scarce. For this reason, an opti-mization-based approach for implementing metabolic models ofunderdetermined systems is reviewed and discussed.

6.1 Introduction

With the advent of high-throughput processes for studying biological systems, particu-larly at the subcellular, cellular, and tissue levels, arranging the collected data into acoherent theoretical framework is a nontrivial task. As technology advances, no abate-ment of this deluge of data is in sight. On the contrary, it is likely that the amount ofdata generated will only increase. Fortunately computational biology is well suited todealing with such high volumes of information. One particular approach that has beengaining popularity in leveraging the increasing amount of genomic and metabolomicdata available is metabolic modeling [1–24]. This methodology utilizes computationalbiology, the theory of reaction kinetics, and applied mathematics to evaluate, analyze,and ultimately engineer metabolic networks.

Metabolic modeling is a powerful tool yielding great benefits for basic research, aswell as being useful for applied purposes. From the basic research perspective, metabolicmodeling allows one to carry out in silico experiments or computational simulations toaddress questions regarding the fundamentals of metabolism of various organisms. As aresult, it may be used as a method for generating hypotheses or for screening whichexperiments will yield the most information regarding a specific question. Simulations,however, should not be considered as a substitute for experiments. Rather, the modelshould be viewed as a tool, similar to a microscope, or some another piece of equipmentin the lab. A model, by definition, is an approximation of nature. Thus, results shouldalways be confirmed experimentally. However, the modeling approach does provide sig-nificant benefits. In particular it may provide useful insights into what an organism isdoing and/or why it is behaving in a given way. Also, by determining which experi-ments are likely to have the highest impact in helping to understand a particular phe-nomena, modeling and simulations may allow efficient direction of scarce resources,both labor and financial, to make sure the most promising experiments are carried out.

Applications of metabolic modeling may be found along the biotechnological spec-trum, ranging from the production of commercially important metabolites [1, 25–27] toanalysis of biomedically important problems [28–30]. Through the use of metabolic andbiochemical engineering, recombinant organisms have become a “workhorse” for theproduction of key metabolites that have significant commercial importance. Using met-abolic modeling, it is possible to help determine how metabolic networks should bereconfigured or engineered to optimize production. From the biomedical perspective, byidentifying how metabolic resources are distributed during a given pathology, it may bepossible to identify means to treat the disease. Metabolic modeling has particular prom-ise in helping to deal with microbial pathogens. For example, the regulation of many vir-ulence genes is regulated by carbon dioxide in pathogenic bacteria [31]; however, thelink between this regulatory behavior to the many metabolic reactions in which carbondioxide is involved has been relatively unexplored. Other potential applications of met-abolic modeling include areas such as biofuels production, bioremediation, andengineering of microbial consortia to address a variety of issues facing society.

The term “genome-scale” metabolic modeling refers to the development of a quanti-tative framework describing the entire metabolic network of an organism. To date, mostsuch models, whether genome-scale or not, have focused on prokaryotic organismsrather than eukaryotic organisms due to issues with intracellular compartmentalization.However, recently that has been changing [32–34]. Regardless of the type of organismused, the general approach taken to develop such a model is to start out with the anno-

Genome-Scale Analysis of Metabolic Networks

96

tated genome-sequence. From there, one may reconstruct the metabolic network. Spe-cifically, genes encoding enzymes involved in metabolic reactions are identified. If aparticular enzyme is present, it is inferred that the associated metabolic reaction is pres-ent. In this way, the entire metabolic network may be built up from thegenome-sequence. However, there are several caveats associated with such an approach.For example, if the DNA sequence for an identified gene has partial homology to anenzyme known to catalyze a metabolic reaction, how much sequence identity must itshare before the reaction is considered to be present? To some extent, this particularproblem may be mitigated by determining whether or not other metabolic reactionsrelated to the pathway in question are present. If the complimentary pathways are pres-ent, that may argue for inclusion of the pathway suggested by the partially matchinggene sequence [35]. Additionally, there may be genes that encode for enzymes that havestructural homology to a known metabolic enzyme but lack any type of sequencehomology. Another possible source of error is the presence of genes whose functions,although currently unknown, interact and impact metabolism. Finally, errors in anno-tating genes may also result in the incorrect and/or incomplete reconstruction ofmetabolic networks.

Although these issues may appear to be daunting, they actually highlight one of theprimary benefits of genome-scale modeling. The mathematical model is effectively aquantitative hypothesis. By carrying out simulations and comparing results to what isactually observed experimentally, mismatches between experiments and theory may beidentified. Using this information as a foundation, hypotheses regarding the connectiv-ity and functioning of the metabolic network may be revised. New simulations may becarried out and compared to experimental results, and in this way, the model shouldideally approach what is observed experimentally. As this iterative process progresses,the model may be used to elucidate what is occurring in reality and furthermore guidefuture experiments.

Once the metabolic network has been reconstructed, it is possible to convert it into amathematical model. The conversion is accomplished through the use of the theory ofreaction kinetics. Taking advantage of the knowledge of metabolite stoichiometry of thesystem of reactions, a system of ordinary differential equations (ODEs) describing howeach metabolite varies with respect to time may be generated. Through various assump-tions described in greater detail in Section 6.2, it is possible to simplify the equationsfrom ODEs to algebraic equations. Generally speaking, at the genome-scale level, thereare many more variables than there are equations, resulting in an under-determined sys-tem. It is possible to reduce the number of degrees of freedom through appropriate andwell-designed experiments. Isotopic carbon labeling of substrate has proven to be a par-ticularly useful approach [6, 10, 34, 36–40]. With sufficient data it is sometimes possibleto reduce the number of degrees of freedom to the point where the system is completelydetermined or even over-determined. For the over-determined system, it is possible touse the extra data to provide a consistency check.

If sufficient data is unavailable to render the system determined, it is still possible tocarry out the metabolic analysis using an optimization strategy. The premise of such anapproach is that the organism is attempting to optimize some type of objective function.The objective function may be maximization of growth rate, optimization of energy effi-ciency, or some other biological process deemed appropriate. The optimization calcula-tion may then be used to determine how metabolic resources should be distributed

6.1 Introduction

97

across pathways in order to best achieve the stated objective function. The objectivefunction strategy is justified by assuming that since organisms have evolved under selec-tion pressures, they are by their nature approaching some kind of optimal level. Theissue of course is: What has a given organism been optimized for? Should one even belooking at the organismal level or is selection at the species level more appropriate?

The focus of this work will be to aid the researcher in developing and carrying outsimulations of prokaryotic organisms where the system is underdetermined.

6.2 Materials and Methods

6.2.1 Flux analysis theory

Metabolic flux analysis (MFA) is the technique by which flux distributions throughmetabolic pathways are either determined or predicted [6, 41, 42]. Fluxes are calculatedthrough the development of stoichiometric models of the metabolic reaction network.Generally the theory of reaction kinetics requires that variation of intracellular metabo-lites over time be described via a system of differential equations such as shown in (6.1).

dXdt

r X= − μ (6.1)

where X is a vector of the metabolites of interest, r is the vector of rate expressions, andμ is the bacterial growth rate. The term μX represents the dilution of the metabolites as

the cells grow. Since intracellular metabolite levels tend to be very low and the dilutionterm is relatively small compared to the other reactions affecting the metabolite, thedilution term is generally assumed to be negligible [6]. If the experimental system canbe manipulated to operate at a steady state or if it can be assumed that the response ofmetabolite pools to perturbations is very rapid, the variation with respect to time maybe approximated as zero. The model is then reduced to a system of algebraic equationsas illustrated by (6.2).

0 = r (6.2)

It is further possible to write the rate expression in terms of the stoichiometric coeffi-cients and their associated fluxes, such that

r ST= ν (6.3)

where ST is the matrix of stoichiometric coefficients and ν is the vector of fluxes. Substi-tuting (6.3) into (6.2) results in

0 = ST ν (6.4)

Knowing the metabolic network along with measured extracellular fluxes, it is possi-ble to use (6.4) to determine metabolic fluxes. MFA allows one to determine a number ofother cellular features. As pointed out by Stephanopoulos et al. [6], features such asnodal rigidity, alternative pathways, values of nonmeasurable fluxes, and maximum


98

theoretical yields may be determined. Nodal rigidity refers to how much or how little theflux distribution through a given branch point will change when operating conditionsare changed. Alternative pathway analysis may be required when several different path-ways appear feasible, but the actual pathway is unknown. Through MFA, it may be possi-ble to show that some of the pathways are actually not used (zero flux) or impossible(negative flux). Such an analysis was used to identify the correct pathway for citric acidfermentation in C. lipolytica [43]. Actual experimental measurement of some fluxes maysimply be impossible. However, enough information from other fluxes may be availableto uniquely identify what the unknown flux must be. Due to the knowledge of thestoichiometry of the reaction system, MFA also allows one to determine the maximumspecific product yield for a given substrate.

Flux balance analysis (FBA) is a variant of MFA. In FBA, the goal is to find the all thefeasible flux distributions for an organism under prescribed conditions (rich media,minimal media, and so forth) [5, 18, 41, 44]. The resulting solution is a bound convexcone made up of all the possible flux distributions in flux space. Experiments may thenbe carried out to determine where within the flux cone the actual flux distributions lie[17, 19]. It is also possible to perform an in silico analysis to determine the specific fluxdistribution of an organism by postulating an objective function that the organism isattempting to optimize. Once the objective function is specified, linear programmingmay be used to determine the flux distribution [17, 19, 44, 45]. Identification of theobjective function is not trivial, as is discussed in Section 6.2.3.1, and is dependent uponthe environment in which the organism finds itself. The benefit of FBA is that it may becarried out at the genome scale with limited data and still provide insight into how theorganism can and will behave [3, 17–20, 45–48].

6.2.2 Model development

Before a metabolic model may be generated, it is necessary to have a reconstruction ofthe metabolism for the organism of interest. The metabolic reconstruction for a num-ber of organisms has already been carried out and are readily available on the Web. Twoof the best-known repositories are BioCyc (http://www.biocyc.org) and the Kyoto Ency-clopedia of Genes and Genomes, more commonly referred to as KEGG (http://www.genome.jp/kegg/). Both Web sites have the metabolic networks of hundreds ofdifferent organisms available. Both sites also allow users to download the metabolicnetworks to their personal computers in a variety of different formats, such as the Sys-tems Biology Markup Language (SBML). With such a copy of the metabolic networkavailable, one can use the information as the source for developing a mathematicalmodel for simulation purposes.

If the metabolic reconstruction for the organism of interest is unavailable, it may stillbe possible to generate the metabolic network. As long as the annotated genomesequence for the organism is known, the network can be reconstructed using the Path-way Tools software package [49, 50], details of which may be found at http://bioinformatics.ai.sri.com/ptools/ptools-overview.html. The Pathway Tools software cantake various genome sequence formats, such as the GenBank Database format (generallydenoted by a “.gbk” extension at the end of the sequence file), and generate a PathwayGenome Database (PGDB). The PGDB may be analyzed directly to study the metabolicnetwork. It may also be used to generate a file in SBML or another format more amenable


99

for generating the mathematical model through direct parsing. An example of the E. colinetwork as generated by the Pathway Tools software is illustrated in Figure 6.1.

6.2.3 Objective function

Given that the systems to be modeled are generally under-determined, one method bywhich the distribution of metabolic resources may be estimated is through the utiliza-tion of optimization theory. In order to use such an approach, it is necessary to postu-late an objective function. However, determination of the objective function is far froma trivial matter. Although the objective function approach does have its uses, it is criti-cal to be aware of the issues associated with this method so that results from such mod-eling exercises will be evaluated in light of these limitations.

6.2.3.1 Objective function choices

Several different objective functions have been utilized for metabolic modeling pur-poses. Some key ones include:

• Maximization of biomass production;

• Maximization of ATP production rate;

• Minimization of nutrient uptake rate;

• Minimization of redox potential production rate;

• Minimization of ATP production rate.

Maximization of biomass production has been by far the most popular choice as anobjective function [5, 17, 19, 44, 45]. The premise for this particular choice is that anorganism that can outgrow its competition is the one that will ultimately dominate itsniche. Thus the organism that distributes its metabolic resources in such a way as toaccomplish this objective will be in the best position to survive.

Maximization of the ATP production rate is justified on the basis that by havingexcess ATP available, the cell is better able to leverage its existing metabolic resources[51–53]. As a result, such an organism is ultimately able to out-compete other organisms.Recent experimental studies have provided strong support for this approach as being aviable objective function in determining the distribution of metabolic fluxes [52].

Minimization of nutrient uptake rate is based on an efficiency argument [2]. Theargument is best illustrated by an analogy comparing the organism of interest to a car.Given two automobiles, the first one requires a certain amount of gasoline to travel afixed number of miles. The second car only requires half that amount of gasoline to gothe same the distance. Thus the second car is the better car, because it can travel thesame distance using less fuel. Minimization of nutrient uptake is based on the same prin-ciple, effectively stating that given two organisms, the one capable of surviving on lessnutrients is the more optimal one.

The remaining two objective functions are based on an optimization of energy effi-ciency argument. Minimization of either the redox potential production rate [2, 54] orthe ATP production rate [2] follows the same rationale as that of the minimization of thenutrient uptake rate. The cell capable of functioning while requiring less energy is con-sidered the “better” cell in this scenario.


100


101

Fig

ure

6.1

Th

en

etw

ork

show

nis

anex

amp

leof

the

E.co

lim

etab

olic

net

wor

kas

gen

erat

edby

the

Path

way

Too

lsso

ftw

are

[49,

50].

Each

nod

ere

pre

sen

tsa

met

abol

ite,

wit

hth

esh

ape

spec

ifyi

ng

the

typ

eof

met

abol

ite.

For

exam

ple

,tri

angl

esre

pre

sen

tam

ino

acid

s,w

hil

esq

uar

esre

pre

sen

tca

rboh

ydra

tes.

Th

eli

nes

con

nec

tin

gth

en

odes

rep

rese

nt

the

met

a-bo

lic

ortr

ansp

ort

reac

tion

sth

atth

em

etab

olit

esar

ein

volv

edin

.Det

ails

onh

owto

get

and

use

the

Path

way

Too

lsso

ftw

are

are

avai

labl

eat

htt

p:/

/bio

info

rmat

ics.

ai.s

ri.c

om/p

tool

s/p

tool

s-ov

ervi

ew.h

tml.

Depending upon the scenario (e.g., the environment the cells are in, the resourcesavailable, the type of competition faced), other objective functions may be more suitablefor use in the modeling analysis. Indeed, if the environmental conditions change duringcellular growth (i.e., going from a nutrient-rich environment to one that is nutrientpoor), the objective itself may change. However, the above list should provide a fairlycomprehensive set of starting points.

6.2.3.2 Objective function determination/evaluation

The choice of an objective function may not be obvious for a given organism. As aresult, a researcher might consider several different objective functions. Various meth-ods exist to help evaluate the choice and quality of the various objective functions.These methodologies may be divided into two broad categories. One is the use of anoptimization-based approach, while the second involves a probabilistic analysis.Although both methods are complementary to each other, only a brief description ofthe optimization approach is provided here. The probabilistic approach is described inmore detail.

Note each of these methods require a means of carrying out the optimization inorder to evaluate the quality of the objective function. Details on how this might beaccomplished are provided in Section 6.2.4.

The method developed by Burgard and Maranas [55] utilizes an inverse optimizationapproach for inferring or disproving various objective functions. A weighted combina-tion of fluxes is maximized, where the weighting factors are referred to as coefficients ofimportance or CoI. The CoI’s are calculated in reference to experimental flux data and aredetermined in such a way that they sum to one. As a result, by looking at the value of theCoI, it is possible to determine the importance of the contribution of the particular fluxbeing weighted. Since the objective function is ultimately a combination of fluxes, if theCoI is low, then that particular flux is not truly contributing to the objective function. Ifthe CoI is high, then the flux is indeed appropriate for the objective function.

The probabilistic approach was developed by Knorr et al. [54] based on the work ofStewart, Box, and others [56, 57]. The method involves carrying out a Bayesian-basedmodel discrimination analysis to determine the posterior probabilities of each objectivefunction of interest. To facilitate the approach, the posterior probability of an objectiveis normalized to the sum of all of the evaluated posterior probabilities in what is referredto as the posterior probability share. The posterior probability shares for each objectivefunction may be compared. The objective function with the highest probability share isthe most likely objective function relative to the other objective functions being evalu-ated. It is critical to note that this approach will always result in a “best” objective func-tion. However, if all of the objective functions are poor choices, then this method willpick the best of the poor choices. It does not change the fact that final objective functionselected may not be a good one if all of the objective functions evaluated are of poorquality. It is therefore very important that all of the objective functions be assessed witha critical eye.

The basis for determining the posterior probability shares for the objective beginswith the proportionality described in (6.5):

( )p M Y p M vj jp

jj⎛

⎝⎜⎞⎠⎟ ∝ −

−

2 22δ

(6.5)


102

where Mj is the objective function; p(Mj) is the prior probability of Mj; Y is the matrix ofweighted experimental data, where the weighting is simply the reciprocal of the stan-dard deviation for the appropriate response value, as described by Stewart et al. [57]; pj isthe number of parameters estimated in Mj; and δ is the number of available degrees offreedom. Given that the parameters for the flux modeling are essentially thestoichiometric coefficients, and they are already known, there are no further parame-ters to be estimated [24]. As a result, unless some modified variation of the metabolicanalysis is carried out, pj will have a value of 0.

The matrix v j represents the products of the deviation of the data from the values

predicted by the model for the objective function, Mj evaluated at the maximum likeli-hood of the parameter vector θ. The ikth element may be calculated via (6.6):

( ) ( )[ ] ( )[ ]v Y F Y Fik j iu ji u j ku jk u ju

n

θ ξ θ ξ θ= − −=

∑ , ,1

(6.6)

where Fji is the weighted model prediction, described in more detail below, for objectivefunction Mj, which is a function of the vector of independent variables and parameters

denoted by ξ u and θ j , respectively. As mentioned for Y, the weighting for Fji is the recip-

rocal of the standard deviation for the appropriate response value [57]. Note that theweighting is also calculated for Yiu. The subscripts i and k denote specific response val-ues, while u represents the experimental run in which the data was collected.

To calculate the normalized posterior probability share, the individual posteriorprobability is simply normalized to the sum of all of the calculated posteriorprobabilities:

π M Yp M Y

p M Yj

j

kk

⎛⎝⎜

⎞⎠⎟ =

⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟∑

(6.7)

The objective function with the largest value of Π is the most likely one given the

experimental data set Y. This last point is a critical one. As more quantity of data isreceived or more accurate data is generated, the results of the analysis may change. Thusit is generally a good idea to revisit the assessment of the objective function when newdata is obtained.

6.2.3.3 Caveats

The very notion that the organism has a specific objective function has significant bio-logical ramifications. It supposes that the organism is trying to accomplish one goalabove and beyond all others. Furthermore, it suggests that the objective function isonly being applied at one scale, that of the organism. It neglects the possibility of appli-cation of an objective function at a different scale or across multiple scales. For exam-ple, imagine a population of organisms where the objective is the survival of thepopulation as a whole, ultimately resulting in enhanced survivability for the individu-als. In this scenario it may be that the organisms are operating in a suboptimal fashionat the individual level. This suboptimal operation may be due to the production of


103

metabolites by the organism to help its neighbors survive. However, those resourcesthat the organism is using to help its neighbor survive could have been used for itself.Additionally, the resulting metabolic burden on the organism from aiding its neighbormay slow down its growth rate, which might hinder its ability to compete as effectivelyfor resources. If, despite these setbacks at the individual level, the chances for survival ofthe population increase, then the population of suboptimal individuals will be selectedfor evolutionarily. However, under this scenario, selection of any objective function foroptimization to model the organism at an individual level is unlikely to provide goodestimations of the metabolic distribution.

Even when focusing on the organismal scale is appropriate, there are still criticalissues the modeler must be aware of. The choice of one objective function over anothermay be appropriate at a given time. However, this choice is critically dependent uponthe context of the system. In other words, the environment in which the organism findsitself will impact what type of objective function is appropriate. If the environmentchanges, the objective function may very well change. However, under such circum-stances, it is quite likely that there will be a shift in the distribution of metabolic fluxes.As a result, the assumption that the organism is operating at a “steady state” will no lon-ger be valid, undermining the development of (6.2). It is still possible to utilize the meta-bolic modeling approach described here. However, the analysis would have to be brokenup into two phases. The first phase would be prior to the environmental shift or pertur-bation and would utilize the first objective function. The second phase would begin afterenough time had passed since the environmental shift such that the organism hadadapted to its new state. At this point, calculations would be based on the use of thesecond objective function.

Another issue to be aware of is the possibility of the existence of multiple simulta-neous objective functions. At the organismal level, it is possible to deal with this prob-lem through appropriate construction/selection of the objective function. However,when dealing with multiple multiscale objective functions, the problem becomesincreasingly difficult to the point of intractability.

Without a doubt, the best scenario is one in which sufficient experimental data isavailable such that the system is determined or overdetermined. Under such circum-stances, an optimization approach is unnecessary, and one does not have to speculate asto what the objective function for an organism might be. However, in the scenariowhere such data is not available, the optimization route can prove useful in providing anapproximation of what is occurring within the organism of interest, as long as the abovecaveats are kept in mind. Furthermore, in addition to clarifying ongoing questions, themetabolic modeling may help determine new questions, generate hypotheses whichmay be tested experimentally, and provide a previously unknown research thrust. Suchresults in this age of high throughput biotechnology can prove to be extremely valuablein aiding researchers to wade through deluge of data being generated.

6.2.4 Optimization

Once the mathematical model is generated and the objective function is chosen, it ispossible to carry out the optimization. If the model is generated in accord with theapproach described in (6.4) and the objective function is constructed so that it is alsolinear in nature, the problem may be cast as linear programming one. A number of fine


104

commercial software packages are available for solving linear programming problems.However, a high-quality free package is also available from the GNU Project and FreeSoftware Foundation that has been used successfully to optimize genome-scale models[54]. Specifically the GNU Linear Programming Kit (GLPK) (http://www.gnu.org/soft-ware/glpk/) may be freely downloaded and is capable on running on a variety of majorcomputing platforms. Significant documentation is also available, making the GLPKrelatively easy to install and run.

If the model is developed in some other fashion resulting in nonlinear constraints, orif a nonlinear objective function is chosen, the GLPK will not suffice. It will then be nec-essary to find another optimization software package capable of dealing with nonlinearproblems. Once again, many fine commercial software packages are available forcarrying out such analysis.


Once the appropriate objective function is identified and the simulation is run, thereare two possible outcomes. The first is that a feasible solution is determined. The secondis that no feasible solution can be determined. Both results can be informative regard-ing the metabolism of the organism in question.

6.3.1 Feasible solution determined

Ideally, upon carrying out the simulations, the value of the resulting objective functionwill be optimized and feasible solution will result. In such a situation, the distributionof the metabolic fluxes may be analyzed to determine how metabolic resources arebeing allocated by the organism. Given these results, the next steps are generallydependent upon the goals of the researcher. If basic research into the fundamentals ofmetabolism is being studied, then hopefully the solution to old questions will havebeen resolved or at least hinted at; new questions will inevitably arise. If the purpose isto engineer the organism to optimize production of a metabolite or protein, then theresulting simulations should provide some insight into what the next steps should be toaccomplish the stated goal. Regardless of what is being studied, in all cases it is impera-tive that the simulation results be verified experimentally. It is especially critical to doso before another round of simulations is carried out. If there is some discrepancy in thefirst round of simulations that is not identified through experimental analysis, then thesecond round of simulations will be built upon a flawed foundation. The resultinginaccuracies will propagate through future simulations, leading to erroneous results.

Another point the researcher must be cognizant of is whether the results of the simu-lation make sense. For example, if there are not very many experimentally determinedconstraints on the organism, then optimization of the objective function may becometrivialized. Specifically, if it is determined that the production of a given metabolite is tobe maximized, then, if the pathway exists, the simulation will predict all of the substrateis converted into that metabolite. In reality, the organism will require distribution of thesubstrate through other metabolic pathways for growth, energy production, and so on.By including such constraints explicitly, the researcher forces the simulation to account


105

for the distribution of resources along pathways that might be suboptimal for the pro-duction of the given metabolite. However, in reality, without utilizing the other meta-bolic pathways, the organism simply may not survive. Thus it is up to the researcher toevaluate the simulation results with a critical eye.

6.3.2 No feasible solution determined

Oftentimes after carrying out the optimization process, it will not be possible to deter-mine a feasible solution. Such a result may be due to an incomplete or incorrect con-straint. For example, assume that based on the metabolic reconstruction, the massbalance constraint for metabolite X was determined to be

X:ν ν ν1 2 3 0+ = (6.8)

Because there are only source terms and no sink terms, any nonzero value for thefluxes would result in an accumulation of metabolite X, violating the steady stateassumption and the resulting constraint. The only possible solution that is consistentwith the above constraint is if all of the fluxes are 0. However, if based on experimentalconsiderations or if any of the given fluxes contribute to other constraints, then it is pos-sible that those fluxes might have a nonzero value. Thus the optimization of the systemcan not yield a feasible solution. In such a scenario, it is possible the metabolite is partici-pating in a hitherto unknown reaction where the metabolite may be a reactant. Theresulting constraint would then have a form similar to the following:

X: ?ν ν ν ν1 2 3 0+ + − = (6.9)

where the flux, v?, may now act as the sink term.Such a simulation result actually turns out to be quite useful, as it highlights metabo-

lites which might be participating in reactions not previously known. As a result, it maybe possible to design experiments in which these particular metabolites are traced todetermine how they are ultimately distributed throughout the cell.

To identify whether a given metabolic constraint is causing problems, it is only nec-essary to comment out the constraint from the input file to the GLPK software. It isimportant to realize that removing the constraint is not the equivalent of removing themetabolite. Removal of the constraint simply means the mass balance is not closed,which provides the flexibility needed to allow for potential participation of themetabolite in other reactions.


Genome-scale metabolic modeling has a great deal to offer the basic research and meta-bolic engineering communities. It is a powerful tool that can be used to help elucidatemetabolic processes that might not otherwise be easily amenable to experimental stud-ies. Furthermore, based on insights provided by such analysis, it may be possible toaddress unanswered questions, formulate new hypotheses, and better manipulateorganisms for biotechnological purposes, or, if the organism in question is a pathogen,better treat illnesses caused by that microbe.


106

Clearly there are many significant assumptions underlying the metabolic modelingapproach. However, other standard wet lab technologies, such as microscopy andmicroarrays, are also fraught with limitations. As long as one is aware of these limita-tions, the results generated via these tools can be extremely insightful. The same is truefor metabolic modeling. It is simply another tool available to the researcher. By beingaware of the limitations of this approach, it is possible to gain some truly valuableinsight into the system that is being studied.

6.5 Summary Points

An attempt has been made to provide a strategy for developing and implementinggenome-scale metabolic models. It has been assumed that the system developed will befor modeling of a prokaryotic organism and the model will be an underdetermined one.As a result, an optimization strategy will be required. Based on these assumptions, thefollowing steps summarize the method just described and which may be used by theresearcher:

• Acquire or generate metabolic reconstruction for organism of interest:• Many organisms are available from BioCyc (http://www.biocyc.org) or KEGG

(http://www.genome.jp/kegg/);• If metabolic reconstruction is unavailable, but annotated genome sequence is,

then one can use Pathway Tools Software to generate metabolic reconstruction(http://bioinformatics.ai.sri.com/ptools/ptools-overview.html).

• Convert metabolic network to metabolic/mathematical model and select objectivefunction:

• Objective function selection is critical and requires careful thought and consid-eration; a list of some of the most commonly used objective functions is pro-vided in Section 6.2.3.1;

• Evaluate objective functions.• Carry out linear programming. One can use the GNU Linear Programming Kit,

available from http://www.gnu.org/software/glpk/. Evaluate the solution:

• If a feasible solution is generated, make sure the solution results make sense;• If no feasible solution can be generated, identify which constraints are causing

the problem and mark those metabolites for future experimental study.

It should be emphasized that this approach is an iterative process, and may requireseveral rounds of updating based on what was learned in previous trials, followed byrepeated analysis.

Acknowledgments

Support for this work was provided in part by the NIH National Library of Medicinethrough grant 1R03LM009753-01.

6.5 Summary Points

107

References

[1] Vallino, J.J., and G. Stephanopoulos, “Metabolic flux distributions in corynebacteriumglutamicum during growth and lysine overproduction,” reprinted from Biotechnology and Bioengi-neering, Vol. 41, 1993, pp. 633–646, in Biotechnol. Bioeng., Vol. 67, No. 6, 2000, pp. 872–885.

[2] Savinell, J.M., and B.O. Palsson, “Network analysis of intermediary metabolism using linear opti-mization. I. Development of mathematical formalism,” J. Theor. Biol., Vol. 154, 1992, pp. 421–454.

[3] Schilling, C.H., et al., “Genome-scale metabolic model of Helicobacter pylori 26695,” J. Bacteriol.,Vol. 184, No. 16, 2002, pp. 4582–4593.

[4] Schilling, C.H., et al., “Combining pathway analysis with flux balance analysis for the comprehen-sive study of metabolic systems,” Biotechnol. Bioeng., Vol. 71, No. 4, 2000, pp. 286–306.

[5] Schilling, C.H., and B.O. Palsson, “Assessment of the metabolic capabilities of haemophilusinfluenzae Rd through a genome-scale pathway analysis,” J. Theor. Biol., Vol. 203, No. 3, 2000,pp. 249–283.

[6] Stephanopoulos, G.N., A.A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodol-ogies, San Diego, CA: Academic Press, 1998.

[7] Sauer, U., D.C. Cameron, and J.E. Bailey, “Metabolic capacity of bacillus subtilis for the productionof purine nucleosides, riboflavin, and folic acid,” Biotechnol. Bioeng., Vol. 59, No. 2, 1998,pp. 227–238.

[8] Sauer, U., et al., “Physiology and metabolic fluxes of wild-type and riboflavin-producing bacillussubtilis,” Appl. Environ. Microbiol., Vol. 62, No. 10, 1996, pp. 3687–3696.

[9] Raman, K., P. Rajagopalan, and N. Chandra, “Flux balance analysis of mycolic acid pathway: tar-gets for anti-tubercular drugs,” PLoS Comput. Biol., Vol. 1, No. 5, 2005, p. e46.

[10] Park, S.M., et al., “Metabolite and isotopomer balancing in the analysis of metabolic cycles: II.Applications,” Biotechnol. Bioeng., Vol. 62, No. 4, 1999, pp. 392–401.

[11] Goel, A., et al., “Analysis of metabolic fluxes in batch and continuous cultures of bacillus subtilis,”Biotechnol. Bioeng., Vol. 42, No. 6, 1993, pp. 686–696.

[12] Goel, A., et al., “Metabolic fluxes, pools, and enzyme measurements suggest a tighter coupling ofenergetics and biosynthetic reactions associated with reduced pyruvate kinase flux,” Biotechnol.Bioeng., Vol. 64, No. 2, 1999, pp. 129–134.

[13] Hatzimanikatis, V., and J.E. Bailey, “Effects of spatiotemporal variations on metabolic control:approximate analysis using (log)linear kinetic models,” Biotechnol. Bioeng., Vol. 54, No. 2, 1997,pp. 91–104.

[14] Hatzimanikatis, V., et al., “Metabolic networks: enzyme function and metabolite structure,” Curr.Opin. Struct. Biol., Vol. 14, No. 3, 2004, pp. 300–306.

[15] Hatzimanikatis, V., et al., “Exploring the diversity of complex metabolic networks,” Bioinformatics,Vol. 21, No. 8, 2005, pp. 1603–1609.

[16] Holms, H., “Flux analysis and control of the central metabolic pathways in Escherichia coli,” FEMSMicrobiol Rev, Vol. 19, No. 2, 1996, pp. 85–116.

[17] Ibarra, R.U., J.S. Edwards, and B.O. Palsson, “Escherichia coli K-12 undergoes adaptive evolution toachieve in silico predicted optimal growth,” Nature, Vol. 420, No. 6912, 2002, pp. 186–189.

[18] Edwards, J.S., M. Covert, and B. Palsson, “Metabolic modelling of microbes: the flux-balanceapproach,” Environ. Microbiol., Vol. 4, No. 3, 2002, pp. 133–140.

[19] Edwards, J.S., R.U. Ibarra, and B.O. Palsson, “In silico predictions of escherichia coli metabolic capa-bilities are consistent with experimental data,” Nat. Biotechnol., Vol. 19, No. 2, 2001, pp. 125–130.

[20] Edwards, J.S., and B.O. Palsson, “Systems properties of the haemophilus influenzae RD metabolicgenotype,” J. Biol. Chem., Vol. 274, No. 25, 1999, pp. 17410–17416.

[21] Fell, D.A., and J.R. “Small, Fat Synthesis in Adipose Tissue. An Examination of Stoichiometric Con-straints,” Biochem. J., Vol. 238, No. 3, 1986, pp. 781–786.

[22] Fischer, E., and U. Sauer, “Large-scale in vivo flux analysis shows rigidity and suboptimal perfor-mance of bacillus subtilis metabolism,” Nat. Genet., Vol. 37, No. 6, 2005, pp. 636–640.

[23] Bonarius, H.P., et al., “Metabolic flux analysis of hybridoma cells in different culture media usingmass balances,” Biotechnol. Bioeng., Vol. 50, No. 3, 1996, pp. 299–318.

[24] Bailey, J.E., “Complex biology with no parameters,” Nat. Biotechnol., Vol. 19, No. 6, 2001,pp. 503–504.

[25] Bailey, J.E., et al., “Inverse metabolic engineering: a strategy for directed genetic engineering of use-ful phenotypes,” Biotechnol. Bioeng., Vol. 79, No. 5, 2002, pp. 568–579.

[26] Alper, H., et al., “Identifying gene targets for the metabolic engineering of lycopene biosynthesis inEscherichia coli,” Metab. Eng., Vol. 7, No. 3, 2005, pp. 155–164.

[27] Alper, H., K. Miyaoku, and G. Stephanopoulos, “Construction of lycopene-overproducing E. colistrains by combining systematic and combinatorial gene knockout targets,” Nat. Biotechnol.,Vol. 23, No. 5, 2005, pp. 612–616.


108

[28] Lee, K., et al., “Metabolic flux analysis of postburn hepatic hypermetabolism,” Metab. Eng., Vol. 2,No. 4, 2000, pp. 312–327.

[29] Lee, K., et al., “Metabolic flux analysis: a powerful tool for monitoring tissue function,” Tissue Eng.,Vol. 5, No. 4, 1999, pp. 347–368.

[30] Lee, K., et al., “Profiling of dynamic changes in hypermetabolic livers,” Biotechnol. Bioeng., Vol. 83,No. 4, 2003, pp. 400–415.

[31] Stretton, S., and A.E. Goodman, “Carbon dioxide as a regulator of gene expression in microorgan-isms,” Antonie Van Leeuwenhoek, Vol. 73, No. 1, 1998, pp. 79–85.

[32] Forster, J., et al., “Genome-scale reconstruction of the Saccharomyces Cerevisiae metabolic net-work,” Genome Res., Vol. 13, No. 2, 2003, pp. 244–253.

[33] Forster, J., et al., “Large-scale evaluation of in silico gene deletions in Saccharomyces Cerevisiae,”Omics, Vol. 7, No. 2, 2003, pp. 193–202.

[34] Gombert, A.K., et al., “Network identification and flux quantification in the central metabolism ofSaccharomyces Cerevisiae under different conditions of glucose repression,” J. Bacteriol., Vol. 183,No. 4, 2001, pp. 1441–1451.

[35] Green, M.L., and P.P.D. Karp, “A Bayesian method for identifying missing enzymes in predictedmetabolic pathway databases,” BMC Bioinformatics, Vol. 5, 2004, p. 76.

[36] Zupke, C., et al., “Numerical isotopomer analysis: estimation of metabolic activity,” Anal.Biochem., Vol. 247, No. 2, 1997, pp. 287–293.

[37] Klapa, M.I., J.C. Aon, and G. Stephanopoulos, “Systematic quantification of complex metabolicflux networks using stable isotopes and mass spectrometry,” Eur. J. Biochem., Vol. 270, No. 17,2003, pp. 3525–3542.

[38] Klapa, M.I., et al., “Metabolite and isotopomer balancing in the analysis of metabolic cycles: I. The-ory,” Biotechnol. Bioeng., Vol. 62, No. 4, 1999, pp. 375–391.

[39] Christensen, B., A.K. Gombert, and J. Nielsen, “Analysis of flux estimates based on (13)C-labellingexperiments,” Eur. J. Biochem., Vol. 269, No. 11, 2002, pp. 2795–2800.

[40] Christensen, B., and J. Nielsen, “Isotopomer analysis using Gc-Ms,” Metab. Eng., Vol. 1, No. 4,1999, pp. 282–290.

[41] Varma, A., and B.O. Palsson, “Metabolic flux balancing—Basic concepts, scientific and practicaluse,” Bio-Technology, Vol. 12, No. 10, 1994, pp. 994–998.

[42] Vallino, J.J., and G. Stephanopoulos, “Metabolic flux distributions in Corynebacterium Glutacmicumduring growth and lysine overproduction,” Biotechnol. Bioeng., Vol. 41, 1993, pp. 633–646.

[43] Aiba, S., and M. Matsuoka, “Identification of metabolic model: citrate production from glucose byCandida Lipolytica,” Biotechnol. Bioeng., Vol. 21, 1979, pp. 1373–1386.

[44] Reed, J.L., and B.O. Palsson, “Thirteen years of building constraint-based in silico models of Esche-richia coli,” J. Bacteriol., Vol. 185, No. 9, 2003, pp. 2692–2699.

[45] Edwards, J.S., and B.O. Palsson, “The Escherichia coli Mg1655 in silico metabolic genotype: its defi-nition, characteristics, and capabilities,” Proc. Natl. Acad. Sci. USA, Vol. 97, No. 10, 2000,pp. 5528–5533.

[46] Reed, J.L., and B.O. Palsson, “Genome-scale in silico models of E. coli have multiple equivalentphenotypic states: assessment of correlated reaction subsets that comprise network states,” GenomeRes., Vol. 14, No. 9, 2004, pp. 1797–1805.

[47] Price, N.D., et al., “Genome-scale microbial in silico models: the constraints-based approach,”Trends Biotechnol., Vol. 21, No. 4, 2003, pp. 162–169.

[48] Price, N.D., et al., “Network-based analysis of metabolic regulation in the human red blood cell,” J.Theor. Biol., Vol. 225, No. 2, 2003, pp. 185–194.

[49] Paley, S.M., and P.D. Karp, “The pathway tools cellular overview diagram and omics viewer,”Nucleic Acids Res., Vol. 34, No. 13, 2006, pp. 3771–3778.

[50] Karp, P.D., S. Paley, and P. Romero, “The pathway tools software,” Bioinformatics, Vol. Vol. 18,Suppl. 1, 2002, pp. S225–S232.

[51] Vo, T.D., H.J. Greenberg, and B.O. Palsson, “Reconstruction and functional characterization of thehuman mitochondrial metabolic network based on proteomic and biochemical data.” J. Biol.Chem., Vol. 279, No. 38, 2004, pp. 39532–39540.

[52] Schuetz, R., L. Kuepfer, and U. Sauer, “Systematic evaluation of objective functions for predictingintracellular fluxes in Escherichia coli,” Mol. Syst. Biol., Vol. 3, 2007, p. 119.

[53] Ramakrishna, R., et al., “Flux-balance analysis of mitochondrial energy metabolism: consequencesof systemic stoichiometric constraints,” Am. J. Physiol. Regul. Integr. Comp. Physiol., Vol. 280, No. 3,2001, pp. R695–R704.

[54] Knorr, A.L., R. Jain, and R. Srivastava, “Bayesian-based selection of metabolic objective functions,”Bioinformatics, Vol. 23, No. 3, 2007, pp. 351–357.

[55] Burgard, A.P., and C.D. Maranas, “Optimization-based framework for inferring and testinghypothesized metabolic objective functions,” Biotechnol. Bioeng., Vol. 82, 2003, pp. 670–677.

Acknowledgments

109

[56] Stewart, W.E., T.L. Henson, and G.E.P. Box, “Model discrimination and criticism with sin-gle-response data,” AIChE Journal, Vol. 42, No. 11, 1996, pp. 3055–3062.

[57] Stewart, W.E., Y. Shon, and G.E.P. Box, “Discrimination and goodness of fit of multiresponsemechanistic models,” AIChE Journal, Vol. 44, No. 6, 1998, pp. 1404–1412.


110

C H A P T E R

7Modeling the Dynamics of Cellular Networks

Ryan Nolan1,2 and Kyongbum Lee1*

1Department of Chemical and Biological Engineering, Tufts University, Medford, MA 021802Wyeth BioPharma, Andover, MA 01810* 4 Colby Street, Room 142, Medford, MA 02155-6013; phone: 617-627-4323; fax: 617-627-3991; e-mail:[email protected]

111

Key terms Cellular dynamicsMetabolic networkModularityMetabolic flux analysisElementary flux modesEnzyme kineticsParameter estimationGenetic algorithmBayesian network analysis

Abstract

Optimization of tissue or cell function is often difficult due to a limited under-standing of the biochemical activity of the system as a whole. A mathematicalmodel simulating the dynamics of cellular biochemical processes would signifi-cantly reduce the experimental burden for such optimization.

Insights into system dynamics would also enable fundamental advances inunderstanding whole-cell regulatory mechanisms. This chapter presents amodeling strategy to simulate the changes in cell density and metabolite con-centrations during an unsteady cell culture process. The methodology involvesthree steps. First, from a genome-scale metabolic reaction network, graph-theo-retical analysis is applied to systematically reduce the network to a manageableset of modules. Second, kinetic rate expressions are defined for each module tocharacterize the initial state of the system during balanced growth. Third, thetransition periods following a system perturbation are explained by the genera-tion of metabolically distinct subpopulations. This methodology is illustratedwith an application to a batch culture of Chinese hamster ovary cells producinga recombinant therapeutic protein.

7.1 Introduction

The living cell is an exceedingly complex system with very many interacting molecularcomponents including genes and other nucleic acids, enzymes and other proteins, andsmall molecule metabolites. These molecules are “chemically connected” through theirshared participation in cellular reactions and regulatory events, giving rise to a “bio-chemical network.” Examples of such networks include gene regulatory circuits, signalcascades, and metabolic reaction networks. Advances in genomics, proteomics, andinformatics have generated an increasingly vast database of information on the compo-sitions of these biochemical networks. For many unicellular organisms, genome-scalemetabolic models have been assembled that catalogue the types of enzymes present inthe cell and thereby define the stoichiometric connections between the metabolitesand reactions. However, translating such catalogues into dynamic computational mod-els has remained elusive due to the complexity of biochemical networks andincomplete knowledge of the components’ kinetic and regulatory behavior.

Current genome-scale or whole-cell models (based on, for example, flux balanceanalysis) assume conditions of pseudo-steady state and/or optimality to provide snap-shots (global descriptions) of observed or desired overall cellular activity. At the otherend of the modeling spectrum are mechanistic models consisting of coupled differentialrate equations that describe the time profiles of the systems’ molecular components. Theadvantage of these kinetic models is that they can lead to powerful insights into thedynamics of the system, for example, offering explanations and predictions on stability,attainable steady states, and responses to time-varying stimuli. On the other hand, theforms and parameters of the rate equations used for model simulations are often culledfrom varied sources in the literature, because they are not generally available forwhole-cell networks. Moreover, published data sometimes reflect isolated in vitro,rather than in vivo settings, and thus lack internal consistency. The limited amount ofbiological knowledge (e.g., mechanism-based rate equations) and dearth of reliable invivo data on parameters have set practical limits on the scope and scale of kinetic mod-els. Indeed, there are relatively few examples of kinetic models that do not focus on aparticular subsystem, such as a metabolic pathway or signaling subnetwork.

The goal of this chapter is to present a data-driven, multiresolution modeling strat-egy that can supplement or complement biological knowledge-driven approaches fordeveloping dynamic models of whole-cell networks. The central premise of this model-ing strategy is that a biochemical network may be abstracted as an organized ensembleof modules. Each module may be represented by one or more rate equations to varyingdegrees of detail depending on available knowledge, data, and the overall modelinggoal. Modular partitioning and coarse graining can systematically reduce network com-plexity and afford estimation of a reasonable number of self-consistent parameters fromexperimental data. The premise is based on recent developments in topological analysesof cellular networks, which present a strong case for modular organization. The illustra-tive example used in this chapter is a metabolic reaction network. Metabolic networksare large, consisting of several hundred to thousand component species. There are anumber of readily accessible databases with comprehensive, species-specificcompositional information of metabolic networks. In contrast, there are no comparablesources of data on the mechanisms of enzyme action and corresponding rate equations.Standard forms of metabolic reaction rate equations are generally nonlinear functions of

Modeling the Dynamics of Cellular Networks

112

metabolites and include multiple coefficient parameters. In this regard, the challengesassociated with dynamic modeling of metabolic networks are broadly representative.

7.2 Materials

7.2.1 Cell culture

The methodology described herein was applied to an industry-relevant fed-batch pro-cess of Chinese Hamster Ovary (CHO) cells producing a recombinant antibody. Briefly,the basal and feed media used were chemically defined, protein-free, proprietary formu-lations. Both the cells and media were products of Wyeth BioPharma (Andover,

Massachusetts). The cells were seeded at >1 × 106 cells/mL and carried approximately 2weeks while maintaining a measured viability of >85%. Samples were taken twice dailyand analyzed for viable cell density, viability, pH, osmolarity, O2, CO2, glucose, lactate,ammonia, amino acids, and recombinant antibody. The goal of this chapter is to pres-ent a broadly applicable modeling method, and additional, CHO cell culture-specificdetails are not presented here.

7.2.2 Database

We have made extensive use of the KEGG database [1], which provides a species-specificlisting of enzymes, reactions, and metabolites for the organism of interest. An especiallyuseful feature of this database is its ftp site (http://www.genome.jp/kegg/down-load/ftp.html), which allows data downloads in various file formats.

7.3 Methods

7.3.1 Network reconstruction

1. Generate a genome-scale metabolic reaction network from annotated database suchas KEGG.

2. Following the initial assembly, perform additional manual curation steps asnecessary to add missing reactions (e.g., within a linear pathway) and removepathways that are irrelevant to the metabolic phenotype under investigation (e.g.,xenobiotic metabolism). Ensure phenotype consistent directionality of certainpathways (e.g., macromolecule biosynthesis) and prevent irrelevant cycles amongcofactor metabolites (e.g., nucleotide recycling) by imposing reaction irreversibilityand reaction coupling.

7.3.2 Network reduction

7.3.2.1 Structural reduction

1. Create a directed graph with metabolites as nodes and reactions as edges (Figure 7.1).Directed edges between nodes are established based on reaction involvement. For

example, in the reaction A + B → C + D, directed edges are defined from A to C, A toD, B to C, and B to D.

7.2 Materials

113

2. Remove noncarbon currency and carbon-shuttle metabolites (Table 7.1) to form acarbon-backbone network (Figure 7.2). The effect of this step should be a significantreduction in graph connectivity (Table 7.2).

3. Classify the graph nodes according to the following criteria: input or output if degree

= 1, intermediate if degree ≥ 2, and cycle if there exists a path from itself to itself. Using


114

Carbon metabolite

Currency metabolite

Direct carbon link

Direct currency link

206 metabolites

197 reactions

659 links

Figure 7.1 Genome-scale network. From the KEGG genome database, enzyme-catalyzed reactions were col-lected to form a complete metabolic reaction network for the CHO cell. Pathways included were: glycolysis, ppp,tca cycle, amino acid metabolism, oxidative phosphorylation, and biomass and recombinant protein synthesis.Cell culture experiments were conducted to rule out alternative or parallel pathways.

Table 7.1 Noncarbon Currency and Carbon-Shuttle Metabolites

Noncarbon Currency Carbon-Shuttle*

Nucleotide phosphatesNucleoside phosphatesOrthophosphatePyrophosphateNAD(P)+/NAD(P)HFAD/FADH2

H+

H2OO2

NH3

H2O2

SulfateSulfiteOxidized/reduced ferredoxin3’-phosphoadenylyl sulfate/adenosine 3’,5’-bisphosphate

ACPCoACOCO2

HCO3

−

THFTHF-derivativesL-glutamate/2-oxoglutarateL-glutamate/L-glutaminetetrahydrobiopterin/dihydrobiopterinS-adenosyl-L-methionine/S-adenosyl-L-homocysteine

* Metabolite pairs such as L-glutamate/2-oxoglutarate were removed only when they act as carbon-shuttles. For example, the pair L-gluta-mate/2-oxoglutarate is removed from: 4-aminobutanoate + 2-oxoglutarate = succinate semialdehyde + L-glutamate, but not from: L-glu-tamate + NAD+ + H2O = 2-oxoglutarate + NH3 + NADH + H+.

these definitions, determine graph-paths from inputs to cycles and outputs, andfrom inputs and cycles to outputs.

4. Apply elementary flux mode (EFM) analysis to every pair of connected input andoutput nodes in the graph. The EFM analysis should identify everystoichiometrically conserved EFM between an input and an output node mapped bya graph-path (reaction sequence). In general, one graph-path uniquely maps to oneEFM. It should be noted that the EFM algorithm applied to a carbon-backbonenetwork of central carbon metabolism with all external metabolites as inputs andoutputs will often result in >40,000 pathways. In contrast, the EFM algorithmapplied to the graph-paths will result far fewer pathways (in our CHO cell network,the number of graph-paths was 36). On occasion using only one start and one endnode as inputs to the EFM algorithm may result in an empty set. When this occurs,an additional node (input, cycle, or output) is required to form a stoichiometricallyconserved pathway. In these instances, an additional node is systematically screenedas an added input to the EFM algorithm, which then completes an EFM pathway.

7.3 Methods

115

Table 7.2 Graph Connectivity

Step Description Metabolites Reactions Links

123456

Genome-scale networkCarbon-backbone networkGraph nodes definedEFM pathways definedPseudo-reaction module networkKinetic model network

206188N/AN/A3420

197195N/AN/A3618

659225N/AN/AN/AN/A

Input metabolite

Intermediate metabolite

Cycle metabolite

Output metabolite

188 metabolites

195 reactions

225 links

Figure 7.2 Carbon-backbone network. The network size was significantly reduced to a carbon-backbone net-work by removing noncarbon currency and carbon-shuttle metabolites. From the directed graph, metaboliteswere then defined as input or output (degree = 1), intermediate (degree = 2), or cycle (there exists a path fromand to itself). These distinctions were used to form stoichiometrically conserved pseudo-reaction modules.

The EFM algorithm can be implemented in MATLAB (Mathworks, Natick,Massachusetts) using Metatool [2].

5. Finally, for each of the EFM pathways generated, sum the involved reactions, withthe noncarbon currency and carbon-shuttle metabolites reintroduced, to form apseudo-reaction module (Figure 7.3).

7.3.2.2 Functional reduction

1. To the network of pseudo-reaction modules, add cellular exchange reactions toaccount for transport of metabolites across the cell membrane, and generate astoichiometric reaction matrix, S. For comprehensive functional analysis, it isadvised to include the significant currency metabolites (i.e., O2, ATP, NADH,NADPH, FADH2, and NH3). In the CHO network, 25 exchange reactions were addedto 36 pseudo-reaction modules, resulting in an S matrix of 35 metabolites by 61reactions.

2. Quantify a steady state flux distribution (e.g., balanced growth) using metabolic fluxanalysis (MFA). Reducing the network using pseudo-reaction modules eliminatesreaction segments that are not connected to an input or output metabolite. Thus,MFA problem should be well posed. We recommend a least-squares solution to theMFA problem using constrained optimization. The objective function for thisproblem is:

Minimize: ( ) { }v v kk kobs− ∀ ∈∑

2, external fluxes

Subject to: S v⋅ = 0


116

Graph Path

GLC

2 ATP 2 NADH

2 PYR

2 ADP

Pseudo-ReactionModule

GLC G6 P

G6P F6P

F6P F16P

F16P GAP + GP

GP GAP

GAP G13P

G13P G3P

PE P PYR

G3P PE P

2

2

2

2

2x

2x

2x

2x

Elementary Mode

GLC

G6P

F6P

GAP

G3P

PYR

PE P

GP

F16P 2

G13P 2

Figure 7.3 Network reduction strategy. From the carbon-backbone network, a set of graph paths was definedbetween all inputs, cycles, and outputs. For each path, the endpoints served as inputs/outputs in an elementaryflux modes (EFM) algorithm applied to the carbon-backbone network (center). The reactions in an EFM werethen combined to form a stoichiometrically conserved pseudo-reaction module, which included currencymetabolites.

where vk and vk

obs are, respectively, predicted and observed external flux componentsof v.

3. Inequality constraints can be added (if necessary) based on the thermodynamicfeasibility of reaction pathways. The rationale for these constraints is somewhatlengthy and is not included in this chapter. A detailed introduction tothermodynamic pathway constraints can be found in [3]. The process is as follows:

i. Estimate Gibbs energies of formation (Gfi) for the metabolites in the networkusing group contribution theory [4].

ii. Calculate a standard Gibbs free energy change for each reaction (ΔGRXN°).iii. Sum the ΔGRXN values across each stoichiometrically balanced pathway (i.e.,

EFM) in the network to obtain standard pathway Gibbs free energy change

(ΔGPATH) values.

iv. Express these values as inequality constraints in the form G⋅v = 0, where G is a

pathway-scaled matrix of ΔGPATH° values.

4. From the steady-state flux distribution, remove reactions and pathways that containa negligible flux. Negligible fluxes can be determined based on a cutoff value set as afraction (e.g., 1%) of the median flux value. For the CHO network, removal of suchreactions resulted in a final reaction network (to be used for subsequent kineticmodeling) consisting of 20 metabolites and 30 reactions (18 pseudo-reactions and 12cellular exchange reactions).

7.3.3 Kinetic modeling

7.3.3.1 Rate equations

Many options exist for saturable enzyme reaction rate expressions. In this chapter, weuse Michaelis-Menten type equations, which are the most commonly used kineticexpressions for relating the concentrations of substrates, inhibitors, and activators tothe reaction velocities. Depending on the enzymes involved and data available, othertypes of rate equations may be more appropriate. We will discuss this point further inSection 7.5.2.

1. Define Michaelis-Menten enzymatic rate equations for each reaction. For example,

the rate v for the reaction A + B → C + D is defined as

[ ][ ]

[ ][ ]

v v XA

K A

B

K Bm A m B

= ⋅ ⋅+

⋅+max

, ,

where v = reaction rate (mM/day), vmax = maximum balanced growth reaction flux

(mmol/109 cells/day), X = viable cell density (106 cells/mL), Km = Michalis-Menten

constant (mM), and [A] and [B] = concentrations of substrates A and B, respectively(mM). Approximating the vmax values from the balanced growth reaction flux (asdetermined from the MFA analysis in the previous section) will significantly reducethe number of estimated parameters.

2. Define cell growth with Monod kinetics as follows:

[ ][ ]

dxdt

XATP

K ATPm ATP

= ⋅ = ⋅+

μ μ μmax,

7.3 Methods

117

where μ = growth rate (1/day), μmax = maximum growth rate (1/day), and [ATP] =intracellular concentration of ATP (mM).

3. Define mass balances on the metabolite concentrations as

dAdt

S v= ⋅

where A = metabolite concentration vector, S = reaction stoichiometric matrix, andv = reaction rate expression vector.

4. Estimate the unknown Km parameters (see Section 7.3.4).

7.3.3.2 Dynamic simulations

For a low-density seed batch culture, balanced growth estimation would be sufficient tocharacterize the dynamics of the system. However, for a high-density, fed-batch culturesuch as the one used as an example here, a different approach is required to accuratelysimulate the transition periods that occur throughout the process. This situation isdemonstrated in Figure 7.4, where the balanced growth parameters can accurately pre-dict the metabolite profiles through the first 1.5 days. However, at day 1.5 a perturba-tion to the system occurs, in the form of glutamine depletion, resulting in a significantdeviation from the expected balance growth trajectory. One possible explanation isthat in response to this perturbation, a fraction of the cell culture population metaboli-cally adjusts to the depletion by altering the activity of specific enzymes (e.g., reversal of


118

Antibody Glucose Lactate

Glutamine Asparagine Alanine Ammonia

0 1 2 3

0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

Viable Cell Density

Figure 7.4 Balanced growth simulation. The model for balanced growth (days 0 to 2) included 42 parameters(Km’s), which were fit to experimental data using a genetic algorithm. The graphs depict the measured (�) andsimulated (�) data. The x-axis is days and the y-axis is concentration. All metabolites were accurately predictedthrough day 1. At day 1.5 the culture experienced a perturbation that resulted in a transition to a new metabolicstate.

glutamine synthetase and lactate dehydrogenase). The result is a transition period dur-ing which a heterogeneous population of cells develops. This is an important modelingassumption, and is further discussed in the commentary section below. The heteroge-neous distribution can be described with a simple transition state (Markov process, i.e.,the probability of transitioning to a future state is dependent only on the current stateand independent of any past states) model, where a fraction of the population respondsquickly to the perturbation with a probability k, and another fraction continues to bemetabolically active at the same balanced growth rate with a probability 1 – k (Figure

7.5). Over time k → 1 and another steady state is achieved with a new homogeneouspopulation. Such process dynamics can be modeled as follows.

1. Define a perturbation event (observed or hypothesized) that will trigger a metabolictransition, as well as the response(s) of the system to compensate. In the CHOexample, the event was glutamine depletion and the responses were reversal of thetwo aforementioned enzymes.

2. For each event, define a new network with the appropriate response variablesadjusted, as well as a Markov probability variable k to represent the fraction of thetotal population that exists in this metabolic state.

3. Assume the balanced growth phase is a baseline state from which the cell deviates,and set the previously estimated kinetic parameters as constants. Estimate the Km’sfor the new reactions and the k value using the data only over the transition period.Results of this method applied to the CHO example are shown in Figure 7.6.

7.3 Methods

119

GLN GLU

1 – k1

k1

GLU GLN

S2

S1

PYR LAC

1 – k2

k2

LAC PYR

S2

S1

SER GLY

1 – k3

k3

SER GLY

S2

S1

Glutamine SerineLactate

0 1 2 3 0 1 2 3 0 1 2 3

Figure 7.5 Markov transition model. At day 1.5 the glutamine concentration approached a low level, resultingin a depletion of some internal metabolites, and a shift to a new (S1 to S2) metabolic steady state to compensate.During this transition, a heterogeneous population developed, with some cells utilizing specific reactions in theforward direction, with a Markov probability = k, and other cells in the reverse direction, with probability 1 – k.

7.3.4 Parameter estimation

The parameter estimation problem is generally solved using nonlinear optimization.Given a set of coupled differential equations expressing the reaction rate dependenceson metabolite concentrations and kinetic coefficient parameters, the objective func-tion for the optimization problem is to minimize the sum-squared differences betweenthe calculated and measured dependent variables (e.g., reaction rates) based on a set ofparameter choices. Typical inputs to the problem are the measured or assumed initialvalues of the independent variables (e.g., metabolite concentrations). Here, we refer toindependence in a mathematical, rather than physical sense. This may be an obviouspoint, since intracellular metabolite concentrations generally cannot be controlledindependently. As with other nonlinear optimization problems, guaranteeing a glob-ally optimal solution is exceedingly difficult, if not impossible. For large-scale prob-lems, the use of gradient-based, local search methods that repeatedly solve the problemwith different initial conditions (multistart strategy) generally fail to arrive at satisfac-tory solutions, often yielding the same local minimum [5]. It is generally agreed thatglobal search methods, while computationally expensive, are likely to yield results thatbroadly reflect the full range of parameter estimation data. Several such methods haverecently been examined, including branch-and-bound [6] and hybrid functional Petrinets [7].

In this chapter, we use genetic algorithms, which is a particular class of evolutionaryalgorithms. The advantage of this global search heuristic is that it offers reasonable (i.e.,exact or approximately exact) solutions even when applied to ill-conditioned problems[8]. Briefly, an initial seed population of random individuals is examined for their ability


120

Viable Cell Density Antibody Glucose Lactate

Glutamine Asparagine Alanine Ammonia

0 1 2 3

0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

Figure 7.6 First transition simulation. Setting the initial 42 parameters (from balanced growth) as constant, thetransition (days 1.5 to 3) to a new metabolic state was modeled by decreasing the activity of several forward reac-tions and increasing the activity of the corresponding reversible reactions. The net result was an increase in thedepleted internal metabolites; 25 parameters (Km’s and Markov probabilities, k’s) were fit for this phase.

to satisfy a predefined fitness function. The algorithm selects the most promising indi-viduals (elite children) along with a user-defined portion of randomly mutated andrecombined (crossover) individuals to seed the next iteration (or generation). Thegenetic algorithm is implemented as follows.

1. Define the fitness function as

Minimize: ( ) { }c c jj jobs− ∀ ∈∑

2, external metabolites

where cj and cj

obs are, respectively, the predicted and observed external metaboliteconcentrations in the culture medium.

2. Initialize the intracellular concentrations and define the bounds on the parameters.Initial concentrations can be estimated from the literature or measured directly.Parameter bounds can be estimated to span one order of magnitude based on the

associated metabolite concentration. For example, in a reaction A → B, if theconcentration of A is 0.5 mM, the bounds on the Km parameter of the reaction wouldbe 0.1 mM and 1 mM.

3. Define the number of generations to terminate the algorithm, the mutationfunction, and crossover fraction. In this work, these values were 500, Gaussian, and0.8, respectively. The choices for these algorithm parameters depend on the softwarepackage. This work used the Genetic Algorithm and Direct Search toolbox forMATLAB to integrate all computing routines, including EFM analysis, fluxcalculation, model simulation, and parameter estimation, into one softwareenvironment.


The focus of this chapter is on model development, rather than metabolite data acquisi-tion. We refer the interested reader to recent publications by Nielsen and coworkers,who have developed excellent assay platforms for high-throughput metabolite analysis[9]. This section will therefore limit the comments to anticipated results and interpreta-tion. These comments will be brief, because Section 7.3 presented many of the impor-tant observations and quantitative details regarding the expected results through theCHO cell example.

7.4.1 Model network

The expected final result is a dynamic model simulating the time-dependent metabolicbehavior of a cell culture. Key intermediate results are the graph models obtainedthrough the systematic reduction and modularization strategy (Figure 7.3). The num-ber of reaction modules (paths) in the reduced model correlates with the number of rateequations, which in turn determines the size of the parameter space. We found that atenfold reduction in model size was possible for the genome-scale CHO cell network,which initially consisted of about 200 reactions (Table 7.2). This level of reduction is


121

likely to be typical, because cellular metabolism is generally well conserved acrossspecies and cell types.

The dynamic model expresses the time-dependent behavior through a set of coupleddifferentiation equations. Following data fitting, the results should include rate coeffi-cients and other equation parameters along with the calculated reaction rates andmetabolite concentrations. Figure 7.4 shows representative time trajectories of simu-lated metabolite concentrations plotted against experimental data. The time scale of thesimulation necessarily depends on the cell type and the culture behavior of interest. Inthe case of the CHO cell culture used here as a test system, the time scale was on theorder of days. To clarify, this time scale does not refer to the time for computation,which executes within minutes. The anticipated dynamic range over the course of a sim-ulation is two orders of magnitude for the extracellular metabolite concentrations for ahigh-density culture exceeding 106 cells/ml. The dynamic range of the intracellularmetabolites depends on the enzyme affinity parameter (Km) bounds established bydomain (cell type specific biological knowledge). We noted that these bounds signifi-cantly influence the convergence of the model during parameter optimization. Fortu-nately, experimental determination of these bounds is possible through initialconcentration measurements on the intracellular metabolites.

7.4.2 Dynamic simulation parameters

The modeling strategy of this chapter involves two types of parameters. One set ofparameters directly depends on the form of the kinetic expression, and reflects the sen-sitivity of the reaction rate to the substrate concentrations. Beyond this basic interpreta-tion, further analysis again depends on the specifics of the model. For example, rateequation parameters of the lin-log form express the elasticity of the enzyme withrespect to both substrates and nonsubstrate effectors [10]. The second parameter typerepresents a probability that the culture gives rise to one or more additionalsubpopulations with qualitative differences in metabolic behavior. When interpretingthe parameter values, we recommend a global analysis of all of the metabolite and reac-tion rate time profiles after performing multiple (on the order of 10) iterations of mod-eling training (data fitting). While GA-based nonlinear optimization generally yieldsrobust results, the heuristic nature of the algorithm cannot guarantee convergence to aglobally optimal solution.


7.5.1 Modularity

In this chapter, we have outlined steps to develop a dynamic model of cellular metabo-lism based on annotated genome data and metabolite concentration measurements.The central premise was that cellular networks can be decomposed into recognizableand functionally meaningful modules. In the present case of the CHO cell metabolicnetwork, the modules represented reaction groups or pathways. Our model reductionstrategy was largely motivated by several recent developments in graph theoreticalanalysis of biological network topology. In particular, an emergent theme in this litera-ture is the concept of modular organization [11]. A number of studies have shown that a


122

biochemical network possesses significant patterns of interconnections representingbasic structural units similar to other complex natural networks such as the ecologicalfood web. Such units have been labeled motifs when identified through bottom-upsearches [12]. The modularity of biochemical networks has also been explored usingtop-down approaches that successively divide the system into smaller subnetworks[13]. In these earlier studies, modularity has been determined by analyzing patterns ofstructural (e.g., reaction stoichiometry-based) connectivity. Until recently, less atten-tion has been paid to functional (e.g., reaction flux-based) connectivity [14]. Determin-ing connectivity relationships solely based on structural information has the drawbackthat every biochemical interaction is treated equally, regardless of the activity level ofthat interaction. The modeling strategy presented in this chapter examines bothstructural and functional modularity in reducing model complexity.

In addition to the premise on modularity, two other important assumptions wereintroduced, which we will discuss in the remainder of this commentary section. The dis-cussion will concentrate on the limitations imposed by the assumptions with respect togenerality, as opposed to, for example, particular aspects of simulating CHO cell dynam-ics. As part of this discussion, we will suggest alternative modeling options as well asfuture research directions for refining the modeling framework.

7.5.2 Generalized kinetic expressions

In this work, we used Michaelis-Menten type equations, which are hyperbolic functionsthat reasonably approximate the saturation behavior of many metabolic enzymes, andare thus an appropriate initial choice. However, there are a number of other options,and careful consideration should be made as to the choice of the rate equation form.

1. Commonly used alternatives to the Michaelis-Menten equations include generalizedmass-action (or S-system) [15], convenience [16], and lin-log kinetics. Each of thesealternatives offers particular advantages. For example, convenience kinetics providesa simple and generalized form for rate expressions of random order enzymemechanisms, and can include thermodynamic dependencies among parameters.When inhibition, activation, and thermodynamic constraints are not considered,the convenience and Michaelis-Menten kinetic expressions are equivalent. Thus, it isnot surprising that, like Michaelis-Menten type expressions, the equations arenumerically well behaved and suitable for large-scale parameter estimation andoptimization. In lin-log kinetics, the reaction rate is proportional to the enzyme leveland a linear sum of nonlinear logarithmic substrate concentrations. While thisequation form does not reflect a particular enzyme action mechanism, it can veryclosely approximate the behavior of a Michaelis-Menten kinetic expression with anappropriate set of parameter choices. One challenge in implementing lin-log kineticsis that the expressions include reference state parameters, preferably obtained at asteady state. In many cases, initial conditions (for internal metabolites) reflecting asteady state may not be readily available, since dynamic culture experiments aremore commonly conducted in a batch setting. On the other hand, as techniques forintracellular metabolite measurements rapidly mature, lin-log kinetics may becomemore attractive. A particularly compelling feature of lin-log kinetics is that its rateequation parameters directly reflect the enzyme elasticities with respect to itssubstrates.


123

2. Regardless of the form, generalized rate equations may not fully capture kineticbehaviors resulting from the regulatory effects of allosteric modulators. When thereis mechanistic knowledge, it is possible to appropriately modify the rate equationson a case-by-case basis, for example by replacing a hyperbolic with sigmoidalfunction via an ultrasensitivity or cooperativity parameter. Unfortunately, suchmechanistic information remains unavailable for many enzymes. Thus, the additionof regulatory variables and parameters may rely on decisions based on ad hocknowledge.

3. One systematic, data-driven approach to determine whether there are regulatoryinteractions is through network inference. Various promising approaches fornetwork inference have been described in the recent literature, including singularvalue decomposition (SVD), independent component analysis (ICA), networkcomponent analysis (NCA), network component mapping (NCM), and Bayesiannetwork (BN) inference. Most of these approaches have been developed for generegulatory circuits, partly because of the comparatively earlier advances intechnologies for high-throughput gene expression measurements. A notable earlierstudy on reconstructing metabolic subnetworks was described by Chan andco-workers, who used an information theory–based learning algorithm [17]. Whilecomputationally efficient, this approach does not distinguish between candidatemodels during the structural learning stage. Rather, categorical decisions are madeon conditional dependencies to arrive at a unique structure. Model refinementoccurs through subsequent introduction of expert knowledge and hypothesistesting. The decisions on the structure of the network rely on thresholds, which canproduce errors when the data size is small. Very recently, we have developed analternative method based on probability theory (unpublished work [18]). Thismethod systematically (although not exhaustively) assesses various candidatemodels based on their conditional probabilities given the data. The expert orsubjective knowledge is introduced as prior probabilities. Therefore, it is notnecessary to subsequently set up and test various hypotheses on the conditionalindependencies within the learned structure. An attractive feature of our method isthat it can routinely update both the structure (conditional independencies betweencomponents) and parameters of the learned network as additional data becomeavailable. The data requirements are information on network stoichiometry andmeasurements on metabolic (flux) profiles, and thus completely overlap with thedynamic modeling framework.

7.5.3 Population heterogeneity

In this work, we modeled the apparent change in the overall behavior of the CHO cellculture to reflect a putative rise in a new subpopulation with a different metabolic phe-notype. This assumption was based on the observation that a growing fraction of cellsin the aging culture failed to exhibit balanced growth even when there was a sufficientsupply of nutrients. An alternative interpretation would have been to consider thechanging culture behavior as an adaptation, where the entire population, representedby an “average” cell, progressively takes on a different phenotype. Resolving this type ofambiguity will require additional experiments on population characteristics, for exam-


124

ple by measuring the distribution of cell cycle states at various times during the cultureprocess.


In this section, we highlight a few notable features of the CHO cell networksimulations.

It is observed that during the transition from balanced growth following a glutaminedepletion, lactate and ammonia are effected to the greatest extent, while glucose,alanine, asparagines, and other metabolites not shown appear to deviate slightly. It ishypothesized that the depletion of glutamine results in less carbon entering the TCAcycle via alpha-ketoglutarate, and as a result, less NADH being produced. To compen-sate, some cells in the population are able to take advantage of the abundant supply oflactate in the culture and reverse the substrate for the LDH enzyme, resulting in a reduc-tion of lactate to pyruvate and generating the needed NADH. The deviation of ammo-nia, on the other hand, is a consequence of the reversal of glutamine synthetase inattempt to replenish the depleted glutamine, which is necessary for nucleotide synthesisand antibody production, in addition to central energy metabolism.

For this simple case of a single, measurable metabolite being depleted, the rules fordetermining which reactions would be altered in activity could be determined based onknowledge of the system. Unfortunately, there are many potential perturbations thatcan occur in a mammalian cell culture system, and often more than one type of pertur-bation is occurring at a given time. The challenge then becomes how to systematicallydefine which reactions will respond to a specific perturbation. By determining whichreactions or metabolites are directly connected to the perturbation, and then whichreactions are most adaptable (i.e., those for which reversibility is known or observed),one can begin to define a chain of connectivity for modeling the transition.

The final point worth mentioning is the accuracy of the predicted transition rate. Itcan be observed in Figure 7.6 that, while the first transition profiles for lactate andasparagine are accurate, those for glucose, glutamine, alanine, and ammonia are not.The reason for this is that the Markov transition probability, k, was modeled as a con-stant. It is more likely, however, that this transition has a dependency on a particularmetabolite concentration, time, or other process parameter. By including this depend-ency, the transition profiles for these metabolites should be more closely simulated.


125


Problem Potential Solution

An empty set is obtained for the EFM analysis An additional node is required to form a stoichiometrically conservedpathway. Screen additional input, cycle, and output nodes as an addedinput to the EFM algorithm, which should result in an EFM pathway

Uncertainty or poor estimates in kinetic parameters Improve the estimates on the parameter bounds by consulting literaturedata for kinetic parameters or the associated intracellular metaboliteconcentration. Even better, obtain measurements for the intracellularconcentrations

Poor fitting of the transition periods Change the dependency of the Markov parameter to a metabolite con-centration, time, or process parameter

7.7 Summary Points

• A systematic methodology based on graph theory and pathway analysis was usedto reduce a genome-scale metabolic reaction network, without loss of conservationrelationships, to a manageable network for kinetic modeling.

• Metabolite and cell density profiles were simulated using Michaelis-Menten kineticequations and Monod growth kinetics, respectively.

• Kinetic parameters were estimated using a genetic algorithm.

• Deviations from external perturbations were modeled by assuming the develop-ment of heterogeneous sub-populations, each with distinct metabolic activities,and the sum of which contribute to the global activity of the culture.

• The probability of a subpopulation transitioning to a new metabolic steady statewas modeled with a Markov process.

Acknowledgments

We gratefully acknowledge financial support for RN by Wyeth and a National ScienceFoundation grant (award # 0829899) to KL.

References

[1] Kanehisa, M., and S. Goto, “Kegg: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res.,Vol. 28, No. 1, 2000, pp. 27–30.

[2] von Kamp, A., and S. Schuster, “Metatool 5.0: fast and flexible elementary modes analysis,”Bioinformatics, Vol. 22, No. 15, 2006, pp. 1930–1931.

[3] Nolan, R.P., A.P. Fenley, and K. Lee, “Identification of distributed metabolic objectives in thehypermetabolic liver by flux and energy balance analysis,” Metab. Eng., Vol. 8, No. 1, 2006,pp. 30–45.

[4] Mavrovouniotis, M.L., “Group contributions for estimating standard Gibbs energies of formationof biochemical compounds in aqueous solution,” Biotechnol. Bioeng., Vol. 36, No. 10, 1990,pp. 1070–1082.

[5] Pardalos, P.M., and R.H. Edwin, Handbook of Global Optimization, Vol. 2, London: Kluwer Academic,2002.

[6] Polisetty, P.K., E.O. Voit, and E.P. Gatzke, “Identification of metabolic system parameters usingglobal optimization methods,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 4.

[7] Koh, G., H.F. Teong, M.V. Clement, D. Hsu, and P.S. Thiagarajan, “A decompositional approach toparameter estimation in pathway modeling: a case study of the Akt and Mapk pathways and theircrosstalk,” Bioinformatics, Vol. 22, No. 14, 2006, pp. e271–e280.

[8] Moles, C.G., P. Mendes, and J.R. Banga, “Parameter estimation in biochemical pathways: a com-parison of global optimization methods,” Genome Res., Vol. 13, No. 11, 2003, pp. 2467–2474.

[9] Villas-Boas, S.G., J.F. Moxley, M. Akesson, G. Stephanopoulos, and J. Nielsen, “High-throughputmetabolic state analysis: the missing link in integrated functional genomics of yeasts,” Biochem. J.,Vol. 388, Pt. 2, 2005, pp. 669–677.

[10] Liebermeister, W., and E. Klipp, “Bringing metabolic networks to life: convenience rate law andthermodynamic constraints,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 41.

[11] Spirin, V., M.S. Gelfand, A.A. Mironov, and L.A. Mirny, “A metabolic network in the evolutionarycontext: multiscale structure and modularity,” Proc. Natl. Acad. Sci. USA, Vol. 103, No. 23, 2006, pp.8774–8779.

[12] Milo, R., S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simplebuilding blocks of complex networks,” Science, Vol. 298, No. 5594, 2002, pp. 824–827.

[13] Ma, H.W., X. M. Zhao, Y.J. Yuan, and A.P. Zeng, “Decomposition of metabolic network into func-tional modules based on the global connectivity structure of reaction graph,” Bioinformatics,Vol. 20, No. 12, 2004, pp. 1870–1876.


126

[14] Yoon, J., Y. Si, R. Nolan, and K. Lee, “Modular decomposition of metabolic reaction networks basedon flux analysis and pathway projection,” Bioinformatics, Vol. 23, No. 18, 2007, pp. 2433–2440.

[15] Schwacke, J.H., and E. Voit, “Computation and analysis of time-dependent sensitivities in general-ized mass action systems,” J. Theor. Biol., Vol. 236, No. 1, 2005, pp. 21–38.

[16] Kresnowati, M.T., W.A. van Winden, and J.J. Heijnen, “Determination of elasticities, concentra-tion and flux control coefficients from transient metabolite data using linlog kinetics,” Metab. Eng.,Vol. 7, No. 2, 2005, pp. 142–153.

[17] Li, Z., and C. Chan, “Inferring pathways and networks with a Bayesian framework,” FASEB J,Vol. 18, No. 6, 2004, pp. 746–748.

[18] Yoon, J., “Metabolic network analysis of liver and adipose tissue,” Ph.D. Dissertation, Departmentof Chemical and Biological Engineering, Tufts University, 2007.

Acknowledgments

127

C H A P T E R

8Steady-State Sensitivity Analysis of BiochemicalReaction Networks: A Brief Review and NewMethods

Stefan Streif,1 Steffen Waldherr,2 Frank Allgöwer,3 and Rolf Findeisen3

1Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany2Institute for Systems Theory and Automatic Control, Universität Stuttgart, GermanyE-mail: [email protected] for Automation Engineering, Otto-von-Guericke University, Magdeburg, Germany

129

Key terms Parametric sensitivity analysisSteady stateBiochemical reaction networksEmpirical GramianGlobal infeasibility certificate

Abstract

Sensitivity analysis is a valuable tool in the analysis of biological systems. It canbe used for many purposes such as drug target identification, model comparison,and model refinements. In this chapter we review steady-state parametric sensi-tivity analysis for biochemical reaction networks. As shown, local sensitivityanalysis methods might lead to wrong conclusions if the considered system ishighly variable over the range of possible parameters. To overcome this problem,we outline two new methods for the steady-state sensitivity analysis. The firstmethod is based on an input-output consideration of the sensitivity question.The parameters of interest are considered as input, and the influence of them onthe “output” is captured by an expansion of the concept of linear cross Gramiansto nonlinear systems, the empirical cross Gramian approach, which allows oneto consider a wider class of systems and obtain insight into the question of sensi-tivity based on simulations, and it can be expanded to time-varying sensitivityanalysis. The second approach is based on a reformulation of the original ques-tion to the question of outer approximating the range of possible steady statesunder uncertainties. Since an outer approximation of the possible steady states isobtained, the method is nonlocal (i.e., global in nature).

8.1 Introduction

Over the past decades significant advances with respect to the mathematical modelingof biological systems have been achieved. Advancements in biological experimentaltechniques have lead to a brisk increase in the size of the mathematical models, as wellas in the amount of available models (see [1]). However, models are not derived withoutpurpose. They often lay the basis for the analysis and understanding of the underlying(biological) principles and are used to identify the key influencing elements. One of thebasic questions in the analysis of biological systems is how the dynamics (e.g., thesteady state) is changing with respect to changes in parameters (or external inputs).Examples for such parameters are reaction constants in biochemical reaction networks,or association constants. Typically such an analysis of the influence of parameterchanges on the behavior of the system is denoted as (parametric) sensitivity analysis(see [2]). It might be used for several purposes, such as the identification of targets forthe design of drugs and therapies, the identification of limiting steps in a metabolicnetwork to achieve a maximum yield of a product, or model comparison [35] orrefinement [36].

We focus here on the sensitivity analysis for biochemical reaction networks, whichforms an important class of models for biological processes [3–5]. One classical tool toprovide insight into the effect that certain parameters have is metabolic control analy-sis (MCA) [6, 7]. It basically allows one to analyze the influence of parameters on thebehavior of the system close to a certain nominal (parameter) operation point. Undersuch conditions one can safely assume that the behavior of the system depends linearlyon the parameters. However, in biochemical reaction networks one usually faces largeparameter variations: in genetic engineering, common techniques like gene knock-outs or knock-downs, overexpression or binding site mutations typically give rise tolarge parameter variations. In these cases one typically falls back to global sensitivitymethods, which are often based on statistical considerations [8–10].

The objective of this contribution is twofold. First, we provide a brief introduction tothe issue of parametric steady-state sensitivity analysis for biochemical reaction net-works. As shown, existing methods are typically based on local considerations. To over-come the local limitations we introduce two new approaches for parametric sensitivityanalysis. The first approach is based on the concept of controllability and observabilityGramians and their expansion to nonlinear systems. It allows one to consider a widersystem class and provides a less local insight into the influence of parameters. The sec-ond approach is based on an efficient calculation of an outer approximation of the set ofall possible steady states for the set of possible parameters. It is based on a reformulationof the problem to an infeasibility problem.

The remainder of the chapter is structured as follows. In Section 8.2 we briefly intro-duce the considered system class and the question of parametric steady-state sensitivityanalysis. Section 8.3 reviews linear sensitivity analysis. In Sections 8.4 and 8.5 we intro-duce two new approaches for parametric sensitivity analysis that partly overcome thelocal limitation of existing methods. Both approaches are exemplified considering areversible covalent modification system. Conclusions and a final outlook are providedin Section 8.6.

Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods

130

8.2 Considered System Class and Parametric Sensitivity

We are interested in the modeling and analysis of biochemical reaction networks,which are typically given by sets of reactions of the form:

[ ] [ ] [ ] [ ]α α β β1 1 1 1S S P Pn n n ns s p p+ + → + +� � (8.1)

Here Si denotes substrates that are transformed into the products Pi. The factors αi

and βi denote the stoichiometric coefficients of the reactants. Typically these networksare modeled by systems of differential equations of the form:

( ) ( )dxdt

Nv x p x t x= =, , 0 0 (8.2)

The rate vector v v n l k( : )� � �× � depends on the parameters p l∈� and on thedependent state variables (concentrations) denoted by x n∈� . The stoichiometric matrixN n m∈ ×� relates the rate vector to the rate of change of the states. It depends on the coef-ficients αi, βi, and, possibly on factors compensating different units or volumina. x0

denotes the initial concentrations at time t0 = 0. For simplicity we assume in the follow-ing that all functions are at least once continuously differentiable with respect to theirarguments. Note that this is usually the case for biochemical reaction networks.

There are a large variety of possible reaction models [3] defining the rate vector r andthe stoichometry. Examples are mass action, power law, Michaelis Menten, and Hillkinetics. We do not go into details here and rather refer to [5, 11].

A common feature in biochemical reaction networks is conservation relationships Lj

among the n state variables x of the form L xj i ii

n=

=∑ ζ1

with nonnegative coefficients ζi.

Usually the system of differential equations is treated in its reduced form with n − j statevariables (see [12]).

8.2.1 Example system: reversible covalent modification

One classical example for a biochemical reaction network, used in this work to exem-plify our considerations, is the reversible covalent modification system [13]. The reac-tion scheme for this system is given by:

[ ] [ ] [ ] [ ] [ ][ ] [ ] [ ] [ ]

A E C E A

A E C E

k

k

k

k

k

k

+ ⎯ →⎯← ⎯⎯ ⎯ →⎯ +

+ ⎯ →⎯← ⎯⎯ ⎯ →⎯

1 1 1

2 2 2

1

2

3

4

5

6

*

* [ ]+ A(8.3)

Here enzymes E1 and E2 convert the protein between its two states A and A* withintermediate complexes C1 and C2. Applying mass-action kinetics, the model of thesystem is given by


131

dtdt

A

A

C

*

1

1 1 0 0 0 1

0 0 1 1 1 0

1 1 1 0 0 0

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

=−

−− −

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥

[ ] [ ]( )[ ][ ][ ] [ ] [ ] [ ]( )⎥

⎥⋅

−

− + + +

k A E C

k C

k C

k A E A A A C

k A

tot

tot tot

t

1 1 1

2 1

3 1

4 2 1

5

,

*,

*

[ ] [ ] [ ]( )[ ] [ ] [ ]( )

ot

tot

A A C

k A A A C

− − −

− − −

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

*

*

1

6 1

(8.4)

with the conservation relationships E1, tot = [E1]+[C1], E2, tot = [E2]+[C2], and Atot = [A] + [A*] +[C1] +[C2].

For the analyses in the following sections, we use the total concentrations Atot = 1 andE1, tot = E2, tot = 0.01, and the nominal parameter values k1 = 105, k4 =5 · 104, k2 = k5 = 1, and k3

= k6 = 103.The system shows an ultrasensitive behavior with respect to the conversion rate of

enzyme E2 represented by parameter k6 [see Figure 8.1(a)]: [A*] is either 0 or 1 for most

values of k6 and changes rapidly for k6 ≈ 103. Therefore, the parameter k6 dramaticallyinfluences the steady state of the system and an appropriate choice of k6 changes thesteady state completely from almost 1 to almost 0.

8.2.2 Parametric steady-state sensitivity

Parametric sensitivity analysis in general addresses the manner how the nominalbehavior of the biochemical reaction network changes with respect to parameter varia-tions. We refer to “sensitivity” as a property/behavior of a considered model [14]. Themodel is defined as “sensitive” when the behavior/property is highly affected by smallvariations in the parameters or state variables. We assume in the following that thebehavior/property of interest is defined in terms of an (virtual) output of (8.2) given by

( )y h x p= , (8.5)

where y: �m × �

l→�


132

10

0

0.25

0.5

0.75

1

k

(b)6

Stea

dy

stat

eo

f[A

*]

1 102 103 104 105 10

0

0.25

0.5

0.75

1

k6

Stea

dy

stat

eo

f[A

*]

1 102 103 104 105

(a)

Figure 8.1 (a) Ultrasensitive steady state response of the reversible covalent modification system withrespect to variations of the parameter k6. (b) Linear extrapolation (dotted lines) of steady-state responseusing linear sensitivity analysis around two nominal steady states (circles).

Remark 1 We limit our attention to the sensitivity with respect toproperties/behaviors which are directly given as a function of the states and parameters.In general one might be interested in more complicated properties, such as thefrequency of an oscillation, an amplitude or other properties which might be defined interms of the complete solution of the dynamical system (see [12, 15]).

With respect to the output/behavior of interest y one might be interested, forinstance, in the variation of the steady state that a system reaches due to parameter vari-ations. This is referred to as steady-state sensitivity and is defined next.

Definition 1 Steady-State Sensitivity The steady-state sensitivity is the shift Δyss of the

output y = h(x, p) due to a perturbation Δpj of parameter pj:

( )( ) ( )Δ Δ Δp p p y y x t p p y x t pj j j nom ss t j nom j nom j= − → = + −→ ∞, , ,lim , ,( )nom

Next we restrict our attention to variations of the steady state. Another interestingand important point might be how y changes over time due to step-wise or time-varyingparameter perturbations. Approaches to this question are briefly discussed later and ref-erences to relevant literature are given.

The most straightforward way to perform a sensitivity analysis is simply to simulatethe system for different parameter values and to look at the steady-state response. Oneobtains a continuation diagram [see Figure 8.1(a) for the covalent modification exam-ple] that can already lead to useful statements about the parametric influence,steady-state sensitivity, and stability of steady states (bifurcation analysis). However, thisis easy only for low dimensional systems, and is difficult or not feasible for larger bio-chemical reaction systems where instability and/or multistationarity might occur.

An approximation of the true steady-state shift due to parameter perturbations canbe calculated by extrapolation from the linear approximation of the true solution,which is called linear sensitivity analysis and is explained in detail in the next section.Here we rather want to highlight the differences between local and global sensitivityanalysis methods.

An extrapolation based on a local measure provides good approximations for pertur-bations that are close enough to the nominal value [see Figure 8.2(b)]. However, extrapo-lation of local properties does not necessarily lead to global sensitivity properties due tononlinearities. Thus, local sensitivity contains only partial information. This becomesevident for large perturbations [see Figure 8.2(b)], where a local approximation does notprovide sufficiently good estimates of the steady-state shift. Often new approaches tosensitivity analysis try to extend the range of validity for the extrapolations by includingsecond-order or bilinear terms [16, 17]. However, such methods are in general still local.Global sensitivity methods are, for example, especially important in pharmaceuticalproblems, where the influence of certain parameters has to be satisfied over the com-plete range of possibilities (e.g., for all possible weights of the patient) or for all possiblevariations in blood pressure.

Global sensitivity analysis methods (such as those introduced in Section 8.5) maypartly overcome this limitation. The question we seek to answer is: What are the (mini-mum and maximum) domains in state space that contain valid solutions due to a rangeof parameter values?

In the following section the linear sensitivity approach is considered.


133

8.3 Linear Sensitivity Analysis

In the following two sections we analyze the behavior of system (8.2) locally around itsnominal solution. This allows one to make predictions for parameter perturbations thatare close enough to their nominal values or for systems in which the influence of theparameter over the range of possible conditions does not change dramatically. Theapproach is to perform a Taylor series expansion of Nv(x, p) in x and p around the nomi-nal solution xnom pnom.

( )

( ) ( )

ddt

x Nv x p

Nv x p

xx

v x p

pp

nom nom

nom nom nom nom

Δ

Δ Δ

=

+ +

,

, ,∂

∂

∂

∂

( )

⎛⎝⎜

⎞⎠⎟

+O x pΔ Δ, 2

(8.6)

( ) ( ) ( )Δ Δ Δ Δ Δyh x p

xx

h x p

pp O x pnom nom nom nom= + +

∂

∂

∂

∂

, ,, 2

(8.7)

where Δx = x − xnom, Δy = y − ynom, and Δp = p − pnom. One obtains the linearization of the sys-tem if terms of order two and higher are neglected

ddt

x A x B pΔ Δ Δ= + (8.8)

Δ Δ Δy C x D p= + (8.9)

where


134

Δp

y y

p p

p p

ΔyssL Δyss

L

ΔyssG

ΔyssG

j

j j

j, nom j, nomy ynom nom

(a) (b)

Δpj

Figure 8.2 Extrapolation of the steady-state shift Δyss due to parameter perturbations Δpj around a nomi-nal steady state (pj,nom, ynom) using local ( )Δyss

L versus global ( )Δyss sensitivity analysis methods. (a) Approxi-mation of the steady-state response using linear sensitivity analysis provides good estimates only locallyaround the nominal steady state. (b) For larger deviations, the steady-state response approximation by thelinear sensitivity analysis deviates significantly from the true response. Global sensitivity analysis meth-ods provide better approximations in this case.

( ) ( )

( )

A Nv x p

xB N

v x p

p

Ch x p

xD

nom nom nom nom

nom nom

= =

= =

∂

∂

∂

∂∂

∂

∂

, ,

, ( )h x p

pnom nom,

∂

The linearization allows to investigate the behavior of the nonlinear system (8.2) dueto small/local parameter perturbations.

For an asymptotically stable steady state, straightforward calculation shows that a

constant parameter deviation of Δp results in a steady-state shift Δxss of

Δ Δ

Δ Δ

x A B p

y CA B pss

ss

= −

= −

−

−

1

1(8.10)

The asymptotic stability of the steady state directly implies that A is invertible andtherefore (8.10) is well defined. Conservation relations introduce zero eigenvalues andthe system is not strictly asymptotically stable. In practice it is therefore necessary toreduce the system before performing the analysis.

Definition 2 Linear Steady-State Sensitivity The linear steady-state sensitivity ofoutput y(x, p) with respect to constant changes of parameter p is defined as

Syp

CxpL

y p

p

ss ss, lim= =→Δ

ΔΔ0

∂

∂(8.11)

The linear steady-state sensitivity SL can be used to approximate the steady-stateresponse locally around a nominal steady state, as depicted in Figure 8.1(b).

A classical approach for sensitivity analysis is metabolic control analysis (MCA) (see[18] for a review). Two sensitivities are commonly used in MCA. First, the concentrationresponse coefficient, which is equivalent to the steady-state sensitivity, is given in Defi-nition 2. Second, the flux response coefficient measures the linear sensitivity of theparameters on the rate vector at steady state (see [18]).

Common in linear sensitivity analysis is the use of relative sensitivities measuringthe impact of a relative change of parameter. To simplify the presentation, this chapterdiscusses only the unscaled case.

Remark 2 The analysis presented so far is restricted to the steady-state response orthe asymptotic response. However, the transient or early response of a biochemicalreaction network might be important when, for example, timing in the network isessential. In such cases, the linear time-varying sensitivity should be considered instead[12]; this allows one to draw conclusions about temporal sensitivity. If parameters aresubjected to external time-varying perturbations, a frequency consideration should betaken into account [15, 19].

Linear steady-state sensitivity analysis has become an important tool in the analysisof biochemical reaction networks, because it is well known and easy to use and applyeven to large systems, and many further extensions are available.

Disadvantages of this method are that it provides only estimations of the steady-stateshift for small parameter perturbations [Figure 8.1(b)]. A further disadvantage is that

8.3 Linear Sensitivity Analysis

135

(8.10) is only applicable to strictly asymptotically stable systems, which requires theelimination of conservation relations of the system before any analysis.

In the next sections we propose two new methods that partially overcome some ofthe drawbacks of linear, steady-state sensitivity analysis.

8.4 Sensitivity Analysis Via Empirical Gramians

The sensitivity measures that were presented so far explicitly depend on thelinearization of the system under consideration. The expression of the steady-state shiftaccording to (8.10) also requires inversion of the Jacobian A which causes problems fornonasymptotically stable systems. We present a method to compute the steady-statesensitivity of the nonlinear system under consideration directly from transient simula-tion data. For the proposed method, it is also not necessary to linearize the model, toperform a matrix inversion, or to reduce the conservation relations beforehand.

Parametric sensitivity of a dynamic system can be analyzed using notions and meth-ods that deal with input-output systems if one considers the parameters to be variable,possibly time-dependent inputs p(t), and the outputs y to be the variables of interest. Acontrol theoretic notion that deals with the influence of the inputs on the states is con-trollability. Informally described, a system is controllable if all of its states can be changedarbitrarily by its inputs. The notion of observability deals—informally speaking—withthe influence of the states on the outputs. The controllability and observability of alinear system can be quantified (in an L2-optimal sense) by the controllability andobservability Gramians that are introduced next.

8.4.1 Gramians and linear sensitivity analysis

How easily the input p(t) can influence the states can be quantified by the controllabil-ity Gramian Wc, whereas the observability Gramian Wo is a measure of the state“energy” that is visible in the output. The two Gramians are defined as

W e BB e d W e C Ce dtcA T A A T AT T

= =− −

−∞

∞

∫ ∫τ ττ τ

τ0

0 0and

Clearly both Gramians capture the asymptotic (since the integrals are spanninginfinite far into the past or future) and the temporary behavior of the system. TheGramians can be used for multiple purposes in general. For example, the analysis of theeigenvalues and corresponding eigenvectors of the controllability and observabilityGramians reveals which directions of the system are best (in an L2 optimal sense) con-trollable and observable. Directions corresponding to large eigenvalues indicate thedirections that are most sensitive. In particular, states in the kernel of the Gramians areunobservable and uncontrollable, respectively. However, analyzing controllability andobservability gives distinct views about the importance of directions in the state space.For an input-output analysis such as sensitivity analysis, a combination of both isrequired. Moore [20] showed that a straightforward combination can be misleading. Forexample, the least observable states could be very controllable. Thus a small input signalcould result in a nonnegligible output signal.


136

A combined view of observability and controllability is the cross Gramian [21, 22]which is defined for a single-input-single-output system as

W e BCe dtcoA A=

∞

∫ τ τ

0(8.12)

The cross Gramian is not only related to the controllability and observabilityGramians by its similar definition, it also holds for single-input-single-output systems[23] that W WWco c o

2 = showing that the cross Gramian contains both the controllabilityand observability Gramian.

One can show that for a single-input single-output system the steady-state systemgain with respect to a step input is proportional to the sum of the eigenvalues of the cor-responding cross Gramian [22]. This relationship can be used to derive the followingresult (proof given in [24]) that relates steady-state linear sensitivity analysis to the crossGramian and sets the stage for the development of nonlinear expansions as derived inthe following section.

Theorem 1 The linear steady-state sensitivity of output y(x, p) = Cx with respect toparameter p is related to an appropriately chosen cross GramianWco

p C, :

( )S Cxp

WLy p ss

cop C, ,= =∂

∂2 trace

in which the matrixWcop C( , ) is the cross Gramian for the input p and the output y = Cx.

With this relation, it is possible to quantify how the steady-state response of a partic-ular output is influenced by a step perturbation of a particular parameter based on thecontrollability and observability notion. Furthermore, it is straightforward to show thatthe cross Gramian contains as a special case the sensitivity covariance matrix as intro-duced in [25]. The main limitation of the Gramians introduced so far is that they areonly applicable and defined for linear systems. In the next section we derive an expan-sion of the Gramians to nonlinear systems that leads to a new sensitivity measure fornonlinear systems motivated by the linear case.

8.4.2 Empirical Gramians for nonlinear systems

As outlined earlier, the main idea for a new sensitivity measure for nonlinear systems isthe consideration of parameters as inputs that lead to changes in the system and thusthe output of interest. This input-output behavior is analyzed using the concept ofGramians. This, however, requires a suitable expansion of the concept of linear crossGramian to the nonlinear case.

In recent years significant progress has been made in deriving Gramians for nonlin-ear systems [26]. However, calculating the Gramians explicitly is still a difficult task.Empirical controllability and observability Gramians were suggested by [27] to over-come this problem. Here we follow this approach, derive a method to compute theempirical cross Gramian, and adapt it to sensitivity analysis.

The idea is to derive the empirical cross Gramian by averaging the system behaviorsthat result from systematic parameter perturbations. The perturbations and the averag-ing are chosen in such a way that, first, the state space and parameter space of interest are


137

probed and sampled, and, second, the result reduces to the usual cross Gramian (8.12)when applied to a linear system. This ensures consistency between the usual Gramian ofan asymptotical stable linear systems and the empirical Gramian of nonlinear system inthe limit of infinitely small perturbations.

First, we define the perturbation of the parameters and state space and the resultingdeviations of the behavior and its temporal mean. Then, we define the empirical crossGramian that will be used for sensitivity analysis.

Definition 3 Systematic System Perturbations The considered perturbations of theparameter ps(t)= pnom + dsδ(t) are defined by the set

{ }Sσσ σ= ∈ > =d d d d ss s1 0 1, , ; , , , ,� ��

where δ(t) denotes the impulsive input (Dirac impulse). Perturbation of the initial con-ditions x x c R erkj

nom k r j0 0= +, given by the sets

{ }R

K

ρρ

κκ

ρ= ∈ = =

= ∈

×R R R R R I r

c c c

rn n

rT

r

k

1

1

1, , ; , , , ,

, , ; ,

� �

�

�

�{ }c k

e e e e ei j

i ji j

k

n kn

iT

j

> =

= ∈ ==≠

0 1

1

01

, , ,

, , ; ,,

,, , ,

�

�

κ

En � k n=⎧⎨⎩

⎧⎨⎩

⎫⎬⎭

1, ,�

Let ( )ut

u dt

= ∫1

0τ τ denote the temporal mean of a function u(t), and let Δu t u t u( ) ( )= −

denote the deviation of u(t) from its temporal mean u. Φps t( )denotes the solution/behav-

ior due to parameter perturbation ps(t), and Φxrkj0(t) denotes the solution/behavior due to

perturbation of the initial condition p xnom rkj, 0 .The set Sσ defines scales for the parameter perturbations that are required to investi-

gate the controllability component of the system. Perturbations of the state space arerequired to account for the observability component. The state space perturbations areparameterized by the sets Kκ, which defines orthogonal coordinate systems and differentscales for each perturbed direction in the state space.

The perturbation sets should be chosen such that the states and the parameterdomain of interest are covered. Using these sets and the collected data resulting from theperturbation simulations, a construction for empirical controllability and observabilityGramians was proposed by [27]. The advantage of this construction is that that theempirical Gramians reduce to the usual Gramians of linear system when applied to a lin-ear (time-invariant) system.

Sensitivity analysis requires one to take into account both controllability andobservability at once. Next we introduce the empirical cross Gramian that accounts forboth.

8.4.3 A new sensitivity measure based on Gramians

We next show how the pertubations introduced in Definition 3 can be used to con-struct an empirical cross Gramian from simulation data.


138

Definition 4 Empirical Cross Gramian Let Ss, Rρ, Kκ and En be given sets as inDefinition 3. For system (8.2) with scalar input and scalar output, define the empirical

cross Gramian �Wco around the steady state x0,nomwith corresponding nominal input pnom by

�Wd cco

srk

s kks

===∑∑1

11σκ

ψκσ

(8.14a)

where the entries of the n × n-matrix ψsrk are given for all i, j = 1, ..., n by

( ) ( )ψ τ τ τi jsrk

iT

rT

ps

xrkj

te R C d, = ∫ ΔΦ ΔΦ

00(8.14b)

The definition of the empirical cross Gramian and the choice of the impulsive inputseems somewhat arbitrary and nonintuitive the controllability component (ΔΦp

s ) of theempirical cross Gramian is due to the different input perturbations whereas theobservability component (ΔΦx

rkj

0) is due to perturbations of the initial conditions. This

again shows that the cross Gramian includes both controllability and observability.Furthermore, the definition was constructed in such a way that the definition of the

empirical cross Gramian falls back to the definition of the usual Gramian when appliedto a linear asymptotic stable system.

Proposition 1 For any nonempty sets Rρ, Kκ, and Sσ the empirical cross Gramian �Wco of

an asymptotically stable linear system in (8.8) and (8.9) is equal to the usual cross

Gramian Wco [see (8.12)] for large integration times t → ∞.Proof: Due to the linearity of (8.8) and (8.9), xnom = 0 and pnom = 0. Thus, the solu-

tion/behavior to the perturbations p t d tss( ) ( )= δ and x c R erkj

k r j0 = is given by

Φxrkj At

k r je c R e0

=

and

( )( )Φps A

sAt

s

te B d d e Bd= =∫ τ δ τ τ

0

All perturbed trajectories converge to the origin, independently of s, r, k, and j. Forlong integration times the perturbed state trajectory converges versus the origin. Then,(8.14b) simplifies to

( ) ( )Ψ ττ τi jsrk

iT

rT A

sA

k r je R e Bd C e c R e d, =∞

∫0

and since Rr is orthonormal, we obtain the desired result

Ψsrks k

A A

co

s kA A

s

d c e BCe d

Wd c e BCe d

d c

=

=

∞

∞

∫∫

τ τ

τ τ

τ

σκ

τ

0

01�

kco

s k

W=∑,


139

In Theorem 1 we have shown that the cross Gramian is directly related to the linearsteady-state sensitivity. We employ this theorem and define an empirical sensitivitymeasure.

Definition 5 Empirical Cross Gramian-Based Sensitivity Measure The empirical sen-sitivity is given by

S WEy p

co,

�= 2 trace

where �Wco is the empirical cross Gramian as defined in Definition 4.

Due to Proposition 1, the empirical sensitivity measure SE is identical to the linearsteady-state sensitivity SL if the conditions hold.

The calculation of the sensitivity measure is straightforwardly implementable tononlinear systems and no linearizations are required. It is based on simulation data thatis weighted in such a manner that it is consistent with linear sensitivity of a linear sys-tem. Whereas in the limit of small perturbations, the equivalence to linear sensitivity isexpected, it is expected that for larger perturbations better estimations of thesteady-state shift can be derived such that less local statements can be derived.

It might also be possible that this approach is directly applicable to a wider systemclass than the one that can be considered by the linearization. For instance, zeroeigenvalues, such as in the case of conservation relations, do not pose any problems forthe analysis. Thus, eliminating conservation relations beforehand is not necessary.

8.4.4 Example: covalent modification system

Next we consider the example system (8.4) and analyze the properties of the empiricalsensitivity measure.

Figure 8.3 shows the values of the nonlinear empirical cross Gramian-based sensitiv-ity measure for different development points (e.g., nominal parameter sets aroundwhich the perturbations are considered). As can be seen, the result obtained by the non-linear empirical cross Gramians are largely consistent with the results obtained by thelinearization for all development points considered. This most probably results from theaveraging over the various perturbations and initial conditions. While this is somewhatsurprising, it does not limit the applicability of the method. The strength of theapproach lies rather in the wider applicability with respect to the considerable systemclass, as explained in the previous section.

Capturing the complete range of the influence of parameters does not require amethod that averages the behavior over all possible parameter variations. Rather anexplicit bounding should be used to obtain a better insight; this is outlined in the nextsection.


140

8.5 Sensitivity Analysis Via Infeasibility Certificates

A common use of sensitivity measures is to estimate the steady-state shift that occursdue to parameter variations. However, as pointed out, due to the local nature of mostsensitivity measures, the estimated steady-state shift is rarely valid for larger parameterperturbations. The aim of this section is to present an approach that computes reliableouter bounds on the steady states of the biochemical network under a specified parame-ter uncertainty and thus provides insight into the global influence parameter changeshave on the output or state.

Computing the set of steady states analytically is only possible in very rare cases.Even if an analytical solution for the steady state is known, computing the correspond-ing set for all possible parameter values may be difficult. Due to this difficulty,nondeterministic approaches are frequently used to solve this problem. A common toolfor this kind of analysis is Monte Carlo methods, which are routinely applied in the anal-ysis of uncertain biochemical reaction networks. However, Monte Carlo approaches tothe problem at hand typically require that all of the possibly multiple steady states forspecific parameter values can be computed explicitly, which is often a difficult task initself.

The approach presented in this section avoids the direct computation of steadystates by making use of the specific properties of biochemical reaction networks as givenby (8.2). In particular, it is assumed that fluxes are modeled using the law of mass action,that is, v(x, p) takes the form

( )v x p p xj j kk

njk, =

=∏ σ

1

(8.15)

for j =1, ..., m. The constants σjk are integers representing the stoichiometric coefficientof the species k taking part in the jth reacting complex. Note that an expansion to morecomplicated reaction schemes that are described by rational functions, such as Michae-lis-Menten kinetics, is easily possible.

The problem under consideration can be formulated as follows. Given a set P ⊂ �l in

parameter space, we aim to compute a set Xsn* ⊂ � that consists of all steady states that


141

10

0

0.25

0.5

0.75

1

k6

Stea

dy

stat

eo

f[A

*]

1 102 103 104 105

SL

Nominal, [ *]k A6

5e1, 0.9895e2, 0.9758e2, 09459e2, 0.9061e3, 0.65450e3, 0.02

SE

-2e-5-4.8e-5-2.2e-4-7e-4-7e-3-8.2e-9

-2.4e-5-6.3e-5-3e-4-9.2e-4-8.3e-3-8.7e-9

(a) (b)

Figure 8.3 Numerical comparison of the empirical cross Gramian based sensitivity measure with linearsteady state sensitivity for the covalent modification example (a) and five different development pointsindicated by circles in (b). Perturbation sets used were S : ds ∈ [10− 4, 10− 3], K : ck ∈ [10− 4, 1], R = In.

the system (8.2) can attain for parameter values taken from P. Mathematically, this iswritten as

( ){ }X x p P Nv x psn* : ,= ∈ ∃ ∈ =� 0 (8.16)

However, Xs* can rarely be computed directly. Instead, we are looking for an outer

bound X Xs s⊃ * which should be as tight as possible.In order to search for sets of steady states for a given parameter set P, we need means

to test whether a candidate solution �Xs obtained in such a search is actually valid or not.

Such a test is readily formulated as a feasibility problem. Moreover, it will turn out thatthe Lagrangian dual for this feasibility problem allows one to certify given regions instate space as not containing a steady state for any parameter value from the set P. Thisinformation can be used to implement an algorithm that constructs outer bounds onthe region Xs

* of all steady states.For ease of presentation, only hyper-rectangles in the state and parameter space are

considered for the sets Xs and P, although the results are readily extended to convexpolytopes in general.

8.5.1 Feasibility problem and semidefinite relaxation

The problem of testing whether a given hyper-rectangle Xs in state space containssteady states of the system (8.2), for some parameter values in a given hyper-rectangle Pin parameter space, can be formulated as the following feasibility problem [28]:

( )find

s. t.

x p

Nv x p

p p p j l

x

n l

j j j

i

∈ ∈=

≤ ≤ =

� �,

,

, ,,min ,max

,

0

1 �

min ,max , ,≤ ≤ =x x i ni i 1 �

(8.17)

The feasibility problem (8.17) can be relaxed to a semidefinite program as follows. Inthe first step, construct a vector ξ containing monomials that occur in the reaction fluxvector v(x, p) [29]. In the special case where no single reaction has more than tworeagents, a starting point for the construction of ξ is

( )ξTl n j i l np p x x p x p x p x= 1 1 1 1 1, , , , , , , , , , , ,� � � �

which can usually be reduced by eliminating components that are not required to rep-

resent the reaction fluxes. Define k such that ξ ∈ �k. Note that this approach is not lim-

ited to second order reaction networks. In more general cases, one has to extend thevector ξ by monomials that are products of several state variables.

Using the vector ξ, the elements of the flux vector v(x, p) can be expressed as

( )v x p V j mjT

j, , , ,= =ξ ξ 1 � (8.18)

where Vj ∈ k is a constant symmetric matrix. Using (8.18), the system (8.2) can be writ-ten as


142

� , , ,x Q i niT

i= =ξ ξ 1 � (8.19)

where Q S V Si ij jk

j

m= ∈

=∑ 1are constant symmetric matrices.

The original feasibility problem (8.17) is thus equivalent to the problem

find

s. t.

ξ

ξ ξ

ξ

ξ

∈= =

≥=

�k

TiQ i n

B

0 1

0

11

, ,�

(8.20)

where the matrix B k k∈ − ×�

( )2 2 is constructed to cover the inequality constraints in(8.17).

A relaxation to a semidefinite program is found by setting X = ξξT. The resulting

nonconvex constraint rank X = 1 is omitted in the relaxation. The relaxed version of theoriginal feasibility problem (8.17) is thus obtained as

( )( )

find

s. t. trace

X S

Q X i n

e e X

BX

BXB

X

k

iT

eT

∈= =

=≥≥

0 1

1

0

0

1 1

1

, ,�

positive semidefinite

(8.21)

where e T k1 1 0 0= ∈( , , , )� � .

The basic relationship between the original problem (8.17) and the relaxed problem(8.21) is that if the original problem is feasible, then the relaxed problem is also feasible.

8.5.2 Infeasibility certificates from the dual problem

The Lagrange dual problem can be used to certify infeasibility of the primal problem(8.21). First, the Lagrangian function L is constructed for the primal problem. Weobtain

( ) ( )( )

L X BXe BXB

X

T T T

Ti

, , , ,λ λ λ ν λ λ

λ ν

1 2 3 1 1 2

3

= − −

− +

trace

trace tr ( ) ( )( )ace traceQ X e e Xi nT

i

n

+ −+=∑ ν 1 1 1

1

1

where λ λ λ ν12 2

22 2

31∈ ∈ ∈ ∈− − +� �k k k nS S, , , and .

Based on the Lagrangian L, the dual problem is obtained as

( )max inf , , , ,

, ,, , ,λ λ λ

λ λ λ ν

λ λ λ1 2 3

1 2 3

1 2 30 0v X S k

L X∈

≥ ≥s. t. positive semidefinite

which is equivalent to


143

max

s. t.

ν

λ λ λ

λ ν

nT T T T

i i nT

B B e B B e

v Q e e

+

+

+ +

+ + + =

1

2 1 1 1 1

3 1 1 1 0

0 0 01

1 2 3

i

n

=∑

≥ ≥ ≥λ λ λ, ,

(8.22)

It is a standard procedure in convex optimization to use the dual problem in order tofind a certificate that guarantees infeasibility of the primal problem [30]. For the prob-lem at hand, this principle is formulated in Theorem 2 [31].

Theorem 2 If the dual problem (8.22) has a feasible solution where νn+1 > 0, then theprimal problem (8.17) is infeasible.

8.5.3 Algorithm to bound feasible steady states

In this section, an algorithm to find outer bounds on the steady state region Xs, basedon the results obtained in the previous section, is presented. As a basic additionalrequirement, assume that some upper and lower bounds on steady states are alreadyknown previously by other means. Let these bounds be given by

x x x i ni lower i i upper, , , , ,≤ ≤ = 1 � (8.23)

In biochemical reaction networks, such bounds typically follow straightforwardlyfrom conservation relationships or from positive invariance of a suitable large region instate space. These bounds may be very loose though, and the main objective of the pre-sented method is to tighten them as far as possible. To this end, a bisection algorithmthat finds the maximum ranges [xj,lower, xj,min] and [xj,max, xj,upper] for which infeasibility canbe proven via Theorem 2 is used. The algorithm iterates over j =1, ..., n, while the steady

state values xi for i ≠ j are assumed to be located within the interval given by inequality(8.23).

For illustration, pseudo-code for computing the lower bound x1,min is given. Compu-tation of the upper bound x1,max works in a very similar way.

Algorithm 1 Lower bound maximization by bisection

up_guess ← x1,upper, lo_guess ← x1,lower

next_x1 ← x1,upper

while (up_guess – lo_guess) = tolerance

use constraint x1,lower ≤ x1 ≤ next_x1

solve semidefinite program (8.22)

if optimal value of (8.22) is infinite

lo_guess ← next_x1

increase next_x1 by 1/2(up_guess – next_x1)

else

up_guess ← next_x1

decrease next_x1 by 1/2(next_x1 – lo_guess)

endif

endwhile

x1,min ← lo_guess


144

Due to the availability of efficient solvers for semidenifite programs and the use ofbisection to maximize the interval that is certified as infeasible, Algorithm 1 can runconsiderably fast on standard desktop computers. Algorithm 1 is run for all state vari-ables, to obtain a hyper-rectangle in state space containing all steady states for theassumed parameter ranges. As demonstrated next, this is relevant information forthe global sensitivity analysis and allows one to draw conclusions on (parametric)

steady-state sensitivity.The outlined approach for outer bounding the set of all feasible steady states for

sets of parameters can be expanded to more general system classes, including systemsdescribed by discrete variables such as switching genetic parts. Also, expansions tomore general nonlinear systems are possible, for details of these expansions see [32].Furthermore, based on similar considerations a general framework for model invalida-tion, parameter and state estimation, as well as input selection for experimental designfor nonlinear systems, specifically biochemical reaction networks, can be derived [33,34]. In this approach the experimental data is allowed to be available as possibly sparse,uncertain, but (set-)bounded measurements of inputs and outputs. In comparison toother approaches, all infeasibility-based approaches have in common that instead ofchecking (possibly) many separate points which might lead to nonconclusive answers,they allow one to check whole parameter and state regions for feasibility.

8.5.4 Example: covalent modification system

As an example, let us compute a steady state region for the covalent modificationscheme (8.3). From the conservation relations and positive invariance of the positive

orthant, we have the steady state bounds 0 ≤ [A], [A*] ≤ [Atot] and 0 ≤ [C1] ≤ [E1,tot] whichare valid for any parameter values.

The previously discussed analysis method can be applied to find tighter bounds onpossible steady state values under specified parameter uncertainties. First, we consideran example where k6 is uncertain between 100 and a varying maximum value. Theresulting lower and upper bounds on [A*] are shown in Figure 8.4.

As an example for multidimensional parameter uncertainty sets, consider the three

uncertainty regions P1, P2, P3 ⊂ 4 given by:

• (k2, k3, k5, k6) ∈ P1 ⇔ 0.98 ki,nom ≤ ki ≤ 1.02 ki,nom;


145

Region of possiblesteady states

0

0.25

0.5

0.75

1

Bo

un

ds

on

[A*]

Max. value of k10

6

510 410 310 2

Figure 8.4 Global sensitivity analysis of the covalent modification example. Upper and lower bounds(black lines) on the steady state of [A*] for the parameter set 100 ≤ k6 ≤ k6,max.

• (k2, k3, k5, k6) ∈ P2 ⇔ 0.9 ki,nom ≤ ki ≤ 1.1 ki,nom;

• (k2, k3, k5, k6) ∈ P3 ⇔ 0.5 ki,nom ≤ ki ≤ 2 ki,nom;

with i = 2, 3, 5, 6 in all three cases. Certified upper and lower bounds on [A*] for this caseare given in Table 8.1, together with “inner” bounds on the steady state set obtained byexplicit computation of the steady state for randomly chosen parameter samples. Ascan be seen from the results, our approach is able to find tight intervals for the steadystate values in all three cases.

Note that the absolute bound on the output variation subject to the various parame-ters easily shows which parameter is the most sensitive.

8.6 Discussion and Outlook

Sensitivity analysis is a useful tool for the analysis of mathematical models of biologicalsystems. Commonly, linear, local methods are used to analyze sensitivity with respectto parameter perturbations. It was shown in this chapter that local sensitivity methodsmight lead to wrong conclusions if the system is highly nonlinear and if large variationsin the parameters are considered.

To overcome the problem of locality, we outlined two new methods for the steady-state sensitivity analysis. The first method is based on an input output consideration ofthe sensitivity question: parameters are considered as inputs, and variables of interest areconsidered as outputs. The sensitivity question is then answered by an extension of theconcept of linear cross Gramians to nonlinear systems, the empirical cross Gramianapproach. This method allows one to consider a wider class of systems and to derive sen-sitivity statements based on simulations. Further work will focus on the expansion of theGramian-based approach to the question of nonstationary sensitivity analysis.A reformulation of the original question of parametric sensitivity to the question ofouter approximating the range of possible steady states under parameter uncertaintiessets the basis for the second approach. Since an outer approximation of the possiblesteady states is obtained, the method is nonlocal (i.e., global in nature). In the case ofmass action kinetics it was shown that one can find rather close outer bounds on thefeasible region using infeasibility certificates. Once the region of possible steady states isfound, one has a direct insight into the most sensitive outputs for the consideredparameter uncertainty. Future research focuses on the expansion of the global sensitiv-ity method to a richer class of systems and on the application of the ideas to modelvalidation and parameter/state estimation.


146

Table 8.1 Upper (ub) and Lower (lb) Bounds on[ ] for Several Parameter Uncertainty Sets (as givenin the text), with Comparison to Extremal ValuesTaken from 1,000 Random Parameter Samples (mc)

x [ *] [ *] [ *] [ *]

1 0.356 0.363 0.823 0.838P2 0.094 0.096 0.936 0.946P3 0.013 0.013 0.980 0.984

References

[1] Le Novère, N., B. Bornstein, A. Broicher, M. Courtot, M. Donizelli, H. Dharuri, L. Li, H. Sauro, M.Schilstra, B. Shapiro, J.L. Snoep, and M. Hucka, “BioModels Database: a free, centralized database ofcurated, published, quantitative kinetic models of biochemical and cellular systems” Nucleic AcidsRes., Vol. 34, 2006, pp. D689–D691. See also http://www.ebi.ac.uk/biomodels/, last visited May 25,2008.

[2] Varma, A., M. Morbidelli, and H. Wu, Parametric Sensitivity in Chemical Systems, Cambridge, U.K.:Cambridge University Press, 1999.

[3] Cornish-Bowden, A., Fundamentals of Enzyme Kinetics, 3rd edition, London, U.K.: Portland Press,2004.

[4] Saez-Rodriguez, J., A. Kremling, and E.D. Gilles, “Dissecting the puzzle of life: Modularization ofsignal transduction networks,” Computers and Chemical Engineering, Vol. 29, 2005, pp. 619–629.

[5] Klipp, E., R. Herwig, A. Kowald, C. Wierling, and H. Lehrach, Systems Biology in Practice: Concepts,Implementation and Application, Weinheim: Wiley-VCH, 2005.

[6] Kacser, H. and J.A. Burns, “The control of flux,” Symposia Society for Experimental Biology, Vol. 27,1973, pp. 65–104.

[7] Fell, D.A., “Metabolic control analysis: a survey of its theoretical and experimental development,”Biochemical J., Vol. 286, 1992, pp. 313–330.

[8] Feng, X. J., S. Hooshangi, D. Chen, G. Li, R. Weiss, and H. Rabitz, “Optimizing genetic circuits byglobal sensitivity analysis,” Biophys. J., Vol. 87, No. 4, 2004, pp. 2195–2202.

[9] Robert, C.P., and G. Casella, Monte Carlo Statistical Methods, New York: Springer Verlag, 2004.[10] Alves, R., and M.A. Savageau, “Systemic properties of ensembles of metabolic networks: applica-

tion of graphical and statistical methods to simple unbranched pathways,” Bioinformatics, Vol. 16,No. 6, 2000, pp. 534–547.

[11] Keener, J., and J. Sneyd, Mathematical Physiology, 2nd edition, volume 8 of Interdisciplinary AppliedMathematics, New York: Springer-Verlag, 2001.

[12] Ingalls, B.P., and H.M. Sauro, “Sensitivity analysis of stoichiometric networks: an extension of met-abolic control analysis to non-steady state trajectories,” J. Theor. Biol. Vol. 222, 2003, pp. 23–36.

[13] Goldbeter, A., and D.E. Koshland, “An amplified sensitivity arising from covalent modification inbiological systems,” Proc. Natl. Acad. Sci. USA, Vol. 78, No. 11, November 1981, pp. 6840–6844.

[14] Saltelli, A., M. Ratto, S. Tarantola, and F. Campolongo, “Sensitivity analysis practices: Strategies formodel-based inference,” Reliability Engineering & System Safety, Vol. 91, 2006, pp. 1109–1125.

[15] Ingalls, B.P., “A frequency domain approach to sensitivity analysis of biochemical systems,” Jour-nal of Physical Chemistry B, Vol. 108, 2004.

[16] Cascante, M., A. Sorribas, R. Franco, and E.I. Canela, “Biochemical systems theory:increasing predictive power by using second-order derivatives measurements,” J. Theor. Biol., Vol.149, No, 4, April 1991, pp. 521–535.

[17] Streif, S., R. Findeisen, and E. Bullinger, “Sensitivity analysis of biochemicalreaction networks by bilinear approximation,” Proc. of the Foundations of Systems Biology in Engineer-ing (FOSBE), Stuttgart, Germany, September 2007, pp. 521–526.

[18] Hofmeyr, J.-H.S., “Metabolic control analysis in a nutshell,” Proc. International Conference on Sys-tems Biology, Pasadena, CA, November 2000, pp. 291–300.

[19] Yi, T.-M., B.W. Andrews, and P.A. Iglesias, “Control analysis of bacterial chemotaxis signal-ing,”Methods Enzymol., Vol. 422, 2007, pp. 123–140.

[20] Moore, B.C., “Principal component analysis in linear systems: Controllability, observability, andmodel reduction,” IEEE Trans. Autom. Control, Vol. 26, No. 1, 1981, pp. 17–32.

[21] Fernando, K.V., and H. Nicholson, “Stability assessment of two-dimensional state-space systems,”IEEE Trans. Circ. Syst., Vol. 32, No. 5, 1985.

[22] Fernando, K.V., and H. Nicholson, “On the structure of balanced and other principle representa-tions of SISO systems,” IEEE Trans. Autom. Control, Vol. 28, No. 2, 1983, pp. 228–231.

[23] Laub, A.J., L.M. Silverman, and M. Verma, “A note on cross-Grammians for symmetric realiza-tions,” Proceedings of the IEEE, Vol. 71, No. 7, 1983, pp. 904–905.

[24] Streif, S., R. Findeisen, S., and E. Bullinger, “Relating cross Gramian and sensitivity analysis in sys-tems biology,” Proc. of the Mathematical Theory of Networks and Systems (MTNS), Kyoto, Japan, 2006,pp. 437–442.

[25] Sun, C., and J. Hahn, “Parameter reduction for stable dynamical systems based on Hankel singularvalues and sensitivity analysis,” Chemical Engineering Science, Vol. 61, No. 16, 2006, pp. 5393–5403.

[26] Fujimoto, K., and J.M.A. Scherpen, “Nonlinear balanced realization based on singular value analy-sis of Hankel operators,” Proc. 42nd IEEE Conference on Decision and Control, Vols. 1-6, 2003,pp. 6072–6077.

8.6 Discussion and Outlook

147

[27] Lall, S., J.E. Marsden, and S. Glavaski, “A subspace approach to balanced truncation for modelreduction of nonlinear control systems,” International Journal of Robust and Nonlinear Control,Vol. 12, 2002, pp. 519–535.

[28] Kuepfer, L., U. Sauer, and P. Parrilo, “Efficient classification of complete parameter regions basedon semidefinite programming,” BMC Bioinformatics, Vol. 8, No. 1, January 12, 2007.

[29] Parrilo, P.A., “Semidefinite programming relaxations for semialgebraic problems,” MathematicalProgramming, Vol. 96, No. 2, May 2003, pp. 293–320.

[30] Boyd, S., and L. Vandenberghe, Convex Optimization, Cambridge, U.K.: Cambridge University Press,2004.

[31] Waldherr, S., R. Findeisen, and F. Allgöwer, “Global sensitivity analysis of biochemical reactionnetworks via semidefinite programming,” Proc. of the 17th IFAC World Congress, Seoul, Korea, 2008,pp. 9701–9706.

[32] Hasenauer, J., P. Rumschinski, S. Waldherr, S. Borchers, F. Allgöwer, and R. Findeisen, “Guaranteedsteady-state bounds for uncertain chemical processes,” Proc. Int. Symp. Adv. Control of Chemical Pro-cesses, ADCHEM’09, 2009.

[33] Borchers, S., P. Rumschinski, S. Bosio, R. Weismantel, and R. Findeisen, “Model discrimination andparameter estimation via infeasibility certificates for dynamical biochemical reaction networks,”Proc. of the 7th MATHMOD Conference, 2009.

[34] Borchers, S., P. Rumschinski, S. Bosio, R. Weismantel, and R. Findeisen, “Model invalidation andsystem identification of biochemical reaction networks,” Proc. 16th IFAC Symposium on Identifica-tion and System Parameter Estimation (SYSID 2009), 2009.

[35] Stelling, J., E.D. Gilles, and F.J. Doyle, “Robustness properties of circadian clock architectures,”Proc. Natl. Acad. Sci. USA, Vol. 101, No. 36, September 2004, pp. 13210–13215.

[36] del Rosario, R.C.H., F.W. Staudinger, S. Streif, F. Pfeiffer, E. Mendoza, and D. Oesterhelt, “Model-ling the Halobacterium salinarum mutant: sensitivity analysis allows choice of parameter to be mod-ified in the phototaxis model,” IET Systems Biology, Vol. 1, No. 4, 2007, pp. 207–221.


148

C H A P T E R

9Determining Metabolite ProductionCapabilities of Saccharomyces Cerevisiae UsingDynamic Flux Balance Analysis

Jared L. Hjersted and Michael A. Henson1

1 Department of Chemical Engineering University of Massachusetts Amherst, MA 01003-9303, phone:413-545-3481; fax: 413-545-1647; e-mail: [email protected]

149

Key terms Metabolic modelsFlux balance analysisDynamic optimizationMetabolic engineeringBatch and fed-batch cultureSaccharomyces cerevisiae

Abstract

Dynamic flux balance analysis (DFBA) is a computational approach for analyz-ing and engineering cellular behavior in dynamic culture environments thatpredominate in batch and fed-batch biochemical reactors. The basic element ofDFBA is a dynamic flux balance model that combines stoichiometric mass bal-ances on intracellular metabolites with dynamic mass balances on extracellularspecies through substrate uptake kinetics and the cellular growth rate. Thedevelopment of customized computational tools allows DFBA to address a widevariety of problems in metabolic network analysis and design, including thedynamic simulation of batch and fed-batch bioreactors, the dynamic optimiza-tion of fed-batch operating policies, and the in silico design of metabolite over-production mutants for batch and fed-batch cultures. We focus on thedevelopment and application of DFBA techniques for the yeast Saccharomycescerevisiae.

9.1 Introduction

The availability of stoichiometric models of cellular metabolism has enabled the devel-opment of computational algorithms for the analysis and design of complex metabolicnetworks. A stoichiometric model is comprised of a linear system of flux balance equa-tions that relate metabolic species to their intracellular fluxes through a reaction net-work [1, 2]. Typically the network contains more unknown fluxes than balancedintracellular species, and the linear system is underdetermined. In flux balance analysis(FBA), the fluxes are resolved by solving a linear programming problem formulatedunder the assumption that the cell optimally utilizes available resources. FBA has beenused extensively for predicting cellular growth and product secretion patterns in micro-bial systems [3–5]. Extensions of classical FBA allow the redesign of metabolic networksfor the overproduction of desired metabolites through gene deletions and insertions,which are implemented by removing or adding intracellular reactions to the network.These computational methods provide metabolic engineering targets that are experi-mentally testable. In a study with the yeast Saccharomyces cerevisiae, the growth pheno-types of knockout mutants were predicted with a 70%–80% success rate by constrainingfluxes associated with these genes [6]. Several computational studies of genemanipulations for metabolite overproduction also have been presented [7–9].

Large-scale production of biotechnological products often is performed in batch andfed-batch bioreactors. An important advantage of fed-batch operation is that substratelevels can be varied transiently to achieve favorable trade-offs between the cellulargrowth and metabolite production rates. Fed-batch fermentation of S. cerevisiae is animportant technology for producing metabolic products such as ethanol [10–12]. How-ever, classical FBA methods assume time-invariant extracellular conditions and generatesteady-state predictions consistent with continuous bioreactor operation. Althoughbatch culture experiments often are used to evaluate FBA predictions, the results arestrictly valid only for the balanced growth phase.

An alternative approach is to perform metabolic network analysis and design usingdynamic extensions of stoichiometric models and classical FBA. Dynamic flux balancemodels [13–16] are obtained by combining stoichiometric equations for intracellularmetabolism with dynamic mass balances on extracellular substrates and products underthe assumption that intracellular metabolite concentrations equilibrate rapidly inresponse to extracellular perturbations [2]. The intracellular and extracellular descrip-tions are coupled through the cellular growth rate and substrate uptake kinetics, whichcan be formulated to include regulatory effects such as product inhibition of growth.Therefore, dynamic flux balance models allow the prediction of cellular behavior as theextracellular environment changes with time. Batch culture simulations with dynamicflux balance models have shown good agreement with experimental data [13–15].Dynamic flux balance analysis (DFBA) refers to computational algorithms in whichdynamic flux balance models are used for the metabolic network analysis and design.

Dynamic flux balance modeling offers important advantages over alternative tran-sient modeling frameworks. Because simple unstructured models rely onphenomenological descriptions of cell growth and constant yield coefficients [17], theyhave limited predictive capability and cannot account for genetic alterations. Metabolicengineering applications of structured kinetic models [18, 19], log-linear kinetic models[20], and cybernetic models [21, 22] are often limited by the lack of parameter values forin vivo enzyme kinetics. Dynamic flux balance modeling provides a practical alternative

Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae

150

for incorporating intracellular structure. Given the availability of a steady-state flux bal-ance model, only a small number of additional parameters are needed to account for thesubstrate uptake kinetics. On the other hand, a well-documented weakness of classicaland dynamic FBA is the difficulty associated with incorporating cellular regulation. Thisproblem has been partially addressed by using gene expression data to constrain regu-

lated fluxes within the metabolic network [23, 24]. DFBA offers the additional possibilityof formulating substrate uptake kinetics to account for known regulatory processes.

In this chapter, the basic elements of DFBA are presented and illustrated throughapplications to Saccharomyces cerevisiae dynamic simulation, fed-batch optimization,and in silico metabolic engineering. Following a discussion of stoichiometric modeling,the computational underpinnings of classical and dynamic FBA are discussed. The appli-cation of DFBA is illustrated by performing dynamic simulations of fed-batch cultures,dynamic optimization of fed-batch operating policies for ethanol production, and insilico design of ethanol overproduction mutants for pure and mixed substrates.

9.2 Methods

9.2.1 Stoichiometric models of cellular metabolism

Both classical and dynamic flux balance analysis are based on stoichiometric cell mod-els that mathematically represent the biochemical reactions in a metabolic network. Astoichiometric model contains all possible paths from the externally supplied sub-strates to the biomass constituents and metabolic products [5, 25]. The essential infor-mation required to construct a stoichiometric model is a list of participatingbiochemical species (metabolites), a list of the relevant intracellular reactions involvingthese species, and the stoichiometric coefficients for every species in each reaction. Theintracellular reaction rates are called fluxes and are unknown variables determinedfrom mathematical solution of the stoichiometric model. Also treated as fluxes are themetabolite transport rates across the cell membrane for extracellular substrates (uptakerates) and for secreted metabolic products (secretion rates). Typically substrate uptakerates are known model input variables, while product secretion rates are unknownmodel output variables calculated along with the intracellular reaction rates.

Unless experimental evidence indicates otherwise, each intracellular metabolite isassumed to exhibit negligible accumulation such that the fluxes producing the metabo-lite must be balanced by the fluxes consuming the metabolite:

a vT = 0 (9.1)

where v is a n-dimensional column vector containing all the fluxes with typical units ofmillimole of metabolite per gram dry weight of biomass per hour (mmol/gDW/h) andaT is an n-dimensional row vector containing the stoichiometric coefficients of the bal-anced metabolite for the corresponding reactions. Stoichiometric coefficients are usu-ally positive for reactions that produce the metabolite and negative for reactions thatconsume the metabolite. Because a single metabolic reaction typically involves a smallnumber of metabolites, most of the coefficients in the aT vector are zero. The individualstoichiometric equations (9.1) can be gathered to form a matrix equation of the form:

9.2 Methods

151

Av = 0 (9.2)

where A is the stoichiometric matrix with m rows corresponding to the number of bal-anced metabolites and n columns corresponding to the number of fluxes. The matrixentry aij in row i and column j is the stoichiometry of the ith species participating in thejth reaction. Given a set of known fluxes vm obtained either by measurement or specifi-cation, the matrix A can be partitioned in terms of the remaining unknown fluxes vc toyield

A v A v A v A v bc c m m c c m m+ = → = − ≡0 (9.3)

where Ac and Am are appropriately dimensioned submatrices of A and b is a known col-umn vector of appropriate dimension. As discussed in the following section, the exis-tence and uniqueness of solutions to (9.3) can be determined from properties of thematrix Ac and the vector b.

A large number of stoichiometric cell models have been constructed for organismsranging in complexity from bacteria to mammals [26–28]. These models often are differ-entiated according to the amount of genomic information utilized in their develop-ment. Prior to the wide availability of complete genomic sequences, stoichiometricmodels were developed from knowledge of metabolic pathways and cellular physiologywithout regard to the genes involved in the synthesis of enzymes that catalyze theintracellular reactions [29–31]. These small-scale pathway models typically describe pri-mary carbon metabolism and include a lumped description of biomass constituent for-mation with 100 or fewer metabolites and reactions. More recently, genome-scalestoichiometric models that attempt to account for all known gene-protein-reaction asso-ciations have been developed for various organisms [26, 27, 32, 33]. With the increasingavailability of genome-scale models, the choice of an appropriate stoichiometric modelis often determined by the intended application. We have found that small-scale meta-bolic models are more computationally efficient when integrated into mathematicalprogramming strategies such as those developed for bioreactor optimization (Section9.3.3). On the other hand, genome-scale models are more suitable for analysis of meta-bolic engineering strategies because the gene-protein-reaction associations facilitate thein silico implementation of genetic manipulations such as gene knockouts and geneinsertions (Section 9.3.4).

9.2.2 Classical flux balance analysis

The objective of classical flux balance analysis (FBA) is to solve the matrix equation (9.3)

for the unknown fluxes vc. Consider the augmented matrix defined by~

[ ]A A bc c≡ . At

least one solution to (9.3) exists if and only if Ac and~Ac have the same matrix rank r and

the solution is unique if and only if r = nc, the dimension of the unknown flux vector vc

[2]. Most stoichiometric models satisfy the existence condition but not the uniquenesscondition because the number of unknown fluxes is greater than the number of bal-anced metabolites, rendering the matrix rank r < nc. In this case, (9.3) has infinitelymany solutions corresponding to different flux distributions that satisfy thestoichiometric equations. This degree-of-freedom problem can be resolved by measur-ing a sufficient number of intracellular and/or transport fluxes such that the number of


152

unknown fluxes is equal to the number of balanced metabolites and the resultingmatrix has rank r = nc. This approach typically involves carbon labeling experiments tomeasure intracellular fluxes [30, 34] and is not well suited for genome-scale models inwhich rank deficiencies of several hundred are common. More importantly, carbonlabeling does not allow a priori prediction of cellular metabolism due to the necessity ofcollecting data for flux computation.

An alternative approach that is the focus of this chapter is to assume the existence ofa cellular objective and to solve an optimization problem in which the fluxes are distrib-uted to maximize this objective while simultaneously satisfying the stoichiometricequations. The most common cellular objective is growth rate maximization, althoughother objectives such as maximal ATP production have been considered [35, 36]. Figure9.1 depicts this computational framework for a very simple example of three calculatedfluxes. Physiochemical constraints such as reaction directionality (reversible or irrevers-ible reaction) can be represented as

v v vmin max≤ ≤ (9.4)

where vmin and vmax are vectors containing known lower and upper bounds on the fluxes,respectively. When combined with the stoichiometric equations, the physiochemicalconstraints bound the flux solution space but typically fail to yield a unique solution forthe flux distribution. In an attempt to resolve the degree-of-freedom problem, the cell isassumed to utilize available substrates to achieve maximal growth. The growth rateμ(h−1) is calculated as the weighted sum of the fluxes, with fluxes corresponding to bio-mass precursors (amino acids, carbohydrates, ribonucleotides, deoxyribonucleotides,lipids, sterols, phospholipids, fatty acids) weighted according to their contribution tothe biomass and the remaining fluxes given weights of zero. Values of the weights w(gDW/mmol) are determined from measurement of the biomass composition and theyare assumed to remain constant under different conditions, an assumption that hasbeen challenged by experimental data [37].

The resulting optimization problem is known as a linear program (LP) because theobjective function and the equality and inequality constraints are linear in theunknown fluxes [38]. A variety of numerical algorithms have been developed for rapidand robust solution of large LPs with many thousands of unknown fluxes andstoichiometric equations [39, 40]. Given a set of substrate uptake rates either measuredthrough experiment or specified for analysis, solution of the LP yields the intracellularflux distribution in the network as well as the maximal growth rate and the transport

9.2 Methods

153

Figure 9.1 Classical flux balance analysis for a hypothetical stoichiometric model with three unknownfluxes (vA, vB, vC), stoichiometric matrix A, physiochemical flux constraints vmin and vmax, growth rate μ andbiomass composition weights w.

rates of secreted products. The solution obtained is unique unless the LP has alternativeoptima, in which case the same maximal growth rate can be obtained with different fluxdistributions [41, 42]. We will revisit this issue in Section 9.3. This approach of formulat-ing an LP to resolve the intracellular fluxes subject to stoichiometric andphysiochemical constraints is known as classical flux balance analysis (FBA).

9.2.3 Dynamic flux balance analysis

Classical flux balance analysis allows prediction of cellular growth and product secre-tion rates for fixed values of the substrate uptake rates. As a result, FBA is strictly applica-ble only to the balanced growth phase in batch cultures and the steady-state growthphase in continuous cultures. Dynamic flux balance analysis (DFBA) is an extension ofclassic flux balance analysis that accounts for cell culture dynamics and allows predic-tion of cellular metabolism in batch and fed-batch fermentations. The basic DFBAframework is depicted in Figure 9.2 for the case of fed-batch culture. The growth rate μ,the intracellular fluxes v, and the product secretion rates vP are computed through solu-tion of the classical FBA problem. Rather than specifying constant substrate uptakerates, extracellular substrate concentrations S and product concentrations P are used forcalculation of time-varying substrate uptake rates vP through expressions for the uptakekinetics vS. The extracellular concentrations are computed by numerically solvingextracellular balance equations for the liquid volume (V), biomass concentration (X),and the substrate and product concentrations given growth and secretion ratesobtained from the LP. Consequently, all the intracellular and extracellular variables aretime varying.

As compared to alternative dynamic cell modeling approaches, the primary advan-tages of DFBA are that the increasing availability of flux balance models is fully leveragedand that very little additional information is required for model construction. The prin-cipal challenges associated with DFBA are experimental determination of the substrateuptake kinetics [43–45] and numerical solution of the dynamic model [14, 16, 46]. Thesubstrate uptake expressions typically take the form of saturation kinetics with possibleterms for inhibition due to product toxicity or a competing substrate in the case ofshared transporters. Two general approaches have been proposed for numerical solutionof the DFBA problem. The sequential approach involves time discretization of the model


154

Figure 9.2 Dynamic flux balance model for a fed-batch bioreactor with substrate concentrations S,product concentrations P, biomass concentration X, reactor liquid volume V, and feed flow rate F. Here vs

is the subset of fluxes for substrate uptakes, vp is the subset of fluxes for product secretion rates and fs is avector function of substrate uptake kinetics.

equations such that the substrate uptake kinetics, the FBA LP and the extracellular bal-ances can be solved separately and sequentially [13, 14, 16]. Given that large LPs can besolved rapidly and robustly, we have developed a simultaneous approach in which theuptake kinetics and the LP are embedded within the extracellular balance solution suchthat explicit time discretization is avoided and high performance integration codes canbe used directly [46]. We demonstrate application of the simultaneous solutionapproach in Section 9.3.2.

9.3 Results and Interpretation

9.3.1 Stoichiometric models of S. cerevisiae metabolism

A variety of stoichiometric models of S. cerevisiae metabolism that differ according totheir complexity and intended use have been presented. Three alternative models havebeen used in our studies to investigate the relationship between model complexity, pre-diction capability, and computational efficiency: (1) a small-scale pathway model thatdescribes primary carbon metabolism and the formation of cellular biomass [31, 47]; (2)the two compartment iFF708 genome-scale model with explicit connections betweenannotated genes and the associated enzyme catalyzed reactions [32]; and (3) themulticompartment iND750 genome-scale model in which the metabolic reactions arelocalized to seven intracellular compartments [27]. We refer to the small-scale model asiGH99 with the capital letters “GH” referring to the last names of the primary authorsand the number “99” referring to the number of intracellular reactions rather than thenumber of gene-reaction associations as used in the genome-scale models. The generalcharacteristics of each stoichiometric model are summarized in Table 9.1. The numberof fluxes reported includes both the number of intracellular reactions and the numberof membrane transport fluxes (e.g., the iGH99 model has 99 intracellular reactionfluxes and 30 transport fluxes).

The iGH99 model has only a single intracellular compartment and accounts for acomparatively small number of metabolites. Because the iGH99 model does not explic-itly include annotated genes, we manually analyzed the reaction set to establish thegene-reaction associations necessary to implement selected metabolic engineering strat-egies (Section 9.3.3). In addition to dividing the intracellular reactions between thecytosol and mitochondria and including gene-reaction associations, the first generationiFF708 genome-scale model contains a much more extensive list of reactions. The pri-mary enhancements in the second generation iND750 genome-scale model are moreextensive reaction localization through inclusion of the cytosol, mitochondria,


155

Table 9.1 Summary of S. cerevisiae Stoichiometric Models

iGH99 iFF708 iND750

GenesIntracellular compartmentsMetabolitesFluxesElementally balancedCharge balanced

—198129XX

70827111,176XX

75071,0591,264

√√

Reference [31] [32] [33]

peroxisome, nucleus, endoplasmic reticulum, Golgi apparatus, and vacuole, detailedcharge balancing, and full elemental mass balancing with respect to carbon andhydrogen.

While stoichiometric coefficients are fixed by the biochemical reactions, theintracellular models have several adjustable parameters associated with cellularenergetics and the biomass composition. Energy related parameters correspond togrowth and nongrowth associated maintenance. The small-scale iGH99 model does notdistinguish between the two types of maintenance and has a single lumped parameter inthe biomass formation flux. The genome-scale models iFF708 and iND750 account fornongrowth associated maintenance with a separate flux where the upper and lowerbounds of the flux are set to identical values (mmol ATP/gdw/h). Lumped maintenancein the small-scale model and growth associated maintenance in the genome-scale mod-els were specified by adjusting the stoichiometry of ATP consumption in the biomassformation flux. In the original studies, the energy related parameters for each modelwere determined for a single metabolic state and then assumed to remain constantunder different conditions.

The biomass composition parameters (w) determine the relative contribution ofeach precursor to the biomass formation rate (μ). Despite large differences in their meta-bolic descriptions, all three models utilize roughly the same level of detail for the bio-mass precursors. In the original studies for the iGH99 and iFF708 models, the biomasscomposition parameters were determined for a particular metabolic state and thenassumed to remain constant. The second generation genome-scale iND750 model uti-lizes the same biomass composition as the iFF708 model. While metabolic models areformulated and analyzed under the assumption of constant biomass composition, con-tinuous culture experiments with S. cerevisiae have shown that the relative contributionsof proteins and carbohydrates to the biomass composition change significantly withvarying dilution rate [37]. Known variations in biomass composition at different cultureconditions can be incorporated by manipulating the stoichiometric coefficients of theprecursors in the biomass formation rate, but the precursor coefficients must be collec-tively adjusted so that total biomass is conserved [48].

We have applied classical FBA to the iND750 model to investigate steady-stategrowth and ethanol production characteristics. The MATLAB interface to the LP codeMOSEK was used to solve the linear program given a set of glucose and oxygen uptakerates. Figure 9.3 shows the growth rate and ethanol production rate surfaces obtainedwhen the LP was repeatedly solved over a representative grid of glucose and oxygenuptake rates. Although alternative optimal solutions are a well-known problem with FBA[49], we did not encounter this issue in these computations. The surfaces are nontrivialfunctions of the substrate uptake rates, as demonstrated by the maximum in the ethanolproduction rate for microaerobic growth conditions. More complicated behavior isobserved when additional metabolic products are considered, as demonstrated by theexistence of seven metabolic phenotypes in the iFF708 genome-scale model [50]. Figure9.3 suggests that the development of a comparable unstructured model [17] would mini-mally require the specification of different metabolic parameters for anaerobic,microaerobic, and aerobic growth. While conceptually possible, the development ofunstructured models that attempt to reproduce multiple metabolic phenotypes quicklybecomes unwieldy. Given the development of the highly efficient dynamic simulation


156

and optimization techniques discussed in this chapter, there is little motivation todevelop unstructured models when a detailed stoichiometric model is available.

9.3.2 Dynamic simulation of fed-batch cultures

Using the stoichiometric models described in Section 9.3.1, we have developeddynamic flux balance models for combined aerobic and anaerobic fed-batch growth ofS. cerevisiae. Each model consists of intracellular steady-state flux balances coupled todynamic extracellular mass balances through kinetic uptake expressions for the twosubstrates (glucose and oxygen). The linear program used to resolve theunderdetermined flux balances was formulated as shown in Figure 9.2. The uptakekinetics for glucose (vg) and oxygen (vo) were modeled as

v vG

K G EK

g gg

ie

=+ +

,max

1

1(9.5)

v vO

K Oo oo

=+,max (9.6)

where G and O are the glucose and dissolved oxygen concentrations, respectively, Kg

and Ko are saturation constants, vg,max and vo,max are maximum uptake rates, and Kie is aninhibition constant. The glucose uptake rate follows Michaelis-Menten kinetics with anadditional regulatory term to capture growth rate suppression due to high ethanol con-centrations [15]. Ethanol uptake was excluded from the model because ethanol con-sumption is oxidative and only experimentally observed when glucose is nearlyexhausted [21], conditions which do not occur in these simulations. The dynamic massbalances on the extracellular environment were posed as

dVdt

F= (9.7)


157

μ

Figure 9.3 Surfaces for optimal growth rate (µ) and ethanol production rate (ve) obtained by repeatedlysolving the iND750 stoichiometric model over a range of glucose (vg) and oxygen (vo) uptake rates.

( )d VX

dtVX= μ (9.8)

( )d VG

dtFG v VXf g= − (9.9)

( )d VE

dtv VXe= (9.10)

where V is the liquid volume, X is the biomass concentration, and Gf is the glucose feedconcentration, and F is the feed flow rate. The growth rate (μ) and the ethanol secretionflux (ve) were resolved by solution of the inner flux balance model. Although not shownhere, analogous equations can be posed for other metabolic byproducts such as glyc-erol. The dissolved oxygen concentration was treated as an input variable under theassumption that its dynamic profile could be tracked by a suitably designed feedbackcontroller. This simplification was deemed reasonable because anaerobic conditionswere used to promote ethanol production during later stages of the batch when highcell densities might limit oxygen mass transfer. Consequently, extracellular oxygen bal-ances were omitted and the dissolved oxygen concentration was simply represented asthe percent of saturation, DO = O/Osat, where Osat is the saturation concentration.

Nominal model parameter values are listed in Table 9.2. Literature values for awild-type yeast strain [51] were used for the glucose (vg,max, Kg) and oxygen (vo,max, Ko)uptake kinetic parameters. The glucose inhibition constant with respect to ethanol (Kie)was chosen to give reasonable predictions of experimentally observed glucose, biomass,and ethanol profiles in batch culture with glucose media [21]. The saturation oxygenconcentration (Osat) was determined from Henry’s law at 1.0 atm and 30°C. The initialand final liquid volumes (V0, Vf), initial glucose (G0) and biomass (X0) concentrations,feed flow rate (F), glucose feed concentration (Gf), and final batch time (tf) were chosenas representative values for a bench-scale bioreactor available in our laboratory.

Dynamic simulations were performed in MATLAB using the code ode23 to integratethe extracellular mass balance equations. The inner linear program was evaluated insidethe integration routine along with the dynamic equations using the MATLAB interface


158

Table 9.2 Parameter Values for GlucoseMedia Dynamic Simulation

Variable Value Reference

20 mmol/gdw/h [51]0.5 g/l [51]8 mmol/gdw/h [51]0.003 mmol/l [51]0.30 mmol/l —10 g/l —

0 0.5 l —

0 10.0 g/l —

0 0.05 g/l —0.044 l/h —100 g/l —16.0 hours —1.2 l —

to the linear program (LP) code MOSEK. A possible problem with FBA is the presence ofmultiple optimal solutions, which implies the existence of an infinite number of differ-ent flux distributions that produce the same optimal growth rate [42]. Multiple optimalsolutions with respect to the ethanol secretion rate were handled by first solving the LPfor the maximum growth rate, and then by fixing the growth rate at this maximumvalue and resolving the LP for maximum ethanol secretion. This approach allowed vari-ability in the ethanol production rate as a result of multiple optima to be eliminated byselecting the theoretical maximum ethanol production with respect to the maximalgrowth rate.

Figure 9.4 shows the results of a fed-batch simulation with the second generationiND750 genome-scale stoichiometric model. The glucose feed flow rate was held con-stant during the batch, while a switch from aerobic (50% DO) to anaerobic (0% DO)growth was implemented at 7.7 hours. A rapid increase in the biomass concentrationwas observed under aerobic growth conditions. The switch to anaerobic growth resultedin a substantially increased ethanol production rate at the expense of biomass produc-tion. The switching time (ts) was chosen such that the glucose was nearly exhausted bythe end of the batch. Competition between the byproducts glycerol and ethanol wasobserved after the switch to anaerobic conditions. Although not shown here, similarresults were obtained with the iGH99 small-scale pathway model and the first genera-tion iFF708 genome-scale model. Given that the computation time for the dynamic sim-ulation with the iND750 model was only 9 seconds on a 3.0 GHz Pentium IVworkstation, there was little motivation to utilize a simpler and more computationallyefficient stoichiometric model.

9.3.3 Dynamic optimization of fed-batch cultures

The primary operational challenges associated with fed-batch cultures are the determi-nation of the initial substrate concentrations and liquid volume, the feeding policies of


159

Figure 9.4 Fed-batch simulation profiles for the iND750 stoichiometric model with constant glucosefeed rate and a switch in the dissolved oxygen concentration from 50% DO to 0% DO at 7.7 hours indi-cated by the vertical line.

the substrates throughout the batch, and the final batch time. Fed-batch performancecan be highly sensitive to these variables due to their complex effects on cellular metab-olism. Therefore, model-based optimization is an essential tool for determiningfed-batch operating strategies. The transient nature of fed-batch fermentation requiresthat the optimal operating policy be determined by solving a dynamic optimizationproblem in which a final time objective (e.g., productivity) is maximized subject to con-straints imposed by dynamic model equations [52, 53]. The formulation and solution ofdynamic optimization problems for maximizing ethanol productivity in fed-batch S.cerevisiae fermentations have been extensively investigated [54–59]. These studies werebased on simple unstructured models with phenomenological descriptions of cellgrowth and constant yield coefficients. Unstructured models cannot be expected toprovide accurate predictions over the wide range of transient conditions observed infed-batch culture.

We have developed dynamic optimization techniques for fed-batch fermentationbased on dynamic flux balance models [46]. Our work has focused primarily on thesmall-scale iGH99 stoichiometric model due to the computationally intensive nature ofthe optimization problem. The oxygen uptake rate (vo) was modeled as in (9.6), while theglucose uptake rate (vg) followed Michaelis-Menten kinetics with additional inhibitory

terms to capture regulatory effects due to high glucose and ethanol concentrations

v vG

K GGK

EK

g g

gig ie

=+ + +

,max 2

1

1(9.11)

where Kig and Kie are inhibition constants that restrict glucose uptake in the presence ofhigh glucose or ethanol concentrations, respectively. Model parameter values used forfed-batch optimization are listed in Tables 9.2 and 9.3. The value of the inhibition con-stant Kig was chosen to yield reasonable model predictions. A constant glucose feed con-centration Gf was used to avoid the trivial situation where the maximum concentrationis always selected by the optimizer. A maximum DO concentration DOmax less than100% was assumed to account for possible oxygen mass transfer limitations during laterstages of the batch due to high biomass concentrations. The model parameter tss repre-sents the time required for equipment maintenance between fed-batch runs.

The objective function maximized was the weighted sum of the ethanol productivityand the ethanol yield on glucose. This dual objective allowed the tradeoff between highproduction rates and efficient substrate usage to be examined. The initial volume V(0)and glucose concentration G(0), the feed flow rate F(t) and dissolved oxygen concentra-tion DO(t) profiles, and the final batch time tf were treated as decision variables. There-fore, the dynamic optimization problem had the form:


160

Table 9.3 Additional Model Parameter Values forDynamic Optimization

Variable Value Variable Value

10 g l-1 2.53 × 10-4 mol l-1

50 g l-1

max 50%6 hours

( ) ( ) ( ) ( )( ) ( )max

, , , ,V G F t DO t t p f y ff

c P t c Y t0 0

+

subject to: extracellular balances (9.7)– (9.10)

uptake kinetics (9.6) and

( ) ( )( )

( )

(9.11)

flux balance LP

V V t

G G

DO t

f

f

0 05 1 12 1

0 0

0

≥ ≤≤ ≤≤

. , .

( )

( ) ( ) ( )

≤≥

≤ ≤≥

DO

F t h

t

X t G t E tf

max

/

, , /

01

1 36

0 1

h h

g

(9.12)

The ethanol productivity P and the ethanol yield on glucose Y at the final batch timewere defined as

( ) ( ) ( )P t

V t E t

t tf

f f

f ss

=+

(9.13)

( ) ( ) ( )( ) ( ) ( )

Y tV t E t

V G G F t dtf

f f

f

t f=

+ ∫0 00

(9.14)

The parameters cp and cy are weights for the productivity and yield objectives, respec-tively. The bounds on the state variables X(t), G(t), and E(t) were specified to ensure aphysically realistic solution. The bounds on the initial and final volumes were chosenfor consistency with our experimental system. Lower and upper bounds on the finalbatch time were included to confine the solution space, but they had no effect on theoptimal solutions generated.

A number of computational algorithms have been proposed to solve generaldynamic optimization problems [60]. Sequential solution methods involve repeatediterations between a dynamic simulation code that integrates the model equations givena candidate feeding policy and a nonlinear programming code that determines animproved feeding policy given the dynamic simulation results. Simultaneous solutionmethods based on temporal discretization of the dynamic model equations have provento be more effective due to their ability to handle state dependent constraints and theirapplicability to large optimal control problems [61, 62]. Our attempts to solve thedynamic optimization problem (9.12) using a sequential method [16] proved unsuccess-ful due to problem complexity. Therefore, we employed a simultaneous solutionmethod in which the bilevel dynamic optimization problem (9.12) was reformulated asa single level nonlinear program with only algebraic constraints. The procedure requiredtemporal discretization of the extracellular balances (9.7)–(9.10) and replacement of theinner LP with its associated first-order optimality conditions to generatecomplimentarity constraints [63]. Discretization was performed with Radau collocationon finite elements using a monomial basis representation [61, 64] with 61 finite ele-ments and two internal collocation points per element for a total of 184 discretizationpoints. The linear program was enforced only at the beginning of each finite element toreduce the overall problem size. The decision variables DO(t) and F(t) were restricted tochange only at the element boundaries.


161

The dynamic optimization problem was solved through the AMPL interface to thenonlinear program solver CONOPT. AMPL is a mathematical programming languagethat provides analytic Jacobian and Hessian information to the solver through inte-grated automatic differentiation [65]. CONOPT is a feasible path, multimethod nonlin-ear program solver based on the generalized reduced gradient method [66]. Theoptimization problem consisted of 36,049 decision variables and 30,496 algebraic con-straints. The computation time necessary to obtained a converged solution varied from126 to 221 seconds depending on the initialization and objective function used. Subse-quent solutions for small changes in the objective function or constraints required onlya small fraction of the initial computation time. All computations were performed on a3.0 GHz Pentium 4 CPU. Solution of the full dynamic optimization problem (9.12) witha larger stoichiometric model such as the iFF708 or iND750 genome-scale network hasproven more challenging and is currently under development. We have used theiND750 model to solve a simpler batch optimization problem for the optimal aero-bic-anaerobic switching time within a metabolic engineering context (see Section 9.3.4).

Figure 9.5 shows the optimal control profiles for the feed flow rate and the dissolvedoxygen generated from solution of the dynamic optimization problem for maximiza-tion of ethanol productivity: cy = 0 in (9.12). The calculated optimal state profiles and thesimulated profiles obtained from direct simulation of the optimal control profiles alsoare displayed. Slight differences between the optimal and simulated profiles originatedfrom the approximation of constant fluxes across finite elements used in the optimiza-tion problem. The optimal control policy produced an initial glucose concentration


162

Figure 9.5 Optimal glucose feed (top inset) and dissolved oxygen (bottom inset) profiles for dynamicoptimization of ethanol productivity with the iGH99 stoichiometric model and the corresponding simu-lated and optimal profiles of the biomass, glucose, and ethanol concentrations.

(14.6 g/l) well below its upper bound and no initial glucose feed. The glucose concentra-tion declined until feeding began at t = 7.0 hours. Then the glucose feed flow rateincreased over time such that the glucose concentration remained approximately con-stant until the final volume constraint was encountered at t = 13.4 hours. Analysis of(9.11) revealed that this constant glucose concentration resided very close to the rela-tively flat maximum in the glucose uptake rate. A sudden switch in the dissolved oxygenfrom the initial maximum to a final value near zero was observed at t = 8.4 hours. Thisswitch divided an initial aerobic phase of high cell growth followed by a microaerobicphase of high ethanol production.

The dynamic optimization problem formulated in (9.12) contains a dual objectivefor ethanol productivity and ethanol yield. A parametric sensitivity analysis was per-formed to examine the trade-off between these competing objectives. The analysisinvolved repeated solution of the dynamic optimization problem with a constant value

of the productivity weight (cp = 0.81−1) and a wide range of values for the yield weight (0 ≤cy ≤ 60). A nonzero value of the productivity weight was used to avoid solutions lying atthe maximum bound of the final time constraint and exhibiting a dramatic decline inthe productivity for an insignificant increase in the yield. Therefore, this strategy pro-duced optimal policies for maximization of ethanol yield where the overall productivityloss was minimized. Figure 9.6 shows that increasing yields were achieved at the expenseof decreasing productivities and longer batch times. The productivity versus yield curverepresents the locus of achievable optima for the dual objective where the entire areaabove the curve is unachievable.

During calculation of the yield-productivity trade-off curve, the ethanol yield even-tually saturated with respect to increasing values of the yield weight (cy). This trend indi-cated that the yield was at its overall maximum and the productivity was at itsmaximum with respect to this yield. Figure 9.7 shows the optimal feeding policy thatgenerated this point (circle in Figure 9.6). The calculated optimal state profiles and thesimulated profiles obtained from direct simulation of the optimal control profiles arealso shown. The results obtained were markedly different from the maximum productiv-


163

Figure 9.6 Trade-off between ethanol productivity and ethanol yield on glucose (left), and the relation-ship between ethanol yield and the batch time (right) obtained from dynamic optimization with theiGH99 stoichiometric model. The square and the circle correspond to the optimization results shown inFigures 9.5 and 9.7, respectively.

ity results (Figure 9.5). While the glucose concentration decreased until feeding begansuch that a relatively constant glucose concentration was maintained, the combinedobjective produced a lower initial glucose concentration to increase yield and earlier glu-cose feeding to achieve the glucose concentration that maximized uptake. The dissolvedoxygen concentration profile showed that microaerobic growth conditions were utilizedthroughout the batch.

9.3.4 Identification of ethanol overproduction mutants

The production of ethanol from recombinant S. cerevisiae strains has received consider-able attention for renewable liquid fuel applications. A recent study [9] revealed novelmetabolic engineering targets for improved ethanol production from glucose mediabased on classical FBA of the iFF708 stoichiometric model. In addition to being limitedto steady-state culture conditions, this computational analysis failed to explicitlyaddress the synergistic effects of the biomass and ethanol yields. While all the meta-bolic engineering strategies considered increased both these yields through modifica-tion of the cellular redox balance, the most favorable strategy for enhanced ethanolproductivity could not be identified due to the complex relation between these twoyields and total ethanol production. Moreover, regulatory processes such as ethanolinhibition of growth that are active under dynamic culture conditions can lead to alter-


164

Figure 9.7 Optimal glucose feed (top inset) and dissolved oxygen (bottom inset) profiles obtained fromdynamic optimization with the iGH99 stoichiometric model for a combined yield-productivity objectivewhere yield was most heavily weighted, and the corresponding simulated and optimal profiles of the bio-mass, glucose, and ethanol concentrations.

native metabolic engineering strategies that are not easily identifiable from steady-stateanalysis.

We have utilized the iND750 dynamic flux balance model (Section 9.3.2) to developcomputational techniques for identifying promising S. cerevisiae mutants for ethanolproduction in fed-batch culture [67]. The ethanol productivity was defined as the overallrate of ethanol production from the batch:

( )VE

tt t

f

f=(9.15)

where tf denotes the final batch time. Dynamic optimization of fed-batch ethanol pro-ductivity was performed with the switching time between partially aerobic (50% DO;hereafter referred to as aerobic) and anaerobic (0% DO) conditions treated as the onlydecision variable to simplify the optimization problem. Additionally, a constant feedflow rate and feed glucose concentration, fixed initial conditions, and a fixed finalbatch were utilized. The resulting single variable optimization problem was solved withthe MATLAB optimal search function fminsearch.

We applied steady-state and dynamic FBA to 10 metabolic engineering strategies thatincluded eight gene insertions and two combination gene insertion/overexpression strat-egies that were previously predicted to enhance anaerobic biomass and ethanol yieldswhen classical FBA was applied to the iFF708 stoichiometric model [9]. The steady-stateanalysis was performed with the iND750 model to provide a consistent basis for compar-ing our DFBA results. The 10 strategies were implemented by the addition of reactions tothe metabolic network for gene insertions, by the removal of reactions for gene deletions,and by the removal of bound constraints from reactions for gene overexpressions as dis-cussed in [9]. Each inserted reaction was charge and elementally balanced for consistencywith the iND750 model. Table 9.4 shows the FBA results obtained with the iND750model, where the two combination gene insertion/ overexpression strategies are denoted

Δgdh1 glt1 gln1 and Δgdh1 gdh2 according the genes manipulated and labels for the eightgene insertions correspond to reaction entries in the KEGG LIGAND database(http://www.genome.jp/). As shown previously with the iFF708 model, all ten manipula-tions produced enhanced ethanol and growth yields for steady-state anaerobic growth.Under aerobic conditions, only one manipulation generated ethanol and biomass yieldsthat differed from the wild type. Enhanced aerobic ethanol production at the expense ofreduced growth was predicted for this strategy.

In the original study [9], anaerobic yield enhancements of 4.2–10.4% for ethanoland 5.2–16.5% for biomass were predicted for the 10 manipulations. We found signifi-cantly reduced ethanol yields enhancements of 3.4–6.1%. We believe that additionalcompartmentalization and full charge balancing of the iND750 model used in our studywere the primary causes of the discrepancies with the iFF708 model used in the originalstudy. The lower ethanol yield prediction from the iND750 model was more consistentwith experimental data for the R01058 mutant, but both models overpredicted theexperimentally observed R01058 growth rate [9]. The results in Table 9.4 demonstrate anotable shortcoming of classical FBA. Given different relative enhancements in anaero-bic ethanol and biomass yields, the preferred in silico manipulation for anaerobic etha-nol production cannot be directly determined. A similar difficulty is encountered foraerobic growth, where the impact of increased ethanol and decreased biomass yields for


165

the Δgdh1 glt1 gln1 mutant cannot be quantitatively compared to the wild type withrespect to total ethanol production. Consequently, the preferred manipulation forfed-batch ethanol production in which an aerobic growth phase is followed by an anaer-obic growth phase cannot be determined without further analysis.

The ethanol productivity represents a single measure of fed-batch performance thatexplicitly incorporates the tradeoff between possibly time-varying ethanol and biomassyields throughout the batch. DFBA results for the sensitivity of the ethanol productivityto the aerobic-anaerobic switching time in fed-batch culture are shown in Figure 9.8.These results were generated for each manipulation strategy by repeated fed-batch simu-lation with different switching times. The maximal productivities shown as peaks andchecked by solving a single variable optimization problem with the switching time asthe decision variable were used to produce an explicit ranking of the manipulation strat-egies: (1) R00105/R01039/R01058; (2) R00365/R01866/R00112/R00845/R01063; (3)

Δgdh1 gdh2; (4) Δgdh1 glt1 gln1; and (5) wild type. The vertical dotted line at the optimalproductivity for the wild type demonstrates that the manipulation strategies have differ-ent optimal switching times. Consequently, optimal performance is dependent both onthe metabolic engineering strategy and the fed-batch operating policy. This result sug-gests that attempts to separately optimize the cellular design and the fermentation con-ditions are likely to produce suboptimal performance.

We also assembled a library of 357 gene insertion candidates from the KEGGLIGAND database in an attempt to uncover novel metabolic engineering strategies forethanol overproduction in fed-batch culture. Only reactions involving species presentin the cytosol of the iND750 model were considered. We were able to match 517 of the575 iND750 cytosolic species to compounds in the KEGG database, and 788 reactionsinvolved only these matched species. The iND750 metabolic network already included431 of these reactions, yielding a reduced set of 357 reactions corresponding to potentialgene insertions. All reactions were assumed reversible unless available experimental data


166

Table 9.4 Steady-State FBA for Mutants in Glucose Media

StrategyEthanol YieldIncrease (%)

Biomass YieldIncrease (%)

ReactionFlux (mmol/g/h)

An Glucose uptake, =5.0 mmol g−1 h−1; Oxygen uptake, =0.0 mmol g−1 h−1

Deletion of 1 and overexpression of 1 and gln1 (Δ 1 1 1) 3.4 5.4 —

Deletion of 1 and overexpression of gdh2 (Δg 1 2) 3.7 11.0 —

Insertion of NAD dependent glycine dehydrogenase (R00365) 3.8 18.0 1.18Insertion of NADP dependent orotate reductase (R01866) 3.9 18.1 1.20Insertion of a transhydrogenase (R00112) 3.8 18.0 1.18Insertion of NADH kinase (R00105) 6.1 5.6 0.87Insertion of NADP dependent glycerol dehydrogenase (R01039) 6.1 5.6 0.88Insertion of NADP dependent glycerol 3-phosphate dehydrogenase (R00845) 3.8 18.0 1.18Insertion of NADP dependent glyceraldehyde-3-phosphate dehydrogenase(R01063)

3.8 18.0 1.18

Insertion of nonphosphorylating NADP dependent glyceraldehyde-3-phosphatedehydrogenase (e.g., ) (R01058)

6.1 5.6 0.87

Wild type: growth rate, μ =0.085 h−1; ethanol yield, / =0.424 g/g

†Aerobic: Glucose uptake, =5.0 mmol g−1 h−1, Oxygen uptake, =7.84 mmol g−1 h−1

Deletion of 1 and overexpression of 1 and gln1 (Δg 1 1 1 ) 9.5 –7.4 —

Wild type: growth rate, μ =0.339 h-1; ethanol yield, / =0.166 g/g†Aerobic yields differed from the wild type only for the single strategy reported

or other genome-scale models [1] suggested otherwise. The reactions extracted from theKEGG database were charge and elementally balanced for consistency with the iND750model.

The fed-batch performance of each candidate insertion was assessed by optimizingthe aerobic-anaerobic switching time to determine maximal ethanol productivity. Fig-ure 9.9 shows the dynamic screening results where the insertions are labeled by theirentries in the KEGG LIGAND database and the eight insertions suggested in [9] are indi-cated by white bars. In addition to the eight previously analyzed insertions, DFBA identi-fied 21 new insertion strategies with productivity enhancements greater than 3% overthe wild type value. The insertions could be grouped into three sets, each with the sameaerobic-anaerobic switching time and very similar productivities. The switching timevaried only slightly between these three groups. The two new candidate insertions withthe highest productivities correspond to expression of a NADP-specific 1-pyrro-line-5-carboxylate dehydrogenase (R00708) and a NADP-malic enzyme (R00216). ANAD-specific 1-pyrroline-5carboxylate and the same NADP-malic enzyme were alreadyexpressed in the mitochondria of the iND750 model, so identification of these cytosolicinsertions required a compartmentalized metabolic network model. Both of the pro-posed insertions maintain a favorable redox balance for ethanol production by generat-ing NADPH, and therefore they represent similar design alternatives to those previouslyproposed [9].

9.3.5 Exploration of novel metabolic capabilities

The computational studies described in Sections 9.3.1 to 9.3.4 illustrate that DFBA canbe used to analyze and engineer the ethanol production capabilities of S. cerevisiae inglucose media. Because glucose is a preferred substrate and ethanol is a naturally


167

ΔΔ

Figure 9.8 Sensitivity of the ethanol productivity to the aerobic-anaerobic switching time (ts) infed-batch culture predicted with the iND750 stoichiometric model. The dotted line indicates the optimalswitching time for the wild type strain.

secreted metabolite, these studies demonstrate the in silico enhancement of native met-abolic capabilities. DFBA also can be used to investigate the engineering of novel meta-bolic capabilities such as the consumption of new substrates and the production ofnonnative metabolic products. The wild-type stoichiometric model is expanded byincluding the intracellular reactions required to describe new substrate metabolismand/or metabolite synthesis, while the extracellular model is augmented with the corre-sponding substrate kinetics and extracellular mass balances. The resulting dynamic fluxbalance model can be used to perform in silico studies of novel substrate utilizationand/or product formation behavior to guide experimental efforts.

Genetic engineering of xylose fermenting S. cerevisiae strains that can grow on mediaderived from agricultural products is important for the production of renewable liquidfuels [68–71]. We have developed a dynamic flux balance model that describes S.cerevisiae growth and ethanol production on glucose/xylose substrate mixtures [67]. Thedynamic model consists of the iND750 stoichiometric model coupled to dynamic


168

Figure 9.9 Dynamic screening of a gene insertion library derived from the KEGG database for optimalfed-batch ethanol productivity predicted with the iND750 stoichiometric model. Results are presented aspercentage increase in the ethanol productivity relative to the wild-type strain. Insertions proposed by [9]are shown as white bars. The number indicated to the right of each bar indicates the optimal aerobic-anaerobic switching time.

extracellular mass balances through uptake expressions for the three possible substrates(glucose, xylose, oxygen). The publicly available iND750 model [27] includes a mostlycomplete description of xylose metabolism such that the associated pathways becomeactive only when a xylose uptake rate is specified. The modification needed for simula-tion of recombinant xylose utilizing strains was the insertion of the reverse reaction forxylitol dehydrogenase, which increased the number of fluxes to 1,265 for mixed-sub-strate studies (see Table 9.1).

The uptake kinetics for glucose (vg) and oxygen (vo) were modeled as in (9.5) and(9.6), respectively, while the xylose uptake (vz) was chosen as

v vZ

K Z EK

GK

z zz

ie ig

=+ + +

,max

1

1

1

1(9.16)

where Z is the extracellular xylose concentration, Kz is a saturation constant, vz,max is themaximum uptake rate, and Kie and Kig are inhibition constants. The xylose uptake fol-lows Michaelis-Menten kinetics with additional regulatory terms to capture growth ratesuppression due to high ethanol concentrations [15] and inhibited xylose metabolismin the presence of the preferred substrate glucose [71]. For fed-batch operation, thedynamic mass balances on the extracellular environment were posed as in (9.7) to(9.10) with an additional equation for xylose:

( )d VZ

dtFZ v VXf z= − (9.17)

Although not shown here, analogous equations were posed for other key metabolicbyproducts (glycerol and xylitol). Table 9.5 lists parameter values used for dynamic sim-ulation of the xylose utilizing recombinant S. cerevisiae strain RWB 218, including exper-imentally derived glucose and xylose uptake kinetic parameters [71]. Literature valuesfor a wild-type S. cerevisiae strain [51] were used for the oxygen uptake kinetic parame-ters. The fermenter operating conditions were chosen as representative values for ourexperimental system with equal concentrations of glucose and xylose in the media. Glu-cose and xylose are believed to be transported by the same family of hexose transporterswith glucose being the preferred carbon source [71, 72]. The xylose inhibition constantwith respect to glucose (Kig) was chosen to capture the effect of repressed xylose uptake inthe presence of glucose [71].

Figure 9.10 shows the results of a fed-batch simulation with constant feeding of a50%/50% glucose/xylose mixture. The mixed-substrate results were generated with alonger batch time of 20 hours than the pure glucose media simulation (Figure 9.4) due tothe xylose utilizing strain having higher saturation constants, a lower maximum glucoseuptake rate, and inhibition of xylose uptake in the presence of glucose. Furthermore, alonger aerobic phase was necessary to generate a sufficiently high biomass concentra-tion such that the substrate was mostly consumed by the final batch time. The switchfrom aerobic to anaerobic conditions at 16 hours was characterized by a significantincrease in ethanol production and a sharp decline in biomass production. The xyloseconcentration increased due to media feeding until decreasing sharply after glucose wasnearly exhausted. Glycerol production was insignificant as a result of the limited resid-ual glucose present following the switch to anaerobic conditions. The production rate of


169

the byproduct xylitol was much higher than was the competing byproduct glycerol inthe glucose media case (Figure 9.4), which suggested that metabolic engineering strate-gies are needed to divert carbon from xylitol to ethanol and/or biomass. These fed-batchpredictions are in qualitative agreement with experimental batch profiles presented in[71].

Steady-state FBA results with glucose/xylose mixed substrates are presented in Table9.6 for the ten genetic manipulations suggested in [9] and analyzed for glucose media inSection 9.3.4. While each manipulation was predicted to yield a simultaneous increase


170

Figure 9.10 Fed-batch simulation profiles for the xylose utilizing S. cerevisiae strain RWB 218 [71]obtained with an extended version of the iND750 stoichiometric model. The glucose/xylose feed wasmaintained constant and a switch in the dissolved oxygen concentration from 50% DO to 0% DO wasimplemented at 17.0 hours as indicated by the vertical line.

Table 9.5 Parameter Values for Glucose/Xylose Media Dynamic Simulation

Variable Value Reference

,max 7.3 mmol/gdw/h [71]1.026 g/l [71]

,max 32 mmol/gdw/h [71]14.85 g/l [71]

,max 8 mmol/gdw/h [51]0.003 mmol/l [51]10 g/l —0.5 g/l —0.30 mmol/l —

0 0.5 l —0.035 l/h —50 g/l —50 g/l —1.2 l —20.0 h —

0 5 g/l —

0 5 g/l —

0 0.05 g/l —

in the ethanol and biomass yields under anaerobic conditions compared to the wildtype, the relative performance of these manipulations could not be determined withoutDFBA. Compared to glucose media (Table 9.4), higher increases in ethanol yields andsmaller increases in biomass yields were predicted. Only the deletion/overexpression

Δgdh1 glt1 gln1 mutant differed from the wild type under aerobic conditions (50% DO).The impact of the substantial increase in ethanol yield and the large decrease in biomassyield for aerobic growth was difficult to quantitatively assess, especially when consider-ing fed-batch culture with both aerobic and anaerobic growth phases.

The sensitivity of fed-batch ethanol productivities with mixed substrates to the aero-bic-anaerobic switching time is shown in Figure 9.11. The predicted productivities weresubstantially lower than for glucose media (Figure 9.8) due to reduced substrate uptakerates and significant secretion of xylitol as a competing byproduct. The productivitymeasure allowed an explicit ranking of the manipulation strategies, with the R00112,R00365, R00845, R01063, and R01866 insertions predicted to yield the best perfor-mance. These insertions comprised the second highest ranked group for glucose media,demonstrating that the media should be considered simultaneously with the geneticmanipulation and the fed-batch operating policy to achieve optimal performance.Unlike glucose media the optimal switching time was relatively insensitive to themanipulation, suggesting that the optimum was most strongly affected by the substrate

uptake kinetics. Only the deletion/overexpression Δgdh1 glt1 gln1 required a signifi-cantly different switching time, but this manipulation produced a substantially lowerproductivity due to its reduced aerobic biomass yield. Comparison of these dynamic pre-dictions with the steady-state FBA results (Table 9.6) revealed that manipulations withrelatively high biomass yields were most favorable for fed-batch growth on these mixedsubstrates. Tests with 25%/75% and 75%/25% glucose/xylose mixtures were conductedand similar trends were predicted (not shown).

In an effort to reveal novel metabolic engineering strategies for ethanol productionfrom glucose/xylose media, we performed DFBA for mixed substrates to screen the 357reactions corresponding to potential gene insertions extracted from the KEGG database


171

Table 9.6 Steady-State FBA for Mutants in Glucoseand Xylose Media

LabelEthanol YieldIncrease (%)

Biomass YieldIncrease (%)

Reaction Flux(mmol/g/h)

Anaerobic: =2.4, =2.1, =0.0 (mmol/g/h)

Δgdh1 glt1 gln1 9.1 4.1 —

Δgdh1 gdh2 8.1 9.5 —

R00365 12.2 15.2 0.96R01866 12.3 15.3 0.98R00112 12.2 15.2 0.96R00105 9.6 4.1 0.44R01039 9.6 4.1 0.45R00845 12.2 15.2 0.96R01063 12.2 15.2 0.96R01058 9.6 4.1 0.45Wild type: μ =0.073 h-1 , /( ) =0.406 g/gAerobic: =2.4, =2.1, =7.84 (mmol/g/h)

Δgdh1 1 1 17.7 –7.6 —

Wild type: μ =0.319 h-1; /( ) =0.111 g/g

(see Section 9.3.4). Figure 9.12 shows that DFBA revealed 15 new insertions thatmatched the performance of the top five insertions from [9]. The top 25 insertions couldbe divided into two sets according to their optimal switching time and predicted ethanolproductivity. These two sets appeared in both glucose and mixed media analysis, buttheir relative performance was reversed such that the top five insertions in glucosemedia were surpassed by the set of 20 insertions in the mixed media. This result empha-sizes the importance of explicitly considering the media composition when utilizingDFBA to identify mutants for metabolite overproduction.


Large-scale production of many important biochemical products is performed in batchand fed-batch bioreactors in which the assumption of balanced growth implicit in clas-sical flux balance analysis (FBA) does not hold. Dynamic flux balance analysis (DFBA) isan extension of FBA that allows the prediction and engineering of cellular metabolismfor dynamic cell culture. The core element of DFBA is a dynamic flux balance modelthat combines a stoichiometric cell model with dynamic mass balances on extracellularsubstrates and products through experimentally determined substrate uptake kineticsand the calculated growth rate. Our work has focused on the analysis and engineeringof Saccharomyces cerevisiae metabolism for enhanced ethanol production in batch andfed-batch culture. We have successfully applied DFBA to the dynamic simulation ofbatch and fed-batch fermentation, the dynamic optimization of fed-batch operatingpolicies, the in silico identification of ethanol overproducing mutants in dynamic cellculture, and the in silico introduction of novel metabolic capabilities for xyloseconsumption.


172

ΔΔ

Figure 9.11 Sensitivity of the ethanol productivity to the aerobic-anaerobic switching time (ts) infed-batch culture with glucose and xylose media predicted with the iND750 stoichiometric model. Thedotted line indicates the optimal switching time for the wild type strain.

Both FBA and DFBA are based on several assumptions that have not been fully vali-dated through experiment. The most essential and controversial assumption is that cellmetabolism is regulated to maximize the cellular growth rate or a similar objective.Computational evidence supporting this hypothesis includes the ability to predictgrowth phenotypes of knockout mutants with 75–90% accuracy [6, 27] and the qualita-tive reproduction of biomass and extracellular metabolite profiles in batch cultures[13–15]. Although not presented in this chapter, we have unpublished results for S.cerevisiae that show measured gene expression data is largely captured by the maximalgrowth objective [48] and that dynamic flux balance models can be parameterized toproduce quantitative agreement with batch and fed-batch data [45]. Despite these suc-cesses, the maximal growth hypothesis appears to be inappropriate for more complexeukaryotic cells in plants and animals and will remain controversial even for microbesthat are the current focus of study. Another key assumption is that the biomass composi-tion remains constant under different growth conditions despite experimental evidenceto the contrary [37]. We have unpublished results showing that experimentally deter-


173

Figure 9.12 Dynamic screening of a gene insertion library derived from the KEGG database for optimalfed-batch ethanol productivity from glucose and xylose media predicted with the iND750 stoichiometricmodel. Results are presented as percentage increase in the ethanol productivity relative to the wild-typestrain. Insertions proposed by [9] are shown as white bars. The number indicated to the right of each barindicates the optimal aerobic-anaerobic switching time.

mined variations in S. cerevisiae biomass composition can introduce errors approaching10% in FBA and DFBA predictions [48]. An implicit assumption of flux balance analysistechniques is that metabolic engineering through gene deletions and insertions doesnot affect the substrate uptake rates such that the wild-type values can be used. To ourknowledge, this assumption has not been experimentally evaluated by direct compari-son of wild-type and mutant metabolism.

Given the availability of a suitable stoichiometric cell model and the capability todevelop the necessary substrate uptake kinetics, the primary challenges associated withthe application of DFBA are computational. Dynamic flux balance model simulation forprediction of batch or fed-batch culture dynamics requires simultaneous solution of thelinear program (LP) for growth rate maximization and integration of the extracellularmass balance equations. We have found that the simulation problem can be efficientlyand robustly solved by embedding the substrate uptake kinetics and the LP within theextracellular balance solution such that high performance integration codes can be uti-lized. Fed-batch culture optimization requires the solution of a much more demandingbilevel nonlinear programming problem in which the cellular objective is growth ratemaximization and the engineering objective is maximal metabolite production. Wehave used the small-scale iGH99 stoichiometric model to develop a solution strategybased on reformulating the bilevel programming problem as a single level nonlinearprogram through temporal discretization of the extracellular mass balance equationsand replacement of the LP with its associated first-order optimality conditions to gener-ate complimentarity constraints. Our initial attempts to implement this method withthe genome-scale iND750 stoichiometric model have proven unsuccessful due to greatermodel complexity and increased problem size. We have used a brute force strategyinvolving enumeration and evaluation to screen a library of candidate gene insertionsfor enhanced ethanol production. Because this approach is computationally infeasiblefor screening large libraries and/or multiple gene insertions, extensions of existingmixed-integer linear programming methods [8, 73] that account for culture dynamicsare needed. Ultimately, computational strategies that allow simultaneous optimizationof the cellular design, media components, and dynamic operating policies for maximi-zation of metabolite production in batch and fed-batch culture should be developed.

The main alternative to dynamic flux balance modeling is full kinetic modeling withthe enzyme kinetics specified for each intracellular reaction [18, 21, 22, 74, 75]. Advan-tages of kinetic models include the lack of an assumed cellular objective, the possibilityof including regulation at the enzyme level, and the simultaneous prediction of reactionrates and species concentrations. While kinetic models have been fruitfully utilized forthe analysis and engineering of individual metabolic pathways [19, 76], the necessity ofincluding enzyme kinetics has severely restricted their application to comprehensivemetabolic modeling. Dynamic flux balance models are well suited for this purpose dueto the increasing availability of genome-scale stoichiometric models and the minimalrequirement that only substrate uptake kinetics are required for model construction.DFBA also offers important computational advantages due to the LP formulation ofintracellular metabolism. Dynamic simulation of a hypothetical genome-scale kineticmodel would require numerical integration of about 1,000 differential equations. Thefed-batch and mutant optimization problems discussed in this chapter would quicklybecome intractable with such kinetic models. Ultimately, the two dynamic modelingapproaches may be combined synergistically with full kinetic equations incorporated


174

for well-characterized primary pathways and stoichiometric equations used for theremaining reactions.

9.5 Summary Points

• Dynamic flux balance analysis (DFBA) is an extension of classic flux balance analy-sis (FBA) that accounts for cell culture dynamics and allows prediction of cellularmetabolism in batch and fed-batch fermentations.

• The scope of DFBA includes the dynamic simulation of batch and fed-batch cul-tures, the dynamic optimization of fed-batch operating policies, the in silico identi-fication of metabolite overproducing mutants in dynamic cell culture, and the insilico introduction of novel metabolic capabilities such as the consumption of newsubstrates and the production of nonnative metabolic products.

• Both FBA and DFBA are based on the assumption that substrates are consumed andproducts are produced to maximize the cellular growth rate.

• Both FBA and DFBA require the availability of a stoichiometric cell model thatallows the steady-state prediction of intracellular fluxes to the biomass constituentsand metabolic products from uptake rates of the extracellularly supplied substrates.

• The dynamic flux balance model needed for DFBA is developed by combining thestoichiometric cell model with dynamic mass balances on extracellular substratesand products through experimentally determined substrate uptake kinetics and thecalculated growth rate.

• As compared to alternative approaches such as enzyme kinetic models, the primaryadvantages of DFBA are that the increasing availability of stoichiometric cell mod-els is fully leveraged and that very little additional information is required formodel construction.

• Batch and fed-batch simulation of a dynamic flux balance model involves simulta-neous solution of the linear program for growth rate maximization and integrationof the extracellular mass balance equations.

• The use of DFBA for fed-batch culture optimization requires the solution of abilevel nonlinear programming problem in which the cellular objective is growthrate maximization and the engineering objective is maximal metaboliteproduction.

• Applications of DFBA to Saccharomyces cerevisiae demonstrate that maximization ofethanol production capabilities requires simultaneous optimization of the growthmedia, the metabolic engineering strategy, and the fed-batch operating policy.

Acknowledgments

Financial support for Jared L. Hjersted from the UMass Center for Process Design andControl is gratefully acknowledged. The authors acknowledge the contributions ofRadhakrishnan Mahadevan (University of Toronto) to the in silico metabolic engineer-ing work presented in Sections 9.3.4 and 9.3.5.

9.5 Summary Points

175

References

[1] Reed, J.L., I. Famili, I. Thiele, and B.O. Palsson, “Towards multidimensional genome annotation,”Nature Reviews Genetics, Vol. 7, 2006, pp. 130–141.

[2] Stephanopoulos, G.N., A.A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodol-ogies, New York: Academic Press, 1998.

[3] Sauer, U., V. Hatzimanikatis, H.P. Hohmann, M. Manneberg, A.P. van Loon, and J.E. Bailey, “Physi-ology and metabolic fluxes of wild-type and riboflavin-producing Bacillus subtilis,” Appl. Environ.Microbiol., Vol. 62, 1996, pp. 3687–3696.

[4] Segre, D., D. Vitkup, and G.M. Church, “Analysis of optimality in natural and perturbed metabolicnetworks,” Proc. Natl. Acad. Sci. USA, Vol. 99, 2002, pp. 15112–15117.

[5] Kauffman, K.J., P. Prakash, and J.S. Edwards, “Advances in metabolic flux analysis.” Curr. Opin.Biotechnol., Vol. 14, 2003, pp. 491–496.

[6] Famili, I., J. Forster, J. Nielsen, and B.O. Palsson, “Saccharomyces cerevisiae phenotypes can be pre-dicted by using constraint-based analysis of a genome-scale reconstructed metabolic network,”Proc. Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 13134–13139.

[7] Burgard, A.P., and C.D. Maranas, “Probing the performance limits of the Escherichia coli metabolicnetwork subject to gene additions or deletions,” Biotechnol. Bioeng., Vol. 74, 2001, pp. 364–375.

[8] Pharkya, P., A.P. Burgard, and C.D. Maranas, “OptStrain: A computational framework for redesignof microbial production systems,” Genome Res., Vol. 14, 2004, pp. 2367–2376.

[9] Bro, C., B. Regenberg, J. Forster, and J. Nielsen, “In silico aided metabolic engineering ofSaccharomyces cerevisiae for improved bioethanol production,” Metabolic Eng., Vol. 8, 2006,pp. 102–111.

[10] Alfenore, S., X. Cameleyre, L. Benbadis, C. Bideaux, J.-L. Uribelarra, G. Goma, C. Molina-Jouve,and S.E. Guillouet, “Aeration strategy: A need for very high ethanol performance in Saccharomycescerevisiae fed-batch process,” Appl. Microbiol. Biotechnol., Vol. 63, 2004, pp. 537–542.

[11] Converti, A., S. Arni, S. Sato, J.C. de Carvalho, and E. Aquarone, “Simplified modeling of fed-batchalcoholic fermentation of sugarcane blackstrap molasses,” Biotechnol. Bioeng., Vol. 84, 2003,pp. 88–95.

[12] Nilssen, A., M.J. Taherzadeh, and G. Linden, “Use of dynamic step response for control of fed-batchconversion of lignocellulosic hyrdrolyzates to ethanol,” J. Biotechnol., Vol. 89, 2001, pp. 41–53.

[13] Varma, A., and B.O. Palsson, “Stoichiometric flux balance models quantitatively predict growthand metabolic by-product secretion in wild-type Escherichia coli,” Appl. Environ. Microbiol., Vol. 60,1994, pp. 3724–3731.

[14] Mahadevan, R., J.S. Edwards, and F.J. Doyle III, “Dynamic flux balance analysis of diauxic growthin Escherichia coli,” Biophys. J., Vol. 83, 2002, pp. 1331–1340.

[15] Sainz, J., F. Ricardo Perez-Correa, and E. Agosin, “Modeling of yeast metabolism and processdynamics in batch fermentation,” Biotechnol. Bioeng., Vol. 81, 2003, pp. 818–828.

[16] Gadkar, K.P., F.J. Doyle III, J.S. Edwards, and R. Mahadevan, “Estimating optimal profiles of geneticalterations using constraint-based models,” Biotechnol. Bioeng., Vol. 89, 2004, pp. 243–251.

[17] Nielsen, J., and J. Villadsen, Bioreaction Engineering Principles, New York: Plenum Press, 1994.[18] Steinmeyer, D.E., and M.L. Shuler, “Structured model for Saccharomyces cerevisiae,” Chem. Eng. Sci.,

Vol. 44, 1989, pp. 2017–2030.[19] Vaseghi, S., A. Baumeister, M. Rizzi, and M. Reuss, “In vivo dynamics of the pentose phosphate

pathway in Saccharomyces cerevisiae,” Metabolic Eng., Vol. 1, 1999, pp. 128–140.[20] Hatzimanikatis, V., M. Emmerling, U. Sauer, and J.E. Bailey, “Application of mathematical tools for

metabolic design of microbial ethanol production,” Biotech. Bioeng., Vol. 58, 1998, pp. 154–161.[21] Jones, K.D., and D.S. Kompala, “Cybernetic modeling of the growth dynamics of Saccharomyces

cerevisiae in batch and continuous cultures,” J. Biotech., Vol. 71, 1999, pp. 105–131.[22] Varner, J. and D. Ramkrishna, “Metabolic engineering from a cybernetic perspective. 1. Theoretical

preliminaries,” Biotechnol. Prog., Vol. 15, 1999, pp. 407–425.[23] Covert, M.W., C.H. Schilling, and B.O. Palsson, “Regulation of gene expression in flux balance

models of metabolism,” J. Theor. Biol., Vol. 213, 2001, pp. 73–88.[24] Akesson, M., J. Forster, and J. Nielsen, “Integration of gene expression data into genome-scale met-

abolic models,” Metabol. Eng., Vol. 6, 2004, pp. 285–293.[25] Palsson, B.O., Systems Biology: Properties of Reconstructed Networks, New York: Cambridge University

Press, 2006.[26] Reed, J.L., T.D. Vo, C.H. Schilling, and B.O. Palsson, “An expanded genome-scale model of Esche-

richia coli K-12 (iJR904 GSM/GPR),” Genome Biology, Vol. 4, 2003, pp. R54.1–R54.12.[27] Duarte, N.C., M.J. Herrgard, and B.O. Palsson, “Reconstruction and validation of Saccharomyces

cerevisiae iND750, a fully compartmentalized genome-scale metabolic model,” Genome Res., Vol.14, 2004, pp. 1298–1309.


176

[28] Duarte, N.D., S.A. Becker, N. Jamshidi, I. Thiele, M.L. Mo, T.D. Vo, R. Srivas, and B. O. Palsson,“Global reconstruction of the human metabolic network based on genomic and bibliomic data,”Proc. Natl. Acad. Sci., Vol. 104, 2007, pp. 1777–1782.

[29] Varma, A., and B.O. Palsson, “Metabolic capabilities of Escherichia coli. II. Optimal growth pat-terns,” J. Theor. Biol., Vol. 165, 1993, pp. 503–522.

[30] Nissen, T.L., U. Schulze, J. Nielsen, and J. Villadsen, “Flux distributions in anaerobic, glucose-lim-ited continuous cultures of Saccharomyces cerevisiae,” Microbiology, Vol. 143, 1997, pp. 203–218.

[31] van Gulik, W.M., and J.J. Heijnen, “A metabolic network stoichiometry analysis for microbialgrowth and product formation,” Biotechnol. Bioeng., Vol. 48, 1995, pp. 681–698.

[32] Forster, J., I. Famili, P. Fu, B. O. Palsson, and J. Nielsen, “Genome-scale reconstruction of theSaccharomyces cerevisiae metabolic network,” Genome Res., Vol. 13, 2003, pp. 244–253.

[33] Schilling, C.H., M.W. Covert, I. Famili, G.M. Church, J.S. Edwards, and B.O. Palsson,“Genome-scale metabolic model of Helicobacter pylori 26695,” J. Bacteriol., Vol. 184, 2002,pp. 4582–4593.

[34] Schmidt, K., L.C. Norregaard, B., Pedersen, A. Meissner, J.O. Dussa, J. Nielsen, and J. Villadsen,“Quantification of intracellular metabolic fluxes from fractional enrichment and 13c–13c couplingconstants on the isotopomer distribution in labeled biomass components,” Metabol. Eng., Vol. 1,1999, pp. 166–179.

[35] Majewski, R.A., and M.M. Domach, “Simple constrained optimization view of acetate overflow inEscherichia coli,” Biotechnol. Bioeng., Vol. 35, 1990, pp. 732–738.

[36] Ramakrishna, R., J.S. Edwards, A. McCulloch, and B.O. Palsson, “Flux balance analysis of mito-chondrial energy metabolism: Consequences of systemic stoichiometric constraints,” Am. J.Physiol. Regulatory Integrative Comp. Physiol., Vol. 280, 2001, pp. R695–R704.

[37] Lange, H.C., and J.J. Heijnen, “Statistical reconciliation of the elemental and molecular biomasscomposition of Saccharomyces cerevisiae,” Biotechnol. Bioeng., Vol. 75, 2001, pp. 334–344.

[38] Chvatal, V., Linear Programming, New York: W.H. Freeman and Company, 1983.[39] Bixby, R.E., “Implementing the simplex method: The initial basis,” ORSA Journal on Computing,

Vol. 4, 1992, pp. 267–284.[40] Roos, C., T. Terlaky, and J.-Ph. Vial, Theory and Algorithms for Linear Optimization: An Interior Point

Approach, New York: John Wiley and Sons, 1997.[41] Lee, S., C. Phalakornkule, M.M. Domach, and I.E. Grossmann, “Recursive MILP model for finding

all the alternate optima in LP models for metabolic networks,” Comput. Chem. Eng., Vol. 24, 2000,pp. 711–716.

[42] Mahadevan, R., and C.H. Schilling, “Effects of alternate optima on constraint-based genome-scalemetabolic models,” Metabolic Eng., Vol. 5, 2003, pp. 264–276.

[43] Rufleux, P.-A., U. von Stockar, and I.W. Marison, “Measurement of volumetric (OUR) and determi-nation of specific (qO2) oxygen uptake rates in animal cell cultures,” J. Biotechnol., Vol. 63, 1998,pp. 85–95.

[44] Gorgens, J.F., and J.H. Knoetze W.H. van Zyl, “Reliability of methods for the determination of spe-cific substrate consumption rates in batch culture,” Biochem. Eng. J., Vol. 25, 2005, pp. 109–112.

[45] Hjersted, J.L., and M.A. Henson, “Parameterization and validation of a Saccharomyces cerevisiaedynamic flux balance model with batch and fed-batch experiments,” in preparation.

[46] Hjersted, J.L., and M.A. Henson, “Optimization of fed-batch Saccharomyces cerevisiae fermentationusing dynamic flux balance models,” Biotechnol. Prog., Vol. 22, 2006, pp. 1239–1248.

[47] Vanrolleghem, P.A., P. de Jong-Gubbels, W.M. van Gulik, J.T. Pronk, J.P. van Dijken, and S.Heijnen, “Validation of a metabolic network for Saccharomyces cerevisiae using mixed substratestudies,” Biotechnol. Prog., Vol. 12, 1996, pp. 434–448.

[48] Hjersted, J.L., and M.A. Henson, “Steady-state and dynamic flux balance analysis of ethanol pro-duction by Saccharomyces cerevisiae,” IET Systems Biology, accepted.

[49] Phalakornkule, C., S. Lee, T. Zhu, R. Koepsel, M.M. Ataai, I.E. Grossman, and M.M. Domach, “Amilp-based flux alternative generation and nmr experimental design strategy for metabolic engi-neering,” Metab. Eng., Vol. 3, 2001, pp. 124–137.

[50] Duarte, N.C., B.O. Palsson, and P. Fu, “Integrated analysis of metabolic phenotypes inSaccharomyces cerevisiae,” BMC Genomics, Vol. 5, 2004, pp. 63–73.

[51] Sonnleitner, B., and O. Kappeli, “Growth of Saccharomyces cerevisiae is controlled by its limitedrespiratory capacity: Formulation and verification of a hypothesis,” Biotechnol. Bioeng., Vol. 28,1986, pp. 927–937.

[52] Johnson, A., “The control of fed-batch fermentation—A survey,” Automatica, Vol. 23, 1987,pp. 691–705.

[53] Lubbert, A., and S.B. Jorgensen, “Bioreactor performance: A more scientific approach for practice,”J. Biotechnol., Vol. 85, 2001, pp. 187–212.

[54] Banga, J.R., A.A. Alonso, and R.P. Singh, “Stochastic dynamic optimization of batch and semicon-tinuous bioprocesses,” Biotechnol. Prog., Vol. 13, 1997, pp. 326–335.

Acknowledgments

177

[55] Kookos, I.K., “Optimization of batch and fed-batch bioreactors using simulated annealing,”Biotechnol. Prog., Vol. 20, 2004, pp. 1285–1288.

[56] Lee, J.-H., “Comparison of various optimization approaches for fed-batch ethanol production,”Appl. Biochem. Biotechnol., Vol. 81, 1999, pp. 91–106.

[57] Luus, R., “Application of dynamic programming to differential algebraic process systems,” Comput.Chem. Eng., Vol. 17, 1993, pp. 373–377.

[58] Vera, J., P. de Atauri, M. Cascante, and N.V. Torres, “Multicriteria optimization of biochemical sys-tems by linear programming: Application to production of ethanol by Saccharomyces cerevisiae,”Biotechnol. Bioeng., Vol. 83, 2003, pp. 335–343.

[59] Wang, F.S., and C.S. Shyu, “Optimal feed policy for fed-batch fermentation of ethanol productionby Zymomous mobilis,” Bioproc. Eng., Vol. 17, 1997, pp. 63–68.

[60] Biegler, L.T., and I.E. Grossmann, “Retrospective on optimization,” Comput. Chem. Eng., Vol. 28,2004, pp. 1169–1192.

[61] Biegler, L.T., A.M. Cervantes, and A. Wachter, “Advances in simultaneous strategies for dynamicprocess optimization,” Chem. Eng. Sci., Vol. 57, 2002, pp. 575–593.

[62] Cuthrell, J.E., and L.T. Biegler, “Simultaneous optimization and solution methods for batch reactorcontrol problems,” Comput. Chem. Eng., Vol. 13, 1987, pp. 49–62.

[63] Raghunathan, A.U., J.R. Perez-Correa, and L.T. Biegler, “Data reconciliation and parameter estima-tion in flux balance analysis,” Biotechnol. Bioeng., Vol. 84, 2003, pp. 700–709.

[64] Bader, G., and U. Ascher, “A new basis implementation for a mixed order boundary value ODEsolver,” SIAM J. Sci. Comp., Vol. 8, 1987, pp. 483–500.

[65] Fourer, R., D.M. Gay, and B.W. Kernighan, “A modeling language for mathematical program-ming,” Management Science, Vol. 36, 1990, pp. 519–554.

[66] Drud, A.S., “CONOPT—a large scale GRG code,” ORSA Journal on Computing, Vol. 6, 1994,pp. 207–216.

[67] Hjersted, J.L., M.A. Henson, and R. Mahadevan, “Genome-scale analysis of Saccharomyces cerevisiaemetabolism and ethanol production in fed-batch culture,” Biotechnol. Bioeng., Vol. 97, 2007,pp. 1190–1204.

[68] Aristidou, A., and M. Penttila, “Metabolic engineering applications to renewable resource utiliza-tion,” Curr. Opin. Biotechnol., Vol. 11, 2000, pp. 187–198.

[69] Ostergaard, S., L. Olsson, and J. Nielsen, “Metabolic engineering of Saccharomyces cerevisiae,” Micro-biology and Molecular Biology Reviews, Vol. 64, 2000, pp. 34–50.

[70] Jeffries, T.W., and Y.-S. Jin, “Metabolic engineering for improved fermentation of pentoses byyeasts,” Appl. Microbiol. Biotechnol., Vol. 63, 2004, pp. 495–509.

[71] Kuyper, M., M.J. Toirkens, J.A. Diderich, A.A. Winkler, J.P. van Dijken, and J.T. Pronk, “Evolution-ary engineering of mixed-sugar utilization by a xylose-fermenting Saccharomyces cerevisiae strain,”FEMS Yeast Research, Vol. 5, 2005, pp. 925–934.

[72] Zaldivar, J., A. Borges, B. Johansson, H.P. Smits, S.G. Villas-Boas, J. Nielsen, and L. Olsson, “Fermen-tation performance and intracellular metabolite patterns in laboratory and industrial xylose-fer-menting Saccharomyces cerevisiae,” Appl. Microbiol. Biotechnol., Vol. 59, 2002, pp. 436–442.

[73] Burgard, A.P., P. Pharkya, and C.D. Maranas, “OptKnock: A bilevel programming framework foridentifying gene knockout strategies for microbial strain optimization,” Biotechnol. Bioeng., Vol. 84,2003, pp. 647–657.

[74] Nielsen, J., and J. Villadsen, “Modelling of microbial kinetics,” Chem. Eng. Sci., Vol. 47, 1992,pp. 4225–4270.

[75] Rizzi, M., M. Baltes, U. Theobald, and M. Reuss, “In vivo analysis of metabolic dynamics inSaccharomyces cerevisiae: II. Mathematical model,” Biotechnol. Bioeng., Vol. 55, 1997, pp. 592–608.

[76] Rizzi, M., U. Theobald, E. Querfurth, T. Rohrhirsch, M. Baltes, and M. Reuss, “In vivo investigationsof glucose transport in Saccharomyces cerevisiae,” Biotechnol. Bioeng., Vol. 49, 1996, pp. 316–327.

Related Resources and Supplementary Electronic Information

Center for Microbial Biotechnology (CMB), iFF708 genome-scale metabolic model,http://www.cmb.dtu.dk/Forskning/Software.aspx.Kyoto Encyclopedia of Genes and Genomes (KEGG), LIGAND database of biochemical reactions for vari-ous organisms, http://www.genome.jp/kegg/.Systems Biology Research Group, University of California at San Diego, Genome-scale metabolic modelsfor many organisms (including iND750), http://gcrg.ucsd.edu/.


178

C H A P T E R

1 0Experimental Design for ParameterIdentifiability in Biological Signal TransductionModeling

Marc R. Birtwistle1, Boris N. Kholodenko2, and Babatunde A. Ogunnaike1

1Department of Chemical Engineering, University of Delaware, Newark, DE 197162Department of Pathology, Anatomy, and Cell Biology, Thomas Jefferson University, Philadelphia, PA19107

179

Key terms Signal transductionExperimental designParameter identifiabilityStructural identifiabilityImpact analysis

Abstract

Predicting how different stimuli elicit distinct cell fate decisions is critical foradvancement of bioengineering applications such as stem cell medicine andrequires understanding the quantitative, dynamic behavior of cellular signaltransduction systems. Mathematical modeling has emerged as a useful tool forobtaining such understanding; however, typical signal transduction modelsare extremely complex, containing hundreds of nonlinear ordinary differentialequations and an even larger number of unknown parameters that must be esti-mated from experimental data. The sheer size of these models makes itcomputationally impractical to apply traditional experimental design methodsin determining appropriate experimental strategies for estimating the modelparameters accurately and precisely. In this chapter, we describe acomputationally inexpensive, iterative experimental design procedure thatallows one to determine how to perturb the system, what to measure, and whento measure it such that the unknown signal transduction model parameters canbe identified to specified tolerances.

10.1 Introduction

Experimental design, in the broadest sense, is a methodology for generating experimen-tal protocols that will maximize the information content in the experimental data setsproduced thereby. Whether stated explicitly or not, such statistical experimental designstrategies are based on assumed mathematical models of the systems of interest. Stan-dard, “alphabetic” optimal experimental designs (A-optimal, D-optimal, E-optimal,and so forth) result from optimizing an appropriate norm of the Fisher InformationMatrix (FIM) [1–3]. Although these standard optimal designs work well for systems andmodels with few parameters and experimental design variables [4–8], since the FIM is afunction of the n-by-p parameter sensitivity matrix, where n is the number of datapoints and p is the number of parameters, even modestly sized models consisting of ~10parameters and experimental decision variables pose a significant computational chal-lenge. For biological signal transduction models that can contain hundreds of parame-ters and hundreds of experimental decision variables, applying standard experimentaldesign methodology is computationally impractical. In this chapter, we present anexperimental design methodology developed specifically for application to signaltransduction models with a large number of unknown parameters and experimentaldecision variables. With modest computational requirements, the technique allowsone to determine how to perturb the system, what to measure, and when to measure itsuch that the unknown model parameters are identifiable (i.e., determinable to withinspecified precision tolerances). First, we discuss the basic model structure under consid-eration, present an overview of parameter estimation, and define the parameteridentifiability metrics that form the basis of the proposed experimental designprocedure before presenting the procedure itself.

10.1.1 Model structure

We consider nonlinear, ordinary differential equation (ODE) models of the form

( )( )( ) ( )( )

ddt

t t

t to

xf x u

x x u

=

= = =

, , ,

,

Θ

Θ0 0

(10.1)

where x is an s-dimensional vector of model states, t is time, Θ is an p-dimensional vec-tor of unknown parameters, and u is a c-dimensional vector of inputs. The n-dimen-sional vector of experimental observations (data), Y, consists of a true but unknown

value�

Y, and εe, an n-dimensional vector of experimental errors; that is,

Y Y= +�

εe (10.2)

�Y, the vector of model predictions of�

Y, is a function of the states x, and the nt-dimen-sional vector of sampling times, ts; that is,

( )Y g t xs= , (10.3)

It is related to the vector of true values εm by

Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling

180

�

Y Y= +� εm (10.4)

where εm is an n-dimensional vector of unknown model mismatch errors. Combining(10.2) and (10.4) gives

Y Y= + +� ε εe m (10.5)

showing that both experimental and model mismatch errors contribute to observeddiscrepancies between model predictions and experimental data. When experimentalerrors dominate model mismatch errors, we have

Y Y≈ +� εe (10.6)

Thus, under such conditions in which structural model uncertainties are not as sig-nificant as measurement noise, the residual vector of differences between model predic-tions and experimental data, defined as

e Y Y= − � (10.7)

will be a reasonable realization of εe. Now, given an error model for εe, one can testwhether e is a valid realization εe, and hence establish the validity of (10.6) and the ade-quacy of the proposed model for describing the experimental data. This is a standardassumption in classical statistical modeling, and the parameter identifiability metricsand experimental design techniques proposed below are also predicated on the validityof (10.6). Methods for performing such model adequacy tests are widely available [1, 9,10], of course, but discussing them is outside the scope of this chapter.

10.1.2 Parameter estimation

Given the model described above, the objective in parameter estimation is to determine

a set of parameters �Θ such that the model predictions �Y are the “best” possible represen-tation of the experimental observations, Y. Out of several criteria by which parametersets are judged to be “best,” the two most widely used are the maximum likelihood (ML)and weighted least squares (LS) criteria [11]. With weighted LS, the objective functionto be minimized is

φ = e WeT (10.8)

where φ is the weighted sum of squared residuals (or the weighted residual norm),where W is an n-by-n weighting matrix.

When the experimental errors follow a zero-mean multivariate normal distribution,

( )εe N~ ,0 VY (10.9)

where VY is the error covariance matrix, setting W equal to VY and minimizing theweighted sum of squared residuals leads to minimum variance parameter estimates �Θ[1],

10.1 Introduction

181

( )min �φ = → =e V eTY Θ Θ (10.10)

In particular, when VY is diagonal, the LS criterion is equivalent to the ML criterion [1].To employ this objective function in practice, VY must be known, and it can be esti-mated from the data Y given a sufficient number of replicates.

Given (10.9) and (10.10), the parameter covariance matrix, VΘ can be approximatedby [1]

( )V Z V ZTYΘ ≈ − −1 1

(10.11)

where Z is the parameter sensitivity matrix, defined by

Ze Y

iji

j

i

j

≡ == =

∂

∂

∂

∂Θ ΘΘ Θ Θ Θ� �

(10.12)

The approximation in (10.11) becomes more accurate as the variance of the measure-ments and the residual norm decrease [1, 12]. The inverse of the parameter covariancematrix (ZTVY

–1Z) is commonly referred to as the FIM.It is quite common for the parameter sensitivity matrix to be ill-conditioned, which

impairs one’s ability to obtain reasonable values for Vθ. To avoid this ill-conditioning,

use of scaled quantities (denoted by ∼) is preferred:

( ) ( )Θ Θ Θ Θj j j ij i; e Z e≡ ≡ = ≡ − ≡−

;~

�

~� ;

~�

~;

~ ~Y V Y Y V Y Y YY Y

1 2 1 1 2 ∂ ∂ j i jY= ∂ ∂�

~ ~Θ (10.13)

All the experimental data Y are scaled by the matrix square root of VY, which has twodesirable effects: (1) the scaled measurement units are dimensionless, and (2) the scaledmeasurement covariance matrix is the identity matrix,

( ) ( )( ) ( ) ( )( )( ) ( )

~ ~var varV Y V Y V Y V

V V

Y Y Y Y

Y Y

≡ = =

=

− − −

− −

1 2 1 1 2 1 1 2 1

1 2 1 1 2 1

T

VY

T

= I

(10.14)

All parameters Θ are scaled by their best fit values, which nondimensionalizes theparameter units and scales them all to order one. The resulting scaled parametercovariance matrix is given by

( ) ( )~ ~ ~ ~ ~ ~V Z V Z Z ZY

TΘ ≈ =

− −T

1 1(10.15)

10.1.3 Identifiability metrics and conditions

Minimization of the objective function in (10.10) leads to a best fit response and theparameter set �Θ, with covariance matrix VΘ. In this section we propose metrics and con-ditions for identifiability which can be used to assess the quality of these parameter esti-mates. We consider two main classes of identifiability:


182

1. Structural Identifiability. A model M(Θ,Y,x) is structurally identifiable if the elements of

the parameter set Θ can be uniquely estimated from noise-free measurements Y.

2. Parameter Identifiability. A parameter set Θ of a model M(Θ,Y,x) is identifiable if it canbe estimated to within a specified precision from an experimental data set Y.

These definitions are general in that they can be considered globally (with respect tothe entire parameter space), or locally (in a neighborhood around �Θ). Because we aredealing with nonlinear systems, global identifiability may differ from localidentifiability. However, evaluation of global identifiability requires a combinatorialsearch over the entire parameter space, which is computationally intensive. As the mainpurpose of this work is to reduce the computational requirements for experimentaldesign, we therefore focus here on local identifiability metrics. We note however thatany identifiability metric, local or global, is compatible with our experimental designprocedure.

10.1.3.1 Local structural identifiability

Consider a first-order Taylor series expansion of the model predictions around thebest-fit parameter values,

Δ ΔΘΘ Θ

�

~ ~ ~�

~Y Z=

=

(10.16)

where Δ denotes a difference from the vector value when Θ = �Θ. Solving (10.16) for ΔΘgives

( )ΔΘ Δ~ ~ ~ ~�

~=

−Z Z Z YT T

1(10.17)

This equation shows that to find a unique local solution for ΔΘ~, the parameter sensitiv-

ity matrix~Z must be nonsingular so that

~ ~Z ZT is invertible. For

~Z to be nonsingular it

must be of full rank, and therefore a condition for local structural identifiability is givenby

( )rank p~Z = (10.18)

If the rank of~Z is less than p, then the rank (

~Z) − p parameters that have no independent

effect on the observables must be held constant in the current estimation problem.

10.1.3.2 Local parameter identifiability

Given the error model defined by (10.6) and (10.9), an (1-α)-level confidence intervalfor parameter

~Θj is given by

δ αjn p

T

jjtn p

V=−

−~ ~ ~e e

Θ (10.19)

10.1 Introduction

183

where t n pα

− is a two-tailed t-distribution statistic evaluated with n-p degrees of freedom atconfidence level (1-α), and δj is the confidence interval (

~Θj j± +1 δ ) [13]. Let ~κ j be the

specified precision for parameter~Θj , such that we desireΘj j± +1 ~κ . A parameter is identi-

fiable if

δ κj j≤ ~ (10.20)

This equation states that if the parameter confidence interval is less than the speci-fied parameter tolerance, it is identifiable.

10.1.4 Overview of the experimental design procedure

As explained previously, application of conventional experimental design techniquesto large signal transduction modeling problems is computationally impractical. As asolution to this problem, we propose an iterative procedure for generating an experi-mental design whose implementation will yield an experimental data set Y that can beused to identify the model parameter values to specified precisions. While our proce-dure does not guarantee a unique design, it does guarantee an adequate design that isexperimentally feasible to implement.

The proposed procedure for experimental design is illustrated as a flowchart in Fig-ure 10.1, and is described in detail in the methods section (Section 10.2). Over the courseof this procedure, an initial experimental design, which is a comprehensive design thatencompasses the feasible ranges of all perturbations and all measurements, is trimmeddown to the implemented design, which contains only essential experiments needed forparameter identifiability. Parameter identifiability is tested several times during theexperimental design process to determine whether more or fewer experiments areneeded. Impact analysis, which identifies experiments that are the most valuable for


184

Initial Perturbation andMeasurement Design

IdentifiabilityAnalysis

ImpactAnalysis

Identifiable?

No

DesignModification

DesignReduction

Yes

Identifiable?

DesignModification

Yes

DesignImplementation

Resourcesand otherconstraints

IdentifiabilityAnalysis

No

Figure 10.1 Experimental design for parameter identifiability procedure. See Section 10.2 for a detaileddescription of the procedure.

parameter identifiability, is central to the process, as the impact analysis results are usedto determine the implemented design.

Before going into the details of the experimental design procedure in the methodssection (Section 10.2), it first is useful to list the characteristics of signal transductionmodels that dictate what can be done experimentally. In terms of traditional experimen-tal design nomenclature, we must first identify the factors, or what perturbations can bemade, and the responses, or what can be measured. Typical factors and responses areshown in Table 10.1, although this list may change slightly from model to model. Fac-tors and responses can be either continuous or categorical. If a quantity is continuous, itcan take any real number value within a particular range. If a quantity categorical, it canonly take a finite number of discrete values.

10.2 Methods

Here, we describe in detail how each step of the experimental design process shown inFigure 10.1 is performed. First, the purpose of each step and how to implement it aredescribed. This description is followed by a simple, numbered list of tasks required tocomplete each step.

10.2.1 Initial perturbation and measurement design

10.2.1.1 Purpose and implementation

The initial design is intended to explore the experimental design variable space compre-hensively, while later steps in the design process identify subsets of the initial designthat yield the most informative experiments. Thus, the initial designs are made to be aslarge as possible, only to be trimmed later.

The initial perturbation design is determined using a factorial design, which requiressetting levels for each factor and then permuting over combinations of levels for all thefactors [2]. When feasible we recommend using a full factorial design, but it is recog-

10.2 Methods

185

Table 10.1 Classes of Factors and Responses for Typical Signal Transduction Model

Name Type Description

Factors Ligand type Categorical The different ligands that can be used to perturb the system of interestLigand input sequence Continuous The dynamic profile of each ligand’s concentration. In this chapter we

consider rectangular pulses, which are characterized by a magnitudeand a duration

siRNA1 Categorical Used to knock-down the level of a particular proteinInhibitor type Categorical The different pharmaceutical inhibitors that target species in the systemInhibitor concentration Continuous The concentration of each pharmaceutical inhibitor

Responses Total protein abundance Continuous Total amount of a particular protein in the system of interestPost-translationalmodification

Continuous Amount of signaling-related post-translational modification of proteinsin the system of interest. Examples include phospho-threonine,phospho-serine, phospho-tyrosine, and ubiquitylation, but there may bemany more or less depending on the system

Protein-proteinassociation

Continuous Amount of signaling-related protein-protein association in the system ofinterest

1 Although in principal the amount of siRNA-mediated protein knock-down can be adjusted, typical applications use the “all-or-nothing” approach wherethe protein is knocked-down as much as possible. It is in this sense that siRNA is considered categorical, but it can be made continuous if it is desired.

nized that in some instances when there are multiple ligands and/or siRNA targets thismay not be reasonable. For such cases fractional factorial designs are recommended. Toset the levels for each factor, first the feasible ranges/values for each factor are deter-mined. The levels for categorical factors are set to each discrete value the variable cantake. For continuous factors that span several orders of magnitude (e.g., ligand concen-trations), we recommend that levels be determined using logarithmic spacing within thefeasible range. For continuous factors that have narrower feasible ranges (e.g., pulsewidths), we recommend that levels be determined using uniform spacing within the fea-sible range. Although the frequency of both uniform and logarithmic spacing willdepend on experimental and computational considerations particular to each model,three levels are recommended as a minimum, as this will allow detection of potentiallynonlinear relationships between the factors/responses and their impact (i.e., a quantifi-cation of how informative a particular measurement is; high impact means that aparticular measurement greatly reduces parameter covariances. Impact metrics aredefined in a subsequent section.).

To determine the initial measurement design, first the measurement technology(ies)is selected. This selection is based both on what technology(ies) is available and theresponses of the system of interest. Given the measurement technology(ies), a set of fea-sible measurements is constructed, and a sampling frequency upper bound(s) is deter-mined. The initial measurement design consists of making all feasible measurements atthe sampling frequency upper bound, in response to every condition in the initialperturbation design.

10.2.1.2 Procedure

1. Identify the factors and set their levels.

2. Identify the available measurement technology(ies).

3. Construct the set of feasible measurements.

4. Determine the sampling frequency upper bound(s).

5. Use a factorial design to construct the initial perturbation and measurement designs.

10.2.2 Identifiability analysis


Identifiability analysis is performed at several steps of the design procedure. Its purposeis to determine whether the parameter values are identifiable given: (1) a potential set ofexperimental data, (2) current parameter values, and (3) parameter tolerances.

The first step in identifiability analysis is to test structural identifiability. To do this,the design (initial or reduced, depending on the stage of the process) is implemented in

silico to calculate the model predictions of the experimental data, �Y, and the parametersensitivity matrix Z. The next step is to calculate the scaled parameter sensitivity matrix~Z; however, to calculate

~Z one must know the experimental data covariance matrix, VY,

which will not be known yet because the experiments have not been performed. Tosolve this problem, a conservative (large) estimate for VY based on typical values for the

measurement technology should be used. Finally, the rank of~Z is calculated and com-

pared to the total number of parameters to evaluate whether the model is structurallyidentifiable.


186

If the model is not structurally identifiable, then parameters having no independent

effects on �Y should be held constant. In many cases, such parameters are easily identifiedas they correspond to columns of Z that contain all zeros. In some cases, however, thereare no columns in Z containing only zeros, yet the model is not structurally identifiable.In these cases, QR decomposition (MATLAB function QR) of Z can be used to identify theproblematic parameters. Parameters corresponding to zero-valued diagonal elements ofthe resulting upper triangular matrix R are the problematic parameters that should befixed. Note that the “economy size” QR decomposition (see MATLAB help file for thefunction QR), which is computationally less expensive than the full QR decomposition,is sufficient for these purposes.

If the model is structurally identifiable, then parameter identifiability is tested. Thisrequires calculation of the parameter covariance matrix, VΘ, and the parameter confi-dence intervals, δ, and specification of the parameter tolerances, ~κ. To calculate confi-dence intervals, one must calculate the confidence interval “pre-factor” β, where

β α≡−

−tn p

n pT~ ~e e

(10.21)

However, because the experiments have not yet been carried out, both the residuals,e, and the number of measurements, n, are unknown, and it is not possible to calculate β.To address this issue, below we derive an approximation for the expected value of β thatcan be used for identifiability analysis during the experimental design process.

The number of measurements must at least be equal to the number of parameters,and the measurements must be replicated at least three times to estimate VY. Under theseconditions, n = 3p, which gives

β α α=−

=−tp p

tp

p pT

pT

3 2

3 2

~ ~ ~ ~e e e e(10.22)

The sum of squared residuals, ~ ~e eT , follows a χ-squared distribution with 3p degrees offreedom [1], and therefore has an expected value of 3p, giving

β α α= =tpp

tp p2 232

32

(10.23)

Because we are dealing with models of high parameter dimension, the number ofdegrees of freedom for evaluating the t-distribution statistic, 2p, will be high such that tis near the asymptotic value. Taking a standard 95% confidence level (α = 0.05) gives

β = ≈16532

2. (10.24)

Thus, one can approximate β = 2 for purpose of testing parameter identifiability inthe experimental design process.

Finally, parameter identifiability is tested by comparing the calculated confidenceintervals to the parameter tolerances.

10.2 Methods

187

10.2.2.2 Procedure

1. Simulate implementation of the proposed experimental design to calculate �Y and Z.

2. Propose a conservative approximation for the experimental data covariance matrixVY based on known aspects of the measurement technology.

3. Calculate the scaled parameter sensitivity matrix~Z.

4. Calculate the rank of~Z to test for structural identifiability.

5. If the model is not structurally identifiable, fix parameters that have no independenteffects on the observables.

6. Calculate the approximate parameter covariance matrix VΘ.

7. Calculate the parameter confidence intervals.

8. Specify the parameter tolerances ~κ.

9. Test for parameter identifiability.

10.2.3 Impact analysis


Impact analysis is the hub of the experimental design procedure, and it is used to deter-mine which experiments have the greatest impact on the parameter variances (i.e.,which experiments are the most informative). We propose three distinct impact metricsfor this task: absolute sensitivity coefficients, net impacts, and importance coefficients.These metrics give slightly different measures of impact, and which is most useful is sit-uation dependent. Here, we only define these metrics; the results of the case studiesprovide insight into the pros and cons of choosing experiments based on thesedifferent metrics.

Absolute sensitivity coefficients are simply the absolute values of the elements of~Z,

s Zij ij=~

(10.25)

where sij is the absolute sensitivity coefficient for potential measurement i and parame-ter j. A high absolute sensitivity coefficient means that the simulated value of measure-ment i is greatly affected by parameter j. Thus, only small changes in parameter j causelarge changes in residual i. In that sense, measurements with high absolute sensitivities“lock down” parameter values and therefore have high impact.

The net impact and importance coefficient are both based on singular value decom-position of the parameter sensitivity matrix,

~Z S RT= Σ (10.26)

where S and R are unitary matrices and Σ is a diagonal matrix of singular values. Substi-tution of (10.26) into the first-order Taylor series expansion in (10.16) yields

Δ Σ ΔΘ�

~ ~Y S RT= (10.27)

and rearrangement gives


188

Δ ΩΨ

Ψ ΔΘ Ω ΣΩ

Ω

�

~

~;

Y

RT

=

≡ ≡ =⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

S

n

1

�

(10.28)

where Ψ is an orthogonal “eigenparameter” set, and each row vector Ωi composing Ωgives the strengths with which each measurement i affects all the eigenparameter direc-tions. We define the net impact of measurement i as

( )ρi ij jjj

p

S≡ ==∑Ω Σi

2

1

(10.29)

The net impact of a particular measurement i will be large if it is a major component of

high singular value directions. Since the square root of the singular values of~Z are equal

to the eigenvalues of~ ~Z ZT , and the largest eigenvalues of

~ ~Z ZT (the FIM) denote the

eigenparameter directions that have the smallest variance, high net impact measure-ments significantly reduce parameter variances.

Now consider a slightly different rearrangement of (10.27),

S YT Δ ΣΨ�

~= (10.30)

The RHS of (10.30) gives the singular value-weighted orthogonal parameter direc-tions, while the LHS of (10.30) describes how each measurement contributes to each sin-gular value-weighted parameter direction. We define the importance coefficient ofmeasurement i for eigenparameter j as

ωij jiT

ijS S≡ = (10.31)

The importance coefficient measures how much a particular measurement i mattersfor determining the eigenparameter j. As S is a unitary matrix, the norm of each columnis equal to 1, and therefore any importance coefficient will be between 0 and 1. Impor-tance coefficients closer to 1 denote higher impact.

These three impact metrics are calculated for each potential measurement in adesign. To analyze these impact data, two different methods can be used: rank analysisand main effects analysis. Again, pros and cons of using these different analysis methodswill be illustrated in the case studies; here we only provide basics on how to perform theanalyses.

In rank analysis, the potential measurements are ordered according to the metric ofinterest. If the impact metric is the net impact, ranking is straightforward since eachpotential measurement is described by a single metric. However, if the impact metric isthe absolute sensitivity coefficient or the importance coefficient, each experi-ment/parameter combination has an impact measure, and therefore a single experimentdoes not have a unique impact. This apparent difficultly, however, actually gives a bene-ficial flexibility because parameter-specific information can be incorporated into theranking. To do this, an experiment rank vector for each parameter is constructed, whichresults in p different ranked impact metric vectors, μj (a vector of absolute sensitivity

10.2 Methods

189

coefficients or importance coefficients). Then, the top experiments from every sortedvector are chosen until a fixed fraction φj of each vector’s norm is accounted for, suchthat

φμ

μjcj

j

j p≥ ∀ = 1, ,� (10.32)

where j denotes a parameter index and μc denotes the impact metric vector for the cho-sen experiments. By choosing experiments in this way, the dimension of each μcj can bedifferent, and therefore more experiments can be allocated to model parameters thatare difficult to identify.

In main effects analysis, the means of the metric of interest for different classes offactors and responses, or main effects, are calculated and then analyzed to identify gen-erally informative experiment characteristics. Factors and responses having the largestmain effects have the highest impact, and experiments containing these high-impactfactors and responses should be chosen first. There are many possible ways to analyzethe main effects; the case studies present some examples.

10.2.3.2 Procedure

1. Calculate the absolute sensitivity coefficients, net impacts, and importancecoefficients.

2. Perform rank analysis.

3. Calculate the main effects of each factor and response for each impact metric.

4. Perform main effects analysis.

10.2.4 Design modification and reduction


During the experimental design process, the results of identifiability and impact analy-sis are used to modify and/or reduce the design. Although this part of the design processis highly situational and model dependent—and is best illustrated through the casestudies presented below—there are some generalities that can be discussed regardingthe initial design.

It is entirely possible that parameter identifiability issues will arise with the initialdesign, and there are two options for dealing with such a scenario. One option is toexpand the initial design, if possible. This may be done by considering alternative mea-surement technologies, higher sampling frequencies, or additional levels for factors.Alternatively, one can fix the unidentifiable parameters, excluding them from the exper-imental design process. Which option to select is highly situation dependent, and is bestdecided on a case by case basis. In many cases the unidentifiable model parameters willnot be important (small sensitivity) for controlling the quantities of interest in the par-ticular model, and in this sense fixing these parameters would often be reasonable. Westress, however, that parameter unidentifiability does not imply biological unimpor-tance; rather, an unidentifiable parameter is not important for the measured variablesaccording to the model.


190

10.2.4.2 Procedure

Based on the results of identifiability and impact analysis, expand or reduce the cur-rently considered experimental design. As this part of the procedure is highly situa-tional and model dependent, we refer the reader to the case studies in subsequentsections for examples of how to perform design reduction and modification.

10.2.5 Design implementation


There is a remaining fundamental issue with the proposed experimental design strat-egy: the design is based upon the current values of the model parameters, which are notyet known. To resolve this issue we propose using an iterative, sequential design andestimation approach, where only a small subset of the reduced experimental design isimplemented at each step in the iteration (Figure 10.2) [2]. For such an approach,Atkinson and Donev recommend that the square root of the total number of measure-ments in the design should be implemented [2]. At each iteration, the cumulativeexperimental data are used to refine the parameter estimates, and the current parameterestimates are then used to propose the next round of experiments. The process isrepeated until the model agrees reasonably with the experimental observations and the

unknown parameters are identifiable. The initial parameter estimates, Θ0, can beobtained by a variety of means, a discussion of which is outside the scope of this chap-ter. Regardless of how these initial estimates are obtained, however, it is essential tostart with some parameter values [14]. The initial experimental data vector, Y0, mayconsist of literature data and/or preliminary experimental data; however, it is not essen-tial to begin with experimental data. In the case that Y0 is empty, the first parameterestimation step is skipped and the procedure begins with the first experimental design.

It is important to note that the parameter estimation steps are not trivial, and theyare also an area of active research [13, 15–19].

10.2.5.2 Procedure

1. Select a small subset of the proposed experimental design for implementation. Thesquare root of the number of measurements is recommended as a rough guideline,but more or fewer experiments can be selected depending on the availableexperimental resources.

10.2 Methods

191

Y0

ParameterEstimation 1

Θ0

Θ1

ExperimentalDesign 1

ParameterEstimation 2

Y1

Θ2

ExperimentalDesign 2

Y2

Iteration 1 Iteration 2

Y0 Y0

Y1

Figure 10.2 Sequential parameter estimation/experimental design strategy.

2. Implement the selected subset of the experimental design.

3. Perform parameter estimation using all available experimental data.

4. Propose a new experimental design based on the updated parameter set.


We present here a detailed, step-by-step implementation of the experimental designprocedure to a previously published model of Erythropoietin (Epo)-induced signaltransducer and activator of transcription 5 (STAT5) signaling in BaF3-EpoR cells over a60-minute time course [20]. This model is chosen because it is small (five parametersand four states), and thus model complexity does not convolute illustration of theexperimental design procedure. In the subsequent Application Notes section (Section10.4), we apply our procedure to a larger, more relevant model of TGF-β signaltransduction. Overall, the goal of this case study is to illustrate how, by using the exper-imental design procedure, Swameye and coworkers could have performed fewer experi-ments while maintaining parameter identifiability.

The STAT model differential equations are

( )dxdt

k x EpoR k x t

dxdt

k x k x EpoR

dx

A

A

11 1 4 3

22 2

21 1

3

2= − + −

= − +

τ

( )dt

k x k x

dxdt

k x t k x

= − +

= − − +

3 3 2 22

44 3 3 3

05.

τ

(10.33)

where EpoRA, the amount of active Epo Receptor, is the model input (experimentallydetermined in [20]); x1, x2, x3, and x4 are the model states, which correspond to differentSTAT5 species; and k1, k2, k3, k4, and τ are the unknown parameters to be estimated. Theexperimental observations of the system, which are made at the time points indicatedin Table 10.2, are

( )( )

y k x x

y k x x x1 5 2 3

2 6 1 2 3

2

2

= +

= + +(10.34)

where y1 and y2 correspond to cytoplasmic tyrosine phosphorylated STAT5 and totalcytoplasmic STAT5, respectively, and k5 and k6 are nuisance parameters that are fixedprior to parameter estimation. These nuisance parameters are unit conversion factorsthat relate the arbitrary measurement units to a common unit for state concentrationsin the model.

We simulate the model in MATLAB using the delay differential equation solver“dde23,” with initial conditions x1(0) = 1 and x2(0) = x3(0) = x4(0) = 0. The model inputEpoRA is calculated by linear interpolation of the experimental data reported by Swameyeand coworkers (see Table 10.2). For implementation, the input is simulated as an addi-tional model state, with its time derivative equal to the slope of the linear interpolation.


192

The nuisance parameters k5 and k6 are assumed to be 39 and 0.95, respectively, while k1 =0.021 min–1, k2 = 2.46 min–1mol–1, k3 = 0.1066 min–1, k4 = 0.10658 min–1, and τ = 6.4 min asreported by [20].

10.3.1 Step 1: Initial perturbation and measurement design

The first step in the experimental design process is to propose initial perturbation andmeasurement designs. In this example, we treat the experimental data set used bySwameye and coworkers as the initial perturbation and measurement design. Translat-ing their experimental data into our nomenclature, for the initial perturbation designthey selected a single ligand type (Epo), a single ligand input sequence (step input), andno pharmaceutical inhibitors or siRNA. For the initial measurement design, they chosetwo responses, cytoplasmic tyrosine phosphorylated STAT5 (pSTAT) and total cytoplas-mic STAT5 (tSTAT), to be observed with the measurement technology ofimmunoblotting. Both of these responses are measured with a 2-minute frequency inthe first 20 minutes after ligand stimulation, and subsequently at 25, 30, 40, 50, and 60minutes.

10.3.2 Step 2: Identifiability analysis

The next step in the experimental design process is to perform identifiability analysisusing the simulated initial perturbation and measurement design. The first step in per-

forming identifiability analysis is to calculate the scaled parameter sensitivity matrix~Z.

To do this, we need estimates of the parameter sensitivities zij and the data covariancematrix VY. We assume that the data covariance matrix is diagonal, and based on thedata reported in [20] we estimate that the variance for pSTAT is 4 and for tSTAT is 0.2(both in arbitrary measurement units). To calculate the inverse of the matrix squareroot of VY, the functions “inv” and “sqrt” in MATLAB are used.


193

Table 10.2 Experimental Data Reported by [20]

TimePoint (min) EpoRA

EpoRASlope pSTAT tSTAT

0 0.00 4.27 1.08 1.002 8.54 2.88 10.01 0.934 14.30 15.20 24.80 0.796 44.70 6.85 27.40 0.788 58.40 –4.05 26.50 0.7010 50.30 –2.30 23.40 0.6512 45.70 –6.20 21.70 0.5914 33.30 1.50 22.10 0.5916 36.30 –8.50 24.20 0.6418 19.30 0.25 22.10 0.6420 19.80 –0.22 23.00 0.6925 18.70 –3.13 22.50 0.6930 3.04 –0.16 23.20 0.7640 1.45 –0.08 14.40 0.8150 0.68 0.03 8.67 0.9260 0.99 7.96 0.97

To estimate the parameter sensitivities, we perform six different time course simula-tions using the same EpoRA input function: one with the nominal parameter values andone for each parameter with its value increased by 1% of its nominal value. Finite for-

ward differences (Δyi/ΔΘj) are used to calculate the parameter sensitivities, which are

then scaled according to (10.13) to obtain~Z.

After calculating the scaled parameter sensitivity matrix, the next step inidentifiability analysis is testing for structural identifiability, which involves evaluating

the rank of~Z. Using the function “rank” in MATLAB, we find that the rank of

~Z is five.

Thus,~Z is of full rank and the model is structurally identifiable.

Since the model is structurally identifiable, we move to testing parameteridentifiability. In their original study, Swameye et al. calculated 1σ parameter confi-dence intervals using likelihood contours to conclude that their estimated parametervalues were identifiable (Table 10.3). To provide a fair comparison of the confidenceintervals defined in (10.19) to that of Swameye et al., we used the parameter standarddeviations as the confidence intervals, neglecting the first two terms on the RHS of(10.19) (the t distribution statistic and the residual norm). Setting the parameter toler-ances ~κ to 0.3 (+/– 30%), our identifiability analysis also indicated that all parameterswere identifiable (Table 10.3). Table 10.3 also indicates that there is reasonable agree-ment between our confidence intervals and those calculated by Swameye et al., despitetheir being calculated using different methods.

10.3.3 Step 3: Impact analysis

Since the initial design gives desirable parameter identifiability results, the next step inthe experimental design procedure is impact analysis. Using the already calculated

scaled sensitivity matrix~Z, we use the MATLAB function “abs” to calculate the absolute

sensitivity coefficients and “svd” for the singular value decomposition of~Z (required to

calculate the net impacts and importance coefficients). Note that the “economy size”svd as described by the MATLAB help files is adequate for our purposes.

Before performing rank-based impact analysis, it is informative to analyze how thesedifferent impact metrics vary for the different measurements and measurement timepoints [Figure 10.3(a–c)]. All metrics show a similar trend that y1, pSTAT, in general hasmore impact than y2, tSTAT, implying that pSTAT measurements have a much greatereffect on parameter variances that tSTAT measurements. However, there are differencesbetween these impact metrics in terms of measurement time points. Long time points(>40 minutes) have high absolute sensitivity coefficients and net impacts, while highimportance coefficients are distributed more evenly between short, mid, and long-term


194

Table 10.3 Parameter Values and Identifiability for the Swameye STAT Model

Name Nominal Value Conf. Interval [20]Conf. Interval(Initial Design)

Conf. Interval(Rank-BasedReduced Design)

Conf. Interval(Main effects-BasedReduced Design

k1 0.021 min−1 +0.004/−0.003 ±0.0023 ±0.0030 ±0.0023k2 2.46 min−1 mol−1 +1.7/−1.0 ±0.3598 ±0.4147 ±0.3673k3 0.1066 min−1 +0.03/−0.022 ±0.0176 ±0.0196 ±0.0183k4 0.10658 min−1 +0.0016/−0.0024 ±0.022 ±0.0267 ±0.0242τ 6.4 min +0.5/−2.6 ±0.9757 ±1.0468 ±0.9825

measurement time points for different parameters. Furthermore, Figure 10.3(a, b) showsthat the impact rankings are parameter dependent: the time points having the highestimpacts are different for each parameter. Thus, although these different impact metricsgive similar information in terms of which responses have the highest impact, there aredifferences between them in terms of which time points have more impact.

These time point impact differences are manifested in the results of rank-basedimpact analysis as shown in Figure 10.3(d), which depicts how the number of identifi-able parameters (again to ±30%) depends on the number of measurements chosen usingrankings based on the different impact metrics. Figure 10.3(d) reveals that using impor-tance coefficients as the impact metric yields the most efficient design, with all fiveparameters being identifiable with only 11 of the original 32 measurements. Thus,including short-time points in the design, as dictated by importance coefficients, resultsin improved parameter identifiability. While importance coefficients are clearly the bestimpact metric to use with this rank-based analysis, absolute sensitivity coefficient-baseddesigns perform nearly as good as importance coefficient-based designs for a small num-ber of measurements. Net impact-based designs are overall the worst performers, beingonly slightly better than choosing experiments randomly. However, net impacts andabsolute sensitivity coefficients perform equally well for identifying all five of theparameter values, both needing 15 of the original 32 measurements. Although there areclear differences in parameter identifiability based on the impact metric used to reducethe design, experimental design based on any of these impact metrics could have savedSwameye et al. from making more than half of their measurements.

The results of main-effects based impact analysis are shown in Figure 10.4 for com-pleteness; although for this simple case study they do not give additional insight into


195

y1 y2

20 40 6000

50

τ

k4

k3

k2

k1

20 40 6000

50

20 40 6000

50

20 40 6000

50

20 40 6000

00 20 40 60

50

20

Net

imp

act

valu

es

20 40 6000

5

20 40 6000

5

20 40 6000

5

20 40 6000

5

20 40 6000

5

Measurement time point (min.)

Ab

solu

tese

nsi

tivi

tyco

effi

cien

tva

lues

y1 y2

20 40 60

τ

k4

k3

k2

k1

20 40 60

20 40 60

20 40 60

20 40 600

20 40 600

20 40 600

20 40 600

20 40 600

20 40 600


Imp

ort

ance

coef

fici

ent

valu

es

40

60

80y1

00Measurement time point (min)

20 40 600

0

1

#o

fid

enti

fiab

lep

aram

eter

s

# of measurements

2

3

4

5

5 10 15 20 25 30

2

Net

imp

act

valu

es

4

6

8y2

Random

Sens. coeff.Net imp.Imp. coeff.

(a) (b)

(c) (d)

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

10.5

0

Figure 10.3 Impact metrics and rank-based impact analysis for the Swameye STAT Model. (a) Absolute sensi-tivity coefficients. (b) Importance coefficients. (c) Net impacts. (d) The number of identifiable parameters versusthe number of proposed measurements for rank-based experiment selection. For panels (a) and (c), note the dif-ferent y-axis scales for y1 and y2.

the experimental design problem. Here, we observe similar trends as described above:pSTAT generally has more impact than tSTAT [Figure 10.4(b, d, f)], absolute sensitivitycoefficients and net impacts imply that long time points have more impact than shorttime points [Figure 10.4(a, e)], and importance coefficients imply that impact is distrib-uted among all time points.

10.3.4 Step 4: Design reduction

Design reduction using rank-based analysis results is straightforward: choose the leastnumber of measurements that will allow us to identify all the model parameter values.Thus, based on Figure 10.3(d) we choose to use importance coefficients as the impactmetric and choose the top 11 ranked measurements. These 11 measurements are listedin Table 10.4 and, interestingly, do not include any tSTAT measurements. Importantly,an entire class of measurements, tSTAT, is unnecessary and was eliminated using ourprocedure.

Design reduction using main effects-based analysis results is more ad hoc, as itrequires reduction based on interpretation of the data presented in Figure 10.4. Since, aspointed out above, pSTAT measurements have more impact than tSTAT measurements,it is reasonable to include only pSTAT measurements in the reduced design. Further-


196

0.00

2.00

4.00

6.00

8.0010

.0012

.0014

.0016

.0018

.0020

.0025

.0030

.0040

.0050

.0060

.0005

101520253035

tk4k3

k2k1

0.00

0.00

2.00

2.00

4.00

4.00

6.00

6.00

8.00

8.00

10.00

10.0

0

12.00

12.0

0

14.00

14.0

0

16.00

16.0

0

18.00

18.0

0

20.00

20.0

0

25.00

25.0

0

30.00

30.0

0

40.00

40.0

0

50.00

50.0

0

60.00

60.0

0

00.10.2

0.30.40.5

tk4k3

k2k1

0

010

10

20304050 15

20

5

0

10

2

4

6

8

00.05

0.10.15

0.20.25

tSTATpSTAT

tSTATpSTAT

tSTAT

Measurement

Measurement time point (min)



Mea

nn

etim

pac

tM

ean

imp

ort

ance

coef

f.M

ean

abs.

sen

siti

vity

coef

f.

Measurement

Measurement

pSTAT

(e)

(c)

(a)

(f)

(d)

(b)

Mea

nab

s.se

nsi

tivi

tyco

eff.

Mea

nim

po

rtan

ceco

eff.

Mea

nn

etim

pac

t

Figure 10.4 Main effects-based impact analysis of the Swameye STAT model. (a, b) Mean absolute sensitivitycoefficients. (c, d) Mean importance coefficients. (e, f) Mean net impacts.

more, since the 40-, 50-, and 60-minute measurement time points have high net impactsand absolute sensitivity coefficients, these points should also be included in the reduceddesign. Importance coefficients imply that several different time points have largeimpact on different parameters; however, it is not clear which time points should beexcluded since they all have reasonably high impact for different parameters. Therefore,we include all time points into the main effects analysis-based reduced design (Table10.4).


After proposing a reduced design, the next step is to perform identifiability analysis onthe reduced design. As part of the rank-based analysis, we performed identifiabilityanalysis with the different reduced designs and found that all five parameters are identi-fiable to ±30%. The confidence intervals are shown in Table 10.3 and, as can be seen, arequite close to the intervals obtained when considering all 32 measurements. Similarresults are obtained with the main effects-based reduced design, with all five parametersbeing identifiable to ±30% with very small increases in confidence intervals from theinitial design.

As we will not implement any experiments in this example, this is the final step ofthe analysis. The results of this case study have clearly shown that experimental designcould have saved a significant amount of experimental effort for Swameye and cowork-ers. For this small model, rank-based analysis using importance coefficients was themost effective experimental design strategy. However, this should not be taken as a gen-erality, and it is strongly recommended that all options (rank-based and maineffects-based analysis, importance coefficients, absolute sensitivity coefficients, and netimpacts) are explored until this experimental design procedure has been applied to awide variety of models and general trends are established.


In this section we illustrate the experimental design procedure by applying it to a practi-cally relevant signal transduction modeling problem. The model, whose equations canbe found in Supplementary Tables 10.1 and 10.2 at the end of this chapter and is sche-matically shown in Figure 10.5, describes how the ligand transforming growth factor β

(TGF-β) induces formation of nuclear Smad2-Smad4 protein complexes over an 8-hourtime course [21]. The model contains 37 unknown parameters, for which we have pre-liminary, nominal values (Table 10.5). However, we do not have any preliminary exper-imental data.


197

Table 10.4 Reduced Designs for the Swameye STAT Model

Rank-Based Design Main Effects-Based Design

Measurements pSTAT pSTATTime Points 4, 8, 10, 12, 14, 18,

20, 25, 30, 40, 60minutes

0, 2, 4, 6, 8, 10, 12, 14, 16,18, 20, 25, 30, 40, 50, 60minutes

10.4.1 Step 1: Initial perturbation and measurement design

To construct the initial perturbation design, which is summarized in Table 10.6, wehave to identify the factors and then set their levels. There is only one ligand, TGF-β, forwhich we consider pulse-chase, or rectangular pulse, input sequences which consist of amagnitude and duration. For both magnitude and duration we consider three levelsuniformly spaced across physiologically relevant scales. We consider siRNAknock-down of either Smad2 or Smad4, but not of the receptors, since receptorknock-down would lead to a trivial signaling response (no signaling). We do not con-sider any pharmacological inhibitors at the present time; however, it may be of interestto consider inhibiting nuclear export with Leptomycin B in a future round of experi-mental design.

For the initial measurement design, we consider immunoblotting as the measure-ment technology. Based on commercial availability of antibodies we consider absolutemeasurements of nine feasible responses that provide reasonable coverage of the modelstates (Table 10.7). It is important to recognize how the measurements are related tomodel state variables through an observation function, which is typically not a“one-to-one” relationship, but a sum over several model states. We list these observationfunctions in the right-hand column of Table 10.7, and encourage the reader to look overthese functions in detail. We choose 5 minutes as a sampling frequency upper bound.Although one can certainly decrease this sampling frequency upper bound, 5 minutesprovides a very fine resolution over the 8-hour model time course and as such represents


198

Figure 10.5 Schematic diagram of the TGF-β induced SMAD signaling model.


199

Table 10.5 Unknown Parameters in the TGF-β Signaling Model

Index Parameter Reaction Step Value Unit δID

a δID

b

1 k1a ligand binding 6.60E–03 molecule−1· min−1 0.0497 0.062 k1d dissociation 2.98E–01 min−1 0.1093 0.16163 k2a association (RI-RII*) 6.60E–03 molecule−1· min−1 0.0894 0.10394 k2d dissociation 2.98E–01 min−1 0.1315 0.14325 k3int internalization (Rc) 3.95E–01 min−1 0.0496 0.0596 k4a association (Rc-S2) 1.50E–04 molecule−1 · min−1 0.0695 0.0615N/A k4d dissociation 9.71E–01 min−1 3087 N/A7 k5cat turnover (pS2) 4.48E+04 min−1 0.1676 0.21258 k6a association (pS2-S4) 6.00E–03 molecule−1 · min−1 0.3413 0.39089 k6d dissociation 1.46E+03 min−1 0.3451 0.403110 k7imp nuclear import (pS2S4) 8.10E–01 min−1 0.1132 0.161811 k8dp dephosphorylation (pS2S4) 2.52E–02 min−1 0.0096 0.033412 k9d dissociation (S2-S4) 1.01E–01 min−1 0.01 0.011213 k10imp nuclear import (S2) 1.62E–01 min−1 0.1624 0.312514 k10exp nuclear export (S2) 3.48E–01 min−1 0.1583 0.303915 k11imp nuclear import (S4) 2.01E–02 min−1 0.0822 0.180316 k11exp nuclear export (S4) 1.74E–01 min−1 0.0813 0.18317 k12syn protein synthesis (RII) 8.00E+00 molecule·min−1·cell−1 0.0165 0.042818 k12deg degradation (RII) 2.80E–02 min−1 0.0655 0.112719 k13syn protein synthesis (RI) 8.00E+00 molecule·min−1·cell−1 0.0172 0.045520 k13deg degradation (RI) 2.80E–02 min−1 0.0547 0.089321 k14syn protein synthesis (S2) 2.74E+01 molecule·min−1·cell−1 0.0151 0.033222 k14deg degradation (S2) 6.46E–04 min−1 0.0107 0.027223 k15syn protein synthesis (S4) 5.00E+01 molecule·min−1·cell−1 0.0158 0.057524 k15deg degradation (S4) 1.20E–03 min−1 0.0073 0.028625 k16deg constitutive deg (Rc) 2.80E–02 min−1 0.0667 0.089526 k16lid ligand-induced deg (Rc) 3.95E–01 min−1 0.538 0.586127 k17imp nuclear import (pS2) 5.03E–01 min−1 0.0182 0.036228 k18a association (pS2-S4) 1.67E–04 molecule−1 · min−1 0.0344 0.085229 k18d dissociation 9.09E–01 min−1 0.0279 0.041230 k19dp dephosphorylation (pS2) 2.52E–02 min−1 0.0079 0.022331 k20lid ligand-induced deg (pS2) 5.40E–03 min−1 0.0214 0.050532 k21int internalization (RII) 3.95E–01 min−1 0.0705 0.108733 k21rec recycling (RII) 3.95E–02 min−1 0.0555 0.06834 k22int internalization (RI) 3.95E–01 min−1 0.0728 0.101135 k22rec recycling (RI) 3.95E–02 min−1 0.0545 0.06636 k23rec recycling (Rc) 3.95E–02 min−1 0.0028 0.0035

aConfidence intervals based on the initial design.bConfidence intervals based on Design D.

Table 10.6 Initial Perturbation Design for the TGF-βSignaling Model

Factor Level

Ligand type TGF-βLigand input sequence Magnitudes: 1, 5, 10 ng/mL

Durations: 1, 4, 8 hourssiRNA None, Smad2, Smad4Pharmaceutical inhibitors None

a reasonable compromise. We note that this sampling frequency upper bound and/orthe considered responses can be changed in the initial “Design Modification” step of theprocedure if necessary.

Using a factorial design gives 27 (3 magnitudes * 3 durations * 3 siRNAs) distinctinput perturbation conditions. In response to all of these perturbations we simulate thenine responses at each of the 97 time points. This gives a total of 23,571 simulated mea-surements (27*97*9) in the initial design.


The next step in the experimental design process is to perform identifiability analysisusing the simulated initial perturbation and measurement design, which involves first

calculating the scaled parameter sensitivity matrix~Z. Similar to the STAT model exam-

ple above, we calculate the parameter sensitivities using forward finite differences basedon simulations with 1% bumps to each of the model parameters. Model trajectories arecalculated using the MATLAB function “ode15s” for numerical integration of ODE sys-tems. To simulate the effects of siRNA, we reduced the synthesis rate for the speciesbeing knocked down (k14syn for Smad2 and k15syn for Smad4) to 10% of its nominal value.We assume that the data covariance matrix VY is diagonal and each measurement’sstandard deviation is 20% of its nominal value. Using these considerations to calculate

the scaled parameter sensitivity matrix~Z, we found that the rank of

~Z is 37, and thus the

model is structurally identifiable. Since this is the first experimental design for thismodel, we consider generous tolerances ~κ of 0.6 (±60%), and find that one parameter,k4d, is not identifiable (Table 10.7). This parameter, which characterizes dissociation ofthe Active Receptor Dimer-Smad2 Complex, is not even close to being identifiable, witha confidence interval of ~3,000 fold of the nominal value. We therefore fix this parame-ter at the nominal value rather than modifying the initial design, and proceed with theexperimental design considering the other 36 parameters.

10.4.3 Steps 3 to 5: Impact analysis, design reduction, and identifiabilityanalysis

The overall goal of steps 3 to 5 is to reduce the initial design such that we retainidentifiability of the 36 model parameters, but simultaneously arrive at a relativelyinexpensive design. The results of this part of the procedure not only give a proposed


200

Table 10.7 Considered Responses for the TGF-β Signaling Model

Response (Abbreviation) Observation Functiona

Nuclear Smad2 (Nuc. Smad2) S2S4Nuc + pS2S4Nuc + S2Nuc + pS2NucCytoplasmic Smad2 (Cyt. Smad2) S2Cyt + pS2Cyt + pS2S4CytPhosphorylated uclear Smad2 (Nuc. pSmad2) pS2S4Nuc + pS2NucPhosphorylated Cytoplasmic Smad2 (Cyt. pSmad2) pS2Cyt + pS2S4CytNuclear Smad4 (Nuc. Smad4) S4Nuc + pS2S4Nuc + S2S4NucCytoplasmic Smad4 (Cyt. Smad4) S4Cyt + pS2S4CytPhosphorylated Smad2-Smad4 Complex (pSmad2-Smad4) pS2S4Cyt + pS2S4NucTotal TGF-β Type 1 Receptor (R1) R1 + RC + RCIn + RCIn-S2Cyt + R1InTotal TGF-β Type 2 Receptor (R2) R2 + R2TGF + RC + RCIn + RCIn-S2Cyt + R2In

aAbbreviations correspond to the nomenclature shown in Supplementary Tables 1 and 2

reduced design for implementation, but also give insight into the pros and cons of: (1)rank versus main effects analysis and (2) the three different impact metrics.

The results of rank-based impact analysis are shown in Figure 10.6, where the num-ber of identifiable parameters is plotted versus the number of measurements included inreduced designs based on different impact metrics. Overall, most of the parameters can

be identified with relatively few measurements (< 500), but identifying all the parametervalues requires a large number of measurements, regardless of the impact metric. In gen-eral, importance coefficient designs outperform absolute sensitivity coefficient and netimpact designs; however, absolute sensitivity coefficient designs are slightly better for

small (< 200 measurements) or large (>2,000 measurements) designs. Net impact designsperform surprisingly bad in all cases, with random measurement choice performingbetter for designs comprising less than ~1,000 measurements.

Although based on this rank analysis one can choose a reduced design comprisingapproximately 5,000 measurements that allows for identification of the 36 unknownparameters [Figure 10.6(b)], it is also of interest to find an inexpensive design. In general,the two experimental characteristics that will lead to a large experimental cost are a largenumber of input perturbations and high frequency measurements. Table 10.8 showshow 500 measurement designs based the different impact metrics perform in terms ofthese expensive design characteristics. Although net impact designs contain slightlyfewer input perturbations and high frequency measurements than importance or abso-lute sensitivity coefficient designs, we see that regardless of the impact metric, these rela-tively small rank analysis designs contain several high frequency measurements andinput perturbations. Having fewer high frequency measurements and input perturba-tions account for the poor performances of net impact designs. Choosing rank analysisreduced designs that allow for identification of all 36 model parameters (~5,000 mea-surements) will only lead to even more high frequency measurements. Thus, onedrawback of choosing reduced designs based on rank analysis is high experimental cost.

Figures 10.7 through 10.9 show results of the main effects analysis using as theimpact metric, either net impacts (Figure 10.7), absolute sensitivity coefficients (Figure10.8), or importance coefficients (Figure 10.9). For net impact, while low TGF-β dose


201

0 1000

5

10

15

20

25

30

35

#o

fid

enti

fiab

lep

aram

eter

s

# of measurements200 300 400 500 0 2000

0

5

10

15

20

25

30

35

#o

fid

enti

fiab

lep

aram

eter

s

# of measurements4000 6000 8000 10000

RandomImp. Coeff.

Net Imp.Sens. Coeff.

(a) (b)

Figure 10.6 Rank-based impact analysis for the TGF-β model. Both panels (a) and (b) plot the numberof identifiable parameters versus the number of measurements in a reduced design. (a) Behavior with asmall number of measurements in the reduced design. (b) Behavior with a large number of measurementsin the reduced design.

[Figure 10.7(a)], short TGF-β duration [Figure 10.7(b)], and Smad4 siRNA [Figure 10.7(c)]have the largest main effects and therefore the highest impact, the main effects for otherperturbation conditions are nearly as large; and for practical purposes, all these perturba-


202

00

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

310

320

330

340

350

360

370

380

390

400

410

420

430

440

450

460

470

480

468

101214161820

(a)

(b)

(c) (d)

(e)

2


0

0

2

2

4

4

6

6

8

8

10

10

12

12

14

14

16

16

TGF- dose (ng/ml)β TGF- stimulation duration (hr.)β

None

1

Smad2

4

SiRNA

Mea

nn

etim

pac

t

Mea

nn

etim

pac

t

Smad4

80

246

810

121416

Mea

nn

etim

pac

t

1 5 10

0

5

10

15

20

25

30

Mea

nn

etim

pac

t

Nuc.Sm

ad2

Cyt. S

mad

2

Nuc.pSm

ad2

Cyt. p

Smad

2

Nuc.Sm

ad4

Cyt. S

mad

4

pSmad

2-Sm

ad4 R1 R2

Response

Mea

nn

etim

pac

t

Figure 10.7 Main effects of factors and responses based on net impact. (a) Ligand concentration. (b) Ligandstimulation duration. (c) siRNA. (d) Response type. (e) Measurement time point.

Table 10.8 Rank Analysis-Based Reduced Design Characteristics

Net ImpactAbsolute SensitivityCoefficients

ImportanceCoefficients

Number of input perturbations 24 of 27 27 of 27 27 of 27Number of responses measured 4 of 9 7 of 9 6 of 9Number of responses measuredwith high frequency

2 of 9 4 of 9 5 of 9

aBased on a 500-measurement design.bA high-frequency measurement is defined as having more than 15 time points.

tion conditions essentially have equivalent net impact. Different response characteris-tics, however, have drastically different impacts. Total type II receptor (R2) andcytoplasmic pSmad2 are clearly the highest impact measurements [Figure 10.7(d)], andmeasurements at time zero and at times close to eight hours have much higher impactthan measurements at short/mid-times [Figure 10.7(e)].

Comparing Figure 10.8 to Figure 10.9 shows an important advantage of using impor-tance coefficients versus absolute sensitivity coefficients for main effects analysis: sinceimportance coefficients are all scaled between zero and one, it is much easier to visualizethe impact trends. Thus, for the sake of the current analysis we use only the importancecoefficient results as they are much easier to interpret. However, we note that for thiscase study a careful analysis of the absolute sensitivity coefficients leads to similarconclusions as those discussed next.

As opposed to the parameter-averaged impact that net impacts quantify, absolutesensitivity and importance coefficients give parameter specific impacts, which in someinstances differ from the general trends. Figure 10.9(a–c) shows that although asobserved before, there are no perturbation conditions with dominant impact, it is appar-ent that low TGF-β doses for short durations have significantly higher impact on param-eters 14 to 19 than on other parameters. Thus, one might choose to perturb the system


203

(e)

(d)

(c)

(b)

(a)

Figure 10.8 Main effects of factors and responses based on absolute sensitivity coefficients. Parameter numberscorrespond to indices as shown in Table 10.5. (a) Ligand concentration. (b) Ligand stimulation duration. (c)siRNA. (d) Response type. (e) Measurement time point.

with a low TGF-β dose for a short duration. Additionally, since there are not significantimpact differences between the siRNA perturbation conditions one might choose nosiRNA, as experimentally, using no siRNA is easier than using siRNA.

Figure 10.9(d, e) shows that which response characteristics have the highest impactis parameter dependent. The high net impact of total type II receptor measurements isattributable to only a few parameters, while the high net impact of cytoplasmic pSmad2is distributed among many parameters. Thus, one might choose to measure cytoplasmicpSmad2 before total type II receptor despite the fact that total type II receptor has greaternet impact. Each response has particular parameters that it has high impact on, eventhough it may have low net impact: for example, cytoplasmic Smad4 on parameter 9and nuclear Smad4 on parameter 6. Therefore, based on these results it is difficult to saywhich responses should be excluded from the reduced design. In terms of time points,Figure 10.9(e) shows that for some parameters it is better to measure at short times, forsome it is better to measure at long times, and for yet others it is good to measure both atshort and long times. Thus, one might choose only short and long time point measure-ments, and exclude mid time measurements.


204

(e)

(d)

(c)

(b)

(a)

Figure 10.9 Main Effects of factors and responses based on importance coefficients. Parameter numbers corre-spond to indices as shown in Table 10.5. (a) Ligand concentration. (b) Ligand stimulation duration. (c) siRNA.(d) Response type. (e) Measurement time point.

Based on these main effects analysis results, one might propose the followingreduced design: 1 ng/mL TGF-β for 1 hour, no siRNA, and all responses at 0, 5, 15, 30minutes, 7 hours, and 8 hours (Design A—Table 10.9). Design A, while being simple toimplement experimentally and comprising only 64 measurements, unfortunately onlyyields 13/36 identifiable parameters (Table 10.9). This result implies that consideringonly the main effects for reducing the design is not adequate; interactions betweendesign characteristics are important for parameter identifiability.

How can Reduced Design A be improved? To answer this question we draw insightfrom the rank analysis reduced designs, which all include a large number of input per-turbations and high frequency measurements. Thus, one might suspect that to improveDesign A we need to include more input perturbations and/or high-frequency measure-ments. Therefore, we investigate two more Designs, B and C, that include, respectively, asingle input perturbation with all responses measured at high frequency or all input per-turbations with all responses measured at low frequency. Table 10.9 shows that,expectedly, both of these designs lead to an increase in the number of identifiableparameters; however, not all of the parameters are identifiable for either design.Although Table 10.9 shows that Design C yields more identifiable parameters than doesDesign B, it also comprises approximately double the number of measurements. Theseresults imply that both of these design features, a multitude of input perturbations andhigh frequency measurements, are desirable, and perhaps essential, for parameteridentifiability. This leads us to propose Design D, a hybrid of Designs B and C combiningall input perturbations, low-frequency system-wide measurements, and high-frequencymeasurement of the high impact response cytoplasmic pSmad2. Design D indeed yields36 identifiable parameters, does so with just over 4,000 measurements, and the parame-ter confidence intervals are quite close to those of the initial design (Table 10.5).Although Design D comprises a large number of measurements, it does so with only onehigh-frequency measurement, and as such is much more inexpensive than a comparablerank analysis reduced design.


In this chapter we have presented an experimental design strategy for parameteridentifiability, which, as opposed to traditional and recently proposed methods, iscomputationally feasible to apply to large signal transduction models using currentlyavailable technology. To obtain this computational feasibility, there were naturally


205

Table 10.9 Main Effects Analysis-Based Reduced Designs.

Initial Design ReducedDesign A

ReducedDesign B

ReducedDesign C

ReducedDesign D

Number ofidentifiableparameters

36 13 23 33 36

Number ofmeasurements

23,814 64 882 1,701 4,158

A: 1 ng/mL; 1 hour; no siRNA; all responses at 0, 5, 15, 30 minutes, 7 hours, 8 hours.B: 1 ng/mL; 1 hour; no siRNA; all responses at all time points.C: All perturbations; all responses at 0, 5, 15, 30 minutes, 7 hours, 8 hours.D: All perturbations; all responses minus Cyt. pSmad2 at 0, 5, 15, 30 minutes, 7 hours, 8 hours;Cyt. pSmad2 at all time points.

trade-offs with other desired design features: robustness and optimality. By consideringonly local parameter identifiability, robustness of the design to parameter uncertaintywas sacrificed, and since no optimization is carried out over the entire experimentaldesign space, acceptance of suboptimal designs is possible. Although our methods donot produce robust or optimal designs, they do produce adequate and experimentally fea-sible designs, and thus represent an important practical solution to an otherwisecomputationally infeasible experimental design problem. Implementing the experi-mental design in a sequential manner (see Figure 10.2) dampens the potential impactthat individual designs in the iterative process have on the final outcome, and as suchaddresses the robustness and optimality issues. Furthermore, it is important to note thatalthough in this chapter we only consider local parameter identifiability with a particu-lar type of confidence interval, there are a whole host of identifiability tests andmethods, all of which are compatible with our proposed experimental design procedure[22, 23].

The parameter identifiability metrics rely on the assumption that the experimentalerrors are multivariate normally distributed. Although raw data from many quantitative,biological experimental techniques are not normally distributed, this does not mean thatour methods cannot be used with these data. Rather, modelers must be aware if theirdata are not normally distributed, and if they are not, the data must be transformed suchthat they are normally distributed. Procedures for such data transformation are wellknown in microarray analysis [24], and may be adaptable to other forms of experimentaldata. Along these lines, recent work by Kreutz and coworkers provides an excellent treat-ment of this problem for immunoblotting data [25]. They show how rawimmunblotting data are not normally distributed, but provide a mixed-effects errormodel for transformation of the raw data into normally distributed data.

We proposed three different metrics for quantifying the impact of potential mea-surements: absolute sensitivity coefficients, net impacts, and importance coefficients.An important advantage of absolute sensitivity and importance coefficients is that theyprovide parameter-specific impact, which can be different from the overall impactwhich the net impact quantifies. This parameter-specific impact led to sensitivity andimportance coefficient-based rank analysis designs having a greater number of identifi-able parameters for a particular number of measurements than did net impact-basedrank analysis designs. However, the better performance of absolute sensitivity andimportance coefficient designs came at higher experimental cost, since they includedmore high-frequency measurements. In most cases importance coefficient designs out-performed absolute sensitivity coefficient designs, most likely because importance coef-ficient designs by definition use an orthogonal basis for selecting experiments, which isa universally desirable feature. However, which impact metric is most desirable is situa-tion dependent, and it is not clear whether the results from the case studies in this chap-ter are generally applicable to signal transduction models. As such, all of the impactmetrics should be included in any experimental design analysis until such generalunderstanding has been established.

We also proposed two different methods for impact analysis, rank-based analysis andmain effects-based analysis. While rank-based analysis is straightforward to apply,designs based on such analysis result in high experiment cost, combining a multitude ofinput perturbations with high frequency measurement of several responses. Althoughmain effects analysis is more ad hoc and difficult to generalize, these designs were not as


206

costly as rank analysis-based designs since they included fewer input perturbations andhigh-frequency measurements. Importantly, interactions between factors and responseswere critically important to account for when reducing designs based on main effects.Main effects analysis also revealed that while the impact of perturbation characteristicsfor particular parameters tended to follow general impact trends, the impact of responsecharacteristics (what to measure and when to measure it) was highly parameterdependent.

The TGF-β model case study showed that although approximately half of the modelparameters can be identified with relatively little experimental effort, identification ofall the model parameters requires a large amount of experimental data, much more thanis typical for such modeling studies. Our results indicated that both high-frequencymeasurements and diverse combinations of input perturbations are important designfeatures. One of the top designs in terms of the number of identifiable parameters forexperiment cost, Design D from Table 10.9, combined a diverse set of input perturba-tions with system-wide low frequency measurements and high-frequency measurementof the informative species cytoplasmic pSmad2. Unfortunately, such a design would beextremely laborious, if not impossible to implement with conventionalimmunoblotting techniques. Alternatively, different measurement technologies arebetter suited to provide these data. Quantitative mass spectrometry is well suited to pro-vide the low-frequency, system-wide measurements [25–27], and live-cell fluorescence iswell suited to provide high-frequency measurements of particular proteins [28].

10.6 Summary Points

• Although the proposed experimental design for parameter identifiability proce-dure does not produce robust or optimal designs, it does produce adequate andexperimentally feasible designs, representing a practical solution to an otherwisecomputationally impractical experimental design problem. Implementing theexperimental design in a sequential manner helps to address the robustness andoptimality issues.

• The experimental design methodology is compatible with any nonlinear ordinarydifferential equation model that does not have significant model-experimentmismatch errors.

• Structural identifiability should always be tested for first, and parameters that haveno independent effects on the observables should be held constant. Any parameteridentifiability test is compatible with the proposed experimental design procedure.

• Rank-based impact analysis is straightforward to apply, but designs based on rankanalysis result in high experiment cost. Although main effects-based impact analy-sis is more ad hoc and difficult to generalize, these designs typically are not as costlyas rank analysis-based designs.

• Only small subsets of proposed experimental designs should be implemented, andthe experimental design procedure should be performed iteratively with parameterestimation steps in a sequential manner.

10.6 Summary Points

207

Acknowledgments

MRB acknowledges Erik Welf for numerous helpful discussions and critical reading ofthis chapter in various stages of its development, and Seung-Wook Chung for providingthe TGF-β signaling model.

References

[1] Bard, Y., Nonlinear Parameter Estimation, New York: Academic Press, 1974.[2] Atkinson, A., and A. Donev, (eds.), Optimum Experimental Designs, Oxford, U.K.: Clarendon Press,

1992.[3] Draper, N., and W. Hunter, “Design of experiments for parameter estimation in multiresponse situ-

ations,” Biometrika, Vol. 53, 1966, pp. 525–533.[4] Asprey, S., and S. Macchietto, “Statistical tools for optimal dynamic model building,” Computers

and Chemical Engineering, Vol. 24, 2000, pp. 1261–1267.[5] Asprey, S., and S. Macchietto, “Designing robust optimal dynamic experiments,” Journal of Process

Control, Vol. 12, 2002, pp. 545–556.[6] Chen, B., S. Bermingham, A. Neumann, H. Kramer, and A. Asprey, “On the design of optimally

informative experiments for dynamic crystallization process modeling,” Ind. Eng. Chem. Res., Vol.43, 2004, pp. 4889–4902.

[7] Gadkar, K.G., R. Gunawan, and F.J. Doyle, 3rd, “Iterative approach to model identification of bio-logical networks,” BMC Bioinformatics, Vol. 6, 2005, p. 155.

[8] Kutalik, Z., K.H. Cho, and O. Wolkenhauer, “Optimal sampling time selection for parameter esti-mation in dynamic pathway modeling,” Biosystems, Vol. 75, 2004, pp. 43–55.

[9] Hill, W., W. Hunter, and D. Wichern, “A joint design criterion for the dual problem of model dis-crimination and parameter estimation,” Technometrics, Vol. 10, 1968, pp. 145–160.

[10] Stewart, W., T. Henson, and G.E.P. Box, “Model discrimination and criticism with single-responsedata,” AIChE Journal, Vol. 42, 1996, pp. 3055–3062.

[11] van Riel, N.A. “Dynamic modelling and analysis of biochemical networks: mechanism-based mod-els and model-based experiments,” Brief Bioinform., Vol. 7, 2006, pp. 364–374.

[12] Yue, H., et al., “Insights into the behaviour of systems biology models from dynamic sensitivityand identifiability analysis: a case study of an NF-kappaB signalling pathway,” Mol. Biosyst., Vol. 2,2006, pp. 640–649.

[13] Rodriguez-Fernandez, M., J.A. Egea, and J.R. Banga, “Novel metaheuristic for parameter estimationin nonlinear dynamic biological systems,” BMC Bioinformatics, Vol. 7, 2006, p. 483.

[14] Box, G.E.P., and H.L. Lucas, “Design of Experiments in Non-Linear Situations,” Biometrika, Vol. 46,1959, pp. 77–90.

[15] Chou, I.C., H. Martens, and E.O. Voit, “Parameter estimation in biochemical systems models withalternating regression,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 25.

[16] Kuepfer, L., U. Sauer, and P.A. Parrilo, “Efficient classification of complete parameter regions basedon semidefinite programming,” BMC Bioinformatics, Vol. 8, 2007, p. 12.

[17] Matsubara, Y., S. Kikuchi, M. Sugimoto, and M. Tomita, “Parameter estimation for stiff equationsof biosystems using radial basis function networks,” BMC Bioinformatics, Vol. 7, 2006, p. 230.

[18] Moles, C.G., P. Mendes, and J.R. Banga, “Parameter estimation in biochemical pathways: a com-parison of global optimization methods,” Genome Res., Vol. 13, 2003, pp. 2467–2474.

[19] Tucker, W., Z. Kutalik, and V. Moulton, “Estimating parameters for generalized mass action modelsusing constraint propagation,” Math. Biosci., Vol. 208, 2007, pp. 607–620.

[20] Swameye, I., T.G. Muller, J. Timmer, O. Sandra, and U. Klingmuller, “Identification ofnucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling,” Proc.Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 1028–1033.

[21] Chung, S.-W., et al., “Quantitative modeling and analysis of the transforming growth factor betasignaling pathway,” Biophys. J., Vol. 96, No. 5, 2009, pp. 1733–1750.

[22] Antoniewicz, M.R., J.K. Kelleher, and G. Stephanopoulos, “Determination of confidence intervalsof metabolic fluxes estimated from stable isotope measurements,” Metab. Eng., Vol. 8, 2006, pp.324–337.

[23] Hengl, S., C. Kreutz, J. Timmer, and T. Maiwald, “Data-based identifiability analysis of non-lineardynamical models,” Bioinformatics, Vol. 23, 2007, pp. 2612–2618.

[24] Huang, S., and Y. Qu, “The loss in power when the test of differential expression is performedunder a wrong scale,” J. Comput. Biol., Vol. 13, 2006, pp. 786–797.


208

[25] Kreutz, C., et al., “An error model for protein quantification,” Bioinformatics, Vol. 23, 2007, pp.2747–2753.

[26] Blagoev, B., et al., “A proteomics strategy to elucidate functional protein-protein interactionsapplied to EGF signaling,” Nat. Biotechnol., Vol. 21, 2003, pp. 315–318.

[27] Blagoev, B., S.E. Ong, I. Kratchmarova, and M. Mann, “Temporal analysis ofphosphotyrosine-dependent signaling networks by quantitative proteomics,” Nat. Biotechnol., Vol.22, 2004, pp. 1139–1145.

[28] Dengjel, J., et al., “Quantitative proteomic assessment of very early cellular signaling events,” Nat.Biotechnol., Vol. 25, 2007, pp. 566–568.

[29] Fujioka, A., et al., “Dynamics of the Ras/ERK MAPK cascade as monitored by fluorescent probes,” J.Biol. Chem., Vol. 281, 2006, pp. 8917–8926.

Supplementary Table 10.1 Rate Equationsfor the TGF-β Model

Index Rate Equations

1 [ ][ ] [ ]v k TGF RII k TGF RIIa d1 1 1= −β β:

2 [ ][ ] [ ]v k TGF RII RI k Ra dC

2 2 2= −β:

3 [ ]v k RC3 3= int

4 [ ][ ] [ ]v k R S k R Sa inC

cyt d inC

cyt4 4 42 2= − :

5 [ ]v k R Scat inC

cyt5 5 2= :

6 [ ][ ] [ ]v k pS S k pS Sa cyt cyt d cyt6 6 62 4 2 4= −

7 [ ]v k pS Simp cyt7 7 2 4=

8 [ ]v k pS Sdp nuc8 8 2 4=

9 [ ]v k S Sd nuc9 9 2 4=

10 [ ] [ ]v k S k Simp cyt nuc10 10 102 2= − exp

11 [ ] [ ]v k S k Simp cyt nuc11 11 114 4= − exp

12 [ ]v k k RIIsyn12 12 12= − deg

13 [ ]v k k RIsyn13 13 13= − deg

14 [ ]v k k Ssyn cyt14 14 14 2= − deg

15 [ ]v k k Ssyn cyt15 15 15 4= − deg

16 ( )[ ]v k k RlidC

16 16 16= +deg

17 [ ]v k pSimp cyt17 17 2=

18 [ ][ ] [ ]v k pS S k pS Sa nuc nuc d18 18 182 4 2= −

19 [ ]v k pSdp nuc19 19 2=

20 [ ]v k pSlid nuc20 20 2=

21 [ ] [ ]v k RII k RIIrec in21 21 21= −int

22 [ ] [ ]v k RI k RIrec in22 22 22= −int

23 [ ]v k Rrec inC

23 23=

Acknowledgments

209

Supplementary Table 10.2 Differential Equations for theTGF-β Model[ ]d RII

dtv v v v= − + − +1 12 21 23

[ ]d TGF RII

dtv v

β:= −1 2

[ ]d RI

dtv v v v= − + − +2 13 22 23

[ ]d R

dtv v v

C

= − −2 3 16

[ ]d R

dtv v v v

inC

= − + −3 4 5 23

[ ]d R S

dtv v

inC

cyt: 24 5= −

[ ]d S

dtv v vcyt2

4 10 14= − − + [ ]d pS

dtv v vcyt2

5 6 17= − −

[ ]d pS S

dtv vcyt2 4

6 7= −[ ]d pS S

dtv v vnuc2 4

7 8 18= − +

[ ]d S S

dtv vnuc2 4

8 9= − [ ]d S

dtv v vnuc2

9 10 19= + +

[ ]d S

dtv v vnuc4

9 11 18= + − [ ]d S

dtv v vcyt4

6 11 15= − − +

[ ]d S

dtv v v vnuc4

17 18 19 20= − − − [ ]d RII

dtvin = 21

[ ]d RI

dtvin = 22


210

C H A P T E R

1 1Parameter Identification with Adaptive SparseGrid-Based Optimization for Models of CellularProcesses

Maia M. Donahue1, Gregery T. Buzzard2, and Ann E. Rundell1

1Weldon School of Biomedical Engineering, Purdue University2Department of Mathematics, Purdue University

211

Key terms Global optimizationParameter estimationMAPKSystems biologyGenetic algorithm

Abstract

Identifying parameter values in mathematical models of cellular processes iscrucial in order to ascertain if the hypotheses reflected in the model structureare consistent with the available experimental data. Due to the uncertainty inthe parameter values, partially attributed to the necessary model abstraction ofany cellular process, parameters are pragmatically estimated by varying theirvalues to minimize a cost function that represents the difference between thesimulated results and available experimental data. Local searches for theseparameter values rarely result in an adequate fit of the model to the data sincethe optimization gets caught in a local minimum near the initial guess. Typi-cally, larger regions of the parameter space must be searched for acceptableparameter values. Most of the global algorithms use stochastic sampling of theparameter space; however, these methods are not computationally efficientand cannot guarantee convergence. Alternatively, adaptive sparse grid-basedoptimization samples the parameter space in a more systematic manner andemploys selective evaluations of the cost function at support nodes to build anerror-controlled interpolated approximation of the cost function from basisfunctions.

11.1 Introduction

Increasingly, mathematical models are being used to provide insight into cellular pro-cesses [1, 2]. The construction of these models is hampered by the sheer number of par-ticipating chemical species, the uncertainty and complexity of the interconnectedsignaling networks, and the complicated regulation of the genetic events within a liv-ing cell. Out of necessity, the model structure must explicitly represent only the domi-nant events and processes for a specific application. Determining the dominant eventsand processes a priori is usually not trivial; hence, the first step of determining if themodel structure is suitable for the specific application typically depends on findingmodel parameters that produce simulations consistent with available experimentaldata and observations.

Most model parameter values are not known accurately, due to both experimentalissues and omission of nonessential process details in the model. Experimentally, it isdifficult to measure the concentrations, rates, and diffusion of elements within a livingintact cell [3–5]. Enzyme-substrate association constants and kinase activity rates cansometimes be determined in a test tube, but there is no guarantee that these rates are thesame when inside a crowded cellular environment [6]. Furthermore, the inevitableabstraction of the process being modeled causes the majority of model parameters toincorporate the net effect of a multitude of events. As a result, parameter values typicallycan be determined only through optimization that minimizes the difference betweenthe simulated model output and the experimental data. This process is straightforwardfor linear models through linear programming. However, most optimization tools arechallenged by these models since they can be highly nonlinear.

In rare cases where very good estimates of parameter values are available, a localsearch can be adequate to find parameter values that minimize the differences betweenmodel simulations and experimental data, which is quantified as the value of a costfunction. A local search starts from an initial point and finds the direction that allowsthe largest decrease in model/data mismatch. The search ends at the nearest minimumin that direction, shown in Figure 11.1. As Figure 11.1 demonstrates, the result of thelocal search is dependent on the initial point. Two of the three example points arecaught in the nearest local minimum, a consequence termed the local minimum trap. Ifpossible, a local search will modify the parameter values from the initial set to improvefitting; however, it cannot move out of the local minimum trap to find a global mini-mum. In contrast, global searches consider the entire parameter space when locating theglobal minimum.

Many global optimization methods can be used to solve the parameter identificationproblem but these can also suffer from the local minimum trap as well as from poor con-vergence rates [7] and are computationally costly. For a subset of systems, the problemcan be solved using deterministic optimization methods that transform the probleminto a convex function or the difference between convex functions [8]. In these systems,achieving the global minimum can be guaranteed [7, 8]. As the applicability of thesedeterministic strategies is limited for complex, nonlinear models, many researchersresort to global optimization techniques that sample the parameter space in a stochasticmanner. Existing popular stochastic methods include simulated annealing, geneticalgorithms (GA), and multiple shooting strategies. The GA, for example, uses evolu-tion-based strategies to modify a population of parameter sets, with a higher probabilityof keeping sets with high fitness (low cost function values) than those with low fitness;

Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes

212

the retained parameter sets become parent sets that are randomly combined to createchildren sets for evaluation in the next iteration [9]. Due to the probabilistic nature ofthese stochastic methods, the parameter set values and corresponding cost functionvalue can vary considerably from run to run. In addition, these stochastic global optimi-zation methods are computationally expensive and do not necessarily converge to asolution; hence, it can take a long time to discover that no solution exists.

Alternatively, the entire parameter space can be searched using a grid algorithm.Grid algorithms divide the parameter space in a patterned manner and evaluate the costfunction at each grid point; see Figure 11.2. Local or global searches can be initiated fromone or more of the best grid points to find acceptable parameter estimates. As the dimen-sion of the uncertain parameter space increases, the number of model evaluationsrequired to cover the entire space increases exponentially for optimization with anevenly spaced (full) or pseudo-randomly spaced (latin hypercube sampling, LHS) grid[11]. Randomly spaced samples are not recommended due to inefficient clustering atsome areas and inadequate sampling at others. However, adaptive sparse grid schemesavoid this exponential increase in points by selective positioning of support nodes.

11.1.1 Adaptive sparse grid interpolation

Recently sparse grid interpolation approaches have been developed that support deter-ministic global optimization for the minimization of functions with bounded mixedderivatives [12]. These methods are currently being refined for efficiently solving largedimension problems, with more than 10 uncertain parameters [13]. Sparse grid interpo-

11.1 Introduction

213

Figure 11.1 This plot illustrates the concept of a local minimum trap for a one-dimensional parameterspace. A local search of this space is initialized at three different points: A, B, and C. Searches starting atpoints A and B will find a local minimum of the cost function, while a search starting at point C results inthe global minimum.

0.5

0.5

00

1

1

0.5

0.5

00

1

1

0.5

0.5

00

1

1

(a) (b) (c)

Figure 11.2 Examples of grids sampling a two-dimensional parameter space: (a) latin hypercube sam-pling; (b) full uniform grid; and (c) Chebyshev sparse grid, generated with the Sparse Grid toolbox [10].

lation techniques were originally developed to reduce the computational cost formultivariate integrals [11, 14, 15]. A thorough review of sparse grid-based interpolationand integration is given in [16].

Adaptive sparse grid-based optimization utilizes the error-controlled interpolant as asurrogate of the cost function to search for the minimum. The process for optimizingwith sparse grids requires model evaluations at selected grid points (support nodes) stra-tegically positioned within the uncertain parameter space. An interpolated function iscreated by combining basis functions at the support nodes to approximate the cost func-tion evaluated over the entire uncertain parameter space. The search for the best param-eter values is performed on the interpolated function. Typically, a search along apolynomial-based interpolation function is significantly faster than a search involvingrepeated numerical integrations of a model; see Figure 11.3. Using sparse grid interpola-tion, the number of actual model evaluations is limited to just the number of supportnodes. However, since the sparse grid technique relies on an approximation, the bestresults are usually obtained if a local search using the actual model is performed startingfrom the optimal values identified by the interpolation function.

This adaptive sparse grid-based optimization approach and its computational effi-ciency rely heavily upon the construction of the interpolating function and selection ofthe support nodes. In brief, the construction of the interpolating function is based uponthe tensor products of a univariate interpolation of the function, f, at the support nodes,

χ χi ik k∈ for k ∈ [1, d], with a basis function, ax i :

( )( ) ( ) ( )U U f a a f x xi i

x x

i i

xx

di i d

d

i d i di i

11

1

1 1

⊗ ⊗ = ⊗ ⊗ ⋅∈∈∑∑� � � �, ,

χχ

In the sparse grid approach introduced by Smolyak, the interpolating function isobtained by summing a selected set of such tensor products. The computational effi-ciency of this method is a result both of the fact that this selected sum requires a rela-tively small set of support nodes and the fact that the sets of support nodes are nested asthe interpolation depth increases. This nesting property greatly reduces the number of


214

(a) (b) (c)

Figure 11.3 Comparison of meshes created by actual cost function evaluations and from an interpolatedcost function for a two-parameter search of a MAPK model [17]. (a) This mapping of the cost function wascreated from a 100 × 100 evenly spaced grid of parameter sets, for a total of 10,000 model evaluations. Alocal search on the best mapping point returned the actual parameter values, with an additional 18 modelevaluations. (b) The 53 adaptive sparse support nodes used to create the interpolated function, generatedby the Sparse Grid toolbox [10]. (c) An evenly spaced 100 × 100 grid of parameter sets was created andevaluated by the interpolated function, creating an identical mapping to (a) that only required the 53model evaluations used to create the support nodes. A local search from the best support node took anadditional 66 model evaluations to return the actual parameter values.

required function evaluations by reusing the support nodes upon increased samplingrefinement of the grid for a higher interpolation depth. The interpolation depth is thedegree of the polynomial, k, which the univariate interpolation function can exactlymatch.

It has been shown that the error of the interpolating function strongly depends uponthe degree of the bounded mixed derivative (smoothness) and is a weak function of thedimension of the problem, O(N–k(logN)(k+1)(d–1)), where N is the number of function evalua-tions performed on the sparse grid at the support nodes [15]. Hence, these methods areconsidered nearly optimal (up to a logarithmic factor) [15] and are significantly betterthan those of quasi-Monte Carlo algorithms, O(N–1(logN)d) [18]. A uniform sparse gridcannot avoid a logarithmic dependence of the error on dimension; however, adaptivesparse grids sample most along the dimension of greatest importance as ascertained bythe ability of samples in that direction to decrease the estimated interpolation error (Fig-ure 11.4) [18]. This “problem-adjusted refinement” [19] most effectively reduces thecomputational costs for the optimization on models whose roughness is confined to asubset of the dimensions of the uncertain space and it does no worse than the uniformsparse grid methods. This adaptive sparse grid-based optimization method is determinis-tic so the numerical values of the identified parameter values and the quality of theresults will not differ from one run to the next. Furthermore, it is anticipated that thequality of the results will improve with an increased sample size since the error of theinterpolant approximation of the cost function mapping on the parameter spacedecreases with large N.


The method described in this chapter will determine, in a relatively efficient manner,the optimal parameter values to fit a model to all available experimental data. A numberof factors must be considered to formulate the problem as a parameter identificationexperiment utilizing sparse grid-based interpolation. These factors involve ascertainingthe dimension of the uncertain parameter space, the size of the uncertain parameterspace, the form of the cost function, and the selection of the basis functions for theinterpolating function. The dimension of the uncertain parameter space must be deter-mined; that is, the number of parameter values to be found needs to be established. It is


215

0% 60% 99%

Figure 11.4 Examples of two-dimensional adaptive Chebyshev sparse grids, with increasing degree ofadaptivity from left to right, generated with the Sparse Grid toolbox [10]. This figure demonstrates thatthe parameter along the x-axis is more important to decreasing interpolation error than the parameteralong the y-axis.

desirable to keep this dimension as small as possible; initially assign the parameter val-ues for which reasonable estimates exist. For instance, total numbers of molecules orconcentrations of certain elements may be experimentally established or obtained fromother published models. In cases where there are too many parameters for which thereis no good estimate of their values, the dimension of the problem can be reduced byconducting a local or global sensitivity analysis [20] about some initial starting guess toascertain which parameters should be targeted for fitting the data [see the troubleshoot-ing section (Section 11.6)] [21, 22]. The initial starting guess values can be roughly esti-mated by back-of-the-envelope calculations or obtained from published models ofsimilar reactions or processes. Parameters with the lowest sensitivity ranks can beneglected and fixed at these initial guesses. For each remaining parameter, which willbe labeled as “uncertain,” an estimated initial search range must be provided. The prod-uct of these search ranges defines the span of the uncertain parameter space. Our experi-ence has found that using a search range encompassing an order of magnitude belowand above an initial guess will, in most cases, be large enough. The search for the poten-tial parameter values should typically be conducted on the log of the uncertainparameter space to more equally spread out the support nodes over the ranges, whichvary in orders of magnitude.

For parameter identification, the optimization problem typically minimizes a costfunction that penalizes differences between the model simulations and the experimentaldata. The most commonly used cost function, when quantitative experimental data isavailable, is the weighted least square error:

( ) ( )[ ]F p w y p t yij j i j ii

n

j

q j

= −⎛

⎝⎜⎜

⎞

⎠⎟⎟

==∑∑log , �

,

2

11

(11.1)

where q is the number of states with experimental data, nj is the number of experimen-tal time points for state j, �yij is the data for state j at time i, yj(p, ti) is the simulated modeloutput for state j at time i for parameter set p, and wij is the weight for that point. For thisconstruction of the cost function, it is important that the simulated data, yj(p, ti), beconverted into the same units as the experimental data, �yij (i.e., numbers of molecules,concentration, percent of total, and so forth). The weights are used to normalize and/orincorporate confidence information in the data points; the confidence in the experi-mental data is typically taken into account by making the weights the reciprocal of thestandard deviation of the experimental data at each time point and state, while themaximum value of the data or simulations for each state is typically used when the val-ues of the states differ significantly in magnitude. When quantitative data is not avail-able, a qualitative cost function can be constructed that, on a smooth scale, penalizes orrewards attributes of the simulations. It is important that the cost function be continu-ous in order for the interpolating function to approximate it accurately without largenumbers of support nodes. Abrupt jumps due to, for instance, if-else statements willseverely increase the interpolation error, as the cost function is interpolated with con-tinuous basis functions. It is also recommended to search over log space; taking the logof the cost function will increase its smoothness.

A wide variety of different basis functions can be used to support the construction ofthe interpolation function on the sparse grid including piecewise linear, Chebyshevpolynomials, polynomial chaos [23], and multiwavelet formulations [24]. Though the


216

choice of basis function changes the placement of the support nodes, the constructionof the interpolant is the same. Barthelmann, et al. compared the two most popular basisfunctions: piecewise linear and Chebyshev polynomial interpolation [15]. They con-cluded from theory and computation that if the function to be approximated is three (ormore) times differentiable, then polynomial basis functions are better in the sense thatthe interpolation converges to the correct answer more quickly as the number of supportnodes increases. If the function to be interpolated is discontinuous, then convergence isslow for both. In general, the authors recommend Chebyshev polynomial interpolation[15]. Therefore, this chapter is written from the assumption that Chebyshevinterpolation will be used.

11.3 Materials

The materials needed to apply this adaptive sparse grid-based optimization method formodel parameter identification from available experimental data are described inTable 11.1. The specific implementation discussed in this chapter requires MATLABand the Sparse Grid toolbox (http://www.ians.uni-stuttgart.de/spinterp) [10]. How-ever, this method can be implemented with alternative coding packages, such as C++.Therefore, the required materials are described generically, with specifics provided inparentheses.

For the examples in this chapter, a published four state, ordinary differential equa-tion (ODE) model [17] of the mitogen activated protein kinase (MAPK) cascade wasused. For these illustrations, mock experimental data was generated by model simula-tion. The mock data consisted of seven time points of the simulations for two of thefour states.

11.3 Materials

217

Table 11.1 Materials Needed for Optimization with Adaptive Sparse Grid Interpolation

Implementation

Hardware Computer capable of running preferred model simulation software (MATLAB)Software Simulation software (MATLAB)

Adaptive sparse grid algorithm (The Sparse Grid toolbox [10], installed, initialized,and modified slightly to store the best grid point for future use: the functionspcgsearch was modified by inserting the codepb = x;cf_pb = fprev;save pb pb cf_pb

at line 106, which is immediately after the lines:% Determine optimization start point[x, fval] = spgetstartpoint(z, xbox, options);fprev = fval; )

A local search algorithm such as the conjugate gradient method (fmincon)Model code Model and cost function files written in preferred software format (m-files). These

files should output the cost function value for the model evaluated at a given set ofuncertain parameter values.

Data Experimental data to fit uncertain model parameters. A local and global sensitivityanalysis can help ascertain if the data available is sufficient to identify uncertainmodel parameters [21, 22]. The data must be accessible by the preferred software(.mat file).

Specifics are provided in parentheses.

11.4 Methods

The general method for parameter identification with sparse grid-based optimization isoutlined below with example code provided in Figure 11.5 using MATLAB and theSparse Grid toolbox [10].

General Procedure

1. The objective of step 1 is to specify the range over which the search will be conductedfor the uncertain parameter value in each dimension. Create a matrix containing thelower bound and upper bound for each uncertain parameter (typically, an order ofmagnitude lower and an order of magnitude higher than the initial point,

respectively). Specific: The matrix should have size × 2 where d is the dimension, thenumber of uncertain parameters.

2. The objective of step 2 is to select the basis function type and establish the desiredgrid size and type. As the support node locations are a function of the basis functionsused to create the interpolating function, the basis function type must be indicated.To constrain the computational time, we recommend setting the maximum numberof grid points to 50 to 500 times the number of uncertain parameters, depending onhow long the model simulations take. One could instead specify a minimuminterpolation error to achieve, but this method could take a significant number ofmodel evaluations, which is unknown a priori. We highly recommend enablingdimension adaptivity since the computational effort required is no worse than thatfor a uniform grid, but for some models it can be significantly more efficient. Thedegree of adaptivity can be modified to ensure a moderate coverage of the uncertainparameter space if desired. Specific: Use spset to set these options.

3. The objective of this step is to evaluate the cost function value at each support nodeand to use these values to create the interpolating function. This requires an iterativesolution to add and locate the support nodes in the sparse grid to continuouslyimprove the accuracy of the interpolating function to the cost function value untilthe maximum number of grid points has been reached or minimum relative orabsolute error tolerance has been achieved. In addition, sort the grid points by costfunction value (low to high) and determine the number of unique points perparameter. The sorted grid points and number of unique points can be used forfurther analysis. Specific: Use spvals to construct the grid and the interpolatingfunction from the basis functions, and use the sort function to sort the grid points.

4. The objective of step 4 is to use the interpolated function from the previous step toestimate the “optimal” parameter set. A search is performed on the interpolatedfunction, which serves as a surrogate for the cost function, to find the optimalparameter values that minimize the interpolated estimate of the cost function.Specific: Use the appropriate search function for the basis functions selected:spcgsearch for Chebyshev polynomials. This algorithm will find the grid pointwith lowest cost function and run a local search on the interpolated function aboutthat point. The result of this search we will denote as pi, which has an interpolatedcost function of cfpi.

5. The objective of this step is to refine the sparse grid interpolated “optimal”parameter set, pi, by searching for the nearest minimum of the actual cost functionabout the “optimal” point found in step 4. A local search using the original cost


218

11.4 Methods

219

Fig

ure

11

.5Ex

amp

leco

de

for

imp

lem

enti

ng

opti

miz

atio

nw

ith

adap

tive

spar

segr

idin

terp

olat

ion

usi

ng

the

Spar

seG

rid

tool

box

[10]

.


220

Fig

ure

11

.5(c

onti

nu

ed)

function and model is performed about pi. This will result in a candidate parameterset we will denote as pil, with a cost function of cfpil. Specific: run fminunc from pi,calling the cost function file.

6. The objective of step 6 is to identify an alternative candidate for the “optimal”parameter set by starting from the support node with the lowest cost function value,which we denote as pb, which has a cost function of cfpb. Load the data file containingthe best grid point, and, if the point differs from the returned optimal point, run alocal search using the original cost function about pb. This will result in a candidateparameter set we will denote as pbl with a cost function of cfpbl. Specific: load the datafile pb.mat and run fminunc from the point, pb, calling the cost function file.

7. The parameter set from these two local searches with the lowest cost function value

is considered the optimal parameter set: pp cf cf

p cf cfil pil pbl

bl pbl pil

* =<<

⎧⎨⎩

.

8. Examine the resulting simulation for consistency and feasibility (i.e., bothquantitative and qualitative fit with experimental data). The objective of this finalstep is to confirm that minimizing the cost function resulted in an acceptable fit tothe experimental data. If the fit is acceptable, the optimization process is complete.

9. The objective of step 9 is to search in other areas of the cost function space with lowvalues, besides the one containing the best support node, if step 8 is not successful.Determine the distance between the sorted grid points with the lowest cost functionvalues (for instance, the lowest 1%), where distance can be defined as the sum of theabsolute percent change in each parameter over all parameters. Run local searcheson the cost function from the points with the farthest distance from the best supportnode. If one of these searches results in an acceptable fit, the optimization process iscomplete.

10. If step 9 is unsuccessful, consult the troubleshooting section (Section 11.6) andconsider increasing the number of maximum grid points by 2 to 10 times. Save theprevious grid as “z,” add “‘PrevResults’, z” to the options, increase the number of‘MaxPoints’, and return to step 2.


The anticipated result of the method described above is a parameter set that acceptablyfits the available experimental data. Whether or not the returned parameter set is ade-quate, the function spvals of the Sparse Grid toolbox [10] returns a structure containinga significant amount of information that may be helpful for understanding thereturned “optimal” set of parameter values as well as information about the cost func-tion values mapped onto the uncertain parameter space. This information includes thenumber and locations of the grid points, the cost function values at those points, theminimum and maximum cost function values, the degree of adaptivity, estimatederrors, and the computational time. From this information, it takes only a few extrasteps to extract other useful information, namely the sorted grid points and uniquepoints.


221

11.5.1 Sorted grid points

Grid points sorted from lowest to highest corresponding cost function value arereturned in step 3 of the methods section (Section 11.4) (see Figure 11.5). Reviewingthese points as a sorted list can provide some insight. For instance, this method canreveal disparate, equally valid areas of the uncertain parameter space, as shown in Fig-ure 11.6. In this example, three parameters of a MAPK model [17] were fitted to anincomplete data set, consisting only of the MAPK data [dots in Figure 11.6(b)]. Theresulting three-dimensional grid of the cost function values on the parameter spaceindicated two disparate regions with equally low cost functions [circled in Figure11.6(a)]. Simulations with a set of parameter values from each space [Figure 11.6(b)]were nearly identical for three of the four states; however, the simulation of MAPKK(blue) showed a distinct difference in its peak. This information suggests that in order todetermine which, if either, of the parameter sets are valid, experimental data for theMAPKK, particularly at the 15-minute time point, is required. The sorted grid points canbe used to determine the size of the parameter space that results in acceptable dynam-ics, termed the acceptable space, which can reveal properties of the model, such as theamount of confidence that can be placed in the chosen parameter values [26, 27]. Inaddition to the above example, as noted in the methods section (Section 11.4), step 9,in the event that the returned parameters were not adequate, the sorted grid points canprovide alternative starting points for additional local or global optimizations that canrefine the solution.

11.5.2 Unique points

With adaptive sparse grids, the number of unique points is the number of distinct loca-tions of grid points along a parameter direction. This value can be obtained by applyingthe MATLAB unique function to a parameter’s grid points (see Figure 11.5, step 3). For


222

0 50

50

100

150

200

250

300

Time (min)

(b)(a)

Act

ivat

edKi

nase

(uM

)

10 15 20 25

Figure 11.6 An example of a three-dimensional search of a MAPK model [17] that revealed two parame-ter sets that fit the mock data equally well, but predicted different dynamics for another model element.(a) Three-dimensional adaptive grid, generated with the Sparse Grid toolbox [10] and color-coded by costfunction (red: high; blue: low). The two circled areas have similar cost functions when only the mockMAPK data is fitted. (b) Simulation results with parameter sets from each “optimal” area (solid: centerarea; dotted: right area). While three of the four state simulations (red: MAPK; green: Raf; black: Rkip) aresimilar, different MAPKK (blue) dynamics are predicted, suggesting that MAPKK data would be required todistinguish between the two parameter sets.

three or fewer dimensions, the number of unique points can be seen by plotting thegrid, as shown in Figure 11.7. Unique points correlate with each parameter’s impor-tance to increasing the accuracy of the interpolant. This information is valuablebecause it demonstrates which parameters required the highest resolution for the inter-polation. A use for unique points in aiding the optimization process is described in thetroubleshooting section (Section 11.6.1) and demonstrated in the application notes(Section 11.8).

11.5.3 Unstable points

In the process of creating the sparse grid, the algorithm may return, or end with, inte-gration errors when the model is integrated with particular parameter sets. In somecases, this error may be due to improper range setting (see Section 11.6) for certainparameters. For instance, a certain search range could allow a parameter to be a valuethat results in a division by zero. These unstable points should be carefully evaluated asthey often reveal weaknesses in the model structure that may need revision to ensurethe model is stable over the allowable parameter ranges.

11.5.4 Interpretation and conclusions

As stated above, if the optimization process is successful, one can conclude that theparameter values found are adequate for fitting the model to experimental data. How-ever, one cannot conclude that these values are physically correct or even unique. If theprocess is unsuccessful, one should examine the model structure to determine whetheror not it is capable of recreating experimental data. One method for examining modelstructure is a parameter sensitivity analysis. Conducting a sensitivity analysis not onlyquantifies the sensitivity of the output with regards to the model parameter values butalso provides information for directing parameter fitting [21, 22]. The output of a sensi-tivity analysis helps to identify dominant processes or elements and recognize


223

5

1

2

3

4

−4

−2

−6 −4−2

02

log(Parameter Y)

log(

Para

met

erZ

)

log(Parameter X)

Figure 11.7 This adaptive sparse grid (generated with the Sparse Grid toolbox [10]) of a three-parame-ter search of a MAPK model [17] demonstrates the concept of unique points. The parameter on the z-axishas three unique points, the y-parameter has five, and the x-parameter has 513. The points arecolor-coded by cost function (black and red: high; blue: low).

events/elements that can be considered negligible, including parameters whose valueshave little impact on fitting the experimental data [21, 22].

11.6 Troubleshooting

For the cases where the methods described above do not result in a parameter set thatallows the model simulations to acceptably fit the experimental data, the troubleshoot-ing table lists some suggested approaches to deal with common issues. The simplest andfirst approach is to increase the number of support nodes as described in step 10 of themethods section (Section 11.4). In addition to these general recommendations, thereare two special cases that are also considered: small problems (less than or equal to threeuncertain parameters) and large problems (10 or more uncertain parameters), since cer-tain trouble-shooting techniques are more helpful, or only applicable, to problems withspecific dimensions of the uncertain parameter space.

11.6.1 Troubleshooting special cases: small and large problems

Small problem: three or fewer parameters

1. The objective of this step is to make use of the ability to visually inspect the costfunction in the parameter space when the space has three or fewer dimensions. Thecost function itself can be visualized using a mesh for two-dimensional problemsand a plot for one-dimensional problems, and the grid can be visualized for one- tothree-dimensional spaces using a scatter plot. Visually examine the sparse grid (inthe one-, two-, or three-dimension case) and the cost function plot (in the one- ortwo-dimension case):

i. Sparse grid: Evaluate the interpolated function at each of the grid points andthen plot the results. Specific: use scatter or scatter3, with the color of eachpoint corresponding to the cost function value at the point.

ii. Create a one- or two-dimensional mapping of the cost function. Specific: useplot or mesh as appropriate.

2. The objective of this step is to analyze the plots from step 1 to determine whether ornot an appropriate search space was used. For instance, in the example of atwo-parameter search of a MAPK model [17] shown in Figure 11.8(a–c), it can be seenthat the optimal point is beyond the lower bounds for both parameters. Both themesh and the grid suggest that the ranges should be shifted down. In the nextsearch, the ranges were corrected and the results are shown in Figure 11.8(d–f). Theimprovement in fit can be seen in the simulation results [Figure 11.8(d)]. Therefore,for this step, analyze the plots and update the parameter ranges as needed. If nothingcan be concluded from the plots, refer to the Troubleshooting Table.

Large problem: 10 or more parameters This troubleshooting solution takesadvantage of the information contained in the number of unique points per parameter,which is calculated in the general procedure (see step 3 of the methods section andFigure 11.5). Typically, these extra steps using the number of unique points are notnecessary for smaller-sized problems. Optimization issues with smaller problems are


224

more commonly due to an issue described in the Troubleshooting Table. For a definitionand description of unique points see Section 11.5. In brief, the number of unique pointsfor a parameter is the number of unique locations of grid points for that parameter andcorrelates with the importance of the parameter to decrease the interpolation error.Parameters with the lowest number of unique points are the least important (or caneasily be fit with low-dimensional basis functions, such as cubic polynomials).

1. This step assumes that, at a minimum, steps 1 to 9 of the methods section have beencompleted and not resulted in an acceptable fit of the model simulations to the data.Therefore, for the parameters with lowest number of unique points, set their valuesto the corresponding values of the best grid point, pb, returned by step 6. With theSparse Grid toolbox [10], the lowest possible number of unique points is three whenusing Chebyshev polynomial basis functions, as the lowest interpolation depthallowed is three.

2. Start a new, lower dimension adaptive grid search for the remainder of theparameters, centering the ranges on the corresponding values of pb and following thesteps in the methods section. If the dimension of the new problem is three or fewerparameters, then examine the resulting grid for range appropriateness as in the SmallProblem section above (if necessary, repeat the search with adjusted ranges). Save thereturned best grid point, which we will denote pb

new.

3. Create a new initial point by replacing the appropriate values of pb with theappropriate values of pb

new; this new initial point we will denote pbinital. Perform a local

search on the cost function starting from pbinital, resulting in the parameter set pbl with

a cost function value of cfbl.

4. If the fit is acceptable, the optimization is completed. If not, try increasing themaximum number of grid points by two to ten times and repeating the procedure.

11.6 Troubleshooting

225


Issue Suggestion(s)

Method takes too long Decrease the number of maximum grid points or the number of evaluationsallowed by the local search algorithm.

Method returns integration errors Examine parameter ranges: typically, certain parameters cannot be zero. Haveparameter values automatically recorded when integration errors occur and thenexamine them to determine areas of instability. As appropriate, alter parameterranges to avoid these areas or modify the model structure to eliminate the prob-lem. Do not artificially set the cost function to an arbitrarily high value whenthese points occur, as this will interfere with the adaptive algorithm and withthe interpolation.

Parameter sets with low cost function valuesdo not result in a fit to the data

Redesign the cost function to more accurately reflect the data. Considerchanging the weights of the LSE cost function or adding qualitative goals tothe cost function.

Method returns lower or upper bounds forsome parameters

Expand the ranges of these parameters beyond the boundaries, if possible.

Method does not produce an acceptable fit Increase maximum number of allowable support nodes.Decrease the problem dimension. Run a global sensitivity analysis, such asextended FAST [28]. Fix the least sensitive parameters at the best guess andsearch for the remainder.Consider an alternative model structure: the current structure may be incapableof producing the desired dynamics.


226

(a)

(b)

(c)

(d)

(e)

(f)

Fig

ure

11

.8A

nex

amp

leof

asm

all

(tw

o-d

imen

sion

al)

par

amet

erse

arch

ofa

MA

PKm

odel

[17]

.In

this

exam

ple

,th

ein

itia

lse

arch

ran

ged

idn

otin

clu

de

the

opti

mal

par

amet

erva

lues

tom

atch

the

moc

kd

ata

(blu

ean

dre

dst

ars)

.(a)

Th

efi

tfo

rth

ere

turn

edp

aram

eter

s(s

olid

lin

es)

com

par

edto

the

moc

kd

ata

(sta

rs)

over

the

init

ial

sear

chra

nge

.(b)

Th

em

esh

ofth

eco

rres

pon

din

gco

stfu

nct

ion

.(c)

Th

ead

apti

vesp

arse

grid

,gen

erat

edw

ith

the

Spar

seG

rid

tool

box

[10]

and

colo

r-co

ded

byco

stfu

nct

ion

(red

:hig

h;b

lue:

low

).T

he

sear

chra

nge

issh

ifte

dlo

wer

,bas

edon

the

mes

han

dgr

idfr

omth

eor

igin

alse

arch

(b,c

).(d

)T

he

fit

for

the

retu

rned

par

amet

ers

(sol

idli

nes

)co

mp

ared

toth

em

ock

dat

a(d

ots)

wit

hth

esh

ifte

dse

arch

ran

ge.(

e)T

he

mes

hof

the

corr

esp

ond

ing

cost

fun

ctio

n.(

f)T

he

corr

esp

ond

ing

adap

tive

spar

segr

id,g

ener

ated

wit

hth

eSp

arse

Gri

dto

olbo

x[1

0]an

dco

lor-

cod

edby

cost

fun

ctio

n(r

ed:h

igh

;blu

e:lo

w).


High dimensional nonlinear models are becoming common in biomedical applicationsbecause of their usefulness in understanding biological processes, predicting behaviors,and developing therapies. However, identifying appropriate model parameters is chal-lenging. As parameters are typically unknown, they are most often fitted to limitedexperimental data. Parameter optimization is a well-researched field, and many algo-rithms exist, including local or global and stochastic or deterministic. Local algorithmsare of little use for nonlinear models since their results are highly dependent on thestarting location. Global algorithms are computationally expensive and typically haveno guarantee of finding, or even converging to, the global minimum [29]. Exceptionsexist for smooth, twice-differentiable functions that are convex/concave or can be con-verted into convex/concave problems can be solved with global deterministicapproaches such as branch and bound [7, 8]. However, it is highly unlikely for a costfunction based on numerical integration of large, highly nonlinear models of cellularprocesses to be of that form. Until recently, the alternatives have been global, stochasticalgorithms such as the genetic algorithm, which have no guarantee of convergence, orLHS/full grid initialization of local searches, which grow exponentially withdimension.

The alternative presented in this chapter employs an adaptive sparse grid-basedsearch. The adaptive sparse grid is designed to map the cost function onto the uncertainparameter space using interpolation with basis functions (typically polynomials) at sup-port nodes. The error between the interpolating function and the cost functiondecreases with increased numbers of support nodes. This method has two benefits: gridpoints are placed in the most important locations (importance being defined by [30] asrequiring higher level polynomials to reduce the error of the interpolating function) andthe interpolant serves as a surrogate cost function. The former is like an informedLHS/full grid: entire dimensions can be mainly neglected if they are easy to fit with lowdimension polynomials, slowing the increase in model evaluations needed with dimen-sion. The second allows searches without additional model evaluations, which, depend-ing on the interpolation accuracy, can find an optimal point very close to the globalminimum. Like the stochastic and local methods, adaptive sparse grid optimization doesnot guarantee finding the global minimum. However, unlike, those methods, it doesreturn valuable information about the uncertain parameter space, as described in Sec-tion 11.5. For instance, the unique points give an indication of parameter importanceand can be used to improve the adaptive sparse grid search. In addition, as shown in Sec-tion 11.8, adaptive sparse grid-based optimization can result in larger, more consistentdecreases in cost function values with increased numbers of model evaluations than theGA, even when followed by a local search.

Complicating factors in optimization searches include parameter correlations(where changes in one parameter can compensate for changes in another) and lowparameter sensitivities (where changes in a parameter have little effect on the model out-put or cost function). As a result, some parameters may not be identifiable from a set ofexperimental data [22]. The recognition of these parameters can play a key role inparameter identification as they should be neglected, and fixed at some best guess value,until further information can be obtained. Neglecting parameters decreases the dimen-sion of the search, thereby increasing the likelihood of finding the global minimum inthe fewest number of model evaluations. In the authors’ experience (data not shown),


227

parameters with very low sensitivity coefficients (as determined by extended FAST [28])are more easily fit with low dimension polynomials; therefore, these parameters willhave fewer unique points. However, this correlation will not always be the case and iscertainly not guaranteed for all problems. Future work will explore altering the adaptivescheme for selecting sparse grid points to incorporating information on the parametersensitivities; this is expected to facilitate the problem of parameter identification.


The MAPK model published by Wolkenhauer et al. [17] was used as an example to dem-onstrate the described methods. Mock data was generated for two (MAPK and MAPKK)of the four elements (the remaining are Raf and Rkip) by simulating the model with thepublished parameter values and taking seven time points for each element from 0 to 25minutes. The posed parameter identification problem attempted to identify all 18model parameters from this mock data set. The results and computational efficiency ofthe adaptive sparse grid-based optimization method is compared to that of the GA. Thesparse grid method, due to symmetry, automatically evaluates the center point of theparameter space. Therefore, in order to avoid biasing the sparse grid towards the actualparameter values, a new center point was created by selecting a random initializationpoint within an order of magnitude above and below the actual values. The uncertainparameter search range for both the sparse grid and GA was assigned with a lower limitset to an order of magnitude smaller than this initial point and the upper limit set to anorder of magnitude larger.

11.8.1 Comparison of adaptive sparse grid and GA-based optimization

The resulting cost function value (the least squared error or LSE) was calculated for theadaptive sparse grid-based optimization method and the GA for increasing numbers ofmodel evaluations. The results are shown in Figure 11.9. For the adaptive sparse gridmethod, steps 1 to 7 of the methods section were followed to achieve the resulting costfunction value (LSE), with the total number of model evaluations being the sum of thenumber of grid points and the number of evaluations performed by the local search(es).For the GA method, the GA was run at least five times (because of its stochastic nature,each outcome is different), each followed by a local search. The number of model evalu-ations in this case is the sum of the evaluations used by the GA and the local searches.The results of the local searches were averaged and the error bars in Figure 11.9 repre-sent the standard deviations of the results.

To illustrate the differences between the GA and the adaptive sparse grid perfor-mance, an example where the maximum number of points for each method was limitedto approximately 5,000 is given. In addition, the use of the number of unique points perparameter, as described under “Large problem” in Section 11.6, is demonstrated.

11.8.2 Adaptive sparse grid-based optimization

An 18-dimensional, 2,017-point grid was created, resulting in an optimal point with acost function of 1.98E4. Eleven of the 18 parameters were found to have three uniquepoints each, the remaining seven had five to nine each. The parameters that had three


228

unique points were set at the returned values. A second, seven-dimensional grid with2,031 points was created to search over the remaining parameters. The returned opti-mal point had a cost function of 1.48E4. The values for the 11 parameters returned bythe first grid and the seven parameters returned by the second grid were combined intoan initial point for a local search on the actual cost function. This local search, using532 model evaluations, returned an optimal point with a cost function of 1.20E4. Theresulting simulations with this parameter set are shown in Figure 11.10(a), which showsthat the simulations are slightly shifted from the mock data but otherwise are quite sim-ilar and consistent with the observed trends in the mock data. A total number of 4,580model evaluations were used. In order to improve the fit, the next step would be toincrease the number of model evaluations by 5 to 10 times.

11.8.3 Genetic algorithm

The GA was run five times and a local search was run from each returned point. With anaverage of 5,517 model evaluations (5,000 due to the GA and an average of 517 due tothe local search), this method returned an average cost function of 8.57E4, with a stan-dard deviation of 1.33E4. The results are shown in Figure 11.10(b). Figure 11.9 suggestthat the fitting could be improved by increasing the number of model evaluations, butagain, the GA would have to be run multiple times in order to have a greater chance ofseeing an improved fit.

This 18-dimensional example demonstrates that the adaptive sparse grid-based opti-mization improves the fit of the model simulations to the experimental data set withincreasing numbers of model evaluations while the GA fitness did improve on averagebut required multiple runs to assure this progress. The example also demonstrates theutility of the unique points identified by the adaptive sparse grid approach; the parame-ters least sensitive to reducing the error of the interpolant were temporarily fixed whilethe most important parameters were identified in a subsequent search. This sparse gridprocess provided information to refine the parameter identification problem to lead to


229

Figure 11.9 For a MAPK model [17], a comparison of the performance, indicated by the least squarederror (LSE) between the model simulations and the mock data set, of the adaptive sparse grid-based opti-mization (blue) and the GA (red). The adaptive sparse grid method consistently performed better than theGA for larger numbers of model simulations. The GA results are the average of at least five runs, with theerror bars representing the standard deviation of the results. The adaptive sparse grid method followedsteps 1 to 7 of the methods section while the GA was run with an increasing number of allowed genera-tions and/or population sizes, followed by a local search on the result.

an acceptable solution while the GA failed to identify a reasonable solution even withmultiple attempts. This inability of the GA to find acceptable parameter values may beinappropriately interpreted as the modeled hypotheses are inconsistent with the experi-mental data. However, with the same number of total model evaluations, the adaptivesparse grid-based optimization was able to find parameter values that supported themodeled hypotheses.

11.9 Summary Points

The adaptive sparse grid-based optimization approach described herein has a numberof advantages over other stochastic global optimization techniques.

• There can be a large variance in the resulting parameter values employing multipleruns of a stochastic optimization approach whereas the adaptive sparse grid-basedoptimization approach will always return the same parameter values when posedwith the same number of maximum model evaluations.

• With an increased number of model evaluations with the adaptive sparsegrid-based optimization, the error of the interpolant approximation of the costfunction mapping on the parameter space decreases, so eventually the parametervalues returned will minimize the cost function value, while the probabilistic sam-pling of the parameter space by stochastic optimization methods does not provideany assurances of an improvement of the solution with more supporting modelevaluations.

• The interpolant mapping of the cost function on the uncertain parameter spaceand the unique points generated during the adaptive sparse grid search may pro-vide insight to refine and improve the identification process.

• While the GA method can lead to incorrectly discarding a model hypothesis due toits inability to find well-fitting model parameters, adaptive sparse grid-based opti-


230

00 5

Time (min)

Con

cent

ratio

n(n

M)

10 15 20 25

50

100

150

200

250

300

00 5

Time (min)

Con

cent

ratio

n(n

M)

10 15 20 25

50

100

150

200

250

300

(a) (b)

Figure 11.10 Fitting the 18 parameters of a MAPK model [17] to a mock data set (stars) using approxi-mately 5,000 model evaluations. Red: MAPK; blue: MAPKK. (a) Results of adaptive sparse grid-based opti-mization. (b) Results of five independent GA runs followed by local searches. The inability of the GA tofind acceptable parameter values prematurely suggests the modeled hypotheses may be inconsistent withthe experimental data.

mization allows a more thorough examination of the parameter space for a betterevaluation of the appropriateness of the model structure.

Acknowledgments

This work supported in part by a National Science Foundation Graduate ResearchFellowship.

References

[1] Dharuri, H., L. Endler, N. Le Novere, R. Machne, B. Shapiro, C. Li, C. Laibe, and N. Rodriguez,“Database of Systems Biology Markup Language Models,” http://www.ebi.ac.uk/biomodels/,2006–2008. Accessed June 17, 2008.

[2] R. J. Orton, “Compilation of useful modeling links including model databases,” http://www.brc.dcs.gla.ac.uk/projects/bps/links.html. Accessed June17, 2008.

[3] Aldridge, B.B., J.M. Burke, D.A. Lauffenburger, and P.K. Sorger, “Physicochemical modelling of cellsignalling pathways,” Nature Cell Biology, Vol. 8, November 2006, pp. 1195–1203.

[4] Wolkenhauer, O., and M. Mesarovic, “Feedback dynamics and cell function: Why systems biologyis called Systems Biology,” Mol Biosyst, Vol. 1, May 2005, pp. 14–16.

[5] Sachs, K., O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan, “Causal protein-signaling net-works derived from multiparameter single-cell data,” Science, Vol. 308, April 22, 2005, pp. 523–529.

[6] Balaban, R.S., “Modeling mitochondrial function,” Am. J. Physiol. Cell Physiol., Vol. 291, 2006,pp. 1107–1113.

[7] Eposito, W.R., and P.W. Zandstra, “Global optimization for the parameter estimation of differen-tial algebraic systems,” Ind. Eng. Chem. Res., Vol. 39, 2000, pp. 1291–1310.

[8] Pardalos, P., H.E. Romeijn, and H. Tuy, “Recent developments and trends in global optimization,”Journal of Computational and Applied Mathematics, Vol. 124, 2000, pp. 209–228.

[9] Goldberg, D.E., Genetic Algorithms in Search, Optimization, and Machine Learning, Reading, MA:Addison-Wesley, 1989.

[10] Klimke, A., and B. Wohlmuth, “Algorithm 847: spinterp: Piecewise multilinear hierarchical sparsegrid interpolation in MATLAB,” ACM Transactions on Mathematical Software, Vol. 31, 2005.

[11] Bungartz, H.-J., and S. Dirnstorfer, “Higher order quadrature on sparse grids,” ICCS 2004, 2004,pp. 394–401.

[12] Ferenczi, I., “Global Optimization Using Sparse Grids,” Technische Unversitat Munchen, 2005,p. 140.

[13] Klimke, A., “Sparse grid surrogate functions for nonlinear systems with parameter uncertainty,”Proceedings of the 1st International Conference on Uncertainty in Structural Dynamics, 2007,pp. 159–168.

[14] Gerstner, T., and M. Griebel, “Numerical integration using sparse grids,” Numerical Algorithms,Vol. 18, 1998, pp. 209–232.

[15] Barthelmann, V., E. Novak, and K. Ritter, “High dimensional polynomial interpolation on sparsegrids,” Advances in Computational Mathematics, Vol. 12, 2000, pp. 213–288.

[16] Bungartz, H.-J., and M. Griebel, “Sparse grids,” Acta Numerica, Vol. 13, 2004, pp. 147–296.[17] Wolkenhauer, O., S.N. Sreenath, P. Wellstead, M. Ullah, and K.H. Cho, “A systems- and signal-ori-

ented approach to intracellular dynamics,” Biochem. Soc. Trans., Vol. 33, June 2005, pp. 507–515.[18] Gerstner, T., and M. Griebel, “Dimension-adaptive tensor-product quadrature,” Computing,

Vol. 71, 2003, pp. 65–87.[19] Klimke, A., “Uncertainty modeling using fuzzy arithmetic and sparse grids,” Universität Stuttgart,

2006.[20] Zheng, Y., and A. Rundell, “Comparative study of parameter sensitivity analyses of the TCR-acti-

vated Erk-MAPK signaling pathway,” IEE Systems Biology. In Press, 2006.[21] Donahue, M.M., W. Zhang, M. Harrison, J. Hu, and A.E. Rundell, “Employing optimization and

sensitivity analyses tools to generate and analyze mathematical models of T cell signaling events,”Data Mining, Systems Analysis and Optimization in Biomedicine, Gainesville, FL, 2007, pp. 43–63.

[22] Yue, H., M. Brown, J. Knowles, H. Wang, D.S. Broomhead, and D.B. Kell, “Insights into the behav-iour of systems biology models from dynamic sensitivity and identifiability analysis: a case studyof an NF-kappaB signalling pathway,” Mol. Biosyst., Vol. 2, December 2006, pp. 640–649.

Acknowledgments

231

[23] Xiu, D.B., and J.S. Hesthaven, “High-order collocation methods for differential equations with ran-dom inputs,” Siam Journal on Scientific Computing, Vol. 27, 2005, pp. 1118–1139.

[24] Le Maitre, O.P., H. Najm, P. Pebay, R. Ghanem, and O.M. Knio, “Multi-resolution-analysis foruncertainty quantification in chemical systems,” SIAM J. Sci. Comput., Vol. 19, 2007, pp. 864–889.

[25] Wolkenhauer, O., M. Ullah, P. Wellstead, and K.H. Cho, “The dynamic systems approach to con-trol and regulation of intracellular networks,” FEBS Lett., Vol. 579, March 21, 2005, pp. 1846–53.

[26] Brown, K.S., and J.P. Sethna, “Statistical mechanical approaches to models with many poorlyknown parameters,” Physical Review E, Vol. 68, 2003.

[27] Gutenkunst, R.H., F.P. Casey, J.J. Waterfall, C.R. Myers, and J.P. Sethna, “Extracting falsifiable pre-dictions from sloppy models,” Ann. N.Y. Acad. Sci., Vol. 1115, 2007, pp. 203–211.

[28] Saltelli, A., S. Tarantola, and K.P.-S. Chan, “A quantitative model-independent method for globalsensitivity analysis of model output,” Technometrics, Vol. 41, 1999.

[29] Moles, C.G., P. Mendes, and J.R. Banga, “Parameter Estimation in Biochemical Pathways: A Com-parison of Global Optimization Methods,” Genome Res., Vol. 2003, 2003, pp. 2467–2474.

[30] Bungartz, H.-J., and S. Dirnstorfer, “Multivariate quadrature on adaptive sparse grids,” Computing,Vol. 71, 2003, pp. 89–114.

Related sources and supplementary information

The Sparse Grid toolbox for MATLAB used in this chapter is available athttp://www.ians.uni-stuttgart.de/spinterp/.


232

C H A P T E R

1 2Reverse Engineering of Biological Networks

Heike E. Assmus, Sonja Boldt, and Olaf WolkenhauerSystems Biology and Bioinformatics, Department of Computer Science, University of Rostock,Albert Einstein Str. 21, 18051 Rostock, Germany; e-mail: [email protected] [email protected], www.sbi.uni-rostock

233

Key terms Reverse engineeringNetwork inferenceNetwork structureNetwork topologyDifferential equationsPower law modelingParameter estimationNetwork biology

Abstract

The function of each single living cell as well as the complexity of any livingsystem is the result of the interactions taking place between networks of biolog-ical entities, such as proteins, metabolites, genes, cells, organisms, groups oforganisms, and so on. It is one of the foremost aims of biological research toidentify and study these networks. This investigation of biological networksmeans not only to discover what components they are made of but also whichof these components actually interact and what kind of interactions they share,and finally, how all this results in the biological processes that we observe, be itin the test tube or in the environment around us. In this chapter, we introduceapproaches to infer the structure and functionality of biological networks,starting briefly with some logical and statistical methods but then focusing onthose that involve differential equations. Furthermore, the final section high-lights some possibilities for the exploration of the inferred networks.

12.1 Introduction: Biological Networks and ReverseEngineering

12.1.1 Biological networks

In biological systems, a variety of networks can be distinguished, on different hierarchi-cal and organizational levels. Networks that occur inside the cell, the smallest units ofany organism (and equivalent to the organism, in case of unicellular organisms), arecalled cellular networks.

Metabolic networks are networks of small biochemical molecules (the metabolites)that are connected by the biochemical reactions (i.e., conversions between the metabo-lites which are usually catalyzed by enzymes, a special class of proteins). The predomi-nant outcomes of metabolism are energy exchange (ATP production/consumption) andde-novo synthesis of the biomolecules which the cell needs for self-maintenance or forcommunication with other cells. Figure 12.1(a) shows the glycolytic pathway as anexample.

Signal transduction networks are networks of signaling proteins. The connectionsbetween them are also biochemical reactions, often reversible modifications of the pro-teins contributing to the signaling, such as, for example, phosphorylation anddephosphorylation. Scaffolding proteins help to spatially organize these modifications,and metabolites also play a role in signaling, by providing the energy needed in order tomodify the proteins. One of the most comprehensively studied signaling pathways up todate is the EGFR-signaling map [1]. Another well-known example is the canonicalWnt-pathway [see Figure 12.1(b)]. It highlights the main features of a signaling pathway,which usually begins with an extracellular signal that is transferred through the cellmembrane into the cytosol where it is relayed and results in the activation ordeactivation of a transcription factor.

Reverse Engineering of Biological Networks

234

(a) (b)

Figure 12.1 There are different types of cellular networks. (a) Glycolytic pathway as an example formetabolic networks. (b) The canonical Wnt-pathway as an example for signal transduction networks.

The third types of cellular networks are the gene regulatory networks. A gene regula-tory network (GRN) is the structured set of information necessary to specify when andhow one or a group of genes is to be expressed. GRNs contain connections that describethe regulatory relationships between genes. Gene regulatory networks include interac-tions between proteins and DNA [Figure 12.2(a)]. Often, only the genes are shown [Fig-ure 12.2(b)]. Sometimes, the pathways that interconnect them are also included. Thesepathways consist of proteins that are expressed as the result of switched-on genes (geneproducts), parts of the signaling network (pathways with few or many steps and withtranscription factors at their end), and finally the genes that are regulated by these tran-scription factors (target genes). A gene product can be a transcription factor and directlyregulate the next gene(s) but usually there is indirect regulation (i.e., the gene product issecreted, and acts as a signal to a membrane receptor, which starts a new signaling cas-cade that peaks in activation and/or translocation to nucleus of yet anothertranscription factor).

Established cellular networks are collected in (curated) databases. One widely useddatabase is KEGG (www.genome.jp/kegg/pathway.html). It currently holds about 120metabolic pathway maps consisting of an average number of 17 reactions. Some otherwell-known databases for metabolic and signaling pathways are BioCarta(www.biocarta.com), Reactome (www.reactome.org), BioCyc (www.biocyc.org), andMetaCyc (www.metacyc.org). The Pathway Interaction Database (PID; pid.nci.nih.gov)is for signaling pathways only, and it currently contains 87 curated human signalingpathways. For some organisms, there are databases that provide transcription or generegulatory network information, such as EcoCyc-DB (ecocyc.org) and RegulonDB(regulondb.ccg.unam.mx) for E.coli. Eukaryotic transcription-regulating proteins andsequences are collected in TRANSFAC. The 2008 update of the molecular biology data-base collection lists over 1,000 databases, and more than 20 of them are for pathways [2].Pathguide (www.pathguide.org) contains information about 240 biological pathwayresources.

The data needed to infer (and populate) networks differs for each network type. Formetabolic networks, it comes from metabolomics, enzymology, photospectrometry,

12.1 Introduction: Biological Networks and Reverse Engineering

235

(a) (b)

Figure 12.2 Gene regulatory networks are cellular networks. (a) An example for a gene regulatory net-work with two components. Protein 1 activates expression of gene 2 by binding to its promoter, and itinhibits its own expression by the same mechanism. Regulatory regions are usually upstream of the genescoding regions. Beside promoter regions, they also include operator (for more than one gene), enhancersand silencers (can also be downstream). (b) In the summarized depiction of the same network, genes arenodes, and regulation is represented by directed sign-labeled arrows.

mass spectrometry, and so on, whereas gene regulatory networks are often based on datafrom microarray experiments, and the input for signal transduction networks (i.e., datadescribing the proteins’ activation status across a set of proteins) comes from varioussources (e.g., Western blots or microarrays). In order to be sufficient for network infer-ence, the data that is used to infer network structure must reflect the time course ofchanges in a network (time series data) or show the response of the network to severaldifferent perturbations. How such data is experimentally generated is described brieflyin Section 12.2, which focuses on large-scale high-throughput experimental techniques.

Furthermore, besides these cellular networks, there are networks that describe theinteractions of cells with each other and networks that describe the interactions betweenwhole populations of microorganisms, plants, or animals. Examples for the former areneuronal networks, which can be depicted in wiring diagrams, such as the synaptic con-nectivity map published by Hall and Russell in 1991 [3] showing the overall pattern bywhich the tail neurons of the nematode (roundworm) C.elegans interact. Ecological net-works, also called food webs, are examples of networks describing the interactionsbetween species populations. The components of such noncellular networks are biggerthan the molecules that form cellular networks; they are more feasible to be studied bydirect observation. Synaptic connectivity maps can be determined, for example, bymicroscopy and photography of the neuronal architecture, or by tracking the propaga-tion of a signal through a combination of labeling, microscopy, and voltage-clamprecordings (method for studying synaptic transmission). Nevertheless, both types ofnetworks may also be inferred by using approaches that are described in Section 12.3(i.e., by deducing the underlying network structure from its observed output).

This chapter focuses on cellular networks only, although some of the describedapproaches can equally be applied to the other types of biological networks that werejust mentioned.

12.1.2 Network representation

For visualizing biological networks, the network components can be represented bysimplified depictions or even reduced to a labeled symbol. It is more complicated toillustrate the interactions between the components of the network, because such illus-tration should convey not only the components that interact but also some qualitativeinformation to the viewer.

Different schematic representations of biological networks are in use (e.g., in Figure12.1). One of the simplest is the representation of networks as graphs. A graph presenta-tion reduces the elements of a network to nodes (vertices, junctions) and their pair-wise


236

(a) (b)

Figure 12.3 Networks can be represented in many ways (e.g., as graphs or matrices). (a) Undirectedgraph with three nodes and corresponding adjacency matrix. (b) Directed acyclic graph with five nodesand corresponding adjacency matrix.

relationships or interactions to edges (arcs, lines) which connect pairs of nodes [see Fig-ure 12.3(a, b)]. The nodes of cellular systems may be genes, mRNA, proteins, or othermolecules. Usually, one node represents one component of the network, and an edgerepresents the interaction between two of the components. Edges can be undirected(i.e., they simply connect two network components [Figure 12.3(a)]) or directed (i.e.,they also imply some sort of causal or other asymmetry between the two connectedcomponents [Figure 12.3(b)]). In synaptic wiring diagrams, for example, the edges arealways directed, as the synapses that form the connections between two neurons areclearly defining them as pre- and postsynaptic.

A way of describing gene regulatory networks is to provide a directed graph. In agraph representation of GRNs, the nodes are genes and the edges describe the relationsbetween pairs of genes [see again Figure 12.2]. Different types of functions may be associ-ated to these edges to indicate some regulatory relationship between genes and todescribe the restrictions controlling the flow of information in the GRN. Sometimes, twodifferent types of nodes are used for the transcription factors and for the genes theyregulate.

Graphs are not the only way to depict networks of interactions between biologicalentities. In fact, for metabolic networks they are less suitable, unless one chooses toemploy bipartite graphs, which include the enzymes that catalyze the reactions, andthereby allow showing reactions with more than one substrate and/or product.Stoichiometry matrices are one way of representing metabolic networks in a very con-cise way. They also allow stoichiometric analysis (see Section 12.4.3). Process diagramsare suited to represent metabolic and signaling networks [4]. Kohn maps are anotherpossibility to schematically represent biological networks [5]. The latter two differ intheir illustrating power and each has certain limitations. The Systems Biology GraphicalNotation (SBGN) initiative is working on the standardization of the schematic represen-tation of essential biochemical and cellular processes that are studied in systems biology(www.sbgn.org).

12.1.3 Motivation and design principles

When cellular networks, such as the ones given in the above examples, are published,are included in databases, become accepted knowledge, and are taught to students, theyare the consolidated result of a long process that led from numerous empirical observa-tions and measurements to a theoretical model of how their individual componentsinteract. Through the process, the translation of data into predictive models is realized.But why is this network identification so important?

Besides just the general curiosity of human beings to learn about the world aroundthem and to investigate how things are related, there is a practical need to establishmodels of the objects or processes in question either in our mind, or in a kind of schemeor verbal description, or as a detailed formal or mathematical representation. These the-oretical models are necessary in order to subject them to systematic probing and analy-sis, to find common or unique design principles and learn from these principles, to formand test hypotheses on the model networks, and ultimately, to apply the lessons learnedby doing so to the real system (i.e., manipulate the real biological system to achieve a cer-tain desired behavior). This can be in order to prevent or cure diseases (development ofvaccines, drugs, therapies), to heal or regenerate (applications in regenerative medicine),

12.1 Introduction: Biological Networks and Reverse Engineering

237

or to manipulate organisms or our environment (biotechnological or agriculturalapplications).

Reverse engineering, as defined and discussed in this chapter, is the process of net-work inference. It comprises the identification of a network that represents a biologicalsystem, the formal description of this network (the examples given in Section 12.1.1 aresuch formal network descriptions; they are the result of extensive reverse engineering),and the discovery of organizing principles in the networks that bring about the proper-ties and behavior of living cells and enable them to meet the demands for robustness inuncertain environments. Just as an engineer designs a device according to the tasks thatit has to perform, in reverse engineering, the goal is the reverse, namely to discover thedesign that is behind the characteristics of the “device” (i.e., the cell) for which evolu-tion was the engineer. The task of mapping an unknown network is known as reverseengineering [6]; it is illustrated in Figure 12.4.

Metabolic, signaling, and gene regulatory networks are very complex, in that theyusually have many loops. In fact, these feed-forward or feed-back loops are one of themajor design principles to perform the wealth of biological functions a cell can accom-plish. The biological network topology and design principles will be discussed in moredetail in Section 12.4.2.

12.1.4 Reverse engineering

Biological networks have to be deduced from empiric observations, and the amount ofexperimental data available is growing faster and faster with the advent ofhigh-throughput screening techniques and improvements in the detection and quanti-fication of ever smaller molecules and of changes in their amounts.

There are two sometimes distinct, sometimes complementary tasks to tackle in orderto gain an understanding of the functional interactions between genes, proteins andmetabolites: (1) structural identification (i.e., to ascertain network structure or topol-ogy), and (2) identification of dynamics to determine interaction details (e.g. transitionrules or rate equations and kinetic parameters).

A metabolic network such as the glycolytic pathway shown in Figure 12.1(a) wasdetermined in a tedious step-by-step process. In a classical (or twentieth century)small-scale low-throughput way, each enzyme in the pathway was studied separately by


238

Figure 12.4 Before and after reverse engineering: a network of five components with all possible inter-actions (left) and (right) the result of networking interference. (Figure inspired by [7].)

classical enzyme assays. The network was then built by assembling all the individualenzymatic reactions together in one pathway. This is where systems biology entered thestage.

Systems biology is about investigating, and ultimately understanding, the organiz-ing principles of biological systems and how they bring about the observed behavior. Italso comes with a change of focus from few components with few interactions to hugenetworks with many components and many interactions. Systems biology of thetwenty-first century takes up the challenge of looking at (whole) networks instead of justcollecting components to complete a parts list. Over the last decade, new technologiesand experimental methods have been developed that enable acquisition of large datasets containing genomic, proteomic, and metabolic information describing the state of acell. Employing them, more and more of the cellular components (and their concentra-tions and concentration changes) can be monitored and captured at once. Among theselarge-scale high-throughput methods are microarrays, time of flight mass spectrometry(MS-TOF), and yeast two-hybrid screening. These modern experimental techniques, incombination with an increase in computing power, made the analyses of larger net-works feasible, and we have long reached a point where the thus inferred networks canno longer be grasped as a whole, including their features and behavior, by the humanmind alone but need to be investigated with the help of computers.

In the following section, some experimental techniques are introduced briefly. Theyare used by researchers in the life sciences to generate empirical (cell-)biological datathat is the basis for network inference. Section 12.3 deals with theoretical approachesapplied to this data to infer biological networks. After the networks have been inferred,they can be analyzed (i.e., their organizing principles uncovered and their behavior sim-ulated). This is discussed in Section 12.4. The final section of the chapter is a summaryand comparison of the various approaches.

12.2 Material: Time Series and Omics Data

New technologies enable the acquisition of large data sets containing genomic,proteomic, and metabolic information that describe the state of a cell. The measuredquantities are metabolite concentrations, protein concentrations, protein activationstatus, and mRNA levels, to name the most studied ones. Time series are series of mdiscrete measurements of n (state) variables. They generally take the following form:

( )( )D X ti ji n

j m=

=

=~, ,

, ,

1

1

�

�

where~Xj is the state vector ( ) ( )[ ]~

, ,~

X t X tj n j

T

1 � and contains the quantities of all mea-

sured variables at the jth observed time point, and~Xi is the time series for just one vari-

able ( ) ( )[ ]~, ,

~X t X ti i m1 � .

Instead of various time points, tj, there can also be just one single measurement butseveral differing conditions in which the system is investigated.


239

Applying large-scale and high-throughput methods to determine the metabolitecontent of cells is called metabolomics, for proteins it is called proteomics, and formRNA it is called transcriptomics.

12.2.1 Metabolomics

Analysis of the primary and secondary metabolism in a large scale (calledmetabolomics) includes the complete characterization of the metabolites that occur ina cell (called the metabolome) at a certain time, and the quantification of these metabo-lites at different time points, or under varying (environmental or stress) conditions, orfor different strains/mutants.

Metabolome databases are, for example, the HMDB (www.hmdb.ca) for humanmetabolomics data, and the Golm Metabolome Database (GMD; csbdb.mpimp-golm.mpg.de/csdbd/gmd/gmd.html) for plant metabolomics data.

Experimental analysis of the metabolome comprises two main steps: separation anddetection. The most common methods for separation are gas chromatography (GC),high performance liquid chromatography (HPLC), and capillary electrophoresis (CE).For detection, they are mass spectrometry (MS) and nuclear magnetic resonance (NMR)spectroscopy. The most common combination of methods for metabolome analysis isgas chromatography interfaced with mass spectrometry GC/MS (also the first to bedeveloped), followed by LC/MS and by method combinations with NMR. The conclud-ing identification of the metabolites in a sample is facilitated by the fast growing librariesof metabolite spectra, such as METLIN (metlin.scripps.edu).

As of now, several hundred metabolites can be measured exactly by these moderntechniques, providing a metabolic profile of a cell that gives an instantaneous snapshotof its physiology.

Metabolic profiles have proved to be useful in optimization and controlling of fer-mentation processes and production flows but they can also be used to infer cellularmetabolic networks. The procedure for this task, namely calculating the correlation foreach pair of metabolites and using it to construct a metabolic network, is explained inSection 12.3.3.

12.2.2 Proteomics and protein interaction networks

The proteome is the expressed protein complement of a genome, and it can be bigger insize than the number of genes in a genome because of splice variants andpost-translational modifications. The human proteome—a catalog of all proteins in thehuman body—is for many life scientists the next key step after the success of theHuman Genome Project. Compared to the transcriptome, the proteome is lessdynamic. For detection of the proteome or at least meaningful fractions of it, severalproteomic techniques exist, such as HPLC and MALDI-TOF mass spectrometry. TheElisa and Western blotting are considered for small-scale systems.

One aspect of proteomics is the interactions proteins can undergo. Proteins interactwith metabolites, with RNA or DNA, or with other proteins. In the last case, the interac-tion is called protein-protein interaction (PPI). The whole of physical interactionsbetween the proteins of a cell forms a cellular network that is the subject of research,experimentally as well as by network inference. One experimental technique for detec-tion of protein-protein-interactions at a large scale is automated yeast-two-hybrid (Y2H)


240

screening [8]. Y2H screens together with complementary methods have already pro-vided PPI networks for, for example, E.coli, S.cerevisiae, C.elegans, and D.melanogaster.The human protein interaction network is under reconstruction, and recent progress inidentifying all human PPIs can be tracked by accessing protein interaction databasesthat collect and store available PPI data—for example, IntAct (www.ebi.ac.uk/intact) orHPRD (www.hrpd.org). Some databases include experimentally validated as well ascomputationally predicted PPIs (e.g., by homology). The database Unified HumanInteractome (UniHI; www.mdc-berlin.de/unihi) combines the data from many otherdatabases.

Y2H tests for binary interactions (between two proteins, bait and prey). It produces ayes/no observation. This is an ideal prerequisite for a graph representation of PPI data(with protein-nodes and interaction-edges) and for subjecting it to graph-theoreticalanalysis (see Section 12.4.1). Unfortunately, PPI sets acquired in a large-scalehigh-throughput manner still contain a high percentage of false negatives (transient orlow-affinity interactions may not be detected) as well as false positives, and are thuscoming with the uncertainty of occurrence of the PPI in vivo. Besides technical falsepositives, there are biological false positives: the two proteins found to interact may notencounter each other in vivo, depending on their abundance in the cell as well as onwhether they are localized in the same compartments or expressed at the same time (spa-tial/temporal coexistence). Therefore, the method is usually combined with other meth-ods, to validate the PPIs and assign a confidence score, in order to verify the networkderived from PPI data. Another challenge is further characterization of the interactions.Y2H detects a physical association between two proteins, but what kind of interaction isit exactly? Is it a permanent or transient complex formation, a substrate-product rela-tionship (one protein modifying another, as in the case of phosphorylation-dephosphorylation, where one interaction partner is a kinase or phosphatase), scaffold-ing (in which case the scaffold protein has a plethora of interactors), or something elsealtogether?

Protein-protein interaction data is, nevertheless, useful as prior or additive knowl-edge when inferring signaling networks [9].

The domain composition of proteins forms the basis for a reduced form of proteininteraction network, namely a domain interaction network that can be built from allknown domain-domain interactions (DDIs) [10].

12.2.3 Transcriptomics

One major constituent in the process of gene expression is the messenger or mRNA, asingle stranded RNA molecule that is the transcribed copy of a gene. It carries thesequence information of a gene out of the nucleus, where it is translated into its finalproduct, the protein. The entirety of all mRNAs in the cell at a certain time point iscalled the transcriptome. The composition of the transcriptome is extremely dynamic.It highly depends on the internal and external (environmental) conditions of the cell.Nutriment circumstances or physical stresses, like rapidly increasing temperature, areexamples for environmental factors that influence the composition of thetranscriptome.

Transcriptomics is the research field dealing with the characterization and analysisof the cells transcriptome. It explores the dynamics of the transcriptome and the mecha-


241

nism regulating the mRNA production (i.e., how genes are up and down regulated inresponse to external signals or how they interact with transcription factors). Thetranscriptome of a cell is studied by experimental methods that measure the presenceand amount of specific mRNAs at a certain time point. Several different methods formonitoring the expression levels of large gene sets exist, but currently the most popularone is transcriptome detection by DNA microarrays.

Microarrays are a high-throughput technology that exploits the fact that mRNA mol-ecules hybridize specifically to complementary DNA copies. Those DNA copies, repre-senting genes, are attached to an array, each copy forming a so-called spot. After theRNA is hybridized to the array, the expression levels of up to thousands of genes can bemeasured simultaneously by calculating the amount of mRNA bound to each spot. Themeasurement of a huge number of genes in a single experiment is called expression pro-filing. Many different array platforms and formats are available that realize the describedarray principle. The two most common platforms are the two-color array and theAffymetrix GeneChip technology. In Figure 12.5, the experimental methodology of atwo-color array is illustrated.

Another high-throughput approach is the Serial Analysis of Gene Expression (SAGE).The advantage of this method in comparison to the array technology is that it allows theexact measurement of any mRNA, known or unknown. Thus, new mRNAs can beidentified.

Two further methods for studying the transcriptome that are not consideredhigh-throughput approaches are Northern blotting and real-time PCR.

12.3 Approaches for Inference of Biological Networks

Some known and well-studied networks, such as the central carbohydrate metabolism(see paragraphs on glycolysis in Sections 12.1.2 and 12.1.4), were the result of a slowand painstaking step-by-step process, in which each biochemical reaction was detectedand studied separately. Modern experimental technologies generate data that allows fordeducing more comprehensive cellular interaction networks from fewer experiments.An example is the protein interaction networks built from the sets of PPIs detected byY2H screening (see Section 12.2.2). These are static networks, but what is of even greater


242

Figure 12.5 Workflow of a microarray experiment. Two yeast cultures are grown, one mutant strain andone wild type strain, their mRNA is isolated, transcribed into cDNA, and labeled with different fluorescentdyes. The cDNA is then mixed and hybridized to a microarray. Each spot on the microarray corresponds to agene; its fluorescence reflects the relative mRNA concentrations. The microarray is scanned and the result-ing intensity values are stored in a gene expression matrix. (Figure reproduced with permission from [11].)

interest are networks that also consider the dynamics of interactions, and thus capturethe behavior of the biological systems that the networks represent. For inferring suchnetworks from experimental data, several approaches exist. These theoreticalapproaches use data of the kind described in the Section 12.2 but also data fromsmall-scale low-throughput experiments.

Instead of building or rather assembling networks from prior knowledge of individ-ual interactions, network inference involves an indirect deduction process. In the fol-lowing, some approaches are looked at more closely: a genome sequence basedapproach, a discrete logical method (Section 12.3.2 on Boolean networks), statistical andprobabilistic methods (Bayesian networks and correlation metrics in Sections 12.3.3 and12.3.4), and then the focus is on approaches that use differential equations (Section12.3.5).

Besides these, further approaches are developed and used—for example, approachesbased on fuzzy logic/mathematics (e.g., fuzzy inference engines [12]), difference equa-tions (e.g., LASSO tool [13]), stochastic ODEs, information theory, and numerical mod-eling with neural networks. They are not discussed in more detail in this chapter, and werefer the reader to the other chapters of this book or to text books.

12.3.1 Genome-scale metabolic modeling

This approach provides models of the complete metabolism of a cell. In contrast to theslow or “low-throughput” way of manually assembling a metabolic network from sin-gle biochemical reactions studied by enzymology, the development of genome-scalemodels is a more global approach. It is based on the sequenced genome of an organism,and on the assumption that all biochemical reactions that this organism can performare encoded in the genome, namely through the genes that code for the enzymes thatcatalyze metabolic reactions. Analogous, the same is assumed for the transport pro-cesses between cell organelles (across organelle membranes) or between intra- andextra-cellular space (across the cell membrane). This approach, thus, reflects the centraldogma of molecular biology.

By taking into account all genes detected on the genome that encode for enzymesand transporters, one can build a complete metabolic model. This approach is impairedby the fact that not all genes are annotated or some are annotated incorrectly (mostannotations are made by homology to already annotated genes in other organisms’genomes), and as a result the network may have disconnected subnetworks or “orphan”or “dead-end” metabolites (see [14] for a discussion of problems with this approach).

Up to now, genome-scale metabolic models have been published for several unicel-lular organisms (e.g., the bacteria M.tuberculosis [15], H.salinarum [16], and the yeastS.cerevisiae [17, 18]). Usually, these models of the entire cellular metabolism compriseseveral hundred or even more than thousand biochemical reactions and transport steps(see Figure 12.6). There are initiatives to also provide genome-scale metabolic models forunicellular algae, as a representative of plant metabolism.

So far, these networks are explored mainly by subjecting them to stoichiometricanalysis or flux balance analysis (see Section 12.4.3). Nevertheless, the assignment ofrate equations to each reaction with appropriate kinetic parameters, the prerequisite fordynamic simulations, will become feasible in the future.


243


244

Fig

ure

12

.6G

enom

e-sc

ale

met

abol

icm

odel

sh

ave

been

pu

blis

hed

for

vari

ous

orga

nis

ms.

G–n

um

ber

ofge

nes

ein

corp

orat

edin

the

mod

el;R

–nu

mbe

ror

reac

tion

s;M

–nu

mbe

rof

met

abol

ites

.(Fi

gure

from

[19]

.Rep

rod

uce

dby

per

mis

sion

ofT

he

Roy

alSo

ciet

yof

Ch

emis

try.

)

Adding to the problem of an incomplete or false annotation as was mentionedearlier, the cells of multicellular organisms often exhibit only part of the metabolismthat is coded in its genome, and modeling specific cell types (e.g., human erythrocytes orhepatocytes) thus requires an adjustment of the above approach. (This can be true formicroorganisms alike, if they show very different metabolism for different environmen-tal conditions or different developmental states, such as diauxic shift in E.coli or yeast.)

For more details on genome-scale metabolic models, we refer the reader to Chapter 6.

12.3.2 Boolean networks

Boolean networks were first used by Stuart Kauffman in the 1960s as a study of ran-domly constructed networks consisting of binary state nodes, which represented genes[20]. More than 40 years later, this network type is still used to infer biological networksin the field of reverse engineering. Especially gene regulatory networks are often recon-structed from gene expression data by employing Boolean networks.

In general, gene regulatory networks modeled as Boolean networks are directedgraphs, in which each node represents one gene. Nodes can adopt only two differentstates, namely 0 and 1. Consequently, one main feature of Boolean networks is the exis-tence of discrete state values. A node with state value 0 represents the inactive form ofthe gene represented by the node, which means that the gene is currently not expressed.In contrast to that a node with state value 1 stands for the active form, indicating thatthe gene is expressed.

In addition to the binary representation of the genes, each gene of the network influ-ences the behavior or state of one or several other genes. Those interactions, illustratedby directed edges in the Boolean network, are modeled by Boolean (logical) functions.Each node/gene is assigned to one of those functions, such that the state of each particu-lar gene (0 or 1) at time point t + 1 depends on the states of genes at time point t regulat-ing that gene. At each time step, all genes are updated synchronously according to theirassigned function, which means that all genes transit to a new state. A table consisting ofall possible state values before and after transition is called a state transition table (seeFigure 12.7).

If the network structure is unknown, reverse engineering methods have to bedefined that detect Boolean relations from gene expression data sets, measured at twotime points at least. The first step of this procedure is always the discretization of theexperimentally gained gene expression profiles into “ON” and “OFF” states by using anumerical threshold. The discrete values are written in the state transition table. It isused to compute the Boolean functions, with which the expression profiles can bedescribed. Finding the Boolean functions is the most comprehensive step, for which dif-ferent algorithms were developed in recent years. Two popular methods that addressthis problem are the REVEAL [21] and the BOOL-2 [22, 23] algorithm. In order to detectthe gene interactions, REVEAL analyzes the mutual information between the input andoutput states of the measured data. An additional extension of that algorithm allowsconstructing multistate models. Thus the network states are not limited to 0 and 1 any-more. The second algorithm, BOOL-2, detects those Boolean functions which explainthe influence of input states on corresponding nodes with a probability of higher thancertain threshold [24].


245

One main problem of finding those Boolean functions is that the computationalcomplexity grows exponentially with the number of nodes within the network. For thisreason many existing methods are limited by the number of arguments of each func-tion, which means that the number of genes influencing each other is limited [24]. For anumber of k arguments, there exist 2k possible input states and altogether 22k possibleBoolean functions.

As a direct result of the model structure of Boolean networks, there are advantagesand disadvantages of using Boolean networks for gene network inference. As describedearlier, the discretization of gene states is a central feature of Boolean networks. On theone hand, discretization is tantamount to the loss of information but this informationreduction together with the need to limit the number of arguments may lead to falsifiednetwork inference results. On the other hand, discretization may be an advantage if onetries to reconstruct a network when only noisy expression data is available.

Two extensions of the basic Boolean network are the probabilistic Boolean networksand the temporal Boolean networks.

In a probabilistic Boolean network, the stochastic nature of gene expression and thenoise of experimental data are considered by introducing probabilistic features to thenetwork behavior. Thus, in a probabilistic Boolean network, each node can be assignedto more than one Boolean function. Each function is chosen according to a certain prob-ability. The function with the highest probability will determine the state of the gene atthe next time point. There are several publications that provide further informationabout probabilistic Boolean networks and their application in gene network inference[25, 26].

In contrast to the above, a temporal Boolean network offers the possibility to modelthe existence of latency periods between the expression of a gene and the observation ofits effect. Therefore in a temporal Boolean network a state of a gene at time point t + 1need not to depend only on gene states at time point t. A state of a gene at time point t +1 can be controlled by a Boolean function of the several gene states at former time pointsthan t. Temporal Boolean networks are described in detail in [27].


246

Input states(at timepoint )t

Output states(at timepoint )t+1X

X

1

1

X2

X3

X2 X3 X1 X2 X3

1 1 1

1 1 0

1 0 1

1 0 0

0 1 1

0 1 0

0 0 1

0 0 0

1 0 0

1 0 0

0 0 0

0 0 0

1 1 1

1 0 1

0 1 1

0 0 1

X = X1 2

X = NOT X AND X2 1 3

X = NOT X3 1

(a) (b) (c)

Figure 12.7 A Boolean network can be represented differently. Shown is a network that consists of threeinteracting nodes only: (a) graphical illustration of the interactions; (b) state transition table of all possi-ble state values; and (c) Boolean functions, one for each node.

Other types of network can deal with continuous data as well. They are described inSections 12.3.3 and 12.3.5.

12.3.3 Network topology from correlation or hierarchical clustering

The construction of reaction networks from correlation metrics is a statistical method.It comprises the calculation of the correlation for each pair of metabolites frommetabolomics data (see Section 12.2.3), and uses the obtained correlation metrics toconstruct metabolic pathways. It interprets metabolic profiles in terms of the underly-ing biochemical network of reactions and regulations. The rationale behind this is thatcorrelated metabolites have a good probability of being functionally related (i.e., beingsubstrate and products of one and the same enzyme-catalyzed reaction, respectively) orbeing linked by only a few steps in a metabolic network.

Measurement data for the detection of correlations between different metabolitescan be obtained by conducting several experiments with different setups, for example byvarying the environmental conditions or by periodically forcing the system by changingone variable over time. The middle panel in Figure 12.8 depicts exemplary metaboliteversus metabolite scatter-plots, in which each dot corresponds to a simultaneous mea-surement of two metabolite concentrations (in arbitrary units) within a single sample.

The relationship between metabolites is assessed using the Pearson correlation coef-ficient rX,Y. The formula for an empirical correlation coefficient, given two series of n datapoints (Xi, Yi), reads:

( ) ( ) ( )

( )r

X Y nX X Y Y

nX X Y

X YX Y

i

n

i

ii

n

i

,

cov ,=

⋅=

⋅ − ⋅ −

⋅ − ⋅

=

=

∑

∑σ σ

1

1

11

2

1

( )−=∑ Yi

n

1

2

The correlation coefficients range from −1 to +1, and are close to zero in case of nodetectable correlation. They can be visualized by colors in so-called heat maps. Figure12.8 illustrates the workflow from metabolomics data via correlation metric to correla-tion network (sometimes also called association network).


247

Figure 12.8 The workflow for network inference using correlation metrics is shown in a schematic overview.Metabolite correlations are derived from metabolomics data (left and middle panel). Two metabolites are con-nected in a correlation network if their pair-wise correlation exceeds a given threshold. A more detailed descrip-tion can be found in [28].

Computational metabolic modeling can more than capture the already-known path-way structure. In a study of Chlamydomonas reinhardtii, it allowed an identification ofmissing enzymatic links [29]. It can also lead to the proposal of enzymes not yet anno-tated in this particular organism or the proposal of new previously unknown connec-tions between intermediates (hypothetical enzymes).

A comparative genomics tool for discovering unknown metabolic pathways inorganisms is pathway inference through pattern matching. It is a technique in whichknown pathways are modeled as biological functionality graphs of gene ontology(GO)-based functions of enzymes (pathway functionality templates), and these are usedto locate frequent functionality patterns, and through pattern matching this allows toinfer previously unknown pathways in metabolic networks [30].

Hierarchical clustering is a similar method for inferring gene networks from geneexpression profiles. Relationships among genes are represented by a tree whose branchlengths reflect the degree of similarity between genes, as assessed, for example, by apair-wise similarity function such as Pearson correlation coefficient. The rationalebehind the use of correlation is that correlated genes may be functionally related. Areview of inferring GRNs through clustering of gene expression data can be found in[31]. On the other hand, some claim that it is useful for finding coexpressed genes butnot for network inference [32]. Clustering expression data into groups of genes thatshare profiles allows for grouping functionally related genes but does not order pathwaycomponents according to physical or regulatory relationships [33].

Network inference through clustering is not only applied to gene expression profilesbut also to protein profiles or metabolic profiles. Here, more generally, a profile relatesthe measured component to the different conditions that where applied. This allowsone to apply the method to time series data, for example, of metabolites [34, 35]. Model-ing signal transduction networks becomes feasible by integrating protein-protein inter-action and gene expression data, as shown in an example in S.cerevisiae [33]. The methodwas further advanced by [36].

12.3.4 Bayesian networks

A Bayesian network is a probabilistic description of a regulatory network. It is a marriageof probability theory and graph theory in which dependencies between variables areexpressed graphically. It is named after Bayes’ theorem which is used for the calculationof conditional probabilities and reads:

( ) ( ) ( )( )

P X XP X X P X

P X1 22 1 1

2

=⋅

where P(X1|X2) is the conditional probability of X1, given that X2 is true. According tobasic probability theory, the joint probability can be factored as a product of condi-tional probability such that

( ) ( ) ( ) ( ) ( )P X X P X X P X P X X P X1 2 2 1 1 1 2 2, = ⋅ = ⋅

A Bayesian network is a graphical model for probabilistic relationships among a setof continuous or discrete random variables Xi. This relationship is encoded by twocomponents.


248

The first component is a directed acyclic graph G(V, E) consisting of a set V = {X1, …,

Xn} of nodes and a set E of the directed edges between these nodes. Xi → Xj means that Xj

belongs to the children ch(Xi) of Xi or, in other words, Xi belongs to the parents pa(Xj) ofXj. Thus, Xj is a descendant of Xi if there is a path from Xi to Xj. (In acyclic graphs, anynode is not a descendant of itself.) The nodes in a Bayesian network represent measuredvariables of interest (e.g., genes or proteins). The edges represent informational or causaldependencies among the variables. For example, if i is a gene, then Xi will describe theexpression level of i.

The second component is the relationship between the variables which is describedby a set P of n conditional probability distributions of the form fi(Xi|pa(Xi)). From theMarkov assumption (i.e., each Xi is conditionally independent1 of its nondescendantsgiven its parents) follows that the distribution f(X) can factorize with reference to thegraph, and the joint probability distribution can be decomposed into

( ) ( )( )f X f X pa Xi i ii

n

==

∏1

Hence, the common distribution emerges from the relationship of the parents ofeach random variable as well as the conditional distributions P f x pa xi i i i n= ={ ( ( ))} , ,1 �

[37].In the example network shown in Figure 12.3(b), the random variables Xi are the five

nodes V = {X1 = 1, X2 = 2, X3 = 3, X4 = 4, X5 = 5}, and the set of edges is E = {(1,3); (1,4); (2,4);(4,5)}. The joint probability distribution of G(V, E), then, has this form:

( ) ( ) ( ) ( ) ( ) ( )P X P P P P P= × × × ×1 2 412 31 54,

In order to reverse-engineer a Bayesian network model of a gene regulatory network,one must find the directed acyclic graph G (i.e., the regulators of each transcript) thatbest describes the gene expression data D, where D is assumed to be a steady state dataset. For this, all possible graphs G must be evaluated for the probability that the data Dhas been generated by the graph G. In case of previous knowledge from the biologicalbackground (i.e., if the classification of genes in functional groups is known), this can beintegrated by means of a priori description, and the search space reduced. To traverse thesearch space of all possible graphs, heuristics are used. Three algorithms for Bayesiannetwork inference are Variable Elimination, Likelihood Weighting, and Gibbs Sampling.These algorithms are implemented in the Mocapy Toolkit, Bayes Net Toolbox, and Deal[24].

In summary, Bayesian networks are suitable for statistical models with incorrectmeasurements and minimal parameterization. For the nodes, discrete and continuousvariables are possible. The advantage is that samples with missing values and latent vari-ables can be integrated. The disadvantage of Bayesian network modeling is that mutualdependencies (cycles) between variables cannot be modeled. Cycles, especially feedbackloops, are a pattern often encountered in biological and other regulation networks, andthey fundamental to their operation.

Bayesian network analysis is often used to infer gene regulatory networks but thereare also examples for inferring signaling networks [38].


249

1 Here, the conditional independence of random variables Xi and Xj, given a random variable Xk,means that P X X X P X X P X Xi j K i k j k( , ) ( ) ( )= × .

The normal Bayesian method works with data sets from experiments with differingconditions or differing cell strands, but not with time series data. An extension for timeseries data are the dynamic Bayesian methods, which allow cycles and therefore alsofeedback [39]. By means of dynamic Bayesian networks, an upgrading to longitudinaldata and (observed over time as well as over space) feedback modeling is possible. Indynamic Bayesian networks, nodes are allowed to be repeated over time. A dynamicBayesian network is a general state-space model to describe stochastic dynamic systems.In comparison, a Bayesian network structure corresponds to a first-order Markov processwith states defined by the variables Xt [see Figure 12.9(a)], whereas a dynamic Bayesiannetwork structure, by contrast, is corresponding to a second order Markov process [seeFigure 12.9(b)]. Two special cases for dynamic Bayesian networks are hidden Markovmodels and Kalman filter models. Hidden Markov models are temporal probabilisticmodels in which the state of the process is described by a single discrete random vari-able. This is the simplest type of dynamic Bayesian network. Kalman filter models havethe same topology as hidden Markov models. All nodes are assumed to have lin-ear-Gaussian distributions. Kalman filter models are the simplest continuous dynamicBayesian method.

Finally, a Bayesian network with both static and dynamic nodes is called a partiallydynamic Bayesian network, also known as a temporal Bayesian network [39].

The software Banjo is based on Bayesian network formalism and implements bothBayesian networks and dynamic Bayesian networks [40]. It can infer gene networks fromsteady state gene expression data or from time series gene expression data.

12.3.5 Ordinary differential equations

If determination of the quantitative interactions between the components of a networkis an important issue, differential equations come into the picture. In this approach,networks of metabolic reactions, signaling networks, and gene regulatory networks aredescribed by a system of ordinary differential equations (ODEs) of the form:

( )( )dXdt

f X t p= ,

where teh [ ]X X Xn

T= 1 , ,� is the state vector containing the amounts/concentrations,

activities, or expression levels, of all components in the network (with nonnegative val-ues X Rn∈ +), and where p = [p1, …, pn]

T is the parameter vector containing all adjustableparameters of the biological system under consideration, such as rate constants. Thefunction f(X, p) determines the dynamics of the network given the states and parame-


250

(a) (b)

Figure 12.9 Two types of Bayesian network structures: (a) normal Bayesian network structure; and (b)dynamic Bayesian network structure.

ters. In cases of small molecular concentrations and/or low levels of diffusion, partialdifferential and stochastic equations may be required, but this is outside the scope ofthis chapter. Uncovering the differential equations that best describe a biological sys-tem directly from observations is a challenging task. When modeling the behavior ofbiological networks with differential equations, there is a basic formula that describesthe rate of change of the amount/concentration of a single variable (also called species)as a (generally nonlinear) function of the state of the variables in the system and of theset of parameters. This basic formula or rate equation reads:

dXdt

Xiij j k

g

kj

jk= ⋅ ⋅ ∏∑ σ γ

where the σij are the stoichiometric coefficients (σij ∈ Z), γj is the rate constant (γj ∈ R+),

and gjk is the kinetic order (gjk ∈ R). The stoichiometric coefficient σij is nonzero only if

there is a direct interaction that relates the states Xi and Xj. The γj and gjk are parameters.Assembly of the dynamic information for all species in the modeled system results in asystem of ODEs with usually one equation per species.

The rate equation for each species contains terms with positive and terms with nega-tive stoichiometric coefficients. These can be summed to the formation or synthesis fluxvi

+ , and the degradation flux vi− , and the rate equation then reads:

( ) ( )dXdt

v X v Xii i= −+ −

The rate equations may obey further restrictions with regard to the kinetic orders (seeFigure 12.10). Models containing the various types of rate equations are classified intoconventional kinetic models and power law models.

Section 12.3.5.1 will focus on network inference on the basis of conventional kineticmodels, on a small scale. It is followed by a section on power-law modeling, and one onautomated reverse engineering, and concluding with a section on parameter estimation.

12.3.5.1 Identification of small-scale biochemical networks

The method for identifying the dynamic interactions between biochemical compo-nents within the cell, as proposed by [42], considers the system in the neighborhood ofa steady state. It assumes that it behaves linearly for small variations around this steadystate. The interactions in a linear system can be described in the form of an interactionmatrix, or Jacobian matrix J. Considering that the system is in the vicinity of a steadystate X0, the assumption that the system behaves linearly around X0 and thus truncating

the Taylor expansion of ( )( )dXdt

f X t p= , gives

( ) ( )Δ Δ ΔdXdt

fX

X t J X ti X

= =∂

∂0

where ΔX = X(t) −X0. The procedure is to experimentally determine the elements of J bysystematic perturbations of the system (perturb concentrations periodically or in pulses,or perturb parameters), and then deduce the network from this matrix. In principle, one


251

perturbation experiment is sufficient to determine J but in practice more than one exper-iment is preferable, and combination of the obtained measurement results, to estimatethe elements of the Jacobian matrix. The method is based on linear least-squares estima-tion. It can deal with data obtained under perturbations of any system parameter, notonly concentrations of specific components. It requires that the number of samplesequals at least the number of network components, and hence, the method is restrictedto relatively small networks. Another problem may arise because in real experiments, itwill be difficult to apply small perturbations such as to remain close to the steady state,as assumed. Realistic experimental settings usually involve a considerable perturbationof the variables (e.g., changes in expression levels of proteins by up- or down-regulation

of genes would be more in the order of −50% or +100%, and so forth).Two more ODE-based approaches for inference of the structure of biomolecular net-

works are the method described by [43] and the method proposed by [44].In the Kholodenko method, the network nodes are modules that have only one out-

put and contain at least one intrinsic parameter that can be directly perturbed withoutthe intervention of other nodes or parameters. The approach represents a distinct andfresh approach to the problem of identification of a gene network based on stationaryexperimental data [45], and it can reveal functional interactions even when the compo-nents of the system are not all known.

The Sontag method is complementary to the Kholodenko method. It is based ontime series measurements, a fact which is an advantage when steady state data are notavailable. Here, the basic concept is to analyze the direct effect of a small change in onenetwork node on the activity of another node, while keeping all remaining nodes (vari-ables) “frozen.” This method also enables quantification of the interaction strength, andthus, can also be of use when the network structure is already known [44].


252

Figure 12.10 Kinetic models can be classified according to the form of their rate equations [41].

Finally, a combination of the two approaches is described in [45]. This approach isbased on stationary and/or temporal data obtained from parameter perturbations, and itunifies the previous approaches of Kholodenko and Sontag [43, 44]. It also aims atimproving experimental design by giving guidance on which parameters to perturb, andto what extent to perturb them, and answering questions of sample rate and sample timeselection to be resolved.

One methodology capable of identifying the correct functional interaction structurewith only a few sampling points through relatively simple computations is the onedescribed by [7]. It only uses a simple algebra based on the Mean Value Theorem. Theauthors also provide guidelines for an experimental design capable of supporting thismethodology by taking proper measurements of the direct influences among thenetwork nodes.

12.3.5.2 Power law modeling

A type of model, where real numbers for the kinetic orders are allowed (as opposed toconventional kinetic models, which allow only integers; see Figure 12.10), are powerlaw models [46–48]. This is seen as an advantage by modelers who prefer this type ofmodel for modeling biological processes, because of the real conditions within a cell.

The cytoplasm, for example, is an extremely inhomogeneous reaction space. Numer-ous biochemical reactions take place simultaneously and between molecules of very dif-ferent size, geometry, and complexity. The fact that all these molecules basically take upall available space inside the cell, and this resembles by no means the conditionsassumed for classical reaction kinetics (free diffusion of reactants, and the proportional-ity of reaction rate to probability of collision between reacting molecules), is emphasizedby the notion of molecular crowding [49, 50]. The admittance of real kinetic orders cancover inhomogeneity, volume exclusion effect, and so on. Thus, it allows accounting forthe properties of the cytoplasm (or other compartment) and makes the approach moreacceptable and valid.

The power law modeling approach is derived from fractal kinetic theory [51]; highvalues for kinetic orders correspond to more spatial restriction [Figure 12.11(a)] (i.e., tomore molecular crowding).

The most remarkable property of power law models is that the features of a powerlaw term vary from the description of an inhibitory process to the description ofcooperativity by just modifying the value of the kinetic orders [41]. Negative values forthe kinetic order represent inhibition; while a zero indicates that the variable does notaffect the described process [see Figure 12.11(b)]. When positive values are consideredfor a kinetic order, several alternatives are possible: a kinetic order equal to one meansthat the system is reproducing a perfectly linear (conventional kinetic) behavior; valuesbetween zero and one represent a saturation-like behavior for the rate modeled; finally,with values higher than one the rate equation models cooperative processes. The use ofthis property allows the modeler the evaluation of different hypotheses concerning thenature of interactions without modifying the formal structure of the equations.

This property of power-law models has been exploited in recent times in gene net-works where no initial information is available about the nature of the interactionsbetween the compounds of the network or even if these interactions actually exist [23,53–55].


253

In Kikuchi et al. [53], a conventional parameter estimation genetic algorithm wasrefined to investigate the structure of unknown gene networks using dynamical dataand S-system models [48], a subclass of power-law models. The objective function con-tains a term that converts those kinetic orders to zero, which the method detects to actu-ally have no influence on the dynamics (described by the experimental data) of theother state variables of the model. In this way, the nonexistent interactions are automat-ically discarded and the actual structure of the network can be elucidated. In summary,in the beginning a fully connected network is assumed, where each variable modifieseach other, and in the course of the parameter estimation (i.e., optimizing the set ofparameters so that the model reproduces the measured data), links are removed if theirkinetic order is estimated to be zero. So, in this approach, both the network structure andthe reaction parameters are inferred at the same time. A disadvantage is the high numberof parameters that need to be estimated. For a network with n state variables, there are

n(n + 2) parameters in the initial network. This may cause a dimensionality problem,where too many network nodes are met by too few available sampling points. One wayto reduce the dimensionality problem may be to incorporate prior knowledge (i.e.,exclude certain interactions or start with an initial guess of network structure taken fromone of the pathway databases) (see Section 12.1.2).

12.3.5.3 Automated reverse engineering

The method of automated reverse engineering allows one to automatically infer theequations for a nonlinear coupled dynamical system directly from time series data of anunknown/hidden biological system [56], assuming that the time series of all variables isobservable. Both the structure and the parameters of the ordinary differential equationscan be determined. This is achieved by an evolutionary-algorithm-like procedure thatincludes the evolving of model structures and the evolving of tests to disprove as manyof the candidate models as possible [57].

The method applies a basic estimation-exploration algorithm that comprises threephases: (1) in the experimental phase, time series data is obtained from the target systemby experiments; (2) in the model phase, a population of symbolic models is generated


254

(a) (b)

Figure 12.11 The power law modeling approach allows for noninteger kinetic orders. (a) Fractal-like kineticorders for a single-reactant biomolecular reaction. (Figure adopted from [52].) (b) Values for kinetic order andimplications for the associated process.

from basic operators and operands—the models that satisfy the data and explain theobserved behavior are called candidate models; (3) in the test phase, new sets of condi-tions are generated that induce maximal disagreement in the predictions of the candi-date models and so disambiguate competing models—these are called intelligent tests.The target system is then (in a new experiment) perturbed according to the best test inorder to extract new behavior from the hidden system. Thus, an iterative cycle ofhypothesis formation and experiment is created that is characteristic for the scientificmethod.

The equations that form the models are represented as nested subexpressions, as isshown for an example differential equation in Figure 12.12(a). This provides a hierarchaldescription of the equations that can be visualized (e.g., in a tree [see Figure 12.12(b)]).Through this encoding, the expressions can become very large, and the depth of thetrees should be limited by a maximum depth of subexpressions which then also definesthe maximal allowed model complexity.

To initialize the algorithm to a hidden system, the experimenter must specify thenumber of components (variables) of the system, how they can be observed, the possiblenumber of operators (algebraic, analytic, or other domain-specific functions) andoperands (variables and parameter ranges) that can be used to describe the behavior,relationships between these operands, the allowed model complexity (maximum depthof subexpressions), the space of possible tests (the sets of initial conditions of the vari-ables), and the termination criteria (e.g., generations, fitness, or run time).

Candidate models can now be generated by “growing” trees. The trees are grown byrandomly selecting symbols from the operators and operands. The probability of select-ing from each set depends on the maximal allowed complexity.

Three main features of the algorithm are partitioning, automated probing, andsnipping.

Partitioning allows the algorithm to model equations describing each variable sepa-rately, even though their behavior may be coupled. The candidate models of severalvariables are integrated and evolved one equation at a time. The references to other vari-ables are replaced with real system data. This reduces the search time and increases theaccuracy.

Automated probing is an algorithm that automatically generates tests that seek outpreviously unseen behavior of the target system. This forces the models to evolve fastertowards the target system. It is a kind of intelligent testing and active learning process.


255

(a) (b)

Figure 12.12 Models are represented as nested subexpressions in the automated reverse engineeringapproach. (a) Encoding of differential equations as nested subexpressions (prefix notation). (b) Descrip-tion of the nested subexpression as hierarchical trees. The maximal complexity of a model is determinedby the depth of its tree.

Snipping automatically simplifies and restructures models during optimization. Asmentioned earlier, the equations are represented as symbolic trees with nestedsubexpressions. During the evolution of the model population, the subexpressions canbecome very large. This is a symptom of over-fitting. Snipping replaces thesubexpressions with a constant value if they reach a certain maximum depth of nesting.So the range of the subexpression is sufficiently small. It is a kind of Occam’s Razor pro-cess that improves the accuracy and human readability of the evolved equations.

The advantage of automated reverse engineering is that both the structure and theparameters of the ordinary differential equations can be determined. Furthermore, it isapplicable to any system that can be described using sets of ordinary nonlinear differen-tial equations. The disadvantage is that it assumes that the time series of all variables areobservable, or, to put it the other way around, only observed species are included intothe inferred model.

Automated reverse engineering could become an important method for discoveringhidden interactions in molecular pathways.

12.3.5.4 Parameter estimation

In the power-law approach as well as in automated reverse engineering, the networksare inferred complete with the set of parameters but in some approaches, network infer-ence only identifies the structure in a first step. In this case, the systems parametersneed to be estimated. The estimation of the parameters of a differential equation model,especially the problem to estimate parameters from time series data can be formulatedas an optimization problem: Given a differential equation model �( ) ( ( ), )X t f X t p= and

time series data D X ti i nt T= =

=(~

( )) , ,, ,

11

�

� , the task is to estimate values for the parameter vector p

by minimizing an objective function F(p, D). A common choice for the objective func-tion is the sum of squared errors between measurement and model prediction.

The problem can in general be formulated as a prediction problem with m observa-

tions having outcomes~( ),

~( ), ,

~( )X t X t X tm1 2 � , and p predictors Xij, i = 1, ... , n, j = 1, ..., p.

Established parameter estimation methods are, for example, multiple shooting [58],Markov Chain Monte Carlo, or evolutionary algorithms such as genetic algorithms [59].

For kinetic parameters, sometimes there is an alternative to parameter estimation. Itis possible to populate a model with kinetic parameters that are specifically measured inexperiments designed for this purpose, or with parameters that are retrieved from data-bases, such as, for example, BRENDA (www.brenda-enzymes.org) or SABIO-RK(sabio.villa-bosch.de/SABIORK). Such databases collect the kinetic parameters that arepublished in the scientific literature. Usually, not all parameters can be found in eitherway, but the use of already-known parameters at least reduces the dimensionality of theproblem of parameter estimation significantly.

12.4 Network Biology—Exploring the Inferred Networks

Once a network is determined by one of the methods introduced in the previous sec-tions or by any other method, it can be compared to other known networks, analyzed(in its entirety or by decomposition), or its dynamics simulated. This interrogation is


256

done in order to understand the relation between network topology and the ability ofthe system to exhibit certain kinds of dynamic behavior.

Comparative biological network analysis, applied by contrasting networks of differ-ent species or under different conditions, facilitates the validation and interpretation ofinferred networks as well as the addressing of questions about network evolution [60].Some often-applied methods of network analysis are specific for networks that are repre-sented as graphs: calculation of network measures and network classification (discussedin Section 12.4.1); and network decomposition and detection of motifs, modularity,hierarchical organization (discussed in Section 12.4.2). Also, mainly for metabolic net-works, there is structural analysis: if the network stoichiometry is known, and a steadystate can be assumed, many interesting conclusions can be drawn from an analysis ofthe matrix of stoichiometric coefficients (Section 12.4.3). And finally, there is thesimulation of network dynamics (Section 12.4.4).

Network properties that are of interest when exploring inferred networks are: basicperformance, ability to withstand trauma (i.e., structural robustness), versatility ofallowed interconnections, reusability of components and/or modules, and response tochanged conditions.

12.4.1 Graph theory

One possibility to retrieve meaningful information that is encoded by the known bio-logical networks is by applying graph theory. The field of graph theory is a branch ofmathematics which is concerned with the theoretical analysis and comparison ofgraphs, and with the topology of complex networks represented by them. It offers avariety of measures for studying the structural properties of already inferred networksand for comparing the architectural features of different networks. All this applies onlyin case that the networks can be represented as graphs.

One of the most basic characteristics of a network (and easy to evaluate) is the degreedistribution. The degree k of a node specifies how many edges from this node to othernodes exist. Thus, in a biological network, it indicates the number of interaction part-ners of a protein, gene or metabolite. The degree distribution P(k) is defined as the proba-bility that a specific node has exactly a degree of k. It has been shown that simplerandom graphs lead to a connectivity distribution that follows a Poisson distribution.Random graphs, developed in the 1960s, are built by starting with a certain number ofnot connected nodes in the first step, and connecting each pair of nodes with an edgewith the same probability in the second step. One basic research result of the last years isthat many real networks do not show this Poisson connectivity distribution, knownfrom random networks. Instead, P(k) of real networks, including biological networks, isfrequently found to adhere a scale free distribution, with P(k)~k−γ. Networks with ascale-free degree distribution are also called scale-free networks. The topology of thesenetworks implicates that the network contains a huge number of nodes with a smalldegree and a small number of nodes with a high degree. The highly linked nodes of thenetwork are called hubs and play a central role towards the robustness of a network, animportant aspect for real network behavior.

It has been shown that scale-free networks are highly robust against random failures,such as the removal of randomly selected nodes. If a node of a network is removed ran-domly, the probability is high that a node with a small degree is chosen. The removal of


257

such a node or several such nodes would probably not affect the networks integrity atall. Consequently this means for instance that the malfunction of proteins in a proteininteraction network, which do not play a central role in the network (e.g., none-hubs)will not lead to an abnormal behavior of the network.

A further important property of scale-free networks is that they are ultra-small,which means that the average path length of those networks is extremely small. Theaverage path length is the average number of edges along the shortest paths for all possi-ble pairs of nodes and is therefore a measurement for the transport efficiency in the net-work. One effect of this ultra-small property in biological networks could be that localperturbation or signal changes in the network can reach the whole network veryquickly. Further examples of biological networks that seem to exhibit a scale-freetopology are given in [61].

Besides the degree distribution, the clustering coefficient C is another important net-work measurement, which is often used to interrogate and to compare different com-plex graphs. Ci of a node i is calculated by the ratio of the actual number of edgesbetween its neighbors to the maximal number of possible edges between its neighbors.The mean over all clustering coefficients of all existing nodes is called average clustering

coefficient <C>. <C>k is defined as the average clustering coefficient of all nodes with

degree k. Based on C, and <C>k conclusions can be drawn about the tendency of nodesto form highly connected groups of nodes, called clusters. Indeed, it has been shownthat real-world networks exhibit a high C compared to random networks. For example,in protein networks the clustering of nodes is often observed. Proteins often interactwith each other or form protein complexes in order to fulfill a special cell function in amodular manner. In inferred protein networks such interacting proteins form clusters.

In this section only the most elementary measures are described; for further detailedreading the recent book by Uri Alon [6] is recommended. The graph theoretical analysisof biological networks is one major branch of network biology, and interesting discover-ies continue to be made—for example, a more recent one is the importance of bottle-necks (defined as proteins with a high betweenness centrality—that is, network nodesthat have many shortest paths going through them) as the key dynamic components ofnetworks [62].

Besides the topology, which can be analyzed by the mentioned network measure-ments, each real network is characterized by its own set of motifs. The definition ofmotifs and particularly their meaning in biological networks are discussed in the nextsection.

12.4.2 Motifs and modules

Each complex network consists of many different subgraphs. A subgraph is a particularpattern of interconnection of two ore more nodes. The number of possible subgraphsgrows exponentially with the number of nodes that are part of this pattern and thenumber of subgraphs with directed edges for a specific number of nodes is higher thanthe number of subgraphs consisting of undirected edges (see Figure 12.13). Undirectedsubgraphs are also called graphlets [63].

A special kind of subgraph is the so called network motif. The term network motifwas introduced as patterns of interconnections occurring in complex networks at num-


258

bers that are significantly higher than those in randomized networks [64]. Thus, motifsare over-represented subgraphs within a network.

For measuring the statistical significance or over-representation, in a first step, thefrequency of the considered motif in the real network is determined. Different ways existfor calculating the frequency. On one hand, one can count all occurrences of the motifin the network; on the other hand one can count only occurrences that have disjointedges or even have disjoint edges and nodes. In the second step, a random network hasto be generated and the frequency of the motif within the randomized version of thenetwork is determined. Again, there are different randomization methods according towhich network property of the real network, like number of nodes and edges or thenodes degree, should be preserved in the random network. Finally, there are two verycommon ways to express the statistical significance, the Z-score and the p-value. TheZ-score indicates how many standard deviations an observed frequency is above orbelow the mean. Thus, for our purpose, the Z-score is calculated by the difference of themotif frequency in the real network and its mean frequency in a set of randomized net-works, divided by the standard deviation of the frequencies of the different random net-works. In contrast to that, the p-value represents the probability that a motif occurs in arandom network in an equal or greater number of times than in the target network. Forstatistical significance this probability should be lower than 0.01. It has to be mentionedthat the statistical significance of a motif depends highly on the way the frequency andthe random network are calculated [65].

Each real network is characterized by its own set of motifs. It has been shown thatnetworks which fulfill similar functions, often have similar motif sets [64]. For example,it was found that the gene regulatory networks of E.coli and S.cerevisiae have a similarsmall set of motifs [66, 67]. Within the E.coli gene regulatory network, three highly sig-nificant motifs were found by these researchers: the feed-forward loop (FFL), the singleinput module (SIM), and the dense overlapping regulon (Figure 12.14). The FFL is the


259

(a) (b)

Figure 12.13 Subgraphs are particular patterns within the overall network. Shown are all the subgraphswith three or four nodes. (a) All undirected subgraphs with three nodes (top) and with four nodes (bot-tom). (b) all directed subgraphs with three nodes (top). In case of four nodes, there are already 199 possi-ble subgraphs [6], so they are not shown here.

most prominent and best studied motif of biological networks, and it was found to bepart of almost all biological systems.

In order to understand the function of network motifs, their dynamics are simulatedusing mathematical modeling. The FFL has been studied this way [64, 67, 68]. Onemajor result is that a coherent FFL has only a signal output, if the input signal is in someway persistent. Consequently, for a transient activity of the input signal, caused by noiseor fluctuating external signals for example, the FFL will exhibit no output signal [67].

Another promising approach for analyzing the function or behavior of a networkmotif is to determine its dynamic stability. The stability represents the probability of amotif to return to a steady state after small-scale perturbations. More detailed informa-tion about modeling the local stability of a network motif can be found in [69].

A common property of real networks seems to be the clustering of motifs to motifclusters or modules. A motif cluster is an aggregation of several motifs to a higher orderstructure. Within a cluster all nodes are highly connected with each other, but there areonly few edges to other nodes or clusters in the network. One assumes that most of thecellular functions are carried out by clusters or aggregated clusters [61]. The tendency ofnodes to form clusters and the scale-free topology of most real networks result in a net-work topology, called hierarchical networks [70].

In recent years, several software tools were developed for detecting and analyzingnetwork motifs. Mfinder [71], MAvisto [72] and FANMOD [73] are possible programs ful-filling this task.

12.4.3 Stoichiometric analysis

Stoichiometric analysis (also termed structural analysis) can be applied to biochemicalreaction networks where the stoichiometries of the reactions are known. Thestoichiometry matrix N is then used to derive conservation relationships, enzyme sub-sets (or reaction correlation coefficient ϕ), elementary modes, and so on. An advantageof structural analysis is that it requires no information on concentrations of species oron volumes of compartments, nor any knowledge of kinetic parameters of enzymatic ornonenzymatic reactions. It can be performed on large-scale metabolic networks andeven on the genome-scale networks mentioned (see Section 12.3.1).


260

Feedfoward loop Single input module Dense overlapping regions

Figure 12.14 Network motifs found in the E. coli transcriptional regulation network. Left: In a FFLmotif, a transcription factor X1 regulates a second transcription factor X2, and both jointly regulate anoperon X3. Middle: The SIM motif comprises a single transcription factor X1 that regulates a set of operonsX2, ..., Xn. X1 is usually auto-regulatory. All regulations are of the same sign. No other transcription factorregulates the operons. Right: in the DOR motif, a set of operons Xn+1, ..., Xm are each regulated by a combi-nation of a set of input transcription factors, X1, ..., Xn.

Every model that consists of a list of biochemical reactions is also representedthrough its stoichiometry matrix N. The elements of N, the stoichiometric coefficients ofthe reactions, relate the rate of change of the concentrations of each network compo-nent to the reaction rates of the reactions that produce or consume the component:

( )dXdt

Nv X p= ,

Knowledge about the structure of a metabolic2 network, reflected by N and details ofits reactions’ reversibility are sufficient to perform a stoichiometric analysis on this reac-tion network. It is important to realize that the steady state assumption Nv = 0 is thepremise for most of the concepts in structural modeling but not for all (e.g., the conser-vation relationships hold true at every point in time).

By analyzing N, one can determine a variety of model properties that could not befound by any other means:

1. Conserved moieties: Sets of internal metabolites with a fixed total concentration[75]; metabolites that contribute to such a moiety are not free to take on everyconcentration but are dependent on the concentrations of the other metabolitescontributing.

2. Enzyme subsets: Groups of enzymes that operate jointly in fixed flux proportions atsteady state [76].

3. Elementary modes: Minimal sets of reactions that can operate at steady state with allirreversible reactions proceeding in the appropriate direction [77]; the concept ofelementary modes provides a mathematical tool to define and comprehensivelydescribe all metabolic routes that are stoichiometrically feasible for a certain reactionnetwork; Schuster, et al. [78] gave an overview, a calculation algorithm, and anexample for this concept.

The three concepts introduced above are not the only consequences arising out ofthe structure of a modeled network. Other contributions of structural analysis have beenproposed: connectivity of metabolites [79], metabolic flux analysis [80], minimal cutsets [81], and the reaction correlation coefficient ϕ [82].

Applications to the genome-scale metabolic models (introduced in Section 12.3.1)are discussed by [83] and several more small-scale examples can be found in [84].

Based on a network’s stoichiometric structure, there are other theoretical methodsfor systems analysis: flux balance analysis (FBA), and energy balance analysis (EBA). InChapter 9, dynamic flux balance models are explained.

12.4.4 Simulation of dynamics, sensitivity analysis, control analysis

Once the topology of a biological network is established and the transition rules or rateequations are defined, analysis and simulation of the dynamics of the network is possi-ble. This can be done with a discrete time scale or with a continuous one.

Continuous dynamic modeling or conventional biochemical network modelingrelies on solving a system of differential equations. Usually, numerical methods and


261

2 There are attempts to also apply this method to signaling networks [74].

software packages that provide them are needed for this as well as for stability orsensitivity analysis.

Discrete dynamic modeling is, for example, the simulation of abstract network flowor information flow, the dynamic simulation of Boolean networks, or Petri Nets (not fur-ther discussed herein). The concepts of abstract network flow and information flow arevery simple/basic and are thus introduced briefly before the dynamic simulation ofBoolean networks will be explained in the following.

One of the latest developments in signaling research is to view intracellular signalingrather as propagation in a complex network than as isolated pathways [85]. In compli-ance with this are approaches such as the simulation of abstract network flow [86] or ofinformation flow [87]. Here, the connections between network nodes are all consideredas of equal consequence (i.e., no weights), and starting from one seed node, the propaga-tion of a signal in a network is studied. This can be the overall reach after a certain num-ber of time steps, or in case of hierarchical networks or networks that contain nodes withspecial features (e.g., TFs in protein-protein-interaction networks), it can be the timesteps it takes to reach nodes in a specific layer or nodes with the special features. The sig-nal can also be split, and equally distributed between all nodes that are reached in thenext step. In this way, biological networks can be compared to random networks, attrac-tors can be detected, and so on, all with a relatively low computational effort, and thedynamic flow can be simulated even for an organisms’ known protein interactionnetwork.

Dynamics in Boolean networks In general, gene regulatory networks modeled asBoolean networks (BN) are directed graphs, in which each node represents one gene.Nodes can adopt only two different states, namely 0 and 1. Consequently, Booleannetworks are characterized by their restriction to discrete state values. A node with statevalue 0 represents the inactive form of the gene represented by the node, which meansthat the gene is currently not expressed. In contrast to that a node with state value 1stands for the active form, indicating that the gene is expressed.

In addition to the binary representation of the genes, each gene of the network influ-ences the behavior or state of one or several other genes. Those interactions, illustratedby directed edges in the Boolean network, are modeled by Boolean functions (Booleanvariables connected by the logical operators AND, OR, and NOT). Each node/gene isassigned to one of those functions, such that the state of each particular gene (0/1 oroff/on) at time point t + 1 depends on the states of genes at time point t regulating thatgene. At each time step, all genes are updated synchronously. In an extended version ofBoolean modeling, not only logical operators but other functions, such as the sum of allinput states, are allowed. Then the rules for state transitions also define the threshold ofthe function for transitions from one state (input state) to another state (output state).This approach was applied to dynamically model the cell-cycle regulatory network ofbudding yeast [88].

Instead of modeling the gene activation status, one can also model the status of theproteins which are the products of the genes. The budding yeast cell-cycle network is asimple dynamic model containing 11 proteins. The protein states Xi with (i = 1, …, 11) inthe next time step are determined by the protein states in the present time step via therules given in a state transition table [e.g., in Figure 12.15(b)], where the aij denote theweights of the edges, with aij = 1 for a green/solid arrow from protein j to protein i and aij


262

= −1 for a red/dashed arrow from j to i (the edges may also be equipped with nonintegerweights). Note that in this model the interactions are modeled by a sum function insteadof pure logical operators, and the threshold for transitions is 0, and only if the functionequals 0, the previous state is retained.

The network was simulated for all 2,048 initial states (211), and the results show thatmost of the simulations converge to one single attractor (i.e., a state vector that alwaysproduced itself in the next time step), which was then related to the stationary G1 phaseof the cell cycle.

An analogous Boolean model for the biochemical network that controls the cellcycle progression in fission yeast S.pombe [89], shown in Figure 12.15(a), successfullypredicted the time sequence protein activation along its cell cycle. The authors comparetheir results with a much more complicated ODE-based model that requires extensiveparameter tuning and conclude by encouraging further modeling experiments with thehere presented quite minimalistic approach, as it may prove a quick route to predictingbiologically relevant dynamical features of genetic and protein networks in the livingcell [89].

Sensitivity analysis, control analysis, and simulations ODE models, once they are puttogether (i.e., their structure defined and populated with parameter values and withinitial values for all variables), can be analyzed in several different ways, by controlanalysis [90] or sensitivity analysis [91, 92], in order to detect crucial steps or parameters,


263

Start

(a) (b)

SK

Rum1

Wee1/Mik1

Cdc25

Slp1

PP

Ste9

Cdc2/Cdc13*

Cdc2/Cdc13

Figure 12.15 Discrete dynamical model of the cell-cycle network of fission yeast (S.pombe) [89]. (a)Cell-cycle network of fission yeast cell-cycle regulation. The nodes denote threshold functions, represent-ing the switching behavior of regulatory proteins. Arrows represent interactions between proteins, aij = 1for an activating interaction (green/solid link) from node j to node i, aij = −1 for an inhibiting (red/dashed)link from node j to node i, and aij = 0 for no interaction at all. (b) State transition table for the fissionyeast cell-cycle model. The rules in the table define how the states of the nodes are updated (in parallel)in discrete time steps.

or by stability analysis [93], in order to evaluate the robustness of the network. For moredetails regarding sensitivity analysis, we refer the reader to Chapter 8.

Furthermore, ODE models can be simulated by using software solutions that allownumerical integration. Thus, the time course and behavior of the system can be investi-gated for different initial or environmental conditions, predictions can be made, andhypotheses about the biological system that is described by the network can be formu-lated, leading to new experiments to test them.

Well-known examples for modeling and simulation of biochemical networks are thevarious erythrocyte (red blood cell) metabolic models [94, 95], and models of microbialand plant metabolism [84]. Many metabolic and some signaling network models can befound and even interactively explored (including control and sensitivity analysis) in themodel database JWS online (jjj.biochem.sun.ac.za).

12.5 Discussion and Comparison of Approaches

Reverse engineering is an appropriate tool to increase our knowledge of biological sys-tems and it is capable of creating predictive models for biological systems that also cap-ture environmental factors that affect system responses. In this chapter, severalmethods for network inference and analysis were introduced that enable the identifica-tion of network structure and the detection of motifs and modules within these net-works. The overall topology of the networks may ensure robustness to componentfailure by redundancy.

The choice of approach used very much depends on the type of cellular network.This is reflected also in the fact that most published reviews of network inference meth-ods focus on one type of network: gene networks [24, 32, 96] or metabolic networks [97].Also crucial for the choice of approach is the type of data that is available (i.e., what bio-logical entities (molecules) are the network components) or how the data was generated(i.e., time series data or steady state data for different conditions). When designing newexperiments, the question must be: What is the (minimal) information required touncover the network? When or how often should the measurements take place [98] orwhat should be measured?

Bansal et al. [32] compared several approaches, Bayesian networks, information-the-oretic, and ODE ones. In the following the features, advantages, and disadvantages ofthe various methods described in this chapter are again summarized and compared.

Boolean networks (see Section 12.3.2) are deterministic. For each node a Booleanrelation has to be defined that explains the influences of the input states (the states ofthe nodes that have an edge directed at the node in question) on the node’s state. This issummarized in the so-called state transition pair table. The discretization is a central fea-ture of this method. It is useful when only noisy data is available but it implicates aninformation loss. Also, from the computational point of view, complexity grows expo-nentially according to the number of nodes. Boolean network modeling can be appliedto data sets measured at two time points at least. In most cases they are used to inferinteractions between genes in terms of gene regulatory networks. The inference of regu-latory interactions between genes from experimental data collected from micro-arrayexperiments is a major challenge. Genome expression analysis involves the use ofoligonucleotide or cDNA microarrays to measure, in a parallel fashion, the mRNA levels


264

of as many as possible genes in a genome (see Section 12.2.3). Many techniques arebeing developed to analyze these experimental measurements in order to disclose themain gene interactions in a given moment.

Bayesian networks (see Section 12.3.4) are directed acyclic graphs that best describe agiven set of steady state data, and they are not limited to discrete variables like Booleannetworks. A major disadvantage of Bayesian networks is the restriction to acyclic graphs,as biological networks contain loops as one of their main features. Also, this networktype cannot be used to model time series data. Bayesian network modeling is suitable forstatistical models with incorrect measurements and minimal parameterization. Theadvantage of Bayesian networks is that samples with missing values and latent variablescan be integrated. An extension of Bayesian networks are dynamic Bayesian networks,which allow using time series data and the modeling of feedback loops as well.

ODE modeling can be applied to reconstructing cellular networks from time series ofgene expression, signaling and metabolite data. Time series data should be available,though, as data from experiments with differing conditions also imply different parame-ters. Power-law models can give a very pragmatic description of a biological network,because noninteger kinetic orders are allowed and these can be interpreted in a biologi-cal sense (inhibition, activation, and so forth) but they have the disadvantage that theproblem to be solved is of a very high dimensionality because many parameters need tobe estimated in order to obtain even the network structure.

The genome-scale metabolic modeling described in Section 12.3.1 requires that thegenome of the organism is available. Also, they provide structural models (only).

A priori information is a useful supplement to the standard numerical data comingfrom an experiment. It lessens the computational complexity problem that can arise ifall interactions are to be inferred. If prior knowledge is available, it should be incorpo-rated. An example for prior knowledge is protein-protein interaction data (see Section12.2.2). Often, there is prior knowledge available for metabolic networks and signalingnetworks but the structure of gene regulatory networks is usually largely unknown inadvance. The confidence in already-established metabolic networks (e.g., for E.coli,yeast, A.thaliana, and so on) is higher but additional theoretical analysis can lead todeeper insight (see Section 12.3.1).

This chapter’s emphasis has been on gene regulatory and metabolic networks, withless detail on signaling networks. Time scale of signaling events usually is much smaller,and obtaining the necessary data for application of the inference methods is more diffi-cult. One possibility of discovering signaling pathways that was not mentioned in thechapter yet is through alignment with known pathways in other species [99]. For signal-ing, see also Chapter 10.

Once a network is inferred, there are several possibilities to analyze the gained net-work in more detail. Graph theory (see Section 12.4.1) provides many measures, withwhich the overall topology and the robustness of the network can be determined. Thedetection of network motifs like feed forward loops (see Section 12.4.2) is one possibilityto get a higher understanding of the dynamics of the inferred network. A survey of thefield of network biology can be found in [65].

In summary, there are different approaches from which a researcher can choose, andin the previous sections, they were introduced and their advantages and disadvantagesdiscussed. The choice of approach taken really depends on the experimental data avail-

12.5 Discussion and Comparison of Approaches

265

able, the type of network one wants to look at, and the question that are to be tackled byanalyzing it.

One last comment: As we have seen, studies are usually restricted to one type of cel-lular network, metabolic, signaling, or gene regulation, but the outlook to the future isthat these networks on the different levels need be combined into one integrated cellu-lar network [100] in order to get a true understanding of the inner workings of the cell.

12.6 Summary Points

• Reverse engineering is the task of mapping an unknown network. It compriseseither or both of the tasks:

• Network inference: process of derivation/assignment of interaction structure;• Parameter estimation: qualitative, quantitative, or dynamic description of the

interactions (populate the structure, with transition rules and rate equations,parameters, and so forth).

• Reverse engineering of biological networks is important because it allows buildingof predictive models (as opposed to statistic models) of the studied biological sys-tems, and this is one major goal in the view of future applications in medicine orbiotechnology.

• There are many methods available for network inference but the choice dependson the type of network (i.e., the biological molecules that form the network) andon the type of data that is available (i.e., from time series or perturbation experi-ments) as well as on the amount of data.

• Among the methods are such that use differential equations. The quickened devel-opment of more ODE-methods in the last decade plus ongoing increase in access tocomputational power in labs holds promises for the future.

• To improve reverse engineering, there is a need for guidelines for appropriateexperimental design; that is, optimization (minimization in number or cost) theexperiments so that different network topologies can be distinguished.

• For an understanding the dynamics of an inferred network, one has to employ fur-ther analysis steps, like motif detection and mathematical simulations of theinferred network.

Acknowledgments

The authors have received financial support from the German Federal Ministry of Edu-cation and Research (BMBF) grant 01GR0475 as part of the National Genome ResearchNetwork (NGFN-2). Furthermore, we thank Sylvia Haus for her help in the initial prepa-ration of the manuscript and with the section in Bayesian network modeling, PeterRaasch for assistance in preparing the figures, and Julio Vera Gonzalez for discussionsand input on power law modeling.


266

References

[1] Oda, K., M. Matsuoka, A. Funahashi, and H. Kitano, “A comprehensive pathway map of epidermalgrowth factor receptor signaling,” Molecular System Biology, Vol. 1, No. 1, May 2005, pp.msb4100014-E1–msb4100014-E17.

[2] Galperin, M.J., “The Molecular Biology Database Collection: 2008 update,” Nucl. Acids Res.,Vol. 36, Database issue, November 2007, pp. D2–D4.

[3] Hall, D.H., and R.L. Russell, “The posterior nervous system of the nematode Caenorhabditiselegans: serial reconstruction of identified neurons and complete pattern of synaptic interactions,”Journal of Neuroscience, Vol. 11, 1991, pp. 1–22.

[4] Kitano, H., A. Funahashi, Y. Matsuoka, and K. Oda, “Using process diagrams for the graphical rep-resentation of biological networks,” Nature Biotechnology, Vol. 23, No. 8, August 2005, pp. 961–966.

[5] Kohn, K.W., M.I. Aladjem, S. Kim, J.N. Weinstein, and Y. Pommier, “Depicting combinatorial com-plexity with the molecular interaction map notation,” Mol. Syst. Biol., Vol. 2, No. 51, October 2006.

[6] Alon, U., An Introduction to Systems Biology, New York: Chapman and Hall: CRC Mathematical andComputational Biology Series, 2006.

[7] Cho, K.-H., H.S. Choi, and S.M. Choo, “Unraveling the functional interaction structure of abiomolecular network through alternate perturbation of initial conditions,” J. Biochem. Biophys.Methods, Vol. 70, No. 4, June 2007, pp. 701–707.

[8] Uetz, P., “Two-hybrid arrays,” Curr. Opin. Chem. Biol., Vol. 6, No. 1, February 2008, pp. 57–62.[9] Ratushny, V., and E.A. Golemis, “Resolving the network of cell signaling pathways using the evolv-

ing yeast two-hybrid system,” BioTechniques, Vol. 44, No. 5, April 2008, pp. 655–662.[10] Schlicker, A., C. Huthmacher, F. Ramirez, T. Lengauer, and M. Albrecht, “Functional evaluation of

domain–domain interactions and human protein interaction networks,” Bioinformatics, Vol. 23,No. 7, April 2007, pp. 859–865.

[11] Schlitt, T., and A. Brazma, “Current approaches to gene regulatory network modeling,” BMCBioinformatics, Vol. 8, Suppl. 9, September 2007.

[12] Wolkenhauer, O., Data Engineering, New York: John Wiley & Sons, 2001.[13] van Someren, E.P., B.L.T. Vaes, W.T. Steegenga, A.M. Sijbers, K.J. Dechering, and M.J.T. Reinder,

“Least absolute regression network analysis of the murine osteoblast differentiation network,”Bioinformatics, Vol. 22, No. 4, February 2006, pp. 477–484.

[14] Poolman, M.G., B.K. Bonde, A. Gevorgyan, H.H. Patel, and D.A. Fell, “Challenges to be faced in thereconstruction of metabolic networks from public databases,” IEE Proc. Syst. Biol., Vol. 153, No. 5,September 2006, pp. 379–384.

[15] Beste, D.J.V., T. Hooper, G. Stewart, B. Bonde, C. Avignone-Rossa, M.E. Bushell, P. Wheeler, S.Klamt, A.M. Kierzek, and J. McFadden, “GSMN-TB: a web-based genome-scale network model ofMycobacterium tuberculosis metabolism,” Genome Biol., Vol. 8, R89, May 2007.

[16] Gonzalez, O., S. Gronau, M. Falb, F. Pfeiffer, E. Mendoza, R. Zimmer, and D. Oesterhelt, “Recon-struction, modeling & analysis of Halobacterium salinarum R-1 metabolism,” Mol. BioSyst., Vol. 4,No. 2, February 2008, pp. 148–159.

[17] Forster, J., I. Famili, P. Fu, B.Ø. Palsson, and J. Nielsen, “Genome-scale reconstruction of theSaccharomyces cerevisiae metabolic network,” Genome Research, Vol. 13, No. 2, February 2003, pp.244–253.

[18] Duarte, N.C., M.J. Herrgard, and B.Ø. Palsson, “Reconstruction and validation of Saccharomycescerevisiae iND750, a fully compartmentalized genome-scale metabolic model,” Genome Research,Vol. 14, No. 7, July 2004, pp. 1298–1309.

[19] Kim, H.U., T.Y. Kim, and S.Y. Lee, “Metabolic flux analysis and metabolic engineering of microor-ganisms,” Mol. BioSyst., Vol. 4, No. 2, 2008, pp. 113–120.

[20] Kauffman, S.A., “Metabolic stability and epigenesis in randomly constructed genetic nets,” J.Theoret. Biol., Vol. 22, No. 3, March 1969, pp. 437–467.

[21] Liang, S., S. Fuhrman, and R. Somogyi, “REVEAL: A general reverse engineering algorithm for infer-ence of genetic network architectures,” Pacific Symp. Biocomputing, Vol. 3, 1998, pp. 18–29.

[22] Akutsu, T., S. Miyano, and S. Kuhara, “Identification of genetic networks from a smaller number ofgene expression patterns under the Boolean network model,” Pac. Symp. Biocomput., 1999,pp. 17–28.

[23] Akutsu, T., S. Miyano, and S. Kuhara, “Inferring qualitative relations in genetic networks and theirinference from gene expression time series,” Bioinformatics, Vol. 16, No. 8, April 2000, pp. 727–734.

[24] Cho, K.-H., S.-M. Choo, S.H. Jung, J.-R. Kim, H.-S. Choi, and J. Kim, “Reverse engineering of generegulatory networks,” IET Systems Biol., Vol. 1, 2007, pp. 149–163.

[25] Shmulevich, I., E.R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: arule-based uncertainty model for gene regulatory networks,” Bioinformatics, Vol. 18, No. 2, Febru-ary 2002, pp. 261–274.

Acknowledgments

267

[26] Shmulevich, I., E.R. Dougherty, and W. Zhang W, “From Boolean to probabilistic Boolean net-works as models of genetic regulatory networks,” Proc. IEEE, Vol. 90, No. 11, November 2002,pp. 1778–1792.

[27] Silvescu, A., and V. Honavar, “Temporal Boolean network models of genetic networks and theirinference from gene expression time series,” Complex Systems, Vol. 13, 2001, pp. 54–70.

[28] Steuer, R. “Review: On the analysis and interpretation of correlations in metabolomic data,” BriefBioinformatics, Vol. 7, No. 2, June 2006, pp. 151–158.

[29] May, P., S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L.Recuenco-Munoz, O. Ebenhöh, W. Weckwerth, and D. Walther, “Metabolomics- andproteomics-assisted genome annotation and analysis of the draft metabolic network ofChlamydomonas reinhardtii,” Genetics, Vol. 179, May 2008, pp. 157–166.

[30] Cakmak, A., and G. Ozsoyoglu, “Mining biological networks for unknown pathways,”Bioinformatics, Vol. 23, No. 20, August 2007, pp. 2775–2783.

[31] D’haeseleer, P., S. Liang, and R. Somogyi, “Genetic network inference: from co-expression cluster-ing to reverse engineering,” Bioinformatics, Vol. 16, No. 8, August 2000, pp. 707–726.

[32] Bansal, M., V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo, “How to infer gene networksfrom expression profiles,” Mol. Syst. Biol., Vol. 3, No. 78, February 2007.

[33] Steffen, M., A. Petti, J. Aach, P. D’haeseleer, and G. Church, “Automated modelling of signaltransduction networks,” BMC Bioinformatics, Vol. 3, No. 34, November 2002.

[34] Arkin, A., and J. Ross, “Statistical construction of chemical reaction mechanisms from measuredtime-series,” J. Chem. Phys., Vol. 99, No. 3, January 1995, pp. 970–979.

[35] Arkin, A., P. Shen, and J. Ross, “A test case of correlation metric construction of a reaction pathwayfrom measurements,” Science, Vol. 277, No. 5330, August 1997, pp. 1275–1279.

[36] Scott, J., T. Ideker, R.M. Karp, and R. Sharan, “Efficient algorithms for detecting signaling pathwaysin protein interaction networks,” J. Comp. Biol., Vol. 13, No. 2, March 2006, pp. 133–144.

[37] de Jong, H., “Modeling and Simulation of Genetic Regulatory Systems: A Literature Review,”J. Comput. Biol., Vol. 9, No. 1, 2002, pp. 67–103.

[38] Sachs, K., O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan, “Causal protein-signaling net-works derived from multiparameter single-cell data,” Science, Vol. 308, No. 5721, April 2005,pp. 523–529.

[39] Murphy, K., “Dynamic Bayesian networks: representation, inference and learning,” Ph.D. Disserta-tion, U.C. Berkeley, Computer Science Division, 2002.

[40] Yu, J., V.A. Smith, P.P. Wang, A.J. Hartemink, and E.D. Jarvis, “Advances to Bayesian network infer-ence for generating causal networks from observational biological data,” Bioinformatics, Vol. 20,No. 18, July 2004, pp. 3594–3603.

[41] Vera, J., E. Balsa-Canto, P. Wellstead, J.R. Banga, and O. Wolkenhauer, “Power-law models of signaltransduction pathways,” Cellular Signalling, Vol. 19, No. 7, July 2007, pp. 1531–1541.

[42] Schmidt, H., K.-H. Cho, and E.W. Jacobsen, “Identification of small scale biochemical networksbased on general type system perturbations,” FEBS J., Vol. 272, No. 9, July 2005, pp. 2141–2151.

[43] Kholodenko, B.N., A. Kiyatkin, F.J. Bruggeman, E. Sontag, H.V. Westerhoff, and J.B. Hoek “Untan-gling the wires: a strategy to trace functional interactions in signaling and gene networks,” PNAS,Vol. 99, No. 20, October 2002, pp. 12841–12846.

[44] Sontag, E., A. Kiyatkin, and B.N. Kholodenko, “Inferring dynamic architecture of cellular networksusing time series of gene expression, protein and metabolite data,” Bioinformatics, Vol. 20, No. 12,August 2004, pp. 1877–1886.

[45] Cho, K.-H., S.-M. Choo, P. Wellstead, and O. Wolkenhauer, “A unified framework for unravelingthe interaction model structure of a biochemical network using stimulus-response data,” FEBS Let-ters, Vol. 579, No. 20, 2005, pp. 4520–4528.

[46] Savageau, M.A., “Biochemical systems analysis: II. Steady state solutions for an n-poll system usinga power-law approximation,” J. Theor. Biol., Vol. 25, December 1969, pp. 370–379.

[47] Savageau, M.A., “Biochemical systems analysis: III. Dynamic solutions using a power-law approxi-mation,” J. Theor. Biol., Vol. 26, February 1970, pp. 215–226.

[48] Voit, E.O., Computational Analysis of Biochemical Systems, A Practical Guide for Biochemists and Molec-ular Biologists, Cambridge, U.K.: Cambridge University Press, 2000.

[49] Takahashi, K., S.N.V. Arjunan, and M. Tomita, “Space in systems biology of signaling pathways—towards intracellular molecular crowding in silico,” FEBS Lett., Vol. 579, No. 8, March 2005,pp. 1783–1788.

[50] Grima, R., and S. Schnell, ”A mesoscopic simulation approach for modeling intracellular reac-tions,” J. Stat. Phys., Vol. 128, No. 1–2, 2006, pp. 139–164.

[51] Kopelman, R., “Fractal Reaction Kinetics,” Science, Vol. 241, No. 4873, September 1988, pp.1620–1626.

[52] Savageau, M.A., “Michaelis-Menten mechanism reconsidered: implications of fractal kinetics,”J.Theor. Biol., Vol. 176, September 1995, pp. 115–124.


268

[53] Kikuchi, S., D. Tominaga, M. Arita, K. Takahashi, and M. Tomita, “Dynamic modeling of geneticnetworks using genetic algorithm and S-system,” Bioinformatics, Vol. 19, No. 5, March 2003, pp.643–650.

[54] Veflingstad, S.R., J. Almeida, and E.O. Voit, “Priming nonlinear searches for pathway identifica-tion,” Theoretical Biology and Medical Modelling, Vol. 1, September 2004, p. 8.

[55] Kimura, S., K. Ide, A. Kashihara, M. Kano, M. Hatakeyama, R. Masui, N. Nakagawa, S. Yokoyama, S.Kuramitsu, and A. Konagaya, “Inference of S-system models of genetic networks using a coopera-tive coevolutionary algorithm,” Bioinformatics, Vol. 21, No. 7, April 2005, pp. 1154–1163.

[56] Bongard, J., and H. Lipson, “Automated reverse engineering of nonlinear dynamical systems,”PNAS, Vol. 104, No. 24, 2007, pp. 9943–9948.

[57] Bongard, J., and H. Lipson, “Nonlinear system identification using coevolution of models andtests,” IEEE Trans. Evol. Comput., Vol. 9, No. 4, August 2005, pp. 361–384.

[58] Pfeifer, M., and J. Timmer, “Parameter estimation in ordinary differential equations using themethod of multiple shooting,” IET Syst. Biol., Vol. 1, No. 2, March 2007, pp. 78–88.

[59] Liebermeister, W., and E. Klipp, “Biochemical networks with uncertain parameters,“ IEE Proc. Syst.Biol., Vol. 152, No. 3, September 2005, pp. 97–107.

[60] Sharan, R., and T. Ideker, “Modeling cellular machinery through biological network comparison,”Nature Biotechnol., Vol. 24, No. 4, April 2006, pp. 427–433.

[61] Barabasi, A.-L., and Z.N. Oltvai, “Network biology: understanding the cell’s functional organiza-tion,” Nature Review Genetics, Vol. 5, No. 2, February 2004, pp. 101–113.

[62] Yu, H., P.M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, “The importance of bottlenecks in pro-tein networks: correlation with gene essentiality and expression dynamics,” PLoS Comput. Biol.,Vol. 3, No. 4, April 2007, p. e59.

[63] Przlj, N., D.G. Corneil, and I. Jurisica, “Modeling interactome: scale-free or geometric?”Bioinformatics, Vol. 20, No. 18, July 2004, pp. 3508–3515.

[64] Milo, R., S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simplebuilding blocks of complex networks,” Science, Vol. 298, No. 5594, October 2002, pp. 824–827.

[65] Junker, B.H., et al., Analysis of Biological Networks, New York: Wiley-Interscience, 2006.[66] Lee, T.I., N.J. Rinaldi, F. Robert, D.T. Odom, Z. Bar-Joseph, G.K. Gerber, N.M. Hannett, C.T.

Harbison, C.M. Thompson, I. Simon, J. Zeitlinger, E.G. Jennings, H.L. Murray, D.B. Gordon, B. Ren,J.J. Wyrick,b J.-B. Tagne, T.L. Volkert, E. Fraenkel, D.K. Gifford, and R.A. Young, “Transcriptionalregulatory networks in Saccharomyces cerevisiae,” Science, Vol. 298, No. 5594, October 2002,pp. 799–804.

[67] Shen-Orr, S., R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulationnetwork of Escherichia coli,” Nature Genetics, Vol. 31, No. 1, May 2002, pp. 64–68.

[68] Mangan, S., and U. Alon, “Structure and function of the feed-forward loop network motif,” Proc.Natl. Acad. Sci., Vol. 100, No. 21, October 2003, pp. 11980–11985.

[69] Prill, R.J., P.A. Iglesias, and A. Levchenko, “Dynamic Properties of Network Motifs Contribute toBiological Network Organization,” PLoS Biology, Vol. 3, No. 11, October 2005, e343.

[70] Clauset, A., C. Moore, and M.E.J. Newman, “Hierarchical structure and the prediction of missinglinks in networks,” Nature, Vol. 453, No. 7191, 2008, pp. 98–10.

[71] Kashtan, N., S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” Technical report, Departmentof Molecular Cell Biology and Computer Science & Applied Mathematics, Weizman Institute ofScience, 2002

[72] Schreiber, F., and H. Schwöbbermeyer, “MAVisto: A tool for the exploration of network motifs,”Bioinformatics, Vol. 21, No. 17, September 2005, pp. 3572–3574.

[73] Wernicke, S., and F. Rasche, “FANMOD: A tool for fast network motif detection,” Bioinformatics,Vol. 22, No. 9, May 2006, pp. 1152–1153.

[74] Papin, J.A., and B.Ø. Palsson, “Topological analysis of mass-balanced signaling networks: a frame-work to obtain network properties including crosstalk,” J. Theoret. Biol., Vol. 227, No. 2, March2004, pp. 283–297.

[75] Hofmeyr, J.-H. S., “Steady state modelling of metabolic pathways: a guide to the prespective simu-lator,” Comp. Appl. Biosci., Vol. 2, 1986, pp. 5–11.

[76] Pfeiffer, T., I. Sanchez-Valdenebro, J.C. Nuno, F. Montero, and S. Schuster, “METATOOL: for study-ing metabolic networks,” Bioinformatics, Vol. 15, No. 3, 1999, pp. 251–257.

[77] Schuster, S., and C. Hilgetag, “On elementary flux modes in biochemicalsystems at steady state,” J.Biol. Syst., Vol. 2, 1994, pp.165–182.

[78] Schuster, S., C. Hilgetag, J. Woods, and D. Fell, “Reaction routes in biochemical reaction systems:algebraic properties, validated calculation procedure and example from nucleotide metabolism,” J.Math. Biol., Vol. 45, No. 2, August 2002, pp. 153–181.

[79] Wagner, A., and D.A. Fell, “The small world inside large networks,” Proc. Biol. Sci., Vol. 268,No. 1478, September 2001, pp. 1803–1810.

Acknowledgments

269

[80] Avignone-Rossa, A.J., A. White, P. Kuiper, M. Postma, M. Bibb, and M. Teixeira de Mattos, “Carbonflux distribution in antibiotic-producing chemostat cultures of Streptomyces lividans,” MetabolicEngineering, Vol. 4, No. 2, April 2002, pp. 138–150.

[81] Klamt, S., and E. Gilles, “Minimal cut sets in biochemical reaction networks,” Bioinformatics,Vol. 20, No. 2, January 2004, pp. 226–234.

[82] Poolman, M.G., C. Sebu, M.K. Pidcock, and D.A. Fell, “Modular decomposition of metabolic sys-tems via null-space analysis,” J. Theoret. Biol., Vol. 249, No. 4, December 2007, pp. 691–705.

[83] Jamshidi, N., and B.Ø. Palsson, “Formulating genome-scale kinetic models in the post-genomeera,” Mol. Syst. Biol., Vol. 4, No. 171, March 2008.

[84] Schuster, S., et al., “Modeling and simulating metabolic networks,” in T. Lengauer, (ed.),Bioinformatics: From Genomes to Therapies, New York: Wiley, 2007.

[85] Friedman, A., and N. Perrimon, “Genetic screening for signal transduction in the era of networkbiology,” Cell, Vol. 128, No. 2, January 2007, pp. 225–231.

[86] Kaluza, P., M. Ipsen, M. Vingron, and A.S. Mikhailov, “Design and statistical properties of robustfunctional networks: A model study of biological signal transduction,” Phys. Rev. E, Vol. 75, 2007,R015101.

[87] Stojmirovic, A., and Y.-K. Yu, “Information flow in interaction networks,” J. Comput. Biol., Vol. 14,No. 8, October 2007, pp. 1115–1143.

[88] Li, F., T. Long, Y. Lu, Q. Ouyang, and C. Tang, “The yeast cell-cycle network is robustly designed,”PNAS, Vol. 101, No. 14, April 2004, pp. 4781–4786.

[89] Davidich, M.I., and S. Bornholdt, ”Boolean network model predicts cell cycle sequence of fissionyeast,” PLoS ONE, Vol. 3, No. 2, 2008, e1672.

[90] Wolkenhauer, O., M. Ullah, P. Wellstead, and K.-H. Cho, “The dynamic systems approach to con-trol and regulation of intracellular networks,” FEBS Lett., Vol. 579, No. 8, 2005, pp. 30–34.

[91] Alves, R., and M. Savageau, “Comparing systemic properties of ensembles of biological networks bygraphical and statistical methods,” Bioinformatics, Vol. 16, No. 6, June 2000, pp. 527–533.

[92] Alves, R., and M. Savageau, “Systemic properties of ensembles of metabolic networks: applicationof graphical and statistical methods to simple unbranched pathways,” Bioinformatics, Vol. 16, No.6. June 2000, pp. 534–547.

[93] Strogatz, S., Nonlinear Dynamics and Chaos, Boulder, CO: Westview Press, 2000.[94] Joshi, A., and B.Ø. Palsson, “Metabolic dynamics in the human red-cell. Part I A comprehensive

kinetic model,” Journal of Theoretical Biology, Vol. 141, No. 4, 1989, pp. 515–528.[95] Ni, T.C., and M.A. Savageau, “Application of biochemical systems theory to metabolism in human

red blood cells,” J. Biol. Chem., Vol. 271, No. 14, 1996, pp. 7927–7941.[96] Bornholdt, S., “Systems biology: Less is more in modeling large genetic networks,” Science, Vol.

310, No. 5747, October 2005, pp. 449–451.[97] Crampin, E.J., S. Schnell, and P.E. McSharry, “Mathematical and computational techniques to

deduce complex biochemical reaction mechanisms,” Prog. Biophys. Mol. Biol., Vol. 86, No. 1, Sep-tember 2004, pp. 77–112.

[98] Kutalik, Z., H.-H. Cho, and O. Wolkenhauer, “Optimal sampling time selection for parameter esti-mation in dynamic pathway modeling,” BioSystems, Vol. 75, No. 1–3, 2004, pp. 43–55.

[99] Kelley, B.P., R. Sharan, R.M. Karp, T. Sittler, D.E. Root, B.R. Stockwell, and T. Ideker, “Conservedpathways within bacteria and yeast as revealed by global protein network alignment,” PNAS, Vol.100, No. 20, September 2003, pp. 11394–11399.

[100] Lee, J.M., E.P. Gianchandani, J.A. Eddy, and J.A. Papin, “Dynamic Analysis of Integrated Signaling,Metabolic, and Regulatory Networks,” PLoS Comput. Biol., Vol. 4, No. 5, May 2008, p. e1000086.


270

C H A P T E R

1 3Transcriptome Analysis of RegulatoryNetworks

Katy C. Kao, Linh M. Tran, and James C. Liao1Department of Chemical Engineering, Texas A&M University2Rosetta Inpharmatics LLC3Department of Chemical and Biomolecular Engineering, University of California, Los Angeles

271

Key terms Transcription factor activitiesNetwork component analysisDNA microarrays

Abstract

The coordinated activities of transcription factors are responsible for changesin the gene expression of the cell upon a shift in the environment. The abilityof a transcription factor to bind to its DNA targets, and therefore to enact achange in expression, is the result of signal transduction cascades generated inresponse to environmental stimuli. Thus, the change in transcript abundance isindicative of changes in the cellular state of the organism. With high-through-put genomic tools now readily available, large scale transcriptome data can begenerated in order to obtain “snap shots” of the transcriptional activity of thecell. In simple transcriptional regulatory networks, where one gene is regulatedby one transcription factor, one can simply infer the activity of the transcrip-tion factor by the expression of the gene it regulates. However, mosttranscriptional regulatory networks are complex, with multiple connectionsbetween transcription factors and genes (Figure 13.1). Thus, systems biologyapproaches are required for the deconvolution of transcriptome data to inferthe activities of transcription factors. The transcription factors identified to beinvolved under the condition of interest can be used to generate testablehypotheses regarding the cellular or metabolic responses to environmental per-turbations, such as drug effects, and may identify new drug targets.

13.1 Introduction

The coordinated expressions of genes define the cellular and metabolic state of the cell.With the increasing knowledge of the transcriptional regulatory network, it is now pos-sible to infer the state of the cell via transcriptional profiling. Cells adapt to environ-mental changes by altering their cellular metabolism via signal transduction cascades,resulting in the ultimate activation or deactivation of a set of DNA binding proteinscalled transcription factors. Once activated, transcription factors bind to specificregions of DNA to either positively or negatively regulate transcription. Although sometranscription factors are activated by their synthesis, others require post-translationalmodification or ligand binding to be able to bind to DNA via conformational changes.The expression of the genetic repertoire of a cell is dictated by the collective activities ofthese transcription factors that regulate whether or not specific regions in the genomeare transcribed and, if so, by which degree they are transcribed. Each transcript can beregulated by more than one transcription factor, whose combined activities determinewhen, where, and how much a given gene is expressed. While more than one gene canbe regulated by the same transcription factor, the effects of the transcription factor canbe different depending on the target gene. A set of genes that is regulated by the sametranscription factor is defined as a regulon. Figure 13.1 shows a cartoon of atranscription regulatory network.

Until the mid-1990s, the transcriptional expression of only a handful of genes couldbe assayed at a time, such as through traditional northern blot analyses. In 1995, PatrickBrown’s lab at Stanford University developed the first DNA microarray [1], where thou-sands of any sequences of interest can be spotted on a single glass microscope slide andassayed to determine relative expression levels. These microarrays revolutionized the

Transcriptome Analysis of Regulatory Networks

272

TFx TFy TFz

1 2 3 4 5 6 7 8

Genes

TFx TFy TFz

CS

TFA

Figure 13.1 Hypothetical transcriptional regulatory network. Green circles represent transcription fac-tors. Brown circles represent the activity of transcription factors (TFA), which can be high or low depend-ing on the cellular state. Yellow squares are genes. The waves under the genes represent changes in theirtranscript abundance. The arrows represent the connections between transcription factors and the genesthey regulate. The heavier the arrows, the stronger the effect of the transcription factor on the target gene.The color of the arrows represents the effect of the transcription factor, either as a positive or negative reg-ulator, on the target gene.

molecular biology field, enabling researchers to assay the gene expressions of thousandsor tens of thousands of genes in a single experiment. The use of DNA microarrays hasthus greatly accelerated our ability to generate data regarding specific transcriptionalregulatory networks. Moreover, the recently developed ultra high-throughput sequenc-ing (UHTS) technology allows scientists to sequence cDNA libraries (RNA-seq) at unprec-edented of levels and generate tens of millions of sequence reads per experiment. Inaddition, with the recently developed Chromatin Immunoprecipitation DNAmicroarray (ChIp-chip) and the more recent Chromatin Immunoprecipitation sequenc-ing (ChIp-seq) technologies, the DNA binding sites of transcription factors can now bedetermined in a high-throughput manner. With the available information on theconnectivities of the transcriptional regulatory network, and the ability to assay thetranscriptome on a genome-wide scale, we can start to use the measured transcriptionalprofiles to help us gain further understanding of cellular behavior. Utilizing DNAmicroarrays or the new RNA-seq methods, one can obtain the expressions of all known(and novel) transcripts in an organism under any condition of interest.

However, the underlying physiological perturbations that result in the gene expres-sion profiles cannot always be easily determined, due to the complexity of thetranscriptional regulatory network. Certain signaling molecules have assays alreadydeveloped for their determination, and thus can be used in conjunction with the geneexpression profiles to show their specific role(s) in changing the metabolic state of thecell due to these perturbations. Unfortunately, the majority of the transcription factorscannot have their activities determined experimentally. Thus, systems biologicalapproaches are necessary to determine the cellular perturbations associated with theenvironmental condition. Since each transcript level is determined by the combinedactivities of their regulators, it is possible to deconvolute the transcriptional profilesfrom DNA microarrays using Network Component Analysis (NCA) [2] and obtain theactivity profiles of transcription factors. Based on which transcription factors are moreactive or less active under the conditions of interest, we can gain insight into how spe-cific signaling or cellular pathways are perturbed. This information will help us to, forexample, better understand the underlying metabolic and cellular responses to pertur-bations, such as a drug effect, and may lead to identifications of potential drug targets.

13.2 Methods

13.2.1 Materials

General equipment

1. Microcentrifuge

2. Floor centrifuge

3. Clinical benchtop centrifuge

4. Fluorometer or spectrophotometer

Resources

1. MATLAB2. Connectivity information for known transcription factors

13.2 Methods

273

13.2.2 Cell harvesting

Grow up the cells under condition of interest and harvest. The cells must be harvestedin such a manner such that the effect of the harvesting of the cell on the transcriptionalprofile will be kept at a minimum. Filtration method is described for yeast and E. coli.

13.2.2.1 For yeast

1. Attach a 0.45-μm analytical test filter funnels (Nalgene) to a vacuum manifold.

2. Fill a 50-ml conical tube approximately half full with liquid nitrogen, placed in abucket of dry ice.

3. Quickly filter culture through the filter funnel.

4. Snap off the top of the filter funnel and carefully peel off the filter containing cellcake with a pair of clean tweezers and immediately place it in the conical tube.

5. Allow the liquid nitrogen to evaporate, then tighten lid, and store at −80°C untilready for use.

13.2.2.2 For E. coli

1. Attach a 0.45-μm analytical test filter funnels (Nalgene) to vacuum manifold.

2. Fill a 50-ml conical tube with 5 ml of RNAlater.

3. Follow procedure as above for yeast.

4. Follow procedure as above for yeast.

5. Make sure the filter cake is completely submerged in RNAlater and mix to resuspend

the cells, then store at −80°C until ready for use.

13.2.3 RNA purification

Reagents2X RNA loading buffer (recipe from New England Biolabs): stored at –20°C

0.02% Bromophenol Blue26% ficoll (w/v)14 M urea4 mM EDTA180 mM Tris-boratepH 8.3 @ 25°C

Depending the types of cells used, RNA isolation method may vary. The methodsdescribed below are for yeast and E. coli. Care should be taken whenever working withRNA to ensure the integrity of the sample, as ribonucleases (RNases) are very stable anddo not require cofactors to function. Since our skin contains RNases, gloves should beworn whenever working with RNA. Make sure all reagents used are made with molecularbiology grade or DEPC-treated water. Use filtered tips that are certified nuclease-free.

Purified RNA should be stored at −80°C.

13.2.3.1 For yeast

Before starting, set the temperature of a water bath to 60°C.

1. Thaw sample on ice.

2. Add 5 ml of RNA buffer and 5 ml of acid phenol to each conical tube.


274

3. Incubate in water bath at 60°C for 1 hour, vortex rigorously every 10 to 20 minutes.

4. Place on ice for 10 minutes.

5. Centrifuge in floor centrifuge at 10,000 rpm for 10 minutes at 4°C.

6. Transfer the aqueous phase (top layer) to new 15-ml conical tube.

7. Add 5 ml of acid phenol to sample and vortex well.

8. Centrifuge in benchtop centrifuge for 10 minutes at 3,500 rpm.

9. Transfer the aqueous phase to new 15-ml conical tube.

10. Add 5 ml of chloroform to sample and vortex well.

11. Centrifuge for 10 minutes at 3,500 rpm.

12. Transfer the aqueous phase to new 50-ml conical tube (should extract approximately4.5 ml of aqueous solution).

13. Add 450 μl of 3M sodium acetate (pH = 5.2) and 10 ml of 100% ethanol.

14. Mix and precipitate overnight at –20°C.

15. Spin down in floor centrifuge at 10,000 rpm for 20 minutes at 4°C to pellet RNA.

16. Discard the supernatant and rinse the pellet with ice cold 70% ethanol to remove salt(make sure not to disturb the pellet).

17. Air dry the pellet (using a sterile pipette tip attached to a vacuum to get as muchliquid off as possible will greatly expedite the drying process).

18. The pellet should start to clear as it dries, then resuspend the pellet in 200 to 500 μl ofnuclease-free water and transfer to 1.5-ml microcentrifuge tube.

13.2.3.2 For E. coli

1. Thaw sample on ice.

2. Spin down sample at 8,000 rpm at 4°C for 5 minutes.

3. Remove filter and RNAlater solution.

4. Add 800 μl of TE + 1mg/ml lysozyme to dissolve the cell wall.

5. Add 80 μl of 10% SDS to lyse the protoplasts and denatures proteins.

6. Heat at 64°C for 1 to 2 minutes to lyse cells (solution turns clear when completelysis).

7. Add 88 μl of 1M sodium acetate at pH = 5.2 to precipitate out the nucleic acids.

8. Add 960 μl of water saturated acid phenol at pH = 4.3 to isolate total RNA from othercellular components (low pH phenol is used because RNA is stable at low pH andDNA moves to organic phase at low pH).

9. Invert tubes 10 times.

10. Incubate at 64°C for 6 minutes, inverting tube 10 times every 40 seconds.

11. Chill on ice for 5 to 10 minutes.

12. Centrifuge at 14,000 rpm for 10 minutes at 4°C.

13. Recover the aqueous phase (top layer) which contains the RNA without disturbingthe thick white interphase.

14. Add equal volume of chloroform to recovered RNA (this is to remove residualphenol).

15. Mix by inverting 10 times.

16. Centrifuge at maximum speed for 5 minutes in 4°C.

17. Recover aqueous phase (~600 μl) into a 15-ml conical tube.

13.2 Methods

275

18. Follow instructions from RNeasy midi kit from Qiagen to clean up RNA, elute twicewith 200-μl RNase free water, ending with 350 μl of total RNA.

13.2.3.3 DNaseI digestion to remove residual DNA

1. Mix the following components in a microcentrifuge tube:

350 μl Total RNA39 μl 10X React 3 buffer (New England Biolabs)0.5 μl RNaseOUT RNase inhibitor (Invitrogen) (20 units)0.5 μl DNaseI (Invitrogen)390 μl Total

2. Incubate at 37°C for 20 to 30 minutes.

3. Add 390 μl of acid phenol/chloroform (pH = 4.3) to denature the enzymes.

4. Recover aqueous layer.

5. Add 39 μl of 3M sodium acetate (pH = 5.2).

6. Mix well by vortexing.

7. Add 1,080 μl (2.5X volume) of ice cold 100% ethanol to precipitate out the RNA.

8. Place at −80°C for >30 minutes to precipitate out as much RNA as possible.

9. Spin at 4°C for 15 to 20 minutes.

10. Air dry after removing as much supernatant as possible using a sterile pipette tipattached to a vacuum.

11. Resuspend in 10 to 20 μl of nuclease-free water.

13.2.3.4 Quantify and check RNA quality

1. Quantify by fluorometer with RNA specific dyes (Qubit by Invitrogen is a goodsystem) or by spectrophotometer.

2. Run 1 μg of total RNA on 1% agarose gel in 1X TBE:

i. Mix RNA with equal volume of RNA loading dye.

ii. Denature at 65°C for 10 minutes then immediately chill on ice.

iii. Run gel to make sure the 28S and 18S (Yeast) and the 23S and 16S (E. coli)ribosomal RNA bands are sharp and not smeary.

13.2.4 Transcriptional profiling using DNA microarrays

There are several different types of DNA microarrays widely used currently, includinghomemade spotted arrays, manufactured arrays such as Affymetrix GeneChips, Agilentand Nimblegen arrays. Depending on the platform of arrays used, different labeling andhybridization protocols are used. The labeling and hybridization protocol described inthe first part are for spotted arrays using amino silane coated glass slides. In this proto-col, the fluorescent dyes are incorporated during the cDNA synthesis process using areverse transcriptase, where the reference is labeled with either cy3 or cy5 and theexperimental sample is labeled with the other.

Reagents

1. Labeling dNTP mix:


276

i. 0.5 mM dATP, dCTP, dGTP

ii. 0.2 mM dTTP

2. Hybridization chambers

3. Staining dishes

4. Microarray scanner

5. Hybridization and washing stations (Affymetrix)

6. Additional reagents and manufacturers are listed under each protocols section

13.2.4.1 Labeling

1. Combine the following in 12 μl total volume:

1.5 μl Random hexamer (stock solution at 3 μg/μl)10 μl Total RNA sample (total 30 μg)

2. Denature RNA at 70°C for 10 minutes.

3. Immediately chill on ice.

4. Add the following together:

11.5 μl 30 μg total RNA + 4.5 μg random hexamer5 μl 5X first strand buffer2.5 μl DTT2.5 μl Labeling dNTP mix2.5 μl Total of 2.5 nmol of cy3-dUTP or cy5-dUTP25 μl Total volume

5. Preheat solution at 42°C for 2 minutes.

6. Add 2 μl of Reverse transcriptase II H- (200 U/μl), mix well.

7. Label at 42°C for 1 to 2 hours in the dark for the cDNA synthesis.

8. Stop the reaction by adding 2.5 μl of EDTA (pH = 8.0), mix well.

9. Hydrolyze the RNA by adding 5 μl of 1N sodium hydroxide, mix well, and incubate

at 65°C for 40 minutes.

10. Add 150 μl of TE buffer (pH = 8.0) to each cy3 and cy5 labeled cDNA samples.

11. Combine with cy3 and cy5 labeled samples and remove unincorporated dyes andconcentrate using a Microcon-30 column to approximately 2 μl:

i. Apply the sample to Microcon-30 column.

ii. Spin at max speed in a microcentrifuge for 12 minutes at room temperature,discard the flow through.

iii. Add 300 μl of TE buffer to column.

iv. Repeat steps ii. and iii.

v. Spin at max speed for 12 minutes.

vi. Check the color of the flow through, if the flow through is not clear or nearclear, then repeat the washing.

vii. Invert the column into a new microcentrifuge tube and recover sample bycentrifugation for 1 minute (the recovered sample should be a dark purplecolor).

13.2 Methods

277

13.2.4.2 Hybridization

15 μl of formamide4.5 μl of 20X SSC3 μl of 10% SDS3 μl of 10X Denhardt’s solution4.5 μl of Blocking DNA

1. Make hybridization solution (for a 22 × 40-mm printed microarray slide):

15 μl Formamide (kept at 4–8°C)4.5 μl 20X SSC (filtered through 0.22-μm filter)3 μl 10% SDS (filtered through 0.22-μm filter)3 μl 10X Denhardt’s solution (prevent unspecific hybridization)4.5 μl Blocking DNA (1:1 yeast tRNA (10 μg/μl): Salmon Sperm DNA (10 μg/μl))

2. Add hybridization solution to labeled sample and denature at 95°C for 3 minutes.

3. Let stand at room temperature for 5 minutes.

4. Collect content by brief centrifugation.

5. Carefully pipette 25 μl of sample to the active side of the DNA microarray slide(usually the side with the barcode, but may differ depending on the arraymanufacturer).

6. Place a glass cover slip with clean tweezer over sample (be careful not to introduceany bubble; in case bubbles are visible under the cover slip, carefully lift one cornerof the cover slip and the bubbles will usually migrate to the edge and be removed).

7. Fill the appropriate holes in the hybridization chambers with 3X SSC or water tokeep the sample hydrated during hybridization.

8. Place slide inside the hybridization chamber and make sure the chamber is securely

tightened, then hybridization for 12 to 16 hours overnight in a 42°C water bath.

13.2.4.3 Washing and scanning

1. Remove the slide from the hybridization chamber and dip in staining dish

containing 0.2X SSC + 0.1% SDS (filtered with 0.22-μm filter) several times to allowthe cover slip to fall off (never forcefully pry the cover slip off as it may damage thearray).

2. Wash in 0.2X SSC + 0.1% SDS (filtered with 0.22-μm filter) for 5 minutes withoccasional agitation (be careful not to touch the active side of the slide to anythingto prevent scratching).

3. Immediately transfer to new staining dish containing 0.2X SSC and wash for 5minutes by occasional agitation (do not allow the slide to dry during transfer as itmay result in streaking on the array).

4. Immediately transfer to new staining dish containing 0.02X SSC and wash for 5minutes by occasional agitation.

5. Quickly dry in table top centrifuge at < 2,000 rpm for 5 minutes.

6. Scan using a DNA microarray scanner following the manufacturer’s operationalinstructions (most scanners allow the user to modify the PMT or sensitivity of thelasers, to maximize the dynamic range of the array signals, make sure the resultingmicroarray image contain a few saturated spots).


278

The second microarray protocol is for Affymetrix GeneChip arrays. These areone-channel arrays, where the samples are labeled with biotin, and the reference andexperimental samples are hybridized to separate arrays.

13.2.4.4 For mRNA

1. Poly(A) RNA enrichment from 1 mg of total RNA using Oligotex mRNA Midi kit(Qiagen) following manufacturer’s instructions.

2. Repeat Poly(A) RNA enrichment for a total of two rounds of enrichment.

3. Remove residual DNA from the sample by treating the poly(A) RNA with Turbo DNAfree kit (Ambion) following manufacturer’s instructions. Take care not to carry anyinactivating agents in the last step.

4. Quantify poly(A) RNA concentration using a fluorometer or a spectrophotometer.

5. Mix the following together:

x μl 9 μg of poly(A) RNA1.5 μl 3 μg/μl random hexamer (total 4.5 μg)y μl Molecular biology grade water

125 μl Total

6. Incubate at 70°C for 10 minutes.


8. Mix on ice the following:

125 μl Poly(A) RNA and random hexamer40 μl 5X First strand buffer (Invitrogen)20 μl 0.1 M DTT (Invitrogen) (0.01 M final)5 μl 10 mM dNTP mix (0.25 mM final)10 μl Superscript II reverse transcriptase (200 U/μl) (Invitrogen) (2,000 U final)200 μl Total volume

9. Carry out the reaction at 42°C for 1 hour.

10. Remove RNA by treating with the following:

3 μl RNase cocktail (Applied Biosystems) (final 15 U RNase A and 60 U RNase T1)6 μl RNase H (New England Biolabs) (final 30 U)

11. Incubate for 20 minutes at 37°C.

12. Purify by extraction with 210 μl of buffer saturated phenol:chloroform (1:1)solution.

13. Mix well.

14. Centrifuge at max speed for 2 minutes.

15. Extract the aqueous phase (top layer).

16. Add 21 μl of 3M sodium acetate (pH 5.2) and mix well.

17. Add 500 μl of 100% ethanol.

18. Mix well and precipitate at –20°C for at least 30 minutes.

19. Spin at 4°C for 10 minutes to pellet the first strand cDNA.

20. Remove the supernatant.

21. Rinse twice with 500 μl 80% ice cold ethanol.

22. Air dry.

13.2 Methods

279

23. Resuspend in 25 μl of molecular biology grade water.

24. Quantify using a fluorometer or spectrophotometer (usually obtain 2 to 3 times theamount of cDNA as input total RNA).

25. Take 4.5 μg of cDNA and fragment to 50 to 100 bp using DNaseI (this takes some trialand error as DNaseI activity may be slightly different depending on batch and age ofenzyme):

i. Mix together the following in 50 μl total volume on ice:

a. 4.5 μg of cDNA;

b. 5 μl of 10X OnePhorAll buffer (Amersham Pharmacia);

c. 1.5 μl of 50 mM CoCl2;

d. DNase I (start with 0.1 U);

ii. Fragment at 37°C for 5 minutes (best do in a thermal cycler).

iii. Inactivate the DNaseI at 99°C for 15 minutes (best do in a thermal cycler).

iv. Run 5 μl of sample on 2% agarose gel along with unfragmented cDNA (usuallysmear between approximately 100 to 1,600 bp) and make sure the DNaseIdigested cDNA has size distribution between 50 to 100 bp.

26. Label via 3’ end labeling with biotin via Terminal Transferase.

45 μl Digested cDNA3.5 μl 1 nmol/μl Biotin-11-ddATP (Perkin Elmer) (0.07 mM final concentration)1 μl 400 U/μl Terminal Transferase (Roche Applied Science)

27. Label at 37°C for 2 hours.

13.2.4.5 For Total RNA

1. Remove residual DNA from the sample by treating the total RNA with Turbo DNAfree kit (Ambion) following manufacturer’s instructions. Take care not to carry anyinactivating agents in the last step.

2. Quantify total RNA concentration using a fluorometer or a spectrophotometer.

3. Mix the following together:

x μl 20 μg of total RNA9 μl 3 μg/μl random hexamer (total 9 μg)y μl Molecular biology grade water120 μl Total

4. Incubate at 70°C for 10 minutes.


6. Mix on ice the following:

120 μl Total RNA and random hexamer40 μl 5X First strand buffer (Invitrogen)20 μl 0.1 M DTT (Invitrogen) (0.01 M final)10 μl 10 mM dNTP mix (0.5 mM final)10 μl Superscript II reverse transcriptase (200 U/μl) (Invitrogen) (2,000 U final)200 μl Total volume

7. Carry out the reaction at 42°C for 1 hour.

8. Remove RNA by following steps 10-24 from mRNA protocol.


280

9. Fragment 15 μg of cDNA using 0.6 units of DNaseI and end label with terminaltransferase following steps 25 to 27 from the mRNA protocol.

13.2.4.6 Hybridization

Hybridize cDNA to GeneChip arrays, wash, and scan following manufacturer’sinstructions.


13.3.1 Acquisition of DNA microarray data

The signal intensities in the cy3 (green) and cy5 (red) channels for each spot (gene) rep-resented on the microarray are proportional to the gene expression ratios between thetwo samples. There are several software packages available for the retrieval of the signalintensity data for each spot (gene) on the array. Some examples are Imagene(Biodiscovery), GenePix Pro (Molecular Devices formerly Axon), and free software likeScanAlyze developed by Mike Eisen’s group at Berkeley and Spotfinder by TIGR. Mostanalysis software has efficient gridding algorithms for finding each spot. However, it isimportant for each researcher to manually look over the spots identified by the algo-rithm as some arrays may have shifted grids/spots which the program is not able to findaccurately. Most software will output both the mean and the median signal intensitiesfor each spot and for the background surrounding the spot. It is better to use themedian intensity, since it is less prone to dust and other small defects in the array. Out-put the signal and background intensities from the program and use the array defini-tion files provided by the manufacturer of the array and match it to each spot (usuallyoutput as coordinates within the metagrid and the grid). Some image analysis softwareallow users to input in an array definition file, and the output will already contain theinformation for each spot (e.g., gene name, sequence, function). For AffymetrixGeneChips, follow the manufacturer’s instructions for image analysis.

13.3.2 Normalization

Due to variations in the amount of starting RNA, cDNA synthesis efficiency, dye-spe-cific and laser-specific effect, the raw data needs to be normalized in order to get anaccurate ratio between the two samples. There are a variety of different methods fornormalization. In general, without external controls, Lowess normalization is consid-ered a reasonable method for normalizing microarray data. Freely distributed software,such as lcDNA developed as a collaboration between Wing Wong (Stanford University)and James Liao’s (UCLA) groups and MIDAS from TIGR, have options for the filtering ofbad spots and for normalization of the data. For Affymetrix GeneChips, software suchas the Affymetrix’s GeneChip Operating Software or dChip are used for normalization,quality control, and obtaining normalized ratios of the data. Multiple replicates are nec-essary to assess the quality of data and to assign significance to the expression ratiosobtained for each gene for each experiment. Significance Analysis of Microarray (SAM)from the Tibshirani group at Stanford and lcDNA are both good software packages for


281

finding genes that are significantly induced or repressed. With both of these, the genesthat have log ratios statistically significantly different from 0 are considered to be eitherinduced or repressed.

13.3.3 Network Component Analysis (NCA)

NCA utilizes the knowledge of transcription factor (TF)-gene interactions, called con-nectivity information, within the organism of interest to deconvolute gene expressiondata to obtain an estimate on the activities of the TFs. The approach is great forwell-studied organisms like E. coli and yeast (S. cerevisiae) since extensive connectivityinformation is required for the success of this type of analysis. The method has recentlybeen applied to mammalian organisms such as mouse (M. musculus) and human (H.sapiens), whose connectivity information is available in some databases such asTRANSFAC, Transcriptional Regulatory Element Database (TRED). NCA is particularlyuseful for looking at the effects of drugs over a time course or during an environmentalswitch. The software can be downloaded from http://www.seas.ucla.edu/~liaoj/down-load.htm. Figure 13.2 summarizes the steps of the approach. The following paragraphsdescribe the analysis procedure in detail.1. Preprocess the inputs. The approach requires two inputs: the connectivity

information and the gene expression data in log ratio form.

1.1 Connectivity information: The information for E. coli, and S. cerevisiae can bedownloaded at the above Web site; however, it may not be up to date, thus morerecent connectivity information may need to be added on. In general theconnectivity information is formatted on the tabulation form with the first rowand the first column listing name of TFs and genes, respectively. If connectionexists between TF j and gene i, the cell corresponding to row i and column j isfilled by any nonzero number; otherwise, it is filled by 0. Note that theconnectivity information of a gene cannot be described by more than one row.


282

Preprocess connectivityinformation & geneexpression

Match gene connectivityinformation and expression

NCAcompliantnetwork

NCA

TFAs

Perturbed TFs

Randomnetworks

Null distributionsof TFAs

NCA

Figure 13.2 Flowchart of NCA procedure.

The file of connectivity information can be saved in Excel or text with tabdelimited format.

1.2 Gene expression data: Similar to the connectivity information, the geneexpression acquired after normalization is arranged in tabulation form with theheader row and column providing the experiment and gene names,respectively. Note that the gene identity system in the expression data must beconsistent with that in the connectivity information. It also requires that theexpression profile of a gene must be represented by one row. If a gene hasmultiple probesets, the input of the expression profile can be the average of all orbest correlated probesets. Genes having missing data points should beeliminated from the data. Since NCA is applied for deconvoluting the geneexpression in log ratio form, the data from Affymetrix array chips, which aresingle channel arrays, must be converted to log ratios. Therefore, one samplesuch as the wild-type or t = 0 sample is selected as the reference for comparing tothe other samples. In general the signal intensities are often in logarithm formafter being normalized by Robust Multi-array Average (RMA) or AffymetrixGeneChip software. That means the log ratios are calculated by subtracting thedata of the reference sample from the others. It is recommended that thebiological repeat arrays should be averaged first before calculating the log ratiosto filter out extreme values of log ratios. The generated log ratios the data nowcan be treated as those from the two-channel arrays. The gene expression datacan be saved in Excel or text with tab delimited format.

Both files are then imported to the MATLAB workspace through NCA GUItoolbox.

2. Match gene expression and connectivity information in genome scale. This task canbe performed by GUI toolbox to obtain the network composed of genes having bothconnectivity information and gene expression.

3. Select the NCA compliant network. GUI toolbox can select the NCA compliantnetwork based on the all (default) or specific TFs selected by the user.

4. Deconvolute the network gene expression to obtain the profiles of TF activities(TFAs) by NCA. Different initial guesses (e.g., for A matrix) are used for robustlyestimating the activities of TFs by the NCA numerical algorithm. It is recommendedfor using n > 10 different initial guesses.

5. Permutation tests to assess the statistical significance of TFAs (option). The goal ofthis step is to determine which TFs are statically significant perturbed whencomparing to the reference conditions. In this analysis, the null distributions of TFAsare built from network component analyzing n (>50) random networks whose geneexpression profiles are randomly sampled from whole genome transcriptome data,but the connectivity information is maintained the same as the report network. Thez-statistic is used to calculate the p-values of the TFAs obtained in step 4.

6. Export the profiles of TFAs to text or Excel file for further studies. Note that NCAmethod cannot identify if a TF is activator or repressor, so the entire activity profile

of a TF can be flipped over by multiplying it by −1 if the user knows the biologicalinformation.


283


It is very important to not only have technical repeats of DNA microarray data, but toalso have biological repeats. Technical repeats are considered to be the same samplehybridized to separate arrays to include the slide-to-slide variations and differences inlabeling and hybridization efficiencies. However, biological repeats are important toobtain biologically relevant results of the transcriptome. It is important to obtain statis-tical information on the data instead of using a strict fold cutoff, as some gene expres-sions can be statistically significantly perturbed, but still not meet any arbitrarythreshold cutoff. In addition, it is important to keep in mind that depending on the sta-tistical algorithm used and the stringency used, different sets of genes can be selected asinduced or repressed. If the experiment is looking for a small set of targets, a more strin-gent analysis (lower false discovery rate or higher confidence interval) can be used.However, if the experiment is looking for an overall trend, then a more relaxed analysiscan be used.

The NCA toolbox provides three different numerical algorithms for decomposingthe transcriptome data. Each one has its own advantage and disadvantage because of thetradeoff between the computation time and the stability of the solutions. For example,the algorithm using QR factorization is fast, but its solutions are unstable, while the oneusing Tikhonov regularization is slow, but provide stable solutions. The orthogonalalgorithm has medium speed, and its solutions are less stable comparing to the regular-ization algorithm. Therefore, if the data is less noise and composed of many (> 30)arrays, the orthogonal algorithm is the best candidate. The regularization algorithmbecomes outstanding if the number of data point is limited or the connectivity informa-tion might contain many false positive and negative connections. Besides, NCA shouldbe applied to data collected from the similar environment conditions or tissues becausethe approach assumes that the TF-gene relationship maintain constant over the wholedataset. The NCA solutions become less stable when applying it in the dataset that com-bines data collected from different environment conditions (e.g., rich media and heatshock), or tissues because the TF-gene interactions are conditional and tissue dependent.

Since gene expression changes can occur rapidly upon shifting to a new environ-ment (e.g., time course data following the addition of a drug), if the activities of tran-scription factors known to be transiently involved do not appear to be affected, it ispossible that the time scale of sampling should be reduced to include any potentiallyinteresting transient signaling. As with the analysis of any computationally predictedoutcomes, it is important to provide experimental validations for the important conclu-sions drawn. For example, if a transcription factor is predicted to be active in a particularcondition, then the use of a deletion strain in the transcription factor should result in acorresponding change in the TF activity profile using NCA. Alternatively, if an assay isavailable for the detection of the TF activity, such as quantification of ligand concentra-tion or active form of the transcription factor, then it should be used as validation.


When cells are subjected to environmental or genetic perturbation, it is difficult todetermine the coordinated regulatory responses to these perturbations. Network Com-ponent Analysis of a series of time course microarray data successfully determined the


284

transcription factors involved when E. coli transitions from utilizing glucose to acetateas a sole carbon source. The transient activities of key transcription factors elucidatedthe coordinated regulatory response of the cell during this metabolic switch. NCA anal-ysis helped to identify a key intracellular metabolite responsible for the prolongedgrowth lag phase during this carbon source transition in a mutant deficient in agluconeogenic gene [3]. The role of the intracellular metabolite on the observed pheno-type of the mutant was confirmed via experimental validation.

13.6 Summary Points

• It is important to have multiple biological and technical replicates of eachmicroarray experiment to ensure statistical significance on genes called to beinduced or repressed.

• The connectivity information between transcription factors and genes they regu-late must be available for this analysis.

• This analysis is best used with time course data, as NCA is for determining the tran-sient activities of transcription factors.

• The user can control the sign of the transcription factor activities if the regulator isknown to be an activator or repressor.

References

[1] Schena, M., D. Shalon, R.W. Davis, and P.O. Brown, “Quantitative monitoring of gene-expressionpatterns with a complementary-DNA microarray,” Science, Vol. 270, No. 5235, 1995, pp. 467–470.

[2] Liao, J.C., R. Boscolo, Y.L. Yang, L.M. Tran, C. Sabatti, and V.P. Roychowdhury, “Network compo-nent analysis: Reconstruction of regulatory signals in biological systems,” Proc. Natl. Acad. Sci. USA,Vol. 100, No. 26, 2003, pp. 15522–15527.

[3] Kao, K.C., L.M. Tran, and J.C. Liao, “A global regulatory role of gluconeogenic genes in Escherichiacoli revealed by transcriptome network analysis,” Journal of Biological Chemistry, Vol. 280, No. 43,2005, pp. 36079–36087.

13.6 Summary Points

285


Problem Possible Explanation Potential Solutions

Degradation of RNA samples RNase contamination Use filtered pipette tips; clean bench and all equipmentswith an RNase remover, such as RNaseZap (AppliedBiosystems, Foster City, California)

Error or warning whenrunning NCA

Missing data points inmicroarray or connectivity

data

Eliminate genes with missing data; fill missing valuesusing an imputation algorithm, such as K nearest neighbor(KNN) imputation

Activity profile nonzero fora disrupted transcription factor

Need to specify in NCA After step 3 in Section 13.4.3, select “Select TF Con-straints,” add the TF and the appropriate microarraydatasets to the list of available constraints. Then click“Save Constraints,” and run NCA

C H A P T E R

1 4A Workflow from Time Series Gene Expressionto Transcriptional Regulatory Networks

Rajanikanth VadigepalliDaniel Baugh Institute for Functional Genomics and Computational Biology, Department of Pathology,Anatomy, and Cell Biology, Room 381 JAH, Thomas Jefferson University, 1020 Locust Street,Philadelphia, PA 19107; phone: (215) 955-0576; fax: (215) 503-2636; e-mail: [email protected]

287

Key terms Gene expression data analysisTranscriptional regulatory network analysisComputational negative controlClusteringPromoter analysisGene regulationNetwork analysis

Abstract

An integrated approach is presented here to analyze the time series microarraygene expression data that extends beyond the list of differentially expressedgenes and focuses on the characterization of their transcriptional regulation. Inthe present approach, the differentially expressed genes are identified througha local false discovery rate and the resultant time series data is analyzed in arobust clustering scheme. The expression clusters are then analyzed using thePAINT bioinformatics tool to uncover shared transcriptional regulation. Thisintegrated approach enables transformation of descriptive data on gene expres-sion to functional mechanisms underlying regulation of the observed dynamicprofiles.

14.1 Introduction

In a typical global gene expression profiling study considering the dynamic response ofa biological system, microarrays are used to monitor changes in gene expression at dif-ferent time points under a perturbation as compared to paired controls at each timepoint. These time points provide information on the course of gene regulatory eventsduring the response. An integrated approach is presented here to analyze the time seriesmicroarray gene expression data that extends beyond the list of differentially expressedgenes and to focus on the characterization of their transcriptional regulation, which isone of the key mechanisms by which protein expression changes are controlled. In thisapproach, the differentially expressed genes are identified through an ANOVA [1, 2]and local false discovery rate (FDR) based approach [3], and the resultant time seriesdata is analyzed in a robust clustering scheme termed computational negative control[4–6]. The expression clusters are then analyzed using the PAINT bioinformatics tool [1,2, 6–13] to uncover shared transcriptional regulation potentially shaping the observeddynamical expression patterns.

In a typical clustering approach, a key consideration is that the number of clusters isuser-specified (e.g., as in K-means and so forth), and hence, there could be genes that areconsidered as “incorrectly clustered” for a given number of partitions. The computa-tional negative control approach overcomes this limitation by scanning a range ofuser-specified numbers of clusters and choosing the maximum number of patterns thatare well distinguishable from clustering randomized data. The original expression timeseries is permuted to generate a randomized data set that is comparable in data range toother overall statistics to the original data set. The set of clusters of original gene expres-sion time series that are well distinguishable from the randomized data is chosen forsubsequent transcriptional regulatory network analysis using PAINT.

Candidate transcription factors (TFs) responsible for differential expression profilesof the dynamically responsive genes are characterized using the Promoter Analysis andInteraction Network Toolset (PAINT) software available online at http://www.dbi.tju.edu/PAINT [8]. The concept driving the analysis in PAINT is that many coexpressedgenes share regulatory elements, typically TF binding sites (or transcriptional regultoryelements; TREs) in their promoters, leading to coregulation. PAINT uses bioinformaticsin combination with robust statistical approaches to identify the significantly enrichedTREs in the promoters of the genes of interest (e.g., gene groups from cluster analysis ofexpression data). The TRE enrichment is based on higher than random frequency asdetermined by a Fisher’s Exact Test. A key aspect of the analysis is the unbiased approachthat considers all known TF binding sites as being equally probable for significance towinnow down the list of TFs from hundreds to a relatively small panel of TFs that couldplay a role under these experimental conditions.

PAINT can also be used to simultaneously analyze multiple groups of genes (e.g.,cluster analysis of multicondition microarray data). In this case, the TRE enrichmentanalysis is performed for each individual cluster as compared to the specified referenceas well as to the entire input list itself (i.e., all clusters combined). This functionality isemployed in the workflow detailed below to analyze the clusters of differential geneexpression time series data. In PAINT, the clustered data is visualized as a matrix layoutwith the hierarchical tree structure aligned to the rows and the columns of the Feasnet.The zeros in the matrix are shown in black and the nonzero entries in the Feasnet are col-ored based on the p-value of the corresponding TRE. The brightest shade of red repre-

A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks

288

sents low p-value (most significantly over-represented in the Feasnet). Conversely, thebrightest shades of cyan represent smaller p-values for under-representation in theobserved Feasnet, indicating more significantly under represented TREs. This image canoptionally represent the cluster index of each gene, where such cluster indices are gener-ated from other sources such as expression or annotation-based clustering. With suchvisualization, it is straightforward to explore the relationship between expression-basedclusters and those based on cis-regulatory pattern (i.e., the Feasnet).

Different aspects of the integrated workflow presented above have been applied tostudy microarray gene expression time series data in a number of biological problems.ANOVA has been widely used in the analysis of microarray gene expression data [14, 15].The computational negative control approach has been used to study gene expressiondynamics during erythroid development [5] and rat liver regeneration [4]. PAINT has beenemployed in studying coordinated gene regulation in a wide range of systems includingneuronal differentiation, neuronal adaptation, blood cell development, retinal injury,brain stroke, bladder inflammation, and liver regeneration [4, 5, 9–12, 16–18].

14.2 Materials

1. Normalized microarray gene expression time series data file (user-provided).Organize the normalized microarray data in a tab-delimited plain text file with thefirst column containing the Gene Identifier (e.g., any one of Accession Number,Affymetrix Probeset ID, Clone ID, and so forth), and the remaining columnscorresponding to the normalized data values, one column per sample. The columnsshould be ordered with samples for each time point grouped together in adjacentcolumns, and within each time point, data from biological replicates are ordered astreatment and control samples in adjacent columns. Name this file asArrayTimeSeriesData.txt. An example data file based on the cDNA microarray timeseries data described in [4] is available in the Supplement(exArrayTimeSeriesData.txt). To use the example file, copy it to a new plaintext file with the name ArrayTimeSeriesData.txt.

2. Experimental Design Matrix for ANOVA (user-provided). Organize this informationin a tab-delimited plain text file with three rows corresponding to three factors(Treatment, Timepoint, and AnimalPair). The first column should contain the rowheaders Treatment, TimePoint, AnimalPair (one per row), and the remainingcolumns correspond to the samples organized in the same order of columns as thenormalized microarray data file (Item 1). The values in the table indicate theparticular level for each of the three factors (e.g., Table 14.1). Name this file asDesignData.txt. An example data file is available in the Supplement(exDesignData.txt). To use the example file, copy it to a new plain text file withthe name DesignData.txt.

3. Statistical analysis software. Download and install the R Project for StatisticalComputing available at http://www.r-project.org.

4. R scripts for differential gene expression analysis. The required files for differentialgene expression analysis using mixed-effects ANOVA, false discovery rate analysis,and the input data files (Items 1 and 2) are available in the Supplement

14.2 Materials

289

(DiffExpAnalysis.R, localFDRanalysis.R, and IdentifyDiffExpGenes.R).Save this file in the same directory as the input data files.

5. R script for robust cluster analysis. The required R code for performing robustclustering using the computational negative control approach is available in theSupplement (robustClustering.R).

6. Gene Level Identifier Resources. The SOURCE tool for conversion between variousgene identifiers can be found at http://source.stanford.edu. Ensembl gene identifierscan be obtained using the BioMart function at http://www.ensembl.org/Multi/martview.

7. Gene List Input Data File. This is a single column list of gene identifiers, oneidentifier per line, in a plain text file. This file is generated as part of the analysisprocedure detailed below, in formatting the statistical analysis results for promoteranalysis using PAINT. An example file available in the Supplement(exGeneList.txt) is needed only if the differential gene expression analysis isbypassed to directly start with promoter analysis.

8. Cluster Membership Data File. This is a tab-delimited plain text file with twocolumns. The first column must contain one gene identifier per row, and the secondcolumn must contain a corresponding single word alphanumeric cluster label. Thisfile is generated as part of the analysis procedure detailed below, in formatting thestatistical analysis results for promoter analysis using PAINT. An example fileavailable in the Supplement (exClusterInfo.txt) is needed only if the differentialgene expression analysis is bypassed to directly start with promoter analysis.

9. Transcription Factor Binding Site Data. The descriptions of transcription factorbinding sites for use with PAINT are provided in two forms from BiobaseInternational. A publicly available database is available through http://www.gene-regulation.com. A commercially licensed version is available from http://www.biobase-international.com/. PAINT requires users to obtain an account with either ofthese resources in order to perform binding site analysis. The professional version ofTRANSFAC contains significantly higher number of TREs and TFs than the publicversion and hence significantly improves the analysis.

10. PAINT: The latest version of the Promoter Analysis and Interaction Network Toolsetis available at http://www.dbi.tju.edu/PAINT [8]. The original version is described in[13] and a subsequent update is detailed in [7].

The example files noted above are based on the cDNA microarray time series datadescribed in [4]. In the Web-based PAINT, existing analyses can be retrieved using aunique job key provided for each analysis. The PAINT results are presented in ahyperlinked report and are available for download as a single compressed file for offlineperusal.

Nomenclature for this chapter includes bold italic for on-screen text, bold for but-tons, and Courier font for files, folders, and software code for execution in R program.


290

Table 14.1 A Typical Experimental Design Matrix Required for Mixed-EffectsANOVA Analysis of Time Series Gene Expression Microarray Data

Treatment P C P C P C P C P C P C

TimePoint 1h 1h 1h 1h 1h 1h 2h 2h 2h 2h 2h 2hAnimalPair P1 P1 P2 P2 P3 P3 P4 P4 P5 P5 P6 P6

P and C represent perturbation and control, respectively.

14.3 Methods

The methods outlined below describe transcriptional regulatory network analysis ofgene expression time series data, based on differential gene expression from ANOVAfollowed by a false discovery rate (FDR) analysis, robust clustering and promoter analy-sis of the grouped genes using PAINT. The term Project Directory below refers to thedirectory where the input data files (Items 1 and 2) as well as the scripts for the differen-tial gene expression and clustering analysis (Items 4 and 5) are located. The workflowdetailed below assumes that all of these files are located in the same directory. If such asetup is not preferred, researchers with advanced proficiency in R program can modifythe scripts and the code detailed below to specify the corresponding locations asappropriate.

14.3.1 Identification of differentially expressed genes

14.3.1.1 Mixed-effects ANOVA of the normalized time series data

1. Start the R program for statistical computing.

2. Modify the following line of code to specify the full path to the Project Directory.Paste it at the R prompt and hit Enter.setwd(“C:/Users/username/research/project/”)

3. Paste the following line of code at the R prompt and hit Enter.source(“DiffExpAnalysis.R”)

4. After the above script runs without errors, a new tab-delimited plain text file namedDiffExpRawPvalues.txt containing the raw p-values and differential geneexpression time series data is generated in the Project Directory.

14.3.1.2 Local false discovery rate analysis

5. Paste the following line of code at the R prompt and hit Enter.

source(“localFDRanalysis.R”)


pi.not

7. Proceed to the next step if pi.not value is greater than 0.7. Otherwise, see Note 14.5.1.

8. The above script generates a plot of the local and overall FDR (e.g., Figure 14.1). Thex-axis represents genes in the order of the raw p-values from the most to the leastsignificant. The y-axis represents the FDR values, for the two curves shown (localFDR within a window of 50 successive genes and overall FDR).

9. Consider a local FDR threshold of 0.3. Paste the following line of code at the Rprompt and hit Enter in order to add a horizontal line to the plot at the thresholdvalue of 0.3.

abline(h=0.3)

10. If the subsequent local FDR values for the next 50 to 100 genes agenes lie above thehorizontal line for the chosen local FDR threshold, then skip Step 11 and proceed toStep 12.

11. If the local FDR values for the next 50 to 100 genes lie below the threshold line,increase or decrease the threshold value by 0.01 to 0.05 and repeat Step 9 with the

14.3 Methods

291

new threshold value. Refer to Note 14.5.2 on how to choose an appropriatethreshold.

12. Use the local FDR threshold identified as providing a reasonable opportunity cost todetermine the differential expressed genes in the time series experimental data.Estimate the range of approximate number of genes (value on x-axis) correspondingto this threshold (e.g., 300 to 325 in Figure 14.1).

13. In the following lines of code, replace the number 0.3 with the threshold identifiedin Step 12, paste the code at the R prompt and hit Enter.

# Replace 0.3 with the threshold value from Step 12.

threshold <- 0.3

approxGenesStart <- 300

approxGenesEnd <- 325

source(“IdentifyDiffExpGenes.R”)

14. After the above script runs without errors, a tab-delimited plain text file namedDiffExpGeneData.txt containing the differential gene expression time seriesdata, raw and adjusted p-values will be generated based on the chosen local FDRthreshold.

14.3.2 Robust clustering of differential gene expression time series datausing computational negative control approach


source(“robustClustering.R”)


292

Local fdrOverall FDR

0.0

0.1

0.2

0.3

0.4

Fals

ed

isco

very

rate

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

# Genes

Figure 14.1 Relationship between overall FDR, local FDR, and the number of predicted differentiallyexpressed genes. We chose a 30% local FDR as a threshold resulting in 309 differentially expressed genes(corresponding to a 21.4% overall FDR). Additional genes selected would be at a higher “opportunitycost” as the local FDR is higher than 30% for the next 100 genes. (Reproduced from [4].)

16. The above script generates a plot of the clustering results using Partitioning AroundMediods (also known as K-medians) using Pearson Correlation as the dissimilaritymetric (e.g., Figure 14.2), between 2 and 10 clusters. Identify the number of clustersthat are significantly distinct from randomized data (e.g., 6 clusters in Figure 14.2).Refer to Note 14.4.1 for further information on how to choose appropriate numberof clusters.

17. In the following lines of code, replace the number 6 with the number of clustersidentified from Step 16, paste the code at the R prompt and hit Enter.

# Replace 6 with the number of clusters from Step 16.

numClusters <- 6

save.clustering.results(numClusters)

18. After the above code runs without errors, two tab-delimited plain text files namedGeneList.txt and ClusterInfo.txt will be generated. These form input files tothe transcriptional regulatory network analysis using PAINT, as detailed below.

14.3.3 Transcriptional regulatory network analysis using PAINT

14.3.3.1 Generation of PAINT-compatible input files

The following two steps are necessary only if the above Steps 1 to 18 in the differentialgene expression analysis are skipped to directly proceed to the promoter analysis usingPAINT.

14.3 Methods

293

2 3 4 5 6 7 8 9

10

0.6

0.8

1.0

1.2

1.4

0.0

0.2

0.4

0.6

0.8

1.0

Silh

ou

ette

coef

fici

ent

Number of clusters

Randomized dataActual expression data

2 3 4 5 6 7 8 9 10

Number of clusters

Dif

fere

nce

inSC

*N

um

ber

of

clu

ster

s

Figure 14.2 Assessment of the gene expression clustering results using the computational negative con-trol (CNC) approach [6]. (a) For each specified number of clusters, the cluster quality metric, silhouettecoefficient (SC), is evaluated and compared to that from the randomly permuted data. (b) Difference inSC from (a) multiplied by number of clusters shows a marked decrease at more than six clusters, indicat-ing that SC is no longer distinct from the randomized data. (Reproduced from [4].)

19. The starting point to PAINT is a file containing the list of differentially expressedgenes. The file should be a single column plain text file, each row listing a geneidentifier. All the identifiers in the file need to be of the same type, for example,Genbank Accession Number. The file GeneList.txt generated in Step 18 conformsto this file format. If Steps 1 to 18 were skipped, copy the example fileexGeneList.txt to a new plain text file with the name GeneList.txt.

20. The Cluster Membership Data File is a tab-delimited plain text file with two columns.The first column must contain one gene identifier per row and the second columnmust contain a corresponding single word alphanumeric cluster label. The fileClusterInfo.txt generated in Step 18 conforms to this file format. If Steps 1 to 18were skipped, copy the example file exClusterInfo.txt to a new plain text filewith the name ClusterInfo.txt.

14.3.3.2 Identification of over-represented transcription factor binding sites

In the steps below, the GeneList and ClusterInfo from Step 18 (or Steps 19 and 20, asappropriate) will be used to retrieve promoter sequences, analyze them usingTRANSFAC Public (or Professional), build a Feasnet, and analyze this matrix as com-pared to a reference Feasnet to derive hypotheses on over-represented TREs in each ofthe identified coexpression time series clusters.

21. Use a Web browser to open the Web page http://www.dbi.tju.edu/PAINT.

22. Click on the button Start New Analysis on the PAINT main page.

23. Select appropriate Organism Name, 2000 for the Desired upstream length, CloneID for Gene Identifier type, Gene Identifiers List for Upload text file of type.

24. Click the Browse button to locate and select the file GeneList.txt on the computerfrom the Project Directory.

25. Select the check box next to TFRetriever.

26. Select MATCH (TRANSFAC Public) for TRE finding program. Refer to Note 14.5.5on the issues involved in the choice of the TRE finding programs.

27. Enter the username and password for logging into the Web site for TRANSFACPublic http://www.gene-regulation.com (or Transfac Pro at http://www.biobase-international.com).

28. Select Minimize False Positives for the MATCH filter option.

29. Select 1.00 for the Core similarity threshold. Check the box for Find TREs oncomplementary strand?

30. Click Execute Feasnet Builder at the end of the Feasnet Builder form. A new pagewill be loaded indicating the status of the analysis. Note the job key at the top of thestatus page for access at a later time, as the analysis might take considerable timedepending on the size of the gene list.

31. After the FeasnetBuilder step is completed without errors, the highlighted status textat the top of the page will be replaced by a link to the ZIP file containing all theresults including the status page.

32. After successful completion of FeasnetBuilder, the PAINT status page indicates thenumber of promoters that were retrieved (refer to Note 14.5.6 on how redundanciesin the gene list are handled), the promoter sequences of specified length in the


294

FASTA format, and also a link to a list of genes for which the promoter sequenceswere not found in the PAINT database. Next, the status page indicates whether thegene list was split into multiple parts for processing using MATCH. Links to theactual HTML output from MATCH are provided next to each of the split sequencefiles. Lastly, the overall Feasnet corresponding to the input Gene List is given next tothe text Feasnet file.

33. After successful completion of the FeasnetBuilder step, the PAINT status page showsa link to the follow-up Feasnet Analysis and Visualization. Click on the link indicatedby the text Click here to continue with Feasnet Analysis and Visualization.

34. On the Feasnet Analysis page, the parameters corresponding to the Feasnet,Organism, Upstream sequence length, Gene Identifier type, TRE finding program,Core similarity threshold, TREs on the complementary strand included?, will beautomatically set based on the data entered earlier in the FeasnetBuilder step.

35. Under Clustering Options, check both the boxes corresponding to TREs based onthe promoters they are present on and genes based on the TREs present on theirpromoters to hierarchical cluster the PAINT analysis results.

36. Under Select Reference Feasnet(s) for significance analysis of TREs, check the boxcorresponding to the microarray used in the project. If this information is unknown,select All promoter sequences in PAINT database. If your microarray is not listed,refer to Note 14.5.7 for information on how to choose or generate an appropriatereference Feasnet.

37. Check the box next to Generate filtered Gene-TRE networks based on TREover-representation. Under this text, select 0.30 for the parameter Only those TREsof FDR-based adjusted p-value <=, and 0.05 for the parameter Only those TREs ofraw p-value <=. Refer to Note 14.5.8 for information on these two thresholdsemployed in the analysis.

38. Click the Browse button for the parameter Gene cluster information file to locateand select the ClusterInfo.txt file in the Project Directory.

Click the Execute Feasnet Analyzer/Viewer button at the end of the form. The jobstatus page will be loaded indicating the status of the analysis. The job key will be thesame as earlier, as this is merely continuation of the PAINT analysis.

39. Once the Analysis and Visualization step is complete, the highlighted text at the topof the status page will be replaced by a link to the ZIP file containing all the resultsincluding the status page.

40. The results from the TRE enrichment analysis are under the headings Significance ofTRE occurrence (in clusters compared to a reference) and Significance of TREoccurrence (in individual clusters compared to the list). Links to the specificreference used, p-values for over-representation, and the Feasnet images areprovided. Under the subheading Hypothesis Gene-TRE network, links are providedto the filtered Feasnet data and images based on the specified p-value threshold (e.g.,0.10 in Step 37). Links to the Network image and Graphviz source file are also given.Refer to Note 14.4.2 for information on how to interpret the PAINT results.

14.3 Methods

295


14.4.1 Selection of number of clusters

The Silhouette Coefficient (SC) plotted in Step 16 (example in Figure 14.2) indicates the“quality” of the partitioning into multiple clusters [4–6]. These values range from 0 to 1,with the zero value corresponding to no separation between clusters and a value of 1indicating perfect partitioning. The randomized data partitioning is based on permut-ing the original data set and applying the same clustering algorithm as with the originaldata set. The difference in SC between the actual and randomized data set may be high-est at low number of clusters. However, it may be informative to further separate theseclusters to derive more specific temporal gene expression patterns. The plot generatedin Step 16 [e.g., Figure 14.2(b)] considers the product of difference in SC and number ofclusters to examine the trade off over a range of total number of clusters and identify anoptimal value. The example results in Figure 14.2(b), based on the results in [4], show amarked decrease at more than six clusters, indicating that SC is no longer distinct fromthe randomized data.

14.4.2 PAINT result interpretation for gene coexpression clusters

The hypothesized Gene-TRE network from the enrichment analysis (Step 40) indicatesthose TREs that are significantly over-represented in the promoters corresponding toeach of the gene expression clusters as compared to the promoters in the reference (Fig-ure 14.3). The significantly-above-random nature of occurrence of certain TREs makesthese ideal candidates for further experimental validation. When using PAINT to ana-lyze the gene expression time series clusters, the results indicate if any of the bindingsites found on promoters of differentially expressed genes are diagnostic for any specificgene co-expression cluster. Therefore, the desired result would be identification of aTRE determined to be statistically enriched in one or a few of the clusters, but not all.The cluster-enriched TREs will appear on the Feasnet image as a vertical collection ofred boxes which mirror the limits of the gene list for the group. The enrichment can bemore easily visualized by graphing the significance score, –log10(p-value), for each TREof interest in each gene expression cluster. The enrichment p-values of TREs in eachcluster can be obtained from the Significance of TRE occurrence (in clusters comparedto a reference) section of the PAINT output, following the link Over-representation ineither raw or overall FDR-adjusted p-values. The inference for biological significance isthat these specific TREs, and their cognate TFs, are specifically involved in theregulation of the corresponding coexpression clusters of genes.


The methodology presented above is motivated by the need to make sense of the coor-dinated changes in gene expression observed as a function of time in a typicaltranscriptional profiling experiment. The present approach utilizes multifactorialANOVA followed by a robust clustering scheme to uncover nonrandom temporal pat-terns in the differentially expressed gene profiles. These clusters of coregulated genes


296

are analyzed in PAINT for statistically enriched shared binding site incidence on theirpromoters, yielding hypotheses on transcriptional factors implicated in driving theobserved coregulation. Several aspects of the present approach require careful consider-ation of various choices available in each stage, as discussed below.

14.5.1 Estimation of nondifferentially expressed genes (pi.not value)

The pi.not value represents the fraction of nondifferentially expressed genes in the data[19]. The R scripts presented here include an estimate of pi.not as detailed in [3]. Api.not value less than 0.7 may indicate issues with normalization of the data, as theassumptions in microarray data normalization about “most of the genes unchangingon any given array between treatment and control” may be inappropriate. If thisoccurs, return to the appropriate normalization procedure in generating the prerequi-site data and redo the analysis from Step 1.

14.5.2 Threshold for local false discovery rate analysis

The local FDR estimator employed here produces a nonmonotonic result (Figure 14.2).A set of heuristics for choosing a p-value threshold given a local FDR threshold are given


297

(b)(a)

Figure 14.3 Analysis of gene expression time series data from [4]. (a) Cluster analysis of the differentialexpression temporal profiles. The data was clustered using Partitioning Around Medoids using PearsonCorrelation as the distance metric and with k = 6 (optimal number obtained from the results shown inFigure 14.1). Each row corresponds to a gene and each column corresponds to one of the four time points(1, 2, 4, 6 hours post partial hepatectomy). Lines demarcate the cluster boundaries. (b) The six clustersfrom (a) were analyzed for over-represented TF binding sites in the corresponding promoters using PAINT.The representative interaction matrix is shown. The rows represent the promoters and columns representTFs. Each binding site for a TF on a promoter is marked red or gray, depending whether the frequency ofthat binding site in that cluster is statistically significantly overrepresented or not, respectively. Bindingsites for several TFs are enriched in distinct expression clusters. Lines indicate the mapping between thegene groups in the expression map and the corresponding promoter sets in the regulatory interactionmatrix. (Reproduced from [4].)

in [3]. A high level summary is presented here. If the local FDR estimate crosses thethreshold only once, choose the corresponding number of genes from the x-axis. Oth-erwise, start at the lowest gene number and if the local FDR threshold for the next 50 to100 genes is higher than the threshold for most of the genes, choose the correspondingnumber of genes from the x-axis. If that is not the case, consider the next intersectionpoint for differentially expressed genes and repeat the above procedure until a satisfac-tory value at which most of the genes have lower local FDR threshold is reached. Thelocal FDR metric should be used in conjunction with the overall FDR metric shown inthe plot (Steps 8 to 12). The local FDR metric gives an upper bound for useful overallFDR threshold values and if the local FDR threshold identified above corresponds to anoverall FDR at an unacceptable level, a more stringent threshold should be consideredto yield a smaller number of differently expressed genes albeit with fewer estimatednumber of false positives.

14.5.3 Format of gene identifiers

Starting from the microarray gene expression data, the Clone IDs or Genbank Accessionnumbers should be preferred as the Gene Identifiers. The SOURCE resource noted in thematerials section (Section 14.2) may be used to convert the data from alternative identi-fiers. In addition, almost all of the commercial microarray platforms are associated withannotation software to enable this transformation to appropriate Gene Identifiers.PAINT employs the UniGene database to map the Clone IDs to the correspondingEntrez Gene IDs and then utilize the Ensembl cross-reference annotation informationto obtain the corresponding unique set of Ensembl Gene IDs.

14.5.4 Cluster size issues

Based on the results from multiple studies, it is recommended that the Gene List con-tain at least 30 genes in each cluster. Otherwise, the results for those clusters are difficultto interpret in a robust fashion. Experience indicates that small inaccuracies (< 10%) inthe clustering algorithms do not significantly influence the results.

14.5.5 TRANSFAC version issues

In order to perform transcriptional regulatory network analysis using PAINT, users needto obtain appropriate licensed access to the public or professional versions of theTRANSFAC database, both of which are not affiliated with the PAINT developmentteam. The public version is available online at http://www.gene-regulation.com and isavailable following a free registration process. Access to professional version is availablethrough http://www.biobase-international.com. The login and password required inthe analysis step are only used to interact with the appropriate Web servers. Thisensures proper handling of the license management issues while providing an option toPAINT users. The professional version of TRANSFAC contains a significantly highernumber of TREs and TFs than the public version and hence its use significantlyimproves the analysis results.


298

14.5.6 Annotation redundancy in the gene list and multiple promoters

Often, several gene identifiers in the input Gene List map to same Ensembl Gene.PAINT builds the entire cross-referenced list of Ensembl Genes that corresponds to theinput Gene List and considers the unique Ensembl Gene ID list in subsequent analysis.In addition, due to the nature of the cross reference in the Entrez Gene and Ensembldatabases, it is likely that a few of the gene identifiers in the input Gene List individuallymap to more than one Ensembl Gene. In these cases, PAINT includes all the mappedEnsembl Genes in the analysis. In the rare cases when an identifier maps to five or moreEnsembl Genes, it is recommended to be excluded from the analysis by removing fromthe files GeneList.txt and ClusterInfo.txt.

14.5.7 Reference Feasnet selection/generation

The selection of appropriate reference set is the key to derive meaningful hypotheses inPAINT analysis. Comparison of the experiment Feasnet to the entire genome gives erro-neous results if the input gene list is obtained from a microarray that does not span theentire genome or is specific to a particular tissue/disease. PAINT contains prebuilt refer-ence files for All Ensembl Promoters and Affymetrix arrays. Other commercial arrays arein the process to be added to the prebuilt reference list. In case the reference for themicroarray of interest is not listed in the Feasnet Analysis and Visualization form (Step36), the microarray Gene List needs to be first processed in the Feasnet Builder to obtaina microarray Feasnet (Steps 21 through 32, with the appropriate plain text file namedmicroarrayGeneList.txt prepared as specified in Step 19 and used in Step 24).

14.5.8 Multiple testing correction in PAINT

The raw p-values in each over-representation analysis are corrected for multiple testingusing an overall FDR estimate. As a first option, the results from the FDR-based adjustedp-values should be employed in identifying the significantly over-represented TREs.However, in some cases, this particular correction is either inappropriate or conserva-tive (due to correlations among TREs) and may yield little or no results. In such cases,one can utilize the raw p-value based results in PAINT in a discovery approach. Whilethis alternative may result in a set of hypotheses with high estimated proportion falsepositives, in practice, this amounts to prioritizing the validation experiments based onindividually enriched TREs. The primary role of the presented computational workflow


299



pi.not value is less than 0.7 Indicates too many differentially expressedgenes. May be caused due to errors in normal-ization of the raw gene expression data

Revisit the normalization procedure tocheck for bias in the data

Too few genes passingreasonable FDR thresholds

High variability in the expression data Power Analysis to estimate the requirednumber of replicates

No meaningful clusters inthe data

Time series may not be well defined Consider adding time points to the experi-mental design

Too few TFs passing reasonableFDR thresholds in PAINT

Cluster sizes may not be reasonable forstatistically significant results, or promoterlength considered is short

Consider larger-sized clusters, or increasethe promoter length analyzed in PAINT

is in generating a reasonable set of candidates for experimental validation. Hence, whenmultiple testing correction yields little or no results the alternative raw p-value basedapproach is the next best option available.


The integrated approach starting from microarray gene expression time series data pre-sented here has been successfully employed to study blood cell development [5] andliver regeneration [4]. In [4], gene expression data was obtained from regenerating ratliver at 1, 2, 4 and 6 hours following partial hepatectomy. The excised livers from eachrat at 0 hours were considered as within-animal controls at each time point. Microarraydata was obtained using cDNA arrays with ~9,000 clones spotted on glass slides.Mixed-effects ANOVA and local FDR analysis, as detailed in the methods section (Sec-tion 14.2), yielded a total of 309 genes at a local FDR threshold of 0.3, corresponding toan approximately 20% overall FDR (Figure 14.1). The robust clustering using the CNCapproach detailed above yielded six clusters that are well distinguishable from random-ized data (Figure 14.2). These clusters represent early responsive genes as well as thosethat are differentially regulated at later time points. Approximately half of the differen-tial regulation was comprised of up-regulation of a number of genes at the 6-hour timepoint [clusters 5 and 6 in Figure 14.3(a)]. The PAINT analysis identified 22 TFs asenriched (overall FDR<30%), in individual clusters with distinct temporal patterns [Fig-ure 14.3(b)]. Some of these TFs (e.g., NF-κB, HNF-1, CREB, C/EBP, GATA, and ATF) areknown to be involved in the early phase of liver regeneration from previous studies,whereas others (e.g., AP2a, LEF1, PAX6) are known to contribute to the regulation ofcellular processes related to proliferation and differentiation (refer to [4] for details).Several of these predicted TFs were experimentally validated for differential DNA bind-ing activity dynamics [4]. These results demonstrate that relevant functional informa-tion on the transcriptional regulatory processes active in the early liver regenerationcan be obtained from PAINT analysis of clustered microarray gene expression timeseries data.

14.7 Summary Points

The methodology detailed in this chapter describes an integrated workflow formicroarray gene expression data analysis using:

1. A mixed-effects ANOVA approach to quantify differential expression across multipletime points.

2. A local false discovery rate based approach to choose a suitable threshold foridentifying differentially expressed genes.

3. A robust clustering approach termed computational negative control fordetermining distinct dynamical expression patterns that are well separated fromrandomized partitioning.

4. A sensitive bioinformatics approach using PAINT software for developinghypotheses on transcriptional regulators potentially shaping the observed geneexpression dynamics.


300

Acknowledgments

This work was supported by National Institutes of Health grants AA016919, HL088283,and HL087361.

References

[1] Pavlidis, P., “Using ANOVA for gene selection from microarray studies of the nervous system,”Methods, Vol. 31, 2003, pp. 282–289.

[2] Scholtens, D., A. Miron, F.M. Merchant, A. Miller, P.L. Miron, J.D. Iglehart, and R. Gentleman,“Analyzing factorial designed microarray experiments,” J. Multivariate Anal., Vol. 90, 2004,pp. 19–43.

[3] Khan, R.L., R. Vadigepalli, G. Gao, and J.S. Schwaber, “A windowed local fdr estimator providinghigher resolution and robust thresholds,” arXiv:q-bio/0702044v1, 2007.

[4] Juskeviciute, E., R. Vadigepalli, and J. Hoek, “Temporal and functional profile of the transcriptionalregulatory network in the early regenerative response to partial hepatectomy in the rat,” BMCGenomics, Vol. 9, No. 1, 2008, p. 527.

[5] Keller, M.A., S. Addya, R. Vadigepalli, B. Banini, K. Delgrosso, H. Huang, and S. Surrey,“Transcriptional regulatory network analysis of developing human erythroid progenitors revealspatterns of co-regulation and potential transcriptional regulators,” Physiol. Genomics, Vol. 28, No.1, 2006, pp. 114–128.

[6] Pearson, R.K., T. Zylkin, J.S. Schwaber, and G.E. Gonye, “Analytical evaluation of clustering resultsusing computational negative controls,” Proc. 4th Soc. Indust. Appl. Math. Int. Conf. Data Mining,2004, pp. 188–199.

[7] Gonye, G.E., P. Chakravarthula, J.S. Schwaber, and R. Vadigepalli, “From promoter analysis totranscriptional regulatory network prediction using PAINT,” in Methods in Molecular Biology: GeneFunction Analysis, M. Ochs, (ed.), Totowa, NJ: Humana Press, 2007, pp. 49–68.

[8] PAINT: Promoter Analysis and Interaction Network Toolset, http:/www.dbi.tju.edu/PAINT.[9] Pratt, C.H., R. Vadigepalli, P. Chakravarthula, G.E. Gonye, N.J. Philp, and G.B. Grunwald,

“Transcriptional regulatory network analysis during epithelial-mesenchymal transformation ofretinal pigment epithelium,” Mol. Vis., Vol. 14, 2008, pp. 1414–1428.

[10] Saban, M.R., H.L. Hellmich, M. Turner, N.B. Nguyen, R. Vadigepalli, D.W. Dyer, R.E. Hurst, M.Centola, and R. Saban, “The inflammatory and normal transcriptome of mouse bladder detrusorand mucosa,” BMC Physiol., Vol. 6, No. 1, 2006, p. 1.

[11] Stevens, S.L., B. Gopalan, M. Minami, C.E.H. Erdmann, C.A. Harrington, W.R. Cannon, R.P.Simon, and M.P. Stenzel-Poore, “LPS preconditioning provides neuroprotection via reprogram-ming of cellular responses to stroke,” Soc. Neuroscience Abstr., 2004, p. 457.14.

[12] Vadigepalli, R., H. Hao, G.M. Miller, H. Liu, and J.S. Schwaber, “EGFR-induced circadian-timedependent gene regulation in suprachiasmatic nucleus,” Neuroreport, Vol. 17, No. 13, 2006,pp. 1437–1441.

[13] Vadigepalli, R., P. Chakravarthula, D.E. Zak, J.S. Schwaber, and G.E. Gonye, “PAINT: a promoteranalysis and interaction network generation tool for genetic regulatory network identification,”Omics, Vol. 7, No. 3, 2003, pp. 235–252.

[14] Churchill, G.A., “Using ANOVA to analyze microarray data,” Biotechniques, Vol. 37, No. 2, 2004,pp. 173–177.

[15] Kerr, M.K., and G.A. Churchill, “Statistical design and the analysis of gene expression microarraydata,” Genet. Res., Vol. 77, No. 2, 2001, pp. 123–128.

[16] Addya, S., M.A. Keller, K. Delgrosso, C.M. Ponte, R. Vadigepalli, G.E. Gonye, and S. Surrey,“Erythroid-induced commitment of K562 cells results in clusters of differentially expressed genesenriched for specific transcription regulatory elements,” Physiol. Genomics, Vol. 19, No. 1, 2004,pp. 117–130.

[17] Dozmorov, M.G., K.D. Kyker, R. Saban, N. Knowlton, I. Dozmorov, M. B. Centola, and R.E. Hurst,“Analysis of the interaction of extracellular matrix and phenotype of bladder cancer cells,” BMCCancer, Vol. 6, 2006, p. 12.

[18] Zak, D.E., H. Hao, R. Vadigepalli, G.M. Miller, B.O. Ogunnaike, and J.S. Schwaber, “Systems analy-sis of circadian time dependent neuronal epidermal growth factor receptor signaling,” Genome Biol,Vol. 7, No. 6, 2006, p. R48.

[19] Broberg, P., “A comparative review of estimates of the proportion unchanged genes and the falsediscovery rate,” BMC Bioinformatics, Vol. 6, 2005, p. 199.

Acknowledgments

301

About the Editors

Arul Jayaraman is an assistant professor in chemical engineering and biomedical engi-neering at Texas A&M University. He received a Ph.D. in chemical engineering from theUniversity of California at Irvine in 1998 and did his postdoctoral training at the Centerfor Engineering in Medicine at Massachusetts General Hospital from 1998 to 2000. Dr.Jayaraman’s research interests are in systems biology of inflammation andinterkingdom signaling in host-pathogen interactions.

Juergen Hahn is an associate professor in chemical engineering at Texas A&M Univer-sity. He received a Ph.D. in chemical engineering from the University of Texas, Austin,and did his postdoctoral training at RWTH Aachen. Dr. Hahn’s research interestsinclude systems biology of signal transduction networks and process modeling andanalysis.

303

List of ContributorsFrank AllgöwerInstitute for Systems Theory and Automatic ControlUniversität StuttgartPfaffenwaldring 970550 Stuttgart, Germanye-mail: [email protected]

Anand R. AsthagiriDivision of Chemistry and Chemical EngineeringCalifornia Institute of TechnologyMail Code 210-41Pasadena, CA 91125 USAe-mail: anand@[email protected]

Heike E. AssmusSystems Biology and BioinformaticsDepartment of Computer ScienceUniversity of RostockAlbert Einstein Str. 2118051 Rostock, Germanye-mail: [email protected]

Marc R. BirtwistleUniversity of DelawareDepartment of Chemical EngineeringNewark, DE 19716 USA

Sonja BoldtSystems Biology and BioinformaticsDepartment of Computer ScienceUniversity of RostockAlbert Einstein Str. 2118051 Rostock, Germany

Gregery T. BuzzardDepartment of MathematicsPurdue UniversityWest Lafayette, IN 47907 USA

Christina ChanDepartment of Chemical Engineering & Materials Science1257 Engineering BuildingMichigan State UniversityEast Lansing, MI 48824 USAe-mail: [email protected]

Murat CiritDepartment of Chemical & Biomolecular EngineeringNorth Carolina State UniversityBox 7905, Engineering Building I911 Partners WayRaleigh, NC 27695 USA

Ertugrul DalkicDepartment of Biochemistry and Molecular BiologyMichigan State UniversityEast Lansing, MI 48824 USA

Maia M. DonahueWeldon School of Biomedical EngineeringPurdue University,West Lafayette, IN 47907 USA

Timothy C. ElstonDepartment of PharmacologyUniversity of North CarolinaChapel Hill, NC 27599 USA

Rolf FindeisenInstitute for Automation EngineeringOtto-von-Guericke UniversityUniversitätsplatz 2D-39106 Magdeburg, Germanye-mail: [email protected]

Juergen HahnDepartment of Chemical EngineeringTexas A&M University3122 TAMUCollege Station, TX 77843 USAe-mail: [email protected]

Nan HaoDepartment of PharmacologyUniversity of North CarolinaChapel Hill, NC 27599 USA

Jason M. HaughDepartment of Chemical & Biomolecular EngineeringNorth Carolina State UniversityBox 7905, Engineering Building I911 Partners WayRaleigh, NC 27695 USAe-mail: [email protected]

Michael A. HensonDepartment of Chemical EngineeringUniversity of Massachusetts686 North Pleasant StreetAmherst, MA 01003 USAe-mail: [email protected]

Jared L. HjerstedDepartment of Chemical EngineeringUniversity of MassachusettsAmherst, MA 01003 USA

Zuyi HuangDepartment of Chemical EngineeringTexas A&M University3122 TAMUCollege Station, TX 77843 USA

Arul JayaramanDepartment of Chemical Engineering and Biomedical Engi-neeringTexas A&M University3122 TAMUCollege Station, TX 77843 USAe-mail: [email protected]

Katy C. KaoDepartment of Chemical EngineeringTexas A&M University3122 TAMUCollege Station, TX 77843 USAe-mail: [email protected]

Boris N. KholodenkoDepartment of Pathology, Anatomy, and Cell BiologyThomas Jefferson UniversityPhiladelphia, PA 19107 USA

Jin-Hong KimDivision of Chemistry and Chemical EngineeringCalifornia Institute of TechnologyMail Code 210-41Pasadena, CA 91125 USA

About the Editors

304

Kyongbum LeeDepartment of Chemical & Biological EngineeringTufts UniversityMedford, MA 02155 USAe-mail: [email protected]

James C. LiaoDepartment of Chemical and Biomolecular EngineeringUniversity of California at Los AngelesLos Angeles, CA 90095 USA

Colby MoyaDepartment of Chemical EngineeringTexas A&M University3122 TAMUCollege Station, TX 77843 USA

Ryan NolanWyeth BioPharmaAndover, MA 01810 USA

Babatunde A. OgunnaikeDepartment of Chemical EngineeringUniversity of DelawareNewark, DE 19716 USAe-mail: [email protected]

Ann E. RundellWeldon School of Biomedical EngineeringPurdue UniversityWest Lafayette, IN 47907 USAe-mail: [email protected]

Ranjan SrivastavaDepartment of Chemical EngineeringUniversity of Connecticut191 Auditorium Road, Unit 3222Storrs, CT 06269 USAe-mail: [email protected]

Stefan StreifMax Planck Institute forDynamics of Complex Technical SystemsSandtorstr. 139106 Magdeburg, Germany

Linh M. TranDepartment of Chemical and Biomolecular EngineeringUniversity of California at Los AngelesLos Angeles, CA 90095 USA

Rajanikanth VadigepalliDaniel Baugh Institute for Functional GenomicsDepartment of PathologyThomas Jefferson University1020 Locust StreetPhiladelphia, PA 19107 [email protected]

Steffen WaldherrInstitute for Systems Theory and Automatic ControlUniversität StuttgartPfaffenwaldring 970550 Stuttgart, [email protected]

Chun-Chao WangDepartment of Chemical & Biomolecular EngineeringNorth Carolina State UniversityBox 7905, Engineering Building I911 Partners WayRaleigh, NC 27695 USA

Xuewei WangDepartment of Chemical Engineering & Materials ScienceMichigan State UniversityEast Lansing, MI 48824 USA

Olaf WolkenhauerSystems Biology and BioinformaticsDepartment of Computer ScienceUniversity of RostockAlbert Einstein Str. 2118051 Rostock, Germanye-mail: [email protected]

Ming WuDepartment of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI 48824 USA

Xuerui YangDepartment of Chemical Engineering & Materials ScienceMichigan State UniversityEast Lansing, MI 48824 USA

Necmettin YildirimDepartment of PharmacologyUniversity of North CarolinaChapel Hill, NC 27599 USA

List of Contributors

305

Index

A

Absolute sensitivity coefficients, 188, 203,206

Activated GFP, 47, 50Adaptive Chebyshev sparse grids, 215Adaptive sparse grid, 223Adaptive sparse grid-based optimization,

211–31anticipated results, 221–24application notes, 228–30computational efficiency, 214data acquisition, 221–24as deterministic, 215discussion and commentary, 227–28error-controlled interpolant, 214example code, 219–20experimental design, 215–17GA-based optimization comparison, 228general procedure, 218–21interpretation, 223–24introduction to, 212–15materials, 217parameter space sampling, 211search range, 218sorted grid points, 222summary points, 230–31troubleshooting, 224–26troubleshooting table, 225unique points, 222–23unstable points, 223

Adaptive sparse grid interpolation, 213–15AHDC1, 87ANOVA

Experimental Design Matrix, 289multifactorial, 296use of, 289

A priori information, 265Arrow diagram models, 58Automated probing, 255Automated reverse engineering, 254–56

advantage, 256candidate models, 255defined, 254

importance, 256method, 254–56partitioning, 255snipping, 256See also Reverse engineering

B

Balanced growth simulation, 118Basis functions, 216Bayesian networks, 248–50

analysis, 249defined, 248as directed acyclic graphs, 265dynamic, 250as graphical model, 248interference, 124reverse engineering, 249for statistical models, 249structure, 250

Bidirectional searchalgorithm illustration, 42illustrated, 40image analysis based on, 38–41procedure, 39–40

Biochemical reaction networksbound chemical steady states, 144sensitivity analysis, 129–46

Biological networks, 234–36approaches for inference, 242–56

Bayesian, 248–50Boolean, 245–47

comparative analysis, 257describing, 237design principles, 237–38discussion and comparison of approaches,

264–66genome-scale metabolic modeling,

243–45graph theory, 257–58hierarchical, 260inferred, 256–64material, 239–42metabolomics, 240

307

Boolean (continued)motifs and modules, 258–60ordinary differential equations, 250–56proteomics, 240–41representation, 236–37reverse engineering, 233–66scale-free, 258static, 242–43stoichiometric analysis, 260–61summary points, 266topology, 247–48transcriptomics, 241–42types of, 234–35visualizing, 236Biomass composition, 156Biomolecular networks, 252BOOL-2 algorithm, 245

Boolean networks, 245–47as deterministic, 264directed edges in, 262dynamics in, 262probabilistic, 246reverse engineering and, 245temporal, 246

BRENDA, 256

C

Canonical Wnt-pathway, 234Carbon-backbone network, 115Carbon-shuttle metabolites, 114Cell harvesting, 274Cellular metabolism, stoichiometric models,

151–52Cellular network modeling, 111–26

anticipated results, 121–22application notes, 125cell culture, 113data acquisition, 121–22database, 113dynamic simulation parameters, 122generalized kinetic expressions, 123–24interpretation, 121–22introduction to, 112–13kinetic, 117–20materials, 113methods, 113–21model network, 121–22modularity, 122–23parameter estimation, 120–21population heterogeneity, 124–25summary points, 126troubleshooting table, 125

Cellular networkscarbon-backbone, 115defined, 235

design principles, 237–38functional reduction, 116–17genome-scale, 114reconstruction, 113reduction, 113–17structural reduction, 113–16

Chebyshev polynomials, 216, 217, 218Classical flux balance analysis (FBA), 152–54Clustering

gene expression, 292–93of motifs, 260network inference before, 248

Cluster size, 298Collective fitting approach, 64CONOPT, 162Conserved moieties, 261Continuous dynamic modeling, 261–62Control analysis, 263–64Controllability, 136Corrected FRET (FRETc), 27Correlation coefficients, 247Correlation networks, 247Cost function, 216

evaluation of, 218interpolant mapping, 230searching areas of, 221surrogate, 227

Covalent modification system, 145–46Covariance matrix

parameter, 182scaled measurement, 182scaled parameter, 182

Cross Gramiandefined, 137empirical, 137–38, 139for input and output, 137See also Gramians

Crosstalkdata-driven model to characterize, 67inhibition of PI3K affects, 69

Cytotoxicity management, 79

D

DAPI images, 5Data-driven modeling, 57–72

computational analysis of signalspecificity, 69–72

data processing, 60–62examples, 64–72experimental data types, 59–60introduction to, 58model complexity and, 63normalization, 60–62parameter specification and estimation,

63–64

Index

308

principles of, 59–64with quantitative data, 62–63systematic analysis of crosstalk, 64–69

Data processing, 60–62Degree distribution, 82Design reduction, 190–91

experimental design procedure, 196–97main effects analysis-based, 205procedure, 191purpose and implementation, 190rank analysis-based, 202See also Signal transduction modeling

Deterministic models, 62Digital images, mathematical description,

37–38Direct Search toolbox, 121Discrete dynamic modeling, 262, 263Discretization, 161Domain-domain interactions (DDIs), 241Dual problem, 143–44Dynamic flux balance analysis (DFBA),

149–75advantage, 154assumptions, 173, 175classical FBA versus, 154–55defined, 149, 150discussion and commentary, 172–75fed-batch cultures, 157–64, 175methods, 151–55model illustration, 154in novel metabolic capabilities, 168results and interpretation, 155–72for Saccharomyces cerevisiae, 149–75scope of, 175for sensitivity of ethanol productivity, 166stoichiometric models, 151–52summary points, 175

Dynamic flux balance model, 150alternative, 174batch, 175fed-batch, 175parameterization, 173

Dynamic optimization problem, 162Dynamic simulations

fed-batch cultures, 157–59kinetic modeling, 118–20parameters, 122

E

Electroporation efficiencies, 30Electroporation of TF reporter plasmids,

23–26clonal screening, 25clonal selection, 24into 3T3-L1 preadipocytes, 23–24

PPARy activation monitoring, 25See also Transcription factor (TF) reporter

Elementary flux mode (EFM)algorithm, 115–16analysis, 115

Elementary modes, 261Empirical cross Gramian, 137–38, 139

defined, 139nonlinear, 140sensitivity measure, 140See also Gramians

Endoplasmic reticulum (ER) stress, 89Energy balance analysis (EBA), 261Enzyme-linked immunosorbent assays

(ELISAs), 60Enzyme subsets, 261Epidermal growth factor (EGF), 3ERK graded response to, 8

samples, 7stimulation, 7

Ethanol productivity, 160, 161for aerobic-anaerobic switching time,

167, 172DFBA results for sensitivity of, 166

ethanol yield trade-off, 163fed-batch, 171overproduction mutants, 164–67

Eukaryotic transcription-regulating proteins,235

Experimental data types, 59–60Experimental design methodology, 207Experimental design procedure

design reduction, 196–97factors, 185feasibility, 184identifiability analysis, 193–94, 197impact analysis, 194–96initial perturbation and measurementdesign, 193overview, 184–85responses, 185

Express GFP, 34Extracellular-regulated kinase (ERK), 1

activation in HepG2 cells, 26–28graded response to EGF, 8

F

False discovery rate (FDR), 287analysis threshold, 297–98Benjamin-Hochberg, 82local estimator, 297, 298

FANMOD, 260Feasibility problem

defined, 142semidefinite relaxation and, 142–43

Index

309

Feasnet, 299Fed-batch cultures

dynamic optimization of, 159–64dynamic simulation of, 157–59

Fed-batch ethanol productivities, 171Fed-batch operating policy, 171Feed-forward loop (FFL), 259–60Fibroblasts, 66Fisher Information Matrix (FIM), 180Fluorescence intensity, 43, 50Fluorescence microscopy imaging, 5–6Fluorescence resonance energy transfer

(FRET), 2, 60control plasmid development, 22–23corrected (FRETc), 27element selection, 18–19in kinase activity monitoring, 12occurrence of, 12signals, 2spectral overlap during, 18

Fluorescent microscopy image analysisanticipated results, 46–50application notes, 50–53based on K-means clustering and PCA,

41–43based on wavelets and bidirectional

search, 38–41data acquisition, 46–50image intensity, 43–45interpretation, 46–50introduction to, 34–35inverse problem solution, 47–50methods, 38–46method selection, 46model development, 46–47preliminaries, 35–38procedure comparison, 45–46summary and conclusions, 53

Fluorescent proteincloning, 20–21PCR, 19–20

Flux analysis theory, 98–99Flux balance analysis (FBA), 99, 152–54, 261

assumptions, 173, 175classical, 152–54optimization problem, 153, 155for stoichiometric model, 153See also Dynamic flux balance analysis

(DFBA)Fractal kinetic theory, 253Free fatty acid (FFA)

concentration in plasma, 78cytotoxicity, 75, 77intracellular metabolic pathways, 76types of, 78

Fus3 cross-inhibition models, 71

G

Gas chromatography (GC), 240GeneChip Operating Software, 281Gene expression

clustering, 293profiles, 79–80, 82, 83time series, 287–300

Gene identifiers, 298Gene pairs, synergy scores, 82, 83GenePix Pro, 281Gene regulatory networks, 235Genetic algorithms (GA), 121, 212, 229–30

finding parameter values, 229–30fitness, 229optimization comparison, 228–29

Genome expression analysis, 264–65Genome-scale models, 243–45

cellular network, 112metabolic, 96, 243–45

Genome-scale network, 114Gibbs free energy change, 117Global algorithms, 227Global sensitivity analysis, 133, 145, 216Glucose media dynamic simulation, 158Glycerol production, 169GNU Linear Programming Kit (GLPK), 105Gramians

controllability, 136cross, 137linear sensitivity analysis and, 136–37for nonlinear systems, 137–38sensitivity measure based on, 138–40uses, 136

Graphs, 237connectivity, 115directed acyclic, 265subgraphs, 258, 259theory, 257–58

Green fluorescent protein (GFP), 2activated, 47, 50express, 34formation for cell line, 34

Green fluorescent protein (GFP) reportersystems, 12

anticipated results, 28application notes, 23–28buffers and reagents, 13cell and bacterial culture, 13cells, 34cloning, 14data acquisition, 28discussion and commentary, 29–30illustrated, 15

Index

310

interpretation, 28kinase reporter development, 17–23materials, 13–14methods, 14–23microscopy, 14principles of, 35–36summary points, 303T3-L1 cell culture, 14transcription factor reporter

development, 14–17troubleshooting table, 30

Growing trees, 255

H

HepG2 cells, 79Hidden Markov models, 250Hierarchical clustering, 248Hierarchical networks, 260High performance liquid chromatography

(HPLC), 240Hub genes, 82, 85–89Hybridization, 278Hybrid models, 62

I

Identifiabilityclasses, 182–83metrics and conditions, 182–84parameter, 183–84, 187structural, 183

Identifiability analysis, 186–88, 200in experimental design process, 193–94,

197purpose, 186–87, 188steps, 186

Image analysisbased on K-means clustering and PCA,

41–43based on wavelets and bidirectional

search, 38–41for fluorescent microscopy, 51goal, 38mathematical description of, 37–38

Image contrast, wavelets and, 39Imagene, 281Immunoblotting, 198Immunofluorescence (IF) staining, 2Impact

metrics, 189, 195net, 206parameter-specific, 206

Impact analysis, 188–90, 200–201defined, 189effects-based, 196experimental design procedure, 194–96main effects-based, 206

methods, 206procedure, 190purpose and implementation, 188–90rank-based, 195, 201, 206, 207See also Signal transduction modeling

Importance coefficients, 203, 204, 206Independent component analysis (ICA), 124Infeasibility certificates

from dual problem, 143–44sensitivity analysis via, 141–46

Inferred networks, 256–64, 266Information theory, 243Information theory-based scores, 80Initial perturbation measurement design

constructing, 198experimental design process, 193immunoblotting, 198procedure, 186purpose and implementation, 185–86See also Signal transduction modeling

INSIG2, 88Intracellular signaling pathways, 1Inverse Laplace transform, 49–50Inverse problem

for determining TF concentrations, 47–50solving, 47–50

K

KEGG database, 166, 167defined, 235gene insertion library from, 168LIGAND, 167

Kholodenko method, 252Kinase reporter

development, 17–23fluorescent protein cloning, 20–21fluorescent protein PCR, 19–20FRET-based, 17FRET control plasmid development, 22–23FRET element selection, 18–19functionality, 17linker oligonucleotide

development/annealing, 21linker region cloning, 22See also Green fluorescent protein (GFP)

reporter systemsKinases, 11Kinetic modeling, 117–20, 252

dynamic simulations, 118–20rate equations, 117–18See also Cellular network modeling

K-means clustering, 36–37in dynamic profile determination, 50fluorescent cell regions/clusters by, 45image analysis based on, 41–43

Index

311

K-means clustering (continued)key idea, 37principle, 36procedure steps, 36–37

K-medians, 293

L

Labeling, 277Lagrange dual problem, 143–44Laplace transformation, 48–49

application, 48–49inverse, 49–50

LASSO tool, 243Levenberg-Marquardt method, 64Linear program (LP), 153Linear sensitivity analysis, 134–36

defined, 135disadvantages, 135–36Gramians and, 136–37relative sensitivities, 135See also Sensitivity analysis

Linkeroligonucleotide development/annealing,

21region cloning, 22

Local parameter identifiability, 183–84Local sensitivity analysis, 216Local structural identifiability, 183

M

MACF1, 87–88Main effects-based impact analysis, 206, 207MAPKs. See mitogen-activated protein

kinasesMarkov Chain Monte Carlo, 256Markov transition model, 119Mass spectrometry (MS), 240Mathematical modeling, 57, 179, 212MATLAB

delay differential equation solver, 192EFM algorithm implementation in, 116optimal search function, 165

MAvisto, 260Mean Value Theorem, 253Mechanistic model complexity, 62MEK activation comparator (MAC), 69Meshes, comparison of, 214Metabolic control analysis (MCA), 130Metabolic flux analysis (MFA), 98–99, 116Metabolic modeling, 95–107

anticipated results, 105–6applications of, 96benefits, 96data acquisition, 105–6discussion and commentary, 106–7feasible solution determined, 105–6

flux analysis theory, 98–99genome-scale, 96implementation, 95interpretation, 105–6introduction to, 96–98materials and methods, 98–105model development, 99–100no feasible solution determined, 106objective function, 100–104optimization, 104–5summary points, 107uses, 95

Metabolic networksdefined, 234reconstructed, 97

Metabolic profiles, 240Metabolites

carbon shuttle, 114gene selection and, 80integrating, 83mass balances, 118measurements, 80trends, 80, 81

Metabolomics, 240Metatool, 116METLIN, 240Metropolis algorithm, 65–66Mfinder, 260Michaelis-Menten enzymatic rate equations,

117alternatives, 123as hyperbolic functions, 123

Michaelis-Menten kinetics, 169Microarrays, 242

cDNA, 264data acquisition, 281experiment workflows, 242transcriptional profiling with, 276–81

Mitogen-activated protein kinases (MAPKs),57, 214

cascades, 64components, 69experimental data, 222Fus3, 70ODE model, 217two-parameter search, 224, 226

Modularity, cellular network modeling,122–23

Monod kinetics, 117Monte Carlo optimization, 70MOSEK, 159Motifs, 258–59

clustering of, 260defined, 258dynamic stability, 260

Index

312

in E. coli transcriptional regulationnetwork, 260

network characterization, 259MRNAs, 242, 279–80Multiple shooting strategies, 212Multiwavelet formulations, 216

N

Negative control experiments, 50Net impacts, 206Network component analysis (NCA), 124,

273, 282–83defined, 282procedure, 282–83solutions, 284toolbox, 284uses, 282

Network component mapping (NCM), 124Network reconstruction, 113Nonalcoholic steatohepatitis (NASH), 76Normalization, 60–62, 281–82

biological variability, 61population endpoint measurements, 61purpose, 60–61

Nuclear magnetic resonance (NMR), 240

O

Objective functions, 100–104across multiple scales, 103choices, 100–102determination, 102–3evaluation, 102–3with largest value, 103maximization of ATP production rate, 100maximization of biomass production, 100minimization of, 182minimization of ATP production rate, 100minimization of nutrient uptake rate, 100minimization of redox potential

production rate, 100multiple simultaneous, 104steady state flux distribution, 116See also Metabolic modeling

Observability, 136ODRPACK, 64Optimization

adaptive sparse grid-based, 211–31of fed-batch cultures, 159–64, 174global methods, 212searches, 227tissue and cell function, 111

Ordinary differential equations (ODEs), 97,217, 250–56

automated reverse engineering, 254–56control analysis simulation, 261–64defined, 250

dynamics simulation, 261–64form, 250parameter estimation, 256power law modeling, 253–54sensitivity analysis simulation, 261–64small-scale biochemical network

identification, 251–53stochastic, 243system of, 251

Organismal scale, 104Over-representation, 296

P

PAINT, 287availability, 290clustered data, 288cross-referenced list, 299defined, 288in gene regulation, 289multiple testing correction in, 299–300prebuilt reference files, 299results, 290results interpretation, 296transcriptional regulatory network

analysis with, 293–95uses, 288

Parameter estimationcellular network modeling, 120–21signal transduction modeling, 181–82

Parameter identifiabilitydefined, 183local, 183–84metrics, 206testing, 187See also Identifiability

Parameterscovariance matrix, 182identification, 211sensitivity matrix, 180, 182

Parametric sensitivity analysis, 132–33Partitioning, 255Pathway Genome Database (PGDB), 99Pathway Interaction Database (PID), 235Pearson Correlation, 293Permutation tests, in synergy significance

evaluation, 82Phenotype, 77, 83Phenotype-specific gene network, 75–90

anticipated results, 82application notes, 83–89cell culture and reagents, 79cytotoxicity management, 79data acquisition, 82discussion and commentary, 83experimental design, 78–79

Index

313

Phenotype-specific gene network (continued)fatty acid salt treatment, 79gene expression profiling, 79–80gene selection based on metabolite

trends, 80hub genes in network, 85–89interpretation, 82introduction, 76–77materials, 79metabolites measurements, 80methods, 79–82network topology evaluation, 82reconstruction, 83summary points, 89–90synergy network characteristics, 84–85synergy scores calculation, 80–82synergy significance evaluation, 82topological analysis, 78troubleshooting table, 84

Phosphoinositide 3-kinase (PI3K), 57, 66Phosphorylated ERK (ppERK)

antibody labeling of, 4–5average nuclear intensities, 7fluorescence microscopy imaging of, 5–6measurements, stimulation for, 4nuclear, 6, 7

Phosphorylated STAT5 (pSTAT), 193, 194, 196Platelet-derived growth factor (PDGF)

dose, 68independent signaling modes, 69receptors, 66, 68stimulation, 69

Polynomial chaos, 216Power law modeling, 253–54

from fractal kinetic theory, 253illustrated, 254properties, 253

Principal component analysis (PCA), 33application illustration, 43defined, 37in dynamic profile determination, 50fluorescent cell regions/clusters by, 45image analysis based on, 41–43motivation for using, 37

Probabilistic Boolean networks, 246Process diagrams, 237Protein kinase C (PKC), 12Protein-protein interaction (PPI), 240

data analysis, 241network, 87validating, 241

Proteomics, 240–41

Q

Quantitative immunofluorescence, 1–9

anticipated results, 6–7data acquisition, 6–7discussion and commentary, 8experimental design, 3interpretation, 6–7introduction to, 2–3materials, 3–4methods, 4–6statistical guidelines, 7summary points, 8–9troubleshooting table, 9

Quantitative mass spectrometry, 207Quasi-Monte Carlo algorithm, 215

R

Rank analysis, 189Rank-based impact analysis, 195, 201, 206,

207Ras-dependent pathways, 57Ras/Erk pathway, 66Rate equations

generalized, 123–24kinetic modeling, 117–18Michaelis-Menten, 117, 123

Relative sensitivities, 135REVEAL algorithm, 245Reverse engineering, 238–39, 266

automated, 254–56Bayesian network, 249before/after, 238Boolean networks and, 245defined, 238improving, 266

Reversible covalent modification, 131–32RNA

purification, 274–76transcriptional profiling for, 280–81

S

SABIO-RK, 256Saccharomyces cerevisiae

dynamic simulation, 151fed-batch cultures, 157–64fed-batch fermentation, 150growth phenotypes of knockout mutants,

150for renewable liquid fuel applications,

164steady-state FBA, 171steady-state FBA mutants, 166stoichiometric models of, 155–57

Scaffolding proteins, 234Scale-free networks, 258Semidefinite relaxation, 142–43Sensitivity analysis, 129–46

discussion and outlook, 146

Index

314

via empirical Gramians, 136–41global, 133, 145, 216via infeasibility certificates, 141–46introduction to, 130linear, 134–36local, 216parametric steady-state sensitivity, 132–33purposes, 130reversible covalent modification, 131–32simulation of, 263–64system class and, 131–34uses, 129

Serial Analysis of Gene Expression (SAGE),242

SH3RF2, 88–89Shortest path length, 82Signal specificity in yeast, 69–72Signal transducer and activator of

transcription 5 (STAT5) signaling,192

Signal transduction, 34cascades, 271in eukaryotic cells, 58networks, 2, 234

Signal transduction modeling, 179–207anticipated results, 192–97application notes, 197–205classes of factors and responses, 185data acquisition, 192–97design implementation, 191–92design modification and reduction,

190–91discussion and commentary, 205–7experimental design procedure, 184–85identifiability analysis, 186–88identifiability metrics/conditions, 182–84impact analysis, 188–90initial perturbation and measurementdesign, 185–86interpretation, 192–97introduction to, 180–85methods, 185–92parameter estimation, 181–82structure, 180–81summary points, 207

Silhouette Coefficient (SC), 296Simulated annealing, 212Single-cell endpoint measurements, 60Single-cell kinetic measurements, 60Single input module (SIM), 259Singular value decomposition (SVD), 124Snipping, 256Sontag method, 252Sorted grid points, 222Source tool, 290

Sparse Grid toolbox, 217, 223Static networks, 243Steady-state flux distribution, 116, 117Steady states

bound feasible, 144–45computing, 141shift, 134

Steady-state sensitivitydefined, 133parametric, 132–33See also Sensitivity analysis

Stochastic sampling, 211Stoichiometric analysis

biological networks, 260–61defined, 260properties found by, 261

Stoichiometric models, 150of cellular metabolism, 151–52classical FBA for, 153of S. cerevisiae metabolism, 155–57wild-type, 168, 169

Structural identifiabilitydefined, 183local, 183parameters and, 187testing, 207See also Identifiability

Subgraphs, 258, 259Subpopulation fractions, 111Sum of squared differences (SSD), 70Synergy

analysis, 75–90defined, 80significance evaluation, 82

Synergy networksdegree distribution of, 85distribution of shortest path lengths in,

85hub genes in, 85–89topographical characteristics, 84–85topology analysis, 90

Synergy scorescalculation of, 80–82gene pairs, 82, 83range, 82

Systematic system perturbations, 138Systems biology, 239Systems biology Graphical Notation (SBGN)

initiative, 237Systems Biology Markup Language (SBML),

99

T

Temporal Boolean networks, 2463T3-L1 cell culture, 14

Index

315

Time of flight mass spectrometry (MS-TOF),239

Time series gene expression, 287–300analysis, 297anticipated results, 296application notes, 300data acquisition, 296differentially expressed gene

identification, 291–92discussion and commentary, 296–300interpretation, 296introduction to, 288–89materials, 289–90methods, 291–95normalized, 289number of clusters, 296robust clustering, 292–93summary points, 300transcriptional regulatory network

analysis with PAINT, 293–95Total cytoplasmic STAT5 (tSTAT), 193, 194,

196Total internal reflection fluorescence (TIRF),

60Transcriptional profiling, 276–81

with DNA microarrays, 276–81hybridization, 278, 281labeling, 277for mRNA, 279–80for total RNA, 280–81washing and scanning, 278–79

Transcriptional Regulatory ElementDatabase (TRED), 282

Transcriptional regulatory networksanalysis with PAINT, 293–95application notes, 284–85cell harvesting, 274discussion and commentary, 284DNA microarray data acquisition, 281illustrated, 272introduction to, 272–73materials, 273methods, 273–81NCA, 282–83normalization, 281–82profiling with DNA microarrays, 276–81RNA purification, 274–76summary points, 285troubleshooting table, 285

Transcription factors (TFs), 11

activities (TFAs), 283binding elements, cloning, 16–17binding states, identification of, 15–16candidate, 288concentrations, inverse problem for

determining, 47–50DNA binding sites, 273GFP-based, 12identification, 271master controllers of, 64response element, 14

Transcription factor (TF) profilescomputation of, 33–53concentration, 50damped oscillation, 50

Transcription factor (TF) reportercloning TF binding elements into, 16–17defined, 14development, 14–17electroporation of plasmids, 23–26See also Green fluorescent protein (GFP)

reporter systemsTranscriptome analysis, 271–85Transcriptomics, 241–42

defined, 241–42methods for studying, 242

TRANSFAC database, 298Troubleshooting tables

adaptive sparse grid-based optimization,225

cellular network modeling, 125green fluorescent protein (GFP) reporter

systems, 30phenotype-specific gene network, 84quantitative immunofluorescence, 9transcriptional regulatory networks, 285

U

Unique points, 222–23Unstable points, 223

W

Washing and scanning, 278–79Wavelets, 36

image analysis based on, 38–41in image contrast, 39

Whole-cell models, 112

Z

Z-score, 259

Index

316