mathematical optimization of biological systems · 2014-12-28 · mathematical optimization of...

197
Mathematical optimization of biological systems by Laurence Yang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Chemical Engineering University of Toronto Copyright c 2012 by Laurence Yang

Upload: others

Post on 29-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Mathematical optimization of biological systems

by

Laurence Yang

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Chemical EngineeringUniversity of Toronto

Copyright c© 2012 by Laurence Yang

Abstract

Mathematical optimization of biological systems

Laurence Yang

Doctor of Philosophy

Graduate Department of Chemical Engineering

University of Toronto

2012

System-level design and optimization of cell metabolism is becoming increasingly important

for the renewable production of fuels, chemicals, and pharmaceuticals. Mathematical models

of the metabolism of biological systems are improving in terms of their accuracy and scope of

predictions, but are also growing in complexity. Consequently, efficient and scalable algorithms

are needed for performing simulations and metabolic system design using these models. Such

algorithms are being actively developed and are used in industry today. However, many of

the existing algorithms scale poorly to the genome-scale, due to an exponential increase in

computational effort with model size or design scope. Therefore, there is difficulty in applying

these algorithms for the identification of more complex designs using detailed models. This

thesis is aimed at meeting these challenges. First, we present EMILiO, a strain design algorithm

that identifies individually fine-tuned flux levels with unprecedented speed via successive linear

programming. To test the algorithm, we efficiently generate over 100 strain designs for several

industrially important biochemicals. We then develop a framework to assess the robustness of

strain designs to industrially relevant perturbations and uncertainties. We then explore how

metabolomics, an emerging technology for high-throughput measurement of many metabolites,

can be used to improve model precision, despite the high variability typically found in these data

sets. Accordingly, we develop an algorithm to randomly sample both fluxes and concentrations

and use the algorithm to design a sequence of experiments, in which high-variance metabolomics

data are used to identify a subset of metabolites needing more precise measurements. Finally,

we evaluate some approaches for extending the methods developed in this thesis for strain

design to the identification of optimal enzyme manipulations using nonlinear kinetic models of

ii

cell metabolism. The methods developed in this thesis should aid metabolic engineers for the

efficient design of robust microbial strains.

iii

Acknowledgements

My Doctoral program at the University of Toronto has been rewarding in large part due to

many individuals. First and foremost, my sincerest gratitude goes to my two supervisors,

Professor Cluett and Professor Mahadevan. In a unique synergy, my mentors enabled me

to explore problems in science to my heart’s content while providing valuable guidance. I

also thank the members of my reading committee. Professor Edwards raised my awareness

of biological tractability and the importance of maintaining cohesiveness. Professor Frances

provided valuable insight and fundamental inquiries on the optimization techniques that are

so integral to this thesis. I also thank my colleagues in the Laboratory for Metabolic Systems

Engineering and the Process Control Group. In particular, Nik Anesiadis, with whom I have

shared the office for the past seven years, has been a valuable friend and colleague.

Many talented and interesting individuals at the University have enriched my PhD program.

Dan Tomchyshyn, despite his challenging schedule as Head of IT in the department, was always

generous in sharing his vast knowledge of networking, file systems, and all things IT. Without his

help, I would not have become the avid Linux user I am today. Paul Jowlabar is irreplaceable in

the department due to his unmatched experience in and dedication towards the proper education

of young engineers. Glenn Wilson provided me with valuable input on industrial challenges for

process control. Fred with both humor and professionalism has been an important part of my

stay at the University.

I also extend gratitude to the friends and family outside of the lab. Yaser provided inspiration

and information on matters of science and the world over expensive, and sometimes exotic,

meals. Virgil helped me to think about scientific advancement within the broader context of

socioeconomic, political, and legal systems. The talented Andrei introduced me to practical

issues in the computer industry and to various programming languages. Charlie, with his

singular intellect, unwavering loyalty, and an unparalleled aptitude to enjoy life will continue

to be a source of inspiration to me. I owe my parents many thanks for their unwavering trust

in all of my endeavors and for enabling me to graduate debt-free.

Finally, I gratefully acknowledge financial support from the Natural Sciences and Engineering

Research Council of Canada, Genome Canada, and the University of Toronto.

iv

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Challenges and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 EMILiO: a fast algorithm for genome-scale strain design . . . . . . . . . . 5

1.3.2 Robust strain design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.3 Experiment design using noisy metabolomics data . . . . . . . . . . . . . 6

1.3.4 Additional contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Literature Review 8

2.1 Constraint-based modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Extensions and applications of flux balance analysis . . . . . . . . . . . . 10

2.1.3 Opportunities for advancement . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Computer-aided strain design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Bilevel optimization-based strain design . . . . . . . . . . . . . . . . . . . 13

2.2.2 Extensions of the bilevel optimization framework . . . . . . . . . . . . . . 16

2.2.3 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.4 Opportunities for advancement . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Simulation and design using kinetic models of metabolism . . . . . . . . . . . . . 22

2.3.1 Optimization approaches to metabolic engineering using kinetic models . 23

2.3.2 Stability of kinetic models . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

2.3.3 Mechanistic versus generalized rate equations . . . . . . . . . . . . . . . . 25

2.3.4 Opportunities for advancement . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Synthesis and summary of the literature . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Constraint-based modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.2 Computer-aided strain design . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.3 Simulation and design using kinetic models of metabolism . . . . . . . . . 30

2.5 On the chapters to follow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.1 A Unifying Theme of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.2 Outline of the remainder of the thesis . . . . . . . . . . . . . . . . . . . . 33

2.5.3 Types of models used in the thesis . . . . . . . . . . . . . . . . . . . . . . 33

3 EMILiO: A fast algorithm for genome-scale strain design 35

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.1 Flux balance analysis, model reduction, and in silico strain design verifi-

cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 The formulation of EMILiO . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.3 Solution of the MPCC using ILP . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 Pruning the Design Using LP . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.5 Minimal and Alternate Optimal Designs Using MILP . . . . . . . . . . . 44

3.3.6 Modified OptReg and Local Search . . . . . . . . . . . . . . . . . . . . . . 46

3.3.7 Local search implementation of modified OptReg . . . . . . . . . . . . . . 48

3.3.8 Determining minimum flux magnitudes . . . . . . . . . . . . . . . . . . . 51

3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4.1 Comparison of the strain design algorithms . . . . . . . . . . . . . . . . . 51

3.4.2 Large-scale exploration of the strain design space . . . . . . . . . . . . . . 54

3.4.3 Increasing production beyond knockout strains . . . . . . . . . . . . . . . 60

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vi

4 Genome-scale robust strain design 63

4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Robust strain design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.1 Flux balance analysis, model reduction, and in silico strain design verifi-

cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.2 EMILiO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.3 Strain design using EMILiO . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.4 Escaping from local optima . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.5 Generating alternate strain designs . . . . . . . . . . . . . . . . . . . . . . 71

4.4.6 Sensitivity analysis of a strain design . . . . . . . . . . . . . . . . . . . . . 72

4.4.7 Determining the perturbation size . . . . . . . . . . . . . . . . . . . . . . 74

4.4.8 Sensitivity of succinate strains without aerobic fumarate reductase activity 74

4.4.9 Modeling the metabolic response to osmotic stress . . . . . . . . . . . . . 75

4.4.10 Modeling byproduct secretion and re-consumption with molecular crowd-

ing and membrane occupancy constraints . . . . . . . . . . . . . . . . . . 76

4.4.11 Mean-variance portfolio optimization . . . . . . . . . . . . . . . . . . . . . 76

4.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.1 Computational strain design . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.2 Pathway diversification improves robustness against flux perturbations . . 80

4.5.3 Diversity increases sensitivity to small perturbations . . . . . . . . . . . . 84

4.5.4 Enhanced robustness of L-serine production via low-yield pathways . . . . 86

4.5.5 Assessing robustness against industrially relevant perturbations . . . . . . 90

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Designing experiments using noisy metabolomics data 102

5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vii

5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.1 Constraint-Based Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.2 Randomly Sampling the Solution Space . . . . . . . . . . . . . . . . . . . 107

5.4 Sampling the non-convex solution space . . . . . . . . . . . . . . . . . . . . . . . 108

5.5 Identifying Important Metabolites . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6.1 Sampling the Non-Convex Solution Space . . . . . . . . . . . . . . . . . . 110

5.6.2 Computational Performance of the Sampling Algorithm . . . . . . . . . . 112

5.6.3 Example: Simplified Model of E. coli Central Metabolism . . . . . . . . . 112

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Scalable methods for strain design using kinetic models 118

6.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.3 Design of optimal enzyme manipulations using approximative kinetic models . . 119

6.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.4.1 Solution using successive linear programming . . . . . . . . . . . . . . . . 121

6.4.2 Escaping local optima with convex relaxations . . . . . . . . . . . . . . . 122

6.5 Result: serine synthesis in E. coli . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7 Conclusions 128

8 Recommendations for Future Work 131

Bibliography 136

A The Robust Strain Design Algorithm 152

A.1 Succinate overproduction strains . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

A.2 Simple example of the portfolio effect . . . . . . . . . . . . . . . . . . . . . . . . . 168

viii

B Simulation and Design using Kinetic Models of Metabolism 169

B.1 Reference state and elasticity matrix . . . . . . . . . . . . . . . . . . . . . . . . . 169

C Strain design for balanced yield, titer, and productivity 174

C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

C.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

C.2.1 Succinate strains using GDLS . . . . . . . . . . . . . . . . . . . . . . . . . 176

C.2.2 Butanediol strains using GDLS . . . . . . . . . . . . . . . . . . . . . . . . 177

C.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

ix

List of Tables

1.1 Global renewable chemicals market sizes by year ($ millions) . . . . . . . . . . . 1

2.1 Comparison of some of the existing strain design algorithms . . . . . . . . . . . . 21

2.2 Models used in this thesis. GAR: gene-associated reactions (if genes are not

present in the model, GAR refers to metabolic reactions excluding transport and

biomass synthesis), NGAR: non-gene-associated reactions. . . . . . . . . . . . . . 34

3.1 Modified bound definitions for OptReg’ . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Reactions whose minimum flux magnitude (see Section 3.3.8) deviated from that

of the wild-type. Reference is made to experimental evidence. . . . . . . . . . . . 59

4.1 Perturbations and model uncertainties investigated . . . . . . . . . . . . . . . . . 65

4.2 Mean and maximum succinate yields through three controlled pathways based

on 1,000 random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Covariance matrix for the three controlled pathway fluxes based on 1,000 random

samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Critical perturbation size, δ∗(n), indicating the perturbation size at which robust-

ness of diversified strains (with n pathways) exceeds that of the most efficient

strain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

B.1 Elasticity matrix at the reference state in sparse format. The full elasticity

matrix can be constructed by creating an n ×m matrix (n = number of fluxes

and m = number of metabolites) of zeros and filling in the non-zero entries at

the row (reaction) and column (metabolite) indices specified in the table below. . 170

x

B.2 Reference flux for the model of E. coli central metabolism (Chassagnole et al.,

2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

B.3 Reference concentrations for the model of E. coli central metabolism (Chassag-

nole et al., 2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

C.1 Knockout strategies for succinate overproduction identified using GDLS . . . . . 177

C.2 Knockout strategies for BDO overproduction identified using GDLS . . . . . . . 178

xi

List of Figures

3.1 Schematic of the definition of up- or down-regulation in OptReg’, based on mod-

ified flux bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Comparison of succinate production strains identified by EMILiO, OptReg’LS,

and OptReg’. Succinate production envelopes for OptReg’, OptReg’LS, and

EMILiO using the iAF1260 genome-scale model of E. coli metabolism (top).

CPU times for strain design using EMILiO, OptReg’LS, and OptReg’ (bottom).

OptReg’LS converged in two iterations. CPU time is shown in log scale. . . . . . 52

3.3 Summary of strategies (i.e., the individual reactions being modified) identified by

EMILiO for succinate production and comparison to existing literature. While

many strategies are supported by previous experimental and/or computational

literature, many more unvalidated predictions have been generated in this work.

Strategies were identified for aerobic, anaerobic, or both conditions. Some of the

frequently used strategies are annotated. Nodes are linked if the strategies are

used together frequently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 The landscape of strategies for succinate production. Squares indicate modifica-

tions having a large impact on strain performance. Diamonds indicate modifica-

tions identified frequently in the 234 alternate strain designs. . . . . . . . . . . . 56

xii

3.5 The 234 strains grouped into 15 clusters using affinity propagation. (A) Clusters

are formed based on the deviation of minimum flux magnitudes, relative to those

of the wild-type. These deviations represent changes in physiology of each strain.

Larger rectangles represent clusters with a larger number of strain design mem-

bers. (B) The fluxes that deviate consistently across the 15 strains are shown

in yellow, while those fluxes distinguishing cluster 5 from cluster 1 are shown in

magenta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1 Nominal and mean succinate yields of the 98 strains generated using EMILiO.

(A) Succinate yield of each strain when no perturbations are present (i.e., the

nominal yields). Dashed red line denotes the maximal (nominal) yield at a growth

rate of 0.1 h−1, the minimum required growth rate for the strain designs. The

red vertical bars are used to indicate the three succinate strains referred to as

strain I, II, and III in the main text. (B) Succinate yield of each strain when gene

expression noise is present, based on 1,000 random samples for each strain (see

Section 4.4.6 for the procedure). Blue dots show the mean of the 1,000 samples

of succinate yield for each strain, while the red line shows the median. Black

lines show the minimum and maximum succinate yield for each strain, while the

minimum and maximum values in the green area correspond to the 25th and

75th percentiles of succinate yield, for each strain. Strains are sorted in order

of descending mean yield (in (A) as well). (C) Histogram of succinate yields

across the 98 strains when no perturbations are present. (D) Histogram of mean

succinate yields across the 98 strains when gene expression noise is present. 52%

of the 98 strains achieved a nominal yield above 99% of the maximum succinate

yield. In contrast, only 1% of strains achieved a mean yield above 99% of the

highest mean yield, which was 88% of the maximal nominal succinate yield. . . . 79

xiii

4.2 Robustness of three succinate strains. (A) Histograms of succinate yield, relative

to glucose uptake flux, for strains I to III. (B-D) histograms of controlled fluxes,

relative to glucose uptake flux. (E) Strains I to III use one to three alternative

routes to succinate production, respectively: the reductive branch of the citric

acid (TCA) cycle (1), the glyoxylate shunt (2), and the oxidative branch of the

TCA cycle (3). (F) Mean succinate yield. (G) Standard deviation of succinate

yield. (H) Robustness, R, of succinate yield, calculated according to Eq. 4.1.

The simultaneous use of a large number of pathways improves robustness against

variations in the controlled fluxes. FRD2: fumarate reductase, MALS: malate

synthase, AKGDH: α-ketoglutarate dehydrogenase. . . . . . . . . . . . . . . . . . 81

4.3 Example of portfolio optimization for three succinate strains I, II, and III. Based

on 1,000 random samples, we calculated the mean flux (Table 4.2) through each

of the three succinate producing pathways (reductive TCA, glyoxylate shunt,

and oxidative TCA). Based on the random samples, we determined the covari-

ance matrix (Table 4.3) between these three pathways. Due to mass balance

constraints and the topological arrangement of the three pathways, the covari-

ance matrix has negative elements. Therefore, the weighted combination of the

three pathways can have a smaller variance than that of individual pathways.

A quadratic program is formulated to identify the optimal fluxes through the

pathways to maximize the mean yield for a specified variance of succinate yield,

or risk (see Section 4.4.11). Strain I only uses only the highest-yield pathway, so

its risk (standard deviation of yield) and return (mean succinate yield) are the

highest of the three strains. Strain II uses two pathways, so flux through each

pathway can be adjusted to achieve a lower risk than any individual pathway,

albiet for an intermediate level of return. Strain III uses three pathways, all of

them showing a weak negative correlation, so it is possible to achieve an even

lower risk for an intermediate return. Additionally, strain III achieves a higher

return than strain II for the same level of risk. . . . . . . . . . . . . . . . . . . . 83

xiv

4.4 Robustness of three succinate strains as functions of perturbation size. (A) Mean

product yield versus perturbation size. Error bars represent one standard devi-

ation. (B) Standard deviation of product yield versus perturbation size. (C)

Robustness (R) versus perturbation size. Critical perturbation sizes for strains

II (δ∗(2) = 0.395) and III (δ∗(3) = 0.415) are indicated by dotted lines. Strains

I, II, and III each use, one, two, and three succinate production pathways, re-

spectively. Strain I uses only the highest-yield pathway; therefore, its mean yield

is highest when perturbations are small. However, the robustness of strain I

deteriorates rapidly as perturbation size increases, while strain III is the most

robust. Strain II is the most robust for only a narrow range of perturbation sizes

(i.e., for 0.395 ≤ δ ≤ 0.415). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5 L-serine production pathways and strains. (A) Two pathways are available for L-

serine production: (1) the PSP route and (2) the GHMT route. (B) We designed

three strains (strains I, II, and III), using one or both of these pathways. In

addition, strain III inhibits NDPK3 and CTPS2 fluxes. . . . . . . . . . . . . . . . 87

4.6 Robustness of three L-serine strains as functions of perturbation size. (A) Mean

yields of three L-serine strains as functions of perturbation size. Error bars

represent one standard deviation. (B) Standard deviation of L-serine yield for

the three strains. (C) Robustness values of the three L-serine strains as functions

of perturbation size. Strain I uses one L-serine synthesis pathway, while strains II

and III use two pathways. Strain III inhibits two additional reactions, compared

to strain II, which results in improved nominal yield but decreased robustness. . 89

xv

4.7 Histograms showing the simulated response of succinate strains to industrially-

relevant perturbations. All controlled fluxes are perturbed due to gene expres-

sion noise. Industrially-relevant perturbations include variations in glucose up-

take rate (a-b), oxygen uptake rate (c-d), osmotic stress (e-f), byproduct secre-

tion due to overflow metabolism (g-k), and re-consumption of byproducts (l-p).

While simulating byproduct secretion, membrane occupancy coefficients were

subjected to parameter uncertainty (g). While simulating byproduct consump-

tion, molecular crowding coefficients were subjected to parameter uncertainty

(l). For oxygen and substrates (glucose, acetate, formate, and ethanol), negative

fluxes correspond to uptake while positive fluxes correspond to secretion. ATPM:

non-growth-associated ATP maintenance, kMemFRD: membrane crowding coef-

ficient of fumarate reductase, kVol: molecular crowding coefficient. . . . . . . . . 92

4.8 Respiration and succinate production. (1) Reductive branch of the citric acid

(TCA) cycle. (2) Glyoxylate shunt. (3) Oxidative branch of the TCA cycle.

When fumarate reductase (FRD) is repressed (A), the quinol-dependent NADH

dehydrogenase activity dominates and oxygen is the terminal electron acceptor.

In contrast, when FRD is activated (B), fumarate is available as an additional

terminal electron acceptor. Accordingly, the production of succinate becomes

insensitive to fluctuations in oxygen availability. . . . . . . . . . . . . . . . . . . . 94

xvi

4.9 Nominal and mean succinate yield of 98 strains without aerobic fumarate re-

ductase (FRD) and anaerobic pyruvate dehydrogenase (PDH) activities. (A)

Succinate yield of each strain when no perturbations are present. All yields were

calculated without aerobic FRD and anaerobic PDH activities. However, to eas-

ily compare results with Fig. 1, the dashed red line denotes the maximal yield

at a growth rate of 0.1 h−1 when aerobic FRD and anaerobic PDH activities

are enabled. (B) Succinate yield of each strain when gene expression noise is

present, based on 1,000 random samples for each strain. Blue dots show the

mean of the 1,000 samples of succinate yield for each strain, while the red line

shows the median. Black lines show the minimum and maximum succinate yield

for each strain, while the minimum and maximum values in the green area corre-

spond to the 25th and 75th percentiles of succinate yield, for each strain. Strains

are sorted in order of descending mean yield (in (A) as well). (C) Histogram

of succinate yield across the 98 strains when no perturbations are present. (D)

Histogram of mean succinate yield across the 98 strains when gene expression

noise is present. Mean succinate yields ranged from 0% to 66% of the maximal

yield, and had a median of 42% of the maximal yield. . . . . . . . . . . . . . . . 95

4.10 Correlation between succinate production and oxygen uptake for strain III. Col-

ors are proportional to growth rate as shown in the colorbar. When fumarate

reductase (FRD) is active under aerobic conditions, maximum succinate flux is

insensitive to changes in oxygen uptake flux due to the availability of fumarate

respiration (A). When FRD is inactive under aerobic conditions, maximum suc-

cinate flux is affected by oxygen uptake rate (B). . . . . . . . . . . . . . . . . . . 96

5.1 Metabolomics data serve as the launchpad for iterative model refinement. Our

computational algorithm, outlined in Section 5.5, allows researchers to identify

metabolites needing more precise concentration measurements to make precise

predictions of the output variables of interest. . . . . . . . . . . . . . . . . . . . . 105

xvii

5.2 The flux and concentration space of a toy reaction cycle. Random samples and

reduction of solution space with (A) no measurements, (B) high-variance mea-

surements, and (C) precise measurements. Four representative pair-wise scatter-

plot patterns: disjoint flux and ∆rG′ regions (v < 0 & ∆rG

′ > 0, and v > 0

& ∆rG′ < 0) (D), relation between ∆rG

′ and metabolite concentrations due

to Equation (5.4) (E), correlation between fully coupled fluxes (Burgard et al.,

2004) (F), and non-convex regions between fluxes constrained by thermodynam-

ics (G). The layout of scatterplots is inspired by the COBRA Toolbox (Becker

et al., 2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.3 Comparison of computational speed non-convex sampling on the simplified model

of E. coli central metabolism on the CPU and GPU. Parallelized code was more

efficient than a single long chain on the CPU. For the largest number of samples,

parallel code on the GPU was faster than that on the CPU by >20X. . . . . . . 113

5.4 Determining the metabolite concentrations needing precise measurements. The

global sensitivity of the variability of each output prediction was assessed relative

to each metabolite concentration. Without experimental data (top two figures),

several metabolite concentrations require measurements to reduce output vari-

ability. Once high-variance data are provided for metabolites 5, 7, and 10, other

metabolite measurements become important for reducing output variability (bot-

tom two figures). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.5 Comparison of model prediction error when, in addition to a partial set of noisy

data, precise metabolites are unavailable (top), chosen randomly (middle) and

chosen by design using our algorithm (bottom). The relative error in model

predictions is reduced over 10X using the designed experiment compared to the

purely random experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xviii

6.1 Dynamic and steady-state simulations of E. coli central metabolism subject to

optimal enzyme manipulations. (A) Optimal enzyme fold-changes identified us-

ing the design algorithm. (B–C) Dynamic profiles of SERS flux (B) and concen-

trations of the 18 metabolites (C), both relative to reference values. The profiles

are based on dynamic simulations of the full kinetic model (Chassagnole et al.,

2002) where enzyme levels are fixed to the optimal levels identified by the algo-

rithm at the start of the simulation (i.e,. Time=0). Initial concentrations are

the reference concentrations, and initial fluxes are perturbed from the reference

values due to the enzyme perturbations at Time=0. . . . . . . . . . . . . . . . . 124

A.1 Simple demonstration of the portfolio effect. . . . . . . . . . . . . . . . . . . . . . 168

xix

Chapter 1

Introduction

1.1 Motivation

Chemicals have been largely derived from petroleum for the past 150 years. With the increas-

ing volatility of oil prices, expanding efforts towards environmental sustenance, and increasing

global demand for greener industries, the renewable chemicals market has been steadily increas-

ing (Table 1.1). While the market may indeed be growing, most renewable chemical building

Table 1.1: Global renewable chemicals market sizes by year ($ millions)

Product 2007 2008 2009 2014

Alcohols 40,819 43,125 45,586 58,894

Organic acids 56 60 65 94

Ketones 12 13 14 21

Polymers 73 81 91 152

Others 11 12 14 22

Total 40,971 43,291 45,770 59,183

Source: Chemicals Market Research Report. MarketsandMarkets (2009), p17.

blocks are currently too expensive to directly replace their conventional counterparts. Accord-

ing to the US Department of Energy, (Top Value Added Chemicals from Biomass, 2004) a

minimum volumetric productivity of 2.5 g/L/hr would be required for certain renewable chemi-

cals to be economically competitive with existing petroleum-derived counterparts. Considering

1

Chapter 1. Introduction 2

that this number is valid when inexpensive media are used for culturing the organisms (e.g.,

minimal media), existing production pipelines that use expensive media components like yeast

extract would require higher productivity. Furthermore, fluctuating or high feedstock costs

makes product yield an important consideration. Product titer factors into overall production

costs as it affects the costs for separating and concentrating the final product.

Metabolic engineering, as well as molecular and synthetic biology are important technologies

for lowering costs and adding value to renewable chemicals. Such technologies, however, are not

critical to the production of all renewable chemicals. For example, the DOE Top Value Added

Chemicals from Biomass (2004) report highlights twelve building block chemicals that have the

greatest potential to penetrate into large or diverse markets. Glycerol, sorbitol, xylitol and

arabinose are produced efficiently using chemical transformations with few technical barriers.

Meanwhile, succinic, fumaric, and malic acids, 3-hydroxypropionic acid (3-HPA), glutamic acid,

and itaconic acid are produced more efficiently using biotransformation routes. However, their

overall production costs must decrease for them to be directly competitive with conventional,

petroleum-based chemicals. This demonstrates that certain chemicals may experience greater

acceleration in cost-competitiveness and value as the techniques of metabolic engineering con-

tinue to develop.

An important component of metabolic engineering, and which will be the main focus of this

thesis, is computational modeling and model-based design of microbial metabolic networks.

Computational methods are important for metabolic engineering in part due to the great com-

plexity of biological systems coupled with the need to systematically design and optimize the

organisms that serve as chemical production platforms. Currently, genome-scale models of cell

metabolism are being constructed for various organisms with increasing ease–in fact, they are

now constructed almost automatically (Henry et al., 2010a). These constraint-based models

(CBM) contain information on the reaction stoichiometry of an organism’s metabolic network,

based on the annotated genome-sequence (Edwards et al., 2002). Typically, the rates, or fluxes

of more than a thousand reactions involving hundreds of metabolites are simulated–sometimes

with reasonable accuracy under specific growth conditions (Edwards et al., 2001; Ibarra et al.,

2002). While this great level of detail and complexity enables us to understand cell metabolism

Chapter 1. Introduction 3

at the systems-level, it also presents difficulties when we wish to (re-)design and (re-)optimize

these systems for engineering objectives.

Accordingly, computational algorithms have been developed to aid in the systematic design

of engineered strains with the use of constraint-based models. Early algorithms used mixed-

integer linear programming (MILP) to design knockout strains (Burgard et al., 2003). In some

cases, experimental observations showed a close agreement with predicted strain behavior (Fong

et al., 2005). Later algorithms also included up- and down-regulation of gene expression to ar-

bitrary levels (Pharkya and Maranas, 2006), as well as the inclusion of heterologous pathways

(Pharkya et al., 2004). An increasing number of experiments have demonstrated, however, that

in addition to knockouts, fine-tuned gene expression levels are necessary to optimize production

(Alper et al., 2005; Lee et al., 2007). The computational problem of designing strains with fine-

tuned gene expression levels, however, is significantly more complex than designing knockout

or even up- and down-regulation strategies. Certainly, the prevalent methods involving integer

optimization could not be efficiently applied to the problem of exploring the continuous spec-

trum of gene expression levels due to the inherent computational complexity. Accordingly, this

thesis will address a number of challenges facing the development of computational methods

for metabolic engineering, as outlined in the next section.

1.2 Challenges and objectives

As stated above, a pressing challenge for metabolic engineering is to overcome the computa-

tional complexity inherent in existing computational algorithms for designing optimal genetic

manipulations to maximize microbial production of biochemicals. This challenge is addressed in

Chapter 3. A closely related challenge is to not only design microbial strains that are optimal

under controlled environments, but to also design strains that are robust against both genetic

and environmental perturbations that are encountered in industrial settings. A computational

algorithm is developed to address this problem in Chapter 4.

Next, the thesis addresses a fundamental problem in model-based design of microbial strains:

Chapter 1. Introduction 4

the practical utility of a design depends on the precision of the model. A model that makes

precise predictions can be used to generate a focused set of strain designs that are predicted to

behave in a precise fashion. This focused set of designs can then be tested experimentally, and

if there is discrepancy between the predicted and observed cell behaviors, this discrepancy can

be used constructively to improve the accuracy of the original model. One source of impreci-

sion, or variability, in model predictions is the presence of parameters that are themselves not

precisely defined (i.e., the parameters involve uncertainty). One method for improving model

precision when such uncertain parameters are present is to perform sensitivity analysis on the

parameters. Furthermore, the results of the sensitivity analysis can be incorporated in an algo-

rithm for efficiently improving model precision by identifying a subset of the parameters that

needs to be measured more precisely. In Chapter 5, we address this problem using a model

of metabolism that describes both reaction fluxes and metabolite concentrations. The methods

developed are expected to be particularly useful for interpreting metabolomics data sets which

include measurements for a large number of metabolites, but typically involve much variability

in the measurements.

Finally, in Chapter 6, the methods developed in this thesis for the optimal design of microbial

strains is extended to kinetic models of metabolism, which incorporate kinetic rate equations.

The use of kinetic models for strain design is challenging, especially for large-scale kinetic mod-

els, since the kinetic rate equations typically involve complex nonlinear terms. In alignment

with the direction of this thesis, we develop an efficient algorithm for strain design using kinetic

models, which has the potential to be scalable to large-scale kinetic models.

1.3 Contributions

As outlined above, this thesis addresses four main challenges that are addressed in Chapters 3

to 6. In this section, the contributions stemming from the work presented in each chapter are

outlined.

Chapter 1. Introduction 5

1.3.1 EMILiO: a fast algorithm for genome-scale strain design

To address the need for an efficient, computational algorithm to design strains having fine-tuned

reaction fluxes, the EMILiO (Enhancing Metabolism with Iterative Linear Optimization) algo-

rithm was developed (Chapter 3). We used a different formulation of the bilevel optimization-

based strain design problem, thereby largely avoiding the exponential increase in computational

effort with increasing model size and design scope that is typical of strain design algorithms.

Work relating to Chapter 3 has been published or presented in the journals and conferences

listed below:

• Yang, L., Cluett, W.R. and Mahadevan, R. (2011) EMILiO: a fast algorithm for genome-

scale strain design Metab Eng. 13:272–281. Copyright permission to reuse the full article

in this thesis in both print and electronic form has been granted by Elsevier.

• Yang, L., Cluett, W.R. and Mahadevan, R. (2010) Rapid design of system-wide metabolic

network modifications using iterative linear programming. In: Proceedings of the 9th

International Symposium on Dynamics and Control of Process Systems, pp. 377-382.

• Yang, L., Cluett, W.R. and Mahadevan, R. “EMILiO: a faster algorithm for genome-

scale strain design,” Society for Industrial Microbiology Annual Meeting, New Orleans,

LA, July 24–28, 2011 (Oral presentation).

• Yang, L., Cluett, W.R. and Mahadevan, R. “Efficient Redesign of Metabolism for Bio-

chemicals,” 2011 IBE Annual Conference, Atlanta, Georgia, March 3–5, 2011 (Oral pre-

sentation).

• Yang, L., Cluett, W.R. and Mahadevan, R. “Rapid design of system-wide metabolic

network modifications using iterative linear programming,” 9th International Symposium

on Dynamics and Control of Process Systems, Leuven, Belgium, July 5–7, 2010 (Keynote

oral presentation).

• Yang, L., Cluett, W.R. and Mahadevan, R. “Scalable and highly efficient computational

algorithm for metabolic engineering,” Metabolic Engineering VIII, Jeju Island, South

Korea, June 13–17, 2010 (Poster presentation).

Chapter 1. Introduction 6

1.3.2 Robust strain design

Strains designed using EMILiO, or any algorithm that identifies inhibition and activation tar-

gets, face important challenges with respect to experimental implementation. The primary

concern is that the optimal performance of the strain designs are predicted according to a

model having no uncertain parameters, and without accounting for random perturbations to

gene expression and environmental disturbances. In the field of robust process control of engi-

neered systems, the importance of considering model uncertainty in the design process had been

a primary concern for the past 30 years. Accordingly, we constructed a framework to assess the

robust performance of alternative strain designs, subject to industrially-relevant genetic and

environmental perturbations (Chapter 4).

Work relating to Chapter 4 has been presented as indicated below:

• Yang, L., Cluett, W.R. and Mahadevan, R. “Genome-scale robust strain design,” Bio-

chemical and Molecular Engineering XVII, Seattle, Washington, June 26–30, 2011 (Poster

presentation).

1.3.3 Experiment design using noisy metabolomics data

The problem of predicting strain performance with an uncertain model raises additional con-

cerns. The genome-scale models used for strain design are typically insufficiently constrained.

That is, model predictions improve as additional constraints are incorporated, accounting for

growth conditions, regulatory rules and flux capacity constraints (Yang et al., 2008).

The problem of sensitivity analysis and model refinement extends to metabolite concentra-

tions, in addition to reaction fluxes. Metabolites are the products of metabolic reactions and

correspond to nodes in a metabolic network graph. Thus, sensitivity analysis of metabolite

concentrations can be critical to the overall refinement of model predictions. In Chapter 5, we

develop a computational framework to assess sensitivity of model predictions to uncertainties

in metabolite concentrations, as well as methods to design subsequent experiments to efficiently

improve model precision.

Work relating to Chapter 5 has been published or presented in the journals and conferences

Chapter 1. Introduction 7

listed below:

• Yang, L., Mahadevan, R. and Cluett, W.R.. (2010) Designing experiments from noisy

metabolomics data to refine constraint-based models. In: Proceedings of the American

Control Conference, pp. 5143–5148. (Oral presentation: Best presentation in session

award)

• Yang, L., Mahadevan, R. and Cluett, W.R. “Monte Carlo sampling of metabolite turnover

rates using constraint-based models of metabolism,” 2008 AIChE Annual Meeting, Philadel-

phia, PA, November 16-21, 2008 (Poster presentation).

• Yang, L., Mahadevan, R. and Cluett, W.R. “Investigating metabolite turnover rates using

constraint-based models of metabolism,” Sixteenth International Conference on Intelligent

Systems for Molecular Biology, Toronto, ON, July 19-23, 2008 (Poster presentation).

1.3.4 Additional contributions

In addition to the contributions listed above, this thesis has also explored additional problems.

In Chapter 6, the methods developed in Chapter 3 were extended to the efficient identifica-

tion of optimal enzyme manipulations using kinetic models of metabolism. A manuscript is in

preparation for journal publication based on this work.

Additionally, in Appendix C, the problem of designing knockout strains for balancing the

engineering objectives of yield, titer, and productivity is investigated. A manuscript for journal

publication is being prepared, entitled “DySScO: an efficient strain design algorithm for bal-

anced yield, titer, and productivity.” The manuscript is co-authored with Kai Zhuang, a PhD

candidate in the Department of Chemical Engineering & Applied Chemistry at the University

of Toronto.

Chapter 2

Literature Review

2.1 Constraint-based modeling

In the broadest sense, constraint-based modeling (CBM) is a mathematical framework for sim-

ulating the metabolic state of one or multiple organisms. In general, both dynamic and steady-

state simulations are possible. CBM is typically applied to the prediction of metabolic states

using genome-scale reconstructions of cell metabolism. These reconstructions include all of the

known metabolic reactions of an organism, the enzymes that catalyze them, and the corre-

sponding genes. Thus, these models describe the fluxes through over a thousand biochemical

reactions that convert hundreds of metabolites. Currently, there are reconstructions for 35

organisms (Orth et al., 2010). Additionally, novel methods have been developed to speed up

the process of metabolic network reconstruction (Henry et al., 2010a). In this section, a brief

review of CBM is presented, with particular emphasis on the potential for applying CBM for

the development of novel algorithms to simulate and design microorganisms for engineering

goals.

2.1.1 Fundamentals

Transient changes in intracellular metabolite concentrations due to their consumption and pro-

duction by metabolic reactions and dilution is described as follows:

1

ρ

dc(t)

dt= Sv(t)− 1

ρµ(t)c(t), (2.1)

8

Chapter 2. Literature Review 9

where c is the vector of intracellular metabolite concentrations (mM), v is the vector of reaction

rates, or fluxes (mmol/gDW/hr), S is the matrix of reaction network stoichiometry, µ is the

specific growth rate (hr−1), and ρ is the cell density (gDW/L). Note that all of the variables

are functions of time, except for S (we also assume constant cell density, ρ). Here, gDW is a

unit denoting the dry weight of biomass in grams. As discussed in the previous section, S is

constant due to the specificity of enzymes regarding the stoichiometry of associated substrates,

products, and cofactors.

So far, the most popular use of Eq. (2.1) has been to obtain steady-state solutions (i.e.,

dc/dt = 0). Typically, the effects of dilution (µ · c) are considered to be negligible, although

recent studies have shown that dilution may have a significant effect in some cases (Benyamini

et al., 2010). Ignoring the effects of dilution, the steady state distribution of metabolic fluxes

is described as follows:

Sv = 0. (2.2)

Typically, the system above will be underdetermined, and more than one flux distribution is

possible. The most popular method for determining a physiologically relevant flux distribution

is flux balance analysis (FBA), which is formulated as the following linear program (LP):

max fT v

s.t. Sv = 0

vL ≤ v ≤ vU

where f ∈ Rn is the vector of objective coefficients, and vL and vU are the lower and upper flux

bounds, respectively. The objective vector is chosen to simulate cell behavior. A commonly

used objective is the maximization of growth yield, subject to finite uptake rates of carbon,

energy, and nutrients. For maximization of growth yield, f consists of 1 for the reaction index

corresponding to a biomass synthesis reaction and zero otherwise. Studies have shown that

this objective accurately describes the growth of prokaryotes like Escherichia coli under certain

conditions, such as carbon-limited growth in minimal media, especially after adaptive evolution

(Ibarra et al., 2002).

Chapter 2. Literature Review 10

2.1.2 Extensions and applications of flux balance analysis

The addition of physiologically meaningful constraints to the FBA formulation is one way of

improving the predictive capabilities of constraint-based modeling. Here, a number of recent

extensions to FBA are reviewed.

Incorporating biophysical constraints

FBA with molecular crowding (FBAwMC) is a method for improving the accuracy of metabolic

flux predictions by accounting for the crowding of enzymes in the cytoplasm (Beg et al., 2007).

FBAwMC has been shown to predict the growth rates of wild-type and mutant strains of E.

coli with higher accuracy than FBA. Furthermore, FBAwMC accurately predicts the sequence

and mode of substrate uptake in dynamic simulations of growth on a complex medium.

The molecular crowding constraints are formulated as follows:

∑j∈CY TO

αjvj ≤ 1, (2.3)

where vj and αj are the flux and crowding coefficient of reaction j, respectively, and CY TO

is the set of enzyme-catalyzed reactions occurring in the cytoplasm. Each αj is a function of

the cytoplasmic density, the molar volume of enzyme j, and the concentration of enzyme j. In

practice, a single representative crowding coefficient, < α > is used for all reactions. The value

of < α > is determined by minimizing the error between predicted and measured growth rates.

FBA with membrane occupancy is a method for improving the accuracy of metabolic flux pre-

dictions by accounting for the crowding of membrane-bound enzymes on the cell membrane

(Zhuang et al., 2011). FBAwMO accurately predicts respiro-fermentation, differential utiliza-

tion of cytochromes, and glucose uptake rates in E. coli.

Not all membrane-bound enzymes are expected to contribute to membrane crowding. Thus,

the membrane crowding coefficient of each crowded membrane-bound enzyme is determined

separately. The coefficient values are determined from experiments that are designed such that

the crowding of the membrane-bound enzyme of interest is expected to actively limit the ob-

served phenotype (i.e., growth rate).

The molecular crowding and membrane occupancy constraints represent distinct biophysical

Chapter 2. Literature Review 11

constraints. The former represents intracellular crowding of cytosolic enzymes, while the latter

represents crowding of membrane-bound enzymes. Thus, the two constraints are complemen-

tary and may be used together.

Incorporating regulatory constraints

Probabilistic regulation of metabolism (PROM) is a method for improving the accuracy of

metabolic flux predictions by including the effects of the transcriptional regulatory network as

additional constraints (Chandrasekaran and Price, 2010). Unlike previous approaches in which

Boolean rules were used to model transcriptional regulation (Covert et al., 2001, 2004), PROM

implements regulatory constraints as quantitative bounds on fluxes. These bounds are deter-

mined using a statistical model of interactions between and among transcription factors and

enzyme-encoding genes, and microarray datasets. Another feature of PROM is that the regula-

tory constraints are not imposed as hard constraints on the fluxes. Rather, fluxes are allowed to

violate the regulatory constraints but with a penalty. A flux distribution is predicted by mini-

mizing the largest violation of regulatory constraints by solving a linear program. Thus, given

sufficient microarray data, PROM is a promising approach for incorporating transcriptional

regulatory constraints into algorithms that use constraint-based models.

Incorporating thermodynamic constraints

Thermodynamics-based metabolic flux analysis (TMFA) is a method for improving the accuracy

of metabolic flux predictions by including thermodynamic constraints on all reactions having a

known or estimated standard Gibbs free energy change (∆G0) (Henry et al., 2007). All fluxes

predicted by TMFA operate in thermodynamically feasible directions. Assuming, without loss of

generality, that ∆G0 is known for all n reactions, the feasible reaction directions are determined

by the reaction Gibbs free energy change, ∆G as follows:

∆G = ∆G0 +RTST ln(x),

where S ∈ Rm×n is the stoichiometric matrix, (·)T denotes the transpose operator, ∆G ∈ Rn

and ∆G0 ∈ Rn are the vectors of reaction and standard Gibbs free energy, respectively, ln(x)

Chapter 2. Literature Review 12

is the vector of the natural log of metabolite concentrations, R is the universal gas constant,

and T is the intracellular temperature. For a reaction, j, ∆Gj determines reaction direction as

follows:

if ∆Gj < 0 then vj ≥ 0,

if ∆Gj > 0 then vj ≤ 0.

These logical constraints are implemented as integer constraints in the constraint-based model.

Accordingly, TMFA is formulated as a mixed-integer linear program (MILP).

TMFA improves prediction accuracy since all fluxes with known ∆G0 operate in thermodynam-

ically feasible directions. When measurements of ∆G0 are not available, they are commonly

estimated using the group contribution method (Henry et al., 2007). Therefore, thermodynamic

constraints can be applied to a majority of the reactions in a metabolic network. One challenge

with TMFA is that intracellular concentration measurements may be relatively scarce, which

leads to large degrees of uncertainty on each reaction’s ∆G estimate. Furthermore, TMFA

does not describe the quantitative relationship between fluxes and concentrations. This lim-

itation is to be expected: TMFA is formulated to identify thermodynamically feasible fluxes,

not to describe enzyme kinetics. Thus, to quantitatively model fluxes and concentrations in a

quantitative manner, the reactions should be described using kinetic rate equations. While the

development of kinetic models of metabolism has a long history, the incorporation of kinetic

rate equations into constraint-based models for simulation and design is a recent development.

Furthermore, the construction of genome-scale kinetic models still faces significant challenges

(Costa et al., 2011). Kinetic models of metabolism and opportunities for advancement, espe-

cially for strain design, are reviewed in greater detail in Chapter 6.

2.1.3 Opportunities for advancement

One of the attractive features of CBM is its flexibility. Model predictions are refined by the

incorporation of additional constraints, which represent biophysical assumptions, biochemical

mechanisms, and physiological phenomena. Accordingly, a significant number of extensions to

CBM have been developed. Nonetheless, a number of major challenges still remain.

Chapter 2. Literature Review 13

For example, random sampling is a method for characterizing the solution spaces determined

by the constraints reviewed above. Although efficient methods have been developed for models

that include stoichiometric constraints and other linear constraints, they have not been devel-

oped for models that include thermodynamic constraints. Accordingly, this thesis develops a

method for randomly sampling both fluxes and concentrations, subject to both stoichiometric

and thermodynamic constraints (Chapter 5).

Another challenge is the use of high-throughput data for both model refinement and metabolic

engineering. One concern with high-throughput datasets is that they are often quite noisy. The

large uncertainty associated with datasets must be dealt with by computational models. In this

thesis, a new computational method is developed in Chapter 5, which uses noisy metabolomics

data to identify a subset of metabolites whose precise measurements would improve model

precision. This method uses thermodynamically constrained models of cell metabolism and

random sampling of both fluxes and concentrations.

2.2 Computer-aided strain design

2.2.1 Bilevel optimization-based strain design

OptKnock is the first bilevel optimization algorithm for in silico strain design (Burgard et al.,

2003). It is capable of using genome-scale constraint-based models of metabolism. The formu-

Chapter 2. Literature Review 14

lation of OptKnock is as follows:

maxv,y

cTp v

s.t. maxv

cT · v

s.t. Sv = b

vLj ≤ vj ≤ vUj , j ∈ CANTKO

vLj (1− yi) ≤ vj ≤ vUj (1− yi), i = 1, . . . , nKO, j ∈ CANKOnKO∑i=1

yi ≤ K

vbio ≥ vminbio

y ∈ {0, 1},

(2.4)

where vminbio is the minimum required growth rate, cTp is the objective vector that maximizes the

product flux, cT is the objective vector that maximizes growth (biomass) yield (i.e., cT v = vbio),

yi are the integer variables used to implement knockouts, CANTKO and CANKO are the sets

of reactions that cannot and can be knocked out, respectively, nKO is the number of reactions

allowed to be knocked out (i.e., the size of the CANKO set), and K is the maximum number of

knockouts to be identified. Since the complexity of the MILP depends strongly on the number

of integer variables, it is crucial to keep the set, CANKO as small as possible. In practice,

CANKO is reduced, for example by excluding reactions that are not associated with known

genes, lethal single deletions and reactions in certain subsystems that are expected to adversely

impact cell physiology (e.g., cell envelope biosynthesis) (Feist et al., 2010).

Using the strong duality theorem of linear programming, this bilevel optimization problem is

reformulated into a single-level MILP (Burgard et al., 2003) as follows:

maxv,y,wS ,wvl,wvu,wKO

cTp v (2.5)

wvuvU − wvlvL = cT v (2.6)

wSS + wvu − wvl + wKO = c (2.7)

−Myi ≤ wKOi ≤Myi, i = 1, . . . , nKO (2.8)

0 ≤ wvuj ≤M(1− yi), i = 1, . . . , nKO, j ∈ CANKO (2.9)

0 ≤ wvlj ≤M(1− yi), i = 1, . . . , nKO, j ∈ CANKO (2.10)

Chapter 2. Literature Review 15

vLj ≤ vj ≤ vUj , j ∈ CANTKO (2.11)

vLj (1− yi) ≤ vj ≤ vUj (1− yi), i = 1, . . . , nKO, j ∈ CANKO (2.12)

vbio ≥ vminbio (2.13)

wvl, wvu ≥ 0 (2.14)

wS , wKO ∈ R (2.15)

nKO∑i=1

yi ≤ K (2.16)

y ∈ {0, 1}, (2.17)

where M is a large positive number, wvl ∈ Rn and wvu ∈ Rn are dual variables for lower and

upper flux bound constraints, respectively, wS ∈ Rm is the vector of dual variables for mass

balance constraints, and wKO ∈ RnKOis the vector of dual variables for knockout constraints.

The single-level formulation above is based on that of GDLS (Lun et al., 2009). Excluding

gene-protein relations and the local search constraints of GDLS, the formulation is equiva-

lent to that of the original OptKnock (Burgard et al., 2003). A subtle point worth noting is

that when the strong duality theorem is used to reformulate a bilevel to single-level problem,

one may encounter products of binary variables (corresponding to knockout constraints) and

continuous variables (dual variables corresponding to flux bounds). One way to resolve this

apparent nonlinearity is to reformulate the product of binary and continuous variables (Glover,

1975). This reformulation would yield, for each product of binary (y) and continuous variables

(say, wvl for duals corresponding to lower bounds), a new continuous variable, zvl = wvly ≥ 0

and two constraints, wLvly ≤ zvl ≤ wUvly. A simpler and more intuitive approach is to simply

separate the knockout constraints and flux bound constraints and to assign dual variables to

each. Thus, −My ≤ wKO ≤My becomes equivalent to −My ≤ zvu− zvl ≤My, where zvu ≥ 0

is the new variable corresponding to upper bound constraints. In both cases, the constraints

(2.9) and (2.10) ensure that the dual variables corresponding to wild-type flux bounds are only

non-zero if the corresponding reaction is not knocked-out.

Prior to OptKnock, mathematical models of cell metabolism served mostly as a simulation tool.

OptKnock allowed metabolic engineers to formalize the problem of identifying optimal genetic

Chapter 2. Literature Review 16

manipulations into the rigorous language of mathematical optimization, which offered a mature

set of tools for solving complex and large-scale problems.

OptKnock does have several limitations. First, how accurately the predicted design reflects

experimental implementation is an important question. This problem arises due to limitations

of the model, not of the algorithm. Nonetheless, experiments have shown that strains designed

by OptKnock behaved as predicted, after adaptive evolution for increased growth yield (Fong

et al., 2005).

The more important limitation of OptKnock is computational tractability. That is, the Opt-

Knock problem grows exponentially in complexity with the number of genetic manipulations or

the size of the model. Therefore, most practical implementations of OptKnock place a limit on

the number of knockouts or limit the amount of time spent by the solver. The latter approach

implies that the obtained solution is not guaranteed to be globally optimal or even feasible.

To partially overcome the computational complexity of OptKnock, a straightforward but effec-

tive extension was developed, called Genetic Design through Local Search (GDLS) (Lun et al.,

2009).

The formulation of GDLS is similar to OptKnock (2.5)–(2.17), but it includes additional con-

straints and an iterative solution scheme. At iteration, t, the local search constraint is as

follows:

∑i∈NOTKO(t−1)

yi +∑

i∈KO(t−1)

(1− yi) ≤ k (2.18)

where k is the neighborhood size, and NOTKO(t− 1) and KO(t− 1) are the sets of reactions

that are not knocked out and knocked out, respectively, at iteration t− 1.

2.2.2 Extensions of the bilevel optimization framework

Identification of activation and inhibition targets

OptReg is a bilevel optimization-based algorithm that identifies knockout, inhibition and acti-

vation reaction targets to maximize production of a target metabolite (Pharkya and Maranas,

2006). Similar to OptKnock, OptReg is formulated as an MILP. In fact, OptKnock solutions

Chapter 2. Literature Review 17

can be identified using OptReg by limiting the number of activation and inhibition targets to

zero. One shortcoming of OptReg is the need to determine the levels of inhibition and activation

prior to the optimization. These arbitrary levels of regulation are defined relative to a reference

flux distribution. Nonetheless, OptReg represents an important advancement in computational

strain design, in which gene deletion, inhibition and activation strategies are jointly evaluated

using the bilevel optimization approach and the MILP formulation.

More recently, OptForce was developed, in order to identify modified reaction fluxes for max-

imizing production of a target metabolite (Ranganathan et al., 2010). OptForce identifies

reaction modification targets relative to a wild-type flux solution space. That is, the feasible

ranges of all fluxes are identified for the wild-type, subject to stoichiometry, enzyme capacity,

thermodynamics, and intracellular flux measurements. Subsequently, feasible flux ranges are

identified subject to maximum product flux and all of the aforementioned constraints, excluding

the wild-type flux measurements, to determine the modified flux ranges in the designed strain.

At this stage, additional design constraints, such as enforcing a minimum biomass formation

rate, may be imposed. By comparing the flux ranges of the wild-type and designed strain, a

subset of reactions is identified, which must be modified for the strain to achieve the desired

product yield. However, not all of these reactions must be modified individually, as they may

be related through stoichiometric constraints or flux bounds. Thus, an MILP is formulated to

identify the minimal combination of the modified fluxes that results in maximum production of

the target metabolite. Unlike previous methods, OptForce uses intracellular flux measurements

to predict the wild-type flux distribution, rather than the assumption of maximum growth yield.

Unlike OptReg, OptForce identifies quantitative flux modification values, instead of arbitrary

levels of inhibition and activation. One limitation with OptForce is that, as with previous MILP

approaches, the computational effort increases exponentially with the scope of the design (i.e.,

the number of allowed modifications). Nonetheless, OptForce represents an important advance-

ment in computational strain design, as quantitative flux modifications could be identified to

achieve product yields at the theoretical maximum.

Chapter 2. Literature Review 18

Design of transcriptional regulatory and metabolic networks

OptORF is a bilevel optimization algorithm for identifying knockout and expression targets

of metabolic genes, as well as deletion targets of transcription factors (Kim and Reed, 2010).

Gene deletion strategies identified based on only a metabolic model can be nullified through

transcriptional regulation. OptORF is able to predict the integrated effects of metabolic and

regulatory networks, and is able to identify gene deletion and overexpression targets that are

consistent with both networks. OptORF models transcriptional regulation using Boolean con-

straints. Although an approximation of transcriptional regulation, the Boolean formulation has

been shown to improve model accuracy under both batch and continuous culturing conditions

(Covert et al., 2004). OptORF represents an important advancement in the field of in silico

strain design accounting for integrated metabolic and regulatory networks.

2.2.3 Alternative approaches

A number of alternative approaches for computational strain design have been developed. For

example, evolutionary programming (EP) was used as an alternative to an MILP formulation to

identify gene knockouts to maximize product formation (Patil et al., 2005). The EP formulation

allows the optimization of nonlinear objective functions and, although it does not guarantee

global optimality, it may be more computationally efficient than the MILP formulation. In

conjunction with OptKnock, OptGene has been shown to identify strains having a large number

of knockouts (e.g., ten knockouts) using genome-scale models (Feist et al., 2010).

Another approach to computational strain design involves identifying deletion, inhibition, and

activation targets based on the correlations of elementary modes with the target flux (Melzer

et al., 2009). The main bottleneck in this approach lies in the enumeration of elementary modes,

which is still a computationally challenging problem for genome-scale networks. Consequently,

this algorithm has been applied to smaller versions of the original genome-scale models (Melzer

et al., 2009).

Chapter 2. Literature Review 19

2.2.4 Opportunities for advancement

Many computational strain design algorithms have been developed since OptKnock, addressing

different limitations and opportunities. Nonetheless, several significant challenges remain in the

field. First, many genome-scale in silico strain design algorithms suffer from an exponential in-

crease in computational effort with increasing design scope and model size. In the case of MILP

formulations, computational effort is determined by the number of allowable combinations of

integer variables. Accordingly, these algorithms are typically limited to the identification of de-

signs with limited scope (i.e., limited number of genetic manipulations). In conjunction, various

procedures are employed to minimize the number of integer variables, based on physiological

knowledge (Feist et al., 2010), or algorithmic methods, such as in OptForce. Another approach

is to identify locally optimal solutions using local search constraints in an MILP formulation,

as in GDLS (Lun et al., 2009). While solutions identified by GDLS are not guaranteed to

be globally optimal, they are still superior to globally optimal designs of smaller scope. One

opportunity for advancement is to apply the local search constraints to the identification of

not only knockout, but also inhibition and activation strategies. Accordingly,the local search

implementation of OptReg is developed in this thesis (Chapter 3). Furthermore, a novel strain

design algorithm is developed for identifying optimal flux values for maximum product yield

in Chapter 3. Compared to previous methods, the new strain design algorithm shows sig-

nificantly improved scalability, and the ability to efficiently identify optimal flux values for

metabolite overproduction. Table 2.1 lists some of the relevant bilevel optimization algorithms

that formed the foundations upon which EMILiO was constructed. In particular, emphasis is

placed on the practical implementation issues that the author of this thesis encountered while

using personally-coded implementations of these algorithms.

Although optimal flux manipulations can be identified, a major challenge still remains: how

robust is the performance of a strain against deviations of the modified fluxes from their optimal

values? Furthermore, how robust is the strain against gene expression noise and environmental

perturbations? As more complex strain designs are identified, which include not only gene

knockouts but also finely-tuned gene expression levels, strain robustness will become increas-

ingly important. Accordingly, this thesis develops a computational framework for assessing the

Chapter 2. Literature Review 20

robustness of in silico strains against perturbations to modified fluxes, as well as a wide range of

industrially relevant perturbations (Chapter 4). The most robust in silico strains are expected

to be of greater practical value for metabolic engineers.

Chapter 2. Literature Review 21

Table 2.1: Comparison of some of the existing strain design algorithms

Algorithm Formulation Design scope Implementation con-

siderations

Consequences

OptKnock (Bur-

gard et al., 2003)

MILP Knockout Limit number of knockouts

(e.g., ≤ 10 knockouts)

May miss better solutions

Limit execution time of

MILP solver (e.g,. solve for

4 days)

May not converge to global

optimum

GDLS (Lun

et al., 2009)

MILP (itera-

tive)

Knockout Limit neighborhood size

(e.g., ≤ 3)

May fail to improve pro-

duction due to limited local

search space

Limit execution time of

each MILP local search

(e.g,. ≤ 1 hour)

May not converge to global

optimum at each local

search iteration

OptReg

(Pharkya and

Maranas, 2006)

MILP Knockout, acti-

vation, inhibi-

tion

Limit number of genetic

manipulations

May miss better solutions

Must define level of activa-

tion/inhibition relative to

reference fluxes

Difficult to determine

exact level of activa-

tion/inhibition prior to

identifying the set of

modified reactions

Limit execution time of

MILP solver (e.g,. solve for

4 days)

May not converge to global

optimum

OptReg’LS

(Yang et al.,

2011) (Section

3.3.7)

MILP (itera-

tive)

Knockout, acti-

vation, inhibi-

tion

Limit number of genetic

manipulations

May miss better solutions

Continued on next page

Chapter 2. Literature Review 22

Table 2.1 – continued from previous page

Algorithm Formulation Design scope Implementation con-

siderations

Consequences

Must define level of activa-

tion/inhibition relative to

reference fluxes

Difficult to determine

exact level of activa-

tion/inhibition prior to

identifying the set of

modified reactions

Limit neighborhood size May fail to improve pro-

duction due to limited local

search space

Limit execution time of

each MILP local search

(e.g,. ≤ 1 hour)

May not converge to global

optimum at each local

search iteration

EMILiO (Yang

et al., 2011)

(Section 3.3.2)

SLP, LP,

MILP

Optimal fluxes

(including

knockout, ac-

tivation, and

inhibition)

Parameter tuning required

for SLP stage

Sometimes difficult to de-

termine SLP parameter

values

SLP is not a global opti-

mization solver

Algorithm may not con-

verge to global optimum

SLP does not include inte-

ger variables

Difficult to limit number

and type of genetic manip-

ulations at the initial SLP

stage

2.3 Simulation and design using kinetic models of metabolism

Even prior to the wide-spread availability of genome-scale stoichiometric models of cell metabolism,

kinetic models had been developed, often built up part-by-part, based on in vitro kinetic stud-

ies to derive reaction mechanisms and parameter values. These models were then used for

metabolic engineering. Metabolic control analysis (MCA) (Kacser and Burns, 1973) has been

seminal in establishing mathematical models as a useful tool for metabolic engineering. MCA

is a mathematical framework enabling systematic quantification of the important enzymes,

Chapter 2. Literature Review 23

metabolites, and fluxes that one must control to affect a target flux. Elasticity coefficients

and flux control coefficients of MCA are highly relevant to kinetic models today. Specifically,

elasticity coefficients are used directly in the lin-log kinetic rate equation, which is a simplified

rate law that was used to construct the latest genome-scale kinetic model (see Section 2.3.3).

In addition to MCA, different approaches have been used to identify optimal engineering strate-

gies using kinetic models of metabolism. The formulation of constrained optimization problems

to identify optimal enzyme levels has emerged as a promising but also challenging approach.

2.3.1 Optimization approaches to metabolic engineering using kinetic mod-

els

As a fairly early example of constrained optimization approaches to metabolic engineering,

Dean and Dervakos (1998) formulated a mixed-integer nonlinear program (MINLP) to identify

optimal enzyme levels to minimize carbon dioxide production from the citric acid (TCA) cycle

of Dictyostelium discoideum. The authors solved the MINLP using the DICOPT++ solver

through the GAMS modeling software. The authors demonstrated that even for this relatively

small model, individual enzyme manipulations may not incrementally improve the objective;

therefore, the optimization formulation should consider a wide scope of simultaneous enzyme

manipulations for best results.

We note that around this time, Mendes and Kell (1998) developed the Gepasi software, now

known as Copasi (Hoops et al., 2006), which provides a user-friendly interface for viewing,

constructing, simulating and optimizing kinetic models. In this work, we have used Copasi to

load SBML files containing kinetic rate equations, simulating steady states, and calculating

elasticities and control coefficients.

Visser et al. (2004) formulated a nonlinear program (NLP) to identify optimal enzyme levels

for maximizing glucose uptake or serine production using a kinetic model of Escherichia coli.

The kinetic model, developed by Chassagnole et al. (2002) consisted of 30 enzymes and 17

intracellular metabolites. The NLP was solved by a gradient-based method.

Similarly, Schmid et al. (2004) formulated an NLP to optimize tryptophan synthesis using the

same kinetic model of E. coli (Chassagnole et al., 2002). The authors solved this NLP with

Chapter 2. Literature Review 24

gradient-based methods and simulated annealing. They found that the strategies obtained by

constrained optimization sometimes contradicted those suggested by flux control coefficients of

MCA. Nonetheless, the authors found that flux control coefficients could also indicate which

enzymes should be optimized.

Vital-Lopez et al. (2006a) developed an optimization framework based on a novel, general lin-

earization of kinetic models. This linearization uses Lagrange expansions (as opposed to Taylor

expansions), according to an arbitrary basis function. Thus, a broad class of approximate

rate equations conform to their general formulation, including lin-log, thermokinetics, GMA,

etc. Upon linearization, optimal enzyme levels are identified through iterative solution of a

mixed-integer linear program (MILP). This formulation allows the user to limit manipulations

to knockouts and/or enzyme level modulations. Again, the kinetic model of E. coli central

carbon metabolism by Chassagnole et al. (2002) was used.

Recently, Nikolaev (2010) formulated an MINLP to optimize both enzyme levels and enzyme

regulatory structures. In addition to metabolite homeostasis and total enzyme level constraints

found in previous works (Schmid et al., 2004; Vital-Lopez et al., 2006a), the authors introduced

a novel, local stability constraint. This constraint explicitly constrains eigenvalues of the Ja-

cobian matrix to be negative, thereby ensuring that the optimal enzyme manipulations result

in stable steady states. The stability of solutions is an important issue that will be discussed

further in Section 2.3.2. The kinetic model of E. coli central carbon metabolism by Chassagnole

et al. (2002) was used, once again.

Pozo et al. (2011) developed a customized spatial branch-and-bound algorithm to globally op-

timize metabolite production by manipulating a limited number of enzymes using a generalized

mass action (GMA) kinetics model. The authors found that their framework outperformed

the commercial MINLP solver, BARON (Tawarmalani and Sahinidis, 2005), due to their cus-

tomized, tight relaxations from MINLP to MILP. The authors identified optimal concentrations

for up to five predesignated enyzmes for a model of citric acid production in Aspergillus niger

consisting of 60 reactions. Because the set of modifiable enzymes was already short-listed, the

problem is representative of the refining stage of metabolic engineering, rather than a global

search of the entire network. Therefore, the scalability of this method for a global search of a

Chapter 2. Literature Review 25

larger number of enzymes is unknown.

2.3.2 Stability of kinetic models

Unlike flux balance analysis (FBA), where the system is at steady state by assumption, kinetic

models describe dynamic behavior; therefore, whether the system reaches a steady state for a

given initial condition must be assessed. If a steady state is reached, then the characteristics of

this steady state, such as whether bifurcations are possible, becomes important for metabolic

engineering because it may constrain enzyme manipulations to a subset of all enzymes or to

smaller levels of modulation. For example, Stephanopoulos and Simpson (1997) observed,

using a kinetic model of aromatic amino acid biosynthesis in Saccharomyces cerevisiae, that

amplifying the phosphofructokinase enzyme by more than 11% induced a bifurcation in the

concentration of the metabolite chorismate. Biological consequences that have been attributed

to the mathematical presence of bifurcations include the secretion of metabolites, induction

of degradation pathways, and large changes in product profiles (Stephanopoulos and Simpson,

1997). Consequently, metabolic engineering strategies are typically designed to avoid large

changes in metabolite concentrations.

Some of the issues associated with optimal enzyme manipulations despite potential bifurcations

were investigated by Vital-Lopez et al. (2006b). The authors constructed bifurcation diagrams

for enzymes previously identified to maximize serine in the model of E. coli by Chassagnole

et al. (2002). The authors found that for a 68% change in enzyme levels from the optimal

levels, the system exhibited both Hopf bifurcations and/or limit points. Based on this obser-

vation, Nikolaev (2010) formulated an optimization problem for identifying optimal enzyme

manipulations, which explicitly constrains solutions to exhibit local stability.

2.3.3 Mechanistic versus generalized rate equations

Kinetic rate equations describe reaction rates as functions of metabolite concentrations, enzyme

levels, and kinetic parameters. These rate equations may be based on known enzyme mech-

anisms, in which case they are called mechanistic rate equations. Mechanistic rate equations

Chapter 2. Literature Review 26

typically involve complex nonlinear terms and many parameters; however, their accuracy does

not deteriorate as the state deviates from a reference state. Generalized (alternatively, approx-

imative or phenomenological) rate equations are not based on reaction mechanisms; rather,

they are empirical models. They also typically involve fewer parameters and nonlinear terms.

Some contain only linear relationships. While simpler, generalized rate equations typically lose

accuracy as the state deviates from a reference state, where the parameter values were deter-

mined. Thus, the choice between mechanistic versus generalized rate equations will depend on

the purpose of the model, and the complexity of its application. In this section, both types of

rate equations are briefly reviewed.

Mechanistic rate equations

Mechanistic rate equations are based on the assumed mechanism of the enzyme-catalyzed re-

action, and the parameter values are typically determined using in vitro studies with purified

enzymes (Chassagnole et al., 2002). For example, Michaelis-Menten kinetics describes reactions

involving one substrate and one product, and it assumes that the rate of product formation is

much slower than the rate at which the enzyme binds to the substrate.

Enzyme-catalyzed reactions involving more than one substrate may operate according to a

number of different mechanisms. These include random sequential, ordered ternary complex

sequential, ordered binary complex sequential, ping pong, and iso mechanisms (Purich, 2010).

Mechanistic rate equations are typically nonlinear and require the estimation of many param-

eters. Hence, describing every reaction in a large metabolic network by mechanistic rate equa-

tions remains challenging. Recent studies have investigated large-scale models in which some or

all of the reactions are described by approximative rate equations (Bulik et al., 2009; Smallbone

et al., 2010). These models represent one approach for balancing the tradeoff between model

scale, accuracy, and scope.

Approximative rate equations

Approximative rate equations (alternatively called generalized or phenomenological rate equa-

tions) are phenomenological descriptions of reaction rates. Thus, these rate equations are not

Chapter 2. Literature Review 27

based on a mechanistic basis. A number of generalized rate equations are currently used in the

study of cell metabolism, each having its own characteristics and utility. We describe below the

lin-log rate equation as one example of generalized rate equations.

Lin-log kinetic modeling of cell metabolism

The lin-log kinetic model uses linear-in-log approximations of the original nonlinear kinetic

rate equations (Visser and Heijnen, 2003). This phenomenological rate law is accurate close

to a reference state of fluxes and concentrations. In fact, lin-log models have been shown to

outperform other phenomenological rate laws including GMA, S-systems, and thermokinetics

(Heijnen, 2005).

The lin-log kinetic rate equation is the following:

v = diag(v0)p

p0

(1 + E · ln x

x0

)(2.19)

where v, p, x are the vectors of fluxes, enzyme levels, and metabolite concentrations, respec-

tively; v0, p0, x0 are the vectors of reference states for fluxes, enzyme levels, and metabolite

concentrations, respectively, and E is the matrix of elasticities. The elasticity matrix quantifies

changes in flux due to small deviations in concentrations from a steady state.

The lin-log rate equation has been used recently to construct a genome-scale model of S. cere-

visiae by Smallbone et al. (2010), in which the authors determined elasticities from other models

in the BioModels online database, as well as the method of tendency modeling (Visser et al.,

2000). The model is based on the recently constructed consensus network model of S. cere-

visiae (Herrgard et al., 2008), which included 1761 metabolic reactions and 1168 metabolites

distributed across 15 compartments. In contrast, the lin-log genome-scale model includes 956

metabolic reactions and 820 metabolites. Also, the 15 compartments have been simplified to

just two: intra- or extra-cellular space. The elasticities were estimated either from other models,

or using the tendency modeling approach (Visser et al., 2000). The authors tested the capabil-

ities of the model using MCA (Kacser and Burns, 1973). In particular, the authors identified

the metabolic reactions exerting the greatest control over biomass synthesis by calculating flux

control coefficients. Clearly, the values of the flux control coefficients and the elasticities de-

Chapter 2. Literature Review 28

pend on the reference state; therefore, the validity of the model is limited to states near to the

reference. Nonetheless, the model developed by Smallbone et al. (2010) represents one of the

first attempts to investigate the complex interactions between metabolites that are connected

by kinetic rate equations, at the genome-scale. Particularly interesting is the fact that available

software platforms for simulating kinetic models are not capable of handling kinetic models of

this size (Smallbone et al., 2010). Accordingly, a practical challenge for the use of large-scale

kinetic models for metabolic engineering is to develop appropriate software platforms.

Generalized linearization of nonlinear rate equations

In some cases, an accurate, nonlinear model of enzyme kinetics may be available for the

metabolic network under study. However, if the size of the network and the number of ge-

netic manipulations are large, or many of the rate equations are nonlinear, it may be difficult

to directly use the mechanistic model for optimal strain design. One approach for overcoming

this computational difficulty is to linearize the original model near a reference state, which is

then used to formulate a simpler optimization problem. In this case, the state (i.e., concentra-

tions, enzyme levels, and fluxes) is typically constrained to remain near the reference to ensure

that the linear representation remains accurate. One method of linearization involves the use

of Lagrange expansion according to arbitrary basis functions for the metabolite concentrations

and enzyme levels (Vital-Lopez et al., 2006a). Vital-Lopez et al. (2006a) have shown that many

approximative rate equations can be described by this generalized linearization by appropriate

selection of basis functions. This formulation should be valuable for future development of

strain design algorithms that use large-scale kinetic models that include nonlinear rate equa-

tions.

2.3.4 Opportunities for advancement

As reviewed above, the field of optimization-based strain design using kinetic models of metabolism

is an active field of research with many challenges remaining. Of particular interest in this thesis

is to develop algorithms that are scalable to larger models of cell metabolism. Scalability will

Chapter 2. Literature Review 29

become increasingly important as kinetic models are continuing to increase in size (Smallbone

et al., 2010). In Chapter 6, an efficient algorithm is developed for identifying optimal enzyme

manipulation strategies using kinetic models of metabolism. This algorithm extends the opti-

mization techniques used in Chapter 3 to kinetic models, and it has the potential for scalability

to larger kinetic models.

2.4 Synthesis and summary of the literature

2.4.1 Constraint-based modeling

One of the attractive features of CBM is its flexibility. Model predictions are refined by the

incorporation of additional constraints, which represent biophysical assumptions, biochemical

mechanisms, and physiological phenomena. Accordingly, a significant number of extensions to

CBM have been developed. Some of the important characteristics of and challenges for CBM

are summarized below:

• In the CBM framework, biophysical, physiological, and environmental constraints are

modeled in the form of mathematical constraints. The ability of CBM to accurately

model cell behavior depends directly on the accuracy of the constraints used.

• To improve the accuracy of CBM two approaches exist: the identification of appropriate

objective functions (Schuetz et al., 2007), or the addition of appropriate constraints. The

two approaches are additive in contributing to the improvement of model accuracy.

• The identification of objective functions and the estimation of parameters for constraints

both rely on efficient methods for interpreting high-throughput data.

• While nonlinear constraints or objective functions may be required to maximize model

accuracy, the computational cost of simulation, and especially of design will increase.

Therefore, a tradeoff is inherent between model accuracy and computational tractability.

Chapter 2. Literature Review 30

2.4.2 Computer-aided strain design

Mathematical optimization-based in silico strain design is an active field of research with prac-

tical applications for metabolic engineering. Some of the important characteristics of and

challenges for optimization-based strain design are summarized below:

• The scalability of in silico strain design algorithms to larger models and more complex

design strategies is in continued development.

• Extension of optimization-based strain design to models including constraints other than

stoichiometry will require further research. For example, OptORF has been developed

for optimal knockout or expression of metabolic genes and transcription factors. To ex-

tend the approach to more quantitative models of transcriptional regulation (e.g., PROM

(Chandrasekaran and Price, 2010)) or more complex strain design strategies (e.g., optimal

gene expression levels) will require the development of novel methods.

• There is a lack of studies reporting the design of microbial strains that are robust against

both model parameter uncertainties and perturbations, whether genetic or environmental.

• The most efficient optimization-based algorithms for strain design using constraint-based

modeling will require the understanding and exploitation of the structure of each specific

problem and adopting appropriate techniques from the field of mathematical optimization.

2.4.3 Simulation and design using kinetic models of metabolism

The literature is rich with studies on constrained optimization approaches to strain design using

kinetic models. Some important characteristics and challenges are summarized below:

• Optimization problems for strain design using kinetic models are difficult, often involving

MINLP formulations, so scalability to larger models is uncertain.

• In the studies reviewed here (Section 2.3), kinetic models with up to 60 reactions have been

used; thus, a remaining challenge is to assess whether constrained optimization approaches

to metabolic engineering can be performed using genome-scale kinetic models.

Chapter 2. Literature Review 31

• An important set of constraints has been identified by the community that are crucial for

any future study in this area: homeostasis, total enzyme capacity, and stability of steady

states (see Section 2.3.2). An important challenge is to identify additional constraints for

improving model accuracy.

• Kinetic models typically require the estimation of many more parameters than models

based on stoichiometry alone. These parameters will involve uncertainty, which may

make the identified design infeasible to implement or result in suboptimal performance in

reality. Therefore, future algorithms should adopt the methods of robust optimization to

rigorously account for model uncertainty.

2.5 On the chapters to follow

2.5.1 A Unifying Theme of this Thesis

The chapters that follow (Chapter 3 to 6) contain the main contributions of this thesis. In every

one of these chapters, a different and novel computational algorithm or framework is developed

in order to address some of the challenges stated in the previous section. Emphasis is placed on

the application of mathematical optimization for improving the predictive capability of models

of metabolism, to generate new hypotheses based on data and systems-level simulation, and

to accelerate metabolic engineering through the generation of novel strategies for strain design

that would be difficult to formulate without the aid of large-scale models and mathematical

optimization techniques.

Amidst all of these different algorithms that address different, albeit related, issues in metabolic

engineering and systems biology, a uniting theme emerges. Namely, that fast and scalable meth-

ods for analysis and design can be developed even for large and complex problems in metabolic

engineering and systems biology through four steps (from a modeler’s point of view): (i) rig-

orously formulate the biological problem in a mathematical form, (ii) understand the proper-

ties and characteristics of the mathematical formulation, (iii) study the literature to identify

mathematical methods that are appropriate for solving the mathematical formulation, and (iv)

judiciously learn and apply the mathematical methods to solve the problem at hand.

Chapter 2. Literature Review 32

Completion of these steps will enable a researcher to begin to understand the biological prob-

lem in greater depth. In other words, iterative application of these four steps, together with

analysis of the mathematical solution and its biological implications, is required to progres-

sively improve one’s understanding of the problem. Certain chapters in this thesis represent

only one iteration of the four steps; i.e., a novel framework has been developed and tested, but

additional contributions may arise from more in-depth analysis of the solutions. Chapter 5 is

one such example. On the other hand, Chapter 4 represents the second iteration of the four

steps, building upon one iteration already completed in Chapter 3. While the first iteration

(Chapter 3) successfully produced a fast and scalable algorithm that can be applied broadly

(e.g., see Appendix C), fundamental insights, in this case into the mechanisms of biological

robustness and the potential implications for design was gained only by the second iteration

(Chapter 4). Chapter 6 also represents an additional iteration based on Chapter 3, but in a

different direction from Chapter 4, in which the mathematical methods identified in the former

iteration were extended to a more complex but also more descriptive model of cell metabolism.

Interestingly, the interdisciplinary nature of systems biology becomes evident when pursuing

step (iii), in that a broad range of disciplines is inevitably visited in the process of identifying

a suitable mathematical method. Also, at step (iv), one may find that no suitable method

exists and may determine that a novel method must be developed. While this situation may

certainly arise, the author’s own experience from preparing this thesis, which focuses on the

area of mathematical optimization, suggests that many more contributions in systems biology

will likely stem from the novel application and combination of existing techniques developed by

experts in the field of mathematical methods (optimization) to challenging problems in biology.

Finally, greater knowledge and deeper insights are expected to be gained from the iterative

application of the four steps above to a certain problem. On the other hand, applying the four

steps to many different problems will allow the researcher to become exposed to a variety of

interesting applications of systems biology and optimization, while increasing awareness of the

fact that apparently different problems in different systems are often similar in mathematical

form.

Chapter 2. Literature Review 33

2.5.2 Outline of the remainder of the thesis

The remainder of the thesis is organized as follows. Chapters 3 to 6 contain material that is

already published or in preparation for publication. In the former case, the publications have

been reproduced verbatim for the most part (although we have used the author-year citation

style in this thesis, which may differ from the original publication). Therefore, at the beginning

of each of these chapters, we make reference to the relevant citation, and we comment on any

noteworthy changes from the original publication.

In Chapter 3, we develop a fast strain design algorithm to address the computational complex-

ity inherent in existing computational algorithms for designing optimal genetic manipulations

for maximizing microbial production of biochemicals. This chapter contains material published

in Yang et al. (2011).

In Chapter 4, we develop a computational framework for designing microbial strains that

are robust against both genetic and environmental perturbations that may be encountered in

industrial-scale bioreactors. Material in this chapter is being prepared for submission.

In Chapter 5, we develop a computational algorithm for identifying metabolite concentrations

that need precise measurements in order to reduce the variability of model predictions. Material

in this chapter has been published in Yang et al. (2010b).

Finally, in Chapter 6, we develop an efficient algorithm for identifying optimal enzyme level

manipulations. The algorithm may potentially be scalable to large-scale kinetic models. Mate-

rial in this chapter is being prepared for submission.

2.5.3 Types of models used in the thesis

In this thesis, a number of models are used to simulate cell metabolism. These models are then

used to develop strain design algorithms, to design experiments for improving model precision,

and for simulating the dynamic response of metabolism to changes in enzyme levels. Table 2.2

summarizes the types of models used and their properties.

Chapter 2. Literature Review 34

Table 2.2: Models used in this thesis. GAR: gene-associated reactions (if genes are not present

in the model, GAR refers to metabolic reactions excluding transport and biomass synthesis),

NGAR: non-gene-associated reactions.

PropertyModel

Toy iAF1260 Chassagnole

Organism E. coli E. coli E. coli

Rea

ctio

ns Total 20 2382 (including biomass

synthesis)

48

GAR 12 1944 30

NGAR 8 438 18

Metabolites 11 1668 18

Compartments Intracellular, extracellular Cytosolic, periplasmic,

extracellular

Intracellular, extracellular

Constraints Stoichiometry, thermody-

namics, flux bounds, con-

centration bounds, ∆Gr

bounds

Stoichiometry, flux

bounds

Stoichiometry, rate equa-

tions, flux bounds, con-

centration bounds

Reference Covert et al. (2001). We

added thermodynamic

constraints in this thesis.

Feist et al. (2007) Chassagnole et al. (2002)

Used in chapter(s) Chapter 5 Chapters 3 & 4 Chapter 6

Chapter 3

EMILiO: A fast algorithm for

genome-scale strain design

This chapter contains material from our publication (Yang et al., 2011):

“Yang, L., Cluett, W.R. and Mahadevan, R. (2011) EMILiO: a fast algorithm for genome-scale

strain design Metab Eng. 13:272–281.”

This chapter consists of a combination of both the main manuscript and the Supporting Infor-

mation from the citation above. Furthermore, Eq. (6.10) has been updated in this chapter to

reflect the latest implementation of the algorithm since publication of the article above. Re-

production of the material above in this thesis is a right that has been granted by Elsevier (the

publisher) to the authors of the manuscript.

3.1 Abstract

Systems-level design and optimization of cell metabolism is becoming increasingly important

for the renewable production of fuels, chemicals, and pharmaceuticals. Mathematical models of

the metabolism of biological systems are improving in terms of their accuracy and scope of pre-

dictions, but are also growing in complexity. Consequently, efficient and scalable algorithms are

increasingly important for strain design. Previous algorithms helped to consolidate the utility

of computational modeling in this field. However, their combinatorial nature is hindering their

35

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 36

application to more complex strain designs. Here, we present EMILiO, a new algorithm that

increases the scope of strain design to individually fine-tuned fluxes. Unlike existing approaches

that would experience an explosion in complexity to solve this problem, we efficiently gener-

ated numerous alternate strain designs producing succinate, L-glutamate and L-serine. This

was enabled by successive linear programming, a technique new to the area of computational

strain design. Our methods should help spur the development of new, scalable algorithms for

metabolic engineering.

3.2 Introduction

Microbial cell factories are becoming increasingly important for the sustainable production

of chemicals and fuels. The system-wide effects of genetic manipulations employed in en-

ginereed microbial strains can be difficult to elucidate without the aid of computational models.

Constraint-based modeling (CBM) (Edwards et al., 2002) has been successfully used to accu-

rately predict cell physiology by integrating multiple types of high-throughput data, especially

for industrially important microorganisms (Joyce and Palsson, 2006; Mahadevan et al., 2005).

Consequently, a number of computational algorithms have been developed to identify network

manipulation strategies while predicting their system-wide effects. OptKnock (Burgard et al.,

2003) was the first computational algorithm for systematically designing knockout strains that

couple enhanced biochemical production with maximal growth rate. This coupling of product

formation and growth rate has been successfully validated in several studies (Fong et al., 2005;

Hua et al., 2006). In addition to gene knockouts, the design of strains involving overexpres-

sion (Jin and Stephanopoulos, 2007) and down-regulation (Nakamura and Whited, 2003) have

been shown to enhance biochemical production. OptReg (Pharkya and Maranas, 2006) is an

MILP-based algorithm that identifies such strain designs but suffers from significantly increased

computational burden arising from additional binary variables and constraints compared to Op-

tKnock.

Globally optimal solutions to OptKnock and OptReg typically require prohibitively long com-

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 37

putational times for more than a few modifications (Feist et al., 2010). However, in some

cases, several more modifications may be required for effective coupling given the redundancy

in metabolic networks. Recently, Lun et al. (2009) developed Genetic Design through Local

Search (GDLS) to quickly obtain locally optimal solutions to OptKnock. The authors also

showed that the MILP-based GDLS predicted complex designs with higher in silico produc-

tion rates than methods based on evolutionary algorithms.

GDLS still suffers from exponential increase in complexity with increasing scope of each local

search. This limitation became apparent when we applied GDLS to the OptReg problem, in

this work–smaller local search scopes proved insufficient for escaping local optima. Recently,

alternatives to optimization-based algorithms for knockout, overexpression and down-regulation

targets have been developed (Melzer et al., 2009). However, the computational burden of com-

puting elementary modes have limited their application to reduced models of metabolism.

In addition to poor computational scalability, another limitation of existing algorithms is that

they identify only discrete levels of target enzyme activities: elimination, overexpression or

down-regulation. In contrast, several studies have shown that fine-tuning the expression levels

of certain genes are required to maximize metabolite production. For example, Alper et al.

(2005) showed that lycopene production by a recombinant strain of E. coli was maximized

when expression of the gene, dxs, coding for deoxy-xylulose-P synthase was fine-tuned to an

optimal, intermediate level. Both positive and negative deviations from this optimal expression

level lead to decreased lycopene production. Similarly, Lee et al. (2007) showed that optimal

expression of a key enzyme in central metabolism, namely PEP carboxylase (PPC), maximized

L-threonine production by an engineered strain of E. coli.

In vivo fluxes can be fine-tuned using either promoter libraries (Alper et al., 2005) or novel

approaches such as automated design of synthetic ribosome binding sites (Salis et al., 2009).

However, quantitative relations between gene expression level and reaction flux are currently

not adequately described by CBM. Hence, the experimental effort to deduce optimal expression

levels to achieve the fine-tuned metabolic flux will increase combinatorially with the number of

modified fluxes in a strain design. Here, we developed a novel computational algorithm, termed

Enhancing Metabolism with Iterative Linear Optimization (EMILiO) to serve two purposes:

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 38

(1) identify a subset of reactions with the potential to improve growth-coupled biochemical

production after fine-tuning, and (2) quantitatively predict the fine-tuned flux ranges that op-

timize production. EMILiO generates complex strain designs using genome-scale models with

unprecedented speed. This is due mainly to the use of successive linear programming (SLP),

which has been developed and applied in the petrochemical industry for at least half a century

(Baker and Lasdon, 1985). Here, we use EMILiO to generate over 200 alternate strain designs

for succinate production using the latest genome-scale model of E. coli metabolism (Feist et al.,

2007). We demonstrate the robustness of our algorithm by also generating some strain designs

for L-glutamate and L-serine production. These amino acids were chosen because computa-

tional strains could not be identified using strictly knockout mutants in a previous study (Feist

et al., 2010).

3.3 Materials and Methods

3.3.1 Flux balance analysis, model reduction, and in silico strain design

verification

The distribution of metabolic reaction fluxes were simulated using Flux Balance Analysis (FBA)

(Varma and Palsson, 1994). In FBA, the reaction network stoichiometry is defined in a matrix,

S ∈ RM×N where the M rows correspond to metabolites and the N columns correspond to

fluxes. The rank, r, of S is less than M; hence, we can separate the free and pivot variables in

the reduced row echelon form of S and formulate a reduced FBA problem as below:

maxv

cT · Tvf = vbio − ε · vprod (3.1a)

s.t. vL ≤ Tvf ≤ vU , (3.1b)

where vf ∈ RN−r are the free flux variables, vL ∈ RN and vU ∈ RN are the vectors of minimum

and maximum fluxes, respectively, and T ∈ RN×(N−r) is defined such that v = Tvf , and c is

the objective vector. Here, we add a small weighted minimization of the product flux (ε · vprod)

because alternate optima in the solution of this linear program (LP) might lead to a range

of product flux when growth rate (vbio) is maximized. We implemented this reduced FBA in

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 39

EMILiO, OptReg’ and OptReg’LS.

The “biomass iAF1260 core” reaction in the iAF1260 model was used to simulate cell growth.

All simulations were run with a maximum uptake rate of 20 mmol/gDW/h for both glucose and

oxygen. We computed the maximum succinate production rate, vmaxprod=32.25 mmol/gDW/h by

maximizing succinate flux subject to these uptake constraints, and a minimum required growth

rate of 0.1 h−1.

We reduced the number of target reactions for modification to both reduce the number of

binary variables for OptReg’ and OptReg’LS and to eliminate target reactions suspected not

to be experimentally implementable. First, non-gene associated reactions were excluded from

the target reactions based on the gene-protein-reaction mappings in the iAF1260 model. We

also removed additional reactions as described by Feist et al. (2010). These reactions were

involved in cell envelope biosynthesis, glycerophospholipid metabolism, inorganic ion transport

and metabolism, lipopolysaccharide biosynthesis and recycling, membrane lipid metabolism,

murein biosynthesis, murein recycling, inner membrane transport, outer membrane transport,

and outer membrane porin transport. We used this reduced model for all algorithms. The

reduction of target reactions was crucial for improving the computational efficiency of OptReg’

and OptReg’:LS as the number of binary variables was greatly reduced.

Each strain design identified by the three algorithms was verified, in silico, by implementing the

strategies into an FBA simulation. This was to ensure that numerical difficulties associated with

solving the large-scale MILP problems did not lead to solutions that violated the constraints of

the optimization problems.

All code was implemented in MATLAB (The Mathworks, Inc., Natick, MA). CPLEX 11.2 was

used to solve the LPs and MILPs using the CPLEXINT MATLAB interface. All simulations

were run on Intel Xeon 3.2 GHz processors.

3.3.2 The formulation of EMILiO

EMILiO is a computational algorithm to couple biochemical production to growth by quan-

titatively fine-tuning a set of target fluxes. EMILiO is formulated as the following bilevel

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 40

optimization problem:

maxvL,vU

cTp · Tvf

s.t. maxvf

cT · Tvf − ε · cTp · Tvf

s.t. vL ≤ Tvf ≤ vU

vbio ≥ vminbio ,

(3.2)

where vminbio is the minimum required growth rate, and the inner optimization is the reduced

FBA formulation (3.1) with the additional objective of minimizing production rate. Hence,

our algorithm identifies manipulation strategies having a high minimal production rate, when

growth rate is optimal. Here, ε = 0.001 is chosen so that the maximum growth rate is not

affected by minimization of production. Using the Karush-Kuhn-Tucker (KKT) conditions,

this bilevel optimization problem is reformulated into a single-level mathematical program with

complementarity constraints (MPCC) (Yang et al., 2008) as follows:

maxx

cTp · Tvf (3.3a)

s.t. wLi µLi + wUi µ

Ui = 0, i = 1, . . . , N (3.3b)

Tvf + µU = vU (3.3c)

Tvf − µL = vL (3.3d)

wUT − wLT = cT · T − ε · cTp · T (3.3e)

vbio ≥ vminbio (3.3f)

wL, wU , µL, µU ≥ 0 (3.3g)

where µL ∈ RN and µU ∈ RN are slack variables for the lower and upper bounds, respectively,

wL ∈ RN and wU ∈ RN are dual variables for the lower and upper bound constraints, respec-

tively, and x = [vf , vU , vL, µU , µL, wU , wL]T . The reduced FBA formulation has removed the

need to include dual variables for Sv = 0, resulting in fewer variables. This MPCC is solved in

three stages: an iterative linear program (ILP) (Bullard and Biegler, 1991) is used to identify

an initial set of optimal flux bounds, a recursive LP-based algorithm is applied to the set of

optimal bounds to generate subsets of optimal bounds, and an MILP is formulated to identify

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 41

the minimal and alternate optimal sets of flux bounds. Each of these stages is described in

detail in the sections that follow.

3.3.3 Solution of the MPCC using ILP

In Yang et al. (2008), the authors solved a similar MPCC by expressing the bilinear constraints

(3.3b) as a penalty function and solving the resulting NLP using off-the-shelf NLP solvers.

Here, we solve the above MPCC by formulating an iterative linear program (ILP), or successive

linear program (SLP).

Iterative linear programming was developed to solve a general nonlinear system of equations

subject to nonlinear inequality constraints and variable bounds (Bullard and Biegler, 1991). The

ILP converges to a feasible solution by iteratively generating search directions based on local

linearization of the nonlinear equations and inequalities. In our algorithm, an ILP is formulated

to satisfy the bilinear constraints (3.3b), while also maximizing product formation. Thus, at

each iteration, k, we move the current solution, xk, which violates the bilinear constraints

but satisfies (3.3c)–(3.3g), by computing an optimal direction, u, and updating the solution,

xk+1 = xk + u.

For simplicity of notation, we define matrices E ∈ R2N×Nx and F ∈ R2N×Nx , where Nx is the

length of the vector x, such that E · x = [wU , wL]T and F · x = [µU , µL]T . Furthermore, we

define gi(xk) = (eix

k)(fixk), where ei and fi denote the i-th rows of E and F , respectively. The

bilinear constraints (3.3b) at iteration k+ 1 are expressed as gi(xk +u) = 0. We now construct

a merit function, Z(xk), similar to Bullard and Biegler (1991) but with the added objective of

maximizing production rate:

Z(xk) =2N∑i=1

gi(xk)−Kp · cTp · Tvf , (3.4)

where Kp is a constant that controls the emphasis placed on maximizing production rate,

relative to minimizing violation of the bilinear constraints. All results were obtained with

Kp = 1000, but a dynamic Kkp is also possible.

We can linearize gi(xk + u) about xk as gi(x

k) + ∇g(xk)u, where ∇g(xk)u = (eixk)(fiu) +

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 42

(fixk)(eiu) is the directional derivative of g(xk) about xk, in the direction u. We thus formulate

the following LP to compute the optimal direction to minimize Z(xk+1) = Z(xk + u):

minu,s

N∑i=1

si −Kp · cTp · T∆vf (3.5a)

s.t. gi(xk) +∇gi(xk)u ≤ si (3.5b)

T (vf + ∆vf ) + (µU + ∆µU ) = (vU + ∆vU ) (3.5c)

T (vf + ∆vf )− (µL + ∆µL) = (vL + ∆vL) (3.5d)

(wU + ∆wU )T − (wL + ∆wL)T = cT · T − ε · cTp · T (3.5e)

vbio + ∆vbio ≥ vminbio (3.5f)

wL + ∆wL ≥ 0 (3.5g)

wU + ∆wU ≥ 0 (3.5h)

µL + ∆µL ≥ 0 (3.5i)

µU + ∆µU ≥ 0 (3.5j)

s ≥ 0, (3.5k)

where u = [∆vf ,∆vU ,∆vL,∆µU ,∆µL,∆wU ,∆wL]T = xk+1 − xk is the direction vector, and

s ∈ RN are auxiliary variables used to minimize the bilinear constraints to 0.

Solution of the ILP above generates an optimal direction, u∗ to determine the new values of

x at the next iteration, k + 1. A full step in this direction is not guaranteed to improve the

objective, because the optimal step direction is determined based on a linear approximation

of the bilinear constraints. Accordingly, we move the current solution in the optimal direction

only by a step size, λ, such that xk+1 = xk + λu∗. Furthermore, we use a line search procedure

to determine the optimal step size,

λ∗ = minλ∈[0,1]

(2N∑i=1

ei(xk + λu∗)fi(x

k + λu∗)−Kp · cTp · T (λ∆vf∗)

). (3.6)

To determine λ∗, we generate a number of trial step sizes and evaluate Eq. (3.6) for each trial.

The trial step size that minimizes Eq. (3.6) is chosen to be λ∗. We note that since Eq. (3.6) is

quadratic in the single variable, λ, the optimal step size, λ∗, can be found analytically. On the

other hand, if we wish to minimize the bilinear constraint violation using a different function,

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 43

then the line search equation may not be so simple. Accordingly, to maintain greater generality

of the ILP stage of EMILiO, we determined the optimal step size by simply evaluating Eq.

(3.6) for many trial values of λ, with negligible computational effort. In this work, we used

100 trial steps, evenly distributed between 0 and 1. If λ∗ = 0, then the SLP has converged

since no further improvement of the objective is possible. In Chapter 4: Section 4.4.4 of this

thesis, a sub-procedure is developed for improving convergence to a global optimum, despite

convergence of the ILP.

3.3.4 Pruning the Design Using LP

The solution of the ILP in Section 3.3.3 generates modified lower and upper bounds vL and vU .

We define the design sets, DesignL andDesignU as theNL lower andNU upper bounds that are

different from the original bounds and whose corresponding dual variables are strictly positive.

Due to network redundancy, many of these constraints may not be active, simultaneously.

Hence, smaller subsets of active constraints may exist. We extract such subsets by recursively

solving the following LP:

minv

cTp v (LPR)

s.t. Sv = 0

vLi ≤ vi, ∀i ∈ DesignL

vi ≤ vUi , ∀i ∈ DesignU

vLi ≤ vi, ∀i ∈ {1, . . . , N} and i /∈ DesignL

vi ≤ vUi ∀i ∈ {1, . . . , N} and i /∈ DesignU

vbio ≥ vminbio .

The solution to (LPR) is the minimum production rate, v∗prod, subject to the modified bounds

and minimal growth rate. We first determine if this minimum production rate is acceptable,

say v∗prod ≥ 0.5 × vmaxprod. We identify the set of active bound constraints and define it as a

subset strain design. We remove these active constraints from DesignL and DesignU and

solve (LPR) again, with the remaining modified bounds. We then define another strain design

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 44

if the resulting production rate is still acceptable. We recursively apply this procedure to all

strain designs and their subset strain designs. We terminate the procedure when no strain

design yields a subset design that is smaller in size, or if all of these subset designs exhibit lower

production rate than the defined tolerance of 0.5× vmaxprod.

3.3.5 Minimal and Alternate Optimal Designs Using MILP

The recursive pruning phase in Section 3.3.4 may produce alternate strain designs that are more

parsimonious than the single initial set generated in Section 3.3.3. The LP in this pruning stage,

however, has not been formulated to generate the strain design with the minimal number of

modifications. We thus formulate a final processing phase as an MILP, with binary variables,

yL ∈ ZNL and yU ∈ ZNU , to identify the minimal set of reaction modifications to achieve a

desired production rate of vminp as follows:

minyL,yU

NL∑i=1

yLi +NU∑i=1

yUi

s.t. maxv

cTbiov − ε · cTp v

s.t. Sv = 0

vL ≤ v ≤ vU

vLi yLi + vLDL,i(1− yLi ) ≤ vDL,i, i = 1, . . . , NL

vDU,i ≤ vUi yUi + vUDU,i(1− yUi ), i = 1, . . . , NU

cTp v ≥ vminp

yLi ∈ {0, 1}, i = 1, . . . , NL

yUi ∈ {0, 1}, i = 1, . . . , NU ,

(3.7)

where vDL = {vi : ∀i ∈ DesignL}, vDU = {vi : ∀i ∈ DesignU}, vLDL = {vLi : ∀i ∈ DesignL},

and vUDU = {vUi : ∀i ∈ DesignU}. This bilevel optimization problem is reformulated to a single

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 45

level MILP as follows:

minv, wS ,

wL, wU ,

ηL, ηU ,

yL, yU

NL∑i=1

yLi +NU∑i=1

yUi

s.t. Sv = 0

vL ≤ v ≤ vU

vLi yLi + vLDL,i(1− yLi ) ≤ vDL,i, i = 1, . . . , NL

vDU,i ≤ vUi yUi + vUDU,i(1− yUi ), i = 1, . . . , NU

(wS)TS + wL − wU + ηL − ηU = cTbio − ε · cTpNL∑i=1

ηLi vLi +

N∑i=1

wLi vLi −

NU∑i=1

ηUi vUi −

N∑i=1

wUi vUi − cTbiov + ε · cTp v = 0

0 ≤ ηLi ≤ KyLi , i = 1, . . . , NL

0 ≤ ηUi ≤ KyUi , i = 1, . . . , NU

0 ≤ wLDL,i ≤ K(1− yLi ), i = 1, . . . , NL

0 ≤ wUDU,i ≤ K(1− yUi ), i = 1, . . . , NU

cTp v ≥ vminp

wL, wU , wLDL, wUDU ≥ 0

yLi ∈ {0, 1}, i = 1, . . . , NL

yUi ∈ {0, 1}, i = 1, . . . , NU ,

(3.8)

where wL ∈ RN and wU ∈ RN are dual variables for lower and upper bounds, respectively,

wLDL = {wLi : ∀i ∈ DesignL}, wUDU = {wUi : ∀i ∈ DesignU}, ηL ∈ RNL and ηU ∈ RNU are

dual variables for the modified lower and upper bounds, respectively, and K = 100. Critical to

note here (for practical application of the algorithm) is that the combinatorial solution space

of this MILP is much smaller than attempting to solve OptKnock or OptReg because we limit

modifications to only those included in each strain design generated in Section 3.3.4. With this

MILP formulation, we can also identify alternate optimal strain designs via integer cuts.

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 46

3.3.6 Modified OptReg and Local Search

We formulated a modified OptReg (Pharkya and Maranas, 2006), referred to as OptReg’,

to address the following issues: (1) the definition of up- or down-regulation requires splitting

fluxes into forward and reverse components and the strain designs generated by OptReg require

careful interpretation (Pharkya and Maranas, 2006), (2) down-regulation of a reversible reaction

should limit the magnitude of flux in both the forward and reverse directions, since the total

enzyme concentration is decreased, and (3) the definition of up- or down-regulations relative

to a reference flux might not be physiologically realistic and may unneccessarily limit strain

design. We thus modified the definition of up- and down-regulation and eliminated the need to

split reversible reactions into forward and reverse fluxes. A reference flux distribution becomes

irrelevant because we assume that the target reactions chosen by OptReg’ will be fine-tuned

further, regardless of their basal values. These modified definitions also reduced the number

of binary variables required for the algorithm. OptReg’ is fully described below. The modified

implementation of OptReg that we used in this work is as follows, for a maximum of θ total

genetic modifications:

maxvf , wKO,

wL, wU ,

ηDF , ηDR,

ηUF , ηUR,

yKO, yD

yUF , yUR

cTp Tvf (OptReg’)

s.t. vLi ≤ Tivf ≤ vUi , ∀i ∈ Unmod,

vLi yKOi ≤ Tivf ≤ vUj yKOi ,

∀i ∈ KO, j = 1, . . . , NKO,

Tivf ≤ vUDj yDj + vUi (1− yDj ),

∀i ∈ Forward, j = 1, . . . , NFor,

vLDj yDj + vLi (1− yDj ) ≤ Tivf ,

∀i ∈ Reverse, j = 1, . . . , NRev,

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 47

Tivf ≤ vUUj yURj + vUi (1− yURj ),

∀i ∈ Reverse, j = 1, . . . , NRev,

vLUj yUFj + vLi (1− yUFj ) ≤ Tivf ,

∀i ∈ Forward, j = 1, . . . , NFor,

NKO∑i=1

wKOi TKOi +

N∑i=1

wUi Ti−

N∑i=1

wLi Ti +

NFor∑i=1

ηDFi TFori −

NRev∑i=1

ηDRi TRevi +

NRev∑i=1

ηURi TRevi −

NFor∑i=1

ηUFi TFori = (cTbio − ε · cTp )T,

N∑i=1

wUi vUi −

N∑i=1

wLi vLi +

NDF∑i=1

ηDFi vUDi −

NDR∑i=1

ηDRi vLDi +

NUR∑i=1

ηURi vUUi −

NUF∑i=1

ηUFi vLUi − (cTbio − ε · cTp )Tvf = 0,

NKO∑i=1

yKO +

ND∑i=1

yD +

NUR∑i=1

yUR +

NUF∑i=1

yUF ≤ θ,

−KyKOi ≤ wKOi ≤ KyKOi , i = 1, . . . , NKO,

0 ≤ ηDFi ≤ KyDFi , i = 1, . . . , NFor,

0 ≤ ηDRi ≤ KyDRi , i = 1, . . . , NRev,

0 ≤ ηURi ≤ KyURi , i = 1, . . . , NRev,

0 ≤ ηUFi ≤ KyUFi , i = 1, . . . , NFor,

0 ≤ wUDF,i ≤ K(1− yDF )i, i = 1, . . . , NFor,

0 ≤ wLDR,i ≤ K(1− yDR)i, i = 1, . . . , NRev,

0 ≤ wUUR,i ≤ K(1− yUR)i, i = 1, . . . , NRev,

0 ≤ wLUF,i ≤ K(1− yUF )i, i = 1, . . . , NFor,

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 48

cTbiov ≥ vminbio ,

wL, wU , ηDL, ηDU , ηUR, ηUF ≥ 0,

yKOi ∈ {0, 1}, i = 1, . . . , NKO,

yDi ∈ {0, 1}, i = 1, . . . , ND,

yURi ∈ {0, 1}, i = 1, . . . , NRev,

yUFi ∈ {0, 1}, i = 1, . . . , NFor,

where, wKO, wU , wL, ηDF , ηDR, ηUR, ηUF are dual variables for the constraints corresonding

to knockouts, unmodified upper and lower bounds, down-regulation of forward fluxes, down-

regulation of reverse fluxes, upregulation of reverse fluxes, and upregulation of forward fluxes,

respectively, yKO, yD, yUR, yUF are binary variables for the constraints corresponding to

knockouts, down-regulation, upregulation of reverse fluxes, and upregulation of forward fluxes,

respectively, and TKO, TFor, TRev are the rows of T corresponding to fluxes in the sets, KO,

Forward, and Reverse, respectively. These sets are defined as follows:

Unmod = {i = 1, . . . , N : flux i cannot be modified},

KO = {i /∈ Unmod : flux i can be knocked out},

Forward = {i /∈ Unmod : vmini ≥ 0, vmaxi > 0},

Reverse = {i /∈ Unmod : vmaxi ≤ 0, vmini < 0},

where vmini and vmaxi are the minimum and maximum values of flux i found using flux variability

analysis (FVA) (Mahadevan and Schilling, 2003). The sets, KO, Forward and Reverse have

NKO, NFor and NRev members, respectively. The modified bounds for up- or down-regulating

forward or reverse fluxes are defined in Table 3.1 and schematically described in Figure 3.1.

3.3.7 Local search implementation of modified OptReg

In order to obtain locally optimal solutions within reasonable computational time, we also

developed a local search version of OptReg’, referred to as OptReg’LS. The local search method

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 49

Table 3.1: Modified bound definitions for OptReg’

Regulation Modified Bound

Down

Forward vUD = 0.5[(1 + C) max(vL, 0) + (1− C)vU ]

Reverse vLD = 0.5[(1 + C) min(vU , 0) + (1− C)vL]

Up

Reverse vUU = vL + 0.5(1− C)[min(vU , 0)− vL]

Forward vLU = vU + 0.5(1− C)[max(vL, 0)− vU ]

Figure 3.1: Schematic of the definition of up- or down-regulation in OptReg’, based on modified

flux bounds.

is based on GDLS, which was recently developed by (Lun et al., 2009) to quickly obtain locally

optimal solutions to OptKnock using genome-scale models of metabolism.

To implement OptReg’LS, we add the following constraint to (OptReg’):

∑i:yKO

i =0

yKOi +∑

i:yKOi =1

(1− yKOi )

+∑

i:yDi =0

yDi +∑

i:yDi =1

(1− yDi )

+∑

i:yURi =0

yURi +∑

i:yURi =1

(1− yURi )

+∑

i:yUFi =0

yUFi +∑

i:yUFi =1

(1− yUFi ) ≤ δ, (3.9)

where δ is the neighborhood size, which limits the number of changes allowed to strain design

at each iteration.

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 50

We observed that at any iteration, k, the algorithm might cycle between two solutions. To

prevent cycling, we added the following constraint:

∑i:yKO

k−1,i=0

yKOi +∑

i:yKOk−1,i=1

(1− yKOi )

+∑

i:yDk−1,i=0

yDi +∑

i:yDk−1,i=1

(1− yDi )

+∑

i:yURk−1,i=0

yURi +∑

i:yURk−1,i=1

(1− yURi )

+∑

i:yUFk−1,i=0

yUFi +∑

i:yUFk−1,i=1

(1− yUFi ) ≥ 1 (3.10)

This prevents the algorithm from identifying a new solution for iteration k + 1 from returning

to the solution found previously at iteration k − 1.

At any iteration, the algorithm might terminate if no solution that improves the objective can

be found, subject to the neighborhood size constraint. Hence, for δ = 1, if the MILP solver

cannot find a single change to the current strain design that would improve the objective,

the solver might return the current solution and the algorithm would converge and terminate.

This situation arises when δ is small and the strain design at the current iteration can only be

improved via multiple simultaneous modifications or by first backtracking. Hence, to prevent

premature convergence for small values of δ, we added the following constraint:

∑i:yKO

i =0

yKOi +∑

i:yKOi =1

(1− yKOi )

+∑

i:yDi =0

yDi +∑

i:yDi =1

(1− yDi )

+∑

i:yURi =0

yURi +∑

i:yURi =1

(1− yURi )

+∑

i:yUFi =0

yUFi +∑

i:yUFi =1

(1− yUFi ) ≥ 1 (3.11)

This forces the MILP solver to make at least one change to the current strain design. Hence,

constraints (3.10) and (3.11) force a different solution to be identified at each iteration, without

returning back to the previous solution. If constraints (3.10) and (3.11) make the MILP problem

infeasible, or the new solution has a worse objective, this indicates that no change within the

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 51

neighborhood size can be made to improve the current solution–hence, the algorithm terminates.

We note that these constraints do not prevent cycles spanning more iterations, such as cycling

back to a solution from two iterations ago. However, we have not experienced such longer cycles

during our simulations.

3.3.8 Determining minimum flux magnitudes

Using EMILiO, we generated almost 200 strains for aerobic or anaerobic succinate production.

To compare the physiology of these strains to each other and to the wild-type, we determined

the minimum flux magnitude of each flux i as follows:

minv,r

r (3.12a)

s.t. Sv = 0, (3.12b)

vi − r ≤ 0, (3.12c)

−vi − r ≤ 0, (3.12d)

vL ≤ v ≤ vU , (3.12e)

r ≥ 0, (3.12f)

where r is a non-negative variable equal to the minimum absolute value of flux, vi, at optimality.

We iteratively solved this linear program for all N fluxes for each mutant and the wild-type.

For anaerobic conditions, the upper and lower bounds of the oxygen uptake flux were set to

zero.

3.4 Results and Discussion

3.4.1 Comparison of the strain design algorithms

We designed succinate-producing E. coli strains grown aerobically on glucose using three algo-

rithms: EMILiO, OptReg’, and OptReg’LS. OptReg’ and OptReg’LS are the global and local

search implementations of a modified OptReg. We modified the definition of up- and down-

regulation such that unbiased exploration of the strain designs could be performed, without the

need for a reference flux distribution (Materials and Methods). We also developed OptReg’LS,

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 52

a local search implementation of OptReg’ based on GDLS.

EMILiO was able to identify a strain having 100% succinate production, fully coupled to its

100

101

102

103

104

0

20

40

60

80

100

Percent of maximal succinate production (%)

CPU time (min)

EMILiO

OptRegLS’

OptReg’

(Minimum growth rate, Maximal succinate production)

Growth rate (h−1)

Succinate production (mmol/gDW/h)

0 0.1 0.3 0.5 0.7 0.9 1.1 1.30

5

10

15

20

25

30

35

Wild−type

EMILiO

OptReg’LS

OptReg’

Figure 3.2: Comparison of succinate production strains identified by EMILiO, OptReg’LS, and

OptReg’. Succinate production envelopes for OptReg’, OptReg’LS, and EMILiO using the

iAF1260 genome-scale model of E. coli metabolism (top). CPU times for strain design using

EMILiO, OptReg’LS, and OptReg’ (bottom). OptReg’LS converged in two iterations. CPU

time is shown in log scale.

maximum growth rate in 2 minutes (Fig. 3.2). The strain design involved a total of three

modifications: deletion of succinate dehydrogenase (SUCDi) and up-regulation of fumarate re-

ductase (FRD2) and aconitase (see (Yang et al., 2011)). We then examined the network-wide

changes due to these modifications. We first calculated the minimal absolute flux of all reactions

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 53

under the wild-type genetic background, subject to a minimum growth rate of 0.1 h−1 (Section

3.3.8). This step was implemented to prevent ambiguity arising from alternate optimal flux

distributions. Similarly, we calculated the minimum flux magnitudes for the designed strain

and compared these results with those of the wild-type. We thus identified 75 reactions that

were forced to carry more flux in the designed strain compared to the wild-type. These reac-

tions include isocitrate lyase (ICL), malate synthase (MALS), citrate synthase (CS), isocitrate

dehydrogenase (ICDH), malate synthase (MALS), malate dehydrogenase (MDH), and PPC. In-

cidentally, all of these reactions were shown to have increased activity by succinate-producing

strains of E. coli involving SUCDi knockout, grown aerobically on glucose in chemostats (Lin

et al., 2005).

We next ran OptReg’ and OptReg’LS with a regulatory strength parameter, C = 0.5, which

determines the flux value of reactions that are up- or down-regulated. First, we terminated

OptReg’ after four days and obtained a solution, which was not proven to be globally optimal.

The strain designed by OptReg’, nonetheless, produced succinate at 83.26% of the maximal

rate (Fig. 3.2). The strain involved three modifications: acetate kinase knockout, and over-

expression of PPC and fumarate reductase (FRD3). We investigated why OptReg’, which

was allowed to identify up to three modifications, did not find the superior three-modification

strain found by EMILiO. Upon inspection, we found that a strategy overexpressing fumarate

reductase and aconitase to values determined by C = 0.5 and deleting SUCDi violated the

stoichiometric constraints (Materials and Methods). This result demonstrated that potentially

better solutions might have been missed by OptReg’ and OptReg’LS due to the difficulty of

choosing an appropriate C for all reactions, prior to running the algorithms.

We then ran the local search implementation, OptReg’LS to quickly find locally optimal solu-

tions. Initially, OptReg’LS converged to a solution in three iterations, taking ∼ 4 hours (Fig.

3.2). The identified strain produced 82.13% of maximal production. Thus, OptReg’LS was able

to identify a strain having only 1.4% less production than the global search in four hours rather

than four days. This strain involved only the overexpression of FRD3, which was one of the

three modifications identified by OptReg’.

We investigated why OptReg’LS was unable to identify the three-modification strain designed

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 54

by OptReg’, since only two more modifications were required. For this, we began with a strain

having only FRD3 overexpression and added each of the two modifications, separately. This

procedure reflects how OptReg’LS was run with a neighborhood size of δ = 1, so only single

changes could be made from the initial strain. We found that adding each modification indi-

vidually could not improve production–only by adding both could production improve. In fact,

adding PPC overexpression to FRD3 overexpression slightly decreased (by 0.035%) succinate

production. This shows that the interactions amongst all possible flux modifications is non-

linear and that a local search method with small neighborhood size may fail to find improved

solutions. Increasing the neighborhood size may overcome this problem; however, we noticed

that a neighborhood size of even δ = 2 made each iteration of OptReg’LS prohibitively long,

thereby undermining the reason for using local search.

3.4.2 Large-scale exploration of the strain design space

The computational efficiency of EMILiO allowed us to use it as an engine for large-scale ex-

ploration of numerous alternate strain designs for succinate production. A substantial body of

literature already exists for succinate overproduction strains. The genetic manipulation strate-

gies in the literature can be categorized as (1) experimentally constructed, (2) computationally

predicted, and (3) computationally predicted and experimentally validated. Here, we have

surveyed a number of strain designs identified by recent computational algorithms, and also a

portion of the experimental literature. We found that while the existing literature on computa-

tional strain designs covered a wide variety of strain designs, some regions of the design space

had not been previously explored. Furthermore, genetically defined experimental strains have

been confined to a small region of the design space (Fig. 3.3).

EMILiO identified distinctly different strain designs for anaerobic and aerobic conditions.

Aerobically, knockout or inhibition of succinate dehydrogenase (SUCDi) and overexpression of

fumarate reductase were predicted to be necessary for achieving 100% of the maximal pro-

duction. Without these two strategies, up to ∼84% maximal production could be achieved.

Anaerobically, however, SUCDi knockout or inhibition was not an important strategy. Fu-

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 55

PFL

PPPGO

SUCDi

FRD2

ICL

MALS

FUM

PPCSCT

SUCDi

SUCOAS

Aerobic Anaerobic Both

191

4

21

27

1

12

6

EMILiO simulations

Literature:

Experimental

Literature:

Computational

Figure 3.3: Summary of strategies (i.e., the individual reactions being modified) identified by

EMILiO for succinate production and comparison to existing literature. While many strategies

are supported by previous experimental and/or computational literature, many more unval-

idated predictions have been generated in this work. Strategies were identified for aerobic,

anaerobic, or both conditions. Some of the frequently used strategies are annotated. Nodes are

linked if the strategies are used together frequently.

marate reductase was an important strategy for an initial anaerobic strain. However, we found

that strain designs not using fumarate reductase had equivalent succinate production. In these

strains, fumarase or malate dehydrogenase overexpression were the most important modifica-

tions. Another strain having 85% maximal anaerobic succinate production was found, for which

a slight induction of malate synthase was most important. Such relationships amongst the re-

action modifications are mapped in Fig. 3.4.

For each of the 234 strains designed using EMILiO and also the wild-type, we calculated the

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 56

Figure 3.4: The landscape of strategies for succinate production. Squares indicate modifica-

tions having a large impact on strain performance. Diamonds indicate modifications identified

frequently in the 234 alternate strain designs.

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 57

minimum magnitude of each flux–again, to avoid ambiguity due to alternate optimal flux distri-

butions. For each strain, we then obtained an n-dimensional vector (n is the number of fluxes)

defined as the deviation in these minimum flux magnitudes, relative to those of the wild-type.

These deviation vectors can thus be used to differentiate amongst the different strain designs.

Also, the set of fluxes whose minimum magnitudes are different from those of the wild-type are

similar to the MUST sets of (Ranganathan et al., 2010). We then clustered the 234 alternate

strain designs based on the similarity of their minimum flux magnitude vectors, using affinity

propagation (AP) (Frey and Dueck, 2007) with damping factor of 0.9. We thus identified 15

distinct clusters of varying sizes (Fig. 3.5). The largest two clusters produced 100% of the

maximal succinate flux, and the cluster centers differed only by one modification: knockout

of methionine adenosyltransferase (METAT) versus increased reverse activity of succinyl-CoA

synthetase (SUCOAS). This resulted in significantly higher fluxes through acetate-CoA ligase

(ACCOAL), propanoyl-CoA: succinate CoA-transferase (PPCSCT), and SUCOAS in the sec-

ond cluster of strains.

We identified flux magnitudes that consistently deviated from those of the wild-type across

many of the 15 clusters (Fig. 3.5). Some of these have been experimentally validated in the

literature (Table 3.2). We also found that, in some cases, a small number of fluxes were suffi-

cient to clearly differentiate one cluster from another. For example, cluster 5 had significantly

higher deviations in glucose-1-phosphate adenylyltransferase (GLGC) and polyphosphate ki-

nase (PPKr) fluxes, compared to those of cluster 1 (Fig. 3.5). PPKr activity was not directly

modified, but the increased production of inorganic diphosphate due to increased GLGC activ-

ity led to high reverse activity of PPKr. The increased GLGC activity more tightly coupled

succinate production to growth. This cluster represented a group of anaerobic strains producing

succinate at 75.72% maximal production.

Strain clusters 8 and 10 provide a case study of the potential utility of the map of strain

design space generated by EMILiO. The respective cluster center strains produced 89.17% and

84.75% maximal succinate. The pattern of absolute flux deviations, relative to wild-type, of

these clusters show distinctly different patterns from the others (Fig. 3.5). In particular, both

clusters have increased activities of acetyl-CoA synthetase (ACS), acetate kinase and phospho-

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 58

Figure 3.5: The 234 strains grouped into 15 clusters using affinity propagation. (A) Clusters are

formed based on the deviation of minimum flux magnitudes, relative to those of the wild-type.

These deviations represent changes in physiology of each strain. Larger rectangles represent

clusters with a larger number of strain design members. (B) The fluxes that deviate consistently

across the 15 strains are shown in yellow, while those fluxes distinguishing cluster 5 from cluster

1 are shown in magenta.

transacetylase (ACK-PTA). Lin et al. (2006) showed that ACS overexpression could reduce

acetate accumulation during excess glucose fermentation. They also showed that under aerobic

conditions, ACS overexpression could increase the acetyl-CoA pool, and the authors hypothe-

sized that this could potentially improve product formation. Our independent computational

exploration agrees with these experimental results. Cluster 10 represents aerobic succinate

production strains with increased ACS activity. Cluster 8 represents anaerobic strains with

similarly increased ACS activity. Both strains exhibited increased PPC activity. The anaer-

obic cluster is thus consistent with experimental literature, in which the acetyl-CoA pool was

increased, together with PPC overexpression, to improve anaerobic succinate production (Lin

et al., 2004). This cluster also exhibited glyoxylate shunt activity, together with increased ACK-

PTA activity. These strategies have been experimentally implemented to improve anaerobic

succinate production by Sanchez et al. (2005). Finally, although EMILiO predicted that both

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 59

Table 3.2: Reactions whose minimum flux magnitude (see Section 3.3.8) deviated from that of

the wild-type. Reference is made to experimental evidence.

Reaction Reference(s)

Malic enzyme, NAD (ME1) (Jantama et al., 2008; Stols and Donnelly, 1997)

Methylglyoxal synthase (MGSA) (Jantama et al., 2008)

Propionate kinase (PPAKr) (Jantama et al., 2008)

Phosphoenolpyruvate carboxykinase (PPCK) (Millard et al., 1996)

Acetaldehyde dehydrogenase (ACALD) (Jantama et al., 2008; Sanchez et al., 2006; Yun et al.,

2005)

Acetate kinase (ACKr) (Jantama et al., 2008; Lin et al., 2005; Sanchez et al.,

2006; Yun et al., 2005)

Alcohol dehydrogenase, ethanol (ALCD2x) (Jantama et al., 2008; Sanchez et al., 2006; Yun et al.,

2005)

Aspartate transaminase (ASPTA) (Jantama et al., 2008)

Isocitrate lyase (ICL) (Lin et al., 2005; Sanchez et al., 2006)

D-lactate dehydrogenase (LDH-D) (Chatterjee et al., 2001; Jantama et al., 2008; Millard

et al., 1996; Sanchez et al., 2006; Stols and Donnelly, 1997)

Malate synthase (MALS) (Sanchez et al., 2006)

Pyruvate formate lyase (PFL) (Chatterjee et al., 2001; Jantama et al., 2008; Stols and

Donnelly, 1997)

Phosphoenolpyruvate carboxylase (PPC) (Lin et al., 2005; Millard et al., 1996)

Phosphotransacetylase (PTAr) (Jantama et al., 2008; Sanchez et al., 2006; Yun et al.,

2005)

Succinate dehydrogenase (SUCDi) (Lin et al., 2005)

CO2 uptake (EX-co2(e)) (Zeikus et al., 1999)

PEP:Pyr phosphotransferase system (GLCpt-

spp)

(Chatterjee et al., 2001; Lin et al., 2005)

NADH dehydrogenase (NADH16pp) (Yun et al., 2005)

ACS and ACK-PTA activities could be implemented simultaneously, ACK-PTA has a higher

Km than ACS, which may limit its flux (Lin et al., 2006). Therefore, mechanistic understanding

of the succinate production might be improved by incorporating detailed kinetic constraints for

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 60

these and related reactions.

3.4.3 Increasing production beyond knockout strains

Previously, Pharkya et al. (2003) explored knockout strains for amino acid production using

OptKnock. Amino acid secretion was coupled to growth but this required fixing certain ex-

change fluxes (e.g., oxygen, ammonia, etc.), in addition to knockouts. This study demonstrated

the difficulty of fundamentally coupling secretion of amino acids to growth using only gene

knockouts. More recently, Feist et al. (2010) computationally explored strains having up to

10 knockouts using OptKnock and OptGene algorithms using the latest genome-scale model

of E. coli metabolism. Strains with high yield were found for certain product-substrate pairs;

however, some products, including L-glutamate and L-serine, could not be coupled to growth.

These studies demonstrated that strictly knockout strategies may be insufficient for growth-

coupled production of certain products. We thus investigated if strains involving fine-tuned

fluxes could be engineered to produce L-glutamate and L-serine.

An initial run of EMILiO generated a strain secreting L-glutamate at 100% of the maximal

flux. This strain design included knockout of glutamate decarboxylase to prevent conversion

of L-glutamate to 4-aminobutanoate and knockout of α-ketoglutarate dehydrogenase to direct

carbon flux towards L-glutamate production. The latter strategy has been experimentally val-

idated (Shirai et al., 2005). The strain also increased reverse activity of glutamate dehydroge-

nase (GLUDy), which would convert AKG to L-glutamate. Computationally, this strategy was

identified because increased reverse activity of GLUDy would directly increase L-glutamate

production. However, increasing in vivo reverse activity of glutamate dehydrogenase would

require a high ratio of AKG to L-glutamate concentrations, which is difficult to directly manip-

ulate. Hence, we ran EMILiO again with GLUDy removed from the list of target reactions, to

encourage the identification of strain designs incorporating a broader scope of manipulations.

The second strain identified by EMILiO secreted L-glutamate at 94% of the maximal rate.

This strain involved two knockouts, four down-regulation and three up-regulation strategies.

In contrast with the first strain, GLUDy flux was decreased, while pentose phosphate pathway

flux significantly increased. This demonstrated that at least two distinct modes of metabolism

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 61

could be used to overproduce L-glutamate. One of the non-intuitive, necessary modifications

was the removal of inorganic diphosphatase activity (PPA), which catalyzed the conversion of

inorganic diphosphate to inorganic phosphate. This strategy was used in conjunction with over-

expression of glucose-1-phosphate adenylyltransferase, which produced inorganic diphosphate.

The removal of PPA lead to increased reverse polyphosphate kinase (PPKr) activity to consume

inorganic diphosphate while also generating ATP. Hence, the network-wide effects of targeting

the balance of currency metabolites was captured by the algorithm.

Similarly, we generated strains for L-serine production using EMILiO. The first included exper-

imental strategies used by Peters-Wendisch et al. (2005) for C. glutamicum, which were based

on targeted metabolic engineering. These included L-serine dehydratase knockout and over-

expression of 3-phosphoglycerate dehydrogenase. We then looked for alternate strain designs

involving a broader scope of manipulations that would be more difficult to conceive using a

targeted metabolic engineering approach. We thus ran EMILiO again with the reactions close

to L-serine production (PGCD, PSERT, PSER-L, MTHFC, and MTHFD) removed from the

list of target reactions. EMILiO identified an alternate strain that produced 99.65% maximal

L-serine, suggesting that non-intuitive strategies could be identified using the algorithm.

The search for L-glutamate and L-serine production strains demonstrated that EMILiO could

generate both experimentally validated and potentially novel strategies for amino acids, in ad-

dition to central metabolism intermediates.

3.5 Conclusions

We have used a novel computational strain design algorithm (EMILiO) for production of suc-

cinate, L-glutamate, and L-serine using the iAF1260 genome-scale model E. coli metabolism.

EMILiO was shown to be computationally efficient and capable of generating almost two hun-

dred alternate strain designs with high succinate production (≥ 83% maximal production) using

both parallel and orthogonal sequential search methods. Using EMILiO, we rapidly identified

strains producing L-glutamate or L-serine production at 100% of the respective maximal rates.

Chapter 3. EMILiO: A fast algorithm for genome-scale strain design 62

Strains coupling the production of these amino acids to growth could not be identified in a

previous study that investigated up to 10 knockouts using OptKnock and OptGene algorithms

(Feist et al., 2010). This shows that while knockout strategies alone can be insufficient to couple

secretion of some products to growth under certain conditions, fine-tuning fluxes may enable

such coupling. Using previous algorithms, the identification of fine-tuned flux strategies would

be significantly more complicated than solely knockout strategies; however, EMILiO was shown

to generate such strain designs with ease using a genome-scale model of metabolism.

We used EMILiO as an efficient engine for exploring the strain design space of growth-coupled

succinate production using the latest genome-scale model of E. coli metabolism. The resulting

map elucidated interactions amongst over 100 genetic manipulation strategies for succinate pro-

duction. We intend to bring such large-scale, predictive maps of the strain design space closer to

the workbench of metabolic engineers. Owing to the speed of EMILiO, the entire map can then

be re-drawn in the future with incorporation of new experimental data. This can accelerate

both model refinement and elucidation of mechanisms relevant for product formation.

Chapter 4

Genome-scale robust strain design

4.1 Abstract

Cell metabolism is an important platform for sustainable biofuel, chemical and pharmaceutical

production but its complexity presents a major challenge for scientists and engineers. Although

in silico strains have been designed in the past with predicted performances near the theoretical

maximum, their real-world performances are often sub-optimal. We argue that model-based,

genome-scale designs need to consider realistic perturbations for improved performance. Here

we demonstrate, using ∼100 in silico succinate overproduction strains, that predicted yields

vary widely when intracellular and environmental perturbations are included in the model.

We show that mechanisms for improving robustness of naturally evolved organisms can be

identified to help design robust engineered strains. Furthermore, we find that redundancy, a

robustness-enhancing strategy ubiquitous in complex systems, may either improve or undermine

robustness, depending on the magnitude of perturbations. With a deeper understanding and a

more fruitful exploitation of robustness, we believe that more robust strain designs are possible.

4.2 Introduction

Naturally evolved systems exhibit robustness against a variety of perturbations, from intracel-

lular noise in protein translation rates (Becskei and Serrano, 2000) to temporal variations in the

weather (Tilman et al., 2006). By virtue of robust design principles (Morari and Zafiriou, 1989),

63

Chapter 4. Genome-scale robust strain design 64

engineered systems display comparable robustness (Csete and Doyle, 2002; Kitano, 2004). The

finding that biological systems naturally acquire the same robustness-enhancing mechanisms as

those utilized in engineered systems (Yi et al., 2000) suggests that it may be possible to system-

atically design robustness in engineered cells. In particular, robust microbial and mammalian

strains are required for the environmentally sustainable and economically viable production of

chemicals, fuels, and pharmaceuticals.

As with engineered systems, predictive models can accelerate the design of robust biological

systems. Currently, computational models of cell metabolism (Orth et al., 2010), and strain

design algorithms (Burgard et al., 2003; Ranganathan et al., 2010; Kim and Reed, 2010; Yang

et al., 2011) are being developed actively, in order to alleviate the bottlenecks encountered in

traditional approaches to strain design (Chen et al., 2010). However, the prediction of strain

performance in response to perturbations in various environmental and intracellular processes

has been explored only to a limited extent for knockout strains (Cox et al., 2006; Tepper and

Shlomi, 2010). Such studies are even more lacking for strains involving optimal target fluxes,

in which perturbations to intracellular (e.g., transcriptional or post-translation regulation) or

environmental (e.g., substrate and oxygen concentrations) processes can cause in vivo flux of

targeted reactions to deviate from their predicted, optimal levels.

Here, we present a novel computational framework to incorporate the effects of genetic and en-

vironmental perturbations, as well as model parameter uncertainties into computational strain

design. We anticipate that the results of our analysis will generate a new category of in silico

strains: those optimized for balanced performance and robustness against specific perturbations

and uncertainties.

4.3 Robust strain design

We now present our computational framework for designing robust overproduction strains. We

model production rate, or flux, of the target metabolite as a random variable with mean, µ and

standard deviation σ. We desire a high µ and low σ. Thus, we define robustness, R, as

R = 1− σ

µ. (4.1)

Chapter 4. Genome-scale robust strain design 65

R is defined only for non-zero µ. It has a maximum value of 1 under nominal conditions (σ = 0),

and it has an unbounded minimum. The coefficient of variation (σ/µ) contained in this defi-

nition is also termed the stability coefficient in ecology (Tilman et al., 2006), while its inverse

(µ/σ) is termed the signal-to-noise ratio in imaging (McGibney and Smith, 1993). In addition,

σ/µ is related to sensitivity measures used to estimate the change in a variable in response to a

change in a parameter (Saltelli et al., 2000). The simple definition of robustness adopted here

is useful for engineering applications. For an alternative definition of robustness in biological

systems, see (Kitano, 2007).

A system that is robust to one perturbation may not be robust, or may be fragile, to oth-

ers. Thus, we consider the perturbations and uncertainties listed in Table 4.1. To maximize

metabolite production, metabolic engineering often requires targeted genes to be expressed

at optimal levels to control flux through key pathways (Alper et al., 2005; Lee et al., 2007).

However, these fluxes are perturbed by various factors and are known to deviate from their

optimal values, in vivo. At a minimum, gene expression noise results in random perturbations

to these fluxes (Wang and Zhang, 2011); therefore, we have included the random perturbation

of controlled fluxes in all of our simulations. In addition, we consider variations in glucose and

oxygen uptake fluxes, the secretion of byproducts due to these variations, the re-consumption

of these byproducts, and osmotic stress response. These are some of the major perturbations

encountered by engineered strains in industrial-scale bioreactors (Enfors et al., 2001).

Recently, the constraint-based modeling framework has been extended to include additional

Table 4.1: Perturbations and model uncertainties investigated

Perturbations

- Flux variations due to gene expression noise

- Variation in substrate and oxygen uptake fluxes

- Re-consumption of overflow byproducts

- Osmotic stress response

Parameter - Parameters involved in the molecular crowding constraint (Beg et al., 2007)

uncertainties - Parameters involved in the membrane occupancy constraint (Zhuang et al., 2011)

metabolic constraints, including a limit on the total enzyme concentration in the cell (molecular

Chapter 4. Genome-scale robust strain design 66

crowding (Beg et al., 2007)), and a limit on the total concentration of membrane-bound enzymes

(membrane occupancy (Zhuang et al., 2011)). These constraints have been shown to affect key

physiological features including catabolite repression and overflow metabolism. Although these

constraints improve the accuracy of model predictions, they require the estimation of many

additional parameters. Therefore, we assessed the sensitivity of strain design to uncertainty in

these model parameters.

4.4 Materials and Methods

4.4.1 Flux balance analysis, model reduction, and in silico strain design

verification

The distribution of metabolic reaction fluxes was simulated using Flux Balance Analysis (FBA)

(Varma and Palsson, 1994). In FBA, the reaction network stoichiometry is defined in a matrix,

S ∈ Rm×n where the m rows correspond to metabolites and the n columns correspond to fluxes.

The rank, r, of S is less than m; hence, we can separate the free and pivot variables in the

reduced row echelon form of S and formulate a reduced FBA problem as below:

maxv

cT · Tvf = vbio − ε · vprod (4.2a)

s.t. vL ≤ Tvf ≤ vU , (4.2b)

where vf ∈ RN−r are the free flux variables, vL ∈ RN and vU ∈ RN are the vectors of minimum

and maximum fluxes, respectively, and T ∈ RN×(N−r) is defined such that v = Tvf , and c is the

objective vector. Here, ε = 0.001 is used to add a small weighted minimization of the product

flux because alternate optima in the solution of this linear program (LP) might lead to a range

of product flux when growth rate (vbio) is maximized. We implemented this reduced FBA in

EMILiO, as described in (Yang et al., 2010a, 2011).

The core biomass reaction in the iAF1260 model was used to simulate cell growth. We defined

the nominal model to have an uptake rate of 20 mmol/gDW/h for both glucose and oxygen, to

reflect experimentally observed uptake rates (Varma et al., 1993). We computed the nominal

maximum succinate production rate, vmaxprod=32.25 mmol/gDW/h by maximizing succinate flux

Chapter 4. Genome-scale robust strain design 67

subject to these uptake constraints, and a minimum required growth rate of 0.1 h−1. Similarly,

the nominal maximum L-serine production rate was 39.52 mmol/gDW/h.

We reduced the number of target reactions for modification to eliminate target reactions sus-

pected not to be experimentally implementable. Non-gene associated reactions were excluded

from the target reactions based on the gene-protein-reaction mappings in the iAF1260 model.

We also removed reactions that were either essential for or significantly reduced growth. These

reactions are described in (Feist et al., 2010) and include reactions involved in cell enve-

lope biosynthesis, glycerophospholipid metabolism, inorganic ion transport and metabolism,

lipopolysaccharide biosynthesis and recycling, membrane lipid metabolism, murein biosynthe-

sis, murein recycling, inner membrane transport, outer membrane transport, and outer mem-

brane porin transport.

Each strain design identified by EMILiO was verified, in silico, by implementing the strategies

into an FBA simulation. This step was implemented to ensure that numerical difficulties asso-

ciated with solving the large-scale mixed-integer linear program (MILP) problems did not lead

to solutions that violated the constraints of the optimization problems.

All code was implemented in MATLAB (The Mathworks, Inc., Natick, MA). CPLEX 12.1 was

used to solve the LPs and MILPs using the CPLEXINT MATLAB interface. All simulations

were run on AMD Opteron 2.4 GHz processors.

4.4.2 EMILiO

EMILiO is a computational algorithm that couples biochemical production to growth by quan-

titatively optimizing a set of target fluxes (Yang et al., 2010a, 2011). EMILiO is formulated as

the following bilevel optimization problem:

maxvL,vU

cTp · Tvf

s.t. maxvf

cT · Tvf − ε · cTp · Tvf

s.t. vL ≤ Tvf ≤ vU

vbio ≥ vminbio ,

(4.3)

Chapter 4. Genome-scale robust strain design 68

where vminbio is the minimum required growth rate, and the inner optimization is the reduced

FBA formulation (4.2) with the additional objective of minimizing production rate. ε = 0.001

was chosen so that the maximum growth rate was not affected by minimization of production.

Using the Karush-Kuhn-Tucker (KKT) conditions, this bilevel optimization problem is refor-

mulated into a single-level mathematical program with complementarity constraints (MPCC)

(Yang et al., 2008) as follows:

maxx

cTp · Tvf (4.4a)

wLi µLi + wUi µ

Ui = 0, i = 1, . . . , N (4.4b)

Tvf + µU = vU (4.4c)

Tvf − µL = vL (4.4d)

wUT − wLT = cT · T − ε · cTp · T (4.4e)

vbio ≥ vminbio (4.4f)

wL, wU , µL, µU ≥ 0 (4.4g)

where µL ∈ RN and µU ∈ RN are slack variables for the lower and upper bounds, respectively,

and x = [vf , vU , vL, µU , µL, wU , wL]T . This MPCC is solved in three stages as described in

(Yang et al., 2010a, 2011). Briefly, a successive linear program (SLP), or iterative linear program

(ILP) (Baker and Lasdon, 1985; Bullard and Biegler, 1991) is formulated to identify a large set

of reaction modifications. This set is then recursively reduced to subsets using LP. Finally, an

MILP is applied to each of the resulting subsets to find alternate minimal sets.

Once the MPCC above was solved, we implemented a number of post-processing steps to ensure

that the strains were not affected by numerical error. First, EMILiO sometimes found strategies

that optimized fluxes to very low levels. While many of these were valid inhibition strategies,

we asked whether some could be replaced by a knockout instead without reducing production.

If an inhibition can indeed be replaced by a knockout without affecting production, then the

inhibition and knockout are alternate optimal strategies. If the knockout actually improves

production, then we can conclude that EMILiO had converged to a local optimum, whereas if

production decreases, then we must quantify the decrease in production. Then, we must assess

Chapter 4. Genome-scale robust strain design 69

whether the greater ease of genetic manipulation through implementing a knockout instead of

an inhibition justifies the decrease in production. Accordingly, for every strain, we replaced each

manipulation that optimized flux to less than 0.1 mmol/gDW/h with a knockout. If production

increased or remained the same, we kept the knockout modification.

Second, manipulations were sometimes found that increased production by only a small amount.

We thus removed all manipulations from each strain that increased production by less than

0.001% of the maximal flux.

4.4.3 Strain design using EMILiO

Using the EMILiO algorithm (Yang et al., 2010a, 2011), we started by generating a total of

112 alternative strains under nominal conditions (i.e., glucose and oxygen uptake rates of 20

mmol/gDW/h). Initially, 73% of the 112 strains achieved at least 99% maximal succinate yield

under nominal conditions. EMILiO, being a local optimization algorithm, does not guarantee

global optimality. Hence, we implemented an additional step to try and improve the nominal

performance of the thirty strains that did not achieve 99% nominal performance. We used the

sensitivity analysis procedure (see Section Sensitivity analysis of a strain design), with glucose

and oxygen uptake rates fixed to their nominal values. We sampled 1,000 feasible random fluxes

for each controlled reactions in each of the thirty strains with nominal performance below 99%

maximal yield.

The results of random sampling showed that despite being a local optimization algorithm,

EMILiO very often found the globally optimal fine-tuning levels, in addition to identifying the

optimal set of manipulated reactions. Only two of the 30 strains showed improved succinate

production. One strain improved 3% from 89% to 92% maximal succinate flux. This was

achieved by replacing a succinate dehydrogenase (SUCDi) inhibition with a knockout and ad-

justing fine-tuned levels of other reactions appropriately. Another strain improved almost 19%

from 82% to 97% maximal succinate flux. This strain already had SUCDi knockout and fine-

tuning of menaquinone-dependent fumarate reductase (FRD2) and malate synthase (MALS).

The large improvement in succinate flux was achieved solely by further adjusting fine-tuned

levels of FRD2 and MALS. The sensitivity analysis further showed that at maximal nominal

Chapter 4. Genome-scale robust strain design 70

performance, MALS fine-tuning became irrelevant–it was not an active constraint. We then

inquired if MALS fine-tuning, which did not improve nominal performance, could be used to

improve robust performance. We thus constructed another strain consisting only of SUCDi

knockout and FRD2 fine-tuning. After these additional steps, we had a total of 114 alternative

strains with nominal performances ranging from 75% to 100% of the maximum nominal per-

formance.

We found that amongst these strains, some included the inhibition of SUCDi, rather than

its deletion. Initial trials of sensitivity analysis indicated that perturbations to SUCDi levels

severely decreased robustness. Therefore, in all instances of SUCDi inhibition, we replaced the

modification with deletion of SUCDi. We then re-optimized the other controlled fluxes to max-

imize succinate production. After this step, we removed strains that were equivalent. Finally,

we had 98 unique strains. We note that if two strains control different fluxes that are part

of the same set of fully coupled fluxes (Burgard et al., 2004), the strains may be functionally

equivalent.

4.4.4 Escaping from local optima

The first stage of EMILiO involves solution of an SLP, which converges to a solution quickly,

but does not guarantee global optimality. We thus developed a procedure, described below, to

search for potentially better local optima in the vicinity of the solution identified by the SLP.

This procedure is initiated if the SLP converges to a solution that does not satisfy either the

KKT conditions or the metabolite production threshold levels.

First, each bilinear term in (4.4b) is replaced by the McCormick relaxation (McCormick, 1976).

This procedure is achieved by introducing a new variable, say, zLi = wLi µLi , and constraining

zLi as follows:

zLi ≥ (wLi )LµLi + (µLi )LwLi − (wLi )L(µLi )L, (4.5)

zLi ≥ (wLi )UµLi + (µLi )UwLi − (wLi )U (µLi )U ,

zLi ≤ (wLi )UµLi + (µLi )LwLi − (wLi )U (µLi )L,

zLi ≤ (wLi )LµLi + (µLi )UwLi − (wLi )L(µLi )U ,

Chapter 4. Genome-scale robust strain design 71

where (wLi )L, (wLi )U , (µLi )L, (µLi )U are the lower and upper bounds of wLi and µLi , respectively.

Accordingly, the relaxation is a function of the lower and upper bounds on each of the variables.

For different bounds, the optimum of the convex relaxation may differ. Hence, we generated a

set of relaxed problems for each local optimum. Each problem involves different bounds for the

relaxed bilinear constraints. For example,

(wL)Lj = wLk − φj(wLk − (wL)min

),

(wL)Uj = wLk + φj((wL)max − wLk

),

where wLk is the value of wL at the local optimum at iteration k, and (wL)min and (wL)max

are the minimum and maximum values for wL, respectively, calculated using Flux Variability

Analysis (FVA) (Mahadevan and Schilling, 2003). The vector, φ can be of any length including

a random or deterministic sequence of numbers between 0 and 1. In this work, We chose

φ = {0.1, 0.3, 0.5, 0.7, 0.9}. This deterministic sequence was chosen to ensure the reproducibility

of our solutions.

If an improved solution is found, then this procedure is repeated from that solution until the

termination criterion is satisfied. Overall, the procedure terminates under two conditions: (1)

when a solution satisfying KKT and production requirements is found, or (2) when a solution

with a better objective value cannot be found. In the latter case, we conclude that no strain

can be found for the given conditions and terminate this run of EMILiO.

4.4.5 Generating alternate strain designs

We generated alternate strain designs using the following procedure:

1. Obtain the initial solution, which is the optimum to the convex relaxation [4.5] subject

to wild-type flux bounds.

2. Run EMILiO starting from the initial solution found in Step 1 and a set of reactions that

are allowed to be manipulated. This set is initially defined by the user, but automatically

changes in subsequent iterations, as described below.

3. If EMILiO identifies a strain design that meets the production criterion and the KKT

Chapter 4. Genome-scale robust strain design 72

conditions within a tolerance level, save the design and continue. Otherwise, quit the

procedure.

4. Rank each reaction manipulation in the strain design according to its contribution to

production. That is, the reactions that result in a greater reduction in production when

removed from the set of manipulated reactions are ranked higher.

5. Remove the highest ranking reaction manipulation (as determined at step 4), from the

set of reactions available for manipulation.

6. Return to step 1 and continue if the number of iterations has not exceeded the user-defined

maximum.

This procedure was used to efficiently generate 98 different succinate production strains, as

well as three L-serine strains. Of the 98 succinate strains, three were chosen for further anal-

ysis. These three strains consisted of (following the reaction notation in the iAF1260 model)

SUCDi deletion, and one to three additional controlled fluxes. These fluxes were FRD2, MALS,

and AKGDH, which controlled flux through the reductive TCA cycle, glyoxylate shunt, and

oxidative TCA cycle, respectively. The exact flux values for the nominal condition (i.e., no

perturbations) are defined in Appendix A.1, which lists all succinate strain definitions. See

strains 12, 83, and 95, which are referred to as succinate strains I, II, and III, respectively in

this chapter.

4.4.6 Sensitivity analysis of a strain design

Here, we define a robust strain as one that maintains a high production rate despite random

perturbations arising from gene expression noise, industrially-relevant perturbations, and uncer-

tainties in model parameters (Table 4.1). We assume that deleted reactions are not perturbed

since they carry no flux, while the other controlled fluxes (i.e., activated or inhibited) are per-

turbed by gene expression noise.

To assess the sensitivity of production to flux perturbations and model uncertainties, we perform

the following sensitivity analysis for Nsamples random samples:

Chapter 4. Genome-scale robust strain design 73

1. Determine feasible flux ranges for the set of perturbed flux bounds (lower or upper bounds)

by applying flux variability analysis (Mahadevan and Schilling, 2003) to the corresponding

reactions. If robustness against model parameter uncertainty is being assessed, define the

ranges for the uncertain parameters.

2. Set Nfeas = 0.

3. Generate a random vector of the perturbed flux bounds from a uniform random distribu-

tion within the feasible ranges determined at Step 1.

4. If robustness against model parameter uncertainty is being assessed, generate a random

vector of parameter values from a uniform random distribution within the defined range

(determined at Step 1) of parameter values.

5. Define an FBA problem that is subject to the perturbed flux bounds. If a perturbed

flux bound is a lower bound (i.e., for activated forward flux, inhibited reverse flux, or

limitation on nutrient uptake) then fix the lower bound to the randomly sampled value.

If a perturbed flux bound is an upper bound (i.e., for inhibited forward flux, or activated

reverse flux), then fix the upper bound to the randomly sampled value.

6. If robustness against model parameter uncertainty is being assessed, add the appropriate

constraints (i.e., molecular crowding or membrane occupancy) to the FBA problem defined

above. Fix the uncertain parameter values to the random values determined at Step 4.

7. Solve the FBA problem defined above to maximize biomass synthesis flux. Subsequently,

minimize product flux subject to the maximum biomass synthesis flux.

8. If the FBA problem is feasible, keep the solution and set Nfeas = Nfeas + 1. If the FBA

problem is infeasible, then reject the sample.

9. Repeat Steps 3-8 until the desired number of samples is collected (i.e., Nfeas = Nsamples).

The solution space that we sampled is a subspace of the convex space defined only by stoichiom-

etry and flux bounds. The additional constraint of optimal growth and random variations in

the flux bounds themselves make this solution space nonlinear and furthermore, non-convex.

Chapter 4. Genome-scale robust strain design 74

Sampling could thus not be performed using artificial centering hit-and-run (ACHR) (Kaufman

and Smith, 1998)–a popular choice for sampling convex solution spaces in constraint-based

modeling (Schellenberger and Palsson, 2009).

4.4.7 Determining the perturbation size

We define the solution space in which controlled reactions can take on any flux within their

feasible ranges as V = {v ∈ Rn : Sv = 0, vL ≤ v ≤ vU}. We also define φ(ε) = {v ∈ V :

v∗i − ε(v∗i − vLi ) ≤ vi ≤ v∗i + ε(vUi − v∗i ), i ∈ MOD}, where MOD is the set of fluxes that

are controlled. Thus, φ(ε) represents the solution space in which the controlled fluxes deviate

from their optimal values, v∗i , by a fraction, ε. To assess how robust performance changed as a

function of the perturbation size, we defined a metric of perturbation size, δ(ε), as follows:

δ(ε) =vol(φ(ε))

vol(V ). (4.6)

Perturbation size is thus normalized to the most conservative description of perturbation, V ,

where controlled reactions have no bias towards their optimal fluxes. The vol(·) operation cal-

culates the volume. We calculated the volume by randomly sampling the optimal solution space

(with maximizing growth rate as the objective function) and counting the number of feasible

points.

4.4.8 Sensitivity of succinate strains without aerobic fumarate reductase

activity

To account for the inactivation of FRD under aerobic conditions, we calculated nominal perfor-

mances of the 98 strains with inactive FRD. These performances were calculated by removing

FRD activity from the original 98 strains and re-optimizing the fluxes of controlled reactions

to maximize succinate production. We note that, in addition to FRD, we also inactivated

pyruvate dehydrogenase (PDH) in anaerobic strains, as it is normally inhibited by the elevated

NADH levels found in anaerobic conditions (Wang et al., 2010). Anaerobic PDH activity can

be achieved by a mutant PDH that is resistant to NADH inhibition under anaerobic conditions

Chapter 4. Genome-scale robust strain design 75

(Wang et al., 2010).

We then evaluated the performances of the sets of strains with and without aerobic FRD ac-

tivity subject to both intracellular and environmental perturbations. Genetic perturbations

involved deviations of controlled fluxes from their optimal levels, as in previous sections. En-

vironmental perturbations involved deviations of glucose and oxygen uptake rates from their

nominal values of 20 mmol/gDW/h. Both glucose and oxygen uptake were varied between 10

and 20 mmol/gDW/h.

When FRD was inactivated under aerobic conditions, nominal performances of the 98 strains

ranged between 0% and 89% maximal yield, with median of 78% (Fig. 4.9A, C). Robust perfor-

mances had minimum, maximum, and median yields of 0%, 66%, and 42% maximal yield (Fig.

4.9B, D). Additionally, succinate production was correlated with oxygen uptake flux when FRD

was inactive (Fig. 4.10A), while production was insensitive to oxygen uptake when FRD was

active (Fig. 4.10B).

4.4.9 Modeling the metabolic response to osmotic stress

Osmotic stress has a number of physiological consequences, including an increase in ATP main-

tenance (Varela et al., 2004). Thus, the metabolic response to osmotic stress can be modeled

partially by imposing a high ATP drain. We perturbed the non-growth associated maintenance

requirement (NGAM) up to ten times its basal value. Such a large increase has been shown to

be necessary to account for observed reductions in growth rate when modeling osmotic stress

response solely by an ATP drain (Metris et al., 2011). Experimentally, an increase in NGAM of

up to five-fold has been observed (Varela et al., 2004), indicating that additional mechanisms

exist. Although evaluating the detailed mechanisms for modeling osmotic stress response is

beyond the scope of this article, detailed models of osmotic stress response can be readily in-

corporated into our framework as they become available.

Chapter 4. Genome-scale robust strain design 76

4.4.10 Modeling byproduct secretion and re-consumption with molecular

crowding and membrane occupancy constraints

To simulate the secretion of by-products under glucose and oxygen variations, we incorporated

the membrane crowding constraint in our FBA simulations (Zhuang et al., 2011). We imposed

the membrane crowding constraint on fumarate reductase because it is membrane-bound and

any limitations to its activity directly impacts the performance of all three succinate overpro-

duction strains, as defined in Section 4.4.5. We used a nominal, normalized crowding coefficient

value of kFRD = 0.033, which is quantitatively equivalent to the inverse of the maximum FRD

flux predicted by FBA for a ∆sdhAB strain at a growth rate of 0.1 h−1. We assumed parameter

uncertainty of ±50% of the nominal value.

To model co-consumption of the by-products, we incorporated the normalized molecular crowd-

ing constraint (Beg et al., 2007). We used a crowding coefficient of 0.0031, consistent with (Beg

et al., 2007). Simulations were performed with ±50% uncertainty on this value.

4.4.11 Mean-variance portfolio optimization

The optimal combination of fluxes through a collection of metabolic pathways to maximize mean

production for a specified variance (or, to minimize variance for a specified mean production)

can be predicted using mean-variance portfolio optimization. This problem is formulated as a

quadratic program, as below:

maxw∈Rn

rTw − wTΣw (4.7)

s.t.

n∑i=1

wi = 1 (4.8)

w ≥ 0 (4.9)

where w is the vector of weights, r is the vector of mean returns, and Σ is the covariance matrix.

We include the constraint, w ≥ 0, which prevents short-selling in financial portfolios, since the

concept of short-selling is not applicable when modeling cell metabolism. The mean returns

and covariance matrix used in this work were calculated from the 1,000 random samples of

Chapter 4. Genome-scale robust strain design 77

succinate strain III, which uses all three pathways. The values are shown in Tables 4.2 and 4.3,

respectively, and the results of the portfolio optimization are shown in Fig. 4.3.

Table 4.2: Mean and maximum succinate yields through three controlled pathways based on

1,000 random samples

Pathway Mean yield Maximum succinate yield (mol/mol glucose)

Reductive TCA (A) 0.443 1.66

Glyoxylate shunt (B) 0.434 1.50

Oxidative TCA (C) 0.117 1.29

Table 4.3: Covariance matrix for the three controlled pathway fluxes based on 1,000 random

samples

Pathway A B C

Reductive TCA (A) 42.2 -9.36 -10.2

Glyoxylate shunt (B) -9.36 25.1 -9.11

Oxidative TCA (C) -10.2 -9.11 25.5

4.5 Results and Discussion

4.5.1 Computational strain design

In order to investigate the effects of perturbations on strain performance, we first generated

succinate overproduction strains using the iAF1260 genome-scale model of Escherichia coli

(Feist et al., 2007) that performed optimally with no perturbations or parameter uncertainty

present. We chose succinate as it is used in the food, pharmaceutical and agricultural industries,

and has the potential to be used as a substrate for the sustainable production of plastics,

solvents, and commodity chemicals (Zeikus et al., 1999; McKinlay et al., 2007).

To generate these strain designs we used the EMILiO algorithm (Yang et al., 2011), as described

in Section 3.3.2. Briefly, EMILiO is a bilevel optimization algorithm that identifies optimal flux

values for a minimal set of controlled reactions to maximize production of a target metabolite. A

Chapter 4. Genome-scale robust strain design 78

major concern with strain designs involving optimally controlled fluxes, such as those generated

by EMILiO and similar algorithms (Ranganathan et al., 2010), is the potential sensitivity of

strain performance to perturations to the optimal flux values.

In total, we generated 98 different strains with yields ranging between 76% and 100% maximal

yield and median yield of 99% maximal yield (Fig. 4.1). These yields are referred to as the

nominal yields, which are the yields predicted in the absence of perturbations and parameter

uncertainty. Detailed procedures for generating alternative strains using EMILiO, as well as

the selection and refinement steps are outlined in Section 4.4.5.

Chapter 4. Genome-scale robust strain design 79

Figure 4.1: Nominal and mean succinate yields of the 98 strains generated using EMILiO.

(A) Succinate yield of each strain when no perturbations are present (i.e., the nominal yields).

Dashed red line denotes the maximal (nominal) yield at a growth rate of 0.1 h−1, the minimum

required growth rate for the strain designs. The red vertical bars are used to indicate the three

succinate strains referred to as strain I, II, and III in the main text. (B) Succinate yield of

each strain when gene expression noise is present, based on 1,000 random samples for each

strain (see Section 4.4.6 for the procedure). Blue dots show the mean of the 1,000 samples

of succinate yield for each strain, while the red line shows the median. Black lines show the

minimum and maximum succinate yield for each strain, while the minimum and maximum

values in the green area correspond to the 25th and 75th percentiles of succinate yield, for each

strain. Strains are sorted in order of descending mean yield (in (A) as well). (C) Histogram

of succinate yields across the 98 strains when no perturbations are present. (D) Histogram of

mean succinate yields across the 98 strains when gene expression noise is present. 52% of the

98 strains achieved a nominal yield above 99% of the maximum succinate yield. In contrast,

only 1% of strains achieved a mean yield above 99% of the highest mean yield, which was 88%

of the maximal nominal succinate yield.

Chapter 4. Genome-scale robust strain design 80

4.5.2 Pathway diversification improves robustness against flux perturbations

When we perturbed the controlled fluxes of the 98 succinate-producing strains described in the

previous section, we found that some strains clearly outperformed others. One of the most

robust strain designs (strain III) consisted of a knockout (succinate dehydrogenase) and three

optimized pathway fluxes: the reductive branch of the citric acid (TCA) cycle, the glyoxylate

shunt, and the oxidative TCA branch (1, 2, and 3, respectively in Fig. 4.2E). Prior to perturbing

the strains, the use of all three pathways seemed redundant, since we found two other strains,

strains I and II, that performed similarly in terms of their nominal yields, but required only one

(i.e., reductive TCA) and two pathways (i.e., reductive TCA and glyoxylate shunt), respectively

(Fig. 4.2E). Under perturbations with the largest size (see Section 4.4.7 for calculation of

perturbation size), strain III was the most robust, based on the robustness metric, Eq. 4.1

(R = 0.752). Strain II was less robust (R = 0.669), and strain I was the least robust (R = 0.412)

(Fig. 4.2H). Depending on one’s perspective, the apparent robustness of strain III is either

counter-intuitive or obvious: on the one hand, controlling a larger number of fluxes introduces

additional perturbations and should worsen performance. On the other hand, the reduction in

variability resulting from the addition of many independent random variables is a well-known

phenomenon in finance, ecology, and the physical sciences. This phenomenon is termed the

statistical averaging, or “portfolio” effect, and it is responsible for the robustness of many

natural and engineered systems (Vlad et al., 2007).

The effects of pathway diversification are evident in the distribution of succinate yield (Fig.

4.2A), and the controlled fluxes (Fig. 4.2B-D). In strain I, controlled flux variations translate

directly to variations in product yield. Meanwhile, controlled flux variations are mitigated in

strains II and III; therefore, the strain using a larger number of pathways has a higher mean and

lower standard deviation of succinate yield (Fig. 4.2F-G), which results in higher robustness

(Fig. 4.2H).

Chapter 4. Genome-scale robust strain design 81

AKGDH (mol/mol glc)

Relative frequency

D

Strain I 1

3

2oaa

citicit

akg

malfum

succ

glxpep

Glycolysis

succoa

Figure 4.2: Robustness of three succinate strains. (A) Histograms of succinate yield, relative to

glucose uptake flux, for strains I to III. (B-D) histograms of controlled fluxes, relative to glucose

uptake flux. (E) Strains I to III use one to three alternative routes to succinate production,

respectively: the reductive branch of the citric acid (TCA) cycle (1), the glyoxylate shunt

(2), and the oxidative branch of the TCA cycle (3). (F) Mean succinate yield. (G) Standard

deviation of succinate yield. (H) Robustness, R, of succinate yield, calculated according to Eq.

4.1. The simultaneous use of a large number of pathways improves robustness against variations

in the controlled fluxes. FRD2: fumarate reductase, MALS: malate synthase, AKGDH: α-

ketoglutarate dehydrogenase.

Chapter 4. Genome-scale robust strain design 82

While the portfolio effect applies to independent random variables, the sum of negatively

correlated random variables leads to an even more pronounced reduction in variance. Inciden-

tally, the three pathways for succinate production show a weak negative correlation due to the

steady-state mass balance constraints and the fact that they are branching pathways. As with

the optimization of financial portfolios, negatively correlated assets (i.e., metabolic pathways)

can be combined in an optimal manner to maximize return (product yield) for a specified level

of risk (variability) (Fig. 4.3). An important consideration in portfolio optimization is that one

must make a tradeoff between risk and return. Kitano (Kitano, 2010) explored such tradeoffs

in the natural evolution of microbes. Our results suggest that a diversified set of metabolic

pathways leads to more robust strain designs. In the next section, we assess whether improved

robustness is a general consequence of diversification, or if this benefit arises only under specific

conditions.

Chapter 4. Genome-scale robust strain design 83

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Standard deviation of succinate yield (mol/mol glucose)

Mean s

uccin

ate

yie

ld(m

ol/m

ol glu

cose)

Strain I

Strain II

Strain III

Figure 4.3: Example of portfolio optimization for three succinate strains I, II, and III. Based

on 1,000 random samples, we calculated the mean flux (Table 4.2) through each of the three

succinate producing pathways (reductive TCA, glyoxylate shunt, and oxidative TCA). Based

on the random samples, we determined the covariance matrix (Table 4.3) between these three

pathways. Due to mass balance constraints and the topological arrangement of the three

pathways, the covariance matrix has negative elements. Therefore, the weighted combination of

the three pathways can have a smaller variance than that of individual pathways. A quadratic

program is formulated to identify the optimal fluxes through the pathways to maximize the

mean yield for a specified variance of succinate yield, or risk (see Section 4.4.11). Strain I

only uses only the highest-yield pathway, so its risk (standard deviation of yield) and return

(mean succinate yield) are the highest of the three strains. Strain II uses two pathways, so flux

through each pathway can be adjusted to achieve a lower risk than any individual pathway,

albiet for an intermediate level of return. Strain III uses three pathways, all of them showing

a weak negative correlation, so it is possible to achieve an even lower risk for an intermediate

return. Additionally, strain III achieves a higher return than strain II for the same level of risk.

Chapter 4. Genome-scale robust strain design 84

4.5.3 Diversity increases sensitivity to small perturbations

Here, we consider perturbations of varying magnitudes and define a metric of perturbation size,

δ (see Section 4.4.7). δ = 1 indicates that every controlled flux is allowed to vary within the full

range of feasible values, as in the previous section. δ < 1 indicates that every controlled flux

remains closer to its nominal value. When controlled fluxes are equal to their nominal values,

then δ = 0.

In the previous section, in which δ = 1, strain III was the most robust to perturbations, due to

pathway diversification, while strain I was the least robust, since it controlled only one flux. In

contrast, when perturbations are small (δ < 0.395), strain I is the most robust, while strain III is

the least robust (Fig. 4.4C). Therefore, pathway diversification appears to improve robustness

only when perturbations are of a certain magnitude, which we now explain how to determine.

Chapter 4. Genome-scale robust strain design 85

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

Perturbation size (δ)

Mea

n yi

eld

(mol

/mol

glu

cose

)

A

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

Perturbation size (δ)

Sta

ndar

d de

viat

ion

ofyi

eld

(mol

/mol

glu

cose

)

B

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Perturbation size (δ)

Rob

ustn

ess

(R)

δ*(2) δ*(3)

C

Strain I

Strain II

Strain III

Figure 4.4: Robustness of three succinate strains as functions of perturbation size. (A) Mean

product yield versus perturbation size. Error bars represent one standard deviation. (B) Stan-

dard deviation of product yield versus perturbation size. (C) Robustness (R) versus perturba-

tion size. Critical perturbation sizes for strains II (δ∗(2) = 0.395) and III (δ∗(3) = 0.415) are

indicated by dotted lines. Strains I, II, and III each use, one, two, and three succinate produc-

tion pathways, respectively. Strain I uses only the highest-yield pathway; therefore, its mean

yield is highest when perturbations are small. However, the robustness of strain I deteriorates

rapidly as perturbation size increases, while strain III is the most robust. Strain II is the most

robust for only a narrow range of perturbation sizes (i.e., for 0.395 ≤ δ ≤ 0.415).

Chapter 4. Genome-scale robust strain design 86

To quantitatively compare robustness between strains at different perturbation sizes, we

introduce the critical perturbation size metric, δ∗(n), with integer n > 1, defined as the per-

turbation size for which a strain using n > 1 pathways is more robust (based on the metric, R)

than the strain using only one pathway. Thus, δ∗(n) represents the perturbation size at which

diversification improves robustness. Furthermore, a small δ∗(n) indicates that diversification

is useful for a wider range of perturbations, while a large δ∗(n) indicates that robustness is

improved only for large perturbations when n pathways are used. Based on the critical per-

turbation size, strain I is the most robust for δ < 0.395 (Fig. 4.4C). Within a narrow interval

of perturbation sizes (0.395 ≤ δ < 0.415), strain II is the most robust. For larger perturba-

tions, δ > 0.415, strain III is the most robust. Thus, the critical perturbation size provides a

metric to quantitatively determine the number of redundant pathways to use for an expected

perturbation size. In this case, strain I or III should be used for small or large perturbations,

respectively. In the following section, we will apply our findings to the study of robust L-serine

overproduction strains.

4.5.4 Enhanced robustness of L-serine production via low-yield pathways

L-serine is an industrially important amino acid that is used in cosmetics, pharmaceuticals, and

as a precursor for a variety of other chemicals (Peters-Wendisch et al., 2005; Stoiz et al., 2007).

In this section, we investigate whether robust L-serine overproduction strains can be designed

using pathway diversification.

In E. coli, two pathways are available for L-serine synthesis (Fig. 4.5A): the phosphoserine

phosphatase (PSP) route, and the glycine hydroxymethyltransferase (GHMT) route. Flux

balance analysis (FBA) simulations show that GHMT yields less L-serine than PSP (2.0 versus

1.15 mol L-serine/mol glucose). Therefore, to maximize nominal yield, the PSP route should

be utilized exclusively. However, to maximize robust production under large perturbations, the

GHMT route, despite its low yield, is shown to play an important role.

Chapter 4. Genome-scale robust strain design 87

��������������������������A�������������BC

�������������������������������������������������

�����������������C���D������������������������������

��������������������������A�������������BC

�����������������������������ECF��

�����������������C���D

��������������������������A�������������BC

�����������������������������ECF���������AC��D

�����������������C���D��������������������������������

���A������BC��D���E��������F�����E�

�B

��������

���

���

���

��

���

�!"

�#����

$�%

&#$

�#��

"%'

��"%# ����� (���� ����� ��)�$

�)�$�� )�"%�

�� ��

���

D�"

��"

*��+"

"��

&+�

&,�

",�

"%�

��)� ������

��"%��+�

� �

� �

�+�

�����

��� ����

����

AB�CD

��������

�E�F��E�B��

��B

$% )& $� )&

����B�����������AB�����CDABED�F

�����������AB�����CDABED��

��������

�E�B�B

Figure 4.5: L-serine production pathways and strains. (A) Two pathways are available for L-

serine production: (1) the PSP route and (2) the GHMT route. (B) We designed three strains

(strains I, II, and III), using one or both of these pathways. In addition, strain III inhibits

NDPK3 and CTPS2 fluxes.

Chapter 4. Genome-scale robust strain design 88

To demonstrate, consider three L-serine strains (Fig. 4.5B). Strain I utilizes only the high-

yield PSP pathway, while strains II and III use both PSP and GHMT. Strain I consistently

had the highest mean yield across all perturbation sizes (Fig. 4.6A). However, its standard

deviation was also the highest (Fig. 4.6B) when perturbations were large, due to the lack of

alternative production routes. Thus, for large perturbations, strains II and III were more robust

than strain I (Fig. 4.6C). This result indicates that even low-yield pathways can be combined

with high-yield ones to improve robustness against large perturbations.

Chapter 4. Genome-scale robust strain design 89

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

Perturbation size (δ)

Mea

n yi

eld

(mol

/mol

glu

cose

)

A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

Perturbation size (δ)

Sta

ndar

d de

viat

ion

ofyi

eld

(mol

/mol

glu

cose

)

B

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.6

0.8

1

Perturbation size (δ)

Rob

ustn

ess

(R)

C

Strain I

Strain II

Strain III

Figure 4.6: Robustness of three L-serine strains as functions of perturbation size. (A) Mean

yields of three L-serine strains as functions of perturbation size. Error bars represent one stan-

dard deviation. (B) Standard deviation of L-serine yield for the three strains. (C) Robustness

values of the three L-serine strains as functions of perturbation size. Strain I uses one L-serine

synthesis pathway, while strains II and III use two pathways. Strain III inhibits two addi-

tional reactions, compared to strain II, which results in improved nominal yield but decreased

robustness.

Chapter 4. Genome-scale robust strain design 90

Under small perturbations, strain II was more robust than strain III. Compared with strain

II, strain III involves two additional controlled fluxes: inhibition of nucleoside-diphosphate

kinase (NDPK3) and CTP synthase (CTPS2). Under nominal conditions, these inhibitions

increased yield by 2% over strain II. However, the minute increase in nominal yield does not

appear to justify the decrease in robustness against small perturbations. In general, the opti-

mal tradeoff between product yield and variability will depend on the yield and variability of

individual controlled fluxes, as well as the size of expected perturbations. In silico design, as

described here, should help to accelerate the systematic identification of an optimal tradeoff.

A potential concern with the experimental feasibility of the proposed strategy is whether PSP

and GHMT can simultaneously produce L-serine since GHMT typically consumes L-serine when

PSP is active (Stolz et al., 2007). Possible approaches for simultaneous activity include provid-

ing glycine as a nitrogen source (Newman et al., 1976) or using resting cell systems to reduce

serine degradation (Shen et al., 2010).

4.5.5 Assessing robustness against industrially relevant perturbations

In the previous sections, we showed that robustness can be improved through the control of

redundant pathways. Yet, it may be argued that under a stable environment, such as lab-scale

cultures, efficient strains that use only high-yield pathways are superior to diversified strains

that use redundant pathways, which may have lower-yields.

We hypothesize that diversified strains are more practical than efficient ones because industrial-

scale bioreactors introduce a wide range of environmental perturbations that are not typically

encountered at the lab-scale (Enfors et al., 2001). To test this hypothesis, we assessed the

robustness of the three succinate strains (discussed in previous sections) against representative

environmental perturbations: variations in glucose and oxygen uptake rates, osmotic stress,

secretion of byproducts due to overflow metabolism, and re-consumption of these byproducts.

These perturbations are difficult to control in large-scale bioreactors and deteriorate bioprocess

performance (Enfors et al., 2001).

Controlled flux variation arising from expression noise was included in all of our simulations,

since this intracellular perturbation is inherent to any cell. When glucose uptake rate was also

Chapter 4. Genome-scale robust strain design 91

perturbed, average product yields decreased, since glucose is the sole substrate (Fig. 4.7a–b).

Osmotic stress, which was modeled as increased ATP maintenance requirements (see Section

4.4.9), had similar consequences (Fig. 4.7e–f). Both results were as expected, as both glucose

availability and ATP drain impact production capacity.

Chapter 4. Genome-scale robust strain design 92

−20 −10 00

0.1

0.2

0.3

Glucose

Relative

frequency

a

10 20 300

0.1

0.2

0.3

Succinate

b

−20 −10 00

0.1

0.2

0.3

Oxygen

Relative

frequency

c

10 20 300

0.1

0.2

0.3

Succinate

d

0 40 800

0.1

0.2

0.3

ATPM

Relative

frequency

e

10 20 300

0.1

0.2

0.3

Succinate

f

0 0.050

0.05

0.1

0.15

0.2

kMemFRD

Relative

frequency

g

10 20 300

0.1

0.2

0.3

Succinate

h

0 10 20 300

0.5

1

Acetate

i

0 20 400

0.2

0.4

0.6

Formate

j

0 10 200

0.2

0.4

0.6

0.8

Ethanol

k

0 5

x 10−3

0.05

0.1

0.15

0.2

kVol

Relative

frequency l

10 20 300

0.1

0.2

0.3

Succinate

m

−30 −20 −10 00

0.5

1

Acetate

n

−40 −20 00

0.2

0.4

0.6

Formate

o

−20 −10 00

0.2

0.4

0.6

0.8

Ethanol

p

Strain I

Strain II

Strain III

Perturbation Predicted flux responses (mmol/gDW/h)Expression noise &

Glucose

Dissolvedoxygen

Cells

Figure 4.7: Histograms showing the simulated response of succinate strains to industrially-

relevant perturbations. All controlled fluxes are perturbed due to gene expression noise.

Industrially-relevant perturbations include variations in glucose uptake rate (a-b), oxygen up-

take rate (c-d), osmotic stress (e-f), byproduct secretion due to overflow metabolism (g-k), and

re-consumption of byproducts (l-p). While simulating byproduct secretion, membrane occu-

pancy coefficients were subjected to parameter uncertainty (g). While simulating byproduct

consumption, molecular crowding coefficients were subjected to parameter uncertainty (l). For

oxygen and substrates (glucose, acetate, formate, and ethanol), negative fluxes correspond to

uptake while positive fluxes correspond to secretion. ATPM: non-growth-associated ATP main-

tenance, kMemFRD: membrane crowding coefficient of fumarate reductase, kVol: molecular

crowding coefficient.

Chapter 4. Genome-scale robust strain design 93

In contrast, we found that all three strains were robust against variations in the oxygen

uptake rate (Fig. 4.7c–d). This robustness was due to aerobic FRD activity, which enables

fumarate to be respired, in addition to oxygen (Fig. 4.8). In the absence of aerobic FRD activity,

the maximum aerobic succinate yield became more sensitive to oxygen uptake rates (Fig. 4.9,

4.10). Although FRD is normally repressed under aerobic conditions, aerobic FRD activity can

be achieved through several methods: the regulatory gene, fnr can be overexpressed to activate

frdABCD (Shaw and Guest, 1982); the frd operon copy number can be increased (Cole and

Guest, 1979); or FRD enzymes can be mutated to decrease their sensitivity to oxygen (Iuchi

et al., 1986). In addition, Portnoy et al. (Portnoy et al., 2010) adaptively evolved cytochrome

oxidase mutants of E. coli, which exhibited anaerobic physiology and fumarate respiration under

aerobic conditions. These studies suggest that fumarate respiration via aerobic FRD activity

may be a viable strategy for robust succinate production.

Chapter 4. Genome-scale robust strain design 94

Figure 4.8: Respiration and succinate production. (1) Reductive branch of the citric acid (TCA)

cycle. (2) Glyoxylate shunt. (3) Oxidative branch of the TCA cycle. When fumarate reductase

(FRD) is repressed (A), the quinol-dependent NADH dehydrogenase activity dominates and

oxygen is the terminal electron acceptor. In contrast, when FRD is activated (B), fumarate is

available as an additional terminal electron acceptor. Accordingly, the production of succinate

becomes insensitive to fluctuations in oxygen availability.

Chapter 4. Genome-scale robust strain design 95

0 10 20 30 40 50 60 70 80 900

0.5

1

1.5

Maximal nominal performance

Strain number

Succin

ate

yie

ld (

mol/m

ol glu

cose)

A

0 10 20 30 40 50 60 70 80 900

0.5

1

1.5

Strain number

Succin

ate

yie

ld (

mol/m

ol glu

cose)

B

Mean Median 25th

& 75th

percentiles Min & max

0 0.5 1 1.50

10

20

30

40

50

Yield (mol/mol glucose)

Fre

quency

C

0 0.5 1 1.50

10

20

30

40

50

Mean yield (mol/mol glucose)

Fre

quency

D

Figure 4.9: Nominal and mean succinate yield of 98 strains without aerobic fumarate reductase

(FRD) and anaerobic pyruvate dehydrogenase (PDH) activities. (A) Succinate yield of each

strain when no perturbations are present. All yields were calculated without aerobic FRD and

anaerobic PDH activities. However, to easily compare results with Fig. 1, the dashed red

line denotes the maximal yield at a growth rate of 0.1 h−1 when aerobic FRD and anaerobic

PDH activities are enabled. (B) Succinate yield of each strain when gene expression noise

is present, based on 1,000 random samples for each strain. Blue dots show the mean of the

1,000 samples of succinate yield for each strain, while the red line shows the median. Black

lines show the minimum and maximum succinate yield for each strain, while the minimum and

maximum values in the green area correspond to the 25th and 75th percentiles of succinate

yield, for each strain. Strains are sorted in order of descending mean yield (in (A) as well).

(C) Histogram of succinate yield across the 98 strains when no perturbations are present. (D)

Histogram of mean succinate yield across the 98 strains when gene expression noise is present.

Mean succinate yields ranged from 0% to 66% of the maximal yield, and had a median of 42%

of the maximal yield.

Chapter 4. Genome-scale robust strain design 96

Figure 4.10: Correlation between succinate production and oxygen uptake for strain III. Colors

are proportional to growth rate as shown in the colorbar. When fumarate reductase (FRD)

is active under aerobic conditions, maximum succinate flux is insensitive to changes in oxygen

uptake flux due to the availability of fumarate respiration (A). When FRD is inactive under

aerobic conditions, maximum succinate flux is affected by oxygen uptake rate (B).

Chapter 4. Genome-scale robust strain design 97

To accurately model byproduct secretion and re-consumption in response to variation in

glucose and oxygen uptake rates, we incorporated the molecular crowding (Beg et al., 2007)

and membrane occupancy (Zhuang et al., 2011) constraints (see Section 4.4.10). Simulations

showed that formate and acetate were major byproducts, which is consistent with experimental

observations (Kirkpatrick et al., 2001; Wang et al., 2011), as well as ethanol (Fig. 4.7i–k).

Byproduct secretion led to a general decrease in succinate yield (Fig. 4.7h). Re-consumption

of these products, however, increased succinate flux, as additional substrates became available

(Fig. 4.7m–p).

In addition to these metabolic responses, we assessed the sensitivity of strains to uncertainty

in the parameters introduced by the molecular crowding and membrane occupancy constraints.

Specifically, we considered ±50% uncertainty on the membrane crowding coefficient of FRD

(kFRD), and similar uncertainty on the single molecular crowding coefficient. As expected, pre-

dicted performance of strain I was the most sensitive to uncertainty in kFRD since this strain

uses only FRD to produce succinate (Fig. 4.7g–h). Parameter uncertainty skewed the succi-

nate distribution, which was originally uniformly distributed. This transition is characteristic

of the “anti-portfolio” effect, arising in situations in which a variable is the product of multiple

random factors (Vlad et al., 2007). Indeed, succinate flux is influenced by the product of kFRD

uncertainty and perturbations to FRD flux. The molecular crowding constraints resulted in

decreased production for all three strains (Fig. 4.7l–m). For both the molecular crowding and

membrane occupancy constraints, pathway diversity improved robustness against the combined

effects of parameter uncertainty and byproduct re-consumption.

We then calculated δ∗(n) to quantitatively assess the effect of pathway diversification in the

context of the industrially relevant perturbations (Table 4.4). Compared to the other pertur-

bations, diversification provided the least amount of benefit when expression noise was the sole

perturbation (δ∗(2) = 0.395 and δ∗(3) = 0.415). This scenario is comparable to the environ-

ments of endosymbionts. Incidentally, endosymbionts lose a great deal of metabolic redundancy

and studies have hypothesized that this is due to the lack of a need for robustness in stable

environments (Moran, 2002; Tamas et al., 2002; Mendonca et al., 2011).

When glucose perturbations were added to expression noise, the use of two and three pathways

Chapter 4. Genome-scale robust strain design 98

was more beneficial, as reflected by reductions in δ∗(2) and δ∗(3) of 56% and 54%, respectively.

Addition of oxygen perturbations to expression noise similarly increased the effective range of

diversification.

Osmotic stress, byproduct secretion and re-consumption all increased the benefits of diversifi-

cation greatly. In fact, the more diversified strains were always more robust than the efficient

strain (i.e., δ∗(n) = 0), even for small perturbations. These results suggest that pathway redun-

dancy is more important for improving robustness against environmental perturbations than

against expression noise alone.

Table 4.4: Critical perturbation size, δ∗(n), indicating the perturbation size at which robustness

of diversified strains (with n pathways) exceeds that of the most efficient strain.

Perturbation δ∗(2) δ∗(3)

Expression noise 0.395 0.415

Glucose variation 0.173 0.193

Oxygen variation 0.243 0.183

Osmotic stress 0 0.011

By-product secretion 0 0

By-product consumption 0 0

δ∗ = 0 indicates that the diversified

strain is more robust than the simple

strain for all perturbation sizes.

4.6 Conclusions

In this work, we have developed a procedure for robust strain design that can be employed

immediately using available genome-scale constraint-based models (CBM) of cell metabolism.

First, the CBM is modified to simulate an engineered strain, based on known genetic modifica-

tions, or using a strain design algorithm (Burgard et al., 2003; Ranganathan et al., 2010; Kim

and Reed, 2010; Yang et al., 2011) to identify knockout targets and controlled fluxes. Second,

Chapter 4. Genome-scale robust strain design 99

a set of random perturbations are identified (e.g., see Table 4.1). The relative sizes of these

perturbations are then estimated (see Section 4.5.3). Next, the designed strains are subjected

to these random perturbations and the robustness of each strain is calculated (Eq. 4.1). Strains

that are robust against specific or many perturbations are retained while the sensitive strains

are discarded. Robustness-enhancing strategies are identified by examining the most robust

strains and these strategies are incorporated into the next iteration of strain design to further

improve strain robustness.

In this work, we found that pathway diversification improved robustness against a wide variety

of industrially relevant random perturbations, including variation in substrate and oxygen up-

take rates, osmotic stress, byproduct secretion, and re-consumption of these byproducts (Fig.

4.7). Although pathway diversification improved robustness against large perturbations, it also

increased sensitivity to small perturbations (Fig. 4.4). This tradeoff may have implications

for strain construction. For example, two representative methods for controlling target fluxes

are inducible expression systems and libraries of constitutive promoters (De Mey et al., 2007).

Inducible systems typically exhibit greater variability in the level of protein expression across

individual cells than that of constitutive promoter libraries (Alper et al., 2005). Therefore,

pathway diversification may have a greater effect on improving robustness when using inducible

systems than when using promoter libraries.

Diversification is adopted as a strategy to improve robustness against perturbations by a variety

of artificial and natural systems, ranging from financial portfolios to prairie grasslands (Tilman

et al., 2006). Under constant environments, efficiency may take preference over robustness. For

example, endosymbionts and parasites reside in stable environments, and their genomes reflect

a significant loss of redundant genes and pathways (Moran, 2002; Tamas et al., 2002; Mendonca

et al., 2011). On a faster time-scale, deletion of key enzymes leads to large intracellular pertur-

bations. These perturbations include re-routing of fluxes, amplification of existing pathways,

and the activation of latent pathways, which are eventually deactivated as intracellular condi-

tions stabilize over the course of adaptive evolution (Fong et al., 2006; Cornelius et al., 2011).

Thus, the transient activation and deactivation of latent pathways may be related to changes

in the size of intracellular perturbations over time. Although the mechanisms underlying these

Chapter 4. Genome-scale robust strain design 100

phenomena are not fully understood, they are consistent with the presence of a tradeoff be-

tween robustness against large perturbations versus sensitivity to small perturbations exhibited

by the diversification strategy. Therefore, the characterization of random perturbations during

latent pathway activation and natural selection may help to explain why robustness is acquired

or lost as conditions change over time.

Experimental data related to our results can be found in the literature. For example, in (Son-

ntag et al., 1993), L-lysine is synthesized by Corynebacterium glutamicum via two pathways

that diverge from a common precursor. One pathway involves a single, ammonium-dependent

reaction that contributes 72% to 0% of total L-lysine production. The actual contribution de-

pends strongly on the availability of ammonium, which can vary temporally and spatially in

the bioreactor. The presence of two L-lysine synthesis pathways in C. glutamicum has raised

fundamental questions on their relative functions, since other microbes, including E. coli, Bacil-

lus subtilis, and Bacillus sphaericus use only one of three possible L-lysine synthesis pathways

(Schrumpf et al., 1991). For the purpose of robust strain design, the results in this thesis point

to the use of both pathways as a viable strategy for improving robustness against ammonium

fluctuations.

To provide experimental verification of the design ideas presented here, one could extend the

experiments in (Sonntag et al., 1993) as follows. First, an experimental apparatus is designed

such that perturbations to ammonium or substrate availability are introduced in at least two

different amplitudes. Fluctuations can be introduced in lab-scale culture, as in (Pekkonen et al.,

2011; Suiter et al., 2003; Picket and Bazin, 1980). Second, a set of strains is constructed, based

on the robust strain design framework. The set should include at least one diversified strain,

having diverse pathways towards product formation, and an efficient strain, having only the

highest-yield pathway. Third, the set of strains are cultured under the two different perturba-

tions. By measuring the mean and standard deviation of L-lysine production of the different

strains, the hypothesis that greater diversification improves robustness against large perturba-

tions, at the cost of increased sensitivity to small perturbations would be tested. Additionally,

the experiment would verify whether in silico robust strain design can indeed lead to the de-

velopment of microbial strains that are robust against industrially-relevant perturbations.

Chapter 4. Genome-scale robust strain design 101

The computational framework described here is general for use with any constraint-based model.

Future work may include the incorporation of integrated metabolic and regulatory network

models (Chandrasekaran and Price, 2010) to assess the potential for genetic and pathway re-

dundancy (Mahadevan and Lovley, 2008) and engineering of regulatory networks (Kafri et al.,

2006, 2009) for robust strain design. Furthermore, the predictive design of robust strains using

simple strategies like diversification, with quantifiable effects under a variety of perturbations

(Fig. 4.7) may become important when designing engineered cells from the bottom up, using

cells with minimal metabolic networks (Glass et al., 2006; Henry et al., 2010b) as a platform.

One of the main contributions of this work is to place these considerations within a systematic

framework that is tangible to the designer, rather than leaving the issue of robust performance

to chance, trial, and error.

Chapter 5

Designing Experiments from Noisy

Metabolomics Data to Refine

Constraint-Based Models

This chapter contains material originally published in the conference proceedings below, with

permission from the publisher, the American Automatic Control Council (AACC):

Yang, L., Mahadevan, R. and Cluett, W.R.. (2010b) Designing experiments from noisy metabolomics

data to refine constraint-based models. In: Proceedings of the American Control Conference,

pp. 5143–5148.

5.1 Abstract

Metabolomics is an emerging technology for making high-throughput measurements of metabo-

lites and is useful for the discovery of novel biomarkers of genetic diseases and for metabolic

engineering. The system-wide data can be used to refine predictions made by constraint-based

models of cell metabolism. However, the predictions of important output variables may still

suffer from high variability due to high variance in the data itself, or from suboptimal choice of

measurements in the metabolomics experiment. Here, we present a computational algorithm

102

Chapter 5. Designing experiments using noisy metabolomics data 103

that uses initial metabolomics data to identify a smaller set of metabolites whose precise mea-

surement most reduces variability of model predictions. We first randomly sample fluxes and

concentrations using a new non-convex sampling algorithm that differs from previous approaches

in its ability to sample across disjoint regions of the space and in its parallel implementation.

We then demonstrate our algorithm’s ability to identify a sequence of experiments that succes-

sively refines model predictions using a simplified model of Escherichia coli central metabolism.

5.2 Introduction

Cell metabolism is a complex network consisting of hundreds of biochemical species, or metabo-

lites, interacting through over a thousand chemical reactions. Accurately modeling this sys-

tem is an important challenge for metabolic engineers and health scientists. Metabolomics is

an emerging high-throughput technology to make system-wide concentration measurements of

hundreds of metabolites and has important applications for identifying novel biomarkers of ge-

netic diseases (Buescher et al., 2009; Shlomi et al., 2009). Constraint-based modeling (CBM)

is used to make systems-level predictions of reaction rate, or flux, distributions throughout the

metabolic network (Becker et al., 2007). Recent advances in this field include the develop-

ment of algorithms that can predict both concentration and flux ranges using thermodynamic

information on the free energies of reactions (Henry et al., 2007). Model predictions can be

refined using metabolomics data to constrain the concentrations of metabolites (Bennett et al.,

2009; Mo et al., 2009). Also, the quality of experimental data can be assessed by examining

thermodynamic feasibility of the data within a constraint-based model (Zamboni et al., 2008).

However, important output variables may still suffer from increased uncertainty due to high

variance in the data, or from suboptimal choice of measurements.

The problem of identifiability in metabolic networks has been the subject of several stud-

ies including the identification of optimal flux measurement sets to completely characterize

flux distributions using isotopic metabolic flux analysis experiments (Chang et al., 2008). In

(Savinell and Palsson, 1992), the authors wanted to completely determine flux configurations

Chapter 5. Designing experiments using noisy metabolomics data 104

by measuring some fluxes and computing the others based on network stoichiometry and mass-

balance equations. By estimating the sensitivity of calculated fluxes relative to the uncertainty

in measured fluxes, the fluxes needing precise measurements could be determined. The system

considered by the authors was defined by linear constraints. Hence, they could use properties

of matrix norms to define upper bounds on flux sensitivities. Furthermore, they estimated

actual sensitivities by generating random experimental flux measurements and experimental

uncertainties.

In this thesis, we build upon this idea by estimating the sensitivities of calculated fluxes and

metabolite concentrations, relative to experimental uncertainties of measured concentrations.

Unlike the system in (Savinell and Palsson, 1992), here we consider both fluxes and concen-

trations, which are non-linearly related; therefore, novel methods are developed. For our algo-

rithm we begin with an initial metabolomics dataset consisting of many metabolite concentra-

tions with high variability. This dataset is used to place loose bounds on concentrations in a

constraint-based model, resulting in a reduced solution space relative to the case where arbi-

trarily wide bounds are used. We then generate random samples from the space using a new

non-convex sampling algorithm. We then use these samples to assess sensitivities of calculated

variables to measured concentrations. A schematic of the overall framework is shown in Fig.

5.1. We present the necessary preliminaries in Section 5.3, describe the algorithm for sampling

the non-convex concentration space in Section 5.4, and present our algorithm for identifying

important metabolites in Section 5.5 with an example using a simplified model of Escherichia

coli central metabolism. We present our results in Section 5.6 and conclusions in Section 5.7.

5.3 Preliminaries

5.3.1 Constraint-Based Modeling

Cell metabolism is modeled as a network of biochemical species, or metabolites, that are in-

terconnected through enzyme-catalyzed reactions with defined stoichiometry. The variables of

this system are reaction rates, or fluxes, and metabolite concentrations. Fluxes are defined by

Chapter 5. Designing experiments using noisy metabolomics data 105

Figure 5.1: Metabolomics data serve as the launchpad for iterative model refinement. Our

computational algorithm, outlined in Section 5.5, allows researchers to identify metabolites

needing more precise concentration measurements to make precise predictions of the output

variables of interest.

the following constraints:

Sv =dx

dt(5.1)

vL ≤ v ≤ vU , (5.2)

where v ∈ RN is the vector of fluxes, x ∈ RM is the vector of metabolite concentrations, vL

and vU are lower and upper bound vectors of the fluxes. S is the matrix defining network

stoichiometry with M rows corresponding to metabolites and N columns corresponding to

fluxes.

In Flux Balance Analysis (FBA) (Becker et al., 2007), we assume that metabolic reactions occur

Chapter 5. Designing experiments using noisy metabolomics data 106

much faster than environmental changes. Hence, we assume a quasi-steady state for metabolite

concentrations, so that dxdt = 0. Consequently, flux configurations are calculated by solving the

following linear program (LP):

maxv

cT v (5.3a)

s.t. Sv = 0 (5.3b)

vL ≤ v ≤ vU , (5.3c)

where cT ∈ RN is the vector of flux weights in the objective function, chosen to reflect cell

behavior under its growth condition (e.g., maximize growth yield).

In Thermodynamics-based Metabolic Flux Analysis (TMFA)(Henry et al., 2007), both fluxes

and concentrations are predicted by solving the following mixed-integer linear program:

maxv,x,∆rG′

cT v (5.4a)

s.t. Sv = 0, (5.4b)

0 ≤ vj ≤ zjvmaxj , {j = 1, . . . , N}, (5.4c)

∆rG′j −K(1 + zj) < 0, (5.4d)

{j = 1, . . . , N |∆rG′j◦is known},

∆rG′j◦ +RT

M∑i=1

si,j ln(xi) = ∆rG′j , (5.4e)

{j = 1, . . . , N |∆rG′j◦is known, },

xL ≤ x ≤ xU , (5.4f)

vj ≥ 0, {j = 1, . . . , N}, (5.4g)

zj ∈ {0, 1}, (5.4h)

where ∆rG′j is the reaction Gibbs free energy change of reaction j, ∆rG

′j◦ is the standard

Gibbs free energy change, and zj is a binary variable equal to 1 when the ∆rG′j of reaction j is

negative, thereby allowing flux, and is equal to 0, otherwise. Reactions are split into forward and

reverse so that all fluxes are non-negative, vmaxj denotes the maximum flux through reaction

j, xL and xU are lower and upper concentration bounds, and si,j denotes the element of S

corresponding to the M -th row and N -th column.

Chapter 5. Designing experiments using noisy metabolomics data 107

5.3.2 Randomly Sampling the Solution Space

Optimization-based approaches like FBA and TMFA are effective at predicting flux distribu-

tions for prokaryotes growing in nutrient-limiting conditions where suitable cellular objective

functions have been validated. When an appropriate objective function is not known, or an

unbiased exploration of the solution space is desired, random sampling approaches are used

(Schellenberger and Palsson, 2009).

To sample points uniformly distributed over the solution space X ⊂ RN defined by constraints

(5.3b)-(5.3c), we can use artificial centering hit and run (ACHR) (Kaufman and Smith, 1998).

In ACHR, we first generate a set of Nw warmup points W = {wa : wa ∈ X, a = 1, . . . , Nw}

using hit-and-run sampling. Subsequently, we do the following:

1. Initialize the starting point X0 ∈ X, the center point X = X0 and set t = 0.

2. Choose a random warmup point, wa from W , and set the random direction vector dt =

(wa − X)/||wa − X||2, where || · ||2 is the L-2 norm.

3. Select a random step size, λt, and a new candidate point from the line set Yt = {λt ∈

R|Xt + λtdt ∈ X}.

4. If the set Yt is empty, then go to Step 2.

5. Set Xt+1 = Xt + λtdt and t = t+ 1.

6. Set X = (tX +Xt)/(t+ 1) and go to Step 2.

The ACHR algorithm differs from previous hit-and-run algorithms in that the direction choice

at each iteration is adaptively chosen to improve convergence. Because each direction choice

depends on previous sample points (i.e., warmup points), the sequence does not form a Markov

Chain and the convergence theorems for Markov Chain Monte Carlo do not apply (Kaufman

and Smith, 1998). Nonetheless, the high empirical convergence rate of ACHR has made it a

popular choice for random sampling in the CBM community.

Chapter 5. Designing experiments using noisy metabolomics data 108

5.4 Sampling the non-convex solution space

In this work, we require uniformly distributed sample points from the thermodynamically fea-

sible solution space. Disjoint regions arise in this space due to reversible fluxes and their

corresponding ∆Gr, which are functions of concentrations (see Fig. 5.2). Schellenberger et

al. (Schellenberger et al., 2007) developed a method to sample concentrations by first defin-

ing flux directions based on stoichiometry, environmental constraints and concentration data.

However, this method does not fully explore the combined concentration and flux space when

reversible reactions are present. This is because reversible reactions create disjoint regions in

the thermodynamically feasible solution space that cannot be fully sampled using convex sam-

pling like ACHR that the authors used in (Schellenberger et al., 2007). Here, we use a simple

extension to ACHR to sample the non-convex solution space that includes reversible reactions.

This method also eliminates the need to identify thermodynamically infeasible reaction cycles

(steady state flux through network loops without a thermodynamic driving force) a priori as

in (Price et al., 2006) since the thermodynamic constraints checked at each iteration include a

test for the presence of such cycles. We sample the non-convex solution space, T ⊆ X ⊂ RN ,

defined by constraints (5.4b)-(5.4h) as follows, for each parallel sampling chain, i:

1. Initialize the starting point Xi0 ∈ T, the center point Xi = Xi

0 and set t = 1.

2. Set the random direction vector dit = (Xjt−1 − Xi)/||Xj

t−1 − Xi||2 based on a random

parallel chain, j.

3. Generate a set of K steps sizes, Λit = {λi(k)t ∈ R|Xi

t + λi(k)t dit ∈ X, k = 1, . . . ,K}.

4. Choose a step size, λit from the set of thermodynamically feasible (satisfying constraint

(5.4d)) step sizes, Θit = {λit ∈ Λit|Xi

t + λitdit ∈ T}.

5. If Θit is empty, then choose a feasible point Xi

t = Xjt−1, from a random parallel chain, j

and go to Step 2.

6. Set Xit+1 = Xi

t + λitdit and t = t+ 1.

7. Set Xi = (tXi +Xit)/(t+ 1) and go to Step 2.

Chapter 5. Designing experiments using noisy metabolomics data 109

The sampling algorithm visits disjoint regions of the solution space by generating many candi-

date points for each parallel search direction and assessing thermodynamic feasibility for each

candidate point. We increase the chance of finding feasible points by allowing communication

between the parallel chains in two ways: (a) the direction of a chain at an iteration is based on

the previous feasible points of all parallel chains, and (b) if a chain fails to find a feasible point

at an iteration, it randomly chooses a feasible point from the other chains for that iteration. In

this way, the success rate of finding feasible points in the non-convex solution space is increased.

We performed parallel computations on the Nvidia GeForce GTX 295 graphics processing unit

(GPU), using the Jacket (AccelerEyes, LLC, Austell, GA) interface to MATLAB (The Math-

works, Inc., Natick, MA).

5.5 Identifying Important Metabolites

Our objective is to identify the metabolites needing more precise concentration measurements,

using metabolomics data as the starting point. For our purposes, additional measurements are

of little value if they do not affect the variability of outputs that we predict using the model.

Hence, we identify important metabolites using an approach inspired by the Derivative-based

Global Sensitivity Measures (Kucherenko et al., 2009). Below is an outline of our method

to estimate the change in variability of output yj (either a flux or concentration) relative to

uncertainty in metabolite concentration measurement xi:

1. Generate uniform random samples from the thermodynamically feasible solution space.

2. For each measurable metabolite, i, generate r = 1 . . . Nrand random concentrations, xir

within their feasible concentration bounds.

3. For random concentration r, define small concentration deviations, xir −∆x and xir + ∆x

for a positive ∆x.

4. Obtain K sample points each within the concentration intervals [xir−∆x, xir] and [xir, xir+

∆x], and denote the k-th sampled value of output j for each interval as yak and ybk,

Chapter 5. Designing experiments using noisy metabolomics data 110

respectively.

5. Calculate variances of the samples of output j, with respect to metabolite i within each

interval as

σja =1

K − 1

K∑k=1

(yak − ya)2

and

σjb =1

K − 1

K∑k=1

(ybk − yb)2,

where ya and yb are the means of the output samples ya and yb, respectively.

6. Calculate the gradient of variability in output j with respect to metabolite i at random

concentration xir as

γjr =|σjb − σ

ja|

2∆x.

7. Repeat Steps 2-6 for all Nrand random concentrations of metabolite i.

8. Define the mean sensitivity of output j with respect to metabolite i as

γji =1

Nrand

Nrand∑r=1

γjr .

The algorithm produces γji , which can be used to assess the metabolites that should be measured

if we wish to minimize variability when predicting fluxes or concentrations. Results in this thesis

were generated using the variance of samples as a measure of output variability at Step 5 but

we can also use other measures of variability. For example, if the range of samples is used,

Step 5 is equivalent to finding the minimum and maximum values of the output within the

concentration intervals. Alternatively, if the distribution of samples is highly skewed, we can

use the interquartile range.

5.6 Results

5.6.1 Sampling the Non-Convex Solution Space

We assessed how metabolomics data of varying degrees of variability could refine the thermo-

dynamically feasible solution space (Fig. 5.2). While the amount of refinement in the solution

Chapter 5. Designing experiments using noisy metabolomics data 111

Figure 5.2: The flux and concentration space of a toy reaction cycle. Random samples and

reduction of solution space with (A) no measurements, (B) high-variance measurements, and

(C) precise measurements. Four representative pair-wise scatterplot patterns: disjoint flux and

∆rG′ regions (v < 0 & ∆rG

′ > 0, and v > 0 & ∆rG′ < 0) (D), relation between ∆rG

′ and

metabolite concentrations due to Equation (5.4) (E), correlation between fully coupled fluxes

(Burgard et al., 2004) (F), and non-convex regions between fluxes constrained by thermody-

namics (G). The layout of scatterplots is inspired by the COBRA Toolbox (Becker et al., 2007).

space was clearly greatest using precise measurements (Fig. 5.2C), even metabolomics data

with high variability could considerably refine the solution space (Fig. 5.2B).

Pair-wise cross-correlation scatterplots visualize relations between two variables (Fig. 5.2A-G).

First, we see that thermodynamically infeasible reaction cycles (Price et al., 2006) are elimi-

nated. This is evident in Fig. 5.2A–when the inflow and outflow to the network (fluxes R1

Chapter 5. Designing experiments using noisy metabolomics data 112

and R3) are zero, so are the fluxes through the “cycle” (fluxes R2, R4, and R5). We also see

disjoint regions formed between a flux and its reaction Gibbs free energy due to thermodynamic

reversibility constraints (5.4d) (Fig. 5.2D). Also, the reaction Gibbs free energies are related

to concentrations by (5.4e) (Fig. 5.2E). Two fluxes that are fully-coupled, as defined by (Bur-

gard et al., 2004), show a correlation on the scatterplots (Fig. 5.2F). In Fig. 5.2G, we see a

non-convex space formed by two reversible fluxes. This pattern arises in this simple example

due to the elimination of infeasible reaction cycles as described above. This eliminates flux

distributions in which the inflow and outflow fluxes are less than the internal (R2, R4, and R5)

fluxes and also distributions in which the flux directions between R1 and R2 are opposite.

Finally, while concentrations and ∆rG′ are clearly related according to (5.4e), concentrations

and fluxes do not show strongly quantitative relationships in the scatterplots. This is because,

unlike kinetic models as in (Famili et al., 2005), in TMFA, concentrations affect only flux direc-

tions but not their magnitudes. Nonetheless, all variables, including concentrations, do exhibit

multi-modality in their marginal probability densities. This indicates that the algorithm was

capable of sampling from the non-convex solution space.

5.6.2 Computational Performance of the Sampling Algorithm

We experienced >20X speedup on the GPU over the CPU for the non-convex sampling al-

gorithm (Fig. 5.3). The performance gain on the GPU increased with increasing number of

samples. This indicated that, for the models investigated here, the GPU resources were not

used to their full potential–hence, larger models can potentially be studied using our algorithm

on the GPU. The GPU-specific code was run using half of the processing units of an Nvidia

GeForce GTX 295, while CPU code was run on an Intel Xeon 3.2 GHz processor.

5.6.3 Example: Simplified Model of E. coli Central Metabolism

We illustrate our algorithm using a simplified network model of E. coli central metabolism and

artificially generated metabolite concentration data. The network model consists of 20 reactions

and 11 metabolites as described in (Yang et al., 2008).

We used the algorithm described in Section 5.5 to estimate the sensitivity of output variability

Chapter 5. Designing experiments using noisy metabolomics data 113

2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

140

Number of samples

Tim

e (

se

c)

Hit and Run sampling of central model (short chains are of length 100)

Long CPU chain

Parallel CPU chains

Parallel GPU chains

Figure 5.3: Comparison of computational speed non-convex sampling on the simplified model

of E. coli central metabolism on the CPU and GPU. Parallelized code was more efficient than

a single long chain on the CPU. For the largest number of samples, parallel code on the GPU

was faster than that on the CPU by >20X.

relative to uncertainty in metabolite concentrations. For each metabolite, we first estimated

the mean output sensitivity (γji ) of all unmeasured concentrations and fluxes, initially in the

absence of measurements (Fig. 5.4A). We then summed the sensitivities of all outputs for

each metabolite (Fig. 5.4B). We identified three metabolites whose concentration measure-

ments most affected overall output variability (metabolites 5, 7, and 10). We then provided

high-variance concentration data for these three metabolites and again used our algorithm to

re-assess sensitivities (Fig. 5.4C-D). Overall, two (metabolites 5 and 7) of the three measured

metabolites were now less sensitive, indicating that despite high variability, the measurements

sufficiently constrained these concentrations. Metabolite 10 was still considered sensitive, over-

all, because while some outputs exhibited less variability, others showed increased variability

as the solution space was more narrowly defined by the additional measurements. This result

Chapter 5. Designing experiments using noisy metabolomics data 114

indicates that the sensitivities depend on the region of the solution space that the physical

system resides in.

We then assessed the potential of using our algorithm to design experiments to refine model

A

1 2 3 4 5 6 7 8 9 1011

0

5

10

15

20

25

30

35

0

0.05

0.1

0.15

0.2

0.25

Measured metabolite index

Output variable index

Ave

rag

e g

rad

ien

t o

f o

utp

ut

sta

nd

ard

de

via

tio

n

B1 2 3 4 5 6 7 8 9 10 11

0

0.5

1

1.5

2

2.5

3

3.5

Measured metabolite index

Su

mm

ed

ave

rag

e g

rad

ien

t o

f o

utp

ut

sta

nd

ard

de

via

tio

n

C

1 2 3 4 5 6 7 8 9 1011

0

5

10

15

20

25

30

35

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Measured metabolite index

Output variable index

Ave

rag

e g

rad

ien

t o

f o

utp

ut

sta

nd

ard

de

via

tio

n

D1 2 3 4 5 6 7 8 9 10 11

0

1

2

3

4

5

6

7

Measured metabolite index

Su

mm

ed

ave

rag

e g

rad

ien

t o

f o

utp

ut

sta

nd

ard

de

via

tio

n

Figure 5.4: Determining the metabolite concentrations needing precise measurements. The

global sensitivity of the variability of each output prediction was assessed relative to each

metabolite concentration. Without experimental data (top two figures), several metabolite

concentrations require measurements to reduce output variability. Once high-variance data are

provided for metabolites 5, 7, and 10, other metabolite measurements become important for

reducing output variability (bottom two figures).

predictions. We used the simplified model as above and generated 5,000 random points from the

thermodynamically feasible solution space. Each point represents a viable cell phenotype. We

Chapter 5. Designing experiments using noisy metabolomics data 115

then injected 10% noise to simulate high-precision data, where uncertainty would be primarily

due to biological variability. We then generated a more realistic metabolomics dataset by inject-

ing 20% noise to the original points. This uncertainty represents the inclusion of experimental

error. We then ran 10 random in silico experiments, each time choosing a different partial set

of the “omics” dataset. In each experiment we used our algorithm to identify the measurable

metabolites needing precise measurements. We then provided up to three precise measurements

from the ideal dataset (10% variability) for the sensitive metabolites and assessed the variabil-

ity of output predictions (Fig. 5.5, bottom). We compared this experiment design to a purely

random approach, where precise measurements were provided for three metabolites chosen at

random (Fig. 5.5, middle). The designed experiment significantly reduces output prediction

variability (over 10X compared to the random approach), while the random approach barely re-

duces variability for most outputs, compared to predictions based solely on initial high-variance

data (Fig. 5.5, top).

5.7 Conclusions

In this work, we have presented a computational algorithm to estimate the sensitivity of a

calculated flux or concentration relative to the uncertainty in a measured concentration, within

the context of constraint-based modeling. This study builds upon previous work assessing

sensitivity of calculated fluxes to uncertainty in measured fluxes (Savinell and Palsson, 1992).

The system studied here includes non-linear relations between fluxes and concentrations, which

were randomly sampled using a new sampling algorithm (Section 5.4). Our algorithm was able

to estimate sensitivities of fluxes and concentrations in a simplified model of E. coli central

metabolism. We found that these sensitivities depended on the operating region of the system;

therefore, even metabolomics data with high variability served as a valuable first step in the

iterative process of precisely defining cell behaviour within the solution space.

Metabolomics experiments provide crucial information on the end-products of metabolism, but

they are still resource-intensive and can involve much experimental uncertainty. In Section 5.5,

we developed a novel algorithm to address this issue and demonstrated its potential applica-

Chapter 5. Designing experiments using noisy metabolomics data 116

0 5 10 15 20 25 30 350

1

2

3

4x 10

4

Rela

tive e

rror

Only partial noisy dataset

0 5 10 15 20 25 30 350

1

2

3

4x 10

4R

ela

tive e

rror

Partial noisy + few random precise data

0 5 10 15 20 25 30 350

1

2

3

4x 10

4

Rela

tive e

rror

Partial noisy + few designed precise data

Measured state index

Figure 5.5: Comparison of model prediction error when, in addition to a partial set of noisy

data, precise metabolites are unavailable (top), chosen randomly (middle) and chosen by design

using our algorithm (bottom). The relative error in model predictions is reduced over 10X using

the designed experiment compared to the purely random experiment.

bility using simulated data on a simplified model of E. coli central metabolism. The results of

our case study demonstrating experiment design (Section 5.6.3) were consistent with our goal

of directing targeted experiments based on initial, noisy metabolomics data. The next step

in this work is to validate the algorithm using actual experimental data on a more detailed

model of cell metabolism. Although applied to a simplified system, the marked improvement

in model precision by incorporating targeted measurements directed by our algorithm (Fig.

5.5) provides the motivation needed to face the challenge of adapting our algorithm to more

complex models and datasets. Furthermore, the significant improvement in computational ef-

ficiency gained by using GPU technology (Fig. 5.3) suggests a promising avenue for scaling up

Chapter 5. Designing experiments using noisy metabolomics data 117

our analysis. Finally, the results of sampling the thermodynamically feasible flux-concentration

space (Fig. 5.2) demonstrated that the thermodynamic constraints on reaction direction only

loosely coupled fluxes and concentrations. Hence, models that more quantitatively capture the

flux-concentration relations may further improve the utility of our algorithm. For example, the

methods developed here could be extended to kinetic models of metabolism such as the k-cone

analysis (Famili et al., 2005).

Chapter 6

Scalable methods for optimal strain

design using kinetic models

6.1 Abstract

Kinetic models of cell metabolism quantitatively describe the relationship between reaction

fluxes, metabolite concentrations, and enzyme levels through kinetic rate equations. Hence,

these models are potentially more accurate than those based solely on stoichiometry and enable

design strategies that target the enzymes directly; however, optimal strain design algorithms

are more difficult to develop using kinetic models due to the presence of complex nonlinear

terms in the rate equations. While a number of optimal design algorithms have been developed

in the past, their scalability to larger kinetic models may be hindered by a large increase in

complexity with model size. Here, we present an alternative approach that is potentially faster

and more scalable to larger kinetic models. This scalability and computational efficiency was

achieved at the cost of a reduced specificity in the design. We make recommendations on how

to extend the current algorithm to improve design specificity.

118

Chapter 6. Scalable methods for strain design using kinetic models 119

6.2 Introduction

This thesis has focused on the design of microbial strains using genome-scale models of cell

metabolism. One limitation of these models is the lack of a quantitative relationship between

fluxes, metabolite concentrations, and enzyme levels. Kinetic models, consisting of kinetic rate

equations, quantitatively describe these relationships. However, they often require the estima-

tion of many more parameters than a genome-scale model that includes only stoichiometric

and thermodynamic constraints. Furthermore, they often introduce additional nonlinear con-

straints in an optimization problem. Therefore, the development of efficient modeling methods

for large-scale networks of kinetic rate equations, and strain design algorithms that utilize these

models, is useful but challenging. This chapter attempts to extend the optimization methods

developed for strain design in previous chapters to kinetic models of metabolism.

A significant number of optimization-based methods are currently available for the identification

of optimal genetic manipulations to maximize production using kinetic models of metabolism

(Pozo et al., 2011; Nikolaev, 2010; Vital-Lopez et al., 2006a; Visser et al., 2004; Schmid et al.,

2004; Dean and Dervakos, 1998). Here, we wish to explore whether more computationally ef-

ficient methods can be developed, albeit while incurring certain costs (e.g., reduced control of

the design scope). Our goal is to develop computational efficient methods that are potentially

scalable to large-scale kinetic models of metabolism. Accordingly, we apply the techniques

developed by Yang et al. (2011) for the development of an efficient algorithm for identifying

optimal enzyme manipulations using kinetic models of metabolism.

Sections 6.3–6.4.2 describe the algorithm. In Section 6.5, we test the algorithm using a kinetic

model of E. coli metabolism (Chassagnole et al., 2002) for the production of serine. A discussion

of the major findings, and recommendations for future work follow in Section 6.6.

6.3 Design of optimal enzyme manipulations using approxima-

tive kinetic models

Our algorithm involves constructing an approximative model from the original kinetic model

at the reference state, followed by identification of optimal enzyme manipulations using the

Chapter 6. Scalable methods for strain design using kinetic models 120

approximative model. The optimal enzyme manipulations are then implemented in the original

kinetic model to accurately predict the improvement in production. Conceptually, this proce-

dure is inspired by the work of (Vital-Lopez et al., 2006a). In that study, the approximative

model was derived using a generalized linearization, which yielded a linear model. Here, we will

instead approximate the original model using the nonlinear form of the lin-log rate equations

(Section 2.3.3). Therefore, the approximative model used here involves bilinear terms (i.e., the

product of enzyme variables and concentration variables). Thus, our algorithm requires the

solution of a nonlinear program that involves bilinear terms. We have excluded integer vari-

ables from the optimization problem for simplicity. Consequently, the total number of enzymes

that are manipulated is not constrained. We discuss the consequences of this characteristic in

Section 6.6.

The nonlinear program for identifying the optimal enzyme levels for maximizing production

using the lin-log model is as follows, for n reactions and m metabolites:

maxv, lnx′, p, γ

cTp · v

s.t. Sv = 0

vj = v0j pjγj , j = 1, . . . , n

γj = 1 +

m∑i=1

(Ej,i · lnx′i

), j = 1, . . . , n

vL ≤ v ≤ vU

lnx′L ≤ lnx′ ≤ lnx′U

pL ≤ p ≤ pU

(6.1)

where S ∈ Rm×n is the stoichiometric matrix, cTp is the objective vector for maximizing produc-

tion, γ is introduced to concisely define the rate equation, lnx′ = ln(x/x0) ∈ Rm are the natural

logarithm of fold-changes in concentrations from the reference concentrations (x0), v ∈ Rn is

the vector fluxes, p ∈ Rn is the vector of enzyme fold-changes from the reference, bounded by

pL and pU , and E ∈ Rn×m is the matrix of elasticities, as described by Smallbone et al. (2010).

We solved (6.1) using a customized algorithm, which is described in detail in the following sec-

tions. The algorithm consists of both successive linear programming (SLP) and a sub-routine

that makes use of convex relaxations. These methods are described in the following sections.

Chapter 6. Scalable methods for strain design using kinetic models 121

6.4 Methods

6.4.1 Solution using successive linear programming

The problem (6.1) involves bilinear terms that give rise to non-convexity making the problems

challenging to solve for large-scale problems. To solve the problem, we used successive linear

programming (SLP) (Baker and Lasdon, 1985; Bullard and Biegler, 1991; Yang et al., 2011),

similar to Section 3.3.2. The SLP formulation is as follows, for iteration k:

min∆v,∆lnx′,∆p,∆γ

Ka

n∑j=1

sj −KpcTp ·∆v (6.2)

s.t. S∆v = 0 (6.3)

− sj ≤ vkj + ∆vj − v0j (p

kjγ

kj + pkj∆γj + γkj ∆pj) ≤ sj , j = 1, . . . , n (6.4)

∆γj =

m∑i=1

Ej,i∆lnx′i, j = 1, . . . , n (6.5)

vL ≤ vk + ∆v ≤ vU (6.6)

lnx′L ≤ lnx′k + ∆lnx′ ≤ lnx′U (6.7)

pL ≤ pk + ∆p ≤ pU (6.8)

s ≥ 0 (6.9)

where Ka and Kp are weights for emphasizing minimization of bilinear constraint violations

or production, respectively, si ≥ 0 are auxiliary variables for minimizing bilinear constraint

violations, and ∆v = vk+1 − vk, ∆p = pk+1 − pk, ∆lnx′ = lnx′k+1 − lnx′k are deviations of

the fluxes, enzyme levels, and (logarithm of) metabolite concentrations, respectively, from their

values at iteration k. Note that in constraint (6.4), the reference flux (vkj = v0j pkj γ

kj ) cancels out.

Solution of (6.2)–(6.9) generates an optimal step direction to determine the new values of v, p,

lnx′ and γ at the next iteration, k+1. A full step in this direction is not guaranteed to improve

the objective, because the optimal step direction is determined based on a linear approximation

of the bilinear constraints. Accordingly, a line search procedure is used to determine the optimal

step size,

λ∗ = minλ∈[0,1]

Ka

n∑j=1

∣∣∣vkj + λ∆vj − v0j (p

kj + λ∆pj)(γ

kj + λ∆γj)

∣∣∣−KpcTp (vk + λ∆v)

, (6.10)

Chapter 6. Scalable methods for strain design using kinetic models 122

where | · | denotes the absolute value. To determine λ∗, we generate a number of trial step

sizes and evaluate Eq. (6.10) for each trial. The trial step size that minimizes Eq. (6.10) is

chosen to be λ∗. In this work, we used 100 trial steps, evenly distributed between 0 and 1. If

λ∗ = 0, then the SLP has converged since no further improvement of the objective is possible.

If the SLP has converged to a solution that does not satisfy the user-defined tolerances for the

bilinear constraint violations and the production level, then either the SLP must be restarted

at a different initial solution, or a sub-procedure for escaping sub-optimal solutions is initiated,

as described in the next Section (6.4.2). The sub-procedure involves the solution of a linear

relaxation of the original problem, which is formulated using the convex hull (McCormick,

1976). The convex hull is obtained as described in the next section. We note that the initial

solution to the SLP is also obtained by solving this relaxed problem.

6.4.2 Escaping local optima with convex relaxations

The first stage of the algorithm involves solution of an SLP, which converges to a solution

quickly, but does not guarantee global optimality. We thus developed a procedure, described

below, to search for potentially better local optima in the vicinity of the solution identified by

the SLP. This procedure is initiated if the SLP converges to a solution that does not satisfy

either the nonlinear violation tolerance or the metabolite production threshold.

First, each bilinear term is replaced by the McCormick relaxation (McCormick, 1976). This is

achieved by introducing a new variable, say, zi = piγi, and constraining zi as follows:

zi ≥ (pi)Lγi + (γi)

Lpi − (pi)L(γi)

L, (6.11)

zi ≥ (pi)Uγi + (γi)

Upi − (pi)U (γi)

U ,

zi ≤ (pi)Uγi + (γi)

Lpi − (pi)U (γi)

L,

zi ≤ (pi)Lγi + (γi)

Upi − (pi)L(γi)

U ,

where (pi)L, (pi)

U , (γi)L, (γi)

U are the lower and upper bounds of pi and γi, respectively. Ac-

cordingly, the relaxation is a function of the lower and upper bounds on each of the variables.

For different bounds, the optimum of the convex relaxation may differ. Hence, we generated a

series of relaxed “trial” problems for each local optimum. Each trial problem involves different

Chapter 6. Scalable methods for strain design using kinetic models 123

bounds for the relaxed bilinear constraints. In this work, the bounds of p for trial j are defined

as follows:

(p)Lj = pk − φj(pk − (p)min

),

(p)Uj = pk + φj ((p)max − pk) ,

where pk is the value of p at the local optimum at iteration k, (p)min and (p)max are the mini-

mum and maximum values for p, respectively, calculated using Flux Variability Analysis (FVA)

(Mahadevan and Schilling, 2003). The vector, φ can be of any length including a random or

deterministic sequence of numbers between 0 and 1. In this work, we chose to deterministi-

cally explore convex relaxations near the local optimum, with additional relaxations on bounds

of random width. Thus, for Ns sequences, φ = {x, r ∈ RNs/2 : x = 0.01 + 0.04 ∗ i−1Ns/2 , i =

1 . . . Ns/2, 0 ≤ r ≤ 1}, where r is a random number between 0 and 1.

6.5 Result: serine synthesis in E. coli

We tested the algorithm by identifying optimal enzyme manipulation strategies to maximize

serine synthesis (SERS) flux in the kinetic model of E. coli central metabolism developed by

Chassagnole et al. (Chassagnole et al., 2002). To identify the enzyme manipulations, we first

constructed the lin-log, approximative model at the reference state. Accordingly, we calculated

the elasticity matrix using automatic differentiation at the reference state. The values of the

elasticity matrix, reference fluxes, and reference concentrations used in this work are listed in

Appendix B.1.

To limit the discrepancy between the approximative and original kinetic models, we constrained

enzyme manipulations to within 0.5- and 2-fold changes and metabolite concentrations to within

0.1 of the smallest concentration (among all metabolites) and 10-times the largest concentration.

Once the optimal enzyme manipulations were identified using the algorithm, we determined the

SERS flux by performing a dynamic simulation using the original kinetic model, subject to the

optimal enzyme manipulations.

The optimal levels of the 30 enzymes in the model are shown in Fig. 6.1. Essentially, all

Chapter 6. Scalable methods for strain design using kinetic models 124

0 10 20 30 400

0.5

1

1.5

2

Enzyme number

Enz

yme

fold

−ch

ange

A

0 500 1000 15001.9

2

2.1

2.2

2.3

2.4

2.5

Time (s)

v SE

RS /

vref

SE

RS

B

0 500 1000 15000

0.5

1

1.5

2

2.5

3

Time (s)

x / x

ref

C

cpepcglcexcg6pcpyrcf6pcg1pcpgcfdpcsed7p

cgapce4pcxyl5pcrib5pcdhapcpgpcpg3cpg2cribu5p

Figure 6.1: Dynamic and steady-state simulations of E. coli central metabolism subject to

optimal enzyme manipulations. (A) Optimal enzyme fold-changes identified using the design

algorithm. (B–C) Dynamic profiles of SERS flux (B) and concentrations of the 18 metabolites

(C), both relative to reference values. The profiles are based on dynamic simulations of the

full kinetic model (Chassagnole et al., 2002) where enzyme levels are fixed to the optimal levels

identified by the algorithm at the start of the simulation (i.e,. Time=0). Initial concentrations

are the reference concentrations, and initial fluxes are perturbed from the reference values due

to the enzyme perturbations at Time=0.

enzyme levels were either increased to the maximum (i.e., two-fold increase) or decreased to

the minimum (i.e., halved). This enzyme manipulation strategy is similar to the bang-bang

optimal control strategy of chemical processes, where the control variable is set equal to the

Chapter 6. Scalable methods for strain design using kinetic models 125

lower or upper bound (San and Stephanopoulos, 1983). Bang-bang control is an optimal control

strategy when certain conditions are satisfied by the problem (San and Stephanopoulos, 1983).

An interesting direction for future work is to investigate when, if ever, these conditions are

satisfied in kinetic models, and whether this can lead to simpler optimization problems for

strain design.

The optimal enzyme levels were then implemented in the original kinetic model, and dynamic

simulations were performed. Dynamic simulation of the original kinetic model indicated that the

optimal enzyme manipulations resulted in a 148% (2.48-fold) increase to SERS flux, compared to

the reference flux. As expected, the optimal enzyme levels include the maximum (i.e., two-fold)

increase of the SERS enzyme. A two-fold increase in SERS alone results in an 89% (1.89-fold)

increase of SERS flux, as determined by a dynamic simulation of the full kinetic model. Thus,

the optimal levels of the other 29 enzymes account for the additional 59% increase in SERS flux.

This result indicates that there is value in developing optimization algorithms for identifying

complex enzyme manipulation strategies to maximize production.

6.6 Conclusions

In this chapter, a computationally efficient algorithm was developed for the identification of op-

timal enzyme manipulation strategies using kinetic models of metabolism. A nonlinear kinetic

model (Chassagnole et al., 2002) was approximated around the reference state by representing

the reactions by lin-log rate equations. The elasticity matrix, which forms the only set of kinetic

parameters in the lin-log model, was calculated at the reference state using automatic differ-

entiation of the original nonlinear model. The resulting lin-log kinetic model, while simpler

than the original model, contained bilinear terms. Thus, to use the lin-log model for design-

ing optimal enzyme manipulations, a nonlinear optimization algorithm was developed. This

algorithm uses successive linear programming (SLP) to rapidly identify a local optimum, while

convex relaxations (McCormick relaxation for bilinear terms (McCormick, 1976)) are used to

improve the solution once the SLP has converged. The algorithm was able to identify opti-

mal enzyme manipulation strategies within a minute for a model containing 48 reactions (30

Chapter 6. Scalable methods for strain design using kinetic models 126

enzyme-catalyzed) and 18 metabolites (17 intracellular, one extracellular) (Chassagnole et al.,

2002). The optimal enzyme manipulations were implemented in the original nonlinear model

in order to accurately predict the effects of the enzyme manipulations via dynamic simulations.

The nonlinear model indicated a 2.48-fold increase in the steady state flux of the target reac-

tion, relative to the reference.

These results suggest that this and similar algorithms may be scalable to larger and more com-

plex, kinetic models of cell metabolism. For example, the algorithm can be applied directly

to the genome-scale kinetic model developed by Smallbone et al. (2010), which uses lin-log

rate equations. On the other hand, we found that applying the algorithm (SLP and convex

relaxations) directly to the original kinetic model of Chassagnole et al. (2002), the algorithm

had difficulty identifying satisfactory solutions. This difficulty may be attributed to several

factors. First, in the case of bilinear constraints, the convex hull is given by the McCormick

relaxations. However, for general nonlinear constraints such as those found in the original ki-

netic model, other convex relaxations must be used. The ability of the algorithm to identify

globally optimal solutions is affected by the availability of tight convex relaxations. Currently,

commercial software such as BARON (Tawarmalani and Sahinidis, 2005) is available for global

optimization of MINLPs involving general nonlinear constraints. Alternatively, methods that

formulate tight relaxations that are customized for each type of rate equation may be more

efficient (Pozo et al., 2011).

A limitation that is inherent in our algorithm is the need to constrain the deviation of variables

from the reference such that the approximate model remains valid. Thus, the approximative

model leads to a simpler optimization problem, but the optimal enzyme manipulations may

only be valid for small changes in enzyme levels, fluxes and concentrations. This tradeoff

between computational tractability and model accuracy is inherent in the use of any approxi-

mative model. In future work, it may be possible to employ an iterative procedure, in which

constrained enzyme manipulations are identified, dynamic simulations are performed using

the original model to determine the new steady state, where the approximate model is re-

constructed, followed by another iteration of optimization. This procedure is similar to the

techniques employed for nonlinear model predictive control of chemical processes. We note

Chapter 6. Scalable methods for strain design using kinetic models 127

that the approximate model need not be limited to the lin-log rate equation, which we used in

this work. In particular, the generalized linearization that is based on arbitrary basis functions

(Vital-Lopez et al., 2006a) may be used at each iteration.

In addition, we note that the algorithm developed here is computationally efficient, but at the

cost of decreased control over the design scope. Specifically, while MILP-based methods (Niko-

laev, 2010; Vital-Lopez et al., 2006a) allow the user to limit the number and type of genetic

manipulations through the use of integer variables, our approach does not offer a straightforward

method for such fine control of the design scope without sacrificing computational efficiency.

The tradeoff between computational efficiency and specificity of the design scope will continue

to be challenging as this approach is developed further. One potential extension of this algo-

rithm is to develop subsequent procedures for choosing optimal subsets of the optimal enzyme

manipulations identified using the algorithm. For example, an MILP can be developed to choose

subsets of the enzyme manipulations subject to user-defined thresholds for the production and

number of enzyme manipulations, or to identify (alternative) minimal subsets. Essentially, the

MILP-based procedure is analogous to the third phase employed in the EMILiO algorithm,

which is described in Section 3.3.2. In conclusion, the most practical approach for in silico

strain design is to become familiar with all of the complementary techniques available and to

assess whether one is better suited than another for each individual problem.

Chapter 7

Conclusions

This thesis has explored a number of problems that are central for accurately predicting the

behavior of biological systems, and for effectively engineering them for the cost-effective pro-

duction of chemicals, biofuels, and pharmaceuticals. The major contributions of this thesis are

summarized below:

• Simulation of metabolic networks: Optimization is used to simulate both steady-

state and dynamic cell behavior. Simulation by optimization is appropriate if the cell is

assumed to behave according to a cellular objective, or if unmeasured states are estimated

while minimizing discrepancy with the subset of states that are measured. Linear, mixed-

integer, and nonlinear objectives and constraints have been explored in the literature.

This thesis has made contributions to the simulation of metabolic states through the

development of a method for randomly sampling thermodynamically feasible reaction

fluxes and metabolite concentrations (Chapter 5). This non-convex solution space is

difficult to sample due to an exponential increase in problem size with model dimension.

Hence, the sampling algorithm was implemented on the graphics processing unit (GPU)

to utilize its parallel processing capabilities. The GPU showed a ten-fold improvement in

processing speed over the CPU.

• In silico strain design: Optimal design of cell metabolism for metabolite overproduc-

tion is a rapidly growing area of research. Various linear, mixed-integer, and nonlinear

128

Chapter 7. Conclusions 129

optimization problems have been formulated for this purpose. These problems are chal-

lenging as they typically become exponentially larger as the model becomes more detailed.

The design scope or model size is often reduced to make the design problem tractable.

This thesis has made contributions to the problem of optimal strain design by developing

an efficient strain design algorithm called EMILiO (Chapter 3). EMILiO is based on a

bilevel optimization problem that is reformulated as an MPCC and efficiently solved (to a

local optimum) using successive linear programming (SLP). Subsequent steps in EMILiO

ensure that both minimal and alternative designs are systematically and efficiently iden-

tified.

• Assessing robustness of a strain design: This thesis has explored the potential for

in silico design of strains that are robust against gene expression noise, environmental

perturbations, and model parameter uncertainty (Chapter 4). Diversification of assets

(i.e., metabolic pathways) was shown to be an effective strategy for improving robustness

against all of these perturbations and model uncertainties. A larger number of diversified,

or redundant, pathways improved robustness against large perturbations; however, sen-

sitivity to small perturbations was also increased. Therefore, metabolic engineers should

be mindful of the trade-offs inherent in robust design. Furthermore, future robust strain

design efforts will require accurate characterization of the nature of environmental per-

turbations, including, but not limited to their magnitudes.

• Experiment design using noisy metabolomics data: Experiment design can be

aided by mathematical models to improve the efficiency of time and resource allocation

while maximizing the value of measurements. This thesis has explored the potential

impact of model-based experiment design using metabolomics data (Chapter 5). Accord-

ingly, a method was developed to assess the sensitivity of reaction flux and metabolite

concentration simulations to uncertainty in measurements in a subset of metabolites. This

sensitivity information, which is based on noisy metabolomics data sets, could be used to

Chapter 7. Conclusions 130

efficiently choose a set of metabolite concentrations needing precise measurements.

• Designing optimal enzyme manipulations using kinetic models of metabolism:

The identification of optimal enzyme manipulation strategies using mathematical opti-

mization and kinetic models of metabolism is a challenging problem with practical ap-

plications. A number of optimization methods have been developed for this purpose, as

reviewed in Chapter 2. In Chapter 6, the optimization techniques used in the development

of EMILiO were successfully extended to the design of optimal enzyme manipulations us-

ing kinetic models of metabolism. The algorithm was tested using a kinetic model of E.

coli central metabolism (Chassagnole et al., 2002). The model is used widely for testing

strain design algorithms. It includes 17 intracellular metabolites and 30 enzyme-catalyzed

reactions whose rates are defined using nonlinear kinetic rate equations. The algorithm

developed in Chapter 6 was able to identify optimal enzyme manipulation strategies that

increased serine synthesis flux by 148% (relative to the reference flux) in less than one

minute of CPU time. We found that this computational efficiency came with the cost

of reduced specificity of the design: i.e., the number and types of enzyme manipulations

could not be easily controlled. However, we have encountered a similar tradeoff in Chapter

3. Thus, potential avenues for improvement are discussed in Chapter 8.

Chapter 8

Recommendations for Future Work

• Include additional constraints in the bilevel optimization framework for strain

design: The bilevel optimization framework for strain design (Chapter 3) is not limited

to models of metabolism that include stoichiometric constraints alone. For example,

the optimization techniques used in EMILiO (Chapter 3) may be extended to models of

metabolism that include regulatory constraints. Specifically, the techniques may be used

to extend existing algorithms that identify optimal gene knockout or expression strategies

(Kim and Reed, 2010), to the identification of optimal gene expression levels in mod-

els that describe both metabolic and transcriptional regulatory networks. Furthermore,

strategies for identifying knockout and optimal gene expression levels may be applied

to models that describe transcriptional regulation using quantitative flux bounds, rather

than Boolean rules, such as the PROM model (Chandrasekaran and Price, 2010).

A special case of additional constraints in the constraint-based modeling framework is the

addition of rate equation constraints. These constraints define the reaction rates as func-

tions of concentrations, enzyme levels, and kinetic parameters. We refer to these classes of

models as kinetic models of metabolism, and we found that optimization-based methods

could be applied efficiently for identifying optimal enzyme manipulations (Chapter 6).

Therefore, we recommend continued research in mathematical optimization-based design

using kinetic models of metabolism.

• Continued development of scalable techniques for optimal strain design us-

131

Chapter 8. Recommendations for Future Work 132

ing kinetic models: In Chapter 6, efficient optimization techniques were employed for

strain design using kinetic models of metabolism. The nonlinear kinetic model of E. coli

central metabolism developed by Chassagnole et al. (2002) was approximated as a lin-log

kinetic model (Section 2.3.3). An optimization problem was formulated using this lin-log

model to identify optimal enzyme manipulations for maximizing serine synthesis (Section

6.3). This nonlinear optimization problem was successfully solved using the optimization

techniques developed in Chapter 3.

In future work, this method may be applied directly to large-scale models that use lin-log

rate equations. For example, Smallbone et al. (2010) developed a genome-scale model of

S. cerevisiae, in which reaction rates are described using lin-log rate equations. For the

model of central metabolism used in Chapter 6 (having 30 enzymes and 17 intracellular

metabolites), the optimization problem was solved in under a minute. Accordingly, it

would be interesting to investigate whether the methods developed here are indeed scal-

able to genome-scale models.

• Investigate the tradeoff between efficiency and specificity of designing a strain:

In Chapter 6, optimal enzyme manipulation strategies were found efficiently, but this

efficiency was achieved at the cost of lower specificity of the design scope. That is, control

over the number and types of enzyme manipulations allowed was decreased, and this can

directly influence the practical value of in silico designs. Thus, one avenue for improving

the algorithm is to develop efficient methods for restricting the design scope. In fact,

MILP-based (Vital-Lopez et al., 2006a) or MINLP-based (Nikolaev, 2010) methods have

been developed, which enable finer control over the scope of design. In future work, an

optimal tradeoff may be achieved by extending the current algorithm. For example, an

MILP can be developed to choose subsets of the enzyme manipulations subject to user-

defined thresholds for the production and number of enzyme manipulations, or to identify

(alternative) minimal subsets. Essentially, the MILP-based procedure is analogous to the

third phase employed in the EMILiO algorithm, which is described in Section 3.3.2.

Chapter 8. Recommendations for Future Work 133

• Investigate special conditions for optimal strain design: In Chapter 6, we iden-

tified an optimal enzyme manipulation strategy to maximize serine synthesis using the

kinetic model of E. coli central metabolism (Chassagnole et al., 2002). Essentially, all

enzyme levels were either increased to the maximum (i.e., two-fold increase) or decreased

to the minimum (i.e., halved). This enzyme manipulation strategy is similar to the bang-

bang optimal control strategy of chemical processes, where the control variable is set

equal to the lower or upper bound (San and Stephanopoulos, 1983). Bang-bang control

is an optimal control strategy when certain conditions are satisfied by the problem (San

and Stephanopoulos, 1983). Accordingly, an interesting direction for future work is to

investigate when, if ever, these conditions are satisfied in kinetic models, and whether

this can lead to simpler optimization problems for strain design.

• Optimal strain designs that are robust against model parameter uncertainty:

Kinetic models typically require the estimation of many parameters. These parameters

involve uncertainty (Miskovic and Hatzimanikatis, 2011), which may affect the feasibility

of the optimal strategies identified. Therefore, future work may be directed at adopting

the techniques of robust optimization (Ben-Tal and Nemirovski, 1998) for the identifica-

tion of enzyme manipulation strategies that are robust to model uncertainty.

• Investigate the use of alternative kinetic models: The optimization techniques

developed in this thesis are not limited to only the lin-log rate equations. Alternatively,

other forms of approximate rate equations (Vital-Lopez et al., 2006a), mechanistic rate

equations (Chassagnole et al., 2002), or hybrid models (Bulik et al., 2009) can be used. An-

other promising direction of research is to use large-scale models that describe metabolic

and regulatory interactions as mass action kinetics, as described by Jamshidi and Palsson

(2010). This modeling approach involves a large number of interactions but the terms

will be consistent in terms of their nonlinearities. Specifically, in the formulation of an

optimal enzyme manipulation problem, we can expect trilinear terms arising from the

Chapter 8. Recommendations for Future Work 134

product of two substrates and an enzyme level for each mass action rate equation.

Accordingly, the extension of the scalable optimization methods developed in this thesis

to alternative kinetic models will likely require the concurrent development or adoption

of optimization techniques.

• Investigate improved optimization techniques

In this thesis, the problem of identifying optimal enzyme levels using kinetic models

(Problem (6.1)) was solved through a straightforward application of the optimization

techniques developed for solving the EMILiO problem (in Chapter 3). For the reference

condition considered here, we could find an optimal solution that was comparable in the

rate of serine synthesis with previous studies (Vital-Lopez et al., 2006a), albeit with a

larger number of enzyme manipulations.

In future research, improved optimization techniques will need to be developed. This

conclusion was reached when we applied the optimization method directly to the original,

nonlinear kinetic model developed by Chassagnole et al. (2002). We found that the

optimization technique that worked well in this thesis did not converge to a satisfactory

solution when the original kinetic model was used. This obstacle was likely due to two

reasons.

First, the solution of (6.1) relies on the availability of tight convex underestimators, both

for the identification of good starting solutions, as well as to improve convergence to a

global optimum. For bilinear constraints, the McCormick relaxations are adequate since

they form the convex hull (McCormick, 1976). However, for general nonlinear constraints,

the McCormick relaxations no longer apply and other convex relaxations must be used.

In the area of convex relaxations, interested researchers are suggested to study the works

of Sahinidis et al. (Tawarmalani and Sahinidis, 2005).

Second, the optimization method can be improved. We used a successive (iterative) linear

program with line search (Bullard and Biegler, 1991) in this thesis (Eq. 6.1). While this

technique worked well for problems involving bilinear constraints, it did not work as

well when applied to the original kinetic model that involved more complex, nonlinear

constraints. One direction for improvement is to use a trust-region method rather than

Chapter 8. Recommendations for Future Work 135

the line search (Biegler, 2010). The extra flexibility present in the trust-region method

may enable better approximations of the complex, nonlinear terms present in mechanistic

kinetic models of metabolism.

• Develop scalable software platforms for large-scale kinetic models

An eventual goal of developing scalable algorithms for design using kinetic models is to

apply the algorithms to genome-scale models such as that developed by Smallbone et al.

(2010). An immediate challenge to the use of genome-scale kinetic models is the lack

of software platforms that are capable of handling such large models (Smallbone et al.,

2010). Accordingly, simulation and visualization of genome-scale kinetic models is diffi-

cult. Indeed, while the development of scalable design algorithms may continue to progress

through the development of optimization techniques, there may nonetheless be a lack of

software platforms to simulate and interpret the optimal designs that are generated by

the algorithm. Therefore, a recommendation for future research is to collaborate with

software developers or computer scientists in order to accelerate the development of soft-

ware platforms that are capable of interpreting the results of, and potentially integrating,

optimal strain designs based on large-scale kinetic models. A potentially useful advance

in computational techniques is the use of graphics processing units (GPUs) for low-cost,

parallel computing (see Chapter 5 for the author’s experience with GPUs for modeling

metabolic networks).

• Experimentally validate computationally-generated hypotheses

This thesis has generated a number of hypotheses that may be experimentally validated

in future work. For example, a large number of strategies for succinate production have

been suggested in Chapter 3. Also, in Chapter 4, we have outlined an experimental

procedure based on the experiments of Sonntag et al. (1993) for testing our hypothesis

that diversification improves robustness against large perturbations at the cost of increases

sensitivity to small perturbations. In Chapter 5, we developed a method for designing

experiments to specify a small set of metabolites needing precise measurements in order

to improve model precision. Thus, in future work, the experiment design methodology

Chapter 8. Recommendations for Future Work 136

should be tested using an initial metabolomics dataset, followed by measurement of only

a few important metabolite concentrations. In Chapter 6, we hypothesized that certain

enzymes should be upregulated while others should be inhibited in order to maximize L-

serine production. In future work, a number of these hypotheses can be tested, although

we recommend first reducing the set of enzyme manipulations using pruning methods,

such as those developed in Chapter 3.

• Additional directions for future research

Finally, while not discussed in detail in this thesis, the optimization techniques used for

the design of optimal microbial strains may also be extended to higher organisms. For

example, industrial production of chemicals and therapeutics requires systematic methods

for designing superior and cost-effective growth media. Unlike the microbes studied in

this thesis, the growth media for mammalian cells typically contain a combination of

many substrates and nutrients. The development of accurate models of mammalian cell

metabolism may enable the use of mathematical optimization for cell medium design.

An immediate challenge for the application of CBM to modeling mammalian cells is the

identification of an appropriate objective function, since maximization of growth yield is

typically inaccurate for these systems. Alternatively, kinetic models may be developed.

In this case, effective methods for parameter estimation will be required. An interesting

problem is to design experiments that involve an optimal combination of experimental

data (i.e., metabolomics, fluxomics, proteomics, transcriptomics, steady-state data, or

transient data) with minimal cost in resources and time. To solve these important and

practical problems, the collaboration between experts in mathematical optimization and

experts in experimental techniques will be necessary.

Bibliography

Alper, H., Fischer, C., Nevoigt, E., and Stephanopoulos, G. (2005). Tuning genetic control

through promoter engineering. Proc. Natl. Acad. Sci. USA, 102:12678–12683.

Baker, T. E. and Lasdon, L. S. (1985). Successive linear programming at exxon. Management

Science, 31:264–274.

Becker, S. A., Feist, A. M., Mo, M. L., et al. (2007). Quantitative prediction of cellular

metabolism with constraint-based models: the COBRA Toolbox. Nature Protocols, 2:727–

738.

Becskei, A. and Serrano, L. (2000). Engineering stability in gene networks by autoregulation.

Nature, 405:590–593.

Beg, Q. K., Vazquez, A., Ernst, J., et al. (2007). Intracellular crowding defines the mode and

sequence of substrate uptake by Escherichia coli and constrains its metabolic activity. Proc.

Natl. Acad. Sci. USA, 104:12663–12668.

Ben-Tal, A. and Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations

Research, 23:769–805.

Bennett, B. D., Kimball, E. H., Gao, M., et al. (2009). Absolute metabolite concentrations

and implied enzyme active site occupancy in Escherichia coli. Nature Chemical Biology,

5:593–599.

Benyamini, T., O., F., Ruppin, E., and Shlomi, T. (2010). Flux balance analysis accounting

for metabolite dilution. Genome. Biol., 11:R43.

137

BIBLIOGRAPHY 138

Biegler, L. T. (2010). Nonlinear programming: concepts, algorithms, and applications to chem-

ical processes. Society for Industrial and Applied Mathematics, Philadelphia, PA.

Buescher, J. M., Czernik, D., Ewald, J. C., Sauer, U., and Zamboni, N. (2009). Cross-Platform

Comparison of Methods for Quantitative Metabolomics of Primary Metabolism. Analytical

Chemistry, 81:2135–2143.

Bulik, S., Grimbs, S., Huthmacher, C., Selbig, J., and Holzhutter, H. G. (2009). Kinetic hybrid

models composed of mechanistic and simplified enzymatic rate laws - a promising method for

speeding up the kinetic modelling of complex metabolic networks. FEBS Journal, 276:410–

424.

Bullard, L. G. and Biegler, L. T. (1991). Iterative linear programming strategies for constrained

simulation. Computers and Chemical Engineering, 15:239–254.

Burgard, A. P., Nikolaev, E. V., Schilling, C. H., and Maranas, C. D. (2004). Flux coupling

analysis of genome-scale metabolic network reconstructions. Genome Research, 14:301–312.

Burgard, A. P., Pharkya, P., and Maranas, C. D. (2003). OptKnock: A bilevel program-

ming framework for identifying gene knockout strategies for microbial strain optimization.

Biotechnol. Bioeng., 84:647–657.

Chandrasekaran, S. and Price, N. D. (2010). Probabilistic integrative modeling of genome-scale

metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis. Proc

Natl Acad Sci USA, 107:17845–17850.

Chang, Y., Suthers, P. F., and Maranas, C. D. (2008). Identification of optimal measurement

sets for complete flux elucidation in metabolic flux analysis experiments. Biotechnology and

Bioengineering, 100:1039–1049.

Chassagnole, C., Noisommit-Rizzi, N., Schmid, J. W., Mauch, K., and Reuss, M. (2002). Dy-

namic modeling of the central carbon metabolism of Escherichia coli. Biotechnol. Bioeng.,

79:53–73.

BIBLIOGRAPHY 139

Chatterjee, R., Millard, C. S., Champion, K., Clark, D. P., and Donnelly, M. I. (2001). Mutation

of the ptsc gene results in increased production of succinate in fermentation of glucose by

Escherichia coli. Appl. Environ. Microbiol., 67:148–154.

Chen, Z., Wilmanns, M., and Zeng, A. P. (2010). Structural synthetic biotechnology: from

molecular structure to predictable design for industrial strain development. Trends Biotech-

nol., 28:534–542.

Cole, S. T. and Guest, J. R. (1979). Amplification and aerobic synthesis of fumarate reductase

in ampicillin-resistant mutants of Escherichia coli k-12. FEMS Microbiol. Lett., 5:65–67.

Cornelius, S. P., Lee, J. S., and Motter, A. E. (2011). Dispensability of Escherichia coli ’s latent

pathways. Proc Natl Acad Sci U.S.A.

Costa, R. S., Machado, D., Rocha, I., and Ferreira, E. C. (2011). Critical perspective on the

consequences of the limited availability of kinetic data in metabolic modeling. IET Systems

Biology, 5:157–163.

Covert, M. W., Knight, E. M., Reed, J. L., Herrgard, M. J., and Palsson, B. O. (2004).

Integrating high-throughput and computational data elucidates bacterial networks. Nature,

429:92–96.

Covert, M. W., Schilling, C. H., and Palsson, B. (2001). Regulation of gene expression in flux

balance models of metabolism. Journal of Theoretical Biology, 213:73–88.

Cox, S. J., Levanon, S. S., Sanchez, A., et al. (2006). Development of a metabolic network

design and optimization framework incorporating implementation constraints: a succinate

production case study. Metab. Eng., 8:46–57.

Csete, M. E. and Doyle, J. C. (2002). Reverse engineering of biological complexity. Science,

295:1664–1669.

De Mey, M., Maertens, J., Lequeux, G. J., Soetaert, W. K., and Vandamme, E. J. (2007).

Construction and model-based analysis of a promoter library for E. coli : an indispensable

tool for metabolic engineering. BMC Biotechnol., 7:34.

BIBLIOGRAPHY 140

Dean, J. P. and Dervakos, G. A. (1998). Redesigning metabolic networks using mathematical

programming. Biotechnol. Bioeng., 58:267–271.

Edwards, J., Covert, M., and Palsson, B. (2002). Metabolic modelling of microbes: the flux-

balance approach. Environmental Microbiology, 4:133–140.

Edwards, J. S., Ibarra, R. U., and Palsson, B. O. (2001). In silico predictions of Escherichia coli

metabolic capabilities are consistent with experimental data. Nat. Biotechnol., 19:125–130.

Enfors, S.-O., Jahic, M., Rozkov, A., et al. (2001). Physiological responses to mixing in large

scale bioreactors. Journal of Biotechnology, 85(2):175 – 185.

Famili, I., Mahadevan, R., and Palsson, B. O. (2005). k-cone analysis: Determining all candidate

values for kinetic parameters on a network scale. Biophysical Journal, 88:1616–1625.

Feist, A. M., Henry, C. S., Reed, J. L., et al. (2007). A genome-scale metabolic reconstruction for

Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information.

Mol. Syst. Biol., 3:121.

Feist, A. M., Zielinski, D. C., Orth, J. D., et al. (2010). Model-driven evaluation of the produc-

tion potential for growth-coupled products of Escherichia coli. Metab. Eng., 12:173–186.

Fong, S., Nanchen, A., Palsson, B. O., and Sauer, U. (2006). Latent pathway activation and

increased pathway capacity enable Escherichia coli adaptation to loss of key metabolic

enzymes. Journal of Biological Chemistry, 281:8024–8033.

Fong, S. S., Burgard, A. P., Herring, C. D., et al. (2005). In silico design and adaptive evolution

of Escherichia coli for production of lactic acid. Biotechnol. Bioeng., 91:643–648.

Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science,

315:972–976.

Glass, J. I., Assad-Garcia, N., Alperovich, N., et al. (2006). Essential genes of a minimal

bacterium. Proc. Natl. Acad. Sci. USA, 103:425–430.

BIBLIOGRAPHY 141

Glover, F. (1975). Improved linear integer programming formulations of nonlinear integer

problems. Manage. Sci., 22:445.

Heijnen, J. J. (2005). Approximative kinetic formats used in metabolic network modeling.

Biotechnol. Bioeng., 91:534–545.

Henry, C. S., Broadbelt, L. J., and Hatzimanikatis, V. (2007). Thermodynamics-Based

Metabolic Flux Analysis. Biophys. J., 92:1792–1805.

Henry, C. S., DeJongh, M., Best, A. A., et al. (2010a). High-throughput generation, optimiza-

tion and analysis of genome-scale metabolic models. Nature Biotechnology, 28:977–U22.

Henry, C. S., Overbeek, R., and Stevens, R. L. (2010b). Building the blueprint of life. Biotechnol.

J., 5:695–704.

Herrgard, M. J., Swainston, N., Dobson, P., et al. (2008). A consensus yeast metabolic network

reconstruction obtained from a community approach to systems biology. Nature Biotechnol-

ogy, 26:1155–1160.

Hoops, S., Sahle, S., Gauges, R., et al. (2006). COPASI–a COmplex PAthway Simulator.

Bioinformatics, 22:3067–3074.

Hua, Q., Joyce, A. R., Fong, S. S., and Palsson, B. O. (2006). Metabolic analysis of adaptive

evolution for in silico-designed lactate-producing strains. Biotechnol. Bioeng., 95:992–1002.

Ibarra, R. U., Edwards, J. S., and Palsson, B. O. (2002). Escherichia coli K-12 undergoes

adaptive evolution to achieve in silico predicted optimal growth. Nature, 420:186–189.

Iuchi, S., Kuritzkes, D. R., and Lin, E. C. C. (1986). Three classes of Escherichia coli mutants

selected for aerobic expression of fumarate reductase. J Bacteriol., 168:1415–1421.

Jamshidi, N. and Palsson, B. O. (2010). Mass action stoichiometric simulation models: incor-

porating kinetics and regulation into stoichiometric models. Biophysical Journal, 98:175–185.

Jantama, K., Haupt, M. J., Svoronos, S. A., et al. (2008). Combining metabolic engineering

BIBLIOGRAPHY 142

and metabolic evolution to develop nonrecombinant strains of Escherichia coli c that produce

succinate and malate. Biotechnol. Bioeng., 99:1140–1153.

Jin, Y. S. and Stephanopoulos, G. (2007). Multi-dimensional gene target search for improving

lycopene biosynthesis in Escherichia coli. Metab. Eng., 9:337–347.

Joyce, A. R. and Palsson, B. O. (2006). The model organism as a system: integrating ’omics’

data sets. Nature Reviews Molecular Cell Biology, 7:198–210.

Kacser, H. and Burns, J. A. (1973). The control of flux. Symp. Soc. Exp. Biol., 27:65–104.

Kafri, R., Levy, M., and Pilpel, Y. (2006). The regulatory utilization of genetic redundancy

through responsive backup circuits. Proc. Natl. Acad. Sci. USA, 103(31):11653–11658.

Kafri, R., Springer, M., and Pilpel, Y. (2009). Genetic Redundancy: New Tricks for Old Genes.

Cell, 136(3):389–392.

Kaufman, D. E. and Smith, R. L. (1998). Direction choice for accelerated convergence in

hit-and-run sampling. Operations Research, 46:84–95.

Kim, J. and Reed, Jennifer, L. (2010). OptORF: Optimal metabolic and regulatory perturba-

tions for metabolic engineering of microbial strains. BMC Syst. Biol., 4:53.

Kirkpatrick, C., Maurer, L., Oyelakin, N., et al. (2001). Acetate and formate stress: Opposite

responses in the proteome of Escherichia coli . J. Bacteriol., 183(21):6466–6477.

Kitano, H. (2004). Biological robustness. Nat. Rev. Genet., 5:826–837.

Kitano, H. (2007). Towards a theory of biological robustness. Mol. Syst. Biol., 3:137.

Kitano, H. (2010). Violations of robustness trade-offs. Mol. Syst. Biol., 6:384.

Kucherenko, S., Rodriguez-Fernandez, M., Pantelides, C., and Shah, N. (2009). Monte carlo

evaluation of derivative-based global sensitivity measures. Reliability Engineering & System

Safety, 94:1135 – 1148.

BIBLIOGRAPHY 143

Lee, K. H., Park, J. H., Kim, T. Y., Kim, H. U., and Lee, S. Y. (2007). Systems metabolic

engineering of Escherichia coli for l-threonine production. Mol. Syst. Biol., 3:149.

Lin, H., Bennett, G. N., and San, K. Y. (2005). Chemostat culture characterization of Es-

cherichia coli mutant strains metabolically engineered for aerobic succinate production: a

study of the modified metabolic network based on metabolic profile, enzyme activity, and

gene expression profile. Metab. Eng., 7:337–352.

Lin, H., Castro, N. M., Bennett, G. N., and San, K. Y. (2006). Acetyl-CoA synthetase over-

expression in Escherichia coli demonstrates more efficient acetate assimilation and lower

acetate accumulation: a potential tool in metabolic engineering. Applied Microbiology and

Biotechnology, 71:870–874.

Lin, H., Vadali, R. V., Bennett, G. N., and San, K. Y. (2004). Increasing the acetyl-coa pool

in the presence of overexpressed phosphoenolpyruvate carboxylase or pyruvate carboxylase

enhances succinate production in Escherichia coli. Biotechnol. Prog., 20:1599–1604.

Lun, D. S., Rockwell, G., Guido, N. J., et al. (2009). Large-scale identification of genetic design

strategies using local search. Mol. Syst. Biol., 5:296.

Mahadevan, R., Burgard, A. P., Famili, I., Van Dien, S., and Schilling, C. H. (2005). Applica-

tions of metabolic modeling to drive bioprocess development for the production of value-added

chemicals. Biotechnol. Bioeng., 10:408–417.

Mahadevan, R., Edwards, J. S., and Doyle, F. J. (2002). Dynamic flux balance analysis of

diauxic growth in Escherichia coli. Biophysical Journal, 83:1331–1340.

Mahadevan, R. and Lovley, D. R. (2008). The degree of redundancy in metabolic genes is linked

to mode of metabolism. Biophys. J., 94:1216–1220.

Mahadevan, R. and Schilling, C. H. (2003). The effects of alternate optimal solutions in

constraint-based genome-scale metabolic models. Metab. Eng., 5:264–276.

McCormick, G. P. (1976). Computability of global solutions to factorable nonconvex programs:

Part I–convex underestimating problems. Math. Program., 10:147–175.

BIBLIOGRAPHY 144

McGibney, G. and Smith, M. (1993). An unbiased signal-to-noise ratio measure for magnetic-

resonance images. Medical Physics, 20(4):1077–1078.

McKinlay, J. B., Vieille, C., and Zeikus, G. J. (2007). Prospects for a bio-based succinate

industry. Appl. Microbiol. Biotechnol., 76:727–740.

Melzer, G., Esfandabadi, M. E., Franco-Lara, E., and Wittmann, C. (2009). Flux design: In

silico design of cell factories based on correlation of pathway fluxes to desired properties.

BMC Systems Biology, 3:120.

Mendes, P. and Kell, D. B. (1998). Non-linear optimization of biochemical pathways: applica-

tion to metabolic engineering and parameter estimation. Bioinformatics, 14:869–883.

Mendonca, A. G., Alves, R. J., and Pereira-Leal, J. B. (2011). Loss of genetic redundancy in

reductive genome evolution. PLoS Comput. Biol., 7:e1001082.

Metris, A., George, S., and Baranyi, J. (2011). Modelling osmotic stress by flux balance analysis

at the genomic scale. Journal of Biotechnology, 104:77–85.

Millard, C. S., Chao, Y. P., Liao, J. C., and Donnelly, M. (1996). Enhanced production

of succinic acid by overexpression of phosphoenolpyruvate carboxylase in Escherichia coli.

Appl. Environ. Microbiol., 62:1808–1810.

Miskovic, L. and Hatzimanikatis, V. (2011). Modeling of uncertainties in biochemical reactions.

Biotechnol. Bioeng., 108:413–423.

Mo, M. L., Palsson, B. O., and Herrgard, M. J. (2009). Connecting extracellular metabolomic

measurements to intracellular flux states in yeast. BMC Systems Biology, 3.

Moran, N. A. (2002). Microbial minimalism: genome reduction in bacterial pathogens. Cell,

108:583–586.

Morari, M. and Zafiriou, E. (1989). Robust Process Control. Prentic Hall, Englewood Cliffs,

New Jersey.

BIBLIOGRAPHY 145

Nakamura, C. E. and Whited, G. M. (2003). Metabolic engineering for the microbial production

of 1,3-propanediol. Current Opinion in Biotechnology, 14:454–459.

Newman, E. B., Batist, G., Fraser, J., et al. (1976). Use of glycine as nitrgen-source by

Escherichia coli -k12. Biochimica et Biophysica Acta, 421(1):97–105.

Nikolaev, E. V. (2010). The elucidation of metabolic pathways and their improvements using

stable optimization of large-scale kinetic models of cellular systems. Metab. Eng., 12:26–38.

Orth, J. D., Thiele, I., and Palsson, B. O. (2010). What is flux balance analysis? Nat.

Biotechnol., 28:245–248.

Patil, K. R., Rocha, I., Forster, J., and Nielsen, J. (2005). Evolutionary programming as a

platform for in silico metabolic engineering. BMC Bioinformatics, 6:308.

Pekkonen, M., Korhonen, J., and Laakso, J. (2011). Increased survival during famine improves

fitness of bacteria in a pulsed-resource environment. Evol. Ecol. Res., 13:1–18.

Peters-Wendisch, P., Stoiz, M., Etterich, H., et al. (2005). Metabolic engineering of Corynebac-

terium glutamicum for l-serine production. Appl. Environ. Microbiol., 71:7139–7144.

Pharkya, P., Burgard, A. P., and Maranas, C. D. (2003). Exploring the overproduction of amino

acids using the bilevel optimization framework optknock. Biotechnol. Bioeng., 84:887–899.

Pharkya, P., Burgard, A. P., and Maranas, C. D. (2004). Optstrain: a computational framework

for redesign of microbial production systems. Genome Research, 14:2367–2376.

Pharkya, P. and Maranas, C. D. (2006). An optimization framework for identifying reaction ac-

tivation/inhibition or elimination candidates for overproduction in microbial systems. Metab.

Eng., 8:1–13.

Picket, A. M. and Bazin, M. J. (1980). Growth and composition of Escherichia coli subjected

to square-wave perturbations in nutrient supply: effect of varying amplitudes. Biotechnol.

Bioeng., 22:1213–1224.

BIBLIOGRAPHY 146

Portnoy, V. A., Scott, D. A., Lewis, N. E., et al. (2010). Deletion of genes encoding cytochrome

oxidases and quinol monooxygenase blocks the aerobic-anaerobic shift in Escherichia coli

K-12 MG1655. Appl. Environ. Microbiol., 76:6529–6540.

Pozo, C., Guillen-Gosalbez, G., Sorribas, A., and Jimenez, L. (2011). A spatial branch-and-

bound framework for the global optimization of kinetic models of metabolic networks. Ind.

Eng. Chem. Res., 50:5225–5238.

Price, N., Thiele, I., and Palsson, B. (2006). Candidate states of Helicobacter pylori’s genome-

scale metabolic network upon application of “loop law” thermodynamic constraints. Bio-

physical Journal, 90:3919–3928.

Purich, D. L. (2010). Enzyme kinetics: catalysis and control. Elsevier/Academic Press, Ams-

terdam, Netherlands; Boston, MA.

Ranganathan, S., Suthers, P. F., and Maranas, C. D. (2010). Optforce: an optimization pro-

cedure for identifying all genetic manipulations leading to targeted overproductions. PLoS

Comput. Biol., 6:e1000744.

Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009). Automated design of synthetic ribosome

binding sites to control protein expression. Nat. Biotechnol., 27:946–950.

Saltelli, A., Tarantola, S., and Campolongo, F. (2000). Sensitivity analysis as an ingredient of

modeling. Statistical Science, 15:377–395.

San, K. Y. and Stephanopoulos, G. (1983). Optimal-control policy for substrate inhibited

kinetics with enzyme deactivation in an isothermal CSTR. AICHE Journal, 29:417–424.

Sanchez, A. M., Bennett, G. N., and San, K. Y. (2005). Novel pathway engineering design of

the anaerobic central metabolic pathway in Escherichia coli to increase succinate yield and

productivity. Metab. Eng., 7:229–239.

Sanchez, A. M., Bennett, G. N., and San, K. Y. (2006). Batch culture characterization and

metabolic flux analysis of succinate-producing Escherichia coli strains. Metab. Eng., 8:209–

226.

BIBLIOGRAPHY 147

Savinell, J. M. and Palsson, B. O. (1992). Optimal selection of metabolic fluxes for invivo mea-

surement .1. development of mathematical-methods. Journal of Theoretical Biology, 155:201–

214.

Schellenberger, J. and Palsson, B. O. (2009). The use of randomized sampling for analysis of

metabolic networks. Journal of Biological Chemistry, 284:5457–5461.

Schellenberger, J., Tsai, E. A., and Palsson, B. O. (2007). Exploring the concentration space

of genome scale metabolic networks. In Eighth International Conference on Systems Biology,

page H20.

Schmid, J. W., Mauch, K., Reuss, M., Gilles, E. D., and Kremling, A. (2004). Metabolic design

based on a coupled gene expression-metabolic network model of tryptophan production in

Escherichia coli. Metab. Eng., 6:364–377.

Schrumpf, B., Schwarzer, A., Kalinowski, J., et al. (1991). A functionally split pathway for

lysine synthesis in corynebacterium-glutamicum. J. Bacteriol., 173:4510–4516.

Schuetz, R., Kuepfer, L., and Sauer, U. (2007). Systematic evaluation of objective functions

for predicting intracellular fluxes in Escherichia coli. Mol. Syst. Biol., 3:119.

Shaw, D. J. and Guest, J. R. (1982). Amplification and product identification of the fnr gene

of Escherichia coli. J. Gen. Microbiol., 128:2221–2228.

Shen, P., Chao, H., Jiang, C., et al. (2010). Enhancing Production of l-Serine by Increasing

the glyA Gene Expression in Methylobacterium sp MB200. Appl. Biochem. Biotechnol.,

160(3):740–750.

Shirai, T., Nakato, A., Izutani, N., et al. (2005). Comparative study of flux redistribution

of metabolic pathway in glutamate production by two coryneform bacteria. Metab. Eng.,

7:59–69.

Shlomi, T., Cabili, M. N., and Ruppin, E. (2009). Predicting metabolic biomarkers of human

inborn errors of metabolism. Molecular Systems Biology, 5.

BIBLIOGRAPHY 148

Smallbone, K., Simeonidis, E., Swainston, N., and Mendes, P. (2010). Towards a genome-scale

kinetic model of cellular metabolism. BMC Syst. Biol., 4:6.

Sonntag, K., Eggeling, L., De Graaf, A. A., and Sahm, H. (1993). Flux partitioning in the

split pathway of lysine synthesis in Corynebacterium glutamicum: quantification by 13C- and

1H-NMR spectroscopy. Eur. J. Biochem., 213:1325–1331.

Stephanopoulos, G. and Simpson, T. W. (1997). Flux amplification in complex metabolic

networks. Chem. Eng. Sci., 52:2607–2627.

Stoiz, M., Peters-Wendisch, P., Etterich, H., et al. (2007). Reduced folate supply as a key to

enhanced l-serine production by Corynebacterium glutamicum. Appl. Environ. Microbiol.,

73:750–755.

Stols, L. and Donnelly, M. I. (1997). Production of succinic acid through overexpression of

nad(+)-dependent malic enzyme in an escherichia coli mutant. Appl. Environ. Microbiol.,

63:2695–2701.

Stolz, M., Peters-Wendisch, P., Etterich, H., et al. (2007). Reduced folate supply as a key to

enhanced L-serine production by Corynebacterium glutamicum. Appl. Environ. Microbiol.,

73(3):750–755.

Suiter, A. M., Banziger, O., and Dean, A. M. (2003). Fitness consequences of a regulatory

polymorphism in a seasonal environment. Proc. Natl. Acad. Sci. U.S.A, 100:12782–12786.

Tamas, I., Klasson, L., Canback, B., et al. (2002). 50 million years of genomic stasis in en-

dosymbiotic bacteria. Science, 296:2376–2379.

Tawarmalani, M. and Sahinidis, N. V. (2005). A polyhedral branch-and-cut approach to global

optimization. Mathematical Programming, 103:225–249.

Tepper, N. and Shlomi, T. (2010). Predicting metabolic engineering knockout strategies for

chemical production: accounting for competing pathways. Bioinformatics, 26:536–543.

Tilman, D., Reich, P. B., and Knops, J. M. H. (2006). Biodiversity and ecosystem stability in

a decade-long grassland experiment. Nature Letters, 441:629–632.

BIBLIOGRAPHY 149

Varela, C. A., Baez, M. E., and Agosin, E. (2004). Osmotic stress response: quantification of

cell maintenance and metabolic fluxes in a lysine-overproducing strain of Corynebacterium

glutamicum. Appl. Environ. Microbiol., 70:4222–4229.

Varma, A., Boesch, B. W., and Palsson, B. O. (1993). Stoichiometric interpretation of Es-

cherichia coli glucose catabolism under various oxygenation rates. Appl. Environ. Microbiol.,

59:2465–2473.

Varma, A. and Palsson, B. O. (1994). Stoichiometric flux balance models quantitatively predict

growth and metabolic by-product secretion in wild-type Escherichia coli W3110. Appl.

Environ. Microbiol., 60:3724–3731.

Visser, D., Heijden van der, R., Mauch, K., Reuss, M., and Heijnen, S. (2000). Tendency

modeling: a new approach to obtain simplified kinetic models of metabolism applied to

Saccharomyces cerevisiae. Metab. Eng., 2:252–275.

Visser, D. and Heijnen, J. J. (2003). Dynamic simulation and metabolic re-design of a branched

pathway using linlog kinetics. Metab. Eng., 5:164–176.

Visser, D., Schmid, J. W., Mauch, K., Reuss, M., and Heijnen, J. J. (2004). Optimal re-design

of primary metabolism in Escherichia coli using linlog kinetics. Metab. Eng., 6:378–390.

Vital-Lopez, F. G., Armaou, A., Nikolaev, E. V., and Maranas, C. D. (2006a). A computa-

tional procedure for optimal engineering interventions using kinetic models of metabolism.

Biotechnol. Prog., 22:1507–1517.

Vital-Lopez, F. G., Maranas, C. D., and Armaou, A. (2006b). Bifurcation analysis of the

metabolism of E. coli at optimal enzyme levels. In Proceedings of the 2006 American Control

Conference, pages 3439–3444.

Vlad, M. O., Corlan, A. D., Popa, V. T., and Ross, J. (2007). On anti-portfolio effects in sci-

ence and technology with application to reaction kinetics, chemical synthesis, and molecular

biology. Proc Natl Acad Sci U.S.A, 104:18398–18403.

BIBLIOGRAPHY 150

Wang, J., Zhu, J., Bennett, G. N., and San, K.-Y. (2011). Succinate production from different

carbon sources under anaerobic conditions by metabolic engineered Escherichia coli strains.

Metab. Eng., 13:328–335.

Wang, Q., Ou, M. S., Kim, Y., Ingram, L. O., and Shanmugam, K. T. (2010). Metabolic

Flux Control at the Pyruvate Node in an Anaerobic Escherichia coli Strain with an Active

Pyruvate Dehydrogenase. Appl. Environ. Microbiol., 76:2107–2114.

Wang, Z. and Zhang, J. (2011). Impact of gene expression noise on organismal fitness and the

efficacy of natural selection. Proc. Natl. Acad. Sci. USA, 108:E67–E76.

Yang, L., Cluett, W. R., and Mahadevan, R. (2010a). Rapid design of system-wide metabolic

network modifications using iterative linear programming. In Proceedings of the 9th Interna-

tional Symposium on Dynamics and Control of Process Systems, pages 377–382.

Yang, L., Cluett, W. R., and Mahadevan, R. (2011). EMILiO: A fast algorithm for genome-scale

strain design. Metab. Eng., 13:272–281.

Yang, L., Mahadevan, R., and Cluett, W. R. (2008). A bilevel optimization algorithm to identify

enzymatic capacity constraints in metabolic networks. Computers and Chemical Engineering,

32:2072–2085.

Yang, L., Mahadevan, R., and Cluett, W. R. (2010b). Designing experiments from noisy

metabolomics data to refine constraint-based models. In Proceedings of the 2010 American

Control Conference, pages 5143–5148.

Yi, T. M., Huang, Y., Simon, M. I., and Doyle, J. (2000). Robust perfect adaptation in bacterial

chemotaxis through integral feedback control. Proc. Natl. Acad. Sci. USA, 97:4649–4653.

Yim, H., Haselbeck, R., Niu, W., et al. (2011). Metabolic engineering of Escherichia coli for

direct production of 1,4-butanediol. Nature Chemical Biology, 7:445–452.

Yu, C., Cao, Y., Zou, H., and Xian, M. (2011). Metabolic engineering of Escherichia coli for

biotechnological production of high-value organic acids and alcohols. Applied Microbiology

and Biotechnology, 89:573–583.

BIBLIOGRAPHY 151

Yun, N. R., San, K. Y., and Bennett, G. N. (2005). Enhancement of lactate and succinate

formation in adhe or pta-acka mutants of nadh dehydrogenase-deficient Escherichia coli.

Journal of Applied Microbiology, 99:1404–1412.

Zamboni, N., Kuemmel, A., and Heinemann, M. (2008). anNET: a tool for network-embedded

thermodynamic analysis of quantitative metabolome data. BMC Bioinformatics, 9.

Zeikus, J. G., Jain, M. K., and Elankovan, P. (1999). Biotechnology of succinic acid production

and markets for derived industrial products. Appl. Microbiol. Biotechnol., 51:545–552.

Zhuang, K., Goutham, V. N., and Mahadevan, R. (2011). Economics of membrane occupancy

and respiro-fermentation. Mol Syst Biol, 7:500.

Appendix A

The Robust Strain Design

Algorithm

A.1 Succinate overproduction strains

The 98 succinate overproduction strains are defined below.

152

Appendix A. The Robust Strain Design Algorithm 153

Strain 1:

Knockout:

SUCDi

Modified lower bound Lower bound value

ACONTa 6.35178

FRD2 26.0044

Strain 2:

Knockout:

PPCSCT

SUCDi

XAND

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 3:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 31.2654

ICL 0.0626559

Strain 4:

Knockout:

PPCSCT

SUCDi

UGLYCH

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 5:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MDH -21.1158

Modified upper bound Upper bound value

ACCOAL 98.593

FUM -25.9006

SUCOAS -100

Strain 6:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 31.2654

ICDHyr 0.16873

Strain 7:

Knockout:

SUCDi

TRDR

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MALS 4.78481

Strain 8:

Knockout:

ALLTAMH

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 9:

Knockout:

SUCDi

Appendix A. The Robust Strain Design Algorithm 154

Modified lower bound Lower bound value

AKGDH 1.45951

FRD2 26.0044

MDH -21.1158

Modified upper bound Upper bound value

FUM -25.9006

Strain 10:

Knockout:

SUCDi

XAND

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MALS 4.78481

Strain 11:

Knockout:

SUCDi

PPCSCT

UGLYCH

Modified lower bound Lower bound value

FRD3 25.7367

MALS 0.0985183

FRD2 0.0439508

Modified upper bound Upper bound value

SUCOAS -5.9933

Strain 12:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 31.3763

Strain 13:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MDH -21.1158

Modified upper bound Upper bound value

FUM -25.9006

Strain 14:

Knockout:

METOX1s

METSOXR2

SUCDi

THIORDXi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 15:

Knockout:

PPCSCT

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MDH -21.1158

Modified upper bound Upper bound value

FUM -25.9006

SUCOAS -1.40702

Strain 16:

Knockout:

SUCDi

XAND

Appendix A. The Robust Strain Design Algorithm 155

Modified lower bound Lower bound value

FRD3 25.9197

MALS 6.27375

Strain 17:

Knockout:

SUCDi

XAND

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 18:

Knockout:

PFL

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

PDH 11.4641

PSERT 0.171463

TKT1B -100

Strain 19:

Knockout:

SUCDi

UGLYCH

Modified lower bound Lower bound value

FRD2 31.2654

MALS 0.0626855

Strain 20:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

GLYCL 0.00539765

PPCSCT 98.593

SUCOAS -100

Strain 21:

Knockout:

SUCDi

UGLYCH

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

ACCOAL 98.593

SUCOAS -100

Strain 22:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICL 4.78474

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 23:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICL 4.78474

PPAKr 100

Appendix A. The Robust Strain Design Algorithm 156

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 24:

Knockout:

PFL

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

GLYCL 0.00539765

PDH 11.4641

TKT1 99.98

TKT1B -100

Strain 25:

Knockout:

SUCDi

THIORDXi

METSOXR1

METSOXR2

Modified lower bound Lower bound value

FRD2 31.2654

MALS 0.0626855

Strain 26:

Knockout:

SUCDi

XAND

Modified lower bound Lower bound value

AKGDH 1.45951

FRD2 26.0044

MALS 4.78481

Strain 27:

Knockout:

SUCDi

PPCSCT

UGLYCH

Modified lower bound Lower bound value

FRD3 25.0746

MALS 2.74481

Modified upper bound Upper bound value

SUCOAS -3.92675

Strain 28:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MDH -21.1158

PTA2 100

Modified upper bound Upper bound value

FUM -25.9006

SUCOAS -1.40702

Strain 29:

Knockout:

PPCSCT

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICL 4.78474

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 30:

Knockout:

SUCDi

Appendix A. The Robust Strain Design Algorithm 157

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MALS 4.78481

Modified upper bound Upper bound value

MTHFC 0

Strain 31:

Knockout:

ACCOAL

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICL 4.78474

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 32:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD3 25.9197

ICDHyr 6.31332

Strain 33:

Knockout:

METOX1s

METSOXR2

PPCSCT

SUCDi

THIORDXi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 34:

Knockout:

SUCDi

Modified lower bound Lower bound value

AKGDH 1.45951

FRD2 26.0044

ICL 4.78474

Strain 35:

Knockout:

SUCDi

Modified lower bound Lower bound value

ACONTb 6.35178

FRD2 26.0044

Strain 36:

Knockout:

SUCDi

PPCSCT

Modified lower bound Lower bound value

FRD3 25.7367

ICL 0.098784

FRD2 0.0436759

Modified upper bound Upper bound value

SUCOAS -5.99483

Strain 37:

Knockout:

ASPT

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MDH -21.1158

PPC 27.7601

Modified upper bound Upper bound value

MTHFC 0.0986143

TKT1 99.98

TKT1B -100

Appendix A. The Robust Strain Design Algorithm 158

Strain 38:

Knockout:

SUCDi

XAND

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

PTA2 100

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 39:

Knockout:

CITL

SUCDi

Modified lower bound Lower bound value

CS 6.35178

FRD2 26.0044

Strain 40:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MDH -21.1158

Modified upper bound Upper bound value

FUM -25.9006

PPCSCT 98.593

SUCOAS -100

Strain 41:

Knockout:

ACCOAL

SUCDi

Modified lower bound Lower bound value

FRD2 25.5693

Modified upper bound Upper bound value

SUCOAS -6.40932

Strain 42:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

ICL 4.78474

Strain 43:

Knockout:

GLYCL

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MALS 4.78481

Strain 44:

Knockout:

PPCSCT

SUCDi

TRDR

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 45:

Knockout:

ASPT

SUCDi

UGLYCH

Appendix A. The Robust Strain Design Algorithm 159

Modified lower bound Lower bound value

FRD2 26.0046

MDH -21.1158

Modified upper bound Upper bound value

ACCOAL 98.593

ADSS 0.029821

SUCOAS -100

Strain 46:

Knockout:

METOX1s

METOX2s

SUCDi

THIORDXi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 47:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

FUM -25.9006

PPCSCT 98.593

SUCOAS -100

Strain 48:

Knockout:

ALLTAMH

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

PPAKr 100

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 49:

Knockout:

SUCDi

ACCOAL

UGLYCH

Modified lower bound Lower bound value

FRD2 25.0746

MALS 2.74481

Modified upper bound Upper bound value

SUCOAS -3.92675

Strain 50:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD3 14.096

FRD2 12.6775

PPAKr 24.1691

Modified upper bound Upper bound value

SUCOAS -80.2984

Strain 51:

Knockout:

G6PDH2r

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

ICDHyr 1.56704

MALS 4.78481

Strain 52:

Appendix A. The Robust Strain Design Algorithm 160

Knockout:

SUCDi

UGLYCH

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 53:

Knockout:

ALLTAMH

SUCDi

Modified lower bound Lower bound value

FRD2 26.0044

MALS 4.78481

Modified upper bound Upper bound value

ACCOAL 98.593

SUCOAS -100

Strain 54:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD3 31.2654

ICL 0.0626559

Strain 55:

Knockout:

METOX2s

METSOXR1

SUCDi

THIORDXi

Modified lower bound Lower bound value

FRD2 26.0044

ICDHyr 1.56704

MALS 4.78481

Strain 56:

Knockout:

SUCDi

ALLTAMH

Modified lower bound Lower bound value

FRD2 31.2654

MALS 0.0626855

Strain 57:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 17.8535

Modified upper bound Upper bound value

SUCOAS -9.15938

PPCSCT 1.19021

Strain 58:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 28.7877

ICDHyr 0.649545

MALS 2.35202

Strain 59:

Knockout:

ALLTN

PPCSCT

SUCDi

Modified lower bound Lower bound value

FRD2 26.0043

MALS 4.78481

Appendix A. The Robust Strain Design Algorithm 161

Modified upper bound Upper bound value

SUCOAS -1.40702

Strain 60:

Knockout:

SUCDi

TRDR

Modified lower bound Lower bound value

FRD2 31.2654

MALS 0.0626855

Strain 61:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

ICL 4.78474

Modified upper bound Upper bound value

ACCOAL 98.593

SUCOAS -100

Strain 62:

Knockout:

PFL

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

ICL 4.78474

PGI 20

XYLI2 0

Modified upper bound Upper bound value

MTHFD 0

PDH 11.4641

Strain 63:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD3 31.2654

Strain 64:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

PGI 20

PPC 27.7601

XYLI2 0

Modified upper bound Upper bound value

GLYAT 0

GLYCL 0.00539765

PDH 11.4641

Strain 65:

Knockout:

FTHFD

SUCDi

Modified lower bound Lower bound value

FRD2 29.7161

PGI 3.34713

XYLI2 3.33587

Strain 66:

Knockout:

CITL

Modified lower bound Lower bound value

CS 11.26

DHAPT 2.97171

Modified upper bound Upper bound value

FUM -16.0841

PUNP1 -67.6209

Strain 67:

Appendix A. The Robust Strain Design Algorithm 162

Modified upper bound Upper bound value

MDH -66.212

PPC 27.6366

PPCSCT 85.8866

SUCOAS -100

Strain 68:

Modified lower bound Lower bound value

FRD2 56.188

Modified upper bound Upper bound value

PPCSCT 88.9

SUCOAS -100

Strain 69:

Knockout:

PAPSR

PFL

Modified lower bound Lower bound value

ENO 28.4368

GND 33.4575

HSDy -0.0698123

Modified upper bound Upper bound value

DHDPS 0.037098

GRXR 0.0246302

Strain 70:

Knockout:

GRXR

PFL

Modified lower bound Lower bound value

MALS 11.1001

PYAM5PO 14.9525

TPI 19.8992

Modified upper bound Upper bound value

ASPK 0.10691

SUCOAS -100

TRDR 14.9772

Strain 71:

Knockout:

ME1

ME2

OAADC

PPA

PPCK

PPCSCT

Modified lower bound Lower bound value

HSDy -0.0698123

PDH 11.5876

PPC 27.6367

PTAr 99.626

Modified upper bound Upper bound value

SUCOAS -11.1

Strain 72:

Knockout:

ALR2

LALDO2x

PFL

Modified lower bound Lower bound value

AKGDH 11.324

ENO 36.7979

HSDy -0.412738

MGSA 2.96296

PPS 11.2628

Modified upper bound Upper bound value

ADK1 11.4935

ADK3 -100

DHDPRy 0.037098

Strain 73:

Knockout:

CITL

Appendix A. The Robust Strain Design Algorithm 163

Modified lower bound Lower bound value

CS 11.26

PDH 22.6876

Modified upper bound Upper bound value

FUM -16.0841

PUNP1 -67.6209

Strain 74:

Knockout:

CITL

Modified lower bound Lower bound value

CS 11.26

Modified upper bound Upper bound value

FUM -16.0841

ICDHyr 0.160022

PUNP1 -67.6209

Strain 75:

Knockout:

CITL

Modified lower bound Lower bound value

CS 11.26

Modified upper bound Upper bound value

FUM -16.0841

PPC 16.5366

PUNP1 -67.6209

Strain 76:

Knockout:

PAPSR

PFL

Modified lower bound Lower bound value

ENO 28.4368

GND 33.4575

HSDy -0.0698123

Modified upper bound Upper bound value

DHDPS 0.037098

PAPSR2 0.0246302

Strain 77:

Knockout:

PFL

TRDR

Modified lower bound Lower bound value

ENO 28.4368

GND 33.4575

HSDy -0.0698123

Modified upper bound Upper bound value

DHDPS 0.037098

PAPSR2 0.0246302

Strain 78:

Knockout:

GLDBRAN2

PFL

Modified lower bound Lower bound value

GLBRAN2 100

MALS 11.1001

PYAM5PO 14.9525

TPI 19.8992

Modified upper bound Upper bound value

ASPK 0.10691

SUCOAS -100

Strain 79:

Knockout:

GLDBRAN2

PFL

Appendix A. The Robust Strain Design Algorithm 164

Modified lower bound Lower bound value

GLBRAN2 100

ICDHyr 0.160022

MALS 11.1001

TPI 19.8992

Modified upper bound Upper bound value

ASPK 0.10691

PPA 0.315937

Strain 80:

Knockout:

GLDBRAN2

PFL

Modified lower bound Lower bound value

GLBRAN2 100

MALS 11.1001

TPI 19.8992

Modified upper bound Upper bound value

ASPK 0.10691

PPA 0.315937

SUCOAS -100

Strain 81:

Knockout:

GTHOr

ME1

ME2

OAADC

PPCK

PPCSCT

Modified lower bound Lower bound value

HSDy -0.0698123

PDH 11.5876

PPC 27.6367

Modified upper bound Upper bound value

SUCOAS -11.1

TRDR 0.0246302

Strain 82:

Knockout:

ALR2

GTHOr

LALDO2x

PFL

Modified lower bound Lower bound value

AKGDH 11.324

ENO 36.7979

HSDy -0.412738

MGSA 2.96296

Modified upper bound Upper bound value

DHDPRy 0.037098

TRDR 0.0299492

Strain 83:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 31.3763

MALS 0.00365797

Strain 84:

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

FRD2 26.5417

ICL 0.754562

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Strain 85:

Appendix A. The Robust Strain Design Algorithm 165

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

MALS 0.754629

NDPK1 92.8447

PPC 28.2974

XYLI2 8.78917

Modified upper bound Upper bound value

FUM -26.4379

PRPPS -99.9068

Strain 86:

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

MALS 0.754629

NDPK1 92.8447

PPC 28.2974

TPI 19.8992

Modified upper bound Upper bound value

FUM -26.4379

PRPPS -99.9068

Strain 87:

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

MALS 0.754629

NDPK1 92.8447

PPC 28.2974

Modified upper bound Upper bound value

FUM -26.4379

PDH 10.9267

PRPPS -99.9068

Strain 88:

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

MALS 0.754629

NDPK1 92.8447

TPI 19.8992

XYLI2 8.78917

Modified upper bound Upper bound value

FUM -26.4379

PDH 10.9267

PRPPS -99.9068

Strain 89:

Knockout:

EX-o2-e-

Modified lower bound Lower bound value

MALS 0.754629

NDPK1 92.8447

PGI 11.2108

XYLI2 8.78917

Modified upper bound Upper bound value

FUM -26.4379

PDH 10.9267

PRPPS -99.9068

Strain 90:

Knockout:

EX-o2-e-

PFL

Modified lower bound Lower bound value

ICDHyr 1.56704

LDH-D 0

MALS 0.754629

Appendix A. The Robust Strain Design Algorithm 166

Modified upper bound Upper bound value

MDH -25.6833

PDH 10.9267

Strain 91:

Knockout:

EX-o2-e-

PFL

PPCSCT

Modified lower bound Lower bound value

LDH-D 0

MALS 0.754629

Modified upper bound Upper bound value

MDH -25.6833

PDH 10.9267

SUCOAS -1.40702

Strain 92:

Knockout:

EX-o2-e-

PFL

PPCSCT

Modified lower bound Lower bound value

GLGC 99.684

MALS 0.754629

Modified upper bound Upper bound value

ACKr -7.4649

MDH -25.6833

SUCOAS -1.40702

Strain 93:

Knockout:

DRPA

EX-o2-e-

MGSA

PFL

Modified lower bound Lower bound value

GHMT2r 0.105953

GLGC 99.684

MALS 4.00779

Strain 94:

Knockout:

DRPA

EX-o2-e-

GRXR

MGSA

PFL

Modified lower bound Lower bound value

GHMT2r 0.105953

MALS 4.00779

Modified upper bound Upper bound value

TRDR 0.0272472

Strain 95:

Knockout:

SUCDi

Modified lower bound Lower bound value

AKGDH 5.89437

FRD2 26.0795

MALS 0.0440389

Strain 96:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

FUM -25.9006

SUCOAS -100

Strain 97:

Appendix A. The Robust Strain Design Algorithm 167

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

SUCOAS -100

Strain 98:

Knockout:

SUCDi

Modified lower bound Lower bound value

FRD2 26.0046

MALS 4.78481

Modified upper bound Upper bound value

PPCSCT 98.593

SUCOAS -100

Appendix A. The Robust Strain Design Algorithm 168

A.2 Simple example of the portfolio effect

We demonstrate the portfolio effect in metabolic networks using a simple example. In Fig. A.1,

three networks are shown, each having one, two, and three pathways. The total flux through all

pathways is X. Now, assume that the flux X, is a random variable with mean µ and standard

deviation s. In the case of one pathway, (Fig. A.1A) the standard deviation is assumed to have

a value, σ. In the case of two pathways (Fig. A.1B), the mean value of the total flux remains

the same. However, the standard deviation is now σ/√

2. For three pathways (Fig. A.1C), the

mean of X is again constant; however, the standard deviation is σ/√

3. Thus, in this simple

demonstration of the portfolio effect (Tilman et al., 2006; Vlad et al., 2007), the presence of a

larger number of alternative pathways reduces the variability of total flux.

a b c

a c

d

e

X

Mean

E[X] = μ

Standard deviation

s = σ

Robustness

r = μ/σ

Mean

E[X] = μ

Standard deviation

s = σ/2

Robustness

r = 2 μ/σ

1/2

X

X

X

X/2

X/2

X/3

C

A

B

a b c

d

e

X/3

X/3

X

X

Fre

qu

en

cy

Fre

qu

en

cy

Fre

qu

en

cy

1/2

Mean

E[X] = μ

Standard deviation

s = σ/3

Robustness

r = 3 μ/σ

1/2

1/2

# of pathways, m=1

m=2

m=3

Figure A.1: Simple demonstration of the portfolio effect.

Appendix B

Simulation and Design using Kinetic

Models of Metabolism

B.1 Reference state and elasticity matrix

In Chapter 6 we tested our design algorithm using the kinetic model of E. coli central metabolism

of (Chassagnole et al., 2002). For a given reference state, consisting of n fluxes and m metabolite

concentrations, the elasticity matrix, E ∈ Rn×m is defined as follows:

E(i, j) =∂vi∂xj· xjvi, i = 1 . . . n, j = 1 . . .m, (B.1)

where vi are fluxes and xj are concentrations.

We calculated E using automatic differentiation in MATLAB. The values of E are listed below

in Table B.1. The reference fluxes and concentrations used to generate the results in Chapter

6, as well as to calculate E are listed in Tables B.2 and B.3.

169

Appendix B. Simulation and Design using Kinetic Models of Metabolism 170

Table B.1: Elasticity matrix at the reference state in sparse format.

The full elasticity matrix can be constructed by creating an n×m

matrix (n = number of fluxes and m = number of metabolites) of

zeros and filling in the non-zero entries at the row (reaction) and

column (metabolite) indices specified in the table below.

Reaction Metabolite Elasticity

PTS cpep 9.999815e-01

PTS cglcex 9.964954e-01

PTS cg6p -3.564447e+00

PTS cpyr -9.999815e-01

PGI cg6p 1.481332e+03

PGI cf6p -1.480573e+03

PGI cpg -5.539836e-01

PGM cg6p 2.008147e+01

PGM cg1p -2.006130e+01

G6PDH cg6p 8.129455e-01

PFK cpep -2.057819e+00

PFK cf6p 5.488340e+00

TA cf6p -1.551946e+01

TA csed7p 1.651946e+01

TA cgap 1.651946e+01

TA ce4p -1.551946e+01

TKA csed7p -1.199361e+01

TKA cgap -1.199361e+01

TKA cxyl5p 1.299361e+01

TKA crib5p 1.299361e+01

TKB cf6p -3.802860e+01

TKB cgap -3.802860e+01

TKB ce4p 3.902860e+01

TKB cxyl5p 3.902860e+01

Continued on next page

Appendix B. Simulation and Design using Kinetic Models of Metabolism 171

Table B.1 – continued from previous page

Reaction Metabolite Elasticity

ALDO cfdp 1.565349e+01

ALDO cgap -1.498767e+01

ALDO cdhap -1.492655e+01

GAPDH cgap 1.003187e+00

GAPDH cpgp -1.001987e+00

TIS cgap -1.618848e+01

TIS cdhap 1.672123e+01

G3PDH cdhap 8.431321e-01

PGK cpgp 1.097577e+02

PGK cpg3 -1.095900e+02

sersynth cpg3 3.067314e-01

rpGluMu cpg3 2.431108e+02

rpGluMu cpg2 -2.430364e+02

ENO cpep -1.737547e+02

ENO cpg2 1.737929e+02

PK cpep 9.888856e-02

PK cfdp 5.507438e-05

pepCxylase cpep 5.897203e-01

pepCxylase cfdp 1.818458e-01

Synth1 cpep 2.609891e-01

Synth2 cpyr 2.724525e-01

DAHPS cpep 1.180329e-03

DAHPS ce4p 2.410126e+00

PDH cpyr 3.565752e+00

PGDH cpg 9.792591e-01

R5PI crib5p -9.568694e+00

Ru5P cxyl5p -9.307892e+00

PPK crib5p 2.024370e-01

G1PAT cg1p 8.383144e-01

Continued on next page

Appendix B. Simulation and Design using Kinetic Models of Metabolism 172

Table B.1 – continued from previous page

Reaction Metabolite Elasticity

G1PAT cfdp 9.313728e-01

G6P cg6p 1

f6P cf6p 1

fdP cfdp 1.000000e+00

GAP cgap 1.000000e+00

DHAP cdhap 1

PGP cpgp 1

PG3 cpg3 1

pg2 cpg2 1.000000e+00

PEP cpep 1.000000e+00

RIB5P crib5p 1

XYL5P cxyl5p 1.000000e+00

SED7P csed7p 1.000000e+00

pyr cpyr 1

PG cpg 1.000000e+00

E4P ce4p 1

GLP cg1p 1

EXTER cglcex -3.967127e-04

Table B.2: Reference flux for the model of E. coli central metabolism (Chassagnole et al., 2002)

vref =[ 3.083759e-03, 7.638286e-02, 2.648023e-03, 1.350352e-01, 9.706682e-02, 3.960809e-02, 3.961367e-02,

3.177509e-02, 4.371100e-04, 1.468660e-01, 3.307102e-01, 1.450466e-01, 1.037000e-03, 1.813584e-03, -7.664090e-02,

1.785526e-02, 3.068128e-01, 3.068026e-01, 3.810352e-02, 4.594254e-02, 1.446010e-02, 5.356195e-02, 7.829968e-03,

1.881657e-01, 2.262700e-03, 1.383618e-01, 4.991527e-02, 7.139249e-02, 1.029097e-02, 2.652691e-03, 9.250426e-05,

1.594627e-05, 9.214241e-06, 6.718378e-06, 5.141066e-06, 2.389480e-07, 6.317756e-05, 1.182871e-05, 7.914956e-05,

3.027655e-06, 1.096343e-05, 3.826403e-06, 6.915745e-06, 7.424102e-05, 2.215697e-05, 2.873981e-06, 1.724254e-05,

3.083453e-03]

Appendix B. Simulation and Design using Kinetic Models of Metabolism 173

Table B.3: Reference concentrations for the model of E. coli central metabolism (Chassagnole

et al., 2002)

xref =[ 2.847107e+00, 4.443915e-02, 3.327491e+00, 2.670540e+00, 5.736069e-01, 6.202351e-01, 7.970132e-01,

3.314475e-01, 2.487678e-01, 2.416683e-01, 1.033806e-01, 1.376404e-01, 3.943680e-01, 1.849304e-01, 8.595251e-03,

2.272574e+00, 4.254930e-01, 1.089085e-01]

Appendix C

Strain design for balanced yield,

titer, and productivity

The strain design algorithms developed in this thesis (Chapters 3 and 4) have used product

yield as the engineering target. However, industrial bioprocesses typically consider additional

objectives; namely, titer and (volumetric) productivity. Therefore, a practical consideration for

in silico strain design is to develop efficient methods that can quantify the tradeoff between the

three objectives (yield, titer, and productivity).

To address the need to design strains that balance yield, titer, and productivity, a novel com-

putational method was developed, called Dynamic Strain Scanning Optimization (DySScO).

Briefly, DySScO involves sampling the production envelope, assuming maximum product flux,

in order to identify the growth rates for which growth-coupled production would best balance

titer, yield, and productivity. These objectives are estimated using dynamic simulations of

a bioreactor that is coupled to flux balance analysis simulations (i.e., the dFBA framework

(Mahadevan et al., 2002)). Then, a strain design algorithm is used to design multiple strains

having growth rates within the range identified in the previous step. If the product yield is not

at the theoretical maximum for the defined growth rate, as is typically the case with knockout

mutants, then the yield, titer, and productivity of these strains are re-assessed. Finally, the

best strains are selected. The DySScO method is compatible with any method of dynamic

simulation and strain design. To maximize the efficiency of DySScO, an efficient strain design

174

Appendix C. Strain design for balanced yield, titer, and productivity 175

algorithm is desired, in order to rapidly generate a large set of strains for screening. Accord-

ingly, we used the GDLS (Lun et al., 2009) algorithm to efficiently identify knockout strains.

This chapter includes the author’s contributions to the development and testing of DySScO,

which was carried out in collaboration with another Doctoral candidate, Kai Zhuang, in the

Department of Chemical Engineering at the University of Toronto.

C.1 Introduction

A large number of computational strain design algorithms have been developed for identifying

optimal metabolic network manipulation strategies constraint-based models of metabolism. Op-

tKnock (Burgard et al., 2003) was the first computational algorithm for systematically designing

knockout strains for growth-coupled production of a biochemical. Growth-coupled production

has been shown to be effective in certain conditions, such in strains that are adaptively evolved

for maximum growth yield (Fong et al., 2005; Hua et al., 2006). In addition to gene knockouts,

the activation (Jin and Stephanopoulos, 2007) and inhibition (Nakamura and Whited, 2003) of

reactions have been shown to enhance biochemical production. OptReg (Pharkya and Maranas,

2006) is a Mixed Integer Linear Program (MILP)-based algorithm that identifies activation and

inhibition targets. Limitations include the need to define activation and inhibition levels for all

reactions prior to the identification of the optimal set of manipulated reactions, and a compu-

tational burden that typically exceeds that of OptKnock.

The computational difficulty of identifying globally optimal solutions to OptKnock has moti-

vated the development of more efficient algorithms. Recently, Lun et al. (Lun et al., 2009)

developed Genetic Design through Local Search (GDLS) to efficiently obtain locally optimal

solutions to OptKnock. The local search constraint is generally applicable to any MILP. Thus,

Yang et al. (2011) developed OptReg’LS, a local search implementation of OptReg,and demon-

strated that the locally optimal strains performed similarly to those identified by the global

OptReg problem, but in only a fraction of the time (Yang et al., 2011). Nonetheless, GDLS

and OptReg’LS still suffer from an exponential increase in complexity with increasing scope

Appendix C. Strain design for balanced yield, titer, and productivity 176

of each local search. This is especially problematic if the product yield cannot be improved

through sequential changes in a small number of reactions.

More recent advances include OptForce, which maximizes product yield by identifying knock-

out, inhibition, and activation targets, relative to a wild-type flux distribution (Ranganathan

et al., 2010); and EMILiO, which rapidly identifies the optimal set of modified reactions and

their optimal fluxes using a successive linear programming procedure (Yang et al., 2011). Alter-

natives to the bilevel optimization-based strain design have also been developed. For example,

evolutionary programming enables the optimization of nonlinear objectives, and, in some cases,

it has been shown to be more efficient for identifying higher-order knockout strategies than

MILP-based formulations (Patil et al., 2005). Other studies used the enumeration of elemen-

tary modes to identify flux modification targets based on their correlation with the product

flux (Melzer et al., 2009). However, this method was applied to condensed versions of the orig-

inal genome-scale models, since the enumeration of elementary modes is still computationally

expensive. Thus, the field of computational strain design is clearly active, and more efficient

algorithms to maximize the yield of overproducing strains is in continued development.

C.2 Results

We tested the capabilities of DySScO using two case studies: the design of succinate and 1, 4-

butanediol (BDO) overproduction strains. We used the iAF1260 genome-scale model of E. coli

metabolism for both case studies. To design BDO overproduction strains, the BDO biosynthesis

pathways described in (Yim et al., 2011) were added to the iAF1260 model.

C.2.1 Succinate strains using GDLS

Many of the strategies for succinate overproduction identified in this work (Table C.1) over-

lapped with those found in the literature. For example, the knockout of competing fermenta-

tion products like formate, ethanol, and lactate is a common experimental strategy (Yu et al.,

2011) and is consistent with the in silico knockout of PFL, ALCD2x, and LDH D. In addition,

knockout of the NADP-dependent malic enzyme (ME2) or glucose-6-phosphate dehydrogenase

Appendix C. Strain design for balanced yield, titer, and productivity 177

(G6PDH2r) is consistent with previously identified in silico strategies (Feist et al., 2010). Es-

sentially, the individual knockout strategies identified in this work have also been identified in

previous computational studies, or have previously been experimentally implemented. The ma-

jor contribution of this work has been to add new value to these well-known knockout strategies

for the model-based improvement of yield, titer and productivity.

Table C.1: Knockout strategies for succinate overproduction identified using GDLS

Succinate strains YZ1 YZ2 YZ3

Growth rate (hr−1) 0.16 0.24 0.21

Product yield (mol/mol glc) 1.27 0.89 0.92

Knockouts ALCD2x F6PA ACALD

GLUDy G6PDH2r F6PA

LDH D ME2 G6PDH2r

PFL MTHFD GLUDy

PPKr PFL ME2

TKT2 PYK PFL

PYK

C.2.2 Butanediol strains using GDLS

The BDO strains identified in this work are listed in Table C.2. The two BDO strains YZ4 and

YZ5 showed similarities in the predicted flux distributions as YIM1260. Namely, all three strains

used pyruvate dehydrogenase and the oxidative TCA cycle, and secreted acetate, all at similar

levels. However, unlike YIM1260, in which malate dehydrogenase (MDH) is deleted, YZ4 and

YZ5 utilize MDH in the reverse direction. Knockout of MDH in YZ5 reduces BDO yield by 97%

while increasing reverse lactate dehydrogenase (LDH) activity. The additional knockout of LDH

leads to YIM1260. Thus, deletion of MDH and LDH eliminate NADH-consuming reactions, such

that excess NADH is channeled to the NADH-consuming BDO synthesis reactions, SSALcoax,

4HBDH, 4HBTALDDH, and BTDP2. While maximizing the channeling of NADH to BDO

synthesis improves BDO yield, this is achieved at the cost of lowered growth rate. Strain YZ5,

Appendix C. Strain design for balanced yield, titer, and productivity 178

through reverse MDH activity, increases growth rate at the cost of lowered product yield. The

DySScO strategy identified YZ5 as the strain that achieves the best tradeoff between yield and

volumetric productivity, whereas previous strain design methods may have discarded YZ5 due

to its lower yield.

Table C.2: Knockout strategies for BDO overproduction identified using GDLS

BDO strains YZ4 YZ5 YIM1260

Growth rate (hr−1) 0.30 0.35 0.30

Product yield (mol/mol glc) 0.52 0.51 0.52

Knockouts ALCD2x ALCD2x ALCD2x

PFL PFL PFL

PGI MDH

TKT2 LDH D

C.3 Methods

The GDLS (Genetic Design through Local Search) algorithm (Lun et al., 2009) was used to

identify knockout strategies for succinate and BDO overproduction in E. coli, using the iAF1260

genome-scale model. For each iteration of GDLS, we used a neighborhood size of 2, and a single

search path. In addition, we implemented constraints to prevent the local search from cycling

back to the previous solution. Each MILP problem (i.e., local search iteration) was given a

timeout threshold of 1,800 seconds. If the MILP problem did reach the timeout threshold, then

GDLS was continued only if a feasible, but not necessarily optimal, solution was identified. In

this work, every local search MILP indeed found a feasible solution even if the timeout threshold

was met.

The MILPs were solved using CPLEX 12.1 using the CPLEXINT interface, with up to 8 parallel

threads using 2.4 GHz AMD Opteron processors.