representing, analyzing, and synthesizing biochemical pathways

27

Upload: lediep

Post on 20-Jan-2017

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representing, Analyzing, and Synthesizing Biochemical Pathways

Representing, Analyzing, and SynthesizingBiochemical PathwaysPeter D. KarpArti�cial Intelligence CenterSRI International333 Ravenswood Ave.Menlo Park, CA 94025voice: 415-859-6375fax: [email protected] L. MavrovouniotisChemical Engineering Dept.Northwestern University2145 Sheridan RoadEvanston, IL 60208-3120voice: 708-491-7043fax: [email protected] 4, 1994Keywords: biotechnology computing, bioinformatics, biochemistry, computational biology,metabolism 1

Page 2: Representing, Analyzing, and Synthesizing Biochemical Pathways

1 AbstractLiving cells are complex systems whose growth and existence depends on thousands of bio-chemical reactions. A subset of these reactions | the metabolism | interconverts smallmolecules. A variety of computational problems arise in representing knowledge of themetabolism in electronic form, in analyzing that knowledge to gain deeper insights intocomplexities of the metabolism, and in using such knowledge in biology, biotechnology andhealth applications. These problems provide a rich set of opportunities for exploiting ex-isting AI techniques, and challenges for developing new and improved techniques. Thisarticle describes challenges and opportunities for addressing computational problems in themetabolism with techniques from knowledge representation, planning, integration of hetero-geneous databases, qualitative reasoning, knowledge acquisition, and machine learning. Thecomputational problems include construction of large shared knowledge bases of biochemi-cal pathways, knowledge acquisition from the biochemical literature, qualitative simulationof metabolic pathways, thermodynamic estimation, synthesis of metabolic pathways, andscienti�c hypothesis formation.

2

Page 3: Representing, Analyzing, and Synthesizing Biochemical Pathways

2 The Domain of Biochemical SystemsLiving cells are complex systems whose growth and existence depends on thousands ofbiochemical reactions. A subset of these reactions | the metabolism | interconvertshundreds of small molecules. Some molecules are nutrients from which energy is extracted;other molecules are building blocks for cellular structures; still others are wastes that mustbe excreted. The reactions are orchestrated by enzymes, which are proteins that allow thecell to accelerate and modulate reaction rates. The metabolism is extremely complex, andbiochemists have amassed a vast amount of information about the metabolism of di�erentorganisms.This article describes computational problems that arise in representing knowledge of themetabolism in electronic form, in analyzing that knowledge to gain deeper insights intocomplexities of the metabolism, and in using such knowledge in biology, biotechnology andhealth applications. These problems provide a rich set of opportunities for exploiting exist-ing AI techniques, and challenges for developing new and improved techniques. This articledescribes challenges in knowledge representation, planning, integration of heterogeneousdatabases, qualitative reasoning, knowledge acquisition, and machine learning. Many ofthese problems are analogous to computer science research in computer-aided design andmanufacturing, where the challenges include building databases and knowledge bases thatdescribe the structures and functions of engineered devices, simulating and analyzing thebehaviors of those devices, and designing or redesigning new devices with desired behav-iors. In the metabolic domain, the parts are biochemical compounds, enzymes, genes, andreactions; the devices are entire cells.We begin with a brief introduction to biochemistry and the metabolism [1, 2]. Themetabolism of a cell has many conceptual similarities to a circuit of electrical components, aplumbing system, or a transportation network. It is even more similar to a complex chemicalplant, such as a petrochemical re�nery, complete with measurements, control signals and al-gorithms, and control valves. These systems all consist of a number of distinct componentscon�gured in a network and interacting through speci�c local processes. Some componentsconvey material (or electrical charge), others act as holding tanks or storage elements, whileothers yet exert a selective controlling action based on the measurements and status of aparticular subsystem. In the metabolism, the components are macromolecules and smallermolecules, and the modes of interaction include bioreactions, formation of intermolecularcomplexes, and regulatory e�ects of one molecule on the activity of another. These systemsare all built from a large number of simple components (from just a few component classes)that, once connected in a network, give rise to very complex behaviors. We urge the readerto think about the analogies between biochemical systems and the other domains, but forthe sake of brevity we will not invoke these analogies again.Every biological organism (except viruses) consists of cells. Simple organisms, such asbacteria or yeast, consist of only one cell. A bacterial cell has a multilayer wall but lacksinternal organelles | it is more like a soup. A yeast cell on the other hand has physicallydistinct internal structures, each containing a di�erent soup.3

Page 4: Representing, Analyzing, and Synthesizing Biochemical Pathways

C

C

C

C

C

C

O

O

O

O

O

OH

H

H

H

HO

C

C

C

C

O

O

O

O

H

H

CC C

O

O

H

H

H

C

CH

C

CC

CC

C

N

NH3

HH

H

H

H

citrate fumarate

tryptophan

C

C

C

C

O

O

O

O

H

OH

H

H−

malate

+Figure 1: Chemical compounds.2.1 Biochemical CompoundsThis biological soup contains a very large number of chemical substances, which continuouslyinteract with each other in all conceivable combinations, as well as with the walls of thesoup-can, which are also built from chemical substances. A molecule is a set of atoms thatare connected by bonds, which are electron clouds forming the glue of molecules (Figure 1).The size of small molecules or metabolites varies within 2 orders of magnitude. By contrast,macromolecules are 2 to 3 orders of magnitude larger and consist of long chains of smallmolecules. These chains twist and fold to adopt intricate 3-dimensional shapes.The molecules oat in the soup, with many water molecules loosely attached to them. Whenthey are near each other, there are forces between their atoms and bonds. These forces maychange the orientation of nearby molecules, causing them to stay in proximity by forminga loose complex, or form a long-lived complex which oats in the soup as one unit.2.2 Biochemical ReactionsThe intermolecular interactions can also cause a chemical reaction, which is the dissolutionof some bonds and formation of other bonds. Each molecule's structure is rearranged, andsome of the atoms of one molecule may be transferred to the other. These interactionscan involve any number of molecules: Two molecules can �rst form a bimolecular complex,which in turn interacts with a third molecule before a chemical reaction takes place. Afterthe reaction, the complex might dissociate into individual molecules or smaller complexes.Figure 2 shows a biochemical reaction that transforms fumarate to malate. Throughout this4

Page 5: Representing, Analyzing, and Synthesizing Biochemical Pathways

C

C

C

C

O

O

O

O

H

H

fumarate

C

C

C

C

O

O

O

O

H

OH

H

H−

O

H

H

malatewaterFigure 2: A biochemical reaction.article, the term reaction will always refer to a biochemical reaction, also termed bioreactionor enzymatic reaction.2.3 EnzymesOne of the interacting molecules in a reaction might act only as a catalyst: It facilitates theassociation of several molecules to form a complex, and it lowers the energy barrier requiredfor the bond rearrangements that constitute a reaction. But in the end the molecule itselfdissociates from the complex and returns to its original state. The biochemical soup containshundreds or thousands of such molecules, called enzymes. Under biological conditions, mostcellular reactions could not overcome these energy barriers without enzymes.Enzymes are proteins, i.e., macromolecules consisting of a long sequence of compoundscalled amino acids. The structure of each enzyme is encoded in the cell's genome by asequence of DNA: a gene. DNA is another type of macromolecule, and it serves as thecentral blueprint of the cell's operation. In order for an enzyme to catalyze a reaction,the molecules on which it acts, called substrates, must have just the right structures andorientations to interact with the intricate 3-dimensional shape of the enzyme. The enzymeis like a key (Figure 3): With a lock (substrate compound) of the right shape, the key(enzyme) �ts in the lock (forms the right complex | part (b)); the key then turns in thelock (chemical bonds breaking and forming | part (c)); and the key can be taken out,unharmed, in the end, leaving the lock in a di�erent con�guration (substrate compoundsconverted into product compounds | part (d)).Many enzymes carry out exactly one such transformation. But some enzymes act likemaster keys: They can catalyze a whole family of similar bioreactions. Other enzymes cancatalyze unrelated bioreactions; they are more like small key-chains than single keys. Manydi�erent enzymes may catalyze the same bioreaction. They may have similar active sites(serrated parts of the keys) or apparently di�erent sites that happen to work on the samelock. In general, for a given bioreaction occurring in many di�erent kinds of cells, each kindof cell may possess and use a di�erent enzyme. The di�erences might be super�cial | not5

Page 6: Representing, Analyzing, and Synthesizing Biochemical Pathways

Figure 3: The interaction of an enzyme with its substrates. Note that the enzyme (key) istypically much larger than the substrates (lock).6

Page 7: Representing, Analyzing, and Synthesizing Biochemical Pathways

a�ecting at all the 3-dimensional conformation of the enzyme or the shape and operationof the active site. But they might be substantial enough to alter the activity of the enzyme| its rate and the way it is in uenced by other molecules.One confusing aspect of biochemical terminology is that names that appear to refer toindividual enzymes actually refer to enzyme classes. The term malate dehydrogenase, forexample, denotes the class of enzymes that carries out a particular bioreaction, but does notdenote a speci�c enzyme unless we specify the source (the kind of cell and the gene) fromwhich the enzyme carrying out this reaction was obtained. This distinction is crucial: Theknowledge-base developer who �nds in the literature data on \malate dehydrogenase" mustrealize that the data refers to only one of the enormous number of di�erent proteins whichcatalyze the same bioreaction. These proteins can di�er in their size, shape, rate, and everyother aspect; their commonality is only their role as catalysts of the speci�c bioreaction.Lehninger estimates that in nature there are 1011 di�erent kinds of proteins, a substantialfraction of which are enzymes [1].2.4 PathwaysThe metabolism of a cell is the set of bioreactions that its enzymes can catalyze. Oftenseveral biochemical reactions act together in a sequence to transform a set of initial sub-strates into products with very di�erent structures. Such a sequence of reactions is calleda pathway. But we should emphasize that the de�nition of a biochemical pathway is notexact. There are always interactions among pathways. A pathway's substrates are usuallythe products of another pathway, and there are junctions where pathways meet or cross (acell, after all, would have no use for a completely isolated pathway). Figure 4 shows a cyclicpathway involved in energy metabolism.2.5 RegulationThe cell must vary the ow of metabolites through di�erent pathways in response to changesin its environment and its activities. These variations and the overall coordination of themetabolism are achieved in part through regulation of the concentrations (amounts) ofenzymes. Genetic control causes the production rate of an enzyme, i.e., the expression ofthe corresponding gene, to be a�ected by the presence or absence of certain molecules. Theactivity of enzymes, after they are produced, can also be regulated by speci�c molecules.The activity can be partly inhibited by a substrate or product of the bioreaction. A moresophisticated form of regulation is allosteric control, in which the activity is further a�ectedby the presence or absence of other small molecules which bind to the enzyme and a�ectits active site. Figure 5 shows control points in the TCA cycle.One of the features used in identifying and distinguishing a pathway is that regulatoryinteractions are largely contained within the pathway. For example, the genes of many7

Page 8: Representing, Analyzing, and Synthesizing Biochemical Pathways

oxaloacetate

acetyl CoA

citrate

isocitrate

2−ketoglutarate

succinyl CoA

succinate

fumarate

malate

CoA−SH

CO2

NAD

CO2

CoA−SH

ADP

ATP

FAD

FADH

H O

NADH + H +

NADH + H+

NADH + H +

+

NAD+

NAD+

2

aconitase

citratesynthase

isocitratedehydrogenase

alpha−ketoglutaratedehydrogenase

succinatethiokinase

succinatedehydrogenase

fumarase

malatedehydrogenase

2

pyruvate CoA−SH

CO2

pyruvatedehydrogenase

Figure 4: The TCA cycle converts pyruvate into carbon dioxide, and into charged cofactorscalled NADH and FADH2; those cofactors produce energy in the form of ATP in a separateset of reactions. The central compounds of the pathway lie along the circumference of thecircle. Other reactants are shown entering and leaving the circle. The names of the enzymesthat catalyze each reaction are shown in bold face.enzymes in a pathway may be in physical proximity in the genome; they may be expressedtogether and a�ected by the same regulators.3 Application AreasThis section brie y outlines the application areas that give rise to computational problemsrelated to the metabolism. Each area involves one or more reasoning problems for whichsigni�cant expertise exists in the AI community, such as simulation, planning, redesign,diagnosis, and learning. Among the inputs to each reasoning problem is a base of metabolicknowledge. Therefore, management of metabolic information is also a central problem forevery application area. 8

Page 9: Representing, Analyzing, and Synthesizing Biochemical Pathways

oxaloacetate

acetyl CoA

citrate

isocitrate

2−ketoglutarate

succinyl CoA

succinate

fumarate

malate

aconitase

citratesynthase

isocitratedehydrogenase

alpha−ketoglutaratedehydrogenase

succinatethiokinase

succinatedehydrogenase

fumarase

malatedehydrogenase

pyruvate

pyruvatedehydrogenase

+

+

− −

− −

− −

NADH

Acetyl−CoA

ATP

NAD+

ADP

Succinyl CoA

Figure 5: Regulation of the TCA cycle. The activities of enzymes in the TCA cycle aremodulated by several metabolic compounds. For example, the activity of the enzyme isoci-trate dehydrogenase is increased by the presence of ADP, and decreased by the presence ofNADH.3.1 Bioprocess EngineeringMany useful chemical products, ranging from simple substances like ethanol to complexpharmaceuticals, can be produced by bioprocesses in which substrates are converted bysome cellular pathways into the useful product. Cellular processes are often preferred overtraditional chemical conversion because they implement a long chain of chemical transfor-mations in a single processing step with great speci�city. In a bioprocess, the pathway of theconversion normally crosses and interacts with the rest of the cell's metabolism. Therefore,a central question in the improvement of a bioprocess is the analysis of the behavior of themetabolism and the identi�cation of the location and nature of key interactions.Novel bioprocesses or improvements in the productivity of existing bioprocesses are achievedthrough alterations in the genetic material of the cell. These alterations can drasticallychange the quantity or activity of an enzyme present in the cell, block a step in a pathwaythat consumes a desired product, or prevent the regulation of the pathway that producesthe desired product and allow higher rates or yields. For example, a slow enzyme that limitsa pathway can be replaced by a faster one from another source, or a missing reaction canbe inserted. Bioprocess engineering therefore requires design of novel metabolic pathways,9

Page 10: Representing, Analyzing, and Synthesizing Biochemical Pathways

and simulation and estimation of their properties, which in turn bene�t from knowledge ofthe metabolism of diverse organisms.3.2 Studies of the MetabolismBecause current understanding of the metabolism is incomplete, a knowledge base of themetabolism could be mined for regularities that will provide a deeper understanding of theunderlying principles of metabolism and of molecular biology. For example, consider theevolution of the metabolism. Morowitz argues [3] that metabolism recapitulates biogenesis,meaning that the organization of the interconnections among metabolic pathways re ectsthe order in which those pathways evolved. That hypothesis could be evaluated computa-tionally with respect to a knowledge base of the metabolism. Morowitz's hypothesis resultedfrom his observation of several patterns within biochemical pathways, such as the existenceof a core set of pathways in which all compounds lack nitrogen. If machine learning algo-rithms are applied to metabolic networks, other patterns and regularities might be foundthat contribute to both our understanding of how these networks function, and of how theyevolved. Similar insights might be derived from metabolic simulations and from searchesfor novel routes through metabolic networks.Another example is that the function of an enzyme is determined by the three dimensionalshape of that protein. A complete understanding of the structure-function relationshipwould allow us to engineer new enzymes with desired characteristics, and to modify thefunction of existing enzymes. In coordination with other biological databases, a metabolicKB could drive a systematic study of enzymes that catalyze similar reactions and interactwith similar substrate molecules to correlate those functional similarities with similaritiesin 3-D shape or primary sequence.3.3 Health-Related ApplicationsComputational representation and reasoning about the metabolism will contribute to betterunderstanding of defects in human metabolism. The extent to which potentially toxicintermediates can accumulate because of a defective pathway depends on the presence,properties (such as reversibility), and interactions of bioreactions. Reasoning about themetabolism can thus be applied to the development of diagnostic tests of a metabolic defect(based on detection of intermediates), and to treatments in which toxic intermediates arecatabolized or deactivated. Biochemical reasoning can identify the cases where the reactionspreceding a defective step are reversible, so that it is e�ective to transform one of theintermediates lying upstream from the defect.A signi�cant issue in both diagnosis and treatment is the in uence of the metabolism thatsurrounds a defective pathway. It is important to know whether the surrounding metabolismcan provide routes that bypass the defective step or simply drain o� the accumulating10

Page 11: Representing, Analyzing, and Synthesizing Biochemical Pathways

intermediates. The identi�cation of pathways that can ful�ll such functions is anotherproblem discussed in this article. Theoretical and computational tools may assist experts byproviding them with qualitative conclusions which can guide an experimental investigation.Related problems include the design of drugs which intervene in the metabolism, and thestudy of the metabolic pathways of human nutrition.4 Construction of a Knowledge Base of Biochemical Path-waysThe volume of scienti�c knowledge of the metabolism is exploding. Scientists in a varietyof �elds will bene�t from quick electronic access to information about metabolic enzymes,reactions, compounds, pathways, and genes. But access to that information is problematicbecause it is scattered throughout the voluminous biological literature. Therefore a needexists for a knowledge base from which scientists and reasoning programs can retrieve in-formation about many aspects of metabolism. This section describes the computationalproblems encountered in the course of developing such a knowledge base.4.1 Existing Databases of the MetabolismMetabolic DBs have been constructed by a number of researchers; most of these DBs containa relatively small number of metabolic reactions from a number of di�erent organisms. ADB constructed by Mavrovouniotis for various pathway-synthesis and estimation problemsdescribes 245 reactions [4]. The ENZYME DB constructed by Bairoch [5] is comprehensivein listing all reactions de�ned by the International Union of Biochemistry and MolecularBiology, which is the standard of enzyme nomenclature (2800 reactions) [6]. A DB con-structed by Selkov et al contains a vast array of information on enzymes from a variety ofspecies [7].The existing databases have a number of limitations. They do not contain comprehensiveinformation about metabolic compounds, such as compound structures. Most DBs do notprovide information about the physical properties of the enzymes that they list, such astheir subunit structures, molecular weights, activators, inhibitors, or cross references toprotein-sequence DBs (an exception is Selkov's DB) The DBs do not contain informationabout the genes encoding the listed enzymes, and they therefore provide no data on theunderlying genetic basis of metabolism. Most DBs provide no information on the kinetics(rates) or mechanisms of individual reactions (Selkov's work is again an exception).Another limitation of most existing metabolic databases is that they describe enzymeclasses, not individual enzymes. These classes are species-invariant, whereas, as proteins, the11

Page 12: Representing, Analyzing, and Synthesizing Biochemical Pathways

enzymes that catalyze metabolic reactions vary among di�erent species. Di�erent enzymeshave di�erent subunit structures, di�erent substrate speci�cities, and di�erent regulatoryproperties; they also catalyze reactions at di�erent rates. Furthermore, the exact comple-ment of metabolic enzymes varies among species. Therefore, despite its name, a databasesuch as ENZYME lists the union of all metabolic reactions such that some species containsan enzyme that catalyzes the reaction. Those DBs are of limited utility to researchers whoare interested in a particular species because the relevance of DB entries to a particularspecies is never made explicit.A recent e�ort towards a metabolic knowledge base that overcomes these limitations is theEcoCyc project, a collaborative e�ort by Karp, Riley (of the Marine Biological Laboratory)and Rudd (of the National Library of Medicine). The EcoCyc project collects informationabout the genes and metabolism of a single organism (the bacterium E. coli), and it willattain a level of detail and comprehensiveness that is not present in any of the existingdatabases. The KB will describe each bioreaction of E. colimetabolism and the enzyme thatcarries out each bioreaction (including cofactors, activators, and inhibitors of the enzyme),and the subunit structure of the enzyme. When known, the genes encoding the subunits ofan enzyme will be listed as well as the position of each gene on the E. coli chromosome. Inaddition, the KB will describe every chemical compound involved in each bioreaction, listingsynonyms for the compound name, the molecular weight of the compound, and in manycases its chemical structure. The current contents of the KB consist of 1,000 metaboliccompounds [8], 15 pathways, and the approximately 100 reactions and enzymes involved inthose pathways. We will explore other aspects of EcoCyc later in the paper.4.2 Knowledge AcquisitionPast research in knowledge acquisition (KA) has generally assumed that information is tobe acquired from the mind of a small number of human experts. In contrast, knowledgeacquisition for the metabolism (and for scienti�c databases in general) draws on a varietyof media, and would bene�t from tools that aid human experts in locating, extracting,encoding, and citing information.Knowledge sources for the metabolism include laboratory instruments, the primary andsecondary literature, textbooks, existing databases, and the minds of expert biologists.The goal is a knowledge pipeline that carries knowledge from each of these sources to ametabolic KB. Knowledge acquisition tools should therefore be tailored for each of thesesources, of which the biological literature may be the most challenging. More and moreof that literature is becoming available on line: The Medline database lists title, author,abstract, and keywords for most recent biomedical journal articles. A number of electroniclibrary projects promise to bring the text of entire publications online.One challenge to KA researchers is to develop sophisticated searching tools (based on nat-ural language analysis, for example) that allow experts to �nd the information they wish toextract. Another challenge is to develop tools that streamline the process of extracting infor-12

Page 13: Representing, Analyzing, and Synthesizing Biochemical Pathways

mation from the literature and inserting it in a KB, ranging from cut-and-paste operationsto more intelligent mechanisms that parse and comprehend a text region of interest andformulate an encoding of that information based on an understanding of the KB ontology.Such tools should also automatically generate citations from the KB back to the knowledgesource. Citation is a novel aspect of scienti�c knowledge acquisition. The existence of acitation for a particular datum may in uence a scientist's con�dence in the datum, and willallow her to determine how the datum was derived and how it may be used. Therefore,items within a scienti�c KB should be tagged with citations whenever their source is notobvious. A very sophisticated citation maker would also extract and encode informationabout the experimental technique used to derive particular information.Dr. Riley begins the KA process for EcoCyc by �nding textbook descriptions of a metabolicpathway and its component reactions. For each reaction as it occurs in E. coli she performs aliterature search using bibliographic services such as Medline. The articles are then perusedfor the information speci�ed in the EcoCyc schema. The schema describes what informationis to be gathered for each class of object in the knowledge base, such as enzymes, reactions,and pathways. The composition of the schema is based on the uses foreseen for the EcoCycKB, the desire to represent metabolic information as accurately as possible, and our under-standing of what information can be found reliably and consistently in the literature. Forexample, the literature contains a large volume of data describing the kinetics of individualreactions, but the measurements are performed under di�erent experimental conditions, andare therefore inconsistent. Because the task of gathering this huge amount of quantitativekinetic data is enormous, and the utility of these inconsistent data are questionable, thisinformation is not included in the EcoCyc KB. In addition to the literature, existing elec-tronic databases are also consulted. The EcoCyc KB is managed using a frame knowledgerepresentation system.5 Knowledge Representation ChallengesA KB of the metabolism places many demands on the underlying information managementsystem because of the many tasks that the system must support over the lifetime of theKB. Those tasks include knowledge acquisition; development and maintenance of the data,including subtasks such as validation, internal consistency checking, correction of errors,and cross-referencing of the data with other databases; and delivery of the data to endusers for applications such as graphical browsers, simulation, computer-aided instruction,and machine learning. The information management system must be expressive enoughto encode the knowledge and to support inferences and query processing required by theapplications. It also must address database issues such as concurrent multiuser access andhigh-speed access to large volumes of information.Frame knowledge representation systems (FRSs) are good candidates for managingmetabolic KBs, because of their expressive power, their support for inference, and theease with which they support schema evolution. However, FRSs have several drawbacks:13

Page 14: Representing, Analyzing, and Synthesizing Biochemical Pathways

they are unable to handle large knowledge bases or to support multiuser access to sharedknowledge bases, and they are foreign to much of the computational biology community(the end users). This section suggests two ways of addressing these problems: by extendingthe capabilities of FRSs, and by employing di�erent information-management technologiesfor di�erent applications of a KB.5.1 Multiple Information Management SystemsIf no single information-management technology is optimal for all of the tasks required tosupport the development of a metabolic KB, one approach is to employ di�erent technologiesfor di�erent tasks. In EcoCyc, for example, an ASCII \data template" is used for theknowledge acquisition task; a frame knowledge representation system called THEO is usedfor internal development and maintenance; �nal delivery to end users will be in three forms:a THEO knowledge base, the ASN.1 international data exchange standard, and ASCII tabledump �les that can be loaded into a relational database management system.Distribution is of particular concern for most scienti�c database projects, because theircentral goal is usually to make information widely accessible to many di�erent people inmany di�erent organizations. Since they often have expertise in di�erent types of infor-mation management systems, we can increase the accessibility of a scienti�c database bydistributing that database in multiple formats, in the hope that each potential user will befamiliar with at least one of the distributed formats. In the case of EcoCyc, the translationof the KB among a number of di�erent representations is useful not only for �nal delivery,but also for other phases of the project. This overall approach would clearly bene�t fromtools that automatically translate among the di�erent representations that are employed;such tools are under development in the EcoCyc project.5.2 Extending Frame Knowledge Representation SystemsEven a relatively simple metabolic knowledge base for a single organism will contain tens ofthousands of objects, and it will have a size of tens of megabytes. As knowledge bases growin size, existing knowledge representation systems will become more and more unwieldy.First, existing systems require that before a knowledge base can be accessed in any way,the entire knowledge base must be loaded into virtual memory. Second, they require thatwhenever a knowledge base is updated, it must be saved to disk in its entirety. Therefore,the time required to access a knowledge base initially, and to save knowledge base updates,is proportional to the size of the knowledge base. We would prefer that the time required toaccess the knowledge base be proportional to the amount of information actually accessed,and that the time required to save updates is proportional to the number of updates.Ideally, a successful metabolic knowledge base will be in constant use by many scientistsconcurrently. The scientists will be distributed throughout the United States and abroad.14

Page 15: Representing, Analyzing, and Synthesizing Biochemical Pathways

They might be browsing through the knowledge base to look up individual facts, or theymight be performing pathway design computations or simulations. By accessing the knowl-edge base remotely over a network, they would be guaranteed up-to-date information. Newinformation might be entered by scienti�c curators who oversee small regions of the knowl-edge base that re ect their expertise. For example, one scientist might be updating infor-mation about energy metabolism, while another scientist is updating the knowledge basewith newly obtained information about gene locations. Existing knowledge representationsystems provide no mechanisms for controlling concurrent updates to a knowledge base.5.3 Representational ChallengesThe problem of encoding knowledge of the metabolism is a daunting one because of thecomplexity of this domain [9], and the elastic nature of the concepts that biologists employ.The complexity includes several types of hierarchical relationships, such as among a com-pound and its constituent groups and atoms, an enzyme and its subunits and active sites,and a pathway and its reactions and subreactions. Many enzymes accept a wider range ofsubstrate molecules than those present under physiological conditions; scientists have variedknowledge of this substrate speci�city.Biologists will wish to compose queries to metabolic databases using familiar terminol-ogy, but this terminology tends to change over time, and di�erent biologists give di�erentmeanings to the same terms [10]. Consider a scientist who asks for all acidic proteins thatsynthesize a precursor of pyruvate and do not require a metal ion as a cofactor. The de�ni-tion of concepts such as precursor and cofactor are complex and variable. Biologists shouldbe able to choose among multiple alternative de�nitions of these terms.6 Interconnection of Heterogeneous DatabasesBoth in-depth studies of biochemical pathways, and applications that utilize metabolic in-formation, bene�t from the availability of information from other areas of molecular biology.In the last 5 years, there has been an explosive growth in the number of biological databasesavailable, and in the variety of database management technologies used to construct theseresources. The most recent release of the LIMB database [11], which is a catalog of molec-ular biology databases, lists 110 di�erent databases. These databases are maintained as at �les, in relational database management systems, in object-oriented database manage-ment systems, in custom database management systems, and in knowledge representationsystems.The ability to issue powerful queries that span multiple databases will allow scientiststo answer complex questions|about the metabolism or other areas of biology|that arelaborious to tackle today. As an example, suppose that we wish to undertake a comparative15

Page 16: Representing, Analyzing, and Synthesizing Biochemical Pathways

study of a biochemical pathway in a number of organisms to learn about the evolution ofthat pathway.The GenBank database might contain the nucleotide sequences of the genes that code forthe enzymes involved in that pathway, in a variety of organisms. The PIR database mightcontain the amino acid sequences of the enzymes, and the PDB database might contain3{D structure for several of the enzymes. Medline might contain references to the literatureon the enzymes. To access all this information now requires knowledge of a diverse set ofDBMSs, query languages, graphical user interfaces, and operating systems. In many cases,the databases are not available for network access, i.e., they must be actually installed at theuser's site. Answering one question by navigating through all of these information sourcesis di�cult and time-consuming; to program a computation that performs a comparativestudy of several enzymes in a variety of organisms would be a nightmare.Multidatabase systems are becoming a critical new area of computer science research, andbiological databases provide a fertile testbed for new ideas. Various approaches to solvingthe multidatabase problem have been proposed in the literature, ranging from completeintegration of the databases to be interconnected, to no integration at all. The choicedepends critically on the nature of the applications to be supported. To determine theneeds of a biological database application, its requirements have to be investigated in areassuch as autonomy, consistency, data representation, language, access pattern, and semanticmismatch.7 Simulation and Analysis of the MetabolismThe long-term objective of simulation of the metabolism is the development of generalcomputational and theoretical techniques that can determine the behavior patterns anddominant mechanisms of a biochemical system. The computational treatment should guideexperimental analysis by limiting the number of possibilities and testing alternative sce-narios. Simulation and analysis must be based on biochemical data, biological principlesand regularities, as well as laws from physical sciences (such as conservation principles andthermodynamics). One basic goal of theoretical and computational analysis of a biologicalsystem is to �nd out what possible processes (such as expression of speci�c genes, enzy-matic activities, and metabolic pathways) are functioning or could be activated, and whattheir overall e�ect is, in terms of observable cell behavior. These questions are normallyaddressed on a case-by-case basis through extensive experimentation, but experiments aredi�cult and time-consuming, and there are often too many alternative hypotheses to inves-tigate experimentally. If theoretical and computational work could at least narrow downthe �eld of possibilities, experimentation would be far more e�ective.The very general problem just de�ned is extremely di�cult because the available knowledgeis usually too sparse, uncertain, and qualitative. One can be easily deceived by the volumeof available information, for example, on enzyme kinetics or gene regulation. But when16

Page 17: Representing, Analyzing, and Synthesizing Biochemical Pathways

one considers the uncertain or unknown parameters and phenomena, it becomes apparentthat the available information is only a small fraction of that which is needed for reliablepredictions.7.1 Qualitative and Order-of-Magnitude analysisThe limitations imposed by the current limited state of knowledge in the �eld dictate theuse of qualitative descriptions, capable of supporting qualitative conclusions from knowledgethat is commonly available. Note that even if detailed information is available for a pathway,a qualitative result may be more useful because it pertains to a whole class of similarpathways. For example, establishing qualitatively that a certain pathway operates over awhole range of conditions is often more useful than predicting the actual rate of the pathwayunder one speci�c set of conditions.Qualitative reasoning about biological systems can adopt any level of abstraction or de-gree of quantitativeness. In a boolean representation, one might indicate the existence ofconstraints and interactions among parameters, without information on the direction ormagnitude of such e�ects. An example is information indicating what metabolites partici-pate in each bioreaction. One might choose to additionally represent the signs of variablesand the direction in which each variable a�ects other variables, using one of the severalapproaches to qualitative reasoning.A third approach that may be well suited to biochemical systems is Order of Magnitudereasoning [12] in which the rough relative magnitudes of parameters come into play. Do-main experts often examine the relative orders of magnitude of parameters and use verbaldescriptions denoting approximate relations among parameters. It is known, for example,that in an exponentially growing culture of bacteria, the concentration of the molecule ATP(written as [ATP]) is much higher than [ADP]; when [ATP] becomes approximately equalto [ADP], bacteria die due to lack of energy; for anaerobic growth, the rates of the biosyn-thetic part of the metabolism is much lower than the rates through the energy metabolism.Similar rough relations can be produced for other compounds, pathways, and parameters.The relations are frequently derived from other approximate relations, algebraic equations,inequalities, or empirical rules. If, for example, it is known that [ATP] is much higherthan [ADP], but approximately equal to [GTP], it can be concluded that [GTP] is muchhigher than [ADP]. Order of Magnitude reasoning is based on the representation of relationscodifying the relative orders of magnitude of the parameters in a system. These relationsformalize semiquantitative statements of the type \A is much smaller than B," or \C is ofthe same order as D" and allow inferences on such statements.Formal Order of Magnitude reasoning provides a vocabulary for the description of previouslyinformal notions and methods. It also allows the analysis of complex biochemical systems.However, Order of Magnitude reasoning assumes that relations are meaningful and validregardless of context: If an expert states that A is much larger than B, having in mind only aparticular permitted use of this statement (and other uses as prohibited), then the statement17

Page 18: Representing, Analyzing, and Synthesizing Biochemical Pathways

cannot be represented as an isolated fact, and reasoning with it is risky. Nevertheless,Order of Magnitude knowledge is relevant in the analysis of biochemical systems becausequantitative knowledge is harder to obtain whereas qualitative knowledge is used extensivelyby the experts.7.2 Thermodynamic Estimation and EvaluationA number of concepts from the physical sciences can be exploited in developing methods forqualitative predictions. One such concept is the analysis of the feasibility and reversibility ofa bioreaction, based on thermodynamic properties that can be estimated from the molecularstructures of compounds.Many of the bioreactions encountered in the literature, especially in collections such as theEnzyme Nomenclature [6], are given in a nominal direction (motivated by nomenclatureconventions), unrelated to the direction in which they actually occur. Furthermore, some(but not all) reactions are reversible, i.e., can occur in either direction, depending on theamounts of the metabolites. A similar view can be taken for whole metabolic pathwayswhen only their overall net e�ect, rather than their sequence of steps, is examined. Clearly,the role of a biotransformation in the metabolism is entirely di�erent in its two possibledirections. A pathway that appears to synthesize an amino acid may represent eitherexclusively biosynthesis of that amino acid, exclusively biodegradation, or both biosynthesisor biodegradation depending on other conditions, such as energy and substrates available tothe cell. The question of the direction of a given biochemical reaction or biotransformationis therefore important.A thermodynamic criterion of the feasibility and reversibility of a transformation is thestandard Gibbs energy, which is a function of the molecular structures of the compoundsinvolved in the transformation. A group-contribution method [4] relates this property tostructure by providing a set of chemical substructures or groups, which serve as the build-ing blocks for the compounds of interest. To estimate the Gibbs energy of a particularcompound, one combines the contributions of the groups that are present in the structure(Figure 6 (a) and (b)). The Gibbs energy of a bioreaction is then derived from those of theparticipating compounds.One of the di�culties encountered in the developed group-contribution method was thatmany biochemical compounds cannot be accurately described by just one arrangement ofbonds in a structure [4]. A compound of this type is viewed as a hybrid of a number ofconjugates, each conjugate being an alternative formal arrangements of the bonds (Figure 6(a) and (c)). If the compound cannot be represented by a single structural formula, itcannot be properly decomposed into groups. One solution is to generate, computationally,important alternative conjugates and incorporate them into the thermodynamic estimation.Preliminary studies have demonstrated that this kind of approach performs remarkably wellfor simple compounds and can reduce the number of groups that are needed.Both group-contribution and conjugation-based methods require the molecular structures18

Page 19: Representing, Analyzing, and Synthesizing Biochemical Pathways

CC C

O

O

H

H

H

C

CH

C

CC

CC

C

N

NH3

HH

H

H

H

CC C

O

O

H

H

H

C

CH

C

CC

CC

C

N

NH3

HH

H

H

H

CC C

O

O

H

H

H

C

CH

C

CC

CC

C

N

NH3

HH

H

H

H

+

+

+

+

(a)

(c)

(b)

Figure 6: The estimation of thermodynamic properties that are important for determiningthe feasibility of a biotransformation can be carried out by decomposing the structure intoa set of substructures called groups. For tryptophan (a), one such decomposition is shownin (b). Starting with the conjugate (a), the arrows in (a) indicate a rearrangement of bondsthat leads to another conjugate (c).of metabolites. A complete computer-based application of group-contribution techniquesrequires algorithms that identify the groups present in a compound. Such pattern-matchingalgorithms have not yet been developed. Since di�erent basis sets of groups are possible,machine learning techniques may enable a software system to improve its performance byaltering the set of groups over time. In the case of conjugation, the generation, compar-ison, and analysis of conjugate forms implies computer based manipulation of molecularstructures, and this has been accomplished only for simple compounds. The e�cient gener-ation of the most in uential conjugates of complex biochemical structures remains an openproblem. This is a search problem in which operators produce one conjugate from another,constrained by rules for eliminating excessively unstable conjugates.The preceding problems involving group decomposition and estimation of Gibbs free energypattern-matching problems for chemical compounds. Kazic discusses related problems, suchas that of encoding compound structures hierarchically, and of inferring atom correspon-dences and patterns of bond breakage and creation in a bioreaction [13].19

Page 20: Representing, Analyzing, and Synthesizing Biochemical Pathways

7.3 Open ProblemsA particularly important open problem is the generation of models suitable for qualitativeor quantitative simulation of the metabolism. This may seem surprising given that the mostcentral phenomena in the metabolism are chemical reactions, whose modeling should not becomplicated. A �rst issue is, as we mentioned earlier, the incompleteness of the data usuallyavailable; the task is the derivation of a model which is not just physically acceptable, butalso avoids unrealistic expectations on the data necessary for simulation and analysis. Thisproblem plagues many quantitative analysis methods.Other di�culties arise when one considers that the activity of the enzyme catalyzing abioreaction is modulated by other compounds and enzymes. Modeling the interaction andconvolution of these a�ects at the most fundamental level would require accurate analysisof every factor's in uence on the enzyme's 3D structure | a task that is not yet withinreach.Finally, a central problem is the derivation of models at a coarser level | such as modelsfor a whole pathway, complete with its regulation | from �ner-level models. Becausethe more detailed representation is characterized by many more components and variables,this process of model aggregation or reduction requires mathematical and computationaltechniques that go beyond the current state of qualitative reasoning and automated modelgeneration.In this problem, as in many other problems of reasoning about the metabolism, we observea hierarchy of solution complexities and accuracies: One may resort to the simplest possiblegroups, which are easy to identify computationally but provided only limited predictivecapability. More complicated groups, which are harder to de�ne and identify, improve theaccuracy. For predictions which are more accurate, the use of alternative forms of thecompound adds a whole new dimension to the reasoning. None of the above approachestakes into account 3-dimensional steric e�ects fully; 3-dimensional molecular modeling wouldthus be the next level of �ner detail. This organization of the computational/chemicalviews of the problem eases the initiation of research e�orts in this and other areas of themetabolism.8 Synthesis of Metabolic PathwaysA pathway does not operate in isolation from other pathways, and thus cannot be considereda distinct physical entity. Rather, pathways are abstractions of sets of enzymatic reactions| they are substructures that are partitions of the metabolism. They are essential forunderstanding cell behavior because the metabolism is a large and intricate network ofreactions.There are often substantially di�erent pathways that can accomplish the same metabolic20

Page 21: Representing, Analyzing, and Synthesizing Biochemical Pathways

function. This section is concerned with the problem of constructing alternative metabolicpathways, from individual bioreactions, that can accomplish a given metabolic function. Ametabolic function describes an overall transformation of some metabolites to others whichthe biological system must accomplish. A metabolic function might be a transformationthat is essential for the survival and growth of a microorganism; for example, many mi-croorganisms need to derive glutamate from alpha-keto-glutarate and ammonia, in orderto produce other amino acids. Thus, this transformation can be thought of as a distinct,identi�able goal for the microorganism. In other cases, a given overall transformation maydescribe derivation of the desired valuable products of a bioprocess from the available sub-strates that are the feedstock of the process.The role of individual enzymes and intermediates for the metabolism can often be under-stood in terms of the pathways in which they participate and the metabolic functions thesepathways ful�ll. Common characteristics shared by many pathways, such as �xed inter-mediate metabolites or bioreactions, allow the identi�cation of fundamental features in themetabolism. For example, if an enzyme occurs in all variants of pathways that accomplisha given metabolic function then it is essential for accomplishing the transformation.Although biochemistry textbooks discuss standard pathways for the biosynthesis of keybuilding blocks and the catabolism of key substrates, those pathways are by no meansunique or exhaustive of all the possibilities within the metabolism; they merely representthe pathways whose presence and coordination has been clearly established. Other pathwayscould be activated either spontaneously or by external intervention, especially in abnormalsituations (mutations or genetic de�ciencies). Thus, in order to reason about metabolicfunctions and the pathways that can ful�ll them, we need a computational method thatconstructs all pathways that can accomplish a given metabolic function, i.e., the transfor-mation of a given set of substrates to speci�ed products.Although pathways are often viewed informally as sets of bioreactions, many distinct path-ways can be constructed to include the same bioreactions but achieve di�erent transforma-tions. For example, the reactions A �! B + C and 2B + C �! D can form thepathways 2A �!�! D + C and A + B �!�! D, depending on whether the ratesof the reactions are in 2:1 or 1:1 proportions. Thus, a fully speci�ed pathway must includea coe�cient for each bioreaction, to indicate the proportions at which the stoichiometriesare combined. This gives the pathway synthesis problem a partial quantitative character.Pathway construction can be thought of as a planning problem. Bioreactions are operators,and constraints are imposed by the targeted metabolic function as well as by the desiredregulatory interactions between metabolites and enzymes. For example, the regulation ofgene expression provides preconditions under which each enzyme (bioreaction operator) isavailable. Other constraints may stem from kinetic properties of enzymes, which determinethe enzymes' ability to sustain the rate expected of the pathway.21

Page 22: Representing, Analyzing, and Synthesizing Biochemical Pathways

8.1 Pathway Synthesis AlgorithmA �rst approach for building pathways that accomplish a speci�ed biotransformation mightbe the following. Begin with the metabolites that are the starting materials for the transfor-mation, and consider all bioreactions that can transform some of these starting metabolitesinto other intermediate metabolites; then look for other bioreactions that can transform thenew intermediates further, and so on, until the target products are reached. This approachhas a number of disadvantages, the most important of which is that it can fail to constructpathways that contain a \loop" | a rather common occurrence in the metabolism.A complete algorithm that synthesizes all pathways that can accomplish a given transforma-tion [14] is based on eliminating those metabolites that should be absent from the desiredtransformation. If, for example, we wish to transform compounds A and B into X andY , then all other compounds besides A, B, X and Y would be considered metabolites toeliminate. The algorithm does not actually construct the complete set of solutions directly;instead, it constructs a basis set out of which any other solution can be obtained as a simplecombination. The algorithm examines, one at a time, those metabolites that must be absentfrom the target transformation, by altering a set of partial pathways. At the beginning thisset of partial pathways is the same as the set of available bioreactions. At each pathwayexpansion iteration, one of the undesired metabolites is chosen, and the set of active path-ways is modi�ed to eliminate the metabolite. Elimination is accomplished by combining apathway that consumes the metabolite with a pathway that produces it; by canceling outproduction with consumption, the metabolite is eliminated from the overall transformation.All the metabolites that must not be present in the target transformation are processed inthis way, and the resulting pathways accomplish either the desired transformation or atleast a portion of that transformation.This algorithm was used in a study of the biosynthesis of the amino acid lysine. It con-structed many non-obvious pathways for lysine, and determined that a particular metabo-lite, oxaloacetate, is an intermediate in all of them even though no single enzyme in thevicinity of oxaloacetate is essential. The algorithm also constructed pathways which by-passed enzymes (including pyruvate kinase and aspartate glutamate transaminase) thatwould at �rst appear essential.8.2 Open ProblemsThe algorithm's practical applicability is limited by a fundamental di�culty: exponentiallymany pathways can result from the basis set that is constructed. Even though many ofthese pathways are very similar, the algorithm cannot take the similarities into account tolimit its results. It is also very di�cult for a user to comprehend a result consisting ofseveral thousand pathways; as it stands, the algorithm is not able to organize its answersin any way.The weakness of the algorithm is that it does not look for patterns in the basic set of22

Page 23: Representing, Analyzing, and Synthesizing Biochemical Pathways

Figure 7: A set of reactions giving rise to an exponential number of pathways. MetaboliteA1 can be converted to metabolite A4 through either of two possible routes, one of whichis the sequence of reactions that converts A1 to A3, and A3 to A4.pathways it synthesizes. Thus, if there is one central pathway with possible small variationshere and there, a large number of very similar pathways will be produced by the algorithm.The number of pathways that satisfy a set of stoichiometric constraints is, in the worst case,exponential in the number of reactions. Consider the reactions depicted in Figure 7. Foreach diamond (numbered as D1, D2, etc.) consisting of two parallel branches, a pathwaycan follow either the upper or the lower branch. If there are n diamonds (and 4n reactions),there are n junctions where these choices occur. Thus, there are 2n distinct pathways (notwo of them involve the same set of enzymes).To cure this de�ciency, one must devise a way to represent a set of similar pathways withoutenumerating them. In addition to its bene�ts in computational complexity, this type oforganization of related pathways is essential to permit the human expert to \make sense"out of a set of entangled pathways. In the example of Figure 7, the description of thepathway-set as a sequence of n substeps, with two alternatives for each step, is much easierto comprehend than the enumeration of all possible pathways. It is interesting to note thatthis description even makes apparent the role of the enzymes and the intermediates of thepathway; it is clear that A4 is always an intermediate in the biotransformation, while nosingle enzyme is essential. However, Figure 7 represents an idealized situation, and thede�nition of appropriate compact description in the general case is an open problem. Thehierarchical description of complex systems in other domains, such as electrical circuits andchemical plants, is related to this issue and may provide some insights.In general, an algorithm for pathway synthesis would construct (a) a basis set of pathways;(b) a set of simple operations for the derivation of any solution from the basis. In thealgorithm discussed earlier, the size of the basis is too large, and the operations used for the23

Page 24: Representing, Analyzing, and Synthesizing Biochemical Pathways

derivation of any solution are simple. A better algorithm should strike a balance by reducingthe size of the basis and making the operations more complex | but only moderately so.With an appropriate representation, parallel processing will be of great value here. Infact, parallelism can be exploited in the existing algorithm, since many metabolites can beprocessed simultaneously.Another open problem is the development of methods for ranking the synthesized pathwaysaccording to various criteria. If, for example, the pathways can be roughly ordered by therates they can achieve, the user interested only in very fast pathways would only considera small portion of the list.9 Machine LearningThe metabolism o�ers several challenges for machine learning techniques. Machine learningo�ers the potential for organizing metabolic knowledge in new ways, for discovering newpatterns and regularities within the metabolism, and for applying metabolic knowledge toscienti�c theory formation tasks.9.1 Classi�cationFor each biochemical component or function (e.g., enzymes and reactions), many detailed,multi-level classi�cations can be created to provide alternative views of a biochemical knowl-edge base. Consider the enzyme nomenclature standard, which is a classi�cation system for2800 enzymes that is maintained and published in book form [6]. The authors of this com-pendium note that the question of which dimension to classify enzymes along is not a trivialone. The authors are forced to choose a single method of classi�cation because alternativeclassi�cations are extremely tedious to develop, query, or view without computers.For metabolites, classi�cation could be based on their chemical functionality, i.e., the sub-structures present in their molecular structures. Substructures can be included in, or bespecial cases of, other substructures; this pattern would give rise to a multi-level classi�-cation. The overall metabolic role of a compound could also serve as a classifying feature:Some metabolites serve primarily as sources of energy, others as building blocks for macro-molecules, others as signalling molecules, and others as cofactors that activate enzymes.In many cases, a classi�cation could be generated computationally. The presence of chem-ical substructures, for example, can be de�ned formally and determined very precisely byalgorithms. On the other hand, the classi�cation of metabolites based on metabolic role canbe automated only through heuristics, some of which are dependent on the context of use.If, for example, we classify a bioreaction as a growth reaction, or a metabolite as essential,we are actually making a statement about whether growth pathways that involve these ob-24

Page 25: Representing, Analyzing, and Synthesizing Biochemical Pathways

jects are present. Other parts of the classi�cation depend not on basic biological facts buton the technological application area. For example, classes such as \raw material," \�nalproduct" or \byproduct" for metabolites are only meaningful in the context of industrialbioprocess engineering. The identi�cation of novel de�nitions (along with the accompa-nying classi�cations) amounts to discovery of new biological regularities. Note that manybiological de�nitions that are used in classi�cations are not rigorous. The formalization ofsuch de�nitions is an essential step in the classi�cation process.Both machine learning techniques and the classi�ers of terminological knowledge represen-tation systems could be applied to these classi�cation tasks.9.2 Scienti�c Hypothesis FormationA hypothesis-formation problem exists when the outcome of an experiment that is predictedby a scienti�c theory does not agree with the observed outcome of that experiment. Ahypothesis is a proposed modi�cation to the theory, or to the assumed initial conditions ofthe experiment, that restores consistency between prediction and observation. Hypothesisformation led to our current knowledge of the metabolism, and continues to expand thatknowledge.The goal of solving hypothesis formation problems by computer is a signi�cant challengefor machine learning research | some philosophers of science believe that machines willnever be able to duplicate (much less surpass) the creative thinking of human scientists.However, the study of biochemical pathways and their genetic regulation provides an excel-lent testbed for developing new machine learning methods for hypothesis formation. Recentprojects provide signi�cant evidence that machines are capable of emulating the reasoningthat human scientists used to solve hypothesis-formation problems. Speci�cally, Kulkarnireproduced the reasoning by which Krebs discovered the Ornithine Cycle | a pathwayfor removing excess nitrogen from the cell [15]. Valdes-Perez generates alternative reactionpathways that account for the observed interconversion of a set of reactants to a set ofproducts [16]. An example of this problem is to propose the reactions in the TCA pathwayin Figure 4 given knowledge of only a few of the chemical intermediates involved. Finally,Karp reproduced elements of the discovery of the genetic regulation of the pathway fortryptophan biosynthesis [17]. Karp's HYPGENE program treats hypothesis formation as adesign problem. The HYPGENE program comes up with hypotheses whose goal is to elimi-nate the di�erence between the observed and predicted outcomes of an experiment | theprediction error. A planner reasons backward from the prediction error to determine whatmodi�cations to the reaction theory or to the initial experimental conditions will eliminatethe prediction error. Design operators then modify the theory or the initial conditions in agoal-directed fashion. HYPGENE solved only a handful of the hypothesis function problemsidenti�ed in the historical reconstruction of the gene regulation mechanism called atten-uation. The unsolved problems constitute a well documented reservoir of challenges tomachine-learning researchers. 25

Page 26: Representing, Analyzing, and Synthesizing Biochemical Pathways

The accelerating growth of biological KBs and DBs will increase both the power of, and thedemand for, machines that formulate scienti�c hypotheses. Human scientists spend manyyears studying their �eld of inquiry before they are judged to have acquired signi�cantknowledge to begin their own program of scienti�c research. Hypothesis-formation is aknowledge-intensive process, and hypothesis-formation programs that expect to competewith or aid human scientists must be able to draw from large bases of scienti�c knowledge.As biological knowledge bases proliferate, they will dwarf the knowledge of individual humanscientists. Although scientists can browse these knowledge bases manually in the course offormulating new hypotheses, they are likely to �nd that hypothesis-formation programsprovide faster, more thorough searches of a large hypothesis space.10 SummaryThe metabolism o�ers an exciting range of di�cult real-world problems to AI researchersin diverse areas including planning, qualitative reasoning, machine learning, knowledgerepresentation, design, and database integration. The problems are hard enough to pushthe limits of existing AI techniques, but are not overwhelming. Workers in this �eld havethe rare opportunity to contribute to two sciences simultaneously: computer science andbiological science.AcknowledgmentsThis article bene�ted from many discussions with Monica Riley and with Michael Liebman.This work was supported by grants 5-R29-LM-05278-02 and R29-LM-05413-01A1 from theNational Library of Medicine, and by grant 1-R01-RR07861-01 from the National Center forResearch Resources. The contents of this article are solely the responsibility of the authorsand do not necessarily represent the o�cial views of the National Institutes of Health.References[1] A. L. Lehninger, Principles of Biochemistry. New York: Worth Publishers, 1982.[2] M. Dixon and E. Webb, Enzymes. New York: Academic Press, 1979.[3] H. Morowitz, Metabolism Recapitulates Biogenesis. 1991.[4] M. L. Mavrovouniotis, \Estimation of standard gibbs energy changes of biotransfor-mations," Journal of Biological Chemistry, vol. 266, pp. 14440{14445, 1991.26

Page 27: Representing, Analyzing, and Synthesizing Biochemical Pathways

[5] A. Bairoch, \ENZYME database." Unpublished computer database, Centre MedicalUniversitaire, Geneva, 1992.[6] E. C. Webb, Enzyme Nomenclature, 1992: Recommendations of the nomenclature com-mittee of the International Union of Biochemistry and Molecular Biology on the nomen-clature and classi�cation of enzymes. Academic Press, 1992.[7] E. Selkov, I. Goryanin, N. Kaimatchnikov, E. Shevelev, and I. Yunus, \Factographicdata bank on enzymes and metabolic pathways," Studia Biophysica, vol. 129, no. 2{3,pp. 155{164, 1989.[8] P. Karp, \A knowledge base of the chemical compounds of intermediary metabolism,"Computer Applications in the Biosciences, vol. 8, no. 4, pp. 347{357, 1992.[9] P. Karp and M. Riley, \Representations of metabolic knowledge," in Proc. of FirstInternational Conference on Intelligent Systems for Molecular Biology, pp. 207{215,Morgan Kaufmann Publishers, 1993.[10] T. Kazic, \Representation, reasoning, and the intermediary metabolism of Escherichiacoli," in Proc. of the 26th Annual Hawaii International Conference on System Sciences,vol. I, pp. 853{862, IEEE Computer Society Press, 1993.[11] G. Keen, G. Redgrave, J. Lawton, M. Cinkowsky, S. Mishra, J. Fickett, and C. Burks,\Access to molecular biology databases," Mathl. Comput. Modelling, vol. 16, pp. 93{101, 1992.[12] M. L. Mavrovouniotis, G. Stephanopoulos, and G. Stephanopoulos, \Formal modellingof approximate relations in biochemical systems," Biotechnology and Bioengineering,vol. 34, pp. 196{206, 1989.[13] T. Kazic, \Reasoning about biochemical compounds and processes," in Second Interna-tional Conference on Bioinformatics, Supercomputing and the Human Genome Project,1993. In press.[14] M. L. Mavrovouniotis, \Identi�cation of qualitatively feasible metabolic pathways,"in Arti�cial Intelligence and Molecular Biology (L. Hunter, ed.), AAAI Press / MITPress, 1993.[15] D. Kulkarni, The Processes of Scienti�c Research: The Strategy of Experimentation.PhD thesis, Carnegie Mellon University Computer Science Department, December1988. CMU School of Computer Science Technical report 88-207.[16] R. Valdes-Perez, Machine Discovery of Chemical Reaction Pathways. PhD thesis,Carnegie Mellon University Computer Science Department, 1990. CMU School ofComputer Science Technical report CMU-CS-90-191.[17] P. Karp, \Design methods for scienti�c hypothesis formation and their application tomolecular biology," Machine Learning, vol. 12, pp. 89{116, 1993.27