the pathway tools ontology and inferencing layer
DESCRIPTION
The Pathway Tools Ontology and Inferencing Layer. Peter D. Karp, Ph.D. SRI International. Overview. Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? - PowerPoint PPT PresentationTRANSCRIPT
SRI InternationalBioinformaticsOverview
Definitions
Ontologies ultimately exciting because of the inferences/computations they enable:
Where are the ontology killer apps?
Adding more facets to an ontology increases inferences that can be made with it
Pathway Tools ontology and associated applications
SRI InternationalBioinformaticsTerminology
Model Organism Database (MOD) – DB describing genome and other information about an organism
Pathway/Genome Database (PGDB) – MOD that combines information about
Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters,
operons, DNA binding sites
BioCyc – Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, YeastCyc
SRI InternationalBioinformaticsTerminology –
Pathway Tools Software PathoLogic
Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases
Pathway/Genome Editors Distributed curation of PGDBs Distributed object database system, interactive editing tools
Pathway/Genome Navigator WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations
Pathway visualization of gene-expression data Global comparisons of metabolic networks
Bioinformatics 18:S225 2002
SRI InternationalBioinformaticsPathway Tools Ontology:
Terms and Taxonomy
Pathway Tools ontology contains 916 classes Define datatypes
Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites Proteins: Enzymes, Transporters, Transcription Factors Small molecule compounds Reactions, pathways
Define taxonomies Taxonomy of chemical compounds Riley’s gene ontology Taxonomy of metabolic pathways EC system
Bioinformatics 16:269 2000
SRI InternationalBioinformaticsOperations Enabled by
Controlled Vocabulary
Equality testing: Is the function of gene X in organism A the same as the
function of gene Y in organism B? Is location L1 in organism A the same as location L2 in
organism B?
SRI InternationalBioinformaticsOperations Enabled by
Taxonomy
Counting / Pie charts How many genes of category “small molecule metabolism”
are in organism A?
Intersecting sets How many of these up-regulated genes are in class “cell
cycle”?
User search via drill down
Applying rules If the substrate of X is an amino acid, then XXX
SRI InternationalBioinformaticsPathway Tools Ontology:
Slots
Pathway Tools ontology contains 199 slots
Categories of slots: Meta-data: Creator, Creation-Date Textual data: Common-Name, Synonyms, Comment,
Citations Attributes: Molecular-Weight, pI Relationships: Gene, Catalyzes, In-Reaction
Give stats on how many slots in each of these classes
SRI InternationalBioinformaticsPathway Tools Ontology:
Slots
Slots introduced at appropriate place in taxonomy Child classes inherit the slot; parent classes do not
Examples:
Proteins: pI, MolWt, Component-Of Polypeptides: Gene Protein-Complexes: Components
Reactions: Left, Right, Keq, In-Pathway Pathways: Reaction-List, Predecessor-List Transcription Units: Components Genes: Product, Component-Of
SRI InternationalBioinformaticsOperations Enabled by Slots
Store/retrieve attributes of an entity Get pI of protein Get citations associated with pathway
Traverse network of semantic relationships Find all substrates of all reactions in pathway X Find all genes that encode an enzyme that catalyzes a
reaction in pathway X Find all regulons encoding multiple metabolic pathways
SRI InternationalBioinformaticsPathway Tools Ontology:
Constraints
Every Pathway Tools slot has associated meta data: Class(es) to which it pertains
Keq pertains to Reactions Data type (number, string, frame, etc)
Keq data type is number Collection type (list, bag)
Keq is not a collection Documentation string Cardinality constraints -- At most one Keq value Range constraints Taxonomy constraints
Values of Left slot of Reactions must be Chemicals
SRI InternationalBioinformaticsOperations Enabled by
Constraints
Constraints make a system “intelligent” because they encode definitions in a machine-understandable fashion
Automated DB consistency checkers (batch or interactive)
Schema-driven data input toolsSubsumption – Compare two concept definitions
SRI InternationalBioinformaticsPathway Tools Inference Layer
Commonly used queries implemented as stored procedures
Infer what is implicitly recorded in the KB
SRI InternationalBioinformaticsCompute Transitive
Relationships
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
product
component-of
catalyzes
reaction
in-pathway
Chrom
succinate
FAD
fumarate
FADH2
left
right
SRI InternationalBioinformaticsPathway Tools Inference Layer
Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm
All substrates, all cofactors, all transported chemicals Protein tests: Is X a transcription factor, enzyme,
transporter Rather than force user to manually assign physiological roles, compute
when possible from biochemical function
Transcription-unit-binding-sites Compute in parts hierarchy: monomers-of-protein,
components-of-protein, genes-of-protein, modified-forms Complex: regulon-of-protein, regulator-proteins-of-
transcription-unit
SRI InternationalBioinformaticsWhat Killer Apps have
Ontologies Enabled?
What comes after pie charts and drill-down interfaces?
SRI InternationalBioinformaticsTerminology –
Pathway Tools Software PathoLogic
Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases
Pathway/Genome Editors Distributed curation of PGDBs Distributed object database system, interactive editing tools
Pathway/Genome Navigator WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations
Pathway visualization of gene-expression data Global comparisons of metabolic networks
SRI InternationalBioinformaticsBioCyc Collection of
Pathway/Genome DBs
Literature-based Datasets:
MetaCyc
Escherichia coli (EcoCyc)
Computationally Derived Datasets:
Agrobacterium tumefaciensCaulobacter crescentusChlamydia trachomatisBacillus subtilisHelicobacter pyloriHaemophilus influenzaeMycobacterium tuberculosis RvH37Mycobacterium tuberculosis CDC1551Mycoplasma pneumoniaPseudomonas aeruginosaSaccharomyces cerevisiaeTreponema pallidumVibrio cholerae
Yellow Underlined = Open Database
http://BioCyc.org/
SRI InternationalBioinformatics
Pathway/Genome DBs Created byExternal UsersPlasmodium falciparum, Stanford University
plasmocyc.stanford.edu
Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington
Arabidopsis.org:1555
Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555
Other PGDBs in progress by 20 other usersSoftware freely availableEach PGDB owned by its creator
SRI InternationalBioinformaticsOntology Reuse
A holy grail in AI since “ontology” became a buzz-word
Decrease knowledge acquisition bottleneck
GO qualifies as a large success in ontology reuse
Pathway Tools ontology reused across 18 PGDBsPathway Tools algorithms portable across all
PGDBs
SRI InternationalBioinformaticsPathway Tools Algorithms
Visualization and editing tools for following datatypes
Full Metabolic Map Paint gene expression data on metabolic network;
compare metabolic networksPathways
Pathway predictionReactions
Balance checkerCompounds
Chemical substructure comparisonEnzymes, Transporters, Transcription FactorsGenesChromosomesOperons
Operon prediction; visualize genetic network
SRI InternationalBioinformaticsInference of Metabolic Pathways
Pathway/GenomeDatabase
Annotated GenomicSequence
Genes/ORFs
Gene Products
DNA Sequences
Reactions
Pathways
Compounds
Multi-organism PathwayDatabase (MetaCyc)
PathoLogic Software
Integrates genome and pathway data to identify
putative metabolic networks
Genomic Map
Genes
Gene Products
Reactions
Pathways
Compounds
SRI InternationalBioinformaticsPathoLogic Analysis Phases
Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze
ferrochelatase [10 min / 1 week] glutamate 1-semialdehyde 2,1-aminomutase porphobilinogen deaminase
A C GB D E F
E1 E2
SRI InternationalBioinformaticsPathoLogic Analysis Phases
From assigned reactions, infer what pathways are present [5 min / few days]
Define metabolic overview diagram [1 day]
Define protein complexes [few days]
SRI InternationalBioinformatics
Killer App: Global Consistency Checking of Biochemical Network
Given: A PGDB for an organism A set of initial metabolites
Infer: What set of products can be synthesized by the small-
molecule metabolism of the organism
Can known growth medium yield known essential compounds?
Pacific Symposium on Biocomputing p471 2001
SRI InternationalBioinformaticsAlgorithm:
Forward Propagation
Nutrientset
Metaboliteset
“Fire”reactions
Transport
Products
Reactants
PGDBreaction
pool
SRI InternationalBioinformaticsResults
Phase I: Forward propagation 21 initial compounds yielded only half of 38 essential
compounds for E. coli
Phase II: Manually identify Bugs in EcoCyc (e.g., two objects for tryptophan) Missing initial protein substrates (e.g., ACP) Missing pathways in EcoCyc
Phase III: Forward propagation with 11 more initial metabolites
Yielded all 38 essential compounds
SRI InternationalBioinformaticsAggregate Properties of the E.
coli Metabolic Network
EcoCyc is not a complete picture of E. coli metabolism
30% of E. coli genes remain unidentified
Analysis pertains to pathways of small-molecule metabolism
Computed with respect to EcoCyc v4.5 (Sep-1998)
Joint work with Christos Ouzounis of EBIGenome Research 10:268 2001
SRI InternationalBioinformaticsEnzymes
4391 genes in E. coli genome
4288 code for proteins
676 (15%) gene products form 607 enzymes
Of the 607 enzymes, 296 are monomers, 311 are multimers
90% of genes for heteromultimers are linked
SRI InternationalBioinformaticsReactions
744 reactions of small-molecule metabolism 582 assigned to at least one pathway
SRI InternationalBioinformaticsCompounds
791 substrates in the 744 reactions
Each reaction contains 4.0 substrates on average
Each substrate appears in 2.1 reactions
SRI InternationalBioinformaticsEnzyme Modulation
805 enzymatic-reaction objects in EcoCyc
80 have physiological inhibitors 22 have physiological activators 17 have both 43% have a modulator
327 require a cofactor or prosthetic group
SRI InternationalBioinformaticsEnzyme-Reaction Associations
585 reactions catalyzed by 1 enzyme 55 reactions catalyzed by 2 enzymes 12 reactions catalyzed by 3 enzymes 1 reaction catalyzed by 4 enzymes
483 reactions belong to a single pathway99 reactions belong to multiple pathways
100 of the 607 E. coli enzymes are multifunctional
SRI InternationalBioinformaticsPathway Tools Implementation
Allegro Common LispSun and PC platforms
Run as window application or WWW server
Ocelot object database
250,000 lines of code
Lisp-based WWW server at BioCyc.org Lisp process reads URLs from the network and generates
GIF+HTML from PGDBs Manages 15 PGDBs
SRI InternationalBioinformaticsOcelot Knowledge Server
Architecture
Frame data model Classes, instances, inheritance
Persistent storage via disk files, Oracle DBMS Concurrent development: Oracle Single-user development: disk files Read-only delivery: bundle data into binary program
Transaction logging facilitySchema evolutionLocal disk cache to improve Internet performance
J. Intelligent Information Systems 1:155-94 1999
SRI InternationalBioinformaticsGKB Editor
Browser and editor for KBs and ontologies
Three editing tools: Taxonomy editor Frame editor Relationships editor
All operations are schema driven
http://www.ai.sri.com/~gkb/user-man.html
SRI InternationalBioinformaticsThe Common Lisp Programming
Environment
Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
SRI InternationalBioinformaticsPeter Norvig’s Solution
“I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)”
http://www.norvig.com/java-lisp.html
SRI InternationalBioinformaticsCommon Lisp Programming
Environment
Interpreted and/or compiled executionFabulous debugging environmentHigh-level languageInteractive data explorationExtensive built-in librariesDynamic redefinition
Find out more! ALU.org -- Association of Lisp Users BioLisp.org
SRI InternationalBioinformaticsPathway Exchange Ontology
BioPathways group developing ontology and format for exchange of pathway data
Metabolic pathways Signaling pathways Protein interactions
Moving upwards from chemicals, proteins, to reactions and pathways
Working to extend CMLDraft ontology at
http://www.ai.sri.com/pkarp/misc/interactions.html
SRI InternationalBioinformaticsSummary
Pathway Tools apps: Predict pathways and generate PGDBs Visualization and editing tools Paint gene expression data; compare entire pathway maps Global consistency checking of metabolic network Characterize metabolic and genetic networks
New killer apps: Interoperability Text mining Bake-off for genome annotation pipelines
SRI InternationalBioinformaticsBioCyc and Pathway Tools
Availability
WWW BioCyc freely available to all BioCyc.org Six BioCyc DBs openly available to all
BioCyc DBs freely available to non-profits Flatfiles downloadable from BioCyc.org Binary executable:
Sun UltraSparc-170 w/ 64MB memory PC, 400MHz CPU, 64MB memory, Windows-98 or newer
PerlCyc API
Pathway Tools freely available to non-profits
SRI InternationalBioinformaticsAcknowledgements
SRI Suzanne Paley, Pedro Romero,
John Pick, Cindy Krieger, Martha Arnaud
EcoCyc Project Julio Collado-Vides, Ian Paulsen,
Monica Riley, Milton Saier
MetaCyc Project Sue Rhee, Lukas Mueller, Peifen
Zhang, Chris Somerville
Stanford Gary Schoolnik, Harley McAdams,
Lucy Shapiro, Russ Altman, Iwei Yeh
Funding sources: NIH National Center for
Research Resources NIH National Institute of
General Medical Sciences
NIH National Human Genome Research Institute
Department of Energy Microbial Cell Project
DARPA BioSpice, UPC
BioCyc.org