pathologic pathway predictor. sri international bioinformatics inference of metabolic pathways...

36
PathoLogic Pathway Predictor

Upload: elinor-clementine-webster

Post on 24-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

PathoLogic Pathway Predictor

Page 2: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsInference of Metabolic Pathways

Pathway/GenomeDatabase

Annotated GenomicSequence

Genes/ORFs

Gene Products

DNA Sequences

Reactions

Pathways

Compounds

Multi-organism PathwayDatabase (MetaCyc)

PathoLogic Software

Integrates genome and pathway data to identify

putative metabolic networks

Genomic Map

Genes

Gene Products

Reactions

Pathways

Compounds

Page 3: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathoLogic Functionality

Initialize schema for new PGDBTransform existing genome to PGDB formInfer metabolic pathways and store in PGDBInfer operons and store in PGDBAssemble Overview diagramAssist user with manual tasks

Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Infer transport reactions Fill pathway holes

Page 4: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathoLogic Input/Output

Inputs: List of all genetic elements

Enter using GUI or provide a file Files containing annotation for each genetic element Files containing DNA sequence for each genetic element MetaCyc database

Output: Pathway/genome database for the subject organism Reports that summarize:

Evidence in the input genome for the presence of reference pathways Reactions missing from inferred pathways

Page 5: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsFile Naming Conventions

One pair of sequence and annotation files for each genetic element

Sequence files: FASTA format suffix fsa or fna

Annotation file: Genbank format: suffix .gbk PathoLogic format: suffix .pf

Page 6: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Typical Problems Using Genbank Files With PathoLogic

Wrong qualifier names used: read PathoLogic documentation!

Extraneous information in a given qualifier

Check results of trial parse carefully

Page 7: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsGenBank File Format

Accepted feature types: CDS, tRNA, rRNA, misc_RNA

Accepted qualifiers: /locus_tag Unique ID [recm] /gene Gene name [req] /product [req] /EC_number [recm] /product_comment [opt] /gene_comment [opt] /alt_name Synonyms [opt] /pseudo Gene is a pseudogene [opt] /db_xref DB:AccessionID [opt] /go_component, /go_function, /go_process GO terms [opt]

For multifunctional proteins, put each function in a separate /product line

Page 8: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathoLogic File Format

Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing //

One attribute-value pair is allowed per line Use multiple FUNCTION lines for multifunctional proteins

Lines starting with ‘;’ are comment lines

Valid attributes are: ID, NAME, SYNONYM STARTBASE, ENDBASE, GENE-COMMENT FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT DBLINK GO INTRON

Page 9: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathoLogic File Format

ID TP0734NAME deoDSTARTBASE 799084ENDBASE 799785FUNCTION purine nucleoside phosphorylaseDBLINK PID:g3323039PRODUCT-TYPE PGENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by

sequence similarity; putative//ID TP0735NAME gltASTARTBASE 799867ENDBASE 801423FUNCTION glutamate synthaseDBLINK PID:g3323040PRODUCT-TYPE PGO glutamate synthase (NADPH) activity [goid 0004355]

[evidence IDA] [pmid 4565085]

Page 10: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Before you start: What to do when an error occursMost Navigator errors are automatically trapped –

debugging information is saved to error.tmp file.All other errors (including most PathoLogic

errors) will cause software to drop into the Lisp debugger

Unix: error message will show up in the original terminal window from which you started Pathway Tools.

Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt

2 goals when an error occurs: Try to continue working Obtain enough information for a bug report to send to

pathway-tools support team.

Page 11: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsThe Lisp Debugger

Sample error (details and number of restart actions differ for each case)Error: Received signal number 2 (Keyboard interrupt)

Restart actions (select using :continue):

0: continue computation

1: Return to command level

2: Pathway Tools version 10.0 top level

3: Exit Pathway Tools version 10.0

[1c] EC(2):

To generate debugging information (stack backtrace)::zoom :count :all

To continue from error, find a restart that takes you to the top level – in this case, number 2:cont 2

To exit Pathway Tools::exit

Page 12: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsHow to report an error

Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed)

Send email to [email protected] containing:

Pathway Tools version number and platform Description of exactly what you were doing (which command

you invoked, what you typed, etc.) or instructions for how to reproduce the problem

error.tmp file, if one was generated If software breaks into the lisp debugger, the complete error

message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

Page 13: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsUsing the PPP GUI to Create a

Pathway/Genome Database

Input Project Information Organism -> Create New Creates directory structure for new PGDB Creates and saves empty PGDB, populated only with objects

common to all PGDBs (schema classes, elements, etc.) and data you entered in the form.

Offers to invoke Replicon Editor

Page 14: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsInput Project Information

Page 15: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsEnter Replicon Information

For each replicon Name Type: chromosome, plasmid, etc. Circular? Annotation file Sequence file (optional) Contigs (optional) Links to other DBs (optional)

GUI-Based entry Build->Specify Replicons

File-Based Entry Create genetic-elements.dat file using template provided

Page 16: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics GUI-Based Replicon Entry

Page 17: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsBatch Entry of Replicon Info

File /<orgid>cyc/<version>/input/genetic-elements.dat:

ID TEST-CHROM-1NAME Chromosome 1TYPE :CHRSMCIRCULAR? NANNOT-FILE chrom1.pfSEQ-FILE chrom1.fsa//ID TEST-CHROM-2NAME Chromosome 2CIRCULAR? NANNOT-FILE /mydata/chrom2.gbkSEQ-FILE /mydata/chrom2.fna//

Page 18: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsSpecify Reference PGDB(s)

This step is optional, and most users will omit itMetaCyc is always the primary reference PGDBSpecify additional reference PGDB if you have

your own curated PGDB which has: Pathways and/or reactions that are not in MetaCyc Manual functional assignments, with names similar to current

genomeThere is no point specifying any of our PGDBs as

references, only your own curated PGDBs.

Page 19: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsBuilding the PGDB

Trial Parse Build -> Trial Parse Check output to ensure numbers “look right”

Same number of gene start positions, end positions, names Did my file contain EC numbers? Were they detected? Did my file contain RNAs? Were they detected?

Fix any errors in input filesBuild pathway/genome database

Build -> Automated Build

Page 20: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathoLogic Parser Output

Page 21: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsAutomated Build

Parses input filesCreates objects for every gene and gene productUses EC numbers, GO annotations and name

matcher to match enzymes to reactions in MetaCyc

Imports catalyzed enzymes and compounds from MetaCyc

Generates list of likely enzymes that couldn’t be assigned

Infers pathways likely to be presentGenerates Cellular Overview Diagram (first pass)Generates reports

Page 22: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsMatching Enzymes to Reactions

Matches on full EC number (partial ECs ignored)Matches on Molecular Function GO terms

If definition of GO term includes cross-reference either to an EC number or to a MetaCyc reaction.

Matches on full enzyme name Match is case-insensitive and removes the punctuation

characters “ -_(){}',:” Also matches after removal of prefixes and suffixes such as:

“Putative”, “Hypothetical”, etc alpha|beta|…|catalytic|inducible chain|subunit|component Parenthetical gene name

Page 23: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsEnzyme Name Matcher

For names that do not match, software identifies probable metabolic enzymes as those

Containing “ase” Not containing keywords such as

“sensor kinase” “topoisomerase” “protein kinase” “peptidase” Etc

User should research unknown enzymes MetaCyc, Swiss-Prot, PubMed

Page 24: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt

Page 25: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsAutomated Pathway Inference

All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion.

Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.

Page 26: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Considerations taken into account when deciding whether or not a pathway should be inferred: Is there a unique enzyme – an enzyme not involved in any

other pathway? Does the organism fall in the expected taxonomic domain of

the pathway? Is this pathway part of a variant set, and, if so, is there more

evidence for some other variant? If there is no unique enzyme:

Is there evidence for more than one enzyme? If a biosynthetic pathway, is there evidence for final reaction(s)? If a degradation pathway, is there evidence for initial reaction(s)? If an energy metabolism pathway, is there evidence for more than half the

reactions?

Page 27: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Assigning Evidence Scores to Predicted Pathways

X|Y|Z denotes score for P in O where:

X = total number of reactions in P Y = enzymes catalyzing number of reactions for which there is

evidence in O Z = number of Y reactions that are used in other pathways in O

Page 28: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsPathway Evidence Report

On Organism Summary Page in Navigator, button “Generate Pathway Evidence Report”

Report saved as HTML file, view in browserHierarchical listing of all inferred pathways

“Pathway Glyph” shows evidence graphically Steps with/without enzymes (green/black) Steps that are unique to pathway (orange) Steps filled by Pathway Hole Filler (blue)

Counts reactions in pathway, with evidence, in other pathways

Lists other pathways that share reactions Link to pathway in MetaCyc

Page 29: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformatics

Page 30: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsManual Pruning of Pathways

Use pathway evidence report Coloring scheme aids in assessing pathway evidence

Phase I: Prune extra variant pathways

Rescore pathways, re-generate pathway evidence report

Phase II: Prune pathways unlikely to be present No/few unique enzymes Most pathway steps present because they are used in another pathway Pathway very unlikely to be present in this organism Nonspecific enzyme name assigned to a pathway step

Page 31: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsCaveats

Cannot predict pathways not present in MetaCyc

Evidence for short pathways is hard to interpret

Since many reactions occur in multiple pathways, some false positives

Next generation pathway inference algorithm is work currently in progress!

Page 32: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsOutput from PPP

Pathway/genome database

Summary pages Pathway evidence page

Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report”

Missing enzymes report

Directory tree containing sequence files, reports, etc.

Page 33: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsResulting Directory Structure

ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ input

organism.dat organism-init.dat genetic-elements.dat annotation files sequence files

reports name-matching-report.txt trial-parse-report.txt

kb ORGIDbase.ocelot

data overview.graph

released -> VERSION

Page 34: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsManual Polishing

Refine -> Assign Probable Enzymes Do this first

Refine -> Rescore Pathways Redo after assigning enzymes

Refine -> Create Protein Complexes Can be done at any time

Refine -> Assign Modified Proteins Can be done at any time

Refine -> Transport Identification Parser Can be done at any time

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Update Overview Do this last, and repeat after any material changes to PGDB

Page 35: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsAssign Probable Enzymes

Page 36: PathoLogic Pathway Predictor. SRI International Bioinformatics Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs

SRI InternationalBioinformaticsHow to find reactions for

probable enzymes

First, verify that enzyme name describes a specific, metabolic function

Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed

Look up protein in UniProt or other DBsSearch for gene name in PGDB for related

organism (bear in mind that gene names are not reliable indicators of function, so check carefully)

Search for function name in PubMedOther…