cisc 841 - bioinformatics

Click here to load reader

Post on 12-Jan-2016

24 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

CISC 841 - BIOINFORMATICS. GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING. Authors : Jie Wu, Zhenjun Hu and Charles DeLisi Boston University. Presented by, Rajesh Ponnurangam. MOTIVATION. The need for effective decision rule to use for correlation - PowerPoint PPT Presentation

TRANSCRIPT

  • CISC 841 - BIOINFORMATICSGENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILINGAuthors : Jie Wu, Zhenjun Hu and Charles DeLisi Boston UniversityPresented by,Rajesh Ponnurangam

  • MOTIVATION The need for effective decision rule to use for correlation Inefficiencies of current methods Effectiveness of Phylogenetic analysis Need for improved performance at various levels of Resolution New Decision Rule Correlation Enrichment

  • OVERVIEW Introduction Concepts Whats wrong with existing technologies of decision making? Comparison of Decision Rules Comparison with other published methods Identifying functional and evolutionary modules Standard Guilt by Association (SGA) Correlation Enrichment (CE) How correlation enrichment (CE) proves to be more effective?

  • INTRODUCTION CONCEPTS Gene Annotation Network inference Phylogenetic profiling Correlation Enrichment (CE) Standard guilt by association (SGA) KEGG Pathways COG Ontology

  • INTRODUCTION CONCEPTS Gene Annotation The process of attaching biological information to sequences identifying elements on the genome (gene finding), and attaching biological information to these elements Network Inference Knowing the topology of a biological network like transcriptional regulatory networks, metabolite networks etc. Phylogentic Profiling Used to infer the function of a gene by finding another gene of known function with an identical pattern of presence and absence across a set of distributed genomes

  • INTRODUCTION CONCEPTS Correlation Enrichment A new decision rule for assigning genes to functional categories at various levels of resolution Standard Guilt by Association (SGA) Simple decision rule, which assigns an unannotated gene to all known categories of an annotated gene if the phylogenetic profiles exceed some specific correlation threshold KEGG Kyoto Encyclopedia of Genes and Genomes, connects known information on molecular interaction networks COG Ontology Cluster of Orthologous Genes, source of a conserved domain datasource

  • EXISTING METHOD AND PHYLOGENETIC PROFILING Current methods like SGA, perform at a level well below what is possible, largely because the performance of an effective decision rule to use the correlate deteriorates rapidly as coverage increases Phylogenetic profiling provides restricted profiling, requiring full profile identity, while accurate, has low coverage Phylogenetic profiling of a gene is a binary string Presence 1 Absence 0

  • PHYLOGENETIC PROFILING N -> Number of genomes over which profiles are defined with gene X occurring in x genomes and gene Y occurring in y genomes and both occurring in z genomes, the probability of observing z co-occurrences purely by chance, given N,x and y is, MI(X,Y) - p(i,j), (i=0,1; j=0,1), fraction of genomes in which gene X is in state i and gene Y is in state j p(1,1) fraction of genomes in which both are present p(1,0) fraction of genomes in which X is present and Y is absent

  • PHYLOGENETIC PROFILINGAlsoThen the relation between MI and eq(1) isThe paper defines a new measure of correlation between two binary strings 0 C 1 (3b)

  • COMPARISON OF SGA & CE SGA assigns an unannotated gene to all known categories of an annotated gene if profiles exceed some correlation threshold. CE assigns an unannotated gene by ranking each category (pathway) with a score reflectingThe number of genes (annotated) within a category, whose profile correlation with that of the unannotated gene exceeds a pre-specified thresholdThe magnitude of these correlations CE substantially outperforms SGA in allocating genes to functional categories SGA, for C*=0.35, links 1025 of 2918 unannotated orthologs to one pathway CE was able to assign all 2918 KEGG unannotated orthologs to pathways and all COG unannotated orthologs to COG categories

  • PATHWAY ALLOCATION PERFORMANCE

  • COMPARISON OF DECISION RULES

  • COMPARISON OF DECISION RULES SGA assignment based on profile identity For inferences based on identity only 5.4% of unannotated orthologs are assignable to KEGG pathways When C*=0.2 to achieve a coverage of 90% requires accepting a PPV of 6% For inferences based on CE, PPV is markedly increased at high coverage, exceeding its SGA value approximately 6 fold The two decision rules perform similarly at coverages below 20% PPV estimates are conservative CE performs superior than SGA

  • COMPARISON OF DECISION RULES At C*=0.4 where SGA and CE curves for PPV have reached about half their maximum divergence, CE performs substantially better than SGA at GO specificity levels.

  • COMPARISON WITH OTHER PUBLISHED MODELS Different methods to draw functional inferences like majority vote and Markov Random Field can assign function based on the network context of unannotated genes Predictive reliability can be increased by combining them using one or another statistical framework such as support vector machines, Bayesian inferences and Markov Random field. Using SGA to assign genes to GO categories, fraction of genes assigned to at least one category decreases from 0.98 to ~0.10 as functional specificity increases with coverage fixed at 40% Using CE, the fraction correctly assigned to at least one category is 0.95 at the lowest specificity level and remains 0.78 at all specificity levels

  • INFERENCES BASED ON COG ONTOLOGY COG functional categories provide a low resolution, but fully resolved annotation 1 gene to 1 functional category mapping Profiling by CE of the full set of 4826 genes, at C*=0.55 returns a 926 genes linked to at least one annotated gene Each of the 926 genes, including 249 unannotated are assignable to COG category Performance estimation 68% (463/677)

  • INFERENCES BASED ON COG ONTOLOGY

  • INFERENCES BASED ON COG ONTOLOGY A more detailed version of the category H TP set reveals two strikingly dense clusters one with 7 orthologs, the other with 11

  • PHYLOGENETIC PROFILES OF THE 11-MEMBER CLUSTERPhlogenetic profiles of the 11-member cluster of orthologs across 66 genomes uncovered by CE. Green represents absence and red, presence of an ortholog

  • CLIQUES, CLUSTERS & INFERENCE QUALITY As the threshold decreases from its most stringent value (C*=0.91), the number of clusters containing more than 3 nodes increases, peaking at C*=0.66 and then declines as the nodes coalesce into increasingly larger clusters

  • CLIQUES, CLUSTERS & INFERENCE QUALITY

  • METHODS Dataset COG database. Accuracy is evaluated against KEGG Assessment Positive Predictive Value By definition, the population averaged positive predictive value is,

  • METHODS PPV as a product of two factors Related Metrics SPE-ACC SEN-A0

  • STANDARD GUILT BY ASSOCIATIONlet i be the number of the categories that contain the gene Ilet J(I, J) be the set of categories that contain a gene J whose profile correlation with I meets the threshold C*, j(I, J) is its sizelet K(I, J) denote the set of common categories and k(I, J) is its size; where 0 k(I, J) min(i, j). The unannotated gene is therefore correctly assigned to TP = k categoriesand incorrectly assigned to the remaining FP = j - k categories.Also TN = T - i - j + k and FN = i - k, where T = 133 is the total number of pathways. Consequently, the PPVI(J) with which gene I is assigned using linked gene J is

  • STANDARD GUILT BY ASSOCIATIONMaximum PPVI(J) is not necessarily 1, but min(i,j)/jFor j
  • CORRELATION ENRICHMENTSuppose an unannotated gene is correlated with in total g other genes (C > C*) from r categories, and let m1, m2, ..., mr be the number of correlated genes in categories k1, k2, ..., kr, where r g, the equality holding only when each gene is in one category. Further, let denote the categories the gene is in. For each of the r categories that have 1 or more genes meeting the correlation threshold with I, define a weighted sum score, Sv is positive adjustable integer which gives disproportionately high weights to strong correlations

  • CORRELATION ENRICHMENTFP = r0 TPFN = T1 TPTN = T r0 T1 + TP

  • REFERENCES Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524-1530

    Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.Proc Natl Acad Sci U S A 1999, 96(8):4285-4288

    Aravind L: Guilt by association: contextual information in genome analysis.Genome Res 2000, 10(8):1074-1077

    Nariai N, Tamada Y, Imoto S, Miyano S: Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data.Bioinformatics 2005, 21 Suppl 2:ii206-ii212.