bioinformática básica - federal university of rio de janeiro · bioinformática básica homologia...

Bioinformática Básica Homologia e Bancos de Dados

Rafael Dias Mesquita [email protected]

Laboratório de Bioinformática

Departamento de Bioquímica Instituto de Química - UFRJ

Definitions

H o m o l o g o u s : H a v e a c o m m o n a n c e s t o r . H o m o l o g y c a n n o t b e measured. Orthologous: The same gene in different species . It is the result of speciation (common ancestral) Paralogous: Related genes (already diverged) in the same species. It is the result of genomic rearrangements or duplication

Strategies to find homology !  Reciprocal best-hit

! Simply blast Seq1 from specie1 against specie2 and blast the best hit Seq2 back against specie1.

! If Seq1and Seq2 are reciprocal best hits they are inferred to be homologous.

!  OMA strategy ! MolFuncs strategy

OMA strategy !  Using Slides from Dr Manuel Gil (only related to

OMA presentation) original presentation: http://www.cbrg.ethz.ch/education/BioInf2/BioInf2_Orthology.pdf

Two Phases of Pairwise Approach

Orthology Inference Clustering from Pairs

• Orthologs are closer than paralogs

• Closer genes have usually higher pairwise alignment score → species-specific top scoring hit

• Corresponding orthologs maybe missing→ “bidirectional best hit” (BBH)

Basic Idea

• Use distance instead of score

• Take into account variance of distance estimates

• Relax top/smallest requirement to include more than one ortholog

• Detect differential gene losses

Refinements of the Basic Idea

a b1 b2

Gene duplication

Speciation

c1 c2

S1

S2

“stable pairs”

Duplication

Speciation

Detect Gene Losses

Duplication

Losses

Speciation

Detect Gene Losses

Duplication

Speciation

Detect Gene Losses

Detect Gene Losses

Dessimoz, Boeckmann, et al., Nucl Acid Res, 2006

• (x1, z3) & (y2, z4) are stable pairs

• d(x1, z3) < d(x1, z4)

• d(y2, z4) < d(y2, z3)

• d(x1, z4) = d(y2, z3)

• All relations considering variance of distance estimates

Duplication

Speciation

• If interested in gene x:→ all genes orthologous to x

• COG database:→ “triangles” of orthologs, merge triangles with common face

• OMA Groups:→ all pairs in group are orthologs

• Hierarchical:→ orthologs and “in-paralogs” with respect to taxonomic range

Grouping of Orthologs

Tatusov et al. Science 1997

Orthology Graph

!"

#"

$"

!%

$%

&"

&%

!

#

$

&

Species TreeOrthologyGraph

Hierarchical GroupsGene Tree

'()*+,-+./

0-1)23.2-1.4.56

02-1.4.563.73+/89*)8369:52,(13G

&9(4+*,-+./

ni

L(ni )

S(ni )

S(ni )!"

#"

$"!%

$%

&"

&%

{

}!"

#"

$"

!%

$%

&"

&%

!

#

$

&



'()*+,-+./

0-1)23.2-1.4.56

02-1.4.563.73+/89*)8369:52,(13G

&9(4+*,-+./

ni

L(ni )

S(ni )

S(ni )!"

#"

$"!%

$%

&"

&%

{

}

!"

#"

$"

!%

$%

&"

&%

!

#

$

&



'()*+,-+./

0-1)23.2-1.4.56

02-1.4.563.73+/89*)8369:52,(13G

&9(4+*,-+./

ni

L(ni )

S(ni )

S(ni )!"

#"

$"!%

$%

&"

&%

{

}

OMA Groups

w1

y

z2

2

z1

x1

Complete Cliques in Orthology Graph

Hierarchical Groups

!"

#"

$"

!%

$%

&"

&%

!

#

$

&



'()*+,-+./

0-1)23.2-1.4.56

02-1.4.563.73+/89*)8369:52,(13G

&9(4+*,-+./

ni

L(ni )

S(ni )

S(ni )!"

#"

$"!%

$%

&"

&%

{

}

Hierarchical Groups

Induced Orthology Subgraph

!"

#"

$"

!%$%

Induced Forest of Gene Trees

ni

!"

#"

$"

!%

$%

!&

"&!"!'#'$#

#&

()*+,*-./0.1*-,2*1.345/6/7,+.8*9*8:

Connected Components in Induced Subgraph

Putting it Together: OMA algorithm

= Paralogs= Orthologs

GP

VP

SP

CP

AP

BP = SP \ VP

Orthologs, Pseudo-Orthologs

Pairs Evolutionary Relation

All Pairs Any

Candidate Pairs Homologs

Stable Pairs

Verified Pairs Orthologs

Group Pairs Close Orthologs

(AP)

(CP)

(SP)

(VP)

(GP)

ParalogsBroken Pairs (BP)

All protein

sequences from

full genomes

Roth et al., BMC Bioinformatics, 2008

MolFuncs !   Most Likely Functional Counterparts !   Uses Network Theory !   The user defines confidence homologs => reliability !   Uses protein sequences !   Blast and HMMER based

Functional Equivalency Inferred from ‘‘AuthoritativeSources’’ in Networks of Homologous ProteinsShreedhar Natarajan1¤, Eric Jakobsson1,2,3*

1 Biophysics and Computational Biology, University of Illinois, Urbana-Champaign, Illinois, United States of America, 2 National Center for Supercomputing Applications,

University of Illinois, Urbana-Champaign, Illinois, United States of America, 3 Department of Molecular and Integrative Physiology, University of Illinois, Urbana-

Champaign, Illinois, United States of America

Abstract

A one-on-one mapping of protein functionality across different species is a critical component of comparative analysis. Thispaper presents a heuristic algorithm for discovering the Most Likely Functional Counterparts (MoLFunCs) of a protein, basedon simple concepts from network theory. A key feature of our algorithm is utilization of the user’s knowledge to assign highconfidence to selected functional identification. We show use of the algorithm to retrieve functional equivalents for 7membrane proteins, from an exploration of almost 40 genomes form multiple online resources. We verify the functionalequivalency of our dataset through a series of tests that include sequence, structure and function comparisons. Comparisonis made to the OMA methodology, which also identifies one-on-one mapping between proteins from different species.Based on that comparison, we believe that incorporation of user’s knowledge as a key aspect of the technique adds value topurely statistical formal methods.

Citation: Natarajan S, Jakobsson E (2009) Functional Equivalency Inferred from ‘‘Authoritative Sources’’ in Networks of Homologous Proteins. PLoS ONE 4(6):e5898. doi:10.1371/journal.pone.0005898

Editor: Olaf Sporns, Indiana University, United States of America

Received July 9, 2008; Accepted April 29, 2009; Published June 12, 2009

Copyright: ! 2009 Natarajan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Funding by the National Science Foundation and the University of Illinois. The sponsors had no role in the design and conduct of the study, in thecollection, analysis, and interpretation of the data, and in the preparation, review, or approval of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

¤ Current address: Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Introduction

The current spate of genome sequencing projects [1] hasresulted in large amounts of sequence information from allkingdoms of life. Experimental techniques to characterize andannotate these sequences have not yet kept pace with thegeneration of data, and it is not foreseeable that they ever will,because sequencing is inherently faster than all present orforeseeable methods of experimental functional determination.Therefore, comparative genomic analysis is being increasinglyemployed for functional annotation. The basis of most compar-ative techniques is the notion of homology or commonevolutionary origin of the gene/protein sets being investigated.The multiplicity of evolutionary scenarios necessitates a morefine-grained description of homology in terms of orthologs, in-paralogs and out-paralogs [2]. Orthologs are genes from differentspecies that have a common ancestor. Traditionally, orthologousgenes from different species were thought of as having similarfunctions. However, gene duplication can result in functionaldivergence within a species and give rise to paralogs. In-paralogsand out-paralogs are defined based on the relative order ofduplication and speciation events. Depending on the degree ofdivergence, paralogs can retain a significant portion of thesequence features of the original gene. Since duplication of a genecan still satisfy the constraint of common ancestor with genesfrom other species, multiple pairs of orthologous genes in twospecies can have arisen from a single ancestor prior to theduplication.

Our explorations were motivated by a desire to predict proteininteraction networks using the evolutionary correlation method[3]. This method is based on the premise that proteins that interactwould have correlated substitution patterns across species.Application of the evolutionary correlation method requires aprotocol to identify corresponding proteins for the comparison. Itis desirable that the full repertoire of functional capabilities of eachprotein - both in terms of its physiological roles, as well as themechanisms of regulation - be as similar as possible across thespecies set considered. Imposing this constraint will also likelyensure that the protein pair from each species interacts with eachother. In the absence of prior knowledge on the multiplicity ofpairings between the two protein sets, it is necessary that theprotein representatives be unique for each species. In our work, werefer to such a sample as the most likely functionalcounterpart (MoLFunC) of each other.

A pair of ‘‘MoLFunCs’’ is similar to a pair of orthologousproteins, but the concept is slightly different. The strict definitionof orthology is in terms of descent. The root definition of orthologyis in terms of genes, and the application to proteins is derived fromthe application to genes. The definition of ‘‘MoLFunC’’ is specificto proteins, and implies an attribution of a common function. Notethat in the definition of MoLFunCs, different splice variants oforthologous genes may not be MoLFunCs of each other.

The most common tool used for sequence similarity is BLAST –Basic Local Alignment Search Tool [4]. It often happens that theresult of bi-directional BLAST searches between two genomes isasymmetric. If protein PA in species A picks up protein PB in

PLoS ONE | www.plosone.org 1 June 2009 | Volume 4 | Issue 6 | e5898

MolFuncs

MolFuncs

•  Start with one protein from one specie (query)

•  Reciprocal blast with other species.

•  All reciprocal-best-hits become authority seeds

•  All authority seeds used to blast against all species

⇒  hits that were reciprocal to query become a seed

⇒  Iterate until no new protein found

MolFuncs

“1” reciprocal-best-hit found “0” no reciprocal-best-it Color Code Query protein – pink Authority Seeds - Reliable (query related) Medium (query unrelated) Putative

MolFuncs •  Authority core = Query + Authority seeds +

all others unique sequences showing same vectors as query in matrix. (same “1” and “0” results when compared to all other sequences)

•  HMM constructed to clean duplicates

•  Rebuild authority core and HMM

Cleaning worse sequences (Next slide)

MolFuncs – Filtering by vote FRESHER CORE SEQUENCES

NO

N C

OR

E SE

QU

ENC

ES

•  Remove sequence if sum is less than ½ of maximum sum (beige color)

MolFuncs •  Authority core = Query + Authority seeds +

all others unique sequences showing same vectors as query in matrix. (same “1” and “0” results when compared to all other sequences)

•  HMM constructed to clean duplicates

•  Rebuild authority core and HMM

Cleaning worse sequences

•  Last HMM used to clean Ambiguities

MolFuncs !   Result is one sequence per organism

à Most Likely Functional Counterpart

!   No web database available

Databases de homólogos

!  OMA: http://omabrowser.org/cgi-bin/gateway.pl

!  KEGG: home: http://www.kegg.jp/kegg/ overview: http://www.kegg.jp/kegg/kegg1a.html

!  GO (gene ontology): http://www.geneontology.org/

! Homologene http://www.ncbi.nlm.nih.gov/homologene ! OrthoMCL: home: http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi !   KOG (What happened ? Where are you????) !   COG (Cluster of ortologous genes – no web server – only to download)

http://www.ncbi.nlm.nih.gov/COG/

bioinformática básica - federal university of rio de janeiro · bioinformática básica homologia...

Documents