evolution of multidomain proteins cs 374 – lecture 10 wissam kazan

Post on 22-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Evolution of Multidomain Proteins

CS 374 – Lecture 10Wissam Kazan

Reference Papers

• C. Chothia, J. Gough, C. Vogel, S. A. Teichmann, “Evolution of the Protein Repertoire”, www.sciencemag.com, Science VOL 300, 13 June 2003

• T. Przytycka, G. Davis, N. Song, D. Durand, “Graph Theoretical Insights into Evolution of Multidomain Proteins”, RECOMB 2005, LNBI 3500, pp. 311-325, 2005

Proteins

• Large Organic Compounds made of amino acids

• Fold into specific Structures, unique to each protein.

Three possible representations of the three-dimensional structure of the protein triose phosphate isomerase.

Protein Functions

• Chief Actors in the cell

• Proteins bind to other molecules specifically and tightly at the binding site

• Act as enzymes to catalyze chemical reactions

• Antibodies are proteins that bind to antigen and target them for destruction

Proteins Domains

• Primary Constituent of Proteins

• It is a conserved evolutionary structural unit :

– Assumed to fold independently

– Observed in different proteins in the context of different neighboring domains

– Whose coding sequence can be duplicated and/or undergo recombination

Domains – cont’d

• Small proteins contain just one domain

• Large proteins are formed by combination of domains

Cartoon representation of the protein Zif268 (blue) containing three zinc fingers domains in complex with DNA (orange). The coordinating amino acid residues of the middle zinc ion (green) are highlighted.

Domain – cont’d (2)

• Often each domain has a separate function to perform for the protein

• On average, domains lengths range from 100 to 250 nucleotides

Binding Domain

Domain Family

• A domain family is a collection of small proteins and/or parts of larger ones that descend from a common ancestor.

PR domain family members

Increase in Protein Repertoire

• The dominant mechanisms are:

– Duplications of sequences coding for one or more domain

– Divergence of duplicated sequences by mutations, deletions and insertions producing modified structures that may have useful new properties

– Recombination of genes that results in new arrangements of domains

Family relationships

• Difficult to detect distant relationships by direct comparisons of sequences

• Presence/Absence of domains and family relationships can be determined if the 3D structures are known

• We only know family relationships and domain structures of proteins of known structures or proteins homologous to proteins of known structures

C2 domain family

Domain Family Sizes

• In individual genomes, the number of members in the different families fit a Pareto distribution:– Few families have many members– Many families have few members

• It is mainly the result of selection for useful functions– Some families have properties that lend themselves to a

wide variety of molecular functions: • P-loop nucleotide family has members functioning as

kinases with diff. specifities, as diff. kind of motor proteins

Analysis of Evolution

• 50% of sequences in the currently known genomes homologous to proteins of known structure

• Based on that half, we got a detailed picture of the evolution that we will explain in the on-coming slides

SCOP DB

• Relationships of domains in proteins of known structures described in the Structural Classification of Proteins (SCOP) Database

http://scop.berkeley.edu/data/scop.b.html

Families and Species

• Vertebrates, ~750 different families, with 50 members per family on average

• Invertebrates, ~670 different families, with 20 members per family on average

• Yeast and bacteria with large genome, ~550 different families, with 8 members per family on average

• Parasitic bacteria, ~220 different families, with 2 member per family on average

Protein Repertoire

• The larger domain families make up the bulk of the protein repertoire in each genome and are widely distributed across genomes

• 429 families occur in all of the 14 known eukaryotes genomes:– Members form 80% of domains in Animals– 90% of domains in Fungi and Plants

Contribution of common families to the protein repertoire

Domain Combinations

• Many proteins formed by combinations of two or more domains

• Domains from some families appear together with domains from several families

• Multidomain proteins constitute 4/5th eukaryotes proteins and 2/3rd of prokaryote proteins

• Phenomenon called Domain Accretion

Known Combinations

• 1100 families of proteins of known structure in total

• 11002 = 1,210,000 different possible pairwise combinations.

• Only useful combinations will be present in genomes

• Studies showed that only 2500 pairwise combinations were found in 85 different genomes

Combination Properties

• Few families have members present in many different combinations

• Many families combine with just one or two others.

• Power Law (Again!)

• Sequential Order

Supradomains

• Two-domain and Three-domain combinations recurring in different protein contexts with different partner domains

• Have a particular functional and spatial relationship

• Larger than individual domains

Supradomain

Metabolic Pathway Formation

• Proteins in a pathways do not function by themselves

• A metabolic pathway is a series of chemicals reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell.

Problem

How does the duplication, divergence and recombination process of the proteins fit into the formation or extension of pathways?

First Solution

• Substrates in pathways retain some similarities

Enzyme evolve by gene duplications:- Catalytic mechanisms change- Some aspects of their recognition

properties are retained

Second Solution

• Enzymes recruited across pathways

• Duplicated Enzymes conserve their catalytic functions while evolving different substrate specificities

Multidomain Protein Mystery

• Are new domains acquired infrequently, or often enough that the same combinations of domains will be repeated through independent events?

Multidomain Protein Mystery

• Once domain architectures are created, do they persist?

• If the domain is present in ancestral proteins, is it likely to observe them in current proteins?

Protein Family Analysis

• One Traditional method:– Tree modeling gene family evolution

based on multiple sequence alignments• Unclear how to build the model for families

with heterogeneous domains

• Solution Proposed: Analyze a graph structure to study multidomain protein evolution

Parsimony Model

• Assume a phylogenetic tree, with each node described by a set of characters (one per domain)

• Focus on binary characters:– 1: presence of a domain in the node– 0: absence of a domain in the node

• Perfect Phylogeny: Each character state change occurs at most once

• State Change:– 0 1: Gain– 1 0: Loss

Dollo Parsimony

• A character may change state from zero to one *only* once, but from one to zero *multiple* times

• Appropriate for complex characters that are hard to gain but relatively easy to lose

Maximum Parsimony Example

• Unrelated to the work presented but useful to explain the concept of parsimony

• We want to find a model such that we minimize the total number of insertions and deletions

• Find a tree that requires the least number of evolutionary changes.

Example

• We have four sequences:

• Find the tree that can explain the observed sequences with a minimal number of substitutions

D1 D2 D3

1 1 0

1 1 1

0 0 1

1 0 1

Try Different Trees

1

1

1 11

112

Total Cost: 3 Total Cost: 4

Total Cost: 4

2

Domain ArchitecturesPhylogenetic tree of family protein tyrosine kinase family, constructed from an Mutliple Sequence Alignment (MSA) of the kinase domain

Note that the tree is not optimal with respect to a parsimonycriterion minimizing the total number of insertions and deletions. For example,if architectures INSR and EGFR were siblings (the only two architectures containingthe Furin-like cysteine rich and Receptor lingand binding domains) thenumber of insertions and deletions would be smaller.

Evolution of Multidomain Proteins

• Multidomain proteins are formed by:– Gene Fusion– Domain Shuffling– Retrotransposition of Exons

• Represent those by:– Domain Merge– Domain Deletion

Domain Merge

• Any process that unites two or more previously separate domains in a single protein

Domain Deletion

• Any process in which a protein loses one or more domains

Protein Overlap Graph

• Vertices are Proteins • If two proteins share a domain, the

two corresponding nodes are connected by an edge

Domain Overlap Graph

• Vertices are protein domains• Two domains are connected by an

edge if there is a protein containing both domains

Domain Overlap Graph

Static Dollo Parsimony

• For any ancestral node, the set of characters in state one in this node is a subset of the set of character in state one in some leaf node.

• Consistent with a history in which no ancestor contains a domain not seen in a leaf node

Conservative Dollo Parsimony

• For any ancestral node and any pair of characters that appear in state one in this node, there exists a leaf node where these two characters are also in state one

• Consistent with a history in which every instance of a domain pair came from a single merge event

• If domains acting in concert offer a selective advantage, it is unlikely that the pair once formed would later separate

Why all this?

• If we can show that for a family, a conservative Dollo parsimony does not exists then:– Single Insertion Assumption is false

or– Conservative Assumption is too strong

ExampleDomain Overlap Graph

Dollo Parsimony

Analyzing the Graph

1. Check for Chordal Graph in domain overlap graph

2. Check for Helly Property in the domain overlap graph

3. Conclude

Chordal Graph

• A Chord is any edge connecting two non-consecutive vertices of a cycle

• A Chordal Graph is a graph which does not contain chordless cycles of length greater than three

Chords

Theorem 1

• There exists a conservative Dollo parsimony tree for a given set of multidomain architectures, iff the domain overlap graph for this set is chordal

Helly Property

• A set S of sets Si has the Helly property if for every subset T of S the following hold: if the elements of T pairwise intersect, then the intersection of all elements of T is also non-empty.

A family {Ti | i I} ∈ of subsets of a set T is said to satisfy the Helly propertyif, for any collection of sets from this family, {Ti | j J I}∈ ⊆ , ∩j J∈ Tj = ∅,whenever Tj ∩ Tk = ∅, j, k J∀ ∈ .

Example

The picture on the left doesn’t satisfy the Helly property butthe picture on the right does.

Theorem 2

• There exists a static Dollo parsimony tree for a set of multidomain proteins, iff the domain overlap graph for this set is chordal and statisfies the Helly property

Questions raised

• Is independent merging of the same pair of domain a rare event?– Yes, for a vast majority of small and

medium size superfamilies– No, for large complex superfamilies

Second Question

• Do domain architectures persist through evolution?– Yes, for a vast majority of small and

medium size superfamilies– No, for large complex superfamilies

Thank You!

• Questions?

Experimental Results

• Superfamily: Set of proteins sharing one particular domain

• Complex Superfamily: Superfamily sharing more than one domain with another superfamily.

• Considering only dataset restricted to Complex Superfamilies, they check for CDP, SDP and PP criteria

Experimental Results

top related