evolution of multidomain proteins cs 374 – lecture 10 wissam kazan

58

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan
Page 2: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Evolution of Multidomain Proteins

CS 374 – Lecture 10Wissam Kazan

Page 3: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Reference Papers

• C. Chothia, J. Gough, C. Vogel, S. A. Teichmann, “Evolution of the Protein Repertoire”, www.sciencemag.com, Science VOL 300, 13 June 2003

• T. Przytycka, G. Davis, N. Song, D. Durand, “Graph Theoretical Insights into Evolution of Multidomain Proteins”, RECOMB 2005, LNBI 3500, pp. 311-325, 2005

Page 4: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Proteins

• Large Organic Compounds made of amino acids

• Fold into specific Structures, unique to each protein.

Three possible representations of the three-dimensional structure of the protein triose phosphate isomerase.

Page 5: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Protein Functions

• Chief Actors in the cell

• Proteins bind to other molecules specifically and tightly at the binding site

• Act as enzymes to catalyze chemical reactions

• Antibodies are proteins that bind to antigen and target them for destruction

Page 6: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Proteins Domains

• Primary Constituent of Proteins

• It is a conserved evolutionary structural unit :

– Assumed to fold independently

– Observed in different proteins in the context of different neighboring domains

– Whose coding sequence can be duplicated and/or undergo recombination

Page 7: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domains – cont’d

• Small proteins contain just one domain

• Large proteins are formed by combination of domains

Cartoon representation of the protein Zif268 (blue) containing three zinc fingers domains in complex with DNA (orange). The coordinating amino acid residues of the middle zinc ion (green) are highlighted.

Page 8: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain – cont’d (2)

• Often each domain has a separate function to perform for the protein

• On average, domains lengths range from 100 to 250 nucleotides

Page 9: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Binding Domain

Page 10: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Family

• A domain family is a collection of small proteins and/or parts of larger ones that descend from a common ancestor.

Page 11: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

PR domain family members

Page 12: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Increase in Protein Repertoire

• The dominant mechanisms are:

– Duplications of sequences coding for one or more domain

– Divergence of duplicated sequences by mutations, deletions and insertions producing modified structures that may have useful new properties

– Recombination of genes that results in new arrangements of domains

Page 13: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Family relationships

• Difficult to detect distant relationships by direct comparisons of sequences

• Presence/Absence of domains and family relationships can be determined if the 3D structures are known

• We only know family relationships and domain structures of proteins of known structures or proteins homologous to proteins of known structures

Page 14: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

C2 domain family

Page 15: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Family Sizes

• In individual genomes, the number of members in the different families fit a Pareto distribution:– Few families have many members– Many families have few members

• It is mainly the result of selection for useful functions– Some families have properties that lend themselves to a

wide variety of molecular functions: • P-loop nucleotide family has members functioning as

kinases with diff. specifities, as diff. kind of motor proteins

Page 16: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Analysis of Evolution

• 50% of sequences in the currently known genomes homologous to proteins of known structure

• Based on that half, we got a detailed picture of the evolution that we will explain in the on-coming slides

Page 17: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

SCOP DB

• Relationships of domains in proteins of known structures described in the Structural Classification of Proteins (SCOP) Database

http://scop.berkeley.edu/data/scop.b.html

Page 18: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Families and Species

• Vertebrates, ~750 different families, with 50 members per family on average

• Invertebrates, ~670 different families, with 20 members per family on average

• Yeast and bacteria with large genome, ~550 different families, with 8 members per family on average

• Parasitic bacteria, ~220 different families, with 2 member per family on average

Page 19: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Protein Repertoire

• The larger domain families make up the bulk of the protein repertoire in each genome and are widely distributed across genomes

• 429 families occur in all of the 14 known eukaryotes genomes:– Members form 80% of domains in Animals– 90% of domains in Fungi and Plants

Page 20: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Contribution of common families to the protein repertoire

Page 21: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Combinations

• Many proteins formed by combinations of two or more domains

• Domains from some families appear together with domains from several families

• Multidomain proteins constitute 4/5th eukaryotes proteins and 2/3rd of prokaryote proteins

• Phenomenon called Domain Accretion

Page 22: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Known Combinations

• 1100 families of proteins of known structure in total

• 11002 = 1,210,000 different possible pairwise combinations.

• Only useful combinations will be present in genomes

• Studies showed that only 2500 pairwise combinations were found in 85 different genomes

Page 23: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Combination Properties

• Few families have members present in many different combinations

• Many families combine with just one or two others.

• Power Law (Again!)

• Sequential Order

Page 24: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Supradomains

• Two-domain and Three-domain combinations recurring in different protein contexts with different partner domains

• Have a particular functional and spatial relationship

• Larger than individual domains

Page 25: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Supradomain

Page 26: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Metabolic Pathway Formation

• Proteins in a pathways do not function by themselves

• A metabolic pathway is a series of chemicals reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell.

Problem

How does the duplication, divergence and recombination process of the proteins fit into the formation or extension of pathways?

Page 27: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

First Solution

• Substrates in pathways retain some similarities

Enzyme evolve by gene duplications:- Catalytic mechanisms change- Some aspects of their recognition

properties are retained

Page 28: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Second Solution

• Enzymes recruited across pathways

• Duplicated Enzymes conserve their catalytic functions while evolving different substrate specificities

Page 29: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Multidomain Protein Mystery

• Are new domains acquired infrequently, or often enough that the same combinations of domains will be repeated through independent events?

Page 30: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Multidomain Protein Mystery

• Once domain architectures are created, do they persist?

• If the domain is present in ancestral proteins, is it likely to observe them in current proteins?

Page 31: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Protein Family Analysis

• One Traditional method:– Tree modeling gene family evolution

based on multiple sequence alignments• Unclear how to build the model for families

with heterogeneous domains

• Solution Proposed: Analyze a graph structure to study multidomain protein evolution

Page 32: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Parsimony Model

• Assume a phylogenetic tree, with each node described by a set of characters (one per domain)

• Focus on binary characters:– 1: presence of a domain in the node– 0: absence of a domain in the node

• Perfect Phylogeny: Each character state change occurs at most once

• State Change:– 0 1: Gain– 1 0: Loss

Page 33: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Dollo Parsimony

• A character may change state from zero to one *only* once, but from one to zero *multiple* times

• Appropriate for complex characters that are hard to gain but relatively easy to lose

Page 34: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Maximum Parsimony Example

• Unrelated to the work presented but useful to explain the concept of parsimony

• We want to find a model such that we minimize the total number of insertions and deletions

• Find a tree that requires the least number of evolutionary changes.

Page 35: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Example

• We have four sequences:

• Find the tree that can explain the observed sequences with a minimal number of substitutions

D1 D2 D3

1 1 0

1 1 1

0 0 1

1 0 1

Page 36: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Try Different Trees

1

1

1 11

112

Total Cost: 3 Total Cost: 4

Total Cost: 4

2

Page 37: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain ArchitecturesPhylogenetic tree of family protein tyrosine kinase family, constructed from an Mutliple Sequence Alignment (MSA) of the kinase domain

Note that the tree is not optimal with respect to a parsimonycriterion minimizing the total number of insertions and deletions. For example,if architectures INSR and EGFR were siblings (the only two architectures containingthe Furin-like cysteine rich and Receptor lingand binding domains) thenumber of insertions and deletions would be smaller.

Page 38: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Evolution of Multidomain Proteins

• Multidomain proteins are formed by:– Gene Fusion– Domain Shuffling– Retrotransposition of Exons

• Represent those by:– Domain Merge– Domain Deletion

Page 39: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Merge

• Any process that unites two or more previously separate domains in a single protein

Page 40: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Deletion

• Any process in which a protein loses one or more domains

Page 41: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Protein Overlap Graph

• Vertices are Proteins • If two proteins share a domain, the

two corresponding nodes are connected by an edge

Page 42: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Overlap Graph

• Vertices are protein domains• Two domains are connected by an

edge if there is a protein containing both domains

Page 43: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Domain Overlap Graph

Page 44: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Static Dollo Parsimony

• For any ancestral node, the set of characters in state one in this node is a subset of the set of character in state one in some leaf node.

• Consistent with a history in which no ancestor contains a domain not seen in a leaf node

Page 45: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Conservative Dollo Parsimony

• For any ancestral node and any pair of characters that appear in state one in this node, there exists a leaf node where these two characters are also in state one

• Consistent with a history in which every instance of a domain pair came from a single merge event

• If domains acting in concert offer a selective advantage, it is unlikely that the pair once formed would later separate

Page 46: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Why all this?

• If we can show that for a family, a conservative Dollo parsimony does not exists then:– Single Insertion Assumption is false

or– Conservative Assumption is too strong

Page 47: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

ExampleDomain Overlap Graph

Dollo Parsimony

Page 48: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Analyzing the Graph

1. Check for Chordal Graph in domain overlap graph

2. Check for Helly Property in the domain overlap graph

3. Conclude

Page 49: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Chordal Graph

• A Chord is any edge connecting two non-consecutive vertices of a cycle

• A Chordal Graph is a graph which does not contain chordless cycles of length greater than three

Chords

Page 50: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Theorem 1

• There exists a conservative Dollo parsimony tree for a given set of multidomain architectures, iff the domain overlap graph for this set is chordal

Page 51: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Helly Property

• A set S of sets Si has the Helly property if for every subset T of S the following hold: if the elements of T pairwise intersect, then the intersection of all elements of T is also non-empty.

A family {Ti | i I} ∈ of subsets of a set T is said to satisfy the Helly propertyif, for any collection of sets from this family, {Ti | j J I}∈ ⊆ , ∩j J∈ Tj = ∅,whenever Tj ∩ Tk = ∅, j, k J∀ ∈ .

Page 52: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Example

The picture on the left doesn’t satisfy the Helly property butthe picture on the right does.

Page 53: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Theorem 2

• There exists a static Dollo parsimony tree for a set of multidomain proteins, iff the domain overlap graph for this set is chordal and statisfies the Helly property

Page 54: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Questions raised

• Is independent merging of the same pair of domain a rare event?– Yes, for a vast majority of small and

medium size superfamilies– No, for large complex superfamilies

Page 55: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Second Question

• Do domain architectures persist through evolution?– Yes, for a vast majority of small and

medium size superfamilies– No, for large complex superfamilies

Page 56: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Thank You!

• Questions?

Page 57: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Experimental Results

• Superfamily: Set of proteins sharing one particular domain

• Complex Superfamily: Superfamily sharing more than one domain with another superfamily.

• Considering only dataset restricted to Complex Superfamilies, they check for CDP, SDP and PP criteria

Page 58: Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Experimental Results