cs262 lecture 9, win07, batzoglou phylogeny tree reconstruction 1 4 3 2 5 1 4 2 3 5
Post on 20-Dec-2015
223 views
TRANSCRIPT
CS262 Lecture 9, Win07, Batzoglou
Phylogeny Tree Reconstruction
1 4
3 2 5
1 4 2 3 5
CS262 Lecture 9, Win07, Batzoglou
Phylogenetic Trees
• Nodes: species• Edges: time of independent
evolution
• Edge length represents evolution time
AKA genetic distance
Not necessarily chronological time
CS262 Lecture 9, Win07, Batzoglou
Parsimony – direct method not using distances
• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment
Idea:
Find the tree that explains the observed sequences with a minimal number of substitutions
Two computational subproblems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
CS262 Lecture 9, Win07, Batzoglou
Example: Parsimony cost of one column
A B A A
{A, B}CostC+=1
{A}Final cost C = 1
{A}
{A} {B} {A} {A}
ABAA
CS262 Lecture 9, Win07, Batzoglou
Parsimony Scoring
Given a tree, and an alignment column u
Label internal nodes to minimize the number of required substitutions
Initialization:
Set cost C = 0; node k = 2N – 1 (last leaf)
Iteration:
If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species
If k is not a leaf,
Let i, j be the daughter nodes;
Set Rk = Ri Rj if intersection is nonempty
Set Rk = Ri Rj, and C += 1, if intersection is empty
Termination:
Minimal cost of tree for column u, = C
CS262 Lecture 9, Win07, Batzoglou
Example
A A A B
{A} {A} {A} {B}
B A BA
{A} {B} {A} {B}
{A}
{A}
{A}
{A,B}
{A,B}
{B}
{B}
CS262 Lecture 9, Win07, Batzoglou
Traceback:
1. Choose an arbitrary nucleotide from R2N – 1 for the root
2. Having chosen nucleotide r for parent k,
If r Ri choose r for daughter i
Else, choose arbitrary nucleotide from Ri
Easy to see that this traceback produces some assignment of cost C
Traceback to find ancestral nucleotides
CS262 Lecture 9, Win07, Batzoglou
Example
A B A B
{A, B}
{A, B}
{A}
{A} {B} {A} {B}
A B A B
A
A
A
x
x
A B A B
A
B
A
x
x
A B A B
B
B
B
xx
Admissible with Traceback
Still optimal, but inadmissible with Traceback
CS262 Lecture 9, Win07, Batzoglou
Multiple Sequence Multiple Sequence AlignmentsAlignments
CS262 Lecture 9, Win07, Batzoglou
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
CS262 Lecture 9, Win07, Batzoglou
Protein Phylogenies
• Proteins evolve by both duplication and species divergence
CS262 Lecture 9, Win07, Batzoglou
Orthology and Paralogy
HB HumanHB Human
WB WormWB Worm
HA1 HumanHA1 Human
HA2 HumanHA2 Human
YeastYeast
WA WormWA Worm
Orthologs:Derived by speciation
Paralogs:Everything else
CS262 Lecture 9, Win07, Batzoglou
Orthology, Paralogy, Inparalogs, Outparalogs
CS262 Lecture 9, Win07, Batzoglou
CS262 Lecture 9, Win07, Batzoglou
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology
• The patterns of conservation can help us tell function of the element
CS262 Lecture 9, Win07, Batzoglou
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
CS262 Lecture 9, Win07, Batzoglou
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken
• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
Duck
CS262 Lecture 9, Win07, Batzoglou
A Profile Representation
• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi
• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings
Can think of this as a “likelihood” of each letter in each position
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4
CS262 Lecture 9, Win07, Batzoglou
Multiple Sequence Alignments
Algorithms
CS262 Lecture 9, Win07, Batzoglou
Multidimensional DP
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)
F(i1,i2,…,iN)= max(all neighbors of cube)(F(nbr)+S(nbr))
CS262 Lecture 9, Win07, Batzoglou
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),
F(i – 1, j – 1, k ) + S(xi, xj, - ),F(i – 1, j , k – 1) + S(xi, -, xk),F(i – 1, j , k ) + S(xi, -, - ),F(i , j – 1, k – 1) + S( -, xj, xk),F(i , j – 1, k ) + S( -, xj, - ),F(i , j , k – 1) + S( -, -, xk) }
Multidimensional DP
CS262 Lecture 9, Win07, Batzoglou
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
CS262 Lecture 9, Win07, Batzoglou
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Multidimensional DP
• How do gap states generalize?
• VERY badly! Require 2N – 1 states, one per combination of
gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)
XY XYZ Z
Y YZ
X XZ
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
pxy
pzw
pxyzw
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
Example
Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)
s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
Result: pxy = (0.7, 0.1, 0, 0, 0.2)
s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
Result: px- = (0.4, 0.1, 0, 0, 0.5)
CS262 Lecture 9, Win07, Batzoglou
Progressive Alignment
• When evolutionary tree is unknown:
Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary
distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree
x
w
y
z?
CS262 Lecture 9, Win07, Batzoglou
Heuristics to improve alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Algorithm (Barton-Stenberg):
1. For j = 1 to N,Remove xj, and realign to x1…
xj-1xj+1…xN
2. Repeat 4 until convergence
x
y
z
x,z fixed projection
allow y to vary
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
CS262 Lecture 9, Win07, Batzoglou
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
CS262 Lecture 9, Win07, Batzoglou
Consistency
z
x
y
xi
yj yj’
zk
CS262 Lecture 9, Win07, Batzoglou
Consistency
Basic method for applying consistency
• Compute all pairs of alignments xy, xz, yz, …
• When aligning x, y during progressive alignment,
For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function
z
x
y
xi
yj yj’
zk
CS262 Lecture 9, Win07, Batzoglou
Real-world protein aligners
• MUSCLE High throughput One of the best in accuracy
• ProbCons High accuracy Reasonable speed
CS262 Lecture 9, Win07, Batzoglou
MUSCLE at a glance
1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
2. Build tree TDRAFT based on those distances, with UPGMA
3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT
• Only perform alignment steps for the parts of the tree that have changed
4. Measure new Kimura-based distances D(x, y) based on MDRAFT
5. Build tree T based on D
6. Progressive alignment over T, to build M
7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept
CS262 Lecture 9, Win07, Batzoglou
PROBCONS at a glance
1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM
2. Re-estimation of posterior matrices M’xy with probabilistic consistency
• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)
3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)
• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)
4. Build tree T with hierarchical clustering using similarity measure E(x, y)
5. Progressive alignment on T to maximize E(.,.)
6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each
sequence and realign the two resulting profiles
CS262 Lecture 9, Win07, Batzoglou
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway
Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2
ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable
http://probcons.stanford.edu/ PROBCONS – most accurate