sequence analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • t-matrix • least ... 1 2...

23
Sequence Analysis '18 lecture 12 T-matrix Least squares Maximum likelihood Bootstrapping

Upload: others

Post on 16-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Sequence Analysis '18lecture 12

• T-matrix

• Least squares

• Maximum likelihood

• Bootstrapping

Page 2: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

The T-matrix

b1A

B

C

Db2

b3

b4

b5

Tb = ddAB

dAC

dAD

dBC

dBD

dCD

=

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

b1

b2

b3

b4

b5

The T matrix expresses the branches needed to sum the tree distance. Branch distances b are found by squaring and inverting the T matrix and multiplying by the sequence

distances d. New tree distances Δ are found by summing b, using T.

dAB = b1 + b2 dAC = b1 + b3 + b5

etc

Page 3: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Least squares branch lengths using the T-matrix

S = Σwij( dij - Δij)2i>j

b1A

B

C

Db2

b3

b4

b5

(Tb = Δ) = d

dAB

dAC

dAD

dBC

dBD

dCD

=

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

b1

b2

b3

b4

b5

b = (TTT)-1(TTd)Solving for b we get:

empirical distances patristic distances

We want to minimize S

taxa

Page 4: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Least squares step 1: squaring the T-matrix

b1A

B

C

Db2

b3

b4

b5

Tb = d

=

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

b1

b2

b3

b4

b5

b = (TTT)-1(TTd)

357463

=

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

=

3 1 1 1 21 3 1 1 21 1 3 1 21 1 1 3 22 2 2 2 4

(TTT)

dAB

dAC

dAD

dBC

dBD

dCD

1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0

T-matrix • Branches = Distances

Solving for branches:TT T

Page 5: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

b = (TTT)-1(TTd)=

TTT =

3 1 1 1 21 3 1 1 21 1 3 1 21 1 1 3 22 2 2 2 3

(TTT)-1 = 1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4

-1/4 -1/4 -1/4 -1/4 3/4

357463

(TTd)= =

1513121622

1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4

-1/4 -1/4 -1/4 -1/4 3/4

1513121622

=2.01.00.52.52.5

*To get the inverse, I used this server http://wims.unice.fr/wims/en_home.html

*

Least squares step 2: inverting, solving for b

1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0

Page 6: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

b = (TTT)-1(TTd)=

b1

b2

b3

b4

b5

2.0A

B

C

D1.0

0.5

2.5

2.5

ΔAB

ΔAC

ΔAD

ΔBC

ΔBD

ΔCD

357463

=

dAB

dAC

dAD

dBC

dBD

dCD

3.05.07.04.06.03.0

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

= =

Sequence distances Least squares patristic distances

S = Σ( dij - Δij)2 = 0.0i>j

2.01.00.52.52.5

2.01.00.52.52.5

How similar are sequence distances and least squares distances?

Page 7: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

How good are least squares distances for the wrong tree

b1A B

C Db2

b3

b4

b5

b = (TTT)-1(TTd) =1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4

-1/4 -1/4 -1/4 -1/4 3/4

dAC

dAB

dAD

dBC

dCD

dBD

=

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

b1

b2

b3

b4

b5

=

537436

537436

= 1716131215

4.754.252.752.25-3.25

=

Least-squares produces a negative distance for b5

S = Σ( dij - Δij)2 = 58.5i>j

1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4

-1/4 -1/4 -1/4 -1/4 3/4

ΔAC

ΔAB

ΔAD

ΔBC

ΔCD

ΔBD

9.07.57.07.06.55.0

1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0

= = 4.754.252.752.250.0

set negative distance to zerovery bad

1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0

Page 8: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Remember Max Unweighted Parsimony?

A ATGGCTATTCTTATAGTACG B ATGGCTAGTCTTATATTACA C TTCACTAGACCTGTGGTCCA D TTGACCAGACCTGTGGTCCG E TTGACCAGTTCTCTAGTTCG TOTALS

1 0

0

0

0

2

1

1

....|....0....|....0

1

1

1

1

all trees

...

Page 9: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Maximum likelihood

Given a tree with branch lengths and MSA, sum the probabilities of the branches over all possible ancestor bases (or amino acids).

Probabilities may be J-C or Kimura for nucleotides, or PAM or BLOSUM for proteins.

May be used in combination with NNI, SPR, TBR to evaluate tree topology, with or without branch lengths (if no branch lengths, assume all branch length are equal).

May be used to predict ancestral sequences.

A C

A C? ?

Sum the total likelihood of all branches over all

possible bases at two ? positions.

is maximum weighted parsimony.

Page 10: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Scoring quartets using Parsimony

Find quartet AHJP in this tree.

Score each tree for AHJP

J

H

P

A

AB

CD E F G H I J K L

M

NP

Q

RS

T

J

H

P

A

J

H P

A

A TTCJ TTTH GTCP GGT

taxon 1 2 3

A T T C

J T T TH G T CP G G T mutations *Prob

x T T C 4y G T Cx G T C

5y G T Tx T T C 4y T T T

1

2

3

x y

y

y

x

x

1

2

3

sequence position

*Maximum likelihood is parsimony taken to the next level. Instead of picking the most parsimonious ancestor sequence, we sum the probabilities of all options. We choose the tree that maximizes this sum.

Given this sequence data which quartet is most likely?

Jukes-Cantor P(mutation)

Page 11: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Maximum likelihood

L = Σ P( distr_x, distr_y)all branches x,y

J

H

P

Ax y

Max Likelihood Algorithm:For all candidate trees:

Iterate until convergenceFor each ancestor node x

Calculate new distr_xcheck for convergence

Calculate L (see above)

ML = max(L)all trees

Page 12: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Quartet puzzling. Tree based on ML quartets

1. Identify all choose-4 subsets of MSA => “quartets”2. For each quartet, find the best 4-taxa tree (3 possibilities) using least-

squares, Fitch-Margoliash, Parsimony, or Maximum Likelihood.3. Start with a randomly selected quartet, ((A,B),(C,D))4. Choose one new taxon, E.5. For every branch, try adding the E to the branch.

1. For all quartets containing E and three of the other taxa, ask whether the split is correct. If not, add 1 to the penalty.

6. Add E to the branch with the lowest penalty.7. Continue at step 4 with a new taxon.

b1A

B

C

Db2

b3

b4

b5

E

penalty for adding E to branch

quartet b1 b2 b3 b4 b5

AB,CD - - - - -

AE,BC 0 1 1 1 1

AE,BD 0 1 1 1 1

AE,CD 0 0 1 1 0

BE,CD 0 0 1 1 0

QP can be repeated, starting with different quartet, to generate any number of trees.

true tree

Page 13: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Expert tree strategy• Get sequence distances.• QP with ML -- Use ML to choose topology for

each quartet. Use QP to build trees from quartets.• NNI, SPR, TBR with L-S, F-M -- Modify the

tree, calculate branch lengths, calculate patristic distances, calculate difference distance score S.

• Consensus tree -- if more than one tree has same best score, merge branches.

• Confidence — For each branch, assign a “bootstrap” value.

Page 14: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

“Boot strap analysis”• A method to validate the significance of a

phylogenetic tree, branchpoint by branchpoint. • Requires a means to generate independent

trees. (For example using distances generated from different regions of the mitochondrial genome or from random subsets of sequences or random subsets of columns.)

• Choose the representative tree as the ‘parent’. Calculate the following:

For each branchpoint in the parent tree, For each tree, ask

Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)

Bootstrap value = number of trees having the branchpoint / total trees.

Page 15: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Comparing branchpoints

A B C D E E DBCA AB C D E E DCAB

B A C D EE DABBB A A EE CCC DD

bootstrap value= P((A,B),C) = 5/8For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order) Bootstrap value = number of trees having the branchpoint / total trees.

Page 16: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

0.625

Comparing branchpoints

A B C D E E DBCA AB C D E E DCAB

B A C D EE DABBB A A EE CCC DD

bootstrap value = P((A,B,C),(D,E))=6/8

Page 17: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Bootstrap values for this data

A B C D E E DBCA AB C D E E DCAB

B A C D EE DABBB A A EE CCC DD

.75

.75

.625

.875

Page 18: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

Compare lineages instead of branchpoints.

A B C D E E DBCA AB C D E E DCAB

B A C D EE DABBB A A EE CCC DD

= P(A,B) = 6/8

For unrooted trees or rooted trees. •Treat any lineage as the root.•Ask how often the root branching is conserved.

Page 19: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

!19

Review• What are patristic distances? • How can you tell if a tree is non-additive? What does non-

additivity imply? • How do you compare two trees that have different

topology? • How do you score a set of distances against a tree? • Assuming an initial tree, what ways exist for exploring other

tree topologies? • Describe nearest neighbor interchange. • What kind of program is quartet puzzling? • ... Fitch-Margoliash? • Compare and contrast three ways to get branch lengths

from distances: UPGMA, Fitch-Margoliash, Least-squares.

Page 20: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

A

B

CD

E

F

G

H

J

K

JG

D

A

E

B

H

K

FC

1.: What is the split distance?Exercise 9

2.: Draw these quartets, both trees.ACDKJFKHGABC

Page 21: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

3: Write the T-matrix

b1

b2

b3 b4

b5

b6 b7A

B C

D

Eb1b2b3b4b5

b6

b7

dAB

dAC

dAD

dAE

dBC

dBD

dBE

dCD

dCE

dDE

=

4: Square the T-matrix

E

=

Page 22: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

5: Invert the squared T-matrix

6: Convert distances to TTD

=

-1

39

111210121312137

=

TT

D

TTD

TTT-1

Exercise 10 (continuation of Exercise 9)

Page 23: Sequence Analysis '18 lecture 12 · 2018. 10. 8. · lecture 12 • T-matrix • Least ... 1 2 3 sequence position *Maximum likelihood is parsimony taken to the next level. Instead

!23

7: Solve for branches b.

TTDTTT-1

b1

b2

b3

b4

b5

b6

b7

=

8: Draw unrooted phylogram.

9: Write Newick format, with distances.