new multiple sequence alignment methods

73
New Multiple Sequence Alignment Methods Tandy Warnow The Department of Computer Science The University of Texas at Austin

Upload: arty

Post on 16-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

New Multiple Sequence Alignment Methods. Tandy Warnow The Department of Computer Science The University of Texas at Austin. The “Tree of Life”. Avian Phylogenomics Project. Erich Jarvis, HHMI . G Zhang, BGI. MTP Gilbert, Copenhagen. T. Warnow UT-Austin. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: New Multiple Sequence Alignment Methods

New Multiple Sequence Alignment Methods

Tandy WarnowThe Department of Computer Science

The University of Texas at Austin

Page 2: New Multiple Sequence Alignment Methods

The “Tree of Life”

Page 3: New Multiple Sequence Alignment Methods

Avian Phylogenomics ProjectG Zhang, BGI

• Approx. 50 species, whole genomes• 8000+ genes, UCEs• Gene sequence alignments and trees computed using SATé

MTP Gilbert,Copenhagen

S. Mirarab Md. S. Bayzid UT-Austin UT-Austin

T. WarnowUT-Austin

Plus many many other people…

Erich Jarvis,HHMI

Challenges: Maximum likelihood tree estimation on multi-million-site

sequence alignmentsMassive gene tree incongruence

Page 4: New Multiple Sequence Alignment Methods

1kp: Thousand Transcriptome Project

Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUT-Austin UT-Austin UT-Austin UT-Austin

Challenge: Alignment of datasets with > 100,000 sequencesGene tree incongruence

Plus many many other people…

Page 5: New Multiple Sequence Alignment Methods

Multiple Sequence Alignment (MSA): another grand challenge1

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Page 6: New Multiple Sequence Alignment Methods

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 7: New Multiple Sequence Alignment Methods

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 8: New Multiple Sequence Alignment Methods

AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA

U V W X Y

U

V W

X

Y

The “real” problem

Page 9: New Multiple Sequence Alignment Methods

…ACGGTGCAGTTACCA…

MutationDeletion

…ACCAGTCACCA…

Indels (insertions and deletions)

Page 10: New Multiple Sequence Alignment Methods

…ACGGTGCAGTTACC-A……AC----CAGTCACCTA…

• The true multiple alignment – Reflects historical substitution, insertion, and deletion events– Defined using transitive closure of pairwise alignments computed on

edges of the true tree

…ACGGTGCAGTTACCA…

SubstitutionDeletion

…ACCAGTCACCTA…

Insertion

Page 11: New Multiple Sequence Alignment Methods

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 12: New Multiple Sequence Alignment Methods

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 13: New Multiple Sequence Alignment Methods

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3

Page 14: New Multiple Sequence Alignment Methods

Simulation Studies

S1 S2

S3S4

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA

Compare

True tree and alignment

S1 S4

S3S2

Estimated tree and alignment

Unaligned Sequences

Page 15: New Multiple Sequence Alignment Methods

Quantifying Error

FN: false negative (missing edge)FP: false positive (incorrect edge)

50% error rate

FN

FP

Page 16: New Multiple Sequence Alignment Methods

Two-phase estimationAlignment methods• Clustal• POY (and POY*)• Probcons (and Probtree)• Probalign• MAFFT• Muscle• Di-align• T-Coffee • Prank (PNAS 2005, Science 2008)• Opal (ISMB and Bioinf. 2007)• FSA (PLoS Comp. Bio. 2009)• Infernal (Bioinf. 2009)• Etc.

Phylogeny methods• Bayesian MCMC • Maximum

parsimony • Maximum likelihood • Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.RAxML: heuristic for large-scale ML optimization

Page 17: New Multiple Sequence Alignment Methods

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Page 18: New Multiple Sequence Alignment Methods

Problems with the two-phase approach

• Current alignment methods fail to return reasonable alignments on large datasets with high rates of indels and substitutions.

• Manual alignment is time consuming and subjective.

• Systematists discard potentially useful markers if they are difficult to align.

This issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

Page 19: New Multiple Sequence Alignment Methods

SATé

SATé (Simultaneous Alignment and Tree Estimation)• Liu et al., Science 2009• Liu et al., Systematic Biology 2012• Public distribution (open source software) and

user-friendly GUI

Page 20: New Multiple Sequence Alignment Methods

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Page 21: New Multiple Sequence Alignment Methods

Re-aligning on a treeA

B D

C

Merge sub-

alignments

Estimate ML tree on merged

alignment

Decompose dataset

A BC D Align

subproblems

A BC D

ABCD

Page 22: New Multiple Sequence Alignment Methods

SATé Algorithm

Tree

Obtain initial alignment and estimated ML tree

Page 23: New Multiple Sequence Alignment Methods

SATé Algorithm

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Page 24: New Multiple Sequence Alignment Methods

SATé Algorithm

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Page 25: New Multiple Sequence Alignment Methods

SATé Algorithm

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

If new alignment/tree pair has worse ML score, realign using a different decomposition

Repeat until termination condition (typically, 24 hours)

Page 26: New Multiple Sequence Alignment Methods

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Page 27: New Multiple Sequence Alignment Methods

1000 taxon models, ordered by difficulty

24 hour SATé analysis, on desktop machines

(Similar improvements for biological datasets)

Page 28: New Multiple Sequence Alignment Methods

1000 taxon models ranked by difficulty

Page 29: New Multiple Sequence Alignment Methods

LimitationsA

B D

C

Merge sub-

alignments

Estimate ML tree on merged

alignment

Decompose dataset

A BC D Align

subproblems

A BC D

ABCD

Page 30: New Multiple Sequence Alignment Methods

LimitationsA

B D

C

Mergesub-

alignments

Estimate ML tree on merged

alignment

Decompose dataset

A BC D Align

subproblems

A BC D

ABCD

Page 31: New Multiple Sequence Alignment Methods

PASTA Algorithm (Improving SATé)

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

New Re-Alignment Technique

Alignment

If new alignment/tree pair has worse ML score, realign using a different decomposition

Repeat until termination condition (typically, 24 hours)

Page 32: New Multiple Sequence Alignment Methods

Each PASTA iterationA

B D

C

New technique

for merging alignments

Estimate ML tree on merged

alignment

Decompose dataset

A BC D Align

subproblems

A BC D

ABCD

Page 33: New Multiple Sequence Alignment Methods

PASTA vs. SATé: better alignments and trees, better scalability

Analyses of Gutell’s16S datasets with curatedstructural alignments andreference trees usingmaximum likelihoodwith bootstrapping.

PASTA can analyzedatasets with up to200,000 sequencesefficiently.

SATé maxes out around50,000 sequences.

Page 34: New Multiple Sequence Alignment Methods

1KP dataset: Large and Highly Fragmentary!

Cytochrome dataset sequence length distribution

>100,000 AA sequencesTranscriptome data

Page 35: New Multiple Sequence Alignment Methods

UPP: basic idea

Input: set S of unaligned sequencesOutput: alignment on S

• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence

alignment A* for entire set S

Page 36: New Multiple Sequence Alignment Methods

Input: Unaligned Sequences

S1 = AGGCTATCACCTGACCTCCAATS2 = TAGCTATCACGACCGCGCTS3 = TAGCTGACCGCGCTS4 = TACTCACGACCGACAGCTS5 = TAGGTACAACCTAGATCS6 = AGATACGTCGACATATC

Page 37: New Multiple Sequence Alignment Methods

Step 1: Pick random subset(backbone)

S1 = AGGCTATCACCTGACCTCCAATS2 = TAGCTATCACGACCGCGCTS3 = TAGCTGACCGCGCTS4 = TACTCACGACCGACAGCTS5 = TAGGTACAACCTAGATCS6 = AGATACGTCGACATATC

Page 38: New Multiple Sequence Alignment Methods

Step 2: Compute backbone alignment

S1 = -AGGCTATCACCTGACCTCCA-ATS2 = TAG-CTATCAC--GACCGC--GCTS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC—-GACCGACAGCTS5 = TAGGTAAAACCTAGATCS6 = AGATAAAACTACATATC

Page 39: New Multiple Sequence Alignment Methods

Step 3: Align each remaining sequence to backbone

S1 = -AGGCTATCACCTGACCTCCA-AT-S2 = TAG-CTATCAC--GACCGC--GCT-S3 = TAG-CT-------GACCGC—-GCT-S4 = TAC----TCAC--GACCGACAGCT-S5 = TAGG---T-A—CAA-CCTA--GATC

First we add S5 to the backbone alignment

Page 40: New Multiple Sequence Alignment Methods

Step 3: Align each remaining sequence to backbone

S1 = -AGGCTATCACCTGACCTCCA-AT-S2 = TAG-CTATCAC--GACCGC--GCT-S3 = TAG-CT-------GACCGC--GCT-S4 = TAC----TCAC—-GACCGACAGCT-S6 = -AG---AT-A-CGTC--GACATATC

Then we add S6 to the backbone alignment

Page 41: New Multiple Sequence Alignment Methods

Step 4: Use transitivity to obtain MSA on entire set

S1 = -AGGCTATCACCTGACCTCCA-AT--S2 = TAG-CTATCAC--GACCGC--GCT--S3 = TAG-CT-------GACCGC--GCT--S4 = TAC----TCAC--GACCGACAGCT--S5 = TAGG---T-A—CAA-CCTA--GATC-S6 = -AG---AT-A-CGTC--GACATAT-C

Page 42: New Multiple Sequence Alignment Methods

UPP: details

Input: set S of unaligned sequencesOutput: alignment on S

• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence

alignment A* for entire set S

Page 43: New Multiple Sequence Alignment Methods

UPP: details

Input: set S of unaligned sequencesOutput: alignment on S

• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence

alignment A* for entire set S

Page 44: New Multiple Sequence Alignment Methods

How to align sequences to a backbone alignment?

Standard machine learning technique: Build HMM (Hidden Markov Model) for backbone alignment, and use it to align remaining sequences

We use HMMER (Eddy, HHMI) for this purpose

Page 45: New Multiple Sequence Alignment Methods

1: Build Hidden Markov Model (HMM) for backbone MSA2: Use HMM to align remaining sequences (one by one)3: Use transitivity to infer MSA on entire dataset

Page 46: New Multiple Sequence Alignment Methods

Using HMMER

Using HMMER works well…• …except when the dataset has a high

evolutionary diameter.

Page 47: New Multiple Sequence Alignment Methods

Using HMMER

Using HMMER works well…except when the dataset is big!

Page 48: New Multiple Sequence Alignment Methods

One Hidden Markov Model for the entire alignment?

Page 49: New Multiple Sequence Alignment Methods

Or 2 HMMs?

Page 50: New Multiple Sequence Alignment Methods

Or 4 HMMs?(or 8 or 16 or…)

Page 51: New Multiple Sequence Alignment Methods

Or what about many different HMMs(HMMs on subsets of size 10, 20, 40, …, n)?

Page 52: New Multiple Sequence Alignment Methods

Evaluation

• Simulated datasets (some have fragmentary sequences):– 10K to 1,000,000 sequences in RNASim (Sheng Guo, Li-San

Wang, and Junhyong Kim, arxiv)– 1000-sequence nucleotide datasets from SATe papers– 5000-sequence AA datasets (from FastTree paper)– 10,000-sequence Indelible nucleotide simulation

• Biological datasets:– Proteins: largest BaliBASE and HomFam – RNA: 3 CRW (Gutell) datasets up to 28,000 sequences

Page 53: New Multiple Sequence Alignment Methods

RNASim: Tree error

Maximum Likelihoodtrees computedusing FastTree (Price)

Page 54: New Multiple Sequence Alignment Methods

One Million Sequences: Tree Error

UPP(100,100): 1.6 days using 12 processors

UPP(100,10): 7 days using 12 processors

Note: UPP decomposition improves accuracy

Page 55: New Multiple Sequence Alignment Methods

UPP vs. MAFFT-profile Running Time

Page 56: New Multiple Sequence Alignment Methods

AA Sequence Alignment Error (13 HomFam datasets)

Homfam: Sievers et al., MSB 2011 10,000 to 46,000 sequences per dataset; 5-14 sequences in each seed alignmentSequence lengths from 46-364 AA

Page 57: New Multiple Sequence Alignment Methods

1KP Cytochrome dataset sequence length distribution (>100K seqs)

Page 58: New Multiple Sequence Alignment Methods

Performance on fragmentary 1000M2 with varying degrees of fragmentation.

Page 59: New Multiple Sequence Alignment Methods

Impact of rate of evolution on performance on fragmentary sequence data

Page 60: New Multiple Sequence Alignment Methods

Impact of rate of evolution on performance on fragmentary sequence data

Page 61: New Multiple Sequence Alignment Methods

MSA and Phylogeny Estimation

• MSA estimation impacts phylogenetic accuracy, and many datasets are hard to align:

– Large datasets– Datasets with high rates of evolution– Datasets with fragmentary sequences

• SATé (Liu et al., Science 2009 and Systematic Biology 2012) provides very good alignments and trees even for high rates of evolution and large datasets

• PASTA (RECOMB 2014) is a direct improvement on SATé (matches or improves accuracy on small datasets, better accuracy on large datasets, can analyze very large datasets – up to 200,000 sequences so far).

• UPP – different approach, highly robust to fragmentary sequences, more accurate alignments than PASTA (but in some cases less accurate trees), can analyze even larger datasets, highly parallelizable. (Not yet available).

Page 62: New Multiple Sequence Alignment Methods

UPP Algorithmic Parameters

• Decompose or not? Use a Family of HMMs or just a single HMM?

• Use a small (100 sequence) or large (1000 sequence) backbone?

• Others:– How to compute backbone alignment and tree– How to select backbone dataset

Page 63: New Multiple Sequence Alignment Methods
Page 64: New Multiple Sequence Alignment Methods

Observations

• Decomposing (using a family of HMMs) improves tree accuracy, but has little impact on alignment accuracy.

• Using large backbones rather than small improves alignment accuracy but has only a small impact on tree accuracy.

• Alignment error metrics and tree error metrics are not that well correlated.

Page 65: New Multiple Sequence Alignment Methods

SEPP, TIPP, and UPP

• SEPP: SATe-enabled Phylogenetic Placement (Mirarab, Nugyen and Warnow, PSB 2012)

• TIPP: Taxon identification and phylogenetic profiling (Nguyen, Mirarab, Liu, Pop, and Warnow, submitted)

• UPP: Ultra-large multiple sequence alignment using SEPP (Nguyen, Mirarab, Kumar, Wang, Guo, Kim, and Warnow, in preparation)

Page 66: New Multiple Sequence Alignment Methods

Future Work

• Extending TIPP to non-marker genes• Using the new HMM Family technique in SEPP

and UPP• Using external seed alignments in SEPP, TIPP,

and UPP• Boosting statistical co-estimation methods by

using them in UPP for the backbone alignment and tree

Page 67: New Multiple Sequence Alignment Methods

Warnow Laboratory

PhD students: Siavash Mirarab*, Nam Nguyen, and Md. S. Bayzid**Undergrad: Keerthana KumarLab Website: http://www.cs.utexas.edu/users/phylo

Funding: Guggenheim Foundation, Packard, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and the University of Alberta (Canada)

TACC and UTCS computational resources* Supported by HHMI Predoctoral Fellowship** Supported by Fulbright Foundation Predoctoral Fellowship

Page 68: New Multiple Sequence Alignment Methods

Indelible 10K: Alignment error

Note: ClustalO and Mafft-Profile(Default)fail to generate an alignmentunder this model condition.

Page 69: New Multiple Sequence Alignment Methods

Indelible 10K : Tree error

Note: ClustalO and Mafft-Profile(Default)fail to generate an alignmentunder this model condition.

Page 70: New Multiple Sequence Alignment Methods

FastTree AA: Alignment error

Page 71: New Multiple Sequence Alignment Methods

FastTree AA: Tree error

Page 72: New Multiple Sequence Alignment Methods

Large fragmentary: Tree error

Page 73: New Multiple Sequence Alignment Methods

Large fragmentary: Alignment error

10000-sequence dataset from RNASim16S.3 and 16S.T from Gutell’s Comparative Ribosomal Website (CRW)