the use of short-read next generation sequences to recover the evolutionary histories in...

13
The use of short-read next generation sequences to recover the evolutionary histories in multi- individual samples Systematic biology presentation Yuantong Ding Dec. 6

Upload: erick-black

Post on 01-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples

Systematic biology presentationYuantong Ding Dec. 6

Page 2: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Outline • Background

• Workflow

• Sequence comparison

• Tree comparison

• Summary & future work

Page 3: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Can short-reads successfully recover phylogeny?

• Next generation sequencing (NGS)• Low-cost• High-throughput • Short-read

Multi individual sampleShort-reads Reconstructed sequence phylogeny

?

Background Workflow Sequence comparison Tree comparisonSummary

Page 4: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Simulation process Original genealogy Original haplotypes NJ treeSimulated by

SerialSimCoal with coalescent model

Consensus sequence Short-readsSimulated by MetaSim with 454 error model

Mapping Alignment built by SHRiMP and SSAHA

Reconstructed haplotypes Haplotypes reconstructed by ShoRAH

NJ tree built by PAUP* Compare tree topology

Compare number and similarity ofhaplotypes

Background Workflow Sequence comparison Tree comparisonSummary

Page 5: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

6 parameters used• Effective population size N• Sample size n• Mutation rate μ• Sequence length l

N n μ l Sr_N Sr_l

3000 10 5.00E-05 1200 5000 200

5000 20 1.00E-05 2000 10000 400

10000 40 5.00E-06 5000 30000 —

• Number of short-reads Sr_N• Length of short-reads Sr_l

Background Workflow Sequence comparison Tree comparisonSummary

All 486 combination of these parameters were simulated

Page 6: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Different numbers of haplotypes

Background Workflow Sequence comparison Tree comparisonSummary

Page 7: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Similar sequences

Background Workflow Sequence comparison Tree comparisonSummary

Page 8: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Can reconstructed haplotypes still capture some phylogenetic information?

• Different haplotypes number impossible to recover the true phylogenetic trees

Assuming true haplotypes number of the sample is known

Select the most similar reconstructed sequences to build phylogeny tree

Calculate symmetric difference

Background Workflow Sequence comparison Tree comparisonSummary

Cluster (k-mean) reconstructed haplotypes to n groups

Build tree with consensus sequence of each group

Calculate tree balance statistics

Page 9: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Method for tree comparison

A B C B A C(BC)(ABC)

(AC)(ABC) symmetric difference = 2

Symmetric difference for rooted and labeled trees

Tree balance statistics for rooted and unlabeled trees

ANi is the internal nodes number between tip i and root

e.g. i=A, NA = 2, Ñ = (2+2+2+3+3)/5=2.4

Page 10: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Different topology of most similar sequence tree

Background Workflow Sequence comparison Tree comparisonSummary

Page 11: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Different balance statistics of k-mean cluster tree

Background Workflow Sequence comparison Tree comparisonSummary

n N_bar I_c

org rec P org rec P

10 4.8 4.7 0.002 0.74 0.67 0.0004

20 7.5 6.9 9.2e-09 0.57 0.47 1.52e-10

40 10.6 9.6 1.2e-08 0.40 0.33 1.94e-09

Page 12: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Summary & future work

• Reconstructed haplotypes typically failed to estimate the correct number of haplotypes

• Consequently, it was not possible to recover the true phylogenetic trees.

• Even assuming we know the true haplotype number, the chance to recover the true tree topology is still small.

• Other reconstruction method, use multiple reference sequence when mapping…

Page 13: The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong

Reference • Anderson, C.N.K., Ramakrishnan, U. et al.2005. Serial SimCoal: A population

genetic model for data from multiple populations and points in time. . Bioinformatics 21, 1733-1734.

• Johnson, P.L., Slatkin, M., 2006. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res 16, 1320-1327.

• Richter, D.C., Ott, F. et al. 2008. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, 3373.

• Suzuki, S., Ono, N., Furusawa, C., Ying, B.-W., Yomo, T., 2011. Comparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms. PLoS ONE 6, e19534.

• Zagordi, O., Bhattacharya, A. et al. 2011. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics 12, 119

• Metei D., Misko D,. et al. 2011 SHRiMP2: Sensitive yet Practical Short Read Mapping. Bioinformatics 27, 7

• Ning Z, Cox AJ and Mullikin JC. 2001. SSAHA: a fast search method for large DNA databases. Genome research, 1725-9