phyloinformatics or how to analyze lots of sequences

37
Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014

Upload: konala

Post on 23-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Phyloinformatics or How to analyze LOTS of sequences. Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014. Phyloinformatic workflow. Phyloinformatic workflow. www.phylota.net. Select and Download Data. Find a sequence cluster with: > 500 sequences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformaticsor

How to analyze LOTS of sequences

Heath BlackmonUniversity of Texas at Arlington

Bioinformatics – Spring 2014

Page 2: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Infer Phylogeny

Page 3: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Infer Phylogeny

Page 4: Phyloinformatics or How  to  analyze  LOTS of sequences

www.phylota.net

Page 5: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 6: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 7: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 8: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 9: Phyloinformatics or How  to  analyze  LOTS of sequences

Select and Download Data

• Find a sequence cluster with:> 500 sequences< 2000 base pairs

- Tetrapoda- Teleostei- eudicotyledons- arthropoda

Page 10: Phyloinformatics or How  to  analyze  LOTS of sequences

Select and Download Data

• Find a sequence cluster with:> 500 sequences< 2000 base pairs

Download the example file of 18S sequences from the class google drive: 18S.fa

- Tetrapoda- Teleostei- eudicotyledons- arthropoda

Page 11: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 12: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 13: Phyloinformatics or How  to  analyze  LOTS of sequences

ProbConsTCofee

Clustal

MuscleKalign

PRRN

DIALIGN-T

MAFFT

Alignment Programs

Clustal Omega

Bali-Phy

DECIPHER

Page 14: Phyloinformatics or How  to  analyze  LOTS of sequences

Balance Between Scalability & Accuracy

Method Score CPU time (s)

Consistency based methods

MAFFT 5.662 86.91 6,000

ProbCons 1.10 87.25 43,000

TCofee 2.46 84.56 210,000

Iterative refinement methods

Muscle 3.52 81.67 3,400

PRRN 3.11 82.61 250,000

MAFFT 3.89 82.16 3,600

ClustalW 2.0 76.67 58,000

Progressive methods

Kalign 1.0 80.25 480

MAFFT 5.662 78.63 140

Muscle 3.52 77.63 160

ClustalW 1.83 75.34 2,000

Page 15: Phyloinformatics or How  to  analyze  LOTS of sequences

MAFFT

• Align 1,000s of sequences in minutes/hours• Progressive and iterative methods supported• Multiple scoring schemes

• Install locally or run on the CBRC servers

Page 16: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 17: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 18: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 19: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 20: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 21: Phyloinformatics or How  to  analyze  LOTS of sequences

Go ahead and try aligning the 18S.fa file that you downloaded from the class google drive.

Page 22: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 23: Phyloinformatics or How  to  analyze  LOTS of sequences

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 24: Phyloinformatics or How  to  analyze  LOTS of sequences
Page 25: Phyloinformatics or How  to  analyze  LOTS of sequences

Dot PlotA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C G A G C A T A A C A

Page 26: Phyloinformatics or How  to  analyze  LOTS of sequences

DELETION / INSERTIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C G C A T A A C A

Page 27: Phyloinformatics or How  to  analyze  LOTS of sequences

INVERSIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C A C A A T A C G A G

Page 28: Phyloinformatics or How  to  analyze  LOTS of sequences

INVERSIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C T G T T A T G C T C

Matches between same strand

Matches between opposite strand

Page 29: Phyloinformatics or How  to  analyze  LOTS of sequences

Evaluating the 18S alignment

• Look at your dot plots first. What is wrong with the sequences?

• How would you fix/prevent this problem?

Page 30: Phyloinformatics or How  to  analyze  LOTS of sequences

Evaluating Sites in an Alignments

• Bootstrapping - Guidance• ID regions with strong support - Gblocks

Page 31: Phyloinformatics or How  to  analyze  LOTS of sequences

GBlocks

Page 32: Phyloinformatics or How  to  analyze  LOTS of sequences

GBlocks

Page 33: Phyloinformatics or How  to  analyze  LOTS of sequences

GBlocks

9 W residues6 I residues8 F residues

Page 34: Phyloinformatics or How  to  analyze  LOTS of sequences

Bootstrapping

Page 35: Phyloinformatics or How  to  analyze  LOTS of sequences

Bootstrapping

These scores across the bottom scaled between 0 and 1 report the proportion of alignments that agree on the assignment of nucleotides in the original MSA

Page 36: Phyloinformatics or How  to  analyze  LOTS of sequences

Try The Data You Downloaded

• Make an alignment• Check the dot plots• Use Gblocks to remove uncertain sites

– How many sites in initial alignment?– How many sites in filtered alignment?– Did you lose any taxa?

Page 37: Phyloinformatics or How  to  analyze  LOTS of sequences

• Treat your alignment as a model parameter!

• BaliPhy: Estimates phylogenetic trees across all possible alignments without conditioning on a single alignment being “true”

• Thanks for listening to me!