bioinformática básica · bioinformática básica multiple sequence alignment rafael dias mesquita...

32
Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita [email protected] Laboratório de Bioinformática Departamento de Bioquímica Instituto de Química - UFRJ

Upload: others

Post on 15-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Bioinformática Básica Multiple Sequence alignment

Rafael Dias Mesquita [email protected]

Laboratório de Bioinformática

Departamento de Bioquímica Instituto de Química - UFRJ

Page 2: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Why we do alignments? !   Correspondence. Find out which parts “do the same thing”

! Similar genes are conserved across widely divergent species, often performing similar functions

!   Structure prediction ! Use knowledge of structure of one or more members of a protein MSA to

predict structure of other members ! Structure is more conserved than sequence

!   Create “profiles” for protein families ! Allow us to search for other members of the family

!   Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs

!   MSA is the starting point for phylogenetic analysis

Page 3: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

An example of Multiple Alignment

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Page 4: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Global vs. Local !  Global – both sequences aligned along entire

lengths !  Local – best subsequence alignment found !  Global alignment of two genomic sequences may

not align exons !  Local alignment would only pick out maximum

scoring exon

Page 5: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Global Alignment

!   Based on Needleman-Wunsch algorithm !   An example: align COELACANTH and PELICAN !   Scoring scheme: +1 if letters match, -1 for mismatches, -1

for gaps

COELACANTH

P-ELICAN--

COELACANTH

-PELICAN--

Page 6: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Needleman-Wunsch Details !   Two-dimensional matrix !   Diagonal when two letters

align !   Horizontal when letters

paired to gaps

C OE L A C A N T HP C

P O

E E E

L L L

I A I

C C C

A A A

N N N

T -

H -

Page 7: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Needleman-Wunsch !   In reality, each cell of matrix contains score and

pointer !   Score is derived from scoring scheme (-1 or +1 in our

example) !   Pointer is an arrow that points up, left, or diagonal !   After initializing matrix, compute the score and arrow

for each cell

Page 8: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm

!   For each cell, compute ! Match score: sum of preceding diagonal cell and score of aligning

the two letters (+1 if match, -1 if no match) ! Horizontal gap score: sum of score to the left and gap score (-1) ! Vertical gap score: sum of score above and gap score (-1)

!   Choose highest score and point arrow towards maximum cell

!   When you finish, trace arrows back from lower right to get alignment

Page 9: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Matrix initialization

C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Page 10: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Matrix initialization

C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Page 11: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Computing sum of scores

C O E L A C A N T HP -1

-1 -1 -2

E +1 -1

L +1 0

I -1 -1

C +1 0

A +1 +1

N +1 +2

-1 +1

-1 0

Page 12: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Finding alignment

C O E L A C A N T HP -1

-1 -1 -2

E +1 -1

L +1 0

I -1 -1

C +1 0

A +1 +1

N +1 +2

-1 +1

-1 0

COELACANTH

P-ELICAN--

COELACANTH

-PELICAN--

Page 13: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Local Alignment

!   Smith-Waterman algorithm !   Modification of Needleman-Wunsch

! Edges of matrix initialized to 0 ! Maximum score never less than 0 ! No pointer unless score greater than 0 ! Trace-back starts at highest score (rather than lower right) and

ends at 0

!   How do these changes affect the algorithm?

Page 14: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

!   An example: align COELACANTH and PELICAN !   Scoring scheme: +1 if letters match, -1 for mismatches, -1

for gaps

Smith-Waterman

ELACAN

ELICAN

Page 15: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Matrix initialization

0 C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Page 16: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Matrix initialization

0 C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Page 17: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Algorithm : Computing sum of scores

0 0 C O E L A C A N T HP -1

0 -1 0

E +1 +1

L +1 +2

I -1 +1

C +1 +2

A +1 +3

N +1 +4

-1 +3

-1 +2

Page 18: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

0 0 C O E L A C A N T HP -1

0 -1 0

E +1 +1

L +1 +2

I -1 +1

C +1 +2

A +1 +3

N +1 +4

-1 +3

-1 +2

Algorithm : Finding alignment

ELACAN

ELICAN

Page 19: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Multiple Sequence Alignment: Approaches !   Optimal Global Alignments - Find alignment that

maximizes a score function !   Global Progressive Alignments - Match closely-related

sequences first using a guide tree !   Global Consistency-based Alignments – Progressive

approach that uses pairwise information to optimize alignment

!   Global Iterative Alignments - Multiple re-building attempts to find best alignment

!   Structure, HMM based alignments ! Praline-web, HMMalign

Page 20: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Optimal Global Alignments

!   Usually uses Dynamic programming !   Generalization of Needleman-Wunsch !   Find alignment that maximizes a score function !   2 sequences => 2 dimensional matrix !   n sequences => n dimensional matrix !   Computationally expensive: Time grows as product of

sequence lengths

Page 21: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Optimal Global Alignments: Examples FOR PAIRWISE ALIGNMENT

! NWalign http://zhanglab.ccmb.med.umich.edu/NW-align/

!   FOGSAA (uses a branch and bound approach – FASTER – no web service yet)

http://www.nature.com/srep/2013/130429/srep01746/full/srep01746.html

Page 22: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Progressive Multiple Alignment Method

!   Compare all sequences pairwise. !   Perform cluster analysis on the pairwise data to generate a hierarchy for

alignment. This may be in the form of a binary tree or a simple ordering !   Build the multiple alignment by first aligning the most similar pair of

sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Page 23: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Steps in alignment

Progressive Multiple Alignment Method

Page 24: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

! ClustalW http://www.clustal.org/clustal2/ http://www.ebi.ac.uk/Tools/msa/clustalw2/

! Kalign: Useful to large datasets

http://msa.sbc.su.se/cgi-bin/msa.cgi http://www.ebi.ac.uk/Tools/msa/kalign/

Progressive Multiple Alignment Method: Examples

Page 25: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Global Consistency-based Alignments

!   Uses Progressive strategy !   During progression base alignment can change, not only

adding sequences. !   Can incorporate multiple sequence information in scoring

pairwise alignments !   Can consider additional information regarding pairwise

alignments when constructing multiple alignments !   Can use probability

Page 26: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

! ProbCons: Uses a score system that includes probabilities http://probcons.stanford.edu/

!   T-Coffe: Slow, useful to small datasets http://www.tcoffee.org/ http://www.ebi.ac.uk/Tools/msa/tcoffee/

Global Consistency-based Alignments: Examples

Page 27: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Global Iterative Alignments

!   Can construct distance matrix based on kmer distance – related sequences have more kmers in common and construct the guide tree using UPGMA.

!   Uses Progressive alignment strategy !   Iterations refine the multiple alignment cutting it and

realign the profile created to both parts

Page 28: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

!   MAFFT: Useful from small to large datasets (can perform progressive, consistency or iterative alignments)

http://mafft.cbrc.jp/alignment/server/ http://www.ebi.ac.uk/Tools/msa/mafft/

!   Muscle: http://www.drive5.com/muscle/ http://www.ebi.ac.uk/Tools/msa/muscle/

Global Iterative Alignments: Examples

Page 29: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

Structure, HMM based alignments

!   Based on progressive strategy !   Uses secondary structure information !   Uses transmembrane regions information !   Pre-profile processing using PSI-blast – Homologous help

guiding the alignment

Page 30: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

!   Praline-web http://www.ibi.vu.nl/programs/pralinewww/

Structure, HMM based alignments

Page 31: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

!   MUMSA (compares different multiple alignments to evaluate quality)

http://msa.sbc.su.se/cgi-bin/msa.cgi !   Core-COFFE (indicate reliability areas with colors) http://tcoffee.crg.cat/apps/tcoffee/do:core ! iRMSD-COFFE (Evaluates Multiple Sequence Alignment

using structural information) http://tcoffee.crg.cat/apps/tcoffee/do:irmsd !   Leon http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?Leon+noid

Quality evaluation

Page 32: Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita rdmesquita@iq.ufrj.br Laboratório de Bioinformática Departamento de Bioquímica

!   FACET (no web server available – read the paper for a review)

http://facet.cs.arizona.edu/

Quality evaluation