bioinformática - universidade da beira interiorpandre/bioinformatica/iacb_t2.pdf ·...

14
1 Sara C. Madeira Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sara C. Madeira 1 Sequence Alignment Pairwise Sequence Alignment 16/03/2009 & 23/03/2009 27/04/2009 Bioinformática Bioinformática Sara C. Madeira 2 Outline Introduction What is pairwise sequence alignment? Scoring Sequence Alignments Pairwise Sequence Alignment Algorithms Needleman-Wunsch Algorithm (global alignment) Smith-Waterman Algorithm (local alignment) Extension to repeated matches (multiple local alignments) Heuristic Pairwise Sequence Alignment Algorithms BLAST FASTA

Upload: lykhanh

Post on 15-Feb-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

1

Sara C. Madeira

Universidade da Beira Interior

(Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática

Sara C. Madeira 1

Sequence Alignment Pairwise Sequence Alignment

16/03/2009 & 23/03/2009 27/04/2009

Bioinformática Bioinformática

Sara C. Madeira 2

Outline

•  Introduction

•  What is pairwise sequence alignment?

•  Scoring Sequence Alignments

•  Pairwise Sequence Alignment Algorithms

•  Needleman-Wunsch Algorithm (global alignment)

•  Smith-Waterman Algorithm (local alignment)

•  Extension to repeated matches (multiple local alignments)

•  Heuristic Pairwise Sequence Alignment Algorithms

•  BLAST

•  FASTA

Page 2: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

2

Bioinformática Bioinformática

Sara C. Madeira 3

Introduction

•  Advances in molecular biology allow increasingly rapid sequencing of

genomes --> Exponential growth in Genbank.

•  Francois Jacob (1977) [Evolution and tinkering, science 196:1161-1166]

“Nature is a tinkerer and not an inventor”

•  Eric Wieschaus (1995) [Associated Press, 9 October, 1995]

“We didn´t know it at the time, but we found out everything in life is

so similar, that the same genes that work in flies are the ones that

work in humans.”

Bioinformática Bioinformática

Sara C. Madeira 4

Introduction

•  New sequences are adapted from pre-existing sequences rather

than invented de novo.

•  Sequence similarity is an indicator of homology.

•  Other (several) uses for sequence similarity

•  Database queries

•  Comparative genomics

•  ...

Page 3: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

3

Bioinformática Bioinformática

Sara C. Madeira 5

Outline

•  Introduction

•  What is Pairwise Sequence Alignment?

•  Scoring Sequence Alignments

•  Pairwise Sequence Alignment Algorithms

•  Needleman-Wunsch Algorithm (global alignment)

•  Smith-Waterman Algorithm (local alignment)

•  Extension to repeated matches (multiple local alignments)

•  Heuristic Pairwise Sequence Alignment Algorithms

•  BLAST

•  FASTA

Bioinformática Bioinformática

Sara C. Madeira 6

What is Pairwise Sequence Alignment?

•  The problem of deciding if a pair of sequences are evolutionarily related or not.

•  Two biological sequences are similar ⇔ Two strings are similar

•  Sequences accumulate

•  Insertions

•  Deletions

and

•  Substitutions

Page 4: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

4

Bioinformática Bioinformática

Sara C. Madeira 7

What is Pairwise Sequence Alignment?

Distance Between DNA Sequences

•  Hamming distance is not typically used to compare DNA or protein

sequences.

•  Levenshtein distance allows one to compare strings of different lengths.

•  Edit distance

Definition: The edit distance between two strings is defined as the

minimum number of edit operations – insertions, deletions and substitutions

– needed to transform the first string into the second. Matches are not

counted.

Bioinformática Bioinformática

Sara C. Madeira 8

What is Pairwise Sequence Alignment?

String Alignment

•  The concept of an alignment is crucial.

•  Global Alignment

Definition: A (global) alignment of two strings S1 e S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one

above the other so that every character or space (dash) in either string is opposite to a unique character (dash) or unique space (dash) in the other string.

Page 5: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

5

Bioinformática Bioinformática

Sara C. Madeira 9

What is Pairwise Sequence Alignment?

Gaps

•  Gaps help create alignments that better conform to underlying

biological models.

•  Mechanisms that make long insertions or deletions in DNA include:

•  unequal crossing-over in meiosis; DNA slippage during replication;

insertion of transposable elements into DNA string; insertions of DNA by

retro-viruses; etc...

Definition: A gap is any maximal, consecutive run of spaces (or

dashes) in a single string of a given alignment.

Bioinformática Bioinformática

Sara C. Madeira 10

What is Pairwise Sequence Alignment?

Example

•  More than one possible alignment!

•  Which one is better?

•  Is it a true or a spurious alignment?

S1 = WEAGAWGHEE S2 = PAWHEAE

WEAGAWGHE-E

P-A--W-HEAE

mismatch match

WEAGAWGHE-E

--P-AW-HEAE

gap

Page 6: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

6

Bioinformática Bioinformática

Sara C. Madeira 11

Outline

•  Introduction

•  What is Pairwise Sequence Alignment?

•  Scoring Sequence Alignments

•  Pairwise Sequence Alignment Algorithms

•  Needleman-Wunsch Algorithm (global alignment)

•  Smith-Waterman Algorithm (local alignment)

•  Extension to repeated matches (multiple local alignments)

•  Heuristic Pairwise Sequence Alignment Algorithms

•  BLAST

•  FASTA

Bioinformática Bioinformática

Sara C. Madeira 12

How to Score an Alignment?

•  Find the best alignment between two strings under some scoring scheme.

•  Use a scoring model that quantifies evolutionary preferences.

•  Substitution matrices

•  Matches and mismatches

•  Gap penalty

•  Initiating a gap

•  Gap extension penalty

•  Extending a gap

Set of values for quantifying the

likelihood of one residue being

substituted by another in an

alignment.

Page 7: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

7

Bioinformática Bioinformática

Sara C. Madeira 13

The Scoring Model

•  The total score will be a sum of terms for each aligned pair of residues, plus terms for each gap.

•  Identities and conservative substitutions will be more likely in alignments than expected by chance.

•  contribute with positive score terms.

•  Non-conservative changes are expected to be observed less frequently in real alignments than expected by chance

•  contribute with negative score terms.

Bioinformática Bioinformática

Sara C. Madeira 14

The Scoring Model

•  The score assigned to an alignment is computed using this function:

where

s(s1(i),s2(i)) is the score for each aligned pair of residues

and

G(g) are the gap penalties

•  Scores s(.,.) and gap penalties G(g) can be computed using different models (scoring matrices, probabilistics models, ...)!

)())(,)(( 21 gGisissSi

+=!

Given by a Scoring Matrix !

Given apriori !

Page 8: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

8

Bioinformática Bioinformática

Sara C. Madeira 15

Example

•  Gap penalty: -8

•  Gap extension penalty: -8

WEAGAWGHE-E

--P-AW-HEAE

(-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 1

W

P

H

E

A

W H G E A

15 -3 -3 -3 -3

-4 -2 -2 -1 -1

-3 10 -2 0 -2

-3 0 -3 6 -1

-3 -2 0 -1 5

Alig

nm

ent

Sco

res

Bioinformática Bioinformática

Sara C. Madeira 16

Example

WEAGAWGHE-E

P-A--W-HEAE

Exercise: What is the score of the following alignment ?

•  Gap penalty: -8

•  Gap extension penalty: -8

W

P

H

E

A

W H G E A

15 -3 -3 -3 -3

-4 -2 -2 -1 -1

-3 10 -2 0 -2

-3 0 -3 6 -1

-3 -2 0 -1 5

Alig

nm

ent

Sco

res

Page 9: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

9

Bioinformática Bioinformática

Sara C. Madeira 17

Example

(-4) + (-8) + 5 + (-8) + (-8) + 15 + (-8) + 10 + 6 + (-8) + 6 = -2

WEAGAWGHE-E

P-A--W-HEAE

Exercise: What is the score of the following alignment ?

•  Gap penalty: -8

•  Gap extension penalty: -8

W

P

H

E

A

W H G E A

15 -3 -3 -3 -3

-4 -2 -2 -1 -1

-3 10 -2 0 -2

-3 0 -3 6 -1

-3 -2 0 -1 5

Alig

nm

ent

Sco

res

Bioinformática Bioinformática

Sara C. Madeira 18

Scoring Matrices

•  Family of matrices listing the likelihood of change from one sequence to another during evolution.

•  Amino acid substitution matrices

•  PAM (Point Accepted Mutation)

•  BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)

•  DNA substitution matrices

•  DNA: less conserved than protein sequences.

•  Less effective to compare coding regions at nucleotide level.

Page 10: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

10

Bioinformática Bioinformática

Sara C. Madeira 19

DNA Substitution Matrices

•  Scoring matrices for nucleotide sequences are relatively simple.

•  A positive value or a high score is given for a match and a negative value/low positive score is given for a mismatch.

•  This assignment is based on the assumption that the frequencies of mutation are equal for all bases.

•  However, this assumption may not be realistic !

•  Observations show that transitions (substitutions between purines and purines, A<->C) occur more frequently than transversions (substitutions between pyrimidines and pyrimidines, T<->G)

•  Therefore, a more sophisticated statistical model with different

probability values to reflect two types of mutations is needed!

•  Several nucleotide substitution models (Example: Kimura model)

Bioinformática Bioinformática

Sara C. Madeira 20

Amino acid substitution matrices

PAM Matrices (Dayhoff, 1978)

•  Encode and summarize expected evolutionary change at the

amino acid level.

•  Each matrix is designed to be used to compare pairs of

sequences that are a specific number of PAM units diverged.

•  1 PAM unit indicates the probability of 1 point mutation per

100 residues.

Page 11: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

11

Bioinformática Bioinformática

Sara C. Madeira 21

Amino acid substitution matrices

•  After 100 PAMs of evolution, not every residue will have changed

•  Some residues may have mutated several times.

•  Some residues may have returned to their original state.

•  Some residues may not changed at all.

•  PAM matrices started by constructing hypothetical phylogenetic trees relating the sequences in 71 families, where each pair of sequences differed by no more than 15% of their residues.

•  For each amino acid pair, Ai and Aj, count the number of times that Ai aligns opposite Aj, and divide that number by the total number of pairs in all the aligned data.

Bioinformática Bioinformática

Sara C. Madeira 22

PAM Matrices

•  Let F(i,j) denote the resulting frequency.

•  Let Fi and Fj be the frequencies that amino acids Ai and Aj

appear in the sequences.

•  The (i,j) entry for the ideal PAMn matrix is:

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.)

)()(),(

(logjFiFjiF

Page 12: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

12

Bioinformática Bioinformática

Sara C. Madeira 23

80 250

70 159

60 120

50 80

40 56

30 38

20 23

10 11

1 1

Observed difference %

Evolutionary distance

(PAM)

Amino acid substitution matrices

Most widely Used PAM

Matrix PAM250

24 Sara C. Madeira

Page 13: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

13

Bioinformática Bioinformática

Sara C. Madeira 25

Amino acid substitution matrices

BLOSUM Matrices (Henikoff, 1992)

•  Substitution matrices derived using probabilistic models.

•  Matrices derived from a much larger dataset: the protein families BLOCKS database.

•  Sequences are clustered whenever their percentage of identical residues exceed some level L%.

•  BLOSUM50 and BLOSUM62 are widely used.

•  BLOSUM observes significantly more replacements than PAM, even for infrequent pairs.

Bioinformática Bioinformática

Sara C. Madeira 26

BLOSUM50 A R N D C Q E G H I L K M F P S T W Y V!A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0!R !-2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3!N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3!D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4!C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1!Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3!E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3!G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4!H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4!I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4!L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1!K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3!M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1!F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1!P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3!S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2!T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0!W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3!Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1!V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

Page 14: Bioinformática - Universidade da Beira Interiorpandre/bioinformatica/IACB_T2.pdf · Bioinformática! 1 Sara C. Madeira Sequence Alignment Pairwise Sequence Alignment 16/03/2009 &

14

Bioinformática Bioinformática

Sara C. Madeira 27

Amino acid substitution matrices

PAM Matrices vs BLOSUM Matrices

•  PAM model is designed to track evolutionary origin of proteins.

•  BLOSUM model is designed to find conserved domains of proteins.

Thumb Rules

•  Lower PAMs and higher BLOSUMs find short local alignment of highly

similar sequences.

•  Higher PAMs and lower BLOSUMs find longer weaker local alignments.