sequence analysis methods

68
CZ5225: Modeling and CZ5225: Modeling and Simulation in Biology Simulation in Biology Lecture 3: Sequence analysis Lecture 3: Sequence analysis methods methods Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of Singapore National University of Singapore

Upload: tender

Post on 15-Jan-2016

78 views

Category:

Documents


1 download

DESCRIPTION

CZ5225: Modeling and Simulation in Biology Lecture 3: Sequence analysis methods Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Sequence Analysis Methods. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Analysis Methods

CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in BiologyBiology

Lecture 3: Sequence analysis methods Lecture 3: Sequence analysis methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore

Page 2: Sequence Analysis Methods

22

Sequence Analysis Methods

Page 3: Sequence Analysis Methods

33

Gene and Protein Sequence Alignment Gene and Protein Sequence Alignment as a Mathematical Problem: as a Mathematical Problem:

Example: Sequence a:  ATTCTTGC Sequence b: ATCCTATTCTAGC

          Best Alignment:             ATTCTTGC

                                 ATCCTATTCTAGC                                           /|\                   gap        Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                                /|\             /|\                                           gap          gap

What is a good alignment? 

Page 4: Sequence Analysis Methods

44

How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

Page 5: Sequence Analysis Methods

55

Pairwise AlignmentPairwise AlignmentSequence a: CTTAACTSequence b: CGGATCAT

An alignment of a and b:

C---TTAACTCGGATCA--T

Insertion gap

Match Mismatch

Deletion gap

Page 6: Sequence Analysis Methods

66

Alignment GraphAlignment GraphSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Insertion gap

Deletion gap

Page 7: Sequence Analysis Methods

77

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C

C C---TTAACTCGGATCA--T

Page 8: Sequence Analysis Methods

88

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A

C C---TTAACTCGGATCA--T

Page 9: Sequence Analysis Methods

99

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T

C

T

C---TTAACTCGGATCA--T

Page 10: Sequence Analysis Methods

1010

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A

C

T

T

A

A

C

C---TTAACTCGGATCA--T

Page 11: Sequence Analysis Methods

1111

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 12: Sequence Analysis Methods

1212

Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 13: Sequence Analysis Methods

1313

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A T

C

T

T

A

A

C

T

CTTAACT-CGGATCAT

Page 14: Sequence Analysis Methods

1414

Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

CTTAACT-CGGATCAT

Page 15: Sequence Analysis Methods

1515

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

- CTTAACTCGGATCAT

Page 16: Sequence Analysis Methods

1616

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

- C - - TTAACTCGGATC - AT -

Page 17: Sequence Analysis Methods

1717

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

CTTAACT - - -

- - CGGATCAT

Page 18: Sequence Analysis Methods

1818

Which pathway is better?Which pathway is better?Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

Multiple pathways

Each with a unique scoring function

Page 19: Sequence Analysis Methods

1919

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 20: Sequence Analysis Methods

2020

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

8-3

=5

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 21: Sequence Analysis Methods

2121

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

8-3

=5

5-3

=2

2-3

=-1

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 22: Sequence Analysis Methods

2222

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8 5 2 -1

-1+8

=7

7-3

=4

4+8

=12

12-3

=9

9-3

=6

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

6+8=14

Alignment score

Page 23: Sequence Analysis Methods

2323

An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between

a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

Page 24: Sequence Analysis Methods

2424

Computing Computing SSi,ji,j

i

j

w(ai,-)

w(-,bj)

w(ai,bj)

Sm,n

Page 25: Sequence Analysis Methods

2525

InitializationsInitializationsS0,0= 0

S0,1=-3, S0,2=-6,

S0,3=-9, S0,4=-12,

S0,5=-15, S0,6=-18,

S0,7=-21, S0,8=-24

S1,0=-3, S2,0=-6,

S3,0=-9, S4,0=-12,

S5,0=-15, S6,0=-18,

S7,0=-21

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Gap symbol: -3

Page 26: Sequence Analysis Methods

2626

SS1,11,1 = = ??Option 1:

S1,1 = S0,0 +w(a1, b1)

= 0 +8 = 8

Option 2:

S1,1=S0,1 + w(a1, -)

= -3 - 3 = -6

Option 3:

S1,1=S1,0 + w( - , b1)

= -3-3 = -6

Optimal:

S1,1 = 8

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 27: Sequence Analysis Methods

2727

SS1,21,2 = = ??Option 1:

S1,2 = S0,1 +w(a1, b2)

= -3 -5 = -8

Option 2:

S1,2=S0,2 + w(a1, -)

= -6 - 3 = -9

Option 3:

S1,2=S1,1 + w( - , b2)

= 8-3 = 5

Optimal:

S1,2 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 28: Sequence Analysis Methods

2828

SS2,12,1 = = ??Option 1:

S2,1= S1,0 +w(a2, b1)

= -3 -5 = -8

Option 2:

S2,1=S1,1 + w(a2, -)

= 8 - 3 = 5

Option 3:

S2,1=S2,0 + w( - , b1)

= -6-3 = -9

Optimal:

S2,1 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 29: Sequence Analysis Methods

2929

SS2,22,2 = = ??Option 1:

S2,2= S1,1 +w(a2, b2)

= 8 -5 = 3

Option 2:

S2,2=S1,2 + w(a2, -)

= 5 - 3 = 2

Option 3:

S2,2=S2,1 + w( - , b2)

= 5-3 = 2

Optimal:

S2,2 =3

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 5 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 30: Sequence Analysis Methods

3030

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 31: Sequence Analysis Methods

3131

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

Page 32: Sequence Analysis Methods

3232

C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

Page 33: Sequence Analysis Methods

3333

Local vs. Global Sequence Alignment: Local vs. Global Sequence Alignment:

Example:

DNA sequence a:  ATTCTTGC

DNA sequence b: ATCCTATTCTAGC  

         Local Alignment:             ATTCTTGC Gaps ignored in local alignments

                                 ATCCTATTCTAGC                                          /|\                   gap        Global Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap Gaps counted in global alignments 

Page 34: Sequence Analysis Methods

3434

Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment

• global alignment:

• local alignment:

All sections are counted

Only local sections (normally separated by gaps) are counted

Page 35: Sequence Analysis Methods

3535

An optimal local alignmentAn optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

Page 36: Sequence Analysis Methods

3636

InitializationsInitializations

0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 37: Sequence Analysis Methods

3737

SS1,11,1 = = ?? Option 1:

S1,1 = S0,0 +w(a1, b1)

= 0 +8 = 8

Option 2:

S1,1=S0,1 + w(a1, -)

= 0 - 3 = -3

Option 3:

S1,1=S1,0 + w( - , b1)

= 0-3 = -3

Option 4:

S1,1=0

Optimal:

S1,1 = 8

0 0 0 0 0 0 0 0 0

0 ?

0

0

0

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 38: Sequence Analysis Methods

3838

local alignmentlocal alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 39: Sequence Analysis Methods

3939

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

local alignmentlocal alignment

Page 40: Sequence Analysis Methods

4040

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Procedure:

• Divide all sequences into overlapping constituent words (size k)

• Build the hash table for Sequence a.• Scan Sequence b for hits.• Extend hits.

Page 41: Sequence Analysis Methods

4141

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Step 1:Hash table for sequence A

Page 42: Sequence Analysis Methods

4242

Amino acid Amino acid similarity similarity matrix matrix PAM 120PAM 120

Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids

Page 43: Sequence Analysis Methods

4343

Amino acid similarity matrix PAM 250Amino acid similarity matrix PAM 250This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.

Page 44: Sequence Analysis Methods

4444

Amino acid similarity matrix Blosum 45Amino acid similarity matrix Blosum 45The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .

Page 45: Sequence Analysis Methods

4545

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Page 46: Sequence Analysis Methods

4646

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

LN:LN=9

NF:NY=8

GW:PW=10

Step 2:

Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8

Note:

Marked points can be on the diagonal and off-diagonal

Page 47: Sequence Analysis Methods

4747

BLASTStep2: Scan sequence b for hits.

Page 48: Sequence Analysis Methods

4848

BLASTStep2: Scan sequence b for hits.

Step 3: Extend hits.

hit

Terminate if the score of the extension fades away.

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

Page 49: Sequence Analysis Methods

4949

Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)

• The multiple sequence alignment problem is to simultaneously align more than two sequences.

Seq1: GCTC

Seq2: AC

Seq3: GATC

GC-TC

A---C

G-ATC

Page 50: Sequence Analysis Methods

5050

Multiple sequence alignment MSAMultiple sequence alignment MSA

Page 51: Sequence Analysis Methods

5151

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

Page 52: Sequence Analysis Methods

5252

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

-5-3+8-3+8= 5

+

8-3-3+8+8= 18

+

-5+8-3-3+8= 5

= 28

SP-score=5+18+5=28

Page 53: Sequence Analysis Methods

5353

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

• PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST

• Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.

Page 54: Sequence Analysis Methods

5454

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

PSI-BLAST is used for:PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection

Page 55: Sequence Analysis Methods

5555

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

• Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001)

• Construct position specific scoring matrix for collected sequences. Rough idea:– Align all sequences to the query sequence as the

template.– Assign weights to the sequences – Construct position specific scoring matrix

• Iterate

Page 56: Sequence Analysis Methods

MGLLTREIF--ILQQ

FGLGRT-I-T-YMTN-GLVRT-I---LGLE

FGLLRT-I---YMTQ

MGLLTREIF--ILQQ

Take a sequence

Search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

Construct a profile, and represent conservation in each position numerically

Profile holds more information than a single sequence: use the profile to retrieve additional sequences

Sequences are multiply alignedFGLLRT-I-T-YMTN

-RLTRD-I---LGLYFGLLRT-I---FMTS

New sequences in the multiple alignment

Construct a new profileA 027005101003200C 000070000000000..Y 202000060202000

After several iterations of this procedure we have:

• Sequence information, including links to annotation

• Several sets of multiple alignments.

• Profiles, derived by us or by PSI-BLAST

• Threshold information (alignment statistics)

A 029001100003200C 000070000000000..Y 002000080202000

using profile

How PLS-BLAST works?

Page 57: Sequence Analysis Methods

5757

Consensus sequenceConsensus sequence

• A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search.

PEAINYGRFTPFS I KSDVW

Page 58: Sequence Analysis Methods

5858Next New iteration……

MGLLTREIF--ILQQ

FGLGRT-I-T-YMTN-GLVRT-I---LGLEFGLLRT-I---YMTQ

MGLLTREIF--ILQQ

Take a sequence

Search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

Construct a profile, and represent conservation in each position numerically

Profile holds more information than a single sequence: use the profile to retrieve additional sequences

Sequences are multiply aligned

Construct a new profile

A 027005101003200C 000070000000000..Y 202000060202000

Using profile to search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

FGLLRT-I-T-YMTN-RLTRD-I---LGLYFGLLRT-I---FMTS

New sequences in the multiple alignments

New iteration

Flow chart of PSI-BLAST

Page 59: Sequence Analysis Methods

5959

PSI-BLASTPSI-BLAST

NCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 60: Sequence Analysis Methods

6060

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 61: Sequence Analysis Methods

6161

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 62: Sequence Analysis Methods

6262

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 63: Sequence Analysis Methods

6363

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 64: Sequence Analysis Methods

6464

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 65: Sequence Analysis Methods

6565

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 66: Sequence Analysis Methods

6666

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 67: Sequence Analysis Methods

6767

Use of PSI-BLAST to probe the Use of PSI-BLAST to probe the function of a viral proteinfunction of a viral protein

PEAINYGRFTPFS I KSDVW

Page 68: Sequence Analysis Methods

6868

Summary of Today’s lectureSummary of Today’s lecture

• Sequence alignment methods revisited:– Pair-wise alignment– Multiple sequence alignment– BLAST– PSI-BLAST

• Use of PSI-BLAST to probe protein function