homology search tools

26
Homology Search Tools Kun-Mao Chao ( 趙趙趙 ) Department of Computer Scienc e and Information Engineering National Taiwan University, T aiwan WWW: http://www.csie.ntu.edu.tw/~k mchao

Upload: gretel

Post on 05-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Homology Search Tools. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. Homology Search Tools. Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Homology Search Tools

Homology Search Tools

Kun-Mao Chao (趙坤茂 )Department of Computer Science an

d Information EngineeringNational Taiwan University, Taiwan

WWW: http://www.csie.ntu.edu.tw/~kmchao

Page 2: Homology Search Tools

2

Homology Search Tools

• Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987)

• FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

• BLAST(Altschul et al., 1990; Altschul et al., 1997)

• BLAT(Kent, 2002)

• PatternHunter(Li et al., 2004)

Page 3: Homology Search Tools

3

Finding Exact Word Matches

• Hash Tables

• Suffix Trees

• Suffix Arrays

Page 4: Homology Search Tools

4

Hash Tables

… …

… …

… …

… …

… …

… …

CATCCA

CTT

TCCTCGTCT

TTT

GAT

010011 (19)010100 (20)

011111 (31)

100011 (35)

110101 (53)

110111 (55)110110 (54)

111111 (63)

AAA000000 (0)

ATC001101 (13)

1

2

3

45

6

7

8

AG TTCTACCT

1021 9876543

Page 5: Homology Search Tools

5

Suffix Trees (I)

AG TTCTACCT

1021 9876543

10

362

8

4

519

ATC

CATCTT TT

GATCCATCTTC

CATCTT

TTATCTT

T

CATCTTTT

T

7

C

Page 6: Homology Search Tools

6

Suffix Trees (II)11

AG TTCTACCT

1021 9876543

$

10

362

8

4

5

19

ATC

CATCTT$ TT$

GATCCATCTT$C

CATCTT$

TT$ATCTT$

T

CATCTT$

TT$

T$

7

C

$

$

11

Page 7: Homology Search Tools

7

Suffix Arrays

AG TTCTACCT

1021 9876543 ATCCATCTT 2

ATCTT 6

CATCTT 5

CCATCTT 4

CTT 8

GATCCATCTT 1

T 10

TCCATCTT 3

TCTT 7

TT 9

Page 8: Homology Search Tools

8

FASTA

1) Find runs of identities, and identify regions with the highest density of identities.

2) Re-score using PAM matrix, and keep top scoring segments.

3) Eliminate segments that are unlikely to be part of the alignment.

4) Optimize the alignment in a band.

Page 9: Homology Search Tools

9

FASTA

Step 1: Find runes of identities, and identify regions with the highest density of identities.

Sequence A

Sequence B

Page 10: Homology Search Tools

10

FASTA

Step 2: Re-score using PAM matrix, andkeep top scoring segments.

Page 11: Homology Search Tools

11

FASTA

Step 3: Eliminate segments that are unlikely to be part

of the alignment.

Page 12: Homology Search Tools

12

FASTA

Step 4: Optimize the alignment in a band.

Page 13: Homology Search Tools

13

BLAST

Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman)

The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

Page 14: Homology Search Tools

14

The maximal segment pair measure

A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4)

the highest scoring pair

•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.

•BLAST heuristically attempts to calculate the MSP score.

Page 15: Homology Search Tools

15

A matrix of similarity scores

G

CTACCTA

TC

T

-4

GTCTTACTA-4-4-4-4-4-4-4 5-4-4

-4 -4-45-4-45-4 -4-4-4

-4 55-4-45-45 -45-4

5 -4-4-45-4-4-4 -4-45

5 -4-4-45-4-4-4 -4-45

-4 -4-45-4-45-4 -4-4-4

-4 55-4-45-45 -45-4

5 -4-4-45-4-4-4 -4-45

-4 55-4-45-45 -45-4

T -4 55-4-45-45 -45-4

Page 16: Homology Search Tools

16

A maximum-scoring segment

10

1110

G

CTACCTA

TC

T

-4

GTCTTACTA-4-4-4-4-4-4-4 5-4-4

-4 -4-45-4-4-4 -4-4-4

-4 55-4-4-45 -45-4

5 -4-4-4-4-4-4 -4-45

5 -4-45-4-4-4 -4-45

-4 -45-4-45-4 -4-4-4

-4 5-4-45-45 -45-4

5 -4-4-45-4-4-4 -4-4

-4 55-4-45-45 -4-4

5

5

5

-4

-4

5

5

1

8765432

9

21 9876543

T -4 55-4-45-45 -45-4

5

Page 17: Homology Search Tools

17

BLAST

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

3) Extend hits.

Page 18: Homology Search Tools

18

BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)

For DNA sequences:

Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..

TTT

For protein sequences:

Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧

Page 19: Homology Search Tools

19

BLASTStep2: Scan sequence B for hits.

Page 20: Homology Search Tools

20

BLASTStep2: Scan sequence B for hits.

Step 3: Extend hits.

hit

Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

Page 21: Homology Search Tools

21

Gapped BLAST (I)D

The two-hit method

Page 22: Homology Search Tools

22

Gapped BLAST (II)

Confining the dynamic-programming

HSP with score at least Sq

seed residue pair

region confined by Xq

Page 23: Homology Search Tools

23

BLAT

database

index

query

Page 24: Homology Search Tools

24

PatternHunter (I)

Page 25: Homology Search Tools

25

PatternHunter (II) T

… …

… …

… …

… …

… …

… …

AG TTCTACC

1021 9876543

CAC

TCA

TCT

TTT

GAC

010001 (17)

100001 (33)

110100 (52)

110111 (55)

111111 (63)

AAA000000 (0)

ATC001101 (13)

1

2

3

4

5

7

ATT001111 (15) 6

… …010100 (20) CCT

ATG001110 (14)

Page 26: Homology Search Tools

26

Remarks

• Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments.

• The idea of filtration was used in FASTA, BLAST, BLAT, and PatternHunter.