local and global alignment, database searching with blast

55
Sequence Database Searching Local and Global Alignment, Database Searching With Blast Original Presentations: Hugues Sicotte National Center for Biotechnology Information [email protected] Adaptation: Alan Durham University of São Paulo [email protected]

Upload: deron

Post on 11-Jan-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Local and Global Alignment, Database Searching With Blast. Original Presentations: Hugues Sicotte National Center for Biotechnology Information [email protected] Adaptation: Alan Durham University of São Paulo [email protected]. Alignment definition and Type:. Alignment:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Local and Global Alignment, Database Searching With Blast

Original Presentations:

Hugues Sicotte

National Center for Biotechnology Information

[email protected]

Adaptation:

Alan Durham

University of São Paulo

[email protected]

Page 2: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Alignment definition and Type:

G-ATES

GRATED

Local Alignments:

Global Alignment:

Alignment:

All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”).

Each Base is used at most once.

Do not need to align all the bases in all sequences.

Align BILLGATESLIKESCHEESE and GRATEDCHEESE

G-ATESLIKESCHEESE or G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE

Page 3: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

C O M P A R A T I V E A N A L Y S I S

GATTATACCAGATTA---CA

Insertions and deletions (‘indels’) are represented by gaps in alignments

gap of length 3

Page 4: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Score and Statistics

G-ATESLIKESCHEESE AND/OR G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE

Percent Identity. Can be misleading.

Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment.

identical bases : “match” score

mismatching bases: “mismatch” score

gaps(optional): “gap opening” penalty for starting a gap

“gap extension” penalty for each gap symbol.

Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))

= -7

Example: match = +1 , mismatch =-1,

gap opening = -5, gap extension=-1

Page 5: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

S C O R I N G S Y S T E M S

Which alignment is “better”?

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

0 mismatches, 5 gaps

3 mismatches, 1 gap

Page 6: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

S C O R I N G S Y S T E M S

High penalty for “opening” a gap

(e.g. G = 5)

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

Penalty = 5G + 6L = 31

Penalty = 1G + 6L = 11

Lower penalty for “entending” a gap

(e.g. L = 1)

Page 7: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

L O C A L S I M I L A R I T Y

Figure 7.3

F12 F2 E F1 E K Catalytic

PLAT F1 E K CatalyticK

Mix-and-match protein modules confound alignment algorithms

Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)

F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy

Page 8: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

L O C A L S I M I L A R I T Y

Figure 7.3

F12 F2 E F1 E K Catalytic

PLAT F1 E K CatalyticK

Mix-and-match protein modules confound alignment algorithms

Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)

F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy

modules inreverse order

Page 9: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

L O C A L S I M I L A R I T Y

Figure 7.3

F12 F2 E F1 E K Catalytic

PLAT F1 E K CatalyticK

Mix-and-match protein modules confound alignment algorithms

Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)

F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy

repeatedmodules

Page 10: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D O T P L O T S

Figure 7.4

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

A

C

G

T

C G T A C C G T

0 0 0 1 0 0 0 0

1

0

0

0 0 0 1 1 0 0

1 0 0 0 0 1 0

0 1 0 0 0 0 1

Horizontal axis is coordinates for one sequence

Vertical axis is coordinates for the other

Page 11: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D O T P L O T S

Figure 7.4b

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields.

A

C

G

T

C G T A C C G T

0 0 0 0 0 0

1 0 0 0 0 1

Horizontal axis is coordinates for one sequence

Vertical axis is coordinates for the other

Page 12: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)

Coagulation Factor XII (F12)

Figure 7.4

Horizontal axis is coordinates for one sequence

Vertical axis is coordinates for the other

Page 13: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)

Coagulation Factor XII (F12)

Figure 7.4

K

K

Catalytic

Cat

aly

ticK

EF1EF2

EF

1

Plot dots for high similarity within a short window

Adjacent dots merge to form diagonal segments

Page 14: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)

Coagulation Factor XII (F12)

Figure 7.4

K

K

Catalytic

Cat

aly

ticK

EF1EF2

EF

1

Repeated domains show a characteristic pattern

Page 15: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

P A T H G R A P H S

Figure 7.5

90 137

72

23

90 137

72

23

PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72

EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT)

Dot plots suggest paths through the alignment space

Path graphs are more explicit representations

Each path is a unique alignment

Page 16: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

G A T A C T AG A T T A C C A

Construct an optimal of these two sequences:

Using these scoring rules: Match:

Mismatch:Gap:

+1-1-1

D Y N A M I C P R O G R A M M I N G

Dynamic Programming Example

Page 17: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Arrange the sequence residues along a two-dimensional lattice

Vertices of the lattice fall between letters

Page 18: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The goal is to find the optimal path

from here

to here

Page 19: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Each path corresponds to a unique alignment

Which one is optimal?

Page 20: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores

A aligned with AMatch = +1

Page 21: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores A aligned with T

Mismatch = -1

Page 22: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores

T aligned with NULL

Gap = -1

NULL aligned with T

Page 23: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0 -1

+1-1

Page 24: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

+1-1

-2

-2

-1

Remember the best sub-path leading to each point on the lattice

Page 25: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

-1

-2

Remember the best sub-path leading to each point on the lattice

0 +2

+1

-1

-20

Page 26: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0 -2

Remember the best sub-path leading to each point on the lattice

0 +2

+1

-1

-20

-2

-1

Page 27: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

Remember the best sub-path leading to each point on the lattice

+1

-1

-2-1

-3-2

-3

-2

+3

-1

-1

0

0

+1

+1

+2

Page 28: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

Remember the best sub-path leading to each point on the lattice

+1

-1

-1

-2

-2 0

0

+1+2

-5-4

-5

-4

-3

-3

-1 -3-2

-10

+1

+2

0

+1-1

+2

-3 -1

-2

+1 +3

+2 +1

+2+3

Page 29: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

Remember the best sub-path leading to each point on the lattice

0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3

Page 30: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Trace-back to get optimal path and alignment

0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3

Page 31: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Print out the alignment

AA-TTTAACCTCAA

GG

Page 32: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Two different types of Alignment

Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.

Global Alignment methods:

Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.

FASTP(Lipman &Pearson(1985),Science 227,1435-1441

BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.

Local Alignment methods:

Page 33: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

G L O B A L & L O C A L S I M I L A R I T Y

Implementations of dynamic programming for global and local similarities

Optimal global alignment

Needleman & Wunsch (1970)

Sequences align essentially from end to end

Optimal local alignment

Smith & Waterman (1981)

Sequences align only in small, isolated regions

Page 34: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Local and Global Alignments: computing the matrix

global alignment: as shown in previous slides

local allignment: change computation so that never put negative values

– when value of cell will be negative, set to zero (means staring another path)

– best local alignment comes from entry in matrix with maximum value

semi-global alignment:

– good in assembling

– ignore gaps at the end and at the beginning of sequences

– to ignore gaps at the beginning of alignment: zeroes in first column and first row

– to ignore gaps at the end of alignment: pick maximum value of last row and last column

Page 35: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D Y N A M I C P R O G R A M M I N G: semi-global alignment

G A T A C TGATTACCA

(we eliminated the last symbol from one of the sequences)

•choose the best score from scores in last row and collumn

•fill first row and first column with zeroes

•in this problem, 2 solutions

0

+1

0

0

0

0 0

0

+1+2

0

0

0

0

-1 -1

0

+2

0

+1-1

+2-1 +2 +1

+2+3

0

0

0

0

00

-1

-2

-1

-1

0+1

+1

+2

-1

-1

-1

-2

-2

-1

-2

-1

+1 +3

+1

0 +2 +4

+3

+3

-2 0

-1

+2 +2

+3

Page 36: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Match/mismatch scores and Statistics

•Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.

•Scoring system that doesn’t penalize very much mutations to similar amino acid.

•PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.

•BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.

Page 37: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Page 38: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Alignment methods

Query sequence

Sub

ject

seq

uenc

e

Sequence Alignment representation using a dot plot.

For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.

Page 39: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Hashing is a common method for accelerating database searches

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.

Page 40: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Index lookup

Each word is assigned a unique integer.

E.g. for a word of 3 letters made up of an alphabet of 20 letters.

1. Assign a code to each letter Code(l) (0 to 19)

2. For a word of 3 letters L1 L2 L3 the code is

index = Code(L1)*202 + Code(L2)*201 + Code(L3)

3. Have an array with a list of the positions that have that word.

1

0 1 2 3

Position in query sequence of word

Page 41: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Building the dictionary for the query sequence requires (N-2) operations.

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

The database contains (M-2) words, and it takes only one operation to see if the word was in the query.

Page 42: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Sub

ject

seq

uenc

e

Scan the subject, looking up words in the dictionary

Use word hits to determine were to search for alignments

fills the dynamic programming matrix

in (N-2)+(M-2) operations instead

of MxN.

Page 43: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Blast: extending good hits

blast pre-processes the target sequence set

lists of hits for each possible word (

– 3-tuple for proteins - 203 = 8000 different words

– for each word, find with ones have “good match”

• 13 in old version 11 in new version

for the “good ones” get list of sequences in database that have it

Blast (old) : extend match both ways while score is increasing

Blast2 (new):

– when two words found in same “diagonal” withing “short” distance, extend an un-gapped alignment.

– continue extension like old blast

get local alignments with score greater than cutoff score

perform SW on best candidates

Page 44: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Dat

abas

e se

quen

ce

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

BLAST extends from word hits

Page 45: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Dat

abas

e se

quen

ce

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

BLAST2 extends pairs in same diagonal first

Page 46: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Simplest Database searching could is a large dynamic programming example.

With all the database sequences concatenated one after another.

Page 47: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Which alignment is more significant?

Page 48: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Score can be used to judge alignments. But a score absolute value is a function of the score parameters.

Match=+1,Mismatch=-1,

Gap_open=5,

gap_extend=1

Yields same alignments as

Match=+10,Mismatch=-10,

Gap_open=50,

gap_extend=10

Scores useful for relative ranking.

Page 49: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

To Judge relevancy of an alignment, need to judge if match is significant.

E-value = Expect(S) is a function of the score, database size and composition, and query size.

Number of Aligments with scores >= S expected if the query was a random given the database size and composition.

Expect of 0.0 means a very good match unlikely to be random.

Page 50: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Alligning sequences in databases: evaluating significance

we can allign a sequence with any other one

we want good allignents that are statistically significant

when searching databases, statistical relevance needs to be computed too

E value: number of hits a random sequence of the same size would get in a database of the same size

Page 51: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

E-value

“Hits” can be sorted according to their E-value or their score.

The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.

E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.

e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.

E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.

Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)

Page 52: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D A T A B A S E S E A R C H I N G

The “hit list” gives titles and scores for matched sequences

> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3

Page 53: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

D A T A B A S E S E A R C H I N G

Detailed alignments are shown farther down in the output

> fasta myquery swissprot -ktup 2

>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap

10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60

60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120

120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180

>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap

10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :

Page 54: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Some matches are non-meaningful because they occur VERY often in

database.

e.g. nucleotide AAA (from polyA)

Biological repeated elements(retroposons ALU)

Low-complexity repeated patterns.

(CAGCAG, QQQ,KKK,…)

These elements should be

FILTERED or MASKED

to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’

Page 55: Local and Global Alignment, Database Searching With Blast

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Sub

ject

seq

uenc

e

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

FASTA searches in a band