cap5510 – bioinformatics database searches for biological sequences

40
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences Tamer Kahveci CISE Department University of Florida

Upload: nevaeh

Post on 12-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

CAP5510 – Bioinformatics Database Searches for Biological Sequences. Tamer Kahveci CISE Department University of Florida. Goals. Understand how major heuristic methods for sequence comparison work FASTA BLAST Understand how search results are evaluated. What is Database Search ?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CAP5510 – Bioinformatics Database Searches for Biological Sequences

1

CAP5510 – BioinformaticsDatabase Searches for Biological Sequences

Tamer KahveciCISE Department

University of Florida

Page 2: CAP5510 – Bioinformatics Database Searches for Biological Sequences

2

Goals

• Understand how major heuristic methods for sequence comparison work– FASTA– BLAST

• Understand how search results are evaluated

Page 3: CAP5510 – Bioinformatics Database Searches for Biological Sequences

3

What is Database Search ?

. . .

query query

Many long sequences One giant sequence

Page 4: CAP5510 – Bioinformatics Database Searches for Biological Sequences

4

What is Database Search ?

Two giant sequences

Page 5: CAP5510 – Bioinformatics Database Searches for Biological Sequences

5

What is Database Search ?• Find a particular (usually) short sequence in a

database of sequences (or one huge sequence).• Problem is identical to local sequence alignment,

but on a much larger scale.• We must also have some idea of the significance

of a database hit.– Databases always return some kind of hit, how much

attention should be paid to the result?• A similar problem is the global alignment of two

large sequences• General idea: good alignments contain high

scoring regions.

Page 6: CAP5510 – Bioinformatics Database Searches for Biological Sequences

6

Database Search Issues

• How can we search massive space quickly?

• How can we evaluate the significance of the result?

Page 7: CAP5510 – Bioinformatics Database Searches for Biological Sequences

7

Database Search Methods

• Hash table based methods– FASTA family

• FASTP, FASTA, TFASTA, FASTAX, FASTAY

– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,

MegaBLAST, PsiBLAST, PhiBLAST

– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS

• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR

Page 8: CAP5510 – Bioinformatics Database Searches for Biological Sequences

8

Hash Table

Page 9: CAP5510 – Bioinformatics Database Searches for Biological Sequences

9

Hash Table

• K-gram = subsequence of length K

• Ak entries– A is alphabet

size

• Linear time construction

• Constant lookup time

Page 10: CAP5510 – Bioinformatics Database Searches for Biological Sequences

10

FASTP

Lipman & Pearson, 1985

Page 11: CAP5510 – Bioinformatics Database Searches for Biological Sequences

11

FASTP

• Three phase algorithm1. Find short good matches using k-

grams1. K = 1 or 2

2. Find start and end positions for good matches

3. Use DP to align good matches

Page 12: CAP5510 – Bioinformatics Database Searches for Biological Sequences

12

position 1 2 3 4 5 6 7 8 9 10 11protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offsetamino acid protein A protein B pos A - posB-----------------------------------------------------a 6 6 0c 2 7 -5k - 11n 1 -p 4 9 -5r - 10s 3 8 -5t 5 ------------------------------------------------------Note the common offset for the 3 amino acids c,s and pA possible alignment can be quickly found :protein 1 n c s p t a | | | protein 2 a c s p r k

FASTP: Phase 1 (1)

Page 13: CAP5510 – Bioinformatics Database Searches for Biological Sequences

13

FASTP: Phase 1 (2)• Similar to dot plot• Offsets range from 1-

m to n-1• Each offset is scored

as – # matches - #

mismatches• Diagonals (offsets)

with large score show local similarities

• How does it depend on k?

Page 14: CAP5510 – Bioinformatics Database Searches for Biological Sequences

14

FASTP: Phase 2

• 5 best diagonal runs are found

• Rescore these 5 regions using PAM250.– Initial score

• Indels are not considered yet

Page 15: CAP5510 – Bioinformatics Database Searches for Biological Sequences

15

FASTP: Phase 3

• Sort the aligned regions in descending score

• Optimize these alignments using Needleman-Wunsch

• Report the results

Page 16: CAP5510 – Bioinformatics Database Searches for Biological Sequences

16

FASTP - Discussion

• Results are not optimal. Why ?

• How does performance compare to Smith-Waterman?

• What is the impact of k?

• How does this idea work for DNAs ?– K = 4 or 6 for DNA

Page 17: CAP5510 – Bioinformatics Database Searches for Biological Sequences

17

FASTA – Improvement Over FASTP

Pearson 1995

Page 18: CAP5510 – Bioinformatics Database Searches for Biological Sequences

18

FASTA (1)

• Phase 2: Choose 10 best diagonal runs instead of 5

Page 19: CAP5510 – Bioinformatics Database Searches for Biological Sequences

19

FASTA (2)• Phase 2.5

– Eliminate diagonals that score less than some given threshold.

– Combine matches to find longer matches. It incurs join penalty similar to gap penalty

Page 20: CAP5510 – Bioinformatics Database Searches for Biological Sequences

20

BLAST

Altschul, Gish, Miller, Myers, Lipman, 1990

Page 21: CAP5510 – Bioinformatics Database Searches for Biological Sequences

21

BLAST (or BLASTP)

• BLAST – Basic Local Alignment Search Tool

• An approximation of Smith-Waterman

• Designed for database searches– Short query sequence against long

database sequence or a database of many sequences

• Sacrifices search sensitivity for speed

Page 22: CAP5510 – Bioinformatics Database Searches for Biological Sequences

22

BLAST Algorithm (1)

• Eliminate low complexity regions from the query sequence.– Replace them with X (protein) or N

(DNA)• Hash table on query sequence.

– K = 3 for proteins

MCG

CGP

MCGPFILGTYC

Page 23: CAP5510 – Bioinformatics Database Searches for Biological Sequences

23

BLAST Algorithm (2)

• For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62– 20k candidates– ~50 on the average per

k-gram– ~50n for the entire

query

• Build hash table

PQG

QGM

PQGMCGPFILGTYC

PQGPQG 18PEG 15PRG 14PSG 13PQA 12

T = 13

Page 24: CAP5510 – Bioinformatics Database Searches for Biological Sequences

24

BLAST Algorithm (3)

• Sequentially scan the database and locate each k-gram in the hash table

• Each match is a seed for an ungapped alignment.

Page 25: CAP5510 – Bioinformatics Database Searches for Biological Sequences

25

BLAST Algorithm (4)

• HSP (High Scoring Pair) = A match between a query word and the database

• Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A

• Extend the hit until the score falls below a threshold value, X

Page 26: CAP5510 – Bioinformatics Database Searches for Biological Sequences

26

BLAST Algorithm (5)

• Keep only the extended matches that have a score at least S.

• Determine the statistical significance of the result

Page 27: CAP5510 – Bioinformatics Database Searches for Biological Sequences

27

What is Statistical Significance?

13 : 15

13 : 15

•Two one-on-one games, two scores.

•Which result is more significant?

•Expected: maybe a random result.•Unexpected: significant, may have significant meanings.

Page 28: CAP5510 – Bioinformatics Database Searches for Biological Sequences

28

Statistical Significance

• E-value: The expected number of matches with score at least S

• E = Kmne-lambda.S

• m, n : sequence lengths• S : alignment score• K, lambda: normalization parameters

• P-value: The probability of having at least one match with score at least S

• 1 – e-E

• The smaller these values are, the more significant the result

• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

Page 29: CAP5510 – Bioinformatics Database Searches for Biological Sequences

29

BLAST - Analysis

• K (k-gram)– Lower: more sensitive.

Slower.

• T (neighbor cutoff)– Lower: Find distant

neighbors. Introduces noise

• X (extension cutoff)– Higher: lower chances

of getting into a local minima. Slower.

Page 30: CAP5510 – Bioinformatics Database Searches for Biological Sequences

30

Sample Query

• http://www.ncbi.nlm.nih.gov/BLAST/

I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I

Dhal_ecoli

Page 31: CAP5510 – Bioinformatics Database Searches for Biological Sequences

31

BLASTN

• BLAST for nucleic acids• K = 11• Exact match instead of neighborhood

search.

Page 32: CAP5510 – Bioinformatics Database Searches for Biological Sequences

32

BLAST Variations

Program Query Target Type

BLASTP Protein Protein Gapped

BLASTN Nucleic acid Nucleic acid Gapped

BLASTX Nucleic acid Protein Gapped

TBLASTN Protein Nucleic acid Gapped

TBLASTX Protein Nucleic acid Gapped

Page 33: CAP5510 – Bioinformatics Database Searches for Biological Sequences

33

Even More Variations

– PsiBLAST (iterative)– BLAT, BLASTZ, MegaBLAST– FLASH, PatternHunter, SSAHA, SENSEI,

WABA, GLASS

– Main differences are• Seed choice (k, gapped seeds)• Additional data structures

Page 34: CAP5510 – Bioinformatics Database Searches for Biological Sequences

34

Suffix Trees

Page 35: CAP5510 – Bioinformatics Database Searches for Biological Sequences

35

Suffix Tree• Tree structure that contains all suffixes of the input

sequence

• TGAGTGCGA• GAGTGCGA• AGTGCGA• GTGCGA• TGCGA• GCGA• CGA• GA• A

Page 36: CAP5510 – Bioinformatics Database Searches for Biological Sequences

36

Suffix Tree Example

Page 37: CAP5510 – Bioinformatics Database Searches for Biological Sequences

37

• O(n) space and construction time– 10n to 70n space usage reported

• O(m) search time for m-letter sequence

• Good for – Small data– Exact matches

Suffix Tree Analysis

Page 38: CAP5510 – Bioinformatics Database Searches for Biological Sequences

38

Suffix Array

• 5 bytes per letter• O(m log n) search

time

• Better space usage• Slower search

Page 39: CAP5510 – Bioinformatics Database Searches for Biological Sequences

39

Mummer

Page 40: CAP5510 – Bioinformatics Database Searches for Biological Sequences

40

Other Sequence Comparison Tools

• Reputer, MGA, AVID• QUASAR (suffix array)