database similarity searching

23
Database Similarity Searching

Upload: helen-powers

Post on 03-Jan-2016

41 views

Category:

Documents


1 download

DESCRIPTION

Database Similarity Searching. BLAST. Global alignment of a pair of seqs., in which all residues from both seqs. are included. BLAST – local alignment Interpreting BLAST output Smith and Waterman algorithm  guaranteed to find the best local alignment of two seqs. Too slow in practice !! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Database Similarity Searching

Database Similarity Searching

Page 2: Database Similarity Searching

BLAST

Global alignment of a pair of seqs., in which all residues from both seqs. are included.

BLAST – local alignmentInterpreting BLAST output-Smith and Waterman algorithm guaranteed to find the best local alignment of two seqs.-Too slow in practice !!-BLAST heuristic search method that is not guaranteed to find the best local alignment, but has been especially effective in practice-e.g. S45649 (from a fossilized insect)

>gi|256517|gb|S45649.1| 16S rRNA [Mastotermes electrodominicus=termites, amber-preserved fossil, Mitochondrial, 94 nt] AATAAAATTTTAATAAATATAAAGATTTATAGGGTCTTCTCGGCCTTTAAAAATATTTTAGCCTTTTGAC AAAAAAAAAAAAATCTACAAAAAA

Page 3: Database Similarity Searching

http://www.ncbi.nlm.nih.gov/BLAST/

E-value, with the most significant hits listed firstE-value is the number of hits with the same level of

similarity that you would expect by chance E = 0.01 occur once every 100 searches even when

there is no true match in the databaseE-value is similar in spirit to the p-value of statistical

hypothesis tests.P-value is the probability of finding a seq. similarity as

similar as the observed match if there were really no true matches in the database.

E-value ≠ p-valueE-value ~ p-value when it is small (say < 0.1)Since we are interested in unusual hits, it is safe to

interchange E-value with p-value.

E-value – the lower the better the alignment, matches above 0.001 are often close to the twilight zone (not significant)

Score (bits) – the higher the better the alignment, score below 50 are unreliable

BLAST

Page 4: Database Similarity Searching

The BLAST output may not be the same every time due to the upgrade of several components :

Database, the BLAST program, the default parameters of the serverE-value, similarity and homologyProtein : >25 %, > 100 a.a., < 10-4

DNA : >70%, > 100 bp, < 10-4

Gap penalties- constant penalty independent of the length of gap, A- proportional penalty, penalty is proportional to the length L of the

gap, BL- Affine (『數』遠交的 ,『化學』親和的 ) gap penalty, gap-opening

penalty + gap-extension penalty = A+BLRemark• Prediction using similarity is a powerful idea in bioinformatics• homologue seqs. evolved by divergence from a common ancestor,

therefore to say two seqs. share 50% homology is nonsense; to say two seqs. share 50% similarity and that they indicate possible homology is the correct usage of the terms

• Similarity NOT necessary implied homology

BLAST

Page 5: Database Similarity Searching

BLAST (choosing the parameters)

BLAST - Most highly cited paper >12000 timesalternative methods seeds + dynamics programming speed up, faster not guaranteed to find the best alignment less accurate

Page 6: Database Similarity Searching

BLAST (Sequence filters)

http://www.ncbi.nlm.nih.gov/BLAST/

Page 7: Database Similarity Searching

What is a coiled-coil?

Coiled-coil domains are characterized by a heptad (成七的一組 ) repeat pattern in which residues in the first and fourth position are hydrophobic, and residues in the fifth and seventh position are predominantly charged or polar. This pattern can be used by computational methods, such as MultiCoil (MIT) or SOCKET (University of Sussex)to predict coiled-coil domains in amino acid sequences.

BLAST

Page 8: Database Similarity Searching

BLAST programs

Page 9: Database Similarity Searching

BLASTing DNA sequences

Page 10: Database Similarity Searching

AE008569

Use of BLASTx to find ORF

Page 11: Database Similarity Searching

Use of BLASTx to find ORF

Frame = -2

Frame = +1

Page 12: Database Similarity Searching

Use of BLASTx to find ORF

Page 13: Database Similarity Searching

Use of BLASTx to find ORF

Page 14: Database Similarity Searching

Use of BLASTx to find ORF

Page 15: Database Similarity Searching
Page 16: Database Similarity Searching

BLAST procedures

Page 17: Database Similarity Searching

• The E-value of the BLAST is given by

• where k (depend on the scoring matrix and gap penalty combination) and are constants, m and n denote the seqs. length, s is the scaling factor for the scoring matrix used

• Gumbel extreme value distribution for alignment scores

BLAST

skmneE

xkmneeP1

http://www.itl.nist.gov/div898/handbook/eda/section3/eda366g.htm

Page 18: Database Similarity Searching

Position-Specific Iterated BLAST (PSI-BLAST)

Page 19: Database Similarity Searching

Position-Specific Iterated BLAST (PSI-BLAST)Query sequence – human hemoglobin>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin alpha subunit (Hemoglobin alpha chain) (Alpha-globin) MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR

0 E-value < 10≦ -40

Page 20: Database Similarity Searching

Query sequence – human hemoglobin>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin alpha subunit (Hemoglobin alpha chain) (Alpha-globin)

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHA

HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR

Position-Specific Iterated BLAST (PSI-BLAST)

Gene or Structureinformation

Page 21: Database Similarity Searching

Position-Specific Iterated BLAST (PSI-BLAST)

More seqs. are identified thanIteration 1

Page 22: Database Similarity Searching

Position-Specific Iterated BLAST (PSI-BLAST)

Add or remove the hits that seemsto be relevant or irrelevant (non-human seq.)

Page 23: Database Similarity Searching

Position-Specific Iterated BLAST (PSI-BLAST)

B ~ C