blast 2.0 details the filter option: –process of hiding regions of (nucleic acid or amino acid)...

53

Upload: naomi-merritt

Post on 17-Jan-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 2: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 3: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 4: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 5: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 6: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 7: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 8: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 9: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 10: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 11: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 12: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 13: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 14: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 15: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 16: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 17: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 18: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 19: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 20: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Blast 2.0 Details• The Filter Option:

– process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores

– typically involves the removal of repeated or low complexity regions

– The SEG program is used to mask or filter LCRs in amino acid queries.

– The DUST program is used to mask or filter LCRs in nucleic acid queries

– More than half of the proteins in the database contain at least one low complexity region

Page 21: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

SEG Filter Example

Default filtering option in BLAST 2.0 automatically converts low complexity sequences into X's which can be seen in the query line of the alignments

Page 22: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

PSI-Blast

• Position Specific Iterated BLAST• an automated, easy-to-use version of a "profile"

search, – a sensitive way to look for sequence homologues

• Intuition: substitution matrices should be specific to a particular site. Penalize alanine glycine more in a helix

Page 23: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

PSI-Blast: Outline• Algorithm:

– First perform a gapped BLAST database search– PSI-BLAST uses information from significant alignments to

construct a position-specific score matrix (PSSM), – PSSM replaces the query sequence for the next round of

database searching. – PSI-BLAST is iterated until no new significant alignments are

found. • Details:

– Set initial thresholds high. Inspect each iteration's result for suspicious sequences.

– Do several iterations (~5), or until no new sequences are found– Even if only looking for a small set of sequences, make the initial

search very broad • First, use NR with up to 5 iterations to set PSSM• Then use that PSSM to search in restricted domain

Page 24: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 25: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

PSI-Blast: Details• To calculate profile for position 108: only shaded regions are used

To calculate profile at position i, pseudo-counts are used

Page 26: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

PSI-BLAST Caveats• Good:

– Increased ability to find distant homologues – If the sequences used to construct PSSMs are all homologous,

the sensitivity at a given specificity improves significantly.• Bad:

– If non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in more non-homologous sequences, and become worse than generic

• Advice:– Special care to prevent non-homologous sequences from being

included in the PSSM calculation.• When in doubt, leave it out!• Examine sequences with moderate similarity carefully.

– Be particularly cautious about matches to sequences with highly biased amino acid content

Page 27: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Database Homology Search• Homology search

– For genes/RNAs which do not encode proteins• relatively inefficient at identifying highly diverged sequences

– For genes which encode proteins • protein-protein searches are significantly better

– (two mRNA sequences might only be ~40% identical at the nucleotide level, but could be 70% similar in the proteins they encode)

• Rules of thumb:– 80% similarity implies same structure and function– highly diverged homologs could have down to 25% similarity– the "twilight zone" in the range of 20%: judgement about

significant similarity is quite difficult – distantly related homologs may lack significant similarity

Page 28: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Database Homology Search• E-values:

– expected number of sequences in the database which would achieve a given score

– are more useful than the raw or bit scores or percentage identity – Score of 0.001 is a standard threshold (unless sequence is

biased – e.g. low complexity)– Scores below 10-50 are highly significant.

• Caveats with low E-values:– while the evolutionary relationship is highly likely, it does not

necessarily imply identical function (multi-domain proteins)– if the score is extremely low AND the alignment covers the

length of both sequences, then they would share related function

Page 29: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 30: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 31: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 32: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 33: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 34: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 35: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 36: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 37: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 38: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 39: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 40: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Profiles

• Rather than identifying only the “consensus” (i.e. most common) amino acid at a particular location, we can assign a probability to each amino acid in each position of the domain.

• Like a PSSM, but just for the domain.

1 2 3 A .1 .5 .25C .3 .1 .25D .2 .2 .25E .4 .2 .25

Page 41: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Applying a Profile• Calculate score (probability of match) for a profile at

each position in a sequence by multiplying individual probabilities.

• Use “Sliding window”:

• Can transform probability to significance given random distribution assumption

1 2 3 A .1 .5 .25C .3 .1 .25D .2 .2 .25E .4 .2 .25

For sequence EACDC:EAC = .4 * .5 * .25 = .05ACD = .1 * .1 * .25 = .0025CDC = .3 * .2 * .25 = .015

Page 42: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics

Sequence Logos

Page 43: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 44: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 45: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 46: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 47: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 48: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 49: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 50: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 51: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 52: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics
Page 53: Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics