blast nd fasta
Post on 10-Apr-2018
220 Views
Preview:
TRANSCRIPT
-
8/8/2019 BLAST ND FASTA
1/28
BLOSUM 62
The Blast and FastA algorithms
-
8/8/2019 BLAST ND FASTA
2/28
Global alignments that do not include gaps : a matrix of 200
PAMS for sequences that are thought to be related.Unknown sequences : a 120 PAM matrix was the bestcompromise.
Local alignment method PAM40, PAM120 and PAM250. Thelower PAM matrices (40-120) find short alignments of highly
similar sequences, while higher PAM matrices (120-250)
find longer, weaker local alignments.
-
8/8/2019 BLAST ND FASTA
3/28
Standard Blast: Overall the BLOSUM 62 matrix is the most
effective.
All other substitution matrices perform better than BLOSUM
62 for a proportion of the families.
-
8/8/2019 BLAST ND FASTA
4/28
Algorithms
Comparing sequences by dot matrix display or byany other standard method of sequence comparison
is a very slow process therefore:
Most commonly the Blast and the FastAapproximation algorithm are used
-
8/8/2019 BLAST ND FASTA
5/28
Blast and Fasta create alignments
In an optimal alignment, non-identicalcharacters and gaps are so placed to bring as
many identical or sim ilar characters aspossible into columns.
Two types of sequence alignment are used
global and local
-
8/8/2019 BLAST ND FASTA
6/28
In global alignment,
an attempt is made to align the entire
sequences, as many characters as possible.The alignment is stretched over the entiresequence lengths to include as many matchingamino acids as possible up to and including thesequence ends. Although there is an obviousregion of identity in this example (the sequenceFGKG), a global alignment may not align suchregions in order to favour matching more aminoacids along the ent ire sequence lengths.
LGPSTKQFGKGSSSRIWDN
| |||| | | global alignment
LNQIERSFGKGAIMRLGDA
-
8/8/2019 BLAST ND FASTA
7/28
Local alignment.
The alignment tends to stop at the ends of
regions of identity or strong similarity. A muchhigher priority is given to finding these local
regions than to extending the alignment toinclude more neighbouring amino acid pairs.Dashes indicate sequence not included in the
alignment. This type of alignment favoursfinding conserved amino acid motifs in relatedprotein sequences.
-------FGKG--------
|||| local alignment
-------FGKG--------
-
8/8/2019 BLAST ND FASTA
8/28
Global alignment is appropriate for sequences that are
known to share similarity over their whole length.
-
8/8/2019 BLAST ND FASTA
9/28
Global alignment Algorithm FASTA
Step 1 Preprocessingfinds regions of similarity by making an index showing all of theamino acid positions for each sequence i.e. a C at position 1, S atposition 2, etc.
Step 2 Heuristic searching
these indexes are used to find if a row of the same characters arefound in the same order in the two sequences being compared.
If these rows are long enough, the sequences are similar.
An alignment is shown with the best matched sequences in thedatabase
-
8/8/2019 BLAST ND FASTA
10/28
FastA
PAM250
top 10 sequences
init 1 scores used to rank the
database sequences
Initn: Sum of init 1 scores
- penalty for gaps (20) NW opt score
-
8/8/2019 BLAST ND FASTA
11/28
Characteristics of FASTA :
Local alignments: FASTA tries to find patches of regionalsimilarity, rather than trying to find the best alignmentbetween your entire query and an entire database sequence.
Gapped alignments Alignments generated with FASTA cancontain gaps.
Rapid
Heuristic FASTA is not guaranteed to find the best alignmentbetween your query and the database; it may miss matches.This is because it uses a strategy which is expected to findmost matches, but sacrifices complete sensitivity in order to
gain speed.
-
8/8/2019 BLAST ND FASTA
12/28
Initn = init1 = opt indicates 100% homology over the matched stretch.
Initn > init1 indicates that there is more than one matching region in the database
sequence, with poorly matching separating regions(s).
Opt > initn shows that the matching regions are greatly improved by the addition
of gaps in one or both of the sequences. Such differences in score are indicative of
non-homologous sequences.Opt < initn FASTA only optimizes within a narrow band along the same diagonal
as the INIT1 region (best single region of match). If any of the (n-1) regions lie
outside the band, then they are excluded from the optimized score. i.e.: There is too
large a separation between the good scoring regions for FASTA to join them.
ScoresScores
-
8/8/2019 BLAST ND FASTA
13/28
With the BLAST algorithm a substitution matrix is usedduring all phases of protein searches (BLASTP, BLASTX,
TBLASTN)
FASTA uses a substitution matrix only for the extension
phase. This is in contrast to BLAST, which uses a matrix for
both phases. To reduce the penalty of using a substitutionmatrix for only the second phase, set the k-tuple parameter to
a low value (1). However, this will give a significant speed
penalty (for you).
Finding a local alignment: BLAST algorithm
-
8/8/2019 BLAST ND FASTA
14/28
Algorithms BLAST
makes an index of the query sequence showing the positions ofeach possible amino triplet i.e. a CCC occurs at positions 1, YTL atposition 23, etc.
Triplets are ordered according to how often they will occur bychance in two related proteins, the most rarely found being the mostsignificant.
A matrix (for instance BLOSUM62) is used to determine thesesignificances
-
8/8/2019 BLAST ND FASTA
15/28
Each database sequence is searched for these unusual triplets first.
An alignment is shown with the best matched sequences in the database
this is a heuristic (tried-and-true) method which usually works well
-
8/8/2019 BLAST ND FASTA
16/28
-
8/8/2019 BLAST ND FASTA
17/28
BLAST (Basic Local Alignment Search Tool).
Characteristics :
Local alignments BLAST tries to find patches of regionalsimilarity, rather than trying to find the best alignment
between your entire query and an entire database sequence.
Ungapped alignments Alignments generated withBLAST do not contain gaps. BLAST's speed and statistical
model depend on this, but in theory it reduces sensitivity.
However, BLAST will report multiple local alignments
between your query and a database sequence.
-
8/8/2019 BLAST ND FASTA
18/28
Rapid: BLAST is extremely fast.
Heuristic; BLAST is not guaranteed to find the bestalignment between your query and the database; it may miss
matches. This is because it uses a strategy which is expected
to find most matches, but sacrifices complete sensitivity in
order to gain speed.
However, in practice few biologically significant matches aremissed by BLAST which can be found with other sequence
search programs. BLAST searches the database in two
phases. First it looks for short subsequences which are likely
to produce significant matches, and then it tries to extend
these sub-sequences.
-
8/8/2019 BLAST ND FASTA
19/28
BLASTP search a Protein Sequence against a Protein
Database.BLASTN search a Nucleotide Sequence against a Nucleotide
Database.
TBLASTN search a Protein Sequence against a Nucleotide
Database, by translating each database Nucleotide sequence in
all 6 reading frames.
BLASTX search a Nucleotide Sequence against a Protein
Database, by first translating the query Nucleotide sequence in
all 6 reading frames.
Especially good for EST databases
-
8/8/2019 BLAST ND FASTA
20/28
Finally some rules of the thumb: Homology
Protein sequence comparisons typically double the evolutionary
look-back time over DNA sequence comparisons.The requirement for a common folded structure in homologous
proteins usually causes these proteins to be similar over the
entire length of the gene product (or domain). Therefore, most
sequences that share statistically significant similarity throughout
their entire lengths are homologous.Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by
chance.
-
8/8/2019 BLAST ND FASTA
21/28
Distantly related homologs may lack significant similarity. Two or morehomologous sequences may have very few absolutely conserved residues.
If homology has been inferred due to significant similarity scores between
two proteins, A and B, that align over their entire lengths and between protein
B and a third protein, C, then proteins A and C must also be homologous, even
if they share no significant similarity.
Low complexity regions, transmembrane regions and coiled-coil regions
frequently display significant similarity in the absense of homology. Low
complexity regions can be filtered out using the default parameters of BLAST.
Transmembrane and coiled-coil regions should be identified and masked (by
eliminating these regions from the query) by the user.
-
8/8/2019 BLAST ND FASTA
22/28
Significance
Results of searches using different scoring systems may be
compared directly using normalized scores.
If S is the (raw) score for a local alignment, the normalized score S'
(in bits) is calculated by the formula S'=(lambdaS-lnK)/ln2. lambda
and K are parameters associated with a given scoring system..
A normalized score, S' with E value = E, is statistically significantif it exceeds log N/E where N is the size of the search space. As the
evolutionary distance between two sequences increases, the length
of a local alignment required to achieve a statistically significant
score also increases..
-
8/8/2019 BLAST ND FASTA
23/28
Global alignment is appropriate for sequences that areknown to share similarity over their whole length.
Local alignment is appropriate when the sequences may
show isolated regions of similarity, for example multiple
domains or repeats.
Local alignment is best applied when scanning a database to
find similarities or when there is noa priori knowledge that
the protein sequences are similar.
Summary of previous
-
8/8/2019 BLAST ND FASTA
24/28
Database artifacts and Low complexity filters
-
8/8/2019 BLAST ND FASTA
25/28
Database Artifacts
Vector sequences A number of authors have identified and
catalogued the contamination of sequence databases withvectors.
Among the studies are:
Claverie Genomics 12:838 1992.
Lamperti et al Nucleic. Acids. Res 20:2741) 1992.
Of particular note in this paper is the finding of short
apparent vector sequencesin the middle of non-vector sequence.
The authors speculate that these may be due to errors in
the editing of sequences or to rearranged plasmids.
Lopez, Kristensen, & Prydz. Nature 355:211. 1992.
Kristensen, Lopez, & Prydz. An estimate of the sequencing
error frequency in the DNA sequence databases. DNA Seq
2:343 1989.
-
8/8/2019 BLAST ND FASTA
26/28
Heterologous sequences
White, O. et al. Nucl. Acids. Res. 21:2829
Describes a statistical method to compare sequence sets (but notindividual
sequences). Shows that several sets of cDNAs show bulk properties
different than human cDNAs. Sequence comparisons are used to show that
this is due to contamination of the anomalous libraries with yeast and
bacterial sequences.
Rearranged & deleted sequences
Repetitive element contamination
cDNA cloning methods may sometimes capture retroelements such as Alus.
In some cases, chimaeras between cellular transcripts and Alus may form.
Derived protein sequences which appear to contain Alu-derived sequences
were cataloged by Claverie (Genomics 12:838)
Sequencing errors / Natural polymorphisms
.
-
8/8/2019 BLAST ND FASTA
27/28
Sequence Pre-Filters
Reducing matches due to biased amino acid composition
Many amino acid sequences are highly repetitive in nature,especially naive translations of genomic DNA. Matches
between such segments are more likely to be due to these
local amino acid composition biases than to common
descent. Filters have been developed to mask out regions
showing highly-biased local composition.SEG (Wooton & Federhen, Computers & Chemistry 17:149.
1993)
XNU(Claverie & States, Computers & Chemistry, 17:191.
1993)
-
8/8/2019 BLAST ND FASTA
28/28
The end
Thank you for your attention
top related