bioinformatics and molecular evolution...
TRANSCRIPT
1
BIOS477/877 L13 - 1
Spring 2020
BIOS 477/877
Bioinformatics and Molecular Evolution
Lecture 13
1
BIOS477/877 L13 - 2
Ø Assignment 4 Review
Ø BLASTP & BLASTN outputsØ BLAST & FASTA statistics
TODAY'S TOPICS
2
BIOS477/877 L13 - 3
blastp Similarity Search: Result Page
3
BIOS477/877 L13 - 4
blastp Similarity Search: Result PagePhylogeny based on pairwise distance from BLAST pairwise alignments.
➜ Approximated tree. For a more accurate phylogeny, distances need to be estimated from the multiple alignment.
4
BIOS477/877 L13 - 5
blastp Similarity Search: Result Page
Download the BLAST result:- BLAST search result in text format- Sequences and alignments in FASTA format- BLAST hit statistics in “Hit Table (csv)”
[Can be imported to any spread sheet program (Excel)]
5
BIOS477/877 L13 - 6
BLASTP resultsQuery coverage:
Proportion of the query aligned
Bit scores
E-value
6
2
BIOS477/877 L13 - 7
Nucleotide Similarity Search
megablast*: w=28 (16~256)*This is the default search methoddiscontiguous megablast:
w=11 (or 12)allows some mismatches
blastn: w=11 (7~15)w=7 for a short sequence
Default DB nr/nt
7
BIOS477/877 L13 - 8
Discontiguous megablast
If discontiguous megablast is chosen:
Word matching based on discontiguous pattern (template):e.g., for coding: 1101101101101101 (w=11, t=16)
➜mismatches are allowed for '0' positions
8
BIOS477/877 L13 - 9
BLASTN resultsmegablast (only 4 hits, all E<7e-91)
Discontiguous megablast(12 hits, E<3e-111)
blastn (137 hits, E<5.3)
9
BIOS477/877 L13 - 10
[blastn]
[blastx] (translated query vs. protein db)
BLASTN/BLASTX results
Low complexity region is masked (shown in lower cases)
6 possible frames
10
BIOS477/877 L13 - 11
[blastp]
BLASTP results
Positive (+) scoring AA pairs (similar AA pairs)
11
BIOS477/877 L13 - 12
BLAST results
Click to see the blast search
statistics
12
3
BIOS477/877 L13 - 13
BLAST results
Used to calculate the scores for the alignments
with gaps
13
BIOS477/877 L13 - 14
BLAST Statistics[blastp]
l and K are scoring system specific (for gap alignments)
Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8
Raw Score (S): simply based on pairwise scores & gap penalties
14
BIOS477/877 L13 - 15
BLAST Statistics[blastp]
Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8
Raw Score (S): simply based on pairwise scores & gap penalties
Raw scores (S) depend on the scoring system; cannot be comparedBit scores (S'bit) are normalized using l and K;
® independent of scoring system; can be compared
15
BIOS477/877 L13 - 16
[For a pairwise alignment]Ø Karlin-Altschul equation (Karlin & Altschul, 1990)
P(S≥x) = 1 - exp[-Kmne-lx] ≈ Kmne-lx
Probability of getting the alignment score S ≥ x by chanceE = NP [N: number of random alignments; used in PRSS and LALIGN]
[For database searching]Ø Multiple pairwise alignments: multiple testing problem
• P(S≥x): Probability of getting the alignment score (S) ≥ x by chance from one pairwise alignment
• If P(S≥x) = 0.05, 1-P(S≥x) = 0.95➜ 0.95 is the probability of having S<x by chance for one pairwise alignment
• For 10 alignments, 0.9510 ≈ 0.60 is the probability to have all 10 alignments with S<x➜ 1-0.60 = 0.40 is the probability of having S≥x by chance at least for one alignment
• For 100 alignments, 0.95100 ≈ 0.006 is the probability to have all 100 with S<x➜ 1-0.006 ≈ 0.99 is the probability of having S≥x by chance at least for one alignment
Pairwise alignment vs. database searching
P(S≥x)
x
Pr=0.05 as the significance level is not good enoughif many alignments need to be tested!
16
BIOS477/877 L13 - 17
Ø Multiple comparison correctionInstead of using Prob = aUse Prob = a/N (for N comparisons) as the threshold• For 10 alignments, use P(S≥x) = 0.05/10 = 0.005 (instead of 0.05)➜ (1-0.005)10 ≈ 0.95 is the probability to have S<x by chance for all 10
alignments➜ 1-0.95 = 0.05 is the probability of having S≥x by chance at least for
one alignment
• For 100 alignments, use P(S≥x) = 0.05/100 = 0.0005 (instead of 0.05)➜ (1-0.0005)100≈0.95 is the probability to have S<x by chance for all 100
alignments➜ 1-0.95=0.05 is the probability of having S≥x by chance at least for
one alignment
Bonferroni correction
17
BIOS477/877 L13 - 18
Ø Multiple comparison correctionInstead of using Prob = aUse a' = a/N (for N comparisons) as the threshold➜ a = N x a'
E = N x Prob➜ E = a can be used as the threshold for multiple
comparisons
• For database searching, N is the database size (the number of entries) ➜ the number of alignments
Bonferroni correction in database searching
E-value threshold can be considered as a P-value threshold corrected for
multiple comparisons in database searching
18
4
BIOS477/877 L13 - 19
Ø Karlin-Altschul equation (Karlin & Altschul, 1990)[For a pairwise alignment]P = Kmne-lS (Lec 11 slide 13)m, n: lengths of the sequences compared➜m x n: search space
[For database similarity searching]E = Kmne-lS (NOTE: E=NP is not used in BLAST)
m: length of the queryn: length of the database (total number of residues)
E-value: the expected number of HSPs with scores ≥ SP = 1 - e-E (P ≈ E if E < 0.01)➜ the probability of having at least one HSP with its score ≥ S
BLAST Statistics
Consider a database as a single very long sequence
T T A G A C G C G T A
A
C
A
G
A
G
C
T
A
Search space
19
BIOS477/877 L13 - 20
Ø Karlin-Altschul equation (Karlin & Altschul, 1990)
E = Km'n'e-lS
m': effective length of the queryn': effective length of the databasem' = m - ln' = n - l x (number of sequences in the database)
l: length adjustment➜ correction for edge effects• HSPs cannot occur too close to the search space edges.• HSPs need to be a certain length.
BLAST Statistics
HSP cannot starttoo close to the edge
Note: Calculation methods for length adjustment (l) and m'n'have been changed based on a new finite-size correction (FSC). See Park et al. (2012, BMC Research Note)
20
BIOS477/877 L13 - 21
P-value, E-value, and database search
Ø P-value for pairwise alignment = 1-exp[-Kmne-lx] ≈ Kmne-lx
➜ Probability of getting the alignment score ≥ x from random pairwise comparison (m and n are the lengths of the 2 sequences compared)
Ø E-value = Kmn e-lS
➜ Number of alignments with a score ≥ S expected by chance from a database searchm: length of the query (or effective length, m')n: length of the database (or effective length, n')
Ø P-value for a database search (Bonferroni corrected)➜ the probability of having at least one HSP with its score ≥ S➜ P = 1 - e-E or E = -ln(1-P)
0 < P < 10 < E < N (N: number of random comparisons)Altschul et al. (1994) & BLAST Statistics Tutorial
(P = E/N or E = PN is used in FASTA; N: database size)
21
BIOS477/877 L13 - 22
BLAST Statistics[blastp HSP]
l = 0.267, K = 0.041, S=795, S'bit = {0.267 x 795 - ln(0.041)} / ln2 = 310.8
Expect (E) = Km'n'e-lS or m'n'e-S'nat or m'n'2-S'bit
E = 0.041 x m' x n' x e-0.267 x 795 [from the raw score]E = m' x n' x 2-310.8 [from the bit score]
m' x n': Effective search space
22
BIOS477/877 L13 - 23
BLAST search summary statistics
Scoring matrix & gap penalties
Word size (W)
Neighborhood threshold (T)
Length separating two HSPs to trigger extension (A: two-hit methods)
l, K, and H are pre-estimated for a combination of the scoring matrix and gap penalties
for gapped alignment
Query: P45897.1Query length: 570 amino acids
23
BIOS477/877 L13 - 24
BLAST search summary statisticsQuery: P45897.1Query length: 570 amino acids m=570 (length of query)
n: length of database
24
5
BIOS477/877 L13 - 25
BLAST Statistics[blastp HSP]
l = 0.267, K = 0.041, S=795, S'bit = {0.267 x 795 - ln(0.041)} / ln2 = 310.8
Expect (E) = Km'n'e-lS or m'n'e-S'nat or m'n'2-S'bit
E = 0.041 x m' x n' x e-0.267 x 795 [from the raw score]E = m' x n' x 2-310.8 [from the bit score]
1.44E-80(1.44x10-80)
W/O length adjustment: m=570, n=94,578,689,328E = 0.041 x 570 x 94,578,689,328 x e-0.267 x 795 = 1.44E-80
E = 570 x 94,578,689,328 x 2-310.8 = 1.48E-80
P = 1 - e-E
= 1- exp(-1.44x10-80)≈ 0 (P ≈ E if E < 0.01)
25
BIOS477/877 L13 - 26
BLAST search set vs. format option[Before search] Restrict a search against the selected organism
Search space will be limitedE=Kmne-lS
è E-values become smaller
[After search] Restrict the result shown for a selected organism
Search space is not affected
26
BIOS477/877 L13 - 27
BLAST search size and format optionsSearch is not limited; results are filtered to show ”Archaea" sequences
Search space is not affected
27
BIOS477/877 L13 - 28
Search is limited for ”Archaea" sequences
Archaea (taxid:2157)
BLAST search size and format options
28
BIOS477/877 L13 - 29
Search is limited
BLAST search size and format optionsSearch is NOT limited;
results are filtered
(Database size is 72 times larger)Score Query cov E-value % ident Score Query cov E-value % ident
(E-value is 100 times larger)
E=Kmne-lS
E-value is affected by the database size!
29
BIOS477/877 L13 - 30
FASTA
http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml(includes also SSEARCH)
http://www.ebi.ac.uk/Tools/sss/fasta/(includes also SSEARCH)With graphic outputResults can be obtained through email
http://fasta.genome.jp/
30