![Page 1: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/1.jpg)
Bioinformatics GroupInstitute of BiotechnologyUniversity of Helsinki
Swapan ‘Shop’ Mallick
Significance in protein analysis
![Page 2: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/2.jpg)
Overview
The need for statistics
Example: BLOSUMWhat do the scores mean?
How can you compare two scores?
Example: BLASTProblems with BLAST
Review of Distributions
Distribution of random BLAST results
P-values and e-values
Statistics of BLAST
Summary and Conclusion
Exercise
![Page 3: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/3.jpg)
The need for statistics
• Statistics is very important for bioinformatics. – It is very easy to have a computer analyze the data
and give you back a result. – Problem is to decide whether the answer the computer
gives you is any good at all. • Questions:
– How statistically significant is the answer?– What is the probability that this answer could have
been obtained by random? What does this depend on?
![Page 4: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/4.jpg)
Nn
X
S
Population Sample
Basics
![Page 5: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/5.jpg)
Nn
X
Population Sample
Basics
Descriptive statistics
Probability
![Page 6: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/6.jpg)
Example: BLOSUM
The BLOSUM matrix assigns a probability score for each residue pair in an alignment based on:
the frequency with which that pairing is known to occur within conserved blocks of related proteins.
Simple since size of population = size of sample
BLOSUM matrices are constructed from observations which lead to observed probabilities
![Page 7: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/7.jpg)
BLOSUM substitution matrices
BLOSUM matrices are used in ‘log-odds’ form based on actually observed substitutions.
This is because: Ease of use: ‘Scores’ can be just added (the raw probabilities would have to be multiplied) Ease of interpretation:
S=0 : substitution is just as likely to occur as random S<0 : substitution is more likely to occur randomly than observed S>0 : substitution is less likely to occur randomly than observed
![Page 8: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/8.jpg)
Substitution matrices
ba
ab
ffpbaS log),( 1
Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers
Pab is the observed frequency that residues a and b are correlated because of homology
fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b
Source: Where did the BLOSUM62 alignment score matrix come from? Eddy S., Nat. Biotech. 22 Aug 2004
Score of amino acid a with amino acid b
![Page 9: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/9.jpg)
Substitution matrices
Sff
p eba
ab
Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers
Pab is the observed frequency that residues a and b are correlated because of homology
fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b
![Page 10: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/10.jpg)
![Page 11: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/11.jpg)
i) S=0 : O/E ratio=1
ii) Compare S=5 and S=10. Ratio is based on exponential function
iii) S=-10: O/E ratio = 0.031 ≈ 1/32.
iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =
![Page 12: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/12.jpg)
5.7
32.1
i) S=0 : O/E ratio=1
ii) Compare S=5 and S=10. Ratio is based on exponential function
iii) S=-10: O/E ratio = 0.031 ≈ 1/32.
iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =
![Page 13: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/13.jpg)
5.7
32.1
i) S=0 : O/E ratio=1
ii) Compare S=5 and S=10. Ratio is based on exponential function
iii) S=-10: O/E ratio = 0.031 ≈ 1/32.
iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =
![Page 14: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/14.jpg)
5.7
32.1
i) S=0 : O/E ratio=1
ii) Compare S=5 and S=10. Ratio is based on exponential function
iii) S=-10: O/E ratio = 0.031 ≈ 1/32.
iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =
)( 2121 / SSSS eee
![Page 15: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/15.jpg)
Example: BLAST
Motivations
Exact algorithms are exhaustive but computationally expensive.
Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning),
and so, database scanning requires heuristic alignment algorithm (at the cost of optimality).
![Page 16: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/16.jpg)
Interpret BLAST results - Description
ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank
Bit score – higher, better. Click to access the pairwise alignment
Expect value – lower, better. It tells the possibility that this is a random hit
Gene/sequence Definition
Links
![Page 17: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/17.jpg)
Problems with BLAST
Why do results change?
How can you compare results from different BLAST tools which may report different types of values?
How are results (eg evalue) affected by query
There are _many_ values reported in the output – what do they mean?
![Page 18: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/18.jpg)
Example: Importance of Blast statistics
But, first a review.
![Page 19: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/19.jpg)
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
![Page 20: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/20.jpg)
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
![Page 21: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/21.jpg)
Features of a Normal Distribution
= meanSymmetric Distribution
Has an average or mean value at the centre
Has a characteristic width called the standard deviation (S.D. = σ)
Most common type of distribution known
![Page 22: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/22.jpg)
Standard Deviations (Z-score)
![Page 23: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/23.jpg)
Mean, Median & Mode
ModeMedian
Mean
![Page 24: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/24.jpg)
Mean, Median, Mode
In a Normal Distribution the mean, mode and median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values in the distribution
Median - the “middlemost” value, usually half way between the mode and the mean
Mode - most common value
![Page 25: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/25.jpg)
Different Distributions
Unimodal Bimodal
![Page 26: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/26.jpg)
Other Distributions
Binomial Distribution
Poisson Distribution
Extreme Value Distribution
![Page 27: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/27.jpg)
Binomial Distribution
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
P(x) = (p + q)n
![Page 28: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/28.jpg)
Poisson DistributionP
ropo
rtio
n of
sam
ples
= 10
=0.1
= 1
= 2
= 3
P(x)
x
!)( xex
xP
![Page 29: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/29.jpg)
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
What is a null hypothesis?
A statistician’s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to a particular set of observations.
The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.
![Page 30: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/30.jpg)
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
What is a null hypothesis?
A statistician’s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to a particular set of observations.
The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.
![Page 31: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/31.jpg)
Review
Examples of null hypotheses:
Sequence comparison using shuffled sequences.
A normal distribution of log ratios from a microarray experiment.
LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome.
![Page 32: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/32.jpg)
Empirical score distribution
The picture shows a distribution of scores from a real database search using BLAST.
This distribution contains scores from non-homologous and homologous pairs.
High scores from homology.
![Page 33: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/33.jpg)
Empirical null score distribution
This distribution is similar to the previous one, but generated using a randomized sequence database.
![Page 34: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/34.jpg)
Review
What is a p-value?
![Page 35: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/35.jpg)
Review
What is a p-value?
The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?”
Pr(x > S|null)
![Page 36: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/36.jpg)
Review
What is the name of the distribution created by sequence similarity scores, and what does it look like?Extreme value distribution, or
Gumbel distribution.
It looks similar to a normal distribution, but it has a larger tail on the right.
![Page 37: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/37.jpg)
Review
What is the name of the distribution created by sequence similarity scores, and what does it look like?
Extreme value distribution, or Gumbel distribution.
It looks similar to a normal distribution, but it has a larger tail on the right.
0
1000
2000
3000
4000
5000
6000
7000
8000
<20 30 40 50 60 70 80 90 100 110 >120
![Page 38: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/38.jpg)
Statistics
BLAST (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated sequences follow the Gumbel Extreme Value Distribution (EVD)
Pr(s>S) = 1-exp(-Kmn e-S)
This is the probability of randomly encountering a score greater than S.
S alignment score
m,n query sequence lengths, and length of database resp.
K, parameters depending on scoring scheme and sequence composition
Bit score : S’ = S – log(K) log(2)
![Page 39: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/39.jpg)
BLAST output revisited
K
S’ S E
nm
From: Expasy BLAST
![Page 40: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/40.jpg)
Review
EVD for random blast
Upper tail behaviour: Pr( s > S ) ~ Kmn e-S
0
1000
2000
3000
4000
5000
6000
7000
8000
<20 30 40 50 60 70 80 90 100 110 >120
This is the EXPECT value = Evalue
![Page 41: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/41.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
![Page 42: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/42.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
![Page 43: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/43.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
Score and bit score grow linearly with the length of the alignment
![Page 44: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/44.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
E-value of bit score
E = mn2-S’
Score and bit score grow linearly with the length of the alignment
![Page 45: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/45.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
E-value of bit score
E = mn2-S’
Score and bit score grow linearly with the length of the alignment
E-Value shrinks really fast as bit score grows
![Page 46: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/46.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
E-value of bit score
E = mn2-S’
Score and bit score grow linearly with the length of the alignment
E-Value shrinks really fast as bit score grows
E-Value grows linearly with the product of target and query sizes.
![Page 47: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/47.jpg)
Summary
Want to be able to compare scores in sequences of different compositions or different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K) log(2)
E-value of bit score
E = mn2-S’
Score and bit score grow linearly with the length of the alignment
E-Value shrinks really fast as bit score grows
E-Value grows linearly with the product of target and query sizes.
Doubling target set size and doubling query length have the same effect on e-value
![Page 48: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/48.jpg)
Conclusion
You should now be able to compare BLAST results from different databases, converting values if they are reported differently (which happens frequently)
You should now know why BLAST results might change from one day to the next, even on the same server
You should understand also the dependance of query length on E-value.
Statistical rankings are reported for (almost) every database search tool. When making comparisons between databases, between sequences it is useful to know how the statistics are derived to know if comparisons are meaningful.
![Page 49: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/49.jpg)
THE END
![Page 50: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/50.jpg)
SupplementalSection
![Page 51: Bioinformatics Group Institute of Biotechnology University of Helsinki Swapan ‘Shop’ Mallick Significance in protein analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062515/56649f425503460f94c62b23/html5/thumbnails/51.jpg)
Look through: Patterns in sequences (Searching for information within sequences) - Some common problems and their solutions:
http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm
What is the structure of my sequence?
http://speedy.embl-heidelberg.de/gtsp/flowchart2.html (clickable!)