bioinformatics group institute of biotechnology university of helsinki swapan ‘shop’ mallick...

Post on 14-Dec-2015

221 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bioinformatics GroupInstitute of BiotechnologyUniversity of Helsinki

Swapan ‘Shop’ Mallick

Significance in protein analysis

Overview

The need for statistics

Example: BLOSUMWhat do the scores mean?

How can you compare two scores?

Example: BLASTProblems with BLAST

Review of Distributions

Distribution of random BLAST results

P-values and e-values

Statistics of BLAST

Summary and Conclusion

Exercise

The need for statistics

• Statistics is very important for bioinformatics. – It is very easy to have a computer analyze the data

and give you back a result. – Problem is to decide whether the answer the computer

gives you is any good at all. • Questions:

– How statistically significant is the answer?– What is the probability that this answer could have

been obtained by random? What does this depend on?

Nn

X

S

Population Sample

Basics

Nn

X

Population Sample

Basics

Descriptive statistics

Probability

Example: BLOSUM

The BLOSUM matrix assigns a probability score for each residue pair in an alignment based on:

the frequency with which that pairing is known to occur within conserved blocks of related proteins.

Simple since size of population = size of sample

BLOSUM matrices are constructed from observations which lead to observed probabilities

BLOSUM substitution matrices

BLOSUM matrices are used in ‘log-odds’ form based on actually observed substitutions.

This is because: Ease of use: ‘Scores’ can be just added (the raw probabilities would have to be multiplied) Ease of interpretation:

S=0 : substitution is just as likely to occur as random S<0 : substitution is more likely to occur randomly than observed S>0 : substitution is less likely to occur randomly than observed

Substitution matrices

ba

ab

ffpbaS log),( 1

Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers

Pab is the observed frequency that residues a and b are correlated because of homology

fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b

Source: Where did the BLOSUM62 alignment score matrix come from? Eddy S., Nat. Biotech. 22 Aug 2004

Score of amino acid a with amino acid b

Substitution matrices

Sff

p eba

ab

Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers

Pab is the observed frequency that residues a and b are correlated because of homology

fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b

i) S=0 : O/E ratio=1

ii) Compare S=5 and S=10. Ratio is based on exponential function

iii) S=-10: O/E ratio = 0.031 ≈ 1/32.

iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =

5.7

32.1

i) S=0 : O/E ratio=1

ii) Compare S=5 and S=10. Ratio is based on exponential function

iii) S=-10: O/E ratio = 0.031 ≈ 1/32.

iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =

5.7

32.1

i) S=0 : O/E ratio=1

ii) Compare S=5 and S=10. Ratio is based on exponential function

iii) S=-10: O/E ratio = 0.031 ≈ 1/32.

iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =

5.7

32.1

i) S=0 : O/E ratio=1

ii) Compare S=5 and S=10. Ratio is based on exponential function

iii) S=-10: O/E ratio = 0.031 ≈ 1/32.

iv) Ratio of scores S1, S2 in terms of probabilities of observed/random =

)( 2121 / SSSS eee

Example: BLAST

Motivations

Exact algorithms are exhaustive but computationally expensive.

Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning),

and so, database scanning requires heuristic alignment algorithm (at the cost of optimality).

Interpret BLAST results - Description

ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank

Bit score – higher, better. Click to access the pairwise alignment

Expect value – lower, better. It tells the possibility that this is a random hit

Gene/sequence Definition

Links

Problems with BLAST

Why do results change?

How can you compare results from different BLAST tools which may report different types of values?

How are results (eg evalue) affected by query

There are _many_ values reported in the output – what do they mean?

Example: Importance of Blast statistics

But, first a review.

Review

What is a distribution?

A plot showing the frequency of a given variable or observation.

Review

What is a distribution?

A plot showing the frequency of a given variable or observation.

Features of a Normal Distribution

= meanSymmetric Distribution

Has an average or mean value at the centre

Has a characteristic width called the standard deviation (S.D. = σ)

Most common type of distribution known

Standard Deviations (Z-score)

Mean, Median & Mode

ModeMedian

Mean

Mean, Median, Mode

In a Normal Distribution the mean, mode and median are all equal

In skewed distributions they are unequal

Mean - average value, affected by extreme values in the distribution

Median - the “middlemost” value, usually half way between the mode and the mean

Mode - most common value

Different Distributions

Unimodal Bimodal

Other Distributions

Binomial Distribution

Poisson Distribution

Extreme Value Distribution

Binomial Distribution

1

1 1

1 2 1

1 3 3 1

1 4 6 4 1

1 5 10 10 5 1

P(x) = (p + q)n

Poisson DistributionP

ropo

rtio

n of

sam

ples

= 10

=0.1

= 1

= 2

= 3

P(x)

x

!)( xex

xP

Review

What is a distribution?

A plot showing the frequency of a given variable or observation.

What is a null hypothesis?

A statistician’s way of characterizing “chance.”

Generally, a mathematical model of randomness with respect to a particular set of observations.

The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review

What is a distribution?

A plot showing the frequency of a given variable or observation.

What is a null hypothesis?

A statistician’s way of characterizing “chance.”

Generally, a mathematical model of randomness with respect to a particular set of observations.

The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review

Examples of null hypotheses:

Sequence comparison using shuffled sequences.

A normal distribution of log ratios from a microarray experiment.

LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome.

Empirical score distribution

The picture shows a distribution of scores from a real database search using BLAST.

This distribution contains scores from non-homologous and homologous pairs.

High scores from homology.

Empirical null score distribution

This distribution is similar to the previous one, but generated using a randomized sequence database.

Review

What is a p-value?

Review

What is a p-value?

The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?”

Pr(x > S|null)

Review

What is the name of the distribution created by sequence similarity scores, and what does it look like?Extreme value distribution, or

Gumbel distribution.

It looks similar to a normal distribution, but it has a larger tail on the right.

Review

What is the name of the distribution created by sequence similarity scores, and what does it look like?

Extreme value distribution, or Gumbel distribution.

It looks similar to a normal distribution, but it has a larger tail on the right.

0

1000

2000

3000

4000

5000

6000

7000

8000

<20 30 40 50 60 70 80 90 100 110 >120

Statistics

BLAST (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated sequences follow the Gumbel Extreme Value Distribution (EVD)

Pr(s>S) = 1-exp(-Kmn e-S)

This is the probability of randomly encountering a score greater than S.

S alignment score

m,n query sequence lengths, and length of database resp.

K, parameters depending on scoring scheme and sequence composition

Bit score : S’ = S – log(K) log(2)

BLAST output revisited

K

S’ S E

nm

From: Expasy BLAST

Review

EVD for random blast

Upper tail behaviour: Pr( s > S ) ~ Kmn e-S

0

1000

2000

3000

4000

5000

6000

7000

8000

<20 30 40 50 60 70 80 90 100 110 >120

This is the EXPECT value = Evalue

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

Score and bit score grow linearly with the length of the alignment

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

E-value of bit score

E = mn2-S’

Score and bit score grow linearly with the length of the alignment

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

E-value of bit score

E = mn2-S’

Score and bit score grow linearly with the length of the alignment

E-Value shrinks really fast as bit score grows

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

E-value of bit score

E = mn2-S’

Score and bit score grow linearly with the length of the alignment

E-Value shrinks really fast as bit score grows

E-Value grows linearly with the product of target and query sizes.

Summary

Want to be able to compare scores in sequences of different compositions or different scoring schemes

Score: S = sum(match) – sum(gap costs)

Bit score

S’ = S – log(K) log(2)

E-value of bit score

E = mn2-S’

Score and bit score grow linearly with the length of the alignment

E-Value shrinks really fast as bit score grows

E-Value grows linearly with the product of target and query sizes.

Doubling target set size and doubling query length have the same effect on e-value

Conclusion

You should now be able to compare BLAST results from different databases, converting values if they are reported differently (which happens frequently)

You should now know why BLAST results might change from one day to the next, even on the same server

You should understand also the dependance of query length on E-value.

Statistical rankings are reported for (almost) every database search tool. When making comparisons between databases, between sequences it is useful to know how the statistics are derived to know if comparisons are meaningful.

THE END

SupplementalSection

Look through: Patterns in sequences (Searching for information within sequences) - Some common problems and their solutions:

http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm

What is the structure of my sequence?

http://speedy.embl-heidelberg.de/gtsp/flowchart2.html (clickable!)

top related