statistics for biologists using microsoft excel luisa cutillo [email protected]

42
Statistics for Biologists Using Microsoft Excel Luisa Cutillo [email protected] http://bioinformatics.ti gem.it

Upload: augusta-may

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Statistics for Biologists

Using Microsoft Excel

Luisa Cutillo

[email protected]

http://bioinformatics.tigem.it

Page 2: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Class Topics:

Basic concepts and practice in excel

• Hypothesis Testing Methodology

• p-Value Approach to Hypothesis Testing

• Comparative Statistics examples (T-Test, Chi squared)

• Multiple Hypotesis Testing (FDR)

•Descriptive and association statistics

Page 3: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• A hypothesis is an assumption about the population parameter.– A parameter is a

characteristic of the population, like its mean or variance.

– The parameter must be identified before analysis.

I assume the mean AGE of this class is 50!!!

Am I correct? TEST IT!

© 1984-1994 T/Maker Co.

What is a Hypothesis

Of a test?

Page 4: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• States the Assumption (numerical) to be tested

e.g. Our class mean age is 50 (H0: µ=50 )

• Begin with the assumption that the null hypothesis is TRUE.

(Similar to the notion of innocent until proven guilty)

The Null Hypothesis, H0

The Null Hypothesis may or may not be rejected,but our aim is to REJECT the null

hypothesis!

Page 5: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• Is the opposite of the null hypothesise.g. The average age of our

class is different from 50 (H1: µ ≠50)

• Is generally the hypothesis that is believed to be true by the researcher!

The Alternative Hypothesis, H1

Page 6: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• Steps:– State the Null Hypothesis– State its opposite, the Alternative

Hypothesis• Hypotheses are mutually exclusive & exhaustive

• Sometimes it is easier to form the alternative hypothesis first.

Identify the Problem

Page 7: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Assume thepopulationmean age is 50.(Null Hypothesis)

REJECT

The SampleMean Is 20

Population

SampleNull Hypothesis

Hypothesis Testing Process

No, not likely!

Page 8: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Sample Meanµ = 50

Sampling DistributionOur sample mean (20) falls in the tails!It’s not likely! Hypotyzed

population mean.

we reject the null

hypothesis that µ = 50.

20

H0

Reason for Rejecting H0

Observed population mean

Page 9: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• Defines the Rejection region

• Typical value of a is 0.05. It Provides the Critical Value(s) of the Test

Level of Significance, α

α

0

Critical Value

Rejection Regions

“Area” of the Rejection region

Page 10: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Level of Significance, α and the Rejection Region

H0: 0

H1: < 0

H0: 0

H1: > 0

H0: 0

H1: 0

/2

0

0

Critical Value(s)

Rejection Regions

0

One tail (left) test

One tail (right) test

Two tails test

Page 11: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• Type I Error– Reject Null Hypothesis when it is

True (“False Positive”)– Has Serious Consequences– Probability of Type I Error Is α

• Called Level of Significance

• Type II Error– Do Not Reject Null Hypothesis

when it is False (“False Negative”)– Probability of Type II Error Is β

( Power 1- β )

Errors in Making Decisions

Page 12: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Reduce probability of one error and the other one goes up.

& Have an Inverse Relationship

One possibility: Increase the sample size!!!!

Page 13: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• The p-value is the Probability of Obtaining a Test Statistic (under H0) more Extreme or ) than the observed Sample Value

• Used to Make Rejection Decision

– If p value < Reject H0 SUCCESS

– If p value Do Not Reject H0 FAILURE

What is the p Value and how to use it in a Test?

p

0

One tail testObservedSample Value

Page 14: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Random variables: am I observing continuous or discrete data???

Roughly speaking

a “random” variable is a quantity whose values are “random” and to which a probability distribution is assigned

(e.g. a fair dice outcomes have same chance of coming up at each throw ) ;

THE DIFFERENCE BETWEEN CONTINUOUS AND DISCRETE VARIBLES IS

FUNDAMENTAL IN CHOOSING THE KIND OF TEST STATISTICS!

Page 15: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Discrete R.V.

If the r.v. X values belongs to a finite set {x1 ,x2,…, xn}

then X is called DISCRETE (usually counts)

As example the flipping of a coin, the number of red cells counted in an image, the number of success in 100 trials…are

observations of a discrete variable!

Page 16: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Continuous R.V.A continuous random variable is a r.v. which takes an infinite number of possible

values.

Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange,

the time required to run a mile,the fluorescence intensity in a microarray,

etc.

(A continuous random variable is not defined at specific values

bat over intervals of values)

Page 17: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Which test to use?

First of all you should choose a summary SAMPLE STATISTIC!

As. Example:

SAMPLE MEAN

SAMPLE VARIANCE

SAMPLE COVARIANCE

SAMPLE CORRELATION

T-test!

Page 18: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

• Assumptions– Population is normally distributed– If not normal, only slightly skewed &

a large sample taken (Central limit theorem applies)

• Parametric test procedure (sample stat. is the sample mean!)

• t test statistic, with n-1 degrees of freedom

paired t-Test: s Unknown

(rigth and left eye)

Page 19: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

t0

Reject H0

t0

Reject H0

H0: ≠ H1: < 0

H0: ≠0 H1: > 0

Must Be Significantly below

= 0

Small values don’t contradict H0 Don’t

Reject H0!

Rejection Region

(one tail)

Page 20: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Unpaired T-test

•The two sample observations are not coupled•Not necessary equal sample numbers•You may distinguish between equal and unequal sample variance

Page 21: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

In few words the other tests:• If you want to compare more then two

populations means when you observe 1 characteristic: one way Anova Test

• If you want to compare more then two populations means when you observe 2 characteristic: two way Anova Test

• If you want to compare two populations variance: F-test

• If you want to compare two populations proportions: Chi-square test

Page 22: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Remark

• If you have counts…or few data YOU ARE NOT ALLOWED TO USE T-TEST!!!

• Any test is build upon conjecture about the shape of the null distributions…again if you have few data or any doubt…please contact us!

• If you just want to have a summary about your data, then use the descriptive statistic excel sheet

Page 23: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

PRACTICALS:Handy Guide

In Excel

Luisa Cutillo

[email protected]

http://bioinformatics.tigem.it

Page 24: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

HOW IN EXCEL….

• DESCRIPTIVE STATISTICS

• ASSOCIATION STATISTICS

• COMPARATIVE STATISTICS

• STATISTICS FOR FREQUENCY DATA

• FDR (for Multiple Hypothesis Testing)

Page 25: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

REMARK

• Statistical formulae and tables can look mysterious and confusing

• You don’t really need to make calculations yourself

• Excel has most of the common statistical tests built in

• EXCEL HOWEVER IS NOT A STATISTICAL SOFTWARE! But it can be used for a basic analysis level.

Page 26: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

DESCRIPTIVE STATISTICS How to summarise the collected measurements?

(time, length, temperature, expression level..)Excel provides 3 measures of the centre of a distribution of replicates:

Aritmetic mean: =AVERAGE(range) most appropriate for normal approximation!

Median (Pr(X> or < median)= 0.5): =MEDIAN(range)

Mode (most frequent value) : =MODE(range)

Page 27: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

DESCRIPTIVE STATISTICS:DESCRIPTIVE_STAT_toy.xls

The mean has no meanig without some measure of spread or variation:Aritmetic mean: AVERAGE(range) most appropriate for normal approximation!The range:MAX(range)-MIN(range)The variance: VAR(range)Standard deviation: STDV(range)Standard error MEAN:

STDV(range)/SQRT(COUNT(range))Confidence interval:

=CONFIDENCE(0.05,STDV(range),COUNT(range))

Page 28: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

ASSOCIATION STATISTICS Task: investigate an association between two variables

(ex. Two genes expression values).

Correlation: to see if two variables vary together i.e. One goes up, the other goes up (or goes down) [excel]

Regression:to see how one variable affects another [contact us!]

The most common tests for correlation are:Pearson coefficient for nomally distributed data

(parametric): to see if two variables vary together i.e. One goes up, the other goes up (or goes down) [excel]

CORREL(range 1, range 2)or

PEARSON(range 1, range 2)

Spearman rank-order correlation coefficient (non parametric) [contact us!]

Both vary from +1 (perfect correlation) through 0 (no correlation) to -1 (anti correlation)

Page 29: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Two types of correlation coefficient. The data are the lengths of a leg bone (in mm) in penguin mating pairs. The Pearson coefficient r can be calculated directly from the data, but the Spearman coefficient rs must be calculated from the ranks of the data. The ranks can either be entered by hand or calculated using Excel’s =RANK formula.

Ex1.xls

correlation_covar_toy.xls

Page 30: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

COMPARATIVE STATISTICS: test_toy.xls Task: Compare two or more sets of data do determine

whether they are basically the same or they are significantly different.

Final result: probability P that the null hypothesis of no difference is true.

In Biology usually: we say that there is a significant difference if P<5%. The most common test for normally distributed data is the T-TEST;

=TTEST((range1,range2,tails,type) which returns directly the P value.tails: 1 for one tailed test

2 for two tailed test (most used in biology, test for differences reguardless of the sign)type: 1 for paired data (one sample, dependent data) 2 for unpaired data (two samples, equal variance) 3 for two sample unequal variance (Never use it!)

Page 31: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

31

MICROARRAY Hypothesis Testing

We want to compare two biologically different samples (ex. Wild Type vs Mutant) through the identification of

differentially expressed genes

We have to simultaneously test, for each gene, the null hypothesis: gene expression has not changed.

Page 32: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

32

For each gene j the test is expressed in term of a Statistic and a p-value

Null Hypothesis Ho: j(WT)=j(KO)

Which is the test to use in this case?

Page 33: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

33

For each gene the test is expressed in term of a Statistic and a p-value

Null Hypothesis

Ho: j(WT)=j(KO)

T-statistic on gene j --> p-value

p-value Is true

( α )

Reminder:The p-value is the probability of finding a false positive (probability of type I error) that is the probability of finding out a differentially expressed gene that actually is not!!!

Ex. If α=0.01 and p<α, then 0.01 represent the probability that the gene detected is a false positive.

Page 34: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

34

Problems in controlling the errors…

Assume that a chip experiment reveals the expression level of m = 20.000 genes relatively to two different biological conditions.

We want to test, simultaneously for each gene, the null hypothesis that the gene is not differentially expressed against the alternative that it is.

If we test each of the m hypothesis at level p<α=0.01, we would expect about 200 false positive!!!

Page 35: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

35

Multiple error controlling procedures:Bonferroni

BonferroniCorrection(FWER)

In practice for each gene you have to compute a new p-value pj<Tr=α/m ----> pj*m<α ---> Pbonf<α

and you should retrieve all the genes for which Pbonf=pj*m <α

Page 36: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

36

Multiple error controlling procedures:Benjamini - HockbergConsider the p-values sorted in ascending order:

p(1)<p(2)<... <p(m)

For the j-st gene the new pBH is p(j)*m/j

So you have to detect all the genes whos sorted p-value is s.t.

p(j) m/j< α

In practice for the j-st gene you have to compute a new p-value

Pcorrect(j)=p(j)*m/jand you should retreive all the genes for wich Pcorrect<α

MicroarrayFdr.xls

Page 37: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Statistics for frequency data

Sometimes in biology results are not measurements but counts (or frequencies)!e.g. counts of different phenotypes, counts of cell types ...

Task: Compare frequency data in different categories with some expected data

You are NOT ALLOWED to perform a t test! Instead you do a Chi-squared test;

=CHITEST(observed range,expected range) which returns directly the P value ( probability that the null hypothesis of no difference between the observed and the expected is true).

Page 38: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Statistics for frequency data Three different uses:

Expected calculated from theory: you test if your observed data agree with the theory. E.g. Mendel theory can be used to predict frequencies of different phenotypes: we expect a genetic cross to be 3:1 ratio of red and white flowers.(P>5% data agree with theory)

Expected calculated assuming that the counts in all the categories should be the same: you test whether there is a difference between the observed sets. (P<5% data significantly different from each other)

Investigate association between frequency data in two separate groups. Expected calculated assuming counts in one group are not affected by counts in the other. (P<5% there is a significant association). Data are set in a contingency table. For each cell the expected data is:

E=(column total x row total)/grand total

Page 39: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Statistics for frequency data Two kinds of chi-squared test.Top: expected values from theory, calculated assuming 3/4 of the flowers should be red and 1/4 should be white.

Bottom: expected values assuming equal distribution.

Ex2.xls

Page 40: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Statistics for frequency data

The chi squared test for association.The observed data were entered in the upper table, and the expected data in the lower table were calculated from the sums for each column and row. Only some examples of the formulae used are shown.

Ex2.xls

Page 41: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

References:

• Biology statistics made simple using Excel, Millar

Page 42: Statistics for Biologists Using Microsoft Excel Luisa Cutillo cutillo@tigem.it

Now ...”test” your lunch!!!