1 advances in statistics or, what you might find if you picked up a current issue of a biological...
TRANSCRIPT
1
Advances in Statistics
Or, what you might find if you picked up a current issue of a Biological Journal
2
Advances in Statistics
• Extensions to the ANOVA• Computer-intensive methods• Maximum likelihood
3
Extensions to ANOVA
• One-way ANOVA– This works for a single explanatory variable
– Simplest possible design
• Two-way ANOVA– Two categorical explanatory variables
– Factorial design
4
ANOVA Tables
Source of variation
Sum of squares
df Mean Squares
F ratio P
Treatment
k-1
Error N-k
Total N-1€
SSerror = si2(ni −1)∑€
SSgroup = ni(Y i −Y )2∑
€
SSgroup + SSerror €
MSerror =SSerror
dferror€
MSgroup =SSgroup
dfgroup
€
F =MSgroup
MSerror
*
5
Two-factor ANOVA TableSource of variation
Sum of Squares
df Mean Square
F ratio P
Treatment 1
SS1 k1 - 1 SS1
k1 - 1
MS1
MSE
Treatment 2
SS2 k2 - 1 SS2
k2 - 1
MS2
MSE
Treatment 1 * Treatment 2
SS1*2 (k1 - 1)*(k2 - 1)
SS1*2
(k1 - 1)*(k2 - 1)
MS1*2
MSE
Error SSerror XXX SSerror
XXX
Total SStotal N-1
6
Two-factor ANOVA TableSource of variation
Sum of Squares
df Mean Square
F ratio P
Treatment 1
SS1 k1 - 1 SS1
k1 - 1
MS1
MSE
Treatment 2
SS2 k2 - 1 SS2
k2 - 1
MS2
MSE
Treatment 1 * Treatment 2
SS1*2 (k1 - 1)*(k2 - 1)
SS1*2
(k1 - 1)*(k2 - 1)
MS1*2
MSE
Error SSerror XXX SSerror
XXX
Total SStotal N-1
Two categorical explanatory variables
7
General Linear Models
• Used to analyze variation in Y when there is more than one explanatory variable
• Explanatory variables can be categorical or numerical
8
General Linear Models
• First step: formulate a model statement
• Example:
€
Y = μ + TREATMENT
9
General Linear Models
• First step: formulate a model statement
• Example:
€
Y = μ + TREATMENT
Overallmean
Treatment effect
10
General Linear Models
• Second step: Make an ANOVA table
• Example: Source of variation
Sum of squares
df Mean Squares
F ratio P
Treatment
k-1
Error N-k
Total N-1
€
SSerror = si2(ni −1)∑€
SSgroup = ni(Y i −Y )2∑
€
SSgroup + SSerror
€
MSerror =SSerror
dferror€
MSgroup =SSgroup
dfgroup
€
F =MSgroup
MSerror*
11
General Linear Models
• Second step: Make an ANOVA table
• Example: Source of variation
Sum of squares
df Mean Squares
F ratio P
Treatment
k-1
Error N-k
Total N-1
€
SSerror = si2(ni −1)∑€
SSgroup = ni(Y i −Y )2∑
€
SSgroup + SSerror
€
MSerror =SSerror
dferror€
MSgroup =SSgroup
dfgroup
€
F =MSgroup
MSerror*
This is the same as a one-way ANOVA!
12
General Linear Models
• If there is only one explanatory variable, these are exactly equivalent to things we’ve already done– One categorical variable: ANOVA– One numerical variable: regression
• Great for more complicated situations
13
Example 1: Experiment with blocking
• Fish experiment: sensitivity of goldfish to light
• Fish are randomly selected from the population
• Four different light treatments are applied to each fish
14
Randomized Block Design
Blocks (fish)
Treatments(light wavelengths)
15
Randomized Block Design
16
Step 1: Make a model statement
€
Y = μ + BLOCK + TREATMENT
17
Step 2: Make an ANOVA table
18
Another Example: Mole Rats
• Are there lazy mole rats?• Two variables:
– Worker type: categorical•“frequent workers” and “infrequent workers”
– Body mass (ln-transformed): numerical
19
20
Step 1: Make a model statement
€
Y = μ + CASTE + LNMASS + CASTE * LNMASS
21
Step 2: Make an ANOVA table
22
Step 2: Make an ANOVA table
23
Step 1: Make a model statement
€
Y = μ + CASTE + LNMASS
24
Step 2: Make an ANOVA table
25
Step 2: Make an ANOVA table
Also called ANCOVA-
Analysis of Covariance
26
General Linear Models
• Can handle any number of predictor variables
• Each can be categorical or numerical
• Tables have the same basic structure
• Same assumptions as ANOVA
27
General Linear Models
• Don’t run out of degrees of freedom!
• Sometimes, the F-statistics will have DIFFERENT denominators - see book for an example
28
Computer-intensive methods
• Hypothesis testing:– Simulation– Randomization
• Confidence intervals– Bootstrap
29
Simulation
• Simulates the sampling process on a computer many times: generates the null distribution from estimates done on the simulated data
• Computer assumes the null hypothesis is true
30
Example: Social spider sex ratios
Social spiders live in groups
31
Example: Social spider sex ratios
• Groups are mostly females• Hypothesis: Groups have just enough males to allow reproduction
• Test: Whether distribution of number of males is as predicted by chance
• Problem: Groups are of many different sizes
• Binomial distribution therefore doesn’t apply
32
Simulation:
• For each group, the number of spiders is known. The overall proportion of males, pm, is known.
• For each group, the computer draws the real number of spiders, and each has pm probability of being male.
• This is done for all groups, and the variance in proportion of males is calculated.
• This is repeated a large number of times.
33
0.5 1. 1.5 2. 2.5 3. 3.5
200
400
600
800
1000
Variance in proportion of males
(Pseudo-values)
Frequency
Actual Observed Value (0.44)
The observed value (0.44), or something more extreme,is observed in only 4.9% of the simulations. Therefore P = 0.049.
34
Randomization
• Used for hypothesis testing• Mixes the real data randomly• Variable 1 from an individual is paired with variable 2 data from a randomly chosen individual. This is done for all individuals.
• The estimate is made on the randomized data.
• The whole process is repeated numerous times. The distribution of the randomized estimates is the null distribution.
35
Without replacement
• Randomization is done without replacement.
• In other words, all data points are used exactly once in each randomized data set.
36
Randomization can be done for any test of association between
two variables
37
Example: Sage crickets
Sage cricket males sometimesoffer their hind-wings to females to eat during mating.
Do females who eat hind-wingswait longer to re-mate?
38
Table 12.3A Waiting time to remating in sage cricket females afterinitial mating with either a wingless or winged male (presented inln(days))
Male wingless Male winged
0 1.4
0.7 1.6
0.7 1.9
1.4 2.3
1.6 2.6
1.8 2.8
1.9 2.8
1.9 2.8
1.9 3.1
2.2 3.8
2.1 3.9
2.1 4.5
4.7
39
ln(Time to remating): First mate had no wings
ln(Time to remating): First mate had intact wings
Problems:Unequal variance, non-normal distributions
40
Male wingless
Male winged
0 1.4
0.7 1.6
0.7 1.9
1.4 2.3
1.6 2.6
1.8 2.8
1.9 2.8
1.9 2.8
1.9 3.1
2.2 3.8
2.1 3.9
2.1 4.5
4.7
Real data: Randomized data:
€
Y 1 −Y 2 = −1.41
Male wingless
Male winged
0.7 2.8
2.3 1.9
1.9 2.1
1.8 1.6
3.8 0
1.4 1.4
1.9 2.2
3.9 2.1
4.7 1.6
2.6 4.5
1.9 2.8
2.8 0.7
3.1
€
Y 1 −Y 2 = 0.41
41
Note that each data point was only used
once
42
1000 randomizations
P < 0.001
43
Randomization: Other questions
Q: Is this periodic?
(yes)
44
Bootstrap
• Method for estimation (and confidence intervals)
• Often used for hypothesis testing too
• "Picking yourself up by your own bootstraps"
45
Bootstrap
• For each group, randomly pick with replacement an equal number of data points, from the data of that group
• With this bootstrap dataset, calculate the estimate -- bootstrap replicate estimate
46
Male wingless
Male winged
0 1.4
0.7 1.6
0.7 1.9
1.4 2.3
1.6 2.6
1.8 2.8
1.9 2.8
1.9 2.8
1.9 3.1
2.2 3.8
2.1 3.9
2.1 4.5
4.7
Real data: Bootstrap data:
€
Y 1 −Y 2 = −1.41
Male wingless
Male winged
0.7 1.4
0.7 1.4
1.4 2.8
1.4 2.8
1.8 2.8
1.8 3.1
1.8 3.1
1.9 3.9
1.9 4.5
2.1 4.7
2.1 4.7
2.1 4.7
4.7
€
Y 1 −Y 2 = −1.78
47
48
Bootstraps are often used in evolutionary
trees
49
Likelihood
€
L hypothesis A | data( ) = P data | hypothesis A[ ]
Likelihood considers many possible hypotheses, not just one
50
Law of likelihood
A particular data set supports one hypothesis better than another if the likelihood of that hypothesis is higher than the likelihood of the other hypothesis.
Therefore we try to find the hypothesis with the maximum likelihood.
51
All estimates we have learned so far are
also maximum likelihood estimates.
52
"Simple" example
• Using likelihood to estimate a proportion
• Data: 3 out of 8 individuals are male.
• Question: What is the maximum likelihood estimate of the proportion of males?
53
Likelihood
€
L p = x( ) = P 3 males out of 8 | p = x[ ]
where x is a hypothesized value of the proportion of males.
e.g., L(p=0.5) is the likelihood of the hypothesis that the proportion of males is 0.5.
54
For this example only...
The probability of getting 3 males out of 8 independent trials is given by the binomial distribution.
€
L p = x( ) = Pr data | p = x[ ]
= Pr 3out of 8 | p = x[ ]
=8
3
⎛
⎝ ⎜
⎞
⎠ ⎟x
3 1− x( )8−3
55
How to find maximum likelihood hypothesis
1. Calculus
or
2. Computer calculations
56
By calculus...
• Maximum value of L(p=x) is found when x = 3/8.
• Note that this is the same value we would have gotten by methods we already learned.
57
By computer calculation...
0.2 0.4 0.6 0.8 1
0.001
0.002
0.003
0.004
0.005
x
L(p=x)
x = 3/8
Input likelihood formula to computer, plot the value of L for each value of x, and find the largest L.
58
Finding genes for corn yield:
Corn Chromosome 5
59
Hypothesis testing by likelihood
• Compares the likelihood of maximum likelihood estimate to a null hypothesis
Log-likelihood ratio =
€
lnLikelihood[Maximum likelihood hypothesis]
Likelihood[Null hypothesis]
⎡
⎣ ⎢
⎤
⎦ ⎥
60
Test statistic
€
χ 2 = 2 log likelihood ratio
With df equal to the number of variables fixed to make null hypothesis
61
Example:3 males out of 8 individuals
• H0: 50% are male
• Maximum likelihood estimate
€
ˆ p =3
8
€
L p = 3/8[ ] =8
3
⎛
⎝ ⎜
⎞
⎠ ⎟ 3/8( )
31− 3/8( )
5= 0.2816
62
Likelihood of null hypothesis
€
L p = 0.5[ ] =8
3
⎛
⎝ ⎜
⎞
⎠ ⎟ 0.5( )
31− 0.5( )
5= 0.21875
63
Log likelihood ratio
€
lnL p = 3/8[ ]L p = 0.5[ ]
⎡
⎣ ⎢
⎤
⎦ ⎥= ln
0.2816
0.21875
⎡ ⎣ ⎢
⎤ ⎦ ⎥= 0.2526
χ 2 = 2 0.2526( ) = 0.5051
We fixed one variable in the null hypothesis (p),So the test has df = 1.
€
χ0.05,12 = 3.84, so we do not reject H0.