basic statistical concepts - courses.pbsci.ucsc.edu
TRANSCRIPT
1
Basic Statistical Concepts
Statistical Population
• The entire underlying set of observations from which samples are drawn.– Philosophical meaning: all observations that could
ever be taken for range of inference• e.g. all barnacle populations that have ever existed, that
exist or that will exist
– Practical meaning: all observations within a reasonable range of inference
• e.g. barnacle populations on that stretch of coast
2
Statistical Sample
• A representative subset of a population.– What counts as being representative
• Unbiased and hopefully precise
Strategies
• Define survey objectives: what is the goal of survey or experiment? What are you hypotheses?
• Define population parameters to estimate (e.g. number of individuals, growth, color etc).
• Implement sampling strategy– measure every individual (think of implications in terms of
cost, time, practicality especially if destructive)– measure a representative portion of the population (a
sample)
3
Sampling
• Goal:– Every unit and combination of units in the population (of
interest) has an equal chance of selection.• This is a fundamental assumption in all estimation procedures
• How:
– Many ways if underlying distribution is not uniform
» In the absence of information about underlying distribution the only safe strategy is random sampling
» Costs: sometimes difficult, and may lead to its own source of bias (if sample size is low). Much more about this later
Sampling Objectives
• To obtain an unbiased estimate of a population mean
• To assess the precision of the estimate (i.e. calculate the standard error of the mean)
• To obtain as precise an estimate of the parameters as possible for time, effort and money spent
4
• Population mean () - the average value• Sample mean = estimates
• Population median - the middle value• Sample median estimates population median
• In a normal distribution the mean=median (also the mode), this is not ensured in other distributions
y
YY
Mean & median MeanMedian
Measures of location
Measures of dispersion
• Population variance (2) - average sum of squared deviations from mean
• Measured sample variance (s2) estimates population variance
• Standard deviation (s)– square root of variance
– same units as original variable
(xi - x)2
n - 1
5
(xi - )2
n
(xi - x)2
n - 1
(xi - x)2
n - 1
Measures (statistics) of Dispersion
Population variance 2 =
Sample variance s2 =
Sample standard deviation s =
• Note, units are squared• Denominator is (n)
• Note, units are squared• Denominator is (n-1)
• Note, units are not squared
Population Sum of Squares (xi - )2
Sample Sum of Squares SS = (xi - x)2
s2
n
sx
(xi - x ) (yi - y )n - 1
More Statistics of Dispersion
Standard error of the mean sx =
Coefficient of variation CV =
Covariance sxy =
• This is also the Standard Deviation of the sample means
• Measurement of variation independent of units
• Expressed as a percentage of mean
• Measure of how two variables covary• Range is between - and + • Value depends in part on range in data
– bigger numbers yield bigger values of covariance
8 8
= n
s
n
s
6
• Point estimate– Single value estimate of the parameter, e.g. is
a point estimate of , s is a point estimate of
• Interval estimate– Range within which the parameter lies known
with some degree of confidence, e.g. 95% confidence interval is an interval estimate of
y
Types of estimates
Sampling distribution
The frequency (or probability) distribution of a statistic (e.g. sample mean):
• Many samples (size n) from population
• Calculate all the sample means
• Plot frequency distribution of sample means (sampling distribution)
7
y
P(y)
y
y
P(y)
-
-
Sampling distribution of sample means
Multiple samples- multiple sample means
True Mean = 25
22 27
12
33
25
41
31
23
19
36
Mean = 21.5
23
24
36
2828 25
172140
16
Mean = 25.8
Means21.522.323.023.924.925.125.826.527.829.9
Estimate of Mean
Nu
mb
er o
f ca
ses
10 20 30 40
8
Sampling distribution of mean
• The sampling distribution of the sample mean approaches a normal distribution as n gets larger - Central Limit Theorem.
• The mean of this sampling distribution is , the mean of original population.
Estimate of Mean (x)
Pro
babi
lity
Estimate of Mean15 20 25 30 350
4
8
12
16
# of
cas
es
0.0
0.1
0.2
0.3
Proportion per B
ar
Large number of Samples
9
Sampling distribution of mean
• The sampling distribution of the sample means approaches a normal distribution as n gets larger -Central Limit Theorem.
• The mean of this sampling distribution is , the mean of original population.
• The standard deviation of this sampling distribution is approximated by s/n, the standard deviation of any given sample divided by square root of sample size -the standard error of the mean.
~2~2
2.5%2.5%
Pro
babi
lity
Estimate of Mean (x)
Standard deviation can be calculated for any distribution
The standard deviation of the distribution of sample means can be calculated the same as for a given sample
Where:1. x = mean of the
means and ~ number of
means used in distribution
(xi - x)2
N - 1sx =
sxsx sx
x
10
~2 SEM~2 SEM
2.5%2.5%
Pro
babi
lity
Estimate of Mean (x)
Standard deviation can be calculated for any distribution
The standard deviation of the distribution of sample means can be calculated the same as for a given sample
(xi - x)2
N - 1However:To do so would require an immense sampling effort, hence an approximation is used:
Where:s = sample standard deviation andn = number of replicates in the sample
n
s
n
ssx ~ SEM =
sx =
x
Standard error of mean
• population SD estimated by sample SE:
s/n
• measures precision of sample mean
• how close sample mean is likely to be to true population mean
11
Standard error of mean• If SE is low:
– repeated samples would produce similar sample means
– therefore, any single sample mean likely to be close to population mean
• If SE is high:– repeated samples would produce very different
sample means– therefore, any single sample mean may not be
close to population mean
0 10 20 30 40Estimate of Mean
0.00
0.06
0.12
0.18
0.24
0.30
Pro
babi
lity
0 10 20 30 40Estimate of Mean
0.00
0.06
0.12
0.18
0.24
0.30
Pro
babi
lity
1 SEM=2 1 SEM=5
Effect of Standard error on estimate of(assume df= large)
~2 SEM~2 SEM
~2 SEM~2 SEM
2.5%2.5%
12
Worked example
Lovett et al. (2000) measured the concentration of SO4
2- in 39 North American forested streams (qk2002, Box 2.2)
Statistic ValueSample mean 61.92Sample median 62.10Sample variance 27.46Sample SD 5.24SE of mean 0.84
Stream SO42-
(mmol.L-1)Santa Cruz 50.6Colgate 55.4Halsey 56.5Batavia Hill 57.5
Interval estimate• How confident are we in a single sample estimate of
, i.e. how close do we think our sample mean is to the unknown population mean.
• Remember is a fixed, but unknown, value.
• Interval (range of values) within which we are 95% (for example) sure occurs - a confidence interval
13
Distribution of sample means
Calculate the proportion of sample means within a range of values.
Transform distribution of means to a distribution with mean = 0 and standard deviation = 1
95%99%
yP( ) y
t statistic
ns /
y
14
-5 -4 -3 -2 -1 0 1 2 3 4 50.0
0.1
0.2
0.3
0.4
Pro
ba
bilit
y
ns /
yt =
Null distribution
t statistic – interpretation and units
• The deviation between the sample and population mean is expressed in terms of Standard error (i.e. Standard deviations of the sampling distribution)
• Hence the value of t’s are in standard errors
• For example t=2 indicates
that the deviation (y- ) is equal to 2 x the standard error
ns /
y
15
The t statistic
• This t statistic follows a t-distribution, which has a mathematical formula.
• Same as normal distribution for n>30 otherwise flatter, more spread than normal distribution.
• Different t distributions for different sample sizes < 30 (actually df which is n-1).
-5 -4 -3 -2 -1 0 1 2 3 4 50.0
0.1
0.2
0.3
0.4
Pro
b ab
ility
-5 -4 -3 -2 -1 0 1 2 3 4 50.0
0.1
0.2
0.3
0.4
N=30
N=3
ns /
yt =
Null distributions
16
Degrees of Freedom .01 .02 .05 .10 .20
1 63.66 31.82 12.71 6.314 3.078
2 9.925 6.965 4.303 2.920 1.886
3 5.841 4.541 3.182 2.353 1.638
4 4.604 3.747 2.776 2.132 1.533
5 4.032 3.365 2.571 2.015 1.476
10 3.169 2.764 2.228 1.812 1.372
15 2.947 2.602 2.132 1.753 1.341
20 2.845 2.528 2.086 1.725 1.325
25 2.787 2.485 2.060 1.708 1.316
z 2.575 2.326 1.960 1.645 1.282
Two tailed t-values
Probability
Probabilities of occurring outside the range– tdf to + tdf
-5 -4 -3 -2 -1 0 1 2 3 4 5-5 -4 -3 -2 -1 0 1 2 3 4 5
-2.78 +2.7895%
4 df
ns /
yt =
ns /
yt =
Degrees of Freedom .005/.01 .01/.02 .025/.05 .05/.10 .10/.20
1 63.66 31.82 12.71 6.314 3.078
2 9.925 6.965 4.303 2.920 1.886
3 5.841 4.541 3.182 2.353 1.638
4 4.604 3.747 2.776 2.132 1.533
5 4.032 3.365 2.571 2.015 1.476
10 3.169 2.764 2.228 1.812 1.372
15 2.947 2.602 2.132 1.753 1.341
20 2.845 2.528 2.086 1.725 1.325
25 2.787 2.485 2.060 1.708 1.316
z 2.575 2.326 1.960 1.645 1.282
-5 -4 -3 -2 -1 0 1 2 3 4 5-5 -4 -3 -2 -1 0 1 2 3 4 5
-2.78 +2.7895%
-5 -4 -3 -2 -1 0 1 2 3 4 5-5 -4 -3 -2 -1 0 1 2 3 4 5
+2.13295%
-5 -4 -3 -2 -1 0 1 2 3 4 5-5 -4 -3 -2 -1 0 1 2 3 4 5
-2.132 95%
One and two tailed t-values (df 4)
2 tailed 1 tailed 1 tailed
ns /
yt =
17
The t statistic• This t statistic follows a t-distribution, which has a
mathematical formula.
• Same as normal distribution for n>30 otherwise flatter, more spread than normal distribution.
• Different t distributions for different sample sizes < 30 (actually df which is n-1).
• The proportions of t values between particular tvalues, yield a confidence estimate (the likelihood that the true mean is in the range)
For n = 5 (df = 4), 95% of all t values occur between t = -2.78 and t = +2.78
95%
Pr(t)
0-2.78 +2.78t
• Probability is 95% that t is between -2.78 and +2.78
• Probability is 95% that is between -2.78 and +2.78
• Rearrange equation to solve for
ns
y
-5 -4 -3 -2 -1 0 1 2 3 4 5-5 -4 -3 -2 -1 0 1 2 3 4 5
-2.78 +2.7895%
18
t( ) =
Rearrange to solve for
yn s / ( )
ns /
yt =
and
1.
2.
3.
For two tailed test
Solve for (using df):1. Calculated t values2. Desired confidence level
(to determine range in values that are likely to contain )
y ns / t( )
y ns / t( )y ns /Pr[y ns / t( )]t( )
For 95% CI, use the t value between which 95% of all t values occur, for specific df (n-1):
This is a confidence interval.
• CI’s from repeated samples of size n , 95% of the CI's would contain and 5% wouldn’t.
• 95% probability that this interval includes the true population mean.
95.0])()([ nstynstyP
19
Worked example (Lovett et al. 2000) Sample mean 61.92Sample SD 5.24SE 0.84
• The t value (95%, 38df) = 2.02 (from a t-table)• 2.5% of t values are greater than 2.02• 2.5% of t values are less than -2.02• 95% of t values are between -2.02 and +2.02
P {61.92 - 2.02 (5.24 / 39) < < 61.92 + 2.02 (5.24 / 39)} = 0.95
P {60.22 < < 63.62} = 0.95
Degrees of Freedom .01 .02 .05 .10 .20
1 63.66 31.82 12.71 6.314 3.078
2 9.925 6.965 4.303 2.920 1.886
3 5.841 4.541 3.182 2.353 1.638
4 4.604 3.747 2.776 2.132 1.533
5 4.032 3.365 2.571 2.015 1.476
10 3.169 2.764 2.228 1.812 1.372
15 2.947 2.602 2.132 1.753 1.341
20 2.845 2.528 2.086 1.725 1.325
25 2.787 2.485 2.060 1.708 1.316
38 2.705 2.426 2.020 1.685 1.302
Confidence Interval (2 tailed) assume 95% CI is desired
Probability
61.92
95%
Lovett et al. (2000)38 df
y ns /t( ) y ns /t( )61.92 – 2.02(0.84)
60.22
61.92 + 2.02(0.84)
63.62< <
y ns /Pr[y ns / t( )]t( )
Sample mean 61.92SEM 0.84DF 38
20
• The interval 60.22 – 63.62 will contain 95% of the time.
• We are 95% confident that the interval 60.22 – 63.62 contains .
Effect on Confidence Interval
Case Mean Sample size (SS)
Standarddeviation (SD)
Standard Error
Probability (%)
LowerConfidence Limit
Upper Confidence Limit
Reference 61.92 39 5.24 0.834 95% 60.22 63.62
DoubleSD
61.92 39 10.48 1.68 95% 58.53 65.31
ReduceSS
61.92 20 5.24 1.17 95% 59.47 64.37
Increase%
61.92 39 5.24 0.834 99% 59.65 64.20
21
Estimating other parameters
• Logic of interval estimation of population mean using t-distribution can be extended to resampling– For example: confidence interval of the mean
Confidence Interval – using resampling vs t-test
• CI from t distribution is based on creation of a distribution from mean and standard deviation calculated from sample data
• CI from resampling is based on sample data
• For example, assume we have the following observations and want to determine if the mean is different from 10– 9, 8,9,10, 9, 8,9,7,11,11
22
Confidence Interval – using t distribution
30.1
14.9
s
y
5 7 9 11 13 15
y
0.0
0.1
0.2
0.3
0.4
Pro
b(y
)
Sample mean 9.139Sample SD 1.30SE 0.412
• The t value (95%, 9 df) = 2.26 (from a t-table)• 2.5% of t values are greater than 2.26• 2.5% of t values are less than -2.26• 95% of t values are between -2.26 and +2.26
P {9.14 – 2.26 (1.30 / 10) < < 9.14 + 2.26 (1.30 / 10)} = 0.95
P {8.22< <10.07} = 0.95
-5 -4 -3 -2 -1 0 1 2 3 4 5
t
0.0
0.1
0.2
0.3
0.4
Pro
b(t
)
Ho: µ=10
Use t distribution
Resampling
Confidence Interval – using resampling
•Resample many times, with replacement, each with 10 observations•Calculate means of all samples•Generate distribution of means and determine empirical confidence interval
Histogram of the Estimates of Mean
8 9 10 11
Mean value
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Pro
po
rtion
pe
r Ba
r
0
50
100
150
Co
un
t
95.0% Confidence Interval for Mean
Variable ¦¦ Mean Lower Upper
---------+-----------------------------y ¦ 9.142 8.453 9.912
23
Compare approaches
Statistic Using t- distribution Using resampling
Mean 9.139 9.142
Upper Confidence limit
10.07 9.91
Lower Confidence Limit
8.22 8.45
Accept Ho: µ =10(is 10 within 95% CI)
YES NO
Confidence Intervals using resampling
• The same technique may be used to set confidence limits to any statistic
e.g. the median,
the average (absolute) deviation, standard deviation (s),
coefficient of variation, or
skewness.