statistics for molecular medicine -- probability ... · pdf filestatistics for molecular...
TRANSCRIPT
Statistics for Molecular Medicine -- Probability & Diagnostic Testing --
Barbara Kollerits & Claudia LaminaMedical University of Innsbruck, Division of Genetic Epidemiology
Molekulare Medizin, SS 2016
2
Probability:
Definition and Calculation rules
1
3
Descriptive Statistics: Pure description of your observed data
But: You cannot conclude on the underlying population
Example: It is not enough to know, that a medication is effective in the
specific study is it effective in all patients for whom the medication
was meant for?
Was the result a chance finding?
With which certainty can the result be transferred to the population?
Introduction
Random experiment • A situation involving chance leading to mutually excluding events.
• The result of the experiment is not known beforehand.
• The experiment can be uncontrolled (observational studies) or controlled, which can be repeated under the same conditions.
Outcome • All possible realizations of an experiment
• Example: The set of all possible realizations of the experiment “Rolling a dice” is 1,2,3,4,5,6
• Lets denote the set of all possible events as Ω
Event / Realization • One specific realization of a single trial of an experiment
• Example: At one single throw, you get “2”
Probability • is the measure of how likely an event is.
• The probability of landing “2” in one dice throw is 1/6.
Definitions
2
5
Sets A and B are sets for possible realizations:
0 ≤ P(A) ≤ 1 (for each A Ω)
P(Ω) = 1
P(Ø) = 0
P(A)=1-P(A) with A = Ω\A = „Not-A“
“B is part of A”: B A
P(B) ≤ P(A), if B A
Complementary set A\B
Calculation rules for probabilities
AB
Ω
∩
BA
∩_ _
„A without B“A B
∩
6
Intersecting set of A and B: A ∩ B
Set union A B: „A or B“
Calculation rules for probabilities
A BA ∩ B
∩
A B
Common saying: „A and B“
A B
A and B not distinct A and B distinct
P(A B) = P(A) + P(B) – P(A ∩ B)
∩
P(A B) = P(A) + P(B)
∩
More general: If all Ai are pairwise distinct:
P(A1 A2… Ak) = P(A1) + P(A2) + ….+ P(Ak)
∩
∩
3
“Counting rule” for Probability:
“Counting rule” for Odds:
Definition of Odds:
Example: Experiment = Tossing a coin once
P(„Head“) = 0.5
Odds („Head“) = 1:1
7
Calculation rules for probabilities
The number of ways event A can occurThe total number of possible outcomes
P(A)=
7The number of ways event A can occur P(A)=
The number of ways event A can not occur
P(A) Odds(A)=1-P(A)
Example: The probability to get an even number (2,4,6), when rolling a
dice once (6 possible outcomes): P(even number) =
Probabilities and relative frequencies: The relative frequency of an
event approximates the probability, if the number of trials is high enough
Calculation rules for probabilities
36
Number of trials (rolling the dice)
Rel
ativ
e fr
eque
ncy
ofth
eev
ent
„eve
nnu
mbe
r“
A simulated „experiment“: The approximation ofa theoretical probability by a relative frequency
4
9
Example: In a practical seminar, students (n=250) determine their blood
types and get the following results:
Calculation rules for probabilities
Blood type Absolute frequencies Relative frequencies
A 106 0.424
B 34 0.136
AB 15 0.06
0 95 0.38
Rhesus factor
+ 210 0.84
- 40 0.16
Relative frequencies asapproximationof probabilities
10
Relative frequencies as approximation of the probabilities:
Blood groups: P(A) = 0.424, P(B) = 0.136, P(AB) = 0.06, P(0) = 0.38
Rhesus Factor: P(R+) = 0.84, P(R-) = 0.16
Applying the rules for calculation:
P(A or B) =
(it can only apply either A or B A and B are distinct)
P(not AB) =
Calculation rules for probabilities: Exercise
5
11
Conditional probability = P(A|B) = „Probability of A given B“
The probability of the event A, if it is known, that B already occurred
P(A|B) = or equivalently: P(A ∩ B) = P(A|B)*P(B)
P(A ∩ B) = P(A|B)*P(B) = P(B|A)*P(A)
If the events A and B are independent P(A|B) = P(A) and P(B|A) = P(B)
For the occurrence of event A it is irrelevant , if B occurred or not
A and B independent P(A ∩ B) = P(A)*P(B)
Example: Two subsequent dice rolls do not depend on each other
P(getting a 6 in two subsequent trials) = 1/6 * 1/6 = 1/36
Conditional probabilities
P(A ∩ B)P(B)
12
Example for independence:
Relative frequencies as approximation of the probabilities:
Blood groups: P(A) = 0.424, P(B) = 0.136, P(AB) = 0.06, P(0) = 0.38
Rhesus Factor: P(R+) = 0.84, P(R-) = 0.16
Applying the rules for calculation:
P(A and R+) =
Calculation rules for probabilities: Exercise
6
13
Example for dependence:
P(Diabetes|male) = 0.07; P(Diabetes|female) = 0.02
The probability to get Type 2 diabetes depends on the gender
How high is the probability that one randomly chosen person (gender
unknown) has diabetes?
P(Diabetes) = ?
Calculation rules for probabilities
14
Joint event Probability
Probability tree
male
female
P(m)=0.5
P(f)=0.5
Diabetes P(D|m) = 0.07
NoDiabetes P(not D|m) = 0.93
Diabetes
NoDiabetes P(not D|f) = 0.98
P(D|f) = 0.02
P(male and Diabetes) = 0.5*0.07
P(female and Diabetes) = 0.5*0.02
P(Diabetes) = 0.07*0.5 + 0.02*0.5 Total probability rule
7
15
Rule for the total probability:
P(B) = P(B|A1)*P(A1) +…+ P(B|Ak)*P(Ak)
Example:
P(Diabetes) = 0.07*0.5 + 0.02*0.5
Bayes Theorem:
P(A|B) = =
Important for evaluation of the validity of diagnostic tests
Calculation rules for probabilities
ΩA1 A2 A3 …. Ak
B
P(A ∩ B)P(B)
P(A ∩ B) Σi P(B|Ai)*P(Ai)
P(m) P(f)
P(D|f)P(D|m)
Calculation rules for probabilities: Exercise
On a cruise ship with 70% female and 30% male passengers, a mysteriousinfectious disease D has broken out. The infection rate is 5 % in womenand 10% in men.
How high is the probability that a randomly selected passenger is infected?
Solution with total probability rule:
Solution with probability tree:
8
Diagnostic Testing
Intention:
To find a marker / specific test, that discriminates patients into diseased or
non-diseased
How accurate is a test? Can it detect all currently diseased and exclude all non-diseased (e.g. Elisa test, Westernblot, PCR etc.). How likely is it, that a person is actually diseased/non-diseased, if the test is positive/negative?
Make predictions for individuals, who will get the disease in the future
Example 1: Predict the progression of a chronic disease using a biomarker
Example 2: Predict the probability to get a disease in the future based on genetic variants
Example: Ideal situation – perfect test
9
Example: In a study, there are 200 individuals, 100 of them are diseased and 100 are healthy. A scientist claims that he has found a new test which perfectly detects those who are ill and those who are healthy.
The “perfect” test would yield such a table:
Example: Ideal situation – perfect test
19
Test result „Truth“
Disease No Disease
Positive 100 (a) 0 (b) 100 (a + b)
Negative 0 (c) 100 (d) 100 (c + d)
100 (a+c) 100 (b + d) 200
Sensitivity and Specificity
Generally used with diagnostic tests
Sensitivity: Percentage of persons with positive test results ( ) of all diseased persons ( ) =
Specificity : Percentage of persons with negative test results ( ) of all non-diseased persons ( ) =
What would be the “perfect” test?
→ one with no false negative results (sensitivity would be 1) or false positive results (specificity would be 1)
When the sensitivity increases, the specificity decreases (or vice versa)
10
Sensitivity and Specificity
Sensitivity = True positive / (True positive + False negative) x 100 =
a / (a+c) x 100
Specificity = True negative / (True negative + False positive) x 100 =
d / (b+d) x 100
Test result „Truth“
Disease No disease
Positiv diseased and positive test result = true positive (a)
not diseased but positive test result = false positive (b)
Negativ diseased, but negative test result = false negative (c)
not diseased, and negative test result = true
negative (d)
Example: Ideal situation – perfect test
22
Test result „Truth“
Disease No Disease
Positive 100 (a) 0 (b) 100 (a + b)
Negative 0 (c) 100 (d) 100 (c + d)
100 (a+c) 100 (b + d) 200
The “perfect test” revisited:
In a study, there are 200 individuals, 100 of them are diseased and 100 are healthy. A scientist claims that he has found a new test which perfectly detects those who are ill and those who are healthy.
Sensitivity = True positive / (True positive + False negative) x 100 = a / (a+c) x 100 = (100 / 100) x 100 = 100%
Specificity = True negative / (True negative + False positive) x 100 = d / (b+d) x 100 = (100 / 100) x 100 = 100%
11
Sensitivity and Specificity: Example 2
Real example: Intention: Predict the progression of renal disease in patients with chronic kidney disease (CKD)
Cohort study of 227 nondiabetic patients with CKD
Fibroblast growth factor 23 (FGF23) plasma concentrations (involved in calcium-phosphate metabolism)
177 of the patients were followed prospectively for up to 7 years to assess progression of renal disease
Sensitivity = a / (a+c) x 100 = 52 / (52 + 11) x 100 = 82.5 %
Specificity = d / (b+d) x 100 = 76 / (76+34) x 100 = 69.1 %
23Kollerits et al. Fibroblast Growth Factor 23 (FGF23) Predicts Progression of Chronic Kidney Disease: The Mild to Moderate Kidney Disease (MMKD) Study. J Am Soc Nephrol 18: 2601–2608, 2007
Test result „Truth“
Progression No Progression
FGF 23 above median
52 (a) 34 (b)
FGF 23 below median
11 (c) 76 (d)
63 (a+c) 110 (b + d)
Positive and negative predictive value
Sensitivity and specificity describe the accuracy of a test
Estimation of the probability of the presence or absence of disease: positive predictive value (PV+) and the negative predictive value (PV-)
Now we are interested in how likely it is for one patient to have the disease if a test is positive:
PV+ = percentage of persons who actually have the disease given a positive test result =
PV- = percentage of persons who don’t have the disease given a negative test result =
12
Probability tree (conditional probabilities of composed events)
25
Sick and Test positive
Sick and Test negative = false negative
Not sick and Test positive = false positive
Not sick and Test negative
State: : healthy
: sick
Test: : Test neg.
: Test pos.Joint event Probability
= Sensitivity
= 1- Sensitivity
= Specificity
= 1- Specificity
Example for calculating positive and negative predictive value
Example: “Elisa Aids-Test”
1989: screening of all civil service applicants of the land Bavaria was planned
The result of the Elisa Aids-Test gives the information, whether HIV antibodies are in the blood or not
The Elisa Aids-Test has a very high sensitivity and specificity (99,9% and 99,5%, respectively)
There are the following assumptions:
: there are antibodies in the blood
: there are no antibodies in the blood
: the test is positive on antibodies
: the test is negative on antibodies
Sensitivity: = 0.999
Specificity: = 0.995
Prevalence of HIV in the whole population = 0.001
Question: If the Elisa Aids-Test is positive, how high is the probability to really have HIV antibodies in the blood?
13
Probability tree (conditional probabilities of composed events)
0.001
No HIV
HIV
Test positive
Test positive
0.999
Test negative
Test negative
0.999 x 0.001
0.999
0.0050.999 x 0.005
Positive predictive value = 17% of persons with a positive Elisa Aids-Test result actually
have HIV antibodies in the blood
83% are falsely diagnosed
Formula based on Bayes Theorem:
Sensitivity
1- Specificity: false positive
Test positive
Probability tree (conditional probabilities of composed events)
0.001
No HIV
HIV
Test positive
Test positive
Test negative
Test negative
0.999
0.995
0.001
Negative predictive value = 99% of persons with a negative Elisa Aids-Test result actually
have no HIV antibodies in the blood
= 1- Sensitivity: false negative
= Specificity
14
Relation between positive and negative predictive value
29
Dependency of PV- / PV+ on the prevalence / incidence of a disease:
Here, Sensitivity = 99,9% and Specificity = 99,5%
Even if a test has very high sensitivity and specificity, a screening in the whole population is not appropriate for rare diseases
define high-risk populations first
Example: screening for prostate cancer only in men above a certain age (in Austria >45 years)
You want to validate a risk marker for the progression of a degenerative disease and get the following results in your study after several years of follow-up:
A clinician claims that values greater than 100 can predict the progression ofdisease. If you use this value as a clinical test for defining progression, how high is the sensitivity and specificity of this test?
Sensitivity =
Specificity=
What can you do to increase the Sensitivity?
What happens then with the Specificity then?
Categorized risk marker Number of patients withprogression of disease
Number of stable patients
Minimum-60 0 1061-80 5 2581-100 25 50101-120 20 10121-Maximum 10 5Sum Σ 60 100
Sensitivity / Specificity / PPW: Exercise
15
Sensitivity =
Specificity =
It is known, that the incidence of the disease progression (D) is 1%.
Fill in all the parameters that you already know into this probability tree:
Based on this tree, how high is the PPW of this test?
Sensitivity / Specificity / PPW: Exercise
D
not D
Test positive
Test negative
Test positive
Test negative
32
Probability distributions
16
33
Random variable X: The values of a random variable X are the
outcomes of a random experiment. A number is assigned to all possible
realizations of a random experiment.
Realization of a random variable X in an experiment: xi
Discrete random variable: Qualitative variables, like gender, disease
status etc.
Continuous random variable: Qualitative variable like age, cholesterol
levels etc.
Definitions
Why are gender, disease status, age etc. „random“?
They are outcomes of the random experiment„Drawing one person from the population“
34
A random variable is discrete, if it can only take a finite number of
realizations
To each realization, a specific probability can be assigned:
f(x) = P(X = xi) = pi Probability function
Distribution function :
F(x) = P(X ≤ x) Probability that X takes the values of x or smaller than x
A distribution function is increasing monotonically and is the sum of the
probabilities of all realizations ≤ x
Discrete random variables
17
35
Example: „Rolling a dice“ once
The random variable is X = “The sum of the pips when rolling a dice once”
Each realization has the probability 1/6: f(x1)=…f(x6) = 1/6
F(2) = F(X ≤ 2) = f(x1) + f(x2) = 2/6
F(3) = 3/6 etc…
Discrete random variables
Realisation xi Probabilityf(xi) = pi
Distribution function F(xi)
1 1/6 1/6
2 1/6 2/6
3 1/6 3/6
4 1/6 4/6
5 1/6 5/6
6 1/6 6/6
36
Discrete random variables
Example for a uniform distribution
Histogram of the probability
= Density function
Distribution function
18
37
Characteristics of probability distributions
Expectation E(X):
Measures of location for a sample: Mean or Median
Measure of location for the underlying population: E(X) or μ
For a discrete random variable with k realizations: E(X) = x1p1 + …+xkpk
Example: X = “The sum of the pips when rolling a dice once”
E(X) = 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6 = 3.5
Expected average number of pips when rolling the dice several times
Expectation of a sum: E(X1+X2+…+Xn) = E(X1) + E(X2) + … + E(Xn)
Example: Z = “The sum of the pips when rolling two dices”
E(Z) = E(Xdice1) + E(Xdice2) = 3.5 + 3.5 = 7
38
Characteristics of probability distributions
Variance Var(X):
Measure of dispersion for a sample: Sample variance
Measure of dispersion for the underlying population: Var(X) or σ2
Var(X) = E(X - μ)2 = E(X2) - μ2
For a discrete random variable : Var(X) = (x1 – μ)2p1 + …+ (xk – μ)2pk
19
39
The Binomial distribution
Bernoulli-Experiment: The realizations of a Bernoulli-Experiment can
take exactly two different values
The probability is described as follows: P(X=1) = p; P(X=0) = 1-p = q
Examples:
A Binomial distribution describes a process, where a Bernoulli-
Experiment is repeated several times and independently from each
other Binomial distributed random variable: X = X1+X2 + …. Xn
(n independent Bernoulli-Experiments)
Experiment Possible realisations Probability P(X)
Tossing a coin HeadTail
0.50.5
Birth of a child FemaleMale
0.50.5
Getting a diseaseup to a specific age
DiseaseNo disease
0.1 (for example)0.9
40
The Binomial distribution
Binomial distributed random variable X is described unambiguously by:
n: number of times the Bernoulli-Experiment is repeated
p = P(X=1)
X ~ B(n,p) : X is Binomial distributed with its parameters n and p
E(X) = np
Var(X) = np(1-p)
P(X=k) = ( )pk(1-p)n-k ( )= nk
nk
1*2*…*n (1*…*k)*(1*…*(n-k))
Binomial-coefficient
Example: ( )=103
10*9*8 1*2*3
simplifiedversion
20
41
The Binomial distribution
Example: Passing a MC-exam just by guessing
5 answers per question p = 0.2
10 questions
E(X) = n*p = 2; Var(X) = n*p*(1-p) = 1.6
The exam is passed, if 6 or more answers are correct P(X ≥ 6) ?
~B(10,0.2)
you can expect to have 2 answers right
k P(X = k) P(X ≤ k)
0 1*0.20*0.810 = 0.1074 0.1074
1 10*0.21*0.89 = 0.2684 0.3758
2 45*0.22*0.88 = 0.3020 0.6778
3 120*0.23*0.87 = 0.2013 0.8791
4 210*0.24*0.86 = 0.0881 0.9672
5 252*0.25*0.85 = 0.0264 0.9936
6 210*0.26*0.84 = 0.0055 0.9991
7 120*0.27*0.83 = 0.0008 0.99992
8 45*0.28*0.82 = 0.00007 0.999999
9 10*0.29*0.81 = 0.000004 ~ 1
10 1*0.210*0.80 = 0.0000001 ~1
P(X ≥ 6) =
= 1- P(X ≤ 5) =
= 1- 0.9936 =
= 0.0064
42
The Binomial distribution
Histogram ofthe probability
= Density
Distribution function
k P(X = k) P(X ≤ k)
0 0.1074 0.1074
1 0.2684 0.3758
2 0.3020 0.6778
3 0.2013 0.8791
4 0.0881 0.9672
5 0.0264 0.9936
6 0.0055 0.9991
7 0.0008 0.99992
8 0.00007 0.999999
9 0.000004 ~ 1
10 0.0000001 ~1
Distribution functionP(X ≤ 5) = ?
21
43
The Binomial distribution
Histogram of theprobability
Distribution function
P(X ≤ 5) = F(5) = 0.9936
Area under thecurve = 0.9936
The Binomial distribution: Exercise
In a laboratory experiment with 30 students, a virus is set free. The infectionprobability is 20% for each student Binomial distributed
How many infections are expected (Expectation E(X)) ?
Determine the following probabilities using the formula
P(X=k) = ( )pk(1-p)n-k
a) no student will be infected
b) exactly one student will be infected
nk
22
The Binomial distribution: Exercise
Now, the distribution function of this probability distribution is given. Determine the following probabilities using this figure:
a) at most 3 students will be infected
b) more than 8 students will be infected
Number of students
46
Continuous random variables
A variable is continuous, if each value between two values a and b (with a
< b) can possibly be realized.
How can a probability be assigned to P(X ≤ x) ?
For discrete random variables sum of the probabilities of all values ≤ x
For continuous random variables “area under the curve”
f(x) F(x)
x0 x0
F(x0)
Density
Distribution function
23
47
Continuous random variables
Distribution function of a continuous variable:
F(x) = P(X ≤ x) = f(t) dt
f(x) F(x)
x0 x0
F(x0)
Density
x
∫-∞ The total area between the x-Axis and the density function f(x) is 1
P(a ≤ X ≤ b) = F(b) – F(a) and P(X ≥ a) = 1- F(a)
Distribution function
1
0
48
The normal distribution
Normal distribution = the most important distribution function in statistics
It can be described unambiguously by μ and σ2 : X ~ N(μ, σ2)
The densities of the following normal distributions are given:
N(0, 1) standard normal distribution
N(0,4) higher variance
N(1,1.44) slightly higher variance and “shifted” to the right
24
Normal distribution
μ = 4
σ2 = 4
Standard Normal distribution
μ = 0
σ2 = 1
Standardization: Z=
The normal distribution
N(4, 4)
N(0, 1)
X - μσ
In this example: Z = X - 42
Distribution of X:
Distribution of Z:
An illustration of the standard normal distribution with quantiles:
20% of the data
50% of the data
5% of the data
p 0.2 0.5 0.8 0.95 0.975
Quantile zp -0.85 0 0.85 1.64 1.96
zp = -z1-p
The normal distribution
Symmetry
25
51
Idea of a Quantile-Quantile-Plot (QQ-Plot):
Plot the quantiles of two distributions against each other:
20%-Quantile
50%-Quantile
97.5%-Quantile
The normal distribution
If the two distributions are the same, all points lie on one line !
If QQ-plots are used to be compared with the normal distribution, they areoften also called Normal-Quantile-Plots (NQ-Plot).
50%-Quantile
The normal distribution
26
Do our observed data follow a normal distribution?
Histogram of the observed data with the estimated density function:
The normal distribution
Do our observed data follow a normal distribution?
Compare with normal distribution:Blue line:
Normal distribution withmean = 53.25 and stdev = 9.19
(as estimated from the data)
The normal distribution
27
NQ-Plot Histogram with normal curve
Example:
Test the distribution of the variable age in observed data with the normal distribution:
The tails of the distribution do not fit !
The normal distribution
Quantiles of Normal distribution: Exercise
Fasting blood glucose is a normally distributed variable with =90 and =10. What is the probability for a randomly drawn person to have a blood glucoselevel of
a) ≤ 75 mg/dl b) >100 mg/dl c) >85 & ≤95
Indicate the probabilities as area under the curve:
N(90,100)) Density function
28
Quantiles of Normal distribution: Exercise
Calculate the probabilities using quantiles of the standard normal distribution:
a) ≤ 75 mg/dl
b) >100 mg/dl
c) >85 & ≤95
Distribution function of a standard normal distribution
Steps:
1. Standardize to getquantiles (z) of a standard normal distribution
2. Derive theprobability(graphically) fromthe distributionfunction
Other continuous distributions
t-distribution: X ~ t(df)
The t-distribution only depends on the degrees of freedom df=(n-1)
The t-distribution approximates N(0,1) for increasing n
f(x)
29
Other continuous distributions
F-distribution: X ~ F(df1, df2) : it depends on two „degrees of freedom“ df
f(x)
Other continuous distributions
Chi-Square-distribution: X ~ 2 (df) :
it depends on one „degree of freedom“ dff(x)
30
Point and Confidence Estimates
Point and confidence estimates
Intention: Conclude from sample on underlying population
Complete population of interest (e.g. all Austrians, all patients with previous myocardial infarctions etc…) cannot be observed samples drawn from the population
Samples should be chosen to be representative for the population.
Sample1
Sample 3
Sample 2Population
31
Point and confidence estimates
Arithmetic Mean in the study sample:
Descriptive measures in the study sample
Conclude on
unknown parameter / characteristic of the underlying population
For example:
is the estimate of
Expectation value of the underlying population
Point and confidence estimates
Example:
Population of interest = All patients with previous MI
„Parameter“ of interest: blood pressure
Study sample: Representative sample of all patients with previous MI
Expected value of the blood pressure in the underlying population, that cannot be observed
Arithmetic Mean of blood pressure in the study sample:
is the estimate of
32
Point and confidence estimates
The precision of the estimate depends on the quality of the sample (representativity) and the sample size
Properties of a good estimator:
Unbiasedness: There is no systematic under- or overestimation
Consistency: increasing the sample size increases the probability of the estimator being close to the population parameter small variance
Bias-Variance-Tradeoff: find an estimating function T, which reduces both:
Reducing Mean Squared Error:
Bias large Bias large
Variation small
Bias small
Variation small Variation large Variation large
Bias small
Point and confidence estimates
The precision of the estimate depends on the quality of the sample (representativity) and the sample size
Properties of a good estimator:
Unbiasedness: There is no systematic under- or overestimation
Consistency: increasing the sample size increases the probability of the estimator being close to the population parameter small variance
Bias-Variance-Tradeoff: find an estimating function T, which reduces both:
Reducing Mean Squared Error:
Bias large Bias large
Variation small
Bias small
Variation small Variation large Variation large
Bias small
Most efficient
33
Point and confidence estimates
Examples for unbiased, consistent and most efficient estimators:
Arithmetic Mean X
Sample Variance S2
Relative frequency
expected value
variance
proportion p of a
dichotomous trait
_
Point and confidence estimates
There is uncertainty in parameter estimation because it is based on a random sample of finite size from the population of interest
Construct an interval, that includes the population parameter with given certainty: Confidence Interval CI
The measure of certainty is given by the error probability
= 5% : 95% CI; = 1% : 99% CI etc.
Assumptions on the underlying distribution have to be made, e.g. it is assumed, that the values X1, X2, …, Xn are the measurements of a normally distributed variable X (the variable is N(,2)-distributed)
Confidence intervals for all parametric models can be described in this way:
Parameter estimate +/-(1-/2)-Quantile of the respective probability
distribution
* Standard error of the estimate
34
Standard error of the mean
The standard error is the standard deviation of the sampling distribution of a statistical estimate (e.g. the mean)
Standard error of the mean:
How accurate is the mean?
How much would the mean change when different samples are taken?
Gets smaller, the bigger sample size ( estimate is more accurate)
Difference to standard deviation: Standard deviation describes the variability of the single observations in the sample you cannot conclude on the population
Confidence Interval
0.025-Quantile
= -1.96
0.975-Quantile:
= + 1.96
Parameter estimate +/-(1-/2)-Quantile of the respective probability
distribution
* Standard error of the estimate
Mean +/-Quantiles of the Standard
Normal Distribution* Standarderror of the Mean
Example: CI for the Mean:
35
Confidence Interval
and = variance of the standard normal distribution
= (1-/2)-Quantile of the standard normal distribution
Interpretation of the confidence interval in general:
If the study is repeated 100 times on 100 different samples, the point estimate (here: mean) will be within this range in 1-percent of cases
For a 95% confidence interval:
= 5% z0.975 ~ 1.96
If is not known & the sample size is large (>30, better >120): take S (sample variance)
Mean +/-Quantiles of the
Standard Normal Distribution
* Standarderror of the Mean
Example for 95% confidence interval for the mean of total cholesterol within a sample of n=1475 individuals
If this study is repeated 100 times, it is expected that the mean of total cholesterol falls into this range in 95 times
Confidence Interval
36
Confidence Interval
Simulation Example: 95% CI for mean of body height in mean born 1993
Assumption: We know the „truth“: mean = 182 cm; σ = 7
A different sample is drawn 100 times (either with n=100 or n=20)
in 5 out of 100 experiments, the true expected mean value is not included
If you only conduct an experiement once (as usual), you do not know, ifthe true value is included (green bars) or not (red bars)
n=100 n=20
p 0.95 0.975 0.995
zp 1.64 1.96 2.58
Determine the 95%-CI for the following parameters:
Mean = 100
S = 15
n=25
95 % CI =
If you repeat an experiment 100 times with the assumption, that these parameters are true, how many times can you expect to observe a mean that is lower or higher than these confidence limits?
Quantiles of the standard normal distribution:
Confidence Interval: Exercise
37
Confidence Interval: Exercise
p 0.95 0.975 0.995
zp 1.64 1.96 2.58
The manufacturer of a laboratory measurement device claims that one measurement will take 5 minutes on average. You want to test that statement with 10 measurements.
You get the following estimates: Mean = 5.3, Standard deviation = 0.3
Determine a 99% Confidence Interval (CI):
99 % CI =
Interpretation:
Quantiles of the standard normal distribution:
38