introduction tostatistical analysis
TRANSCRIPT
Introduction to Statistical Analysis
Cancer Research UK – 9th of November 2021
D.-L. Couturier (Bioinformatics core)
M. Fernandes (Craik-Marshall)
Grand Picture of Statistics
Population Sample
Data
(x1, x2, ..., xn)
Statistics
µ
σ2
π
Parameters
µ
σ2
π
2
Data Types
x1 x2 x3 · · · xn
Cancer status C �C �C · · · C
Nucleic acid sequence C T T · · · A
5-level pain score 3 1 5 · · · 4
# of daily admissions at A&E 16 23 12 · · · 17
Gene expression intensity 882.1 379.5 528.3 · · · 120.9
3
Summary statistics and plots for qualitative data
5-level answers of 21 patients to the question”How much did pain due to your ureteric stones interfere with yourday to day activities ?”:
3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4,
whereI 1 = ”Not at all”,I 2 = ”A little bit”,I 3 = ”Somewhat”,I 4 = ”Quite a bit”,I 5 = ”Very much”.
0.0
0.1
0.2
0.3
0.4
Not atall
A littlebit
SomewhatQuitea bit
Verymuch
4
Summary statistics and plots for qualitative data
5-level answers of 21 patients to the question”How much did pain due to your ureteric stones interfere with yourday to day activities ?”:
3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4,
whereI 1 = ”Not at all”,I 2 = ”A little bit”,I 3 = ”Somewhat”,I 4 = ”Quite a bit”,I 5 = ”Very much”.
0.0
0.1
0.2
0.3
0.4
Not atall
A littlebit
SomewhatQuitea bit
Verymuch
4
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene “CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
5
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene “CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
5
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene “CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
5
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0
● ● ●
● ● ● ●● ● ●●●●●● ●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene “CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
5
Two-sample case: independent versus paired samples
Permeability constants of a placental membrane at term (X) and between 12 to26 weeks gestational age (Y).
1 2 3 4 5 6 7 8 9 10X 0.80 0.83 1.89 1.04 1.45 1.38 1.91 1.64 0.73 1.46Y 1.15 0.88 0.90 0.74 1.21
Hamilton depression scale factor measurements in 9 patients with mixed anxietyand depression, taken at the first (X) and second (Y) visit after initiation of atherapy (administration of a tranquilizer).
1 2 3 4 5 6 7 8 9X 1.83 0.50 1.62 2.48 1.68 1.88 1.55 3.06 1.30Y 0.88 0.65 0.60 2.05 1.06 1.29 1.06 3.14 1.29
Y−X −0.95 0.15 −1.02 −0.43 −0.62 −0.59 −0.49 0.08 −0.01
6
Quiz TimeSections 1 to 4
On http://bioinformatics-core-shared-training.github.
io/IntroductionToStats/, select Online quiz under CourseMaterials
7
Statistical distributions
“In probability theory and statistics, a statistical distribution isa mathematical function that provides
the probabilities of occurrence of different possible outcomesin an experiment” [Wikipedia].
For a given cancer, mutation of the nucleic acid located at position 790 of Exon 20 isassumed to occur with a probability of ∼ 5%.Probability of observing y patients out of n = 50 cancer patients with this mutation?
P (Y = y|n, π) =n!
(n− y)!y!πy(1− π)n−y .
Number of successes out of 50 experiments
prob
abili
ty
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
8
Statistical distributions
“In probability theory and statistics, a statistical distribution isa mathematical function that provides
the probabilities of occurrence of different possible outcomesin an experiment” [Wikipedia].
For a given cancer, mutation of the nucleic acid located at position 790 of Exon 20 isassumed to occur with a probability of ∼ 5%.Probability of observing y patients out of n = 50 cancer patients with this mutation?
P (Y = y|n, π) =n!
(n− y)!y!πy(1− π)n−y .
Number of successes out of 50 experiments
prob
abili
ty
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
8
Some parametric distributions: Binomial distribution
I the number of successes out of n trials (experiments), Y =∑ni=1Xi,
follows a binomial distribution with parameters n and π:
Y ∼ Bin(n, π),
P (Y = y|n, π) = n!
(n− y)!y!πy(1− π)n−y.
IFI n independent experiments,I outcome of each experiment is dichotomous (success/failure),I the probability of success π is the same for all experiments,
Number of successes out of 50 experiments
prob
abili
ty
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 499
Some parametric distributions: Poisson distribution
I the number of events occurring in a fixed time interval or in a given area,X, may be modelled by means of a Poisson distribution with parameter λ:
X ∼ Poisson(λ),
P (X = x|λ) = λxe−λ
x!.
IF, during a time interval or in a given area,I events occur independently,I at the same rate,I and the probability of an event to occur in a small interval (area) is
proportional to the length of the interval (size of the area),
Number of chronic conditions per patient (US National Medical Expenditure Survey)
Probab
ility
0 1 2 3 4 5 6 7 8 9 10 11
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
10
Some parametric distributions: Continuous distrib.D
ensi
ty
−10 −5 0 5 10 15 20
0.0
0.1
0.2
0.3
ex−Gaussianskew tGammaInverse GammaGaussian
11
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
Probability density function, fZ(z), of a standard normal:
Den
sity
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
12
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
Probability density function, fZ(z), of a standard normal:
99.73%
µ− 3σ µ+ 3σ
95.45%
µ− 2σ µ+ 2σ
68.27%µ− σ µ+ σµ
0.0
0.1
0.2
0.3
0.4
12
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
(i) Suitable modelling for a lot of variables:
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
12
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
(i) Suitable modelling for a lot of variables
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
12
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
(i) Suitable modelling for a lot of variables: IQ
99.73%
55 145
95.45%
70 130
68.27%85 115100
0.0000
0.0266
12
Some parametric distributions: Normal distribution
X ∼ N(µ, σ2), fX(x) =1√2πσ2
e−(x−µ)2
2σ2
E[X] = µ, Var[X] = σ2,
Z =X − µσ
∼ N(0, 1), fZ(z) =1√2π
e−z2
2 .
(ii) Central limit theorem (Lindeberg-Levy CLT)
. Let (X1, ..., Xn) be n independent and identically distributed(iid) random variables drawn from distributions of expectedvalues given by µ and finite variances given by σ2,
. then
µ = X =
∑ni=1Xi
n
d→ N
(µ,σ2
n
).
If Xi ∼ N(µ, σ2), this result is true for all sample sizes.
12
Central limit theorem shiny app:Distribution of the mean
https://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/
https://pauljudge.shinyapps.io/central-limit-theorem-master/
13
95% Confidence interval for µ, the population mean,when Xi ∼ N(µ, σ2)
I if X ∼ N(µ, σ2), then X ∼ N(µ, σ
2
n
),
I if X ∼ N(µ, σ2), then Z = X−µσ∼ N(0, 1),
I if σ unknown, then T = X−µs∼ Stn−1.
P
(< <
)= 0.95
14
95% Confidence interval for µ, the population mean,when Xi ∼ N(µ, σ2)
I if X ∼ N(µ, σ2), then X ∼ N(µ, σ
2
n
),
I if X ∼ N(µ, σ2), then Z = X−µσ∼ N(0, 1),
I if σ unknown, then T = X−µs∼ Stn−1.
P
(< <
)= 0.95
Den
sity
-4.303 4.30395%-2.228 2.22895%-2.042 2.04295%-1.96 1.9695%
X ∼ N(0, 1)T ∼ St30
T ∼ St10
T ∼ St2
fT (t|n− 1) =Γ(n−2
2 )
Γ(n−12 )[(n−1)π]1/2
(1+ t2
n−1
)n2
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
14
95% Confidence interval for µ, the population mean,when Xi ∼ N(µ, σ2)
I if X ∼ N(µ, σ2), then X ∼ N(µ, σ
2
n
),
I if X ∼ N(µ, σ2), then Z = X−µσ∼ N(0, 1),
I if σ unknown, then T = X−µs∼ Stn−1.
P
(< <
)= 0.95
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
14
95% Confidence interval for µ, the population mean,when Xi ∼ iid(µ, σ2)
I CLT: Xd→ N
(µ, σ
2
n
),
I if X ∼ N(µ, σ2), then Z = X−µσ∼ N(0, 1),
I if σ unknown, then T = X−µs∼ Stn−1.
Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, µ = 1.55)
Probab
ility
0 1 2 3 4 5 6 7 8 9 10 11
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
CINormal(µ, 0.95) = [1.5021; 1.5818]
CIStudent(µ, 0.95) = [1.5021; 1.5819]
CIBootstrap(µ, 0.95) = [1.5020; 1.5819]
CINB−GLM (µ, 0.95) = [1.5027; 1.5823]
15
95% Confidence interval for µY − µX , the differencebetween population means
If we haveI Xi ∼ iid(µX , σ
2X), i = 1, ..., nX ,
I Yi ∼ iid(µY , σ2Y ), i = 1, ..., nY ,
thenI if σ2
X = σ2Y [Student’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,nX+nY −2sp
√1
nX+ 1
nY
where sp =(nX−1)s2X+(nY −1)s2Y
nX+nY −2 ,
I if σ2X 6= σ2
Y [Welch-Satterthwaite’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,df
√s2XnX
+s2YnY
, where
df =
(s2XnX
+s2YnY
)2
(s2XnX
)2
nX−1 +
(s2YnY
)2
nY −1
.
16
95% Confidence interval for µY − µX , the differencebetween population means
If we haveI Xi ∼ iid(µX , σ
2X), i = 1, ..., nX ,
I Yi ∼ iid(µY , σ2Y ), i = 1, ..., nY ,
thenI if σ2
X = σ2Y [Student’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,nX+nY −2sp
√1
nX+ 1
nY
where sp =(nX−1)s2X+(nY −1)s2Y
nX+nY −2 ,
I if σ2X 6= σ2
Y [Welch-Satterthwaite’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,df
√s2XnX
+s2YnY
, where
df =
(s2XnX
+s2YnY
)2
(s2XnX
)2
nX−1 +
(s2YnY
)2
nY −1
.
16
95% Confidence interval for µY − µX , the differencebetween population means
If we haveI Xi ∼ iid(µX , σ
2X), i = 1, ..., nX ,
I Yi ∼ iid(µY , σ2Y ), i = 1, ..., nY ,
thenI if σ2
X = σ2Y [Student’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,nX+nY −2sp
√1
nX+ 1
nY
where sp =(nX−1)s2X+(nY −1)s2Y
nX+nY −2 ,
I if σ2X 6= σ2
Y [Welch-Satterthwaite’s t-test equation],
. CI (µY − µX , 0.95) = (Y −X)± t1−α2 ,df
√s2XnX
+s2YnY
, where
df =
(s2XnX
+s2YnY
)2
(s2XnX
)2
nX−1 +
(s2YnY
)2
nY −1
.
16
Central limit theorem shiny app:Coverage of Student’s asymptotic confidence intervals
https://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/
https://pauljudge.shinyapps.io/central-limit-theorem-master/
17
Quiz TimePractical 1
https://bioinformatics-core-shared-training.github.io/
IntroductionToStats/practical.html
18
PART II:Parametric and non-parametricone-sample location tests
Cancer Research UK – 9th of November 2021D.-L. Couturier (Bioinformatics core)
M. Fernandes (Craik-Marshall)
Grand Picture of Statistics
Population Sample
Data
(x1, x2, ..., xn)
Statistics
µ
σ2
π
Parameters
µ
σ2
π
20
Statistical hypothesis testing
A hypothesis test describes a phenomenon by means oftwo non-overlapping idealised models/descriptions:
I the null hypothesis H0, “generally assumed to be true until evidenceindicates otherwise”
I the alternative hypothesis H1.
The aim of the test is to reject the null hypothesis in favour of thealternative hypothesis, and conclude, with a probability α of being wrong,that the idealised model/description of H1 is true.
Theory 1: Dieters lose more fat than the exercisers
Theory 2: There is no majority for Brexit now
Theory 3: Serum vitamin C is reduced in patients
21
Statistical hypothesis testing
Several-step process:
I Define H0 and H1 according to a theory
I Set α, the probability of rejecting H0 when it is true (type I error),
I Determine the test statistic to be used,
I Define n, the sample size, allowing you to reject H0 when H1 is truewith a probability 1− β (Power),
I Collect the data,
I Perform the statistical test, define the p-value, and reject (or not) thenull hypothesis.
22
Statistical hypothesis testing
Many options:
I One-sided versus two-sided tests,
I Exact versus asymptotic tests,
I Parametric versus non-parametric tests.
23
Parametric location test(One-sided) t-test
We test:H0: µIQ = 100,H1: µIQ > 100.
We have Xi ∼ N(µ, σ2), i = 1, ..., n,
We knowI X ∼ N
(µ, σ
2
n
),
I Z = X−µσ√n
∼ N(0, 1),
Thus, if H0 is true, we have:
I Z = X−µ0σ√n
∼ N(0, 1).
Define the p-value:I p− value = P (T > Tobs)
Den
sity
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
99.73%
55 145
95.45%
70 130
68.27%85 115100
0.0000
0.0266
24
Statistical tests4 possible outcomes
Conclude:I if p-value > α → do not reject H0.I if p-value < α → reject H0 in favour of H1.
Test Outcome
H0 not rejected H1 accepted
Unknown Truth H0 true 1− α [TN] α [FP]
H1 true β [FN] 1− β [TP]
whereI α is the type I error, the probability of rejecting H0 when H0 is correct,I β is the type II error, the probability of not rejecting H0 when H1 is correct.
WarningsI ‘absence of evidence is not evidence of absence’,I design may help minimising FP and FN (ie, maximising TN and TP).
25
Parametric location testStudent’s test
A location model is assumed for Xi, i = 1, ..., n:
Xi = µ+ ei,where ei ∼ N(µe = 0, σ2
e), a symmetrical distribution.
Interest for H0: µ = µ0 against H1: µ < µ0 or µ 6= µ0 or µ > µ0.
Test statistics : T = X−µ0s√n
..
Distribution of W under H0: T ∼ Student(df = n− 1).
One Sample t-test
data: golub[1042, gol.fac == "ALL"]t = 4.172, df = 26, p-value = 0.0002982alternative hypothesis: true mean is not equal to 1.595 percent confidence interval:1.699817 2.087948sample estimates:mean of x1.893883
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
26
Parametric location testStudent’s test
A location model is assumed for Xi, i = 1, ..., n:
Xi = µ+ ei,where ei ∼ N(µe = 0, σ2
e), a symmetrical distribution.
Interest for H0: µ = µ0 against H1: µ < µ0 or µ 6= µ0 or µ > µ0.
Test statistics : T = X−µ0s√n
..
Distribution of W under H0: T ∼ Student(df = n− 1).
One Sample t-test
data: golub[1042, gol.fac == "ALL"]t = 4.172, df = 26, p-value = 0.0002982alternative hypothesis: true mean is not equal to 1.595 percent confidence interval:1.699817 2.087948sample estimates:mean of x1.893883
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
26
Non-parametric location testWilcoxon sign-rank test
A location model is assumed for Xi, i = 1, ..., n:
Xi = θ + ei,where ei ∼ iid(µe = 0, σ2
e), a symmetrical distribution.
Interest for H0: θ = θ0 against H1: θ < θ0 or θ 6= θ0 or θ > θ0.
Test statistics : W+ =∑ni=1 ι(Xi − θ0 > 0) Rank(|Xi − θ0|).
Distribution of W under H0: W+ has no closed-form distribution.
Wilcoxon signed rank exact test
data: golub[1042, gol.fac == "ALL"]V = 333, p-value = 0.0002363alternative hypothesis: true location is not equal to 1.595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median
1.926475
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
27
Non-parametric location testWilcoxon sign-rank test
A location model is assumed for Xi, i = 1, ..., n:
Xi = θ + ei,where ei ∼ iid(µe = 0, σ2
e), a symmetrical distribution.
Interest for H0: θ = θ0 against H1: θ < θ0 or θ 6= θ0 or θ > θ0.
Test statistics : W+ =∑ni=1 ι(Xi − θ0 > 0) Rank(|Xi − θ0|).
Distribution of W under H0: W+ has no closed-form distribution.
Wilcoxon signed rank exact test
data: golub[1042, gol.fac == "ALL"]V = 333, p-value = 0.0002363alternative hypothesis: true location is not equal to 1.595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median
1.926475
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
27
Parametric or non-parametric ?
T-test Outcome(s) normally distributed
Yes Mildly No
Sample size
Small
Medium
Large
Situations which may suggest the use of non-parametric statistics:I When there is a small sample size or very unequal groups,I When the data has notable outliers,I When one outcome has a distribution other than normal,I When the data are ordered with many ties or are rank ordered.
Non-parametric does not mean assumption free
28
PART III:Parametric and non-parametrictwo-sample location tests
Cancer Research UK – 9th of November 2021D.-L. Couturier (Bioinformatics core)
M. Fernandes (Craik-Marshall)
Two-sample case
Many options:
I One-sided versus two-sided tests,
I Exact versus asymptotic tests,
I Parametric versus non-parametric tests,
I Tests for paired versus independent data.
31
Parametric two-sample location testTwo-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY − µX = 0 against H1: µY − µX 6= 0.
We know:
I Student’s t-test [assume σ2X = σ2
Y ]: (Y−X)−(µY −µX )
sp
√1nX
+ 1nY
∼ t1−α2,nX+nY −2
I Welch’s t-test [assume σ2X 6= σ2
Y ]: (Y−X)−(µY −µX )√s2XnX
+s2YnY
∼ t1−α2,df
32
Parametric two-sample location testTwo-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY − µX = 0 against H1: µY − µX 6= 0.
We know:
I Student’s t-test [assume σ2X = σ2
Y ]: (Y−X)−(µY −µX )
sp
√1nX
+ 1nY
∼ t1−α2,nX+nY −2
I Welch’s t-test [assume σ2X 6= σ2
Y ]: (Y−X)−(µY −µX )√s2XnX
+s2YnY
∼ t1−α2,df
Two Sample t-test
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.7983, df = 36, p-value = 6.046e-08alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8829143 1.6336690sample estimates:mean of x mean of y1.8938826 0.6355909
32
Parametric two-sample location testTwo-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY − µX = 0 against H1: µY − µX 6= 0.
We know:
I Student’s t-test [assume σ2X = σ2
Y ]: (Y−X)−(µY −µX )
sp
√1nX
+ 1nY
∼ t1−α2,nX+nY −2
I Welch’s t-test [assume σ2X 6= σ2
Y ]: (Y−X)−(µY −µX )√s2XnX
+s2YnY
∼ t1−α2,df
Welch Two Sample t-test
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.3186, df = 16.118, p-value = 9.871e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8363826 1.6802008sample estimates:mean of x mean of y1.8938826 0.6355909
32
Non-parametric two-sample location testMann-Whitney-Wilcoxon test
LetI Xi ∼ iid(µX , σ2), i = 1, ..., nX ,I Yi ∼ iid(µX + δ, σ2), i = 1, ..., nY .
Interest for H0: δ = δ0 against H1: δ < δ0 or δ 6= δ0 or δ > δ0.
Standardised test statistic: z =∑nYi=1
R(Yi)−[nY (nX+nY +1)/2]√nXnY (nX+nY +1)/12
,
where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,amongst (X1, ..., XnX , Y1, ..., YnY ).
Distribution of Z under H0: Z ∼ N(0, 1).
Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05
Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location
1.21951
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
33
Non-parametric two-sample location testMann-Whitney-Wilcoxon test
LetI Xi ∼ iid(µX , σ2), i = 1, ..., nX ,I Yi ∼ iid(µX + δ, σ2), i = 1, ..., nY .
Interest for H0: δ = δ0 against H1: δ < δ0 or δ 6= δ0 or δ > δ0.
Standardised test statistic: z =∑nYi=1
R(Yi)−[nY (nX+nY +1)/2]√nXnY (nX+nY +1)/12
,
where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,amongst (X1, ..., XnX , Y1, ..., YnY ).
Distribution of Z under H0: Z ∼ N(0, 1).
Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05
Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location
1.21951
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
Den
sity
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
33
F-test of equality of variances
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: σ2Y = σ2
X against H1: σ2Y 6= σ2
X .
We know:
I F-test [assume Xi ∼ N(µX , σX) and Yi ∼ N(µY , σY )]:s2Ys2X
∼ FnY −1,nX−1
34
F-test of equality of variances
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: σ2Y = σ2
X against H1: σ2Y 6= σ2
X .
We know:
I F-test [assume Xi ∼ N(µX , σX) and Yi ∼ N(µY , σY )]:s2Ys2X
∼ FnY −1,nX−1
F test to compare two variances
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]F = 0.71164, num df = 26, denom df = 10, p-value = 0.4652alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.2127735 1.8428387sample estimates:ratio of variances
0.7116441
34
WarningMultiplicity correction
For each test, the probability of rejecting H0 (and accept H1) when H0 istrue equals α.
For k tests, the probability of rejecting H0 (and accept H1) at least 1 timewhen H0 is true, αk, is given by
αk = 1− (1− α)k.
Thus, for α = 0.05,I if k = 1, α1 = 1− (1− α)1 = 0.05,I if k = 2, α2 = 1− (1− α)2 = 0.0975,I if k = 10, α10 = 1− (1− α)10 = 0.4013.
Idea: change the level of each test so that αk = 0.05:
I Bonferroni correction : α = αkk ,
I Dunn-Sidak correction: α = 1− (1− αk)1/k.
35
WarningNon-parametric is not assumption free: Type I error
Simulate 2500 samples withI Xi ∼ Uniform(1.5, 2.5), i = 1, ..., nX ,I Yi ∼ Uniform(0, 4), i = 1, ..., nY ,
so that E[Xi] = E[Yi] = 2 (i.e., same mean, same median).
AssumeI Xi ∼ iid(µX , σ2), i = 1, ..., nX ,I Yi ∼ iid(µX + δ, σ2), i = 1, ..., nY .
Test H0: δ = δ0 against H1: δ 6= δ0, at the 5% level, by means ofI Mann-Whitney-Wilcoxon test (MWW),I T-test,I Welch-test.
α Tests
MWW Student’s t-test Welch’s test
Sample size nX = 200, nY = 70 0.145 0.202 0.055
nX = 20, nY = 7 0.148 0.240 0.062
36