experimental research methodology statistical tests
TRANSCRIPT
1
Experimental Research Methodology
– Statistical Tests –
Fernando Brito e Abreu ([email protected])
Universidade Nova de Lisboa (http://www.unl.pt)
QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)
Data analysis taxonomy
Number of independent variables (aka factors):
One-factorial analysis – one independent variables
Multifactorial analysis – several independent variables
Number of dependent variables:
Univariate analysis – one dependent variables
Multivariate analysis – several dependent variables
2
Data analysis methods
Proportion testing
Inference tests for categorical and continuous data Parametric testing
Non-parametric testing
Regression analysis Linear regression modeling
Nonlinear regression modeling (e.g. logistic regression analysis)
Multivariate data analysis Factor analysis
Cluster analysis
Discriminant analysis
More on scale types
Categorical (discrete) data
Nominal scale
Ordinal scale
Absolute scale
Continuous data
Interval scale
Ratio scale
3
Variable types
Independent variables
(aka factors or explanatory variables)
Are those that are manipulated in experimental research
Ex: Programming language, Development environment, Design size,
Practitioner expertise
Dependent variables
(aka outcome variables, measures, criteria)
Are those whose effect of the independent ones we want to
measure in experimental research
Ex: Effort do produce a given deliverable, Project schedule, Defects
found in code inspection, System faults in operation (e.g. MTBF, MTTR)
Exercise: Identify the independent and dependent
variables …
Expertise
Defects
detected
during
component
integration
Assembly
complexity
Defects
detected
during
inspection
Internal
complexity
Interface
complexity
Reviewer
Developer
Reviewer
Practitioner Component Component Assembly
Developer
Developer
4
Degree of freedom (df) of an estimate
Is the number of independent pieces of information on
which the estimate is based.
Is the number of values in the final calculation of a statistic that
are free to vary (df = number of different treatments – 1)
Why is the "Normal distribution" important?
… because in most cases, it approximates well the function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size
The distribution of many test statistics is normal or follows some form that can be derived from the normal distributionMany frequently used statistical tests make the
assumption that the data come from a normal distribution
5
Distribution adherence
The distribution type conditions the kind of statistical tests we can apply
Therefore we want to know if a variable follows (adheres to) a given statistical distributionOften we are interested in how well the distribution can be
approximated by the normal distribution
We can take several, increasingly powerful, approaches: Use descriptive statistics
Use plots
Use distribution adherence tests
Testing distribution adherenceMost common normality tests
Kolmogorov-Smirnov one-sample test
Lilliefors test (correction upon the previous)
Shapiro-Wilks' W test
Royston test (correction upon the previous)
These tests are also known as goodness-of-fit ones since they
test whether the observations could reasonably have come from
the specified distribution
6
Testing distribution adherenceKolmogorov-Smirnov one-sample test
The Kolmogorov-Smirnov one-sample test for normality is based on the maximum difference between the sample cumulative distribution and the hypothesized cumulative distribution.
H0: X ~ N(;)
H1: ¬ X ~ N(;)
Notes: For many software programs, the probability values that are reported are
based on those tabulated by Massey (1951); those probability values are valid when the mean and standard deviation of the normal distribution are known a-priori and not estimated from the data
This test can be used to verify goodness of fit for other distributions (e.g. uniform, Poisson, exponential)
Testing distribution adherenceKolmogorov-Smirnov one-sample test
Interpretation:
If the Z statistic is significant, then the hypothesis that the respective distribution is normal (H0) should be rejected “Significant” means that the statistical significance p of the result is not
inferior to the test significance (required level)
Example: Consider the test significance = 0.05
Probability of Type I error = 0.05 * 100% = 5%
(probability of rejecting H0, the null hypothesis, when it is true)
If p ≤ (significant Z statistic):
Reject H0 and accept H1 (sample cannot come from a Normal population)
If p > (not significant Z statistic)
Accept H0 and therefore reject H1 (sample may come from a Normal population)
7
One-Sample Kolmogorov-Smirnov Test
3310 4180
444.17 5951.71
926.623 20567.461
.317 .386
.262 .302
-.317 -.386
18.238 24.970
.000 .000
N
Mean
Std. Deviation
Normal Parameters a,b
Absolute
Positive
Negative
Most ExtremeDif ferences
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
FunctionalSize
NormalisedWork Ef fort
Test distribution is Normal.a.
Calculated f rom data.b.
Example:
Even for a test significance = 0.01 (99% confidence
interval), since p=0.000 ≤ (significant Z statistic):
We reject H0 and accept H1 (neither Size nor Effort can come
from a Normal population)
SPSS:
Analyze
Nonparametric Tests
1-Sample K-S…
Testing distribution adherenceLilliefors test
This test is basically a correction to the Kolmogorov-Smirnov test, applicable when the parameters of the hypothesized normal distribution are estimated from the sample data
Interpretation:
If the Z statistic is significant, then the hypothesis that the respective distribution is normal should be rejected (same as for the KS test)
Notes:
In a Kolmogorov-Smirnov test for normality when the mean and standard deviation of the hypothesized normal distribution are not known (i.e., they are estimated from the sample data), the probability values tabulated by Massey (1951) are not valid. In that case, the test for normality involves a complex conditional hypothesis ("how likely is it to obtain a D statistic of this magnitude or greater, contingent upon the mean and standard deviation computed from the data"), and the Lilliefors probabilities should be interpreted (Lilliefors, 1967)
8
Testing distribution adherenceShapiro-Wilks' W test
This test is the preferred test of normality because of its good
power properties as compared to a wide range of alternative tests
Interpretation:
If the W statistic is significant (i.e. p ≤ ), then the hypothesis that
the respective distribution is normal should be rejected
Notes:
Some software programs implement an extension to the test
described by Royston (1982), which allows it to be applied to large
samples (with up to 5000 observations)
Statistical significance (p-value) of a result
The p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative” of the population A p-value of 5% (i.e.,1/20) indicates that there is a 5%
probability that the relation between the variables found in our sample is a "fluke“ (stroke of luck)
For adherence tests, the p-value is the probability that the observed difference between the sample cumulative distribution and the hypothesized cumulative distribution. occurred by pure chance ("luck of the draw") In other words, that in the population from which the sample
was drawn, no such difference exists
9
Common p-values
(conventions in many research areas)
Borderline statistically significant
p-value = 5% (1/20)
Statistically significant
p-value = 1% (1/100)
Highly statistically significant
p-value = 0.5% (1/200) or even 0.1% (1/1000)
Hypothesis testing
Suppose that a CIO is interested in showing that in his software-house the projects have an average defect density (ADD) below 5[KLOC-1]. This question, in statistical terms: “Is ADD < 5?"
STEP 1: State as a "statistical null hypothesis" (hypothesis H0) something that is the logical opposite of what you believe. Ho: ADD > 5
STEP 2: Collect data (build a sample)
STEP 3: Using statistical theory, show from the data that it is likely H0 is false, and should be rejected. By rejecting H0, you support what you actually believe.
This kind of situation, which is typical in many fields of research, is called "Reject-Support testing," (RS testing) because rejecting the null hypothesis supports the experimenter's theory.
10
Hypothesis testing
Two kinds of errors
α – Type I error rate, must be kept at or below .05
β –Type II error rate, must be kept low as well (the conventions are
much more rigid with respect to α than with respect to β)
The "Statistical power," (1-β), must be kept high
Ideally, power should be at least .80 to detect a reasonable departure
from the null hypothesis
State of the World
HO H1
Decision
H0 Correct H0
Acceptance
(1- α)
Type II
Error (β)
H1 Type I
Error (α)
Correct H0
Rejection
(1-β)
The null hypothesis is either true or false The statistical decision should
be set up so that no "ties” occur
The null hypothesis is either rejected or not rejected
Hypothesis testing (expanded)
STATE OF THE WORLD
H0 is True
H1 is False
H0 = False
H1 = True
DECISION Accept H0
Reject H1
OK
Correct H0 Acceptance
Correct H1 Rejection
(Probability = 1 - α)
Type II Error
Incorrect H0 Acceptance
Incorrect H1 Rejection
(Probability = β)
Reject H0
Accept H1
Type I Error
Incorrect H0 Rejection
Incorrect H1 Acceptance
(Probability = α)
OK
Correct H0 Rejection
Correct H1 Acceptance
(Probability = 1 - β)
11
Statistical Tests
Parametric tests
Assure stronger validity than the non-parametric
counterparts
Their statistical power is greater
Non-parametric tests
Weaker validity than the parametric counterparts
Their statistical power is smaller
Statistical Tests for Scales
Measurement scale of the variable under consideration
Nominal Ordinal Interval Ratio
Non-parametric test Normal distribution
Non-parametric methods
Parametric methods
No Yes
12
Parametric tests (between groups)
Name Factors
/ Treat.
Outcome scale Null hypotheses
t-Student
(one sample)
NA Numeric (absolute,
interval or ratio)
The mean of a variable is equal to
a specified constant?
t-Student
(2 independent samples)
1/2 Numeric (absolute,
interval or ratio)
The means of a variable on each
group (treatment) are the same?
One-Way ANOVA 1/2+ Numeric (absolute,
interval or ratio)
The means of a variable on each
group (treatment) are the same?
Factorial ANOVA 2+/2+ Numeric (absolute,
interval or ratio)
i) The means of a variable on each
group (treatment) are the same?
ii) There is no interaction among
the factors?
Nonparametric tests (between groups)
Name Factors /
Treat.
Outcome scale Null hypotheses
Binomial test
(test of proportions)
1/2 NA The expected proportions are the
ones being tested?
Chi-Square
(test of proportions)
1/2+ NA The expected proportions in the
groups are similar?
Mann-Whitney test
(aka U-test)
1/2 At least ordinal scale The two groups have similar central
tendency?
Kruskal-Wallis test
(aka H-test)
1/2+ At least ordinal scale The several groups have a similar
localization parameter?
Nonparametric
Factorial ANOVA
2+/2+ At least ordinal scale i) The several groups have a similar
localization parameter?
ii) There is no interaction among
the factors?
13
PARAMETRIC TESTS
T-Student test(One sample)
14
One sample T-Student testApplicability
This procedure tests whether the mean of a quantitative
variable differs from a hypothesized test value
The test value is a specified constant
Design
N/A
Scales
Factor (grouping) variable: none
Outcome variable: numeric (absolute, interval or ratio)
One sample T-Student testAssumptions
This test assumes that the data from the outcome
variable are normally distributed; however, it is fairly
robust to departures from normality.
15
One sample T-Student testHypotheses being tested
H0: = k
The mean of the variable does not differ significantly from a
specified K value
H1: k
The mean of the variable differs significantly from the
specified K value
This test uses the T statistic that has a t-Student
distribution with (n-1) degrees of freedom
One sample T-Student testTest decision
For n ≤ 30:
We reject H0, for a given level of significance , if:
|Tcalc| > t1-/2 (n-1)
The critical values can be obtained from a t-Student table.
For n > 30: the t-Student distribution becomes ~ N(0;1)
Therefore, we can then use the test significance:
If p ≤ (significant Z statistic):
Reject H0 and accept H1 (proportions are different)
If p > (not significant Z statistic)
Accept H0 and therefore reject H1 (proportions are similar)
16
Example (1/3)
Problem:
Is the population mean
effort per adjusted
function point equal to
15 or 16 man.hours?
First we have to
compute that effort …
SPSS:
Transform
Compute Variable
One-Sample Kolmogorov-Smirnov Test
2839
428,94
829,349
,304
,256
-,304
16,186
,000
N
Mean
Std. Dev iat ion
Normal Parameters a,b
Absolute
Positive
Negativ e
Most Extreme
Dif f erences
Kolmogorov-Smirnov Z
Asy mp. Sig. (2-tailed)
Adjusted
Function
Points
Test distribution is Normal.a.
Calculated f rom data.b.
Example (2/3)
Assumption: is the effort per adjusted function point
Normally distributed?
The effort is not Normally distributed,
but since this test is robust to non-
Normal data, we will still use it!
SPSS:
Analyze
Nonparametric Tests
1 Sample K-S
17
Example (3/3)SPSS:
Analyze
Compare Means
One-Sample T-Test…
One-Sample Statistics
2839 16,5820 25,00000 ,46920Ef fort per Adjusted FP
N Mean Std. Deviation
Std. Error
Mean
One-Sample Test
3,372 2838 ,001 1,58204 ,6620 2,5020Ef fort per Adjusted FP
t df Sig. (2-tailed)
Mean
Dif f erence Lower Upper
95% Conf idence
Interv al of the
Dif f erence
Test Value = 15
One-Sample Test
1,240 2838 ,215 ,58204 -,3380 1,5020Ef fort per Adjusted FP
t df Sig. (2-tailed)
Mean
Dif f erence Lower Upper
95% Conf idence
Interv al of the
Dif f erence
Test Value = 16
H0 cannot be rejected, even with a 90%
confidence interval. The population mean
value of the variable is 16!
H0 is rejected. The population mean
value of the variable is not 15!
Conclusion:
The expected value for the population mean
for FP countings using the IFPUG rules is 16.
T-Student test(2 samples)
18
Two samples T-Student testApplicability
This test allows inferring the equality of the means in the
populations from two samples (groups)
Design
1 factor, 2 treatments, independent samples
Scales
Factor (grouping) variable: categorical or cut-point
defined upon a numeric variable (e.g. setting a cut point
on team size of 10 persons allows splitting projects on
two groups, according to that variable)
Outcome variable: numeric (absolute, interval or ratio)
Two samples T-Student testAssumptions
The subjects should be randomly assigned to two groups, so that any difference in response is due to the treatments and not to other factors.
This test assumes that the data from the outcome variable are normally distributed; however, it is fairly robust to departures from normality.
This test uses different statistics depending on the outcome variable having homogeneous or non-homogeneous variances on the two groups This homogeneity can be assessed with the Levene test
19
Two samples T-Student test Hypotheses being
tested
H0: A = B
The means of the variable on each group (treatment)
are the same
H1: A B
The averages of the variable on each group
(treatment) are not the same
Two samples T-Student testTest decision
For n ≤ 30:
We reject H0, for a given level of significance , if:
|Tcalc| > t1-/2 (n-1)
The critical values can be obtained from a t-Stundent table.
For n > 30: the t-Student distribution becomes ~ N(0;1)
Therefore, we can then use the test significance:
If p ≤ (significant Z statistic):
Reject H0 and accept H1 (means are different)
If p > (not significant Z statistic)
Accept H0 and therefore reject H1 (means are similar)
20
Example (1/3)
Problem:
Is the mean number of adjusted Function Points the same when using IFPUG counting or any other counting (e.g. COSMIC, FiSMA, Feature Points)?
First we have to create a new factor variable (isIFPUG) IFPUG projects will be coded
“1” and non-IFPUG “0”SPSS:
Transform
Compute Variable
Example (2/3)
SPSS:
Analyze
Compare Means
Independent Samples T-Test
Group Statistics
2839 17,9005 28,46928 ,53431
148 13,7710 14,25176 1,17149
Is FP counting IFPUG?
1
0
Ef fort per Adjusted FP
N Mean Std. Dev iation
Std. Error
Mean
21
Independent Samples Test
4,960 ,026 1,753 2985 ,080 4,12954 2,35567 -,48937 8,74845
3,207 214,040 ,002 4,12954 1,28758 1,59157 6,66751
Equal variances
assumed
Equal variances
not assumed
Ef fort per Adjusted FP
F Sig.
Levene's Test f or
Equality of Variances
t df Sig. (2-tailed)
Mean
Dif f erence
Std. Error
Dif f erence Lower Upper
95% Conf idence
Interv al of the
Dif f erence
t-test for Equality of Means
Example (3/3)
For a confidence interval of 95% (equal variances not assumed), we can say that the means between the two groups differ significantly!
For a confidence interval of 99% (equal variances assumed), we cannot say that the means between the two groups differ significantly!
Conclusion:
The FP counting rules other than the IFPUG
ones, do not seem to differ significantly from
the latter.
Setting the value of the confidence interval
can change results interpretation in border-
line situations!
For a confidence interval of 95% (p<), H0 can be rejected, therefore sample variances cannot be considered homogeneous.
For a confidence interval of 99% (p>), H0 cannot be rejected, therefore sample variances can be considered homogeneous.
One-Way ANOVA(One-factorial ANalysis Of VAriance)
22
One-Way ANOVAApplicability
This procedure is used to test the hypothesis that the
means among several groups (determined by a factor
variable) are equal. Therefore, it allows testing if there is
a variance on the outcome variable that is due to the
factor. This is an extension of the two-sample t test.
Design
1 factor, 2+ treatments, independent samples
Scales
Factor (grouping) variable: categorical (recoded into
numeric)
Outcome variable: numeric (absolute, interval or ratio)
One-Way ANOVA Assumptions
Each group is an independent random sample from a
normal population. One-Way ANOVA is robust to
departures from normality, although the data should be
symmetric.
The groups should come from populations with equal
(homogeneous) variances. To test this assumption, use
Levene's homogeneity-of-variance test.
23
One-Way ANOVAThe groups
Let us consider that we have k groups (each group is a
sample), each one corresponding to a given treatment
(factor level)
sample 1 with n1 elements: X1= {X11, X21, ...Xn11}
…
…
sample k with nk elements: Xk= {X1k, X2k, ...Xnkk}
where Xij is the value observed for subject i, belonging to sample j
One-Way ANOVA Calculating the variance
Let SST be the sum of the
squares of the deviations of
observed values around the
global mean:
SST =
where
SST=SSW+SSB
Let SSW be the sum of the squares
of the deviations within groups,
SSW =
Let SSB be the sum of the squares
of the deviations between groups,
SSB =
2
1 1
( )jnk
ij j
J i
X X
2
1
( )k
j j
j
n X X
2
1 1
( )k n
ij
j i
X X
24
One-Way ANOVA The T statistic
The ANOVA compares the sum of the squares of the deviations between groups (difference between groups), with the sum of the squares within groups.
The null hypothesis is tested using the following test statistic:
Under the null hypothesis, the T statistic follows an F (Snedecor) distribution with (k-1,n-k) degrees of freedom, i.e.,
T F(k-1,n-k)
/( 1)
/( )
SSB kT
SSW n k
n= number of cases
k=number of groups
One-Way ANOVA Hypotheses being tested
H0: The means of the outcome variable for each group (treatment) are all the same
i, j : i = j ( i j )
H1: The means for each group are not all the same
i, j : i j ( i j )
Test decision:
We reject H0, for a given level of significance , if:
Fcalc> F1- (k-1,n-k)
Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm
Note: k-1 is the numerator and n-k is the denominator (see previous slide)
25
One-Way ANOVA Which groups differ?
In addition to determining that differences exist among the means, you may want to know which means differ
There are two types of tests for comparing means:
a priori contrasts are tests set up before running the experiment
post hoc are tests are run after the experiment has been conducted
Example (1/4)
Problem:
Is the effort per adjusted function point the same
across 4 well-known languages (Cobol, Visual
Basic, C++ and Java)?
Verifying assumptions:
Is the outcome variable (the effort) normally distributed?
From previous slides we have seen this is not true, but since
ANOVA is robust to departures from normality, we still use it …
Have the groups corresponding to each of the
programming languages equal variances?
26
Example (2/4)
We need to recode
the programming
languages of interest
SPSS:
Transform
Recode into Different Variables
Test of Homogeneity of Variances
Ef fort per Adjusted FP
3,519 3 1090 ,015
Levene
Stat ist ic df 1 df 2 Sig.
Example (3/4)
SPSS:
Analyze
Compare Means
One-Way ANOVA
Verifying another precondition:
With a confidence interval of 99% we
cannot reject the null hypothesis that
the variances are homogeneous
The plot gives us a qualitative
perspective of the phenomenon;
the mean effort seems to depend
on the language!
27
Descriptives
Ef fort per Adjusted FP
509 23,5086 34,84305 1,54439 20,4744 26,5427 ,24 424,87
265 16,0208 23,46494 1,44144 13,1826 18,8590 ,13 256,13
116 28,8261 34,36263 3,19049 22,5064 35,1458 ,99 211,77
204 17,2658 29,09480 2,03704 13,2493 21,2823 ,90 259,71
1094 21,0945 31,57117 ,95451 19,2217 22,9674 ,13 424,87
31,32726 ,94714 19,2361 22,9530
2,85458 12,0100 30,1791 22,57916
Cobol
Visual Basic
C++
Java
Total
Fixed Ef f ects
Random Ef fects
Model
N Mean Std. Dev iation Std. Error Lower Bound Upper Bound
95% Conf idence Interval for
Mean
Minimum Maximum
Between-
Component
Variance
Example (4/4)
ANOVA
Ef fort per Adjusted FP
19712,581 3 6570,860 6,695 ,000
1069723 1090 981,397
1089435 1093
Between Groups
Within Groups
Total
Sum of
Squares df Mean Square F Sig.
n (number of cases) = 1094
k (number of groups) = 4
The upper critical value of the F distribution can be found in a table. Notice
that the critical value for (k-1,n-k) = (3, 1090) can be majorated by the
critical value for (3, 100). For = 5% we get a majorant of 3.984. Since
Fcalc=6.695 > 3.984 we reject the null hypothesis!
Conclusion:
The average effort per function point is
significantly dependent on the language
Factorial ANOVA(Multi-factorial ANalysis Of VAriance)
28
Factorial ANOVAApplicability
This procedure is used to test if a given set of factors has a
significant effect on a given variable
Allows determining the effect of each factor
Allows assessing the interaction among factors (aka moderation)
This is a particular case of a multivariate regression
analysis methodology called “General Linear Model” (GLM)
In GLM, both balanced and unbalanced models can be tested. A
design is balanced if each cell (a treatment) in the model contains
the same number of cases
Factorial ANOVADesign & Scales
Design
2+ factors, 2+ levels per factor, independent samples
If we only have 2 factors, this is called Two-way ANOVA
In Factorial ANOVA a treatment corresponds to a combination
(tuple) of factor levels such as (Java; Eclipse) if the factors are
programming language and development environment.
In Factorial ANOVA, a treatment is often called a model cell.
Scales
Factor (grouping) variable: categorical (recoded)
Outcome variable: numeric (absolute, interval or ratio)
29
Factorial ANOVAMain and interaction effects
Consider that you have three factors F1, F2 and F3
Main effects
These are the effects on the outcome variable caused by each
factor alone, as we did with One-way ANOVA
These are represented by F1, F2, F3
Interaction effects
There are the cross factor effects caused by the combined action
of all combinations of factors, which are:
F1*F2, F1*F3, F2*F3, F1*F2*F3
Overall effect representation in the GLM
I + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3
Where I is an intercept term (similar to that used in linear regression)
Factorial ANOVA Hypotheses being tested (one for each factor)
H0: The expected means (in the population) of the outcome variable for each group (treatment) are all the same
1 = .... = k
H1: At least one the expected means is different
i,j: i j (i j)
Test decision:
We reject H0, for a given level of significance , if:
Fcalc> F1- (k-1,n-k)
Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm
Note: k-1 is the numerator and n-k is the denominator (see a previous slide)
30
Factorial ANOVA Hypotheses being tested (one for each interaction)
H0: There is no interaction among the factors
i,j: i,j = 0 (i ≠ j)
H1: There is interaction between at least two factors
i,j: i,j ≠ 0 (i ≠ j)
Test decision:
We reject H0, for a given level of significance , if:
Fcalc> F1- (k-1,n-k)
Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm
Note: k-1 is the numerator and n-k is the denominator (see a previous slide)
Factorial ANOVA Assumptions
For increased test power the populations from where each cell
data was taken should be normal and with homogeneous
variances. However:
Factorial ANOVA is robust to departures from normality, although the data
should be symmetric
Regarding variance, there are alternatives for using Factorial ANOVA
when variance homogeneity is not assumed
To check assumptions, we can use homogeneity of variances
tests (e.g. Levene test) and spread-versus-level plots. We can
also examine residuals and residual plots.
31
Factorial ANOVA Differences among specific treatments
The overall F statistic allows
to test that at least a group
corresponding to a given
treatment has a means on the
outcome variable that is
different from the other groups
If an overall F test has shown
significance, we use post hoc
tests to evaluate differences
among specific means.
Some of those post hoc tests
are applicable when equal
variances are assumed and
some other when they are not
Use the Scheffé or the Tamhane’s tests, depending on variance’s homogeneity, since those two tests are more conservative (safer) than others, which means that a larger difference between means is required for significance.
(See more by clicking on the help button)
Factorial ANOVA Profile (interaction plots)
If the interaction effects are not significant, we should
consider each of the the main effects separately, as we
did for one-way ANOVA
When interaction effects are significant (rejected
interaction effect null hypotheses), we do not consider
the corresponding main effects. Therefore, we should
center our attention on the study of interactions
Profile plots allow visualizing the interactions among the
factors!
32
Example (1/6)
Problem:
Is the effort per adjusted function
point dependent on the language,
development type and software
architecture?
We know already that the effort
per adjusted function point does
not have a Normal distribution
(see previous slides), but since
the Factorial ANOVA is robust to
non-normality we still use it.
Between-Subjects Factors
Cobol 508
Visual Basic 264
C++ 116
Java 204
New development 445
Enhancement 622
Re-development 25
Stand-alone 720
Client-server 298
Mult i-tier with web interface 74
1
2
3
4
Language
0
1
2
Development ty pe
1
2
3
Architecture_
Value Label N
Levene's Test of Equality of Error Variancesa
Dependent Variable: Ef fort per Adjusted FP
5,049 20 1071 ,000
F df 1 df 2 Sig.
Tests the null hypothesis that the error v ariance of the
dependent variable is equal across groups.
Design: Intercept+Language+DevTy pe+Architecture_
+Language * DevTy pe+Language * Architecture_
+DevType * Architecture_+Language * DevType *
Architecture_
a.
Example (2/6)
SPSS:
Analyze
General Linear Model
Univariate
Verifying variance homogeneity:
With a confidence interval of 99%% we
reject the null hypothesis that the
variances are homogeneous.
Notice the interaction terms
33
Tests of Between-Subjects Effects
Dependent Variable: Ef fort per Adjusted FP
80088,097b 20 4004,405 3,877 ,000 ,068 77,532 1,000
37040,756 1 37040,756 35,859 ,000 ,032 35,859 1,000
802,606 3 267,535 ,259 ,855 ,001 ,777 ,100
7598,601 2 3799,300 3,678 ,026 ,007 7,356 ,677
8014,872 2 4007,436 3,880 ,021 ,007 7,759 ,702
265,799 4 66,450 ,064 ,992 ,000 ,257 ,063
1294,776 3 431,592 ,418 ,740 ,001 1,253 ,134
8740,139 3 2913,380 2,820 ,038 ,008 8,461 ,680
7952,667 3 2650,889 2,566 ,053 ,007 7,699 ,634
1106304,428 1071 1032,964
1741086,067 1092
1186392,525 1091
Source
Corrected Model
Intercept
Language
DevType
Architecture_
Language * DevTy pe
Language * Architecture_
DevType * Architecture_
Language * DevTy pe * Architecture_
Error
Total
Corrected Total
Type I II Sum
of Squares df Mean Square F Sig.
Part ial Eta
Squared
Noncent .
Parameter
Observed
Powera
Computed using alpha = ,05a.
R Squared = ,068 (Adjusted R Squared = ,050)b.
Example (3/6)
With a confidence interval of 95%, the DevType
* Architecture interaction is significant.
Therefore, we should consider this interaction
effect instead of the main effects
When the test power 1- is low (below 80%) as it happens here, specially for all terms including the Language, we should be careful since the Type II Error () is high. Recall that = β is the probability of Incorrect H0 acceptance (incorrect H1 rejection) when Ho is false.
Conclusion:
The average effort per function point is significantly
dependent on the combined action of development
type and software architecture, although care should
be taken since the test power is limited
Multiple Comparisons
Dependent Variable: Ef fort per Adjusted FP
Tamhane
6,0732* 2,04680 ,009 1,1751 10,9714
19,0322* 1,37305 ,000 15,7464 22,3179
-6,0732* 2,04680 ,009 -10,9714 -1,1751
12,9589* 1,55147 ,000 9,2343 16,6836
-19,0322* 1,37305 ,000 -22,3179 -15,7464
-12,9589* 1,55147 ,000 -16,6836 -9,2343
(J) Architecture_
Client-server
Mult i-tier with web interface
Stand-alone
Mult i-tier with web interface
Stand-alone
Client-server
(I) Architecture_
Stand-alone
Client-server
Mult i-tier with
web interf ace
Mean
Dif f erence
(I-J) Std. Error Sig. Lower Bound Upper Bound
95% Conf idence Interv al
Based on observ ed means.
The mean dif f erence is signif icant at the ,05 lev el.*.
Multiple Comparisons
Dependent Variable: Ef fort per Adjusted FP
Tamhane
-10,4831* 1,82443 ,000 -14,8474 -6,1188
-6,6862 3,02944 ,103 -14,3758 1,0034
10,4831* 1,82443 ,000 6,1188 14,8474
3,7969 3,33127 ,596 -4,4952 12,0891
6,6862 3,02944 ,103 -1,0034 14,3758
-3,7969 3,33127 ,596 -12,0891 4,4952
(J) Development type
Enhancement
Re-development
New development
Re-development
New development
Enhancement
(I) Dev elopment type
New development
Enhancement
Re-development
Mean
Dif f erence
(I-J) Std. Error Sig. Lower Bound Upper Bound
95% Conf idence Interv al
Based on observ ed means.
The mean dif f erence is signif icant at the ,05 lev el.*.
Example (4/6)
With a confidence interval of 95% we can only say that software enhancement requires on average effort per adjusted FP of around 10,5 hours larger than for new development. Your interpretation?
Scenario 1:
Main effects are important
Attention: this scenario is
not true in our case study!
With a confidence interval of 95% we can only say that there is an increasing order of magnitude in the average effort per adjusted FP from multi-tier with web interface to stand alone. Your interpretation?
34
Example (5/6)Significant interactions
The effect of the development type on the effort is partly moderated by the architecture (and vice-versa)
This moderation effect manifests itself in crossing lines.
Scenario 2:
Interaction effects are important
Attention: this scenario is the correct
one in our case study!
Example (6/6)Non significant interactions
When the interactions are not significant, the lines do not cross or only cross slightly
35
NON-PARAMETRIC TESTS
(Between Groups)
Binomial
(test of proportions)
36
Binomial testApplicability
This test allows comparing the proportions of the
occurrence of one of the two possible values of a
dichotomic variable on the total number of cases
Design
1 factor, 1 treatment, independent samples
Scales
Factor (grouping) variable: categorical (dichotomic)
Outcome variable: N/A
Binomial testThe proportions to be tested
If px and py are the proportions of the two possible
values of the factor, then px + py = 1
We test the hypothesis that the expected proportions in
the population are of a given value (p0, 1-p0), as for
instance:
p0 = 25% -> px = 25%, py = 75%
p0 = 50% -> px = 50%, py = 50%
p0 = 60% -> px = 60%, py = 40%
37
Binomial test
H0: px = p0, py = 1 - p0 there is statistical evidence that the expected proportions in the
population are the ones being tested
H1: px ≠ p0, py ≠ 1 - p0 the expected proportions in the population is significantly
different from the tested ones
Test decision: If p ≤ (significant Z statistic):
Reject H0 and accept H1
If p > (not significant Z statistic)
Accept H0 and therefore reject H1
Example (1/3)
Objective: assess if the proportions of CASE tool usage / non-usage are even
CASE tool usage is represented by a dichotomic variable that splits subjects in 2 samples (groups)One group corresponds to the projects using CASE
tools (label “Yes”) and the other to those projects that aren’t using them (label “No”)
SPSS:
Analyze
Nonparametric Tests
Binomial
38
Example (2/3)
Binomial Test
No 1254 ,66 ,50 ,000a
Yes 646 ,34
1900 1,00
Group 1
Group 2
Total
Case tool usage
Category N
Observed
Prop. Test Prop.
Asy mp. Sig.
(2-tailed)
Based on Z Approximation.a.
Conclusion:
There is a statistically significant difference
between the proportion of projects that
use CASE tools and those that don’t.
With a confidence interval
greater than 99,99% we can
reject the null hypothesis
We can enter a test proportion for the first
group. The probability for the second
group will be 1 minus the specified
probability for the first group.
Binomial Test
No 1254 ,660 ,666 ,297a,b
Yes 646 ,340
1900 1,000
Group 1
Group 2
Total
Case tool usage
Category N
Observed
Prop. Test Prop.
Asy mp. Sig.
(1-tailed)
Alternativ e hypothesis states that the proportion of cases in the f irst group < ,666.a.
Based on Z Approximation.b.
Example (3/3)
Conclusion:
We accept that the proportion of projects
not using CASE tools is twice as large as that
of those that do not use them!
Even with a confidence interval
of 90%% we cannot reject the
null hypothesis
Here we are testing if the proportion of
projects not using CASE tools is twice as
large as those using them.
39
Chi-Square
(test of proportions)
Chi-Square testApplicability
This goodness-of-fit test compares the observed and
expected frequencies in each category to test that all
categories contain the same proportion of values
Can also test if each category contains a user-specified
proportion of values
It can be used to test if 2 or more independent samples
(groups) differ regarding a given factor
40
Chi-Square testDesign & Scales
Design
1 factor, 2 or more treatments, independent samples
Scales
Factor (grouping) variable: categorical
Outcome variable: N/A
Chi-square testHypotheses being tested
H0: the expected proportions of the groups (in the population) are similarGroups do not differ significantly in size
The effect of the factor is negligible
H1: the proportions in the groups are different Groups differ significantly in size
The effect of the factor is not negligible
Test decision: If p ≤ (significant Z statistic):
Reject H0 and accept H1 (proportions are different)
If p > (not significant Z statistic)
Accept H0 and therefore reject H1 (proportions are similar)
41
Chi-square testPreconditions
The Chi-square operates on a contingency table
Rows and columns represent the categories of the two variables
Each cell contains the number of observations for a given pair of values
(factor, outcome variable)
The Chi-Square preconditions are:
The sample must be large enough (n > 20)
All contingency values must be > 1
At least 80% of the contingency values must be > 5
Primary Programming Language * Case tool usage Crosstabulation
Count
279 121 400
229 24 253
26 42 68
61 36 97
595 223 818
Cobol
Visual Basic
C++
Java
Primary Programming
Language
Total
No Yes
Case tool usage
Total
This is a contingency table where all the preconditions are met
Example (1/3)
Objective: Is the adoption of CASE tools dependent on the
programming language used?
If there is some sort of dependence than the proportions in the groups will not be similar.
SPSS:
Analyze
Descriptive Statistics
Crosstabs
42
Example (2/3)
Example (3/3)
Conclusion:
The adoption of CASE tools depends
on the programming language
With a confidence interval of 99% we can reject the null hypothesis
Case Processing Summary
639 22,5% 2201 77,5% 2840 100,0%
Primary Programming
Language * Case tool
usage
N Percent N Percent N Percent
Valid Missing Total
Cases
Primary Programming Language * Case tool usage Crosstabulation
222 106 328
243,3 84,7 328,0
192 17 209
155,0 54,0 209,0
11 28 39
28,9 10,1 39,0
49 14 63
46,7 16,3 63,0
474 165 639
474,0 165,0 639,0
Count
Expected Count
Count
Expected Count
Count
Expected Count
Count
Expected Count
Count
Expected Count
Cobol
Visual Basic
C++
Java
Primary Programming
Language
Total
No Yes
Case tool usage
Total
Chi-Square Tests
84,822a 3 ,000
86,160 3 ,000
,565 1 ,452
639
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asy mp. Sig.
(2-sided)
0 cells (,0%) hav e expected count less than 5. The
minimum expected count is 10,07.
a.
43
Mann-Whitney test
(aka U-test)
Mann-Whitney testApplicability
This is the nonparametric analog of the t-test
Instead of comparing the average of the 2 samples, it
compares their central tendency to detect differences
It can be used to test if 2 samples differ regarding
a given factor
44
Mann-Whitney test Design & Scales
Design
1 factor, 2 treatments, independent samples, random samples (between groups)
Scales
Factor (grouping) variable: categorical
Outcome variable: at least ordinal scale
Assumptions
The two tested samples should be similar in shape
Mann-Whitney testHypotheses being tested
H0: The two populations from which the samples for the
two groups were taken, have similar central tendency
The groups are not affected by the factor variable
H1: The two samples do not have similar central
tendency
The groups are affected by the factor variable
U statistic
This statistic is used to test the above hypotheses
45
Example (1/3)
Objective: assess if the effort per each development phase is different between two languages (Cobol and Java)
Each independent sample (group) corresponds to the projects (cases) that use the same programming language (PL)
Let c and j be indexes identifying Cobol and Java, respectively. Then, the underlying hypotheses for this test are the following:
H0: i,j :Ec ~ Ej
H1: ¬ i,j :Ec ~ Ej
If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the efforts per phase for each group of projects (corres-ponding to a PL) are different.
In other words, we would accept the alternative hypothesis that the PL has an influence on the effort per phase.
Notice that since we have several phases, then we have to perform one test for each phase
Test Statisticsa
,215 ,130 ,550 ,215 ,225 ,222
,215 ,130 ,200 ,215 ,225 ,000
-,088 -,050 -,550 -,014 -,031 -,222
,919 ,628 ,977 1,128 1,162 ,998
,367 ,825 ,295 ,157 ,134 ,272
Absolute
Positive
Negative
Most Extreme
Dif f erences
Kolmogorov-Smirnov Z
Asy mp. Sig. (2-tailed)
Ef fort Plan Ef fort Specify Ef fort Design Ef fort Build Ef fort Test
Ef fort
Implement
Grouping Variable: Primary Programming Languagea.
Example (2/3)1 is Cobol and 4 is Java
First we must verify, for each effort kind, if the groups
corresponding to different languages have distributions with
similar shapes, by using the Kolmogorov-Smirnov Z test.
H0 – The statistical distribution of the effort is similar in both
programming languages
H1 – The statistical distribution of the effort is significantly
different in both programming languages
SPSS:
Analyze
Nonparametric Tests
2 Independent Samples
We accept that the 2 groups have similar statistical distributions (with
a confidence level of 99%) for all efforts being tested
46
Example (3/3)Ranks
77 49,22 3790,00
24 56,71 1361,00
101
108 68,27 7373,50
30 73,92 2217,50
138
4 8,00 32,00
15 10,53 158,00
19
142 85,42 12130,00
34 101,35 3446,00
176
160 93,84 15014,00
32 109,81 3514,00
192
106 69,46 7363,00
25 51,32 1283,00
131
Primary Programming
LanguageCobol
Java
Total
Cobol
Java
Total
Cobol
Java
Total
Cobol
Java
Total
Cobol
Java
Total
Cobol
Java
Total
Ef fort Plan
Ef fort Specify
Ef fort Design
Ef fort Build
Ef fort Test
Ef fort Implement
N Mean Rank Sum of Ranks
Test Statisticsb
787,000 1487,500 22,000 1977,000 2134,000 958,000
3790,000 7373,500 32,000 12130,000 15014,000 1283,000
-1,093 -,684 -,800 -1,638 -1,485 -2,150
,274 ,494 ,424 ,102 ,138 ,032
,469a
Mann-Whitney U
Wilcoxon W
Z
Asy mp. Sig. (2-tailed)
Exact Sig. [2*(1-tailed Sig.)]
Ef fort Plan Ef fort Specif y Ef fort Design Ef fort Build Ef fort Test
Ef fort
Implement
Not corrected f or t ies.a.
Grouping Variable: Primary Programming Languageb.
The effort to plan, specify and design do not differ significantly between Cobol and Java.
The effort to build and implement may be considered significantly different with a confidence interval of 90%.
SPSS:
Analyze
Nonparametric Tests
2 Independent Samples
H0 – the two samples have similar central tendency on the effort
H1 – the two samples do not have similar central tendency on the effort
Kruskal-Wallis H test
(one-way analysis of variance)
47
Kruskal-Wallis testApplicability
Is an extension of the Mann-Whitney U test
Is the nonparametric equivalent to the One-Way ANOVA
Assesses whether several independent samples have a
common localization parameter
Each sample is a group of subjects corresponding to the
application of a given treatment (a level of the factor variable)
Kruskal-Wallis testDesign & Scales
Design
1 factor with more than 2 treatments, independent, random samples (between groups)
Scales
Factor (grouping) variable: categorical
Outcome variable: any type
Assumptions
The tested samples should be similar in shape
48
Kruskal-Wallis testHypotheses being tested
H0: The distribution of the populations from where each
group was extracted have the same localization
parameter
Groups do not differ significantly
The effect of the factor is negligible
H1: At least one of the distributions has a localization
parameter that is smaller or greater than the others
A least one sample (group) differs significantly
The effect of the factor is not negligible
Kruskal-Wallis testTest decision
The calculated H test statistic is distributed
approximately as chi-square
From a chi-square table with the given df (degrees of
freedom) and for a stipulated significance (probability
of a Type I error) we obtain a critical value of the chi-
square to be compared with the calculated H statistic
Test decision:
We reject H0, for a given level of significance , if:
Hcalc> H(1-,df)
49
df 90% 95% 97.5% 99% 99.5% 99.9%
1 2.706 3.841 5.024 6.635 7.879 10.827
2 4.605 5.991 7.378 9.210 10.597 13.815
3 6.251 7.815 9.348 11.345 12.838 16.268
4 7.779 9.488 11.143 13.277 14.860 18.465
5 9.236 11.070 12.832 15.086 16.750 20.517
6 10.645 12.592 14.449 16.812 18.548 22.457
7 12.017 14.067 16.013 18.475 20.278 24.322
8 13.362 15.507 17.535 20.090 21.955 26.125
9 14.684 16.919 19.023 21.666 23.589 27.877
10 15.987 18.307 20.483 23.209 25.188 29.588
11 17.275 19.675 21.920 24.725 26.757 31.264
12 18.549 21.026 23.337 26.217 28.300 32.909
13 19.812 22.362 24.736 27.688 29.819 34.528
14 21.064 23.685 26.119 29.141 31.319 36.123
15 22.307 24.996 27.488 30.578 32.801 37.697
16 23.542 26.296 28.845 32.000 34.267 39.252
17 24.769 27.587 30.191 33.409 35.718 40.790
18 25.989 28.869 31.526 34.805 37.156 42.312
19 27.204 30.144 32.852 36.191 38.582 43.820
20 28.412 31.410 34.170 37.566 39.997 45.315
21 29.615 32.671 35.479 38.932 41.401 46.797
22 30.813 33.924 36.781 40.289 42.796 48.268
23 32.007 35.172 38.076 41.638 44.181 49.728
24 33.196 36.415 39.364 42.980 45.558 51.179
25 34.382 37.652 40.646 44.314 46.928 52.620
26 35.563 38.885 41.923 45.642 48.290 54.052
27 36.741 40.113 43.194 46.963 49.645 55.476
28 37.916 41.337 44.461 48.278 50.993 56.893
29 39.087 42.557 45.722 49.588 52.336 58.302
30 40.256 43.773 46.979 50.892 53.672 59.703
Chi-square
distribution table
Example (1/2)
Objective: assess the impact of the adopted programming language (PL) on the normalized work effort (E)
Each independent sample (group) corresponds to the projects (cases) that use the same PL
Let i and j be two different PLs. Then, the underlying hypotheses for this test are the following:
H0: i,j :Ei ~ Ej
H1: ¬ i,j :Ei ~ Ej
If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the groups of projects’ NWE corresponding to each of the PLs are different.
In other words, we would accept the alternative hypothesis that the PL has an influence on E.
50
Example (2/2)
SPSS:
Analyze
Nonparametric Tests
K Independent Samples
Ranks
509 606,70
265 454,29
116 646,58
204 464,53
1094
Language
Cobol
Visual Basic
C++
Java
Total
Ef fort per Adjusted FP
N Mean Rank
Test Statisticsa,b
66,405
3
,000
Chi-Square
df
Asy mp. Sig.
Ef fort per Adjusted FP
Kruskal Wallis Testa.
Grouping Variable: Languageb.
Even for a confidence interval of 99.9% we have Chi-
SquareCALC > Chi-square (3; 0.001) and we can reject
the null hypothesis. Therefore the effect of the
language on the effort per FP is not negligible
Ranks give us the indication of the relative
influence of each language on the effort per FP.
Notice that C++ is the language requiring more
effort and Visual Basic the least!
Extract of Chi-Square table:
df 90% 95% 97.5% 99% 99.5% 99.9%
3 6.251 7.815 9.348 11.345 12.838 16.268.
Nonparametric Factorial ANOVA
51
Nonparametric Factorial ANOVAApplicability
This procedure is used to test if a given set of factors has a
significant effect on a given variable
Allows determining the effect of each factor
Allows assessing the interaction among factors (aka moderation)
This procedure is similar to the (parametric) Factorial
ANOVA, but the H statistic is calculated based upon the
ranks of cases within each group
Remember that in the parametric version we used the F statistic
that is calculated upon the values of the outcome variable itself
Nonparametric Factorial ANOVAHow to perform?
The basic distribution of SPSS does not support the
Nonparametric Factorial ANOVA (not even the Two-Way)
There are several alternatives to perform this test:
Use another tool instead of SPSS (R has this procedure for free)
Get an advanced SPSS module that supports nonparametric
ANOVA (may be expensive)
Program this procedure in the SPSS syntax language (VB-like) or
find in the Internet someone who has done it already
Transform the outcome variable in a Normal distributed one and if
successful, use the parametric Factorial ANOVA
Use Excel to implement the test statistic H and then use a Chi-
Square table to make the test decision.
52
Nonparametric Factorial ANOVADesign & Scales
Design
2+ factors, 2+ levels per factor, independent samples
If we only have 2 factors, this is called Nonparametric
two-way ANOVA
A treatment corresponds to a combination (tuple) of factor
levels such as (Java; Eclipse) if the factors are programming
language and development environment.
A treatment is often called a model cell.
Scales
Factor (grouping) variable: categorical (recoded)
Outcome variable: at least ordinal
Nonparametric Factorial ANOVAMain and interaction effects
Consider that you have three factors F1, F2 and F3
Main effects
These are the effects on the outcome variable caused by each
factor alone, as we did with the Kruskal-Wallis test
These are represented by F1, F2, F3
Interaction effects
There are the cross factor effects caused by the combined action
of all combinations of factors, which are:
F1*F2, F1*F3, F2*F3, F1*F2*F3
Overall effect representation in the GLM
I + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3
Where I is an intercept term (similar to that used in linear regression)
53
Nonparametric Factorial ANOVA Main effects hypotheses (one for each factor)
H0: The distribution of the populations from where each
group was extracted have the same localization
parameter
Groups do not differ significantly
The effect of the factor is negligible
H1: At least one of the distributions has a localization
parameter that is smaller or greater than the others
A least one sample (group) differs significantly
The effect of the factor is not negligible
Nonparametric Factorial ANOVA Main effects test decision (one for each factor)
As seen in the Kruskal-Wallis test, the calculated H test
statistic is distributed approximately as chi-square
Test decision:
We reject H0, for a given level of significance , if:
Hcalc> H(1-,df)
Get the value of the critical value H(1-,df) from the Chi-Square
table presented on the Kruskal-Wallis test
54
Nonparametric Factorial ANOVA Interaction effects hypotheses (one for each interaction)
H0: There is no interaction among the factors
i,j: i,j = 0 (i ≠ j)
H1: There is interaction between at least two factors
i,j: i,j ≠ 0 (i ≠ j)
Nonparametric Factorial ANOVA Interaction effects test decision (one for each interaction)
We reject H0, for a given level of significance , if:
Hcalc> H(1-,df)
Get the value of the critical value H(1-,df) from the Chi-Square
table presented on the Kruskal-Wallis test