experimental research methodology statistical tests

54
1 Experimental Research Methodology Statistical Tests Fernando Brito e Abreu ([email protected]) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Data analysis taxonomy Number of independent variables (aka factors): One-factorial analysis one independent variables Multifactorial analysis several independent variables Number of dependent variables: Univariate analysis one dependent variables Multivariate analysis several dependent variables

Upload: others

Post on 09-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental Research Methodology Statistical Tests

1

Experimental Research Methodology

– Statistical Tests –

Fernando Brito e Abreu ([email protected])

Universidade Nova de Lisboa (http://www.unl.pt)

QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)

Data analysis taxonomy

Number of independent variables (aka factors):

One-factorial analysis – one independent variables

Multifactorial analysis – several independent variables

Number of dependent variables:

Univariate analysis – one dependent variables

Multivariate analysis – several dependent variables

Page 2: Experimental Research Methodology Statistical Tests

2

Data analysis methods

Proportion testing

Inference tests for categorical and continuous data Parametric testing

Non-parametric testing

Regression analysis Linear regression modeling

Nonlinear regression modeling (e.g. logistic regression analysis)

Multivariate data analysis Factor analysis

Cluster analysis

Discriminant analysis

More on scale types

Categorical (discrete) data

Nominal scale

Ordinal scale

Absolute scale

Continuous data

Interval scale

Ratio scale

Page 3: Experimental Research Methodology Statistical Tests

3

Variable types

Independent variables

(aka factors or explanatory variables)

Are those that are manipulated in experimental research

Ex: Programming language, Development environment, Design size,

Practitioner expertise

Dependent variables

(aka outcome variables, measures, criteria)

Are those whose effect of the independent ones we want to

measure in experimental research

Ex: Effort do produce a given deliverable, Project schedule, Defects

found in code inspection, System faults in operation (e.g. MTBF, MTTR)

Exercise: Identify the independent and dependent

variables …

Expertise

Defects

detected

during

component

integration

Assembly

complexity

Defects

detected

during

inspection

Internal

complexity

Interface

complexity

Reviewer

Developer

Reviewer

Practitioner Component Component Assembly

Developer

Developer

Page 4: Experimental Research Methodology Statistical Tests

4

Degree of freedom (df) of an estimate

Is the number of independent pieces of information on

which the estimate is based.

Is the number of values in the final calculation of a statistic that

are free to vary (df = number of different treatments – 1)

Why is the "Normal distribution" important?

… because in most cases, it approximates well the function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size

The distribution of many test statistics is normal or follows some form that can be derived from the normal distributionMany frequently used statistical tests make the

assumption that the data come from a normal distribution

Page 5: Experimental Research Methodology Statistical Tests

5

Distribution adherence

The distribution type conditions the kind of statistical tests we can apply

Therefore we want to know if a variable follows (adheres to) a given statistical distributionOften we are interested in how well the distribution can be

approximated by the normal distribution

We can take several, increasingly powerful, approaches: Use descriptive statistics

Use plots

Use distribution adherence tests

Testing distribution adherenceMost common normality tests

Kolmogorov-Smirnov one-sample test

Lilliefors test (correction upon the previous)

Shapiro-Wilks' W test

Royston test (correction upon the previous)

These tests are also known as goodness-of-fit ones since they

test whether the observations could reasonably have come from

the specified distribution

Page 6: Experimental Research Methodology Statistical Tests

6

Testing distribution adherenceKolmogorov-Smirnov one-sample test

The Kolmogorov-Smirnov one-sample test for normality is based on the maximum difference between the sample cumulative distribution and the hypothesized cumulative distribution.

H0: X ~ N(;)

H1: ¬ X ~ N(;)

Notes: For many software programs, the probability values that are reported are

based on those tabulated by Massey (1951); those probability values are valid when the mean and standard deviation of the normal distribution are known a-priori and not estimated from the data

This test can be used to verify goodness of fit for other distributions (e.g. uniform, Poisson, exponential)

Testing distribution adherenceKolmogorov-Smirnov one-sample test

Interpretation:

If the Z statistic is significant, then the hypothesis that the respective distribution is normal (H0) should be rejected “Significant” means that the statistical significance p of the result is not

inferior to the test significance (required level)

Example: Consider the test significance = 0.05

Probability of Type I error = 0.05 * 100% = 5%

(probability of rejecting H0, the null hypothesis, when it is true)

If p ≤ (significant Z statistic):

Reject H0 and accept H1 (sample cannot come from a Normal population)

If p > (not significant Z statistic)

Accept H0 and therefore reject H1 (sample may come from a Normal population)

Page 7: Experimental Research Methodology Statistical Tests

7

One-Sample Kolmogorov-Smirnov Test

3310 4180

444.17 5951.71

926.623 20567.461

.317 .386

.262 .302

-.317 -.386

18.238 24.970

.000 .000

N

Mean

Std. Deviation

Normal Parameters a,b

Absolute

Positive

Negative

Most ExtremeDif ferences

Kolmogorov-Smirnov Z

Asymp. Sig. (2-tailed)

FunctionalSize

NormalisedWork Ef fort

Test distribution is Normal.a.

Calculated f rom data.b.

Example:

Even for a test significance = 0.01 (99% confidence

interval), since p=0.000 ≤ (significant Z statistic):

We reject H0 and accept H1 (neither Size nor Effort can come

from a Normal population)

SPSS:

Analyze

Nonparametric Tests

1-Sample K-S…

Testing distribution adherenceLilliefors test

This test is basically a correction to the Kolmogorov-Smirnov test, applicable when the parameters of the hypothesized normal distribution are estimated from the sample data

Interpretation:

If the Z statistic is significant, then the hypothesis that the respective distribution is normal should be rejected (same as for the KS test)

Notes:

In a Kolmogorov-Smirnov test for normality when the mean and standard deviation of the hypothesized normal distribution are not known (i.e., they are estimated from the sample data), the probability values tabulated by Massey (1951) are not valid. In that case, the test for normality involves a complex conditional hypothesis ("how likely is it to obtain a D statistic of this magnitude or greater, contingent upon the mean and standard deviation computed from the data"), and the Lilliefors probabilities should be interpreted (Lilliefors, 1967)

Page 8: Experimental Research Methodology Statistical Tests

8

Testing distribution adherenceShapiro-Wilks' W test

This test is the preferred test of normality because of its good

power properties as compared to a wide range of alternative tests

Interpretation:

If the W statistic is significant (i.e. p ≤ ), then the hypothesis that

the respective distribution is normal should be rejected

Notes:

Some software programs implement an extension to the test

described by Royston (1982), which allows it to be applied to large

samples (with up to 5000 observations)

Statistical significance (p-value) of a result

The p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative” of the population A p-value of 5% (i.e.,1/20) indicates that there is a 5%

probability that the relation between the variables found in our sample is a "fluke“ (stroke of luck)

For adherence tests, the p-value is the probability that the observed difference between the sample cumulative distribution and the hypothesized cumulative distribution. occurred by pure chance ("luck of the draw") In other words, that in the population from which the sample

was drawn, no such difference exists

Page 9: Experimental Research Methodology Statistical Tests

9

Common p-values

(conventions in many research areas)

Borderline statistically significant

p-value = 5% (1/20)

Statistically significant

p-value = 1% (1/100)

Highly statistically significant

p-value = 0.5% (1/200) or even 0.1% (1/1000)

Hypothesis testing

Suppose that a CIO is interested in showing that in his software-house the projects have an average defect density (ADD) below 5[KLOC-1]. This question, in statistical terms: “Is ADD < 5?"

STEP 1: State as a "statistical null hypothesis" (hypothesis H0) something that is the logical opposite of what you believe. Ho: ADD > 5

STEP 2: Collect data (build a sample)

STEP 3: Using statistical theory, show from the data that it is likely H0 is false, and should be rejected. By rejecting H0, you support what you actually believe.

This kind of situation, which is typical in many fields of research, is called "Reject-Support testing," (RS testing) because rejecting the null hypothesis supports the experimenter's theory.

Page 10: Experimental Research Methodology Statistical Tests

10

Hypothesis testing

Two kinds of errors

α – Type I error rate, must be kept at or below .05

β –Type II error rate, must be kept low as well (the conventions are

much more rigid with respect to α than with respect to β)

The "Statistical power," (1-β), must be kept high

Ideally, power should be at least .80 to detect a reasonable departure

from the null hypothesis

State of the World

HO H1

Decision

H0 Correct H0

Acceptance

(1- α)

Type II

Error (β)

H1 Type I

Error (α)

Correct H0

Rejection

(1-β)

The null hypothesis is either true or false The statistical decision should

be set up so that no "ties” occur

The null hypothesis is either rejected or not rejected

Hypothesis testing (expanded)

STATE OF THE WORLD

H0 is True

H1 is False

H0 = False

H1 = True

DECISION Accept H0

Reject H1

OK

Correct H0 Acceptance

Correct H1 Rejection

(Probability = 1 - α)

Type II Error

Incorrect H0 Acceptance

Incorrect H1 Rejection

(Probability = β)

Reject H0

Accept H1

Type I Error

Incorrect H0 Rejection

Incorrect H1 Acceptance

(Probability = α)

OK

Correct H0 Rejection

Correct H1 Acceptance

(Probability = 1 - β)

Page 11: Experimental Research Methodology Statistical Tests

11

Statistical Tests

Parametric tests

Assure stronger validity than the non-parametric

counterparts

Their statistical power is greater

Non-parametric tests

Weaker validity than the parametric counterparts

Their statistical power is smaller

Statistical Tests for Scales

Measurement scale of the variable under consideration

Nominal Ordinal Interval Ratio

Non-parametric test Normal distribution

Non-parametric methods

Parametric methods

No Yes

Page 12: Experimental Research Methodology Statistical Tests

12

Parametric tests (between groups)

Name Factors

/ Treat.

Outcome scale Null hypotheses

t-Student

(one sample)

NA Numeric (absolute,

interval or ratio)

The mean of a variable is equal to

a specified constant?

t-Student

(2 independent samples)

1/2 Numeric (absolute,

interval or ratio)

The means of a variable on each

group (treatment) are the same?

One-Way ANOVA 1/2+ Numeric (absolute,

interval or ratio)

The means of a variable on each

group (treatment) are the same?

Factorial ANOVA 2+/2+ Numeric (absolute,

interval or ratio)

i) The means of a variable on each

group (treatment) are the same?

ii) There is no interaction among

the factors?

Nonparametric tests (between groups)

Name Factors /

Treat.

Outcome scale Null hypotheses

Binomial test

(test of proportions)

1/2 NA The expected proportions are the

ones being tested?

Chi-Square

(test of proportions)

1/2+ NA The expected proportions in the

groups are similar?

Mann-Whitney test

(aka U-test)

1/2 At least ordinal scale The two groups have similar central

tendency?

Kruskal-Wallis test

(aka H-test)

1/2+ At least ordinal scale The several groups have a similar

localization parameter?

Nonparametric

Factorial ANOVA

2+/2+ At least ordinal scale i) The several groups have a similar

localization parameter?

ii) There is no interaction among

the factors?

Page 13: Experimental Research Methodology Statistical Tests

13

PARAMETRIC TESTS

T-Student test(One sample)

Page 14: Experimental Research Methodology Statistical Tests

14

One sample T-Student testApplicability

This procedure tests whether the mean of a quantitative

variable differs from a hypothesized test value

The test value is a specified constant

Design

N/A

Scales

Factor (grouping) variable: none

Outcome variable: numeric (absolute, interval or ratio)

One sample T-Student testAssumptions

This test assumes that the data from the outcome

variable are normally distributed; however, it is fairly

robust to departures from normality.

Page 15: Experimental Research Methodology Statistical Tests

15

One sample T-Student testHypotheses being tested

H0: = k

The mean of the variable does not differ significantly from a

specified K value

H1: k

The mean of the variable differs significantly from the

specified K value

This test uses the T statistic that has a t-Student

distribution with (n-1) degrees of freedom

One sample T-Student testTest decision

For n ≤ 30:

We reject H0, for a given level of significance , if:

|Tcalc| > t1-/2 (n-1)

The critical values can be obtained from a t-Student table.

For n > 30: the t-Student distribution becomes ~ N(0;1)

Therefore, we can then use the test significance:

If p ≤ (significant Z statistic):

Reject H0 and accept H1 (proportions are different)

If p > (not significant Z statistic)

Accept H0 and therefore reject H1 (proportions are similar)

Page 16: Experimental Research Methodology Statistical Tests

16

Example (1/3)

Problem:

Is the population mean

effort per adjusted

function point equal to

15 or 16 man.hours?

First we have to

compute that effort …

SPSS:

Transform

Compute Variable

One-Sample Kolmogorov-Smirnov Test

2839

428,94

829,349

,304

,256

-,304

16,186

,000

N

Mean

Std. Dev iat ion

Normal Parameters a,b

Absolute

Positive

Negativ e

Most Extreme

Dif f erences

Kolmogorov-Smirnov Z

Asy mp. Sig. (2-tailed)

Adjusted

Function

Points

Test distribution is Normal.a.

Calculated f rom data.b.

Example (2/3)

Assumption: is the effort per adjusted function point

Normally distributed?

The effort is not Normally distributed,

but since this test is robust to non-

Normal data, we will still use it!

SPSS:

Analyze

Nonparametric Tests

1 Sample K-S

Page 17: Experimental Research Methodology Statistical Tests

17

Example (3/3)SPSS:

Analyze

Compare Means

One-Sample T-Test…

One-Sample Statistics

2839 16,5820 25,00000 ,46920Ef fort per Adjusted FP

N Mean Std. Deviation

Std. Error

Mean

One-Sample Test

3,372 2838 ,001 1,58204 ,6620 2,5020Ef fort per Adjusted FP

t df Sig. (2-tailed)

Mean

Dif f erence Lower Upper

95% Conf idence

Interv al of the

Dif f erence

Test Value = 15

One-Sample Test

1,240 2838 ,215 ,58204 -,3380 1,5020Ef fort per Adjusted FP

t df Sig. (2-tailed)

Mean

Dif f erence Lower Upper

95% Conf idence

Interv al of the

Dif f erence

Test Value = 16

H0 cannot be rejected, even with a 90%

confidence interval. The population mean

value of the variable is 16!

H0 is rejected. The population mean

value of the variable is not 15!

Conclusion:

The expected value for the population mean

for FP countings using the IFPUG rules is 16.

T-Student test(2 samples)

Page 18: Experimental Research Methodology Statistical Tests

18

Two samples T-Student testApplicability

This test allows inferring the equality of the means in the

populations from two samples (groups)

Design

1 factor, 2 treatments, independent samples

Scales

Factor (grouping) variable: categorical or cut-point

defined upon a numeric variable (e.g. setting a cut point

on team size of 10 persons allows splitting projects on

two groups, according to that variable)

Outcome variable: numeric (absolute, interval or ratio)

Two samples T-Student testAssumptions

The subjects should be randomly assigned to two groups, so that any difference in response is due to the treatments and not to other factors.

This test assumes that the data from the outcome variable are normally distributed; however, it is fairly robust to departures from normality.

This test uses different statistics depending on the outcome variable having homogeneous or non-homogeneous variances on the two groups This homogeneity can be assessed with the Levene test

Page 19: Experimental Research Methodology Statistical Tests

19

Two samples T-Student test Hypotheses being

tested

H0: A = B

The means of the variable on each group (treatment)

are the same

H1: A B

The averages of the variable on each group

(treatment) are not the same

Two samples T-Student testTest decision

For n ≤ 30:

We reject H0, for a given level of significance , if:

|Tcalc| > t1-/2 (n-1)

The critical values can be obtained from a t-Stundent table.

For n > 30: the t-Student distribution becomes ~ N(0;1)

Therefore, we can then use the test significance:

If p ≤ (significant Z statistic):

Reject H0 and accept H1 (means are different)

If p > (not significant Z statistic)

Accept H0 and therefore reject H1 (means are similar)

Page 20: Experimental Research Methodology Statistical Tests

20

Example (1/3)

Problem:

Is the mean number of adjusted Function Points the same when using IFPUG counting or any other counting (e.g. COSMIC, FiSMA, Feature Points)?

First we have to create a new factor variable (isIFPUG) IFPUG projects will be coded

“1” and non-IFPUG “0”SPSS:

Transform

Compute Variable

Example (2/3)

SPSS:

Analyze

Compare Means

Independent Samples T-Test

Group Statistics

2839 17,9005 28,46928 ,53431

148 13,7710 14,25176 1,17149

Is FP counting IFPUG?

1

0

Ef fort per Adjusted FP

N Mean Std. Dev iation

Std. Error

Mean

Page 21: Experimental Research Methodology Statistical Tests

21

Independent Samples Test

4,960 ,026 1,753 2985 ,080 4,12954 2,35567 -,48937 8,74845

3,207 214,040 ,002 4,12954 1,28758 1,59157 6,66751

Equal variances

assumed

Equal variances

not assumed

Ef fort per Adjusted FP

F Sig.

Levene's Test f or

Equality of Variances

t df Sig. (2-tailed)

Mean

Dif f erence

Std. Error

Dif f erence Lower Upper

95% Conf idence

Interv al of the

Dif f erence

t-test for Equality of Means

Example (3/3)

For a confidence interval of 95% (equal variances not assumed), we can say that the means between the two groups differ significantly!

For a confidence interval of 99% (equal variances assumed), we cannot say that the means between the two groups differ significantly!

Conclusion:

The FP counting rules other than the IFPUG

ones, do not seem to differ significantly from

the latter.

Setting the value of the confidence interval

can change results interpretation in border-

line situations!

For a confidence interval of 95% (p<), H0 can be rejected, therefore sample variances cannot be considered homogeneous.

For a confidence interval of 99% (p>), H0 cannot be rejected, therefore sample variances can be considered homogeneous.

One-Way ANOVA(One-factorial ANalysis Of VAriance)

Page 22: Experimental Research Methodology Statistical Tests

22

One-Way ANOVAApplicability

This procedure is used to test the hypothesis that the

means among several groups (determined by a factor

variable) are equal. Therefore, it allows testing if there is

a variance on the outcome variable that is due to the

factor. This is an extension of the two-sample t test.

Design

1 factor, 2+ treatments, independent samples

Scales

Factor (grouping) variable: categorical (recoded into

numeric)

Outcome variable: numeric (absolute, interval or ratio)

One-Way ANOVA Assumptions

Each group is an independent random sample from a

normal population. One-Way ANOVA is robust to

departures from normality, although the data should be

symmetric.

The groups should come from populations with equal

(homogeneous) variances. To test this assumption, use

Levene's homogeneity-of-variance test.

Page 23: Experimental Research Methodology Statistical Tests

23

One-Way ANOVAThe groups

Let us consider that we have k groups (each group is a

sample), each one corresponding to a given treatment

(factor level)

sample 1 with n1 elements: X1= {X11, X21, ...Xn11}

sample k with nk elements: Xk= {X1k, X2k, ...Xnkk}

where Xij is the value observed for subject i, belonging to sample j

One-Way ANOVA Calculating the variance

Let SST be the sum of the

squares of the deviations of

observed values around the

global mean:

SST =

where

SST=SSW+SSB

Let SSW be the sum of the squares

of the deviations within groups,

SSW =

Let SSB be the sum of the squares

of the deviations between groups,

SSB =

2

1 1

( )jnk

ij j

J i

X X

2

1

( )k

j j

j

n X X

2

1 1

( )k n

ij

j i

X X

Page 24: Experimental Research Methodology Statistical Tests

24

One-Way ANOVA The T statistic

The ANOVA compares the sum of the squares of the deviations between groups (difference between groups), with the sum of the squares within groups.

The null hypothesis is tested using the following test statistic:

Under the null hypothesis, the T statistic follows an F (Snedecor) distribution with (k-1,n-k) degrees of freedom, i.e.,

T F(k-1,n-k)

/( 1)

/( )

SSB kT

SSW n k

n= number of cases

k=number of groups

One-Way ANOVA Hypotheses being tested

H0: The means of the outcome variable for each group (treatment) are all the same

i, j : i = j ( i j )

H1: The means for each group are not all the same

i, j : i j ( i j )

Test decision:

We reject H0, for a given level of significance , if:

Fcalc> F1- (k-1,n-k)

Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

Note: k-1 is the numerator and n-k is the denominator (see previous slide)

Page 25: Experimental Research Methodology Statistical Tests

25

One-Way ANOVA Which groups differ?

In addition to determining that differences exist among the means, you may want to know which means differ

There are two types of tests for comparing means:

a priori contrasts are tests set up before running the experiment

post hoc are tests are run after the experiment has been conducted

Example (1/4)

Problem:

Is the effort per adjusted function point the same

across 4 well-known languages (Cobol, Visual

Basic, C++ and Java)?

Verifying assumptions:

Is the outcome variable (the effort) normally distributed?

From previous slides we have seen this is not true, but since

ANOVA is robust to departures from normality, we still use it …

Have the groups corresponding to each of the

programming languages equal variances?

Page 26: Experimental Research Methodology Statistical Tests

26

Example (2/4)

We need to recode

the programming

languages of interest

SPSS:

Transform

Recode into Different Variables

Test of Homogeneity of Variances

Ef fort per Adjusted FP

3,519 3 1090 ,015

Levene

Stat ist ic df 1 df 2 Sig.

Example (3/4)

SPSS:

Analyze

Compare Means

One-Way ANOVA

Verifying another precondition:

With a confidence interval of 99% we

cannot reject the null hypothesis that

the variances are homogeneous

The plot gives us a qualitative

perspective of the phenomenon;

the mean effort seems to depend

on the language!

Page 27: Experimental Research Methodology Statistical Tests

27

Descriptives

Ef fort per Adjusted FP

509 23,5086 34,84305 1,54439 20,4744 26,5427 ,24 424,87

265 16,0208 23,46494 1,44144 13,1826 18,8590 ,13 256,13

116 28,8261 34,36263 3,19049 22,5064 35,1458 ,99 211,77

204 17,2658 29,09480 2,03704 13,2493 21,2823 ,90 259,71

1094 21,0945 31,57117 ,95451 19,2217 22,9674 ,13 424,87

31,32726 ,94714 19,2361 22,9530

2,85458 12,0100 30,1791 22,57916

Cobol

Visual Basic

C++

Java

Total

Fixed Ef f ects

Random Ef fects

Model

N Mean Std. Dev iation Std. Error Lower Bound Upper Bound

95% Conf idence Interval for

Mean

Minimum Maximum

Between-

Component

Variance

Example (4/4)

ANOVA

Ef fort per Adjusted FP

19712,581 3 6570,860 6,695 ,000

1069723 1090 981,397

1089435 1093

Between Groups

Within Groups

Total

Sum of

Squares df Mean Square F Sig.

n (number of cases) = 1094

k (number of groups) = 4

The upper critical value of the F distribution can be found in a table. Notice

that the critical value for (k-1,n-k) = (3, 1090) can be majorated by the

critical value for (3, 100). For = 5% we get a majorant of 3.984. Since

Fcalc=6.695 > 3.984 we reject the null hypothesis!

Conclusion:

The average effort per function point is

significantly dependent on the language

Factorial ANOVA(Multi-factorial ANalysis Of VAriance)

Page 28: Experimental Research Methodology Statistical Tests

28

Factorial ANOVAApplicability

This procedure is used to test if a given set of factors has a

significant effect on a given variable

Allows determining the effect of each factor

Allows assessing the interaction among factors (aka moderation)

This is a particular case of a multivariate regression

analysis methodology called “General Linear Model” (GLM)

In GLM, both balanced and unbalanced models can be tested. A

design is balanced if each cell (a treatment) in the model contains

the same number of cases

Factorial ANOVADesign & Scales

Design

2+ factors, 2+ levels per factor, independent samples

If we only have 2 factors, this is called Two-way ANOVA

In Factorial ANOVA a treatment corresponds to a combination

(tuple) of factor levels such as (Java; Eclipse) if the factors are

programming language and development environment.

In Factorial ANOVA, a treatment is often called a model cell.

Scales

Factor (grouping) variable: categorical (recoded)

Outcome variable: numeric (absolute, interval or ratio)

Page 29: Experimental Research Methodology Statistical Tests

29

Factorial ANOVAMain and interaction effects

Consider that you have three factors F1, F2 and F3

Main effects

These are the effects on the outcome variable caused by each

factor alone, as we did with One-way ANOVA

These are represented by F1, F2, F3

Interaction effects

There are the cross factor effects caused by the combined action

of all combinations of factors, which are:

F1*F2, F1*F3, F2*F3, F1*F2*F3

Overall effect representation in the GLM

I + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3

Where I is an intercept term (similar to that used in linear regression)

Factorial ANOVA Hypotheses being tested (one for each factor)

H0: The expected means (in the population) of the outcome variable for each group (treatment) are all the same

1 = .... = k

H1: At least one the expected means is different

i,j: i j (i j)

Test decision:

We reject H0, for a given level of significance , if:

Fcalc> F1- (k-1,n-k)

Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

Note: k-1 is the numerator and n-k is the denominator (see a previous slide)

Page 30: Experimental Research Methodology Statistical Tests

30

Factorial ANOVA Hypotheses being tested (one for each interaction)

H0: There is no interaction among the factors

i,j: i,j = 0 (i ≠ j)

H1: There is interaction between at least two factors

i,j: i,j ≠ 0 (i ≠ j)

Test decision:

We reject H0, for a given level of significance , if:

Fcalc> F1- (k-1,n-k)

Take the critical values from the tables in http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

Note: k-1 is the numerator and n-k is the denominator (see a previous slide)

Factorial ANOVA Assumptions

For increased test power the populations from where each cell

data was taken should be normal and with homogeneous

variances. However:

Factorial ANOVA is robust to departures from normality, although the data

should be symmetric

Regarding variance, there are alternatives for using Factorial ANOVA

when variance homogeneity is not assumed

To check assumptions, we can use homogeneity of variances

tests (e.g. Levene test) and spread-versus-level plots. We can

also examine residuals and residual plots.

Page 31: Experimental Research Methodology Statistical Tests

31

Factorial ANOVA Differences among specific treatments

The overall F statistic allows

to test that at least a group

corresponding to a given

treatment has a means on the

outcome variable that is

different from the other groups

If an overall F test has shown

significance, we use post hoc

tests to evaluate differences

among specific means.

Some of those post hoc tests

are applicable when equal

variances are assumed and

some other when they are not

Use the Scheffé or the Tamhane’s tests, depending on variance’s homogeneity, since those two tests are more conservative (safer) than others, which means that a larger difference between means is required for significance.

(See more by clicking on the help button)

Factorial ANOVA Profile (interaction plots)

If the interaction effects are not significant, we should

consider each of the the main effects separately, as we

did for one-way ANOVA

When interaction effects are significant (rejected

interaction effect null hypotheses), we do not consider

the corresponding main effects. Therefore, we should

center our attention on the study of interactions

Profile plots allow visualizing the interactions among the

factors!

Page 32: Experimental Research Methodology Statistical Tests

32

Example (1/6)

Problem:

Is the effort per adjusted function

point dependent on the language,

development type and software

architecture?

We know already that the effort

per adjusted function point does

not have a Normal distribution

(see previous slides), but since

the Factorial ANOVA is robust to

non-normality we still use it.

Between-Subjects Factors

Cobol 508

Visual Basic 264

C++ 116

Java 204

New development 445

Enhancement 622

Re-development 25

Stand-alone 720

Client-server 298

Mult i-tier with web interface 74

1

2

3

4

Language

0

1

2

Development ty pe

1

2

3

Architecture_

Value Label N

Levene's Test of Equality of Error Variancesa

Dependent Variable: Ef fort per Adjusted FP

5,049 20 1071 ,000

F df 1 df 2 Sig.

Tests the null hypothesis that the error v ariance of the

dependent variable is equal across groups.

Design: Intercept+Language+DevTy pe+Architecture_

+Language * DevTy pe+Language * Architecture_

+DevType * Architecture_+Language * DevType *

Architecture_

a.

Example (2/6)

SPSS:

Analyze

General Linear Model

Univariate

Verifying variance homogeneity:

With a confidence interval of 99%% we

reject the null hypothesis that the

variances are homogeneous.

Notice the interaction terms

Page 33: Experimental Research Methodology Statistical Tests

33

Tests of Between-Subjects Effects

Dependent Variable: Ef fort per Adjusted FP

80088,097b 20 4004,405 3,877 ,000 ,068 77,532 1,000

37040,756 1 37040,756 35,859 ,000 ,032 35,859 1,000

802,606 3 267,535 ,259 ,855 ,001 ,777 ,100

7598,601 2 3799,300 3,678 ,026 ,007 7,356 ,677

8014,872 2 4007,436 3,880 ,021 ,007 7,759 ,702

265,799 4 66,450 ,064 ,992 ,000 ,257 ,063

1294,776 3 431,592 ,418 ,740 ,001 1,253 ,134

8740,139 3 2913,380 2,820 ,038 ,008 8,461 ,680

7952,667 3 2650,889 2,566 ,053 ,007 7,699 ,634

1106304,428 1071 1032,964

1741086,067 1092

1186392,525 1091

Source

Corrected Model

Intercept

Language

DevType

Architecture_

Language * DevTy pe

Language * Architecture_

DevType * Architecture_

Language * DevTy pe * Architecture_

Error

Total

Corrected Total

Type I II Sum

of Squares df Mean Square F Sig.

Part ial Eta

Squared

Noncent .

Parameter

Observed

Powera

Computed using alpha = ,05a.

R Squared = ,068 (Adjusted R Squared = ,050)b.

Example (3/6)

With a confidence interval of 95%, the DevType

* Architecture interaction is significant.

Therefore, we should consider this interaction

effect instead of the main effects

When the test power 1- is low (below 80%) as it happens here, specially for all terms including the Language, we should be careful since the Type II Error () is high. Recall that = β is the probability of Incorrect H0 acceptance (incorrect H1 rejection) when Ho is false.

Conclusion:

The average effort per function point is significantly

dependent on the combined action of development

type and software architecture, although care should

be taken since the test power is limited

Multiple Comparisons

Dependent Variable: Ef fort per Adjusted FP

Tamhane

6,0732* 2,04680 ,009 1,1751 10,9714

19,0322* 1,37305 ,000 15,7464 22,3179

-6,0732* 2,04680 ,009 -10,9714 -1,1751

12,9589* 1,55147 ,000 9,2343 16,6836

-19,0322* 1,37305 ,000 -22,3179 -15,7464

-12,9589* 1,55147 ,000 -16,6836 -9,2343

(J) Architecture_

Client-server

Mult i-tier with web interface

Stand-alone

Mult i-tier with web interface

Stand-alone

Client-server

(I) Architecture_

Stand-alone

Client-server

Mult i-tier with

web interf ace

Mean

Dif f erence

(I-J) Std. Error Sig. Lower Bound Upper Bound

95% Conf idence Interv al

Based on observ ed means.

The mean dif f erence is signif icant at the ,05 lev el.*.

Multiple Comparisons

Dependent Variable: Ef fort per Adjusted FP

Tamhane

-10,4831* 1,82443 ,000 -14,8474 -6,1188

-6,6862 3,02944 ,103 -14,3758 1,0034

10,4831* 1,82443 ,000 6,1188 14,8474

3,7969 3,33127 ,596 -4,4952 12,0891

6,6862 3,02944 ,103 -1,0034 14,3758

-3,7969 3,33127 ,596 -12,0891 4,4952

(J) Development type

Enhancement

Re-development

New development

Re-development

New development

Enhancement

(I) Dev elopment type

New development

Enhancement

Re-development

Mean

Dif f erence

(I-J) Std. Error Sig. Lower Bound Upper Bound

95% Conf idence Interv al

Based on observ ed means.

The mean dif f erence is signif icant at the ,05 lev el.*.

Example (4/6)

With a confidence interval of 95% we can only say that software enhancement requires on average effort per adjusted FP of around 10,5 hours larger than for new development. Your interpretation?

Scenario 1:

Main effects are important

Attention: this scenario is

not true in our case study!

With a confidence interval of 95% we can only say that there is an increasing order of magnitude in the average effort per adjusted FP from multi-tier with web interface to stand alone. Your interpretation?

Page 34: Experimental Research Methodology Statistical Tests

34

Example (5/6)Significant interactions

The effect of the development type on the effort is partly moderated by the architecture (and vice-versa)

This moderation effect manifests itself in crossing lines.

Scenario 2:

Interaction effects are important

Attention: this scenario is the correct

one in our case study!

Example (6/6)Non significant interactions

When the interactions are not significant, the lines do not cross or only cross slightly

Page 35: Experimental Research Methodology Statistical Tests

35

NON-PARAMETRIC TESTS

(Between Groups)

Binomial

(test of proportions)

Page 36: Experimental Research Methodology Statistical Tests

36

Binomial testApplicability

This test allows comparing the proportions of the

occurrence of one of the two possible values of a

dichotomic variable on the total number of cases

Design

1 factor, 1 treatment, independent samples

Scales

Factor (grouping) variable: categorical (dichotomic)

Outcome variable: N/A

Binomial testThe proportions to be tested

If px and py are the proportions of the two possible

values of the factor, then px + py = 1

We test the hypothesis that the expected proportions in

the population are of a given value (p0, 1-p0), as for

instance:

p0 = 25% -> px = 25%, py = 75%

p0 = 50% -> px = 50%, py = 50%

p0 = 60% -> px = 60%, py = 40%

Page 37: Experimental Research Methodology Statistical Tests

37

Binomial test

H0: px = p0, py = 1 - p0 there is statistical evidence that the expected proportions in the

population are the ones being tested

H1: px ≠ p0, py ≠ 1 - p0 the expected proportions in the population is significantly

different from the tested ones

Test decision: If p ≤ (significant Z statistic):

Reject H0 and accept H1

If p > (not significant Z statistic)

Accept H0 and therefore reject H1

Example (1/3)

Objective: assess if the proportions of CASE tool usage / non-usage are even

CASE tool usage is represented by a dichotomic variable that splits subjects in 2 samples (groups)One group corresponds to the projects using CASE

tools (label “Yes”) and the other to those projects that aren’t using them (label “No”)

SPSS:

Analyze

Nonparametric Tests

Binomial

Page 38: Experimental Research Methodology Statistical Tests

38

Example (2/3)

Binomial Test

No 1254 ,66 ,50 ,000a

Yes 646 ,34

1900 1,00

Group 1

Group 2

Total

Case tool usage

Category N

Observed

Prop. Test Prop.

Asy mp. Sig.

(2-tailed)

Based on Z Approximation.a.

Conclusion:

There is a statistically significant difference

between the proportion of projects that

use CASE tools and those that don’t.

With a confidence interval

greater than 99,99% we can

reject the null hypothesis

We can enter a test proportion for the first

group. The probability for the second

group will be 1 minus the specified

probability for the first group.

Binomial Test

No 1254 ,660 ,666 ,297a,b

Yes 646 ,340

1900 1,000

Group 1

Group 2

Total

Case tool usage

Category N

Observed

Prop. Test Prop.

Asy mp. Sig.

(1-tailed)

Alternativ e hypothesis states that the proportion of cases in the f irst group < ,666.a.

Based on Z Approximation.b.

Example (3/3)

Conclusion:

We accept that the proportion of projects

not using CASE tools is twice as large as that

of those that do not use them!

Even with a confidence interval

of 90%% we cannot reject the

null hypothesis

Here we are testing if the proportion of

projects not using CASE tools is twice as

large as those using them.

Page 39: Experimental Research Methodology Statistical Tests

39

Chi-Square

(test of proportions)

Chi-Square testApplicability

This goodness-of-fit test compares the observed and

expected frequencies in each category to test that all

categories contain the same proportion of values

Can also test if each category contains a user-specified

proportion of values

It can be used to test if 2 or more independent samples

(groups) differ regarding a given factor

Page 40: Experimental Research Methodology Statistical Tests

40

Chi-Square testDesign & Scales

Design

1 factor, 2 or more treatments, independent samples

Scales

Factor (grouping) variable: categorical

Outcome variable: N/A

Chi-square testHypotheses being tested

H0: the expected proportions of the groups (in the population) are similarGroups do not differ significantly in size

The effect of the factor is negligible

H1: the proportions in the groups are different Groups differ significantly in size

The effect of the factor is not negligible

Test decision: If p ≤ (significant Z statistic):

Reject H0 and accept H1 (proportions are different)

If p > (not significant Z statistic)

Accept H0 and therefore reject H1 (proportions are similar)

Page 41: Experimental Research Methodology Statistical Tests

41

Chi-square testPreconditions

The Chi-square operates on a contingency table

Rows and columns represent the categories of the two variables

Each cell contains the number of observations for a given pair of values

(factor, outcome variable)

The Chi-Square preconditions are:

The sample must be large enough (n > 20)

All contingency values must be > 1

At least 80% of the contingency values must be > 5

Primary Programming Language * Case tool usage Crosstabulation

Count

279 121 400

229 24 253

26 42 68

61 36 97

595 223 818

Cobol

Visual Basic

C++

Java

Primary Programming

Language

Total

No Yes

Case tool usage

Total

This is a contingency table where all the preconditions are met

Example (1/3)

Objective: Is the adoption of CASE tools dependent on the

programming language used?

If there is some sort of dependence than the proportions in the groups will not be similar.

SPSS:

Analyze

Descriptive Statistics

Crosstabs

Page 42: Experimental Research Methodology Statistical Tests

42

Example (2/3)

Example (3/3)

Conclusion:

The adoption of CASE tools depends

on the programming language

With a confidence interval of 99% we can reject the null hypothesis

Case Processing Summary

639 22,5% 2201 77,5% 2840 100,0%

Primary Programming

Language * Case tool

usage

N Percent N Percent N Percent

Valid Missing Total

Cases

Primary Programming Language * Case tool usage Crosstabulation

222 106 328

243,3 84,7 328,0

192 17 209

155,0 54,0 209,0

11 28 39

28,9 10,1 39,0

49 14 63

46,7 16,3 63,0

474 165 639

474,0 165,0 639,0

Count

Expected Count

Count

Expected Count

Count

Expected Count

Count

Expected Count

Count

Expected Count

Cobol

Visual Basic

C++

Java

Primary Programming

Language

Total

No Yes

Case tool usage

Total

Chi-Square Tests

84,822a 3 ,000

86,160 3 ,000

,565 1 ,452

639

Pearson Chi-Square

Likelihood Ratio

Linear-by-Linear

Association

N of Valid Cases

Value df

Asy mp. Sig.

(2-sided)

0 cells (,0%) hav e expected count less than 5. The

minimum expected count is 10,07.

a.

Page 43: Experimental Research Methodology Statistical Tests

43

Mann-Whitney test

(aka U-test)

Mann-Whitney testApplicability

This is the non­parametric analog of the t-test

Instead of comparing the average of the 2 samples, it

compares their central tendency to detect differences

It can be used to test if 2 samples differ regarding

a given factor

Page 44: Experimental Research Methodology Statistical Tests

44

Mann-Whitney test Design & Scales

Design

1 factor, 2 treatments, independent samples, random samples (between groups)

Scales

Factor (grouping) variable: categorical

Outcome variable: at least ordinal scale

Assumptions

The two tested samples should be similar in shape

Mann-Whitney testHypotheses being tested

H0: The two populations from which the samples for the

two groups were taken, have similar central tendency

The groups are not affected by the factor variable

H1: The two samples do not have similar central

tendency

The groups are affected by the factor variable

U statistic

This statistic is used to test the above hypotheses

Page 45: Experimental Research Methodology Statistical Tests

45

Example (1/3)

Objective: assess if the effort per each development phase is different between two languages (Cobol and Java)

Each independent sample (group) corresponds to the projects (cases) that use the same programming language (PL)

Let c and j be indexes identifying Cobol and Java, respectively. Then, the underlying hypotheses for this test are the following:

H0: i,j :Ec ~ Ej

H1: ¬ i,j :Ec ~ Ej

If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the efforts per phase for each group of projects (corres-ponding to a PL) are different.

In other words, we would accept the alternative hypothesis that the PL has an influence on the effort per phase.

Notice that since we have several phases, then we have to perform one test for each phase

Test Statisticsa

,215 ,130 ,550 ,215 ,225 ,222

,215 ,130 ,200 ,215 ,225 ,000

-,088 -,050 -,550 -,014 -,031 -,222

,919 ,628 ,977 1,128 1,162 ,998

,367 ,825 ,295 ,157 ,134 ,272

Absolute

Positive

Negative

Most Extreme

Dif f erences

Kolmogorov-Smirnov Z

Asy mp. Sig. (2-tailed)

Ef fort Plan Ef fort Specify Ef fort Design Ef fort Build Ef fort Test

Ef fort

Implement

Grouping Variable: Primary Programming Languagea.

Example (2/3)1 is Cobol and 4 is Java

First we must verify, for each effort kind, if the groups

corresponding to different languages have distributions with

similar shapes, by using the Kolmogorov-Smirnov Z test.

H0 – The statistical distribution of the effort is similar in both

programming languages

H1 – The statistical distribution of the effort is significantly

different in both programming languages

SPSS:

Analyze

Nonparametric Tests

2 Independent Samples

We accept that the 2 groups have similar statistical distributions (with

a confidence level of 99%) for all efforts being tested

Page 46: Experimental Research Methodology Statistical Tests

46

Example (3/3)Ranks

77 49,22 3790,00

24 56,71 1361,00

101

108 68,27 7373,50

30 73,92 2217,50

138

4 8,00 32,00

15 10,53 158,00

19

142 85,42 12130,00

34 101,35 3446,00

176

160 93,84 15014,00

32 109,81 3514,00

192

106 69,46 7363,00

25 51,32 1283,00

131

Primary Programming

LanguageCobol

Java

Total

Cobol

Java

Total

Cobol

Java

Total

Cobol

Java

Total

Cobol

Java

Total

Cobol

Java

Total

Ef fort Plan

Ef fort Specify

Ef fort Design

Ef fort Build

Ef fort Test

Ef fort Implement

N Mean Rank Sum of Ranks

Test Statisticsb

787,000 1487,500 22,000 1977,000 2134,000 958,000

3790,000 7373,500 32,000 12130,000 15014,000 1283,000

-1,093 -,684 -,800 -1,638 -1,485 -2,150

,274 ,494 ,424 ,102 ,138 ,032

,469a

Mann-Whitney U

Wilcoxon W

Z

Asy mp. Sig. (2-tailed)

Exact Sig. [2*(1-tailed Sig.)]

Ef fort Plan Ef fort Specif y Ef fort Design Ef fort Build Ef fort Test

Ef fort

Implement

Not corrected f or t ies.a.

Grouping Variable: Primary Programming Languageb.

The effort to plan, specify and design do not differ significantly between Cobol and Java.

The effort to build and implement may be considered significantly different with a confidence interval of 90%.

SPSS:

Analyze

Nonparametric Tests

2 Independent Samples

H0 – the two samples have similar central tendency on the effort

H1 – the two samples do not have similar central tendency on the effort

Kruskal-Wallis H test

(one-way analysis of variance)

Page 47: Experimental Research Methodology Statistical Tests

47

Kruskal-Wallis testApplicability

Is an extension of the Mann-Whitney U test

Is the nonparametric equivalent to the One-Way ANOVA

Assesses whether several independent samples have a

common localization parameter

Each sample is a group of subjects corresponding to the

application of a given treatment (a level of the factor variable)

Kruskal-Wallis testDesign & Scales

Design

1 factor with more than 2 treatments, independent, random samples (between groups)

Scales

Factor (grouping) variable: categorical

Outcome variable: any type

Assumptions

The tested samples should be similar in shape

Page 48: Experimental Research Methodology Statistical Tests

48

Kruskal-Wallis testHypotheses being tested

H0: The distribution of the populations from where each

group was extracted have the same localization

parameter

Groups do not differ significantly

The effect of the factor is negligible

H1: At least one of the distributions has a localization

parameter that is smaller or greater than the others

A least one sample (group) differs significantly

The effect of the factor is not negligible

Kruskal-Wallis testTest decision

The calculated H test statistic is distributed

approximately as chi-square

From a chi-square table with the given df (degrees of

freedom) and for a stipulated significance (probability

of a Type I error) we obtain a critical value of the chi-

square to be compared with the calculated H statistic

Test decision:

We reject H0, for a given level of significance , if:

Hcalc> H(1-,df)

Page 49: Experimental Research Methodology Statistical Tests

49

df 90% 95% 97.5% 99% 99.5% 99.9%

1 2.706 3.841 5.024 6.635 7.879 10.827

2 4.605 5.991 7.378 9.210 10.597 13.815

3 6.251 7.815 9.348 11.345 12.838 16.268

4 7.779 9.488 11.143 13.277 14.860 18.465

5 9.236 11.070 12.832 15.086 16.750 20.517

6 10.645 12.592 14.449 16.812 18.548 22.457

7 12.017 14.067 16.013 18.475 20.278 24.322

8 13.362 15.507 17.535 20.090 21.955 26.125

9 14.684 16.919 19.023 21.666 23.589 27.877

10 15.987 18.307 20.483 23.209 25.188 29.588

11 17.275 19.675 21.920 24.725 26.757 31.264

12 18.549 21.026 23.337 26.217 28.300 32.909

13 19.812 22.362 24.736 27.688 29.819 34.528

14 21.064 23.685 26.119 29.141 31.319 36.123

15 22.307 24.996 27.488 30.578 32.801 37.697

16 23.542 26.296 28.845 32.000 34.267 39.252

17 24.769 27.587 30.191 33.409 35.718 40.790

18 25.989 28.869 31.526 34.805 37.156 42.312

19 27.204 30.144 32.852 36.191 38.582 43.820

20 28.412 31.410 34.170 37.566 39.997 45.315

21 29.615 32.671 35.479 38.932 41.401 46.797

22 30.813 33.924 36.781 40.289 42.796 48.268

23 32.007 35.172 38.076 41.638 44.181 49.728

24 33.196 36.415 39.364 42.980 45.558 51.179

25 34.382 37.652 40.646 44.314 46.928 52.620

26 35.563 38.885 41.923 45.642 48.290 54.052

27 36.741 40.113 43.194 46.963 49.645 55.476

28 37.916 41.337 44.461 48.278 50.993 56.893

29 39.087 42.557 45.722 49.588 52.336 58.302

30 40.256 43.773 46.979 50.892 53.672 59.703

Chi-square

distribution table

Example (1/2)

Objective: assess the impact of the adopted programming language (PL) on the normalized work effort (E)

Each independent sample (group) corresponds to the projects (cases) that use the same PL

Let i and j be two different PLs. Then, the underlying hypotheses for this test are the following:

H0: i,j :Ei ~ Ej

H1: ¬ i,j :Ei ~ Ej

If we reject the null hypothesis that the samples do not differ on the criterion (factor or grouping) variable (the PL), then we can sustain that the statistical distributions of the groups of projects’ NWE corresponding to each of the PLs are different.

In other words, we would accept the alternative hypothesis that the PL has an influence on E.

Page 50: Experimental Research Methodology Statistical Tests

50

Example (2/2)

SPSS:

Analyze

Nonparametric Tests

K Independent Samples

Ranks

509 606,70

265 454,29

116 646,58

204 464,53

1094

Language

Cobol

Visual Basic

C++

Java

Total

Ef fort per Adjusted FP

N Mean Rank

Test Statisticsa,b

66,405

3

,000

Chi-Square

df

Asy mp. Sig.

Ef fort per Adjusted FP

Kruskal Wallis Testa.

Grouping Variable: Languageb.

Even for a confidence interval of 99.9% we have Chi-

SquareCALC > Chi-square (3; 0.001) and we can reject

the null hypothesis. Therefore the effect of the

language on the effort per FP is not negligible

Ranks give us the indication of the relative

influence of each language on the effort per FP.

Notice that C++ is the language requiring more

effort and Visual Basic the least!

Extract of Chi-Square table:

df 90% 95% 97.5% 99% 99.5% 99.9%

3 6.251 7.815 9.348 11.345 12.838 16.268.

Nonparametric Factorial ANOVA

Page 51: Experimental Research Methodology Statistical Tests

51

Nonparametric Factorial ANOVAApplicability

This procedure is used to test if a given set of factors has a

significant effect on a given variable

Allows determining the effect of each factor

Allows assessing the interaction among factors (aka moderation)

This procedure is similar to the (parametric) Factorial

ANOVA, but the H statistic is calculated based upon the

ranks of cases within each group

Remember that in the parametric version we used the F statistic

that is calculated upon the values of the outcome variable itself

Nonparametric Factorial ANOVAHow to perform?

The basic distribution of SPSS does not support the

Nonparametric Factorial ANOVA (not even the Two-Way)

There are several alternatives to perform this test:

Use another tool instead of SPSS (R has this procedure for free)

Get an advanced SPSS module that supports nonparametric

ANOVA (may be expensive)

Program this procedure in the SPSS syntax language (VB-like) or

find in the Internet someone who has done it already

Transform the outcome variable in a Normal distributed one and if

successful, use the parametric Factorial ANOVA

Use Excel to implement the test statistic H and then use a Chi-

Square table to make the test decision.

Page 52: Experimental Research Methodology Statistical Tests

52

Nonparametric Factorial ANOVADesign & Scales

Design

2+ factors, 2+ levels per factor, independent samples

If we only have 2 factors, this is called Nonparametric

two-way ANOVA

A treatment corresponds to a combination (tuple) of factor

levels such as (Java; Eclipse) if the factors are programming

language and development environment.

A treatment is often called a model cell.

Scales

Factor (grouping) variable: categorical (recoded)

Outcome variable: at least ordinal

Nonparametric Factorial ANOVAMain and interaction effects

Consider that you have three factors F1, F2 and F3

Main effects

These are the effects on the outcome variable caused by each

factor alone, as we did with the Kruskal-Wallis test

These are represented by F1, F2, F3

Interaction effects

There are the cross factor effects caused by the combined action

of all combinations of factors, which are:

F1*F2, F1*F3, F2*F3, F1*F2*F3

Overall effect representation in the GLM

I + F1 + F2 + F3 + F1*F2 + F1*F3 + F2*F3 + F1*F2*F3

Where I is an intercept term (similar to that used in linear regression)

Page 53: Experimental Research Methodology Statistical Tests

53

Nonparametric Factorial ANOVA Main effects hypotheses (one for each factor)

H0: The distribution of the populations from where each

group was extracted have the same localization

parameter

Groups do not differ significantly

The effect of the factor is negligible

H1: At least one of the distributions has a localization

parameter that is smaller or greater than the others

A least one sample (group) differs significantly

The effect of the factor is not negligible

Nonparametric Factorial ANOVA Main effects test decision (one for each factor)

As seen in the Kruskal-Wallis test, the calculated H test

statistic is distributed approximately as chi-square

Test decision:

We reject H0, for a given level of significance , if:

Hcalc> H(1-,df)

Get the value of the critical value H(1-,df) from the Chi-Square

table presented on the Kruskal-Wallis test

Page 54: Experimental Research Methodology Statistical Tests

54

Nonparametric Factorial ANOVA Interaction effects hypotheses (one for each interaction)

H0: There is no interaction among the factors

i,j: i,j = 0 (i ≠ j)

H1: There is interaction between at least two factors

i,j: i,j ≠ 0 (i ≠ j)

Nonparametric Factorial ANOVA Interaction effects test decision (one for each interaction)

We reject H0, for a given level of significance , if:

Hcalc> H(1-,df)

Get the value of the critical value H(1-,df) from the Chi-Square

table presented on the Kruskal-Wallis test