lecture 20 cell means model - purdue universityghobbs/stat_512/lecture_notes/ano… · 20-1 lecture...

Lecture 20

Cell Means Model

STAT 512

Spring 2011

Background Reading

KNNL: 16.3-16.6

Topic Overview

• ANOVA as Regression

• Cell Means Model for Single Factor ANOVA

• Sums of Squares & Degrees of Freedom

• Cash Offers example

Cash Offers Example (Pr16.10)

• Goal: Determine if age of owner affects the cash offer made by a dealer for a used car.

• Experiment: SAME car was taken by 36 different people (12 young, 12 middle-aged, and 12 elderly) to 36 different dealerships for an offer.

• Notes: “Owners” were randomized to dealerships. Offers given in hundreds of dollars.

• SAS code: cashoffers.sas

Aligned Box Plots

• Informative way to plot the data and can be done easily in SAS

symbol1 v=dot c=purple; proc boxplot data =cash; plot offer*age / cboxes =purple cboxfill =yellow; run;

• Numerous different options can be used in the plot statement (see SAS help files).

Aligned Box Plots (Cash Offers)

Interpreting Box Plots

Modeling by Regression

• Uses indicator variables

� YNG = 1 if YOUNG

� MID = 1 if MIDDLE

� (Both are 0 if ELDERLY)

• Model: 0ij yng yng mid mid ijY X Xβ β β ε= + + +

� 0β is the mean offer for ELDERLY

� 0 yngβ β+ is mean offer for YOUNG

� 0 midβ β+ is mean offer for MIDDLE

SAS Code

*Create indicator variables for regression; data cash; set cash; if age= 'Young' then yng = 1; else yng = 0; if age= 'Middle' then mid = 1; else mid = 0; proc print; run; *Use Regression to analyze the indicator variables; proc reg data =cash; model offer=yng mid / clm alpha = 0.01667; id yng mid;

Regression Model

Source DF SS MS F Value Pr > F

Model 2 316.7 158.4 63.60 <.0001

Error 33 82.2 2.49

Total 35 398.9

• Two DF in model since two indicator variables needed

• F-test indicates that there is difference due to age (but doesn’t tell us what exactly is different)

Regression Model (2)

Variable DF Est SE t Value Pr>|t|

Intercept 1 21.417 0.456 47.02 <.0001

yng 1 0.083 0.644 0.13 0.8979

mid 1 6.333 0.644 9.83 <.0001

• Estimated mean for Elderly is: $2142.

• Estimated mean for Young is: 2142+8=$2150.

• Estimated mean for Middle is: 2142+633=$2775.

• No difference between Elderly/Young, sig. difference between Elderly/Middle.

Regression Model (3)

• Can get confidence intervals by taking appropriate combinations / using CLM

• CLM gives CI’s for all 36 points, but they will be the same for each group of 12

• Use alpha = 0.01667 (why?)

� YOUNG: (20.35, 22.65)

� MIDDLE: (26.60, 28.90)

� ELDERLY: (20.27, 22.57)

Big Picture

• We could do everything for categorical variables using these indicator variables.

• Internally, SAS does this! But from an analytical viewpoint, there are other ways of modeling that are a bit easier to understand.

Cell Means Model

ij i ijY µ ε= +

• ijY is the value of the response variable in

the jth trial for the ith factor level.

• iµ is the (unknown) theoretical mean for all

of the observations at level i

• ijε are independent normal errors with

means 0 and variances 2σ

• Since ijε are normal RV, ijY also are normal

RV with means iµ and variances 2σ

Comparison to Regression

• 0eldµ β= is the mean offer for ELDERLY

• 0yng yngµ β β= + is mean offer for YOUNG

• 0mid midµ β β= + is mean offer for MIDDLE

• Note that the number of parameters involved is the same – 3 in each case. If I estimate the sµ′ , I can get the sβ ′ – and vice versa.

Parameters in ANOVA

• Need to estimate all of the cell means

1 2, ,..., rµ µ µ and also 2σ

• F-test answers the question of whether iµ

depends on i. That is we test the null hypothesis 0 1 2: ... rH µ µ µ= = = against

the alternative that not all the means are the same.

Notation

• “DOT” indicates to sum over that index, “BAR” indicates to take the average.

• Overall or grand mean is

ijT i j

= ∑∑ii

• Mean for factor level i is

= ∑i

Estimates

• Each group mean is estimated by the mean of the observations within that group:

i i ij

µ = = ∑i

• Cell variances are estimated by

( )22 1

1i ij i

s Y Yn

= −−∑ i

• Note: in is the number of obs. in cell i. If

all the same, we usually just write n.

Pooled Variance Estimate

• Assumed variances the same, so we pool the cell variances to get an overall variance est.

ij ii ii ji

Y Yn s

MSEn rn

−−

= =−−

∑∑∑

• Pooling is weighted according to the number of observations in each group.

SAS Coding for ANOVA

proc glm data =cash; class age; model offer=age; means age; run;

• Class statement causes AGE to be treated as a classification (categorical) variable.

• Means statement produces table of means and standard deviations.

ANOVA Output

Model 2 316.7 158.4 63.60 <.0001

Error 33 82.2 2.49

Total 35 398.9

• Exactly the same as the regression model

• 2 DF in the model since 3 levels for AGE

• F-statistic indicates model significance; there is some difference in the age groups

ANOVA Output (2)

Level of ---------offer---------

age N Mean Std Dev

Elderly 12 21.4167 1.67649

Middle 12 27.7500 1.28806

Young 12 21.5000 1.73205

• Note these results are the same as in the regression approach

• It seems apparent here where the difference in age groups is, but important to do statistical tests to obtain “groupings” (More in a later topic).

Partitioning Variation

• Break down difference between observation and grand mean into two parts:

( ) ( ) ( ) Total Deviation of Estimated Deviaton around

Deviation Factor Level Mean Estimated Factor

Level Mean Around Grand Mean

ij i ij iY Y Y Y Y Y− = − + −ii i ii i

��

BETWEEN WITHIN

} ij iY Y−i

Sums of Squares

• If we square both sides of the equation on Slide 22, cross-terms in ( )( )i ij iY Y Y Y− −

i ii i

will cancel and the equation works out nicely to:

( ) ( ) ( )2 22

SSTO SSTrt SSE

ij i ij ii j i j i j

Y Y Y Y Y Y− = − + −∑ ∑ ∑ii i ii i

��

Analysis of Variance Table

Source DF SS MS

Model/Trt 1r − ( )2

n Y Y−∑ i ii

Error Tn r− ( )

Y Y−∑ i

Total 1Tn − ( )

Y Y−∑ ii

Sources of Variation

• MODEL line represents variation BETWEEN groups

• ERROR line represents variation WITHIN groups

• Ratio of Model to Error Mean Squares yields F-test as usual.

Expected Mean Squares

• Can show that

( ) ( )2

1 i ii

E MSTR nr

σ µ µ= + −−∑

where µ is the grand mean.

• ( ) 2E MSE σ=

• Ratio MSTR / MSE will be 1 if there is no treatment effect and will be bigger than 1 if there is a treatment effect.

• See page 696 for how to find the expected mean squares.

F-test

• 0 1 2: ... rH µ µ µ= = =

• :aH Not all the

iµ are equal

• F = MSTR / MSE

• Under 0H , ( )~ 1,T

F F r n r− − , so reject

0H if F is bigger than the critical value at

significance level α

• SAS reports p-value for this test in the ANOVA table. Reject 0H if p-value less

than significance level α.

Example (Cash Offers)

Model 2 316.7 158.4 63.60 <.0001

Error 33 82.2 2.49

Total 35 398.9

• P-value < 0.0001 so there is some difference among the age groups

• We understand where the difference is from seeing the plots; and from the output of the MEANS statement we saw earlier. But, will check with a formal test later.

Assumptions

• Constant variance of errors

• Residual plot: Residuals vs. X (Age Group)

• No obvious problems with the variance.

Assumptions (2)

• Normality of errors

• No major violations of normality

• If minor violations, generally ok for ANOVA

Assumptions (3)

• There are some slight differences in how these are assessed for ANOVA, and also in how we fix problems if they exist.

• Will discuss diagnostics/remedial measures in greater detail later.

Upcoming in Lecture 21...

• Factor Effects Model

• Power/Sample Size Planning

• Sections 16.7, 16.10-11

lecture 20 cell means model - purdue universityghobbs/stat_512/lecture_notes/ano… · 20-1 lecture...

Documents

pe-103...20 20 20 20 20 20 20 20

topic 1 topic overview - purdue...

lecture 32 analysis of covariance ii - purdue...

topic 2 - purdue...

perioperative%20%20%20%20%20%20%20%20 arrhythmias

random and mixed e ects anova - purdue universityrandom and...

lecture 26 basics of two-way anova - purdue...

2 ,oppgjør med svindler - sno.no · at-generalen dØd i'...

lecture 34 fixed vs random effects - purdue...

matrix approach to simple linear regression knnl – chapter...

lecture 3: inference in slr - purdue...

stat 512 class 1 - purdue...

lecture 29 rcbd & unequal cell sizes - purdue university29-1...

lecture 22 multiple comparisons - purdue...

lecture 27 two-way anova: interaction - purdue...

lecture 26 basics of two-way anova - purdue...

lecture 19 introduction to anova - purdue · pdf file19-1...

modeling a multinomial response - purdue...

› pow_resources › camplists › fukuoka ›...

la knnl,a n mecca crcttn secondo il i ~- 1. [imgedimento ......