lecture 20 cell means model - purdue universityghobbs/stat_512/lecture_notes/ano… · 20-1 lecture...
Post on 12-May-2020
6 Views
Preview:
TRANSCRIPT
20-1
Lecture 20
Cell Means Model
STAT 512
Spring 2011
Background Reading
KNNL: 16.3-16.6
20-2
Topic Overview
• ANOVA as Regression
• Cell Means Model for Single Factor ANOVA
• Sums of Squares & Degrees of Freedom
• Cash Offers example
20-3
Cash Offers Example (Pr16.10)
• Goal: Determine if age of owner affects the cash offer made by a dealer for a used car.
• Experiment: SAME car was taken by 36 different people (12 young, 12 middle-aged, and 12 elderly) to 36 different dealerships for an offer.
• Notes: “Owners” were randomized to dealerships. Offers given in hundreds of dollars.
• SAS code: cashoffers.sas
20-4
Aligned Box Plots
• Informative way to plot the data and can be done easily in SAS
symbol1 v=dot c=purple; proc boxplot data =cash; plot offer*age / cboxes =purple cboxfill =yellow; run;
• Numerous different options can be used in the plot statement (see SAS help files).
20-5
Aligned Box Plots (Cash Offers)
20-6
Interpreting Box Plots
20-7
Modeling by Regression
• Uses indicator variables
� YNG = 1 if YOUNG
� MID = 1 if MIDDLE
� (Both are 0 if ELDERLY)
• Model: 0ij yng yng mid mid ijY X Xβ β β ε= + + +
� 0β is the mean offer for ELDERLY
� 0 yngβ β+ is mean offer for YOUNG
� 0 midβ β+ is mean offer for MIDDLE
20-8
SAS Code
*Create indicator variables for regression; data cash; set cash; if age= 'Young' then yng = 1; else yng = 0; if age= 'Middle' then mid = 1; else mid = 0; proc print; run; *Use Regression to analyze the indicator variables; proc reg data =cash; model offer=yng mid / clm alpha = 0.01667; id yng mid;
run;
20-9
Regression Model
Source DF SS MS F Value Pr > F
Model 2 316.7 158.4 63.60 <.0001
Error 33 82.2 2.49
Total 35 398.9
• Two DF in model since two indicator variables needed
• F-test indicates that there is difference due to age (but doesn’t tell us what exactly is different)
20-10
Regression Model (2)
Variable DF Est SE t Value Pr>|t|
Intercept 1 21.417 0.456 47.02 <.0001
yng 1 0.083 0.644 0.13 0.8979
mid 1 6.333 0.644 9.83 <.0001
• Estimated mean for Elderly is: $2142.
• Estimated mean for Young is: 2142+8=$2150.
• Estimated mean for Middle is: 2142+633=$2775.
• No difference between Elderly/Young, sig. difference between Elderly/Middle.
20-11
Regression Model (3)
• Can get confidence intervals by taking appropriate combinations / using CLM
• CLM gives CI’s for all 36 points, but they will be the same for each group of 12
• Use alpha = 0.01667 (why?)
� YOUNG: (20.35, 22.65)
� MIDDLE: (26.60, 28.90)
� ELDERLY: (20.27, 22.57)
20-12
Big Picture
• We could do everything for categorical variables using these indicator variables.
• Internally, SAS does this! But from an analytical viewpoint, there are other ways of modeling that are a bit easier to understand.
20-13
Cell Means Model
ij i ijY µ ε= +
• ijY is the value of the response variable in
the jth trial for the ith factor level.
• iµ is the (unknown) theoretical mean for all
of the observations at level i
• ijε are independent normal errors with
means 0 and variances 2σ
• Since ijε are normal RV, ijY also are normal
RV with means iµ and variances 2σ
20-14
Comparison to Regression
• 0eldµ β= is the mean offer for ELDERLY
• 0yng yngµ β β= + is mean offer for YOUNG
• 0mid midµ β β= + is mean offer for MIDDLE
• Note that the number of parameters involved is the same – 3 in each case. If I estimate the sµ′ , I can get the sβ ′ – and vice versa.
20-15
Parameters in ANOVA
• Need to estimate all of the cell means
1 2, ,..., rµ µ µ and also 2σ
• F-test answers the question of whether iµ
depends on i. That is we test the null hypothesis 0 1 2: ... rH µ µ µ= = = against
the alternative that not all the means are the same.
20-16
Notation
• “DOT” indicates to sum over that index, “BAR” indicates to take the average.
• Overall or grand mean is
1
ijT i j
Y Yn
= ∑∑ii
• Mean for factor level i is
1
i ij
ji
Y Yn
= ∑i
20-17
Estimates
• Each group mean is estimated by the mean of the observations within that group:
1
i i ij
ji
Y Yn
µ = = ∑i
• Cell variances are estimated by
( )
( )22 1
1i ij i
ji
s Y Yn
= −−∑ i
• Note: in is the number of obs. in cell i. If
all the same, we usually just write n.
20-18
Pooled Variance Estimate
• Assumed variances the same, so we pool the cell variances to get an overall variance est.
( )
( )
( )2
21
1
ij ii ii ji
Tii
Y Yn s
MSEn rn
−−
= =−−
∑∑∑
∑
i
• Pooling is weighted according to the number of observations in each group.
20-19
SAS Coding for ANOVA
proc glm data =cash; class age; model offer=age; means age; run;
• Class statement causes AGE to be treated as a classification (categorical) variable.
• Means statement produces table of means and standard deviations.
20-20
ANOVA Output
Source DF SS MS F Value Pr > F
Model 2 316.7 158.4 63.60 <.0001
Error 33 82.2 2.49
Total 35 398.9
• Exactly the same as the regression model
• 2 DF in the model since 3 levels for AGE
• F-statistic indicates model significance; there is some difference in the age groups
20-21
ANOVA Output (2)
Level of ---------offer---------
age N Mean Std Dev
Elderly 12 21.4167 1.67649
Middle 12 27.7500 1.28806
Young 12 21.5000 1.73205
• Note these results are the same as in the regression approach
• It seems apparent here where the difference in age groups is, but important to do statistical tests to obtain “groupings” (More in a later topic).
20-22
Partitioning Variation
• Break down difference between observation and grand mean into two parts:
( ) ( ) ( ) Total Deviation of Estimated Deviaton around
Deviation Factor Level Mean Estimated Factor
Level Mean Around Grand Mean
ij i ij iY Y Y Y Y Y− = − + −ii i ii i
��������� ��������� ���������
BETWEEN WITHIN
20-23
} ij iY Y−i
iY Y
−
i ii
ijY Y
−
ii
20-24
Sums of Squares
• If we square both sides of the equation on Slide 22, cross-terms in ( )( )i ij iY Y Y Y− −
i ii i
will cancel and the equation works out nicely to:
( ) ( ) ( )2 22
, , ,
SSTO SSTrt SSE
ij i ij ii j i j i j
Y Y Y Y Y Y− = − + −∑ ∑ ∑ii i ii i
������������� ������������� �������������
20-25
Analysis of Variance Table
Source DF SS MS
Model/Trt 1r − ( )2
i ii
n Y Y−∑ i ii
Trt
SSTrt
df
Error Tn r− ( )
2
,
ij i
i j
Y Y−∑ i
E
SSE
df
Total 1Tn − ( )
2
,
ij
i j
Y Y−∑ ii
20-26
Sources of Variation
• MODEL line represents variation BETWEEN groups
• ERROR line represents variation WITHIN groups
• Ratio of Model to Error Mean Squares yields F-test as usual.
20-27
Expected Mean Squares
• Can show that
( ) ( )2
2 1.
1 i ii
E MSTR nr
σ µ µ= + −−∑
where µ is the grand mean.
• ( ) 2E MSE σ=
• Ratio MSTR / MSE will be 1 if there is no treatment effect and will be bigger than 1 if there is a treatment effect.
• See page 696 for how to find the expected mean squares.
20-28
F-test
• 0 1 2: ... rH µ µ µ= = =
• :aH Not all the
iµ are equal
• F = MSTR / MSE
• Under 0H , ( )~ 1,T
F F r n r− − , so reject
0H if F is bigger than the critical value at
significance level α
• SAS reports p-value for this test in the ANOVA table. Reject 0H if p-value less
than significance level α.
20-29
Example (Cash Offers)
Source DF SS MS F Value Pr > F
Model 2 316.7 158.4 63.60 <.0001
Error 33 82.2 2.49
Total 35 398.9
• P-value < 0.0001 so there is some difference among the age groups
• We understand where the difference is from seeing the plots; and from the output of the MEANS statement we saw earlier. But, will check with a formal test later.
20-30
Assumptions
• Constant variance of errors
• Residual plot: Residuals vs. X (Age Group)
• No obvious problems with the variance.
20-31
Assumptions (2)
• Normality of errors
• No major violations of normality
• If minor violations, generally ok for ANOVA
20-32
Assumptions (3)
• There are some slight differences in how these are assessed for ANOVA, and also in how we fix problems if they exist.
• Will discuss diagnostics/remedial measures in greater detail later.
20-33
Upcoming in Lecture 21...
• Factor Effects Model
• Power/Sample Size Planning
• Sections 16.7, 16.10-11
top related