upc notes for theory sessions smde-miri-fiblmontero/lmm_tm/smde - (us... · (n ) yt ,y = 1 be a...

UPC Notes for theory sessions SMDE-MIRI-FIB: Statistical Modeling: Normal response data – General Linear

Models

Statistical Modeling and Design of Experiments- MIRI - FIB - UPC

Prof. Lídia Montero © Term 2. 01 3- 2. 01 4 Q1

TABLE OF CONTENTS

1 READINGS _______________________________________________________________________________________________________________ 3 2 INTRODUCTION TO LINEAR MODELS _____________________________________________________________________________________ 4 3 LEAST SQUARES ESTIMATION IN MULTIPLE REGRESSION _______________________________________________________________ 12 3.1 GEOMETRIC PROPERTIES __________________________________________________________________________________________________ 14 4 LEAST SQUARES ESTIMATION: INFERENCE ______________________________________________________________________________ 16 4.1 BASIC INFERENCE PROPERTIES _____________________________________________________________________________________________ 16 5 HYPOTHESIS TESTS IN MULTIPLE REGRESSION __________________________________________________________________________ 20

5.1 TESTING IN R ____________________________________________________________________________________________________________ 22

5.2 CONFIDENCE INTERVAL FOR MODEL PARAMETERS _____________________________________________________________________________ 23 6 GOODNESS OF FIT: MULTIPLE CORRELATION COEFFICIENT _____________________________________________________________ 24 6.1 PROPERTIES OF THE MULTIPLE CORRELATION COEFFICIENT ____________________________________________________________________ 25 6.2 R2-ADJUSTED ___________________________________________________________________________________________________________ 25 7 GLOBAL TEST FOR REGRESSION. ANOVA TABLE _________________________________________________________________________ 26 7.1 EXAMPLE: R OUTPUT ON R2, R2-ADJUSTED AND GLOBAL TEST __________________________________________________________________ 27 8 MODEL VALIDATION ____________________________________________________________________________________________________ 28 8.1 A PRIORI INFLUENTIAL OBSERVATIONS ______________________________________________________________________________________ 38 8.2 A POSTERIORI INFLUENTIAL DATA __________________________________________________________________________________________ 40 9 BEST MODEL SELECTION _______________________________________________________________________________________________ 42 9.1 STEPWISE REGRESSION ___________________________________________________________________________________________________ 44 10 INTRODUCTION TO GENERAL LINEAR MODEL __________________________________________________________________________ 49 10.1 EXAMPLE: PRESTIGE OF CANADIAN OCCUPATIONS IN DATA.FRAME PRESTIGE IN CAR LIBRARY FOR R (FOX AND WEISBER 2011) ___________ 52 11 INTRODUCTION TO GENERAL LINEAR MODEL: TWO-WAY ANOVA ______________________________________________________ 57 11.1 EXAMPLE: PRESTIGE OF CANADIAN OCCUPATIONS OF PROFESSION VS TYPE AND FEMENINE FACTOR (FOX AND WEISBER 2011) ____________ 63 12 MORE COMPLEX ANOVA MODELS ______________________________________________________________________________________ 68 13 ANALYSIS OF COVARIANCE (ANCOVA MODELS) ________________________________________________________________________ 69 13.1 EXAMPLE: PRESTIGE OF CANADIAN OCCUPATIONS IN DATA.FRAME PRESTIGE IN CAR LIBRARY FOR R (FOX AND WEISBER 2011) ___________ 72



1 READINGS

Basic References:

Fox, J. Applied Regression Analysis and Generalized Linear Models. Sage Publications, 2nd Edition 2008. James H. Stock and Mark W. Watson, Introduction to Econometrics, Prentice Hall, 2007 Seber, G.A.F.: Linear Regression Analysis. Wiley, 1977. Fox, J. An R Companion to Applied Regression. Sage Publications, 2nd Edition 2010. Peña, D.: Estadística. Modelos y métodos. Vol. 2, Modelos lineales y series temporales. Alianza Universidad

Textos, 1989.



2 INTRODUCTION TO LINEAR MODELS

Let ( )nT yy ,,y 1= be a vector of n observations considered a draw of the random vector

( )nT YY ,,Y 1= , whose variables are statistically independent and distributed with expectation

( )nT µµ ,,1 =µ :

In linear models, the Random Component distribution for ( )nT YY ,,Y 1= is assumed normal with

constant variance 2σ and expectation [ ] µ=YΕ .

So, the response variable is modeled as normal distributed, thus negative or positive values, arbitrary small or large can be encountered as data for the response and prediction.

The Systematic component of the model consists on specifying a vector called the linear predictor, notated η , of the same length as the response, dimension n, obtained from the linear combination

of regressors (explicative variables) . In vector notation, parameters are ( )pT ββ ,,1=β and

regressors are ( )pT XX ,,X 1= and thus βη X= where η is nx1, X is nxp and β is px1.

Vector µ is directly the linear predictor η , thus the link function is µη = .




Empirical problem: What do data say about class sizes and test scores according in The California Test Score Data Set

All K-6 and K-8 California school districts (n = 420) Variables:

• 5th grade test scores (Stanford-9 achievement test, combined math and reading), district average

• Student-teacher ratio (STR) = no. of students in the district divided by no. full-time equivalent teachers

Policy question: What is the effect of reducing class size by one student per class? by 8 students/class? Do districts with smaller classes (lower STR) have higher test scores?




Stock and Watson (2007)



2 SOME NOTATION AND TERMINOLOGY

• The population regression line is

Test Score = β1 + β2STR

β1 is the intercept β2 is the slope of population regression line

= Test score

STR∆

∆ = change in test score for a unit change in STR

• Why are β1 and β2 “population” parameters? • We would like to know the population value of β2.

• We don’t know β2, so must estimate it using data. • How can we estimate β1 and β2 from data?

We will focus on the least squares (“ordinary least squares” or “OLS”) estimator of the unknown parameters β1 and β2, which solves,

( ) ( ) ( ) ( ) ( )∑ −−=−⋅−==

kk21k

Tββ xββYSMin

21

2, XβYXβYββ



2 SOME NOTATION AND TERMINOLOGY: EXAMPLE

The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line.

Application to the California Test Score – Class Size data

Estimated slope = = – 2.28

Estimated intercept = = 698.9

Estimated regression line: = 698.9 – 2.28 STR

Interpretation of the estimated slope and intercept:

• Districts with one more student per teacher on average have test scores that are 2.28 points lower.

• The intercept (taken literally) means that, according to this estimated line, districts

with zero students per teacher would have a (predicted) test score of 698.9. It makes no sense, since it extrapolates outside the range of the data.



2 SOME NOTATION AND TERMINOLOGY: EXAMPLE

Predicted values & residuals:

One of the districts in the data set for which STR = 25 and Test Score = 621

predicted value: = 698.9 – 2.28*25 = 641.9

residual: = 621 – 641.9 = -20.9

The OLS regression line is an estimate, computed using our sample of data; a different sample would have given a

different value of 2β .

κy

ky


Prof. Lídia Montero © Page 1 0 Term 2. 01 3- 2. 01 4 Q1

2 SOME NOTATION AND TERMINOLOGY

How can we:

quantify the sampling uncertainty associated with 2β ? use 2β to test hypotheses such as 2β = 0? construct a confidence interval for 2β ?

We are going to proceed in four steps:

• The probability framework for linear regression • Estimation • Hypothesis Testing • Confidence intervals

A vector-matrix notation for regression elements will be considered since it simplifies the mathematical framework when dealing with several explicative variables (regressors) .




Classification of statistical tools for analysis and modeling Explicative Variables

Response Variable Dicothomic or

Binary Polythomic Counts

(discrete) Continuous

Normal Time between events

Dicothomic Contingency tables Logistic regression Log-linear models

Contingency tables Log-linear models

Log-linear models

Tests for 2 subpopulation means: t.test

Survival Analysis

Polythomic Contingency tables Logistic regression Log-linear models

Contingency tables Log-linear models

Log-linear models

ONEWAY, ANOVA

Survival Analysis

Continuous (covariates)

Logistic regression * Log-linear models

Multiple regression

Survival Analysis

Factors and covariates

Logistic regression * Log-linear models

Covariance Analysis

Survival Analysis

Random Effects

Mixed models Mixed models Mixed models

Mixed models Mixed models



3 LEAST SQUARES ESTIMATION IN MULTIPLE REGRESSION

Assume a linear model without any distribution hypothesis,

εXβεμY +=+= , where Y is nx1, X is the design matrix nxp and β is the vector parameters px1

Let Y be a numeric response variable, and ε be the model error

+

==+=

=

−−−

===

pp

npnn

pnnn

p

pj

p

jj

n

xxxxxx

xxxxxx

y

y

ε

εε

β

ββ

2

1

2

1

32

13121

22322

1

3

13

2

12

1

11

11

βXY

1

ε

The Ordinary Least Squares estimation of the model parameters β can be written in the general case as,

( ) ( ) ( ) ( ) YX2XXYYxXYXY1

2 TTTTT

ni

Tkk

T YSMin ββββββββ −+=−=−⋅−= ∑=

The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line. The matrix-vector notation is used along these notes.



3 LEAST SQUARES ESTIMATION

First order condition for a mínimum of ( )βS are:

( ) ( ) pjSSj

,,10β

0 ==∂∂

↔=∇βββ

Once derivatives are computed, the so-called normal equations appear,

( ) ( ) ( ) YXXXˆb0YX2XX20 1 TTTTSS −

==→=−=∂

∂↔=∇ ββ

ββββ

• The solution of the normal equations is the least square estimator β of the β parameters.

• If the design matrix is not singular, i.e, column rang of X is p, then β the solution to normal equations is unique.

• Perfect multicollinearity is when one of the regressors is an exact linear function of the other regressors.

• If X is not full-rang and this happens where there is perfect multicollinearity, then infinite solutions

exist, but all of them give βXμy ˆˆˆ == .




The rang of XXT is equal to the rang of X. This attribute leads to two criteria that must be met in

order to ensure XXTis non singular and thus obtain a unique solution:

1. At least as many observations as there are coefficients in the model are needed.

2. The columns of X must not be perfectly linearly related, but even near collinearity can cause statistical problems.

3.1 Geometric properties

Let ( )Xℜ be the linear variation expanded by the columns of X,

( ) { } np ℜ⊂ℜ∈==ℜ ββμμ XX .

And βXμy ˆˆˆ == the predictions once computed the least squared estimator of model parameters,




Then it can be shown that μ is the ortogonal projection of Y and it is unique being the projection

operator defined as, ( ) TT XXXXH −= , and called the hat matrix since applied to Y provide the

fitted values or predictions tipically noted as Y ,

( ) TT XXXXH −= puts a ^ to Y: ( ) HYYXXXXXbY TT ===

−ˆ .

Graphically,

Properties of the Hat Matrix:

• It depends solely on the predictor variables X

• It is square, symmetric and idempotent: HH = H

• Finally, the trace of H is the degrees of freedom for the model

iy

iy

iii yye ˆ−=



4 LEAST SQUARES ESTIMATION: INFERENCE

4.1 Basic Inference Properties A continuous response linear model is assummed where,

εXβεμY +=+= or Yi = β1 + β2X2i + … + βpXpi + εi, i = 1,…,n

where Y ,µ are nx1, X nxp of rang p and β px1 and the conditional distribution of unbiased

errors are i.i.d. of constant variance and normal ( )2σ0,N| ≈Xε - equivalent to ( )nn I,XNY 2σβ≈ .

…Then, the minimum variance unbiased estimator of β is ( ) YXXXˆ TT 1−=β , is the ordinary

least square estimator and is equal to the maximum likelihood estimator MVβ .

If normal assumption is not met then OLS estimators are not efficient (do not have mínimum variance as ML estimator).

If the hypothesis hold then the unbiased estimator of 2σ , noted

2s is efficient (mínimum variance),

( ) ( )pn

RSSpnpn

s−

=−

−⋅−=

−⋅

=βXYβXYee

TT ˆˆ2

Residual Sum of Squares




Theoretical optimal conditions (left) vs heterocedasticity in a case study




What we knew before presenting the statistical properties for OLS was about a sample (in particular, our sample) but now we know something about a larger set of observations from which this sample is drawn.

We consider the statistical properties of ( ) YXXXˆ TT 1−=β , an estimator of something, specifically

the population parameter vector β and the former theorem says to us that OLS estimator is linear and unbiased. It is the best, the most efficient (with smaller variance) than any other estimator and has a normal sampling distribution.

Any individual coefficient jβ is distributed normally with expectation jβ and sampling variance

( ) ( )TjjXXV Tj

2ˆ σ=β and we can test the simple hypothesis (i.e., make some inference):

00 : jjH ββ = with ( )

)1,0(ˆ 0

0 NZTjj

jj ≈−

=XXTσ

ββ

But since it does not help so much since jβ and 2σ are unknown an unbiased estimator of

2σ is proposed

based on the standard error of regression 2s and to estimate the sample variance of jβ , ( )jV βˆ .




( ) ( ) ( ) ( )TjjTjj sSEs XXXXV T

jT

j =→= ββ ˆˆˆ 2and the ratio of jβ and ( )jβSE is distributed as a

Student t with n-p degrees of freedom (since jβ and 2s are independent)

( ) pnTjj

jj tStudents

t −≈−

=XXT

0

0

ββcan be defined and thus ( ) ( )00 ttPHP pn >= − computed or a

bilateral confidence interval at ( )%100 1 α− for ( )jpnjj SEt βββ α ˆˆ 2−±∈

Inference for Multiple Coefficient will be presented further by F-test



5 HYPOTHESIS TESTS IN MULTIPLE REGRESSION

Let a statistical model for multiple regression for a given data set be,

εβεµ +=+= XY , where Y nx1, X nxp of rang p and β px1

Where errors are unbiased and iid normally distributed ( )nn I20,N σ≈ε . Ordinary least

square estimators are noted as β .

Let a simpler model of multiple regression for the same set of data be the former one with a set of linear restrictions applying to the model parameters (a nested model); i.e,

εβεµ +=+= XY , where Y nx1, X nxp of rang p and β px1

subject to linear constraints cA =β that define a linear hypothesis to contrast (to make

inference) that we call H, A is a qxp matrix of rang q < p. Ordinary least square estimates subject

to constraints are Hβ . Residual Sum of Squares is noted as RSSH.




• If the hypothesis H is true than it can be shown that,

( )( )

( ) ( )( ) ( )pnq

TTT

H FsqpnRSS

qRSSRSSF −

−−

→−−

=−

−= ,2

11 ˆXXˆ cAAAcA ββ

Example: Suppose we have a model with a set of parameters

( )4321 ββββ=Tβ

and we want to test the hypothesis

=−=−+

024

:21

421

βββββ

H then

=

−

−→=

02

00114011

4

3

2

1

ββββ

β cA




5.1 Testing in R anova()

• For nested linear regression models, method anova(fullmodel, restrictedmodel) implements F-test. Hypothesis testing in this subject will be suggested by this method.

• Inconvenient: The two models have to be previously computed

Linear.hypothesis() in car package: following the previous generic example

library(car)

linearHypothesis(model,

hypothesis.matrix=matrix(c(1,1,0,-4,1,-1,0,0),nrow=2,ncol=4,byrow=TRUE),

rhs=as.vector(c(2,0)) )

For glm() object performs a Wald test

It requires the estimation of one model only (compared to anova() method)




5.2 Confidence interval for model parameters

Individual confidence interval for iβ in OLS resumes:

pnii tt

i

−≈−

=β

σββ

ˆˆ

ˆ → i

pni tβ

α σβ ˆˆˆ 2−± donde ( ) 1−

= iiT XXs

iβσ ˆˆ

y pnRSSs−

== σ

2αpnt − is the t de Student for bilateral confidence interval 1-α . Degrees of freedon are (n-p) and

correspond to the standard error of regression.

Estimates of iβ parameters are statistically dependent and individual confidence intervals might give a wrong idea of the jointly distributed values.

There is a large literature about how to build confidence regions or perform simultaneous test of several hypothesis: it is out of the scope of this material, some particular suggestions will be presented in the practical sessions.



6 GOODNESS OF FIT: MULTIPLE CORRELATION COEFFICIENT

Multiple correlation coefficient R, is a goodness of fit measured of a regression model defined as the Pearson correlation

coefficient between fitted values ky and observations ky :

The squared of the multiple correlation coefficient R2 is called the coefficient of determination...

1. TSS=( )∑ −

kk yy 2

where ∑=k

kyn

y 1 is the mean of the observed response data.

2. ESS= ( )∑ −k

k yy 2ˆ and RSS= ( )∑ −k

kk yy 2ˆ .

3. TSS=ESS+RSS, ( )∑ −k

k yy 2= ( )∑ −

kk yy 2ˆ + ( )∑ −

kkk yy 2ˆ ,

)ˆ,( yycorR =

( )( ) TSS

RSSTSSESS

yy

yyR

kk

kk

−==−

−=∑∑

1ˆˆ

2

2

2



6 MULTIPLE CORRELATION COEFFICIENT …

Or equivalently ( )TSSRRSS 21−= .

6.1 Properties of the Multiple Correlation Coefficient

1. 1≤R and if |R|=1 indicates a perfect linear relation between response data and regressors.

2. 2100R represents the fraction of response data variability explained by the current model.

6.2 R2-adjusted Since this correlation cannot go down when variables are added, an adjustment must be

made to the R2 in order to increase only when really significative regressors are added to the model. This is the adjusted R-Squared:

( )( ) ( )

−−

−−=−−

−=pn

nRnSCT

pnSCRRa111

11 22

Adjusted R-Square is always less to the original R-Square and might be negative.



7 GLOBAL TEST FOR REGRESSION. ANOVA TABLE

The global test of regression is a particular case of a multiple contrast of hypothesis where all parameters related to explicative variables are tested to be simultaneously zero.

H: 002 == pββ ,, . ( )

( )( ) ( )

( )( )( ) ( ) ,,121

11pnp

H Fsp

ESSpnTSS

pESSpnRSSpRSSTSS

pnRSSqRSSRSSF −−≈

−=

−−

=−

−−=

−−

=



7.1 Example: R Output on R2, R2-adjusted and Global Test

> summary(m5) Call: lm(formula = sl ~ sx + f.doctor * yd, data = sous) Residuals: Min 1Q Median 3Q Max -5667.7 -1988.7 -442.5 1466.5 8810.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20759.6 2429.0 8.547 3.94e-11 *** sxFemale -1809.7 1139.4 -1.588 0.118908 f.doctorYes -4423.1 2703.0 -1.636 0.108450 yd 175.3 101.1 1.734 0.089546 . f.doctorYes:yd 437.5 123.1 3.555 0.000874 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3532 on 47 degrees of freedom Multiple R-squared: 0.6716, Adjusted R-squared: 0.6436 F-statistic: 24.03 on 4 and 47 DF, p-value: 7.256e-11



8 MODEL VALIDATION

Residual analysis constitutes a practical tool for graphically assessing model fitting and satisfaction of optimal hypothesis for OLS estimates:

Let a model a continuous response be,

εβεµ +=+= XY , where Y nx1, X is nxp of column range p and β is px1

And errors are unbiased, uncorrelated and normally distributed with constant variance; i.e.: ( )nn I0,N 2σ≈ε or equivalently ( )nn I,XNY 2σβ≈ .

Residuals are the difference between observed response values and fitted values :

( )YHIYYe −=−= ˆ . Do not confuse errors and residuals; residuals are the observed errors when the proposed model is correct.



8 MODEL VALITION: RESIDUAL ANALYSIS

In order to interpret residuals, some transformed residuals are defined:

1. Scaled residual ic defined as i residual divided by the standard error of regression

estimate for the model, s, sec i

i = . It is not so bad when leverages show no big changes,

since [ ] ( )iii heV −= 12σ .

2. Standarized residual id is defined as the residual divided by its standard error:

ii

ii hs

ed−

=1 .

3. Studentized residual ir is defined as ( ) iii

ii hs

er−

=1 where ( )

( ) ( )1122

2

−−−−−

=pn

hespns iiii

• Outliers for ir can be detected using t.Student lower and upper bounds for sample size or

by univariate descriptive graphical tools as a boxplot.



8 RESIDUAL ANALYSIS: DIAGNOSTIC PLOTS

Residual analysis - usual plots included in default plot( model )

1. Histogram residuals: a normal density is checked. hist(rstudent(model), freq=F)

curve(dt(x, model $df),col=2,add=T)

2. Boxplot residuals to identify outliers of regression.

3. Normal Probability Plot or Quantile Plot of standardized or studentized residuals.

In R the confidence envelope for studentized residuals is shown loading car package and method,

library(car)

qqPlot( model, simulate=T, labels=F)

Histogram of rstudent(an

rstudent(anscombe.lmA)

Den

sity

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

-2 -1 0 1 2

-2-1

01

Empirical quantiles of St

t Quantiles

Stu

dent

ized

Res

idua

ls(a

nsco

mbe

.lmA

)




When normality assumption is not fully met than the constrast of hypothesis based on Student o F Test are approximate and estimates are not efficient.

Tests for normality assessment can be used. Shapiro-Wilk is one of my favorites, but there are a lot of tests depending on sample size. Package nortest in R contains some common possibilities. Even on non normal errors, residuals tend to normality (due to Central Limit Theorem) on medium and large samples.

Residuals are correlated with observations iY , but not with fitted values iY , then scatterplots with fitted values on X axis are suggested.

4. Scatterplots: ii Yvse ˆ , or ii Yvsd ˆ o ii Yvsr ˆ . Failing of linearity can be detected (transformations or addition of regressors might be needed) or heterocedasticity (transformations required). Unusual observations might difficult the interpretation.



5 6 7 8 9 10

-2

-1

0

1

2

Fitted Value

Sta

ndar

dize

d R

esid

ual

Residuals Versus the Fitted Values(response is YA)


1098765

1

0

-1

-2

Fitted Value

Sta

ndar

dize

d R

esid

u al

Residuals Versus the Fitted Values(response is YB)




5. Scatterplots of Residuals vs each regressor (except intercept).

Horizontal band indicates linearity satisfaction and homocedasticity.

Homocedasticity Breusch-Pagan test in package lmtest might be of interest. > library(lmtest)

> bptest(model)

6. Residual vs time/order or any omitted variable in the model suspected to affect hypothesis

Such as, autocorrelation function for residuals (method in R, acf(rstudent(model))

of order k, ( )∑∑ +=

i k

i kii

eee

kr 2 and contrats

for simultaneous hypothesis r(j)=0 for j>k.

Or Durbin-Watson test (tables very difficult to interpret) also for autocorrelation testing.

0 10 20 30 40 50 60 70 80 90 100

-2

-1

0

1

2

3

EDUCATION

SR

ES

1



8 MODEL VALIDATION: UNUSUAL AND INFLUENTIAL DATA

It is easy to find examples for regression analysis that show the existence of a few observations that strongly affect the estimation of model parameters in OLS: one percent of data might have a weight in parameter estimation greater than 99% data.

Influential data affects model prediction and it seems natural that predicted values should be supported by 99% of data and not seriously affected by the 1% left.

Classifying an observation as a priori influential is related to robustness of the design of data collection.

We have a technical case study (Anscombe data) presented in one of the lab sessions to further discuss.



8 UNUSUAL AND INFLUENTIAL DATA

Outlying Observations can cause us to misinterpret patterns in plots

• Temporarily removing them can sometimes help see patterns that we otherwise would not have

• Transformations can also spread out clustered observations and bring in the outliers • More importantly, separated points can have a strong influence on statistical models -

removing outliers from a regression model can sometimes give completely different results

Unusual cases can substantially influence the fit of the OLS model

• Cases that are both outliers and high leverage exert influence on both the slopes and intercept of the model

• Outliers may also indicate that our model fails to capture important characteristics of the data




Types of Unusual Observations

• A Regression Outlier is an observation that has an unusual value of the outcome variable Y, conditional on its value of the explanatory variable X o An observation that is unconditionally unusual in either its Y or X value is called a

univariate outlier, but it is not necessarily a regression outlier

o In other words, for a regression outlier, neither the X nor the Y value is necessarily unusual on its own.

o Regression outliers often have large residuals but do not necessarily affect the regression slope coefficient. Also sometimes referred to as vertical outliers.

o Outliers could increase standard errors of the estimated parameters (or fitted values) since the standard error of regression is computed from residuals:

( ) ( )pn

RSSpnpn

s−

=−

−⋅−=

−⋅

=βXYβXYee

TT ˆˆ2

.




Observation with Leverage (diagonal element of the hat matrix) o An observation that has an unusual X value - i.e., it is far from the mean of X - has

leverage on the regression line o The further the outlier sits from the mean of X (either in a positive or negative

direction), the more leverage it has o High leverage does not necessarily mean that it influences the regression

coefficients, it is possible to have a high leverage and yet follow straight in line with the pattern of the rest of the data. Such cases are sometimes called “good” leverage points because they help the precision of the estimates.

Influential Observations o An observation with high leverage that is also a regression outlier will strongly

influence the regression line o In other words, it must have an unusual X-value with an unusual Y-value given its X-

value o In such cases both the intercept and slope are affected, as the line chases the

observation

Discrepancy × Leverage = Influence




8.1 A priori influential observations

Simple regression: An observation that has an unusual X value - i.e., it is far from the mean of X - has leverage on the regression line The further the outlier lays from the mean of X (either in a positive or negative direction), the more leverage it has.

Multiple regression: we have to think in a cloud of points defined by regressors in X (each column in an axis) and center of gravity of those points.

Points x (pℜ∈x ) heterogenous regarding the cloud of X points and their center of gravity

identify a priori influential data.

The most common measure of leverage is the hat − value, hi, the name hat − values

results from their calculation based on the fitted values ( jy ): Leverages ih measure distance from the point ix to the center of gravity of the whole set of observation data.

And thus the average value for the leverage is np

nh

h i ii == ∑ .

Leverage cut-off: if obs i has hhii 2> or hhii 3> then is an unusual data.




According to Fox (figure 11.3 in Fox, 1997), the diagram to the left shows elliptical contours of hat values for two explanatory variables.

As the contours suggest, hat values in multiple regression take into consideration the correlational and variational structure of the X’s

As a result, outliers in multi-dimensional X-space are high leverage observations - i.e., the outcome variable values are irrelevant in calculating hi.




8.2 A posteriori influential data An influential observation implies that the inclussion of the data in OLS:

1. Modifies the vector of estimated parameter β .

2. Modifies the fitted values Y .

3. If the i-th observation is influential data when it is removed its fitted value is bad, leading to a high value of the absolute residual.

An influential observation is one that combines discrepancy with leverage•

4. The most direct approach to assessing influence is to assess how the regression coefficients change if outliers are omitted from the model. We can use Di j (often termed DFBetasi j) to do so:

( )( )jijjijD βσββ ˆˆˆˆ −=

i = 1, . . . , n; j = 1, . . . , p

Where the β are the coefficients for all the data and the ( )iβ are the coefficients for the same model with the ith observation removed.

A standard cut-off for an influential data is: Di j ≥ 2/√n (be careful in small samples !!!)




A problem with DFBetas is that each observation has several measures of influence -one for each coefficient np different measures.

Cook’s D overcomes the problem by presenting a single summary measure for each bservation iD :

( )( ) ( )( )pnp

ii

ii

ii

iiii F

pp

ppse

spD −≈

−

−=

−−= ,

2

2 11

1

ˆˆˆˆ ββββ XXTT

where ( )iβ are the coefficients for the same model with the ith observation removed.

• Cook’s D measures the ‘squared distance’ between β are the coefficients for all the data and the ( )iβ

are the coefficients for the same model with the ith observation removed by calculating an F-test for

the hypothesis that ( )iββ ˆ: =iH

• There is no significance test for Di but a commonly used cut-off is the Chatterjee y Hadi (88) that justifies an influential observation for those that satisfy Di >4/(n-p).

• For large samples, Chatterjee-Hadi cut-off does not work and as a rule of thumb Di >0.5 are suspected to be influential data and Di>1 are qualified as being influential (R criteria).



9 BEST MODEL SELECTION

The best regression equation for Y given the regressors ( p1 x,,x ) might contain dummy variables, transformations of the original variables and terms related to polynomial

regression (higher order tan linear for covariate variable) for the original variables (q1 z,,z ) .

Model selection should satisfy trade-off between simplicity and goodness of fit, often called parsimony criteria.

It is not practical to built all possible regression models and choose the best one according to some balance criteria.

A good model should show consistence to theoretical properties in residual analysis. Neither influential nor unsual data should be included.




Available elements to assess the quality of a particular multiple regression (goodness of fit) model are:

1. Determination coefficient, 2R . 2. Stabilibity on the standard error of regression estímate. Estimation of 𝜎2 by s2 on

underfitting is biased and greater than the true value. Stability on s2confirms or at least points to goodness of fit.

3. Residual analysis. 4. Unusual and influent data analysis.

5. And a new element, Mallows pC . Related to Akaike Information Criteria (AIC)

( )( )pAIC +−= y,β2 . Models with lower values of Cp or AIC indicator are preferred.

Some authors strongly recommend BIC (Bayesian Information Criteria) Schwartz

criteria ( ) npBIC logˆ2 +−= y,β where extra parameters are penalized. In R,AIC(model) for AIC on model objects for which a log-likelihood value can be obtained and AIC(model, k=log(dim(data.frame)[1]) ) for BIC.




9.1 Stepwise Regression Backward Elimination is an heuristic strategy to select the best model given a number of regressor and a maximal model built from them. It is a robust method that supressed non significative terms from the maximal model to the point that all mantained terms are statistically significative and can not be removed. It has been proven to be very effective for polynomial regression.

Forward inclusión is an heuristic strategy to select the best model given a set of regressor from the null model is iteratively adding terms and regressor in the target set. It is not a robust procedure and it is not recommended as an authomatic procedure to find the best model to a data set and regressor terms.

Stepwise Regression is a strategy that is forward increasing the starting model, but at each iteration regressor terms are checked for statistical significancy.

Criteria for adding/removing regressor terms varies in the different statistical packages, but F tests or AIC are commonly used. Partial correlation between Y and each Xj once some subset of regressors is already in the model has been proved succesful in the selection of regressors to increase the current model.




R software has a sophisticated implementation of these heuristics in the method step(model, target model) based on AIC criteria for model selection at each step. > step(duncan1.lm0, ~income+education, direction="forward",data=duncan1) #AIC direction "forward" Start: AIC=311.52 prestige ~ 1 Df Sum of Sq RSS AIC + education 1 31707 11981 255.30 + income 1 30665 13023 259.05 <none> 43688 311.52 Step: AIC=255.3 prestige ~ education Df Sum of Sq RSS AIC + income 1 4474.2 7506.7 236.26 <none> 11980.9 255.30 Step: AIC=236.26 prestige ~ education + income Call: lm(formula = prestige ~ education + income, data = duncan1)



Coefficients: (Intercept) education income -6.0647 0.5458 0.5987 > step(duncan1.lm2,data=duncan1) # Without scope direction is "backward" using AIC Start: AIC=236.26 prestige ~ income + education Df Sum of Sq RSS AIC <none> 7506.7 236.26 - income 1 4474.2 11980.9 255.30 - education 1 5516.1 13022.8 259.05 Call: lm(formula = prestige ~ income + education, data = duncan1) Coefficients: (Intercept) income education -6.0647 0.5987 0.5458 > step(duncan1.lm2,k=log(dim(duncan1)[1]),data=duncan1) # Without scope direction is "backward" using BIC Start: AIC=241.68 prestige ~ income + education Df Sum of Sq RSS AIC <none> 7506.7 241.68 - income 1 4474.2 11980.9 258.91 - education 1 5516.1 13022.8 262.66



Call: lm(formula = prestige ~ income + education, data = duncan1) Coefficients: (Intercept) income education -6.0647 0.5987 0.5458 > step(duncan1.lm0, ~income+education, data=duncan1) #AIC direction "both" Start: AIC=311.52 prestige ~ 1 Df Sum of Sq RSS AIC + education 1 31707 11981 255.30 + income 1 30665 13023 259.05 <none> 43688 311.52 Step: AIC=255.3 prestige ~ education Df Sum of Sq RSS AIC + income 1 4474 7507 236.26 <none> 11981 255.30 - education 1 31707 43688 311.52 Step: AIC=236.26 prestige ~ education + income Df Sum of Sq RSS AIC <none> 7506.7 236.26



- income 1 4474.2 11980.9 255.30 - education 1 5516.1 13022.8 259.05 Call: lm(formula = prestige ~ education + income, data = duncan1) Coefficients: (Intercept) education income -6.0647 0.5458 0.5987 >



10 INTRODUCTION TO GENERAL LINEAR MODEL

General Linear Model is an extension of multiple regression models to deal with explicative variables being factors (nominal or qualitative variables) and interaction between covariates (numeric variables) and factors.

Linear regression can be extended to accommodate categorical variables (factors) using dummy variable regressors (or indicator variables).

ANOVA models might have 1, 2 or more factors in the linear predictor leading to oneway ANOVA, two-way ANOVA and general ANOVA. When factors appear additive effect or first order, second order etc interaction effects may appear.

ANCOVA models combine factors (either main effects and/or interactions), covariates and interactions between factors (groups of factors) and covariates.

We are going to present:

1. 1 Way ANOVA: 1 factor and main effects only. 2. 2 Way ANOVA: main effects and interaction effects occur. 3. K Way ANOVA: extension to multiple factors and high order interactions 4. ANCOVA models: 1 factor and 1 covariate. 5. Extension to multiple factors, covariates and interactions: general linear model.



10 INTRODUCTION TO GENERAL LINEAR MODEL: ONE-WAY ANOVA

• Does the relationship between weight on height depend on gender? • Does profession prestige in Duncan data depend on the type of profession? And after controlling for

income and education?

Both height and prestige are numeric response data and a first random component stated as normal may be assumed leading to OLS estimator.

• Gender is a dicothomic factor (two levels, Male and Female) • Type of Profession is a polythomic factor consisting in three levels “Blue collar” “White collar” and

“professional”.

How to interpret the R2 ? Exactly as we did in multiple regression

• A high R2 means that the regressors explain the variation in Y. • A high R2 does not mean that you have eliminated omitted variable bias. • A high R2 does not mean that the included variables are statistically significant – this must be

determined using hypotheses tests.




The ANOVA model of a factor (generically with I levels)- setting the ideas:

• Formulation and construction of the design matrix for the models of regression, • Interpretation of its parameters • Discussion of inference

Group 1 111211 ,,, nyyy Mean 1y

Group 2 222221 ,,, nyyy Mean 2y

... ... ...

Group I IInII yyy ,,, 21 Mean Iy

(1) ijiijY εµ += , I parameters ( )I0N 2,σn≈ε .

(2) ijiijY εαµ ++= , µ is the overall expected mean and iα effect for i level, I+1 parameters.

The usual null hypothesis is that there are no differences between the expected mean of the groups and can be written according to the formulations as:

(1) µµµ === I1:0H versus :1H µµ =∃ i .

(2) 0: 1 === Iαα 0H versus 0: ≠∃ iα1H .




10.1 Example: Prestige of Canadian Occupations in data.frame Prestige in car library for R (Fox and Weisber 2011)

• Description: The Prestige data frame has 102 rows and 6 columns. The observations are occupations. This data frame contains the following columns:

Education Average education of occupational incumbents, years, in 1971.

Income Average income of incumbents, dollars, in 1971.

Women Percentage of incumbents who are women.

Prestige Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

Census Canadian Census occupational code.

Type Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

Source

Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].



10 ONE-WAY ANOVA: EXAMPLE

First of all, a summary of data: there are 4 missing values in factor - type

Default R graphic plot to inspection the relation between a numeric variable (prestige) and a factor (type) works

nice: > plot(prestige~type, main="prestige vs type",col=3)

> summary(Prestige)

education income women prestige census type

Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113 bc :44

1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120 prof:31

Median :10.540 Median : 5930 Median :13.600 Median :43.60 Median :5135 wc :23

Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83 Mean :5402 NA's: 4

3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:8312

Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517

> plot(prestige~type)




• Group descriptive statistics and standard procedure for 1 way ANOVA: > tapply(prestige,type,summary) $bc Min. 1st Qu. Median Mean 3rd Qu. Max. 17.30 27.10 35.90 35.53 42.60 54.90 $prof Min. 1st Qu. Median Mean 3rd Qu. Max. 53.80 61.00 68.40 67.85 72.95 87.20 $wc Min. 1st Qu. Median Mean 3rd Qu. Max. 26.50 35.90 41.50 42.24 47.50 67.50 > oneway.test(prestige~type) One-way analysis of means (not assuming equal variances) F = 117.2471, num df = 2.000, denom df = 54.979, p-value < 2.2e-16 > oneway.test(prestige~type,var=TRUE)# Corresponds to F-Test One-way analysis of means F = 109.5916, num df = 2, denom df = 95, p-value < 2.2e-16 > kruskal.test(prestige~type)# Non Parametric version for One-way means test Kruskal-Wallis rank sum test Kruskal-Wallis chi-squared = 63.3965, df = 2, p-value = 1.713e-14




> model<-lm(prestige~type, data=Prestige[!is.na(Prestige$type),], + contrasts=list(type="contr.treatment")) > summary(model) Call: lm(formula = prestige ~ type, data = Prestige[!is.na(Prestige$type), ], contrasts = list(type = "contr.treatment")) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.527 1.432 24.810 < 2e-16 *** typeprof 32.321 2.227 14.511 < 2e-16 *** typewc 6.716 2.444 2.748 0.00718 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.499 on 95 degrees of freedom Multiple R-squared: 0.6976, Adjusted R-squared: 0.6913 F-statistic: 109.6 on 2 and 95 DF, p-value: < 2.2e-16

244.42716.6527.35ˆˆˆ848.67321.32527.35ˆˆˆ

527.350527.35ˆˆˆ

32

1ˆˆˆ

333

222

111

0ˆ1 =+=+===+=+==

=+=+==

≡=≡=≡=

→+== αµ

αµαµ

αµα yy

yyyy

wciprofibci

y

j

j

j

iij




> model<-lm(prestige~type, data=Prestige[!is.na(Prestige$type),], + contrasts=list(type="contr.sum")) > summary(model) Call: lm(formula = prestige ~ type, data = Prestige[!is.na(Prestige$type), ], contrasts = list(type = "contr.sum")) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 48.5397 0.9935 48.86 <2e-16 *** type1 -13.0124 1.2925 -10.07 <2e-16 *** type2 19.3087 1.3990 13.80 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.499 on 95 degrees of freedom Multiple R-squared: 0.6976, Adjusted R-squared: 0.6913 F-statistic: 109.6 on 2 and 95 DF, p-value: < 2.2e-16

( ) 244.42309.19013.13540.48ˆˆˆ849.67309.19540.48ˆˆˆ527.35013.13540.48ˆˆˆ

32

1ˆˆˆ

333

222

111

0ˆˆˆ 321 =−+=+===+=+===−=+==

≡=≡=≡=

→+==++ αµ

αµαµ

αµααα yy

yyyy

wciprofibci

y

j

j

j

iij



11 INTRODUCTION TO GENERAL LINEAR MODEL: TWO-WAY ANOVA

Motivation: Prestige of professions (Y response) is related with profession type (Factor A) and a new factor indicating if there are mostly women professions (women percentage greater than 50%) (Factor B)?

> plot.design(prestige~type+f.femenin)

> summary(Prestige)

education income women prestige census type f.femenin

Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113 bc :44 No :75

1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120 prof:31 Yes:27




Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517

> plot(prestige~type*f.femenin)




The analysis of variance of 2 factors examines the relationship between a quantitative response variable and two variables qualitative explanatory.

The inclusion of the second factor allows the modelling and standardisation of dependence relations and introduce interactions.

Assuming in Two-way ANOVA that population means for each cell in the combinations of the levels of the factors patterns of usual relationship appear clearly.

1 .... J

1 11µ .... J1µ •1µ

I 1Iµ .... IJµ •Iµ

1•µ .... J•µ

If Factors A and B do not interact, then the partial relationship between each factor and the variable of response does not depend on the level of the other factor, that is, the difference between levels is constant. It is supposed I = 4 and J = 2 in the following diagrams.

A



3.2-3 TWO-WAY ANOVA

1 2

543210

8

7

6

5

4

3

FACTOR A

mu_

ij

FACTOR B

Factors A and B are significant.

No interactions between A and B factors are present

Factor A is significant.

Factor B is not significant.


1 2

0 1 2 3 4 5

3

4

5

6

7

8

mu_

ij

FACTOR A

FACTOR B




Factor A is not significant.

Factor B is significant.


1 2

1 2 3 4

2

3

4

5

6

mu_

ij

FACTOR A

FACTOR B

1 2

543210

8

7

6

5

4

3

mu_

ij

FACTOR A

FACTOR B

Factor A is significant.

Factor B is significant.

Interactions between A and B factors are present




Two-way ANOVA models: (M 0) Null Model: ijkijkY εµ += (M 1) Two ANOVA with interactions: ijkijjiijkY εγβαµ ++++= (M 2) Additive Two ANOVA: ijkjiijkY εβαµ +++=

(M 3) One-way ANOVA of A: ijkiijkY εαµ ++=

(M 4) One-way ANOVA of B: ijkjijkY εβµ ++=

Most common hypothesis tested in Two Way ANOVA: only RSS for models needed • H1: No interactions presents between both factors.

F-Test : H: Model (M1) and (M2) are equivalent >anova(m2,m1) • H2: Net effect of Factor A once Factor B is included in the model (or Factor B | A) not significant

F-Test : H: Model (M4) and (M2) are equivalent >anova(m4,m2) F-Test : H: Model (M3) and (M2) are equivalent >anova(m3,m2) (for Factor B | A)




• H3: Gross effect of Factor A (or Factor B).

F-Test : H: Model (M0) and (M3) are equivalent >anova(m0,m3) F-Test : H: Model (M0) and (M4) are equivalent >anova(m0,m4)



11 TWO-WAY ANOVA: EXAMPLE

11.1 Example: Prestige of Canadian Occupations of profession vs type and femenine factor (Fox and Weisber 2011)

> options(contrasts=c("contr.treatment","contr.treatment")) > m0<-lm(prestige~1,data=Prestige[!is.na(Prestige$type),]) > m1<-lm(prestige~type*f.femenin,data=Prestige[!is.na(Prestige$type),]) > m2<-lm(prestige~type+f.femenin,data=Prestige[!is.na(Prestige$type),]) > m3<-lm(prestige~type,data=Prestige[!is.na(Prestige$type),]) > m4<-lm(prestige~f.femenin,data=Prestige[!is.na(Prestige$type),]) > anova(m2,m1) # Interaction Test Analysis of Variance Table Model 1: prestige ~ type + f.femenin Model 2: prestige ~ type * f.femenin Res.Df RSS Df Sum of Sq F Pr(>F) 1 94 8382.2 2 92 8194.6 2 187.61 1.0531 0.353 > anova(m3,m2) # Net f.femenin effect Analysis of Variance Table Model 1: prestige ~ type Model 2: prestige ~ type + f.femenin Res.Df RSS Df Sum of Sq F Pr(>F) 1 95 8571.3 2 94 8382.2 1 189.07 2.1203 0.1487




> anova(m4,m2) # Net type effect Analysis of Variance Table Model 1: prestige ~ f.femenin Model 2: prestige ~ type + f.femenin Res.Df RSS Df Sum of Sq F Pr(>F) 1 96 28001.6 2 94 8382.2 2 19619 110.01 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > anova(m0,m3) # Gross f.femenin effect Analysis of Variance Table Model 1: prestige ~ 1 Model 2: prestige ~ type Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 28346.9 2 95 8571.3 2 19776 109.59 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




> anova(m0,m4) # Gross type effect Analysis of Variance Table Model 1: prestige ~ 1 Model 2: prestige ~ f.femenin Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 28347 2 96 28002 1 345.31 1.1838 0.2793 > summary(m1) Call: lm(formula = prestige ~ type*f.femenin, data=Prestige[!is.na(Prestige$type), ]) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.227 1.552 23.349 <2e-16 *** typeprof 33.033 2.443 13.519 <2e-16 *** typewc 5.463 3.364 1.624 0.108 f.femeninYes -4.398 3.890 -1.131 0.261 typeprof:f.femeninYes -2.895 5.791 -0.500 0.618 typewc:f.femeninYes 5.378 5.558 0.968 0.336 Residual standard error: 9.438 on 92 degrees of freedom Multiple R-squared: 0.7109, Adjusted R-squared: 0.6952 F-statistic: 45.25 on 5 and 92 DF, p-value: < 2.2e-16




> summary(m2) Call: lm(formula = prestige ~ type + f.femenin,data=Prestige[!is.na(Prestige$type),]) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.068 1.471 24.516 < 2e-16 *** typeprof 32.438 2.216 14.640 < 2e-16 *** typewc 8.096 2.608 3.104 0.00252 ** f.femeninYes -3.398 2.333 -1.456 0.14869 Residual standard error: 9.443 on 94 degrees of freedom Multiple R-squared: 0.7043, Adjusted R-squared: 0.6949 F-statistic: 74.63 on 3 and 94 DF, p-value: < 2.2e-16 >




> options(contrasts=c("contr.sum","contr.sum")) > m2<-lm(prestige~type+f.femenin,data=Prestige[!is.na(Prestige$type),]) > summary(m2) Call: lm(formula = prestige ~ type + f.femenin,data=Prestige[!is.na(Prestige$type),]) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.880 1.087 44.066 <2e-16 *** type1 -13.511 1.330 -10.160 <2e-16 *** type2 18.927 1.415 13.372 <2e-16 *** f.femenin1 1.699 1.167 1.456 0.149 --- Residual standard error: 9.443 on 94 degrees of freedom Multiple R-squared: 0.7043, Adjusted R-squared: 0.6949 F-statistic: 74.63 on 3 and 94 DF, p-value: < 2.2e-16



12 MORE COMPLEX ANOVA MODELS

The extension of the formulation is straightforward to more complex ANOVA models, for example when increasing the number of factors in the experimental designs. When the number of factors increases to factors A, B and C, then higher order interactions appear, order 3 A*B*C and 3 interactions of order 2, A*B, A*C, B*C and the main effects. Interpretation of model coefficients gets complex.

In the designs of real experiments the factors can be crossed or nested or a mix of both: all they can be easily incorporated.



13 ANALYSIS OF COVARIANCE (ANCOVA MODELS)

ANCOVA models of analysis of covariance are mixed models in which appear dummy variables that represent levels of factors or interactions and continuous variables or covariates.

We are going to present ANCOVA models through an example without data, of sociological character and very intuitive, inspired in the proposal of Fox (84): relation between the income (Y) and the level of education (X) between the white population, oriental and black of the U.S. (Factor A, I=3 )

1 2 3

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

9

X

Y

Bl

Bl

Bl

Bl

OrOr

OrOr

Ne Ne Ne Ne

FACTOR A

1 2 3

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

X

Y

Bl

Bl

Bl

Bl

OrOr

OrOr

Ne Ne Ne Ne

FACTOR A

Model (M1): Factor –covariate interaction

ikikiiik xY εθηαµ ++++= )(




Model (M2): No factor –covariate interaction , no correlation between race and education

Model (M2): No factor –covariate interaction , correlation between race and education

1 2 3

9876543210

9

8

7

6

5

4

3

2

1

0

X

Y

NeNe

NeNe

OrOr

OrOr

BlBl

BlBl

FACTOR A

1 2 3

87654321

7

6

5

4

3

2

X

Y

Ne

Ne

Ne

Ne

Or

Or

Or

Or

Bl

Bl

Bl

Bl

FACTOR A

ikikiik xY εηαµ +++=




Model (M4) Income and Education associated but no affected by race

Model (M4) Nul Model

1 2 3

3 4 5 6

3

4

5

6

X

Y

Bl

Bl

Bl

Bl

Or

Or

Or

Or

Ne

Ne

Ne

NeFACTOR A

1 2 3

3 4 5 6

3

4

5

6

X

Y

Bl

Bl

Bl

BlOr

Or

Or

Or

Ne

Ne

Ne

Ne

ikikik xY εηµ ++=

ikikY εµ +=




13.1 Example: Prestige of Canadian Occupations in data.frame Prestige in car library for R (Fox and Weisber 2011)

An ANCOVA model where income and type of profession is tested for the prediction of prestige illustrates how to estimate and contrast hypothesis (using F Test) in R. Interpretation is also considered

> summary(Prestige)

education income women prestige census type f.femenin

Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113 bc :44 No :75

1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120 prof:31 Yes:27




Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517 > options(contrasts=c("contr.treatment","contr.treatment")) > m0<-lm(prestige~1,data=Prestige[!is.na(Prestige$type),]) > m1<-lm(prestige~type*income,data=Prestige[!is.na(Prestige$type),]) > m2<-lm(prestige~type+income,data=Prestige[!is.na(Prestige$type),]) > m3<-lm(prestige~type,data=Prestige[!is.na(Prestige$type),]) > m4<-lm(prestige~income,data=Prestige[!is.na(Prestige$type),]) >




> anova(m2,m1) # Interaction Test Analysis of Variance Table Model 1: prestige ~ type + income Model 2: prestige ~ type * income Res.Df RSS Df Sum of Sq F Pr(>F) 1 94 6336.7 2 92 4859.2 2 1477.5 13.987 4.969e-06 *** --- > anova(m3,m2) # Net income-covariate effect Analysis of Variance Table Model 1: prestige ~ type Model 2: prestige ~ type + income Res.Df RSS Df Sum of Sq F Pr(>F) 1 95 8571.3 2 94 6336.7 1 2234.5 33.147 1.068e-07 *** --- > anova(m4,m2) # Net type effect Analysis of Variance Table Model 1: prestige ~ income Model 2: prestige ~ type + income Res.Df RSS Df Sum of Sq F Pr(>F) 1 96 14325.3 2 94 6336.7 2 7988.5 59.251 < 2.2e-16 *** ---




> anova(m0,m3) # Gross income-covariate effect Analysis of Variance Table Model 1: prestige ~ 1 Model 2: prestige ~ type Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 28346.9 2 95 8571.3 2 19776 109.59 < 2.2e-16 *** --- > anova(m0,m4) # Gross type effect Analysis of Variance Table Model 1: prestige ~ 1 Model 2: prestige ~ income Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 28347 2 96 14325 1 14022 93.965 6.773e-16 *** --- > summary(m1) Call:lm(formula =prestige~type*income, data=Prestige[!is.na(Prestige$type),])




Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 13.9045168 3.1671787 4.390 3.02e-05 ***

typeprof 45.0190221 4.2907398 10.492 < 2e-16 ***

typewc 18.9807386 5.3421020 3.553 0.000603 ***

income 0.0040235 0.0005530 7.276 1.12e-10 ***

typeprof:income -0.0031783 0.0006047 -5.256 9.48e-07 ***

typewc:income -0.0021712 0.0009700 -2.238 0.027603 *

--- Residual standard error: 7.268 on 92 degrees of freedom Multiple R-squared: 0.8286, Adjusted R-squared: 0.8193 F-statistic: 88.94 on 5 and 92 DF, p-value: < 2.2e-16




> options(contrasts=c("contr.sum","contr.sum")) > m1<-lm(prestige~type*income,data=Prestige[!is.na(Prestige$type),]) > summary(m1) Call:lm(formula =prestige~type*income, data=Prestige[!is.na(Prestige$type),]) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.524e+01 2.025e+00 17.399 < 2e-16 *** type1 -2.133e+01 2.729e+00 -7.818 8.59e-12 *** type2 2.369e+01 2.626e+00 9.020 2.63e-14 *** income 2.240e-03 3.334e-04 6.719 1.50e-09 *** type1:income 1.783e-03 4.616e-04 3.863 0.000208 *** type2:income -1.395e-03 3.621e-04 -3.852 0.000216 *** --- Residual standard error: 7.268 on 92 degrees of freedom Multiple R-squared: 0.8286, Adjusted R-squared: 0.8193 F-statistic: 88.94 on 5 and 92 DF, p-value: < 2.2e-16 >

upc notes for theory sessions smde-miri-fiblmontero/lmm_tm/smde - (us... · (n ) yt ,y = 1 be a...

Documents