ordinary linear regression - clic-cimec

Ordinary linear regression

Marco Baroni

Linear models in R

Outline

Linear models


Ordinary linear regression in R

The general setting

I One response/dependent variableI One or more explanatory/independent variables

(continuous or discrete)I We build a model that produces a value for the dependent

variable for each setting of the independent variablesI Why?

I Hypothesis testing angle: If model is tenable (in a sense tobe made precise below), evidence that (some) independentvariables are having an effect on the response (and of howthey affect it)

I Machine learning/data mining angle: Predict responsewhen only values of independent variables are available

Stages of statistical modeling

I Collect a sample/training data set (vectors of independentvariables paired with the response)

I Choose general form of modelI Estimate the parameters of the model based on the

sample data to determine specific form of the relationshipbetween independent variables and response (aka: fit themodel, train the model)

I Look at the estimated model: is it plausible? are theresimpler models that are equally plausible? what does it tellus about the phenomenon we are trying to model?

I (Do something with the model, like prediction of responsefor data points where we only know the values of theindependent variables)

The dependent/response variable

I The most straightforward case: it can take any continuousvalue (e.g., standardized RTs)

I Other possibilities:I Counts (e.g., how many mice visit this corner per minute?)I Binary responses (acceptable/not acceptable; phenomenon

takes/does not take place)I Multiple categorical responses (which product does subject

buy?)I ...

Independent variables

I Can be binary, ordinal, real, or a mixture thereofI From observation of (more or less) naturally occurring

phenomena: No control over the distribution of the valuesof the independent variables

I From experiments, where the levels of the independentvariables, and possibly the number of sampledobservations for each combination of the values of theindependent variables, is controlled

Linear models

I A variable related to the response (e.g., the responseitself) is modeled by the sum of the values of theindependent variables, each multiplied by a constant β thatmust be estimated from the sample data

I In the simplest case (continuous response, plus someassumptions to be introduced below), the response itself ismodeled as a weighted combination of the independentvariables:

y = β0 + β1x1 + β2x2 + · · ·+ βnxn

I (More precisely, what is modeled is (a function of) theexpected (or mean) value for each setting of theindependent variables)

What is linear?I The independent variables might be generated by

non-linear functions, see e.g. polynomial regression:

y = β0x0 + β1x1 + β2x2 + · · ·+ βnxn

I The relation between the modeled variable and theresponse needs not be linear; in the linear model known aslogistic regression it is:

y =eβ0+β1x1+β2x2+...+βnxn

1 + eβ0+β1x1+β2x2+...+βnxn

I What is linear is the combination of the independentvariables (that might have been derived via a non-linearfunction, as in the first example above) to produce thevalues of the modeled variable (i.e., the modeled variableis given by a β-weighted sum of the independent variables,treated as constants); the following is not a linear model:

y = β1x1 + β2x2 + · · ·+ βnxn

This is a linear modely = β0 + β1x

-2 -1 0 1 2

-3-2

-10

12

3

x

y

This is also a linear modely = β0 + β1x1 + β2x2 + β3x3

-2 -1 0 1 2

-50

510

1520

x

y

And so is this one!y = eβ0+β1x1

1+eβ0+β1x1

-10 -5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

x

y

Why linear models

I Well-understood and powerful formalismI Shared algorithms for estimationI Many classic analyses (ANOVA, multiple regression) as

special cases of the generalized linear model

Outline

Linear models

Ordinary linear regressionConstructing the modelEstimationLooking at the fitted modelPrediction and validation


The general settingI In many many many research contexts, you have a number

of measurements (variables) taken on the same unitsI You want to find out whether the distribution of a certain

variable (response, dependent variable) can be, to acertain extent, predicted by a combination of the others(explanatory, independent variables), and how the latterare affecting the former

I We look first at case in which response is continuous (oryou can reasonably pretend it is)

I Simple but extremely effective model for such data isbased on assumption that response is given by weightedsum of explanatory variables, plus some random noise (theerror term)

I We must look for a good setting of the weights, and at howwell the weighted sums match the observed responsedistribution (the fit of the model)

ModelingI For each combination of values of the independent

variables, assume that their weighted sum gives theexpectation (or mean value) of the response given thatsetting of the independent variables

I Actual responses are given by the model, plus acomponent of normally distributed random noise withmean 0 and variance σ2 (same variance for each setting ofindependent variables)

I Equivalent to assuming that for each setting of theindependent variables, corresponding responses arenormally distributed with the prediction of the linear modelas mean and variance σ2

I As customary, we add an extra constant independentvariable (a column of 1s); the corresponding β0 weight willbe the predicted response value when all explanatoryvariables are set to 0

I When there is only one true independent variable, β0 is theintercept, β1 the slope of the straight line representing E(y)(the expected y value) in function of x

Modeling

I Systematic component for specific values of theindependent variables:

E(yi) = β0 + β1xi1 + β2xi2 + · · ·+ βnxin

I Plus a component of noise/error randomly sampled from anormal distribution with mean 0 and variance σ2:

yi = E(yi) + εi

I Putting the components together:

yi = β0 + β1xi1 + β2xi2 + · · ·+ βnxin + εi

The error component

I Our sample might contain two or more observations for thesame configuration of the independent variables (i.e., twoor more observations with the same values of theindependent variables)

I In such cases, the expected response predicted by themodel is the same, variation in observed response isattributed to noise/error component

I The errors (the εi terms) should be normally distributedwith mean 0 and variance σ2

I NB: neither the independent variables, nor their weightedcombination, nor the expected or observed responses areassumed normal!

The linear model

y1 = β0 + β1x11 + β2x12 + · · ·+ βnx1n + ε1

y2 = β0 + β1x21 + β2x22 + · · ·+ βnx2n + ε2

· · ·ym = β0 + β1xm1 + β2xm2 + · · ·+ βnxmn + εm

The matrix-by-vector multiplication view

~y = X~β + ~ε

y1y2· · ·ym

=

1 x11 x12 · · · x1n1 x21 x22 · · · x2n· · · · · · · · · · · · · · ·1 xm1 xm2 · · · xmn

×β0β1β2· · ·βn

+

ε1ε2· · ·εm

The matrix-by-vector multiplication view

~y = X~β + ~ε

y1y2· · ·ym

= β0

11· · ·1

+β1

x11x21· · ·xm1

+β2

x12x22· · ·xm2

+· · ·+βn

x1nx2n· · ·xmn

+

ε1ε2· · ·εm

Outline

Linear models



Choosing the independent variablesI Typically, you will have one or more variables that are of

interest for your researchI plus a number of “nuisance” variables you should take into

accountI E.g., you might be really interested in the effect of colour

and shape on speed of image recognition, but you mightalso want to include age and sight of subject and familiarityof image among the independent variables that might havean effect

I General advice: it is better to include nuisanceindependent variables than to try to build artificially“balanced” data sets

I In many domains, it is easier and more sensible tointroduce more independent variables in the model than totry to control for them in an artificially dichotomized design

I Free yourself from stiff ANOVA designs!I As usual, with moderation and commonsense

Choosing the independent variables

I Measure the correlation/association of independentvariables, and avoid highly correlated/associated variables

I Intuitively, if two independent variables are perfectlycorrelated you could have an infinity of weights assigned tothe variables that would lead to exactly the same responsepredictions

I More generally, if two variables are nearly interchangeableyou cannot assess their effects separately

I Even if no pair of independent variables is stronglycorrelated, one variable might be highly correlated to alinear combination of all the others (“collinearity”)

I In the language of linear algebra, the independent variablesare not linearly independent

I With perfect collinearity, the fitting routine will die

Complex models tend to overfit the training data

I How many independent variables can you get away with?I If you have as many independent variables as data points,

you are in serious troubleI Either the independent variables are not independent (and

you are back to the collinearity problem) orI you can model perfectly any response vector (m

independent m-dimensional vectors span the fullm-dimensional space)

I More in general, too many independent variables are not agood idea

I The more independent variables you feed it, the moreroom the model will have to fit accidental characteristics ofthe data set and it will not generalize well (overfitting)

(Over)fitting simulated dataToo simple: y = β0

-3 -2 -1 0 1 2

-4-2

02

x

y

(Over)fitting simulated dataAbout right: y = β0 + β1x

-3 -2 -1 0 1 2

-4-2

02

x

y

(Over)fitting simulated dataToo complex: y = β0 + β1x1 + · · ·+ β20x20

-3 -2 -1 0 1 2

-4-2

02

x

y

Predicting new values generated by the same functionStraight line generalizes better than polynomial

-3 -2 -1 0 1 2

-4-2

02

x

y

Avoiding overfitting

I Don’t just add more independent variables to get a bettertraining data fit!

I Special estimation routines that disfavour models withlarge betas (regularized regression)

I Validation on held-out data (we will explore this route withbootstrap validation later)

Dummy coding of categorical variablesI Categorical variables with 2 values coded by a single 0/1

termI E.g., male/female distinction might be coded by term that is

0 for male subjects and 1 for femalesI Weight of this term will express (additive) difference in

response for female subjectsI E.g., if response is reaction times in milliseconds and

weight of term that is set to 1 for female subjects is -10,model predicts that, all else being equal, female subjectstake 10 less milliseconds than males to respond

I Multi-level categorical factors are split into n − 1 binary(0/1) variables (the last variable would be linearcombination of intercept and other binary variables)

I E.g., from 3-valued “concept class” variable (animal, plant,tool) to:

I is animal? (animal=1; plant=0; tool=0)I is plant? (animal=0; plant=1; tool=0)

I Often, choosing sensible “default” level (the one mapped to0 for all binary variables) can greatly improve thequalitative analysis of the results

InteractionsI Suppose we are testing recognition of animals vs. tools in

males and females, and we suspect men recognize toolsfaster than women

I We need a male-tool interaction term (equivalently, but lessintuitively: female-tool, female-animal, male-animal),created by entering a separate weight for the product ofthe male and tool dummy variables:

y = β0 + β1 ×male + β2 × tool + β3 × (male× tool) + ε

I Here, β3 will be added only in cases in which a malesubject sees a tool (both male and tool variables are set to1) and will account for any differential effect present whenthese two properties co-occur

I Categorical variable interactions are the easiest tointerpret, but you can also introduce interactions betweencategorical and continuous or two continuous variables

Pre-processing

I Lots of potentially useful transformations I will skipI E.g., take logarithm of response and/or of some

explanatory variablesI Center variable so that mean value will be 0, scale it (these

operations will not affect fit of model, but they might makethe results easier to interpret)

I Look at documentation for R’s scale() function

Outline

Linear models



Estimation (model fitting/training)

I The linear model:

y = β0 + β1x1 + β2x2 + ...+ βnxn + ε

I ε is normally distributed with mean 0I We need to estimate (assign values) to the β weights and

the variance σ2 of the normally distributed ε variableI Our criterion will be to look for βs that minimize the error

termsI Intuitively, the smaller the ε terms, the better the model

Big and small ε’sSome (unrealistically neat) data

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

● ●

−2 −1 0 1 2

−4

−2

02

4

x

y

Big and small ε’sBad fit

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

● ●

−2 −1 0 1 2

−4

−2

02

4

x

y

Big and small ε’sLarge ε’s

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

● ●

−2 −1 0 1 2

−4

−2

02

4

x

y

Big and small ε’sGood fit

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

● ●

−2 −1 0 1 2

−4

−2

02

4

x

y

Big and small ε’s(Relatively) small ε’s

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

● ●

−2 −1 0 1 2

−4

−2

02

4

x

y

Minimizing the sum of squared errorsThe method of least squares

I We minimize the sum of squared errors∑i=m

i=1 ε2i = ~εT~ε (by

squaring, we account for both over- and underestimation)I Rewrite:

y = β0 + β1x1 + β2x2 + ...+ βnxn + ε

I as:ε = y − (β0 + β1x1 + β2x2 + ...+ βnxn)

I Across observations, in matrix form:

~ε = ~y − X~β

I Then the sum of squared errors can be expressed as:

~εT~ε = (~y − X~β)T (~y − X~β)

I To find minimum, we take the derivative of this expressionwith respect to ~β, and set it to 0

I Leads to the following formula to find the optimal βs:~β = (XT X)−1XT~y

Minimizing the sum of squared errors

~β = (XT X)−1XT~y

I This is a closed-form expression to find the β estimatesthat minimize the sum of squared errors

I This means that we have an easy-to-compute formula tofind the optimal βs, no need to look for them bycomputationally costly and often unreliable trial-and-errormethods

Minimizing the sum of squared errors


I For the formula to work, (XT X) must be invertibleI It can be shown that (XT X) is invertible if and only if the

columns of X are linearly independentI If X has more columns than rows (it is a “short and fat”

matrix), its columns cannot be linearly independent (youcannot have more than m independent m-dimensionalvectors)

I This explains why estimation fails if you have moreindependent variables (columns of X) than observations(rows of X) in the samples

I And estimation becomes less stable as X approachessingularity

Estimated varianceI Once we estimate the βs, an obvious estimate for the

parameter σ2 of variance of the error term is:

σ2 =1m

i=m∑i=1

(yi − E(yi))2 =

1m(~y − X~β)T (~y − X~β)

where m is the number of observations in the sample usedfor estimation

I The one above can be shown to be a biased estimator ofσ2 (E(σ2) 6= σ2), and thus sometimes we use analternative estimate that corrects for bias by dividing bym − n, where n is the number of estimated β coefficients(number of independent variables plus 1):

σ2 =1

m − n

i=m∑i=1

(yi − E(yi))2 =

1m − n

(~y − X~β)T (~y − X~β)

I Intuitively, think of this as an estimate of variance thatpenalizes models with more parameters

Residuals and sum of squares

I The difference between an observed response and theprediction by the estimated model

yi − E(yi) = yi − ~xi~β

is called a residual, and it is (not surprisingly) an importantquantity in assessing the quality of a fitted model

I The numerator of the estimated variance of the error term,called the sum of squared residuals, or sum of squares,will also appear in model assessment statistics:

SS =i=m∑i=1

(yi − E(yi))2 = (~y − X~β)T (~y − X~β)

Minimizing the sum of squared errors = maximizingthe likelihood

I We assume that the yi values derive from the valuepredicted by the model for the specific setting of theindependent variables (~xi ~β) plus a normally distributederror component with 0 mean

I This is the same as saying that yi comes from a normaldistribution with mean E(yi) = ~xi ~β, where ~xi (a row vector)is the specific configuration of independent variables forthe i-th observation in the sample

I The probability of sampling a value from a normaldistribution decreases symmetrically as we get away fromthe mean of the distribution

I Thus, a model that tends to have low values for(yi − E(yi))

2 is a model that tends to assign highprobabilities to the observed responses


I A model that tends to have low values for (yi − E(yi))2 is a

model that tends to assign high probabilities to theobserved responses

I The maximum likelihood estimate (i.e., the set of βs thatassign the highest probability to the observed responsevector ~y ) is found by minimizing the sum of the squareddifference between the observed response and the modelprediction across training examples, i.e.:

i=m∑i=1

(yi − E(yi))2 = (~y − ~E(y))T (~y − ~E(y))

= (~y − X~β)T (~y − X~β)= εT~ε


i=m∑i=1

(yi − E(yi))2 = (~y − ~E(y))T (~y − ~E(y))

= (~y − X~β)T (~y − X~β)= εT~ε

I In an ordinary regression model, maximizing the likelihoodis the same as minimizing the squared error

I The maximum likelihood criterion is also used to fit otherkinds of linear models, where the least squares errormethod would not work

Distribution of the β estimatesI Based on our model and assumptions about how ~y was

generated, and some involved inferential statistics, we canderive various properties of the distribution of theestimated β coefficients

I Most importantly, it can be shown that if the true underlyingpopulation parameter βi = 0 (i.e., the independent variablexi has no effect on the response), then the followingquantity:

βi

σ√

vi

has a t distribution with m − n degrees of freedomI m: observations in the sampleI n: estimated parameters (independent variables + 1)I vi is the i-th diagonal element of (XT X)−1

I The quantity σ√

vi is the standard error of βi , a measure ofthe margin of uncertainty of our estimate

Distribution of the β estimates

I The quantity:βi

σ√

vi

has a t distribution with m − n degrees of freedomI Intuitively, the larger |βi |, the smaller σ and the larger the

variance in xi , the less likely it will be that the true value ofβi is 0

Outline

Linear models



Sanity checks after fitting

I Plenty of sanity checks we might want to perform afterestimating a model

I See, e.g., Baayen, Fox, Gelman/Hill, Dalgaard, . . .

Residual plotsI A simple and potentially revealing diagnostic looks at the

distribution of residuals over the range of fitted valuesI Fitted values: the y ’s predicted by the model for the data

points in the setI Residuals: the attested y ’s minus the predicted y ’s, i.e., the

empirical error terms given the estimated modelI If residuals do not look random and symmetrically centered

at 0 across fitted values, our modeling assumption thatresponse can be explained by a weighted sum of theexplanatory variables plus 0-mean normally distributednoise does not hold

I What to do in such case?I Ignore the problem (a common strategy)I Transform the response (e.g., take the logarithm, square

root. . . ) or the independent variables (e.g., square one ofthe variables)

I Try other forms of analyses (e.g., could the data be moreappropriate for an ordinal logistic regression?)

I etc.

Good-looking residuals

●

●

●●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

−2 −1 0 1 2

−1.

0−

0.5

0.0

0.5

fitted values

resi

dual

s

Very suspicious-looking residualsHow could you try to fix this?

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●●

●

●

0.4 0.6 0.8 1.0 1.2 1.4 1.6

−4

−2

02

46

fitted values

resi

dual

s

Looking at the overall goodness of fit: the R2 statisticI Compare estimated error variance (variance of residuals)

for the model we built to variance in the unmodeled yresponse (equivalently: a trivial model with only anintercept parameter β0)

I Variance of the unmodeled y ’s: s2y =

∑i (yi−µy )2

mI The ratio of the modeled and unmodeled variances (σ2/s2

y )tells us by how much adding the independent variables tothe model reduces the variance, i.e., makes our predictionsabout the values of specific y ’s more precise

I Note that the m denominators of the two variances willcancel out, making this a ratio of sums of squares

I Note also that σ2/s2y ≤ 1 or else we should have picked the

intercept-only model as the least squares estimateI The goodness of fit statistic R2 is given by 1 - the ratio of

variances, and it is said to be the amount of variance“explained” by the model

I R2 ranges from 0 (no variance explained) to 1 (perfect fit)

The R2 statistic: adjusted or not

I Recall that in order to obtain an unbiased estimate of σ2

we divide by m − n instead of mI If we do this (and compute s2

y dividing by m − 1), we obtainwhat in R (the software package) is called the adjusted R2,that penalizes models with many parameters

I As (m− 1)/(m− n) approaches 1 (because m, the numberof data points, is high, and/or n, the number of weights, islow) the unadjusted and adjusted R2 become almostidentical

I A rule-of-thumb I’ve read somewhere on the Web: ifunadjusted and adjusted R2 differ by more than 5%, youshould worry that you are using too many independentvariables for the amount of data you’ve got

Interpreting the βs

I Positive βs correspond to positive effects on the dependentvariable, vice versa for negative βs: the higher |βi |/s.e.(βi),the stronger the corresponding effect

I As we said above, if a βi coefficient is 0 (i.e., thecorresponding explanatory variable has no effectwhatsoever on the response), then βi/s.e.(βi) has a tdistribution with m − n degrees of freedom (m: number ofdata points; n: number of weights to estimate)

I In “quick and dirty” terms, if βi ± 2× s.e.(β) crosses 0, thecorresponding explanatory variable (or interaction) is nothaving a significant effect on the response at α = 0.05,given the current model

Comparing nested models

I Does it help to include variable xi (that might be aninteraction), when we already included x0, . . . , xi−1? (the“subset” model is said to be nested in the model withextra-variable(s), because it is a special case of it where βiis 0)

I Adding a variable will typically improve the fit (just becausewe have an extra weight to play around with) – but is the fitimprovement good enough to justify complicating themodel? Or is it just in line with what we could expect bychance?


I We assess the improvement due to the extra parametersby computing the decrease in the the sum of squaredresiduals brought about by each extra parameter of largermodel, normalized by the (unbiased) estimate of the errorvariance of the larger model:

F =SSsmaller − SSlarger/(nlarger − nsmaller )

σ2larger

=SSsmaller − SSlarger/(nlarger − nsmaller )

SSlarger/(m − nlarger )

I Significance of F value can be found in theFnlarger−nsmaller ,m−nlarger table you might be familiar with fromANOVA (not by chance, in R we will perform this sort ofcomparison with the anova() command)



σ2larger

I The larger the difference in squared residuals between thesmaller and the larger model, the stronger the impact ofthe extra terms

I The more extra terms we added, the less surprising it isthat the difference in squared residuals increased


I Multiple F-tests comparing increasingly larger models(and/or increasingly smaller models) can be used forincremental model building

I However, it is all too easy to get carried away with thisI Don’t forget that the model is indeed a “model” of the

phenomenon you are studying: it should make sense froma qualitative point of view

I Some practical advice:I Start with a model with the constant (intercept) term and

the important nuisance terms: keep these in any caseI When you add an interaction, always keep the

corresponding single variable termsI If a set of independent variables constitute a meaningful

cluster (e.g., the variables that measure different aspects ofcolour: hue, saturation and brightness), keep them all in themodel

Relation to ANOVA

I When all the independent variables are categorical, aF-test such as we described above comparing the fullmodel (or subsets thereof) to an “empty” model predictingthe dependent variable mean is equivalent to traditionalANOVA test

I It is instructive to compare the properties of the objectscreated by lm (linear model fitting function) and aov(ANOVA function) in R

Model assessment statistics: summary

I While experimenting with nested models, look at the Fvalue of increasingly complex models to decide whetherthe increase in complexity is justified

I Once you settle on a model, look atI (adjusted) R2 for the overall fit of the modelI the βs and their standard errors (and the corresponding t

values) to assess the separate impact of the explanatoryvariables in the final model

I NB: it is easy to get confused between F and t values: theformer compare different regression models; the latter lookat the impact of a single variable in a specific model

Outline

Linear models



Prediction

I Ultimately, we hope to have a model that will predictresponses in the “population” of interest, not only in oursample

I E.g., we want to draw conclusions about how variousfactors influence image perception in women and men ingeneral, not only in the data collected in our experiment

I The more parameters we have and/or the less data wehave and/or the less truly random our “random” sampleis. . .

I the more we run the risk of overfitting: estimating a modelthat is adapted to idiosyncratic aspects of our data that wewould not encounter in other samples from the populationwe are interested in

Overfitting

-3 -2 -1 0 1 2

-4-2

02

x

y

ValidationI While statistical inference will give us an idea of how stable

our results should be across similar samples, it does notshield us from the risk that there is something reallyidiosyncratic about some of our data, and that’s what ourmodel “learned” to simulate

I Statistical inference simply tells us how well our modelwould generalize to more data with the sameidiosyncrasies!

I The ideal test for the model would be how well it predictsdata that have not been used for estimation

I This is the standard division between “training” and “test”data in machine learning, where emphasis shifts almostentirely on prediction of unseen data

I Modern validation techniques such as bootstrapping orn-fold cross-validation estimate the model on a subset ofthe data, and use the remaining part to evaluate it, iteratingthe process with different data used as estimation andevaluation sets

I Only a small example of what you can do withcomputation-intensive simulation techniques!

Bootstrap validation

I In bootstrap validation, we randomly resample m datapoints from our original m data points with replacement

I I.e., at each step all data points have the same probabilityof being sampled, including those that we have alreadysampled in a previous step

I The estimation set has same number of data points m asoriginal data set, but contains repeated data points andmisses some original data points

I We fit model to estimation set, but measure itsgoodness-of-fit on the original data set (that will containdata points not used for estimation)

I We repeat this process many times, and we use averagesof R2 (or other statistics) across the estimation and fullsets to assess how well the model generalizes

Outline

Linear models


Ordinary linear regression in RPreliminariesModel building and estimationExploring the modelPractice

The english data

I Load Baayen’s languageR library, that in turn loadsvarious useful packages and data sets:

> library(languageR)

I We load and attach Baayen’s english data set:

> data(english)> ?english> summary(english)> attach(english)

The english data

I We will only look at a few variables from this data set,namely Familiarity, WordCategory, and WrittenFrequency

I Suppose that our theory of familiarity (Familiarity) predictsthat nouns feel in general more familiar than verbs(WordCategory); we want to test this while taking intoaccount the obvious fact that familiarity is affected by thefrequency of usage of the word (WrittenFrequency)

I We will thus use only two independent variables (and theirinteraction), but in a real life regression you should look atthe potential impact of many more variables

Preliminary checksI The dependent variable looks reasonably “continuous”:> summary(Familiarity)Min. 1st Qu. Median Mean 3rd Qu. Max.

1.100 3.000 3.700 3.796 4.570 6.970

I A look at the distribution of WrittenFrequency (or at thedocumentation!) reveals that these are log frequencies:> summary(WrittenFrequency)Min. 1st Qu. Median Mean 3rd Qu. Max.

0.000 3.761 4.832 5.021 6.247 11.360

I And we have so many nouns and so many verbs:> summary(WordCategory)

N V2904 1664

Preliminary checks

I A sanity t-set suggests that WrittenFrequency andWordCategory are not so strongly related as to worryabout using them as independent variables within thesame regression model:

> t.test(WrittenFrequency[WordCategory=="V"],WrittenFrequency[WordCategory=="N"])

...t = 1.0129, df = 3175.109, p-value = 0.3112...sample estimates:mean of x mean of y5.058693 4.999629

Outline

Linear models



The linear model in R

I We estimate the model using the R “formula” notation:dependent ∼ indep1 + indep2 + ...

I In our case:> modelFrequencyCategory<-lm(Familiarity~WrittenFrequency+WordCategory)

I lm() (for linear model) returns an R object that containsvarious information about the fitted model

I Other functions, such as t.test(), also return an object,except that in that case we rarely need to store theinformation for further post-processing and analysis

I We store the object produced by fitting the regression inthe modelFrequencyCategory variable

A look at the estimated model with summary()

> summary(modelFrequencyCategory)

Call:lm(formula = Familiarity ~ WrittenFrequency + WordCategory)

Residuals:Min 1Q Median 3Q Max

-2.43148 -0.45094 -0.04207 0.41231 2.71238

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.225880 0.030547 40.13 <2e-16WrittenFrequency 0.492205 0.005546 88.75 <2e-16WordCategoryV 0.269772 0.021243 12.70 <2e-16

Residual standard error: 0.6909 on 4565 degrees of freedomMultiple R-Squared: 0.6388, Adjusted R-squared: 0.6387F-statistic: 4037 on 2 and 4565 DF, p-value: < 2.2e-16

The summary line by line

I The model we specified (a useful reminder when we starthaving many variables containing lm objects):

Call:lm(formula = Familiarity ~ WrittenFrequency + WordCategory)

I Distribution of residuals should be normal with mean 0(and thus median should not be far from 0):

Residuals:Min 1Q Median 3Q Max

-2.43148 -0.45094 -0.04207 0.41231 2.71238

I This looks reasonably symmetrical around 0


I The β coefficient estimates and their standard errors:Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 1.225880 0.030547 40.13 <2e-16WrittenFrequency 0.492205 0.005546 88.75 <2e-16WordCategoryV 0.269772 0.021243 12.70 <2e-16

I The intercept (β0) is significantly different from 0, which isnot a particularly interesting fact

I A noun with log frequency 0 (frequency 1) should get onaverage a 1.22588 familiarity rating

I Not surprisingly, frequency has a significantly positiveimpact on familiarity

I Notice that 0.492205/0.005546 = 88.75


I The β coefficient estimates and their standard errors:Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 1.225880 0.030547 40.13 <2e-16WrittenFrequency 0.492205 0.005546 88.75 <2e-16WordCategoryV 0.269772 0.021243 12.70 <2e-16

I WordCategory has been recoded as the dummy variableWordCategoryV (1 for verbs, 0 for nouns), thus we get a βfor the verbs, nouns are the “default” level

I Interestingly, WordCategoryV has a significantly positiveeffect, i.e., all else being equal, verbs “feel” more familiarthan nouns

Least squares estimation: do it yourself!

I Need to compute the following in R:


I Matrix manipulation:I transpose: t()I inverse: solve()I multiply: %*%

I Dummy coding of WordCategory as WordCategoryV:> WordCategoryV<-as.numeric(WordCategory)-1

Least squares estimation: do it yourself!


I Constructing the independent variable matrix (with aconstant term for the intercept β0):> X<-cbind(rep(1,length(WrittenFrequency)),WrittenFrequency,WordCategoryV)

I Finding the least squares βs:> solve(t(X) %*% X) %*% t(X) %*% Familiarity

[,1]1.2258800

WrittenFrequency 0.4922048WordCategoryV 0.2697716

I Comfortingly, same estimates as with lm()!

Changing the default level

I In this case, with only two levels of WordCategory, there isno particular reason to pick nouns or verbs as the default,but in other cases picking the right default level cansimplify the interpretation of the results

I Levels are ordered alphabetically by default, userelevel() to pick a different default; Try:> levels(WordCategory)> WordCategoryBis<-relevel(WordCategory,"V")> levels(WordCategoryBis)> summary(lm(Familiarity~WrittenFrequency+WordCategoryBis))

I What changed?


I Next line gives standard error of residuals (square root ofthe unbiased estimate of the variance of the estimatederror):

Residual standard error: 0.6909 on 4565 degrees of freedom

I And then we get the R2 and adjusted R2 values:

Multiple R-Squared: 0.6388, Adjusted R-squared: 0.6387

R2 and adjusted R2

I The R2 value is 1 minus the ratio of the variance of theresiduals to the variance of the response variable (note theresiduals() function):> unadjustedRsquared<-1-

(var(residuals(modelFrequencyCategory))/var(Familiarity))

> unadjustedRsquared[1] 0.6388438

I NB: var() divides by m − 1, but this term cancels out inthe ratio

R2 and adjusted R2

I The adjusted R2 corrects the variance estimates for thenumber of parameters (weights) that have been estimated:> correctedResVar<-

(sum((residuals(modelFrequencyCategory)-mean(residuals(modelFrequencyCategory)))^2)/(length(Familiarity)-3))

> 1-(correctedResVar/var(Familiarity))[1] 0.6386856

I Relation between unadjusted and adjusted R2:> m<-length(Familiarity)> n<-3> 1-((1-unadjustedRsquared)*((m-1)/(m-n)))[1] 0.6386856


I Finally, we see an F value:F-statistic: 4037 on 2 and 4565 DF, p-value: < 2.2e-16

I This is a test for the hypothesis that our model issignificantly better at explaining variance than an “empty”model that predicts the dependent variable mean for alldata points: not a terribly interesting result

I Different from the F test we use in step-wise modelbuilding, as discussed above and illustrated below

I But same as ANOVA test with categorical independentvariables, try:summary(lm(Familiarity~WordCategory))summary(aov(Familiarity~WordCategory))

Interactions

I Is there an interaction between the effect of wordfrequency and the one of word category?

> modelInteraction<-lm(Familiarity~WrittenFrequency+WordCategory+WrittenFrequency:WordCategory)

I Equivalent shorthand notation I don’t particularly like:

> modelInteraction<-lm(Familiarity~WrittenFrequency*WordCategory)

I What does the summary() tell us?

Comparing nested models with anova()

I Incremental model construction confirms what we foundout by looking at the significant β coefficients in the largestmodel, i.e., we should keep both the frequency andcategory models in the model, whereas we don’t havestrong evidence for an effect of their interaction:

> modelFrequency<-lm(Familiarity~WrittenFrequency)> modelCategory<-lm(Familiarity~WordCategory)

> anova(modelFrequency,modelFrequencyCategory)> anova(modelCategory,modelFrequencyCategory)> anova(modelFrequencyCategory,modelInteraction)


I Recall that we compare models computing the F statisticas follows:


SSlarger/(m − nlarger )

with nlarger − nsmaller and m − nlarger degrees of freedom

I Let us compute the relevant values for the comparison ofmodelFrequencyCategory and modelInteraction


I Sum of squared residuals for the larger and smallermodels:> ss_smaller<-sum(residuals(modelFrequencyCategory)^2)> ss_larger<-sum(residuals(modelInteraction)^2)

I Total data points, parameters estimated in the two models:> m<-length(Familiarity)> n_smaller<-length(modelFrequencyCategory$coefficients)> n_larger<-length(modelInteraction$coefficients)

I The F-statistic:> myF<-((ss_smaller-ss_larger)/(n_larger-n_smaller))

/(ss_larger/(m-n_larger))

I The corresponding p-value:1-pf(myF,n_larger-n_smaller,m-n_larger)

I Compare the results (and also the various quantities wecompute along the way) with the output of anova()

What to report

I After you picked the model you want to focus on, did all therelevant checks, gave some serious thought to yourstatistical model in the light of your theoretical model, etc.,you might consider reporting the following results in yourpaper:

I The number of data points used to estimate the modelI The model (mention also if you explored other smaller or

larger models via F-tests for nested models)I The β values with their standard errors (marking the

significant ones)I The R2 or adjusted R2 value (assuming they are

reasonably similar)

Outline

Linear models



A further look at the lm object

I The lm object stores many goodies – you can see the fulllist in the Value section of the lm() documentation

I Among other things, you can extract coefficients (weights),residuals and fitted values:> coefficients(modelFrequencyCategory)> modelFrequencyCategory$coefficients> modelFrequencyCategory$coefficients[1]> head(fitted.values(modelFrequencyCategory))> head(modelFrequencyCategory$fitted.values)> head(residuals(modelFrequencyCategory))> head(modelFrequencyCategory$residuals)

I Use these data to make a plot of residuals by fitted valuesI How does it look?

Practice with the lm object

I Use the fitted values and Familiarity to create a residualsvector; check that it is (nearly) identical toresiduals(modelFrequencyCategory) (smalldifferences might be due to rounding errors)

I Use the coefficients from the model and theWrittenFrequency and WordCategoryV vectors to recreatethe fitted values; then compare them with the fitted valuesstored in the lm object

The fitted values by matrix multiplication

> modelFrequencyCategory<-lm(Familiarity~WrittenFrequency+WordCategory,x=T,y=T)

> myFitted<-modelFrequencyCategory$x %*%modelFrequencyCategory$coefficients

Plotting the model linesI This is relatively straightforward here, since we have only

one continuous and one two-level discrete independentvariables

# plot familiarity in function of frequency separately# for nouns and verbs> plot(WrittenFrequency[WordCategory=="N"],Familiarity[WordCategory=="N"],col="red",pch=1,cex=.5)

> points(WrittenFrequency[WordCategory=="V"],Familiarity[WordCategory=="V"],col="green",pch=1,cex=.5)

# the noun line: beta0 + beta1*frequency> abline(a=modelFrequencyCategory$coefficients[1],b=modelFrequencyCategory$coefficients[2],col="red")

# the verb line: beta0 + beta2 + beta1*frequency> vIntercept<-modelFrequencyCategory$coefficients[1]+modelFrequencyCategory$coefficients[3]

> abline(a=vIntercept,b=modelFrequencyCategory$coefficients[2],col="green")

Plotting the lines with the interaction

# plot familiarity in function of frequency separately# for nouns and verbs> plot(WrittenFrequency[WordCategory=="N"],Familiarity[WordCategory=="N"],col="red",pch=1,cex=.5)

> points(WrittenFrequency[WordCategory=="V"],Familiarity[WordCategory=="V"],col="green",pch=1,cex=.5)

# the noun line: beta0 + beta1*frequency> abline(a=modelInteraction$coefficients[1],b=modelInteraction$coefficients[2],col="red")

# the verb line: beta0 + beta2 + (beta1+beta3)*frequency> vIntercept<-modelInteraction$coefficients[1]+modelInteraction$coefficients[3]

> vSlope<-modelInteraction$coefficients[2]+modelInteraction$coefficients[4]

> abline(a=vIntercept,b=vSlope,col="green")


I We need to use a different function (part of the Designlibrary, to be loaded) to fit a linear model that then we canuse with the validate() function (also part of theDesign library):

> modelFrequencyCategory<-ols(Familiarity~WrittenFrequency+WordCategory,x=TRUE,y=TRUE)

I The x and y arguments tell the function to store theoriginal independent variable matrix (x) and dependentvariable values (y ) in the resulting object; these arenecessary to perform the validation experiments

I We repeat the bootstrapping procedure 1,000 times:> validate(modelFrequencyCategory,B=1000)


I We only look at the bootstrap estimates of R2

I Different simulations will produce slightly different results,but the training column reports the average R2

computed on the same bootstrapped sets used forestimation whereas test reports the average R2 of thebootstrapped models as computed on the whole data set

I The optimism column is the difference between thesetwo values, and the index.corrected is the original R2

minus this optimism scoreI The larger the optimism (or, equivalently, the larger the

difference between original and corrected R2), the morethe model is overfitting the training data, and the worse itwill be at predicting unseen data

Outline

Linear models



Roberta Sellaro’s telescoping data

I From ongoing work by Roberta Sellaro, Roberto Cubelliand Lorella Fiorino

I Telescoping in Roberta’s words: In dating tasks, eventstend to be judged more recent than they really are. Variousfactors might affect this “telescoping” effect, including howfamiliar an event is, how far it is in time, etc.

Fields in the telescoping.txt filesubj.id unique subject identifier

sex of subjectage of subject

item.id unique event identifierevent event description (in Italian)

date.error difference (in years) between subject-assignedand real date (negative: event is perceived asolder than it really is; positive: event is perceivedas more recent)

knowledge self-reported degree of knowledge of event on 0-9scale

emo.involvement self-reported degree of emotionalinvolvement in the event on 1-7 scale

event.markers self-reported assessment of presence of laterrelated events (e.g., trial for a murder) on 0-5 scale

time 1 for remote, 2 for recent events

Analyze the telescoping data

I Try various linear models with date.error as dependentvariable

I Look at the models’ fit, look at the βs and think of what theymean, explore some interactions, compare nested models

I Try some variable transformation: time as a categoricaldummy-coded variable, abs(date.error) as anindependent variable (how can we interpret it?), etc.

I Inspect the distribution of residualsI Run a bootstrap validation

Elena Nava’s cochlear implant data

I Fields in the cochlear.txt filesource mono or bilateral

session 1st, 2nd, 3d or 4th meeting with the subjectstimulus position of stimulus wrt subject, in degrees

response response, in degreeserror absolute difference, in degrees, between

stimulus and responseI For stimulus and response, possible values are 30, 60,

120, 150 (source on the right, degrees measuredclockwise from 0 in front of subject), -30, -60, -120, -150(source on the left, negative degrees measurescounterclockwise from 0 in front of subject)

I Possible error values: 0, 30, 60, 90, 120, 150, 180I It is not entirely appropriate to treat this as a continuous

dependent variable!

Decomposing the stimulus

I Possible stimulus values are 30, 60, 120, 150 (source onthe right, degrees measured clockwise from 0 in front ofsubject), -30, -60, -120, -150 (source on the left, negativedegrees measures counterclockwise from 0 in front ofsubject)

I Ears are at -90 and 90 degreesI We can break down the stimulus into more primitive

variables:I Is stimulus presented from left (< 0) or right (> 0)?I Is stimulus in front (|stim| < 90) or to the back (|stim| > 90)

of subject?I Is stimulus “nearer” (60 ≤ |stim| ≤ 120) or further away

from the ears?


front

back

left right

far

near

far

near

far

near

far

near


> stimlr<-sapply(stimulus,function(x)if (x<0)"left"else"right")

> stimfb<-sapply(stimulus,function(x)if (abs(x)<90)"front"else"back")

> stimdist<-sapply(stimulus,function(x)if ((abs(x)==60) || (abs(x)==120))"near"else"far")

# we also scale the dependent variable to the 0-6 range> normError<-error/30

Analyze!

I Try various linear models of normErr using (a selection of)the following independent variables: source, session,stimlr, stimdist, stimfb

I You might get a few “variable ’X’ converted to a factor”warnings: to avoid them, wrap the variable creationcommands above with the as.factor() function

I Look at the models’ fit, look at the βs and think of what theymean, explore some interactions, compare nested models,relevel some variable (you will have to use as.factor()to be able to relevel the variables created above)

I Inspect the distribution of residuals (not very reassuring)I Run a bootstrap validation

ordinary linear regression - clic-cimec

Documents