basics of regression analysis i purpose of linear models least-squares solution for linear models...

24
Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Post on 19-Dec-2015

237 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Basics of regression analysis I

• Purpose of linear models

• Least-squares solution for linear models

• Analysis of diagnostics

Page 2: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Reason for linear models

Purpose of regression is to reveal statistical relations between input and output variables. Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics can help validation of various functional relationship (models). Let us assume that we suspect that functional relationship has a form:

where is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and y is output, is an error associated with the experiment. Then we can set up experiments for various values of x and get output (or response) for them. If the number of experiments is n then we will have n output values. Denote them as a vector y=(y1,y2,,,yn). Purpose of statistics is to evaluate parameter vector using input and output values. If function f is a linear function of the parameters and errors are additive then we are dealing with linear model. For this model we can write

Linear model is linearly dependent on parameters but not on input variables. For example

is a linear model. But

is not.

y = f (x,β ,ε)

ipipiii xxxy ++++= ...2211

+++= 212110 xxy

+++= 22

1110 xxy

Page 3: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Assumptions

Basic assumptions for analysis of linear model are:

1) the model is linear in parameters

2) the error structure is additive

3) Random errors have 0 mean, equal variances and they are uncorrelated.

These assumptions are sufficient to deal with linear models. Uncorrelated with equal variance assumptions (number 3) can be removed. Then the treatments becomes a little bit more complicated.

Note that for general solution, normal distribution of errors assumption is not used. This assumption is necessary to design test statistics. If this assumption does not work then we can use bootstrap to design test statistic.

These assumptions can be written in a vector form:

where y, 0, I, are vectors and X is a matrix. This matrix is called a design matrix, input matrix etc. I is nxn identity matrix.

y = Xβ + ε, E(ε) = 0, V(ε) = σ 2I

Page 4: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

One parameter case

When we have one variable (x) for predictor and one for response then problem becomes considerable simpler. In this case we have:

Now let us assume that we have observations for n values of x (x1,,,xn) the response vector is (y1,,,yn). If we assume that errors are independent, have equal variances and normally distributed then maximum likelihood becomes least-square:

If we solve this minimisation problem we get for estimations for coefficients:

y = β 0 + β1x + ε

i2

i=1

n

∑ = (y i − (β 0 + β1x i))2 - > min

i=1

n

ˆ β 0 =y i x i

2 − x i y ix i∑∑∑∑n x i

2 − ( x i∑ )2∑

ˆ β 1 =n y ix i∑ − y i x i∑∑

n x i2 − ( x i∑ )2∑

Page 5: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

One parameter case

If you divide denominator and numerator by n2 and use definitions of correlation and standard deviations then the second equation can be written as:

Slope of the line is correlation between input and output variables.

If slope is equal 0 (i.e. input and output variables are uncorrelated) then intercept becomes mean value of the observations

ˆ β 1 = ρ (x,y)s(y)

s(x)

Page 6: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Solution for the general caseSolution to the least-squares with linear model and and with given assumptions is:

Let us show this. If we use the form of the model and write least squares equation (since we want to find solution with minimum least-squares error):

and get the first and solve the equation then we can see that this solution is correct.

If we use the formula for the solution and the expression of y then we can write:

So solution is unbiased. Variance of the estimation is:

Here we used the form of the solution and the assumption number 3)

ˆ β = (XT X)−1 XT y

S = ε Tε = (y − Xβ )T (y − Xβ ) → min

∂S

∂β= 2(XT Xβ − XT y) = 0 ⇒ ˆ β = (XT X)−1 XT y

E( ˆ β ) = E((XT X)−1XT y) = E((XT X)−1XT (Xβ + ε)) = E((XT X)−1XT Xβ + (XT X)−1XTε)

= E(β ) + (XT X)−1XT E(ε) = β

V ( ˆ β ) = E(( ˆ β − β )( ˆ β − β )T ) = E((XT X)−1XTεε T X(XT X)−1) = (XT X)−1XT E(εε T )X(XT X)−1 = σ 2(XT X)−1

Page 7: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Variance

To calculate covariance matrix for estimated parameters we need to be able to calculate 2. Since it is the variance of the error term we can find it using the form of the solution. For the estimated error (denoted by r) we can write:

If we use:

It gives

Since the matrix M is idempotent and symmetric, i.e. M2=M=MT, we can write:

Where n is the number of the observations and p is the number of the fitted parameters. Then for unbiased estimator for the variance of the residual we can write:

r = y − X ˆ β = y − X(XT X)−1XT y = (I − X(XT X)−1XT )y

y = Xβ + ε

r = (I − X(XT X)−1XT )ε = Mε

ˆ σ 2 = (y − X ˆ β )T (y − X ˆ β ) /(n − p)€

E(rTr) = E(ε TMε) = E(tr(ε TMε)) = E(tr(Mεε T ) = tr(ME(εε T ) =

σ 2tr(M) = σ 2(tr(I) − tr(X(XT X)−1 XT ) = σ 2(n − tr((XT X)−1 XT X)) =

σ 2(n − p)

Page 8: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Singular case: SVD and psuedoinversionThe above form of the solution is true if matrices X and XTX are non-singular. I.e. the rank of

the matrix X is equal to the number of parameters. If it is not true then either singular value decomposition or eignevalue filtering techniques are used. Fortunately most good properties of the linear model remains.

Singular value decomposition (SVD): Any nxp matrix can be decomposed in a form:

Where U is nxn and V is pxp orthogonal matrices (inverse is equal to transpose). D is nxp diagonal matrix of the singular values. If X is singular then number of non-zero diagonal elements of D is less than p. Then for XTX we can write:

DTD is pxp diagonal matrix. If the matrix is non-singular then we can write:

Since DTD is a diagonal matrix therefore its inverse is also diagonal matrix. Main trick used in SVD technique for equation solution is that when diagonals are 0 or close to 0 then instead of their inversion zero is used. I.e. pseudo inverse is calculated using:

UDVX =

DVDVUDVUDVUDVUDVXX TTTTTT === T)(

VDDVXX TTT 11 )()( −− =

⎩⎨⎧

=−

otherwise)1/(

0 is of diagonal ing correspond if0)(

*1

ii

iiDD

DDDD

T

T

Page 9: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Singular case: Ridge regression

Another technique to deal with singular case is ridge regression. In this technique a constant value added to the diagonal terms XTX before inverting it.

Mathematically it is equivalent to Tikhonov regularisation for ill-posed problems.

In this technique one of the problems is how to find the regularisation parameter - . A value for this parameter can be found using cross validation. That is usually done by trial and error.

ˆ β = (XT X + δI)−1 XT y

Page 10: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Standard diagnosticsBefore starting to model

1) Visualisation of data: 1) plotting predictor vs observations. These plots may give a clue about the relationship,

outliers

2) Smootheners

After modelling and fitting

2) Fitted values vs residuals. It may help to identify outliers, correctness of the model

3) Normal QQ plot of residuals. It may help to check distribution assumptions

4) Cook’s distance, reveal outliers, checks correctness of the model

5) Model assumptions - t tests given by default print of lm

Checking model and designing tests

3) Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model

4) Bootstrap. Validity of the model can be checked if the distribution of statistic of interest is available. Or these distributions could be generated using bootstrap

Page 11: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Visualisation prior to modellingDifferent type of datasets may require different visualisation tools. For simple

visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modelling may help to propose model (form of the functional relationship between input and output, probability distribution of observation etc)

For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

Page 12: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

After modelling: linear modelsAfter modelling the results should be analysed. For example

attach(women)

lm1 = lm(weight~height)

It means that we want a liner model (we believe that dependence of weight on height is linear)

weight=0+1*height

Results could be viewed using

lm1

summary(lm1)

The last command will produce significant of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

Page 13: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

After modelling: linear modelsIt is a good idea to plot data and fitted model, and differences between fitted and

observed values on the same graph. For linear models with one predictor it can be done using:

plot(weight,height)

abline(lm1)

segements(weight,fitted(lm1),weight,height)

This plot already shows some systematic

differences. It is an indication that model

may need to be revised.

Page 14: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Checking validity of the model: standard toolsPlotting fitted values vs residual, QQ plot and Cook’s distance can give some insight

into model and how to improve it. Some of these plots can be done using

plot(lm1)

Page 15: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Confidence and prediction bands

Confidence bands are those where true line would belong with 1-α probability

Prediction bands are bands where future observation would full with probability of 1-α:

Prediction bands are wider than confidence bands.

P( f ( ˆ β ,x) − c1(x) < f (β ,x) < f ( ˆ β ,x) − c1(x)) =1−α

P( f ( ˆ β ,x) − c2(x) < y future < f ( ˆ β ,x) − c2(x)) =1−α

Page 16: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Prediction and confidence bandslm1 = lm(height~weight)

pp = predict(lm1,interval='p')

pc = predict(lm1,interval='c')

plot(weight,height,ylim=range(height,pp))

n1=order(weight)

matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')

matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red’)

These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

Page 17: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Bootstrap

Simplest application of bootstrap for this problem is as follows:1) Calculate residuals using

2) Sample with replacement from the residual vector and denote them rrandom

3) Design new “observations” using

4) Estimate parameters5) Repeat steps 2 3 and 46) Estimate bootstrap estimation, variances, covariance matrix or the

distribution

Another technique for bootstrapping is: Resample observations and corresponding row of the design matrix simultaneously - (yi,x1i,x2i,,,,xpi),i=1,n. It meant to be less sensitive to misspecified models.

Note that for some samples, the matrix may become singular and problem may become ill defined. If it happens then ridge regression or similar techniques need to be used

r = y − X ˆ β

y = X ˆ β + rrandom

Page 18: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Bootstrap prediction linesBootstrap could be used to plot distribution of fitted model. This function is on:

http://www.ysbl.york.ac.uk/~garib/mres_course/2008/boot_lm_lines.r

If we apply this function for the dataset women we will get the lines shown on the figure below.

boot_lm(women,flm0,1000)

Functions boot_lm and flm0 are available from the course’s website

Page 19: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Analysis of diagnosticsResiduals and hat matrix: Residuals are differences between observation and fitted

values:

H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these values are close to one then that fitted value is determined by this observation. Sometimes hi’=hi/(1-hi) is used to enhance high leverages.

Q-Q plot can be used to check normality assumption. If assumption on distribution is correct then this plot should be nearly linear. If the distribution is normal then tests designed for normal distributions can be used. Otherwise bootstrap may be more useful to derive the desired distributions.

r = y − ˆ y = y − X ˆ β = y − X(XT X)−1 XT y = (I − X(XT X)−1 XT )y = (I − H)y

Page 20: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Analysis of diagnostics: Cont.Other analysis tools include:

Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is

unbiased estimator of 2 after removal of i-th observation

DFFITS)'(

residual edstandardiz externally ))1(/(

residual ed)(stutendiz edstandardiz ))1(/(

*2/1

2/1*

2/1'

iii

iiii

iii

rhC

hsrr

hsrr

=

−=

−=

Page 21: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics
Page 22: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

R-squared

One of the statistics used for goodness of fit is R-squared. It is defined as:

Adjusted R-squared:

n - the number of observations, p - the number of parameters

R2 =1−SSe

SSt

SSe = (y i − ˆ y i)2∑

SSt = (y i − y ∑ )2

ˆ y i is fitted values.

Radj2 =1− (1− R2)

n −1

n − p −1

Page 23: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Several potential problems with linear model and simple regression analysis

1) InterpretationRegression analysis can tell us that there is a relationship between input and output. It does not

say one causes another

2) OutliersUse robust M-estimators

3) Singular or near singular casesRidge regression or SVD.

In underdetermined systems Partial least squares may be helpful

4) Distribution assumptionsGeneralised linear models

Random or mixed effect models

Data transformation

5) Non-linear modelBox-Cox transformation

Non-linear model fitting for example using R command nlm

Page 24: Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

References

1. Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”

2. Stuart, A., Ord, JK, and Arnold, S. (1991) “Kendall’s advanced Theory of statistics. Volume 2A. Classical Inference and the Linear models” Arnold publisher, London, Sydney, Auckland

3. Dalgaard, P (2002) “Introductory statistics with R”, Springer Publishers