regression) - university of colorado colorado springsjkalita/work/cs586/2013/regression.pdfcopyright...

36
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Regression By Jugal Kalita Based on Chapter 17 of Chapra and Canale, Numerical Methods for Engineers 1

Upload: others

Post on 24-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Regression      

By  Jugal  Kalita  Based  on  Chapter  17  of  Chapra  and  Canale,  

Numerical  Methods  for  Engineers  

1

Page 2: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Regression

•  Describes techniques to fit curves (curve fitting) to discrete data to obtain intermediate estimates.

•  There are two general approaches two curve fitting: –  Data exhibit a significant degree of scatter. The strategy is to derive a

single curve that represents the general trend of the data. –  Data is very precise. The strategy is to pass a curve or a series of curves

through each of the points.

•  Two types of applications: –  Trend analysis. Predicting values of dependent variable, may include

extrapolation beyond data points or interpolation between data points. –  Hypothesis testing. Comparing existing mathematical model with

measured data.

2

Page 3: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

3

Page 4: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Mathematical Background Simple Statistics •  In course of a study, if several measurements are

made of a particular quantity, additional insight can be gained by summarizing the data in one or more well chosen statistics that convey as much information as possible about specific characteristics of the data set.

•  These descriptive statistics are most often selected to represent –  The location of the center of the distribution of the data, –  The degree of spread of the data.

4

Page 5: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

•  Arithmetic mean. The sum of the individual data points (yi) divided by the number of points (n).

•  Standard deviation. The most common measure of a spread for a sample.

y =yi∑n

i =1,…,n

∑ −=

−=

2)(1yyS

nSS

it

ty

5

Page 6: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

•  Variance. Representation of spread by the square of the standard deviation.

•  Coefficient of variation. Has the utility to quantify the

spread of data.

1)( 2

2

−=∑

nyy

S iy

%100..yS

vc y=

6

Degrees of freedom

Page 7: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

7

Page 8: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

8

1 standard deviation on both sides of mean: Include 68% of the data, 2 sds: 95.5, 3 sds: 99% of the data.

Page 9: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least Squares Regression

Linear Regression •  Fitting a straight line to a set of paired

observations: (x1, y1), (x2, y2),…,(xn, yn). y=a0+a1x+e a1- slope a0- intercept e- error, or residual, between the model and the observations

9

Page 10: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Criteria for a “Best” Fit/ •  Minimize the sum of the residual errors for all

available data:

n = total number of points

•  However, this is an inadequate criterion, so is the sum of the absolute values

∑∑==

−−=n

iioi

n

ii xaaye

11

1

)(

eii=1

n

∑ = yi − a0 − a1xii=1

n

10

Page 11: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Figure 17.2

a) Minimize sum of residuals, b) Minimize absolute values of residuals, c) Minimize the maximum error of any individual point. 11

Page 12: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

•  Best strategy is to minimize the sum of the squares of the residuals between the measured y and the y calculated with the linear model:

•  Yields a unique line for a given set of data.

∑ ∑∑= ==

−−=−==n

i

n

iiiii

n

iir xaayyyeS

1 1

210

2

1

2 )()model,measured,(

12

Page 13: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least-Squares Fit of a Straight Line

[ ]

∑ ∑∑∑ ∑∑

−−=

−−=

=−−−=∂

=−−−=∂

210

10

11

1

0

0

0)(2

0)(2

iiii

ii

iioir

ioio

r

xaxaxy

xaay

xxaayaS

xaayaS

( )

( )xaya

xxn

yxyxna

yaxna

naa

ii

iiii

ii

10

221

10

00

−=

−=

=+

=

∑ ∑∑ ∑ ∑

∑∑∑

13

Normal equations, can be solved simultaneously

Mean values

Page 14: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Figure 17.3

The residual in linear regression represents the vertical distance between a data point and the straight line. 14

Page 15: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Figure  17.4    

a) Shows spread of data about the mean point b) Shows the spread of data around the best-fit line. The reduction in the spread in going from a) to b) says that b) describes the tendency in the data better.

15

Page 16: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Linear Regression with a) small and b) large residual errors. 16

Figure 17.5

Page 17: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

“Goodness” of our fit If •  Total sum of the squares around the mean for the

dependent variable, y, is St •  Sum of the squares of residuals around the regression

line is Sr

•  St-Sr quantifies the improvement or error reduction due to describing data in terms of a straight line rather than as an average value.

t

rt

SSSr −

=2

17

r2-coefficient of determination

Sqrt(r2) – correlation coefficient

Page 18: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

•  For a perfect fit Sr=0 and r=r2=1, signifying that the line explains 100 percent of the variability of the data.

•  For r=r2=0, Sr=St, the fit represents no improvement.

18

Page 19: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Linearization of Nonlinear relationships

•  It is possible to linearize some non-linear relationships

•  Figure 17.9: a) Exponential equation b) Power equation c) Saturation-growth-rate equation

19

Page 20: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Common Nonlinear Relations: (2)  Power-like curve:

ln(y) = ln(a2) + b2 ln(x)

(3) Saturation growth-rate curve

population growth under limiting conditions

Be careful about the implied distribution of the errors. Always

use the untransformed values for error analysis.

3

3

a xy =b + x

2b2y = a x

ʹ′ ʹ′ 33 3

3 3

b1 1 1 1= a +b = +y x a a x

20 40 60 80 100

1

2

3

4

5

a3=5 b3=1..10

Page 21: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Polynomial Regression

•  Some data is poorly represented by a straight line. For these cases a curve is better suited to fit the data. The least squares method can readily be extended to fit the data to higher order polynomials (Sec. 17.2).

21

Page 22: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

General Linear Least Squares

{ } [ ]{ } { }[ ]

{ }{ }{ } residualsE

tscoefficienunknown Avariabledependent theof valuedobservedY

t variableindependen theof valuesmeasured at the functions basis theof valuescalculated theofmatrix

functions basis 1 are 10

221100

+=

+

+++++=

ZEAZY

m, z, , zzezazazazay

m

mm

2

1 0∑ ∑= =

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

n

i

m

jjijir zayS

22

Minimized by taking its partial derivative w.r.t. each of the coefficients and setting the resulting equation equal to zero

Page 23: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least-Squares Regression Given: n data points: (x1,y1), (x2,y2), … (xn,yn) Obtain: "Best fit" curve:

f(x) =a0 Z0(x) + a1 Z1(x) + a2 Z2(x)+…+ am Zm(x)

ai's are unknown parameters of model Zi's are known functions of x.

We will focus on two of the many possible types of regression models:

Simple Linear Regression Z0(x) = 1 & Z1(x) = x General Polynomial Regression Z0(x) = 1, Z1(x)= x, Z2(x) = x2, …, Zm(x)= xm

Page 24: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least Squares Regression (cont'd): General Procedure: For the ith data point, (xi,yi) we find the set of coefficients for which:

yi = a0 Z0(xi) + a1 Z1(xi) .... + am Zm (xi) + ei

where ei is the residual error = the difference between reported value and model:

ei = yi – a0Z0 (xi) – a1Z1 (x)i –… – amZm (xi)

Our "best fit" will minimize the total sum of the squares of the residuals:

∑=

=n

iir eS

1

2

Page 25: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Our "best fit" will be the function which minimizes the sum of squares of the residuals:

y

x x i

yi ei

measured value

modeled value

( )2

n n m2

r i i j j ii 1 i 1 j 1

S e y a Z (x )= = =

⎛ ⎞⎜ ⎟= = −⎜ ⎟⎝ ⎠

∑ ∑ ∑( )

n2

r i 0 0 i 1 1 i 2 2 i m m ii 1

S y a Z (x ) a Z (x ) a Z (x ) a Z (x )=

= − − − − −∑ L

Page 26: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least Squares Regression (cont'd):

∑=

=n

iir eS

1

2 ∑=

−−−=n

iimmii xZaxZay

100 ))()((

nr

0 i i 0 0 i m m i0 i 1

nr

1 i i 0 0 i m m i1 i 1

nr

m i i 0 0 i m m im i 1

S 2 Z (x )(y a Z (x ) ... a Z (x ))a

S 2 Z (x )(y a Z (x ) ... a Z (x ))a

S 2 Z (x )(y a Z (x ) ... a Z (x ))a

=

=

=

∂= − − − −

∂= − − − −

∂= − − − −

M

To minimize this expression with respect to the unknowns a0, a1 … am take derivatives of Sr and set them to zero:

Page 27: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Least Squares Regression (cont'd): In Linear Algebra form:

{Y} = [Z] {A} + {E} or {E} = {Y} – [Z] {A}

where: {E} and {Y} --- n x 1 [Z] -------------- n x (m+1)

{A} ------------- (m+1) x 1

n = # points (m+1) = # unknowns {E}T = [e1 e2 ... en],

{Y}T = [y1 y2 ... yn],

{A}T = [a0 a1 a2 ... am] [ ]

0 1 0 1 m 1

0 2 1 2 m 2

0 n 1 n m n

Z x Z x Z xZ x Z x Z x

Z

Z x Z x Z x

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

LL

M M O ML

Page 28: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Least Squares Regression (cont'd): {E} = {Y} – [Z]{A}

Then Sr = {E}T{E} = ({Y}–[Z]{A})T ({Y}–[Z]{A})

= {Y}T{Y} – {A}T[Z]T{Y} – {Y}T[Z]{A} + {A}T[Z]T[Z]{A}

= {Y}T{Y}– 2 {A}T[Z]T{Y} + {A}T[Z]T[Z]{A} Setting = 0 for i =1,...,n yields:

= 0 = 2 [Z]T[Z]{A} – 2 [Z]T{Y}

or [Z]T[Z]{A} = [Z]T{Y}

i

r

aS∂∂

Page 29: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Least Squares Regression (cont'd):

[Z]T[Z]{A} = [Z]T{Y} (C&C Eq. 17.25)

This is the general form of Normal Equations. They provides (m+1) equations in (m+1) unknowns. (Note that we end up with a system of linear equations.)

Page 30: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Simple Linear Regression (m = 1): Given: n data points, (x1,y1),(x2,y2),…(xn,yn)

with n > 2 Obtain: "Best fit" curve: f(x) = a0 + a1x

from the n equations: y1 = a0 + a1x1 + e1 y2 = a0 + a1x2 + e2 yn = a0 + a1xn + en

Or, in matrix form: [Z]T[Z] {A} = [Z]T{Y}

1 1

2 0 2

1 2 n 1 1 2 n

n n

1 x y1 1 1 1 x a 1 1 1 yx x x a x x x

1 x y

⎡ ⎤ ⎧ ⎫⎢ ⎥ ⎪ ⎪⎡ ⎤ ⎧ ⎫ ⎡ ⎤ ⎪ ⎪⎢ ⎥ =⎨ ⎬ ⎨ ⎬⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦ ⎩ ⎭ ⎣ ⎦ ⎪ ⎪⎢ ⎥ ⎪ ⎪⎣ ⎦ ⎩ ⎭

L LL M M L M

M

Page 31: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Simple Linear Regression (m = 1): Normal Equations

[Z]T[Z] {A} = [Z]T{Y} upon multiplying the matrices become

n n

i ii 1 i 10

n n n12i i i i

i 1 i 1 i 1

n x yaa

x x x y

= =

= = =

⎡ ⎤ ⎧ ⎫⎢ ⎥ ⎪ ⎪⎢ ⎥ ⎪ ⎪⎧ ⎫ ⎪ ⎪

=⎢ ⎥ ⎨ ⎬ ⎨ ⎬⎢ ⎥ ⎩ ⎭ ⎪ ⎪⎢ ⎥ ⎪ ⎪⎢ ⎥ ⎪ ⎪⎣ ⎦ ⎩ ⎭

∑ ∑

∑ ∑ ∑

Normal Equations for Linear Regression

C&C Eqs. (17.4-5)

(This form works well for spreadsheets.)

Page 32: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Simple Linear Regression (m = 1): [Z]T[Z] {A} = [Z]T{Y}

C&C equations (17.6) and (17.7)

Solving for {a}:

1

n n n

i i i ii 1 i 1 i 1

2n n2i i

i 1 i 1

n x y x y

a

n x x

= = =

= =

=⎛ ⎞⎜ ⎟−⎜ ⎟⎝ ⎠

∑ ∑ ∑

∑ ∑

n n

0 i 1 ii 1 i 1

1

1 1a y a xn n

y a x= =

⎡ ⎤⎢ ⎥= − =⎢ ⎥⎣ ⎦

= −

∑ ∑

Page 33: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Simple Linear Regression (m = 1): [Z]T[Z] {A} = [Z]T{Y}

which is easier and numerically more stable, but the 2nd equation remains the same:

A better version of the first normal equation is:

0 1a y a x= −

( )( )

( )

n

i ii 1

1 n2

ii 1

y y x x

a

x x

=

=

⎡ ⎤− −⎣ ⎦=

Page 34: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Common Nonlinear Relations: Objective: Use linear equations for simplicity. Remedy: Transform data into linear form and perform

regressions. Given: data which appears as:

1b x1y = a e(1)  exponential-like curve:

(e.g., population growth, radioactive decay,

attenuation of a transmission line) Can also use: ln(y) = ln(a1) + b1x

Page 35: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

 Major Points in Least-Squares Regression: 1.  In all regression models one is solving an

overdetermined system of equations, i.e., more equations than unknowns.

2.  How good is the fit? Often based on a coefficient of determination, r2

Page 36: Regression) - University of Colorado Colorado Springsjkalita/work/cs586/2013/Regression.pdfCopyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Engineering  Computa/on      Standard  Errpr  of  the  es/mate        36  

Precision : If the spread of the points around the line is of similar

magnitude along the entire range of the data, Then one can use

ry x

Ss

n (m 1)=

− +

to describe the precision of the regression estimate (in which m+1 is the number of coefficients calculated for the fit, e.g., m+1=2 for linear regression)

= standard error of the estimate (standard deviation in y)