slide 1 non-linear regression all regression analyses are for finding the relationship between a...

20
Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables (x), by estimating the parameters that define the relationship. Functional form known Non-linear relationships whose parameters can be estimated by linear regression: e.g, y = ax b , y = ab x , y = ae bx Non-linear relationships whose parameters can be estimated by non-linear regression, e.g, Functional form unknown: lowess/loess. While lowess and loess are often treated as synonyms, some people do insist that they are different as prescribed below: lowess: a locally weighted linear least squares regression, generally involving a single IV loess: a locally weighted linear or quadratic least squares regression, involving one or more IVs -(-) , 1 x bx y y ax e

Upload: sophia-long

Post on 31-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Slide 1

Non-linear regression• All regression analyses are for finding the relationship between

a dependent variable (y) and one or more independent variables (x), by estimating the parameters that define the relationship.

• Functional form known– Non-linear relationships whose parameters can be estimated by linear

regression: e.g, y = axb, y = abx, y = aebx

– Non-linear relationships whose parameters can be estimated by non-linear regression, e.g,

• Functional form unknown: lowess/loess. While lowess and loess are often treated as synonyms, some people do insist that they are different as prescribed below:– lowess: a locally weighted linear least squares regression, generally

involving a single IV– loess: a locally weighted linear or quadratic least squares regression,

involving one or more IVs

- ( - ),

1 x

bxy y

ax e

Page 2: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Rationale of nonlinear regression• Both linear and non-linear regression aim to find the

parameter values that minimize the residual sum of squared deviation, RSS = [y – E(y)]2

• For linear regression, a solution exists for intercept (a) and slope (b); for non-linear regression, such a solution often does not exist and we need to try various combination of parameter values.

• Let's us first pretend that we do not know the solution for a and b in linear regression and try different a and b to find the best parameter estimates that minimize RSS.

Xuhua Xia Slide 3

Page 3: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Get slope and intercept the hard way

Xuhua Xia Slide 4

• The data set has been used before in our first lecture on regression. X is humidity and Y is weight loss. Double-click it and copy to an EXCEL sheet.

• We will try different combination of intercept (a) and slope (b) to find the best combination that minimizes RSS.

• From the plot we can guess that a 9 and b -0.06• The 3rd column is the predicted value: E(Y) = a – bx• The 4th column is squared deviation: [Y – E(Y)]2

• You may first try different a and b values. Better ones will make RSS smaller.

• Now use EXCEL solver to automate this process. • You may do an ordinary linear regression to check the parameter

estimates.

• Summary:• Guestimate parameter values• Try different parameter values to minimize RSS

• EXCEL solver will try parameter values from 0 up. If a parameter is negative as the slope in our case, express the predicted value E(Y) as a – bx.

X Y Pred SS a 90.00 8.98 9 0.0004 b 0.06

12.00 8.14 8.28 0.019629.50 6.67 7.23 0.313643.00 6.08 6.42 0.115653.00 5.90 5.82 0.006462.50 5.83 5.25 0.336475.50 4.68 4.47 0.044185.00 4.20 3.9 0.0993.00 3.72 3.42 0.09

RSS = 1.0161

0.00 20.00 40.00 60.00 80.00 100.003.00

4.00

5.00

6.00

7.00

8.00

9.00

X

Y

Page 4: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

By using nls

Xuhua Xia Slide 5

Time N0.5 201 421.5 752 1492.5 2783 5153.5 10184 23724.5 44165 65335.5 130686 196246.5 326637 570797.5 662308 873698.5 952749 1093809.5 9987510 129872

NN K

N K N et rt

0

0 0( )

2 4 6 8 10

020000

60000

100000

Time

N

Initial values of the parameters to estimate:K: carrying capacity: 200000?N0: 10?r: 1.35?

0.5

2.5

10 20000020

10 (200000 10)

1.3864

10 200000278

10 (200000 10)

1.3306

r

r

e

r

e

r

Page 5: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Use EXCEL solver to do estimates

Xuhua Xia Slide 6

Time N Pred SS K 2000000.5 20 19.63938311 0.130044542 N0 101 42 38.56874494 11.77351128 r 1.35

1.5 75 75.73620696 0.542000692 149 148.6941255 0.093559197

2.5 278 291.8310018 191.29660973 515 572.3605864 3290.236869

3.5 1018 1121.042253 10617.705964 2372 2189.930426 33149.32971

4.5 4416 4256.168202 25546.203555 6533 8191.208515 2749655.48

5.5 13068 15476.73605 5802009.3366 19624 28286.6258 75041085.77

6.5 32663 48889.91211 263312676.57 57079 77708.75379 425586741.5

7.5 66230 111033.0251 20073110618 87369 142048.5146 2989849314

8.5 95274 165601.2467 49459216329 109380 180870.7213 5110923225

9.5 99875 189780.4221 808298492410 129872 194662.7729 4197844257

RSS = 28107399387

These (K, N0, and r) are our guestimates. Now refine them by using EXCEL solver (or by hand if you so wish

Page 6: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

nls output

Xuhua Xia Slide 7

md<-read.table("nlinLogistic.txt",header=T)attach(md)fit<-nls(N~N0*K/(N0+(K-N0)*exp(-r*Time)),start=c(K=150000,N0=10,r=1.35))plot(Time,N)lines(Time,fitted(fit))

2 4 6 8 10

020000

60000

100000

Time

N

Parameters: Estimate Std. Error t value Pr(>|t|) K 1.232e+05 5.412e+03 22.759 3.59e-14N0 2.708e+01 2.186e+01 1.239 0.232 r 1.151e+00 1.181e-01 9.753 2.23e-08

Page 7: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Xuhua Xia Slide 8

Fitting another equation• In rapidly replicating unicellular eukaryotes such as the yeast,

highly expressed intron-containing genes requires more efficient splicing sites than lowly expressed genes. GE: gene expression

• Natural selection will operate on the mutations at the slicing sites to optimize splicing efficiency (SE).

• Observation: SE increases with GE non-linearly, then levels off and appears to have reached a maximum.

GE SE

1 0.46

2 0.47

3 0.57

4 0.61

5 0.62

6 0.68

7 0.69

8 0.78

9 0.7

10 0.74

11 0.77

12 0.78

13 0.74

13 0.8

15 0.8

16 0.785 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

GE

SE

1

GESE

GE

Page 8: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

0 2 4 6 8 10 12 14 16 184

4.5

5

5.5

6

6.5

7

7.5

8

GE

SE

Xuhua Xia Slide 9

Guesstimate initial valuesThe minimum of E(SE) is when GE = 0. 4

The maximum of E(SE) is / when GE is large, e.g., 15), / 8, i.e., 8

The relationship is almost linear when GE is small. When GE = 6, SE 6.5.

8 8*0.278 2.22

( )1

GEE SE

GE

4 (8 )66.5 0.278

1 6

Page 9: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Using EXCEL Solver

Xuhua Xia Slide 10

GE SE Pred SS Alpha 41 4.6 4.86698 0.071278 Beta 2.222 4.7 5.424165 0.524414 Gamma 0.2783 5.7 5.812432 0.0126414 6.1 6.098485 2.3E-065 6.2 6.317992 0.0139226 6.8 6.491754 0.0950167 6.9 6.632722 0.0714378 7.8 6.74938 1.1038039 7 6.847516 0.023251

10 7.4 6.931217 0.21975811 7.7 7.00345 0.48518212 7.8 7.066421 0.53813913 7.4 7.121803 0.07739313 8 7.121803 0.7712315 8 7.2147 0.61669616 7.8 7.254038 0.298074

RSS = 4.922236

These (K, N0, and r) are our guestimates. Now refine them by using EXCEL solver (or by hand if you so wish

Page 10: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

R functions and output

Xuhua Xia Slide 11

md<-read.table("nlinGESE.txt",header=T)attach(md)fit<-nls(SE~(a+b*GE)/(1+g*GE),start=c(a=4,b=2.22,g=0.278))summary(fit)plot(GE,SE)lines(GE,fitted(fit))

Parameters: Estimate SE t Pa 2.6668 0.9741 2.738 0.0169b 1.9694 0.8687 2.267 0.0411g 0.2036 0.1043 1.951 0.0729

2.6668 1.9694

1 0.2036

GESE

GE

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

GE

SE

Page 11: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Xuhua Xia Slide 12

A general approach• Sometimes we do not know the functional form. So here

is a general approach.• Same problem as before, but we are not sure of the exact

relationship between SE and GE

GE SE

1 0.46

2 0.47

3 0.57

4 0.61

5 0.62

6 0.68

7 0.69

8 0.78

9 0.7

10 0.74

11 0.77

12 0.78

13 0.74

13 0.8

15 0.8

16 0.78

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

GE

SE

Page 12: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

A general approach

Xuhua Xia Slide 13

20

0

if ( )

if

GE GE GE GEE SE

c GE GE

1. y increases with x at decreasing rate: use a polynomial to approximate, e.g., y = a + bx + cx2 when x < x0

2. When x reaches a certain level (x0), y reaches its maximum and does not increase any more, y = ymax for x x0

0 2 4 6 8 10 12 14 16 18 200.5

0.550.6

0.650.7

0.75

0.80.85

0.90.95

1

GE

SE

Page 13: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

0 2 4 6 8 10 12 14 16 184

4.5

5

5.5

6

6.5

7

7.5

8

GE

SE

Xuhua Xia Slide 14

Guesstimate initial values2

0

0

if ( )

if

GE GE GE GEE SE

c GE GE

When GE=0 then SE = , so 4

For a short segment of GE, the relationship between SE and GE is approximately linear, i.e., SE a + bGE. When GE increases from 2 to 8, SE increases from 4.7 to 7.5, so (7.5-4.7)/(8-2) 0.47

Given the linear approximation, with 0.4 and 0.47, then SE for GE = 12 should be 0.4+0.4712 = 9.6, but the actual SE is only about 7.7. This must be due to the quadratic term GE2, i.e.,

(7.7 – 9.6) = 122, so

- 0.02

Page 14: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Xuhua Xia Slide 15

A few more twists

20 0 0( ) when E SE c GE GE GE GE

00

( )2 0

E SEGE

GE

20

0

if ( )

if

GE GE GE GEE SE

c GE GE

The continuity condition requires that

The smoothness condition requires that

The two conditions implies that

0

22

0 0

2

4

GE

c GE GE

We will find α, β, and that minimiseRSS = [SE-E(SE)]2

We tell R to substitute various values for α, β, and , and find the set of values that minimizes RSS

Note that GE0 and c are not parameters because they are functions of α, β, .

Page 15: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

R statements to do the jobmd<-read.table("nlinGESE.txt",header=T)attach(md)# Function for estimating the parameters by minimizing RRS# a: alpha, b: beta, g: gamma, x0: GE0myF <- function(x) {a<- x[1]b<- x[2]g<- x[3]x0<- -b/2/gc<- a-b^2/4/gseg1Data<-subset(md,subset=(md$GE < x0))EY<- a+b*seg1Data$GE+g*seg1Data$GE*seg1Data$GEsumD2<-sum((seg1Data$SE-EY)^2)seg2Data<-subset(md,subset=(md$GE >= x0))sumD2<-sumD2 + sum((seg2Data$SE-c)^2)}# obtain solution by supplying the initial values for a, b, g, and the functionsol<-optim(c(4,0.47,-0.02),myF)a<-sol$par[1]b<-sol$par[2]g<-sol$par[3]x0<- -b/2/gc<- a-b^2/4/gseg1Data<-subset(md,subset=(md$GE < x0))EY1<- a+b*seg1Data$GE+g*seg1Data$GE*seg1Data$GEPredY<- c(EY1,rep(c,length(GE)-length(seg1Data$GE)))plot(GE,SE)lines(GE,PredY, col="red")abline(v=x0)

Page 16: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Output

$par[1] 3.49320527 0.64625314 -0.02431488

$value[1] 1.180377

$countsfunction gradient 150 NA

$convergence[1] 0

c[1] 7.787315

x0[1] 13.28925

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

GE

SE

RSS

α, β, and

0 means success

Page 17: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Xuhua Xia Slide 18

Robust regression• LOWESS: robust local regression between Y and X, with

linear fitting• LOESS: robust local regression between Y and one or more

Xs, with linear or quadratic fitting• Used with relations that cannot be expressed in functional

forms• SAS: proc loess• Data:

– Data set: monthly averaged atmospheric pressure differences between Easter Island, Chile and Darwin, Australia for a period of 168 months (NIST, 1998), suspected to exhibit 12-month (annual), 42-month (El Nino), and 25-month (Southern Oscillation) cycles (From Robert Cohen of SAS Institute)

Page 18: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

lowess in R

Xuhua Xia Slide 19

md<-read.table("nlinGESE.txt",header=T)

attach(md)

fit<-loess(SE~GE,span=0.75,degree=1|2)

summary(fit)

pred<-predict(fit,GE,se=TRUE) OR pred<-predict(fit,c(3,6),se=TRUE)

plot(GE,SE)

lines(GE,pred$fit,col="red")

par(mfrow=c(2,3))

for(span in seq(0.4,0.9,0.1)) {

fit<-loess(SE~GE,span=span)

pred<-predict(fit,GE)

sTitle<-paste0("span = ",span)

plot(GE,SE,main=sTitle)

lines(GE,pred,col="red")

}

smooth parameter α (proportion of data points used): larger = more smooth, default=0.75

linear or quadratic, default is 1

 tricubic weighting (proportional to (1 - (dist/maxdist)3)3)

How would I know which span value to use?

Page 19: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.4

GE

SE

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.5

GESE

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.6

GE

SE

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.7

GE

SE

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.8

GE

SE

5 10 154.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

span = 0.9

GE

SE

Page 20: Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables

Plotting the fitted values> fit<-loess(SE~GE,span=0.8)> pred<-predict(fit,GE,se=T)> pred$fit [1] 4.445761 ...

$se.fit [1] 0.2785894 ...

$residual.scale[1] 0.3273702

$df[1] 10.77648

t<-qt(0.975,pred$df)ub<-pred$fit+t*pred$se.fitlb<-pred$fit-t*pred$se.fitplot(GE,SE)lines(GE,pred$fit)lines(GE,lb,col="red")lines(GE,ub,col="red")

plot(GE,SE,ylim=c(min(lb),max(ub)))...

5 10 15

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

GE

SE

5 10 15

45

67

8

GE

SE