non-linear & logistic regression - university of alberta · non-linear regression curvilinear...

27
Non-Linear & Logistic Regression “If the statistics are boring, then you've got the wrong numbers.” Edward R. Tufte (Statistics Professor, Yale University)

Upload: phungdung

Post on 06-Apr-2019

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear & Logistic Regression

“If the statistics are boring, then you've got the wrong numbers.” Edward R. Tufte (Statistics Professor, Yale University)

Page 2: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Regression Analyses When do we use these?

PART 1: find a relationship between response variable (Y) and a predictor variable (X) (e.g. Y~X)

PART 2: use relationship to predict Y from X

Simple linear regression: y = b + m*x y = β0 + β1 * x1

Multiple linear regression: y = β0 + β1*x1 + β2*x2 … + βn*xn

Non linear regression: when a line just doesn’t fit our data

Logistic regression: when our data is binary (data is represented as 0 or 1)

Page 3: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-linear Regression Curvilinear relationship between response and predictor variables

• The right type of non-linear model are usually conceptually determined based on biological considerations

• For a starting point we can plot the relationship between the 2 variables and “visually check” which model might be a good option

• There are obviously MANY curves you can generate to try and fit your data

Page 4: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Exponential Curve Non-linear regression option #1

• Rapid increasing/decreasing change in Y or X for a change in the other Ex: bacteria growth/decay, human population growth, infection rates (humans, trees, etc.)

𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙: 𝑦 = 𝑎 + 𝑏𝑐𝑥

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

a a

a a

0 < c < 1

+b

c > 1

+b

0 < c < 1

-b

c > 1

-b

Page 5: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

𝐿𝑜𝑔𝑎𝑟𝑖𝑡ℎ𝑚𝑖𝑐: 𝑦 = 𝑎 + 𝑏𝑥𝑐 Logarithmic Curve Non-linear regression option #2

• Rapid increasing/decreasing change in Y or X for a change in the other Ex: survival thresholds, resource optimization

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

a a

a a

-c

+b

+c

+b

-c

-b

+c

-b

Page 6: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Hyperbolic Curve Non-linear regression option #3

• Rapid increasing/decreasing change in Y or X for a change in the other Ex: survival of a function of population

• Similar to exponential and logarithmic curve but now we have 2 asymptotes

𝐻𝑦𝑝𝑒𝑟𝑏𝑜𝑙𝑖𝑐: 𝑦 = 𝑎 +𝑏

𝑥 + 𝑐

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

a

c

+b

a

c

-b

Page 7: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Parabolic Curve Non-linear regression option #4

• Rapid increasing/decreasing change in Y or X for a change in the other followed by the reverse trend

Ex: survival of a function of an environmental variable

𝑃𝑎𝑟𝑎𝑏𝑜𝑙𝑖𝑐: 𝑦 = 𝑎 + 𝑏 ∗ 𝑥 − 𝑐 2

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

a

c

+b

a

c

-b

Upward Parabolic Downward Parabolic

Page 8: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Gaussian Curve Non-linear regression option #5

• Resembles a normal distribution Ex: survival of a function of an environmental variable

• Where 0 < b < 1

𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛: 𝑦 = 𝑎 ∗ 𝑏 𝑥−𝑐 2

predictor (x)

resp

on

se (

y)

b

a

c

Page 9: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Sigmoidal Curve Non-linear regression option #6

• Stability in Y followed by rapid increase then stability again Ex: restricted growth, learning response, a threshold has to occur for a response effect

• Where b > 1 and c > 1

𝑆𝑖𝑔𝑛𝑜𝑖𝑑𝑎𝑙: 𝑦 =𝑎

1 + 𝑏 𝑥−𝑐+ 𝑑

predictor (x)

resp

on

se (

y)

b

a

c

d

Page 10: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Michaelis Menten Curve Non-linear regression option #7

𝑀𝑖𝑐ℎ𝑎𝑒𝑙𝑖𝑠 𝑀𝑒𝑛𝑡𝑒𝑛: 𝑦 =𝑎 ∗ 𝑥

𝑏 + 𝑥

predictor (x)

resp

on

se (

y)

predictor (x)

resp

on

se (

y)

1

2 a

• Rapid increasing/decreasing change in Y or X for a change in the other Ex: biological process as a function of resource availability

• Similar to exponential and logarithmic curve but now we have 2 parameters – this model comes from kinetics/physiology

a

b

a

-b

Page 11: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression Curve Fitting

Procedure:

1. Plot your variables to visualize the relationship a. What curve does the pattern resemble? b. What might alternative options be?

2. Decide on the curves you want to compare and run a non-linear regression curve fitting

a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function

3. Once you have parameters for your curves compare models with AIC

4. Plot the model with the lowest AIC on your point data to visualize fit

Non-linear regression curve fitting in R: install.packages("minpack.lm")

nlsLM(responseY~MODEL, start=list(starting values for model parameters))

Page 12: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression Output from R

Non-linear model that we fit Simplified logarithmic with slope=0

Estimates of model parameters

Residual sum-of-squares for your non-linear model

Number of iterations needed to estimate the parameters

Page 13: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression Curve Fitting

Procedure:

1. Plot your variables to visualize the relationship a. What curve does the pattern resemble? b. What might alternative options be?

2. Decide on the curves you want to compare and run a non-linear regression curve fitting

a. You will have to estimate your parameters from your curve to have starting values for your curve fitting function

3. Once you have parameters for your curves compare models with AIC

4. Plot the model with the lowest AIC on your point data to visualize fit

Non-linear regression curve fitting in R: install.packages("minpack.lm")

nlsLM(responseY~MODEL, start=list(starting values for model parameters))

Page 14: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Akaike’s Information Criterion (AIC) How do we decide which model is best?

• AIC considers both the fit of the model and the model complexity

• Complexity is measured as number parameters or the use of higher order polynomials

• Allows us to balance over- and under-fitting in our modelled relationships – We want a model that is as simple as possible, but no simpler

– A reasonable amount of explanatory power is traded off against model complexity

– AIC measures the balance of this for us

Hirotugu Akaike, 1927-2009

In the 1970s he used information theory to build a numerical equivalent of Occam's razor

Occam’s razor: All else being equal, the simplest explanation is the best one • For model selection, this means the simplest model is

preferred to a more complex one • Of course, this needs to be weighed against the ability of

the model to actually predict anything

Page 15: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Akaike’s Information Criterion (AIC) AIC in R

Akaike’s Information Criterion in R to determine best model: AIC(nlsLM(responseY~MODEL1, start=list(starting values)))

AIC(nlsLM(responseY~MODEL2, start=list(starting values)))

AIC(nlsLM(responseY~MODEL3, start=list(starting values)))

• AIC is useful because it can be calculated for any kind of model allowing comparisons across different modelling approaches and model fitting techniques

• Model with the lowest AIC value is the model that fits your data best (e.g. minimizes your model residuals) – Output from R is a single AIC value

Page 16: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression Curve fitting

• Use the parameter estimates outputted from nlsLM() to generate curve for plotting

Page 17: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression

• NLR make no assumptions for normality, equal variances, or outliers

• However the assumptions of independence (spatial & temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply

• We don’t have to worry about statistical power here because we are fitting relationships – All we care about is if or how well we can model the relationship

between our response and predictor variables

Assumptions

Page 18: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Non-Linear Regression

• Calculating an R2 is NOT APPROPIATE for non-linear regression

• Why? – For linear models, the sums of the squared errors always add up in a specific

manner: 𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙

– Therefore 𝑅2=𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 which mathematically must produce a value

between 0 and 100%

– But in nonlinear regression 𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 ≠ 𝑆𝑆𝑇𝑜𝑡𝑎𝑙

– Therefore the ratio used to construct R2 is bias in nonlinear regression

• Best to use AIC value and the measurement of the residual sum-of-squares to pick best model then plot the curve to visualize the fit

R2 for “goodness of fit”

Page 19: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Logistic Regression (a.k.a logit regression)

Relationship between a binary response variable and predictor variables

• Binary response variable can be considered a class (1 or 0) • Yes or No • Present or Absent

• The linear part of the logistic regression equation is used to find the probability of being in a category based on the combination of predictors

• Predictor variables are usually (but not necessarily) continuous • But it is harder to make inferences from regression outputs that use discrete

or categorical variables

𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑀𝑜𝑑𝑒𝑙: 𝑦 =𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑛𝑥𝑛

1 − 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑛𝑥𝑛 Logit Model

Page 20: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Binomial distribution vs Normal distribution

• Key difference: Values are continuous (Normal) vs discrete (Binomial)

• As sample size increases the binomial distribution appears to resemble the normal distribution

• Binomial distribution is a family of distributions because the shape references both the number of observations and the probability of “getting a success” - a value of 1

“What is probability of x success in n independent and identically distributed Bernoulli trials?”

• Bernoulli trial (or binomial trial) - a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted

Page 21: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

• Linear Regression - references the Gaussian (normal) distribution - uses ordinary least squares to find a best fitting line the estimates parameters

that predict the change in the dependent variable for change in the independent variable

• Logistic regression - references the Binomial distribution - estimates the probability (p) of an event occurring (y=1) rather then not

occurring (y=0) from a knowledge of relevant independent variables (our data) - regression coefficients are estimated using maximum likelihood estimation

(iterative process)

Logistic Regression vs Linear Regression

Page 22: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Maximum likelihood estimation

• Complex iterative process to find coefficient values that maximizes the likelihood function

Likelihood function - probability for the occurrence of a observed set of values X and Y given a function with defined parameters

Process:

1. Begins with a tentative solution for each coefficient 2. Revise it slightly to see if the likelihood function can be improved 3. Repeats this revision until improvement is minute, at which point the

process is said to have converged

How coefficients are estimated for logistic regression

Page 23: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

• Linear Regression - references the Gaussian (normal) distribution - uses ordinary least squares to find a best fitting line the estimates parameters

that predict the change in the dependent variable for change in the independent variable

• Logistic regression - references the Binomial distribution - estimates the probability (p) of an event occurring (y=1) rather then not

occurring (y=0) from a knowledge of relevant independent variables (our data) - regression coefficients are estimated using maximum likelihood estimation

(iterative process)

Logistic Regression vs Linear Regression

Simple Logistic Regression in R: lm(response~predictor, family="binomial")

summary(lm(response~predictor, family="binomial"))

Multiple Logistic Regression in R: lm(response~predictor1+predictor2+…+predictorN, family="binomial")

summary(lm(response~predictor1+predictor2+…+predictorN, family="binomial"))

Page 24: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Logistic Regression (a.k.a logit regression)

Output from R

Estimate of model parameters (intercept and slope)

Standard error of estimates

Tests the null hypothesis that the coefficient is equal to zero (no effect)

A predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable

A large p-value suggests that changes in the predictor are not associated with changes in the response

AIC value for the model

Page 25: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

• In linear regression, the relationship between the dependent and the independent variables is linear

• However this assumption is not made in logistic regression

so we cannot use the calculation 𝑅2=𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑆𝑆𝑇𝑜𝑡𝑎𝑙

- REMEMBER we are not using sum-of-squares to estimate our parameters – we are using maximum likelihood estimation

• We can however calculate a pseudo R2

- Lots of options on how to do this, but the best for logistic regression appears to be McFadden's calculation

Logistic Regression (a.k.a logit regression)

Pseudo R2 for “goodness of fit”

𝑅2 = 1 −𝑙𝑛𝐿 𝑀𝐹𝑈𝐿𝐿

𝑙𝑛𝐿 𝑀𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

𝐿 = Estimated likelihood

Estimating McFadden’s pseudo R2 in R: mod=lm(response~predictor,family="binomial")

mcF.r2=1-mod$deviance/mod$null.deviance

NOTE: Pseudo R2 will be MUCH lower than R2 values!

Page 26: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

Logistic Regression (a.k.a logit regression)

• Logistic regression make no assumptions for normality, equal variances, or outliers

• However the assumptions of independence (spatial & temporal) and design considerations (randomization, sufficient replicates, no pseudoreplication) still apply

• Logistic regression assumes the response variable is binary (0 & 1)

• We don’t have to worry about statistical power here because we are fitting relationships – All we care about is if or how well we can model the relationship

between our response and predictor variables

Assumptions

Page 27: Non-Linear & Logistic Regression - University of Alberta · Non-linear Regression Curvilinear relationship between response and predictor variables • The right type of non-linear

A non-linear or logistic relationship DOES NOT imply causation! AIC or pseudo 𝑅2 implies a relationship rather than one or multiple factors causing another factor value Be careful of your interpretations!

Important to Remember