math 344: regression analysis - new jersey institute of ...wguo/math344_2012/math344_chapter...
TRANSCRIPT
Math 344: Regression AnalysisChapter 1 — Linear Regression with One Predictor Variable
Wenge Guo
January 19, 2012
Wenge Guo Math 344: Regression Analysis
Why regression?
I Want to model a functional relationship between an“predictor variable” (input, independent variable, etc.) and a“response variable” (output, dependent variable, etc.)
I Examples?
I But real world is noisy, no f = maI Observation noiseI Process noise
I Two distinct goalsI (Estimation) Understanding the relationship between predictor
variables and response variablesI (Prediction) Predicting the future response given the new
observed predictors.
History
I Sir Francis Galton, 19th centuryI Studied the relation between heights of parents and children
and noted that the children “regressed” to the populationmean
I “Regression” stuck as the term to describe statistical relationsbetween variables
Example Applications
Trend lines, eg. Google over 6 mo.
Others
I EpidemiologyI Relating lifespan to obesity or smoking habits etc.
I Science and engineeringI Relating physical inputs to physical outputs in complex systems
I Brain
Aims for the course
I Given something you would like to predict and some numberof covariates
I What kind of model should you use?I Which variables should you include?I Which transformations of variables and interaction terms
should you use?
I Given a model and some dataI How do you fit the model to the data?I How do you express confidence in the values of the model
parameters?I How do you regularize the model to avoid over-fitting and
other related issues?
Aims for the course
I Given something you would like to predict and some numberof covariates
I What kind of model should you use?I Which variables should you include?I Which transformations of variables and interaction terms
should you use?
I Given a model and some dataI How do you fit the model to the data?I How do you express confidence in the values of the model
parameters?I How do you regularize the model to avoid over-fitting and
other related issues?
Data for Regression Analysis
I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!
I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .
I Treatment: the length of trainingI Experimental Units: the analysts included in the study.
I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.
Data for Regression Analysis
I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!
I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .
I Treatment: the length of trainingI Experimental Units: the analysts included in the study.
I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.
Data for Regression Analysis
I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!
I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .
I Treatment: the length of trainingI Experimental Units: the analysts included in the study.
I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.
Simple Linear Regression
I Want to find parameters for a function of the form
Yi = β0 + β1Xi + εi
I Distribution of error random variable not specified
Quick Example - Scatter Plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−2
02
46
8
a
b
Formal Statement of Model
Yi = β0 + β1Xi + εi
I Yi value of the response variable in the i th trial
I β0 and β1 are parameters
I Xi is a known constant, the value of the predictor variable inthe i th trial
I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
I i = 1, . . . , n
Properties
I The response Yi is the sum of two componentsI Constant term β0 + β1Xi
I Random term εi
I The expected response is
E(Yi ) = E(β0 + β1Xi + εi )
= β0 + β1Xi + E(εi )
= β0 + β1Xi
Expectation Review
I Definition
E(X ) = E(X ) =
∫XP(X )dX , X ∈ R
I Linearity property
E(aX ) = aE(X )
E(aX + bY ) = aE(X ) + b E(Y )
I Obvious from definition
Example Expectation Derivation
P(X ) = 2X , 0 ≤ X ≤ 1
Expectation
E(X ) =
∫ 1
0XP(X )dX
=
∫ 1
02X 2dX
=2X 3
3|10
=2
3
Expectation of a Product of Random Variables
If X,Y are random variables with joint distribution P(X ,Y ) thenthe expectation of the product is given by
E(XY ) =
∫XY
XYP(X ,Y )dXdY .
Expectation of a product of random variables
What if X and Y are independent? If X and Y are independentwith density functions f and g respectively then
E(XY ) =
∫XY
XYf (X )g(Y )dXdY
=
∫X
∫Y
XYf (X )g(Y )dXdY
=
∫X
Xf (X )[
∫Y
Yg(Y )dY ]dX
=
∫X
Xf (X )E(Y )dX
= E(X )E(Y )
Regression Function
I The response Yi comes from a probability distribution withmean
E(Yi ) = β0 + β1Xi
I This means the regression function is
E(Y ) = β0 + β1X
Since the regression function relates the means of theprobability distributions of Y for a given X to the level of X
Error Terms
I The response Yi in the i th trial exceeds or falls short of thevalue of the regression function by the error term amount εi
I The error terms εi are assumed to have constant variance σ2
Response Variance
Responses Yi have the same constant variance
Var(Yi ) = Var(β0 + β1Xi + εi )
= Var(εi )
= σ2
Variance (2nd central moment) Review
I Continuous distribution
Var(X ) = E((X −E(X ))2) =
∫(X −E(X ))2P(X )dX , X ∈ R
I Discrete distribution
Var(X ) = E((X − E(X ))2) =∑i
(Xi − E(X ))2P(Xi ), X ∈ Z
Alternative Form for Variance
Var(X ) = E((X − E(X ))2)
= E((X 2 − 2X E(X ) + E(X )2))
= E(X 2)− 2E(X )E(X ) + E(X )2
= E(X 2)− 2E(X )2 + E(X )2
= E(X 2)− E(X )2.
Example Variance Derivation
P(X ) = 2X , 0 ≤ X ≤ 1
Var(X ) = E((X − E(X ))2) = E(X 2)− E(X )2
=
∫ 1
02XX 2dX − (
2
3)2
=2X 4
4|10 −
4
9
=1
2− 4
9=
1
18
Variance Properties
Var(aX ) = a2 Var(X )
Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y
Var(a + cX ) = c2 Var(X ) if a and c are both constant
More generally
Var(∑
aiXi ) =∑i
∑j
aiajCov(Xi ,Xj)
Covariance
I The covariance between two real-valued random variables Xand Y, with expected values E(X ) = µ and E(Y ) = ν isdefined as
Cov(X ,Y ) = E((X − µ)(Y − ν))
I Which can be rewritten as
Cov(X ,Y ) = E(XY − νX − µY + µν),
Cov(X ,Y ) = E(XY )− ν E(X )− µE(Y ) + µν,
Cov(X ,Y ) = E(XY )− µν.
Covariance of Independent Variables
If X and Y are independent, then their covariance is zero. Thisfollows because under independence
E(XY ) = E(X )E(Y ) = µν.
and thenCov(X ,Y ) = µν − µν = 0.
Formal Statement of Model
Yi = β0 + β1Xi + εi
I Yi value of the response variable in the i th trial
I β0 and β1 are parameters
I Xi is a known constant, the value of the predictor variable inthe i th trial
I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
I i = 1, . . . , n
Example
An experimenter gave three subjects a very difficult task. Data onthe age of the subject (X ) and on the number of attempts toaccomplish the task before giving up (Y ) follow:
Table:
Subject i 1 2 3
Age Xi 20 55 30Number of Attempts Yi 5 12 10
Least Squares Linear Regression
I Seek to minimize
Q =n∑
i=1
(Yi − (β0 + β1Xi ))2
I Choose b0 and b1 as estimators for β0 and β1. b0 and b1 willminimize the criterion Q for the given sample observations(X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Guess #1
Guess #2
Function maximization
I Important technique to remember!I Take derivativeI Set result equal to zero and solveI Test second derivative at that point
I Question: does this always give you the maximum?
I Going further: multiple variables, convex optimization
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + ln(x), x > 0
1. Take derivative
d
dx(−x2 + ln(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
Function Maximization
Find the value of x that maximize the function
f (x) = −x2 + ln(x), x > 0
1. Take derivative
d
dx(−x2 + ln(x)) = 0 (1)
−2x +1
x= 0 (2)
2x2 = 1 (3)
x2 =1
2(4)
x =
√2
2(5)
2. Check second order derivative
d2
dx2(−x2 + ln(x)) =
d
dx(−2x +
1
x) (6)
= −2− 1
x2(7)
< 0 (8)
Then we have found the maximum. x∗ =√22 ,
f (x∗) = −12 [1 + log(2)].
Least Squares Max(min)imization
I Function to minimize w.r.t. b0 and b1 – b0 and b1 are calledpoint estimators of β0 and β1 respectively
Q =n∑
i=1
(Yi − (b0 + b1Xi ))2
I Minimize this by maximizing -Q
I Either way, find partials and set both equal to zero
dQ
db0= 0
dQ
db1= 0
Normal Equations
I The result of this maximization step are called the normalequations. b0 and b1 are called point estimators of β0 and β1respectively. ∑
Yi = nb0 + b1
∑Xi∑
XiYi = b0
∑Xi + b1
∑X 2i
I This is a system of two equations and two unknowns. Thesolution is given by. . .
Solution to Normal Equations
After a lot of algebra one arrives at
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
b0 = Y − b1X
X =
∑Xi
n
Y =
∑Yi
n
Least Square Fit
Questions to Ask
I Is the relationship really linear?
I What is the distribution of the of “errors”?
I Is the fit good?
I How much of the variability of the response is accounted forby including the predictor variable?
I Is the chosen predictor variable the best one?
Goals for First Half of Course
I How to do linear regressionI Self familiarization with software tools
I How to interpret standard linear regression results
I How to derive tests
I How to assess and address deficiencies in regression models
Estimators for β0, β1, σ2
I We want to establish properties of estimators for β0, β1, andσ2 so that we can construct hypothesis tests and so forth
I We will start by establishing some properties of the regressionsolution.
Properties of Solution
I The i th residual is defined to be
ei = Yi − Yi
I The sum of the residuals is zero:∑i
ei =∑
(Yi − b0 − b1Xi )
=∑
Yi − nb0 − b1
∑Xi
=∑
Yi − n(Y − b1X )− b1
∑Xi
= 0
Properties of Solution
The sum of the observed values Yi equals the sum of the fittedvalues Yi ∑
i
Yi =∑i
(b1Xi + b0)
=∑i
(b1Xi + Y − b1X )
= b1
∑i
Xi + nY − b1nX
= b1nX +∑i
Yi − b1nX
=∑i
Yi
Properties of Solution
The sum of the weighted residuals is zero when the residual in thei th trial is weighted by the level of the predictor variable in the i th
trial ∑i
Xiei =∑
(Xi (Yi − b0 − b1Xi ))
=∑i
XiYi − b0
∑Xi − b1
∑X 2i
= 0
The last step is due to the second normal equation.
Properties of Solution
The sum of the weighted residuals is zero when the residual in thei th trial is weighted by the fitted value of the response variable inthe i th trial ∑
i
Yiei =∑i
(b0 + b1Xi )ei
= b0
∑i
ei + b1
∑i
Xiei
= 0
Properties of Solution
The regression line always goes through the point
X , Y
Using the alternative format of linear regression model:
Yi = β∗0 + β1(Xi − X ) + εi
The least squares estimator b1 for β1 remains the same as before.The least squares estimator for β∗0 = β0 + β1X becomes
b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y
Hence the estimated regression function is
Y = Y + b1(X − X )
Apparently, the regression line always goes through the point(X , Y ).
Estimating Error Term Variance σ2
I Review estimation in non-regression setting.
I Show estimation results for regression setting.
Estimation Review
I An estimator is a rule that tells how to calculate the value ofan estimate based on the measurements contained in a sample
I i.e. the sample mean
Y =1
n
n∑i=1
Yi
Point Estimators and Bias
I Point estimatorθ = f ({Y1, . . . ,Yn})
I Unknown quantity / parameter
θ
I Definition: Bias of estimator
B(θ) = E(θ)− θ
Example
I Samples from a Normal(µ, σ2) distribution
Yi ∼ Normal(µ, σ2)
I Estimate the population mean
θ = µ, θ = Y =1
n
n∑i=1
Yi
Sampling Distribution of the Estimator
I First moment
E(θ) = E(1
n
n∑i=1
Yi )
=1
n
n∑i=1
E(Yi ) =nµ
n= θ
I This is an example of an unbiased estimator
B(θ) = E(θ)− θ = 0
Variance of Estimator
I Definition: Variance of estimator
Var(θ) = E([θ − E(θ)]2)
I Remember:
Var(cY ) = c2 Var(Y )
Var(n∑
i=1
Yi ) =n∑
i=1
Var(Yi )
Only if the Yi are independent with finite variance
Example Estimator Variance
I For N(0, 1) mean estimator
Var(θ) = Var(1
n
n∑i=1
Yi )
=1
n2
n∑i=1
Var(Yi ) =nσ2
n2=σ2
n
I Note assumptions
Bias Variance Trade-off
I The mean squared error of an estimator
MSE (θ) = E([θ − θ]2)
I Can be re-expressed
MSE (θ) = Var(θ) + (B(θ)2)
MSE = VAR + BIAS2
Proof
MSE (θ)
= E((θ − θ)2)
= E(([θ − E(θ)] + [E(θ)− θ])2)
= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)
= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (B(θ))2
= Var(θ) + 2(0 + 0) + (B(θ))2
= Var(θ) + (B(θ))2
Trade-off
I Think of variance as confidence and bias as correctness.I Intuitions (largely) apply
I Sometimes choosing a biased estimator can result in anoverall lower MSE if it exhibits lower variance.
s2 estimator for σ2 for Single population
Sum of Squares:n∑
i=1
(Yi − Yi )2
Sample Variance Estimator:
s2 =
∑ni=1(Yi − Yi )
2
n − 1
I s2 is an unbiased estimator of σ2.
I The sum of squares SSE has n − 1 “degrees of freedom”associated with it, one degree of freedom is lost by using Y asan estimate of the unknown population mean µ.
Estimating Error Term Variance σ2 for Regression Model
I Regression model
I Variance of each observation Yi is σ2 (the same as for theerror term εi )
I Each Yi comes from a different probability distribution withdifferent means that depend on the level Xi
I The deviation of an observation Yi must be calculated aroundits own estimated mean.
s2 estimator for σ2
s2 = MSE =SSE
n − 2=
∑(Yi − Yi )
2
n − 2=
∑e2i
n − 2
I MSE is an unbiased estimator of σ2
E(MSE ) = σ2
I The sum of squares SSE has n − 2 “degrees of freedom”associated with it.
I Cochran’s theorem (later in the course) tells us where degree’sof freedom come from and how to calculate them.
Normal Error Regression Model
I No matter how the error terms εi are distributed, the leastsquares method provides unbiased point estimators of β0 andβ1
I that also have minimum variance among all unbiased linearestimators
I To set up interval estimates and make tests we need tospecify the distribution of the εi
I We will assume that the εi are normally distributed.
Normal Error Regression Model
Yi = β0 + β1Xi + εi
I Yi value of the response variable in the i th trial
I β0 and β1 are parameters
I Xi is a known constant, the value of the predictor variable inthe i th trial
I εi ∼iid N(0, σ2)note this is different, now we know the distribution
I i = 1, . . . , n
Notational Convention
I When you see εi ∼iid N(0, σ2)
I It is read as εi is distributed identically and independentlyaccording to a normal distribution with mean 0 and varianceσ2
I ExamplesI θ ∼ Poisson(λ)I z ∼ G (θ)
Maximum Likelihood Principle
The method of maximum likelihood chooses as estimates thosevalues of the parameters that are most consistent with the sampledata.
Likelihood Function
IfXi ∼ F (Θ), i = 1 . . . n
then the likelihood function is
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
Maximum Likelihood Estimation
I The likelihood function can be maximized w.r.t. theparameter(s) Θ, doing this one can arrive at estimators forparameters as well.
L({Xi}ni=1,Θ) =n∏
i=1
F (Xi ; Θ)
I To do this, find solutions to (analytically or by followinggradient)
dL({Xi}ni=1,Θ)
dΘ= 0
Important Trick
Never (almost) maximize the likelihood function, maximize the loglikelihood function instead.
log(L({Xi}ni=1,Θ)) = log(n∏
i=1
F (Xi ; Θ))
=n∑
i=1
log(F (Xi ; Θ))
Quite often the log of the density is easier to work withmathematically.
ML Normal Regression
Likelihood function
L(β0, β1, σ2) =
n∏i=1
1
(2πσ2)1/2e−
12σ2 (Yi−β0−β1Xi )
2
=1
(2πσ2)n/2e−
12σ2
∑ni=1(Yi−β0−β1Xi )
2
which if you maximize (how?) w.r.t. to the parameters you get. . .
Maximum Likelihood Estimator(s)
I β0b0 same as in least squares case
I β1b1 same as in least squares case
I σ2
σ2 =
∑i (Yi − Yi )
2
n
I Note that ML estimator is biased as s2 is unbiased and
s2 = MSE =n
n − 2σ2
Comments
I Least squares minimizes the squared error between theprediction and the true output
I The normal distribution is fully characterized by its first twocentral moments (mean and variance)