nonlinear regression (part 1) - christof seiler · 2021. 3. 29. · nonlinearregression(part1)...
Post on 25-Apr-2021
7 Views
Preview:
TRANSCRIPT
Nonlinear Regression (Part 1)
Christof Seiler
Stanford University, Spring 2016, STATS 205
Overview
I Smoothing or estimating curvesI Density estimationI Nonlinear regression
I Rank-based linear regression
Curve Estimation
I A curve of interest can be a probability density function fI In density estimation, we observe X1, . . . ,Xn from some
unknown cdf F with density f
X1, . . . ,Xn ∼ f
I The goal is to estimate density f
Density Estimation
0.0
0.1
0.2
0.3
0.4
−2 0 2 4rating
dens
ity
Density Estimation
0.0
0.1
0.2
0.3
0.4
−2 0 2 4rating
dens
ity
Nonlinear Regression
I A curve of interest can be a regression function rI In regression, we observe pairs (x1,Y1), . . . , (xn,Yn) that are
related asYi = r(xi) + εi
with E (εi) = 0I The goal is to estimate the regression function r
Nonlinear Regression
0.0
0.1
0.2
10 15 20 25Age
Cha
nge
in B
MD
gender
female
male
Bone Mineral Density Data
Nonlinear Regression
−0.1
0.0
0.1
0.2
10 15 20 25Age
Cha
nge
in B
MD
gender
female
male
Bone Mineral Density Data
Nonlinear Regression
0.0
0.1
0.2
10 15 20 25Age
Cha
nge
in B
MD
gender
female
male
Bone Mineral Density Data
Nonlinear Regression
0.0
0.1
0.2
10 15 20 25Age
Cha
nge
in B
MD
gender
female
male
Bone Mineral Density Data
The Bias–Variance Tradeoff
I Let f̂n(x) be an estimate of a function f (x)I Define the squared error loss function as
Loss = L(f (x), f̂n(x)) = (f (x)− f̂n(x))2
I Define average of this loss as risk or Mean Squared Error(MSE)
MSE = R(f (x), f̂n(x)) = E(Loss)I The expectation is taken with respect to f̂n which is randomI The MSE can be decomposed into a bias and variance term
MSE = Bias2 +Var
I The decomposition is easy to show
The Bias–Variance Tradeoff
I Expand
E((f − f̂ )2) = E(f 2 + f̂ 2 + 2f f̂ ) = E(f 2) + E(f̂ 2)− E(2f f̂ )
I Use Var(X ) = E(X 2)− E(X )2
E((f − f̂ )2) = Var(f ) + E(f )2 + Var(f̂ ) + E(f̂ )2 − E(2f f̂ )
I Use E(f ) = f and Var(f ) = 0
E((f − f̂ )2) = f 2 + Var(f̂ ) + E(f̂ )2 − 2f E(f̂ )
I Use (E(f̂ )− f )2 = f 2 + E(f̂ )2 − 2f E(f̂ )
E((f − f̂ )2) = (E(f̂ )− f )2 + Var(f̂ ) = Bias2 +Var
The Bias–Variance Tradeoff
I This described the risk at one pointI To summarize the risk, for density problems, we need to
integrateR(f , f̂n) =
∫R(f (x), f̂n(x))dx
I For regression problems, we sum over all
R(r , r̂n) =n∑
i=1R(r(xi), r̂n(xi))
The Bias–Variance Tradeoff
I Consider the regression model
Yi = r(xi) + εi
I Suppose we draw new observation Y ∗i = r(xi) + ε∗i for each xiI If we predict Y ∗i with r̂n(xi) then the squared prediction
error is
(Y ∗i − r̂n(xi))2 = (r(xi) + ε∗i − r̂n(xi))2
I Define predictive risk as
E(1n
n∑i=1
(Y ∗i − r̂n(xi))2)
The Bias–Variance Tradeoff
I Up to a constant, the average risk and the predictive risk arethe same
E(1n
n∑i=1
(Y ∗i − r̂n(xi))2)
= R(r , r̂n) + 1n
n∑i=1
E((ε∗i )2)
I and in particular, if error εi has variance σ2, then
E(1n
n∑i=1
(Y ∗i − r̂n(xi))2)
= R(r , r̂n) + σ2
The Bias–Variance Tradeoff
I Challenge in smoothing is to determine how much smoothingto do
I When the data are oversmoothed, the bias term is large andthe variance is small
I When the data are undersmoothed the opposite is trueI This is called the bias–variance tradeoffI Minimizing risk corresponds to balancing bias and variance
The Bias–Variance Tradeoff
Source: Wassermann (2006)
The Bias–Variance Tradeoff (Example)
I Let f be a pdfI Consider estimating f (0)I Let h be a small and positive numberI Define
ph := P(−h2 < X <
h2
)=∫ h/2
−h/2f (x)dx ≈ hf (0)
I Hencef (0) ≈ ph
h
The Bias–Variance Tradeoff (Example)
I Let X be the number of observations in the interval(−h/2, h/2)
I Then X ∼ Binom(n, ph)I An estimate of ph is p̂h = X/n and estimate of f (0) is
f̂n(0) = p̂hh = X
nh
I We now show that the MSE of f̂n(0) is (for some constants Aand B)
MSE = Ah4 + Bnh = Bias2 +Variance
The Bias–Variance Tradeoff (Example)
I Taylor expand around 0
f (x) ≈ f (0) + xf ′(0) + x2
2 f ′′(0)
I Plugin
ph =∫ h/2
−h/2f (x)dx ≈
∫ h/2
−h/2
(f (0) + xf ′(0) + x2
2 f ′′(0))
dx
= hf (0) + f ′′(0)h3
24
The Bias–Variance Tradeoff (Example)
I Since X is binomial, we have E(X ) = nphI Use Taylor approximation ph ≈ hf (0) + f ′′(0)h3
24 and combine
E(f̂n(0)) = E(X )nh = ph
h ≈ f (0) + f ′′(0)h2
24I After rearranging, the bias is
Bias = E(f̂n(0))− f (0) ≈ f ′′(0)h2
24
The Bias–Variance Tradeoff (Example)
I For the variance term, note that Var(X ) = nph(1− ph), then
Var(f̂n(0)) = Var(X )n2h2 = ph(1− ph)
nh2
I Use 1− ph ≈ 1 since h is small
Var(f̂n(0)) ≈ phnh2
I Combine with Taylor expansion
Var(f̂n(0)) ≈hf (0) + f ′′(0)h3
24nh2 = f (0)
nh + f ′′(0)h24n ≈ f (0)
nh
The Bias–Variance Tradeoff (Example)
I And combinding both terms
MSE = Bias2 +Var(f̂n(0)) = (f ′′(0))2h4
576 + f (0)nh ≡ Ah4 + B
nhI As we smooth more (increase h),
the bias term increases and the variance term decreasesI As we smooth less (decrease h),
the bias term decreases and the variance term increasesI This is a typical bias–variance analysis
The Curse of DimensionalityI Problem with smoothing is the curse of dimensionality in high
dimensionsI Estimation gets harder as the dimensions of the observations
increasesI Computational: Computational burden increases
exponentially with dimension, andI Statistical: If data have dimension d then we need sample size
n to grow exponentially with dI The MSE of any nonparametric estimator of a smooth curve
has form (for c > 0)
MSE ≈ cn4/(4+d)
I If we want to have a fixed MSE = δ equal to some smallnumber δ, then solving for n
n ∝(cδ
)d/4
The Curse of Dimensionality
I We see that n ∝( cδ
)d/4 grows exponentially with dimension dI The reason for this is that smoothing involves estimating a
function in a local neighborhoodI But in high-dimensional problems the data are very sparse, so
local neighborhood contain very few points
The Curse of Dimensionality (Example)I Suppose n data points uniformly distributed on the interval
[0, 1]I How many data points will we find in the interval [0, 0.1]?I The answer: about n/10 pointsI Now suppose n point in 10 dimensional unit cube [0, 1]10
I How many data points in [0, 0.1]10?I The answer: about
n ×(0.1
1
)10= n
10, 000, 000, 000
I Thus, n has to be huge to ensure that small neighborhoodshave any data in them
I Smoothing methods can in principle be used in high-dimensionsproblems
I But estimator won’t be accurate, confidence interval aroundthe estimate will be distressingly large
The Curse of Dimensionality (Example)
Source: Hastie, Tibshirani, and Friedman (2009)I In ten dimensions 80% of range to capture 10% of the data
References
I Wassermann (2006). All of Nonparametric StatisticsI Hastie, Tibshirani, Friedman (2009). The Elements of
Statistical Learning
top related